From jje at gate.sinica.edu.tw Wed Aug 1 03:53:28 2007 From: jje at gate.sinica.edu.tw (J.J. Emerson) Date: Wed, 1 Aug 2007 18:53:28 +0800 Subject: [Genome] Trace Coverage Message-ID: Hello, I am using your genome browser, which is beautiful, by the way. Anyway, I have a few questions for you with regard to the tracks present for the public human genome project's trace coverage. The first question that I have, is how much information on coverage do you have? A bit more specifically, do you have coverage information with regards to which sequencing method that coverage was collected from? The questions I'm looking for don't seem to be directly addressed in the browser or descriptions of the tables underlying the brower's coverage track: http://genome.cse.ucsc.edu/cgi-bin/hgTables? db=hg18&hgta_group=map&hgta_track=clonePos&hgta_table=clonePos&hgta_doSc hema=describe+table+schema I'm specifically looking for a way to determine the fold coverage of within clone shotgun sequence (not finishing sequence) on as fine a scale as possible. Ideally, I would have a sequence of integers representing the fold coverage along a chromosome. But I can imagine the data coming in many other forms, such as in a series of mappings between trace sequences and the assembled reference sequence. Finally, do you have anything similar for the Celera project's WGS assembly/traces? Thanks a lot! Again, the job you do with the data is amazing. My hat's off to you yet again. Cheers, J.J. From rhead at soe.ucsc.edu Wed Aug 1 11:21:09 2007 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Wed, 01 Aug 2007 11:21:09 -0700 Subject: [Genome] Trace Coverage In-Reply-To: References: Message-ID: <46B0CF15.7020804@soe.ucsc.edu> Hi J.J., Thank you for the compliments on the Browser! We do not display all of the information you are seeking, although the clone coverage track may be interesting to you and at least lead part way to what you need. Also, we do have links to the NCBI human genome assembly documentation from our hg18 gateway page, here: http://genome.ucsc.edu/cgi-bin/hgGateway?org=Human&db=hg18 Scroll down to "assembly details" section and look for the "NCBI Build 36.1 Statistics" link and the "NCBI Build 36.1 release notes" link. We do not display the Celera WGS assembly of the human genome. I hope this information is useful. If you have further questions, please do not hesitate to contact us again. -- Brooke Rhead UCSC Genome Bioinformatics Group J.J. Emerson wrote: > Hello, > > I am using your genome browser, which is beautiful, by the way. > Anyway, I have a few questions for you with regard to the tracks > present for the public human genome project's trace coverage. > > The first question that I have, is how much information on coverage > do you have? A bit more specifically, do you have coverage > information with regards to which sequencing method that coverage was > collected from? > > The questions I'm looking for don't seem to be directly addressed in > the browser or descriptions of the tables underlying the brower's > coverage track: > > http://genome.cse.ucsc.edu/cgi-bin/hgTables? > db=hg18&hgta_group=map&hgta_track=clonePos&hgta_table=clonePos&hgta_doSc > hema=describe+table+schema > > I'm specifically looking for a way to determine the fold coverage of > within clone shotgun sequence (not finishing sequence) on as fine a > scale as possible. Ideally, I would have a sequence of integers > representing the fold coverage along a chromosome. But I can imagine > the data coming in many other forms, such as in a series of mappings > between trace sequences and the assembled reference sequence. > > Finally, do you have anything similar for the Celera project's WGS > assembly/traces? > > Thanks a lot! Again, the job you do with the data is amazing. My > hat's off to you yet again. > > Cheers, > > J.J. > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From Jianmin.Wang at roswellpark.org Wed Aug 1 12:39:46 2007 From: Jianmin.Wang at roswellpark.org (Wang, Jianmin) Date: Wed, 1 Aug 2007 15:39:46 -0400 Subject: [Genome] Question about retrieving gene, EST/mRNA names from knownGene table Message-ID: <6FF91AE4F1DC7743A6466E334EB865AE16EE567A@VERITY.roswellpark.org> Hello, This is Jianmin Wang from Roswell Park Cancer Institute. I was tring to get the known genes for a specific genomic location, but the knowGene table only contains the name such as uc001aaa.1 instead of gene, EST or mRNA names which are shown on the genome browser. Could you tell me which table contains such information? Your help is highly appriecated. Best Regards, Jianmin Wang Roswell Park Cancer Institute Buffalo, NY 14263 This email message may contain legally privileged and/or confidential information. If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited. If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you. From rhead at soe.ucsc.edu Wed Aug 1 13:09:50 2007 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Wed, 01 Aug 2007 13:09:50 -0700 Subject: [Genome] Question about retrieving gene, EST/mRNA names from knownGene table In-Reply-To: <6FF91AE4F1DC7743A6466E334EB865AE16EE567A@VERITY.roswellpark.org> References: <6FF91AE4F1DC7743A6466E334EB865AE16EE567A@VERITY.roswellpark.org> Message-ID: <46B0E88E.7070302@soe.ucsc.edu> Hello Jianman, The table that contains various accession numbers and gene symbols associated with UCSC Genes (a.k.a. 'knownGene') is called 'kgXref'. If you are using the Table Browser, you can keep the UCSC Genes table selected, but choose the output format option of "selected fields from primary and related tables". The kgXref table will then be in the list of related tables, and you can use it to retrieve the mRNA IDs and gene symbols associated with your genes of interest. I hope this information helps. Please contact us again if you have any further questions. -- Brooke Rhead UCSC Genome Bioinformatics Group Wang, Jianmin wrote: > Hello, > > This is Jianmin Wang from Roswell Park Cancer Institute. I was tring to > get the known genes for a specific genomic location, but the knowGene > table only contains the name such as uc001aaa.1 instead of gene, EST or > mRNA names which are shown on the genome browser. Could you tell me > which table contains such information? > > Your help is highly appriecated. > > Best Regards, > > Jianmin Wang > Roswell Park Cancer Institute > Buffalo, NY 14263 > > > This email message may contain legally privileged and/or confidential information. If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited. If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you. > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From caleb.kallen at emory.edu Wed Aug 1 13:43:19 2007 From: caleb.kallen at emory.edu (Caleb Kallen) Date: Wed, 01 Aug 2007 16:43:19 -0400 Subject: [Genome] problems Message-ID: Hi, I am trying to blat search multiple sequences. I emptied all my caches and cookies, and it worked for me in one trial, using 3 sequences. Then, 1) I got an error: No name in line starting with '>' In this case, I submitted the following (see below), which does appear to have a name in FASTA format: please advise. Also 2) Is there a way to limit output to 100% match for the entire sequence that is submitted...I am looking to find genomic coordinates that exist, not scroll for similar sequences? 3) finally, I had problems when submitting using an Excel sheet...there was no hyperlink to output and no error message offered. Any suggestions? thanks very much CBK >Hits_A1 CCTAGGCACAGCCAGGTTTCCTTTCCTAGAAGCCAGACTGCCTGGATTCGAATCCCATGGCTACCTGGGCTGTGTGACCT TGGGCAAGTATTTAACGTCTCTGAGCCCTAGTATTTCCATCTTTAAACTGGGAACAGTGACTGTGTCATAGGGTTGCTCT GAGGAGTAAATCAGATTGTCATGTAAATTGTTTAACACGTGGGTAGCACTTTATTAACAACAAAGACTGGTTGGAAGCCC CCGTCTGTGCTTGCAGCTGCCTTCAGCCATGGTCCACCCTACCAAGGCATGTAGGCTCAGAGAGGTCCAGTAACTTGCTC AAGGTCACACGGCAAGTGAATAGAAGAGCCAGCATATGAACTTTGGTGTGGTTGACTCCTGAACTCAAAATCTGAATCTC TATGCTTTACTGGCTCCTTGTGAAAGAAAGAACCGCACATCTGTGTGTGTGTGTGTGTACACGTACACACACACACACAC ACTCACACAAACATTTCCACTGACTTGAAGGCTCAGGAGATATCCCCAAAGCCAAGGCATGGATCTGCTGGCTGAGGAGG TGTGAGCCCTCGGAGGGCACCATGTGCCCAGTCCACACTGACCTTGTGTCTTGCCATCCCTTAAGGACCCCGTCTCCTCT GCTGTGTGCTCATAGGGAAGTGGCCCGGAGAGGCCAGGTCAGACTCTTAATTCTTTGGCATCTGAGATGTAAGAACATGG GCCGTGCAGCCAAGGGTGGACAGACGCCATGTGGGTGTGTCTGGCAGGACTGCCTGCCTGCATGAAGGTCACGCTGCCCT TTACAGTCACACTGGCCCCAGCTCCCCACGGTGACCTGCTACCAGGCCTGTGAGAGGGGCCACTGCTTGCTGGGCTGTGG GAGCACTAAAAGGAAGACCCCCCTCACCAACCACCTTGGAAGGACCCAATATCATCTCTGTTCTGCTCCCCTGCCCGCAG CTGGAGGGGAGGGAGAGCTGGCCTCTCGCACCTGTGAAGTATCACTAGCAACCATGACCTAGCAAAGGGCTTCCTTTCCT CTCATATCACAGTCTATCTCTCTATCTATCGTCTAGAAAAGTTTCCAAAGTCAGTCTGTTGACAACCTGCTGGACTTTGG TGGCTGAGCATTTTGCCTCTTTGAACTACAATCTCCTTGAAGGCAGAGATTTTAGTTCAACCCTGAAAAGTCCATTCTAT CTTTTTTTTCTTGGGGAGATGGGTGGGGGGTTGGGGGGCGCGGTGGTAAGGCATCAGTAAATGACCTGATAAGGTCTGCA GGATTAGCTCTGGAGGGCTGCCTGATACAGGCAGGCCTGGCTCTCCTGGCTGTTTCCACGTGGCATGACTGGAAGATGAG TGTTGTCCTTCCACAAGAAGGTGTACTGGACCCCTTGCTCCTGCTCATGGTATCCCCCCATCTGCCCTGCAGAAGCCGAC CTCCCGACTCCCAAACCCTATGGCTTGGCTGGGGCTGGTATACCCCTGGCTCAGAGTCTGGCCATGACCCACAGCTGGCC AAATAGAGCACCATTTCCTTGGTTCAGGCTCACCCAGTCACTGCCAGAGAGACCCAATTCCTGGACCCTTCCAGGACTTC TGGGCAAAGAACACCTGTCTGTCTACTTGAATGGGAGCTGGAAAGGTTACGTTACGAGCAGCAGGGGGCAGGTGTGAGCA GCCCCGTGGCAGAGAGGTCTGCCTCAAAGTGAGGGGAGAGAGAAACTCTCCTGAGTCCTGACAGCACTGAGCCCCTGGAG ACATG >A2 CTCCCTGAACTGCAAGCAAATTGGGAAATTTGAATGTGACGCAATAAGCCTGCAATAAAACAGAGCCTCCAAGAAGTGCA TGAAGCGAATCTCCTCCCAGCCTAGTCCACCACCCCTCTGCAGCTTCCCGAGCTGTCACCCTCCAGGGTGGCCACATACA GTATTTATTTTTTTTAATTAAGCTGCGTTGCTGTGTGGCTCACCCTCCACAGATGCTTTATAAAGCCCAGCCTTGTAAGT TGCTGAAAGTTTTTATTTTTTTTTAAGCGGATAGTAAAATTAAGCAGTTATTAGTATTCACGAAAAATGCTCGTGGCAGG CTTCGCACACGTAATGAGAAGCCACCTGCAGTTCAAAGCAGTACTTGATAATGCTGTAATTAGATTTGAGATGCAAAAAA GCTCATTAAAATTAAATAACATATTTGTCACTGTTTTAAGAGGTACAGATTGCCTAGAACCATGCTGGTCTCTTGACCTT GAGTATTTTGGCAGAGTCCTTGCTGCCGACAATTCAAAAATAACATCAGCAATTCTGCAGAGAGTCCTGAAAACATGCTC ATCCTTGGTTCAGTTTGTATTGAAAAAATTATATAGATATCCAGGGGAAATCACAGATTTGTACACAGTGTGTTCTGGGA GAAAAGGGTGGCCACAGTCCTCTCTGAGCCAGTAGCTCAGACGCTGTGAAGAAGAGATGGATGTGTGCCTGTGGGAGGCT GGGGCAGGTGGAGGGGTCCCCCATAGGACACGGGGACACTGTGAGGGAGTTTGCAGGAAGCGGCGGCCTTGGGCACCACT GCTTACCACTAAGCTGAGCTGGATGAAGCCACTTGCTCAGAGCAGGCCCAGGGCGACTGGTAATCAGATTGTCAGTCAGT GTTTCCCGGGTGCCTACCGGTGTACCGGGTCCCGGGCAGGGGATGGTGGGGCCGCAAGTGCCAGTGAGATTTGCCTCAGG CTGCAAAAGAGGTCATAGCCTCATCACGGAGATGAGGCATAAAGTCAGGTGCCAGGAAAGTATTTATTGTTAATAAAAGT AGCTAATGTTGACTGAGCGTGTCTTTCATAGTAGGTAGCCAAGCGTGGCATGCATTATCTTAGGAATCATCTCCATAGTC TGTGAGACAGCCCTGACTGCCCCATTTTCCTTGAGGTTACGTTGCTCTTGTAAGGTCACTTCACTAGTAAGTAACTAGTA AGCCCAGAGCTGAGCCGGGTTCAGATTCAGGTCTGGCCTTCTGCAGCACCCCTTCTGCATGAATCACAGAGTCACTCAGG TCCCATGGGCAGAGCAGAGAGACCCAGGGTGGGAGAGGACACACATGAAGGCACTCACTGAGGACCCACTCTTCACCAGA TCCTGCTCTAAACACTGCGTGCCTGTTCATCATTGTATCCCCATAGCAACCCTGTGATTCCATTTTACAGATGAGGAAAT TGAGGCATGAGAAATTAAATAACCAGTCCCAGGTCACCAGGTAGTTGGTGGCAGAGTTAGGATTAAGCTCTGACCTTGAC TGTTATTCTATGTTTAATTTTTCTTGAAATCTGTAAAGCATTCTATAAAGGTAGGATGTAATTTAAAAGCAATTATGATG ATTTTTGTTCTTGGAGAAAGAAAATGCAAGTTGCTTTCAAGGGGCAGAGACCAGAGGAGTTTTCCTGGGGGCAGAGAGCA GTTAAACGAGACGTGAGTGGAGGGGGCCAGATGGAGGGAAAATGAGGCCAAGTTAGAAACAGAGGCCTGGCCTGGGCTCC AGGGACAGTGTGGGCCCCCATTTTGGAGGAGCTGCGTGTCCTCCAGGGCAGGGAGTCATTGGGAAGGCAGTCTGGGCCGA CTGCCCCTCCTCCCGGTGGCTGGATTGATTTTCCAGGCTGACTCTCTTGGCACTGGGTCCAGATTTAGTTCTGTGGAGAG CTCGTCTTTCAAAGTCCAGCTCAAATGTCACCTCCTCGGCAAAGACTCCCCCACCGCCTGAAGCAAAATTAATCACTCCT CCTCCATGTCCCACAGCACTCTGTTGATACCTCCGGAATAGTCTTATCACATCGGACACGGATTTATGTGTTTACAAGTC TGTCTCTCCAGCTAGACAGCGACTTCTCGAGGGCAGAGCCCACTCTGCCTTTGTGTCATTGTGCCTGGGGCAGTGCTGCG TGTGCAGTGGGTGCTGAGGAAATGATGCATACATGGAGGACCAGTGGATGGTTTAATTTGATTTCAGTGACCAAGCACAG AGCTCTTACTTTGTGCAGGTCTTGTTCTAAGCTCTTTACATATGTGAAAGGATTTAGCCTTTATAACCCCATGACATCAG TTAGTTCTATGATGAGCCTGTTTTACAGGTGAGGACACTGAGGCTCAGAGAGGTTGTGTAACTTCTCAATATCATACATC TGGTAAGTGGTAGAGCTGGGAGTTTTAAACCCAGGCAGCCTGGAGTGAAAACATCTGCTTAAAGTGATTTTGTAGCTACA GATTGTATATAGTGATTCTTATTAGGGGAGTTTCTGGAATGAATTCTCATTTCACTCTTCTTTTCCCTAACTTGTCTGAA TTTAGGAAGGTTTGGAGAAAGCGATTGGGCTGCGATGCAGTTAAGTTTCCTTTGTTGTAAGCCGTGGAAACCAACTCTAG CTAAATTAAGCTAAACAGGGATTCTGTTGTGAGGATCTAAGTAGCTTGCAGAAGGAGCACGCTGGAAGGAAGGGCTTTGG GGAGGCC >A3 CTGGGATTACAAGCGTGAGCCACCGCGCCCAGCCTGTCATTTGCCTTTTACTTTTGTTTTGGGTGGGGTTAATTTGGCTC ACGGTAAAAGAACAGTTTTCCTGCATGTTTTATTAACACTAAAATGAAACGCATTTAGGGTCACAGAGATTTGGGGCCAT ACTAAACCAAAATTTTAAAAATAAAGTCCTTGGCTGGGCGTGGTGGCTCACACCTGTAATCCCAGCACTTTGGGAGGCCA AGGTGAGTGGATTGCTGGAGCCCAGGAGTTTGAGACCAGCCTGGGCAACATGGCGAAACCCCATCTCTACTAAAAATACA AAAAATTAGCCAGGCGTGGTGGTGGTACCCTCCTGTAGTCCCAGCTACTCAGGAGGCTGAGGTGGGAAGATCCCTGGAGC CCGGGAGGCGGAGGTTGCAGGGAGCCAAGATCAGCCACTTCACTCCAGCCTGGGCGGTAGAGGAGATCCTGTCTCAAAAA ATAAAATAAAATAAATAATTTCCTCCGTATCCTCAGAAACAATTCAAAGCCATGTGTAAACCCAAAACTGAGTTCATCAC TTCTAAGATCAGAAGAATTTGTGTTTGTTTATGATGAACTCACACACCATAATCTTCCAAAATCATTGAATTCCAGGTGT GACTAAAGCAAAGTTTCAGATTGCAAGTTGACACTGTTACCTGAAAGCTGCACACAACTGCTGAATCACTTTTTAAACAA GCATGAAAACAACAATTTCTCTTGTTCCCAGTACAGCTTTGAGCCACAAAAATTAATTACATCACCTAGGAAATTAGCCC TGCGTTTTGCATCCAAAGGTCAATGTTCAGACTGAGAAATGGTTTAGTGCTTCCAAACACAGGGCTCAGCACATAAATAT TTATCATGCAGAGGACAGGAACTCATTGTACATGGCTTGTTTACCTTATCAGCCAGTTACCTGGGGCAAAGAGTCCACTG TCTGCTTCCTGTCCTAATTTCCCTTGCCCAAGACCTTGGGCAATTGTTTATTTGACAGAGTGTTCTGTCCCATTCCCTTT CTTCTGGGAATAAGGGCACAGGATACTAAACAAATAGATCACCTGGTCTGGGAACACCTGATTTCTGGAGAGGCTGAGAC TTCTAGGTGTTTGCTAGAAAGATCCAAATACAGGAAAGAGAAAACAGCTGCTGCGTTCACAAGAAAACGGAGGATGCATT TCTCCCCATGACTTCGGCTTGTTTGGGTTCGCATTTCAAGTGCTCTGACTGGGCTTTATGGAATGTCCTGTCAAGCTTTT GATTGTGAAAATAAAGTTACAGAGACACAGGGGAACAATAATCACAGACGCGATTTGTAGCACAGACACGGCCTTTCATT TCATTCCTCTTTTCTGTGTTTACAAATTGCGACTGTTAGAGAAACGTTGGCCAAGCTGGCTCTTTAACCTGGCCACATAT TTATTTTGGCCACTCTTAGCTTTTATTTCGCAAGGTCTCATACTCACCAGCTCAGGTGCTTGTTACATAAGAAAGCTCTG CCGGCGGAGTGCCAAAAATTCAGCTCAGCATTTTCTGCCTCCCTGCAGTGGTCCCAAGGAATGGATACCTCTGCAACCAT CTGCAACCTGGACCTTTGCAACAGGAATCTCAGGTTAGGTGACACTGGCCAACATGCAGGGAAACTATGTCAACAAGCTA GTTTTCCTTTCCTTCCTTCCTCCTTCCCTTCTTCCCTCTCTCCCTCTCCTCACCTCCTTTCCTTTCTTTCATTTTTGATA CAGTAGGGTGTCTCACTATGTTGCCCAGGCTGGTCTCAAACCACTGGCCTCCAGTGATCCTCCTGCCTCAGCCTCCCGAA GCTCTGAGATTACAGGTGTGAGCCTCCAGGCCTGGCCCCAAACTAGTTTTCAAGGGTTGGTTATCACATCCCTTGTATCT TGGTTCTCACCAGTCAATCATTACGTACTTTGCATGGATTTATTAGAATAGATTTTAAATCTCTTGTTTTAGCCTCTTCT ACTTGTAGAGCTACCAATCCTAGAAGTTTACTAGCAACTCTGCATTGTTACATTTTAATTTGCATTTTGTTTTGCATATA TTTAATTTGCATTTAGTTTTGCATTTAGTTTGATTGT >A4 GAAGTAAATTACAGGCACGCTTTTCAAGCTCTGAGCTCTTGAGAAGACCACTGAGCTCAAAGTTAGTCCAACAAATAATG TTCATCATACACAGCATCTTCTGTTTAAGAAGGCAGGAAATTGGCCGGGTGCGGTGGCTCACACCTATAATCCTAGGACT TTGAGAGGCCGAGGCTGGCAGATCACTTGAAGTCAGCAGTTCGAGAGACCAGCCCACATGGCGAAACCCTATTTCTACTA AAAATACAAAAAATTAGCCAGGCATGATGGCGTGTGCCTATAGTCTCAGCTACTCGGGAGGCTGAGGTTGAGAATTGCTT GAACCCAGGAGGCGGATGTTGCAGTGAGCCCAGATCATGCCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTGTATCA AAACAATCAAGCAAAAATGGTTGTTGTCAACACAGTGGGTAACATCTGAACCAAACAAGCCAGGAAATTGAGCACTTGGT TTTAGAAAACTTTTCATCCAAAAAGGGTTAACAGATTTATAATGTGAGTTTAAAGGGGTCATCTCGAGTTTTCAAGAACA TTCATTTATGCCAAGCAGCTCCTTCTCCACTGATATCAAAGATATTTTACTGTAGAGGGCACCAAATACACCCCTTCATT ATTTTCTCCTTTTAAAGAAAACAGGACAAATTAGTGACCTGGAAGGTTTGGATAAAAACACATCAGAGAAATCATTGTTT TGAGAGTTCTTTTCAGCTTAATTAATTCACATTGGCTGCTCAAATTCATTTTAGGTCTGGAGGCGATTTTCAAACAAAAT GTTATGACAGAGATCATAACAAAGATTTGGAGCAAAGATTTCTGTTTGCTGAGCTGTCACTCAAGCTAAACAGATAGCAA CAGGGAATCTTTAGGCAAATTAACTTTCCCCTGAATGGGTTCTTTTATCCTGTCCATCTGCATTCATTCCCATGGGTACT TCGAGACCGAGCTAACAAGTCAGATATACATAATAAAGAAAATTGCTCAAGGGAAAAAAAAATCAACTGTTCATTTGGGA AAATGTGGTGCTAGTCTAAATACAATTCTCTATTAATTTCAAGGGGTCTAAAAATCTATCACCAGTTCTTCAAAAAGTTA AACAGACTTATCCTGTGGCCCAGCAATTACATTCCTAGGTACGCACCCCGAAGAACTGAAAACAGATAGTCAAACAAATA CTTGCACACAAATGTTCATCGCAGCACTATTCTCAATAGCCAAAAAGTGGAAATTACTCAAAATTCCATCAACAGATGAA TGGAGAAACAAAATGTGGTATATACGTATACAACAGAATATCATCCAGCCGTGAAAACAAATTAACTACTGATGTGTGAT ACAACGTAGATGAACCTCAGAAACATGACCTTAAGTGTTAAAAGCCAGACATAAAGATCATGATTCCGCTTATGTGAAAT ATATGAATGATCCCCTTTATACGAAATATTCAGAATAGGTGAATTTATCAAGACAGAAAGCAGATCAGTGGTTGTCAAGG GATGGGGAGGGGGGAATGAGGAATGACCGCTAATGGGTACGGGGTTTTCTTTCGGGGTGATGAAAATGTTTTGGAACAAG >Hits_A5 TGTGTTGTGTGTGTTGTGTGGATGGGTGTGTGTGCTGTCTTGTGTGTTGTGTGGTGTGTTGTGTGCTATGTGGATGGGTG TGTGTGGTGTGTTGTGTGTTGTGTGGTGTGTGTGGTGTGTGGGTGTGTGTGGTGTGTTGTGTGTTGTGTGGTGTGTTGTG TGCTGTGTGGATGGGTGTGTGTGATGTGTGTGGTGTAGTGTGTGGTGTGTGTGTGGTGTGTGTGGTGTGTTGTGTGGTGT GTTGTGTGCTGTGTGGATGGGTGTGTGTGGTGTGAGTGGTGTGTGGTGTGGTGTGTGGTGTGTGATGTGTGTGGTGTGTT GTGTGTTGTGTGGTGTGTGGATGGGTGTGTGTGGTGTGTTGTGTGGTGTGTTGTGTGCTGTGTGGATGGGTGTGTGTGGT GTGGTGAGTGGTATGTGGTGTGGTGTGTTGTGTGTTGTGTGGTGTGTGTGGTGTGTGGATGGGTGTGTGTGGTGTTTTGT GTGTTGTGTGGTGTGTGTGGTGTGTGGTGTGGTGTGTGGTGTGTGGTGTGTGTGGGAGTGCTCTCTGGCCAGGTGCAATC TCCTTCCTGTTCAGCTCCTGGGGCTGCTGCTACTCTTAGTTCAAAAGACTTTTTATCCCTGGCAAGGACAATTATTGGGG TCACGGTGACCTGCCTGTTTCAGCGCCTGCTTGGGCTTGTTTCGTTTGGCCTAAGACAAAAGCTTGAACTCTGGAACTGC CCAGAGGGGCGAGGGGTGCAAATTCCCTCTGCCTGACCTAACTGCAGTGTTCACAGGACTGACTCATTGTCCCCCCGGGT CAGGACAGGAGTGACATTCAAAAGAACTCATAAGCAGTTTGGCCAGGTACTGATCACTCCGAATGGATTCTGTGCTTGAG GAATCCACCAAAGGCCTTTTTGGTTTGTTTTCCCAGAAGCCCATCAGTCACCATCTGGGGAGGACAAGGAGGGAGCAGGG CTGATCCAAGGAGCTCTGGGGCAGATGGACTGTCCCTGTCACGGCTTCTTTTACTGGTAGTAATGACTCCCACTGGTTGA GTATGCAGGCGCCACTTTGCTCCATGCCAGGTACCATCCATGCCTGCACTTTCTTTTTCTTTTTGTTTTTTGAGATGGAG TCCCGCTGTGTTGCCCAGGCTGGAGTGCAGTGGCGAGATCTTGACTCACTGCAACCTCCACCTCCTGGGTTCAAGTGATG CTCCTGCCTCAGTCTCCTGAGTAGCTCGGATTACAGGCATGCACCACCATGCCCGGCTAATTTTTGTATTTTTAATAGAG ATGGGTTTCACCGTGTTGGTCAGGATGGTCTCGATCTCCTGACCTCATGATCCGCCCACCTTGGCCTTCCAAAGTGCTGG GATTACAGGCGTGAGCCACCGCACTTGGCCCATGCCAGCACTTTCTAGGTATAATCTCACTTGATCCCATCCGGCTCTTT GAGGCAGGCTATATGATCATGCCCATTTTGCAAATCAGGGAGCTGAGCCTGAGGAGCTGGGGTCCTTGCCCAGTCACACA GCCAGTAACAGCAAGAGTTCTGATTTACCCCTAGCTGGTCTGGATTCTCTTGGTCCCCTTATACACCTGTTCCCTGCTGG G >Hits_A6 AAGTACAGATAAAAGCTGGTTGAGGAGGTGTTGATAGCACAGAGTGTTTTTACGGAGTGTCCTCGGAGGGTTACATGCTC CTTTTTGTTGAATCCGAATCTTTTTTTTTTTCTAATATGGAGCCTCATAATCATTCAAGGACTCAGTGTGATGCCACAGG TTTCAAACTTGCCAGTTCGTCTCTTATGTTCTGCCAATTTTTTAAAAGTGGGAATATGAGTGCATGTGCACGTGTACACA CACGTGTGTGTGTGTGTGTCTTTGTGTATGTAAAATGCTTGGGATAAAAAAAGATGGATCAAATTTTGTCTGGCATGGCA TAGCAACTCTCACCTCTAGTATCCTCTAAACATTTAACCAAATGGACTTTAACTGGTTTACTTCAATTCTATTTTTATTT TCGTAGTCATTTAAATGTATGGGGAGCCAACTCTTTCAAATGAGACATTTTGTCATTAAAACATGTTATTTAGTTTAACC GCATGGGCCCGGCTTCAGCAACATTGAGGACTGTGAAGAGGGTTCTGAGCGGCTGTTCTCTCCTGACCTGACCGCAGCAG TGACCCACCAGCTTGCAATGCACGCAAACTGAAACAGGCAGGATGGCCAGCCCAGGGTTGTCTGCACGGTGGTCTTGGCA TCTGTACCTTAGGTTGCCAGCTCCTGGCCCTCCTTGCCTGCATTTGGTGGCCCTTGGCCCTGCCGTGTCATATGGCACAA GAGCCCTTCTGGCCCGCATCACCCTGTGTACTGATTAGGCAGGAATCAGCAAATCCGTTGAGTGCATGTTTAATGTGACA CAGATGTGGGACATACTACAGATCTTTCTGCTTTACAAATTTTCAGCTTTTTGCATTTCATCAGTGTTGCTTATTATTTA CAGAATGTAACTTGAGACATCAGAACAGATTGTAATTTTTAAAAAGGTTGTTGGGGGGGTGGGCCCGCAGGGGATCTGGG AGAAAAAATCCAGTAGTGTGGAATGTTGTTTGCTGTCCAGACCAAACAAACCAGTTTGCCAAGGGTTGATGGACTATGAC AATCAGTCCAATTCATGTTAACACTGGGGGCACTTAATTAAAAGGCAGCTTGACCTCCTTATCACCAAATAAAGAATAAT CTTTTGTTACCAGGTTGACCGTCGCAGCACCAGGCAGAAATGTACCCCTTGTAGAGGCACTGAAGGGTACCACCAAGCTG CCAGCTCGTGTCCTGCCTCCTTCCTGTCCACAGAGAAAGGCACCATTGAACCCAGTCCCCACTCAGACATGTGGATATTC TTGCCTCATGCCTGCTGTGGCTTTACACCCAGGCGCTATGGCAAGAGGCGGAAAAAAGAAATCTCTTCCTTTGCCCAAAA ATCCAAGATACTCTTGAGGGCATTGGAATGATGAGCCAGGCCAGTTTGACACAAGTTCAAGGAGAAGGTACCATCTCTTG CTTCTAGAGGGGCACATCTGGCTAGAGGTGGCCCTGGCTTCTGAGAAAGCTCAGCCATAGGTTACATTTAAGGGCAAACA ATGAAAAGCACCAGCTGGTCGGGAGGAAGATGAGGAGGAGTGGGTGATTGGTTGCTTCTTTCCTGTGTCTAGTTCTAAAG AACTTAAAACTTGGGTCCAAAATAATGAGAGCCGTGTTGTAAAAATCATTTGAAGAGAGTTCTCCTGGGGAATGTGAGCT AGTGCCGTGACCTGATCCCTTGGTCAGATGGTTCTGAGCTGTTTAAGTGCTGAGTGTCAACTGTCTACAGGGAGCCAGGC CCCTGAGGAGCGGCCAGGGTTCCAAGCCCCACTGTTCTTTTCCTTCCTGGAAAAGTGCAGACGGTGCACGTTCCCCGGAG GCCACCAGGCAGCTCTCCAGTTGCCCTCTGGGTGAACGTGACCCAGGTCGGTAACTTTCTAGTGTCCCCCCCACGTTGGA ATGATCAGCATTCCTGTACTCCAGGGGTTCTTGAGTTCTTGAGTTGTTACCAACGCCCTGCCTTTCTCTGGCTCAGATTC AGAGGAGGAGTTCCACTGCGGACTGCCGCTTCTTCACCCTCTGGGCTTCGGAAGCAGCACCTCTGGCTCCCCTGCACCCA GTCCTCAGTCTTGTGGCCCTGGTCTGCCCCTTTATCACATGTCTTGAGCACCCTCTCAGCACACAGCACCGTTCTGAGCG CTGAGGTGCGACCATGCATAGAAGAGACAAGAATCCCTGGTTCCGCATGGCTTCCCTCAACACCTACGTTTCCAGGCCAT GGCTTGGGTTCAGAAGAATCACCCTCCCTGGAAGAGATGAGTGACTGGAAGGAGAGCATGTCACTGGCTCCCCTGAGACC CTCCCAAGAGCCTCAGTTTCTCCCTCTACCACCTGAAGTCTAACTGCAGGGCCTGAAACAGATTCTCTTTAAAAACCCAT GAGTGTGGTGACTGCATAGGTTTTCATTCTCCTGAGTTCTTAATTTCCGTGTCCCTAGCACTTTTTGAAGCATCTGCTCC TGTTGCCCCTTTCTTTTCCTCCTCACCCTCCTCCATAGTCTCTGTTTGATAGTAGAAGGAGATTTAAAACCTCCAGGCTG AGTGAACAGGCGGTCTTGAAAAAGCAAACAAATCGGGTCAGGCACGGTGGCTCACGCCTGTAATCCCCAGCACTTTGGGA GGCCGAGGTGGGCGGATCACAAGGTCAGGAGTTCGAGACCAGCCTGGCCAATATGGTGAAACCCCGTCTCTACTAAAAAT ACAAAAAAATTAGCCGGGCATGGTGCCGTATACCTGTA From rhead at soe.ucsc.edu Wed Aug 1 16:46:17 2007 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Wed, 01 Aug 2007 16:46:17 -0700 Subject: [Genome] problems In-Reply-To: References: Message-ID: <46B11B49.4010403@soe.ucsc.edu> Hello CBK, I will try to answer each of your questions below: 1) I am not able to reproduce the error you got. The example sequence pasted at the end of your email seems to be in perfectly acceptable FASTA format. Is there a possibility that a stray ">" character somehow got into the text when you BLATed it? If you continue to have problems with it, let us know, and we will try to look into it further. 2) There is not a way to limit the output of BLAT with the web interface. However, BLAT output can be sorted so that the information more relevant to you is at the top. For instance, if you sort just on score (and not also by query or chrom), the 100% identity hits for each query tend to be at the top. (Note that sometimes BLAT hits that show 100% identity can be lower-scoring, if only a portion of the query aligns to the genome, but that portion aligns perfectly.) If you have a lot of sequences to BLAT, you may be interested in downloading and running command-line BLAT. It is free for non-commercial, educational, and personal use. With the command-line version of BLAT, there are more options for setting parameters and sorting output. BLAT downloads and documentation are available here: http://hgdownload.cse.ucsc.edu/downloads.html#source_downloads 3) We do not support Excel-formatted files for BLAT, only text files. If you can save the FASTA sequence as a simple text file, BLAT shouldn't have any problems with it. I hope this helps. If we can be of further assistance, please do not hesitate to contact us again. -- Brooke Rhead UCSC Genome Bioinformatics Group Caleb Kallen wrote: > Hi, > I am trying to blat search multiple sequences. I emptied all my caches and > cookies, and it worked for me in one trial, using 3 sequences. Then, > > 1) I got an error: > > No name in line starting with '>' > > In this case, I submitted the following (see below), which does appear to > have a name in FASTA format: > > please advise. Also > > 2) Is there a way to limit output to 100% match for the entire sequence > that is submitted...I am looking to find genomic coordinates that exist, not > scroll for similar sequences? > > 3) finally, I had problems when submitting using an Excel sheet...there was > no hyperlink to output and no error message offered. Any suggestions? > > > thanks very much > CBK > > > >> Hits_A1 > CCTAGGCACAGCCAGGTTTCCTTTCCTAGAAGCCAGACTGCCTGGATTCGAATCCCATGGCTACCTGGGCTGTGTGACCT > TGGGCAAGTATTTAACGTCTCTGAGCCCTAGTATTTCCATCTTTAAACTGGGAACAGTGACTGTGTCATAGGGTTGCTCT > GAGGAGTAAATCAGATTGTCATGTAAATTGTTTAACACGTGGGTAGCACTTTATTAACAACAAAGACTGGTTGGAAGCCC > CCGTCTGTGCTTGCAGCTGCCTTCAGCCATGGTCCACCCTACCAAGGCATGTAGGCTCAGAGAGGTCCAGTAACTTGCTC > AAGGTCACACGGCAAGTGAATAGAAGAGCCAGCATATGAACTTTGGTGTGGTTGACTCCTGAACTCAAAATCTGAATCTC > TATGCTTTACTGGCTCCTTGTGAAAGAAAGAACCGCACATCTGTGTGTGTGTGTGTGTACACGTACACACACACACACAC > ACTCACACAAACATTTCCACTGACTTGAAGGCTCAGGAGATATCCCCAAAGCCAAGGCATGGATCTGCTGGCTGAGGAGG > TGTGAGCCCTCGGAGGGCACCATGTGCCCAGTCCACACTGACCTTGTGTCTTGCCATCCCTTAAGGACCCCGTCTCCTCT > GCTGTGTGCTCATAGGGAAGTGGCCCGGAGAGGCCAGGTCAGACTCTTAATTCTTTGGCATCTGAGATGTAAGAACATGG > GCCGTGCAGCCAAGGGTGGACAGACGCCATGTGGGTGTGTCTGGCAGGACTGCCTGCCTGCATGAAGGTCACGCTGCCCT > TTACAGTCACACTGGCCCCAGCTCCCCACGGTGACCTGCTACCAGGCCTGTGAGAGGGGCCACTGCTTGCTGGGCTGTGG > GAGCACTAAAAGGAAGACCCCCCTCACCAACCACCTTGGAAGGACCCAATATCATCTCTGTTCTGCTCCCCTGCCCGCAG > CTGGAGGGGAGGGAGAGCTGGCCTCTCGCACCTGTGAAGTATCACTAGCAACCATGACCTAGCAAAGGGCTTCCTTTCCT > CTCATATCACAGTCTATCTCTCTATCTATCGTCTAGAAAAGTTTCCAAAGTCAGTCTGTTGACAACCTGCTGGACTTTGG > TGGCTGAGCATTTTGCCTCTTTGAACTACAATCTCCTTGAAGGCAGAGATTTTAGTTCAACCCTGAAAAGTCCATTCTAT > CTTTTTTTTCTTGGGGAGATGGGTGGGGGGTTGGGGGGCGCGGTGGTAAGGCATCAGTAAATGACCTGATAAGGTCTGCA > GGATTAGCTCTGGAGGGCTGCCTGATACAGGCAGGCCTGGCTCTCCTGGCTGTTTCCACGTGGCATGACTGGAAGATGAG > TGTTGTCCTTCCACAAGAAGGTGTACTGGACCCCTTGCTCCTGCTCATGGTATCCCCCCATCTGCCCTGCAGAAGCCGAC > CTCCCGACTCCCAAACCCTATGGCTTGGCTGGGGCTGGTATACCCCTGGCTCAGAGTCTGGCCATGACCCACAGCTGGCC > AAATAGAGCACCATTTCCTTGGTTCAGGCTCACCCAGTCACTGCCAGAGAGACCCAATTCCTGGACCCTTCCAGGACTTC > TGGGCAAAGAACACCTGTCTGTCTACTTGAATGGGAGCTGGAAAGGTTACGTTACGAGCAGCAGGGGGCAGGTGTGAGCA > GCCCCGTGGCAGAGAGGTCTGCCTCAAAGTGAGGGGAGAGAGAAACTCTCCTGAGTCCTGACAGCACTGAGCCCCTGGAG > ACATG >> A2 > CTCCCTGAACTGCAAGCAAATTGGGAAATTTGAATGTGACGCAATAAGCCTGCAATAAAACAGAGCCTCCAAGAAGTGCA > TGAAGCGAATCTCCTCCCAGCCTAGTCCACCACCCCTCTGCAGCTTCCCGAGCTGTCACCCTCCAGGGTGGCCACATACA > GTATTTATTTTTTTTAATTAAGCTGCGTTGCTGTGTGGCTCACCCTCCACAGATGCTTTATAAAGCCCAGCCTTGTAAGT > TGCTGAAAGTTTTTATTTTTTTTTAAGCGGATAGTAAAATTAAGCAGTTATTAGTATTCACGAAAAATGCTCGTGGCAGG > CTTCGCACACGTAATGAGAAGCCACCTGCAGTTCAAAGCAGTACTTGATAATGCTGTAATTAGATTTGAGATGCAAAAAA > GCTCATTAAAATTAAATAACATATTTGTCACTGTTTTAAGAGGTACAGATTGCCTAGAACCATGCTGGTCTCTTGACCTT > GAGTATTTTGGCAGAGTCCTTGCTGCCGACAATTCAAAAATAACATCAGCAATTCTGCAGAGAGTCCTGAAAACATGCTC > ATCCTTGGTTCAGTTTGTATTGAAAAAATTATATAGATATCCAGGGGAAATCACAGATTTGTACACAGTGTGTTCTGGGA > GAAAAGGGTGGCCACAGTCCTCTCTGAGCCAGTAGCTCAGACGCTGTGAAGAAGAGATGGATGTGTGCCTGTGGGAGGCT > GGGGCAGGTGGAGGGGTCCCCCATAGGACACGGGGACACTGTGAGGGAGTTTGCAGGAAGCGGCGGCCTTGGGCACCACT > GCTTACCACTAAGCTGAGCTGGATGAAGCCACTTGCTCAGAGCAGGCCCAGGGCGACTGGTAATCAGATTGTCAGTCAGT > GTTTCCCGGGTGCCTACCGGTGTACCGGGTCCCGGGCAGGGGATGGTGGGGCCGCAAGTGCCAGTGAGATTTGCCTCAGG > CTGCAAAAGAGGTCATAGCCTCATCACGGAGATGAGGCATAAAGTCAGGTGCCAGGAAAGTATTTATTGTTAATAAAAGT > AGCTAATGTTGACTGAGCGTGTCTTTCATAGTAGGTAGCCAAGCGTGGCATGCATTATCTTAGGAATCATCTCCATAGTC > TGTGAGACAGCCCTGACTGCCCCATTTTCCTTGAGGTTACGTTGCTCTTGTAAGGTCACTTCACTAGTAAGTAACTAGTA > AGCCCAGAGCTGAGCCGGGTTCAGATTCAGGTCTGGCCTTCTGCAGCACCCCTTCTGCATGAATCACAGAGTCACTCAGG > TCCCATGGGCAGAGCAGAGAGACCCAGGGTGGGAGAGGACACACATGAAGGCACTCACTGAGGACCCACTCTTCACCAGA > TCCTGCTCTAAACACTGCGTGCCTGTTCATCATTGTATCCCCATAGCAACCCTGTGATTCCATTTTACAGATGAGGAAAT > TGAGGCATGAGAAATTAAATAACCAGTCCCAGGTCACCAGGTAGTTGGTGGCAGAGTTAGGATTAAGCTCTGACCTTGAC > TGTTATTCTATGTTTAATTTTTCTTGAAATCTGTAAAGCATTCTATAAAGGTAGGATGTAATTTAAAAGCAATTATGATG > ATTTTTGTTCTTGGAGAAAGAAAATGCAAGTTGCTTTCAAGGGGCAGAGACCAGAGGAGTTTTCCTGGGGGCAGAGAGCA > GTTAAACGAGACGTGAGTGGAGGGGGCCAGATGGAGGGAAAATGAGGCCAAGTTAGAAACAGAGGCCTGGCCTGGGCTCC > AGGGACAGTGTGGGCCCCCATTTTGGAGGAGCTGCGTGTCCTCCAGGGCAGGGAGTCATTGGGAAGGCAGTCTGGGCCGA > CTGCCCCTCCTCCCGGTGGCTGGATTGATTTTCCAGGCTGACTCTCTTGGCACTGGGTCCAGATTTAGTTCTGTGGAGAG > CTCGTCTTTCAAAGTCCAGCTCAAATGTCACCTCCTCGGCAAAGACTCCCCCACCGCCTGAAGCAAAATTAATCACTCCT > CCTCCATGTCCCACAGCACTCTGTTGATACCTCCGGAATAGTCTTATCACATCGGACACGGATTTATGTGTTTACAAGTC > TGTCTCTCCAGCTAGACAGCGACTTCTCGAGGGCAGAGCCCACTCTGCCTTTGTGTCATTGTGCCTGGGGCAGTGCTGCG > TGTGCAGTGGGTGCTGAGGAAATGATGCATACATGGAGGACCAGTGGATGGTTTAATTTGATTTCAGTGACCAAGCACAG > AGCTCTTACTTTGTGCAGGTCTTGTTCTAAGCTCTTTACATATGTGAAAGGATTTAGCCTTTATAACCCCATGACATCAG > TTAGTTCTATGATGAGCCTGTTTTACAGGTGAGGACACTGAGGCTCAGAGAGGTTGTGTAACTTCTCAATATCATACATC > TGGTAAGTGGTAGAGCTGGGAGTTTTAAACCCAGGCAGCCTGGAGTGAAAACATCTGCTTAAAGTGATTTTGTAGCTACA > GATTGTATATAGTGATTCTTATTAGGGGAGTTTCTGGAATGAATTCTCATTTCACTCTTCTTTTCCCTAACTTGTCTGAA > TTTAGGAAGGTTTGGAGAAAGCGATTGGGCTGCGATGCAGTTAAGTTTCCTTTGTTGTAAGCCGTGGAAACCAACTCTAG > CTAAATTAAGCTAAACAGGGATTCTGTTGTGAGGATCTAAGTAGCTTGCAGAAGGAGCACGCTGGAAGGAAGGGCTTTGG > GGAGGCC >> A3 > CTGGGATTACAAGCGTGAGCCACCGCGCCCAGCCTGTCATTTGCCTTTTACTTTTGTTTTGGGTGGGGTTAATTTGGCTC > ACGGTAAAAGAACAGTTTTCCTGCATGTTTTATTAACACTAAAATGAAACGCATTTAGGGTCACAGAGATTTGGGGCCAT > ACTAAACCAAAATTTTAAAAATAAAGTCCTTGGCTGGGCGTGGTGGCTCACACCTGTAATCCCAGCACTTTGGGAGGCCA > AGGTGAGTGGATTGCTGGAGCCCAGGAGTTTGAGACCAGCCTGGGCAACATGGCGAAACCCCATCTCTACTAAAAATACA > AAAAATTAGCCAGGCGTGGTGGTGGTACCCTCCTGTAGTCCCAGCTACTCAGGAGGCTGAGGTGGGAAGATCCCTGGAGC > CCGGGAGGCGGAGGTTGCAGGGAGCCAAGATCAGCCACTTCACTCCAGCCTGGGCGGTAGAGGAGATCCTGTCTCAAAAA > ATAAAATAAAATAAATAATTTCCTCCGTATCCTCAGAAACAATTCAAAGCCATGTGTAAACCCAAAACTGAGTTCATCAC > TTCTAAGATCAGAAGAATTTGTGTTTGTTTATGATGAACTCACACACCATAATCTTCCAAAATCATTGAATTCCAGGTGT > GACTAAAGCAAAGTTTCAGATTGCAAGTTGACACTGTTACCTGAAAGCTGCACACAACTGCTGAATCACTTTTTAAACAA > GCATGAAAACAACAATTTCTCTTGTTCCCAGTACAGCTTTGAGCCACAAAAATTAATTACATCACCTAGGAAATTAGCCC > TGCGTTTTGCATCCAAAGGTCAATGTTCAGACTGAGAAATGGTTTAGTGCTTCCAAACACAGGGCTCAGCACATAAATAT > TTATCATGCAGAGGACAGGAACTCATTGTACATGGCTTGTTTACCTTATCAGCCAGTTACCTGGGGCAAAGAGTCCACTG > TCTGCTTCCTGTCCTAATTTCCCTTGCCCAAGACCTTGGGCAATTGTTTATTTGACAGAGTGTTCTGTCCCATTCCCTTT > CTTCTGGGAATAAGGGCACAGGATACTAAACAAATAGATCACCTGGTCTGGGAACACCTGATTTCTGGAGAGGCTGAGAC > TTCTAGGTGTTTGCTAGAAAGATCCAAATACAGGAAAGAGAAAACAGCTGCTGCGTTCACAAGAAAACGGAGGATGCATT > TCTCCCCATGACTTCGGCTTGTTTGGGTTCGCATTTCAAGTGCTCTGACTGGGCTTTATGGAATGTCCTGTCAAGCTTTT > GATTGTGAAAATAAAGTTACAGAGACACAGGGGAACAATAATCACAGACGCGATTTGTAGCACAGACACGGCCTTTCATT > TCATTCCTCTTTTCTGTGTTTACAAATTGCGACTGTTAGAGAAACGTTGGCCAAGCTGGCTCTTTAACCTGGCCACATAT > TTATTTTGGCCACTCTTAGCTTTTATTTCGCAAGGTCTCATACTCACCAGCTCAGGTGCTTGTTACATAAGAAAGCTCTG > CCGGCGGAGTGCCAAAAATTCAGCTCAGCATTTTCTGCCTCCCTGCAGTGGTCCCAAGGAATGGATACCTCTGCAACCAT > CTGCAACCTGGACCTTTGCAACAGGAATCTCAGGTTAGGTGACACTGGCCAACATGCAGGGAAACTATGTCAACAAGCTA > GTTTTCCTTTCCTTCCTTCCTCCTTCCCTTCTTCCCTCTCTCCCTCTCCTCACCTCCTTTCCTTTCTTTCATTTTTGATA > CAGTAGGGTGTCTCACTATGTTGCCCAGGCTGGTCTCAAACCACTGGCCTCCAGTGATCCTCCTGCCTCAGCCTCCCGAA > GCTCTGAGATTACAGGTGTGAGCCTCCAGGCCTGGCCCCAAACTAGTTTTCAAGGGTTGGTTATCACATCCCTTGTATCT > TGGTTCTCACCAGTCAATCATTACGTACTTTGCATGGATTTATTAGAATAGATTTTAAATCTCTTGTTTTAGCCTCTTCT > ACTTGTAGAGCTACCAATCCTAGAAGTTTACTAGCAACTCTGCATTGTTACATTTTAATTTGCATTTTGTTTTGCATATA > TTTAATTTGCATTTAGTTTTGCATTTAGTTTGATTGT >> A4 > GAAGTAAATTACAGGCACGCTTTTCAAGCTCTGAGCTCTTGAGAAGACCACTGAGCTCAAAGTTAGTCCAACAAATAATG > TTCATCATACACAGCATCTTCTGTTTAAGAAGGCAGGAAATTGGCCGGGTGCGGTGGCTCACACCTATAATCCTAGGACT > TTGAGAGGCCGAGGCTGGCAGATCACTTGAAGTCAGCAGTTCGAGAGACCAGCCCACATGGCGAAACCCTATTTCTACTA > AAAATACAAAAAATTAGCCAGGCATGATGGCGTGTGCCTATAGTCTCAGCTACTCGGGAGGCTGAGGTTGAGAATTGCTT > GAACCCAGGAGGCGGATGTTGCAGTGAGCCCAGATCATGCCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTGTATCA > AAACAATCAAGCAAAAATGGTTGTTGTCAACACAGTGGGTAACATCTGAACCAAACAAGCCAGGAAATTGAGCACTTGGT > TTTAGAAAACTTTTCATCCAAAAAGGGTTAACAGATTTATAATGTGAGTTTAAAGGGGTCATCTCGAGTTTTCAAGAACA > TTCATTTATGCCAAGCAGCTCCTTCTCCACTGATATCAAAGATATTTTACTGTAGAGGGCACCAAATACACCCCTTCATT > ATTTTCTCCTTTTAAAGAAAACAGGACAAATTAGTGACCTGGAAGGTTTGGATAAAAACACATCAGAGAAATCATTGTTT > TGAGAGTTCTTTTCAGCTTAATTAATTCACATTGGCTGCTCAAATTCATTTTAGGTCTGGAGGCGATTTTCAAACAAAAT > GTTATGACAGAGATCATAACAAAGATTTGGAGCAAAGATTTCTGTTTGCTGAGCTGTCACTCAAGCTAAACAGATAGCAA > CAGGGAATCTTTAGGCAAATTAACTTTCCCCTGAATGGGTTCTTTTATCCTGTCCATCTGCATTCATTCCCATGGGTACT > TCGAGACCGAGCTAACAAGTCAGATATACATAATAAAGAAAATTGCTCAAGGGAAAAAAAAATCAACTGTTCATTTGGGA > AAATGTGGTGCTAGTCTAAATACAATTCTCTATTAATTTCAAGGGGTCTAAAAATCTATCACCAGTTCTTCAAAAAGTTA > AACAGACTTATCCTGTGGCCCAGCAATTACATTCCTAGGTACGCACCCCGAAGAACTGAAAACAGATAGTCAAACAAATA > CTTGCACACAAATGTTCATCGCAGCACTATTCTCAATAGCCAAAAAGTGGAAATTACTCAAAATTCCATCAACAGATGAA > TGGAGAAACAAAATGTGGTATATACGTATACAACAGAATATCATCCAGCCGTGAAAACAAATTAACTACTGATGTGTGAT > ACAACGTAGATGAACCTCAGAAACATGACCTTAAGTGTTAAAAGCCAGACATAAAGATCATGATTCCGCTTATGTGAAAT > ATATGAATGATCCCCTTTATACGAAATATTCAGAATAGGTGAATTTATCAAGACAGAAAGCAGATCAGTGGTTGTCAAGG > GATGGGGAGGGGGGAATGAGGAATGACCGCTAATGGGTACGGGGTTTTCTTTCGGGGTGATGAAAATGTTTTGGAACAAG >> Hits_A5 > TGTGTTGTGTGTGTTGTGTGGATGGGTGTGTGTGCTGTCTTGTGTGTTGTGTGGTGTGTTGTGTGCTATGTGGATGGGTG > TGTGTGGTGTGTTGTGTGTTGTGTGGTGTGTGTGGTGTGTGGGTGTGTGTGGTGTGTTGTGTGTTGTGTGGTGTGTTGTG > TGCTGTGTGGATGGGTGTGTGTGATGTGTGTGGTGTAGTGTGTGGTGTGTGTGTGGTGTGTGTGGTGTGTTGTGTGGTGT > GTTGTGTGCTGTGTGGATGGGTGTGTGTGGTGTGAGTGGTGTGTGGTGTGGTGTGTGGTGTGTGATGTGTGTGGTGTGTT > GTGTGTTGTGTGGTGTGTGGATGGGTGTGTGTGGTGTGTTGTGTGGTGTGTTGTGTGCTGTGTGGATGGGTGTGTGTGGT > GTGGTGAGTGGTATGTGGTGTGGTGTGTTGTGTGTTGTGTGGTGTGTGTGGTGTGTGGATGGGTGTGTGTGGTGTTTTGT > GTGTTGTGTGGTGTGTGTGGTGTGTGGTGTGGTGTGTGGTGTGTGGTGTGTGTGGGAGTGCTCTCTGGCCAGGTGCAATC > TCCTTCCTGTTCAGCTCCTGGGGCTGCTGCTACTCTTAGTTCAAAAGACTTTTTATCCCTGGCAAGGACAATTATTGGGG > TCACGGTGACCTGCCTGTTTCAGCGCCTGCTTGGGCTTGTTTCGTTTGGCCTAAGACAAAAGCTTGAACTCTGGAACTGC > CCAGAGGGGCGAGGGGTGCAAATTCCCTCTGCCTGACCTAACTGCAGTGTTCACAGGACTGACTCATTGTCCCCCCGGGT > CAGGACAGGAGTGACATTCAAAAGAACTCATAAGCAGTTTGGCCAGGTACTGATCACTCCGAATGGATTCTGTGCTTGAG > GAATCCACCAAAGGCCTTTTTGGTTTGTTTTCCCAGAAGCCCATCAGTCACCATCTGGGGAGGACAAGGAGGGAGCAGGG > CTGATCCAAGGAGCTCTGGGGCAGATGGACTGTCCCTGTCACGGCTTCTTTTACTGGTAGTAATGACTCCCACTGGTTGA > GTATGCAGGCGCCACTTTGCTCCATGCCAGGTACCATCCATGCCTGCACTTTCTTTTTCTTTTTGTTTTTTGAGATGGAG > TCCCGCTGTGTTGCCCAGGCTGGAGTGCAGTGGCGAGATCTTGACTCACTGCAACCTCCACCTCCTGGGTTCAAGTGATG > CTCCTGCCTCAGTCTCCTGAGTAGCTCGGATTACAGGCATGCACCACCATGCCCGGCTAATTTTTGTATTTTTAATAGAG > ATGGGTTTCACCGTGTTGGTCAGGATGGTCTCGATCTCCTGACCTCATGATCCGCCCACCTTGGCCTTCCAAAGTGCTGG > GATTACAGGCGTGAGCCACCGCACTTGGCCCATGCCAGCACTTTCTAGGTATAATCTCACTTGATCCCATCCGGCTCTTT > GAGGCAGGCTATATGATCATGCCCATTTTGCAAATCAGGGAGCTGAGCCTGAGGAGCTGGGGTCCTTGCCCAGTCACACA > GCCAGTAACAGCAAGAGTTCTGATTTACCCCTAGCTGGTCTGGATTCTCTTGGTCCCCTTATACACCTGTTCCCTGCTGG > G >> Hits_A6 > AAGTACAGATAAAAGCTGGTTGAGGAGGTGTTGATAGCACAGAGTGTTTTTACGGAGTGTCCTCGGAGGGTTACATGCTC > CTTTTTGTTGAATCCGAATCTTTTTTTTTTTCTAATATGGAGCCTCATAATCATTCAAGGACTCAGTGTGATGCCACAGG > TTTCAAACTTGCCAGTTCGTCTCTTATGTTCTGCCAATTTTTTAAAAGTGGGAATATGAGTGCATGTGCACGTGTACACA > CACGTGTGTGTGTGTGTGTCTTTGTGTATGTAAAATGCTTGGGATAAAAAAAGATGGATCAAATTTTGTCTGGCATGGCA > TAGCAACTCTCACCTCTAGTATCCTCTAAACATTTAACCAAATGGACTTTAACTGGTTTACTTCAATTCTATTTTTATTT > TCGTAGTCATTTAAATGTATGGGGAGCCAACTCTTTCAAATGAGACATTTTGTCATTAAAACATGTTATTTAGTTTAACC > GCATGGGCCCGGCTTCAGCAACATTGAGGACTGTGAAGAGGGTTCTGAGCGGCTGTTCTCTCCTGACCTGACCGCAGCAG > TGACCCACCAGCTTGCAATGCACGCAAACTGAAACAGGCAGGATGGCCAGCCCAGGGTTGTCTGCACGGTGGTCTTGGCA > TCTGTACCTTAGGTTGCCAGCTCCTGGCCCTCCTTGCCTGCATTTGGTGGCCCTTGGCCCTGCCGTGTCATATGGCACAA > GAGCCCTTCTGGCCCGCATCACCCTGTGTACTGATTAGGCAGGAATCAGCAAATCCGTTGAGTGCATGTTTAATGTGACA > CAGATGTGGGACATACTACAGATCTTTCTGCTTTACAAATTTTCAGCTTTTTGCATTTCATCAGTGTTGCTTATTATTTA > CAGAATGTAACTTGAGACATCAGAACAGATTGTAATTTTTAAAAAGGTTGTTGGGGGGGTGGGCCCGCAGGGGATCTGGG > AGAAAAAATCCAGTAGTGTGGAATGTTGTTTGCTGTCCAGACCAAACAAACCAGTTTGCCAAGGGTTGATGGACTATGAC > AATCAGTCCAATTCATGTTAACACTGGGGGCACTTAATTAAAAGGCAGCTTGACCTCCTTATCACCAAATAAAGAATAAT > CTTTTGTTACCAGGTTGACCGTCGCAGCACCAGGCAGAAATGTACCCCTTGTAGAGGCACTGAAGGGTACCACCAAGCTG > CCAGCTCGTGTCCTGCCTCCTTCCTGTCCACAGAGAAAGGCACCATTGAACCCAGTCCCCACTCAGACATGTGGATATTC > TTGCCTCATGCCTGCTGTGGCTTTACACCCAGGCGCTATGGCAAGAGGCGGAAAAAAGAAATCTCTTCCTTTGCCCAAAA > ATCCAAGATACTCTTGAGGGCATTGGAATGATGAGCCAGGCCAGTTTGACACAAGTTCAAGGAGAAGGTACCATCTCTTG > CTTCTAGAGGGGCACATCTGGCTAGAGGTGGCCCTGGCTTCTGAGAAAGCTCAGCCATAGGTTACATTTAAGGGCAAACA > ATGAAAAGCACCAGCTGGTCGGGAGGAAGATGAGGAGGAGTGGGTGATTGGTTGCTTCTTTCCTGTGTCTAGTTCTAAAG > AACTTAAAACTTGGGTCCAAAATAATGAGAGCCGTGTTGTAAAAATCATTTGAAGAGAGTTCTCCTGGGGAATGTGAGCT > AGTGCCGTGACCTGATCCCTTGGTCAGATGGTTCTGAGCTGTTTAAGTGCTGAGTGTCAACTGTCTACAGGGAGCCAGGC > CCCTGAGGAGCGGCCAGGGTTCCAAGCCCCACTGTTCTTTTCCTTCCTGGAAAAGTGCAGACGGTGCACGTTCCCCGGAG > GCCACCAGGCAGCTCTCCAGTTGCCCTCTGGGTGAACGTGACCCAGGTCGGTAACTTTCTAGTGTCCCCCCCACGTTGGA > ATGATCAGCATTCCTGTACTCCAGGGGTTCTTGAGTTCTTGAGTTGTTACCAACGCCCTGCCTTTCTCTGGCTCAGATTC > AGAGGAGGAGTTCCACTGCGGACTGCCGCTTCTTCACCCTCTGGGCTTCGGAAGCAGCACCTCTGGCTCCCCTGCACCCA > GTCCTCAGTCTTGTGGCCCTGGTCTGCCCCTTTATCACATGTCTTGAGCACCCTCTCAGCACACAGCACCGTTCTGAGCG > CTGAGGTGCGACCATGCATAGAAGAGACAAGAATCCCTGGTTCCGCATGGCTTCCCTCAACACCTACGTTTCCAGGCCAT > GGCTTGGGTTCAGAAGAATCACCCTCCCTGGAAGAGATGAGTGACTGGAAGGAGAGCATGTCACTGGCTCCCCTGAGACC > CTCCCAAGAGCCTCAGTTTCTCCCTCTACCACCTGAAGTCTAACTGCAGGGCCTGAAACAGATTCTCTTTAAAAACCCAT > GAGTGTGGTGACTGCATAGGTTTTCATTCTCCTGAGTTCTTAATTTCCGTGTCCCTAGCACTTTTTGAAGCATCTGCTCC > TGTTGCCCCTTTCTTTTCCTCCTCACCCTCCTCCATAGTCTCTGTTTGATAGTAGAAGGAGATTTAAAACCTCCAGGCTG > AGTGAACAGGCGGTCTTGAAAAAGCAAACAAATCGGGTCAGGCACGGTGGCTCACGCCTGTAATCCCCAGCACTTTGGGA > GGCCGAGGTGGGCGGATCACAAGGTCAGGAGTTCGAGACCAGCCTGGCCAATATGGTGAAACCCCGTCTCTACTAAAAAT > ACAAAAAAATTAGCCGGGCATGGTGCCGTATACCTGTA > > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From Anna.Git at cancer.org.uk Thu Aug 2 06:31:09 2007 From: Anna.Git at cancer.org.uk (Anna Git) Date: Thu, 2 Aug 2007 14:31:09 +0100 Subject: [Genome] tRNA genes Message-ID: <250169CA5A44114B86538611B148E48301E44698@uk-lif-lexc02.crwin.crnet.org> Hi! Which (repeat?) track contains tRNA genes and why aren't they granted a separate menu like sno/miRNAs? Many thanks in advance, Anna Dr Anna Git Cancer Research UK Cambridge Research Institute (CRI) Li Ka Shing Centre Robinson Way Cambridge CB2 0RE +44 (1223) 404423 Anna.Git at cancer.org.uk "Man will occasionally stumble over the truth, but most of the time he will pick himself up and continue on." Winston Churchill From Peter.Brooks at integragen.com Thu Aug 2 09:47:35 2007 From: Peter.Brooks at integragen.com (Peter Brooks) Date: Thu, 02 Aug 2007 18:47:35 +0200 Subject: [Genome] GC Percent selected fields Message-ID: <46B20AA7.9000602@integragen.com> Hello In Table Browser, I tried to download a table of GC percent with a limited choice of fields. In "output format", only the 3 choices: data points, bed format or custom track were available. Output format did not include the usual prompt "selected fields form primary and related tables". (So the only possible output was the listing of GC% for sequential 5 bp segments.) I tried another type of table and found the "selected fields form primary and related tables" as usual in Output format. Then when I went Back in Firefox, and reselected gc5Base, I was able to select fields and download. So I got the data but I would appreciate any suggestions about how to use Table Browser to avoid this problem. Thank you Peter Brooks From rhead at soe.ucsc.edu Thu Aug 2 11:42:11 2007 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Thu, 02 Aug 2007 11:42:11 -0700 Subject: [Genome] GC Percent selected fields In-Reply-To: <46B20AA7.9000602@integragen.com> References: <46B20AA7.9000602@integragen.com> Message-ID: <46B22583.5010206@soe.ucsc.edu> Hello Peter, I am not sure how you were able to get "selected fields" as an output option for the gc5base table. The gc5Base table, like all wiggle-type tables, simply contains pointers to data files. The data files contain the actual information. (We do this because the tables would otherwise be extremely large.) The "data points" output option fetches the actual data in the region specified from the data files, and presents it in variableStep format. You can read about the wiggle data format here: http://genome.ucsc.edu/goldenPath/help/wiggle.html The information in the gc5Base table is not very interesting to most users (just confusing), so the "all fields" and "selected fields" output options are not made available. However, if you are actually looking for the contents of the table, you can find it on our downloads server, here: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ (the file is called gc5Base.txt.gz). If you are looking for the actual data underlying the GC Percent track, you can use the "data points" option in the Table Browser. This can fetch up to 10 million rows of information at one time. If you are trying to download this information for an entire genome, this will take multiple Table Browser queries (about 60 for the hg18 assembly). If this is indeed what you are looking for, let us know, and we can make the data file available for download. I hope this is helpful. If you have further questions, please do not hesitate to contact us again. -- Brooke Rhead UCSC Genome Bioinformatics Group Peter Brooks wrote: > Hello > > In Table Browser, I tried to download a table of GC percent with a > limited choice of fields. > In "output format", only the 3 choices: data points, bed format or > custom track were available. > Output format did not include the usual prompt "selected fields form > primary and related tables". > (So the only possible output was the listing of GC% for sequential 5 bp > segments.) > > I tried another type of table and found the "selected fields form > primary and related tables" as usual in Output format. > > Then when I went Back in Firefox, and reselected gc5Base, I was able to > select fields and download. > So I got the data but I would appreciate any suggestions about how to > use Table Browser to avoid this problem. > > Thank you > > Peter Brooks > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From rhead at soe.ucsc.edu Thu Aug 2 11:58:05 2007 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Thu, 02 Aug 2007 11:58:05 -0700 Subject: [Genome] tRNA genes In-Reply-To: <250169CA5A44114B86538611B148E48301E44698@uk-lif-lexc02.crwin.crnet.org> References: <250169CA5A44114B86538611B148E48301E44698@uk-lif-lexc02.crwin.crnet.org> Message-ID: <46B2293D.70708@soe.ucsc.edu> Hello Anna, The 'repeatMasker' track contains an RNA category, of which tRNA is a subcategory. If you would like to get a list of only the tRNAs in this track, you can go to the Table Browser, select the repeatMasker track (rmsk table), and create a filter such that "repClass does match tRNA". The reason there is a separate track for sno/miRNAs is because two sources provide them to us: snoRNABase and miRBase. I hope this is helpful. Please do not hesitate to contact us again if you have further questions. -- Brooke Rhead UCSC Genome Bioinformatics Group Anna Git wrote: > Hi! > > Which (repeat?) track contains tRNA genes and why aren't they granted a > separate menu like sno/miRNAs? > > Many thanks in advance, > > Anna > > Dr Anna Git > Cancer Research UK Cambridge Research Institute (CRI) > Li Ka Shing Centre > Robinson Way > Cambridge > CB2 0RE > +44 (1223) 404423 > Anna.Git at cancer.org.uk > > "Man will occasionally stumble over the truth, but most of the time he > will pick himself up and continue on." > Winston Churchill > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From zhang.kangyu at gene.com Thu Aug 2 13:54:51 2007 From: zhang.kangyu at gene.com (Kangyu Zhang) Date: Thu, 2 Aug 2007 13:54:51 -0700 Subject: [Genome] Question on gene length information retrieval Message-ID: <003501c7d547$5ebd3140$1f0c2c0a@gne.windows.gene.com> Hello, We are interested in retrieving the human gene length information from the UCSC genome browser. In particular, for an NCBI gene id identifier (refseq identifier NP_ or NM_ are good too, but that's not one to one mapping because of alternative splicing), retrieve its protein sequence length, mRNA length (total exon length), number of exon, and the total length of this gene on the genome (including the intron length). Could you help figure out the way to retrieve the above information via the UCSC genome browser? Thanks, Kangyu Zhang Department of Bioinformatics Genentech, Inc. 1 DNA Way, MS 93 South San Francisco, CA 94080 From rhead at soe.ucsc.edu Thu Aug 2 14:44:18 2007 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Thu, 02 Aug 2007 14:44:18 -0700 Subject: [Genome] Question on gene length information retrieval In-Reply-To: <003501c7d547$5ebd3140$1f0c2c0a@gne.windows.gene.com> References: <003501c7d547$5ebd3140$1f0c2c0a@gne.windows.gene.com> Message-ID: <46B25032.2030905@soe.ucsc.edu> Hi Kangyu, Can you clarify a couple of things? Do you have only a single gene to look at, or a list of several genes? Also, I'm not sure which NCBI gene ID you are referring to. Can you give an example? Also, do you already have a corresponding track in mind from which to retrieve the information (like UCSC Genes, RefSeq Genes, etc.)? Thanks. This will help us answer your questions. -- Brooke Rhead UCSC Genome Bioinformatics Group Kangyu Zhang wrote: > Hello, > > > > We are interested in retrieving the human gene length information from the > UCSC genome browser. In particular, for an NCBI gene id identifier (refseq > identifier NP_ or NM_ are good too, but that's not one to one mapping > because of alternative splicing), retrieve its protein sequence length, mRNA > length (total exon length), number of exon, and the total length of this > gene on the genome (including the intron length). > > > > Could you help figure out the way to retrieve the above information via > the UCSC genome browser? > > > > Thanks, > > Kangyu Zhang > > > > Department of Bioinformatics > > Genentech, Inc. > > 1 DNA Way, MS 93 > > South San Francisco, CA 94080 > > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From zhang.kangyu at gene.com Thu Aug 2 16:00:15 2007 From: zhang.kangyu at gene.com (Kangyu Zhang) Date: Thu, 2 Aug 2007 16:00:15 -0700 Subject: [Genome] Question on gene length information retrieval In-Reply-To: <46B25032.2030905@soe.ucsc.edu> Message-ID: <000b01c7d558$e3bb26c0$1f0c2c0a@gne.windows.gene.com> Hi, Brooke, Sorry for my unclearness. I need to retrieve the entire genome. The NCBI gene ID I referred to is entrez ID, for example, 324 corresponds to genename APC, and 10297 is APC2; however, other IDs such as RefSeq genes are fine too. Sorry I don't understand what is the meaning of "corresponding track"? But I guess the RefSeq Genes shall serve well as the identifier, though the Entrez ID is better. Kangyu -----Original Message----- From: Brooke Rhead [mailto:rhead at soe.ucsc.edu] Sent: Thursday, August 02, 2007 2:44 PM To: Kangyu Zhang Cc: genome at soe.ucsc.edu Subject: Re: [Genome] Question on gene length information retrieval Hi Kangyu, Can you clarify a couple of things? Do you have only a single gene to look at, or a list of several genes? Also, I'm not sure which NCBI gene ID you are referring to. Can you give an example? Also, do you already have a corresponding track in mind from which to retrieve the information (like UCSC Genes, RefSeq Genes, etc.)? Thanks. This will help us answer your questions. -- Brooke Rhead UCSC Genome Bioinformatics Group Kangyu Zhang wrote: > Hello, > > > > We are interested in retrieving the human gene length information from the > UCSC genome browser. In particular, for an NCBI gene id identifier (refseq > identifier NP_ or NM_ are good too, but that's not one to one mapping > because of alternative splicing), retrieve its protein sequence length, mRNA > length (total exon length), number of exon, and the total length of this > gene on the genome (including the intron length). > > > > Could you help figure out the way to retrieve the above information via > the UCSC genome browser? > > > > Thanks, > > Kangyu Zhang > > > > Department of Bioinformatics > > Genentech, Inc. > > 1 DNA Way, MS 93 > > South San Francisco, CA 94080 > > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From dmauldin at fhcrc.org Thu Aug 2 16:44:52 2007 From: dmauldin at fhcrc.org (Denise Mauldin) Date: Thu, 2 Aug 2007 16:44:52 -0700 (PDT) Subject: [Genome] Closest Gene Message-ID: Hello, I have a list of genome positions that I got from Blat. What's the easiest programmatic way to get the closest gene to those positions? I currently work in Perl, but could adapt to other suggestions. Thanks, Denise From zhaoshi at u.washington.edu Thu Aug 2 17:01:08 2007 From: zhaoshi at u.washington.edu (zhaoshi) Date: Thu, 02 Aug 2007 17:01:08 -0700 Subject: [Genome] question on chain data format Message-ID: <46B27044.5070906@u.washington.edu> Hi-- I was trying to get the pairwise sequence identity information from the humans self chain data. I download the chain files and read the http://genome.ucsc.edu/goldenPath/help/chain.html and have some questions: 1) it seems that chain file dose not contain sequence identity information, is there any other data that contain this information or I need compute by my own base on these chain data? 2) I read the chain format explanation, but I do not quite understand what does 'dt, dq' mean? It states in the explanation like: dt -- the difference between the end of this block and the beginning of the next block (reference sequence) dq -- the difference between the end of this block and the beginning of the next block (query sequence) what does this really mean? difference means what ? mismatch ? Thanks for your help. Zhaoshi From zxu at uhnres.utoronto.ca Fri Aug 3 08:12:13 2007 From: zxu at uhnres.utoronto.ca (zxu) Date: Fri, 03 Aug 2007 11:12:13 -0400 Subject: [Genome] try to get gene symbol Message-ID: <46B345CD.C6AEAA4E@uhnres.utoronto.ca> hi, if I have a list of knowngene ID, refseq ID, and some pseudo gene ID, how can I get their gene symbol in a batch mode? for example knowngene NM_015023 ----> gene symbol WDTC1. thanks zd From ann at soe.ucsc.edu Fri Aug 3 11:27:38 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Fri, 03 Aug 2007 11:27:38 -0700 Subject: [Genome] try to get gene symbol In-Reply-To: <46B345CD.C6AEAA4E@uhnres.utoronto.ca> References: <46B345CD.C6AEAA4E@uhnres.utoronto.ca> Message-ID: <46B3739A.70706@cse.ucsc.edu> Hello Zd, You can do this using the Table Browser. Read more about how to use the Table Browser here: I'll show you how to do it for the one RefSeq gene you listed in your question, but you can also do it in bulk (more than one at a time). Configure the Table Browser like so: clade: Vertebrate genome: Human assembly: Mar. 2006 (I assume you want the most recent human assembly) group: Genes and Gene Prediction Tracks track: RefSeq Genes table: refGene region: genome identifiers: paste list (Paste the list of Ref Seq identifiers. For this example, I will paste only NM_015023). output format: selected fields from primary and related tables On the next page, choose two fields from the refSeq table: name and name2. Your output will look similar to this: #name name2 NM_015023 WDTC1 The table for your Known Gene IDs is called knownGene. Hope this is helpful to you. Feel free to write back to the list if you need more detailed instruction. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Please feel free to search the Genome mailing list archives by visiting our home page, clicking on "Contact Us", then typing a word or phrase into the search box. On that same page (http://genome.ucsc.edu/contacts.html), you can subscribe to the Genome mailing list. zxu wrote: > hi, > > if I have a list of knowngene ID, refseq ID, and some pseudo gene ID, > how can I get their gene symbol in a batch mode? > > for example knowngene NM_015023 ----> gene symbol WDTC1. > > thanks > > zd > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From ann at soe.ucsc.edu Fri Aug 3 11:41:34 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Fri, 03 Aug 2007 11:41:34 -0700 Subject: [Genome] custom coloring a bedGraph track? In-Reply-To: <46898C25.2050509@gs.washington.edu> References: <46898C25.2050509@gs.washington.edu> Message-ID: <46B376DE.5000404@cse.ucsc.edu> Hello Ginger, First let me apologize for taking so long to answer your question; somehow it must have fallen through the cracks and I am sorry. I hope by now you have found the answer or a suitable work-around, but in case you haven't let me answer here. You can color elements in a bedGraph track based on the score value (but not, as you have shown below) based on a separate field. What you can do is choose the color scheme, then ask that the elements be shaded in that color scheme based on their score (or dataValue). Perhaps this will be useful to you. You can read about how to enable this coloring in "Example 3" on this page: http://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html#CustomTracks Please let us know if you have more questions. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Ginger Cheng wrote: > Hello, Browser Gurus, > > Just wondering if there is any quick way to color a bedGraph track based > on a field of the data table. So it would be sth like > > chrom | chromStart | chromEnd | dataValue | color > > | chr1 | 16190712 | 16190763 | -0.548 | red > | chr1 | 16190782 | 16190803 | -0.001 | gray > > > THank you for any help in advance > From rhead at soe.ucsc.edu Fri Aug 3 13:05:32 2007 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Fri, 03 Aug 2007 13:05:32 -0700 Subject: [Genome] Closest Gene In-Reply-To: References: Message-ID: <46B38A8C.8030005@soe.ucsc.edu> Hello Denise, I have two suggestions for you from our developers, one using Perl and one using our public MySQL database. I will describe them both. -- First, the Perl solution: Download the knownGene table, either from the downloads server, here: http://hgdownload.cse.ucsc.edu/downloads.html (go to the appropriate assembly, and then look under the "annotation database" link for the knownGene.txt.gz file), or from the Table Browser. Then, follow this logic: hash chromosomes => lists of [gene positions, gene names] sorted by position for each position from BLAT, do a binary search for that position in the appropriate list. If search ends without an exact match, compute the distance of the BLAT position to the final position from the search and its adjacent positions, and pick the closest one. -- Now, the MySQL solution. Information on connecting to our MySQL server is here: http://genome.ucsc.edu/FAQ/FAQdownloads#download29. Using the public MySQL server: #!/bin/sh # given position chr1:710000-720000 # find a sample of genes near this upstream and downstream C=chr1 S=710000 E=720000 echo "three upstream genes from ${C}:${S}-${E}" mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e \ 'select e.chrom,e.txStart,e.txEnd,e.alignID,j.geneSymbol FROM knownGene e, kgXref j WHERE e.alignID = j.kgID AND e.chrom="'${C}'" AND e.txEnd < '${S}' ORDER BY e.txEnd DESC limit 3;' hg18 echo "three downstream genes from ${C}:${S}-${E}" mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e \ 'select e.chrom,e.txStart,e.txEnd,e.alignID,j.geneSymbol FROM knownGene e, kgXref j WHERE e.alignID = j.kgID AND e.chrom="'${C}'" AND e.txStart > '${E}' ORDER BY e.txStart ASC limit 3;' hg18 Produces results: three upstream genes from chr1:710000-720000 +------+--------+--------+------------+----------+ | chr1 | 690107 | 703869 | uc001abo.1 | BC006361 | | chr1 | 665195 | 665226 | uc001abn.1 | DQ599872 | | chr1 | 665086 | 665147 | uc001abm.1 | DQ600587 | +------+--------+--------+------------+----------+ three downstream genes from chr1:710000-720000 +------+--------+--------+------------+----------+ | chr1 | 752926 | 778860 | uc001abp.1 | BC102012 | | chr1 | 752926 | 778860 | uc001abq.1 | BC042880 | | chr1 | 752926 | 779603 | uc001abr.1 | CR601056 | +------+--------+--------+------------+----------+ This approach is now also documented on our wiki site, here: http://genomewiki.ucsc.edu/index.php?title=Finding_nearby_genes -- I hope this information helps. Please feel free to create a genomewiki account and edit or add to the wiki page(s). If you have further questions, please do not hesitate to contact us again. -- Brooke Rhead UCSC Genome Bioinformatics Group Denise Mauldin wrote: > Hello, > > I have a list of genome positions that I got from Blat. What's the > easiest programmatic way to get the closest gene to those positions? I > currently work in Perl, but could adapt to other suggestions. > > Thanks, > Denise > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From angelike at caltech.edu Fri Aug 3 13:08:31 2007 From: angelike at caltech.edu (Angela Stathopoulos) Date: Fri, 3 Aug 2007 13:08:31 -0700 Subject: [Genome] question - opening wig Message-ID: ? I am attempting to load this data into the genome browser but am getting an error message: Error File 'Zeitlinger.wig' - track load error: Couldn't find size of chromosome chrYh (note: chrom names are case sensitive) It worked last month. Can you tell me what might be wrong? Angela ++++++++++++++++++++++++++++++ Angelike Stathopoulos Asst. Professor Division of Biology - Caltech email: angelike at caltech.edu phone: 626-395-5855 fax: 626-395-5958 mailing address: Caltech - Broad Center Division of Biology 114-96 1200 E. California Blvd. Pasadena, CA 91125 ++++++++++++++++++++++++++++++ From hiram at soe.ucsc.edu Fri Aug 3 13:35:48 2007 From: hiram at soe.ucsc.edu (Hiram Clawson) Date: Fri, 03 Aug 2007 13:35:48 -0700 Subject: [Genome] question - opening wig In-Reply-To: References: Message-ID: <46B391A4.4060905@soe.ucsc.edu> Good Afternoon Angela: Which genome assembly are you on that has a chrYh ? --Hiram Angela Stathopoulos wrote: > I am attempting to load this data into the genome browser but am > getting an error message: > Error File 'Zeitlinger.wig' - track load error: > Couldn't find size of chromosome chrYh (note: chrom names are case > sensitive) > > > It worked last month. Can you tell me what might be wrong? > > Angela > > > ++++++++++++++++++++++++++++++ > Angelike Stathopoulos > Asst. Professor > Division of Biology - Caltech > email: angelike at caltech.edu > phone: 626-395-5855 > fax: 626-395-5958 > mailing address: > Caltech - Broad Center > Division of Biology 114-96 > 1200 E. California Blvd. > Pasadena, CA 91125 > ++++++++++++++++++++++++++++++ > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From ann at soe.ucsc.edu Fri Aug 3 13:43:44 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Fri, 03 Aug 2007 13:43:44 -0700 Subject: [Genome] question - opening wig In-Reply-To: References: Message-ID: <46B39380.3090700@cse.ucsc.edu> Hello Angela, Our mail program strips out attachments, so I was not able to view your input file. However, let me take a stab at answering anyway. The newest D. melanogaster assembly (dm3) has a chromosome named chrYHet. The previous assembly (dm2) has a chromosome named chrYh. Perhaps you have tried to upload your input file to dm3 by accident? Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Please feel free to search the Genome mailing list archives by visiting our home page, clicking on "Contact Us", then typing a word or phrase into the search box. On that same page (http://genome.ucsc.edu/contacts.html), you can subscribe to the Genome mailing list. Angela Stathopoulos wrote: > ? > I am attempting to load this data into the genome browser but am > getting an error message: > Error File 'Zeitlinger.wig' - track load error: > Couldn't find size of chromosome chrYh (note: chrom names are case > sensitive) > > > It worked last month. Can you tell me what might be wrong? > > Angela > > > ++++++++++++++++++++++++++++++ > Angelike Stathopoulos > Asst. Professor > Division of Biology - Caltech > email: angelike at caltech.edu > phone: 626-395-5855 > fax: 626-395-5958 > mailing address: > Caltech - Broad Center > Division of Biology 114-96 > 1200 E. California Blvd. > Pasadena, CA 91125 > ++++++++++++++++++++++++++++++ > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From puckj at peds.ucsf.edu Fri Aug 3 13:40:00 2007 From: puckj at peds.ucsf.edu (Jennifer Puck) Date: Fri, 03 Aug 2007 13:40:00 -0700 Subject: [Genome] Affymetrix 500k snp track? Message-ID: I am looking for a track of the Affymetrix 500K SNPs. I see some emails about it being on the test site; is there a way I can use it? Jennifer M. Puck, M.D. Department of Pediatrics, Box 0519 University of California, San Francisco 513 Parnassus Avenue, HSE 301A San Francisco, CA 94143-0519 Email: puckj at peds.ucsf.edu Phone: 415 476-3181 FAX: 415 502-5127 From ann at soe.ucsc.edu Fri Aug 3 13:55:33 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Fri, 03 Aug 2007 13:55:33 -0700 Subject: [Genome] Affymetrix 500k snp track? In-Reply-To: References: Message-ID: <46B39645.3030201@cse.ucsc.edu> Hello Jennifer, Those data were released earlier this week and are now available on the public browser. You will find them in the hg18 browser. The track is named "SNP Arrays" -- part of the Variation and Repeats track group. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Jennifer Puck wrote: > I am looking for a track of the Affymetrix 500K SNPs. I see some emails > about it being on the test site; is there a way I can use it? > > Jennifer M. Puck, M.D. > Department of Pediatrics, Box 0519 > University of California, San Francisco > 513 Parnassus Avenue, HSE 301A > San Francisco, CA 94143-0519 > > Email: puckj at peds.ucsf.edu > Phone: 415 476-3181 > FAX: 415 502-5127 > > > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From angie at soe.ucsc.edu Fri Aug 3 14:13:16 2007 From: angie at soe.ucsc.edu (Angie Hinrichs) Date: Fri, 3 Aug 2007 14:13:16 -0700 (PDT) Subject: [Genome] question - opening wig In-Reply-To: References: Message-ID: Hi Angela, The BDGP Release 4 assembly of D. mel. (UCSC's dm2, Apr. 2004) has chrYh, but the latest Release 5 assembly (dm3, Apr. 2006) has chrYHet. We recently released dm3 on our public site, and since it is newest, it is now the default assembly for D. mel. Since your coords are probably for the dm2 / Apr. 2004 assembly, try selecting Apr. 2004 in the assembly menu, and then load the custom track again. If you have any more questions or if that doesn't work, please let us know at genome at soe.ucsc.edu. Angie On Fri, 3 Aug 2007, Angela Stathopoulos wrote: > I am attempting to load this data into the genome browser but am > getting an error message: > Error File 'Zeitlinger.wig' - track load error: > Couldn't find size of chromosome chrYh (note: chrom names are case > sensitive) > > > It worked last month. Can you tell me what might be wrong? > > Angela > > > ++++++++++++++++++++++++++++++ > Angelike Stathopoulos > Asst. Professor > Division of Biology - Caltech > email: angelike at caltech.edu > phone: 626-395-5855 > fax: 626-395-5958 > mailing address: > Caltech - Broad Center > Division of Biology 114-96 > 1200 E. California Blvd. > Pasadena, CA 91125 > ++++++++++++++++++++++++++++++ > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > -- angie at soe.ucsc.edu Software Developer, UCSC CBSE / Genome Bioinformatics Group From ann at soe.ucsc.edu Fri Aug 3 14:30:11 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Fri, 03 Aug 2007 14:30:11 -0700 Subject: [Genome] How to check for homologies in other species ? In-Reply-To: References: Message-ID: <46B39E63.6090701@cse.ucsc.edu> Hello Daren, First let me apologize for taking so long to answer your question; somehow it must have fallen through the cracks and I am sorry. I hope by now you have found the answer or a suitable work-around, but in case you haven't let me answer here. From the data you have downloaded, the best one to use is the one that supports the Net track in the Browser. The Net track is useful for finding orthologous regions and for studying genome rearrangement between two assemblies. The files will have names like this: hg18.panTro2.net.gz hg18.panTro2.net.gz: "net" file that describes rearrangements between the species and the best Chimp match to any part of the Human genome. The net format is described in http://genome.ucsc.edu/goldenPath/help/net.html. Depending on which assemblies you want to map to, you can, instead, use the blastTab tables in the hg18 database to find orthologs genes in other species. The species are limited to: mouse, rat, zebrafish, D. melanogaster, C. elegans, S. cerevisiae. The tables you will use for this will be in the hg18 database and will have names similar to: mmBlastTab (for mouse: mm8) rnBlastTab (for rat: rn4) This should get you started. Please feel free to write back if you need more detailed instructions. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu daren76 daren76 wrote: > Hi friends, > > I have downloaded the pairwise and multiple alignment informations > from http://hgdownload.cse.ucsc.edu/goldenPath/hg18. I have a set of > genes about 100 nts in length from human to check for homologies in > other species. Hope someone can point me the steps. > > Thanks Daren > > _________________________________________________________________ Get > the new Windows Live Messenger! > http://get.live.com/messenger/overview > > _______________________________________________ Genome maillist - > Genome at soe.ucsc.edu http://www.soe.ucsc.edu/mailman/listinfo/genome From rhead at soe.ucsc.edu Fri Aug 3 14:52:18 2007 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Fri, 03 Aug 2007 14:52:18 -0700 Subject: [Genome] Closest Gene In-Reply-To: <46B38A8C.8030005@soe.ucsc.edu> References: <46B38A8C.8030005@soe.ucsc.edu> Message-ID: <46B3A392.5050503@soe.ucsc.edu> Hello again Denise, Another engineer has pointed out that "upstream" and "downstream" may have at least three different definitions, based on the strand: 1. gene based -- 5' vs. 3' 2. blat based -- dependent on whether alignment is to top or bottom strand 3. raw genomic coordinates -- lower txStart is 'upstream' and higher txStart is 'downstream' If you are not aware of the differences, it can potentially make things very confusing! This is just a reminder to keep these different meanings in mind when you deal with these data. -- Brooke Rhead UCSC Genome Bioinformatics Group Brooke Rhead wrote: > Hello Denise, > > I have two suggestions for you from our developers, one using Perl and > one using our public MySQL database. I will describe them both. > > -- > > First, the Perl solution: > > Download the knownGene table, either from the downloads server, here: > http://hgdownload.cse.ucsc.edu/downloads.html (go to the appropriate > assembly, and then look under the "annotation database" link for the > knownGene.txt.gz file), or from the Table Browser. > > Then, follow this logic: > > hash chromosomes => lists of [gene positions, gene names] sorted by > position > > for each position from BLAT, do a binary search for that position in > the appropriate list. If search ends without an exact match, compute > the distance of the BLAT position to the final position from the > search and its adjacent positions, and pick the closest one. > > -- > > Now, the MySQL solution. Information on connecting to our MySQL server > is here: http://genome.ucsc.edu/FAQ/FAQdownloads#download29. > > Using the public MySQL server: > > #!/bin/sh > > # given position chr1:710000-720000 > # find a sample of genes near this upstream and downstream > C=chr1 > S=710000 > E=720000 > > echo "three upstream genes from ${C}:${S}-${E}" > mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e \ > 'select e.chrom,e.txStart,e.txEnd,e.alignID,j.geneSymbol FROM > knownGene e, > kgXref j > WHERE e.alignID = j.kgID AND e.chrom="'${C}'" AND e.txEnd < '${S}' > ORDER BY e.txEnd DESC limit 3;' hg18 > > echo "three downstream genes from ${C}:${S}-${E}" > mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e \ > 'select e.chrom,e.txStart,e.txEnd,e.alignID,j.geneSymbol FROM > knownGene e, > kgXref j > WHERE e.alignID = j.kgID AND e.chrom="'${C}'" AND e.txStart > '${E}' > ORDER BY e.txStart ASC limit 3;' hg18 > > Produces results: > three upstream genes from chr1:710000-720000 > +------+--------+--------+------------+----------+ > | chr1 | 690107 | 703869 | uc001abo.1 | BC006361 | > | chr1 | 665195 | 665226 | uc001abn.1 | DQ599872 | > | chr1 | 665086 | 665147 | uc001abm.1 | DQ600587 | > +------+--------+--------+------------+----------+ > three downstream genes from chr1:710000-720000 > +------+--------+--------+------------+----------+ > | chr1 | 752926 | 778860 | uc001abp.1 | BC102012 | > | chr1 | 752926 | 778860 | uc001abq.1 | BC042880 | > | chr1 | 752926 | 779603 | uc001abr.1 | CR601056 | > +------+--------+--------+------------+----------+ > > This approach is now also documented on our wiki site, here: > > http://genomewiki.ucsc.edu/index.php?title=Finding_nearby_genes > > -- > > I hope this information helps. Please feel free to create a genomewiki > account and edit or add to the wiki page(s). If you have further > questions, please do not hesitate to contact us again. > > -- > Brooke Rhead > UCSC Genome Bioinformatics Group > > > Denise Mauldin wrote: >> Hello, >> >> I have a list of genome positions that I got from Blat. What's the >> easiest programmatic way to get the closest gene to those positions? I >> currently work in Perl, but could adapt to other suggestions. >> >> Thanks, >> Denise >> _______________________________________________ >> Genome maillist - Genome at soe.ucsc.edu >> http://www.soe.ucsc.edu/mailman/listinfo/genome > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From rhead at soe.ucsc.edu Fri Aug 3 17:37:42 2007 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Fri, 03 Aug 2007 17:37:42 -0700 Subject: [Genome] Question on gene length information retrieval In-Reply-To: <000b01c7d558$e3bb26c0$1f0c2c0a@gne.windows.gene.com> References: <000b01c7d558$e3bb26c0$1f0c2c0a@gne.windows.gene.com> Message-ID: <46B3CA56.4080908@soe.ucsc.edu> Hi Kangyu, Thanks for clarifying. By "corresponding track" I just mean an annotation track that contains the gene set you want to get the information from. Let me see if I can be more clear. When you look in the Genome Browser, under the section called "Genes and Gene Prediction Tracks", there are several different gene annotation tracks, or gene sets, available. Some of them are "UCSC Genes", "CCDS", "RefSeq Genes", "MGC Genes", "Ensembl Genes", etc. They are all slightly different, and may give you different results for each gene. It sounds like the RefSeq Genes set might be good for your purposes. The advantage of using that set is that we have it linked to the Entrez gene ID in our tables. However, if you want more detailed information about proteins, or if you want to include non-coding genes in this set, you should consider using the UCSC Genes set instead. I will describe how to get the information you seek for the RefSeq set, and if you decide to use a different set, you can do a similar query using it instead. Also, once you get the data from the tables, you will need to do some calculations to get the actual lengths. You can either do this using your own tools, or you may be interested in another website called Galaxy: http://main.g2.bx.psu.edu/ Galaxy is a site run by Penn State that works in conjunction with the UCSC Genome Browser. They have tools that will perform calculations on columns in a table. If you need help with Galaxy, their helpdesk email is galaxy-user at bx.psu.edu. Most of the information you are looking for can be gleaned from two related tables: refGene and refLink. To get to the tables, go to the Table Browser (the blue "tables" link at the top of the page), and make the following selections: clade: vertebrate genome: human assembly: Mar. 2006 group: Genes and gene prediction tracks track: RefSeq Genes table: refGene region: genome output format: selected fields from primary and related tables After hitting "get output", select the hg18.refLink table from the list on the bottom half of the page, so that you can get information from it in addition to the refGene table. Scroll down to the bottom of the page and hit the "allow selection from checked tables" button. Now select the fields in which you are interested from both of these tables. From the refLink table you will need to select the 'locusLinkId' field. This is the Entrez ID (Entrez used to be called Locus Link). You can also get the RefSeq protein name from this table (field 'protAcc'). You will need most of the information from the refGene table. Now you can use the values retrieved from the tables to get the lengths you need. Note that items in our tables have a 0-based start and a 1-based end (see explanation here: http://genome.ucsc.edu/FAQ/FAQtracks#tracks1). The length of the gene on the genome (including the introns) will be: txEnd - txStart. The number of exons is listed in the field 'exonCount'. The total length of all the exons in a gene (corresponding to the mRNA length) will be the sum of each of the exonEnds - exonStarts. Note that this includes both coding and non-coding exons. PROTEINS: One way to get the approximate protein length from the data retrieved above would be to take the total exon length, and then subtract the non-coding regions from it. The cdsStart and cdsEnd values indicate where the coding regions are. These values are usually within an exon. If you would rather get more precise protein information, the method will depend on which gene set you chose to use. For the RefSeq Genes set: we do not store more detailed RefSeq protein information in our tables, just the accession numbers (like the "NP_" numbers from hg18.refLink.protAcc). So, if you wish to get more data for these accessions, you will need to go to NCBI for it. For the UCSC Genes set: the proteins associated with UCSC Genes are from UniProt. The protein sequences are stored in one of our tables: uniprot.protein.val (database.table.field). You can get link these to the knownGene table by the hg18.kgXref table. Sorry to be so long-winded! I hope this information helps. -- Brooke Rhead UCSC Genome Bioinformatics Group Kangyu Zhang wrote: > Hi, Brooke, > > Sorry for my unclearness. I need to retrieve the entire genome. > > The NCBI gene ID I referred to is entrez ID, for example, 324 corresponds > to genename APC, and 10297 is APC2; however, other IDs such as RefSeq genes > are fine too. > > Sorry I don't understand what is the meaning of "corresponding track"? But > I guess the RefSeq Genes shall serve well as the identifier, though the > Entrez ID is better. > > > Kangyu > > > > -----Original Message----- > From: Brooke Rhead [mailto:rhead at soe.ucsc.edu] > Sent: Thursday, August 02, 2007 2:44 PM > To: Kangyu Zhang > Cc: genome at soe.ucsc.edu > Subject: Re: [Genome] Question on gene length information retrieval > > Hi Kangyu, > > Can you clarify a couple of things? Do you have only a single gene to > look at, or a list of several genes? Also, I'm not sure which NCBI gene > ID you are referring to. Can you give an example? Also, do you already > have a corresponding track in mind from which to retrieve the > information (like UCSC Genes, RefSeq Genes, etc.)? > > Thanks. This will help us answer your questions. > > -- > Brooke Rhead > UCSC Genome Bioinformatics Group > > > Kangyu Zhang wrote: >> Hello, >> >> >> >> We are interested in retrieving the human gene length information from > the >> UCSC genome browser. In particular, for an NCBI gene id identifier (refseq >> identifier NP_ or NM_ are good too, but that's not one to one mapping >> because of alternative splicing), retrieve its protein sequence length, > mRNA >> length (total exon length), number of exon, and the total length of this >> gene on the genome (including the intron length). >> >> >> >> Could you help figure out the way to retrieve the above information via >> the UCSC genome browser? >> >> >> >> Thanks, >> >> Kangyu Zhang >> >> >> >> Department of Bioinformatics >> >> Genentech, Inc. >> >> 1 DNA Way, MS 93 >> >> South San Francisco, CA 94080 >> >> >> >> _______________________________________________ >> Genome maillist - Genome at soe.ucsc.edu >> http://www.soe.ucsc.edu/mailman/listinfo/genome From vkullu at gmail.com Sun Aug 5 18:26:27 2007 From: vkullu at gmail.com (Vinayak Kulkarni) Date: Sun, 5 Aug 2007 20:26:27 -0500 Subject: [Genome] Refseq START and END sites Message-ID: <6643b6460708051826j1de5f121vabecffb35c916c7e@mail.gmail.com> Hi, I was wondering if using your browser tables I could pull out information of the start site and end site co-ordinates within the refseq transcript ( in terms of positions within the refseq) for all human coding genes. Could you please help me grab that data from your browser tables. Thank you, Vinayak. Translational Research, UT Southwestern Medical Center at Dallas. -- -- "Hesitating to act because the whole vision might not be achieved, or because others do not yet share it, is an attitude that only hinders progress"... Mahatma Gandhi From jje at gate.sinica.edu.tw Mon Aug 6 00:33:37 2007 From: jje at gate.sinica.edu.tw (J.J. Emerson) Date: Mon, 6 Aug 2007 15:33:37 +0800 Subject: [Genome] Syntenic Alignments Message-ID: <49A81C69-F4DF-44FF-AE6E-EDB37FE0F6AE@gate.sinica.edu.tw> Hello, I'm browsing the human genome downloads section (1) and looking at the human/chimp pairwise alignments (2). Specifically, I'm interested in the "syntenic net" file (3). Anyway, when I download that file, I notice a few things. The first is that, despite the '.gz' extension, it isn't in fact g-zipped. It is just plain text. Second, if I grep for "net" in the file, I notice that there are only entries for a few chromosomes, which I have listed below (4). The first question I have is, why do the "random" chromosomes show up as nets in species comparison files but not in the self comparison file (5)? The second question I have is, why is the syntenic net alignment (3) incomplete and unzipped, but tagged as zipped? Can this file be properly generated and posted to the downloads section? Thanks a lot! Cheers, J.J. Emerson References in e-mail: 1. http://hgdownload.cse.ucsc.edu/downloads.html#human 2. http://hgdownload.cse.ucsc.edu/goldenPath/hg18/vsPanTro2/ 3. http://hgdownload.cse.ucsc.edu/goldenPath/hg18/vsPanTro2/ hg18.panTro2.syn.net.gz 4. Results of ?$ grep 'net' hg18.panTro2.syn.net.gz?: net chr1 247249719 net chr10 135374737 net chr10_random 113275 net chr11 134452384 net chr11_random 215294 net chr12 132349534 net chr13 114142980 5. http://hgdownload.cse.ucsc.edu/goldenPath/hg18/vsSelf/ hg18.hg18.net.gz From erezliv at gmail.com Sun Aug 5 02:41:17 2007 From: erezliv at gmail.com (=?UTF-8?B?15DXqNeWINec15nXkdeg15Q=?=) Date: Sun, 5 Aug 2007 12:41:17 +0300 Subject: [Genome] Hello UCSC Genome browser staff Message-ID: <7f010a680708050241s7977e0efw7c799f737cbc9446@mail.gmail.com> Hello UCSC Genome browser staff, I am seeking for specific tables in the annotation database (Human & Mouse) and am a bit overwhelmed by the size of your database.. Would you be so kind to tell me which tables do I need in order to get (by genomic coordinates): - The nearest gene (accession, name,..) - The nearest repeat - The nearest CpG islands - chromosome stain-band - (if inside a gene) The exon number, The GO annotations Also, are these tables given per-chromosome or for the entire genome? Thank you in advance, Erez Livneh From j.d.ronde at nki.nl Mon Aug 6 06:39:09 2007 From: j.d.ronde at nki.nl (JJ de Ronde) Date: Mon, 06 Aug 2007 15:39:09 +0200 Subject: [Genome] Questions about the conservation tracks Message-ID: <1186407549.6315.57.camel@Beast> Dear reader, We are using the conservation scores and most-conserved tracks you supply in the genome browser for various purposes and are now trying to get a better understanding of what we are using exactly. To this purpose we would also like to replicate the results we get in the genome browser using the phastcons software. Adam Siepel redirected us to you regarding the following and we would be grateful if you could answer a couple of questions for us about the (parameter) settings that you are using. - First of all: which parameter settings are you using for the 17-way and 28-way conservation tracks (smoothness and coverage constraints, rho, branch lengths, background frequencies, phylogenetic tree, substitution matrix)? Are the background frequencies, rho scaling factor, branch lengths and substitution matrix estimated as described in Siepel's paper ('Evolutionary conserved elements in vertebrate, insect, worm and yeast genomes', Genome Research 2005)? How do you derive the smoothness and coverage constraints? If, for example, you would add an 18th species to the 17-way conservation, would you re-tune all the settings? - As I understand it, the 'most conserved'-track displays the viterbi path conserved states. This would usually correspond to a high (posterior) conservation score. I've attached an image displaying some results we got with the genome browser. Between the 2nd and 3rd conserved elements there seems to be a high conservation score (top graph) but no conserved element. When would a case like this arise? Also: what do the little green graphs represent, are those the pair-wise phastcon scores (reference vs species)? The top (blue) graph is a representation of the posterior probabilities (=conservation score => P(state(i) = j) | data X, model parameters), where state(i) = j represents the fact that the state in the path at position i = j), right? Thanks a lot for your time, Jorma de Ronde p.s. The image comes from the following URL: http://genome.ucsc.edu/cgi-bin/hgTracks?org=Human&db=hg18&hgsid=95943147 -- Jorma de Ronde E-mail: j.d.ronde at nki.nl Bioinformatician Tel: +31 (0)20 512 6285 Central Microarray Facility Website: http://microarrays.nki.nl Molecular Biology G2/H2 Netherlands Cancer Institute Plesmanlaan 121, 1066 CX, Amsterdam, The Netherlands From pfreire at mdanderson.org Mon Aug 6 09:33:21 2007 From: pfreire at mdanderson.org (pfreire at mdanderson.org) Date: Mon, 6 Aug 2007 11:33:21 -0500 Subject: [Genome] UCSC idiogram retrieving Message-ID: Dear sirs, I think there is a small problem for retrieving the raw tables for the chromosome idiogram. When I selected the parameter 'region' as the whole human genome, the information for the chromosome X was absent. However, when I selected only the chromosome X, no problem has occured. Thanks. Pablo Freire. From kayla at soe.ucsc.edu Mon Aug 6 10:23:45 2007 From: kayla at soe.ucsc.edu (Kayla Smith) Date: Mon, 06 Aug 2007 10:23:45 -0700 Subject: [Genome] Refseq START and END sites In-Reply-To: <6643b6460708051826j1de5f121vabecffb35c916c7e@mail.gmail.com> References: <6643b6460708051826j1de5f121vabecffb35c916c7e@mail.gmail.com> Message-ID: <46B75921.1080903@cse.ucsc.edu> Hello Vinayak, The Table Browser ("Tables" on the blue bar on the top of the page) can be used to retrieve this information from the refGene table. Use the following options: clade: Vertebrate genome: Human assembly: Mar. 2006 group: Genes and Gene Prediction Tracks track: RefSeq Genes table: refGene region: select "genome" output format "selected fields from primary and related tables" click "get output" Now you can choose which fields from the refSeq table you'd like to retrieve. Then select "get output" again. I hope this is helpful to you. Please don't hesitate to contact the mailing list again if you require further assistance. Kayla Smith UCSC Genome Bioinformatics Group Vinayak Kulkarni wrote: > Hi, > I was wondering if using your browser tables I could pull out information of > the start site and end site co-ordinates within the refseq transcript ( in > terms of positions within the refseq) for all human coding genes. > Could you please help me grab that data from your browser tables. > Thank you, > Vinayak. > Translational Research, > UT Southwestern Medical Center at Dallas. From kayla at soe.ucsc.edu Mon Aug 6 11:27:17 2007 From: kayla at soe.ucsc.edu (Kayla Smith) Date: Mon, 06 Aug 2007 11:27:17 -0700 Subject: [Genome] UCSC idiogram retrieving In-Reply-To: References: Message-ID: <46B76805.8040200@cse.ucsc.edu> Pablo, I used the Table Browser to retrieve all columns from the hg18.cytoBandIdeo table, and I do get chrX data. Notice that the chrX data is between the chr9 and chr10 data. Also, the table cytoBandIdeo.txt.gz has chrX data at: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ It's possible that you have a filter set up or that your cart needs to be reset. Try resetting your cart by going to this url: http://genome.ucsc.edu/cgi-bin/cartReset Then try retrieving your data again. I hope this information is helpful to you. Please don't hesitate to contact us again if you require further assistance. Kayla Smith UCSC Genome Bioinformatics Group pfreire at mdanderson.org wrote: > Dear sirs, > > > I think there is a small problem for retrieving the raw tables for the > chromosome idiogram. When I selected the parameter 'region' as the whole > human genome, the information for the chromosome X was absent. However, > when I selected only the chromosome X, no problem has occured. > > Thanks. > > Pablo Freire. > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From kate at soe.ucsc.edu Mon Aug 6 11:52:41 2007 From: kate at soe.ucsc.edu (Kate Rosenbloom) Date: Mon, 06 Aug 2007 11:52:41 -0700 Subject: [Genome] Questions about the conservation tracks In-Reply-To: <1186407549.6315.57.camel@Beast> References: <1186407549.6315.57.camel@Beast> Message-ID: <46B76DF9.4060705@cse.ucsc.edu> Hello Jorma, Regarding the multiple alignment display -- the green graphs do represent pairwise alignment quality, but are not phastCons scores. They are computed on-the-fly from pairwise alignments extracted from the multiple alignment, using the multiz scoring aligorithm. The parameter settings for the 28way are all available from the track description and downloads area. Specifically: The model file, which contains the background frequencies, substitution matrix, phylogenetic tree and branch lengths, is here: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/phastCons28way/28way.mod The other parameters (listed in the track description) are: expected-length=45, target-coverage=.3, rho=.31 These are not the same parameters used in the Genome Research paper -- we generate new parameters with each alignment to approximate 5% coverage of conserved elements over the human genome, with 70% coverage of coding sequence (as reported by our 'featureBits' utility). For the 17way, the parameters were: expected-length=14, target-coverage=.008, rho=.28 (this will be added to the track description), and the model file (same as that used for ENCODE alignments): http://hgdownload.cse.ucsc.edu/goldenPath/hg18/phastCons17way/elliotsEncode.mod Our methods for generating phastCons parameters for 'deep' alignments are based on recommendations in the "Working with Large Data Sets" section of Adam Siepel's phastCons 'HOWTO': http://compgen.bscb.cornell.edu/~acs/phastCons-HOWTO.html We begin with the last set used (e.g. for the 28way, I started with the 17way parameters). The resulting coverage is assessed, and the target-coverage is adjusted to meet the criteria above. The number of conserved elements and smoothness of the conservation graph are then used to adjust the expected-length parameter. Regarding the region you asked about (hg18.chr9:125,250,933-125,256,5300) -- it appears that the conservation peak between the second and third elements may not be above threshold for selection of conserved elements -- I suggest you contact Adam for a more authoritative interpretation. Hope this helps! Kate --- Kate Rosenbloom UCSC Genome Bioinformatics JJ de Ronde wrote: > Dear reader, > > We are using the conservation scores and most-conserved tracks you > supply in the genome browser for various purposes and are now trying to > get a better understanding of what we are using exactly. To this purpose > we would also like to replicate the results we get in the genome browser > using the phastcons software. Adam Siepel redirected us to you regarding > the following and we would be grateful if you could answer a couple of > questions for us about the (parameter) settings that you are using. > > - First of all: which parameter settings are you using for the 17-way > and 28-way conservation tracks (smoothness and coverage constraints, > rho, branch lengths, background frequencies, phylogenetic tree, > substitution matrix)? > Are the background frequencies, rho scaling factor, branch lengths and > substitution matrix estimated as described in Siepel's paper > ('Evolutionary conserved elements in vertebrate, insect, worm and yeast > genomes', Genome Research 2005)? > How do you derive the smoothness and coverage constraints? > If, for example, you would add an 18th species to the 17-way > conservation, would you re-tune all the settings? > > - As I understand it, the 'most conserved'-track displays the viterbi > path conserved states. This would usually correspond to a high > (posterior) conservation score. I've attached an image displaying some > results we got with the genome browser. Between the 2nd and 3rd > conserved elements there seems to be a high conservation score (top > graph) but no conserved element. When would a case like this arise? > Also: what do the little green graphs represent, are those the pair-wise > phastcon scores (reference vs species)? > The top (blue) graph is a representation of the posterior probabilities > (=conservation score => P(state(i) = j) | data X, model parameters), > where state(i) = j represents the fact that the state in the path at > position i = j), right? > > Thanks a lot for your time, > Jorma de Ronde > > p.s. The image comes from the following URL: > http://genome.ucsc.edu/cgi-bin/hgTracks?org=Human&db=hg18&hgsid=95943147 > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From kayla at soe.ucsc.edu Mon Aug 6 15:18:42 2007 From: kayla at soe.ucsc.edu (Kayla Smith) Date: Mon, 06 Aug 2007 15:18:42 -0700 Subject: [Genome] Syntenic Alignments In-Reply-To: <49A81C69-F4DF-44FF-AE6E-EDB37FE0F6AE@gate.sinica.edu.tw> References: <49A81C69-F4DF-44FF-AE6E-EDB37FE0F6AE@gate.sinica.edu.tw> Message-ID: <46B79E42.8050507@cse.ucsc.edu> Dear J. J., We don't consistently include the _random chromosomes when making self chains. We did for hg17 but not for hg18, because they did not add meaningful information and made the resulting data too large. Thank you for pointing out the error in the Human/Chimp Net file. This file has been re generated: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/vsPanTro2/hg18.panTro2.syn.net.gz I hope this is helpful to you. Please don't hesitate to contact us again if you require further assistance. Kayla Smith UCSC Genome Bioinformatics Group J.J. Emerson wrote: > Hello, > > I'm browsing the human genome downloads section (1) and looking at > the human/chimp pairwise alignments (2). Specifically, I'm > interested in the "syntenic net" file (3). Anyway, when I download > that file, I notice a few things. The first is that, despite the > '.gz' extension, it isn't in fact g-zipped. It is just plain text. > Second, if I grep for "net" in the file, I notice that there are only > entries for a few chromosomes, which I have listed below (4). > > The first question I have is, why do the "random" chromosomes show up > as nets in species comparison files but not in the self comparison > file (5)? The second question I have is, why is the syntenic net > alignment (3) incomplete and unzipped, but tagged as zipped? Can > this file be properly generated and posted to the downloads section? > > Thanks a lot! > > Cheers, > > J.J. Emerson > > References in e-mail: > > 1. http://hgdownload.cse.ucsc.edu/downloads.html#human > > 2. http://hgdownload.cse.ucsc.edu/goldenPath/hg18/vsPanTro2/ > > 3. http://hgdownload.cse.ucsc.edu/goldenPath/hg18/vsPanTro2/ > hg18.panTro2.syn.net.gz > > 4. Results of ?$ grep 'net' hg18.panTro2.syn.net.gz?: > net chr1 247249719 > net chr10 135374737 > net chr10_random 113275 > net chr11 134452384 > net chr11_random 215294 > net chr12 132349534 > net chr13 114142980 > > 5. http://hgdownload.cse.ucsc.edu/goldenPath/hg18/vsSelf/ > hg18.hg18.net.gz > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From kayla at soe.ucsc.edu Mon Aug 6 16:06:11 2007 From: kayla at soe.ucsc.edu (Kayla Smith) Date: Mon, 06 Aug 2007 16:06:11 -0700 Subject: [Genome] Hello UCSC Genome browser staff In-Reply-To: <7f010a680708050241s7977e0efw7c799f737cbc9446@mail.gmail.com> References: <7f010a680708050241s7977e0efw7c799f737cbc9446@mail.gmail.com> Message-ID: <46B7A963.7050405@cse.ucsc.edu> Hello Erez, Tables can be downloaded here: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ Some tables you may be interested in are: knownGene -this has ucsc gene names and coordinates, or refGene -this has refSeq accession names and coordinates. Both have exon counts. rmsk or simpleRepeat -have repeat information (the rmsk table is per chromosome) cpgIslandExt -has CpG island data cytoBandIdeo -describes the positions of cytogenetic bands with a chromosome Go annotations are found in a separate database, the tables for which can be downloaded from here: http://hgdownload.cse.ucsc.edu/goldenPath/go/database/ Similar tables can be found in the mouse mm8 downloads section. I hope this information is helpful to you. Please don't hesitate to contact us again if you require further assistance. Kayla Smith UCSC Genome Bioinformatics Group > Hello UCSC Genome browser staff, > > I am seeking for specific tables in the annotation database (Human & Mouse) > and am a bit overwhelmed by the size of your database.. > > Would you be so kind to tell me which tables do I need in order to get (by > genomic coordinates): > - The nearest gene (accession, name,..) > - The nearest repeat > - The nearest CpG islands > - chromosome stain-band > - (if inside a gene) The exon number, The GO annotations > > Also, are these tables given per-chromosome or for the entire genome? > > Thank you in advance, > Erez Livneh > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From - Mon From zhouyubio at gmail.com Tue Aug 7 06:00:46 2007 From: zhouyubio at gmail.com (Yu Zhou) Date: Tue, 7 Aug 2007 21:00:46 +0800 Subject: [Genome] Question about no hits by using BLAT Message-ID: <613ffb490708070600s7ede138eo35354cbb130aea90@mail.gmail.com> Hello, We have hundreds of sequences from sequencing PCR products in experiments. We expect that they should be segments of the human genome and could be mapped to the genome (with or without mismatches). However, by using web BLAT, around 100 of them don't have any hit. It is very disappointing. I pasted some of them as below. Whether you can get hits? Could you tell me some reasons in those cases? ( I don't know if errors in PCR and sequencing could cause so many mutations that the sequences couldn't be mapped to their orignal positions in genome.) How to use BLAT or are there other methods, to do better mapping? (BLAT is so quick, whether exist slower programs with better sensitivity?) Thanks a lot! >1 TCAGCGTTATCACCATTTGCTGGTCGATGGCTTTAGTTTCCCGGCCATTACCCGCCAGAT >4 TACTAGTATCAGCGTTATCACCATTTGCTGGTCGATGGCTTTAGTTTCCCGGCCATTACCCGCCAGAT >11 TCGATGGCTTTAGTTTCCCGGCCATTACCCGCCAGAT >18 CGGTTCTTCCAGCAGGCTTTCGCTTAAGTCAGCGTTAC >22 TACGCCATCACCAAAGATATCTT >38 CGCACTGCGTCGCGAACCACGGCTGCGCTTTACGCTAACAGAAGTGAATGATCTACCGGTCCGGCAAA -- Best regards, Yu Zhou From chrism at sanbi.ac.za Tue Aug 7 03:06:44 2007 From: chrism at sanbi.ac.za (chrism at sanbi.ac.za) Date: Tue, 7 Aug 2007 12:06:44 +0200 (SAST) Subject: [Genome] Refseq bug Message-ID: <2366.192.168.2.109.1186481204.squirrel@webmail.sanbi.ac.za> Hi - I am getting the following message: "Note: some of the identifiers (e.g. NM_138996) have no match in table refGene, field name or in alias table refLink, field name. Try the "describe table schema" button for more information about the table and field." when I paste a list of Refseq gene names to the 'Tables' section (clade:vertebrate; genome:human; assembly:March 2006; group:Genes and Gene Prediction Track; track:Refseq Genes; table:refGene). I find this odd since I can easily search for NM_138996 and because I am using this Refseq id because I downloaded it from an earlier analysis under the tables page of the browser. So why now does it not find this particular Refseq transcript? Cheers, Chris Computational Postdoc South African National Bioinformatics Institute (SANBI) University of Western Cape From moushengxu at gmail.com Tue Aug 7 07:41:52 2007 From: moushengxu at gmail.com (mousheng xu) Date: Tue, 7 Aug 2007 10:41:52 -0400 Subject: [Genome] Entrez Genes? Message-ID: <5b9ceee40708070741j41ca7955r18b02dd64ef018c8@mail.gmail.com> Dear Help Desk, Is there a table containing the Entrez Gene IDs at your site? If yes, it's not obvious to me. Your prompt response will be highly appreciated. Thanks a lot! Sincerely, Mousheng Xu Research Fellow Harvard Medical School From archanat at soe.ucsc.edu Tue Aug 7 09:40:50 2007 From: archanat at soe.ucsc.edu (Archana Thakkapallayil) Date: Tue, 07 Aug 2007 09:40:50 -0700 Subject: [Genome] Refseq bug In-Reply-To: <2366.192.168.2.109.1186481204.squirrel@webmail.sanbi.ac.za> References: <2366.192.168.2.109.1186481204.squirrel@webmail.sanbi.ac.za> Message-ID: <46B8A092.3020309@soe.ucsc.edu> Hello Chris, Here is the response from one of our developers to your question: http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi shows that NM_138996.1 was permanently suppressed because currently there is not sufficient data to support this transcript. We just got a new refseq release, which is when we detect deleted entries. I hope this information is helpful to you. Please don't hesitate to contact us again if you require further assistance. Regards, Archana UCSC Genome Bioinformatics Group chrism at sanbi.ac.za wrote: > Hi - > > I am getting the following message: > > "Note: some of the identifiers (e.g. NM_138996) have no match in table > refGene, field name or in alias table refLink, field name. Try the > "describe table schema" button for more information about the table and > field." > > when I paste a list of Refseq gene names to the 'Tables' section > (clade:vertebrate; genome:human; assembly:March 2006; group:Genes and Gene > Prediction Track; track:Refseq Genes; table:refGene). I find this odd > since I can easily search for NM_138996 and because I am using this Refseq > id because I downloaded it from an earlier analysis under the tables page > of the browser. So why now does it not find this particular Refseq > transcript? > > Cheers, > Chris > > Computational Postdoc > South African National Bioinformatics Institute (SANBI) > University of Western Cape > > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From archanat at soe.ucsc.edu Tue Aug 7 10:16:10 2007 From: archanat at soe.ucsc.edu (Archana Thakkapallayil) Date: Tue, 07 Aug 2007 10:16:10 -0700 Subject: [Genome] Entrez Genes? In-Reply-To: <5b9ceee40708070741j41ca7955r18b02dd64ef018c8@mail.gmail.com> References: <5b9ceee40708070741j41ca7955r18b02dd64ef018c8@mail.gmail.com> Message-ID: <46B8A8DA.10802@soe.ucsc.edu> Hello Mousheng Xu, The 'value' field in the table 'knownToLocusLink' and the 'locusLinkId' field in the table 'refLink' contains the Entrez Gene IDs. One thing to note is that the Entrez Genes IDs used to be called Locus Link IDs. Our tables still uses this old LocusLink nomenclature, even though it contains EntrezGene IDs I hope this information is helpful to you. Please don't hesitate to contact us again if you require further assistance. Regards, Archana UCSC Genome Bioinformatics Group mousheng xu wrote: > Dear Help Desk, > > Is there a table containing the Entrez Gene IDs at your site? If yes, it's > not obvious to me. > > Your prompt response will be highly appreciated. > > Thanks a lot! > > Sincerely, > > Mousheng Xu > Research Fellow > Harvard Medical School > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From ikhrebtukova at illumina.com Tue Aug 7 11:33:04 2007 From: ikhrebtukova at illumina.com (Khrebtukova, Irina) Date: Tue, 7 Aug 2007 11:33:04 -0700 Subject: [Genome] non-redundant set of refseq or knownGenes exons In-Reply-To: <4668CB4F.80204@soe.ucsc.edu> References: <4668CB4F.80204@soe.ucsc.edu> Message-ID: Is there an easy way to get non-redundant set of all refseq and/or knownGenes exons? I mean if the same exon is shared by two or more transcripts it is redundant. I think this or similar question was already asked. Just can not find the answer. thanks! Irina Khrebtukova, PhD Sr. Staff Bioinformatics Scientist Illumina Inc. 25861 Industrial Blvd., Hayward, CA 94545 ph: 510-723-9219 ikhrebtukova at illumina.com From kayla at soe.ucsc.edu Tue Aug 7 12:18:58 2007 From: kayla at soe.ucsc.edu (Kayla Smith) Date: Tue, 07 Aug 2007 12:18:58 -0700 Subject: [Genome] question on chain data format In-Reply-To: <46B27044.5070906@u.washington.edu> References: <46B27044.5070906@u.washington.edu> Message-ID: <46B8C5A2.2090504@cse.ucsc.edu> Hello Zhaoshi, 1. The chain files don't contain sequence data. Here is where you can download sequence data: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips You might also find the axt downloads useful (but note that these are for the nets and so will not include all chains): http://hgdownload.cse.ucsc.edu/goldenPath/hg18/vsSelf/axtNet/ 2. The dt/dq distances mean how much of your sequence is skipped before the next aligning block. Not everything aligns. Here is an example (from the chain format help page): chain 4900 chrY 58368225 + 25985406 25985566 chr5 151006098 - 43549808 43549970 2 16 0 2 60 4 0 10 0 4 70 size dt dq This shows 4 ungapped alignment blocks, sizes 16, 60, 10, and 70, with one 4bp gap in the reference (between 1st and 2nd block). So the total extent of the chain is 16+60+4+10+70 (160bp) in the reference, which agrees with the tEnd-tStart from the header line (25985566-25985406). I hope this information is helpful to you. Please don't hesitate to contact us again if you require further assistance. Kayla Smith UCSC Genome Bioinformatics Group zhaoshi wrote: > Hi-- > > I was trying to get the pairwise sequence identity information from the > humans self chain data. > I download the chain files and read the > http://genome.ucsc.edu/goldenPath/help/chain.html and > have some questions: > > 1) it seems that chain file dose not contain sequence identity > information, is there any other data that contain this > information or I need compute by my own base on these chain data? > > 2) I read the chain format explanation, but I do not quite understand > what does 'dt, dq' mean? > It states in the explanation like: > dt -- the difference between the end of this block and the beginning of > the next block (reference sequence) > dq -- the difference between the end of this block and the beginning of > the next block (query sequence) > what does this really mean? difference means what ? mismatch ? > > Thanks for your help. > > Zhaoshi > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From george.zogopoulos at utoronto.ca Tue Aug 7 12:07:17 2007 From: george.zogopoulos at utoronto.ca (george.zogopoulos at utoronto.ca) Date: Tue, 07 Aug 2007 15:07:17 -0400 Subject: [Genome] UCSC structural variation submission Message-ID: <20070807150717.tm8dlh1odsdckwk0@webmail.utoronto.ca> Hi, We have recently published a paper describing copy number variation in a large (>1000 individuals) North American population. The size of our study allowed us to estimate the frequencies of these genomic changes in our control population. In addition, we have identified 183 previously unknown copy number changes. The citation is listed below. Would it be possible to add our findings to the UCSC structural variation track? It is the first population based study of more than 1000 individuals. Please let me know if you need additional information. Thank you, George Zogopoulos MD PhD Mount sinai Hospital Toronto, Canada 1: Hum Genet. 2007 Jul 19; [Epub ahead of print] Links Germ-line DNA copy number variation frequencies in a large North American population.Zogopoulos G, Ha KC, Naqib F, Moore S, Kim H, Montpetit A, Robidoux F, Laflamme P, Cotterchio M, Greenwood C, Scherer SW, Zanke B, Hudson TJ, Bader GD, Gallinger S. Sam Minuk Cancer Genetics and Biomarker Laboratories, Fred Litwin Centre for Cancer Genetics, Samuel Lunenfeld Research Institute, Toronto, Canada. Genomic copy number variation (CNV) is a recently identified form of global genetic variation in the human genome. The Affymetrix GeneChip 100 and 500 K SNP genotyping platforms were used to perform a large-scale population-based study of CNV frequency. We constructed a genomic map of 578 CNV regions, covering approximately 220 Mb (7.3%) of the human genome, identifying 183 previously unknown intervals. Copy number changes were observed to occur infrequently (<1%) in the majority (>93%) of these genomic regions, but encompass hundreds of genes and disease loci. This North American population-based map will be a useful resource for future genetic studies. From archanat at soe.ucsc.edu Tue Aug 7 14:09:52 2007 From: archanat at soe.ucsc.edu (Archana Thakkapallayil) Date: Tue, 07 Aug 2007 14:09:52 -0700 Subject: [Genome] Question about no hits by using BLAT In-Reply-To: <613ffb490708070600s7ede138eo35354cbb130aea90@mail.gmail.com> References: <613ffb490708070600s7ede138eo35354cbb130aea90@mail.gmail.com> Message-ID: <46B8DFA0.3000904@soe.ucsc.edu> Hello Yu, I blatted the below sequences to the Human genome and I couldn't get any hits either. Here are the possible explanations from one of our developers: BLAT won't find a hit if the sequences are not really from the genome, or if they are from rna-space (i.e. could contains introns that would split them into even tinier regions which would be insufficient to align). Another explanation is that if the sequences are in repeat regions, in which case web-based blat won't necessarily find a hit (because we throw out the blat sequence tiles that are over-represented). Also, one of our developers blasted the first sequence below using this online free blast tool: http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606 and it found nothing anywhere. I hope this information helps. If you have further questions, please don't hesitate to contact us. Regards, Archana UCSC Genome Bioinformatics Group Yu Zhou wrote: > Hello, > > We have hundreds of sequences from sequencing PCR products in experiments. > We expect that they should be segments of the human genome and could be > mapped to the genome (with or without mismatches). However, by using web > BLAT, around 100 of them don't have any hit. It is very disappointing. > > I pasted some of them as below. Whether you can get hits? Could you tell me > some reasons in those cases? ( I don't know if errors in PCR and sequencing > could cause so many mutations that the sequences couldn't be mapped to their > orignal positions in genome.) How to use BLAT or are there other methods, to > do better mapping? (BLAT is so quick, whether exist slower programs with > better sensitivity?) Thanks a lot! > > >> 1 >> > TCAGCGTTATCACCATTTGCTGGTCGATGGCTTTAGTTTCCCGGCCATTACCCGCCAGAT > >> 4 >> > TACTAGTATCAGCGTTATCACCATTTGCTGGTCGATGGCTTTAGTTTCCCGGCCATTACCCGCCAGAT > >> 11 >> > TCGATGGCTTTAGTTTCCCGGCCATTACCCGCCAGAT > >> 18 >> > CGGTTCTTCCAGCAGGCTTTCGCTTAAGTCAGCGTTAC > >> 22 >> > TACGCCATCACCAAAGATATCTT > >> 38 >> > CGCACTGCGTCGCGAACCACGGCTGCGCTTTACGCTAACAGAAGTGAATGATCTACCGGTCCGGCAAA > > > > From archanat at soe.ucsc.edu Tue Aug 7 16:15:54 2007 From: archanat at soe.ucsc.edu (Archana Thakkapallayil) Date: Tue, 07 Aug 2007 16:15:54 -0700 Subject: [Genome] non-redundant set of refseq or knownGenes exons In-Reply-To: References: <4668CB4F.80204@soe.ucsc.edu> Message-ID: <46B8FD2A.2040909@soe.ucsc.edu> Hello Irina, One of my colleagues pointed out to me that you could get this information by first making a Custom Track using the Table Browser of all the KnownGene exons. You could then output your custom track in a BED file and then do some PERL programming to get the unique exons. 1. Creating a custom track of exons. Set the following options in the Table Browser: clade: Vertebrate genome: Human assembly: Mar 2006 group: Genes and Gene Prediction Tracks track: UCSC Genes table: knownGene region: genome output format: custom track Click "get output". On the next page, select the radio button for "Coding Exons" and press "get custom track in table browser". You now have a custom track of the exons of the Known Genes. 2. Back on the Table Browser main page, now choose "Custom Tracks" as the group and the corresponding track and table. Then choose 'output format: BED' and hit 'get output'. On this page select the radio button for 'Create one BED record per: Whole Gene" and press "get BED". This gives you the coordinates of all the coding exons in the BED format. More information on using the Table Browser is here: http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html I hope that this helps you. Please let us know if you have further questions. Regards, Archana UCSC Genome Bioinformatics Group Khrebtukova, Irina wrote: > Is there an easy way to get non-redundant set of all refseq and/or > knownGenes exons? I mean if the same exon is shared by two or more > transcripts it is redundant. > > I think this or similar question was already asked. Just can not find > the answer. > > thanks! > > Irina Khrebtukova, PhD > Sr. Staff Bioinformatics Scientist > Illumina Inc. > 25861 Industrial Blvd., > Hayward, CA 94545 > ph: 510-723-9219 > ikhrebtukova at illumina.com > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From ikhrebtukova at illumina.com Tue Aug 7 16:44:10 2007 From: ikhrebtukova at illumina.com (Khrebtukova, Irina) Date: Tue, 7 Aug 2007 16:44:10 -0700 Subject: [Genome] non-redundant set of refseq or knownGenes exons In-Reply-To: <46B8FD2A.2040909@soe.ucsc.edu> References: <4668CB4F.80204@soe.ucsc.edu> <46B8FD2A.2040909@soe.ucsc.edu> Message-ID: Hi Archana, my problem is that I do know now how to make a custom track and how to get a sequence from custom track. I was wondering (and I guess it's now a FAQ for you guys) how to get NON_REDUNDANT set of exons? Preferably of course dealing with overlapping exons too. Like selecting largest exon of all overlapping. ok, I guess it's still not that easy solution. But my prediction would be that more and more people would ask this question... thanks! and of course I just LOVE your browser! really it's the best despite of my minor desires of making even better. Irina -----Original Message----- From: Archana Thakkapallayil [mailto:archanat at soe.ucsc.edu] Sent: Tuesday, August 07, 2007 4:16 PM To: Khrebtukova, Irina Cc: genome at soe.ucsc.edu Subject: Re: [Genome] non-redundant set of refseq or knownGenes exons Hello Irina, One of my colleagues pointed out to me that you could get this information by first making a Custom Track using the Table Browser of all the KnownGene exons. You could then output your custom track in a BED file and then do some PERL programming to get the unique exons. 1. Creating a custom track of exons. Set the following options in the Table Browser: clade: Vertebrate genome: Human assembly: Mar 2006 group: Genes and Gene Prediction Tracks track: UCSC Genes table: knownGene region: genome output format: custom track Click "get output". On the next page, select the radio button for "Coding Exons" and press "get custom track in table browser". You now have a custom track of the exons of the Known Genes. 2. Back on the Table Browser main page, now choose "Custom Tracks" as the group and the corresponding track and table. Then choose 'output format: BED' and hit 'get output'. On this page select the radio button for 'Create one BED record per: Whole Gene" and press "get BED". This gives you the coordinates of all the coding exons in the BED format. More information on using the Table Browser is here: http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html I hope that this helps you. Please let us know if you have further questions. Regards, Archana UCSC Genome Bioinformatics Group Khrebtukova, Irina wrote: > Is there an easy way to get non-redundant set of all refseq and/or > knownGenes exons? I mean if the same exon is shared by two or more > transcripts it is redundant. > > I think this or similar question was already asked. Just can not find > the answer. > > thanks! > > Irina Khrebtukova, PhD > Sr. Staff Bioinformatics Scientist > Illumina Inc. > 25861 Industrial Blvd., > Hayward, CA 94545 > ph: 510-723-9219 > ikhrebtukova at illumina.com > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From hiram at soe.ucsc.edu Tue Aug 7 21:15:29 2007 From: hiram at soe.ucsc.edu (Hiram Clawson) Date: Tue, 7 Aug 2007 21:15:29 -0700 Subject: [Genome] non-redundant set of refseq or knownGenes exons Message-ID: <200708080415.VAA20900@sundance.cse.ucsc.edu> Good evening Irina: If I understand your query correctly, what you would like to know are the areas of the genome that are covered by any exon. This is an intersection of the exons with the genome. The only item you do not readily have is "the genome." For this, you need the information from the chromInfo table. For example, the chrom extents of hg18 in the form of a bed file to be used as a custom track can be created by: echo 'track name=hg18_extent description="hg18 chromosome extents"' \ > hg18.extent.bed mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N \ -e "select chrom,0,size,chrom from chromInfo;" hg18 >> hg18.extent.bed Or, from the chromInfo file: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/chromInfo.txt.gz Or from the table browser, "all tables" then "chromInfo" and selected fields from that table, the chrom and size. Insert a 0 in a middle column from that output to obtain a bed file. Or, from a copy I made from the mysql command above: http://genome-test.cse.ucsc.edu/~hiram/hg18/hg18.extent.bed With hg18.extent.bed in hand, load that as a custom track. Then, create a custom track of "exons only" from your favorite gene set. Then, with the hg18 extent track chosen as first track, run a base-pair intersection of that with the exon custom track, obtain the results as a third track. You can view these tracks in the browser to verify they have the meaning you want from this exercise. Then, using the result of your intersection, obtain the fasta for those areas via the table browser "sequence" output. I have saved this exercise as a session "hg18 exon locus": http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=Hiram&hgS_otherUserSessionName=hg18%20exon%20locus Note the three custom tracks. --Hiram > From: "Khrebtukova, Irina" > Subject: Re: [Genome] non-redundant set of refseq or knownGenes exons > Hi Archana, > my problem is that I do know now how to make a custom track and how to > get a sequence from custom track. I was wondering (and I guess it's now > a FAQ for you guys) how to get NON_REDUNDANT set of exons? Preferably of > course dealing with overlapping exons too. Like selecting largest exon > of all overlapping. > ok, I guess it's still not that easy solution. But my prediction would > be that more and more people would ask this question... > thanks! and of course I just LOVE your browser! really it's the best > despite of my minor desires of making even better. > Irina From ikhrebtukova at illumina.com Tue Aug 7 21:42:32 2007 From: ikhrebtukova at illumina.com (Khrebtukova, Irina) Date: Tue, 7 Aug 2007 21:42:32 -0700 Subject: [Genome] non-redundant set of refseq or knownGenes exons In-Reply-To: <200708080415.VAA20900@sundance.cse.ucsc.edu> References: <200708080415.VAA20900@sundance.cse.ucsc.edu> Message-ID: Hiram, THANKS!!!!! that's exactly what I need. (save it as FAQ as I'm sure there will be more queries like that). you guys are awesome!!!! Irina -----Original Message----- From: Hiram Clawson [mailto:hiram at soe.ucsc.edu] Sent: Tuesday, August 07, 2007 9:15 PM To: archanat at soe.ucsc.edu; Khrebtukova, Irina Cc: genome at soe.ucsc.edu Subject: Re: [Genome] non-redundant set of refseq or knownGenes exons Good evening Irina: If I understand your query correctly, what you would like to know are the areas of the genome that are covered by any exon. This is an intersection of the exons with the genome. The only item you do not readily have is "the genome." For this, you need the information from the chromInfo table. For example, the chrom extents of hg18 in the form of a bed file to be used as a custom track can be created by: echo 'track name=hg18_extent description="hg18 chromosome extents"' \ > hg18.extent.bed mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N \ -e "select chrom,0,size,chrom from chromInfo;" hg18 >> hg18.extent.bed Or, from the chromInfo file: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/chromInfo.txt.gz Or from the table browser, "all tables" then "chromInfo" and selected fields from that table, the chrom and size. Insert a 0 in a middle column from that output to obtain a bed file. Or, from a copy I made from the mysql command above: http://genome-test.cse.ucsc.edu/~hiram/hg18/hg18.extent.bed With hg18.extent.bed in hand, load that as a custom track. Then, create a custom track of "exons only" from your favorite gene set. Then, with the hg18 extent track chosen as first track, run a base-pair intersection of that with the exon custom track, obtain the results as a third track. You can view these tracks in the browser to verify they have the meaning you want from this exercise. Then, using the result of your intersection, obtain the fasta for those areas via the table browser "sequence" output. I have saved this exercise as a session "hg18 exon locus": http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_other UserName=Hiram&hgS_otherUserSessionName=hg18%20exon%20locus Note the three custom tracks. --Hiram > From: "Khrebtukova, Irina" > Subject: Re: [Genome] non-redundant set of refseq or knownGenes exons > Hi Archana, > my problem is that I do know now how to make a custom track and how to > get a sequence from custom track. I was wondering (and I guess it's > now a FAQ for you guys) how to get NON_REDUNDANT set of exons? > Preferably of course dealing with overlapping exons too. Like > selecting largest exon of all overlapping. > ok, I guess it's still not that easy solution. But my prediction would > be that more and more people would ask this question... > thanks! and of course I just LOVE your browser! really it's the best > despite of my minor desires of making even better. > Irina From ikhrebtukova at illumina.com Tue Aug 7 21:49:46 2007 From: ikhrebtukova at illumina.com (Khrebtukova, Irina) Date: Tue, 7 Aug 2007 21:49:46 -0700 Subject: [Genome] pack format algorithm In-Reply-To: <200708080415.VAA20900@sundance.cse.ucsc.edu> References: <200708080415.VAA20900@sundance.cse.ucsc.edu> Message-ID: Hi, could you please explain in simple words how algorithm for "pack" format works? For example, if I have all start & end positions (converted to screen coordinates, therefore x coordinates are known), what's the way to calculate y coordinates to "pack" features? I broke my head over the wall and it still far from perfect view like in your browser. thanks, Irina Khrebtukova Illumina, Hayward. From oren.ben-ami at weizmann.ac.il Tue Aug 7 23:34:25 2007 From: oren.ben-ami at weizmann.ac.il (Oren Ben-Ami) Date: Wed, 08 Aug 2007 09:34:25 +0300 Subject: [Genome] A Question regarding the UCSC genome browser Message-ID: Hello, I have a question regarding the mouse genome presented by default on your Browser. What mouse strain this genome sequence represents? I'm interesed in the genome sequence of the mouse strain 129/SvEvTac. Is there a way to configure my UCSC browser to link me to the required genome? I would appreciate a quick response. Best regards, Oren Ben-Ami Dept. of Molecular Genetics The Weizmann Institute of Science Rehovot 76100 Israel Phone: 972-08-9342318 From atorrano at lsi.upc.edu Wed Aug 8 01:20:43 2007 From: atorrano at lsi.upc.edu (Alexis Torrano) Date: Wed, 8 Aug 2007 10:20:43 +0200 (CEST) Subject: [Genome] mirroring from BSC Message-ID: <2107.84.88.50.16.1186561243.squirrel@webmail.lsi.upc.edu> Hello I am Alexis Torrano. I am mailing you from INB-BSC (National Institute in BioInformatics-Barcelona Supercomputing Center) asking for some advice. Our researchers make an important use of your databases. And we are interested in keeping a mirroring of such genome databases for them. We plan to add an update process of your databases to our authomatical database update process. Mainly we will follow the indications appearing on your website about rsync. We'd like to know if we should take into account some other issue which could make easier the update. We have observed that some files are of the kind *fa.gz, ga.masked,gz and others like *Fa.zip, *FaMasked.zip, hardmask.fa.gz. Also, there are some files that match in someway their specie directory, some others do not. Anolis_carolinensis/bigZips/anoCar1.fa.masked.gz Anopheles_gambiae/bigZips/chromFaMasked.zip Canis_familiaris/bigZips/chromFaMasked.tar.gz Bos_taurus/bigZips/bosTau2.hardmask.fa.gz Do you have in mind future changes on naming protocol for genome files? full name, abbreviation based on directory name, uniformity maske vs hardmask, etc... About mirroring UCSC genomes and offer such mirroring through our website, there is some needed constraint to accomplish, something we must know? may we mirror only the databases without Genome-Browser? ?????????? thank you very much. ??????? Alexis Torrano.-- ----------------------------------------------------- Alexis Torrano Martinez e-mail: atorrano at bsc.es, atorrano at lsi.upc.edu Nodo Computacional GNHC-1 (inb.bsc.es) Instituto Nacional de Bioinformatica (www.inab.org) Barcelona Supercomputing Center Node (www.bsc.es) BSC-CNS (www.bsc.es) c/. Jordi Girona 29 Edifici Nexus II, despatx 1B Tel: (+34) 93 413 7605 E-08034 Barcelona Fax: (+34) 93 Catalunya (Spain) Team info: http://inb.lsi.upc.edu/ ----------------------------------------------------- Berlin's Law of Computing - Computers don't do what you ask them to do, they do what you tell them to do. Named for Dean Berlin, noted observer of reality. From shlomitf at ekmd.huji.ac.il Wed Aug 8 04:38:45 2007 From: shlomitf at ekmd.huji.ac.il (shlomit farkash) Date: Wed, 8 Aug 2007 14:38:45 +0300 Subject: [Genome] lower case Letters and repetetive sequences Message-ID: <86D596CE0D90274CB81C987E2836216A83E0C0@EKEKEVS00.ekmd.huji.uni> Hello, First, thanks a lot for this very helpful site and for your helpful assistance in the past... I have downloaded genome version (hg17), I assumed that the lower case letters are repetitive sequences, however, when I was looking for a pattern of 18bp, I found many patterns that were in lower case and appeared only once in the genome... since it surprised me, I tried to look for one of these 18bp patterns (for example: gaattcaaaggctaatgg in BLAT, but it failed (it found no matches)... My question is if there is a way to make sure that the sequence indeed occur only once although it is a repetitive sequence (use BLAT in the command line with some flags?), I have many such examples and the only explanation I have is that sequences that were defined as repetitive sequences don't have a clear consensus, but rather fall into some defined categories What do you think and am I missing anything here? Thanks a lot, Shlomit Amar-Farkash From hiram at soe.ucsc.edu Wed Aug 8 09:15:48 2007 From: hiram at soe.ucsc.edu (Hiram Clawson) Date: Wed, 08 Aug 2007 09:15:48 -0700 Subject: [Genome] A Question regarding the UCSC genome browser In-Reply-To: References: Message-ID: <46B9EC34.1000803@soe.ucsc.edu> Good Morning Oren Ben-Ami: We only display the strain C57BL/6J There are 38 contigs in the 129/SvEvTac strain which you would find at NCBI for mouse build 37 (UCSC mm9 under construction). You may also be able to find resources at MGI: http://www.informatics.jax.org/menus/strain_menu.shtml Attached below is a listing of the 38 contigs obtained from the NCBI seq_contig.md.gz file: ftp://ftp.ncbi.nih.gov/genomes/MapView/Mus_musculus/sequence/BUILD.37.1/initial_release/seq_contig.md.gz For mm8 (NCBI build 36) this file would be: ftp://ftp.ncbi.nih.gov/genomes/MapView/Mus_musculus/sequence/BUILD.36.1/initial_release/seq_contig.md.gz --Hiram Oren Ben-Ami wrote: > Hello, > I have a question regarding the mouse genome presented by default on your > Browser. What mouse strain this genome sequence represents? I'm interesed in > the genome sequence of the mouse strain 129/SvEvTac. Is there a way to > configure my UCSC browser to link me to the required genome? > I would appreciate a quick response. > Best regards, > > Oren Be