From rsinha at fli-leibniz.de Fri Jun 1 08:23:44 2007 From: rsinha at fli-leibniz.de (Rileen Sinha) Date: Fri, 01 Jun 2007 17:23:44 +0200 Subject: [Genome] The count of all annotated exons for a UCSC known gene ...... Message-ID: <46603A00.1040302@fli-leibniz.de> Hi, Is there some quick/convenient way of obtaining a list of the number of annotated exons in each given gene in the UCSC known gene collection? By that I mean all exons that appear in at least one transcript - a sort of "logical OR" of all exons in all transcripts. Is there also a corresponding solution for RefSeq genes? Thanks! Take care, Yours, Rileen -- ****************************************************************** "I know nothing, but i _know_ that." Rileen Sinha rileen at gmail.com Personal Phone : (0049)3641412276 (cheaper to call) (0049)17624078373 ****************************************************************** From ann at soe.ucsc.edu Fri Jun 1 11:22:19 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Fri, 01 Jun 2007 11:22:19 -0700 Subject: [Genome] The count of all annotated exons for a UCSC known gene ...... In-Reply-To: <46603A00.1040302@fli-leibniz.de> References: <46603A00.1040302@fli-leibniz.de> Message-ID: <466063DB.40908@cse.ucsc.edu> Hello Rileen, You can get the information you want by using the Table Browser tool ('Tables' in the blue navigation bar). Configure it like so: clade: Vertebrate genome: Human assembly: Mar 2006 group: Genes & Gene Prediction tracks track: UCSC Genes table: knownGene region: genome output format: selected fields from primary and related tables Press "get output" button. Then, on the next page choose: name exonCount (and any other fields you would like) Press "get output" button to see your list. Do the same for the refGene table to get the exon counts for all of the Ref Seq Genes. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Please feel free to search the Genome mailing list archives by visiting our home page, clicking on "Contact Us", then typing a word or phrase into the search box. On that same page (http://genome.ucsc.edu/contacts.html), you can subscribe to the Genome mailing list. Rileen Sinha wrote: > Hi, > Is there some quick/convenient way of obtaining a list of the number > of annotated exons in each given gene in the UCSC known gene collection? > > By that I mean all exons that appear in at least one transcript - a sort of > "logical OR" of all exons in all transcripts. > > Is there also a corresponding solution for RefSeq genes? > > Thanks! > > Take care, > Yours, > Rileen > From ann at soe.ucsc.edu Fri Jun 1 11:43:26 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Fri, 01 Jun 2007 11:43:26 -0700 Subject: [Genome] motif quality assessment In-Reply-To: <11417a880705312358u925bea8n2f686b4bc3d7cbfc@mail.gmail.com> References: <11417a880705271009t1b957608g7c034b821e576081@mail.gmail.com> <11417a880705312358u925bea8n2f686b4bc3d7cbfc@mail.gmail.com> Message-ID: <466068CE.1080101@cse.ucsc.edu> Hello T. Joshi, To get conserved sequence for the 5' and 3' regions, you can use the Table Browser ('Tables' from the top blue navigation bar). First, you will create a Custom Track made up of the 5' and 3' regions of the gene set you are interested in. For more details on creating a Custom Track, see this User's Guide: http://genome.ucsc.edu/goldenPath/help/customTrack.html Then, you will intersect your Custom Track with the multiz17way table. This is the table that underlies the Conservation track. For more details on performing a table intersection using the Table Browser, see this User's Guide: http://hgw8.cse.ucsc.edu/goldenPath/help/hgTablesHelp.html#SimpleIntersection As for your second question, you can find the most conserved sections of your 5' and 3' UTRs by intersection your original Custom Track with the phastConsElements17way table. This is the table which underlies the Most Conserved track. See this very similar previously-answered mail list question for more details: http://www.soe.ucsc.edu/pipermail/genome/2006-May/010525.html Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Please feel free to search the Genome mailing list archives by visiting our home page, clicking on "Contact Us", then typing a word or phrase into the search box. On that same page (http://genome.ucsc.edu/contacts.html), you can subscribe to the Genome mailing list. T Joshi wrote: > Hi ! > Thanks Kayla for your reply. > > I have some more questions, I hope someone can help : > 1)How can I get conserved sequences from 5' and 3' UTR region ? > I tried UCSC's conserved track, which gives me the entire > genome/chromosome's conserved sequence with its postion coordinates. > But I am only interested in only the 5' or 3' region of the sequence > which is conseved. > Tools such as Blast on NCBI allows to query conserved amino acid > sequence, but not nucleotide sequences. > > 2) Provided I have a set of sequences from these UTR regions, is there > any tool which lets me find only conserved sub-sequences from the > input set of sequences? > > Thanks, > T Joshi > > On 29/05/07, Kayla Smith wrote: >> TJoshi, >> >> You may want to use the Improbizer: >> http://genome-test.cse.ucsc.edu/Improbizer/ >> >> This is a program that slowly crawls through DNA or RNA sequence >> looking for consensus motifs that happen improbably often. Note >> that this is on our test server and data/tools found here have not >> gone through our rigorous QA process. >> >> I hope this is helpful to you. Please don't hesitate to contact us again >> if you require further assistance. >> >> Kayla Smith >> UCSC Genome Bioinformatics Group >> >> On Sun, 27 May 2007, T Joshi wrote: >> >>> Hello All ! >>> This is my first email to the list, and I am not sure if I am posting >>> at a right place. >>> Anyways, my problem refers to the quality assessment of motifs >>> discovered any of the motif discovery algorithms such as MEME, YMF or >>> Gemoda, applied to DNA sequence data. Given a set of known motifs, I >>> want to compare them, to evaluate predicted motifs generated by one of >>> these algorithms. I want to find the statistics such as sensibility, >>> specificity, false positives, false negatives, etc. >>> I searched for the tool for this purpose, but it gives me those for >>> the protein sequences and not for DNA sequences. Some of them compare >>> the predicted motifs with their own set of knownmotifs, but not with >>> user-given set of known motifs. >>> >>> thoughts? pointers? >>> Please help. >>> Thanks, >>> TJoshi >>> _______________________________________________ >>> Genome maillist - Genome at soe.ucsc.edu >>> http://www.soe.ucsc.edu/mailman/listinfo/genome >>> > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From archie_russell at merck.com Fri Jun 1 12:05:04 2007 From: archie_russell at merck.com (Russell, Archie) Date: Fri, 1 Jun 2007 12:05:04 -0700 Subject: [Genome] Restriction enzyme tracks In-Reply-To: <466068CE.1080101@cse.ucsc.edu> References: <11417a880705271009t1b957608g7c034b821e576081@mail.gmail.com> <11417a880705312358u925bea8n2f686b4bc3d7cbfc@mail.gmail.com> <466068CE.1080101@cse.ucsc.edu> Message-ID: <23B0A4FBD181A44D9B89C4FB3E96D59473E6CF@ussemx1100.merck.com> I imported a new assembly and noticed a track for restriction enzymes that I didn't create on my own showing up (not intentionally anyway). Does the browser include these by default? Thanks, Archie ------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station, New Jersey, USA 08889), and/or its affiliates (which may be known outside the United States as Merck Frosst, Merck Sharp & Dohme or MSD and in Japan, as Banyu - direct contact information for affiliates is available at http://www.merck.com/contact/contacts.html) that may be confidential, proprietary copyrighted and/or legally privileged. It is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please notify us immediately by reply e-mail and then delete it from your system. ------------------------------------------------------------------------------ From ann at soe.ucsc.edu Fri Jun 1 12:16:23 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Fri, 01 Jun 2007 12:16:23 -0700 Subject: [Genome] Restriction enzyme tracks In-Reply-To: <23B0A4FBD181A44D9B89C4FB3E96D59473E6CF@ussemx1100.merck.com> References: <11417a880705271009t1b957608g7c034b821e576081@mail.gmail.com> <11417a880705312358u925bea8n2f686b4bc3d7cbfc@mail.gmail.com> <466068CE.1080101@cse.ucsc.edu> <23B0A4FBD181A44D9B89C4FB3E96D59473E6CF@ussemx1100.merck.com> Message-ID: <46607087.8040207@cse.ucsc.edu> Hi Archie, Yes, this track is created on the fly and by default for all assemblies. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Russell, Archie wrote: > > > I imported a new assembly and noticed a track for restriction enzymes > that I didn't create on my own showing up (not intentionally anyway). > Does the browser include these by default? > > Thanks, > Archie > > > ------------------------------------------------------------------------------ > Notice: This e-mail message, together with any attachments, contains > information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station, > New Jersey, USA 08889), and/or its affiliates (which may be known > outside the United States as Merck Frosst, Merck Sharp & Dohme or MSD > and in Japan, as Banyu - direct contact information for affiliates is > available at http://www.merck.com/contact/contacts.html) that may be > confidential, proprietary copyrighted and/or legally privileged. It is > intended solely for the use of the individual or entity named on this > message. If you are not the intended recipient, and have received this > message in error, please notify us immediately by reply e-mail and then > delete it from your system. > > ------------------------------------------------------------------------------ > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From yoyoq at yahoo.com Fri Jun 1 12:22:02 2007 From: yoyoq at yahoo.com (jp d) Date: Fri, 1 Jun 2007 12:22:02 -0700 (PDT) Subject: [Genome] improbizer question In-Reply-To: <466068CE.1080101@cse.ucsc.edu> Message-ID: <828743.34884.qm@web50405.mail.re2.yahoo.com> hi, I have a question on a kent/src program, I hope this is the right place for it. I am using ameme locally and no matter which option I send to ameme it always includes reverse complement I have tried rcToo=off,FALSE, on, TRUE any suggestions? typical command line ~/bin/x86_64/ameme good=test.seq numMotifs=4 background=m2 motifOutput=test.out html=test.html gif=test.gif rcToo=FALSE thanks jpd From galt at soe.ucsc.edu Fri Jun 1 12:43:19 2007 From: galt at soe.ucsc.edu (Galt Barber) Date: Fri, 1 Jun 2007 12:43:19 -0700 (PDT) Subject: [Genome] improbizer question In-Reply-To: <828743.34884.qm@web50405.mail.re2.yahoo.com> References: <828743.34884.qm@web50405.mail.re2.yahoo.com> Message-ID: try not specifying it at all. -Galt On Fri, 1 Jun 2007, jp d wrote: > hi, > I have a question on a kent/src program, I hope > this is the right place for it. > I am using ameme locally and no matter > which option I send to ameme it always > includes reverse complement > I have tried rcToo=off,FALSE, on, TRUE > > any suggestions? > typical command line > > ~/bin/x86_64/ameme good=test.seq numMotifs=4 > background=m2 motifOutput=test.out html=test.html > gif=test.gif rcToo=FALSE > > > thanks > jpd > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From hartera at soe.ucsc.edu Fri Jun 1 14:20:09 2007 From: hartera at soe.ucsc.edu (Rachel Harte) Date: Fri, 1 Jun 2007 14:20:09 -0700 (PDT) Subject: [Genome] related to conservation track In-Reply-To: <20070531180338.ntnc5xiwycggo8cs@webmail.utoronto.ca> References: <20070531180338.ntnc5xiwycggo8cs@webmail.utoronto.ca> Message-ID: Dear Xiao, One of our engineers gave this explanation for what you are seeing: "The scores are generated by a hidden markov model, which has internal state. So instead of considering each base in isolation, the neighborhood is considered and the model does include a (tunable) smoothing parameter. In that tiny (11bp) region, the score seems to be slowly decaying from a previous high. Gaps (the --'s and ==='s) are treated as missing information, not as harshly penalized as mismatches I believe. Also, there are some matching bases in the most distant species (the run of A's in t_castaneum) that are probably softening the fall (some precipitous drops can be seen in the neighborhood, but in those cases most everything seems to be mismatching at the boundaries)." I hope that this helps you. Please let us know if you have further questions. Rachel Rachel Harte UCSC Genome Bioinformatics Group http://genome.ucsc.edu On Thu, 31 May 2007 alicexiao.li at utoronto.ca wrote: > > Dear UCSC Genome browser colleages: > > > This maybe a very silly question but it confused me alot. :) > > When I used ucsc genome brower to identify conserved region for CG4040 > 3'UTR between 12 drospholia species. I found in the region chrX > 8898540-8898550, the figure sures these region are relatively > conserved. However, it is obvious that at base level, it is not > conserved(lots of - and =). Detailed information can be seen in the > following website: > > http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=93161117&hgt.out1=1.5x&position=chrX%3A8898530-8898549 > > So is that actually correct? > > cheers > > Xiao > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From rileen at gmail.com Fri Jun 1 13:55:42 2007 From: rileen at gmail.com (Rileen) Date: Fri, 1 Jun 2007 22:55:42 +0200 Subject: [Genome] The count of all annotated exons for a UCSC known gene ...... In-Reply-To: <466063DB.40908@cse.ucsc.edu> References: <46603A00.1040302@fli-leibniz.de> <466063DB.40908@cse.ucsc.edu> Message-ID: <6dc7bd4a0706011355l76b37505m4bda172a3cd66dad@mail.gmail.com> Hi Ann, Thanks. I'm sorry, my question was ambiguous, I already knew how to do what you said. I wanted the number of exons per loci/"canonical" gene, which is why I said "By that I mean all exons that appear in at least one transcript - a sort of "logical OR" of all exons in all transcripts." which wasn't very clear, I admit. To clarify - let's take an example, say the gene EDA. There are 8 Refseqs, with varying numbers of exons - between 1 and 3. However, when you look at them together, there 5 exons in this locus, each of which appears in at least one Refgene. These are not exons which differ from each other by one splice site etc., these are distinct, non-overlapping exons. It is this number - 5 in this example - that I'm after. Take care, Yours, Rileen On 01/06/07, Ann Zweig wrote: > Hello Rileen, > > You can get the information you want by using the Table Browser tool > ('Tables' in the blue navigation bar). Configure it like so: > > clade: Vertebrate > genome: Human > assembly: Mar 2006 > group: Genes & Gene Prediction tracks > track: UCSC Genes > table: knownGene > region: genome > output format: selected fields from primary and related tables > > Press "get output" button. > > Then, on the next page choose: > name > exonCount > (and any other fields you would like) > > Press "get output" button to see your list. > > Do the same for the refGene table to get the exon counts for all of the > Ref Seq Genes. > > Regards, > > ---------- > Ann Zweig > UCSC Genome Bioinformatics Group > http://genome.ucsc.edu > > Please feel free to search the Genome mailing list archives by visiting > our home page, clicking on "Contact Us", then typing a word or phrase > into the search box. On that same page > (http://genome.ucsc.edu/contacts.html), you can subscribe to the Genome > mailing list. > > > > > Rileen Sinha wrote: > > Hi, > > Is there some quick/convenient way of obtaining a list of the number > > of annotated exons in each given gene in the UCSC known gene collection? > > > > By that I mean all exons that appear in at least one transcript - a sort of > > "logical OR" of all exons in all transcripts. > > > > Is there also a corresponding solution for RefSeq genes? > > > > Thanks! > > > > Take care, > > Yours, > > Rileen > > > From hartera at soe.ucsc.edu Fri Jun 1 14:53:29 2007 From: hartera at soe.ucsc.edu (Rachel Harte) Date: Fri, 1 Jun 2007 14:53:29 -0700 (PDT) Subject: [Genome] lost track entries / multiple configurations In-Reply-To: <465F057E.6030900@sammeth.net> References: <465F057E.6030900@sammeth.net> Message-ID: Hello Micha, If you have a large number of items in the same location, then the code will not allow display of all the data. We see this sometimes with ESTs and with chain tracks and also the SAGEtags in the ENCODE data on hg17. If you set visibility to dense for the custom track, then you will be able to view all the data collapsed into one item. What type of data are you loading? Is there a way that you can reduce the data in any way or maybe load the data in more than one track. The data is still available but you can not view it all. All of the data will be accessible in the Table Browser so you would still be able to query it there and intersect it with other tracks. With regard to your question about conflicting parameters from different sources, if you add a line to a custom track, then this just gives the default position for loading that track which will stay the same. So whenever you go to the custom track manager and select to view a custom track then it will take you to its default position (if specified) otherwise to the position of the first item with the default visibility. Once viewing the Genome Browser, you can change the visibility using the track control for the custom track. The assembly information is also retained so if you created the track on hg18 (Mar 2006), then it will only appear on hg18 and you would need to select hg18 in the custom track manager or the hg18 Genome Browser in order to view the custom track. If you change the database to say mouse mm8 in the URL, then it will take you to the mm8 Genome Browser. The organism and assembly information variables are kept in the cart, and this means that if you have just visited mm8 and then choose Blat or the Table Browser or Gene Sorter, then mm8 will be the default organism. Other variables from the URL and some that you do not see including the tracks' visibilities and the genome position are also kept in the cart. So if you go back to a Genome Browser for an assembly then you will see the same tracks and position as when you last visited that assembly. These settings are associated with an id (hgsid) which you will see appearing in the URL. You can reset everything to default using the "Click here to reset" on the Genomes Gateway page (click on the "Genomes" link on the top blue bar). This will also remove any custom tracks that you added. You can see all these settings in the cart by doing a cart dump: http://genome.ucsc.edu/cgi-bin/cartDump. I hope that this helps you. Please let us know if you have further questions. Rachel Rachel Harte UCSC Genome Bioinformatics Group http://genome.ucsc.edu On Thu, 31 May 2007, Micha Sammeth wrote: > Hi, > > I have two questions: > > - if you display too many data in a track in the browser it seems to > cut the lasts entries, displaying a line: "Last row XX:" > I guess this is due to limit the resources, I couldnt find a way to > deactivate it. Is there any way to get information about the cut entries? > > - what happens if you submit the browser contradicting information > via different sources? For instance you can set region, track display > features, ... via the URL with which you call the browser, but at the > same time, you may specify these parameters in the browser configuration > of a dynamically loaded custom track. Which configuration overrides > which, is there an order to the processing? > > > Thank you !! micha. > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From ann at soe.ucsc.edu Fri Jun 1 15:22:05 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Fri, 01 Jun 2007 15:22:05 -0700 Subject: [Genome] The count of all annotated exons for a UCSC known gene ...... In-Reply-To: <6dc7bd4a0706011355l76b37505m4bda172a3cd66dad@mail.gmail.com> References: <46603A00.1040302@fli-leibniz.de> <466063DB.40908@cse.ucsc.edu> <6dc7bd4a0706011355l76b37505m4bda172a3cd66dad@mail.gmail.com> Message-ID: <46609C0D.4090703@cse.ucsc.edu> Hello again, Rileen, Sorry for misunderstanding your question the first time. Unfortunately, there is no quick way to do this, but I can help you gather all the information you need, then you'll need to do one final step on your own. This pertains only to the UCSC Gene set. The Ref Seq Gene tables are set up differently. Each gene cluster is represented by a canonical gene noted in the knownCanonical table. Each cluster has a clusterId which also appears in the knownIsoform table. Using your example of the EDA gene, the canonical gene is: mysql> SELECT * FROM knownCanonical WHERE transcript = 'uc004dxs.1'\G *************************** 1. row *************************** chrom: chrX chromStart: 68752635 chromEnd: 69176046 clusterId: 25603 transcript: uc004dxs.1 protein: uc004dxs.1 So, the clusterId in this case is 25603. Now use this information to search the knownIsoform table for all of the genes in this cluster: mysql> SELECT * FROM knownIsoforms WHERE clusterId = 25603; +-----------+------------+ | clusterId | transcript | +-----------+------------+ | 25603 | uc004dxq.1 | | 25603 | uc004dxn.1 | | 25603 | uc004dxm.1 | | 25603 | uc004dxl.1 | | 25603 | uc004dxp.1 | | 25603 | uc004dxo.1 | | 25603 | uc004dxs.1 | | 25603 | uc004dxr.1 | +-----------+------------+ Here are the 8 genes that you see displayed in the UCSC Gene track for "EDA". Now you can search the knownGene table to get a list of the exonStarts (and exonEnds if needed). For this, I would suggest using our public mysql server. You can access the server using this command from a prompt: mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A use hg18 To get a list of exonStarts for the set of genes in a cluster, you can issue this command: mysql> SELECT clusterId, exonStarts FROM knownGene g, knownIsoforms i WHERE g.name = i.transcript and i.clusterId = 25603; +-----------+--------------------------------------------------------------------------+ | clusterId | exonStarts | +-----------+--------------------------------------------------------------------------+ | 25603 | 68752635,69093601, | | 25603 | 68752635,68997412, | | 25603 | 68752635,68806778,68997412, | | 25603 | 68752635, | | 25603 | 68752635,68757464,69093601, | | 25603 | 68752635,68865711,68997412,69093601, | | 25603 | 68752635,69093601,69159792,69164431,69166078,69167043,69169972,69171932, | | 25603 | 68752635,69093601,69159792,69164431,69166078,69167043,69169972,69171932, | +-----------+--------------------------------------------------------------------------+ 8 rows in set (0.04 sec) This gives you your list of exonStarts for your cluster. To get the exonStarts for *all* gene clusters, simply leave off the WHERE part that limits it to a specific cluster. Once you have this list, you will need to write a perl script (or the like) to determine the number of unique number of exons for each cluster of genes. Please be sure to write back if this is unclear. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Rileen wrote: > Hi Ann, > Thanks. I'm sorry, my question was ambiguous, I already knew > how to do what you said. I wanted the number of exons per > loci/"canonical" gene, > which is why I said > > "By that I mean all exons that appear in at least one transcript - a > sort of > "logical OR" of all exons in all transcripts." > > which wasn't very clear, I admit. > > To clarify - let's take an example, say the gene EDA. There are 8 Refseqs, > with varying numbers of exons - between 1 and 3. However, when you look > at them together, there 5 exons in this locus, each of which appears in > at least > one Refgene. These are not exons which differ from each other by one > splice site > etc., these are distinct, non-overlapping exons. > > It is this number - 5 in this example - that I'm after. > > Take care, > Yours, > Rileen > > > On 01/06/07, Ann Zweig wrote: >> Hello Rileen, >> >> You can get the information you want by using the Table >> Browser tool >> ('Tables' in the blue navigation bar). Configure it like so: >> >> clade: Vertebrate >> genome: Human >> assembly: Mar 2006 >> group: Genes & Gene Prediction tracks >> track: UCSC Genes >> table: knownGene >> region: genome >> output format: selected fields from primary and related tables >> >> Press "get output" button. >> >> Then, on the next page choose: >> name >> exonCount >> (and any other fields you would like) >> >> Press "get output" button to see your list. >> >> Do the same for the refGene table to get the exon counts for >> all of the >> Ref Seq Genes. >> >> Regards, >> >> ---------- >> Ann Zweig >> UCSC Genome Bioinformatics Group >> http://genome.ucsc.edu >> >> Please feel free to search the Genome mailing list archives by visiting >> our home page, clicking on "Contact Us", then typing a word or phrase >> into the search box. On that same page >> (http://genome.ucsc.edu/contacts.html), you can subscribe to the Genome >> mailing list. >> >> >> >> >> Rileen Sinha wrote: >> > Hi, >> > Is there some quick/convenient way of obtaining a list of the >> number >> > of annotated exons in each given gene in the UCSC known gene >> collection? >> > >> > By that I mean all exons that appear in at least one transcript - a >> sort of >> > "logical OR" of all exons in all transcripts. >> > >> > Is there also a corresponding solution for RefSeq genes? >> > >> > Thanks! >> > >> > Take care, >> > Yours, >> > Rileen >> > >> From hartera at soe.ucsc.edu Fri Jun 1 18:56:37 2007 From: hartera at soe.ucsc.edu (Rachel Harte) Date: Fri, 1 Jun 2007 18:56:37 -0700 (PDT) Subject: [Genome] remove custom track via URL parameter In-Reply-To: <465EBAD4.2090704@sammeth.net> References: <465EBAD4.2090704@sammeth.net> Message-ID: Hello Micha, One of our engineers said that this example will remove a custom track via the URL parameter: http://genome.ucsc.edu/cgi-bin/hgTracks? db=hg17&hgct_doRemoveCustomTrack=1&hgct_table=ct_boo will remove track named 'boo'. I hope that this helps you. Please let us know if you have further questions. Rachel Rachel Harte UCSC Genome Bioinformatics Group http://genome.ucsc.edu On Thu, 31 May 2007, Micha Sammeth wrote: > Hi, > > I can load custom track(s) for human using "&hgt.customText=" in > the request to the browser, and currently I wonder, whether there is a > similar parameter to make the browser "forget" a custom track loaded in > an earlier session. Manual ways would be to deactivate it in the > configuration page or to delete the respective track(s) in manage custom > tracks, but do you have a CGI parameter for this? > > Thanks, micha. > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From hartera at soe.ucsc.edu Sat Jun 2 11:31:11 2007 From: hartera at soe.ucsc.edu (Rachel Harte) Date: Sat, 2 Jun 2007 11:31:11 -0700 (PDT) Subject: [Genome] 2 questions about BLAT In-Reply-To: <465E966A.2060907@eva.mpg.de> References: <465E966A.2060907@eva.mpg.de> Message-ID: Hello Christian, Here are the answers to your questions: 1) There are a couple of programs that you can use to post-process the BLAT output and select alignments where the alignment must coverage the whole length. You can obtain these programs by downloading the Genome Browser source tree: http://genome.ucsc.edu/FAQ/FAQlicense#license3 It is free for non-profit, academic and personal use. pslReps and pslCDnaFilter both use -minCover so you could set this to 1.0 to get the whole of the query aligned. pslReps is in the directory src/hg/pslReps/ and pslCDnaFilter is in src/hg/pslCDnaFilter/. Is that what you want? This would return only alignments where the whole of the query is aligned. I wasn't sure if this is what you want or if you want the percentage ID recalculated to include the whole query sequence. If you would like to recalculate the percentag identity in this way, then you will need to write your own program to do this. Here is a link to the C code that calculates the percent identity: http://genome.ucsc.edu/FAQ/FAQblat#blat4 I also have as script in Perl that calculates the percent identity if you would prefer that. 2) There is a program called pslPretty in the src/hg/pslPretty directory that will convert the PSL file from BLAT into a human readable alignment. This requires fasta, 2bit or nib files which can be obtained through the downloads server at: http://hgdownload.cse.ucsc.edu/ You can use the following programs to convert the PSL BLAT output file to a format which can then be converted to a FASTA file: src/utils/pslToBed src/utils/bedToGenePred src/hg/genePredToMrna They should be run in this order: pslToBed -> bedToGenePred -> genePredToMrna For the last program, you will need to use our databases. You can do this by using our public MySQL server and by setting up a .hg.conf file in your home directory with the settings as show in this FAQ: http://genome.ucsc.edu/FAQ/FAQdownloads#download29 The result will be the genomic sequence to which your query sequence aligned i.e. the target region from the PSL file. I hope that this helps you. Please let us know if you have further questions. Rachel Rachel Harte UCSC Genome Bioinformatics Group http://genome.ucsc.edu On Thu, 31 May 2007, Christian Arnold wrote: > Hello, > > I am using BLAT for research and I have two questions regarding the > configuration of BLAT > > 1) Is there a way to define that the entered sequence has to be found in > the whole length? Often, I have alignments where the first X bases are > ignored because they did not had an significant alignment. But the > identity score ignores these parts which is not what I want to have. The > Identity number should include the whole sequence. > > 2) Is there an easy way to get the aligned sequence (not the one which I > typed in) as a fasta file in its complete length? > > I read the Manual but I didn't found anything, so it would be nice if > you can help me with these issues... > > > best > Christian > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From hartera at soe.ucsc.edu Sat Jun 2 15:26:09 2007 From: hartera at soe.ucsc.edu (Rachel Harte) Date: Sat, 2 Jun 2007 15:26:09 -0700 (PDT) Subject: [Genome] Expression of brain-specific genes Message-ID: Dear Dr. Paul Grossfeld, If you are referring to human, we have the human Genomics Institute of the Novartis Research Foundation (GNF) Atlas and human GNF Atlas 2 microarray expression data. You can view these by going to the track controls below the Browser image and changing the visibility to turn on the GNF Ratio (GNF Atlas 1) and GNF Atlas 2 tracks which are in the "Expression and Regulation" group. If you click on the links above the track controls, you will find the track descriptions and citations of the publications for this data. If you then copy the position of your region of interest into the position/search box, you can view the data in this region. The Affy U95 track shows alignments of the sequences used for the design of probes for the Affymetrix U95 GeneChip that was used for the GNF Ratio data and, similarly, Affy GNF1H contains the alignments for a custom GeneChip (GNF1H) that was used for the GNF Atlas 2 data. Sometimes several probeset sequences are from the same region so, in the track display, the colouring that you see in the GNF Ratio and GNF Atlas 2 tracks is a composite of that for several probesets. You can see each one individually if you click on an item in the track which will take you to a details page. The colouring represents a log (base 2) ratio where red shows overexpression of a gene and green is underexpression compared to reference. For this data, the reference is the median of the medians of the replicates for all tissues/cell lines that are non-cancerous. You can also see which genes are represented by these probesets by turning on some of the Genes tracks such as UCSC Genes (hg18, known Genes on hg17), RefSeq, mRNA etc. By clicking on the Tables link in the top blue bar you can query the tables that contain this data. If you select "All Tables" as the group, you will see that there is a knownToAffyU95 and knownToU95 and a knownToGnf1h which map the genes from the knownGene table (UCSC Genes in hg18 and Known Genes in hg17) to the probeset sequences from the Affymetrix U95 chip used for GNF Ratio and the GNF1H custom chip used for GNF Atlas2 respectively. The mapping is such that for each known gene it is mapped to the best overlapping probeset sequence that is aligned to the same strand. Another powerful tool for data mining is the Gene Sorter. You can reach this by clicking on the "Gene Sorter" link on the top blue bar. There, you can search for individual genes and then find other genes that have a similar expression pattern in the GNF Atlas1 (GNF Ratio track) or GNF Atlas2 data. The gene that you searched for will appear at the top and the genes below it are related to the first one in order of decreasing relatedness (in this case, similar expression patterns). The configure button allow you to expand or reduce the number of data columns shown. By default, only a subset of the tissues are shown for the expression data. There is also a filter button which takes you to filter controls. You could use to select for tissues that have a certain level of expression. I hope that this helps you to get started. There are links to Help and FAQ on the top blue menu bar of the Genome Browser pages. Also, please let us know if you have further questions. In the future, please direct your Genome Browser questions to our mailing list (genome at soe.ucsc.edu) for a faster response - thanks. Rachel Rachel Harte UCSC Genome Bioinformatics Group http://genome.ucsc.edu -------- Original Message -------- Date: Fri, 25 May 2007 17:04:53 GMT From: pgrossfeld at UCSD.Edu To: cbseweb at cbse.ucsc.edu Dear Sir/Mam, I am writing to seek your help. I am interested in determining which genes are expressed in the brain in a specific chromosomal region: chr11:120,472,551-133,693,850. I understand you have an atlas that would have such information. I do not have any experience in performing this kind of analysis, and I would deeply appreciate your helping me to use this powerful database. Thank you in advance! Sincerely Yours, Paul Grossfeld, M.D. Assistant Adjunct Professor Division of Pediatric Cardiology UCSD School of Medicine 858-966-5855 -- Branwyn Stewart Wagman Communications & Human Resources Center for Biomolecular Science and Engineering (CBSE) Institute for Quantitative Biomedical Research (QB3) 501C Engineering 2 Building UC Santa Cruz 1156 High Street, MS: CBSE/ITI Santa Cruz CA 95064 Tel: (831) 459-3077 Fax: (831) 459-1809 bwagman at soe.ucsc.edu http://www.cbse.ucsc.edu From hartera at soe.ucsc.edu Sat Jun 2 15:45:02 2007 From: hartera at soe.ucsc.edu (Rachel Harte) Date: Sat, 2 Jun 2007 15:45:02 -0700 (PDT) Subject: [Genome] Entrez gene ID to refseq ID mapping In-Reply-To: References: Message-ID: Hello Guoliang, I'm glad that was useful to you and that you are able to get your program working now. The reason that you are finding multiple UCSC Gene IDs for some of the Entrez Gene IDs is that each Entrez Gene ID represents a gene and each UCSC Gene ID represents a transcript. There can transcripts that represent multiple alternative splice forms associated with each gene. You can see this by searching for one of the UCSC Gene IDs in the Genome Browser. If you click on the blue/gray button at the left side of the track, you can select for the UCSC Gene IDs to be displayed and then you can see that the alternate splice form transcripts. I hope that this helps you. Rachel Rachel Harte UCSC Genome Bioinformatics Group http://genome.ucsc.edu On Thu, 31 May 2007, Guoliang Xing wrote: > Hi Rachel, > > The information you provided proved to be very useful. I actually > downloaded the UCSC annotation tables and imported them to my local MySQL > database, the mapping now is done by joining common UCSC gene IDs. My > programs now work beautifully. > > > I do have a new question coming up. I am mapping Affy U133plus2 array > probesetID to EntrezGeneID via UCSC KnowGeneID, through the tables I > downloaded. > > Here is my query and result, I am joining 3 tables: > > SELECT k.`name` as UCSCGeneID, k.`value` as ProbeSetID, q.`value` as > EntrezGeneID > FROM knowntou133plus2 k, knowngene p, knowntolocuslink q > WHERE k.`name`= p.`name` AND k.`name` =q.`name`; > > UCSCGeneID, ProbeSetID, EntrezGeneID > 'uc001aab.1', '225035_x_at', '375690' > 'uc001aac.1', '225035_x_at', '375690' > 'uc001aad.1', '225035_x_at', '375690' > 'uc001aae.1', '225035_x_at', '375690' > 'uc001aaf.1', '225035_x_at', '375690' > 'uc001aag.1', '225035_x_at', '375690' > 'uc001aah.1', '225035_x_at', '375690' > 'uc001aai.1', '226340_x_at', '375690' > 'uc001aaw.1', '234369_at', '729759' > 'uc001abd.1', '234369_at', '729759' > > total 45366 rows > > What's interesting is that the same Entrez Gene ID matches multiple UCSC > known genes. I am curious why one Entrez gene may match multiple UCSC > known genes. > > Though this is not a problem for me since once I added the distinct > keyword and > without UCSC Known gene field, unique rows returned: > > SELECT distinct k.`value` as ProbeSetID, q.`value` as EntrezGeneID > FROM knowntou133plus2 k, knowngene p, knowntolocuslink q > WHERE k.`name`= p.`name` AND k.`name` =q.`name`; > > returns 20586 rows > > > It's a great experience utilizing UCSC annotation tables. > > Thanks, > > Guoliang > > > On Thu, 24 May 2007, Rachel Harte wrote: > > > Guoliang, > > > > The refFlat table contains all the alignments for the RefSeq mRNAs > > (excluding the predicted ones whose IDs begin with "XM_"). Each mRNA may > > align to the genome more than once with the criteria used to filter Blat > > alignments for RefSeqs: > > "RefSeq mRNAs were aligned against the human genome using blat; those with > > an alignment of less than 15% were discarded. When a single mRNA aligned > > in multiple places, the alignment having the highest base identity was > > identified. Only alignments having a base identity level within 0.1% of > > the best and at least 96% base identity with the genomic sequence were > > kept." > > > > There are 25271 rows now in refFlat for hg18 but if you look at unique > > names (accessions) then there are 24837. There are 215202 rows in refLink > > because it contains the data for all RefSeqs from human and also from > > other organisms if the assembly has an "Other RefSeq" track (xenoRefFlat, > > xenoRefGene, xenoRefSeqAli tables). The tables that you downloaded from NCBI > > has many more rows for human RefSeqs than there are RefSeqs found in our > > refFlat table because they include the predicted RefSeqs (XM_ accessions) that > > we do not think are reliable and therefore they are not aligned for the > > Genome Browser. > > > > I hope that this helps you. > > > > Rachel > > > > We invite you to give us your feedback on the UCSC Genome Browser through > > May 31, 2007: http://www.surveymonkey.com/s.asp?U=881163743177 > > > > Rachel Harte > > UCSC Genome Bioinformatics Group > > http://genome.ucsc.edu > > > > > > On Wed, 23 May 2007, Guoliang Xing wrote: > > > > > Hi Rachel, > > > > > > Thank you for your reply. This is what I want and it's very helpful. > > > > > > I am a little bit surprised to find that from the same human Mar 2006 > > > Genes and Gene Prediction Tracks group -> RefSeq Genes -> refFlat table > > > returns 25407 rows, but reflink option returns 203390 rows, almost 10 > > > times more. > > > > > > On the other hand, the gene2refseq.gz file I downloaded from > > > ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ , after filtering to keep human only > > > genes with Taxonomy ID of 9606, it produces 145360 rows (many of the gene > > > IDs are duplicates here). > > > > > > I guess the genes downloaded from UCSC table browser using the refFlat > > > option is a more reasonable snapshot of all coding genes on HG18. > > > > > > My key task is to figure out the genome coordinates of each gene and its > > > exons, then link this geneID to existing pathways, and/or SNPs associated > > > with it. > > > > > > If my guess is not good, please let me know. I understand that > > > gene annotation is not simple, there will be some inconsistences. > > > > > > Thanks, > > > > > > Guoliang > > > > > > On Wed, 23 May 2007, Rachel Harte wrote: > > > > > > > Hello Guoliang, > > > > > > > > There is a table called refLink. You can download through the Table > > > > Browser (click on the "Tables" link on the top blue menu bar). Select the > > > > assembly of interest and the "Genes and Gene Predictions" group and the > > > > "RefSeq Genes" track. Then you can select the refLink table. The Entrez > > > > Gene ID is in the locuslinkId column. > > > > Alternatively, go to the Downloads server: > > > > http://hgdownload.cse.ucsc.edu/downloads.html > > > > > > > > Once you have found the organism and assembly of interest, then click on > > > > the "Annotation database" link and there you can download the contents of > > > > any table in the database for that assembly. > > > > > > > > I hope that this helps you. For a rapid response in the future, please > > > > direct questions to our mailing list at: genome at soe.ucsc.edu > > > > > > > > Thanks. > > > > > > > > Rachel > > > > > > > > > > > > Rachel Harte UCSC Genome > > > > Bioinformatics Group http://genome.ucsc.edu > > > > > > > > > > > > On Tue, 22 May 2007, Guoliang Xing wrote: > > > > > > > > > Hi Rachel, > > > > > > > > > > Is there a way to download a mapping table from UCSC between Entrez > > > > > GeneID to Refseq ID? > > > > > > > > > > I used UCSC table browser, RefSeq Genes track, knownToRefSeq table, and > > > > > downloaded the data, it has a name field which is the transcript accession > > > > > name, and a gene name field. But not GeneID. > > > > > > > > > > I like all the info this table provides, but I also need the Entrez > > > > > GeneID (for Human). > > > > > > > > > > One the other hand, I parsed NCBI's gene2refseq table for human, but it > > > > > has too many redundant lines for the same geneID. > > > > > > > > > > > > > > > > > > > > You help will be appreciated. > > > > > > > > > > Guoliang > > > > > > > > > > > > > > > From gmicha at gmail.com Sat Jun 2 14:07:43 2007 From: gmicha at gmail.com (Micha Sammeth) Date: Sat, 02 Jun 2007 23:07:43 +0200 Subject: [Genome] cds info on mRNA track (human) Message-ID: <4661DC1F.60608@sammeth.net> Hi, (thank you Rachel for the very detailed and exhaustive explanations to my previous questions!!) When downloading an gtf export of the human (hg17) mRNA track via the table browser, it does not include information about ORF positions (feature cds). I also checked the table and didnt find a field that sounds to store this information. However, in the browser track I can see "CDSs" (thick start/end). Did I do something wrong or is there another way to obtain this information? Thank you, micha. From hartera at soe.ucsc.edu Sat Jun 2 19:09:31 2007 From: hartera at soe.ucsc.edu (Rachel Harte) Date: Sat, 2 Jun 2007 19:09:31 -0700 (PDT) Subject: [Genome] cds info on mRNA track (human) In-Reply-To: <4661DC1F.60608@sammeth.net> References: <4661DC1F.60608@sammeth.net> Message-ID: Hello Micha, The GTF output is determined by the information in the PSL table (all_mrna) which does not include the CDS position. We just use the Genbank record CDS to determine how to draw the mRNAs in the Human mRNA track. This information can be found by referring to the gbCdnaInfo table. In the cds column, there is an integer which can be used to search the id column of the cds table. The CDS region from the Genbank record is shown in the name column. The RefSeq genes and UCSC Genes (or Known Genes track on hg17) tables use genePred format which include the coordinates of the CDS region so these will appear in the GTF output for these gene tables. I hope that this helps you. Please let us know if you have further questions. Rachel Rachel Harte UCSC Genome Bioinformatics Group http://genome.ucsc.edu On Sat, 2 Jun 2007, Micha Sammeth wrote: > Hi, > > (thank you Rachel for the very detailed and exhaustive explanations to > my previous questions!!) > > When downloading an gtf export of the human (hg17) mRNA track via the > table browser, it does not include information about ORF positions > (feature cds). I also checked the table and didnt find a field that > sounds to store this information. However, in the browser track I can > see "CDSs" (thick start/end). Did I do something wrong or is there > another way to obtain this information? > > Thank you, micha. > From JSam at coh.org Fri Jun 1 16:58:42 2007 From: JSam at coh.org (Sam, Judy) Date: Fri, 1 Jun 2007 16:58:42 -0700 Subject: [Genome] BLAT question Message-ID: I have just used BLAT for the first time, and discovered several nearly identical sequences to my query sequence within ~200 kb. This "cluster" of related sequences would not have been otherwisse observable, since only my query sequence was a recognized gene. Is there any way of generating a list of genes that are recently duplicated in this way, for example, all such genes on mouse Chr 7 between 65 and 84 Mb? Thank you for your help. Judy Sam City of Hope Duarte, CA 91010 email: jsam at coh.org Tel. 626-301-8241 "EMF " made the following annotations. ------------------------------------------------------------------------------ SECURITY/CONFIDENTIALITY WARNING: This message and any attachments are intended solely for the individual or entity to which they are addressed. This communication may contain information that is privileged, confidential, or exempt from disclosure under applicable law (e.g., personal health information, research data, financial information). Because this e-mail has been sent without encryption, individuals other than the intended recipient may be able to view the information, forward it to others or tamper with the information without the knowledge or consent of the sender. If you are not the intended recipient, or the employee or person responsible for delivering the message to the intended recipient, any dissemination, distribution or copying of the communication is strictly prohibited. If you received the communication in error, please notify the sender immediately by replying to this message and deleting the message and any accompanying files from your system. If, due to the security risks, you do not wish to receive further communications via e-mail, please reply to this message and inform the sender that you do not wish to receive further e-mail from the sender. ============================================================================== From rileen at gmail.com Sat Jun 2 05:55:20 2007 From: rileen at gmail.com (Rileen) Date: Sat, 2 Jun 2007 14:55:20 +0200 Subject: [Genome] The count of all annotated exons for a UCSC known gene ...... In-Reply-To: <46609C0D.4090703@cse.ucsc.edu> References: <46603A00.1040302@fli-leibniz.de> <466063DB.40908@cse.ucsc.edu> <6dc7bd4a0706011355l76b37505m4bda172a3cd66dad@mail.gmail.com> <46609C0D.4090703@cse.ucsc.edu> Message-ID: <6dc7bd4a0706020555j559e0a3cq5dbc770af5dd0e33@mail.gmail.com> Hi Ann, Thanks :-) I guess then I might as well use the refFlat.txt file (and the UCSC genes file), which gives the start and end positions of all exons, and for each gene, simply drop "duplicates" where at least one end of the exon has already been found in another (transcript's) exon. I was just being lazy :-} Have a nice weekend! Take care, Yours, Rileen On 02/06/07, Ann Zweig wrote: > Hello again, Rileen, > > Sorry for misunderstanding your question the first time. > Unfortunately, there is no quick way to do this, but I can help you > gather all the information you need, then you'll need to do one final > step on your own. > > This pertains only to the UCSC Gene set. The Ref Seq Gene tables are > set up differently. > > Each gene cluster is represented by a canonical gene noted in the > knownCanonical table. Each cluster has a clusterId which also appears > in the knownIsoform table. Using your example of the EDA gene, the > canonical gene is: > > mysql> SELECT * FROM knownCanonical WHERE transcript = 'uc004dxs.1'\G > *************************** 1. row *************************** > chrom: chrX > chromStart: 68752635 > chromEnd: 69176046 > clusterId: 25603 > transcript: uc004dxs.1 > protein: uc004dxs.1 > > > So, the clusterId in this case is 25603. Now use this information to > search the knownIsoform table for all of the genes in this cluster: > > mysql> SELECT * FROM knownIsoforms WHERE clusterId = 25603; > +-----------+------------+ > | clusterId | transcript | > +-----------+------------+ > | 25603 | uc004dxq.1 | > | 25603 | uc004dxn.1 | > | 25603 | uc004dxm.1 | > | 25603 | uc004dxl.1 | > | 25603 | uc004dxp.1 | > | 25603 | uc004dxo.1 | > | 25603 | uc004dxs.1 | > | 25603 | uc004dxr.1 | > +-----------+------------+ > > > Here are the 8 genes that you see displayed in the UCSC Gene track for > "EDA". Now you can search the knownGene table to get a list of the > exonStarts (and exonEnds if needed). For this, I would suggest using > our public mysql server. You can access the server using this command > from a prompt: > > mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A > use hg18 > > To get a list of exonStarts for the set of genes in a cluster, you can > issue this command: > > mysql> SELECT clusterId, exonStarts FROM knownGene g, knownIsoforms i > WHERE g.name = i.transcript and i.clusterId = 25603; > +-----------+--------------------------------------------------------------------------+ > | clusterId | exonStarts | > +-----------+--------------------------------------------------------------------------+ > | 25603 | 68752635,69093601, | > | 25603 | 68752635,68997412, | > | 25603 | 68752635,68806778,68997412, | > | 25603 | 68752635, | > | 25603 | 68752635,68757464,69093601, | > | 25603 | 68752635,68865711,68997412,69093601, | > | 25603 | > 68752635,69093601,69159792,69164431,69166078,69167043,69169972,69171932, | > | 25603 | > 68752635,69093601,69159792,69164431,69166078,69167043,69169972,69171932, | > +-----------+--------------------------------------------------------------------------+ > 8 rows in set (0.04 sec) > > > This gives you your list of exonStarts for your cluster. To get the > exonStarts for *all* gene clusters, simply leave off the WHERE part that > limits it to a specific cluster. > > Once you have this list, you will need to write a perl script (or the > like) to determine the number of unique number of exons for each cluster > of genes. > > Please be sure to write back if this is unclear. > > Regards, > > ---------- > Ann Zweig > UCSC Genome Bioinformatics Group > http://genome.ucsc.edu > > > > > Rileen wrote: > > Hi Ann, > > Thanks. I'm sorry, my question was ambiguous, I already knew > > how to do what you said. I wanted the number of exons per > > loci/"canonical" gene, > > which is why I said > > > > "By that I mean all exons that appear in at least one transcript - a > > sort of > > "logical OR" of all exons in all transcripts." > > > > which wasn't very clear, I admit. > > > > To clarify - let's take an example, say the gene EDA. There are 8 Refseqs, > > with varying numbers of exons - between 1 and 3. However, when you look > > at them together, there 5 exons in this locus, each of which appears in > > at least > > one Refgene. These are not exons which differ from each other by one > > splice site > > etc., these are distinct, non-overlapping exons. > > > > It is this number - 5 in this example - that I'm after. > > > > Take care, > > Yours, > > Rileen > > > > > > On 01/06/07, Ann Zweig wrote: > >> Hello Rileen, > >> > >> You can get the information you want by using the Table > >> Browser tool > >> ('Tables' in the blue navigation bar). Configure it like so: > >> > >> clade: Vertebrate > >> genome: Human > >> assembly: Mar 2006 > >> group: Genes & Gene Prediction tracks > >> track: UCSC Genes > >> table: knownGene > >> region: genome > >> output format: selected fields from primary and related tables > >> > >> Press "get output" button. > >> > >> Then, on the next page choose: > >> name > >> exonCount > >> (and any other fields you would like) > >> > >> Press "get output" button to see your list. > >> > >> Do the same for the refGene table to get the exon counts for > >> all of the > >> Ref Seq Genes. > >> > >> Regards, > >> > >> ---------- > >> Ann Zweig > >> UCSC Genome Bioinformatics Group > >> http://genome.ucsc.edu > >> > >> Please feel free to search the Genome mailing list archives by visiting > >> our home page, clicking on "Contact Us", then typing a word or phrase > >> into the search box. On that same page > >> (http://genome.ucsc.edu/contacts.html), you can subscribe to the Genome > >> mailing list. > >> > >> > >> > >> > >> Rileen Sinha wrote: > >> > Hi, > >> > Is there some quick/convenient way of obtaining a list of the > >> number > >> > of annotated exons in each given gene in the UCSC known gene > >> collection? > >> > > >> > By that I mean all exons that appear in at least one transcript - a > >> sort of > >> > "logical OR" of all exons in all transcripts. > >> > > >> > Is there also a corresponding solution for RefSeq genes? > >> > > >> > Thanks! > >> > > >> > Take care, > >> > Yours, > >> > Rileen > >> > > >> > From feyuval at wisemail.weizmann.ac.il Sun Jun 3 08:53:18 2007 From: feyuval at wisemail.weizmann.ac.il (Yuval Tabach) Date: Sun, 03 Jun 2007 18:53:18 +0300 Subject: [Genome] construct human promoters Message-ID: <46630FF8.B6A7.007D.0@wisemail.weizmann.ac.il> Hello 1. I would like to download all Human promoter from 1000bp upstream to 200bp down stream, from where and how can I download them? 2. Are you going to have new version of the human/mouse genome ? if so when? Thanks yuval Thabach From jing_gao at agilent.com Sun Jun 3 11:28:37 2007 From: jing_gao at agilent.com (jing_gao at agilent.com) Date: Sun, 3 Jun 2007 12:28:37 -0600 Subject: [Genome] question about URL construction In-Reply-To: <0b2201c78dc4$d4796360$0ba8a8c0@donnakLT> Message-ID: <1386D91EC3168B4D9E1F7FA9AE03B25C0149D401@wcosmb02.cos.agilent.com> Hi, Thanks for the help on this topic in the past. I'm still having a little problem here with constructing the URL and have the bed file loaded up to your browser properly. I'm attaching my test bed file here. When I loads it directly into your browser by going to your web site and do it manually, everything is beautiful. This is the UCSC URL I'm seeing from the browser: http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=93311163&Submit=go+to+genome+browser&position=chrX%3A000000007-000000054 When I load it in using a constructed URL, there are 2 problems. This is my constructed URL: http://genome.ucsc.edu/cgi-bin/hgTracks??org=human&position=chrX&hgt.customText=https://wcosned2.cos.agilent.com/earray/userdata/probedesign/PI1234/output/GenomeTiling_bed Problems: 1. How do I add the position like: position=chrX%3A000000007-000000054 This gives the preferred resolution, but what is "3A"? How do I get that string for all chromosomes and all organisms? 2. Bigger problem, if you click on my URL you may see, on top of the page, there is an error line: Unrecognized format line 1 of custom track: https://wcosned2.cos.agilent.com/earray/userdata/probedesign/PI1234/output/GenomeTiling_bed (note: chrom names are case sensitive) Why? The same file loads OK within your browser, why not with my URL? A million thanks. Jing <> Jing Gao MD PhD Sr. R&D Engineer/Scientist Agilent Technologies 5301 Stevens Creek Rd Santa Clara, CA 95051 Tel: (408) 553-2571 Email: jing_gao at agilent.com -----Original Message----- From: Donna Karolchik [mailto:donnak at soe.ucsc.edu] Sent: Thursday, May 03, 2007 1:47 PM To: Galt Barber Cc: jing_gao at agilent.com; Genome Subject: Re: [Genome] question about URL construction You have to be a bit careful when you use the db parameter -- it depends on what you want from your URL. If you want the URL to always point to that specific assembly version, it's better to use the db name. However, if you want your URL to always point to the latest assembly version for a genome (i.e. automatically use the newer version if an assembly is updated), then you should use the org name. -Donna ----- Original Message ----- From: "Galt Barber" To: "Donna Karolchik" Cc: ; Sent: Thursday, May 03, 2007 1:40 PM Subject: Re: [Genome] question about URL construction > > You can usually just specify the db (e.g. hg18) in the URL > (URL=http://server/path/...&db=hg18...), and the system > will figure out the org automatically. > > -Galt > > > On Thu, 3 May 2007, Donna Karolchik wrote: > >> hi Jing, >> >> To add to what Galt says, if you are using one of the org names >> that contains white space (e.g. D. melanogaster), be sure to >> put a >> "+" in place of the white space, i.e. D.+melanogaster. >> >> -Donna >> ----------------------------------- >> Donna Karolchik >> UCSC Genome Bioinformatics Group >> http://genome.ucsc.edu >> >> >> ----- Original Message ----- >> From: "Galt Barber" >> To: >> Cc: >> Sent: Thursday, May 03, 2007 1:31 PM >> Subject: Re: [Genome] question about URL construction >> >> >> > >> > Check out name, description, organism, etc. in dbDB in this >> > file: >> > >> > http://hgdownload.cse.ucsc.edu/admin/hgcentral.sql >> > >> > -Galt >> > >> > >> > On Thu, 3 May 2007 jing_gao at agilent.com wrote: >> > >> >> Hi, >> >> >> >> >> >> >> >> I'm trying to figure out how to construct an URL to share my >> >> annotation tracks with others. I'm following your >> >> instruction >> >> on >> >> http://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html#SHARE, >> >> >> >> Which works well for human. And I could figure out org=mouse >> >> and org=rat. >> >> >> >> >> >> >> >> But could you tell me the String values I should use for >> >> "org=" >> >> in the case for the following species: >> >> >> >> >> >> >> >> A. thaliana >> >> >> >> C. elegans >> >> >> >> D. melanogaster >> >> >> >> D. rerio >> >> >> >> G. gallus >> >> >> >> H. sapiens human >> >> >> >> M. musculus mouse >> >> >> >> R. norvegicus rat >> >> >> >> S. cerevisiae >> >> >> >> S. pombe >> >> >> >> X. tropicalis >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Thank you very much. >> >> >> >> >> >> >> >> Jing >> >> >> >> _______________________________________________ >> >> Genome maillist - Genome at soe.ucsc.edu >> >> http://www.soe.ucsc.edu/mailman/listinfo/genome >> >> >> > _______________________________________________ >> > Genome maillist - Genome at soe.ucsc.edu >> > http://www.soe.ucsc.edu/mailman/listinfo/genome >> > >> > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From anton.kratz at googlemail.com Sun Jun 3 18:48:52 2007 From: anton.kratz at googlemail.com (Anton Kratz) Date: Mon, 4 Jun 2007 10:48:52 +0900 Subject: [Genome] Coding Sequence [hg17] Message-ID: <68c7994b0706031848j4d0983eci7230a74e49cbd5f7@mail.gmail.com> Dear UCSC Team, I'd like to ask what the definition of coding sequence in the UCSC Known Gene table is? Is Coding Sequence the transcribed or the translated sequence? I wonder because often there are several exons outside the CDS, espexcially in the non-CDS region from the beginning of the gene. One example is BC020953: BC020953 chr11 + 68100015 68139375 68120240 68137161 10 68100015,68112021,68114703,68115619,68120140,68124384,68125906,68127385,68133947,68137109, 68100087,68112068,68114758,68115750,68120262,68124538,68126013,68127536,68134067,68139375, Q7L4R8_HUMAN G173459 Best regards, Anton From elevanon at genetics.med.harvard.edu Sun Jun 3 20:19:29 2007 From: elevanon at genetics.med.harvard.edu (Erez Levanon) Date: Sun, 3 Jun 2007 23:19:29 -0400 Subject: [Genome] additional genomes Message-ID: <8a118b630706032019t451f8912j38306c7f88f8ed5e@mail.gmail.com> Hi I find the UCSC genome browser as an extremely useful data source. i wonder if you have a tentative list of the next additional genomes that are going to enter to the browser ? Thanks Erez -- Erez Levanon Postdoctoral Fellow George Church's laboratory Department of Genetics Harvard Medical School Harvard University From sungm at mail.nih.gov Mon Jun 4 05:52:29 2007 From: sungm at mail.nih.gov (Sung, Myong-Hee (NIH/NCI) [E]) Date: Mon, 4 Jun 2007 08:52:29 -0400 Subject: [Genome] Sequence retrieval Message-ID: Hi, I'm trying to retrieve genomic sequences by their BED format genomic coordinates. I uploaded my file of genomic coordinates onto the 'defined regions' in Table Browser and specified settings (Mouse; Feb 2006; All Tables; mm8; gold; sequence output). But what I get is genomic sequences covering much larger regions than my chromosome coordinates, perhaps something that could be contigs. I've tried querying a single region but had the same problem. My regions are genomic, not necessarily promoter or coding regions. I've browsed through your Q&A but couldn't figure it out. I'm guessing I should specify certain retrieval settings differently? What will do the trick? Thanks in advance. Mia Myong-Hee Sung, Ph.D. Laboratory of Receptor Biology and Gene Expression National Cancer Institute, NIH 41 Library Drive, Bldg 41, Rm B602 Bethesda, MD 20892 Office: 301-402-0364 Fax: 301-496-4951 email: sungm at mail.nih.gov http://ccr.cancer.gov/staff/staff.asp?profileid=11633 From CKOZAK at niaid.nih.gov Mon Jun 4 06:10:00 2007 From: CKOZAK at niaid.nih.gov (Kozak, Christine (NIH/NIAID) [E]) Date: Mon, 4 Jun 2007 09:10:00 -0400 Subject: [Genome] question/mistake? Message-ID: <0413F0716179994A9ED9995CA4020B3C01B356A0@NIHCESMLBX5.nih.gov> http://www.cbse.ucsc.edu/research/research_human_chimp_mouse.shtml This chart on your website shows the blocks of conserved synteny in human and chimp. It has been shown that two chimp chromosomes fused to form human chromosome 2. So human chromosome 2 should have two blocks of color corresponding to two chimp chromosomes. Also, chimps should have 1 more chromosome than humans. Christine Kozak, Ph. D. Viral Biology Section LMM, NIAID, NIH Bldg. 4, Room 329 Bethesda, MD TEL: 301-496-0972 FAX: 301-480-6477 Disclaimer: The information in this e-mail and any of its attachments is confidential and may contain sensitive information. It should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage devices. The National Institute of Allergy and Infectious Diseases (NIAID) shall not accept liability for any statement made that are the sender's own and not expressly made on behalf of the NIAID by one of its representatives. From kuhn at soe.ucsc.edu Mon Jun 4 10:07:21 2007 From: kuhn at soe.ucsc.edu (Robert Kuhn) Date: Mon, 4 Jun 2007 10:07:21 -0700 Subject: [Genome] question/mistake? Message-ID: <200706041707.KAA24751@moondance.cse.ucsc.edu> Dr. Kozak, Thanks for using our site. We are aware of the human chrom 2 color issue. This is an artifact of a change made by the sequencing consortium responsible for the chimp assembly. Beginning with the chimp assembly we call panTro2, they have chosen to renumber the chromosomes formerly called chr12 and chr13. These are now called a more human-centric chr2a and chr2b, so they appear as the same color in the browser and on the page you site. There has been some discussion within our group about how to make the display more graphically represent the situation, but we have not yet implemented any such change. On the previous human assembly (our hg17), which has alignments to the prevous chimp assembly (panTro1), the color difference is clearly visible. compare: hg17: http://genome.cse.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=Kuhn&hgS_otherUserSessionName=hg17_chr2_chimp hg18: http://genome.cse.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=Kuhn&hgS_otherUserSessionName=hg18_chr2_chimp We're sorry for any confusion this may have caused. best wishes, --b0b kuhn ucsc genome bioinformatics group > From genome-bounces at soe.ucsc.edu Mon Jun 4 09:03:59 2007 > To: > Subject: [Genome] question/mistake? > > http://www.cbse.ucsc.edu/research/research_human_chimp_mouse.shtml > > > > This chart on your website shows the blocks of conserved synteny in > human and chimp. It has been shown that two chimp chromosomes fused to > form human chromosome 2. So human chromosome 2 should have two blocks > of color corresponding to two chimp chromosomes. Also, chimps should > have 1 more chromosome than humans. > > > > Christine Kozak, Ph. D. > > Viral Biology Section > > LMM, NIAID, NIH > > Bldg. 4, Room 329 > > Bethesda, MD > > TEL: 301-496-0972 > > FAX: 301-480-6477 > > > > > > Disclaimer: > > The information in this e-mail and any of its attachments is > confidential and may contain sensitive information. It should not be > used by anyone who is not the original intended recipient. If you have > received this e-mail in error please inform the sender and delete it > from your mailbox or any other storage devices. The National Institute > of Allergy and Infectious Diseases (NIAID) shall not accept liability > for any statement made that are the sender's own and not expressly made > on behalf of the NIAID by one of its representatives. > > > > > > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From heather at soe.ucsc.edu Mon Jun 4 10:38:10 2007 From: heather at soe.ucsc.edu (Heather Trumbower) Date: Mon, 4 Jun 2007 10:38:10 -0700 (PDT) Subject: [Genome] repeats in bosTau2.0 vs bosTau3.0 In-Reply-To: References: Message-ID: Joao: My colleague Rachel has forwarded this inquiry to me, I'll be working with you going forward. Thanks very much for your report. We always use the -species and the -s (sensitive) parameters to RepeatMasker. We don't use the -nolow option. I obtained the Ensembl bosTau3 repeats from ftp.ensembl.org in the file pub/release-44/bos_taurus_44_3a/data/mysql/bos_taurus_core_44_3a/repeat_feature.txt.table.gz. I used the bos_taurus_core_44_3a.sql file from the same directory to get the description of the repeat_feature table: `repeat_feature_id` int(10) unsigned NOT NULL auto_increment, `seq_region_id` int(10) unsigned NOT NULL default '0', `seq_region_start` int(10) unsigned NOT NULL default '0', `seq_region_end` int(10) unsigned NOT NULL default '0', `seq_region_strand` tinyint(1) NOT NULL default '1', `repeat_start` int(10) NOT NULL default '0', `repeat_end` int(10) NOT NULL default '0', `repeat_consensus_id` int(10) unsigned NOT NULL default '0', `analysis_id` smallint(5) unsigned NOT NULL default '0', `score` double default NULL, Next, I obtained the seq_region.txt.table.gz file, and used that to translate from contigs to chromosomes, using the content of our gold table. I loaded this into a track called EnsRep, in my development browser at http://hgwdev-heather.cse.ucsc.edu. To review it, I've been walking through chr1 and chr29, one contig at a time. What I see is definitely a concern. For the first few contigs (chr1 first 3, chr29 first 6), the two tracks are generally concordant. However, past that, they become wildly different. The fact that the first few look okay gives me confidence that I've translated the ensembl repeats correctly. Could you take a look at the EnsRep track in the bosTau3 browser at http://hgwdev-heather.cse.ucsc.edu, and confirm that it matches the data that you are working with? Please let me know, thanks, that will help me decide what is the next best step. One thing I'm strongly considering is rerunning repeatmasker with -nolow to see if I get results that match Ensembl much more closely. Heather Trumbower UCSC Genome Bioinformatics Group On Thu, 31 May 2007, Jo?o Fadista wrote: > Dear Rachel Harte, > > Thanks for the useful reply. I have now compared the number of repetitive events in chr1 > from Ensembl against the ones from UCSC genome browser: > > - For bosTau2, UCSC has 164666 repetitive events while ENSEMBL has 209523 > - For bosTau3, UCSC has 272914 repetitive events while ENSEMBL has 374749. > > My questions is now why are the numbers between ENSEMBL and UCSC so different? > I asked the same question to the people at Ensembl and they said the following: > > "The only thing I can say is that Ensembl ran Repeatmasker with the parameters -nolow > -species cow using the repbase from 20050129. For btau3 also the > parameter -s (0-5% more sensitive) was used. I don't know what UCSC > exactly does." > > > > Kind regards, > Jo?o Fadista > > > ________________________________ > > De: Rachel Harte [mailto:hartera at soe.ucsc.edu] > Enviada: qui 31-05-2007 18:17 > Para: Jo?o Fadista > Cc: genome at soe.ucsc.edu > Assunto: Re: [Genome] repeats in bosTau2.0 vs bosTau3.0 > > > > Hello Joao, > > I took a look at the repeats in chr1 for bosTau2 and bosTau3 and I get > these results when using one of our programs: > > For bosTau2: >> faSize chr1.fa > 102834029 bases (20863006 N's 81971023 real 45985884 upper 35985139 lower) > in 1 sequences in 1 files > %34.99 masked total, %43.90 masked real > > For bosTau3: > faSize chr1.fa > 146199855 bases (9578504 N's 136621351 real 71742580 upper 64878771 lower) > in 1 sequences in 1 files > %44.38 masked total, %47.49 masked real > > Note also that there is a difference in size between chr1 of bosTau2 > and bosTau3 which could contribute to differences in the amount of repeats > in chr1 between the two assemblies. > > So when you look at the percentage of bases overall that are masked for > chr1 of each of the assemblies (bosTau2 and bosTau3), the amount of masked > sequence is quite different. However, if you just look at the percentage > of non-N sequence (so this is excluding all the gaps in the sequence), > then the percentage of repeats is very similar. So the difference in > resuls that you are finding are due to there being a larger amount of gaps > (Ns) in the sequence for chr1 of bosTau2 (20,863,006 Ns) compared to chr1 > of bosTau3 (9,578,504 Ns). > > I hope that this helps you. Please let us know if you have further > questions. > > Rachel > > Rachel Harte > UCSC Genome Bioinformatics Group > http://genome.ucsc.edu > > > On Thu, 31 May 2007, [iso-8859-1] Jo?o Fadista wrote: > >> Hello, >> >> I would like to ask you why do I get such a difference between the length of repeats in chr1 of the cow genome for the versions 2.0 and 3.0: >> >> - In bosTau2.0 the length of chr1 is 102,834,029 bp with 36,085,051 bp >> of repeats (35.1%) >> - In bosTau3.0 the length of chr1 is 146,199,855 bp with 65,482,014 bp of >> repeats (44.8%) >> >> To retrieve the position of the repeats I used the group "Variation and Repeats" and the track "Repeat Masker". >> >> >> >> Best regards >> >> Jo?o Fadista >> Ph.d. student >> >> >> >> UNIVERSITY OF AARHUS >> Faculty of Agricultural Sciences >> Dept. of Genetics and Biotechnology >> Blichers All? 20, P.O. BOX 50 >> DK-8830 Tjele >> >> Phone: +45 8999 1900 >> Direct: +45 8999 1900 >> E-mail: Joao.Fadista at agrsci.dk >> Web: www.agrsci.org >> ________________________________ >> >> News and news media . >> >> This email may contain information that is confidential. Any use or publication of this email without written permission from Faculty of Agricultural Sciences is not allowed. If you are not the intended recipient, please notify Faculty of Agricultural Sciences immediately and delete this email. >> >> _______________________________________________ >> Genome maillist - Genome at soe.ucsc.edu >> http://www.soe.ucsc.edu/mailman/listinfo/genome >> > > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From kuhn at soe.ucsc.edu Mon Jun 4 10:46:09 2007 From: kuhn at soe.ucsc.edu (Robert Kuhn) Date: Mon, 4 Jun 2007 10:46:09 -0700 Subject: [Genome] additional genomes Message-ID: <200706041746.KAA25320@moondance.cse.ucsc.edu> Hello, Erez, we're happy that you find the browser useful. At the present time we are nearing the end of a flurry of new assembly releases, including since January, fugu, medaka, stickleback, cat, horse, lizard and platypus. We have five worm assemblies and a new D. melanogaster undergoing final quality assurance check prior to release, probably within a couple of weeks. Additionally, on a longer timeframe, we expect to release updated cow and urchin assembles and possibly in the fall, a new mouse. best wishes, --b0b kuhn ucsc genome bioinformatics group > From genome-bounces at soe.ucsc.edu Mon Jun 4 09:05:01 2007 > To: genome at soe.ucsc.edu > Subject: [Genome] additional genomes > > Hi > I find the UCSC genome browser as an extremely useful data source. > i wonder if you have a tentative list of the next additional genomes that > are going to enter to the browser ? > Thanks > Erez > > > -- > Erez Levanon > Postdoctoral Fellow > George Church's laboratory > Department of Genetics > Harvard Medical School > Harvard University > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From archanat at soe.ucsc.edu Mon Jun 4 10:59:36 2007 From: archanat at soe.ucsc.edu (Archana Thakkapallayil) Date: Mon, 04 Jun 2007 10:59:36 -0700 Subject: [Genome] construct human promoters In-Reply-To: <46630FF8.B6A7.007D.0@wisemail.weizmann.ac.il> References: <46630FF8.B6A7.007D.0@wisemail.weizmann.ac.il> Message-ID: <46645308.6000406@soe.ucsc.edu> Hello Yuval, 1) We provide the zipped files of 1000, 2000, and 5000 bases upstream of the annotated transcription start of refSeq genes on our downloads server at: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/ From the README file: upstream1000.zip - Sequences 1000 bases upstream of annotated transcription start of RefSeq genes. This includes only the cases where the transcription start is annotated separately from the coding region start. Note that upstream files are generated only when an assembly is released. Therefore, the data may be slightly out of synch with the RefSeq data in assemblies that are incrementally updated nightly. You could also download the sequences upstream and downstream for the Known Genes and RefSeq Genes using the Table Browser. Please see these previously answered mailinglist question which asks a similar question to yours: http://www.soe.ucsc.edu/pipermail/genome/2006-December/012449.html http://www.soe.ucsc.edu/pipermail/genome/2006-December/012402.html 2) We are planning to release the new version of mouse assembly, probably this summer or fall. But, no new version of human assembly is currently scheduled. I suggest you check with NCBI about this. We release after they make it public. I hope this information is helpful to you. Please don't hesitate to contact us again if you require further assistance. Regards, Archana UCSC Genome Bioinformatics Group Yuval Tabach wrote: > Hello > > 1. I would like to download all Human promoter from 1000bp upstream to 200bp down stream, > from where and how can I download them? > > 2. Are you going to have new version of the human/mouse genome ? if so when? > > > Thanks > yuval Thabach > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From fanhsu at soe.ucsc.edu Mon Jun 4 11:07:37 2007 From: fanhsu at soe.ucsc.edu (Fan Hsu) Date: Mon, 4 Jun 2007 11:07:37 -0700 Subject: [Genome] Coding Sequence [hg17] In-Reply-To: <68c7994b0706031848j4d0983eci7230a74e49cbd5f7@mail.gmail.com> Message-ID: Hi Anton, The knownGene table contains genomic positions of a gene aligned to the base genome. Positions of exons are defined by exonStarts and exonEnds. Please note that UTRs are also counted as exons. Coding exons are defined by the portion of exons that are between cdsStart and cdsEnd. UTRs are exon regions outside cdsStart/cdsEnd. mysql> select * from knownGene where name='BC020953'\G *************************** 1. row *************************** name: BC020953 chrom: chr11 strand: + txStart: 68100015 txEnd: 68139375 cdsStart: 68120240 cdsEnd: 68137161 exonCount: 10 exonStarts: 68100015,68112021,68114703,68115619,68120140,68124384,68125906,68127385,6813 3947,68137109, exonEnds: 68100087,68112068,68114758,68115750,68120262,68124538,68126013,68127536,6813 4067,68139375, proteinID: Q7L4R8_HUMAN alignID: G173459 *************************** 2. row *************************** ... You can get mRNA sequence of a Known Gene via the knownGeneMrna table and the AA sequence of a Known Gene via the knownGenePep table. I suggest you use hg18 instead of hg17. Fan. -----Original Message----- From: genome-bounces at soe.ucsc.edu [mailto:genome-bounces at soe.ucsc.edu]On Behalf Of Anton Kratz Sent: Sunday, June 03, 2007 6:49 PM To: genome at soe.ucsc.edu Subject: [Genome] Coding Sequence [hg17] Dear UCSC Team, I'd like to ask what the definition of coding sequence in the UCSC Known Gene table is? Is Coding Sequence the transcribed or the translated sequence? I wonder because often there are several exons outside the CDS, espexcially in the non-CDS region from the beginning of the gene. One example is BC020953: BC020953 chr11 + 68100015 68139375 68120240 68137161 10 68100015,68112021,68114703,68115619,68120140,68124384,68125906,68127385,6813 3947,68137109, 68100087,68112068,68114758,68115750,68120262,68124538,68126013,68127536,6813 4067,68139375, Q7L4R8_HUMAN G173459 Best regards, Anton _______________________________________________ Genome maillist - Genome at soe.ucsc.edu http://www.soe.ucsc.edu/mailman/listinfo/genome From hwang56 at jhmi.edu Mon Jun 4 10:39:07 2007 From: hwang56 at jhmi.edu (Hang Wang) Date: Mon, 04 Jun 2007 13:39:07 -0400 Subject: [Genome] search transcripts with alternative splicing or polyadenylation Message-ID: <466415FB020000E2000120ED@cis27.hosts.jhmi.edu> Hello! I want to ask "how to search alternative splicing or polyadenylation" for a particular gene from genome.ucsc.edu website. Thank you. Hang Wang Tel: 4105026868 From archanat at soe.ucsc.edu Mon Jun 4 11:36:45 2007 From: archanat at soe.ucsc.edu (Archana Thakkapallayil) Date: Mon, 04 Jun 2007 11:36:45 -0700 Subject: [Genome] Sequence retrieval In-Reply-To: References: Message-ID: <46645BBD.2050403@soe.ucsc.edu> Hello Mia, When you upload the genomic coordinates using the 'defined regions' feature in the Table Browser and retrieve the sequences by setting the 'table:gold', you are actually getting the sequences for the entire contigs that are included in your region of interest. In order to get the sequences for your genomic coordinates, you could first upload your BED file as a custom track, using the button "add custom tracks" on the Genome Browser gateway. The custom track user's guide is here: http://www.genome.ucsc.edu/goldenPath/help/customTrack.html Then on the 'Table Browser', select 'Custom Tracks' as group and the corresponding track and table. Then choose 'region: genome'; 'output format: sequence' and hit 'get output'. I hope that this helps you. Please let us know if you have further questions. Regards, Archana UCSC Genome Bioinformatics Group Sung, Myong-Hee (NIH/NCI) [E] wrote: > Hi, > > I'm trying to retrieve genomic sequences by their BED format genomic > coordinates. I uploaded my file of genomic coordinates onto the > 'defined regions' in Table Browser and specified settings (Mouse; Feb > 2006; All Tables; mm8; gold; sequence output). But what I get is > genomic sequences covering much larger regions than my chromosome > coordinates, perhaps something that could be contigs. I've tried > querying a single region but had the same problem. My regions are > genomic, not necessarily promoter or coding regions. > I've browsed through your Q&A but couldn't figure it out. I'm guessing > I should specify certain retrieval settings differently? What will do > the trick? > > Thanks in advance. > > Mia > > Myong-Hee Sung, Ph.D. > Laboratory of Receptor Biology and Gene Expression > National Cancer Institute, NIH > 41 Library Drive, Bldg 41, Rm B602 > Bethesda, MD 20892 > > Office: 301-402-0364 > Fax: 301-496-4951 > email: sungm at mail.nih.gov > http://ccr.cancer.gov/staff/staff.asp?profileid=11633 > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From Joao.Fadista at agrsci.dk Mon Jun 4 13:24:42 2007 From: Joao.Fadista at agrsci.dk (=?iso-8859-1?Q?Jo=E3o_Fadista?=) Date: Mon, 4 Jun 2007 22:24:42 +0200 Subject: [Genome] repeats in bosTau2.0 vs bosTau3.0 In-Reply-To: References: Message-ID: Dear Heather Trumbower, Thank you very much for taking care of this "problem". I think that perhaps it is better if I forward to you the file that the people at Ensembl sent to me. Therefore in attachment I put the repeats file for chr1 bosTau3.0 and if you want I can also send you the file from chr1 bosTau2.0. Kind regards, Joao Fadista -----Original Message----- From: Heather Trumbower [mailto:heather at soe.ucsc.edu] Sent: Monday, June 04, 2007 7:38 PM To: Jo?o Fadista Cc: genome at soe.ucsc.edu Subject: Re: [Genome] repeats in bosTau2.0 vs bosTau3.0 Joao: My colleague Rachel has forwarded this inquiry to me, I'll be working with you going forward. Thanks very much for your report. We always use the -species and the -s (sensitive) parameters to RepeatMasker. We don't use the -nolow option. I obtained the Ensembl bosTau3 repeats from ftp.ensembl.org in the file pub/release-44/bos_taurus_44_3a/data/mysql/bos_taurus_core_44_3a/repeat_feature.txt.table.gz. I used the bos_taurus_core_44_3a.sql file from the same directory to get the description of the repeat_feature table: `repeat_feature_id` int(10) unsigned NOT NULL auto_increment, `seq_region_id` int(10) unsigned NOT NULL default '0', `seq_region_start` int(10) unsigned NOT NULL default '0', `seq_region_end` int(10) unsigned NOT NULL default '0', `seq_region_strand` tinyint(1) NOT NULL default '1', `repeat_start` int(10) NOT NULL default '0', `repeat_end` int(10) NOT NULL default '0', `repeat_consensus_id` int(10) unsigned NOT NULL default '0', `analysis_id` smallint(5) unsigned NOT NULL default '0', `score` double default NULL, Next, I obtained the seq_region.txt.table.gz file, and used that to translate from contigs to chromosomes, using the content of our gold table. I loaded this into a track called EnsRep, in my development browser at http://hgwdev-heather.cse.ucsc.edu. To review it, I've been walking through chr1 and chr29, one contig at a time. What I see is definitely a concern. For the first few contigs (chr1 first 3, chr29 first 6), the two tracks are generally concordant. However, past that, they become wildly different. The fact that the first few look okay gives me confidence that I've translated the ensembl repeats correctly. Could you take a look at the EnsRep track in the bosTau3 browser at http://hgwdev-heather.cse.ucsc.edu, and confirm that it matches the data that you are working with? Please let me know, thanks, that will help me decide what is the next best step. One thing I'm strongly considering is rerunning repeatmasker with -nolow to see if I get results that match Ensembl much more closely. Heather Trumbower UCSC Genome Bioinformatics Group On Thu, 31 May 2007, Jo?o Fadista wrote: > Dear Rachel Harte, > > Thanks for the useful reply. I have now compared the number of > repetitive events in chr1 from Ensembl against the ones from UCSC genome browser: > > - For bosTau2, UCSC has 164666 repetitive events while ENSEMBL has > 209523 > - For bosTau3, UCSC has 272914 repetitive events while ENSEMBL has 374749. > > My questions is now why are the numbers between ENSEMBL and UCSC so different? > I asked the same question to the people at Ensembl and they said the following: > > "The only thing I can say is that Ensembl ran Repeatmasker with the > parameters -nolow -species cow using the repbase from 20050129. For > btau3 also the parameter -s (0-5% more sensitive) was used. I don't > know what UCSC exactly does." > > > > Kind regards, > Jo?o Fadista > > > ________________________________ > > De: Rachel Harte [mailto:hartera at soe.ucsc.edu] > Enviada: qui 31-05-2007 18:17 > Para: Jo?o Fadista > Cc: genome at soe.ucsc.edu > Assunto: Re: [Genome] repeats in bosTau2.0 vs bosTau3.0 > > > > Hello Joao, > > I took a look at the repeats in chr1 for bosTau2 and bosTau3 and I get > these results when using one of our programs: > > For bosTau2: >> faSize chr1.fa > 102834029 bases (20863006 N's 81971023 real 45985884 upper 35985139 > lower) in 1 sequences in 1 files > %34.99 masked total, %43.90 masked real > > For bosTau3: > faSize chr1.fa > 146199855 bases (9578504 N's 136621351 real 71742580 upper 64878771 > lower) in 1 sequences in 1 files > %44.38 masked total, %47.49 masked real > > Note also that there is a difference in size between chr1 of bosTau2 > and bosTau3 which could contribute to differences in the amount of > repeats in chr1 between the two assemblies. > > So when you look at the percentage of bases overall that are masked > for > chr1 of each of the assemblies (bosTau2 and bosTau3), the amount of > masked sequence is quite different. However, if you just look at the > percentage of non-N sequence (so this is excluding all the gaps in the > sequence), then the percentage of repeats is very similar. So the > difference in resuls that you are finding are due to there being a > larger amount of gaps > (Ns) in the sequence for chr1 of bosTau2 (20,863,006 Ns) compared to > chr1 of bosTau3 (9,578,504 Ns). > > I hope that this helps you. Please let us know if you have further > questions. > > Rachel > > Rachel Harte > UCSC Genome Bioinformatics Group > http://genome.ucsc.edu > > > On Thu, 31 May 2007, [iso-8859-1] Jo?o Fadista wrote: > >> Hello, >> >> I would like to ask you why do I get such a difference between the length of repeats in chr1 of the cow genome for the versions 2.0 and 3.0: >> >> - In bosTau2.0 the length of chr1 is 102,834,029 bp with 36,085,051 >> bp of repeats (35.1%) >> - In bosTau3.0 the length of chr1 is 146,199,855 bp with 65,482,014 >> bp of repeats (44.8%) >> >> To retrieve the position of the repeats I used the group "Variation and Repeats" and the track "Repeat Masker". >> >> >> >> Best regards >> >> Jo?o Fadista >> Ph.d. student >> >> >> >> UNIVERSITY OF AARHUS >> Faculty of Agricultural Sciences >> Dept. of Genetics and Biotechnology >> Blichers All? 20, P.O. BOX 50 >> DK-8830 Tjele >> >> Phone: +45 8999 1900 >> Direct: +45 8999 1900 >> E-mail: Joao.Fadista at agrsci.dk >> Web: www.agrsci.org >> ________________________________ >> >> News and news media . >> >> This email may contain information that is confidential. Any use or publication of this email without written permission from Faculty of Agricultural Sciences is not allowed. If you are not the intended recipient, please notify Faculty of Agricultural Sciences immediately and delete this email. >> >> _______________________________________________ >> Genome maillist - Genome at soe.ucsc.edu >> http://www.soe.ucsc.edu/mailman/listinfo/genome >> > > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From archanat at soe.ucsc.edu Mon Jun 4 14:27:21 2007 From: archanat at soe.ucsc.edu (Archana Thakkapallayil) Date: Mon, 04 Jun 2007 14:27:21 -0700 Subject: [Genome] search transcripts with alternative splicing or polyadenylation In-Reply-To: <466415FB020000E2000120ED@cis27.hosts.jhmi.edu> References: <466415FB020000E2000120ED@cis27.hosts.jhmi.edu> Message-ID: <466483B9.6070806@soe.ucsc.edu> Hello Hang Wang, You can get this information from the 'Alt Events' track under 'Genes and Gene Prediction Tracks' and the 'Poly(A)' track under the 'mRNA and EST tracks', on the recent Human assembly. To get to this, first open the genome browser to the Human Assembly and then enter the name of the gene into the position/serach box. Then turn on the 'Alt Events' and the 'Poly(A)' track on the browser. You could read more about this track by clicking the "mini-button" to the left of these track in the main Genome Browser display (the tall, gray or blue box on the very far left of the display), or by clicking the link directly above the track control (the drop-down box that controls the visibility of the track). Information on the polyA tail insertion can also be obtained from the 'Human mRNAs' track, by turning on this track and then checking the box in the track settings page that says: Draw a vertical green line where query has a polyA tail insertion More information on the poly-A tail coloring is here: http://genome.ucsc.edu/goldenPath/help/hgIndelDisplay.html I hope that this helps you. If you have further questions please do not hesitate to contact us again. Regards, Archana UCSC genome Bioinformatics Group Hang Wang wrote: > Hello! > > I want to ask "how to search alternative splicing or polyadenylation" for a particular gene from genome.ucsc.edu website. > > Thank you. > > Hang Wang > > Tel: 4105026868 > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From ann at soe.ucsc.edu Mon Jun 4 14:29:45 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Mon, 04 Jun 2007 14:29:45 -0700 Subject: [Genome] question about URL construction In-Reply-To: <1386D91EC3168B4D9E1F7FA9AE03B25C0149D401@wcosmb02.cos.agilent.com> References: <1386D91EC3168B4D9E1F7FA9AE03B25C0149D401@wcosmb02.cos.agilent.com> Message-ID: <46648449.3020904@cse.ucsc.edu> Hello Jing, Sorry to hear that you are still having problems constructing a URL to use for a custom track. Let me break down each of your questions and I think we can work this out. Answer to Problem 1. The position to which you would like the browser to open when you first view your custom track is defined in a 'browser line' in your custom track data file. For example, if you would like your custom track to open at the position you suggest, add this to the beginning of your data file: browser position chrX:000000007-000000054 The "%3A" is the ASCII Hex symbol for ":", it is not necessary for you to add that to the data file. You do not need to add the positional information to the URL that you construct -- the software will use the starting position from your data file. Read more about browser lines here: http://genome.ucsc.edu/goldenPath/help/customTrack.html#BROWSER Answer to Problem 2. The secure website on which your data are located does not appear to be accessible from outside of Agilent. When I tried to access it directly, my browser timed out. This could be due to a firewall at Agilent, or other privacy measures, or the data file could just be too large for my browser to open. At any rate, my Internet browser gives a "page cannot be displayed" error when trying to access your data here: https://wcosned2.cos.agilent.com/earray/userdata/probedesign/PI1234/output/GenomeTiling_bed Your data must be located on a server that is accessible by our servers. If a login and password is required to access data loaded through a URL, this information can be included in the URL using the format protocol://user:password at server.com/somepath. Only Basic Authentication is supported for HTTP. Note that passwords included in URLs are not protected. If a password contains a non-alphanumeric character, such as @, the character must be replaced by the hexidecimal representation for that character. For example, in the password mypwd at wk, the @ character should be replaced by %40, resulting in the modified password mypwd%40wk. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu jing_gao at agilent.com wrote: > Hi, > > Thanks for the help on this topic in the past. I'm still having a little problem here with constructing the URL and have the bed file loaded up to your browser properly. > > I'm attaching my test bed file here. When I loads it directly into your browser by going to your web site and do it manually, everything is beautiful. This is the UCSC URL I'm seeing from the browser: > http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=93311163&Submit=go+to+genome+browser&position=chrX%3A000000007-000000054 > > > When I load it in using a constructed URL, there are 2 problems. This is my constructed URL: > http://genome.ucsc.edu/cgi-bin/hgTracks??org=human&position=chrX&hgt.customText=https://wcosned2.cos.agilent.com/earray/userdata/probedesign/PI1234/output/GenomeTiling_bed > > > Problems: > > 1. How do I add the position like: position=chrX%3A000000007-000000054 > > This gives the preferred resolution, but what is "3A"? How do I get that string for all chromosomes and all organisms? > > > 2. Bigger problem, if you click on my URL you may see, on top of the page, there is an error line: > > Unrecognized format line 1 of custom track: https://wcosned2.cos.agilent.com/earray/userdata/probedesign/PI1234/output/GenomeTiling_bed (note: chrom names are case sensitive) > > > Why? The same file loads OK within your browser, why not with my URL? > > > A million thanks. > > > Jing > > > > <> > > Jing Gao MD PhD > Sr. R&D Engineer/Scientist > Agilent Technologies > 5301 Stevens Creek Rd > Santa Clara, CA 95051 > Tel: (408) 553-2571 > Email: jing_gao at agilent.com > > > > > -----Original Message----- > From: Donna Karolchik [mailto:donnak at soe.ucsc.edu] > Sent: Thursday, May 03, 2007 1:47 PM > To: Galt Barber > Cc: jing_gao at agilent.com; Genome > Subject: Re: [Genome] question about URL construction > > You have to be a bit careful when you use the db parameter -- it > depends on what you want from your URL. If you want the URL to > always point to that specific assembly version, it's better to use > the db name. However, if you want your URL to always point to the > latest assembly version for a genome (i.e. automatically use the > newer version if an assembly is updated), then you should use the > org name. > > -Donna > > ----- Original Message ----- > From: "Galt Barber" > To: "Donna Karolchik" > Cc: ; > Sent: Thursday, May 03, 2007 1:40 PM > Subject: Re: [Genome] question about URL construction > > >> You can usually just specify the db (e.g. hg18) in the URL >> (URL=http://server/path/...&db=hg18...), and the system >> will figure out the org automatically. >> >> -Galt >> >> >> On Thu, 3 May 2007, Donna Karolchik wrote: >> >>> hi Jing, >>> >>> To add to what Galt says, if you are using one of the org names >>> that contains white space (e.g. D. melanogaster), be sure to >>> put a >>> "+" in place of the white space, i.e. D.+melanogaster. >>> >>> -Donna >>> ----------------------------------- >>> Donna Karolchik >>> UCSC Genome Bioinformatics Group >>> http://genome.ucsc.edu >>> >>> >>> ----- Original Message ----- >>> From: "Galt Barber" >>> To: >>> Cc: >>> Sent: Thursday, May 03, 2007 1:31 PM >>> Subject: Re: [Genome] question about URL construction >>> >>> >>>> Check out name, description, organism, etc. in dbDB in this >>>> file: >>>> >>>> http://hgdownload.cse.ucsc.edu/admin/hgcentral.sql >>>> >>>> -Galt >>>> >>>> >>>> On Thu, 3 May 2007 jing_gao at agilent.com wrote: >>>> >>>>> Hi, >>>>> >>>>> >>>>> >>>>> I'm trying to figure out how to construct an URL to share my >>>>> annotation tracks with others. I'm following your >>>>> instruction >>>>> on >>>>> http://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html#SHARE, >>>>> >>>>> Which works well for human. And I could figure out org=mouse >>>>> and org=rat. >>>>> >>>>> >>>>> >>>>> But could you tell me the String values I should use for >>>>> "org=" >>>>> in the case for the following species: >>>>> >>>>> >>>>> >>>>> A. thaliana >>>>> >>>>> C. elegans >>>>> >>>>> D. melanogaster >>>>> >>>>> D. rerio >>>>> >>>>> G. gallus >>>>> >>>>> H. sapiens human >>>>> >>>>> M. musculus mouse >>>>> >>>>> R. norvegicus rat >>>>> >>>>> S. cerevisiae >>>>> >>>>> S. pombe >>>>> >>>>> X. tropicalis >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Thank you very much. >>>>> >>>>> >>>>> >>>>> Jing >>>>> >>>>> _______________________________________________ >>>>> Genome maillist - Genome at soe.ucsc.edu >>>>> http://www.soe.ucsc.edu/mailman/listinfo/genome >>>>> >>>> _______________________________________________ >>>> Genome maillist - Genome at soe.ucsc.edu >>>> http://www.soe.ucsc.edu/mailman/listinfo/genome >>>> >> _______________________________________________ >> Genome maillist - Genome at soe.ucsc.edu >> http://www.soe.ucsc.edu/mailman/listinfo/genome >> > > > ------------------------------------------------------------------------ > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From jing_gao at agilent.com Mon Jun 4 15:46:01 2007 From: jing_gao at agilent.com (jing_gao at agilent.com) Date: Mon, 4 Jun 2007 16:46:01 -0600 Subject: [Genome] question about URL construction In-Reply-To: <46648449.3020904@cse.ucsc.edu> Message-ID: <1386D91EC3168B4D9E1F7FA9AE03B25C014E7DF8@wcosmb02.cos.agilent.com> Ann, Thank you for the help, let me retry with your advise and have you know. Thanks again. Jing Jing Gao MD PhD Sr. R&D Engineer/Scientist Agilent Technologies 5301 Stevens Creek Rd Santa Clara, CA 95051 Tel: (408) 553-2571 Email: jing_gao at agilent.com -----Original Message----- From: Ann Zweig [mailto:ann at soe.ucsc.edu] Sent: Monday, June 04, 2007 2:30 PM To: jing_gao at agilent.com Cc: genome at cse.ucsc.edu; Galt Barber; Donna Karolchik Subject: Re: [Genome] question about URL construction Hello Jing, Sorry to hear that you are still having problems constructing a URL to use for a custom track. Let me break down each of your questions and I think we can work this out. Answer to Problem 1. The position to which you would like the browser to open when you first view your custom track is defined in a 'browser line' in your custom track data file. For example, if you would like your custom track to open at the position you suggest, add this to the beginning of your data file: browser position chrX:000000007-000000054 The "%3A" is the ASCII Hex symbol for ":", it is not necessary for you to add that to the data file. You do not need to add the positional information to the URL that you construct -- the software will use the starting position from your data file. Read more about browser lines here: http://genome.ucsc.edu/goldenPath/help/customTrack.html#BROWSER Answer to Problem 2. The secure website on which your data are located does not appear to be accessible from outside of Agilent. When I tried to access it directly, my browser timed out. This could be due to a firewall at Agilent, or other privacy measures, or the data file could just be too large for my browser to open. At any rate, my Internet browser gives a "page cannot be displayed" error when trying to access your data here: https://wcosned2.cos.agilent.com/earray/userdata/probedesign/PI1234/output/GenomeTiling_bed Your data must be located on a server that is accessible by our servers. If a login and password is required to access data loaded through a URL, this information can be included in the URL using the format protocol://user:password at server.com/somepath. Only Basic Authentication is supported for HTTP. Note that passwords included in URLs are not protected. If a password contains a non-alphanumeric character, such as @, the character must be replaced by the hexidecimal representation for that character. For example, in the password mypwd at wk, the @ character should be replaced by %40, resulting in the modified password mypwd%40wk. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu jing_gao at agilent.com wrote: > Hi, > > Thanks for the help on this topic in the past. I'm still having a little problem here with constructing the URL and have the bed file loaded up to your browser properly. > > I'm attaching my test bed file here. When I loads it directly into your browser by going to your web site and do it manually, everything is beautiful. This is the UCSC URL I'm seeing from the browser: > http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=93311163&Submit=go+to+genome+browser&position=chrX%3A000000007-000000054 > > > When I load it in using a constructed URL, there are 2 problems. This is my constructed URL: > http://genome.ucsc.edu/cgi-bin/hgTracks??org=human&position=chrX&hgt.customText=https://wcosned2.cos.agilent.com/earray/userdata/probedesign/PI1234/output/GenomeTiling_bed > > > Problems: > > 1. How do I add the position like: position=chrX%3A000000007-000000054 > > This gives the preferred resolution, but what is "3A"? How do I get that string for all chromosomes and all organisms? > > > 2. Bigger problem, if you click on my URL you may see, on top of the page, there is an error line: > > Unrecognized format line 1 of custom track: https://wcosned2.cos.agilent.com/earray/userdata/probedesign/PI1234/output/GenomeTiling_bed (note: chrom names are case sensitive) > > > Why? The same file loads OK within your browser, why not with my URL? > > > A million thanks. > > > Jing > > > > <> > > Jing Gao MD PhD > Sr. R&D Engineer/Scientist > Agilent Technologies > 5301 Stevens Creek Rd > Santa Clara, CA 95051 > Tel: (408) 553-2571 > Email: jing_gao at agilent.com > > > > > -----Original Message----- > From: Donna Karolchik [mailto:donnak at soe.ucsc.edu] > Sent: Thursday, May 03, 2007 1:47 PM > To: Galt Barber > Cc: jing_gao at agilent.com; Genome > Subject: Re: [Genome] question about URL construction > > You have to be a bit careful when you use the db parameter -- it > depends on what you want from your URL. If you want the URL to > always point to that specific assembly version, it's better to use > the db name. However, if you want your URL to always point to the > latest assembly version for a genome (i.e. automatically use the > newer version if an assembly is updated), then you should use the > org name. > > -Donna > > ----- Original Message ----- > From: "Galt Barber" > To: "Donna Karolchik" > Cc: ; > Sent: Thursday, May 03, 2007 1:40 PM > Subject: Re: [Genome] question about URL construction > > >> You can usually just specify the db (e.g. hg18) in the URL >> (URL=http://server/path/...&db=hg18...), and the system >> will figure out the org automatically. >> >> -Galt >> >> >> On Thu, 3 May 2007, Donna Karolchik wrote: >> >>> hi Jing, >>> >>> To add to what Galt says, if you are using one of the org names >>> that contains white space (e.g. D. melanogaster), be sure to >>> put a >>> "+" in place of the white space, i.e. D.+melanogaster. >>> >>> -Donna >>> ----------------------------------- >>> Donna Karolchik >>> UCSC Genome Bioinformatics Group >>> http://genome.ucsc.edu >>> >>> >>> ----- Original Message ----- >>> From: "Galt Barber" >>> To: >>> Cc: >>> Sent: Thursday, May 03, 2007 1:31 PM >>> Subject: Re: [Genome] question about URL construction >>> >>> >>>> Check out name, description, organism, etc. in dbDB in this >>>> file: >>>> >>>> http://hgdownload.cse.ucsc.edu/admin/hgcentral.sql >>>> >>>> -Galt >>>> >>>> >>>> On Thu, 3 May 2007 jing_gao at agilent.com wrote: >>>> >>>>> Hi, >>>>> >>>>> >>>>> >>>>> I'm trying to figure out how to construct an URL to share my >>>>> annotation tracks with others. I'm following your >>>>> instruction >>>>> on >>>>> http://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html#SHARE, >>>>> >>>>> Which works well for human. And I could figure out org=mouse >>>>> and org=rat. >>>>> >>>>> >>>>> >>>>> But could you tell me the String values I should use for >>>>> "org=" >>>>> in the case for the following species: >>>>> >>>>> >>>>> >>>>> A. thaliana >>>>> >>>>> C. elegans >>>>> >>>>> D. melanogaster >>>>> >>>>> D. rerio >>>>> >>>>> G. gallus >>>>> >>>>> H. sapiens human >>>>> >>>>> M. musculus mouse >>>>> >>>>> R. norvegicus rat >>>>> >>>>> S. cerevisiae >>>>> >>>>> S. pombe >>>>> >>>>> X. tropicalis >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Thank you very much. >>>>> >>>>> >>>>> >>>>> Jing >>>>> >>>>> _______________________________________________ >>>>> Genome maillist - Genome at soe.ucsc.edu >>>>> http://www.soe.ucsc.edu/mailman/listinfo/genome >>>>> >>>> _______________________________________________ >>>> Genome maillist - Genome at soe.ucsc.edu >>>> http://www.soe.ucsc.edu/mailman/listinfo/genome >>>> >> _______________________________________________ >> Genome maillist - Genome at soe.ucsc.edu >> http://www.soe.ucsc.edu/mailman/listinfo/genome >> > > > ------------------------------------------------------------------------ > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From archanat at soe.ucsc.edu Mon Jun 4 17:22:38 2007 From: archanat at soe.ucsc.edu (Archana Thakkapallayil) Date: Mon, 04 Jun 2007 17:22:38 -0700 Subject: [Genome] BLAT question In-Reply-To: References: Message-ID: <4664ACCE.9030406@soe.ucsc.edu> Hello Sam, Here is the response from one of our developers to your question: If you are trying to find nearby homologs, you could get a list of Known Gene sequences using our Table Browser, that are within the specified region of mouse chr7. Then take those sequences and run a batch BLAT job on them. You could even make a database that consisted just of the dna for the region and then search just that. More information on using the Table Browser is here: http://genome.cse.ucsc.edu/goldenPath/help/hgTablesHelp.html If you are interested in distant homologs, then you might need to use BLAST. I hope that this helps you. If this does not answer you question, please don't hesitate to write back to the list. Regards, -- Archana UCSC Genome Bioinformatics Group Sam, Judy wrote: > I have just used BLAT for the first time, and discovered several nearly > identical sequences to my query sequence within ~200 kb. This "cluster" of > related sequences would not have been otherwisse observable, since only my > query sequence was a recognized gene. Is there any way of generating a list > of genes that are recently duplicated in this way, for example, all such > genes on mouse Chr 7 between 65 and 84 Mb? > > Thank you for your help. > > Judy Sam > City of Hope > Duarte, CA 91010 > > > email: jsam at coh.org > Tel. 626-301-8241 > > "EMF " made the following annotations. > ------------------------------------------------------------------------------ > > SECURITY/CONFIDENTIALITY WARNING: This message and any attachments are intended solely for the individual or entity to which they are addressed. This communication may contain information that is privileged, confidential, or exempt from disclosure under applicable law (e.g., personal health information, research data, financial information). Because this e-mail has been sent without encryption, individuals other than the intended recipient may be able to view the information, forward it to others or tamper with the information without the knowledge or consent of the sender. If you are not the intended recipient, or the employee or person responsible for delivering the message to the intended recipient, any dissemination, distribution or copying of the communication is strictly prohibited. If you received the communication in error, please notify the sender immediately by replying to this message and deleting the message and any accompanying files from your system. If, due t! > o the security risks, you do not wish to receive further communications via e-mail, please reply to this message and inform the sender that you do not wish to receive further e-mail from the sender. > ============================================================================== > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From yang21 at llnl.gov Mon Jun 4 18:05:49 2007 From: yang21 at llnl.gov (Shan Yang) Date: Mon, 04 Jun 2007 18:05:49 -0700 Subject: [Genome] Liftover question Message-ID: <7.0.0.16.2.20070604180258.025e1990@llnl.gov> Hi, I have a question about liftover. Sometimes, I get sequence spilt in the new assembly. But it does not give any further information. If a sequence in an old assembly is broken into not too many segments in the new assembly (e.g. 2 segments), is it possible to get the two coordinates in the new assembly or I have to use old sequence to blat on the new assembly? Thanks, Shan Yang, PhD Genome Biology Division, L-452 Chemistry, Materials & Life Sciences Directorate (CMLS) Lawrence Livermore National Laboratory 7000 East Ave, Livermore, CA, 94550 Ph: 925-422-7389 Fax: 925-422-2099 From kwong at bccrc.ca Tue Jun 5 10:30:48 2007 From: kwong at bccrc.ca (Kendy Wong) Date: Tue, 5 Jun 2007 10:30:48 -0700 Subject: [Genome] custom track submission Message-ID: <0BE438149FF2254DB4199E2682C8DFEB024E7AC8@crcmail1.BCCRC.CA> Hi, We have previously submitted a custom track of copy-number variants (see below) which is now publicly available at the UCSC goldenPath customTracks site. Copy-Number Variants (hg17) accompanying the paper Wong, K. et al. A Comprehensive Analysis of Common Copy-Number Variations in the Human Genome . American Journal of Human Genetics 80:91-104 (2007). We would like to include all the BAC clones on our array as well. Please let us know which will be a better approach: 1) Submit the whole array as a separate custom track and modify the CNV track's description to alert users that both tracks can be loaded simultaneously if desired. 2) Amend the current CNV track to include the remaining BAC clones from the array and color code these differently from the CNV clones. (If this is the choice, should we submit just the remaining clones or combine the two and resubmit a modified CNV track?) 3) Or other suggestion? Thanks, Kendy _______________________________ Kendy Wong, PhD Post-doctoral fellow Cancer Genetics & Developmental Biology BC Cancer Research Centre Vancouver, B.C. V5Z 1L3 Canada 604-675-8000 x7701 kwong at bccrc.ca From gsoltes at Cangene.com Tue Jun 5 11:12:30 2007 From: gsoltes at Cangene.com (Glenn Soltes) Date: Tue, 5 Jun 2007 13:12:30 -0500 Subject: [Genome] Hello UCSC Genome Staff Message-ID: <999DB85E8B3EF347A40A208EB5D6BF8D05C8CE36@WPGCHAEX01.cangene.com> I am looking to download your horse gemone. It is hanging always at 34.8Mb of 765Mb suggesting you have a problem with your file. Please fix or suggest how I can get chromFa.tar.gz Glenn ******************************************************************* CONFIDENTIAL - This e-mail transmission and any documents attached to it may contain information that is confidential or legally privileged. If you are not the intended recipient, or a person responsible for delivering this transmission to the intended recipient, you are hereby notified that any disclosure, copying, distribution, or use of this transmission is strictly prohibited. If you have received this transmission in error, please immediately notify the sender and destroy the original transmission, attachments, and destroy any hard copies. From heather at soe.ucsc.edu Tue Jun 5 11:28:13 2007 From: heather at soe.ucsc.edu (Heather Trumbower) Date: Tue, 5 Jun 2007 11:28:13 -0700 (PDT) Subject: [Genome] custom track submission In-Reply-To: <0BE438149FF2254DB4199E2682C8DFEB024E7AC8@crcmail1.BCCRC.CA> References: <0BE438149FF2254DB4199E2682C8DFEB024E7AC8@crcmail1.BCCRC.CA> Message-ID: Hi Kendy, good to hear from you. Thanks very much for offering additional data. Any of the options you describe below should work fine, it's really your preference. The only other possibility to consider would be to change your color format to match our Structural Variation track. Note we now have that track for both hg17 and hg18. Lars Feuk from Steve Scherer's group did the work to move the data to hg18, I'm sure he would like to have your data included as well. You can send files directly to me. Heather Trumbower UCSC Genome Bioinformatics Group On Tue, 5 Jun 2007, Kendy Wong wrote: > Hi, > > We have previously submitted a custom track of copy-number variants (see > below) which is now publicly available at the UCSC goldenPath > customTracks site. > > Copy-Number Variants > enome-test.cse.ucsc.edu/goldenPath/customTracks/hg17/wong.bed> (hg17) > accompanying the paper Wong, K. et al. A Comprehensive Analysis of > Common Copy-Number Variations in the Human Genome > 44048.abstract.html> . American Journal of Human Genetics 80:91-104 > (2007). > > We would like to include all the BAC clones on our array as well. > Please let us know which will be a better approach: > > 1) Submit the whole array as a separate custom track and modify > the CNV track's description to alert users that both tracks can be > loaded simultaneously if desired. > > 2) Amend the current CNV track to include the remaining BAC clones > from the array and color code these differently from the CNV clones. > (If this is the choice, should we submit just the remaining clones or > combine the two and resubmit a modified CNV track?) > > 3) Or other suggestion? > > Thanks, > > Kendy > > > _______________________________ > > Kendy Wong, PhD > Post-doctoral fellow > Cancer Genetics & Developmental Biology > BC Cancer Research Centre > Vancouver, B.C. > V5Z 1L3 Canada > 604-675-8000 x7701 > kwong at bccrc.ca > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From achu at bcgsc.ca Tue Jun 5 12:01:18 2007 From: achu at bcgsc.ca (Andy Chu) Date: Tue, 05 Jun 2007 12:01:18 -0700 Subject: [Genome] Linking to specific genome position on custom tracks saved in session In-Reply-To: <465E134D.8050304@soe.ucsc.edu> References: <1180566883.2187.88.camel@achu01.phage.bcgsc.ca> <465E134D.8050304@soe.ucsc.edu> Message-ID: <1181070078.15031.12.camel@achu01.phage.bcgsc.ca> Hi, I understand that I can change the position when I'm importing the custom track, but I'm trying to access a track that has already been saved in a session. When I load up the session http://genome.cse.ucsc.edu/cgi-bin/hgTracks? hgS_doOtherUser=submit&hgS_otherUserName=Mfield&hgS_otherUserSessionName=hg18_all_fpp it always goes back to the default position that was assigned when I imported my track. My question is can I go directly to a position using the session URL that I have? What I would like to do dynamically generate a URL that goes to that session and automatically display any position that I supply. Thanks, Andy Chu Canada's Micheal Smith Genome Sciences Centre On Wed, 2007-05-30 at 17:14 -0700, Brooke Rhead wrote: > Hi Andy, > > Positions specified in custom tracks overrides positions specified in > URLs. The position lines in the custom tracks will need to be removed > for the URL position to take effect. See these answers to similar > previously-asked questions: > > http://www.soe.ucsc.edu/pipermail/genome/2004-August/005326.html > > http://www.soe.ucsc.edu/pipermail/genome/2007-May/013446.html > > I hope this information helps. If you have further questions, please do > not hesitate to contact us again. > > -- > Brooke Rhead > UCSC Genome Bioinformatics Group > > We invite you to give us your feedback on the UCSC Genome Browser > through May 31, 2007: http://www.surveymonkey.com/s.asp?U=881163743177 > > > > Andy Chu wrote: > > Hi, > > > > I have several custom tracks loaded into the genome browser and saved in > > a session. I want to link to these tracks and specify the position > > where they open in a dynamically generated URL. Unfortunately, whenever > > custom tracks are loaded they always have a default position, so when I > > try to link to the session with a &position=chr:start-end added to the > > end of the URL, the default position is loaded up instead of the > > position I want. Is there any way to go straight to the position I want > > using the session URL? > > > > Thanks, > > > > Andy Chu > > Canada's Micheal Smith Genome Sciences Centre > > > > _______________________________________________ > > Genome maillist - Genome at soe.ucsc.edu > > http://www.soe.ucsc.edu/mailman/listinfo/genome From helixphoton at yahoo.com Tue Jun 5 11:41:44 2007 From: helixphoton at yahoo.com (Helix Photon) Date: Tue, 5 Jun 2007 11:41:44 -0700 (PDT) Subject: [Genome] Mapping SNP and protein Message-ID: <476111.69416.qm@web57403.mail.re1.yahoo.com> Hi, I am interested in the number of SNP present in the following genomic coordinate chr17:76705160-76721740 and when I queried through the Table browser, I got 63 hits. Now my question is: a) Like in dbSNP, can I get the information about the amino acid change and weather SNP is present near 3' or 5' position of the coordinate. b) If I need to use intersection between the tables for this, which tables to look for the protein information. thanks --------------------------------- Get the Yahoo! toolbar and be alerted to new email wherever you're surfing. From kayla at soe.ucsc.edu Tue Jun 5 12:59:03 2007 From: kayla at soe.ucsc.edu (Kayla Smith) Date: Tue, 05 Jun 2007 12:59:03 -0700 Subject: [Genome] Liftover question In-Reply-To: <7.0.0.16.2.20070604180258.025e1990@llnl.gov> References: <7.0.0.16.2.20070604180258.025e1990@llnl.gov> Message-ID: <4665C087.8090807@cse.ucsc.edu> Shan, Here is how to use the hgLiftOver tool to convert from human (hg18) to mouse (mm8). Check the box that says "Allow multiple output regions:" and change "Data Format" to "BED". I was able to paste in "chrX 151073054 151383976 test". This gave six regions as results: chrX 68673706 68913210 test 1 chr6 62978477 62978536 test 2 chrX 68761575 68761926 test 3 chrX 68842739 68844472 test 4 chrX 68844181 68844307 test 5 chr7 108825495 108827205 test 6 The tricky thing here is that you must post your input coordinates as BED data, which means you have to put spaces between the chrom, chromStart, and chromEnd coordinates. Additionally, you have to put one more field after that (I used "test"). Please note that if you are converting from position to BED coordinates to be mindful of the start coordinates differing by one base: http://genome.ucsc.edu/FAQ/FAQtracks#tracks1 Also, please see a related previously answered mailinglist question: http://www.soe.ucsc.edu/pipermail/genome/2006-May/010693.html I hope this is helpful to you. Please don't hesitate to contact us again if you require further assistance Kayla Smith UCSC Genome Bioinformatics Group Shan Yang wrote: > Hi, > > I have a question about liftover. Sometimes, I get sequence spilt in > the new assembly. But it does not give any further information. > If a sequence in an old assembly is broken into not too many segments > in the new assembly (e.g. 2 segments), is it possible to get the two > coordinates in the new assembly or I have to use old sequence to blat > on the new assembly? > > Thanks, > > Shan Yang, PhD > Genome Biology Division, L-452 > Chemistry, Materials & Life Sciences Directorate (CMLS) > Lawrence Livermore National Laboratory > 7000 East Ave, Livermore, CA, 94550 > > Ph: 925-422-7389 > Fax: 925-422-2099 > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From frampton at MIT.EDU Tue Jun 5 14:18:44 2007 From: frampton at MIT.EDU (Garrett M Frampton) Date: Tue, 5 Jun 2007 17:18:44 -0400 Subject: [Genome] Uploading custom browser tracks Message-ID: <003a01c7a7b7$1971ca50$9133090a@ccdom.wi.mit.edu> Hello, I have recently begun getting the error "Broken pipe file write failed" when trying to upload custom tracks. What is going on? What should I do? Thanks, Garrett Frampton From riteshbioinfo at yahoo.com Tue Jun 5 13:34:36 2007 From: riteshbioinfo at yahoo.com (Ritesh Kumar) Date: Tue, 5 Jun 2007 13:34:36 -0700 (PDT) Subject: [Genome] GeneID along with variations Message-ID: <918229.81183.qm@web35111.mail.mud.yahoo.com> Hi When we submit the genomic position in the format chrX:15283001-15129000 for querying variations and SNPs, Is there any way such that Gene Symbol or GeneID could also be displayed in the result otherwise, for a list of genomic positions it becomes difficult to analyze the whole result. Ritesh --------------------------------- Fussy? Opinionated? Impossible to please? Perfect. Join Yahoo!'s user panel and lay it on us. From hiram at soe.ucsc.edu Tue Jun 5 15:03:08 2007 From: hiram at soe.ucsc.edu (Hiram Clawson) Date: Tue, 05 Jun 2007 15:03:08 -0700 Subject: [Genome] Uploading custom browser tracks In-Reply-To: <003a01c7a7b7$1971ca50$9133090a@ccdom.wi.mit.edu> References: <003a01c7a7b7$1971ca50$9133090a@ccdom.wi.mit.edu> Message-ID: <4665DD9C.5000000@soe.ucsc.edu> Good Afternoon Garrett: Is this something new with previously successful data ? Do you have a large amount of data in your custom track submission ? Which genome assembly. What type of data ? Can you load any of the example custom tracks ? Can you provide a URL to a set of data that fails ? --Hiram Garrett M Frampton wrote: > Hello, > > I have recently begun getting the error "Broken pipe file write failed" when > trying to upload custom tracks. What is going on? What should I do? > > Thanks, > > Garrett Frampton From kayla at soe.ucsc.edu Tue Jun 5 15:35:49 2007 From: kayla at soe.ucsc.edu (Kayla Smith) Date: Tue, 05 Jun 2007 15:35:49 -0700 Subject: [Genome] Hello UCSC Genome Staff In-Reply-To: <999DB85E8B3EF347A40A208EB5D6BF8D05C8CE36@WPGCHAEX01.cangene.com> References: <999DB85E8B3EF347A40A208EB5D6BF8D05C8CE36@WPGCHAEX01.cangene.com> Message-ID: <4665E545.4070306@cse.ucsc.edu> Glenn, I'm sorry you are having trouble downloading this file. How are you trying to download chromFa.tar.gz? There are a few things that could be going wrong, but for now, try using FTP to get the file. Here is a link to our FAQ on this topic: http://genome.ucsc.edu/FAQ/FAQdownloads#download1 And a link to the FTP site for horse: ftp://hgdownload.cse.ucsc.edu/goldenPath/equCab1/bigZips/ I hope this helps. Please don't hesitate to write back if you are still having trouble. Kayla Smith UCSC Genome Bioinformatics Group Glenn Soltes wrote: > I am looking to download your horse gemone. > > It is hanging always at 34.8Mb of 765Mb suggesting you have a problem with your file. Please fix or suggest how I can get chromFa.tar.gz > > Glenn > > > ******************************************************************* > > CONFIDENTIAL - This e-mail transmission and any documents attached to it may contain informatio