From ann at soe.ucsc.edu Wed Jan 2 09:23:48 2008 From: ann at soe.ucsc.edu (Ann Zweig) Date: Wed, 02 Jan 2008 09:23:48 -0800 Subject: [Genome] [Fwd: A query on the Genome Browser] In-Reply-To: <477BC36F.2050601@ucsc.edu> References: <477BC36F.2050601@ucsc.edu> Message-ID: <477BC8A4.2050704@soe.ucsc.edu> Hello Siddarth, In the genome browser and the associated tools, much of the upper- and lower-case display is controlled by the user. When you ask for DNA sequence output using most tools (Table Browser, 'DNA' link, etc.) the case is configurable. So this inconsistency may be a simple factor of your having two views configured differently. When you retrieved the output from the Table Browser, you must have chosen a gene set from which to extract the exonic sequence (e.g. RefSeq Genes, UCSC Genes, etc.). I note that in the particular location you mention (chrM:6903-7445) there is a UCSC Gene, but not a RefSeq Gene. This could be causing the inconsistency you are seeing. In the future, please direct your questions to the genome mailing list at genome at soe.ucsc.edu -- our moderated forum for user questions and discussion. You will likely get a quicker response to your question. I hope this information is helpful to you. If this has not fully addressed your issue, please contact the list again with further details and a list of the exact steps you have taken to see the inconsistency. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Please feel free to search the Genome mailing list archives by visiting our home page, clicking on "Contact Us", then typing a word or phrase into the search box. On that same page (http://genome.ucsc.edu/contacts.html), you can subscribe to the Genome mailing list. > -------- Original Message -------- > Subject: A query on the Genome Browser > Date: Mon, 31 Dec 2007 16:44:27 -0700 > From: SiddarthGautham Selvaraj > To: cbseweb at cbse.ucsc.edu > > > > Respected Sir: > > I am Siddarth Gautham Selvaraj, a graduate student in Arizona State > University. My interest lies in collecting all the coding exons in the > human genome with its homologs. > > I used Table browser to get the sequences of all codig exons for each > chromosome. An interesting fact which I found out and could not > understand is that why not all coding exons mentioned by the Table > Browser not capitalized n the genome browser window. I find for example > in chromosome M sequence in position 6903-7445 as coding exons in Table > Browser but the same sequence is not capitalized which means is not > considered coding by the genome browser. I found very difficult to > understand this concept. > > Also what exactly is the capitalized coding regions put by the genome > browser? > > Thanks > siddarth From rhead at soe.ucsc.edu Wed Jan 2 13:10:50 2008 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Wed, 02 Jan 2008 13:10:50 -0800 Subject: [Genome] BAC track In-Reply-To: References: Message-ID: <477BFDDA.4060903@soe.ucsc.edu> Hello Shlomit, Each BAC end pair consists of a + strand read from one end of the BAC and a - strand read from the other end of the BAC. You should see one + and one - alignment if you click on the details page for any BAC end pair. Each BAC pair can be oriented in either direction when compared to the genomic sequence, indicated in the "strand" field. See an explanation in this previously-answered question: http://www.soe.ucsc.edu/pipermail/genome/2006-July/011146.html I hope this answers your question. -- Brooke Rhead UCSC Genome Bioinformatics Group Shlomit wrote: > Hello, > > > > First, thanks for the very useful site and your support. > > > > I was looking at your BAC track in the mouse genome, I couldn't understand > what is the "strand" (+ or -) field means. the BAC DNA is double-stranded, > isn't it? > > > > Thank a lot, > > Shlomit Amar-Farkash > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From galt at soe.ucsc.edu Wed Jan 2 13:20:41 2008 From: galt at soe.ucsc.edu (Galt Barber) Date: Wed, 2 Jan 2008 13:20:41 -0800 (PST) Subject: [Genome] BLAT miRNA In-Reply-To: <07ce01c8455b$800bb310$d4fa69a2@msworkstation> References: <07ce01c8455b$800bb310$d4fa69a2@msworkstation> Message-ID: Please try your stand-alone blat query using "T" instead of "U". BLAT does not currently understand the symbol "U", and treats it as if it were an "N". For DNA/RNA, BLAT just uses these symbols for queries: {A,C,G,T,N} hgBlat does a little extra pre-processing to translate U to T before calling gfServer, but this does not happen in standalone blat. You'll notice that even in the hgBlat output detailed alignment, the aligment is reported with T's, not U's. I'm afraid "U" just doesn't get its due! Thank you! -Galt On Sun, 23 Dec 2007, Gao, Ge wrote: > Hi, > > I'm trying to align a miRNA (dme-mir-100) stem-loop sequence againest dm2 > genome by local standalone BLAT. > > >dme-mir-100 MI0000378 > CCAUUAACAGAAACCCGUAAAUCCGAACUUGUGCUGUUUUAUAUCUGUUACAAGACCGGCAUUAUGGGAGUCUGUCAAUGCAAACAACUGGUUUUUGGCA > > When running the query on Webblat, I found the exact match: > chr2L:18467363~18467462(+). > > However, when I do it on the local standalone BLAT with default settings, no > result was found. > > After some surfing, I followed the suggestions from > http://www.soe.ucsc.edu/pipermail/genome/2006-October/011872.html, and changed > the command line to: > > blat dm2_genome.fn query.fn > output.psl -t=dna -q=rna -stepSize=2 -tileSize=6 -minScore=0 -minIdentity=100 > > This brought up 134 hits (11 on chr2L), but no the right one. :( > > Could you give me some clues? and are there recommended settings to run BLAT > on miRNA? > > Thanks, > Gao > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From rhead at soe.ucsc.edu Wed Jan 2 14:56:42 2008 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Wed, 02 Jan 2008 14:56:42 -0800 Subject: [Genome] the kgAlias table In-Reply-To: References: Message-ID: <477C16AA.40807@soe.ucsc.edu> Hello Yuval, Not every gene in GenBank is part of the "UCSC Genes" set, so there will not be a corresponding UCSC Gene name for every one. The process used to create the gene set is displayed when you click on the track name in the Genome Browser (the blue link above the drop-down box that lets you hide or show the track), or by clicking on the blue or grey "mini-button" to the far left of the track in the main display. Note that he three gene names that have matches in the Genome Browser when using the search/position box (FAM21B, TMEM23, LIPL2) are not matches to the UCSC Genes track, but to the "RefSeq Genes" and "Non-Human RefSeq Genes" tracks (assuming you are using the hg18 assembly). I hope this helps explain what you are seeing. If you have further questions, please feel free to contact us again at the mailing list address. -- Brooke Rhead UCSC Genome Bioinformatics Group Yuval Nevo wrote: > hi, > I have a list of genes taken from the GenBank files (last version, 29 Aug. > 2006) that I am trying to match with the unique UCSC gene names (e.g. > uc001aaa.1), using the kgAlias table. > I have encountered numerous gene names with no match. 1960 names start with > 'LOC', another 408 do not. below are several examples: > LOC642622 > LOC728065 > LOC728152 > (when searched for in the genome browser they can't be found) > and > FAM21B > TMEM23 > LIPL2 > (are found in the browser but have a different name in the kgAlias table). > Why can't I find these aliases in the kgAlias table? Should the table be > updated? > note: I can send the full list of these names if you like. > Thanks, Yuval (a PhD student). > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From ann at soe.ucsc.edu Wed Jan 2 15:37:41 2008 From: ann at soe.ucsc.edu (Ann Zweig) Date: Wed, 02 Jan 2008 15:37:41 -0800 Subject: [Genome] (no subject) In-Reply-To: References: Message-ID: <477C2045.50108@soe.ucsc.edu> Hello Kurt, Thanks for pointing this out and sorry for the inconvenience. The software on that blat/pcr server needs to be updated. Hopefully this will happen later today or tomorrow. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Kurt311311 at aol.com wrote: > > http://genome.ucsc.edu/cgi-bin/hgPcr?hgsid=101894276&org=Lizard&db=anoCar1&wp_f=atacagtgttccctcactta&wp_r=atacagtgttccctcactta&Submit=submit&wp_size=5000&wp_perfect=15&wp_good=15&wp_flipReverse=on&boolshad.wp_flipReverse=1 > > > > help!!! It's been doing this for days!!!! > > > > > > > ------------------------------------------------------------------------ > See AOL's top rated recipes > and > easy ways to stay in shape > > for winter. From sgao at mcw.edu Thu Jan 3 06:34:40 2008 From: sgao at mcw.edu (Gao, Shouguo) Date: Thu, 3 Jan 2008 08:34:40 -0600 Subject: [Genome] How to get snps locate between the start and end chromosomal location of each gene with Entrez GeneID Message-ID: It is really useful to connect to mySQL server. Now I have a large snp list, and I am looking for the snps related to a gene list (Entrez Geneid), which locate between 1000 upstream and 500 downstream of genes. The most important thing is to find the start and end chromosomal location of each gene with Entrez GeneID. I used knowngene and knownToLocusLink tables, but got several locations for one Entrez GeneID (Locuslink). name locuslink chro strand start end uc001aab.1 375690 chr1 - 4558 14764 uc001aac.1 375690 chr1 - 4558 19346 uc001aad.1 375690 chr1 - 4558 7231 uc001aae.1 375690 chr1 - 4558 9622 uc001aaf.1 375690 chr1 - 4832 19672 uc001aag.1 375690 chr1 - 5658 7231 uc001aah.1 375690 chr1 - 6720 19346 uc001aai.1 375690 chr1 - 6720 9622 The reason should be that the same Entrez Gene ID matches multiple UCSC known genes (transcript). In NCBI one Entrez GeneID only has one start and end location. Could you please tell me how to solve this issue? Is there any other table for that? Thanks Shouguo From sde at mrc-lmb.cam.ac.uk Thu Jan 3 09:42:31 2008 From: sde at mrc-lmb.cam.ac.uk (Subhajyoti De) Date: Thu, 03 Jan 2008 17:42:31 +0000 Subject: [Genome] Gene Order in primates In-Reply-To: <010201c84416$c459d510$0ba8a8c0@donnakLT> References: <010201c84416$c459d510$0ba8a8c0@donnakLT> Message-ID: <477D1E87.60106@mrc-lmb.cam.ac.uk> Hi, Greetings of new year. I have a naive question to ask about genome assembly and gene order in chimpanzee and macaque. I have a list of a few human genes, whose orthologs (or homologs) in chimpanzee and macaque are located at different chromosomal locations, while orthologs of most of the neighboring genes still maintain synteny. For example, a gene is located in human chromosome 12. It has one ortholog in chimpanzee and that is located in chimp chromosome 18 (and no paralog or pseudogene is present at position equivalent to that in human chromosome 12). Many such cases are biologically known, driven by short rearrangements or retrotransposon mediated activities, while a small number of others can appear due to errors in genome assembly. Can you kindly give me a rough estimate, how many such cases of altered gene order I can expect in the current release of chimpanzee and macaque genome due to assembly error and/or coverage? Many thanks, Subho From Bart.Aelterman at ua.ac.be Thu Jan 3 05:51:44 2008 From: Bart.Aelterman at ua.ac.be (Aelterman Bart) Date: Thu, 3 Jan 2008 14:51:44 +0100 Subject: [Genome] differences in annotation databases Message-ID: <50F55D0085412B49B8C9AB7C2A6948D901C1154A@xmail01.ad.ua.ac.be> Dear, I want to download the ENCODE Annotation Database to run some tests local. Apparently there are two ways to do this: - Download the Annotation Database from http://hgdownload.cse.ucsc.edu/goldenPath/hg18/encode/database/, or - Download the Test Annotation Database from http://hgdownload.cse.ucsc.edu/goldenPath/hg18encodedev/database/. I find this rather confusing and the difference between these two databases is not explained in your documentation nor in the papers describing the ENCODE Annotation data. Could you tell me what the difference is and which one is the preferred reference annotation? I assume this would be the Annotation Database, but the existence of an undefined Test Annotation Database makes it a bit confusing. Thanks in advance, Bart Aelterman From gaog at mail.cbi.pku.edu.cn Thu Jan 3 03:55:36 2008 From: gaog at mail.cbi.pku.edu.cn (Gao, Ge) Date: Thu, 3 Jan 2008 19:55:36 +0800 Subject: [Genome] BLAT miRNA References: <07ce01c8455b$800bb310$d4fa69a2@msworkstation> Message-ID: <017001c84dff$8de9ad00$d4fa69a2@msworkstation> Thanks, it works now! But just another question: if so, what's the real difference between "-q=rna" and "-q=dna"? Thanks for your kindly help again!:) Gao, Ge ----- Original Message ----- From: "Galt Barber" To: "Gao, Ge" Cc: Sent: Thursday, January 03, 2008 5:20 AM Subject: Re: [Genome] BLAT miRNA > > Please try your stand-alone blat query using > "T" instead of "U". BLAT does not currently > understand the symbol "U", and treats it as > if it were an "N". For DNA/RNA, BLAT just > uses these symbols for queries: {A,C,G,T,N} > > hgBlat does a little extra pre-processing > to translate U to T before calling gfServer, > but this does not happen in standalone blat. > > You'll notice that even in the hgBlat output > detailed alignment, the aligment is reported > with T's, not U's. > > I'm afraid "U" just doesn't get its due! > > Thank you! > > -Galt > > > On Sun, 23 Dec 2007, Gao, Ge wrote: > >> Hi, >> >> I'm trying to align a miRNA (dme-mir-100) stem-loop sequence againest dm2 >> genome by local standalone BLAT. >> >> >dme-mir-100 MI0000378 >> CCAUUAACAGAAACCCGUAAAUCCGAACUUGUGCUGUUUUAUAUCUGUUACAAGACCGGCAUUAUGGGAGUCUGUCAAUGCAAACAACUGGUUUUUGGCA >> >> When running the query on Webblat, I found the exact match: >> chr2L:18467363~18467462(+). >> >> However, when I do it on the local standalone BLAT with default settings, >> no >> result was found. >> >> After some surfing, I followed the suggestions from >> http://www.soe.ucsc.edu/pipermail/genome/2006-October/011872.html, and >> changed >> the command line to: >> >> blat dm2_genome.fn query.fn >> output.psl -t=dna -q=rna -stepSize=2 -tileSize=6 -minScore=0 -minIdentity=100 >> >> This brought up 134 hits (11 on chr2L), but no the right one. :( >> >> Could you give me some clues? and are there recommended settings to run >> BLAT >> on miRNA? >> >> Thanks, >> Gao >> >> _______________________________________________ >> Genome maillist - Genome at soe.ucsc.edu >> http://www.soe.ucsc.edu/mailman/listinfo/genome >> From kayla at soe.ucsc.edu Thu Jan 3 12:09:27 2008 From: kayla at soe.ucsc.edu (Kayla Smith) Date: Thu, 03 Jan 2008 12:09:27 -0800 Subject: [Genome] BLAT miRNA In-Reply-To: <017001c84dff$8de9ad00$d4fa69a2@msworkstation> References: <07ce01c8455b$800bb310$d4fa69a2@msworkstation> <017001c84dff$8de9ad00$d4fa69a2@msworkstation> Message-ID: <477D40F7.901@cse.ucsc.edu> Hello Gao,Ge, You can find the answer to your question in this previously answered mailing list question. http://www.soe.ucsc.edu/pipermail/genome/2006-August/011440.html Kayla Smith UCSC Genome Bioinformatics Group Gao, Ge wrote: > Thanks, it works now! > > But just another question: if so, what's the real difference between "-q=rna" > and "-q=dna"? > > Thanks for your kindly help again!:) > > Gao, Ge > > ----- Original Message ----- > From: "Galt Barber" > To: "Gao, Ge" > Cc: > Sent: Thursday, January 03, 2008 5:20 AM > Subject: Re: [Genome] BLAT miRNA > > >> Please try your stand-alone blat query using >> "T" instead of "U". BLAT does not currently >> understand the symbol "U", and treats it as >> if it were an "N". For DNA/RNA, BLAT just >> uses these symbols for queries: {A,C,G,T,N} >> >> hgBlat does a little extra pre-processing >> to translate U to T before calling gfServer, >> but this does not happen in standalone blat. >> >> You'll notice that even in the hgBlat output >> detailed alignment, the aligment is reported >> with T's, not U's. >> >> I'm afraid "U" just doesn't get its due! >> >> Thank you! >> >> -Galt >> >> >> On Sun, 23 Dec 2007, Gao, Ge wrote: >> >>> Hi, >>> >>> I'm trying to align a miRNA (dme-mir-100) stem-loop sequence againest dm2 >>> genome by local standalone BLAT. >>> >>>> dme-mir-100 MI0000378 >>> CCAUUAACAGAAACCCGUAAAUCCGAACUUGUGCUGUUUUAUAUCUGUUACAAGACCGGCAUUAUGGGAGUCUGUCAAUGCAAACAACUGGUUUUUGGCA >>> >>> When running the query on Webblat, I found the exact match: >>> chr2L:18467363~18467462(+). >>> >>> However, when I do it on the local standalone BLAT with default settings, >>> no >>> result was found. >>> >>> After some surfing, I followed the suggestions from >>> http://www.soe.ucsc.edu/pipermail/genome/2006-October/011872.html, and >>> changed >>> the command line to: >>> >>> blat dm2_genome.fn query.fn >>> output.psl -t=dna -q=rna -stepSize=2 -tileSize=6 -minScore=0 -minIdentity=100 >>> >>> This brought up 134 hits (11 on chr2L), but no the right one. :( >>> >>> Could you give me some clues? and are there recommended settings to run >>> BLAT >>> on miRNA? >>> >>> Thanks, >>> Gao >>> >>> _______________________________________________ >>> Genome maillist - Genome at soe.ucsc.edu >>> http://www.soe.ucsc.edu/mailman/listinfo/genome >>> > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From rhead at soe.ucsc.edu Thu Jan 3 12:33:25 2008 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Thu, 03 Jan 2008 12:33:25 -0800 Subject: [Genome] How to get snps locate between the start and end chromosomal location of each gene with Entrez GeneID In-Reply-To: References: Message-ID: <477D4695.1070104@soe.ucsc.edu> Hello Shouguo, There are two tables related to UCSC Genes that you can use to get a single splice variant for each gene: knownCanonical and knownIsoforms. The knownIsoforms table contains all of the items in knownGene grouped into clusters, and knownCanonical contains a single isoform for each cluster, along with its position. (See this previously-answered question for how these tables are made: http://www.soe.ucsc.edu/pipermail/genome/2005-July/008123.html .) In your example, the canonical transcript for the locusLink ID 375690 is uc001aad.1: mysql> select * from knownIsoforms where clusterId=2 order by transcript; +-----------+------------+ | clusterId | transcript | +-----------+------------+ | 2 | uc001aab.1 | | 2 | uc001aac.1 | | 2 | uc001aad.1 | | 2 | uc001aae.1 | | 2 | uc001aaf.1 | | 2 | uc001aag.1 | | 2 | uc001aah.1 | | 2 | uc001aai.1 | +-----------+------------+ 8 rows in set (0.01 sec) mysql> select * from knownCanonical where clusterId=2; +-------+------------+----------+-----------+------------+------------+ | chrom | chromStart | chromEnd | clusterId | transcript | protein | +-------+------------+----------+-----------+------------+------------+ | chr1 | 4558 | 7231 | 2 | uc001aad.1 | uc001aad.1 | +-------+------------+----------+-----------+------------+------------+ 1 row in set (0.01 sec) I hope this information helps. Please let us know if you have any further questions. -- Brooke Rhead UCSC Genome Bioinformatics Group Gao, Shouguo wrote: > It is really useful to connect to mySQL server. Now I have a large > snp list, and I am looking for the snps related to a gene list (Entrez Geneid), which locate between 1000 upstream and 500 downstream of genes. The most important thing is to find the start and end chromosomal location of each gene with Entrez GeneID. I used knowngene and knownToLocusLink tables, but got several locations for one Entrez GeneID (Locuslink). > > > > name locuslink chro strand start end > > uc001aab.1 375690 chr1 - 4558 14764 > > uc001aac.1 375690 chr1 - 4558 19346 > > uc001aad.1 375690 chr1 - 4558 7231 > > uc001aae.1 375690 chr1 - 4558 9622 > > uc001aaf.1 375690 chr1 - 4832 19672 > > uc001aag.1 375690 chr1 - 5658 7231 > > uc001aah.1 375690 chr1 - 6720 19346 > > uc001aai.1 375690 chr1 - 6720 9622 > > > > The reason should be that the same Entrez Gene ID matches multiple UCSC known genes (transcript). > > > > In NCBI one Entrez GeneID only has one start and end location. Could you please tell me how to solve this issue? Is there any other table for that? > > > Thanks > > > > Shouguo > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From kayla at soe.ucsc.edu Thu Jan 3 16:59:27 2008 From: kayla at soe.ucsc.edu (Kayla Smith) Date: Thu, 03 Jan 2008 16:59:27 -0800 Subject: [Genome] differences in annotation databases In-Reply-To: <50F55D0085412B49B8C9AB7C2A6948D901C1154A@xmail01.ad.ua.ac.be> References: <50F55D0085412B49B8C9AB7C2A6948D901C1154A@xmail01.ad.ua.ac.be> Message-ID: <477D84EF.1030102@cse.ucsc.edu> Hello Bart, Coming from the main Encode page: http://hgdownload.cse.ucsc.edu/goldenPath/encode/ The first link is the one you want: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/encode/database/ The difference between the two links you've mentioned is that the second of them may contain development, or test, data. That is, work in progress. Such data may not have gone through our rigorous QA process and so could have errors. I hope this information is helpful to you. Please don't hesitate to contact us again if you require further assistance. Kayla Smith UCSC Genome Bioinformatics Group Aelterman Bart wrote: > Dear, > > I want to download the ENCODE Annotation Database to run some tests local. Apparently there are two ways to do this: > - Download the Annotation Database from http://hgdownload.cse.ucsc.edu/goldenPath/hg18/encode/database/, or > - Download the Test Annotation Database from http://hgdownload.cse.ucsc.edu/goldenPath/hg18encodedev/database/. > I find this rather confusing and the difference between these two databases is not explained in your documentation nor in the papers describing the ENCODE Annotation data. Could you tell me what the difference is and which one is the preferred reference annotation? I assume this would be the Annotation Database, but the existence of an undefined Test Annotation Database makes it a bit confusing. > > Thanks in advance, > > Bart Aelterman > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From tingchiun at gmail.com Thu Jan 3 20:56:10 2008 From: tingchiun at gmail.com (Ting-chiun Chung) Date: Fri, 4 Jan 2008 12:56:10 +0800 Subject: [Genome] promoter location Message-ID: Dear Sir: I am planning to plot the Gene exons whit perl GD, I have implied the refGene table witch contains the exons location for each refseq ID, However, I also need to plot the promoter resion of the related genes, Do you have any table witch contain this kind of information? Thanks for your help Sincerely Ting-Chiun From ann at soe.ucsc.edu Fri Jan 4 09:47:46 2008 From: ann at soe.ucsc.edu (Ann Zweig) Date: Fri, 04 Jan 2008 09:47:46 -0800 Subject: [Genome] promoter location In-Reply-To: References: Message-ID: <477E7142.5060903@cse.ucsc.edu> Hello Ting-Chiun, We have several annotation tracks in the browser that show promoters. Some are experimentally derived, others are computed bioinformatically. You will find all of the tracks in the "Expression and Regulation" group of the track controls. In the hg18 assembly, there is a 'FirstEF' track which predicts exon, promoter and CpG window. There is also a track called the 'TFBS Conserved' that contains the location and score of transcription factor binding sites conserved in the human/mouse/rat alignment. The 'Reg Potential 7 species' track predicts regulatory regions. You can read the track descriptions and decide for yourself which one best fits your needs. I hope this information is helpful to you. Please don't hesitate to contact the mail list again if you require further assistance. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Please feel free to search the Genome mailing list archives by visiting our home page, clicking on "Contact Us", then typing a word or phrase into the search box. On that same page (http://genome.ucsc.edu/contacts.html), you can subscribe to the Genome mailing list. Ting-chiun Chung wrote: > Dear Sir: > > I am planning to plot the Gene exons whit perl GD, > I have implied the refGene table witch contains the exons location for > each refseq ID, > However, I also need to plot the promoter resion of the related genes, > Do you have any table witch contain this kind of information? > Thanks for your help > > Sincerely > > > Ting-Chiun > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From watersto at u.washington.edu Fri Jan 4 10:25:19 2008 From: watersto at u.washington.edu (Bob Waterston) Date: Fri, 4 Jan 2008 11:25:19 -0700 Subject: [Genome] custom tracks Message-ID: Dear Folks, I'm having trouble with custom tracks. I load tracks (i've used these successfully before many times), the window comes back showing that the tracks have been loaded, but when I click on go to 'genome browser' the tracks don't show up and indeed seem to have disappeared. Any suggestions? Thanks, Bob Waterston From watersto at u.washington.edu Fri Jan 4 10:34:11 2008 From: watersto at u.washington.edu (Bob Waterston) Date: Fri, 4 Jan 2008 11:34:11 -0700 Subject: [Genome] custom tracks Message-ID: Dear Folks, I'm having trouble with custom tracks. I load tracks (i've used these successfully before many times), the window comes back showing that the tracks have been loaded, but when I click on go to 'genome browser' the tracks don't show up and indeed seem to have disappeared. Any suggestions? Thanks, Bob Waterston ******************** I just got it to work -- I had opened an old page and had been trying to add from that. When I went back to home and reopened the worm genome it seems to have righted itself. Sorry to have bothered you. Bob From hiram at soe.ucsc.edu Fri Jan 4 14:32:09 2008 From: hiram at soe.ucsc.edu (Hiram Clawson) Date: Fri, 04 Jan 2008 14:32:09 -0800 Subject: [Genome] custom tracks In-Reply-To: References: Message-ID: <477EB3E9.1070603@soe.ucsc.edu> I'll check on this Bob. We are having unusual power outages today which may cause this. --Hiram Bob Waterston wrote: > Dear Folks, > > I'm having trouble with custom tracks. I load tracks (i've used > these successfully before many times), the window comes back showing > that the tracks have been loaded, but when I click on go to 'genome > browser' the tracks don't show up and indeed seem to have disappeared. > > Any suggestions? > > Thanks, > Bob Waterston From hiram at soe.ucsc.edu Fri Jan 4 14:34:37 2008 From: hiram at soe.ucsc.edu (Hiram Clawson) Date: Fri, 04 Jan 2008 14:34:37 -0800 Subject: [Genome] custom tracks In-Reply-To: References: Message-ID: <477EB47D.1070600@soe.ucsc.edu> You are correct Bob. I see that they seem to be functioning as usual. --Hiram Bob Waterston wrote: > Dear Folks, > > I'm having trouble with custom tracks. I load tracks (i've used > these successfully before many times), the window comes back showing > that the tracks have been loaded, but when I click on go to 'genome > browser' the tracks don't show up and indeed seem to have disappeared. > > Any suggestions? > > Thanks, > Bob Waterston > > ******************** > > I just got it to work -- I had opened an old page and had been trying > to add from that. When I went back to home and reopened the worm > genome it seems to have righted itself. > > Sorry to have bothered you. > > Bob > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From hiram at soe.ucsc.edu Fri Jan 4 15:33:20 2008 From: hiram at soe.ucsc.edu (Hiram Clawson) Date: Fri, 04 Jan 2008 15:33:20 -0800 Subject: [Genome] [Fwd: UCSC knownGene database file] In-Reply-To: <477EBEC4.4060404@ucsc.edu> References: <477EBEC4.4060404@ucsc.edu> Message-ID: <477EC240.8060401@soe.ucsc.edu> Good Afternoon Jeff: You can read about the process for the current hg18 known gene procedure in the details page for the known gene track: http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg18&g=knownGene And the previous procedure in the hg17 known gene track: http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg17&g=knownGene Note the "Methods" section on both of the pages. Please refer your genome questions in the future to the genome email alias which is copied on this email. --Hiram > -------- Original Message -------- > Subject: UCSC knownGene database file > Date: Fri, 04 Jan 2008 11:33:16 -0600 > From: jeff Nie > CC: Ron Stewart , shulan Tian > > > > Dear Sir/Madam, > We have just downloaded the hg18 knownGene.txt file from UCSC download > site and found that both format and total gene number changed. The > original hg18 KG set has 39,288 genes, the new set has 56,722 genes. > Could you tell us what is the different between these two versions and > any document that we could find to explain this? > Thanks > Jeff From kayla at soe.ucsc.edu Mon Jan 7 13:42:36 2008 From: kayla at soe.ucsc.edu (Kayla Smith) Date: Mon, 07 Jan 2008 13:42:36 -0800 Subject: [Genome] highlight multiple items in genome browser In-Reply-To: <46df4b6d0801062311y32989b08ue075f1cb7588521a@mail.gmail.com> References: <46df4b6d0801062311y32989b08ue075f1cb7588521a@mail.gmail.com> Message-ID: <47829CCC.3090205@cse.ucsc.edu> Hello Tao Peng, On the SNP details page, which you can get to by clicking on the blue- or grey-bar on the left-hand side of the track, there are some options for coloring various subsets of SNPs. Using these controls, you could for example, make all functional non-synonymous SNPs "red". You can also uncheck many of the checkboxes on the details page to exclude sets of SNPs from being displayed. Alternatively, you can create a Custom Track, containing only the data that you want. Here is information on making a Custom Track: http://genome.ucsc.edu/goldenPath/help/customTrack.html I hope this information is helpful to you. Please don't hesitate to contact us again if you require further assistance. Kayla Smith UCSC Genome Bioinformatics Group Tao Peng wrote: > Dear Sir/Madam: > > I want to explore several SNPs in a region. But there are hundreds of > them in this region. How can I highlight those I 'm interested in? > > Querying in the position/search window can highlight one each time. Can > I highlight multiple items, like SNPs or EST ? > > Thank you very much! > > Tao Peng > From dglemay at ucdavis.edu Sun Jan 6 14:00:21 2008 From: dglemay at ucdavis.edu (Danielle Lemay) Date: Sun, 06 Jan 2008 14:00:21 -0800 Subject: [Genome] Batch paste of URLs to upload custom tracks not working Message-ID: <47814F75.3090205@ucdavis.edu> Hello Team, If a series of URLs are pasted into the custom track upload window (http://genome.ucsc.edu/cgi-bin/hgCustom), they cannot be found ("Error not found"). The browser appears to treat the series of URLs as one file. If extra newlines are inserted between the URLs, it works. Example (non-working): http://hgwdev.cse.ucsc.edu/~angie/cowGlean/bosTau3.glean52.gff http://hgwdev.cse.ucsc.edu/~angie/cowGlean/perfect.psl http://hgwdev.cse.ucsc.edu/~angie/cowGlean/perfectDupes.psl http://hgwdev.cse.ucsc.edu/~angie/cowGlean/prettyGood.psl http://hgwdev.cse.ucsc.edu/~angie/cowGlean/missing.psl http://hgwdev.cse.ucsc.edu/~angie/cowGlean/nonChrUnPerfectDupes.psl Example (works): http://hgwdev.cse.ucsc.edu/~angie/cowGlean/bosTau3.glean52.gff http://hgwdev.cse.ucsc.edu/~angie/cowGlean/perfect.psl http://hgwdev.cse.ucsc.edu/~angie/cowGlean/perfectDupes.psl http://hgwdev.cse.ucsc.edu/~angie/cowGlean/prettyGood.psl http://hgwdev.cse.ucsc.edu/~angie/cowGlean/missing.psl http://hgwdev.cse.ucsc.edu/~angie/cowGlean/nonChrUnPerfectDupes.psl The non-working example previously worked so this is a recently introduced bug (in the last 2 months). Thanks, Danielle ==================================================== Danielle Lemay PhD Candidate, Nutritional Biology German Lab University of California at Davis dglemay at ucdavis.edu (530) 297 7688 From k.askland at gmail.com Mon Jan 7 11:11:53 2008 From: k.askland at gmail.com (kathleen askland) Date: Mon, 7 Jan 2008 14:11:53 -0500 Subject: [Genome] (no subject) Message-ID: <961ec6120801071111l356754fy944b085856f5a052@mail.gmail.com> Hello, I was wondering if someone could tell me how I might get the GNF ratio data wherein missing values are appropriately placed in the sequence of data for all the tissues (i.e., demarcated by commas on either side). I am trying to get expression data (ratios), for all replicates, but whether I pull up the data on the genome browser, or download it as a table, all of the values available for each tissue are listed consecutively, rather than skipping values for missing data. Thus, when I import the data, the values are left-justified, and all the blanks are at the end of the sequence. Can someone advise on how to get the data already partitioned appropriately? Thank you, Kathleen Askland, MD From kayla at soe.ucsc.edu Mon Jan 7 17:09:32 2008 From: kayla at soe.ucsc.edu (Kayla Smith) Date: Mon, 07 Jan 2008 17:09:32 -0800 Subject: [Genome] (no subject) In-Reply-To: <961ec6120801071111l356754fy944b085856f5a052@mail.gmail.com> References: <961ec6120801071111l356754fy944b085856f5a052@mail.gmail.com> Message-ID: <4782CD4C.3040108@cse.ucsc.edu> Hello Kathleen, I'm not exactly sure what you're asking for, so correct me if I am wrong. I'm assuming that you're looking at the hg18.gnfAtlas2 table. You're looking at the expId and the expScores columns of this table. There is also a database hgFixed which has some related gnf data in it, which you might be asking about. Are you trying to retrieve data for only a subset of tissues? Unfortunately, the expScores field from hg18.gnfAtlas2 is a comma separated list and can't be separated with our online tools. For that you'd need to write a program to do the parsing. If you can indicate which table(s) you're looking at, and an example row of data that doesn't look how you expect it to, I can take a closer look at figuring out what you need. I hope this information is helpful to you. Please don't hesitate to contact us again if you require further assistance. Kayla Smith UCSC Genome Bioinformatics Group kathleen askland wrote: > Hello, > > I was wondering if someone could tell me how I might get the GNF ratio data > wherein missing values are appropriately placed in the sequence of data for > all the tissues (i.e., demarcated by commas on either side). I am trying to > get expression data (ratios), for all replicates, but whether I pull up the > data on the genome browser, or download it as a table, all of the values > available for each tissue are listed consecutively, rather than skipping > values for missing data. Thus, when I import the data, the values are > left-justified, and all the blanks are at the end of the sequence. > > Can someone advise on how to get the data already partitioned appropriately? > > Thank you, > > Kathleen Askland, MD > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From pengtao.arthur at gmail.com Mon Jan 7 18:50:01 2008 From: pengtao.arthur at gmail.com (Tao Peng) Date: Tue, 8 Jan 2008 10:50:01 +0800 Subject: [Genome] highlight multiple items in genome browser In-Reply-To: <47829CCC.3090205@cse.ucsc.edu> References: <46df4b6d0801062311y32989b08ue075f1cb7588521a@mail.gmail.com> <47829CCC.3090205@cse.ucsc.edu> Message-ID: <46df4b6d0801071850u13c942f9w4d58cd07216841fb@mail.gmail.com> Thank you for the explanation. My SNPs can not be filtered out by certain combination of criteria. Creating a Custom Track can partly solve my problem. I suggest UCSC genome browser can add this feature, highlighting several items simultaneously. Sometimes I want to explore the targets among their neighbors. Constructing a Custom track will lose the information of those neighbors. Thank you for you attention. 2008/1/8, Kayla Smith : > > Hello Tao Peng, > > On the SNP details page, which you can get to by clicking on the blue- > or grey-bar on the left-hand side of the track, there are some options > for coloring various subsets of SNPs. > > Using these controls, you could for example, make all functional > non-synonymous SNPs "red". > > You can also uncheck many of the checkboxes on the details page to > exclude sets of SNPs from being displayed. > > Alternatively, you can create a Custom Track, containing only the data > that you want. Here is information on making a Custom Track: > http://genome.ucsc.edu/goldenPath/help/customTrack.html > > I hope this information is helpful to you. Please don't hesitate to > contact us again if you require further assistance. > > Kayla Smith > UCSC Genome Bioinformatics Group > > > Tao Peng wrote: > > Dear Sir/Madam: > > > > I want to explore several SNPs in a region. But there are hundreds > of > > them in this region. How can I highlight those I 'm interested in? > > > > Querying in the position/search window can highlight one each time. > Can > > I highlight multiple items, like SNPs or EST ? > > > > Thank you very much! > > > > Tao Peng > > > > -- Bioinformatics Division, TNLIST, Room 1-108, Information Science and Technology Building, Dept. of Automation, Tsinghua Univ. Beijing,China 100084 E-mail: pengtao.arthur at gmail.com From remi.terranova at fmi.ch Tue Jan 8 02:30:38 2008 From: remi.terranova at fmi.ch (Terranova, Remi) Date: Tue, 8 Jan 2008 11:30:38 +0100 Subject: [Genome] Clone RP23-473N24 References: <4850F76274F80D4D8F6D043FAD35F8270246E815@ex2.fmi.ch> Message-ID: <4850F76274F80D4D8F6D043FAD35F8270246E816@ex2.fmi.ch> To whom it may concern, Over the last two years, I have used one clone I chose from UCSC genome browser, RP23-473N24 for FISH analyses. It seems like this clone is not available anymore from the UCSC browser and I wondered what may be wrong with it. This is important for my purpose as I have based many of my analyses on FISH detection using this BAC clone. I would be grateful if you could tell me whether anything is wrong with this RP23-473N24 clone. With best wishes, Remi Terranova RP23-473N24, Mouse Chr7, qF5. Originally located between RP23-71O21 and MSMG01-379L13 - Assembly: Mouse - March 2005 - Track: BAC End Pairs - Kcnq1 locus: chr7: 137,000,000 - 138,000,000 --------------------------------------------------------------- Dr. R?mi Terranova Friedrich Miescher Institute (FMI) Novartis Research foundation Maulbeerstrasse 66 CH-4058 Basel, Switzerland Phone: +41 61 697 6100 E-mail: remi.terranova at fmi.ch --------------------------------------------------------------- * The name of the assembly (e.g. "hg17", or "human - May 2004") * The name of the track (e.g. "RefSeq Genes") * The exact location you are viewing (e.g. chr7:127,471,196-127,495,720) And if the problem seems to be one of display, please include: * The name and version of the web browser you are using (e.g. Internet Explorer 6) * The name and version of your operating system (e.g. Microsoft XP) * The series of items you clicked on to reach the problem From sselvar1 at asu.edu Mon Jan 7 23:50:28 2008 From: sselvar1 at asu.edu (SiddarthGautham Selvaraj) Date: Tue, 8 Jan 2008 00:50:28 -0700 Subject: [Genome] Coding exons of genes in the human genome Message-ID: <5c826dfc0801072350k4511d382u1a3d7ea4b4f3bf89@mail.gmail.com> Hi: I am siddarth, a graduate student in Arizona State University. I have been trying to get the coding exon sequences of all genes in the human genome. I define genes as an unit of the genome which codes for something. I do not need gene variants. That is a gene may get alternatively spliced to produce different products. I am not bothered on the variants a gene can produce through alternate splicing. I tried KnownCanonical Table but I guess it does not focus on all exons. Thus where can I get information on coding exons of all genes in the human genome. Awaiting Your reply Thanks siddarth From ann at soe.ucsc.edu Tue Jan 8 10:30:31 2008 From: ann at soe.ucsc.edu (Ann Zweig) Date: Tue, 08 Jan 2008 10:30:31 -0800 Subject: [Genome] Coding exons of genes in the human genome In-Reply-To: <5c826dfc0801072350k4511d382u1a3d7ea4b4f3bf89@mail.gmail.com> References: <5c826dfc0801072350k4511d382u1a3d7ea4b4f3bf89@mail.gmail.com> Message-ID: <4783C147.1010100@soe.ucsc.edu> Hello Siddarth, If you have successfully extracted the sequence for all Known Genes, but you only want the sequence for the canonical gene in each gene cluster, then my suggestion would be to use the output from the knownCanonical table to filter out those genes from your list that you are not interested in. If this does not address your issue, please feel free to write back to the genome list. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Please feel free to search the Genome mailing list archives by visiting our home page, clicking on "Contact Us", then typing a word or phrase into the search box. On that same page (http://genome.ucsc.edu/contacts.html), you can subscribe to the Genome mailing list. SiddarthGautham Selvaraj wrote: > Hi: > > I am siddarth, a graduate student in Arizona State University. I have been > trying to get the coding exon sequences of all genes in the human genome. > > I define genes as an unit of the genome which codes for something. I do not > need gene variants. That is a gene may get alternatively spliced to produce > different products. I am not bothered on the variants a gene can produce > through alternate splicing. > > I tried KnownCanonical Table but I guess it does not focus on all exons. > > Thus where can I get information on coding exons of all genes in the human > genome. > > Awaiting Your reply > > Thanks > siddarth > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From hiram at soe.ucsc.edu Tue Jan 8 13:24:49 2008 From: hiram at soe.ucsc.edu (Hiram Clawson) Date: Tue, 08 Jan 2008 13:24:49 -0800 Subject: [Genome] Clone RP23-473N24 Message-ID: <4783EA21.5050106@soe.ucsc.edu> Good Afternoon Dr. Terranova: I am investigating why this clone does not show up on the UCSC mouse genome browser for July 2007 (UCSC mm9/NCBI 37) compared to the March 2005 (UCSC mm6/NCBI 34) version. I do note the clone end sequences AZ095043 and AZ095046 do map correctly to the July 2007 sequence in the location: chr7:150,015,500-150,194,000 on the same contig NT_039437 as the March 2005 sequence. This would indicate this is a valid clone even on the newer assemblies. I will have to determine why it is missing from the BacEnd track. Please note, the genome email list is the correct address for questions on the genome browser. --Hiram -------- Original Message -------- Subject: [Genome] Clone RP23-473N24 Date: Tue, 8 Jan 2008 11:30:38 +0100 From: Terranova, Remi To: To whom it may concern, Over the last two years, I have used one clone I chose from UCSC genome browser, RP23-473N24 for FISH analyses. It seems like this clone is not available anymore from the UCSC browser and I wondered what may be wrong with it. This is important for my purpose as I have based many of my analyses on FISH detection using this BAC clone. I would be grateful if you could tell me whether anything is wrong with this RP23-473N24 clone. With best wishes, Remi Terranova RP23-473N24, Mouse Chr7, qF5. Originally located between RP23-71O21 and MSMG01-379L13 - Assembly: Mouse - March 2005 - Track: BAC End Pairs - Kcnq1 locus: chr7: 137,000,000 - 138,000,000 --------------------------------------------------------------- Dr. R?mi Terranova Friedrich Miescher Institute (FMI) Novartis Research foundation Maulbeerstrasse 66 CH-4058 Basel, Switzerland Phone: +41 61 697 6100 From wanglunju at gmail.com Tue Jan 8 23:36:11 2008 From: wanglunju at gmail.com (=?GB2312?B?zfXotA==?=) Date: Wed, 9 Jan 2008 15:36:11 +0800 Subject: [Genome] found a puzzle in your database In-Reply-To: <14ad659e0801082228x1d12e861x1cf3eab7e0a9ec76@mail.gmail.com> References: <14ad659e0801082228x1d12e861x1cf3eab7e0a9ec76@mail.gmail.com> Message-ID: <14ad659e0801082336v3b2d6a02u7f4248ffb9b0fef4@mail.gmail.com> Dear principal of *UCSC Genome Bioinformatics Group:* * *I am a graduate student of Nanjing University of China,and my major is bio-chemistry.Now my job is about the promoter of mouse PRL-3 gene( *ENSMUSG00000059895* ). In your database, the introns of this gene are about 30kb,but in the database of Ensemble,the gene structure is different(in a picture of accessory).I figured out that the difference is the first intron.In your database,there is a 26kb intron at the beginning of the gene,but in the Ensemble database the gene does not contain this intron. Maybe I made a mistake when I am searching for this gene,then, can you tell me where is the position of transcriptional starting point of this gene? Soryy to put you in the trouble of this puzzle,but I really hope that you can help me.Thank you! Best wishes. Yours Wang Lu From kayla at soe.ucsc.edu Wed Jan 9 10:29:48 2008 From: kayla at soe.ucsc.edu (Kayla Smith) Date: Wed, 09 Jan 2008 10:29:48 -0800 Subject: [Genome] highlight multiple items in genome browser In-Reply-To: <46df4b6d0801071850u13c942f9w4d58cd07216841fb@mail.gmail.com> References: <46df4b6d0801062311y32989b08ue075f1cb7588521a@mail.gmail.com> <47829CCC.3090205@cse.ucsc.edu> <46df4b6d0801071850u13c942f9w4d58cd07216841fb@mail.gmail.com> Message-ID: <4785129C.6080602@cse.ucsc.edu> Hello Tao Peng, Thank you for your suggestion. We will consider it for a future release. Kayla Smith UCSC Genome Bioinformatics Group Tao Peng wrote: > Thank you for the explanation. > > My SNPs can not be filtered out by certain combination of criteria. > Creating a Custom Track can partly solve my problem. > > I suggest UCSC genome browser can add this feature, highlighting several > items simultaneously. > Sometimes I want to explore the targets among their neighbors. > Constructing a Custom track will lose the information of those neighbors. > > Thank you for you attention. > > > 2008/1/8, Kayla Smith < kayla at soe.ucsc.edu >: > > Hello Tao Peng, > > On the SNP details page, which you can get to by clicking on the blue- > or grey-bar on the left-hand side of the track, there are some options > for coloring various subsets of SNPs. > > Using these controls, you could for example, make all functional > non-synonymous SNPs "red". > > You can also uncheck many of the checkboxes on the details page to > exclude sets of SNPs from being displayed. > > Alternatively, you can create a Custom Track, containing only the data > that you want. Here is information on making a Custom Track: > http://genome.ucsc.edu/goldenPath/help/customTrack.html > > I hope this information is helpful to you. Please don't hesitate to > contact us again if you require further assistance. > > Kayla Smith > UCSC Genome Bioinformatics Group > > > Tao Peng wrote: > > Dear Sir/Madam: > > > > I want to explore several SNPs in a region. But there are > hundreds of > > them in this region. How can I highlight those I 'm interested in? > > > > Querying in the position/search window can highlight one each > time. Can > > I highlight multiple items, like SNPs or EST ? > > > > Thank you very much! > > > > Tao Peng > > > > > > > -- > Bioinformatics Division, TNLIST, Room 1-108, > Information Science and Technology Building, > Dept. of Automation, Tsinghua Univ. > Beijing,China 100084 > E-mail: pengtao.arthur at gmail.com From viswanathl at mail.nih.gov Wed Jan 9 10:44:59 2008 From: viswanathl at mail.nih.gov (Viswanath, Lalitha (NIH/NCI) [C]) Date: Wed, 9 Jan 2008 13:44:59 -0500 Subject: [Genome] Is the UCSC FTP Server down? Message-ID: Hi I am encountering error message Error in server response, closing control connection. Error in server response, closing control connection. When trying to download wget -nv ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/cytoBand.txt.gz Thanks Lalitha From hiram at soe.ucsc.edu Wed Jan 9 11:19:09 2008 From: hiram at soe.ucsc.edu (Hiram Clawson) Date: Wed, 09 Jan 2008 11:19:09 -0800 Subject: [Genome] Is the UCSC FTP Server down? In-Reply-To: References: Message-ID: <47851E2D.6030000@soe.ucsc.edu> Thank you for the report. Our FTP server was hung up due to too many connections. Evidently the back-log of requests from our outage last week is stressing the system. --Hiram Viswanath, Lalitha (NIH/NCI) [C] wrote: > Hi > > I am encountering error message > > Error in server response, closing control connection. > > Error in server response, closing control connection. > > When trying to download > > wget -nv > ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/cytoBand.txt.gz > > > > Thanks > > Lalitha From kuhn at soe.ucsc.edu Wed Jan 9 11:55:43 2008 From: kuhn at soe.ucsc.edu (Robert Kuhn) Date: Wed, 9 Jan 2008 11:55:43 -0800 Subject: [Genome] Is the UCSC FTP Server down? Message-ID: <200801091955.LAA26414@sundance.cse.ucsc.edu> Lalitha, You may be able to get the file you need via the mysql site, as it is very small. mysql -h genome-mysql -u genome -A hg18 -e "SELECT * FROM cytoBand" > cytoBand.txt best wishes, --b0b kuhn ucsc genome bioinformatics group > From genome-bounces at soe.ucsc.edu Wed Jan 9 11:19:39 2008 > To: "Viswanath, Lalitha (NIH/NCI) [C]" > Cc: genome at soe.ucsc.edu > Subject: Re: [Genome] Is the UCSC FTP Server down? > > Thank you for the report. Our FTP server was hung up due to > too many connections. Evidently the back-log of requests from > our outage last week is stressing the system. > > --Hiram > > Viswanath, Lalitha (NIH/NCI) [C] wrote: > > Hi > > > > I am encountering error message > > > > Error in server response, closing control connection. > > > > Error in server response, closing control connection. > > > > When trying to download > > > > wget -nv > > ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/cytoBand.txt.gz > > > > > > > > Thanks > > > > Lalitha > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From sselvar1 at asu.edu Wed Jan 9 13:03:36 2008 From: sselvar1 at asu.edu (SiddarthGautham Selvaraj) Date: Wed, 9 Jan 2008 14:03:36 -0700 Subject: [Genome] Coding exons of genes in the human genome In-Reply-To: <4783C147.1010100@soe.ucsc.edu> References: <5c826dfc0801072350k4511d382u1a3d7ea4b4f3bf89@mail.gmail.com> <4783C147.1010100@soe.ucsc.edu> Message-ID: <5c826dfc0801091303s58f400c2k2e066d7f490ef820@mail.gmail.com> Hi, I am a graduate student from Arizona State University. I am trying to extract the list of canonical genes from KnownCanonical table. To do that, I first need to organize genes based on gene symbols. The values contained as gene symbol in kgXref table are not very precise. It has entries like 1-Mar 2-Mar 3-Mar 4-Mar 5-Mar 6-Mar 7-Mar 8-Mar 9-Mar 1-Sep 2-Sep 3-Sep 4-Sep 5-Sep 6-Sep 7-Sep 8-Sep 9-Sep 10-Sep 11-Sep 12-Sep 15-Sep 1-Dec and C10orf10 C10orf104 C10orf107 C10orf108 C10orf11 C10orf110 C10orf111 C10orf113 C10orf114 C10orf116 C10orf118 C10orf119 C10orf12 C10orf120 C10orf122 C10orf125 C10orf129 C10orf130 C10orf132 C10orf137 C10orf18 C10orf22 C10orf25 C10orf26 C10orf27 C10orf28 C10orf30 C10orf32 C10orf33 C10orf35 C10orf38 C10orf4 C10orf46 C10orf47 C10orf49 C10orf53 C10orf54 C10orf55 C10orf56 C10orf57 C10orf58 C10orf59 C10orf6 C10orf61 C10orf62 C10orf63 C10orf64 C10orf65 C10orf67 C10orf68 C10orf71 C10orf72 C10orf73 C10orf76 C10orf78 which I guess are junk entries. Thus I would require HGNC symbols from some other table so that I can compare these values and get only the proper gene symbols and hence proper genes. Please let me know which table has the HGNC symbols. Please help me in this. Thanks siddarth On 08/01/2008, Ann Zweig wrote: > > Hello Siddarth, > > If you have successfully extracted the sequence for all Known > Genes, but you > only want the sequence for the canonical gene in each gene cluster, then > my > suggestion would be to use the output from the knownCanonical table to > filter > out those genes from your list that you are not interested in. > > If this does not address your issue, please feel free to write > back to the > genome list. > > > Regards, > > ---------- > Ann Zweig > UCSC Genome Bioinformatics Group > http://genome.ucsc.edu > > Please feel free to search the Genome mailing list archives by visiting > our home > page, clicking on "Contact Us", then typing a word or phrase into the > search > box. On that same page > (http://genome.ucsc.edu/contacts.html), you can subscribe to the Genome > mailing > list. > > > > > SiddarthGautham Selvaraj wrote: > > Hi: > > > > I am siddarth, a graduate student in Arizona State University. I have > been > > trying to get the coding exon sequences of all genes in the human > genome. > > > > I define genes as an unit of the genome which codes for something. I do > not > > need gene variants. That is a gene may get alternatively spliced to > produce > > different products. I am not bothered on the variants a gene can produce > > through alternate splicing. > > > > I tried KnownCanonical Table but I guess it does not focus on all exons. > > > > Thus where can I get information on coding exons of all genes in the > human > > genome. > > > > Awaiting Your reply > > > > Thanks > > siddarth > > _______________________________________________ > > Genome maillist - Genome at soe.ucsc.edu > > http://www.soe.ucsc.edu/mailman/listinfo/genome > From rhead at soe.ucsc.edu Wed Jan 9 13:34:09 2008 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Wed, 09 Jan 2008 13:34:09 -0800 Subject: [Genome] Coding exons of genes in the human genome In-Reply-To: <5c826dfc0801091303s58f400c2k2e066d7f490ef820@mail.gmail.com> References: <5c826dfc0801072350k4511d382u1a3d7ea4b4f3bf89@mail.gmail.com> <4783C147.1010100@soe.ucsc.edu> <5c826dfc0801091303s58f400c2k2e066d7f490ef820@mail.gmail.com> Message-ID: <47853DD1.7050508@soe.ucsc.edu> Hi Siddarth, Are you using the hg18 database? I do not see any entries in the hg18.kgXref table like '1-Mar' or 'C10orf10'. Is the kgXref table you are using identical to the one located here?: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/kgXref.txt.gz -- Brooke Rhead UCSC Genome Bioinformatics Group SiddarthGautham Selvaraj wrote: > Hi, > > > I am a graduate student from Arizona State University. I am trying to > extract the list of canonical genes from KnownCanonical table. To do that, I > first need to organize genes based on gene symbols. > > The values contained as gene symbol in kgXref table are not very precise. It > has entries like > > 1-Mar 2-Mar 3-Mar 4-Mar 5-Mar 6-Mar 7-Mar 8-Mar 9-Mar 1-Sep 2-Sep > 3-Sep 4-Sep 5-Sep 6-Sep 7-Sep 8-Sep 9-Sep 10-Sep 11-Sep 12-Sep > 15-Sep 1-Dec and > > C10orf10 C10orf104 C10orf107 C10orf108 C10orf11 C10orf110 C10orf111 > C10orf113 C10orf114 C10orf116 C10orf118 C10orf119 C10orf12 C10orf120 > C10orf122 C10orf125 C10orf129 C10orf130 C10orf132 C10orf137 C10orf18 > C10orf22 C10orf25 C10orf26 C10orf27 C10orf28 C10orf30 C10orf32 > C10orf33 C10orf35 C10orf38 C10orf4 C10orf46 C10orf47 C10orf49 > C10orf53 C10orf54 C10orf55 C10orf56 C10orf57 C10orf58 C10orf59 > C10orf6 C10orf61 C10orf62 C10orf63 C10orf64 C10orf65 C10orf67 > C10orf68 C10orf71 C10orf72 C10orf73 C10orf76 C10orf78 > which I guess are junk entries. Thus I would require HGNC symbols from some > other table so that I can compare these values and get only the proper gene > symbols and hence proper genes. > > Please let me know which table has the HGNC symbols. > > Please help me in this. > > Thanks > siddarth > > On 08/01/2008, Ann Zweig wrote: >> Hello Siddarth, >> >> If you have successfully extracted the sequence for all Known >> Genes, but you >> only want the sequence for the canonical gene in each gene cluster, then >> my >> suggestion would be to use the output from the knownCanonical table to >> filter >> out those genes from your list that you are not interested in. >> >> If this does not address your issue, please feel free to write >> back to the >> genome list. >> >> >> Regards, >> >> ---------- >> Ann Zweig >> UCSC Genome Bioinformatics Group >> http://genome.ucsc.edu >> >> Please feel free to search the Genome mailing list archives by visiting >> our home >> page, clicking on "Contact Us", then typing a word or phrase into the >> search >> box. On that same page >> (http://genome.ucsc.edu/contacts.html), you can subscribe to the Genome >> mailing >> list. >> >> >> >> >> SiddarthGautham Selvaraj wrote: >>> Hi: >>> >>> I am siddarth, a graduate student in Arizona State University. I have >> been >>> trying to get the coding exon sequences of all genes in the human >> genome. >>> I define genes as an unit of the genome which codes for something. I do >> not >>> need gene variants. That is a gene may get alternatively spliced to >> produce >>> different products. I am not bothered on the variants a gene can produce >>> through alternate splicing. >>> >>> I tried KnownCanonical Table but I guess it does not focus on all exons. >>> >>> Thus where can I get information on coding exons of all genes in the >> human >>> genome. >>> >>> Awaiting Your reply >>> >>> Thanks >>> siddarth >>> _______________________________________________ >>> Genome maillist - Genome at soe.ucsc.edu >>> http://www.soe.ucsc.edu/mailman/listinfo/genome > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From sselvar1 at asu.edu Wed Jan 9 14:29:08 2008 From: sselvar1 at asu.edu (SiddarthGautham Selvaraj) Date: Wed, 9 Jan 2008 15:29:08 -0700 Subject: [Genome] HGNC or HUGO Symbols Message-ID: <5c826dfc0801091429m6b151a0fm7c04682de62482ae@mail.gmail.com> Hi: I am using the hg18 database. I am using only the gene symbol field of the table KgXref. You can find entries like MARCH1, MARCH5, etc. The gene symbol field also has entries like C10orf10, etc. I have learnt that the field contains actual gene symbols, sometime swiss-prot id and sometimes something else. Under what consitions, values other than gene symbol are given in the 'Gene Symbol' field of the table KgXref? Also please let me know if some table has the HGNC symbols or HUGO symbols so that I can extract exactly those entries that have Gene symbols in the field 'gene symbol' of the table KgXref. Thanks siddarth From rhead at soe.ucsc.edu Wed Jan 9 14:53:53 2008 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Wed, 09 Jan 2008 14:53:53 -0800 Subject: [Genome] HGNC or HUGO Symbols In-Reply-To: <5c826dfc0801091429m6b151a0fm7c04682de62482ae@mail.gmail.com> References: <5c826dfc0801091429m6b151a0fm7c04682de62482ae@mail.gmail.com> Message-ID: <47855081.2050309@soe.ucsc.edu> Hi Siddarth, Pardon me, I did not see the names like 'C10orf10' in my original search of kgXref. I see them now. When there is a HGNC (HUGO) Gene Symbol available, it will appear in this field of this table: kgXref.geneSymbol. Not all genes in the knownGene table have a HGNC Gene Symbol assigned. There is another table in the 'proteome' database that contains HUGO gene symbols and HGNC IDs: 'hgncXref'. One of our team members has developed a MySQL database query that will retrieve the HGNC gene symbols mapped to the knownGene transcript name: SELECT geneSymbol,transcript FROM knownCanonical,kgXref,proteome.hgncXref WHERE kgXref.kgID=knownCanonical.transcript AND kgXref.geneSymbol = proteome.hgncXref.symbol Also, here are some previously answered questions that pertain to getting HUGO gene names from our tables: http://www.soe.ucsc.edu/pipermail/genome/2006-April/010350.html https://www.soe.ucsc.edu/pipermail/genome/2006-September/011570.html I hope this information is helpful. -- Brooke Rhead UCSC Genome Bioinformatics Group SiddarthGautham Selvaraj wrote: > Hi: > > I am using the hg18 database. I am using only the gene symbol field of the > table KgXref. > > You can find entries like MARCH1, MARCH5, etc. The gene symbol field also > has entries like C10orf10, etc. > > I have learnt that the field contains actual gene symbols, sometime > swiss-prot id and sometimes something else. > > Under what consitions, values other than gene symbol are given in the 'Gene > Symbol' field of the table KgXref? > > Also please let me know if some table has the HGNC symbols or HUGO symbols > so that I can extract exactly those entries that have Gene symbols in the > field 'gene symbol' of the table KgXref. > > Thanks > siddarth > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From elana.miriami at yale.edu Wed Jan 9 14:48:35 2008 From: elana.miriami at yale.edu (Miriami Elana) Date: Wed, 09 Jan 2008 17:48:35 -0500 Subject: [Genome] question: non-protein coding genes Message-ID: <20080109174835.zx7q85uem8o0o0s0@www.mail.yale.edu> Hello, I am interested in non-protein coding genes from all Known Genes (the coordinates in a bed file with the name of the gene). How can I get it using the UCSC genome browser? Thank you very much, Miriami Elana Yale University Department of Biology and MCB 219 Prospect St New Haven, CT 06511 USA From rhead at soe.ucsc.edu Wed Jan 9 15:27:42 2008 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Wed, 09 Jan 2008 15:27:42 -0800 Subject: [Genome] question: non-protein coding genes In-Reply-To: <20080109174835.zx7q85uem8o0o0s0@www.mail.yale.edu> References: <20080109174835.zx7q85uem8o0o0s0@www.mail.yale.edu> Message-ID: <4785586E.7010001@soe.ucsc.edu> Hello Elana, To get all of the non-protein-coding UCSC Genes in BED format, go to the Table Browser (the "Tables" link at the top of the page) and select the clade, genome and assembly of interest. Then select: group: Genes and Gene Prediction Tracks track: UCSC Genes table: knownGene region: genome output format: BED - Browser Extensible Data output file: enter a filename to save the results to your computer Then hit the "filter: create" button, and in the free-form query box enter: cdsStart=cdsEnd and hit submit. Now hit "get output". Note that only the "UCSC Genes" tracks in hg18 and mm9 contain non-protein-coding genes (the older "Known Genes" tracks do not). If you have further questions, please feel free to write back to this mailing list address. -- Brooke Rhead UCSC Genome Bioinformatics Group Miriami Elana wrote: > Hello, > > I am interested in non-protein coding genes from all Known Genes (the > coordinates > in a bed file with the name of the gene). How can I get it using the UCSC genome > browser? > > Thank you very much, > > Miriami Elana > Yale University > Department of Biology and MCB > 219 Prospect St > New Haven, CT 06511 > USA > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From rhead at soe.ucsc.edu Wed Jan 9 15:49:42 2008 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Wed, 09 Jan 2008 15:49:42 -0800 Subject: [Genome] HGNC or HUGO Symbols In-Reply-To: <47855081.2050309@soe.ucsc.edu> References: <5c826dfc0801091429m6b151a0fm7c04682de62482ae@mail.gmail.com> <47855081.2050309@soe.ucsc.edu> Message-ID: <47855D96.5060803@soe.ucsc.edu> Hello again Siddarth, A colleague has pointed out that the names like MARCH1 and C10orf10 are indeed approved gene symbols in HUGO/HGNC. If you look them up at http://www.genenames.org/ you will see entries for them: MARCH1 = membrane-associated ring finger (C3HC4) 1 C10orf10 = chromosome 10 open reading frame 10 -- Brooke Rhead UCSC Genome Bioinformatics Group Brooke Rhead wrote: > Hi Siddarth, > > Pardon me, I did not see the names like 'C10orf10' in my original search > of kgXref. I see them now. > > When there is a HGNC (HUGO) Gene Symbol available, it will appear in > this field of this table: kgXref.geneSymbol. Not all genes in the > knownGene table have a HGNC Gene Symbol assigned. > > There is another table in the 'proteome' database that contains HUGO > gene symbols and HGNC IDs: 'hgncXref'. > > One of our team members has developed a MySQL database query that will > retrieve the HGNC gene symbols mapped to the knownGene transcript name: > > SELECT geneSymbol,transcript > FROM knownCanonical,kgXref,proteome.hgncXref > WHERE kgXref.kgID=knownCanonical.transcript > AND kgXref.geneSymbol = proteome.hgncXref.symbol > > Also, here are some previously answered questions that pertain to > getting HUGO gene names from our tables: > > http://www.soe.ucsc.edu/pipermail/genome/2006-April/010350.html > https://www.soe.ucsc.edu/pipermail/genome/2006-September/011570.html > > I hope this information is helpful. > From ann at soe.ucsc.edu Wed Jan 9 16:05:11 2008 From: ann at soe.ucsc.edu (Ann Zweig) Date: Wed, 09 Jan 2008 16:05:11 -0800 Subject: [Genome] Clone RP23-473N24 In-Reply-To: <4783EA21.5050106@soe.ucsc.edu> References: <4783EA21.5050106@soe.ucsc.edu> Message-ID: <47856137.50001@cse.ucsc.edu> Hello Remi, We have added this clone to the mm9 BAC End Pairs annotation track. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Hiram Clawson wrote: > Good Afternoon Dr. Terranova: > > I am investigating why this clone does not show up on the > UCSC mouse genome browser for July 2007 (UCSC mm9/NCBI 37) compared > to the March 2005 (UCSC mm6/NCBI 34) version. I do note the > clone end sequences AZ095043 and AZ095046 do map correctly > to the July 2007 sequence in the location: chr7:150,015,500-150,194,000 > on the same contig NT_039437 as the March 2005 sequence. > > This would indicate this is a valid clone even on the newer assemblies. > I will have to determine why it is missing from the BacEnd track. > > Please note, the genome email list is the correct address > for questions on the genome browser. > > --Hiram > > -------- Original Message -------- > Subject: [Genome] Clone RP23-473N24 > Date: Tue, 8 Jan 2008 11:30:38 +0100 > From: Terranova, Remi > To: > > To whom it may concern, > > Over the last two years, I have used one clone I chose from UCSC genome browser, RP23-473N24 for FISH analyses. It seems like this > clone is not available anymore from the UCSC browser and I wondered what may be wrong with it. > > This is important for my purpose as I have based many of my analyses on FISH detection using this BAC clone. I would be grateful if > you could tell me whether anything is wrong with this RP23-473N24 clone. > > With best wishes, > > Remi Terranova > > > RP23-473N24, Mouse Chr7, qF5. Originally located between RP23-71O21 and MSMG01-379L13 > - Assembly: Mouse - March 2005 > - Track: BAC End Pairs > - Kcnq1 locus: chr7: 137,000,000 - 138,000,000 > > --------------------------------------------------------------- > > Dr. R?mi Terranova > Friedrich Miescher Institute (FMI) > Novartis Research foundation > Maulbeerstrasse 66 > > CH-4058 Basel, Switzerland > > Phone: +41 61 697 6100 > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From arhan at ucla.edu Wed Jan 9 16:19:08 2008 From: arhan at ucla.edu (Areum Han) Date: Wed, 9 Jan 2008 16:19:08 -0800 Subject: [Genome] download of spliced_Est annotation Message-ID: <003d01c8531e$6b330850$8a2743a4@AreumHan> The spliced_Est annotation for mm is downloadable? or need to be downloaded a crhom by chrom? From rhead at soe.ucsc.edu Wed Jan 9 17:14:48 2008 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Wed, 09 Jan 2008 17:14:48 -0800 Subject: [Genome] download of spliced_Est annotation In-Reply-To: <003d01c8531e$6b330850$8a2743a4@AreumHan> References: <003d01c8531e$6b330850$8a2743a4@AreumHan> Message-ID: <47857188.2030901@soe.ucsc.edu> Hello Areum Han, The tables underlying the Spliced ESTs track are indeed split into chromosomes on our downloads page (http://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/). However, you can get all of the split tables at once using the Table Browser. Hit the blue "Tables" link at the top of the page, and choose the mouse assembly you wish to use from the drop-down menus. Then select: group: mRNA and EST tracks track: Spliced ESTs table: intronEst region: genome output format: all fields from selected table output file: enter a file name here file type returned: gzip compressed and hit "get output". When compressed, the file is about 63 Mbytes (for the mm9 assembly). -- Brooke Rhead UCSC Genome Bioinformatics Group Areum Han wrote: > The spliced_Est annotation for mm is downloadable? > or need to be downloaded a crhom by chrom? > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From Mike.Mitchell at cancer.org.uk Thu Jan 10 06:23:28 2008 From: Mike.Mitchell at cancer.org.uk (Mike Mitchell) Date: Thu, 10 Jan 2008 14:23:28 +0000 Subject: [Genome] Batch extraction Message-ID: Hello!, I have a list of 90+ regions for which I would would like to extract data from the TableBrowser. If I upload these regions into the "define regions" interface the data I get back doesn't (seem to) identify which region any particular RefGene is within. I could c&p each region into the tablebrowser and do the 90+ searches in turn, but that will be laborious (and no doubt prone to human error.) Is there a way to do a "batch" TableBrowser query? Or is there a better tool I could use? Thanks in advance for your time and help. -- Mike Mitchell Bioinformatics & Biostatistics Service Cancer Research UK +44 (0) 207 269 3115 +++ Divide By Cucumber Error. Please Reinstall Universe And Reboot +++ From peter.shepard at gmail.com Thu Jan 10 07:43:47 2008 From: peter.shepard at gmail.com (Pete Shepard) Date: Thu, 10 Jan 2008 07:43:47 -0800 Subject: [Genome] go term level Message-ID: <5c2c43620801100743k317b00d8wf61d84f7cca739dc@mail.gmail.com> Good Morning Browser Folks, I am using the Table Browser to extract GO terms for a number of genes from go.goaPart table and joining this with the go.term field. I am hoping to get the level of each term in the ontology. For example, if I enter a single Trembleid=dbObjectId (O00139) I get the following output for the cellular component ontology: #go.goaPart.dbObjectId go.goaPart.dbObjectSymbol go.goaPart.aspect go.term.name go.term.term_type go.term.acc O00139 KIF2A_HUMAN C cytoskeleton cellular_component GO:0005856 O00139 KIF2A_HUMAN C microtubule cellular_component GO:0005874 O00139 KIF2A_HUMAN C microtubule associated complex cellular_component GO:0005875 I am hoping to get the term at the highest level for each ontology. Is there a way to accomplish this such as a number associated with each term. Thank you. From darnelr at mail.rockefeller.edu Thu Jan 10 08:37:33 2008 From: darnelr at mail.rockefeller.edu (Robert Darnell) Date: Thu, 10 Jan 2008 11:37:33 -0500 Subject: [Genome] Genome graph window size Message-ID: <9D306E9B-AB4F-416C-BDBE-16EA7495E40E@rockefeller.edu> Hi: I'm a molecular biologist (so have mercy please) interested in trying to graph BED files uploaded into Genome Graphs, but to have the data plotted in a narrower window than 10,000bp; even 1,000 bp would be an improvement I think for our data, or, optimally, 100bp. Is this impossible because of computation time, or feasible? Thanks, Bob _________________________________ Robert B. Darnell The Rockefeller University From hiram at soe.ucsc.edu Thu Jan 10 10:46:21 2008 From: hiram at soe.ucsc.edu (Hiram Clawson) Date: Thu, 10 Jan 2008 10:46:21 -0800 Subject: [Genome] Genome graph window size In-Reply-To: <9D306E9B-AB4F-416C-BDBE-16EA7495E40E@rockefeller.edu> References: <9D306E9B-AB4F-416C-BDBE-16EA7495E40E@rockefeller.edu> Message-ID: <478667FD.7060402@soe.ucsc.edu> Good Morning Robert: If your data can be calculated to a consistent window size, say one data point for every 100bp, you can use the wiggle format to view the data in the genome browser: http://genome.ucsc.edu/goldenPath/help/wiggle.html The genome graphs are whole genome viewpoints and wouldn't be able to show your data detail below a couple 100 thousand bases. Unless you view them in the genome browser, where you could instead view them as wiggle data plots. Wiggle data points are bar graphs at the specified locations with the bars of size "span". (span must be consistent) Whereas, the genome graph data points are points on a line graph. They are different types of graphs. The wiggle data can be fudged into looking like a line graph by setting the graph options to "points" (actually the lines of the tops of the bars) and set smoothing on to smear those little line segments into a continuous looking line. --Hiram Robert Darnell wrote: > Hi: > > I'm a molecular biologist (so have mercy please) interested in trying > to graph BED files uploaded into Genome Graphs, but to have the data > plotted in a narrower window than 10,000bp; even 1,000 bp would be an > improvement I think for our data, or, optimally, 100bp. Is this > impossible because of computation time, or feasible? > > Thanks, > > Bob > _________________________________ > Robert B. Darnell > The Rockefeller University > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From galt at soe.ucsc.edu Thu Jan 10 11:20:55 2008 From: galt at soe.ucsc.edu (Galt Barber) Date: Thu, 10 Jan 2008 11:20:55 -0800 (PST) Subject: [Genome] Genome graph window size In-Reply-To: <478667FD.7060402@soe.ucsc.edu> References: <9D306E9B-AB4F-416C-BDBE-16EA7495E40E@rockefeller.edu> <478667FD.7060402@soe.ucsc.edu> Message-ID: Genome Graphs If you convert your BED file with a script to the native Genome Graphs custom file format, which is very simple, e.g. chr pos value chr12 120000 1 You can have up to millions of points, although it was designed expecting more like hundreds of thousands. If you only plot the peaks and they are all the same value, you won't see them, but by adding a few points around your peak you can give it definition. i.e. chr12 120000 1 chr12 130000 1 chr12 140000 1 chr12 150000 1 Since it just connects the dots, definining only the peaks doesn't work. It just joins them into a what looks like a single line. (There is also a user-configurable limit value that causes it to not connect two dots further apart than the max #bp you specify.) However, if you had features that are say 100 bp wide and non-overlapping, you can define each peaks with multiple points like this: chr12 120000 0 chr12 120001 1 chr12 120099 1 chr12 120100 0 And now you have a square peak 100bp wide. If you prefer a triangular peak, you could do this: chr12 120000 0 chr12 120050 1 chr12 120100 0 Note that the genome graphs display has a limited number of pixels so that you are probably looking at roughly thousands of bp per pixel. But when you click on it to go through to the genome-browser, you will be able to zoom in to any level of detail. -Galt On Thu, 10 Jan 2008, Hiram Clawson wrote: > Good Morning Robert: > > If your data can be calculated to a consistent window size, > say one data point for every 100bp, you can use the > wiggle format to view the data in the genome browser: > http://genome.ucsc.edu/goldenPath/help/wiggle.html > > The genome graphs are whole genome viewpoints and wouldn't > be able to show your data detail below a couple 100 thousand > bases. Unless you view them in the genome browser, where > you could instead view them as wiggle data plots. > > Wiggle data points are bar graphs at the specified > locations with the bars of size "span". (span must > be consistent) Whereas, the genome graph data > points are points on a line graph. They are different > types of graphs. The wiggle data can be fudged into > looking like a line graph by setting the graph > options to "points" (actually the lines of the tops > of the bars) and set smoothing on to smear those little > line segments into a continuous looking line. > > --Hiram > > Robert Darnell wrote: > > Hi: > > > > I'm a molecular biologist (so have mercy please) interested in trying > > to graph BED files uploaded into Genome Graphs, but to have the data > > plotted in a narrower window than 10,000bp; even 1,000 bp would be an > > improvement I think for our data, or, optimally, 100bp. Is this > > impossible because of computation time, or feasible? > > > > Thanks, > > > > Bob > > _________________________________ > > Robert B. Darnell > > The Rockefeller University > > > > > > _______________________________________________ > > Genome maillist - Genome at soe.ucsc.edu > > http://www.soe.ucsc.edu/mailman/listinfo/genome > > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From darnelr at mail.rockefeller.edu Thu Jan 10 11:36:16 2008 From: darnelr at mail.rockefeller.edu (Robert Darnell) Date: Thu, 10 Jan 2008 14:36:16 -0500 Subject: [Genome] Fwd: Genome graph window size References: Message-ID: Hi Galt: Thanks for the help. I sent the email below back to Hiram, from which you can see I am in dire straits. Your solution looks interesting...and makes me wonder if I am inputting my data in the wrong format? When I try loading the file format the way you suggested I get errors loading through custom tracks (Error File 'test2.txt' - Error line 2 of custom track: chromStart after chromEnd (100000000 > 200)), and importing directly into Genome Graph it seems to load but is lost (to me)... Bob _________________________________ Robert B. Darnell The Rockefeller University Begin forwarded message: > From: Robert Darnell > Date: January 10, 2008 2:21:30 PM EST > To: Hiram Clawson > Subject: Re: [Genome] Genome graph window size > > Hi Hiram: > > Thanks very much for the help. I am going to try your wiggle idea. > In the meantime, can I ask an even more basic question about Genome > Graph. The BED files I am plotting look great, but I realize the Y > axis doesn't bear any relationship to the data that I can figure > out. So, I made a few test files, which only serve to convince me > further I have no idea what I am plotting on Y: > track name='test data' description='test data' > chr10 101168028 101168069 4.88 > chr10 101230425 101230508 7.23 > chr9 10000000010000010025 > > > and > > track name='test data' description='test data' > chr9 100000000 100000100 200 > > > I do understand that the Y axis is plotting 1/3 and 2/3 of > something, but it seems to be a fraction of the number 0.01, not of > my data... > > Thanks for the hand-holding! > > Bob > _________________________________ > Robert B. Darnell > The Rockefeller University > > > On Jan 10, 2008, at 1:46 PM, Hiram Clawson wrote: > >> Good Morning Robert: >> >> If your data can be calculated to a consistent window size, >> say one data point for every 100bp, you can use the >> wiggle format to view the data in the genome browser: >> http://genome.ucsc.edu/goldenPath/help/wiggle.html >> >> The genome graphs are whole genome viewpoints and wouldn't >> be able to show your data detail below a couple 100 thousand >> bases. Unless you view them in the genome browser, where >> you could instead view them as wiggle data plots. >> >> Wiggle data points are bar graphs at the specified >> locations with the bars of size "span". (span must >> be consistent) Whereas, the genome graph data >> points are points on a line graph. They are different >> types of graphs. The wiggle data can be fudged into >> looking like a line graph by setting the graph >> options to "points" (actually the lines of the tops >> of the bars) and set smoothing on to smear those little >> line segments into a continuous looking line. >> >> --Hiram >> >> Robert Darnell wrote: >>> Hi: >>> I'm a molecular biologist (so have mercy please) interested in >>> trying to graph BED files uploaded into Genome Graphs, but to >>> have the data plotted in a narrower window than 10,000bp; even >>> 1,000 bp would be an improvement I think for our data, or, >>> optimally, 100bp. Is this impossible because of computation >>> time, or feasible? >>> Thanks, >>> Bob >>> _________________________________ >>> Robert B. Darnell >>> The Rockefeller University >>> _______________________________________________ >>> Genome maillist - Genome at soe.ucsc.edu >>> http://www.soe.ucsc.edu/mailman/listinfo/genome >> > From galt at soe.ucsc.edu Thu Jan 10 12:14:53 2008 From: galt at soe.ucsc.edu (Galt Barber) Date: Thu, 10 Jan 2008 12:14:53 -0800 (PST) Subject: [Genome] Fwd: Genome graph window size In-Reply-To: References: Message-ID: Put your input data in a local file or make it accessible via URL. In Genome Graphs, after clicking upload and selecting your input, give it at least a name for the data set. Click submit. If you see no errors, You should be returned to the main Genome Graphs view. Look in the drop down list for the name you gave your dataset and select that. It should render your graph. This is considered a Genome-Graphs custom track, and it's not the same as a regular custom track. (Although a regular custom track could be imported into Genome Graphs, it would be done with 10,000 bp windowing.) When you click on your genome graph, it should jump to a genome-browser view of the same region, and from there you can continue to zoom etc. -Galt On Thu, 10 Jan 2008, Robert Darnell wrote: > Hi Galt: > > Thanks for the help. I sent the email below back to Hiram, from which > you can see I am in dire straits. > > Your solution looks interesting...and makes me wonder if I am > inputting my data in the wrong format? When I try loading the file > format the way you suggested I get errors loading through custom > tracks (Error File 'test2.txt' - Error line 2 of custom track: > chromStart after chromEnd (100000000 > 200)), and importing directly > into Genome Graph it seems to load but is lost (to me)... > > Bob > _________________________________ > Robert B. Darnell > The Rockefeller University > > > Begin forwarded message: > > > From: Robert Darnell > > Date: January 10, 2008 2:21:30 PM EST > > To: Hiram Clawson > > Subject: Re: [Genome] Genome graph window size > > > > Hi Hiram: > > > > Thanks very much for the help. I am going to try your wiggle idea. > > In the meantime, can I ask an even more basic question about Genome > > Graph. The BED files I am plotting look great, but I realize the Y > > axis doesn't bear any relationship to the data that I can figure > > out. So, I made a few test files, which only serve to convince me > > further I have no idea what I am plotting on Y: > > track name='test data' description='test data' > > chr10 101168028 101168069 4.88 > > chr10 101230425 101230508 7.23 > > chr9 10000000010000010025 > > > > > > and > > > > track name='test data' description='test data' > > chr9 100000000 100000100 200 > > > > > > I do understand that the Y axis is plotting 1/3 and 2/3 of > > something, but it seems to be a fraction of the number 0.01, not of > > my data... > > > > Thanks for the hand-holding! > > > > Bob > > _________________________________ > > Robert B. Darnell > > The Rockefeller University > > > > > > On Jan 10, 2008, at 1:46 PM, Hiram Clawson wrote: > > > >> Good Morning Robert: > >> > >> If your data can be calculated to a consistent window size, > >> say one data point for every 100bp, you can use the > >> wiggle format to view the data in the genome browser: > >> http://genome.ucsc.edu/goldenPath/help/wiggle.html > >> > >> The genome graphs are whole genome viewpoints and wouldn't > >> be able to show your data detail below a couple 100 thousand > >> bases. Unless you view them in the genome browser, where > >> you could instead view them as wiggle data plots. > >> > >> Wiggle data points are bar graphs at the specified > >> locations with the bars of size "span". (span must > >> be consistent) Whereas, the genome graph data > >> points are points on a line graph. They are different > >> types of graphs. The wiggle data can be fudged into > >> looking like a line graph by setting the graph > >> options to "points" (actually the lines of the tops > >> of the bars) and set smoothing on to smear those little > >> line segments into a continuous looking line. > >> > >> --Hiram > >> > >> Robert Darnell wrote: > >>> Hi: > >>> I'm a molecular biologist (so have mercy please) interested in > >>> trying to graph BED files uploaded into Genome Graphs, but to > >>> have the data plotted in a narrower window than 10,000bp; even > >>> 1,000 bp would be an improvement I think for our data, or, > >>> optimally, 100bp. Is this impossible because of computation > >>> time, or feasible? > >>> Thanks, > >>> Bob > >>> _________________________________ > >>> Robert B. Darnell > >>> The Rockefeller University > >>> _______________________________________________ > >>> Genome maillist - Genome at soe.ucsc.edu > >>> http://www.soe.ucsc.edu/mailman/listinfo/genome > >> > > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From biomail at gmail.com Thu Jan 10 14:16:09 2008 From: biomail at gmail.com (qi wang) Date: Thu, 10 Jan 2008 16:16:09 -0600 Subject: [Genome] Is there a 3-way primates whole genome MAF (hg18, panTro2, rheMac2) in UCSC ftp? Message-ID: <98a149f00801101416y607f066dxcf8361d4fad5b39d@mail.gmail.com> Hello, I am now needing a set of primates specific multialignment data. From your ftp server, I can only find the 28-way or 17-way results. Can you tell me where can I got the Human-Chimp-Macaque multiz results? Thank you. Qi Wang From kayla at soe.ucsc.edu Thu Jan 10 14:33:12 2008 From: kayla at soe.ucsc.edu (Kayla Smith) Date: Thu, 10 Jan 2008 14:33:12 -0800 Subject: [Genome] Batch extraction In-Reply-To: References: Message-ID: <47869D28.90107@cse.ucsc.edu> Hello Mitchell, The "define regions" option is working correctly. If you are retrieving refGenes, there is not information about which part of the defined regions each refGene is in. What you can do is to make a Custom Track of your regions then intersect that with refGene using the Table Browser. Here are instructions for making a Custom Track: http://genome.ucsc.edu/goldenPath/help/customTrack.html Finally, if that doesn't give you a detailed enough intersection, let me recommend that you use the Galaxy tool: http://main.g2.bx.psu.edu/ I hope this information is helpful to you. Please don't hesitate to contact us again if you require further assistance. Kayla Smith UCSC Genome Bioinformatics Group Mike Mitchell wrote: > Hello!, > > I have a list of 90+ regions for which I would would like to extract data > from the TableBrowser. > > If I upload these regions into the "define regions" interface the data I get > back doesn't (seem to) identify which region any particular RefGene is > within. > > I could c&p each region into the tablebrowser and do the 90+ searches in > turn, but that will be laborious (and no doubt prone to human error.) Is > there a way to do a "batch" TableBrowser query? Or is there a better tool I > could use? > > Thanks in advance for your time and help. > From sselvar1 at asu.edu Thu Jan 10 14:50:34 2008 From: sselvar1 at asu.edu (SiddarthGautham Selvaraj) Date: Thu, 10 Jan 2008 15:50:34 -0700 Subject: [Genome] Canonical and Non-Canonical table Message-ID: <5c826dfc0801101450k4dac79dcs17190e973b7aa6db@mail.gmail.com> Hi I am siddarth, graduate student in Arizona State University. I got the following query. Many HGNC genes have records in the Knowngene table but not in KnownCanonical table. I have attached a file that contains gene symbols that do not have an entry in KnownCanonical but has lot of entries in the Knowngene table. How to arise conclusions from it? How does UCSC define genes as canonical and non canonical. (i.e) When does a gene gains entry in to a KnownCanonical table from Knowngene table? Thanks siddarth On 09/01/2008, SiddarthGautham Selvaraj wrote: > > Ya I only mistook them as junk. Pardon me. > > Thanks > siddarth > > On 09/01/2008, Brooke Rhead < rhead at soe.ucsc.edu > wrote: > > > > Hello again Siddarth, > > > > A colleague has pointed out that the names like MARCH1 and C10orf10 are > > indeed approved gene symbols in HUGO/HGNC. > > > > If you look them up at http://www.genenames.org/ you will see entries > > for them: > > > > MARCH1 = membrane-associated ring finger (C3HC4) 1 > > C10orf10 = chromosome 10 open reading frame 10 > > > > -- > > Brooke Rhead > > UCSC Genome Bioinformatics Group > > > > > > Brooke Rhead wrote: > > > Hi Siddarth, > > > > > > Pardon me, I did not see the names like 'C10orf10' in my original > > search > > > of kgXref. I see them now. > > > > > > When there is a HGNC (HUGO) Gene Symbol available, it will appear in > > > this field of this table: kgXref.geneSymbol. Not all genes in the > > > knownGene table have a HGNC Gene Symbol assigned. > > > > > > There is another table in the 'proteome' database that contains HUGO > > > gene symbols and HGNC IDs: 'hgncXref'. > > > > > > One of our team members has developed a MySQL database query that will > > > > > retrieve the HGNC gene symbols mapped to the knownGene transcript > > name: > > > > > > SELECT geneSymbol,transcript > > > FROM knownCanonical,kgXref,proteome.hgncXref > > > WHERE kgXref.kgID=knownCanonical.transcript > > > AND kgXref.geneSymbol = proteome.hgncXref.symbol > > > > > > Also, here are some previously answered questions that pertain to > > > getting HUGO gene names from our tables: > > > > > > http://www.soe.ucsc.edu/pipermail/genome/2006-April/010350.html > > > https://www.soe.ucsc.edu/pipermail/genome/2006-September/011570.html > > > > > > I hope this information is helpful. > > > > > > > -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: HGNC_non_canonical.txt Url: http://www.soe.ucsc.edu/pipermail/genome/attachments/20080110/4a700a21/attachment-0001.txt From kayla at soe.ucsc.edu Thu Jan 10 17:10:38 2008 From: kayla at soe.ucsc.edu (Kayla Smith) Date: Thu, 10 Jan 2008 17:10:38 -0800 Subject: [Genome] Is there a 3-way primates whole genome MAF (hg18, panTro2, rheMac2) in UCSC ftp? In-Reply-To: <98a149f00801101416y607f066dxcf8361d4fad5b39d@mail.gmail.com> References: <98a149f00801101416y607f066dxcf8361d4fad5b39d@mail.gmail.com> Message-ID: <4786C20E.4060205@cse.ucsc.edu> Hello Qi Wang, Here are two previously answered mailing list questions similar to yours. http://www.soe.ucsc.edu/pipermail/genome/2006-March/009989.html http://www.soe.ucsc.edu/pipermail/genome/2007-July/014207.html I hope this information is helpful to you. Please don't hesitate to contact us again if you require further assistance. Kayla Smith UCSC Genome Bioinformatics Group q wang wrote: > Hello, > > I am now needing a set of primates specific multialignment data. From your > ftp server, I can only find the 28-way or 17-way results. Can you tell me > where can I got the Human-Chimp-Macaque multiz results? > > Thank you. > Qi Wang > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From kayla at soe.ucsc.edu Thu Jan 10 17:12:40 2008 From: kayla at soe.ucsc.edu (Kayla Smith) Date: Thu, 10 Jan 2008 17:12:40 -0800 Subject: [Genome] go term level In-Reply-To: <5c2c43620801100743k317b00d8wf61d84f7cca739dc@mail.gmail.com> References: <5c2c43620801100743k317b00d8wf61d84f7cca739dc@mail.gmail.com> Message-ID: <4786C288.7090704@cse.ucsc.edu> Hello Pete, If I understand correctly, you're looking to retrieve data in the form of a gene and its highest level ontology. There are three main term_types in the term table, as shown in this query: mysql> SELECT DISTINCT term_type, COUNT(*) AS number FROM term GROUP BY term_type; +-----------------------+--------+ | term_type | number | +-----------------------+--------+ | association_qualifier | 4 | | biological_process | 12905 | | cellular_component | 1977 | | gene_ontology | 1 | | molecular_function | 8050 | | relationship | 1 | | sequence | 5 | | subset | 5 | | synonym_type | 5 | | universal | 1 | +-----------------------+--------+ Notice that the main term_types are: biological_process, cellular_component and molecular_function. Now, let's use your example: mysql> select * from goaPart where dbObjectId like "O00139"; +------------+----------------+-------+------------+--------+ | dbObjectId | dbObjectSymbol | notId | goId | aspect | +------------+----------------+-------+------------+--------+ | O00139 | KIF2A_HUMAN | | GO:0000166 | F | | O00139 | KIF2A_HUMAN | | GO:0003774 | F | | O00139 | KIF2A_HUMAN | | GO:0003777 | F | | O00139 | KIF2A_HUMAN | | GO:0005524 | F | | O00139 | KIF2A_HUMAN | | GO:0005856 | C | | O00139 | KIF2A_HUMAN | | GO:0005874 | C | | O00139 | KIF2A_HUMAN | | GO:0005875 | C | | O00139 | KIF2A_HUMAN | | GO:0007018 | P | +------------+----------------+-------+------------+--------+ 8 rows in set (0.01 sec) The goaPart.aspect column is just symbols representing the 3 main term_types above. F = Molecular Function C = Cellular Component P = Biological Process You can see that the gene you are interested in is part of all three of the goaPart.aspect values. Note that in the data you pasted in your email, you already have this information displayed. I hope this information is helpful to you. If I misunderstood your question, or you require further assistance, please don't hesitate to contact us again. Kayla Smith UCSC Genome Bioinformatics Group Pete Shepard wrote: > Good Morning Browser Folks, > > I am using the Table Browser to extract GO terms for a number of genes from > go.goaPart table and joining this with the go.term field. I am hoping to get > the level of each term in the ontology. For example, if I enter a single > Trembleid=dbObjectId (O00139) I get the following output for the cellular > component ontology: > > #go.goaPart.dbObjectId go.goaPart.dbObjectSymbol go.goaPart.aspect go.term.name go.term.term_type go.term.acc > O00139 KIF2A_HUMAN C cytoskeleton cellular_component GO:0005856 > O00139 KIF2A_HUMAN C microtubule cellular_component GO:0005874 > O00139 KIF2A_HUMAN C microtubule associated > complex cellular_component GO:0005875 > > > > I am hoping to get the term at the highest level for each ontology. Is > there a way to accomplish this such as a number associated with each > term. > > Thank you. > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From kayla at soe.ucsc.edu Thu Jan 10 17:18:56 2008 From: kayla at soe.ucsc.edu (Kayla Smith) Date: Thu, 10 Jan 2008 17:18:56 -0800 Subject: [Genome] Canonical and Non-Canonical table In-Reply-To: <5c826dfc0801101450k4dac79dcs17190e973b7aa6db@mail.gmail.com> References: <5c826dfc0801101450k4dac79dcs17190e973b7aa6db@mail.gmail.com> Message-ID: <4786C400.9010203@cse.ucsc.edu> Hello Siddarth, Here is a previously answered mailinglist question similar to yours: https://www.soe.ucsc.edu/pipermail/genome/2005-July/008123.html I hope this answers your question. Kayla Smith UCSC Genome Bioinformatics Group SiddarthGautham Selvaraj wrote: > Hi > > I am siddarth, graduate student in Arizona State University. I got the > following query. > > Many HGNC genes have records in the Knowngene table but not in > KnownCanonical table. I have attached a file that contains gene symbols that > do not have an entry in KnownCanonical but has lot of entries in the > Knowngene table. > > How to arise conclusions from it? How does UCSC define genes as canonical > and non canonical. (i.e) When does a gene gains entry in to a > KnownCanonical table from Knowngene table? > > Thanks > siddarth > > On 09/01/2008, SiddarthGautham Selvaraj wrote: >> Ya I only mistook them as junk. Pardon me. >> >> Thanks >> siddarth >> >> On 09/01/2008, Brooke Rhead < rhead at soe.ucsc.edu > wrote: >>> Hello again Siddarth, >>> >>> A colleague has pointed out that the names like MARCH1 and C10orf10 are >>> indeed approved gene symbols in HUGO/HGNC. >>> >>> If you look them up at http://www.genenames.org/ you will see entries >>> for them: >>> >>> MARCH1 = membrane-associated ring finger (C3HC4) 1 >>> C10orf10 = chromosome 10 open reading frame 10 >>> >>> -- >>> Brooke Rhead >>> UCSC Genome Bioinformatics Group >>> >>> >>> Brooke Rhead wrote: >>>> Hi Siddarth, >>>> >>>> Pardon me, I did not see the names like 'C10orf10' in my original >>> search >>>> of kgXref. I see them now. >>>> >>>> When there is a HGNC (HUGO) Gene Symbol available, it will appear in >>>> this field of this table: kgXref.geneSymbol. Not all genes in the >>>> knownGene table have a HGNC Gene Symbol assigned. >>>> >>>> There is another table in the 'proteome' database that contains HUGO >>>> gene symbols and HGNC IDs: 'hgncXref'. >>>> >>>> One of our team members has developed a MySQL database query that will >>>> retrieve the HGNC gene symbols mapped to the knownGene transcript >>> name: >>>> SELECT geneSymbol,transcript >>>> FROM knownCanonical,kgXref,proteome.hgncXref >>>> WHERE kgXref.kgID=knownCanonical.transcript >>>> AND kgXref.geneSymbol = proteome.hgncXref.symbol >>>> >>>> Also, here are some previously answered questions that pertain to >>>> getting HUGO gene names from our tables: >>>> >>>> http://www.soe.ucsc.edu/pipermail/genome/2006-April/010350.html >>>> https://www.soe.ucsc.edu/pipermail/genome/2006-September/011570.html >>>> >>>> I hope this information is helpful. From arhan at ucla.edu Thu Jan 10 16:37:54 2008 From: arhan at ucla.edu (Areum Han) Date: Thu, 10 Jan 2008 16:37:54 -0800 Subject: [Genome] download of ucsc known genes Message-ID: <002f01c853ea$34a0b860$8a2743a4@AreumHan> Hi. the ucsc known genes table is downloadable in the table browser? I downloaded knowngene table from table bowser but it does not have below exon, which is visible in genome browser. >exon_mm8 TTGACCCATTGATCGATGATTACTCTGGAAGTGGAGgcaaat Sincerely yours, Areum Han. From kayla at soe.ucsc.edu Thu Jan 10 17:29:41 2008 From: kayla at soe.ucsc.edu (Kayla Smith) Date: Thu, 10 Jan 2008 17:29:41 -0800 Subject: [Genome] download of ucsc known genes In-Reply-To: <002f01c853ea$34a0b860$8a2743a4@AreumHan> References: <002f01c853ea$34a0b860$8a2743a4@AreumHan> Message-ID: <4786C685.3080407@cse.ucsc.edu> Hello Areum, Yes, the knownGene table is downloadable here: http://hgdownload.cse.ucsc.edu/goldenPath/mm8/database/knownGene.txt.gz The mm8.knownGene table does not contain sequence data, so I'm not sure how you are searching. If you want to find more information about the sequence you've pasted, you could try BLATing it. Click on "BLAT" on the blue bar on the top of the main page. I hope this helps. Kayla Smith UCSC Genome Bioinformatics Group Areum Han wrote: > Hi. the ucsc known genes table is downloadable in the table browser? > > I downloaded knowngene table from table bowser but it does not have below exon, which is visible in genome browser. > > > >> exon_mm8 > > TTGACCCATTGATCGATGATTACTCTGGAAGTGGAGgcaaat > > > > Sincerely yours, > > Areum Han. > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From J.Bowes at manchester.ac.uk Fri Jan 11 06:35:52 2008 From: J.Bowes at manchester.ac.uk (John Bowes) Date: Fri, 11 Jan 2008 14:35:52 +0000 Subject: [Genome] Gene Ontology Message-ID: <20080111143552531.00000003328@d6lt192j> Hello, I have spent a number of hours trying (unsuccessfully) to extract GO data from for particular gene ids. My aim is to supply a list of gene ids, e.g. PTPN22, and extract the three GO aspects (biological_process, cellular_component and molecular_function). I have tried a number of times, but do not get useful data (either no data, or a large volume of multiple entries). I am sure this due to me not fully understanding the table relationships. I have looked at related questions in the archives but could not implement them. One example would be PTPN22, this has two molecular_function and three biological_processes when GO is searched manually. But I can't reproduce this through the UCSC browser. Any help would be greatly appreciated. You are doing a great job with this browser. Thank you. John Bowes From fanhsu at soe.ucsc.edu Fri Jan 11 08:29:30 2008 From: fanhsu at soe.ucsc.edu (Fan Hsu) Date: Fri, 11 Jan 2008 08:29:30 -0800 Subject: [Genome] Gene Ontology In-Reply-To: <20080111143552531.00000003328@d6lt192j> References: <20080111143552531.00000003328@d6lt192j> Message-ID: <003201c8546f$257951f0$6b00a8c0@duo4t> Hi John, The GO annotation is dynamically retrieved and displayed on the UCSC Genes details page. Attached please find an abbreviated version of the C code doing this. If you are not familiar with C, just look at the SQL query which should give you pretty good idea how we did it using the goaPart table in our go database. Fan. static void goPrint(struct section *section, struct sqlConnection *conn, char *geneId) /* Print out GO annotations. */ { struct sqlConnection *goConn = sqlConnect("go"); char query[512]; struct sqlResult *sr; char **row; static char *aspects[3] = {"F", "P", "C"}; static char *aspectNames[3] = { "Molecular Function", "Biological Process", "Cellular Component", }; int aspectIx; for (aspectIx = 0; aspectIx < ArraySize(aspects); ++aspectIx) { boolean hasFirst = FALSE; safef(query, sizeof(query), "select term.acc,term.name" " from goaPart,term" " where goaPart.dbObjectId = '%s'" " and goaPart.goId = term.acc" " and goaPart.aspect = '%s'" , acc, aspects[aspectIx]); sr = sqlGetResult(goConn, query); while ((row = sqlNextRow(sr)) != NULL) { char *goID = row[0]; char *goTermName = row[1]; if (!hasFirst) { hPrintf("%s:
", aspectNames[aspectIx]); hasFirst = TRUE; } hPrintf("%s %s
\n", goID, goTermName); } if (hasFirst) hPrintf("
"); sqlFreeResult(&sr); } sqlDisconnect(&goConn); } -----Original Message----- From: genome-bounces at soe.ucsc.edu [mailto:genome-bounces at soe.ucsc.edu] On Behalf Of John Bowes Sent: Friday, 11 January, 2008 6:36 AM To: genome at soe.ucsc.edu Subject: [Genome] Gene Ontology Hello, I have spent a number of hours trying (unsuccessfully) to extract GO data from for particular gene ids. My aim is to supply a list of gene ids, e.g. PTPN22, and extract the three GO aspects (biological_process, cellular_component and molecular_function). I have tried a number of times, but do not get useful data (either no data, or a large volume of multiple entries). I am sure this due to me not fully understanding the table relationships. I have looked at related questions in the archives but could not implement them. One example would be PTPN22, this has two molecular_function and three biological_processes when GO is searched manually. But I can't reproduce this through the UCSC browser. Any help would be greatly appreciated. You are doing a great job with this browser. Thank you. John Bowes _______________________________________________ Genome maillist - Genome at soe.ucsc.edu http://www.soe.ucsc.edu/mailman/listinfo/genome From peter.shepard at gmail.com Fri Jan 11 08:36:12 2008 From: peter.shepard at gmail.com (Pete Shepard) Date: Fri, 11 Jan 2008 08:36:12 -0800 Subject: [Genome] Gene Ontology In-Reply-To: <20080111143552531.00000003328@d6lt192j> References: <20080111143552531.00000003328@d6lt192j> Message-ID: <5c2c43620801110836q66d46a6dxb3385d713eab0ea4@mail.gmail.com> Hi John, I can try to field this one while you wait for the experts because I have been working on this exact problem. You first need to convert your gene id to what I think is a uniprot/tremble id. Then you can go to the table browser group: and select all tables, then from database: select "go"and go.goaPart as the table:. Paste the uniprot/tremble id into the identifiers (names/accessions): field. Select "selected fields from primary and related tables" from the output format: field and hit get output. Now scroll down to linked tables and select "go" then scroll all the way down and hit "allow selection from checked tables" select check "name", "term type", and "acc" from the go.term field and then h it get output. Below is what I get for your gene. One thing that I cannot find is a way to get the hierarchical levels of these terms if you find this out please let me know. Peter #go.goaPart.dbObjectId go.goaPart.dbObjectSymbol go.term.name go.term.term_type go.term.acc Q9Y2R2 PTN22_HUMAN phosphoprotein phosphatase activity molecular_function GO:0004721 Q9Y2R2 PTN22_HUMAN protein tyrosine phosphatase activity molecular_function GO:0004725 Q9Y2R2 PTN22_HUMAN protein amino acid dephosphorylation biological_process GO:0006470 Q9Y2R2 PTN22_HUMAN signal transduction biological_process GO:0007165 Q9Y2R2 PTN22_HUMAN development biological_process GO:0007275 Q9Y2R2 PTN22_HUMAN hydrolase activity molecular_function GO:0016787 On Jan 11, 2008 6:35 AM, John Bowes wrote: > Hello, > > I have spent a number of hours trying (unsuccessfully) to extract GO data > from for particular gene ids. My aim is to supply a list of gene ids, e.g. > PTPN22, and extract the three GO aspects (biological_process, > cellular_component and molecular_function). I have tried a number of times, > but do not get useful data (either no data, or a large volume of multiple > entries). I am sure this due to me not fully understanding the table > relationships. I have looked at related questions in the archives but could > not implement them. > > One example would be PTPN22, this has two molecular_function and three > biological_processes when GO is searched manually. But I can't reproduce > this through the UCSC browser. Any help would be greatly appreciated. > > You are doing a great job with this browser. > > Thank you. > > John Bowes > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From anton.kratz at googlemail.com Fri Jan 11 02:11:24 2008 From: anton.kratz at googlemail.com (Anton Kratz) Date: Fri, 11 Jan 2008 19:11:24 +0900 Subject: [Genome] knownToPfam, DNA-coordinates of domains Message-ID: <68c7994b0801110211p67e2befeq456b0ba990f2308f@mail.gmail.com> Dear UCSC team, Context of my question: I am trying to get protein domain coordinates in DNA space for the domains listed in the knownToPfam table. Basically what my program does is for each name-value-pair of the knownToPfam table, it looks up (in the knownGenes table) the protein encoded by that isoform, then it looks up (in Pfam-A.full, a flatfile with the entire Pfam database) that domain's sequence and aligns that sequence back onto the human genome, using BLAT (locally) in translated mode. My problem is that for 13,636 of the 35,789 name-value-pairs in the knownToPfam table, I do not find the protein coded for by the respective isoform when searching for the protein in Pfam-A.full. Example: according to knownToPfam, NM_015658 contains the domain PF03715. According to the knownGene table, NM_015658 encodes the protein YU20_HUMAN. So I am looking for YU20_HUMAN under the entry for PF03715 in Pfam-A.full. But it's not there and thus not part of the multiple alignment. I would be very thankful if you have an idea what's going wrong in this approach. regards, Anton P.S.: Everything I am using is for hg17 because I have other mappings only available for hg17 which are not available for hg18. From fanhsu at soe.ucsc.edu Fri Jan 11 09:06:16 2008 From: fanhsu at soe.ucsc.edu (Fan Hsu) Date: Fri, 11 Jan 2008 09:06:16 -0800 Subject: [Genome] Gene Ontology In-Reply-To: <5c2c43620801110836q66d46a6dxb3385d713eab0ea4@mail.gmail.com> References: <20080111143552531.00000003328@d6lt192j> <5c2c43620801110836q66d46a6dxb3385d713eab0ea4@mail.gmail.com> Message-ID: <003701c85474$475fcf60$6b00a8c0@duo4t> Thanks to Pete pointing out that acc is the UniProt accession, which you can bet from the hg18.kgXref table. To figure out hierarchy, you need to do a little bit of your own programming/processing. The parent/child info I believe is contained in the term2term table of the go database. Fan. -----Original Message----- From: genome-bounces at soe.ucsc.edu [mailto:genome-bounces at soe.ucsc.edu] On Behalf Of Pete Shepard Sent: Friday, 11 January, 2008 8:36 AM To: J.Bowes at manchester.ac.uk Cc: genome at soe.ucsc.edu Subject: Re: [Genome] Gene Ontology Hi John, I can try to field this one while you wait for the experts because I have been working on this exact problem. You first need to convert your gene id to what I think is a uniprot/tremble id. Then you can go to the table browser group: and select all tables, then from database: select "go"and go.goaPart as the table:. Paste the uniprot/tremble id into the identifiers (names/accessions): field. Select "selected fields from primary and related tables" from the output format: field and hit get output. Now scroll down to linked tables and select "go" then scroll all the way down and hit "allow selection from checked tables" select check "name", "term type", and "acc" from the go.term field and then h it get output. Below is what I get for your gene. One thing that I cannot find is a way to get the hierarchical levels of these terms if you find this out please let me know. Peter #go.goaPart.dbObjectId go.goaPart.dbObjectSymbol go.term.name go.term.term_type go.term.acc Q9Y2R2 PTN22_HUMAN phosphoprotein phosphatase activity molecular_function GO:0004721 Q9Y2R2 PTN22_HUMAN protein tyrosine phosphatase activity molecular_function GO:0004725 Q9Y2R2 PTN22_HUMAN protein amino acid dephosphorylation biological_process GO:0006470 Q9Y2R2 PTN22_HUMAN signal transduction biological_process GO:0007165 Q9Y2R2 PTN22_HUMAN development biological_process GO:0007275 Q9Y2R2 PTN22_HUMAN hydrolase activity molecular_function GO:0016787 On Jan 11, 2008 6:35 AM, John Bowes wrote: > Hello, > > I have spent a number of hours trying (unsuccessfully) to extract GO data > from for particular gene ids. My aim is to supply a list of gene ids, e.g. > PTPN22, and extract the three GO aspects (biological_process, > cellular_component and molecular_function). I have tried a number of times, > but do not get useful data (either no data, or a large volume of multiple > entries). I am sure this due to me not fully understanding the table > relationships. I have looked at related questions in the archives but could > not implement them. > > One example would be PTPN22, this has two molecular_function and three > biological_processes when GO is searched manually. But I can't reproduce > this through the UCSC browser. Any help would be greatly appreciated. > > You are doing a great job with this browser. > > Thank you. > > John Bowes > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > _______________________________________________ Genome maillist - Genome at soe.ucsc.edu http://www.soe.ucsc.edu/mailman/listinfo/genome From bob at drive5.com Fri Jan 11 10:34:45 2008 From: bob at drive5.com (Robert Edgar) Date: Fri, 11 Jan 2008 10:34:45 -0800 Subject: [Genome] CpG islands feature is exon Message-ID: When I download the "CpG Islands" track in GTF format, the feature field in the GTF records is "exon". Is this the intent, or am I getting the wrong records somehow? Thanks for your help. Robert Edgar. From ann at soe.ucsc.edu Fri Jan 11 11:12:19 2008 From: ann at soe.ucsc.edu (Ann Zweig) Date: Fri, 11 Jan 2008 11:12:19 -0800 Subject: [Genome] CpG islands feature is exon In-Reply-To: References: Message-ID: <4787BF93.10901@cse.ucsc.edu> Hello Robert, It does seem a little surprising that the third column of the GTF file is "exon". However, the data in the file is correct. You can think of that field as filler (it had to place some feature name in there, so it chose exon). Read more about the GTF (and GFF) file formats here: http://genome.ucsc.edu/FAQ/FAQformat#format3 Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Please feel free to search the Genome mailing list archives by visiting our home page, clicking on "Contact Us", then typing a word or phrase into the search box. On that same page (http://genome.ucsc.edu/contacts.html), you can subscribe to the Genome mailing list. Robert Edgar wrote: > When I download the "CpG Islands" track in GTF format, the feature field in > the GTF records is "exon". Is this the intent, or am I getting the wrong > records somehow? > Thanks for your help. > Robert Edgar. > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From li11 at niehs.nih.gov Fri Jan 11 11:31:55 2008 From: li11 at niehs.nih.gov (Li, Jianying (NIH/NIEHS) [C]) Date: Fri, 11 Jan 2008 14:31:55 -0500 Subject: [Genome] Difference in nucleotide counts In-Reply-To: References: Message-ID: <7930EE6CD7CA354D93B444D0433C0611051549ED@NIHCESMLBX6.nih.gov> Dear admin, I am currently working on a project and need to get the total count of nucleotides of human hg-17. After I finished the counting, I compared with the statistics listed on the genome browser's page: http://genome.ucsc.edu/goldenPath/stats.html#hg17 Chr Assembled Size Sequenced I counted 1 247249719 224999719 245522847 2 242951149 237712649 243018229 3 199501827 194704827 199505740 4 191273063 187297063 191411218 5 180857866 177702766 180857866 6 170899992 167273992 170975699 7 158821424 154952424 158628139 8 146274826 142612826 146274826 9 140273252 120143252 138429268 10 135374737 131624737 135413628 11 134452384 131130853 134452384 12 132349534 130303534 132449811 13 114142980 95559980 114142980 14 106368585 88290585 106368585 15 100338915 81341915 100338915 16 88827254 78884754 88827254 17 78774742 77800220 78774742 18 76117153 74656155 76117153 19 63811651 55785651 63811651 20 62435964 59505253 62435964 21 46944323 34171998 46944323 22 49691432 34851332 49554710 X 154913754 151058754 154824264 Y 57772954 25652954 57701691 M 16571 16571 16571 Total 3080436051 2858034764 3076798458 Among these chromosomes, some match but some don't. Could you please help me explain why? Your any attention is appreciated. Jianying Li From ann at soe.ucsc.edu Fri Jan 11 11:35:58 2008 From: ann at soe.ucsc.edu (Ann Zweig) Date: Fri, 11 Jan 2008 11:35:58 -0800 Subject: [Genome] Difference in nucleotide counts In-Reply-To: <7930EE6CD7CA354D93B444D0433C0611051549ED@NIHCESMLBX6.nih.gov> References: <7930EE6CD7CA354D93B444D0433C0611051549ED@NIHCESMLBX6.nih.gov> Message-ID: <4787C51E.70502@cse.ucsc.edu> Hello Jianying Li, Looks like some of the stats you are looking at are coming from the hg18 assembly, and some from the hg17 assembly. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Li, Jianying (NIH/NIEHS) [C] wrote: > Dear admin, > > I am currently working on a project and need to get the total count of > nucleotides of human hg-17. After I finished the counting, I compared > with the statistics listed on the genome browser's page: > http://genome.ucsc.edu/goldenPath/stats.html#hg17 > > Chr Assembled Size Sequenced I counted > 1 247249719 224999719 245522847 > 2 242951149 237712649 243018229 > 3 199501827 194704827 199505740 > 4 191273063 187297063 191411218 > 5 180857866 177702766 180857866 > 6 170899992 167273992 170975699 > 7 158821424 154952424 158628139 > 8 146274826 142612826 146274826 > 9 140273252 120143252 138429268 > 10 135374737 131624737 135413628 > 11 134452384 131130853 134452384 > 12 132349534 130303534 132449811 > 13 114142980 95559980 114142980 > 14 106368585 88290585 106368585 > 15 100338915 81341915 100338915 > 16 88827254 78884754 88827254 > 17 78774742 77800220 78774742 > 18 76117153 74656155 76117153 > 19 63811651 55785651 63811651 > 20 62435964 59505253 62435964 > 21 46944323 34171998 46944323 > 22 49691432 34851332 49554710 > X 154913754 151058754 154824264 > Y 57772954 25652954 57701691 > M 16571 16571 16571 > Total 3080436051 2858034764 3076798458 > > Among these chromosomes, some match but some don't. Could you please > help me explain why? > > Your any attention is appreciated. > > Jianying Li > > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From ann at soe.ucsc.edu Fri Jan 11 12:01:57 2008 From: ann at soe.ucsc.edu (Ann Zweig) Date: Fri, 11 Jan 2008 12:01:57 -0800 Subject: [Genome] knownToPfam, DNA-coordinates of domains In-Reply-To: <68c7994b0801110211p67e2befeq456b0ba990f2308f@mail.gmail.com> References: <68c7994b0801110211p67e2befeq456b0ba990f2308f@mail.gmail.com> Message-ID: <4787CB35.6020104@cse.ucsc.edu> Hello Anton, In general your approach seems sound. My guess is that what's going on is that the databases are out-of-sync with the actual data at pfam. Note that the hg17.knownToPfam table was built on 2005-07-14. And that hg18 Known Gene track was based on UniProt around 2007-02-02. I realize that you are using hg17 because you have annotations and mappings in that coordinate system. You might consider lifting those over to hg18 (using our liftOver tool) and then using hg18 and its more recent association with pfam. As for your specific example, when you look up YU20_HUMAN in UniProt you see the following: query: YU20_HUMAN The ID: YU20_HUMAN is no longer valid. YU20_HUMAN (associated with primary accession number: Q9Y3T9 from release 39.10) was renamed to NOC2L_HUMAN in release 47.8. And the following queries of hg17 and hg18 show: mysql> select kgId,spDisplayID from hg17.knownToPfam, hg17.kgXref where value='PF03715' and kgId=name; +-----------+--------------+ | kgId | spDisplayID | +-----------+--------------+ | AK022756 | Q9H9J5_HUMAN | | AK092491 | Q8NAK7_HUMAN | | NM_015658 | YU20_HUMAN | | BC003555 | Q9H9J5_HUMAN | +-----------+--------------+ mysql> select kgId,spDisplayID from hg18.knownToPfam, hg18.kgXref where value='PF03715' and kgId=name; +------------+--------------+ | kgId | spDisplayID | +------------+--------------+ | uc001aby.1 | Q9H9J5_HUMAN | +------------+--------------+ mysql> select * from hg18.knownToPfam where value='PF03715'; +------------+---------+ | name | value | +------------+---------+ | uc001aby.1 | PF03715 | +------------+---------+ I hope this is helpful to you. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Anton Kratz wrote: > Dear UCSC team, > > Context of my question: I am trying to get protein domain coordinates in DNA > space for the domains listed in the knownToPfam table. > Basically what my program does is for each name-value-pair of the > knownToPfam table, it looks up (in the knownGenes table) the protein encoded > by that isoform, then it looks up (in Pfam-A.full, a flatfile with the > entire Pfam database) that domain's sequence and aligns that sequence back > onto the human genome, using BLAT (locally) in translated mode. > > My problem is that for 13,636 of the 35,789 name-value-pairs in the > knownToPfam table, I do not find the protein coded for by the respective > isoform when searching for the protein in Pfam-A.full. > > Example: according to knownToPfam, NM_015658 contains the domain PF03715. > According to the knownGene table, NM_015658 encodes the protein YU20_HUMAN. > So I am looking for YU20_HUMAN under the entry for PF03715 in Pfam-A.full. > But it's not there and thus not part of the multiple alignment. > > I would be very thankful if you have an idea what's going wrong in this > approach. > > regards, > Anton > > P.S.: Everything I am using is for hg17 because I have other mappings only > available for hg17 which are not available for hg18. > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From eherman at ucla.edu Fri Jan 11 15:22:16 2008 From: eherman at ucla.edu (Edward Herman) Date: Fri, 11 Jan 2008 15:22:16 -0800 Subject: [Genome] GeneSorter References Message-ID: Hi, We have recently used the Gene Sorter tool on our gene of interest, and found a striking overlap between the list that this tool generated and our own data. We would like to know more about what publications or results the Gene Sorter is based on and how we can access them. In particular, we are interested in the HRPD Protein-Protein interaction database. We tried their website (www.hprd.org) searching on our gene and clicking the "Interactions" tab, but only a subset (8 out of 71) appear here with links to publications. Can you tell us more about what data was used to generate the HRPD Protein-Protein interaction database on the Gene Sorter tool? Thanks, Eddie Geschwind Lab, UCLA From fanhsu at soe.ucsc.edu Fri Jan 11 15:39:35 2008 From: fanhsu at soe.ucsc.edu (Fan Hsu) Date: Fri, 11 Jan 2008 15:39:35 -0800 Subject: [Genome] GeneSorter References In-Reply-To: References: Message-ID: <005301c854ab$39a6c270$6b00a8c0@duo4t> Hi Eddie, Attached below please find the detailed processing steps we used to build our knownToHprd table. Help this helps. Fan. BUILD HPRD DATA FOR KNOWN GENE DETAILS PAGE LINKS (DONE 9/11/06) # Download HPRD_XML_060106.tar.gz from www.hprd.org gzip -d HPRD_XML_060106.tar.gz tar -xvf HPRD_XML_060106.tar.gz # This will create 18838 xxxx.xml files under HPRD_XML_060106 # Create hprdToCdna table echo 'grep -H entry_cdna HPRD_XML_060106/$1.xml' >do1Cdna ls HPRD_XML_060106 >j cat j |sed -e 's/.xml/\tdo1Cdna/g' >jj cut -f 1 jj >j.2 cut -f 2 jj >j.1 paste j.1 j.2 >doAllCdna chmod +x do* ./doAllCdna >j.cdna cat j.cdna| sed -e 's/\//\t/' | sed -e 's/.xml/\t/' |\ sed -e 's//\t/' | sed -e 's/<\//\t/'| sed -e 's/\./\t/'| cut -f 2,4|\ grep -v None >hprdToCdna.tab hgsql hg18 -e 'drop table hprdToCdna' hgsql hg18 <~/src/hg/lib/hprdToCdna.sql hgsql hg18 -e 'load data local infile "hprdToCdna.tab" into table hprdToCdna' # Create hprdToUniProt table echo 'fgrep -H Swiss HPRD_XML_060106/$1.xml' >do1 ls HPRD_XML_060106 >j cat j |sed -e 's/.xml/\tdo1/g' >jj cut -f 1 jj >j.2 cut -f 2 jj >j.1 paste j.1 j.2 >doall chmod +x do* ./doall >j.out cat j.out|grep SwissProt | sed -e 's/\//\t/' | sed -e 's/.xml/\t/' | \ sed -e 's/Prot>/\t/' | sed -e 's/<\//\t/'| cut -f 2,4|grep -v None >hgrdToUniProt.tab hgsql hg18 -e 'drop table hprdToUniProt' hgsql hg18 <~/src/hg/lib/hprdToUniProt.sql hgsql hg18 -e 'load data local infile "hprdToUniProt.tab" into table hprdToUniProt' # build knownToHprd table hgsql hg18 -N -e 'select kgId,hprdId from hprdToCdna, kgXref where cdnaId=kgId' >j.kg1 hgsql hg18 -N -e 'select kgId,hprdId from hprdToUniProt, kgXref where uniProtId=spId' >j.kg2 cat j.kg1 j.kg2 |sort -u >knownToHprd.tab wc knownToHprd.tab hgsql hg18 -e 'drop table knownToHprd' hgsql hg18 <~/src/hg/lib/knownToHprd.sql hgsql hg18 -e 'load data local infile "knownToHprd.tab" into table knownToHprd' hgsql hg18 -e 'select count(*) from knownToHprd' # 19,646 records created. # remove temporary files. rm j* -----Original Message----- From: genome-bounces at soe.ucsc.edu [mailto:genome-bounces at soe.ucsc.edu] On Behalf Of Edward Herman Sent: Friday, 11 January, 2008 3:22 PM To: genome at soe.ucsc.edu Cc: Brett Abrahams Subject: [Genome] GeneSorter References Hi, We have recently used the Gene Sorter tool on our gene of interest, and found a striking overlap between the list that this tool generated and our own data. We would like to know more about w