From gtg894p at mail.gatech.edu Thu Mar 1 08:06:50 2007 From: gtg894p at mail.gatech.edu (Jittima Piriyapongsa) Date: Thu, 01 Mar 2007 11:06:50 -0500 Subject: [Genome] CCDS protein sequences Message-ID: <7.0.1.0.2.20070301105451.02292470@mail.gatech.edu> Hi, I have a question regarding CCDS protein sequence. I have a list of genomic coordinates corresponding to a part or whole exon of CCDS gene. Is there any way to get the part of protein sequences corresponding to those regions? Thank you. Jittima From weddington at bio.fsu.edu Thu Mar 1 07:50:05 2007 From: weddington at bio.fsu.edu (Nodin Weddington) Date: Thu, 01 Mar 2007 10:50:05 -0500 Subject: [Genome] Obtaining images Message-ID: <7.0.0.16.2.20070301103334.02855ff8@bio.fsu.edu> I am working on a site that will have related data graphed on images much like your site. I would like to have the corresponding sequence image from your site to be displayed next to the graphed images of our data. To do this I would need to be able to submit a sequence query to your site and retrieve only the graphed image. Is there any way of doing this or some thing similar to this. Thanks, Nodin Weddington #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# # # # # Nodin Weddington # |\ | | / # # Computing, CON 221 # | \ | __ __| ___ # # 644-1006 or 644-9302 # | \ | | | | | | | | # # weddington at bio.fsu.edu # | \| |__| |__| | | | # # # # #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# From ann at soe.ucsc.edu Thu Mar 1 09:16:13 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Thu, 01 Mar 2007 09:16:13 -0800 Subject: [Genome] Obtaining images In-Reply-To: <7.0.0.16.2.20070301103334.02855ff8@bio.fsu.edu> References: <7.0.0.16.2.20070301103334.02855ff8@bio.fsu.edu> Message-ID: <45E70A5D.3060907@cse.ucsc.edu> Hello Nodin, If I understand you correctly, you would like to get a snapshot of particular images from the genome browser. It is possible to do this. Once you have the browser set up exactly the way you want it, press the "PDF/PS" link on the top blue navigation bar. From the next screen you will be able to save the image as either a PDF or PS file. If this is not what you are trying to accomplish, please feel free to write back to the list with more details. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Nodin Weddington wrote: > I am working on a site that will have related data graphed on images > much like your site. I would like to have the corresponding sequence > image from your site to be displayed next to the graphed images of > our data. To do this I would need to be able to submit a sequence > query to your site and retrieve only the graphed image. Is there any > way of doing this or some thing similar to this. > > Thanks, > Nodin Weddington > > > #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# > # # # > # Nodin Weddington # |\ | | / # > # Computing, CON 221 # | \ | __ __| ___ # > # 644-1006 or 644-9302 # | \ | | | | | | | | # > # weddington at bio.fsu.edu # | \| |__| |__| | | | # > # # # > #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From sridhar at prathikanti.com Thu Mar 1 09:05:18 2007 From: sridhar at prathikanti.com (Sridhar Prathikanti) Date: Thu, 01 Mar 2007 12:05:18 -0500 Subject: [Genome] how to find a snp on zebra fish... Message-ID: <1172768718.23112.1177219295@webmail.messagingengine.com> Dear Sir, I just left the NIH Bethesda campus Dec. 2006 and am trying to wrap some research. I am trying to find out whether the human SNP 671871 http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=671871 exists in the Zebrafish? I noticed you pulled down the Zebrafish and have it in archives but without BLAT. May I please speak with you over the telephone for 5 minutes. regards, Sridhar Prathikanti, MD From major at cbio.mskcc.org Thu Mar 1 09:34:49 2007 From: major at cbio.mskcc.org (John Major) Date: Thu, 01 Mar 2007 12:34:49 -0500 Subject: [Genome] Pfam foreign keys in data tables // mapping protein domains to genomic coordinates Message-ID: <45E70EB9.8010307@cbio.mskcc.org> Hello- I am trying to map the pfam protein domains to genomic coordinates and am having some problems. I see that in the proteome tables, there are 2 obvious pfam tables: pfamDesc and pfamXref. Neither of these tables appear to be linked to other tables... or at least the table description pages do not offer any information as to which tables these 2 link to. Also, I do not seem to see a table which gives the start and end coordinates for the pfam doamins (in protein, mrna, or genomic space). What I would like to get is a simple table of domain information in genomic coordinate space. Ie: GenomeBuildID Chrm Start End ProteinDomainName SourceDatabase hg18 chr1 100000 100050 Protein-Kinase pFam hg18 chr2 200010 200090 X-binding-site uniprot I would like to get this info for both uniprot and pfam. The uniprot tables (uniprot.feature and uniprot.description) appear to be linked to kgXref via acc->spid. And I should be able to derive genomic coordinates for the uniprot features via these tables. If you have any advice on an easier way to get this mapping of domains to genomic coordinates, I'd be thrilled to hear it. Otherwise, could you please advise me on the pfam tables. Thanks! John Major From ahidalgo at inmegen.gob.mx Thu Mar 1 09:37:40 2007 From: ahidalgo at inmegen.gob.mx (Alfredo Hidalgo Miranda) Date: Thu, 1 Mar 2007 11:37:40 -0600 (CST) Subject: [Genome] RefLink table Message-ID: <43749.201.147.87.66.1172770660.squirrel@www.inmegen.gob.mx> Hi!! I am trying to get the annotations from the RefLink table in the genome browser from a particular region, however when I select the RefLink table in the table browser, the "position" option dissapear. is this normal? thanks! -- Dr. Alfredo Hidalgo Miranda Instituto Nacional de Medicina Genomica Periferico Sur 4124, Ex-Rancho de Anzaldo, Alvaro Obregon, 01900, Mexico DF. Tel: (52-55) 53501966 From ann at soe.ucsc.edu Thu Mar 1 11:38:55 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Thu, 01 Mar 2007 11:38:55 -0800 Subject: [Genome] CCDS protein sequences In-Reply-To: <7.0.1.0.2.20070301105451.02292470@mail.gatech.edu> References: <7.0.1.0.2.20070301105451.02292470@mail.gatech.edu> Message-ID: <45E72BCF.4070004@cse.ucsc.edu> Hello Jittima, Because we generate the protein sequence on the fly, it is not trivial to get the protein sequence for only part of a CCDS gene. This is because, given only a part of the gene, it is not possible for us to tell which frame to start generating the protein from. It is possible, however, to get the protein for the entire CCDS gene. You can do this from the gene details page: click on the CCDS gene in question, then click on the 'protein' hyperlink. From there you can calculate your position and extract the protein part you are interested in. Alternatively you can use the exonStarts and exonFrames fields in the ccdsGene table to figure out the frame. Then you could translate the genomic sequence yourself. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Please feel free to search the Genome mailing list archives by visiting our home page, clicking on "Contact Us", then typing a word or phrase into the search box. On that same page (http://genome.ucsc.edu/contacts.html), you can subscribe to the Genome mailing list. Jittima Piriyapongsa wrote: > Hi, > > I have a question regarding CCDS protein sequence. I have a list of > genomic coordinates corresponding to a part or whole exon of CCDS > gene. Is there any way to get the part of protein sequences > corresponding to those regions? > > Thank you. > Jittima > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From rhead at soe.ucsc.edu Thu Mar 1 11:40:56 2007 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Thu, 01 Mar 2007 11:40:56 -0800 Subject: [Genome] BLAT match, score, percentage - how to select best alignment In-Reply-To: <68c7994b0702280501y5ae77a03l284d9f8347f7a138@mail.gmail.com> References: <68c7994b0702280501y5ae77a03l284d9f8347f7a138@mail.gmail.com> Message-ID: <45E72C48.6090707@soe.ucsc.edu> Hello Anton, The "best" BLAT alignment can be figured different ways, depending on which features of the alignment you wish to weight most heavily (e.g., number of matches on both the target and the query side, the number of gaps on the target or query side, the size of the gaps, etc.). Here are some discussions about blat score from previously answered mailing list question: http://www.cse.ucsc.edu/pipermail/genome/2004-January/003883.html http://www.cse.ucsc.edu/pipermail/genome/2001-December/000789.html Feel free to search the mailing list archives for even more: go to http://genome.ucsc.edu/FAQ/ and enter "blat score" in the "Search the Genome mailing list archives" box. These discussions should help you to understand the meaning of score and percent identity calculated by web-based BLAT. > - Can I just take the alignment that has the highest number in the > "match"-column and take this entry as the "best" alignment? This is a valid approach. However, some subtleties will be missed. For instance, a perfect match that has no gaps on either the target or the query side would be treated the same as a match where each base matched perfectly, but the matches were interrupted by non-matching sequence. > - The order in which the alignments appear in a psl-file, has this > something to do with the quality of the alignment? No, the alignments in the psl file are not sorted. There are some tools in the Kent Source tree that are useful for sorting psl files based on different criteria. Info on downloading the source is here: http://genome.ucsc.edu/FAQ/FAQlicense.html#license3 The utility "pslReps" is one of these programs. Here is the usage statement for pslReps (usage statements for any of the utilities in the source tree can be obtained by running the command with no arguments): ================ pslReps - analyse repeats and generate genome wide best alignments from a sorted set of local alignments usage: pslReps in.psl out.psl out.psr where in.psl is an alignment file generated by psLayout and sorted by pslSort, out.psl is the best alignment output and out.psr contains repeat info options: -nohead don't add PSL header -ignoreSize Will not weigh in favor of larger alignments so much -noIntrons Will not penalize for not having introns when calculating size factor -singleHit Takes single best hit, not splitting into parts -minCover=0.N minimum coverage to output. Default is 0. -ignoreNs Ignore 'N's when calculating minCover. -minAli=0.N minimum alignment ratio default is 0.93 -nearTop=0.N how much can deviate from top and be taken default is 0.01 -minNearTopSize=N Minimum size of alignment that is near top for alignment to be kept. Default 30. -coverQSizes=file Tab-separate file with effective query sizes. When used with -minCover, this allows polyAs to be excluded from the coverage calculation ================== Here is part of the usage statement for a different utility, "pslCDnaFilter": ========== Filter cDNA alignments in psl format. Filtering criteria are comparative, selecting near best in genome alignments for each given cDNA and non-comparative, based only on the quality of an individual alignment. WARNING: comparive filters requires that the input is sorted by query name. The command: 'sort -k 10,10' will do the trick. Each alignment is assigned a score that is based on identity and weighted towards longer alignments and those with introns. This can do either global or local best-in-genome selection. Local near best in genome keeps fragments of an mRNA that align in discontinuous locations from other fragments. It is useful for unfinished genomes. Global near best in genome keeps alignments based on overall score. ========== A handy list of the Kent source utilities and brief descriptions for each one is available on our wiki site: http://genomewiki.ucsc.edu/index.php/Kent_source_utilities I hope this information is helpful. If you have questions in the future, please do not hesitate to contact us again. -- Brooke Rhead UCSC Genome Bioinformatics Group Anton Kratz wrote: > Hi, > > I am aligning some 10.000s of sequences against the human genome with BLAT > gfServer/gfClient. One sequence = one psl outputfile. Often I get several > alignments for each sequence. I would like to know how to best interpret > (parse) the psl-output; I only want to select one single (the best) > alignment. I am unsure how to determine which alignment is most probably the > "real" position of my sequence. > > > > - Can I just take the alignment that has the highest number in the > "match"-column and take this entry as the "best" alignment? > > > > - The order in which the alignments appear in a psl-file, has this > something to do with the quality of the alignment? > > > > - Why does the UCSC BLAT-website produce a score and percentage value > (unlike the standalone BLAT), and what is the idea behind these values? I > tried to understand what score and percentage value mean from looking at the > source-code in the FAQ-Entry (how to reproduce those values), but do not > understand the actual meaning of the values. > > > Thanks a lot, > Anton > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From ann at soe.ucsc.edu Thu Mar 1 11:49:16 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Thu, 01 Mar 2007 11:49:16 -0800 Subject: [Genome] RefLink table In-Reply-To: <43749.201.147.87.66.1172770660.squirrel@www.inmegen.gob.mx> References: <43749.201.147.87.66.1172770660.squirrel@www.inmegen.gob.mx> Message-ID: <45E72E3C.8020906@cse.ucsc.edu> Hello Alfredo, The refLink table is not a positional table (that is, it does not contain information about chromosomal location). Non-positional tables do not display the "position" option in the Table Browser. Depending on what you are trying to accomplish, you may want the refGene table instead. You can also do a join in the Table Browser where you link together the refLink and refGene tables to get the information you need. The two tables are joined like so: refLink.mrnaAcc == refGene.name Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Please feel free to search the Genome mailing list archives by visiting our home page, clicking on "Contact Us", then typing a word or phrase into the search box. On that same page (http://genome.ucsc.edu/contacts.html), you can subscribe to the Genome mailing list. Alfredo Hidalgo Miranda wrote: > Hi!! > > I am trying to get the annotations from the RefLink table in the genome > browser from a particular region, however when I select the RefLink table > in the table browser, the "position" option dissapear. > > is this normal? > thanks! > > From ann at soe.ucsc.edu Thu Mar 1 12:07:26 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Thu, 01 Mar 2007 12:07:26 -0800 Subject: [Genome] how to find a snp on zebra fish... In-Reply-To: <1172768718.23112.1177219295@webmail.messagingengine.com> References: <1172768718.23112.1177219295@webmail.messagingengine.com> Message-ID: <45E7327E.9080905@cse.ucsc.edu> Hello Sridhar, We archived the oldest version of the zebrafish (danRer1), but we still have the three more recent versions available in our active browser. When we create a SNP track for an assembly, we get the data from dbSNP. dbSNP has about 2,000 SNPs for zebrafish right now. We would like to create a SNP track for danRer4, but it is not a high priority at this time. Please see the NCBI SNP Summary page for more details: http://www.ncbi.nlm.nih.gov/SNP/snp_summary.cgi Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Please feel free to search the Genome mailing list archives by visiting our home page, clicking on "Contact Us", then typing a word or phrase into the search box. On that same page (http://genome.ucsc.edu/contacts.html), you can subscribe to the Genome mailing list. Sridhar Prathikanti wrote: > Dear Sir, > > I just left the NIH Bethesda campus Dec. 2006 and am trying to wrap some > research. > > I am trying to find out whether the human SNP 671871 > http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=671871 > > exists in the Zebrafish? I noticed you pulled down the Zebrafish and > have it in archives but without BLAT. > > May I please speak with you over the telephone for 5 minutes. > > regards, > > Sridhar Prathikanti, MD > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From eckerta at email.chop.edu Thu Mar 1 13:22:03 2007 From: eckerta at email.chop.edu (Andrew Eckert) Date: Thu, 01 Mar 2007 16:22:03 -0500 Subject: [Genome] SQL query hg17 Message-ID: Great Brooke, thanks for your help! Like you said, even when I find my HUGO symbol in the list, there are sometimes more than one mRNA, but thanks to the data in the refFlat table, I can calculate the length of each mRNA for each gene, and just keep the row with the longest one. Here was my solution: select geneName, name, chrom, txStart, txEnd, max(txEnd - txStart) as GeneLength, concat(chrom,":",txStart,"-",txEnd) as Coordinates from hg17.refFlat where geneName in ([my list]) group by geneName; Since I don't care about the splice variants I figured the longest known mRNA must represent the entire gene. Now since I have the coordinates I can find the SNPs in the range of each gene. I submitted a list of 207 genes and got data back for every one, so this looks like it's going to work! Thanks so much for the great tips Andrew >>> Brooke Rhead 2/28/2007 7:30 PM >>> Hello Andrew, None of our tables use the HUGO symbols exclusively to identify genes or specific mRNAs. However, some tables do include many HUGO symbols, and depending on your particular list of genes, one or more of them might suit your needs. One option is to use the 'refFlat' table instead of the 'refLink' table. The refFlat.geneName field often corresponds to the HUGO gene name. I checked your 'CD4' example, and there is only one entry for it in the refFlat table, as opposed to 10 entries in the refLink table. The refFlat table also contains the corresponding gene coordinates. Another option is to use the 'kgXref' table, which correlates several gene aliases for genes in the 'knownGene' table. If your gene names are present in the kgXref.geneSymbol field, you could go from kgXref.geneSymbol --> kgXref.kgId, and then with the kgXref.kgId (which is the same as knownGene.name), get a set of gene coordinates. One of our developers came up with a mySQL query to produce a bed 4 file from the kgXref and knownGene tables, which you may find useful: hgsql hg17 -N -e 'select k.chrom, k.txStart, k.txEnd, x.geneSymbol from kgXref x, knownGene k where x.kgId=k.name' This query produces a list of coordinates and geneSymbols, like this: mysql> select k.chrom, k.txStart, k.txEnd, x.geneSymbol from -> kgXref x, knownGene k where x.kgId=k.name limit 2; +-------+---------+-------+------------+ | chrom | txStart | txEnd | geneSymbol | +-------+---------+-------+------------+ | chr1 | 4268 | 7438 | BC073913 | | chr1 | 24416 | 25944 | AF346307 | +-------+---------+-------+------------+ 2 rows in set (0.00 sec) Note that with either of these options, each gene symbol may have multiple records. However, many of the names will only have one entry, and so may be quite useful to you. I hope this information is helpful. If you have further questions, please do not hesitate to contact us again. -- Brooke Rhead UCSC Genome Bioinformatics Group Andrew Eckert wrote: > Hello, > I have a couple of questions about querying the UCSC human genome data > (May 2004). I'm working on a project where I have a list of about 400 > HUGO gene symbols and I want to use them to return the coordinates > (boundaries) of each gene, as well as a list of all SNPs from dbSNP > within those coordinates. > > So far the most straightforward method I've figured out is to use a SQL > editor to query the hg17 database for gene name --> mRNA --> mRNA > coordinates --> all SNPs within those coordinates. > > I am having trouble because there are multiple rows for some genes. > For example, I queried hg17.refLink for "name like 'CD4'" and found > three different names (CD4, Cd4, cd4), and multiple mRNAs for each name, > each having different coordinates. Really I don't care about all the > different splice variants or different mRNAs; I just couldn't find any > better way to get the gene's coordinates. Do you know of a better way > to accomplish this? > > I have ben trying to find the tables that contain HUGO symbols, > thinking that it might allow me to get to one single mRNA accession > number that I could use. For example I looked up CD4 on the HGNC site > and found one single mRNA. I'm having trouble here though because I > can't find either of the tables that refer to HUGO symbols (hugo, > knownMore). I am only looking in the hg17 database though; are they > somewhere else? > > Thanks in advance for your advice! > > > > Andrew Eckert > CHOP Center for Applied Genomics > 3615 Civic Center Blvd. > Room 1014H > Philadelphia, PA 19104 > > eckerta at email.chop.edu > (267)426-0694 > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From ann at soe.ucsc.edu Thu Mar 1 16:00:10 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Thu, 01 Mar 2007 16:00:10 -0800 Subject: [Genome] Pfam foreign keys in data tables // mapping protein domains to genomic coordinates In-Reply-To: <45E70EB9.8010307@cbio.mskcc.org> References: <45E70EB9.8010307@cbio.mskcc.org> Message-ID: <45E7690A.1010209@cse.ucsc.edu> Hello John, The missing link is the knownToPfam table in the hg18 (or whichever assembly you are working in) database. This table is the link between the knownGene table and the Pfam tables. $db.knownGene.name == knownToPfam.name knownToPfam: name value NM_001005484 PF00001 BC024295 PF07647 The $db.knownGene table has fields for chromosomal positions. You will find the domain type in the pfamDesc table: proteome.pfamDesc.description. The pfamXref table includes several types of ID values. pfamXref: pfamAC swissAC swissDisplayID PF00001 O00155 GPR25_HUMAN PF00001 O00254 PAR3_HUMAN PF00001 O00270 GPR31_HUMAN You should be able to put these pieces together to mine for the exact information you are looking for. If this is not enough detail for you, please do not hesitate to write back to the list and ask for more information. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Please feel free to search the Genome mailing list archives by visiting our home page, clicking on "Contact Us", then typing a word or phrase into the search box. On that same page (http://genome.ucsc.edu/contacts.html), you can subscribe to the Genome mailing list. John Major wrote: > Hello- > > I am trying to map the pfam protein domains to genomic coordinates and > am having some problems. > I see that in the proteome tables, there are 2 obvious pfam tables: > pfamDesc and pfamXref. > Neither of these tables appear to be linked to other tables... or at > least the table description pages do not offer any information as to > which tables these 2 link to. > Also, I do not seem to see a table which gives the start and end > coordinates for the pfam doamins (in protein, mrna, or genomic space). > > What I would like to get is a simple table of domain information in > genomic coordinate space. Ie: > GenomeBuildID Chrm Start End ProteinDomainName > SourceDatabase > hg18 chr1 100000 100050 > Protein-Kinase pFam > hg18 chr2 200010 200090 > X-binding-site uniprot > > > I would like to get this info for both uniprot and pfam. The uniprot > tables (uniprot.feature and uniprot.description) appear to be linked to > kgXref via acc->spid. And I should be able to derive genomic > coordinates for the uniprot features via these tables. > > > If you have any advice on an easier way to get this mapping of domains > to genomic coordinates, I'd be thrilled to hear it. Otherwise, could > you please advise me on the pfam tables. > > Thanks! > John Major > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From Gonzales.Patrick at mayo.edu Fri Mar 2 07:59:24 2007 From: Gonzales.Patrick at mayo.edu (Gonzales, Patrick R.) Date: Fri, 2 Mar 2007 09:59:24 -0600 Subject: [Genome] Problems with Human May 2004 Browser Message-ID: <572057D3BDD52A46BD05BC6DA50686115886A7@MSGEBE22.mfad.mfroot.org> Greetings. Since early this AM, I have had trouble using the May 2004 Human browser. When I type in a search in this browser, it either takes me to Mouse 2006 (where my search query is not found) or the Mar 2006 Human browser. I scanned my local machine for viruses and spyware and it is clean. Is this a problem on your end? Many thanks! Pat Patrick R. Gonzales, MS Clinical Development Technologist Cytogenetics Array CGH Mayo Clinic Hilton 942 (507)284-8338 gonzales.patrick at mayo.edu From mikhail.spivakov02 at csc.mrc.ac.uk Fri Mar 2 06:57:57 2007 From: mikhail.spivakov02 at csc.mrc.ac.uk (Spivakov, Mikhail) Date: Fri, 2 Mar 2007 14:57:57 -0000 Subject: [Genome] (no subject) Message-ID: Hi, For about an hour already, when I try to run in silico PCR with the mouse genome assemblies from March 2005 or May 2004 (but not the newer ones), I receive the following error messages: May 2004: Can't open /gbdb/mm5/nib/oryCun1.2bit to read: No such file or directory March 2005: Can't open /gbdb/mm6/droYak2.2bit to read: No such file or directory Can you please try to fix this problem? Many thanks, Mikhail -------------------------------------------- Mikhail Spivakov, Predoctoral Fellow Lymphocyte Development Group MRC Clinical Sciences Centre Imperial College School of Medicine Hammersmith Hospital Campus Du Cane Road London W12 0NN UK Tel: +44(0)208 3832 140 Fax: +44(0)208 3838 338 From ann at soe.ucsc.edu Fri Mar 2 09:03:51 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Fri, 02 Mar 2007 09:03:51 -0800 Subject: [Genome] Problems with Human May 2004 Browser In-Reply-To: <572057D3BDD52A46BD05BC6DA50686115886A7@MSGEBE22.mfad.mfroot.org> References: <572057D3BDD52A46BD05BC6DA50686115886A7@MSGEBE22.mfad.mfroot.org> Message-ID: <45E858F7.2010103@soe.ucsc.edu> Hello Patrick, We had a slight glitch on our end -- it looks like our file server was temporarily unavailable to the Genome Browser. Sorry for the inconvenience. It should be working fine now. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Gonzales, Patrick R. wrote: > Greetings. Since early this AM, I have had trouble using the May 2004 > Human browser. When I type in a search in this browser, it either takes > me to Mouse 2006 (where my search query is not found) or the Mar 2006 > Human browser. I scanned my local machine for viruses and spyware and > it is clean. Is this a problem on your end? Many thanks! > > Pat > > > Patrick R. Gonzales, MS > > Clinical Development Technologist > Cytogenetics Array CGH > Mayo Clinic > Hilton 942 > (507)284-8338 > gonzales.patrick at mayo.edu > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From ann at soe.ucsc.edu Fri Mar 2 09:03:47 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Fri, 02 Mar 2007 09:03:47 -0800 Subject: [Genome] (no subject) In-Reply-To: References: Message-ID: <45E858F3.2040403@soe.ucsc.edu> Hello Mikhail, We had a slight glitch on our end -- it looks like our file server was temporarily unavailable to the Genome Browser. Sorry for the inconvenience. It should be working fine now. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Spivakov, Mikhail wrote: > Hi, > For about an hour already, when I try to run in silico PCR with the mouse genome assemblies from March 2005 or May 2004 (but not the newer ones), I receive the following error messages: > > May 2004: > > Can't open /gbdb/mm5/nib/oryCun1.2bit to read: No such file or directory > > March 2005: > > Can't open /gbdb/mm6/droYak2.2bit to read: No such file or directory > > Can you please try to fix this problem? > > Many thanks, > Mikhail > > -------------------------------------------- > Mikhail Spivakov, Predoctoral Fellow > Lymphocyte Development Group > MRC Clinical Sciences Centre > Imperial College School of Medicine > Hammersmith Hospital Campus > Du Cane Road > London W12 0NN UK > > Tel: +44(0)208 3832 140 > Fax: +44(0)208 3838 338 > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From ross.lazarus at channing.harvard.edu Fri Mar 2 08:37:12 2007 From: ross.lazarus at channing.harvard.edu (Ross Lazarus) Date: Fri, 02 Mar 2007 11:37:12 -0500 Subject: [Genome] genome browser local copy question In-Reply-To: <45E36F79.8050106@soe.ucsc.edu> References: <45E216B5.5090603@channing.harvard.edu> <45E36F79.8050106@soe.ucsc.edu> Message-ID: <45E852B8.40003@channing.harvard.edu> Ann, I have a cut down (hg18 only) local build of the genome browser installed and working here, and the genome graphs work fine locally. The docs describe making custom tracks available via urls, but I'm going to guess that adding our own custom tracks so they're available for *all* local users as part of the built-in track list on our local copy of the browser is non trivial? I couldn't find any documentation on doing that... I would like to make some genome graph data available more or less permanently to my investigators on our local copy. I can't find this anywhere in the docs so I'll ask you :) - where is the 48 hour nominal retention time for user uploaded genome graph data configured - I'd like to change it to make these permanent Again, thanks in advance for your time in helping me get this figured out and thanks for making this excellent software available! Ann Zweig wrote: > Hello Ross, > > Glad you like the new Genome Graphs tool. Currently there is no way > to pre-configure the track, nor is there a direct interface to the tool > yet. > > You are more than welcome to run the code locally. The Genome > Browser and Blat software are free for academic, nonprofit, and personal > use. A license is required for commercial use. > > How to download the software: > http://genome.cse.ucsc.edu/FAQ/FAQlicense#license3 > > You can obtain the source tree either via CVS: > http://genome.ucsc.edu/admin/cvs.html > or a zip file: > http://hgdownload.cse.ucsc.edu/admin/jksrc.zip > > Please note the build instructions: > http://genome.ucsc.edu/admin/jk-install.html > > All of the kent utilities output their usage message and command > line options by running them with no arguments. > > > Regards, > > ---------- > Ann Zweig > UCSC Genome Bioinformatics Group > http://genome.ucsc.edu > > > Ross Lazarus wrote: >> Your new genome graphs utility looks great. I was just about to try to >> write a custom track but am now saved a vast amount of work. We're >> about to get a whole genome snp data set so this might be extremely >> useful for our investigators. >> >> Is there any way for me to call it with most of the file reading >> values filled in - or are any other more direct machine interface >> (json/soap/whatever) available or planned? We'd be happy to take a >> poke at running it locally if the code is available? >> -- Ross Lazarus MBBS MPH, Director of Bioinformatics Channing Laboratory, 181 Longwood Ave., Boston MA 02115, USA. Voice: +617 525 2730 Fax: +617 525 0958 From grobertson at bcgsc.ca Fri Mar 2 09:24:45 2007 From: grobertson at bcgsc.ca (Gordon Robertson) Date: Fri, 02 Mar 2007 09:24:45 -0800 Subject: [Genome] Relationship between 'phastCons17way' and 'phastConsElements17way'? Message-ID: Hello Could you clarify the relationship between (hg18) 'phastCons17way' (68MB gz) and 'phastConsElements17way' (21MB gz) download data? We need to assess conservation 'support' for large sets of ChIP peaks, so will be programmatically querying either a local DB table or a local flat file. The 'elements' download appears to offer a more compact representation of the larger datafile, but I have found no documentation on your web site that describes how the full dataset is transformed into the compact one. I apologize if this documentation exists and I've missed it. Thank you for your help - G -- Gordon Robertson Gene Regulation Informatics Canada's Michael Smith Genome Sciences Centre 604-707-5900 x5416 www.bcgsc.ca From ann at soe.ucsc.edu Fri Mar 2 09:49:35 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Fri, 02 Mar 2007 09:49:35 -0800 Subject: [Genome] Relationship between 'phastCons17way' and 'phastConsElements17way'? In-Reply-To: References: Message-ID: <45E863AF.9070703@soe.ucsc.edu> Hello Gordon, The phastCons17way table underlies the wiggle part of the Conservation track while the phastConsElements17way underlies the Most Conserved track. The Most Conserved track shows predictions of conserved elements produced by the phastCons program. You can read details about the methods used to produce each of the tables by clicking on the track name hyper link in the track controls below the Genome Browser track image. I think that you will want to use the phastConsElements17way for your queries. Please let us know if you need more details or if this doesn't quite answer your question. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Please feel free to search the Genome mailing list archives by visiting our home page, clicking on "Contact Us", then typing a word or phrase into the search box. On that same page (http://genome.ucsc.edu/contacts.html), you can subscribe to the Genome mailing list. Gordon Robertson wrote: > Hello > > Could you clarify the relationship between (hg18) 'phastCons17way' (68MB gz) > and 'phastConsElements17way' (21MB gz) download data? We need to assess > conservation 'support' for large sets of ChIP peaks, so will be > programmatically querying either a local DB table or a local flat file. The > 'elements' download appears to offer a more compact representation of the > larger datafile, but I have found no documentation on your web site that > describes how the full dataset is transformed into the compact one. I > apologize if this documentation exists and I've missed it. > > Thank you for your help - > > G From grobertson at bcgsc.ca Fri Mar 2 09:50:47 2007 From: grobertson at bcgsc.ca (Gordon Robertson) Date: Fri, 02 Mar 2007 09:50:47 -0800 Subject: [Genome] Relationship between 'phastCons17way' and 'phastConsElements17way'? In-Reply-To: <45E863AF.9070703@soe.ucsc.edu> Message-ID: Thank you, Ann. G On 3/2/07 9:49 AM, "Ann Zweig" wrote: > Hello Gordon, > > The phastCons17way table underlies the wiggle part of the Conservation track > while the phastConsElements17way underlies the Most Conserved track. The Most > Conserved track shows predictions of conserved elements produced by the > phastCons program. You can read details about the methods used to produce > each > of the tables by clicking on the track name hyper link in the track controls > below the Genome Browser track image. > > I think that you will want to use the phastConsElements17way for your queries. > > Please let us know if you need more details or if this doesn't quite answer > your question. > > > Regards, > > ---------- > Ann Zweig > UCSC Genome Bioinformatics Group > http://genome.ucsc.edu > > > Please feel free to search the Genome mailing list archives by visiting > our home page, clicking on "Contact Us", then typing a word or phrase > into the search box. On that same page > (http://genome.ucsc.edu/contacts.html), you can subscribe to the Genome > mailing list. > > > > Gordon Robertson wrote: >> Hello >> >> Could you clarify the relationship between (hg18) 'phastCons17way' (68MB gz) >> and 'phastConsElements17way' (21MB gz) download data? We need to assess >> conservation 'support' for large sets of ChIP peaks, so will be >> programmatically querying either a local DB table or a local flat file. The >> 'elements' download appears to offer a more compact representation of the >> larger datafile, but I have found no documentation on your web site that >> describes how the full dataset is transformed into the compact one. I >> apologize if this documentation exists and I've missed it. >> >> Thank you for your help - >> >> G -- Gordon Robertson Gene Regulation Informatics Canada's Michael Smith Genome Sciences Centre 604-707-5900 x5416 www.bcgsc.ca From ann at soe.ucsc.edu Fri Mar 2 10:02:52 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Fri, 02 Mar 2007 10:02:52 -0800 Subject: [Genome] genome browser local copy question In-Reply-To: <45E852B8.40003@channing.harvard.edu> References: <45E216B5.5090603@channing.harvard.edu> <45E36F79.8050106@soe.ucsc.edu> <45E852B8.40003@channing.harvard.edu> Message-ID: <45E866CC.8030509@soe.ucsc.edu> Hello again Ross, I am glad you got your own copy of the browser up and running -- that was fast! After Custom Tracks are created, they get placed in a trash directory on our server. Every day we clean out of this directory files that have not been touched for the past 48 hours. So, it's likely that you do not have the 48-hour limit. Also, it's not out of the question to create your own tracks on your site -- plenty of mirror sites do just that. To do that, you will create a database table and a trackDb entry. Write to me off-list if you think you're interested in giving it a try. You may consider subscribing to the mail list that pertains to people who mirror all or part of the Genome Browser website. It's relatively low volume and is often full of tips about configuring your mirror. http://www.soe.ucsc.edu/mailman/listinfo/genome-mirror Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Ross Lazarus wrote: > Ann, I have a cut down (hg18 only) local build of the genome browser > installed and working here, and the genome graphs work fine locally. The > docs describe making custom tracks available via urls, but I'm going to > guess that adding our own custom tracks so they're available for *all* > local users as part of the built-in track list on our local copy of the > browser is non trivial? I couldn't find any documentation on doing that... > > I would like to make some genome graph data available more or less > permanently to my investigators on our local copy. I can't find this > anywhere in the docs so I'll ask you :) - where is the 48 hour nominal > retention time for user uploaded genome graph data configured - I'd like > to change it to make these permanent > > Again, thanks in advance for your time in helping me get this figured > out and thanks for making this excellent software available! > > Ann Zweig wrote: >> Hello Ross, >> >> Glad you like the new Genome Graphs tool. Currently there is no >> way to pre-configure the track, nor is there a direct interface to the >> tool yet. >> >> You are more than welcome to run the code locally. The Genome >> Browser and Blat software are free for academic, nonprofit, and >> personal use. A license is required for commercial use. >> >> How to download the software: >> http://genome.cse.ucsc.edu/FAQ/FAQlicense#license3 >> >> You can obtain the source tree either via CVS: >> http://genome.ucsc.edu/admin/cvs.html >> or a zip file: >> http://hgdownload.cse.ucsc.edu/admin/jksrc.zip >> >> Please note the build instructions: >> http://genome.ucsc.edu/admin/jk-install.html >> >> All of the kent utilities output their usage message and command >> line options by running them with no arguments. >> >> >> Regards, >> >> ---------- >> Ann Zweig >> UCSC Genome Bioinformatics Group >> http://genome.ucsc.edu >> >> >> Ross Lazarus wrote: >>> Your new genome graphs utility looks great. I was just about to try >>> to write a custom track but am now saved a vast amount of work. We're >>> about to get a whole genome snp data set so this might be extremely >>> useful for our investigators. >>> >>> Is there any way for me to call it with most of the file reading >>> values filled in - or are any other more direct machine interface >>> (json/soap/whatever) available or planned? We'd be happy to take a >>> poke at running it locally if the code is available? >>> > From major at cbio.mskcc.org Fri Mar 2 10:11:06 2007 From: major at cbio.mskcc.org (John Major) Date: Fri, 02 Mar 2007 13:11:06 -0500 Subject: [Genome] Pfam foreign keys in data tables // mapping protein domains to genomic coordinates In-Reply-To: <45E7690A.1010209@cse.ucsc.edu> References: <45E70EB9.8010307@cbio.mskcc.org> <45E7690A.1010209@cse.ucsc.edu> Message-ID: <45E868BA.4060408@cbio.mskcc.org> Hi Ann- Thank you for the prompt reply. I can get the genomic positions of known genes, but what I really need is the specific genomic coordinates of the domains *within* those genes. I think the information you provided will only allow me to determine if a specific gene contains a certain domain? What I need is more detail still. A hypothetical gene ABCD is one exon on chromosomeX from 100000 -> 101000. The gene has a pfam Tyrosine kinase domain in the middle of it, which maps to the genomic coordinates chrX 100250 -> 100550. I'd like to be able to extract a table that looks like: Chrm startPos endPos DomainNAME databaseNAME chrX 100250 100550 Tyrosine-Kinase pFam What I need to know is that the genomic position for the protein domain. And in reality, I'd like to get the genomic positions for all of the protein domains for interpro and pfam for hg17&18. Thanks, John Ann Zweig wrote: > Hello John, > > The missing link is the knownToPfam table in the hg18 (or > whichever assembly you are working in) database. This table is the > link between the knownGene table and the Pfam tables. > > $db.knownGene.name == knownToPfam.name > > knownToPfam: > name value > NM_001005484 PF00001 > BC024295 PF07647 > > The $db.knownGene table has fields for chromosomal positions. > > You will find the domain type in the pfamDesc table: > proteome.pfamDesc.description. > > The pfamXref table includes several types of ID values. > > pfamXref: > pfamAC swissAC swissDisplayID > PF00001 O00155 GPR25_HUMAN > PF00001 O00254 PAR3_HUMAN > PF00001 O00270 GPR31_HUMAN > > You should be able to put these pieces together to mine for the > exact information you are looking for. > > If this is not enough detail for you, please do not hesitate to > write back to the list and ask for more information. > > Regards, > > ---------- > Ann Zweig > UCSC Genome Bioinformatics Group > http://genome.ucsc.edu > > > Please feel free to search the Genome mailing list archives by > visiting our home page, clicking on "Contact Us", then typing a word > or phrase into the search box. On that same page > (http://genome.ucsc.edu/contacts.html), you can subscribe to the > Genome mailing list. > > > John Major wrote: > >> Hello- >> >> I am trying to map the pfam protein domains to genomic coordinates >> and am having some problems. >> I see that in the proteome tables, there are 2 obvious pfam tables: >> pfamDesc and pfamXref. >> Neither of these tables appear to be linked to other tables... or at >> least the table description pages do not offer any information as to >> which tables these 2 link to. >> Also, I do not seem to see a table which gives the start and end >> coordinates for the pfam doamins (in protein, mrna, or genomic space). >> >> What I would like to get is a simple table of domain information in >> genomic coordinate space. Ie: >> GenomeBuildID Chrm Start End >> ProteinDomainName SourceDatabase >> hg18 chr1 100000 100050 >> Protein-Kinase pFam >> hg18 chr2 200010 200090 >> X-binding-site uniprot >> >> >> I would like to get this info for both uniprot and pfam. The uniprot >> tables (uniprot.feature and uniprot.description) appear to be linked >> to kgXref via acc->spid. And I should be able to derive genomic >> coordinates for the uniprot features via these tables. >> >> >> If you have any advice on an easier way to get this mapping of >> domains to genomic coordinates, I'd be thrilled to hear it. >> Otherwise, could you please advise me on the pfam tables. >> >> Thanks! >> John Major >> _______________________________________________ >> Genome maillist - Genome at soe.ucsc.edu >> http://www.soe.ucsc.edu/mailman/listinfo/genome > > From fanhsu at soe.ucsc.edu Fri Mar 2 10:41:44 2007 From: fanhsu at soe.ucsc.edu (Fan Hsu) Date: Fri, 2 Mar 2007 10:41:44 -0800 Subject: [Genome] Pfam foreign keys in data tables // mapping protein domains to genomic coordinates In-Reply-To: <45E868BA.4060408@cbio.mskcc.org> Message-ID: Hi John, Your desire had been expressed by others before. One challenge is to get one representative protein sequence for a domain. I found often we have many candidates to represent a domain. Do you have any suggestion on this? In the mean time, you may want to consider using the Superfamily track to see if it is helpful. Fan. -----Original Message----- From: genome-bounces at soe.ucsc.edu [mailto:genome-bounces at soe.ucsc.edu]On Behalf Of John Major Sent: Friday, March 02, 2007 10:11 AM To: Ann Zweig Cc: genome at soe.ucsc.edu Subject: Re: [Genome] Pfam foreign keys in data tables // mapping protein domains to genomic coordinates Hi Ann- Thank you for the prompt reply. I can get the genomic positions of known genes, but what I really need is the specific genomic coordinates of the domains *within* those genes. I think the information you provided will only allow me to determine if a specific gene contains a certain domain? What I need is more detail still. A hypothetical gene ABCD is one exon on chromosomeX from 100000 -> 101000. The gene has a pfam Tyrosine kinase domain in the middle of it, which maps to the genomic coordinates chrX 100250 -> 100550. I'd like to be able to extract a table that looks like: Chrm startPos endPos DomainNAME databaseNAME chrX 100250 100550 Tyrosine-Kinase pFam What I need to know is that the genomic position for the protein domain. And in reality, I'd like to get the genomic positions for all of the protein domains for interpro and pfam for hg17&18. Thanks, John Ann Zweig wrote: > Hello John, > > The missing link is the knownToPfam table in the hg18 (or > whichever assembly you are working in) database. This table is the > link between the knownGene table and the Pfam tables. > > $db.knownGene.name == knownToPfam.name > > knownToPfam: > name value > NM_001005484 PF00001 > BC024295 PF07647 > > The $db.knownGene table has fields for chromosomal positions. > > You will find the domain type in the pfamDesc table: > proteome.pfamDesc.description. > > The pfamXref table includes several types of ID values. > > pfamXref: > pfamAC swissAC swissDisplayID > PF00001 O00155 GPR25_HUMAN > PF00001 O00254 PAR3_HUMAN > PF00001 O00270 GPR31_HUMAN > > You should be able to put these pieces together to mine for the > exact information you are looking for. > > If this is not enough detail for you, please do not hesitate to > write back to the list and ask for more information. > > Regards, > > ---------- > Ann Zweig > UCSC Genome Bioinformatics Group > http://genome.ucsc.edu > > > Please feel free to search the Genome mailing list archives by > visiting our home page, clicking on "Contact Us", then typing a word > or phrase into the search box. On that same page > (http://genome.ucsc.edu/contacts.html), you can subscribe to the > Genome mailing list. > > > John Major wrote: > >> Hello- >> >> I am trying to map the pfam protein domains to genomic coordinates >> and am having some problems. >> I see that in the proteome tables, there are 2 obvious pfam tables: >> pfamDesc and pfamXref. >> Neither of these tables appear to be linked to other tables... or at >> least the table description pages do not offer any information as to >> which tables these 2 link to. >> Also, I do not seem to see a table which gives the start and end >> coordinates for the pfam doamins (in protein, mrna, or genomic space). >> >> What I would like to get is a simple table of domain information in >> genomic coordinate space. Ie: >> GenomeBuildID Chrm Start End >> ProteinDomainName SourceDatabase >> hg18 chr1 100000 100050 >> Protein-Kinase pFam >> hg18 chr2 200010 200090 >> X-binding-site uniprot >> >> >> I would like to get this info for both uniprot and pfam. The uniprot >> tables (uniprot.feature and uniprot.description) appear to be linked >> to kgXref via acc->spid. And I should be able to derive genomic >> coordinates for the uniprot features via these tables. >> >> >> If you have any advice on an easier way to get this mapping of >> domains to genomic coordinates, I'd be thrilled to hear it. >> Otherwise, could you please advise me on the pfam tables. >> >> Thanks! >> John Major >> _______________________________________________ >> Genome maillist - Genome at soe.ucsc.edu >> http://www.soe.ucsc.edu/mailman/listinfo/genome > > _______________________________________________ Genome maillist - Genome at soe.ucsc.edu http://www.soe.ucsc.edu/mailman/listinfo/genome From major at cbio.mskcc.org Fri Mar 2 10:56:21 2007 From: major at cbio.mskcc.org (John Major) Date: Fri, 02 Mar 2007 13:56:21 -0500 Subject: [Genome] Pfam foreign keys in data tables // mapping protein domains to genomic coordinates In-Reply-To: References: Message-ID: <45E87355.7040503@cbio.mskcc.org> Hello Fan- I'd suggest that you do not try to find a representative protein, but think about the problem purely in the genome space. Let users tease apart which protein isoforms they are interested in. I'd think a datamodel similar to this might work: Table:: genomicProteinDomainMap transcriptID (foreignKEY to knownGene.name) chromosome start end domainName* databaseName* *These 2 columns could be abstracted to keys to database specific tables... but for the sake of clarity I collapsed them. In anycase, this table would have a good deal of redundant data, but it would represent the reality of genomic encoded protein domain information accurately. Ie: Any given genomic position can be involved in 0 to several genes. And any given genomic position can be involved in numerous mRNA splice variants. It follows that any given genomic position can also be involved in 0->several protein domains. I'd rather you report the full depth of information, and let the users filter what they want- rather than try to find a representative dataset. Actually, following the simple table described above, I could join with kgCanonical to get a representative set... or I could just join with kgXref and knowngene to get the domains encoded in all possible gene transcripts at any given location(this latter bit is what I really am looking for). I hope this was helpful? As far as the superfamily track... it appears from the GenomeBrowser that the track does not specify regions of domains, but highlights the complete gene? Or am I not using the correct track? Cheers, John Fan Hsu wrote: >Hi John, > >Your desire had been expressed by others before. > >One challenge is to get one representative protein >sequence for a domain. I found often we have >many candidates to represent a domain. Do you >have any suggestion on this? > >In the mean time, you may want to consider >using the Superfamily track to see if >it is helpful. > >Fan. >-----Original Message----- >From: genome-bounces at soe.ucsc.edu [mailto:genome-bounces at soe.ucsc.edu]On >Behalf Of John Major >Sent: Friday, March 02, 2007 10:11 AM >To: Ann Zweig >Cc: genome at soe.ucsc.edu >Subject: Re: [Genome] Pfam foreign keys in data tables // mapping >protein domains to genomic coordinates > > >Hi Ann- > >Thank you for the prompt reply. > >I can get the genomic positions of known genes, but what I really need >is the specific genomic coordinates of the domains *within* those genes. >I think the information you provided will only allow me to determine if >a specific gene contains a certain domain? What I need is more detail >still. > >A hypothetical gene ABCD is one exon on chromosomeX from 100000 -> 101000. >The gene has a pfam Tyrosine kinase domain in the middle of it, which >maps to the genomic coordinates chrX 100250 -> 100550. >I'd like to be able to extract a table that looks like: >Chrm startPos endPos DomainNAME databaseNAME >chrX 100250 100550 Tyrosine-Kinase pFam > > >What I need to know is that the genomic position for the protein domain. >And in reality, I'd like to get the genomic positions for all of the >protein domains for interpro and pfam for hg17&18. > > >Thanks, >John > > > > >Ann Zweig wrote: > > > >>Hello John, >> >> The missing link is the knownToPfam table in the hg18 (or >>whichever assembly you are working in) database. This table is the >>link between the knownGene table and the Pfam tables. >> >>$db.knownGene.name == knownToPfam.name >> >>knownToPfam: >>name value >>NM_001005484 PF00001 >>BC024295 PF07647 >> >> The $db.knownGene table has fields for chromosomal positions. >> >> You will find the domain type in the pfamDesc table: >>proteome.pfamDesc.description. >> >> The pfamXref table includes several types of ID values. >> >>pfamXref: >>pfamAC swissAC swissDisplayID >>PF00001 O00155 GPR25_HUMAN >>PF00001 O00254 PAR3_HUMAN >>PF00001 O00270 GPR31_HUMAN >> >> You should be able to put these pieces together to mine for the >>exact information you are looking for. >> >> If this is not enough detail for you, please do not hesitate to >>write back to the list and ask for more information. >> >>Regards, >> >>---------- >>Ann Zweig >>UCSC Genome Bioinformatics Group >>http://genome.ucsc.edu >> >> >>Please feel free to search the Genome mailing list archives by >>visiting our home page, clicking on "Contact Us", then typing a word >>or phrase into the search box. On that same page >>(http://genome.ucsc.edu/contacts.html), you can subscribe to the >>Genome mailing list. >> >> >>John Major wrote: >> >> >> >>>Hello- >>> >>>I am trying to map the pfam protein domains to genomic coordinates >>>and am having some problems. >>>I see that in the proteome tables, there are 2 obvious pfam tables: >>>pfamDesc and pfamXref. >>>Neither of these tables appear to be linked to other tables... or at >>>least the table description pages do not offer any information as to >>>which tables these 2 link to. >>>Also, I do not seem to see a table which gives the start and end >>>coordinates for the pfam doamins (in protein, mrna, or genomic space). >>> >>>What I would like to get is a simple table of domain information in >>>genomic coordinate space. Ie: >>>GenomeBuildID Chrm Start End >>>ProteinDomainName SourceDatabase >>>hg18 chr1 100000 100050 >>>Protein-Kinase pFam >>>hg18 chr2 200010 200090 >>>X-binding-site uniprot >>> >>> >>>I would like to get this info for both uniprot and pfam. The uniprot >>>tables (uniprot.feature and uniprot.description) appear to be linked >>>to kgXref via acc->spid. And I should be able to derive genomic >>>coordinates for the uniprot features via these tables. >>> >>> >>>If you have any advice on an easier way to get this mapping of >>>domains to genomic coordinates, I'd be thrilled to hear it. >>>Otherwise, could you please advise me on the pfam tables. >>> >>>Thanks! >>>John Major >>>_______________________________________________ >>>Genome maillist - Genome at soe.ucsc.edu >>>http://www.soe.ucsc.edu/mailman/listinfo/genome >>> >>> >> >> > >_______________________________________________ >Genome maillist - Genome at soe.ucsc.edu >http://www.soe.ucsc.edu/mailman/listinfo/genome > > > > From ann at soe.ucsc.edu Fri Mar 2 12:26:30 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Fri, 02 Mar 2007 12:26:30 -0800 Subject: [Genome] Chicken Genome In-Reply-To: References: Message-ID: <45E88876.8000701@soe.ucsc.edu> Hello Mark, Unfortunately, the cytoband data are not available for galGal3. We do not have plans to build a Known Genes track for this assembly (this is the track that would create the kgXref table). Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu > ------------------------------------------------------------------------ > > Subject: > Chicken Genome > From: > "mark fife (IAH-C)" > Date: > Fri, 2 Mar 2007 16:04:35 -0000 > To: > > > To: > > CC: > "TechSupport-ILMN" , "Waterfall, Christy" > , "Cook, Karen" > > > > > Dear Sir/ Madam > > > > Myself and Illumina inc are having problems running the Illumina genome > viewer for the Chicken genome. We cannot run the chicken genome into our > Genome Browser without the cytoband.txt and kgXref.txt files for the > chicken genome. > > > > Can you help. Do the cytoband.txt and kgXref.txt files exist for the > chicken genome? If so what are they? > > > > If not can we add them to the annotation database? > > > > PLEASE SEE EMAILS BELOW BETWEEN MYSELF AND Illumina for more details. > > > > I look forward to hearing from you. > > > > Many thanks > > Mark Fife > > *Dr. Mark Fife (PhD) > Avian Genomics Group > Institute for Animal Health > Compton > Berkshire > RG20 7NN* > > (/ /*+44 (0)1635 577266*// > > / /*+44 (0)7941 018594*// > > *: /mark.fife at bbsrc.ac.uk/ > > The information contained in this message may be confidential or legally > privileged and is intended solely for the addressee. If you have > received this message in error please delete it & notify the originator > immediately. > Unauthorised use, disclosure, copying or alteration of this message is > forbidden & may be unlawful. > The contents of this e-mail are the views of the sender and do not > necessarily represent the views of the Institute. > This email and associated attachments has been checked locally for > viruses but we can accept no responsibility once it has left our systems. > Communications on Institute computers are monitored to secure the > effective operation of the systems and for other lawful purposes. > > Hi Mark, > > > > I would be happy to escalate this issue to our Marketing group so they > can lobby UCSC to create these files. Of course, I cannot guarantee > success. I do think it would be worth your time to send any email to > UCSC to get their feedback on support for the chicken genome (i.e. why > the cytoband.txt, etc. are not available). > > > > With regard to this issue, since Illumina cannot directly resolve this > issue and will need the assistance of a 3^rd party, I am going to close > this specific support case, but will certainly keep you posted on any > progress we make with UCSC. > > > > Best wishes, > > Jason > > __________________________ > *Jason W. Harger, Ph.D.* > *Scientist, Customer Solutions* > *Illumina, Inc.* > 9885 Towne Centre Drive > San Diego, CA 92121 > Tel: 800.809.4566 ext. 4834 > Fax: 858.202.4766 > E-mail: techsupport at illumina.com > Web: www.illumina.com > > ------------------------------------------------------------------------ > > *From:* mark fife (IAH-C) [mailto:mark.fife at bbsrc.ac.uk] > *Sent:* Thursday, March 01, 2007 12:25 AM > *To:* TechSupport-ILMN; Waterfall, Christy > *Subject:* RE: Your incident 204207 Illumina CSS#19551 Illumina Genome > Viewer > > > > Dear Jason > > > > Many thanks for your email regarding the missing chicken files. Would it > be possible for your developers to contact USCS? Not only will Illumina > have more weight in getting the files added but also you are aware of > what these files actually do to make the software work. > > > > Many thanks > > Mark > > > > ------------------------------------------------------------------------ > > *From:* TechSupport-ILMN [mailto:techsupport-ilmn at illumina.com] > *Sent:* Wed 28/02/2007 22:43 > *To:* mark fife (IAH-C) > *Cc:* Waterfall, Christy > *Subject:* RE: Your incident 204207 Illumina CSS#19551 Illumina Genome > Viewer > > Hi Mark, > > > > After speaking with our developers, it does not appear that we can load > the chicken genome into our Genome Browser without the cytoband.txt and > kgXref.txt files for the chicken genome. One way to address this would > be to produce these files yourself. An alternative would be to contact > UCSC Genome Bioinformatics Help > (http://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html) to see if > they can produce the files for you. > > > > Based on our conversation, it sounds like you primarily want to look at > spacing of your SNPs along the chromosome and/or gene. Another, albeit > less sophisticated way to do this would be in Microsoft Excel by > plotting the coordinates for your SNPs along an axis with the genomic > coordinates for each chicken chromosome or gene you have targeted. > > > > Kind regards, > > Jason > > > > __________________________ > *Jason W. Harger, Ph.D.* > *Scientist, Customer Solutions* > *Illumina, Inc.* > 9885 Towne Centre Drive > San Diego, CA 92121 > Tel: 800.809.4566 ext. 4834 > Fax: 858.202.4766 > E-mail: techsupport at illumina.com > Web: www.illumina.com > > > From aindap at yahoo.com Fri Mar 2 13:28:16 2007 From: aindap at yahoo.com (Amit Indap) Date: Fri, 2 Mar 2007 13:28:16 -0800 (PST) Subject: [Genome] func column in snp126 table Message-ID: <876859.6656.qm@web32605.mail.mud.yahoo.com> Hi UCSC I was trying to look in the mailing list archives and dbSNP page on explanations for some of the functional classes of snps. I'm a bit confused about what unknown, locus, coding, and cds-reference mean. If it were coding, you would be able to classify the snp as coding-synonymous or coding-nonsynon. I'm guessing unknown means intergenic. If you can point me to some documentation that helps explain these classifications that would be great! Thanks, Amit Indap Cornell University ____________________________________________________________________________________ Never miss an email again! Yahoo! Toolbar alerts you the instant new Mail arrives. http://tools.search.yahoo.com/toolbar/features/mail/ From bushel at niehs.nih.gov Fri Mar 2 13:25:02 2007 From: bushel at niehs.nih.gov (Bushel, Pierre (NIH/NIEHS) [E]) Date: Fri, 2 Mar 2007 16:25:02 -0500 Subject: [Genome] Access of annotation for GenBank and RefSeq genes Message-ID: Greetings, I'd like to use the UCSC Genome Browser annotation files or MySQL database to obtain annotation for GenBank and RefSeq genes. In particular, I'd like to query the files or the database for the gene symbol(s), start and end locations in the genome, the chromosome number, the DNA strand the gene lies on and gene description. This would be for human genes which I have a GenBank accession number or RefSeq ID. Could you possibly provide me with the database tables, relationships and SQL required to query this information? I'm trying to obtain this data in a tab-delimited format file to use with a commercial software. Thanks, Pierre Bushel From ann at soe.ucsc.edu Fri Mar 2 13:34:05 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Fri, 02 Mar 2007 13:34:05 -0800 Subject: [Genome] func column in snp126 table In-Reply-To: <876859.6656.qm@web32605.mail.mud.yahoo.com> References: <876859.6656.qm@web32605.mail.mud.yahoo.com> Message-ID: <45E8984D.8050502@soe.ucsc.edu> Hello Amit, You can find this on the SNP details page. Function: Predicted functional role (each variant may have more than one functional role) * Locus Region - variation within 2000 bases of gene, but not in transcript * Coding - Synonymous - no change in peptide for allele with respect to reference assembly * Coding - Non-Synonymous - change in peptide for allele with respect to reference assembly * Untranslated - variation in transcript, but not in coding region interval * Intron - variation in intron, but not in first two or last two bases of intron * Splice Site - variation in first two or last two bases of intron * Reference - allele observed in a coding region of the reference sequence * Unknown - no known functional classification Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Amit Indap wrote: > Hi UCSC > > I was trying to look in the mailing list archives and > dbSNP page on explanations for some of the functional > classes of snps. I'm a bit confused about what > > unknown, locus, coding, and cds-reference mean. > If it were coding, you would be able to classify the > snp as coding-synonymous or coding-nonsynon. I'm > guessing unknown means intergenic. > > If you can point me to some documentation that helps > explain these classifications that would be great! > > Thanks, > > Amit Indap > Cornell University > > > > ____________________________________________________________________________________ > Never miss an email again! > Yahoo! Toolbar alerts you the instant new Mail arrives. > http://tools.search.yahoo.com/toolbar/features/mail/ > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From heather at soe.ucsc.edu Fri Mar 2 13:41:50 2007 From: heather at soe.ucsc.edu (Heather Trumbower) Date: Fri, 2 Mar 2007 13:41:50 -0800 (PST) Subject: [Genome] func column in snp126 table In-Reply-To: <45E8984D.8050502@soe.ucsc.edu> References: <876859.6656.qm@web32605.mail.mud.yahoo.com> <45E8984D.8050502@soe.ucsc.edu> Message-ID: Amit: A few additional comments: Coding is an older category that dbSNP appears to be moving away from using, in favor of coding-synon and coding-nonsynon. You will find very few examples of this in the data (hg18.snp126 for example). Unknown is in fact intergenic. Heather Trumbower UCSC Genome Bioinformatics Group On Fri, 2 Mar 2007, Ann Zweig wrote: > Hello Amit, > > You can find this on the SNP details page. > > Function: Predicted functional role (each variant may have more than one > functional role) > > * Locus Region - variation within 2000 bases of gene, but not in transcript > * Coding - Synonymous - no change in peptide for allele with respect to > reference assembly > * Coding - Non-Synonymous - change in peptide for allele with respect to > reference assembly > * Untranslated - variation in transcript, but not in coding region interval > * Intron - variation in intron, but not in first two or last two bases of > intron > * Splice Site - variation in first two or last two bases of intron > * Reference - allele observed in a coding region of the reference sequence > * Unknown - no known functional classification > > > Regards, > > ---------- > Ann Zweig > UCSC Genome Bioinformatics Group > http://genome.ucsc.edu > > > Amit Indap wrote: >> Hi UCSC >> >> I was trying to look in the mailing list archives and >> dbSNP page on explanations for some of the functional >> classes of snps. I'm a bit confused about what >> >> unknown, locus, coding, and cds-reference mean. >> If it were coding, you would be able to classify the >> snp as coding-synonymous or coding-nonsynon. I'm >> guessing unknown means intergenic. >> >> If you can point me to some documentation that helps >> explain these classifications that would be great! >> >> Thanks, >> >> Amit Indap >> Cornell University >> >> >> >> ____________________________________________________________________________________ >> Never miss an email again! >> Yahoo! Toolbar alerts you the instant new Mail arrives. >> http://tools.search.yahoo.com/toolbar/features/mail/ >> _______________________________________________ >> Genome maillist - Genome at soe.ucsc.edu >> http://www.soe.ucsc.edu/mailman/listinfo/genome > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From ann at soe.ucsc.edu Fri Mar 2 15:50:14 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Fri, 02 Mar 2007 15:50:14 -0800 Subject: [Genome] Access of annotation for GenBank and RefSeq genes In-Reply-To: References: Message-ID: <45E8B836.6060205@soe.ucsc.edu> Hello Pierre, You can do this using our on line Table Browser tool (press the 'Tables' link in the top blue navigation bar of the browser). Configure the Table Browser for the assembly you want, then set the rest of it up like so: group: Genes and Gene Prediction Tracks track: RefSeq Genes table: refGene region: genome identifiers: [paste in your list of RefSeq IDs] output format: selected fields from primary and related tables output file: [name of output file to be saved on your machine] Press the 'get output' button. From the next screen, you will join the refGene table with the kgXref table like so: 1. From the "Select Fields from hg18.refGene" section, check the following fields of the refGene table: name, chrom, strand, txStart, txEnd 2. From the "Linked Tables" section, check kgXref and press the 'Allow Selection From Checked Tables' button near the bottom of the screen. 3. From the "hg18.kgXref fields" section, check the description field. 4. Press the 'get output' button. The tab-delimited file will be create and downloaded to your machine. For example, when I follow these instructions on the latest human assembly (hg18) for the RefSeq Gene ID # NM_000808, I get the following tab-delimited output: #hg18.refGene.name hg18.refGene.chrom hg18.refGene.strand hg18.refGene.txStart hg18.refGene.txEnd hg18.kgXref.description NM_000808 chrX - 151087187 151370486 gamma-aminobutyric acid A receptor, alpha 3 Hope this is helpful to you. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Please feel free to search the Genome mailing list archives by visiting our home page, clicking on "Contact Us", then typing a word or phrase into the search box. On that same page (http://genome.ucsc.edu/contacts.html), you can subscribe to the Genome mailing list. Bushel, Pierre (NIH/NIEHS) [E] wrote: > Greetings, > > I'd like to use the UCSC Genome Browser annotation files or MySQL > database to obtain annotation for GenBank and RefSeq genes. In > particular, I'd like to query the files or the database for the gene > symbol(s), start and end locations in the genome, the chromosome number, > the DNA strand the gene lies on and gene description. This would be for > human genes which I have a GenBank accession number or RefSeq ID. Could > you possibly provide me with the database tables, relationships and SQL > required to query this information? I'm trying to obtain this data in a > tab-delimited format file to use with a commercial software. > > Thanks, > > Pierre Bushel > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From Colin_Sieff at dfci.harvard.edu Sat Mar 3 14:20:54 2007 From: Colin_Sieff at dfci.harvard.edu (Colin Sieff) Date: Sat, 3 Mar 2007 17:20:54 -0500 Subject: [Genome] problem with primers Message-ID: <76EA165D-E010-438B-BE04-F8A71EA6F4D0@dfci.harvard.edu> If one tries to do in silico PCR with primers that come from a gene that is mapped to the reverse strand of the DNA on the UCSC map, then the PCR won't work! Here is an example of a gene whose primers were selected using an application called Primer Express (ABI). The 5' primer will not map to the mouse genome, and only the reverse primer will flip. Of course when one does this one does not know which weay the gene is encoded, so all this is not very satisfactory. You probably know all this. Any comment? Thanks, Colin Sieff ATCGCCGAAGCACAAAACAT CAATGCGACAGGATAGGGAAC From cooper303 at btinternet.com Sun Mar 4 11:44:14 2007 From: cooper303 at btinternet.com (JONATHAN COOPER) Date: Sun, 4 Mar 2007 19:44:14 +0000 (GMT) Subject: [Genome] (no subject) Message-ID: <474753.49261.qm@web86410.mail.ird.yahoo.com> Hi. I am required for an Msc project to compare 1400 human RE1(repressor element 1) sites (each 21bp) against chimp, rat, & mouse genomes for sequence homology. Although i could do this manually, it'd take me quite a while, & wouldn't be much fun. Could you please tell me - are there any scripts you know of that can do this for me? Failing that, can you suggest the most appropriate data to download to write a script for. At the moment i am considering using the "axtNet" genome alignments - but am not sure if it is the best format to use. All i really need is for a file to contain the most similar match to the sequences i give it (and possibly a blast score) Thanks for your time, best regards, Jon. From gatzmon at aecom.yu.edu Sun Mar 4 10:59:59 2007 From: gatzmon at aecom.yu.edu (gatzmon@aecom.yu.edu) Date: Sun, 4 Mar 2007 13:59:59 -0500 (EST) Subject: [Genome] Sequence of Hapmap-Encode project Message-ID: <50006.68.161.32.239.1173034799.squirrel@netmail.aecom.yu.edu> Dear Researchers I'm trying to track the sequence data from the 48 individuals (of the Hapmap-Encode project) that are stored in NCBI's Trace Archive using your web browser without success. Do you know a better way to do so? Is there a web site where I can download those sequences? Thanks in advance Gil Gil Atzmon, Ph.D. Assistant Professor of Medicine Director of the Genetic Core for LonGenety Institute for Aging Research and the Diabetes Research Center Albert Einstein College of Medicine 1300 Morris Park Ave; Forchheimer #523 Bronx, NY, 10461 Tel: 718-430-3628 Fax: 718-430-8557 From heather at soe.ucsc.edu Mon Mar 5 10:06:47 2007 From: heather at soe.ucsc.edu (Heather Trumbower) Date: Mon, 5 Mar 2007 10:06:47 -0800 (PST) Subject: [Genome] Sequence of Hapmap-Encode project In-Reply-To: <50006.68.161.32.239.1173034799.squirrel@netmail.aecom.yu.edu> References: <50006.68.161.32.239.1173034799.squirrel@netmail.aecom.yu.edu> Message-ID: Gil: HapMap data is managaed by the Data Coordination Center at http://www.hapmap.org. We are developing a track that is based on the data available at www.hapmap.org/downloads/genotypes/2007-01/rs_strand/non-redundant/genotypes_chr*_*_r21a_nr.txt.gz You can obtain that data directly. This data is described at http://www.hapmap.org/genotypes/latest/00README.txt. You may also wish to use the data at www.hapmap.org/downloads/raw_data/affy500k/ We would be happy to try to answer further questions about HapMap data, but your best source of information would be help at hapmap.org. Heather Trumbower UCSC Genome Bioinformatics Group On Sun, 4 Mar 2007, gatzmon at aecom.yu.edu wrote: > Dear Researchers > I'm trying to track the sequence data from the 48 individuals (of the > Hapmap-Encode project) that are stored in NCBI's Trace Archive using your > web browser without success. > Do you know a better way to do so? Is there a web site where I can > download those sequences? > Thanks in advance > Gil > > > > Gil Atzmon, Ph.D. > Assistant Professor of Medicine > Director of the Genetic Core for LonGenety > Institute for Aging Research and the Diabetes Research Center > Albert Einstein College of Medicine > 1300 Morris Park Ave; Forchheimer #523 > Bronx, NY, 10461 > Tel: 718-430-3628 > Fax: 718-430-8557 > > > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From bushel at niehs.nih.gov Mon Mar 5 08:34:25 2007 From: bushel at niehs.nih.gov (Bushel, Pierre (NIH/NIEHS) [E]) Date: Mon, 5 Mar 2007 11:34:25 -0500 Subject: [Genome] Access of annotation for GenBank and RefSeq genes In-Reply-To: <45E8B836.6060205@soe.ucsc.edu> Message-ID: Yes it does. Thanks a lot. This is so helpful. You all are doing a magnificent job! Pierre -----Original Message----- From: Ann Zweig [mailto:ann at soe.ucsc.edu] Sent: Friday, March 02, 2007 6:50 PM To: Bushel, Pierre (NIH/NIEHS) [E] Cc: genome at soe.ucsc.edu Subject: Re: [Genome] Access of annotation for GenBank and RefSeq genes Hello Pierre, You can do this using our on line Table Browser tool (press the 'Tables' link in the top blue navigation bar of the browser). Configure the Table Browser for the assembly you want, then set the rest of it up like so: group: Genes and Gene Prediction Tracks track: RefSeq Genes table: refGene region: genome identifiers: [paste in your list of RefSeq IDs] output format: selected fields from primary and related tables output file: [name of output file to be saved on your machine] Press the 'get output' button. From the next screen, you will join the refGene table with the kgXref table like so: 1. From the "Select Fields from hg18.refGene" section, check the following fields of the refGene table: name, chrom, strand, txStart, txEnd 2. From the "Linked Tables" section, check kgXref and press the 'Allow Selection From Checked Tables' button near the bottom of the screen. 3. From the "hg18.kgXref fields" section, check the description field. 4. Press the 'get output' button. The tab-delimited file will be create and downloaded to your machine. For example, when I follow these instructions on the latest human assembly (hg18) for the RefSeq Gene ID # NM_000808, I get the following tab-delimited output: #hg18.refGene.name hg18.refGene.chrom hg18.refGene.strand hg18.refGene.txStart hg18.refGene.txEnd hg18.kgXref.description NM_000808 chrX - 151087187 151370486 gamma-aminobutyric acid A receptor, alpha 3 Hope this is helpful to you. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Please feel free to search the Genome mailing list archives by visiting our home page, clicking on "Contact Us", then typing a word or phrase into the search box. On that same page (http://genome.ucsc.edu/contacts.html), you can subscribe to the Genome mailing list. Bushel, Pierre (NIH/NIEHS) [E] wrote: > Greetings, > > I'd like to use the UCSC Genome Browser annotation files or MySQL > database to obtain annotation for GenBank and RefSeq genes. In > particular, I'd like to query the files or the database for the gene > symbol(s), start and end locations in the genome, the chromosome > number, the DNA strand the gene lies on and gene description. This > would be for human genes which I have a GenBank accession number or > RefSeq ID. Could you possibly provide me with the database tables, > relationships and SQL required to query this information? I'm trying > to obtain this data in a tab-delimited format file to use with a > commercial software. > > Thanks, > > Pierre Bushel > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From jroohi at gmail.com Mon Mar 5 11:24:26 2007 From: jroohi at gmail.com (Jasmin Roohi) Date: Mon, 5 Mar 2007 14:24:26 -0500 Subject: [Genome] SNP Function Message-ID: Dear UCSC Genome Browser: The UCSC Genome Browser includes information on SNPs. The page for each SNP includes the following categories: position, band, genomic size, strand, observed, reference allele, location type, class, validation, function, molecular type, and weight. The function of a SNP can be: locus region, intron, splice site, reference, coding-synonymous, coding-nonsynonymous, untranslated, or unknown. What does the category unknown mean? I realize it sounds self explanatory but I am confused. I have seen a couple of SNPs within exons called unknown (ie rs35687367 and rs35490546). These SNPs are translated. Why aren't they called coding-synonymous or coding-nonsynonymous. Thank you so much for your help. Best, Jasmin _______________________________ Jasmin Roohi, MSTP V Stony Brook School of Medicine and Department of Genetics Stony Brook University Medical Center Health Sciences Tower T8-053 Stony Brook, NY 11794-8088 (631) 444-3126 jroohi at gmail.com From longli at buffalo.edu Mon Mar 5 09:57:14 2007 From: longli at buffalo.edu (Long Li) Date: Mon, 05 Mar 2007 12:57:14 -0500 Subject: [Genome] How to get gene names in a batch given accession IDs of mRNA or EST? Message-ID: <1173117434.3537.48.camel@bchwxp3040.med.buffalo.edu> Hi, I have a lot of IDs of mRNA (from all_mrna table) and EST (from all_est table), my question is how to get their gene names in a batch. For example, the DA561869 and AL602556 are from all_est table, the AK074693 is from all_mrna table. But they are belong to gene CABC1. So which table should I look up to find the gene name CABC1? Thanks. Long From Yannick.Wurm at unil.ch Mon Mar 5 04:42:24 2007 From: Yannick.Wurm at unil.ch (Yannick Wurm) Date: Mon, 5 Mar 2007 13:42:24 +0100 Subject: [Genome] honey bee genome Message-ID: <23D7FF26-1D0B-4E60-B500-24637103EB65@unil.ch> Hi, I work with ants - and the honey bee genome is a great resources. Currently, beebase is a great resource. http://racerx00.tamu.edu/cgi-bin/gbrowse/bee_genome2_chromo But theirs in incompatible with tools such as galaxy, and not as complete as your browser. http://main.g2.bx.psu.edu/ Thus, I wonder whether you will update the honey bee data in your genome browser. Thanks and regards, Yannick -------------------------------------------- yannick . wurm @ unil . ch Ant Genomics, Ecology & Evolution @ Lausanne http://www.unil.ch/dee/page28685_fr.html From bushel at niehs.nih.gov Mon Mar 5 13:34:14 2007 From: bushel at niehs.nih.gov (Bushel, Pierre (NIH/NIEHS) [E]) Date: Mon, 5 Mar 2007 16:34:14 -0500 Subject: [Genome] MySQL database of the genome browser Message-ID: Is the MySQL database of the genome browser still available? I would need the sql commands to create the database and the files as tables to insert into the database. Thanks, Pierre R. Bushel, Ph.D. Head of Microarray and Genome Informatics Biostatistics Branch National Institute of Environmental Health Sciences P.O. Box 12233 RTP, NC 27709 (W) 919-316-4564 (F) 919-316-4649 bushel at niehs.nih.gov From heather at soe.ucsc.edu Mon Mar 5 14:31:48 2007 From: heather at soe.ucsc.edu (Heather Trumbower) Date: Mon, 5 Mar 2007 14:31:48 -0800 (PST) Subject: [Genome] SNP Function In-Reply-To: References: Message-ID: Jasmin: The UCSC SNPs track is presenting data that is obtained from dbSNP. This includes the function assignments. Typically when a SNP has unknown function, it is outside of any gene annotation. However, in the cases you mention, it could be that dbSNP will update the functional assignment in a later build. I would recommend that you contact them directly at snp-admin at ncbi.nlm.nih.gov. Heather Trumbower UCSC Genome Bioinformatics Group On Mon, 5 Mar 2007, Jasmin Roohi wrote: > Dear UCSC Genome Browser: > > The UCSC Genome Browser includes information on SNPs. The page for each SNP > includes the following categories: position, band, genomic size, strand, > observed, reference allele, location type, class, validation, function, > molecular type, and weight. The function of a SNP can be: locus region, > intron, splice site, reference, coding-synonymous, coding-nonsynonymous, > untranslated, or unknown. What does the category unknown mean? I realize it > sounds self explanatory but I am confused. I have seen a couple of SNPs > within exons called unknown (ie rs35687367 and rs35490546). These SNPs are > translated. Why aren't they called coding-synonymous or > coding-nonsynonymous. Thank you so much for your help. > > Best, > Jasmin > > > _______________________________ > Jasmin Roohi, MSTP V > Stony Brook School of Medicine > and Department of Genetics > > Stony Brook University Medical Center > Health Sciences Tower T8-053 > Stony Brook, NY 11794-8088 > (631) 444-3126 > jroohi at gmail.com > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From rhead at soe.ucsc.edu Mon Mar 5 15:22:25 2007 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Mon, 05 Mar 2007 15:22:25 -0800 Subject: [Genome] honey bee genome In-Reply-To: <23D7FF26-1D0B-4E60-B500-24637103EB65@unil.ch> References: <23D7FF26-1D0B-4E60-B500-24637103EB65@unil.ch> Message-ID: <45ECA631.1020909@soe.ucsc.edu> Hello Yannick, We do have plans to update the honey bee data, but it is low on our priority list (we are more mammal-focused at UCSC). I do not have an estimated completion date at this time. You are welcome to check back with us in a few months to see if we are any closer to releasing an update. We will also make an announcement on our "genome-announce" mailing list when we it is released. If you would like to subscribe to that (very low volume) mailing list, please see this link: http://www.soe.ucsc.edu/mailman/listinfo/genome-announce -- Brooke Rhead UCSC Genome Bioinformatics Group Yannick Wurm wrote: > Hi, > I work with ants - and the honey bee genome is a great resources. > Currently, beebase is a great resource. > http://racerx00.tamu.edu/cgi-bin/gbrowse/bee_genome2_chromo > But theirs in incompatible with tools such as galaxy, and not as > complete as your browser. > http://main.g2.bx.psu.edu/ > > Thus, I wonder whether you will update the honey bee data in your > genome browser. > > Thanks and regards, > > Yannick > > > > -------------------------------------------- > yannick . wurm @ unil . ch > Ant Genomics, Ecology & Evolution @ Lausanne > http://www.unil.ch/dee/page28685_fr.html > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From rhead at soe.ucsc.edu Mon Mar 5 16:15:35 2007 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Mon, 05 Mar 2007 16:15:35 -0800 Subject: [Genome] MySQL database of the genome browser In-Reply-To: References: Message-ID: <45ECB2A7.704@soe.ucsc.edu> Hello Pierre, The MySQL database dumps are available from our downloads page: http://hgdownload.cse.ucsc.edu/downloads.html Go to your assembly of interest and then click on the "Annotation database" link. The tables are also available on our ftp server, here: ftp://hgdownload.cse.ucsc.edu/goldenPath/ After choosing the assembly of interest, click on the "database" directory. I hope this information helps. Please do not hesitate to contact us again if you have further questions. -- Brooke Rhead UCSC Genome Bioinformatics Group Bushel, Pierre (NIH/NIEHS) [E] wrote: > Is the MySQL database of the genome browser still available? I would > need the sql commands to create the database and the files as tables to > insert into the database. > > Thanks, > > Pierre R. Bushel, Ph.D. > Head of Microarray and Genome Informatics > Biostatistics Branch > National Institute of Environmental Health Sciences > P.O. Box 12233 > RTP, NC 27709 > (W) 919-316-4564 > (F) 919-316-4649 > bushel at niehs.nih.gov > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From rhead at soe.ucsc.edu Mon Mar 5 17:25:42 2007 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Mon, 05 Mar 2007 17:25:42 -0800 Subject: [Genome] How to get gene names in a batch given accession IDs of mRNA or EST? In-Reply-To: <1173117434.3537.48.camel@bchwxp3040.med.buffalo.edu> References: <1173117434.3537.48.camel@bchwxp3040.med.buffalo.edu> Message-ID: <45ECC316.90804@soe.ucsc.edu> Hello Long, The association of mRNA and EST IDs with gene names does not exist in any of our tables. You could instead use the Table Browser to create custom tracks from the all_mrna and all_est IDs and then intersect the custom tracks with one of the gene tracks to get the corresponding gene names. However, the Table Browser does not have a "join" function that will give you the mRNA/EST name in the same table as the gene name (it will only return a list of the gene names that intersected with your custom tracks -- you will not be able to tell which mRNA/EST goes with which gene). There is another tool run by Penn State that works in conjunction with the UCSC Genome Browser that can do joins. It is called "Galaxy", and it is located here: http://main.g2.bx.psu.edu/ Galaxy has a tool to "Join the intervals of two queries side-by-side" that you can use to join the intervals of your specific mRNAs or ESTs with the intervals of a gene track, from which you can get a gene name, or at least a gene accession. (If you are working with hg18, I recommend using the "RefSeq Genes" track, which has a gene common name in the 'name2' field.) Some notes about using Galaxy: - You can access the UCSC Table Browser through the "UCSC Main table browser" link on the left-hand side under "Get Data". Note that Galaxy defaults to May 2004 (hg17), which is not our most recent human build. - The "join intervals" function is under the "Operate on genomic intervals" section on the left-hand side. - Galaxy has a wiki page available on interval operations: http://g2.trac.bx.psu.edu/wiki/GopsDesc I hope this information is helpful. If you have further questions for us, please do not hesitate to contact us again. If you have questions about Galaxy, please contact their help desk at galaxy-user at bx.psu.edu. -- Brooke Rhead UCSC Genome Bioinformatics Group Long Li wrote: > Hi, > > I have a lot of IDs of mRNA (from all_mrna table) and EST (from all_est > table), my question is how to get their gene names in a batch. > > For example, the DA561869 and AL602556 are from all_est table, the > AK074693 is from all_mrna table. But they are belong to gene CABC1. So > which table should I look up to find the gene name CABC1? Thanks. > > Long > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From benoit at ebi.ac.uk Tue Mar 6 04:36:45 2007 From: benoit at ebi.ac.uk (Benoit Ballester) Date: Tue, 06 Mar 2007 12:36:45 +0000 Subject: [Genome] liftOver over.chain for rat Message-ID: <45ED605D.9040601@ebi.ac.uk> Hi, I would like to map coordinate from the rat assembly 3.1 to the rat assembly 3.4 using liftOver. But the only over.chain file I have found is rn4ToRn3.over.chain.gz available here http://hgdownload.cse.ucsc.edu/goldenPath/rn4/liftOver/ which, if I am right, contain data to convert rn4(v3.4) to rn3(v3.1). Do you have a over.chain file for the over way round rn3torn4 ?? Thanks, Benoit -- Benoit Ballester, PhD Ensembl Team European Bioinformatics Institute (EMBL-EBI) Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SD, United Kingdom From wyu8 at jhmi.edu Tue Mar 6 06:57:23 2007 From: wyu8 at jhmi.edu (Wayne Yu) Date: Tue, 06 Mar 2007 09:57:23 -0500 Subject: [Genome] location query Message-ID: <45ED3B03020000C00000CE15@cis27.hosts.jhmi.edu> I have about 3000 genes and want to know their locations in hg 17, i.e. chr7:127,471,196-127,495,720. Is there a way to do query with a formed gene list instead doing it one by one with individual genes? Thanks Wayne Yu Johns Hopkins University From yzhou at saturn.med.nyu.edu Tue Mar 6 06:07:27 2007 From: yzhou at saturn.med.nyu.edu (Yiming Zhou) Date: Tue, 06 Mar 2007 09:07:27 -0500 Subject: [Genome] smaller size of mm8 vs hg18 all chain data from genome browser Message-ID: <1173190047.45ed759f2f41e@saturn.med.nyu.edu> Dear Colleagues, I downloaded all chain data for mm8 vs hg18 from http://hgdownload.cse.ucsc.edu/goldenPath/mm8/vsHg18/ (ftp_data) and from http://genome.ucsc.edu/cgi-bin/hgTables?command=start (web_data) with the following parameters: 1. clade: Vertebrate 2. genome: Mouse 3. assembly: Feb 2006 4. group: Comparative Genomics 5. track: Human Chain 6. table: chainHg18 7. region: genome 8. output format: all fields from selected table 9. all other parameters were default ones. I think these two data sets represent same thing: all chain data from mouse/human whole genome pairwise alignment. The question is that the sizes of two datasets are different. I know the formats of two datasets are different. Since ftp_data used a more compact format, I expected ftp_data was smaller. But I found ftp_data was 684M while web_data was 259M. I am wondering what make ftp_data biger than web_data. I am going to use the data to define conserved transcription factor binding sites. Would you like to give me suggestions which one is better for the purpose? Thank you very much. Best Regards, Yiming ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From barreiro at pasteur.fr Tue Mar 6 10:40:28 2007 From: barreiro at pasteur.fr (Luis Barreiro) Date: Tue, 06 Mar 2007 19:40:28 +0100 Subject: [Genome] Question? Message-ID: <45EDB59C.6090702@pasteur.fr> HI, I have a simple question concerning the use of the Table Browser. Indeed, I would need to have the gene transcript location in hg18 for about 500 genes. I know how to get this information giving as gene identifier in the format NM_XXX (for example NM_003263) but since I don't have this code for all my genes, I would like to obtain this information using as gene identifier the HUGO nomenclature (e.g. TLR1 for the former example). Is it possible, and if so, how should I proceed? Thanks a lot in advance Luis From bushel at niehs.nih.gov Tue Mar 6 10:48:53 2007 From: bushel at niehs.nih.gov (Bushel, Pierre (NIH/NIEHS) [E]) Date: Tue, 6 Mar 2007 13:48:53 -0500 Subject: [Genome] MySQL database of the genome browser In-Reply-To: <45ECB2A7.704@soe.ucsc.edu> Message-ID: Yes thanks a lot. Pierre -----Original Message----- From: Brooke Rhead [mailto:rhead at soe.ucsc.edu] Sent: Monday, March 05, 2007 7:16 PM To: Bushel, Pierre (NIH/NIEHS) [E] Cc: genome at soe.ucsc.edu Subject: Re: [Genome] MySQL database of the genome browser Hello Pierre, The MySQL database dumps are available from our downloads page: http://hgdownload.cse.ucsc.edu/downloads.html Go to your assembly of interest and then click on the "Annotation database" link. The tables are also available on our ftp server, here: ftp://hgdownload.cse.ucsc.edu/goldenPath/ After choosing the assembly of interest, click on the "database" directory. I hope this information helps. Please do not hesitate to contact us again if you have further questions. -- Brooke Rhead UCSC Genome Bioinformatics Group Bushel, Pierre (NIH/NIEHS) [E] wrote: > Is the MySQL database of the genome browser still available? I would > need the sql commands to create the database and the files as tables > to insert into the database. > > Thanks, > > Pierre R. Bushel, Ph.D. > Head of Microarray and Genome Informatics > Biostatistics Branch > National Institute of Environmental Health Sciences > P.O. Box 12233 > RTP, NC 27709 > (W) 919-316-4564 > (F) 919-316-4649 > bushel at niehs.nih.gov > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From hiram at soe.ucsc.edu Tue Mar 6 11:05:58 2007 From: hiram at soe.ucsc.edu (Hiram Clawson) Date: Tue, 06 Mar 2007 11:05:58 -0800 Subject: [Genome] MySQL database of the genome browser In-Reply-To: References: Message-ID: <45EDBB96.3080308@soe.ucsc.edu> Good Morning Pierre: You can also download directly the actual MySQL tables in their binary MySQL format directly into your database without any loading: ftp://hgdownload.cse.ucsc.edu/mysql/ These files would be copied directly into your /var/mysql/ directory, or wherever it is that your MySQL keeps its data files. This saves you the time of loading the files from their ASCII SQL representations. Note, these files are MySQL version 4.0.27 They will function in newer MySQL versions. --Hiram Bushel, Pierre (NIH/NIEHS) [E] wrote: > Yes thanks a lot. > > Pierre > > -----Original Message----- > From: Brooke Rhead [mailto:rhead at soe.ucsc.edu] > Sent: Monday, March 05, 2007 7:16 PM > To: Bushel, Pierre (NIH/NIEHS) [E] > Cc: genome at soe.ucsc.edu > Subject: Re: [Genome] MySQL database of the genome browser > > > Hello Pierre, > > The MySQL database dumps are available from our downloads page: > > http://hgdownload.cse.ucsc.edu/downloads.html > > Go to your assembly of interest and then click on the "Annotation > database" link. > > The tables are also available on our ftp server, here: > > ftp://hgdownload.cse.ucsc.edu/goldenPath/ > > After choosing the assembly of interest, click on the "database" > directory. > > I hope this information helps. Please do not hesitate to contact us > again if you have further questions. > > -- > Brooke Rhead > UCSC Genome Bioinformatics Group > > > > Bushel, Pierre (NIH/NIEHS) [E] wrote: >> Is the MySQL database of the genome browser still available? I would >> need the sql commands to create the database and the files as tables >> to insert into the database. >> >> Thanks, >> >> Pierre R. Bushel, Ph.D. >> Head of Microarray and Genome Informatics >> Biostatistics Branch >> National Institute of Environmental Health Sciences >> P.O. Box 12233 >> RTP, NC 27709 >> (W) 919-316-4564 >> (F) 919-316-4649 >> bushel at niehs.nih.gov From kayla at soe.ucsc.edu Tue Mar 6 11:13:41 2007 From: kayla at soe.ucsc.edu (Kayla Smith) Date: Tue, 06 Mar 2007 11:13:41 -0800 Subject: [Genome] liftOver over.chain for rat In-Reply-To: <45ED605D.9040601@ebi.ac.uk> References: <45ED605D.9040601@ebi.ac.uk> Message-ID: <45EDBD65.5010909@cse.ucsc.edu> Benoit, http://hgdownload.cse.ucsc.edu/goldenPath/rn3/liftOver/rn3ToRn4.over.chain.gz Kayla Smith UCSC Genome Bioinformatics Group Benoit Ballester wrote: > Hi, > > I would like to map coordinate from the rat assembly 3.1 to the rat > assembly 3.4 using liftOver. But the only over.chain file I have found > is rn4ToRn3.over.chain.gz available here > http://hgdownload.cse.ucsc.edu/goldenPath/rn4/liftOver/ > which, if I am right, contain data to convert rn4(v3.4) to rn3(v3.1). > > Do you have a over.chain file for the over way round rn3torn4 ?? > > Thanks, > > Benoit > > From kayla at soe.ucsc.edu Tue Mar 6 11:19:35 2007 From: kayla at soe.ucsc.edu (Kayla Smith) Date: Tue, 06 Mar 2007 11:19:35 -0800 Subject: [Genome] location query In-Reply-To: <45ED3B03020000C00000CE15@cis27.hosts.jhmi.edu> References: <45ED3B03020000C00000CE15@cis27.hosts.jhmi.edu> Message-ID: <45EDBEC7.9090508@cse.ucsc.edu> Wayne, Please see this previously answered mailinglist question: http://www.cse.ucsc.edu/pipermail/genome/2006-July/011161.html I hope this helps you. Please don't hesitate to contact again if you require more assistance. Kayla Smith UCSC Genome Bioinformatics Group Wayne Yu wrote: > > I have about 3000 genes and want to know their locations in hg 17, i.e. chr7:127,471,196-127,495,720. Is there a way to do query with a formed gene list instead doing it one by one with individual genes? > > Thanks > Wayne Yu > Johns Hopkins University > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From rhead at soe.ucsc.edu Tue Mar 6 11:29:43 2007 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Tue, 06 Mar 2007 11:29:43 -0800 Subject: [Genome] help interpret PCR results Message-ID: <45EDC127.8060502@soe.ucsc.edu> Hello Nallasivam, This mailing list is generally intended for questions more directly related to the Genome Browser. However, sometimes the community of users who subscribe to the mailing list are able to help one another. If anyone reading the mailing list has any ideas or insight for this user, your response is welcome. -- Brooke Rhead UCSC Genome Bioinformatics Group -------- Original Message -------- Date: Mon, 5 Mar 2007 09:12:27 +0800 From: Nallasivam PALANISAMY To: CC: , I need help to resolve the following seq, which is present only in the mRNA of RCC2 gene and in the 5?UTR of FSCN2 gene but not in the genomic seq of both genes. We identified this seq as part of a chimeric gene identified by 5?RACE. By FISH analysis, we did not find the colocalization of signal with the partner gene. Please help to interpret out PCR results. Thank you ================================================================= Our seq: CGCGGATCCAGACGCTGCGTTTGCTGGCTTTGATGAA >NM_018715 (RCC2) length=4114 ttgatgaaacgcggatccagacgctgcgtttgctggctttgatgaaacgcacatgtgttt ctgttttgtgttgtagcatttgttctggaagctcgtatttacattttaagtgtatctggt gagtgggctggagccctcgtctgggccggaaaaaaaaaagccctccgatccgtcttttag ttgcttctcttccttttttctctccggtttctcatcactccaaccagccgcgaccatgcc caggaagaaggcggcggcggcggcctgggaggagccgagctcgggcaacggcactgcccg cgccgggcccaggaaacgcggcggcccggcgggcaggaagcgcgagcggcccgagcgctg Nallasivam Palanisamy, Ph.D. Group Leader, Cancer Biology Genome Institute of Singapore 60, Biopolis Street, #02-01 Genome Singapore, 138672 Tel: (65) 6478 8173 Fax: (65) 6478 9004 email:palanisamyn at gis.a-star.edu.sg http://www.gis.a-star.edu.sg/internet/site/investigators.php?f=cv&user_id=15 This email is confidential and may be privileged. If you are not the intended receipient, please delete it and notify us immediately. Please do not copy or use it for any purpose or disclose its contents to any other person. Thank you. From MEC at stowers-institute.org Tue Mar 6 11:36:25 2007 From: MEC at stowers-institute.org (Cook, Malcolm) Date: Tue, 6 Mar 2007 13:36:25 -0600 Subject: [Genome] Queyring the repeat tables in dm2 mysql database In-Reply-To: <45EDBB96.3080308@soe.ucsc.edu> Message-ID: In dm2, I understand that to directly query the mysql database for overlapping repeats to a given region, I need to contrive the query the table named after the chromosome, i.e. SELECT repName, repClass \ FROM chr2R_rmsk AND genoStart <= ? AND genoEnd >= ? Am I correct that there is no way to query a single table, passing the chromosome as a parameter, to achieve the same effect? Thanks! Malcolm Cook Stowers Institute for Medical Research - Kansas City, Missouri > -----Original Message----- > From: genome-bounces at soe.ucsc.edu > [mailto:genome-bounces at soe.ucsc.edu] On Behalf Of Hiram Clawson > Sent: Tuesday, March 06, 2007 1:06 PM > To: Bushel, Pierre (NIH/NIEHS) [E] > Cc: genome at soe.ucsc.edu > Subject: Re: [Genome] MySQL database of the genome browser > > Good Morning Pierre: > > You can also download directly the actual MySQL tables in their > binary MySQL format directly into your database without any > loading: > ftp://hgdownload.cse.ucsc.edu/mysql/ > These files would be copied directly into your /var/mysql/ directory, > or wherever it is that your MySQL keeps its data files. > > This saves you the time of loading the files from their ASCII SQL > representations. Note, these files are MySQL version 4.0.27 > They will function in newer MySQL versions. > > --Hiram > > Bushel, Pierre (NIH/NIEHS) [E] wrote: > > Yes thanks a lot. > > > > Pierre > > > > -----Original Message----- > > From: Brooke Rhead [mailto:rhead at soe.ucsc.edu] > > Sent: Monday, March 05, 2007 7:16 PM > > To: Bushel, Pierre (NIH/NIEHS) [E] > > Cc: genome at soe.ucsc.edu > > Subject: Re: [Genome] MySQL database of the genome browser > > > > > > Hello Pierre, > > > > The MySQL database dumps are available from our downloads page: > > > > http://hgdownload.cse.ucsc.edu/downloads.html > > > > Go to your assembly of interest and then click on the "Annotation > > database" link. > > > > The tables are also available on our ftp server, here: > > > > ftp://hgdownload.cse.ucsc.edu/goldenPath/ > > > > After choosing the assembly of interest, click on the "database" > > directory. > > > > I hope this information helps. Please do not hesitate to > contact us > > again if you have further questions. > > > > -- > > Brooke Rhead > > UCSC Genome Bioinformatics Group > > > > > > > > Bushel, Pierre (NIH/NIEHS) [E] wrote: > >> Is the MySQL database of the genome browser still > available? I would > >> need the sql commands to create the database and the files > as tables > >> to insert into the database. > >> > >> Thanks, > >> > >> Pierre R. Bushel, Ph.D. > >> Head of Microarray and Genome Informatics > >> Biostatistics Branch > >> National Institute of Environmental Health Sciences > >> P.O. Box 12233 > >> RTP, NC 27709 > >> (W) 919-316-4564 > >> (F) 919-316-4649 > >> bushel at niehs.nih.gov > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From kayla at soe.ucsc.edu Tue Mar 6 12:03:29 2007 From: kayla at soe.ucsc.edu (Kayla Smith) Date: Tue, 06 Mar 2007 12:03:29 -0800 Subject: [Genome] Queyring the repeat tables in dm2 mysql database In-Reply-To: References: Message-ID: <45EDC911.4080508@cse.ucsc.edu> Malcolm, The rmsk table is split so mysql queries must have the chrom_ in the table name. However, the Table Browser ("Tables" on the blue bar on the top of the main page) could be used, as it will query all the split tables together. You would have to set up a filter to get the results that you want. I hope this is helpful to you. Please don't hesitate to contact us again if you require further assistance. Kayla Smith UCSC Genome Bioinformatics Group Cook, Malcolm wrote: > In dm2, I understand that to directly query the mysql database for > overlapping repeats to a given region, I need to contrive the query the > table named after the chromosome, i.e. > > SELECT repName, repClass \ > FROM chr2R_rmsk > AND genoStart <= ? > AND genoEnd >= ? > > Am I correct that there is no way to query a single table, passing the > chromosome as a parameter, to achieve the same effect? > > Thanks! > > Malcolm Cook > Stowers Institute for Medical Research - Kansas City, Missouri > > >> -----Original Message----- >> From: genome-bounces at soe.ucsc.edu >> [mailto:genome-bounces at soe.ucsc.edu] On Behalf Of Hiram Clawson >> Sent: Tuesday, March 06, 2007 1:06 PM >> To: Bushel, Pierre (NIH/NIEHS) [E] >> Cc: genome at soe.ucsc.edu >> Subject: Re: [Genome] MySQL database of the genome browser >> >> Good Morning Pierre: >> >> You can also download directly the actual MySQL tables in their >> binary MySQL format directly into your database without any >> loading: >> ftp://hgdownload.cse.ucsc.edu/mysql/ >> These files would be copied directly into your /var/mysql/ directory, >> or wherever it is that your MySQL keeps its data files. >> >> This saves you the time of loading the files from their ASCII SQL >> representations. Note, these files are MySQL version 4.0.27 >> They will function in newer MySQL versions. >> >> --Hiram >> >> Bushel, Pierre (NIH/NIEHS) [E] wrote: >>> Yes thanks a lot. >>> >>> Pierre >>> >>> -----Original Message----- >>> From: Brooke Rhead [mailto:rhead at soe.ucsc.edu] >>> Sent: Monday, March 05, 2007 7:16 PM >>> To: Bushel, Pierre (NIH/NIEHS) [E] >>> Cc: genome at soe.ucsc.edu >>> Subject: Re: [Genome] MySQL database of the genome browser >>> >>> >>> Hello Pierre, >>> >>> The MySQL database dumps are available from our downloads page: >>> >>> http://hgdownload.cse.ucsc.edu/downloads.html >>> >>> Go to your assembly of interest and then click on the "Annotation >>> database" link. >>> >>> The tables are also available on our ftp server, here: >>> >>> ftp://hgdownload.cse.ucsc.edu/goldenPath/ >>> >>> After choosing the assembly of interest, click on the "database" >>> directory. >>> >>> I hope this information helps. Please do not hesitate to >> contact us >>> again if you have further questions. >>> >>> -- >>> Brooke Rhead >>> UCSC Genome Bioinformatics Group >>> >>> >>> >>> Bushel, Pierre (NIH/NIEHS) [E] wrote: >>>> Is the MySQL database of the genome browser still >> available? I would >>>> need the sql commands to create the database and the files >> as tables >>>> to insert into the database. >>>> >>>> Thanks, >>>> >>>> Pierre R. Bushel, Ph.D. >>>> Head of Microarray and Genome Informatics >>>> Biostatistics Branch >>>> National Institute of Environmental Health Sciences >>>> P.O. Box 12233 >>>> RTP, NC 27709 >>>> (W) 919-316-4564 >>>> (F) 919-316-4649 >>>> bushel at niehs.nih.gov >> _______________________________________________ >> Genome maillist - Genome at soe.ucsc.edu >> http://www.soe.ucsc.edu/mailman/listinfo/genome >> > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From bushel at niehs.nih.gov Tue Mar 6 11:08:56 2007 From: bushel at niehs.nih.gov (Bushel, Pierre (NIH/NIEHS) [E]) Date: Tue, 6 Mar 2007 14:08:56 -0500 Subject: [Genome] MySQL database of the genome browser In-Reply-To: <45EDBB96.3080308@soe.ucsc.edu> Message-ID: That's even better. Thanks for the follow-up. Pierre -----Original Message----- From: Hiram Clawson [mailto:hiram at soe.ucsc.edu] Sent: Tuesday, March 06, 2007 2:06 PM To: Bushel, Pierre (NIH/NIEHS) [E] Cc: genome at soe.ucsc.edu Subject: Re: [Genome] MySQL database of the genome browser Good Morning Pierre: You can also download directly the actual MySQL tables in their binary MySQL format directly into your database without any loading: ftp://hgdownload.cse.ucsc.edu/mysql/ These files would be copied directly into your /var/mysql/ directory, or wherever it is that your MySQL keeps its data files. This saves you the time of loading the files from their ASCII SQL representations. Note, these files are MySQL version 4.0.27 They will function in newer MySQL versions. --Hiram Bushel, Pierre (NIH/NIEHS) [E] wrote: > Yes thanks a lot. > > Pierre > > -----Original Message----- > From: Brooke Rhead [mailto:rhead at soe.ucsc.edu] > Sent: Monday, March 05, 2007 7:16 PM > To: Bushel, Pierre (NIH/NIEHS) [E] > Cc: genome at soe.ucsc.edu > Subject: Re: [Genome] MySQL database of the genome browser > > > Hello Pierre, > > The MySQL database dumps are available from our downloads page: > > http://hgdownload.cse.ucsc.edu/downloads.html > > Go to your assembly of interest and then click on the "Annotation > database" link. > > The tables are also available on our ftp server, here: > > ftp://hgdownload.cse.ucsc.edu/goldenPath/ > > After choosing the assembly of interest, click on the "database" > directory. > > I hope this information helps. Please do not hesitate to contact us > again if you have further questions. > > -- > Brooke Rhead > UCSC Genome Bioinformatics Group > > > > Bushel, Pierre (NIH/NIEHS) [E] wrote: >> Is the MySQL database of the genome browser still available? I would >> need the sql commands to create the database and the files as tables >> to insert into the database. >> >> Thanks, >> >> Pierre R. Bushel, Ph.D. >> Head of Microarray and Genome Informatics >> Biostatistics Branch >> National Institute of Environmental Health Sciences >> P.O. Box 12233 >> RTP, NC 27709 >> (W) 919-316-4564 >> (F) 919-316-4649 >> bushel at niehs.nih.gov From kayla at soe.ucsc.edu Tue Mar 6 12:35:25 2007 From: kayla at soe.ucsc.edu (Kayla Smith) Date: Tue, 06 Mar 2007 12:35:25 -0800 Subject: [Genome] Question? In-Reply-To: <45EDB59C.6090702@pasteur.fr> References: <45EDB59C.6090702@pasteur.fr> Message-ID: <45EDD08D.1040103@cse.ucsc.edu> Luis, For genes where you don't have the NM_ identifiers, you can use the kgXref table. Select the following options in the Table Browser: clade: Vertebrate genome: Human assembly: Mar. 2006 table: knownGene region: genome filter: create On the filter page scroll down and select the box next to kgXref. Then click on "Allow Filtering Using Fields in Checked Tables". Under the header hg18.kgXref based filters, set: geneSymbol "does match" TLR1 (you could try pasting in all your non NM_ names here) click "submit" Back in the Table Browser, toggle "output format" to "selected fields from primary and related tables" click "get output" Select the boxes: name, chrom, txStart, txEnd from hg18.knownGene and also the geneSymbol box from hg18.kgXref fields. click "get output" one last time for your results. I hope this is helpful to you. Please don't hesitate to contact us again if you require further assistance. Kayla Smith UCSC Genome Bioinformatics Group Luis Barreiro wrote: > HI, > > I have a simple question concerning the use of the Table Browser. > Indeed, I would need to have the gene transcript location in hg18 for > about 500 genes. I know how to get this information giving as gene > identifier in the format NM_XXX (for example NM_003263) but since I > don't have this code for all my genes, I would like to obtain this > information using as gene identifier the HUGO nomenclature (e.g. TLR1 > for the former example). Is it possible, and if so, how should I proceed? > > Thanks a lot in advance > Luis > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From rhead at soe.ucsc.edu Tue Mar 6 15:17:54 2007 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Tue, 06 Mar 2007 15:17:54 -0800 Subject: [Genome] (no subject) In-Reply-To: <474753.49261.qm@web86410.mail.ird.yahoo.com> References: <474753.49261.qm@web86410.mail.ird.yahoo.com> Message-ID: <45EDF6A2.5060203@soe.ucsc.edu> Hello Jon, You might consider using the Conservation track multiple alignment format (MAF) file instead of the individual net alignments. MAF is described here: http://genome.ucsc.edu/FAQ/FAQformat#format5 You can extract the regions of interest from a MAF file using the Table Browser by first making a custom track of the regions and then intersecting the Conservation track (the multiz17way table if using hg18) with your custom track. You would then need to grep out your species of interest from the resulting MAF file. Another option is to use Galaxy, a set of tools created and maintained by Penn State that work in conjunction with the Genome Browser, located here: http://main.g2.bx.psu.edu/ They have several tools for dealing with MAFs. In the "Fetch Sequences and Alignments" section on the left-hand side of the site, there are two tools that would be useful for your situation: "Extract MAF blocks -- given a set of genomic intervals" and "Maf Limit to Species -- Remove undesired species from a MAF file". I hope this information is helpful. Please let us know if you have any further questions. -- Brooke Rhead UCSC Genome Bioinformatics Group JONATHAN COOPER wrote: > Hi. > > I am required for an Msc project to compare 1400 human RE1(repressor > element 1) sites (each 21bp) against chimp, rat, & mouse genomes for > sequence homology. Although i could do this manually, it'd take me > quite a while, & wouldn't be much fun. Could you please tell me - are > there any scripts you know of that can do this for me? Failing that, > can you suggest the most appropriate data to download to write a > script for. At the moment i am considering using the "axtNet" genome > alignments - but am not sure if it is the best format to use. All i > really need is for a file to contain the most similar match to the > sequences i give it (and possibly a blast score) > > Thanks for your time, best regards, Jon. > _______________________________________________ Genome maillist - > Genome at soe.ucsc.edu http://www.soe.ucsc.edu/mailman/listinfo/genome From rhead at soe.ucsc.edu Tue Mar 6 16:09:21 2007 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Tue, 06 Mar 2007 16:09:21 -0800 Subject: [Genome] potential discrepancy between genome browser and hg18.mm8.all.chain In-Reply-To: <45E38A58.90904@saturn.med.nyu.edu> References: <45E38A58.90904@saturn.med.nyu.edu> Message-ID: <45EE02B1.8050300@soe.ucsc.edu> Hello Yiming, The results from the hg18.mm8.all.chain file and the results from the Genome Browser are actually showing the same thing; they are just formatted differently. When the strand value is "-" in chain format, position coordinates are listed in terms of the reverse-complemented sequence. This means that to convert chain file coordinates to Genome Browser coordinates for the "-" strand, you need to subtract the start and end coordinates from the total length of the chromosome. More information about the chain format is here: http://genome.ucsc.edu/goldenPath/help/chain.html In your example: qStrand: - qSize: 98252459 qStart: 65522021 qEnd: 66995791 So, to get to Genome Browser coordinates, Browser start = qSize - qEnd = 31256668 Browser end = qSize -qStart = 32730438 Additionally, like all of our data files/tables, the start position is 0-based, not 1-based as it is displayed in the Genome Browser, so you need to add 1 to the start to get to Browser coordinates. Once converted, the data from hg18.mm8.all.chain exactly matches the position from Genome Browser: chr16:31256669-32730438. -- Brooke Rhead UCSC Genome Bioinformatics Group Yiming Zhou wrote: > Dear colleagues: > > I recently used hg18.mm8.all.chain to get the conserved region between > human and mouse. One of regions of interest is within human > chr3:196913410-198819991. > > hg18.mm8.all.chain give: > chain 10445662 chr3 199501827 + 196913409 198819991 chr16 98252459 - > 65522021 66995791 201 > > But Genome browser give: > > Human position: chr3:196913410-198819991 size: 1906582 > Strand: - > Mouse position: chr16:31256669-32730438 size: 1473770 > Chain ID: 201 > Score: 10445662 Approximate Score within browser window: 3805 > > > Which one is correct? > > Many thanks. > Best, > Yiming > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From ps at c2b2.columbia.edu Wed Mar 7 09:03:01 2007 From: ps at c2b2.columbia.edu (Pavel Sumazin) Date: Wed, 07 Mar 2007 12:03:01 -0500 Subject: [Genome] differences between MAF and 2bit Message-ID: <45EEF045.7090706@c2b2.columbia.edu> I obtained a few coordinates from the mouse genome from the 17 vertebrate MAF files (mm8 is in the alignment) and attempted to retrieve flanking sequence from mm8.2bit While the sequence information exists in the MAF files, it appears to be missing from the 2bit file. Is this a known phenomena? Is there a reason for this? I am including the problematic regions below. Thanks, Pavel ---------------------------------- chr10:1580222-1595222 chr10:1617696-1632696 chr10:1622371-1637371 chr10:1628386-1643386 chr10:1719750-1734750 chr10:1720947-1735947 chr10:1723502-1738502 chr10:1836299-1851299 chr10:1845002-1860002 chr10:1867826-1882826 chr10:1914649-1929649 chr10:1975021-1990021 chr10:1975369-1990369 chr10:1993795-2008795 chr10:2015417-2030417 chr10:2042729-2057729 chr10:2043929-2058929 chr10:2047326-2062326 chr10:2047391-2062391 chr10:2051919-2066919 chr10:2101210-2116210 chr10:2131846-2146846 chr10:2141109-2156109 chr10:2178724-2193724 chr10:2214138-2229138 chr10:2220258-2235258 chr10:2243283-2258283 chr10:2248703-2263703 chr10:2288675-2303675 chr10:2309814-2324814 chr10:2329785-2344785 chr10:2331482-2346482 chr10:2450934-2465934 chr10:2492728-2507728 chr10:2511309-2526309 chr10:260893-275893 chr10:2680359-2695359 chr10:2794156-2809156 chr10:2809478-2824478 chr10:2874457-2889457 chr10:2874934-2889934 chr10:2880224-2895224 chr10:2920094-2935094 chr10:377837-392837 chr10:932728-947728 chr1:177616-192616 chr1:1984125-1999125 chr1:2010530-2025530 chr1:2081239-2096239 chr1:2111890-2126890 chr12:1152309-1167309 chr1:2184304-2199304 chr1:2498693-2513693 chr1:2848074-2863074 chr13:0-9435 chr13:1092860-1107860 chr13:11089088-11104088 chr13:1501341-1516341 chr13:1503859-1518859 chr13:2206584-2221584 chr13:22310-37310 chr13:2454751-2469751 chr13:691196-706196 chr1:393253-408253 chr1:537375-552375 chr1:541851-556851 chr17:2039786-2054786 chr18:1317598-1332598 chr18:1456739-1471739 chr18:1693472-1708472 chr18:44383-59383 chr2:176101878-176116878 chr2:176330068-176345068 chr3:0-1684 chr3:1568344-1583344 chr3:15714271-15729271 chr3:1854827-1869827 chr3:1885230-1900230 chr3:1915118-1930118 chr4:1025555-1040555 chr4:1084758-1099758 chr4:1097824-1112824 chr4:1208593-1223593 chr4:121881029-121896029 chr4:1272011-1287011 chr4:1503025-1518025 chr4:1534100-1549100 chr4:154290-169290 chr4:160073-175073 chr4:160278-175278 chr4:178279-193279 chr4:1879861-1894861 chr4:212494-227494 chr4:2162246-2177246 chr4:2163680-2178680 chr4:2195773-2210773 chr4:2209048-2224048 chr4:2688384-2703384 chr4:279208-294208 chr4:282080-297080 chr4:301592-316592 chr4:311723-326723 chr4:31762137-31777137 chr4:326964-341964 chr4:396249-411249 chr4:436992-451992 chr4:517689-532689 chr4:521117-536117 chr4:611893-626893 chr4:809490-824490 chr4:825296-840296 chr5:68503-83503 chr7:0-2353 chr7:1461315-1476315 chr7:1486950-1501950 chr7:25797-40797 chr7:369833-384833 chr7:408919-423919 chr7:466426-481426 chr7:727175-742175 chr7:883844-898844 chr7:901622-916622 chr8:1573442-1588442 chr8:20223847-20238847 chr8:21592558-21607558 chr8:457261-472261 chr8:58243751-58258751 chr8:58248362-58263362 chr8:58277430-58292430 chr8:58301692-58316692 chrX:1010247-1025247 chrX:1523464-1538464 chrX:1524856-1539856 chrX:2814671-2829671 chrX:2866971-2881971 chrX:2933212-2948212 chrX:3611925-3626925 chrX:3808426-3823426 chrX:3811419-3826419 chrX:877279-892279 chrX:893084-908084 chrX:896646-911646 From cm138621 at bcm.tmc.edu Wed Mar 7 08:32:10 2007 From: cm138621 at bcm.tmc.edu (Montemayor, Celina) Date: Wed, 7 Mar 2007 10:32:10 -0600 Subject: [Genome] Download first intron sequences Message-ID: Hi, I was wondering if I could get some advice regarding how to download the coordinates of the frist introns of my list of genes (I need a BED file)? I used the table browser to paste my gene IDs and download intron sequences; the output I obtained looks like this: chr1 136014280 136014811 NM_016749_intron_5_0_chr1_136014281_f Is there a way for me discriminate which are the coordinates corresponding to the first introns? Thanks for your help, Celina Montemayor, MD Graduate Student, Pereira lab Baylor College of Medicine From rhead at soe.ucsc.edu Tue Mar 6 17:39:14 2007 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Tue, 06 Mar 2007 17:39:14 -0800 Subject: [Genome] smaller size of mm8 vs hg18 all chain data from genome browser In-Reply-To: <1173190047.45ed759f2f41e@saturn.med.nyu.edu> References: <1173190047.45ed759f2f41e@saturn.med.nyu.edu> Message-ID: <45EE17C2.1030801@soe.ucsc.edu> Hi Yiming, You are correct that data retrieved via ftp (mm8.hg18.all.chain) is in a more compact format. The reason it is larger than the data retrieved via the Table Browser is that there is more information contained in the chain file. Here is part of the first entry in mm8.hg18.all.chain: chain 647145090 chr12 120463159 + 45508375 113878656 chr14 106368585 + 24227835 105403220 1 19 1 0 64 1 0 32 0 17 13 0 5 26 1 0 10 0 13 . . . and so on. The chain file contains one line for each alignment chain, followed by several lines describing the alignment gaps within that chain. (See the bottom of this page: http://genome.ucsc.edu/goldenPath/help/chain.html for more information on the alignment lines.) Note that the information retrieved via the Table Browser is the same as information in the Annotation database, here: http://hgdownload.cse.ucsc.edu/goldenPath/mm8/database/ (The files are split up into chromosomes, called chr*_chainHg18.txt.gz.) These files contain only one line per chain, with no extra lines describing alignment gaps. Here is the entry for chain ID 647145090 (the same as in the example above) from the chr12_chainHg18.txt.gz: 0 647145090 chr12 120463159 45508375 113878656 chr14 106368585 + 24227835 105403220 1 Note that the extra alignment information is missing. This entry is also present in the Table Browser output. I hope this information is helpful. Please let us know if we can clarify anything, or if you have any further questions about the Genome Browser. -- Brooke Rhead UCSC Genome Bioinformatics Group Yiming Zhou wrote: > Dear Colleagues, > > I downloaded all chain data for mm8 vs hg18 from > http://hgdownload.cse.ucsc.edu/goldenPath/mm8/vsHg18/ (ftp_data) > and from > http://genome.ucsc.edu/cgi-bin/hgTables?command=start (web_data) > with the following parameters: > 1. clade: Vertebrate > 2. genome: Mouse > 3. assembly: Feb 2006 > 4. group: Comparative Genomics > 5. track: Human Chain > 6. table: chainHg18 > 7. region: genome > 8. output format: all fields from selected table > 9. all other parameters were default ones. > > I think these two data sets represent same thing: all chain data from > mouse/human whole genome pairwise alignment. > > The question is that the sizes of two datasets are different. I know the formats > of two datasets are different. Since ftp_data used a more compact format, I > expected ftp_data was smaller. But I found ftp_data was 684M while web_data was > 259M. > > I am wondering what make ftp_data biger than web_data. I am going to use the > data to define conserved transcription factor binding sites. Would you like to > give me suggestions which one is better for the purpose? Thank you very much. > > Best Regards, > Yiming > > -----------------------------------------------