From sandmann at embl.de Tue Jan 2 08:44:28 2007 From: sandmann at embl.de (sandmann@embl.de) Date: Tue, 2 Jan 2007 17:44:28 +0100 Subject: [Genome] Wiggle display of custom annotations Message-ID: <20070102174428.tde5h8132k0sk4ok@webmail.embl.de> Dear UCSC genome browser team, I have tried to upload a medium sized custom annotation file (ca. 42 Mb uncompressed) via ftp (as .gz file) to the Genome browser for display with the Drosophila genome. The file contains several .wiggle tracks, together ca. 780.000 lines. The last track in the file is not displayed properly, most of the chromosomes are reported to contain "no data". This is not due to the file format (upload to a local version of the browser is fine), so I was wondering if there is a size/line limit on the custom annotation files or whether this might be another problem you could point out to me. Thank you ! Thomas From kate at soe.ucsc.edu Tue Jan 2 10:10:40 2007 From: kate at soe.ucsc.edu (Kate Rosenbloom) Date: Tue, 02 Jan 2007 10:10:40 -0800 Subject: [Genome] Wiggle display of custom annotations In-Reply-To: <20070102174428.tde5h8132k0sk4ok@webmail.embl.de> References: <20070102174428.tde5h8132k0sk4ok@webmail.embl.de> Message-ID: <459AA020.1000302@cse.ucsc.edu> Hello Thomas, There is no defined limit for the size of a custom track file. However in practice, the file upload can be limited by the speed of your internet connection. I can check for other problems if you can provide me (email me directly) with the URL for your custom track file. Cheers, Kate --- Kate Rosenbloom Software Developer UCSC Genome Bioinformatics sandmann at embl.de wrote: > Dear UCSC genome browser team, > I have tried to upload a medium sized custom annotation file (ca. 42 Mb > uncompressed) via ftp (as .gz file) to the Genome browser for display with the > Drosophila genome. The file contains several .wiggle tracks, together ca. > 780.000 lines. The last track in the file is not displayed properly, most of > the chromosomes are reported to contain "no data". This is not due to the file > format (upload to a local version of the browser is fine), so I was wondering > if there is a size/line limit on the custom annotation files or whether this > might be another problem you could point out to me. > > Thank you ! > Thomas > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From hartera at soe.ucsc.edu Tue Jan 2 10:22:25 2007 From: hartera at soe.ucsc.edu (Rachel Harte) Date: Tue, 2 Jan 2007 10:22:25 -0800 (PST) Subject: [Genome] [Fwd: GoldenPath] In-Reply-To: <459A9295.90406@ucsc.edu> References: <459A9295.90406@ucsc.edu> Message-ID: Loren, The "Golden Path" is the assembled genome sequence. This term was originally applied to the human genome assemblies at UCSC but it is also now used for other genome assemblies. If you go to the human Genome Browser and click on the "Assembly" link above the Assembly track control in the "Mapping and Sequencing Tracks" group then you will see the description for this track. Here it is: "This track shows the draft assembly of the human genome. This assembly merges contigs from overlapping drafts and finished clones into longer sequence contigs. The sequence contigs are ordered and oriented when possible by mRNA, EST, paired plasmid reads (from the SNP Consortium) and BAC end sequence pairs. In dense mode, this track depicts the path through the draft and finished clones (aka the golden path) used to create the assembled sequence. Clone boundaries are distinguished by the use of alternating gold and brown coloration. Where gaps exist in the path, spaces are shown between the gold and brown blocks. If the relative order and orientation of the contigs between the two blocks is known, a line is drawn to bridge the blocks." I hope that this answers your question. In the future, please direct your questions to the genome mailing list at genome at soe.ucsc.edu - our moderated forum for user questions and discussion. You should then get a faster response to your question. Thanks. Rachel > -------- Original Message -------- > Subject: GoldenPath > Date: Mon, 01 Jan 2007 20:27:13 -0800 > From: Loren Engrav > To: > > > > What does goldenPath mean? What does it refer to? > > Thanks > > > ---from > > ...all creatures great and small... > > Loren H. Engrav, MD > Professor, Division of Plastic Surgery > Harborview Med Ctr, 359796 > 325 Ninth Avenue > Seattle, WA 98104 > office 206 731 3209 > fax 206 731 3656 > email engrav at u.washington.edu > > > > > > -- > Branwyn Stewart Wagman > Communications & Human Resources > Center for Biomolecular Science and Engineering (CBSE) > Institute for Quantitative Biomedical Research (QB3) > 501C Engineering 2 Building > UC Santa Cruz > 1156 High Street, MS: CBSE/ITI > Santa Cruz CA 95064 > Tel: (831) 459-3077 > Fax: (831) 459-1809 > bwagman at soe.ucsc.edu > http://www.cbse.ucsc.edu > From hiram at soe.ucsc.edu Tue Jan 2 10:34:11 2007 From: hiram at soe.ucsc.edu (Hiram Clawson) Date: Tue, 02 Jan 2007 10:34:11 -0800 Subject: [Genome] Wiggle display of custom annotations In-Reply-To: <20070102174428.tde5h8132k0sk4ok@webmail.embl.de> References: <20070102174428.tde5h8132k0sk4ok@webmail.embl.de> Message-ID: <459AA5A3.6020101@soe.ucsc.edu> Good Morning Thomas: This sounds like something may be incorrect in the declarations on your track definition line. Or perhaps a problem with the chrom names you are using, maybe an upper vs. lower case issue. A 42Mb file with 780,000 lines is not large and if there had been difficulty loading it you would have received some kind of error message. --Hiram sandmann at embl.de wrote: > Dear UCSC genome browser team, > I have tried to upload a medium sized custom annotation file (ca. 42 Mb > uncompressed) via ftp (as .gz file) to the Genome browser for display with the > Drosophila genome. The file contains several .wiggle tracks, together ca. > 780.000 lines. The last track in the file is not displayed properly, most of > the chromosomes are reported to contain "no data". This is not due to the file > format (upload to a local version of the browser is fine), so I was wondering > if there is a size/line limit on the custom annotation files or whether this > might be another problem you could point out to me. > > Thank you ! > Thomas From hartera at soe.ucsc.edu Tue Jan 2 11:21:07 2007 From: hartera at soe.ucsc.edu (Rachel Harte) Date: Tue, 2 Jan 2007 11:21:07 -0800 (PST) Subject: [Genome] how to use peak height to show score in custom track In-Reply-To: <9391F5F6E849B844BD3A7468023691BE3B4A96@ADM-EXCHVS01.bsdad.uchicago.edu> References: <9391F5F6E849B844BD3A7468023691BE3B4A96@ADM-EXCHVS01.bsdad.uchicago.edu> Message-ID: Hello Chunyu, The Help page for the custom tracks may be found here: http://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html#CustomTracks Specifically, here is a help page for the wiggle tracks which should explain how to use the wiggles (peaks) to show score in a custom track: http://genome.ucsc.edu/goldenPath/help/wiggle.html I hope that this answers your question. Please let us know if you need further help. Rachel Rachel Harte UCSC Genome Bioinformatics Group http://genome.ucsc.edu On Wed, 27 Dec 2006, Liu, Chun-Yu [BSD] - KNP wrote: > Hi, > Is there any format to use peak height in the custom track? > I know you can use color code for score. But it is not as good as peak > height sometimes. > Thanks! > Chunyu > > > Chunyu Liu, Ph.D. > > Department of Psychiatry > > University of Chicago > > Tel: 773-834-3604 > > Fax: 773-834-2970 > > > This email is intended only for the use of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this email message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is prohibited. If you have received this email in error, please notify the sender and destroy/delete all copies of the transmittal. Thank you. > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From hartera at soe.ucsc.edu Tue Jan 2 12:16:55 2007 From: hartera at soe.ucsc.edu (Rachel Harte) Date: Tue, 2 Jan 2007 12:16:55 -0800 (PST) Subject: [Genome] ABL2 inconsistency In-Reply-To: <7126FD4CDEA5264386295C3BA1BE11110321DCA6@cell.sequenom.com> References: <7126FD4CDEA5264386295C3BA1BE11110321DCA6@cell.sequenom.com> Message-ID: Dear Mathias, I have taken a look at this problem. When I try your mySQL command for hg18 I get the following results: name | chrom | strand | txStart | cdsStart | cdsEnd | exonStarts | exonEnds NM_005158 | chr1 | - | 177343379 | 177343475 | 177378802 | 177343379,177346039,177348066,177350635,177353089,177354344,177355947,177357352,177362134,177367068,177378690, | 177345199,177346213,177348156,177350788,177353274,177354522,177356032,177357625,177362430,177367239,177378808, | The position that you see on the Known Genes details page for ABL2 is consistent with this result. txStart is 177343379 and txEnd is 177378808. The starts in the table are 0-based so base 1 is represented by 0 therefore the txStart is really 177343380 which gives a position of chr1:177343380-177378808 for ABL2 (NM_005158). This is consistent with the position given on the Known Genes details page for the ABL2 transcript for hg18 (March 2006 assembly). If I then do the same mySQL query for hg17 known Genes, then I get the following result: name | chrom | strand | txStart | cdsStart | cdsEnd | exonStarts | exonEnds | +-----------+-------+--------+-----------+-----------+-----------+----------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+ | NM_005158 | chr1 | - | 175808413 | 175808509 | 175843836 | 175808413,175811073,175813100,175815669,175818123,175819378,175820981,175822386,175827168,175832102,175843724, | 175810233,175811247,175813190,175815822,175818308,175819556,175821066,175822659,175827464,175832273,175843842, | This looks to be the same as the result from your mySQL query. Please check again for hg18 and let us know if you still find that there is an inconsistency. Please let us know if you have further questions. Rachel Rachel Harte UCSC Genome Bioinformatics Group http://genome.ucsc.edu On Thu, 28 Dec 2006, Mathias Ehrich wrote: > Dear Genome Browser Team, > > I am trying to retrieve the transcription start sites for a number of > genes. I have managed to implement the database query using MySQL. > Everything appeared to be working fine, but when I started checking the > results I found a small inconsistency between the website results and > the results obtained by the SQL query. The location of the ABL2 gene > (NM_005158) differs by approximately 2M bp. I'm not sure that I used the > correct versions, but I used 'hg18' for the SQL query and the assembly > 'March 2006' for the website. I have attached both results for your > review. > > Do you know where this inconsistency is might come from. > > Thanks for your help. > > > > Best regards > > Mathias Ehrich > > > > Results from MySQL Query > > Schema: hg18 > > Query: > > SELECT k.name, k.chrom, k.strand, k.txStart ,k.cdsStart, k.cdsEnd, > k.exonStarts, k.exonEnds > > FROM knownGene k > > WHERE k.chrom = 'chr1' AND k.strand = '-' AND k.name = 'NM_005158' > > > > Results: > > Name, chrom, strand, txStart, cdsStart, cdsEnd > > 'NM_005158', 'chr1', '-', 175808413, 175808509, 175843836, '', '' > > > > > > Results from GenomeBrowser Website > > Assembly: March 2006 > > > > Description: v-abl Abelson murine leukemia viral oncogene > Alternate Gene Symbols: RP11-177A2.3-002 > Representative Refseq: NM_005158 > &c=chr1&o=177343379&l=177343379&r=177378808&db=hg18> Protein: Q5W0C5 > > RefSeq Summary: ABL2 is a cytoplasmic tyrosine kinase which is closely > related to but distinct from ABL1. The similarity of the proteins > includes the tyrosine kinase domains and extends amino-terminal to > include the SH2 and SH3 domains. ABL2 is expressed in both normal and > tumor cells. The ABL2 gene product is expressed as two variants bearing > different amino termini, both approximately 12-kb in length. > Position: chr1:177343380-177378808 > Strand: - > Genomic Size: 35429 > Exon Count: 11 > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From gongs at mail.rockefeller.edu Tue Jan 2 11:13:19 2007 From: gongs at mail.rockefeller.edu (Shiaoching Gong) Date: Tue, 02 Jan 2007 14:13:19 -0500 Subject: [Genome] (no subject) Message-ID: <459AAECF.8050302@rockefeller.edu> Dear Sir/Madam, Happy New Year! May you please let me know where I can order the human BAC clone, CTA-799F10. Thank you very much. Shiaoching From hartera at soe.ucsc.edu Tue Jan 2 12:44:58 2007 From: hartera at soe.ucsc.edu (Rachel Harte) Date: Tue, 2 Jan 2007 12:44:58 -0800 (PST) Subject: [Genome] (no subject) In-Reply-To: <459AAECF.8050302@rockefeller.edu> References: <459AAECF.8050302@rockefeller.edu> Message-ID: Shiaoching, To find the information about where to order this BAC clone, go to the Human Genome Browser and search for the BAC clone by typing CTA-799F10 into the position/search box, then click on the BAC End Pairs result. Then click on the alignment for CTA-799F10 in the BAC End Pairs track and click on the CTA-799F10 link near the top of the page. This will take you to the NCBI Clone Registry. In the section called "Distributor Information", you will find contact details for ordering the clone. You can also go directly to the Clone Registry and search for the BAC clone: http://www.ncbi.nlm.nih.gov/genome/clone/ I hope that this helps you. Please let us know if you have further questions. Rachel Rachel Harte UCSC Genome Bioinformatics Group http://genome.ucsc.edu On Tue, 2 Jan 2007, Shiaoching Gong wrote: > Dear Sir/Madam, > Happy New Year! > May you please let me know where I can order the human BAC clone, > CTA-799F10. > Thank you very much. > Shiaoching > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From kwlaschi at cems.umn.edu Tue Jan 2 12:26:55 2007 From: kwlaschi at cems.umn.edu (Katie F Wlaschin) Date: Tue, 02 Jan 2007 14:26:55 -0600 Subject: [Genome] Converting Genome Coordinates Message-ID: <459AC00F.4030908@cems.umn.edu> Hello - I want to convert the coordinates from the human/mouse/rat alignment from the NCBI human 34 build / mouse 30 and rat 3 to the most updated coordinates from the most recent mouse assembly. I've tried your lift-over tool and most of the start-stop coordinates are "split" so its a little difficult. Is there a similar table that defines the start-stop coordinates for the most updated assemblies Thank you - Katie From kate at soe.ucsc.edu Tue Jan 2 14:21:46 2007 From: kate at soe.ucsc.edu (Kate Rosenbloom) Date: Tue, 2 Jan 2007 14:21:46 -0800 Subject: [Genome] Converting Genome Coordinates In-Reply-To: <459AC00F.4030908@cems.umn.edu> References: <459AC00F.4030908@cems.umn.edu> Message-ID: Hi Katie, Can you clarify what exactly you are trying to convert? Do you have a set of regions derived from our human build 34 (July 2003, hg16) multiple alignment for which you would like to convert the mouse NCBI build 30 (Feb. 2003, mm3) coordinates to our current mouse assembly (NCBI build 36, Feb. 2006, mm8)? If this is the case, you will indeed have difficulty with the mapping, as the mouse genome assemblies have changed significantly over the past 3 years -- from early draft to "essentially complete". You could use the '-multiple' option of liftOver to capture the split regions, but a better choice would be to use one of our more recent multiple alignments. The conservation track on our human build 36 (Mar. 2006, hg18) browser includes the latest mouse assembly. If this doesn't address your problem adequately, please contact me directly. Cheers, Kate ---- Kate Rosenbloom UCSC Genome Bioinformatics On Jan 2, 2007, at 12:26 PM, Katie F Wlaschin wrote: > Hello - I want to convert the coordinates from the human/mouse/rat > alignment from the NCBI human 34 build / mouse 30 and rat 3 to the > most > updated coordinates from the most recent mouse assembly. I've tried > your lift-over tool and most of the start-stop coordinates are "split" > so its a little difficult. Is there a similar table that defines the > start-stop coordinates for the most updated assemblies > > Thank you - Katie > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From hartera at soe.ucsc.edu Tue Jan 2 16:21:34 2007 From: hartera at soe.ucsc.edu (Rachel Harte) Date: Tue, 2 Jan 2007 16:21:34 -0800 (PST) Subject: [Genome] Link between and the gene ontology database Message-ID: Marcel, There is no direct link between these tables. However, you can intersect regions of the genome using the Table Browser. For instance, if you would like to find the regions that are upstream of knownGene that contain repeats, you can do in two steps. First, you will need to create a custom track of the upstream regions and then you can intersect it with the repeats track. Here is information about the Table Browser: http://genome.ucsc.edu/cgi-bin/hgTables?command=start#Help To start using it, click on the Tables link on the top blue menu bar. Then follow these steps to create a custom track of upstream regions: 1) Select Genome: Human Assembly: May 2004 May 2004 is hg17 2) Select "Genes and Gene Prediction Tracks" at the group and "Known Genes" as the track and knownGene as the table. 3) Select "custom track" as the output format and press "get output" 4) Select "Create one BED record per:" Upstream by bases and fill in the number of bases upstream of Known Genes. 5) Press "get custom track in table browser" Now you can intersect it with the repeats track: 1) select "Custom Tracks" as the group and then you can select the track that you made. 2) Press the "create" button for intersection 3) Select "Variation and Repeats" as the group and "RepeatMasker" as the track. 4) Then once you have submitted this, select BED as the output format and get the output. This will give you a list known genes upstream regions that contain repeats. This is just an example of what you can do. You can probably do something similar to look for genes close to some DNA elements. You can also find tables for the go database in the Table Browser. If you select "All Tables" for the group, then "go" will appear in the database list. You can select a go table in the table list and then press "describe table schema" to get the description of the table and also other tables that are linked to it. So that may help with figuring out which tables may be linked when making mySQL queries. I hope that this helps you. Please let us know if you need further assistance. Rachel Marcel Leonce GUIBOTSY M'BOULAS guibot04 at hotmail.com Fri Dec 29 13:42:56 PST 2006 Dear all, I downoloaded the knowngene of hg17, the repeatmasker and the gene ontology mysql-tables. I want to get the name of genes close to some DNA elements. Mysql queries always return an empty set. There seems to be no link between these tables. What should I do to solve this issue? I'm new in MySQL. Happy holidays Thank you. Marcel Leonce Guibotsy mguibotsy at uta.edu Research Assistant University of Texas at Arlington Rachel Harte UCSC Genome Bioinformatics Group http://genome.ucsc.edu From rhead at soe.ucsc.edu Wed Jan 3 14:08:04 2007 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Wed, 03 Jan 2007 14:08:04 -0800 Subject: [Genome] Please advice with retrieving data In-Reply-To: <002801c72ac1$564f22b0$b8a6b250@ad.biu.ac.il> References: <002801c72ac1$564f22b0$b8a6b250@ad.biu.ac.il> Message-ID: <459C2944.7090305@soe.ucsc.edu> Hello Mali, Since you are using accession numbers from different sources, it will probably be simplest to retrieve the conserved transcription factor binding sites separately for each source. The Known Genes table contains only genes that meet certain criteria (which you can read about on the Known Genes description page -- click the blue "Known Genes" link right above the track control), so it may not contain all of the genes in which you are interested. Also, the gene names in the Known Genes table often correspond to RefSeq accession numbers, but not always. Accession numbers from the different sources you mentioned would not necessarily be expected to appear in the Known Genes table. So, generally, you can follow the three steps that you outlined, but repeat the steps for each accession source. First create a custom track that contains the regions 1000 base pairs upstream of each transcript (do this for just one accession source at first, such as just the Genbank accession numbers). For each source, use the track/table that corresponds to that source (Human mRNAs -- assuming you are looking at the human genome -- for Genbank, RefSeq Genes for RefSeq, etc.). Then create a second custom track that contains the specific binding site of interest, such as E2F. Were you using the TFBS Conserved track for this step? If not, you can do so by using the tfbsConsSites table and filtering on *E2F* in the name field. Third, intersect the two custom tracks. In order to keep the accession numbers, you will need to select the custom track that contains the accession numbers first, and then intersect with the binding sites custom track. When you do the intersection, be sure to select one of the intersection options listed under the heading "These combinations will maintain the gene/alignment structure (if any) of [your custom track]". The output should include all of the items from the first custom track that overlap with the second, maintaining the names from the first custom track. Now you can repeat these steps for the other accession sources. For your second question: > Also how can I retrieve sequences from a list that contain acc. > numbers that fit to the known genes name, but also other retrieve the > acc. numbers of those genes that lies 1000bp downstream to the genomic > location I retrieved? To get accession numbers for regions 1000bp downstream of Known Genes, first create a custom track of of of the downstream regions. Then go to the Table Browser and select the track whose accession numbers you wish to find in the downstream regions of Known Genes (it is important that you choose this track first when doing the intersection). Intersect this track with the custom track of the downstream regions. Your results should be all of the items in the track you selected that overlap with the downstream regions, including their accession numbers. I hope this information is helpful to you. Please let us know if we can clarify any of the steps above, or if we did not fully answer your questions. -- Brooke Rhead UCSC Genome Bioinformatics Group Mali Salmon wrote: > Hello,I am new with using the table browser, and I need some help.I have a list of accession numbers from several sources (genebank, refseq, mgc, est, etc.) and I want to find those that have a conserved binding site to a specific transcription factor (E2F for example) in their promoters.I tried to do it using three steps as follow:1. First I created a custom track that contains 1000bp upstream to my transcripts. Here I had a problem since I used known genes table, and I received only ~580 sequences out of 2730.2. Than I created a second custom track that contains all the genes that have E2F binding sites (I used the filter to match *E2F*)3. Than I did intersection between both custom tables and I received those genes that has the TF binding site. However, I received their genomic location but not the acc. Numbers.Please advice if there is a better way of doing it. Also how can I retrieve sequences from a list that contain acc. numbers that fit to the known genes name, but al > so other > retrieve the acc. numbers of those genes that lies 1000bp downstream to the genomic location I retrieved?Looking forward to your answerMany thanksMali > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From kosi at burnham.org Wed Jan 3 09:11:34 2007 From: kosi at burnham.org (Kosi Gramatikoff) Date: Wed, 03 Jan 2007 09:11:34 -0800 Subject: [Genome] Promoters? Match to RefSeq (insignificant???) Message-ID: In one of your REAADME files at: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/ I read this: "Sequences 1000 bases upstream of annotated ??? transcription start of RefSeq genes.? This includes only the ??? cases where the transcription start (TSS) is annotated separately ??? from the coding region start.? Note that upstream files are ??? generated only when an assembly is released. Therefore, the ??? data may be slightly out of synch with the RefSeq data in ??? assemblies that are incrementally updated nightly." 1. I would like to know how were the TSS annotated - what is the principle of this annotation - identification of the TSS??? 2. Do you have references describing the presumed principle of TSS identification? RefSeq does not contain that... 3. What do you mean by: "slightly out of synch with the RefSeq"? If TSSs are annotated separately from the start of the coding region - how could they be ever in sync. TSS is upstream of the ATG codon (see next sentence as evidence). The above questions are driven by a specific observation. I compared the entire RefSeq (~26,000 5'UTR) with your upstream2000 set (~23,000) and found only 244 RefSeq 5'UTRs to be included entirely. If 244 upstream sequences (promoters) contain sequences down to the ATG why the rest of the promoters do not contain that? How exactly was the association of a particular promoter (upstream region) matched to unique RefSeq entry? Is there information for how many nucleotides are missing from the ATG (or in between the ATG and your upstream sequence)? Please let me know, Kosi Gramatikoff Burnham Institute for Medical Research La Jolla, CA PS. If there is a person (contact info, phone) please let me know I would like to discuss the above issues with a live feedback From hartera at soe.ucsc.edu Wed Jan 3 16:54:49 2007 From: hartera at soe.ucsc.edu (Rachel Harte) Date: Wed, 3 Jan 2007 16:54:49 -0800 (PST) Subject: [Genome] Drosophila phylogenetic analysis In-Reply-To: <26B871D42497A946A1F25421794CEE8A040DB987@MAIL1.AD.Brown.Edu> References: <26B871D42497A946A1F25421794CEE8A040DB987@MAIL1.AD.Brown.Edu> Message-ID: Nicola, If you would like to find homologs, there are several ways that you could do this. In some of the fly Genome Browsers, we have a D. mel. Proteins track which aligns the Drosophila melanogaster proteins to the genome. However, not all the fly Browsers have this track. For 7 of the fly species, we do not have complete browsers yet, only the alignments to D. melanogaster. Therefore, using the net alignment track would also not work for all species. The net track shows the best in genome alignment of the fly species to the genome in question. An intersection of fly genes with the net track for each species would help you to find homologs but the net track is not available for all Drosophila species. The D. melanogaster conservation track (multiz15way) has multiple alignments for all of the Drosophila species. So one possibility for finding homologs is to do an intersection of the CDS regions of your genes of interest with the multiz15way conservation track. You can do this through the UCSC Table Browser but when you get the results of an intersection it only shows identifiers for the first table and not the second. I can show you how to do this with the Table Browser. I think that Galaxy 2 could be used to do this intersection and keep identifiers from both tables. Galaxy 2 adds data manipulation tools on top of the UCSC Table Browser and it is produced by a group at Penn State University. Go to: http://www.bx.psu.edu/cgi-bin/trac.cgi then click on the Galaxy 2 link. Using the UCSC Table Browser, click on the Tables link on the top blue bar. First you will need to get the coding exons for the genes that you are interested in: 1) Select: clade: insect genome: D. melanogaster and the assembly of interest. 2) Select Genes and Gene Prediction Tracks as the group and FlyBase Genes as the track. 3) Make sure that you select genome as the region. 4) Then you can click on "paste list" or "upload list" to add a list of gene names. 5) Select "custom track" as the output format and "get output" and select "Coding Exons" and press the "get custom track in table browser" option. You could just use Whole Gene too but coding regions will be more conserved. If you are unfamiliar with the BED format, here is a link to explain it: http://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html#BED Then you will need to intsect this with the conservation track data: 1) Select clade: Insect genome: D. melanogaster and the assembly of interest. 2) Select Comparative Genomics as the group, Conservation as the track and multiz15way as the table. 3) Press the "create" button for intersection and select "Custom Tracks" as the group and select your custom track. Choose the base-pairwise AND for the intersection or you can select for regions of the multiple alignment that have a certain % overlap with the FlyBase genes. 4) Select MAF - multiple alignment format as the output format and press the "get output" button. This will give you just the regions of the multiple alignment that intersect with FlyBase coding exons. You can extract the coordinates for each region for each species. It will not tell you which genes intersected with each reason. As I mentioned, Galaxy 2 may be able to do give you both the multiple alignment region and the D. melanogaster FlyBase gene name. There are tutorials on using Galaxy 2 on the home page. If you have any problems with it or further questions, please contact the group that created Galaxy 2. They have mailing lists and contact details on the bottom of the Galaxy home page. I hope that this helps answers your question. Please let us know if you have further questions. Rachel Rachel Harte UCSC Genome Bioinformatics Group http://genome.ucsc.edu On Sat, 30 Dec 2006, Neretti, Nicola wrote: > To Whom It May Concern: > > > > I am running a response element analysis on sets of genes from > Drosophila melanogaster. I would like to incorporate a phylogenetic > component by including sequences from all the other Drosophila species > available in the genome browser. However I noticed that searching > through a gene ID only returns hits from melanogaster. > > > > How can I extract homologs from all the other species? > > > > Also, is there a way to do it in batch mode? > > > > Thanks a lot for your help. > > > > > > > > Best Regards, > > > > -Nicola Neretti > > > > > > > > > > Nicola Neretti > > Institute for Brain and Neural Systems > > Brown University > > Providence, RI 02912 > > T. (401) 863-2187 > > F. (401) 863-3494 > > e-mail: nicola_neretti at brown.edu > > > > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From mnefedov at chori.org Wed Jan 3 17:05:27 2007 From: mnefedov at chori.org (Michael Nefedov) Date: Wed, 03 Jan 2007 17:05:27 -0800 Subject: [Genome] question Message-ID: <6ac1ff214c588e4e89ee1b72f8a6e649@mail.chori.org> How to find RFLPs in mouse mm8, chr12:57,612,909-57,630,019 ? Dr. Mikhail Nefedov, BACPAC Resources, Children's Hospital Oakland Research Institute (CHORI), Bruce Lyon Research Building, Room 203. 747 - 52nd street, Oakland, CA94609 Phone: 510-428-3885 ext: 5799 Fax: 510-450-7951 From ann at soe.ucsc.edu Wed Jan 3 17:27:43 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Wed, 03 Jan 2007 17:27:43 -0800 Subject: [Genome] Promoters? Match to RefSeq (insignificant???) In-Reply-To: References: Message-ID: <459C580F.4050903@cse.ucsc.edu> Hello Kosi, The upstream*.zip files are created in a purely informatic way; there is no biology or hand-curation involved. We simply take the values in the refGene table (this table underlies the RefSeq Gene Track), and determine the {1000, 2000, 5000} upstream bases and put them in the file. To do this, we use a local tool called featureBits. The featureBits tool is able to, among other things, capture the region of N bases before region of interest. I have appended the usage statement for the featureBits tool to the end of this email. Now let me address your question regarding the upstream files being "slightly out of sync with the RefSeq". By this we mean that we create the upstream files *once* at the initial release of an assembly. Thus the upstream files are based on whatever is in the refGene table at the assembly release date. The refGene table is updated each night; new genes are added and some are even be dropped. We do *not* update the upstream* files again to reflect these changes to the refGene table. This should be enough background to help you understand the files. If it is not, please do not hesitate to reply back to the genome list. In a similar message that you sent only to Donna you asked another question that I would like to address here. There is not a pre-computed set of all human upstream sequences beginning immediately before the ATG (coding region), however, you can create one using the Table Browser. Take, for example, the knownGene named NM_011041 (located here: chr12:57,613,651-57,629,241). If you wanted to produce the 5000 bases which are immediately upstream from the start codon, you could do this in the Table Browser. Choose the knownGene table and enter the gene name in the identifiers list. Choose 'sequence' as the output format. Choose 'genomic', then choose: Promoter/Upstream by _5000_ bases 5' UTR Exons The output will be the 5000 bases upstream of the cdsStart (which starts with the start codon). The 5000 bases will be in small letters, and the UTR Exon (if any) will be in capital letters. In this example the output looks like this: >mm8_knownGene_NM_011041 range=chr12:57608651-57613872 5'pad=0 3'pad=0 revComp=FALSE strand=+ repeatMasking=none agcacaacaaacgcccatggggaagctcacaaaccactgtactaaagggc cccgcagttttgcttcccctctccccacgtcccgccccccagctgcagct ctcaaagtcgagtggaagcgtaaggtcattcattgaaagaccaaggccag ... agccagaaacagacggcggaaggagtttcccggactgagctgtcactcac cggcctgcaccaattacaacgcagattgctcgcgggcccacctcttttgg ggtgtgtcacaagtgagtgatagactgagccgcccggccctgctcagccc AGCCCACGTTGCTGCTTAGATTGAAATGCAGAACTCAAGCCTCTTTCAGC CCGGCACAGACTTCCTTTTACTCTTTCCTTTGGCACTCTTGTCGCCTCCT CCCGGGAAGAAGCCAAGGCACCCTCGGCTTGGAGCAGCGACAGGCCGGCT CAGTGAGAACAAGAAAAAAGTTTCTTTCTGGGAGTGCGGAACTGGGGCCG GGTTGGTGTACTGCTCAGAGCA Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu [hgwdev:~/make> featureBits featureBits - Correlate tables via bitmap projections. usage: featureBits database table(s) This will return the number of bits in all the tables anded together Options: -bed=output.bed Put intersection into bed format -fa=output.fa Put sequence in intersection into .fa file -faMerge For fa output merge overlapping features. -minSize=N Minimum size to output (default 1) -chrom=chrN Restrict to one chromosome -chromSize=sizefile read chrom sizes from file instead of database. -or Or tables together instead of anding them -not Output negation of resulting bit set. -countGaps Count gaps in denominator -noRandom Don't include _random (or Un) chromosomes -noHap Don't include _hap chromosomes -dots=N Output dot every N chroms (scaffolds) processed -minFeatureSize=n Don't include bits of the track that are smaller than minFeatureSize, useful for differentiating between alignment gaps and introns. -bin=output.bin Put bin counts in output file -binSize=N Bin size for generating counts in bin file (default 500000) -binOverlap=N Bin overlap for generating counts in bin file (default 250000) -bedRegionIn=input.bed Read in a bed file for bin counts in specific regions and write to bedRegionsOut -bedRegionOut=output.bed Write a bed file of bin counts in specific regions from bedRegionIn -enrichment Calculates coverage and enrichment assuming first table is reference gene track and second track something else '-where=some sql pattern' restrict to features matching some sql pattern You can include a '!' before a table name to negate it. Some table names can be followed by modifiers such as: :exon:N Break into exons and add N to each end of each exon :cds Break into coding exons :intron:N Break into introns, remove N from each end :utr5, :utr3 Break into 5' or 3' UTRs :upstream:N Consider the region of N bases before region :end:N Consider the region of N bases after region :score:N Consider records with score >= N :upstreamAll:N Like upstream, but doesn't filter out genes that have txStart==cdsStart or txEnd==cdsEnd :endAll:N Like end, but doesn't filter out genes that have txStart==cdsStart or txEnd==cdsEnd The tables can be bed, psl, or chain files, or a directory full of such files as well as actual database tables. To count the bits used in dir/chrN_something*.bed you'd do: featureBits database dir/_something.bed Kosi Gramatikoff wrote: > In one of your REAADME files at: > ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/ > > I read this: > "Sequences 1000 bases upstream of annotated > transcription start of RefSeq genes. This includes only the > cases where the transcription start (TSS) is annotated separately > from the coding region start. Note that upstream files are > generated only when an assembly is released. Therefore, the > data may be slightly out of synch with the RefSeq data in > assemblies that are incrementally updated nightly." > > > 1. I would like to know how were the TSS annotated - what is the principle > of this annotation - identification of the TSS??? > > 2. Do you have references describing the presumed principle of TSS > identification? RefSeq does not contain that... > > 3. What do you mean by: "slightly out of synch with the RefSeq"? > If TSSs are annotated separately from the start of the coding region - how > could they be ever in sync. TSS is upstream of the ATG codon (see next > sentence as evidence). > > The above questions are driven by a specific observation. > I compared the entire RefSeq (~26,000 5'UTR) with your upstream2000 set > (~23,000) and found only 244 RefSeq 5'UTRs to be included entirely. If 244 > upstream sequences (promoters) contain sequences down to the ATG why the > rest of the promoters do not contain that? > > How exactly was the association of a particular promoter (upstream region) > matched to unique RefSeq entry? > > Is there information for how many nucleotides are missing from the ATG (or > in between the ATG and your upstream sequence)? > > > Please let me know, > Kosi Gramatikoff > Burnham Institute for Medical Research > La Jolla, CA > > PS. If there is a person (contact info, phone) please let me know I would > like to discuss the above issues with a live feedback > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From dexter at mail.med.upenn.edu Wed Jan 3 21:52:51 2007 From: dexter at mail.med.upenn.edu (Dexter Hadley) Date: Thu, 4 Jan 2007 00:52:51 -0500 Subject: [Genome] File offsets in hg18.multiz17way table Message-ID: <000601c72fc4$92948190$650fa8c0@PAPPY> Hi there, In hg17 I was able to use the file offsets from the multiz8way table to query alignments in the respective maf files. However, now that I have migrated to hg18, my scripts no longer work. I think that the problem is that the file offsets in hg18.multiz17way are incorrect. For instance, please consider these two queries of the first multiz alignment from hg17.multiz8way and hg18.multiz17way: mysql> select * from hg17.multiz8way where chrom = 'chrY' order by chromStart limit 1; +-----+-------+------------+----------+---------+--------+----------+ | bin | chrom | chromStart | chromEnd | extFile | offset | score | +-----+-------+------------+----------+---------+--------+----------+ | 585 | chrY | 861 | 2250 | 2468793 | 909 | 0.940234 | +-----+-------+------------+----------+---------+--------+----------+ mysql> select * from hg18.multiz17way where chrom = 'chrY' order by chromStart limit 1; +-----+-------+------------+----------+---------+--------+----------+ | bin | chrom | chromStart | chromEnd | extFile | offset | score | +-----+-------+------------+----------+---------+--------+----------+ | 585 | chrY | 667 | 861 | 2389811 | 34 | 0.596211 | +-----+-------+------------+----------+---------+--------+----------+ The offset in hg17.multiz8way is greater than that in hg18.multiz17way even though there are many more comment lines in the beginning of hg18/multiz17way/chrY.maf relative to hg17/multiz8way/chrY.maf. Something is wrong. Please help. Thanks, Dexter Hadley From y.itan at ucl.ac.uk Thu Jan 4 04:46:11 2007 From: y.itan at ucl.ac.uk (Yuval Itan) Date: Thu, 4 Jan 2007 12:46:11 +0000 Subject: [Genome] Automating web BLAT Message-ID: <5cf05da5091d96a623c31d9b50796ebf@ucl.ac.uk> Hello, I have a few hundred cDNA sequences that I need to BLAT against the human, chimp and macaque genome. I can't do it locally because these tasks require huge memory (2GB of my computer is not enough). Doing it manually through your web interface will require many hours of very tedious work, due to the 25 sequences limitation. I was wondering if you know of a way to automatically perform this task using your (or other) web interface. Thanks, Yuval From peter.scacheri at case.edu Thu Jan 4 06:56:42 2007 From: peter.scacheri at case.edu (Peter Scacheri) Date: Thu, 4 Jan 2007 09:56:42 -0500 Subject: [Genome] random DNA sequence generator Message-ID: Please help!! I'm trying download a list of 500 or so random sequences of a specified length (~300 bp) from the ENCODE regions. Is this possible to do this using the Table Browser, or somewhere else on the UCSC website? Sincerely, Peter -- ******************************* Peter C. Scacheri, Ph.D. Assistant Professor Department of Genetics Case Western Reserve University 10900 Euclid Ave; BRB 627 Cleveland, OH 44106 tel: 216-368-3458 fax: 216-368-3432 Email: peter.scacheri at case.edu From hartera at soe.ucsc.edu Thu Jan 4 08:45:29 2007 From: hartera at soe.ucsc.edu (Rachel Harte) Date: Thu, 4 Jan 2007 08:45:29 -0800 (PST) Subject: [Genome] Automating web BLAT In-Reply-To: <5cf05da5091d96a623c31d9b50796ebf@ucl.ac.uk> References: <5cf05da5091d96a623c31d9b50796ebf@ucl.ac.uk> Message-ID: Hello Yuval, I have a perl script that allows you to submit a file of sequences to our BLAT server. It obeys our site's rules for Blat, submitting only 25 sequences and pausing for 2 minutes between submissions. The script (BlatBot.pl) may be downloaded from here: http://genomewiki.cse.ucsc.edu/index.php/Blat_Scripts The output is HTML so the parseBlatOutput.pl script may be used remove the HTML. Please let me know if you have any questions or find any problems. Rachel Rachel Harte UCSC Genome Bioinformatics Group http://genome.ucsc.edu On Thu, 4 Jan 2007, Yuval Itan wrote: > Hello, > > I have a few hundred cDNA sequences that I need to BLAT against the > human, chimp and macaque genome. I can't do it locally because these > tasks require huge memory (2GB of my computer is not enough). > Doing it manually through your web interface will require many hours of > very tedious work, due to the 25 sequences limitation. > I was wondering if you know of a way to automatically perform this task > using your (or other) web interface. > > Thanks, > > Yuval > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From shepardp at uci.edu Thu Jan 4 09:16:23 2007 From: shepardp at uci.edu (shepardp@uci.edu) Date: Thu, 4 Jan 2007 09:16:23 -0800 (PST) Subject: [Genome] altsplicing conserved graphs Message-ID: <4523.128.195.176.149.1167930983.squirrel@webmail.uci.edu> Hello UCSC folks, I am trying to download in bulk from your table browser the following data; Human splicing graphs conserved in the mouse transcriptome. I can get the altGraphX (location of splicing events in the human genome) but I do not know how to proceed after this? Thanks. P From kuhn at soe.ucsc.edu Thu Jan 4 11:42:23 2007 From: kuhn at soe.ucsc.edu (Robert Kuhn) Date: Thu, 4 Jan 2007 11:42:23 -0800 Subject: [Genome] question Message-ID: <200701041942.LAA14407@moondance.cse.ucsc.edu> Dr. Nefedov, We do not have RFLP information on the browser. The best approach I could suggest using our software would be to guide an experimental approach. By turning on the tracks in the Variation and Repeats section, specifically Repeat Masker, Simple Repeats and Microsatellites, you can see where candidate RFLP regions lie. Using the Restiction Enzymes track, you can see where restiction enzymes map to the reference assembly and could design experiments to determine whether the repeat regions are useful to you as RFLPs. Note that you can limit the Restriction Enzyme track to enzymes of your choice using the configuration page. Click the track name link or the minibutton to the left of the track in the browser display. Note that the UCSC mm8 database contains only the reference strain C57BL/6J. best wishes, --b0b kuhn ucsc genome bioinformatics group > From genome-bounces at soe.ucsc.edu Wed Jan 3 17:19:16 2007 > To: "genome at soe.ucsc.edu." > Subject: [Genome] question > > How to find RFLPs in mouse mm8, chr12:57,612,909-57,630,019 ? > > > > > > Dr. Mikhail Nefedov, > BACPAC Resources, > Children's Hospital Oakland > Research Institute (CHORI), > Bruce Lyon Research Building, Room 203. > 747 - 52nd street, Oakland, CA94609 > Phone: 510-428-3885 ext: 5799 > Fax: 510-450-7951 > > > > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From rhead at soe.ucsc.edu Thu Jan 4 11:52:23 2007 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Thu, 04 Jan 2007 11:52:23 -0800 Subject: [Genome] File offsets in hg18.multiz17way table In-Reply-To: <000601c72fc4$92948190$650fa8c0@PAPPY> References: <000601c72fc4$92948190$650fa8c0@PAPPY> Message-ID: <459D5AF7.6090507@soe.ucsc.edu> Hello Dexter, Since the hg17 8-way alignment, we have upgraded the display of maf files in the Genome Browser so that more information about gaps and genomic breaks is displayed. Now we actually have two versions of maf files: the original version, which we make available for download, and an annotated version, which we use to display the extra information in the Browser. The "offset" field in the multiz17way table refers to the annotated version. We have not made the annotated maf files available for download because they are much larger than the non-annotated versions. Can you let us know what your scripts do? Perhaps we can help you change them or suggest an alternative method so that you can get the same results. -- Brooke Rhead UCSC Genome Bioinformatics Group Dexter Hadley wrote: > Hi there, > > In hg17 I was able to use the file offsets from the multiz8way table to > query alignments in the respective maf files. However, now that I have > migrated to hg18, my scripts no longer work. I think that the problem is > that the file offsets in hg18.multiz17way are incorrect. For instance, > please consider these two queries of the first multiz alignment from > hg17.multiz8way and hg18.multiz17way: > > mysql> select * from hg17.multiz8way where chrom = 'chrY' order by > chromStart limit 1; > +-----+-------+------------+----------+---------+--------+----------+ > | bin | chrom | chromStart | chromEnd | extFile | offset | score | > +-----+-------+------------+----------+---------+--------+----------+ > | 585 | chrY | 861 | 2250 | 2468793 | 909 | 0.940234 | > +-----+-------+------------+----------+---------+--------+----------+ > > > mysql> select * from hg18.multiz17way where chrom = 'chrY' order by > chromStart limit 1; > +-----+-------+------------+----------+---------+--------+----------+ > | bin | chrom | chromStart | chromEnd | extFile | offset | score | > +-----+-------+------------+----------+---------+--------+----------+ > | 585 | chrY | 667 | 861 | 2389811 | 34 | 0.596211 | > +-----+-------+------------+----------+---------+--------+----------+ > > The offset in hg17.multiz8way is greater than that in hg18.multiz17way even > though there are many more comment lines in the beginning of > hg18/multiz17way/chrY.maf relative to hg17/multiz8way/chrY.maf. Something > is wrong. Please help. Thanks, > > Dexter Hadley > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From rhead at soe.ucsc.edu Thu Jan 4 14:23:40 2007 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Thu, 04 Jan 2007 14:23:40 -0800 Subject: [Genome] altsplicing conserved graphs In-Reply-To: <4523.128.195.176.149.1167930983.squirrel@webmail.uci.edu> References: <4523.128.195.176.149.1167930983.squirrel@webmail.uci.edu> Message-ID: <459D7E6C.2000504@soe.ucsc.edu> Hello P, We do not store the alt-splicing images; they are drawn as needed by the Genome Browser from the data in the table. If you would like to recreate the alt-splicing images yourself, our programs are available freely for academic, nonprofit, and personal use: http://genome.ucsc.edu/FAQ/FAQdownloads#download27 If you are interested in doing this, one of our developers suggests looking at doAltGraphXDetails() in kent/src/hg/hgc/hgc.c to get started. I hope this information is helpful. Please contact us again if you have further questions. -- Brooke Rhead UCSC Genome Bioinformatics Group shepardp at uci.edu wrote: > Hello UCSC folks, > > I am trying to download in bulk from your table browser the following data; > > Human splicing graphs conserved in the mouse transcriptome. > > I can get the altGraphX (location of splicing events in the human genome) > but I do not know how to proceed after this? > > Thanks. > > P > > > > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From rhead at soe.ucsc.edu Thu Jan 4 15:27:07 2007 From: rhead at soe.ucsc.edu (Brooke Rhead) Date: Thu, 04 Jan 2007 15:27:07 -0800 Subject: [Genome] random DNA sequence generator In-Reply-To: References: Message-ID: <459D8D4B.8000009@soe.ucsc.edu> Hello Peter, We do not have a way of generating random regions using the Genome Browser. However, the 'encodeRegions' table might be helpful to you. You can get to it in the Table Browser by selecting "group: ENCODE Regions and Genes", "track: ENCODE Regions" and "table: encodeRegions". It will at least give you a starting point for generating random regions. Once you are able to get a suitable list of regions some other way, you could upload the regions as a custom track. (Information on custom tracks is here: http://genome.ucsc.edu/goldenPath/help/customTrack.html#ADD_CT ). Then you could download the sequences of these regions from the Table Browser by selecting your custom track and choosing "output format: sequence". Let us know if you need any further help using the Table Browser. -- Brooke Rhead UCSC Genome Bioinformatics Group Peter Scacheri wrote: > Please help!! I'm trying download a list of 500 or so random > sequences of a specified length (~300 bp) from the ENCODE regions. > Is this possible to do this using the Table Browser, or somewhere > else on the UCSC website? > > Sincerely, > Peter From sdavis2 at mail.nih.gov Thu Jan 4 18:01:16 2007 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 04 Jan 2007 21:01:16 -0500 Subject: [Genome] random DNA sequence generator In-Reply-To: <459D8D4B.8000009@soe.ucsc.edu> References: <459D8D4B.8000009@soe.ucsc.edu> Message-ID: <459DB16C.6070904@mail.nih.gov> Brooke Rhead wrote: > Hello Peter, > > We do not have a way of generating random regions using the Genome > Browser. However, the 'encodeRegions' table might be helpful to you. > You can get to it in the Table Browser by selecting "group: ENCODE > Regions and Genes", "track: ENCODE Regions" and "table: encodeRegions". > It will at least give you a starting point for generating random regions. > > Once you are able to get a suitable list of regions some other way, you > could upload the regions as a custom track. (Information on custom > tracks is here: > http://genome.ucsc.edu/goldenPath/help/customTrack.html#ADD_CT ). Then > you could download the sequences of these regions from the Table Browser > by selecting your custom track and choosing "output format: sequence". > Hi, Pete. I saw your question and thought I would point out a tool that you and folks here might find useful as a supplemental tool for the Table Browser. It is based heavily on the data in the UCSC genome browser and is being developed and maintained at Penn State. The Galaxy server there allows you to upload essentially a custom track and then it will generate random regions to "match" the regions in your custom track. You can also choose to mask out intervals; in your case, you would want to mask all the non-encode regions. You can download the encode intervals, perform a complement on those regions (a function also available via the Galaxy server), and use the result for your mask. Finally, they allow you to grab the sequences of your random regions if you need them. I haven't actually done this, so check on the details, but I think what I described should work. Sean -- Genetics Branch NCI/CCR/NIH Bethesda, MD -- From sdavis2 at mail.nih.gov Thu Jan 4 18:06:13 2007 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 04 Jan 2007 21:06:13 -0500 Subject: [Genome] random DNA sequence generator In-Reply-To: <459DB16C.6070904@mail.nih.gov> References: <459D8D4B.8000009@soe.ucsc.edu> <459DB16C.6070904@mail.nih.gov> Message-ID: <459DB295.7070009@mail.nih.gov> Sean Davis wrote: > Brooke Rhead wrote: > >> Hello Peter, >> >> We do not have a way of generating random regions using the Genome >> Browser. However, the 'encodeRegions' table might be helpful to you. >> You can get to it in the Table Browser by selecting "group: ENCODE >> Regions and Genes", "track: ENCODE Regions" and "table: encodeRegions". >> It will at least give you a starting point for generating random regions. >> >> Once you are able to get a suitable list of regions some other way, you >> could upload the regions as a custom track. (Information on custom >> tracks is here: >> http://genome.ucsc.edu/goldenPath/help/customTrack.html#ADD_CT ). Then >> you could download the sequences of these regions from the Table Browser >> by selecting your custom track and choosing "output format: sequence". >> >> > Hi, Pete. I saw your question and thought I would point out a tool that > you and folks here might find useful as a supplemental tool for the > Table Browser. It is based heavily on the data in the UCSC genome > browser and is being developed and maintained at Penn State. The Galaxy > server there allows you to upload essentially a custom track and then it > will generate random regions to "match" the regions in your custom > track. You can also choose to mask out intervals; in your case, you > would want to mask all the non-encode regions. You can download the > encode intervals, perform a complement on those regions (a function also > available via the Galaxy server), and use the result for your mask. > Finally, they allow you to grab the sequences of your random regions if > you need them. I haven't actually done this, so check on the details, > but I think what I described should work. > And the URL would be useful: http://main.g2.bx.psu.edu/ Sorry for the noise.... Sean From anton.kratz at googlemail.com Fri Jan 5 00:23:02 2007 From: anton.kratz at googlemail.com (Anton Kratz) Date: Fri, 5 Jan 2007 17:23:02 +0900 Subject: [Genome] Table Browser / Why are there LocusLink ID denoted genes on different strands/chromosomes? Message-ID: <68c7994b0701050023md659d40yc5dfdd0ebc0c4aab@mail.gmail.com> Hi, for a bioinformatics project I am working on I want to make a list of all genes (*) in the human genome (hg17, May 2004, NCBI Build 35) with these entries: 1. some unique number identifying the gene 2. strand 3. chromosome 4. start 5. end As a first step I got a list through the Table Browser and it has these entries: LocusLink ID, Known Gene ID, chromosome, strand, start, end (LocusLink is included b/c later I want to access via LocusLink not Known Gene). This list has around 34000 entries (lines) in total and 17000 unique LocusLink IDs, b/c many LocusLink IDs occur multiple times, and it looks likes this: #hg17.knownToLocusLink.value hg17.knownGene.name hg17.knownGene.chrom hg17.knownGene.strand hg17.knownGene.txStart hg17.knownGene.txEnd [...] 83259 NM_032971 chrY + 4911627 5016846 83259 NM_032972 chrY + 4967491 5016846 83259 NM_032973 chrY + 4967491 5653623 [...] This is almost what I want, b/c I can concatenate such multiple entries as the example above to a new entry like this: 83259 chrY + 4911627 5653623 And this would be my "gene". 4911627 b/c it is the minimum in this example and 5653623 b/c it is the maximum. But I have difficulties understanding many of the entries the UCSC Known Genes list I got through the Table Browser. F.e.: 1. Gene of different chromosomes: 55344 NM_018390,NM_018390, chrX,chrY, +,+, 132991,132991, 160020,160020, 2. Gene on different strands: 9084 NM_181880,NM_181880, chrY,chrY, -,+, 14535783,14606232, 14536519,14606968, Why are there LocusLink ID denoted genes on different strands/chromosomes? Best, Anton (*) when I use the term "gene", I do not mean gene in an "true" biological sense, I do not want to distinguish between alternative splice variants etc. For my program, a gene needs to be s/th which has a exactly one startpostion, one enposition, and does only occur once in the genome. Yes I admit that's an oversimplification. From farin.henner at mh-hannover.de Fri Jan 5 02:10:35 2007 From: farin.henner at mh-hannover.de (Henner Farin) Date: Fri, 5 Jan 2007 11:10:35 +0100 Subject: [Genome] "Motifs to highlight" in "Base Position" track doesn't work Message-ID: Hello, I'd like to inform you about the fact that the option "Motifs to highlight:" in the track "Base Position" does not work anymore. It would be really great if you could make this tool working again. Thanks (and no need to answer), Henner ---------------------------------------------------------------------- Institut f?r Molekularbiologie Medizinische Hochschule Hannover Raum 2580 Carl-Neuberg-Str. 1 D-30625 Hannover Germany Tel.: (49) 511 532 5961 farin.henner at mh-hannover.de From GHansen at lexgen.com Fri Jan 5 07:52:16 2007 From: GHansen at lexgen.com (Hansen, Gwenn) Date: Fri, 5 Jan 2007 09:52:16 -0600 Subject: [Genome] Table browser query Message-ID: <1EC36F3E1AAA664681D226582FF6F0810149B494@wdexchmb01.lexicon.lexgen.com> Is there a way that I can use the Table Browser to obtain a dataset showing all mouse known genes/Refseq/mRNAs that intersect with gene trap sequence alignments? (the IGTC track) Ideally I would like to output a table with these fields: gene name, gene accession, entrez id, chr, IGTC gene trap sequence accession, IGTC alignment position start/end. Thanks, Gwenn The contents of this communication, including any attachments, may be confidential, privileged or otherwise protected from disclosure. They are intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient, please do not read, copy, use or disclose the contents of this communication. Please notify the sender immediately and delete the communication in its entirety. From ann at soe.ucsc.edu Fri Jan 5 08:57:40 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Fri, 05 Jan 2007 08:57:40 -0800 Subject: [Genome] "Motifs to highlight" in "Base Position" track doesn't work In-Reply-To: References: Message-ID: <459E8384.6040504@soe.ucsc.edu> Hello Henner, I have not been able to reproduce a problem with the "Motifs to highlight" feature of the Base Position track. When I choose a motif, say ATG, and view the track at the base level, I see every instance of that motif highlighted (black background, white letter). Could you please describe to me the steps that you are taking, as well as the browser and o/s you are using? I have tested using Internet Explorer 6 and Firefox 1.5. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Henner Farin wrote: > Hello, > > I'd like to inform you about the fact that the option > "Motifs to highlight:" in the track "Base Position" does not work anymore. > It would be really great if you could make this tool working again. > > Thanks (and no need to answer), > Henner > > > > ---------------------------------------------------------------------- > Institut f?r Molekularbiologie > Medizinische Hochschule Hannover > Raum 2580 > Carl-Neuberg-Str. 1 > D-30625 Hannover > Germany > > Tel.: (49) 511 532 5961 > farin.henner at mh-hannover.de > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From rmetlap at chg.duhs.duke.edu Fri Jan 5 10:03:37 2007 From: rmetlap at chg.duhs.duke.edu (Ravikanth Metlapally) Date: Fri, 5 Jan 2007 13:03:37 -0500 Subject: [Genome] Annotation Display Problems Message-ID: <003101c730f3$d4545030$c5fd1098@DNA344> Hi there, I have been trying to upload a GFF file to the Genome Browser and I keep getting this error message: 'Expecting number field 5 line 2 of ../trash/ct/ct_www_1f56_e92050.bed, got .' It will be great if anyone can help me out with this. Best Ravi From renaudg at mail.nih.gov Fri Jan 5 11:42:45 2007 From: renaudg at mail.nih.gov (Gabriel Renaud) Date: Fri, 05 Jan 2007 14:42:45 -0500 Subject: [Genome] buffer overflow Message-ID: <459EAA35.7070103@mail.nih.gov> Hi, I saw your reply (http://www.cse.ucsc.edu/pipermail/genome/2006-September/011584.html) regarding the error I am getting: */Error/ buffer overflow, size 32, format: %s:%d-%d* Did you guys were able to fix it ? The track I am using if for Zebrafish 2006 with: ----------------------------------------------------- browser position chrNA_random browser hide all browser pack refGene browser dense ruler browser dense gap track name=single_0mm description="track" color=0,0,0 chrNA_random 141876165 141876182 >seq1 chrNA_random 38278694 38278711 >seq2 chrNA_random 92441292 92441309 >seq3 chrNA_random 87518410 87518427 >seq4 ----------------------------------------------------- Thank you ! -- Gabriel Renaud renaudg at mail.nih.gov 301-451-6746 From ann at soe.ucsc.edu Fri Jan 5 14:55:04 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Fri, 05 Jan 2007 14:55:04 -0800 Subject: [Genome] Annotation Display Problems In-Reply-To: References: Message-ID: <459ED748.5070709@soe.ucsc.edu> Hello Ravi, Sorry you are having problems with the GFF Custom Tracks. If you send me (no need to cc the entire list) a sample of what you're trying to upload, I can try to help you troubleshoot. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu > Subject: > Annotation Display Problems > From: > "Ravikanth Metlapally" > Date: > Fri, 5 Jan 2007 13:03:37 -0500 > To: > > > To: > > > > Hi there, > > I have been trying to upload a GFF file to the Genome Browser and I keep > getting this error message: > > 'Expecting number field 5 line 2 of ../trash/ct/ct_www_1f56_e92050.bed, > got .' > > It will be great if anyone can help me out with this. > > Best > Ravi From ann at soe.ucsc.edu Fri Jan 5 15:57:23 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Fri, 05 Jan 2007 15:57:23 -0800 Subject: [Genome] Table browser query In-Reply-To: <1EC36F3E1AAA664681D226582FF6F0810149B494@wdexchmb01.lexicon.lexgen.com> References: <1EC36F3E1AAA664681D226582FF6F0810149B494@wdexchmb01.lexicon.lexgen.com> Message-ID: <459EE5E3.4090500@soe.ucsc.edu> Hello Gwenn, It is not possible to use the UCSC Table Browser to do this intersection. However, there is a Hansen, Gwenn wrote: > Is there a way that I can use the Table Browser to obtain a dataset > showing all mouse known genes/Refseq/mRNAs that intersect with gene trap > sequence alignments? (the IGTC track) > > Ideally I would like to output a table with these fields: gene name, > gene accession, entrez id, chr, IGTC gene trap sequence accession, IGTC > alignment position start/end. > > Thanks, > > Gwenn > > > > The contents of this communication, including any attachments, may be confidential, privileged or otherwise protected from disclosure. They are intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient, please do not read, copy, use or disclose the contents of this communication. Please notify the sender immediately and delete the communication in its entirety. > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From ann at soe.ucsc.edu Fri Jan 5 16:02:39 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Fri, 05 Jan 2007 16:02:39 -0800 Subject: [Genome] Table browser query In-Reply-To: <459EE5E3.4090500@soe.ucsc.edu> References: <1EC36F3E1AAA664681D226582FF6F0810149B494@wdexchmb01.lexicon.lexgen.com> <459EE5E3.4090500@soe.ucsc.edu> Message-ID: <459EE71F.1010501@soe.ucsc.edu> Hello again, Gwenn, Sorry, my fingers got away from me there! I was about to tell you about a supplemental tool that is being developed by the folks at Penn State. The Galaxy server there provides access to all of the UCSC data. It is capable of more complicated joins and intersections. http://main.g2.bx.psu.edu/ Start with the "get data" link, then choose "UCSC Main Table Browser". You should be able to complete your query using this tool. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Ann Zweig wrote: > Hello Gwenn, > > It is not possible to use the UCSC Table Browser to do this > intersection. However, there is a > > > > > Hansen, Gwenn wrote: >> Is there a way that I can use the Table Browser to obtain a dataset >> showing all mouse known genes/Refseq/mRNAs that intersect with gene trap >> sequence alignments? (the IGTC track) >> >> Ideally I would like to output a table with these fields: gene name, >> gene accession, entrez id, chr, IGTC gene trap sequence accession, IGTC >> alignment position start/end. >> >> Thanks, >> >> Gwenn >> >> >> >> The contents of this communication, including any attachments, may be >> confidential, privileged or otherwise protected from disclosure. They >> are intended solely for the use of the individual or entity to whom >> they are addressed. If you are not the intended recipient, please do >> not read, copy, use or disclose the contents of this communication. >> Please notify the sender immediately and delete the communication in >> its entirety. >> _______________________________________________ >> Genome maillist - Genome at soe.ucsc.edu >> http://www.soe.ucsc.edu/mailman/listinfo/genome > From ann at soe.ucsc.edu Fri Jan 5 16:42:49 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Fri, 05 Jan 2007 16:42:49 -0800 Subject: [Genome] Table Browser / Why are there LocusLink ID denoted genes on different strands/chromosomes? In-Reply-To: <68c7994b0701050023md659d40yc5dfdd0ebc0c4aab@mail.gmail.com> References: <68c7994b0701050023md659d40yc5dfdd0ebc0c4aab@mail.gmail.com> Message-ID: <459EF089.40103@soe.ucsc.edu> Hello Anton, You appear to be most of the way there, you just need a little help understanding the data. I will address each of your two questions. 1. Gene of different chromosomes: 55344 NM_018390,NM_018390, chrX,chrY, +,+, 132991,132991, 160020,160020, In this case, it looks like the gene appears on both the X and Y chromosomes; probably in the shared section. The Y chromosome in this assembly contains two pseudoautosomal regions (PARs). One of them is located at chrY:1-2692881. 2. Gene on different strands: 9084 NM_181880,NM_181880, chrY,chrY, -,+, 14535783,14606232, 14536519,14606968, It is not uncommon to see a copy of the gene on the opposite strand of the same chromosome. This gene happens to be in a region of inverted repeat. You can see that by viewing the Segmental Duplication track in this location. I hope this helps you understand the underlying data. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu Anton Kratz wrote: > Hi, > > for a bioinformatics project I am working on I want to make a list of all > genes (*) in the human genome (hg17, May 2004, NCBI Build 35) with these > entries: > 1. some unique number identifying the gene > 2. strand > 3. chromosome > 4. start > 5. end > > As a first step I got a list through the Table Browser and it has these > entries: > > LocusLink ID, Known Gene ID, chromosome, strand, start, end > > (LocusLink is included b/c later I want to access via LocusLink not Known > Gene). > > This list has around 34000 entries (lines) in total and 17000 unique > LocusLink IDs, b/c many LocusLink IDs occur multiple times, and it looks > likes this: > > #hg17.knownToLocusLink.value hg17.knownGene.name hg17.knownGene.chrom > hg17.knownGene.strand hg17.knownGene.txStart hg17.knownGene.txEnd > > [...] > > 83259 NM_032971 chrY + 4911627 5016846 > 83259 NM_032972 chrY + 4967491 5016846 > 83259 NM_032973 chrY + 4967491 5653623 > > [...] > > This is almost what I want, b/c I can concatenate such multiple entries as > the example above to a new entry like this: > 83259 chrY + 4911627 5653623 > > And this would be my "gene". 4911627 b/c it is the minimum in this example > and 5653623 b/c it is the maximum. > > But I have difficulties understanding many of the entries the UCSC Known > Genes list I got through the Table Browser. F.e.: > > 1. Gene of different chromosomes: > 55344 NM_018390,NM_018390, chrX,chrY, +,+, 132991,132991, > 160020,160020, > > 2. Gene on different strands: > 9084 NM_181880,NM_181880, chrY,chrY, -,+, 14535783,14606232, > 14536519,14606968, > > Why are there LocusLink ID denoted genes on different strands/chromosomes? > > Best, > Anton > > > (*) when I use the term "gene", I do not mean gene in an "true" biological > sense, I do not want to distinguish between alternative splice variants etc. > For my program, a gene needs to be s/th which has a exactly one > startpostion, one enposition, and does only occur once in the genome. Yes I > admit that's an oversimplification. > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From sdavis2 at mail.nih.gov Sat Jan 6 08:30:34 2007 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Sat, 06 Jan 2007 11:30:34 -0500 Subject: [Genome] Best sync time Message-ID: <459FCEAA.3070804@mail.nih.gov> Just a quick question--what is the best time (from your point of view) to do syncs (via rsync) of data from your server? Thanks, Sean From calhoujd at notes.udayton.edu Fri Jan 5 20:34:27 2007 From: calhoujd at notes.udayton.edu (calhoujd@notes.udayton.edu) Date: Fri, 5 Jan 2007 23:34:27 -0500 Subject: [Genome] Help with upstream .maf file Message-ID: To Whom it May Concern, I am an undergraduate Biology major from the University of Dayton trying to utilize your upstream alignment data for a senior thesis (Multiple alignments of 16 vertebrate genomes with Human), but I cannot open the .maf file in Microsoft Access. Enclosed is an image of the error message I keep getting. Any help on this matter would be greatly appreciated. Let me also say that the UCSC Genome Browser and affiliated data is an excellent set of public bioinformatics tool and I thank you for making it available for use. Take care, Jeff From kuhn at soe.ucsc.edu Sat Jan 6 17:17:14 2007 From: kuhn at soe.ucsc.edu (Robert Kuhn) Date: Sat, 6 Jan 2007 17:17:14 -0800 Subject: [Genome] Best sync time Message-ID: <200701070117.RAA15404@moondance.cse.ucsc.edu> Sean, Our peak usage hours are 9 AM EST to 5 PM PST. Most of our mirrors do their syncs during the night. weekends are light, too. best wishes, --b0b kuhn ucsc genome bioinformatics group > From genome-bounces at soe.ucsc.edu Sat Jan 6 08:31:55 2007 > To: "'genome at soe.ucsc.edu'" > Subject: [Genome] Best sync time > > Just a quick question--what is the best time (from your point of view) > to do syncs (via rsync) of data from your server? > > Thanks, > Sean > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From calhoujd at notes.udayton.edu Sat Jan 6 11:39:43 2007 From: calhoujd at notes.udayton.edu (calhoujd@notes.udayton.edu) Date: Sat, 6 Jan 2007 14:39:43 -0500 Subject: [Genome] More help with upstream .maf file Message-ID: To Whom it May Concern, Feel free to disregard my previous email seeking help. I figured out my previous issue, but have instead run into a new problem trying to use the upstream2000.maf file (Multiple alignments of 16 vertebrate genomes with Human). I have tried using at least five text editors including notepad, notepad++, wordpad, and the MEGA3.1 text editor to open the file with little success. I am assuming this is partially due to the size of the file (>600 MB). However, two of the text editors were able to begin to open the file, but stopped way short of the entire file. Enclosed is a screen shot of how far the text editor was able to get (not quite through the first alignment). Any help as to how I can open this file in a text editor would be greatly appreciated. Sincerely, Jeffrey Calhoun From triley at ias.edu Sat Jan 6 15:02:52 2007 From: triley at ias.edu (Todd Riley) Date: Sat, 06 Jan 2007 18:02:52 -0500 Subject: [Genome] Questions about indexing in the .fa files Message-ID: <45A02A9C.5090900@ias.edu> Hello, I have some questions about the indexing in the *.fa files, which unfortunately I could not find in the docs. First let me thank you kindly for your answers... 1. Is the first nucleotide in a *.fa file index 0 or 1? 2. Next I am a bit confused about indexes found in the known genes table compared to what queries return: Let's take a positive strand gene like NM_018234 in hg17, the known gene table gives this info: #name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds proteinID alignID NM_018234 chr2 + 119697613 119739455 119719302 119737144 6 119697613,119713168,119719294,119721484,119728489,119736862, 119697694,119713236,119719794,119722012,119728654,119739455, Q86SF6_HUMAN R18909 However, when I query for exon 1 I get that exon 1 starts at index 119697614 not 119697613: >hg17_refGene_NM_018234_0 range=chr2:119697614-119697694 5'pad=0 3'pad=0 revComp=FALSE strand=+ repeatMasking=none GAGGAGGAGCCTCGGGCCGAGCCACCGCCTTCGCCGCGGACCTTCAGCTG CCGCGGTCGCTCCGAGCGGCGGGCCGCAGAG So my next question is, are the exonStarts and exonEnds supposed to be inclusive or exclusive? It looks like the indexing scheme is exonRegion = (exonStart, exonEnd], is this correct? Also, is it different if the gene is on the negative strand? Thanks, Todd From yaelal at md.huji.ac.il Sun Jan 7 00:20:14 2007 From: yaelal at md.huji.ac.il (Yael Altuvia) Date: Sun, 07 Jan 2007 10:20:14 +0200 Subject: [Genome] The GNF tables Message-ID: <45A0AD3E.8080205@md.huji.ac.il> Hi, I have a few questions regarding the GNF tables. A. I tried to reconstruct the normalization procedure leading to the values of the gnfAtlas2 table from the raw data in gnfHumanAtlas2All (I used the hg17 and hg17 fixed) I conducted the following: 1. conversion between the indexing of the 158 two sample entries to the 79 entries For that I used the gnfHumanAtlas2AllExps and the gnfHumanAtlas2MedianExps I compared them by their names. Most were identical but two differed 1. pituitary v. pituitary gland 2. DRG v. dorsal root gangelion I assumed they are the same sample I also used this files to assign the group to each tissue (e.g. brain immune etc) 2. For each sample of the 158 samples in "all" I calculated the median of each group 3. I run into some problems trying to calculate the log ratio of each entry. According to the previously asked questions relating to this topic I calculated the log ratio of each entry by dividing it by the median of its group. but the values I got differed from the gnfAtlas2. I also tried to use instead the median of the group medians for the normalization? (this gives more similar values although they still seem to differ) (in both cases I calculated the median of the two samples after applying the procedure). What am I doing wrong? B. Regarding the gnfAtlas2Distance I would like to know a. what is the function used for the vector comparisons? b. what "_at" are related to each gene (all the exonic _at's?, a representative _at?) Thanks for your help Yael Altuvia 5. the distance what function between them and according to what assignment (the best the exons etc..) From taysk at novasprint.com Sun Jan 7 01:26:02 2007 From: taysk at novasprint.com (Tay Sen Kwan) Date: Sun, 7 Jan 2007 17:26:02 +0800 Subject: [Genome] BED formatted files for liftOver Message-ID: <000001c7323d$d9a39a00$4001a8c0@taysklaptop> Hello, I am trying to use the liftOver tool in batch mode using input files in BED format. This format has 3 mandatory fields and 9 optional fields. For the optional fields, I would like to know if the field is not used, what is the procedure ? i.e. I understand the fields need to be separated by 'space', do I have to put in a dummy values to maintain the total number of fields or I only have to put in values for fields that are actually used. For the latter, how does liftOver knows how to assign the values to the correct fields ? e.g. for the optional fields, I only need the name, strand, blockCount, blockSizes and BlockStarts - should a typical line be: chr1 1000 2000 hit1 + 0 2 30,200 1000,1100 I understand that the '0' before blockCount is required as the 'reserved' value. Or do I have to use: chr1 1000 2000 hit1 900 + 1000 2000 0 2 30,200 1000,1100 ie 900, 1000, 2000 are just dummy fillers. Regards, Sen Kwan From todd22 at caltech.edu Sun Jan 7 14:37:19 2007 From: todd22 at caltech.edu (Todd Anthony) Date: Sun, 7 Jan 2007 14:37:19 -0800 Subject: [Genome] question about BACs shown in genome browser Message-ID: <002401c732ac$63efeba0$37fad783@D9SMHB51> Hello, I have two questions that I would be grateful if you could answer: 1) I noticed that the BAC clones shown in the genome browser differ in the thickness of the lines at the ends; some have thin lines and others thick lines. Does this reflect some difference in the end sequences (e.g. repetitive seqs)? 2) The BAC clones also differ in the direction of the arrows, some pointing left and others right, even for overlapping clones. What does that mean? Thank you very much for your time and consideration, Yours Sincerely, Todd Anthony ___________________________ Todd E. Anthony, Ph.D. California Institute of Technology D. Anderson Laboratory 1200 E. California Boulevard Division of Biology 216-76 Pasadena, CA 91125-7600 Phone: (626) 395-6822 FAX: (626) 564-8243 From hinkle.kelly at gmail.com Sun Jan 7 15:11:09 2007 From: hinkle.kelly at gmail.com (Kelly Hinkle) Date: Sun, 7 Jan 2007 18:11:09 -0500 Subject: [Genome] wrong genomic sequence for mse FBXL2 Message-ID: <3a54f7190701071511q114f27aci7497290259f3484f@mail.gmail.com> Hello, I was using your sequence to find exon boundaries when I discovered that your mouse FBXL2 sequence( NM_178624) is actually the human sequence( NM_012157) If you could let me know if/when this gets resolved that would be great Thanks, Kelly Hinkle hinkle.kelly at gmail.com From hinkle.kelly at gmail.com Sun Jan 7 16:05:09 2007 From: hinkle.kelly at gmail.com (Kelly Hinkle) Date: Sun, 7 Jan 2007 19:05:09 -0500 Subject: [Genome] wrong genomic sequence for mse SRPK1 Message-ID: <3a54f7190701071605k72c4991fq9bde80dfc79c4202@mail.gmail.com> Hello, I was using your sequence to find exon boundaries when I discovered that your mouse SRPK1 sequence( NM_016795) is actually the human sequence( NM_003137)...I wrote to you before about another gene FBXL2 also being incorrect. Is this something I am doing incorrectly, because I am pretty sure i am doing it correctly....if you see the link which cleary says mouse and the number, then scroll down to genomic sequence, the seq is human...here is the link to the page i am using: http://genome.ucsc.edu/cgi-bin/hgc?hgsid=83939078&o=35910569&t=35996851& g=xenoRefGene&i=NM_016795&c=chr6&l=35 910569&r=35996851&db=hg18&pix=620 If you could let me know if/when this gets resolved that would be great Thanks, Kelly Hinkle hinkle.kelly at gmail.com From zhouq03 at mail.kiz.ac.cn Sun Jan 7 17:05:37 2007 From: zhouq03 at mail.kiz.ac.cn (ZhouQi) Date: Mon, 8 Jan 2007 09:05:37 +0800 Subject: [Genome] Question Message-ID: <20070108013940.3D5AC1410E@mail.kiz.ac.cn> Dear Sir: I used ucsc genome browser frequently these days, and I have two questions: 1. today I found I can do blat search, but I cannot open the "browser" link, it keep on saying "Unknown visibility de". And this only happen on my computer! 2. Sometimes when I change the position in genome browser, for example, Drosophila melanogaster, it will show me windows of Drosophila miranda. Plz help me to figure it out. Thanks a lot!          Best Wishes! Yours truely      ZhouQi Max-Planck Research Group I Kunming Institute of Zoology          zhouq03 at mail.kiz.ac.cn            2007-01-08 From Sandmann at embl.de Mon Jan 8 01:23:47 2007 From: Sandmann at embl.de (Thomas Sandmann) Date: Mon, 8 Jan 2007 10:23:47 +0100 Subject: [Genome] Branch length of phylogenetic tree Message-ID: <0B6BC37D-C50A-43C2-880B-7D705C21873A@embl.de> Dear UCSC browser team, I would like to ask a question about the "conservation track", more specifically about the one calculated and displayed for insects on http://genome.ucsc.edu/cgi-bin/hgTrackUi? hgsid=83959973&c=chr2L&g=multiz15way The phylogenetic relationships between the species are displayed as a tree image as well as the "formal hierarchy" at the bottom of the page. Do you by any chance have information about the branch length of this phylogenetic tree, too (e.g. as a result of phastcons evaluation of the multiple alignment) ? That would help me a lot ! Thanks, Thomas From archanat at soe.ucsc.edu Mon Jan 8 09:25:45 2007 From: archanat at soe.ucsc.edu (Archana Thakkapallayil) Date: Mon, 08 Jan 2007 09:25:45 -0800 Subject: [Genome] Questions about indexing in the .fa files In-Reply-To: <45A02A9C.5090900@ias.edu> References: <45A02A9C.5090900@ias.edu> Message-ID: <45A27E99.3040907@soe.ucsc.edu> Hello Todd, The reason why you are seeing this differences in the start coordinate for the first exon is that, the start coordinates in a downloaded table are zero-based start and one-based end, which means you have to add +1 to the start coordinate to get the "real" starting coordinate as displayed in the graphical browser.The end coordinate is included in the range. To answer your questions: 1. If you use our programs to extract the sequences from the *.fa files, then the first nucleotide is indexed as 0. But the sequence itself is not indexed in the FASTA file. Also, the first position in a FASTA file of a chromosome sequence is 0 when it is referred to in a database table, but it is 1 in the Genome Browser display. 2. The exon region from the values in the database for the first exon = exonStart + 1 to exonEnd, both inclusive . Whereas for the rest of the exons, exon region = exonStart to exonEnd. The same is applicable for genes on the negative strand. See our FAQ for a discussion about database/browser genomic coordinates. http://genome.ucsc.edu/FAQ/FAQtracks#tracks1 I hope that this is helpful to you. Please let us know if you have further questions. Regards, Archana UCSC Genome Bioinformatics Group Todd Riley wrote: > Hello, > > I have some questions about the indexing in the *.fa files, which > unfortunately I could not find in the docs. First let me thank you > kindly for your answers... > > 1. Is the first nucleotide in a *.fa file index 0 or 1? > > 2. Next I am a bit confused about indexes found in the known genes table > compared to what queries return: > > Let's take a positive strand gene like NM_018234 in hg17, the known gene > table gives this info: > > #name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds proteinID alignID > > NM_018234 chr2 + 119697613 119739455 119719302 119737144 6 119697613,119713168,119719294,119721484,119728489,119736862, 119697694,119713236,119719794,119722012,119728654,119739455, Q86SF6_HUMAN R18909 > > > However, when I query for exon 1 I get that exon 1 starts at index > 119697614 not 119697613: > > >> hg17_refGene_NM_018234_0 range=chr2:119697614-119697694 5'pad=0 3'pad=0 revComp=FALSE strand=+ repeatMasking=none >> > GAGGAGGAGCCTCGGGCCGAGCCACCGCCTTCGCCGCGGACCTTCAGCTG > CCGCGGTCGCTCCGAGCGGCGGGCCGCAGAG > > > So my next question is, are the exonStarts and exonEnds supposed to be > inclusive or exclusive? It looks like the indexing scheme is exonRegion > = (exonStart, exonEnd], is this correct? Also, is it different if the > gene is on the negative strand? > > Thanks, > Todd > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From heather at soe.ucsc.edu Mon Jan 8 09:37:12 2007 From: heather at soe.ucsc.edu (Heather Trumbower) Date: Mon, 8 Jan 2007 09:37:12 -0800 (PST) Subject: [Genome] Question In-Reply-To: <20070108013940.3D5AC1410E@mail.kiz.ac.cn> References: <20070108013940.3D5AC1410E@mail.kiz.ac.cn> Message-ID: ZhouQi: It is possible that your preferences for our site contain a setting that isn't working. Would you be willing to reset your preferences by going to http://genome.ucsc.edu/cgi-bin/cartReset? You may find that this solves the problems you report. Heather Trumbower UCSC Genome Bioinformatics Group On Mon, 8 Jan 2007, ZhouQi wrote: > Dear Sir: > > I used ucsc genome browser frequently these days, and I have two questions: > > 1. today I found I can do blat search, but I cannot open the "browser" link, it keep on saying "Unknown visibility de". And this only happen on my computer! > > 2. Sometimes when I change the position in genome browser, for example, Drosophila melanogaster, it will show me windows of Drosophila miranda. > > Plz help me to figure it out. Thanks a lot! > > > > > > ???????????????? > Best Wishes! > > Yours truely > ???????? ZhouQi > > Max-Planck Research Group I > Kunming Institute of Zoology > ???????????????? zhouq03 at mail.kiz.ac.cn > ???????????????????? 2007-01-08 > > From kuhn at soe.ucsc.edu Mon Jan 8 09:40:05 2007 From: kuhn at soe.ucsc.edu (Robert Kuhn) Date: Mon, 8 Jan 2007 09:40:05 -0800 Subject: [Genome] Branch length of phylogenetic tree Message-ID: <200701081740.JAA07877@moondance.cse.ucsc.edu> dear Thomas, This is from our makedoc for the 15way Conservation: #TREE: (((((((((dm2:0.032478,(droSim1:0.017650,droSec1:0.015740):0.017736):0.026088,(droYak2:0.058116,droEre2:0.055952):0.031922):0.084953,droAna3:0.218988):0.051563,(dp4:0.013624,droPer1:0.015374):0.210705):0.046101,droWil1:0.292357):0.019351,((droVir3:0.109131,droMoj3:0.142623):0.047595,droGri2:0.154583):0.189999):0.113599,anoGam1:0.357817):0.092848,apiMel2:0.386309):0.169225,triCas2:0.169225); best wishes, --b0b kuhn ucsc genome bioinformatics group > From genome-bounces at soe.ucsc.edu Mon Jan 8 01:58:06 2007 > To: genome at soe.ucsc.edu > Subject: [Genome] Branch length of phylogenetic tree > > Dear UCSC browser team, > > I would like to ask a question about the "conservation track", more > specifically about the one calculated and displayed for insects on > > http://genome.ucsc.edu/cgi-bin/hgTrackUi? > hgsid=83959973&c=chr2L&g=multiz15way > > The phylogenetic relationships between the species are displayed as a > tree image as well as the "formal hierarchy" at the bottom of the page. > Do you by any chance have information about the branch length of this > phylogenetic tree, too (e.g. as a result of phastcons evaluation of > the multiple alignment) ? > > That would help me a lot ! > Thanks, Thomas > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From heather at soe.ucsc.edu Mon Jan 8 09:41:05 2007 From: heather at soe.ucsc.edu (Heather Trumbower) Date: Mon, 8 Jan 2007 09:41:05 -0800 (PST) Subject: [Genome] BED formatted files for liftOver In-Reply-To: <000001c7323d$d9a39a00$4001a8c0@taysklaptop> References: <000001c7323d$d9a39a00$4001a8c0@taysklaptop> Message-ID: Sen: You need to supply some sort of default value for each column. Heather Trumbower UCSC Genome Bioinformatics Group On Sun, 7 Jan 2007, Tay Sen Kwan wrote: > Hello, > > I am trying to use the liftOver tool in batch mode using input files in BED > format. This format has 3 mandatory fields and 9 optional fields. For the > optional fields, I would like to know if the field is not used, what is the > procedure ? i.e. I understand the fields need to be separated by 'space', do > I have to put in a dummy values to maintain the total number of fields or I > only have to put in values for fields that are actually used. For the > latter, how does liftOver knows how to assign the values to the correct > fields ? > > e.g. for the optional fields, I only need the name, strand, blockCount, > blockSizes and BlockStarts - should a typical line be: > > chr1 1000 2000 hit1 + 0 2 > 30,200 1000,1100 > > I understand that the '0' before blockCount is required as the 'reserved' > value. Or do I have to use: > > chr1 1000 2000 hit1 900 + 1000 2000 > 0 2 30,200 1000,1100 > > ie 900, 1000, 2000 are just dummy fillers. > > > Regards, > > Sen Kwan > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From heather at soe.ucsc.edu Mon Jan 8 09:47:48 2007 From: heather at soe.ucsc.edu (Heather Trumbower) Date: Mon, 8 Jan 2007 09:47:48 -0800 (PST) Subject: [Genome] wrong genomic sequence for mse SRPK1 In-Reply-To: <3a54f7190701071605k72c4991fq9bde80dfc79c4202@mail.gmail.com> References: <3a54f7190701071605k72c4991fq9bde80dfc79c4202@mail.gmail.com> Message-ID: Kelly: If you are using the genome browser for the human assembly with the "Other RefGene" track displayed, the genome sequence will in fact be the human sequence. You can find mouse SRPK1 at chr17:28,317,181-28,350,043 in the browser for the latest mouse assembly. I hope this helps, let us know if you have further questions. Heather Trumbower UCSC Genome Bioinformatics Group On Sun, 7 Jan 2007, Kelly Hinkle wrote: > Hello, > > I was using your sequence to find exon boundaries when I discovered that > your mouse SRPK1 sequence( NM_016795) is actually the human sequence( > NM_003137)...I wrote to you before about another gene FBXL2 also being > incorrect. Is this something I am doing incorrectly, because I am pretty > sure i am doing it correctly....if you see the link which cleary says mouse > and the number, then scroll down to genomic sequence, the seq is > human...here is the link to the page i am using: > > http://genome.ucsc.edu/cgi-bin/hgc?hgsid=83939078&o=35910569&t=35996851& > g=xenoRefGene&i=NM_016795&c=chr6&l=35 910569&r=35996851&db=hg18&pix=620 > > If you could let me know if/when this gets resolved that would be great > > Thanks, > Kelly Hinkle > hinkle.kelly at gmail.com > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From yueming.ding at jax.org Mon Jan 8 09:03:47 2007 From: yueming.ding at jax.org (Yueming Ding) Date: Mon, 8 Jan 2007 12:03:47 -0500 Subject: [Genome] blat didn't work on 33mers Message-ID: <20070108120347099.00000002944@sable> Generator Microsoft Word 11 (filtered medium) HI Jim, I tried to use Blat to align 33mers to mouse genome. But it didn' t work. I blatted 725000 33mers. After I ran pslReps, I only got 1031 lines. About half of the 1031 lines are mapped to wrong chromosomes. Could you please tell me if 33mers is too short to run blat? What is the minimum sequence length for blat? Thanks. Yueming Ding The Jackson Laboratory 600 Main Street Bar Harbor, ME 04609 From kuhn at soe.ucsc.edu Mon Jan 8 10:39:44 2007 From: kuhn at soe.ucsc.edu (Robert Kuhn) Date: Mon, 8 Jan 2007 10:39:44 -0800 Subject: [Genome] Branch length of phylogenetic tree Message-ID: <200701081839.KAA08713@moondance.cse.ucsc.edu> PS, Thomas, You may not have found the phyloGif tree-drawing utility on our site. It will give you a graphical view of the tree with the branch-lengths drawn to scale,with the lengths written into the image, if desired. see: http://genome.ucsc.edu/cgi-bin/phyloGif accessible via the "Utilities" link on the main page. best, --b0b kuhn > Dear Thomas, > > This is from our makedoc for the 15way Conservation: > > #TREE: > (((((((((dm2:0.032478,(droSim1:0.017650,droSec1:0.015740):0.017736): > 0.026088,(droYak2:0.058116,droEre2:0.055952):0.031922):0.084953,droAna3: > 0.218988):0.051563,(dp4:0.013624,droPer1:0.015374):0.210705):0.046101,droWil1: > 0.292357):0.019351,((droVir3:0.109131,droMoj3:0.142623):0.047595,droGri2: > 0.154583):0.189999):0.113599,anoGam1:0.357817):0.092848,apiMel2:0.386309): > 0.169225,triCas2:0.169225); > > best wishes, > > --b0b kuhn > ucsc genome bioinformatics group > > > From genome-bounces at soe.ucsc.edu Mon Jan 8 01:58:06 2007 > > To: genome at soe.ucsc.edu > > Subject: [Genome] Branch length of phylogenetic tree > > > > Dear UCSC browser team, > > > > I would like to ask a question about the "conservation track", more > > specifically about the one calculated and displayed for insects on > > > > http://genome.ucsc.edu/cgi-bin/hgTrackUi? > > hgsid=83959973&c=chr2L&g=multiz15way > > > > The phylogenetic relationships between the species are displayed as a > > tree image as well as the "formal hierarchy" at the bottom of the page. > > Do you by any chance have information about the branch length of this > > phylogenetic tree, too (e.g. as a reCSC browser team, > > > > I would like to ask a question about the "conservation track", more > > specifically about the one calculated and displayed for insects on > > > > http://genome.ucsc.edu/cgi-bin/hgTrackUi? > > hgsid=83959973&c=chr2L&g=multiz15way > > > > The phylogenetic relationships between the species are displayed as a > > tree image as well as the "formal hierarchy" at the bottom of the page. > > Do you by any chance have information about the branch length of this > > phylogenetic tree, too (e.g. as a result of phastcons evaluation of > > the multiple alignment) ? > > > > That would help me a lot ! > > Thanks, Thomas > > > > _______________________________________________ > > Genome maillist - Genome at soe.ucsc.edu > > http://www.soe.ucsc.edu/mailman/listinfo/genome From bli1 at bcm.tmc.edu Mon Jan 8 11:58:09 2007 From: bli1 at bcm.tmc.edu (Bingshan Li) Date: Mon, 8 Jan 2007 13:58:09 -0600 Subject: [Genome] HUGO gene names to Build35 coordinates Message-ID: <20DADBBD-6D22-4AE1-8D08-7FD1099D9D99@bcm.tmc.edu> Hi, I am looking a table to map HUGO gene names to their coordinates on Build35. Is there such a table? If not, can I have multiple tables with sufficient information so that I can get the coordinates of HUGO gene names? Thanks a lot! From ekamberov at rubicongenomics.com Mon Jan 8 12:55:04 2007 From: ekamberov at rubicongenomics.com (Emmanuel Kamberov) Date: Mon, 8 Jan 2007 15:55:04 -0500 Subject: [Genome] BLAT Search Question Message-ID: <8672C6F68CCF1343A126510E4337CF6A019374@blitz64.rubigen.rubicongenomics.com> Hi, we are trying to create a custom annotation track from BLAT search output. We have a few hundred short genomic DNA sequences (80-150 bp long). How many sequences can be processed in one batch ? Is there a way to filter only the highest scorring match for each sequence when creating a PSL output file ? Thanks, Emmanuel Kamberov _____________________________________ Emmanuel Kamberov, Ph.D. Senior Research Scientist Rubicon Genomics, Inc. 4370 Varsity Drive, Ann Arbor, MI 48103 Office: 734-677-2894 Mobile: 734-274-1102 kamberov at rubicongenomics.com ______________________________________ From Anil.Jegga at cchmc.org Mon Jan 8 13:59:57 2007 From: Anil.Jegga at cchmc.org (Anil Jegga) Date: Mon, 08 Jan 2007 16:59:57 -0500 Subject: [Genome] download of multialignment regions Message-ID: Hi Is there a way by which I can download the multialignments (multiz-17species) for a given set of coordinates? I used the custom track option and then the table browser but couldn't figure out how to download the corresponding multialignment files. human coordinates: chr12 97562605 97562637 chr19 54149513 54149533 chr19 54149524 54149543 chr19 52426417 52426436 chr22 16619483 16619502 chr11 104411147 104411166 chr4 110838081 110838104 I also tried with Galaxy (http://main.g2.bx.psu.edu/), it does give me the alignments but the coordinates (for e.g. aligned mouse sequence coordinates) seem to be wrong. I made sure that I select the proper assembly to avoid the mismatching coordinates. Thanks Anil Anil Jegga Assistant Professor Department of Pediatrics and Division of Biomedical Informatics Cincinnati Children's Hospital Medical Center and University of Cincinnati Tel: (513)-636-0261 Fax: (513)-636-2056 http://anil.cchmc.org From fanhsu at soe.ucsc.edu Mon Jan 8 15:07:26 2007 From: fanhsu at soe.ucsc.edu (Fan Hsu) Date: Mon, 8 Jan 2007 15:07:26 -0800 Subject: [Genome] HUGO gene names to Build35 coordinates In-Reply-To: <20DADBBD-6D22-4AE1-8D08-7FD1099D9D99@bcm.tmc.edu> Message-ID: Hi Bingshan, I generated a table for someone not long ago, which you might be able to use. It is at: http://www.soe.ucsc.edu/~fanhsu/bigKgGo.tab This table is generated by the following query: 'select kgId, spId, geneSymbol, mm.target, term.acc, goaPart.aspect, term.name, k.description, kg.* from goaPart,term, hg18.kgXref k, hg18.knownGene kg, hg18.mmBlastTab mm where goaPart.dbObjectId = k.spId and goaPart.goId = term.acc and kg.name=k.kgId and kgId=mm.query' You could go to our public MySQL server and tailer the above query to get data to suit your need. Fan. -----Original Message----- From: genome-bounces at soe.ucsc.edu [mailto:genome-bounces at soe.ucsc.edu]On Behalf Of Bingshan Li Sent: Monday, January 08, 2007 11:58 AM To: genome at soe.ucsc.edu Subject: [Genome] HUGO gene names to Build35 coordinates Hi, I am looking a table to map HUGO gene names to their coordinates on Build35. Is there such a table? If not, can I have multiple tables with sufficient information so that I can get the coordinates of HUGO gene names? Thanks a lot! _______________________________________________ Genome maillist - Genome at soe.ucsc.edu http://www.soe.ucsc.edu/mailman/listinfo/genome From archanat at soe.ucsc.edu Mon Jan 8 15:18:35 2007 From: archanat at soe.ucsc.edu (Archana Thakkapallayil) Date: Mon, 08 Jan 2007 15:18:35 -0800 Subject: [Genome] question about BACs shown in genome browser In-Reply-To: <002401c732ac$63efeba0$37fad783@D9SMHB51> References: <002401c732ac$63efeba0$37fad783@D9SMHB51> Message-ID: <45A2D14B.90202@soe.ucsc.edu> Hello Todd, 1) The difference in thickness of the lines at the end of the BAC clones reflect the size of the BAC end pairs mapped to the genome. 2) The direction of the arrows show the strand for the BAC clone. The BAC clone is the sequence between (and including the ends). If a BAC end, that is the forward end maps to the + strand and that from the reverse end maps to the - strand, then the BAC clone is from the + strand. In this case the arrows look like this: >>>>. If the reverse end maps to the + strand and the forward end maps to the - strand, then the BAC clone maps to the - strand. In this case the arrows look like this: <<<<. In each case, one end maps to the + strand and the other to the - strand. I hope this information is helpful to you. If you have further questions, please do not hesitate to contact us again. Regards, Archana UCSC genome Bioinformatics Group Todd Anthony wrote: > Hello, > > I have two questions that I would be grateful if you could answer: > > 1) I noticed that the BAC clones shown in the genome browser differ in the thickness of the lines at the ends; some have thin lines and others thick lines. Does this reflect some difference in the end sequences (e.g. repetitive seqs)? > > 2) The BAC clones also differ in the direction of the arrows, some pointing left and others right, even for overlapping clones. What does that mean? > > Thank you very much for your time and consideration, > Yours Sincerely, > Todd Anthony > > ___________________________ > Todd E. Anthony, Ph.D. > California Institute of Technology > D. Anderson Laboratory > 1200 E. California Boulevard > Division of Biology 216-76 > Pasadena, CA 91125-7600 > Phone: (626) 395-6822 > FAX: (626) 564-8243 > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From ann at soe.ucsc.edu Mon Jan 8 15:52:30 2007 From: ann at soe.ucsc.edu (Ann Zweig) Date: Mon, 08 Jan 2007 15:52:30 -0800 Subject: [Genome] how the blocks in the chain are organized? In-Reply-To: References: Message-ID: <45A2D93E.80100@cse.ucsc.edu> Hello Xianjun, The layout of the chain and chainLink tables are not intuitive. I will give you some details about their structure, and then you can let us know if this does not answer your question. These specific chains were first created for the zebrafish assembly (danRer4) and swapped to the human assembly (hg18). As you noted, there are 148 rows in each of the two tables for values with chainId of 2107: danRer4.chr20_chainHg18Link hg18.chr14_chainDanRer4Link Although you reported seeing only about 37 solid blocks, there are actually many more (84 to be exact). You can see them when you make your display quite wide and zoom in to a very detailed level. However, this still does not explain why there are 148 rows in the tables. As you noted, there are several rows that are "neighbored in coordinates". Yes, you can combine those rows to make one block. In fact, when you do that on the hg18.chr14_chainDanRer4Link table, the new table has 84 rows (equal to the number of solid blocks you see in the browser). For example, from the original table: mysql> select * from chr14_chainDanRer4Link where ChainId=2107 limit 6; +------+-------+----------+----------+----------+---------+ | bin | tName | tStart | tEnd | qStart | chainId | +------+-------+----------+----------+----------+---------+ | 1093 | chr14 | 66595113 | 66595258 | 62269888 | 2107 | | 1093 | chr14 | 66625443 | 66625547 | 62270186 | 2107 | | 1093 | chr14 | 66625547 | 66625588 | 62270303 | 2107 | | 1093 | chr14 | 66625588 | 66625601 | 62270351 | 2107 | | 1093 | chr14 | 66637317 | 66637387 | 62273299 | 2107 | | 1093 | chr14 | 66646598 | 66646732 | 62273437 | 2107 | After you combine the neighboring rows, the original 6 rows result in 4 rows: +------+-------+----------+----------+----------+---------+ | bin | tName | tStart | tEnd | qStart | chainId | +------+-------+----------+----------+----------+---------+ | 1093 | chr14 | 66595113 | 66595258 | 62269888 | 2107 | | 1093 | chr14 | 66625443 | 66625601 | 62270351 | 2107 | | 1093 | chr14 | 66637317 | 66637387 | 62273299 | 2107 | | 1093 | chr14 | 66646598 | 66646732 | 62273437 | 2107 | A chain is, by definition, a sequence of gapless aligned blocks, where there must be no overlaps of blocks' target or query coords within the chain. Within a chain, target and query coords are monotonically non-decreasing. If there is any break (even of one base) in *either* assembly, there will be a new row in each of the two tables. Be sure to let us know if this is not enough information for you to get started with these data. Regards, ---------- Ann Zweig UCSC Genome Bioinformatics Group http://genome.ucsc.edu > Hi, > > I am using the chain and chainLink table to see how well one exon in > Human is aligned with the sequence in zebrafish. For each chain in the > chain table, there are many small blocks in the chainLink table. My > question is: how to decide the split of the chain into so many small > blocks? In the chain track of browser, we could only see several blocks > in each chain. > > For example, for human gene ENSG00000072415, I could see it's located in > the chain of zebrafish, chr20:62269888-62413893, with chainID=2107 in > chr14_chainDanRer4 table. In the full view of Zebrafish Chain Alignment > for Human browser, I only could see about 37 solid blocks in the chain > (see the chr14:66,569,841-66,995,158 regin in Human browser). However, > when I query the chr14_chainDanRer4Link table with chainID=2107, I could > see 148 results for this chain. So, why not 37 results as I expected? > How is this 148 blocks created? How can I get start:end for the expected > 37 blocks? I could see that most of the 148 blocks are neighbored in > coordinates, so, it will be 37 blocks if I merge the neighbored ones > together, right? > > Thanks for answering. > > Xianjun From hiram at soe.ucsc.edu Mon Jan 8 15:59:41 2007 From: hiram at soe.ucsc.edu (Hiram Clawson) Date: Mon, 08 Jan 2007 15:59:41 -0800 Subject: [Genome] More help with upstream .maf file In-Reply-To: References: Message-ID: <45A2DAED.8030600@soe.ucsc.edu> Good Afternoon Jeffrey: It is going to be tough to get this file into an editor. It is over 600 Mb and almost 450,000 lines. It will load into the vi editor on a Linux system, but that running vi process occupies over 1 Gb of memory as it works. It would probably be more useful if you could think about working with this file in a programmatical method rather than reading the file with an editor. If you can locate Windows command equivalents to typical Linux commands such as grep,awk,sed, and perl scripts, you would be able to parse out the bits of interest from this file for your use. The kent source tree has a variety of useful commands that can work with maf files to transform them in various ways, although that would require something such as the Cygwin system to give you a Linux system on top of your Windows. That is a lot of work though to get all of that up and running. You could probably make do with some Perl scripts to parse the file if you are into Perl programming. --Hiram calhoujd at notes.udayton.edu wrote: > To Whom it May Concern, > > Feel free to disregard my previous email seeking help. I figured out my > previous issue, but have instead run into a new problem trying to use the > upstream2000.maf file (Multiple alignments of 16 vertebrate genomes with > Human). I have tried using at least five text editors including notepad, > notepad++, wordpad, and the MEGA3.1 text editor to open the file with > little success. I am assuming this is partially due to the size of the > file (>600 MB). However, two of the text editors were able to begin to > open the file, but stopped way short of the entire file. Enclosed is a > screen shot of how far the text editor was able to get (not quite through > the first alignment). Any help as to how I can open this file in a text > editor would be greatly appreciated. > > Sincerely, > > Jeffrey Calhoun From zhangge.uc at gmail.com Mon Jan 8 14:54:43 2007 From: zhangge.uc at gmail.com (Ge Zhang) Date: Mon, 8 Jan 2007 17:54:43 -0500 Subject: [Genome] question about custom tracks Message-ID: Dear Sir or Madam, First I would like to appreciate your efforts in creating and maintaining this excellent free resource for bioscientist. The custom track utility provides tremendous help to our research. I have several question about how to use it: 1. How to create a item with different colors? For example, if I want to create a item (GFF group) to show a block with several SNP markers, the ideal way is to use a rectangle (in light gray) for the whole block and several vertical lines (in black, or in different colors showing their functional significance) for each SNP. I try to use GFF format to describe such feature in the following way, but it seems to me that the Genome Browser will only display each feature with the color for the HIGHEST score (in our case, 1000 for each SNP marker). browser position chr1:213742472-213941141 browser hide all track name=Block description="SNP Blocks" visibility=1 color=0,128,0 useScore=1 chr1 BlockFinder Block 213753472 213759167 400 . . Block_1 chr1 BlockFinder SNP 213743472 213743472 1000 . . Block_1 chr1 BlockFinder SNP 213747205 213747206 1000 . . Block_1 chr1 BlockFinder SNP 213749167 213749168 1000 . . Block_1 chr1 BlockFinder Block 213762463 213780713 400 . . Block_2 chr1 BlockFinder SNP 213762463 213762464 1000 . . Block_2 chr1 BlockFinder SNP 213764076 213764077 1000 . . Block_2 chr1 BlockFinder SNP 213769724 213769725 1000 . . Block_2 chr1 BlockFinder SNP 213780713 213780714 1000 . . Block_2 2. Is there any way to override the maximum image width limitation (5000 pixels), because in order to display some fine features for a long genomic region 5000 pixels is not enough. 3. How to create a custom track showing pairwise relationship among genomic elements? For example the pairwise LD similar to the HapMap LD track included in the hg17 database ( http://genome.ucsc.edu/cgi-bin/hgTrackUi?g=hapmapLd). 4. In order to lower the computational loads of your site, we are thinking to build a stand along application using the power of Genome Browser in generating high quality graphic representation for genomic annotations. Therefore, the question is: can we use the specific codes for drawing custom tracks from Genome Browser and build a light-weight application instead of mirror the whole Genome Browser on our sever. If it is possible, what is the license requirement (for academic use only) and how to locate those codes? Thank you for your help! Best regards, Ge Zhang Center for Genome Information University of Cincinnati Medical Center From kuhn at soe.ucsc.edu Mon Jan 8 16:13:04 2007 From: kuhn at soe.ucsc.edu (Robert Kuhn) Date: Mon, 8 Jan 2007 16:13:04 -0800 Subject: [Genome] Branch length of phylogenetic tree Message-ID: <200701090013.QAA13376@moondance.cse.ucsc.edu> Dear Thomas, One of our engineers has provided me with a little more information about the insect 15way Conservation track. The tree I sent you earlier is the starting-tree.mod. After the estimation step, the tress are: conserved: (((((((((dm2:0.017975,(droSim1:0.012862,droSec1:0.012586):0.011123): 0.012087,(droYak2:0.031478,droEre2:0.032714):0.015973):0.033568,droAna3: 0.107755):0.019847,(dp4:0.008814,droPer1:0.009578):0.104435):0.016369,droWil1: 0.150471):0.005457,((droVir3:0.056362,droMoj3:0.072079):0.020331,droGri2: 0.083731):0.092042):0.049021,anoGam1:0.218175):0.046658,apiMel2:0.187335): 0.106621,triCas2:0.106621); nonconserved: (((((((((dm2:0.059235,(droSim1:0.039799,droSec1:0.039025):0.035598):0.041349, (droYak2:0.104304,droEre2:0.106715):0.054017):0.119983,droAna3:0.377089): 0.071962,(dp4:0.028079,droPer1:0.031175):0.369228):0.060569,droWil1:0.536488): 0.019784,((droVir3:0.195542,droMoj3:0.254453):0.073077,droGri2:0.290773): 0.336972):0.180643,anoGam1:0.783559):0.174661,apiMel2:0.670401):0.382291, triCas2:0.382291); The multi-fly analysis wiki has links to other group's estimates: http://rana.lbl.gov/drosophila/wiki/index.php/Phylogeny I'm sorry If I've caused you any confusion. best wishes, --b0b kuhn > Dear Thomas, > > This is from our makedoc for the 15way Conservation: > > #TREE: > (((((((((dm2:0.032478,(droSim1:0.017650,droSec1:0.015740):0.017736):0.026088,(droYak2:0.058116,droEre2:0.055952):0.031922):0.084953,droAna3:0.218988):0.051563,(dp4:0.013624,droPer1:0.015374):0.210705):0.046101,droWil1:0.292357):0.019351,((droVir3:0.109131,droMoj3:0.142623):0.047595,droGri2:0.154583):0.189999):0.113599,anoGam1:0.357817):0.092848,apiMel2:0.386309):0.169225,triCas2:0.169225); > > best wishes, > > --b0b kuhn > ucsc genome bioinformatics group > > > From genome-bounces at soe.ucsc.edu Mon Jan 8 01:58:06 2007 > > To: genome at soe.ucsc.edu > > Subject: [Genome] Branch length of phylogenetic tree > > > > Dear UCSC browser team, > > > > I would like to ask a question about the "conservation track", more > > specifically about the one calculated and displayed for insects on > > > > http://genome.ucsc.edu/cgi-bin/hgTrackUi? > > hgsid=83959973&c=chr2L&g=multiz15way > > > > The phylogenetic relationships between the species are displayed as a > > tree image as well as the "formal hierarchy" at the bottom of the page. > > Do you by any chance have information about the branch length of this > > phylogenetic tree, too (e.g. as a result of phastcons evaluation of > > the multiple alignment) ? > > > > That would help me a lot ! > > Thanks, Thomas > > > > > > > > > > That's starting-tree.mod -- the trees I send are from further down, > after the estimation step (search for the next instance of "FIRST > ITERATION"). > > > On Mon, 8 Jan 2007, Robert Kuhn wrote: > > > > > > > angie, > > > > I just looked more closely at you values here, and I see they do not > > match what I got out of the dm2 makedoc. > > > > > #TREE: > > > (((((((((dm2:0.032478,(droSim1:0.017650,droSec1:0.015740):0.017736): > > > 0.026088,(droYak2:0.058116,droEre2:0.055952):0.031922):0.084953,droAna3: > > > 0.218988):0.051563,(dp4:0.013624,droPer1:0.015374):0.210705):0.046101,droWil1: > > > 0.292357):0.019351,((droVir3:0.109131,droMoj3:0.142623):0.047595,droGri2: > > > 0.154583):0.189999):0.113599,anoGam1:0.357817):0.092848,apiMel2:0.386309): > > > 0.169225,triCas2:0.169225); > > > > > From daemon at services.cse.ucsc.edu Mon Jan 8 10:31:31 2007 > > > To: browser-qa at soe.ucsc.edu > > > Subject: Fwd: [Genome] Branch length of phylogenetic tree > > > > > > These are the branch lengths (I think in units of substitutions per > > > site) estimated by phastCons for conserved and nonconserved sequence: > > > > > > conserved: > > > (((((((((dm2:0.017975,(droSim1:0.012862,droSec1:0.012586):0.011123):0.012087,(droYak2:0.031478,droEre2:0.032714):0.015973):0.033568,droAna3:0.107755):0.019847,(dp4:0.008814,droPer1:0.009578):0.104435):0.016369,droWil1:0.150471):0.005457,((droVir3:0.056362,droMoj3:0.072079):0.020331,droGri2:0.083731):0.092042):0.049021,anoGam1:0.218175):0.046658,apiMel2:0.187335):0.106621,triCas2:0.106621); > > > > > > nonconserved: > > > (((((((((dm2:0.059235,(droSim1:0.039799,droSec1:0.039025):0.035598):0.041349,(droYak2:0.104304,droEre2:0.106715):0.054017):0.119983,droAna3:0.377089):0.071962,(dp4:0.028079,droPer1:0.031175):0.369228):0.060569,droWil1:0.536488):0.019784,((droVir3:0.195542,droMoj3:0.254453):0.073077,droGri2:0.290773):0.336972):0.180643,anoGam1:0.783559):0.174661,apiMel2:0.670401):0.382291,triCas2:0.382291); > > > > > > The multi-fly analysis wiki has links to other group's estimates: > > > > > > http://rana.lbl.gov/drosophila/wiki/index.php/Phylogeny > > > > > > > > > ---------- Forwarded message ---------- > > > Date: Mon, 8 Jan 2007 10:23:47 +0100 > > > From: Thomas Sandmann > > > To: genome at soe.ucsc.edu > > > Subject: [Genome] Branch length of phylogenetic tree > > > > > > Dear UCSC browser team, > > > > > > I would like to ask a question about the "conservation track", more > > > specifically about the one calculated and displayed for insects on > > > > > > http://genome.ucsc.edu/cgi-bin/hgTrackUi? > > > hgsid=83959973&c=chr2L&g=multiz15way > > > > > > The phylogenetic relationships between the species are displayed as a > > > tree image as well as the "formal hierarchy" at the bottom of the page. > > > Do you by any chance have information about the branch length of this > > > phylogenetic tree, too (e.g. as a result of phastcons evaluation of > > > the multiple alignment) ? > > > > > > That would help me a lot ! > > > Thanks, Thomas > > > > > > _______________________________________________ > > > Genome maillist - Genome at soe.ucsc.edu > > > http://www.soe.ucsc.edu/mailman/listinfo/genome > > > > > > > -- > angie at soe.ucsc.edu > Software Developer, UCSC CBSE / Genome Bioinformatics Group > From archanat at soe.ucsc.edu Mon Jan 8 16:47:12 2007 From: archanat at soe.ucsc.edu (Archana Thakkapallayil) Date: Mon, 08 Jan 2007 16:47:12 -0800 Subject: [Genome] BLAT Search Question In-Reply-To: <8672C6F68CCF1343A126510E4337CF6A019374@blitz64.rubigen.rubicongenomics.com> References: <8672C6F68CCF1343A126510E4337CF6A019374@blitz64.rubigen.rubicongenomics.com> Message-ID: <45A2E610.2040906@soe.ucsc.edu> Hello Emmanuel, Our web-based BLAT could process up to 25 sequences in one batch. If you would like to process more sequences, you should download the stand-alone BLAT. See this answer to a previous mail list question that gives more information on running batch query: http://www.soe.ucsc.edu/pipermail/genome/2006-December/012454.html Also, it is not possible to filter out the best hit when using the online BLAT. Please see this previously answered mail list question that gives information on filtering out the best hit from the output using the stand-alone BLAT: http://www.cse.ucsc.edu/pipermail/genome/2006-March/009918.html I hope that this helps you. Please let us know if you have further questions. Regards, Archana UCSC Genome Bioinformatics Group Emmanuel Kamberov wrote: > Hi, we are trying to create a custom annotation track from BLAT search > output. We have a few hundred short genomic DNA sequences (80-150 bp > long). How many sequences can be processed in one batch ? Is there a > way to filter only the highest scorring match for each sequence when > creating a PSL output file ? > > Thanks, > > Emmanuel Kamberov > > _____________________________________ > > Emmanuel Kamberov, Ph.D. > Senior Research Scientist > Rubicon Genomics, Inc. > 4370 Varsity Drive, > Ann Arbor, MI 48103 > Office: 734-677-2894 > Mobile: 734-274-1102 > kamberov at rubicongenomics.com > ______________________________________ > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From archanat at soe.ucsc.edu Tue Jan 9 10:44:17 2007 From: archanat at soe.ucsc.edu (Archana Thakkapallayil) Date: Tue, 09 Jan 2007 10:44:17 -0800 Subject: [Genome] download of multialignment regions In-Reply-To: References: Message-ID: <45A3E281.90405@soe.ucsc.edu> Hello Anil, This task can be accomplished using the Table Browser. It will involve first creating a custom track with your coordinates. If you follow these steps, you should be able to get what you need: 1. Create a BED3 custom track with your set of coordinates. 2. Then make the following selections in the Table Browser: Set the clade, genome and assembly to the organism you are interested group: Comparative Genomics track: Conservation table: multiz17way region: genome Then you can create an intersection of this track with your custom track. To do this, press the "intersection: create" button and then select "Custom Tracks" as group and hit "submit" button. Back on the Table Browser, choose "output format: MAF - multiple alignment format" and hit "get output". ( This takes a long time, probably because the multiz table is so big ). I hope that this helps you. Please let us know if you have further questions. Regards, Archana UCSC genome Bioinformatics Group Anil Jegga wrote: > Hi > > Is there a way by which I can download the multialignments > (multiz-17species) for a given set of coordinates? I used the custom > track option and then the table browser but couldn't figure out how to > download the corresponding multialignment files. > > human coordinates: > chr12 97562605 97562637 > chr19 54149513 54149533 > chr19 54149524 54149543 > chr19 52426417 52426436 > chr22 16619483 16619502 > chr11 104411147 104411166 > chr4 110838081 110838104 > > I also tried with Galaxy (http://main.g2.bx.psu.edu/), it does give me > the alignments but the coordinates (for e.g. aligned mouse sequence > coordinates) seem to be wrong. I made sure that I select the proper > assembly to avoid the mismatching coordinates. > > Thanks > Anil > > Anil Jegga > Assistant Professor > Department of Pediatrics and Division of Biomedical Informatics > Cincinnati Children's Hospital Medical Center and University of > Cincinnati > Tel: (513)-636-0261 > Fax: (513)-636-2056 > http://anil.cchmc.org > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome > From kent at soe.ucsc.edu Tue Jan 9 12:27:22 2007 From: kent at soe.ucsc.edu (Jim Kent) Date: Tue, 9 Jan 2007 12:27:22 -0800 Subject: [Genome] blat didn't work on 33mers In-Reply-To: <20070108120347099.00000002944@sable> References: <20070108120347099.00000002944@sable> Message-ID: <5BAC4538-D389-4D74-B7E0-1A76843125BD@soe.ucsc.edu> Hi - you probably want to use these blat parameters: -stepSize=5 -minScore=0 33mers are on the short side for accurate mapping though. Are these mouse sequences? From the same strain as the genome? On Jan 8, 2007, at 9:03 AM, Yueming Ding wrote: > HI Jim, > > > > I tried to use Blat to align 33mers to mouse genome. But it didn?t > work. I blatted 725000 33mers. After I ran pslReps, I only got 1031 > lines. About half of the 1031 lines are mapped to wrong > chromosomes. Could you please tell me if 33mers is too short to run > blat? What is the minimum sequence length for blat? Thanks. > > > > Yueming Ding > > The Jackson Laboratory > > 600 Main Street > > Bar Harbor, ME 04609 > > From phoenix.kwan at thermofisher.com Tue Jan 9 13:53:08 2007 From: phoenix.kwan at thermofisher.com (Phoenix Kwan) Date: Tue, 9 Jan 2007 14:53:08 -0700 Subject: [Genome] Accession in RefGene table Message-ID: Dear sir or madam, The accessions of each mRNA in the RefGene table don't have the accession versions with them, such as NM_015658.2. Is there any way that we can find out the versioning of each mRNA? Thanks for your time, Phoenix Kwan From kayla at soe.ucsc.edu Tue Jan 9 14:38:11 2007 From: kayla at soe.ucsc.edu (Kayla Smith) Date: Tue, 09 Jan 2007 14:38:11 -0800 Subject: [Genome] Accession in RefGene table In-Reply-To: References: Message-ID: <45A41953.3030806@cse.ucsc.edu> Phoenix, The version of each refSeq is found in the version field of the gbCdnaInfo table. You could generate this information from the Table Browser ("Tables" on the blue bar on the top of the main page) as follows for hg18: Set the following options: clade: Vertebrate genome: Human assembly: Mar. 2006 group: Genes and Gene Prediction Tracks track: RefSeq Genes table: gbCdnaInfo output format: selected fields from primary and related tables click "get output" on the next page check "acc" and "version" click on "get output" again. I hope this is helpful to you. Please don't hesitate to contact us again if you require further assistance. Kayla Smith UCSC Genome Bioinformatics Group Phoenix Kwan wrote: > Dear sir or madam, > > > > The accessions of each mRNA in the RefGene table don't have the > accession versions with them, such as NM_015658.2. Is there any way that > we can find out the versioning of each mRNA? > > > > Thanks for your time, > > Phoenix Kwan > > _______________________________________________ > Genome maillist - Genome at soe.ucsc.edu > http://www.soe.ucsc.edu/mailman/listinfo/genome From triley at ias.edu Tue Jan 9 14:50:55 2007 From: triley at ias.edu (Todd Riley) Date: Tue, 09 Jan 2007 17:50:55 -0500 Subject: [Genome] Question about tracks and duplicate entries Message-ID: <45A41C4F.2060406@ias.edu> Hello again, Thank you for your answers, they were very helpful. If you don't mind, I know have a new question for you. How are the tracks "Known Genes", "RefSeq Genes", and "Other RefSeq" related to each other? I noticed that there are many RefSeq Genes in the KnownGenes table (which makes sense). However, I also noticed that the RefSeq entries in the KnownGene track do not always correspond to the entries in the RefSeq table!! Here is an example from hg17: Known Genes track: NM_004869 chr18 - 59207406 59240673 59209187