[Genome] Question on gene length information retrieval

Brooke Rhead rhead at soe.ucsc.edu
Fri Aug 3 17:37:42 PDT 2007


Hi Kangyu,

Thanks for clarifying.  By "corresponding track" I just mean an
annotation track that contains the gene set you want to get the
information from.  Let me see if I can be more clear.  When you look in
the Genome Browser, under the section called "Genes and Gene Prediction
Tracks", there are several different gene annotation tracks, or gene
sets, available.  Some of them are "UCSC Genes", "CCDS", "RefSeq Genes",
"MGC Genes", "Ensembl Genes", etc.  They are all slightly different, and 
may give you different results for each gene.

It sounds like the RefSeq Genes set might be good for your purposes. 
The advantage of using that set is that we have it linked to the Entrez 
gene ID in our tables.  However, if you want more detailed information 
about proteins, or if you want to include non-coding genes in this set, 
you should consider using the UCSC Genes set instead.

I will describe how to get the information you seek for the RefSeq set, 
and if you decide to use a different set, you can do a similar query 
using it instead.

Also, once you get the data from the tables, you will need to do some 
calculations to get the actual lengths.  You can either do this using 
your own tools, or you may be interested in another website called 
Galaxy: http://main.g2.bx.psu.edu/

Galaxy is a site run by Penn State that works in conjunction with the 
UCSC Genome Browser.  They have tools that will perform calculations on 
columns in a table.  If you need help with Galaxy, their helpdesk email 
is galaxy-user at bx.psu.edu.


Most of the information you are looking for can be gleaned from two 
related tables: refGene and refLink.  To get to the tables, go to the 
Table Browser (the blue "tables" link at the top of the page), and make 
the following selections:

clade: vertebrate
genome: human
assembly: Mar. 2006
group: Genes and gene prediction tracks
track: RefSeq Genes
table: refGene
region: genome
output format: selected fields from primary and related tables

After hitting "get output", select the hg18.refLink table from the list 
on the bottom half of the page, so that you can get information from it 
in addition to the refGene table.  Scroll down to the bottom of the page 
and hit the "allow selection from checked tables" button.

Now select the fields in which you are interested from both of these 
tables.  From the refLink table you will need to select the 
'locusLinkId' field.  This is the Entrez ID (Entrez used to be called 
Locus Link).  You can also get the RefSeq protein name from this table 
(field 'protAcc').  You will need most of the information from the 
refGene table.

Now you can use the values retrieved from the tables to get the lengths 
you need.  Note that items in our tables have a 0-based start and a 
1-based end (see explanation here: 
http://genome.ucsc.edu/FAQ/FAQtracks#tracks1).

The length of the gene on the genome (including the introns) will be: 
txEnd - txStart.

The number of exons is listed in the field 'exonCount'.

The total length of all the exons in a gene (corresponding to the mRNA 
length) will be the sum of each of the exonEnds - exonStarts.  Note that 
this includes both coding and non-coding exons.

PROTEINS:
One way to get the approximate protein length from the data retrieved 
above would be to take the total exon length, and then subtract the 
non-coding regions from it.  The cdsStart and cdsEnd values indicate 
where the coding regions are.  These values are usually within an exon.

If you would rather get more precise protein information, the method 
will depend on which gene set you chose to use.

For the RefSeq Genes set:  we do not store more detailed RefSeq protein 
information in our tables, just the accession numbers (like the "NP_" 
numbers from hg18.refLink.protAcc).  So, if you wish to get more data 
for these accessions, you will need to go to NCBI for it.

For the UCSC Genes set:  the proteins associated with UCSC Genes are 
from UniProt.  The protein sequences are stored in one of our tables: 
uniprot.protein.val (database.table.field).  You can get link these to 
the knownGene table by the hg18.kgXref table.

Sorry to be so long-winded!  I hope this information helps.

--
Brooke Rhead
UCSC Genome Bioinformatics Group



Kangyu Zhang wrote:
> Hi, Brooke, 
> 
>   Sorry for my unclearness. I need to retrieve the entire genome. 
> 
>   The NCBI gene ID I referred to is entrez ID, for example, 324 corresponds
> to genename APC, and 10297 is APC2; however, other IDs such as RefSeq genes
> are fine too. 
> 
>   Sorry I don't understand what is the meaning of "corresponding track"? But
> I guess the RefSeq Genes shall serve well as the identifier, though the
> Entrez ID is better. 
> 
> 
> Kangyu
> 
>   
> 
> -----Original Message-----
> From: Brooke Rhead [mailto:rhead at soe.ucsc.edu] 
> Sent: Thursday, August 02, 2007 2:44 PM
> To: Kangyu Zhang
> Cc: genome at soe.ucsc.edu
> Subject: Re: [Genome] Question on gene length information retrieval
> 
> Hi Kangyu,
> 
> Can you clarify a couple of things?  Do you have only a single gene to 
> look at, or a list of several genes?  Also, I'm not sure which NCBI gene 
> ID you are referring to.  Can you give an example?  Also, do you already 
> have a corresponding track in mind from which to retrieve the 
> information (like UCSC Genes, RefSeq Genes, etc.)?
> 
> Thanks. This will help us answer your questions.
> 
> --
> Brooke Rhead
> UCSC Genome Bioinformatics Group
> 
> 
> Kangyu Zhang wrote:
>> Hello, 
>>
>>  
>>
>>   We are interested in retrieving the human gene length information from
> the
>> UCSC genome browser. In particular, for an NCBI gene id identifier (refseq
>> identifier NP_ or NM_ are good too, but that's not one to one mapping
>> because of alternative splicing), retrieve its protein sequence length,
> mRNA
>> length (total exon length), number of exon, and the total length of this
>> gene on the genome (including the intron length). 
>>
>>  
>>
>>   Could you help figure out the way to retrieve the above information via
>> the UCSC genome browser? 
>>
>>  
>>
>> Thanks, 
>>
>> Kangyu Zhang
>>
>>  
>>
>> Department of Bioinformatics
>>
>> Genentech, Inc.
>>
>> 1 DNA Way, MS 93
>>
>> South San Francisco, CA 94080
>>
>>  
>>
>> _______________________________________________
>> Genome maillist  -  Genome at soe.ucsc.edu
>> http://www.soe.ucsc.edu/mailman/listinfo/genome



More information about the Genome mailing list