[Genome] Entrez gene ID to refseq ID mapping

Rachel Harte hartera at soe.ucsc.edu
Thu May 24 09:22:12 PDT 2007


Guoliang,

The refFlat table contains all the alignments for the RefSeq mRNAs
(excluding the predicted ones whose IDs begin with "XM_"). Each mRNA may
align to the genome more than once with the criteria used to filter Blat
alignments for RefSeqs:
"RefSeq mRNAs were aligned against the human genome using blat; those with
an alignment of less than 15% were discarded. When a single mRNA aligned
in multiple places, the alignment having the highest base identity was
identified. Only alignments having a base identity level within 0.1% of
the best and at least 96% base identity with the genomic sequence were
kept."

There are 25271 rows now in refFlat for hg18 but if you look at unique
names (accessions) then there are 24837. There are 215202 rows in refLink
because it contains the data for all RefSeqs from human and also from
other organisms if the assembly has an "Other RefSeq" track (xenoRefFlat,
xenoRefGene, xenoRefSeqAli tables). The tables that you downloaded from NCBI
has many more rows for human RefSeqs than there are RefSeqs found in our
refFlat table because they include the predicted RefSeqs (XM_ accessions) that
we do not think are reliable and therefore they are not aligned for the
Genome Browser.

I hope that this helps you.

Rachel

We invite you to give us your feedback on the UCSC Genome Browser through
May 31, 2007: http://www.surveymonkey.com/s.asp?U=881163743177

Rachel Harte
UCSC Genome Bioinformatics Group
http://genome.ucsc.edu


On Wed, 23 May 2007, Guoliang Xing wrote:

> Hi Rachel,
>
>   Thank you for your reply. This is what I want and it's very helpful.
>
>   I am a little bit surprised to find that from the same human Mar 2006
> Genes and Gene Prediction Tracks group -> RefSeq Genes -> refFlat table
> returns 25407 rows, but reflink option returns 203390 rows, almost 10
> times more.
>
>   On the other hand, the gene2refseq.gz file I downloaded from
> ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ , after filtering to keep human only
> genes with Taxonomy ID of 9606, it produces 145360 rows (many of the gene
> IDs are duplicates here).
>
>   I guess the genes downloaded from UCSC table browser using the refFlat
> option is a more reasonable snapshot of all coding genes on HG18.
>
>   My key task is to figure out the genome coordinates of each gene and its
> exons, then link this geneID to existing pathways, and/or SNPs associated
> with it.
>
>   If my guess is not good, please let me know. I understand that
>  gene annotation is not simple, there will be some inconsistences.
>
>   Thanks,
>
>   Guoliang
>
> On Wed, 23 May 2007, Rachel Harte wrote:
>
> > Hello Guoliang,
> >
> > There is a table called refLink. You can download through the Table
> > Browser (click on the "Tables" link on the top blue menu bar). Select the
> > assembly of interest and the "Genes and Gene Predictions" group and the
> > "RefSeq Genes" track. Then you can select the refLink table. The Entrez
> > Gene ID is in the locuslinkId column.
> > Alternatively, go to the Downloads server:
> > http://hgdownload.cse.ucsc.edu/downloads.html
> >
> > Once you have found the organism and assembly of interest, then click on
> > the "Annotation database" link and there you can download the contents of
> > any table in the database for that assembly.
> >
> > I hope that this helps you. For a rapid response in the future, please
> > direct questions to our mailing list at: genome at soe.ucsc.edu
> >
> > Thanks.
> >
> > Rachel
> >
> >
> >  Rachel Harte UCSC Genome
> > Bioinformatics Group http://genome.ucsc.edu
> >
> >
> > On Tue, 22 May 2007, Guoliang Xing wrote:
> >
> > > Hi Rachel,
> > >
> > >    Is there a way to download a mapping table from UCSC between Entrez
> > > GeneID to Refseq ID?
> > >
> > >    I used UCSC table browser, RefSeq Genes track, knownToRefSeq table, and
> > > downloaded the data, it has a name field which is the transcript accession
> > > name, and a gene name field. But not GeneID.
> > >
> > >    I like all the info this table provides, but I also need the Entrez
> > > GeneID (for Human).
> > >
> > >    One the other hand, I parsed NCBI's gene2refseq table for human, but it
> > > has too many redundant lines for the same geneID.
> > >
> > >
> > >
> > >    You help will be appreciated.
> > >
> > >    Guoliang
> > >
> >
>


More information about the Genome mailing list