[Genome] Entrez gene ID to refseq ID mapping

Rachel Harte hartera at soe.ucsc.edu
Sat Jun 2 15:45:02 PDT 2007


Hello Guoliang,

I'm glad that was useful to you and that you are able to get your program
working now.

The reason that you are finding multiple UCSC Gene IDs for some of the
Entrez Gene IDs is that each Entrez Gene ID represents a gene and each
UCSC Gene ID represents a transcript. There can transcripts that represent
multiple alternative splice forms associated with each gene. You can see
this by searching for one of the UCSC Gene IDs in the Genome Browser. If
you click on the blue/gray button at the left side of the track, you can
select for the UCSC Gene IDs to be displayed and then you can see that the
alternate splice form transcripts.

I hope that this helps you.

Rachel

Rachel Harte
UCSC Genome Bioinformatics Group
http://genome.ucsc.edu


On Thu, 31 May 2007, Guoliang Xing wrote:

> Hi Rachel,
>
>   The information you provided proved to be very useful. I actually
> downloaded the UCSC annotation tables and imported them to my local MySQL
> database, the mapping now is done by joining common UCSC gene IDs. My
> programs now work beautifully.
>
>
>   I do have a new question coming up. I am mapping Affy U133plus2 array
> probesetID to EntrezGeneID via UCSC KnowGeneID, through the tables I
> downloaded.
>
>   Here is my query and result, I am joining 3 tables:
>
> SELECT k.`name` as UCSCGeneID, k.`value` as ProbeSetID,  q.`value` as
> EntrezGeneID
> FROM knowntou133plus2 k, knowngene p, knowntolocuslink q
> WHERE k.`name`= p.`name` AND k.`name` =q.`name`;
>
> UCSCGeneID, ProbeSetID, EntrezGeneID
> 'uc001aab.1', '225035_x_at', '375690'
> 'uc001aac.1', '225035_x_at', '375690'
> 'uc001aad.1', '225035_x_at', '375690'
> 'uc001aae.1', '225035_x_at', '375690'
> 'uc001aaf.1', '225035_x_at', '375690'
> 'uc001aag.1', '225035_x_at', '375690'
> 'uc001aah.1', '225035_x_at', '375690'
> 'uc001aai.1', '226340_x_at', '375690'
> 'uc001aaw.1', '234369_at', '729759'
> 'uc001abd.1', '234369_at', '729759'
>
> total 45366 rows
>
> What's interesting is that the same Entrez Gene ID matches multiple UCSC
> known genes. I am curious why one Entrez gene may match multiple UCSC
> known genes.
>
> Though this is not a problem for me since once I added the distinct
> keyword and
> without UCSC Known gene field, unique rows returned:
>
> SELECT distinct k.`value` as ProbeSetID,  q.`value` as EntrezGeneID
> FROM knowntou133plus2 k, knowngene p, knowntolocuslink q
> WHERE k.`name`= p.`name` AND k.`name` =q.`name`;
>
> returns 20586 rows
>
>
> It's a great experience utilizing UCSC annotation tables.
>
> Thanks,
>
> Guoliang
>
>
> On Thu, 24 May 2007, Rachel Harte wrote:
>
> > Guoliang,
> >
> > The refFlat table contains all the alignments for the RefSeq mRNAs
> > (excluding the predicted ones whose IDs begin with "XM_"). Each mRNA may
> > align to the genome more than once with the criteria used to filter Blat
> > alignments for RefSeqs:
> > "RefSeq mRNAs were aligned against the human genome using blat; those with
> > an alignment of less than 15% were discarded. When a single mRNA aligned
> > in multiple places, the alignment having the highest base identity was
> > identified. Only alignments having a base identity level within 0.1% of
> > the best and at least 96% base identity with the genomic sequence were
> > kept."
> >
> > There are 25271 rows now in refFlat for hg18 but if you look at unique
> > names (accessions) then there are 24837. There are 215202 rows in refLink
> > because it contains the data for all RefSeqs from human and also from
> > other organisms if the assembly has an "Other RefSeq" track (xenoRefFlat,
> > xenoRefGene, xenoRefSeqAli tables). The tables that you downloaded from NCBI
> > has many more rows for human RefSeqs than there are RefSeqs found in our
> > refFlat table because they include the predicted RefSeqs (XM_ accessions) that
> > we do not think are reliable and therefore they are not aligned for the
> > Genome Browser.
> >
> > I hope that this helps you.
> >
> > Rachel
> >
> > We invite you to give us your feedback on the UCSC Genome Browser through
> > May 31, 2007: http://www.surveymonkey.com/s.asp?U=881163743177
> >
> > Rachel Harte
> > UCSC Genome Bioinformatics Group
> > http://genome.ucsc.edu
> >
> >
> > On Wed, 23 May 2007, Guoliang Xing wrote:
> >
> > > Hi Rachel,
> > >
> > >   Thank you for your reply. This is what I want and it's very helpful.
> > >
> > >   I am a little bit surprised to find that from the same human Mar 2006
> > > Genes and Gene Prediction Tracks group -> RefSeq Genes -> refFlat table
> > > returns 25407 rows, but reflink option returns 203390 rows, almost 10
> > > times more.
> > >
> > >   On the other hand, the gene2refseq.gz file I downloaded from
> > > ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ , after filtering to keep human only
> > > genes with Taxonomy ID of 9606, it produces 145360 rows (many of the gene
> > > IDs are duplicates here).
> > >
> > >   I guess the genes downloaded from UCSC table browser using the refFlat
> > > option is a more reasonable snapshot of all coding genes on HG18.
> > >
> > >   My key task is to figure out the genome coordinates of each gene and its
> > > exons, then link this geneID to existing pathways, and/or SNPs associated
> > > with it.
> > >
> > >   If my guess is not good, please let me know. I understand that
> > >  gene annotation is not simple, there will be some inconsistences.
> > >
> > >   Thanks,
> > >
> > >   Guoliang
> > >
> > > On Wed, 23 May 2007, Rachel Harte wrote:
> > >
> > > > Hello Guoliang,
> > > >
> > > > There is a table called refLink. You can download through the Table
> > > > Browser (click on the "Tables" link on the top blue menu bar). Select the
> > > > assembly of interest and the "Genes and Gene Predictions" group and the
> > > > "RefSeq Genes" track. Then you can select the refLink table. The Entrez
> > > > Gene ID is in the locuslinkId column.
> > > > Alternatively, go to the Downloads server:
> > > > http://hgdownload.cse.ucsc.edu/downloads.html
> > > >
> > > > Once you have found the organism and assembly of interest, then click on
> > > > the "Annotation database" link and there you can download the contents of
> > > > any table in the database for that assembly.
> > > >
> > > > I hope that this helps you. For a rapid response in the future, please
> > > > direct questions to our mailing list at: genome at soe.ucsc.edu
> > > >
> > > > Thanks.
> > > >
> > > > Rachel
> > > >
> > > >
> > > >  Rachel Harte UCSC Genome
> > > > Bioinformatics Group http://genome.ucsc.edu
> > > >
> > > >
> > > > On Tue, 22 May 2007, Guoliang Xing wrote:
> > > >
> > > > > Hi Rachel,
> > > > >
> > > > >    Is there a way to download a mapping table from UCSC between Entrez
> > > > > GeneID to Refseq ID?
> > > > >
> > > > >    I used UCSC table browser, RefSeq Genes track, knownToRefSeq table, and
> > > > > downloaded the data, it has a name field which is the transcript accession
> > > > > name, and a gene name field. But not GeneID.
> > > > >
> > > > >    I like all the info this table provides, but I also need the Entrez
> > > > > GeneID (for Human).
> > > > >
> > > > >    One the other hand, I parsed NCBI's gene2refseq table for human, but it
> > > > > has too many redundant lines for the same geneID.
> > > > >
> > > > >
> > > > >
> > > > >    You help will be appreciated.
> > > > >
> > > > >    Guoliang
> > > > >
> > > >
> > >
> >
>


More information about the Genome mailing list