[Genome] Question about tracks and duplicate entries

Kayla Smith kayla at soe.ucsc.edu
Tue Jan 9 16:33:37 PST 2007


Todd,

Let me add to what Fan has said and point out that you can find out more 
about a track by looking at it's details page.  For example, here is the 
Known Genes details page, which gives an idea about how this track is 
made:  http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg18&g=knownGene

RefSeq is the gene set built by NCBI and we align them with
the base genome using BLAT.

UCSC Known Genes (KG) is the gene set built by UCSC using a different
process, in which pairs of protein-mRNA as gene candidates are processed
with gene-check and various scoring/ranking steps.

KG overlaps substantially with RefSeq, but these two sets also
differ in various ways.

A substantial number of RefSeq were filtered out from KG
because they did not pass our gene-check logic (similar
to what is used to construct the CCDS (Consensus CDS)
gene set.  But KG also includes valid GenBank mRNA-UniProt
protein pairs, such that the overall genomic coverage is
higher than RefSeq.  All UniProt isoforms are included in
our process as part of the initial gene candidates, thus the
final KG set offers more alternative splicing isoforms
than RefSeq.  Comprehensive annotations associated with
KG are also provided.

In summary, UCSC Known Genes has better quality gene model,
higher genomic coverage, and comprehensive annotations.

Finally, the Other RefSeq track that you've asked about actually xeno- 
or "non-human" refSeq alignments.  The details page will tell you that 
"This track shows known protein-coding genes from organisms other than 
human, taken from the NCBI mRNA reference sequences collection (RefSeq)."

I hope this helps to answer your questions.  Please don't hesitate to 
contact us again if you require more assistance.

Kayla Smith
UCSC Genome Bioinformatics Group




Fan Hsu wrote:
> Hi Todd,
> 
> The RefSeq track data (refGene table) is updated nightly.
> The Known Genes data is not updated until next major release
> (or rebuild).
> 
> I noticed that the version number for NM_004869 is 3 now.
> 
> Fan.
> -----Original Message-----
> From: genome-bounces at soe.ucsc.edu [mailto:genome-bounces at soe.ucsc.edu]On
> Behalf Of Todd Riley
> Sent: Tuesday, January 09, 2007 2:51 PM
> To: genome at soe.ucsc.edu
> Subject: [Genome] Question about tracks and duplicate entries
> 
> 
> Hello again,
> 
> Thank you for your answers, they were very helpful.  If you don't mind,
> I know have a new question for you.
> 
> How are the tracks "Known Genes", "RefSeq Genes", and "Other RefSeq"
> related to each other?  I noticed that there are many RefSeq Genes in
> the KnownGenes table (which makes sense).  However, I also noticed that
> the RefSeq entries in the KnownGene track do not always correspond to
> the entries in the RefSeq table!! Here is an example from hg17:
> 
> Known Genes track:
> 
> NM_004869    chr18    -    59207406    59240673    59209187
> 59240472    11
> 59207406,59211621,59215246,59217454,59218260,59218759,59221919,59225644,5922
> 8502,59229679,59240445,
> 59209289,59211762,59215466,59217536,59218409,59218916,59222039,59225712,5922
> 8659,59229791,59240673,
> VPS4B_HUMAN    R8768
> 
> RefSeq Genes track:
> 
> NM_004869    chr18    -    59207406    59240732    59209187
> 59240472    11
> 59207406,59211621,59215246,59217454,59218260,59218759,59221919,59225644,5922
> 8502,59229679,59240445,
> 59209289,59211762,59215466,59217536,59218409,59218916,59222039,59225712,5922
> 8659,59229791,59240732,
> 
> 
> Notice that they are not completely the same!!!  I would expect them to
> be the same except for the "VPS4B_HUMAN    R8768" appended to the end of
> the known gene.
> 
> Thanks,
> Todd
> _______________________________________________
> Genome maillist  -  Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
> 
> _______________________________________________
> Genome maillist  -  Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome



More information about the Genome mailing list