[Genome] Question about tracks and duplicate entries
Kayla Smith
kayla at soe.ucsc.edu
Tue Jan 9 16:33:37 PST 2007
Todd,
Let me add to what Fan has said and point out that you can find out more
about a track by looking at it's details page. For example, here is the
Known Genes details page, which gives an idea about how this track is
made: http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg18&g=knownGene
RefSeq is the gene set built by NCBI and we align them with
the base genome using BLAT.
UCSC Known Genes (KG) is the gene set built by UCSC using a different
process, in which pairs of protein-mRNA as gene candidates are processed
with gene-check and various scoring/ranking steps.
KG overlaps substantially with RefSeq, but these two sets also
differ in various ways.
A substantial number of RefSeq were filtered out from KG
because they did not pass our gene-check logic (similar
to what is used to construct the CCDS (Consensus CDS)
gene set. But KG also includes valid GenBank mRNA-UniProt
protein pairs, such that the overall genomic coverage is
higher than RefSeq. All UniProt isoforms are included in
our process as part of the initial gene candidates, thus the
final KG set offers more alternative splicing isoforms
than RefSeq. Comprehensive annotations associated with
KG are also provided.
In summary, UCSC Known Genes has better quality gene model,
higher genomic coverage, and comprehensive annotations.
Finally, the Other RefSeq track that you've asked about actually xeno-
or "non-human" refSeq alignments. The details page will tell you that
"This track shows known protein-coding genes from organisms other than
human, taken from the NCBI mRNA reference sequences collection (RefSeq)."
I hope this helps to answer your questions. Please don't hesitate to
contact us again if you require more assistance.
Kayla Smith
UCSC Genome Bioinformatics Group
Fan Hsu wrote:
> Hi Todd,
>
> The RefSeq track data (refGene table) is updated nightly.
> The Known Genes data is not updated until next major release
> (or rebuild).
>
> I noticed that the version number for NM_004869 is 3 now.
>
> Fan.
> -----Original Message-----
> From: genome-bounces at soe.ucsc.edu [mailto:genome-bounces at soe.ucsc.edu]On
> Behalf Of Todd Riley
> Sent: Tuesday, January 09, 2007 2:51 PM
> To: genome at soe.ucsc.edu
> Subject: [Genome] Question about tracks and duplicate entries
>
>
> Hello again,
>
> Thank you for your answers, they were very helpful. If you don't mind,
> I know have a new question for you.
>
> How are the tracks "Known Genes", "RefSeq Genes", and "Other RefSeq"
> related to each other? I noticed that there are many RefSeq Genes in
> the KnownGenes table (which makes sense). However, I also noticed that
> the RefSeq entries in the KnownGene track do not always correspond to
> the entries in the RefSeq table!! Here is an example from hg17:
>
> Known Genes track:
>
> NM_004869 chr18 - 59207406 59240673 59209187
> 59240472 11
> 59207406,59211621,59215246,59217454,59218260,59218759,59221919,59225644,5922
> 8502,59229679,59240445,
> 59209289,59211762,59215466,59217536,59218409,59218916,59222039,59225712,5922
> 8659,59229791,59240673,
> VPS4B_HUMAN R8768
>
> RefSeq Genes track:
>
> NM_004869 chr18 - 59207406 59240732 59209187
> 59240472 11
> 59207406,59211621,59215246,59217454,59218260,59218759,59221919,59225644,5922
> 8502,59229679,59240445,
> 59209289,59211762,59215466,59217536,59218409,59218916,59222039,59225712,5922
> 8659,59229791,59240732,
>
>
> Notice that they are not completely the same!!! I would expect them to
> be the same except for the "VPS4B_HUMAN R8768" appended to the end of
> the known gene.
>
> Thanks,
> Todd
> _______________________________________________
> Genome maillist - Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
>
> _______________________________________________
> Genome maillist - Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
More information about the Genome
mailing list