[Genome] knownGene.txt question
Archana Thakkapallayil
archanat at soe.ucsc.edu
Tue Nov 27 09:48:21 PST 2007
Hello Dmitriy,
Here is the response from Jim Kent on your question:
-----------
uc001aad is in a duplicon that looks like it appears in several places
in the genome. Since this is based on a RefSeq gene, we take the RefSeq
mRNA (NM_182905) in preference to the genomic RNA as cut into exons. In
general I'm hoping that all of the cases you find where the total exon
length and the mRNA length disagree they are RefSeqs, and there should
only be a few hundreds of them I think.
-----------
I hope this is helpful.
Regards,
Archana
UCSC Genome Bioinformatics Group
Dmitriy Skvortsov wrote:
> Dear colleagues,
> While working with knownGene.txt file which I downloaded from
> http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/
> I found that in 17428 cases out of 56722 , total length of exons do
> not match length of the sequence from file knownGeneMrna.txt which
> I downloaded
> from same source.
>
>
> Example
> uc001aad.1
>
> line 4 from knownGene.txt
> uc001aad.1 chr1 - 4558 7231 4558 7173 8 4558,4832,5658,5769,6469,6720,6723,7095, 4692,4901,5767,5810,6628,6721,6918,7231,
>
>
> Exon start Exon end exon length
> 1 4558 4692 134
> 2 4832 4901 69
> 3 5658 5767 109
> 4 5769 5810 41
> 5 6469 6628 159
> 6 6720 6721 1
> 7 6723 6918 195
> 8 7095 7231 136
> total 844
>
> mrna sequence from file knownGeneMrna.txt
>
> atgctgggggcagagacagaggagaagctgtttgatgcccccttgtccatcagcaagagagagcagctggaacagcaggtcccagagaactacttctatgtgccagacctgggccaggtgcc
> tgagattgatgttccatcctacctgcctgacctgcccggcattgccaacgacctcatgtacattgccgacctgggccccggcattgccccctctgcccctggcaccattccagaactgcccacctt
> ccacactgaggtagccgagcctctcaaggcagacctacaagatggggtactaacaccacccccaccgcccccaccaccacccccagctcctgaggtgctggccagtgcacccccactccc
> accctcaaccgcggcccctgtaggccaaggcgccaggcaggacgacagcagcagcagcacgtctccttcagtccagggagctcccagggaagtggtcgacccctccggtggctgggccact
> ctgctagagtccatccgccaagctgggggcatcggcaaggccaagctgcgcagcatgaaggagcgaaagctggagaagaagaagcagaaggagcaggagcaagtgagagccacgagcca
> aggtgggcacttgatgtcggatctcttcaacaagctggtcatgaggcgcaagggcatctctgggaaaggacctggggctggtgaggggcccggaggagcctttgcccgcgtgtcagactccatccc
> tcctctgccgccaccgcagcagccacaggcagaggaggacgaggacgactgggaatcctag
>
> length 795 nt
>
> 795 != 844
> I am attaching list of all 17000 UCID where annotation do not match sequence.
> Recently we report similar discrepancy in refSeq I guess that error
> in knownGene.txt & knownGeneMrna.txt have same origin.
> Hope this would help
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Genome maillist - Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
More information about the Genome
mailing list