[Genome] "_dupX" names for transcript ID duplicates on same chromosomes

Angie Hinrichs angie at soe.ucsc.edu
Thu Jun 14 10:02:13 PDT 2007


Hi Micha,

If you need qStarts and CDS alignment info, then you definitely need 
to combine the all_mrna, gbCdnaInfo and cds tables; our very limited 
GTF output (target coords only, names uniquified only to accomodate 
GTF's multiple-lines-per-feature) won't be very helpful for you.  
Instead of recreating the order, it might work to concatenate the name 
(no _dup) with the target chrom, start and stop for joining 
all_mrna-derived GTF with all_mrna rows.  However, if your ultimate 
requirement is GTF with CDS lines, then yes, it might be the best 
solution for you to create your own GTF directly from all_mrna, 
gbCdnaInfo and cds.

The good news is that our source tree has a couple programs to help 
with that task.  The source is free for academic/non-commercial use, 
and can be licensed for commercial use (see 
http://genome.ucsc.edu/license/ if applicable; if you are academic/
non-profit, http://genome.ucsc.edu/admin/cvs.html).  

Advice from our local GenBank expert, Mark Diekhans:
--------------------------------------------------------------------
The mrnaToGene program takes a psl and a tab separated file of
name and genbank-format CDS specification and produces a
genePred.  Lots of options for handling gaps in alignments.
They can also create the exon frames which might be useful for
looking for frame shifts.  This will all be nicer than dealing
with GTF.

Well, they will have to worry about mis-annotated CDS;
which there is a lot of in Genbank.  
--------------------------------------------------------------------

mrnaToGene outputs the genePred format which is what our gene 
prediction tables use.  We also have a program, genePredToGtf, that 
translates the genePred format to GTF.  

Both of those programs have options to use either flatfiles, or a 
local mysql database with our tables.  To see the usage instructions 
for any program in our source tree, run it with no arguments.  

Angie


On Wed, 13 Jun 2007, Micha Sammeth wrote:

> Hi,
> 
> thank you Angie, but chopping off the _dup suffixes wont solve my 
> problem. I need also the alignment information from the all_mRNA 
> (qStarts, blockSizes, ...) to detect misaligned CDSs. Is it possible to 
> reconstruct the "order of retrieval from the database query" when 
> generating the GTF files? If there is no possibility to assign the 
> dup-transcript_ids again uniquely to their entry in all_mRNA, I guess 
> the only thing that helps is to download all data necessary to produce 
> an own GTF.
> 
> Thank you, micha.
> 
> En/na Angie Hinrichs ha escrit:
> > Hi Micha,
> >
> > There is nothing special about the assignment of the _dup suffixes -- 
> > the Table Browser simply assigns them in the order of retrieval from a 
> > database query.  If it has already written GTF for a given ID, and 
> > then gets another element with the same ID, it adds the suffix for 
> > uniquification as b0b said.  There is no consideration of the quality 
> > of the items, and it is not stable.  
> >
> > In case this part of your original question hasn't been addressed 
> > already -- in order to join with gbCdnaInfo to get the CDS, simply 
> > chop off the _dup suffix.  The CDS info is a property of the mRNA 
> > sequence from GenBank, and applies to all genomic alignments of that 
> > sequence.
> >
> > The UCSC Genes track does use stable accessions (uc00*.version)... 
> > no guarantees that it will cover all of the GenBank sequences you are 
> > working with, but it does have the advantage of stability.  The 
> > GenBank tracks are particularly un-"stable" because they are 
> > regenerated regularly (mRNAs nightly, ESTs weekly) so that we can keep 
> > up with additions and changes in GenBank.  
> >
> > Angie
> >
> >
> > On Tue, 12 Jun 2007, Micha Sammeth wrote:
> >
> >   
> >> Hi Robert,
> >>
> >> is there something known about the order that the TB assigns 
> >> "transcript_id", "transcript_id_dup1", "transcript_id_dup2", ... to 
> >> transcripts that align multiple times? Is it according to alignment 
> >> quality (match% or query mismatches or ...?) or according to the genomic 
> >> order of the transcripts (which one?)? How ties are broken? Is the 
> >> process stable, ie, are the dupX suffixes the same across different 
> >> exports from the TB, different genome versions, future Genbank imports?
> >>
> >> Thank you!! micha.
> >>
> >> En/na Robert Kuhn ha escrit:
> >>     
> >>> Hello, again, Micha,
> >>>
> >>> As you suspect, the transcript_id field is generated dynamically by 
> >>> the Table Browser as a result of your choice of GTF for output.  On 
> >>> the following page in our FAQ:
> >>>
> >>>   http://genome.ucsc.edu/FAQ/FAQformat#format4
> >>>
> >>> that field is defined as
> >>>
> >>> "transcript_id value - A globally unique identifier for the predicted 
> >>> transcript"
> >>>
> >>> so it is necessary for the TB to distinguish the two AK130020 transcripts.
> >>>
> >>> If this does not fully resolve the issue, please let us know.
> >>>
> >>> thanks,
> >>> 			--b0b kuhn
> >>>
> >>>
> >>>
> >>>   
> >>>       
> >>>> From gmicha at gmail.com  Tue Jun 12 02:45:30 2007
> >>>> To: Robert Kuhn <kuhn at soe.ucsc.edu>
> >>>> CC: genome at soe.ucsc.edu
> >>>> Subject: Re: [Genome] "_dupX" names for transcript ID duplicates on same chromosomes
> >>>>
> >>>> Hi Robert,
> >>>>
> >>>> the following is a grep on a file I downloaded from the table browser 
> >>>> (mRNA track, hg17, GTF export). So, does the table browser somehow 
> >>>> dynamically generate these _dup names for multiple alignments on the 
> >>>> same chromosome?
> >>>>
> >>>> Best, micha.
> >>>>
> >>>> $ grep AK130020 human_hg17_mRNA_fromUCSC.gtf
> >>>> chr1    hg17_all_mrna   exon    16385779        16385842        
> >>>> 0.000000-.gene_i
> >>>> d "AK130020"; transcript_id "AK130020";
> >>>> chr1    hg17_all_mrna   exon    16452373        16452436        
> >>>> 0.000000-.gene_i
> >>>> d "AK130020"; transcript_id "AK130020";
> >>>> chr1    hg17_all_mrna   exon    16452437        16452447        
> >>>> 0.000000-.gene_i
> >>>> d "AK130020"; transcript_id "AK130020";
> >>>> chr1    hg17_all_mrna   exon    16452451        16452453        
> >>>> 0.000000-.gene_i
> >>>> d "AK130020"; transcript_id "AK130020";
> >>>> chr1    hg17_all_mrna   exon    16452454        16452559        
> >>>> 0.000000-.gene_i
> >>>> d "AK130020"; transcript_id "AK130020";
> >>>> chr1    hg17_all_mrna   exon    17065218        17066019        
> >>>> 0.000000-.gene_i
> >>>> d "AK130020"; transcript_id "AK130020";
> >>>> chr1    hg17_all_mrna   exon    17067288        17067965        
> >>>> 0.000000-.gene_i
> >>>> d "AK130020"; transcript_id "AK130020_dup1";
> >>>> chr1    hg17_all_mrna   exon    17068055        17068101        
> >>>> 0.000000-.gene_i
> >>>> d "AK130020"; transcript_id "AK130020_dup1";
> >>>> chr1    hg17_all_mrna   exon    17068187        17068297        
> >>>> 0.000000-.gene_i
> >>>> d "AK130020"; transcript_id "AK130020_dup1";
> >>>> chr1    hg17_all_mrna   exon    17068821        17068976        
> >>>> 0.000000-.gene_i
> >>>> d "AK130020"; transcript_id "AK130020_dup1";
> >>>> chr1    hg17_all_mrna   exon    17071812        17071943        
> >>>> 0.000000-.gene_i
> >>>> d "AK130020"; transcript_id "AK130020_dup1";
> >>>> chr1    hg17_all_mrna   exon    17072047        17072113        
> >>>> 0.000000-.gene_i
> >>>> d "AK130020"; transcript_id "AK130020_dup1";
> >>>> chr1    hg17_all_mrna   exon    17072201        17072648        
> >>>> 0.000000-.gene_i
> >>>> d "AK130020"; transcript_id "AK130020_dup1";
> >>>>
> >>>>
> >>>> En/na Robert Kuhn ha escrit:
> >>>>     
> >>>>         
> >>>>> hello, micha,
> >>>>>
> >>>>> I'm not sure I understand where you are seeing the "_dup" part of
> >>>>> transcript_ids.  The chr1_mrna table does have two entries for the
> >>>>> accession you mention.  
> >>>>>
> >>>>> One of these is definitely a better alignment than the other.  Type 
> >>>>> AK130020 into the position box of the browser and then click on one of 
> >>>>> them.  Then expand your position window to chr1:16513000-17200000.  
> >>>>> Both transcripts should be highlighted in the Human Mrna track.  I 
> >>>>> know that does not address your bulk processing issues, but I thought 
> >>>>> I'd mention it.
> >>>>>
> >>>>> Please let me know where exactly you are seeing the "_dupX"
> >>>>> notation and I will try to track it down.  By the way, it is
> >>>>> not uncommon for mrnas to align in more than one location.  the
> >>>>> existence of segmental duplications virtually guarantees it.
> >>>>>
> >>>>> best wishes,
> >>>>>
> >>>>> 			--b0b kuhn
> >>>>>
> >>>>>   
> >>>>>       
> >>>>>           
> >>>>>> From genome-bounces at soe.ucsc.edu  Mon Jun 11 15:48:27 2007
> >>>>>> Cc: genome at soe.ucsc.edu, Micha Sammeth <micha at sammeth.net>
> >>>>>> Subject: [Genome] "_dupX" names for transcript ID duplicates on same
> >>>>>> 	chromosomes
> >>>>>>
> >>>>>> Hello again,
> >>>>>>
> >>>>>> thank you, Robert for the explanation on the ESTs, I got it clear now.
> >>>>>>
> >>>>>> I have another little problem, hopefully the last one for this project: 
> >>>>>> when I download the (hg17) mRNA track, I get conveniently unique 
> >>>>>> transcript_id tags - at least on a chromosome level, which is enough 
> >>>>>> (e.g., "AK130020" and "AK130020_dup1" on chr1). From which field does 
> >>>>>> the table browser retrieve them? It does not seem to be all_mrna.qName, 
> >>>>>> nor did I find a field in gbCdnaInfo or other soundy tables that are up 
> >>>>>> to 1-2 joins away from all_mrna. I would need these "_dupX" identifiers 
> >>>>>> to map back the CDS info I get from hg17.cds to genomic coordinates.
> >>>>>>
> >>>>>> Best thanks, micha.
> >>>>>> _______________________________________________
> >>>>>> Genome maillist  -  Genome at soe.ucsc.edu
> >>>>>> http://www.soe.ucsc.edu/mailman/listinfo/genome
> >>>>>>
> >>>>>>     
> >>>>>>         
> >>>>>>             
> >>>   
> >>>       
> >> _______________________________________________
> >> Genome maillist  -  Genome at soe.ucsc.edu
> >> http://www.soe.ucsc.edu/mailman/listinfo/genome
> >>
> >>     
> >
> >   
> 
> _______________________________________________
> Genome maillist  -  Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
> 

-- 
angie at soe.ucsc.edu
Software Developer, UCSC CBSE / Genome Bioinformatics Group


More information about the Genome mailing list