[Genome] clusterGenes
Brooke Rhead
rhead at soe.ucsc.edu
Thu Jul 12 15:46:44 PDT 2007
Hi Archie and Amit,
I have a bit more information for you about clusterGenes. First, the
usage statement of clusterGenes is:
-------------
clusterGenes - Cluster genes from genePred tracks
usage:
clusterGenes [options] outputFile database table1 ... tableN
clusterGenes [options] -trackNames outputFile database track1 table1
... trackN tableN
Where outputFile is a tab-separated file describing the clustering,
database is a genome database such as mm4 or hg16,
and the table parameters are either tables in genePred format in that
database or genePred tab seperated files. If the input is all from
files, the argument
can be `no'.
options:
-verbose=N - Print copious debugging info. 0 for none, 3 for loads
-chrom=chrN - Just work this chromosome, maybe repeated.
-cds - cluster only on CDS exons
-trackNames - If specified, input are pairs of track names and files.
This is useful when the file names don't reflact the desired track
names.
-ignoreStrand - cluster postive and negative strand together
-clusterBed=bed - output BED file for each cluster
-flatBed=bed - output BED file that contains the exons of all genes
flattned into a single record.
-joinContained - join genes that are contained within a larger loci
into that loci. Intended as a way to handled fragments and exon-level
predictsions, as genes-in-introns on the same strand are very rare.
-conflicted - detect conflicted loci. Conflicted loci are loci that
contain genes that share not sequence. This option greatly increases
size of output file.
The cdsConflicts and exonConflicts columns contains `y' if the cluster
has conficts. A conflict is a cluster where all of the genes don't share
exons. Conflicts maybe either internal to a table or between tables.
-------------
And here is some extra information from one of our engineers:
1. The clusterGenes program requires that transcripts actually overlap
to be placed in the same cluster. It clusters transcripts based on at
least one exon base (introns don't count) of genomic overlap on the same
strand.
2. Two transcripts that don't overlap might end up in the same cluster
due to sharing exons with a third transcript, but not each other
(transitive):
For example, these the transcripts would form one cluster:
||||||-----||||||||--------|||||||
-------||||||
|||||------------|||||||||
3. Note that if the -cds option is used, clustering is only done on
genomic overlap of CDS. This is generally preferred, as some gene loci
have overlapping UTR.
4. Also, transcripts within transcripts are put into different clusters.
That is, exon overlap, in a transcript completely contained in an
intron of another transcript (you could say they are interleaved) are
put into different clusters, even if they are on the same strand.
(Although most transcripts inside of transcripts are on the opposite
strand.)
I hope this information is helpful.
--
Brooke Rhead
UCSC Genome Bioinformatics Group
Russell, Archie wrote:
>
> This program could be very useful, is there a longer description of what
> it does and how?
>
> Thanks,
> Archie
>
> -----Original Message-----
> From: genome-bounces at soe.ucsc.edu [mailto:genome-bounces at soe.ucsc.edu]
> On Behalf Of Galt Barber
> Sent: Thursday, July 12, 2007 10:53 AM
> To: Amit Indap
> Cc: UCSC
> Subject: Re: [Genome] clusterGenes
>
>
> Yes, that appears to be the case.
> It is using binKeeper to keep track of
> growing cluster ranges and new elements
> that overlap the cluster (by position)
> are added to it.
>
> -Galt
>
>
> On Thu, 12 Jul 2007, Amit Indap wrote:
>
>> Hi UCSC
>>
>> For the program clusterGenes from kent src, is it
>> clustering genepred records based on how close they
>> are to each other (in terms of assembly coordinates)?
>>
>> Thanks,
>>
>> Amit Indap
>> Cornell University
>>
>>
>>
>>
> ________________________________________________________________________
> ____________
>> Pinpoint customers who are looking for what you sell.
>> http://searchmarketing.yahoo.com/
>> _______________________________________________
>> Genome maillist - Genome at soe.ucsc.edu
>> http://www.soe.ucsc.edu/mailman/listinfo/genome
>>
> _______________________________________________
> Genome maillist - Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
>
>
>
>
> ------------------------------------------------------------------------------
> Notice: This e-mail message, together with any attachments, contains
> information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
> New Jersey, USA 08889), and/or its affiliates (which may be known
> outside the United States as Merck Frosst, Merck Sharp & Dohme or MSD
> and in Japan, as Banyu - direct contact information for affiliates is
> available at http://www.merck.com/contact/contacts.html) that may be
> confidential, proprietary copyrighted and/or legally privileged. It is
> intended solely for the use of the individual or entity named on this
> message. If you are not the intended recipient, and have received this
> message in error, please notify us immediately by reply e-mail and then
> delete it from your system.
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Genome maillist - Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
More information about the Genome
mailing list