[Genome] The count of all annotated exons for a UCSC known gene ......

Ann Zweig ann at soe.ucsc.edu
Fri Jun 1 15:22:05 PDT 2007


Hello again, Rileen,

	Sorry for misunderstanding your question the first time. 
Unfortunately, there is no quick way to do this, but I can help you 
gather all the information you need, then you'll need to do one final 
step on your own.

	This pertains only to the UCSC Gene set.  The Ref Seq Gene tables are 
set up differently.

	Each gene cluster is represented by a canonical gene noted in the 
knownCanonical table.  Each cluster has a clusterId which also appears 
in the knownIsoform table.  Using your example of the EDA gene, the 
canonical gene is:

mysql>  SELECT * FROM knownCanonical WHERE transcript = 'uc004dxs.1'\G
*************************** 1. row ***************************
      chrom: chrX
chromStart: 68752635
   chromEnd: 69176046
  clusterId: 25603
transcript: uc004dxs.1
    protein: uc004dxs.1


	So, the clusterId in this case is 25603.  Now use this information to 
search the knownIsoform table for all of the genes in this cluster:

mysql>  SELECT * FROM knownIsoforms WHERE clusterId = 25603;
+-----------+------------+
| clusterId | transcript |
+-----------+------------+
|     25603 | uc004dxq.1 |
|     25603 | uc004dxn.1 |
|     25603 | uc004dxm.1 |
|     25603 | uc004dxl.1 |
|     25603 | uc004dxp.1 |
|     25603 | uc004dxo.1 |
|     25603 | uc004dxs.1 |
|     25603 | uc004dxr.1 |
+-----------+------------+


	Here are the 8 genes that you see displayed in the UCSC Gene track for 
"EDA".  Now you can search the knownGene table to get a list of the 
exonStarts (and exonEnds if needed).  For this, I would suggest using 
our public mysql server.  You can access the server using this command 
from a prompt:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A
use hg18

	To get a list of exonStarts for the set of genes in a cluster, you can 
issue this command:

mysql> SELECT clusterId, exonStarts FROM knownGene g, knownIsoforms i 
WHERE g.name = i.transcript and i.clusterId = 25603;
+-----------+--------------------------------------------------------------------------+
| clusterId | exonStarts        |
+-----------+--------------------------------------------------------------------------+
|     25603 | 68752635,69093601,        |
|     25603 | 68752635,68997412,        |
|     25603 | 68752635,68806778,68997412,        |
|     25603 | 68752635,        |
|     25603 | 68752635,68757464,69093601,        |
|     25603 | 68752635,68865711,68997412,69093601,        |
|     25603 | 
68752635,69093601,69159792,69164431,69166078,69167043,69169972,69171932, |
|     25603 | 
68752635,69093601,69159792,69164431,69166078,69167043,69169972,69171932, |
+-----------+--------------------------------------------------------------------------+
8 rows in set (0.04 sec)


	This gives you your list of exonStarts for your cluster.  To get the 
exonStarts for *all* gene clusters, simply leave off the WHERE part that 
limits it to a specific cluster.

	Once you have this list, you will need to write a perl script (or the 
like) to determine the number of unique number of exons for each cluster 
of genes.

	Please be sure to write back if this is unclear.

Regards,

----------
Ann Zweig
UCSC Genome Bioinformatics Group
http://genome.ucsc.edu




Rileen wrote:
> Hi Ann,
>             Thanks. I'm sorry, my question was ambiguous,  I already knew
> how to do what you said. I wanted the number of exons per 
> loci/"canonical" gene,
> which is why I said
> 
> "By that I mean all exons that appear in at least one transcript - a 
> sort of
> "logical OR" of all exons in all transcripts."
> 
> which wasn't very clear, I admit.
> 
> To clarify - let's take an example, say the gene EDA. There are 8 Refseqs,
> with varying numbers of exons - between 1 and 3. However, when you look
> at them together, there 5 exons in this locus, each of which appears in 
> at least
> one Refgene. These are not exons which differ from each other by one 
> splice site
> etc., these are distinct, non-overlapping exons.
> 
> It is this number - 5 in this example - that I'm after.
> 
> Take care,
>                   Yours,
>                                  Rileen
> 
> 
> On 01/06/07, Ann Zweig <ann at soe.ucsc.edu> wrote:
>> Hello Rileen,
>>
>>         You can get the information you want by using the Table 
>> Browser tool
>> ('Tables' in the blue navigation bar).  Configure it like so:
>>
>> clade:  Vertebrate
>> genome: Human
>> assembly: Mar 2006
>> group:  Genes & Gene Prediction tracks
>> track:  UCSC Genes
>> table:  knownGene
>> region: genome
>> output format: selected fields from primary and related tables
>>
>>         Press "get output" button.
>>
>> Then, on the next page choose:
>> name
>> exonCount
>> (and any other fields you would like)
>>
>>         Press "get output" button to see your list.
>>
>>         Do the same for the refGene table to get the exon counts for 
>> all of the
>> Ref Seq Genes.
>>
>> Regards,
>>
>> ----------
>> Ann Zweig
>> UCSC Genome Bioinformatics Group
>> http://genome.ucsc.edu
>>
>> Please feel free to search the Genome mailing list archives by visiting
>> our home page, clicking on "Contact Us", then typing a word or phrase
>> into the search box.  On that same page
>> (http://genome.ucsc.edu/contacts.html), you can subscribe to the Genome
>> mailing list.
>>
>>
>>
>>
>> Rileen Sinha wrote:
>> > Hi,
>> >        Is there some quick/convenient way of obtaining a list of the 
>> number
>> > of annotated exons in each given gene in the UCSC known gene 
>> collection?
>> >
>> > By that I mean all exons that appear in at least one transcript - a 
>> sort of
>> > "logical OR" of all exons in all transcripts.
>> >
>> > Is there also a corresponding solution for RefSeq genes?
>> >
>> > Thanks!
>> >
>> > Take care,
>> >                    Yours,
>> >                                   Rileen
>> >
>>


More information about the Genome mailing list