[Genome] Pfam foreign keys in data tables // mapping protein domains to genomic coordinates

John Major major at cbio.mskcc.org
Fri Mar 2 10:56:21 PST 2007


Hello Fan-

I'd suggest that you do not try to find a representative protein, but 
think about the problem purely in the genome space.
Let users tease apart which protein isoforms they are interested in.

I'd think a datamodel similar to this might work:

Table:: genomicProteinDomainMap
    transcriptID (foreignKEY to knownGene.name)
    chromosome
    start
    end
    domainName*
    databaseName*
   *These 2 columns could be abstracted to keys to database specific 
tables... but for the sake of clarity I collapsed them.

In anycase, this table would have a good deal of redundant data, but it 
would represent the reality of genomic encoded protein domain 
information accurately.
Ie: Any given genomic position can be involved in 0 to several genes.  
And any given genomic position can be involved in numerous mRNA splice 
variants.  It follows that any given genomic position can also be 
involved in 0->several protein domains.  I'd rather you report the full 
depth of information, and let the users filter what they want-  rather 
than try to find a representative dataset.

Actually, following the simple table described above, I could join with 
kgCanonical to get a representative set... or I could just join with 
kgXref and knowngene to get the domains encoded in all possible gene 
transcripts at any given location(this latter bit is what I really am 
looking for).

I hope this was helpful?

As far as the superfamily track... it appears from the GenomeBrowser 
that the track does not specify regions of domains, but highlights the 
complete gene?  Or am I not using the correct track?


Cheers,
John



Fan Hsu wrote:

>Hi John,
>
>Your desire had been expressed by others before.
>
>One challenge is to get one representative protein
>sequence for a domain.  I found often we have 
>many candidates to represent a domain.  Do you 
>have any suggestion on this?
>
>In the mean time, you may want to consider 
>using the Superfamily track to see if 
>it is helpful. 
>
>Fan.
>-----Original Message-----
>From: genome-bounces at soe.ucsc.edu [mailto:genome-bounces at soe.ucsc.edu]On
>Behalf Of John Major
>Sent: Friday, March 02, 2007 10:11 AM
>To: Ann Zweig
>Cc: genome at soe.ucsc.edu
>Subject: Re: [Genome] Pfam foreign keys in data tables // mapping
>protein domains to genomic coordinates
>
>
>Hi Ann-
>
>Thank you for the prompt reply.
>
>I can get the genomic positions of known genes, but what I really need 
>is the specific genomic coordinates of the domains *within* those genes. 
>I think the information you provided will only allow me to determine if 
>a specific gene contains a certain domain?  What I need is more detail 
>still.
>
>A hypothetical gene ABCD is one exon on chromosomeX from 100000 -> 101000. 
>The gene has a pfam Tyrosine kinase domain in the middle of it, which 
>maps to the genomic coordinates chrX 100250 -> 100550.
>I'd like to be able to extract a table that looks like:
>Chrm   startPos  endPos    DomainNAME     databaseNAME
>chrX    100250   100550     Tyrosine-Kinase   pFam
>
>
>What I need to know is that the genomic position for the protein domain. 
>And in reality, I'd like to get the genomic positions for all of the 
>protein domains for interpro and pfam for hg17&18.
>
>
>Thanks,
>John
>
>
>
>
>Ann Zweig wrote:
>
>  
>
>>Hello John,
>>
>>    The missing link is the knownToPfam table in the hg18 (or 
>>whichever assembly you are working in) database.  This table is the 
>>link  between the knownGene table and the Pfam tables.
>>
>>$db.knownGene.name == knownToPfam.name
>>
>>knownToPfam:
>>name        value
>>NM_001005484    PF00001
>>BC024295    PF07647
>>
>>    The $db.knownGene table has fields for chromosomal positions.
>>
>>    You will find the domain type in the pfamDesc table: 
>>proteome.pfamDesc.description.
>>
>>    The pfamXref table includes several types of ID values.
>>
>>pfamXref:
>>pfamAC    swissAC    swissDisplayID
>>PF00001    O00155    GPR25_HUMAN
>>PF00001    O00254    PAR3_HUMAN
>>PF00001    O00270    GPR31_HUMAN
>>
>>    You should be able to put these pieces together to mine for the 
>>exact information you are looking for.
>>
>>    If this is not enough detail for you, please do not hesitate to 
>>write back to the list and ask for more information.
>>
>>Regards,
>>
>>----------
>>Ann Zweig
>>UCSC Genome Bioinformatics Group
>>http://genome.ucsc.edu
>>
>>
>>Please feel free to search the Genome mailing list archives by 
>>visiting our home page, clicking on "Contact Us", then typing a word 
>>or phrase into the search box.  On that same page 
>>(http://genome.ucsc.edu/contacts.html), you can subscribe to the 
>>Genome mailing list.
>>
>>
>>John Major wrote:
>>
>>    
>>
>>>Hello-
>>>
>>>I am trying to map the pfam protein domains to genomic coordinates 
>>>and  am having some problems.
>>>I see that in the proteome tables, there are 2 obvious pfam tables: 
>>>pfamDesc and pfamXref.
>>>Neither of these tables appear to be linked to other tables... or at 
>>>least the table description pages do not offer any information as to 
>>>which tables these 2 link to.
>>>Also, I do not seem to see a table which gives the start and end 
>>>coordinates for the pfam doamins (in protein, mrna, or genomic space).
>>>
>>>What I would like to get is a simple table of domain information in 
>>>genomic coordinate space. Ie:
>>>GenomeBuildID     Chrm   Start         End          
>>>ProteinDomainName   SourceDatabase
>>>hg18                         chr1    100000    100050        
>>>Protein-Kinase           pFam
>>>hg18                         chr2    200010    200090        
>>>X-binding-site            uniprot
>>>
>>>
>>>I would like to get this info for both uniprot and pfam.  The uniprot 
>>>tables (uniprot.feature and uniprot.description) appear to be linked 
>>>to kgXref via acc->spid.  And I should be able to derive genomic 
>>>coordinates for the uniprot features via these tables.
>>>
>>>
>>>If you have any advice on an easier way to get this mapping of 
>>>domains to genomic coordinates, I'd be thrilled to hear it.  
>>>Otherwise, could you please advise me on the pfam tables.
>>>
>>>Thanks!
>>>John Major
>>>_______________________________________________
>>>Genome maillist  -  Genome at soe.ucsc.edu
>>>http://www.soe.ucsc.edu/mailman/listinfo/genome
>>>      
>>>
>>    
>>
>
>_______________________________________________
>Genome maillist  -  Genome at soe.ucsc.edu
>http://www.soe.ucsc.edu/mailman/listinfo/genome
>
>
>  
>



More information about the Genome mailing list