[Genome] Pfam foreign keys in data tables // mapping protein domains to genomic coordinates
John Major
major at cbio.mskcc.org
Fri Mar 2 10:56:21 PST 2007
Hello Fan-
I'd suggest that you do not try to find a representative protein, but
think about the problem purely in the genome space.
Let users tease apart which protein isoforms they are interested in.
I'd think a datamodel similar to this might work:
Table:: genomicProteinDomainMap
transcriptID (foreignKEY to knownGene.name)
chromosome
start
end
domainName*
databaseName*
*These 2 columns could be abstracted to keys to database specific
tables... but for the sake of clarity I collapsed them.
In anycase, this table would have a good deal of redundant data, but it
would represent the reality of genomic encoded protein domain
information accurately.
Ie: Any given genomic position can be involved in 0 to several genes.
And any given genomic position can be involved in numerous mRNA splice
variants. It follows that any given genomic position can also be
involved in 0->several protein domains. I'd rather you report the full
depth of information, and let the users filter what they want- rather
than try to find a representative dataset.
Actually, following the simple table described above, I could join with
kgCanonical to get a representative set... or I could just join with
kgXref and knowngene to get the domains encoded in all possible gene
transcripts at any given location(this latter bit is what I really am
looking for).
I hope this was helpful?
As far as the superfamily track... it appears from the GenomeBrowser
that the track does not specify regions of domains, but highlights the
complete gene? Or am I not using the correct track?
Cheers,
John
Fan Hsu wrote:
>Hi John,
>
>Your desire had been expressed by others before.
>
>One challenge is to get one representative protein
>sequence for a domain. I found often we have
>many candidates to represent a domain. Do you
>have any suggestion on this?
>
>In the mean time, you may want to consider
>using the Superfamily track to see if
>it is helpful.
>
>Fan.
>-----Original Message-----
>From: genome-bounces at soe.ucsc.edu [mailto:genome-bounces at soe.ucsc.edu]On
>Behalf Of John Major
>Sent: Friday, March 02, 2007 10:11 AM
>To: Ann Zweig
>Cc: genome at soe.ucsc.edu
>Subject: Re: [Genome] Pfam foreign keys in data tables // mapping
>protein domains to genomic coordinates
>
>
>Hi Ann-
>
>Thank you for the prompt reply.
>
>I can get the genomic positions of known genes, but what I really need
>is the specific genomic coordinates of the domains *within* those genes.
>I think the information you provided will only allow me to determine if
>a specific gene contains a certain domain? What I need is more detail
>still.
>
>A hypothetical gene ABCD is one exon on chromosomeX from 100000 -> 101000.
>The gene has a pfam Tyrosine kinase domain in the middle of it, which
>maps to the genomic coordinates chrX 100250 -> 100550.
>I'd like to be able to extract a table that looks like:
>Chrm startPos endPos DomainNAME databaseNAME
>chrX 100250 100550 Tyrosine-Kinase pFam
>
>
>What I need to know is that the genomic position for the protein domain.
>And in reality, I'd like to get the genomic positions for all of the
>protein domains for interpro and pfam for hg17&18.
>
>
>Thanks,
>John
>
>
>
>
>Ann Zweig wrote:
>
>
>
>>Hello John,
>>
>> The missing link is the knownToPfam table in the hg18 (or
>>whichever assembly you are working in) database. This table is the
>>link between the knownGene table and the Pfam tables.
>>
>>$db.knownGene.name == knownToPfam.name
>>
>>knownToPfam:
>>name value
>>NM_001005484 PF00001
>>BC024295 PF07647
>>
>> The $db.knownGene table has fields for chromosomal positions.
>>
>> You will find the domain type in the pfamDesc table:
>>proteome.pfamDesc.description.
>>
>> The pfamXref table includes several types of ID values.
>>
>>pfamXref:
>>pfamAC swissAC swissDisplayID
>>PF00001 O00155 GPR25_HUMAN
>>PF00001 O00254 PAR3_HUMAN
>>PF00001 O00270 GPR31_HUMAN
>>
>> You should be able to put these pieces together to mine for the
>>exact information you are looking for.
>>
>> If this is not enough detail for you, please do not hesitate to
>>write back to the list and ask for more information.
>>
>>Regards,
>>
>>----------
>>Ann Zweig
>>UCSC Genome Bioinformatics Group
>>http://genome.ucsc.edu
>>
>>
>>Please feel free to search the Genome mailing list archives by
>>visiting our home page, clicking on "Contact Us", then typing a word
>>or phrase into the search box. On that same page
>>(http://genome.ucsc.edu/contacts.html), you can subscribe to the
>>Genome mailing list.
>>
>>
>>John Major wrote:
>>
>>
>>
>>>Hello-
>>>
>>>I am trying to map the pfam protein domains to genomic coordinates
>>>and am having some problems.
>>>I see that in the proteome tables, there are 2 obvious pfam tables:
>>>pfamDesc and pfamXref.
>>>Neither of these tables appear to be linked to other tables... or at
>>>least the table description pages do not offer any information as to
>>>which tables these 2 link to.
>>>Also, I do not seem to see a table which gives the start and end
>>>coordinates for the pfam doamins (in protein, mrna, or genomic space).
>>>
>>>What I would like to get is a simple table of domain information in
>>>genomic coordinate space. Ie:
>>>GenomeBuildID Chrm Start End
>>>ProteinDomainName SourceDatabase
>>>hg18 chr1 100000 100050
>>>Protein-Kinase pFam
>>>hg18 chr2 200010 200090
>>>X-binding-site uniprot
>>>
>>>
>>>I would like to get this info for both uniprot and pfam. The uniprot
>>>tables (uniprot.feature and uniprot.description) appear to be linked
>>>to kgXref via acc->spid. And I should be able to derive genomic
>>>coordinates for the uniprot features via these tables.
>>>
>>>
>>>If you have any advice on an easier way to get this mapping of
>>>domains to genomic coordinates, I'd be thrilled to hear it.
>>>Otherwise, could you please advise me on the pfam tables.
>>>
>>>Thanks!
>>>John Major
>>>_______________________________________________
>>>Genome maillist - Genome at soe.ucsc.edu
>>>http://www.soe.ucsc.edu/mailman/listinfo/genome
>>>
>>>
>>
>>
>
>_______________________________________________
>Genome maillist - Genome at soe.ucsc.edu
>http://www.soe.ucsc.edu/mailman/listinfo/genome
>
>
>
>
More information about the Genome
mailing list