[Genome] knownToPfam, DNA-coordinates of domains
Ann Zweig
ann at soe.ucsc.edu
Fri Jan 11 12:01:57 PST 2008
Hello Anton,
In general your approach seems sound. My guess is that what's going on
is that the databases are out-of-sync with the actual data at pfam.
Note that the hg17.knownToPfam table was built on 2005-07-14. And that
hg18 Known Gene track was based on UniProt around 2007-02-02.
I realize that you are using hg17 because you have annotations and
mappings in that coordinate system. You might consider lifting those
over to hg18 (using our liftOver tool) and then using hg18 and its more
recent association with pfam.
As for your specific example, when you look up YU20_HUMAN in UniProt
you see the following:
query: YU20_HUMAN
The ID: YU20_HUMAN is no longer valid.
YU20_HUMAN (associated with primary accession number: Q9Y3T9 from
release 39.10) was renamed to NOC2L_HUMAN in release 47.8.
And the following queries of hg17 and hg18 show:
mysql> select kgId,spDisplayID from hg17.knownToPfam, hg17.kgXref where
value='PF03715' and kgId=name;
+-----------+--------------+
| kgId | spDisplayID |
+-----------+--------------+
| AK022756 | Q9H9J5_HUMAN |
| AK092491 | Q8NAK7_HUMAN |
| NM_015658 | YU20_HUMAN |
| BC003555 | Q9H9J5_HUMAN |
+-----------+--------------+
mysql> select kgId,spDisplayID from hg18.knownToPfam, hg18.kgXref where
value='PF03715' and kgId=name;
+------------+--------------+
| kgId | spDisplayID |
+------------+--------------+
| uc001aby.1 | Q9H9J5_HUMAN |
+------------+--------------+
mysql> select * from hg18.knownToPfam where value='PF03715';
+------------+---------+
| name | value |
+------------+---------+
| uc001aby.1 | PF03715 |
+------------+---------+
I hope this is helpful to you.
Regards,
----------
Ann Zweig
UCSC Genome Bioinformatics Group
http://genome.ucsc.edu
Anton Kratz wrote:
> Dear UCSC team,
>
> Context of my question: I am trying to get protein domain coordinates in DNA
> space for the domains listed in the knownToPfam table.
> Basically what my program does is for each name-value-pair of the
> knownToPfam table, it looks up (in the knownGenes table) the protein encoded
> by that isoform, then it looks up (in Pfam-A.full, a flatfile with the
> entire Pfam database) that domain's sequence and aligns that sequence back
> onto the human genome, using BLAT (locally) in translated mode.
>
> My problem is that for 13,636 of the 35,789 name-value-pairs in the
> knownToPfam table, I do not find the protein coded for by the respective
> isoform when searching for the protein in Pfam-A.full.
>
> Example: according to knownToPfam, NM_015658 contains the domain PF03715.
> According to the knownGene table, NM_015658 encodes the protein YU20_HUMAN.
> So I am looking for YU20_HUMAN under the entry for PF03715 in Pfam-A.full.
> But it's not there and thus not part of the multiple alignment.
>
> I would be very thankful if you have an idea what's going wrong in this
> approach.
>
> regards,
> Anton
>
> P.S.: Everything I am using is for hg17 because I have other mappings only
> available for hg17 which are not available for hg18.
> _______________________________________________
> Genome maillist - Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
More information about the Genome
mailing list