[Genome] knownToPfam, DNA-coordinates of domains

Fan Hsu fanhsu at soe.ucsc.edu
Mon Jan 14 09:06:30 PST 2008


Hi Anton,

You can use the kgXref table.  The spDisplayID is the UniProt display ID
(used to be SwissProt display ID).  For hg18, some representative proteins
of UCSC Genes (built by our new Known Gene III process) are RefSeq proteins
instead of UniProt proteins.

We do not have precise mapping between RefSeq and UniProt proteins.
I do noticed that current UniProt database has a RefSeq field.
You may be able to get a better mapping from data downloaded from
UniProt.

Fan.
-----Original Message-----
From: genome-bounces at soe.ucsc.edu [mailto:genome-bounces at soe.ucsc.edu]On
Behalf Of Anton Kratz
Sent: Sunday, January 13, 2008 4:38 AM
To: genome at soe.ucsc.edu
Subject: Re: [Genome] knownToPfam, DNA-coordinates of domains


Dear UCSC Team,

Thank you very much for your explanations.

I have a follow-up question: how can I map/convert UniProt accession IDs and
RefSeq protein IDs to SWISSPROT IDs?

I decided to repeat my project based on hg18. But with the new UCSC Genes,
the proteins are given in a mix of UniProt accession IDs and RefSeq protein
IDs (unlike UCSC KnownGenes where they are given as SWISSPROT IDs).

Pfam uses SWISSPROT IDs, so to look up proteins encoded bu UCSC Genes I
somehow need to convert these IDs (UniProt accession IDs and RefSeq protein
IDs) to SWISSPROT IDs.

best regards,
Anton

2008/1/12, Ann Zweig <ann at soe.ucsc.edu >:
>
> Hello Anton,
>
>         In general your approach seems sound.  My guess is that what's
> going on
> is that the databases are out-of-sync with the actual data at pfam.
> Note that the hg17.knownToPfam table was built on 2005-07-14.  And that
> hg18 Known Gene track was based on UniProt around 2007-02-02.
>
>         I realize that you are using hg17 because you have annotations and
>
> mappings in that coordinate system.  You might consider lifting those
> over to hg18 (using our liftOver tool) and then using hg18 and its more
> recent association with pfam.
>
>         As for your specific example, when you look up YU20_HUMAN in
> UniProt
> you see the following:
>
>         query: YU20_HUMAN
>
>         The ID: YU20_HUMAN is no longer valid.
>
>         YU20_HUMAN (associated with primary accession number: Q9Y3T9 from
> release 39.10) was renamed to NOC2L_HUMAN in release 47.8.
>
>         And the following queries of hg17 and hg18 show:
>
> mysql> select kgId,spDisplayID from hg17.knownToPfam, hg17.kgXref where
> value='PF03715' and kgId=name;
> +-----------+--------------+
> | kgId      | spDisplayID  |
> +-----------+--------------+
> | AK022756  | Q9H9J5_HUMAN |
> | AK092491  | Q8NAK7_HUMAN |
> | NM_015658 | YU20_HUMAN   |
> | BC003555  | Q9H9J5_HUMAN |
> +-----------+--------------+
>
>
> mysql> select kgId,spDisplayID from hg18.knownToPfam, hg18.kgXref where
> value='PF03715' and kgId=name;
> +------------+--------------+
> | kgId       | spDisplayID  |
> +------------+--------------+
> | uc001aby.1 | Q9H9J5_HUMAN |
> +------------+--------------+
>
> mysql> select  * from hg18.knownToPfam where value='PF03715';
> +------------+---------+
> | name       | value   |
> +------------+---------+
> | uc001aby.1 | PF03715 |
> +------------+---------+
>
>
>         I hope this is helpful to you.
>
> Regards,
>
> ----------
> Ann Zweig
> UCSC Genome Bioinformatics Group
> http://genome.ucsc.edu
>
>
>
> Anton Kratz wrote:
> > Dear UCSC team,
> >
> > Context of my question: I am trying to get protein domain coordinates in
> DNA
> > space for the domains listed in the knownToPfam table.
> > Basically what my program does is for each name-value-pair of the
> > knownToPfam table, it looks up (in the knownGenes table) the protein
> encoded
> > by that isoform, then it looks up (in Pfam-A.full, a flatfile with the
> > entire Pfam database) that domain's sequence and aligns that sequence
> back
> > onto the human genome, using BLAT (locally) in translated mode.
> >
> > My problem is that for 13,636 of the 35,789 name-value-pairs in the
> > knownToPfam table, I do not find the protein coded for by the respective
> > isoform when searching for the protein in Pfam-A.full.
> >
> > Example: according to knownToPfam, NM_015658 contains the domain
> PF03715.
> > According to the knownGene table, NM_015658 encodes the protein
> YU20_HUMAN.
> > So I am looking for YU20_HUMAN under the entry for PF03715 in
> Pfam-A.full.
> > But it's not there and thus not part of the multiple alignment.
> >
> > I would be very thankful if you have an idea what's going wrong in this
> > approach.
> >
> > regards,
> > Anton
> >
> > P.S.: Everything I am using is for hg17 because I have other mappings
> only
> > available for hg17 which are not available for hg18.
> > _______________________________________________
> > Genome maillist  -  Genome at soe.ucsc.edu
> > http://www.soe.ucsc.edu/mailman/listinfo/genome
>



--
Dipl.-Inf. Anton Kratz

mobile phone:  +81 [0]80-309-489-78
email: anton.kratz at googlemail.com
Skype ID: anton.kratz
_______________________________________________
Genome maillist  -  Genome at soe.ucsc.edu
http://www.soe.ucsc.edu/mailman/listinfo/genome



More information about the Genome mailing list