CMP 243 Homework 2

Due: Tues. Jan. 27

Protein database searching and significance of a database hit, intro to HMMs

As a warmup in probability theory:

Q0: Answer the questions in exercises 1.1, 1.2 and 1.4 of the text.

Now here is the experimental part of the assignment. In the last assignment we looked at the most famous tumor suppressor protein, p53. Recently, a new human tumor suppressor protein, p73, was discovered whose sequence shows a relationship with p53 (Science, 12 Sept. 1997, vol. 277, pg. 1605). Details were published in the 22 August issue of Cell by Daniel Caput and friends. We want to see if, hypothetically, we could have found this new protein if it had just been deposited in one of the protein databases, by using bioinformatics tools to search these databases for sequences similar to p53.

We'll also pretend that we were participating in the CASP2 experiment in protein structure prediction. This was a kind of contest held in 1996 in which "target" protein sequences were proposed for which there was no known structure, and experts worldwide were asked to use their progams to predict the structure for these target sequences. Later, the structures were solved and everyone met to evaluate how well the methods did. The results are to appear in a special issue of Proteins. In the fold recognition category of CASP2, the participants were asked not to give the predicted 3d coordinates of the atoms in the target sequence, but only to search the PDB database of proteins sequences whose structure is known, and determine if there were any sequences in PDB that had structure similar to that of the target sequences.

The general task in both the p53 case and the CASP2 case is referred to as searching for homologous proteins, that is, searching for proteins that have similar structure and possibly similar function to a given protein. We will try three tools: search using a PROSITE pattern, BLAST search using the target sequence, and a hidden Markov model method being developed at UCSC. In this exercise we'll just try the HMM method, later we'll say more about what it is and how it works.

p53 is our first "target" sequence. We hope to use it to find the related protein p73. First let's get some information about p73.

  1. Go to Entrez
  2. Click on "Literature - PubMed"
  3. Change the "Search Field" to "Author Name". Enter "Caput D" and press search.
  4. Narrow search by adding "Journal Name" "Cell" to query.
  5. Retrieve the 2 documents.
  6. Click "Kaghad M, et al.", as this corresponds to the correct Cell issue.
  7. This gives the abstract to the Cell article. We're interested in the amino acid sequences, so click on the button "protein" at the top of the page.
  8. Click on "Display" to see the Genpept report for each relevant sequence.
  9. Check them out. These are the sequences that we want to try to find using computational methods, starting from the p53 family of tumor suppressors.

The database we will search is the nonredundant protein database "nr", maintained by NCBI. You can access this database remotely by going to NCBI's web page, or you can use the local copy we have here on the barnyard machines, which is updated every Sunday night. For the PROSITE pattern search, I suggest you just do the search locally. Log on to one of the barnyard machines as described in homework 1, make a directory to hold your results files, and in that directory execute the following command:


/projects/compbio/bin/scripts/findstringnrp MCNSSCMGGMNRR > p53.prosite.hits &

The Perl script "findstringnrp" (one of the utility scripts written by Richard Hughey) will search all the proteins in the "nr" protein database for any that contain a specified string, in this case the PROSITE pattern "MCNSSCMGGMNRR". For each protein it finds, it will produce as output the line in the nr database that describes that protein. This line includes the accension number(s) of the protein in the database(s) that it originally came from, and a short phrase describing the protein. The syntax "> p53.prosite.hits" at the end of the line is UNIX operating system notation for directing the output to the file named "p53.prosite.hits", which will be created in your directory. The ampersand at the end of the command puts the process into the background so that you can still use the command line. When the job completes, examine the output for the p73 sequences. Their accension numbers all contain the string "Y11416". (The full id is "HSY11416", where "HS" stands for "Human Sequence". In some instances the "HS" is dropped, so it is a safe bet to text search the file with just "Y11416".)

Q1: How many sequences does the search find? Does it find any of the the p73 sequences?

Q2: The nr database you just searched has 281,027 sequences and 85,284,362 total letters (residues). Assume that each sequence in the database has length at least 13, and that each residue in each sequence was generated independently at random according to the uniform distribution over the 20 amino acids. What is the expected number of occurrences of "MCNSSCMGGMNRR"? Note: count each occurrence separately, even if there is more than one occurrence in the same protein sequence. First give me the formula you are using to compute this number, then give me the number. That is, I want to know how you got your answer.

Q3: Repeat the above question, this time assuming that each residue in each sequence was generated independently at random according to the following probability distribution (approximately the background distribution of the amino acids in natural protein sequences):

 P(A) 0.08
 P(C) 0.01
 P(D) 0.05
 P(E) 0.06
 P(F) 0.04
 P(G) 0.07
 P(H) 0.02
 P(I) 0.06
 P(K) 0.06
 P(L) 0.10
 P(M) 0.03
 P(N) 0.04
 P(P) 0.05
 P(Q) 0.04
 P(R) 0.05
 P(S) 0.07
 P(T) 0.06
 P(V) 0.07
 P(W) 0.01
 P(Y) 0.03

Q4: Let us call the answer you got to question Q3 above the E-value, for expected number of occurences of the pattern in the (random) database. Let us define the P-value as the probability that there are one or more occurences of the pattern in the (random) database. Give me a good upper bound on the corresponding P-value, using the same assumptions as in question Q3.

Q5: In light of the above E-value or P-value, do you think that the matches you found to the pattern "MCNSSCMGGMNRR" in the nr database happened merely by chance, or do you think that they represent genuine hits to proteins in the p53 family, or in a similar family?

Now we will try BLAST. Retrieve the human p53 sequence and save it in a file or in a window. Go to the BLAST database search page, read as much of the BLAST help page as you can stand to read, and then click on Advanced BLAST. Choose the database to be nr. Select the blastp program, which searches a protein library with a protein sequence. Set "alignments" to 100. Now paste in the p53 sequence and click "Submit query." Look at the BLAST search results. You may want to save this file, say as the file "p53.blast.hits".

Search for the string "Y11416". There you find p73. You also find the two variants of p73 obtained by alternative splicing. So BLAST would have found p73. Under the column P(N), Blast gives the P-value for this hit, in this case N=3, indicating that this is really 3 separate hits that should be combined.

Q6: What is the P-value for this hit? In light of this P-value, do you worry that this hit may have happened merely by chance? Just to make sure, let's look at the alignment of p73 and p53. Search down to the lower part of the BLAST output to find this alignment. If you used the BLAST at NCBI, you find something like:

gnl|PID|e308621 (Y11416) P53-like transcription factor [Homo sapiens]
            Length = 636

 Score = 273 (123.9 bits), Expect = 1.9e-77, Sum P(3) = 1.9e-77
 Identities = 47/84 (55%), Positives = 64/84 (76%)

Query:    97 VPSYKTYQGDYGFRLGFLHSGTAKSVTCTYSPSLNKLFCQLAKTCPVQLWVNSTPPPGTR 156
             +PS   Y G + F + F  S TAKS T TYSP L KL+CQ+AKTCP+Q+ V++ PPPGT 
Sbjct:   115 IPSNTDYPGPHHFEVTFQQSSTAKSATWTYSPLLKKLYCQIAKTCPIQIKVSTPPPPGTA 174

Query:   157 VRAMAIYKKLQYMTEVVRRCPHHE 180
             +RAM +YKK +++T+VV+RCP+HE
Sbjct:   175 IRAMPVYKKAEHVTDVVKRCPNHE 198

 Score = 373 (169.3 bits), Expect = 1.9e-77, Sum P(3) = 1.9e-77
 Identities = 69/104 (66%), Positives = 85/104 (81%)

Query:   189 APPQHLIRVEGNLHAEYLDDKQTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNR 248
             AP  HLIRVEGN  ++Y+DD  T R SVVVPYEPP+VG++ TTI YN+MCNSSC+GGMNR
Sbjct:   209 APASHLIRVEGNNLSQYVDDPVTGRQSVVVPYEPPQVGTEFTTILYNFMCNSSCVGGMNR 268

Query:   249 RPILTIITLEDPSGNLLGRNSFEVRICACPGRDRRTEEKNFQKK 292
             RPIL IITLE   G +LGR SFE RICACPGRDR+ +E +++++
Sbjct:   269 RPILIIITLEMRDGQVLGRRSFEGRICACPGRDRKADEDHYREQ 312

 Score = 62 (28.1 bits), Expect = 1.9e-77, Sum P(3) = 1.9e-77
 Identities = 11/29 (37%), Positives = 20/29 (68%)

Query:   324 DGEYFTLKIRGHERFKMFQELNEALELKD 352
             D + + L++RG E F++  +L E+LEL +
Sbjct:   351 DEDTYYLQVRGRENFEILMKLKESLELME 379

p53 is the "query" sequence, and p73 is the "subject" sequence. Notice that where there are substitutions, they often are sensible substitutions (indicated by a "+"). Also note that BLAST gives the E-value here ("Expect = 1.9e-77") and it is the same as the P-value, as discussed in class.

Q7: Looking at this alignment, tell me why the string search of nr for the PROSITE pattern "MCNSSCMGGMNRR" failed to find p73. This shows the "brittleness" of the PROSITE patterns.

Now, let's try something harder: a CASP2 target sequence. Go to the CASP2 homepage. Find the link to the page describing the target sequences that you are supposed to predict structure for. Go there. Set the table display to show the top 31 targets. Click on target 31. That is the one we will try. You get a page giving information about t31. At the bottom, click on "Template Sequence file" and save this file as "t31.seq". This is a FASTA format file containing the amino acid sequence of t31. (This format is trivial: it is just a line beginning with ">" that has a free format description of the protein, followed by one or more lines that contain the sequence of amino acids in the protein. You can also concatenate such records together to get a FASTA format file holding information on several different proteins. We'll use this format later as well.) Now forget about PROSITE and go directly to BLAST. Do the same thing with t31 that you did with p53. However, this time set the database you search to "pdb" instead of "nr". (You can use basic BLAST this time, but feel free to try any of the more advanced versions of BLAST as well.) You should definitely find 1AGJ. It gets a great P-value. Look at the alignment and you see that this is t31 itself. Since the time of the contest, the structure of t31 has been solved an it has been deposited in the PDB database. You should get one or two other "hits". Look at their P-values, and their alignments.

Q8: List these other "hits" and their P-values, and for each one say if you think it is a real homology or a chance match.

Enough of BLAST. Let's try HMMs. Go to the temporary web page that Christian Barrett has set up for the HMM part of this assignment. Type in your email address and paste in the t31 sequence. Select NLL-NULL score cutoff: -20 and Number of alignments to include: 10 and click "Press Here to submit". As it says, igonre the "Document contains no data" message that may pop up after a while. It will take some time to get your answer back by email (say an hour or so, depending on which machine it runs on.) The script hmm-target98 takes the target sequence (given in a FASTA format file sepecified in the command line as above), finds some close relatives of it in the nr database using a variant of BLAST called WU-BLAST, then uses these to build and HMM, and scores sequences in the specified database (in this case PDB) with this HMM. For every hit that scores below a cutoff of -20, it lists the sequence name and the score for this hit in the output file, which is specified after the ">" in the command line. It also gives a multiple alignment for the top 10 hits at the bottom of the output file. This script was built by Kevin Karplus and Christian Barrett, using the SAM HMM system, primarily built by Richard Hughey. Look at the file you get back by email. The top two hits are t31 itself (1agjA and 1agjB). These are two identical chains from the same PDB file 1AGJ. The score for each hit is in the column labeled "NLL-NULL". For SAM, negative scores are good. We'll say more about these scores in a later assignment. Then there are several more hits with less impressive scores. One of these should be 1TRY. Since we don't yet know how to interpret these scores, let's look at the bottom of the file and check out the alignment. I did this myself with an old version of the program and got this alignment.


t0031; ......EVSAEEIKKHEEKWNKYYGVNAFNLPKELFSKVDEKDRQKYPYNTIGNVFVKGQTSA
1try   ivgg..-----------------------------------TSASAGDFPFIVSISRNGGPWC
1hylB  ii24qd---------------------------------------------------QRRVWC
1elt   vv27yy------------------------------------------------------HTC
2cp1   ii22kd--------------------------------------------------QQPEAIC
1bit   fv29ah----------------------------------------------QVSLNSGYHFC
1sgpE  is26ty---------------------------------------------------------
2cgaB  cg39gf------------------------------------------------------HFC
1cghA  ii18yl----------------------------------------------QIQSPAGQSRC


        60        70        80        90       100       110       120
         |         |         |         |         |         |         |
t0031; TGVLIGKNTVLTNRHIAKFANGDPSKVSFRPSINTDDNGNTETPYGEYEVKEILQEPFGAGVD
1try   GGSLLNANTVLTAAHCVSGYAQSGFQIRAGSLSRTSGGITSSLSSVRVHPSYSGN-----NND
1hylB  GGSLIDNKWILTAAHCVHDAVSVVVYLGSAVQYEGEA-----VVNSERIISHSMFNPDTYLND
1elt   GGSLIRQGWVMTAAHCVDSARTWRVVLGEHNLNTNEGKEQIMTVNSVFIHSGWNSDDVAGGYD
2cp1   GGFLIREDFVLTAAHCEGSIINVTLGAHNIKEQEKTQQVIPMVKCIPHP----DYNPKTFSND
1bit   GGSLVNENWVVSAAHCYKSRVEVRLGEHNIKVTEGSEQ----FISSSRVIRHPNYSSYNIDND
1sgpE  --------YFLTAGHCTDGATTWWANS---------------ARTTVLGTTSGSS---FPNND
2cgaB  GGSLINENWVVTAAHCGVTTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSL---TINND
1cghA  GGFLVREDFVLTAAHCWG--SNINVTLGAHNIQRRENTQQHITARRAIRHPQYNQRTI--QND


              130       140        150       160                 170
                |         |          |         |                   |
t0031; LALIRLKPDQNGVSLGDKISPAKI.GTSNDLKDGDKLELIGY....PFDHKV......NQMHR
1try   LAILKLSTS---IPSGGNIGYARLaASGSDPVAGSSATVAGW....GATSEGgs9nllKVTVP
1hylB  VALIKIPH----VEYTDNIQPIRL.PSGE-------------....------......-----
1elt   IALLRLNTQ---ASLNSAVQLAAL.PPSN-------------....------......-----
2cp1   IMLLKLKSKAKRTRAVRPLNLPRR.NVNV--KPGDVCYVAGWgrmaPMGKYS......NTLQE
1bit   IMLIKLSKP---ATLNTYVQPVAL.PTSC-APAGTMCTVSGW....GNTMSS......TADSN
1sgpE  YGIVRYTNTTIPKD--GTVGGQDI.TSAANATVGMAVTRRGS....TTGTHS......GSVTA
2cgaB  ITLLKLSTA---ASFSQTVSAVCLpSASDDFAAGTTCVTTGW....GLTRYT......NANTP
1cghA  IMLLQLSRR---VRRNRNVNPVALpRAQEGLRPGTLCTVAGW....G-----......-----

                  180       190       200          210       220
                    |         |         |            |         |
t0031; SEIELTT......LSRGLRYYGFTVPGNSGSGIF...NSNGELVGIHSSKVSHLDREHQINYG
1try   IVSRATCra14mfCAGVSSGGKDSCQGDSGGPIV...DSSNTLIGAVSW---------GNGCA
1hylB  -------......---------------------...--------------------------
1elt   -------......---------------------...--------------------------
2cp1   VELTVQKdr18qiCAGDPKTKRASFRGDSGGPLV...--------------------------
1bit   KLQCLNIpi21mfCAGYLEGGKDSCQGDSGGPVV...-CNGELQGVVSWGYG-----------
1sgpE  LNATVNYgggdv.VYGMIRTNVCAEPGDSGGPLY...-SGTRAIGLTSGGSG-----------
2cgaB  DRLQQASlp20daMICAGASGVSSCMGDSGGPLVckkNGAWTLVGIVSWGSS-----------
1cghA  -------......---------------------...--------------------------


        230       240
          |         |
t0031; VGIGNYVKRIINEKNE-.......
1try   RPNYSGVYASVGALRS-fidtya.
1hylB  -----------------e118kf.
1elt   -----------------q118im.
2cp1   -----------------ck39ss.
1bit   -----------------ca27sy.
1sgpE  -----------------nc25vy.
2cgaB  -----------------tc27an.
1cghA  -----------------rv96rs.

Here I have editied the alignment to remove redundant sequences. You will probably get different sequences and a different alignment. Some of these segments are hard to align, being so far diverged from the t31 sequence.

Q9: Referring to the alignment above (the old one I got): Are there parts of this alignment that look convincing, and if so, which parts? Do you think these sequences are remote homologs of t31, or are they just chance matches? To check into this further, go the FSSP database of protein structure classification. Type in 1AGJ and then click on 1agj-A. You get a selected set of PDB sequences that are structurally related to 1AGJ. For each PDB structure you see its 4 letter PDB identifier (and possibly another letter to identify a particular chain), a "Z" score indicating how similar its structure is to that of 1put (a high Z score, say Z > 7, indicates close structural relationship), the length of the part of both sequences that can be aligned ("LALI"), the total length of the second sequence ("LSEQ2"), the percentage of amino acids in one that are the same as the corresponding amino acids in the other sequence ("%IDE"), and the common protein name for the second sequence. 1TRY is not listed here, but it is closely related to 5ptp, which is listed. Click on 5ptp. Find 1try. Z score of 38.8 and the long length of the alignment (covering essentially the entire protein) indicates that the structures of 1try and 5ptp are very similar. Since the Z score for 1agj-A and 5ptp was 19.3 with a similarly long alignment, this indicates that there is probably a good structural match between 1agj-A and 1try. (In fact, it was shown at the CASP2 meeting that there is a good structural match between t31 and 1TRY. The DALI program, written by Liisa Holm, was used, along with other programs to compare the two structures and find this match. DALI is also used to find the matches listed in the FSSP web pages. Since 1AGJ is a relatively new structure, the manager of these pages has not yet used DALI to compare 1AGJ to all the PDB sequences, hence only comparisons to selected sequences are listed.)

Q10: Are any of the other 7 sequences in the above alignment also structurally related to 5ptp, or another close relative of 1AGJ?

(For extra credit: (1) The FSSP server also gives an alignment for each structurally related protein, showing parts of the protein that match parts of the target. See how closely these alignments agree with the HMM alignment. It is very hard to get these alignments right from just the sequence itself, not knowing the structure! You can also use DALI directly to get a structural alignment of 1try and 1agj-A. Look at the structures and see if this alignment makes sense. (2) Try the CATH and SCOP protein structre web pages (find them under the "more protein folding links" on the resources page for the class.) (3) If you work in a molecular biology lab, try hmm-target98 on a protein that is of interest to the research in your lab. This could be the start of a good class project. )


Questions regarding about page content should be directed to haussler@cse.ucsc.edu.
Last modified Jan 15, 1998.

Back to the CMP 243 Class Page.