Exercises provided by Christian Barrett, with additions by D. Haussler
This assignment is intended to show you some of the basic features and capabilities of our HMM software, which is called SAM for Sequence Alignment and Modeling software. It will also give you a chance to use PHD, which is Burkhard Rost's neural net that predicts secondary structure of protein sequences. These will be demonstrated in the context of the Archae whole genome analysis project that was just started this quarter, using the M. jannaschii genome.
One aspect of this project focuses on functional annotation of the TIGR-predicted proteins from the M. jannaschii genome. This method begins with a sequence that the TIGR group suspects is a protein, based on analysis of the genome with the GENMARK genefinding program. We take this protein and build an HMM from it. Normally this is really not enough information from which to estimate all of the parameters of an HMM, but it's all we have at first. Hopefully the Bayesian method of estimating the parameters of the HMM using Dirichlet mixture priors will help, as mentioned in the lecture.
One thing we can do with this HMM is to search for related proteins within the M. jannaschii genome. It is often the case that a genome contains families of closely related genes. Thus, given one M. jannaschii protein, we will often find other similar proteins among the remaining M. jannaschi proteins. If we find these, then we can build an HMM for the entire family of related genes. Thus the next step is to search the M. jannaschii genome using the initial model built above form a single protein sequence. If we pull out any sequences that score well with the model, a new HMM is built using these sequences, along with the original sequence.
This has been done with putative protein sequence MJ0553 from M. jannaschii. Make a directory for yourself and copy all of the files from /projects/compbio/class/cmp243/hw5 into it.
One of these files is "MJ0553_2_long.dist". This is the file that contains the log-odds scores for the potentially interesting sequences in the genome. (In class we've been using using base 2 logarithms, as right thinking people ought. SAM uses base e logarithms, for "historical reasons"). (Those comments by C. Barrett. I take no responsibility! D.H.)
If you look in this "distance file", you will see some NLL-NULL scores, and above them some comments about significance. (Ignore the NLL-NULL/L scores.) First let's look at how the NLL-NULL scores are computed. For example, the 4th line gives the score for the protein sequence MJ1405. which is -123.356. Using "ln" to denote natural logarithm, this score is
-ln (P(MJ1405|HMM)/P(MJ1405|null)) + ln (number of residues in MJ1405),where P(MJ1405|HMM) is the probability of the protein sequence MJ1405 under the hidden Markov model (built from ML0553), and P(MJ1405|null) is the probability of MJ1405 under a simple null model in which all amino acids are assumed to be generated independently according to a fixed distribution. the same fixed distribution that is used in all the insert states in the HMM. (And in the FIMs, but we won't go into what these are now. They are like regular insert states in the HMM.) Thus the score is the negative log likelihood ratio plus an "adjustment" equal to + ln (number of residues in MJ1405). (The negative log likelihood ratio is also used for "historical reasons".)
Normally, the quantity P(MJ1405|HMM) would be calculated by the forward algorithm that we studied in class. However, in this case, the actual algorithm that was used was a fast version of the Viterbi algorithm that does a Smith-Waterman-type local alignment of the sequence to the HMM. What is meant here is that the dynamic programming algorithm is set up to try to match part of the sequence with part of the HMM, rather than always matching all of the sequence to all of the HMM. The output of this method is the probability of the sequence, jointly with the best local alignment of this type. So this is really at best an approximation to the quantity P(MJ1405|HMM) as defined in class, but it has other advantages, besides speed of calculation, which we will not go into.
To understand the adjustment term "+ ln (number of residues in MJ1405)", we need to review how significance is interpreted in terms of log posterior odds. By Bayes rule, the log posterior odds are given by
ln (P(HMM|MJ1405)/P(null|MJ1405)) = ln (P(MJ1405|HMM)/P(MJ1405|null)) + ln P(HMM)/P(null)where ln P(HMM)/P(null) is the log prior odds, which may be approximated by the log of the number of pairs consisting of a place in the MJ1405 sequence that could possible match a place in a sequence in the database. This is similar to what we did when we interpreted BLAST scores as log posterior odds. As in that analysis, we may approximate the log posterior odds by
- ln (number of residues in MJ1405 * number of residues in the database)
= - ln (number of residues in MJ1405) - ln (number of residues in the database)
The first term is the adjustment. This is subtracted by SAM from all the negative log likelihood scores, to get the NLL-NULL score. The second term is a constant that does not depend on what sequence is being analyzed. It is mentioned in the discussion of significance, before the scores are given in the SAM MJ0553_2_long.dist file, rather than being added to all the scores. In this search there are 63467 residues in the database, and this term is the constant -ln(63467)= -11.1. Putting this together, we see that the negative log posterior odds are
- ln (P(HMM|MJ1405)/P(null|MJ1405)) = -ln (P(MJ1405|HMM)/P(MJ1405|null)) + ln (number of residues in MJ1405) + 11.1 = NLL-NULL score + 11.1
For the match to be significant, this value should be negative. Specifically, if you want the posterior odds to be greater than 100 to 1, then the -ln posterior odds should be less than -ln(100) = -4.6. This calculation is shown in the first line of the SAM output listing of confidence cutoffs. In subsequent lines you see confidence cutoffs computed using different assumptions. We won't go into these calcs here.
Q1: Using the above calculations, how many sequences in the database have NLL-NULL scores that indicate that the posterior odds are at least 100 to 1 for the HMM model vs the null model? Repeat the above calculation assuming that the database has 100,000,000 residues in it. What is the cutoff for significance? How many of these sequences are still deemed significant at odds of 100 to 1?
Before leaving this topic, we must note that this approximation ignores the important fact that, unlike BLAST matching, we allow indels in HMM matches, which makes for a lot more possible "places" for the match to occur. Because of this, we do not have a rigorous mathematical theory of the significance of HMM scores; nobody has been able to solve the difficult mathematical problems involved in this. However, this log odds interpretation at least gives some heuristic guidance.
Since we don't want to answer the above questions for you, let us assume now that we use these scores as the simple heuristics that they are, and just look for a break in the score distribution. Upon examination of the "MJ0553_2_long.dist" file, we see immediately that there are three other sequences in M. jannaschi's protein database that score markedly better than the rest. Given this clean break in scores, we will take these three new sequences (MJ0809, MJ1331, and MJ1405) and reestimate the HMM for MJ0533 with them. (It is interesting to note that none of these sequences have any annotation in the M. jannaschi protein database. So bear in mind that if we can find a homology match to a protein with known function, or even better, known structure, then we will be the first people *ever* to know what these proteins do and look like.)
At this point we have as good a model as we can get, given the information available. The next step is to search SWISSPROT. Our aim here is to find sequences in this database that score well with our model. SWISSPROT is probably the most carefully constructed protein database, in that the sequences it contains are carefully annotated with descriptions, journal article links, other database links, and important residues (such as active sites, critical to structure or function, etc). Thus, if we can find a hit in SWISSPROT using the model built from our sequence of interest in the M. jannaschi genome, then we may be able to infer quite a bit about it. This was tantalizingly noted above.
So search SWISSPROT we do. The file "MJ0553_2_sprot33.dist" lists the top- scoring sequences from the 52,205 that are in the current release of SWISSPROT. Hmmm. Kinda on the low side of our significance range.
Q2: Repeat the same calculations as in Q1, but using the larger database size of 18531282 residues for SWISSPROT, and see how many sequences in this database search have even posterior odds (1 to 1) or greater in favor of the HMM vs the null model.
Since these scores are not convincing, we want to perform a visual inspection of the multiple alignment between these SWISSPROT sequences and the four homologous sequences from the M. jannaschi genome that we have from the search described above. For this you will use SAM's WWW interface at
http://www.cse.ucsc.edu/research/compbio/sam_form.html
What we're going to do is reestimate a model using our four M. jannaschi
sequences and the top SWISSPROT sequences, which have been extracted from
the database for you and placed in the file "topSP-cleaned.seqs". The
"cleaned" part of the name is because some sequences have been removed. These
would be some of the ones that are identical (same length and score almost
always implies identical sequences) and the two sequences whose names begin
with HYPF. The latter have been removed because it was found that they may
be just random hits. All but one of the identical
sequences have been removed so that the resulting model is not unfairly
skewed.
Before going on to the next step, lets see what these SWISSPROT hits are. Using your web browser go to the SWISSPROT search page at URL
http://expasy.hcuge.ch/cgi-bin/sprot-search-de
and enter the sequence ID for, say, "ACYO_HUMAN".
When this comes up, look at the entry. If you read down to the lines
that begin with DR, you'll see a line that says something about a PDB
entry. Hey! So there's actually a solved structure for this sequence. The PDB
id for the structure is 1aps. This
is a tremendous help. You might want to click on the PDB link and retrieve the PDB
file to view it with RASMOL. (This is optional, but may help you with the
last part of this homework assignment.) Also, for future reference,
click on the HSSP link for this protein, and save the HSSP file for 1aps.
This file describes, among many other things, the secondary structure for
1aps.
For each PDB entry (which contains the atomic coordinates for a protein), there is the raw amino acid sequence for the protein in the Brookhaven database. In case this particular sequence isn't in SWISSPROT yet, we'll get this raw sequence from Brookhaven. It is in the file "1aps.seq".
So, go to the SAM web page SAM web page and fill in accordingly:
Buildmodel - yes, data back - no
Send back results of align2model only
You have not provided an initial model
Fill in your email address
For training sequences, cut and paste in the sequences from the files
1aps.seq, topSP-cleaned.seqs and MJ0553_2_long.sel
For buildmodel parameters:
"minmodlen 98 maxmodlen 104 Nmodels 4"
Prettyalign - yes
For prettyalign parameters, "-m 6"
Now, press the submit button. The results should arrive by email within
5 or so minutes. When they do, save the email message that has the subject
heading "SAM: Align2model Results" into a file and print it out.
Examine the alignment. We'll return to it below.
In the next part of this assignment, we will try the PHD program. What we would like to do is run this program on the putative protein MJ0553, and see if the secondary structure it predicts shows any similarity to the known secondary structure for 1aps. If it does, then this is further evidence that MJ0553 may be homologous to 1aps. If it does not, then they might not really be homologous, or PHD may just be predicting the wrong secondary structure for MJ0553.
Go to the WWW resources page for cs243 and click on the PHD web server. Notice that there are a number of prediction programs. Click "help" for some introduction. After scanning this, go back to the main PHD page and click "submit" to submit the MJO553 sequence. Name your prediction "MJO553". Paste in the MJ0553 sequence and click "RUN PREDICTION". It may take several hours to get an email response. Just as a quality check on the PHD program, repeat the above steps with the 1aps sequence. In this case, we know the secondary structure, and we can check how well PHD performs.
Now while you are waiting for PHD predictions to be mailed, let's look at the HSSP file for 1aps, and see if we can figure out what its secondary structure is. At the top of the file are several lines that describe how the file was built, and what the notations in the file mean (lines beginning "NOTATION"). What is interesting about the HSSP file is that it also includes known homologs of the protein 1aps, and shows how these can be aligned to 1aps. In this case there are 14 homologs listed in the block after the "NOTATION" section.
Q3: How does this list of homologs differ from those in topSP-cleaned.seqs?
After this list of homologs, we get to the meat of the HSSP file (starting at "##ALIGNMENTS ..."). On the left side are the 14 homologs of 1aps aligned vertically. Reading from the right side to the left, there are several columns of information about 1aps. They are
Finally, at the bottom of the HSSP file is a raw profile for this protein family built from the multiple alignment, giving amino acid probabilities as percentages. No fancy statistical estimation is done here, so these are just the raw frequencies.
Take the file 1aps.seq containing the 1asp sequence in the usual horizontal form, put it all on one line with no spaces, and create a line below it showing the known secondary structure for this sequence, using the letters "H", "E", and "L" (for "other"). Now when you get the PHD results for 1aps, read the file to figure out what is going on, then add another line to this file indicating the PHD secondary structure prediction. Turn a copy of this in with your homework.
Q4: How well does PHD do? How many residues have correct secondary structure prediction? Be aware that it is very difficult to predict the residues at the ends of beta strands and alpha helices, indeed there is some disagreement on how to classify these occasionally, even given the 3d coordinates. Look also at scores, on a scale from 0-9, that PHD gives for its predictions.
Now let us look at the PHD prediction for MJ0553.
Q5: In the critical region in the SAM alignment of MJ0553 and 1aps, along with their homologs, (say from alignment positions 8 to 98), how often does the PHD predicted secondary for MJ0553 agree with the known secondary structure of 1aps?
The last part of this assignment is for you to decide if MJ0553 (and some/all of its M. jannaschi homologs) are homologous to this family of SWISSPROT sequences. This is Q6. Justify your decision.
Here are some thoughts and hints (from Christian):