This homework covers some of the material in chapters 9 and 10 of the text. I'll be lecturing on this material for the next 2 classes. The theoretical part of this assignment is contained in this postscript file.
The experimental portion of this assignment was prepared by Michael Brown (mpbrown@cse.ucsc.edu), edited by me, and is given below.
The small subunit (SSU) of the ribosomal RNA (rRNA) is one of the most important and most studied biomolecules. It is a key part of the ribosome, the machine for translating from messenger RNA to protein, and key aspects of this sequence are highly conserved throughout the tree of life. Based on these conserved parts, SSU rRNA sequences have been recovered from a huge number of different species. These are aligned and the variations in these sequences are used in the phylogenetic classification of these species. In alignment of RNA sequences that share a common structure, like SSU rRNA sequences, it is crucial that that this structure be taken into account. In particular, it is crucial that residues that form base pairs of a helix in one sequence are properly aligned with the corresponding residues for this helix in the second sequence. Thus it is best that the secondary structure of RNA sequences be predicted in conjunction with the alignment process, rather than considered as a separate step in the analysis.
In the first part of the experimental section, we will look at some tools developed by Michael Brown at UCSC for the alignment and secondary structure prediction for SSU rRNA sequences. Look at the UCSC web page for ribosomal secondary structure prediction . Using the "display previously computed results" option, display and print the postscript file for the predicted secondary structure for the SSU rRNA from Ag.tumefac, as well as an alignment of this sequence to the SSU rRNA from E. coli. The display program used here was written by Michael Zuker. Compare the predicted secondary structure for Ag.tumefac to that of E. coli and note any differences that you find (you should consult the alignment as well as the postscript files.)
(Optional) If you can find a bacterial SSU rRNA sequence that is not already on this list of previously computed results, please follow the instructions and submit this sequence for analysis by the system. Make sure your runname is not the same as that of a sequence that has already been computed.
Q1: E. coli SSU rRNA has length about 1500. The grammar modeling this molecule has about 8000 nonterminals. Assuming 4 bytes per table entry about how much memory would you need to parse E. coli SSU rRNA using a CYK parsing algorithm? Note: In the CYK algorithm I did in class I showed a "box" for each piece of the sequence being parsed. In this "box" I wrote all the nonterminals that could derive that piece. However, in this question, it says there are "4 bytes per table entry". In each box, there is one table entry for each nonterminal. You need to know this to answer the question.
The machines alpha and beta have a gigabyte of memory, and the program on this web page parses this sequence in about a minute. Do you think it could be using the CYK algorithm?
Now for the second part of the experimental section, let us try training our own stochastic context-free grammar and using it to parse new RNA sequences. In this exercise you will use the STOCCFG system to model an RNA structure for a toy set of sequences.
Dr. RNA has spent several years in his lab isolating a particular RNA structure. He has fifty variants of this structure and has created a multiple alignment of these varying sequences with a consensus secondary structure that identifies which columns of the multiple alignment are basepaired. Your job is to create a model of the structure and predict the secondary structure of three newly acquired RNA sequences. All that you need to do this is in the directory
/projects/compbio/usr/mpbrown/cmp243The files contained here are
Here is what you need to do to complete the experiment (UNIX commands are in []):
Q2: Based on your knowledge of RNA structures, do these predicted structures look reasonable? Why or why not?
While this example is trivial, it is interesting to note that an HMM CANNOT do this modeling. To see this we will build an HMM from the multiple alignment, align the sequences, and infer secondary structure given the multiple alignment and the given secondary structure of the sequence, seq0 which is the first sequence in the multiple alignment.
Q3: It should be clear from this experiment that the SCFG did well while the HMM did not. Why?
For those interested in (optional) further investigation of this system, here are the other files that are generated: