CMP 243 Homework 5

Due: Fri. March 13 at 5:00PM (put it in my box in the CS Dept.)

Stochastic Context-Free Grammars


This homework covers some of the material in chapters 9 and 10 of the text. I'll be lecturing on this material for the next 2 classes. The theoretical part of this assignment is contained in this postscript file.

The experimental portion of this assignment was prepared by Michael Brown (mpbrown@cse.ucsc.edu), edited by me, and is given below.

The small subunit (SSU) of the ribosomal RNA (rRNA) is one of the most important and most studied biomolecules. It is a key part of the ribosome, the machine for translating from messenger RNA to protein, and key aspects of this sequence are highly conserved throughout the tree of life. Based on these conserved parts, SSU rRNA sequences have been recovered from a huge number of different species. These are aligned and the variations in these sequences are used in the phylogenetic classification of these species. In alignment of RNA sequences that share a common structure, like SSU rRNA sequences, it is crucial that that this structure be taken into account. In particular, it is crucial that residues that form base pairs of a helix in one sequence are properly aligned with the corresponding residues for this helix in the second sequence. Thus it is best that the secondary structure of RNA sequences be predicted in conjunction with the alignment process, rather than considered as a separate step in the analysis.

In the first part of the experimental section, we will look at some tools developed by Michael Brown at UCSC for the alignment and secondary structure prediction for SSU rRNA sequences. Look at the UCSC web page for ribosomal secondary structure prediction . Using the "display previously computed results" option, display and print the postscript file for the predicted secondary structure for the SSU rRNA from Ag.tumefac, as well as an alignment of this sequence to the SSU rRNA from E. coli. The display program used here was written by Michael Zuker. Compare the predicted secondary structure for Ag.tumefac to that of E. coli and note any differences that you find (you should consult the alignment as well as the postscript files.)

(Optional) If you can find a bacterial SSU rRNA sequence that is not already on this list of previously computed results, please follow the instructions and submit this sequence for analysis by the system. Make sure your runname is not the same as that of a sequence that has already been computed.

Q1: E. coli SSU rRNA has length about 1500. The grammar modeling this molecule has about 8000 nonterminals. Assuming 4 bytes per table entry about how much memory would you need to parse E. coli SSU rRNA using a CYK parsing algorithm? Note: In the CYK algorithm I did in class I showed a "box" for each piece of the sequence being parsed. In this "box" I wrote all the nonterminals that could derive that piece. However, in this question, it says there are "4 bytes per table entry". In each box, there is one table entry for each nonterminal. You need to know this to answer the question.

The machines alpha and beta have a gigabyte of memory, and the program on this web page parses this sequence in about a minute. Do you think it could be using the CYK algorithm?

Now for the second part of the experimental section, let us try training our own stochastic context-free grammar and using it to parse new RNA sequences. In this exercise you will use the STOCCFG system to model an RNA structure for a toy set of sequences.

Dr. RNA has spent several years in his lab isolating a particular RNA structure. He has fifty variants of this structure and has created a multiple alignment of these varying sequences with a consensus secondary structure that identifies which columns of the multiple alignment are basepaired. Your job is to create a model of the structure and predict the secondary structure of three newly acquired RNA sequences. All that you need to do this is in the directory


 /projects/compbio/usr/mpbrown/cmp243

The files contained here are
  1. stem.ma : The multiple alignment of the structure. The first sequence is seq0.
  2. canstem.bp : A list of which residues are basedpaired for the sequence, seq0.
  3. canstem.seq : The primary sequence of the sequence, seq0.
  4. test.seq : The primary sequences of the three newly acquired sequences. The sequence, seq0, is added so that the new sequences can be compared against it.
  5. scfgfromalign : A shell script that uses stem.ma, canstem.bp, and canstem.seq to produce an estimated SCFG with the name fullstem.tdgrammar.
  6. scfgalign : A shell script that takes a grammar and sequences and produces a multiple alignment of the sequences.

Here is what you need to do to complete the experiment (UNIX commands are in []):

  1. Make a working directory: [ mkdir scfgwork; cd scfgwork ]
  2. Copy necessary files to directory: [ cp /projects/compbio/usr/mpbrown/cmp243/* . ]
  3. Estimate a grammar from data: [ scfgfromalign ] This program will print some things out (ignore the "Did not understand ..." lines) and in the end you will have a file, fullstem.tdgrammar, which is the grammar estimated from the data.
  4. Use the grammar the align the new sequences: [ scfgalign fullstem.tdgrammar test.seq ] This program will produce a file test.seq.align. This file is the alignment of the test sequences.
  5. Look at the alignment file: [ more test.seq.align ] This file has a structure sequence which gives the secondary structure interpretations of the columns of the multiple alignment. The characters ( and ) mean basepaired, [ or ] mean bulges, - means loop match position. Draw the predicted secondary structure for each of the three sequences.

Q2: Based on your knowledge of RNA structures, do these predicted structures look reasonable? Why or why not?

While this example is trivial, it is interesting to note that an HMM CANNOT do this modeling. To see this we will build an HMM from the multiple alignment, align the sequences, and infer secondary structure given the multiple alignment and the given secondary structure of the sequence, seq0 which is the first sequence in the multiple alignment.

  1. Build HMM from multiple alignment: [ /projects/compbio/bin/alpha/modelfromalign hmmstem -alignfile stem.ma ] This gives an HMM, hmmstem.mod.
  2. Align the sequences using the HMM: [ /projects/compbio/bin/alpha/align2model hmmtest -i hmmstem.mod -db test.seq ] This gives an alignment, hmmtest.a2m.
  3. Look at the multiple alignment and infer secondary structure for each sequence based on this alignment and the known secondary structure for seq0: [ more hmmtest.a2m ]. List any discrepancies between these secondary structures and those predicted by the stochastic context-free grammar.

Q3: It should be clear from this experiment that the SCFG did well while the HMM did not. Why?

For those interested in (optional) further investigation of this system, here are the other files that are generated:


Questions regarding about page content should be directed to haussler@cse.ucsc.edu.
Last modified March 5, 1998.

Back to the CMP 243 Class Page.