CMP 243 Homework 3

Due: Tues. Feb. 10

Using the EMmix Program for Splice Site Recognition in Worm Data

D1, D2, etc. are things to do, and T1, T2, are items that you must turn in. Turn in your items by the due date.


In this assignment we'll gain some experience in using EMmix , a program that builds a profile or weight-matrix probability model for a given set of sequences. We'll try using EMmix for splice site recognition on recently acquired worm data .

D1: Place the following command in your .cshrc file


      alias EMmix /projects/compbio/bin/alpha/EMmix

or add /projects/compbio/bin/alpha to your UNIX path.

D2: Learn how to use the EMmix program by reading the short example below, and running EMmix on it. For more information browse through the online documentation which comes up when you type EMmix | more. Be forewarned that the online documentation of EMmix was not designed to serve as a tutorial. EMmix has a lot of advanced options that are outlined there. For example, EMmix builds models that include several components, each of which is a profile. It also builds first-order Markov profile models. These will be discussed briefly in class.

To use EMmix you first need to have a set of input sequences, all of the same length, stored in a file, say examples , in FASTA format. Use these sequences.

>SEQ1
CTAGGGTCAAG
>SEQ2
AAAAAGTCATA
>SEQ3
TTAAGGTCAGG
>SEQ4
GAAACGTCAAG
>SEQ5
ATAGGGTCACA
>SEQ6
CTAGGGTCAAA
>SEQ7
CAAGGGTCAAG
>SEQ8
CTAGGGTCAAA
>SEQ9
CAAGGGTCAAA
>SEQ10
CTAGGGTCAAG
EMmix takes these as aligned sequences, and it will make a profile for them with one probability distribution for each of the 11 positions (columns) in this alignment. This is called "training" a model. To build this profile, execute the following command (run EMmix only on an ALPHA machine).
    EMmix -m modelname -seqlen 11 < examples
Here "modelname" is any name you choose, for example "splice". The "-seqlen 11" is a parameter that specifies that each training sequence has length 11. Other parameters you can use are described in the on-line documentation. After the name of the parameter, the default value for that parameter is given. EMmix reads the sequences from standard input and builds a profile probability model for them that it stores in the file modelname.md . You should inspect this file manually to examine the model EMmix built. There you will find the estimated probabilities for each of the 4 nucleic acids in each of the 11 positions of the model. They are given in 11 rows of 4 numbers each. These numbers are not exactly the same as the frequencies in the dataset given above becuase by default, for each position, EMmix adds one pseudocount in for each of the four residues. Thus these numbers are mean posterior Bayesian estimates, assuming a uniform prior over the set of possible parameter vectors.

You should also run EMmix a second time with another command-line argument that generates a most probable string from the model. For a simple profile like this, this sequence is essentially the same as what you would get by taking the most frequent letter to represent each column in the alignment. This is called the "consensus" sequence for the profile in molecular biology. Finally, you should run EMmix with the -score option on a file testseqs that contains sequences to be scored. In particular, run

    EMmix -m modelname -score -seqlen 11 < testseqs
where testseqs is a file containing the sequences
>S1
CTAGGGTCAAG
>S2
AGCAAACGGGA

When EMmix is run with the -score option, it scores the sequences in the input file using the model modelname.md. No changes are made to the model. As during training, the sequences that are scored must have exactly the same number of residues as there are positions in the model.

For a given sequence S and EMmix model M, the score for S is computed as

     score = (- log_2 (P(S|M)/P(S|NULL)))/|S|

Here |S| denotes the number of residues in S, P(S|M) denotes the probability of the sequence S under the model M, and P(S|NULL) denotes the probability of S under a null model. For a simple profile model M, each letter of S is assumed to be generated independently according to the probability distribution in M for the position corresponding to that letter. Hence P(S|M) is obtained by multiplying together the probabilites of each of the letters in S, using the probability distribution in position j of model M for the jth letter of S.

If there is no modelname.nmp file in the current directory, then the null model is one in which each residue occurs independently with equal probability, regardless of its position. Otherwise, the probabilities of the residues for the null model are taken from the modelname.nmp file (see further discussion of this below).

Note that the more negative the score, the better the fit of the sequence to the model.

T1: You should turn in the following:

D3: And now we use EMmix for splice site recognition on real data. The actual data is in

    /projects/compbio2/data/worm/splices.fasta
You are strongly encouraged to browse the web site from where this data was downloaded (see the README file in the same directory). The data that you need to run EMmix on is in
    /projects/compbio2/data/worm/hw3
Each sequence in this dataset has length 25, and includes the last 10 residues from the exon of a spliced worm gene, followed by the first 15 residues of the following intron. Thus each sequence represents a place where the worm spliceosome will bind and begin the process of splicing out the intron. (These splice sites are called 5' or "donor" splice sites. The spliceosome will also make a cut at the other (5') end of the intron, which may be possibly far along down the gene, but the 5' splice site data is not included in this exercise. It is available from the web page if you are interested.) The question is: what signal does the worm spliceosome recognize in these 5' splice site sequences that makes it bind and cut here and not somewhere else? To try to answer this question, here are the steps you will need to perform

D4: Repeat D3 but first, create a file modelname.nmp in your directory, where "modelname" is the name you use in your "EMmix -m modelname ..." command, with a single line

A 0.32 G 0.18 C 0.175
This file specifies the probabilities for the 4 nucleotides for the NULL model. (the probability of T will be calculated by the program.) This file is discussed briefly in the documentation for EMmix. If you don't have a .nmp file, EMmix uses uniform probabilities for the frequencies in the null model, which are not appropriate in this case, since the worm genome is very AT rich. Note how much better the scores are after using a decent null model.

D5: Repeat D4 but now, in the first step, train a 2-component model instead of a 1-component one. In both cases, the model type will be indep , which is the default.

T2:

  1. Discuss the distinctive features of worm 5' splice sites that you found doing this experiment.
  2. Turn in the plots from D3 to D5 , together with a brief explanation of what the plots indicate.
  3. As stated above, EMmix calculates the score for a sequence S and a model M as
    score = - log_2 (P(S|M)/P(S|NULL)). 
    
    Now suppose M is a good model for sequences of length 25 that are 5' worm splice sites, and NULL is a reasonable null model for everything else (i.e. worm sequences of length 25 that are not 5' splice sites). Now when scanning worm DNA, the frequency with which real worm 5' splice sites occur is less and 1/1000. Let is say that this frequency is exactly 1/1000, and let us suppose we select a worm DNA sequence S of length 25 at random and score it with our EMmix 5' worm splice site model M. Using Bayes rule, and the prior frequency of 1/1000 for real worm 5' splice sites, tell me what score would we need to get in order for the posterior odds of S being a real worm 5' splice site to be bigger than 1?
(For extra credit: Explore other types of models, and try to find one that is better able to distinguish real 5' splice sites from decoys. For example, try using EMmix to build first order Markov models. The spliceosome does a lot better job than any of the models you created following the steps above. How can it be so accurate?) Late addition: Arun has sent instructions for a quick-and-dirty way to visualize an EMmix profile. Try this if you have time. It is easier than looking at rows of numbers. (The instructions below are embodied in a single script
/projects/compbio/bin/scripts/mp.pl
If you want to do this thie easy way, just type this command to get usage info, and then type it again with appropriate arguments.)
	How to Display an EMmix Model File M.md

1. Put each column of M.md into a separate file A, G, C, T.
      This may be done as follows. First copy M.md into a file M
      and delete its first four lines (the header ones).
      Then run the following sequence of awk commands on M.

      awk '{print $1;}' M > A
      awk '{print $2;}' M > G
      awk '{print $3;}' M > C
      awk '{print $4;}' M > T

      To verify that this worked correctly do

      paste M A G C T | more

2. Now you type

oink 1> gnuplot
gnuplot> set output "plot.ps"
gnuplot> set terminal postscript
gnuplot> set data style linespoints
gnuplot> plot 'A','G','C','T'
gnuplot> exit

The file plot.ps now contains the plot. You can view it by typing the command

ghostview plot.ps

Remark:
From inside gnuplot you may type 'help' to find out about all
its features, including saving the plot to a file, and various
other style options. In particular 'help set' will be informative.

Questions regarding about page content should be directed to haussler@cse.ucsc.edu.
Last modified Jan 31, 1998.

Back to the CMP 243 Class Page.