In this assignment we'll gain some experience in using EMmix , a program that builds a profile or weight-matrix probability model for a given set of sequences. We'll try using EMmix for splice site recognition on recently acquired worm data .
D1: Place the following command in your .cshrc file
alias EMmix /projects/compbio/bin/alpha/EMmix
or add /projects/compbio/bin/alpha to your UNIX path.
D2: Learn how to use the EMmix program by reading the short example below, and running EMmix on it. For more information browse through the online documentation which comes up when you type EMmix | more. Be forewarned that the online documentation of EMmix was not designed to serve as a tutorial. EMmix has a lot of advanced options that are outlined there. For example, EMmix builds models that include several components, each of which is a profile. It also builds first-order Markov profile models. These will be discussed briefly in class.
To use EMmix you first need to have a set of input sequences, all of the same length, stored in a file, say examples , in FASTA format. Use these sequences.
>SEQ1 CTAGGGTCAAG >SEQ2 AAAAAGTCATA >SEQ3 TTAAGGTCAGG >SEQ4 GAAACGTCAAG >SEQ5 ATAGGGTCACA >SEQ6 CTAGGGTCAAA >SEQ7 CAAGGGTCAAG >SEQ8 CTAGGGTCAAA >SEQ9 CAAGGGTCAAA >SEQ10 CTAGGGTCAAGEMmix takes these as aligned sequences, and it will make a profile for them with one probability distribution for each of the 11 positions (columns) in this alignment. This is called "training" a model. To build this profile, execute the following command (run EMmix only on an ALPHA machine).
EMmix -m modelname -seqlen 11 < examples
Here "modelname" is any name you choose, for example "splice".
The "-seqlen 11" is a parameter that specifies that each training
sequence has length 11.
Other parameters you can use are described in the on-line documentation.
After the name of the parameter, the default value for that parameter
is given.
EMmix reads the sequences from standard input
and builds a profile probability model for them
that it stores in the file modelname.md .
You should inspect this file manually to examine the model EMmix built.
There you will find the estimated probabilities for
each of the 4 nucleic acids in each of the 11 positions
of the model. They are given in 11 rows of 4 numbers each.
These numbers are not exactly the same as the frequencies in
the dataset given above becuase by default, for each position, EMmix adds
one pseudocount in for each of the four residues. Thus these
numbers are mean posterior Bayesian estimates, assuming a uniform
prior over the set of possible parameter vectors.
You should also run EMmix a second time with another command-line argument that generates a most probable string from the model. For a simple profile like this, this sequence is essentially the same as what you would get by taking the most frequent letter to represent each column in the alignment. This is called the "consensus" sequence for the profile in molecular biology. Finally, you should run EMmix with the -score option on a file testseqs that contains sequences to be scored. In particular, run
EMmix -m modelname -score -seqlen 11 < testseqs
where testseqs is a file containing the sequences
>S1 CTAGGGTCAAG >S2 AGCAAACGGGA
When EMmix is run with the -score option, it scores the sequences in the input file using the model modelname.md. No changes are made to the model. As during training, the sequences that are scored must have exactly the same number of residues as there are positions in the model.
For a given sequence S and EMmix model M, the score for S is computed as
score = (- log_2 (P(S|M)/P(S|NULL)))/|S|
Here |S| denotes the number of residues in S, P(S|M) denotes the probability of the sequence S under the model M, and P(S|NULL) denotes the probability of S under a null model. For a simple profile model M, each letter of S is assumed to be generated independently according to the probability distribution in M for the position corresponding to that letter. Hence P(S|M) is obtained by multiplying together the probabilites of each of the letters in S, using the probability distribution in position j of model M for the jth letter of S.
If there is no modelname.nmp file in the current directory, then the null model is one in which each residue occurs independently with equal probability, regardless of its position. Otherwise, the probabilities of the residues for the null model are taken from the modelname.nmp file (see further discussion of this below).
Note that the more negative the score, the better the fit of the sequence to the model.
T1: You should turn in the following:
D3: And now we use EMmix for splice site recognition on real data. The actual data is in
/projects/compbio2/data/worm/splices.fasta
You are strongly encouraged to browse the web site from where this
data was downloaded (see the README file in the same directory).
The data that you need to run EMmix on is in
/projects/compbio2/data/worm/hw3
Each sequence in this dataset has length 25, and includes the last 10
residues from the exon of a spliced worm gene, followed by the
first 15 residues of the following intron. Thus each sequence
represents a place where the worm spliceosome will bind and begin
the process of splicing out the intron. (These splice sites
are called 5' or "donor" splice sites. The spliceosome will also make a cut
at the other (5') end of the intron, which may be possibly far
along down the gene, but the 5' splice site data is not included in this
exercise. It is available from the web page if you are interested.)
The question is: what signal does the worm spliceosome recognize in these
5' splice site sequences
that makes it bind and cut here and not somewhere else?
To try to answer this question,
here are the steps you will need to perform
/projects/compbio/bin/scripts/fpfn.pl
to plot the results from these two scores. Be sure to read the top of
the file fpfn.pl to understand how to set up the data for
fpfn.pl to work correctly. Here is a brief description
of what to do
Let us say that you stored the two sets of scores in the files scores.test and scores.decoys respectively. To convert these two files into the format to use the fpfn.pl plot script, do the following: grep -E '^[^>]' scores.test > test_scores_real grep -E '^[^>]' scores.decoys > test_scores_decoys then run the fpfn.pl script as follows: /projects/compbio/bin/scripts/fpfn.pl test plot-title where "plot-title" is any string you want to appear as a title for your plot. The result will be stored in the postscript file "test.fpfn.eps" View it with the program ghostview, i.e. "ghostview test.fpfn.eps" before printing it with lpr (e.g. "lpr test.fpfn.eps").
D4: Repeat D3 but first, create a file modelname.nmp in your directory, where "modelname" is the name you use in your "EMmix -m modelname ..." command, with a single line
A 0.32 G 0.18 C 0.175This file specifies the probabilities for the 4 nucleotides for the NULL model. (the probability of T will be calculated by the program.) This file is discussed briefly in the documentation for EMmix. If you don't have a .nmp file, EMmix uses uniform probabilities for the frequencies in the null model, which are not appropriate in this case, since the worm genome is very AT rich. Note how much better the scores are after using a decent null model.
D5: Repeat D4 but now, in the first step, train a 2-component model instead of a 1-component one. In both cases, the model type will be indep , which is the default.
T2:
score = - log_2 (P(S|M)/P(S|NULL)).Now suppose M is a good model for sequences of length 25 that are 5' worm splice sites, and NULL is a reasonable null model for everything else (i.e. worm sequences of length 25 that are not 5' splice sites). Now when scanning worm DNA, the frequency with which real worm 5' splice sites occur is less and 1/1000. Let is say that this frequency is exactly 1/1000, and let us suppose we select a worm DNA sequence S of length 25 at random and score it with our EMmix 5' worm splice site model M. Using Bayes rule, and the prior frequency of 1/1000 for real worm 5' splice sites, tell me what score would we need to get in order for the posterior odds of S being a real worm 5' splice site to be bigger than 1?
/projects/compbio/bin/scripts/mp.plIf you want to do this thie easy way, just type this command to get usage info, and then type it again with appropriate arguments.)
How to Display an EMmix Model File M.md
1. Put each column of M.md into a separate file A, G, C, T.
This may be done as follows. First copy M.md into a file M
and delete its first four lines (the header ones).
Then run the following sequence of awk commands on M.
awk '{print $1;}' M > A
awk '{print $2;}' M > G
awk '{print $3;}' M > C
awk '{print $4;}' M > T
To verify that this worked correctly do
paste M A G C T | more
2. Now you type
oink 1> gnuplot
gnuplot> set output "plot.ps"
gnuplot> set terminal postscript
gnuplot> set data style linespoints
gnuplot> plot 'A','G','C','T'
gnuplot> exit
The file plot.ps now contains the plot. You can view it by typing the command
ghostview plot.ps
Remark:
From inside gnuplot you may type 'help' to find out about all
its features, including saving the plot to a file, and various
other style options. In particular 'help set' will be informative.