This assignment is intended to deepen your understanding of first-order Markov chains and of relative entropy. I will provide you with a data set consisting of many protein sequences in a FASTA file (see Perl assignment 1). You will build stochastic models of these sequences as zero-order and first-order Markov chains, and measure the information gain of the first-order model over the zero-order model.
I have provided two FASTA files: perl2_f07_1.x-seqs and perl2_f07_2.x-seqs for training and testing the models. You will do 2-fold cross-validation (training on each of the sets and testing on the other). These can be accessed on SoE machines directly from the directory /cse/classes/bme205/Fall07/ so that you don't need to make your own copies.
The combined set was derived from Dunbrack's culledpdb
list of X-ray structures with resolution 1.8 Angstroms or better
and R-factor <= 0.25, reduced so that no two sequences had more than
30% identity. I used only those sequences that were in our t06
template library, which primarily means that short sequences were
removed. I randomly split the list of chain ids into two partitions.
I also x-ed out the lower-case letters of the sequences, so that amino
acids not present in the 3D structure file are not listed.
For full details on how the set was created, see the Makefile.
Note: this is a different train and test set than the ones used in previous
years---make sure you get the right set.
For simplicity, we will not be modeling the length distribution of the
strings---instead we will build length-dependent stochastic models.
That is, our models will be probability functions that sum to one for
strings of a given length, rather than for all strings.
The simplest model we discussed in class is the i.i.d. (independent,
identically distributed) model, or zero-order Markov model. It
assumes that each character of the string comes from the same
background distribution P_0, and the probability of a string is
just the product of the probabilities of the letters of the string.
Step 1: Write a program that reads a FASTA file from STDIN and outputs the
counts of amino acids in the sequences to STDOUT.
This output should be both human- and machine-readable, as you will be
using it later. Document the output format!
Run this program on each of the sequence sets to get 0-order models.
Step 2: Write a program that reads in a table of background
counts (in the same format as you output them) from a file and
FASTA-formatted sequences from STDIN, and computes the encoding cost
in bits for each sequence in the file.
The encoding cost of a character x is -log_2 P_0(x), and the encoding
cost of a sequence is the sum of the costs for the characters (note:
this is equivalent to taking -log_2 of the product of the
probabilities).
Note that you need to convert the counts into probabilities. The
simplest approach is to normalize the counts to sum to 1. Because the
FASTA training file will be large, you will not need to worry about
pseudocounts---this simple maximum-likelihood estimate will be fine.
Produce a table that reports for each sequence the ID, the encoding cost
for that sequence in bits,
and the average encoding cost per character (encoding cost/length).
Also report for the whole data set the total encoding cost, the
average encoding cost per sequence, and the average encoding cost per
character.
Run this program on both sets of sequences using both models.
This should give you 2 files of results on training sets and 2 files
of results on test sets. Name the files sensibly, so that you can
tell which result is which!
Step 3: Write a program that reads a FASTA file from STDIN and
builds a first-order Markov model of the sequences. Since we are not
modeling the length distribution, there will be no stop state, but we
will have a start state, so a first-order Markov model consists of 21
probability distributions over the 20 amino acids. For this
assignment, output counts for each of the 21 conditions, rather than probabilities.
Document the output format!
Run this program on both sets to get two first-order models.
Step 4: Write a program that reads a first-order Markov model from
a file (in the same format as you output for Step 3) and
FASTA-formatted sequences from STDIN, and computes exactly the same
table as for the zero-order Markov model.
Note: it is possible to have a pair of characters occur in the test
set that had zero counts in the training set.
Your program should not provide infinite results in this case.
Document what you do to avoid infinity.
Run this program on the training and test sets and compare the average encoding
cost of the first-order model with the zero-order model.
Turn in your summary page with the program stapled after it.
(Avoid paper clips---they have a tendency to either come off or to
pick up other people's papers.)
You need to think about about upper and lower case (they should be
treated as equivalent) and about letters that are not the 20 standard
amino acids (wildcards B,Z, and X, and spaces "." and "-").
There are several ways to handle wildcards and non-standard amino acid letters:
Think about the FASTA-reading requirements of the first-order
Markov model first---it needs to process each sequence separately, so
you might want to write a routine that reads one sequence from the
file, and returns it as a single string, returning "undef" if there
are no more sequences to read. (Don't use the empty string to mark
the end of the file, since an empty string is a legal sequence, and
actually occurs fairly frequently in real data sets.)
To turn in
I want copies of your programs on paper, but I don't want all the output from
every run of the programs!
Instead, write a short summary describing the results of the tests,
giving the average encoding costs for each set of sequences for each model.
Try to write your summary as if you were describing the method
and the results in a journal paper that has a page limit: you want to
describe the data sets and methods completely but concisely.
If you do it well, the presentation should take less than a page.
Bonus points
There are several ways to improve on or extend this assignment:
Hints
The B and Z codes occur mainly in protein sequences that have been
determined by old-fashioned chemical methods, not from translating DNA
sequences.
There are undoubtedly other methods also.
Method 3 with no extra start transition is probably best for this
program, but any of the methods is acceptable. You must document
how you handle B and Z.
foreach $letter (sort(keys(%aa_prob)))
{ print "$letter\t$aa_prob{$letter}\n";
}
@chars = split(//, $seq);
@rev = reverse(@chars);
$rev_seq = reverse($seq);
Things learned for next year
|
|
| BME 205 home page | Karplus's lab page | UCSC Bioinformatics research |
Questions about page content should be directed to
Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250
318 Physical Sciences Building