UCSC BME 205 Fall 2007

Intro to Bioinformatics
PERL Assignment 2

(Last Update: 14:49 PST 5 November 2007 )

Simple Markov chains
Due Friday 2 Nov 2007 (11 a.m.).

This assignment is intended to deepen your understanding of first-order Markov chains and of relative entropy. I will provide you with a data set consisting of many protein sequences in a FASTA file (see Perl assignment 1). You will build stochastic models of these sequences as zero-order and first-order Markov chains, and measure the information gain of the first-order model over the zero-order model.

I have provided two FASTA files: perl2_f07_1.x-seqs and perl2_f07_2.x-seqs for training and testing the models. You will do 2-fold cross-validation (training on each of the sets and testing on the other). These can be accessed on SoE machines directly from the directory /cse/classes/bme205/Fall07/ so that you don't need to make your own copies.

The combined set was derived from Dunbrack's culledpdb list of X-ray structures with resolution 1.8 Angstroms or better and R-factor <= 0.25, reduced so that no two sequences had more than 30% identity. I used only those sequences that were in our t06 template library, which primarily means that short sequences were removed. I randomly split the list of chain ids into two partitions. I also x-ed out the lower-case letters of the sequences, so that amino acids not present in the 3D structure file are not listed. For full details on how the set was created, see the Makefile.

Note: this is a different train and test set than the ones used in previous years---make sure you get the right set.

For simplicity, we will not be modeling the length distribution of the strings---instead we will build length-dependent stochastic models. That is, our models will be probability functions that sum to one for strings of a given length, rather than for all strings.

The simplest model we discussed in class is the i.i.d. (independent, identically distributed) model, or zero-order Markov model. It assumes that each character of the string comes from the same background distribution P_0, and the probability of a string is just the product of the probabilities of the letters of the string.

Step 1: Write a program that reads a FASTA file from STDIN and outputs the counts of amino acids in the sequences to STDOUT. This output should be both human- and machine-readable, as you will be using it later. Document the output format!

Run this program on each of the sequence sets to get 0-order models.

Step 2: Write a program that reads in a table of background counts (in the same format as you output them) from a file and FASTA-formatted sequences from STDIN, and computes the encoding cost in bits for each sequence in the file. The encoding cost of a character x is -log_2 P_0(x), and the encoding cost of a sequence is the sum of the costs for the characters (note: this is equivalent to taking -log_2 of the product of the probabilities).

Note that you need to convert the counts into probabilities. The simplest approach is to normalize the counts to sum to 1. Because the FASTA training file will be large, you will not need to worry about pseudocounts---this simple maximum-likelihood estimate will be fine.

Produce a table that reports for each sequence the ID, the encoding cost for that sequence in bits, and the average encoding cost per character (encoding cost/length). Also report for the whole data set the total encoding cost, the average encoding cost per sequence, and the average encoding cost per character.

Run this program on both sets of sequences using both models. This should give you 2 files of results on training sets and 2 files of results on test sets. Name the files sensibly, so that you can tell which result is which!

Step 3: Write a program that reads a FASTA file from STDIN and builds a first-order Markov model of the sequences. Since we are not modeling the length distribution, there will be no stop state, but we will have a start state, so a first-order Markov model consists of 21 probability distributions over the 20 amino acids. For this assignment, output counts for each of the 21 conditions, rather than probabilities. Document the output format!

Run this program on both sets to get two first-order models.

Step 4: Write a program that reads a first-order Markov model from a file (in the same format as you output for Step 3) and FASTA-formatted sequences from STDIN, and computes exactly the same table as for the zero-order Markov model. Note: it is possible to have a pair of characters occur in the test set that had zero counts in the training set. Your program should not provide infinite results in this case. Document what you do to avoid infinity.

Run this program on the training and test sets and compare the average encoding cost of the first-order model with the zero-order model.

To turn in

I want copies of your programs on paper, but I don't want all the output from every run of the programs! Instead, write a short summary describing the results of the tests, giving the average encoding costs for each set of sequences for each model. Try to write your summary as if you were describing the method and the results in a journal paper that has a page limit: you want to describe the data sets and methods completely but concisely. If you do it well, the presentation should take less than a page.

Turn in your summary page with the program stapled after it. (Avoid paper clips---they have a tendency to either come off or to pick up other people's papers.)

Bonus points

There are several ways to improve on or extend this assignment:

Hints


Things learned for next year



slug icon to go to Scool of Engineering home page
SoE home
sketch of Kevin Karplus by Abe
Kevin Karplus's home page
BME-slug-icon
BS, MS, and PhD programs
BME 205 home page Karplus's lab page UCSC Bioinformatics research

Questions about page content should be directed to

Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250
318 Physical Sciences Building