Dynamic Programming Exercises. BME 205 perl Homework #4

Fall 2008
Due Friday 21 Nov 2008
(Last Update: 20:59 PST 15 December 2008 )

There is a nice tutorial on dynamic programming at http://www.sbc.su.se/~per/molbioinfo2001/seqali-dyn.html with pointers to an example at http://www.sbc.su.se/~per/molbioinfo2001/dynprog/dynamic.html. It uses the more expensive alignment algorithm that allows arbitrary gap costs for different length gaps, though only an affine gap cost is used. If you want a textbook presentation, one of the best I know is in Chapter 2 of Durbin, Eddy, Krogh, and Mitchison's book Biological Sequence Analysis. The new book An Introduction to Bioinformatics Algorithms by Neil C. Jones and Pavel A. Pevzner may be more tutorial. Chapter 6 is the relevant chapter there.

The version of dynamic programming with affine gap costs that we presented in class used only nearest neighbors, but needed 3 matrices:

Mi, j = max { Iyi - 1, j - 1 + subst(Ai, Bj)
Mi - 1, j - 1 + subst(Ai, Bj)
Ixi - 1, j - 1 + subst(Ai, Bj)

Iyi, j = max { Iyi , j - 1 - extend
Mi , j - 1 - open
Ixi , j - 1 - double_gap

Ixi, j = max { Mi-1 , j - open
Ixi-1 , j - extend
Iyi-1 , j - double_gap

Be careful of the boundary conditions. For global alignment, you should have "start"-"start" aligned with 0 cost in the MATCH state and -infinity in Ix and Iy, so that a gap opening penalty is incurred if you start with an insertion or deletion. Before start, all rows and columns are -infinity, and the rest of the start row and column can be inferred from the recurrence relation. Make sure that your initial conditions are consistent with the recurrence relations---people sometimes get the indices of their matrices reversed, or confuse the Ix and Iy matrices.

For local alignment, the recurrence relations and the boundary conditions are different:
Mi, j = subst(Ai, Bj) + max { 0
Iyi - 1, j - 1
Mi - 1, j - 1
Ixi - 1, j - 1

Iyi, j = max { Iyi , j - 1 - extend
Mi , j - 1 - open
Ixi , j - 1 - double_gap

Ixi, j = max { Mi-1 , j - open
Ixi-1 , j - extend
Iyi-1 , j - double_gap

Since local alignment is allowed to start at any match state (with the 0 choice for previous alignment cost), all rows and columns before the first letters of the sequences can be minus infinity.

You'll need a substitution matrix for this assignment. I've provided BLOSUM62 in the format provided by matblas and used by BLAST.

Q1 (hand computation):
Compute the score for the following global alignment using the BLOSUM62 scoring matrix with a gap-opening cost of 12 and a gap-extension cost of 1 and a double-open cost of 2 (all are in 1/2 bit units to be compatible with the BLOSUM62 matrix). Show what you added up to get the score, since the final number alone could easily be incorrect due to a transcription or addition error.
    ---CTNIDDSADKNR---
    VLDAME-EIGEGDDIKMV
    
Note: this is probably not the optimal alignment.

Q2 (programming):
Write a short PERL program that takes two protein sequences (from a single FASTA file) and aligns them with the both the global and the local algorithm (using affine gap costs). The local alignment should be requested by the user with a "-l" or "-local" command-line option, and global with a "-g" or "-global" option.

You should read the BLOSUM62 matrix in from a file, not hardwire it into your code, though the file name to read it from may be in the code. The gap penalties should (by default) be open=12, extend=1, double_gap=2. (The open and extend penalties are reasonable, but I'm not so sure about the double_gap penalty.)

Remember to structure your program cleanly, using well-documented subroutines for the different tasks. I don't want to see a lot of long straight-line code and excessive use of global variables. (In Perl, "use strict" will help you catch the variables that are being used without a "my" declaration.)

The fasta files may have "alignment" characters in them, like spaces, periods, and hyphens. These are not part of the sequences and should be removed or ignored.

Your program should have two outputs: the score for the alignment and an a2m file that contains the alignment obtained by doing a traceback. The score output should contain the ids of the two sequences, the method used (global/local), and the score. Reporting other parameters (such as which substitution matrix was used and the gap-penalty parameters) is a good idea.

Note: PERL is not the language of choice for computation-intensive applications like sequence alignment. It is required for this assignment only for pedagogic purposes.

The correct A2M format for a local alignment could be

>test1
----------CCCCCCCCCCCCCCCADEFGCCCCCCCCCCCCCCCC-----------------
>test2
EEEEEEEEEE---------------ADEFG----------------EEEEEEEEEEEEEEEEE
or
>test1
cccccccccccccccADEFGcccccccccccccccc.
>test2
.....eeeeeeeeeeADEFGeeeeeeeeeeeeeeeee
The second format (with lower-case and periods for unaligned characters) is greatly preferred, as the first format really expresses a global alignment that has large end gaps.

See http://compbio.soe.ucsc.edu/a2m-desc.html for more details of a2m format.

Name your program "align", so that we can test everyone's program with a script that does "align -l < test.fasta" and "align -g < test.fasta". Turn in both your program and the output for running the program on the sequence pairs in the the following fasta files:

Bonus points:
Provide options for changing the gap costs.
Provide options to change to other substitution matrices, such as BLOSUM45, BLOSUM80, PAM30, and PAM70.
Implement faster scoring without alignment.
Implement linear gap costs and compare running times for affine and linear algorithms. You may need to use longer test sequences to get repeatable results, as the expected 3-fold difference in the inner loop may be buried in overhead.

Things learned for next year



slug icon to go to Scool of Engineering home page
SoE home
sketch of Kevin Karplus by Abe
Kevin Karplus's home page
BME-slug-icon
Biomolecular Engineering Dept.
BME 205 home page Karplus's lab page UCSC Bioinformatics research

Questions about page content should be directed to

Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250
318 Physical Sciences Building