There is a nice tutorial on dynamic programming at http://www.sbc.su.se/~per/molbioinfo2001/seqali-dyn.html with pointers to an example at http://www.sbc.su.se/~per/molbioinfo2001/dynprog/dynamic.html. It uses the more expensive alignment algorithm that allows arbitrary gap costs for different length gaps, though only an affine gap cost is used. If you want a textbook presentation, one of the best I know is in Chapter 2 of Durbin, Eddy, Krogh, and Mitchison's book Biological Sequence Analysis. The new book An Introduction to Bioinformatics Algorithms by Neil C. Jones and Pavel A. Pevzner may be more tutorial. Chapter 6 is the relevant chapter there.
The version of dynamic programming with affine gap costs that we presented in class used only nearest neighbors, but needed 3 matrices:
| Mi, j = max { | Iyi - 1, j - 1 + subst(Ai, Bj) |
| Mi - 1, j - 1 + subst(Ai, Bj) | |
| Ixi - 1, j - 1 + subst(Ai, Bj) |
| Iyi, j = max { | Iyi , j - 1 - extend |
| Mi , j - 1 - open | |
| Ixi , j - 1 - double_gap |
| Ixi, j = max { | Mi-1 , j - open |
| Ixi-1 , j - extend | |
| Iyi-1 , j - double_gap |
Be careful of the boundary conditions. For global alignment, you should have "start"-"start" aligned with 0 cost in the MATCH state and -infinity in Ix and Iy, so that a gap opening penalty is incurred if you start with an insertion or deletion. Before start, all rows and columns are -infinity, and the rest of the start row and column can be inferred from the recurrence relation. Make sure that your initial conditions are consistent with the recurrence relations---people sometimes get the indices of their matrices reversed, or confuse the Ix and Iy matrices.
For local alignment, the recurrence relations and the boundary conditions are different:
| Mi, j = subst(Ai, Bj) + max { | 0 |
| Iyi - 1, j - 1 | |
| Mi - 1, j - 1 | |
| Ixi - 1, j - 1 |
| Iyi, j = max { | Iyi , j - 1 - extend |
| Mi , j - 1 - open | |
| Ixi , j - 1 - double_gap |
| Ixi, j = max { | Mi-1 , j - open |
| Ixi-1 , j - extend | |
| Iyi-1 , j - double_gap |
Since local alignment is allowed to start at any match state (with the 0 choice for previous alignment cost), all rows and columns before the first letters of the sequences can be minus infinity.
You'll need a substitution matrix for this assignment. I've provided BLOSUM62 in the format provided by matblas and used by BLAST.
---CTNIDDSADKNR---
VLDAME-EIGEGDDIKMV
Note: this is probably not the optimal alignment.
You should read the BLOSUM62 matrix in from a file, not hardwire it into your code, though the file name to read it from may be in the code. The gap penalties should (by default) be open=12, extend=1, double_gap=2. (The open and extend penalties are reasonable, but I'm not so sure about the double_gap penalty.)
Remember to structure your program cleanly, using well-documented subroutines for the different tasks. I don't want to see a lot of long straight-line code and excessive use of global variables. (In Perl, "use strict" will help you catch the variables that are being used without a "my" declaration.)
The fasta files may have "alignment" characters in them, like spaces, periods, and hyphens. These are not part of the sequences and should be removed or ignored.
Your program should have two outputs: the score for the alignment and an a2m file that contains the alignment obtained by doing a traceback. The score output should contain the ids of the two sequences, the method used (global/local), and the score. Reporting other parameters (such as which substitution matrix was used and the gap-penalty parameters) is a good idea.
Note: PERL is not the language of choice for computation-intensive applications like sequence alignment. It is required for this assignment only for pedagogic purposes.
The correct A2M format for a local alignment could be
>test1 ----------CCCCCCCCCCCCCCCADEFGCCCCCCCCCCCCCCCC----------------- >test2 EEEEEEEEEE---------------ADEFG----------------EEEEEEEEEEEEEEEEEor
>test1 cccccccccccccccADEFGcccccccccccccccc. >test2 .....eeeeeeeeeeADEFGeeeeeeeeeeeeeeeeeThe second format (with lower-case and periods for unaligned characters) is greatly preferred, as the first format really expresses a global alignment that has large end gaps.
See http://compbio.soe.ucsc.edu/a2m-desc.html for more details of a2m format.
Name your program "align", so that we can test everyone's program with a script that does "align -l < test.fasta" and "align -g < test.fasta". Turn in both your program and the output for running the program on the sequence pairs in the the following fasta files:
|
|
| BME 205 home page | Karplus's lab page | UCSC Bioinformatics research |
Questions about page content should be directed to
Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250
318 Physical Sciences Building