next up previous
Next: 3 System and methods Up: Reduced space hidden Markov Previous: 1 Abstract

2 Introduction

Dynamic programming forms the core of many sequence analysis methods, including classic methods for sequence-sequence comparison and alignment [Needleman & Wunsch, 1970,Smith & Waterman, 1981], as well as more recent methods such as profiles and linear hidden Markov models (HMMs) [Gribskov et al., 1990,Krogh et al., 1994].

The typical process of training an HMM consists of performing many complete dynamic programming calculations on the model and each of the training set sequences. This is repeated many times until the model converges to a statistical representation of the family of sequences. The model may then be used to create a multiple alignment or to find other related sequences.

Perhaps the simplest form of the dynamic programming equation is that of edit distance [Wagner & Fischer, 1974,Sellers, 1974]:  

 \begin{displaymath}
\begin{array}
{rcll}
 {c_{i,j}}&=&\min \left\{ \begin{array}...
 ...limits(\phi,b_j)&\hbox{delete},
 \end{array} \right.\end{array}\end{displaymath}

The calculation of this recurrence can be arranged in a table of costs for matching each prefix of a to each prefix of b, and the selection information from the minimizations can be used to trace back the optimal alignment (Figure 1).
  
Figure: Dynamic programming example to find least cost edit of ``BACKTRACK'' into ``TRACEBACK'' using Sellers' evolutionary distance metric [Sellers, 1974]. Below the dynamic programming table are two possible alignments and an illustration of the data dependencies.
\begin{figure}
{\small \begin{minipage}
{21em}\noindent

\setlength {\unitlength...
 ...,3){\makebox(1,1){\hss$c_{i,j-1}$\hss}}%\end{picture}\end{minipage}}\end{figure}


  
Figure 2: An example of an HMM with two sequences whose characters are generated by the HMM, and the corresponding alignment. Positions modeled by the HMM's match states are indicated with uppercase letters, while those modeled by unaligned insertion states are indicated with lowercase letters.
\begin{figure}
\pscolumn{hmmexample.eps}\end{figure}

The addition of affine gap costs g to the recurrence [Gotoh, 1982] produces three interleaved recurrences of similar form. Here, the comparison or alignment can be thought of as being in one of three states: in the midst of a sequence of matches, deletions, or insertions, where cost for changing states provides the constant term in the affine gap cost. To emulate the simplest affine gap cost model, many of the g values are zero, indicating for example that there is no extra cost in matching one pair of characters and then matching the next pair of characters. In the most general forms of sequence comparison, such as linear hidden Markov models and generalized profiles [Krogh et al., 1994,Bucher & Bairoch, 1994], all transition or gap costs (g) between the three states (match, insert, or delete) and character distance costs are position-dependent:

\begin{displaymath}
\begin{array}
{*{9}{l@{\,}}}
{f^{\ss M}_{\ss i,j}}&=
&&\left...
 ...t)&
\multicolumn{3}{l}{\cdot P_j^{\ss D}(\phi,b_j).}\end{array}\end{displaymath}

The typical linear hidden Markov model (Figure 2) is a chain of match (square), insert (diamond) and delete (circle) nodes, with all transitions between nodes and all character costs in the insert and match nodes trained to specific probabilities. When performing dynamic programming between a sequence and an HMM, the HMM will index one dimension, such as the columns, of the matrix. The single best path through an HMM corresponds to a path from the Start state to the End state in which each character of the sequence is related to a successive match or insertion state along that path (delete states indicate that the sequence has no character corresponding to that position in the HMM). If two sequences are aligned to the model, a multiple alignment between those sequences can be inferred from their alignments to the model, though it must be remembered with HMMs that characters modeled by insert states are not aligned between sequences. This HMM multiple alignment algorithm requires time proportional to the number of sequences, or alternatively the product of the total number of residues and the model length.

The forward-backward method of hidden Markov model training considers probabilities rather than scores over all possible states, replacing the minimizations with the addition of probability, and the additions with multiplication of probability (the probabilities are stored as log-probabilities to avoid underflow). Thus, the forward pass becomes:

\begin{displaymath}
\begin{array}
{*{9}{l@{\,}}}
{f^{\ss M}_{\ss i,j}}&=
&&\left...
 ...ht)&
\multicolumn{3}{l}{\cdot P_j^{\ss D}(\phi,b_j)}\end{array}\end{displaymath}

and, considering the various parameters as probabilities, computes

\begin{displaymath}
f_{i,j}=P\left(a_1\ldots a_i, b_1\ldots b_j\vert
\mbox{HMM}\right),\end{displaymath}

where in this simplified version of the HMM notation, the ranges on a and b indicate that the first i characters of a were generated by a chain of j states of the HMM ending at state bj. A similar recurrence, starting at the end of the sequence and the end of the model, is used to compute the backward values

\begin{displaymath}
b_{i,j}=P(a_n\ldots a_i,b_m\ldots b_j \vert
\mbox{HMM}),\end{displaymath}

indicating that the last n-i+1 characters of a can be generated by a chain of m-j+1 states of the model. The forward and the backward values are combined to yield

\begin{displaymath}
P(a_i \mbox{ is generated
by state } b_j \vert\mbox{HMM}) = f_{i,j}\cdot b_{i,j}. \end{displaymath}

This value, the probability that a certain character was generated by a certain state of the HMM, is then used to update the probabilities in the HMM, in association with the values for other sequences in the training set and a regularizer or Dirichlet mixture prior [Sjölander et al., 1996]. The notation has been simplified; the reader is referred to the literature for a more detailed treatment [Rabiner, 1989,Krogh et al., 1994] and an HMM review [Eddy, 1996].

The simplest approach to computing these O(nm) dynamic programs is to create a large, $n\times m$ table in memory to store values. Unfortunately, this table will not fit entirely in a workstation's memory for large model and sequence lengths. For example, if a family of long molecules is to be modeled, say 2000 nucleotides and 2000 model nodes, the dynamic programming matrix will include 12$\times10^6$ entries, one for each state at each index point. To provide sufficient precision, in SAM each of those entries requires four bytes, so 48$\times10^6$ bytes are required just for the dynamic programming table. When added to the memory requirements for the rest of the application, not to mention the operating system and other users, this will consume most all the memory of a typical workstation with 64Mbytes of main memory. If the sequence or model length is larger, or memory available for the program is lower, virtual memory will be used extensively. That is, blocks of memory (or pages -- typically 2 to 8 kilobytes of data each) will be temporarily stored on disk, with the computer's real memory and cache storing only currently active pages. In a current machine, a cache may require one clock cycle to access, main memory 10 to 20 clock cycles, and disk over one million cycles [Hennessy & Patterson, 1996]. Thus, the cost of paging into virtual memory is high, and can make runs of a given size effectively uncomputable.

The solution to this is to use a sequence alignment method that requires less space. In the case of finding the single best path, there is an elegant divide-and-conquer algorithm that requires only O(n+m) space, where n is the sequence length and m is the model length [Hirschberg, 1975]. The approach of this algorithm is to find a midpoint of the best path without saving all O(nm) dynamic programming entries, and then to solve two smaller problems, each of approximate size nm/4 using the same algorithm. This algorithm is well known in the computational biology community [Myers & Miller, 1988], and has, for example, been implemented in the HMMer package for sequence alignment to a trained HMM [Eddy et al., 1995].

The divide-and-conquer algorithm does not work with forward-backward training. The efficiency of the divide-and-conquer algorithm is in its partitioning into subproblems so that, while the entire dynamic programming matrix must be evaluated once, recursive calls consider smaller and smaller segments of the matrix. With forward-backward training, all paths through the matrix are an important part of training the HMM, and thus this gain cannot be used.

One answer is a recently introduced checkpointing algorithm [Grice et al., 1997]. In the simplest case, diagonals, rows, or columns of the dynamic programming matrix are stored at regular intervals to reduce space use to $O(m\sqrt{n})$ while increasing runtime by a small constant factor. In the experiments discussed below, the new checkpoint-based HMM training requires approximately 10% more time than standard dynamic programming up to the point of virtual memory paging, at which point the checkpoint algorithm is considerably faster. The reduced space requirements are particularly valuable in multiple user environments.


next up previous
Next: 3 System and methods Up: Reduced space hidden Markov Previous: 1 Abstract
SAM
sam-info@cse.ucsc.edu
UCSC Computational Biology Group