Linear hidden Markov models (HMMs) [Haussler et al., 1993,Krogh et al., 1994] and the related generalized profiles [Bucher & Bairoch, 1994] have been quite successful in detecting conserved patterns in multiple sequences [Hughey & Krogh, 1996,Baldi et al., 1994,Eddy et al., 1995,Eddy, 1995,Bucher et al., 1996,McClure et al., 1996]. By providing unaligned sequences that contain instances of a family, the resulting HMM identifies a set of positions that describes, more or less, conserved first order structure in the family.
The resulting model is useful for three purposes. Using a dynamic programming method one can generate a multiple alignment of the unaligned sequences from which the model was built. Thus one can inspect the regions in these training sequences that the process found to be ``homologous''. By studying the model itself, one can glean further insight by noting what it reveals about the common structure underlying the sequences in the family. Finally, the model can be used to discriminate between family and non-family sequences when employed for database searching.
It is this last issue that we address here using the Sequence Analysis and Modeling software suite (SAM) [Hughey & Krogh, 1996]. While previous work with SAM has shown notable success in sequence family discrimination, the methods used to determine sequence homology were admittedly crude [Krogh et al., 1994]. The basic scores reported by SAM are highly length dependent, so a heuristic was used to Z-score each sequence based on the distribution of scores for sequences of similar length. This paper studies techniques to overcome problems with the Z-score heuristic. The problem of length dependence in the score is addressed using log-odds scoring, discussed by Altschul and used by HMMer [Altschul, 1991,Eddy et al., 1995]. This method evaluates how much better a sequence fits the model than some underlying background distribution or null model. The significance of these scores is evaluated with the algorithmic significance method [Milosavljevic & Jurka, 1993]. The primary result of this paper is an experimental evaluation of null model alternatives for log-odds scoring.