next up previous
Next: 6 Discussion Up: Scoring Hidden Markov Models Previous: 4 Algorithm

Subsections

5 Results

To evaluate scoring methods, we performed discrimination experiments on three families of sequences, globins, EF-hands, and ferredoxins, and searched for remote homologues of 1hurA.

To build models, we created a model based on the unaligned sequences in the training set using SAM's default settings and a Dirichlet mixture prior [Sjölander et al., 1996]. From this model, we generated a multiple alignment of the training sequences and calculated position-based sequence weights [Henikoff & Henikoff, 1994], using an unpublished method to determine total sequence weight. With these weights, the unaligned sequences, and the mixture prior, we built the final HMM. After SAM's surgery heuristic, the globin model had 148 positions, the EF-hand model 27, the ferredoxin model 95, and the 1hurA model 180. For the HMMer models, we used the default settings, which include an internal weighting scheme. Because we did not adjust any parameters, our models may not be as good as those of an experienced HMMer user.

In scoring the models, we chose to use semi-local scoring, which allows a fragment to match only part of the model. We decided on this after discovering that many of our globin false negatives were short fragments. The switch to semi-local scoring enables us to move several, but not all, of these fragments to the true and possible positive scores, rather than being buried in the database.

For evaluation of false positives and false negatives, we use the Prosite classification in the first three experiments and the FSSP tree in the last. Because there may be unidentified EF-hands, globins, or ferredoxins in the databases, some of the sequences identified as false positives in the experiments may actually be true positives.

We investigated the following simple null models:

flat
Each amino acid is equally likely
background
The background distribution over all proteins
train counts
The amino acid frequencies of the training set
match counts
The average amino acid frequencies found in the model's match states for the training set
geometric mean
The normalized geometric mean of the match state probabilities
scored counts
The amino acid frequencies of the scored sequence.
We also analyzed the complex null model based on the trained model's transitions and the geometric mean of its match states, SAM's Z-scoring heuristic, and HMMer's hmmsw.

5.1 Globins

For the globins, we used Prosite [Bairoch & Bucher, 1994] to extract 685 globin sequences from SwissProt 32 [Bairoch & Boeckmann, 1994] of lengths 19 to 403. From these 685 sequences, we used the same 333 training sequences of lengths 40 to 403, as in a previous study [Bucher et al., 1996]. Because the globin domain is quite conserved, it is relatively easy to model.


  
Figure 2: Globin false positives versus false negatives for different thresholds using the different null models (a). For the cluster of roughly equivalent methods, the theoretical significance level occurs at 6 false negatives and no false positives, although on typical histogram (b, for the complex null model, with a vertical line at $\sigma = 1$) slightly lower settings provide 5 and 0 or 4 and 1 false negatives and false positives. Scoring methods are listed in order of intersection with the top of the plot.
\begin{figure*}
\begin{center}
 \begin{tabular}
{c@{\hspace{0.25in}}c}
{
\psfig ...
 ...,width=0.45\textwidth}
}\\  (a) & (b) \\  \end{tabular}\end{center}\end{figure*}

Figure 2a plots false negatives versus false positives, each of which are a function of the setting of threshold score or $\sigma$. At each integral point on, for example, the false positive axis, the value of the false negative coordinate is a linear interpolation between the two nearest integral false negative values based on the scores of the two false negative points and the false positive point in question. This means, for example, that if two negative examples score 10 and 20, and two positive examples score 19 and 21, the point (1,1.1) will be plotted for the score of 19, indicating that that if the threshold is set at 19, there will be one false negative (the sequence that scores 21) and a little over one false positive (the sequence scoring 19 is 10% of the way between the two negative examples).

In the case of the globins, the model achieved excellent, and comparable, discrimination for all null models except the flat distribution and the scored letter counts distribution. The Z-scoring method does poorly, in part because of problems with false positives with more than 20% wildcard characters. If these are trimmed from the database, Z-scoring performance is similar to hmmsw. The hmmsw performance is more indicative of the model building (and our unfamiliarity with it) than the scoring: hmmsw uses log-odds scoring with either a background amino acid distribution (default) or a user-specified null model.

Because of the ease of discrimination, the good null models show little variation between themselves. An ideal graph would show complete separation between the false positives and false negatives, an `L' shape that includes the origin.

Looking at the histogram for a typical example (Figure 2b), the theoretical significance setting does not exactly provide the number of false positives, as the threshold must be lowered (to approximately $\sigma=10$) before a false positive is found.

For the well-performing null models, the low-scoring positives include, in order of lowest to highest score, three fragments with 29-41 residues (GLB4_LAMSP, HMPA_SALTY, and GLB1_LAMSP), and GLBH_CAEEL (159 residues). Under the best null model, the complex, the -7.9 score of GLBH_CAEEL is higher than any false positve, and would certainly be found by inspection. The other three scores are 2.6, 3.2, and 4.4, corresponding to 140, 75, and 25 false positives. The problem is not strictly in detecting fragments, as the smallest globin fragment, HBB2_UROHA at 19 residues, always scores as an identifiable globin. Of these five sequences, only GLB1_LAMSP was in the training set. The two highest-scoring non-globins are YF06_HAEIN (9.3) and WN8B_XENLA (6.8).

The primary result of this examination of globin scoring is that when the model is exceptionally good, the choice of null model makes little difference, apart from avoiding the flat and the scored letter counts null models.

5.2 EF-Hands

The EF-hand motif is a 29-residue structure present in cytosolic calcium-modulated proteins [Nakayama et al., 1992,Persechini et al., 1989,Moncrief et al., 1990]. These proteins bind the second messenger calcium and in their active form function as enzymes or regulate other enzymes and structural proteins. Although a number of proteins possess the EF-hand motif, some of these regions have lost their calcium binding property. For a training set, we used the same set of EF-hand sequences as in earlier work [Krogh et al., 1994]. This set is based on the June 1992 set of EF-hand sequences maintained by Kretsinger and co-workers [Nakayama et al., 1992]. The EF-hand structures were extracted from each of the 242 sequences in the EF-hand database to obtain 885 EF-hand motifs having an average length of 29. We used Prosite to classify 324 full and partial sequences of lengths 25 to 2477 containing the EF-hand motif as true positives. We placed Prosite's unknown sequences in the group of non-EF-hands. Because of the motif's shortness and its diversity, it is a difficult family to model.


  
Figure 3: EF-hand false negatives versus false positives the different null models and HMMer. Scoring methods are listed in order of intersection with the top of the plot. Plot (b) is a subregion of (a) at higher magnification.
\begin{figure*}
\begin{center}
 \begin{tabular}
{c@{\hspace{0.25in}}c}
{
\psfig ...
 ...,width=0.45\textwidth}
}\\  (a) & (b) \\  \end{tabular}\end{center}\end{figure*}

In the case of the EF-hand experiments (Figure 3), discrimination was not nearly so good, but there was differentiation between null models. Here, the minimal number of false negatives is approximately 6 for the complex, background distribution, and geometric average null models. The match counts, letter counts, flat, scored sequence, and hmmsw curves become vertical around 10-15 false negatives, indicating more sequences that could not be found regardless of the number of false positives. Scored letter counts achieves zero false positives at 330 false negatives, match counts and letter counts at 240 false negatives, hmmsw at 220 false negatives, and the remaining null models at 200-210 false negatives.


  
Figure 4: EF-hand false positives and false negatives as a function of threshold setting before (a) and after (b, c) moving 43 sequences from the negative to the positive category. The vertical bar corresponds to $\sigma=1.0$.
\begin{figure*}
\begin{center}
\begin{tabular}
{c*{2}{@{\hspace{0.025\textwidth}...
 ...tric Average &
 (c) Adjusted HMMer hmmsw
 \end{tabular}\end{center}\end{figure*}

During the experiments, we found that many of the false positives were sequences containing variant EF-motifs that no longer bind calcium, and thus are not labeled as EF-hands by Prosite. To account for this, we changed the classifications of 43 sequences to being an EF-hand based on their SwissProt annotations. Specifically, we moved 36 sequences that are similar to other EF-hand calcium binding proteins (with scores between 25.4 and 16.6), but do not bind calcium, 4 sequences that contain potential calcium binding sites (AAC2_HUMAN (20.8), AAC3_HUMAN (20.0), AACT_CHICK (19.3), and YNZ1_CAEEL (16.6)), and three sequences (SPCA_HUMAN (18.2), CC23_ORCLI (17.9), CAGC_PIG (17.5)) that bind calcium but are not noted as such in Prosite. Two sequences that scored quite well (SM21_SCHMA (22.2) and YKT5_CAEEL (18.2)) do not have any calcium-related annotations and were kept in the negative category, as were 6 additional proteins that are involved in calcium channel function. The results of this switch, which changes the curve offsets but not the relative ordering, are shown in Figure 4. Additionally, these diagrams show that the theoretical $\sigma=1.0$ falls close to where it should, though hand inspection of sequences and score histograms is still required for a neighborhood around the significance setting. This observation was common, to varying degrees, in all four experiments.

The EF-hand experiments favor the use of complex, geometric, or background distribution null models.

5.3 Ferredoxins

The 2Fe-2S ferredoxins are a subgroup of the ferredoxin family, which is composed of iron-sulfur proteins that mediate electron transfer in a diverse range of metabolic reactions in a broad spectrum of organisms, including plants, algae and archaebacteria. As such, the family has witnessed numerous sequence and structural evolutionary changes. This, plus the sequence and structural conservation needed to retain functional properties, make the ferredoxin family as a whole difficult to model. Thus we chose the 2Fe-2S subgroup.

The 2Fe-2S ferredoxins are proteins or domains of approximately 100 amino acid residues that bind a single 2Fe-2S iron-sulfur cluster [Sander & Schneider, 1991]. Using the Prosite classification of true 2Fe-2S ferredoxins, we extracted 100 sequences from SwissProt 32 of lengths 32 to 1353. Two of these were fragments. Model training involved 49 sequences, with fragments excluded.


  
Figure 5: Ferredoxin false negatives versus false positives. Scoring methods are listed in order of intersection with the top of the plot.
\begin{figure}
\centerline{
\psfig {figure=plots/ferre/multi.ps,width=0.9\columnwidth}
}\end{figure}

Discrimination ability for the 2Fe-2S ferredoxins fell between that of the globins and EF-hands (Figure 5). The best tradeoff for false negatives and false positives is for the complex and the flat null models, closely followed by the training set letter counts and geometric mean null models. For a low number of false positives (3-7), there are 12-15 false negatives. The flat null model performs particularly well, though we are not entirely sure why.

Scored counts, background, and match counts all perform poorly. In the globin and EF-hand experiments, the train counts and match counts performed similarly. The difference here is that the ferredoxin model was trained on unclipped sequences that contained the ferredoxin domain plus extraneous residues. Because the amino acid distributions of the motif itself and the complete sequences that contain the motif are different, train counts and match counts produce different results.

For the 2Fe-2S ferredoxins, the flat, geometric average, complex, and train counts null models provided the best discrimination.

5.4 Remote Homologue Detection

In order to test the effect of null model choice on searches for remote homologs, we selected an HMM trained on 1hurA (human ADP-ribosylation factor 1) and its close homologs, as given in the HSSP database [Sander & Schneider, 1991]. We used this HMM to search the FSSP database of representative proteins [Holm & Sander, 1996]. Because no two sequences in the FSSP database have greater than 25% residue identity, we are looking for remote homologs, not close ones.

Deciding how distant a homolog should be counted as a correct one is a little arbitrary -- we defined the following 6 proteins as desired hits: 5p21, 1tag, 1hurA, 1efm, 1efgA, and 1eft. These proteins form a subtree of the FSSP classification tree and all have a Z-score higher than 9 in the 1hurA.fssp file.

All of the methods score 1hurA and 5p21 better than any of the incorrect ones, and all but the flat null model score 1tag better than any of the incorrect ones. Only hmmsw and the flat null model score 1efgA worse than an incorrect protein. The two most difficult hits are 1eft and 1efm, both of which score worse than the best of the incorrect proteins for all models.


  
Figure 6: False negatives versus false positives when searching for remote homologues of 1hurA. Scoring methods are listed in order of intersection with the vertical axis.
\begin{figure}
\centerline{
\psfig {figure=plots/hssp/multi.ps,width=0.9\columnwidth}
}\end{figure}

Because the behavior on false negatives is so similar across all null models (Figure 6), we get the most sensitive test of the different null model choices by considering how many false positives are found when there are no false negatives. In increasing order these are training sequence count and geometric mean (4 false positives), match state count (6 false positives), complex null model (10 false positives), generic background (11 false positives), hmmsw (46 false positives), and flat background (73 false positives). Thus, the remote homologue experiments favor the training sequence counts or geometric mean null models.


next up previous
Next: 6 Discussion Up: Scoring Hidden Markov Models Previous: 4 Algorithm
SAM
sam-info@cse.ucsc.edu
UCSC Computational Biology Group