To evaluate scoring methods, we performed discrimination experiments on three families of sequences, globins, EF-hands, and ferredoxins, and searched for remote homologues of 1hurA.
To build models, we created a model based on the unaligned sequences in the training set using SAM's default settings and a Dirichlet mixture prior [Sjölander et al., 1996]. From this model, we generated a multiple alignment of the training sequences and calculated position-based sequence weights [Henikoff & Henikoff, 1994], using an unpublished method to determine total sequence weight. With these weights, the unaligned sequences, and the mixture prior, we built the final HMM. After SAM's surgery heuristic, the globin model had 148 positions, the EF-hand model 27, the ferredoxin model 95, and the 1hurA model 180. For the HMMer models, we used the default settings, which include an internal weighting scheme. Because we did not adjust any parameters, our models may not be as good as those of an experienced HMMer user.
In scoring the models, we chose to use semi-local scoring, which allows a fragment to match only part of the model. We decided on this after discovering that many of our globin false negatives were short fragments. The switch to semi-local scoring enables us to move several, but not all, of these fragments to the true and possible positive scores, rather than being buried in the database.
For evaluation of false positives and false negatives, we use the Prosite classification in the first three experiments and the FSSP tree in the last. Because there may be unidentified EF-hands, globins, or ferredoxins in the databases, some of the sequences identified as false positives in the experiments may actually be true positives.
We investigated the following simple null models:
For the globins, we used Prosite [Bairoch & Bucher, 1994] to extract 685 globin sequences from SwissProt 32 [Bairoch & Boeckmann, 1994] of lengths 19 to 403. From these 685 sequences, we used the same 333 training sequences of lengths 40 to 403, as in a previous study [Bucher et al., 1996]. Because the globin domain is quite conserved, it is relatively easy to model.
![]() |
Figure 2a plots false negatives versus false positives,
each of which are a function of the setting of threshold score or
. At each integral point on, for example, the false positive
axis, the value of the false negative coordinate is a linear
interpolation between the two nearest integral false negative values
based on the scores of the two false negative points and the false
positive point in question. This means, for example, that if two
negative examples score 10 and 20, and two positive examples score 19
and 21, the point (1,1.1) will be plotted for the score of 19,
indicating that that if the threshold is set at 19, there will be one
false negative (the sequence that scores 21) and a little over one
false positive (the sequence scoring 19 is 10% of the way between the
two negative examples).
In the case of the globins, the model achieved excellent, and comparable, discrimination for all null models except the flat distribution and the scored letter counts distribution. The Z-scoring method does poorly, in part because of problems with false positives with more than 20% wildcard characters. If these are trimmed from the database, Z-scoring performance is similar to hmmsw. The hmmsw performance is more indicative of the model building (and our unfamiliarity with it) than the scoring: hmmsw uses log-odds scoring with either a background amino acid distribution (default) or a user-specified null model.
Because of the ease of discrimination, the good null models show little variation between themselves. An ideal graph would show complete separation between the false positives and false negatives, an `L' shape that includes the origin.
Looking at the histogram for a typical example (Figure 2b), the theoretical significance setting does not exactly
provide the number of false positives, as the threshold must be
lowered (to approximately
) before a false positive is
found.
For the well-performing null models, the low-scoring positives include, in order of lowest to highest score, three fragments with 29-41 residues (GLB4_LAMSP, HMPA_SALTY, and GLB1_LAMSP), and GLBH_CAEEL (159 residues). Under the best null model, the complex, the -7.9 score of GLBH_CAEEL is higher than any false positve, and would certainly be found by inspection. The other three scores are 2.6, 3.2, and 4.4, corresponding to 140, 75, and 25 false positives. The problem is not strictly in detecting fragments, as the smallest globin fragment, HBB2_UROHA at 19 residues, always scores as an identifiable globin. Of these five sequences, only GLB1_LAMSP was in the training set. The two highest-scoring non-globins are YF06_HAEIN (9.3) and WN8B_XENLA (6.8).
The primary result of this examination of globin scoring is that when the model is exceptionally good, the choice of null model makes little difference, apart from avoiding the flat and the scored letter counts null models.
The EF-hand motif is a 29-residue structure present in cytosolic calcium-modulated proteins [Nakayama et al., 1992,Persechini et al., 1989,Moncrief et al., 1990]. These proteins bind the second messenger calcium and in their active form function as enzymes or regulate other enzymes and structural proteins. Although a number of proteins possess the EF-hand motif, some of these regions have lost their calcium binding property. For a training set, we used the same set of EF-hand sequences as in earlier work [Krogh et al., 1994]. This set is based on the June 1992 set of EF-hand sequences maintained by Kretsinger and co-workers [Nakayama et al., 1992]. The EF-hand structures were extracted from each of the 242 sequences in the EF-hand database to obtain 885 EF-hand motifs having an average length of 29. We used Prosite to classify 324 full and partial sequences of lengths 25 to 2477 containing the EF-hand motif as true positives. We placed Prosite's unknown sequences in the group of non-EF-hands. Because of the motif's shortness and its diversity, it is a difficult family to model.
![]() |
In the case of the EF-hand experiments (Figure 3), discrimination was not nearly so good, but there was differentiation between null models. Here, the minimal number of false negatives is approximately 6 for the complex, background distribution, and geometric average null models. The match counts, letter counts, flat, scored sequence, and hmmsw curves become vertical around 10-15 false negatives, indicating more sequences that could not be found regardless of the number of false positives. Scored letter counts achieves zero false positives at 330 false negatives, match counts and letter counts at 240 false negatives, hmmsw at 220 false negatives, and the remaining null models at 200-210 false negatives.
![]() |
During the experiments, we found that many of the false positives were
sequences containing variant EF-motifs that no longer bind calcium,
and thus are not labeled as EF-hands by Prosite. To account for this,
we changed the classifications of 43 sequences to being an EF-hand
based on their SwissProt annotations. Specifically, we moved 36
sequences that are similar to other EF-hand calcium binding proteins
(with scores between 25.4 and 16.6), but do not bind calcium, 4
sequences that contain potential calcium binding sites (AAC2_HUMAN
(20.8), AAC3_HUMAN (20.0), AACT_CHICK (19.3), and YNZ1_CAEEL
(16.6)), and three sequences (SPCA_HUMAN (18.2), CC23_ORCLI (17.9),
CAGC_PIG (17.5)) that bind calcium but are not noted as such in
Prosite. Two sequences that scored quite well (SM21_SCHMA (22.2) and
YKT5_CAEEL (18.2)) do not have any calcium-related annotations and
were kept in the negative category, as were 6 additional proteins that
are involved in calcium channel function. The results of this switch,
which changes the curve offsets but not the relative ordering, are
shown in Figure 4. Additionally, these diagrams show that
the theoretical
falls close to where it should, though
hand inspection of sequences and score histograms is still required for
a neighborhood around the significance setting. This observation was
common, to varying degrees, in all four experiments.
The EF-hand experiments favor the use of complex, geometric, or background distribution null models.
The 2Fe-2S ferredoxins are a subgroup of the ferredoxin family, which is composed of iron-sulfur proteins that mediate electron transfer in a diverse range of metabolic reactions in a broad spectrum of organisms, including plants, algae and archaebacteria. As such, the family has witnessed numerous sequence and structural evolutionary changes. This, plus the sequence and structural conservation needed to retain functional properties, make the ferredoxin family as a whole difficult to model. Thus we chose the 2Fe-2S subgroup.
The 2Fe-2S ferredoxins are proteins or domains of approximately 100 amino acid residues that bind a single 2Fe-2S iron-sulfur cluster [Sander & Schneider, 1991]. Using the Prosite classification of true 2Fe-2S ferredoxins, we extracted 100 sequences from SwissProt 32 of lengths 32 to 1353. Two of these were fragments. Model training involved 49 sequences, with fragments excluded.
![]() |
Discrimination ability for the 2Fe-2S ferredoxins fell between that of the globins and EF-hands (Figure 5). The best tradeoff for false negatives and false positives is for the complex and the flat null models, closely followed by the training set letter counts and geometric mean null models. For a low number of false positives (3-7), there are 12-15 false negatives. The flat null model performs particularly well, though we are not entirely sure why.
Scored counts, background, and match counts all perform poorly. In the globin and EF-hand experiments, the train counts and match counts performed similarly. The difference here is that the ferredoxin model was trained on unclipped sequences that contained the ferredoxin domain plus extraneous residues. Because the amino acid distributions of the motif itself and the complete sequences that contain the motif are different, train counts and match counts produce different results.
For the 2Fe-2S ferredoxins, the flat, geometric average, complex, and train counts null models provided the best discrimination.
In order to test the effect of null model choice on searches for remote homologs, we selected an HMM trained on 1hurA (human ADP-ribosylation factor 1) and its close homologs, as given in the HSSP database [Sander & Schneider, 1991]. We used this HMM to search the FSSP database of representative proteins [Holm & Sander, 1996]. Because no two sequences in the FSSP database have greater than 25% residue identity, we are looking for remote homologs, not close ones.
Deciding how distant a homolog should be counted as a correct one is a little arbitrary -- we defined the following 6 proteins as desired hits: 5p21, 1tag, 1hurA, 1efm, 1efgA, and 1eft. These proteins form a subtree of the FSSP classification tree and all have a Z-score higher than 9 in the 1hurA.fssp file.
All of the methods score 1hurA and 5p21 better than any of the incorrect ones, and all but the flat null model score 1tag better than any of the incorrect ones. Only hmmsw and the flat null model score 1efgA worse than an incorrect protein. The two most difficult hits are 1eft and 1efm, both of which score worse than the best of the incorrect proteins for all models.
![]() |
Because the behavior on false negatives is so similar across all null models (Figure 6), we get the most sensitive test of the different null model choices by considering how many false positives are found when there are no false negatives. In increasing order these are training sequence count and geometric mean (4 false positives), match state count (6 false positives), complex null model (10 false positives), generic background (11 false positives), hmmsw (46 false positives), and flat background (73 false positives). Thus, the remote homologue experiments favor the training sequence counts or geometric mean null models.