*Author to whom correspondence should be directed
In this study we apply simple neural networks to the problem of predicting RNA secondary structure in the small subunit (SSU) and large subunit (LSU) of ribosomal RNA (rRNA) from prokaryotic and eukaryotic data. Our method is a form of comparative sequence analysis, which is regarded as being critical to understanding RNA structure and function (Gutell et al. 1990, James et al. 1989, Turner et al. 1988, Fox and Woese 1975). In contrast to some other phylogenetic comparison methods, the predictive neural network requires little interaction and the end product is a completely automatic machine that can be employed in more complex protocols which model nucleic acid structure. Although the network is trained on two-dimensional data, it is possible that the same statistical evaluation of evolutionary conservation and change can be applied to identifying higher order interactions.
Pairs of sequence numbers corresponding to the base-paired positions in the known secondary structure of E. coli were renumbered to the multiple alignments of the RDP files. Each pair of numbers, representing two columns in the alignment, was used to generate counts of the 36 possible pairs over the alphabet {A,C,G,U,N,-} (see Figure 1 for an example alignment).
1 2 3 4 5 6 7 8 - - A G C U G - - U A C G G - - - - G C G U A N
The character 'N' designates any ambiguous or modified base {D, N, R, Y, etc.}; '-' represents all gaps or "no information" in the alignment sequence; and the other four symbols correspond to the nucleotides adenine, cytosine, guanine, and uracil. Although a larger alphabet could have been utilized to incorporate knowledge of the ambiguous or modified bases, we wished to limit the free parameters of the neural net for speed, accuracy and simplicity.
Initial training on 16-count data (generated from pairs between only the four bases {A,C,G,U}) indicated that the omission of special characters and gaps produced inferior results and prolonged the required training time. Further 16-count experimentation was abandoned, however results are given for comparison to the 36-count networks.
The RDP secondary structure of E. coli contained 472 base-pairings in SSU rRNA, which was used to derive a data set of 472 vectors of 36 counts each. A complementary set of 472 non-base-pairing examples was randomly generated to form a full data set of base-paired (BP) and non-base-paired (NBP) data. A similar data set of 794 BP and NBP examples was composed for the LSU rRNA. The equal distribution of BP and NBP examples enabled us to assume an a priori probability of 0.5 for base-pairing. Although this does not represent the actual frequency of base-pairing in the macromolecules, it facilitated equal evaluation of the net's predictive ability on BP and NBP examples. In other experiments we will evaluate the performance of the network in a situation where the number of NBP examples greatly outnumbers the number of BP examples.
Mutual Information
Mutual information has been used to efficiently predict secondary structure (Lapedes 1994, Gutell et al. 1992, Chiu et al. 1991) by providing an estimate of covariance between columns. Positions in the alignment that base-pair will covary throughout evolution while others should not. The definition of mutual information is given in equation 1 below. This definition differs from the more standard definition by a constant factor which helps in comparison with direct input counts. Note that in the calculation of mutual information, the 36-count vector is interpreted as a 6 by 6 matrix of bases, such that C_ij refers to the observed count of pairs (base_i, base_j) where i,j is an element of {A,C,G,U,N,-}.




Neural Networks
Artificial neural networks were used to compute a weighted sum for each input vector, such that the output could be interpreted as a Bayesian a posteriori probability of base-pairing (Richard and Lippmann 1991). All development and training were executed with the Matlab software package version 4.2a on a Sun SparcStation running UNIX. The histograms and plots displayed in this paper were also generated by Matlab.
We selected a network architecture during preliminary testing on prokaryotic data. Hidden units were not utilized. Sum-squared error proved efficient and consistent with the Bayesian interpretation, using a feed-forward net with classical back propagation in batch training mode (Haykin 1994 provides a thorough description of neural networks). Initial weights and bias were set to zero. An adaptive learning rate was implemented to accelerate training, as well as reduce the amount of guesswork involved in defining parameters. To ensure smooth error curves, a maximum learning rate was maintained at a low level (generally one-thousandth to one-millionth).
Three types of networks were compared: "Counts only" (using the 36 input counts of observed base pairs), "MI only" (using solely the measure M of mutual information), and "Composite" (the two combined in a 37-input net). The "Composite" architecture is pictured in Figure 2.

The following formula calculates a BP score for each input vector to the "Composite" net:


NETWORK ARCHITECTURES Counts only MI only Composite DATA SETS BP NBP Both BP NBP Both BP NBP Both Prokaryotic SSU 92.80% 90.47% 91.63% 93.22% 95.55% 94.39% 96.61% 95.55% 96.08% 60rep Prok SSU 93.43% 93.86% 93.64% 71.82% 93.64% 82.73% 96.19% 95.55% 95.87% Eukaryotic SSU 76.90% 85.38% 81.14% 89.62% 59.96% 74.79% 77.75% 86.44% 82.10% Combined SSU 82.73% 90.47% 86.60% 92.80% 56.36% 74.58% 82.63% 92.58% 87.61% Combined LSU 67.42% 55.81% 61.62% 57.45% 77.40% 67.42% 54.55% 78.28% 66.41%
The net performed very well on prokaryotic SSU data; however, classification percentages of eukaryotic SSU were 10-14% worse than those for prokaryotic. The "MI only" network exhibited particularly poor performance with SSU eukaryotic data. Learning with combined SSU generally produced results between that of solely prokaryotic and solely eukaryotic, with the exception of those from the "MI only" network, which were worse than those generated from eukaryotic SSU only. Experiments with combined LSU data reported the lowest percentages overall.

Figure 3 illustrates the net's ability to discriminate between BP and NBP prokaryotic SSU pairs of bases. Although the neural net is trained using a log-sigmoid transfer function at the output, a simple sum (the value v from Eq. 5) is computed to display a wider range of values in the histograms. This gives a clearer picture of how well the two groups are separated by the neural net. Typical weights for the "Composite" net trained to predict prokaryotic SSU data are displayed in Table 2, and sample error curves are given in Figure 4. Only seventeen of the 472 base-paired examples were misclassified in one or more training runs. These false negatives are marked on the SSU secondary structure of E. coli, one of the representative prokaryotes (Figure 5). Four of these consistently scored low, even when they were part of the training set: 1405/1496, 665/741, 945/1236, and 950/1231 (listed lowest first). Note that the entire helix formed by 1404/1947 and 1405/1496 is rejected by the neural net.
A C G U N - A -0.0119 -0.0316 -0.0146 0.0019 0.0002 -0.0067 C -0.0669 -0.0335 0.0036 -0.0384 0.0018 -0.0068 G -0.0097 0.0019 -0.0479 0.0043 0.0004 -0.0402 U -0.0048 -0.0581 -0.0094 -0.0404 0.0117 -0.0371 N -0.0231 -0.0106 0.0094 -0.0013 0.0006 0.0136 - -0.0241 -0.0143 -0.0140 -0.0103 -0.0021 -0.0033 Mutual information weight = 0.0581 Bias = -3.7111e-04


Experiments in generalization
We also conducted a series of experiments to determine if a neural net trained to discriminate BP from NBP column pairs in the multiple alignment of one family of RNA sequences could be used to perform the same discrimination task in another family of RNA sequences. These experiments in generalization were run on the "Composite" net exclusively, since it had outperformed the other architectures almost without exception. Since the LSU, prokaryotic SSU, and eukaryotic SSU data sets were generated from multiple alignments of three different sizes, each LSU and eukaryotic SSU input was scaled such that its sum matched that of a prokaryotic SSU input (from 74 and 236 to 1379). For a direct comparison with the previous experiments, only 3/4 of the examples in the selected training set were used in training. However, all examples in the selected test set were used for testing. (Previous experiments had used 3/4 of the given data set for training and 1/4 for testing). The results are given in Table 3.
TRAIN->TEST BP NBP BOTH Prok->Euk SSU 79.66% 84.53% 82.10% Euk->Prok SSU 98.73% 81.36% 90.04% SSU->LSU 44.32% 85.61% 64.96% LSU->SSU 91.10% 75.11% 83.10%
Training solely with the prokaryotic SSU data and testing on the eukaryotic SSU data clearly demonstrated the net's ability to generalize across domains: training with the prokaryotic data proved as effective as training with the eukaryotic when the test set was eukaryotic. However, when generalizing from one macromolecule to another, care must be taken not to overfit the training data. Figure 6 shows the sum-squared error curves for the prokaryotic training set and separately for the eukaryotic test set, as a function of the number of epochs of neural net gradient descent estimation. Note that after about 100 epochs the error of the test set starts to rise, indicating overfitting. For the performance quoted in Table 3, the network trained for only 100 epochs was used.

Good generalization was also observed when training on the SSU data and testing on the LSU data, compared to training and testing both with the LSU data. However, considering the poor performance of the net in classifying LSU data, a generalization scheme of training on SSU data and testing on LSU data had little to achieve. Although the net overfit the test data almost immediately, continuation of training did not lessen the total classification percentage for LSU data. The amount of false negatives rose proportionately with a better classification of true negative examples.
Table 3 also lists results from reversed generalization experiments: training with eukaryotic SSU while testing on prokaryotic SSU and training with combined LSU while testing on combined SSU. Evidently the eukaryotic SSU data does not generalize as well to prokaryotic SSU, and neither does combined LSU data to combined SSU. However, some ability to generalize is still evident.
Using a large NBP set
If one wanted to search the entire multiple alignment of SSU or LSU sequences for pairs of columns that might base-pair, the NBP examples would greatly outnumber the BP examples. Specifically, with 2688 positions in the multiple alignment of prokaryotic SSU rRNA, there are over 3.6 million possible base pairs. The prokaryotic SSU classifier, with a 4.45% rate of false positives, would identify more than 160,000 false positive base pairs along with the 472 real base pairs. To reduce the number of false positives, one needs to raise the threshold used by the classifier to discriminate BP from NBP examples. We implemented and tested this method on a larger sample of the possible NBP examples, to determine if the current classifier could be used to identify some portion of the structure with a low rate of false positives.
A set of 23128 NBP examples was randomly generated from the multiple alignment of prokaryotic SSU rRNA, using the same procedure which generated the small set of 472 NBP examples. Adding the 472 prokaryotic SSU BP examples produced a data set of 23600 inputs, 2% of which were base-paired. We used the weights learned from the "Composite" net trained on prokaryotic SSU data to classify this large set of examples. Over 1000 (about 5%) of the NBP examples scored above zero. By setting a higher threshold and rejecting any input scoring less than this threshold, we significantly lowered the percentage of false positives. Table 4 lists the thresholds, the percentage of false positives they allowed, and the corresponding percentage of BP examples still classified correctly.
FALSE POSITIVES ALLOWED THRESHOLD BP CLASSIFIED 5.150% 0.00 98.20% 0.993% 10.50 84.32% 0.098% 25.23 60.59% 0.009% 34.00 50.64% 0.000% 41.16 44.28% 0.000% 45.00 41.74% 0.000% 50.00 37.71%
The highest scoring false positive could be rejected with a threshold of 41.1622, above which 209 of the 472 true base pairs were still classified correctly. However, since the large NBP set represented only 0.65% of the total NBP examples, a value of 45 or 50 may be a safer threshold at which almost all possible NBP examples would be rejected. Figure 7 displays the 197 base pairs scoring above a threshold of 45 on the secondary structure of E. coli. The marked base pairs are distributed throughout the graph and could be used to identify most of the structure. Some large helices missed are: 921-933/1384-1396, 938-943/1340-1345, 945-955/1225-1236, 1068-1073/1102-1107, and 1506-1515/1520-1529. Note that many of the missed helices are in physical proximity, which would make that portion of the macromolecule vaguely defined.

Looking at the input counts corresponding to the base pairs in these helices reveals that most of them are very conserved, i.e. usually one count (typically A/U or G/C) in each input ranges from 900-1200, and the remainder of the counts are trivially small. Such conservation leads to low mutual information scores, thus these inputs are rejected when mutual information is weighted highly. Many of these conserved inputs also indicate a significant amount of insertion/ deletion. For example, if the conserved pair is a G/C, there may also be a -/C in the range of 100-200. The three large helices missed which are in proximity (921-933/1384-1396, 938-943/1340-1345, 945-955/1225-1236) are all conserved with a large amount of insertion/ deletion. The nearby helix 1506-1515/1520-1529 is not conserved at all, but the counts for -/- (gap to gap) in each input are all over 1100, indicating this helix exists in nearly less than half of the prokaryotes represented in the alignment.
The neural net classifier accurately predicted 96.08% of the prokaryotic and 82.10% of the eukaryotic SSU data. Best results for the LSU data were achieved the "MI only" network, predicting 67.42% correctly. Overall, experiments with LSU sequences produced poor results. Switching from the 16-count inputs to 36-count gained roughly 2% in the prokaryotic SSU classifier, while a much larger gain of 10% was observed in predicting the eukaryotic SSU (data not shown). Due to its effectiveness, the "Composite" network is preferred. Using only mutual information as a classifier worked well with SSU prokaryotes, but comparatively poorly with SSU eukaryotes. Perhaps this can be attributed to the smaller sequence ensemble of eukaryotes (roughly one-sixth the size of the prokaryotic). Training with the 60 representative prokaryotic species confirms this suspicion, as the "MI only" net's scores were also comparatively low in this reduced ensemble. A smaller range of counts and highly conserved sequences (those with little phylogenetic diversity) impair the ability to determine dependency using mutual information.
The false negatives in Figure 5 seem to indicate the neural net classifier has some difficulty with endcaps, noncanonical pairs, and pairs adjacent to a bulge. However, these mistakes do not necessarily point to a weakness in the system, since many base pairs of these types were correctly classified in other locations. Although the G/A in 945/1236 was misclassified, 321/332 was not. The U/U in 1307/1330 was not classified as base-pairing whereas 245/283 was. Even the A/G next to a bulge in 665/741 has a counterpart in 68/101. Likewise many endcaps and U/G pairs were correctly classified as base-pairing. Some of the BP examples missed by the network may even be dubiously classified as base-pairing.
DATA SETS W-C+ MI prokaryotic SSU 16 count 25% 75% 36 count 11% 51% eukaryotic SSU 31% 26% combined SSU 43% 45% prok->euk SSU* 56% 43% combined LSU 00% 85%
Analysis of the learned weights, listed in Table 5, provides some understanding of the neural net's predictive ability. Positive weights are interpreted as positive evidence for base-pairing, while negative weights are assumed to be negative evidence. Mutual information, canonical base pairs, and pairs with the 'N' character generally provided the bulk of positive evidence for base-pairing when training with SSU data. Many noncanonical base pairs also earned positive weights, particularly the wobble pairs G/U and U/G. Pairs with the gap character and some noncanonical base pairs (e.g.: C/A, C/C) corresponded to strongly negative weights. In contrast, training with LSU data put all evidence on mutual information, ignoring the Watson-Crick pairs.
W-C+ MI GAP SSU prokaryotic BP 48.25% 35.05% 08.61% NBP 07.25% 03.04% 72.53% SSU eukaryotic BP 51.95% 13.99% 16.79% NBP 07.93% 02.74% 77.16% LSU combined BP 08.53% 16.17% 59.62% NBP 06.51% 07.64% 76.37%
A study of the composition of column pairs found in the multiple alignments partially explains the distribution of weights produced by the networks. In Table 6, three categories of counts describe the data files: "W-C+" groups the counts of Watson-Crick pairs (plus G/U and U/G), "MI" represents the fraction of total input to the net contributed by mutual information, and "Gap" corresponds to the portion contributed by all pairs with the '-' character. We assume that for base-paired columns, the "Gap" percentage should be low, reflecting high conservation throughout evolution. Parallel to this, "W-C+" and "MI" should be high. All this is observed in the SSU data, though the eukaryotic seems less conserved compared to the prokaryotic. Very low "W-C+" frequency and high "Gap" are observed in the LSU data, which may be key to the net's poor performance in predicting LSU base-pairing. The small distinction between BP and NBP composition may be attributed to the low number of sequences in the LSU ensemble (74 species), or perhaps a lack of phylogenetic diversity. It may also indicate that the multiple alignment is of lower quality than the SSU multiple alignments.
Experiments in generalization may also be used to evaluate the quality of an alignment. Since prokaryotic SSU generalized to eukaryotic SSU better than eukaryotic to prokaryotic (Table 3), we may posit that the prokaryotic SSU alignment is more consistent. Likewise for the combined SSU data over the LSU data. The accuracy of the neural network in predicting base-pairing is dependent on the quality of the multiple alignment and its sequence content. As multiple alignments improve and more sequences become available, we expect to observe better prediction of eukaryotic SSU and combined LSU data by the neural net.
High weights given to pairs with the 'N' character in prokaryotic SSU experiments may be caused by the comparatively low counts for pairs with the 'N' character. If these count inputs are consistently small but provide useful evidence, they should be weighted strongly such that the product w_ij multiplied by c_ij is comparable to other count-weight products. This argument may also be extended to why canonical pairs were not always weighted highest: since the input signal was already strong, not much amplification by the corresponding weight was necessary. After the first few training sessions, mutual information and canonical base pairs provided most of the evidence in the best networks. However, extended training sought subtle distinctions between inputs to improve classification accuracy. In many cases this resulted in noncanonical base pairs being weighted as positive evidence. Experimental evidence verifies that noncanonical base-pairing does influence higher order structure (Gutell et al. 1990, Holbrook et al. 1991). Occasionally even canonical base pairs were weighted negatively, particularly U/A.
It is conceivable that some false-positives in the experiments could be representative of tertiary structure. Implementing this neural net method to predict the known two- and three-dimensional structure of tRNA could verify its ability to predict tertiary structure in other macromolecules. Gutell, Noller and Woese (1986) have stated that considerable evidence must support claims of tertiary interactions. Although a 96% accurate classifier is not sufficient to filter through the millions of NBP positions in macromolecules the size of SSU rRNA, it might be used to target likely base pairs crucial to tertiary structure.
Perhaps the most effective use of the neural network method is to import its learned weights into other algorithms for determining structure. The simple calculation using input counts contributes a very low computational overhead and has proven to be reliable and robust. In other work these learned weights were incorporated into a program that identifies secondary structure in SSU and LSU rRNA with excellent results (Grate, 1995).
Acknowledgments
Many thanks are accorded to Leslie Grate for providing software to generate count files, as well as helping with other essential tasks. Bryn Weiser's program xRNA generated the E. coli secondary structure graph in Figures 5 and 7. Brad Gulko, Mark Herbster, Rodrigo Garces, and David Konerding also provided assistance at various stages in the project.
1 The random set of pairs excluded those known to base-pair. All data, both base-paired and non-base-paired, are taken in sequence order (5' to 3').