Prediction of ribosomal RNA base-pairing by neural networks

Jesse Reklaw, Rebecca Hanna, David Haussler*
Department of Computer and Information Sciences
University of California, Santa Cruz, CA 95064 USA
(408) 459 2105
Email: haussler@cse.ucsc.edu

*Author to whom correspondence should be directed


keywords: comparative sequence analysis, ribosomal RNA, artificial neural networks, mutual information

Abstract

With the large number of rRNA sequences available in multiple alignments, comparative sequence analysis has enabled the reliable inference of higher order structure. This paper presents a neural net method for predicting base-pairing in the SSU and LSU of rRNA. Input is coded as the observed counts between pairs of bases and an interdependence measure known as mutual information. Experiments conducted on several data sets evaluate the network's overall performance and its ability to generalize across domains and macromolecules. With a large multiple alignment representative of the phylogenetic diversity, our best method characterizes over 96% of the test data as base-pairing or non-base-pairing. This accuracy may be sufficient to predict some tertiary interactions, although the rate of false positives still limits the applicability of the method. The issue of generalization to other macromolecules also remains unresolved.

Introduction

Current studies indicate RNA is active in a variety of cellular processes, and suggest it plays a central role in the origins of life. A surge of interest in this molecule has encouraged the use of many techniques, both biological and computational, to determine higher-order structure. The modeling of RNA structure is an important problem in biology, due to the close relationship between molecular structure and function; yet the only RNAs for which the detailed structure is known are the tRNAs, as determined by x-ray crystallography. Due to the difficulty in crystallizing most types of RNA, other techniques for inferring structure need to be explored and implemented.

In this study we apply simple neural networks to the problem of predicting RNA secondary structure in the small subunit (SSU) and large subunit (LSU) of ribosomal RNA (rRNA) from prokaryotic and eukaryotic data. Our method is a form of comparative sequence analysis, which is regarded as being critical to understanding RNA structure and function (Gutell et al. 1990, James et al. 1989, Turner et al. 1988, Fox and Woese 1975). In contrast to some other phylogenetic comparison methods, the predictive neural network requires little interaction and the end product is a completely automatic machine that can be employed in more complex protocols which model nucleic acid structure. Although the network is trained on two-dimensional data, it is possible that the same statistical evaluation of evolutionary conservation and change can be applied to identifying higher order interactions.

Materials and Methods

Multiple alignments and sequence data were acquired from The Ribosomal Database Project (RDP), Revision 3.0. Three distinct data sets were used: prokaryotic SSU, eukaryotic SSU, and combined LSU rRNA. The sequence ensembles for these sets contained 1379, 236 and 74 species, respectively. Experimentation was also performed on a data set of 60 representative prokaryotic SSU rRNA. These sequences constitute a subset of the full prokaryote ensemble and encompass the phylogenetic diversity, as determined by the RDP. Representative data sets were not available for eukaryotic SSU, nor LSU data. Experiments focused on prokaryotic SSU rRNA and SSU data sets in general since these ensembles were larger and more complete.

Pairs of sequence numbers corresponding to the base-paired positions in the known secondary structure of E. coli were renumbered to the multiple alignments of the RDP files. Each pair of numbers, representing two columns in the alignment, was used to generate counts of the 36 possible pairs over the alphabet {A,C,G,U,N,-} (see Figure 1 for an example alignment).

          1 2 3 4 5 6 7 8
          - - A G C U G -
          - U A C G G - -
          - - G C G U A N

Fig. 1. Example multiple alignment. Rows are nucleotide sequences and the columns are numbered as corresponding postions in each sequence. If coulmns 4 and 5 were compared, the counts for C/G and G/C would number 2 and 1, respectively. All other counts would be zero.

The character 'N' designates any ambiguous or modified base {D, N, R, Y, etc.}; '-' represents all gaps or "no information" in the alignment sequence; and the other four symbols correspond to the nucleotides adenine, cytosine, guanine, and uracil. Although a larger alphabet could have been utilized to incorporate knowledge of the ambiguous or modified bases, we wished to limit the free parameters of the neural net for speed, accuracy and simplicity.

Initial training on 16-count data (generated from pairs between only the four bases {A,C,G,U}) indicated that the omission of special characters and gaps produced inferior results and prolonged the required training time. Further 16-count experimentation was abandoned, however results are given for comparison to the 36-count networks.

The RDP secondary structure of E. coli contained 472 base-pairings in SSU rRNA, which was used to derive a data set of 472 vectors of 36 counts each. A complementary set of 472 non-base-pairing examples was randomly generated to form a full data set of base-paired (BP) and non-base-paired (NBP) data. A similar data set of 794 BP and NBP examples was composed for the LSU rRNA. The equal distribution of BP and NBP examples enabled us to assume an a priori probability of 0.5 for base-pairing. Although this does not represent the actual frequency of base-pairing in the macromolecules, it facilitated equal evaluation of the net's predictive ability on BP and NBP examples. In other experiments we will evaluate the performance of the network in a situation where the number of NBP examples greatly outnumbers the number of BP examples.

Mutual Information

Mutual information has been used to efficiently predict secondary structure (Lapedes 1994, Gutell et al. 1992, Chiu et al. 1991) by providing an estimate of covariance between columns. Positions in the alignment that base-pair will covary throughout evolution while others should not. The definition of mutual information is given in equation 1 below. This definition differs from the more standard definition by a constant factor which helps in comparison with direct input counts. Note that in the calculation of mutual information, the 36-count vector is interpreted as a 6 by 6 matrix of bases, such that C_ij refers to the observed count of pairs (base_i, base_j) where i,j is an element of {A,C,G,U,N,-}.


where



The quantity M is always positive and reflects the amount of statistical dependency between columns. (In the case of C_ij = 0, the corresponding term in M was also set to zero, as is standardly done.) If the two positions are observed to vary independently then P_ij = P_i* P_*j and M = 0; likewise a high measure of dependency should be evinced by comparatively large M.

Neural Networks

Artificial neural networks were used to compute a weighted sum for each input vector, such that the output could be interpreted as a Bayesian a posteriori probability of base-pairing (Richard and Lippmann 1991). All development and training were executed with the Matlab software package version 4.2a on a Sun SparcStation running UNIX. The histograms and plots displayed in this paper were also generated by Matlab.

We selected a network architecture during preliminary testing on prokaryotic data. Hidden units were not utilized. Sum-squared error proved efficient and consistent with the Bayesian interpretation, using a feed-forward net with classical back propagation in batch training mode (Haykin 1994 provides a thorough description of neural networks). Initial weights and bias were set to zero. An adaptive learning rate was implemented to accelerate training, as well as reduce the amount of guesswork involved in defining parameters. To ensure smooth error curves, a maximum learning rate was maintained at a low level (generally one-thousandth to one-millionth).

Three types of networks were compared: "Counts only" (using the 36 input counts of observed base pairs), "MI only" (using solely the measure M of mutual information), and "Composite" (the two combined in a 37-input net). The "Composite" architecture is pictured in Figure 2.


Fig. 2. Neural network diagram for "Composite " net.

The following formula calculates a BP score for each input vector to the "Composite" net:


Where w_ij represents the 36 learned count weights corresponding to the 36 input counts, w_M is the mutual information weight and b is the learned bias. This linear sum was then input to the logistic sigmoid transfer function with the following formula:

The output, y, can be interpreted as the a posteriori probability of base-pairing (see Richard and Lippmann 1991 for complete discussion).

Results

In Table 1 all three net architectures are compared for performance on prokaryotic SSU, eukaryotic SSU, combined SSU, and combined LSU rRNA. Numbers reported in Table 1 are test set scores averaged over four training runs. Each data set was randomly partitioned into fourths for complete cross-validation in a supervised learning procedure wherein one-fourth was used as a test set and the remainder provided training examples. After four training sessions, every example had been used in a test set once.

Table 1. Correct classification of test set using predictive neural networks

                                      NETWORK ARCHITECTURES
                     Counts only             MI only              Composite
   DATA SETS      BP    NBP    Both     BP    NBP    Both     BP    NBP    Both
Prokaryotic SSU 92.80% 90.47% 91.63%  93.22% 95.55% 94.39%  96.61% 95.55% 96.08%	
60rep Prok SSU  93.43% 93.86% 93.64%  71.82% 93.64% 82.73%  96.19% 95.55% 95.87%
Eukaryotic SSU  76.90% 85.38% 81.14%  89.62% 59.96% 74.79%  77.75% 86.44% 82.10%
Combined SSU    82.73% 90.47% 86.60%  92.80% 56.36% 74.58%  82.63% 92.58% 87.61%
Combined LSU    67.42% 55.81% 61.62%  57.45% 77.40% 67.42%  54.55% 78.28% 66.41%

Three percentages are given for each network architecture: correct classification of base-paired examples, correct classification of non-base-paired examples, and an average between the two which accounts for an overall predictive ability.

The net performed very well on prokaryotic SSU data; however, classification percentages of eukaryotic SSU were 10-14% worse than those for prokaryotic. The "MI only" network exhibited particularly poor performance with SSU eukaryotic data. Learning with combined SSU generally produced results between that of solely prokaryotic and solely eukaryotic, with the exception of those from the "MI only" network, which were worse than those generated from eukaryotic SSU only. Experiments with combined LSU data reported the lowest percentages overall.


Fig. 3. Histogram of test set scores for base-paired and non-base-paired prokaryotic SSU data, after training for 1000 epochs on "Composite" net, using complete cross-validation.

Figure 3 illustrates the net's ability to discriminate between BP and NBP prokaryotic SSU pairs of bases. Although the neural net is trained using a log-sigmoid transfer function at the output, a simple sum (the value v from Eq. 5) is computed to display a wider range of values in the histograms. This gives a clearer picture of how well the two groups are separated by the neural net. Typical weights for the "Composite" net trained to predict prokaryotic SSU data are displayed in Table 2, and sample error curves are given in Figure 4. Only seventeen of the 472 base-paired examples were misclassified in one or more training runs. These false negatives are marked on the SSU secondary structure of E. coli, one of the representative prokaryotes (Figure 5). Four of these consistently scored low, even when they were part of the training set: 1405/1496, 665/741, 945/1236, and 950/1231 (listed lowest first). Note that the entire helix formed by 1404/1947 and 1405/1496 is rejected by the neural net.

Table 2. Typical weights for "Composite" net on prokaryotic SSU data.

        A        C        G        U        N        -
A   -0.0119  -0.0316  -0.0146   0.0019   0.0002  -0.0067
C   -0.0669  -0.0335   0.0036  -0.0384   0.0018  -0.0068
G   -0.0097   0.0019  -0.0479   0.0043   0.0004  -0.0402
U   -0.0048  -0.0581  -0.0094  -0.0404   0.0117  -0.0371
N   -0.0231  -0.0106   0.0094  -0.0013   0.0006   0.0136
-   -0.0241  -0.0143  -0.0140  -0.0103  -0.0021  -0.0033
Mutual information weight = 0.0581
Bias = -3.7111e-04



Fig. 4. Sample sum-squared error learning curves for prokaryotic SSU data, after training for 1000 epochs on "Composite" net. An epoch is one complete gradient descent pass through the training set.


Experiments in generalization

We also conducted a series of experiments to determine if a neural net trained to discriminate BP from NBP column pairs in the multiple alignment of one family of RNA sequences could be used to perform the same discrimination task in another family of RNA sequences. These experiments in generalization were run on the "Composite" net exclusively, since it had outperformed the other architectures almost without exception. Since the LSU, prokaryotic SSU, and eukaryotic SSU data sets were generated from multiple alignments of three different sizes, each LSU and eukaryotic SSU input was scaled such that its sum matched that of a prokaryotic SSU input (from 74 and 236 to 1379). For a direct comparison with the previous experiments, only 3/4 of the examples in the selected training set were used in training. However, all examples in the selected test set were used for testing. (Previous experiments had used 3/4 of the given data set for training and 1/4 for testing). The results are given in Table 3.

Table 3. Generalization results using "Composite" network.

TRAIN->TEST               BP     NBP     BOTH
Prok->Euk SSU            79.66%  84.53%  82.10%
Euk->Prok SSU            98.73%  81.36%  90.04%
SSU->LSU                 44.32%  85.61%  64.96%
LSU->SSU                 91.10%  75.11%  83.10%

Three percentages are given for each network architecture: correct classification of base-paired examples, non-base-paired examples, and an average between the two which accounts for an overall predictive ability.

Training solely with the prokaryotic SSU data and testing on the eukaryotic SSU data clearly demonstrated the net's ability to generalize across domains: training with the prokaryotic data proved as effective as training with the eukaryotic when the test set was eukaryotic. However, when generalizing from one macromolecule to another, care must be taken not to overfit the training data. Figure 6 shows the sum-squared error curves for the prokaryotic training set and separately for the eukaryotic test set, as a function of the number of epochs of neural net gradient descent estimation. Note that after about 100 epochs the error of the test set starts to rise, indicating overfitting. For the performance quoted in Table 3, the network trained for only 100 epochs was used.


Fig. 6. Sample sum-squared error learning curves for prokaryotic (training) and eukaryotic (test) SSU data, after training for 1000 epochs on "Composite" net. An epoch is one complete gradient descent pass through the training set.

Good generalization was also observed when training on the SSU data and testing on the LSU data, compared to training and testing both with the LSU data. However, considering the poor performance of the net in classifying LSU data, a generalization scheme of training on SSU data and testing on LSU data had little to achieve. Although the net overfit the test data almost immediately, continuation of training did not lessen the total classification percentage for LSU data. The amount of false negatives rose proportionately with a better classification of true negative examples.

Table 3 also lists results from reversed generalization experiments: training with eukaryotic SSU while testing on prokaryotic SSU and training with combined LSU while testing on combined SSU. Evidently the eukaryotic SSU data does not generalize as well to prokaryotic SSU, and neither does combined LSU data to combined SSU. However, some ability to generalize is still evident.

Using a large NBP set

If one wanted to search the entire multiple alignment of SSU or LSU sequences for pairs of columns that might base-pair, the NBP examples would greatly outnumber the BP examples. Specifically, with 2688 positions in the multiple alignment of prokaryotic SSU rRNA, there are over 3.6 million possible base pairs. The prokaryotic SSU classifier, with a 4.45% rate of false positives, would identify more than 160,000 false positive base pairs along with the 472 real base pairs. To reduce the number of false positives, one needs to raise the threshold used by the classifier to discriminate BP from NBP examples. We implemented and tested this method on a larger sample of the possible NBP examples, to determine if the current classifier could be used to identify some portion of the structure with a low rate of false positives.

A set of 23128 NBP examples was randomly generated from the multiple alignment of prokaryotic SSU rRNA, using the same procedure which generated the small set of 472 NBP examples. Adding the 472 prokaryotic SSU BP examples produced a data set of 23600 inputs, 2% of which were base-paired. We used the weights learned from the "Composite" net trained on prokaryotic SSU data to classify this large set of examples. Over 1000 (about 5%) of the NBP examples scored above zero. By setting a higher threshold and rejecting any input scoring less than this threshold, we significantly lowered the percentage of false positives. Table 4 lists the thresholds, the percentage of false positives they allowed, and the corresponding percentage of BP examples still classified correctly.

Table 4. Reducing percentage of false positives.

FALSE POSITIVES ALLOWED    THRESHOLD      BP CLASSIFIED
5.150%                       0.00            98.20%
0.993%                      10.50            84.32%
0.098%                      25.23            60.59%
0.009%                      34.00            50.64%
0.000%                      41.16            44.28%
0.000%                      45.00            41.74%
0.000%                      50.00            37.71%

`False positives allowed' is the percentage of NBP examples which scored above the given threshold, using the "Composite" classifier for prokaryotic SSU data. `BP classified' is the percentage of BP examples scoring above the given threshold.

The highest scoring false positive could be rejected with a threshold of 41.1622, above which 209 of the 472 true base pairs were still classified correctly. However, since the large NBP set represented only 0.65% of the total NBP examples, a value of 45 or 50 may be a safer threshold at which almost all possible NBP examples would be rejected. Figure 7 displays the 197 base pairs scoring above a threshold of 45 on the secondary structure of E. coli. The marked base pairs are distributed throughout the graph and could be used to identify most of the structure. Some large helices missed are: 921-933/1384-1396, 938-943/1340-1345, 945-955/1225-1236, 1068-1073/1102-1107, and 1506-1515/1520-1529. Note that many of the missed helices are in physical proximity, which would make that portion of the macromolecule vaguely defined.

Looking at the input counts corresponding to the base pairs in these helices reveals that most of them are very conserved, i.e. usually one count (typically A/U or G/C) in each input ranges from 900-1200, and the remainder of the counts are trivially small. Such conservation leads to low mutual information scores, thus these inputs are rejected when mutual information is weighted highly. Many of these conserved inputs also indicate a significant amount of insertion/ deletion. For example, if the conserved pair is a G/C, there may also be a -/C in the range of 100-200. The three large helices missed which are in proximity (921-933/1384-1396, 938-943/1340-1345, 945-955/1225-1236) are all conserved with a large amount of insertion/ deletion. The nearby helix 1506-1515/1520-1529 is not conserved at all, but the counts for -/- (gap to gap) in each input are all over 1100, indicating this helix exists in nearly less than half of the prokaryotes represented in the alignment.

Discussion

The neural net classifier accurately predicted 96.08% of the prokaryotic and 82.10% of the eukaryotic SSU data. Best results for the LSU data were achieved the "MI only" network, predicting 67.42% correctly. Overall, experiments with LSU sequences produced poor results. Switching from the 16-count inputs to 36-count gained roughly 2% in the prokaryotic SSU classifier, while a much larger gain of 10% was observed in predicting the eukaryotic SSU (data not shown). Due to its effectiveness, the "Composite" network is preferred. Using only mutual information as a classifier worked well with SSU prokaryotes, but comparatively poorly with SSU eukaryotes. Perhaps this can be attributed to the smaller sequence ensemble of eukaryotes (roughly one-sixth the size of the prokaryotic). Training with the 60 representative prokaryotic species confirms this suspicion, as the "MI only" net's scores were also comparatively low in this reduced ensemble. A smaller range of counts and highly conserved sequences (those with little phylogenetic diversity) impair the ability to determine dependency using mutual information.

The false negatives in Figure 5 seem to indicate the neural net classifier has some difficulty with endcaps, noncanonical pairs, and pairs adjacent to a bulge. However, these mistakes do not necessarily point to a weakness in the system, since many base pairs of these types were correctly classified in other locations. Although the G/A in 945/1236 was misclassified, 321/332 was not. The U/U in 1307/1330 was not classified as base-pairing whereas 245/283 was. Even the A/G next to a bulge in 665/741 has a counterpart in 68/101. Likewise many endcaps and U/G pairs were correctly classified as base-pairing. Some of the BP examples missed by the network may even be dubiously classified as base-pairing.

Table 5. Comparison of positive weights for different data sets.

DATA SETS                      W-C+      MI
prokaryotic SSU  16 count      25%       75%
                 36 count      11%       51%
eukaryotic SSU                 31%       26%
combined SSU                   43%       45%	
prok->euk SSU*                 56%       43%
combined LSU                   00%       85%

Each of these figures is an average of 4-6 learned weights from distinct training sessions on the "Composite" net architecture. All data sets are 36-count unless otherwise noted.
"W-C+" represents the total percentage of positive weights that correspond to canonical base-pairs or the wobble pairs G/U and U/G.
"MI" is the percentage of positive weights given to mutual information.
*Trained on prokaryotic, tested on eukaryotic SSU rRNA.

Analysis of the learned weights, listed in Table 5, provides some understanding of the neural net's predictive ability. Positive weights are interpreted as positive evidence for base-pairing, while negative weights are assumed to be negative evidence. Mutual information, canonical base pairs, and pairs with the 'N' character generally provided the bulk of positive evidence for base-pairing when training with SSU data. Many noncanonical base pairs also earned positive weights, particularly the wobble pairs G/U and U/G. Pairs with the gap character and some noncanonical base pairs (e.g.: C/A, C/C) corresponded to strongly negative weights. In contrast, training with LSU data put all evidence on mutual information, ignoring the Watson-Crick pairs.

Table 6. Composition of data files.

                       W-C+      MI       GAP
SSU prokaryotic  BP   48.25%   35.05%   08.61%
                 NBP  07.25%   03.04%   72.53%
SSU eukaryotic   BP   51.95%   13.99%   16.79%
                 NBP  07.93%   02.74%   77.16%
LSU combined     BP   08.53%   16.17%   59.62%
                 NBP  06.51%   07.64%   76.37%

"BP" represents the composition of the sum of all base-paired input counts taken from column pairs in the multiple aligment for each of the given families of rRNA. Similarly "NBP" gives the composition of non-base-paired inputs.
"W-C+" indicates what percentage of counts are represented by the canonical base pairs plus G/U and U/G.
"MI" gives the percentage of total input represented by mutual information.
"GAP" represents the percent of total counts contributed by any pair containing the `-' character.

A study of the composition of column pairs found in the multiple alignments partially explains the distribution of weights produced by the networks. In Table 6, three categories of counts describe the data files: "W-C+" groups the counts of Watson-Crick pairs (plus G/U and U/G), "MI" represents the fraction of total input to the net contributed by mutual information, and "Gap" corresponds to the portion contributed by all pairs with the '-' character. We assume that for base-paired columns, the "Gap" percentage should be low, reflecting high conservation throughout evolution. Parallel to this, "W-C+" and "MI" should be high. All this is observed in the SSU data, though the eukaryotic seems less conserved compared to the prokaryotic. Very low "W-C+" frequency and high "Gap" are observed in the LSU data, which may be key to the net's poor performance in predicting LSU base-pairing. The small distinction between BP and NBP composition may be attributed to the low number of sequences in the LSU ensemble (74 species), or perhaps a lack of phylogenetic diversity. It may also indicate that the multiple alignment is of lower quality than the SSU multiple alignments.

Experiments in generalization may also be used to evaluate the quality of an alignment. Since prokaryotic SSU generalized to eukaryotic SSU better than eukaryotic to prokaryotic (Table 3), we may posit that the prokaryotic SSU alignment is more consistent. Likewise for the combined SSU data over the LSU data. The accuracy of the neural network in predicting base-pairing is dependent on the quality of the multiple alignment and its sequence content. As multiple alignments improve and more sequences become available, we expect to observe better prediction of eukaryotic SSU and combined LSU data by the neural net.

High weights given to pairs with the 'N' character in prokaryotic SSU experiments may be caused by the comparatively low counts for pairs with the 'N' character. If these count inputs are consistently small but provide useful evidence, they should be weighted strongly such that the product w_ij multiplied by c_ij is comparable to other count-weight products. This argument may also be extended to why canonical pairs were not always weighted highest: since the input signal was already strong, not much amplification by the corresponding weight was necessary. After the first few training sessions, mutual information and canonical base pairs provided most of the evidence in the best networks. However, extended training sought subtle distinctions between inputs to improve classification accuracy. In many cases this resulted in noncanonical base pairs being weighted as positive evidence. Experimental evidence verifies that noncanonical base-pairing does influence higher order structure (Gutell et al. 1990, Holbrook et al. 1991). Occasionally even canonical base pairs were weighted negatively, particularly U/A.

It is conceivable that some false-positives in the experiments could be representative of tertiary structure. Implementing this neural net method to predict the known two- and three-dimensional structure of tRNA could verify its ability to predict tertiary structure in other macromolecules. Gutell, Noller and Woese (1986) have stated that considerable evidence must support claims of tertiary interactions. Although a 96% accurate classifier is not sufficient to filter through the millions of NBP positions in macromolecules the size of SSU rRNA, it might be used to target likely base pairs crucial to tertiary structure.

Perhaps the most effective use of the neural network method is to import its learned weights into other algorithms for determining structure. The simple calculation using input counts contributes a very low computational overhead and has proven to be reliable and robust. In other work these learned weights were incorporated into a program that identifies secondary structure in SSU and LSU rRNA with excellent results (Grate, 1995).

Acknowledgments

Many thanks are accorded to Leslie Grate for providing software to generate count files, as well as helping with other essential tasks. Bryn Weiser's program xRNA generated the E. coli secondary structure graph in Figures 5 and 7. Brad Gulko, Mark Herbster, Rodrigo Garces, and David Konerding also provided assistance at various stages in the project.

Bibliography

  • Chiu, D. K., and Kolodziejczak, T. 1991. Inferring consensus structure from nucleic acid sequences. CABIOS, 7, 347-352.
  • Fox, G. E., and Woese, C. R. 1975. 5S RNA secondary structure. Nature, 256, 505-507.
  • Grate, Leslie. 1995. Automatic RNA Secondary Structure Determination with Stochastic Context-Free Grammars. Cambridge conference, to appear in ISMB95. AAAI/MIT Press. Menlo Park, CA.
  • Gutell, R. R., and Woese, C. R. 1990. Higher order structural elements in ribosomal RNAs: Pseudo-knots and the use of noncanonical pairs. Proc. Natl. Acad. Sci. USA, 87, 663-667.
  • Gutell, R. R., Noller, H. F., and Woese, C. R. 1986. Higher order structure in ribosomal RNA. EMBO Journal, 5, 1111-1113.
  • Gutell, R. R., Power, A., Hertz, G. Z., Putz, E. J., and Stormo, G. D. 1992. Identifying constraints on the higher-order structure of RNA: continued development and application of comparative sequence analysis methods. Nucleic Acids Research., 20, 5785-5795.
  • Haykin, S. 1994. Neural Networks. Macmillan, New York.
  • Holbrook, S. R., Cheong, C., Tinoco, I. Jr., and Kim, S. H. 1991. Crystal structure of an RNA double helix incorporating a track of non-Watson-Crick base pairs. Nature, 353, 579-581.
  • James, B. D., Olsen, G. J., and Pace, N. R. 1989. Phylogenetic comparative analysis of RNA secondary structure. Methods in Enzymology, 180, 227-239.
  • Lapedes, Alan. 1994. A Complex Systems Approach to Computational Molecular Biology. Proceedings of the Santa Fe Institute Workshop On Integrative Themes, 21. Addison Wesley.
  • Matlab version 4.2a. 1994. The MathWorks, Inc., Natick, MA.
  • Ribosomal Data Project, University of Illinois at Urbana-Champaign. Revision 3.0 of the database. Retrieved from rdp.life.uiuc.edu in /pub/RDP. Multiple alignment data is drawn from /rdp/SSU_Prok.gb and /rdp/SSU_Euk.gb. This source also requests citation of Niels Larsen, Gary J. Olsen, Bonnie L. Maidak, Michael J. McCaughey, Ross Overbeek, Thomas J. Macke, Terry L. Marsh, and Carl R. Woese. 1993. The Ribosomal Database Project. Nucleic Acids Research, 21, 3021-3023.
  • Richard, M.D., Lippmann, R.P. 1991. Neural network classifiers estimate Bayesian a posteriori probabilities. Neural Computation., 3 (4), 461-83.
  • Turner, D.H., Sugimoto, N., and Freir, S. M. 1988. RNA structure prediction. Annual Review of Biophysics and Biophysical Chemistry, 17, 167-192.
  • Woese C. R., Gutell, R. R., Gupta, R., and Noller, H. F. 1983. Detailed analysis of the higher-order structure of 16S-like ribosomal ribonucleic acids. Microbiology Reviews, 47(4), 624-669.
  • 1 The random set of pairs excluded those known to base-pair. All data, both base-paired and non-base-paired, are taken in sequence order (5' to 3').


    Last updated: 8-18-95
    reklaw@netcom.com