Dirichlet Mixtures

In recent years, a large amount of work in biosequence analysis has focused on methods for finding homologous proteins in the protein databases. Statistical models for proteins, such as profiles, position-specific scoring matrices, and hidden Markov models have been developed especially to find remote homologs--proteins whose primary sequence identity is less than 25%, but whose overall structures are homologous.

Here at UCSC, in our work with estimating hidden Markov models for proteins, we have found that the number of sequences we need to accurately estimate the expected amino acids at each position in a protein is larger than the number typically available for most protein families. Moreover, it is often the case that the available data is skewed toward one subfamily of the protein being modeled, so that a large fraction of the sequences available are minor variants of each other. This makes computing the expected amino acids in each position difficult.

Because of this, we have developed a Bayesian method to incorporate prior information into the process of estimating these expected distributions. When little data is available, prior information is relied upon heavily. But as the number of amino acids observed in a column increases, the observed frequency of the amino acids has the greater impact on the expected amino acids for that position.

Our approach, instead of beginning with secondary structure, is to take unlabeled training data, that is, columns from multiple alignments with no secondary structure information attached, and attempt to discover those classes of distributions of amino acids that are intrinsic to the data. The statistical method employed directly estimates the most likely Dirichlet mixture density through clustering observed counts of amino acids. In most cases, the common amino acid distributions we find are easily identified (e.g., a large non-polar), but we do not set out to find distributions representing known structural environments. In our work with hidden Markov models, these distributions have enabled us to estimate models for proteins where few sequences were available. When we use hidden Markov models for database search for homologous sequences, use of these mixtures produces fewer false negatives and false positives. Other researchers, using these mixtures with different statistical models, report similar results.


kimmen@cse.ucsc.edu
Baskin Center for Computer Engineering and Information Sciences
University of California, Santa Cruz