Second Biennial
UCSC-QB3 Symposium on Bioinformatics:
Predicting the structure and function of proteins

(Informal post-CASP Workshop)
Earth & Marine B206
University of California, Santa Cruz
7-8 Dec 2002

Speakers, Titles, and Abstracts

Kevin Karplus (University of California, Santa Cruz)
Unifying secondary-structure, fold-recognition, and new-fold methods for protein structure prediction

This talk will attempt to be a sweeping tutorial on protein structure prediction in 25 minutes, using the UCSC programs as examples---the full version of the talk takes about 14 hours, so it will necessarily be a bit abbreviated.

We have recently implemented a fragment-packing program that allows us to combine information from several different techniques.

  1. We use iterated search with hidden Markov models (HMMs) to make a multiple alignment of probable homologs of the target sequence.
  2. We use the multiple alignment as input to a neural net to make predictions of secondary structure (or other local structural properties). The predictions are in the form of probability vectors over the local structure alphabet for each position of the target.
  3. We create multi-track HMMs that have emission tables for amino acids (from the multiple alignments) and secondary structure (from the neural net). The multi-track HMMs are used to score every template in the template library (currently over 7000 structures). The local structure track substantially improves performance over amino-acid-only HMMs.
  4. We combine scores for multi-track target HMMs using different local structure alphabets with amino-acid-only template HMMs scoring the target sequence. The combined scores are used to select templates for alignment.
  5. We generate target-template alignments for the top hits using several different alignment parameter settings.
  6. Using a new program in the SAM tool suite, fragfinder, we search the template library for the top 10 or so short gapless alignments (fragments) at each position in the target sequence.
  7. Undertaker, which tries to optimize burial, takes in all the fold-recognition alignments, the fragfinder fragments, and a large generic library of very short fragments, then uses a genetic algorithm to generate conformations of the target sequence. The method is similar to Baker's Rosetta program in concept, but uses an all-heavy-atom representation of the conformation, and allows inserting full alignments, and not just contiguous fragments.

    Our score function for this generate-and-test method is not yet tuned or tested, but we used it for CASP5 this summer anyway. The results were interesting, but we do not yet know whether they were right.


Arne Elofsson (Stockholm Bioinformatics Center, Stockholm University)
A study of automated measures for benchmarks of fold recognition methods

Automated methods are necessary.

Prediction of protein structures is one of the fundamental challenges in biology today. To fully understand how well different prediction methods perform, it is necessary to use measures that evaluate their performance. Every two years, starting in 1994, the CASP (Critical Assessment of protein Structure Prediction) process has been organized to evaluate the ability of different predictors to blindly predict the structure of proteins. To capture different features of the models, several measures have been developed during the CASP processes. However, these measures have not been examined in detail before. In an attempt to develop fully automatic measures that can be used in CASP.

Here we compare automatic measures that study the similarity between a model and the native structure with the "fold recognition" ability, as measured by the selection of the correct fold, for a large set of fold recognition methods. We show that the correlation between automatic and manual classification is similar to the correlation between the classifications in CATH and SCOP. Further we show that one of the best measures is a method that finds the largest possible fragment between the model and the native structure.


Bjørn Wallner (Stockholm Bioinformatics Center, Stockholm University)
Can correct protein models be identified?

The ability to separate correct models of protein structures from less correct models is of the greatest important for protein structure prediction methods. Several studies have examined the ability of different types of energy function to detect the native, or native--like, protein structure from a large set of decoys. In contrast to earlier studies we examine here the ability to detect models that only show some structural similarity to the native structure. These correct models are defined by the existence of a fragment that show significant similarity between this model and the native structure. Further, it has been shown that the existence of such fragments are useful for comparing the performance between different fold recognition methods and that this performance correlate well with performance in fold recognition.

We developed a neural network based method to predict the quality of a protein model (ProQ). ProQ extracts structural features, such as frequency of atom--atom contacts, and predicts the quality of a model, as measured either by LGscore or MaxSub. We show that ProQ performs at least as good as other measures when identifying the native structure and better at the detection of correct models. This performance is maintained over several different test sets.

ProQ can also be combined with the Pcons fold recognition predictor to increase its performance. However, the improvement is quite marginal, with the main advantage being the elimination of a few high-scoring false positive models.

ProQ is freely available as a standalone web server on http://www.sbc.su.se/~bjorn/ProQ/, and is incorporated into Pcons consensus server, available at http://www.sbc.su.se/~arne/pcons.


Carol Rohl (University of California, Santa Cruz)
Loop Modeling Using Rosetta

A major limitation of current comparative modeling methods is the accuracy with which loop regions that are not homologous to a parent of known structure can be modeled. Because structurally variable regions between homologous proteins are responsible for variations in protein function and specificity, the ability to model the structural differences between similar structures has important functional consequences. While existing loop modeling methods can provide reasonably accurate models of short loop regions, modeling longer structurally variable regions is an unsolved problem.

Here we describe a method based on the de novo folding algorithm, Rosetta, for predicting conformations of variable regions in comparative models. Initial conformations for very short variable regions are selected from the protein structure database while longer loop regions are built up using three and nine residue fragments drawn from the database and combined using the Rosetta algorithm. A gap closure term in the potential in combination with conjugate gradient minimization is used to ensure continuity of the peptide backbone. Loop conformations are refined in the context of a fixed template structure using random angle perturbations. Direct comparison to several other loop modeling strategies indicates that the Rosetta method performs as well as existing methods on short to medium loops, but can additionally provide estimates of longer structurally variable regions.

The Rosetta loop modeling method, in combination with a sequence alignment algorithm, was used to make both automated and manual predictions for twenty-eight protein domains in the Critical Assessment of Protein Structure 5 (CASP5), providing a double-blind assessment of the efficacy of the method.


Charlie Strauss (Los Alamos National Lab)
MAMMOTH structure-structure comparison

Given a gene sequence, often it is possible to estimate the corresponding protein's structure. In particular, ab intio, methods such as our ROSETTA algorithm, can do so even when there is no detectable sequence similarity to a protein of known structure. It is possible to use structures, predicted and measured, to relate two protiens by their degree of non-random similarity, as is done, analogously, by traditional sequence comparison. And, by extension, one may choose to make inferences about functional and evolutionary simmilarity. We have applied this to a large collection of proteins from the pFAM data base as well as a test set for the discovery of proteins structural similarity.

In this talk I highlight a key piece of the machinery for doing this: a new model comparison algorithm suited for quantitative structural comparison on a genomic scale: MAMMOTH

Advances in structural genomics and protein structure prediction require the design of automatic, fast, objective, and well benchmarked methods capable of comparing and assessing the similarity of low-resolution three-dimensional structures, via experimental or theoretical approaches. Here, a new method for sequence-independent structural alignment is presented that allows comparison of an experimental protein structure with an arbitrary low-resolution protein tertiary model. The heuristic algorithm is given and then used to show that it can describe random structural alignments of proteins with different folds with good accuracy by an extreme value distribution. From this observation, a structural similarity score between two proteins or two different conformations of the same protein is derived from the likelihood of obtaining a given structural alignment by chance. The performance of the derived score is then compared with well established, consensus manual-based scores and data sets. We found that the new approach correlates better than other tools with the gold standard provided by a human evaluator. Timings indicate that the algorithm is fast enough for routine use with large databases of protein models. Overall, our results indicate that the new program (MAMMOTH) will be a good tool for protein structure comparisons in structural genomics applications. MAMMOTH is available from our web site at http://physbio.mssm.edu/


Gabriel del Rio
Assessing the structure/function relationship of biological networks:
From DNA microarray data analysis to protein structure prediction

Biological systems can be represented by their elements and their interactions in a graph or network. Graph theory analyzes systems represented by vertices (i.e. elements) and edges (i.e. interactions). >From these graphs, central vertices or edges (nexuses) can be detected based on diverse criteria, including connectivity. Nexuses defined in terms of connectivity are those vertices upon which the connectivity relay on. Since graphs are models of biological systems then nexuses represent essential elements for biological systems. We have developed a method to detect nexuses in biological networks (e.g. cellular molecular mechanism and protein structure) that improves any other previously reported approach we have tested in terms of identifying essential elements (e.g. genes, amino acid residues) from a network model.


Robert Baertsch (University of California, Santa Cruz)
Improving gene prediction with pseudogenes
A novel use for mouse-human synteny to finish the genome

(Abstract not received yet)


Matt Jacobson (Pharmaceutical Chemistry, UCSF)
PLOP (Protein Local Optimization Program): A New Software Platform for Predictive Protein Modeling Using All-Atom Force Fields and Implicit Solvation

Molecular dynamics simulations, using force fields and explicit or implicit solvent models, have been used extensively to help elucidate protein function. However, the all-atom physical chemical energy models used in such simulations currently play a rather minor role in predictive protein modeling, i.e., when the protein in question lacks an experimentally determined structure.

I have developed a new software platform, PLOP, which is designed to complement knowledge-based approaches to protein structure prediction by permitting efficient sampling of critical protein conformational degrees of freedom -- side chains, loops, and helix positions/orientations -- using all-atom physical chemical energy models. These new sampling algorithms do not employ molecular dynamics and rely heavily on multi-scale and hierarchical strategies to reduce computational expense. The primary goal of this project has been the development of new computational tools for homology model refinement. However, this physics-based approach also permits new questions to be posed, beyond predicting the overall fold of the protein:

  • What regions are likely to be disordered?
  • What would be the effect of post-translational modification on the structure and dynamics of the protein?
  • What hydrogen bond networks are likely to form in an active site?
I'll also discuss efforts to systematically validate and improve the energy function, using new methodologies based on the accuracy of side chain and loop prediction on proteins of known structure, as well as recent efforts to facilitate model building from experimental data.

Andrei Lomize (College of Pharmacy, University of Michigan)
New energy functions for protein modeling derived from thermodynamic data

Efficient methods for protein structure prediction, de novo design and ligand docking require energy optimization. An especially important goal here is the correct evaluation of free energy differences, not enthalpy in vacuum that is usually calculated with molecular mechanics potentials. The required energy functions must take into account conformational entropy, solvation free energy, and the dependence of interatomic interactions on the environment. They must be also tested against the experimental thermodynamic stabilities of proteins or protein-ligand complexes.

Recently, we have determined van der Waals (vdW) interaction energies between different atom types, energies of hydrogen bonds, and atomic solvation parameters from the published free-energy differences for 106 mutants with replacements of buried uncharged residues and available crystal structures [1]. The obtained energies of interatomic interactions were different from that in molecular mechanics in three important aspects:

  1. they describe interactions in the protein interior rather than in vacuum;
  2. they are generally weaker and follow "like dissolves like" rule;
  3. they are related to enthalpy of melting, rather than to enthalpy of sublimation.
The developed energy functions are especially useful for design of a full-atomic "threading" procedure, which would allow a comparative modeling of proteins that are only remotely related to their experimental templates.

[1] A.L. Lomize, M.Y. Reibarkh, and I.D. Pogozheva. Interatomic potentials and solvation parameters from protein engineering data for buried residues. 2002. Protein Science, v. 11, p. 1984-2000.


Kimmen Sjölander (University of California, Berkeley)
Phylogenetic trees, Subfamily HMMs, Modeller expertise, and sleep deprivation for 4 months straight: The UC Berkeley approach to CASP5

Our approach used a combination of tools, including

  1. SAM-T99 and fw0.5 for HMM library construction given the Astral PDB40 dataset version 1.57 as seeds for model construction;
  2. Bayesian Evolutionary Tree Estimation (BETE) to identify subfamilies in each structural class, based on the SAM-T99 alignment;
  3. Subfamily HMM construction to build HMMs for subfamilies found by BETE;
  4. FlowerPower, a novel clustering and alignment tool, using subfamily HMMs and BETE, to gather and align homologs for target proteins, and to extend DALI structural alignments to include sequence homologs;
  5. Simultaneous Alignment and Tree Construction using Hidden Markov mOdels (SATCHMO) to align the target (and homologs) and candidate structures (and homologs);
  6. Modeller and PROSA to build all-atom models (core expertise in this area provided by Andras Fiser at Rockefeller University);
  7. Alignment refinement using secondary structure information.

Chris Bystroff (Dept. of Biology, Rensselaer Polytechnic Institute)
Contact map threading using HMMSTR

HMMSTR is a hidden Markov model for local structure motifs in proteins. Each Markov state represents a single position in a structural motif. Transitions between Markov states represent the ways that motifs can be adjacent to each other in the sequence. A knowledge-based energy for all motif-motif contacts was calculated by expressing each sequence in the database of known proteins as an a posteriori probability distribution over Markov states. A log-likelihood ratio, summed over all of these probability distributions, gives us the probability of any two states (i.e. any two local structure motifs) being in contacts (less than 8A apart). This potential function was used to assign an energy to each residue-residue pair in the target sequence. To select the best subset of contacts, we aligned this matrix to every protein in the database and chose as the template the protein that gave the lowest contact free energy. If no good template was found, we used a rule-based method to make an ab initio contact map prediction. Results will be shown for targets 157 and 130.


Peter Schattner (University of California, Santa Cruz)
Tracking Down Non-coding RNAs: The Oher Guide-SnoRNA Genefinder

The role of box C/D and H/ACA snoRNAs in guiding RNA methylation and RNA pseudouridylation has been well established. However, biochemical identification of RNA modification sites and cognate guide snoRNAs is still quite challenging. For the methylation guide snoRNAs, the presence of relatively well-conserved box motifs and complementary guide sequences enabled the development of a successful computational screen. In contrast, the H/ACA pseudouridylation guide snoRNAs have shorter and less well-conserved primary sequence motifs. Consequently, computational screens for them have not yet been developed.

We have developed a probabilistic program to screen full genomes for candidate H/ACA snoRNAs. The program has been designed with the flexibility to specify models with different training sets, including snoRNAs from various organisms and potentially novel classes of RNAs.

Initial program training and testing is being carried out on the Saccharomyces cerevisiae genome and initial results from these investigations will be presented.


Igor Grigoriev (Bioinformatics Scientist, SUGEN, Inc.)
Remote homology detection by CRISSP: Application to protein kinases

We propose a new approach to detect remote homologues of proteins of known three-dimensional structure. We describe a protein family by Conserved amino acid Residues embedded Into Secondary Structure Pattern (CRISSP). The secondary structure pattern can describe a protein fold in general and in particular structural context of a few conserved and functionally important residues that ensure a protein function.

We have tested our approach on protein kinases---key players in signaling pathways. Protein kinase CRISSP has been built using structural alignments of six kinase remote homologues. We found that all the conserved residues are located in ATP binding pocket of protein kinases. Tested on the data sets of SCOP protein structural domains and Swiss-Prot protein sequences, the method has identified 70-80% of known protein kinases with 10-20% error rate in both cases. About 90 novel protein kinases have been predicted in human genome (ORFs with predicted secondary structures).

Automated analysis of structural alignment of proteins of known structure suggests that CRISSP can be generated for approximately 40 different protein families.


Bob Edgar (Independent Scientist)
Improved alignments through profile-profile HMM methods

Alignment of two or more profiles (multiple sequence alignments, or compressed parameterizations of such alignments) has important applications in computational biology. It is a central step in progressive multiple sequence alignment and is used to detect homologies between related groups of sequences. We show that it can also be used to achieve improved pair-wise alignments of homologs with low sequence identity.

A multiple sequence alignment contains information that indicates which positions are conserved by evolution and hence are structurally or functionally important, and can therefore lead to more accurate alignments and more discriminative scoring than a single sequence. Current methods for aligning two alignments or two profiles cannot apply standard gap penalties to gaps found within (as opposed to between) alignment columns.

Here we present a new approach using hidden Markov models (HMMs) that fully incorporates transition scores, and hence gap penalties, for each sequence in the profile. Given an alignment of N sequences of length L to be aligned to k profiles, our algorithms have space complexity O(L2) and time complexity O(kL2+NL), making high-throughput processing tractable. We further show how this method can be applied to aligning two profile HMMs and to constructing an HMM similarity measure. We call our method "COACH", for COmbining Alignments by Constructing HMMs.

In tests on protein sequence alignments of known structure and low pair-wise identity, we found Coach to give more accurate results than a) aligning a sequence to a profile HMM and b) aligning two sequences and/or profiles using ClustalW's profile-profile method. These tests found that on average Coach correctly aligned an order of magnitude more positions than ClustalW and approximately 20% more than aligning a sequence to a profile HMM.


Jahnavi C. Prasad (Bioinformatics Graduate Program, Boston University)
Automated Consensus Method of Alignment for Confident Comparative Modeling

We have developed an algorithm that consistently gives a high quality alignment for comparative modeling, and identifies the regions of this alignment that are reliable and structurally similar between the template and target. In order to identify a consistent way to get an accurate alignment, ten popular alignment methods were tested against a set of 79 pairs of homologous proteins for alignment accuracy in the context of comparative modeling. The top five performing methods were selected and a method for generating a consensus by combining the alignments from these five methods has been subsequently developed. By building on the strength of the consensus alignment, we have identified a set of criteria that remove alignment zones corresponding to structurally dissimilar regions and poor alignment reliability. When applied over an independent set of 49 homologous protein structure pairs, the average RMS deviations of the structures obtained with this consensus based alignment is on the order of 2.5 A, while the length of the alignment is about 80% of that found by standard structural superposition methods. While the selected top five methods had 20-40% of the alignments that would yield predicted structures with RMS deviations of 6A or more from the native structure, there were no such cases at all from our method. In our tests, the method performs consistently over a range of target-template sequence identity spanning 5-30%. The algorithm is currently available as a server at http://structure.bu.edu/cgi-bin/consensus.cgi

1. Prasad,J.C., Comeau,S.R., Vajda,S., Camacho,C.J. Confident Homology Modeling Based On Consensus Alignment. Submitted for publication.


Mark Akeson (University of California, Santa Cruz)
Nanopores

(Abstract not received yet)


Martin Madera (MRC Laboratory of Molecular Biology)
Have sequence comparison methods reached their full potential?

The current generation of sequence comparison methods (PSI-BLAST, HMMER, SAM) are profile methods: they compare a profile of a protein family to a query sequence. Because of the extra information contained in the profile as opposed to a single sequence, they detect three to fives times as many remote homologs as the previous generation (WU-BLAST, FASTA, MPSEARCH). Using this information on both sides in a profile-profile comparison should improve performance even further.

During the talk I will present a beta version of PRC, my program for comparison of two profile hidden Markov models (HMMs). Using the curated Superfamily models (available from http://supfam.org), it currently detects 50% more remote homologs than the profile HMM program SAM, and 8.5x as many as WU-BLAST. I will describe the fully-probabilistic algorithm behind PRC, the twin-null and reverse null models used by the program, and the way it calculates its E-values.

On a more speculative note, I will also discuss the likely improvements from using profile-profile programs within the next-generation iterative procedures and the potential performance advantages to be gained from clustering of protein space as opposed to simple iteration.


David Juan (CNB-CSIC, Madrid)
Protein-Protein Interaction predictions in E. coli using information from literature

Protein-protein interactions data can be very helpful in protein function deduction and pharmatheutical target identification, providing a deeper understanding of biological systems behaviour, dynamics and evolution. However, experimental and theoretical deduced protein interactions must be tested and improved.

For this work E.coli was chosen as a reference organism, because it is the most simple and well known system, what is important to obtain some knowledge about how methods work. In this study, protein-protein interactions for E. coli are deduced using 5 different in silico methods and an experimental approach (Hybrigenics two-hybrid data for H. pylori). In silico methods used were Phylogenetic Profiles, Gene Context, Gene Fusion (data from C. Ouzounis set), Mirror Tree and In silico two-hybrid. All these different approachs were compared and combined using information authomatically extracted from the literature (Medline abstracts). The assessment of these protein-protein interaction data, althougth necessarily uncomplete, shows that every method provide useful information, with similar performance for experimental and theoretical data. Furthermore, very accurate predictions can be obtained from the small overlapping among methods.