Projects list: protein structure prediction

(Last Update: 05:17 PST 6 February 2005 )

This is a list of projects that need to be done on various projects for protein structure prediction. I've broken the list up into different categories, based on the different stages we go through in making a prediction.

Perl modules

I'm working on perl modules to make the SAM-T03 prediction engine more modular than the previous ones, so that pieces of it can be reused by other researchers. I plan to do most of the work on this myself (incorporating results from student projects where appropriate). [Chris Dragon finished the a2m2html project, but I did not manage to get the SAM-T03 modules and scripts finished.]

I have the SAM-T04 modules that are almost finished (enough for us to do predictions at CASP in Summer 2004), but there are several functions that still need to be added.

Correlated columns of multiple alignments

I threw together a program to measure the mutual information between columns of a multiple alignment (any alphabet).

This program can be used to produce input for a neural net to predict contact pairs. [George Shackelford is working on this.]

Finding similar sequences (potential homologs) Other researchers have found that they can get far more homologs for proteins if they search they unfinished genomes, particularly using tblastn to search genomes that have not had adequate gene finding done. This might we worth trying, but extracting the "full-length" protein sequences can be difficult from tblastn hits---we want to have protein sequences, not full contigs of nucleotides, but we don't want just the narrow region where the tblastn hit is significant. The thresholds used in SAM-T2K and SAM-T04 for determining what sequences to include at each level are somewhat arbitrary. It may be worthwhile to play with them a bit to see if we can include more sequences without getting contamination.

Multiple alignment of similar sequences

The biggest improvement that SAM-Txx makes over PSIBLAST is in the quality of the multiple alignment, but there is still a lot more that could be done to improve it. Here are some ideas (not necessarily compatible):

T-coffee has shown itself to be a good, but slow, multiple aligner. We could use T-coffee on a subset of the sequences to get a good multiple alignment of them, then use that as a seed for an HMM to retrain on all the sequences. We would have to write a converter that changed the T-coffee output format to A2M format, unaligning all columns that did not contain the target sequence.
We may want to implement our own version of T-coffee that does not have the O(s^2 n^2 + s^3 n) running time (where s is the number of sequences and n is the average length). It may be worth looking for graph algorithms that allow us to propagate consistency information around an s-partite graph that has residues as nodes and weighted alignment pairs as edges. It would be particularly good to be able to work with sparse graphs, that did not attempt to do all pairwise alignments.
We may also want to look at the Muscle multiple aligner at http://www.drive5.com/muscle which is supposedly better than T-coffee and much faster (faster than Clustal W even). There is an NAR paper on the performance of MUSCLE Pubmed abstract 15034147.
Progressive aligners (such as Clustal W) are about as good at multiple alignment as SAM-T99, but are faster for small numbers of sequences (slower for large numbers). The slow step is in taking all pairwise distances to determine the neighbor-joining tree. We could make a fast, sloppy, progressive aligner by using the initial HMM to make a crude multiple alignment and extracting the distance matrix from that mutiple alignment (time O(s n^2 + s^2 n) instead of O(s^2 n^2)). This is still too slow for 10,000 sequences, but we could thin the alignment, as we do in SAM-T2K for the buildmodel step.
The results of the progessive alignment could be further refined by doing a few iterations of Baum-Welch in buildmodel, though that would have to use a thinned alignment, as in SAM-T2K, to avoid excessively long time in buildmodel.
Progressive alignment relies on profile-profile alignment, which we need to add to the SAM tool suite anyway. If we have profile-profile alignment on multi-track HMMs, we can use predicted local structure properties in the progressive alignment. A prototype progressive aligner could be written in PERL, calling predict-2nd to get the local structure predictions and SAM's profile-profile aligner to do the alignments. If the method works, we may need to reimplement in C or C++ to avoid all the overhead of calling separate programs. [Martina Koeva did some work on profile-profile alignment, but I'm not sure of its current status.]
A much smaller change, though messy to code, would be to implement the posterior decoding alignment in a memory-efficient way---currently about 8% of our target2k builds fail because of memory problems in the final posterior-decoding alignment.
Both the current buildmodel alignment improvement and progressive alignment rely on using a thinned set of sequences to be adequately fast for aligning large sets of sequences. The current thinning process allows us to specify a percent-identity threshold (based on an initial multiple alignment), but does not allow us to specify how large a set of sequences we're willing to tolerate. For methods, like T-coffee ,that are cubic in the number of sequences, we might want a tighter control over the number of sequences. This would require modifications to uniqueseq, possibly requiring re-thinning an already thinned alignment.
I made improvements to GPCR multiple alignments by selecting key conserved residues, restricting the multiple alignment to sequences that contained reasonable values for those key residues (thus eliminating misaligned and some false-positive sequences), then realigning all the sequences to an HMM built from the subset alignment. This method is mostly automatic (just some thresholds were picked manually), so might be worth incorporating into the next SAM-Txx method. The current scripts are pick-key-residues and select-by-key-residues.
[Aleksey Kleyman did some work on using key residues to improve SAM multiple alignments, but the results were not encouraging.]

Local structure prediction

We are still working on finding good local structure alphabets to predict, and our neural network framework is fairly reliable. Minor tweaks will probably not make much difference, but I have some ideas that may be worth pursuing.

We need to create a new framework for testing methods since we currently rely on predict-2nd to monitor its own performance. The information gain, Q_n, and SOV computations should be encapsulated in a separate evaluation program, so that we can test combining methods and new machine-learning methods.
[Marica Soriano has started working on this.]
We could probably get slightly better local structure prediction by combining the results of several independently trained neural networks. It is not clear what the best way to do the combining is---simply averaging the probability vectors is probably not optimal, though it is easy. We may want to look at combining the log-odds scores (log (P_predicted/P_background)), for example.
We already have one combining program called RDBCombine that is used in the SAM-T02 web server to merge neural-net predictions to get the dssp_ehl2 prediction. It can combine RDB-format prediction files for different alphabets. It translates the alphabet using the contingency tables output by compare-real, and weights them by the mutual information between the predicted alphabet and the desired alphabet. This is almost certainly not an optimal combining method, but it was put together in haste for CASP5, and has not really been evaluated.
I doubt that we'd get much improvement in the results by changing the topology of the neural nets, though we may want to retrain the networks on a larger or cleaner dataset before the next CASP season. We could probably enlarge the networks slightly (either increasing the number of hidden units or increasing the window size on the first layer) without too much risk of overtraining.
One improvement that I think would be valuable for some local structure alphabets is to replace the current profile input with a double input: a profile and a 0-1 vector indicating the actual sequence. The profile could be set to generalize a bit further (say 0.8 bits_saved/column instead of the current scheme of using Henikoff weighting with the number of sequences as the total weight), as the guide sequence would record the actual residues. Separating the guide and the profile would be particularly useful in those cases where the multiple alignment from which the profile is generated has suffered from model drift, and no longer represents well the particular sequence. It might also be useful for improving the predictions in the regions where many of the aligned sequences have deletions, as the profile gets close to the background there, because of the low weights.
This modification would require some fairly minor changes to predict-2nd to allow the InterfaceDescription to have multiple inputs from different files, but the changes may be useful for other purposes also.
[Sol Katzman did this work, but still needs to write it up.]
One useful thing to do would be to train neural nets for predicting local structure from single sequences. The resulting predictions would not be as good as the predictions using multiple alignments and the current networks, but would be useful for a progressive aligner and for making predictions when we do not find any homologs in the iterative search. Note: the guide+profile modification suggested above made eliminate the need for separate neural nets for single-sequence prediction, as the network would already be using the guide sequence as best it can.
Our burial and exposure alphabets are based on an underlying continuous (or, at least, finer-grain) value. It would be nice to have the neural net output parameters for a parameterized distribution (such as log-normal or gamma), rather than the discrete 7 or 11 states we currently use. The two values can be interpreted as the mean and standard deviation of the output distribution (though other parameterizations may be more convenient for some distributions). A weak prediction would have a large standard deviation.
This could be done by adding a combining layer to the outputs of a normal neural net. For each output of the neural net we associate two parameters (center and spread). The predicted mean would be the neural-net weighted sum of the center values, and the predicted standard deviation would be the standard deviation of a mixture distribution, with mixture coefficients from the neural net.
The center and spread parameters could initially be set by partitioning the outputs in the training data into roughly equal weight discrete classes (as is now done) and computing a mean and standard deviation for each class. They could be tweaked a bit by a few passes of EM to get the output described as a mixture. After a few epochs of training the neural net, the center and spread parameters could be trained by back propagation also.
We would need to compare the encoding cost of the actual burial value using this continuous method (which requires some care in the discretizing, since our underlying data is usually integer-valued) versus the encoding cost of the discrete alphabet method, which assumes a flat distribution within each of the discrete bins.
Training is currently quite slow, and a student interested in machine learning might want to work on modifying predict-2nd to use a faster learning method than the current back propagation. This is not a high priority (we rarely retrain the networks), but might be a good project for someone more interested in machine learning than in bioinformatics.
Neural nets suffer from being basically uninterpretable. There are some ways we could attempt to glean some information from the neural nets, and perhaps use this information to make neural nets that train faster or work better. The basic idea is to look for recodings of the amino acids that the neural nets are learning, and then hardwire the more important recodings. The simplest notion of a recoding is the dot product of an amino-acid probability vector with a direction vector (a 20-dimensional vector with magnitude one). The weights of the first layer of each neural net we have trained can be normalized to produce a number of such direction vectors (75 vectors for each network in our most common network architecture).
The resulting vectors can be clustered (with the square of the dot product as the appropriate similarity measure) to determine the most common directions. We can attempt to identify the properties these directions represent by looking at the correlation with the indices in the AAIndex database.
We can also compute the dot products of the profile with the centers of the most populated clusters and use them as inputs to the neural net, in addition to or in place of the profile input.
The above residue-at-a-time analysis does not get at the neighborhood patterns that the neural nets learn. People have tried to determine some of the patterns by looking at Fourier transforms and wavelet transforms of numeric properties (mainly hydrophobicity). We could look at each hidden unit on the first neural layer and see if its weight matrix can be roughly matched by taking a single direction vector and a neighbor weight function. That is, decompose
sum_{a,-2<=j<2} w_{a,j} Prob(amino acid a in position i+j)
into
sum_{-2<=j<=2} f(j) sum_a dir(a) Prob(amino acid a in position i+j)
or w_{a,j} = f(j) dir(a).
If there are hidden units that can be approximately described this way, it may be worth providing input features that correspond to this neighborhood property also.
Machine-learning students have tried using other machine-learning techniques besides neural networks to predict local structure, so far without coming close to the performance of the neural nets. I believe the difficulty is that they have been using kernel-based methods, but no one has found a decent kernel for local structure prediction. Neural nets are good at determining which features are relevant, and the multi-level windowing I do in predict-2nd helps the neural nets focus on the center of the window it is looking at. Perhaps a kernel could be created that uses the first layer of a neural net to recode the window. This is a long-shot idea, suitable for a machine-learning student really interested in kernel-based methods, but not likely to lead to improved local structure prediction.
We really want to get maximum information gain from a multi-class soft classifier, which is much easier to design into a neural net than into SVMs. I'm not sure how to get a good probability vector out of a multi-class SVM.
I have an idea for two new structure alphabets, based on what I learned about hydrogen bonds. We could label residues according to the type of hydrogen bond made with the amide nitrogen as a donor, or by the type of hydrogen bond made by the carbonyl oxygen as receptor. Here are some possible Hbond types:
- G backbone separation=3 (notor approx -14)
- H backbone separation=4 (notor approx 101)
- P backbone separation!=3,4 "notor" torsion angle approx 170
- A backbone separation!=3,4 "notor" torsion angle approx -83
- B backbone separation!=3,4 "notor" torsion angle approx 23
- S sidechain
- M multiple Hbonds
- N no Hbonds
The breakpoints for the notor torsion angle can be set around -133, -17, and 76 degrees.
It might be worth splitting the G class into "G" and "T" based on the other possible backbone hbond for the residue, with "G" if the other is separation 3 or 4, and T if it is a backbone hbond with other separation. Not clear what to do if this is the only backbone hbond for the residue---probably stick with G. Perhaps we could separate on notor angle instead, with T peaking around 23 (same as B) and G peaking around -14. Current definition of Hbond in undertaker seems to cut off some of the T peak, since it limits NOtor to [-60,35] The G set is 10-12 times more common, so it may not be worth trying to split off the T peak.
It may be more interesting to look at the sep=5 anti-parallel Hbonds, which seem to have a NOtor angle around 86 (not 23 or -83). The A/B split does seem to depend on the separation quite strongly in other ways also, so there may be just a few standard hairpin patterns.
[Pinal Kanabar is working on this one.]
One approach to improving secondary structure that we have tried, and that may be worth improving on, is to use the fold-recognition target HMM to align the template set and gather statistics about what secondary structure codes align to each position. Since the target HMM can be a multi-track HMM using neural-net predictions, this method should be able to do at least as well as the neural-net methods. The method would be particularly valuable when there is a close homolog in the template library, as it would be able to "look up" the right answers. This makes testing more difficult, since it is easy to fool oneself about how well the method is working, but would be very useful in a tool used by bench biologists, as the easy predictions would be right (not necessarily the case for neural nets). It is used in the SAM-T99 server, I believe, but not the SAM-T02 server. We should test it properly and use it in the SAM-T04 server.
Mark Diekhans implemented this in SAM, but I am not sure of the details of his implementation. In his tests, the method improved Q3 scores slightly, but made the information gain worse than the neural nets used to create the HMM tracks. Several things could be done to improve this method:
- The details of the method could be documented, and testing of the method could be properly reported in a journal article. (Mark has offered to document what he's done, if someone wants to pick up the project.)
- The HMM could use multiple tracks from different local structure predictions to get better alignments, hence getting better statistics. Alignment testing has given us some weights for multi-track HMMs that should work better than what Mark has tested.
- I'm not sure what statistics precisely that Mark gathers for each HMM state, nor whether he uses Viterbi training or full forward-backward (I suspect it is forward-backward). One could try different minor variations of the algorithm.
- The weighting scheme for individual sequences that is used for weighting the gathered statistics could be changed. I don't know what is currently used. I think that a method using exp(-E_value) would be worth trying. Here is the reasoning: We want to give more weight to sequences that match the HMM than to sequences that are very remote, but a simple cutoff is likely to have problems when there are no very similar sequences. Using P(seq|HMM) as the weight is likely to put too much weight on the closest match. If we use instead the probability of not getting that good a sequence by chance in the N sequences (1-p_0)^N = (1-E_value/N)^N, we could have numerical problems, but for E_value< It is unknown whether over-representation of some families in the template set helps or hurts the method, and whether lower-quality models just add noise or increase the range over which some matches can be made.
- The method could also be compared with a conceptually simpler, but more expensive approach: building a fragfinder library and collecting statistics from the fragments.

Fold recognition

Our current best fold-recognition method relies on having a two-track target model and a library of amino-acid-only template models, and combining the scores. There are several areas where this could be improved, but experimentation will be needed with the fold-recognition test set to see which make a big difference.

Updating our test set.

We've been using the same test set for a long time, which raises two questions: (1) Is the underlying "gold standard" still valid? There have been reclassifications and improvements in SCOP since we created our test set, which may change the results, particularly in cases where methods are fairly close in performance. (2) Have we started training to the benchmark? If we keep selecting methods based on how well they do on a particular benchmark, we can easily get overtraining. Right now this is not a big concern, as we have done relatively few tests on the full fold-recognition benchmark, but it would be good to have a test that is at least somewhat new.

If we create a new test set based on SCOP, we should pay attention to Julian Gough's rules about what to include and exclude from the test set, based on his understanding of SCOP classification: Profile-profile scoring

A somewhat more general version of the two-track template model would allow full probability vectors for both the states and the sequence---with scores computed in any of several ways. The simplest is to take the log of the dot product of the two probability vectors, but other schemes have been proposed in the literature. Note that this scheme generalizes the current profile-sequence scoring and the two-track template scoring proposed above, with the simple labels handled as elementary vectors (all zero except for a 1 in the labeled position).

Many of the best fold-recognition servers use some version of profile-profile scoring. If we can pick the right multitrack HMM to use with profile-profile scoring, we should be able to make substantial improvements over our current scheme.

There have been various proposals for handling insertions and deletions in profile-profile scoring. Bob Edgar's seems the most complete, but some of the simpler ones are probably worth implementing also.

One nice thing about profile-profile scoring is that we can use the existing HMM formats for representing the profiles, not requiring any new I/O code in SAM. (Though it would be good to read the RDB format output by the neural nets and output the HMM, so that we could eliminate the TCL script that currently does this.]

There may be different profile-profile alignment methods or parameter settings that are optimal for progressive alignment, pairwise alignment of remote homologs, and fold recognition.

[Martina Koeva has done some work on profile-profile alignment.]

Parameter adjustment for local structure alphabets

Rachel Karchin tested local structure alphabets with a constant weight of 1 on the amino-acid track and 0.3 on the local structure track. These weights have not been optimized, and we may be able to get better performance by adjusting them (and adjusting them differently for different alphabets).

[Bret Barnes has been working on this, using improved scripts written by Sol Katzman.]

Combining templates

Our fold-recognition methods treat every template in the library as independent of the rest, but we actually have many templates which contain similar folds. Various ways have been proposed for combining template scores to get a "fold" score, and a couple students have done preliminary tests of the "product-of-pvalues" method, and shown that it might help, but their projects were never taken to the point where the method could be incorporated into an automatic prediction method.

One possible project is to incorporate the "product-of-pvalues" method into our automatic predictions. This requires designing a way to update the calibration parameters for the combining method automatically as new templates are added to the library, and converting the fold results back into selection of good templates.

Another possible project is to devise a different combining method. The "product-of-pvalues" method treats all templates in a fold class as equally informative and does not use information from templates in competing fold classes. One can envision combining methods such as logisitic regression that could use the extra information to get better predictions. One danger is that multi-domain proteins correctly have multiple correct folds, so simple competition between fold classes is not quite the right model.

Handling multi-domain proteins

Our current methods look for target-template matches where there is a common subdomain, but do not take the next step of looking separately at the part of the target that does not match the template and doing a separate prediction for it. We could probably improve out fold-recognition capability if we did some domain splitting, either after having recognized a domain using the whole chain, or by trying domain splits based on the multiple alignment. In the recent run of all yeast ORFs, one of the predictions took 6 days, because the protein was 4910 residues long---this clearly would have been helped by breaking up into smaller pieces.

In CASP6, it was obvious that we had managed to put together some multi-domain predictions without explicit domain prediction. It is not clear to me that explicit domain prediction is as valuable as I thought in 2002.

Better calibration of HMMs, particularly multi-track HMMs

Our current method for calibrating the E-values of HMMs is described in the paper Calibrating E-values for Hidden Markov Models with Reverse-sequence Null Models, where a few problems are described.

One of the problems is that simple i.i.d. models of "random" sequences work ok for amino-acid sequences, but not for secondary structure sequences, which often have long runs of the same letter. We have a sophisticated sequence generator for amino-acid sequences released as open-source code: gen_sequence, but no good random sequence generator for secondary structure sequences. There are several ways to try to generate random secondary structure sequences: i.i.d. models, Dirichlet mixture for composition with i.i.d. from the composition (used in gen_sequence), first-order Markov chains, higher-order Markov chains, variable-order Markov chains, mixtures of Markov chains, hidden Markov models, Markov chains of segments with length models for each segment, and so forth. Implementing one or more of these methods and testing how well it can encode secondary structure sequences (over various alphabets) in a train/test experiment would be very useful. I'm particularly intrigued by the idea of generating sequences of segments, with a separate length model for each segment type, and have done a little preliminary work on the idea.
[Nikhil did some work on this, but the problem is open again.]
Even if we get a good random sequence generator for the various local structure alphabets, we still have some trouble with the failure of the "reversability assumption" for some alphabets that otherwise look quite useful, such as de Brevern's protein blocks. For multi-track HMMs using these alphabets, we need a different way of calibrating the E-values that does not use reverse-sequence null models. One possibility would be to fit a Gumbel extreme-value distribution to scores on randomly generated sequences (using a good generator for the local structure alphabet). Other parameterized distributions could also be explored, as there is some evidence that the Gumbel distribution does not have a fat enough tail.
More immediately, the referees for the paper pointed out that we should try a maximum-likelihood estimate (not just a moment-matching estimate) and we might also want to examine "a recent paper by Bailey and Gribskov (J Comput Biol, Vol 9 (3), 2002) which investigates in some detail the maximum likelihood method for fitting the Gumbel distribution to datasets that may contain true as well as false positives."
The maximum-likelihood estimate turns out to be fairly simple to compute---not much harder than the moment-matching we currently do, so we should put that in SAM and try it out. I've not yet looked at the Bailey and Gribskov paper. We should try to get this done over the summer, so we can revise and resubmit the paper. [George has implemented the ML estimate and tested it. It seems to do a poorer job than the moment-matching method, fitting the center better but the tail worse. The paper is being rewritten to incorporate the new results.]

Better template selection for close homologs

When we have many close homologs in PDB, the current fold-recognition method finds all of them, but does not do a good job of picking the best template.

Model drift often results in the HMM being better at recognizing some subfamily different from the one containing the target protein. For example, if the SH3 domain for 2abl is given as a seed for the SAM-T02 web site, the self-hit is about 80 down from the top (partly because 2abl is not in the template library). Blast finds several sequences containing exact matches to the 55-residue seed: 1oplA, 1opkA, 2abl, 1ju5C, 1abq, 1abo[AB], 1awo, and 1bbz[ACEG]. Four of these are in the template library (1aboB, 1awo, 1bbz[AE]), but the best scoring match using the template library is 1ng2A, which has the SH3 domain twice. If we just use the w0.5 model, we can see that it has drifted to center on 1udlA, 1gfc, and 1gcqA, rather than the sequence we started with.

Various fixes are possible.

One is to look at the simple pairwise scores (ignoring the multiple alignment) when choosing close homologs---for this example, the best alignment scores came from the alignments that had identical sequences, which would have worked fine in this case. This reduces to using rather crude tools though, so might not work as well when sequences are somewhat diverged.
Another possibility is to look for pdb files in the t2k.a2m alignment (there are 78 of them in this example), and make a phylogenetic tree based on the multiple alignment. The closest templates in the tree are the most likely ones to be worth using. This would work well here, since there are many PDB files in the alignment, and the right ones would cluster no matter what tree-building method was used.
A slightly more powerful method would be to align all the top hits from PDB to the w0.5 model, then build a tree from that alignment. This may get a few more distant sequences than the t2k alignment building (for example, there are 95 SH3 domains found, not just 78 in the 2AWL example). The method could be applied to fold-recognition targets as well as the targets with really close homologs in the templates.
We could also look at the ordering of the predictions with different weights on the second (or third) track. It seems that the predicted local structure tracks become more valuable for more distant templates. If we find close templates, we may want to use parameters that put more weight on the AA track.

Alignment of target and template

Currently, our best alignments are made from target 2-track HMMs (using the STR alphabet as the second track) doing local alignments to the templates. We have a couple of pairwise alignment test sets that can be used for evaluating different alignment methods, though we have to be a bit careful about tuning an alignment method to the test sets. There are a lot of things that could be done to improve pairwise alignments:

Exploring other local structure alphabets: Rachel Karchin is exploring other local structure alphabets for her Ph.D. thesis. After she finishes, her methods should be available for other students who want to try new structure alphabets. One that did not get covered in her thesis is the STR3 alphabet (a minor extension of STR and STR2). Since STR2 is currently our best single alphabet, STR3 would be worth exploring.
[Bret is working on this.]
Tweaking parameters: We have not tried adjusting transition-cost parameters nor the weight of the second track. Other alignment parameters (such as fimstrength) could also be explored. This project requires no major coding, just running lots of alignment experiments and keeping careful records of the results.
[Bret is working on this.]
Three-track HMMs: The two-track HMMs currently give the best results, but Rachel Karchin has proposed some three-track HMMs that may do better (there are some problems at the moment with hmmscore, that Richard Hughey and Rachel Karchin are looking into). Exploring the possible combinations of alphabets could result in significantly better alignments.
[Bret Barnes is working on this.]
Alignment trimming: In earlier work with one-track HMMs, Melissa Cline found that she could get better alignments by trimming out the unreliable parts of global alignments than by using strictly local alignments. Her method is implemented in SAM, but has not been tested or calibrated with multi-track HMMs.
[Jenny Draper did some work on this---it doesn't look like alignment trimming helps significantly with the newest alignment methods.]
Profile-profile alignment: All our alignment methods currently use homologs only for one of the two sequences being aligned (usually the target sequence, though sometimes the template sequence). We should be able to do better by using homolog information for both sequences. For the templates, we can use either sequence-based SAM-T2K mulitple alignments or structure-based alignments.
Anyone interested in the profile-profile alignment should talk to Richard Hughey, Jenny Draper, and Martina Koeve, all of whom have done some work on the project.

Generating fragments (fragfinder)

The fragfinder program produces short gapless alignments of the target sequence to template sequences, trying to select the best k (say 6) fragments for each position in the target. There are many variants possible on the selection algorithm, and we could try several different multi-track HMMs to search the template library for fragments.

The fragfinder program has not had extensive testing---we selected an alogrithm and HMM based on intuition and miniscule tests on one target sequence. The undertaker program can be used to evaluate the fragments as fragments (looking at the histogram of RMSD, or RMSD normalized by length, for example), and we can come up with various figures of merit for fragment libraries, allowing us to compare different fragment finding methods.

The fragfinder output is as an A2M file, which is fairly compact and easily produced by the SAM software, but which does not include 3D information. Undertaker can read the a2m file and the necessary PDB template files and generate fragments by sidechain replacement, but does not currently have a way of storing these fragments in a file. Existing undertaker file formats could easily be adapted to save the fragments, speeding up multiple undertaker runs and making it easier to run undertaker on the kilocluster (by eliminating the need for access to the PDB library).

In addition to providing a format for native undertaker fragment libraries, it would be convenient to have I/O handling in undertaker for Rosetta-format fragment libraries. These do not contain as much information as undertaker fragment libraries (no sidechains), so output Rosetta libraries is easier than inputting them. Being able to output a Rosetta library would allow us to convert fragfinder results to a Rosetta library, which could then be compared with existing fragment-finding methods.

[Sol produced Rosetta fragment libraries and did a lot of work on filtering them in undertaker. We have not done substantial work on getting fragfinder to produce better libraries.]

Packing fragments (undertaker)

The undertaker program got its first real tryout in summer 2002 in the CASP5 experiment. We applied it uniformly to all targets, with mixed results. On the comparative modeling targets, the resulting predictions were worse than the automatic predictions by the server about 3/4 of the time. On the fold-recognition targets, undertaker helped about 2/3 of the time, and on the very difficult fold recognition and new fold targets (where, traditionally, our methods have made no headway), we did well enough to be invited to speak and to submit a paper. Still, there is an awful lot of work that still needs to be done on undertaker.

Here is list of a few of the many things that came up over the summer:

More modern C++

The C++ used in undertaker is a subset of the language that avoids many of the newer features (since they were either not available or not reliable when I started creating the class library that undertaker uses). One new feature that would be nice to use (if it works robustly) is dynamic casts, as there are several places in the code where I do run-time type checking and casting with my own mechanism.

Print out and analyze distances

It would be a very small project to add a new command to accept a list of atoms and print out the distances between them. This command could be useful for extracting distance constraints from a model to be added to the score function, either to keep undertaker from pulling apart some good feature of a template, or to transfer information from a model that is not going to be used as a template. [SIMPLE VERSION OF THIS DONE]

A somewhat more ambitious project would be to make histograms for several different types of distance measurements, that could later be used for generating cost functions. For example, it would be useful to have histograms of the distances between corresponding CB atoms on adjacent beta strands. Is it different for parallel and anti-parallel strands? What about for CA atoms? These statistics would be useful for creating constraints for pairing up beta strands when we don't know the parity of the hydrogen bonding. It would also be good to do the histograms conditioned on separation along the backbone.

[STATISTICS FOR BETA-BONDING PARTNERS DONE 6-Dec-2003:
I added a routine to undertaker to gather some statistics on beta bonding partners for the dunbrack-1332 training set. I still have to digest the results a bit, but here are some preliminary figures:

CA-distances:

type mean std dev approx.peak

all beta partners 4.97 0.39 5.09

parallel 4.84 0.26 4.83

anti-parallel 5.03 0.43 5.13

anti-parallel unbonded 4.43 0.24 4.39

anti-parallel bonded 5.25 0.21 5.13

type	mean	std dev	approx.peak
all beta partners	4.97	0.39	5.09
parallel	4.84	0.26	4.83
anti-parallel	5.03	0.43	5.13
anti-parallel unbonded	4.43	0.24	4.39
anti-parallel bonded	5.25	0.21	5.13

CB-distances:

type mean std dev approx.peak

all beta partners 5.16 0.71 5.19

parallel 5.13 0.65 4.85

anti-parallel 5.17 0.73 5.19

anti-parallel unbonded 5.58 0.65 5.79

anti-parallel bonded 5.02 0.70 4.43

type	mean	std dev	approx.peak
all beta partners	5.16	0.71	5.19
parallel	5.13	0.65	4.85
anti-parallel	5.17	0.73	5.19
anti-parallel unbonded	5.58	0.65	5.79
anti-parallel bonded	5.02	0.70	4.43

Essentially all the 24135 beta-pair residues have CB-CB distances less than 8 and 99% have CB-CB distances less than 7. The common use of 8 Angstrom CB-CB for "contacts" may be a bit loose for identifying neighbors in beta sheets, but perhaps it is useful in other contexts (distance between sheets in a sandwich?).

I used a very crude way of finding beta-bonding partners (just looking for a pair of H-bonds), which probably increases the noise of these estimates.

The CA distances have much less variance than the CB distances, as you would expect for a structure defined by hbonds on the N and O atoms. The CA spacing is dependent on the whether the residues are parallel or anti-parallel, and whether or not the residues are bonded. This is also expected (build some models with the Darling model kits if this is not immediately obvious to you).

The CB spacing is less dependent on the relationship between the residues, but still more dependent than I had expected, with the bonded/unbonded distinction still making a 0.5 Angstrom difference.

I'm a bit worried about the distributions where the peak is so far from the mean (CB parallel and CB antiparallel bonded). I'll have to check to see whether these are artifacts of my histograms, inclusion of non-beta-sheet bonding pairs, bad H-bond definition, or some other phenomenon that I should be conditioning on.

I also don't know what distribution the data best fits. Probably not a normal distribution (distances are positive), but possibly a gamma or log-normal.

Manual moves

We did a lot of chopping up proteins and trying to force them into weird conformations this summer. It might be nice to replace all the crude perl scripts we wrote with a a simple "move" command that would take a range of residues and some spec for where they should be moved to. It would also be nice to get a copy of "protoshop" from UC Davis, since it seems to have fairly nice manual-move capabilities (though the inverse kinematics it uses to do the moves results in very unlikely conformations---I might prefer gaps to such unlikely conformations). http://graphics.cs.ucdavis.edu/~okreylos/ResDev/ProtoShop has more info on ProtoShop. [I believe that David Bernick has installed protoshop, and is trying to get it to work consistently.]

Interactive hand tweaking

There are lots of times I would like to be able to apply a specific operator (say break reduction for a partivcular gap with fragments inserted in a particular location), see the results using Rasmol, apply another operator, and so on.

Good quality measures

We'd like a robust measure of how bad a prediction is. RMSD alone is not a good measure, since the "dot model" having all atoms at (0,0,0) often does fairly well by that measure. Measures like "GDT" which count how many (or what fraction) of the residues are within a given error distance when superimposed on the correct structure can also suffer from overly rewarding the dot model. It might be worth looking at some "contact" measures that reward conformations for bring the right residues close together and keeping apart residues that are not close in the real structure. It would be good to come up with a cost measure that is not too sensitve to the length of the protein or to one badly folded piece.

Clean up cost functions to handle incomplete conformations

When reading in true, but incomplete, conformations (the usual case for PDB structures) the cost functions often report the conformations as terrible, since the missing atoms are all at location (0,0,0). The cost functions need to be modified to be aware of the new "contains_atom()" member of the Conformation class.

Cost function calibration

We have a lot of decoys generated for CASP5 and CASP6, and now have correct structures for most of them. We can use this set, or a test set generated by the Baker group, to try to adjust the weights of the various components of the cost function. Probably we should do a linear regression to fit the cost to whatever badness measure we choose, though we may have to figure out a length scaling to be able to do the fitting across several different targets.

I did some preliminary work on this, and (as expected) linear regression works terribly when the cost functions are highly correlated. We'll probably need to do some dimension reduction first (perhaps with principal components analysis?) before linear regression stands a chance of working.

Use NMR information

We already have distance constraints implemented---it would be good to provide an input format that accepts the standard NMR constraint/restraint files (many of which are available from BMRB). We should also try to interpret other information, such as residual dipolar couplings and chemical shift information, as this has been fairly successful in Rosetta.

[Oscar Hur has started writing scripts to convert NMR restraint files into undertaker constraints. There is some loss of information, so we need to test to see how much is lost.]

Save and restore the SpecificFragment Library and the AlignmentLibrary

A lot of time is wasted reading template PDB files and recomputing sidechain replacement for fragment libraries and alignments on each run. Saving and restoring them would be useful, particularly for reducing the number of files needed for an undertaker run---we'd like to be able to capture all the files needed and distribute them to the kilocluster for doing many parallel optimizations. Saving the specific fragment library would also make it feasible to put some more effort into creating it---doing sidechain optimization to remove steric conflicts, for example.

Predicted burial

Rachel Karchin has tested neural nets to predict various of the burial functions. We could incorporate predicted burial into the cost functions, instead of just the generic burial histogram by residue type. An example of this sort of cost function has already been implemented for the "alpha" angle predictions.

Predicted "alpha"

Although we have an implementation for using neural-net-predicted alpha values in the cost function for undertaker, there are several "tuning" adjustments that could be made---we have not done any testing to see what variant of the algorithm provides the best cost function.

What we currently do is to convert the 11-dimensional neural network prediction vector into a nearly continuous function function of the alpha angle (360 1-degree bins). This involves combining the neural network prediction with the histograms for individual amino acids. The current algorithm starts with the histograms for alpha conditioned on the residue and conditioned on the next residue, and takes their product to get an intial continuous curve, each of the 11 sections is scaled to get the appropriate probability according to the neural net prediction, and the resulting curve is smoothed to remove the discontinuities at the edges of the sections. The scaling and smoothing is repeated several times.

Some parameters to play with include

What initial curve to start with (flat, histogram based on all amino acids, histogram conditioned on current amino acid, histogram conditioned on next amino acid, product of histograms, average of histograms, ...).
How much smoothing to do at each step. This probably depends on the initial curve chosen, since smoothing tends to spread out peaks. If the initial curve has peaks that are too sharp, then more smoothing would improve it, but if it has broad peaks, smoothing may make things worse.
How many iterations of rescaling and smoothing to do. Again, more smoothing flattens peaks more, and the lots of iterations of scaling and smoothing will tend to move the peaks to the centers of the sections.

The way to tune the algorithm is to use a large training set with neural-net predictions and try to minimize the total cost (negative log probability) of the actual alpha values.

[Rocky Choi did some work on this.]

Apply secondary structure

It might be good to include secondary structure in guiding the conformation search, not just the score functions. For example, we could

Look for fragment that matches peaks in predicted alpha values.
Label fragments with DSSP (or other) labels and look for matches to strongly predicted regions.
Add length constraints for helical or strand regions. (say one or two constraints for every 2-turns of helix, and 4-residue CB constraints for strands?)
When breaks are inserted (say by jiggle or opt subtree), make the more likely to be inserted in regions which have lower probability of being in a helix or strand. Possibly deliberately insert a break in each such region to act as a rapid hinge.

New conformation-change operators

We probably want to add some more conformation-change operators to the suite used by the genetic algorithm---for example, it might be good to have a crossoverfunction that allows two breaks, and puts together the results from conformations A and B as an ABA chimera. One possible implementation would be to make a copy of the central setion of Confomration B as a new AlignedFragments object, then apply this to Conformation A.

More focus on loop regions

It would be good to have a way to get undertaker to focus on regions that are having trouble getting a good conformation. Possibilities include freezing a core region, so that no operators are applied there, increasing the probability of applying fragments near breaks, applying TwoFragment to residues that are spatially near each other, but not close along the chain (possible hinges), ... Instead of a complete freeze on the core, we might want to keep confidence values with alignments and specific fragments, so that we frequently re-apply alignments or fragments that have high confidence, keeping the core from drifting away from a good initial conformation.

Jiggling without breaking

We often want to make small moves to improve packing on an almost finished model, without breaking the chain and incurring a break penalty. One way to do this would be to use the "crankshaft" moves developed by Baker's group. A crankshaft move takes a pair of backbone bonds that are nearly colinear and rotates the region between them. If the bonds are parallel but not colinear, then a shift in position occurs. The simplest example is rotating a single peptide plane without moving the ones much on either side. By pairing two peptide planes that are almost colinear, one can shift the atoms between the two planes without rotating them. To do a "crankshaft" move between residues A and B, add theta to psi(A-1), -theta to phi(A), -theta to psi(B-1), theta to phi(B).

Rewrite ApplyFragments command

The old ApplyFragments command takes the old slicer format for specific fragments, which is no longer used. Change to use the name of an alignment---this may require a NameToPtr table for the AlignmentLibrary.

MERGE report_cost_to_log_and_rdb (OptCommands.cc) and report_score_to_log (ConformationPool.h)

Rotamer optimization

The current version of undertaker has 3 conformation-change operators that just change sidechains (OneRotamer, ClashingRotamer, and ClusteredRotamer). It would be interesting to see how well undertaker optimizes sidechains compared to scwrl, for different cost functions [particularly whether the hbond_geom cost function helps]. This can be done with no modifications to undertaker, just running undertaker scripts and gathering statistics.

A possible approach is to read a pdb file, randomize the rotamers, optimize with undertaker, reoptimize with scwrl, then reoptimize with undertaker, reporting the three costs and all-atom rmsd for each of the conformations. One could also gather all-atom rmsd for a huge number of sidechain conformations and try to do linear regression to set weights for the cost function, but this is likely to have problems due to correlation between the components.

Other undertaker tasks

This list of things to do to undertaker is far from complete. There is also a ToDo list in /projects/compbio/experiments/undertaker, and there are various "BUG" notations scattered through the code in /projects/compbio/programs/undertaker

Genome-wide prediction

I have done a genome-wide fold-recognition prediction of S. cerevisiae proteins http://www.soe.ucsc.edu/research/compbio/yeast-protein-predictions/, but there are a lot a little things to do to clean up the presentation (like providing indexing of the results by something other than just the accession code). Fan Hsu will be working on this, in preparation for similar tasks on mammalian genomes.

Some of the search capabilities that should be added include:

Search for particular ORF (using any of the legal gene names).
Search for ORFs with significant hits (user specified E-value).
Search for ORFs with significant hits to particular SCOP families (problem---may have hit to different domain of multi-domain template)
Search for ORFs with significant hit but low sequence id in pairwise alignment.
Two-step search---from an ORF, find all other ORFs with similar predicted structure.

Perhaps a more important task is to figure out a way to handle the large, multidomain proteins better, as finding a template for just one domain doesn't really solve the problem. Currently, the web pages don't even report which sections of the protein we have made predictions for!

The fairly small yeast genome took weeks to finish on the small cluster---to do a bigger genome, such as the human genome, we'd have to "clusterize" the method to run on the kilocluster without hammering the file servers. Fan Hsu has started work on making the fold recognition method work on the kilocluster. I don't think there is any point to trying to apply undertaker genome-wide right now---it requires far too much hand tweaking to get even a crude new-fold prediction out of undertaker.

It may be worthwhile to do some other small genomes, such as the archeal genomes that Todd Lowe is building DNA chips for.

Individual proteins

There are some proteins being studied at UCSC for which it would be useful to get structure predictions. I started work on a few, but have not made much progress, because CASP5 interfered. I suspect that these are difficult-to-predict proteins, and that new-fold methods like undertaker and Rosetta will need to be applied.

For example, I've done some predictions of chemotaxis proteins for Helicobaster pylori for Karen Ottemann.

Protein design

I'd love to get involved in protein design, but the current tools I have (neural nets, HMMs, undertaker, ...) do not seem well suited for the task. I've been thinking about ways that I could try to use them, the way Rosetta gets used for sequence optimization. Unfortunately, I don't think I can use the same tricks that Rosetta does, because undertaker is structured around prediction for a fixed sequence---it would require some rather large changes to the code to fix the conformation at let the sequence vary.

Another approach is to take several of the local-structure-predicting neural nets, and fix their output at the desired local structures, then use backpropagation to optimize the inputs. One could start with an intial random sequence, optimize using the neural nets to get an input profile, then sample from that profile and reoptimize. This could be repeated to get a large sample of possible sequences that should have good local properties, which could then be fed into the Rosetta optimization program to try to find sequences with good teritary properties. To do this right, one should probably do the back-propagation simultaneously in neural nets predicting several different local properties. This would require many changes to predict-2nd, but probably not as massive as changes to undertaker for protein design.

More random thoughts on using backpropagation. If I changed my neural nets over to using (A HREF="#guide+profile">guide-sequence+profile input, then I could use backpropagation to change the profile, and sample form the profile to get a sequence. Iterate that many times from a random initial profile (generated from a Dirichlet mixture prior) and take the resulting sequence as a possible prediction. There are lots of variants that could be tried. For example, one could resample the sequence points where the backpropagated derivative for the current residue is negative rather than resampling all points. Instead of sampling from the profile one could take the most probable residue of the profile. After generating a few hundred sequences in this way from random starting points, one could make a profile from the sequences, and score each sequence with that profile. Further tweaking could take just the best 100 or so sequences and the profile generated from them. Each could be used with that profile as a starting point for the backpropagation and resample process. A tertiary scoring system (such as Rosetta's) could be used to pick out the best sequences from the list, and use the resulting sequences and profile to do more sampling.

Another approach to using neural nets would be to build a neural net whose input was many structural properties and whose output was a sequence. Such a neural net could be trained in the same manner as current nets for prediction (with inputs and outputs swapped), and would provide a probability distribution for amino acids at each position. This would require fewer changes to the neural net code (mainly providing an input layer that accepts several alpahbets, instead of just one alphabet), and would more directly provide probability distributions to sample sequences from. This method is less likely to provide a variety of very different solutions to a protein design problem.

Using protein redesign to improve fold-recognition

Some researchers have tried using protein design with a known backbone to try to get a large sample of sequences that could reasonably adopt the backbone conformation, in order to create profiles for searching in sequence space for homologs. This method could be used to improve the template models in our fold-recognition process, for those templates that have few homologs found by iterative search. It is unclear whether the design methods do a good enough job to actually improve the profiles. Also, there are small changes in structure between homologs (especially insertions and deletions), so the design method may have difficulty generating appropriate sequences to generalize in the right direction.

A more promising use for redesign of a known structure is to predict which residues are conserved for functional reasons rather than structural ones. David Bernick is investigating this with Carol Rohl, and it looks like there is a reasonable application for my mutual-information code here.

SoE home

Kevin Karplus's home page

Biomolecular Engineering Department

UCSC Bioinformatics research

Questions about page content should be directed to Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250
318 Physical Sciences Building