This is a list of projects that need to be done on various projects for protein structure prediction. I've broken the list up into different categories, based on the different stages we go through in making a prediction.
I have the SAM-T04 modules that are almost finished (enough for us to do predictions at CASP in Summer 2004), but there are several functions that still need to be added.
This program can be used to produce input for a neural net to predict contact pairs. [George Shackelford is working on this.]
Other researchers have found that they can get far more homologs for proteins if they search they unfinished genomes, particularly using tblastn to search genomes that have not had adequate gene finding done. This might we worth trying, but extracting the "full-length" protein sequences can be difficult from tblastn hits---we want to have protein sequences, not full contigs of nucleotides, but we don't want just the narrow region where the tblastn hit is significant.
The thresholds used in SAM-T2K and SAM-T04 for determining what sequences to include at each level are somewhat arbitrary. It may be worthwhile to play with them a bit to see if we can include more sequences without getting contamination.
[Aleksey Kleyman did some work on using key residues to improve SAM multiple alignments, but the results were not encouraging.]
[Marica Soriano has started working on this.]
We already have one combining program called RDBCombine that is used in the SAM-T02 web server to merge neural-net predictions to get the dssp_ehl2 prediction. It can combine RDB-format prediction files for different alphabets. It translates the alphabet using the contingency tables output by compare-real, and weights them by the mutual information between the predicted alphabet and the desired alphabet. This is almost certainly not an optimal combining method, but it was put together in haste for CASP5, and has not really been evaluated.
One improvement that I think would be valuable for some local structure alphabets is to replace the current profile input with a double input: a profile and a 0-1 vector indicating the actual sequence. The profile could be set to generalize a bit further (say 0.8 bits_saved/column instead of the current scheme of using Henikoff weighting with the number of sequences as the total weight), as the guide sequence would record the actual residues. Separating the guide and the profile would be particularly useful in those cases where the multiple alignment from which the profile is generated has suffered from model drift, and no longer represents well the particular sequence. It might also be useful for improving the predictions in the regions where many of the aligned sequences have deletions, as the profile gets close to the background there, because of the low weights.
This modification would require some fairly minor changes to predict-2nd to allow the InterfaceDescription to have multiple inputs from different files, but the changes may be useful for other purposes also.
[Sol Katzman did this work, but still needs to write it up.]
This could be done by adding a combining layer to the outputs of a normal neural net. For each output of the neural net we associate two parameters (center and spread). The predicted mean would be the neural-net weighted sum of the center values, and the predicted standard deviation would be the standard deviation of a mixture distribution, with mixture coefficients from the neural net.
The center and spread parameters could initially be set by partitioning the outputs in the training data into roughly equal weight discrete classes (as is now done) and computing a mean and standard deviation for each class. They could be tweaked a bit by a few passes of EM to get the output described as a mixture. After a few epochs of training the neural net, the center and spread parameters could be trained by back propagation also.
We would need to compare the encoding cost of the actual burial value using this continuous method (which requires some care in the discretizing, since our underlying data is usually integer-valued) versus the encoding cost of the discrete alphabet method, which assumes a flat distribution within each of the discrete bins.
The resulting vectors can be clustered (with the square of the dot product as the appropriate similarity measure) to determine the most common directions. We can attempt to identify the properties these directions represent by looking at the correlation with the indices in the AAIndex database.
We can also compute the dot products of the profile with the centers of the most populated clusters and use them as inputs to the neural net, in addition to or in place of the profile input.
The above residue-at-a-time analysis does not get at the
neighborhood patterns that the neural nets learn. People have
tried to determine some of the patterns by looking at Fourier
transforms and wavelet transforms of numeric properties (mainly
hydrophobicity). We could look at each hidden unit on the first
neural layer and see if its weight matrix can be roughly matched by taking a
single direction vector and a neighbor weight function.
That is, decompose
sum_{a,-2<=j<2} w_{a,j} Prob(amino acid a in position i+j)
into
sum_{-2<=j<=2} f(j) sum_a dir(a) Prob(amino acid a in position i+j)
or w_{a,j} = f(j) dir(a).
If there are hidden units that can be approximately described this
way, it may be worth providing input features that correspond to
this neighborhood property also.
We really want to get maximum information gain from a multi-class soft classifier, which is much easier to design into a neural net than into SVMs. I'm not sure how to get a good probability vector out of a multi-class SVM.
It might be worth splitting the G class into "G" and "T" based on the other possible backbone hbond for the residue, with "G" if the other is separation 3 or 4, and T if it is a backbone hbond with other separation. Not clear what to do if this is the only backbone hbond for the residue---probably stick with G. Perhaps we could separate on notor angle instead, with T peaking around 23 (same as B) and G peaking around -14. Current definition of Hbond in undertaker seems to cut off some of the T peak, since it limits NOtor to [-60,35] The G set is 10-12 times more common, so it may not be worth trying to split off the T peak.
It may be more interesting to look at the sep=5 anti-parallel Hbonds, which seem to have a NOtor angle around 86 (not 23 or -83). The A/B split does seem to depend on the separation quite strongly in other ways also, so there may be just a few standard hairpin patterns.
[Pinal Kanabar is working on this one.]
Mark Diekhans implemented this in SAM, but I am not sure of the details of his implementation. In his tests, the method improved Q3 scores slightly, but made the information gain worse than the neural nets used to create the HMM tracks. Several things could be done to improve this method:
If we create a new test set based on SCOP, we should pay
attention to Julian Gough's rules about what to include and
exclude from the test set, based on his understanding of SCOP classification:
Profile-profile scoring
Many of the best fold-recognition servers use some version of
profile-profile scoring. If we can pick the right multitrack HMM
to use with profile-profile scoring, we should be able to make
substantial improvements over our current scheme.
There have been various proposals for handling insertions and
deletions in profile-profile scoring. Bob Edgar's seems the most
complete, but some of the simpler ones are probably worth
implementing also.
One nice thing about profile-profile scoring is that we can
use the existing HMM formats for representing the profiles, not
requiring any new I/O code in SAM. (Though it would be good to
read the RDB format output by the neural nets and output the HMM,
so that we could eliminate the TCL script that currently does this.]
There may be different profile-profile alignment methods or
parameter settings that are optimal for progressive alignment,
pairwise alignment of remote homologs, and fold recognition.
[Martina Koeva has done some work on profile-profile alignment.]
[Bret Barnes has been working on this, using improved scripts
written by Sol Katzman.]
One possible project is to incorporate the
"product-of-pvalues" method into our automatic predictions.
This requires designing a way to update the
calibration parameters for the combining method automatically as
new templates are added to the library, and converting the
fold results back into selection of good templates.
Another possible project is to devise a different combining
method. The "product-of-pvalues" method treats all templates in a
fold class as equally informative and does not use information
from templates in competing fold classes. One can envision
combining methods such as logisitic regression that could use the
extra information to get better predictions. One danger is that
multi-domain proteins correctly have multiple correct folds, so
simple competition between fold classes is not quite the right model.
In CASP6, it was obvious that we had managed to put together
some multi-domain predictions without explicit domain prediction.
It is not clear to me that explicit domain prediction is as
valuable as I thought in 2002.
[Nikhil did some work on this, but the problem is open again.]
The maximum-likelihood estimate
turns out to be fairly simple to compute---not much harder than
the moment-matching we currently do, so we should put that in SAM
and try it out. I've not yet looked at the Bailey and Gribskov paper.
We should try to get this done over the summer, so we can revise
and resubmit the paper.
[George has implemented the ML estimate and tested it.
It seems to do a poorer job than the moment-matching method,
fitting the center better but the tail worse.
The paper is being rewritten to incorporate the new results.]
Model drift often results in the HMM being better at
recognizing some subfamily different from the one containing the
target protein. For example, if the SH3 domain for 2abl is given
as a seed for the SAM-T02 web site, the self-hit is about 80 down
from the top (partly because 2abl is not in the template library).
Blast finds several sequences containing exact matches to the
55-residue seed: 1oplA, 1opkA, 2abl, 1ju5C, 1abq, 1abo[AB], 1awo,
and 1bbz[ACEG]. Four of these are in the template library (1aboB,
1awo, 1bbz[AE]), but the best scoring match using the template
library is 1ng2A, which has the SH3 domain twice. If we just use
the w0.5 model, we can see that it has drifted to center on
1udlA, 1gfc, and 1gcqA, rather than the sequence we started with.
Various fixes are possible.
[Bret is working on this.]
[Bret is working on this.]
[Bret Barnes is working on this.]
[Jenny Draper did some work on this---it doesn't look like alignment trimming helps significantly with the newest alignment methods.]
Anyone interested in the profile-profile alignment should talk to Richard Hughey, Jenny Draper, and Martina Koeve, all of whom have done some work on the project.
The fragfinder program has not had extensive testing---we selected an alogrithm and HMM based on intuition and miniscule tests on one target sequence. The undertaker program can be used to evaluate the fragments as fragments (looking at the histogram of RMSD, or RMSD normalized by length, for example), and we can come up with various figures of merit for fragment libraries, allowing us to compare different fragment finding methods.
The fragfinder output is as an A2M file, which is fairly compact and easily produced by the SAM software, but which does not include 3D information. Undertaker can read the a2m file and the necessary PDB template files and generate fragments by sidechain replacement, but does not currently have a way of storing these fragments in a file. Existing undertaker file formats could easily be adapted to save the fragments, speeding up multiple undertaker runs and making it easier to run undertaker on the kilocluster (by eliminating the need for access to the PDB library).
In addition to providing a format for native undertaker fragment libraries, it would be convenient to have I/O handling in undertaker for Rosetta-format fragment libraries. These do not contain as much information as undertaker fragment libraries (no sidechains), so output Rosetta libraries is easier than inputting them. Being able to output a Rosetta library would allow us to convert fragfinder results to a Rosetta library, which could then be compared with existing fragment-finding methods.
[Sol produced Rosetta fragment libraries and did a lot of work on filtering them in undertaker. We have not done substantial work on getting fragfinder to produce better libraries.]
Here is list of a few of the many things that came up over the summer:
A somewhat more ambitious project would be to make histograms for several different types of distance measurements, that could later be used for generating cost functions. For example, it would be useful to have histograms of the distances between corresponding CB atoms on adjacent beta strands. Is it different for parallel and anti-parallel strands? What about for CA atoms? These statistics would be useful for creating constraints for pairing up beta strands when we don't know the parity of the hydrogen bonding. It would also be good to do the histograms conditioned on separation along the backbone.
[STATISTICS FOR BETA-BONDING PARTNERS DONE 6-Dec-2003:
I added a routine to undertaker to gather
some statistics on beta bonding partners for the dunbrack-1332
training set. I still have to digest the results a bit, but here are
some preliminary figures:
CA-distances:
type | mean | std dev | approx.peak |
---|---|---|---|
all beta partners | 4.97 | 0.39 | 5.09 |
parallel | 4.84 | 0.26 | 4.83 |
anti-parallel | 5.03 | 0.43 | 5.13 |
anti-parallel unbonded | 4.43 | 0.24 | 4.39 |
anti-parallel bonded | 5.25 | 0.21 | 5.13 |
CB-distances:
type | mean | std dev | approx.peak |
---|---|---|---|
all beta partners | 5.16 | 0.71 | 5.19 |
parallel | 5.13 | 0.65 | 4.85 |
anti-parallel | 5.17 | 0.73 | 5.19 |
anti-parallel unbonded | 5.58 | 0.65 | 5.79 |
anti-parallel bonded | 5.02 | 0.70 | 4.43 |
Essentially all the 24135 beta-pair residues have CB-CB distances less than 8 and 99% have CB-CB distances less than 7. The common use of 8 Angstrom CB-CB for "contacts" may be a bit loose for identifying neighbors in beta sheets, but perhaps it is useful in other contexts (distance between sheets in a sandwich?).
I used a very crude way of finding beta-bonding partners (just looking for a pair of H-bonds), which probably increases the noise of these estimates.
The CA distances have much less variance than the CB distances, as you would expect for a structure defined by hbonds on the N and O atoms. The CA spacing is dependent on the whether the residues are parallel or anti-parallel, and whether or not the residues are bonded. This is also expected (build some models with the Darling model kits if this is not immediately obvious to you).
The CB spacing is less dependent on the relationship between the residues, but still more dependent than I had expected, with the bonded/unbonded distinction still making a 0.5 Angstrom difference.
I'm a bit worried about the distributions where the peak is so far from the mean (CB parallel and CB antiparallel bonded). I'll have to check to see whether these are artifacts of my histograms, inclusion of non-beta-sheet bonding pairs, bad H-bond definition, or some other phenomenon that I should be conditioning on.
I also don't know what distribution the data best fits. Probably not a normal distribution (distances are positive), but possibly a gamma or log-normal.
I did some preliminary work on this, and (as expected) linear regression works terribly when the cost functions are highly correlated. We'll probably need to do some dimension reduction first (perhaps with principal components analysis?) before linear regression stands a chance of working.
[Oscar Hur has started writing scripts to convert NMR restraint files into undertaker constraints. There is some loss of information, so we need to test to see how much is lost.]
What we currently do is to convert the 11-dimensional neural network prediction vector into a nearly continuous function function of the alpha angle (360 1-degree bins). This involves combining the neural network prediction with the histograms for individual amino acids. The current algorithm starts with the histograms for alpha conditioned on the residue and conditioned on the next residue, and takes their product to get an intial continuous curve, each of the 11 sections is scaled to get the appropriate probability according to the neural net prediction, and the resulting curve is smoothed to remove the discontinuities at the edges of the sections. The scaling and smoothing is repeated several times.
Some parameters to play with include
[Rocky Choi did some work on this.]
A possible approach is to read a pdb file, randomize the rotamers, optimize with undertaker, reoptimize with scwrl, then reoptimize with undertaker, reporting the three costs and all-atom rmsd for each of the conformations. One could also gather all-atom rmsd for a huge number of sidechain conformations and try to do linear regression to set weights for the cost function, but this is likely to have problems due to correlation between the components.
Some of the search capabilities that should be added include:
Perhaps a more important task is to figure out a way to handle the large, multidomain proteins better, as finding a template for just one domain doesn't really solve the problem. Currently, the web pages don't even report which sections of the protein we have made predictions for!
The fairly small yeast genome took weeks to finish on the small cluster---to do a bigger genome, such as the human genome, we'd have to "clusterize" the method to run on the kilocluster without hammering the file servers. Fan Hsu has started work on making the fold recognition method work on the kilocluster. I don't think there is any point to trying to apply undertaker genome-wide right now---it requires far too much hand tweaking to get even a crude new-fold prediction out of undertaker.
It may be worthwhile to do some other small genomes, such as the archeal genomes that Todd Lowe is building DNA chips for.
For example, I've done some predictions of chemotaxis proteins for Helicobaster pylori for Karen Ottemann.
Another approach is to take several of the local-structure-predicting neural nets, and fix their output at the desired local structures, then use backpropagation to optimize the inputs. One could start with an intial random sequence, optimize using the neural nets to get an input profile, then sample from that profile and reoptimize. This could be repeated to get a large sample of possible sequences that should have good local properties, which could then be fed into the Rosetta optimization program to try to find sequences with good teritary properties. To do this right, one should probably do the back-propagation simultaneously in neural nets predicting several different local properties. This would require many changes to predict-2nd, but probably not as massive as changes to undertaker for protein design.
More random thoughts on using backpropagation. If I changed my neural nets over to using (A HREF="#guide+profile">guide-sequence+profile input, then I could use backpropagation to change the profile, and sample form the profile to get a sequence. Iterate that many times from a random initial profile (generated from a Dirichlet mixture prior) and take the resulting sequence as a possible prediction. There are lots of variants that could be tried. For example, one could resample the sequence points where the backpropagated derivative for the current residue is negative rather than resampling all points. Instead of sampling from the profile one could take the most probable residue of the profile. After generating a few hundred sequences in this way from random starting points, one could make a profile from the sequences, and score each sequence with that profile. Further tweaking could take just the best 100 or so sequences and the profile generated from them. Each could be used with that profile as a starting point for the backpropagation and resample process. A tertiary scoring system (such as Rosetta's) could be used to pick out the best sequences from the list, and use the resulting sequences and profile to do more sampling.
Another approach to using neural nets would be to build a neural net whose input was many structural properties and whose output was a sequence. Such a neural net could be trained in the same manner as current nets for prediction (with inputs and outputs swapped), and would provide a probability distribution for amino acids at each position. This would require fewer changes to the neural net code (mainly providing an input layer that accepts several alpahbets, instead of just one alphabet), and would more directly provide probability distributions to sample sequences from. This method is less likely to provide a variety of very different solutions to a protein design problem.
A more promising use for redesign of a known structure is to predict which residues are conserved for functional reasons rather than structural ones. David Bernick is investigating this with Carol Rohl, and it looks like there is a reasonable application for my mutual-information code here.
|
|
|
UCSC Bioinformatics research |
Questions about page content should be directed to
Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250
318 Physical Sciences Building