Projects list: protein structure prediction

(Last Update: 05:17 PST 6 February 2005 )

This is a list of projects that need to be done on various projects for protein structure prediction. I've broken the list up into different categories, based on the different stages we go through in making a prediction.

Perl modules
I'm working on perl modules to make the SAM-T03 prediction engine more modular than the previous ones, so that pieces of it can be reused by other researchers. I plan to do most of the work on this myself (incorporating results from student projects where appropriate). [Chris Dragon finished the a2m2html project, but I did not manage to get the SAM-T03 modules and scripts finished.]

I have the SAM-T04 modules that are almost finished (enough for us to do predictions at CASP in Summer 2004), but there are several functions that still need to be added.

Correlated columns of multiple alignments
I threw together a program to measure the mutual information between columns of a multiple alignment (any alphabet).

This program can be used to produce input for a neural net to predict contact pairs. [George Shackelford is working on this.]

Finding similar sequences (potential homologs)

Other researchers have found that they can get far more homologs for proteins if they search they unfinished genomes, particularly using tblastn to search genomes that have not had adequate gene finding done. This might we worth trying, but extracting the "full-length" protein sequences can be difficult from tblastn hits---we want to have protein sequences, not full contigs of nucleotides, but we don't want just the narrow region where the tblastn hit is significant.

The thresholds used in SAM-T2K and SAM-T04 for determining what sequences to include at each level are somewhat arbitrary. It may be worthwhile to play with them a bit to see if we can include more sequences without getting contamination.

Multiple alignment of similar sequences
The biggest improvement that SAM-Txx makes over PSIBLAST is in the quality of the multiple alignment, but there is still a lot more that could be done to improve it. Here are some ideas (not necessarily compatible):
Local structure prediction
We are still working on finding good local structure alphabets to predict, and our neural network framework is fairly reliable. Minor tweaks will probably not make much difference, but I have some ideas that may be worth pursuing.
Fold recognition
Our current best fold-recognition method relies on having a two-track target model and a library of amino-acid-only template models, and combining the scores. There are several areas where this could be improved, but experimentation will be needed with the fold-recognition test set to see which make a big difference.
Updating our test set.
We've been using the same test set for a long time, which raises two questions: (1) Is the underlying "gold standard" still valid? There have been reclassifications and improvements in SCOP since we created our test set, which may change the results, particularly in cases where methods are fairly close in performance. (2) Have we started training to the benchmark? If we keep selecting methods based on how well they do on a particular benchmark, we can easily get overtraining. Right now this is not a big concern, as we have done relatively few tests on the full fold-recognition benchmark, but it would be good to have a test that is at least somewhat new.

If we create a new test set based on SCOP, we should pay attention to Julian Gough's rules about what to include and exclude from the test set, based on his understanding of SCOP classification: Profile-profile scoring

A somewhat more general version of the two-track template model would allow full probability vectors for both the states and the sequence---with scores computed in any of several ways. The simplest is to take the log of the dot product of the two probability vectors, but other schemes have been proposed in the literature. Note that this scheme generalizes the current profile-sequence scoring and the two-track template scoring proposed above, with the simple labels handled as elementary vectors (all zero except for a 1 in the labeled position).

Many of the best fold-recognition servers use some version of profile-profile scoring. If we can pick the right multitrack HMM to use with profile-profile scoring, we should be able to make substantial improvements over our current scheme.

There have been various proposals for handling insertions and deletions in profile-profile scoring. Bob Edgar's seems the most complete, but some of the simpler ones are probably worth implementing also.

One nice thing about profile-profile scoring is that we can use the existing HMM formats for representing the profiles, not requiring any new I/O code in SAM. (Though it would be good to read the RDB format output by the neural nets and output the HMM, so that we could eliminate the TCL script that currently does this.]

There may be different profile-profile alignment methods or parameter settings that are optimal for progressive alignment, pairwise alignment of remote homologs, and fold recognition.

[Martina Koeva has done some work on profile-profile alignment.]

Parameter adjustment for local structure alphabets
Rachel Karchin tested local structure alphabets with a constant weight of 1 on the amino-acid track and 0.3 on the local structure track. These weights have not been optimized, and we may be able to get better performance by adjusting them (and adjusting them differently for different alphabets).

[Bret Barnes has been working on this, using improved scripts written by Sol Katzman.]

Combining templates
Our fold-recognition methods treat every template in the library as independent of the rest, but we actually have many templates which contain similar folds. Various ways have been proposed for combining template scores to get a "fold" score, and a couple students have done preliminary tests of the "product-of-pvalues" method, and shown that it might help, but their projects were never taken to the point where the method could be incorporated into an automatic prediction method.

One possible project is to incorporate the "product-of-pvalues" method into our automatic predictions. This requires designing a way to update the calibration parameters for the combining method automatically as new templates are added to the library, and converting the fold results back into selection of good templates.

Another possible project is to devise a different combining method. The "product-of-pvalues" method treats all templates in a fold class as equally informative and does not use information from templates in competing fold classes. One can envision combining methods such as logisitic regression that could use the extra information to get better predictions. One danger is that multi-domain proteins correctly have multiple correct folds, so simple competition between fold classes is not quite the right model.

Handling multi-domain proteins
Our current methods look for target-template matches where there is a common subdomain, but do not take the next step of looking separately at the part of the target that does not match the template and doing a separate prediction for it. We could probably improve out fold-recognition capability if we did some domain splitting, either after having recognized a domain using the whole chain, or by trying domain splits based on the multiple alignment. In the recent run of all yeast ORFs, one of the predictions took 6 days, because the protein was 4910 residues long---this clearly would have been helped by breaking up into smaller pieces.

In CASP6, it was obvious that we had managed to put together some multi-domain predictions without explicit domain prediction. It is not clear to me that explicit domain prediction is as valuable as I thought in 2002.

Better calibration of HMMs, particularly multi-track HMMs
Our current method for calibrating the E-values of HMMs is described in the paper Calibrating E-values for Hidden Markov Models with Reverse-sequence Null Models, where a few problems are described.
  • One of the problems is that simple i.i.d. models of "random" sequences work ok for amino-acid sequences, but not for secondary structure sequences, which often have long runs of the same letter. We have a sophisticated sequence generator for amino-acid sequences released as open-source code: gen_sequence, but no good random sequence generator for secondary structure sequences. There are several ways to try to generate random secondary structure sequences: i.i.d. models, Dirichlet mixture for composition with i.i.d. from the composition (used in gen_sequence), first-order Markov chains, higher-order Markov chains, variable-order Markov chains, mixtures of Markov chains, hidden Markov models, Markov chains of segments with length models for each segment, and so forth. Implementing one or more of these methods and testing how well it can encode secondary structure sequences (over various alphabets) in a train/test experiment would be very useful. I'm particularly intrigued by the idea of generating sequences of segments, with a separate length model for each segment type, and have done a little preliminary work on the idea.

    [Nikhil did some work on this, but the problem is open again.]

  • Even if we get a good random sequence generator for the various local structure alphabets, we still have some trouble with the failure of the "reversability assumption" for some alphabets that otherwise look quite useful, such as de Brevern's protein blocks. For multi-track HMMs using these alphabets, we need a different way of calibrating the E-values that does not use reverse-sequence null models. One possibility would be to fit a Gumbel extreme-value distribution to scores on randomly generated sequences (using a good generator for the local structure alphabet). Other parameterized distributions could also be explored, as there is some evidence that the Gumbel distribution does not have a fat enough tail.
  • More immediately, the referees for the paper pointed out that we should try a maximum-likelihood estimate (not just a moment-matching estimate) and we might also want to examine "a recent paper by Bailey and Gribskov (J Comput Biol, Vol 9 (3), 2002) which investigates in some detail the maximum likelihood method for fitting the Gumbel distribution to datasets that may contain true as well as false positives."

    The maximum-likelihood estimate turns out to be fairly simple to compute---not much harder than the moment-matching we currently do, so we should put that in SAM and try it out. I've not yet looked at the Bailey and Gribskov paper. We should try to get this done over the summer, so we can revise and resubmit the paper. [George has implemented the ML estimate and tested it. It seems to do a poorer job than the moment-matching method, fitting the center better but the tail worse. The paper is being rewritten to incorporate the new results.]

Better template selection for close homologs
When we have many close homologs in PDB, the current fold-recognition method finds all of them, but does not do a good job of picking the best template.

Model drift often results in the HMM being better at recognizing some subfamily different from the one containing the target protein. For example, if the SH3 domain for 2abl is given as a seed for the SAM-T02 web site, the self-hit is about 80 down from the top (partly because 2abl is not in the template library). Blast finds several sequences containing exact matches to the 55-residue seed: 1oplA, 1opkA, 2abl, 1ju5C, 1abq, 1abo[AB], 1awo, and 1bbz[ACEG]. Four of these are in the template library (1aboB, 1awo, 1bbz[AE]), but the best scoring match using the template library is 1ng2A, which has the SH3 domain twice. If we just use the w0.5 model, we can see that it has drifted to center on 1udlA, 1gfc, and 1gcqA, rather than the sequence we started with.

Various fixes are possible.

  • One is to look at the simple pairwise scores (ignoring the multiple alignment) when choosing close homologs---for this example, the best alignment scores came from the alignments that had identical sequences, which would have worked fine in this case. This reduces to using rather crude tools though, so might not work as well when sequences are somewhat diverged.
  • Another possibility is to look for pdb files in the t2k.a2m alignment (there are 78 of them in this example), and make a phylogenetic tree based on the multiple alignment. The closest templates in the tree are the most likely ones to be worth using. This would work well here, since there are many PDB files in the alignment, and the right ones would cluster no matter what tree-building method was used.
  • A slightly more powerful method would be to align all the top hits from PDB to the w0.5 model, then build a tree from that alignment. This may get a few more distant sequences than the t2k alignment building (for example, there are 95 SH3 domains found, not just 78 in the 2AWL example). The method could be applied to fold-recognition targets as well as the targets with really close homologs in the templates.
  • We could also look at the ordering of the predictions with different weights on the second (or third) track. It seems that the predicted local structure tracks become more valuable for more distant templates. If we find close templates, we may want to use parameters that put more weight on the AA track.
Alignment of target and template
Currently, our best alignments are made from target 2-track HMMs (using the STR alphabet as the second track) doing local alignments to the templates. We have a couple of pairwise alignment test sets that can be used for evaluating different alignment methods, though we have to be a bit careful about tuning an alignment method to the test sets. There are a lot of things that could be done to improve pairwise alignments:
Exploring other local structure alphabets
Rachel Karchin is exploring other local structure alphabets for her Ph.D. thesis. After she finishes, her methods should be available for other students who want to try new structure alphabets. One that did not get covered in her thesis is the STR3 alphabet (a minor extension of STR and STR2). Since STR2 is currently our best single alphabet, STR3 would be worth exploring.

[Bret is working on this.]

Tweaking parameters
We have not tried adjusting transition-cost parameters nor the weight of the second track. Other alignment parameters (such as fimstrength) could also be explored. This project requires no major coding, just running lots of alignment experiments and keeping careful records of the results.

[Bret is working on this.]

Three-track HMMs
The two-track HMMs currently give the best results, but Rachel Karchin has proposed some three-track HMMs that may do better (there are some problems at the moment with hmmscore, that Richard Hughey and Rachel Karchin are looking into). Exploring the possible combinations of alphabets could result in significantly better alignments.

[Bret Barnes is working on this.]

Alignment trimming
In earlier work with one-track HMMs, Melissa Cline found that she could get better alignments by trimming out the unreliable parts of global alignments than by using strictly local alignments. Her method is implemented in SAM, but has not been tested or calibrated with multi-track HMMs.

[Jenny Draper did some work on this---it doesn't look like alignment trimming helps significantly with the newest alignment methods.]

Profile-profile alignment
All our alignment methods currently use homologs only for one of the two sequences being aligned (usually the target sequence, though sometimes the template sequence). We should be able to do better by using homolog information for both sequences. For the templates, we can use either sequence-based SAM-T2K mulitple alignments or structure-based alignments.

Anyone interested in the profile-profile alignment should talk to Richard Hughey, Jenny Draper, and Martina Koeve, all of whom have done some work on the project.

Generating fragments (fragfinder)
The fragfinder program produces short gapless alignments of the target sequence to template sequences, trying to select the best k (say 6) fragments for each position in the target. There are many variants possible on the selection algorithm, and we could try several different multi-track HMMs to search the template library for fragments.

The fragfinder program has not had extensive testing---we selected an alogrithm and HMM based on intuition and miniscule tests on one target sequence. The undertaker program can be used to evaluate the fragments as fragments (looking at the histogram of RMSD, or RMSD normalized by length, for example), and we can come up with various figures of merit for fragment libraries, allowing us to compare different fragment finding methods.

The fragfinder output is as an A2M file, which is fairly compact and easily produced by the SAM software, but which does not include 3D information. Undertaker can read the a2m file and the necessary PDB template files and generate fragments by sidechain replacement, but does not currently have a way of storing these fragments in a file. Existing undertaker file formats could easily be adapted to save the fragments, speeding up multiple undertaker runs and making it easier to run undertaker on the kilocluster (by eliminating the need for access to the PDB library).

In addition to providing a format for native undertaker fragment libraries, it would be convenient to have I/O handling in undertaker for Rosetta-format fragment libraries. These do not contain as much information as undertaker fragment libraries (no sidechains), so output Rosetta libraries is easier than inputting them. Being able to output a Rosetta library would allow us to convert fragfinder results to a Rosetta library, which could then be compared with existing fragment-finding methods.

[Sol produced Rosetta fragment libraries and did a lot of work on filtering them in undertaker. We have not done substantial work on getting fragfinder to produce better libraries.]

Packing fragments (undertaker)
The undertaker program got its first real tryout in summer 2002 in the CASP5 experiment. We applied it uniformly to all targets, with mixed results. On the comparative modeling targets, the resulting predictions were worse than the automatic predictions by the server about 3/4 of the time. On the fold-recognition targets, undertaker helped about 2/3 of the time, and on the very difficult fold recognition and new fold targets (where, traditionally, our methods have made no headway), we did well enough to be invited to speak and to submit a paper. Still, there is an awful lot of work that still needs to be done on undertaker.

Here is list of a few of the many things that came up over the summer:

More modern C++
The C++ used in undertaker is a subset of the language that avoids many of the newer features (since they were either not available or not reliable when I started creating the class library that undertaker uses). One new feature that would be nice to use (if it works robustly) is dynamic casts, as there are several places in the code where I do run-time type checking and casting with my own mechanism.
Print out and analyze distances
It would be a very small project to add a new command to accept a list of atoms and print out the distances between them. This command could be useful for extracting distance constraints from a model to be added to the score function, either to keep undertaker from pulling apart some good feature of a template, or to transfer information from a model that is not going to be used as a template. [SIMPLE VERSION OF THIS DONE]

A somewhat more ambitious project would be to make histograms for several different types of distance measurements, that could later be used for generating cost functions. For example, it would be useful to have histograms of the distances between corresponding CB atoms on adjacent beta strands. Is it different for parallel and anti-parallel strands? What about for CA atoms? These statistics would be useful for creating constraints for pairing up beta strands when we don't know the parity of the hydrogen bonding. It would also be good to do the histograms conditioned on separation along the backbone.

[STATISTICS FOR BETA-BONDING PARTNERS DONE 6-Dec-2003:
I added a routine to undertaker to gather some statistics on beta bonding partners for the dunbrack-1332 training set. I still have to digest the results a bit, but here are some preliminary figures:

CA-distances:
type mean std dev approx.peak
all beta partners 4.97 0.39 5.09
parallel 4.84 0.26 4.83
anti-parallel 5.03 0.43 5.13
anti-parallel unbonded 4.43 0.24 4.39
anti-parallel bonded 5.25 0.21 5.13

CB-distances:
type mean std dev approx.peak
all beta partners 5.16 0.71 5.19
parallel 5.13 0.65 4.85
anti-parallel 5.17 0.73 5.19
anti-parallel unbonded 5.58 0.65 5.79
anti-parallel bonded 5.02 0.70 4.43

Essentially all the 24135 beta-pair residues have CB-CB distances less than 8 and 99% have CB-CB distances less than 7. The common use of 8 Angstrom CB-CB for "contacts" may be a bit loose for identifying neighbors in beta sheets, but perhaps it is useful in other contexts (distance between sheets in a sandwich?).

I used a very crude way of finding beta-bonding partners (just looking for a pair of H-bonds), which probably increases the noise of these estimates.

The CA distances have much less variance than the CB distances, as you would expect for a structure defined by hbonds on the N and O atoms. The CA spacing is dependent on the whether the residues are parallel or anti-parallel, and whether or not the residues are bonded. This is also expected (build some models with the Darling model kits if this is not immediately obvious to you).

The CB spacing is less dependent on the relationship between the residues, but still more dependent than I had expected, with the bonded/unbonded distinction still making a 0.5 Angstrom difference.

I'm a bit worried about the distributions where the peak is so far from the mean (CB parallel and CB antiparallel bonded). I'll have to check to see whether these are artifacts of my histograms, inclusion of non-beta-sheet bonding pairs, bad H-bond definition, or some other phenomenon that I should be conditioning on.

I also don't know what distribution the data best fits. Probably not a normal distribution (distances are positive), but possibly a gamma or log-normal.

Manual moves
We did a lot of chopping up proteins and trying to force them into weird conformations this summer. It might be nice to replace all the crude perl scripts we wrote with a a simple "move" command that would take a range of residues and some spec for where they should be moved to. It would also be nice to get a copy of "protoshop" from UC Davis, since it seems to have fairly nice manual-move capabilities (though the inverse kinematics it uses to do the moves results in very unlikely conformations---I might prefer gaps to such unlikely conformations). http://graphics.cs.ucdavis.edu/~okreylos/ResDev/ProtoShop has more info on ProtoShop. [I believe that David Bernick has installed protoshop, and is trying to get it to work consistently.]
Interactive hand tweaking
There are lots of times I would like to be able to apply a specific operator (say break reduction for a partivcular gap with fragments inserted in a particular location), see the results using Rasmol, apply another operator, and so on.
Good quality measures
We'd like a robust measure of how bad a prediction is. RMSD alone is not a good measure, since the "dot model" having all atoms at (0,0,0) often does fairly well by that measure. Measures like "GDT" which count how many (or what fraction) of the residues are within a given error distance when superimposed on the correct structure can also suffer from overly rewarding the dot model. It might be worth looking at some "contact" measures that reward conformations for bring the right residues close together and keeping apart residues that are not close in the real structure. It would be good to come up with a cost measure that is not too sensitve to the length of the protein or to one badly folded piece.
Clean up cost functions to handle incomplete conformations
When reading in true, but incomplete, conformations (the usual case for PDB structures) the cost functions often report the conformations as terrible, since the missing atoms are all at location (0,0,0). The cost functions need to be modified to be aware of the new "contains_atom()" member of the Conformation class.
Cost function calibration
We have a lot of decoys generated for CASP5 and CASP6, and now have correct structures for most of them. We can use this set, or a test set generated by the Baker group, to try to adjust the weights of the various components of the cost function. Probably we should do a linear regression to fit the cost to whatever badness measure we choose, though we may have to figure out a length scaling to be able to do the fitting across several different targets.

I did some preliminary work on this, and (as expected) linear regression works terribly when the cost functions are highly correlated. We'll probably need to do some dimension reduction first (perhaps with principal components analysis?) before linear regression stands a chance of working.

Use NMR information
We already have distance constraints implemented---it would be good to provide an input format that accepts the standard NMR constraint/restraint files (many of which are available from BMRB). We should also try to interpret other information, such as residual dipolar couplings and chemical shift information, as this has been fairly successful in Rosetta.

[Oscar Hur has started writing scripts to convert NMR restraint files into undertaker constraints. There is some loss of information, so we need to test to see how much is lost.]

Save and restore the SpecificFragment Library and the AlignmentLibrary
A lot of time is wasted reading template PDB files and recomputing sidechain replacement for fragment libraries and alignments on each run. Saving and restoring them would be useful, particularly for reducing the number of files needed for an undertaker run---we'd like to be able to capture all the files needed and distribute them to the kilocluster for doing many parallel optimizations. Saving the specific fragment library would also make it feasible to put some more effort into creating it---doing sidechain optimization to remove steric conflicts, for example.
Predicted burial
Rachel Karchin has tested neural nets to predict various of the burial functions. We could incorporate predicted burial into the cost functions, instead of just the generic burial histogram by residue type. An example of this sort of cost function has already been implemented for the "alpha" angle predictions.
Predicted "alpha"
Although we have an implementation for using neural-net-predicted alpha values in the cost function for undertaker, there are several "tuning" adjustments that could be made---we have not done any testing to see what variant of the algorithm provides the best cost function.

What we currently do is to convert the 11-dimensional neural network prediction vector into a nearly continuous function function of the alpha angle (360 1-degree bins). This involves combining the neural network prediction with the histograms for individual amino acids. The current algorithm starts with the histograms for alpha conditioned on the residue and conditioned on the next residue, and takes their product to get an intial continuous curve, each of the 11 sections is scaled to get the appropriate probability according to the neural net prediction, and the resulting curve is smoothed to remove the discontinuities at the edges of the sections. The scaling and smoothing is repeated several times.

Some parameters to play with include

  • What initial curve to start with (flat, histogram based on all amino acids, histogram conditioned on current amino acid, histogram conditioned on next amino acid, product of histograms, average of histograms, ...).
  • How much smoothing to do at each step. This probably depends on the initial curve chosen, since smoothing tends to spread out peaks. If the initial curve has peaks that are too sharp, then more smoothing would improve it, but if it has broad peaks, smoothing may make things worse.
  • How many iterations of rescaling and smoothing to do. Again, more smoothing flattens peaks more, and the lots of iterations of scaling and smoothing will tend to move the peaks to the centers of the sections.
The way to tune the algorithm is to use a large training set with neural-net predictions and try to minimize the total cost (negative log probability) of the actual alpha values.

[Rocky Choi did some work on this.]

Apply secondary structure
It might be good to include secondary structure in guiding the conformation search, not just the score functions. For example, we could
  • Look for fragment that matches peaks in predicted alpha values.
  • Label fragments with DSSP (or other) labels and look for matches to strongly predicted regions.
  • Add length constraints for helical or strand regions. (say one or two constraints for every 2-turns of helix, and 4-residue CB constraints for strands?)
  • When breaks are inserted (say by jiggle or opt subtree), make the more likely to be inserted in regions which have lower probability of being in a helix or strand. Possibly deliberately insert a break in each such region to act as a rapid hinge.
New conformation-change operators
We probably want to add some more conformation-change operators to the suite used by the genetic algorithm---for example, it might be good to have a crossoverfunction that allows two breaks, and puts together the results from conformations A and B as an ABA chimera. One possible implementation would be to make a copy of the central setion of Confomration B as a new AlignedFragments object, then apply this to Conformation A.
More focus on loop regions
It would be good to have a way to get undertaker to focus on regions that are having trouble getting a good conformation. Possibilities include freezing a core region, so that no operators are applied there, increasing the probability of applying fragments near breaks, applying TwoFragment to residues that are spatially near each other, but not close along the chain (possible hinges), ... Instead of a complete freeze on the core, we might want to keep confidence values with alignments and specific fragments, so that we frequently re-apply alignments or fragments that have high confidence, keeping the core from drifting away from a good initial conformation.
Jiggling without breaking
We often want to make small moves to improve packing on an almost finished model, without breaking the chain and incurring a break penalty. One way to do this would be to use the "crankshaft" moves developed by Baker's group. A crankshaft move takes a pair of backbone bonds that are nearly colinear and rotates the region between them. If the bonds are parallel but not colinear, then a shift in position occurs. The simplest example is rotating a single peptide plane without moving the ones much on either side. By pairing two peptide planes that are almost colinear, one can shift the atoms between the two planes without rotating them. To do a "crankshaft" move between residues A and B, add theta to psi(A-1), -theta to phi(A), -theta to psi(B-1), theta to phi(B).
Rewrite ApplyFragments command
The old ApplyFragments command takes the old slicer format for specific fragments, which is no longer used. Change to use the name of an alignment---this may require a NameToPtr table for the AlignmentLibrary.
MERGE report_cost_to_log_and_rdb (OptCommands.cc) and report_score_to_log (ConformationPool.h)
Rotamer optimization
The current version of undertaker has 3 conformation-change operators that just change sidechains (OneRotamer, ClashingRotamer, and ClusteredRotamer). It would be interesting to see how well undertaker optimizes sidechains compared to scwrl, for different cost functions [particularly whether the hbond_geom cost function helps]. This can be done with no modifications to undertaker, just running undertaker scripts and gathering statistics.

A possible approach is to read a pdb file, randomize the rotamers, optimize with undertaker, reoptimize with scwrl, then reoptimize with undertaker, reporting the three costs and all-atom rmsd for each of the conformations. One could also gather all-atom rmsd for a huge number of sidechain conformations and try to do linear regression to set weights for the cost function, but this is likely to have problems due to correlation between the components.

Other undertaker tasks
This list of things to do to undertaker is far from complete. There is also a ToDo list in /projects/compbio/experiments/undertaker, and there are various "BUG" notations scattered through the code in /projects/compbio/programs/undertaker
Genome-wide prediction
I have done a genome-wide fold-recognition prediction of S. cerevisiae proteins http://www.soe.ucsc.edu/research/compbio/yeast-protein-predictions/, but there are a lot a little things to do to clean up the presentation (like providing indexing of the results by something other than just the accession code). Fan Hsu will be working on this, in preparation for similar tasks on mammalian genomes.

Some of the search capabilities that should be added include:

Perhaps a more important task is to figure out a way to handle the large, multidomain proteins better, as finding a template for just one domain doesn't really solve the problem. Currently, the web pages don't even report which sections of the protein we have made predictions for!

The fairly small yeast genome took weeks to finish on the small cluster---to do a bigger genome, such as the human genome, we'd have to "clusterize" the method to run on the kilocluster without hammering the file servers. Fan Hsu has started work on making the fold recognition method work on the kilocluster. I don't think there is any point to trying to apply undertaker genome-wide right now---it requires far too much hand tweaking to get even a crude new-fold prediction out of undertaker.

It may be worthwhile to do some other small genomes, such as the archeal genomes that Todd Lowe is building DNA chips for.

Individual proteins
There are some proteins being studied at UCSC for which it would be useful to get structure predictions. I started work on a few, but have not made much progress, because CASP5 interfered. I suspect that these are difficult-to-predict proteins, and that new-fold methods like undertaker and Rosetta will need to be applied.

For example, I've done some predictions of chemotaxis proteins for Helicobaster pylori for Karen Ottemann.

Protein design
I'd love to get involved in protein design, but the current tools I have (neural nets, HMMs, undertaker, ...) do not seem well suited for the task. I've been thinking about ways that I could try to use them, the way Rosetta gets used for sequence optimization. Unfortunately, I don't think I can use the same tricks that Rosetta does, because undertaker is structured around prediction for a fixed sequence---it would require some rather large changes to the code to fix the conformation at let the sequence vary.

Another approach is to take several of the local-structure-predicting neural nets, and fix their output at the desired local structures, then use backpropagation to optimize the inputs. One could start with an intial random sequence, optimize using the neural nets to get an input profile, then sample from that profile and reoptimize. This could be repeated to get a large sample of possible sequences that should have good local properties, which could then be fed into the Rosetta optimization program to try to find sequences with good teritary properties. To do this right, one should probably do the back-propagation simultaneously in neural nets predicting several different local properties. This would require many changes to predict-2nd, but probably not as massive as changes to undertaker for protein design.

More random thoughts on using backpropagation. If I changed my neural nets over to using (A HREF="#guide+profile">guide-sequence+profile input, then I could use backpropagation to change the profile, and sample form the profile to get a sequence. Iterate that many times from a random initial profile (generated from a Dirichlet mixture prior) and take the resulting sequence as a possible prediction. There are lots of variants that could be tried. For example, one could resample the sequence points where the backpropagated derivative for the current residue is negative rather than resampling all points. Instead of sampling from the profile one could take the most probable residue of the profile. After generating a few hundred sequences in this way from random starting points, one could make a profile from the sequences, and score each sequence with that profile. Further tweaking could take just the best 100 or so sequences and the profile generated from them. Each could be used with that profile as a starting point for the backpropagation and resample process. A tertiary scoring system (such as Rosetta's) could be used to pick out the best sequences from the list, and use the resulting sequences and profile to do more sampling.

Another approach to using neural nets would be to build a neural net whose input was many structural properties and whose output was a sequence. Such a neural net could be trained in the same manner as current nets for prediction (with inputs and outputs swapped), and would provide a probability distribution for amino acids at each position. This would require fewer changes to the neural net code (mainly providing an input layer that accepts several alpahbets, instead of just one alphabet), and would more directly provide probability distributions to sample sequences from. This method is less likely to provide a variety of very different solutions to a protein design problem.

Using protein redesign to improve fold-recognition
Some researchers have tried using protein design with a known backbone to try to get a large sample of sequences that could reasonably adopt the backbone conformation, in order to create profiles for searching in sequence space for homologs. This method could be used to improve the template models in our fold-recognition process, for those templates that have few homologs found by iterative search. It is unclear whether the design methods do a good enough job to actually improve the profiles. Also, there are small changes in structure between homologs (especially insertions and deletions), so the design method may have difficulty generating appropriate sequences to generalize in the right direction.

A more promising use for redesign of a known structure is to predict which residues are conserved for functional reasons rather than structural ones. David Bernick is investigating this with Carol Rohl, and it looks like there is a reasonable application for my mutual-information code here.



baskin-icon
SoE home
sketch of Kevin Karplus by Abe
Kevin Karplus's home page
BME-slug-icon
Biomolecular Engineering Department
UCSC Bioinformatics research

Questions about page content should be directed to Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250
318 Physical Sciences Building

Locations of visitors to pages with this footer (started 3 Nov 2008)