Projects list: protein structure prediction

(Last Update: 16:52 PST 11 January 2006 )

This is a list of projects that need to be done on various projects for protein structure prediction. Note: there is no requirement that the project you do come from this list—it is merely a set of projects that I am interested in. If you can come up with your own project and convince a faculty member to supervise it, that can be even better. New collaborations with biologists and biochemists are particularly welcome.

Although this list is now "complete", I may add to it from time to time as new thoughts occur to me. There are also always lots of code maintenance and code development projects that don't make it onto this list. (For example, undertaker needs a lot of modernizing of its c++ code, and predict-2nd needs to be cleaned up so that it can be released as open-source code.)

I've broken the list up into different categories:

Using existing tools

Modifying or creating new tools


Here is an alternative classification:

Web server infrastructure

Local Structure prediction

Multiple Alignment and Target-template alignment

Fold recognition and template selection

Tertiary structure prediction

Structure analysis or function prediction

Designing proteins


Using Existing Tools

Projects in this section are intended for students whose programming skills are somewhat limited, though some script writing will probably be needed. They fall into two categories: doing structure (and maybe function) predictions for a number of proteins or testing existing methods (tweaking parameters, for example).

The prediction tasks generally involve going through the prediction process for a number of proteins, looking for useful information to guide wet-lab research. Some will require writing short scripts, and all will require looking at and interpreting results from a number of different tools. Note: all students will have a homework assignment that requires doing a prediction for one or two proteins. Projects in this category will require more substantial work than that homework, either by using experimental evidence to guide the prediction or by doing predictions of many proteins.

Predicting specific proteins

Chemotaxis proteins

Two of our environmental toxicology faculty are interested in chemotaxis proteins: Karen Ottemann in those for Helicobacter pylori and Fitnat Yldiz in those for Vibrio cholerae. The main interest is in the proteins that include the methyl-accepting domain (MAD) that is a highly conserved signaling domain. These are the proteins that act as the chemosensors in the first stage of the signaling pathway.

Since many (but not quite all) the chemosensors are transmembrane proteins, which we cannot model in their entirety, and we are not not interested in modeling the MAD domain, the first task is to select the domain of the protein (generally the extracellular domain) for which we want to do structure prediction. Generally this means finding transmembrane helices (with TMHMM, for example) and the MAD domain, and doing prediction on just the interesting subdomain.

Structure predictions were done for the few MAD-containing proteins of H. pylori a couple of years ago, but could be redone, to see if any of the newer files in PDB have provided better templates to work from. This is not likely to lead to any big improvements, unless there has been a homolog solved.

V. cholerae has a lot of MAD-containing proteins, and I am considering parceling them out to the class as a homework assignment. It might be good to have one person collect all the work that is done, clean it up, improve on the poorly done work, generate hypotheses about possible ligands, and prepare a detailed report for Fitnat Yldiz. Here is Fitnat's description of the project from spring 2005:

Project 1: Identification of the ligand binding domain(s) of the methyl accepting chemotaxis proteins in Vibrio cholerae (Vc)
Flagellar motility in Vc is controlled by the chemotaxis system, allowing the microbe to move in response to environmental cues. Chemotaxis is regulated by chemoreceptors that sense environmental cues, and transduce this ligand-binding information to regulate a signal transduction cascade that affects flagellar rotation. The core signal transduction proteins consist of CheW, the receptor-kinase coupling protein, CheA, the kinase and CheY, the response regulator that interacts with the flagellar motor.

This proposal seeks to understand the ligand(s) bound by methyl-accepting chemotaxis proteins in Vc. The list of these proteins and their VC gene numbers are shown below. Detailed information on each gene including DNA and amino acid sequences can be obtained from: http://www.tigr.org/tigr-scripts/CMR2/name_search_test.spl. Vibrio cholerae El Tor N16961
Locus Gene Symbol Common Name
VC0098 methyl-accepting chemotaxis protein
VC0216 methyl-accepting chemotaxis protein
VC0282 methyl-accepting chemotaxis protein
VC0449 methyl-accepting chemotaxis protein
VC0512 methyl-accepting chemotaxis protein
VC0514 methyl-accepting chemotaxis protein
VC1248 methyl-accepting chemotaxis protein
VC1289 methyl-accepting chemotaxis protein
VC1298 methyl-accepting chemotaxis protein
VC1313 methyl-accepting chemotaxis protein
VC1394 methyl-accepting chemotaxis protein
VC1403 methyl-accepting chemotaxis protein
VC1405 methyl-accepting chemotaxis protein
VC1406 methyl-accepting chemotaxis protein
VC1413 methyl-accepting chemotaxis protein
VC1535 methyl-accepting chemotaxis protein
VC1643 methyl-accepting chemotaxis protein
VC1859 methyl-accepting chemotaxis protein
VC1868 methyl-accepting chemotaxis protein
VC1898 methyl-accepting chemotaxis protein
VC1967 methyl-accepting chemotaxis protein
VC2161 methyl-accepting chemotaxis protein
VC2439 methyl-accepting chemotaxis protein
VCA0008 methyl-accepting chemotaxis protein
VCA0031 methyl-accepting chemotaxis protein
VCA0068 methyl-accepting chemotaxis protein
VCA0176 methyl-accepting chemotaxis protein
VCA0268 methyl-accepting chemotaxis protein
VCA0658 methyl-accepting chemotaxis protein
VCA0663 methyl-accepting chemotaxis protein
VCA0773 methyl-accepting chemotaxis protein
VCA0864 methyl-accepting chemotaxis protein
VCA0906 methyl-accepting chemotaxis protein
VCA0923 methyl-accepting chemotaxis protein
VCA0974 methyl-accepting chemotaxis protein
VCA0979 methyl-accepting chemotaxis protein
VCA0988 methyl-accepting chemotaxis protein
VCA1031 methyl-accepting chemotaxis protein, authentic frameshift
VCA1034 methyl-accepting chemotaxis protein
VCA1056 methyl-accepting chemotaxis protein
VCA1069 methyl-accepting chemotaxis protein
VCA1088 methyl-accepting chemotaxis protein
VCA1092 methyl-accepting chemotaxis protein

Update yeast proteins

Since October 2002, we have made predictions for all the proteins of the yeast genome, with updates to some of the predictions every few months. In May 2005, a student taking BME 220 identified 215 proteins (most short and annotated as "dubious") in SGD that had not had predictions made for them and another 77 whose sequences in SGD differed from the ones that we had predicted. These sequences need to be checked to make sure that they are not just renamings of existing sequences in our set.

The 215 proteins need to be added to the set and predictions made (this is fairly trivial). The 77 proteins need to be checked to see whether the sequence change is an improvement. (In at least one case, we replaced a protein that was non-functional in the reference strain with an undamaged gene from a different yeast strain.) Where the sequence is an improvement over the one we predicted for, the README file should be updated to reflect this, the sequence replaced, and the prediction redone.

A new comparison should be done between the sequences we have predictions for and the sequences in SGD to find other discrepancies—if possible this should be automated, so that it can be checked periodically without much effort.

Modifying nicotinic acetylcholine receptor

One design project that I was working on fairly intensely for a while, but have put aside is a redesign of nicotinic acetylcholine receptors. The receptors form a pentamer, but no one has yet managed to crystallize the extracellular domain, as the pentamer is not formed when the transmembrane portion is removed. The idea is to use undertaker to build a homology model, then use Rosetta to redesign some of the residues to increase stability of the pentamer. There is a soluble protein that is homologous to the extracellular domain that *does* naturally form the pentamer (see PDB file 1i9bA). I have already done a lot of modeling work on the protein, and may have a model with a good enough backbone to do sidechain replacement. (See http://www.soe.ucsc.edu/~karplus/ach/ach7_short_1based/three-template/summary.html paying particular attention to the README file.)

One attractive aspect of this protein design is that we do have an experimentalist who would be interested in expressing and characterizing the protein, if we can come up with a fairly convincing redesign that does not require too many mutations. (Note: the cost of mutations is not the number of residues changed, but the number of reactions needed to replace parts of the gene. Changing five adjacent residues has about the same cost as replacing one residue, and is much cheaper than replacing three widely scattered residues.)

Analysis of HSP-60 proteins

Jonathan Trent has spent several years studying the HSP60 heat-shock proteins (chaperonins) in Sulfolobus shibatae. Structures are known for the protein, but there are still a lot of questions about the function and about details of how different variants of the protein achieve different quaternary structures. Some preliminary bioinformatic analysis of HSP60 proteins has been done in /projects/compbio/experiments/protein-predict/hsp60/

Here are some questions Jonathan is interested in:

Structure of intrinsically unfolded syncleins

Tony Fink suggested the following project:

The synucleins are intrinsically unfolded proteins but at least a-synuclein adopts a helical structure when bound to membranes. There is no information available on b-synuclein or g-synuclein. I think it would it be worth predicting some potential folded and partially folded structures—what do you think? The maximum length is 140 amino acids.

The PFAM synuclein HMM (pfam01387) is documented: Synuclein. There are three types of synucleins in humans, these are called alpha, beta and gamma. Alpha synuclein has been found mutated in families with autosomal dominant Parkinson's disease. A peptide of alpha synuclein has also been found in amyloid plaques in Alzheimer's patients. Alpha synuclein in humans (SYUA_HUMAN) apparently has two isoforms:

MDVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGSKTKEGVVHGVATVAEKTKEQVTNVGGAVVTGVTAVAQKTVEGAGSIAAATGFVKKDQLGK----------------------------EGYQDYEPEA
MDVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGSKTKEGVVHGVATVAEKTKEQVTNVGGAVVTGVTAVAQKTVEGAGSIAAATGFVKKDQLGKNEEGAPQEGILEDMPVDPDNEAYEMPSEEGYQDYEPEA
Beta synuclein (SYUB_HUMAN) is similar to alpha synuclein:
MDVFMKGLSM AKEGVVAAAE KTKQGVTEAA EKTKEGVLYV GSKTREGVVQ GVASVAEKTK
EQASHLGGAV FSGAGNIAAA TGLVKREEFP TDLKPEEVAQ EAAEEPLIEP LMEPEGESYE
DPPQEEYQEY EPEA
So is gamma synuclein (SYUG_HUMAN):
MDVFKKGFSI AKEGVVGAVE KTKQGVTEAA EKTKEGVMYV GAKTKENVVQ SVTSVAEKTK
EQANAVSEAV VSSVNTVATK TVEEAENIAV TSGVVRKEDL RPSAPQQEGV ASKEKEEVAE
EAQSGGD

Hyperphosphorylated proteins

Doug Kellog's lab has been working on two highly conserved proteins called Swe1 (YJL187C) and Mih1 (YMR036C). These are the budding yeast homologs of proteins called Wee1 and Cdc25 in most organisms, and they play key roles in the mechanisms that regulate entry into mitosis. Both of these proteins undergo dramatic hyperphosphorylation during the cell cycle. Swe1 is phosphorylated on over 30 sites at the N terminus by a bound kinase, and Dr. Kellog's hypothesis is that the N terminus is unstructured until it is phosphorylated, at which point it clamps down on the bound kinase (see Harvey et al, Cell 122, p. 407 for their recent work on Swe1).

They think something similar may be going on with Mih1, but this work is not yet published. Swe1 has a kinase domain and Mih1 has a phosphatase domain. They are curious what you would find if we put these proteins through our structure prediction programs.

Automatic predictions alread exist for these proteins in http://www.soe.ucsc.edu/~karplus/yeast/YJL1/YJL187C/summary.html and http://www.soe.ucsc.edu/~karplus/yeast/YMR0/YMR036C/summary.html , but these predictions do not use the latest methods and are not producing full-length models. The parts that are being modeled are the "easy" parts: the kinase domain and the phosphatase domain.

Splitting the proteins up into domains and and doing predictions for each part would be fairly straightforward, but getting good results for the N-terminal domains, for which there appear to be no good templates, could be quite difficult. This would be especially true if Kellog's conjecture is true and they are ntively unstructured until highly phosphorylated. I don't believe that any of the tools we use can handle phosphorylated proteins.

For such large ab-initio predictions, we would probably want to use both the SAM-T04/undertaker and the Rosetta protocols.

protein from C. elegans

Andrew Chisholm (biology professor) juset sent me the following message (11 Jan 2006):

We have just a new protein (from C elegans) that is highly unfamiliar. We are desperate for any bioinformatics insight as to what it could be. (It has come out of a suppressor screen). MAybe it would be an interesting project for an undergraduate.

Testing and tweaking parameters

Using unfinished genomes

For many targets, we get a lot of potential homologs just by searching NR and adding more sequences to the multiple alignments would produce only marginal improvements. Other sequences, though, appear to be ORFans, with no similar proteins in other species. In such cases, it may be worthwhile to look through unfinished genomes and even the trace archives at NCBI, to try to find evidence for related proteins that have not been annotated as proteins yet. Using tblastn may find appropriate hits, but we do not have enough experience using the noisy data in the trace archives to know whether it is worth including automatically or even just as a last-resort measure.

Combining local structure predictions

For the past few years, we have been using a script called RDBCombine to merge neural net predictions (possibly over different, but related alphabets) to get a consensus prediction. We have never tested this method to see if it does better than simply taking a neural net optimized for the output alphabet, nor have we tried tuning any parameters related to it.

RDBCombine can combine RDB-format prediction files for different alphabets. It translates the alphabet using the contingency tables output by compare-real, and weights them by the mutual information between the predicted alphabet and the desired alphabet, and does a weighted average of the translated predictions. It is not clear what the best way to do the combining is—simply averaging the probability vectors is probably not optimal, though it is easy. We may want to look at combining the log-odds scores (log (P_predicted/P_background)), for example.

We have a script to evaluate RDB files for Qn and bits saved (and we could add the SOV measure to the evaluations), so different methods for combining predictions can be easily tested with only fairly minor script writing.

New training methods for neural nets

Training neural nets is currently quite slow, and a student interested in machine learning might want to work on modifying predict-2nd to use a faster learning method than the current back propagation. This is not a high priority (we rarely retrain the networks), but might be a good project for someone more interested in machine learning than in bioinformatics.

A more serious problem for us is that training does not always converge to equally good results. Our current protocol for comparing the predictability of different alphabets is based on the test results for a single training run on each part of a three-way cross validation test. We need to develop a new training regime that more consistently produces good results—perhaps by starting from several random neural nets, training for a limited time, then taking the best of the set on the training data for further training. This training method may be possible with just new scripts in predict-2nd, or it may require slight modifications to the predict-2nd code.

Local structure prediction using HMMs

One approach to improving secondary structure that we have tried, and that may be worth improving on, is to use the fold-recognition target HMM to align the template set and gather statistics about what secondary structure codes align to each position. Since the target HMM can be a multi-track HMM using neural-net predictions, this method should be able to do at least as well as the neural-net methods. The method would be particularly valuable when there is a close homolog in the template library, as it would be able to "look up" the right answers. This makes testing more difficult, since it is easy to fool oneself about how well the method is working, but would be very useful in a tool used by bench biologists, as the easy predictions would be right (not necessarily the case for neural nets). It is used in the SAM-T99 server, I believe, but not the SAM-T02 server. We should test it properly and use it in the SAM-T05 server.

Mark Diekhans implemented this several years ago in SAM, but I am not sure of the details of his implementation. In his tests, the method improved Q3 scores slightly, but made the information gain worse than the neural nets used to create the HMM tracks. Josue Samayoa did more careful tests in Fall 05, and found that the current implementation does not make the results of the neural net prediction substantially better or worse. Somewhat surprisingly, including the true sequence in the library of structures that were scored by the HMM did not change the result much.

Several things could be done to improve this method:

I believe that Josue may wish to work on this project for another quarter, and I'm not sure whether there is enough work left for two people. If you are interested in this project, please consult with Josue.

Analyzing alignment results

We have collected a mound of data about the quality of alignments for different track weightings using multi-track HMMs to align structurally similar proteins. We have used the data to select track weights that are generally good, but I suspect that we could do much more with the data. For example, I believe that the best track weighting is dependent on how similar the two sequences are. Very distantly related sequences may need more weight put on the local structure tracks, while closely related sequences should have more weight on the amino-acid track. It would be interesting to try to come up with a method for measuring sequence similarity and using it to modify track weights accordingly.

Testing profile-profile alignment

For the past few CASPs, some of the best fold recognition and alignment has come from profile-profile alignment methods. Richard Hughey has added profile-profile alignment to SAM, but the code has not been thoroughly tested. We have two test suites: one for alignment and one for fold recognition. It would be good to test SAM's profile-profile methods using these suites: comparing the profile-profile method with the more traditional sequence-profile methods. We do have the ability to use multi-track HMMs for the profiles, so there is a lot to explore: which tracks to use, track weights, which profile-profile scoring algorithm to use, ... .

Mixed sequence-profile alignments in template library

We routinely use multi-track HMMs built from a target sequence to search a template database of sequences, but we currently use only single-track (amino-acid-only) template HMMs to score the target sequence. Although we plan to explore full profile-profile scoring, it might also be worthwhile to do more modest multitrack sequence-HMM scoring schemes with template-based HMMs.

There are two approaches that I see:

Testing fragfinder

The fragfinder program produces short gapless alignments of the target sequence to template sequences, trying to select the best k (say 20) fragments for each position in the target.

The fragfinder program has not had extensive testing—we selected an algorithm and HMM based on intuition and minuscule tests on one target sequence. The undertaker program can be used to evaluate the fragments as fragments (looking at the histogram of RMSD, or RMSD normalized by length, for example), and we can come up with various figures of merit for fragment libraries, allowing us to compare different fragment finding methods.

Sol Katzman wrote routines for undertaker to filter fragfinder output and produce Rosetta fragment libraries. Some of his work provides more measures for fragment-library quality. It also opens up the possibility of comparing fragfinder-generated libraries with the libraries that Rosetta normally uses.

Evaluating quality measures

Undertaker includes several "cheating cost functions" that compare the current structure with a known correct structure. These cheating functions have two purposes: one is to test the conformation-change operators and stochastic search procedure, to see if optimization with a really good cost function converges to a correct structure; the other is to evaluate the results of different optimization methods, to see how well they work. The current set of cheating functions includes all-atom RMSD, RMSD_CA, GDT (the measure popular at CASP), smooth_GDT (a very similar measure with slightly less quantization noise), and a contact-based measure called "clens" developed by Tim Dreszer in Spring 2005.

Some of these measures are known to have problems, particularly for structures that are not very similar to the real structure. For example, RMSD and RMSD_CA are both very sensitive to the worst-done part of the structure, and interpretation of the result is very dependent on the length of the chain. GDT (and smooth_GDT) evaluates only how much of the chain is correct, giving no penalty for grossly wrong parts. All of these superposition-based measures tend to over-reward structures that have been compacted too much—they have no penalty for unreasonably dense proteins. In some cases, the point model (all atoms at (0,0,0)) scores reasonably well.

The clens function was devised to try to handle the problems of the other measures, be rewarding for correct contacts and penalizing for incorrect ones. In preliminary testing using models generated by undertaker, clens and GDT were very highly correlated with each other, indicating that they may be nearly interchangeable in their ability to recognize decent models. We'd like to do more extensive testing of "clens", using all the models available from previous runs of CASP. In particular, we'd like to see if we can find any outliers, where GDT and clens evaluate models quite differently. We could then examine such outliers, and see which metric does a better job of capturing the quality of the model.

It is also possible to modify the clens definition in undertaker, to parameterize the definition of contact or change the scoring function for missing and extra contacts.

Rotamer optimization

The current version of undertaker has 3 conformation-change operators that just change sidechains (OneRotamer, ClashingRotamer, and ClusteredRotamer). It would be interesting to see how well undertaker optimizes sidechains compared to scwrl (and Rosetta), for different cost functions [particularly whether the hbond_geom cost function helps]. This can be done with no modifications to undertaker, just running undertaker scripts and gathering statistics.

A possible approach is to read a pdb file, randomize the rotamers, optimize with undertaker with only the rotamer-changing operators turned on, reoptimize with scwrl (callable from inside undertaker), then reoptimize with undertaker, reporting the three costs and all-atom rmsd for each of the conformations. One could also gather all-atom rmsd for a huge number of sidechain conformations and try to do linear regression to set weights for the cost function, but this is likely to have serious problems due to correlation between the components of the cost function.

Tweaking undertaker's cost function

We have a lot of decoys generated for CASP5 and CASP6, and now have correct structures for most of them. We can use this set, or a test set generated by the Baker group, to try to adjust the weights of the various components of the cost function in undertaker. Possibly we should do a linear regression to fit the cost to whatever badness measure we choose, though we may have to figure out a length scaling to be able to do the fitting across several different targets.

I did some preliminary work on this, and (as expected) linear regression works terribly when the cost functions are highly correlated. We'll probably need to do some dimension reduction first (perhaps with principal components analysis?) before linear regression stands a chance of working.

A minor programming project that may need to be done before we can try tweaking the cost functions: we need to clean up cost functions to handle incomplete conformations. When reading in true, but incomplete, conformations (the usual case for PDB structures) the cost functions often report the conformations as terrible, since the missing atoms are all at location (0,0,0). The cost functions need to be modified to be aware of the "contains_atom()" member of the Conformation class. Some cost function components have already been fixed, but not all.

Protein design by back propagation

In spring 2005 Jes Frellsen and I made some modifications to predict-2nd so that it could be used for protein design. (I made further changes in summer 2005.) The idea is to take several of the local-structure-predicting neural nets, and fix their output at the desired local structures, then use back-propagation to optimize the inputs. One starts with an initial random sequence, optimize using the neural nets to get an input profile, then sample from that profile and reoptimize. This can be repeated to get a large sample of possible sequences that should have good local properties, which could then be fed into the Rosetta optimization program to try to find sequences with good tertiary properties.

More random thoughts on using back-propagation. There are lots of variants that could be tried in the sampling of sequences from the back-propagation. For example, one could resample the sequence points where the back-propagated derivative for the current residue is negative rather than resampling all points. Instead of sampling from the profile one could take the most probable residue of the profile. After generating a few hundred sequences in this way from random starting points, one could make a profile from the sequences, and score each sequence with that profile. Further tweaking could take just the best 100 or so sequences and the profile generated from them. Each could be used with that profile as a starting point for the back-propagation and resample process. A tertiary scoring system (such as Rosetta's) could be used to pick out the best sequences from the list, and use the resulting sequences and profile to do more sampling.

Another approach to using neural nets would be to build a neural net whose input was many structural properties and whose output was a sequence. Such a neural net could be trained in the same manner as current nets for prediction (with inputs and outputs swapped), and would provide a probability distribution for amino acids at each position. This would require a few changes to the neural net code (mainly providing an input layer that accepts several alphabets, instead of just one alphabet), and would more directly provide probability distributions to sample sequences from. This method is less likely to provide a variety of very different solutions to a protein design problem, though, so I have not pursued it.

One protein that it may be interesting to apply the back-propagation plus Rosetta approach to is the "miniAGRP" peptide, a small cysteine knot that Glen Millhauser's group is interested in. They would like to remove the disulfide bridges that currently hold the structure together, without greatly disrupting the structure. Certain of the residues are believed to be functionally important, and should not be touched in the redesign. Millhauser's group may be interested in synthesizing a design, if a better one can be found than the design that Craig Lowe did (using just Rosetta) last year. For more information, see miniAGRP paying particular attention to the README file and the design/ subdirectory.

In addition to that particular design, it would be good to experiment with various networks and local structure alphabets, so see which combinations result in the best designs. Since we do not have the time nor the facilities for fabricating and testing large numbers of proteins, we will have to rely mainly on computational methods. One simple test is to look at the percent sequence recovery: given just the backbone for a real protein, try to design amino acids sequences that would fold to that conformation. The closer the designs are to the real protein, the more convincing the design method. Such a study should be done on hundreds of different backbones from radically different structures.


Modifying or creating new tools

Projects in this section generally require some programming, database, or web-design skills.

SAM-T05 Web site

The SAM-T05 web site was created last spring and summer, but it is still in beta release form, and we'd like to make it public before CASP starts. Some things that need doing include the following:

Improve monthly update of yeast predictions

We have a set of predictions for all the proteins of yeast. I had planned to do monthly updates to the yeast predictions, but they have been happening a bit less often, because of how long the updates take to do. I wrote some simple criteria in scripts and a Makefile to decide which proteins were worth re-predicting, but the list ends up being too long, with several well-predicted proteins getting predictions triggered just because a very distantly related protein has had its structure recently solved. It would be good to improve the selection method, so that only those proteins whose predictions are likely to improve are targeted for re-prediction. This might have to be done rather carefully, as it is possible for a new structure to provide information about a domain that was not previously solved, even when other domains previously had excellent solutions—we would not want to miss the new domain in that case!

The yeast predictions use the SAM_T02 prediction machinery, and it might be a good idea to update the Make.main file to use the SAM_T04 or SAM_T05 predictions, though redoing all the predictions would take a lot of cpu time.

Perhaps a more important task is to figure out a way to handle the large, multidomain proteins better, as finding a template for just one domain doesn't really solve the problem. Currently, the web pages don't even report which sections of the protein we have made predictions for!

The fairly small yeast genome took weeks to finish on the small cluster—to do a bigger genome, such as the human genome, we'd have to "clusterize" the method to run on the kilocluster without hammering the file servers. Fan Hsu has started work on making the fold recognition method work on the kilocluster. I don't think there is any point to trying to apply undertaker genome-wide right now—it requires far too much hand tweaking to get even a crude new-fold prediction out of undertaker.

It may be worthwhile to do some other small genomes, such as the archeal genomes that Todd Lowe is building DNA chips for, or the bacterial genomes that Fitnat Yldiz and Karen Ottemann are interested in.

Index the yeast predictions

The yeast predictions are not used much, in part because there is no convenient web interface for accessing particular predictions. Ideally, we would like to be able to search the set of predictions in several ways: by name (and each sequence may have several names), by sequence, by predicted fold (SCOP) class, by confidence of prediction or fraction of protein confidently predicted, ... .

This would require setting up a database (either a real one or a crude one using perl scripts) of names and aliases, sequence data, predictions, ... . It would be good to be able to populate some parts of the database (such as names and aliases) from SGD, or to have the web script query SGD for translations.

Some of the search capabilities that should be added include:

Continuous burial prediction

Our burial and exposure alphabets are based on an underlying continuous (or, at least, finer-grain) value. It would be nice to have the neural net output parameters for a parameterized distribution (such as log-normal or gamma), rather than the discrete 7 or 11 states we currently use. The two values can be interpreted as the mean and standard deviation of the output distribution (though other parameterizations may be more convenient for some distributions). A weak prediction would have a large standard deviation.

This could be done by adding a combining layer to the outputs of a normal neural net. For each output of the neural net we associate two parameters (center and spread). The predicted mean would be the neural-net weighted sum of the center values, and the predicted standard deviation would be the standard deviation of a mixture distribution, with mixture coefficients from the neural net.

The center and spread parameters could initially be set by partitioning the outputs in the training data into roughly equal weight discrete classes (as is now done) and computing a mean and standard deviation for each class. They could be tweaked a bit by a few passes of EM to get the output described as a mixture. After a few epochs of training the neural net, the center and spread parameters could be trained by back propagation also.

We would need to compare the encoding cost of the actual burial value using this continuous method (which requires some care in the discretizing, since our underlying data is usually integer-valued) versus the encoding cost of the discrete alphabet method, which assumes a flat distribution within each of the discrete bins.

Recoding neural net inputs

Neural nets suffer from being basically uninterpretable. There are some ways we could attempt to glean some information from the neural nets, and perhaps use this information to make neural nets that train faster or work better. The basic idea is to look for recodings of the amino acids that the neural nets are learning, and then hardwire the more important recodings. The simplest notion of a recoding is the dot product of an amino-acid probability vector with a direction vector (a 20-dimensional vector with magnitude one). The weights of the first layer of each neural net we have trained can be normalized to produce a number of such direction vectors (78 vectors for each network in our most common network architecture).

The resulting vectors can be clustered (with the square of the dot product as the appropriate similarity measure) to determine the most common directions. Alternatively, Principal Components Analysis (PCA) could be done to reduce the dimensionality. We can attempt to identify the properties these directions represent by looking at the correlation with the indices in the AAIndex database.

We can also compute the dot products of the profile with the centers of the most populated clusters and use them as inputs to the neural net, in addition to or in place of the profile input. (Similarly, the factors that come from PCA can be used for recoding.)

The above residue-at-a-time analysis does not get at the neighborhood patterns that the neural nets learn. People have tried to determine some of the patterns by looking at Fourier transforms and wavelet transforms of numeric properties (mainly hydrophobicity). We could look at each hidden unit on the first neural layer and see if its weight matrix can be roughly matched by taking a single direction vector and a neighbor weight function. That is, decompose
sum_{a,-2<=j<2} w_{a,j} Prob(amino acid a in position i+j)
into
sum_{-2<=j<=2} f(j) sum_a dir(a) Prob(amino acid a in position i+j)
or w_{a,j} = f(j) dir(a).
If there are hidden units that can be approximately described this way, it may be worth providing input features that correspond to this neighborhood property also.

The neighborhood analysis may be a bit limited in our current networks, since we have gone to using rather short windows (only 3 wide) in the first layer, but the residue-at-a-time recoding analysis may be more revealing than in networks with wider input windows.

Another possible residue-at-a-time recoding scheme would be to replace the 1-hot encoding of 20 amino acids for the guide sequence with a binary encoding of amino acids. There are some simple greedy algorithms and random search optimization algorithms that could be applied to find 5 or 6 different partitions of the amino acids so that each of the weight vectors could be well approximated by adding just 5 or 6 weights to the appropriate subsets of the amino acids.

Entropy profiles

There was a poster at ISMB 04 (I-84 FASE: a new fold-recognition method through entropy profiles by Alejandro Sanchez-Flores and Lorenzo Segovia, both at UNAM in Mexico) that claimed to get excellent results using just the entropy information from the profile to do fold recognition—all other signals were thrown out and just a 4-level entropy alphabet used. In close questioning, I did not see an obvious flaw in their testing procedure, so we should try to implement this and see if their work is duplicable.

To avoid artifacts, we should use our own testing methods and our own computation of entropy. Note that entropy is not a property of the protein structure (unlike most of our alphabets), but a property of a multiple alignment. This means that we do not have all our usual tools for evaluating an alphabet (in particular, conservation and predictability are not meaningful concepts), so we are limited to alignment tests and fold-recognition tests.

Even if entropy profiles alone are not very useful, they could make an interesting extra track in profile-profile HMMs.

Converting RDB files to HMMs

We currently use a TCL script (2nd-rdb-to-sam-model) for converting the RDB-formatted output of our local structure predictions to SAM HMM format, for inclusion in multi-track HMMs. It would be better if the SAM code understood the RDB format directly, so that no translation script was needed. Indeed, if the SAM code could read the RDB format directly, it would be trivial to write the translation program in C, since SAM already has routines for outputting the HMM format. The 2nd-rdb-to-sam-model script is the only TCL script in the SAM package, and it would be good to eliminate it.

Although there is not much code needed for adding this functionality to the SAM package, a lot of code would need to be read in order to understand the internal data structures and parameter passing mechanisms of SAM.

Combining fold recognition results

Our fold-recognition methods treat every template in the library as independent of the rest, but we actually have many templates which contain similar folds. Various ways have been proposed for combining template scores to get a "fold" score, and a couple students have done preliminary tests of the "product-of-pvalues" method, and shown that it might help, but their projects were never taken to the point where the method could be incorporated into an automatic prediction method.
[ Bailey TL, Gribskov M. Combining evidence using p-values: application to sequence homology searches. Bioinformatics. 1998;14(1):48-54.]
[ Bailey, Timothy L. and Grundy, William N. Classifying proteins by family using the product of correlated p-values. Proceedings of the Third international conference on computational molecular biology (RECOMB99), April 11-14, 1999. pp. 10-14. ]

One possible project is to incorporate the "product-of-pvalues" method into our automatic predictions. This requires designing a way to update the calibration parameters for the combining method automatically as new templates are added to the library, and converting the fold results back into selection of good templates. (This needn't be a weekly update, as the SCOP classification is only updated once or twice a year.)

Another possible project is to devise a different combining method. The "product-of-pvalues" method treats all templates in a fold class as equally informative and does not use information from templates in competing fold classes. One can envision combining methods such as logistic regression that could use the extra information to get better predictions. One danger is that multi-domain proteins correctly have multiple correct folds, so simple competition between fold classes is not quite the right model.

Handling multi-domain proteins

Our current methods look for target-template matches where there is a common subdomain, but do not take the next step of looking separately at the part of the target that does not match the template and doing a separate prediction for it. We could probably improve our fold-recognition capability if we did some domain splitting, either after having recognized a domain using the whole chain, or by trying domain splits based on the multiple alignment. In one run of all yeast ORFs, one of the predictions took 6 days, because the protein was 4910 residues long—this clearly would have been helped by breaking up into smaller pieces.

We have a script available for breaking up long proteins and doing separate predictions on each part (split-into-domains). The script can be used either manually to pull out a predicted subdomain for separate prediction or automatically to predict overlapping windows of some fixed length.

In CASP6, it was obvious that we had managed to put together some multi-domain predictions without explicit domain prediction. It is not clear to me that explicit domain prediction is as valuable as I thought in 2002. It is clear though, that a subdomain with many homologs can result in masking of other domains in the sequence, due to sequence-weighting artifacts (among other problems).

Projects involving multiple-domain proteins include

Calibrating HMMs

Our current method for calibrating the E-values of HMMs is described in the paper Calibrating E-values for Hidden Markov Models with Reverse-sequence Null Models, where a few problems are described.

Better template selection for close homologs

When we have many close homologs in PDB, the current fold-recognition method finds all of them, but does not do a good job of picking the best template.

Model drift often results in the HMM being better at recognizing some subfamily different from the one containing the target protein. For example, in one test, the SH3 domain for 2abl was given as a seed for the SAM-T02 web site, and the self-hit was about 80 down from the top (partly because 2abl is not in the template library). Blast finds several sequences containing exact matches to the 55-residue seed: 1oplA, 1opkA, 2abl, 1ju5C, 1abq, 1abo[AB], 1awo, and 1bbz[ACEG]. Four of these were in the template library (1aboB, 1awo, 1bbz[AE]), but the best scoring match using the template library was 1ng2A, which has the SH3 domain twice. If we just use the w0.5 model, we can see that it has drifted to center on 1udlA, 1gfc, and 1gcqA, rather than the sequence we started with.

Various fixes are possible.

Residue-residue contact prediction

For his Ph.D. thesis, George Shackelford is working on various ways to predict residue-residue contacts in proteins. He started with an idea of mine to use the statistical significance of mutual information between columns of a multiple alignment as a predictor, and has since expanded to many other measures of "correlated mutation" and conservation. There is more than a PhD thesis worth of material in this project, and a student selecting the project would work closely with George and me in identifying a 10-week project that would aid George's work and not interfere with progress on his PhD.

Some possible projects include

Hand-interaction with prediction

When we do predictions for CASP, we spend a lot of our time looking a predictions made by our tools, trying to figure out what the tools did wrong, then forcing them to do what we want. In 2002 for CASP5, we spent a lot of time writing little one-use scripts to move parts of proteins around for further optimizations. In 2004 for CASP6, we spent a lot of time trying to create constraints to give to undertaker to force beta strands together or to move parts of the structure. In many cases, it would have been much nicer to have an interactive tool that let us move things more explicitly and easily.

There was a tool almost usable in 2004: ProtoShop or ProteinShop. Information is available at http://proteinshop.lbl.gov/ http://graphics.cs.ucdavis.edu/~okreylos/ResDev/ProtoShop

ProteinShop seems to have some fairly nice manual-move capabilities, but was not usable last time I tried it (in 2004). I understand that David Bernick did some work installing the program and debugging it, and that the publicly available open-source version may now be usable.

The class project would be to install (and possibly debug) ProteinShop, then try to interface it to undertaker, so that one could look at partially optimized structures generated by undertaker, manipulate them with ProteinShop, and hand them back to undertaker for further optimization. The two programs could probably interact by passing PDB files of the conformation back and forth.

A more advanced GUI would not only allow manual manipulation of the conformation, but would allow applying specific undertaker conformation-change operators in specific regions. Visualization of the "ternary edges" that hold together structures that lack a contiguous backbone would also be useful, though these are not currently represented in the PDB output by undertaker.

One operator that would be nice to have is the ability to insert a particular alignment, or a subdomain from a particular PDB file. There *was* an operator in undertaker for this (ApplyFragmentsToConform), but this command is now obsolete, since it used an old "slicer" format for specific fragments, which is no longer used by undertaker. If we added a NameToPtr hash table to the AlignmentLibrary, we could refer to alignments by name and create a new command "ApplyAlignment".

Save and restore AlignedFragments library

A lot of time is wasted reading template PDB files and recomputing sidechain replacement for specific fragment libraries and alignments on each run. Saving and restoring them would be useful, particularly for reducing the number of files needed for an undertaker run—we'd like to be able to capture all the files needed and distribute them to the kilocluster for doing many parallel optimizations. Saving the specific fragment library would also make it feasible to put some more effort into creating it—doing sidechain optimization to remove steric conflicts, for example. (Currently, we often run SCWRL on the alignments, but not on the fragments from fragfinder.)

The fragfinder output is as an A2M file, which is fairly compact and easily produced by the SAM software, but which does not include 3D information. Undertaker can read the a2m file and the necessary PDB template files and generate fragments by sidechain replacement, but does not currently have a way of storing these fragments in a file. Existing undertaker file formats could easily be adapted to save the fragments, speeding up multiple undertaker runs and making it easier to run undertaker on the kilocluster (by eliminating the need for access to the PDB library).

Although we can output fragments in Rosetta format, the Rosetta format does not contain enough information to reconstruct undertaker fragments (which include sidechain information, as well as backbone information).

Undertaker cost functions from local structure predictions

We have neural nets that predict several local structure properties, but currently only one of these predictions (the alpha angle) is passed to undertaker for use in its cost functions. Many other properties are measured, but their only current use in the cost function is in a rather generic "residue propensity" way. The neural nets do a much better job of predicting the properties than simple residue propensity, and we should be taking advantage of them.

What we currently do for alpha is to convert the 11-dimensional neural network prediction vector into a nearly continuous function function of the alpha angle (360 1-degree bins). This involves combining the neural network prediction with the histograms for individual amino acids. The current algorithm starts with the histograms for alpha conditioned on the residue and conditioned on the next residue, and takes their sum to get an initial continuous curve, each of the 11 sections is scaled to get the appropriate probability according to the neural net prediction, and the resulting curve is smoothed to remove the discontinuities at the edges of the sections. The scaling and smoothing is repeated several times.

We could do a similar thing for burial, but the burial curves are simpler than the alpha curve, so it might be better to use a simpler representation, such as just a mean and standard deviation parameter for a family of curves such as the gamma distribution.

In addition to the various burial alphabets, we have some backbone alphabets. These are discrete alphabets and cost functions could be implemented just by having a table lookup of -log(Prob(letter)) for each position—simpler that what we do for alpha.

Currently the "Bystroff" alphabet is understood by undertaker, and it would not be hard to add the de Brevern protein-block alphabet. The new hbond alphabets that we have developed locally are also understood by undertaker, though there is no scoring scheme for them implemented yet. Some of the more conventional alphabets (dssp, stride, and our dssp extension str2) would be more difficult to implement, as we do not want to duplicate the innards of DSSP or STRIDE.

Other ways of using predicted local structure in undertaker

Besides using the neural net predictions directly in undertaker cost functions, we have other ways that we could used the local structure predictions. One that we currently use is to convert the RDB files to HelixConstraint and StrandConstraint commands, and use the constraints as part of the cost function. Specifying constraints not only affects the cost function, but also some of the conformation-change operators (such are OptSubtree and OptSegment). The scripts that do the conversion to constraints could use a little improving (for example, by including a scaling parameter on the command line to adjust the weights of the constraints). Another possible modification is to make the weight of the constraint dependent on the length of the constrained segment, since the weight is divided up into several individual constraints, with more individual constraints for longer segments.

Predictions of the new n_sep and o_sep alphabets could be used to create specific hydrogen bond constraints, to supplement the helix and strand constraints from other local structure alphabets.

The use of Hbond constraints (either from helix and sheet constraints or directly provided) could be improved by changing the TargetA and TargetB functions to provide better estimates of where the donor or acceptor should be. Currently the Hbonds use the same somewhat random target location that other constraints use.

Besides using constraints, local structure information could also be used to affect where fragments get inserted, with higher probabilities for loop regions and for regions where the current secondary structure does not match the predicted secondary structure. With a little more programming effort, we could even make the probabilities of different fragments depend on how well they match predicted structure. For example, we could

Conformation-change operators for undertaker

Undertaker currently has 35 conformation-change operators, some of which make major changes, some of which make tiny changes. The adaptive algorithm that chooses the probability for each operator seems to work fairly well (though it occasionally gets tricked into assigning too high a value to an operator), so we can fairly safely add more operators.

One operator that might be nice to have is an ABA crossover operator. Like the current (AB) crossover operator, it would start from two parent conformations. But instead of having one crossover point, it would have two, using two peptide splices to join a chunk from conformation A, a chunk from conformation B, and the rest from conformation A. This could be done either as two peptide-splice operations (based on how the current crossover operator works), or by extracting a chunk of conformation B as an incomplete AlignedFragments conformation, and applying it to conformation A. This would do a better job of keeping the two parts of A in their current relationship to each other.

Another useful operator (or set of operators) would be ones that jiggle the residues without breaking the backbone. One fairly sophisticated way to do this would be to tweak one torsion angle, then use inverse kinematics to compute torsion angles needed to keep a later peptide plane in position. Somewhat simpler and cruder would be to find pairs of bonds that are almost parallel and to make "crankshaft" moves that tweak the two torsion angles in opposite directions. This is done in a simple way by the TweakPsiPhiSubtree and TweakPsiPhiSegment operators. A variant of this would be to do a "crankshaft" move between residues A and B, add theta to psi(A-1), -theta to phi(A), -theta to psi(B-1), theta to phi(B).

There is plenty of room for thinking up new conformation-change operators!

Improved handling of clashes and bond angles in undertaker

The cost function in undertaker for clashes (steric conflicts) was last updated in spring 2004. It uses a separate distance threshold for each pair of atoms in the same residue, in adjacent residues, and in residues with a chain separation of 2 or more. Seting these distance thresholds is rather tricky, since there is not enough data for really robust setting of so many parameters. The method has the advantage/disadvantage of treating steric conflicts the same as bond lengths.

One big disadvantage of the current scheme is that there is no attractive van der Waals term in our cost function. Atoms that don't come into contact score the same as ones that do contact without clashing. While a traditional Lennard-Jones potential gives too high a penalty to clashes for our purposes (making the landscape too rough), the attractive well of the potential is important in determining whether a protein is properly packed. The Lennard-Jones potential is easily parameterized, having only one parameter for each pair of atoms. We need to come up with a similarly simple cost function that is easily calibrated from available data, with a good fit to the observed distance distributions, but with a softer clash penalty.

It is quite likely that bond lengths and H-bonds will not fit the same shape length distributions as van der Waals contacts, so we may want to provide completely separate mechanisms for bonded atoms (we already have a separate H-bond cost function). Atoms that are 2 bonds apart are also not governed by van der Waals contacts; their distances are basically measures of bond angles. The current clash detection scheme has different thresholds for bonded atoms and for atoms 2 or 3 bonds apart (because of the different thresholds for same residue, adjacent residue, and arbitrary chain separation), but does not have different shapes to the cost function. Page Not Found | Jack Baskin School of Engineering | UC Santa Cruz