Page Not Found
Our web site is constantly being updated. If you clicked a link or a bookmark to get here, the link or bookmark probably needs to be updated.
Thanks for your patience and understanding!
This is a list of projects that need to be done on various projects for protein structure prediction. Note: there is no requirement that the project you do come from this list—it is merely a set of projects that I am interested in. If you can come up with your own project and convince a faculty member to supervise it, that can be even better. New collaborations with biologists and biochemists are particularly welcome.
Although this list is now "complete", I may add to it from time to time as new thoughts occur to me. There are also always lots of code maintenance and code development projects that don't make it onto this list. (For example, undertaker needs a lot of modernizing of its c++ code, and predict-2nd needs to be cleaned up so that it can be released as open-source code.)
I've broken the list up into different categories:
Projects in this section are intended for students whose programming skills are somewhat limited, though some script writing will probably be needed. They fall into two categories: doing structure (and maybe function) predictions for a number of proteins or testing existing methods (tweaking parameters, for example).
The prediction tasks generally involve going through the
prediction process for a number of proteins, looking for useful
information to guide wet-lab research. Some will require writing
short scripts, and all will require looking at and interpreting
results from a number of different tools.
Note: all students will have a homework assignment that requires doing
a prediction for one or two proteins. Projects in this category will
require more substantial work than that homework, either by using
experimental evidence to guide the prediction or by doing predictions
of many proteins.
Since many (but not quite all) the chemosensors are transmembrane
proteins, which we cannot model in their entirety, and we are not not
interested in modeling the MAD domain, the first task is to select the
domain of the protein (generally the extracellular domain) for which
we want to do structure prediction. Generally this means finding
transmembrane helices (with TMHMM, for example) and the MAD domain,
and doing prediction on just the interesting subdomain.
Structure predictions were done for the few MAD-containing
proteins of H. pylori a couple of years ago, but could be redone, to
see if any of the newer files in PDB have provided better templates to
work from. This is not likely to lead to any big improvements, unless
there has been a homolog solved.
V. cholerae has a lot of MAD-containing proteins, and I am
considering parceling them out to the class as a homework assignment.
It might be good to have one person collect all the work that is done,
clean it up, improve on the poorly done work, generate hypotheses
about possible ligands, and prepare a detailed report for Fitnat
Yldiz.
Here is Fitnat's description of the project from spring 2005:
This proposal seeks to understand the ligand(s) bound by
methyl-accepting chemotaxis proteins in Vc. The list of these
proteins and their VC gene numbers are shown below. Detailed
information on each gene including DNA and amino acid sequences can be
obtained from: http://www.tigr.org/tigr-scripts/CMR2/name_search_test.spl.
Predicting specific proteins
Chemotaxis proteins
Two of our environmental toxicology faculty are interested in
chemotaxis proteins: Karen Ottemann in those for Helicobacter pylori
and Fitnat Yldiz in those for Vibrio cholerae. The main interest is
in the proteins that include the methyl-accepting domain (MAD) that is
a highly conserved signaling domain. These are the proteins that act
as the chemosensors in the first stage of the signaling pathway.
Project 1: Identification of the ligand binding domain(s) of the methyl
accepting chemotaxis proteins in Vibrio cholerae (Vc)
Flagellar motility in Vc is controlled by the chemotaxis system,
allowing the microbe to move in response to environmental cues.
Chemotaxis is regulated by chemoreceptors that sense environmental
cues, and transduce this ligand-binding information to regulate a
signal transduction cascade that affects flagellar rotation. The core
signal transduction proteins consist of CheW, the receptor-kinase
coupling protein, CheA, the kinase and CheY, the response regulator
that interacts with the flagellar motor.
| Locus | Gene Symbol | Common Name |
| VC0098 | methyl-accepting chemotaxis protein | |
| VC0216 | methyl-accepting chemotaxis protein | |
| VC0282 | methyl-accepting chemotaxis protein | |
| VC0449 | methyl-accepting chemotaxis protein | |
| VC0512 | methyl-accepting chemotaxis protein | |
| VC0514 | methyl-accepting chemotaxis protein | |
| VC1248 | methyl-accepting chemotaxis protein | |
| VC1289 | methyl-accepting chemotaxis protein | |
| VC1298 | methyl-accepting chemotaxis protein | |
| VC1313 | methyl-accepting chemotaxis protein | |
| VC1394 | methyl-accepting chemotaxis protein | |
| VC1403 | methyl-accepting chemotaxis protein | |
| VC1405 | methyl-accepting chemotaxis protein | |
| VC1406 | methyl-accepting chemotaxis protein | |
| VC1413 | methyl-accepting chemotaxis protein | |
| VC1535 | methyl-accepting chemotaxis protein | |
| VC1643 | methyl-accepting chemotaxis protein | |
| VC1859 | methyl-accepting chemotaxis protein | |
| VC1868 | methyl-accepting chemotaxis protein | |
| VC1898 | methyl-accepting chemotaxis protein | |
| VC1967 | methyl-accepting chemotaxis protein | |
| VC2161 | methyl-accepting chemotaxis protein | |
| VC2439 | methyl-accepting chemotaxis protein | |
| VCA0008 | methyl-accepting chemotaxis protein | |
| VCA0031 | methyl-accepting chemotaxis protein | |
| VCA0068 | methyl-accepting chemotaxis protein | |
| VCA0176 | methyl-accepting chemotaxis protein | |
| VCA0268 | methyl-accepting chemotaxis protein | |
| VCA0658 | methyl-accepting chemotaxis protein | |
| VCA0663 | methyl-accepting chemotaxis protein | |
| VCA0773 | methyl-accepting chemotaxis protein | |
| VCA0864 | methyl-accepting chemotaxis protein | |
| VCA0906 | methyl-accepting chemotaxis protein | |
| VCA0923 | methyl-accepting chemotaxis protein | |
| VCA0974 | methyl-accepting chemotaxis protein | |
| VCA0979 | methyl-accepting chemotaxis protein | |
| VCA0988 | methyl-accepting chemotaxis protein | |
| VCA1031 | methyl-accepting chemotaxis protein, authentic frameshift | |
| VCA1034 | methyl-accepting chemotaxis protein | |
| VCA1056 | methyl-accepting chemotaxis protein | |
| VCA1069 | methyl-accepting chemotaxis protein | |
| VCA1088 | methyl-accepting chemotaxis protein | |
| VCA1092 | methyl-accepting chemotaxis protein |
The 215 proteins need to be added to the set and predictions made (this is fairly trivial). The 77 proteins need to be checked to see whether the sequence change is an improvement. (In at least one case, we replaced a protein that was non-functional in the reference strain with an undamaged gene from a different yeast strain.) Where the sequence is an improvement over the one we predicted for, the README file should be updated to reflect this, the sequence replaced, and the prediction redone.
A new comparison should be done between the sequences we have
predictions for and the sequences in SGD to find other
discrepancies—if possible this should be automated, so that it can
be checked periodically without much effort.
Modifying nicotinic acetylcholine receptor
One design project that I was working on fairly intensely for a while,
but have put aside is a redesign of nicotinic acetylcholine receptors.
The receptors form a pentamer, but no one has yet managed to
crystallize the extracellular domain, as the pentamer is not formed
when the transmembrane portion is removed.
The idea is to use undertaker to build a homology model, then use
Rosetta to redesign some of the residues to increase stability of the
pentamer.
There is a soluble protein that is homologous to the extracellular
domain that *does* naturally form the pentamer (see PDB file 1i9bA). I
have already done a lot of modeling work on the protein, and may have
a model with a good enough backbone to do sidechain replacement.
(See
http://www.soe.ucsc.edu/~karplus/ach/ach7_short_1based/three-template/summary.html
paying particular attention to the README file.)
One attractive aspect of this protein design is that we do have an
experimentalist who would be interested in expressing and
characterizing the protein, if we can come up with a fairly convincing
redesign that does not require too many mutations. (Note: the cost of
mutations is not the number of residues changed, but the number of
reactions needed to replace parts of the gene. Changing five adjacent
residues has about the same cost as replacing one residue, and is much
cheaper than replacing three widely scattered residues.)
Analysis of HSP-60 proteins
Jonathan Trent has spent several years studying the HSP60 heat-shock
proteins (chaperonins) in Sulfolobus shibatae. Structures are known for the
protein, but there are still a lot of questions about the function and
about details of how different variants of the protein achieve
different quaternary structures. Some preliminary bioinformatic analysis of HSP60
proteins has been done in /projects/compbio/experiments/protein-predict/hsp60/
Here are some questions Jonathan is interested in:
The synucleins are intrinsically unfolded proteins but at least a-synuclein adopts a helical structure when bound to membranes. There is no information available on b-synuclein or g-synuclein. I think it would it be worth predicting some potential folded and partially folded structures—what do you think? The maximum length is 140 amino acids.
The PFAM synuclein HMM (pfam01387) is documented:
Synuclein. There are three types of synucleins in humans, these are
called alpha, beta and gamma. Alpha synuclein has been found mutated
in families with autosomal dominant Parkinson's disease. A peptide of
alpha synuclein has also been found in amyloid plaques in Alzheimer's patients.
Alpha synuclein in humans (SYUA_HUMAN) apparently has two isoforms:
MDVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGSKTKEGVVHGVATVAEKTKEQVTNVGGAVVTGVTAVAQKTVEGAGSIAAATGFVKKDQLGK----------------------------EGYQDYEPEA MDVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGSKTKEGVVHGVATVAEKTKEQVTNVGGAVVTGVTAVAQKTVEGAGSIAAATGFVKKDQLGKNEEGAPQEGILEDMPVDPDNEAYEMPSEEGYQDYEPEABeta synuclein (SYUB_HUMAN) is similar to alpha synuclein:
MDVFMKGLSM AKEGVVAAAE KTKQGVTEAA EKTKEGVLYV GSKTREGVVQ GVASVAEKTK EQASHLGGAV FSGAGNIAAA TGLVKREEFP TDLKPEEVAQ EAAEEPLIEP LMEPEGESYE DPPQEEYQEY EPEASo is gamma synuclein (SYUG_HUMAN):
MDVFKKGFSI AKEGVVGAVE KTKQGVTEAA EKTKEGVMYV GAKTKENVVQ SVTSVAEKTK EQANAVSEAV VSSVNTVATK TVEEAENIAV TSGVVRKEDL RPSAPQQEGV ASKEKEEVAE EAQSGGD
They think something similar may be going on with Mih1, but this work is not yet published. Swe1 has a kinase domain and Mih1 has a phosphatase domain. They are curious what you would find if we put these proteins through our structure prediction programs.
Automatic predictions alread exist for these proteins in http://www.soe.ucsc.edu/~karplus/yeast/YJL1/YJL187C/summary.html and http://www.soe.ucsc.edu/~karplus/yeast/YMR0/YMR036C/summary.html , but these predictions do not use the latest methods and are not producing full-length models. The parts that are being modeled are the "easy" parts: the kinase domain and the phosphatase domain.
Splitting the proteins up into domains and and doing predictions for each part would be fairly straightforward, but getting good results for the N-terminal domains, for which there appear to be no good templates, could be quite difficult. This would be especially true if Kellog's conjecture is true and they are ntively unstructured until highly phosphorylated. I don't believe that any of the tools we use can handle phosphorylated proteins.
For such large ab-initio predictions, we would probably want to use
both the SAM-T04/undertaker and the Rosetta protocols.
protein from C. elegans
Andrew Chisholm (biology professor) juset sent me the following message
(11 Jan 2006):
We have just a new protein (from C elegans) that is highly unfamiliar.
We are desperate for any bioinformatics insight as to what it could be.
(It has come out of a suppressor screen). MAybe it would be an
interesting project for an undergraduate.
Testing and tweaking parameters
Using unfinished genomes
For many targets, we get a lot of potential homologs just by searching
NR and adding more sequences to the multiple alignments would produce
only marginal improvements. Other sequences, though, appear to be
ORFans, with no similar proteins in other species. In such cases, it
may be worthwhile to look through unfinished genomes and even the
trace archives at NCBI, to try to find evidence for related proteins
that have not been annotated as proteins yet. Using tblastn may find
appropriate hits, but we do not have enough experience using the noisy
data in the trace archives to know whether it is worth including
automatically or even just as a last-resort measure.
Combining local structure predictions
For the past few years, we have been using a script called RDBCombine
to merge neural net predictions (possibly over different, but related
alphabets) to get a consensus prediction. We have never tested this
method to see if it does better than simply taking a neural net
optimized for the output alphabet, nor have we tried tuning any
parameters related to it.
RDBCombine can combine RDB-format prediction files for different alphabets. It translates the alphabet using the contingency tables output by compare-real, and weights them by the mutual information between the predicted alphabet and the desired alphabet, and does a weighted average of the translated predictions. It is not clear what the best way to do the combining is—simply averaging the probability vectors is probably not optimal, though it is easy. We may want to look at combining the log-odds scores (log (P_predicted/P_background)), for example.
We have a script to evaluate RDB files for Qn and bits
saved (and we could add the SOV measure to the evaluations), so
different methods for combining predictions can be easily tested with
only fairly minor script writing.
New training methods for neural nets
Training neural nets is currently quite slow, and a student interested in
machine learning might want to work on modifying predict-2nd to use a
faster learning method than the current back propagation. This is
not a high priority (we rarely retrain the networks), but might be
a good project for someone more interested in machine learning
than in bioinformatics.
A more serious problem for us is that training does not always
converge to equally good results. Our current protocol for comparing
the predictability of different alphabets is based on the test
results for a single training run on each part of a three-way cross
validation test. We need to develop a new training regime that more
consistently produces good results—perhaps by starting from several
random neural nets, training for a limited time, then taking the best
of the set on the training data for further training.
This training method may be possible with just new scripts in
predict-2nd, or it may require slight modifications to the predict-2nd code.
Local structure prediction using HMMs
One approach to improving secondary structure that we have tried, and
that may be worth improving on, is to use the fold-recognition target
HMM to align the template set and gather statistics about what
secondary structure codes align to each position.
Since the target HMM can be a multi-track HMM using neural-net
predictions, this method should be able to do at least as well as the
neural-net methods.
The method would be particularly valuable when there is a close
homolog in the template library, as it would be able to "look up" the
right answers.
This makes testing more difficult, since it is easy to fool oneself
about how well the method is working, but would be very useful in a
tool used by bench biologists, as the easy predictions would be right
(not necessarily the case for neural nets).
It is used in the SAM-T99 server, I believe, but not the SAM-T02
server.
We should test it properly and use it in the SAM-T05 server.
Mark Diekhans implemented this several years ago in SAM, but I am not sure of the details of his implementation. In his tests, the method improved Q3 scores slightly, but made the information gain worse than the neural nets used to create the HMM tracks. Josue Samayoa did more careful tests in Fall 05, and found that the current implementation does not make the results of the neural net prediction substantially better or worse. Somewhat surprisingly, including the true sequence in the library of structures that were scored by the HMM did not change the result much.
Several things could be done to improve this method:
One such weighting would be to use P(sequence), essentially
decoding with P(state, sequence | position).
Another method would be to try explicit weighting of
sequences.
I think that a method using exp(-E_value) would be worth
trying.
Here is the reasoning: We want to give more weight to
sequences that match the HMM than to sequences that are very
remote, but using P(seq|HMM) as the weight is likely to put
too much weight on the closest match.
If we use instead the probability of not getting that good a
sequence by chance in the N sequences (1-p_0)^N =
(1-E_value/N)^N, we could have numerical problems, but for
E_value<
I believe that Josue may wish to work on this project for another
quarter, and I'm not sure whether there is enough work left for two people.
If you are interested in this project, please consult with Josue.
Analyzing alignment results
We have collected a mound of data about the quality of alignments for
different track weightings using multi-track HMMs to align
structurally similar proteins. We have used the data to select track
weights that are generally good, but I suspect that we could do much
more with the data. For example, I believe that the best track
weighting is dependent on how similar the two sequences are. Very
distantly related sequences may need more weight put on the local
structure tracks, while closely related sequences should have more
weight on the amino-acid track. It would be interesting to try to
come up with a method for measuring sequence similarity and using it
to modify track weights accordingly.
Testing profile-profile alignment
For the past few CASPs, some of the best fold recognition and
alignment has come from profile-profile alignment methods. Richard
Hughey has added profile-profile alignment to SAM, but the code has
not been thoroughly tested. We have two test suites: one for
alignment and one for fold recognition. It would be good to test
SAM's profile-profile methods using these suites: comparing the
profile-profile method with the more traditional sequence-profile methods.
We do have the ability to use multi-track HMMs for the profiles, so
there is a lot to explore: which tracks to use, track weights, which
profile-profile scoring algorithm to use, ... .
Mixed sequence-profile alignments in template library
We routinely use multi-track HMMs built from a target sequence to
search a template database of sequences, but we currently use only
single-track (amino-acid-only) template HMMs to score the target sequence.
Although we plan to explore full profile-profile scoring, it might
also be worthwhile to do more modest multitrack sequence-HMM scoring
schemes with template-based HMMs.
There are two approaches that I see:
The predicted secondary structure for the target sequence could be used with a template model by modifying SAM to use log P_sequence_pos(template_state) as the second track scoring. This requires having SAM read in probability vectors for the "database" and have labels for the states of the HMM.
The fragfinder program has not had extensive testing—we selected an algorithm and HMM based on intuition and minuscule tests on one target sequence. The undertaker program can be used to evaluate the fragments as fragments (looking at the histogram of RMSD, or RMSD normalized by length, for example), and we can come up with various figures of merit for fragment libraries, allowing us to compare different fragment finding methods.
Sol Katzman wrote routines for undertaker to filter fragfinder
output and produce Rosetta fragment libraries. Some of his work
provides more measures for fragment-library quality. It also opens
up the possibility of comparing fragfinder-generated libraries with
the libraries that Rosetta normally uses.
Evaluating quality measures
Undertaker includes several "cheating cost functions" that compare the
current structure with a known correct structure. These cheating
functions have two purposes: one is to test the conformation-change
operators and stochastic search procedure, to see if optimization with
a really good cost function converges to a correct structure; the
other is to evaluate the results of different optimization methods, to
see how well they work. The current set of cheating functions
includes all-atom RMSD, RMSD_CA, GDT (the measure popular at CASP),
smooth_GDT (a very similar measure with slightly less quantization
noise), and a contact-based measure called "clens" developed by Tim
Dreszer in Spring 2005.
Some of these measures are known to have problems, particularly for structures that are not very similar to the real structure. For example, RMSD and RMSD_CA are both very sensitive to the worst-done part of the structure, and interpretation of the result is very dependent on the length of the chain. GDT (and smooth_GDT) evaluates only how much of the chain is correct, giving no penalty for grossly wrong parts. All of these superposition-based measures tend to over-reward structures that have been compacted too much—they have no penalty for unreasonably dense proteins. In some cases, the point model (all atoms at (0,0,0)) scores reasonably well.
The clens function was devised to try to handle the problems of the other measures, be rewarding for correct contacts and penalizing for incorrect ones. In preliminary testing using models generated by undertaker, clens and GDT were very highly correlated with each other, indicating that they may be nearly interchangeable in their ability to recognize decent models. We'd like to do more extensive testing of "clens", using all the models available from previous runs of CASP. In particular, we'd like to see if we can find any outliers, where GDT and clens evaluate models quite differently. We could then examine such outliers, and see which metric does a better job of capturing the quality of the model.
It is also possible to modify the clens definition in undertaker,
to parameterize the definition of contact or change the scoring
function for missing and extra contacts.
Rotamer optimization
The current version of undertaker has 3 conformation-change
operators that just change sidechains (OneRotamer,
ClashingRotamer, and ClusteredRotamer). It would be interesting
to see how well undertaker optimizes sidechains compared to scwrl
(and Rosetta), for different cost functions [particularly whether the hbond_geom
cost function helps]. This can be done with no modifications to
undertaker, just running undertaker scripts and gathering statistics.
A possible approach is to read a pdb file, randomize the
rotamers, optimize with undertaker with only the rotamer-changing
operators turned on, reoptimize with scwrl (callable from inside undertaker), then
reoptimize with undertaker, reporting the three costs and all-atom
rmsd for each of the conformations.
One could also gather all-atom rmsd for a huge number of sidechain
conformations and try to do linear regression to set weights for
the cost function, but this is likely to have serious problems due to
correlation between the components of the cost function.
Tweaking undertaker's cost function
We have a lot of decoys generated for CASP5 and CASP6, and now have
correct structures for most of them.
We can use this set, or a test set generated by the Baker group, to
try to adjust the weights of the various components of the cost
function in undertaker.
Possibly we should do a linear regression to fit the cost to whatever
badness measure we choose, though we may have to figure out a length
scaling to be able to do the fitting across several different targets.
I did some preliminary work on this, and (as expected) linear regression works terribly when the cost functions are highly correlated. We'll probably need to do some dimension reduction first (perhaps with principal components analysis?) before linear regression stands a chance of working.
A minor programming project that may need to be done before we can
try tweaking the cost functions: we need to clean up cost functions to
handle incomplete conformations.
When reading in true, but incomplete, conformations (the usual case
for PDB structures) the cost functions often report the conformations
as terrible, since the missing atoms are all at location (0,0,0).
The cost functions need to be modified to be aware of the
"contains_atom()" member of the Conformation class.
Some cost function components have already been fixed, but not all.
Protein design by back propagation
In spring 2005 Jes Frellsen and I made some modifications to
predict-2nd so that it could be used for protein design. (I made
further changes in summer 2005.) The idea is to
take several of the
local-structure-predicting neural nets, and fix their output at the
desired local structures, then use back-propagation to optimize the
inputs. One starts with an initial random sequence, optimize
using the neural nets to get an input profile, then sample from that
profile and reoptimize. This can be repeated to get a large sample
of possible sequences that should have good local properties, which
could then be fed into the Rosetta optimization program to try to find
sequences with good tertiary properties.
More random thoughts on using back-propagation. There are lots of variants that could be tried in the sampling of sequences from the back-propagation. For example, one could resample the sequence points where the back-propagated derivative for the current residue is negative rather than resampling all points. Instead of sampling from the profile one could take the most probable residue of the profile. After generating a few hundred sequences in this way from random starting points, one could make a profile from the sequences, and score each sequence with that profile. Further tweaking could take just the best 100 or so sequences and the profile generated from them. Each could be used with that profile as a starting point for the back-propagation and resample process. A tertiary scoring system (such as Rosetta's) could be used to pick out the best sequences from the list, and use the resulting sequences and profile to do more sampling.
Another approach to using neural nets would be to build a neural net whose input was many structural properties and whose output was a sequence. Such a neural net could be trained in the same manner as current nets for prediction (with inputs and outputs swapped), and would provide a probability distribution for amino acids at each position. This would require a few changes to the neural net code (mainly providing an input layer that accepts several alphabets, instead of just one alphabet), and would more directly provide probability distributions to sample sequences from. This method is less likely to provide a variety of very different solutions to a protein design problem, though, so I have not pursued it.
One protein that it may be interesting to apply the back-propagation plus Rosetta approach to is the "miniAGRP" peptide, a small cysteine knot that Glen Millhauser's group is interested in. They would like to remove the disulfide bridges that currently hold the structure together, without greatly disrupting the structure. Certain of the residues are believed to be functionally important, and should not be touched in the redesign. Millhauser's group may be interested in synthesizing a design, if a better one can be found than the design that Craig Lowe did (using just Rosetta) last year. For more information, see miniAGRP paying particular attention to the README file and the design/ subdirectory.
In addition to that particular design, it would be good to experiment with various networks and local structure alphabets, so see which combinations result in the best designs. Since we do not have the time nor the facilities for fabricating and testing large numbers of proteins, we will have to rely mainly on computational methods. One simple test is to look at the percent sequence recovery: given just the backbone for a real protein, try to design amino acids sequences that would fold to that conformation. The closer the designs are to the real protein, the more convincing the design method. Such a study should be done on hundreds of different backbones from radically different structures.
Projects in this section generally require some programming,
database, or web-design skills.
SAM-T05 Web site
The SAM-T05 web site was created last spring and summer, but it is still in beta release form, and we'd like to make it public before CASP starts. Some things that need doing include the following:
The yeast predictions use the SAM_T02 prediction machinery, and it might be a good idea to update the Make.main file to use the SAM_T04 or SAM_T05 predictions, though redoing all the predictions would take a lot of cpu time.
Perhaps a more important task is to figure out a way to handle the large, multidomain proteins better, as finding a template for just one domain doesn't really solve the problem. Currently, the web pages don't even report which sections of the protein we have made predictions for!
The fairly small yeast genome took weeks to finish on the small cluster—to do a bigger genome, such as the human genome, we'd have to "clusterize" the method to run on the kilocluster without hammering the file servers. Fan Hsu has started work on making the fold recognition method work on the kilocluster. I don't think there is any point to trying to apply undertaker genome-wide right now—it requires far too much hand tweaking to get even a crude new-fold prediction out of undertaker.
It may be worthwhile to do some other small genomes, such as the
archeal genomes that Todd Lowe is building DNA chips for, or the
bacterial genomes that Fitnat Yldiz and Karen Ottemann are interested in.
The yeast predictions are not used much, in part because there is
no convenient web interface for accessing particular predictions.
Ideally, we would like to be able to search the set of predictions in
several ways: by name (and each sequence may have several names), by
sequence, by predicted fold (SCOP) class, by confidence of prediction
or fraction of protein confidently predicted, ... .
This would require setting up a database (either a real one or a
crude one using perl scripts) of names and aliases, sequence data,
predictions, ... . It would be good to be able to populate some parts
of the database (such as names and aliases) from SGD, or to have the
web script query SGD for translations.
Some of the search capabilities that should be added include:
Index the yeast predictions
Continuous burial prediction
Our burial and exposure alphabets are based on an underlying
continuous (or, at least, finer-grain) value. It would be nice to
have the neural net output parameters for a parameterized
distribution (such as log-normal or gamma), rather than the
discrete 7 or 11 states we currently use. The two values can be
interpreted as the mean and standard deviation of the output
distribution (though other parameterizations may be more
convenient for some distributions). A weak prediction would have
a large standard deviation.
This could be done by adding a combining layer to the outputs of a normal neural net. For each output of the neural net we associate two parameters (center and spread). The predicted mean would be the neural-net weighted sum of the center values, and the predicted standard deviation would be the standard deviation of a mixture distribution, with mixture coefficients from the neural net.
The center and spread parameters could initially be set by partitioning the outputs in the training data into roughly equal weight discrete classes (as is now done) and computing a mean and standard deviation for each class. They could be tweaked a bit by a few passes of EM to get the output described as a mixture. After a few epochs of training the neural net, the center and spread parameters could be trained by back propagation also.
We would need to compare the encoding cost of the actual
burial value using this continuous method (which requires some
care in the discretizing, since our underlying data is usually
integer-valued) versus the encoding cost of the discrete alphabet
method, which assumes a flat distribution within each of the
discrete bins.
Recoding neural net inputs
Neural nets suffer from being basically uninterpretable.
There are some ways we could attempt to glean some information
from the neural nets, and perhaps use this information to make
neural nets that train faster or work better. The basic idea is
to look for recodings of the amino acids that the neural nets are
learning, and then hardwire the more important recodings. The
simplest notion of a recoding is the dot product of an amino-acid
probability vector with a direction vector (a 20-dimensional
vector with magnitude one). The weights of the first layer of
each neural net we have trained can be normalized to produce a
number of such direction vectors (78 vectors for each network in
our most common network architecture).
The resulting vectors can be clustered (with the square of the dot product as the appropriate similarity measure) to determine the most common directions. Alternatively, Principal Components Analysis (PCA) could be done to reduce the dimensionality. We can attempt to identify the properties these directions represent by looking at the correlation with the indices in the AAIndex database.
We can also compute the dot products of the profile with the centers of the most populated clusters and use them as inputs to the neural net, in addition to or in place of the profile input. (Similarly, the factors that come from PCA can be used for recoding.)
The above residue-at-a-time analysis does not get at the
neighborhood patterns that the neural nets learn. People have
tried to determine some of the patterns by looking at Fourier
transforms and wavelet transforms of numeric properties (mainly
hydrophobicity). We could look at each hidden unit on the first
neural layer and see if its weight matrix can be roughly matched by taking a
single direction vector and a neighbor weight function.
That is, decompose
sum_{a,-2<=j<2} w_{a,j} Prob(amino acid a in position i+j)
into
sum_{-2<=j<=2} f(j) sum_a dir(a) Prob(amino acid a in position i+j)
or w_{a,j} = f(j) dir(a).
If there are hidden units that can be approximately described this
way, it may be worth providing input features that correspond to
this neighborhood property also.
The neighborhood analysis may be a bit limited in our current networks, since we have gone to using rather short windows (only 3 wide) in the first layer, but the residue-at-a-time recoding analysis may be more revealing than in networks with wider input windows.
Another possible residue-at-a-time recoding scheme would be to
replace the 1-hot encoding of 20 amino acids for the guide
sequence with a binary encoding of amino acids. There are some
simple greedy algorithms and random search optimization algorithms
that could be applied to find 5 or 6 different partitions of the
amino acids so that each of the weight vectors could be well
approximated by adding just 5 or 6 weights to the appropriate
subsets of the amino acids.
Entropy profiles
There was a poster at ISMB 04 (I-84 FASE: a new fold-recognition
method through entropy profiles by Alejandro Sanchez-Flores and
Lorenzo Segovia, both at UNAM in Mexico) that claimed to get excellent
results using just the entropy information from the profile to do fold
recognition—all other signals were thrown out and just a 4-level
entropy alphabet used.
In close questioning, I did not see an obvious flaw in their testing
procedure, so we should try to implement this and see if their work is
duplicable.
To avoid artifacts, we should use our own testing methods and our own computation of entropy. Note that entropy is not a property of the protein structure (unlike most of our alphabets), but a property of a multiple alignment. This means that we do not have all our usual tools for evaluating an alphabet (in particular, conservation and predictability are not meaningful concepts), so we are limited to alignment tests and fold-recognition tests.
Even if entropy profiles alone are not very useful, they could make
an interesting extra track in profile-profile HMMs.
Converting RDB files to HMMs
We currently use a TCL script (2nd-rdb-to-sam-model) for converting
the RDB-formatted output of our local structure predictions to SAM HMM
format, for inclusion in multi-track HMMs. It would be better if the
SAM code understood the RDB format directly, so that no translation
script was needed. Indeed, if the SAM code could read the RDB format
directly, it would be trivial to write the translation program in C,
since SAM already has routines for outputting the HMM format. The
2nd-rdb-to-sam-model script is the only TCL script in the SAM package,
and it would be good to eliminate it.
Although there is not much code needed for adding this
functionality to the SAM package, a lot of code would need to be read
in order to understand the internal data structures and parameter
passing mechanisms of SAM.
Combining fold recognition results
Our fold-recognition methods treat every template in the library as
independent of the rest, but we actually have many templates which
contain similar folds. Various ways have been proposed for combining
template scores to get a "fold" score, and a couple students have done
preliminary tests of the "product-of-pvalues" method, and shown that
it might help, but their projects were never taken to the point where
the method could be incorporated into an automatic prediction method.
[
Bailey TL, Gribskov M.
Combining evidence using p-values: application to sequence homology searches.
Bioinformatics. 1998;14(1):48-54.]
[
Bailey, Timothy L. and Grundy, William N.
Classifying proteins by family using the product of
correlated p-values.
Proceedings of the Third international conference on computational
molecular biology (RECOMB99), April 11-14, 1999. pp. 10-14.
]
One possible project is to incorporate the "product-of-pvalues" method into our automatic predictions. This requires designing a way to update the calibration parameters for the combining method automatically as new templates are added to the library, and converting the fold results back into selection of good templates. (This needn't be a weekly update, as the SCOP classification is only updated once or twice a year.)
Another possible project is to devise a different combining
method. The "product-of-pvalues" method treats all templates in a
fold class as equally informative and does not use information
from templates in competing fold classes. One can envision
combining methods such as logistic regression that could use the
extra information to get better predictions. One danger is that
multi-domain proteins correctly have multiple correct folds, so
simple competition between fold classes is not quite the right model.
Handling multi-domain proteins
Our current methods look for target-template matches where there is a
common subdomain, but do not take the next step of looking separately
at the part of the target that does not match the template and doing a
separate prediction for it.
We could probably improve our fold-recognition capability if we did
some domain splitting, either after having recognized a domain using
the whole chain, or by trying domain splits based on the multiple
alignment.
In one run of all yeast ORFs, one of the predictions took 6
days, because the protein was 4910 residues long—this clearly would
have been helped by breaking up into smaller pieces.
We have a script available for breaking up long proteins and doing separate predictions on each part (split-into-domains). The script can be used either manually to pull out a predicted subdomain for separate prediction or automatically to predict overlapping windows of some fixed length.
In CASP6, it was obvious that we had managed to put together some multi-domain predictions without explicit domain prediction. It is not clear to me that explicit domain prediction is as valuable as I thought in 2002. It is clear though, that a subdomain with many homologs can result in masking of other domains in the sequence, due to sequence-weighting artifacts (among other problems).
Projects involving multiple-domain proteins include
Model drift often results in the HMM being better at recognizing some subfamily different from the one containing the target protein. For example, in one test, the SH3 domain for 2abl was given as a seed for the SAM-T02 web site, and the self-hit was about 80 down from the top (partly because 2abl is not in the template library). Blast finds several sequences containing exact matches to the 55-residue seed: 1oplA, 1opkA, 2abl, 1ju5C, 1abq, 1abo[AB], 1awo, and 1bbz[ACEG]. Four of these were in the template library (1aboB, 1awo, 1bbz[AE]), but the best scoring match using the template library was 1ng2A, which has the SH3 domain twice. If we just use the w0.5 model, we can see that it has drifted to center on 1udlA, 1gfc, and 1gcqA, rather than the sequence we started with.
Various fixes are possible.
Some possible projects include
There was a tool almost usable in 2004: ProtoShop or ProteinShop. Information is available at http://proteinshop.lbl.gov/ http://graphics.cs.ucdavis.edu/~okreylos/ResDev/ProtoShop
ProteinShop seems to have some fairly nice manual-move capabilities, but was not usable last time I tried it (in 2004). I understand that David Bernick did some work installing the program and debugging it, and that the publicly available open-source version may now be usable.
The class project would be to install (and possibly debug) ProteinShop, then try to interface it to undertaker, so that one could look at partially optimized structures generated by undertaker, manipulate them with ProteinShop, and hand them back to undertaker for further optimization. The two programs could probably interact by passing PDB files of the conformation back and forth.
A more advanced GUI would not only allow manual manipulation of the conformation, but would allow applying specific undertaker conformation-change operators in specific regions. Visualization of the "ternary edges" that hold together structures that lack a contiguous backbone would also be useful, though these are not currently represented in the PDB output by undertaker.
One operator that would be nice to have is the ability to insert a
particular alignment, or a subdomain from a particular PDB file.
There *was* an operator in undertaker for this
(ApplyFragmentsToConform), but this command is now obsolete, since it
used an old "slicer" format for specific fragments, which is no longer
used by undertaker. If we added a NameToPtr hash table to the
AlignmentLibrary, we could refer to alignments by name and create a
new command "ApplyAlignment".
Save and restore AlignedFragments library
A lot of time is wasted reading template PDB files and
recomputing sidechain replacement for
specific fragment libraries and alignments on each run.
Saving and restoring them would be useful, particularly for
reducing the number of files needed for an undertaker run—we'd
like to be able to capture all the files needed and distribute
them to the kilocluster for doing many parallel optimizations.
Saving the specific fragment library would also make it feasible
to put some more effort into creating it—doing sidechain
optimization to remove steric conflicts, for example.
(Currently, we often run SCWRL on the alignments, but not on the fragments
from fragfinder.)
The fragfinder output is as an A2M file, which is fairly compact and easily produced by the SAM software, but which does not include 3D information. Undertaker can read the a2m file and the necessary PDB template files and generate fragments by sidechain replacement, but does not currently have a way of storing these fragments in a file. Existing undertaker file formats could easily be adapted to save the fragments, speeding up multiple undertaker runs and making it easier to run undertaker on the kilocluster (by eliminating the need for access to the PDB library).
Although we can output fragments in Rosetta format, the Rosetta
format does not contain enough information to reconstruct undertaker
fragments (which include sidechain information, as well as backbone
information).
Undertaker cost functions from local structure predictions
We have neural nets that predict several local structure properties,
but currently only one of these predictions (the alpha angle) is
passed to undertaker for use in its cost functions. Many other
properties are measured, but their only current use in the cost
function is in a rather generic "residue propensity" way. The neural
nets do a much better job of predicting the properties than simple
residue propensity, and we should be taking advantage of them.
What we currently do for alpha is to convert the 11-dimensional neural network prediction vector into a nearly continuous function function of the alpha angle (360 1-degree bins). This involves combining the neural network prediction with the histograms for individual amino acids. The current algorithm starts with the histograms for alpha conditioned on the residue and conditioned on the next residue, and takes their sum to get an initial continuous curve, each of the 11 sections is scaled to get the appropriate probability according to the neural net prediction, and the resulting curve is smoothed to remove the discontinuities at the edges of the sections. The scaling and smoothing is repeated several times.
We could do a similar thing for burial, but the burial curves are simpler than the alpha curve, so it might be better to use a simpler representation, such as just a mean and standard deviation parameter for a family of curves such as the gamma distribution.
In addition to the various burial alphabets, we have some backbone alphabets. These are discrete alphabets and cost functions could be implemented just by having a table lookup of -log(Prob(letter)) for each position—simpler that what we do for alpha.
Currently the "Bystroff" alphabet is understood by
undertaker, and it would not be hard to add the de Brevern
protein-block alphabet. The new hbond alphabets that we have
developed locally are also understood by undertaker, though there is
no scoring scheme for them implemented yet. Some of the more
conventional alphabets (dssp, stride, and our dssp extension str2)
would be more difficult to implement, as we do not want to duplicate
the innards of DSSP or STRIDE.
Other ways of using predicted local structure in undertaker
Besides using the neural net predictions directly in undertaker cost
functions, we have other ways that we could used the local structure predictions.
One that we currently use is to convert the RDB files to
HelixConstraint and StrandConstraint commands, and use the constraints
as part of the cost function. Specifying constraints not only affects
the cost function, but also some of the conformation-change operators
(such are OptSubtree and OptSegment). The scripts that do the
conversion to constraints could use a little improving (for example,
by including a scaling parameter on the command line to adjust the
weights of the constraints). Another possible modification is to make
the weight of the constraint dependent on the length of the
constrained segment, since the weight is divided up into several
individual constraints, with more individual constraints for longer segments.
Predictions of the new n_sep and o_sep alphabets could be used to create specific hydrogen bond constraints, to supplement the helix and strand constraints from other local structure alphabets.
The use of Hbond constraints (either from helix and sheet constraints or directly provided) could be improved by changing the TargetA and TargetB functions to provide better estimates of where the donor or acceptor should be. Currently the Hbonds use the same somewhat random target location that other constraints use.
Besides using constraints, local structure information could also be used to affect where fragments get inserted, with higher probabilities for loop regions and for regions where the current secondary structure does not match the predicted secondary structure. With a little more programming effort, we could even make the probabilities of different fragments depend on how well they match predicted structure. For example, we could
One operator that might be nice to have is an ABA crossover operator. Like the current (AB) crossover operator, it would start from two parent conformations. But instead of having one crossover point, it would have two, using two peptide splices to join a chunk from conformation A, a chunk from conformation B, and the rest from conformation A. This could be done either as two peptide-splice operations (based on how the current crossover operator works), or by extracting a chunk of conformation B as an incomplete AlignedFragments conformation, and applying it to conformation A. This would do a better job of keeping the two parts of A in their current relationship to each other.
Another useful operator (or set of operators) would be ones that jiggle the residues without breaking the backbone. One fairly sophisticated way to do this would be to tweak one torsion angle, then use inverse kinematics to compute torsion angles needed to keep a later peptide plane in position. Somewhat simpler and cruder would be to find pairs of bonds that are almost parallel and to make "crankshaft" moves that tweak the two torsion angles in opposite directions. This is done in a simple way by the TweakPsiPhiSubtree and TweakPsiPhiSegment operators. A variant of this would be to do a "crankshaft" move between residues A and B, add theta to psi(A-1), -theta to phi(A), -theta to psi(B-1), theta to phi(B).
There is plenty of room for thinking up new conformation-change operators!
Improved handling of clashes and bond angles in undertaker
The cost function in undertaker for clashes (steric conflicts) was
last updated in spring 2004. It uses a separate distance threshold for each
pair of atoms in the same residue, in adjacent residues, and in
residues with a chain separation of 2 or more. Seting these distance
thresholds is rather tricky, since there is not enough data for really
robust setting of so many parameters. The method has the
advantage/disadvantage of treating steric conflicts the same as bond lengths.
One big disadvantage of the current scheme is that there is no attractive van der Waals term in our cost function. Atoms that don't come into contact score the same as ones that do contact without clashing. While a traditional Lennard-Jones potential gives too high a penalty to clashes for our purposes (making the landscape too rough), the attractive well of the potential is important in determining whether a protein is properly packed. The Lennard-Jones potential is easily parameterized, having only one parameter for each pair of atoms. We need to come up with a similarly simple cost function that is easily calibrated from available data, with a good fit to the observed distance distributions, but with a softer clash penalty.
It is quite likely that bond lengths and H-bonds will not fit the same shape length distributions as van der Waals contacts, so we may want to provide completely separate mechanisms for bonded atoms (we already have a separate H-bond cost function). Atoms that are 2 bonds apart are also not governed by van der Waals contacts; their distances are basically measures of bond angles. The current clash detection scheme has different thresholds for bonded atoms and for atoms 2 or 3 bonds apart (because of the different thresholds for same residue, adjacent residue, and arbitrary chain separation), but does not have different shapes to the cost function.
Quick Links