This is a list of projects that need to be done on various projects for protein structure prediction. Note: there is no requirement that the project you do come from this list—it is merely a set of projects that we are interested in. If you can come up with your own project and convince a faculty member to supervise it, that can be even better. New collaborations with biologists and biochemists are particularly welcome.
There are still a number of viable projects on last year's list that hav not been moved forward to this year's list. If there is one on that list that interests you, check with Kevin Karplus to see if it is still a viable project.
I've broken the list up into different categories---but some projects straddle these divisions, so look through all of them.
This project involves several components:
The user interface of Pymol should be relatively easy to modify to make it more rasmol-like. It would be nice if rasmol selection and highlighting scripts that we produce could be used by the modified interface, as well as using the same mouse conventions as rasmol.
The project requires learning the user interfaces for both
rasmol and pymol, and modifying the Python code of pymol.
Web server infrastructure
Model Quality Assessment Server
In 2006, for CASP7, a new category was created for CASP: model
quality assessment programs (MQAP). In this category, servers are given a
bunch of models generated by other servers and asked to
evaluate them.
We did not participate in this category in CASP7, but would
like to for CASP8.
http://predictioncenter.org/casp7/doc/casp7-format.html#QA explains the two types of evaluation asked for and the two formats. Most participants submitted just a global score for the entire model, not per-residue error estimates.
We now have the ability to evaluate models using undertaker cost functions. In fact, informal observation on the CASP7 targets indicates that our cost functions would be a fairly good model quality assessment measure.
This project is really two parts:
Note that we are not optimizing conformations using these cost functions, but only evaluating server-predicted models, so some terms that are important in optimization may be irrelevant in selecting good models from servers (if, for example, the servers all do a good job of avoidng clashes, then clash detection may be irrelevant in choosing among models). We are particularly interested in identifying the best models, not so much in ranking the low-quality ones.
Our current undertaker cost functions are on a scale from -infinity to +infinity, so we would have to rescale them so that -infinity->1.0 and +infinity->0.0, to match the range expected in the "QA" format.
The CASP organizers turned alignment-only predictions into backbone-only predictions, but many of our cost functions really work much better if there are sidechains present. We can automatically run the alignment backbones through scwrl, and we should evaluate models both before and after scwrling, perhaps taking the better score as the assessment.
The UCSC genome browser is one of the most popular resources/tools for biologists interested in exploring the human genome, or other genomes. In addition its emphasis on comparative genomic analyses makes it suitable for bioinformatic analyses. Various linked resources exist, one of them is the UCSC Proteome Browser (Hsu et al., Nucl. Acids Res. 33, D454-D458 2005. pdf ) as it was conceived in 2004.
The project is to first investigate
the currently implemented features and abilities of the UCSC
Proteome Browser, and to explore what possible updates and
additions would increase the resource's usefulness in a variety
of (computational) case studies. The next step is then to
select one of the improvements that you have recommended
and implement it, either directly with the Proteome Browser
team, or as a separate add-on.
Local Structure prediction
Combining local structure predictions
For the past few years, we have been using a script called RDBCombine
to merge neural net predictions (possibly over different, but related
alphabets) to get a consensus prediction. We have never tested this
method to see if it does better than simply taking a neural net
optimized for the output alphabet, nor have we tried tuning any
parameters related to it.
RDBCombine can combine RDB-format prediction files for different alphabets. It translates the alphabet using the contingency tables output by compare-real, and weights them by the mutual information between the predicted alphabet and the desired alphabet, and does a weighted average of the translated predictions. It is not clear what the best way to do the combining is—simply averaging the probability vectors is probably not optimal, though it is easy. We may want to look at combining the log-odds scores (log (P_predicted/P_background)), for example.
We have a script (somewhere!) to evaluate RDB files for Qn and bits
saved (and we could add the SOV measure to the evaluations), so
different methods for combining predictions can be easily tested with
only fairly minor script writing. Writing a new script may be
almost as easy as finding the old one, though!
Local structure prediction using HMMs
One approach to improving secondary structure that we have tried, and
that may be worth improving on, is to use the fold-recognition target
HMM to align the template set and gather statistics about what
secondary structure codes align to each position.
Since the target HMM can be a multi-track HMM using neural-net
predictions, this method should be able to do at least as well as the
neural-net methods.
The method would be particularly valuable when there is a close
homolog in the template library, as it would be able to "look up" the
right answers.
This makes testing more difficult, since it is easy to fool oneself
about how well the method is working, but would be very useful in a
tool used by bench biologists, as the easy predictions would be right
(not necessarily the case for neural nets).
It is used in the SAM-T99 server, I believe, but not the SAM-T02 and SAM-T06
servers.
We should test it properly and use it in the SAM-T07/8 server.
Mark Diekhans implemented this several years ago in SAM, but I am not sure of the details of his implementation. In his tests, the method improved Q3 scores slightly, but made the information gain worse than the neural nets used to create the HMM tracks. Josue Samayoa did more careful tests in Fall 05, and found that the current implementation does not make the results of the neural net prediction substantially better or worse. Somewhat surprisingly, including the true sequence in the library of structures that were scored by the HMM did not change the result much. This implies to me that the probability of the sequence is not being used adequately to weight the predictions---having a perfect fit in the sequences being scanned should result in near-perfect predictions.
Several things could be done to improve this method:
One such weighting would be to use P(sequence), essentially
decoding with P(state, sequence | position).
Another method would be to try explicit weighting of
sequences.
I think that a method using exp(-E_value) would be worth
trying.
Here is the reasoning: We want to give more weight to
sequences that match the HMM than to sequences that are very
remote, but using P(seq|HMM) as the weight is likely to put
too much weight on the closest match.
If we use instead the probability of not getting that good a
sequence by chance in the N sequences (1-p_0)^N =
(1-E_value/N)^N, we could have numerical problems, but for
E_value<
A set of about 300 protein pairs at different levels of difficulty have been selected to cover the full range of alignment problems.
There are some details of the test protocol that still need to be tweaked (like exactly what cost function to use in undertaker to evaluate the model), but mostly this project consists of parameter tweaking on different alignment methods (mainly HMM-based methods) and measuring the quality of the resulting alignments.
Perhaps the most important alignment testing is for
profile-profile alignments. The best methods in CASP for the
past few years have used profile-profile methods, but we have
not yet tested and incorporated profile-profile methods in the
SAM protocols. Richard Hughey has implemented a form of
HMM-HMM alignment, but it has *still* not been tested.
Fold recognition and template selection
This project is to explore the possible benefit of using domain content information for the orthology prediction problem. That is, to use tools for identifying the domain structure of protein sequences to help find orthologous proteins between different species.
Predicting orthologs in eukaryotes is a challenging problem. Protein families can undergo several duplication events in a single lineage, making it difficult to identify functional orthologs versus divergent paralogs. The popular approach is to use sequence-similarity based approaches, such as reciprocal best BLAST comparisons, to predict orthology. See for example the recently published Multiparanoid method [Alexeyenko A, Tamas I, Liu G, Sonnhammer EL. Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinformatics. 2006 Jul 15;22(14):e9-15. doi:10.1093/bioinformatics/btl213 ].
These approaches are limited because the similarity scores ignore the importance of different regions in the compared protein sequences. Intuitively, differences observed in positions that are highly variable should detract less from an estimate of similarity compared to differences observed in positions that are highly conserved.
The question this project seeks to address is whether using conserved functional domains provides any independent information that can help resolve orthology predictions made by sequence similarity approaches. One could devise a score that measures the similarity between two protein sequences based on the set of conserved PFAM domains each contain. To evaluate different score choices or to compare to sequence similarity prediction alone, one could compare the predictions with the agreement of coexpression neighborhoods, conserved synteny, or curated sets of orthologs.
This project may not require very deep programming, but
will require running a lot of HMMer (Pfam) jobs from scripts
and coming up with intelligent ways to combine the results.
Converting RDB files to HMMs
We currently use a TCL script (2nd-rdb-to-sam-model) for converting
the RDB-formatted output of our local structure predictions to SAM HMM
format, for inclusion in multi-track HMMs. It would be better if the
SAM code understood the RDB format directly, so that no translation
script was needed. Indeed, if the SAM code could read the RDB format
directly, it would be trivial to write the translation program in C,
since SAM already has routines for outputting the HMM format. The
2nd-rdb-to-sam-model script is the only TCL script in the SAM package,
and it would be good to eliminate it.
Although there is not much code needed for adding this
functionality to the SAM package, a lot of code would need to be read
in order to understand the internal data structures and parameter
passing mechanisms of SAM.
Combining fold recognition results
Our fold-recognition methods treat every template in the library as
independent of the rest, but we actually have many templates which
contain similar folds. Various ways have been proposed for combining
template scores to get a "fold" score, and a couple students have done
preliminary tests of the "product-of-pvalues" method, and shown that
it might help, but their projects were never taken to the point where
the method could be incorporated into an automatic prediction method.
[
Bailey TL, Gribskov M.
Combining evidence using p-values: application to sequence homology searches.
Bioinformatics. 1998;14(1):48-54.]
[
Bailey, Timothy L. and Grundy, William N.
Classifying proteins by family using the product of
correlated p-values.
Proceedings of the Third international conference on computational
molecular biology (RECOMB99), April 11-14, 1999. pp. 10-14.
]
One possible project is to incorporate the "product-of-pvalues" method into our automatic predictions. This requires designing a way to update the calibration parameters for the combining method automatically as new templates are added to the library, and converting the fold results back into selection of good templates. (This needn't be a weekly update, as the SCOP classification is only updated once or twice a year.)
Another possible project is to devise a different combining
method. The "product-of-pvalues" method treats all templates in a
fold class as equally informative and does not use information
from templates in competing fold classes. One can envision
combining methods such as logistic regression that could use the
extra information to get better predictions. One danger is that
multi-domain proteins correctly have multiple correct folds, so
simple competition between fold classes is not quite the right model.
Tertiary structure prediction
Undertaker cost functions for local structure alphabets
We have many neural networks for predicting various local
structure properties of individual residues of a protein.
These neural networks do a much better job of predicting the
properties than a simple residue-propensity scale, but our
tertiary prediction program (undertaker) only uses
propensity-based cost functions for evaluating most of these properties.
We have code in undertaker for accepting neural net predictions and creating cost functions from them for some local structure alphabets (the alpha angle alphabets and the burial alphabets), but several of our best alphabets (str2, Bystroff, de Brevern, N-notor, O-notor, N-sep, O-sep) have nerual-net predictors but do not yet have neural-net-based cost functions.
This project requires learning how the undertaker cost
functions are implemented and adapting a fairly small amount
of c++ code to handle other local structure properties.
Distance constraints from alignments
In CASP7, we noticed that the sheet constraints that we
generated from alignments were doing an excellent job of
hanging onto sheets that we found in fold-recognition targets,
but that all-helical proteins were often being disrupted by
the attempts to close gaps and construct loops.
We would like to generate constraints for contacts between residues that are widely separated along the chain based on contacts seen in alignments, and test the value of different ways of choosing and weighting the contacts in selecting good models for the target protein.
This project requires learning how undertaker represents 3D
structures and detects atoms that are close, plus writing some
c++ code to output constraints. The code should be fairly
easy, as it can be modeled on the existing code for generating
sheet constraints and helix constraints.
i
Continue developing an
automated method for predicting
disulfide bond connectivity
We have recently begun to devise an "new/old" method for predicting disulfide bridge connectivity based on multiple sequence alignments. The strategy employed is probably the oldest (but still trusted, and widely applied) approach used in manual sequence analysis. It is based on connecting pairs of alignment positions that contain partially conserved cysteine residues whose patterns of conservation and variation resemble each other most closely. At this point, we have most pieces in hand (predominantly in the form of Perl scripts) that are necessary for a first automated implementation of this strategy, and for its evaluation on known examples (Zoe Woodruffe, MRes thesis + code, available).
In the longer term, there are two possible applications for this automated method - (i) for automated flagging of likely mis-threading between target and template sequences underlying automated comparative models (e.g. as provided by ModBase); or (ii) as a more widely applicable prediction method for disulfide bond connectivity.
The goals of this project are
One big disadvantage of the current scheme is that there is no attractive van der Waals term in our cost function. Atoms that don't come into contact score the same as ones that do contact without clashing. While a traditional Lennard-Jones potential gives too high a penalty to clashes for our purposes (making the landscape too rough), the attractive well of the potential is important in determining whether a protein is properly packed. The Lennard-Jones potential is easily parameterized, having only one parameter for each pair of atoms. We need to come up with a similarly simple cost function that is easily calibrated from available data, with a good fit to the observed distance distributions, but with a softer clash penalty.
It is quite likely that bond lengths and H-bonds will not fit the
same shape length distributions as van der Waals contacts, so we may
want to provide completely separate mechanisms for bonded atoms (we
already have a separate H-bond cost function). Atoms that are 2 bonds
apart are also not governed by van der Waals contacts; their distances
are basically measures of bond angles. The current clash detection
scheme has different thresholds for bonded atoms and for atoms 2 or 3
bonds apart (because of the different thresholds for same residue,
adjacent residue, and arbitrary chain separation), but does not have
different shapes to the cost function.
Improving Undertaker knot detector
Undertaker has an implementation of Firas Khatib's knot finder, but there are some minor problems with it---mainly that it uses the first version of the algorithm, where points to be considered for deletion are visited in a simple end-to-end sweep of the protein. This should be changed to allow visiting the points in arbitrary orders: most importantly the two smallest-first orders that Firas uses in his Rosetta implementation, but also a random order.
This project requires modifying the existing c++ code for the knot detector and implementing a new data structure (most likely a priority queue) for ordering the points.
This project has fairly low importance (the existing
implementation is good enough most of the time---it has a
small false positive rate that could be reduced by the
changes), but is also a small project suitable for a student
who already knows c++ and who needs a mini-project, not a
full-size project.
Applications of tertiary structure prediction
Disulfides in intracellular proteins of extremophiles
In doing structure
predictions for a VapC protein, Kevin noticed that there
were some disulfide bridges predicted, even though the protein
is an intracellular nucleic-acid-binding protein.
Following up on this, he contacted Martin Lawrence, a structural biologist at Montana State University, who confirmed that intracellular disulfides do occur in thermophiles, and pointed to a paper by Todd Yeates at the California NanoSystem Institute:
Beeby M, O'Connor BD, Ryttersgaard C, Boutz DR, Perry LJ, Yeates TO
The Genomics of Disulfide Bonding and Protein Stabilization
in Thermophiles.
PLoS Biology Vol. 3, No. 9, e309
doi:10.1371/journal.pbio.0030309
This paper present evidence that disulfides are common in thermophiles---at least in those that have a protein known as "protein disulfide oxidoreductase".
It would be interesting to do predictions of disulfides for the archeal species that the Lowe lab works with. There are two parts to the prediction: selecting proteins that have multiple cysteines and are not predicted to be extracellular or membrane bound, and doing structure predictions for them to see if the cysteines are predicted to be close enough to form disulfides.
Another interesting approach would be to look at the homologs of the protein Kevin started with---perhaps starting from one of the PDB chains that is is most similiar to: 2h1oA, 2h1cA, 2bsq, or 1y82A. (1y82A might be a particularly good choice, since even BLAST finds it.) Getting a deep multiple sequence alignment to it is fairly easy with the SAM-T06 protocol.
The distances between CB atoms in the known structure can be computed and those that are close enough to be potential disulfide contacts (say closer than 6 Angstroms) can be recorded. Undertaker can print the CB pairs of interested with
InFilePrefix /projects/compbio/experiments/undertaker/spots/ PrintTrainingDistance Cbeta.spot 6.0 pairs.outThe use of Cbeta.spot, rather than CB, means that glycines will have a projected CB atom (as if they were alanines) rather than using the CA location.
The sequences of the multiple alignment can be quickly
scanned to find which cysteines look likely to form disulfides
and which are unlikely to. It would be interesting to see
how the fraction of cysteines that seem to form disulfides
varies depending on the taxonomy and living environment of the
species.
Detection, and structure
modelling, of new members
of the SRS/PGSH superfamily in Apicomplexa
In previous work we were able to produced model for the double- domains of the malaria gamete surface protein Pfs230, through fold recognition and template-based modelling ( Structural models for the protein family characterised by the Pfs230 gamete surface protein of Plasmodium falciparum. Gerloff et al., PNAS 102, 13598-603 2005). One of the interesting findings was that the core beta-sandwich domain structure of these proteins resembled that found, so far, only in one other family of Apicomplexan surface proteins, namely the SRS-proteins in Toxoplasma gondii (SRS stands for "SAG1 (Surface Antigen 1) -related sequences").
Subsequently, we started screening the inferred protein sequences from other Apicomplexan species in GenBank for further distant homologues, i.e. members of the same superfamily (H Su, MSc thesis, available), using Hidden Markov Models generated using HMMer. The student working on this project managed to generate initial results but neither had the time to explore his results nor produce models of the newly found members.
The aim of this project is to update and extend the
work
(since new sequences will be available in GenBank), using both
HMMer and SAM Hidden Markov Models for the screening, and to
model the newly found members of this superfamily.
Large-scale comparative modelling
of Fibronectin 3 (Fn3) domains
We have recently demonstrated that comparing the surface properties of modelled, homologous protein domains can be of interest in the context of characterizing their interactions with other proteins (Large-scale modelling as a route to multiple surface comparisons of the CCP module family. D. C. Soares et al., Prot. Eng. Des. Sel. 18, 379-88 2005). Since this study, we have developed a new method for comparisons of this kind, and the results are much more promising; in fact, much of our current research focuses on exploiting such comparisons (Ali & Gerloff, manuscript in prep).
The protein that triggered our interest in this area, due to its complicated interaction properties with multiple protein partners, is human CR1 (complement receptor 1). Human CR1 has a "homo-multimodular" architecture; it consists of 30 paralogous domains, each approximately 60 amino acids in length and sufficiently similar in sequence that we are certain that these domain will have the same fold (although it is clear that their functional properties differ). Casting our initial focus on CR1 has proved valuable in method development. However, there are other proteins to which our new methods methods could (and should) be applied now. In the first instance, we require modelled structures for all domains of potential new target proteins. Just like CR1, our favorite proteins are large, have multi-modular architectures, and potential for yielding interesting discoveries when the predictions are ultimately followed up through experiment.
The project is to focus on one such target protein. A protein domain named "Fibronectin3-domain" (Fn3 in SMART) appears to be "similar" in some respects to our paradigm CCP- modules but has a different structure and function. The primary aim of the project is to produce comparative models for Fn3 modules in human/mammalian proteins. This will involve
ModBase is a database of automatically produced comparative models using the program MODELLER by the Sali group (http://salilab.org/modbase). In previous work, we have downloaded all models of proteins from the malarial parasite Plasmodium falciparum, and subjected them to quality and redundancy (overlap) filtering.
Since the models were downloaded (early in 2006) a considerable number of P.falciparum structures have been resolved by crystallography. This allows us to evaluate the accuracy of the ModBase models on this subset, in a way reminiscent of how this is done in connection with the CASP structure prediction "experiments". Kevin Karplus has recently implemented the main CASP evaluation method "GDT" (Global Distance Test) locally. This means we have a suite of programs for computing standard assessment criteria (e.g. r.m.s. deviation) and more informative measures (e.g. correctly predicted hydrogen bonds).
The project
is to use this methods, combined with visual analyses,
to compare ModBase models of P.falciparum proteins.
By comparing them to their
experimentally solved "true structures", you will derive
an assessment of the current MODELLER-parameter settings
used in
ModBase, and make suggestions how to improve them.
Protein complexes
(Re)modelling and validation of
predicted CDK-cyclin complexes from Arabidopsis
thaliana
In recent work, we have used protein-protein docking combined with electrostatic complementarity calculations, to predict 19 most plausible pairings of CDK (cell-cycle dependent protein kinase) and cyclin homologues in A.thaliana (Quan & Gerloff, manuscript available). Due to the large scale at which this analysis was undertaken (1188 possible CDK-cyclin combinations were considered) a rigid docking program was applied, as opposed to the more accurate (and biophysically realistic) flexible docking methods. Accordingly, we currently only have approximate, "unrefined" models of the 19 complexes - some of them come with unrealistic features like side-chain clashes, etc.
The project is to apply the flexible protein-protein docking program
RosettaDock
(Protein-Protein Docking with Simultaneous
Optimization of Rigid Body Displacement and Side Chain
Conformations,
J.J. Gray, et al., J. Mol. Biol. 331(1), 281-299
2003.
pdf)
to the predicted CDK-cyclin pairs as well as control complexes,
in
order to validate the predictions and remodel the corresponding
heterodimeric complex structures.
Impact of side-chain modelling
on electrostatic complementarity scoring of protein
complexes
An alternative to assembling hypothetical protein-protein complexes through protein-protein docking is to generate them through simultaneous comparative modelling of all subunits, using a known structure of a homologous protein complex as a template.
We have recently noticed that the electrostatic complementarity at the interface of protein complexes modelled in this way using the program MODELLER (http://salilab.org/modeller) seems exaggerated (overfitted) compared with complexes that were solved crystallographically (Rupert Koenig, diploma thesis available). In other words, the correlation of positive and negative surface potential within contacting patches on the two subunits is even higher in MODELLER-generated protein complexes than it is in crystal structures of protein complexes. Most likely, this observation reflects an overfitting at the stage of side-chain placement.
The aim of the project is to verify this conjecture,
and
hopefully to remedy this effect, by remodelling the side-chains
of a
set of approximately 1200 MODELLER-generated hypothetical
CDK-cyclin complexes using Rosetta (Improved side-chain
modeling for protein-protein docking., C. Wang, et al.,
Prot.
Sci. 14, 1328-1339 2005. pdf; Progress in Modeling of Protein Structures and
Interactions., O. Schueler-Furman, et al., Science
310 (5748), 638-642 2005. link), and recalculating electrostatic complementarity
using
a local copy of MOLSURFER that was modified to
allow large-scale application.
Transient versus permanent
protein interactions - annotate a database of heterodimeric
binary (sub)complex structures
In recent work, a graduate student in the Gerloff group has derived a database of heterodimeric (sub)complexes from the Protein Data Bank (PDB) (Thomas Juettemann, unpublished results). In the context of developing prediction methods using this data set as reference, it is important for us to be able to annotate which of these protein-protein interactions are of permanent nature (as for example between the alpha and beta subunits of hemoglobin, i.e. part of a complex that persists throughout the proteins' existence), or of transient nature (as for example the regulatory interaction between a CDK (cell cycle dependent protein kinase) and a cyclin, i.e. temporary).
Since the database contains several hundred complexes, the more automation can be applied in this exercise, the better. However, while hints can be derived from protein- sequence/structure-based properties (such as amino acid properties at the interface) the actual annotation must not be based on any of these, but rather inferred from existing annotation of biological function in SWISS-PROT, GO, etc. (since the purpose of the annotation is to serve as reference, i.e. "truth", in developing a prediction method).
The project is to do this
following
partly the rules used in a previous publication in this field:
Analysing six types of protein interfaces. Y. Ofran & B.
Rost, J. Mol. Biol. 325, 377-387 2003 (link), and partly by developing new strategies to do
this.
Determine the intersection
between the functional genomics protein interaction
database IntAct and "modellable space" (+
produce comparative models)
A variety of experimental techniques exist to screen for interactions between proteins. The results of such screens (using predominantly yeast-two-hybrid (Y2H), tandem affinity system purification (TAP), and co-immuniprecipitation (CoIP) strategies) are stored in protein-protein interaction databases - one of the best annotated and organised ones is IntAct by the European Bioinformatics Institute (EBI).
A small proportion of the interactions reported in IntAct are based on crystal structures of the complexes but for the majority no atomic coordinates for the complex are currently available through the Protein Data Bank (PDB). This raises the interesting question - how many of the complexes could be investigated by template-directed structure modelling, ideally comparative modelling (where homology between the template and target protein sequences is easily detected and the two sequences can be aligned confidently)?
The goal of the project are to
determine the set of protein pairs in IntAct for which
structures of (i)
the complex; or (ii) both individual subunits could be obtained,
and
be investigated for their feasibility. If time allows, the
complexes
can be modelled with the help of scripts for automated
comparative
modelling at large scale.
Implementation
of an algorithm for ordering heterodimeric protein
interface residues in accordance with the contacts
between them
We have recently developed a novel, one-dimensional (profile- style), representation of electrostatic potential surfaces. Such "1-D Electrostatic surface profiles" are useful for fast and simplified comparison/screening applications (Ali & Gerloff, unpublished results). In these profiles, a summed electrostatic potential value (derived based on a 3-D structure/model) is assigned for each amino acid residue that is considered in the analysis.
Up to now, we have used 1-D surface profiles to compare the entire surfaces (i.e. all residues in the proteins) of modelled, homologous domains - to screen for the most "interesting" domains in the set, i.e. to predict which of them are likely to have different protein-protein interaction properties (binding site; binding partner). In order to do this in a structurally meaningful way, we (re-)order the amino acid residues multiple times and carry out multiple comparisons; the reordering algorithm seeks to enrich instances where neighboring surface positions (in 3-D) are next to each other in the profile. (Since we do not know where the binding site is we try out many different orders, typically 1000).
Because this type of screening involves pair-wise comparisons only between 1-D profiles derived from homologous proteins, aligning the two profiles has been straightforward in this application - positions that are aligned in a sequence-based multiple sequence alignment for the protein family define the pair-wise alignment of the 1-D profiles.
Now we are taking the first steps towards using the same idea for a different application: heterodimeric partner prediction between paralogous sets (e.g. to help predict which CDK (cell cycle dependent protein kinase) homolog might interact specifically with which cyclin homolog). In other words, we would like to rank hypothetical protein complexes of protein subunits that were forced together using modelling techniques, or docking, based on the complementarity of their 1-D electrostatic surface profiles.
In this application, we have to worry not only about providing complete orderings of the residues that are considered (that enrich 1-D neighbors that are also 3-D neighbors on the surface, as above) but we also have to align the two profiles that now stem from non-homologous proteins (the two interacting proteins).
We have
thought up an algorithm for accomplishing this, i.e.
to generate aligned complete orderings of the interface
residues on each subunit so that they have the best
chance to contribute to the "complementarity signal"
(Rupert Koenig, Diploma project, available). This
algorithm
involves a partially guided, partially randomized, ordering
process
reminiscent of a "walk across the interface" of the hypothetical
complex. The
project is to implement this
interesting
new idea and to evaluate its usefulness for the problem at hand.
Function prediction
Disorder and domain prediction in proteins
Michael Rexach (http://bio.research.ucsc.edu/people/rexach/)
has "recently completed a small project that used amino acids
substitution 'rates' in protein domains (among orthologs in
four Saccharomyces species) combined amino acids composition
(% order-conferring aa/ % disorder conferring aa) to predict
boundaries for domain structure and function. This was
particularly insightful for protein that have a significant
content of disorder-conferring amino acids, such as the
nucleoporins we study. The analysis can be computerized and
done for the entire yeast proteome. The analysis predicts four
types of domains: 1) folded domains with no functional
significance other that to provide a needed structure; 2)
folded domains with function (enzymatic, protein interactions,
etc); 3) unfolded domain with no function (a linker) or 4)
unfolded domains with important functions."
This project combines comparative genomics (looking at differences among the 4 Saccharomyces species) with protein informatics. There are two versions of the project: a straightforward implementation of the Rexach's method and applying it to the yeast proteome or improving the method to include other data sources (such as homologous proteins from more distant species) and other statistical measures for predicting disorder and function.
Rexach has provided an unpublished manuscript which should
be readable on campus or with the class password: rexach.pdf. He says "The most
relevant figures are fig. 3, table 2, and supplemental figure
6. The second part of our manuscript (which was not published
because it lacked a proteome-wide analysis as I am suggesting
above) was the beginning of a similar analysis at a proteome
wide level, the results were interesting, but required
bioinformatics tools that are beyond my expertise."
Validating suspected sites of adaptive
evolution on Complement
Receptor 1 (CR1) through Ka/Ks calculations
We have recently produced models of all 30 so-called CCP (or sushi) modules of CR1 individually (available here). Each module of this surface protein involved in the human immune response spans approximately 60 amino acids in length and adopts the typical CCP-module fold (aka complement control module, or SCR, fold). Two conserved disulfide bridges and one buried, conserved tryptophane residue are maintaining the fold, while surface residues vary greatly between modules.
Different modules within CR1 may interact with different partner proteins (many of them remain unknown to date), and the location of their interaction sites are not necessarily conserved with respect to the common structural scaffold defining the fold. Using a variety of new and existing methods, we have generated predictions regarding which modules may have undergone adaptive evolution (i.e. evolved under positive selection).
These predictions are
based
on protein sequence and structure-based considerations only. By
contrast, the
project is to investigate whether positive selection
can
be detected in the corresponding gene (DNA) sequences of closely
related homologues, through computation of Ka/Ks (non-
synonymous versus synonymous mutation) rates, and to pin-point
possible interaction sites in this way.
Structure analysis
Finding crevices that might bind nanotubes
We are planning to try to design proteins to selectively bind
carbon nanotubes of specific chiralities. One approach we are
pursuing is to redesign crevices in existing proteins. For
this to work, we need to find crevices of approximately the
right shape, then run a design program (such as RosettaDesign)
to select different residues to line the crevice with.
This project involves scanning a subset of PDB (1,000 to 3,000 structures) looking for holes or grooves appropriate for binding a cylinder. Because of the symmetries of a cylinder, there are only 4 degrees of freedom to sample, making this a simpler problem than the usual 6-degree-of-freedom rigid docking problem. We need to specify the axis of the cylinder, then count atoms in two cylindrical shells around the axis. For example, a crevice for binding a nanotube with a 5nm radius should minimize the number of backbone atoms within 5nm of the axis, but maximize the number of atoms within 7nm of the axis.
To sample cylindrical axes, we can split the problem into a 2D orientation and a 2D translation. For the orientation, we can sample points on the unit sphere, and rotate the protein about the origin so that the sampled point is at (0,0,1). Our cylindrical axis will now be parallel to the z-axis, and we can sample (x,y) points to get the different cylindrical axes.
The sampling can be deterministic (grids) or random, or a combination of the two. When somewhat promising cylinders are found, we can do finer sampling in the neighborhood of the good cylinders.
Since we are interested in several different nanotube diameters from about 6.75 nm to about 16.2 nm, we should probably collect the number of atoms (backbone atoms and all atoms) in several different cylindrical shells simultaneously.
This project is probably best done by adding a few new
commands to the undertaker program. The program already has
code for reading structures, for doing rigid transformations,
and for rapidly determining how many atoms are in a sphere.
Extending the XYZbins class to count points in cylindrical
shells is an easy extension, and the sampling of cylindrical
axes should be fairly straightforward for anyone familiar with
c++ as a programming language.
Analyzing ring-stacking interactions in proteins
For our project on designing proteins to bind nanotubes, we
expect to need terms in our cost or energy functions that
model the stacking of aromatic rings (from phenylalanine,
tyrosine, and tryptophan) against the graphite lattice of the
nanotube. Since we don't have much structural experimental data for
interactions between aromatic amino acids and graphite
lattices, we will have to generalize from the ring-stacking
interactions that we have seen.
This project involves small modifications to undertaker to gather statistics about interactions between aromatic amino acids observed in pdb. The code may be similar to existing code that was used for gathering statistics about hydrogen-bond geometry. Which statistics we gather and how we make a cost (energy) function out of them will depend on what interactions we observe most frequently.
one approach would be to generate different statistics for each of the nine ordered pairs of aromatic amino acids (16 if we decide to include histidine). We could transform one sidechain into a standard frame of reference (say based on CB, CG, and CD1) and to gather statistics about any other aromatic rings (location and orientation) that come within 4 Angstroms of atoms in the selected sidechain. We are particularly interested in rings that are nearly parallel, since that is the mode of interaction we are expecting to see with nanotubes.
If get some good regularities, we can create a new cost function for undertaker that provides a roughly -log probability cost for different orientations of nearby rings. The hardest part may be figuring out how much bonus there should be for the rings being close in the first place.
One possible problem with this whole approach is that the aromatic rings of the sidechains are not quite the same as the graphite lattice. The slight difference in electronegativity between hydrogen and carbon means that the center of the phenylalanine ring is a very weak hydrogen-bond acceptor and the rings may have a favorable interaction at right angles. The nitrogens in tryptophan and histidine may cause this effect to be even stronger. We do not expect the carbon nanotubes to form these weak hydrogen bonds, so our cost function may need to be different for interactions between amino acids and between amino acids and the graphite lattice.
In addition to creating new cost functions for undertaker, we would like to add a ring-stacking term to the Rosetta cost function, as we expect to do large portions of our design work with RosettaDesign.
Note: this project is an open-ended project that is much
too large for just a BME 220 project. It could well lead to a
Master's thesis or even a PhD thesis, but there is an initial
component (gathering statistics) that should fit comfortably
in the 10-week time frame.
A survey of disulfide bond
"properties" in surface proteins
One of the noticeable characteristics of proteins that are protruding to the outside from cell surfaces is the abundance of disulfide bridges. Accordingly, many of the well-recognized surface proteins are recognized by sequence motifs/signatures based partly on the highly conserved cysteines involved.
Since disulfide bonds are geometrically (3-D structurally) more restrained than providing a simply distance constraint, not all combinations of cysteines are favorable - for example disulfide bridges between sequence- neighboring cysteines are rarely observed. While there have been previous surveys of disulfide bonds found in 3-D structures stored in the Protein Data Bank (PDB) - or rather inferred based on the atomic coordinates - there are now a large enough number of experimentally solved structures of surface protein/domain structures to inform a specialized study surveying their 3-D structural "properties" and their sequence "properties" (such as conversation patterns).
The project is to
A more promising approach is to design a helical multimer that self-assembles on the surface of the nanotube. If the symmetry operation from one monomer to the next matches a symmetry of the nanotube, then the cooperative binding of many monomers should be able to distinguish even subtle differences between nanotubes.
Actually, we probably want a helical arrangement of dimers, rather than of monomers, since the nanotubes are symmetric if you rotate them 180 degrees perpendicular to the axis. If we used an asymmetric unit for forming the helical multimer, there would be two different orientations that would work, and we would get discontinuities (like crystal grain boundaries) between regions of each orientation. If the unit for the helical multimer has an appropriate 180-degree rotational symmetry, then there may be only one orientation for the helical multimer and no grain boundaries.
As a BME 220 project, there are several aspects of this project that can be worked on. One is to try packing existing proteins around nanotubes and using RosettaDesign to repack the interfaces. Another is to try designing a monomer or dimer from scratch that has an appropriate shape for making the helical multimers (again using RosettaDesign, this time to design the entire backbone).
For those looking for programming projects, undertaker could use some modifications to handle helical multimers. Currently it has only a crude mechanism for handling cyclic multimers (using a single rigid transform as a generator), but the multimers that we are interested in involve 2 transforms: one for taking the monomer to its dimeric partner, the other for taking one dimer to the next in the helix. The helix is also an infinite object, so limitations on the size of the created complex are needed.
Undertaker also needs modification for handling nanotubes.
Currently, undertaker has no representation for ligand atoms.
Carbon nanotubes have only one sort of ligand atom, so the
extension of the atom alphabet is easy, but setting parameters
for clashes and other cost functions may be difficult, as
there is not much data about how nanotubes interact with amino
acids.
Testing back-propagation as a protein design method
In spring 2005 Jes Frellsen and I made some modifications to
predict-2nd so that it could be used for protein design.
(I made further changes in summer 2005.)
The idea is to take several of the local-structure-predicting neural
nets, and fix their output at the desired local structures, then use
back-propagation to optimize the inputs.
One starts with an initial random sequence, optimize using the neural
nets to get an input profile, then sample from that profile and
reoptimize.
This can be repeated to get a large sample of possible sequences that
should have good local properties, which could then be fed into
undertaker, rosetta, or gromacs to try to find sequences with good
tertiary properties.
We still need more experimentation with various networks and local structure alphabets, so see which combinations result in the best designs. Since we do not have the time nor the facilities for fabricating and testing large numbers of proteins, we will have to rely mainly on computational methods. One simple test is to look at the percent sequence recovery: given just the backbone for a real protein, try to design amino acids sequences that would fold to that conformation. The closer the designs are to the real protein, the more convincing the design method. Such a study should be done on hundreds of different backbones from radically different structures. Some work was done on this in Winter 2006, and the sequence recovery was somewhat less than we wanted. Only two local structure alphabets were used, and the neural nets were not as good at prediction as the ones we now use, so we may be able to do better now.
There are lots of variants that could be tried in the sampling of
sequences from the back-propagation.
For example, one could resample the
sequence points where the back-propagated derivative for the current
residue is negative rather than resampling all points.
Instead of sampling from the profile one could take the most probable
residue of the profile.
After generating a few hundred sequences in this way from random
starting points, one could make a profile from the sequences, and
score each sequence with that profile. Further tweaking could take
just the best 100 or so sequences and the profile generated from them.
Each could be used with that profile as a starting point for the
back-propagation and resample process.
A tertiary scoring system (such as Rosetta's) could be used to pick
out the best sequences from the list, and use the resulting sequences
and profile to do more sampling.
Training neural nets for protein design
Another approach to using neural nets would be to build a neural net whose input was many structural properties and whose output was a sequence. Such a neural net could be trained in the same manner as current nets for prediction (with inputs and outputs swapped), and would provide a probability distribution for amino acids at each position. This would require a few changes to the neural net code (mainly providing an input layer that accepts several alphabets, instead of just one alphabet), and would more directly provide probability distributions to sample sequences from.
|
|
|
| Karplus's lab page | UCSC Bioinformatics research |
Questions about page content should be directed to
Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250