Projects list: protein structure prediction

(Last Update: 22:45 PDT 5 April 2007 )

This is a list of projects that need to be done on various projects for protein structure prediction. Note: there is no requirement that the project you do come from this list—it is merely a set of projects that we are interested in. If you can come up with your own project and convince a faculty member to supervise it, that can be even better. New collaborations with biologists and biochemists are particularly welcome.

There are still a number of viable projects on last year's list that hav not been moved forward to this year's list. If there is one on that list that interests you, check with Kevin Karplus to see if it is still a viable project.

I've broken the list up into different categories---but some projects straddle these divisions, so look through all of them.

Visualization and manipulation of structures

Web server infrastructure

Local structure prediction

Multiple Alignment and Target-template alignment

Fold recognition and template selection

Tertiary structure prediction

Application of tertiary structure prediction

Assessment of predicted tertiary structures

Protein complexes

Function prediction

Structure analysis

Designing proteins


Here is an alternative classification:

Using existing tools

Modifying or creating new tools


Visualization and manipulation of structures

Manipulating proteins with a Wii remote

Ian Rickard, a CMPS grad student, has been modifying Wii Remote controllers ("Wiimotes") to be able to do full 6-degree of freedom motion capture using one hand-held and 2 stationary remotes. We would like to hook up this controller to a protein visualization program (most likely Pymol, as it has the highest quality rendering, including important features like transparent surfaces) so that we can manipulate selected portions of a protein.

This project involves several components:

Improving the user interface for Pymol

We have been primarily a rasmol shop, as Rasmol has an easy-to-use, easy-to-learn interface, and runs quite quickly even on old machines. We are leaning toward moving to Pymol, because it can produce better quality images, particularly for journal publication. The drawing time is often too slow on old machines, but using modern machines with hardware-supported OpenGL libraries has made this less of a problem. We still have problems with the highly non-intuitive user interface and awkward mouse control.

The user interface of Pymol should be relatively easy to modify to make it more rasmol-like. It would be nice if rasmol selection and highlighting scripts that we produce could be used by the modified interface, as well as using the same mouse conventions as rasmol.

The project requires learning the user interfaces for both rasmol and pymol, and modifying the Python code of pymol.

Web server infrastructure

Model Quality Assessment Server

In 2006, for CASP7, a new category was created for CASP: model quality assessment programs (MQAP). In this category, servers are given a bunch of models generated by other servers and asked to evaluate them. We did not participate in this category in CASP7, but would like to for CASP8.

http://predictioncenter.org/casp7/doc/casp7-format.html#QA explains the two types of evaluation asked for and the two formats. Most participants submitted just a global score for the entire model, not per-residue error estimates.

We now have the ability to evaluate models using undertaker cost functions. In fact, informal observation on the CASP7 targets indicates that our cost functions would be a fairly good model quality assessment measure.

This project is really two parts:

  1. Testing various variants of the undertaker cost functions as a model-quality assessment measure on the CASP7 examples. If we use only cost functions that do not require local structure prediction, the assessment can be quite fast---how good is that compared with using cost functions that include predicted burial, helix constraints, strand constraints, residue-residue contact prediction, ... ?

    Note that we are not optimizing conformations using these cost functions, but only evaluating server-predicted models, so some terms that are important in optimization may be irrelevant in selecting good models from servers (if, for example, the servers all do a good job of avoidng clashes, then clash detection may be irrelevant in choosing among models). We are particularly interested in identifying the best models, not so much in ranking the low-quality ones.

  2. setting up a script, server, or makefile target to do MQAP for CASP8. Since we did not participate in this category in CASP7, I'm not sure how the MQAP servers were queried. The server models were available as a tarball, but I don't know if this was pushed at servers or had to be fetched by the predictors. Some communication with the CASP organizers about how this will be done in CASP8 may be needed.

    Our current undertaker cost functions are on a scale from -infinity to +infinity, so we would have to rescale them so that -infinity->1.0 and +infinity->0.0, to match the range expected in the "QA" format.

    The CASP organizers turned alignment-only predictions into backbone-only predictions, but many of our cost functions really work much better if there are sidechains present. We can automatically run the alignment backbones through scwrl, and we should evaluate models both before and after scwrling, perhaps taking the better score as the assessment.

User requirement analysis and case studies for an UCSC Proteome Browser Update

The UCSC genome browser is one of the most popular resources/tools for biologists interested in exploring the human genome, or other genomes. In addition its emphasis on comparative genomic analyses makes it suitable for bioinformatic analyses. Various linked resources exist, one of them is the UCSC Proteome Browser (Hsu et al., Nucl. Acids Res. 33, D454-D458 2005. pdf ) as it was conceived in 2004.

The project is to first investigate the currently implemented features and abilities of the UCSC Proteome Browser, and to explore what possible updates and additions would increase the resource's usefulness in a variety of (computational) case studies. The next step is then to select one of the improvements that you have recommended and implement it, either directly with the Proteome Browser team, or as a separate add-on.

Local Structure prediction

Combining local structure predictions

For the past few years, we have been using a script called RDBCombine to merge neural net predictions (possibly over different, but related alphabets) to get a consensus prediction. We have never tested this method to see if it does better than simply taking a neural net optimized for the output alphabet, nor have we tried tuning any parameters related to it.

RDBCombine can combine RDB-format prediction files for different alphabets. It translates the alphabet using the contingency tables output by compare-real, and weights them by the mutual information between the predicted alphabet and the desired alphabet, and does a weighted average of the translated predictions. It is not clear what the best way to do the combining is—simply averaging the probability vectors is probably not optimal, though it is easy. We may want to look at combining the log-odds scores (log (P_predicted/P_background)), for example.

We have a script (somewhere!) to evaluate RDB files for Qn and bits saved (and we could add the SOV measure to the evaluations), so different methods for combining predictions can be easily tested with only fairly minor script writing. Writing a new script may be almost as easy as finding the old one, though!

Local structure prediction using HMMs

One approach to improving secondary structure that we have tried, and that may be worth improving on, is to use the fold-recognition target HMM to align the template set and gather statistics about what secondary structure codes align to each position. Since the target HMM can be a multi-track HMM using neural-net predictions, this method should be able to do at least as well as the neural-net methods. The method would be particularly valuable when there is a close homolog in the template library, as it would be able to "look up" the right answers. This makes testing more difficult, since it is easy to fool oneself about how well the method is working, but would be very useful in a tool used by bench biologists, as the easy predictions would be right (not necessarily the case for neural nets). It is used in the SAM-T99 server, I believe, but not the SAM-T02 and SAM-T06 servers. We should test it properly and use it in the SAM-T07/8 server.

Mark Diekhans implemented this several years ago in SAM, but I am not sure of the details of his implementation. In his tests, the method improved Q3 scores slightly, but made the information gain worse than the neural nets used to create the HMM tracks. Josue Samayoa did more careful tests in Fall 05, and found that the current implementation does not make the results of the neural net prediction substantially better or worse. Somewhat surprisingly, including the true sequence in the library of structures that were scored by the HMM did not change the result much. This implies to me that the probability of the sequence is not being used adequately to weight the predictions---having a perfect fit in the sequences being scanned should result in near-perfect predictions.

Several things could be done to improve this method:

Multiple Alignment and Target-template alignment

Testing pairwise alignments

Our lab has recently been developing a new pairwise-alignment test that more closely measures the properties we want from a good pairwise alignment. The new test uses the alignment to create an incomplete model of the protein by copying the backbone and adding the appropriate sidechains, then evaluates the quality of the resulting 3D model.

A set of about 300 protein pairs at different levels of difficulty have been selected to cover the full range of alignment problems.

There are some details of the test protocol that still need to be tweaked (like exactly what cost function to use in undertaker to evaluate the model), but mostly this project consists of parameter tweaking on different alignment methods (mainly HMM-based methods) and measuring the quality of the resulting alignments.

Perhaps the most important alignment testing is for profile-profile alignments. The best methods in CASP for the past few years have used profile-profile methods, but we have not yet tested and incorporated profile-profile methods in the SAM protocols. Richard Hughey has implemented a form of HMM-HMM alignment, but it has *still* not been tested.

Fold recognition and template selection

  • Orthology prediction

    (Josh Stuart)

    This project is to explore the possible benefit of using domain content information for the orthology prediction problem. That is, to use tools for identifying the domain structure of protein sequences to help find orthologous proteins between different species.

    Predicting orthologs in eukaryotes is a challenging problem. Protein families can undergo several duplication events in a single lineage, making it difficult to identify functional orthologs versus divergent paralogs. The popular approach is to use sequence-similarity based approaches, such as reciprocal best BLAST comparisons, to predict orthology. See for example the recently published Multiparanoid method [Alexeyenko A, Tamas I, Liu G, Sonnhammer EL. Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinformatics. 2006 Jul 15;22(14):e9-15. doi:10.1093/bioinformatics/btl213 ].

    These approaches are limited because the similarity scores ignore the importance of different regions in the compared protein sequences. Intuitively, differences observed in positions that are highly variable should detract less from an estimate of similarity compared to differences observed in positions that are highly conserved.

    The question this project seeks to address is whether using conserved functional domains provides any independent information that can help resolve orthology predictions made by sequence similarity approaches. One could devise a score that measures the similarity between two protein sequences based on the set of conserved PFAM domains each contain. To evaluate different score choices or to compare to sequence similarity prediction alone, one could compare the predictions with the agreement of coexpression neighborhoods, conserved synteny, or curated sets of orthologs.

    This project may not require very deep programming, but will require running a lot of HMMer (Pfam) jobs from scripts and coming up with intelligent ways to combine the results.

    Converting RDB files to HMMs

    We currently use a TCL script (2nd-rdb-to-sam-model) for converting the RDB-formatted output of our local structure predictions to SAM HMM format, for inclusion in multi-track HMMs. It would be better if the SAM code understood the RDB format directly, so that no translation script was needed. Indeed, if the SAM code could read the RDB format directly, it would be trivial to write the translation program in C, since SAM already has routines for outputting the HMM format. The 2nd-rdb-to-sam-model script is the only TCL script in the SAM package, and it would be good to eliminate it.

    Although there is not much code needed for adding this functionality to the SAM package, a lot of code would need to be read in order to understand the internal data structures and parameter passing mechanisms of SAM.

    Combining fold recognition results

    Our fold-recognition methods treat every template in the library as independent of the rest, but we actually have many templates which contain similar folds. Various ways have been proposed for combining template scores to get a "fold" score, and a couple students have done preliminary tests of the "product-of-pvalues" method, and shown that it might help, but their projects were never taken to the point where the method could be incorporated into an automatic prediction method.
    [ Bailey TL, Gribskov M. Combining evidence using p-values: application to sequence homology searches. Bioinformatics. 1998;14(1):48-54.]
    [ Bailey, Timothy L. and Grundy, William N. Classifying proteins by family using the product of correlated p-values. Proceedings of the Third international conference on computational molecular biology (RECOMB99), April 11-14, 1999. pp. 10-14. ]

    One possible project is to incorporate the "product-of-pvalues" method into our automatic predictions. This requires designing a way to update the calibration parameters for the combining method automatically as new templates are added to the library, and converting the fold results back into selection of good templates. (This needn't be a weekly update, as the SCOP classification is only updated once or twice a year.)

    Another possible project is to devise a different combining method. The "product-of-pvalues" method treats all templates in a fold class as equally informative and does not use information from templates in competing fold classes. One can envision combining methods such as logistic regression that could use the extra information to get better predictions. One danger is that multi-domain proteins correctly have multiple correct folds, so simple competition between fold classes is not quite the right model.

    Tertiary structure prediction

    Undertaker cost functions for local structure alphabets

    We have many neural networks for predicting various local structure properties of individual residues of a protein. These neural networks do a much better job of predicting the properties than a simple residue-propensity scale, but our tertiary prediction program (undertaker) only uses propensity-based cost functions for evaluating most of these properties.

    We have code in undertaker for accepting neural net predictions and creating cost functions from them for some local structure alphabets (the alpha angle alphabets and the burial alphabets), but several of our best alphabets (str2, Bystroff, de Brevern, N-notor, O-notor, N-sep, O-sep) have nerual-net predictors but do not yet have neural-net-based cost functions.

    This project requires learning how the undertaker cost functions are implemented and adapting a fairly small amount of c++ code to handle other local structure properties.

    Distance constraints from alignments

    In CASP7, we noticed that the sheet constraints that we generated from alignments were doing an excellent job of hanging onto sheets that we found in fold-recognition targets, but that all-helical proteins were often being disrupted by the attempts to close gaps and construct loops.

    We would like to generate constraints for contacts between residues that are widely separated along the chain based on contacts seen in alignments, and test the value of different ways of choosing and weighting the contacts in selecting good models for the target protein.

    This project requires learning how undertaker represents 3D structures and detects atoms that are close, plus writing some c++ code to output constraints. The code should be fairly easy, as it can be modeled on the existing code for generating sheet constraints and helix constraints. i

    Continue developing an automated method for predicting disulfide bond connectivity

    We have recently begun to devise an "new/old" method for predicting disulfide bridge connectivity based on multiple sequence alignments. The strategy employed is probably the oldest (but still trusted, and widely applied) approach used in manual sequence analysis. It is based on connecting pairs of alignment positions that contain partially conserved cysteine residues whose patterns of conservation and variation resemble each other most closely. At this point, we have most pieces in hand (predominantly in the form of Perl scripts) that are necessary for a first automated implementation of this strategy, and for its evaluation on known examples (Zoe Woodruffe, MRes thesis + code, available).

    In the longer term, there are two possible applications for this automated method - (i) for automated flagging of likely mis-threading between target and template sequences underlying automated comparative models (e.g. as provided by ModBase); or (ii) as a more widely applicable prediction method for disulfide bond connectivity.

    The goals of this project are

    1. to refine the current, first, implementation of this method, primarily by modifying/optimizing the current scoring scheme for assigning disulfide bridge based on known cases
    2. to investigate for which of the two applications it is more suitable.

    Improved handling of clashes and bond angles in undertaker

    The cost function in undertaker for clashes (steric conflicts) was last updated in spring 2004. It uses a separate distance threshold for each pair of atoms in the same residue, in adjacent residues, and in residues with a chain separation of 2 or more. Seting these distance thresholds is rather tricky, since there is not enough data for really robust setting of so many parameters. The method has the advantage/disadvantage of treating steric conflicts the same as bond lengths.

    One big disadvantage of the current scheme is that there is no attractive van der Waals term in our cost function. Atoms that don't come into contact score the same as ones that do contact without clashing. While a traditional Lennard-Jones potential gives too high a penalty to clashes for our purposes (making the landscape too rough), the attractive well of the potential is important in determining whether a protein is properly packed. The Lennard-Jones potential is easily parameterized, having only one parameter for each pair of atoms. We need to come up with a similarly simple cost function that is easily calibrated from available data, with a good fit to the observed distance distributions, but with a softer clash penalty.

    It is quite likely that bond lengths and H-bonds will not fit the same shape length distributions as van der Waals contacts, so we may want to provide completely separate mechanisms for bonded atoms (we already have a separate H-bond cost function). Atoms that are 2 bonds apart are also not governed by van der Waals contacts; their distances are basically measures of bond angles. The current clash detection scheme has different thresholds for bonded atoms and for atoms 2 or 3 bonds apart (because of the different thresholds for same residue, adjacent residue, and arbitrary chain separation), but does not have different shapes to the cost function.

    Improving Undertaker knot detector

    Undertaker has an implementation of Firas Khatib's knot finder, but there are some minor problems with it---mainly that it uses the first version of the algorithm, where points to be considered for deletion are visited in a simple end-to-end sweep of the protein. This should be changed to allow visiting the points in arbitrary orders: most importantly the two smallest-first orders that Firas uses in his Rosetta implementation, but also a random order.

    This project requires modifying the existing c++ code for the knot detector and implementing a new data structure (most likely a priority queue) for ordering the points.

    This project has fairly low importance (the existing implementation is good enough most of the time---it has a small false positive rate that could be reduced by the changes), but is also a small project suitable for a student who already knows c++ and who needs a mini-project, not a full-size project.

    Applications of tertiary structure prediction

    Disulfides in intracellular proteins of extremophiles

    In doing structure predictions for a VapC protein, Kevin noticed that there were some disulfide bridges predicted, even though the protein is an intracellular nucleic-acid-binding protein.

    Following up on this, he contacted Martin Lawrence, a structural biologist at Montana State University, who confirmed that intracellular disulfides do occur in thermophiles, and pointed to a paper by Todd Yeates at the California NanoSystem Institute:

    Beeby M, O'Connor BD, Ryttersgaard C, Boutz DR, Perry LJ, Yeates TO
    The Genomics of Disulfide Bonding and Protein Stabilization in Thermophiles.
    PLoS Biology Vol. 3, No. 9, e309
    doi:10.1371/journal.pbio.0030309

    This paper present evidence that disulfides are common in thermophiles---at least in those that have a protein known as "protein disulfide oxidoreductase".

    It would be interesting to do predictions of disulfides for the archeal species that the Lowe lab works with. There are two parts to the prediction: selecting proteins that have multiple cysteines and are not predicted to be extracellular or membrane bound, and doing structure predictions for them to see if the cysteines are predicted to be close enough to form disulfides.

    Another interesting approach would be to look at the homologs of the protein Kevin started with---perhaps starting from one of the PDB chains that is is most similiar to: 2h1oA, 2h1cA, 2bsq, or 1y82A. (1y82A might be a particularly good choice, since even BLAST finds it.) Getting a deep multiple sequence alignment to it is fairly easy with the SAM-T06 protocol.

    The distances between CB atoms in the known structure can be computed and those that are close enough to be potential disulfide contacts (say closer than 6 Angstroms) can be recorded. Undertaker can print the CB pairs of interested with

    	InFilePrefix /projects/compbio/experiments/undertaker/spots/
    	PrintTrainingDistance Cbeta.spot 6.0 pairs.out
    
    The use of Cbeta.spot, rather than CB, means that glycines will have a projected CB atom (as if they were alanines) rather than using the CA location.

    The sequences of the multiple alignment can be quickly scanned to find which cysteines look likely to form disulfides and which are unlikely to. It would be interesting to see how the fraction of cysteines that seem to form disulfides varies depending on the taxonomy and living environment of the species.

    Detection, and structure modelling, of new members of the SRS/PGSH superfamily in Apicomplexa

    In previous work we were able to produced model for the double- domains of the malaria gamete surface protein Pfs230, through fold recognition and template-based modelling ( Structural models for the protein family characterised by the Pfs230 gamete surface protein of Plasmodium falciparum. Gerloff et al., PNAS 102, 13598-603 2005). One of the interesting findings was that the core beta-sandwich domain structure of these proteins resembled that found, so far, only in one other family of Apicomplexan surface proteins, namely the SRS-proteins in Toxoplasma gondii (SRS stands for "SAG1 (Surface Antigen 1) -related sequences").

    Subsequently, we started screening the inferred protein sequences from other Apicomplexan species in GenBank for further distant homologues, i.e. members of the same superfamily (H Su, MSc thesis, available), using Hidden Markov Models generated using HMMer. The student working on this project managed to generate initial results but neither had the time to explore his results nor produce models of the newly found members.

    The aim of this project is to update and extend the work (since new sequences will be available in GenBank), using both HMMer and SAM Hidden Markov Models for the screening, and to model the newly found members of this superfamily.

    Large-scale comparative modelling of Fibronectin 3 (Fn3) domains

    We have recently demonstrated that comparing the surface properties of modelled, homologous protein domains can be of interest in the context of characterizing their interactions with other proteins (Large-scale modelling as a route to multiple surface comparisons of the CCP module family. D. C. Soares et al., Prot. Eng. Des. Sel. 18, 379-88 2005). Since this study, we have developed a new method for comparisons of this kind, and the results are much more promising; in fact, much of our current research focuses on exploiting such comparisons (Ali & Gerloff, manuscript in prep).

    The protein that triggered our interest in this area, due to its complicated interaction properties with multiple protein partners, is human CR1 (complement receptor 1). Human CR1 has a "homo-multimodular" architecture; it consists of 30 paralogous domains, each approximately 60 amino acids in length and sufficiently similar in sequence that we are certain that these domain will have the same fold (although it is clear that their functional properties differ). Casting our initial focus on CR1 has proved valuable in method development. However, there are other proteins to which our new methods methods could (and should) be applied now. In the first instance, we require modelled structures for all domains of potential new target proteins. Just like CR1, our favorite proteins are large, have multi-modular architectures, and potential for yielding interesting discoveries when the predictions are ultimately followed up through experiment.

    The project is to focus on one such target protein. A protein domain named "Fibronectin3-domain" (Fn3 in SMART) appears to be "similar" in some respects to our paradigm CCP- modules but has a different structure and function. The primary aim of the project is to produce comparative models for Fn3 modules in human/mammalian proteins. This will involve

    Ultimately, the aim is to undertake comparisons of the electrostatic surface properties of the resulting models using existing methods.

    Assessment of predicted tertiary structures

    Evaluation of ModBase predictions for malaria proteins

    ModBase is a database of automatically produced comparative models using the program MODELLER by the Sali group (http://salilab.org/modbase). In previous work, we have downloaded all models of proteins from the malarial parasite Plasmodium falciparum, and subjected them to quality and redundancy (overlap) filtering.

    Since the models were downloaded (early in 2006) a considerable number of P.falciparum structures have been resolved by crystallography. This allows us to evaluate the accuracy of the ModBase models on this subset, in a way reminiscent of how this is done in connection with the CASP structure prediction "experiments". Kevin Karplus has recently implemented the main CASP evaluation method "GDT" (Global Distance Test) locally. This means we have a suite of programs for computing standard assessment criteria (e.g. r.m.s. deviation) and more informative measures (e.g. correctly predicted hydrogen bonds).

    The project is to use this methods, combined with visual analyses, to compare ModBase models of P.falciparum proteins. By comparing them to their experimentally solved "true structures", you will derive an assessment of the current MODELLER-parameter settings used in ModBase, and make suggestions how to improve them.

    Protein complexes

    (Re)modelling and validation of predicted CDK-cyclin complexes from Arabidopsis thaliana

    In recent work, we have used protein-protein docking combined with electrostatic complementarity calculations, to predict 19 most plausible pairings of CDK (cell-cycle dependent protein kinase) and cyclin homologues in A.thaliana (Quan & Gerloff, manuscript available). Due to the large scale at which this analysis was undertaken (1188 possible CDK-cyclin combinations were considered) a rigid docking program was applied, as opposed to the more accurate (and biophysically realistic) flexible docking methods. Accordingly, we currently only have approximate, "unrefined" models of the 19 complexes - some of them come with unrealistic features like side-chain clashes, etc.

    The project is to apply the flexible protein-protein docking program RosettaDock (Protein-Protein Docking with Simultaneous Optimization of Rigid Body Displacement and Side Chain Conformations, J.J. Gray, et al., J. Mol. Biol. 331(1), 281-299 2003. pdf) to the predicted CDK-cyclin pairs as well as control complexes, in order to validate the predictions and remodel the corresponding heterodimeric complex structures.

    Impact of side-chain modelling on electrostatic complementarity scoring of protein complexes

    An alternative to assembling hypothetical protein-protein complexes through protein-protein docking is to generate them through simultaneous comparative modelling of all subunits, using a known structure of a homologous protein complex as a template.

    We have recently noticed that the electrostatic complementarity at the interface of protein complexes modelled in this way using the program MODELLER (http://salilab.org/modeller) seems exaggerated (overfitted) compared with complexes that were solved crystallographically (Rupert Koenig, diploma thesis available). In other words, the correlation of positive and negative surface potential within contacting patches on the two subunits is even higher in MODELLER-generated protein complexes than it is in crystal structures of protein complexes. Most likely, this observation reflects an overfitting at the stage of side-chain placement.

    The aim of the project is to verify this conjecture, and hopefully to remedy this effect, by remodelling the side-chains of a set of approximately 1200 MODELLER-generated hypothetical CDK-cyclin complexes using Rosetta (Improved side-chain modeling for protein-protein docking., C. Wang, et al., Prot. Sci. 14, 1328-1339 2005. pdf; Progress in Modeling of Protein Structures and Interactions., O. Schueler-Furman, et al., Science 310 (5748), 638-642 2005. link), and recalculating electrostatic complementarity using a local copy of MOLSURFER that was modified to allow large-scale application.

    Transient versus permanent protein interactions - annotate a database of heterodimeric binary (sub)complex structures

    In recent work, a graduate student in the Gerloff group has derived a database of heterodimeric (sub)complexes from the Protein Data Bank (PDB) (Thomas Juettemann, unpublished results). In the context of developing prediction methods using this data set as reference, it is important for us to be able to annotate which of these protein-protein interactions are of permanent nature (as for example between the alpha and beta subunits of hemoglobin, i.e. part of a complex that persists throughout the proteins' existence), or of transient nature (as for example the regulatory interaction between a CDK (cell cycle dependent protein kinase) and a cyclin, i.e. temporary).

    Since the database contains several hundred complexes, the more automation can be applied in this exercise, the better. However, while hints can be derived from protein- sequence/structure-based properties (such as amino acid properties at the interface) the actual annotation must not be based on any of these, but rather inferred from existing annotation of biological function in SWISS-PROT, GO, etc. (since the purpose of the annotation is to serve as reference, i.e. "truth", in developing a prediction method).

    The project is to do this following partly the rules used in a previous publication in this field: Analysing six types of protein interfaces. Y. Ofran & B. Rost, J. Mol. Biol. 325, 377-387 2003 (link), and partly by developing new strategies to do this.

    Determine the intersection between the functional genomics protein interaction database IntAct and "modellable space" (+ produce comparative models)

    A variety of experimental techniques exist to screen for interactions between proteins. The results of such screens (using predominantly yeast-two-hybrid (Y2H), tandem affinity system purification (TAP), and co-immuniprecipitation (CoIP) strategies) are stored in protein-protein interaction databases - one of the best annotated and organised ones is IntAct by the European Bioinformatics Institute (EBI).

    A small proportion of the interactions reported in IntAct are based on crystal structures of the complexes but for the majority no atomic coordinates for the complex are currently available through the Protein Data Bank (PDB). This raises the interesting question - how many of the complexes could be investigated by template-directed structure modelling, ideally comparative modelling (where homology between the template and target protein sequences is easily detected and the two sequences can be aligned confidently)?

    The goal of the project are to determine the set of protein pairs in IntAct for which structures of (i) the complex; or (ii) both individual subunits could be obtained, and be investigated for their feasibility. If time allows, the complexes can be modelled with the help of scripts for automated comparative modelling at large scale.

    Implementation of an algorithm for ordering heterodimeric protein interface residues in accordance with the contacts between them

    We have recently developed a novel, one-dimensional (profile- style), representation of electrostatic potential surfaces. Such "1-D Electrostatic surface profiles" are useful for fast and simplified comparison/screening applications (Ali & Gerloff, unpublished results). In these profiles, a summed electrostatic potential value (derived based on a 3-D structure/model) is assigned for each amino acid residue that is considered in the analysis.

    Up to now, we have used 1-D surface profiles to compare the entire surfaces (i.e. all residues in the proteins) of modelled, homologous domains - to screen for the most "interesting" domains in the set, i.e. to predict which of them are likely to have different protein-protein interaction properties (binding site; binding partner). In order to do this in a structurally meaningful way, we (re-)order the amino acid residues multiple times and carry out multiple comparisons; the reordering algorithm seeks to enrich instances where neighboring surface positions (in 3-D) are next to each other in the profile. (Since we do not know where the binding site is we try out many different orders, typically 1000).

    Because this type of screening involves pair-wise comparisons only between 1-D profiles derived from homologous proteins, aligning the two profiles has been straightforward in this application - positions that are aligned in a sequence-based multiple sequence alignment for the protein family define the pair-wise alignment of the 1-D profiles.

    Now we are taking the first steps towards using the same idea for a different application: heterodimeric partner prediction between paralogous sets (e.g. to help predict which CDK (cell cycle dependent protein kinase) homolog might interact specifically with which cyclin homolog). In other words, we would like to rank hypothetical protein complexes of protein subunits that were forced together using modelling techniques, or docking, based on the complementarity of their 1-D electrostatic surface profiles.

    In this application, we have to worry not only about providing complete orderings of the residues that are considered (that enrich 1-D neighbors that are also 3-D neighbors on the surface, as above) but we also have to align the two profiles that now stem from non-homologous proteins (the two interacting proteins).

    We have thought up an algorithm for accomplishing this, i.e. to generate aligned complete orderings of the interface residues on each subunit so that they have the best chance to contribute to the "complementarity signal" (Rupert Koenig, Diploma project, available). This algorithm involves a partially guided, partially randomized, ordering process reminiscent of a "walk across the interface" of the hypothetical complex. The project is to implement this interesting new idea and to evaluate its usefulness for the problem at hand.

    Function prediction

    Disorder and domain prediction in proteins

    Michael Rexach (http://bio.research.ucsc.edu/people/rexach/) has "recently completed a small project that used amino acids substitution 'rates' in protein domains (among orthologs in four Saccharomyces species) combined amino acids composition (% order-conferring aa/ % disorder conferring aa) to predict boundaries for domain structure and function. This was particularly insightful for protein that have a significant content of disorder-conferring amino acids, such as the nucleoporins we study. The analysis can be computerized and done for the entire yeast proteome. The analysis predicts four types of domains: 1) folded domains with no functional significance other that to provide a needed structure; 2) folded domains with function (enzymatic, protein interactions, etc); 3) unfolded domain with no function (a linker) or 4) unfolded domains with important functions."

    This project combines comparative genomics (looking at differences among the 4 Saccharomyces species) with protein informatics. There are two versions of the project: a straightforward implementation of the Rexach's method and applying it to the yeast proteome or improving the method to include other data sources (such as homologous proteins from more distant species) and other statistical measures for predicting disorder and function.

    Rexach has provided an unpublished manuscript which should be readable on campus or with the class password: rexach.pdf. He says "The most relevant figures are fig. 3, table 2, and supplemental figure 6. The second part of our manuscript (which was not published because it lacked a proteome-wide analysis as I am suggesting above) was the beginning of a similar analysis at a proteome wide level, the results were interesting, but required bioinformatics tools that are beyond my expertise."

    Validating suspected sites of adaptive evolution on Complement Receptor 1 (CR1) through Ka/Ks calculations

    We have recently produced models of all 30 so-called CCP (or sushi) modules of CR1 individually (available here). Each module of this surface protein involved in the human immune response spans approximately 60 amino acids in length and adopts the typical CCP-module fold (aka complement control module, or SCR, fold). Two conserved disulfide bridges and one buried, conserved tryptophane residue are maintaining the fold, while surface residues vary greatly between modules.

    Different modules within CR1 may interact with different partner proteins (many of them remain unknown to date), and the location of their interaction sites are not necessarily conserved with respect to the common structural scaffold defining the fold. Using a variety of new and existing methods, we have generated predictions regarding which modules may have undergone adaptive evolution (i.e. evolved under positive selection).

    These predictions are based on protein sequence and structure-based considerations only. By contrast, the project is to investigate whether positive selection can be detected in the corresponding gene (DNA) sequences of closely related homologues, through computation of Ka/Ks (non- synonymous versus synonymous mutation) rates, and to pin-point possible interaction sites in this way.

    Structure analysis

    Finding crevices that might bind nanotubes

    We are planning to try to design proteins to selectively bind carbon nanotubes of specific chiralities. One approach we are pursuing is to redesign crevices in existing proteins. For this to work, we need to find crevices of approximately the right shape, then run a design program (such as RosettaDesign) to select different residues to line the crevice with.

    This project involves scanning a subset of PDB (1,000 to 3,000 structures) looking for holes or grooves appropriate for binding a cylinder. Because of the symmetries of a cylinder, there are only 4 degrees of freedom to sample, making this a simpler problem than the usual 6-degree-of-freedom rigid docking problem. We need to specify the axis of the cylinder, then count atoms in two cylindrical shells around the axis. For example, a crevice for binding a nanotube with a 5nm radius should minimize the number of backbone atoms within 5nm of the axis, but maximize the number of atoms within 7nm of the axis.

    To sample cylindrical axes, we can split the problem into a 2D orientation and a 2D translation. For the orientation, we can sample points on the unit sphere, and rotate the protein about the origin so that the sampled point is at (0,0,1). Our cylindrical axis will now be parallel to the z-axis, and we can sample (x,y) points to get the different cylindrical axes.

    The sampling can be deterministic (grids) or random, or a combination of the two. When somewhat promising cylinders are found, we can do finer sampling in the neighborhood of the good cylinders.

    Since we are interested in several different nanotube diameters from about 6.75 nm to about 16.2 nm, we should probably collect the number of atoms (backbone atoms and all atoms) in several different cylindrical shells simultaneously.

    This project is probably best done by adding a few new commands to the undertaker program. The program already has code for reading structures, for doing rigid transformations, and for rapidly determining how many atoms are in a sphere. Extending the XYZbins class to count points in cylindrical shells is an easy extension, and the sampling of cylindrical axes should be fairly straightforward for anyone familiar with c++ as a programming language.

    Analyzing ring-stacking interactions in proteins

    For our project on designing proteins to bind nanotubes, we expect to need terms in our cost or energy functions that model the stacking of aromatic rings (from phenylalanine, tyrosine, and tryptophan) against the graphite lattice of the nanotube. Since we don't have much structural experimental data for interactions between aromatic amino acids and graphite lattices, we will have to generalize from the ring-stacking interactions that we have seen.

    This project involves small modifications to undertaker to gather statistics about interactions between aromatic amino acids observed in pdb. The code may be similar to existing code that was used for gathering statistics about hydrogen-bond geometry. Which statistics we gather and how we make a cost (energy) function out of them will depend on what interactions we observe most frequently.

    one approach would be to generate different statistics for each of the nine ordered pairs of aromatic amino acids (16 if we decide to include histidine). We could transform one sidechain into a standard frame of reference (say based on CB, CG, and CD1) and to gather statistics about any other aromatic rings (location and orientation) that come within 4 Angstroms of atoms in the selected sidechain. We are particularly interested in rings that are nearly parallel, since that is the mode of interaction we are expecting to see with nanotubes.

    If get some good regularities, we can create a new cost function for undertaker that provides a roughly -log probability cost for different orientations of nearby rings. The hardest part may be figuring out how much bonus there should be for the rings being close in the first place.

    One possible problem with this whole approach is that the aromatic rings of the sidechains are not quite the same as the graphite lattice. The slight difference in electronegativity between hydrogen and carbon means that the center of the phenylalanine ring is a very weak hydrogen-bond acceptor and the rings may have a favorable interaction at right angles. The nitrogens in tryptophan and histidine may cause this effect to be even stronger. We do not expect the carbon nanotubes to form these weak hydrogen bonds, so our cost function may need to be different for interactions between amino acids and between amino acids and the graphite lattice.

    In addition to creating new cost functions for undertaker, we would like to add a ring-stacking term to the Rosetta cost function, as we expect to do large portions of our design work with RosettaDesign.

    Note: this project is an open-ended project that is much too large for just a BME 220 project. It could well lead to a Master's thesis or even a PhD thesis, but there is an initial component (gathering statistics) that should fit comfortably in the 10-week time frame.

    A survey of disulfide bond "properties" in surface proteins

    One of the noticeable characteristics of proteins that are protruding to the outside from cell surfaces is the abundance of disulfide bridges. Accordingly, many of the well-recognized surface proteins are recognized by sequence motifs/signatures based partly on the highly conserved cysteines involved.

    Since disulfide bonds are geometrically (3-D structurally) more restrained than providing a simply distance constraint, not all combinations of cysteines are favorable - for example disulfide bridges between sequence- neighboring cysteines are rarely observed. While there have been previous surveys of disulfide bonds found in 3-D structures stored in the Protein Data Bank (PDB) - or rather inferred based on the atomic coordinates - there are now a large enough number of experimentally solved structures of surface protein/domain structures to inform a specialized study surveying their 3-D structural "properties" and their sequence "properties" (such as conversation patterns).

    The project is to

    Designing proteins

    Designing multimers for binding nanotubes

    The big project that the Karplus lab is about to embark on, under the leadership of postdoc Martin Madera, is the design of proteins to specifically bind carbon nanotubes of specific chirality. There are several approaches to the design. One is to try to redesign existing protein folds (see Finding crevices that might bind nanotubes), but we don't expect this approach to be sufficiently selective. The problem is that nanotubes are locally all identical---they are graphite lattices that have been curved into cylinders. The differences between say an (8,8) nanotube and a (9,7) nanotube are slight differences in the curvature, and it would be difficult to make a protein rigid enough to have much difference in affinity for the different curvatures.

    A more promising approach is to design a helical multimer that self-assembles on the surface of the nanotube. If the symmetry operation from one monomer to the next matches a symmetry of the nanotube, then the cooperative binding of many monomers should be able to distinguish even subtle differences between nanotubes.

    Actually, we probably want a helical arrangement of dimers, rather than of monomers, since the nanotubes are symmetric if you rotate them 180 degrees perpendicular to the axis. If we used an asymmetric unit for forming the helical multimer, there would be two different orientations that would work, and we would get discontinuities (like crystal grain boundaries) between regions of each orientation. If the unit for the helical multimer has an appropriate 180-degree rotational symmetry, then there may be only one orientation for the helical multimer and no grain boundaries.

    As a BME 220 project, there are several aspects of this project that can be worked on. One is to try packing existing proteins around nanotubes and using RosettaDesign to repack the interfaces. Another is to try designing a monomer or dimer from scratch that has an appropriate shape for making the helical multimers (again using RosettaDesign, this time to design the entire backbone).

    For those looking for programming projects, undertaker could use some modifications to handle helical multimers. Currently it has only a crude mechanism for handling cyclic multimers (using a single rigid transform as a generator), but the multimers that we are interested in involve 2 transforms: one for taking the monomer to its dimeric partner, the other for taking one dimer to the next in the helix. The helix is also an infinite object, so limitations on the size of the created complex are needed.

    Undertaker also needs modification for handling nanotubes. Currently, undertaker has no representation for ligand atoms. Carbon nanotubes have only one sort of ligand atom, so the extension of the atom alphabet is easy, but setting parameters for clashes and other cost functions may be difficult, as there is not much data about how nanotubes interact with amino acids.

    Testing back-propagation as a protein design method

    In spring 2005 Jes Frellsen and I made some modifications to predict-2nd so that it could be used for protein design. (I made further changes in summer 2005.) The idea is to take several of the local-structure-predicting neural nets, and fix their output at the desired local structures, then use back-propagation to optimize the inputs. One starts with an initial random sequence, optimize using the neural nets to get an input profile, then sample from that profile and reoptimize. This can be repeated to get a large sample of possible sequences that should have good local properties, which could then be fed into undertaker, rosetta, or gromacs to try to find sequences with good tertiary properties.

    We still need more experimentation with various networks and local structure alphabets, so see which combinations result in the best designs. Since we do not have the time nor the facilities for fabricating and testing large numbers of proteins, we will have to rely mainly on computational methods. One simple test is to look at the percent sequence recovery: given just the backbone for a real protein, try to design amino acids sequences that would fold to that conformation. The closer the designs are to the real protein, the more convincing the design method. Such a study should be done on hundreds of different backbones from radically different structures. Some work was done on this in Winter 2006, and the sequence recovery was somewhat less than we wanted. Only two local structure alphabets were used, and the neural nets were not as good at prediction as the ones we now use, so we may be able to do better now.

    There are lots of variants that could be tried in the sampling of sequences from the back-propagation. For example, one could resample the sequence points where the back-propagated derivative for the current residue is negative rather than resampling all points. Instead of sampling from the profile one could take the most probable residue of the profile. After generating a few hundred sequences in this way from random starting points, one could make a profile from the sequences, and score each sequence with that profile. Further tweaking could take just the best 100 or so sequences and the profile generated from them. Each could be used with that profile as a starting point for the back-propagation and resample process. A tertiary scoring system (such as Rosetta's) could be used to pick out the best sequences from the list, and use the resulting sequences and profile to do more sampling.

    Training neural nets for protein design

    Another approach to using neural nets would be to build a neural net whose input was many structural properties and whose output was a sequence. Such a neural net could be trained in the same manner as current nets for prediction (with inputs and outputs swapped), and would provide a probability distribution for amino acids at each position. This would require a few changes to the neural net code (mainly providing an input layer that accepts several alphabets, instead of just one alphabet), and would more directly provide probability distributions to sample sequences from.


    slug icon to go to Scool of Engineering home page
    SoE home
    sketch of Kevin Karplus by Abe
    Kevin Karplus's home page
    Dietlind Gerloff portrait
    Dietlind Gerloff's home page
    BME-slug-icon
    BS, MS, and PhD programs
    Karplus's lab page UCSC Bioinformatics research

    Questions about page content should be directed to

    Kevin Karplus
    Biomolecular Engineering
    University of California, Santa Cruz
    Santa Cruz, CA 95064
    USA
    karplus@soe.ucsc.edu
    1-831-459-4250