This is a list of projects that need to be done on various projects
for protein structure prediction. I've broken the list up into
different categories, based on the different stages we go through in
making a prediction.
- Perl modules
- I'm working on perl modules to make the SAM-T04 prediction engine
more modular than the previous ones, so that pieces of it can be
reused by other researchers.
I have SAM-T04 modules that are almost finished (enough for us to
do predictions at CASP in Summer 2004), but there are several
functions that still need to be added. It would also be good to make
them easier to install and to have frozen copies of for a web service.
- SAM-T05 web server
- We added a bunch of new steps to the prediction process in 2004,
but have not yet put up a web server to provide these predictions to
the public. The web server could be modeled after the existing
SAM-T02 web server which does some checking and cleaning of the input,
then calls "make" to create the output. We have Make.main files that
can be included in a Makefile for both casp6
and yeast predictions. These can be
merged appropriately and updated to make the SAM-T05 server.
For example, the yeast web pages include pictures of 3-D structures
based on the top alignments, and the casp6 predictions include a
full-length model built automatically from the alignments and
fragments (but no pictures). The SAM-T05 website should generate both
sets of models (form alignments and full-length) and include pictures
of both.
Neither the yeast nor the CASP6 Make.main files do a search of the
SAM-T04 template library, largely because the library was not
calibrated until 23 March 2005. This would be a fairly simple
addition to the yeast web pages and the SAM-T05 website, and would
make them significantly more useful.
- Finding similar sequences (potential homologs)
Other researchers have found that they can get far more homologs
for proteins if they search they unfinished genomes, particularly
using tblastn to search genomes that have not had adequate gene
finding done. This might we worth trying, but extracting the
"full-length" protein sequences can be difficult from tblastn
hits---we want to have protein sequences, not full contigs of
nucleotides, but we don't want just the narrow region where the
tblastn hit is significant.
The thresholds used in SAM-T2K and SAM-T04 for determining what sequences to
include at each level are somewhat arbitrary. It may be worthwhile to
play with them a bit to see if we can include more sequences without
getting contamination.
- Multiple alignment of similar sequences
- The biggest improvement that SAM-Txx makes over PSIBLAST is in the
quality of the multiple alignment, but there is still a lot more that
could be done to improve it. Here are some ideas (not necessarily compatible):
- T-coffee has shown itself to be a good, but slow, multiple aligner.
We could use T-coffee on a subset of the sequences to get a good
multiple alignment of them, then use that as a seed for an HMM to
retrain on all the sequences. We would have to write a converter
that changed the T-coffee output format to A2M format, unaligning
all columns that did not contain the target sequence.
- We may want to implement our own version of T-coffee that does
not have the O(s^2 n^2 + s^3 n) running time (where s is the number of
sequences and n is the average length). It may be worth looking
for graph algorithms that allow us to propagate consistency
information around an s-partite graph that has residues as nodes
and weighted alignment pairs as edges. It would be particularly
good to be able to work with sparse graphs, that did not attempt
to do all pairwise alignments.
- We may also want to look at the Muscle multiple aligner at
http://www.drive5.com/muscle
which is supposedly better than T-coffee and much faster (faster
than Clustal W even). There is an NAR paper on the performance of MUSCLE
Pubmed
abstract 15034147.
- Progressive aligners (such as Clustal W) are about as good at
multiple alignment as SAM-T99, but are faster for small numbers of
sequences (slower for large numbers). The slow step is in taking
all pairwise distances to determine the neighbor-joining tree. We
could make a fast, sloppy, progressive aligner by using the initial HMM to
make a crude multiple alignment and extracting the distance matrix
from that mutiple alignment (time O(s n^2 + s^2 n) instead of
O(s^2 n^2)). This is still too slow for 10,000 sequences, but we could
thin the alignment, as we do in SAM-T2K for the buildmodel step.
-
The results of the progessive alignment could be further refined
by doing a few iterations of Baum-Welch in buildmodel, though that
would have to use a thinned alignment, as in SAM-T2K, to avoid
excessively long time in buildmodel.
-
Progressive alignment relies on profile-profile alignment, which
we need to add to the SAM tool suite anyway. If we have profile-profile
alignment on multi-track HMMs, we can use predicted local
structure properties in the progressive alignment. A prototype
progressive aligner could be written in PERL, calling predict-2nd
to get the local structure predictions and SAM's profile-profile
aligner to do the alignments. If the method works, we may need to
reimplement in C or C++ to avoid all the overhead of calling
separate programs.
- A much smaller change, though messy to code, would be to
implement the posterior decoding alignment in a memory-efficient
way---currently about 8% of our target2k builds fail because of
memory problems in the final posterior-decoding alignment.
- Both the current buildmodel alignment improvement and
progressive alignment rely on using a thinned set of sequences to
be adequately fast for aligning large sets of sequences. The
current thinning process allows us to specify a percent-identity
threshold (based on an initial multiple alignment), but does not
allow us to specify how large a set of sequences we're willing to tolerate.
For methods, like T-coffee ,that are cubic in the number of
sequences, we might want a tighter control over the number of sequences.
This would require modifications to uniqueseq, possibly requiring
re-thinning an already thinned alignment.
- I made improvements to GPCR multiple alignments by selecting
key conserved residues, restricting the multiple alignment to
sequences that contained reasonable values for those key residues
(thus eliminating misaligned and some false-positive sequences),
then realigning all the sequences to an HMM built from the subset
alignment. This method is mostly automatic (just some thresholds
were picked manually), so might be worth incorporating into the
next SAM-Txx method. The current scripts are pick-key-residues
and select-by-key-residues.
- Local structure prediction
-
We are still working on finding good local structure alphabets to
predict, and our neural network framework is fairly reliable.
Minor tweaks will probably not make much difference, but I have some
ideas that may be worth pursuing.
- We need to create a new framework for testing methods since
we currently rely on predict-2nd to monitor its own
performance. The information gain, Qn, and SOV
computations should be encapsulated in a separate evaluation
program, so that we can test combining methods and new
machine-learning methods.
- We could probably get slightly better local structure
prediction by combining the results of several independently
trained neural networks.
It is not clear what the best way to do the combining is---simply
averaging the probability vectors is probably not optimal, though
it is easy.
We may want to look at combining the log-odds scores (log
(P_predicted/P_background)), for example.
We already have one combining program called RDBCombine that
is used in the SAM-T02 web server to merge neural-net predictions
to get the dssp_ehl2 prediction. It can combine RDB-format
prediction files for different alphabets. It translates the
alphabet using the contingency tables output by compare-real, and
weights them by the mutual information between the predicted
alphabet and the desired alphabet. This is almost certainly not
an optimal combining method, but it was put together in haste for
CASP5, and has not really been evaluated.
- I doubt that we'd get much improvement in the results by
changing the topology of the neural nets, though we may want to
retrain the networks on a larger or cleaner dataset before the
next CASP season. We could probably enlarge the networks slightly
(either increasing the number of hidden units or increasing the
window size on the first layer) without too much risk of
overtraining.
One improvement that I think would be valuable for
some local structure alphabets is to replace the current profile
input with a double input: a profile and a 0-1 vector indicating
the actual sequence. The profile could be set to generalize a bit
further (say 0.8 bits_saved/column instead of the current scheme
of using Henikoff weighting with the number of sequences as the
total weight), as the guide sequence would record the actual
residues. Separating the guide and the profile would be
particularly useful in those cases where the multiple alignment
from which the profile is generated has suffered from model drift,
and no longer represents well the particular sequence. It might
also be useful for improving the predictions in the regions where
many of the aligned sequences have deletions, as the profile gets
close to the background there, because of the low weights.
This modification would require some fairly minor changes to
predict-2nd to allow the InterfaceDescription to have multiple
inputs from different files, but the changes may be useful for
other purposes also.
[Sol Katzman did this work, but still needs to write it up.]
- One useful thing to do would be to train neural nets for predicting
local structure from single sequences. The resulting predictions
would not be as good as the predictions using multiple alignments and
the current networks, but would be useful for a progressive aligner
and for making predictions when we do not find any homologs in the
iterative search. Note: the guide+profile modification suggested
above made eliminate the need for separate neural nets for
single-sequence prediction, as the network would already be using
the guide sequence as best it can.
- Our burial and exposure alphabets are based on an underlying
continuous (or, at least, finer-grain) value. It would be nice to
have the neural net output parameters for a parameterized
distribution (such as log-normal or gamma), rather than the
discrete 7 or 11 states we currently use. The two values can be
interpreted as the mean and standard deviation of the output
distribution (though other parameterizations may be more
convenient for some distributions). A weak prediction would have
a large standard deviation.
This could be done by adding a combining layer to the outputs
of a normal neural net. For each output of the neural net we
associate two parameters (center and spread). The predicted mean
would be the neural-net weighted sum of the center values, and the
predicted standard deviation would be the standard deviation of a
mixture distribution, with mixture coefficients from the neural net.
The center and spread parameters could initially be set by
partitioning the outputs in the training data into roughly equal weight discrete
classes (as is now done) and computing a mean and standard
deviation for each class. They could be tweaked a bit by a few
passes of EM to get the output described as a mixture.
After a few epochs of training the
neural net, the center and spread parameters could be trained by
back propagation also.
We would need to compare the encoding cost of the actual
burial value using this continuous method (which requires some
care in the discretizing, since our underlying data is usually
integer-valued) versus the encoding cost of the discrete alphabet
method, which assumes a flat distribution within each of the
discrete bins.
- Training is currently quite slow, and a student interested in
machine learning might want to work on modifying predict-2nd to use a
faster learning method than the current back propagation. This is
not a high priority (we rarely retrain the networks), but might be
a good project for someone more interested in machine learning
than in bioinformatics.
- Neural nets suffer from being basically uninterpretable.
There are some ways we could attempt to glean some information
from the neural nets, and perhaps use this information to make
neural nets that train faster or work better. The basic idea is
to look for recodings of the amino acids that the neural nets are
learning, and then hardwire the more important recodings. The
simplest notion of a recoding is the dot product of an amino-acid
probability vector with a direction vector (a 20-dimensional
vector with magnitude one). The weights of the first layer of
each neural net we have trained can be normalized to produce a
number of such direction vectors (75 vectors for each network in
our most common network architecture).
The resulting vectors can
be clustered (with the square of the dot product as the
appropriate similarity measure) to determine the most common directions.
We can attempt to identify the properties these directions
represent by looking at the correlation with the indices in the
AAIndex database.
We can also compute the dot products of the profile with the
centers of the most
populated clusters and use them as inputs to the
neural net, in addition to or in place of the profile input.
The above residue-at-a-time analysis does not get at the
neighborhood patterns that the neural nets learn. People have
tried to determine some of the patterns by looking at Fourier
transforms and wavelet transforms of numeric properties (mainly
hydrophobicity). We could look at each hidden unit on the first
neural layer and see if its weight matrix can be roughly matched by taking a
single direction vector and a neighbor weight function.
That is, decompose
sum_{a,-2<=j<2} w_{a,j} Prob(amino acid a in position i+j)
into
sum_{-2<=j<=2} f(j) sum_a dir(a) Prob(amino acid a in position i+j)
or w_{a,j} = f(j) dir(a).
If there are hidden units that can be approximately described this
way, it may be worth providing input features that correspond to
this neighborhood property also.
- Machine-learning students have tried using other machine-learning
techniques besides neural networks to predict local structure, so far
without coming close to the performance of the neural nets. I believe
the difficulty is that they have been using kernel-based methods, but
no one has found a decent kernel for local structure prediction.
Neural nets are good at determining which features are relevant, and
the multi-level windowing I do in predict-2nd helps the neural nets
focus on the center of the window it is looking at.
Perhaps a kernel could be created that uses the first layer of a
neural net to recode the window. This is a long-shot idea,
suitable for a machine-learning student really interested in
kernel-based methods, but not likely to lead to improved local
structure prediction.
We really want to get maximum information gain from a
multi-class soft classifier, which is much easier to design into a
neural net than into SVMs. I'm not sure how to get a good probability
vector out of a multi-class SVM.
- I have an idea for two new structure alphabets, based on what I
learned about hydrogen bonds. We could label residues according
to the type of hydrogen bond made with the amide nitrogen as a
donor, or by the type of hydrogen bond made by the carbonyl oxygen
as receptor. Here are some possible Hbond types:
- G backbone separation=3 (notor approx -14)
- H backbone separation=4 (notor approx 101)
- P backbone separation!=3,4 "notor" torsion angle approx 170
- A backbone separation!=3,4 "notor" torsion angle approx -83
- B backbone separation!=3,4 "notor" torsion angle approx 23
- S sidechain
- M multiple Hbonds
- N no Hbonds
The breakpoints for the notor torsion angle can be set around
-133, -17, and 76 degrees.
It might be worth splitting the G class into "G" and "T"
based on the other possible backbone hbond for the residue,
with "G" if the other is separation 3 or 4, and T if it is a
backbone hbond with other separation. Not clear what to do if
this is the only backbone hbond for the residue---probably
stick with G. Perhaps we could separate on notor angle
instead, with T peaking around 23 (same as B) and G peaking
around -14. Current definition of Hbond in undertaker seems
to cut off some of the T peak, since it limits NOtor to [-60,35]
The G set is 10-12 times more common, so it may not be worth
trying to split off the T peak.
It may be more interesting to look at the sep=5 anti-parallel
Hbonds, which seem to have a NOtor angle around 86 (not 23 or -83).
The A/B split does seem to depend on the separation quite strongly
in other ways also, so there may be just a few standard hairpin patterns.
[Pinal Kanabar is working on this one.]
Another pair of alphabets that may be worth exploring provide
more detailed predictions of hairpins (one alphabet for the
carbonyl oxygen, another for the amide nitrogen):
- G backbone separation=3
- H backbone separation=4
- I backbone separation=5
- J backbone separation=6
- K backbone separation=7
- L backbone separation=8
- M backbone separation=9
- N backbone separation=10
- P backbone separation=-3
- Q backbone separation=-4
- R backbone separation=-5
- S backbone separation=-6
- T backbone separation=-7
- U backbone separation=-8
- V backbone separation=-9
- W backbone separation=-10
- A N-O hbond with other separation
- B multiple N-O hbonds
- C hbond to sidechain, no N-O hbond
- D no hbond
- One approach to improving secondary structure that we have tried,
and that may be worth improving on, is to use the fold-recognition
target HMM to align the template set and gather statistics about what
secondary structure codes align to each position. Since the
target HMM can be a multi-track HMM using neural-net predictions,
this method should be able to do at least as well as the
neural-net methods. The method would be particularly valuable
when there is a close homolog in the template library, as it would
be able to "look up" the right answers. This makes testing more
difficult, since it is easy to fool oneself about how well the
method is working, but would be very useful in a tool used by
bench biologists, as the easy predictions would be right (not
necessarily the case for neural nets). It is used in the SAM-T99
server, I believe, but not the SAM-T02 server. We should test it
properly and use it in the SAM-T04 server.
Mark Diekhans implemented this in SAM, but I am not sure of
the details of his implementation. In his tests, the method
improved Q3 scores slightly, but made the information gain worse
than the neural nets used to create the HMM tracks. Several
things could be done to improve this method:
- The details of the method could be documented, and testing
of the method could be properly reported in a journal article.
(Mark has offered to document what he's done, if someone
wants to pick up the project.)
- The HMM could use multiple tracks from different local
structure predictions to get better alignments, hence getting better
statistics. Alignment testing has given us some weights for
multi-track HMMs that should work better than what Mark has tested.
- I'm not sure what statistics precisely that Mark gathers
for each HMM state, nor whether he uses Viterbi training or
full forward-backward (I suspect it is forward-backward).
One could try different minor variations of the algorithm.
- The weighting scheme for individual sequences that is
used for weighting the gathered statistics could be changed.
I don't know what is currently used. I think that a method
using exp(-E_value) would be worth trying. Here is the
reasoning: We want to give more weight to sequences that match
the HMM than to sequences that are very remote, but a simple
cutoff is likely to have problems when there are no very
similar sequences. Using P(seq|HMM) as the weight is likely
to put too much weight on the closest match. If we use
instead the probability of not getting that good a sequence by
chance in the N sequences (1-p_0)^N = (1-E_value/N)^N, we
could have numerical problems, but for E_value< It is unknown whether over-representation of some
families in the template set helps or hurts the method, and
whether lower-quality models just add noise or increase the
range over which some matches can be made.
- The method could also be compared with a conceptually
simpler, but more expensive approach: building a fragfinder
library and collecting statistics from the fragments.
- Fold recognition
- Our current best fold-recognition method relies on having a
two-track target model and a library of amino-acid-only template
models, and combining the scores.
There are several areas where this
could be improved, but experimentation will be needed with the
fold-recognition test set to see which make a big difference.
- Updating our test set.
- We've been using the same test set for a long time, which
raises two questions: (1) Is the underlying "gold standard" still
valid? There have been reclassifications and improvements in SCOP
since we created our test set, which may change the results,
particularly in cases where methods are fairly close in performance.
(2) Have we started training to the benchmark? If we keep
selecting methods based on how well they do on a particular
benchmark, we can easily get overtraining. Right now this is not
a big concern, as we have done relatively few tests on the full
fold-recognition benchmark, but it would be good to have a test
that is at least somewhat new.
If we create a new test set based on SCOP, we should pay
attention to Julian Gough's rules about what to include and
exclude from the test set, based on his understanding of SCOP classification:
http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/ruleset_1.65.html.
Even more important, the scripts that we used for running the
tests are no longer usable, as we have changed the underlying
schedulare for running programs on the mini-cluster. We no longer
use Condor (which never worked well for our applications), but use parasol.
We have a generic submission script (para-trickle-make) that can
be used for a variety of tasks, and it would be good to set up a
new fold-recognition test that uses para-trickle-make to submit
jobs to the farm cluster.
- Entropy profiles
- There was a poster at ISMB 04 (I-84 FASE: a new
fold-recognition method through entropy profiles by Alejandro
Sanchez-Flores and Lorenzo Segovia, both at UNAM in Mexico) that
claimed to get excellent results using just the entropy
information from the profile to do fold recognition---all other
signals were thrown out and just a 4-level entropy alphabet used.
In close questioning, I did not see an obvious flaw in their
testing procedure, so we should try to implement this and see if
their work is duplicatable. To avoid artifacts, we should use our
own testing methods and our own computation of entropy.
- Two-track template models
- The predicted secondary structure for the target sequence
could be used with a template model by modifying SAM to use
log P_sequence_pos(template_state) as the second track scoring.
This requires having SAM read in probability vectors for the
"database" and have labels for the states of the HMM.
- Profile-profile scoring
- A somewhat more general version of the two-track template model
would allow full probability vectors for both the states and the
sequence---with scores computed in any of several ways. The
simplest is to take the log of the dot product of the two
probability vectors, but other schemes have been proposed in the
literature. Note that this scheme generalizes the current
profile-sequence scoring and the two-track template scoring
proposed above, with the simple labels handled as elementary
vectors (all zero except for a 1 in the labeled position).
Many of the best fold-recognition servers use some version of
profile-profile scoring. If we can pick the right multitrack HMM
to use with profile-profile scoring, we should be able to make
substantial improvements over our current scheme.
There have been various proposals for handling insertions and
deletions in profile-profile scoring. Bob Edgar's seems the most
complete, but some of the simpler ones are probably worth
implementing also.
One nice thing about profile-profile scoring is that we can
use the existing HMM formats for representing the profiles, not
requiring any new I/O code in SAM. (Though it would be good to
read the RDB format output by the neural nets and output the HMM,
so that we could eliminate the TCL script that currently does this.]
There may be different profile-profile alignment methods or
parameter settings that are optimal for progressive alignment,
pairwise alignment of remote homologs, and fold recognition.
- Parameter adjustment for local structure alphabets
- Rachel Karchin tested local structure alphabets with a constant
weight of 1 on the amino-acid track and 0.3 on the local structure track.
These weights have not been optimized, and we may be able to get
better performance by adjusting them (and adjusting them
differently for different alphabets).
Sol Katzman rewrote the alignment-test scripts, and Bret Barnes
has run many tests with different alphabets and different
parameter settings. Other alphabets could be tried, and the
scripts could be rewritten to have less I/O impact on the file
server when run from the cluster.
Perhaps a more interesting project would be to examine
that mound of data and try to come up with simple rules for
parameter setting to get optimum performance given a pair of
sequences. We have noticed the best setting for pairs that are
closely related is quite different from the best setting for pairs
that are remotely related, but have not yet come up with a simple
formula for choosing good parameter settings that are dependent on
the pair. On approach might be to make a pairwise alignment with
a default setting that works fairly well, then measure how
different the two sequences are in this alignment. We need then
to estimate the best track weights based on this difference measure.
If the track-weight optimization is based on the current
alignment test sets, we would need to come up with a new set of
pairwise alignments to test the procedure. Since our most recent
test set was generated in 2002, there should be plenty of new data
to use for testing.
- Combining templates
- Our fold-recognition methods treat every template in the
library as independent of the rest, but we actually have many
templates which contain similar folds. Various ways have been
proposed for combining template scores to get a "fold" score, and
a couple students have done preliminary tests of the
"product-of-pvalues" method, and shown that it might help, but
their projects were never taken to the point where the method
could be incorporated into an automatic prediction method.
One possible project is to incorporate the
"product-of-pvalues" method into our automatic predictions.
This requires designing a way to update the
calibration parameters for the combining method automatically as
new templates are added to the library, and converting the
fold results back into selection of good templates.
Another possible project is to devise a different combining
method. The "product-of-pvalues" method treats all templates in a
fold class as equally informative and does not use information
from templates in competing fold classes. One can envision
combining methods such as logisitic regression that could use the
extra information to get better predictions. One danger is that
multi-domain proteins correctly have multiple correct folds, so
simple competition between fold classes is not quite the right model.
- Handling multi-domain proteins
- Our current methods look for target-template matches where
there is a common subdomain, but do not take the next step of
looking separately at the part of the target that does not match
the template and doing a separate prediction for it. We could
probably improve out fold-recognition capability if we did some
domain splitting, either after having recognized a domain using
the whole chain, or by trying domain splits based on the multiple
alignment. In the recent run of all yeast ORFs, one of the
predictions took 6 days, because the protein was 4910 residues
long---this clearly would have been helped by breaking up into
smaller pieces.
In CASP6, it was obvious that we had managed to put together
some multi-domain predictions without explicit domain prediction.
It is not clear to me that explicit domain prediction is as
valuable as I thought in 2002.
- Better calibration of HMMs, particularly multi-track HMMs
- Our current method for calibrating the E-values of HMMs is
described in the paper Calibrating
E-values for Hidden Markov Models with Reverse-sequence Null
Models, where a few problems are described.
- One of the
problems is that simple i.i.d. models of "random" sequences work
ok for amino-acid sequences, but not for secondary structure
sequences, which often have long runs of the same letter. We have
a sophisticated sequence generator for amino-acid sequences
released as open-source code: gen_sequence,
but no good random sequence generator for secondary structure sequences.
There are several ways to try to generate random secondary
structure sequences: i.i.d. models, Dirichlet mixture for
composition with i.i.d. from the composition (used in gen_sequence),
first-order Markov chains, higher-order
Markov chains, variable-order Markov chains, mixtures of Markov
chains, hidden Markov models, Markov chains of segments with
length models for each segment, and so forth. Implementing one or
more of these methods and testing how well it can encode secondary
structure sequences (over various alphabets) in a train/test
experiment would be very useful. I'm particularly intrigued by
the idea of generating sequences of segments, with a separate
length model for each segment type, and have done a little
preliminary work on the idea.
- Even if we get a good random sequence generator for the
various local structure alphabets, we still have some trouble with
the failure of the "reversability assumption" for some alphabets
that otherwise look quite useful, such as de Brevern's protein blocks.
For multi-track HMMs using these alphabets, we need a different
way of calibrating the E-values that does not use reverse-sequence
null models. One possibility would be to fit a Gumbel
extreme-value distribution to scores on randomly generated
sequences (using a good generator for the local structure alphabet).
Other parameterized distributions could also be explored, as there
is some evidence that the Gumbel distribution does not have a fat
enough tail.
- Better template selection for close homologs
- When we have many close homologs in PDB, the current
fold-recognition method finds all of them, but does not do a good
job of picking the best template.
Model drift often results in the HMM being better at
recognizing some subfamily different from the one containing the
target protein. For example, if the SH3 domain for 2abl is given
as a seed for the SAM-T02 web site, the self-hit is about 80 down
from the top (partly because 2abl is not in the template library).
Blast finds several sequences containing exact matches to the
55-residue seed: 1oplA, 1opkA, 2abl, 1ju5C, 1abq, 1abo[AB], 1awo,
and 1bbz[ACEG]. Four of these are in the template library (1aboB,
1awo, 1bbz[AE]), but the best scoring match using the template
library is 1ng2A, which has the SH3 domain twice. If we just use
the w0.5 model, we can see that it has drifted to center on
1udlA, 1gfc, and 1gcqA, rather than the sequence we started with.
Various fixes are possible.
- One is to look at the simple
pairwise scores (ignoring the multiple alignment) when choosing
close homologs---for this example, the best alignment scores came
from the alignments that had identical sequences, which would have
worked fine in this case. This reduces to using rather crude
tools though, so might not work as well when sequences are
somewhat diverged.
- Another possibility is to look for pdb files in the t2k.a2m
alignment (there are 78 of them in this example), and make a
phylogenetic tree based on the multiple alignment. The closest
templates in the tree are the most likely ones to be worth using.
This would work well here, since there are many PDB files in the
alignment, and the right ones would cluster no matter what
tree-building method was used.
- A slightly more powerful method would be to align all the top
hits from PDB to the w0.5 model, then build a tree from that
alignment. This may get a few more distant sequences than the t2k
alignment building (for example, there are 95 SH3 domains found,
not just 78 in the 2AWL example). The method could be applied to
fold-recognition targets as well as the targets with really close
homologs in the templates.
- We could also look at the ordering of the predictions with
different weights on the second (or third) track. It seems that
the predicted local structure tracks become more valuable for more
distant templates. If we find close templates, we may want to use
parameters that put more weight on the AA track.
- Alignment of target and template
- Currently, our best alignments are made from target 2-track HMMs
(using the STR alphabet as the second track) doing local alignments to
the templates. We have a couple of pairwise alignment test sets that
can be used for evaluating different alignment methods, though we have
to be a bit careful about tuning an alignment method to the test sets.
There are a lot of things that could be done to
improve pairwise alignments:
- Exploring other local structure alphabets
- Rachel Karchin is exploring other local structure
alphabets for her Ph.D. thesis. After she finishes, her
methods should be available for other students who want to try
new structure alphabets. One that did not get covered in her
thesis is the STR3 alphabet (a minor extension of STR and STR2).
Since STR2 is currently our best single alphabet, STR3 would
be worth exploring.
- Tweaking parameters
- We have not tried adjusting transition-cost parameters
nor the weight of the second track. Other alignment
parameters (such as fimstrength) could also be explored.
This project requires no major coding, just running lots of
alignment experiments and keeping careful records of the results.
- Three-track HMMs
- The two-track HMMs currently give the best results, but
Rachel Karchin has proposed some three-track HMMs that may do
better (there are some problems at the moment with hmmscore,
that Richard Hughey and Rachel Karchin are looking into).
Exploring the possible combinations of alphabets could result
in significantly better alignments.
- Alignment trimming
- In earlier work with one-track HMMs, Melissa Cline found
that she could get better alignments by trimming out the
unreliable parts of global alignments than by using strictly
local alignments. Her method is implemented in SAM, but has
not been tested or calibrated with multi-track HMMs.
Jenny Draper did some work on this---it doesn't look
like alignment trimming helps significantly with the newest
alignment methods.
- Profile-profile alignment
- All our alignment methods currently use homologs only for
one of the two sequences being aligned (usually the target
sequence, though sometimes the template sequence). We should
be able to do better by using homolog information for both sequences.
For the templates, we can use either sequence-based SAM-T2K
mulitple alignments or structure-based alignments.
Anyone interested in the profile-profile alignment should
talk to Richard Hughey, Jenny Draper, and Martina Koeve, all
of whom have done some work on the project.
- Correlated columns of multiple alignments
- I threw together a program to measure the mutual information
between columns of a multiple alignment (any alphabet).
This program can be used to produce input for a neural net to
predict contact pairs.
[George Shackelford is working on this.]
- Generating fragments (fragfinder)
- The fragfinder program produces short gapless alignments of the
target sequence to template sequences, trying to select the best k
(say 6) fragments for each position in the target. There are many
variants possible on the selection algorithm, and we could try several
different multi-track HMMs to search the template library for fragments.
The fragfinder program has not had extensive testing---we selected an
alogrithm and HMM based on intuition and miniscule tests on one target
sequence. The undertaker program can be used to evaluate the fragments
as fragments (looking at the histogram of RMSD, or RMSD normalized by
length, for example), and we can come up with various figures of merit
for fragment libraries, allowing us to compare different fragment
finding methods.
The fragfinder output is as an A2M file, which is fairly compact
and easily produced by the SAM software, but which does not include 3D
information. Undertaker can read the a2m file and the necessary PDB
template files and generate fragments by sidechain replacement, but
does not currently have a way of storing these fragments in a file.
Existing undertaker file formats could easily be adapted to save the
fragments, speeding up multiple undertaker runs and making it easier
to run undertaker on the kilocluster (by eliminating the need for
access to the PDB library).
In addition to providing a format for
native undertaker fragment libraries, it would be convenient to have
I/O handling in undertaker for Rosetta-format fragment libraries.
These do not contain as much information as undertaker fragment
libraries (no sidechains), so output Rosetta libraries is easier than
inputting them. Being able to output a Rosetta library would allow us
to convert fragfinder results to a Rosetta library, which could then
be compared with existing fragment-finding methods.
[Sol produced Rosetta fragment libraries and did a lot of work on
filtering them in undertaker. We have not done substantial work on
getting fragfinder to produce better libraries.]
- Packing fragments (undertaker)
- The undertaker program got its first real tryout in summer 2002 in
the CASP5 experiment. We applied it uniformly to all targets, with
mixed results. On the comparative modeling targets, the resulting
predictions were worse than the automatic predictions by the server
about 3/4 of the time. On the fold-recognition targets, undertaker
helped about 2/3 of the time, and on the very difficult fold
recognition and new fold targets (where, traditionally, our methods
have made no headway), we did well enough to be invited to speak and
to submit a paper. Still, there is an awful lot of work that still
needs to be done on undertaker.
Here is list of a few of the many things that came up over the summer:
- More modern C++
- The C++ used in undertaker is a subset of the language that
avoids many of the newer features (since they were either not
available or not reliable when I started creating the class
library that undertaker uses). One new feature that would be nice
to use (if it works robustly) is dynamic casts, as there are
several places in the code where I do run-time type checking and
casting with my own mechanism.
- Print out and analyze distances
- It would be a very small project to add a new command to
accept a list of atoms and print out the distances between them.
This command could be useful for extracting distance constraints
from a model to be added to the score function, either to keep
undertaker from pulling apart some good feature of a template, or
to transfer information from a model that is not going to be used
as a template.
[SIMPLE VERSION OF THIS DONE]
A somewhat more ambitious project would be to make histograms
for several different types of distance measurements, that could
later be used for generating cost functions. For example, it
would be useful to have histograms of the distances between
corresponding CB atoms on adjacent beta strands. Is it different
for parallel and anti-parallel strands? What about for CA atoms?
These statistics would be useful for creating constraints for
pairing up beta strands when we don't know the parity of the
hydrogen bonding. It would also be good to do the histograms
conditioned on separation along the backbone.
[STATISTICS FOR BETA-BONDING PARTNERS DONE 6-Dec-2003:
I added a routine to undertaker to gather
some statistics on beta bonding partners for the dunbrack-1332
training set. I still have to digest the results a bit, but here are
some preliminary figures:
CA-distances:
| type | mean | std dev | approx.peak |
| all beta partners | 4.97 | 0.39 | 5.09 |
| parallel | 4.84 | 0.26 | 4.83 |
| anti-parallel | 5.03 | 0.43 | 5.13 |
| anti-parallel unbonded | 4.43 | 0.24 | 4.39 |
| anti-parallel bonded | 5.25 | 0.21 | 5.13 |
CB-distances:
| type | mean | std dev | approx.peak |
| all beta partners | 5.16 | 0.71 | 5.19 |
| parallel | 5.13 | 0.65 | 4.85 |
| anti-parallel | 5.17 | 0.73 | 5.19 |
| anti-parallel unbonded | 5.58 | 0.65 | 5.79 |
| anti-parallel bonded | 5.02 | 0.70 | 4.43 |
Essentially all the 24135 beta-pair residues have CB-CB
distances less than 8 and 99% have CB-CB distances less than
7. The common use of 8 Angstrom CB-CB for "contacts" may be a
bit loose for identifying neighbors in beta sheets, but
perhaps it is useful in other contexts (distance between
sheets in a sandwich?).
I used a very crude way of finding beta-bonding partners (just looking
for a pair of H-bonds), which probably increases the noise of these
estimates.
The CA distances have much less variance than the CB distances, as you
would expect for a structure defined by hbonds on the N and O atoms.
The CA spacing is dependent on the whether the residues are parallel
or anti-parallel, and whether or not the residues are bonded. This is
also expected (build some models with the Darling model kits if this
is not immediately obvious to you).
The CB spacing is less dependent on the relationship between the
residues, but still more dependent than I had expected, with the
bonded/unbonded distinction still making a 0.5 Angstrom difference.
I'm a bit worried about the distributions where the peak is so far
from the mean (CB parallel and CB antiparallel bonded). I'll have to
check to see whether these are artifacts of my histograms, inclusion
of non-beta-sheet bonding pairs, bad H-bond definition, or some other
phenomenon that I should be conditioning on.
I also don't know what distribution the data best fits. Probably not
a normal distribution (distances are positive), but possibly a gamma
or log-normal.
- Manual moves
- We did a lot of chopping up proteins and trying to force them
into weird conformations this summer. It might be nice to replace
all the crude perl scripts we wrote with a a simple "move" command
that would take a range of residues and some spec for where they
should be moved to. It would also be nice to get a copy of
"protoshop" from UC Davis, since it seems to have fairly nice
manual-move capabilities (though the inverse kinematics it uses to do the
moves results in very unlikely conformations---I might prefer gaps
to such unlikely conformations).
http://graphics.cs.ucdavis.edu/~okreylos/ResDev/ProtoShop has
more info on ProtoShop.
[I believe that David Bernick has installed protoshop, and is
trying to get it to work consistently.]
- Interactive hand tweaking
- There are lots of times I would like to be able to apply a
specific operator (say break reduction for a partivcular gap
with fragments inserted in a particular location), see the
results using Rasmol, apply another operator, and so on.
- Good quality measures
- We'd like a robust measure of how bad a prediction is. RMSD
alone is not a good measure, since the "dot model" having all
atoms at (0,0,0) often does fairly well by that measure.
Measures like "GDT" which count how many (or what fraction) of
the residues are within a given error distance when superimposed
on the correct structure can also suffer from overly rewarding the
dot model.
It might be worth looking at some "contact" measures that reward
conformations for bring the right residues close together and
keeping apart residues that are not close in the real structure.
It would be good to come up with a cost measure that is not too
sensitve to the length of the protein or to one badly folded piece.
- Clean up cost functions to handle incomplete conformations
- When reading in true, but incomplete, conformations (the usual
case for PDB structures) the cost functions often report the
conformations as terrible, since the missing atoms are all at
location (0,0,0). The cost functions need to be modified to be
aware of the new "contains_atom()" member of the Conformation class.
- Cost function calibration
- We have a lot of decoys generated for CASP5 and CASP6, and now
have correct structures for most of them. We can use this set, or
a test set generated by the Baker group, to try to adjust the
weights of the various components of the cost function. Probably
we should do a linear regression to fit the cost to whatever
badness measure we choose, though we may have to figure out a
length scaling to be able to do the fitting across several
different targets.
I did some preliminary work on this, and (as expected) linear
regression works terribly when the cost functions are highly correlated.
We'll probably need to do some dimension reduction first (perhaps
with principal components analysis?) before linear regression
stands a chance of working.
- Use NMR information
- We already have distance constraints implemented---it would be
good to provide an input format that accepts the standard NMR
constraint/restraint files (many of which are available from
BMRB).
We should
also try to interpret other information, such as residual dipolar
couplings and chemical shift information, as this has been fairly
successful in Rosetta.
Oscar Hur has written scripts to convert NMR restraint
files into undertaker constraints and has done some testing of the method.
- Save and restore the SpecificFragment Library and the AlignmentLibrary
- A lot of time is wasted reading template PDB files and
recomputing sidechain replacement for
fragment libraries and alignments on each run.
Saving and restoring them would be useful, particularly for
reducing the number of files needed for an undertaker run---we'd
like to be able to capture all the files needed and distribute
them to the kilocluster for doing many parallel optimizations.
Saving the specific fragment library would also make it feasible
to put some more effort into creating it---doing sidechain
optimization to remove steric conflicts, for example.
- Predicted burial
- Rachel Karchin has tested neural nets to predict various of
the burial functions. We could incorporate predicted burial into
the cost functions, instead of just the generic burial histogram
by residue type. An example of this sort of cost function has
already been implemented for the "alpha" angle predictions.
What we currently do for alpha
is to convert the 11-dimensional neural network prediction vector
into a nearly continuous function function of the alpha angle
(360 1-degree bins). This involves combining the neural network
prediction with the histograms for individual amino acids. The
current algorithm starts with the histograms for alpha
conditioned on the residue and conditioned on the next residue,
and takes their product to get an intial continuous curve,
each of the 11 sections is scaled to get the appropriate probability
according to the neural net prediction, and the resulting curve
is smoothed to remove the discontinuities at the edges of the sections.
The scaling and smoothing is repeated several times.
Some parameters to play with include
- What initial curve to start with (flat, histogram based on
all amino acids, histogram conditioned on current amino acid, histogram
conditioned on next amino acid, product of histograms, average of
histograms, ...).
- How much smoothing to do at each step. This probably depends
on the initial curve chosen, since smoothing tends to spread out peaks.
If the initial curve has peaks that are too sharp, then more smoothing
would improve it, but if it has broad peaks, smoothing may make
things worse.
- How many iterations of rescaling and smoothing to do. Again,
more smoothing flattens peaks more, and the lots of iterations of
scaling and smoothing will tend to move the peaks to the centers
of the sections.
The way to tune the algorithm is to use a large training set with
neural-net predictions and try to minimize the total cost
(negative log probability) of the actual alpha values.
Rocky Choi did some tuning work on the predictor for alpha values.
- Apply secondary structure
- It might be good to include secondary structure in guiding the
conformation search, not just the score functions. For
example, we could
- Look for fragment that matches peaks in predicted alpha values.
- Label fragments with DSSP (or other) labels and look for
matches to strongly predicted regions.
- Add length constraints for helical or strand regions.
(say one or two constraints for every 2-turns of helix, and
4-residue CB constraints for strands?)
- When breaks are inserted (say by jiggle or opt subtree),
make the more likely to be inserted in regions which have
lower probability of being in a helix or strand.
Possibly deliberately insert a break in each such region
to act as a rapid hinge.
- New conformation-change operators
- We probably want to add some more conformation-change
operators to the suite used by the genetic algorithm---for
example, it might be good to have a crossoverfunction that
allows two breaks, and puts together the results from
conformations A and B as an ABA chimera.
One possible implementation would be to make a copy of the
central setion of Confomration B as a new AlignedFragments
object, then apply this to Conformation A.
- More focus on loop regions
- It would be good to have a way to get undertaker to focus on
regions that are having trouble getting a good conformation.
Possibilities include freezing a core region, so that no
operators are applied there, increasing the probability of
applying fragments near breaks, applying TwoFragment to
residues that are spatially near each other, but not close
along the chain (possible hinges), ...
Instead of a complete freeze on the core, we might want to
keep confidence values with alignments and specific fragments, so
that we frequently re-apply alignments or fragments that have high
confidence, keeping the core from drifting away from a good
initial conformation.
- Jiggling without breaking
-
We often want to make small moves to improve packing on an
almost finished model, without breaking the chain and
incurring a break penalty. One way to do this would be to use
the "crankshaft" moves developed by Baker's group. A
crankshaft move takes a pair of backbone bonds that are nearly
colinear and rotates the region between them.
If the bonds are parallel but not colinear, then a shift in
position occurs.
The simplest example is rotating a single peptide plane
without moving the ones much on either side. By pairing
two peptide planes that are almost colinear, one can shift the
atoms between the two planes without rotating them.
To do a "crankshaft" move between residues A and B,
add theta to psi(A-1), -theta to phi(A), -theta to psi(B-1),
theta to phi(B).
- Rewrite ApplyFragments command
- The old ApplyFragments command takes the old slicer format for
specific fragments, which is no longer used.
Change to use the name of an
alignment---this may require a NameToPtr table for the
AlignmentLibrary.
- MERGE report_cost_to_log_and_rdb (OptCommands.cc) and
report_score_to_log (ConformationPool.h)
-
- Rotamer optimization
- The current version of undertaker has 3 conformation-change
operators that just change sidechains (OneRotamer,
ClashingRotamer, and ClusteredRotamer). It would be interesting
to see how well undertaker optimizes sidechains compared to scwrl,
for different cost functions [particularly whether the hbond_geom
cost function helps]. This can be done with no modifications to
undertaker, just running undertaker scripts and gathering statistics.
A possible approach is to read a pdb file, randomize the
rotamers, optimize with undertaker, reoptimize with scwrl, then
reoptimize with undertaker, reporting the three costs and all-atom
rmsd for each of the conformations.
One could also gather all-atom rmsd for a huge number of sidechain
conformations and try to do linear regression to set weights for
the cost function, but this is likely to have problems due to
correlation between the components.
- Other undertaker tasks
- This list of things to do to undertaker is far from complete.
There is also a ToDo list in
/projects/compbio/experiments/undertaker, and there are various
"BUG" notations scattered through the code in
/projects/compbio/programs/undertaker
- Genome-wide prediction
- I have done a genome-wide fold-recognition prediction of
S. cerevisiae proteins
http://www.soe.ucsc.edu/research/compbio/yeast-protein-predictions/,
but there are a lot a little things to do to clean up the presentation
(like providing indexing of the results by something other than just
the accession code). Fan Hsu will be working on this, in preparation
for similar tasks on mammalian genomes.
Some of the search capabilities that should be added include:
- Search for particular ORF (using any of the legal gene names).
- Search for ORFs with significant hits (user specified E-value).
- Search for ORFs with significant hits to particular SCOP families
(problem---may have hit to different domain of multi-domain template)
- Search for ORFs with significant hit but low sequence id in
pairwise alignment.
- Two-step search---from an ORF, find all other ORFs with similar
predicted structure.
Perhaps a more important task is to figure out a way to handle the
large, multidomain proteins better, as finding a template for just one
domain doesn't really solve the problem. Currently, the web pages
don't even report which sections of the protein we have made
predictions for!
The fairly small yeast genome took weeks to finish on
the small cluster---to do a bigger genome, such as the human genome,
we'd have to "clusterize" the method to run on the kilocluster without
hammering the file servers. Fan Hsu has started work on
making the fold recognition method work on the kilocluster.
I don't think there is any point to trying to apply undertaker
genome-wide right now---it requires far too much hand tweaking to get
even a crude new-fold prediction out of undertaker.
It may be worthwhile to do some other small genomes, such as the
archeal genomes that Todd Lowe is building DNA chips for.
- Individual proteins
- There are some proteins being studied at UCSC for which it would
be useful to get structure predictions. I started work on a few, but
have not made much progress, because CASP5 interfered. I suspect that
these are difficult-to-predict proteins, and that new-fold methods
like undertaker and Rosetta will need to be applied.
For example, I've done some predictions of chemotaxis proteins for Helicobaster
pylori for Karen Ottemann.
Fitnat Yldiz has a lot of proteins of Vibrio cholera that she
is interested in knowing more about.
Here is a description from Fitnat of the projects she is interested in
(PDF version of her projects):
- Project 1: Identification of the ligand binding domain(s) of the methyl
accepting chemotaxis proteins in Vibrio cholerae (Vc)
-
Flagellar motility in Vc is controlled by the chemotaxis system,
allowing the microbe to move in response to environmental cues.
Chemotaxis is regulated by chemoreceptors that sense environmental
cues, and transduce this ligand-binding information to regulate a
signal transduction cascade that affects flagellar rotation. The core
signal transduction proteins consist of CheW, the receptor-kinase
coupling protein, CheA, the kinase and CheY, the response regulator
that interacts with the flagellar motor.
This proposal seeks to understand the ligand(s) bound by
methyl-accepting chemotaxis proteins in Vc. The list of these
proteins and their VC gene numbers are shown below. Detailed
information on each gene including DNA and amino acid sequences can be
obtained from: http://www.tigr.org/tigr-scripts/CMR2/name_search_test.spl.
Vibrio cholerae El Tor N16961
| Locus | Gene Symbol | Common Name |
| VC0098 | | methyl-accepting chemotaxis protein |
| VC0216 | | methyl-accepting chemotaxis protein |
| VC0282 | | methyl-accepting chemotaxis protein |
| VC0449 | | methyl-accepting chemotaxis protein |
| VC0512 | | methyl-accepting chemotaxis protein |
| VC0514 | | methyl-accepting chemotaxis protein |
| VC1248 | | methyl-accepting chemotaxis protein |
| VC1289 | | methyl-accepting chemotaxis protein |
| VC1298 | | methyl-accepting chemotaxis protein |
| VC1313 | | methyl-accepting chemotaxis protein |
| VC1394 | | methyl-accepting chemotaxis protein |
| VC1403 | | methyl-accepting chemotaxis protein |
| VC1405 | | methyl-accepting chemotaxis protein |
| VC1406 | | methyl-accepting chemotaxis protein |
| VC1413 | | methyl-accepting chemotaxis protein |
| VC1535 | | methyl-accepting chemotaxis protein |
| VC1643 | | methyl-accepting chemotaxis protein |
| VC1859 | | methyl-accepting chemotaxis protein |
| VC1868 | | methyl-accepting chemotaxis protein |
| VC1898 | | methyl-accepting chemotaxis protein |
| VC1967 | | methyl-accepting chemotaxis protein |
| VC2161 | | methyl-accepting chemotaxis protein |
| VC2439 | | methyl-accepting chemotaxis protein |
| VCA0008 | | methyl-accepting chemotaxis protein |
| VCA0031 | | methyl-accepting chemotaxis protein |
| VCA0068 | | methyl-accepting chemotaxis protein |
| VCA0176 | | methyl-accepting chemotaxis protein |
| VCA0268 | | methyl-accepting chemotaxis protein |
| VCA0658 | | methyl-accepting chemotaxis protein |
| VCA0663 | | methyl-accepting chemotaxis protein |
| VCA0773 | | methyl-accepting chemotaxis protein |
| VCA0864 | | methyl-accepting chemotaxis protein |
| VCA0906 | | methyl-accepting chemotaxis protein |
| VCA0923 | | methyl-accepting chemotaxis protein |
| VCA0974 | | methyl-accepting chemotaxis protein |
| VCA0979 | | methyl-accepting chemotaxis protein |
| VCA0988 | | methyl-accepting chemotaxis protein |
| VCA1031 | | methyl-accepting chemotaxis protein, authentic frameshift |
| VCA1034 | | methyl-accepting chemotaxis protein |
| VCA1056 | | methyl-accepting chemotaxis protein |
| VCA1069 | | methyl-accepting chemotaxis protein |
| VCA1088 | | methyl-accepting chemotaxis protein |
| VCA1092 | | methyl-accepting chemotaxis protein |
- Project 2. Identification of cyclic nucleotide ( cAMP, cGMP, c-di-GMP)
binding proteins binding proteins in V. cholerae.
-
Sequence analysis of many bacterial genomes showed that proteins with
GGDEF and/or EAL domains are highly prevalent. Recent studies showed
that the GGDEF and EAL domains control the cellular levels of
cyclic-di-GMP through the opposing activities of di-guanylate cyclases
(DGCs) and phosphodiesterases (PDEs), respectively. It has been shown
recently that a cyclic guanosine signaling compound called cyclic
di-guanosine- monophosphate (c-di-GMP) is as an essential modulator of
cell surface structures and in turn biofilm formation in medically
important microorganisms. Other cyclic nucleotides cAMP and cGMP are
required for many cellular functions as well.
The overall objective is to determine the proteins that are capable of
binding to these cyclic nucleotides.
Below is some information on cNMP binding proteins and related references.
She provided a picture that I could not include---it is just a
cartoon showing that the protein contains three domains: cNMP
binding, His kinA N, ATPbind ATPase.
Proteins that bind cyclic nucleotides (cAMP or cGMP) share a
structural domain of about 120 residues [1, 2, 3]. The best studied of
these proteins is the prokaryotic catabolite gene activator (also
known as the cAMP receptor protein) (gene crp) where such a domain is
known to be composed of three alpha-helices and a distinctive
eight-stranded, antiparallel beta- barrel structure. There are six
invariant amino acids in this domain, three of which are glycine
residues that are thought to be essential for maintenance of the
structural integrity of the beta-barrel. cAMP- and cGMP- dependent
protein kinases (cAPK and cGPK) contain two tandem copies of the
cyclic nucleotide-binding domain. The cAPK's are composed of two
different subunits, a catalytic chain and a regulatory chain, which
contains both copies of the domain. The cGPK's are single chain
enzymes that include the two copies of the domain in their N-terminal
section. Vertebrate cyclic nucleotide-gated ion-channels also contain
this domain. Two such cations channels have been fully characterized,
one is found in rod cells where it plays a role in visual signal
transduction.
- Körner H. , Sofia H.J. , Zumft W.G.
Phylogeny of the bacterial superfamily of Crp-Fnr transcription
regulators: exploiting the metabolic spectrum by controlling
alternative gene programs.
FEMS Microbiol. Rev. 27: 559- 592(2003) [PubMed: 14638413 ]
- Busby S. , Ebright R.H.
Transcription activation by catabolite activator protein (CAP).
J. Mol. Biol. 293: 199- 213 (1999) [PubMed: 10550204 ]
- Kaupp U.B.
The cyclic nucleotide-gated channels of vertebrate photoreceptors
and olfactory epithelium.
Trends Neurosci. 14: 150- 157 (1991) [PubMed: 1710853 ]
This web site http://www.tigr.org/tigr-scripts/CMR2/GenomePage3.spl?database=gvc
can be used to retrieve gene
sequences.
- Project 3. Characterization of the protein encoded by VCA0849.
-
VCA0849 is predicted to encode for a protein of 3263 amino acid in length.
The goal of the project is to characterize this protein. Below the
coordinates for the VCA0849 on the genome are given. The sequence could
also be obtained from :
http://us.expasy.org/cgi-bin/niceprot.pl?Q9KL97.
| TIGR Locus Name: | VCA0849 |
| Primary Locus Name: | None |
| SWISS-PROT/TrEMBL | Q9KL97 |
| AC: | |
| Putative identification: | hypothetical protein |
| Coordinates: | 800820 to 791029 |
| DNA Molecule Name: | Chromosome 2 V.cholerae El Tor N16961 |
| Gene length: | 9792 |
| Protein length: | 3263 |
| Molecular Weight: | 341491.81 |
| pI: | 3.7102 |
| Percent GC: | 50.14% |
| Kingdom: | Bacteria |
| Family: | Proteobacteria |
- Protein design
- I'd love to get involved in protein design, but the current tools
I have (neural nets, HMMs, undertaker, ...) do not seem well suited
for the task. I've been thinking about ways that I could try to use
them, the way Rosetta gets used for sequence optimization.
Unfortunately, I don't think I can use the same tricks that Rosetta
does, because undertaker is structured around prediction for a fixed
sequence---it would require some rather large changes to the code to
fix the conformation at let the sequence vary.
One approach I am looking at is to use undertaker to build a
homology model, then use Rosetta to redesign some of the residues to
increase stability. One protein I've been applying this to is
nicotinic acetylcholine receptors, which need to form a pentamer to
function, but which don't do so when the extracellular domain is
severed from the transmembrane domain. There is a soluble protien
that is homologous to the extracellular domain that *does* naturally
form the pentamer (see PDB file 1i9bA). I have already done a lot of
modeling work on the protein, and may have a model with a good enough
backbone to do sidechain replacement. (See
http://www.soe.ucsc.edu/~karplus/ach/ach7_short_1based/three-template/summary.html
paying particular attention to the README file.)
Another approach is to take several of the
local-structure-predicting neural nets, and fix their output at the
desired local structures, then use backpropagation to optimize the
inputs. One could start with an intial random sequence, optimize
using the neural nets to get an input profile, then sample from that
profile and reoptimize. This could be repeated to get a large sample
of possible sequences that should have good local properties, which
could then be fed into the Rosetta optimization program to try to find
sequences with good teritary properties. To do this right, one should
probably do the back-propagation simultaneously in neural nets
predicting several different local properties. This would require
many changes to predict-2nd, but probably not as massive as changes to
undertaker for protein design.
More random thoughts on using backpropagation. If I changed my
neural nets over to using (A
HREF="#guide+profile">guide-sequence+profile input, then I could
use backpropagation to change the profile, and sample form the profile
to get a sequence. Iterate that many times from a random initial
profile (generated from a Dirichlet mixture prior) and take the
resulting sequence as a possible prediction.
There are lots of variants that could be tried.
For example, one could resample the
sequence points where the backpropagated derivative for the current
residue is negative rather than resampling all points.
Instead of sampling from the profile one could take the most probable
residue of the profile.
After generating a few hundred sequences in this way from random
starting points, one could make a profile from the sequences, and
score each sequence with that profile. Further tweaking could take
just the best 100 or so sequences and the profile generated from them.
Each could be used with that profile as a starting point for the
backpropagation and resample process.
A tertiary scoring system (such as Rosetta's) could be used to pick
out the best sequences from the list, and use the resulting sequences
and profile to do more sampling.
Another approach to using neural nets would be to build a neural
net whose input was many structural properties and whose output was a
sequence. Such a neural net could be trained in the same
manner as current nets for prediction (with inputs and outputs
swapped), and would provide a probability distribution for amino acids
at each position. This would require fewer changes to the neural net
code (mainly providing an input layer that accepts several alpahbets,
instead of just one alphabet), and would more directly provide
probability distributions to sample sequences from. This method is
less likely to provide a variety of very different solutions to a
protein design problem.
- Using protein redesign to improve fold-recognition
- Some researchers have tried using protein design with a known
backbone to try to get a large sample of sequences that could
reasonably adopt the backbone conformation, in order to create
profiles for searching in sequence space for homologs. This method
could be used to improve the template models in our fold-recognition
process, for those templates that have few homologs found by iterative
search. It is unclear whether the design methods do a good enough
job to actually improve the profiles. Also, there are small changes
in structure between homologs (especially insertions and deletions),
so the design method may have difficulty generating appropriate
sequences to generalize in the right direction.
A more promising use for redesign of a known structure is to
predict which residues are conserved for functional reasons rather
than structural ones. David Bernick is investigating this with Carol
Rohl, and it looks like there is a reasonable application for my
mutual-information code here.