Everyone in CMP243 must do a class project; this is 50% of your grade. I would like to have a one paragraph project proposal by Thurs. Jan. 22. It is OK to change your mind after this, but I want to make sure you get started early on something. Projects are due the last day of class (Tues. March 17). You should turn in a short written report (10 pages text + lots of figures, charts, tables, etc.) describing the problem, how you approached it, and what were the experimental results.
It is fine to work in teams. I encourage you to work with one of the students in our computational biology group if you can, or with Kevin Karplus, Richard Hughey or Lydia Gregoret. All of these people have computational sequence analysis projects they are working on that you could participate in, and would make good class projects. Also, if you are in biology or chemistry, you probably have a project with one of the professors there that already has or could have an informatics component that would be suitable for a class project. I know Manny Ares has some computational projects involving introns in the yeast genome, Alan Zahler is working on introns and splicing mechanisms in higher organisms, Tonk Fink has some protein families he would like to apply some computational analysis to, Harry Noller and Chuck Wilson have RNA data you could work with, and Todd Wipke has data sets and ongoing projects on protein docking.
I encourage students to be creative in designing their projects, but the most important thing is that you gain experience by trying to get some actual useful results. So be sure to run any wild ideas you may have by me before you commit a serious piece of time to them. To get an idea of the kind of thing I have in mind, first here are two "generic" applications-based projects:
1. Take a particular family of sites in DNA where a certain protein or protein-RNA complex binds and build a computational mechanism to recognize members of this family of sites when they appear in newly sequenced DNA. One example would be transcription factors, which are places in DNA where certain proteins bind that are involved in the regulation of transcription. Another would be branch points, which are involved in the mRNA splicing process. A third example would be splice sites themselves. The first step is to collect the data. If you do transcription factors, there is already a large publicly available database. It is the transcription factor database. See also the searchable TRANSFAC database linked into EMBL's SRS server. (See also
@article{Wingender97,
author="Wingender, E. and others",
title="{TRANSFAC, TRRD and COMPEL: Towards a federated database system on
transcriptional regulation}",
journal="Nucleic Acids Research",
volume=25, month=jan, number=1, year=1997,
pages="265--268", }
)
Branch points are less well
studied. You can see Manny Ares about yeast branch points, but these are fairly easy
and well-characterized. You might start with yeast and then try to generalize to
a broader class of organisms. Characterizing human branch points seems
to be an entirely open problem at present. Ask Al Zahler.
I'll make some datasets avaliable, including datasets for splice
sites in yeast, worm, fly and human.
For now see the splice datasets at
LBL (fly
and human)
and
Sanger (worm).
There are many other types of sites in DNA one could
try to find as well.
2. Analyze a protein family. Use the SAM software at UCSC to make multiple alignments of proteins in this family, and to build a statistical model of the family that can be used to search the protein database for remote homologs of members of the family. Group the members of the family into subfamilies and characterize their differences. I know Tony Fink, Andrew Chisholm, Yishi Jin, John Tamkun and many other Profs in Biology and Biochemistry have proteins they are working with, and we need students to apply our methods to these proteins.
Another type of project, different than an applications-based project, would be to take a single computational or statistical method, and try it on a wide variety of sequences or sequence analysis problems. I would call this a "methods-based" project. Here is one example of a "methods-based" project.
1. To compete in the CASP3 protein structure prediction contest, Kevin Karplus and UCSC grads have built a system that reads in a target protein sequence and (using hidden Markov models, which we will discuss at length) finds another sequence in a database of protein sequences of known structure that it thinks has the same structure as the target sequence. It produces an alignment of the target sequence to the other sequence so the user can judge how similar the sequences look. There are a lot of variations on the method, and parameters that can be varied. We have not had enough time to do careful testing of the method. However, we have available a mechanism for testing the method(s). We do this by giving it a protein sequence of known structure that is not a close homolog of any of the sequences of known structure in its database, but is a distant homolog of one or more of these database sequences, and has the same structure as these distant homologs. We can then check if the method can find these distant homologs, and if it gives a structurally correct alignment to each of them. I would like to see some students in this class choose for their project to work with Kevin Karplus, Christian Barrett (cbarrett@cse), Melissa Cline (cline@cse) or Mark Diekhans (markd@cse), and I'd like to have the project be to test these methods we developed, and find the variations/parameter settings that work best, or develop new extensions/variations of the method that work even better.
I have asked people at UCSC who work in this area to contribute ideas for projects that students could work on with them, and I'll add them here as they come in. I'll also add here specific project ideas I may decide to suggest.
p53 is one of the most heavily scrutinized genes in history (8000 references in Medline papers), mainly because it is thought to contribute to the development of 50% of all cancers. Recently a new gene, p73, was found to bear a strong resemblence to p53. In particular, it resembles p53 in three regions: transcription activation domain -- 29% IDE DNA binding domain -- 63% IDE self binding domain -- 38% IDE (It also seems to share p53's tumor suppression capability.)These shared regions are sufficiently diverse that they would play well into the generalization schemes our HMM techniques are proving to excel at. Detection of an uncharacterized gene found by our methods that passes the ol' critical residue conservation test would almost certainly be of interest. If the evidence was strong enough, some publication might even take the evidence without wet lab work. Who knows? A CMP 243 project could entail building models from the p53 and p73 sequences, searching for homologs, and analyzing results. It would also be good to try to use methods currently under construction at UCSC (contact Kevin Karplus (karplus@cse), David Lin (dave@cse), Richard Hughey (rph@cse) or David Kulp (dkulp@cse)) that allow you to seach an EST database with an HMM built fro a protein family.
Development of a computational tool for modeling RNA secondary structure
The secondary structures of large RNAs have been determined almost universally by phylogenetic covariational analysis. This method uses a set of aligned RNA sequences from a range of species as a starting point for analysis. Secondary structure is inferred by the consistent conservation of base pairing (A:U or G:C pairs) even in the absence of primary sequence conservation. While this type of analysis has proven extremely powerful (ultimately yielding the structures of RNAs including the SRP RNA, 16S rRNA, 23S rRNA, the group I intron, and RNAse P), its successful application typically requires a large number of sequences and an accurate alignment (generally defined by the structure itself). Energy minimization approaches (e.g. MFOLD by Zuker) are typically able to identify long, highly stable helices within a single sequence but are rarely able to correctly predict the complete global structure.
Your mission (should you choose to accept it) is to create a computational tool that combines the strengths of covariational analysis and energy minimization to yield an (accurate) predicted secondary structure for a set of RNA sequences. The ideal tool would allow the user to interactively edit the secondary structure (e.g. starting from an unfolded sequence, progressively build up the structure by defining regions of base pairing). In doing so, it should be possible to simultaneously view the 'reasonableness' of the prediction with respect to conservation of structure between related sequences and local energy minization. Using PCR methods, we have recently determined the sequences of the H19 RNA (a putative tumor suppressor) from several different mammalian species. Your tool can be tested by its application to both known structures (e.g. 16S RNA) and to an unknown test (e.g. the H19 RNA). I would be happy to assist anyone interested in trying this in the design aspects of the project (defining what general structure would be most helpful for biologist users). Some suggestions:
Good luck....
The theme of this project is to try to discover the basic functional features of introns in human DNA, primarily places where proteins involved in splicing and gene regulation bind, including regulation of alternative splicing. The idea is to use human-mouse gene comparisons to help find these features.
You can use the data set of human introns
collected by Martin Reese (ftp://www-hgc.lbl.gov/pub/genesets/1754introns.gz)
and you can use NCBI's human-mouse gene web page
to collect a database of pairs of homologous genes,
one form human and one from mouse. It may also be possible to
work with Niclas Jareborg
1. The key problem that needs to be solved first is how to
identify the candidate
functional sites. A simple sequence matching program like BLAST could be used
to scan along the corresponding mouse and human genes and mark specific regions
where the homology between the mouse and human gene was particularly high,
as measured say by the BLAST p-value. This nearly always happens in exons,
but it only happens at specific places in introns. The first thing would be
to collect a database of all these regions in introns that are conserved between
human and mouse genes.
2. Given a database of conserved regions from introns, the next step would
seem to be to somehow cluster these into groups of related regions.
Our EMmix program could be used here. Arun Jagota could work with you to
get this working.
One problem is defining the boundaries and subregions
of the regions.
For example, you might get a region for one intron of the form ABC, another
region from a different region of the form A'B' and a yet a third region
of the form B"B"C", where A, A' and A" are similar, as are B, B', B", and
C, C', C", etc. This means that parts of regions must be clustered with
parts of other regions. I would expect
this to be a significant problem.
Regulatory binding sites tend to occur in complex groupings
that vary quite a bit from gene to gene, and have different order and repetition-
number for their subunits.
You should find a way to configure EMmix to handle this.
3. There are databases of known DNA binding sites.
One is TFSEARCH (http://www.genome.ad.jp/SIT/TFSEARCH.html). These can
be used to "seed" your clusters. They also have useful info about
what the binding sites actually do, a big plus.
(see also article{Wingender97,
author="Wingender, E. and others",
title="{TRANSFAC, TRRD and COMPEL: Towards a federated database system on
transcriptional regulation}",
journal="Nucleic Acids Research",
volume=25, month=jan, number=1, year=1997,
pages="265--268", })
However, these databses are very incomplete.
Most is stuff that occurs in conjunction with the promoter,
i.e in the 5' upstream region from the gene; there is less info
about signals in the introns.
In a sense, it is crazy to not look at that as well, since the same methods will
apply, and there are even more signals there than in the introns (as
far as we know).
4. Al Zahler has said that if we do this project, he would like
to be involved on the wetlab side. Here is what he told me when
I suggested this a few months ago:
"Comparison of the introns of known alternatively spliced
genes might yield conserved sequences important for the regulation of
alternative splicing. Does this database identify any known alternatively
spliced exons present in both humans and mice? If so, then we can devise
wet experiments to test the importance of any conserved intron sequences
near alternatively spliced exons for the ability to regulate splicing. "
This is a great opportunity, and you should definitely work
with Al if possible.
In homework 2, you created an alignment of the CASP2 target t31 to 1try,
and perhaps compared your alignment to the ``correct" one created
by
DALI. If you compared the two alignments, you probably saw regions
where your predicted alignment was just great, with everything aligned
within a few residues of where it should be, and regions where
everything was shifted by about ten residues from its correct spot.
What we want to do is predict when we're looking at a bad alignment
region, so that we can remove it from the alignment. This will yield
subalignments that still have the good regions but not the suspect ones,
hopefully. We have a program that finds the optimal subalignment of two
sequences of known structure, and scores both the existing alignment and
the optimal subalignment. So what we want to do is take an alignment,
score it, get the score of the optimal subalignment, and then trim the
alignment. If our trimmed subalignment has a score that's closer to the
optimal one, then we're making progress!
So what characterizes these suspect alignment regions? In these
regions,
the aligned sequences don't have much similarity to each other. They're
often close to indels, meaning that the alignment algorithm found
something in one sequence that it didn't want to align to anything in
the other sequence(s). Often, they're in loop regions where mutations
occur at a greater rate; loop regions are so hard to model that a lot of
structure predictors don't even try.
To predict the suspect alignment regions, there are two things we'll
try. First, we'll look at the template sequences, the ones of known
structure, and check which residues are in loop regions; if any of them
are aligned to target sequence residues, we'll cut those columns out of
the alignment.
Second, we'll build a multiple alignment of the target sequence, the
template sequence, and their homologs, and run it through a secondary
structure predictor that estimates the probability of each class of
secondary structure. If there are columns with at least a threshold
probability of being in a loop, we will remove them.
This work is very hot, timely, and relevant to UCSC's efforts in the
CASP3 contest. If this investigation goes well, there might be a
conference paper in it.
These projects will involve experimental evaluation of
neural networks, nearest neighbor classifiers, support vector machines,
statistical methods, EMmix, and other methods on splice site recognition.
Implementations of these methods are already available. The emphasis will
therefore be more on experimentation and less on programming.
(A slight amount of Perl script-writing, for utility tasks, is expected.)
We hope that
each method can be taken up by a team of (one or two) students.
All teams may work independently but on a common core
dataset, to permit their experimental results to be comparable.
(The teams may share ideas, and Perl scripts.)
To ensure regular progress, each project will be divided into a few subtasks
(with a few deadlines). Teams are expected to meet their (few) deadlines, and
this will be taken into account in the grading.
Each project has been designed so that it can be easily finished within the
alloted time.
In case a project turns out simpler than expected or if a team makes progress
quicker than expected, then a Phase II is planned. In the former case, Phase
II will be initiated at the discretion of Arun Jagota; in the latter case
at the discretion of the team.
Phase II involves implementing a meta-method, called boosting, on top
of your method, and evaluating its performance. This meta-method can be
applied to any project.
Finally, there are two other projects in this group of projects but of
a slightly different nature. The first one involves writing a few
general-purpose utilities in Perl for the GFF format recently adopted
at Sanger for annotating DNA sequences.
The second one involves using the NetGene2 C. elegans splice site
prediction system over the world-wide web to evaluate the same data that
we will evaluate here.
Questions regarding about page content should be directed to
haussler@cse.ucsc.edu.
Last modified October 20, 1996.