Projects for CMP243

Everyone in CMP243 must do a class project; this is 50% of your grade. I would like to have a one paragraph project proposal by Tues. Oct. 15. It is OK to change your mind after this, but I want to make sure you are thinking about a project. Projects are due the last day of class. You should turn in a short written report (10 pages text + lots of figures, charts, tables, etc.) describing the problem, how you approached it, and what were the experimental results.

It is fine to work in teams. I encourage you to work with one of the students in our computational biology group if you can, or with Kevin Karplus, Richard Hughey or Lydia Gregoret. All of these people have computational sequence analysis projects they are working on that you could participate in, and would make good class projects. Also, if you are in biology or chemistry, you probably have a project with one of the professors there that already has or could have an informatics component that would be suitable for a class project. I know Manny Ares has some computational projects involving introns in the yeast genome, Alan Zahler is working on introns and splicing mechanisms in higher organisms, Tonk Fink has some protein families he would like to apply some computational analysis to, Harry Noller and Chuck Wilson have RNA data you could work with, and Todd Wipke I'm sure has lots of great problems to work on.

I encourage students to be creative in designing their projects, but the most important thing is that you gain experience by trying to get some actual useful results. So be sure to run any wild ideas you may have by me before you commit a serious piece of time to them. To get an idea of the kind of thing I have in mind, first here are two "generic" applications-based projects:

1. Take a particular family of sites in DNA where a certain protein or protein-RNA complex binds and build a computational mechanism to recognize members of this family of sites when they appear in newly sequenced DNA. One example would be transcription factors, which are places in DNA where certain proteins bind that are involved in the regulation of transcription. Another would be branch points, which are involved in the mRNA splicing process. A third example would be splice sites themselves. The first step is to collect the data. If you do transcription factors, there is already a large publicly available database. It is the transcription factor database. See also the searchable TRANSFAC database linked into EMBL's SRS server. Branch points are less well studied. You can see Manny Ares about yeast branch points, but these are fairly easy and well-characterized. You might start with yeast and then try to generalize to a broader class of organisms. Characterizing human branch points seems to be an entirely open problem at present. Ask Martin Reese, martinr@genome.lbl.gov, at LBL if he has any data for this. Martin has a database of human splice sites too. Note: a related project would be to build or improve on an existing computational method to recognize another type of DNA feature, such as ALU repeat sequences. There are large databases of these too.

2. Analyze a protein family. Use the SAM software at UCSC to make multiple alignments of proteins in this family, and to build a statistical model of the family that can be used to search the protein database for remote homologs of members of the family. Group the members of the family into subfamilies and characterize their differences. Below is a letter I got last week from Dr. Patsy Babbitt at UCSF, who would like our help with a project of this type. I'm looking for a student to participate in this project. Tony Fink here at UCSC has some similar projects. Lydia may as well.

Here is Dr. Babbitt's letter:

From: babbitt@cgl.ucsf.EDU
To: haussler@cse.ucsc.edu

Hi David,

If you recall, I talked with you when I was at UC Santa Cruz
back in February of last year. I am working on understanding distant
structural and functional relationships in proteins in several different
superfamilies. In particular, we talked about doing some work together to
evaluate my set of sequences using your approaches. 

I have finally generated what I believe to be reasonable alignments corrected
by structural superpositions of three of the most distantly related
superfamily members. I have a paper that has been accepted to Biochemistry
detailing this work and I would like to send you a preprint. Would you
still be interested in talking further with me about my approaches and 
ways in which we might work together on developing our respective interests
further?  I would very much like to drive down and talk with you and your
colleagues further.

I have a second issue I would like to discuss with you as well.  I have
been working with colleagues in the Dept. of Biopharmaceutical Sciences here
at UCSF on a separate project aimed at deducing very distant relationships
among the transmembrane domains of G-protein coupled receptors (including
such distant families as the fungal phermomone receptors) and the
bacteriorhodopsins. This is work initiated and largely carried out by
Wolfgang Sadee and his graduate student, Richard Graul.

We have analyzed all 7 TMDs of over 200 receptors and attempted to use 
information from sub-optimal alignment positions of these TMDs to help 
distinguish very distant relationships. This has been strengthened by
performing multiple pairwise alignments of all of the TMDs in order to
improve our ability to distinguish significant relationships among short
sequences whose simple pairwise alignment scores cannot be regarded as
statistically significant.

The problem we now face is that we have generated a large amount of data
using essentially new approaches. I do not have the expertise to evaluate
the validity of our approach and have realized that we badly need someone
of your expertise to discuss it with. I am hopeful that this is, in fact,
a useful approach that would be of interest to the field and particularly
to the problem of distinguishing very distant relationships.

...

Another type of project, different than an applications-based project, would be to take a single computational or statistical method, and try it on a wide variety of sequences or sequence analysis problems. I would call this a "methods-based" project. Several students have indicated to me that they feel the methods we have been discussing so far seem rather primitive, and somewhat naive when one considers the true complexity of molecular biology. I agree! We'll see some slightly more sophisticated stuff as the quarter progresses, but basically this is a relatively new and wide open field, and there is plenty of room for improvements to currently used methods. So if you find yourself saying "I could do that better", then just do it!

Here are some examples of such "methods-based" projects.

1. Henikoff and Henikoff have looked at a few ways to convert each of their 3000 protein "blocks" into a scoring method that can be used to determine if a query sequence that is submitted to the database contains the motif defined by the block. They used one of these scoring schemes to determine the score of the actin-binding site block for the t37 query sequence you submitted to the BLOCKS searcher in HW2, and this scoring method was good enough to reveal the presence of the actin binding site in t37. However, a more remote homology might not have been detected. Make a better scoring scheme for the blocks, and test it. We have a copy of the whole blocks database available here at UCSC (ask David Kulp, dkulp@cse). I have some specific ideas along these lines; see me if you are interested. We'll see a bit more on profiles and profile searching in HW3.

2. To compete in the CASP2 protein structure prediction contest, we have built a system that reads in a target protein sequence and, with a bit of human intervention, eventually produces a predicted structure for that target sequence, along with an alignment of the target sequence to a sequence that is known to have this predicted structure. There are a lot of variations on the method, and parameters that can be varied. In our rush, we have not had time to do any careful testing of the method. However, we have available a mechanism for testing the method(s). We do this by giving it a protein sequence of known structure that is not a close homolog of any of the sequences of known structure in its database, but is a distant homolog of one or more of these database sequences, and has the same structure as these distant homologs. We can then check if the method can find these distant homologs, and if it gives a structurally correct alignment to each of them. I would like to see some students in this class choose for their project to work with one of the students who participated in the CASP2 contest, and I'd like to have the project be to test these methods we developed, and find the variations/parameter settings that work best, or develop new extensions/variations of the method that work even better. Get a team together to do this. The following are students who are in this class and participated in the CASP contest: Christian Barrett (cbarrett@cse), Melissa Cline (cline@cse), Marc Hansen (mhansen@cse) and Chris Tarnas (cft@cse).

At some point soon (I hope) I will add some more specific project ideas to this page. Other faculty involved in the course are also welcome to submit project ideas. I got these from Richard Hughey; see him for further details:

Richard Hughey's suggestions:

Here are some things on my list.  Some require that the person already
know things about SAM, others require willing experimentation.

1.  Motif finding.  Analyze and improve SAM's motif finding ability in
    comparison to, for example MEME (which I'm sure we can beat :-).
    This could include coming up with techniques for determining the
    number of highly conserved motifs in a family of sequences,
    ranking motifs, deciding where they begin and end, and the like.

2.  Protein to DNA model converter, making use of codon usage tables
    &tc. A research component here would be coming up with an
    appropriate transition regularizer for this case

3.  Codon alphabets, mixtures, and transition regularizer.  Getting
    the alphabets into the code is mostly a matter of I/O -- the
    actual code won't need much change.  Possibly a 20-component
    mixture on the codon distribution tables is appropriate.  We may
    also want a library of codon priors for different type of
    organisms. 

4. Probabilistic sequences.  This would take major code work, with
   each character being replaced by a vector of probabilities (say,
   from reading a DNA sequence).

5. WWW Server work, updating and improving these simplistic interfaces
   to the code.

    -- weighting server for estimate-dist
    -- database search of pdb using SAM
    -- possibly of other databases once we get the cluster & 
       christians MPI interface.  This would need some checks
       to make sure it's not abused.
    -- Analyze length dependencies in HMM models (something Kevin and I
       were looking at some), especially for models with multiple FIMs.


Items 2-4 are motivated by email with Marcie McClure.

Here are some late additions to the project list:

Date: Fri, 11 Oct 1996 20:55:04 -0700
From: David Haussler 
Message-Id: <199610120355.UAA04551@arapaho.cse.ucsc.edu>
To: cs243@cse.ucsc.edu
Subject: project ideas NEWSFLASH

Two members of our compbio group, Christian Barrett and Kimmen
Sjolander, have volunteered to work
with interested students from cs243 on a very exciting project this
quarter. They are planning to look at all the unannotated genes in the 
genome from the Science paper Lydia handed out the first day of class
(M. jannaschii, a rare Archae organism, from the mysterious "third branch" 
in the tree of life).
There are hundreds of these unannotated genes, for which no related
genes are known, and no function has been predicted. 
Christian has already examined one such gene and has made a tentative
prediction.  Working with them, you will use and further develop methods of
predicting remote homologs of protein sequences that we have developed
for the CASP2 contest.  This sounds like a lot of fun, and I highly
recommend it. 

As currently envisioned, the project would start
with the protein sequences that are obtained by translating from DNA
to protein the regions in the genome that are marked as 
genes for which no function or homolog is known. Each of these will be
analysed.  However, how do we know that the regions currently marked in the 
genome as genes really are the genes? In some cases this annotation may be 
wrong. It was produced by the GENMARK program, which, like any genefinder,
is far from perfect.
If other students are interested, we could also have another group
of students look at the predicted locations of the genes in this genome
and see if better preditions can be made.  I haven't asked him yet,
but perhaps David Kulp, who wrote the Genie genefinder, would be willing
to advise/work with a separate group of students on this project.

If you are interested in this project, please let me know by Tues.,
and cc cbarrett@cse, kimmen@cse and dkulp@cse. I need to gauge
the level of interest to see how to handle this.    -David

Date: Fri, 11 Oct 1996 23:08:35 -0700
From: Kevin Karplus 
Message-Id: <199610120608.XAA08979@purr.cse.ucsc.edu>
To: haussler@cse.ucsc.edu
CC: cs243@cse.ucsc.edu
In-reply-to: <199610120355.UAA04551@arapaho.cse.ucsc.edu> (message from David
Haussler on Fri, 11 Oct 1996 20:55:04 -0700)
Subject: Re: project ideas NEWSFLASH

Still a third line of inquiry on the archae genome---Christian
suggested searching the entire genome (in all three reading frames)
looking for proteins that would be expected from metabolic pathways
but which have not yet been found.  This may find some genes that the
gene-finding programs miss, perhaps because of sequencing errors (the
gene finding programs almost all stop immediately at a stop codon,
even though there is a chance of a sequencing error and the stop codon
may not really be there).

Kevin
 

Date: Sat, 12 Oct 1996 16:54:34 -0800
To: cs243@cse.ucsc.edu
From: ares@biology.ucsc.edu (Manny Ares)
Subject: Re: project ideas NEWSFLASH

Here's a specific suggestion for an enzyme to search for in terms of the
project idea that Kevin mentioned: Ribonuclease III.  This enzyme is a
double strand-specific endoribonuclease that was first identified in
bacteria, but has only recently been recognized in "lower" eukaryotes.  So
far, no archaeal homologs have been identified, and among eukaryotes, no
vertebrate homologs are known (but be careful--the database contains a
homologous sequence labeled as human which is almost certainly not).  It
seems like all cells should have it--even a genome as small as that of
mycoplasma has an RNase III gene-- but none is annotated in the archaeal
genomic sequence.  It might be fun to take on as a project.  Perhaps
there's even a vertebrate sequence out there.  To find out more background
on the enzyme see Abouelela et al. (1996)  Cell 85:115.

Manny

More ideas:

As far as I know, the Henikoffs never fully experimented with varying the total number of counts when they designed their tests of different regularizers for the blocks. We know that when applying a Dirichlet regularizer to a particular block, the total weight of all the training sequences should be scaled such that the total information (i.e. relative entropy, or equivalently, average score) of the probability distributions that define that particular block is approximately equal to the logarithm of the size of the database you want to search, plus the logarithm of the total number of blocks, plus a small constant. For example, if you want to use the blocks database to search for protein motifs in the yeast genome, you should create a blocks database in which, wherever possible, the total information of an individual block is about

   log(100,000,000) + log (3000) + c, where say, c=5.
(Since there are about 3000 blocks.) This makes the blocks as general as possible, yet still capable of picking out a signal reliably. I think this simple experiment, searching yeast or M. jannaschii predicted proteins with an "information-tuned" blocks database, would turn up interesting things.
Questions regarding about page content should be directed to haussler@cse.ucsc.edu.
Last modified October 20, 1996.

Back to the CMP 243 Class Page.