Projects for CMP243

Everyone in CMP243 must do a class project; this is 50% of your grade. I would like to have a one paragraph project proposal by Thurs. Jan. 22. It is OK to change your mind after this, but I want to make sure you get started early on something. Projects are due the last day of class (Tues. March 17). You should turn in a short written report (10 pages text + lots of figures, charts, tables, etc.) describing the problem, how you approached it, and what were the experimental results.

It is fine to work in teams. I encourage you to work with one of the students in our computational biology group if you can, or with Kevin Karplus, Richard Hughey or Lydia Gregoret. All of these people have computational sequence analysis projects they are working on that you could participate in, and would make good class projects. Also, if you are in biology or chemistry, you probably have a project with one of the professors there that already has or could have an informatics component that would be suitable for a class project. I know Manny Ares has some computational projects involving introns in the yeast genome, Alan Zahler is working on introns and splicing mechanisms in higher organisms, Tonk Fink has some protein families he would like to apply some computational analysis to, Harry Noller and Chuck Wilson have RNA data you could work with, and Todd Wipke has data sets and ongoing projects on protein docking.

I encourage students to be creative in designing their projects, but the most important thing is that you gain experience by trying to get some actual useful results. So be sure to run any wild ideas you may have by me before you commit a serious piece of time to them. To get an idea of the kind of thing I have in mind, first here are two "generic" applications-based projects:

1. Take a particular family of sites in DNA where a certain protein or protein-RNA complex binds and build a computational mechanism to recognize members of this family of sites when they appear in newly sequenced DNA. One example would be transcription factors, which are places in DNA where certain proteins bind that are involved in the regulation of transcription. Another would be branch points, which are involved in the mRNA splicing process. A third example would be splice sites themselves. The first step is to collect the data. If you do transcription factors, there is already a large publicly available database. It is the transcription factor database. See also the searchable TRANSFAC database linked into EMBL's SRS server. (See also

@article{Wingender97,
author="Wingender, E. and others",
title="{TRANSFAC, TRRD and COMPEL: Towards a federated database system on
   transcriptional regulation}",
journal="Nucleic Acids Research",
volume=25, month=jan, number=1, year=1997,
pages="265--268", }
) Branch points are less well studied. You can see Manny Ares about yeast branch points, but these are fairly easy and well-characterized. You might start with yeast and then try to generalize to a broader class of organisms. Characterizing human branch points seems to be an entirely open problem at present. Ask Al Zahler. I'll make some datasets avaliable, including datasets for splice sites in yeast, worm, fly and human. For now see the splice datasets at LBL (fly and human) and Sanger (worm). There are many other types of sites in DNA one could try to find as well.

2. Analyze a protein family. Use the SAM software at UCSC to make multiple alignments of proteins in this family, and to build a statistical model of the family that can be used to search the protein database for remote homologs of members of the family. Group the members of the family into subfamilies and characterize their differences. I know Tony Fink, Andrew Chisholm, Yishi Jin, John Tamkun and many other Profs in Biology and Biochemistry have proteins they are working with, and we need students to apply our methods to these proteins.

Another type of project, different than an applications-based project, would be to take a single computational or statistical method, and try it on a wide variety of sequences or sequence analysis problems. I would call this a "methods-based" project. Here is one example of a "methods-based" project.

1. To compete in the CASP3 protein structure prediction contest, Kevin Karplus and UCSC grads have built a system that reads in a target protein sequence and (using hidden Markov models, which we will discuss at length) finds another sequence in a database of protein sequences of known structure that it thinks has the same structure as the target sequence. It produces an alignment of the target sequence to the other sequence so the user can judge how similar the sequences look. There are a lot of variations on the method, and parameters that can be varied. We have not had enough time to do careful testing of the method. However, we have available a mechanism for testing the method(s). We do this by giving it a protein sequence of known structure that is not a close homolog of any of the sequences of known structure in its database, but is a distant homolog of one or more of these database sequences, and has the same structure as these distant homologs. We can then check if the method can find these distant homologs, and if it gives a structurally correct alignment to each of them. I would like to see some students in this class choose for their project to work with Kevin Karplus, Christian Barrett (cbarrett@cse), Melissa Cline (cline@cse) or Mark Diekhans (markd@cse), and I'd like to have the project be to test these methods we developed, and find the variations/parameter settings that work best, or develop new extensions/variations of the method that work even better.

I have asked people at UCSC who work in this area to contribute ideas for projects that students could work on with them, and I'll add them here as they come in. I'll also add here specific project ideas I may decide to suggest.

Questions regarding about page content should be directed to haussler@cse.ucsc.edu.
Last modified October 20, 1996.

Back to the CMP 243 Class Page.