Note: Completed projects are now on-line.
Everyone in CMP243 must do a class project; this is 50% of your grade. I would like to have a one paragraph project proposal by October 11. It is OK to change your mind after this, but I want to make sure you get started early on something. Projects are due the last day of classes (December 3). You must produce a short written report (10 pages text + lots of figures, charts, tables, etc. with outstanding legends) describing the problem, how you approached it, and what were the experimental results. It is required that you put your project on the web. Your project page must include your project writeup, which can be in HTML, pdf, postscript, etc., and links to any data and/or bioinformatics programs that you created or modified. For help, see Jim Kent's tutorial on creating a program that runs on the web at UCSC. At the end of the quarter I'd like to make a web page with links to the class projects.
The first complete genomic sequence for an animal, the nematode ("worm") C. elegans was finished in December 1998. A master copy of the Science paper describing this work is in the class handout box for you to copy. Al Zahler and Jim Kent have built a nice tool called the intronerator to study gene structure and splicing mechanisms in C. elegans. This tool provides access to the entire genome, including gene predictions and other kinds of evidence for gene structure. We will use data from the intronerator in some of our exercises. If you go all out, one these exercises could evolve into a class project for you.
It is fine to work in teams. I encourage you to work with one of the students in our computational biology group if you can, or with one of the other bioinformatics faculty, Kevin Karplus, Richard Hughey, Hongyun Wang, and Alex Pang. (Manfred Warmuth may also have a project in mind.) The faculty and senior students in our group have computational sequence analysis projects they are working on that you could participate in, and would make good class projects. Also, if you are in biology or chemistry, you probably have a project with one of the professors there that already has or could have an informatics component that would be suitable for a class project. I know Manny Ares has some computational projects involving the new DNA microarray (``gene chip") lab he is running, and on introns in the yeast genome, Alan Zahler is working on splicing mechanisms in C. elegans, Yishi Jin and Andrew Chisholm have some exciting C. elegans genes involved in differentiation and nervous system development and function that they are working on, Grant Hartzog is working on regulation of transcription, Tony Fink and Lydia Gregoret have some protein families he would like to apply some computational analysis to, Harry Noller, Chuck Wilson and Bill Scott have RNA data you could work with, Todd Wipke has data sets and ongoing projects on protein docking, and Dave Deamer has a very exciting project involving a new method to sequence DNA very rapidly. Also, I strongly encourage engineering students to link up with biology and biochemistry faculty, and vice versa. This makes a great learning experience.
I encourage students to be creative in designing their projects, but the most important thing is that you gain experience by trying to get some actual useful results. So be sure to run any wild ideas you may have by me before you commit a serious piece of time to them. To get an idea of the kind of thing I have in mind, first here are two "generic" applications-based projects:
1. Take a particular family of sites in DNA where a certain protein or protein-RNA complex binds and build a computational mechanism to recognize members of this family of sites when they appear in newly sequenced DNA. One example would be to try to recognize promoter regions of a certain type. Promoters are places in DNA where certain proteins bind that are involved in the regulation of transcription. Terry Furey (booch@cse) has a data set of promoters for yeast genes that make proteins that are involved in the ribosome. They all have a common promoter structure that is recognized by the yeast cell. Can you build a computational mechanism to recognize this promoter structure, and distinguish it from the promoters of other yeast genes?
Other examples like the project above would be to try to build a computational mechanism to recognize branch points, which are involved in the mRNA splicing process, binding sites for factors that regulate splicing, or to recognize splice sites themselves. Manny Ares can provide yeast introns and branch points, and Jim Kent (kent@biology) will be providing some datasets to work with concerning C. elegans splice mechanisms via the intronerator.
2. Analyze a protein family. Use the SAM-T99 software at UCSC (see below) to make multiple alignments of proteins in this family, and to build a statistical model of the family that can be used to search the protein database for remote homologs of members of the family. Group the members of the family into subfamilies and characterize their differences. Many professors in biology and biochemistry have proteins they are working with, and we need students to apply our methods to these proteins.
Another type of project, different than an applications-based project, would be to take a single computational or statistical method, and try it on a wide variety of sequences or sequence analysis problems. I would call this a "methods-based" project. Here are two examples of a "methods-based" projects.
1. To compete in the CASP3 protein structure prediction contest, Kevin Karplus, Richard Hughey and UCSC grads built the above-mentioned system called SAM-T98 (and recently updated to SAM-T99) that reads in a target protein sequence and (using hidden Markov models, which we will discuss at length) finds another sequence in a database of protein sequences of known structure that it thinks has the same structure as the target sequence. It produces an alignment of the target sequence to the other sequence so the user can judge how similar the sequences look. There are a lot of variations on the method, and parameters that can be varied. We have not had enough time to do careful testing of the method. However, we have available a mechanism for testing the method(s). We do this by giving it a protein sequence of known structure that is not a close homolog of any of the sequences of known structure in its database, but is a distant homolog of one or more of these database sequences, and has the same structure as these distant homologs. We can then check if the method can find these distant homologs, and if it gives a structurally correct alignment to each of them. I would like to see some students in this class choose for their project to work with Kevin Karplus, Christian Barrett (cbarrett@cse), Melissa Cline (cline@cse) or Mark Diekhans (markd@cse), and I'd like to have the project be to test these and related methods we developed, and find the variations/parameter settings that work best, or develop new extensions/variations of the method that work even better.
2. A central problem in bioinformatics is to find the locations of genes in DNA sequences (see intro/review of genefinding.) This is hard in the genomes of higher organisms because their genes are interrupted by introns. David Kulp (dkulp@cse) has developed a genefinder called Genie to find genes in human genomic DNA, and is working with Jim Kent (kent@biology) to customize it and use it to find genes in C. elegans to be displayed on the intronerator. One project would be to do experiments with different settings of the Genie parameters and Genie configuration to see if you can get improved performance in finding C. elegans genes, or to explore special genefinding problems, like how to handle alternative splicing, or how to use information from the related genome, C. briggsae, to find genes in C. elegans. Another option is to use or develop other programs, like the MEME program to discover new features in genomic DNA that have biological importance, or to develop new means to recognize features other than the gene structure that Genie finds. A new program of this type, called Improbizer has been developed by Jim Kent. I strongly suggest exploring its more powerful feature set.
Here is a brief summary of some concrete projects that I think would be good ones. I'll be adding on the end of this list as I get recommendations from other faculty.
Another project associated with the microarrayer is to implement new software to read out the images taken from the chips after they have been hybridized with the sample. The challenge would be to improve commercial software currently being used for this, writing code that can do customized image processing operations. If your specialty is image analysis, please consider this project!
One important other piece of information you might use to predict the start of transcription is the location of the start of translation, which is usually a certain characteristic distance downstream form the start of transcription. In some organisms, people have used "Kozak's rule", which basically says that, with some exceptions, translation starts at the first ATG downstream from the transcription start. What kind of (stochastic) rule can you discover for yeast genes? One problem you will encounter is that our information about the start of translation for yeast genes is not 100% reliable. In many cases the start of translation listed in the yeast genome databases (see links from the Ares Lab yeast intron site) is calculated by finding the longest possible open reading frame (ORF), which is a stretch of genomic DNA that does not contain any stop codons when read in one of the three possible reading frames. This means that in some cases the actual start of translation may be downstream from the start codon specified in the database. Also, genes with introns near their 5' ends present special problems, but we have extremely good local expertise to address this issue. (Actually, a project to find more reliable starts of translation for yeast genes would also be an excellent one.)
This is a really exciting project on this campus now. This group of scientists has developed a technique that might be used to determine the sequence of a stretch of DNA or RNA by pulling it through a nanoscale pore. They have also developed a neat new idea for forming a kind of "molecular bar code". CS, CE and EE students, especially those with signal processing experience, are strongly encouraged to consider working on this project. Contact Mark Akeson for details. Here is a description of the project he sent me.
In a recent seminar at Agilent, our collaborator Daniel Branton showed that individual polynucleotide strands can be detected as they pass through a single nanoscale pore under an applied voltage. More recent results demonstrate that this prototype nanopore detector can be used to derive a direct, high-speed read-out of the linear composition of RNA block copolymers. For example, within an individual RNA strand, the transition from a 30-nucleotide poly A segment to a 70-nucleotide poly C segment can be read as an abrupt, 10 pA current change. A long-term goal of our collaboration is to improve this resolution approximately fifty-fold so that individual DNA molecules may be sequenced at very high speed. We are seeking a student who has experience in signal processing and/or statistical analysis of time series data to help design software to analyze the data produced by this method and give a more accurate reading of the sequence. A near-term objective pursued independently in Santa Cruz is to develop a library of encoded linear polymers (molecular bar codes) that can be read by the nanopore and that can be used to tag antibodies, agonists and gene-specific sequences. This technology shows extraordinary speed and sensitivity. We have made progress using molecular bar codes synthesized using conventional phosphoramidite chemistry, and using nitroxide-mediated radical polymerization. The latter technique (developed by Rebecca Braslau at UCSC and Craig Hawker at IBM) is revolutionary in that it effects polymerizations with extremely low polydispersities for a number of monomer families, including styrenes, acrylates, and acrylic acids. In this context, it will permit us to readily synthesize molecular bar codes with a much larger selection of interchangeable polymer blocks than could be achieved using available phosphoramidites. Molecular bar code technology is a necessary prelude to nanopore DNA sequencing that may have wide utility as a direct, sensitive, high speed detector of biological macromolecules that are important in medical diagnostics and food safety. We are looking for students to assist in the bioinformatic aspects of this project as well.
As mentioned above, Kevin Karplus and Richard Hughey have developed the SAM-T98 system for finding remote protein homologies, and producing multiple alignments of protein families. I gave you a brief introduction to SAM-T98 in homework 2. A new version, SAM-T99, is being tested now. A good project would be to work with Kevin Karplus on the final testing and refinement of this method. The exciting part about this project is that this is the program that was so successful in the last two CASP contests.
Other projects along these lines suggested by Kevin involve building some exciting new protein analysis tools. Here are some messages he has sent:
One project I have is a part of the CASP protein-structure prediction project. It involves mainly existing programs, but probably requires some script writing or new programming. What I want is a program/script that takes a sequence (or multiple alignment generated by target99) and searches a subset of PDB for matching pieces. The local alignments need to be converted to 3-d structures by extracting the corresponding atoms from the PDB files (that may be new code, though existing PDB parsers should be used), and running SCWRL to replace any sidechains that differ. The resulting partial models need to be converted to a format not yet determined, in order to be useful as input to the structure-prediction program. The format will probably be close to an existing one that converts whole-chain PDB files, but with extra annotation to explain which parts of the chain are present.
A second project idea from Kevin Karplus is the following. The details of it will become clear when you start working with Spencer Tu (stu@cse), the main author of this system.
If someone is mainly interested in visualization work, we could use an HMM editor for Spencer Tu's general HMMs. The basic idea is to start with a graph editor, and modify or add operations specific to HMMs: add edge (needs count or probability parameter filled in) delete edge add state (needs emission counts or probabilities filled in) detach edge from null state (replace edge x->y with edges from x to all out-neighbors of y, updating counts as well as possible) route edge through hub (replace x->y with x->h and h->y, where h is an existing null state) retrain HMM on data set display emission probabilities with variant of Sequence Logo format edit emission or transition probabilities or counts (edge counts can be well displayed by thickness proportional to sqrt(count)) Allow fat-state/fat-edge or detailed picture of HMM.
Here is an excellent project you can do applying the SAM-T98/99 protein homology tools developed by Kevin Karplus and Richard Hughey, discussed above.
The V-type and F-type ATPases are proton pumps which either make ATP or use ATP to generate proton gradients. The two enzymes are distantly related and have some subunits that are obvious homologs. However, both enzymes are very complex, being composed of at least 13 different kinds of subunits. Sequence alignments show significant similarity for only 4 of the subunits (e.g. the A subunit of the V-ATPase is homologous to the beta subunit of the F-ATPase). Several other subunits are likely to be true homologs but the sequences have diverged greatly. I would predict that the secondary structure is probably still very similar. Specific questions that need to be addressed: Which subunit in the V-ATPase is equivalent to the gamma subunit of the F-ATPase? This is important because the gamma is the "rotor" in the ATPase motor. Can the A and B subunits of the V-ATPase be fit to the 3D structure for the alpha and beta subunits of the F-ATPase? Alignments can be made, region is which the subunits different could be identified. We have identified a gene unique to filmentous fungi, named htl. Can we get any information about the structure of the HTL protein or identity the family of proteins to which is might belong?
Questions regarding about page content should be directed to
haussler@cse.ucsc.edu.
Last modified Dec.11, 1999.