Projects for CMP243

Note: Completed projects are now on-line.

Everyone in CMP243 must do a class project; this is 50% of your grade. I would like to have a one paragraph project proposal by October 11. It is OK to change your mind after this, but I want to make sure you get started early on something. Projects are due the last day of classes (December 3). You must produce a short written report (10 pages text + lots of figures, charts, tables, etc. with outstanding legends) describing the problem, how you approached it, and what were the experimental results. It is required that you put your project on the web. Your project page must include your project writeup, which can be in HTML, pdf, postscript, etc., and links to any data and/or bioinformatics programs that you created or modified. For help, see Jim Kent's tutorial on creating a program that runs on the web at UCSC. At the end of the quarter I'd like to make a web page with links to the class projects.

The first complete genomic sequence for an animal, the nematode ("worm") C. elegans was finished in December 1998. A master copy of the Science paper describing this work is in the class handout box for you to copy. Al Zahler and Jim Kent have built a nice tool called the intronerator to study gene structure and splicing mechanisms in C. elegans. This tool provides access to the entire genome, including gene predictions and other kinds of evidence for gene structure. We will use data from the intronerator in some of our exercises. If you go all out, one these exercises could evolve into a class project for you.

It is fine to work in teams. I encourage you to work with one of the students in our computational biology group if you can, or with one of the other bioinformatics faculty, Kevin Karplus, Richard Hughey, Hongyun Wang, and Alex Pang. (Manfred Warmuth may also have a project in mind.) The faculty and senior students in our group have computational sequence analysis projects they are working on that you could participate in, and would make good class projects. Also, if you are in biology or chemistry, you probably have a project with one of the professors there that already has or could have an informatics component that would be suitable for a class project. I know Manny Ares has some computational projects involving the new DNA microarray (``gene chip") lab he is running, and on introns in the yeast genome, Alan Zahler is working on splicing mechanisms in C. elegans, Yishi Jin and Andrew Chisholm have some exciting C. elegans genes involved in differentiation and nervous system development and function that they are working on, Grant Hartzog is working on regulation of transcription, Tony Fink and Lydia Gregoret have some protein families he would like to apply some computational analysis to, Harry Noller, Chuck Wilson and Bill Scott have RNA data you could work with, Todd Wipke has data sets and ongoing projects on protein docking, and Dave Deamer has a very exciting project involving a new method to sequence DNA very rapidly. Also, I strongly encourage engineering students to link up with biology and biochemistry faculty, and vice versa. This makes a great learning experience.

I encourage students to be creative in designing their projects, but the most important thing is that you gain experience by trying to get some actual useful results. So be sure to run any wild ideas you may have by me before you commit a serious piece of time to them. To get an idea of the kind of thing I have in mind, first here are two "generic" applications-based projects:

1. Take a particular family of sites in DNA where a certain protein or protein-RNA complex binds and build a computational mechanism to recognize members of this family of sites when they appear in newly sequenced DNA. One example would be to try to recognize promoter regions of a certain type. Promoters are places in DNA where certain proteins bind that are involved in the regulation of transcription. Terry Furey (booch@cse) has a data set of promoters for yeast genes that make proteins that are involved in the ribosome. They all have a common promoter structure that is recognized by the yeast cell. Can you build a computational mechanism to recognize this promoter structure, and distinguish it from the promoters of other yeast genes?

Other examples like the project above would be to try to build a computational mechanism to recognize branch points, which are involved in the mRNA splicing process, binding sites for factors that regulate splicing, or to recognize splice sites themselves. Manny Ares can provide yeast introns and branch points, and Jim Kent (kent@biology) will be providing some datasets to work with concerning C. elegans splice mechanisms via the intronerator.

2. Analyze a protein family. Use the SAM-T99 software at UCSC (see below) to make multiple alignments of proteins in this family, and to build a statistical model of the family that can be used to search the protein database for remote homologs of members of the family. Group the members of the family into subfamilies and characterize their differences. Many professors in biology and biochemistry have proteins they are working with, and we need students to apply our methods to these proteins.

Another type of project, different than an applications-based project, would be to take a single computational or statistical method, and try it on a wide variety of sequences or sequence analysis problems. I would call this a "methods-based" project. Here are two examples of a "methods-based" projects.

1. To compete in the CASP3 protein structure prediction contest, Kevin Karplus, Richard Hughey and UCSC grads built the above-mentioned system called SAM-T98 (and recently updated to SAM-T99) that reads in a target protein sequence and (using hidden Markov models, which we will discuss at length) finds another sequence in a database of protein sequences of known structure that it thinks has the same structure as the target sequence. It produces an alignment of the target sequence to the other sequence so the user can judge how similar the sequences look. There are a lot of variations on the method, and parameters that can be varied. We have not had enough time to do careful testing of the method. However, we have available a mechanism for testing the method(s). We do this by giving it a protein sequence of known structure that is not a close homolog of any of the sequences of known structure in its database, but is a distant homolog of one or more of these database sequences, and has the same structure as these distant homologs. We can then check if the method can find these distant homologs, and if it gives a structurally correct alignment to each of them. I would like to see some students in this class choose for their project to work with Kevin Karplus, Christian Barrett (cbarrett@cse), Melissa Cline (cline@cse) or Mark Diekhans (markd@cse), and I'd like to have the project be to test these and related methods we developed, and find the variations/parameter settings that work best, or develop new extensions/variations of the method that work even better.

2. A central problem in bioinformatics is to find the locations of genes in DNA sequences (see intro/review of genefinding.) This is hard in the genomes of higher organisms because their genes are interrupted by introns. David Kulp (dkulp@cse) has developed a genefinder called Genie to find genes in human genomic DNA, and is working with Jim Kent (kent@biology) to customize it and use it to find genes in C. elegans to be displayed on the intronerator. One project would be to do experiments with different settings of the Genie parameters and Genie configuration to see if you can get improved performance in finding C. elegans genes, or to explore special genefinding problems, like how to handle alternative splicing, or how to use information from the related genome, C. briggsae, to find genes in C. elegans. Another option is to use or develop other programs, like the MEME program to discover new features in genomic DNA that have biological importance, or to develop new means to recognize features other than the gene structure that Genie finds. A new program of this type, called Improbizer has been developed by Jim Kent. I strongly suggest exploring its more powerful feature set.

Here is a brief summary of some concrete projects that I think would be good ones. I'll be adding on the end of this list as I get recommendations from other faculty.


  1. The new UCSC DNA microarry project is certainly one of the hottest scientific projects on campus. They desperately need a database for the results obtained from the gene chips. Work with Manny Ares, Chuck Sugnet, Chi-Yip Ho, and Tyson Clark to specify the schema for this database to maximize the usefulness of this data for a variety of research tasks. (The latter three people are in this class.) Actually implementing the database I would consider to be a separate project, perhaps as an independent study/research project in the following quarter. If you have database experience, please consider this project! A good database design is worth its weight in gold. There is also huge demand in industry for people who can design genomics-oriented databases. A CS or CE student who did this project would really learn a lot about genomics data analysis and be in a great position to do further work in this area.

    Another project associated with the microarrayer is to implement new software to read out the images taken from the chips after they have been hybridized with the sample. The challenge would be to improve commercial software currently being used for this, writing code that can do customized image processing operations. If your specialty is image analysis, please consider this project!


  2. Build a better null model for BLAST. There are a lot of aspects of the statistical structure of genomic DNA that are not taken into account by the BLAST null model, which is designed primarily for efficiency. After running BLAST to get a few thousand hits, these hits could be rescored using a more sophisticated null model, and perhaps borderline hits of real biological significance will move closer to the top of the list. Currently many of these borderline hits are lost in the noise, and this borderline area is where many of the more interesting discoveries are to be made. We also need a good null model to score the hits found by the motifs that Improbizer finds. This is an excellent opportunity to extend one of the most important new locally-built bioinformatics tools. Jim can let you know what would be required to use your new null model in evaluating the significance of hits found by an Improbizer motif. I have some specific ideas on new null models that I hope to write up soon. These could form the basis of your implementation. They would require that you read ahead to the section of the text on hidden Markov models.
  3. There are two related nematode genomes, C. elegans and C. briggsae A lot can be learned from comparing two related genomes. For example, what features of the corresponding genes between these two organisms are conserved? Of special interest would be intron features, and features of the promoter region. Transcription factor binding sites in promoter regions can often only be discovered by bioinformatic methods that use direct comparison of promoter regions from the corresponding genes from related organisms. Jim Kent has developed a new tool for use in aligning two genomes that is likely to pull out some common features between the genes of these two organisms that have not been seen before. Work with Jim, Al Zahler and others to develop scripts and additional analysis tools needed to do the genome comparison between These two organisms. Then do this comparison, and some preliminary exploration of the new discoveries that pop up.
  4. We have a set of 400 S. cerevisiae genes in which the location of the transcription start is exactly known. Are there common features of these genes that occur at particular places in yeast genes relative to the transcription start? Some such features are known, such as the famous "TATA" box, which is involved in the initiation of transcription. Use Improbizer and/or MEME to try to discover new features of these yeast promoter regions. (Also try the Gibbs Sampler. I'll get a URL for this.) Make a hidden Markov model or other computational tool to use this information, and other information, to predict the transcription start site for other S. cerevisiae genes.

    One important other piece of information you might use to predict the start of transcription is the location of the start of translation, which is usually a certain characteristic distance downstream form the start of transcription. In some organisms, people have used "Kozak's rule", which basically says that, with some exceptions, translation starts at the first ATG downstream from the transcription start. What kind of (stochastic) rule can you discover for yeast genes? One problem you will encounter is that our information about the start of translation for yeast genes is not 100% reliable. In many cases the start of translation listed in the yeast genome databases (see links from the Ares Lab yeast intron site) is calculated by finding the longest possible open reading frame (ORF), which is a stretch of genomic DNA that does not contain any stop codons when read in one of the three possible reading frames. This means that in some cases the actual start of translation may be downstream from the start codon specified in the database. Also, genes with introns near their 5' ends present special problems, but we have extremely good local expertise to address this issue. (Actually, a project to find more reliable starts of translation for yeast genes would also be an excellent one.)


  5. Related to the previous project, train the Genie genefinder to run on the yeast genome and find a more accurate set of coding exons in this genome. Remarkably, no really sophisticated genefinding program has ever been run on the yeast genome, at least to my knowledge. People assumed that longest ORFs were good enough to start with, and as more experimentation was done, these gene predictions would be refined. But it takes a long time to experimentally confirm 6200 gene predictions, and there is evidence that there may be more introns than previously thought, which confound the longest ORF method of predicting coding exons. David Kulp (dkulp@cse), Leslie Grate (grate@cse) and Chuck Sugnet (sugnet@cse) would be good consultants on this. David is the Genie author, and has put together a directory containing the Genie program and tools for retraining it for other organisms. Because he is VP of a start-up bioinformatics company, David has very limited time these days, so students doing this project would have to be quite resourceful in using sophisticated code without much hand-holding from the author of the code.
  6. Also related to the previous project, as mentioned above, Terry Furey is working on the problem of what characterizes the promoters of genes in yeast that make proteins for the ribosome, as opposed to other genes. We suspect that ribosomal genes are strongly co-regulated because DNA microarray experiments show that they are turned on and off (i.e. their mRNA expression levels go up and down) in the same way under different types of experimental conditions. We used this expression information to predict which genes in the yeast genome are involved in the ribosome. We have a technical report on this work linked into the UCSC DNA microarry project page. There are two projects you might do to follow up on this work.
    1. Do further experiments of the type reported in this paper on other classes of yeast genes. Manny Ares and Jim Kent have some suggested classes to look at.
    2. Do what Terry is doing, and try to determine what is distinctive about the promoter regions of one of these classes. Use Improbizer to discover new transcription factor binding sites that are prevalent in the promoter regions of this particular class of genes.

  7. For a more systems-oriented student, help us set up a farm of PCs running Linux that we can use to run some of the more compute-intensive programs that Jim Kent has written, like Improbizer, Intronerator, and dynAli, and others that members of the bioinformatics group have or will write. Our barnyard machines are getting very heavy use, and we need some cheap, alternative cpu cycles. Jim is running some of these on his home PCs now, but there is a limit to what he can do! If I can get approval for this, I can contribute some cash to this project.
  8. Here's a hot new C. elegans project. It appears that there may be some special subclasses of introns in the C. elegans genome that were introduced at different times in the evolution of this genome, and/or play different functional roles, e.g. related to alternative splicing. Work with Al Zahler and Jim to characterize these special classes of introns. Talk to Al for details.
    Mark Diekhans has developed code to test a new method for predicting remote protein homologies, originally due to Tommi Jaakkola, who is now at MIT. The paper on this topic, entitled "A discriminative framework for detecting remote protein homologies", won the best paper award at the most recent Bioinformatics conference (the ISMB conference in Heidelberg.) Work with Mark to apply this method to further protein families and related problems in bioinformatics.
  9. Mark Akeson and David Deamer, Chemistry and Biochemistry makeson@hydrogen.ucsc.edu, deamer@hydrogen.ucsc.edu (with Hugh Olsen and Wenonah Vercoutere)

    This is a really exciting project on this campus now. This group of scientists has developed a technique that might be used to determine the sequence of a stretch of DNA or RNA by pulling it through a nanoscale pore. They have also developed a neat new idea for forming a kind of "molecular bar code". CS, CE and EE students, especially those with signal processing experience, are strongly encouraged to consider working on this project. Contact Mark Akeson for details. Here is a description of the project he sent me.

    In a recent seminar at Agilent, our collaborator Daniel Branton showed
    that individual polynucleotide strands can be detected as they pass
    through a single nanoscale pore under an applied voltage.  More recent
    results demonstrate that this prototype nanopore detector can be used to
    derive a direct, high-speed read-out of the linear composition of RNA
    block copolymers.  For example, within an individual RNA strand, the
    transition from a 30-nucleotide poly A segment to a 70-nucleotide poly C
    segment can be read as an abrupt, 10 pA current change.  A long-term
    goal of our collaboration is to improve this resolution approximately
    fifty-fold so that individual DNA molecules may be sequenced at very
    high speed. We are seeking a student who has experience in signal
    processing and/or statistical analysis of time series data
    to help design software to analyze the data produced by this method
    and give a more accurate reading of the sequence.
    
    A near-term objective pursued independently in Santa Cruz is to develop
    a library of encoded linear polymers (molecular bar codes) that can
    be
    read by the nanopore and that can be used to tag antibodies, agonists
    and gene-specific sequences. This technology shows
    extraordinary speed and sensitivity. We have made progress
    using molecular bar codes synthesized
    using conventional phosphoramidite chemistry, and using
    nitroxide-mediated radical polymerization.  The latter technique
    (developed by Rebecca Braslau at UCSC and Craig Hawker at IBM) is
    revolutionary in that it effects polymerizations with extremely low
    polydispersities for a number of monomer families, including styrenes,
    acrylates, and acrylic acids. In this context, it will permit us to
    readily synthesize molecular bar codes with a much larger selection of
    interchangeable polymer blocks than could be achieved using available
    phosphoramidites.
    Molecular bar code technology is a necessary prelude to nanopore DNA
    sequencing that may have wide utility as a direct, sensitive, high speed
    detector of biological macromolecules that are important in medical
    diagnostics and food safety. We are looking for students
    to assist in the bioinformatic aspects of this project as well.
    

  10. Kevin Karplus (karplus@cse.ucsc.edu)

    As mentioned above, Kevin Karplus and Richard Hughey have developed the SAM-T98 system for finding remote protein homologies, and producing multiple alignments of protein families. I gave you a brief introduction to SAM-T98 in homework 2. A new version, SAM-T99, is being tested now. A good project would be to work with Kevin Karplus on the final testing and refinement of this method. The exciting part about this project is that this is the program that was so successful in the last two CASP contests.

    Other projects along these lines suggested by Kevin involve building some exciting new protein analysis tools. Here are some messages he has sent:

    One project I have is a part of the CASP protein-structure prediction project.
    It involves mainly existing programs, but probably requires some
    script writing or new programming.
    
    What I want is a program/script that takes a sequence (or multiple
    alignment generated by target99) and searches a subset of PDB for
    matching pieces.  The local alignments need to be converted to 3-d
    structures by extracting the corresponding atoms from the PDB files
    (that may be new code, though existing PDB parsers should be used),
    and running SCWRL to replace any sidechains that differ.
    
    The resulting partial models need to be converted to a format not yet
    determined, in order to be useful as input to the structure-prediction
    program.   The format will probably be close to an existing one that
    converts whole-chain PDB files, but with extra annotation to explain
    which parts of the chain are present.
    

    A second project idea from Kevin Karplus is the following. The details of it will become clear when you start working with Spencer Tu (stu@cse), the main author of this system.

    If someone is mainly interested in visualization work, we could use an
    HMM editor for Spencer Tu's general HMMs.
    
    The basic idea is to start with a graph editor, and modify or add
    operations specific to HMMs:
    	add edge (needs count or probability parameter filled in)
    	delete edge
    	add state (needs emission counts or probabilities filled in)
    	detach edge from null state
    		(replace edge x->y with edges from x to all
    		out-neighbors of y, updating counts as well as possible)
    	route edge through hub
    		(replace x->y with x->h and h->y, where h is an
    		existing null state)
    	retrain HMM on data set
    	
    	display emission probabilities with variant of Sequence Logo format
    	
    	edit emission or transition probabilities or counts
    	(edge counts can be well displayed by thickness proportional
    		to sqrt(count))
    	
    	Allow fat-state/fat-edge or detailed picture of HMM.
    

  11. One project I would really like to see done is to link our protein analysis capabilities into intronerator. This would involve working with Jim Kent and Al Zahler to decide on what capabilities from SAM-T99 and related tools they feel should be available in conjunction with intronerator, and providing appropriate scripts that run these tools, which Jim could have intronerator call. You could either do some simple scripts, and then use them to explore some C. elegans proteins, or you could spend the whole quarter doing some more complex scripts.
  12. Barry Bowman (bowman@biology.ucsc.edu)

    Here is an excellent project you can do applying the SAM-T98/99 protein homology tools developed by Kevin Karplus and Richard Hughey, discussed above.

    The V-type and F-type ATPases are proton pumps which either make ATP or use
    ATP to generate proton gradients.  The two enzymes are distantly related
    and have some subunits that are obvious homologs.  However, both enzymes
    are very complex, being composed of at least 13 different kinds of
    subunits.  Sequence alignments show significant similarity for only 4 of
    the subunits (e.g. the A subunit of the V-ATPase is homologous to the beta
    subunit of the F-ATPase).  Several other subunits are likely to be true
    homologs but the sequences have diverged greatly.  I would predict that the
    secondary structure is probably still very similar.
    
    Specific questions that need to be addressed:  Which subunit in the
    V-ATPase is equivalent to the gamma subunit of the F-ATPase?  This is
    important because the gamma is the "rotor" in the ATPase motor.  Can the A
    and B subunits of the V-ATPase be fit to the 3D structure for the alpha and
    beta subunits of the F-ATPase?  Alignments can be made, region is which the
    subunits different could be identified.
    
    We have identified a gene unique to filmentous fungi, named htl.  Can we
    get any information about the structure of the HTL protein or identity the
    family of proteins to which is might belong?
    

Questions regarding about page content should be directed to haussler@cse.ucsc.edu.
Last modified Dec.11, 1999.

Back to the CMP 243 Class Page.