UCSC BME 100 Fall 2003

Intro to Bioinformatics

(Last Update: 06:10 PST 15 December 2003 )
This is a required course for bioinformatics students---both undergraduate and graduate students (pre-requisite to BME 220 and BME 230).

For catalog copy and pre-requisites, see the main page for BME100.

Who, When, and Where:

Instructor: Kevin Karplus ( karplus@soe.ucsc.edu) http://www.soe.ucsc.edu/~karplus
Office hours: Mondays 11-1 315B Baskin Engineering (unless BME 200 and 280B move into that timeslot).
TA: George Shackelford ( ggshack@soe.ucsc.edu)
Office hours: meet him in the lab section, in Jack's Lounge on Friday's from 3:30pm to 4:30pm, or arrange an appointment.
Don't forget use the news group ucsc.class.bme100 to ask questions!

Lectures: MW 5-6:45 p.m. Eight Acad 250

One lab section a week is required:
W 9:30-11 a.m. Baskin Engineering 105

You must register for the lab and the course together---neither can be taken without the other.

Although you must register for the lab, attendance at lab sections is optional---it will be a time when the TA will be in the lab to help out with Perl questions, with bioinformatics tools on the web, with debugging, and with general help with the homework assignments.

Texts

There will be two required texts, plus additional readings that will be distributed either on paper or via the Web:
Programming Perl
Larry Wall, Tom Christiansen & Jon Orwant
latest edition
O'Reilly and Associates
Considered the best single book on PERL---this is the main reference work on the language, and every PERL programmer should have a copy of it handy. You may use other PERL tutorials or references, but I expect you to have easy access to this one. We will be covering just the basics of PERL, not open-source packages like BioPerl, which you may wish to learn on your own.

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids from Cambridge University Press by R. Durbin, S. Eddy, A. Krogh, and G. Mitchison.
This book is a tutorial introduction to the use of hidden Markov models and other probabilistic models for sequence analysis problems in computational molecular biology, but is aimed mainly at a gradauate-student audience. We've been using it for years in the the graduate courses, and used it successfully last year in BME 100. This is a text and reference book that every bioinformatics programmer should have.

(Be sure to look at the errata page.)

Darling models
I added some assignments last year to build physical models of peptides (and DNA base pairs) using the Darling model kits. These kits are available over the web at http://www.darlingmodels.com/ I recommend getting the "Biochemistry" kit, though the cheaper "protein alpha helix--pleated sheet" kit may suffice. I have found these kits to give me a much better insight into protein flexibility and rigidity than the standard ball-and-stick models used in organic chemistry classes, and they are fun to play with. To reduce costs, it is quite reasonable for students to share a kit.

Some initial instructions for building a protein backbone with this model kit are now available.

Evaluation

There will be two types of assignments for the course and two types for the lab. The course will have reading assignments and pencil-and-paper exercises; the lab will have programming exercises to learn PERL and bioinformatics exercises using real data. The same grade (and evaluation) will be given for both the course and the lab.

Based on the first running of the course in Fall 2001, there will be no exams. It turns out to be very difficult to make up small enough problems for examination---almost all the homework exercises are much larger problems than could reasonably be given on a timed exam.

The assignments will be distributed on the web (see http://www.soe.ucsc.edu/~karplus/bme100/f03/homework.html).

The relative weights of the different types of assignment in the evaluation has not been determined yet---it should be roughly proportional to how much time the different assignments take to do well. We will try to assign points to each assignment as it is given, but the total number of points won't be known until we've created all the assignments.

Academic Integrity

Anyone caught cheating in the class will be reported to their college provost (see UCSC policy on academic integrity) and may fail the class. Cheating includes any attempt to claim someone else's work as your own. Plagiarism in any form (including close paraphrasing) will be considered cheating. Use of any source without proper citation will be considered cheating.

Collaboration without explicit written acknowledgement will be considered cheating. Collaboration on lab assignments with explicit written acknowledgement is encouraged---guidelines for the extent of reasonable collaboration will be given in class.

Rough list of topics we'll probably cover (not necessarily in order)

Note: list is being updated throughout the quarter to reflect what really happens.
  1. Quick review of the fundamental dogma of biology: DNA->RNA->protein, bases, codons, amino acids
    (3-4 hours)
  2. Stochastic models, Bayes Rule, 0-order Markov chain, first-order Markov chain, length model versus stop character for finite strings, use of log-probability for computations, adding probabilities in log-prob representation (efficient computation of log(exp(x)+exp(y)) ). (1.5 hour)
  3. Constructing a model from data. Training, cross-training, and testing. Maximum-likelihood estimate. Pseudocounts to get mean posterior estimate. (1.5 hours)
  4. Converting abitrary scores to stochastic models: P-value and E-value. Brief discussion of Z-scores (Gaussian dist.) and fat tails of extreme-value (Gumbel dist.) (1.5 hour)
  5. Entropy, relative entropy, Mutual information, sequence logos. (1.5 hour)
  6. What fellowship reviewers look for. Relationship between relative entropy and difference in encoding cost in a train/test framework (clarification for homework exercise). Interpreting classification results: true/false positives, specificity, sensitivity, ROC curves, ROC_n numbers What is a substitution matrix? (1.5 hour)
  7. Substitution matrices and sequence alignment scores. Aligning sequences to sequences, dynamic programming We'll do the the simple, but inefficient algorithm (for aribtrary gap costs) first. (1 hour: Blosum substitution matrices and gapless scoring) (1 hour: the alignment problem and global dynamic programming with arbitrary gap costs) (1 hour: global dynamic programming with linear gap costs, traceback) (1 hour: affine gap costs. Global and local dynamic programming)
  8. Introduction to Hidden Markov models (1.5 hour on HMMs and profiles) (1.5 hours on profile HMMs giving Viterbi algorithm and forward-backward)
    (Could have been based on Rachel Karchin's lecture (powerpoint slides), but wasn't.)
  9. Dirichlet Mixtures (1.5 hours) See http://www.soe.ucsc.edu/research/compbio/dirichlets/dirichlet-papers.html for papers and http://www.soe.ucsc.edu/research/compbio/dirichlets/ for general information about Dirichlet mixtures.
  10. Guest Lecture Wed Nov 12 in the Science Library. Christy Hightower and Katherine Soehner will give a presentation on bioinformatics resources available through the library, as well as talking about some of the challenges that face the UCSC library in building an adequate collection in new fields like bioinformatics.
  11. Protein secondary structure (DSSP and STRIDE), in order to explain second track of 2-track HMM. Discuss secondary structure prediction using neural nets. (1.5 hours)
  12. Sequence weighting (Henikoff's technique for relative weighting and target bit savings for total weight) (1 hour) Multiple alignment techniques Overview and progressive alignment (0.5 hour)
  13. Multiple alignment techniques T-Coffee and MAFFT (1.5 hour)
    paper on T-coffee:
    T-Coffee: A novel method for fast and accurate multiple sequence alignment.
    Notredame C, Higgins DG, Heringa J.
    J Mol Biol 2000 Sep 8;302(1):205-17

    paper on MAFFT:
    Kazutaka Katoh, Kazuharu Misawa1, Kei-ichi Kuma and Takashi Miyata. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research 30(14):3059-3066, 2002.

  14. Phylogeny: brief mention of maximum-likelihood and parsimony. Additivity assumption. UPGMA algorithm presented, ultrametric assumption and molecular clocks, intro to neighbor-joining (no proofs) (1.5 hour)
  15. RNA structure and Stochastic Context-Free Grammars (1.5 hour)
  16. A protocol for evaluating local structure alphabets. This talk ( http://www.soe.ucsc.edu/~karplus/papers/local-structure-germany02.pdf) presented some of the main results from Rachel Karchin's PhD thesis.

Rough list of topics we didn't have enough time to do more than briefly mention last year:

Other resources on the web

Handouts for Rune Lygsoe's summer 2002 course on bioinformatics
User's Guide to the Human Genome (in nature genetics).



slug icon to go to School of Engineering home page SoE home     UCSC Bioinformatics Home Page     BME 100 home page    

Questions about page content should be directed to

Kevin Karplus
Computer Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250