UCSC BME 205 Fall 2008

Bioinformatics: models and algorithms

(Last Update: 17:11 PST 17 November 2008 )
This is a required course for bioinformatics students---both undergraduate and graduate students (pre-requisite to BME 220 and BME 230). Those who need permission codes to register can get them on the first day of class.

For catalog copy and pre-requisites, see the main page for BME205.

Who, When, and Where:

Instructor: Kevin Karplus ( karplus@soe.ucsc.edu) http://www.soe.ucsc.edu/~karplus
Office hours: PSB 318, W 10:30-11:30 (phone-only: W 10-10:30)
+1-831-459-4250
TA: None this year

Lectures: MWF 2-3:10 Engineering 2, room 506

The E2 506 room is not a normal classroom and is not particularly convenient for me or for the grad students, but is being used so that the class can be telecast to students in one of the national labs. This is an experiment this year, to see how well the technology works and whether the ability to offer the course remotely provides enough benefit to offset the greater inconvenience for those at UCSC.

Instructions for remote viewing are not complete yet, but this link provides what information I have for accessing the lectures remotely. Although the recorded lectures will be available to local students also, I would greatly prefer that all students able to do so attend in person. I will not delay the class just because the remote-viewing hardware or software is not working.

Texts

There will be two required texts, plus additional readings that will be distributed either on paper or via the Web:
Programming Perl
Larry Wall, Tom Christiansen, and Jon Orwant
latest edition
O'Reilly and Associates
Considered the best single book on PERL—this is the main reference work on the language, and every PERL programmer should have a copy of it handy. You may use other PERL tutorials or references, but I expect you to have easy access to this one. We will be covering just the basics of PERL, not open-source packages like BioPerl, which you may wish to learn on your own.

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids from Cambridge University Press by R. Durbin, S. Eddy, A. Krogh, and G. Mitchison.
This book is a tutorial introduction to the use of hidden Markov models and other probabilistic models for sequence analysis problems in computational molecular biology, but is aimed mainly at a graduate-student audience. We've been using it for years in this class, and have not yet found as detailed a text.

This is a text and reference book that every bioinformatics programmer should have. I don't follow the book very closely, so you will have to figure out for yourself when it is appropriate to read various sections.

Get the second edition, if you can, which has made corrections as indicated on the errata page.

Darling models
I added some assignments in 2003 to build physical models of peptides (and DNA base pairs) using the Darling model kits. These kits are available over the web at http://www.darlingmodels.com/ I recommend getting the "Biochemistry" kit, though the cheaper "protein alpha helix--pleated sheet" kit may suffice. I have found these kits to give me a much better insight into protein flexibility and rigidity than the standard ball-and-stick models used in organic chemistry classes, and they are fun to play with. To reduce costs, it is quite reasonable for students to share a kit.

Note: the science library now has Darling model kits that you can check out! Also, several of the more advanced graduate students may be willing to lend their model kits.

Some initial instructions for building a protein backbone with this model kit are available.

An Introduction to Bioinformatics Algorithms
Neil Jones and Pavel Pevzner
MIT Press
This is book came out in summer 2004. It looks like it may be a valuable supplementary text, as it seems to be easier to read and at a slightly less advanced level than the Durbin et al. book.

Reading assignments

Reading assignments in Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids by R. Durbin, S. Eddy, A. Krogh, and G. Mitchison.

Date Have read these sections
1 Oct 2008 1.1–1.4
8 Oct 2008 11.1–11.6
15 Oct 2008 2.1–2.9
22 Oct 2008 3.1–3.6
29 Oct 2008 5.1–5.8
5 Nov 2008 6.1–6.5
12 Nov 2008 7.1–7.6

Homework

Date to be releasedAssignmentDate Due
25 Sept 2008prereq survey plain-text version for those who need to submit by e-mail 26 Sept 2008
25 Sept 2008 perl1 (FASTA parsing) 10 Oct 2008
28 Sept 2008 fellowship application 17 Oct 2008
6 Oct 2008 perl2 (Markov chains) 24 Oct 2008
15 Oct 2008 Darling models 31 Oct 2008
16 Oct 2008 perl3 (New assignment--cassette mutagenesis) 7 Nov 2008
30 Oct 2008 web and literature search 14 Nov 2008
10 Nov 2008 perl4 (affine-gap alignment) 21 Nov 2008
17 Nov 2008 perl5 (finding palindromes)1 Dec 2008
20 Nov 2008 perl6 (null models) 7 Dec 2008

Every student in the class will need a School of Engineering computer account. I will want assignments turned in by providing me with a publically readable file (PDF for written assignments) or directory (for multi-file assignments) containing the assignment on the SoE machines. All perl programs must execute correctly on the SoE machines, without needing to install additional Perl modules. I would prefer to get paper copies of assignments in addition to the electronic ones (to save me the time of printing them), but I will accept electronic-only submissions from those who are taking the class remotely or are too ill to attend class.

I will not accept attachments to e-mail.

To get an SoE computer account see http://www.soe.ucsc.edu/administration/computing-support#NewUserAccounts. The LANL students will need to find out from the SoE Tech Support how to get their forms submitted, since the "drop box" is obviously not convenient for them.

Evaluation

There will be four types of assignments for the class:

As has been my practice since Fall 2001, there will be no exams, and we will not meet during the final exam period (Thurs 11 Dec 2008, 8-11 a.m.) It turns out to be very difficult to make up small enough problems for examination—almost all the homework exercises are much larger problems than could reasonably be given on a timed exam.

The assignments will be distributed on the web.

The relative weights of the different types of assignment in the evaluation has not been determined yet—it should be roughly proportional to how much time the different assignments take to do well. I will try to assign points to each assignment as it is given, but the total number of points won't be known until I've tweaked all the assignments. I expect that most of the assignments will be similar to ones given in previous years, with a few parts tweaked to update them, but I may replace one or more assignments with new ones, if I can think of new problems at the appropriate level of difficulty.

Academic Integrity

Anyone caught cheating in the class will be reported to their college provost (see UCSC policy on academic integrity) and may fail the class. Cheating includes any attempt to claim someone else's work as your own. Plagiarism in any form (including close paraphrasing) will be considered cheating. Use of any source without proper citation will be considered cheating. If you are not certain about citation standards, please ask, as I hate having to fail students because they were improperly taught how to cite sources.

Collaboration without explicit written acknowledgment will be considered cheating. Collaboration on lab assignments with explicit written acknowledgment is encouraged—guidelines for the extent of reasonable collaboration will be given in class.

Classroom Accommodations for Disabilities

If you qualify for classroom accommodations because of a disability, please submit your Accommodation Authorization from the Disability Resource Center (DRC) to me during my office hours in a timely manner, preferably within the first two weeks of the quarter. Contact DRC at 459-2089 (voice), 459-4806 (TTY).

Rough list of topics we'll probably cover (not necessarily in order)

Note: The schedule will be updated throughout the quarter to reflect what really happens.
  1. Quick review of the fundamental dogma of biology: DNA->RNA->protein, bases, codons, amino acids
    (3-4 hours)
  2. Stochastic models, Bayes Rule, 0-order Markov chain, first-order Markov chain, length model versus stop character for finite strings, use of log-probability for computations, adding probabilities in log-prob representation (efficient computation of log(exp(x)+exp(y)) ). (1.5 hour)
  3. Constructing a model from data. Training, cross-training, and testing. Maximum-likelihood estimate. Pseudocounts to get mean posterior estimate. (1.5 hours)
  4. Converting arbitrary scores to stochastic models: P-value and E-value. Brief discussion of Z-scores (Gaussian dist.) and fat tails of extreme-value (Gumbel dist.) (1.5 hour)
  5. Entropy, relative entropy, Mutual information, sequence logos. (1.5 hour)
  6. What fellowship reviewers look for. Relationship between relative entropy and difference in encoding cost in a train/test framework (clarification for homework exercise). Interpreting classification results: true/false positives, specificity, sensitivity, ROC curves, ROC_n numbers What is a substitution matrix? (1.5 hour)
  7. Substitution matrices and sequence alignment scores. Aligning sequences to sequences, dynamic programming We'll do the the simple, but inefficient algorithm (for arbitrary gap costs) first. (1 hour: Blosum substitution matrices and gapless scoring) (1 hour: the alignment problem and global dynamic programming with arbitrary gap costs) (1 hour: global dynamic programming with linear gap costs, traceback) (1 hour: affine gap costs. Global and local dynamic programming)
  8. Introduction to Hidden Markov models (1.5 hour on HMMs and profiles) (1.5 hours on profile HMMs giving Viterbi algorithm and forward-backward)
    See powerpoint slides by Rachel Karchin (not used in class this year).
  9. Dirichlet Mixtures (1.5 hours) See http://www.soe.ucsc.edu/research/compbio/dirichlets/dirichlet-papers.html for papers and http://www.soe.ucsc.edu/research/compbio/dirichlets/ for general information about Dirichlet mixtures.
  10. Halloween (31 Oct 2008) Guest Lecture. Science librarians will give a presentation on bioinformatics resources available through the library, as well as talking about some of the challenges that face the UCSC library in building an adequate collection in new fields like bioinformatics.

    Bring a laptop to class 31 Oct.

  11. Protein secondary structure (DSSP and STRIDE), in order to explain second track of 2-track HMM. Discuss secondary structure prediction using neural nets. (1.5 hours)
  12. Sequence weighting (Henikoff's technique for relative weighting and target bit savings for total weight) (1 hour) Multiple alignment techniques Overview and progressive alignment (0.5 hour)
  13. Multiple alignment techniques: Muscle, Probcons, AMAP

    documentation on MUSCLE:
    http://www.drive5.com/muscle/docs.htm Refereed paper: Edgar, Robert C. (2004), MUSCLE: multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Research 32(5), 1792-97.

    PROBCONS web site (including overview of algorithm): http://probcons.stanford.edu

    AMAP http://bio.math.berkeley.edu/amap
    Ariel S. Schwartz and Lior Pachter
    Multiple alignment by sequence annealing
    Bioinformatics 2007 23(2):e24-e29; doi:10.1093/bioinformatics/btl311

    Oher multiple alignment programs:
    paper on T-coffee:
    T-Coffee: A novel method for fast and accurate multiple sequence alignment.
    Notredame C, Higgins DG, Heringa J.
    J Mol Biol 2000 Sep 8;302(1):205-17
    doi:10.1006/jmbi.2000.4042

    paper on MAFFT:
    Kazutaka Katoh, Kazuharu Misawa, Kei-ichi Kuma and Takashi Miyata. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research 30(14):3059-3066, 2002.

  14. Phylogeny: brief mention of maximum-likelihood and parsimony. Additivity assumption. UPGMA algorithm presented, ultrametric assumption and molecular clocks, intro to neighbor-joining (no proofs) (1.5 hour)
  15. RNA structure and Stochastic Context-Free Grammars (1.5 hour)
  16. A protocol for evaluating local structure alphabets. This talk ( http://www.soe.ucsc.edu/~karplus/papers/local-structure-germany02.pdf) presented some of the main results from Rachel Karchin's PhD thesis. The following two papers are the guts of the method:

    Rachel Karchin, Melissa Cline, Yael Mandel-Gutfreund, and Kevin Karplus. Hidden Markov models that use predicted local structure for fold recognition: alphabets of backbone geometry. Proteins: Structure, Function, and Genetics, 51(4):504--514, June 2003. doi:10.1002/prot.10369 reprint

    Rachel Karchin, Melissa Cline, and Kevin Karplus. Evaluation of local structure alphabets based on residue burial. Proteins: Structure, Function, and Genetics, 55(3):508--518, 5 March 2004. doi:10.1002/prot.20008 reprint

Tpoics for last week requested by students

The following requests are quotes from student email:

Rough list of topics we didn't have enough time to do more than briefly mention last year:

Other resources on the web

http://genome.ucsc.edu/
UCSC Genome Browser - gateway to over 27 complete genome sequences
http://archaea.ucsc.edu/
UCSC Archaea Browser - gateway to all archeal and many prokaryotic genome sequences
http://genome-test.cse.ucsc.edu/eng/
Getting Started on the UCSC Genome Project Team
User's Guide to the Human Genome (in Nature Genetics).


slug icon to go to Scool of Engineering home page
SoE home
sketch of Kevin Karplus by Abe
Kevin Karplus's home page
BME-slug-icon
Biomolecular Engineering Dept.
BME 205 home page Karplus's lab page UCSC Bioinformatics research

Questions about page content should be directed to

Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250
318 Physical Sciences Building