Homework
Evaluation
There will be four types of assignments for the class:
- reading assignments,
- pencil-and-paper exercises,
- programming exercises to learn PERL, and
- bioinformatics exercises using real data.
Based on the first running of the course in Fall 2001, there will
be no exams, and we will not meet during the final exam period (Wed 12
Dec 2007, 4-7 p.m.) It turns out to be very difficult to make up
small enough problems for examination—almost all the homework
exercises are much larger problems than could reasonably be given on a
timed exam.
The assignments will be distributed on the web.
The relative weights of the different types of
assignment in the evaluation has not been determined yet—it should
be roughly proportional to how much time the different assignments
take to do well. I will try to assign points to each assignment as
it is given, but the total number of points won't be known until I've
created all the assignments.
Academic Integrity
Anyone caught cheating in the class will be reported to their college
provost (see UCSC
policy on academic integrity) and may fail the class.
Cheating includes any attempt to claim someone else's work as your
own.
Plagiarism in any form (including close paraphrasing) will be
considered cheating.
Use of any source without proper citation will be considered cheating.
If you are not certain about citation standards, please ask, as I hate
having to fail students because they were improperly taught how to
cite sources.
Collaboration without explicit written acknowledgement will be
considered cheating.
Collaboration on lab assignments with explicit written acknowledgement
is encouraged—guidelines for the extent of reasonable collaboration
will be given in class.
Classroom Accomodations for Disabilities
If you qualify for classroom accommodations because of a disability,
please submit your Accommodation Authorization from the Disability
Resource Center (DRC) to me during my office hours in a timely manner,
preferably within the first two weeks of the quarter. Contact DRC at
459-2089 (voice), 459-4806 (TTY).
Rough list of topics we'll probably cover (not necessarily in order)
Note: The schedule will be updated throughout the quarter to reflect what
really happens.
- MONDAY 8 Oct 07
- Not an official class meeting, but there is an NSF
fellowship proposal-writing workshop at
4:00-6:00pm in the Muwekma Ohlone Conference Room C, Bay Tree Conference Center.
Since you will have to write a fellowship or grant proposal of some
sort this quarter, the workshop may well be worth your time.
- Quick review of the fundamental dogma of biology:
DNA->RNA->protein, bases, codons, amino acids
(3-4 hours)
- Stochastic models, Bayes Rule, 0-order Markov chain,
first-order Markov chain, length model versus stop character for
finite strings, use of log-probability for computations,
adding probabilities in log-prob representation (efficient
computation of log(exp(x)+exp(y)) ).
(1.5 hour)
- Constructing a model from data. Training, cross-training, and testing.
Maximum-likelihood estimate. Pseudocounts to get mean posterior estimate.
(1.5 hours)
- Converting abitrary scores to stochastic models: P-value and E-value.
Brief discussion of Z-scores (Gaussian dist.) and fat tails of
extreme-value (Gumbel dist.)
(1.5 hour)
- Entropy, relative entropy, Mutual information, sequence logos.
(1.5 hour)
- What fellowship reviewers look for.
Relationship between relative entropy and difference in encoding
cost in a train/test framework (clarification for homework exercise).
Interpreting classification results: true/false positives,
specificity, sensitivity, ROC curves, ROC_n numbers
What is a substitution matrix?
(1.5 hour)
- Substitution matrices and sequence alignment scores.
Aligning sequences to sequences, dynamic programming
We'll do the the simple, but inefficient algorithm (for
aribtrary gap costs) first.
(1 hour: Blosum substitution matrices and gapless scoring)
(1 hour: the alignment problem and global dynamic programming with
arbitrary gap costs)
(1 hour: global dynamic programming with linear gap costs,
traceback)
(1 hour: affine gap costs. Global and local dynamic programming)
- Introduction to Hidden Markov models
(1.5 hour on HMMs and profiles)
(1.5 hours on profile HMMs giving Viterbi algorithm and
forward-backward)
See powerpoint slides by Rachel Karchin (not used in class
this year).
- Dirichlet Mixtures (1.5 hours)
See
http://www.soe.ucsc.edu/research/compbio/dirichlets/dirichlet-papers.html
for papers and http://www.soe.ucsc.edu/research/compbio/dirichlets/
for general information about Dirichlet mixtures.
- Halloween, Oct 31.Guest Lecture in the Science Library. Science librarians
will give a presentation on
bioinformatics resources available through the library, as well as
talking about some of the challenges that face the UCSC library in
building an adequate collection in new fields like bioinformatics.
- Protein secondary structure (DSSP and STRIDE), in order to
explain second track of 2-track HMM.
Discuss secondary structure prediction using neural nets.
(1.5 hours)
- Sequence weighting (Henikoff's technique for relative
weighting and target bit savings for total weight)
(1 hour)
Multiple alignment techniques
Overview and progressive alignment (0.5 hour)
- Multiple alignment techniques
Muscle and Probcons (I may choose other modern multiple aligners
this year)
documentation on MUSCLE:
http://www.drive5.com/muscle/docs.htm
Referreed paper:
Edgar, Robert C. (2004), MUSCLE: multiple sequence alignment with
high accuracy and high throughput, Nucleic Acids Research 32(5),
1792-97.
PROBCONS web site (including overview of algorithm):
http://probcons.stanford.edu
Oher multiple alignment programs:
paper on T-coffee:
T-Coffee: A novel method for fast and accurate multiple sequence alignment.
Notredame C, Higgins DG, Heringa J.
J Mol Biol 2000 Sep 8;302(1):205-17
paper on MAFFT:
Kazutaka Katoh, Kazuharu Misawa1, Kei-ichi Kuma and Takashi Miyata.
MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform.
Nucleic Acids Research 30(14):3059-3066, 2002.
- Phylogeny: brief mention of maximum-likelihood and parsimony.
Additivity assumption.
UPGMA algorithm presented, ultrametric assumption and molecular
clocks, intro to neighbor-joining (no proofs)
(1.5 hour)
- RNA structure and Stochastic Context-Free Grammars
(1.5 hour)
- A protocol for evaluating local structure alphabets.
This talk (
http://www.soe.ucsc.edu/~karplus/papers/local-structure-germany02.pdf)
presented some of the main results from Rachel Karchin's PhD thesis.
Rough list of topics we didn't have enough time to do more than
briefly mention last year:
- Contact order and folding rate.
In 2001, I handed out paper on contact order:
Contact order, transition state placement and the refolding rates of single domain proteins.
Plaxco KW, Simons KT, Baker D.
J Mol Biol 1998 Apr 10;277(4):985-94
- Phylogenetic analysis
- DNA microarrays and expression data
- Gene finding
- Proteomics
- RNA structure
- DNA assembly
- Fast methods for searching (BLAST and BLAT).
(In 2001, Jim Kent gave an excellent lecture on these.)
- RNA genes, DNA microarrays, computational and
functional genomics
Look at the BME
210 course, the BME
230 course, and the lab pages for Todd Lowe and
Josh Stuart.
- Combining secondary structure, fold-recognition, and
new-fold methods for protein structure prediction.
Using the
transparencies given at Schloss Dagstuhl.
I could have handed out
book chapter on SAM-T2K, but didn't.
Other resources on the web
- http://genome.ucsc.edu/
- UCSC Genome Browser - gateway to over 27 complete
genome sequences
- http://archaea.ucsc.edu/
- UCSC Archaea Browser - gateway to all archeal and many prokaryotic
genome sequences
- http://genome-test.cse.ucsc.edu/eng/
- Getting Started on the UCSC Genome Project Team
-
User's Guide to the Human Genome (in Nature Genetics).
Questions about page content should be directed to
Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250
318 Physical Sciences Building