CMP 243 Homework 3

Due: Wed. Oct. 23

Dirichlet and Dirichlet mixture priors, sequence weighting, database search with BLOCKS profiles.

First read sections 1,2, and 5 of the tutorial paper by Leslie Grate, Richard Hughey, Kevin Karplus and Kimmen Sjolander. This is a tuturial on hidden Markov models, which we will get to later in the course. However, these section discuss more general issues that apply to the current and past topics we have covered in this class.

Then you should make sure you have carefully read up to page 19 in chapter one of the text. Note that there is a typo in line 13 of page 19: the denominator of this fraction should be Z(alpha) M(n). The function M(n) is defined in equation 1.5.

This homework has 2 parts. The first part consists of a few numerical problems to solve. The second part is www/experimental.


PART 1.

  1. This is a variant of the example on page 8 of the text, where they calculate the probability of getting five sixes and five ones when you roll a fair die ten times. Suppose that all of the amino acids are equally likely (which is of course never true, but it makes the calculation easier). Suppose you observe 5 amino acids selected independently at random. Then what is the probability of seeing 2 leucines, 2 isoleucines and 1 valine? (Remember: as in the example in the book, the order does not matter, so you must use the multinomial distribution here.)
  2. As in section 1.6 of the text, suppose that you count the number of amino acids of each type in a particular column of a multiple alignment. For each i between 1 and 20, let n_i be the number of amino acids of type i. Suppose that in the order of the amino acids, valine is first leucine is second, and isoleucine is third. You see 10 valines, 1 leucine, 3 isoleucines and no other amino acids. So n_1 = 10, n_2 = 1, n_3 = 3, and n_i = 0 for all other i. For each i between 1 and 20, from n_i you want to estimate the probability theta_i of amino acid i for this column. Suppose you use a Dirichlet distribution with parameters alpha_1, ..., alpha_20 as a prior, and compute the posterior mean estimator for each theta_i. The formula is given is section 1.6 of the text. Consider two cases. In the first case, alpha_1 = alpha_2 = ... = alpha_20 = 1. In the second case alpha_1 = alpha_2 = ... = alpha_20 = 100. In each of these cases, what is the posterior mean estimator for each theta_i?

PART 2. (Thanks to Melissa Cline)

NOTE: SOME BUGS WERE FOUND IN THIS PART OF THE HOMEWORK. See fixes.

Introduction:

In the last assignment, you searched the BLOCKS database with Target 37. In this assignment we will investigate different techniques for building blocks, and how each technique affects the search performance. In this assignment, we have built four different BLOCKS databases for you using different Dirichlet regularizers and sequence weights. You will search these databases for a number of target sequences, and observe how sequence weights and the choice of regularizer affect the sensitivity of the search.

Instructions on using the BLOCKS profiles

Note: if you are not a CSE student, and have not gotten access to the barnyard machines, you will need to do so before performing these steps.
Make yourself a local working directory

Move to your working directory and execute the following commands:
	cp /projects/compbio/class/cmp243/configfile .
	cp /projects/compbio/class/cmp243/T*.seq .
	cp /projects/compbio/class/cmp243/select.pl .
	cp /projects/compbio/class/cmp243/run_search.csh .
	ln -s /projects/compbio/class/cmp243/blocks.whole
	
Now you are ready to begin searching the BLOCKS database.  There are
four sets of BLOCKS profiles assembled for you:	
	/projects/compbio/class/cmp243/blocks.1.1comp
	/projects/compbio/class/cmp243/blocks.1.9comp
	/projects/compbio/class/cmp243/blocks.num_seq.1comp
	/projects/compbio/class/cmp243/blocks.num_seq.9comp
and you will search each of these in turn.  

A profile is essentially a multiple alignment of a set of sequences converted into a sequence of probability distributions, one for each column or position in the alignment, as discussed in class. The naming convention indicates characteristics of how the profiles were built. 1comp indicates that the profiles were built by computing a posterior mean estimate, using a Dirichlet density as a prior; 9comp indicates that a nine component Dirichlet mixture was used as a prior. The other part of the name refers to the sequence weighting convention that was used. All blocks were built using the relative sequence weights given in the BLOCKS database. However, these weights are scaled to sum to 1 in one case, the ".1" case, and they are scaled so that they sum to the total number of sequences in the other case, the "num_seq" case.

By default, you will be analyzing three of the CASP2 target sequences: T0006, T0021, and T0037. If there are additional proteins you'd like to analyze, then create a file for each of your proteins using the files following the format of the files T*.seq.

If you're not already logged in on an alpha (such as moo, oink, or quack), at this point you should log into one and move to your working directory.

To specify the profile to search, edit the file configfile. In the line following "BLOCKS:", specify the first profile to search.

To run the search, execute the command:
	run_search.csh  >  &
where  specifies a file containing an input sequence and 
 specifies the name an output file.  A sample command line is:
	run_search.csh T0006.seq >blocks.1.1comp.out &

After approximately five minutes (give or take system load), the search will be done. Then, take a look in the output file to see the hits. For each hit, the output file lists the relevant target protein, accession number of the block, positions hit, encoding cost, and other information. Repeat with all four profiles and at least all three target sequences.

If any of hit captures your attention and you'd like to investigate further, 
mark down the accession number and run the following command:
	select.pl  blocks.whole
For example:
	select.pl BL00450D blocks.whole
This will print such information on the block as its family and the
subsequences belonging to that block.

Do this with at least 2 hits. Describe what you find.

Repeat the above with some of the first 10 putative proteins from Methanococcus janaschii genome. You can pick at random the ones from these ten you want to analyse, or you can do all 10. You might do a BLAST search on these as well. Describe what you find.

The whole list of putative genes from Methanococcus janaschii can be found in /projects/compbio/data/jannaschii/GMJ-pep.seqs. The first 10 of them are

>MJ0001 aspartate aminotransferase      SP:P14909
MISSRCKNIKPSAIREIFNLATSDCINLGIGEPDFDTPKHIIEAAKRALDEGKTHYSPNN
GIPELREEISNKLKDDYNLDVDKDNIIVTCGASEALMLSIMTLIDRGDEVLIPNPSFVSY
FSLTEFAEGKIKNIDLDENFNIDLEKVKESITKKTKLIIFNSPSNPTGKVYDKETIKGLA
EIAEDYNLIIVSDEVYDKIIYDKKHYSPMQFTDRCILINGFSKTYAMTGWRIGYLAVSDE
LNKELDLINNMIKIHQYSFACATTFAQYGALAALRGSQKCVEDMVREFKMRRDLIYNGLK
DIFKVNKPDGAFYIFPDVSEYGDGVEVAKKLIENKVLCVPGVAFGENGANYIRFSYATKY
EDIEKALGIIKEIFE
>MJ0002
MEIFMEVPIFVVISGSDLYGIPNPSDVDIRGAHILDRELFIKNCLYKSKEEEVINKMFGK
CDFVSFELGKFLRELLKPNANFIEIALSDKVLYSSKYHEDVKGIAYNCICKKLYHHWKGF
AKPLQKLCEKESYNNPKTLLYILRAYYQGILCLESGEFKSDFSSFRCLDCYDEDIVSYLF
ECKVNKKPVDESYKKKIKSYFYELGVLLDESYKNSNLIDEPSETAKIKAIELYKKLYFED
VRE
>MJ0003
MKGKRIAIVSHRILNQNSVVNGLERAEGAFNEVVEILLKNNYGIIQLPCPELIYLGIDRE
GKTKEEYDTKEYRELCKKLLEPIIKYLQEYKKDNYKFILIGIENSTTCDIFKNRGILMEE
FFKEVEKLNIIIKAIEYPKNEKDYNKFVKTLEKMIK
>MJ0004 activator of (R)-2-hydroxyglutaryl-CoA dehydratase      PIR:S36105
MILGIDVGSTTTKMVLMEDSKIIWYKIEDIGVVIEEDILLKMVKEIEQKYPIDKIVATGY
GRHKVSFADKIVPEVIALGKGANYFFNEADGVIDIGGQDTKVLKIDKNGKVVDFILSDKC
AAGTGKFLEKALDILKIDKNEINKYKSDNIAKISSMCAVFAESEIISLLSKKVPKEGILM
GVYESIINRVIPMTNRLKIQNIVFSGGVAKNKVLVEMFEKKLNKKLLIPKEPQIVCCVGA
ILV
>MJ0005 formate dehydrogenase, beta subunit     GB:J02581_2
MKYVLIQATDNGILRRAECGGAVTALFKYLLDKKLVDGVLALKRGEDVYDGIPTFITNSN
ELVETAGSLHCAPTNFGKLIAKYLADKKIAVPAKPCDAMAIRELAKLNQINLDNVYMIGL
NCGGTISPITAMKMIELFYEVNPLDVVKEEIDKGKFIIELKNGEHKAVKIEELEEKGFGR
RKNCQRCEIMIPRMADLACGNWGAEKGWTFVEICSERGRKLVEDAEKDGYIKIKQPSEKA
IQVREKIESIMIKLAKKFQKKHLEEEYPSLEKWKKYWNRCIKCYGCRDNCPLCFCVECSL
EKDYIEEKGKIPPNPLIFQGIRLSHISQSCINCGQCEDACPMDIPLAYIFHRMQLKIRDT
LGYIPGVDNSLPPLFNIER
>MJ0006 formate dehydrogenase, alpha subunit    SP:P06131
MKVVHTICPGCSVGCGIDLIVKDDKVVGTYPYKRHPINEGKNCSNGKNSYKIIYHEKRLK
KPLIKKNGKLVEATWDEALSFIAEKLKNYNADDITFIASGKCTNEDNYALKKLVDSLKAK
IGHCICNSPKVNYAEVSTTIDDIENAKNIIIIGDVFSEHALIGRKVIKAKEKGSKVTIFN
TEEKEILKLNADEFVKVDSYLGVDLSNVDKNTIIIINAPVNVDEIIKTAKENKAKVLPVA
KHCNTVGATLIGIPALNKDEYFELLKNSKFLYIMGENPALVDKDVLKNVEFLVVQDIIMT
ETAEMADVVLPSTCWAEKDGTFINTDKRIQKINKAVNPPGDAMDDWLIIKSLAEKLGSDL
GFNSLEDIQQDIHRNKLL
>MJ0007 2-hydroxyglutaryl-CoA dehydratase, subunit beta SP:P11570
MMKLKAIEKLMQKFASRKEQLYKQKEEGRKVFGMFCAYVPIEIILAANAIPVGLCGGKND
TIPIAEEDLPRNLCPLIKSSYGFKKAKTCPYFEASDIVIGETTCEGKKKMFELMERLVPM
HIMHLPHMKDEDSLKIWIKEVEKLKELVEKETGNKITEEKLKETVDKVNKVRELFYKLYE
LRKNKPAPIKGLDVLKLFQFAYLLDIDDTIGILEDLIEELEERVKKGEGYEGKRILITGC
PMVAGNNKIVEIIEEVGGVVVGEESCTGTRFFENFVEGYSVEDIAKRYFKIPCACRFKND
ERVENIKRLVKELDVDGVVYYTLQYCHTFNIEGAKVEEALKEEGIPIIRIETDYSESDRE
QLKTRLEAFIEMI
>MJ0008
MFCGSMIAICMRSKEGFLFNNKLMDWGLHYNPKIVKDNNIIGYHAPILDLDKKESIIILK
NIIENIKGRDYLTIHLHNGKYGKINKETLIENLSIVNEFAEKNGIKLCIENLRKGFSSNP
NNIIEIADEINCYITFDVGHIPYNRRLEFLEICSDRVYNSHVYEIEVDGKHLPPKNLNNL
KPILDRLLDIKCKMFLIELMDIKEVLRTERMLKDYLEMYR
>MJ0009
MIFNENTPNFIDFKESFKELPLSDETFKIIEENGIKLREIAIGEFSGRDSVAAIIKAIEE
GIDFVLPVVAFTGTDYGNINIFYKNWEIVNKRIKEIDKDKILLPLHFMFEPKLWNALNGR
WVVLSFKRYGYYRPCIGCHAYLRIIRIPLAKHLGGKIISGERLYHNGDFKIDQIEEVLNV
YSKICRDFDVELILPIRYIREGKKIKEIIGEEWEQGEKQFSCVFSGNYRDKDGKVIFDKE
GILKMLNEFIYPASVEILKEGYKGNFNYLNIVKKLI
>MJ0010 phosphonopyruvate decarboxylase GP:D37809_1
MRAILILLDGLGDRASEILNNKTPLQFAKTPNLDRLAENGMCGLMTTYKEGIPLGTEVAH
FLLWGYSLEEFPGRGVIEALGEDIEIEKNAIYLRASLGFVKKDEKGFLVIDRRTKDISRE
EIEKLVDSLPTCVDGYKFELFYSFDVHFILKIKERNGWISDKISDSDPFYKNRYVMKVKA
IRELCKSEVEYSKAKDTARALNKYLLNVYKILQNHKINRKRRKLEKMPANFLLTKWASRY
KRVESFKEKWGMNAVILAESSLFKGLAKFLGMDFIKIESFEEGIDLIPELDYDFIHLHTK
ETDEAAHTKNPLNKVKVIEKIDKLIGNLKLREDDLLIITADHSTPSVGNLIHSGESVPIL
FYGKNVRVDNVKEFNEISCSNGHLRIRGEELMHLILNYTDRALLYGLRSGDRLRYYIPKD
DEIDLLEG

Questions regarding about page content should be directed to haussler@cse.ucsc.edu.
Last modified Oct. 20, 1996.

Back to the CMP 243 Class Page.