First read sections 1,2, and 5 of the tutorial paper by Leslie Grate, Richard Hughey, Kevin Karplus and Kimmen Sjolander. This is a tuturial on hidden Markov models, which we will get to later in the course. However, these section discuss more general issues that apply to the current and past topics we have covered in this class.
Then you should make sure you have carefully read up to page 19 in chapter one of the text. Note that there is a typo in line 13 of page 19: the denominator of this fraction should be Z(alpha) M(n). The function M(n) is defined in equation 1.5.
NOTE: SOME BUGS WERE FOUND IN THIS PART OF THE HOMEWORK. See fixes.
Make yourself a local working directory Move to your working directory and execute the following commands: cp /projects/compbio/class/cmp243/configfile . cp /projects/compbio/class/cmp243/T*.seq . cp /projects/compbio/class/cmp243/select.pl . cp /projects/compbio/class/cmp243/run_search.csh . ln -s /projects/compbio/class/cmp243/blocks.whole Now you are ready to begin searching the BLOCKS database. There are four sets of BLOCKS profiles assembled for you: /projects/compbio/class/cmp243/blocks.1.1comp /projects/compbio/class/cmp243/blocks.1.9comp /projects/compbio/class/cmp243/blocks.num_seq.1comp /projects/compbio/class/cmp243/blocks.num_seq.9comp and you will search each of these in turn.
A profile is essentially a multiple alignment of a set of sequences converted into a sequence of probability distributions, one for each column or position in the alignment, as discussed in class. The naming convention indicates characteristics of how the profiles were built. 1comp indicates that the profiles were built by computing a posterior mean estimate, using a Dirichlet density as a prior; 9comp indicates that a nine component Dirichlet mixture was used as a prior. The other part of the name refers to the sequence weighting convention that was used. All blocks were built using the relative sequence weights given in the BLOCKS database. However, these weights are scaled to sum to 1 in one case, the ".1" case, and they are scaled so that they sum to the total number of sequences in the other case, the "num_seq" case.
By default, you will be analyzing three of the CASP2 target sequences: T0006, T0021, and T0037. If there are additional proteins you'd like to analyze, then create a file for each of your proteins using the files following the format of the files T*.seq.
If you're not already logged in on an alpha (such as moo, oink, or quack), at this point you should log into one and move to your working directory.
To specify the profile to search, edit the file configfile. In the line following "BLOCKS:", specify the first profile to search.
To run the search, execute the command: run_search.csh> & where specifies a file containing an input sequence and specifies the name an output file. A sample command line is: run_search.csh T0006.seq >blocks.1.1comp.out &
After approximately five minutes (give or take system load), the search will be done. Then, take a look in the output file to see the hits. For each hit, the output file lists the relevant target protein, accession number of the block, positions hit, encoding cost, and other information. Repeat with all four profiles and at least all three target sequences.
If any of hit captures your attention and you'd like to investigate further, mark down the accession number and run the following command: select.plblocks.whole For example: select.pl BL00450D blocks.whole This will print such information on the block as its family and the subsequences belonging to that block.
Do this with at least 2 hits. Describe what you find.
Repeat the above with some of the first 10 putative proteins from Methanococcus janaschii genome. You can pick at random the ones from these ten you want to analyse, or you can do all 10. You might do a BLAST search on these as well. Describe what you find.
The whole list of putative genes from Methanococcus janaschii can be found in /projects/compbio/data/jannaschii/GMJ-pep.seqs. The first 10 of them are
>MJ0001 aspartate aminotransferase SP:P14909 MISSRCKNIKPSAIREIFNLATSDCINLGIGEPDFDTPKHIIEAAKRALDEGKTHYSPNN GIPELREEISNKLKDDYNLDVDKDNIIVTCGASEALMLSIMTLIDRGDEVLIPNPSFVSY FSLTEFAEGKIKNIDLDENFNIDLEKVKESITKKTKLIIFNSPSNPTGKVYDKETIKGLA EIAEDYNLIIVSDEVYDKIIYDKKHYSPMQFTDRCILINGFSKTYAMTGWRIGYLAVSDE LNKELDLINNMIKIHQYSFACATTFAQYGALAALRGSQKCVEDMVREFKMRRDLIYNGLK DIFKVNKPDGAFYIFPDVSEYGDGVEVAKKLIENKVLCVPGVAFGENGANYIRFSYATKY EDIEKALGIIKEIFE >MJ0002 MEIFMEVPIFVVISGSDLYGIPNPSDVDIRGAHILDRELFIKNCLYKSKEEEVINKMFGK CDFVSFELGKFLRELLKPNANFIEIALSDKVLYSSKYHEDVKGIAYNCICKKLYHHWKGF AKPLQKLCEKESYNNPKTLLYILRAYYQGILCLESGEFKSDFSSFRCLDCYDEDIVSYLF ECKVNKKPVDESYKKKIKSYFYELGVLLDESYKNSNLIDEPSETAKIKAIELYKKLYFED VRE >MJ0003 MKGKRIAIVSHRILNQNSVVNGLERAEGAFNEVVEILLKNNYGIIQLPCPELIYLGIDRE GKTKEEYDTKEYRELCKKLLEPIIKYLQEYKKDNYKFILIGIENSTTCDIFKNRGILMEE FFKEVEKLNIIIKAIEYPKNEKDYNKFVKTLEKMIK >MJ0004 activator of (R)-2-hydroxyglutaryl-CoA dehydratase PIR:S36105 MILGIDVGSTTTKMVLMEDSKIIWYKIEDIGVVIEEDILLKMVKEIEQKYPIDKIVATGY GRHKVSFADKIVPEVIALGKGANYFFNEADGVIDIGGQDTKVLKIDKNGKVVDFILSDKC AAGTGKFLEKALDILKIDKNEINKYKSDNIAKISSMCAVFAESEIISLLSKKVPKEGILM GVYESIINRVIPMTNRLKIQNIVFSGGVAKNKVLVEMFEKKLNKKLLIPKEPQIVCCVGA ILV >MJ0005 formate dehydrogenase, beta subunit GB:J02581_2 MKYVLIQATDNGILRRAECGGAVTALFKYLLDKKLVDGVLALKRGEDVYDGIPTFITNSN ELVETAGSLHCAPTNFGKLIAKYLADKKIAVPAKPCDAMAIRELAKLNQINLDNVYMIGL NCGGTISPITAMKMIELFYEVNPLDVVKEEIDKGKFIIELKNGEHKAVKIEELEEKGFGR RKNCQRCEIMIPRMADLACGNWGAEKGWTFVEICSERGRKLVEDAEKDGYIKIKQPSEKA IQVREKIESIMIKLAKKFQKKHLEEEYPSLEKWKKYWNRCIKCYGCRDNCPLCFCVECSL EKDYIEEKGKIPPNPLIFQGIRLSHISQSCINCGQCEDACPMDIPLAYIFHRMQLKIRDT LGYIPGVDNSLPPLFNIER >MJ0006 formate dehydrogenase, alpha subunit SP:P06131 MKVVHTICPGCSVGCGIDLIVKDDKVVGTYPYKRHPINEGKNCSNGKNSYKIIYHEKRLK KPLIKKNGKLVEATWDEALSFIAEKLKNYNADDITFIASGKCTNEDNYALKKLVDSLKAK IGHCICNSPKVNYAEVSTTIDDIENAKNIIIIGDVFSEHALIGRKVIKAKEKGSKVTIFN TEEKEILKLNADEFVKVDSYLGVDLSNVDKNTIIIINAPVNVDEIIKTAKENKAKVLPVA KHCNTVGATLIGIPALNKDEYFELLKNSKFLYIMGENPALVDKDVLKNVEFLVVQDIIMT ETAEMADVVLPSTCWAEKDGTFINTDKRIQKINKAVNPPGDAMDDWLIIKSLAEKLGSDL GFNSLEDIQQDIHRNKLL >MJ0007 2-hydroxyglutaryl-CoA dehydratase, subunit beta SP:P11570 MMKLKAIEKLMQKFASRKEQLYKQKEEGRKVFGMFCAYVPIEIILAANAIPVGLCGGKND TIPIAEEDLPRNLCPLIKSSYGFKKAKTCPYFEASDIVIGETTCEGKKKMFELMERLVPM HIMHLPHMKDEDSLKIWIKEVEKLKELVEKETGNKITEEKLKETVDKVNKVRELFYKLYE LRKNKPAPIKGLDVLKLFQFAYLLDIDDTIGILEDLIEELEERVKKGEGYEGKRILITGC PMVAGNNKIVEIIEEVGGVVVGEESCTGTRFFENFVEGYSVEDIAKRYFKIPCACRFKND ERVENIKRLVKELDVDGVVYYTLQYCHTFNIEGAKVEEALKEEGIPIIRIETDYSESDRE QLKTRLEAFIEMI >MJ0008 MFCGSMIAICMRSKEGFLFNNKLMDWGLHYNPKIVKDNNIIGYHAPILDLDKKESIIILK NIIENIKGRDYLTIHLHNGKYGKINKETLIENLSIVNEFAEKNGIKLCIENLRKGFSSNP NNIIEIADEINCYITFDVGHIPYNRRLEFLEICSDRVYNSHVYEIEVDGKHLPPKNLNNL KPILDRLLDIKCKMFLIELMDIKEVLRTERMLKDYLEMYR >MJ0009 MIFNENTPNFIDFKESFKELPLSDETFKIIEENGIKLREIAIGEFSGRDSVAAIIKAIEE GIDFVLPVVAFTGTDYGNINIFYKNWEIVNKRIKEIDKDKILLPLHFMFEPKLWNALNGR WVVLSFKRYGYYRPCIGCHAYLRIIRIPLAKHLGGKIISGERLYHNGDFKIDQIEEVLNV YSKICRDFDVELILPIRYIREGKKIKEIIGEEWEQGEKQFSCVFSGNYRDKDGKVIFDKE GILKMLNEFIYPASVEILKEGYKGNFNYLNIVKKLI >MJ0010 phosphonopyruvate decarboxylase GP:D37809_1 MRAILILLDGLGDRASEILNNKTPLQFAKTPNLDRLAENGMCGLMTTYKEGIPLGTEVAH FLLWGYSLEEFPGRGVIEALGEDIEIEKNAIYLRASLGFVKKDEKGFLVIDRRTKDISRE EIEKLVDSLPTCVDGYKFELFYSFDVHFILKIKERNGWISDKISDSDPFYKNRYVMKVKA IRELCKSEVEYSKAKDTARALNKYLLNVYKILQNHKINRKRRKLEKMPANFLLTKWASRY KRVESFKEKWGMNAVILAESSLFKGLAKFLGMDFIKIESFEEGIDLIPELDYDFIHLHTK ETDEAAHTKNPLNKVKVIEKIDKLIGNLKLREDDLLIITADHSTPSVGNLIHSGESVPIL FYGKNVRVDNVKEFNEISCSNGHLRIRGEELMHLILNYTDRALLYGLRSGDRLRYYIPKD DEIDLLEG