After doing homework 1, you are ready to tackle target t0037 of the CASP contest. Well ... at least we can play with t37 a bit. If it was easy it wouldn't be in the F/A category of the contest.
Follow the instructions of homework 1 to get the t0037.doc file starting from the home page for the CASP contest. You see that t37 is a beta spectrin. You might want to look this sequence up in Entrez as we did with target 25 to find out more about it. Then continue as in that assignment, extracting the t0037 sequence from this file and pasting it into the window on NCBI's BLAST biosequence database search tool. Use blastp with default settings to search the PDB database for sequences that have high scoring segment pairs with target 37.
Q1: What is the PDB sequence that gets the highest scoring locally maximal segment pair (MSP)? In the more detailed BLAST report for this MSP, it tells how many MSPs with a score at least this high would be expected in a search of a random database the size of the PDB database. I called this the E-value in my lecture. The E-value is calculated in a way similar to the calculation you did in problem 1 in the first part of this assignment. How many would be expected, i.e. what is the E-value for this MSP? In the report for this MSP, BLAST also tells the estimated probability of finding one or more MSPs with this score or higher. I called this the P-value in my lecture. Here the calculation is done assuming a Poisson distribution with mean equal to the E-value. For a Poisson distribution with mean E, what is the probability of 0 occurrences? If we subtract this number from 1, we get the probability of one or more occurrences, which is the P-value, right? Do this calculation. What number do you get for the P-value? How does this compare to the P-value reported by BLAST for this MSP? Do you think this MSP represents a statistically significant match between the query sequence and this PDB sequence?
Now change databases and search the nonredundant protein database (nr), using balstp with default settings.
Q2: How many total residues are in the nr database? Note that it is more than 100 times larger than PDB. The top scoring MSP is not surprising; it is an alignment of the query sequence t37 with itself. Quite an impressive P-value. Not likely to happen by chance in a random database even of this size!
Most of the other high scoring MSPs are to other beta spectrin proteins, until we get down to about the 15th hit, where we start seeing strong hits to alpha-actinins. It appears these must be related to beta spectrins. (How?) Find the hit to
sp|Q08043|AAC3_HUMAN ALPHA-ACTININ 3, SKELETAL MUSCLE ISOFORM (F-ACTIN CROSS LINKING PROTEIN). pir||FAHUA3 alpha-actinin 3 - human
and scan down and find BLAST's detailed report for this hit. There are two MSPs reported; let's just look at the first, higher scoring one.
Q3: What are the E-value and P-value for this MSP? Why are they the same? This record also says that the BLAST score S for this MSP is 127.6 bits. According to equation (4) of the paper "Amino Acid Substitution Matrices..." by Altshul et al.,
S = -log P-value + log N + log K
where the logarithm is base 2, N is n*m, where n is the number of residues in the database being searched and m is the number of residues in the query sequence, and log K is a relatively small number (think of it as a correction factor). Check how closely this equation holds, i.e. what must log K be to make this true?
The number N is supposed to represent the number of different possible segment pairs that can be made between a piece of the query sequence and a piece of a sequence in the database of the size of this particular MSP. That is why it is suggested that we take N = m*n, where m is the length of the query sequence and n is the total number of residues in the database. The way this works is that m is approximately the number of places the segment pair can start in the query sequence, and n is approximately the number of places it can start in the database, so their product is the total number of possible segment pairs. However, since this MSP is 103 residues long, and the query sequence is 109 residues, there are really only 7 possible starting positions for a segment pair of this length in the query sequence. Similarly, the last 102 positions in each database sequence are not possible starting places for a segment pair of this length. (However, it is not easy to calculate how many possible starting places this leaves in the database.) The upshot of this is that n*m is an overestimate for the number of possible segment pairs of length 103 that can be made between the query sequence and the sequences in the database. Try the above formula for S again, using m = 7 instead of m = 109 and notice how the correction factor is now smaller.
From these BLAST results, there appear to be quite a few sequences in the nonredundant protein database that are homologous to Target 37. In such a case there are often fairly conserved residue patterns that are common to many of these sequences, which are related to the common function of the proteins in the family. These are called protein motifs. So far we have only explored pairwise relationships between sequences, now let's look at motifs common to larger sets of sequences. Two ways to approach this are via the PROSITE database and via the BLOCKS database. Each contains a large set of known protein motifs, but the representation of the motifs is different. BLOCKS was derived from Prosite.) We'll go via the blocks database; it has links back to Prosite. We'll talk more about protein motifs and how they are represented in class. They are discussed in chapter 3 of the text.
Go to Pedro's page (under class bioinformatics resources) and locate the BLOCKS server. Click on the first entry for BLOCKS. You might read under the "about blocks" button. Then click on "blocks searcher". You can type in your email address or leave it blank and see the results on the screen. Paste in the t37 sequence as in the BLAST search and do a search of the blocks database. The results give some description of the method and how to interpret the scores, followed by a list of blocks motifs that were found to have residue patterns that are similar to pieces of the query sequence, t37. Each blocks motif has a number. The highest scoring motif is BL00019C. An alignment of a piece of t37 with a representative sequence for this block, SPCB_HUMAN, is displayed in roughly BLAST-style format. There are also other motifs from this same block, namely BL00019B and BL00019A. A hit to BL00019B is mentioned in the summary, but the score is not high enough for it to have a detailed alignment shown. You might save this file, and the files you get in the next step.
Let's look at the whole file for block BL00019. Go back to the main blocks server page and click on "Get blocks by number". Type "BL00019" into the search box and hit return. This brings up two windows for block BL00019: the bottom window is the Prosite documentation for this block. This tells you that the conserved sequence motifs in this block are thought to be involved in binding actin. Different actin-binding proteins are discussed briefly, including beta spectrin, which is what t37 is. This helps answer a question posed earlier. Then a description of the amino acid pattern that is common to these motifs is given (we'll discuss this a bit in class), followed some citations to the literature.
The top window shows the three blocks motifs that make up block BL00019. Each is specified here as an ungapped multiple alignment of pieces from several related proteins, all of which have a similar residue pattern in this region. Looking at this multiple alignment, we can see that some positions within the motifs are more conserved than others. For example, in the first motif, BL00019A, the 4th residue is T (threonine) in all but one instance (it is valine in the yeast sequence), and the 5th residue is a completely conserved F (phenylalanine). These positions may be very important to the function or structure of the proteins in this family. Other more variable positions may be less crucial. When searching the blocks database, the score table for a match is different in each position of the blocks profile, and is computed based on the frequencies of the amino acids that are actually observed in that position in the multiple alignment of members of the family that you see here. (More on this later.) By using the combined information from many related sequences, blocks searching can sometimes be more sensitive than BLAST searching, which uses information only from one sequence at a time.
Q4: Which are the most strongly conserved positions in BL00019C? Does the target have the same conserved (or nearly conserved) amino acid in these positions?
We'll return to this motif in a future assignment.
FOR EXTRA CREDIT. Solve the 3 dimensional structure for t37 (and tell me the answer by Monday so I can enter it in the CASP2 contest.) Seriously, we do need to make a prediction on this by October 14. We don't need full 3d coordinates of every atom. All we need is an alignment (possibly with gaps) to a protein in PDB that has similar structure. Maybe by further exploration you could get an idea of what positions lie in the active site of t37 that binds actin. This might help is deciding if an alignment of t37 to a proposed structure is good or not. But there is a lot more to do in finding the proposed structure and alignment in the first place ...