Feature prediction within genomic data is a powerful method for us to learn novel relationships armed with little more than a biological model and a way to express that model in terms of a score. You have seen already that when we can score natures choices for amino acid substitution, we can use that score as a means to align protein sequences. Similar models can be created to infer phylogeny, ancestral sequences or in searches for sequences that code for RNA structures (e.g., tRNA or snoRNA).
Proteins that bind specific regions of DNA, either as a regulation mechanism (transcription factors), or to recognize foreign DNA (phage DNA) need to make use of short sequences. These sequences need to be short in order to minimize the complexity of the binding protein, and they need to be specific in order to produce a specific regulatory effect. Consider, for example, what would happen if a regulatory protein that targets destruction of viral DNA happened upon that same sequence within the host genome. The benefits to the host of reserving these “special words” can now be seen. Since the machinery that makes use of these “special words” is also quite complex, we would tend to see them preserved over evolutionary time.
One class of special words appear to be palindromes. These palindromes are of a special type. For example, “TATA” is one of the more important palindromic sequences, used to signal start of transcription ( http://www.pdb.org/pdb/static.do?p=education_discussion/molecule_of_the_month/pdb67_1.html. This sequence is actually a reverse-complement palindrome. It can be written as: T A A' T', where A' means the reverse complement of A (T), and T' means the reverse complement of T (A). Palindromes are common recognition sites, because many of the proteins that bind to DNA bind as dimers or with two copies of the same protein domain, with the two chains having the DNA binding sites facing each other and recognizing corresponding sections on the forward and reverse strands, as can be seen in this cartoon picture of the PDB file 1LMB:
Rocha et al. (http://www.genome.org/cgi/content/full/11/6/946), noticed that palindromes of size 4 and 6 seem to be underrepresented in some prokaryotic genomes. They suggest that this may be due to particular structures (restriction or modification systems) that exist in some prokaryotes and some phage (lambda for example). This may be a mechanism for establishing exclusive use or restricting use of these words within a genome.
So, armed with the possibility that there exist special words, and that these words are sometimes palindromes, can we now see if we can find them in our genome of choice?
For us to predict when a word is missing, we need to have a model of the frequency with which we would expect a particular word—this will be our null model. We want to find the palindromes that occur significantly less frequently than would be expected.
There are many null models that are possible, and for this assignment we will make use of a Markov model that models extending the central palindrome in both directions. Note: nothing in this null model is restricted to palindromes—we could use this model to look for any under-represented n-mer. The symmetry of extending in both directions from the central (n–2)-mer is particularly elegant for palindromes, but a very similar method can be made by using an order (n–2) Markov model as the null model.
To compute the probability that an arbitrary n-mer is the word (W)
that we are interested in, we will use 3 components:
the probability that the core (n-2) bases match, and the probabilities
of the first and last bases given that the core matches.
P(W) = P(W1 | W2...Wn-1) *
P(Wn | W2...Wn-1) *
P(W2...Wn-1) .
If we check for the palindrome N times (roughly the length of the genome),
we would expect to see N*P(W) occurrences of W. We can use counts and
maximum-likelihood estimates of the probabilities in our null model to
get an estimation formula:
E(C(W)) = C(W1...Wn-1*)
* C(*W2...Wn)
/ C(*W2...Wn-1*) ,
where E(C(W)) is the expected value for the count of the number of
times W occurs in the genome, and C(Wi...Wj)
is the actual count of the number of times the word
Wi...Wj occurs. The '*' character on the subwords
is intended to represent "any character". We are counting the number
of words of length n that agree on the first n-1, the second n-1, or
the middle n-2 positions. If n=2, we are not counting empty strings
on the bottom, but the number of 2-letter windows in our training data.
Because our model for the count is the sum of N almost independent
observations, each with probability P(W), it can be well modeled as a
binomial distribution, with variance
Var(C(W)) = N* P(W) * (1-P(W))
= E(C(W)) * (1 - E(C(W))/N) ,
and the standard deviation is
sigma(W) = sqrt(E(C(W)) * (1 - E(C(W))/N)) .
Note: because E(C(W)) is much smaller than N, the variance is
approximately the same as the mean.
We can then compute a Z-score for each of our words:
Z(W) = (C(W) – E(C(W))) / sigma(W) .
Using a Z-score is quite reasonable here, because we are expecting a
distribution very close to a normal distribution.
How big a palindrome can we use this method on? As the palindromes get bigger, the expected counts get smaller, and it is harder to tell whether the number is really smaller than expected.
Say we want to have 0.01 probability of a false positive. If we check 4^4 (256) palindromes (the number of palindromes of length 8), we would need a probability of 3.9E-05 for a false reading for each one. According to a normal probability table (http://www.math.unb.ca/~knight/utility/NormTble.htm), we would need a Z-score < -4 for this level of significance. For a zero count to be at least 4 standard deviations below the expected value, we would need for the expected value to be at least 16. For a megabase genome, the expected number of counts for the 4^8 (64k) different 8-mers should be around 16, so 8-long palindromes are about as big as we can go.
Note that if we wanted to check all 64k 8-mers for under-representation and not just the palindromes, we'd need a probability of 1.5E-07 of a false positive, or a Z-value around -5.1. This would require expected counts of at least 26, or a 1.7 megabase genome.
I will provide a few genomes for various prokaryotes as FASTA
files (actually, as gzipped fasta files—you will need to use
"gunzip -c" to read them as fasta):
A.fulgidus.fa.gz
A.pernix.fa.gz
H.influenzae.fa.gz
H.pylori.fa.gz
M.jannaschii.fa.gz
M.thermoautotrophicus.fa.gz
N.meningitidus.fa.gz
P.abyssi.fa.gz
P.furiosus.fa.gz
P.horikoshii.fa.gz
S.acidocaldarius.fa.gz
S.solfataricus.fa.gz
S.tokadaii.fa.gz
V.cholerae.I.fa.gz
V.cholerae.II.fa.gz
V.fischeri.I.fa.gz
V.fischeri.II.fa.gz
V.parahaemolyticus.I.fa.gz
V.parahaemolyticus.II.fa.gz
V.vulnificus.I.fa.gz
V.vulnificus.II.fa.gz
These genomes are also available directly from /cse/classes/bme205/Fall05/ on the School of Engineering computers, and it is fairly simple to find other full prokaryotic genomes on the web. There will be one sequence per file, but your program should be able to handle multiple sequences in one file, so that it can be applied to multi-chromosome genomes or genomes that are not fully assembled. You could also apply the program to a set of genomes from the same genus, as the extra data would allow you to search for slightly longer palindromes. Note: the files take up about 45 Mbytes unzipped, so you probably want to leave them in the compressed format.
Write a program, named search_nmer that has 3 parameters:
--genome which specifies the fasta file to use
--Z which specifies the maximum Z score to include in your report.
These values will be negative, because we are looking for cases
where an observed value is below the predicted mean. (For genomes of
the size we will be using, the maximum Z-score will be about -5.)
--maxSize which specifies the maximum size word to look for. We will use 8 for testing.
Your program would then be executed as follows (for example):
search_nmer -genome H.influenzae.fa.gz -Z -5.5 -maxSize 8
Note: you can open a gzipped or non-gzipped file in perl fairly easily:
open (FILE, ($filename =~ /[.]gz$/) ?
"gunzip -c $filename |"
: "< $filename");
The program should use hashes or arrays to count how many occurrences there are of each n-mer up to length maxSize. Because we are keeping maxSize fairly small, there are no memory problems in storing this information. For each palindrome of length 2, 3, 4, ... up to length maxSize, compute the expected number of that palindrome and determine whether the observed count is significantly different from the expected count.
Your output report should look something like this:
# Analyzing H.influenzae.fa.gz for underrepresented palindromes of length 6 through 6. # Reporting those with Z-scores smaller than -5.5 # # There are 1829276 positions in which a palindrome may be looked for. # # Sequence Observed Expected std_dev Z_score TTTAAA 2284 2585.11 50.81 -5.93 ATATAT 425 565.58 23.78 -5.91 GAATTC 302 668.74 25.86 -14.18 CATATG 173 336.88 18.35 -8.93 AATATT 2034 2408.28 49.04 -7.63 AAGCTT 359 651.07 25.51 -11.45 AAATTT 2904 3379.48 58.08 -8.19
If you sort your output by length of word and alphabetically within that, you may see helpful patterns, such as palindromic sequences that have a don't-care region in the center.
Once a table of the the n-mer counts is made, the palindromes can be enumerated and the appropriate counts looked up. If students want to do palindrome detection on the keys of the nmer hash rather than enumeration, that can be done also, but should probably use reverse() and tr/ACGT/TGCA/ rather than slow and complicated loops.
Many people use m/$word/g to count n-mers, but this gets incorrect
results for overlapping words. For example, looking for ATA in ATATA
should count 2, but the regular expression method only sees the first
one.
|
|
| BME 205 home page | Karplus's lab page | UCSC Bioinformatics research |
Questions about page content should be directed to
Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250
318 Physical Sciences Building