UCSC BME 205 Fall 2008

Bioinformatics: Models and Algorithms
PERL Assignment 5

Palindromes in Genomic sequences

Due 1 Dec 2008 (beginning of class)

(Last Update: 22:17 PST 15 December 2008 )
Note: this assignment was originally created by David Bernick, then edited by Kevin Karplus. Both are about equally responsible for the final shape of the assignment.

Feature prediction within genomic data is a powerful method for us to learn novel relationships armed with little more than a biological model and a way to express that model in terms of a score. You have seen already that when we can score natures choices for amino acid substitution, we can use that score as a means to align protein sequences. Similar models can be created to infer phylogeny, ancestral sequences or in searches for sequences that code for RNA structures (e.g., tRNA or snoRNA).

Proteins that bind specific regions of DNA, either as a regulation mechanism (transcription factors), or to recognize foreign DNA (phage DNA) need to make use of short sequences. These sequences need to be short in order to minimize the complexity of the binding protein, and they need to be specific in order to produce a specific regulatory effect. Consider, for example, what would happen if a regulatory protein that targets destruction of viral DNA happened upon that same sequence within the host genome. The benefits to the host of reserving these “special words” can now be seen. Since the machinery that makes use of these “special words” is also quite complex, we would tend to see them preserved over evolutionary time.

One class of special words appear to be palindromes. These palindromes are of a special type. For example, “TATA” is one of the more important palindromic sequences, used to signal start of transcription ( http://www.pdb.org/pdb/static.do?p=education_discussion/molecule_of_the_month/pdb67_1.html. This sequence is actually a reverse-complement palindrome. It can be written as: T A A' T', where A' means the reverse complement of A (T), and T' means the reverse complement of T (A). Palindromes are common recognition sites, because many of the proteins that bind to DNA bind as dimers or with two copies of the same protein domain, with the two chains having the DNA binding sites facing each other and recognizing corresponding sections on the forward and reverse strands, as can be seen in this cartoon picture of the PDB file 1LMB:

image of DNA with two protein monomers bound

Rocha et al. (http://www.genome.org/cgi/content/full/11/6/946), noticed that palindromes of size 4 and 6 seem to be underrepresented in some prokaryotic genomes. They suggest that this may be due to particular structures (restriction or modification systems) that exist in some prokaryotes and some phage (lambda for example). This may be a mechanism for establishing exclusive use or restricting use of these words within a genome.

So, armed with the possibility that there exist special words, and that these words are sometimes palindromes, can we now see if we can find them in our genome of choice?

For us to predict when a word is missing, we need to have a model of the frequency with which we would expect a particular word—this will be our null model. We want to find the palindromes that occur significantly less frequently than would be expected.

There are many null models that are possible, and for this assignment we will make use of a Markov model that models extending the central palindrome in both directions. Note: nothing in this null model is restricted to palindromes—we could use this model to look for any under-represented n-mer. The symmetry of extending in both directions from the central (n–2)-mer is particularly elegant for palindromes, but a very similar method can be made by using an order (n–2) Markov model as the null model.

To compute the probability that an arbitrary n-mer is the word (W) that we are interested in, we will use 3 components: the probability that the core (n-2) bases match, and the probabilities of the first and last bases given that the core matches.
P(W) = P(W1 | W2...Wn-1) * P(Wn | W2...Wn-1) * P(W2...Wn-1) .

If we check for the palindrome N times (roughly the length of the genome), we would expect to see N*P(W) occurrences of W. We can use counts and maximum-likelihood estimates of the probabilities in our null model to get an estimation formula:
E(C(W)) = C(W1...Wn-1*) * C(*W2...Wn) / C(*W2...Wn-1*) ,
where E(C(W)) is the expected value for the count of the number of times W occurs in the genome, and C(Wi...Wj) is the actual count of the number of times the word Wi...Wj occurs. The '*' character on the subwords is intended to represent "any character". We are counting the number of words of length n that agree on the first n-1, the second n-1, or the middle n-2 positions. If n=2, we are not counting empty strings on the bottom, but the number of 2-letter windows in our training data.

Because our model for the count is the sum of N almost independent observations, each with probability P(W), it can be well modeled as a binomial distribution, with variance
Var(C(W)) = N* P(W) * (1-P(W))
= E(C(W)) * (1 - E(C(W))/N) ,
and the standard deviation is
sigma(W) = sqrt(E(C(W)) * (1 - E(C(W))/N)) . Note: because E(C(W)) is much smaller than N, the variance is approximately the same as the mean.

We can then compute a Z-score for each of our words:
Z(W) = (C(W) – E(C(W))) / sigma(W) . Using a Z-score is quite reasonable here, because we are expecting a distribution very close to a normal distribution.

How big a palindrome can we use this method on? As the palindromes get bigger, the expected counts get smaller, and it is harder to tell whether the number is really smaller than expected.

Say we want to have 0.01 probability of a false positive. If we check 4^4 (256) palindromes (the number of palindromes of length 8), we would need a probability of 3.9E-05 for a false reading for each one. According to a normal probability table (http://www.math.unb.ca/~knight/utility/NormTble.htm), we would need a Z-score < -4 for this level of significance. For a zero count to be at least 4 standard deviations below the expected value, we would need for the expected value to be at least 16. For a megabase genome, the expected number of counts for the 4^8 (64k) different 8-mers should be around 16, so 8-long palindromes are about as big as we can go.

Note that if we wanted to check all 64k 8-mers for under-representation and not just the palindromes, we'd need a probability of 1.5E-07 of a false positive, or a Z-value around -5.1. This would require expected counts of at least 26, or a 1.7 megabase genome.

Assignment

I will provide a few genomes for various prokaryotes as FASTA files (actually, as gzipped fasta files—you will need to use "gunzip -c" to read them as fasta):
A.fulgidus.fa.gz
A.pernix.fa.gz
H.influenzae.fa.gz
H.pylori.fa.gz
M.jannaschii.fa.gz
M.thermoautotrophicus.fa.gz
N.meningitidus.fa.gz
P.abyssi.fa.gz
P.furiosus.fa.gz
P.horikoshii.fa.gz
S.acidocaldarius.fa.gz
S.solfataricus.fa.gz
S.tokadaii.fa.gz
V.cholerae.I.fa.gz
V.cholerae.II.fa.gz
V.fischeri.I.fa.gz
V.fischeri.II.fa.gz
V.parahaemolyticus.I.fa.gz
V.parahaemolyticus.II.fa.gz
V.vulnificus.I.fa.gz
V.vulnificus.II.fa.gz

These genomes are also available directly from /cse/classes/bme205/Fall05/ on the School of Engineering computers, and it is fairly simple to find other full prokaryotic genomes on the web. There will be one sequence per file, but your program should be able to handle multiple sequences in one file, so that it can be applied to multi-chromosome genomes or genomes that are not fully assembled. You could also apply the program to a set of genomes from the same genus, as the extra data would allow you to search for slightly longer palindromes. Note: the files take up about 45 Mbytes unzipped, so you probably want to leave them in the compressed format.

Write a program, named search_nmer that has 3 parameters:

--genome which specifies the fasta file to use 

--Z which specifies the maximum Z score to include in your report.
These values will be negative, because we are looking for cases where an observed value is below the predicted mean. (For genomes of the size we will be using, the maximum Z-score will be about -5.)

--maxSize which specifies the maximum size word to look for. We will use 8 for testing.

Your program would then be executed as follows (for example):

search_nmer -genome H.influenzae.fa.gz -Z -5.5 -maxSize 8

Note: you can open a gzipped or non-gzipped file in perl fairly easily:

open (FILE, ($filename =~ /[.]gz$/) ?  
                      "gunzip -c $filename |"
		    : "< $filename");

The program should use hashes or arrays to count how many occurrences there are of each n-mer up to length maxSize. Because we are keeping maxSize fairly small, there are no memory problems in storing this information. For each palindrome of length 2, 3, 4, ... up to length maxSize, compute the expected number of that palindrome and determine whether the observed count is significantly different from the expected count.

Your output report should look something like this:

# Analyzing H.influenzae.fa.gz for underrepresented palindromes of length 6 through 6.
# Reporting those with Z-scores smaller than -5.5 
#
# There are 1829276 positions in which a palindrome may be looked for.
#
# Sequence Observed      Expected      std_dev     Z_score 
TTTAAA        2284       2585.11       50.81        -5.93
ATATAT        425        565.58        23.78        -5.91
GAATTC        302        668.74        25.86        -14.18
CATATG        173        336.88        18.35        -8.93
AATATT        2034       2408.28       49.04        -7.63
AAGCTT        359        651.07        25.51        -11.45
AAATTT        2904       3379.48       58.08        -8.19

If you sort your output by length of word and alphabetically within that, you may see helpful patterns, such as palindromic sequences that have a don't-care region in the center.

  1. Run your program for 3 of the prokaryotic genomes. You might consider using the Rocha paper in order to choose species that may have interesting results.
  2. Write up a short discussion -- that includes your data, along with an interpretation of their meaning. Is there anything special about these sequences? (Try Google, or the following link: http://www.neb.com/nebecomm/tech_reference/restriction_enzymes/cross_index_recognition_sequences.asp

    )
  3. How did you deal with unknown nucleotides? How did you calculate N?
  4. How did you decide an appropriate significance level (Z-score) to use?
  5. Does this data suggest anything about the species you have selected? The web site http://rebase.neb.com/rebase/rebase.html has a search site for finding known restriction enzymes by organism or sequence (among other methods). Do your finds match up with known restriction enzymes?

Extra credit:

  1. Allow the --genome option to be repeated, so that multiple files can be read in. This would allow you to analyze the full set of Vibrio genomes simultaneously, for example, rather than one chromosome at a time.
  2. What we would really like to use is an E-value, rather than a Z-score. It would be nice if we could specify E as a parameter on the command line. To make this work, we need a way to convert from Z-scores to E-values or vice-versa. You can find PERL functions for computing the necessary transformations on the web—for example, look at http://home.online.no/~pjacklam/notes/invnorm/ Note that you have to consider how many tests you are really doing to convert the p-value that you get from the inverse of the normal cumulative distribution into an E-value. This would be the number of different palindromes or words that you are evaluating your z-score on.
  3. It might be useful to have an option for looking at all n-mers, and not just palindromic ones. This changes slightly what the significant Z-score should be (for the same E-value).
  4. It might also be useful to look for over-represented palindromes (or n-mers) as well as under-represented ones.
  5. Several of the odd-length palindromes have a don't-care base in the center, but our current algorithm only checks for the exactly matching base. You could merge the counts for abcdxd'c'b'a' for all values of x to do the same sort of analysis for palindromes with a don't-care base in the center. Because you are merging counts, you could go up to one longer palindromes this way.
  6. You could go even further, looking for palindromes with 2, 3, or more don't-care bases in the center. Since the DNA-binding mechanism does not require that the two protein domains bind adjacent to each other, it is quite common to have a pair of sites that are separated by a few bases.
  7. If you are feeling really ambitious, try coming up with a different way to estimate the probability of a given palindrome. Does it produce better results? If so, you may have found something worth publishing.

Things learned after assignment given

Students came up with some remarkably inefficient ways to do this assignment (like checking every n-mer in the input to see if it is a palindrome). It is much simpler and cleaner to make a hash containing all observed n-mers (no need to index by n, and no need to pre-fill hash with potential n-mers).

Once a table of the the n-mer counts is made, the palindromes can be enumerated and the appropriate counts looked up. If students want to do palindrome detection on the keys of the nmer hash rather than enumeration, that can be done also, but should probably use reverse() and tr/ACGT/TGCA/ rather than slow and complicated loops.

Many people use m/$word/g to count n-mers, but this gets incorrect results for overlapping words. For example, looking for ATA in ATATA should count 2, but the regular expression method only sees the first one.


slug icon to go to Scool of Engineering home page
SoE home
sketch of Kevin Karplus by Abe
Kevin Karplus's home page
BME-slug-icon
Biomolecular Engineering Dept.
BME 205 home page Karplus's lab page UCSC Bioinformatics research

Questions about page content should be directed to

Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250
318 Physical Sciences Building