This exercise is mainly one in exploring appropriate null models for deciding whether to accept or reject a hypothesis,
Jonathan Trent had found an open reading frame (ORF) that was 388 codons long on the reverse strand of the gene of a protein that he studies (thermosome subunit alpha, THSA_SULSH), and he was wondering whether or not it coded for a protein as well. The putative gene product of the ORF is tentatively named "oops". [Kagawa HK, Osipiuk J, Maltsev N, Overbeek R, Quaite-Randall E, Joachimiak A, Trent JD. doi:10.1006/jmbi.1995.0585 The 60 kDa heat shock proteins in the hyperthermophilic archaeon Sulfolobus shibatae. Journal of Molecular Biology 1995 Nov 10;253(5):712-25.]
There are several approaches possible for gathering evidence for or against the hypothesis that oops is a protein. These include such diverse methods as looking for evidence that homologous proteins exist in other species (particularly as stand-alone genes---not just reverse strands of other thermosomes), looking for codon bias in the ORF, looking for promoters in the appropriate places, doing structure predictions for the putative protein, and so forth. For this assignment we will look at only one question: how unexpected is the 388-codon-long ORF on the reverse strand of a 560-codon-long gene?
Since this is a perl-programming exercise, we will write a short program that generates DNA sequences using various null hypotheses, and estimate the p-value of finding that long an ORF "by chance" on the opposite strand of the thermosome gene.
We will need species-specific training data for our various null models.
Unfortunately, Sulfolobus shibatae has not been fully sequenced, but
the very closely related species Sulfolobus solfataricus P2 has been sequenced.
Furthermore, using tblastn to search for homologs of oops reveals that
S. solfataricus has a 95% identical ORF (with no stop codons) that is
at least as long opposite
one of its thermosome genes, so we can do the analysis in
S. solfataricus, whose
codon bias table is easily available.
(You can google
A big part of the point of
this exercise is to point out the importance of choosing the right
null model. The hypothesis we are testing is that the ORF is a
protein-coding gene---that the absence of stop codons has been
selected for through evolution by the need to make the full-length
protein product. Our null hypothesis is that no selection is being
done on the ORF.
Here are three null models corresponding to different versions of the
null hypothesis:
The 560*3 bases of the sequence are chosen at random (using a 0-order
Markov chain) from the distribution of bases in the genome of the species.
This model is the most commonly used null model for determining how
unlikely an ORF is. Since the frequency of G is the same as C, and
the frequency of A is the same as T, this model has only one
parameter---the G+C frequency in the genome, which is 36.3% for
S. solfataricus, according to http://www.fut.es/~debb/HGT/ssol.d/welcome.html.
Note: since we are only looking at one strand of DNA, one might
think that G=C and A=T constraints on double-stranded DNA do not apply.
Indeed, for short stretches it is possible to have an imbalance, but
since there is not a "preferred" strand for the genome as a whole, the
G/C and A/T ratios are very close to one for long enough stretches of DNA.
For our purposes, having a single G+C frequency for this model is
sufficiently detailed.
The string of 1680 bases is constructed by taking the reverse complement of
a string of 560 codons (not including the STOP codons) drawn
from the distribution of codons for the species. The codon preference
table is easily found on the web---for a newly sequenced genome, you
would first have to identify a number of genes (probably by homology
to other species) and count the codons used in those genes.
The string of bases is is constructed by taking the reverse complement
of a series of codons that codes for THSA_SULSH, using the codon
biases of S. solfataricus. That is, where the sequence of THSA_SULSH
has an L, we randomly select a leucine-encoding codon, using the
frequencies of the codons from the codon bias table. You should be
able to find the sequence for THSA_SULSH in Entrez or Swissprot with
no difficulty.
For each sequence generated, scan the three possible reading frames
and determine the longest open reading frame (sequence of codons that
does not contain a stop codon) in the sequence.
Output a histogram indicating how often each length occurred as the
longest ORF. The output should have three tab-delimited columns:
the length, the number of occurrences of that length as the longest
ORF, and the estimated probability of seeing that long an ORF or
longer (P-value).
Provide plots (using gnuplot or similar tool) of the histograms for
the three different null models, using at least 10,000 samples for
each null model. Interpret the results: is the 388-long ORF unlikely
to have arisen by chance? Is the evidence for "oops" strong enough to
justify doing lab work to look for evidence of transcription and
translation of the sequence?
You should turn in a PDF file which includes your plots and your
interpretations of them. Personally, I prefer to generate such PDF
files using LaTeX and the epsfig package, with gnuplot to generate the
plots as eps files, but you are free to use other methods for
generating the PDF files.
Which frame the longest ORF appears in may not be uniformly
distributed among the three frames. Modify your program to get
separate histograms for the longest ORF in all three frames and the
longest ORF in each of the three frames. Which frame produces the
longest ORFs in models that assume you are opposite a protein-coding
gene? Why?
Come up with closed-form computation for the probabilities using
one or more of the null models.
This assignment looked only at the probability of this long an ORF
opposite this particular gene. A more "bioinformatic" approach would
look at all the genes of the genome and figure out the E-value
(expected number of ORFs this long on reverse strands of genes) for
the whole genome. Does the distribution of observed ORFs deviate
significantly from the expected distribution using the null model(s)?
Is the ORF that Jonathan Trent found an outlier, or is it pretty
typical of ORFs on reverse strands of genes?
Design other computational tests for whether the oops sequence is a gene.
You may need to get the DNA sequence for the ORF, which can be found
in Entrez (I found it searching the protein data base for "trent sulfolobus").
The S. solfataricus genome is also available on the http://archaea.ucsc.edu genome browser.
Questions about page content should be directed to
Kevin Karplus
Each of the three null models is a generative model that constructs a
string of 560*3=1680 bases. The open reading frame could appear in
any of the 3 reading frames of this string.
The program
Write a program named reverseORF that accepts (at least) the following
options:
Bonus Points
Learned for next year's assignment
SoE home
Kevin Karplus's home page
BS, MS, and PhD programs
BME 205 home page
Karplus's lab page
UCSC Bioinformatics research
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250
318 Physical Sciences Building