UCSC BME 205 Fall 2007

Bioinformatics: Models and Algorithms
PERL Assignment 5

(Last Update: 13:55 PST 26 November 2007 )

Does the ORF code for a protein?
Due Wed 21 Nov 2007 (11 a.m.)

This exercise is mainly one in exploring appropriate null models for deciding whether to accept or reject a hypothesis,

Jonathan Trent had found an open reading frame (ORF) that was 388 codons long on the reverse strand of the gene of a protein that he studies (thermosome subunit alpha, THSA_SULSH), and he was wondering whether or not it coded for a protein as well. The putative gene product of the ORF is tentatively named "oops". [Kagawa HK, Osipiuk J, Maltsev N, Overbeek R, Quaite-Randall E, Joachimiak A, Trent JD. doi:10.1006/jmbi.1995.0585 The 60 kDa heat shock proteins in the hyperthermophilic archaeon Sulfolobus shibatae. Journal of Molecular Biology 1995 Nov 10;253(5):712-25.]

There are several approaches possible for gathering evidence for or against the hypothesis that oops is a protein. These include such diverse methods as looking for evidence that homologous proteins exist in other species (particularly as stand-alone genes---not just reverse strands of other thermosomes), looking for codon bias in the ORF, looking for promoters in the appropriate places, doing structure predictions for the putative protein, and so forth. For this assignment we will look at only one question: how unexpected is the 388-codon-long ORF on the reverse strand of a 560-codon-long gene?

Since this is a perl-programming exercise, we will write a short program that generates DNA sequences using various null hypotheses, and estimate the p-value of finding that long an ORF "by chance" on the opposite strand of the thermosome gene.

We will need species-specific training data for our various null models. Unfortunately, Sulfolobus shibatae has not been fully sequenced, but the very closely related species Sulfolobus solfataricus P2 has been sequenced. Furthermore, using tblastn to search for homologs of oops reveals that S. solfataricus has a 95% identical ORF (with no stop codons) that is at least as long opposite one of its thermosome genes, so we can do the analysis in S. solfataricus, whose codon bias table is easily available. (You can google Sulfolobus solfataricus codon to find this and other codon-usage tables.)

A big part of the point of this exercise is to point out the importance of choosing the right null model. The hypothesis we are testing is that the ORF is a protein-coding gene---that the absence of stop codons has been selected for through evolution by the need to make the full-length protein product. Our null hypothesis is that no selection is being done on the ORF.

Here are three null models corresponding to different versions of the null hypothesis:

  1. The DNA is undergoing no selection at all.

    The 560*3 bases of the sequence are chosen at random (using a 0-order Markov chain) from the distribution of bases in the genome of the species. This model is the most commonly used null model for determining how unlikely an ORF is. Since the frequency of G is the same as C, and the frequency of A is the same as T, this model has only one parameter---the G+C frequency in the genome, which is 36.3% for S. solfataricus, according to http://www.fut.es/~debb/HGT/ssol.d/welcome.html.

    Note: since we are only looking at one strand of DNA, one might think that G=C and A=T constraints on double-stranded DNA do not apply. Indeed, for short stretches it is possible to have an imbalance, but since there is not a "preferred" strand for the genome as a whole, the G/C and A/T ratios are very close to one for long enough stretches of DNA. For our purposes, having a single G+C frequency for this model is sufficiently detailed.

  2. The DNA is on the opposite strand of a 560-codon-long protein-coding gene for S. solfataricus.

    The string of 1680 bases is constructed by taking the reverse complement of a string of 560 codons (not including the STOP codons) drawn from the distribution of codons for the species. The codon preference table is easily found on the web---for a newly sequenced genome, you would first have to identify a number of genes (probably by homology to other species) and count the codons used in those genes.

  3. The DNA is on the opposite strand of THSA_SULSH.

    The string of bases is is constructed by taking the reverse complement of a series of codons that codes for THSA_SULSH, using the codon biases of S. solfataricus. That is, where the sequence of THSA_SULSH has an L, we randomly select a leucine-encoding codon, using the frequencies of the codons from the codon bias table. You should be able to find the sequence for THSA_SULSH in Entrez or Swissprot with no difficulty.

Each of the three null models is a generative model that constructs a string of 560*3=1680 bases. The open reading frame could appear in any of the 3 reading frames of this string.

The program

Write a program named reverseORF that accepts (at least) the following options:
--gcNull
Use the 0-order model for bases, with the specified fraction of the genome being G or C. (Note: Getopt::Long distinguishes between integers "n" and floating-point "f" parameters.) The default if no option is specified should be --gcNull 0.5
--protein
Read the protein sequence for the forward strand (in FASTA format) from the specified file.
--codon
Read the codon bias table from the specified file. (Use a standard format so that you can download tables from the Web.) If a codon table is specified, but no protein is specified, the program should use the random-codon model.
--num_sequences
The default number of sequences generated in the sample should be 10 000.

For each sequence generated, scan the three possible reading frames and determine the longest open reading frame (sequence of codons that does not contain a stop codon) in the sequence.

Output a histogram indicating how often each length occurred as the longest ORF. The output should have three tab-delimited columns: the length, the number of occurrences of that length as the longest ORF, and the estimated probability of seeing that long an ORF or longer (P-value).

Provide plots (using gnuplot or similar tool) of the histograms for the three different null models, using at least 10,000 samples for each null model. Interpret the results: is the 388-long ORF unlikely to have arisen by chance? Is the evidence for "oops" strong enough to justify doing lab work to look for evidence of transcription and translation of the sequence?

You should turn in a PDF file which includes your plots and your interpretations of them. Personally, I prefer to generate such PDF files using LaTeX and the epsfig package, with gnuplot to generate the plots as eps files, but you are free to use other methods for generating the PDF files.

Bonus Points

Which frame the longest ORF appears in may not be uniformly distributed among the three frames. Modify your program to get separate histograms for the longest ORF in all three frames and the longest ORF in each of the three frames. Which frame produces the longest ORFs in models that assume you are opposite a protein-coding gene? Why?

Come up with closed-form computation for the probabilities using one or more of the null models.

This assignment looked only at the probability of this long an ORF opposite this particular gene. A more "bioinformatic" approach would look at all the genes of the genome and figure out the E-value (expected number of ORFs this long on reverse strands of genes) for the whole genome. Does the distribution of observed ORFs deviate significantly from the expected distribution using the null model(s)? Is the ORF that Jonathan Trent found an outlier, or is it pretty typical of ORFs on reverse strands of genes?

Design other computational tests for whether the oops sequence is a gene. You may need to get the DNA sequence for the ORF, which can be found in Entrez (I found it searching the protein data base for "trent sulfolobus"). The S. solfataricus genome is also available on the http://archaea.ucsc.edu genome browser.


Learned for next year's assignment



slug icon to go to Scool of Engineering home page
SoE home
sketch of Kevin Karplus by Abe
Kevin Karplus's home page
BME-slug-icon
BS, MS, and PhD programs
BME 205 home page Karplus's lab page UCSC Bioinformatics research

Questions about page content should be directed to

Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250
318 Physical Sciences Building