Building upon the work of those two groups, we have built a contig parser program using Optimal Parse, a dynamic programming technique described by Haussler and Stormo [17], to find gene-encoding regions in E.Coli DNA.
In a manner similar to HMMs, it can be instructive to picture the creation of these contigs by a ``generator''. The generator is a machine which emits nonrandom sequences of DNA in a probabilistic manner. The sequence emitted by the generator is a representation of the path the generator made between the two opposing models, gene coding and nongene regions. There are probabilities associated with the transitions between models, as well as probabilities relating to the models themselves.
The constrains are:
For a gene region, the generator can choose between many different
lengths for the gene, then once it picks the length K in codons, it
chooses between the three classic start codons ``ATG, GTG, TTG''. By
far ATG was the most common start codon, seen in 92% of the genes in
the training set (See section
). The generator next steps k
times through the Gene Codon model, emitting codons probabilistically
according to the frequency of codons seen in the training set. It
finishes by stepping once through the stop codon model, generating a
stop codon from the three classic stop codons ``TAA, TGA, TAG''. 66%
of the genes in the training set were TAA.
Nongene regions are chosen in a similar way, but there are no start codon or stop codon models, and the codon model used represents the frequencies of noncoding regions instead of coding regions.
Our program attempts to determine the specific set of steps by which
the imaginary generator built the sequence; the program's
determination is called a ``parse''. A parse of a sequence S is a
partition of S into consecutive nonempty regions which are
alternately labeled as a gene-encoding or nongene region. A parse may
start and end with either type of region, so there are
possible
parses of S which is length l. We are able to reduce the number
of possible parses to be considered to be a tiny fraction of that
number by eliminating parses which do not satisfy our constraints
(such as proper start and stop codons). Without this refinement, the
parser's demand for CPU and memory would make contig parsing
intractable.
To find the best (most likely) parse of a contig, we first score each possible subsequence of the contig according to its similarity to gene encoding or nongene regions. We have developed functions (``sensors'') which return probabilities (more accurately, logarithms of probabilities, which are more convenient to handle) for each subsequence. Sensors analyze regions for their gene encoding or nongene nature by using statistics such as gene-codon measure, start codon and stop codon content, and a length histogram.
The scores are used by the optimal parse algorithm to find the highest scoring (``best'') parse using dynamic programming. The best parse represents the most likely partition according to our sensors.
Using machine learning (gradient descent) the parser can change the weights of the sensors with respect to each other with the goal of finding weights which maximize the probability of the correct parse. For a detailed derivation of the learning algorithm, see [17].