next up previous contents
Next: Limitations of the Up: Introduction Previous: HMMs and Neural

Parsers

Building upon the work of those two groups, we have built a contig parser program using Optimal Parse, a dynamic programming technique described by Haussler and Stormo [17], to find gene-encoding regions in E.Coli DNA.

In a manner similar to HMMs, it can be instructive to picture the creation of these contigs by a ``generator''. The generator is a machine which emits nonrandom sequences of DNA in a probabilistic manner. The sequence emitted by the generator is a representation of the path the generator made between the two opposing models, gene coding and nongene regions. There are probabilities associated with the transitions between models, as well as probabilities relating to the models themselves.

The constrains are:

Edges
The generator may begin with either a nongene or gene encoding region.
Alternation
The generator alternates between nongene and gene encoding regions.
Length
The generator determines the length of a nongene or gene encoding region and then generates the sequence of the region

For a gene region, the generator can choose between many different lengths for the gene, then once it picks the length K in codons, it chooses between the three classic start codons ``ATG, GTG, TTG''. By far ATG was the most common start codon, seen in 92% of the genes in the training set (See section gif). The generator next steps k times through the Gene Codon model, emitting codons probabilistically according to the frequency of codons seen in the training set. It finishes by stepping once through the stop codon model, generating a stop codon from the three classic stop codons ``TAA, TGA, TAG''. 66% of the genes in the training set were TAA.

Nongene regions are chosen in a similar way, but there are no start codon or stop codon models, and the codon model used represents the frequencies of noncoding regions instead of coding regions.

Our program attempts to determine the specific set of steps by which the imaginary generator built the sequence; the program's determination is called a ``parse''. A parse of a sequence S is a partition of S into consecutive nonempty regions which are alternately labeled as a gene-encoding or nongene region. A parse may start and end with either type of region, so there are possible parses of S which is length l. We are able to reduce the number of possible parses to be considered to be a tiny fraction of that number by eliminating parses which do not satisfy our constraints (such as proper start and stop codons). Without this refinement, the parser's demand for CPU and memory would make contig parsing intractable.

To find the best (most likely) parse of a contig, we first score each possible subsequence of the contig according to its similarity to gene encoding or nongene regions. We have developed functions (``sensors'') which return probabilities (more accurately, logarithms of probabilities, which are more convenient to handle) for each subsequence. Sensors analyze regions for their gene encoding or nongene nature by using statistics such as gene-codon measure, start codon and stop codon content, and a length histogram.

The scores are used by the optimal parse algorithm to find the highest scoring (``best'') parse using dynamic programming. The best parse represents the most likely partition according to our sensors.

Using machine learning (gradient descent) the parser can change the weights of the sensors with respect to each other with the goal of finding weights which maximize the probability of the correct parse. For a detailed derivation of the learning algorithm, see [17].



next up previous contents
Next: Limitations of the Up: Introduction Previous: HMMs and Neural



David Konerding
Sun May 21 12:19:38 PDT 1995