Next: Introduction
Up: Prediction of Gene-encoding
Previous: Prediction of Gene-encoding
Abstract:
We implement a variant of the Optimal Parse method described by
Stormo and Haussler [17] as a C++ program to recognize
gene-encoding regions in anonymous E.Coli DNA (called ``contigs'').
The ``parse'' produced by the program is a set of nonoverlapping,
alternating gene and nongene regions on a given DNA sequence. Gene
and nongene regions are recognized using scoring functions which
rank each possible subsequence of a contig with a probability
representing the subsequence's similarity to a gene or nongene
region. By using probability ``sensors'' which model the
statistical content of genes and nongenes, we may build functions
that give the ``correct'' parse (which is a path through the
sequence of alternating gene/nongene regions that has the highest
probability). The Optimal Parse method allows us to use multiple
types of evidence, such as codon usage, translation initiation site,
and start and stop codon usage. Using machine learning to adjust the
weights of evidence, we can train the program on contigs for which
the correct parse is known to obtain maximum performance.
David Konerding
Sun May 21 12:19:38 PDT 1995