next up previous
Next: Introduction Up: Prediction of Gene-encoding Previous: Prediction of Gene-encoding

Contents

 

Abstract:

We implement a variant of the Optimal Parse method described by Stormo and Haussler [17] as a C++ program to recognize gene-encoding regions in anonymous E.Coli DNA (called ``contigs''). The ``parse'' produced by the program is a set of nonoverlapping, alternating gene and nongene regions on a given DNA sequence. Gene and nongene regions are recognized using scoring functions which rank each possible subsequence of a contig with a probability representing the subsequence's similarity to a gene or nongene region. By using probability ``sensors'' which model the statistical content of genes and nongenes, we may build functions that give the ``correct'' parse (which is a path through the sequence of alternating gene/nongene regions that has the highest probability). The Optimal Parse method allows us to use multiple types of evidence, such as codon usage, translation initiation site, and start and stop codon usage. Using machine learning to adjust the weights of evidence, we can train the program on contigs for which the correct parse is known to obtain maximum performance.



David Konerding
Sun May 21 12:19:38 PDT 1995