We measured the performance of the parser on a subset of contigs provided by Anders Krogh used in [7]. Krogh used the HMM EColi gene finder both with and without an augmented intergenic model (similar to our promoter sensor). Krogh reports a rate of 78.7% genes in the training set found and 85.2% of the genes in the test set found. Our results are lower; we would like to note that the favorable performance of the HMM parser is due to its ability to map overlapping genes, tandem genes, and its more finely tuned translational initiation site sensor.
One should be aware the number of false positives (genes found which weren't in the database) cannot be compared between our set and Krogh's because Krogh's represent post-processed predictions. Krogh removed predictions of genes of length l < 20 and genes of length l < 100 which touched either end of a contig. We did not use this postprocessing to remove false positives.
In addition, more false positives could be removed by another of Krogh's post-processing step. Krogh removed short (incorrect) gene regions predicted on a strand complementary to one containing a real gene. These false genes appear because regions of DNA which are reverse complements of genes will have start, stop, and gene codon usage that looks similar to that of real genes.
To interpret the chart below:
Table:
Performance of the parsers on the test and training sets in terms of
prediction of genes.