next up previous contents
Next: Conclusion Up: Prediction of Gene-encoding Previous: Data

Results

 

We measured the performance of the parser on a subset of contigs provided by Anders Krogh used in [7]. Krogh used the HMM EColi gene finder both with and without an augmented intergenic model (similar to our promoter sensor). Krogh reports a rate of 78.7% genes in the training set found and 85.2% of the genes in the test set found. Our results are lower; we would like to note that the favorable performance of the HMM parser is due to its ability to map overlapping genes, tandem genes, and its more finely tuned translational initiation site sensor.

One should be aware the number of false positives (genes found which weren't in the database) cannot be compared between our set and Krogh's because Krogh's represent post-processed predictions. Krogh removed predictions of genes of length l < 20 and genes of length l < 100 which touched either end of a contig. We did not use this postprocessing to remove false positives.

In addition, more false positives could be removed by another of Krogh's post-processing step. Krogh removed short (incorrect) gene regions predicted on a strand complementary to one containing a real gene. These false genes appear because regions of DNA which are reverse complements of genes will have start, stop, and gene codon usage that looks similar to that of real genes.

To interpret the chart below:

Total Contigs
: Number of contigs given in the database (including the reverse complement of contigs).
Total Genes
: Number of genes given in the database.

Mappable
: Genes which fit the constraint of beginning with a start codon and ending with stop codon, as well as being database. Mappable genes are genes which are members of the compressed feature matrices we built; we cannot predict genes which do not start with start codons and end with stop codons.

Found
: Number of genes correctly found from parsing the contigs.

Unmappable
: Unmappable genes are genes which were predicted but do not fit the constraints. It is not clear why the parser predicted genes that do not fit the mappable constraints.

Unknown
: These are false positives; genes which were predicted but do not exist in the database. Some of these are genes which the parser did not know actually existed in the database, while some are genes that are not in the database but were correctly predicted by the parser. These warrant closer inspection to determine if they are novel genes.

  
Table: Performance of the parsers on the test and training sets in terms of prediction of genes.



next up previous contents
Next: Conclusion Up: Prediction of Gene-encoding Previous: Data



David Konerding
Sun May 21 12:19:38 PDT 1995