next up previous contents
Next: Results Up: Prediction of Gene-encoding Previous: Refinments to the

Data

  For the sequence library, we use the EcoSeq6 database [11]. The EcoSeq6 database provides a wealth of information for the computational biologist. It contains 458 distinct ``contigs'', regions of the E.Coli chromosome which have already been sequenced. The database is compact enough to be stored on a modest workstation. The contigs contain genes previously identified by biochemical methods, but the gene record is incomplete and partially incorrect.

The datasets we used for training and testing the parser are actually a subset of the ones used by Krogh in [7]. We used 235 training contigs (478 when reverse complements were counted) and 100 (200 with reverse contigs). Krogh's sets are a partition of a subset of EcoSeq6 into a train and test set. We have removed a number of the contigs because they are too large to be parsed with current memory limitations. Since the parser generates at least two matrices of size , a contig whose mapped size is 1000 will use more than a megabyte of memory; when we are parsing 100 or more contigs at a time, this will cause the program to fail on a DEC-Alpha with about 90M of RAM.

  
Table: Programs Used in Parsing the E.Coli Genome



David Konerding
Sun May 21 12:19:38 PDT 1995