For the sequence library, we use the EcoSeq6 database [11]. The EcoSeq6 database provides a wealth of information for the computational biologist. It contains 458 distinct ``contigs'', regions of the E.Coli chromosome which have already been sequenced. The database is compact enough to be stored on a modest workstation. The contigs contain genes previously identified by biochemical methods, but the gene record is incomplete and partially incorrect.
The datasets we used for training and testing the parser are actually
a subset of the ones used by Krogh in [7]. We used 235
training contigs (478 when reverse complements were counted) and 100
(200 with reverse contigs). Krogh's sets are a partition of a subset
of EcoSeq6 into a train and test set. We have removed a number of the
contigs because they are too large to be parsed with current memory
limitations. Since the parser generates at least two matrices of size
, a contig whose mapped size is 1000 will use more than a
megabyte of memory; when we are parsing 100 or more contigs at a time,
this will cause the program to fail on a DEC-Alpha with about 90M of
RAM.
Table: Programs Used in Parsing the E.Coli Genome