Fix a DNA sequence S of letters from the deoxyriboncleic alphabet A,C,G,T.
The letters of S are numbered from
, where l is the
length of S. Since we are looking for genes, it will be convenient
to use a notation of codons. Sequence S of length l, can also be
seen as a sequence of codons,
,
...
,
where
. We use
to represent the start codon,
to
represent the stop codon, and
...
to represent the
codons between the start and the stop.
For the sequence S several T upper diagonal matrices are created,
one for each statistic G in a collection of ``gene statistics'' and
one for each statistic NG in a collection of ``nongene statistics''.
For each i and j, where
,
is a
negative real number (logarithm of a probability) that we call the
``score'' for the gene statistic G over the region
of
the sequence S. If G is a good gene statistic, then the more the
region of S from
looks like an gene, the larger
will be. Statistics that measure similarity to nongenes
are also defined in an analogous way, and these are stored in matrices
. Note that these, and other matrices, need only be
stored as half-matrices for which j>i.
The set of T matrices are combined into two other half-matrices. The L matrices can be combined in many different ways; we have experimented with several methods.
The most straightforward is to use a weighted linear combination of sensors:
However, this method does not guarantee that the L scores are normalized probabilities. If this condition is necessary, one may use a combination called ``softmax'':
where
The score of a parse
, denoted
,
is defined to be the sum of the
and
values for
the regions of S that make up the parse. An example is provided in
section
;
the parse has nongene regions,
and
and one
gene encoding region,
.
The score is the sum of the
and
matrices corresponding to
those regions: