For the following sensors, assume: sequence S is a string length l
= j - i + 1 composed of nucleotide alphabet A,C,G,T. As codons
the sequence is described as:
,
...
,
where
.
In log notation, this becomes:
) was built from the distribution of gene lengths in
the train set. The probability of a gene given its length is the
number of train set genes whose are in the same bin as the test gene
divided by the total number of genes in the train set. The histogram
was smoothed by averaging each bin with a window of 50 and removing
genes which were not believed to be protein-encoding. The average
gene was slightly less than 1.0Kb long.
. In
reality, only three codons have nonzero probabilities: ATG, GTG and
TTG.
. The
probability is the number of occurences of that codon within coding
regions of the train set divided by the total number of codons in
the coding regions of the train set.
See section
for the scores given by the log
probabilities for the sensors that were derived from the training set.
. In
reality, only three codons have nonzero probabilities: TAA, TGA, and
TAG.