How the cDNA Alignments Were Generated

The cDNA alignments are the foundation of this database. They were generated in a two pass manner. In the first pass an algorithm similar to BLAST collected "hits" where 16 nucleotide fragments of the cDNA aligned perfectly with the genome. Areas of the genome with promising hits were passed to the fuzzyFinder program to finish the alignment. For the genomic data we used the Sanger November 1998 release. For the cDNA data we used all of the ESTs in the Sanger database, and supplemented this with cDNAs and more recent ESTs downloaded from GenBank on July 10, 1999.

The first pass algorithm starts out much like Blast. It packs the genomic sequence into two bits per base. Then it looks for 16-base "tiles" in the cDNA sequence that are suitable for searching. This is basically every 16-base region that contains no N's, and also contains no internal repeats. (This rules out tttttttttttttttt, atatatatatatatat, and so forth). Finally it loads up 32 bits (16 bases) of the genome at a time and compares it to the tiles.The inner loop is thus just 32-bit-word compare, which us unrolled 32x. The inner loop generates "hits." Since the comparison advances over the genome 16 bases at a time rather than one base at a time it can overlook perfect matches up to 30 bases long, and find them as short as 16 bases. It may be worthwhile to use a smaller tile size when the large tile size can find no hits. However, for cDNAs it's really not unfair to ask for one 30 base perfect match somewhere in the sequence. Next the algorithm groups tiles that are adjacent, or nearly so in both cDNA and genome into "crude exons". Next it groups crude exons that are adjacent or nearly so in cDNA, and within 25,000 bases in the genome into "crude genes". The crude genes are score by summing the squares of the number of tiles in each crude exon, and adding the square of he number of exons. The crude genes scoring less than 10% of the best scoring crude gene are thrown out. The remainder are used to define a region of the genome for the fuzzyFinder algorithm to align with the cDNA.

The fuzzyFinder algorithm is a work in progress. While it works very well something like 99% of the time, it can still be confounded upon occasion by sequences with just the right mix of repeating patterns and noise. In many ways this algorithm parallels the first pass algorithm, but it takes a somewhat finer grained approach in searching for hits, and ultimately must extend the hits through sequencing noise. Since the code is rather complex as a cross check I ran the cDNA sequences which it wouldn't align well (defined as less than 50% of the cDNA bases part of aligning blocks) through NCBI Blast.

Upon starting up fuzzyFinder counts up how frequently each base is used in the genomic sub-sequence, and uses this to assign a probability to each base. Armed with this information, the program creates tiles by scanning the cDNA. A tile is elongated until the probability product of all of its bases is less than 1/(stringency*numBasesInGenomicSubsequence) where stringency is an adjustable parameter that typically is set to 100 or 1000. The result of this is in a typical run tiles of 8-12 bases long with the shorter ones being more G/C rich (since the C. elegans genome is nearly 70% A/T). Tiles containing N's or internal repeats are discarded. The next tile is started by scanning the cDNA where the last tile left off. Once the tiles are assembled they are used to scan the genomic subsequence for hits. The hits are then grouped as before into crude exons, and crude genes, scored, and the best scoring crude gene chosen for further extension.

The extension first just merges adjacent hits and expands their ends as far as the cDNA and genomic DNA match perfectly. Then it expands the ends further, allowing N's in the cDNA to match any single base. At this point the alignment looks something like:

cDNA   cacatagt     xxxxxxxx        gattatcgnt       xxxxxxxx                anacctgan
       ||||||||                     |||||||| |                               | ||||||
geno  ycacatagtyyyyyyyyyyyyyyyyyyyyygattatcgctyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyagacctgacyyy

where the x's and y's represent unaligned areas of the cDNA and genome respectively. The program then recurses, making up tiles and trying to match in the unaligned areas. Since the genomic area to search has been reduced, the tile size is smaller in the second run. The recursion runs until either no tiles are found (in which case it is run again at somewhat reduced stringency), or until the gap aligned blocks in the genome or cDNA becomes less than 6. Then the ends of alignments are expanded allowing gaps and inserts of one, and then again allowing gaps and inserts of up to 12. At this point the alignment looks something like:

cDNA   cacatagt......accg...anaa.....gattatcgnt.......tac..acnac.............anacctgan
       ||||||||      ||||   | ||     |||||||| |       |||  || ||             | |||||| 
geno  ccacatagttaaagcaccgataataattatggattatcgctttcttcatacaaac.acacgtttaattttaagacctgacttt

At this point some of the gaps in the alignment resemble introns, such as the one shown in blue above. Where possible the intron is allowed to slide so as to maximize the how closely it matches to the consensus gt ... ag. This results in the final alignment depicted below:

cDNA   cacatagt......accg...anaa.....gattatcgnt.......tac..acnacan.............acctgan
       ||||||||      ||||   | ||     |||||||| |       |||  || |||              |||||| 
geno  ccacatagttaaagcaccgataataattatggattatcgctttcttcatacaaac.acacgtttaattttaagacctgacttt