[Genome] Chained blastz alignments
Angie Hinrichs
angie at soe.ucsc.edu
Wed Mar 28 12:54:21 PDT 2007
Hi Eric,
Good question -- the short answer is, axtChain can do a better job.
And chainNet takes it a step further by ranking and filtering the
chains, producing a single-coverage (on the reference sequence)
alignment and highlighting putative rearrangements.
When aligning whole genomes, we chop up the sequences into chunks of
10M - 30M bases overlapping by 10k, and align the chunks in separate
blastz jobs on a compute cluster. When working on chunked sequence,
blastz has a limited scope in which it could chain local alignments
together.
AxtChain works on the entire set of blastz alignments for each
reference chromosome (or scaffold etc), so it can create chains longer
than the alignment chunk size. Also, I believe blastz can output
alignments that have a gap only in one sequence or the other, not both
simultaneously, while axtChain does support double-sided gaps which is
the key to the creation of very long chains.
And although I don't know the details of blastz's chaining algorithm
(Webb Miller's group at PSU knows best), I don't think it's as
sophisticated and parameterized as axtChain's. AxtChain uses the same
nucleotide scoring matrix as blastz, but also uses a stepwise linear
gap penalty matrix that allows for less harsh penalties on longer
gaps, and allows different penalties on a gap in both sequences
relative to a gap in one or the other. AxtChain begins with all local
alignments for a given pair of sequences and strand; it breaks the
local alignments down into gapless blocks, rescores them, orders them
by position into a k-d tree (as suggested by Webb Miller), and then
for each block recurses down the tree to find the best predecessor for
that block so we end up with high-scoring chains. The
best-predecessor decision uses the fancy gap penalties and also looks
for optimal crossover point if two blocks overlap so it can merge
them into a higher-scoring block. The upshot of all this is that we
can find some very long chains that we believe represent shared common
ancestry of sequences and can shed some light on rearrangement
history.
chainNet takes chains as input, sorts them by score, and then starting
with the highest-scoring chain, it ranks the chains, filling in gaps
in the higher-scoring chains with lower-scoring chains where possible.
The resulting filtered, hierarchical view of the chains is where
putative rearrangements (inversions, insertions, deletions,
duplications in the reference sequence) really stand out.
Hope that helps,
Angie
On Wed, 28 Mar 2007, Eric Just wrote:
> Hello,
>
> I have read some papers and some information on your site about generating
> whole genome alignments. I am using the technique of running BLASTZ and
> then post-processing the data using axtChain then chainNet to generate the
> non-overlapping gapped global alignements. My question is about the C
> argument to BLASTZ. The BLASTZ output says:
>
> C(0) 0: no chaining; 1: just output chain; 2: chain and extend; 3:
> just output HSPs
>
> So can the use of the C argument obviate the need for chaining using
> axtChain? If not, what is the difference between chaining done by BLASTZ vs
> chaining doen by chainNet?
>
> Thanks for your time!
>
> Eric
> _______________________________________________
> Genome maillist - Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
>
--
angie at soe.ucsc.edu
Software Developer, UCSC CBSE / Genome Bioinformatics Group
More information about the Genome
mailing list