[Genome] Chained blastz alignments
Angie Hinrichs
angie at soe.ucsc.edu
Mon Apr 2 13:15:55 PDT 2007
Hi Eric,
> So if you first shred the sequences into chunks, then use axtChain to create
> chains that can span across these smaller chunks, how and where do you store
> the positions of the chunks on the larger assembly? How do you use this
> information to later piece back together the chromosome/scaffold that you
> have previously broken into chunks.
First we invoke blastz with a [start,end] range appended to the
sequence filenames. blastz's lav output includes the [start,end]
range. Then we run a postprocessing script (written by Scott
Schwartz, formerly of PSU, now at Google I think) that parses the
range and adjusts the coordinates in the lav blocks.
If you have downloaded the kent source tree, the postprocessing script
is kent/src/hg/utils/automation/blastz-normalizeLav . It is invoked
by blastz-run-ucsc in the same dir, and that is invoked by a job
script created by doBlastzChainNet.pl (which in turn depends on
several *.pm files in that dir).
Scott Schwartz developed a pretty sophisticated system for running
blastz on our compute cluster with various optional tweaks (for
example, snipping of lineage-specific repeats prior to running blastz,
and then adjusting the lav coords to un-snip!). Then Jim added the
chaining and netting flow. We staff developers used to run a sequence
of their scripts and programs manually from a template, but then
automated the whole flow with yet another layer of scripts... so the
whole system is pretty challenging to reverse engineer at this point,
but the scripts are there to peruse. You can run doBlastzChainNet.pl
-help to get a lengthy description of the config file input that it
expects, and then you can run "doBlastzChainNet.pl -debug DEF" where
DEF is a properly constructed config file, to see a sequence of
commands that would be run if not for -debug -- also, it will dump out
slave .csh scripts. A lot of the commands in there are specific to
our computing environment (e.g. parasol commands for cluster jobs),
but they do show exactly how we invoke the many programs in the
pipeline.
That said, you probably don't need to exactly duplicate our pipeline.
For example, if your sequences are small enough (or computers big
enough :) so that you don't need to chunk them, then you don't need to
bother with the blastz-normalizeLav stuff. Still, axtChain can find
much longer chains than blastz because of its handling of double-sided
gaps, so I still think it's worth your while -- there are benefits
beyond stitching across those chunks.
Angie
More information about the Genome
mailing list