[Genome] Chained blastz alignments

Angie Hinrichs angie at soe.ucsc.edu
Mon Apr 2 13:15:55 PDT 2007


Hi Eric,

> So if you first shred the sequences into chunks, then use axtChain to create
> chains that can span across these smaller chunks, how and where do you store
> the positions of the chunks on the larger assembly?   How do you use this
> information to later piece back together the chromosome/scaffold that you
> have previously broken into chunks.

First we invoke blastz with a [start,end] range appended to the 
sequence filenames.  blastz's lav output includes the [start,end] 
range.  Then we run a postprocessing script (written by Scott 
Schwartz, formerly of PSU, now at Google I think) that parses the 
range and adjusts the coordinates in the lav blocks.

If you have downloaded the kent source tree, the postprocessing script 
is kent/src/hg/utils/automation/blastz-normalizeLav .  It is invoked 
by blastz-run-ucsc in the same dir, and that is invoked by a job 
script created by doBlastzChainNet.pl (which in turn depends on 
several *.pm files in that dir).

Scott Schwartz developed a pretty sophisticated system for running 
blastz on our compute cluster with various optional tweaks (for 
example, snipping of lineage-specific repeats prior to running blastz, 
and then adjusting the lav coords to un-snip!).  Then Jim added the 
chaining and netting flow.  We staff developers used to run a sequence 
of their scripts and programs manually from a template, but then 
automated the whole flow with yet another layer of scripts... so the 
whole system is pretty challenging to reverse engineer at this point, 
but the scripts are there to peruse.  You can run doBlastzChainNet.pl 
-help to get a lengthy description of the config file input that it 
expects, and then you can run "doBlastzChainNet.pl -debug DEF" where 
DEF is a properly constructed config file, to see a sequence of 
commands that would be run if not for -debug -- also, it will dump out 
slave .csh scripts.  A lot of the commands in there are specific to 
our computing environment (e.g. parasol commands for cluster jobs), 
but they do show exactly how we invoke the many programs in the 
pipeline.  

That said, you probably don't need to exactly duplicate our pipeline.  
For example, if your sequences are small enough (or computers big 
enough :) so that you don't need to chunk them, then you don't need to 
bother with the blastz-normalizeLav stuff.  Still, axtChain can find 
much longer chains than blastz because of its handling of double-sided 
gaps, so I still think it's worth your while -- there are benefits 
beyond stitching across those chunks.

Angie




More information about the Genome mailing list