[Genome] Chained blastz alignments
Eric Just
e-just at northwestern.edu
Mon Apr 2 14:17:54 PDT 2007
Thanks again, Angie, that really clarifies stuff for me. You have been very
helpful and the pipeline is fascinating.
Eric
On 4/2/07, Angie Hinrichs <angie at soe.ucsc.edu> wrote:
>
> Hi Eric,
>
> > So if you first shred the sequences into chunks, then use axtChain to
> create
> > chains that can span across these smaller chunks, how and where do you
> store
> > the positions of the chunks on the larger assembly? How do you use
> this
> > information to later piece back together the chromosome/scaffold that
> you
> > have previously broken into chunks.
>
> First we invoke blastz with a [start,end] range appended to the
> sequence filenames. blastz's lav output includes the [start,end]
> range. Then we run a postprocessing script (written by Scott
> Schwartz, formerly of PSU, now at Google I think) that parses the
> range and adjusts the coordinates in the lav blocks.
>
> If you have downloaded the kent source tree, the postprocessing script
> is kent/src/hg/utils/automation/blastz-normalizeLav . It is invoked
> by blastz-run-ucsc in the same dir, and that is invoked by a job
> script created by doBlastzChainNet.pl (which in turn depends on
> several *.pm files in that dir).
>
> Scott Schwartz developed a pretty sophisticated system for running
> blastz on our compute cluster with various optional tweaks (for
> example, snipping of lineage-specific repeats prior to running blastz,
> and then adjusting the lav coords to un-snip!). Then Jim added the
> chaining and netting flow. We staff developers used to run a sequence
> of their scripts and programs manually from a template, but then
> automated the whole flow with yet another layer of scripts... so the
> whole system is pretty challenging to reverse engineer at this point,
> but the scripts are there to peruse. You can run doBlastzChainNet.pl
> -help to get a lengthy description of the config file input that it
> expects, and then you can run "doBlastzChainNet.pl -debug DEF" where
> DEF is a properly constructed config file, to see a sequence of
> commands that would be run if not for -debug -- also, it will dump out
> slave .csh scripts. A lot of the commands in there are specific to
> our computing environment (e.g. parasol commands for cluster jobs),
> but they do show exactly how we invoke the many programs in the
> pipeline.
>
> That said, you probably don't need to exactly duplicate our pipeline.
> For example, if your sequences are small enough (or computers big
> enough :) so that you don't need to chunk them, then you don't need to
> bother with the blastz-normalizeLav stuff. Still, axtChain can find
> much longer chains than blastz because of its handling of double-sided
> gaps, so I still think it's worth your while -- there are benefits
> beyond stitching across those chunks.
>
> Angie
>
>
>
More information about the Genome
mailing list