[Genome] Inconsistent .over.chain's ?

Hiram Clawson hiram at soe.ucsc.edu
Sun Dec 23 11:34:15 PST 2007


Good Afternoon Xueya:

Attempting to reproduce your scenario.

Pick up the all.chain file from hgdownload:
-rw-rw-r--  1 70228690 Dec 23 11:12 hg18.panTro2.all.chain.gz

Run the liftOver procedure:
chainPreNet hg18.panTro2.all.chain.gz hg18.chrom.sizes \
         panTro2.chrom.sizes stdout \
| chainNet stdin -minSpace=1 hg18.chrom.sizes \
         panTro2.chrom.sizes stdout /dev/null \
| netSyntenic stdin noClass.net
Got 49 chroms in hg18.chrom.sizes, 52 in panTro2.chrom.sizes
Finishing nets
writing stdout
writing /dev/null
memory usage 390295552, utime 523 s/100, stime 67

Results in the noClass.net file:
$ wc -l noClass.net
2423883 noClass.net
$ ls -og noClass.net
-rw-rw-r--  1 92169517 Dec 23 11:16 noClass.net

And creating the liftOver file:
netChainSubset -verbose=0 noClass.net hg18.panTro2.all.chain.gz stdout \
| chainSort stdin stdout | gzip -c > hg18.panTro2.over.chain.gz

Results in the file:
-rw-rw-r--  1 14055998 Dec 23 11:18 hg18.panTro2.over.chain.gz
With a chain count and checksum:
$ zcat hg18.panTro2.over.chain.gz | grep "^chain" | wc -l
48902
$ zcat hg18.panTro2.over.chain.gz | sum
55616 40486


Which is identical to the lift over file obtained from hgdownload:

-rw-rw-r--  1 14055998 Dec 23 11:21 hg18ToPanTro2.over.chain.gz

$ zcat hg18ToPanTro2.over.chain.gz | grep "^chain" | wc -l
48902
$ zcat hg18ToPanTro2.over.chain.gz | sum
55616 40486


The creation procedure for the download files produced the same
results in July 2006.  Is your kent source tree software older
than July 2006 ?

Happy New Year.

--Hiram


Xueya Zhou wrote:
> Dear UCSC Genome Browser Team,
> 
> I want to ask a technical question on generating a .over.chain  from a
> .all.chain.
> 
> I download the .all.chain from you public web site (e.g.
> http://hgdownload.cse.ucsc.edu/goldenPath/hg18/vsPanTro2/), then followed
> the procedures detailed in doBlastzChainNet.pl scripts in the Kent source
> tree to extract an single coverage .over.chain as the following:
> 
> chainPreNet $tDb.$qDb.all.chain.gz $tDb.chrom.sizes $qDb.chrom.sizes stdout
> | chainNet stdin -minSpace=1 $tDb.chrom.sizes $qDb.chrom.sizes stdout
> /dev/null | netSyntenic stdin noClass.net
> 
> netChainSubset -verbose=0 noClass.net $all_chain stdout | chainStitchId
> stdin stdout | gzip -c > $tDb.$qDb.over.chain.gz
> 
> To my surprise, that I found the generated .over.chain is some what
> different from my downloaded liftOver files. I compared the
> hg18.panTro2.over.chain generated by myself with that of
> hg18ToPanTro2.over.chain from downloads. The former have about ten thousand
> less chains than the latter (compared by: grep '^chain' *.over.chain | wc
> -l). And a considerable portion of these two set of over.chain's are not
> identical to each other.  I don't understand this inconsistency if we use
> the same input data (.all.chain), the same programs and follow the same
> procedures. I did not look deep into how different these two over.chain's
> are. Would it be possible if the aligned blocks are the same but the way
> they are chained differ? I think it is unlikely if the algorithm is
> deterministic. Or can it be caused by the orders of the chains that feed
> into the program? Then I want to know the effect of this discrepancy in my
> downstream analysis.
> 
> I am particularly concerned about this, because I also used the same set of
> tools to generate human-chimp reciprocal best alignment chains and nets,
> which are not available in your public sites. So I would like to hear some
> expert's suggestions on this issues.
> 
> Thank you very much and Merry Christmas!
> 
> Xueya


More information about the Genome mailing list