[Genome] Inconsistent .over.chain's ?

Xueya Zhou xueyazhou at gmail.com
Sat Dec 29 09:45:16 PST 2007


Thanks Hiram for clarifying it out for me. But I came up with even more
problems related to the over.chains.

1. Comparing two chains

Since the chain format allows gaps in both query and target sequences, a
pairwise alignment can assume many forms in terms of chains. Then there
should be a way for one to determine whether two .over.chain's are
equivalent or not. In order to make comparison, I first convert two
.over.chain files into .axt format, then compare the sorted .axt files. But
the method is not perfect. I did come to the situations when I need to
compare two chains below, and it also relates to one of my questions. Here I
just want to know if there is an easier way to do so.

2. About the hapmapAlleleChimp/hapmapAlleleMacaque table

As stated in the track descriptions, the outgroup alleles of HapMap SNPs are
determined using liftOver. When I was using these data, I discern a
considerable portion of the data in hg17.hapmapAlleleChimp maybe aligned to
the wrong place of the outgroup genome sequence. So I wrote a script
(attached)  to check the consistency of the data in all four tables with the
liftOver chains (hg17.hapmapAlleleChimp/hapmapAlleleMacaque and
hg18.hapmapAlleleChimp/hapmapAlleleMacaque downloaded from annotation
database section of goldenPath page).

I found hg17.hapmapAlleleChimp table have many error alignments (
hg18.hapmapAlleleChimp is all right) compared to the liftOver results. Most
of them are on human Y chromosome. Careful look at those inconsistent
records, it turns out that those erroneous alignments can be replicated by
using hg18ToPanTro2.over.chain to liftover hg17 SNP cooridnates! I think it
is a pitfall when lifting over each part of the partitioned data. And hope
those who generated the table can have a check on his/her procedures. Since
the hg18.snp126 outgroup allele table is generated by the same method, it
may also need a check.

3. About hg18ToRheMac2.over.chain

When I use the same method to check the macaque allele table, I was flooded
with inconsistent until I found my liftOver chain may be wrong. I noticed
that my hg18ToRheMac2.over.chain was downloaded from you goldenPath page (
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/liftOver/). And there is
another hg18ToRheMac2.over.chain on your test website (
http://hgwdev.cse.ucsc.edu/goldenPath/hg18/liftOver/). The generation times
of these two files are different, and so are the content:

from hgdownload page: hg18ToRheMac2.over.chain.gz 14-Apr-2006 09:45   39M
from hgwdev page: hg18ToRheMac2.over.chain.gz 06-Jun-2006 09:31  34.4M

When I switch to the newer version of liftOver chain, the records in the
hapmapAlleleMacaque tables become consistent. So these two over.chain's I
used are not equivalent. And I confirmed this by comparing converted axt
files.

In order to determine which over.chain should be the "right one". I looked
at the hg18/vsRheMac2/ directory at both sites alluded above and found their
.all.chain's are the same. So I tried to generate .over.chain from
.all.chain. But this time .all.chain of hg18-rheMac2 alignment are too large
to process at once (program ran short of memories), so I end up with first
chainSplit, process one at once, and then chainMergeSort to combine them
together. Then I compare the the resulting .over.chain with the two
hg18ToRheMac2.over.chain.gz chains downloaded from the two sites. The older
chain matches the .over.chain better. Then I searched the mailing list
archive, and did not find any announcement regarding to this new
.over.chain. So I want to know where did it come out? What is the prototype
.all.chain? Why it is used to generate the hapmapAlleleMacaque tables? Why
did the two .over.chain can yield so many discrepancies when used to map
human SNPs? Which one should I use?

It is also worth noting that when using the newer chain, the liftOver runs
much faster than before. It used to take more than 30 computer hours to
liftOver all HapMap SNPs to rheMac2 coordinates on our cluster. But now less
than an hour on my laptop. I am curious about why.

4. axt file format

When I compare two .over.chains (one generated by myself, one from your
website) by comparing the derived axt format (using diff), I actually
encountered numerous imperfect matches as shown below. One of the record in
axt file have one base pair ahead.

< 19926 chr15 87334482 87334608 chr3 42226024 42226150 - 9288
<
TCTCTTCATCAGACAGCTCCTCCTCCAGACCCGCCATCTTCCTTCTAGCAACAAACTGTGGCGGGGGCGGGGGCAGTGGCCACGTGACTTACCCCTTTCCGTGCGGTGCTCGGCGCCGCGCACCACC
<
TCTCTTCATCAGACAACTGCTCCTCCAGATCCGCCATCTTCCTTCTGGCGACAAACCgcggcggcggcggcggcagcggcCACCTGACTTCCCCGCTTCCGTGCGGTCCTCAGCGCCGCGCGCCCCC
---
> 19926 chr15 87334481 87334608 chr3 42226023 42226150 - 9379
>
TTCTCTTCATCAGACAGCTCCTCCTCCAGACCCGCCATCTTCCTTCTAGCAACAAACTGTGGCGGGGGCGGGGGCAGTGGCCACGTGACTTACCCCTTTCCGTGCGGTGCTCGGCGCCGCGCACCACC
>
TTCTCTTCATCAGACAACTGCTCCTCCAGATCCGCCATCTTCCTTCTGGCGACAAACCgcggcggcggcggcggcagcggcCACCTGACTTCCCCGCTTCCGTGCGGTCCTCAGCGCCGCGCGCCCCC

I deem the above two records as matched one the other. But it gives rise to
the question about what causes this subtle difference. I have used the
axtNet files in pairwise alignment directory of hg18/vsPanTro2
hg17/vsPanTro2 before. It seemed that all the coordinates of each record in
axtNet files are not "half-open zero-based" as stated in the description
page of axt format. I found it actually "minus-one-based".  I find this by
comparing the sequence of each record with those generated by twoBitToFa
program, which accept zero-based half-open coordinates.  And I guess it can
be related to what I observed above. Some program may have small bugs on
converting coordinates.

----
When I began to use those data in my research, my basic assumption was that
the data should be the golden truth. But now as I am half way back to check
the data again, I found myself deluged with that many problems. So I really
hope some expert here in UCSC group can help me out this time.

Thank you very much and Happy New Year!

Xueya

On Dec 27, 2007 2:51 AM, Hiram Clawson <hiram at soe.ucsc.edu> wrote:

> Good Morning Xueya:
>
> The use of chainStitchId in place of the chainSort
> in the doBlastzChainNet.pl script was introduced in July 2007.
> The older hg18 liftOver file was made with the chainSort.
>
> The resulting liftOver files are equivalent.
>
> --Hiram
>
> Xueya Zhou wrote:
> > Hi Hiram! Thank you for not having my question delayed over your
> holiday.
> >
> > I use the latest source tree. But I see the subtle difference between my
>
> > commands and yours: you used 'chainSort' in the second command, but I
> > use the 'chainStitchId'. After I changed to chainSort, I got the same
> > result.
> >
> > I think the line 870-877 in the doBlastzChainNet.pl should be
> >
> >    871    $bossScript->add(<<_EOF_
> >    872  # Make nets ("noClass", i.e. without rmsk/class stats which are
> > added later):
> >    873  chainPreNet $chain $defVars{SEQ1_LEN} $defVars{SEQ2_LEN} stdout
> \\
> >    874  | chainNet stdin -minSpace=1 $defVars{SEQ1_LEN}
> > $defVars{SEQ2_LEN} stdout /dev/null \\
> >    875  | netSyntenic stdin noClass.net
> >    876
> >    877  # Make liftOver chains:
> >    878  netChainSubset -verbose=0 noClass.net $chain stdout \\
> >    879  | chainStitchId stdin stdout | gzip -c > $tDb.$qDb.over.chain.gz
> >
> > And that is what I followd.
> >
> > Since the chainStitchId have the effect of  joining the fragments, that
> > explains why I see much fewer number of chains last time. So I think
> > these two different .over.chain's I saw last time should be equivalent.
> > And I should have the same results if I use either of them to do the
> > liftOver job. I tried on some of my data sets. It seems to be. Still I
> > want to be confirmed by some experts. Thank you very much!
> >
> >
> > Xueya
> >
> > On Dec 24, 2007 3:34 AM, Hiram Clawson < hiram at soe.ucsc.edu
> > <mailto:hiram at soe.ucsc.edu>> wrote:
> >
> >     Good Afternoon Xueya:
> >
> >     Attempting to reproduce your scenario.
> >
> >     Pick up the all.chain file from hgdownload:
> >     -rw-rw-r--  1 70228690 Dec 23 11:12 hg18.panTro2.all.chain.gz
> >
> >     Run the liftOver procedure:
> >     chainPreNet hg18.panTro2.all.chain.gz hg18.chrom.sizes \
> >             panTro2.chrom.sizes stdout \
> >     | chainNet stdin -minSpace=1 hg18.chrom.sizes \
> >             panTro2.chrom.sizes stdout /dev/null \
> >     | netSyntenic stdin noClass.net
> >     Got 49 chroms in hg18.chrom.sizes, 52 in panTro2.chrom.sizes
> >     Finishing nets
> >     writing stdout
> >     writing /dev/null
> >     memory usage 390295552, utime 523 s/100, stime 67
> >
> >     Results in the noClass.net file:
> >     $ wc -l noClass.net
> >     2423883 noClass.net
> >     $ ls -og noClass.net
> >     -rw-rw-r--  1 92169517 Dec 23 11:16 noClass.net
> >
> >     And creating the liftOver file:
> >     netChainSubset -verbose=0 noClass.net hg18.panTro2.all.chain.gzstdout \
> >     | chainSort stdin stdout | gzip -c > hg18.panTro2.over.chain.gz
> >
> >     Results in the file:
> >     -rw-rw-r--  1 14055998 Dec 23 11:18 hg18.panTro2.over.chain.gz
> >     With a chain count and checksum:
> >     $ zcat hg18.panTro2.over.chain.gz | grep "^chain" | wc -l
> >     48902
> >     $ zcat hg18.panTro2.over.chain.gz | sum
> >     55616 40486
> >
> >
> >     Which is identical to the lift over file obtained from hgdownload:
> >
> >     -rw-rw-r--  1 14055998 Dec 23 11:21 hg18ToPanTro2.over.chain.gz
> >
> >     $ zcat hg18ToPanTro2.over.chain.gz | grep "^chain" | wc -l
> >     48902
> >     $ zcat hg18ToPanTro2.over.chain.gz | sum
> >     55616 40486
> >
> >
> >     The creation procedure for the download files produced the same
> >     results in July 2006.  Is your kent source tree software older
> >     than July 2006 ?
> >
> >     Happy New Year.
> >
> >     --Hiram
> >
> >
> >     Xueya Zhou wrote:
> >      > Dear UCSC Genome Browser Team,
> >      >
> >      > I want to ask a technical question on generating a .over.chain
> >      from a
> >      > .all.chain.
> >      >
> >      > I download the .all.chain from you public web site (e.g.
> >      > http://hgdownload.cse.ucsc.edu/goldenPath/hg18/vsPanTro2/), then
> >     followed
> >      > the procedures detailed in doBlastzChainNet.pl scripts in the
> >     Kent source
> >      > tree to extract an single coverage .over.chain as the following:
> >      >
> >      > chainPreNet $tDb.$qDb.all.chain.gz $tDb.chrom.sizes
> >     $qDb.chrom.sizes stdout
> >      > | chainNet stdin -minSpace=1 $tDb.chrom.sizes $qDb.chrom.sizes
> >     stdout
> >      > /dev/null | netSyntenic stdin noClass.net
> >      >
> >      > netChainSubset -verbose=0 noClass.net $all_chain stdout |
> >     chainStitchId
> >      > stdin stdout | gzip -c > $tDb.$qDb.over.chain.gz
> >      >
> >      > To my surprise, that I found the generated .over.chain is some
> what
> >      > different from my downloaded liftOver files. I compared the
> >      > hg18.panTro2.over.chain generated by myself with that of
> >      > hg18ToPanTro2.over.chain from downloads. The former have about
> >     ten thousand
> >      > less chains than the latter (compared by: grep '^chain'
> >     *.over.chain | wc
> >      > -l). And a considerable portion of these two set of over.chain's
> >     are not
> >      > identical to each other.  I don't understand this inconsistency
> >     if we use
> >      > the same input data (.all.chain), the same programs and follow
> >     the same
> >      > procedures. I did not look deep into how different these two
> >     over.chain 's
> >      > are. Would it be possible if the aligned blocks are the same but
> >     the way
> >      > they are chained differ? I think it is unlikely if the algorithm
> is
> >      > deterministic. Or can it be caused by the orders of the chains
> >     that feed
> >      > into the program? Then I want to know the effect of this
> >     discrepancy in my
> >      > downstream analysis.
> >      >
> >      > I am particularly concerned about this, because I also used the
> >     same set of
> >      > tools to generate human-chimp reciprocal best alignment chains
> >     and nets,
> >      > which are not available in your public sites. So I would like to
> >     hear some
> >      > expert's suggestions on this issue.
> >      >
> >      > Thank you very much and Merry Christmas!
> >      >
> >      > Xueya
> >
> >
>
>


More information about the Genome mailing list