[Genome] N-gap missing from liftAll.lft for chr8 in hg18

Herve Pages hpages at fhcrc.org
Wed May 14 10:31:04 PDT 2008


Hi,

I'm working on tools where we are trying to implement fast pattern
matching algorithms on the entire chromosome sequences for hg18 and
other genomes. For hg18 we use the sequences provided here:

   http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/

In order to speed up our algorithms, we'd like to be able to skip the
N-gaps and other regions. Also we want our users to keep some control
on what to skip exactly so our plan is to provide "predefined masks"
for each chromosome that the user can put on the sequence or not.

So far I've successfully extracted the masks of N-gaps for hg18 from
UCSC liftAll.lft file (offsets/lengths of contigs within chromosomes)
except for chr8 where I run into the following problem.

For this chromosome, liftAll.lft contains the following lines:

       offset      acc_num    length
            0  8/NT_023736   7462059
      7562059  8/NT_077531   4537293
     12199352  8/NT_030737   9464880
     21681632  8/NT_023666   8051036
     29798768  8/NT_007995  14159284
     46958052  8/NT_023678   1291149
     48309201  8/NT_008183  38454502
     86851003  8/NT_008046  57155273
    144106276  8/NT_023684   1290020
    145403396  8/NT_037704    871430

But the 7th contig in the above list (offset=48309201, acc_num=8/NT_008183,
length=38454502) contains a block of 124100 consecutive Ns.

Now if I use the hs_ref_chr*.agp files from NCBI for the current
reference assembly (36.3):

   ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/

in order to extract the coordinates of the N-gaps, I find exactly the
same thing as with UCSC liftAll.lft file for all chromosomes, except
for chr8. Here are the N-gaps reported in hs_ref_chr8.agp:

   chr_start   chr_stop  gap_len   gap_type
     7462060    7562059   100000      clone
    12099353   12199352   100000      clone
    21664233   21681632    17400      clone
    29732669   29798768    66100      clone
    43958053   46958052  3000000 centromere
    48249202   48309201    60000      clone
    86069242   86193341   124100   fragment
    86763704   86851003    87300      clone
   144006277  144106276   100000      clone
   145396297  145403396     7100      clone

The 7th N-gap (of type "fragment") is indeed a region containing
124100 consecutive Ns and it is the N-gap that is missing from UCSC
liftAll.lft file.
Note that this is the only N-gap of type "fragment" reported by NCBI
for the entire 36.3 build (302 N-gaps in total) and also the only "N-gap
inconsistency" between UCSC liftAll.lft and the NCBI hs_ref_chr*.agp files.

Is there any reason why this N-gap is ignored in UCSC liftAll.lft file?
I have no idea what "fragment" means, but maybe a "fragment" is considered
as being PART OF A CONTIG and not a SEPARATOR BETWEEN CONTIGS?
In this case that would mean that N-gaps can occur _inside_ contigs and
that the liftAll.lft file is not the right place to extract the
N-gaps coordinates from?

Any help/recommendation on this would be highly appreciated.
Thanks in advance,
H.


More information about the Genome mailing list