[Genome] N-gap missing from liftAll.lft for chr8 in hg18
Herve Pages
hpages at fhcrc.org
Wed May 14 10:31:04 PDT 2008
Hi,
I'm working on tools where we are trying to implement fast pattern
matching algorithms on the entire chromosome sequences for hg18 and
other genomes. For hg18 we use the sequences provided here:
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/
In order to speed up our algorithms, we'd like to be able to skip the
N-gaps and other regions. Also we want our users to keep some control
on what to skip exactly so our plan is to provide "predefined masks"
for each chromosome that the user can put on the sequence or not.
So far I've successfully extracted the masks of N-gaps for hg18 from
UCSC liftAll.lft file (offsets/lengths of contigs within chromosomes)
except for chr8 where I run into the following problem.
For this chromosome, liftAll.lft contains the following lines:
offset acc_num length
0 8/NT_023736 7462059
7562059 8/NT_077531 4537293
12199352 8/NT_030737 9464880
21681632 8/NT_023666 8051036
29798768 8/NT_007995 14159284
46958052 8/NT_023678 1291149
48309201 8/NT_008183 38454502
86851003 8/NT_008046 57155273
144106276 8/NT_023684 1290020
145403396 8/NT_037704 871430
But the 7th contig in the above list (offset=48309201, acc_num=8/NT_008183,
length=38454502) contains a block of 124100 consecutive Ns.
Now if I use the hs_ref_chr*.agp files from NCBI for the current
reference assembly (36.3):
ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/
in order to extract the coordinates of the N-gaps, I find exactly the
same thing as with UCSC liftAll.lft file for all chromosomes, except
for chr8. Here are the N-gaps reported in hs_ref_chr8.agp:
chr_start chr_stop gap_len gap_type
7462060 7562059 100000 clone
12099353 12199352 100000 clone
21664233 21681632 17400 clone
29732669 29798768 66100 clone
43958053 46958052 3000000 centromere
48249202 48309201 60000 clone
86069242 86193341 124100 fragment
86763704 86851003 87300 clone
144006277 144106276 100000 clone
145396297 145403396 7100 clone
The 7th N-gap (of type "fragment") is indeed a region containing
124100 consecutive Ns and it is the N-gap that is missing from UCSC
liftAll.lft file.
Note that this is the only N-gap of type "fragment" reported by NCBI
for the entire 36.3 build (302 N-gaps in total) and also the only "N-gap
inconsistency" between UCSC liftAll.lft and the NCBI hs_ref_chr*.agp files.
Is there any reason why this N-gap is ignored in UCSC liftAll.lft file?
I have no idea what "fragment" means, but maybe a "fragment" is considered
as being PART OF A CONTIG and not a SEPARATOR BETWEEN CONTIGS?
In this case that would mean that N-gaps can occur _inside_ contigs and
that the liftAll.lft file is not the right place to extract the
N-gaps coordinates from?
Any help/recommendation on this would be highly appreciated.
Thanks in advance,
H.
More information about the Genome
mailing list