[Genome] N-gap missing from liftAll.lft for chr8 in hg18
Robert Kuhn
kuhn at soe.ucsc.edu
Wed May 14 13:54:18 PDT 2008
Herve,
We have a track/table on the hg18 assembly that annotates
the gaps,
including the fragment gap in chr8 You are correct that a
fragment
gap =is= within a clone, though it is defined as between
the contigs
displayed n the assembly track. This is described on the
details page
for the gap track, accessible via the minibutton to the
left of
the gap track in the genome browser display, or by
clicking the
gap track control below the image:
Fragment - gaps between the contigs of a draft clone. (In
this context,
a contig is a set of overlapping sequence reads.) These
may be bridged
or not.
There are also fragment gaps on several of the _random
chroms:
chr8_random
chr9_random
chr10_random
chr15_random
chr17_random
The gap on chr8 can be seen here:
http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=Kuhn&hgS_otherUserSe
ssionName=hg18_mlq20080514
As you can see, the gap (and all other gaps) is excluded
from the
assembly track (table chrN_gold, where each chrom is
substituted
for N). You should be able to use the assembly track to
get the
coordinates for all non-N sequence in the hg18 assembly.
best wishes,
--b0b kuhn
On Wed, 14 May 2008 10:31:04 -0700
Herve Pages <hpages at fhcrc.org> wrote:
> Hi,
>
> I'm working on tools where we are trying to implement
>fast pattern
> matching algorithms on the entire chromosome sequences
>for hg18 and
> other genomes. For hg18 we use the sequences provided
>here:
>
> http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/
>
> In order to speed up our algorithms, we'd like to be
>able to skip the
> N-gaps and other regions. Also we want our users to keep
>some control
> on what to skip exactly so our plan is to provide
>"predefined masks"
> for each chromosome that the user can put on the
>sequence or not.
>
> So far I've successfully extracted the masks of N-gaps
>for hg18 from
> UCSC liftAll.lft file (offsets/lengths of contigs within
>chromosomes)
> except for chr8 where I run into the following problem.
>
>For this chromosome, liftAll.lft contains the following
>lines:
>
> offset acc_num length
> 0 8/NT_023736 7462059
> 7562059 8/NT_077531 4537293
> 12199352 8/NT_030737 9464880
> 21681632 8/NT_023666 8051036
> 29798768 8/NT_007995 14159284
> 46958052 8/NT_023678 1291149
> 48309201 8/NT_008183 38454502
> 86851003 8/NT_008046 57155273
> 144106276 8/NT_023684 1290020
> 145403396 8/NT_037704 871430
>
> But the 7th contig in the above list (offset=48309201,
>acc_num=8/NT_008183,
> length=38454502) contains a block of 124100 consecutive
>Ns.
>
> Now if I use the hs_ref_chr*.agp files from NCBI for the
>current
> reference assembly (36.3):
>
> ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/
>
> in order to extract the coordinates of the N-gaps, I
>find exactly the
> same thing as with UCSC liftAll.lft file for all
>chromosomes, except
> for chr8. Here are the N-gaps reported in
>hs_ref_chr8.agp:
>
> chr_start chr_stop gap_len gap_type
> 7462060 7562059 100000 clone
> 12099353 12199352 100000 clone
> 21664233 21681632 17400 clone
> 29732669 29798768 66100 clone
> 43958053 46958052 3000000 centromere
> 48249202 48309201 60000 clone
> 86069242 86193341 124100 fragment
> 86763704 86851003 87300 clone
> 144006277 144106276 100000 clone
> 145396297 145403396 7100 clone
>
> The 7th N-gap (of type "fragment") is indeed a region
>containing
> 124100 consecutive Ns and it is the N-gap that is
>missing from UCSC
> liftAll.lft file.
> Note that this is the only N-gap of type "fragment"
>reported by NCBI
> for the entire 36.3 build (302 N-gaps in total) and also
>the only "N-gap
> inconsistency" between UCSC liftAll.lft and the NCBI
>hs_ref_chr*.agp files.
>
> Is there any reason why this N-gap is ignored in UCSC
>liftAll.lft file?
> I have no idea what "fragment" means, but maybe a
>"fragment" is considered
> as being PART OF A CONTIG and not a SEPARATOR BETWEEN
>CONTIGS?
> In this case that would mean that N-gaps can occur
>_inside_ contigs and
> that the liftAll.lft file is not the right place to
>extract the
> N-gaps coordinates from?
>
> Any help/recommendation on this would be highly
>appreciated.
> Thanks in advance,
> H.
> _______________________________________________
> Genome maillist - Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
More information about the Genome
mailing list