[Genome] N-gap missing from liftAll.lft for chr8 in hg18

Robert Kuhn kuhn at soe.ucsc.edu
Wed May 14 13:54:18 PDT 2008


Herve,

We have a track/table on the hg18 assembly that annotates 
the gaps,
including the fragment gap in chr8  You are correct that a 
fragment
gap =is= within a clone, though it is defined as between 
the contigs
displayed n the assembly track.  This is described on the 
details page
for the gap track, accessible via the minibutton to the 
left of
the gap track in the genome browser display, or by 
clicking the
gap track control below the image:

Fragment - gaps between the contigs of a draft clone. (In 
this context,
a contig is a set of overlapping sequence reads.) These 
may be bridged
or not.

There are also fragment gaps on several of the _random 
chroms:
  chr8_random
  chr9_random
  chr10_random
  chr15_random
  chr17_random
  
The gap on chr8 can be seen here:

http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=Kuhn&hgS_otherUserSe
ssionName=hg18_mlq20080514

As you can see, the gap (and all other gaps) is excluded 
from the
assembly track (table chrN_gold, where each chrom is 
substituted
for N).  You should be able to use the assembly track to 
get the
coordinates for all non-N sequence in the hg18 assembly.

best wishes,

                         --b0b kuhn

On Wed, 14 May 2008 10:31:04 -0700
  Herve Pages <hpages at fhcrc.org> wrote:
> Hi,
> 
> I'm working on tools where we are trying to implement 
>fast pattern
> matching algorithms on the entire chromosome sequences 
>for hg18 and
> other genomes. For hg18 we use the sequences provided 
>here:
> 
>   http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/
> 
> In order to speed up our algorithms, we'd like to be 
>able to skip the
> N-gaps and other regions. Also we want our users to keep 
>some control
> on what to skip exactly so our plan is to provide 
>"predefined masks"
> for each chromosome that the user can put on the 
>sequence or not.
> 
> So far I've successfully extracted the masks of N-gaps 
>for hg18 from
> UCSC liftAll.lft file (offsets/lengths of contigs within 
>chromosomes)
> except for chr8 where I run into the following problem.
> 
>For this chromosome, liftAll.lft contains the following 
>lines:
> 
>       offset      acc_num    length
>            0  8/NT_023736   7462059
>      7562059  8/NT_077531   4537293
>     12199352  8/NT_030737   9464880
>     21681632  8/NT_023666   8051036
>     29798768  8/NT_007995  14159284
>     46958052  8/NT_023678   1291149
>     48309201  8/NT_008183  38454502
>     86851003  8/NT_008046  57155273
>    144106276  8/NT_023684   1290020
>    145403396  8/NT_037704    871430
> 
> But the 7th contig in the above list (offset=48309201, 
>acc_num=8/NT_008183,
> length=38454502) contains a block of 124100 consecutive 
>Ns.
> 
> Now if I use the hs_ref_chr*.agp files from NCBI for the 
>current
> reference assembly (36.3):
> 
>   ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/
> 
> in order to extract the coordinates of the N-gaps, I 
>find exactly the
> same thing as with UCSC liftAll.lft file for all 
>chromosomes, except
> for chr8. Here are the N-gaps reported in 
>hs_ref_chr8.agp:
> 
>   chr_start   chr_stop  gap_len   gap_type
>     7462060    7562059   100000      clone
>    12099353   12199352   100000      clone
>    21664233   21681632    17400      clone
>    29732669   29798768    66100      clone
>    43958053   46958052  3000000 centromere
>    48249202   48309201    60000      clone
>    86069242   86193341   124100   fragment
>    86763704   86851003    87300      clone
>   144006277  144106276   100000      clone
>   145396297  145403396     7100      clone
> 
> The 7th N-gap (of type "fragment") is indeed a region 
>containing
> 124100 consecutive Ns and it is the N-gap that is 
>missing from UCSC
> liftAll.lft file.
> Note that this is the only N-gap of type "fragment" 
>reported by NCBI
> for the entire 36.3 build (302 N-gaps in total) and also 
>the only "N-gap
> inconsistency" between UCSC liftAll.lft and the NCBI 
>hs_ref_chr*.agp files.
> 
> Is there any reason why this N-gap is ignored in UCSC 
>liftAll.lft file?
> I have no idea what "fragment" means, but maybe a 
>"fragment" is considered
> as being PART OF A CONTIG and not a SEPARATOR BETWEEN 
>CONTIGS?
> In this case that would mean that N-gaps can occur 
>_inside_ contigs and
> that the liftAll.lft file is not the right place to 
>extract the
> N-gaps coordinates from?
> 
> Any help/recommendation on this would be highly 
>appreciated.
> Thanks in advance,
> H.
> _______________________________________________
> Genome maillist  -  Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome



More information about the Genome mailing list