[Genome] N-gap missing from liftAll.lft for chr8 in hg18

Herve Pages hpages at fhcrc.org
Wed May 14 19:09:29 PDT 2008


Hi Kuhn,

Thank you very much for your detailed answer! That is a lot of very
useful information that I need to digest now (I've almost never used the
genome browser before but I learned a lot by reading your email and I feel
much confortable with it now).
Maybe I'll use the chr*_gap.txt.gz files in
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/
for extracting the masks of "assembly gaps". If I understand correctly
the gaps here should be exactly the same as the gaps reported in the NCBI
hs_ref_chr*.agp files.

Thanks again!

Cheers,
H.


Robert Kuhn wrote:
> 
> Herve,
> 
> We have a track/table on the hg18 assembly that annotates the gaps,
> including the fragment gap in chr8  You are correct that a fragment
> gap =is= within a clone, though it is defined as between the contigs
> displayed n the assembly track.  This is described on the details page
> for the gap track, accessible via the minibutton to the left of
> the gap track in the genome browser display, or by clicking the
> gap track control below the image:
> 
> Fragment - gaps between the contigs of a draft clone. (In this context,
> a contig is a set of overlapping sequence reads.) These may be bridged
> or not.
> 
> There are also fragment gaps on several of the _random chroms:
>  chr8_random
>  chr9_random
>  chr10_random
>  chr15_random
>  chr17_random
>  
> The gap on chr8 can be seen here:
> 
> http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=Kuhn&hgS_otherUserSe 
> 
> ssionName=hg18_mlq20080514
> 
> As you can see, the gap (and all other gaps) is excluded from the
> assembly track (table chrN_gold, where each chrom is substituted
> for N).  You should be able to use the assembly track to get the
> coordinates for all non-N sequence in the hg18 assembly.
> 
> best wishes,
> 
>                         --b0b kuhn
> 
> On Wed, 14 May 2008 10:31:04 -0700
>  Herve Pages <hpages at fhcrc.org> wrote:
>> Hi,
>>
>> I'm working on tools where we are trying to implement fast pattern
>> matching algorithms on the entire chromosome sequences for hg18 and
>> other genomes. For hg18 we use the sequences provided here:
>>
>>   http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/
>>
>> In order to speed up our algorithms, we'd like to be able to skip the
>> N-gaps and other regions. Also we want our users to keep some control
>> on what to skip exactly so our plan is to provide "predefined masks"
>> for each chromosome that the user can put on the sequence or not.
>>
>> So far I've successfully extracted the masks of N-gaps for hg18 from
>> UCSC liftAll.lft file (offsets/lengths of contigs within chromosomes)
>> except for chr8 where I run into the following problem.
>>
>> For this chromosome, liftAll.lft contains the following lines:
>>
>>       offset      acc_num    length
>>            0  8/NT_023736   7462059
>>      7562059  8/NT_077531   4537293
>>     12199352  8/NT_030737   9464880
>>     21681632  8/NT_023666   8051036
>>     29798768  8/NT_007995  14159284
>>     46958052  8/NT_023678   1291149
>>     48309201  8/NT_008183  38454502
>>     86851003  8/NT_008046  57155273
>>    144106276  8/NT_023684   1290020
>>    145403396  8/NT_037704    871430
>>
>> But the 7th contig in the above list (offset=48309201, 
>> acc_num=8/NT_008183,
>> length=38454502) contains a block of 124100 consecutive Ns.
>>
>> Now if I use the hs_ref_chr*.agp files from NCBI for the current
>> reference assembly (36.3):
>>
>>   ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/
>>
>> in order to extract the coordinates of the N-gaps, I find exactly the
>> same thing as with UCSC liftAll.lft file for all chromosomes, except
>> for chr8. Here are the N-gaps reported in hs_ref_chr8.agp:
>>
>>   chr_start   chr_stop  gap_len   gap_type
>>     7462060    7562059   100000      clone
>>    12099353   12199352   100000      clone
>>    21664233   21681632    17400      clone
>>    29732669   29798768    66100      clone
>>    43958053   46958052  3000000 centromere
>>    48249202   48309201    60000      clone
>>    86069242   86193341   124100   fragment
>>    86763704   86851003    87300      clone
>>   144006277  144106276   100000      clone
>>   145396297  145403396     7100      clone
>>
>> The 7th N-gap (of type "fragment") is indeed a region containing
>> 124100 consecutive Ns and it is the N-gap that is missing from UCSC
>> liftAll.lft file.
>> Note that this is the only N-gap of type "fragment" reported by NCBI
>> for the entire 36.3 build (302 N-gaps in total) and also the only "N-gap
>> inconsistency" between UCSC liftAll.lft and the NCBI hs_ref_chr*.agp 
>> files.
>>
>> Is there any reason why this N-gap is ignored in UCSC liftAll.lft file?
>> I have no idea what "fragment" means, but maybe a "fragment" is 
>> considered
>> as being PART OF A CONTIG and not a SEPARATOR BETWEEN CONTIGS?
>> In this case that would mean that N-gaps can occur _inside_ contigs and
>> that the liftAll.lft file is not the right place to extract the
>> N-gaps coordinates from?
>>
>> Any help/recommendation on this would be highly appreciated.
>> Thanks in advance,
>> H.
>> _______________________________________________
>> Genome maillist  -  Genome at soe.ucsc.edu
>> http://www.soe.ucsc.edu/mailman/listinfo/genome
> 



More information about the Genome mailing list