[Genome] N-gap missing from liftAll.lft for chr8 in hg18

Hiram Clawson hiram at soe.ucsc.edu
Wed May 14 20:25:33 PDT 2008


Good Evening Herve:

You could also simply scan the hg18 sequence and find all the N's
(and n's) and mark those locations as out of your search area.
There are cases where there are N's in the sequence, but they
are not marked as gaps in the agp files.

--Hiram

Herve Pages wrote:
> Hi Kuhn,
> 
> Thank you very much for your detailed answer! That is a lot of very
> useful information that I need to digest now (I've almost never used the
> genome browser before but I learned a lot by reading your email and I feel
> much confortable with it now).
> Maybe I'll use the chr*_gap.txt.gz files in
> http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/
> for extracting the masks of "assembly gaps". If I understand correctly
> the gaps here should be exactly the same as the gaps reported in the NCBI
> hs_ref_chr*.agp files.
> 
> Thanks again!
> 
> Cheers,
> H.
> 
> 
> Robert Kuhn wrote:
>> Herve,
>>
>> We have a track/table on the hg18 assembly that annotates the gaps,
>> including the fragment gap in chr8  You are correct that a fragment
>> gap =is= within a clone, though it is defined as between the contigs
>> displayed n the assembly track.  This is described on the details page
>> for the gap track, accessible via the minibutton to the left of
>> the gap track in the genome browser display, or by clicking the
>> gap track control below the image:
>>
>> Fragment - gaps between the contigs of a draft clone. (In this context,
>> a contig is a set of overlapping sequence reads.) These may be bridged
>> or not.
>>
>> There are also fragment gaps on several of the _random chroms:
>>  chr8_random
>>  chr9_random
>>  chr10_random
>>  chr15_random
>>  chr17_random
>>  
>> The gap on chr8 can be seen here:
>>
>> http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=Kuhn&hgS_otherUserSe 
>>
>> ssionName=hg18_mlq20080514
>>
>> As you can see, the gap (and all other gaps) is excluded from the
>> assembly track (table chrN_gold, where each chrom is substituted
>> for N).  You should be able to use the assembly track to get the
>> coordinates for all non-N sequence in the hg18 assembly.
>>
>> best wishes,
>>
>>                         --b0b kuhn
>>
>> On Wed, 14 May 2008 10:31:04 -0700
>>  Herve Pages <hpages at fhcrc.org> wrote:
>>> Hi,
>>>
>>> I'm working on tools where we are trying to implement fast pattern
>>> matching algorithms on the entire chromosome sequences for hg18 and
>>> other genomes. For hg18 we use the sequences provided here:
>>>
>>>   http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/
>>>
>>> In order to speed up our algorithms, we'd like to be able to skip the
>>> N-gaps and other regions. Also we want our users to keep some control
>>> on what to skip exactly so our plan is to provide "predefined masks"
>>> for each chromosome that the user can put on the sequence or not.
>>>
>>> So far I've successfully extracted the masks of N-gaps for hg18 from
>>> UCSC liftAll.lft file (offsets/lengths of contigs within chromosomes)
>>> except for chr8 where I run into the following problem.
>>>
>>> For this chromosome, liftAll.lft contains the following lines:
>>>
>>>       offset      acc_num    length
>>>            0  8/NT_023736   7462059
>>>      7562059  8/NT_077531   4537293
>>>     12199352  8/NT_030737   9464880
>>>     21681632  8/NT_023666   8051036
>>>     29798768  8/NT_007995  14159284
>>>     46958052  8/NT_023678   1291149
>>>     48309201  8/NT_008183  38454502
>>>     86851003  8/NT_008046  57155273
>>>    144106276  8/NT_023684   1290020
>>>    145403396  8/NT_037704    871430
>>>
>>> But the 7th contig in the above list (offset=48309201, 
>>> acc_num=8/NT_008183,
>>> length=38454502) contains a block of 124100 consecutive Ns.
>>>
>>> Now if I use the hs_ref_chr*.agp files from NCBI for the current
>>> reference assembly (36.3):
>>>
>>>   ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/
>>>
>>> in order to extract the coordinates of the N-gaps, I find exactly the
>>> same thing as with UCSC liftAll.lft file for all chromosomes, except
>>> for chr8. Here are the N-gaps reported in hs_ref_chr8.agp:
>>>
>>>   chr_start   chr_stop  gap_len   gap_type
>>>     7462060    7562059   100000      clone
>>>    12099353   12199352   100000      clone
>>>    21664233   21681632    17400      clone
>>>    29732669   29798768    66100      clone
>>>    43958053   46958052  3000000 centromere
>>>    48249202   48309201    60000      clone
>>>    86069242   86193341   124100   fragment
>>>    86763704   86851003    87300      clone
>>>   144006277  144106276   100000      clone
>>>   145396297  145403396     7100      clone
>>>
>>> The 7th N-gap (of type "fragment") is indeed a region containing
>>> 124100 consecutive Ns and it is the N-gap that is missing from UCSC
>>> liftAll.lft file.
>>> Note that this is the only N-gap of type "fragment" reported by NCBI
>>> for the entire 36.3 build (302 N-gaps in total) and also the only "N-gap
>>> inconsistency" between UCSC liftAll.lft and the NCBI hs_ref_chr*.agp 
>>> files.
>>>
>>> Is there any reason why this N-gap is ignored in UCSC liftAll.lft file?
>>> I have no idea what "fragment" means, but maybe a "fragment" is 
>>> considered
>>> as being PART OF A CONTIG and not a SEPARATOR BETWEEN CONTIGS?
>>> In this case that would mean that N-gaps can occur _inside_ contigs and
>>> that the liftAll.lft file is not the right place to extract the
>>> N-gaps coordinates from?
>>>
>>> Any help/recommendation on this would be highly appreciated.
>>> Thanks in advance,
>>> H.
>>> _______________________________________________
>>> Genome maillist  -  Genome at soe.ucsc.edu
>>> http://www.soe.ucsc.edu/mailman/listinfo/genome
> 
> _______________________________________________
> Genome maillist  -  Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
> 


More information about the Genome mailing list