[Genome] N-gap missing from liftAll.lft for chr8 in hg18
Herve Pages
hpages at fhcrc.org
Wed May 14 19:09:29 PDT 2008
Hi Kuhn,
Thank you very much for your detailed answer! That is a lot of very
useful information that I need to digest now (I've almost never used the
genome browser before but I learned a lot by reading your email and I feel
much confortable with it now).
Maybe I'll use the chr*_gap.txt.gz files in
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/
for extracting the masks of "assembly gaps". If I understand correctly
the gaps here should be exactly the same as the gaps reported in the NCBI
hs_ref_chr*.agp files.
Thanks again!
Cheers,
H.
Robert Kuhn wrote:
>
> Herve,
>
> We have a track/table on the hg18 assembly that annotates the gaps,
> including the fragment gap in chr8 You are correct that a fragment
> gap =is= within a clone, though it is defined as between the contigs
> displayed n the assembly track. This is described on the details page
> for the gap track, accessible via the minibutton to the left of
> the gap track in the genome browser display, or by clicking the
> gap track control below the image:
>
> Fragment - gaps between the contigs of a draft clone. (In this context,
> a contig is a set of overlapping sequence reads.) These may be bridged
> or not.
>
> There are also fragment gaps on several of the _random chroms:
> chr8_random
> chr9_random
> chr10_random
> chr15_random
> chr17_random
>
> The gap on chr8 can be seen here:
>
> http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=Kuhn&hgS_otherUserSe
>
> ssionName=hg18_mlq20080514
>
> As you can see, the gap (and all other gaps) is excluded from the
> assembly track (table chrN_gold, where each chrom is substituted
> for N). You should be able to use the assembly track to get the
> coordinates for all non-N sequence in the hg18 assembly.
>
> best wishes,
>
> --b0b kuhn
>
> On Wed, 14 May 2008 10:31:04 -0700
> Herve Pages <hpages at fhcrc.org> wrote:
>> Hi,
>>
>> I'm working on tools where we are trying to implement fast pattern
>> matching algorithms on the entire chromosome sequences for hg18 and
>> other genomes. For hg18 we use the sequences provided here:
>>
>> http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/
>>
>> In order to speed up our algorithms, we'd like to be able to skip the
>> N-gaps and other regions. Also we want our users to keep some control
>> on what to skip exactly so our plan is to provide "predefined masks"
>> for each chromosome that the user can put on the sequence or not.
>>
>> So far I've successfully extracted the masks of N-gaps for hg18 from
>> UCSC liftAll.lft file (offsets/lengths of contigs within chromosomes)
>> except for chr8 where I run into the following problem.
>>
>> For this chromosome, liftAll.lft contains the following lines:
>>
>> offset acc_num length
>> 0 8/NT_023736 7462059
>> 7562059 8/NT_077531 4537293
>> 12199352 8/NT_030737 9464880
>> 21681632 8/NT_023666 8051036
>> 29798768 8/NT_007995 14159284
>> 46958052 8/NT_023678 1291149
>> 48309201 8/NT_008183 38454502
>> 86851003 8/NT_008046 57155273
>> 144106276 8/NT_023684 1290020
>> 145403396 8/NT_037704 871430
>>
>> But the 7th contig in the above list (offset=48309201,
>> acc_num=8/NT_008183,
>> length=38454502) contains a block of 124100 consecutive Ns.
>>
>> Now if I use the hs_ref_chr*.agp files from NCBI for the current
>> reference assembly (36.3):
>>
>> ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/
>>
>> in order to extract the coordinates of the N-gaps, I find exactly the
>> same thing as with UCSC liftAll.lft file for all chromosomes, except
>> for chr8. Here are the N-gaps reported in hs_ref_chr8.agp:
>>
>> chr_start chr_stop gap_len gap_type
>> 7462060 7562059 100000 clone
>> 12099353 12199352 100000 clone
>> 21664233 21681632 17400 clone
>> 29732669 29798768 66100 clone
>> 43958053 46958052 3000000 centromere
>> 48249202 48309201 60000 clone
>> 86069242 86193341 124100 fragment
>> 86763704 86851003 87300 clone
>> 144006277 144106276 100000 clone
>> 145396297 145403396 7100 clone
>>
>> The 7th N-gap (of type "fragment") is indeed a region containing
>> 124100 consecutive Ns and it is the N-gap that is missing from UCSC
>> liftAll.lft file.
>> Note that this is the only N-gap of type "fragment" reported by NCBI
>> for the entire 36.3 build (302 N-gaps in total) and also the only "N-gap
>> inconsistency" between UCSC liftAll.lft and the NCBI hs_ref_chr*.agp
>> files.
>>
>> Is there any reason why this N-gap is ignored in UCSC liftAll.lft file?
>> I have no idea what "fragment" means, but maybe a "fragment" is
>> considered
>> as being PART OF A CONTIG and not a SEPARATOR BETWEEN CONTIGS?
>> In this case that would mean that N-gaps can occur _inside_ contigs and
>> that the liftAll.lft file is not the right place to extract the
>> N-gaps coordinates from?
>>
>> Any help/recommendation on this would be highly appreciated.
>> Thanks in advance,
>> H.
>> _______________________________________________
>> Genome maillist - Genome at soe.ucsc.edu
>> http://www.soe.ucsc.edu/mailman/listinfo/genome
>
More information about the Genome
mailing list