[Genome] N-gap missing from liftAll.lft for chr8 in hg18
Hiram Clawson
hiram at soe.ucsc.edu
Wed May 14 20:25:33 PDT 2008
Good Evening Herve:
You could also simply scan the hg18 sequence and find all the N's
(and n's) and mark those locations as out of your search area.
There are cases where there are N's in the sequence, but they
are not marked as gaps in the agp files.
--Hiram
Herve Pages wrote:
> Hi Kuhn,
>
> Thank you very much for your detailed answer! That is a lot of very
> useful information that I need to digest now (I've almost never used the
> genome browser before but I learned a lot by reading your email and I feel
> much confortable with it now).
> Maybe I'll use the chr*_gap.txt.gz files in
> http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/
> for extracting the masks of "assembly gaps". If I understand correctly
> the gaps here should be exactly the same as the gaps reported in the NCBI
> hs_ref_chr*.agp files.
>
> Thanks again!
>
> Cheers,
> H.
>
>
> Robert Kuhn wrote:
>> Herve,
>>
>> We have a track/table on the hg18 assembly that annotates the gaps,
>> including the fragment gap in chr8 You are correct that a fragment
>> gap =is= within a clone, though it is defined as between the contigs
>> displayed n the assembly track. This is described on the details page
>> for the gap track, accessible via the minibutton to the left of
>> the gap track in the genome browser display, or by clicking the
>> gap track control below the image:
>>
>> Fragment - gaps between the contigs of a draft clone. (In this context,
>> a contig is a set of overlapping sequence reads.) These may be bridged
>> or not.
>>
>> There are also fragment gaps on several of the _random chroms:
>> chr8_random
>> chr9_random
>> chr10_random
>> chr15_random
>> chr17_random
>>
>> The gap on chr8 can be seen here:
>>
>> http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=Kuhn&hgS_otherUserSe
>>
>> ssionName=hg18_mlq20080514
>>
>> As you can see, the gap (and all other gaps) is excluded from the
>> assembly track (table chrN_gold, where each chrom is substituted
>> for N). You should be able to use the assembly track to get the
>> coordinates for all non-N sequence in the hg18 assembly.
>>
>> best wishes,
>>
>> --b0b kuhn
>>
>> On Wed, 14 May 2008 10:31:04 -0700
>> Herve Pages <hpages at fhcrc.org> wrote:
>>> Hi,
>>>
>>> I'm working on tools where we are trying to implement fast pattern
>>> matching algorithms on the entire chromosome sequences for hg18 and
>>> other genomes. For hg18 we use the sequences provided here:
>>>
>>> http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/
>>>
>>> In order to speed up our algorithms, we'd like to be able to skip the
>>> N-gaps and other regions. Also we want our users to keep some control
>>> on what to skip exactly so our plan is to provide "predefined masks"
>>> for each chromosome that the user can put on the sequence or not.
>>>
>>> So far I've successfully extracted the masks of N-gaps for hg18 from
>>> UCSC liftAll.lft file (offsets/lengths of contigs within chromosomes)
>>> except for chr8 where I run into the following problem.
>>>
>>> For this chromosome, liftAll.lft contains the following lines:
>>>
>>> offset acc_num length
>>> 0 8/NT_023736 7462059
>>> 7562059 8/NT_077531 4537293
>>> 12199352 8/NT_030737 9464880
>>> 21681632 8/NT_023666 8051036
>>> 29798768 8/NT_007995 14159284
>>> 46958052 8/NT_023678 1291149
>>> 48309201 8/NT_008183 38454502
>>> 86851003 8/NT_008046 57155273
>>> 144106276 8/NT_023684 1290020
>>> 145403396 8/NT_037704 871430
>>>
>>> But the 7th contig in the above list (offset=48309201,
>>> acc_num=8/NT_008183,
>>> length=38454502) contains a block of 124100 consecutive Ns.
>>>
>>> Now if I use the hs_ref_chr*.agp files from NCBI for the current
>>> reference assembly (36.3):
>>>
>>> ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/
>>>
>>> in order to extract the coordinates of the N-gaps, I find exactly the
>>> same thing as with UCSC liftAll.lft file for all chromosomes, except
>>> for chr8. Here are the N-gaps reported in hs_ref_chr8.agp:
>>>
>>> chr_start chr_stop gap_len gap_type
>>> 7462060 7562059 100000 clone
>>> 12099353 12199352 100000 clone
>>> 21664233 21681632 17400 clone
>>> 29732669 29798768 66100 clone
>>> 43958053 46958052 3000000 centromere
>>> 48249202 48309201 60000 clone
>>> 86069242 86193341 124100 fragment
>>> 86763704 86851003 87300 clone
>>> 144006277 144106276 100000 clone
>>> 145396297 145403396 7100 clone
>>>
>>> The 7th N-gap (of type "fragment") is indeed a region containing
>>> 124100 consecutive Ns and it is the N-gap that is missing from UCSC
>>> liftAll.lft file.
>>> Note that this is the only N-gap of type "fragment" reported by NCBI
>>> for the entire 36.3 build (302 N-gaps in total) and also the only "N-gap
>>> inconsistency" between UCSC liftAll.lft and the NCBI hs_ref_chr*.agp
>>> files.
>>>
>>> Is there any reason why this N-gap is ignored in UCSC liftAll.lft file?
>>> I have no idea what "fragment" means, but maybe a "fragment" is
>>> considered
>>> as being PART OF A CONTIG and not a SEPARATOR BETWEEN CONTIGS?
>>> In this case that would mean that N-gaps can occur _inside_ contigs and
>>> that the liftAll.lft file is not the right place to extract the
>>> N-gaps coordinates from?
>>>
>>> Any help/recommendation on this would be highly appreciated.
>>> Thanks in advance,
>>> H.
>>> _______________________________________________
>>> Genome maillist - Genome at soe.ucsc.edu
>>> http://www.soe.ucsc.edu/mailman/listinfo/genome
>
> _______________________________________________
> Genome maillist - Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
>
More information about the Genome
mailing list