[Genome] N_blocks of the same length

Angie Hinrichs angie at soe.ucsc.edu
Fri Jan 25 11:25:38 PST 2008


Hi Petr,

All of those are assembly gaps of unknown size, mostly unbridged.  You 
can see the details of the assembly in AGP files available for 
download.  Assuming you are using the latest human assembly, hg18:

http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/chromAgp.zip

The file chr1.agp shows the construction of chr1:

chr1    1       616     1       F       AP006221.1      36116   36731   -
chr1    617     167280  2       F       AL627309.15     241     166904  +
chr1    167281  217280  3       N       50000   clone   yes
chr1    217281  257582  4       F       AP006222.1      1       40302   +
chr1    257583  307582  5       N       50000   contig  no
chr1    307583  461231  6       F       AL732372.15     1       153649  +
chr1    461232  511231  7       N       50000   contig  no
chr1    511232  622780  8       F       AC114498.2      1       111549  +
...

If you extract the lines whose 5th column is N, you will see all of 
the 42 gaps.  hg18 chr1 has mostly 50000, one 60000, and very large 
centromere and heterochromatin gaps.  More info on the AGP format is 
available here: 
http://www.ncbi.nlm.nih.gov/genome/guide/Assembly/AGP_Specification.html

Hope that helps,
Angie


On Thu, 24 Jan 2008, Pancoska Petr wrote:

> For computational reasons, I need to split chromosome sequence(s) 
> into rationally selected blocks (max 150KB per file) while keeping 
> the position information correct. Going through the sequence of 
> Chr:1 downloaded from the ftp site (chr1.fa.gz), in the intermediate 
> stage (locating the N- blocks and listing them for review as these 
> boundaries are rational places to split the chromosome) found 
> strange in output of that these "missing" parts are systematically 
> (and exactly) segments of N's each 50KB long. I thought that I have 
> some error in the routine, so I wnet to few of these by hand in the 
> original file and those few I checked were indeed 50KB each. Is this 
> something I overlooked in the info arround or there is a reason for 
> that?
> 
> Thanks for help, Petr Pancoska
> _______________________________________________
> Genome maillist  -  Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
> 


More information about the Genome mailing list