[Genome] smaller size of mm8 vs hg18 all chain data from genome browser
Brooke Rhead
rhead at soe.ucsc.edu
Tue Mar 6 17:39:14 PST 2007
Hi Yiming,
You are correct that data retrieved via ftp (mm8.hg18.all.chain) is in a
more compact format. The reason it is larger than the data retrieved
via the Table Browser is that there is more information contained in the
chain file. Here is part of the first entry in mm8.hg18.all.chain:
chain 647145090 chr12 120463159 + 45508375 113878656 chr14 106368585 +
24227835 105403220 1
19 1 0
64 1 0
32 0 17
13 0 5
26 1 0
10 0 13
. . . and so on.
The chain file contains one line for each alignment chain, followed by
several lines describing the alignment gaps within that chain. (See the
bottom of this page: http://genome.ucsc.edu/goldenPath/help/chain.html
for more information on the alignment lines.)
Note that the information retrieved via the Table Browser is the same as
information in the Annotation database, here:
http://hgdownload.cse.ucsc.edu/goldenPath/mm8/database/
(The files are split up into chromosomes, called chr*_chainHg18.txt.gz.)
These files contain only one line per chain, with no extra lines
describing alignment gaps. Here is the entry for chain ID 647145090
(the same as in the example above) from the chr12_chainHg18.txt.gz:
0 647145090 chr12 120463159 45508375
113878656 chr14 106368585 + 24227835 105403220 1
Note that the extra alignment information is missing. This entry is
also present in the Table Browser output.
I hope this information is helpful. Please let us know if we can
clarify anything, or if you have any further questions about the Genome
Browser.
--
Brooke Rhead
UCSC Genome Bioinformatics Group
Yiming Zhou wrote:
> Dear Colleagues,
>
> I downloaded all chain data for mm8 vs hg18 from
> http://hgdownload.cse.ucsc.edu/goldenPath/mm8/vsHg18/ (ftp_data)
> and from
> http://genome.ucsc.edu/cgi-bin/hgTables?command=start (web_data)
> with the following parameters:
> 1. clade: Vertebrate
> 2. genome: Mouse
> 3. assembly: Feb 2006
> 4. group: Comparative Genomics
> 5. track: Human Chain
> 6. table: chainHg18
> 7. region: genome
> 8. output format: all fields from selected table
> 9. all other parameters were default ones.
>
> I think these two data sets represent same thing: all chain data from
> mouse/human whole genome pairwise alignment.
>
> The question is that the sizes of two datasets are different. I know the formats
> of two datasets are different. Since ftp_data used a more compact format, I
> expected ftp_data was smaller. But I found ftp_data was 684M while web_data was
> 259M.
>
> I am wondering what make ftp_data biger than web_data. I am going to use the
> data to define conserved transcription factor binding sites. Would you like to
> give me suggestions which one is better for the purpose? Thank you very much.
>
> Best Regards,
> Yiming
>
> ----------------------------------------------------------------
> This message was sent using IMP, the Internet Messaging Program.
>
> _______________________________________________
> Genome maillist - Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
More information about the Genome
mailing list