[Genome] spurious declaration lines in phastCons drosophila files?
Jan Szubert
jan.szubert at gmail.com
Sun Oct 8 16:30:17 PDT 2006
Hi
I have a question about the drosophila phastCons and multi9way alignment
correspondence.
I am looking at Drosophilae, specifically chr2L.maf.gz 24-May-2005 10:01
and chr2l.pp.gz 27-May-2005 18:15 both from the
http://hgdownload.cse.ucsc.edu/goldenPath/dm2/multiz9way/ and phastCons9way/
download directories. The link
http://genome.ucsc.edu/goldenPath/help/phastCons.html gives a description of
the PhastCons file format :
File Format (assemblies released Nov. 2004 and later) When uncompressed,
the file contains a declaration line and one column of data in wiggle table
fixed-step format:
" fixedStep chrom=scaffold_1 start=3462 step=1 0.0978 0.1588 0.1919 0.1948
0.1684
1. Declaration line: The declaration line specifies the starting point of
the data in the assembly. It consists of the following fields:
fixedStep -- keyword indicating the wiggle track format used to write the
data. In fixed step format, the data is single-column with a fixed interval
between values.
chrom -- chromosome or scaffold on which first value is located.
start -- position of first value on chromosome or scaffold specified by
chrom. NOTE: Unlike most Genome Browser coordinates, these are one-based.
step -- size of the interval (in bases) between values.
A new declaration line is inserted in the file when the chrom value changes,
when a gap is encountered (requiring a new start value), or when the step
interval changes."
My observation :
In nearly all cases the declaration lines correspond to the start of a new
alignment group after a gap and apart from the odd exception of an extra
group covering a genome region already found in the MAF file, it works well.
However I have found that on most 1 million nucleotide boundaries an extra
declaration line is present in the phastCons file which does not correspond
to a "chrom value changes, when a gap is encountered (requiring a new start
value), or when the step interval changes." and is right inside a perfectly
good contiguous alignment group. In the case of chr2L this does not occur on
the 1st million nt boundary, but does on the n =
2,3,4,5,6,8,10,11....million +1 .. I presume they are present in the other
fly phastCons files, but I haven't confirmed them apart from chr3R.pp yet.
As these entries seem to be redundant according to the file specification,
wouldn't it be better to simply delete them or do they serve some unstated
function such as for aligning in the browser display? They make it more
difficult to develop a simple algorithm to merge the phastCons and multi way
alignments for assessment of conservation results.
Alternatively I would get my data with the custom tracks and table browser
except the 10 million line file limit means about three files or more per
chromosome and I don't think you can get the phastCons score in the MAF
alignment format for detailed checking.
Of course it is not too difficult to remove these offending lines myself,
in my own copies, but wondered why they were there and
more importantly if there are other occasions when declaration lines would
be created in the phastCons files that did not
correspond to gaps or chrom or step changes, and hence in the middle of the
source data alignment groups.
Thank you for your work in producing the genome browser and in providing
such a valuable resource to the research community.
kind regards
Jan Szubert
More information about the Genome
mailing list