[Genome] spurious declaration lines in phastCons drosophila files?

Brooke Rhead rhead at soe.ucsc.edu
Mon Oct 9 12:35:27 PDT 2006


Hello Jan,

Thank you for the compliment on the Genome Browser.  I'm glad that you 
find it a valuable resource.  One of our developers has this to say 
about the extra declaration lines:

---------------------------------
This is an artifact of our processing -- the genome is split into 1Mbp
chunks so that we can distribute jobs on a cluster.  Wiggle coords are
1-based hence the "2,3,4,5,6,8,10,11....million +1".
---------------------------------

Sorry for any inconvenience this causes.  For now, at least, you will 
need to remove the extra lines yourself.  Perhaps at some point in the 
future we will improve our alignment process in a way that will 
eliminate these extra declaration lines that are due to our processing 
of the genome.  I hope this at least answers your question about why the 
  lines appear in the files.

--
Brooke Rhead
UCSC Genome Bioinformatics Group


Jan Szubert wrote:
> Hi
> 
> I have a question about the drosophila phastCons and multi9way alignment
> correspondence.
> 
>  I am looking at Drosophilae, specifically chr2L.maf.gz 24-May-2005 10:01
> and chr2l.pp.gz   27-May-2005 18:15 both from the
> http://hgdownload.cse.ucsc.edu/goldenPath/dm2/multiz9way/ and phastCons9way/
> download directories.  The link
> http://genome.ucsc.edu/goldenPath/help/phastCons.html gives a description of
> the PhastCons file format :
> File Format (assemblies released Nov. 2004 and later)  When uncompressed,
> the file contains a declaration line and one column of data in wiggle table
> fixed-step format:
> 
> " fixedStep chrom=scaffold_1 start=3462 step=1 0.0978 0.1588 0.1919 0.1948
> 0.1684
> 
> 1. Declaration line: The declaration line specifies the starting point of
> the data in the assembly. It consists of the following fields:
> 
> fixedStep -- keyword indicating the wiggle track format used to write the
> data. In fixed step format, the data is single-column with a fixed interval
> between values.
> chrom -- chromosome or scaffold on which first value is located.
> start -- position of first value on chromosome or scaffold specified by
> chrom. NOTE: Unlike most Genome Browser coordinates, these are one-based.
> step -- size of the interval (in bases) between values.
> A new declaration line is inserted in the file when the chrom value changes,
> when a gap is encountered (requiring a new start value), or when the step
> interval changes."
> 
> My observation :
> 
> In nearly all cases the declaration lines correspond to the start of a new
> alignment group after a gap and apart from the odd exception of an extra
> group covering a genome region already found in the MAF file, it works well.
> However I have found that on most 1 million nucleotide boundaries an extra
> declaration line is present in the phastCons file which does not correspond
> to a "chrom value changes, when a gap is encountered (requiring a new start
> value), or when the step interval changes." and is right inside a perfectly
> good contiguous alignment group. In the case of chr2L this does not occur on
> the 1st million nt boundary, but does on the n =
> 2,3,4,5,6,8,10,11....million +1 .. I presume they are present in the other
> fly phastCons files, but I haven't confirmed them apart from chr3R.pp yet.
> As these entries seem to be redundant according to the file specification,
> wouldn't it be better to simply delete them or do they serve some unstated
> function such as for aligning in the browser display? They make it more
> difficult to develop a simple algorithm to merge the phastCons and multi way
> alignments for assessment of conservation results.
> 
> Alternatively I would get my data with the custom tracks and table browser
> except the 10 million line file limit means about three files or more per
> chromosome and I don't think you can get the phastCons score in the MAF
> alignment format for detailed checking.
> 
>  Of course it is not too difficult to remove these offending lines myself,
> in my own copies, but wondered why they were there and
> more importantly if there are other occasions when declaration lines would
> be created in the phastCons files that did not
> correspond to gaps or chrom or step changes, and hence in the middle of the
> source data alignment groups.
> 
> Thank you for your work in producing the genome browser and in providing
> such a valuable resource to the research community.
> 
> kind regards
> 
> Jan Szubert
> _______________________________________________
> Genome maillist  -  Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome


More information about the Genome mailing list