[Genome] hg18:Most Conserved (phastConsElements17way)
Ann Zweig
ann at soe.ucsc.edu
Fri Mar 23 10:23:09 PDT 2007
Hello Erika,
We calculate the Most Conserved Elements track and display all of the
results. Then we allow users to configure the track to display only
those items over a certain threshold. You can configure the display by
visiting the description page for the track and entering your minimum
value in the "Show only items with score at or above:" text box.
In case you're interested in the details about exactly how we calculate
the items in the track, you are welcome to read the make doc for this
track in the source tree. Here are some of the pertinent details from
that doc:
The tuning loop:
# Try for 5% overall cov, and 70% CDS cov (used elen=13, tcov=.007,
rho=.27)
featureBits hg18 -enrichment refGene:cds phastConsElements17way
# refGene:cds 1.072%, phastConsElements17way 5.510%, both 0.759%, cover
70.83%, enrich 12.86x
featureBits hg17 -enrichment refGene:cds phastConsElements17way
# refGene:cds 1.064%, phastConsElements17way 5.104%, both 0.748%, cover
70.29%, enrich 13.77x
Then the phastCons command:
phastCons $f.ss elliotsEncode.mod \
--rho 0.28 --expected-length 14 --target-coverage 0.008 --quiet \
--not-informative panTro1,rheMac2 \
--seqname $c --idpref $c --viterbi $f.bed --score > $f.pp
Where $c is some chrom name, $f is one of the pieces created by msa_split:
msa_split $c.maf -i MAF \
-M /cluster/bluearc/hg18/chrom/$c.fa \
-o SS -r $c/$c -w 10000000,0 -I 1000 -B 5000
The lodToBedScore calculation makes the mostConserved track, and is:
# The transform is
# f(x) = a * log x + b, s.t. f(x_med) = 300 and f(x_max) = 1000.
# where x_max is the max input score and x_med is the median input score.
# Solving for a and b, you obtain
# b = (300 log x_max - 1000 log x_med) / (log x_max - log x_med)
# a = (1000 - b) / log x_max.
This script from our source does the calculation you may be interested
in: cluster/bin/scripts/lodToBedScore
Hope this is helpful to you in your understanding of the Most Conserved
track.
Regards,
----------
Ann Zweig
UCSC Genome Bioinformatics Group
http://genome.ucsc.edu
Erika wrote:
> Dear colleagues at Santa Cruz Genome Browser,
>
> I have a question regarding the "most conserved" phastCons 17-way
> track for hg18. I have read the track description on the table
> browser and briefly the main text Siepel et al. (2005). However, I
> am still a little confused about how the track was generated. If you
> could help clarify this for me, a naive user, I would greatly
> appreciate it! From the descriptions below I am still not
> understanding how or if a cut-off/threshold value is assigned to the
> log-odds score to say that it is indeed an HCE. I understand that
> the scores range from 0-1000 so it seems to me that a score of 0
> would indicate low probability of an HCE. Is there such a cut-off
> used in generating the track and the data downloaded from the table
> browser?
>
> From track description:
> "The predicted conserved elements are segments of the alignment that
> are likely to have been "generated" by the conserved state of the
> phylo-HMM. Each element is assigned a log-odds score equal to its log
> probability under the conserved model minus its log probability under
> the non-conserved model. The "score" field associated with this track
> contains transformed log-odds scores, taking values between 0 and 1000."
>
> From Siepel et al. (2005) Methods:
> "Conserved elements were predicted using the Viterbi algorithm
> (Durbin et al. 1998). Conservation scores—posterior probabilities
> that each site was generated by the conserved state—were computed
> using the forward/backward algorithm (Durbin et al. 1998). Each
> predicted conserved element was assigned a log-odds score, indicating
> how much more likely it is under the conserved phylogenetic model
> than under the nonconserved model (see Supplemental material)."
>
> Thank you for your help with this,
> Erika
>
> **********************************************************
> E.M. Kvikstad
> Academic Computing Fellow
> IGDP Genetics
> Center for Comparative Genomics and Bioinformatics
> The Pennsylvania State University
> 208 Mueller Lab
> University Park, PA 16802
> (814) 863-2185
> kvik at bx.psu.edu
>
> _______________________________________________
> Genome maillist - Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
More information about the Genome
mailing list