[Genome] GenScan UCSC track for the ENCODE regions
Hiram Clawson
hiram at soe.ucsc.edu
Fri Oct 6 09:44:00 PDT 2006
Good Morning Axel:
Our run of genscan is done on:
1. hard masked contigs
2. our own gsBig front end
(in the kent source tree: http://genome.ucsc.edu/admin/cvs.html
http://genome.ucsc.edu/admin/jk-install.html)
to genscan with arguments, for example:
/cluster/bin/x86_64/gsBig /cluster/data/hg17/1/NT_004321/NT_004321.fa.masked \
gtf/NT_004321.fa.gtf -trans=pep/NT_004321.fa.pep \
-subopt=subopt/NT_004321.fa.bed -exe=hg3rdParty/genscanlinux/genscan \
-par=hg3rdParty/genscanlinux/HumanIso.smat -tmp=/tmp -window=2400000
3. results lifted to chrom coordinates
The usage message of gsBig is:
gsBig - Run Genscan on big input and produce GTF files and other parsed output
usage:
gsBig file.fa output.gtf
options:
-subopt=output.bed - Produce suboptimal exons.
-trans=output.fa - where to put translated proteins.
-prerun=input.genscan - Assume genscan run already with this output.
-window=size Set window to pass to genscan specific size (default 1200000)
You want ~400 bytes memory for each base in window.
-exe=/bin/genscan-linux/genscan - where genscan executable is.
-par=/bin/genscan-linux/HumanIso.smat - where parameter file is.
-tmp=/tmp - where temporary files go to.
You can read our entire processing sequence in the *.txt files in the source tree:
src/hg/makeDb/doc/*.txt
--Hiram
Axel E. Bernal wrote:
> Hi,
>
> I am trying to reproduce the predictions of the GenScan track for the
> ENCODE region.
> The numbers for the nucleotide level, as they appeared in the last EGASP
> '05 ENCODE publication: Genome Biology, Volumen 7, Supplement 1, are as
> follows (they appear in page S2.10 under "USCS genscan track"):
> Sn:84.17%
> Sp:60.60%
>
> Whereas the ones that I obtained with a local copy of GenScan are around:
> Sn:48.92%
> Sp:59.40%
>
> I made sure no errors were made while reading the output of the program,
> also I am using masked sequences - I used RepeatMasker - (the results with
> unmasked sequences are even worse).
>
> I'd appreciate if you could help me out as to what could be the reason I
> am obtaining these bad results on ENCODE; in all other test sets I don't
> have this problem. Did you use any especial parameters for running the
> program? or any pre/post-processing in the sequences?
>
> Thanks a lot in advance,
>
> Sincerely,
>
> Axel E Bernal
> University of Pennsylvania
More information about the Genome
mailing list