[Genome] GenScan UCSC track for the ENCODE regions

Hiram Clawson hiram at soe.ucsc.edu
Fri Oct 6 09:44:00 PDT 2006


Good Morning Axel:

Our run of genscan is done on:

1. hard masked contigs

2. our own gsBig front end
	(in the kent source tree: http://genome.ucsc.edu/admin/cvs.html
	http://genome.ucsc.edu/admin/jk-install.html)
     to genscan with arguments, for example:
	/cluster/bin/x86_64/gsBig /cluster/data/hg17/1/NT_004321/NT_004321.fa.masked \
gtf/NT_004321.fa.gtf -trans=pep/NT_004321.fa.pep \
-subopt=subopt/NT_004321.fa.bed -exe=hg3rdParty/genscanlinux/genscan \
-par=hg3rdParty/genscanlinux/HumanIso.smat -tmp=/tmp -window=2400000

3. results lifted to chrom coordinates

The usage message of gsBig is:
gsBig - Run Genscan on big input and produce GTF files and other parsed output
usage:
    gsBig file.fa output.gtf
options:
    -subopt=output.bed - Produce suboptimal exons.
    -trans=output.fa - where to put translated proteins.
    -prerun=input.genscan - Assume genscan run already with this output.
    -window=size    Set window to pass to genscan specific size (default 1200000)
                    You want ~400 bytes memory for each base in window.
    -exe=/bin/genscan-linux/genscan - where genscan executable is.
    -par=/bin/genscan-linux/HumanIso.smat - where parameter file is.
    -tmp=/tmp - where temporary files go to.

You can read our entire processing sequence in the *.txt files in the source tree:
	src/hg/makeDb/doc/*.txt

--Hiram

Axel E. Bernal wrote:
> Hi,
> 
> I am trying to reproduce the predictions of the GenScan track for the 
> ENCODE region.
> The numbers for the nucleotide level, as they appeared in the last EGASP 
> '05 ENCODE publication: Genome Biology, Volumen 7, Supplement 1, are as 
> follows (they appear in page S2.10 under "USCS genscan track"):
> Sn:84.17%
> Sp:60.60%
> 
> Whereas the ones that I obtained with a local copy of GenScan are around:
> Sn:48.92%
> Sp:59.40%
> 
> I made sure no errors were made while reading the output of the program, 
> also I am using masked sequences - I used RepeatMasker - (the results with 
> unmasked sequences are even worse).
> 
> I'd appreciate if you could help me out as to what could be the reason I 
> am obtaining these bad results on ENCODE; in all other test sets I don't 
> have this problem. Did you use any especial parameters for running the 
> program? or any pre/post-processing in the sequences?
> 
> Thanks a lot in advance,
> 
> Sincerely,
> 
> Axel E Bernal
> University of Pennsylvania


More information about the Genome mailing list