[Genome] A problem with the soft-masking!

wang xiaosong dr.wang at hotmail.com
Sun Mar 11 15:41:08 PDT 2007


Dear BLAT experts

I encountered a problem with repeative EST sequence when doing BLAT with 
softmasked genome nib file. Take sequence AI306750 for example, the command 
line is as following
----------------------------------
faToNib -softMask hg18.fa hg18.nib
gfServer start server 2345  *.nib ¨CstepSize=5 -mask
gfClient server 2345 /hg18/chromnibsoftmasked /data/AI306750.fa 
/data/AI306750.out ¨Ct=dna ¨Cq=rna ¨CminScore=0 ¨CminIdentity=0
---------------------------------
I always get a result as follows, with no "rep match" bases declared.
---------------------------------
>chr22 
          Length = 49691432

 Score = 41 bits (107), Expect = 1e-03
 Identities = 22/22 (100%)
 Strand = Plus / Plus

Query: 191      ctcaaaaaaaaaaaaaaaaaaa 212
                ||||||||||||||||||||||
Sbjct: 39310678 ctcaaaaaaaaaaaaaaaaaaa 39310699
----------------------------------------------
match	mis- 	rep. 	N's	Q gap	Q gap	T gap	T gap
     	match	match	   	count	bases	count	bases
22	0	0	0	0	0	0	0
----------------------------------------------

It seems that the command line I used did not mask the repeat polyA here 
and provide "rep match" bases. Therefore, my question is: 
(1)how can I correctly soft-mask the genome sequence file and how can I get 
"repeat match" bases that occur in my current results. 
(2)does the faToNib -softmask program mask the .fa file itself or take the 
softmasked .fa file generated from other repeatmasking program" 
(3)Does ucsc provide the softmasked genome sequences in fa or nib format?
(4)there was a -mask=type option in blat command, does it exists in 
gfserver/gfclient? In my current machine, the memory only allow 
gfserver/gfclient to run with the whole genome sequence.
Many thanks for the help!
>AI306750
TATACTGCTGCGAGAAGACGACAGAAGGGCAGTGACTCGACAAAGGCCACAGGCAGTCCAGGCCTCTCTC
TGCTCCATCCCCCTGCCTCCCATTCTGCACCACACCTGGCATGGTGCAGGGAGACATCTGCACCCCTGAG
TTGGGCAGCCAGGAGTGCCCCCGGGAATGGATAATAAAGATACTAGAGAACTCAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAGTCGTATCGA

>From: Galt Barber <galt at soe.ucsc.edu>
>To: wang xiaosong <dr.wang at hotmail.com>
>CC: genome at soe.ucsc.edu, xiaosong at med.umich.edu
>Subject: Re: [Genome] The problem in the results of BLAT linux v34
>Date: Mon, 26 Feb 2007 14:47:29 -0800 (PST)
>
>
>Looks like you are using the hard-masked version of the chromosomes.
>I recommend using the soft-masked versions.  There are many
>repeats around the exons in question and that could affect
>the alignments.
>
>For the question about the score that hgBlat generates,
>please see the blat FAQ:
>
>http://hgwdev.cse.ucsc.edu/FAQ/FAQblat#blat4
>
>Also, note that if you are doing batch queries
>it may be easier to just use stand-alone commandline
>"blat" instead of gfServer/gfClient.
>
>If memory is tight you can do one chrom at a time
>and then combine/filter psl results with pslReps
>and other tools like that.
>
>-Galt
>
>
>On Tue, 27 Feb 2007, wang xiaosong wrote:
>
> > Dear All,
> >
> > I'm Xiaosong Wang From Dr. Arul Chinnaiyan's lab at the University of
> > Michigan. We encountered a problem in the output of the BLAT linux 
version
> > 34. The linux version of BLAT usually overlook one exon at either end 
of
> > the input sequence. For example, the chromosome matched regions of ERG 
and
> > TMPRSS2 sequences are 0-1128 and 55-1725 as mapped by the BLAT linux 
v34,
> > while the matched regions were changed to 1-1514 and 1-1725 with the
> > web-based BLAT(See attached file for BLAT results, and test.txt for the
> > sequence). The linux version BLAT lost the last exon of ERG (1128-1514) 
and
> > the First exon of TMPRSS2 (0-55).  The command line we use is as 
following:
> > -----------------------------------------------
> > gfServer start path-t1 7855 *.nib -minMatch=1
> > gfClient path-t1 7855 /data/chromnibmasked /data/test.fa /data/test.out
> > -t=dna -q=rna -minScore=0 -minIdentity=0
> > -----------------------------------------------
> > In addition, we find that the score in the web-based blat results was 
not
> > provided in the linux version results. Therefore, we wonder whether 
anyone
> > knows the algorism behind this score.
> >
> > Thank you very much indeed.
> >
> > Xiaosong
> >
> >
> > Xiaosong Wang
> > Department of Pathology, University of Michigan Medical School
> > 1150 W.Medical Center Dr. Rm3232, Med Sci I, Ann Arbor, MI 48109
> > Phone: 734-763-1224
> >
> > _________________________________________________________________
> > ÓëÁª»úµÄÅóÓѽøÐн»Á÷£¬ÇëʹÓà MSN Messenger:  
http://messenger.msn.com/cn
> >

_________________________________________________________________
ÏíÓÃÊÀ½çÉÏ×î´óµÄµç×ÓÓʼþϵͳ¡ª MSN Hotmail¡£  http://www.hotmail.com  



More information about the Genome mailing list