[Genome] A problem with the soft-masking!
wang xiaosong
dr.wang at hotmail.com
Sun Mar 11 15:41:08 PDT 2007
Dear BLAT experts
I encountered a problem with repeative EST sequence when doing BLAT with
softmasked genome nib file. Take sequence AI306750 for example, the command
line is as following
----------------------------------
faToNib -softMask hg18.fa hg18.nib
gfServer start server 2345 *.nib ¨CstepSize=5 -mask
gfClient server 2345 /hg18/chromnibsoftmasked /data/AI306750.fa
/data/AI306750.out ¨Ct=dna ¨Cq=rna ¨CminScore=0 ¨CminIdentity=0
---------------------------------
I always get a result as follows, with no "rep match" bases declared.
---------------------------------
>chr22
Length = 49691432
Score = 41 bits (107), Expect = 1e-03
Identities = 22/22 (100%)
Strand = Plus / Plus
Query: 191 ctcaaaaaaaaaaaaaaaaaaa 212
||||||||||||||||||||||
Sbjct: 39310678 ctcaaaaaaaaaaaaaaaaaaa 39310699
----------------------------------------------
match mis- rep. N's Q gap Q gap T gap T gap
match match count bases count bases
22 0 0 0 0 0 0 0
----------------------------------------------
It seems that the command line I used did not mask the repeat polyA here
and provide "rep match" bases. Therefore, my question is:
(1)how can I correctly soft-mask the genome sequence file and how can I get
"repeat match" bases that occur in my current results.
(2)does the faToNib -softmask program mask the .fa file itself or take the
softmasked .fa file generated from other repeatmasking program"
(3)Does ucsc provide the softmasked genome sequences in fa or nib format?
(4)there was a -mask=type option in blat command, does it exists in
gfserver/gfclient? In my current machine, the memory only allow
gfserver/gfclient to run with the whole genome sequence.
Many thanks for the help!
>AI306750
TATACTGCTGCGAGAAGACGACAGAAGGGCAGTGACTCGACAAAGGCCACAGGCAGTCCAGGCCTCTCTC
TGCTCCATCCCCCTGCCTCCCATTCTGCACCACACCTGGCATGGTGCAGGGAGACATCTGCACCCCTGAG
TTGGGCAGCCAGGAGTGCCCCCGGGAATGGATAATAAAGATACTAGAGAACTCAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAGTCGTATCGA
>From: Galt Barber <galt at soe.ucsc.edu>
>To: wang xiaosong <dr.wang at hotmail.com>
>CC: genome at soe.ucsc.edu, xiaosong at med.umich.edu
>Subject: Re: [Genome] The problem in the results of BLAT linux v34
>Date: Mon, 26 Feb 2007 14:47:29 -0800 (PST)
>
>
>Looks like you are using the hard-masked version of the chromosomes.
>I recommend using the soft-masked versions. There are many
>repeats around the exons in question and that could affect
>the alignments.
>
>For the question about the score that hgBlat generates,
>please see the blat FAQ:
>
>http://hgwdev.cse.ucsc.edu/FAQ/FAQblat#blat4
>
>Also, note that if you are doing batch queries
>it may be easier to just use stand-alone commandline
>"blat" instead of gfServer/gfClient.
>
>If memory is tight you can do one chrom at a time
>and then combine/filter psl results with pslReps
>and other tools like that.
>
>-Galt
>
>
>On Tue, 27 Feb 2007, wang xiaosong wrote:
>
> > Dear All,
> >
> > I'm Xiaosong Wang From Dr. Arul Chinnaiyan's lab at the University of
> > Michigan. We encountered a problem in the output of the BLAT linux
version
> > 34. The linux version of BLAT usually overlook one exon at either end
of
> > the input sequence. For example, the chromosome matched regions of ERG
and
> > TMPRSS2 sequences are 0-1128 and 55-1725 as mapped by the BLAT linux
v34,
> > while the matched regions were changed to 1-1514 and 1-1725 with the
> > web-based BLAT(See attached file for BLAT results, and test.txt for the
> > sequence). The linux version BLAT lost the last exon of ERG (1128-1514)
and
> > the First exon of TMPRSS2 (0-55). The command line we use is as
following:
> > -----------------------------------------------
> > gfServer start path-t1 7855 *.nib -minMatch=1
> > gfClient path-t1 7855 /data/chromnibmasked /data/test.fa /data/test.out
> > -t=dna -q=rna -minScore=0 -minIdentity=0
> > -----------------------------------------------
> > In addition, we find that the score in the web-based blat results was
not
> > provided in the linux version results. Therefore, we wonder whether
anyone
> > knows the algorism behind this score.
> >
> > Thank you very much indeed.
> >
> > Xiaosong
> >
> >
> > Xiaosong Wang
> > Department of Pathology, University of Michigan Medical School
> > 1150 W.Medical Center Dr. Rm3232, Med Sci I, Ann Arbor, MI 48109
> > Phone: 734-763-1224
> >
> > _________________________________________________________________
> > ÓëÁª»úµÄÅóÓѽøÐн»Á÷£¬ÇëʹÓà MSN Messenger:
http://messenger.msn.com/cn
> >
_________________________________________________________________
ÏíÓÃÊÀ½çÉÏ×î´óµÄµç×ÓÓʼþϵͳ¡ª MSN Hotmail¡£ http://www.hotmail.com
More information about the Genome
mailing list