[Genome] A problem with the soft-masking!

Xiaosong Wang xiaosong at med.umich.edu
Wed Mar 14 18:35:12 PDT 2007


Dear Galt,

I really appreciate your sincere help. Yes, I have tried gfServer
command with  or without "-mask", but I get the same result without
repmatch report. I simply can not figure out why standalone BLAT didn't
report this repmatch bases! 
Again, thank you very much for the detailed explanation to psl
softwares.

Best wishes

Xiaosong


Xiaosong Wang
Department of Pathology, University of Michigan Medical School
1150 W.Medical Center Dr. Rm3232, Med Sci I, Ann Arbor, MI 48109
Phone: 734-763-1224

>>> Galt Barber <galt at soe.ucsc.edu> 2007-3-14 4:30 >>>

Our web blat (hgBlat) cgi talks to our
gfServer which is NOT started with -mask for dna.
You can, however, use -mask wth your own installation
gfServer and it should correctly report repmatches in the
output psl files.

pslReps is not for re-scoring psl files with
dna-repeat information from masked sequence
as your email implies.
pslReps simply helps you filter the psls for a
given alignment so that you keep the best
alignment(s).  So, in that sense it is
removing "repeated alignments".

pslCDnaFilter is somewhat more complete
and has many useful options.

------
pslReps - analyse repeats and generate genome wide best
alignments from a sorted set of local alignments
usage:
    pslReps in.psl out.psl out.psr
where in.psl is an alignment file generated by psLayout and
sorted by pslSort, out.psl is the best alignment output
and out.psr contains repeat info
[...]
(options not shown here)
--------

pslCDnaFilter [options] inPsl outPsl

Filter cDNA alignments in psl format.  Filtering criteria are
comparative, selecting near best in genome alignments for each
given cDNA and non-comparative, based only on the quality of an
individual alignment.

WARNING: comparive filters requires that the input is sorted by
query name.  The command: 'sort -k 10,10' will do the trick.

[...]
(many, many options not shown here)
-----

Good Luck!


-Galt


On Tue, 13 Mar 2007, Xiaosong Wang wrote:

> Dear Galt,
>
> Thank you so much for the detailed explanation. I'm using the
> chromFa.zip, and the file is soft masked. With your suggestion, I
also
> find that the position chr22:39310678-39310699 is already softmasked
in
> my file. I admit that web based blat have not been dedicated to get
best
> results. However, despite that the standalone version of BLAT
produce
> this false positive, the BLAT on genome browser seems has avoided
this
> problem. Following is the BLAT results from the genome browser. We
can
> see that "chr22:39310678-39310699" didn't appear in this result. The
> reason why I would like to pursue this kind of small matching is
that
> I'm trying to find the fused sequences. So besides the best match,
I'm
> trying to find out the small regions on each side of the best match
that
> can be mapped to other genome locus. And in the output of standalone
> BLAT, there was hundreds of ESTs with PolyA tail mapped to other
genome
> locus without penalty for the repmatch. This make up quite a bit
false
> positive results for my analysis. Therefore, I would like to ask
whether
> you have any idea about how to bring up the information for repmatch
> bases and whether pslReps will read the sequence again to get the
> repmatch bases?
> --command line---(hg18.fa is only representative symbol, I'm using
the
> splited files for chromosomes)
> faToNib -softMask hg18.fa hg18.nib
> gfServer start server 2345  *.nib -stepSize=5 -mask
> gfClient server 2345 /hg18/chromnibsoftmasked /data/AI306750.fa
> /data/AI306750.out -t=dna -q=rna -minScore=0 -minIdentity=0
>
> Thank you very much!
>
> Best wishes
>
> Xiaosong
>
> ----------Web based BLAT Search Results--------------
>    ACTIONS      QUERY           SCORE START  END QSIZE IDENTITY CHRO
> STRAND  START    END      SPAN
>
---------------------------------------------------------------------------------------------------
> browser details AI306750         172    21
  192   234 100.0%     6  
-
>   34613559  34613730    172
> browser details AI306750          45    53   200   234  96.0%    16  
+
>   11329927  11330365    439
> browser details AI306750          43    61   218   234  95.7%     7  
+
>   86690596  86690917    322
> browser details AI306750          26    31    65   234  71.5%    10  
+
>   58592479  58592506     28
> browser details AI306750          25    65    92   234  96.3%    12  
-
>    4499858   4499946     89
> browser details AI306750          25    30    77   234  61.6%     2  
+
>  129287483 129287508     26
> browser details AI306750          24    74   101   234  96.2%    15  
-
>   72145425  72145457     33
> browser details AI306750          24    74   101   234  96.2%    15  
+
>   83595466  83595498     33
> browser details AI306750          23   191   217   234  92.6%     8  
+
>  145519532 145519558     27
> browser details AI306750          23    22    50   234  89.7%    17  
+
>   77861946  77861974     29
> browser details AI306750          22    85   111   234  95.9%     7  
+
>  102498033 102498076     44
> browser details AI306750          22   172   197   234  78.3%    17  
+
>    8625289   8625311     23
> browser details AI306750          22    70    93   234  87.0%    10  
+
>   91185511  91185533     23
> browser details AI306750          22   175   197   234 100.0%    10  
+
>   74307390  74307414     25
> browser details AI306750          20    79    98   234 100.0%     2  
-
>   91242437  91242456     20
> browser details AI306750          20    64    84   234 100.0%    17  
-
>   35181460  35181481     22
> browser details AI306750          20   137   156   234 100.0%     1  
+
>   92189967  92189986     20
> browser details AI306750          19   180   200   234  85.0%     4  
-
>  178914046 178914065     20
> browser details AI306750          19   168   188   234  85.0%     4  
-
>  148746891 148746910     20
> browser details AI306750          19   101   120   234 100.0%     4  
+
>  190241376 190241396     21
>
>
>
> Xiaosong Wang
> Department of Pathology, University of Michigan Medical School
> 1150 W.Medical Center Dr. Rm3232, Me
> d Sci I, Ann Arbor, MI 48109
> Phone: 734-763-1224
>
> >>> Galt Barber <galt at soe.ucsc.edu> 2007-3-13 15:10 >>>
>
> Is your original hg18.fa itself hard-masked (N),
> soft-masked(Lowercase)
> or unmasked?  If it was unmasked, then that would mean
> you need to get a masked version of the fasta file.
>
> You can get hg18 sequence here:
> http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/ 
>
> either use hg18.2bit or chromFa.zip
>
> The browser confirms that the region you show
> on hg18, chr22:39310678-39310699 has been masked by repeatmasker
> and DNA link at the top menu bar shows masking is there
> when I do get with masking to lower case
>
> >hg18_dna range=chr22:39310678-39310699 repeatMasking=lower
> ctcaaaaaaaaaaaaaaaaaaa
>
> So we should expect repmatch count not to be 0
> as you said.
>
> Even when masking is working properly,
> BLAT will accept alignments as along as the
> exon does not START in a masked region.
>
> BLAT also trims poly-A tails, but not if even
> one non-A base appears near the end.
> e.g. AAAAAAAAAGA might still not be trimmed
> on the query side.
>
> fyi A search for AI306750 in hg18 genome browser,
> finds an EST here: chr6:34,613,559-34,613,730
>
> In general, you may find that stand-alone command-line
> BLAT is good for batch jobs, and that the various
> utility programs such as pslReps and pslCDnaFilter
> are great for filtering BLAT psls to match your
> requirements.  Our hgBlat web cgi is intended for
> use by humans, and shows everything, even low-quality
> alignments.  For batch processing, configure
> reasonable thresholds of quality for the alignments.
>
> -Galt
>
>
> On Tue, 13 Mar 2007, wang xiaosong wrote:
>
> > Dear BLAT experts
> >
> > I encountered a problem with repeative EST sequence when doing
BLAT
> with
> > softmasked genome ni
b file. Take sequence AI306750 for example,
the
> command
> > line is as following
> > ----------------------------------
> > faToNib -softMask hg18.fa hg18.nib
> > gfServer start server 2345  *.nib -stepSize=5 -mask
> > gfClient server 2345 /hg18/chromnibsoftmasked /data/AI306750.fa
> > /data/AI306750.out -t=dna -q=rna -minScore=0 -minIdentity=0
> > ---------------------------------
> > I always get a result as follows, with no "rep match" bases
> declared.
> > ---------------------------------
> > >chr22
> >           Length = 49691432
> >
> >  Score = 41 bits (107), Expect = 1e-03
> >  Identities = 22/22 (100%)
> >  Strand = Plus / Plus
> >
> > Query: 191      ctcaaaaaaaaaaaaaaaaaaa 212
> >                 ||||||||||||||||||||||
> > Sbjct: 39310678 ctcaaaaaaaaaaaaaaaaaaa 39310699
> > ----------------------------------------------
> > match	mis- 	rep. 	N's	Q gap	Q gap	T gap	T gap
> >      	match	match	   	count	bases	count	bases
> > 22	0	0	0	0	0	0	0
> > ----------------------------------------------
> >
> > It seems that the command line I used did not mask the repeat
polyA
> here
> > and provide "rep match" bases. Therefore, my question is:
> > (1)how can I correctly soft-mask the genome sequence file and how
can
> I get
> > "repeat match" bases that occur in my current results.
> > (2)does the faToNib -softmask program mask the .fa file itself or
> take the
> > softmasked .fa file generated from other repeatmasking program"?
> should I
> > do maskoutfa before using faToNib? Is there any direction for
> installation
> > of maskoutfa in linux?
> > (3)Does ucsc provide the softmasked genome sequences in fa or nib
> format?
> > (4)there was a -mask=type option in blat command, does it exists
in
> > gfserver/gfclient? In my current machine, the memory only allow
> > gfserver/gfclient to run with the whole genome sequence.
> > Many thanks for the help!
> > >AI306750
> >
>
TATACTGCTGCGAGAAGACGACAGAAGGGCAGTGACTCGACAAAGGCCACAGGCAGTCCAGGCCTCTCTC
> >
>
TGCTCCATCCCCCTGCCTCCCATTCTGCACCACACCTGGCATGGTGCAGGGAGACATCTGCACCCCTGAG
> >
>
TTGGGCAGCCAGGAGTGCCCCCGGGAATGGATAATAAAGATACTAGAGAACTCAAAAAAAAAAAAAAAAA
> > AAAAAAAAAAAAAAGTCGTATCGA
> >
> > _________________________________________________________________
> > 享用世界上最大的*子*件系** MSN Hotmail。  http://www.hotmail.com 
> >
>
>
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should
not be used for urgent or sensitive issues.
>


**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.



More information about the Genome mailing list