[Genome] A problem with the soft-masking!

Galt Barber galt at soe.ucsc.edu
Wed Mar 14 19:35:29 PDT 2007


I think the short answer to your question is
you need -repeats=lower when running standalone blat
in order for the psl output report to show repmatches.

stand-alone blat has -mask= and -qMask= and -repeats options.
Did you try them?
If you just type "blat" with nothing else on the commandline,
it gives you the blat help information.

Here is an excerpt from the blat help:
----

-mask=type Mask out repeats.  Alignments won't be started in masked
region but may extend through it in nucleotide searches.  Masked
areas are ignored entirely in protein or translated searches.
Types are
         lower - mask out lower cased sequence
         upper - mask out upper cased sequence
         out   - mask according to database.out RepeatMasker .out file
         file.out - mask database according to RepeatMasker file.out

-qMask=type Mask out repeats in query sequence.  Similar to -mask above
but for query rather than target sequence.

-repeats=type Type is same as mask types above.  Repeat bases will not
be masked in any way, but matches in repeat areas will be
reported separately from matches in other areas in the psl output.

----

For some purposes, stand-alone blat will run faster
if you make a .ooc file to mark the highly used tiles.
This is useful for many purposes.

blat hg18.2bit /dev/null /dev/null -tileSize=11 -makeOoc=11.ooc -repMatch=1024


If you ran blat without the option -repeats=lower then
they probably nothing gets reported as repmatches
in the .psl output.  Instead it all gets reported
as simple match.

Note that if you are low on ram, you can run stand-alone
blat once for each chromosome.  Then you can combine
the psl files by concatenation and sort them with
pslSort and filter them with the pslReps or pslCDnaFilter
mentioned earlier.  For concatenating psls easily,
it is probably better to add -noHead to suppress
the header in the output.

-Galt


On Wed, 14 Mar 2007, Xiaosong Wang wrote:

> Dear Galt,
>
> I really appreciate your sincere help. Yes, I have tried gfServer
> command with  or without "-mask", but I get the same result without
> repmatch report. I simply can not figure out why standalone BLAT didn't
> report this repmatch bases!
> Again, thank you very much for the detailed explanation to psl
> softwares.
>
> Best wishes
>
> Xiaosong
>
>
> Xiaosong Wang
> Department of Pathology, University of Michigan Medical School
> 1150 W.Medical Center Dr. Rm3232, Med Sci I, Ann Arbor, MI 48109
> Phone: 734-763-1224
>
> >>> Galt Barber <galt at soe.ucsc.edu> 2007-3-14 4:30 >>>
>
> Our web blat (hgBlat) cgi talks to our
> gfServer which is NOT started with -mask for dna.
> You can, however, use -mask wth your own installation
> gfServer and it should correctly report repmatches in the
> output psl files.
>
> pslReps is not for re-scoring psl files with
> dna-repeat information from masked sequence
> as your email implies.
> pslReps simply helps you filter the psls for a
> given alignment so that you keep the best
> alignment(s).  So, in that sense it is
> removing "repeated alignments".
>
> pslCDnaFilter is somewhat more complete
> and has many useful options.
>
> ------
> pslReps - analyse repeats and generate genome wide best
> alignments from a sorted set of local alignments
> usage:
>     pslReps in.psl out.psl out.psr
> where in.psl is an alignment file generated by psLayout and
> sorted by pslSort, out.psl is the best alignment output
> and out.psr contains repeat info
> [...]
> (options not shown here)
> --------
>
> pslCDnaFilter [options] inPsl outPsl
>
> Filter cDNA alignments in psl format.  Filtering criteria are
> comparative, selecting near best in genome alignments for each
> given cDNA and non-comparative, based only on the quality of an
> individual alignment.
>
> WARNING: comparive filters requires that the input is sorted by
> query name.  The command: 'sort -k 10,10' will do the trick.
>
> [...]
> (many, many options not shown here)
> -----
>
> Good Luck!
>
>
> -Galt
>
>
> On Tue, 13 Mar 2007, Xiaosong Wang wrote:
>
> > Dear Galt,
> >
> > Thank you so much for the detailed explanation. I'm using the
> > chromFa.zip, and the file is soft masked. With your suggestion, I
> also
> > find that the position chr22:39310678-39310699 is already softmasked
> in
> > my file. I admit that web based blat have not been dedicated to get
> best
> > results. However, despite that the standalone version of BLAT
> produce
> > this false positive, the BLAT on genome browser seems has avoided
> this
> > problem. Following is the BLAT results from the genome browser. We
> can
> > see that "chr22:39310678-39310699" didn't appear in this result. The
> > reason why I would like to pursue this kind of small matching is
> that
> > I'm trying to find the fused sequences. So besides the best match,
> I'm
> > trying to find out the small regions on each side of the best match
> that
> > can be mapped to other genome locus. And in the output of standalone
> > BLAT, there was hundreds of ESTs with PolyA tail mapped to other
> genome
> > locus without penalty for the repmatch. This make up quite a bit
> false
> > positive results for my analysis. Therefore, I would like to ask
> whether
> > you have any idea about how to bring up the information for repmatch
> > bases and whether pslReps will read the sequence again to get the
> > repmatch bases?
> > --command line---(hg18.fa is only representative symbol, I'm using
> the
> > splited files for chromosomes)
> > faToNib -softMask hg18.fa hg18.nib
> > gfServer start server 2345  *.nib -stepSize=5 -mask
> > gfClient server 2345 /hg18/chromnibsoftmasked /data/AI306750.fa
> > /data/AI306750.out -t=dna -q=rna -minScore=0 -minIdentity=0
> >
> > Thank you very much!
> >
> > Best wishes
> >
> > Xiaosong
> >
> > ----------Web based BLAT Search Results--------------
> >    ACTIONS      QUERY           SCORE START  END QSIZE IDENTITY CHRO
> > STRAND  START    END      SPAN
> >
> ---------------------------------------------------------------------------------------------------
> > browser details AI306750         172    21
>   192   234 100.0%     6
> -
> >   34613559  34613730    172
> > browser details AI306750          45    53   200   234  96.0%    16
> +
> >   11329927  11330365    439
> > browser details AI306750          43    61   218   234  95.7%     7
> +
> >   86690596  86690917    322
> > browser details AI306750          26    31    65   234  71.5%    10
> +
> >   58592479  58592506     28
> > browser details AI306750          25    65    92   234  96.3%    12
> -
> >    4499858   4499946     89
> > browser details AI306750          25    30    77   234  61.6%     2
> +
> >  129287483 129287508     26
> > browser details AI306750          24    74   101   234  96.2%    15
> -
> >   72145425  72145457     33
> > browser details AI306750          24    74   101   234  96.2%    15
> +
> >   83595466  83595498     33
> > browser details AI306750          23   191   217   234  92.6%     8
> +
> >  145519532 145519558     27
> > browser details AI306750          23    22    50   234  89.7%    17
> +
> >   77861946  77861974     29
> > browser details AI306750          22    85   111   234  95.9%     7
> +
> >  102498033 102498076     44
> > browser details AI306750          22   172   197   234  78.3%    17
> +
> >    8625289   8625311     23
> > browser details AI306750          22    70    93   234  87.0%    10
> +
> >   91185511  91185533     23
> > browser details AI306750          22   175   197   234 100.0%    10
> +
> >   74307390  74307414     25
> > browser details AI306750          20    79    98   234 100.0%     2
> -
> >   91242437  91242456     20
> > browser details AI306750          20    64    84   234 100.0%    17
> -
> >   35181460  35181481     22
> > browser details AI306750          20   137   156   234 100.0%     1
> +
> >   92189967  92189986     20
> > browser details AI306750          19   180   200   234  85.0%     4
> -
> >  178914046 178914065     20
> > browser details AI306750          19   168   188   234  85.0%     4
> -
> >  148746891 148746910     20
> > browser details AI306750          19   101   120   234 100.0%     4
> +
> >  190241376 190241396     21
> >
> >
> >
> > Xiaosong Wang
> > Department of Pathology, University of Michigan Medical School
> > 1150 W.Medical Center Dr. Rm3232, Me
> > d Sci I, Ann Arbor, MI 48109
> > Phone: 734-763-1224
> >
> > >>> Galt Barber <galt at soe.ucsc.edu> 2007-3-13 15:10 >>>
> >
> > Is your original hg18.fa itself hard-masked (N),
> > soft-masked(Lowercase)
> > or unmasked?  If it was unmasked, then that would mean
> > you need to get a masked version of the fasta file.
> >
> > You can get hg18 sequence here:
> > http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/
> >
> > either use hg18.2bit or chromFa.zip
> >
> > The browser confirms that the region you show
> > on hg18, chr22:39310678-39310699 has been masked by repeatmasker
> > and DNA link at the top menu bar shows masking is there
> > when I do get with masking to lower case
> >
> > >hg18_dna range=chr22:39310678-39310699 repeatMasking=lower
> > ctcaaaaaaaaaaaaaaaaaaa
> >
> > So we should expect repmatch count not to be 0
> > as you said.
> >
> > Even when masking is working properly,
> > BLAT will accept alignments as along as the
> > exon does not START in a masked region.
> >
> > BLAT also trims poly-A tails, but not if even
> > one non-A base appears near the end.
> > e.g. AAAAAAAAAGA might still not be trimmed
> > on the query side.
> >
> > fyi A search for AI306750 in hg18 genome browser,
> > finds an EST here: chr6:34,613,559-34,613,730
> >
> > In general, you may find that stand-alone command-line
> > BLAT is good for batch jobs, and that the various
> > utility programs such as pslReps and pslCDnaFilter
> > are great for filtering BLAT psls to match your
> > requirements.  Our hgBlat web cgi is intended for
> > use by humans, and shows everything, even low-quality
> > alignments.  For batch processing, configure
> > reasonable thresholds of quality for the alignments.
> >
> > -Galt
> >
> >
> > On Tue, 13 Mar 2007, wang xiaosong wrote:
> >
> > > Dear BLAT experts
> > >
> > > I encountered a problem with repeative EST sequence when doing
> BLAT
> > with
> > > softmasked genome ni
> b file. Take sequence AI306750 for example,
> the
> > command
> > > line is as following
> > > ----------------------------------
> > > faToNib -softMask hg18.fa hg18.nib
> > > gfServer start server 2345  *.nib -stepSize=5 -mask
> > > gfClient server 2345 /hg18/chromnibsoftmasked /data/AI306750.fa
> > > /data/AI306750.out -t=dna -q=rna -minScore=0 -minIdentity=0
> > > ---------------------------------
> > > I always get a result as follows, with no "rep match" bases
> > declared.
> > > ---------------------------------
> > > >chr22
> > >           Length = 49691432
> > >
> > >  Score = 41 bits (107), Expect = 1e-03
> > >  Identities = 22/22 (100%)
> > >  Strand = Plus / Plus
> > >
> > > Query: 191      ctcaaaaaaaaaaaaaaaaaaa 212
> > >                 ||||||||||||||||||||||
> > > Sbjct: 39310678 ctcaaaaaaaaaaaaaaaaaaa 39310699
> > > ----------------------------------------------
> > > match	mis- 	rep. 	N's	Q gap	Q gap	T gap	T gap
> > >      	match	match	   	count	bases	count	bases
> > > 22	0	0	0	0	0	0	0
> > > ----------------------------------------------
> > >
> > > It seems that the command line I used did not mask the repeat
> polyA
> > here
> > > and provide "rep match" bases. Therefore, my question is:
> > > (1)how can I correctly soft-mask the genome sequence file and how
> can
> > I get
> > > "repeat match" bases that occur in my current results.
> > > (2)does the faToNib -softmask program mask the .fa file itself or
> > take the
> > > softmasked .fa file generated from other repeatmasking program"?
> > should I
> > > do maskoutfa before using faToNib? Is there any direction for
> > installation
> > > of maskoutfa in linux?
> > > (3)Does ucsc provide the softmasked genome sequences in fa or nib
> > format?
> > > (4)there was a -mask=type option in blat command, does it exists
> in
> > > gfserver/gfclient? In my current machine, the memory only allow
> > > gfserver/gfclient to run with the whole genome sequence.
> > > Many thanks for the help!
> > > >AI306750
> > >
> >
> TATACTGCTGCGAGAAGACGACAGAAGGGCAGTGACTCGACAAAGGCCACAGGCAGTCCAGGCCTCTCTC
> > >
> >
> TGCTCCATCCCCCTGCCTCCCATTCTGCACCACACCTGGCATGGTGCAGGGAGACATCTGCACCCCTGAG
> > >
> >
> TTGGGCAGCCAGGAGTGCCCCCGGGAATGGATAATAAAGATACTAGAGAACTCAAAAAAAAAAAAAAAAA
> > > AAAAAAAAAAAAAAGTCGTATCGA
> > >
> > > _________________________________________________________________
> > > 享用世界上最大的*子*件系** MSN Hotmail。  http://www.hotmail.com
> > >
> >
> >
> > **********************************************************
> > Electronic Mail is not secure, may not be read every day, and should
> not be used for urgent or sensitive issues.
> >
>
>
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.
>



More information about the Genome mailing list