[Genome] A problem with the soft-masking!
Galt Barber
galt at soe.ucsc.edu
Wed Mar 14 19:35:29 PDT 2007
I think the short answer to your question is
you need -repeats=lower when running standalone blat
in order for the psl output report to show repmatches.
stand-alone blat has -mask= and -qMask= and -repeats options.
Did you try them?
If you just type "blat" with nothing else on the commandline,
it gives you the blat help information.
Here is an excerpt from the blat help:
----
-mask=type Mask out repeats. Alignments won't be started in masked
region but may extend through it in nucleotide searches. Masked
areas are ignored entirely in protein or translated searches.
Types are
lower - mask out lower cased sequence
upper - mask out upper cased sequence
out - mask according to database.out RepeatMasker .out file
file.out - mask database according to RepeatMasker file.out
-qMask=type Mask out repeats in query sequence. Similar to -mask above
but for query rather than target sequence.
-repeats=type Type is same as mask types above. Repeat bases will not
be masked in any way, but matches in repeat areas will be
reported separately from matches in other areas in the psl output.
----
For some purposes, stand-alone blat will run faster
if you make a .ooc file to mark the highly used tiles.
This is useful for many purposes.
blat hg18.2bit /dev/null /dev/null -tileSize=11 -makeOoc=11.ooc -repMatch=1024
If you ran blat without the option -repeats=lower then
they probably nothing gets reported as repmatches
in the .psl output. Instead it all gets reported
as simple match.
Note that if you are low on ram, you can run stand-alone
blat once for each chromosome. Then you can combine
the psl files by concatenation and sort them with
pslSort and filter them with the pslReps or pslCDnaFilter
mentioned earlier. For concatenating psls easily,
it is probably better to add -noHead to suppress
the header in the output.
-Galt
On Wed, 14 Mar 2007, Xiaosong Wang wrote:
> Dear Galt,
>
> I really appreciate your sincere help. Yes, I have tried gfServer
> command with or without "-mask", but I get the same result without
> repmatch report. I simply can not figure out why standalone BLAT didn't
> report this repmatch bases!
> Again, thank you very much for the detailed explanation to psl
> softwares.
>
> Best wishes
>
> Xiaosong
>
>
> Xiaosong Wang
> Department of Pathology, University of Michigan Medical School
> 1150 W.Medical Center Dr. Rm3232, Med Sci I, Ann Arbor, MI 48109
> Phone: 734-763-1224
>
> >>> Galt Barber <galt at soe.ucsc.edu> 2007-3-14 4:30 >>>
>
> Our web blat (hgBlat) cgi talks to our
> gfServer which is NOT started with -mask for dna.
> You can, however, use -mask wth your own installation
> gfServer and it should correctly report repmatches in the
> output psl files.
>
> pslReps is not for re-scoring psl files with
> dna-repeat information from masked sequence
> as your email implies.
> pslReps simply helps you filter the psls for a
> given alignment so that you keep the best
> alignment(s). So, in that sense it is
> removing "repeated alignments".
>
> pslCDnaFilter is somewhat more complete
> and has many useful options.
>
> ------
> pslReps - analyse repeats and generate genome wide best
> alignments from a sorted set of local alignments
> usage:
> pslReps in.psl out.psl out.psr
> where in.psl is an alignment file generated by psLayout and
> sorted by pslSort, out.psl is the best alignment output
> and out.psr contains repeat info
> [...]
> (options not shown here)
> --------
>
> pslCDnaFilter [options] inPsl outPsl
>
> Filter cDNA alignments in psl format. Filtering criteria are
> comparative, selecting near best in genome alignments for each
> given cDNA and non-comparative, based only on the quality of an
> individual alignment.
>
> WARNING: comparive filters requires that the input is sorted by
> query name. The command: 'sort -k 10,10' will do the trick.
>
> [...]
> (many, many options not shown here)
> -----
>
> Good Luck!
>
>
> -Galt
>
>
> On Tue, 13 Mar 2007, Xiaosong Wang wrote:
>
> > Dear Galt,
> >
> > Thank you so much for the detailed explanation. I'm using the
> > chromFa.zip, and the file is soft masked. With your suggestion, I
> also
> > find that the position chr22:39310678-39310699 is already softmasked
> in
> > my file. I admit that web based blat have not been dedicated to get
> best
> > results. However, despite that the standalone version of BLAT
> produce
> > this false positive, the BLAT on genome browser seems has avoided
> this
> > problem. Following is the BLAT results from the genome browser. We
> can
> > see that "chr22:39310678-39310699" didn't appear in this result. The
> > reason why I would like to pursue this kind of small matching is
> that
> > I'm trying to find the fused sequences. So besides the best match,
> I'm
> > trying to find out the small regions on each side of the best match
> that
> > can be mapped to other genome locus. And in the output of standalone
> > BLAT, there was hundreds of ESTs with PolyA tail mapped to other
> genome
> > locus without penalty for the repmatch. This make up quite a bit
> false
> > positive results for my analysis. Therefore, I would like to ask
> whether
> > you have any idea about how to bring up the information for repmatch
> > bases and whether pslReps will read the sequence again to get the
> > repmatch bases?
> > --command line---(hg18.fa is only representative symbol, I'm using
> the
> > splited files for chromosomes)
> > faToNib -softMask hg18.fa hg18.nib
> > gfServer start server 2345 *.nib -stepSize=5 -mask
> > gfClient server 2345 /hg18/chromnibsoftmasked /data/AI306750.fa
> > /data/AI306750.out -t=dna -q=rna -minScore=0 -minIdentity=0
> >
> > Thank you very much!
> >
> > Best wishes
> >
> > Xiaosong
> >
> > ----------Web based BLAT Search Results--------------
> > ACTIONS QUERY SCORE START END QSIZE IDENTITY CHRO
> > STRAND START END SPAN
> >
> ---------------------------------------------------------------------------------------------------
> > browser details AI306750 172 21
> 192 234 100.0% 6
> -
> > 34613559 34613730 172
> > browser details AI306750 45 53 200 234 96.0% 16
> +
> > 11329927 11330365 439
> > browser details AI306750 43 61 218 234 95.7% 7
> +
> > 86690596 86690917 322
> > browser details AI306750 26 31 65 234 71.5% 10
> +
> > 58592479 58592506 28
> > browser details AI306750 25 65 92 234 96.3% 12
> -
> > 4499858 4499946 89
> > browser details AI306750 25 30 77 234 61.6% 2
> +
> > 129287483 129287508 26
> > browser details AI306750 24 74 101 234 96.2% 15
> -
> > 72145425 72145457 33
> > browser details AI306750 24 74 101 234 96.2% 15
> +
> > 83595466 83595498 33
> > browser details AI306750 23 191 217 234 92.6% 8
> +
> > 145519532 145519558 27
> > browser details AI306750 23 22 50 234 89.7% 17
> +
> > 77861946 77861974 29
> > browser details AI306750 22 85 111 234 95.9% 7
> +
> > 102498033 102498076 44
> > browser details AI306750 22 172 197 234 78.3% 17
> +
> > 8625289 8625311 23
> > browser details AI306750 22 70 93 234 87.0% 10
> +
> > 91185511 91185533 23
> > browser details AI306750 22 175 197 234 100.0% 10
> +
> > 74307390 74307414 25
> > browser details AI306750 20 79 98 234 100.0% 2
> -
> > 91242437 91242456 20
> > browser details AI306750 20 64 84 234 100.0% 17
> -
> > 35181460 35181481 22
> > browser details AI306750 20 137 156 234 100.0% 1
> +
> > 92189967 92189986 20
> > browser details AI306750 19 180 200 234 85.0% 4
> -
> > 178914046 178914065 20
> > browser details AI306750 19 168 188 234 85.0% 4
> -
> > 148746891 148746910 20
> > browser details AI306750 19 101 120 234 100.0% 4
> +
> > 190241376 190241396 21
> >
> >
> >
> > Xiaosong Wang
> > Department of Pathology, University of Michigan Medical School
> > 1150 W.Medical Center Dr. Rm3232, Me
> > d Sci I, Ann Arbor, MI 48109
> > Phone: 734-763-1224
> >
> > >>> Galt Barber <galt at soe.ucsc.edu> 2007-3-13 15:10 >>>
> >
> > Is your original hg18.fa itself hard-masked (N),
> > soft-masked(Lowercase)
> > or unmasked? If it was unmasked, then that would mean
> > you need to get a masked version of the fasta file.
> >
> > You can get hg18 sequence here:
> > http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/
> >
> > either use hg18.2bit or chromFa.zip
> >
> > The browser confirms that the region you show
> > on hg18, chr22:39310678-39310699 has been masked by repeatmasker
> > and DNA link at the top menu bar shows masking is there
> > when I do get with masking to lower case
> >
> > >hg18_dna range=chr22:39310678-39310699 repeatMasking=lower
> > ctcaaaaaaaaaaaaaaaaaaa
> >
> > So we should expect repmatch count not to be 0
> > as you said.
> >
> > Even when masking is working properly,
> > BLAT will accept alignments as along as the
> > exon does not START in a masked region.
> >
> > BLAT also trims poly-A tails, but not if even
> > one non-A base appears near the end.
> > e.g. AAAAAAAAAGA might still not be trimmed
> > on the query side.
> >
> > fyi A search for AI306750 in hg18 genome browser,
> > finds an EST here: chr6:34,613,559-34,613,730
> >
> > In general, you may find that stand-alone command-line
> > BLAT is good for batch jobs, and that the various
> > utility programs such as pslReps and pslCDnaFilter
> > are great for filtering BLAT psls to match your
> > requirements. Our hgBlat web cgi is intended for
> > use by humans, and shows everything, even low-quality
> > alignments. For batch processing, configure
> > reasonable thresholds of quality for the alignments.
> >
> > -Galt
> >
> >
> > On Tue, 13 Mar 2007, wang xiaosong wrote:
> >
> > > Dear BLAT experts
> > >
> > > I encountered a problem with repeative EST sequence when doing
> BLAT
> > with
> > > softmasked genome ni
> b file. Take sequence AI306750 for example,
> the
> > command
> > > line is as following
> > > ----------------------------------
> > > faToNib -softMask hg18.fa hg18.nib
> > > gfServer start server 2345 *.nib -stepSize=5 -mask
> > > gfClient server 2345 /hg18/chromnibsoftmasked /data/AI306750.fa
> > > /data/AI306750.out -t=dna -q=rna -minScore=0 -minIdentity=0
> > > ---------------------------------
> > > I always get a result as follows, with no "rep match" bases
> > declared.
> > > ---------------------------------
> > > >chr22
> > > Length = 49691432
> > >
> > > Score = 41 bits (107), Expect = 1e-03
> > > Identities = 22/22 (100%)
> > > Strand = Plus / Plus
> > >
> > > Query: 191 ctcaaaaaaaaaaaaaaaaaaa 212
> > > ||||||||||||||||||||||
> > > Sbjct: 39310678 ctcaaaaaaaaaaaaaaaaaaa 39310699
> > > ----------------------------------------------
> > > match mis- rep. N's Q gap Q gap T gap T gap
> > > match match count bases count bases
> > > 22 0 0 0 0 0 0 0
> > > ----------------------------------------------
> > >
> > > It seems that the command line I used did not mask the repeat
> polyA
> > here
> > > and provide "rep match" bases. Therefore, my question is:
> > > (1)how can I correctly soft-mask the genome sequence file and how
> can
> > I get
> > > "repeat match" bases that occur in my current results.
> > > (2)does the faToNib -softmask program mask the .fa file itself or
> > take the
> > > softmasked .fa file generated from other repeatmasking program"?
> > should I
> > > do maskoutfa before using faToNib? Is there any direction for
> > installation
> > > of maskoutfa in linux?
> > > (3)Does ucsc provide the softmasked genome sequences in fa or nib
> > format?
> > > (4)there was a -mask=type option in blat command, does it exists
> in
> > > gfserver/gfclient? In my current machine, the memory only allow
> > > gfserver/gfclient to run with the whole genome sequence.
> > > Many thanks for the help!
> > > >AI306750
> > >
> >
> TATACTGCTGCGAGAAGACGACAGAAGGGCAGTGACTCGACAAAGGCCACAGGCAGTCCAGGCCTCTCTC
> > >
> >
> TGCTCCATCCCCCTGCCTCCCATTCTGCACCACACCTGGCATGGTGCAGGGAGACATCTGCACCCCTGAG
> > >
> >
> TTGGGCAGCCAGGAGTGCCCCCGGGAATGGATAATAAAGATACTAGAGAACTCAAAAAAAAAAAAAAAAA
> > > AAAAAAAAAAAAAAGTCGTATCGA
> > >
> > > _________________________________________________________________
> > > 享用世界上最大的*子*件系** MSN Hotmail。 http://www.hotmail.com
> > >
> >
> >
> > **********************************************************
> > Electronic Mail is not secure, may not be read every day, and should
> not be used for urgent or sensitive issues.
> >
>
>
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.
>
More information about the Genome
mailing list