[Genome] A problem with the soft-masking!

Galt Barber galt at soe.ucsc.edu
Wed Mar 14 01:30:21 PDT 2007


Our web blat (hgBlat) cgi talks to our
gfServer which is NOT started with -mask for dna.
You can, however, use -mask wth your own installation
gfServer and it should correctly report repmatches in the
output psl files.

pslReps is not for re-scoring psl files with
dna-repeat information from masked sequence
as your email implies.
pslReps simply helps you filter the psls for a
given alignment so that you keep the best
alignment(s).  So, in that sense it is
removing "repeated alignments".

pslCDnaFilter is somewhat more complete
and has many useful options.

------
pslReps - analyse repeats and generate genome wide best
alignments from a sorted set of local alignments
usage:
    pslReps in.psl out.psl out.psr
where in.psl is an alignment file generated by psLayout and
sorted by pslSort, out.psl is the best alignment output
and out.psr contains repeat info
[...]
(options not shown here)
--------

pslCDnaFilter [options] inPsl outPsl

Filter cDNA alignments in psl format.  Filtering criteria are
comparative, selecting near best in genome alignments for each
given cDNA and non-comparative, based only on the quality of an
individual alignment.

WARNING: comparive filters requires that the input is sorted by
query name.  The command: 'sort -k 10,10' will do the trick.

[...]
(many, many options not shown here)
-----

Good Luck!


-Galt


On Tue, 13 Mar 2007, Xiaosong Wang wrote:

> Dear Galt,
>
> Thank you so much for the detailed explanation. I'm using the
> chromFa.zip, and the file is soft masked. With your suggestion, I also
> find that the position chr22:39310678-39310699 is already softmasked in
> my file. I admit that web based blat have not been dedicated to get best
> results. However, despite that the standalone version of BLAT produce
> this false positive, the BLAT on genome browser seems has avoided this
> problem. Following is the BLAT results from the genome browser. We can
> see that "chr22:39310678-39310699" didn't appear in this result. The
> reason why I would like to pursue this kind of small matching is that
> I'm trying to find the fused sequences. So besides the best match, I'm
> trying to find out the small regions on each side of the best match that
> can be mapped to other genome locus. And in the output of standalone
> BLAT, there was hundreds of ESTs with PolyA tail mapped to other genome
> locus without penalty for the repmatch. This make up quite a bit false
> positive results for my analysis. Therefore, I would like to ask whether
> you have any idea about how to bring up the information for repmatch
> bases and whether pslReps will read the sequence again to get the
> repmatch bases?
> --command line---(hg18.fa is only representative symbol, I'm using the
> splited files for chromosomes)
> faToNib -softMask hg18.fa hg18.nib
> gfServer start server 2345  *.nib -stepSize=5 -mask
> gfClient server 2345 /hg18/chromnibsoftmasked /data/AI306750.fa
> /data/AI306750.out -t=dna -q=rna -minScore=0 -minIdentity=0
>
> Thank you very much!
>
> Best wishes
>
> Xiaosong
>
> ----------Web based BLAT Search Results--------------
>    ACTIONS      QUERY           SCORE START  END QSIZE IDENTITY CHRO
> STRAND  START    END      SPAN
> ---------------------------------------------------------------------------------------------------
> browser details AI306750         172    21   192   234 100.0%     6   -
>   34613559  34613730    172
> browser details AI306750          45    53   200   234  96.0%    16   +
>   11329927  11330365    439
> browser details AI306750          43    61   218   234  95.7%     7   +
>   86690596  86690917    322
> browser details AI306750          26    31    65   234  71.5%    10   +
>   58592479  58592506     28
> browser details AI306750          25    65    92   234  96.3%    12   -
>    4499858   4499946     89
> browser details AI306750          25    30    77   234  61.6%     2   +
>  129287483 129287508     26
> browser details AI306750          24    74   101   234  96.2%    15   -
>   72145425  72145457     33
> browser details AI306750          24    74   101   234  96.2%    15   +
>   83595466  83595498     33
> browser details AI306750          23   191   217   234  92.6%     8   +
>  145519532 145519558     27
> browser details AI306750          23    22    50   234  89.7%    17   +
>   77861946  77861974     29
> browser details AI306750          22    85   111   234  95.9%     7   +
>  102498033 102498076     44
> browser details AI306750          22   172   197   234  78.3%    17   +
>    8625289   8625311     23
> browser details AI306750          22    70    93   234  87.0%    10   +
>   91185511  91185533     23
> browser details AI306750          22   175   197   234 100.0%    10   +
>   74307390  74307414     25
> browser details AI306750          20    79    98   234 100.0%     2   -
>   91242437  91242456     20
> browser details AI306750          20    64    84   234 100.0%    17   -
>   35181460  35181481     22
> browser details AI306750          20   137   156   234 100.0%     1   +
>   92189967  92189986     20
> browser details AI306750          19   180   200   234  85.0%     4   -
>  178914046 178914065     20
> browser details AI306750          19   168   188   234  85.0%     4   -
>  148746891 148746910     20
> browser details AI306750          19   101   120   234 100.0%     4   +
>  190241376 190241396     21
>
>
>
> Xiaosong Wang
> Department of Pathology, University of Michigan Medical School
> 1150 W.Medical Center Dr. Rm3232, Me
> d Sci I, Ann Arbor, MI 48109
> Phone: 734-763-1224
>
> >>> Galt Barber <galt at soe.ucsc.edu> 2007-3-13 15:10 >>>
>
> Is your original hg18.fa itself hard-masked (N),
> soft-masked(Lowercase)
> or unmasked?  If it was unmasked, then that would mean
> you need to get a masked version of the fasta file.
>
> You can get hg18 sequence here:
> http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/
>
> either use hg18.2bit or chromFa.zip
>
> The browser confirms that the region you show
> on hg18, chr22:39310678-39310699 has been masked by repeatmasker
> and DNA link at the top menu bar shows masking is there
> when I do get with masking to lower case
>
> >hg18_dna range=chr22:39310678-39310699 repeatMasking=lower
> ctcaaaaaaaaaaaaaaaaaaa
>
> So we should expect repmatch count not to be 0
> as you said.
>
> Even when masking is working properly,
> BLAT will accept alignments as along as the
> exon does not START in a masked region.
>
> BLAT also trims poly-A tails, but not if even
> one non-A base appears near the end.
> e.g. AAAAAAAAAGA might still not be trimmed
> on the query side.
>
> fyi A search for AI306750 in hg18 genome browser,
> finds an EST here: chr6:34,613,559-34,613,730
>
> In general, you may find that stand-alone command-line
> BLAT is good for batch jobs, and that the various
> utility programs such as pslReps and pslCDnaFilter
> are great for filtering BLAT psls to match your
> requirements.  Our hgBlat web cgi is intended for
> use by humans, and shows everything, even low-quality
> alignments.  For batch processing, configure
> reasonable thresholds of quality for the alignments.
>
> -Galt
>
>
> On Tue, 13 Mar 2007, wang xiaosong wrote:
>
> > Dear BLAT experts
> >
> > I encountered a problem with repeative EST sequence when doing BLAT
> with
> > softmasked genome nib file. Take sequence AI306750 for example, the
> command
> > line is as following
> > ----------------------------------
> > faToNib -softMask hg18.fa hg18.nib
> > gfServer start server 2345  *.nib -stepSize=5 -mask
> > gfClient server 2345 /hg18/chromnibsoftmasked /data/AI306750.fa
> > /data/AI306750.out -t=dna -q=rna -minScore=0 -minIdentity=0
> > ---------------------------------
> > I always get a result as follows, with no "rep match" bases
> declared.
> > ---------------------------------
> > >chr22
> >           Length = 49691432
> >
> >  Score = 41 bits (107), Expect = 1e-03
> >  Identities = 22/22 (100%)
> >  Strand = Plus / Plus
> >
> > Query: 191      ctcaaaaaaaaaaaaaaaaaaa 212
> >                 ||||||||||||||||||||||
> > Sbjct: 39310678 ctcaaaaaaaaaaaaaaaaaaa 39310699
> > ----------------------------------------------
> > match	mis- 	rep. 	N's	Q gap	Q gap	T gap	T gap
> >      	match	match	   	count	bases	count	bases
> > 22	0	0	0	0	0	0	0
> > ----------------------------------------------
> >
> > It seems that the command line I used did not mask the repeat polyA
> here
> > and provide "rep match" bases. Therefore, my question is:
> > (1)how can I correctly soft-mask the genome sequence file and how can
> I get
> > "repeat match" bases that occur in my current results.
> > (2)does the faToNib -softmask program mask the .fa file itself or
> take the
> > softmasked .fa file generated from other repeatmasking program"?
> should I
> > do maskoutfa before using faToNib? Is there any direction for
> installation
> > of maskoutfa in linux?
> > (3)Does ucsc provide the softmasked genome sequences in fa or nib
> format?
> > (4)there was a -mask=type option in blat command, does it exists in
> > gfserver/gfclient? In my current machine, the memory only allow
> > gfserver/gfclient to run with the whole genome sequence.
> > Many thanks for the help!
> > >AI306750
> >
> TATACTGCTGCGAGAAGACGACAGAAGGGCAGTGACTCGACAAAGGCCACAGGCAGTCCAGGCCTCTCTC
> >
> TGCTCCATCCCCCTGCCTCCCATTCTGCACCACACCTGGCATGGTGCAGGGAGACATCTGCACCCCTGAG
> >
> TTGGGCAGCCAGGAGTGCCCCCGGGAATGGATAATAAAGATACTAGAGAACTCAAAAAAAAAAAAAAAAA
> > AAAAAAAAAAAAAAGTCGTATCGA
> >
> > _________________________________________________________________
> > 享用世界上最大的子件系﹘ MSN Hotmail。  http://www.hotmail.com
> >
>
>
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.
>



More information about the Genome mailing list