[Genome] A problem with the soft-masking!
Galt Barber
galt at soe.ucsc.edu
Wed Mar 14 01:30:21 PDT 2007
Our web blat (hgBlat) cgi talks to our
gfServer which is NOT started with -mask for dna.
You can, however, use -mask wth your own installation
gfServer and it should correctly report repmatches in the
output psl files.
pslReps is not for re-scoring psl files with
dna-repeat information from masked sequence
as your email implies.
pslReps simply helps you filter the psls for a
given alignment so that you keep the best
alignment(s). So, in that sense it is
removing "repeated alignments".
pslCDnaFilter is somewhat more complete
and has many useful options.
------
pslReps - analyse repeats and generate genome wide best
alignments from a sorted set of local alignments
usage:
pslReps in.psl out.psl out.psr
where in.psl is an alignment file generated by psLayout and
sorted by pslSort, out.psl is the best alignment output
and out.psr contains repeat info
[...]
(options not shown here)
--------
pslCDnaFilter [options] inPsl outPsl
Filter cDNA alignments in psl format. Filtering criteria are
comparative, selecting near best in genome alignments for each
given cDNA and non-comparative, based only on the quality of an
individual alignment.
WARNING: comparive filters requires that the input is sorted by
query name. The command: 'sort -k 10,10' will do the trick.
[...]
(many, many options not shown here)
-----
Good Luck!
-Galt
On Tue, 13 Mar 2007, Xiaosong Wang wrote:
> Dear Galt,
>
> Thank you so much for the detailed explanation. I'm using the
> chromFa.zip, and the file is soft masked. With your suggestion, I also
> find that the position chr22:39310678-39310699 is already softmasked in
> my file. I admit that web based blat have not been dedicated to get best
> results. However, despite that the standalone version of BLAT produce
> this false positive, the BLAT on genome browser seems has avoided this
> problem. Following is the BLAT results from the genome browser. We can
> see that "chr22:39310678-39310699" didn't appear in this result. The
> reason why I would like to pursue this kind of small matching is that
> I'm trying to find the fused sequences. So besides the best match, I'm
> trying to find out the small regions on each side of the best match that
> can be mapped to other genome locus. And in the output of standalone
> BLAT, there was hundreds of ESTs with PolyA tail mapped to other genome
> locus without penalty for the repmatch. This make up quite a bit false
> positive results for my analysis. Therefore, I would like to ask whether
> you have any idea about how to bring up the information for repmatch
> bases and whether pslReps will read the sequence again to get the
> repmatch bases?
> --command line---(hg18.fa is only representative symbol, I'm using the
> splited files for chromosomes)
> faToNib -softMask hg18.fa hg18.nib
> gfServer start server 2345 *.nib -stepSize=5 -mask
> gfClient server 2345 /hg18/chromnibsoftmasked /data/AI306750.fa
> /data/AI306750.out -t=dna -q=rna -minScore=0 -minIdentity=0
>
> Thank you very much!
>
> Best wishes
>
> Xiaosong
>
> ----------Web based BLAT Search Results--------------
> ACTIONS QUERY SCORE START END QSIZE IDENTITY CHRO
> STRAND START END SPAN
> ---------------------------------------------------------------------------------------------------
> browser details AI306750 172 21 192 234 100.0% 6 -
> 34613559 34613730 172
> browser details AI306750 45 53 200 234 96.0% 16 +
> 11329927 11330365 439
> browser details AI306750 43 61 218 234 95.7% 7 +
> 86690596 86690917 322
> browser details AI306750 26 31 65 234 71.5% 10 +
> 58592479 58592506 28
> browser details AI306750 25 65 92 234 96.3% 12 -
> 4499858 4499946 89
> browser details AI306750 25 30 77 234 61.6% 2 +
> 129287483 129287508 26
> browser details AI306750 24 74 101 234 96.2% 15 -
> 72145425 72145457 33
> browser details AI306750 24 74 101 234 96.2% 15 +
> 83595466 83595498 33
> browser details AI306750 23 191 217 234 92.6% 8 +
> 145519532 145519558 27
> browser details AI306750 23 22 50 234 89.7% 17 +
> 77861946 77861974 29
> browser details AI306750 22 85 111 234 95.9% 7 +
> 102498033 102498076 44
> browser details AI306750 22 172 197 234 78.3% 17 +
> 8625289 8625311 23
> browser details AI306750 22 70 93 234 87.0% 10 +
> 91185511 91185533 23
> browser details AI306750 22 175 197 234 100.0% 10 +
> 74307390 74307414 25
> browser details AI306750 20 79 98 234 100.0% 2 -
> 91242437 91242456 20
> browser details AI306750 20 64 84 234 100.0% 17 -
> 35181460 35181481 22
> browser details AI306750 20 137 156 234 100.0% 1 +
> 92189967 92189986 20
> browser details AI306750 19 180 200 234 85.0% 4 -
> 178914046 178914065 20
> browser details AI306750 19 168 188 234 85.0% 4 -
> 148746891 148746910 20
> browser details AI306750 19 101 120 234 100.0% 4 +
> 190241376 190241396 21
>
>
>
> Xiaosong Wang
> Department of Pathology, University of Michigan Medical School
> 1150 W.Medical Center Dr. Rm3232, Me
> d Sci I, Ann Arbor, MI 48109
> Phone: 734-763-1224
>
> >>> Galt Barber <galt at soe.ucsc.edu> 2007-3-13 15:10 >>>
>
> Is your original hg18.fa itself hard-masked (N),
> soft-masked(Lowercase)
> or unmasked? If it was unmasked, then that would mean
> you need to get a masked version of the fasta file.
>
> You can get hg18 sequence here:
> http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/
>
> either use hg18.2bit or chromFa.zip
>
> The browser confirms that the region you show
> on hg18, chr22:39310678-39310699 has been masked by repeatmasker
> and DNA link at the top menu bar shows masking is there
> when I do get with masking to lower case
>
> >hg18_dna range=chr22:39310678-39310699 repeatMasking=lower
> ctcaaaaaaaaaaaaaaaaaaa
>
> So we should expect repmatch count not to be 0
> as you said.
>
> Even when masking is working properly,
> BLAT will accept alignments as along as the
> exon does not START in a masked region.
>
> BLAT also trims poly-A tails, but not if even
> one non-A base appears near the end.
> e.g. AAAAAAAAAGA might still not be trimmed
> on the query side.
>
> fyi A search for AI306750 in hg18 genome browser,
> finds an EST here: chr6:34,613,559-34,613,730
>
> In general, you may find that stand-alone command-line
> BLAT is good for batch jobs, and that the various
> utility programs such as pslReps and pslCDnaFilter
> are great for filtering BLAT psls to match your
> requirements. Our hgBlat web cgi is intended for
> use by humans, and shows everything, even low-quality
> alignments. For batch processing, configure
> reasonable thresholds of quality for the alignments.
>
> -Galt
>
>
> On Tue, 13 Mar 2007, wang xiaosong wrote:
>
> > Dear BLAT experts
> >
> > I encountered a problem with repeative EST sequence when doing BLAT
> with
> > softmasked genome nib file. Take sequence AI306750 for example, the
> command
> > line is as following
> > ----------------------------------
> > faToNib -softMask hg18.fa hg18.nib
> > gfServer start server 2345 *.nib -stepSize=5 -mask
> > gfClient server 2345 /hg18/chromnibsoftmasked /data/AI306750.fa
> > /data/AI306750.out -t=dna -q=rna -minScore=0 -minIdentity=0
> > ---------------------------------
> > I always get a result as follows, with no "rep match" bases
> declared.
> > ---------------------------------
> > >chr22
> > Length = 49691432
> >
> > Score = 41 bits (107), Expect = 1e-03
> > Identities = 22/22 (100%)
> > Strand = Plus / Plus
> >
> > Query: 191 ctcaaaaaaaaaaaaaaaaaaa 212
> > ||||||||||||||||||||||
> > Sbjct: 39310678 ctcaaaaaaaaaaaaaaaaaaa 39310699
> > ----------------------------------------------
> > match mis- rep. N's Q gap Q gap T gap T gap
> > match match count bases count bases
> > 22 0 0 0 0 0 0 0
> > ----------------------------------------------
> >
> > It seems that the command line I used did not mask the repeat polyA
> here
> > and provide "rep match" bases. Therefore, my question is:
> > (1)how can I correctly soft-mask the genome sequence file and how can
> I get
> > "repeat match" bases that occur in my current results.
> > (2)does the faToNib -softmask program mask the .fa file itself or
> take the
> > softmasked .fa file generated from other repeatmasking program"?
> should I
> > do maskoutfa before using faToNib? Is there any direction for
> installation
> > of maskoutfa in linux?
> > (3)Does ucsc provide the softmasked genome sequences in fa or nib
> format?
> > (4)there was a -mask=type option in blat command, does it exists in
> > gfserver/gfclient? In my current machine, the memory only allow
> > gfserver/gfclient to run with the whole genome sequence.
> > Many thanks for the help!
> > >AI306750
> >
> TATACTGCTGCGAGAAGACGACAGAAGGGCAGTGACTCGACAAAGGCCACAGGCAGTCCAGGCCTCTCTC
> >
> TGCTCCATCCCCCTGCCTCCCATTCTGCACCACACCTGGCATGGTGCAGGGAGACATCTGCACCCCTGAG
> >
> TTGGGCAGCCAGGAGTGCCCCCGGGAATGGATAATAAAGATACTAGAGAACTCAAAAAAAAAAAAAAAAA
> > AAAAAAAAAAAAAAGTCGTATCGA
> >
> > _________________________________________________________________
> > 享用世界上最大的子件系﹘ MSN Hotmail。 http://www.hotmail.com
> >
>
>
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.
>
More information about the Genome
mailing list