[Genome] A problem with the soft-masking!
Xiaosong Wang
xiaosong at med.umich.edu
Wed Mar 14 18:35:12 PDT 2007
Dear Galt,
I really appreciate your sincere help. Yes, I have tried gfServer
command with or without "-mask", but I get the same result without
repmatch report. I simply can not figure out why standalone BLAT didn't
report this repmatch bases!
Again, thank you very much for the detailed explanation to psl
softwares.
Best wishes
Xiaosong
Xiaosong Wang
Department of Pathology, University of Michigan Medical School
1150 W.Medical Center Dr. Rm3232, Med Sci I, Ann Arbor, MI 48109
Phone: 734-763-1224
>>> Galt Barber <galt at soe.ucsc.edu> 2007-3-14 4:30 >>>
Our web blat (hgBlat) cgi talks to our
gfServer which is NOT started with -mask for dna.
You can, however, use -mask wth your own installation
gfServer and it should correctly report repmatches in the
output psl files.
pslReps is not for re-scoring psl files with
dna-repeat information from masked sequence
as your email implies.
pslReps simply helps you filter the psls for a
given alignment so that you keep the best
alignment(s). So, in that sense it is
removing "repeated alignments".
pslCDnaFilter is somewhat more complete
and has many useful options.
------
pslReps - analyse repeats and generate genome wide best
alignments from a sorted set of local alignments
usage:
pslReps in.psl out.psl out.psr
where in.psl is an alignment file generated by psLayout and
sorted by pslSort, out.psl is the best alignment output
and out.psr contains repeat info
[...]
(options not shown here)
--------
pslCDnaFilter [options] inPsl outPsl
Filter cDNA alignments in psl format. Filtering criteria are
comparative, selecting near best in genome alignments for each
given cDNA and non-comparative, based only on the quality of an
individual alignment.
WARNING: comparive filters requires that the input is sorted by
query name. The command: 'sort -k 10,10' will do the trick.
[...]
(many, many options not shown here)
-----
Good Luck!
-Galt
On Tue, 13 Mar 2007, Xiaosong Wang wrote:
> Dear Galt,
>
> Thank you so much for the detailed explanation. I'm using the
> chromFa.zip, and the file is soft masked. With your suggestion, I
also
> find that the position chr22:39310678-39310699 is already softmasked
in
> my file. I admit that web based blat have not been dedicated to get
best
> results. However, despite that the standalone version of BLAT
produce
> this false positive, the BLAT on genome browser seems has avoided
this
> problem. Following is the BLAT results from the genome browser. We
can
> see that "chr22:39310678-39310699" didn't appear in this result. The
> reason why I would like to pursue this kind of small matching is
that
> I'm trying to find the fused sequences. So besides the best match,
I'm
> trying to find out the small regions on each side of the best match
that
> can be mapped to other genome locus. And in the output of standalone
> BLAT, there was hundreds of ESTs with PolyA tail mapped to other
genome
> locus without penalty for the repmatch. This make up quite a bit
false
> positive results for my analysis. Therefore, I would like to ask
whether
> you have any idea about how to bring up the information for repmatch
> bases and whether pslReps will read the sequence again to get the
> repmatch bases?
> --command line---(hg18.fa is only representative symbol, I'm using
the
> splited files for chromosomes)
> faToNib -softMask hg18.fa hg18.nib
> gfServer start server 2345 *.nib -stepSize=5 -mask
> gfClient server 2345 /hg18/chromnibsoftmasked /data/AI306750.fa
> /data/AI306750.out -t=dna -q=rna -minScore=0 -minIdentity=0
>
> Thank you very much!
>
> Best wishes
>
> Xiaosong
>
> ----------Web based BLAT Search Results--------------
> ACTIONS QUERY SCORE START END QSIZE IDENTITY CHRO
> STRAND START END SPAN
>
---------------------------------------------------------------------------------------------------
> browser details AI306750 172 21
192 234 100.0% 6
-
> 34613559 34613730 172
> browser details AI306750 45 53 200 234 96.0% 16
+
> 11329927 11330365 439
> browser details AI306750 43 61 218 234 95.7% 7
+
> 86690596 86690917 322
> browser details AI306750 26 31 65 234 71.5% 10
+
> 58592479 58592506 28
> browser details AI306750 25 65 92 234 96.3% 12
-
> 4499858 4499946 89
> browser details AI306750 25 30 77 234 61.6% 2
+
> 129287483 129287508 26
> browser details AI306750 24 74 101 234 96.2% 15
-
> 72145425 72145457 33
> browser details AI306750 24 74 101 234 96.2% 15
+
> 83595466 83595498 33
> browser details AI306750 23 191 217 234 92.6% 8
+
> 145519532 145519558 27
> browser details AI306750 23 22 50 234 89.7% 17
+
> 77861946 77861974 29
> browser details AI306750 22 85 111 234 95.9% 7
+
> 102498033 102498076 44
> browser details AI306750 22 172 197 234 78.3% 17
+
> 8625289 8625311 23
> browser details AI306750 22 70 93 234 87.0% 10
+
> 91185511 91185533 23
> browser details AI306750 22 175 197 234 100.0% 10
+
> 74307390 74307414 25
> browser details AI306750 20 79 98 234 100.0% 2
-
> 91242437 91242456 20
> browser details AI306750 20 64 84 234 100.0% 17
-
> 35181460 35181481 22
> browser details AI306750 20 137 156 234 100.0% 1
+
> 92189967 92189986 20
> browser details AI306750 19 180 200 234 85.0% 4
-
> 178914046 178914065 20
> browser details AI306750 19 168 188 234 85.0% 4
-
> 148746891 148746910 20
> browser details AI306750 19 101 120 234 100.0% 4
+
> 190241376 190241396 21
>
>
>
> Xiaosong Wang
> Department of Pathology, University of Michigan Medical School
> 1150 W.Medical Center Dr. Rm3232, Me
> d Sci I, Ann Arbor, MI 48109
> Phone: 734-763-1224
>
> >>> Galt Barber <galt at soe.ucsc.edu> 2007-3-13 15:10 >>>
>
> Is your original hg18.fa itself hard-masked (N),
> soft-masked(Lowercase)
> or unmasked? If it was unmasked, then that would mean
> you need to get a masked version of the fasta file.
>
> You can get hg18 sequence here:
> http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/
>
> either use hg18.2bit or chromFa.zip
>
> The browser confirms that the region you show
> on hg18, chr22:39310678-39310699 has been masked by repeatmasker
> and DNA link at the top menu bar shows masking is there
> when I do get with masking to lower case
>
> >hg18_dna range=chr22:39310678-39310699 repeatMasking=lower
> ctcaaaaaaaaaaaaaaaaaaa
>
> So we should expect repmatch count not to be 0
> as you said.
>
> Even when masking is working properly,
> BLAT will accept alignments as along as the
> exon does not START in a masked region.
>
> BLAT also trims poly-A tails, but not if even
> one non-A base appears near the end.
> e.g. AAAAAAAAAGA might still not be trimmed
> on the query side.
>
> fyi A search for AI306750 in hg18 genome browser,
> finds an EST here: chr6:34,613,559-34,613,730
>
> In general, you may find that stand-alone command-line
> BLAT is good for batch jobs, and that the various
> utility programs such as pslReps and pslCDnaFilter
> are great for filtering BLAT psls to match your
> requirements. Our hgBlat web cgi is intended for
> use by humans, and shows everything, even low-quality
> alignments. For batch processing, configure
> reasonable thresholds of quality for the alignments.
>
> -Galt
>
>
> On Tue, 13 Mar 2007, wang xiaosong wrote:
>
> > Dear BLAT experts
> >
> > I encountered a problem with repeative EST sequence when doing
BLAT
> with
> > softmasked genome ni
b file. Take sequence AI306750 for example,
the
> command
> > line is as following
> > ----------------------------------
> > faToNib -softMask hg18.fa hg18.nib
> > gfServer start server 2345 *.nib -stepSize=5 -mask
> > gfClient server 2345 /hg18/chromnibsoftmasked /data/AI306750.fa
> > /data/AI306750.out -t=dna -q=rna -minScore=0 -minIdentity=0
> > ---------------------------------
> > I always get a result as follows, with no "rep match" bases
> declared.
> > ---------------------------------
> > >chr22
> > Length = 49691432
> >
> > Score = 41 bits (107), Expect = 1e-03
> > Identities = 22/22 (100%)
> > Strand = Plus / Plus
> >
> > Query: 191 ctcaaaaaaaaaaaaaaaaaaa 212
> > ||||||||||||||||||||||
> > Sbjct: 39310678 ctcaaaaaaaaaaaaaaaaaaa 39310699
> > ----------------------------------------------
> > match mis- rep. N's Q gap Q gap T gap T gap
> > match match count bases count bases
> > 22 0 0 0 0 0 0 0
> > ----------------------------------------------
> >
> > It seems that the command line I used did not mask the repeat
polyA
> here
> > and provide "rep match" bases. Therefore, my question is:
> > (1)how can I correctly soft-mask the genome sequence file and how
can
> I get
> > "repeat match" bases that occur in my current results.
> > (2)does the faToNib -softmask program mask the .fa file itself or
> take the
> > softmasked .fa file generated from other repeatmasking program"?
> should I
> > do maskoutfa before using faToNib? Is there any direction for
> installation
> > of maskoutfa in linux?
> > (3)Does ucsc provide the softmasked genome sequences in fa or nib
> format?
> > (4)there was a -mask=type option in blat command, does it exists
in
> > gfserver/gfclient? In my current machine, the memory only allow
> > gfserver/gfclient to run with the whole genome sequence.
> > Many thanks for the help!
> > >AI306750
> >
>
TATACTGCTGCGAGAAGACGACAGAAGGGCAGTGACTCGACAAAGGCCACAGGCAGTCCAGGCCTCTCTC
> >
>
TGCTCCATCCCCCTGCCTCCCATTCTGCACCACACCTGGCATGGTGCAGGGAGACATCTGCACCCCTGAG
> >
>
TTGGGCAGCCAGGAGTGCCCCCGGGAATGGATAATAAAGATACTAGAGAACTCAAAAAAAAAAAAAAAAA
> > AAAAAAAAAAAAAAGTCGTATCGA
> >
> > _________________________________________________________________
> > 享用世界上最大的*子*件系** MSN Hotmail。 http://www.hotmail.com
> >
>
>
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should
not be used for urgent or sensitive issues.
>
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.
More information about the Genome
mailing list