[Genome] The problem in the results of BLAT linux v34

Galt Barber galt at soe.ucsc.edu
Tue Feb 27 10:26:09 PST 2007


All the parameters and score equations
are fully documented on the BLAT FAQ given earlier,
because we get this question all the time.
Please read it carefully.

Basically, for dna, we use softmasked 2bit or nib
with tileSize=11 stepSize=5.
For gfClient you would choose whatever parameters
are most inclusive.

Note also that the hgBlat web program is
basically returning everything,
even poor alignments of questionable value.
This is because people are expected to
look at it and use their own judgment.

When one is doing batch alignments, it is
better to choose reasonable parameters
to limit alignments to good quality ones,
and often to use pslReps and pslFilter
and other tools like that after the blat run.

BLAT has fairly good sensitivity but is designed
for organisms with around 90% identity (dna).
Using it to align fish and mammal for instance
would be inappropriate.  It is optimized for
aligning mRNA to DNA within the same species,
but it can be stretched to do dna or protein
and works pretty well within mammals.

Because it has high but not perfect sensitivity,
it can rarely miss things.  The shorter the query,
the more difficult the alignment.  Still, people
have happily used it with primers as small as around 14bp.

On the other hand, BLAT has problems with very long queries,
i.e. 40K is pretty much the limit. Of course
much longer genes can be done because none
of the exons individually are really long.
But you can break very long queries into parts
and align them and chain the results together
afterwards.  Blat is used for intra-species
full genome liftOver because it is very fast and it works
well for that.

-Galt


On Wed, 28 Feb 2007, wang xiaosong wrote:

> Dear Galt,
>
> Thank you very much for the input. I've changed to soft-masked version of
> chromosome, and solved this problem in the two sequence I provided
> previously. But I still can not replicate the results that get in the UCSC
> web-based Blat with other sequences,even if I have tried different
> parameters of tileSize (10,11) and stepSize(5,6).
> For example, for the sequence of AW948986, web-based identified two major
> mapping(1-22 and 19-629), which correlates well with our experimental
> results. But I can't replicate this result with the stand-alone BLAT linux
> v34. I tried to set down the tileSize to 10, and stepSize to 5,following
> the rule of minGuranteedHit=2*stepSize+tileSize-1. This will pick up the
> 1-22 match, but the 19-629 match was changed to 19-595. It seems that the
> web-based blat server has optimized the parameters of tileSize,stepSize,
> mimMatch and repMatch. The parameters I used were as follows. Do you have
> any idea about how can I replicate the web-based parameters?
> Upon your suggestion, I'm using the following equation to calcuate
> web-based score and identity. I was wondering whether the qNumInsert and
> tNumInsert means "gap count" rather than "gap bases". If this is true, the
> Blat Score can be replicated with folloing equation, but the identity can
> not. Is the web-based identity different from following equations? (I
> speculate that tNumInsert is equal to the number of introns, which may not
> influence identity).
>
> >>equation
> identity = 100.0 * (match + repMatch) / (match + repMatch + misMatch +
> qNumInsert + tNumInsert);
> Blat Score=match + repMatch/2 - misMatch ¨C qNumInsert ¨CtNumInsert;
>
>
> >>Parameters
> gfServer start path-t1 7855 *.nib -tileSize=10 -stepSize=5
> gfClient path-t1 7855 /data/chromnibmasked /data/test.fa /data/test.out
> -t=dna -q=rna -minScore=0 -minIdentity=0
>
> >>BLAT Search Results
> SCORE START  END IDENTITY CHRO:START- END
> 601    19   629   99.6%    chr21:38696395  -739417
> 22     1    22   100.0%   chr 21:41801746-41801767
>
> >AW948986
> GTGAGGAGCGCGGCGCGGCAGGAAGCCTTATCAGTTGTGAGTGAGGACCAGTCGTTGTTTGAGTGTGCCT
> ACGGAACGCCACACCTGGCTAAGACAGAGATGACCGCGTCCTCCTCCAGCGACTATGGACAGACTTCCAA
> GATGAGCCCACGCGTCCCTCAGCAGGATTGGCTGTCTCAACCCCCAGCCAGGGTCACCATCAAAATGGAA
> TGTAACCCTAGCCAGGTGAATGGCTCAAGGAACTCTCCTGATGAATGCAGTGTGGCCAAAGGCGGGAAGA
> TGGTGGGCAGCCCAGACACCGTTGGGATGAACTACGGCAGCTACATGGAGGAGAAGCACATGCCACCCCC
> AAACATGACCACGAACGAGCGCAGAGTTATCGTGCCAGCAGATCCTACGCTATGGAGTACAGACCATGTG
> CGGCAGTGGCTGGAGTGGGCGGTGAAAGAATATGGCCTTCCAGACGTCAACATCTTGTTATTCCAGAACA
> TCGATGGGAAGGAACTGTGCAAGATGACCAATGACGACTTCCAGAGGCTCACCCCCAGCTACAACGCCGA
> CATCCTTCTCTCACATCTCCACTAACTCAGAGAGACTCCTCTTCCACATTTGACTCAGATGATGGTGAT
>
>
> >From: Galt Barber <galt at soe.ucsc.edu>
> >To: wang xiaosong <dr.wang at hotmail.com>
> >CC: genome at soe.ucsc.edu, xiaosong at med.umich.edu
> >Subject: Re: [Genome] The problem in the results of BLAT linux v34
> >Date: Mon, 26 Feb 2007 14:47:29 -0800 (PST)
> >
> >
> >Looks like you are using the hard-masked version of the chromosomes.
> >I recommend using the soft-masked versions.  There are many
> >repeats around the exons in question and that could affect
> >the alignments.
> >
> >For the question about the score that hgBlat generates,
> >please see the blat FAQ:
> >
> >http://hgwdev.cse.ucsc.edu/FAQ/FAQblat#blat4
> >
> >Also, note that if you are doing batch queries
> >it may be easier to just use stand-alone commandline
> >"blat" instead of gfServer/gfClient.
> >
> >If memory is tight you can do one chrom at a time
> >and then combine/filter psl results with pslReps
> >and other tools like that.
> >
> >-Galt
> >
> >
> >On Tue, 27 Feb 2007, wang xiaosong wrote:
> >
> > > Dear All,
> > >
> > > I'm Xiaosong Wang From Dr. Arul Chinnaiyan's lab at the University of
> > > Michigan. We encountered a problem in the output of the BLAT linux
> version
> > > 34. The linux version of BLAT usually overlook one exon at either end
> of
> > > the input sequence. For example, the chromosome matched regions of ERG
> and
> > > TMPRSS2 sequences are 0-1128 and 55-1725 as mapped by the BLAT linux
> v34,
> > > while the matched regions were changed to 1-1514 and 1-1725 with the
> > > web-based BLAT(See attached file for BLAT results, and test.txt for the
> > > sequence). The linux version BLAT lost the last exon of ERG (1128-1514)
> and
> > > the First exon of TMPRSS2 (0-55).  The command line we use is as
> following:
> > > -----------------------------------------------
> > > gfServer start path-t1 7855 *.nib -minMatch=1
> > > gfClient path-t1 7855 /data/chromnibmasked /data/test.fa /data/test.out
> > > -t=dna -q=rna -minScore=0 -minIdentity=0
> > > -----------------------------------------------
> > > In addition, we find that the score in the web-based blat results was
> not
> > > provided in the linux version results. Therefore, we wonder whether
> anyone
> > > knows the algorism behind this score.
> > >
> > > Thank you very much indeed.
> > >
> > > Xiaosong
> > >
> > >
> > > Xiaosong Wang
> > > Department of Pathology, University of Michigan Medical School
> > > 1150 W.Medical Center Dr. Rm3232, Med Sci I, Ann Arbor, MI 48109
> > > Phone: 734-763-1224
> > >
> > > _________________________________________________________________
> > > ÓëÁª»úµÄÅóÓѽøÐн»Á÷£¬ÇëʹÓà MSN Messenger:
> http://messenger.msn.com/cn
> > >
>
> _________________________________________________________________
> Ãâ·ÑÏÂÔØ MSN Explorer:   http://explorer.msn.com/lccn
>



More information about the Genome mailing list