[Genome] The problem in the results of BLAT linux v34

Galt Barber galt at soe.ucsc.edu
Tue Feb 27 12:31:39 PST 2007


FAQ PSL

http://genome.ucsc.edu/FAQ/FAQformat#format2

columns 5-8
# qNumInsert - Number of inserts in query
# qBaseInsert - Number of bases inserted in query
# tNumInsert - Number of inserts in target
# tBaseInsert - Number of bases inserted in target

NumInsert = gap count
BaseInsert = gap bases

Good luck!

-Galt


On Wed, 28 Feb 2007, wang xiaosong wrote:

>
> Dear Galt,
>
> Thank you so much for the detailed explanation! The paremeter of
> "tileSize=11 stepSize=5" replicates the web-based result very well!
> According to our experience, this parameter correlates very well to
> experimental results. Would you please take a look at the following problem
> that I 'm encountering? Many thanks again for your help.
>
> The following is the equation that I found in your FAQ for calculate
> identity and score. I was wondering whether the qNumInsert and tNumInsert
> means query and target "gap count" rather than "gap bases". If this is
> true, the following equation can replicate the web-based score, while the
> web-based identity can not be replicated. Is the following equation for the
> identity correct?
>
> >>equation
> identity = 100.0 * (match + repMatch) / (match + repMatch + misMatch +
> qNumInsert + tNumInsert);
> Blat Score=match + repMatch/2 - misMatch ¨C qNumInsert ¨CtNumInsert;
>
>
> >From: Galt Barber <galt at soe.ucsc.edu>
> >To: wang xiaosong <dr.wang at hotmail.com>
> >CC: galt at soe.ucsc.edu, genome at soe.ucsc.edu, xiaosong at med.umich.edu
> >Subject: Re: [Genome] The problem in the results of BLAT linux v34
> >Date: Tue, 27 Feb 2007 10:26:09 -0800 (PST)
> >
> >
> >All the parameters and score equations
> >are fully documented on the BLAT FAQ given earlier,
> >because we get this question all the time.
> >Please read it carefully.
> >
> >Basically, for dna, we use softmasked 2bit or nib
> >with tileSize=11 stepSize=5.
> >For gfClient you would choose whatever parameters
> >are most inclusive.
> >
> >Note also that the hgBlat web program is
> >basically returning everything,
> >even poor alignments of questionable value.
> >This is because people are expected to
> >look at it and use their own judgment.
> >
> >When one is doing batch alignments, it is
> >better to choose reasonable parameters
> >to limit alignments to good quality ones,
> >and often to use pslReps and pslFilter
> >and other tools like that after the blat run.
> >
> >BLAT has fairly good sensitivity but is designed
> >for organisms with around 90% identity (dna).
> >Using it to align fish and mammal for instance
> >would be inappropriate.  It is optimized for
> >aligning mRNA to DNA within the same species,
> >but it can be stretched to do dna or protein
> >and works pretty well within mammals.
> >
> >Because it has high but not perfect sensitivity,
> >it can rarely miss things.  The shorter the query,
> >the more difficult the alignment.  Still, people
> >have happily used it with primers as small as around 14bp.
> >
> >On the other hand, BLAT has problems with very long queries,
> >i.e. 40K is pretty much the limit. Of course
> >much longer genes can be done because none
> >of the exons individually are really long.
> >But you can break very long queries into parts
> >and align them and chain the results together
> >afterwards.  Blat is used for intra-species
> >full genome liftOver because it is very fast and it works
> >well for that.
> >
> >-Galt
> >
> >
> >On Wed, 28 Feb 2007, wang xiaosong wrote:
> >
> > > Dear Galt,
> > >
> > > Thank you very much for the input. I've changed to soft-masked version
> of
> > > chromosome, and solved this problem in the two sequence I provided
> > > previously. But I still can not replicate the results that get in the
> UCSC
> > > web-based Blat with other sequences,even if I have tried different
> > > parameters of tileSize (10,11) and stepSize(5,6).
> > > For example, for the sequence of AW948986, web-based identified two
> major
> > > mapping(1-22 and 19-629), which correlates well with our experimental
> > > results. But I can't replicate this result with the stand-alone BLAT
> linux
> > > v34. I tried to set down the tileSize to 10, and stepSize to
> 5,following
> > > the rule of minGuranteedHit=2*stepSize+tileSize-1. This will pick up
> the
> > > 1-22 match, but the 19-629 match was changed to 19-595. It seems that
> the
> > > web-based blat server has optimized the parameters of
> tileSize,stepSize,
> > > mimMatch and repMatch. The parameters I used were as follows. Do you
> have
> > > any idea about how can I replicate the web-based parameters?
> > > Upon your suggestion, I'm using the following equation to calcuate
> > > web-based score and identity. I was wondering whether the qNumInsert
> and
> > > tNumInsert means "gap count" rather than "gap bases". If this is true,
> the
> > > Blat Score can be replicated with folloing equation, but the identity
> can
> > > not. Is the web-based identity different from following equations? (I
> > > speculate that tNumInsert is equal to the number of introns, which may
> not
> > > influence identity).
> > >
> > > >>equation
> > > identity = 100.0 * (match + repMatch) / (match + repMatch + misMatch +
> > > qNumInsert + tNumInsert);
> > > Blat Score=match + repMatch/2 - misMatch ¨C qNumInsert ¨CtNumInsert;
> > >
> > >
> > > >>Parameters
> > > gfServer start path-t1 7855 *.nib -tileSize=10 -stepSize=5
> > > gfClient path-t1 7855 /data/chromnibmasked /data/test.fa /data/test.out
> > > -t=dna -q=rna -minScore=0 -minIdentity=0
> > >
> > > >>BLAT Search Results
> > > SCORE START  END IDENTITY CHRO:START- END
> > > 601    19   629   99.6%    chr21:38696395  -739417
> > > 22     1    22   100.0%   chr 21:41801746-41801767
> > >
> > > >AW948986
> > > GTGAGGAGCGCGGCGCGGCAGGAAGCCTTATCAGTTGTGAGTGAGGACCAGTCGTTGTTTGAGTGTGCCT
> > > ACGGAACGCCACACCTGGCTAAGACAGAGATGACCGCGTCCTCCTCCAGCGACTATGGACAGACTTCCAA
> > > GATGAGCCCACGCGTCCCTCAGCAGGATTGGCTGTCTCAACCCCCAGCCAGGGTCACCATCAAAATGGAA
> > > TGTAACCCTAGCCAGGTGAATGGCTCAAGGAACTCTCCTGATGAATGCAGTGTGGCCAAAGGCGGGAAGA
> > > TGGTGGGCAGCCCAGACACCGTTGGGATGAACTACGGCAGCTACATGGAGGAGAAGCACATGCCACCCCC
> > > AAACATGACCACGAACGAGCGCAGAGTTATCGTGCCAGCAGATCCTACGCTATGGAGTACAGACCATGTG
> > > CGGCAGTGGCTGGAGTGGGCGGTGAAAGAATATGGCCTTCCAGACGTCAACATCTTGTTATTCCAGAACA
> > > TCGATGGGAAGGAACTGTGCAAGATGACCAATGACGACTTCCAGAGGCTCACCCCCAGCTACAACGCCGA
> > > CATCCTTCTCTCACATCTCCACTAACTCAGAGAGACTCCTCTTCCACATTTGACTCAGATGATGGTGAT
> > >
> > >
> > > >From: Galt Barber <galt at soe.ucsc.edu>
> > > >To: wang xiaosong <dr.wang at hotmail.com>
> > > >CC: genome at soe.ucsc.edu, xiaosong at med.umich.edu
> > > >Subject: Re: [Genome] The problem in the results of BLAT linux v34
> > > >Date: Mon, 26 Feb 2007 14:47:29 -0800 (PST)
> > > >
> > > >
> > > >Looks like you are using the hard-masked version of the chromosomes.
> > > >I recommend using the soft-masked versions.  There are many
> > > >repeats around the exons in question and that could affect
> > > >the alignments.
> > > >
> > > >For the question about the score that hgBlat generates,
> > > >please see the blat FAQ:
> > > >
> > > >http://hgwdev.cse.ucsc.edu/FAQ/FAQblat#blat4
> > > >
> > > >Also, note that if you are doing batch queries
> > > >it may be easier to just use stand-alone commandline
> > > >"blat" instead of gfServer/gfClient.
> > > >
> > > >If memory is tight you can do one chrom at a time
> > > >and then combine/filter psl results with pslReps
> > > >and other tools like that.
> > > >
> > > >-Galt
> > > >
> > > >
> > > >On Tue, 27 Feb 2007, wang xiaosong wrote:
> > > >
> > > > > Dear All,
> > > > >
> > > > > I'm Xiaosong Wang From Dr. Arul Chinnaiyan's lab at the University
> of
> > > > > Michigan. We encountered a problem in the output of the BLAT linux
> > > version
> > > > > 34. The linux version of BLAT usually overlook one exon at either
> end
> > > of
> > > > > the input sequence. For example, the chromosome matched regions of
> ERG
> > > and
> > > > > TMPRSS2 sequences are 0-1128 and 55-1725 as mapped by the BLAT
> linux
> > > v34,
> > > > > while the matched regions were changed to 1-1514 and 1-1725 with
> the
> > > > > web-based BLAT(See attached file for BLAT results, and test.txt for
> the
> > > > > sequence). The linux version BLAT lost the last exon of ERG
> (1128-1514)
> > > and
> > > > > the First exon of TMPRSS2 (0-55).  The command line we use is as
> > > following:
> > > > > -----------------------------------------------
> > > > > gfServer start path-t1 7855 *.nib -minMatch=1
> > > > > gfClient path-t1 7855 /data/chromnibmasked /data/test.fa
> /data/test.out
> > > > > -t=dna -q=rna -minScore=0 -minIdentity=0
> > > > > -----------------------------------------------
> > > > > In addition, we find that the score in the web-based blat results
> was
> > > not
> > > > > provided in the linux version results. Therefore, we wonder whether
> > > anyone
> > > > > knows the algorism behind this score.
> > > > >
> > > > > Thank you very much indeed.
> > > > >
> > > > > Xiaosong
> > > > >
> > > > >
> > > > > Xiaosong Wang
> > > > > Department of Pathology, University of Michigan Medical School
> > > > > 1150 W.Medical Center Dr. Rm3232, Med Sci I, Ann Arbor, MI 48109
> > > > > Phone: 734-763-1224
> > > > >
> > > > > _________________________________________________________________
> > > > > ÓëÁª»úµÄÅóÓѽøÐн»Á÷£¬ÇëʹÓà MSN Messenger:
> > > http://messenger.msn.com/cn
> > > > >
> > >
> > > _________________________________________________________________
> > > Ãâ·ÑÏÂÔØ MSN Explorer:   http://explorer.msn.com/lccn
> > >
>
> _________________________________________________________________
> ÏíÓÃÊÀ½çÉÏ×î´óµÄµç×ÓÓʼþϵͳ¡ª MSN Hotmail¡£  http://www.hotmail.com
>



More information about the Genome mailing list