[Genome] BLAT short sequence queries ( 12mer to 30mer )

Galt Barber galt at soe.ucsc.edu
Wed Oct 11 18:00:35 PDT 2006


I have managed to get 28 hits of length 12 and 11 on your 12 bp example
query using this command:

blat dm2.2bit query.fa output.psl -tileSize=9 -stepSize=2 -minScore=0
-minIdentity=100 -noTrimA

screech:dm2>cat query.fa
>my_query
tgtggtgaggaa

screech:dm2>wc -l *.psl
28 output.psl

I found that -fastMap for some reason does not work with
very short sequences, so please DO NOT USE -fastMap.

This works on dm2.  If one wanted to do human,
it might take a lot of ram or cpu time to do even
one chrom at a time with the reduced tileSize and stepSize.
For dm2 with a genome size of just 132Mbase, this
works quite well even with just 1GB of ram.

I can also get this to give identical results
for the example query if I use -tileSize=7 -stepSize=3,
as long as I also require -minScore=12 so that they
are all full length.  This shows that it can find
all full-length exact-match hits.

-Galt


On Wed, 11 Oct 2006, Ann Zweig wrote:

> Hello John-
>
> 	I'm sorry that BLAT still isn't finding your short sequences; it really isn't
> optimized to find such short matches.  You might try a few other tweaks and see
> if it makes any difference:
>
> -minScore=12
> -minIdentity=100
> -noTrimA
> -fastMap
>
> 	Yes, highly-repetitive tiles may still be dynamically masked out.
>
> 	One of our developers has downloaded the dm2 fasta files and made a 2bit file
> without repeat-masking.  He is testing blat to see if your example sequence can
> be found.  As an aside, grep finds 7 exact hits just on the fasta files, so
> there are at least 7 (grep won't find cases where due to sequence broken into
> finite-length-lines for the fasta, it could just happen to split our target
> sequence over two lines).  He has tried several settings to blat to return hits
> of size 12.  So far no luck.
>
> 	Please give those parameters a try and let us know how it works for you.
>
> Regards,
>
> ----------
> Ann Zweig
> UCSC Genome Bioinformatics Group
> http://genome.ucsc.edu
>
>
>
>
> John Major wrote:
> > Hi Ann-
> >
> > Thanks for the suggestions.  I tried those settings, and still am
> > returning no hits for my shorter sequences.
> > One example sequence for D.melanogaster (which does have several exact
> > alignments),  is the sequence 'tgtggtgaggaa'.  It does not return any
> > Blat hits when I use the suggested settings below.
> > Could this be because Blat is screening out repetitive looking sequences?
> >
> > John
> >
> >
> > Ann Zweig wrote:
> >
> >> Hello John,
> >>
> >>     I have consulted one of our Blat experts and he has the following
> >> to say regarding your question:
> >>
> >> s=stepsize, t=tilesize
> >>
> >> min guaranteed hit = 2s + t - 1 = 12
> >>
> >> To reach 12 bp, he needs
> >> 2s + t = 12+1 = 13
> >> if t=9, then s=2
> >> if t=7, then s=3
> >> if t=5, then s=4
> >> stop here because values of t < s do not work.
> >>
> >> This all looks like about 2-3x ram usage and 2x more runtime.  The ram
> >> usage-factor is 1/s.  You want this small for less ram usage.  So the
> >> bigger s is, the better for ram usage.
> >>
> >> But there is a sensitivity factor too.  By going to stepsize 5 instead
> >> of 11,
> >> the ram usage more than doubles, but then he can be guaranteed to pick
> >> up 2s+t = 21.  One could also increase sensitivity by tweaking the
> >> query with variations
> >> and re-searching the hit-lists.
> >>
> >> And there is a performance factor too, the index span is related to
> >> tilesize,
> >> because the bigger the tile, the more patterns the genome is
> >> fragmented into,
> >> so that you make more hit-lists of shorter lengths which helps
> >> performance, because with the tiles in your query, you are hitting a
> >> smaller number of shorter hit-lists when tilesize increases.
> >>
> >>     Keep in mind that there may be a point where you need to search
> >> chromosome-by-chromosome if you make your stepsize and tilesize too
> >> small.
> >>
> >>     This previously-answered mail list question may also be of some
> >> help to you:
> >> http://www.cse.ucsc.edu/pipermail/genome/2004-September/005612.html
> >>
> >>     I hope this is helpful to you.  Feel free to write back to the
> >> list if these suggestions don't work for you.
> >>
> >> Regards,
> >>
> >> ----------
> >> Ann Zweig
> >> UCSC Genome Bioinformatics Group
> >> http://genome.ucsc.edu
> >>
> >>
> >>
> >> John Major wrote:
> >>
> >>> Hello-
> >>>
> >>> I have a large amount of 12-30bp sequences which I'd like to find
> >>> exact (or 1 bp mismatch) alignments to the Drosophila genome.  I have
> >>> a local install of BLAT and have played with the parameters, but
> >>> can't seem to return any hits for the shorter sequences (in the
> >>> 12-15bp range mostly).
> >>> I have run the s/w w/out the 11.ooc file, and with  tileSize=8,
> >>> minMatch=2, stepSize=5, but with no luck.
> >>> I see in a post from 2002 that BLAT can't handle sequence searches
> >>> below 21bp in length.  Is this still true?
> >>> Are there recommended settings to run BLAT on short sequences?
> >>>
> >>> Thanks,
> >>> John
> >>> _______________________________________________
> >>> Genome maillist  -  Genome at soe.ucsc.edu
> >>> http://www.soe.ucsc.edu/mailman/listinfo/genome
> >>
> >
> > _______________________________________________
> > Genome maillist  -  Genome at soe.ucsc.edu
> > http://www.soe.ucsc.edu/mailman/listinfo/genome
>
> _______________________________________________
> Genome maillist  -  Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
>


More information about the Genome mailing list