[Genome] BLAT short sequence queries ( 12mer to 30mer )
Ann Zweig
ann at soe.ucsc.edu
Wed Oct 11 13:57:55 PDT 2006
Hello John-
I'm sorry that BLAT still isn't finding your short sequences; it really isn't
optimized to find such short matches. You might try a few other tweaks and see
if it makes any difference:
-minScore=12
-minIdentity=100
-noTrimA
-fastMap
Yes, highly-repetitive tiles may still be dynamically masked out.
One of our developers has downloaded the dm2 fasta files and made a 2bit file
without repeat-masking. He is testing blat to see if your example sequence can
be found. As an aside, grep finds 7 exact hits just on the fasta files, so
there are at least 7 (grep won't find cases where due to sequence broken into
finite-length-lines for the fasta, it could just happen to split our target
sequence over two lines). He has tried several settings to blat to return hits
of size 12. So far no luck.
Please give those parameters a try and let us know how it works for you.
Regards,
----------
Ann Zweig
UCSC Genome Bioinformatics Group
http://genome.ucsc.edu
John Major wrote:
> Hi Ann-
>
> Thanks for the suggestions. I tried those settings, and still am
> returning no hits for my shorter sequences.
> One example sequence for D.melanogaster (which does have several exact
> alignments), is the sequence 'tgtggtgaggaa'. It does not return any
> Blat hits when I use the suggested settings below.
> Could this be because Blat is screening out repetitive looking sequences?
>
> John
>
>
> Ann Zweig wrote:
>
>> Hello John,
>>
>> I have consulted one of our Blat experts and he has the following
>> to say regarding your question:
>>
>> s=stepsize, t=tilesize
>>
>> min guaranteed hit = 2s + t - 1 = 12
>>
>> To reach 12 bp, he needs
>> 2s + t = 12+1 = 13
>> if t=9, then s=2
>> if t=7, then s=3
>> if t=5, then s=4
>> stop here because values of t < s do not work.
>>
>> This all looks like about 2-3x ram usage and 2x more runtime. The ram
>> usage-factor is 1/s. You want this small for less ram usage. So the
>> bigger s is, the better for ram usage.
>>
>> But there is a sensitivity factor too. By going to stepsize 5 instead
>> of 11,
>> the ram usage more than doubles, but then he can be guaranteed to pick
>> up 2s+t = 21. One could also increase sensitivity by tweaking the
>> query with variations
>> and re-searching the hit-lists.
>>
>> And there is a performance factor too, the index span is related to
>> tilesize,
>> because the bigger the tile, the more patterns the genome is
>> fragmented into,
>> so that you make more hit-lists of shorter lengths which helps
>> performance, because with the tiles in your query, you are hitting a
>> smaller number of shorter hit-lists when tilesize increases.
>>
>> Keep in mind that there may be a point where you need to search
>> chromosome-by-chromosome if you make your stepsize and tilesize too
>> small.
>>
>> This previously-answered mail list question may also be of some
>> help to you:
>> http://www.cse.ucsc.edu/pipermail/genome/2004-September/005612.html
>>
>> I hope this is helpful to you. Feel free to write back to the
>> list if these suggestions don't work for you.
>>
>> Regards,
>>
>> ----------
>> Ann Zweig
>> UCSC Genome Bioinformatics Group
>> http://genome.ucsc.edu
>>
>>
>>
>> John Major wrote:
>>
>>> Hello-
>>>
>>> I have a large amount of 12-30bp sequences which I'd like to find
>>> exact (or 1 bp mismatch) alignments to the Drosophila genome. I have
>>> a local install of BLAT and have played with the parameters, but
>>> can't seem to return any hits for the shorter sequences (in the
>>> 12-15bp range mostly).
>>> I have run the s/w w/out the 11.ooc file, and with tileSize=8,
>>> minMatch=2, stepSize=5, but with no luck.
>>> I see in a post from 2002 that BLAT can't handle sequence searches
>>> below 21bp in length. Is this still true?
>>> Are there recommended settings to run BLAT on short sequences?
>>>
>>> Thanks,
>>> John
>>> _______________________________________________
>>> Genome maillist - Genome at soe.ucsc.edu
>>> http://www.soe.ucsc.edu/mailman/listinfo/genome
>>
>
> _______________________________________________
> Genome maillist - Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
More information about the Genome
mailing list