[Genome] BLAT short sequence queries ( 12mer to 30mer )
John Major
major at cbio.mskcc.org
Mon Oct 16 10:15:14 PDT 2006
Hi Galt,
thanks! This works very well.
I appreciate all the speedy assistance from you and Ann.
John
Galt Barber wrote:
>I have managed to get 28 hits of length 12 and 11 on your 12 bp example
>query using this command:
>
>blat dm2.2bit query.fa output.psl -tileSize=9 -stepSize=2 -minScore=0
>-minIdentity=100 -noTrimA
>
>screech:dm2>cat query.fa
>
>
>>my_query
>>
>>
>tgtggtgaggaa
>
>screech:dm2>wc -l *.psl
>28 output.psl
>
>I found that -fastMap for some reason does not work with
>very short sequences, so please DO NOT USE -fastMap.
>
>This works on dm2. If one wanted to do human,
>it might take a lot of ram or cpu time to do even
>one chrom at a time with the reduced tileSize and stepSize.
>For dm2 with a genome size of just 132Mbase, this
>works quite well even with just 1GB of ram.
>
>I can also get this to give identical results
>for the example query if I use -tileSize=7 -stepSize=3,
>as long as I also require -minScore=12 so that they
>are all full length. This shows that it can find
>all full-length exact-match hits.
>
>-Galt
>
>
>On Wed, 11 Oct 2006, Ann Zweig wrote:
>
>
>
>>Hello John-
>>
>> I'm sorry that BLAT still isn't finding your short sequences; it really isn't
>>optimized to find such short matches. You might try a few other tweaks and see
>>if it makes any difference:
>>
>>-minScore=12
>>-minIdentity=100
>>-noTrimA
>>-fastMap
>>
>> Yes, highly-repetitive tiles may still be dynamically masked out.
>>
>> One of our developers has downloaded the dm2 fasta files and made a 2bit file
>>without repeat-masking. He is testing blat to see if your example sequence can
>>be found. As an aside, grep finds 7 exact hits just on the fasta files, so
>>there are at least 7 (grep won't find cases where due to sequence broken into
>>finite-length-lines for the fasta, it could just happen to split our target
>>sequence over two lines). He has tried several settings to blat to return hits
>>of size 12. So far no luck.
>>
>> Please give those parameters a try and let us know how it works for you.
>>
>>Regards,
>>
>>----------
>>Ann Zweig
>>UCSC Genome Bioinformatics Group
>>http://genome.ucsc.edu
>>
>>
>>
>>
>>John Major wrote:
>>
>>
>>>Hi Ann-
>>>
>>>Thanks for the suggestions. I tried those settings, and still am
>>>returning no hits for my shorter sequences.
>>>One example sequence for D.melanogaster (which does have several exact
>>>alignments), is the sequence 'tgtggtgaggaa'. It does not return any
>>>Blat hits when I use the suggested settings below.
>>>Could this be because Blat is screening out repetitive looking sequences?
>>>
>>>John
>>>
>>>
>>>Ann Zweig wrote:
>>>
>>>
>>>
>>>>Hello John,
>>>>
>>>> I have consulted one of our Blat experts and he has the following
>>>>to say regarding your question:
>>>>
>>>>s=stepsize, t=tilesize
>>>>
>>>>min guaranteed hit = 2s + t - 1 = 12
>>>>
>>>>To reach 12 bp, he needs
>>>>2s + t = 12+1 = 13
>>>>if t=9, then s=2
>>>>if t=7, then s=3
>>>>if t=5, then s=4
>>>>stop here because values of t < s do not work.
>>>>
>>>>This all looks like about 2-3x ram usage and 2x more runtime. The ram
>>>>usage-factor is 1/s. You want this small for less ram usage. So the
>>>>bigger s is, the better for ram usage.
>>>>
>>>>But there is a sensitivity factor too. By going to stepsize 5 instead
>>>>of 11,
>>>>the ram usage more than doubles, but then he can be guaranteed to pick
>>>>up 2s+t = 21. One could also increase sensitivity by tweaking the
>>>>query with variations
>>>>and re-searching the hit-lists.
>>>>
>>>>And there is a performance factor too, the index span is related to
>>>>tilesize,
>>>>because the bigger the tile, the more patterns the genome is
>>>>fragmented into,
>>>>so that you make more hit-lists of shorter lengths which helps
>>>>performance, because with the tiles in your query, you are hitting a
>>>>smaller number of shorter hit-lists when tilesize increases.
>>>>
>>>> Keep in mind that there may be a point where you need to search
>>>>chromosome-by-chromosome if you make your stepsize and tilesize too
>>>>small.
>>>>
>>>> This previously-answered mail list question may also be of some
>>>>help to you:
>>>>http://www.cse.ucsc.edu/pipermail/genome/2004-September/005612.html
>>>>
>>>> I hope this is helpful to you. Feel free to write back to the
>>>>list if these suggestions don't work for you.
>>>>
>>>>Regards,
>>>>
>>>>----------
>>>>Ann Zweig
>>>>UCSC Genome Bioinformatics Group
>>>>http://genome.ucsc.edu
>>>>
>>>>
>>>>
>>>>John Major wrote:
>>>>
>>>>
>>>>
>>>>>Hello-
>>>>>
>>>>>I have a large amount of 12-30bp sequences which I'd like to find
>>>>>exact (or 1 bp mismatch) alignments to the Drosophila genome. I have
>>>>>a local install of BLAT and have played with the parameters, but
>>>>>can't seem to return any hits for the shorter sequences (in the
>>>>>12-15bp range mostly).
>>>>>I have run the s/w w/out the 11.ooc file, and with tileSize=8,
>>>>>minMatch=2, stepSize=5, but with no luck.
>>>>>I see in a post from 2002 that BLAT can't handle sequence searches
>>>>>below 21bp in length. Is this still true?
>>>>>Are there recommended settings to run BLAT on short sequences?
>>>>>
>>>>>Thanks,
>>>>>John
>>>>>_______________________________________________
>>>>>Genome maillist - Genome at soe.ucsc.edu
>>>>>http://www.soe.ucsc.edu/mailman/listinfo/genome
>>>>>
>>>>>
>>>_______________________________________________
>>>Genome maillist - Genome at soe.ucsc.edu
>>>http://www.soe.ucsc.edu/mailman/listinfo/genome
>>>
>>>
>>_______________________________________________
>>Genome maillist - Genome at soe.ucsc.edu
>>http://www.soe.ucsc.edu/mailman/listinfo/genome
>>
>>
>>
>
>
>
More information about the Genome
mailing list