[Genome] local gfClient / gfServer vs. web BLAT on short sequences
Julien Lagarde
jlagarde at imim.es
Tue Nov 21 05:37:35 PST 2006
Hi Brooke,
Thanks a lot for your help.
i've been playing a little bit with the parameters myself.
I couldn't get gfServer to load hg17 with a stepSize<4 without crashing,
even on a 16GB RAM machine.
So for gfServer I've used
- a stepSize of 4
- a tileSize of 8
- an (overkill) repMatch > ( genomeSize / stepSize ), to ensure that
even in the *worst* (very worst) case, no tile will be considered
overused, if i understood it correctly.
This slows down the queries of course, but it's still reasonable.
I'm blatting about 500,000 seqs of 23 < length < 27 to hg17. These short
sequences are directly extracted from the chromosome sequences, so all
of them should get at least one perfect match in genome. This is the
case with this set of parameters, this wasn't with tileSize=10
-stepSize=5 and greater values.
As a more general comment i think Blat is great but it lacks a
comprehensive documentation. this would probably prevent people from
getting confused and ask the same kind of questions over and over again
in this list. For instance, how can one figure that repMatch's default
depends on tileSize? (BTW in my understanding it should also depend on
stepSize)
thanks again!
julien
Brooke Rhead wrote:
> Hello again j.,
>
> I have a few things to add to my previous email. First, I discovered
> that if you don't specify a number for repMatch when using gfServer, it
> will automatically go from 1024 to 4096 when you change the tileSize
> from 11 to 10. So, in order to see a difference in your results when
> using tileSize 10, you will need to set repMatch to a number higher than
> 4096. One of our engineers suggested running gfServer with
> repMatch=1,000,000. This will slow down the process some, but it
> shouldn't be too bad for your small primer.
>
> Another recommendation I got for your situation is to try dropping
> stepSize to an even lower number, to 3 or possibly even 1. Unless
> stepSize=1 is set, there will be some cases where tileSize=11 will
> find a few things that tileSize=10 won't. If you can live with the
> slower speed of stepSize=1, this is an option you can try.
>
> I hope this information helps.
>
> --
> Brooke Rhead
> UCSC Genome Bioinformatics Group
>
>
>
> Brooke Rhead wrote:
>> Hello j.,
>>
>> We tried replicating your setup with a tileSize of 10, and we did not
>> find the web-based chr7 hit, either. What we are likely seeing in
>> this situation is an issue of overused tiles being disregarded by
>> BLAT. When a certain tile is over-used (that is, when the sequence
>> appears over and over in the genome), the BLAT program disregards the
>> tile and does not use it to seed a match. Since your sequence is
>> small, this throwing-out of tiles can cause potential alignments to
>> be missed. Also, when the tileSize is changed from 11 to 10, an
>> entirely different set of tiles is created, and the tiles in the
>> over-used category change, which might explain why the chr7 alignment
>> appears in one set and not the other (even though a tileSize of 10 is
>> generally more sensitive than a tileSize of 11).
>>
>> The default number of times for a tile to appear before it is
>> disregarded is 1,024. This can be changed with the repMatch
>> parameter. From
>> http://genome-test.cse.ucsc.edu/goldenPath/help/blatSpec.html :
>>
>> -repMatch=N sets the number of repetitions of a tile allowed before
>> it is marked as overused. Typically this is 256 for tileSize 12,
>> 1024 for tile size 11, 4096 for tile size 10. Default is 1024.
>>
>> You could try setting the repMatch parameter to a higher number to
>> see if the chr7 hit gets picked up.
>>
>> I hope this clears up the confusion. Please write back to us if we
>> can be of further assistance.
>>
>> --
>> Brooke Rhead
>> UCSC Genome Bioinformatics Group
>>
>>
>> Julien Lagarde wrote:
>>> Hi again dear Genome,
>>>
>>> using a tileSize of 11 (instead of 10) for gfServer has fixed the
>>> problem... which i don't understand AT ALL since decreasing tileSize
>>> is supposed to increase sensitivity, if i understood correctly...
>>> What combo of parameters gives max sensitivity?
>>> Thanks
>>> j.
>>>
>>> Julien Lagarde wrote:
>>>> Hi Genome,
>>>>
>>>> I am a bit puzzled by the alignements i get with my local gfClient
>>>> / gfServer compared to those given by the online UCSC blat server.
>>>>
>>>> my input seq is:
>>>>
>>>>> chr22_primer
>>>> TTGCCTTCTCCCTCATCGAGGGTTA
>>>>
>>>> Online BLAT result (hg17) is:
>>>>
>>>>
>>>> match mis- rep. N's Q gap Q gap T gap T
>>>> gap strand Q Q Q Q T
>>>> T T T block blockSizes qStarts tStarts
>>>> match match count bases count
>>>> bases name size start end name
>>>> size start end count
>>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>>
>>>> 25 0 0 0 0 0 0 0 - chr22_primer 25
>>>> 0 25 chr22 49554710 29006841 29006866 1 25,
>>>> 0, 29006841,
>>>> 21 1 0 0 0 0 0 0 + chr22_primer 25
>>>> 3 25 chr7 158628139 73878910 73878932 1 22,
>>>> 3, 73878910,
>>>>
>>>>
>>>> local gfClient/gfServer (v.32) output is:
>>>>
>>>>
>>>> 11 0 0 0 0 0 0 0 + chr22_primer 25
>>>> 0 11 chr1 245522847 7292025 7292036 1 11,
>>>> 0, 7292025,
>>>> 11 0 0 0 0 0 0 0 + chr22_primer 25
>>>> 0 11 chr1 245522847 14727360 14727371 1 11,
>>>> 0, 14727360,
>>>> 11 0 0 0 0 0 0 0 + chr22_primer 25
>>>> 0 11 chr1 245522847 17882525 17882536 1 11,
>>>> 0, 17882525,
>>>> 11 0 0 0 0 0 0 0 + chr22_primer 25
>>>> 0 11 chr1 245522847 18867990 18868001 1 11,
>>>> 0, 18867990,
>>>> 11 0 0 0 0 0 0 0 + chr22_primer 25
>>>> 0 11 chr1 245522847 19615590 19615601 1 11,
>>>> 0, 19615590,
>>>> 11 0 0 0 0 0 0 0 + chr22_primer 25
>>>> 0 11 chr1 245522847 24596870 24596881 1 11,
>>>> 0, 24596870,
>>>> 12 0 0 0 0 0 0 0 + chr22_primer 25
>>>> 0 12 chr1 245522847 26453344 26453356 1 12,
>>>> 0, 26453344,
>>>> 15 0 0 0 0 0 0 0 + chr22_primer 25
>>>> 0 15 chr12 132449811 88185840 88185855 1
>>>> 15, 0, 88185840,
>>>> 15 0 0 0 0 0 0 0 + chr22_primer 25
>>>> 0 15 chr12 132449811 131134400 131134415 1
>>>> 15, 0, 131134400,
>>>> 14 0 0 0 0 0 0 0 - chr22_primer 25
>>>> 2 16 chr1 245522847 10657190 10657204 1 14,
>>>> 9, 10657190,
>>>> 13 0 0 0 0 0 0 0 - chr22_primer 25
>>>> 2 15 chr1 245522847 19442182 19442195 1 13,
>>>> 10, 19442182,
>>>> 12 0 0 0 0 0 0 0 - chr22_primer 25
>>>> 3 15 chr1 245522847 22864170 22864182 1 12,
>>>> 10, 22864170,
>>>> 15 0 0 0 0 0 0 0 - chr22_primer 25
>>>> 0 15 chr11 134452384 73282830 73282845 1
>>>> 15, 10, 73282830,
>>>> 15 0 0 0 0 0 0 0 - chr22_primer 25
>>>> 0 15 chr2 243018229 60887140 60887155 1 15,
>>>> 10, 60887140,
>>>> 25 0 0 0 0 0 0 0 - chr22_primer 25
>>>> 0 25 chr22 49554710 29006841 29006866 1 25,
>>>> 0, 29006841,
>>>> 15 0 0 0 0 0 0 0 - chr22_primer 25
>>>> 1 16 chr4 191411218 48118575 48118590 1 15,
>>>> 9, 48118575,
>>>> 16 0 0 0 0 0 0 0 - chr22_primer 25
>>>> 1 17 chr7 158628139 4776264 4776280 1 16,
>>>> 8, 4776264,
>>>>
>>>>
>>>> The parameters i use on my local installation are:
>>>>
>>>> # convert hg17 to 2bit, no mask:
>>>> $ faToTwoBit -noMask
>>>> /seq/genomes/H.sapiens/golden_path_200405/chromFa/*fa
>>>> complete_hg17_noMask.2bit
>>>>
>>>> # start gfServer:
>>>> $ gfServer -tileSize=10 -stepSize=5 -canStop start localhost 3500
>>>> complete_hg17_noMask.2bit
>>>>
>>>> # query with gfClient:
>>>> $ gfClient -minScore=0 -nohead -minIdentity=0 localhost 3500 /
>>>> chr22_primer.fa chr22_primer.psl
>>>>
>>>> I'm trying to make my local blat as sensitive as possible for short
>>>> seqs, following the recommendations posted by you guys in this list.
>>>> This results in many spurious hits, as expected. No problem, I can
>>>> deal with this, but the thing that bothers me is that my blat skips
>>>> an obvious, near-perfect match on chr7
>>>> (qStart=73878910)
>>>> that the online blat finds.
>>>>
>>>> Do you have any idea why?
>>>> Thanks in advance,
>>>> j.
>>>>
>> _______________________________________________
>> Genome maillist - Genome at soe.ucsc.edu
>> http://www.soe.ucsc.edu/mailman/listinfo/genome
>
--
-----------------------------------------------------
Julien Lagarde
Genome Bioinformatics Research Group
Centre de Regulacio Genomica
Grup de Recerca en Informatica Biomedica (IMIM)
Dr. Aiguader, 88 (+34) 93 3160166 ph
E-08003 Barcelona (+34) 93 3160099 fax
http://genome.imim.es
--------------------------------
More information about the Genome
mailing list