[Genome] local gfClient / gfServer vs. web BLAT on short sequences

Julien Lagarde jlagarde at imim.es
Tue Nov 21 05:37:35 PST 2006


Hi Brooke,

Thanks a lot for your help.
i've been playing a little bit with the parameters myself.
I couldn't get gfServer to load hg17 with a stepSize<4 without crashing, 
even on a 16GB RAM machine.
So for gfServer I've used
- a stepSize of 4
- a tileSize of 8
- an (overkill) repMatch > ( genomeSize / stepSize ), to ensure that 
even in the *worst* (very worst) case, no tile will be considered 
overused, if i understood it correctly.

This slows down the queries of course, but it's still reasonable.
I'm blatting about 500,000 seqs of 23 < length < 27 to hg17. These short 
sequences are directly extracted from the chromosome sequences, so all 
of them should get at least one perfect match in genome. This is the 
case with this set of parameters, this wasn't with tileSize=10 
-stepSize=5 and greater values.

As a more general comment i think Blat is great but it lacks a 
comprehensive documentation. this would probably prevent people from 
getting confused and ask the same kind of questions over and over again 
in this list. For instance, how can one figure that repMatch's default 
depends on tileSize? (BTW in my understanding it should also depend on 
stepSize)

thanks again!
julien




Brooke Rhead wrote:
> Hello again j.,
>
> I have a few things to add to my previous email.  First, I discovered
> that if you don't specify a number for repMatch when using gfServer, it
> will automatically go from 1024 to 4096 when you change the tileSize
> from 11 to 10.  So, in order to see a difference in your results when
> using tileSize 10, you will need to set repMatch to a number higher than
> 4096.  One of our engineers suggested running gfServer with
> repMatch=1,000,000.  This will slow down the process some, but it
> shouldn't be too bad for your small primer.
>
> Another recommendation I got for your situation is to try dropping
> stepSize to an even lower number, to 3 or possibly even 1.  Unless 
> stepSize=1 is set, there will be some cases where tileSize=11 will 
> find a few things that tileSize=10 won't.  If you can live with the 
> slower speed of stepSize=1, this is an option you can try.
>
> I hope this information helps.
>
> -- 
> Brooke Rhead
> UCSC Genome Bioinformatics Group
>
>
>
> Brooke Rhead wrote:
>> Hello j.,
>>
>> We tried replicating your setup with a tileSize of 10, and we did not 
>> find the web-based chr7 hit, either.  What we are likely seeing in 
>> this situation is an issue of overused tiles being disregarded by 
>> BLAT.  When a certain tile is over-used (that is, when the sequence 
>> appears over and over in the genome), the BLAT program disregards the 
>> tile and does not use it to seed a match.  Since your sequence is 
>> small, this throwing-out of tiles can cause potential alignments to 
>> be missed.  Also, when the tileSize is changed from 11 to 10, an 
>> entirely different set of tiles is created, and the tiles in the 
>> over-used category change, which might explain why the chr7 alignment 
>> appears in one set and not the other (even though a tileSize of 10 is 
>> generally more sensitive than a tileSize of 11).
>>
>> The default number of times for a tile to appear before it is 
>> disregarded is 1,024.  This can be changed with the repMatch 
>> parameter.   From 
>> http://genome-test.cse.ucsc.edu/goldenPath/help/blatSpec.html :
>>
>> -repMatch=N sets the number of repetitions of a tile allowed before 
>> it is marked as overused.  Typically this is 256 for tileSize 12, 
>> 1024 for tile size 11, 4096 for tile size 10.  Default is 1024.
>>
>> You could try setting the repMatch parameter to a higher number to 
>> see if the chr7 hit gets picked up.
>>
>> I hope this clears up the confusion.  Please write back to us if we 
>> can be of further assistance.
>>
>> -- 
>> Brooke Rhead
>> UCSC Genome Bioinformatics Group
>>
>>
>> Julien Lagarde wrote:
>>> Hi again dear Genome,
>>>
>>> using a tileSize of 11 (instead of 10) for gfServer has fixed the 
>>> problem... which i don't understand AT ALL since decreasing tileSize 
>>> is supposed to increase sensitivity, if i understood correctly...
>>> What combo of parameters gives max sensitivity?
>>> Thanks
>>> j.
>>>
>>> Julien Lagarde wrote:
>>>> Hi Genome,
>>>>
>>>> I am a bit puzzled by the alignements i get with my local gfClient 
>>>> / gfServer compared to those given by the online UCSC blat server.
>>>>
>>>> my input seq is:
>>>>
>>>>> chr22_primer
>>>> TTGCCTTCTCCCTCATCGAGGGTTA
>>>>
>>>> Online BLAT result (hg17) is:
>>>>
>>>>
>>>> match    mis-     rep.     N's    Q gap    Q gap    T gap    T 
>>>> gap    strand    Q            Q       Q        Q      T            
>>>> T       T       T      block    blockSizes     qStarts     tStarts
>>>>         match    match           count    bases    count    
>>>> bases             name         size    start    end    name         
>>>> size    start    end    count
>>>> --------------------------------------------------------------------------------------------------------------------------------------------------------------- 
>>>>
>>>> 25    0    0    0    0    0    0    0    -    chr22_primer    25    
>>>> 0    25    chr22    49554710    29006841    29006866    1    25,    
>>>> 0,    29006841,
>>>> 21    1    0    0    0    0    0    0    +    chr22_primer    25    
>>>> 3    25    chr7    158628139    73878910    73878932    1    22,    
>>>> 3,    73878910,
>>>>
>>>>
>>>> local gfClient/gfServer (v.32) output is:
>>>>
>>>>
>>>> 11    0    0    0    0    0    0    0    +    chr22_primer    25    
>>>> 0    11    chr1    245522847    7292025    7292036    1    11,    
>>>> 0,    7292025,
>>>> 11    0    0    0    0    0    0    0    +    chr22_primer    25    
>>>> 0    11    chr1    245522847    14727360    14727371    1    11,    
>>>> 0,    14727360,
>>>> 11    0    0    0    0    0    0    0    +    chr22_primer    25    
>>>> 0    11    chr1    245522847    17882525    17882536    1    11,    
>>>> 0,    17882525,
>>>> 11    0    0    0    0    0    0    0    +    chr22_primer    25    
>>>> 0    11    chr1    245522847    18867990    18868001    1    11,    
>>>> 0,    18867990,
>>>> 11    0    0    0    0    0    0    0    +    chr22_primer    25    
>>>> 0    11    chr1    245522847    19615590    19615601    1    11,    
>>>> 0,    19615590,
>>>> 11    0    0    0    0    0    0    0    +    chr22_primer    25    
>>>> 0    11    chr1    245522847    24596870    24596881    1    11,    
>>>> 0,    24596870,
>>>> 12    0    0    0    0    0    0    0    +    chr22_primer    25    
>>>> 0    12    chr1    245522847    26453344    26453356    1    12,    
>>>> 0,    26453344,
>>>> 15    0    0    0    0    0    0    0    +    chr22_primer    25    
>>>> 0    15    chr12    132449811    88185840    88185855    1    
>>>> 15,    0,    88185840,
>>>> 15    0    0    0    0    0    0    0    +    chr22_primer    25    
>>>> 0    15    chr12    132449811    131134400    131134415    1    
>>>> 15,    0,    131134400,
>>>> 14    0    0    0    0    0    0    0    -    chr22_primer    25    
>>>> 2    16    chr1    245522847    10657190    10657204    1    14,    
>>>> 9,    10657190,
>>>> 13    0    0    0    0    0    0    0    -    chr22_primer    25    
>>>> 2    15    chr1    245522847    19442182    19442195    1    13,    
>>>> 10,    19442182,
>>>> 12    0    0    0    0    0    0    0    -    chr22_primer    25    
>>>> 3    15    chr1    245522847    22864170    22864182    1    12,    
>>>> 10,    22864170,
>>>> 15    0    0    0    0    0    0    0    -    chr22_primer    25    
>>>> 0    15    chr11    134452384    73282830    73282845    1    
>>>> 15,    10,    73282830,
>>>> 15    0    0    0    0    0    0    0    -    chr22_primer    25    
>>>> 0    15    chr2    243018229    60887140    60887155    1    15,    
>>>> 10,    60887140,
>>>> 25    0    0    0    0    0    0    0    -    chr22_primer    25    
>>>> 0    25    chr22    49554710    29006841    29006866    1    25,    
>>>> 0,    29006841,
>>>> 15    0    0    0    0    0    0    0    -    chr22_primer    25    
>>>> 1    16    chr4    191411218    48118575    48118590    1    15,    
>>>> 9,    48118575,
>>>> 16    0    0    0    0    0    0    0    -    chr22_primer    25    
>>>> 1    17    chr7    158628139    4776264    4776280    1    16,    
>>>> 8,    4776264,
>>>>
>>>>
>>>> The parameters i use on my local installation are:
>>>>
>>>> # convert hg17 to 2bit, no mask:
>>>> $ faToTwoBit -noMask 
>>>> /seq/genomes/H.sapiens/golden_path_200405/chromFa/*fa 
>>>> complete_hg17_noMask.2bit
>>>>
>>>> # start gfServer:
>>>> $ gfServer -tileSize=10 -stepSize=5 -canStop start localhost 3500 
>>>> complete_hg17_noMask.2bit
>>>>
>>>> # query with gfClient:
>>>> $ gfClient -minScore=0 -nohead -minIdentity=0 localhost 3500 / 
>>>> chr22_primer.fa chr22_primer.psl
>>>>
>>>> I'm trying to make my local blat as sensitive as possible for short 
>>>> seqs, following the recommendations posted by you guys in this list.
>>>> This results in many spurious hits, as expected. No problem, I can 
>>>> deal with this, but the thing that bothers me is that my blat skips 
>>>> an obvious, near-perfect match on chr7
>>>> (qStart=73878910)
>>>> that the online blat finds.
>>>>
>>>> Do you have any idea why?
>>>> Thanks in advance,
>>>> j.
>>>>
>> _______________________________________________
>> Genome maillist  -  Genome at soe.ucsc.edu
>> http://www.soe.ucsc.edu/mailman/listinfo/genome
>

-- 
-----------------------------------------------------
Julien Lagarde
Genome Bioinformatics Research Group
Centre de Regulacio Genomica 
Grup de Recerca en Informatica Biomedica (IMIM)
Dr. Aiguader, 88 				(+34) 93 3160166 ph
E-08003 Barcelona				(+34) 93 3160099 fax
http://genome.imim.es
--------------------------------



More information about the Genome mailing list