[Genome] effect of using masked database

Mera Vigyan meravigyan at gmail.com
Wed May 14 02:08:09 PDT 2008


thanks,


On Tue, May 13, 2008 at 11:59 PM, Galt Barber <galt at soe.ucsc.edu> wrote:

>
> With your current settings, BLAT is ignoring masking
> (unless you have previously hard-masked with N's the genome sequence).


I have already hard masked the genome sequence with Ns.


>
> I think you will not want masking, as you can use
> various tools like pslCDnaFilter or other filters
> you write yourself to filter the final results.
>
> You should know that by choosing -out=blast8
> that you are getting only non-intronic non-chained
> alignments.  With out=psl, it outputs alignments
> that have been chained together, which can be
> especially helpful with RNA.
>

I am working with DNA query sequences and masked genome sequence as the
reference. I am using blast8 format for easy parsing.


>
> Even when you perform no masking, blat still blocks
> out tiles that are very highly repetitively used.
> You can try to increase this threshold by increasing -repMatch=N
> but it also means you get more low-quality alignments
> in the output which you may end up filering out
> later anyway.
>
> It is also very likely that some of your missing alignments = (50000 -
> 49966)
> reads are either hitting repetitive tiles or
> simply don't align to the reference genome.
>
> Also, if a sequence aligns to more than one place,
> it will create multiple output rows.  Therefore,
> your missing alignments count as calculated above could
> actually be higher.  There is not a one-to-one relationship
> in general between query and target unless it is guaranteed
> to be unique (which is rarely the case).  In addition you
> can have differences between your reference
> genome and the genome of the tissue sample you are working
> with.
>

I am thinking this may not happen, as i am having a set of unique sequence
as query
and actually counted the number of  49966 from the blat output file.

thanks


>
> -Galt
>
>
>
> On Tue, 13 May 2008, Mera Vigyan wrote:
>
>  Good Morning
>>
>> I have been using BLAT for placing reads pair sequences of  a species of
>> grapes on the
>> full sequence of grapevine.  As a variation, i ran BLAT leniently with a
>> chunk of 50000
>> such sequences on the masked version of grapevine sequence.
>>
>> my command line :
>>
>> blat -stepSize=5 -minIdentity=0 -minScore=0 -out=blast8
>>
>> I have some questions here :
>> 1.Does running BLAT against a masked version of reference sequence help to
>> understand more
>> clearly the nature of structural variations and presence of transposable
>> elements
>> or running BLAT with masked version of query sequences against the actual
>> reference sequence
>> enable to understand them in a better way.
>>
>> 2. I also notice, the output relative to the chunk of 50000 sequences was
>> only generated for
>> 49966 sequences. Why is this so ?
>> How can i get the output for the full chunk in such cases ?
>>
>> thank you
>> _______________________________________________
>> Genome maillist  -  Genome at soe.ucsc.edu
>> http://www.soe.ucsc.edu/mailman/listinfo/genome
>>
>>


More information about the Genome mailing list