[Genome] effect of using masked database

Galt Barber galt at soe.ucsc.edu
Tue May 13 14:59:27 PDT 2008


With your current settings, BLAT is ignoring masking
(unless you have previously hard-masked with N's the genome sequence).

I think you will not want masking, as you can use
various tools like pslCDnaFilter or other filters
you write yourself to filter the final results.

You should know that by choosing -out=blast8
that you are getting only non-intronic non-chained
alignments.  With out=psl, it outputs alignments
that have been chained together, which can be
especially helpful with RNA.

Even when you perform no masking, blat still blocks
out tiles that are very highly repetitively used.
You can try to increase this threshold by increasing -repMatch=N
but it also means you get more low-quality alignments
in the output which you may end up filering out
later anyway.

It is also very likely that some of your 
missing alignments = (50000 - 49966)
reads are either hitting repetitive tiles or
simply don't align to the reference genome.

Also, if a sequence aligns to more than one place,
it will create multiple output rows.  Therefore,
your missing alignments count as calculated above could
actually be higher.  There is not a one-to-one relationship
in general between query and target unless it is guaranteed
to be unique (which is rarely the case).  In addition you
can have differences between your reference
genome and the genome of the tissue sample you are working
with.


-Galt


On Tue, 13 May 2008, Mera Vigyan wrote:

> Good Morning
>
> I have been using BLAT for placing reads pair sequences of  a species of
> grapes on the
> full sequence of grapevine.  As a variation, i ran BLAT leniently with a
> chunk of 50000
> such sequences on the masked version of grapevine sequence.
>
> my command line :
>
> blat -stepSize=5 -minIdentity=0 -minScore=0 -out=blast8
>
> I have some questions here :
> 1.Does running BLAT against a masked version of reference sequence help to
> understand more
> clearly the nature of structural variations and presence of transposable
> elements
> or running BLAT with masked version of query sequences against the actual
> reference sequence
> enable to understand them in a better way.
>
> 2. I also notice, the output relative to the chunk of 50000 sequences was
> only generated for
> 49966 sequences. Why is this so ?
> How can i get the output for the full chunk in such cases ?
>
> thank you
> _______________________________________________
> Genome maillist  -  Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
>


More information about the Genome mailing list