[Genome] BLAT match, score, percentage - how to select best alignment
Brooke Rhead
rhead at soe.ucsc.edu
Thu Mar 1 11:40:56 PST 2007
Hello Anton,
The "best" BLAT alignment can be figured different ways, depending on
which features of the alignment you wish to weight most heavily (e.g.,
number of matches on both the target and the query side, the number of
gaps on the target or query side, the size of the gaps, etc.).
Here are some discussions about blat score from previously answered
mailing list question:
http://www.cse.ucsc.edu/pipermail/genome/2004-January/003883.html
http://www.cse.ucsc.edu/pipermail/genome/2001-December/000789.html
Feel free to search the mailing list archives for even more: go to
http://genome.ucsc.edu/FAQ/ and enter "blat score" in the "Search the
Genome mailing list archives" box. These discussions should help you to
understand the meaning of score and percent identity calculated by
web-based BLAT.
> - Can I just take the alignment that has the highest number in the
> "match"-column and take this entry as the "best" alignment?
This is a valid approach. However, some subtleties will be missed. For
instance, a perfect match that has no gaps on either the target or the
query side would be treated the same as a match where each base matched
perfectly, but the matches were interrupted by non-matching sequence.
> - The order in which the alignments appear in a psl-file, has this
> something to do with the quality of the alignment?
No, the alignments in the psl file are not sorted.
There are some tools in the Kent Source tree that are useful for sorting
psl files based on different criteria. Info on downloading the source
is here:
http://genome.ucsc.edu/FAQ/FAQlicense.html#license3
The utility "pslReps" is one of these programs. Here is the usage
statement for pslReps (usage statements for any of the utilities in the
source tree can be obtained by running the command with no arguments):
================
pslReps - analyse repeats and generate genome wide best
alignments from a sorted set of local alignments
usage:
pslReps in.psl out.psl out.psr
where in.psl is an alignment file generated by psLayout and
sorted by pslSort, out.psl is the best alignment output
and out.psr contains repeat info
options:
-nohead don't add PSL header
-ignoreSize Will not weigh in favor of larger alignments so much
-noIntrons Will not penalize for not having introns when calculating
size factor
-singleHit Takes single best hit, not splitting into parts
-minCover=0.N minimum coverage to output. Default is 0.
-ignoreNs Ignore 'N's when calculating minCover.
-minAli=0.N minimum alignment ratio
default is 0.93
-nearTop=0.N how much can deviate from top and be taken
default is 0.01
-minNearTopSize=N Minimum size of alignment that is near top
for alignment to be kept. Default 30.
-coverQSizes=file Tab-separate file with effective query sizes.
When used with -minCover, this allows polyAs
to be excluded from the coverage calculation
==================
Here is part of the usage statement for a different utility,
"pslCDnaFilter":
==========
Filter cDNA alignments in psl format. Filtering criteria are
comparative, selecting near best in genome alignments for each
given cDNA and non-comparative, based only on the quality of an
individual alignment.
WARNING: comparive filters requires that the input is sorted by
query name. The command: 'sort -k 10,10' will do the trick.
Each alignment is assigned a score that is based on identity and
weighted towards longer alignments and those with introns. This
can do either global or local best-in-genome selection. Local
near best in genome keeps fragments of an mRNA that align in
discontinuous locations from other fragments. It is useful for
unfinished genomes. Global near best in genome keeps alignments
based on overall score.
==========
A handy list of the Kent source utilities and brief descriptions for
each one is available on our wiki site:
http://genomewiki.ucsc.edu/index.php/Kent_source_utilities
I hope this information is helpful. If you have questions in the
future, please do not hesitate to contact us again.
--
Brooke Rhead
UCSC Genome Bioinformatics Group
Anton Kratz wrote:
> Hi,
>
> I am aligning some 10.000s of sequences against the human genome with BLAT
> gfServer/gfClient. One sequence = one psl outputfile. Often I get several
> alignments for each sequence. I would like to know how to best interpret
> (parse) the psl-output; I only want to select one single (the best)
> alignment. I am unsure how to determine which alignment is most probably the
> "real" position of my sequence.
>
>
>
> - Can I just take the alignment that has the highest number in the
> "match"-column and take this entry as the "best" alignment?
>
>
>
> - The order in which the alignments appear in a psl-file, has this
> something to do with the quality of the alignment?
>
>
>
> - Why does the UCSC BLAT-website produce a score and percentage value
> (unlike the standalone BLAT), and what is the idea behind these values? I
> tried to understand what score and percentage value mean from looking at the
> source-code in the FAQ-Entry (how to reproduce those values), but do not
> understand the actual meaning of the values.
>
>
> Thanks a lot,
> Anton
> _______________________________________________
> Genome maillist - Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
>
More information about the Genome
mailing list