[Genome] BLAT client/server and parameters
Kayla Smith
kayla at soe.ucsc.edu
Mon Jul 30 14:05:04 PDT 2007
Hello Vikram,
Answers are within:
> 1) Is it possible to use fastMap, the 11.ooc, and masking options of a
> 2bit database file with gfClient, or are these parameters only
> restricted to command line BLAT?
-fastMap is only available to standalone blat. It is not working
for very short sequences (e.g. < 40 bp), so don't use it that way.
The 11.ooc file just helps the standalone blat load up faster.
It is not used by gfServer/gfClient which needs to load only once.
gfServer (used with gfClient) has masking options, too.
We ignore masking for most dna BLAT uses.
Because of the over-used tile effect, alignments cannot
be seeded off of very short sequences with over-used tiles.
However that is not the same as the soft or hardmasking
of the input files.
> Are there any speed advantages in using nib vs. 2bit databases?
.2bit will be a little bit faster to load up
when starting. It is the preferred format.
But .nib works too. Since each .nib can only hold one sequence,
if your assembly uses thousands of scaffolds, .2bit is the only
practical format.
> 2) If I use a 2bit database (~800 Mb) of the human genome, why does it
> take up >3.1Gb memory when sending a query of 1000 sequences, while with
> gfServer it only uses ~1Gb memory?
gfServer/gfClient were split in part to reduce the memory requirements.
The gfServer does not store the actual sequence it just indexes the
tiles. gfClient has to create finished alignments and it contains
plenty of code, starting from the suggested regions offered by gfServer.
So gfClient has to load up the actual sequences of the genome into
ram. If you consider both the memory used by both gfServer and gfClient,
it will be about the same as is used by standalone blat.
Since gfServer and gfClient are often run on different machines,
you don't have to have quite so much ram.
For both gfServer/gfClient and standalone blat, you can reduce memory
requirements by blatting just one chromosome at a time, and then
post-processing the psls to sort and filter them.
> If I upload the database from
> command line, it reports that I have 33 sequences in the database, while
> in reality I generated the 2bit file using the faToTwoBit command with
> only my 24 chromosomes in fasta format.
~> blat /gbdb/hg18/nib/hg18.2bit test.fa test.psl
Loaded 3107677273 letters in 49 sequences
How did you make your 2bit? We see 49 sequences.
There are chrM and chrN_random and some chrN_hapN chromosomes.
Try this:
twoBitInfo hg18.2bit output.tab
And then check the output.tab to see what sequences are in the .2bit file.
Is it possible that the query.fa input file has 33 sequences?
> 3) Why are the results different when I output in blast8 vs. psl
> format? For most input sequences, I get many more hits with blast8.
> For blast8 format, why do I recieve sequences with ~70% homology when
> the default for minIdentity is 90%?
Blast only aligns single exons. Blat can automatically chain parts
into a whole. It knows about splice junctions, too. When blat
filters based on identity and score, it is filtering on the entire
alignment, not on the individual exons or parts.
> If I change the minScore and minIdentity while outputting in blast8 >
format, the output remains exactly the same. Why is this so?
Can you provide an example of this?
> I also get a lot of spurious results of reads
> matching <10 bases with 100% homology in the results. How to I block
> out results below a certain length?
We suggest using pslReps or pslCdnaFilter to filter psls. They are
flexible utilities with many options.
I hope this information is helpful to you. Please don't hesitate to
contact us again if you require further assistance.
Kayla Smith
UCSC Genome Bioinformatics Group
More information about the Genome
mailing list