[Genome] Trouble with BLAT to align SNPs

Galt Barber galt at soe.ucsc.edu
Mon Feb 4 10:36:05 PST 2008


1. Question about noise, i.e. low-quality alignments in output.

Use pslReps and pslCDnaFilter to apply filtering criteria
to the psl output of BLAT.  This will allow you to get
rid of low-quality alignments.  It works much more reliably
and flexibly than using -minScore or -minIdentity blat commandline
options.

Here is our BLAT FAQ which has lots of handy info:
 http://genome.ucsc.edu/FAQ/FAQblat


2. Question about why the first FASTA record seems to be skipped.

It appears that you may have a leading space in the first line
before the ">" character.  Apparently BLAST tolerates it but
not BLAT.


-Galt


On Sun, 27 Jan 2008, Michael Reese wrote:

> Dear BLAT gurus,
>
>     I'm pretty new to BLAT, and am attempting to use it to align gene
> models from one strain of a parasite to the genomes of other strains
> that have  been sequenced.  The genome is relatively small (~60Mb), so
> it's no problem to hold it all in memory.  I used seg to mask low
> complexity regions in both my genome (nucleotide) and query sequences
> (protein) with lower case.  BLAT works great to align things when
> there's no low complexity.  But as soon as I have stretches of low
> complexity, I end up with a bunch of noise.  I've tried the various
> flags (mask=; qMask=; repeats= lower) for masking, with very little
> effect.  If I use the "mask" flag, I see that the output is represented
> as "X" in a blast output style alignment, but I don't see any
> improvement in the noise.  I see no change when I use the qMask or
> repeats flags.
>
>     I'm using standalone BLAT v33x5.  My database and input sequences
> are both in fasta format, as faToNib tells me it'll only work on single
> sequences, and I really need to do everything in batch.  I've assumed
> the file format isn't a problem, since the help file suggests that it
> deals with all file types equivalently.
>
>     An unrelated problem:  BLAT is ignoring the first entry in my input
> (multi-seq) fasta file.  (BLAST does not and finds the seq in the genome
> no problem, so I'm unsure as to what that problem is).
>
> An example of my input (lower-case masked):
>
>  >25.m01838|hypothetical protein
> MEGTQPTCYCTVLRGGIGLLRQYSTEKAMTAGREQLLHTEDAADRVLHPTSDSLLTLQLV
> LTKGTTTFRQAYILGTALPFLDLSYHNIFLDNATTRANLFLHAPAYIWTGGWDTGIplvp
> flllaplasllvYAFLWKSLWRPLKREEERTGRLALTRGEDGKKALHSPSPKPFSLSLVP
> LYASDLLSHLGPFLAKHVESRGVCKRALSELQDPLLAREAKGSALAGETDDETEGHRGKK
> TQDTGTELKQQRDKTPLEIVLSVFVGENLPEGIVKQSTEWLLELSCLCTILRHTPPEDRE
> VPYDLLLRLVNANASLWAHEPSFEELRGRILLKLLENTSNALTVYDREAPQLQARSVESD
> SGHQTEKDELSVPEKVLSSDHGSSADLAERASAKNesspehassssapsdasdVETQEEK
> KGTQDTPRLVLDLLLHPGPYTTPDSIFLQLWPVGQRIRAGEPEHALRRAVKLLRSFVEAR
> DEARISERHASDELRENsvslsssassfwsKLRGRESRKQAGEKDASGDYGAQDVSPLSA
> ALLLCDTSLpsprrsppWRMKlllreletdllcvlSVVFLGVFSRKSGGESLFGEVGKIA
> LTGIRALRGLAILEAAYHLETNFIHSPAYYEATDSDMIKQSLGLVTFNGLLAAAVFRTHK
> YALLPFFLLRIRDMFAMDFRI
>
> In addition to a hit that returns the entire sequence (in its two
> exons), I get a series of things that hit to the low complexity regions,
> e.g. (in blast output format):
>
> Query: 9      SSPFAFSSSHSASPSS 24
>               SS  + SSS SASPSS
> Sbjct: 685739 SSSSSASSSSSASPSS 685786
>
> Any hints I could get would be greatly appreciated.
>
> Regards,
> Michael
>
> --
>
> Michael Reese, Ph.D.
> Postdoctoral Scholar
> Stanford University
> Dept. Microbiology & Immunology
> Fairchild D305
> 300 Pasteur Dr. (299 Campus Dr. for Courier)
> Stanford CA, 94305-5124
> Phone: (650)-723-7884
> Fax: (650)-723-6853
>
> _______________________________________________
> Genome maillist  -  Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
>


More information about the Genome mailing list