[Genome] Trouble with BLAT to align SNPs
Galt Barber
galt at soe.ucsc.edu
Mon Feb 4 10:36:05 PST 2008
1. Question about noise, i.e. low-quality alignments in output.
Use pslReps and pslCDnaFilter to apply filtering criteria
to the psl output of BLAT. This will allow you to get
rid of low-quality alignments. It works much more reliably
and flexibly than using -minScore or -minIdentity blat commandline
options.
Here is our BLAT FAQ which has lots of handy info:
http://genome.ucsc.edu/FAQ/FAQblat
2. Question about why the first FASTA record seems to be skipped.
It appears that you may have a leading space in the first line
before the ">" character. Apparently BLAST tolerates it but
not BLAT.
-Galt
On Sun, 27 Jan 2008, Michael Reese wrote:
> Dear BLAT gurus,
>
> I'm pretty new to BLAT, and am attempting to use it to align gene
> models from one strain of a parasite to the genomes of other strains
> that have been sequenced. The genome is relatively small (~60Mb), so
> it's no problem to hold it all in memory. I used seg to mask low
> complexity regions in both my genome (nucleotide) and query sequences
> (protein) with lower case. BLAT works great to align things when
> there's no low complexity. But as soon as I have stretches of low
> complexity, I end up with a bunch of noise. I've tried the various
> flags (mask=; qMask=; repeats= lower) for masking, with very little
> effect. If I use the "mask" flag, I see that the output is represented
> as "X" in a blast output style alignment, but I don't see any
> improvement in the noise. I see no change when I use the qMask or
> repeats flags.
>
> I'm using standalone BLAT v33x5. My database and input sequences
> are both in fasta format, as faToNib tells me it'll only work on single
> sequences, and I really need to do everything in batch. I've assumed
> the file format isn't a problem, since the help file suggests that it
> deals with all file types equivalently.
>
> An unrelated problem: BLAT is ignoring the first entry in my input
> (multi-seq) fasta file. (BLAST does not and finds the seq in the genome
> no problem, so I'm unsure as to what that problem is).
>
> An example of my input (lower-case masked):
>
> >25.m01838|hypothetical protein
> MEGTQPTCYCTVLRGGIGLLRQYSTEKAMTAGREQLLHTEDAADRVLHPTSDSLLTLQLV
> LTKGTTTFRQAYILGTALPFLDLSYHNIFLDNATTRANLFLHAPAYIWTGGWDTGIplvp
> flllaplasllvYAFLWKSLWRPLKREEERTGRLALTRGEDGKKALHSPSPKPFSLSLVP
> LYASDLLSHLGPFLAKHVESRGVCKRALSELQDPLLAREAKGSALAGETDDETEGHRGKK
> TQDTGTELKQQRDKTPLEIVLSVFVGENLPEGIVKQSTEWLLELSCLCTILRHTPPEDRE
> VPYDLLLRLVNANASLWAHEPSFEELRGRILLKLLENTSNALTVYDREAPQLQARSVESD
> SGHQTEKDELSVPEKVLSSDHGSSADLAERASAKNesspehassssapsdasdVETQEEK
> KGTQDTPRLVLDLLLHPGPYTTPDSIFLQLWPVGQRIRAGEPEHALRRAVKLLRSFVEAR
> DEARISERHASDELRENsvslsssassfwsKLRGRESRKQAGEKDASGDYGAQDVSPLSA
> ALLLCDTSLpsprrsppWRMKlllreletdllcvlSVVFLGVFSRKSGGESLFGEVGKIA
> LTGIRALRGLAILEAAYHLETNFIHSPAYYEATDSDMIKQSLGLVTFNGLLAAAVFRTHK
> YALLPFFLLRIRDMFAMDFRI
>
> In addition to a hit that returns the entire sequence (in its two
> exons), I get a series of things that hit to the low complexity regions,
> e.g. (in blast output format):
>
> Query: 9 SSPFAFSSSHSASPSS 24
> SS + SSS SASPSS
> Sbjct: 685739 SSSSSASSSSSASPSS 685786
>
> Any hints I could get would be greatly appreciated.
>
> Regards,
> Michael
>
> --
>
> Michael Reese, Ph.D.
> Postdoctoral Scholar
> Stanford University
> Dept. Microbiology & Immunology
> Fairchild D305
> 300 Pasteur Dr. (299 Campus Dr. for Courier)
> Stanford CA, 94305-5124
> Phone: (650)-723-7884
> Fax: (650)-723-6853
>
> _______________________________________________
> Genome maillist - Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
>
More information about the Genome
mailing list