[Genome] Question on missing alignment
Galt Barber
galt at soe.ucsc.edu
Fri Jul 20 17:02:21 PDT 2007
As a follow-up, I searched for repName 'L1PA3'
with the Table Browser tool (http://genome.ucsc.edu/cgi-bin/hgTables)
on hg18, chose group "Variation and Repeats",
chose track "RepeatMasker",
defaults to table "rmsk",
defaults to "genome" position,
clicked "filter" button and set repName to 'L1PA3',
clicked "summary/statistics" button.
Results show how many instances are identified by repeatmasker:
item count 10,418
If you want a bed file instead you could
set the output format to BED and click
the "get output" button.
In which case, gfServer/gfClient did a pretty
good job of finding these.
-Galt
On Fri, 20 Jul 2007, Galt Barber wrote:
>
> Hi, Jin!
> I have looked at your example sequence.
>
> cat majin.fa
> >YourSeq
> GCACATGTATACATATGTAACTAACCTGCACATTGTGCACGTGTACCCTAAAACTTAAAG
>
> Note the chr5 region missing is right in the middle of a LINE repeat
> Name: L1PA3
> Family: L1
> Class: LINE
> SW Score: 7722
> Divergence: 2.1%
> Deletions: 0.1%
> Insertions: 0.0%
> Begin in repeat: 5069
> End in repeat: 6150
> Left in repeat: 5
> Position: chr5:143868520-143869600
> Band: 5q32
> Genomic Size: 1081
> Strand: -
>
> gfServer/(gfClient or hgBlat) does not find all 32 perfect
> matches simply because the default -repMatch paramater
> is not high enough if you want to find repetitive sequences.
> When a tile is over-used and exceeds the repMatch,
> it stops taking in new positions. The first thing to
> do would be to increase -repMatch to a large value.
>
> The second thing required is to increase the number
> of alignments that gfServer will send back to the
> client, because normally this number is 100 or 200
> depending, so this is the -maxDnaHits parameter
> that must be increased.
>
> Note that because it is a LINE, although there are only 32
> perfect matches with no gaps, there are at least 13,600
> matches with 90% identity (only 6 or fewer mismatches
> out of 60).
>
> gfServer -tileSize=11 -stepSize=5 -repMatch=16384 -maxDnaHits=10000
> -canStop start localhost 3000 /gbdb/hg18/nib/hg18.2bit&
>
> gfClient -q=rna -t=dna localhost 3000 / majin.fa majin.psl
> got 13600 results
>
> I suppose that one might get even more by increasing
> the -repMatch and -maxDnaHits parameters further.
>
> Standalone blat did not return as many hits,
> but it did return all 32 perfect matches:
>
> blat -tileSize=11 -stepSize=5 -repMatch=16384 /gbdb/hg18/nib/hg18.2bit
> majin.fa majinb2.psl
> got 900 hits only.
>
> In this case, the gfServer/gfClient seems more capable
> of handling repetitive matches than blat, which is not
> what I expected. Both are using the same repMatches
> and the same default score and identity.
>
> So these alignments that do not appear in your output
> are not lacking because of some masking of the sequence.
> In fact, it is just tile-overuse.
>
> The system runs noticeably slower when revved up
> with these extra repetitive-tolerant parameters.
>
> I think the main point here is that BLAT index is designed
> to quickly find some unique gold out of all the genome.
> But you are trying to find dirt that is virtually everywhere.
>
> You are welcome to tune the parameters as you like, but
> BLAT is not optimized for finding repeats. There are
> other tools in the world for that.
>
> I think the same answer will apply also to your
> second and third questions below.
>
> -Galt
>
>
> On Thu, 19 Jul 2007, Ma, Jin wrote:
>
> > Dear browser colleagues,
> >
> > I need you help in explaining the following missing alignments. I will
> > show 3 examples with the first in detail
> >
> > 1) I used some program to align the following sequence to the hg18
> > assembly and got 32 hits but only 27 from BLAT (our internal ucsc
> > browser, while only 22 from the public ucsc browser). I picked 2 of the
> > missing alignment and found the sequence from the browser and it looks
> > like a perfect alignment. I was just wondering why BLAT didn't report
> > them.
> > # original sequence
> > GCACATGTATACATATGTAACTAACCTGCACATTGTGCACGTGTACCCTAAAACTTAAAG
> > # sequence at chr2:194,853,440-194,853,501 reading from 3' end to the 5'
> > end for the reverse strand
> > CGTGTACATATGTATACATTGATTGGACGTGTAACACGTGCACATGGGATTTTGAATTTC
> > # sequence at chr5:143,868,533-143,868,593 reading from 3' end to the 5'
> > end for the reverse strand
> > CGTGTACATATGTATACATTGATTGGACGTGTAACACGTGCACATGGGATTTTGAATTTC
> >
> > 2) I found 166 matches using my program for the following sequence but
> > only 43 from BLAT. I didn't check specific cases
> > TCGAGACCATCTTGGCTAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAG
> >
> > 3) I found 2294 matches using my program for the following sequence but
> > only 112 from BLAT. I didn't check specific cases
> > TTGTTGGACATTTGGGTTGGTTCCAAGTCTTTGCTATTGTGAATAGTGCCGCAATAAACA 0
> > chr10 96719 192278380 0.002294
> >
> > Is that caused by the repeats within the sequence that got masked by
> > BLAT? Why some got matched but not all of them? I would really
> > appreciate your help.
> >
> > Jin
> >
> >
> >
> > ------------------------------------------------------------------------------
> > Notice: This e-mail message, together with any attachments, contains
> > information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
> > New Jersey, USA 08889), and/or its affiliates (which may be known
> > outside the United States as Merck Frosst, Merck Sharp & Dohme or MSD
> > and in Japan, as Banyu - direct contact information for affiliates is
> > available at http://www.merck.com/contact/contacts.html) that may be
> > confidential, proprietary copyrighted and/or legally privileged. It is
> > intended solely for the use of the individual or entity named on this
> > message. If you are not the intended recipient, and have received this
> > message in error, please notify us immediately by reply e-mail and then
> > delete it from your system.
> >
> > ------------------------------------------------------------------------------
> > _______________________________________________
> > Genome maillist - Genome at soe.ucsc.edu
> > http://www.soe.ucsc.edu/mailman/listinfo/genome
> >
> _______________________________________________
> Genome maillist - Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
>
More information about the Genome
mailing list