[Genome] Query regarding EST Blat Alignment results
Ann Zweig
ann at soe.ucsc.edu
Fri Nov 2 16:16:45 PDT 2007
Hello Lalitha,
To generate the EST track, human ESTs from GenBank are aligned
against the genome using BLAT. Quite often, ESTs will align to more
than one location in the genome. When a single EST aligns in multiple
places, the alignment having the highest base identity is identified.
Other EST alignments having a base identity level within a certain
amount of the best alignment (call this "X") and at minimum a certain
percent base identity (call this "Y") with the genomic sequence are kept.
For ESTs in the latest human assembly (hg18), we apply following
filtering to the BLATted results:
pslCDnaFilter -minId=0.95 -minCover=0.25 -globalNearBest=0.0025
-minQSize=20 -minNonRepSize=16 -ignoreNs -bestOverlap
-polyASizes=ests.polya -usePolyTHead ests-in.psl ests-out.psl
Where:
ests.polya is created by faPolyASizes
globalNearBest=0.0025 == "X" in the previous paragraph
minId=0.95 == "Y" in the previous paragraph
We keep and display all alignments that pass the filtering listed
above. The near-best filtering is used since it can be very difficult
to determine exactly which locus actually produced an EST. The
polymorphisms within the genome make it difficult to distinguish between
very similar loci. So keeping only one EST alignment might mean we are
keeping the *wrong* alignment.
We do not have a pruned data set, however, if you want to see only
one alignment for each EST, you could prune the data set yourself based
on whichever parameter you think is "best".
We do not have a table that contains the BLAT score. But the EST
table you are looking at does contain all of the information needed to
compute the BLAT score.
The score used in filtering is computed from the following tool in our
source tree:
kent/src/hg/pslCDnaFilter/cDnaAligns.c
a simple score function that is sometimes used is:
int pslScore(const struct psl *psl)
/* Return score for psl. */
{
int sizeMul = pslIsProtein(psl) ? 3 : 1;
return sizeMul * (psl->match + ( psl->repMatch>>1)) -
sizeMul * psl->misMatch - psl->qNumInsert - psl->tNumInsert;
}
The Genome Browser and Blat software are free for academic,
nonprofit, and personal use. A license is required for commercial use.
How to download the software:
http://genome.cse.ucsc.edu/FAQ/FAQlicense#license3
You can obtain the source tree either via CVS:
http://genome.ucsc.edu/admin/cvs.html
or a zip file:
http://hgdownload.cse.ucsc.edu/admin/jksrc.zip
Please note the build instructions:
http://genome.ucsc.edu/admin/jk-install.html
All of the kent utilities output their usage message and command
line options by running them with no arguments.
I hope this information is helpful to you. Please don't hesitate
to contact the mail list again if you require further assistance.
Regards,
----------
Ann Zweig
UCSC Genome Bioinformatics Group
http://genome.ucsc.edu
Please feel free to search the Genome mailing list archives by visiting
our home page, clicking on "Contact Us", then typing a word or phrase
into the search box. On that same page
(http://genome.ucsc.edu/contacts.html), you can subscribe to the Genome
mailing list.
Viswanath, Lalitha (NIH/NCI) [C] wrote:
> Hi
>
> NCI's caBIO (part of caBIG: http://cabig.nci.nih.gov ) exposes the
> chromosomal positions of ESTs as available in
>
> ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/chr*_est.txt.gz
>
>
>
> This dataset does not provide the BLAT alignment scores and contains
> multiple mappings per EST, in some cases.
>
>
>
> Examples are AA001509, AA001126, T40080, Z99433, etc
>
>
>
> I would appreciate if you could advise whether
>
> a) a pruned data set containing only one result per EST is available or
>
> b) a dataset providing BLAT scores for the alignments is available
>
>
>
> Any input on the settings for BLAT or process to filter BLAT alignment
> results would be helpful in understanding the results.
>
>
>
> Thanks
>
> Lalitha
>
> Data Architect
>
> SAIC (NYSE: SAI)
>
> National Cancer Institute (NIH)
>
> _______________________________________________
> Genome maillist - Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
More information about the Genome
mailing list