[Genome] Query regarding EST Blat Alignment results

Ann Zweig ann at soe.ucsc.edu
Fri Nov 2 16:16:45 PDT 2007


Hello Lalitha,

     To generate the EST track, human ESTs from GenBank are aligned 
against the genome using BLAT.  Quite often, ESTs will align to more 
than one location in the genome.  When a single EST aligns in multiple 
places, the alignment having the highest base identity is identified. 
Other EST alignments having a base identity level within a certain 
amount of the best alignment (call this "X") and at minimum a certain 
percent base identity (call this "Y") with the genomic sequence are kept.

     For ESTs in the latest human assembly (hg18), we apply following 
filtering to the BLATted results:

pslCDnaFilter -minId=0.95 -minCover=0.25 -globalNearBest=0.0025 
-minQSize=20 -minNonRepSize=16 -ignoreNs -bestOverlap 
-polyASizes=ests.polya -usePolyTHead ests-in.psl ests-out.psl

Where:
ests.polya is created by faPolyASizes
globalNearBest=0.0025 == "X" in the previous paragraph
minId=0.95 == "Y" in the previous paragraph


     We keep and display all alignments that pass the filtering listed 
above.  The near-best filtering is used since it can be very difficult 
to determine exactly which locus actually produced an EST.  The 
polymorphisms within the genome make it difficult to distinguish between 
very similar loci.  So keeping only one EST alignment might mean we are 
keeping the *wrong* alignment.

     We do not have a pruned data set, however, if you want to see only 
one alignment for each EST, you could prune the data set yourself based 
on whichever parameter you think is "best".

     We do not have a table that contains the BLAT score.  But the EST 
table you are looking at does contain all of the information needed to 
compute the BLAT score.

The score used in filtering is computed  from the following tool in our 
source tree:
     kent/src/hg/pslCDnaFilter/cDnaAligns.c


     a simple score function that is sometimes used is:

int pslScore(const struct psl *psl)
/* Return score for psl. */
{
int sizeMul = pslIsProtein(psl) ? 3 : 1;

return sizeMul * (psl->match + ( psl->repMatch>>1)) -
     sizeMul * psl->misMatch - psl->qNumInsert - psl->tNumInsert;
}


     The Genome Browser and Blat software are free for academic, 
nonprofit, and personal use. A license is required for commercial use.

How to download the software: 
http://genome.cse.ucsc.edu/FAQ/FAQlicense#license3

You can obtain the source tree either via CVS:
     http://genome.ucsc.edu/admin/cvs.html
or a zip file:
     http://hgdownload.cse.ucsc.edu/admin/jksrc.zip

Please note the build instructions:
     http://genome.ucsc.edu/admin/jk-install.html

     All of the kent utilities output their usage message and command
line options by running them with no arguments.


     I hope this information is helpful to you.  Please don't hesitate 
to contact the mail list again if you require further assistance.

Regards,

----------
Ann Zweig
UCSC Genome Bioinformatics Group
http://genome.ucsc.edu

Please feel free to search the Genome mailing list archives by visiting 
our home page, clicking on "Contact Us", then typing a word or phrase 
into the search box.  On that same page 
(http://genome.ucsc.edu/contacts.html), you can subscribe to the Genome 
mailing list.


Viswanath, Lalitha (NIH/NCI) [C] wrote:
> Hi
> 
> NCI's caBIO (part of caBIG: http://cabig.nci.nih.gov ) exposes the
> chromosomal positions of ESTs as available in 
> 
> ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/chr*_est.txt.gz
> 
>  
> 
> This dataset does not provide the BLAT alignment scores and contains
> multiple mappings per EST, in some cases.
> 
>  
> 
> Examples are  AA001509, AA001126, T40080, Z99433, etc
> 
>  
> 
> I would appreciate if you could advise whether
> 
> a) a pruned data set containing only one result per EST is available or
> 
> b) a dataset providing BLAT scores for the alignments is available
> 
>  
> 
> Any input on the settings for BLAT or process to filter BLAT alignment
> results would be helpful in understanding the results.
> 
>  
> 
> Thanks
> 
> Lalitha
> 
> Data Architect
> 
> SAIC (NYSE: SAI)
> 
> National Cancer Institute (NIH)
> 
> _______________________________________________
> Genome maillist  -  Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome


More information about the Genome mailing list