[Genome] Closest Gene

Brooke Rhead rhead at soe.ucsc.edu
Fri Aug 3 14:52:18 PDT 2007


Hello again Denise,

Another engineer has pointed out that "upstream" and "downstream" may 
have at least three different definitions, based on the strand:

  1. gene based -- 5' vs. 3'
  2. blat based -- dependent on whether alignment is to top or bottom strand
  3. raw genomic coordinates -- lower txStart is 'upstream' and higher 
txStart is 'downstream'

If you are not aware of the differences, it can potentially make things 
very confusing!  This is just a reminder to keep these different 
meanings in mind when you deal with these data.

--
Brooke Rhead
UCSC Genome Bioinformatics Group



Brooke Rhead wrote:
> Hello Denise,
> 
> I have two suggestions for you from our developers, one using Perl and 
> one using our public MySQL database.  I will describe them both.
> 
> --
> 
> First, the Perl solution:
> 
> Download the knownGene table, either from the downloads server, here: 
> http://hgdownload.cse.ucsc.edu/downloads.html (go to the appropriate 
> assembly, and then look under the "annotation database" link for the 
> knownGene.txt.gz file), or from the Table Browser.
> 
> Then, follow this logic:
> 
> hash chromosomes => lists of [gene positions, gene names] sorted by
> position
> 
> for each position from BLAT, do a binary search for that position in
> the appropriate list.  If search ends without an exact match, compute
> the distance of the BLAT position to the final position from the
> search and its adjacent positions, and pick the closest one.
> 
> --
> 
> Now, the MySQL solution.  Information on connecting to our MySQL server 
> is here: http://genome.ucsc.edu/FAQ/FAQdownloads#download29.
> 
> Using the public MySQL server:
> 
> #!/bin/sh
> 
> # given position chr1:710000-720000
> # find a sample of genes near this upstream and downstream
> C=chr1
> S=710000
> E=720000
> 
> echo "three upstream genes from ${C}:${S}-${E}"
> mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e \
> 'select e.chrom,e.txStart,e.txEnd,e.alignID,j.geneSymbol FROM
>     knownGene e,
>     kgXref j
> WHERE e.alignID = j.kgID AND e.chrom="'${C}'" AND e.txEnd < '${S}'
> ORDER BY e.txEnd DESC limit 3;' hg18
> 
> echo "three downstream genes from ${C}:${S}-${E}"
> mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e \
> 'select e.chrom,e.txStart,e.txEnd,e.alignID,j.geneSymbol FROM
>     knownGene e,
>     kgXref j
> WHERE e.alignID = j.kgID AND e.chrom="'${C}'" AND e.txStart > '${E}'
> ORDER BY e.txStart ASC limit 3;' hg18
> 
> Produces results:
> three upstream genes from chr1:710000-720000
> +------+--------+--------+------------+----------+
> | chr1 | 690107 | 703869 | uc001abo.1 | BC006361 |
> | chr1 | 665195 | 665226 | uc001abn.1 | DQ599872 |
> | chr1 | 665086 | 665147 | uc001abm.1 | DQ600587 |
> +------+--------+--------+------------+----------+
> three downstream genes from chr1:710000-720000
> +------+--------+--------+------------+----------+
> | chr1 | 752926 | 778860 | uc001abp.1 | BC102012 |
> | chr1 | 752926 | 778860 | uc001abq.1 | BC042880 |
> | chr1 | 752926 | 779603 | uc001abr.1 | CR601056 |
> +------+--------+--------+------------+----------+
> 
> This approach is now also documented on our wiki site, here:
> 
> http://genomewiki.ucsc.edu/index.php?title=Finding_nearby_genes
> 
> --
> 
> I hope this information helps.  Please feel free to create a genomewiki 
> account and edit or add to the wiki page(s).  If you have further 
> questions, please do not hesitate to contact us again.
> 
> --
> Brooke Rhead
> UCSC Genome Bioinformatics Group
> 
> 
> Denise Mauldin wrote:
>> Hello,
>>
>> I have a list of genome positions that I got from Blat.  What's the 
>> easiest programmatic way to get the closest gene to those positions?  I 
>> currently work in Perl, but could adapt to other suggestions.
>>
>> Thanks,
>> Denise
>> _______________________________________________
>> Genome maillist  -  Genome at soe.ucsc.edu
>> http://www.soe.ucsc.edu/mailman/listinfo/genome
> _______________________________________________
> Genome maillist  -  Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome


More information about the Genome mailing list