[Genome] Closest Gene
Brooke Rhead
rhead at soe.ucsc.edu
Fri Aug 3 14:52:18 PDT 2007
Hello again Denise,
Another engineer has pointed out that "upstream" and "downstream" may
have at least three different definitions, based on the strand:
1. gene based -- 5' vs. 3'
2. blat based -- dependent on whether alignment is to top or bottom strand
3. raw genomic coordinates -- lower txStart is 'upstream' and higher
txStart is 'downstream'
If you are not aware of the differences, it can potentially make things
very confusing! This is just a reminder to keep these different
meanings in mind when you deal with these data.
--
Brooke Rhead
UCSC Genome Bioinformatics Group
Brooke Rhead wrote:
> Hello Denise,
>
> I have two suggestions for you from our developers, one using Perl and
> one using our public MySQL database. I will describe them both.
>
> --
>
> First, the Perl solution:
>
> Download the knownGene table, either from the downloads server, here:
> http://hgdownload.cse.ucsc.edu/downloads.html (go to the appropriate
> assembly, and then look under the "annotation database" link for the
> knownGene.txt.gz file), or from the Table Browser.
>
> Then, follow this logic:
>
> hash chromosomes => lists of [gene positions, gene names] sorted by
> position
>
> for each position from BLAT, do a binary search for that position in
> the appropriate list. If search ends without an exact match, compute
> the distance of the BLAT position to the final position from the
> search and its adjacent positions, and pick the closest one.
>
> --
>
> Now, the MySQL solution. Information on connecting to our MySQL server
> is here: http://genome.ucsc.edu/FAQ/FAQdownloads#download29.
>
> Using the public MySQL server:
>
> #!/bin/sh
>
> # given position chr1:710000-720000
> # find a sample of genes near this upstream and downstream
> C=chr1
> S=710000
> E=720000
>
> echo "three upstream genes from ${C}:${S}-${E}"
> mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e \
> 'select e.chrom,e.txStart,e.txEnd,e.alignID,j.geneSymbol FROM
> knownGene e,
> kgXref j
> WHERE e.alignID = j.kgID AND e.chrom="'${C}'" AND e.txEnd < '${S}'
> ORDER BY e.txEnd DESC limit 3;' hg18
>
> echo "three downstream genes from ${C}:${S}-${E}"
> mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e \
> 'select e.chrom,e.txStart,e.txEnd,e.alignID,j.geneSymbol FROM
> knownGene e,
> kgXref j
> WHERE e.alignID = j.kgID AND e.chrom="'${C}'" AND e.txStart > '${E}'
> ORDER BY e.txStart ASC limit 3;' hg18
>
> Produces results:
> three upstream genes from chr1:710000-720000
> +------+--------+--------+------------+----------+
> | chr1 | 690107 | 703869 | uc001abo.1 | BC006361 |
> | chr1 | 665195 | 665226 | uc001abn.1 | DQ599872 |
> | chr1 | 665086 | 665147 | uc001abm.1 | DQ600587 |
> +------+--------+--------+------------+----------+
> three downstream genes from chr1:710000-720000
> +------+--------+--------+------------+----------+
> | chr1 | 752926 | 778860 | uc001abp.1 | BC102012 |
> | chr1 | 752926 | 778860 | uc001abq.1 | BC042880 |
> | chr1 | 752926 | 779603 | uc001abr.1 | CR601056 |
> +------+--------+--------+------------+----------+
>
> This approach is now also documented on our wiki site, here:
>
> http://genomewiki.ucsc.edu/index.php?title=Finding_nearby_genes
>
> --
>
> I hope this information helps. Please feel free to create a genomewiki
> account and edit or add to the wiki page(s). If you have further
> questions, please do not hesitate to contact us again.
>
> --
> Brooke Rhead
> UCSC Genome Bioinformatics Group
>
>
> Denise Mauldin wrote:
>> Hello,
>>
>> I have a list of genome positions that I got from Blat. What's the
>> easiest programmatic way to get the closest gene to those positions? I
>> currently work in Perl, but could adapt to other suggestions.
>>
>> Thanks,
>> Denise
>> _______________________________________________
>> Genome maillist - Genome at soe.ucsc.edu
>> http://www.soe.ucsc.edu/mailman/listinfo/genome
> _______________________________________________
> Genome maillist - Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
More information about the Genome
mailing list