[Genome] SQL query hg17

Andrew Eckert eckerta at email.chop.edu
Thu Mar 1 13:22:03 PST 2007


Great Brooke, thanks for your help!  Like you said, even when I find my
HUGO symbol in the list, there are sometimes more than one mRNA, but
thanks to the data in the refFlat table, I can calculate the length of
each mRNA for each gene, and just keep the row with the longest one. 
Here was my solution:
 
select geneName, name, chrom, txStart, txEnd, max(txEnd - txStart) as
GeneLength, concat(chrom,":",txStart,"-",txEnd) as Coordinates 
from hg17.refFlat 
where geneName in ([my list])
group by geneName;
 
Since I don't care about the splice variants I figured the longest
known mRNA must represent the entire gene.  Now since I have the
coordinates I can find the SNPs in the range of each gene.  I submitted
a list of 207 genes and got data back for every one, so this looks like
it's going to work!
 
Thanks so much for the great tips
Andrew


>>> Brooke Rhead <rhead at soe.ucsc.edu> 2/28/2007 7:30 PM >>>

Hello Andrew,

None of our tables use the HUGO symbols exclusively to identify genes
or 
specific mRNAs.  However, some tables do include many HUGO symbols, and

depending on your particular list of genes, one or more of them might 
suit your needs.

One option is to use the 'refFlat' table instead of the 'refLink'
table. 
  The refFlat.geneName field often corresponds to the HUGO gene name. 
I 
checked your 'CD4' example, and there is only one entry for it in the 
refFlat table, as opposed to 10 entries in the refLink table.  The 
refFlat table also contains the corresponding gene coordinates.

Another option is to use the 'kgXref' table, which correlates several 
gene aliases for genes in the 'knownGene' table.  If your gene names
are 
present in the kgXref.geneSymbol field, you could go from 
kgXref.geneSymbol --> kgXref.kgId, and then with the kgXref.kgId (which

is the same as knownGene.name), get a set of gene coordinates.  One of

our developers came up with a mySQL query to produce a bed 4 file from

the kgXref and knownGene tables, which you may find useful:

hgsql hg17 -N -e 'select k.chrom, k.txStart, k.txEnd, x.geneSymbol
from
kgXref x, knownGene k where x.kgId=k.name'

This query produces a list of coordinates and geneSymbols, like this:

  mysql> select k.chrom, k.txStart, k.txEnd, x.geneSymbol from
     -> kgXref x, knownGene k where x.kgId=k.name limit 2;
+-------+---------+-------+------------+
| chrom | txStart | txEnd | geneSymbol |
+-------+---------+-------+------------+
| chr1  |    4268 |  7438 | BC073913   |
| chr1  |   24416 | 25944 | AF346307   |
+-------+---------+-------+------------+
2 rows in set (0.00 sec)


Note that with either of these options, each gene symbol may have 
multiple records.  However, many of the names will only have one entry,

and so may be quite useful to you.

I hope this information is helpful.  If you have further questions, 
please do not hesitate to contact us again.

--
Brooke Rhead
UCSC Genome Bioinformatics Group


Andrew Eckert wrote:
> Hello,
> I have a couple of questions about querying the UCSC human genome
data
> (May 2004).  I'm working on a project where I have a list of about
400
> HUGO gene symbols and I want to use them to return the coordinates
> (boundaries) of each gene, as well as a list of all SNPs from dbSNP
> within those coordinates.
>  
> So far the most straightforward method I've figured out is to use a
SQL
> editor to query the hg17 database for gene name --> mRNA --> mRNA
> coordinates --> all SNPs within those coordinates.
>  
> I am having trouble because there are multiple rows for some genes. 
> For example, I queried hg17.refLink for "name like 'CD4'" and found
> three different names (CD4, Cd4, cd4), and multiple mRNAs for each
name,
> each having different coordinates.  Really I don't care about all
the
> different splice variants or different mRNAs; I just couldn't find
any
> better way to get the gene's coordinates.  Do you know of a better
way
> to accomplish this?  
>  
> I have ben trying to find the tables that contain HUGO symbols,
> thinking that it might allow me to get to one single mRNA accession
> number that I could use.  For example I looked up CD4 on the HGNC
site
> and found one single mRNA.  I'm having trouble here though because I
> can't find either of the tables that refer to HUGO symbols (hugo,
> knownMore).  I am only looking in the hg17 database though; are they
> somewhere else?
>  
> Thanks in advance for your advice!
>  
>  
>  
> Andrew Eckert
> CHOP Center for Applied Genomics
> 3615 Civic Center Blvd.
> Room 1014H
> Philadelphia, PA 19104
>  
> eckerta at email.chop.edu
> (267)426-0694
> 
> _______________________________________________
> Genome maillist  -  Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome



More information about the Genome mailing list