[Genome] Unique canonical sequence for each HGNC gene symbol
Amir Karger
akarger at CGR.Harvard.edu
Thu Jun 21 13:06:35 PDT 2007
I'm working on a high throughput project, and we want to get "the"
sequence for the exons for every "gene". We've run into the old problem
of defining a gene, though. We hoped that we could get a single sequence
for every (or at least most) HGNC gene symbol.
I was excited to find the knownCanonical table, which seems to do just
that. Unfortunately, it turns out that a few hundred gene names in the
knownCanonical table map to multiple knownGene IDs.
I was able to remove a few of these copies with some tricks.
1. Delete any gene not in chr1-22/X/Y/M (e.g., "6_cox_hap1")
2. Delete genes which are completely non-coding
3. Sometimes there's a gene at the same pos in both X & Y chromosomes
(pseudoautosomal regions). Remove one of them. (e.g., SHOX)
4. Compare UCSC's chromosome designation with HGNC's. There are a few
genes where these values are different. (E.g., DUX4 is on chr4 according
to HGNC, but UCSC has uc003izw.1 on chr4 and uc001lns.1 on chr10.)
I tried a few other things, but I was still stuck with about 200 genes
with multiple transcripts. EPHA10, for example, has two transcripts very
close to each other. I guess they don't overlap, so the knownCanonical
clusterer doesn't cluster them.
Do you know of any way to do this? Web, table browser, mysql, Perl
script, or even (gasp!) non-UCSC? I sort of think I'll have to just go
through the last 200 genes by hand, but I was wondering if there might
be a standard way to automate the choices.
Thanks,
- Amir Karger
Research Computing
Life Sciences Division
Harvard University
More information about the Genome
mailing list