[Genome] Table Browser / Why are there LocusLink ID denoted genes on different strands/chromosomes?

Anton Kratz anton.kratz at googlemail.com
Fri Jan 5 00:23:02 PST 2007


Hi,

for a bioinformatics project I am working on I want to make a list of all
genes (*) in the human genome (hg17, May 2004, NCBI Build 35) with these
entries:
1. some unique number identifying the gene
2. strand
3. chromosome
4. start
5. end

As a first step I got a list through the Table Browser and it has these
entries:

LocusLink ID, Known Gene ID, chromosome, strand, start, end

(LocusLink is included b/c later I want to access via LocusLink not Known
Gene).

This list has around 34000 entries (lines) in total and 17000 unique
LocusLink IDs, b/c many LocusLink IDs occur multiple times, and it looks
likes this:

#hg17.knownToLocusLink.value    hg17.knownGene.name    hg17.knownGene.chrom
hg17.knownGene.strand    hg17.knownGene.txStart    hg17.knownGene.txEnd

[...]

83259    NM_032971    chrY    +    4911627    5016846
83259    NM_032972    chrY    +    4967491    5016846
83259    NM_032973    chrY    +    4967491    5653623

[...]

This is almost what I want, b/c I can concatenate such multiple entries as
the example above to a new entry like this:
83259    chrY    +    4911627    5653623

And this would be my "gene". 4911627 b/c it is the minimum in this example
and 5653623 b/c it is the maximum.

But I have difficulties understanding many of the entries the UCSC Known
Genes list I got through the Table Browser. F.e.:

1. Gene of different chromosomes:
55344    NM_018390,NM_018390,    chrX,chrY,    +,+,    132991,132991,
160020,160020,

2. Gene on different strands:
9084    NM_181880,NM_181880,    chrY,chrY,    -,+,    14535783,14606232,
14536519,14606968,

Why are there LocusLink ID denoted genes on different strands/chromosomes?

Best,
Anton


(*) when I use the term "gene", I do not mean gene in an "true" biological
sense, I do not want to distinguish between alternative splice variants etc.
For my program, a gene needs to be s/th which has a exactly one
startpostion, one enposition, and does only occur once in the genome. Yes I
admit that's an oversimplification.


More information about the Genome mailing list