[Genome] Table Browser / Why are there LocusLink ID denoted genes on different strands/chromosomes?
Ann Zweig
ann at soe.ucsc.edu
Fri Jan 5 16:42:49 PST 2007
Hello Anton,
You appear to be most of the way there, you just need a little help
understanding
the data. I will address each of your two questions.
1. Gene of different chromosomes:
55344 NM_018390,NM_018390, chrX,chrY, +,+, 132991,132991,
160020,160020,
In this case, it looks like the gene appears on both the X and Y chromosomes;
probably in the shared section. The Y chromosome in this assembly contains two
pseudoautosomal regions (PARs). One of them is located at chrY:1-2692881.
2. Gene on different strands:
9084 NM_181880,NM_181880, chrY,chrY, -,+, 14535783,14606232,
14536519,14606968,
It is not uncommon to see a copy of the gene on the opposite strand of the
same
chromosome. This gene happens to be in a region of inverted repeat. You can
see that
by viewing the Segmental Duplication track in this location.
I hope this helps you understand the underlying data.
Regards,
----------
Ann Zweig
UCSC Genome Bioinformatics Group
http://genome.ucsc.edu
Anton Kratz wrote:
> Hi,
>
> for a bioinformatics project I am working on I want to make a list of all
> genes (*) in the human genome (hg17, May 2004, NCBI Build 35) with these
> entries:
> 1. some unique number identifying the gene
> 2. strand
> 3. chromosome
> 4. start
> 5. end
>
> As a first step I got a list through the Table Browser and it has these
> entries:
>
> LocusLink ID, Known Gene ID, chromosome, strand, start, end
>
> (LocusLink is included b/c later I want to access via LocusLink not Known
> Gene).
>
> This list has around 34000 entries (lines) in total and 17000 unique
> LocusLink IDs, b/c many LocusLink IDs occur multiple times, and it looks
> likes this:
>
> #hg17.knownToLocusLink.value hg17.knownGene.name hg17.knownGene.chrom
> hg17.knownGene.strand hg17.knownGene.txStart hg17.knownGene.txEnd
>
> [...]
>
> 83259 NM_032971 chrY + 4911627 5016846
> 83259 NM_032972 chrY + 4967491 5016846
> 83259 NM_032973 chrY + 4967491 5653623
>
> [...]
>
> This is almost what I want, b/c I can concatenate such multiple entries as
> the example above to a new entry like this:
> 83259 chrY + 4911627 5653623
>
> And this would be my "gene". 4911627 b/c it is the minimum in this example
> and 5653623 b/c it is the maximum.
>
> But I have difficulties understanding many of the entries the UCSC Known
> Genes list I got through the Table Browser. F.e.:
>
> 1. Gene of different chromosomes:
> 55344 NM_018390,NM_018390, chrX,chrY, +,+, 132991,132991,
> 160020,160020,
>
> 2. Gene on different strands:
> 9084 NM_181880,NM_181880, chrY,chrY, -,+, 14535783,14606232,
> 14536519,14606968,
>
> Why are there LocusLink ID denoted genes on different strands/chromosomes?
>
> Best,
> Anton
>
>
> (*) when I use the term "gene", I do not mean gene in an "true" biological
> sense, I do not want to distinguish between alternative splice variants etc.
> For my program, a gene needs to be s/th which has a exactly one
> startpostion, one enposition, and does only occur once in the genome. Yes I
> admit that's an oversimplification.
> _______________________________________________
> Genome maillist - Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
More information about the Genome
mailing list