[Genome] gene walking & Unigene cluster mapping
Guoliang Xing
gxing at soe.ucsc.edu
Mon Apr 9 09:39:09 PDT 2007
Hi Rachel,
This is cool. Your info plus the mySQL table schema page make the sql
join query very clear.
Now it's time for me to download the related tables and get hands on.
Thanks a lot!
Guoliang
On Mon, 9 Apr 2007, Rachel Harte wrote:
> Hello Guoliang,
>
> The chrN_gold tables show the positions of the clones that were sequenced
> to create the genome assembly. They do not have an ID that relates
> directly to Known Genes. The kgID is in the kgXRef table and it is
> typically the Genbank accession. The paper that you read (Hsu et al. (2006))
> about "The UCSC Known Genes" describes the old Known Genes process. This
> track is now called "Old Known Genes" on the human hg18 Genome Browser and
> the table for the gene alignments is knownGeneOld2. The kgXref
> table for Known Genes has been renamed as kgXrefOld2. This table has
> identifiers relating to other databases e.g. RefSeq, SWISS-PROT.
>
> The new gene set, UCSC Genes, that was released last week was produced by
> a new process and it is a high quality gene set that is more up to date
> than the Known Genes track. The old Known Genes track is a year old.
> The method used to create UCSC Genes is described on the track description
> page. You can find this either by clicking on the blue/gray button at the
> left side of the track in the Browser display or by clicking on the "UCSC
> Genes" link above the track control in the "Genes and Gene Prediction
> Tracks" group below the Browser image. This new UCSC Genes track also has
> a kgXref table but the kgID has a new format. This is so that we can
> produce IDs that can be easily tracked from one release of this gene set
> to the next. The format is ucNNNXXX.1 where N is a number and X is a lower
> case letter and the .1 is the version number. It starts with uc001aaa.1,
> then uc001aab.1, uc001aac.1 ... uc001aba.1, uc001abb.1 etc.
>
> I hope that this helps you.
>
> Rachel
>
> Rachel Harte
> UCSC Genome Bioinformatics Group
> http://genome.ucsc.edu
>
>
> On Sun, 8 Apr 2007, Guoliang Xing wrote:
>
> > Hi Rachel,
> >
> > Thanks for the info. I'll use base 0 for caculation.
> >
> > Now I am looking at the mySQL db dump at
> >
> > http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/
> >
> > And I've read Fan Hsu's Bioinformatics paper "The UCSC Known Genes", it's
> > time to get to table details.
> >
> > My question is, which master table should I query to get chr-by-chr UCSC
> > known gene ID (kgID)?
> >
> > Is chr1_gold table the correct one to start for chr1? Is the bin field
> > equal to kgID described in Fan's paper?
> >
> > I've copied its table structure below:
> >
> > -- Table structure for table `chr1_gold`
> > --
> >
> > CREATE TABLE chr1_gold (
> > bin smallint(6) NOT NULL default '0',
> > chrom varchar(255) NOT NULL default '',
> > chromStart int(10) unsigned NOT NULL default '0',
> > chromEnd int(10) unsigned NOT NULL default '0',
> > ix int(11) NOT NULL default '0',
> > type char(1) NOT NULL default '',
> > frag varchar(255) NOT NULL default '',
> > fragStart int(10) unsigned NOT NULL default '0',
> > fragEnd int(10) unsigned NOT NULL default '0',
> > strand char(1) NOT NULL default '',
> > UNIQUE KEY chromStart (chromStart),
> > UNIQUE KEY chromEnd (chromEnd),
> > KEY bin (bin),
> > KEY frag (frag(14))
> > ) TYPE=MyISAM;
> >
> > Thanks
> >
> > Guoliang
> >
> >
> > On Fri, 6 Apr 2007, Rachel Harte wrote:
> >
> > > Guoliang,
> > >
> > > The coordinates in both the BLAT output and the database tables have
> > > 0-based starts so the first base (5' most) on a chromosome or scaffold is
> > > position 0. For the ends, the positions are 1-based. You are correct in
> > > saying that the BLAT output and the database table coordinates are
> > > directly compatible in this way.
> > >
> > > Rachel
> > >
> > > Rachel Harte
> > > UCSC Genome Bioinformatics Group
> > > http://genome.ucsc.edu
> > >
> > >
> > > On Fri, 6 Apr 2007, Guoliang Xing wrote:
> > >
> > > > Hi Rachel,
> > > >
> > > > Thank you for the info. It is what I need. The next step for me is to
> > > > figure out which tables to query once you release the new annotation.
> > > >
> > > > What I have now is the blat result of all the probes on the affy arrays
> > > > against Human Genome build 18.
> > > >
> > > > I guess in your annotation, the gene (or exon) boundary coordinates are
> > > > based on genome position from 5'-most base 1 against HG 18. If this is
> > > > the case, I can combine my BLAT results with your gene annotation info
> > > > directly.
> > > >
> > > > Then I can query my data gene-by-gene based on unique gene IDs, or do a
> > > > "chromosome walk" from 5' to 3' and apply my statistical analysis in a
> > > > flexible way.
> > > >
> > > > Please confirm if my guess is right, and if you have additional info
> > > > please let me know.
> > > >
> > > > Thanks
> > > >
> > > > Guoliang
> > > >
> > > > On Fri, 6 Apr 2007, Rachel Harte wrote:
> > > >
> > > > > Guoliang,
> > > > >
> > > > > Currently, we have the "Known Genes" set of genes which has not been
> > > > > updated recently. However, within the next few days we are releasing a new
> > > > > gene set (UCSC Genes) that will be our new gold standard for gene
> > > > > annotation - the method for producing this gene set has changed resulting
> > > > > in an even higher quality set of annotations which will include more
> > > > > splice variants and also non-protein-coding genes.
> > > > >
> > > > > Watch out for an announcement of the release of this gene set on our home
> > > > > page. The track control will appear in the "Genes and Gene Prediction
> > > > > Tracks" group under the Genome Browser image. By clicking on the link above
> > > > > this track control, you will see a description of the track and a "View
> > > > > Table Schema" link shows the main table for this track which contains the
> > > > > alignment information for the genes. There are other connected tables that
> > > > > are also listed on this page.
> > > > >
> > > > > I hope that this helps you. Please let us know if you have further
> > > > > questions.
> > > > >
> > > > > Rachel
> > > > >
> > > > > Rachel Harte
> > > > > UCSC Genome Bioinformatics Group
> > > > > http://genome.ucsc.edu
> > > > >
> > > > >
> > > > > On Thu, 5 Apr 2007, Guoliang Xing wrote:
> > > > >
> > > > > > Dear Genome help team,
> > > > > >
> > > > > > I want to code some Perl programs to compare different microarray
> > > > > > signals on the same genes using UCSC genome browser annotation data. I'd
> > > > > > like to download relevant gene track info in MySQL dataabse and run in
> > > > > > local machine mode.
> > > > > >
> > > > > > Let's say Affy Expression array U133 plus2 vs. Affy U95, they have
> > > > > > different probe coverages, but some of the probes mapped to the same gene.
> > > > > > My intuition is to map different probes on the two different arrays to a
> > > > > > common Unigene cluster id, and then run my statistical comparison based on
> > > > > > common cluster ID, gene by gene.
> > > > > >
> > > > > > I know UCSC browser has the Affy array mapping info.
> > > > > >
> > > > > > But I don't know which track or mySQL UCSC table is the current gold
> > > > > > starndard for gene annotation.
> > > > > >
> > > > > > Please advise.
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > Guoliang
> > > > > > _______________________________________________
> > > > > > Genome maillist - Genome at soe.ucsc.edu
> > > > > > http://www.soe.ucsc.edu/mailman/listinfo/genome
> > > > > >
> > > > >
> > > >
> > >
> >
>
More information about the Genome
mailing list