[Genome] gnfAtlas2

Rachel Harte hartera at soe.ucsc.edu
Fri Jan 19 08:56:09 PST 2007


Yael,

We have been looking in the calculation for hgRatioMicroarray and I am sorry
to say that it was not well-documented and the description that you found
regarding the calculation of the log ratios was incorrect. I am very sorry
for the inconvenience and for our delay in replying to your questions.

> A. I tried to reconstruct the normalization procedure leading to the
> values of the gnfAtlas2 table from the raw data
>in  gnfHumanAtlas2All  (I used the hg17 and hg17 fixed)
> 2. For each sample of the 158 samples in "all" I calculated the median
> of each group
>
> 3.  I run into some problems trying to calculate the log ratio of each
> entry.

There is a clump.ra file which specifies groups and the indices of the
replicates for each tissue type. Here are the first few lines of the file
that was used for gnfHumanAtlas2 data:
'fetal brain'   brain 80 81
'whole brain'   brain 56 57
     .
     .
'tonsil'        immune 36 37
All the tissues were represented in the file except those from cancerous
cell types. For each probeset p, the ratio is calculated by first
calculating the median for replicates for each tissue as defined in the
clump.ra file e.g. median of fetal brain replicates, median of whole brain
replicates, median of tonsil replicates etc. Then the median of all the
medians is calculated.
Then for each experiment,x (each replicate), the log2 ratio is calculated
in the following way:
log2 (expression value of p in x / median of medians for p)

There are some other criteria used by hgRatioMicroarray e.g.
-minAbsVal=N Minimum absolute value considered meaningful
                Anything less will be considered to be minAbsVal/2
                Default 20.000000
-minMaxVal   Minimum max value before throw away gene data
                Must be expressed at least this much in one tissue
                Default 50

You may obtain the clumps file and programs used for microarray
processing and by downloading the Genome Browser source code. It is free
for academic, non-profit and personal use. For source code download, see:
http://genome.ucsc.edu/FAQ/FAQlicense#license1

hgRatioMicroarray and gnfHumanAtlas2Clumps.ra are in the directory:
src/hg/makeDb/hgRatioMicroarray
and the file that documents how the array data was processed is:
src/hg/makeDb/doc/hgFixed.txt

>A 1.
1. conversion between the indexing of the 158 two sample entries to the
> 79 entries
>
>    For that I used the gnfHumanAtlas2AllExps and the
> gnfHumanAtlas2MedianExps
>
>    I compared them by their names. Most were identical but two differed
>
>    1. pituitary v. pituitary gland
>    2. DRG v. dorsal root gangelion
>    I assumed they are the same sample
You are correct - these are the same sample. The names for the
gnfHumanAtlas2AllExps table come from the original expression data file used
to load the gnfHumanAtlas2All table - it uses "Pituitary" and "DRG". The
names in gnfHumanAtlas2MedianExps come from the clumps file used by
the hgMedianMicroarray program that creates the gnfHumanAtlas2MedianExps
and gnfHumanAtlas2MedianRatio and gnfHumanAtlas2Median tables. This clumps
file has "Pituitary gland" and "dorsal root ganglion".

> B. Regarding the gnfAtlas2Distance I would like to know
>    a. what is the function used for the vector comparisons?
>    b. what "_at" are related to each gene (all the exonic _at's?,  a
> representative _at?, other?)

Regarding the gnfAtlas2Distance table, the code used to create it assigns
a Known Genes ID to each probeset in gnfHumanAtlas2 based on
the knownTo table which is knownToGnfAtlas2 for this human data. For each
known Gene, it compares it to every other gene in the list of known Genes
assigned to a probeset.
For each pairs of genes, the expression values (expScores) are compared to
values of the same index so if the values in expScores are indexed
starting at 0 then expScores[i] for gene a is compared to expScores[i] for
gene b.  If the value for a is larger than that for b then b is subtracted
from a else a is subtracted from b. The distance is calculated for each
pair of scores and that total is tracked across all pairs. By default, the
weights are all 1. The total weight is found by adding the weights for
each experiment i. The resulting normalized distance is the
total distance/total weight.

The program that is used to create this data table is
src/hg/near/hgExpDistance/hgExpDistance.c and here is the function that
calculates this distance above:

double expDistance(struct microData *a,
struct microData *b, float *weights)
/* Return normalized distance between a and b. */
{
double totalWeight = 0.0, totalDistance = 0.0, distance, weight; double
aVal, bVal;
int i, count = a->expCount;
assert(a->expCount == b->expCount);
for (i=0; i<count; ++i)
    {
    aVal = a->expScores[i];
    bVal = b->expScores[i];     if (aVal >= -9999 && bVal >= -9999)
        {
        if (aVal > bVal)
            distance = aVal - bVal;
        else
            distance = bVal - aVal;
        weight = weights[i];
        totalWeight += weight;
        totalDistance += distance*weight;
        }
    }
assert(totalDistance >= 0);
if (totalWeight <= 0.0)
    return count;       /* No data at all - everything is far apart. */
else
    return totalDistance/totalWeight;
}

I hope that this helps you. Please let us know if you have further
questions.

Rachel

Rachel Harte
UCSC Genome Bioinformatics Group
http://genome.ucsc.edu


On Thu, 11 Jan 2007, Yael Altuvia wrote:

> Hi,
>
>
> I have a few questions regarding the GNF tables.
>
>
> A. I tried to reconstruct the normalization procedure leading to the
> values of the gnfAtlas2 table from the raw data
>
> in  gnfHumanAtlas2All  (I used the hg17 and hg17 fixed)
>
>
> I conducted the following:
>
>
> 1. conversion between the indexing of the 158 two sample entries to the
> 79 entries
>
>    For that I used the gnfHumanAtlas2AllExps and the
> gnfHumanAtlas2MedianExps
>
>    I compared them by their names. Most were identical but two differed
>
>    1. pituitary v. pituitary gland
>    2. DRG v. dorsal root gangelion
>    I assumed they are the same sample
>
>
>    I also used this files to assign the group to each tissue (e.g. brain
> immune etc)
>
>
> 2. For each sample of the 158 samples in "all" I calculated the median
> of each group
>
>
> 3.  I run into some problems trying to calculate the log ratio of each
> entry. According to the previously asked questions relating to this
> topic I calculated the log ratio of each entry by dividing it by the
> median of its group. but the values I got differed from the gnfAtlas2.
> I also tried to use instead the median of the group medians for the
> normalization? (this gives more similar values although they still seem
> to differ)  (in both cases I calculated the median of the two samples
> after applying the procedure). What am I doing wrong?
>
>
>
> B. Regarding the gnfAtlas2Distance I would like to know
>    a. what is the function used for the vector comparisons?
>    b. what "_at" are related to each gene (all the exonic _at's?,  a
> representative _at?, other?)
>
>
> Thanks for your help
> Yael Altuvia
>
>
>
> _______________________________________________
> Genome maillist  -  Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
>


More information about the Genome mailing list