|
|
Application and Analysis of
Microarrays Lab 3: Finding Archae genes
responsive to stress |
|
Objective |
|
I. Get the data |
Download the 5 Pae
cold-shock hybridization results (Pae-A007.gpr, Pae-A026.gpr, Pae-A027.gpr, Pae-A028.gpr, Pae-A029.gpr) and the two reference versus reference
hybridization results (Pae-A012.gpr and Pae-A013.gpr). Download
all 7 of the files into a new directory called “cold”. Put the Pae-printA.gal and Pae-arrays.txt files into the cold directory as well. Finally, make sure all of the GenePix result
files end with a *.gpr extension. If any
file ends in something else (e.g. sometimes they’re saved with a *.gpr.txt
extension by Internet Explorer), change the file names to end with .gpr. Note that when you download the following
scripts they should end in .R (not .R.txt for example).
Within R, change your working directory to the cold
directory. Run the load-pae.R script to load all of the GenePix result files in
the cold directory into a single microarray object called pae.raw. To run the script
type source(“load-pae.R”) in R. Check
that you now have the object pae.raw
loaded into your workspace.
|
II. Visualize and normalize |
As a refresher, let’s make the standard log-ratio
versus intensity scatterplots (“MA-plots”) for each of the 7 hybridizations.
a. Make a
MA-plot of each hybridization result.
Color the genes with log-ratios in the top and bottom 5% a different
color so you can see where they are on the MA-plot. Try it on your own first and then, if you
need help, download and run the maplot-pae.R script.
b. See if you
can identify common trends in the data, indicating the need for
normalization. Normalize each array
using print-tip loess. Again, try to do
this yourself first and then, if you need help, download and walk through the norm-pae.R script. Make
the same MA-plots you made in part a,
but this time use the normalized data.
c. Again, make a
MA-plot of the raw (non-normalized) log-ratios as in part a but this time color those genes whose normalized log-ratios are
in the top or bottom 5%. Explain what
you see.
d. Make a
MA-plot of the normalized log-ratios and color genes whose raw (non-normalized)
log-ratios are in the top or bottom 5%.
Explain what you see.
e. Make a heat
map of the entire cold-shock expression matrix (Hint: type ?image). Missing data should appear as blanks in the
matrix.
f. To make
subsequent analysis easier, remove genes containing any missing data in any of
the 7 hybridizations. How many genes
did you eliminate with this filter? Make
a heat map of the resulting data to verify that you eliminated the right genes
(all of the blank entries should be gone now).
Here’s Josh’s script for making a heat-map,
heatmap-pae.R, in case you need it.
|
III. Identify cold-shock responsive genes |
It is now the moment of truth! We will next identify genes that are
candidates for involvement in the Archeal cold-shock response. We want to find which genes have log-ratios
consistently higher in the cold-shock hybridizations relative to the reference
hybridizations. For each gene, we’ve
observed 5 log-ratios under cold-shock and 2 log-ratios under the control
conditions. We want to test whether the
mean level in the 5 cold-shock conditions is different than the mean level in
the 2 control conditions.
a. Use the t-test. Follow along with Josh to perform
a t-test that will identify genes up- and down-regulated with respect to the
cold-shock condition. We will be using
the R function t.test() to do the computation.
Write each step of the procedure into a text file called ttest-cold.R using your favorite
editor. You will need this file later so
make sure you save it. Print out a
subset of the normalized expression matrix that shows the normalized log-ratios
of the top 10 and bottom 10 candidates across the 7 conditions. Print it out in such a way so that the
reference conditions are on the left and the cold-shock conditions are on the
right.
If you need a working t-test routine,
download Josh’s ttest-cold.R script.
My top 10 genes were: PAE1557 PAE3383 PAE.i277 EMPTY
YBR244W PAE1279 PAE2028 PAE0604 PAE0991 PAE1199
My bottom 10 genes were: PAE.i317 PAE2282 EMPTY EMPTY
PAE1695 PAE0524 PAE3003 PAE1293 PAE.i280 PAE2549
You plot of the expression profiles of these genes
should look like my plot.
b. Use a more conservative
(non-parametric) test. The t-test assumes
the log-ratios observed within each group are normally distributed. This is often not the case with microarray
data. However, we still would like to
know which genes have significantly higher (or lower) log-ratios under
cold-shock relative to the control conditions even if the log-ratios are not
normally distributed. Non-parametric
tests allow us to compute significance levels and confidence intervals without
assuming a particular distribution. The
Wilcoxon Rank Sum test is the non-parametric version of the t-test. It will give us a more conservative estimate
of whether the mean log-ratio of the cold-shock measurements for a particular
gene are more extreme relative to the gene’s values in the reference conditions.
Perform a Wilcoxin Rank Sum test (Hint: type
?wilcox.test for help). Copy your ttest-cold.R file to another file called
wilcox-cold.R. Modify the wilcox-cold.R file to perform the Wilcoxin Rank-Sum test on the
cold-shock dataset (Hint: you should have to change very little!). Print out the top 10 scoring genes (and the
bottom 10 scoring genes) along with their expression across the 7
hybridizations and make a heat-map as we did in part a. Email Josh
(jstuart@soe.ucsc.edu) print outs of each.
Make a scatterplot in which t-test P-values are
plotted against the Wilcoxin P-values (Hint: type ?plot). What’s reassuring about the plot? What seems to be funny about the plot? As a follow-up, plot the actual P-values of
the top candidates. What do you think is
going on? Given this finding, do you
think it’s appropriate to perform a Wilcoxin test in this case?
|
You’re done! |
Comments to Josh Stuart