|
|
Application and Analysis of
Microarrays Lab 6: Cluster evaluation: finding
Archaeal stress-response modules |
|
Objective |
|
Due date |
Answers to the questions (either in hardcopy form or
emailed to Josh) are due in class on Thursday,
March 11. In addition to answering
the questions, please hand in all of the requested plots and the code you
wrote.
As always, you need to turn in your own work, but you
are encouraged to work in groups (for example, you may elect to work with the
same people as those in your project team).
|
I. Setup |
Download the saved R binary file lab6.RData and save it to your R working directory. Once you’ve done that, start R and then issue
the command: load(“lab6.RData”). This
should give you the following objects in your R workspace:
|
II. Compare clusterings |
In order to judge the robustness and accuracy of
clustering methods, we need some way of comparing the resulting solutions from two
clustering algorithms. We need some way
of measuring the similarity of two different clustering methods even when the
clusters obtained from each are wildly different. To avoid having to determine which clusters
from one solution correspond to those in another solution, one approach is to
measure the degree to which the two solutions group the same pairs of
genes. In this section, you will
implement an approach based on comparing the degree to which gene pairs are
co-clustered. This is similar to the
a. Write code to detect clustering
similarity based on pair-wise co-clustering. Write a function called cocluster.index()
that calculates the fraction of gene pairs that are co-clustered in common
between two clustering solutions. The
function takes two assignment vectors as input.
Each assignment vector contains a number for each gene. The number specifies which cluster the gene
was assigned after clustering. For
example, if we had 3 clusters over 6 genes then the clustering vector:
a <-c (1, 3, 2, 1, 2, 2)
specifies that clustering placed genes 1 and 4 in the
first cluster, genes 3, 5, and 6 in the second cluster, and gene 2 alone in the
third cluster. Comparing this assignment
vector with another assignment vector b <- c(2, 1, 1, 2, 3, 3) derived from a different clustering solution on
the same genes, cocluster.index(a,b)
should return 0.4 since 2 out of the 5 co-clustered pairs are clustered in both
solutions (i.e. gene pairs: (1,4) and
(5,6) are clustered in both
clusterings while gene pairs (3,5), (3,6), and (2,3) are only clustered together in one of them).
*** In class I gave the solution to this
problem. You can download the cocluster.R file to get the cocluster.index() function.
b. Measure the consistency of k-means clustering. Using k=15,
run k-means clustering on the stress.submatrix 10 times obtaining 10
different k-means clustering
solutions. Compare every one of the 45
unique pairs of solutions to each other using your cocluster.index()
function. Make a histogram of these 45
co-clustering indices.
c. Control on random data. Do the same as in part b, but now compare a k-means solution to a randomly generated
solution. Make a random clustering by
permuting the assignment vector from a real clustering solution. To permute the assignment vector a from above, you could do the following
in R:
a.permuted
<- a[order(runif(length(a)))]
which would create a different permutation of a every time you executed the above
line. The benefits of creating random
clustering solutions this way is that it preserves some of the properties of
the original clustering, namely the number of clusters and their sizes. Preserving these properties in the random
solution is important for comparison purposes.
Choose one of the 10 k-means solutions derived from part b. Create 45 randomly permuted cluster
assignment vectors from this solution and compare each random assignment vector
with the original non-permuted assignment vector using your cocluster.index()
function. Make a histogram of these 45
comparisons. Interpret the results you
got in part b in light of these randomly derived results.
d. Thought experiment. Without writing any code, describe a
procedure you could use to select the best k
using your cocluster.index() function.
|
III. Choose the best clustering method based on prediction
strength |
Before we search for gene modules related to stress,
we need to identify which clustering method is best for our dataset. How can we
decide which clustering method to use? One
idea proposed by Tibshirani et al.
(Tibshirani, R et al. 2001 [1]) is to
measure how well a clustering method derives the same solution when applied to
different subsets of the data. The idea
is that a good method should produce robust clusters that generalize in some
sense and that are impervious to small changes in the data.
a. Write helper functions. Write a
function called compute.centroids()
that takes in a matrix and an assignment vector, and returns a matrix
containing the center of each cluster.
Run kmeans() on the
stress.submatrix and compare the results of your compute.centroids() function
with the centers assigned by the call to kmeans(). Make sure you’re getting the same
answer. Write another function called assign.clusters() that takes in a matrix
of expression values and a matrix of centroids (e.g. like what is produced by
your compute.centroids() function)
and assigns each gene in the expression matrix to its closest centroid.
*** Try writing the above functions
yourself first and then, if you have trouble, download and source the files centroids.R and assign.R to get working copies of the above
functions.
b. Evaluate the prediction strength of k-means. Randomly divide the stress.submatrix in half by assigning a random half of the genes to
a new matrix called stress.train and
the other half of the genes to another new matrix called stress.test. Run k-means clustering (k=15) on the stress.train
matrix to obtain a cluster assignment vector called train.clusters. Run k-means clustering (with k=15) on the stress.test matrix to
obtain a cluster assignment vector called test.clusters. Transfer the clusters obtained from the training
set to the genes contained in the test set using the functions you wrote in
part a. Call the resulting assignment vector predicted.clusters. Using your cocluster.index() function, calculate the fraction of gene pairs
that are co-clustered between the training and test sets by passing the
function the test.clusters and predicted.clusters assignment
vectors. Repeat the entire procedure 100
times (i.e. randomly split the data into training and test sets, run k-means clustering, and calculate the co-clustering
degree). Make a histogram of the 100
co-clustering values you obtain from random splitting.
c. Evaluate the prediction strength of
hierarchical clustering. Repeat the
experiment you performed in part b using hierarchical clustering instead of k-means.
Note that you should cut the hierarchical clustering tree to obtain 15
clusters. Based on your results from k-means and hierarchical clustering,
which method do you prefer for the stress data?
|
IV. Find stress modules using the Silhouette method |
Now that you’ve chosen a clustering method, you’re
ready to find stress modules. In this
section, you’ll identify modules as those clusters that appear to be of good
quality based on the Silhouette measure (Rousseeuw, PJ 1987 [2]).
For each gene i, the silhouette width s(i) is defined as follows:
· Let a(i) be the average dissimilarity between gene i and all other genes in its cluster.
· For all other clusters C to which gene i is not assigned, let d(i,C) be the average dissimilarity of gene i to all of the genes in C. The smallest of these d(i,C) is b(i). I.e. b(i) = min_C d(i,C). In words, b(i) as the dissimilarity between gene i and the nearest cluster to which it does not belong.
· Finally, let s(i) equal (b(i) - a(i)) / max(a(i), b(i)).
Observations with a large s(i) (almost 1) are very well clustered, a small s(i) (around 0) means that the observation lies between two clusters, and observations with a negative s(i) are probably placed in the wrong cluster.
a. Write a function to compute the Silhouette. Write a function called my.silhouette() that takes in an expression matrix and a cluster
assignment vector and computes the Silhouette value for each gene. You can either write your own Silhouette
function from scratch (this is worth extra
credit), or you can call the silhouette()
function that is available in the cluster library (type: “library(cluster)”
inside your R session to import it).
b. Identify stress modules. Using the
clustering method you selected in section III, obtain 15 clusters on the
stress.submatrix. Use you’re my.silhouette() function to compute the
silhouette for each gene. Order the
genes so that the cluster with the largest average Silhouette is first and the
cluster with the smallest is last; within each cluster order the genes so that
the gene with the largest Silhouette value is first and the gene with the
smallest is last. Make a plot showing
the silhouette value of each gene in the order just described. What are the top 10 genes in the cluster
with the highest average Silhouette?
c. Thought experiment. Without
writing any code, outline an approach you could use to evaluate clustering
methods based on the Silhouette measure.
For example, how could you use it to find the best k for k-means? How could you use it to decide between using
hierarchical or k-means clustering?
|
V. Use BioProspector to identify stress-related cis-regulatory elements |
Now that you’ve identified clusters that appear to be
of good quality, you’re next goal is to identify whether the genes in these
clusters contain shared cis-regulatory
elements.
a. Select 5 of
the best-scoring clusters from the Silhouette analysis. For each selected cluster, extract the
upstream sequence for each gene in the cluster and write the sequences out to a
separate file. You can use the write.seqs() function to do this. For example write.seqs(“first10-seqs.txt”,
b. Run BioProspector to find sequence motifs. BioProspector is a Gibbs-sampling method that
searches for common motifs in a set of unaligned sequences. You can either download BioProspector from
the course website (unzip the BioProspector.zip file and run on linux, cygwin, or unix), or you can
log into an SOE machine and run BioProspector from the BME210 course directory:
/cse/classes/bme210/Winter04/labs/lab6/BioProspector
To help BioProspector find regulatory motifs that
discriminate our stress-response genes from genes unrelated to stress, supply
the program with a set of negative examples.
Run BioProspector on each of the five clusters, using the file up-back.fa as the negative background example sequences (pass
this to BioProspector’s –b option).
Report the top three motifs you identify for each cluster.
c. Thought experiment. Outline an experiment that would allow you to
gauge the significance of the motifs you obtained in part b.
d. Extra credit. Perform the experiment you outlined in part c and report your findings.
|
Congratulations on a
job well done! |
Comments to Josh Stuart
|
References |
1.
Tibshirani, Walther,
Botstein, Brown. (2001). Cluster
validation by prediction strength. Stanford Tech Report.
2. Rousseeuw, P.J. (1987) Silhouettes: A graphical aid to the
interpretation and validation of cluster analysis. J. Comput. Appl. Math.,
20, 53–65.