Application and Analysis of Microarrays
BME 210, Winter 2004

Lab 6: Cluster evaluation: finding Archaeal stress-response modules


Objective

 

In this lab, we want to identify biologically meaningful expression patterns in the Archeal stress dataset.  In the last lab, we used ANOVA to identify genes that vary significantly across the stress dataset.  Finding a set of genes that respond similarly across the various hybridizations may help us identify modules of genes involved in various stress response pathways.

 

Due date

 

Answers to the questions (either in hardcopy form or emailed to Josh) are due in class on Thursday, March 11.  In addition to answering the questions, please hand in all of the requested plots and the code you wrote.

 

As always, you need to turn in your own work, but you are encouraged to work in groups (for example, you may elect to work with the same people as those in your project team).

 

 

I. Setup

 

Download the saved R binary file lab6.RData and save it to your R working directory.  Once you’ve done that, start R and then issue the command: load(“lab6.RData”).  This should give you the following objects in your R workspace:

 

  • stress.submatrix – An expression matrix, containing only genes with significant expression profiles as judged by the ANOVA analysis from lab 4.
  • stress.factor – A factor that lists which type of shock was done in each hybridization (cold-shock, pH-shock, etc.).  Note that this contains a different ordering of conditions than the last lab.
  • up – a data frame containing 200bp upstream sequence for each gene contained in the stress.submatrix.  The sequences are stored in the same order as the rows in the stress.submatrix.
  • write.seqs() – A function that takes a file name and a set of rows as its arguments and stores the 200 base-pairs upstream of the start codon for each gene.  If a gene does not have upstream sequence, it writes an “X” by its name.

 

 

II. Compare clusterings

 

In order to judge the robustness and accuracy of clustering methods, we need some way of comparing the resulting solutions from two clustering algorithms.  We need some way of measuring the similarity of two different clustering methods even when the clusters obtained from each are wildly different.  To avoid having to determine which clusters from one solution correspond to those in another solution, one approach is to measure the degree to which the two solutions group the same pairs of genes.  In this section, you will implement an approach based on comparing the degree to which gene pairs are co-clustered.   This is similar to the Rand index discussed in class except that it does not count pairs where both solutions put the genes in separate clusters.

 

a. Write code to detect clustering similarity based on pair-wise co-clustering.  Write a function called cocluster.index() that calculates the fraction of gene pairs that are co-clustered in common between two clustering solutions.  The function takes two assignment vectors as input.  Each assignment vector contains a number for each gene.  The number specifies which cluster the gene was assigned after clustering.  For example, if we had 3 clusters over 6 genes then the clustering vector:

 

a <-c (1, 3, 2, 1, 2, 2)

 

specifies that clustering placed genes 1 and 4 in the first cluster, genes 3, 5, and 6 in the second cluster, and gene 2 alone in the third cluster.  Comparing this assignment vector with another assignment vector b <- c(2, 1, 1, 2, 3, 3) derived from a different clustering solution on the same genes, cocluster.index(a,b) should return 0.4 since 2 out of the 5 co-clustered pairs are clustered in both solutions (i.e. gene pairs: (1,4) and (5,6) are clustered in both clusterings while gene pairs (3,5), (3,6), and (2,3) are only clustered together in one of them).

 

*** In class I gave the solution to this problem.  You can download the cocluster.R file to get the cocluster.index() function.

 

b. Measure the consistency of k-means clustering.  Using k=15, run k-means clustering on the stress.submatrix 10 times obtaining 10 different k-means clustering solutions.  Compare every one of the 45 unique pairs of solutions to each other using your cocluster.index() function.  Make a histogram of these 45 co-clustering indices.

 

c. Control on random data.  Do the same as in part b, but now compare a k-means solution to a randomly generated solution.  Make a random clustering by permuting the assignment vector from a real clustering solution.  To permute the assignment vector a from above, you could do the following in R:

 

a.permuted <- a[order(runif(length(a)))]

 

which would create a different permutation of a every time you executed the above line.  The benefits of creating random clustering solutions this way is that it preserves some of the properties of the original clustering, namely the number of clusters and their sizes.  Preserving these properties in the random solution is important for comparison purposes.

 

Choose one of the 10 k-means solutions derived from part b.  Create 45 randomly permuted cluster assignment vectors from this solution and compare each random assignment vector with the original non-permuted assignment vector using your cocluster.index() function.  Make a histogram of these 45 comparisons.  Interpret the results you got in part b in light of these randomly derived results.

 

d. Thought experiment.  Without writing any code, describe a procedure you could use to select the best k using your cocluster.index() function.

 

III. Choose the best clustering method based on prediction strength

 

Before we search for gene modules related to stress, we need to identify which clustering method is best for our dataset. How can we decide which clustering method to use?  One idea proposed by Tibshirani et al. (Tibshirani, R et al. 2001 [1]) is to measure how well a clustering method derives the same solution when applied to different subsets of the data.  The idea is that a good method should produce robust clusters that generalize in some sense and that are impervious to small changes in the data. 

 

a. Write helper functions. Write a function called compute.centroids() that takes in a matrix and an assignment vector, and returns a matrix containing the center of each cluster.  Run kmeans() on the stress.submatrix and compare the results of your compute.centroids() function with the centers assigned by the call to kmeans().  Make sure you’re getting the same answer.  Write another function called assign.clusters() that takes in a matrix of expression values and a matrix of centroids (e.g. like what is produced by your compute.centroids() function) and assigns each gene in the expression matrix to its closest centroid.

 

*** Try writing the above functions yourself first and then, if you have trouble, download and source the files centroids.R and assign.R to get working copies of the above functions.

 

b. Evaluate the prediction strength of k-means.  Randomly divide the stress.submatrix in half by assigning a random half of the genes to a new matrix called stress.train and the other half of the genes to another new matrix called stress.test.  Run k-means clustering (k=15) on the stress.train matrix to obtain a cluster assignment vector called train.clusters.  Run k-means clustering (with k=15) on the stress.test matrix to obtain a cluster assignment vector called test.clusters.  Transfer the clusters obtained from the training set to the genes contained in the test set using the functions you wrote in part a.  Call the resulting assignment vector predicted.clusters.  Using your cocluster.index() function, calculate the fraction of gene pairs that are co-clustered between the training and test sets by passing the function the test.clusters and predicted.clusters assignment vectors.  Repeat the entire procedure 100 times (i.e. randomly split the data into training and test sets, run k-means clustering, and calculate the co-clustering degree).  Make a histogram of the 100 co-clustering values you obtain from random splitting.

 

c. Evaluate the prediction strength of hierarchical clustering.  Repeat the experiment you performed in part b using hierarchical clustering instead of k-means.  Note that you should cut the hierarchical clustering tree to obtain 15 clusters.  Based on your results from k-means and hierarchical clustering, which method do you prefer for the stress data?

 

IV. Find stress modules using the Silhouette method

 

Now that you’ve chosen a clustering method, you’re ready to find stress modules.  In this section, you’ll identify modules as those clusters that appear to be of good quality based on the Silhouette measure (Rousseeuw, PJ  1987 [2]).

 

For each gene i, the silhouette width s(i) is defined as follows:

 

·        Let a(i) be the average dissimilarity between gene i and all other genes in its cluster.

·        For all other clusters C to which gene i is not assigned, let d(i,C) be the average dissimilarity of gene i to all of the genes in C. The smallest of these d(i,C) is b(i).  I.e. b(i) = min_C d(i,C).  In words, b(i) as the dissimilarity between gene i and the nearest cluster to which it does not belong.

·        Finally, let s(i) equal (b(i) - a(i)) / max(a(i), b(i)).

 

Observations with a large s(i) (almost 1) are very well clustered, a small s(i) (around 0) means that the observation lies between two clusters, and observations with a negative s(i) are probably placed in the wrong cluster.

 

a. Write a function to compute the Silhouette.  Write a function called my.silhouette() that takes in an expression matrix and a cluster assignment vector and computes the Silhouette value for each gene.  You can either write your own Silhouette function from scratch (this is worth extra credit), or you can call the silhouette() function that is available in the cluster library (type: “library(cluster)” inside your R session to import it).

 

b. Identify stress modules. Using the clustering method you selected in section III, obtain 15 clusters on the stress.submatrix.  Use you’re my.silhouette() function to compute the silhouette for each gene.  Order the genes so that the cluster with the largest average Silhouette is first and the cluster with the smallest is last; within each cluster order the genes so that the gene with the largest Silhouette value is first and the gene with the smallest is last.  Make a plot showing the silhouette value of each gene in the order just described.   What are the top 10 genes in the cluster with the highest average Silhouette?

 

c. Thought experiment. Without writing any code, outline an approach you could use to evaluate clustering methods based on the Silhouette measure.  For example, how could you use it to find the best k for k-means?  How could you use it to decide between using hierarchical or k-means clustering?

 

V. Use BioProspector to identify stress-related cis-regulatory elements

 

Now that you’ve identified clusters that appear to be of good quality, you’re next goal is to identify whether the genes in these clusters contain shared cis-regulatory elements.

 

a. Select 5 of the best-scoring clusters from the Silhouette analysis.  For each selected cluster, extract the upstream sequence for each gene in the cluster and write the sequences out to a separate file.  You can use the write.seqs() function to do this.  For example write.seqs(“first10-seqs.txt”,1:10) would write the upstream sequences of the first 10 genes in the stress.submatrix to the file first10-seqs.txt.  Convert this tab-delimited text file into a FASTA file using Josh’s stab2fasta.pl script.  To use the script on UNIX type “stab2fasta.pl < first10-seqs.txt > first10-seqs.fa” to create a FASTA file called first10-seqs.fa.

 

b. Run BioProspector to find sequence motifs.  BioProspector is a Gibbs-sampling method that searches for common motifs in a set of unaligned sequences.  You can either download BioProspector from the course website (unzip the BioProspector.zip file and run on linux, cygwin, or unix), or you can log into an SOE machine and run BioProspector from the BME210 course directory:

 

/cse/classes/bme210/Winter04/labs/lab6/BioProspector

 

To help BioProspector find regulatory motifs that discriminate our stress-response genes from genes unrelated to stress, supply the program with a set of negative examples.  Run BioProspector on each of the five clusters, using the file up-back.fa as the negative background example sequences (pass this to BioProspector’s –b option).  Report the top three motifs you identify for each cluster.

 

c. Thought experiment.  Outline an experiment that would allow you to gauge the significance of the motifs you obtained in part b.

 

d. Extra credit.  Perform the experiment you outlined in part c and report your findings.

 

Congratulations on a job well done!

 

Comments to Josh Stuart

 

References

 

1.      Tibshirani, Walther, Botstein, Brown.  (2001). Cluster validation by prediction strength.  Stanford Tech Report.

 

2.      Rousseeuw, P.J. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math., 20, 53–65.