----------------------------------------------------------------------- Thanasis Kottas Applied Math & Statistics University of California, Santa Cruz ----------------------------------------------------------------------- ************************************************************************ R program to fit location normal Dirichlet process (DP) mixture model: y_i | theta_i,phi ind. Normal(y_i|theta_i,phi), i=1,...,n theta_i | G ind G, i=1,...,n G | alpha,mu,tau2 ~ DP(alpha,G_0), G_0=Normal(mu,tau2) phi ~ IGamma(a.phi,b.phi) hyperpriors: alpha ~ Gamma(a.alpha,b.alpha) mu ~ Normal(a.mu,b.mu) tau2 ~ IGamma(a.tau2,b.tau2) ************************************************************************ The file "mcmc.r" includes the R code. After reading it in R, with source("mcmc.r"), the Gibbs sampler can be implemented by calling the function mcmc(IB,B,y,a.alpha,b.alpha,a.mu,b.mu,a.tau2,b.tau2,a.phi,b.phi) where, --> IB is the burn-in period for the Gibbs sampler (say, IB=5000) --> B is the posterior sample size after burn-in (say, B=10000) (note that, for this model, a burn-in period as small as, say, 1000 iterations should be sufficient for convergence; also, B between 2000 and 5000 yields very similar posterior inference with larger posterior sample sizes) --> y is the data vector (say, y <- scan("sim_data.txt")) for the data set provided in the file "sim_data.txt", which includes simulated data (250 observations) from a mixture of three normals, 0.2N(-5,1) + 0.5N(0,1) + 0.3N(3.5,1) --> a.alpha,b.alpha,a.mu,b.mu,a.tau2,b.tau2 are the parameters of the hyperpriors, and a.phi,b.phi are the prior parameters for phi For example, the following prior specification can be used for the simulated data: a.alpha = 2.0, b.alpha = 1.8 (which yields, approximately, E(n^{*})=6) a.mu = 0.0, b.mu = 4.0 a.tau2 = 2.0, b.tau2 = 4.0 a.phi = 2.0, b.phi = 4.0 (which yield infinite prior variance for tau2 and phi, and marginal prior moments for y that correspond to rough prior guesses of 0 and 14 for the "center" and the "range" of the data, respectively, based on the limiting case of the DP mixture model with one cluster, i.e., for alpha --> 0. Note that under the model with one cluster, marginally, we have E(y) = a.mu, and Var(y) = b.phi + b.tau2 + b.mu. Therefore, a.mu=0, and the choice b.phi=b.tau2=b.mu=4 arises by setting b.phi=b.tau2=b.mu and estimating Var(y) by (range/4)^2.) Sensitivity of posterior predictive inference to the prior specification can be studied by considering other prior choices. The starting values for the Gibbs sampler are specified in the "mcmc" function of the R program. Their choice is not that important for this model, but they can be changed as needed. The output from the R code includes 8 files. Files "alpha.txt", "mu.txt", "tau2.txt", "phi.txt", "n_star.txt" include the posterior samples for the hyperparameters and for the number of clusters. File "post_pred_theta.txt" includes the sample from the posterior predictive distribution for a ``new'' theta_0, p(theta_0|data), associated with a ``new'' observation y_0. For instance, a histogram of this sample indicates the clustering in the imputed means theta_i. Finally, files "prior_pred_y.txt" and "post_pred_y.txt" include samples from the prior, p(y_0), and posterior, p(y_0|data), predictive distribution, respectively, for the ``new'' observation y_0. (Recall, that the prior predictive distribution for y_0 is defined as the average of the Normal(y_0|theta,phi) distribution over the prior for phi and theta, where theta|mu,tau2 ~ G_0=Normal(mu,tau2), with the specified priors for mu and tau2). For instance, a density plot based on the sample in "post_pred_y.txt" provides the Bayesian density estimate arising from the DP mixture model. The corresponding plot based on the sample in "prior_pred_y.txt" indicates the amount of prior-to-posterior learning based on the data. For instance, for the simulated data, > y <- scan("sim_data.txt") > y.0.post <- scan("post_pred_y.txt") > y.0.prior <- scan("prior_pred_y.txt") > hist(probability=T,xlim=c(-8,7),ylim=c(0,0.2),main="",xlab="",ylab="",y) > par(new=T) > plot(density(y.0.prior),xlim=c(-8,7),ylim=c(0,0.2),main="",ylab="",type="l",lty=5, + xlab="p(y_0) dashed line, p(y_0|data) solid line, overlaid on data histogram") > par(new=T) > plot(density(y.0.post),xlim=c(-8,7),ylim=c(0,0.2),main="",xlab="",ylab="",type="l",lty=1) ********* NOTE ********* The program in R for the simulated data executes at a rate of about 100 iterations per minute (on a SunBlade 1000 workstation). The computing time increases with the data sample size. Substantial gains in computing speed emerge by using C or Fortran instead.