2.6 Simulations
2.6.1 Description of Simulated Data
In order to conduct simulations we need to establish a rubric for the simulations. Since we want to maintain the authenticity of the data we will use the data from the HDP example to generate data for simulations. The first step in this process is to create “true” values for the parameters. To create the true values we will use the results from our analysis of the
HDP data using JAGS. We analyze the entire dataset using the same priors (β ∼ N3(0, I)
and τ ∼ Gamma(1/2, 1/2)) and same tuning parameters of the MCMC chains (ie the same number of iterations, 5000 iterations, same burn-in length, the first 10%, etc) as before. From the analysis we calculate the posterior medians for all the parameters. These posterior medians will be the true values that we will generate the data off of and, hopefully, will be the ones we capture in the analysis of our simulations.
We elected to include only two of the three physician level variables in our data: the years of experience and whether or not the physician attended a top ranked medical school. We next establish the a parameter related to the simulations, the number of observational units, k. In terms of the original data, k would represent the number of physicians, and in the HDP data there were 308 physicians. We initially set k = 300 but will also try k = 100 and k = 500.
k random samples from a standard normal distribution for the experience covariate and an additional k random samples from a Bernoulli distribution for the medical school covariate. In the original analysis we standardized the experience covariate and we do the same here. The percent parameter for the medical school covariate is the same percentage in the original data (approximately 22%). In addition to these two pieces of data we will include an intercept term and together they will form the systematic component, say X, a k × 3 matrix of simulated data.
Using the “true” values along with the covariate data just created we can create simulated data which mimics “perfect simulations.” To do this, we need the simulated data to represent the variance due to the random effect. Recall, the random effect is normally distributed
with zero mean and a variance equal to τ−1. Since the true value of τ is known prior to
the analysis of the simulated data, we can effectively represent the “perfect sampling” by finding the quantiles, ui, from the N (0, τ−1) distribution instead of randomly sampling from
the same distribution. We randomly permute the ui quantiles as well before assigning them
to each observational unit (physician). Then logit(pi) = Xiβ + ui, where Xi is the simulated covariate data and ui is the randomly assigned quantile of the N (0, 1/τ ) distribution.
Given the true probabilities of success, for each observational unit, pi, we need a way to determine the number of trials and the number of successes with the trials. In the HDP data, this is the equivalent of knowing how many patients the physician treated and the number of patients whose cancer went into remission. The total number of observations within the observational unit, ni, will be controlled by an additional simulation parameter, λ. Let ni be a random sample from a Poisson distribution with a rate λ. Relating this to the original problem, λ can be regarded as the average number of patients a physician treated. In the original data, the average number of patients was approximately 22, but since we know that we need larger values of ni in order to demonstrate the effectiveness of the normal approximation we will set λ to a range of values. The number of successes associated with
each observational unit will not be a random sample from a Binomial distribution instead we obtain data as if it were coming from a “perfectly” sampled Binomial distribution and we set yi = nipi, rounding when necessary, for each of the i = 1, . . . , k observational units.
One issue that will arise is that nipi may not always be a whole number between 1 and
ni − 1. By rounding the result the response value will be a whole number, however the
normal approximation method does not work if yi = ni or yi = 0. In that case we simply
remove the observation from the data. When we tested this method, with λ = 25, we averaged about 5% of data was removed and when λ = 50 about 1% of the data was discarded. Clearly, as λ increases the percentage will go down. To calculate this percentage of discarded data took several simulations under the same (k, λ) values and while the number of discarded observational units varies it leads us to question if such variability is good and will that impact our ability to capture the true values. Psuedo-code of how we generated the simulation data is included in Algorithm 1.
Algorithm 1 Pseudo Perfect Simulation Data Require: True Values for β and τ
1: procedure Data Generate for Simulation(k, λ)
2: p ← 0.224
3: for i = 1 : k do
4: n[i] ← Poisson(λ)
5: Xnew[i, 1] ← 1; Xnew[i, 2] ← N (0, 1); Xnew[i, 3] ← Bern(p)
6: u[i] ← Φ−1
√
(i−0.5)τtrue
k
7: Randomly permute the ui values.
8: u[i] ← u[i] + Xnew[i, 1]β1,true+ Xnew[i, 2]β2,true+ Xnew[i, 3]β3,true
9: p[i] ← logit−1(u[i])
10: y[i] ← round(n[i]p[i])
11: if y[i] = 0 or y[i] = n[i] then Remove case i