In order to identify more concrete and concise impact of multicollinearity and concurvity and the influence of missing data imputation methods in the GGAMM, a series of simulated data sets are generated to process the three simulations. The
concurvity simulation and missing data imputation simulation are using artificial data, but the simulated data in the multicollinearity simulation is generated from real data.
The artificial data generating procedure is similar as the simulation method proposed by Lin and Zhang (1999). The first step is constructing a simulated model to generate
60
random data repeatedly. 1,000 data sets are generated with 10 subjects in each data set, and each subject has 100 repeated measurements. The framework of the simulated GGAMM is constructed by
- o, o , O, OMO, "^M9_ , "$%&E, P , K 2.65
for i=1,…, 10 and j=1,…, 100, where MO is an independent random variable generated from a normal distribution 0, 0.16; the variable M9 is supposed to be a covariate changing within each subject with equally 100 knots in [0, 1], and define it follows a normal distribution
6LE3 7 , 41005 , 0.01 h 1, 0.0001,
where trun{.} indicates a truncation operator, which only remains the integer part of any number in it. The between-subject error term K is generated from a normal distribution 0, 0.09, and the within-subject error term has autoregressive correlation by
K ÅK,O, K with Å 0.2. The smoothing function f is a bimodal function
"B OoO ñ6'Qo.OóB , 4'Q,OOBò h 1, 2.66
where '%,[. is the probability distribution function of beta distribution
'%,[B ô%ô[ô%`[ B%O1 h B[O, 2.67
and õ. is a gamma function. The constant 1 used in "B is for the purpose of centering smoothing function. The spatial function is simply defined as
61
where E and P are generated from a uniform distribution }0, 10. It can be regarded as a monotone increasing linear function from south-west to north-east.
The true values of fixed intercept and slop (o, O) are defined as (0.1, 0.1), and the data of random intercept and slope o, O are simultaneously generated from a multivariate normal distribution
§00¨,§0.709 0.609¨
.
Finally, the response variable can be simulated from a Poisson distribution with mean parameter
KB + §o, o , O, OMO, "^M9_ , "$%&E, P , K¨. 2.69
The artificial data simulated from above steps can be immediately use in missing data imputation simulation. When 1,000 simulated data sets are prepared, each data set randomly drops out linear predictor MO and covariate M9 in smoothing function by different missing rates. In order to clarify the efficiency of each missing data imputation method, MO and M9 are independently dropped out to make two scenarios in this simulation. Note that dependent variable is always complete in both scenarios. This procedure is strongly based on that the missing data mechanism is missing completely at random (MCAR) or, at least, missing at random (MAR). The missing rates are varied from 5%, 10%, 20%, 30%, 40% and 50%.
We are going to investigate targeted estimates O and estimated smoothing function ". Two different Os are estimated from the simulated data with missing MO and
missing M9, respectively. An adjusted sample mean square error is modified from initial sample mean square error in Nittner’s paper (2002), and applied to be the criterion of assessing estimated smoothing functions. It follows
62
ASMSE^"J, "J_ Où∑ >5LuOù ú "J, ûBý"J, "Jþ9, 2.70
where ߢ is the number of valid y in the smoothing function, and Bý is the bias between "J and "J.
By using the same data generating procedure from (2.65) to (2.69), a set of
concurvity data is able to be generated based on initial MO and "M9. Suppose a new variable defined by
MO MO, . H "M9, 2.71
and K is a numeric value which can control the concurvity level. When assigning K=0, 0.02, 0.05, 0.09, 0.13, 0.17, 0.22, 0.30, 0.41 and 0.64, the concurvity level between MO and "M9 is 0.03, 0.10, 0.19, 0.31, 0.41, 0.50, 0.59, 0.70, 0.80 and 0.90,
respectively. Each scenario with a specific concurvity level contains 1,000 simulated data set, and the average of O, se(O), se(O) and ASMSE will be evaluated.
Some previous air pollution studies used to simulate data from real observations (Dominici, McDermott, Zeger, & Samet, 2002b; He, Mazumdar, & Arena, 2006), and we use a similar procedure to generate data from original PM10 and SO2 concentrations. In order to facilitate the velocity of simulation, we restrict the study period in only 1991 from our database. Suppose a couple of principal component variables
x :ÈO :È9 are calculated from original 1-year data of PM10 and SO2 by PCA. Define a covariance matrix as
> g1 ÅÅ 1l,
and a Cholesky decomposition can make it as > È¡È, where R is a upper triangular matrix. As a result, two correlated variables O 9 with correlation coefficient
63
Å can be generated by x H È. When using the two new variables to fit
- , o, ®\ , O, OO,, 9, 99,, "
," , "$%&!, 2.72
the prior prediction can be estimated from (2.72). The number of 1,000 new responses O, … , Oooo can be generated from a Poisson distribution :47. Each scenario repeats the above steps to generate its own simulated data, and the corresponding estimates can be evaluated from taking the average.