Simulation - Multi-city time series analyses of air pollution and mortality data using generali

In order to identify more concrete and concise impact of multicollinearity and concurvity and the influence of missing data imputation methods in the GGAMM, a series of simulated data sets are generated to process the three simulations. The

concurvity simulation and missing data imputation simulation are using artificial data, but the simulated data in the multicollinearity simulation is generated from real data.

The artificial data generating procedure is similar as the simulation method proposed by Lin and Zhang (1999). The first step is constructing a simulated model to generate

random data repeatedly. 1,000 data sets are generated with 10 subjects in each data set, and each subject has 100 repeated measurements. The framework of the simulated GGAMM is constructed by

- o, o , O, OMO, "^M9_ , "$%&E, P , K 2.65

for i=1,…, 10 and j=1,…, 100, where M_O is an independent random variable generated from a normal distribution 0, 0.16; the variable M₉ is supposed to be a covariate changing within each subject with equally 100 knots in [0, 1], and define it follows a normal distribution

6LE3 7 , 4₁₀₀5 , 0.01 h 1, 0.0001,

where trun{.} indicates a truncation operator, which only remains the integer part of any number in it. The between-subject error term K is generated from a normal distribution 0, 0.09, and the within-subject error term has autoregressive correlation by

K ÅK,O, K with Å 0.2. The smoothing function f is a bimodal function

"B _OoO ñ6'Qo.OóB , 4'Q,OOBò h 1, 2.66

where '_%,[. is the probability distribution function of beta distribution

'%,[B _ô%ô[ô%`[ B%O1 h B[O, 2.67

and õ. is a gamma function. The constant 1 used in "B is for the purpose of centering smoothing function. The spatial function is simply defined as

where E and P are generated from a uniform distribution }0, 10. It can be regarded as a monotone increasing linear function from south-west to north-east.

The true values of fixed intercept and slop (_o, _O) are defined as (0.1, 0.1), and the data of random intercept and slope _o, _O are simultaneously generated from a multivariate normal distribution

§00¨,§0.709 0.609¨

.

Finally, the response variable can be simulated from a Poisson distribution with mean parameter

KB + §o, o , O, OMO, "^M9_ , "$%&E, P , K¨. 2.69

The artificial data simulated from above steps can be immediately use in missing data imputation simulation. When 1,000 simulated data sets are prepared, each data set randomly drops out linear predictor M_O and covariate M₉ in smoothing function by different missing rates. In order to clarify the efficiency of each missing data imputation method, M_O and M₉ are independently dropped out to make two scenarios in this simulation. Note that dependent variable is always complete in both scenarios. This procedure is strongly based on that the missing data mechanism is missing completely at random (MCAR) or, at least, missing at random (MAR). The missing rates are varied from 5%, 10%, 20%, 30%, 40% and 50%.

We are going to investigate targeted estimates _O and estimated smoothing function ". Two different _Os are estimated from the simulated data with missing M_O and

missing M₉, respectively. An adjusted sample mean square error is modified from initial sample mean square error in Nittner’s paper (2002), and applied to be the criterion of assessing estimated smoothing functions. It follows

ASMSE^"J, "J_ O_ù∑ >5LuOù ú "J, ûBý"J, "Jþ9, 2.70

where ߢ is the number of valid y in the smoothing function, and Bý is the bias between "J and "J.

By using the same data generating procedure from (2.65) to (2.69), a set of

concurvity data is able to be generated based on initial M_O and "M₉. Suppose a new variable defined by

MO MO, . H "M9, 2.71

and K is a numeric value which can control the concurvity level. When assigning K=0, 0.02, 0.05, 0.09, 0.13, 0.17, 0.22, 0.30, 0.41 and 0.64, the concurvity level between MO and "M9 is 0.03, 0.10, 0.19, 0.31, 0.41, 0.50, 0.59, 0.70, 0.80 and 0.90,

respectively. Each scenario with a specific concurvity level contains 1,000 simulated data set, and the average of _O, se(_O), se(_O) and ASMSE will be evaluated.

Some previous air pollution studies used to simulate data from real observations (Dominici, McDermott, Zeger, & Samet, 2002b; He, Mazumdar, & Arena, 2006), and we use a similar procedure to generate data from original PM10 and SO2 concentrations. In order to facilitate the velocity of simulation, we restrict the study period in only 1991 from our database. Suppose a couple of principal component variables

x :ÈO :È9 are calculated from original 1-year data of PM10 and SO2 by PCA. Define a covariance matrix as

> g1 Å_{Å 1l},

and a Cholesky decomposition can make it as > È¡È, where R is a upper triangular matrix. As a result, two correlated variables _O ₉ with correlation coefficient

Å can be generated by x H È. When using the two new variables to fit

- , o, ®\ , O, OO,, 9, 99,, "

," , "_$%&!, 2.72

the prior prediction can be estimated from (2.72). The number of 1,000 new responses O, … , Oooo can be generated from a Poisson distribution :47. Each scenario repeats the above steps to generate its own simulated data, and the corresponding estimates can be evaluated from taking the average.

In document Multi-city time series analyses of air pollution and mortality data using generalized geoadditive mixed models (Page 78-82)