Missing data analysis - Multi-city time series analyses of air pollution and mortality data usi

Most of published literatures used complete case analysis to handle severe missing data problem by just deleting missingness whatever other corresponding variables are complete (Dominici et al., 2000a; Dominici et al., 2003a). We are interested in the efficiency of using some missing data imputation methods to compensate possible

information losing, especially in smoothing functions. Suppose the mechanism of missing data in the NMMAPS follows MAR, the following missing data imputation methods were applied to make these data sets back to completeness:

1. Nearest neighbor imputation – version I (NNI1) 2. Nearest neighbor imputation – version II (NNI2) 3. Multiple imputation by MCMC (MI-MCMC)

In details, the NNI1 is a kind of hot deck imputation with a long story, and has been used in many surveys conducted by Statistics in Canada, the U.S. Bureau of Labor

Statistics, and the U.S. Census Bureau. Its statistical properties had not been derived until Chen and Shao (2001), which gave a detailed inference over several issues to get

asymptotically unbiased and consistent estimated variances. Generally speaking, suppose the data structure with m missing values for the row indices i=n-m+1,…,n can be

55 observed × Ø Ù Ø Ú M_.O . . M=¹ Û observed × Ø Ø Ø Ù Ø Ø Ø Ú _.O . . =¹ =¹`O . . . = Û missing × Ø Ù Ø ÚM=¹`O_. . . M= Û

A missing value M, 3 h A , 1, … , 3, is imputed by choosing that value M, 7 1, … , 3 h A, which is corresponding to its closest value to . This is also the true meaning of nearest neighbor. The definition of closest is determined by the distance between any two response values. In other words, the distance of the nearest

neighborhood is calculated with all observed values for Y by

bh b A73Oà(à=¹b(h b 2.54

When we find the response value ₍, * 1, … , 3 h A, which is the closest one to , 1, … , 3, we can impute its corresponding M₍ to the missing value. If there are more than one M₍ whose corresponding response values ₍ has the same minimum distance to among others, then the mean of those X values is imputed.

However, the classical NNI has a potential disadvantage because a smoothing function relating x and y can lead to substitutes being far away from the ‘true’ value (Nittner, 2002). Hence, the NNI2 has been modified from NNI1 to handle possible imprecision in this situation. Consider a neighborhood of J contains a pre-determined number of neighbors k. A key concept in this method is to control the range of the fixed neighborhood, and impute data based on different principles after comparing with a percentage p of the length of the data interval. Suppose k=3 and p=0.05, the ordered values B_ª$« for s=1,2,3 satisfying equation (2.54). Then, the range R and interval I can be expressed as

È BªQ«h BªO« 2.55 ^Bª¹&á«h Bª¹=«_ H 0.05 2.56

Then, a concise step-by-step procedure in NNI2 based on the above assumptions is: Step 1. If R ≤ I, then impute a random number generated from }B_ªO«, B_ªQ«

Step 2. If R > I, then compute z È h , and generate a random number u from }0, zª¹&á«, where z_ª¹&á« 0.95 H È

Step 2.1. If u > z, then impute B_ªO«, B_ª9«, B_ªQ«/3

Step 2.2. If u ≤ z, then compute an empirical distribution Mâ, 9 from observed x and three probabilities :^M ≤ B_ªO«_, :^B_ªO« ã M ≤ B_ªQ«_, and :^M ä BªQ«_. After ordering them, imputing a value satisfying the condition of the maximum probability and satisfying B_¹=≤ M ≤ B_¹&á.

Note that if there are more than one B_ªO« and B_ªQ«, the average B_ªO«s and B_ªQ«s can be used in the procedure. The efficiency of NNI1 and NNI2 has been confirmed in missing data in the independent variable when fitting additive model (Nittner, 2002). So far, there is no existing package in any statistical software, so two self-made SAS macros %NNI1 and %NNI2 were used to handle the two imputation procedures.

In original methodology of NNI1 and NNI2, there is no special restriction. Both continuous and categorical variable can apply it. However, even though there is no study to support how large data set it can support, too small sample size may cause somehow imprecision. In addition, NNI1 and NNI2 can be immediately applied in one independent variable with one dependent variable, but impossibly used in multivariate imputation. A compromised way is making a correlation matrix among them, and picking the complete variables with the highest correlation with another variable containing missing data. Nonetheless, this modification is not scientific proven, and loss too many information from other variables which aren’t used in NNI1 and NNI2. That is the reason why

multiple imputation is popular in this situation, even though it needs more assumptions from data itself.

The multiple imputation method can easily handle large number of variables simultaneously, whatever variables themselves are complete or incomplete. Among the categories of multiple imputation methods, the Monte Carlo Markov Chain (MCMC) method can simulate the joint posterior distribution of unknown values and estimate simulation-based posterior parameters. Considering general regression model with

outcomes å and a vector of predictors æ. For a given subjects, these variables are either observed or partially missing. We define ç ç°$, ç¹$, where ç°$ ç°$, æ°$ and ç¹$ å¹$, æ¹$, and è as a set of indicator variables, where È 1 if the jth element of ç is observed, and È 0 otherwise. The appropriate situation of using multiple imputation method is that the data should follow either missing completely at random (MCAR) mechanism

:è|ç :^èbç°$_{, x}¹$_{_ :è|}_,_2.57

which means the missing data is not related to any variable, whatever known or unknown, or missing at random (MAR) mechanism

:è|ç :è|ç°$_:è|_,_2.58

which indicates the missing data is only related to observed quantities of variables. Note that is presumed parameter set. Use of MCAR or MAR allows the analyst to generate imputations çVOW, çV9W, … , çV¹W from the conditional distribution "ç¹$, |ç°$ iteratively. The whole procedure can be implemented using the IP algorithm (Schafer, 1997), which two steps can be defined at the tth iteration as:

Parameter step: Draw `O from "|ç°$, ç¹$,`O.

In imputation step, suppose é é_O¡, é₉¡¡ is a partitioned mean vector of ç°$ and ç¹$, and a partitioned covariance matrix of ç°$ and ç¹$ is

ê gê_êOO êO9

O9\ ê99l,

where ê_OO and ê₉₉ are covariance matrices of ç°$ and ç¹$, respectively, and ê_O9 is the covariance matrix between ç°$ and ç¹$. Hence, the conditional distribution of ç¹$_given_ç°$ _ë

ì is a multivariate normal distribution with the men vector

é9.O é9, êO9\ êOOOëìh éO, 2.59

and the conditional covariance matrix

ê99.O ê99h êO9\ êOOOêO9. 2.60

In Bayesian theorem, suppose that a 3 H + matrix å J_O\, J₉\, … , J₌\¡ is

distributed with a multivariate normal distribution with mean é and covariance matrix ê, the posterior distributions of é and ê are

ê|ç~O_{3 , A,Û Û3 h 1í , î ,} 38

3 , 8 zï h ozï h o\¨, 2.61 é|ê, ç~ g_{3 , 8 3zï , 8}1 o,Û Û_{3 , 8 êl, 2.62}1

where O5, ð means an inverted Wishart distribution with the degree of freedom a and a precision matrix ð; n is the total number of observations in ç; m and î are the mean and precision matrix of prior distribution of ê; 3 h 1í is the corrected sum of squares and crossproducts (CSSCP) matrix; _o and 8 are the mean and the denominator of variance-covariance matrix in the prior distribution of é|ê, respectively (Anderson,

59 1984).

Based on (2.61) and (2.62), we can derive the posterior distribution of é and ê from their prior information in posterior step, and here we only use a noninformative prior, i.e., the Jeffery prior, to obtain

ê`O_|ç~O_{3 h 1,Û Û3 h 1í, 2.63}

éÒ_|êÒ_{, ç~ zï,}êÒ

3 . 2.64 The two steps construct a Markov Chain to simulate draws ñçO, Oò,

ñç9_,9_{ò, … , ñç}`O_,`O_ò_{from the posterior distribution of}_"ç¹$_{, |ç}°$_{, and}

this Markov Chain can converge to this posterior distribution as well. After replicating the above procedure m times to generate m imputed data sets, we can fit the GGAMM for each imputed data sets, and get m model-fitting results. Finally, m results should be integrated into a final result in pooling step. The purpose of this step is providing robust estimates of the parameters and their standard errors. Some extensive papers concerning the asymptotic behavior of multiple imputation methods can refer Barnard and Rubin (1999), Meng and Rubin (1992), and Robins and Wang (2000).

In document Multi-city time series analyses of air pollution and mortality data using generalized geoadditive mixed models (Page 73-78)