Missing data imputation analysis - Multi-city time series analyses of air pollution and mortali

For a long time, analyses of air pollution data were undermined from missing data problem. Even though air pollution monitoring stations were widely built in most large cities, due to the consideration of budget and cost, no air pollution monitoring station can guarantee that they can 100% collect every air pollution materials 24 hours without rest. From the NMMAPS database, we can induce some situations of missing data appearing in air pollutants specifically. For six main air pollutants (CO, NO2, O3, SO2, PM10, PM2.5) from

133

1987 to 2000 among 108 cities, there were only 32, 11, 21, 12 cities with complete data in CO, NO2, O3 and SO2, respectively (Table 3.8). The missing data problem went worse in particulate matters PM10 and PM2.5, and no city had complete PM10 and PM2.5 over 14 years at all. In particular, most air pollution monitoring stations started to collect PM2.5 data after 1999. If we just look at the valid period of data collection of PM2.5, there was still no city with complete PM2.5 data from 1999 to 2000. Moreover, the duration was still not enough for a spatio-temporal data analysis with other co-pollutants, that’s why PM2.5 was not listed in this study.

The missing data pattern was also not totally regular among different air pollutant Table 3.8

The cities with complete air pollutant data in 108 U.S. cities from 1987 to 2000.

Air pollutant Cities

CO Akron, Albuqerque, Boston, Chicago, Cincinnati, Cleveland, Columbus, Dayton, Denver, Dallas/Fort Worth, El Paso, Fresno, Houston,

Indianapolis, Kansas City(MO), Los Angeles, Louisville, Memphis, Milwaukee, Nashville, Norfolk, Richmond, Sacramento, Salt Lake City, San Bernardino, Seattle, Spokane, Santa Ana/Anaheim, St. Petersburg, Tampa, Tucson, Wichita

NO2 Bakersfield, Boston, Chicago, Dallas/Fort Worth, Fresno, Houston, Kansas City(MO), Los Angeles, Oakland, San Bernardino, Santa Ana/Anaheim

O3 Albuqerque, Bakersfield, Baton Rouge, Chicago, Denver, Dallas/Fort Worth, El Paso, Fresno, Houston, Los Angeles, Little Rock, Nashville, Oakland, Riverside, Sacramento, San Bernardino, Shreveport, Santa Ana/Anaheim, St. Petersburg, Tampa, Tucson

SO2 Boston, Cleveland, Detroit, Houston, Indianapolis, Kansas City(MO), Los Angeles, Milwaukee, Pittsburg, Providence, St. Petersburg, Tampa

PM10 None

PM2.5 None

134

Angeles) or half-year cycle (O3 in Akron). In statistical analysis, it is supposed that the missing data scenario was considered with missing completely at random (MCAR), and all missing data was eliminated when fitting models in chapter 3.2. This data management strategy is named as complete case analysis. We were interested in how missing data imputation methods work in the GGAMM, and look for opportunities that missing data imputation methods can repair the damage of missing data along with improving model-fitting.

Table 3.9 was organized by main estimates with CCA, NNI1, NNI2 and MI. When applying missing data imputations, we also encountered the convergence problem to get reasonable estimates in some models, such as model 4 with NNI2 or model 6 with NNI2. imputations have convergent results simultaneously. We tried to search good starting values of smoothing parameter in either time smother or temperature smoother in some imputations within each model, but only model 1, model 2, model 4, model 5 and model 6 can accomplish this. Note that model 3 with CCA, NNI1, NNI2 and MI-MCMC all used initial settings because there was no good result whatever any starting value of smoothing function or the number of knots was used.

Comparing the estimated PM10 fixed effect over six models, the NNI1 increased its fixed effect in model 1, model 2 and model 3, and the remaining co-pollutant models did not raise this effect, especially in model 4, where after controlling for NO2, the PM10 fixed effect with NNI1 reduced around 8 times to the same effect with CCA. As imputed by NNI2, the PM10 fixed effects were reduced besides model 3. In particular, the PM10 fixed effect became negative in model 1 and model 2. The largest decrement appeared in model 5, where PM10

135

Table 3.9

The model-fitting results from complete case analysis (CCA), nearest neighbor imputation version 1 and version 2 (NNI1, NNI2) and multiple imputation (MI-MCMC).

se() se()

Model Variable CCA NNI1 NNI2

MI-

MCMC CCA NNI1 NNI2

MI-

MCMC CCA NNI1 NNI2

MI- MCMC Model 1 PM10 0.000105 0.000186 -0.000044 0.000139 0.000287 0.000345 0.000262 0.000295 0.000194 0.000691 0.000247 0.000434 Model 2 PM10 0.000093 0.000202 -0.000013 0.000127 0.000355 0.000322 0.000320 0.000330 0.000414 0.000417 0.000554 0.000540 PM10-lag1 -0.000037 -0.000069 -0.000102 0.000008 0.000370 0.000355 0.000319 0.000322 0.000385 0.000540 0.000509 0.000412 PM10-lag2 0.000142 0.000065 0.000047 0.000061 0.000337 0.000371 0.000283 0.000316 0.000389 0.000767 0.000371 0.000502 Model 3 PM10 0.000196 0.000546 0.000258 0.000183 0.088258 0.076300 0.125322 0.082237 0.341816 0.295501 0.485365 0.318495 CO -0.000005 -0.000010 -0.000011 -0.000009 0.081648 0.081645 0.081634 0.081649 0.316223 0.316210 0.316165 0.316226 Model 4 PM10 0.000141 0.000021 0.000114 -0.000234 0.000413 0.000364 0.000301 0.000437 0.000480 0.000474 0.000345 0.000818 NO2 0.001256 0.000748 0.000459 0.001658 0.000837 0.000727 0.000663 0.000879 0.000951 0.000835 0.000855 0.001668 Model 5 PM10 0.000223 0.000146 0.000004 0.000027 0.000378 0.000314 0.000266 0.000330 0.000672 0.000435 0.000242 0.000564 O3 0.001772 0.001496 0.000546 0.002448 0.000837 0.000652 0.000598 0.000825 0.000670 0.000786 0.000836 0.001663 Model 6 PM10 0.000392 0.000355 0.000373 0.000519 0.000378 0.000394 0.000428 0.000391 0.000343 0.000414 0.000714 0.000417 SO2 -0.000266 0.000265 0.000251 0.000027 0.001813 0.001875 0.001846 0.001920 0.003936 0.004269 0.004205 0.004408

136

fixed effect reduced from 0.000227 to 0.000004. MI-MCMC generally had the same performance as NNI1 in PM10, except for model 6, where PM10 fixed effect raised 33.76%.

The fixed estimates of lag effects were all underestimated after using missing data imputation methods, besides 1-day lag effect in NNI1, and the most serious reduced situation happened in MI-MCMC. Both of NNI1 and NNI2 could make co-pollutants CO, NO2 and O3 decreased their effect from CCA, but SO2 was estimated as positive effect with value of 0.000265, 0.000251 and 0.000027 as applying NNI1, NNI2 and MI-MCMC, respectively. In addition, MI-MCMC not only can raise the fixed effect of SO2, but also increase NO2 and O3 effects.

The influence of missing data imputation methods in se() was not as much as that in , but most of them were still underestimated, which means that the confidence interval of would become narrower. The se() from NNI2 were all decreased besides SO2 in model 6 and PM10 in model 3. Its decrement was all larger than NNI1 and MI. These missing data imputation methods seemed to have no much influence on se() in model 6, except for PM10 in NNI2. Also, MI had similar se() with CCA in most models, especial in model 1, model 4 and model 5.

As we had reported, the versatility of city-specific effects in the GGAMM was not too much because it was well controlled on the entire model structure. Sometimes we were wondering whether more valid data can increase the versatility of city-specific effects, so the missing data imputation would be expected to be a good tool to make more variation among cities. The best index was to see whether the estimated standard errors in random effects were increased with imputed data. From our analysis in table 3.9, the PM10 random effects indeed had an increase when applying missing data imputation, especially in NNI1, which significantly increased se() 3.56 times, but NNI1 and NNI2 cannot keep the same efficacy in co-pollutant models (model 4 & model 5). However, the

137

se()s in co-pollutants were all increased, besides CO, which was almost not varied from CCA. This increment went larger as long as using MI-MCMC. For example, the se(s of NO2 and O3 in MI-MCMC were 2.10-fold and 2.55-fold to values in CCA. In lag effects, the se()s of 1-day lag effect had a slight reduce with missing data imputation methods, but not too far away from the value of CCA. However, the se()s of 2-day lag effect had much increase, especially in NNI1, where se() was raised from 0.000767 to 0.000244. This implied that the versatility level of city-specific effects in shorter lag effects would be deducted and smaller than the versatility level of city-specific effects in relatively longer lag effects. Note that the relative comparison between two lag effects in model 2 could be changed if more lag effects were included. Nonetheless, this will add some additional problems, so we will discuss distributed lag models with more lag effects of the GGAMM in section 3.11.

In document Multi-city time series analyses of air pollution and mortality data using generalized geoadditive mixed models (Page 151-156)