Data preparation - Model building and evaluation

Chapter 4: Model building and evaluation

4.2 Data preparation

4.2.1 Dealing with missing values

The modelling of air quality trends largely rests on statistical analysis of data

collected at monitoring stations. However, it is common that not all scheduled measurements are made. The reasons for missing data in a set may include machine failure, routine

maintenance, human error or other factors. It is acknowledged that incomplete datasets may produce results that vary from those that would have been acquired from a complete dataset (Hawthorne and Elliott, 2005), with data-base users often obliged to complete the data sets themselves. Imputation is a common method used to determine a value for missing values in a dataset. However, in this analysis we chose to leave missing data as NA’s, as the

development and testing of a model for imputing the missing values was beyond the scope of this project (the development of a statistical multiple imputation model that accurately reflects the trend, seasonal cycle, and joint error structure of multiple atmospheric gases is time-consuming due to the extensive analysis needed to evaluate the most appropriate

imputation technique with the assumptions of the imputation needing to be checked), and the use of a simple imputation technique that did not accurately reflect the joint distribution of the variables at any given timepoint may have biased our results.

4.2.2 Outlier detection and removal

Outliers are data points that deviate significantly from others, and are a challenge to properly deal with in science research. The different methods of defining, identifying, and handling outliers can significantly change study conclusions (Aguinis et al., 2013). As emphasized by Cortina (2002), “caution also must be used because, in most cases, deletion [of outliers] helps us to support our hypothesis” (p.359). Removing outlier values can be problematic, having the capacity to cause favorable results that produce a model with a better fit. However, it should also be mentioned that outliers present in data can have such a strong influence on the data that they bias the fit estimators, predictors and accuracy of the model. Ultimately, it becomes a tradeoff, and is left to the researcher to decide on the

appropriateness of removing outliers.

Visual inspection of the TEOM and BAM measurements reveal cases where the values were quite different, to such an extent that one would expect they represent erroneous data. However, given that the purpose of this study is to identify any evidence of differences in responses by the PM2.5 TEOM and BAM, standard air quality data editing practice was limited,

as it might remove data that reflects real biases in each of the samplers. Hence, a conservative approach to data consistency was applied, allowing the inclusion of data displaying significant levels of inconsistency.

The data was visually inspected to identify outliers, via a scatter plot (Figure 3- 1A) in combination with a plot of residuals against leverages (Figure 4- 1). From Figure 3- 1 A, it seems that the data point furthest to the right on the x-axis is an outlier (identified as TEOM = 170.9 µg/m3, BAM = 45.1 µg/m3 on 04/09/2012 at 01:00a.m). The point appears to not follow the same trend when compared to the rest of the data.

Figure 4- 1 assists with identifying influential points that influence the regression line, by showing Cook’s distance, indicated by the red dotted line. Cook’s distance measures the effect of removing a certain observation, with the larger the distance indicating a more influential observation. Observations outside of the red dashed line are influential to the regression results. Data from row 17, 577 was identified as an influential observation. This is the same data point identified from the visual inspection. The TEOM value was removed

from the dataset and assigned an NA value, due to concerns that it might influence the

4.2.3 Lagging

Due to our exploratory analysis revealing the significance of lagged variables, it is important to include lagged variables in the pool of covariates available for model

construction. This lag is added as a new variable to the dataset, and is called a lag-response, in addition to the standard exposure-response relationship. Lag-response variables of 1hr, 2hrs and 24hrs for the PM2.5 TEOM, nephelometer and PM10 measurements were made

available as variables for selection in the model.

Figure 4- 1. A plot of residuals against leverages, along with Cook’s distance.

4.2.4 Breaking up monthly and hourly data into blocks

Data blocking was performed to satisfy the aim of constructing a parsimonious model. That is, a model that accomplishes a good level of prediction while using as few variables as possible, without sacrificing rigor. It would be inappropriate to input 24 hourly values and 12 monthly values into the model as this would lead to such a large number of input variables. Hence, we blocked the hourly and monthly data based on their significance levels. Hourly values were segmented into block a: 11:00p.m. – 2:00a.m., b: 3:00a.m. -7:00a.m., c: 8:00a.m. – 3:00p.m. and d: 4:00p.m.-10:00p.m. These statistical blocks of hours appear to be

overnight, a fall and rise of values during the day, and higher values during morning and afternoon peak periods. Monthly data was blocked into block a: November to March, and b: April to October. Again, the monthly statistical blocks follow reasonably well with the physical cause. November through to March have a fairly constant average PM2.5 readings,

with the months of April through to October possessing more variation. Reasons for the statistical cut-off points for these blocks is explained in Appendix 4.

In document Development and evaluation of a model to correct tapered element oscillating microbalance (TEOM) readings of PM2.5 in Chullora, Sydney. (Page 56-59)