• No results found

3.2 Methodology

3.2.3 Data analysis

3.2.3.1 Handling of measurements below the limit of detection

Thirty-four per cent of the filters analysed by Hill Laboratories reported arsenic concentrations below the laboratory limit of detection (LOD). In this case the

concentration of arsenic was known only to be somewhere between zero and the ICP- MS detection limit of 0.05 µg/sample. These measurements are considered too uncertain to report as a single number and so were reported as “<0.05”. There is no regulatory guidance for how to incorporate results below the LOD when calculating an annual average to be compared against national ambient air quality guidelines.

In environmental chemistry, a variety of approaches can be used for incorporating values reported at below the LOD into statistical analyses. Typically, substitution methods are used, for example, all values reported below LOD can be set to zero, set at half the LOD or at the LOD threshold. Alternatively, the actual non-reported laboratory values (if available) can be used. In this thesis the LOD was set to zero for calculating summary statistics for use in the health risk assessment and for comparing against the regulatory guidelines for consistency with previous studies. In the statistical modelling, all values below the LOD were set equal to the LOD and were converted to concentrations based on volume of air sampled in the 24-hour exposure period.

3.2.3.2 Graphical presentation and statistical analyses

Data analysis was carried out using R version 3.1.0 (R Core Team, 2014) to assess data distributions and look for potential correlations and dependencies among variables. Data aggregation, time series, scatterplots, time variation, correlation plots and the wind rose plots were produced using the R package Openair (Carslaw & Ropkins,

2012). Box plots, density plots and summary statistics were produced in the default

Base R package.

A non-parametric bootstrap method was used to calculate means and associated confidence intervals for the metrics required for objectives underpinning Aim 1 (Chapter 1). This method was used as it was not possible to calculate a confidence interval following the post hoc weighting of the mean to account for the non-uniform monitoring frequency. A further advantage of the non-parametric bootstrap method is that statistical inference does not need to be based on the assumptions of a particular probability distribution (e.g., normal) as it is based on the assumption that the

empirical sample observations represent the population distribution (Efron &

Tibshirani, 1986). The bootstrap analysis employed the sample function in the default

Base R package to calculate means for replicate sample observations and two-sided confidence intervals at the 0.05 significance level, these being the 0.025 and 0.975 quantiles of the frequency distribution of 10,000 bootstrapped means for the period of interest, i.e., annual, winter and non-winter. The major assumption for valid inferences based on the bootstrap method and for correlations between time series variables is that the observations in the original sample are independent of one another (Zieffler, Harring, & Long, 2011). In environmental time series data there is potential for autocorrelation, that is, temporal correlation in which an observation on one day (xt) may be correlated with the previous day’s observation (xt - 1). Air quality and

meteorological time series are prone to positive autocorrelation due to physical processes. For instance, higher than normal temperature on one day is often

associated with higher than normal temperature on the next day (Weatherhead et al., 1998). Violation of the independence assumption can result in systematic

underestimation of variation in a population which leads to erroneous inference (Zieffler et al., 2011).

Principal Components Analysis (PCA) was used with the multi-element data set and the

prcomp function in the Stats R package was used to identify elements that are associated with each other (i.e., co-vary) and therefore may indicate a common emission source.

Simple multiple linear regression using the least squares estimator was used to model the dependence of arsenic on other measured variables using the lm function in the default Base R package. Potential candidate predictor variables were identified using recursive partitioning and regression tree analysis using rpart function in the Rpart R package, using Pearson’s correlation coefficients, and as having physical plausibility. Diagnostics for the linear regression used plots of the fitted model residuals (error term) to check for violations of assumptions underpinning model validity, i.e., linear relationship between mean response and predictor, independence, constant variance and approximately normal distribution. Omnibus F tests were carried out using the

anova function in R to ensure that the most parsimonious model was selected, i.e., best fit for the least number of predictor variables.

3.2.3.3 Regulatory guideline comparison

The New Zealand ambient air quality guideline for inorganic arsenic is expressed as an annual average of arsenic in PM10 of 5.5 ng/m3 (MfE, 2002; Section 2.3.2). Although not stated in the MfE guidelines, it is the norm to express annual average based on a calendar year. In this study the timing and duration of the monitoring campaign did not match calendar years. For comparing results to the MfE guideline value, the annual averages were taken as the period 1 November to 31 October and were weighted to account for non-uniform sampling frequency.

There is no regulatory guidance for how to incorporate results below the LOD when calculating an annual average to be compared against national ambient air quality guidelines. In this case all observations below the LOD were set to equal zero as outlined in Section 3.2.3.1).