AEROSOL SAMPLING DEVICES AND ANALYTICAL METHODOLOGY
2.6 Data and statistical analysis
All datasets obtained in the study were used to analyse and evaluate the sources and formation pathways of major chemical components in airborne particulate matter. In principle, the two most common practices to analyse datasets are nonparametric and parametric analysis.
Nonparametric statistical procedures are used to evaluate research data which do not require and restrict the assumptions about the distribution of the data. On the contrary, parametric method strongly requires that the form of the population distribution be completely specified as normal distribution. This makes nonparametric statistical methods more flexible and appropriate in this study since the data observed tend to be heterogeneous, and the heterogeneity is not well described. The Kolmogorov-Smirnov test was applied to dataset to investigate the sample distribution. These nonparametric tests used mostly for testing hypotheses of data comparison such as the Mann-Whitney test and Kruskal-Wallis test.
However, nonparametric procedures are less suited for the estimation of association for this datasets. The following are the details of methods of data analysis.
Determination of association
The measurements of levels of association or the relationships between chemical components in air samples were determined both inter- and intra-sites in order to investigate their origins as well as formation pathways. The Pearson (product-moment) correlation should be the first choice (Belle, 2008) to measure the correlation coefficients (r). This test measures the strength of the linear relationship between two variables. Because of all concentrations data (raw data) used without removing outlier in this study, Peason correlation test is appropriate and higher sensitive than rank correlation analysis (nonparametric method). A linear association implies that as one variable increases, the other increases or decreases linearly.
The correlation coefficient does not imply cause and effect. Test results between two variables should report as high correlation or strong relationship. The values of these correlation coefficients usually range from -1.00 to +1.00. Values of the correlation coefficient close to +1.00 indicate that as one variable increases, the other increases nearly linearly. On the contrary, a correlation coefficient close to -1.00 indicates that as one variable
increases, the other decrease nearly linearly. Values close to 0 imply little correlation between the variables. A rough guide to the degree of the association given by some statistical references (Yin, 2002) indicates that r, in the range 0.00 - 0.19, represents a very weak correlation, 0.20 - 0.39 a weak correlation, 0.40 - 0.69 a modest correlation, 0.70 - 0.89 a strong correlation and 0.90 - 1.00 a very strong correlation. In order to know whether the correlation is significant between two variables, the null hypothesis, a hypothesis of no correlation, was tested consulting a table for Pearson product moment correlation values at the 0.05 levels of significance, with the number of degrees of freedom n – 2 (n is the number of observations). The distributions of both variables should be analysed because a skewed distribution produces a smaller r than a normal distribution. The Kolmogorov-Smirnov was used to determine the distribution of datasets and reported in Appendix E. Test results, however, indicate that ionic species analysed in air samples deviate significantly from a normal distribution. Thus, smaller r values might be observed and reported for the determination of association. In this thesis, the “good” correlations of chemical components in aerosol samples denote strong and very strong correlations. The “poor” correlations of chemical components denote weak and very weak correlations.
Regression analysis
The correlation analysis is always used in conjunction with regression analysis. When data of two variables are plotted on a graph, they are said to have a linear relationship if the points tend to fall in a straight line. The strength of the association between the variables can be estimated by judging how close the points are to the line. The correlation between the variables is high when the points are very near the line and low when the line is a poor summary of the positions of the points. The position of the line shows how a change in one variable is expected to affect the other. The reduced major axis (RMA) regression was applied in this study (Ayers, 2001). The best linear fitting line is achieved by minimising the deviations of both variables. The coefficients of determination or coefficients of regression were reported as R2. If there is a perfect relationship, the coefficient of determination is 1.00.
This means all the points on the graph lie on the regression line and all the variation in one variable is explained by the other one. RMA regression was applied to the dataset in order to determine the relationship between rural and urban background concentration during the samples collected simultaneously (chapter 3.3.3, 4.4).
Measures of central tendency
The measures of central tendency of samples were estimated by the sample mean in this study. In case of mean value higher than 75th percentile of the observations, the discussion still based on this value with special cautions and data also presented the sample median as reported in Table 3.1. The reasons of this case mostly come from the measurement data observed outliers. Outliers result commonly from transcription errors, data-coding errors, or measurement system problems. However, outliers may also represent true extreme values of a distribution and indicate more variability in the population than was expected. Not removing true outliers and removing false outliers both lead to a distortion of calculates of population parameters. Rejecting an outlier from a dataset should be done with extreme caution, particularly for environmental data sets, which often contain legitimate extreme values (U.S.
EPA, 2006).
Data below the limits of detection
As mentioned about the method of determination of detection limits of chemical analysis, some data obtained from this study fell below the detection limit of the analytical procedure.
These measurement data are generally reported as non-detects (<dl). In these cases, the concentrations of analysed species are unknown although they lie somewhere between zero and the detection limits. There are a variety ways to evaluate data that includes values below the detection limit. However, there are no general procedures that are applicable in all cases (U.S. EPA, 2006). In this thesis, data below the limits of detection are replaced with dl/2 and the usual analysis performed.
Comparision of two populations
Generally, it is difficult to obtain a truly random sample in the environmental sciences (Watts and Halliwell, 1996). Many of statistical tests described are based on the assumption that the observations being tested have a normal or almost normal distribution. When this condition cannot be met, a range of distribution-free or nonparametric statistical tests are appropriate.
In this study, Kolmogorov-Smirnov test was applied to datasets and test results indicated that chemical components analysed in air samples deviate significantly from a normal distribution as shown Appendix E. Therefore, nonparametric tests were used in all hypothesis tests of significant differences. The Mann-Whitney U test is a nonparametric test that can be used to analyse data from two-group independent groups design when measurement can at least be ranked or be ordered. This test makes no assumptions about data distribution. It does assume
that the two distributions are similar in shape but the distributions need not be symmetric.
The two datasets need not be drawn from normal distributions. This test evaluates the null hypothesis that two groups of sample come from the same population.
Comparison of several populations
The Kruskal-Wallis nonparametric test can be used to assess whether any significant differences between k independent samples. It is used for comparing more than two samples.
The test assumes that the measurements under study are at least on an ordinal scale and that underlying variable has a continuous distribution. The null hypothesis is that the k samples come from the same population, or that their underlying distributions have the same average.
This test was performed to investigate the significant differences for component concentrations between clusters of air back trajectories (chapter 5.5). For trajectory analysis, cluster method was used for classifying air mass backward trajectories generated from the HYSPLIT4 (HYbrid Single-Particle Lagrangian Integrated Trajectory) model as detailed in Chapter 5.
Lastly, estimation of measurement uncertainty associated with analysed data was also evaluated in order to achieve the reliability and traceability of measurement results as described in Appendix A.