Multiple regression - Secondary data sources

Chapter 4: Research Design and Methodology

4.6 Secondary data sources

4.7.3 Multiple regression

Having examined the overall relationships between water chemistry and catchment data attention is focused on the main thrust of the thesis, namely whether critical load can be predicted by one or more catchment attributes.

Simple linear regression examines the relationship between a single dependent variable, Y,

and an independent variable, X, where the latter is thought to determine the former to some

through the series of datapoints in the scatter which minimises the sum of squares of errors, or deviations, between the observed values of Y and those predicted by the line. This line can be described by the equation:

V ' = a + p X + e (4.2)

where Y is the response variable, a is the intercept of the line with the y axis, (3 is the slope of the line, X is the explanatory variable and e is the error term. Where the response variable is to be predicted by values of two or more explanatory variables multiple regression is used. This is described by the equation:

^

2 ^ 2 + ... + + E (4.3)

where X^, ....X„ are the explanatory variables. This equation finds the coefficient of all

the X values that minimises the error sum of squares (Manly, 1992). The partial regression coefficients here will only be identical to coefficients from separate simple regressions when the predictor variables are uncorrelated. The prior use of RDA, as well as facilitating a reductionist approach, allows the identification of collinearities between environmental variables. This eliminates the need for an ’extra sum of squares’ approach (Manly, 1992)

where variables X, to X„ and successive regressions are fitted, relating Y to X^, Y to X^ and

Xg and so on until Y is related to all X variables. Variation in Y accounted for by X„ on top of that accounted for by X, to X^., is given by the extra sum of squares accounted for by

adding X„ to the model.

Residual plots are use to assess whether the assumptions of the regression model are fulfilled (Manly, 1992; Johnston, 1991). These are discussed further in Chapter 7. There should be no pattern when residuals are plotted against Y estimates or X values. Tests for

randomness, constant variance and normality can be applied to the residuals to ensure these criteria are met. Outliers can also be identified using residual plots and these can be examined to assess whether the statistical relationships generated by the regression model are being disproportionately influenced by individual observations.

4.8 Discussion

A number of issues arising from the research design and methodology need to be addressed, particularly relating the adequacy of the data used to characterise the calibration catchments. These are introduced here and discussed further in Chapter 8.

4.8.1 Land use data

Land cover and land classification data is only available nationally in raster (grid) format. Land cover is available at 1km and 25m resolution. Ground based and aerial surveys have been undertaken to validate the satellite derived land cover database. These show 75% to 95% accuracy depending on the spatial scale (Fuller and Groom, 1993b). At the 25m scale certain cover types (e.g. suburban and arable) appear more likely to be classed erroneously than others (0.Curtis, pers. comm; J.Hall, pers. comm.). However, despite these uncertainties, the ITE land cover dataset represents the most accurate picture of land cover presently available in Britain at this resolution.

Land classification data are only available at 1km resolution. Superimposing small

catchments onto such grid data could lead to a significant loss of information. This will be particularly problematical at boundaries between different classes.

It has to be accepted that geological maps at any resolution are only a guide as to the conditions existing on the ground. This is particularly the case with geological boundaries. The scale of geology maps is such that small outcrops of a rock with a high buffering capacity may not be mapped. These may have a disproportionate ameliorating effect on surface water acidity.

A further problem relates to the sensitivity classification proposed by Kinniburgh and Edmunds (1986). This is based on the stratigraphy, or age, of the rock and not on the lithology. As such, it allows for broad lithological differentiation but some facies changes may not be mapped, a difficulty noted by the authors. Limestones and clays, for example, may be present in poorly buffering formations.

Additionally, a classification of rocks on a scale from acidic to basic, although indicative of the chemistry of the rock, might not necessarily reflect the chemistry of the run off water (M.Clark, pers. comm). The movement of water through bedrock is important in this context. Where fissure flow dominates over matrix flow the geochemistry of the fissure surface will control buffering. Preferential flow along these pathways can remove the buffering cement that has accrued (Kinniburgh and Edmunds, 1984).

4.8.3 Soil data

Mapped soil data provide the same difficulties of resolution as those encountered with geology. The boundaries between soii map units are even less solid in reality and are impossible to represent accurately with two-dimensional maps. The quantification of soil data also presents difficulties. The pH and base saturation data for each soil map unit are based on an insufficient number of samples at best and a single sample at worst. Clearly soils

caution against the use of using mean weathering rate to characterise soil associations.

4.8.4 Deposition data

The uncertainties associated with using monitoring atmospheric input of N and S, for individual catchments is discussed in detail in Chapters 2 and 3. These uncertainties also apply to rainfall values although these are based on a much larger monitoring network.

Although a series of problems have been recognised it is likely that these will always be encountered when attempting to use nationally available data at a local scale. However, the development of a nationally applicable statistical model requires that the best avaiiable data is used. In the absence of comprehensive and freely available data at high resolution, the data used here offer the most pragmatic means of developing a predictive catchment critical loads model.

In document Predicting surface water critical loads at the catchment scale (Page 110-115)