Data Preparation and Model Development - ESCAPE LUR Model Preparation and Development

4. Modelling Ambient Particle Concentrations

4.3. ESCAPE LUR Model Preparation and Development

4.3.2. Data Preparation and Model Development

Within the ESCAPE project a standardised protocol was applied for LUR model development, which is described in detail in the study manual (Beelen & Hoek 2010) and is outlined below. LUR models for the prediction of particulates within the London & Thames Valley region were developed using adjusted annual average concentrations of PM from a bespoke monitoring campaign described in chapter 3. In a first step the data sets presented in Table 29 were prepared and integrated into a GIS, and potential explanatory variables were derived for the locations of the monitoring sites. This for example included the sum of road length in buffers of different sizes (Radius: 25m, 50m, 100m, 300m, 500m, and 1000m) around each location, or altitude at the sites. Buffers can be calculated in a GIS as circular areas around a site. The calculation of different buffer sizes for each type of variable was undertaken to evaluate the impact of predictor variables at different spatial scales. The selection of minimum and maximum buffer sizes for LUR input variables was based on known dispersion patterns. The effect of traffic source emissions for example reduces rapidly with distance from road, which is for example reflected in small buffer sizes of 25m or 50m for traffic variables. Larger buffer sizes around roads, such as 1000m were included to reflect the sum of emissions from roads in the area. Land use buffers were generally chosen larger than road variable buffer sizes (from 100 to 5000), as they do not reflect a specific source, but represent diffuse sources over an area. 30 different principle predictor variables (not counting different variations or buffer sizes of the same variable) were produced, which are shown in Table 30.

136

Data Type Name Description Resolution Version Source Remarks

Land Use Data

CORINE 2000 Vector data 1:100,000 2000 European Environmental Agency

44 land-use categories grouped into 6 classes: high and low density residential land, industry, ports, urban green and natural land

Roads Meridian 2 Vector data 1m 2009 Ordnance Survey

Traffic Intensity

National Traffic Estimates

Traffic counts & grid reference for main roads

- 2009 Department for Transport, Road Traffic Statistics

Data was attached to Meridian 2 dataset Population Density Headcount Population Data Population estimates based on 2001 census Postcode level (points) 2001 Office of National Statistics - Altitude Digital elevation data (SRTM)

Raster data 90m - CGIAR-CSI GeoPortal

Table 29: Datasets used to calculate ESCAPE-LUR model predictor variables for the London and the Thames Valley study area. Source: Eeftens et al. (2012), modified. SRTM = Shuttle Radar Topography Mission

The LUR model was developed using standard linear regression. In a forward stepwise approach each candidate predictor variable was introduced individually as a potential variable to enter in the model. The resulting R2 were compared and the variable with the strongest predictive power (R2) was chosen to form part of the final regression equation. In the next step all other candidate predictor variables were again introduced individually and the next variable that raised the R2 highest was chosen to remain in the model. This process was repeated until no added variable could raise the predictive power (R²) of the model further. The introduction of variables was closely monitored and several rules applied (Beelen & Hoek 2010):

1. A variable needed to improve the R2 by at least 1% to enter into the equation. 2. Minimum significance of all entered variables was 0.05.

3. Each variable had a predefined direction of effect (Table 30): e.g. increased nearby traffic intensity was expected to have a positive effect on the slope, as it should increase concentrations; urban green was expected to have a negative effect as it would provide space for particle dispersion.

4. Effects of 1, 2 and 3 should not change for any included variable if another variable was added.

137 Table 30: Principle predictor variables for the ESCAPE-LUR model with predefined variable names, units, defined buffer sizes, and directions of effect. Variables chosen for London and the Thames Valley models (PM2.5 and PM10) are highlighted in blue and bold. Source: Eeftens et al. (2012).

GIS dataset Predictor variable Name variable1 Unit Buffer size (radius of buffer

in meter)

Direction of effect Background

- Coordinate variables 4 XCOORD, YCOORD m NA NA

CORINE Surface area of high density residential land HDRES_X m2

100, 300, 500, 1000, 5000 +

CORINE Surface area of low density residential land LDRES_X m2

100, 300, 500, 1000, 5000 +

CORINE Surface area of low and high density residential land combined HLDRES_X m2 100, 300, 500, 1000, 5000 +

CORINE Surface area of industry INDUSTRY_X m2

100, 300, 500, 1000, 5000 +

CORINE Surface area of port PORT_X m2

100, 300, 500, 1000, 5000 +

CORINE Surface area of urban green 2 URBGREEN_X m2 100, 300, 500, 1000, 5000 -

CORINE Surface area of semi-natural and forested areas 3

NATURAL_X m2

100, 300, 500, 1000, 5000 - CORINE Sum of Urban green and Semi-natural and

forested areas

GREEN_X m2 100, 300, 500, 1000, 5000 -

CORINE Surface Area of Water WATER_X m2 100, 300, 500, 1000, 5000 -

Population density Number of inhabitants POP_X N(umber) 100, 300, 500, 1000, 5000 +

Household density Number of households HHOLD_X N(umber) 100, 300, 500, 1000, 5000 +

Altitude Altitude SQRALT m NA -

Traffic 6

Local road network

Traffic intensity on nearest road TRAFNEAR Veh.day-1

NA +

Local road network

Inverse distance and inverse squared distance to the nearest road

DISTINVNEAR1 DISTINVNEAR2

m-1, m-2 NA +

Local road network

Product of traffic intensity on nearest road and inverse of distance to the nearest road and distance squared INTINVDIST INTINVDIST2 Veh.day-1m- 1 Veh.day-1m- 2 NA + Local road network

Traffic intensity on nearest major road 6 TRAFMAJOR Veh.day-1 NA +

Local road network

Inverse distance and inverse squared distance to the nearest major road 6

DISTINVMAJOR1 DISTINVMAJOR2

m-1, m-2 NA +

Local road network

Product of traffic intensity on nearest major road and inverse of distance to the nearest major road and distance squared 6

INTMAJORINVDIST INTMAJORINVDIST2 Veh.day- 1 m-1 Veh.day-1m- 2 NA +

138

Local road network

Total traffic load of major roads in a buffer (sum of (traffic intensity * length of all segments))

6 TRAFMAJORLOAD_X Veh.day-1 m 25, 50, 100, 300, 500, 1000 + Local road network

Total traffic load of all roads in a buffer (sum of (traffic intensity * length of all segments)) TRAFLOAD_X Veh.day-1m 25, 50, 100, 300, 500, 1000 +

Local road network

Heavy-duty traffic intensity on nearest road HEAVYTRAFNEAR* Veh.day-1 NA +

Local road network

Product of Heavy-duty traffic intensity on nearest road and inverse of distance to the nearest road and distance squared

HEAVYINTINVDIST HEAVYINTINVDIST2 Veh.day-1 m- 1 Veh.day-1m- 2 NA + Local road network

Heavy-duty traffic intensity on nearest major road 6 HEAVYTRAFMAJOR Veh.day-1 NA +

Local road network

Total heavy-duty traffic load of major roads in a buffer (sum of (heavy-duty traffic intensity * length of all segments)) 6

HEAVYTRAFMAJORLOAD_X Veh.day-1

m 25, 50, 100, 300, 500, 1000 +

Local road network

Total heavy-duty traffic load of all roads in a buffer (sum of (heavy-duty traffic intensity * length of all segments))

HEAVYTRAFLOAD_X Veh.day-1

m 25, 50, 100, 300, 500, 1000 +

Central road network

Road length of all roads in a buffer ROADLENGTH_X m 25, 50, 100, 300, 500, 1000 +

Central road network

Road length of major roads in a buffer 5 MAJORROADLENGTH_X m 25, 50, 100, 300, 500, 1000 +

Central road network

Distance to the nearest road DISTINVNEARC1

DISTINVNEARC2

m-1, m-2 NA +

Central road network

Inverse distance and inverse squared distance to the nearest major road 5

DISTINVMAJORC1 DISTINVMAJORC2

m-1

, m-2

NA +

1_{Variable name: Combining name and buffer size, e.g. for HDRES_X: HDRES_100, HDRES_300, HDRES_500, HDRES_1000, HDRES_5000} 2_{CORINE Urban green is the sum of CORINE classes 141 and 142}

3_{CORINE semi-natural is the sum of CORINE classes 311, 312, 313, 321, 322, 323, 324, 331, 332, 333, 334, 335, 411, 412, 421, 422, 423, 512, 521, 522 and 523}

4_{Variables were only offered if a model has been developed to test if the model with more explicit variables could be improved with these variables (describing slow trends in background).} 5_{Definition of major road for central road network: classes 0, 1, and 2 (+ classes 3 and 4 based on local knowledge and decision)}

6_{Definition of major road for local road network: road with traffic intensity > 5,000 mvh/24h}

139 The concentrations collected during each seasonal monitoring campaign (described in chapter 3) were weighted to the annual average for the analysis as follows: 2-weekly average concentrations measured at the reference site, which operated during the whole monitoring period, were used to calculate the ratio to the reference site’s annual mean. These ratios were subsequently applied as adjustment to all sites for each round (Eeftens, et al., 2012). Several tests and checks were applied to the final model to assure the assumptions underlying the regression analyses were met:

1. The selected regression variables were checked for colinearity. Regression results with multiple variables can have correlated explanatory variables, which increase the R2_adj. without explaining the dependent variable. The Variance Inflation Factor (VIF) calculates the colinearity between variables. Variables with a VIF of below 5 are generally considered to have low colinearity (Harris & Jarvis 2011).

2. Results in a regression can be influenced by single observations in the dataset. In order to check for unusually influential data points Cook’s D was calculated. Cook’s D (or Cook’s distance) is a measure of each observation’s influence on the regression line. Values with a Cook’s D above 1 should be further investigated (Harris & Jarvis 2011).

3. Normal distribution of the residuals was ensured by looking at the residual distribution in histograms.

4. Spatial data has the tendency that locations near to each other can be more similar. This tendency, called spatial autocorrelation can violate the assumption of independent data distribution in a regression. Spatial autocorrelation was tested using Moran’s I statistics (Eeftens et al. 2012; ESCAPE project 2010). A Moran’s I close to zero signifies low spatial autocorrelation. Moran’s I values can be transformed to z scores for which values greater than 1.96 or smaller than -1.96 signify high autocorrelation significant at the 5% level (Harris & Jarvis 2011).

Model validation was performed by leave-one-out cross-validation (LOOCV). For LOOCV the predictor variables are fixed and the regression is run for n-1 (in this analysis for 20-1=19 sites), leaving each one of the data points out in n-1 repeated regression analyses. The resulting R2 adj. is then compared with the R2 adj. of the full dataset (Wang et al. 2013).

In document Micro-environmental models of human exposure to air pollution (Page 135-139)