Aggregated cross-section analysis - Explanation of household car ownership with proximity to tr

4 Methodology

4.2 Aggregated cross-section analysis

The cross-section analysis contains the analysis to relations of the influencing factors with household car ownership. The steps that finally lead to models explaining car ownership are discussed in this chapter.

Data preparation

Gross of the data of the for the cross-sectional analysis is obtained from CBS (Dutch national agency for statistics). Nonetheless, the CBS database did not contain all the required variables; therefore data is used from other sources too. After all the data is collected, the data must be on a comparable level of detail. Most of the data have the same neighbourhood level.

Methodology

_P

₂₂ Nonetheless, not all data is aggregated to the appropriate level yet. Those data are prepared with ArcGIS, a computer program to visualise and analyse spatial data, to the same neighbourhood level. Only the neighbourhoods in the scope stay in the dataset for the analysis. An extensive description of the data is in shown in Appendix A.

Open source national data has several spatial detail levels. The chosen level of detail for this part of the study is “Buurt”. That is a part of a municipality with a homogenous socio-economic structure or spatial planning (CBS, 2019). The boundaries of buurten are determined by municipalities, and the geographic administration is coordinated by CBS. Moreover, the detail levels are further described and compared in Appendix A.1. The year 2016 is the most recent year wherefore gross of the data is available and this year is included in the MPN data too. Therefore, the year 2016 is selected for the analysis.

The dataset contains three variables with information about car ownership: average household car ownership and the total number of cars in the buurt. The dependent variable is average household car ownership in the buurt because the total number of cars is strongly dependent on the size of the buurt and the number of cars per km2_{is strongly depended on the density of residents.}

The variables that are available for the analysis of influencing factors of average household car ownership are listed in Table 4-2.The chosen variables are based on the literature review. If data was not available by CBS, data of other sources were prepared to the same detail level. Only national data about parking permits were not available, therefore is for every buurt in the neighbourhood, the website of the relevant municipality checked on price and requirement of parking permits. Enriched variables have a reference in the final column to the relevant section of the Appendix. Outliers were removed if the value for the variable was extreme (more than three times deviation of the mean) and if this record behaves as an influential outlier.

Table 4-2 Variables for cross-sectional analysis

Variable Name in code (Based on) source(s) Detail level

Chapter of explanation Average household car

ownership

auto_hh (CBS, 2016a) Buurt

Density

Density of residents bev_dichth (CBS, 2016a) Buurt

Density of residents bev_dich_wk (CBS, 2016a) Wijk

Density of residents bev_dich_gm (CBS, 2016b) Municipality

Urbanisation level sted (CBS, 2016a) Buurt

Urbanisation level sted_wk (CBS, 2016a) Wijk

Urbanisation level sted_gm (CBS, 2016b) Municipality

Diversity

Entropy index Entropy (CBS, 2015) Buurt A.3.2

Entropy index Entropy_wk (CBS, 2015) Wijk A.3.2

Job density job_density (Kadaster, 2018) Buurt A.3.3

Job- housing ratio ratio_job_resident (Kadaster, 2018) & (CBS, 2016a) Buurt A.3.3

Design

Network distance to (nearest) strongly urbanised city centre

Netw_dist_centre_12 (Bikeprint, 2016) Buurt A.3.5

Centre, shell or other built-up area

Schil (Bikeprint, 2016) Buurt A.3.5

Public Transport and Accessibility

Nearest train station type min_distance (Prorail, 2019) & (Bikeprint, 2016) Buurt A.3.1

Distance to nearest train station min_station_type (Prorail, 2019) & (Bikeprint, 2016) Buurt A.3.1

Larger train station than nearest train station available in 3km

Larger

Largest train station type in 3km Largest

Density of bus stops bus_density Goudappel Groep Buurt A.3.1

Bike and ride Accessibility Average University of Twente PC4 A.3.1

Demand

Parking costs Parking (RDW, 2018) Parking area A.3.6

Methodology

_P

₂₃

Socio-demographics

Percentage of households with lowest 40% of income

p_hh_li (CBS, 2016a) Buurt

Percentage of people with income

p_inkont (CBS, 2016a) Buurt

Average house hold size gem_hh_gr (CBS, 2016a) Buurt

Percentage of rental properties p_huurw (CBS, 2016a) Buurt

Average building value woz (CBS, 2016a) Buurt

Class of building value price (CBS, 2016a) Buurt A.3.4

Average income of residents g_ink_pi (CBS, 2016a) Buurt

Percentage of people with age between 0-14

p_00_14_jr (CBS, 2016a) Buurt

Percentage of people with age between 15-24

p_15_24_jr (CBS, 2016a) Buurt

Percentage of people with age between 25-44

p_25_44_jr (CBS, 2016a) Buurt

Percentage of people with age between 45-64

p_45_64_jr (CBS, 2016a) Buurt

Percentage of people with age 65+

p_65_eo_jr (CBS, 2016a) Buurt

Average building year Bouwjaar (Kadaster, 2018) Unit A.3.4

Average surface area average (Kadaster, 2018) Unit A.3.4

Analysis of variables

The relations of the variables with household car ownership are analysed visually by creating bin plots and by creating maps in ArcGIS. Those plots provide more insight into the relations of the variables and are useful in comparison to the literature analysis but are not measures that can be used to quantify the relations.

Therefore, Pearson Correlation Coefficients (PCC) and their p-value are used to show which variables have a significant linear relationship with car ownership. The formula for the PCC is shown in Formula 4-1 (Zwillinger & Kokoska, 2000).

PCC _{𝑟 =}(

∑ ((𝑥𝑖 𝑖− 𝑥̅) ∙ (𝑦𝑖− 𝑦̅))

𝑛 − 1 )

𝜎(𝑥) ∙ 𝜎(𝑦)

4-1

With i for the observation number, x and y for the two variables of interest and n the total number of observations.

For the categorical variables, the Spearman Rank Correlation Coefficient (SRCC) is a more appropriate measure. The formula for the SRCC is shown in Formula 4-2 (Zwillinger & Kokoska, 2000). SRCC 𝑟 = 1 − 6 ∑ (𝑢𝑖− 𝑣𝑖) 2 𝑛 𝑖 𝑛(𝑛2_{− 1)} 4-2

With u the rank of the ith_{observation of the first variable of interest and v the rank of the i}th

observation of the second variable of interest.

Regression models

The multiple linear regression models are used to analyse the selected variables in more depth. Since the results of the models show the effect of the individual influencing factors while controlling for the other factors. The estimation of the model is executed by Python Software with the module Statsmodels. As dependent variable, the estimated variable, average household car ownership is used. The models are built by step by step adding a new independent variable into the model: step-wise linear regression modelling. The variable with the largest impact on the residuals of the previous model will be added to the new model. Residual plots analyse this impact.

Methodology

_P

₂₄ So, the output of the model is a continuous value for average household car ownership dependent on continuous and categorical influencing factors. The coefficients of the regression model will represent the strength of the relation between the variable and average household car ownership.

4.2.3.1 Formulas

A multiple linear regression model is used to model average household car ownership. The general formula is shown in Formula 4-3 (Zwillinger & Kokoska, 2000).

General formula 𝑦̂ = 𝐸(𝑌|𝑋) + 𝜀𝑖 𝑖 = 𝛽0+ 𝛽𝑗𝑥𝑖𝑗+ ⋯ + 𝜀𝑖 4-3 𝑦̂𝑖 represents the predicted value for the 𝑖𝑡ℎ observation. Dependent on 𝛽0, the intercept or the constant and 𝛽𝑗 the partial effect of variable 𝑥𝑖𝑗 on 𝐸(𝑦|𝑥) with 𝑗 for the (number of the) variable. So, 𝛽𝑗 represents the effect of the change in 𝑥𝑗 units while keeping the other independent variables constant. 𝜀𝑖 represents the error variable. The continuous variables can be used without dummy variables, but the categorical variables need dummies to be analysed. An important assumption of multiple linear regression models is that the independent variables have a linear relation with the dependent variable.

The ordinary least squares (OLS) is the chosen method for estimating the parameters. Therefore, the residual sum of squared (RSS) is minimised, see Formula 4-4. The residual is calculated by 𝑦𝑖: the actual value, minus 𝑦̂𝑖 : the predicted value.

RSS 𝑅𝑆𝑆 = ∑(𝑦𝑖− 𝑦̂ )𝑖 2

𝑖

4-4

To assess the goodness-of-fit of the regression, the R-squared value is used: the coefficient of determination. The R-squared value represents how the model can reduce many variations in the sample. Formula 4-5 shows how this value is calculated. The numerator of this formula is again the RSS, the total variation in the residuals, and the denominator of the formula represents the total variation in the sample.

R-squared 𝑅2= 1 − ∑ (𝑦𝑖− 𝑦̂)𝑖 2 𝑖 ∑ (𝑦_𝑖− 𝑦̅)2 𝑖 4-5 4.2.3.2 Flowchart

The analysed relations of the factors are visualised in a flowchart in Figure 4-2. A the flowchart shows, the direct linear effect of the train stations, built environment and socio-demographics on average household car ownership are analysed.

Methodology

_P

₂₅

Figure 4-2 Flowchart of the aggregated cross-section MLR model

In document Explanation of household car ownership with proximity to train stations : Quantification of household car ownership with proximity to train stations and a determination where parking standards in urban areas can be improved in the Netherlands (Page 33-37)