Chapter 4: A multilevel spatial interaction model of transit flows incorporating spatial and
4.5 Data
Study area: Arnhem Nijmegen region
We apply the models on the bus transit flows in the Arnhem Nijmegen region, located in the eastern part of the Netherlands. This region comprises 20 municipalities with a total size of more than 1,000 square kilometres, and almost 750,000 inhabitants. The cities Nijmegen (168,000 inhabitants) and Arnhem (151,000 inhabitants), located only about 15 kilometres apart, form the core of the region.
The local and regional public transport system in the region is mainly bus based. The bus transit services are operated by Hermes, a private company operating a 10-year public tender under
the brand name “Breng”. They offer a more or less integrated network of bus services throughout the entire region. There is also heavy rail available, with major train stations in Arnhem and Nijmegen, and several smaller train stations within these cities and in surrounding towns. The transit services within the region largely focus on the city centres and main train stations of Arnhem and Nijmegen: virtually all bus routes serve at least one of these locations.
Local buses serve neighbourhoods within the cities of Arnhem and Nijmegen, while regional buses connect the surrounding towns to the city centres and main train stations of Arnhem and Nijmegen. In this study, our focus is on the bus system and only journeys made using the public buses are included in the analysis (i.e., we exclude trips made by the heavy rail system as well as the limited number of trips made by on-demand transit services).
Figure 4.1: Neighbourhood boundaries and centroids (dots), the bus transit system (green lines), and heavy rail (dashed lines) of the Arnhem Nijmegen region
Variables and data sources
Dependent variable: travel flow (Tij)
The dependent variable is the total number of public transport passengers between each OD pair in the month March 2014. That is, the total number of trips that have been made (in the studied month) from an origin i to a destination j. The information on the number of passengers (flows) between any two bus-stops are retrieved from smart card registrations. Since the year 2012, passengers of most public transport services in the Netherlands are strongly stimulated to use a smart card (the “OV-chipkaart”) to pay for their journeys. The underlying smart card system is a nationwide, integral fare collection system, used for all public transport modes such
Chapter 4: A multilevel SIM of transit flows incorporating spatial and network autocorrelation 53
as heavy rail, light rail, trams, and regional and local buses. As fares are calculated based on the distance travelled, passengers are required to tap-in when they enter the public transport system, and tap-out when they leave it. For both taps the date, time of day, (stop-) location, and card number are registered. This implies that complete trajectories travelled (i.e., origin destination pairs) can be reconstructed. Smart cards were used as payment method for about 96% of all trips in Dutch local and regional bus services in the year 2014 (KpVV CROW, 2016).
If passengers enter a second bus within 35 minutes after leaving the first bus, this is considered as a transfer and the journey will be registered as one journey with the first boarding location as origin and the final alighting location as destination. Because of privacy regulations, no exact data were provided for trajectories with less than five journeys. Trajectories with between one and four travellers are included as “one” in the data set. In total, there were 1,464 active bus stops in the study area, and thus 2,141,832 potential origin-destination combinations. Only 4%
of these combinations (91,362 OD pairs) actually show at least five trips during the study period. Another 7% (142,454 OD pairs) show between one and four trips.
In order to analyse the overall flow patterns in the region, we have aggregated the data to travel flows at neighbourhood-to-neighbourhood level. We do so for two reasons. First, the real origins and destination of the travellers are not the bus stops itself, but are likely to be in the neighbourhood of the specific bus stops. This means that aggregating the data to neighbourhood level gives a realistic indication of the complete flow patterns. Second, we do not have data on land use patterns at the level of bus stops, but only at the level of neighbourhoods (as defined by the Dutch Bureau of Statistics). Flows between bus stops are therefore aggregated to flows between neighbourhoods both for the origin as well as for the destination. If the origin and/or destination bus stops are close to the border between two neighbourhoods, the flows are divided based on the part of the 400 meters service area of the bus stop that is located in each neighbourhood using an exponential function. The distance of 400 meter is typically used as the service area of a bus stop in the literature, as well as in the practice of transit planning (Horner & Murray, 2004). The region is divided in 485 neighbourhoods. Disregarding journeys that both start and end in the same neighbourhood, the total number of possible OD combinations is 234,740. Of these OD combinations, 29,252 (or 12 percent) were actually used by travellers in the studied month.
Independent variables included in the lower model: boarding model
The lower model (boarding model) aims to capture the transit attractiveness of each neighbourhood, represented by the (estimated) number of boardings and alightings. The number of boardings and alightings are estimated based on different variables on potential demand and transit supply, from which the influence on transit usage is well established in transport literature. The boarding model, including the model development, was described elaborately in Kerkman et al. (2015). There are, however, two main differences between that model and the model used in this paper. First, the current model is estimated for neighbourhoods instead of bus stops. This results in some slight differences in the variables used. Second, we estimate a spatial lag model instead of OLS to correct for spatial autocorrelation among neighbourhoods.
The model includes variables of potential demand and transit supply (see Table 4.1).
Table 4.1: Independent variables included in the boarding model students, and train travellers in the neighbourhood
Statistics Netherlands
Elderly Number of inhabitants aged 65 years or older Statistics Netherlands (2014)
Distance to urban centre [km]
Euclidean distance between the neighbourhood centroid and the city centre of Arnhem or
Part of the neighbourhood with agricultural land use
density Number of bus stops per m² in the neighbourhood OVapi (2015) Bus
terminus [0/1]
Start or end stop of at least one scheduled route of
a bus in the neighbourhood OVapi (2015)
Transfer stop [0/1]
A transfer to another line is possible at at least one
stop in the neighbourhood OVapi (2015)
Independent variables included in the upper models Travel impedance
As defined in equation 2, the travel impedance is defined by an exponential distance decay function based on the transit travel time and the directness of the connection.
Transit travel time (cij)
Travel time is generally used as an important indicator for travel impedance in transport studies.
The transit travel time is calculated as the mean of the travel time on Monday morning, at 08:00, 08:07, and 08:18h AM, from the centroid of the origin to the centroid of the destination neighbourhood. The travel time includes (modelled) walking time (from the centroid of the neighbourhood to the nearest bus stop or vice versa), waiting time at the bus stop, in-vehicle travel time, and waiting time in case of a transfer. The travel time is calculated using the “Add GTFS to a Network Dataset” add-in in ArcGIS 10.2. The transit schedules were retrieved from OVapi (2015) and the road network from NWB-Wegen (2014).
Direct (no transfer)
In general, public transport passengers prefer to travel without changing vehicles. Therefore, lower travel impedance is expected between origins and destinations which allow travel without a transfer. To capture this in our distance decay function, we include a dummy variable that indicates whether a direct connection is available between at least one bus stop in the origin neighbourhood and at least one bus stop in the destination neighbourhood.
Chapter 4: A multilevel SIM of transit flows incorporating spatial and network autocorrelation 55
Other independent variables included in the upper models
In addition to the attractiveness of neighbourhoods and the travel impedance between them, we add other independent variables to the various SIMs that describe characteristics of each OD-set and are expected to influence transit travel flows. These variables were selected based on existing literature, and different specifications and combinations of them were tested in the SIMs (including residual analysis) before they were included in the final model.
Train station
In this research, we only include the travel flow patterns of travellers using the local and regional bus transit system. These are, however, likely to be influenced by the heavy rail network which serves the same region as well as connecting it to the rest of the Netherlands.
The heavy rail services influences the bus travel flows in two ways. First, it is a competitor of the bus system, especially on OD pairs where both the origin and the destination areas have access to a train station. Second, a train station is an important ‘generator’ of bus trips, as the bus system functions as a feeder service of the heavy rail system. We include these effects in the SIM using three dummy variables. The first two dummy variables indicate whether a train station is located in or within 400 meter distance of the origin or the destination neighbourhood, respectively. The third dummy variable indicates if a train station is located in or near (i.e., within 400 meters of the neighbourhood border) both the origin and the destination neighbourhood.
Residential land use
A large part of the transit trips are made for commuting, to/from education, or shopping/leisure.
These trips often originate or end in a residential area, making residential areas relatively attractive as an origin or destination for transit trips. Between residential areas, however, the number of transit trips is usually rather low. To prevent the SIM for highly overestimating the number of trips between residential areas, we include a variable describing the share of residential land use in both the origin and the destination neighbourhood. This variable is defined as the product of the share of residential land use area in the origin neighbourhood and the share of residential land use area in the destination neighbourhood. The share of residential land use for each individual neighbourhood is derived from Statistics Netherlands (2010).
Independent variables included in the SIMs
The specifications of the different SIMs (as described in section 4.4) results in different combinations of variables used in the SIMs. An overview of the independent variables included in each specific SIM is given in Table 4.2.
Table 4.2: Overview of independent variables included in each SIM
Descriptive statistics for the variables included in the upper models
Descriptive statistics of the different variables included in the SIMs are displayed in Table 4.3.
The distribution of (the natural logarithm of) the estimated number of boardings in neighbourhood i (Oi) and alightings in j (Dj) estimated by the lower model is fairly similar to the actually observed boardings and alightings. This indicates a good performance of the lower model. For most dummy variables used in the models, the majority of the observations are “0”.
The variable indicating the availability of a train station at both the origin and the destination is true (1) for only 1% of the OD-pairs. However, as we have a large number of observations, this variable might still add valuable information to the models.
Table 4.3: Descriptive statistics of upper model variables
Variable Min. Max. Mean Std. Dev.
Chapter 4: A multilevel SIM of transit flows incorporating spatial and network autocorrelation 57
4.6 Results
In this section, the results of the different modelling exercises are given. First, the results of the boarding model – modelling the number of travellers boarding and alighting buses in each neighbourhood – is displayed. After this, the results of the different SIMs are given and described.
Lower model: Boarding model
In the lower level model, the boarding model, the number of boardings and alightings of the neighbourhoods are estimated using a spatial lag model. The results in Table 4.4 show that most variables are highly significant. Bus stop frequency has the highest relative influence on the number of bus passengers. Also the influence of the number of potential passengers in a neighbourhood has a large positive influence; the distance to an urban centre has a large negative influence. The latter might be explained by decreasing urban densities at larger distance to the centre. Also ρ, the spatial autocorrelation coefficient, is highly significant.
Although not all variables are significant, we keep them in the model because their influence on transit usage is generally accepted, and they do contribute to the overall performance of the model (as confirmed by test runs of the model excluding these variables).
Table 4.4: Spatial lag model of neighbourhood mass (n = 485)
Variable Coefficient Std. Error Sig. Std. Coeff.
(intercept) .714 .675 .290
Potential demand variables
Potential travellers (log) .393 .050 .000 .214
Income [x€1.000] -.013 .016 .411 -.016
Elderly .0004 .0002 .052 .049
Distance to urban centre [km] -.0001 .00002 .000 -.202
LU: Agriculture -.184 .217 .396 -.020
Transit supply variables
Stop frequency (logarithm) .231 .028 .000 .266
Bus stop density 2.247 1.57 .153 .035
Bus terminus [1/0] .363 .166 .029 .047
Transfer stop [1/0] .543 .176 .002 .087
Spatial autocorrelation
Rho .384 .044 .000
SIMs
First, a base model (SIM1) was estimated based on the actual number of boardings and alightings at the origin and the destination, and the transit travel time between them. Next, the actual number of boardings and alightings were replaced by the estimates of the lower-level boarding model described in the previous section (SIM2). In SIM3, additional variables were added via upper model 1. In SIM4, the effects of spatial structures is included using upper model 2. SIM5 specifically addresses the influence of spatial and network autocorrelation using upper model 3. The results of the different SIMs are displayed in Table 4.5.
Table 4.5. Results of SIMs (N = 234,740)
SIM1: base model (base)
SIM2: estimations Oi and Dj (lower + base)
SIM3: OD-set (lower + upper1)
SIM4: spatial structures (lower+upper2)
SIM5: Autocorrelation (lower+upper3) Variable Coeff Std.
error z value Coeff Std.
error z value Coeff Std.
error z value Coeff Std.
error z value Coeff Std.
error z value (Intercept) -6.38 .044 -145.9 -2.74 .040 -69.0 -3.11 .040 -78.1 5.15 .180 28.5 .287 .066 4.4 Oi (actual, log) .675 .003 199.4
Dj (actual, log) .655 .003 192.0
Oi (estimate, log) .493 .003 152.0 .490 .003 146.1 .597 .005 120.7 .041 .001 45.5
Dj (estimate, log) .478 .003 147.2 .473 .003 140.8 .618 .005 126.7 .041 .001 45.9
Travel time -.05 .000 -207.6 -.058 .000 -206.8 -.056 .000 -205.6 -.063 .000 -210.6 -.0002 .000 -3.4
Direct 1.30 .018 73.8 1.365 .023 59.4 1.315 .022 60.0 .985 .022 44.4 .833 .006 150.4
Train station O 1.072 .019 57.2 .991 .019 52.7 .131 .004 31.0
Train station D 1.097 .019 57.9 1.053 .019 55.7 .130 .004 30.7
Train station
O&D -1.22 .045 -26.9 -1.09 .045 -24.4 .078 .008 9.4
Residential LU -.622 .027 -22.9 -.499 .027 -18.3 -.003 .005* -0.7
CO (log) -.854 .031 -40.6 -.075 .009 -8.5
CD (log) -1.19 .029 -27.6 -.085 .009 -9.6
Rho (spatial lag) .792 .006 132.7
Lambda (error) .805
Moran’s I .203 .088 .050 .048
Model fit AIC: 304,203 AIC: 334,441 AIC: 328,087 AIC: 325,109
Correlation (actual flows / estimates)
.642 .373 .421 .504 .834
* Not significant at .1 level. All other coefficients are significant at .01 level
Chapter 4: A multilevel SIM of transit flows incorporating spatial and network autocorrelation 59
SIM1: Base model (single level model)
The results of SIM1 show that all variables included are highly significant, and have the expected sign. The coefficients for Oi and Dj are very similar, what means that the mass of the origin and the destination are equally important. This was expected, because we included OD-pairs in both directions in our data set (both trip A-B and B-A). The standardized coefficient shows that the relative influence of travel time is a bit larger than the mass of the neighbourhoods (Oi and Dj). The highly significant Moran’s I of .2 indicates the existence of spatial autocorrelation.
SIM2: Base multilevel (lower model + base model)
In SIM2, the actual number of boardings and alightings (Oi and Dj) are replaced by estimation results of the lower-level boarding model. This obviously results in a lower model fit compared to SIM1 (higher AIC), but the change is relatively small. More importantly, the values of the different coefficients in the models are fairly similar. Only the coefficients of Oi and Dj are slightly lower, and the importance of travel time is slightly higher. Overall, the results show that replacing the actual number of boardings and alightings by estimates of the lower level model does not significantly change the model. This suggests that the model can be used for purposes of predictions. However, although the Moran’s I is much lower, it is still highly significant indicating remaining spatial autocorrelation.
SIM3: OD-set (lower model + upper model 1)
The added variables about the OD-set (upper model 1) are all highly significant, with the expected directions: a positive effect of train station availability at either the origin or destination; a negative effect of a train station at both origin and destination (due to competition from train); a negative effect for the share of residential-to-residential land use. SIM3 is overall very similar to SIM2. This shows that the model is very stable and that the added variables add new information. The latter is also confirmed by a slightly better model fit. Adding the mentioned variables results in a lower Moran’s I, indicating that part of the spatial autocorrelation in the previous SIMs was due to omitted autocorrelated variables. However, Moran’s I is still highly significant, indicating remaining spatial autocorrelation.
SIM4: Spatial structure (lower model + upper model 2)
Including variables that describe the spatial structure, defined as competing origins (CO) and competing destinations (CD), have a small positive effect on the model fit. They do not have a large effect on other variables in the model. Only the intercept significantly changes, and changes from a negative coefficient to a, intuitively more logical, positive coefficient. The large change in the intercept shows that spatial structure variables add really new information to the model, which is not captured by any of the other variables. As expected, both CO and CD have a negative sign, implying competition among origins and destinations. The coefficient of CD (competing destinations) is larger than for CO (competing origins): the existence of alternative destinations has a larger impact on the flow between an origin and a destination then the number and size of competing origins. This makes sense, as competing destinations will directly affect the flow between origin i and j (as the flow from i will distribute over all possible destinations), while competing origins only indirectly affect the flow between i and j (and competing origins will also lead to an increase in ‘capacity’ in the destinations, as discussed before, thereby mitigating the effect of competing origins). CO and CD hardly have an influence on the remaining spatial autocorrelation in the model, as indicated by a barely changed Moran’s I.
SIM5: Spatial and network autocorrelation (lower model + upper model 3)
The SIMs of the previous sections do not incorporate the effects of spatial and network autocorrelation. As described before, spatial and network autocorrelation might influence the transit flows. The significant values of Moran’s I for the residuals of previous SIMs indicate the existence of autocorrelation in the models. Ignoring this aspect can result in misleading results.
The coefficients for most variables are highly significant in SIM5 and have the expected sign, similar to SIM4. Only the availability of a train station in both the origin and the destination neighbourhood now has a positive coefficient. This might be due to the low number of observations where this holds. The share of residential-to-residential land use is not significant in SIM5.
The coefficients of the spatial lag (rho) and spatial error (lambda) components are both highly significant, positive, and with a value of .8 high in magnitude. This indicates that there are both a direct effect of the size of (spatially) related flows on each other, and spatially correlated omitted variables in the model. The goodness-of-fit measures, specifically the correlation between the actual flows and the model estimates, indicate that SIM5 performs better in explaining the flows than the previous SIMs.
Because of the dependence structure of SIM5, the coefficient estimates of this model do not have the same interpretation as in the previous SIMs. In the previous SIMs, the coefficients represent the total marginal effect of a change in the independent variables. In SIM5, the coefficients only describe the short-run direct impact of xi on yi. However, as described in
Because of the dependence structure of SIM5, the coefficient estimates of this model do not have the same interpretation as in the previous SIMs. In the previous SIMs, the coefficients represent the total marginal effect of a change in the independent variables. In SIM5, the coefficients only describe the short-run direct impact of xi on yi. However, as described in