Predicting crowding using AFC data - Automatic fare collection (AFC) data and rail passenger cr

2.1.2 ‘Found data’ and the digital footprint from ubiquitous computing

2.2 On-train crowding

2.2.3 Automatic fare collection (AFC) data and rail passenger crowding

2.2.3.1 Predicting crowding using AFC data

Ceapa et al. (2012) undertook analysis of peak crowding at stations on the London Underground using one month of Oyster card entrance and exit data.

After cleaning the data, they conducted a spatio-temporal analysis, which revealed that the station crowding patterns were very regular during the week due to commuting, but typically had higher variance at the weekends. For the weekday data, they identified at least three types of station: ‘Residential’ with high entrances in the AM Peak and high exits in the PM Peak e.g. Finchley Central; ‘Business’ with high exits in the AM Peak and high entrances in the PM Peak e.g. Canary Wharf; and ‘Transport Hub’ with both high entrances and high exits in both the AM Peak and PM Peak e.g. Waterloo. Figure 1 shows the average weekday entrances (blue) and exits (red) at Canary Wharf Underground station, which showed very low standard deviation (shaded area) in the evening peak, along with three ‘sub-peaks’ around 17:10, 17:40 and 18:10.

permission)

They then undertook a more systematic classification of the stations using an agglomerative hierarchical clustering technique, i.e. this is a ‘bottom up’ approach in which each observation starts in its own cluster and then pairs of clusters are merged together. They used a technique related to ‘Dynamic Time Warping’ (DTW), which is a popular algorithm used in data mining and time series clustering analysis, as well as in other fields. Senin (2008) describes the DTW algorithm as “being extremely efficient as the time-series similarity measure which minimises the effects of shifting and distortion in time by allowing elastic transformation of time series in order to detect similar shapes with different phases”. Specifically Ceapa et al. chose to use an approximation to DTW called ‘FastDTW’; this had the advantage of being more suitable for large time series datasets. From visual inspection, they chose to terminate the clustering algorithm at six

there was good compactness within each cluster; there was also large inter-cluster distances suggesting good separation between the clusters.

As there was no official definition for station crowding, they defined a proxy measure for crowding, which was the proportion of touch-ins and touch-outs at the station relative to the maximum number observed in the data. This measure meant that a value of 1 represented the station at its peak level of crowding and a value of 0 indicated no entrances or exits within the period.

They then built and evaluated three prediction models and investigated the effect of several parameters on the accuracy of the results, using half of the data for training the model and half for testing. The three predictors were as follows:

 ‘Historic value’ – This was the most basic predictor, taking the one corresponding value in the training dataset for the corresponding day of week and time of day.

 ‘Historic mean’ – This was similar to the ‘Historic value’ predictor, but used the mean of all values in the training dataset for the corresponding day of week and time of day.

 ‘Historic trend’ – This attempted to improve on the other two predictors by taking into consideration crowding level for the current time.

The project concluded that the high regularity of commuting travel meant that patterns are indeed predictable, even with as little as two weeks of training data. The three predictors based on historic data achieved very good results in predicting crowding, which suggested that providing information on crowding levels to passengers would be “highly feasible”.

They proposed one use of this analysis as follows: “Identifying times when public transport is overcrowded could help travellers change their travel patterns, by either travelling slightly earlier or later, or by travelling from/to a different but geographically close station”. However, the study did not assess how passengers would make use of such information, but recommend this for future studies.

The finding that weekday travel on the London Underground has low variability is not surprising; for the most part this is likely explained by high levels of commuting trips, many of which will involve interchanges with timetabled train services. Nevertheless, it is novel that even relatively simple predictors yielded accurate results due to this

regularity. A limitation of the study should be noted in that it did not consider any seasonal effects; the study was based on data for the 31 days in March 2010, which was a ‘typical’ month in that it contained no school holidays or bank holidays.

The fundamental approach of station-by-station analysis, clustering and then prediction may be applicable to heavy rail, although it is not clear whether journeys would also have low levels of variability and in turn high levels of predictability. This would likely be affected by the proportion of passengers on a particular route who were travelling for commuting, business, leisure or other journey purposes.

As discussed above, the study used the proportion of touch-ins and touch-outs relative to the maximum number observed in the data as a proxy measure for crowding. This measure was fit for purpose in the context of the study in assessing the predictability of demand, although a limitation of the study is that there were no attempts to calibrate this with available capacity, i.e. to understand actual levels of crowding. This could be attempted in a more detailed study, which might involve analysis of service timetables, walking times, predicted interchanges, proportion of tickets that are still paper-based, validation with manual counts etc. As such, it seems that it would be easier to use APC data if available rather than AFC data for measuring levels of crowding.

In document Optimising the Loading Diversity of Rail Passenger Crowding using On-Board Occupancy Data (Page 36-38)