RESEARCH FRAMEWORK AND METHODOLOGIES - The prediction of bus arrival time using Automatic Vehic

1.3.1 Perform a Literature Review

Related research reports, journal articles, and Ph. D. dissertations were reviewed. The primary areas of interest are 1) the current state of practice and trends in the provision of traveler information 2) the technology of AVL systems, 3) GPS theory, and 4) the methodology for travel time prediction. The purpose of this task is to ensure that no research relevant to this study is overlooked or inappropriately duplicated.

1.3.2 Collect Data and Define Test Bed

Actual AVL data collected in Houston, Texas, were used as a test bed. The Houston data were collected by Houston Metro buses equipped with DGPS receivers that collect data at 5 seconds intervals. Data were collected over 6 months in 2000 (from June to

November). The test bed is route 60, which runs on a congested corridor in Houston. This DGPS provides time, speed, heading, etc., as well as bus location.

There are two test bed sites: a downtown area corridor and a north area corridor. The first corridor has 9 bus stops and is 1.6 kilometer long. Stop 1 and stop 9 are time check points where bus drivers should keep to scheduled time. The second corridor has 25 bus stops and is 4.26 kilometer long. Stop 6 and stop 20 are time check points. The schedule headway during weekday peak period is about 30 minutes and during the weekday non- peak period and weekends is about 1 hour.

1.3.3 Reduce Data and Correct Errors

There are two types of errors associated with GPS data. The first is noise errors added by the U.S. DOD in order to degrade the accuracy of GPS data. This error was corrected by using DGPS. The second type of error is measurement errors. It is anticipated that some of the bus location data were correspond to off-route locations (i.e. parking lot, refueling station, etc.). In addition, even if the bus is located on the road, there would be errors

associated with its exact location. An additional source of data error would be missing data. Where the data are missing, existing data were used to calculate input data according to distance. Outliers were also identified when the data are located unreasonably far away from the road.

1.3.4 Cluster Data

The transit schedule and congestion for weekday peak hour, non-peak hour, evening, and weekend, are different. It would be expected that dwell time and link travel time would also be different. To account for these differences, data were clustered by time of the week and time of the day.

1.3.5 Develop Prediction Models

A number of modeling techniques were used including a simple statistical model (historical data), a regression model, and an artificial neural network model. In this research, the input variables were be arrival time, dwell time, and schedule adherence at each stop. To consider traffic congestion, schedule adherence was calculated by

subtracting the scheduled data from the actual arrival time. A positive value of schedule adherence means that the bus is delayed at the stop while a negative value means that the bus arrives early. To consider traffic congestion, the link travel times were clustered by time period in task 4. The output variable is arrival time at each stop.

1.3.6 Evaluate Prediction Models

All three model architectures were calibrated. With these calibrated models, the arrival times were predicted. A validation data set was obtained in order to test which model is most appropriate. Predicted arrival times were compared to the observed arrival times from the validation data set. The Mean Absolute Percentage Error (MAPE) was used as the measure of effectiveness (MOE). The MAPE is shown in Equation 1-1. It represents the average percentage difference between the observed value (in this case observed arrival times at a bus stop) and the predicted value (in this case predicted arrival times at

a bus stop). Smaller MAPE means that the model predicts more accurately than other models. % 100 1 − _× =

∑

n i o o i y y y n MAPE (1-1) where,

yi = Predicted value (i.e. arrival time at given transit stop);

yo= Observed value (i.e. arrival time at given transit stop);

n = The number of data considered.

1.3.7 Identify the Prediction Interval of the Bus Arrival Time

The model with the smallest MAPE is chosen for the prediction model for the bus arrival time. With the selected model, prediction intervals on these estimates were provided. If ANN models are chosen for the outperformed model, the conventional method for finding prediction interval is not appropriate. In that case, the bootstrap method, which is a statistical method that provides prediction intervals for non-parametric models, was used. In order to statistically test the differences in mean and variance of the three different models, one of several pairwise comparison methods, such as Tukey’s procedure was used.

1.3.8 Identify the Probability of a Bus Being on Time

The probability density function of schedule adherence was identified. To determine which distribution was the best fit for schedule adherence, a chi-squared goodness-of-fit test was used. After identifying the best fit probability density function, the probability of a bus being on time, being ahead schedule, and being behind schedule were able to be estimated.

FIGURE 1-1 Framework for Research

In document The prediction of bus arrival time using Automatic Vehicle Location Systems data (Page 30-33)