Evaluating models - MODELING DEMAND PER STATION

CHAPTER 4 MODELING DEMAND PER STATION

4.2 Evaluating models

This section presents scores used in this work to compare algorithms, it provides an insight on the specificity of each score, and what they represent. A lot of measures exist to compare models and each of them focus on different aspects providing different insights on the goodness of an algorithm. These measures can be computed by station or globally, giving different levels of understanding. To compare models, we first compare the global score to select the best algorithms, then we will focus on these scores per station to get a deeper understanding of the model and its limits.

The first scores are precision scores, that value how close the prediction was to reality. Ho-wever, these scores have to be nuanced as rental and return demands on each station are stochastic. Then a precision score may not be suited. Gast et al. (2015) show that even if we knew perfectly the probabilistic model, the precision score will be significantly different from zero. Therefore these measures are not adapted to stochastic processes. However, they are pertinent to compare expectation estimations. Section 4.5.1 provides these scores with the log likelihood, which allow us to compare different models.

Notations :

S the set of stations

n_s the number of stations n_s= |S|

T the set of hours of the dataset.

n_t the number of hours n_t= |T | D is the |T | × |S| trip count matrix

D_t,s the real value of the objective (trip count) at time t in station s Dˆt,s the estimated value of D_t,s

D¯ the mean of D_i,s over time and stations. ¯D = _{|T ||S|}¹ ^P_t∈T ^P_s∈SD_t,s C is the set of features.

n_C is the number of features n_C = |C|

F is the feature matrix (time and weather features for each time step).

F is a n_t× n_C matrix.

Throughout this thesis, if M is a matrix, M_i,j is the element of M at the i^th row and j^th column. If i or j is replaced with an underscore (_) this means that the whole row or column is selected. Hence M_{_,j} is the j^th column of M and M_{i,_} is the i^th row of M . A ¯bar indicates the means of all selected elements, a ˆhat represent an estimation.

4.2.1 Splitting Data

The trip and feature data represent time dependent data. The trip data depends on the features but also on the activities near the station. These activities are considered constant, as they should not change during the year and then their influence is learned implicitly by the model. However changes can occur and affect the performance of the model, since the model itself is not able to automatically learn the changes in activities due to the unavailability of the data. . This process must be considered in the data split. Selecting randomly test data into the trip data gives to the learning algorithm information about the future behavior of the network, which results in an overestimation of the performance of the model. Then we choose to select the first part of the data for the training, the second part for validation and the last one for test. The training set refers to the data from 01/01/2015 to 30/06/2016, the validation set from 01/07/2016 to 31/08/2016 and the test set from 01/09/2016 to 30/10/2016.

The data of 2017 is not used because of the modification of the network made between 2016 and 2017 and the lack of yearly data for the new network. This loss is compensated by proposing some solutions to add new stations in the network (subSection 4.8.4).

4.2.2 Size of Stations

The size of a station is defined as the average number of trips per hour in the station. This number is not an error measure, but a characteristic of a station. It is computed as :

Size(s) = 1

|T |

t∈T

D_t,s

We categorize stations into three categories, i.e. big stations (more than 2 trips per hour), small stations (less than 0.5 trips per hours), whereas the remaining stations are considered medium stations. Big stations represent 30% of the stations, small stations 15% and medium stations 55%.

4.2.3 MAE

The MAE or Mean Absolute Error is defined as the mean of the absolute error values. It is one of the simplest error measures. It is used in our work to better approximate the expected mean.

The penalization of errors is linear, hence big errors are not discouraged (as much as in other scores)

4.2.4 RMSE

The RMSE or square Root Mean Squared Error is the Root of the means of the errors squared. It is one of the most commonly used error measures. This error penalizes a large error value. It is widely used in machine learning because of its relation to the Gaussian distribution.

The MAPE or Mean Absolute Percentage Error is a measure that quantifies the relative error. The error is computed in terms of percentages and not in terms of raw error. This error compensates the magnitude of the error which increases when the predicted value increases. The formula to compute the MAPE score is :

M AP E({D_i,s}_{i∈T ,s∈S}, { ˆD_i,s}_{i∈T ,s∈S}) = 1

The RMSLE or square Root Mean Squared Logarithmic Error is the square root of the means of the square of the logarithm of the error. This error penalizes more positive errors than negative ones, and penalizes more large errors than small ones. It is used in cases where the prediction has to be positive.

The R squared coefficient is a measure of the deviation from the mean explained by the model.

Let us define SS_tot =^P_i∈T ^P_s∈S(D_i,s− D)², SS_res =^P_i∈T ^P_s∈S(D_i,s− ˆD_i,s)² then the R² is defined by :

R² = 1 − SS_res

SS_tot = 1 −

i∈T

s∈S(D_i,s− ˆD_i,s)²

i∈T

s∈S(D_i,s− D)² (4.1)

This score is between −∞ and 1. A score of 1 means that all variance has been explained by the model. This score can be negative because a model can be arbitrarily bad. This score measures the explained ratio of variance. It is not a measure of the goodness of a model. A good model can have a low R² score, and a model that does not fit the data can have a high R² score.

In document Towards Station-Level Demand Prediction for Effective Rebalancing in Bike-Sharing Systems (Page 46-50)