One of the main advantages of a tree-based learning algorithm for both continuous (regression) and discrete (classification) prediction is that they are hard to over-fit. In our training process, train loss and validation loss decrease, and the model is not over-fitted to the training set. After training the gradient boosting trees for the cost prediction task, features importance is discussed to indicate the top fifteen features that contribute most in trees forming.
1. Route features:
• Distance is the most important feature for splitting.
• Duration is the 2nd important feature for splitting.
2. Spatial features:
• Pickup latitude is the 3rd important feature for splitting.
• Dropoff latitude is the 5th important feature for splitting.
• Pickup longitude is the 6th important feature for splitting.
• Dropoff longitude is the 7th important feature for splitting.
• Jfk dist is the 4th important feature for splitting.
• Direction is the 14th important feature for splitting.
• Delta longitude is the 9th important feature for splitting.
• Delta latitude is the 10th important features for splitting.
• Lga dist is the 11th important features for splitting.
• Sol dist is the 13th important features for splitting.
3. Temporal features:
• Pickup hour is the 8th important feature for splitting.
• Pickup day is the 12th important feature for splitting.
• Pickup year is the 15th important feature for splitting.
The two most important features are the trip distance and trip duration. It is important to note that the route characteristics, including tertiary, secondary, trunk, nTrafficSignals, and residential, are encoded in trip duration, and the model eliminates them in splitting.
The model is evaluated on the test set, and the results are available is table 4.4.
R2 RMSE
0.8548 3.6954
TABLE4.4: Prediction result for trip cost
The RMSE is relatively low, which shows that the trained trees can predict the trip cost accurately. The other evaluation metric R2shows the trees’ ability to explain the variance of the target feature (trip cost) in the test set. The trained model used to predict the trip cost is denoted as ˆCp.q, which takes a trip and returns the trip cost.
Two important functions in ridesharing simulation are approximated in this chapter. In the next chapter, we define the simulation environment, and both cost function and duration function are used to indicate the trip cost and trip duration on the edges of the city network.
Dijkstra’s shortest path algorithm is used to calculate the shortest travel time for a trip.
These two functions help us to capture the dynamics in the city network with respect to cost and duration. Both functions play an important role in providing an accurate simulation for ridesharing.
Chapter 5
Ridesharing Simulation
During this chapter, our ridesharing approach is simulated using NYC yellow-cab data.
To run the simulation, we first provide the algorithm for our ridesharing procedure based on chapter 3 and afterward prepare the data for simulation. As [90] mentioned, the static case results for ridesharing is an upper bound for results in the dynamic case. Based on this observation, instead of forecasting future demand, we assume that a perfect model is available and forecast the exact demand for the near future (within 10 minutes from processing time). In the remaining of this chapter, we provide the exact algorithm for our ridesharing approach in section 5.1, and afterward, we provide the simulation environment details from both spatial and temporal perspective in section 5.2 and finally in section 5.3, we present the simulation results for our ridesharing approach. In section 5.3, the different aspects of adaptive waiting time and its influence on the matching rate in the simulation are investigated.
5.1 Ridesharing Algorithm
In this section, we provide our ridesharing algorithm for a defined time horizon T on a one-minute interval. In our ridesharing algorithm, the inputs are ε, Gm,n, ˆCp.q, and ˆφp.q which are detour flexibility, city network, cost function and duration function. The algorithm output is a set of matched passengers who share their ride during the time horizon. In this work, to mimic dynamic ridesharing’s online nature, data is streamed in a discretized manner (one-minute interval), and future demands are just used to calculate waiting time
for passengers.
output: A matching set M that is iteratively updated
Data: Request set D, which is streamed over the time horizon
1 for t P t0, . . . , T u do
2 Calculate the trip duration between adjacent vertices in Gm,n based on their centers’ longitude and latitude at time t.
3 Add the unmatched passengers from Πt´1 to Πt, Πt Ð Πt´1.
4 calculate trip duration for P “ triP D|ti“ tu by Dijkstra’s algorithm.
5 Calculate the optimal waiting time for P “ triP D|ti“ tu based on equation 5.
6 Indicate the leaving candidate as follows: Lt“ triP Πt|ti` wi“ tu.
7 Create the passengers’ graph Gpand add edges for feasible matches.
8 Construct matching Mt using greedy approach in algorithm 1 and remove matched passengers from Πt.
9 M Ð M Y Mt
At each iteration, the ridesharing algorithm updates the weights on the city network based on the trip duration, from the vertex’s center to an adjacent vertex’s center. After-ward, the shortest path (in terms of duration) is calculated using Dijkstra’s algorithm and thus added to features for each request. By knowing the shortest trip path, the upper bound for waiting time is calculated, and subsequently, the optimal waiting time is calculated and added to features by equation 5. The leaving candidate set is defined afterward as the set of passengers in the passengers’ pool, whose waiting time has elapsed and are now ready to match. After creating the passengers graph, the greedy algorithm 1 finds a matching set M on it. Adjacent vertices in M accomplish a ride together.
5.2 Simulation Environment
We explained the ridesharing algorithm in the last section, and in this section, we describe the simulation environment precisely. By the simulation environment, we mean all the
factors that shape the ridesharing, including the detour flexibility factor, the city graph Gm,n, D, which represents requests, and time horizon T . First, we define the requests set D.
We consider the NYC yellow-cab data for our simulation [93]. We have selected our data from January 2012 for the simulation. The dataset consists of more than 14 million ride requests.