A use of GPS data for analyzing customer behavior insight

(1)

1 A use of GPS data for analyzing customer behavior insight

Tho Thiravetyan

^1,a

, Tinnapat Thiamtawan

^1,b

, Tanapon Wangpataravanich

^1,c

, Pisit Jarumaneeroj

^1,d

1Department of Industrial Engineering, Faculty of Engineering, Chulalongkorn University, Bangkok, 10330, Thailand

Email addresses: ^a[email protected], ^b[email protected], ^c[email protected],

d[email protected]

Acknowledgement

This research was supported by Pisit Jarumaneeroj, Department of Industrial Engineering, Faculty of Engineering, Chulalongkorn University, and also supported by the study company providing the data.

Abstract: Most companies have recently tried to collect all data associated with their products, services, and other related activities for the detailed analysis of their businesses. These data are crucial both for the improvement of their operations and the creation of new products or services differentiating themselves from the competitors. Example includes the GPS data collected by a car-sharing company investigated in this paper. While the GPS data are mainly used for location and status tracking, we investigate the plausible use of such data for the improvement of the company’s operational efficiency by a dynamically detailed trip chain analysis. In this analysis, by using Gaussian mixture model (GMM), customers will be divided into several segments based on their historical usages, i.e. trip chains; then, the trip chains will be used as consumer insight and customer needs which are inputs for improvement of car-sharing system’s operational efficiency and considering opening new stations.

Keywords: Car-sharing; Gaussian mixture model; Clustering; GPS; Behavior; Trip chain

(2)

2

1. Introduction

From the study of Dresner Advisory Services in 2560, in the world of business, big data is worth to be used to develop the organization. Data analyzing in various organizations has been increased from 17 percent to 53 percent in the last 3 years, so as Thailand. Therefore, we may assume that analyzing data for developing organizations is indispensable. The data is collected in a variety of forms, the customer service information, financial transaction information, internet access information, travel information with GPS of car users, and others. The benefits of analyzing data will vary according to each industry. Some researchers are using big data to develop their company (Sanjay KumarSingh, Abdul-NasserEl-Kassar, 2019). In the public car industry, whether it is Grab car, Uber, or foreign car-sharing company, information that these companies have to collect tremendously and have the opportunity to use to analyze for organizational development is GPS information.

One form of implementing GPS information is trip chain. Trip chain is the sorting of activities in the form of time and place by having the connection of activities from the start of the car service until the end of the service (Guido Perboli, Francesco Ferrero, Stefano Musso, Andrea Vesco, 2018). Trip chain will help to clearly visualize the behavior of users which will be used as information in the next operating system development process.

The researchers chose to develop the operating system of car-sharing company which is a short- term car rental business. Due to the car-sharing business is linked to public transport in densely populated cities and experiencing traffic problems, able to meet the concept of using public transportation in the future, reducing environmental problems, and the flexibility if compared to the daily car rental business.

However, the researchers found that a car-sharing company in Thailand that the researchers worked with has not been analyzed the travel information of the service user to create benefits for the organization as much as it should be. Therefore, our group saw the opportunity that if traveling information of users, which is currently being collected throughout, is analyzed, they will know the behavior and in- depth information of customer groups in order to improve the company's system’s operational efficiency.

And if the company’s system has been improved significantly, then it will be able to help attract more people, become widely known, and the number of users will be increased. In addition, it will be the most effective use of the information which will help increase sales and profits and ultimately result in increased value of the company.

The case study company is a car-sharing company in Thailand which has more than 5,000 users.

The car for rent is available from 30 minutes to 7 days. There are over 100 parking spots throughout the country. A customer use characteristic is used via the application 24 hours a day. At this time, the company has a round-trip service model, which is the form that when starting to drive from any station, once used, must be returned to the original station (Nourinejad & Roorda, 2015). For service vehicles, there are cars for 2-5 passengers and cars for 5-7 passengers.

The purpose of this project is to analyze the GPS data of the study company in order to gain in- depth information about the service usage behavior of users which can bring results to help support the selection of business strategies, by using Gaussian mixture model (GMM). This project focuses on bringing GPS data to improve the car-sharing system of a car-sharing company in Thailand, consist of two main parts:

implementing GPS information into trip chain and the use of a trip chain for future benefits.

2. Literature Review

This project aims to develop and improve the efficiency of the operating system. In order to achieve this project in the long run, the researchers studied various related theories that are useful for the project as follows.

2.1 Analysis of current car-sharing systems

In the first part, the researchers are looking for information on the current car-sharing system in order to know the various service systems of the car-sharing company. The researchers will bring the results of the current car-sharing system study to compare the similarities and differences between the three characteristics (Francesco Ferrero, 2018) as follows.

2.1.1 Free floating (Firnkorn & Müller, 2011)

A model that can park anywhere in public places under the terms of the company which is extremely convenient but hard to forecast and schedule a specific car occupation in the company view.

(3)

3

2.1.2 One trip (Station based) (Nourinejad & Roorda, 2015)

A model that users can park anywhere but must return the car at the station only but does not need to be a station that starts using car.

2.1.3 Round trip or Two trip (Station based) (Nourinejad & Roorda, 2015)

A format that when starting to drive out of any station, once used, must be returned to the original station. This make the operational model does not have to consider the place to park which depends on the needs of customers. The case study company has a system of use in this manner.

2.2 Customers using car-sharing services

There are many theories that can be applied to analyze how trip chain can be utilized. First, the factor that affects the selection of car-sharing services (Taekwan Yoon, Christopher R. Cherry, Luke R. Jones 2017) which will be the result of a round-trip form of car-sharing. Most of the customers who will use the service are the people who do not have their own cars. The other characteristics of the customers are male who has a higher cost of using their own car than using car-sharing and has a high income. And there is the possibility of having a public mind for the environment.

2.3 Trip chain

2.3.1 Definition of trip chain

From the researchers mentioned trip chain in the introduction, in this section, the researchers therefore introduced the definition by (McGuckin and Murakami, 1995) who are transport planners, said that the trip chain is a sort of activity in the form of time and place by connecting trips between work activities and activities that are not working. While another definition (Srinivasan, 1998) says that trip chain is a sequence of sub-trips by starting from activities at home and travels to other activities but will eventually return to the house and other says that it is a sequence of sub-trips during activities at home with work or school.

The car-sharing system of the case study company is a round trip or two trips (Station based).

Therefore, our trip chain corresponds to the definition of the first definition of (Srinivasan, 1998). The house in the definition is the station of the car-sharing company itself.

2.3.2 Model creation process, changes in activity in the walking chain (D’Este, 1997)

D’Este wrote articles about the technique of the overall factors that resulted in changing travel behavior. The model was created using a Markov chain to extend the existing model. The procedure is to dividing the method into 3 steps to create a trip chain. The steps are as follows.

1. Create a Markov Transition Matrix. This matrix shows the possibility of switching from one activity to another activity. Activities that occur at ‘Home’ will mean the starting point and destination.

2. Calculate the order of each state in which each state represents the probability that each person performs one activity in any part of the trip chain. This probability is calculated from the summation of the probability of doing each previous activity and the probability of traveling from each place in the previous activity to the place where the current activity is. By assuming that the initial state is to stay at home and the last state is to return home.

3. Calculate the Matrix of Cumulative transition probabilities from the results in step 2. The cumulative transition probabilities are the combination of every probability of activity to the current point and the next sequence by looking at the probability of traveling from the starting point to the destination of activities in the Markov Transition Matrix.

The researchers can apply such a process to predict the next activity of the user of car-sharing services. The results of the predictions include the activities that have occurred, where it occurred, and time spent.

2.4 Methods for analyzing types of activities and factors that affect

The researchers refer to the method of analyzing GPS data from two research projects. First, the analysis of activity patterns from the chain of users of Metro and public buses in Korea which is a research work of (Gain Han, Keemin Sohn, 2015) which analyzed to predict the activities that users will do using the starting time, the duration of each activity, and the use of the area. The trip chain that is obtained will be a sequence of information in each location and activity. The starting time and duration of each activity are taken from the information of the user's card touch. As for the use of that area data is taken from other

(4)

4

sources. In the research, there are examples of using space divided into 4 parts: residential areas, commercial buildings, offices and others. In addition, researchers cannot know the exact location of each activity but only know that it is a bus stop or subway station. This is similar to our research which is a car- sharing that when the user goes to the parking spot, we cannot know exactly what activities have been made. But knowing what areas are important places. This research will look at the use of space by looking at a radius of 200 meters from the bus stop and 400 meters from the metro station. The area used will be divided into 4 parts as previously mentioned by using GIS technology and can be used for further development of models.

While another research is the research of (Cory M.Krause, Lei Zhang, 2018) which is creating a trip chain from car-sharing as well as this project. The general information that the research has will consist of time, date, latitude, longitude, distance traveled in 30 seconds, and average speed in 30 seconds. In one day will collect all 2880 data points per day. The researchers will plot the data in the GIS Map, which is a Vector Diagram which he will divide the points as follows.

-Initial Origin points analyzed by the point at which the distance moves in 30 seconds, is over 300 meters and has passed the main road.

-Destination points analyzed by the point that the distance moves in 30 seconds, is not more than 300 meters or stops for more than 30 minutes or the average speed in 30 seconds is less than 1.2 meters/second.

-Trip Purpose points by analyzing from areas, locations such as department stores, hospitals, office buildings, residential areas (Set up in GIS map)

The researchers will perform cleaning data to only 3 points and used to analyze activities. The activity depends on area characteristics at each point and parking period.

2.5 The method to divide cluster in activity forms

Knowing just where the customer went is not enough to be used for benefits. The researchers, therefore, studied deep into the activities in which customers acted on Gaussian mixture models (Santosh, D. 2013) as a probability model for showing the normal distribution of subgroups in the entire population.

Normally Mixture models do not need to know which data belongs to any subgroup. This makes the Model able to learn about sub-populations by themselves. And because we do not know which data belongs to the population, a form called Unsupervising learning is created. Each group of the population has different mean and variance so this makes it is possible to separate each group. This model can be used to divide the activities that customers do since each activity has different mean and variance of different variables already.

2.6 Walking distance

The researchers studied the distance that the drivers walked from the parking spot to the destination.

From the research of (Peter van der Waerden, Harry Timmermans, Marloes de Bruin-Verhoeven, 2017), found that the maximum distance that a car user will walk to shopping place and working place is about 50 meters. Experiments in the research find that the longer the driver is at the destination location, the more likely it is that the driver will park farther.

So if considering about the distance between the parking spot and the destination, the researchers will give more weight to a distance of 50 meters than the father distance.

2.7 k nearest neighbors

The researchers refer to the method of K nearest neighbors (kNN) from a research project called Efficient kNN classification algorithm for big data, a research work of (Zhenyun Deng, Xiaoshu Zhu, Debo Cheng, Ming Zong, Shichao Zhang, 2016), which is about the classification of big data types into several categories, divided into 2 steps: training process and testing process. And select the appropriate k values.

The study also made a precision comparison with other types of kNN, such as LC-kNN, RC-kNN and found that kNN provides more accurate, effective results, and the most suitable for big data.

3. Methodology

This research uses mathematical models to analyze the activities that users do by analyzing from 3 main factors: the time that the activity starts, time spent in activities and types of places that the users do activities. Then, use those activities to create a trip chain.

(5)

5

3.1 Data preparation

Data is obtained from a car-sharing company in Thailand. The data is the service usage data from the date 28/04/2017 up to the date 19/12/2018. There are all 12,272 trips. In this project, the researchers choose to analyze trip information that has travel in Bangkok Metropolitan Region only. Since the data of land use characteristics from Google map must be used, the selection in Bangkok Metropolitan Region will provide more accurate information. In addition, we also choose to study trips that are no more than 48 hours because if users use car more than 48 hours, it is likely to be used to travel out of town. There are all 3,690 trips as shown in Table 1.

Table 1

Number of trip available in each process.

All available trips Cleaned trips Studied trips

Number of trips (%) 12,272 (100%) 6,859 (55.9%) 3,690 (30.0%)

Data cleaning is done by combining the GPS data at the same parking spot which has a distribution because of the GPS error by using the center of the distribution due to the GPS error has a normal distribution. And discard the trip that the distribution of GPS exceeds the standard value.

The time that users stop the car at one place is a time to start the activity and the parking period at that parking spot is the time that the users do the activity. Both are used as input in the math model.

In terms of the type of place that users do activities, we have written a program to extract that stop point data and the surrounding area from Google map and use the program to do image processing. It will see the pins and types of pins that surrounding that stop point to see what the places are by doing 2 types of analysis.

Types of pins have various according to the colors as follows; orange for food & drink, blue for shopping, purple for services, pink for hotel/apartment, grey for civil services/worship, red for health, and green for outdoor. These colors are the standard colors of Google map.

There are 2 methods of the data preparation as follows:

3.1.1 Location-based analyzing: The researchers use machine learning to help to analyze the type of parking place by using the variable as the location of the place around the parking spot. Starting with doing a train and test data set by random the 1,000 stop point data and use Google map to define what the place is. Then, divide the data by 80% and use the machine learning to train it and use another 20% to test how accurate the machine learning is. The machine learning that we choose is the k-nearest neighbors machine learning. Once tested, it is found that k = 14 gives results of predicting the place with the highest accuracy of 74.32%. We, therefore, implement the machine learning to run to label the type of location.

3.1.2 Weight-point analyzing: Calculated from the type of pins, the number of pins and the distance between locations and parking spot. The researchers create a weight equation to show the opportunity that the car user will go to the location near the parking spot, rather than the point that is away from the parking spot. Eq. (1) is created regarded to the distance from the parking spot to the target of the car user. In Eq. (1), 𝑥 is the distance between the pin and the stop point. The reason why we divide 𝑥 by 66.5 is that it is the most remote distance of the location when doing the image processing. In addition, we power it by 10.5 because the distance that normal people walk is 50 meters which has a weight value of 0.95. The relationship is a decreasing function which the latter will reduce faster.

Weight = 1 − ( ^𝑥

66.5)^10.5 (1)

3.2 Activity model

The model that we choose to use in the activity analysis is a Gaussian mixture model (GMM). This model will divide the behavior of various variable values into groups that have a normal distribution. In addition, it is an unsupervised machine learning which is a machine learning that does not need to have data to train before it can be used. So it can reduce the hassle of doing the survey.

The detail of the GMM math model is described as follows. Eq. (2) represents the prior distribution of vector 𝑥⃗ where 𝑖 vector component is characterized Eq. (3), which is normal distributions

(6)

6

with weights 𝜙𝑘, means 𝜇⃗𝑘 and covariance matrices Σ𝑘. The total probability of weight distribution normalizes to 1 as show in Eq. (4).

𝑝(𝑥⃗) = ∑^𝐾_𝑖=1𝜙_𝑖𝒩(𝑥⃗|𝜇⃗_𝑖, Σ_𝑖) (2)

𝒩(𝑥⃗|𝜇⃗𝑖, Σ𝑖) = ¹

√(2𝜋)^𝐾|Σ_𝑖|

exp (−¹

2(𝑥⃗ − 𝜇⃗_𝑖)^𝑇Σ𝑖−1(𝑥⃗ − 𝜇⃗_𝑖)) (3)

∑^𝐾_𝑖=1𝜙_𝑖= 1 (4)

Where:

𝑥⃗ Vector of variable in each data point K Number of components of the model

𝜇⃗𝑘 Mean of component 𝑘^𝑡ℎ for the multivariate case

Σ𝑘 Covariance matrix of component 𝑘^𝑡ℎ for the multivariate case 𝜙_𝑘 The mixture component weights for component 𝐶𝑘

To best fit the research data into GMM, expectation maximization algorithm is used. The algorithm for maximizing the expectation for a Gaussian blending model begins with the initial configuration process, which sets the model parameters to a reasonable value according to the data. Then the model will be repeated along with the expectation (E) and maximization (M) until the parameter estimation converges, which is for all parameters 𝜃𝑡 at iteration t, |𝜃𝑡 − 𝜃𝑡−1| ≤∈ for the user to customize 𝜖. For initialization step, randomly assign samples without changing from the data set 𝑋⃗ = {𝑥⃗1, … , 𝑥⃗𝑁} to the component means approximate 𝜇̂1, … , 𝜇̂_𝐾. Then set the covariance estimation for all components as sample variance. Finally set all component distributions before estimation into distribution sets 𝜙̂1, … , 𝜙̂_𝐾=

1 𝐾.

The next step is called the expectation step or the E step. It consists of calculating the expectation of the composition 𝐶𝑘 for each data point 𝑥⃗𝑖𝜖 𝑋⃗ that determines the model parameters ∅𝑘, 𝜇⃗𝑘, and Σ⃗⃗𝑘 as shown in Eq. (5).

𝛾̂𝑖𝑘= ^𝜙^̂^𝑘^𝒩(𝑥⃗^𝑖^|𝜇^̂^𝑘^,Σ^̂^𝑘⁾

∑^𝐾_𝑗=1𝜙̂_𝑗𝒩(𝑥⃗_𝑖|𝜇̂_𝑗,Σ̂_𝑗) (5)

Where 𝛾̂𝑖𝑘 is the probability that 𝑥⃗𝑖 is generated by component 𝐶𝑘.

The following step is called the maximization step or the M step. It consists of increasing the maximum expectation calculated in the E step according to the model parameters. This step consists of updating the values ∅𝑘, 𝜇⃗𝑘, and Σ⃗⃗𝑘

𝜙̂𝑘 = ∑ ^𝛾^̂^𝑖𝑘

𝑁

𝑁𝑖=1 (6)

𝜇̂𝑘 = ^∑ ^𝛾^̂^𝑖𝑘^𝑥⃗^𝑖

𝑁 𝑖=1

∑^𝑁_𝑖=1𝛾̂_𝑖𝑘 (7)

Σ̂_𝑘 = ^∑ ^𝛾^̂^𝑖𝑘^(𝑥⃗^𝑖^−Σ^̂^𝑘⁾

𝑁 2 𝑖=1

∑^𝑁_𝑖=1𝛾̂_𝑖𝑘 (8)

All iterations will be repeated until the algorithm converges, resulting in a maximum likelihood estimate. By intuitively, the algorithm works because knowing the component definition 𝐶𝑘 for each 𝑥_𝑖 makes solving ∅_𝑘, 𝜇⃗_𝑘, and Σ⃗⃗_𝑘 simplifying while knowing ∅_𝑘, 𝜇⃗_𝑘, and Σ⃗⃗_𝑘 making simple inference. The forecasting process corresponds to the latter case while the maximization process is consistent with the past. Therefore, by switching between the values that assume that is fixed, or known, the maximum probability estimation of uncertain values can be calculated effectively.

Using Eq. (9) or Bayes' theorem and the estimated model parameters, we can estimate the probability of defining the rear composition. Knowing that the data point is likely to come from the

(7)

7

distribution of one component and another, providing a group learning method in which the cluster definition is determined by determining the most possible elements.

𝑝(𝐶𝑖|𝑥⃗) =^{𝑝(𝑥⃗,𝐶}^𝑖⁾

𝑝(𝑥⃗) =_∑ ^𝑝(𝐶^𝑖^{)𝑝(𝑥⃗|𝐶}^𝑖⁾

𝑝(𝐶_𝑖)𝑝(𝑥⃗|𝐶_𝑖) 𝐾𝑗=1

=_∑ ^𝜙^𝑖^{𝒩(𝑥⃗|𝜇}^⃗⃗⃗^𝑖^,Σ^𝑖⁾

𝜙̂_𝑗𝒩(𝑥⃗|𝜇⃗⃗⃗_𝑗,Σ_𝑗) 𝐾𝑗=1

(9)

With the parameters of the multivariate model, the 𝑝(𝐶𝑖|𝑥⃗) or probability that the data point 𝑥⃗ belongs to that component 𝐶𝑖 is calculated. Clustering has many benefits in machine learning, from creating different tissues to medical imaging to customer segmentation in market research.

In this research, different models are used to analyze 2 type of input as follows:

3.2.1 Location-based input: A model that uses Location-based analyzing data to run in the model which makes the model more difficult to create. Quantitative data may be biased or not standard but we will know the exact location.

3.2.2 Weight-based model: A model is used for analyze input that obtained from Weight-point analyzing to classify activities. The variables put in the model are numerical thus may reduce bias and have a higher standard but it may not be clear what the place is.

3.3 Trip chain creation

When we have the probability of various types of stop-point activity, we will calculate the probability of every possible path to get the trip chain that starts from the station and returns to the station.

Then sum it up to get the estimated number of trip that happened in trip chain. The trip chain is made in 2 forms: in the form of activity, and in the form of place.

4. Results & Discussion 4.1 Normal trip chain result

When bringing the location information from the machine learning to change the original data into a trip chain in the form of location, it will get results as shown in the Table 2. The researchers have divided the location into 7 categories as follows: restaurant where there have the most eating place in that area, shopping zone where there has the most department store in that area, residence with a most village or dormitory, gas station if there have the most petrol station, university for school or university, the company for having most office building in that area, and uncertain if there are nothing to see or the building is too far from the center of the picture. In each location, we will know about the time of that activity both in term of the mean and standard deviation of the start time and the time users take at that location. We also see the probability of going from one location to another one as shown in Table 3. We will know the behavior of users at a glance. But still cannot aware of the activities that the users do.

Table 2

Distribution of the top 20 most probable location trip chain start and end at station.

Location sequence Counts (%) Location sequence Counts (%)

Restaurant 523 (16.2%) Shopping-Restaurant 37 (1.1%)

Company 461 (14.3%) Shopping-Company 34 (1.1%)

Shopping 201 (6.2%) Residence-Restaurant 29 (0.9%)

Residence 146 (4.5%) University 28 (0.9%)

Gas station 117 (3.6%) Restaurant-Restaurant-Restaurant 27 (0.8%)

Company-Company 89 (2.8%) Residence-Company 26 (0.8%)

Restaurant-Restaurant 82 (2.5%) Company-Residence 25 (0.8%)

Company-Restaurant 74 (2.3%) Company-Restaurant-Company 24 (0.7%)

Restaurant-Company 68 (2.1%) Restaurant-Residence 23 (0.7%)

Company-Shopping 37 (1.1%) Restaurant-Shopping 22 (0.7%)

(8)

8

Table 3

Estimated transition probabilities of location.

Location 0

(Station) Location 1

(Company) Location 2

(Gas station) Location 3

(Residence) Location 4

(Restaurant) Location 5

(University) Location 6 (Shopping) Location 0

(Station) 0.000 0.339 0.069 0.109 0.334 0.019 0.130

Location 1

(Company) 0.399 0.205 0.028 0.080 0.206 0.010 0.073

Location 2

(Gas station) 0.462 0.185 0.042 0.062 0.178 0.007 0.066

Location 3

(Residence) 0.397 0.209 0.028 0.105 0.194 0.006 0.061

Location 4

(Restaurant) 0.433 0.178 0.030 0.072 0.211 0.007 0.070

Location 5

(University) 0.512 0.174 0.025 0.083 0.165 0.000 0.041

Location 6

(Shopping) 0.410 0.196 0.030 0.067 0.194 0.008 0.095

4.2 Model result

From using GMM to analyze data that we have, it will allow us to penetrate more deeply than locations by knowing the activity. But we assume that people can do activities within walking distance. We obtain estimates for the 4 parameter sets: the probabilities of the initial state, the transition probabilities, the membership probabilities, and the mean and variance and covariance of each cluster in the feature space. First, we have to label each cluster as an activity by looking at the characteristics of the cluster which consists of the mean of vector and variance of different variables. Then, match the parking spot and activity type by base on the estimated membership probabilities. Next, arrange the order of activities that occur in the trip chain to create a transition matrix. Last, bring the type of activity and stage of the activity to create an activity-based trip chain.

4.2.1 Location-based model

When bringing data into the model, the result will show a total of 13 clusters with a mean and standard deviation of each variable according to the table 4. Based on cluster analysis, the researchers have divided the activity into 6 categories: 1. Work as an activity in the location “company”, with an average time beginning around the morning and spent times doing that activity for almost 7 hours, 2. Run some errands is an activity that occurs during the afternoon to an evening which occurs at the location “company and residence”, with an activity duration of about 1 hour, 3. Refuel and have a meal which occurs at a “gas station”, 4. Sleepover as an activity that occurs in the evening and night and spends time doing activities for more than 6 hours, 5. Have a meal as an activity that occurs in the restaurant zone. Often occur in the evening but the nature of the time used is divided into 2 types: firstly, time that takes about 1 and a half hours and secondly, time that takes about 6 hours, and 6. Shopping and have a meal which occurs in the evening at about 4-6 o'clock, takes about 1 hour. The nature of time is divided into 2 types. The first takes about 1 hour and another takes about 2 and a half hours. By grouping different cluster into activity, we get estimated membership probability of cluster in activity is shown in appendix (Table 12).

Table 4

Description in each clusters from Location-based model.

Group no. Average start time

Average duration

time (hr.) Location type

Standard deviation start time

Standard deviation

duration time (hr.) Probable activity

Define activity number

Group 1 11:33 6:43 Company 7:26 4:52 Working 1

Group 2 15:18 1:21 Company 4:51 0:53 Run some errands 2

Group 3 16:20 1:41 Gas 5:50 1:54 Refuel, Have a meal 3

Group 4 19:37 9:29 Residence 5:56 5:25 Sleepover 4

Group 5 16:51 0:56 Residence 5:2 0:30 Run some errands 2

Group 6 17:06 2:47 Residence 4:42 1:18 Offside work 2

Group 7 17:21 6:30 Restaurant 7:24 4:22 Have a meal/Party 5

Group 8 17:23 1:22 Restaurant 17:60 1:37 Have a meal 5

Group 9 16:43 1:02 University 17:27 1:14 Have a meal 5

Group 10 19:03 6:22 University 19:46 7:38 Sleepover 4

(9)

9

Group 11 17:59 2:33 Shopping district 3:41 1:13 Shopping+ Have a meal 6

Group 12 16:09 1:04 Shopping district 4:34 0:36 Shopping/ Have a meal 6

Group 13 16:31 3:07 Uncertain 11:41 4:02 - -

Then, we created a transition matrix as shown in table 5 to see what the chances of changing from one activity to another are. And finally, we created a trip chain to see the behavior of the users about the number of people who use the car from one location to other locations and the activities at those locations. The top 20 trip chains of activities are shown in table 6.

Table 5

Estimated transition probabilities of activities from Location-based model.

Activity 0 (Car-sharing

station)

Activity 1 (Working)

Activity 2 (Run some errands)

Activity 3 (Refuel/ Have

a meal)

Activity 4 (Sleepover)

Activity 5 (Have a meal/Party)

Activity 6 (Shopping/

Have a meal) Activity 0

(Car-sharing station) 0.00 0.05 0.38 0.07 0.01 0.35 0.13

Activity 1

(Working) 0.40 0.08 0.21 0.03 0.02 0.20 0.06

Activity 2

(Run some errands) 0.40 0.04 0.23 0.03 0.01 0.21 0.07

Activity 3

(Refuel/ Have a meal) 0.46 0.06 0.48 0.04 0.02 0.18 0.07

Activity 4

(Sleepover) 0.33 0.07 0.25 0.02 0.01 0.23 0.09

Activity 5

(Have a meal/Party) 0.44 0.05 0.18 0.03 0.02 0.21 0.07

Activity 6

(Shopping/ Have a meal) 0.41 0.05 0.20 0.03 0.02 0.20 0.09

Table 6

Distribution of the top 20 most probable activity trip chain from Location-based model.

Activity sequence Counts (%) Activity sequence Counts (%)

0-5-0 551 (17.1%) 0-6-2-0 38 (1.2%)

0-2-0 537 (16.6%) 0-6-5-0 37 (1.1%)

0-6-0 201 (6.2%) 0-5-5-5-0 29 (0.9%)

0-3-0 117 (3.6%) 0-5-2-5-0 25 (0.8%)

0-2-2-0 104 (3.2%) 0-5-6-0 23 (0.7%)

0-2-5-0 103 (3.2%) 0-2-5-2-0 23 (0.7%)

0-5-5-0 88 (2.7%) 0-5-1-0 21 (0.7%)

0-5-2-0 75 (2.3%) 0-2-5-5-0 21 (0.7%)

0-1-0 60 (1.9%) 0-2-2-2-0 20 (0.6%)

0-2-6-0 40 (1.2%) 0-2-1-0 19 (0.6%)

4.2.2 Weight-based model

When we put the data into this model, we will get means and standard deviation of 11 clusters.

Means, standard deviation, and probable activities are shown in table 7, table 8, and table 9, respectively.

Based on cluster analysis, the researchers have divided the activity into 5 categories which are similar to the activities in the weight-based model described before. But refuel will not appear in this model. The activities are as follows: 1. Shopping and have a meal, for this model, will have starting activities time during noon, afternoon and evening with time use spreading from 1 and a half hours to 5 hours. The location where this activity occurred is a place full of restaurants, shopping center, and civil service. 2. Run some errands, activities that occur in the afternoon. Activity duration is about 1-2 hours and is in a civil service area. 3.

Have a meal, an activity that occurs in which there are many restaurants. It takes 1-3 hours, in the afternoon.

4. Working, an activity that the average start time is at noon and has an average duration of up to 12 hours.

5. Sleepover, an activity in which the average time occurs at 1 am and lasts for about 7 hours. We also get the estimated membership probability as shown in appendix (Table 13).

(10)

10

Table 7

Average of variable in each clusters from Weight-based model.

Group no. Average start time

Average duration time

Average Restaurant

Average Shopping

Average Gas station

Average Outdoor pin

Average Residence

Average Health

Average Civil Services

Group 1 15:54 2:00 3.84 7.39 2.59 0.00 0.17 0.18 3.54

Group 2 ^15:22 ^1:14 ^0.88 ^0.42 ^0.18 ^0.00 ^0.34 ^0.00 ^1.44

Group 3 ^16:10 ^2:40 ^4.81 ^2.08 ^2.34 ^0.00 ^0.02 ^0.05 ^0.89

Group 4 ^15:28 ^6:40 ^1.29 ^0.21 ^0.87 ^0.11 ^0.13 ^0.05 ^0.43

Group 5 ^12:37 ^4:48 ^4.06 ^2.58 ^1.76 ^0.03 ^0.65 ^0.32 ^4.19

Group 6 ^12:03 ^12:02 ^2.77 ^1.20 ^0.41 ^0.08 ^0.34 ^0.15 ^2.39

Group 7 ^15:19 ^1:55 ^0.33 ^0.74 ^0.15 ^0.00 ^0.00 ^0.03 ^0.21

Group 8 ^1:17 ^7:13 ^0.08 ^0.06 ^0.10 ^0.02 ^0.11 ^0.02 ^0.45

Group 9 14:19 1:00 4.08 1.75 0.94 0.02 0.07 0.11 1.06

Group 10 ^14:56 ^1:30 ^4.97 ^3.75 ^0.39 ^0.00 ^0.00 ^0.22 ^2.98

Group 11 ^16:14 ^2:18 ^0.89 ^1.85 ^0.54 ^0.04 ^0.25 ^0.21 ^2.55

Table 8

Standard deviation of variable in each clusters from Weight-based model.

Group no. SD start

time SD duration time

SD Restaurant

SD Shopping

SD Gas station pin

SD Outdoor

SD Residence

SD Health pin

SD Civil Services pin

Group 1 5:44 1:15 2.84 4.50 1.64 0.00 0.40 0.44 2.27

Group 2 ^5:08 ^0:45 ^1.04 ^0.59 ^0.42 ^0.00 ^0.54 ^0.00 ^1.12

Group 3 ^4:30 ^1:40 ^3.03 ^1.32 ^1.87 ^0.00 ^0.15 ^0.21 ^0.91

Group 4 ^5:45 ^5:35 ^1.68 ^0.43 ^1.44 ^0.33 ^0.35 ^0.23 ^0.61

Group 5 ^7:03 ^3:23 ^3.33 ^1.99 ^1.79 ^0.18 ^0.90 ^0.64 ^2.82

Group 6 ^8:31 ^7:07 ^2.40 ^1.20 ^0.72 ^0.37 ^0.55 ^0.46 ^1.37

Group 7 ^4:47 ^1:27 ^0.62 ^1.84 ^0.38 ^0.00 ^0.00 ^0.16 ^0.41

Group 8 ^1:21 ^4:48 ^0.27 ^0.23 ^0.35 ^0.16 ^0.32 ^0.13 ^0.66

Group 9 ^6:14 ^0:34 ^2.81 ^1.34 ^1.06 ^0.13 ^0.25 ^0.40 ^1.04

Group 10 ^5:53 ^0:54 ^4.23 ^2.60 ^0.61 ^0.00 ^0.00 ^0.45 ^2.21

Group 11 ^3:54 ^1:30 ^0.92 ^1.36 ^0.79 ^0.19 ^0.53 ^0.52 ^2.00

Table 9

Description in each clusters from Weight-based model.

Group no. Probable activity Define activity number

Group 1 Shopping+ Have a meal 1

Group 2 Run some errands 2

Group 3 Have a meal 3

Group 4 - -

Group 6 Working 4

Group 7 - -

Group 8 Sleepover 5

Group 9 Have a meal 3

Group 11 Run some errands 2

Finally, we will get a transition matrix as shown in table 10 to see what the chances of changing from one activity to another are. And we created a trip chain to see the behavior of the users about the number of people who use the car from one location to other locations and the activities at those locations as well as the weight-based model. The top 20 trip chains of activities are shown in table 11.

(11)

11

Table 10

Estimated transition probabilities of activities from Weight-based model.

Activity 0 (Car-sharing

station)

Activity 1 (Shopping/

Have a meal)

Activity 2 (Run some errands)

Activity 3 ( Have a

meal)

Activity 4

(Working) Activity 5 (Sleepover) Activity 0

(Car-sharing station) 0.00 0.22 0.41 0.30 0.03 0.04

Activity 1

(Shopping/ Have a meal) 0.45 0.16 0.19 0.13 0.03 0.04

Activity 2

(Run some errands) 0.47 0.12 0.22 0.15 0.02 0.02

Activity 3

(Have a meal) 0.50 0.10 0.19 0.14 0.02 0.04

Activity 4

(Working) 0.45 0.17 0.20 0.14 0.02 0.02

Activity 5 (Sleepover) 0.45 0.14 0.15 0.17 0.01 0.08

Table 11

Distribution of the top 20 most probable activity trip chain from Weight-based model.

Activity sequence Counts (%) Activity sequence Counts (%)

0-2-0 568 (20.4%) 0-1-1-0 36 (1.3%)

0-3-0 457 (16.4) 0-4-0 36 (1.3%)

0-1-0 297 (10.7%) 0-1-3-0 35 (1.3%)

0-2-2-0 125 (4.5%) 0-3-1-0 31 (1.1%)

0-2-3-0 85 (3.1%) 0-2-2-2-0 29 (1.0%)

0-3-3-0 60 (2.2%) 0-3-2-2-0 17 (0.6%)

0-3-2-0 58 (2.1%) 0-2-1-2-0 16 (0.6%)

0-1-2-0 48 (1.7%) 0-2-2-1-0 15 (0.5%)

0-5-0 47 (1.7%) 0-2-3-2-0 15 (0.5%)

0-2-1-0 46 (1.7%) 0-3-2-3-0 12 (0.4%)

4.3 Model validation

Although GMM is an unsupervised machine learning that does not require a survey to create a label data, in this research, we conducted a survey to test the robust accuracy of this method. Although it is not possible to tell the actual activity, the data collected from the survey will be a real activity which was collected to validate the model.

A car-sharing study company sent a questionnaire to the car users after the user completed the service for a period of 1 month. There were more than 100 questionnaires sent and some respondents and has more than 100 parking spots. The researchers put these spots in the model to find activities and compare with the activities that users responded in the survey. We found that the weight-based model had 65.19% accuracy in predictions and location-based model had a predictive accuracy of 60%. For the application, work or lead to further research, the researchers, therefore, suggested that using a Weight- based model should be better.

4.4 Application

We have clustered the result to be able to see the deeper insight of car-sharing usage. It can be viewed as the usage characteristics of each station, each period, etc. And it will help considering the new station setting by seeing that if the new station is set in the same manner, how many the number of users will be and include the nature of activities that will occur. Making it is possible to estimate the income of that station more accurately for use in doing project feasibility. Moreover, the result can be used in market research for making customer segmentation. Furthermore, it can be used for making a promotion to increase car sharing usage from old customer and attract new users.

(12)

12

5. Conclusion

Data analysis is an indispensable thing at present. We have new knowledge and technology in data analysis, making data more valuable. Data can be analyzed and used in more ways in order to benefit the company.

Our created model can be used to analyze GPS data of a car-sharing company in Thailand which has been stored without being used as much as it should be. Our analysis gives an insight into consumer usage. Let us aware of their behavior and what they use the car to do. Although there are some limitations in various areas, such as assuming that users are doing activities in places around the parking lot in the distance, and the type of activity is specified from various variables that are similar to that activity without really being able to determine what activities are. Although the best model that we have created had the accurate in activity predicting only 65.19%, it can be used to group the activities that have similar time characteristics together.

In the future, if there are methods or procedures for clustering or specifying activities from the characteristics of various variables, they will help improve our model by making our result more accurate.

With an increasing number of data and the more accurately of Google map, we believe that this method will be more robust and precise.

(13)

13

Appendix Table 12

Estimated membership probabilities of Location-based model.

Group 1 Group 2 Group 3 Group 4 Group 5 Group 6 Group 7 Group 8 Group 9 Group 10 Group 11 Group 12

Activity 1

(Working) 1.000

Activity 2

(Run some errands) 0.715 0.183 0.102

Activity 3

(Refuel/ Have a meal) 1.000

Activity 4

(Sleepover) 0.880 0.120

Activity 5

(Have a meal/Party) 0.152 0.811 0.037

Activity 6

(Shopping/ Have a meal) 0.362 0.638

Table 13

Estimated membership probabilities of Weight-based model.

Group 1 Group 2 Group 3 Group 4 Group 5 Group 6 Group 7 Group 8 Group 9 Group 10 Group 11 Activity 1

(Shopping/ Have a meal) 0.026 0.211 0.763

Activity 2

(Run some errands) 0.316 0.684

Activity 3

(Have a meal) 0.087 0.913

Activity 4

(Working) 1.000

Activity 5

(Sleepover) 1.000

(14)

14

References

Francesco Ferreroc. (2018). Car-sharing services: An annotated review, Sustainable Cities and Society, 37, 501–518.

Firnkorn, J., & Müller, M. (2011). What will be the environmental effects of new freeﬂoating car-sharing systems? The case of car2go in Ulm, Ecological Economics, 70, 1519–1528.

Nourinejad, M., & Roorda, M. (2105). Carsharing operations policies: A comparison between one-way and two-way systems, Transportation, 42, 497–518.

Guido Perbolid, Francesco Ferreroc, Stefano Mussod, Andrea Vescob. (2018). Business models and tariff simulation in car-sharing services, Transportation Research Part A, 115, 32–48.

Gain Han, Keemin Sohn. (2015). Activity imputation for trip-chains elicited from smart-card data using a continuous hidden Markov model, Transportation Research Part B, 83, 121–135.

Cory M.Krause, Lei Zhang. (2018). Short-term travel behavior prediction with GPS, land use, and point of interest data, Transportation Research Part B: Methodological, 000, 1–13.

McGuckin and Murakami. (1995). Examining Trip-Chaining Behavior, Defining and understanding trip chaining behaviour, Travel Behavior Analyst, 2.

Srinivasan, S. (1998). Linking land use, transportation and travel behaviour: understanding trip chaining in terms of land use and accessibility patterns. Cambridge, Department of Urban Studies and Planning, Massachusetts Institute of Technology.

D’Este, G. (2016). A technique for incorporating the effect of changing patterns of travel behaviour into the traditional transport planning paradigm, J Eastern Asia Society Transport, Stud. 2, 1099–1111.

Santosh, D. (2013). Tracking Multiple Moving Objects Using Gaussian Mixture Model, International Journal of Soft Computing and Engineering, 3, 114-119.

Peter van der Waerden, Harry Timmermans, Marloes de Bruin-Verhoeven. (2017). Car drivers’

characteristics and the maximum walking distance between parking facility and final destination, The journal of transport and land use, 10, 1-11.

Zhenyun Deng, Xiaoshu Zhu, Debo Cheng, Ming Zong, Shichao Zhang. (2016). Efficient kNN classification algorithm for big data, Neurocomputing, 195, 143-148.

Sanjay KumarSingh, Abdul-NasserEl-Kassar. (2019). Role of big data analytics in developing sustainable capabilities, Journal of Cleaner Production, 213, 1264-1273.

Guido Perboli, Francesco Ferrero, Stefano Musso, Andrea Vesco. (2018). Business models and tariff simulation in car-sharing services, Transportation Research Part A , 115, 32-48.