Applying machine learning on the data of a control
tower in a retail distribution landscape
Thomas Kolner
30 augustus 2019
Master thesis
Abstract
Retail distribution is the activity of getting goods into stores where they are sold to the public. A new concept in the retail distribution in the Netherlands is the transport con-trol tower. In this context, a transport concon-trol tower is defined as an integrated platform where transportation companies and their stakeholders share information or data and connect different services. This thesis aims to build a predictive model to predict the on-time arrival rate of trucks at the stores and help to explain the variance in on-time arrivals of trucks by using the data from a transport control tower. Building a predictive model and explaining the on-time arrivals, this thesis asks: How can a control tower pro-vide insights and be valuable within the chain distribution of a retailer using on planned and actual truck arrivals? And how can a control tower be used to explain the variance in the on-time arrival rate of trucks?
Based on a review of the literature on integration platforms in the transportation, it has been argued, at a conceptual level, that there is a huge potential for the use of a control tower in the field of retail distribution. To validate the use of the control tower, this thesis conducted a case study to apply machine learning on the control data in col-laboration with Albert Heijn. This show a successful application of machine learning on the data of a control tower in a retail distribution landscape. It describes a method to extend the control tower data with open data on weather and traffic, and apply machine learning on the extended control tower data. The results show that the Random Forest model is most suited for the detection of on-time arrivals. The Random Forest classifier achieves an f1 score of 0.86. Analysis of the outcomes showed that the on-time arrival rate is caused by several variables. The most important variables in this case study are ranked by using the feature importance from the proposed Random Forest model. Human factors, could influence the time of arrival, and it is concluded that such factors should be considered in future research.
Title: Applying machine learning on the data of a control tower in a retail distribution landscape
Authors: Thomas Kolner, [email protected], s1505432 Supervisors: prof. dr. J. van Hillegersberg, dr. N. Sikkel
End date: August 30, 2019
Faculty of Electrical Engineering, Mathematics and Computer Science University of Twente
Drienerlolaan 5, 7522 NB Enschede
Preface
This thesis could not have been completed without the contribution and help of multiple persons.
First of all I would like to thank everyone from the Transport department at Albert Heijn for the interest they have shown in my graduation project and input they gave me. Next, I want to thank the Noortje van Genugten for the opportunity to write my thesis at Albert Heijn which gave me the possibility to get to know the company from within and for her advice and comments on my thesis. In special I would like to thank Rene Krukkert for all his input and critical review on my writings.
I also thank my supervisors from the university, Prof. dr. Jos van Hillegersberg and dr. N. Sikkel for guiding me through the process of writing a thesis. Your advice and comments on my research were most useful and helped me to create the result you are reading now.
Lastly, I would like to thank my family for their support the past 11 months.
Thomas Kolner
Contents
Introduction 1
1.1 Company introduction . . . 1
1.2 Problem introduction . . . 2
1.2.1 Problem context . . . 2
1.2.2 Problem identification . . . 3
1.3 Research introduction . . . 5
1.3.1 Research scope . . . 5
1.3.2 Research objective . . . 5
1.3.3 Research questions . . . 5
1.3.4 Research method . . . 6
1.3.5 Report structure . . . 9
Background 11 2.1 Distribution network . . . 11
2.1.1 Supply chain . . . 11
2.1.2 Distribution network . . . 12
2.1.3 Retail distribution . . . 14
2.2 Machine learning . . . 16
2.2.1 Machine Learning classifiers . . . 17
2.2.2 Evaluation classifiers . . . 22
2.2.3 Cross-validation . . . 25
2.3 Related work . . . 26
2.3.1 Academic work on control towers . . . 27
2.3.2 Whitepapers on controltowers . . . 31
Data understanding 34 3.1 Data collection . . . 34
3.2 Data description . . . 34
3.3 Data exploration . . . 36
3.3.1 Variables in the dataset . . . 36
3.3.2 Transformed variables . . . 42
3.3.3 External variables . . . 43
4.2 Data cleansing . . . 49
4.3 Data transformation . . . 50
4.4 Data integration . . . 53
4.5 Data balancing . . . 53
4.6 Data formatting . . . 54
4.7 Variables . . . 54
Modelling 55 5.1 Selection machine learning techniques . . . 55
5.2 Experimental design . . . 56
5.2.1 Label . . . 56
5.2.2 Datasets . . . 56
5.2.3 Featureset . . . 56
5.3 Training and testing . . . 57
5.4 Feature importance . . . 58
5.5 Tools . . . 59
Results analysis 60 6.1 Performance overview . . . 60
6.2 Performance per classifier . . . 62
6.2.1 Random Forest . . . 62
6.2.2 Logistic Regression . . . 64
6.2.3 K-nearest neighbour . . . 65
6.2.4 Conclusion . . . 67
6.3 Feature importance . . . 68
6.3.1 Important ranking . . . 68
6.3.2 Recursive feature elimination . . . 70
6.3.3 Conclusion . . . 71
Usability of the Model 73 7.1 Technical Usability . . . 73
7.2 Business usability . . . 74
7.3 Conclusion . . . 76
Conclusion 77 8.1 Answers to the research questions . . . 77
8.2 Variance in arrival time . . . 79
Discussion 82 9.1 Results . . . 82
9.2 Contribution . . . 83
9.3 Challenges . . . 83
9.4 Limitations . . . 84
9.6 Future research . . . 85
Appendix A - Data integration 91
Introduction
This report is written for the completion of the Master Business & IT within the Data Science & Business track at the University of Twente. To do so, research is conducted at Albert Hijn located in Zaandam. The research is done at the Transport department of Albert Heijn. This research focuses on the complex retail distribution network of ALbert Heijn in southwest of the Netherlands. Retail distribution is the activity of getting goods into stores where they are sold to the public.
This chapter describes gives an introduction of the company, the problem statement and the research questions.
1.1 Company introduction
This research is conducted at Albert Heijn (AH). Albert Heijn B.V. is a supermarket chain in the Dutch retail market. It was founded by mister Albert Heijn, who opened his first store in 1887 in Oostzaan, The Netherlands. From there it expanded through the first half of the 20th century and became the largest supermarket chain in the Nether-lands. In 1973, parent company Ahold was established. Ahold merged in 2016 with the Belgium food retailer company Delhaize. Therefore, AH currently is a subsidiary of Ahold Delhaize N.V. Nowadays Albert Heijn B.V. operates in more than 880 stores, of which approximately 840 are located in the Netherlands. The other stores are located in Belgium with 42. One third of the stores in Netherlands are franchise stores. At this moment, approximately 80,000 people work for the AH brand in the Netherlands. The biggest part of them work at AH, and 30.000 of them works for franchisers who operate an AH store. Several formats exist within the AH stores. The most common store is the Albert Heijn district store. Besides, the stores consist of an Albert Heijn XL, Albert Heijn To Go and Albert Heijn Online. The online service delivers the groceries at home or the customers can pick up the groceries at one of the 54 pick-up points. Together, the stores have a market share of 34,7 percent at the end of 2018 [1].
The assortment of AH consists of more than 30.000 products items from numerous of brands. Four of these are own brands, called AH Huismerk, AH puur & eerlijk, AH Excellent and AH BASIC.
towards the stores. In total operates AH transports between the 1000-1200 trucks every day.
1.2 Problem introduction
This section introduces the problem. First, it describes the context. Next, it describes the problem identification.
1.2.1 Problem context
The slow-moving items are products with low sales volumes and distributed over the four central DCs (LDC, NOTE, SFC and SWH). The fast-moving items are products with high sales volumes and are located at the four regional DCs (RDCs). Downstream, all stores are supplied by trucks coming from the RDCs. The bundling of products from the different DC options thus happens not at special consolidation points but are consolidated at the RDCs. The products allocated at central DC requires, therefore, a shipment to the RDCs, which are called a ’transit’ ride by AH. Each RDC is dedicated to a subset of stores in their region of the Netherlands; DCZ supplies the stores in the region North-West, DCO the stores in the North-East, DCP the stores in the SouthWest, and DCT the stores in the South-East.
Albert Heijn Transport is responsible for the entire transport planning. This consist of inbound transport and outbound transport. The rides can be distinguished into: from suppliers to one of the DC (inbound), the transit ride between DC (inbound) and the deliveries to the stores (outbound).
1.2.2 Problem identification
The supply chain and transport domain are a rapidly changing environment [4]. Al-bert Heijn Transport, their external partners, and several experts identified three main challenges that retail distributions is facing at this moment. These challenges are:
• Increased supply chain complexity
• Increasing traffic congestion
• High dependency on driver availability
The increased supply chain complexity is a challenge AH identified. Supply chain is a complex network of business entities involved in the upstream and downstream flows of products and/or services, along with the related finances and information [5]. Complexity in a supply chain grows, as customer requirements, competitive environment and industry standards change, and as the companies in the supply chain form strategic alliances, engage in mergers and acquisitions, outsource functions to third parties, adopt new technologies, launch new products/services, and extend their operations to new geographies, time zones and markets [5]. In other words, the growth of supply chain complexity accelerates with trends such as globalization, sustainability, customization, outsourcing, innovation, and flexibility.
any reliable information about the actual arrival times of the vehicles. In these circum-stances, it becomes difficult to satisfy the time windows during which the customers must be visited. This increases supply chain and logistics cost.
The European road transport firms are racing towards a driver shortage crisis [7]. Ac-cording to the UWV, the Dutch employment agency, the problem of lack of drivers in the Netherlands has been escalating from April 2016. More and more transport compa-nies have trouble finding potential truckers, which makes carrying out everyday activities difficult. There are a few reasons for the lack of truckers in the Netherlands [8]. The eco-nomic growth, similarly to other EU countries, results in growing demand for transport. Currently the carriers are not able to meet the market demands. At the same time a lot of truckers resign from work in the industry or retire, while there is little interest in that profession among the young Dutch.
The supply chain at Albert Heijn is a collaboration between different departments. These departments are Replenishment, Transport and Logistics. Replenishment is re-sponsible for the forecast. Transport is responsible for the transportation of all the goods. Logistics is the department that is responsible for all the distribution centers. For Albert Heijn transport to be competitive in the future they need to come up with smart solutions. One of the initiative that started is the Simacan control tower. In coop-eration with Simacan and all the external transport partners of Albert Heijn the control tower was launched in December 2017. To monitor and control all the transportation activities, Simacan designed the control tower. The Simacan control tower is a neu-tral, supplier-independent data exchange platform. It ensures that the spatial data sets from all sources, sensors, and logistics systems have a place where they feel understood and where they can be connected to all other data. This creates an ecosystem of inex-haustible information that can lead to insight into the physical world. In this research, a control tower is defined as an integrated platform where transportation companies and their stakeholders share information or data and connect different services. The main advantage of the control tower is the "shared-real time" information. The Transport de-partment can keep real-time track of all trucks and deliveries. From this real-time data the transport department is able to gather a lot of data about their performances. Since every delivery is followed by the control tower. One of the problems that AH Transport identified is the on-time delivery. Since there is more reliable data available of all the deliveries since the introduction of the control tower, Albert Heijn can identify with a higher accuracy the number of on-time deliveries. The next step is to get insights and optimize their entire transport domain based the real-time data collection and be able to deal with the three main challenges Albert Heijn identified. This next step shows the goal for this research. The problem Albert Heijn identified is to use and analyse the data from their control tower in order to deal with their challenges in the transport domain.
Traffic congestion in the Netherlands increasing. The truck that operates from DC Pijnacker and DC Zaandam faces the most problems with traffic congestion’s. Due to the traffic congestion, it becomes more difficult to plan the routes, such that the time windows which the stores must be visited are met. This increases supply chain and logistics cost.
Especially during the holiday times, there is shortage for truck drivers. This results in long workdays for the available drivers. During the summer of 2018 the drivers shortage resulted eventually in cancelled deliveries.
1.3 Research introduction
This section introduces the research. This research consist of a case study in collaboration with Albert Heijn. With a case study, this research can investigate a contemporary phenomenon within its real-life context [9]. Firstly, the scope is determined, secondly, the objectives are formulated, the research question is stated and the methodology is given.
1.3.1 Research scope
This thesis is focused at the outbound transport at the regional distribution center Pijnacker (DCP). The performance of Albert Heijn transport in the region DCP is, in comparison with the other regional distribution centers, below the average. In particular the service level of on-time deliveries for stores in the region that gets delivered by DC Pijnacker is significantly lower than average service degree. The below average service degree can be partly explained by the previously mentioned main challenges. In the region DCP there is relatively more traffic congestion and the external transport partners facing problems with the availability of driver availability. The number of stores that are being delivered by DCP is 139. DCP handles the regular products chain and the fresh chain for those 139 stores. One of the main advantages of DCP is availability of data. Since Albert Heijn started with the control tower, all the data of the deliveries of the regular chain and the fresh chain from DCP are available in the control tower.
1.3.2 Research objective
The objective of this research is to get to know to what extent the collected data from a transport control tower can be valuable for a retail company. Furthermore, in which extend is machine learning applicable in the field of retail distribution. To apply machine learning in the retail the distribution, the objective is to build a predictive model to predict the on-time arrival rate and help to explain the variance in on-time arrivals of trucks.
1.3.3 Research questions
This research uses the following main research question and sub-questions:
The main question in this research is developed with Albert Heijn. The answer to the main research question provides insights for Albert Heijn based on the available data from the control tower. Furthermore, it provides a model that is able to explain the on-time delivery rate from the distribution center located in Pijnacker.
The practical contribution of this thesis is to propose a model to explain the variance in arrival time of trucks in the area of retail distribution. The major contribution of this thesis is presenting the application of machine learning models on a distribution control tower in a retail distribution landscape.
The following sub-questions are formulated to help answer the main research question:
S.Q.1: What can we learn from literature about collaboration integration platforms in the domain of logistics?
S.Q.2: Which data is available in the Simacan control tower for Albert Heijn?
S.Q.3: What features are relevant for explaining the accuracy of arrival times of trucks?
S.Q.4: What would be a good machine learning classifier?
S.Q.5: How do different machine learning techniques perform in explaining on-time arrival rate of trucks?
S.Q.6: What is the usability of the machine learning model and the outcomes of the model?
1.3.4 Research method
A research method is required to answer the research questions in a structured manner. The research method used in this study is based on the research of Shmueli and Koppius [10]. This method provides several steps in building an empirical model (Predictive or Explanatory).
This thesis is organized according to the research method shown in figure 1.1 The research method and the report structure is described below. In this research steps 3 and 4 are exchanged. This is done to first explore the data, and understand its context and values. So, in the data preparation step the different data sources can be prepared and combined into one dataset for the modelling phase.
1. Goal Definition
Building a predictive model requires careful specification of what needs to be predicted, as this impacts the type of models and methods used later on. One common goal in predictive modeling is to accurately predict an outcome value for a new set of observa-tions. This goal is known in predictive analytics as prediction (for a numerical outcome) or classification (for a categorical outcome).
The goal of this research is to build a model to predict and explain the on-time arrival of a truck. The model will be used to classify if a truck is on-time or not. Based on this predictive model and the outcomes of it, the impact of several variables can be ranked in order to explain which variables impact the on-time arrival of a truck. So, the goal is to build a predictive classifier.
2. Data collection & study design
Even at the early stages of study design and data collection, issues of what and how much data to collect, according to what design, and which collection instrument to use are considered differently for prediction versus explanation.
Since the dataset used in this research is an external dataset provided by Simacan, the data collection step is not applicable for this research. Furthermore, the external datasets used in this research are public available datasets. The only relevant question applicable to this research is the amount of data that will be used. As stated in section 1.3.1, only DC Pijnacker is selected for this research. So, this limits the amount of data.
3. Data preparation
There are two common data preparation operations: handling missing values and data partitioning. Most real datasets consist of missing values, thereby requiring one to iden-tify the missing values, to determine the extent and type of missingness, and to choose a course of action accordingly. A popular solution for avoiding overoptimistic predictive accuracy is to evaluate performance not on the training set, that is, the data used to build the model, but rather on a holdout sample which the model ’did not see.’ The creation of a holdout sample can be achieved in various ways, the most commonly used being a random partition of the sample into training and holdout sets. A popular alternative is cross-validation.
In this research the data preparation step consists of handling errors in the data and the preparation of the dataset that will be used by the machine learning models. This ’preparing’ of the complete dataset exist of combining the Simacan set with the external data sources and prepare the variables for analyses.
4. Exploratory data analysis
The EDA is conducted to understand the datasets that are used in this research. The data is summarized, and several statistics are shown. After the EDA the data preparation is conducted. This is done to understand the data before it was processed and prepared.
5. Choice of variables
Predictive models are based on association rather than causation between the predictors and the response. The choice of potential predictors is often wider than in an explanatory model to better allow for the discovery of new relationships. Predictors are chosen based on a combination of theory, domain knowledge, and empirical evidence of association with the response.
The choice of variables is done at first by using the expertise of several experts in the transportation domain. After the dataset was prepared for the modelling phase, the relevant variables are selected based on the domain knowledge of the experts.
6. Choice of potential methods
In predictive modeling, where the top priority is generating accurate predictions of new observations and the prediction is often unknown, the range of plausible methods includes not only statistical models (interpretable and uninterpretable) but also data mining or machine learning algorithms.
In this research the choice of suitable models is based on theory. There are well-known machine learning classifiers selected and briefly explained in the introduction to machine learning.
7. Evaluation, validation & model selection
Choosing the final model among a set of models, validating it, and evaluating its perfor-mance based on different metrics.
The model selection is in Shmueli and Koppius [10] based on the best scoring model for prediction purpose. In this research a prediction model is used to help explain the variance in the on-time arrival rate. So, the selected models are compared based on the best metrics. The best scoring models are selected and further analyzed. However, the main goal of these models is not to predict, merely explain the variance. So the best scoring classifier is selected to help explain the variance in the on-time arrival of trucks.
8. Model Use & Reporting
At the end of the explanatory modeling process, a predictive model is used to make pre-dictions from new data, and the results are used to formulate new hypotheses, establish relevance of existing theories, and assess predictability of certain relations.
1.3.5 Report structure
Background
This consists of the background required to understand this thesis. The main topics of this chapter are an introduction to supply chain and retail distribution to understand the context of this thesis and to understand the domain of retail logistics. Next, a background on machine learning is given, to understand the models and methods used in this thesis. Furthermore, recent papers on collaboration integration platforms in the domain of logistics are examined. This last step is performed to find relevant literature on the use of a control tower in distribution networks. This phase presents the results needed to answer the sub-question S.Q.1. Chapter 2 contains the findings of this phase.
Data understanding
The dataset used in this research is provided by an external party. Therefore this phase is required to understand the content of the provided dataset. The dataset content is explored with the use of multiple descriptive statistics. This phase also consists of verifying the data quality. This phase presents the results needed to answer the sub-questions S.Q.2. Chapter 3 contains the findings of this phase and steps 2 and 4 of the research methodology. Step 3 and 4 are exchanged in this research as stated in section 1.3.4.
Data preparation
This chapter describes the data preparation and the choice of variables. Multiple prepa-ration steps are needed to construct a dataset that can be used in this research. Chapter 4 describe the steps taken during this phase and present the answer to sub-questions S.Q.3. Chapter 4 contains the findings of this phase and steps 3 and 5 of the research methodology.
Modelling
This phase describes the choice of potential methods. This phase consists of selecting machine learning techniques, setting up experiments, and training and testing of the machine learning techniques. This phase presents the results needed to answer the sub-question S.Q.4. Chapter 5 describes the steps taken during this phase and contains step 6 of the research methodology.
Results analysis
Usability of the Model
This chapter describes the used model and how it can be used. This phase consists of analyzing the usability of the proposed model in this research. This phase results in the answer to S.Q.6. Chapter 7 describes the findings of this phase and step 8 of the research methodology.
Conclusion, discussion and future research
Background
Each subsection of this chapter describes the necessary background knowledge for a specific subsection of this thesis, to understand its content. Section 2.1 describes the relevant literature on supply chain within the retail area, to understand the context of this thesis. In section 2.2 an introduction to machine learning is given, to understand the models and methods used in the modelling phase of this thesis. Section 2.3 describes the related work on collaboration integration platforms in the domain of logistics.
2.1 Distribution network
A distribution network is a part of the supply chain. First a supply chain is defined. In the second section defines a distribution network and gives several distribution network design options.
2.1.1 Supply chain
A supply chain is an integrated manufacturing process where raw materials are converted into final products, then delivered to customers. At its highest level, a supply chain is comprised of two basic, integrated processes [11]:
1. The Production Planning and Inventory Control Process
2. The Distribution and Logistics Process
Figure 2.1: Test
2.1.2 Distribution network
At the highest level, performance of a distribution network should be evaluated along two dimensions [12] :
1. Customer needs that are met
2. Cost of meeting customer needs
The customer needs that are met influence the company’s revenues, which along with cost decide the profitability of the delivery network. While customer service consists of many components, we will focus on those measures that are influenced by the structure of the distribution network. These include:
• Response time
• Product variety
• Product availability
• Customer experience
• Order visibility
• Return-ability
to wait longer than those that drive to a nearby store to get the same item. On the other hand, customers can find a far larger variety of items at an online store compared to the nearby store.
(a) Relationship between desired response time and number of facilities
(b) Relationship between number of facilities and logistics cost.
(c) Variation in logistics cost and response time with number of facilities
Firms that target customers who can tolerate a large response time require few loca-tions that may be far from the customer and can focus on increasing the capacity of each location. On the other hand, firms that target customers who value short response times need to locate close to them. These firms must have many facilities, with each location having a low capacity. Thus, a decrease in the response time customers desire increases the number of facilities required in the network, as shown in Figure 2.2a. For example, local store provides its customers with items on the same day but requires a large number of stores to achieve this goal. On online shop, on the other hand, takes about a week to deliver an item to its customers, but only uses a few locations to store its products.
of scale are maintained, increasing the number of facilities decreases total transportation cost, as shown in figure Figure 2.2b. If the number of facilities is increased to a point where there is a significant loss of economies of scale in inbound transportation, increasing the number of facilities increases total transportation cost. A distribution network with more than one warehouse allows Amazon.com to reduce transportation cost relative to a network with a single warehouse. Facility costs decrease as the number of facilities is reduced as shown in figure Figure 2.2b, because a consolidation of facilities allows a firm to exploit economies of scale.
Total logistics costs are the sum of inventory, transportation, and facility costs for a supply chain network. As the number of facilities is increased, total logistics costs first decrease and then increase as shown in figure Figure 2.2c. Each firm should have at least the number of facilities that minimize total logistics costs (see figure 2.2c). As a firm wants to further reduce the response time to its customers, it may have to increase the number of facilities beyond the point with minimum logistics costs. A firm should add facilities beyond the cost-minimizing point only if managers are confident that the increase in revenues because of better responsiveness is greater than the increase in costs because of the additional facilities.
2.1.3 Retail distribution
The present section elaborates on a retail distribution system. First, we characterize the structure of the retail network. Second, we describe the subsystems and processes that are part of a retailer’s distribution network.
Characterization of the retail network structure
Characterization of the subsystems and processes
Several processes occur in the distribution network described above, such as transporta-tion, storing products, picking store orders, and stocking the shelves in the stores. Each of these processes belongs to a certain subsystem of the retail chain, i.e., (1) inbound trans-portation (supplier - DC), (2) warehousing, (3) outbound transtrans-portation (DC - Stores), and (4) instore operations. These subsystems are explained in the following paragraphs.
Inbound transportation
Inbound transportation comprises transportation tasks between the supply points of the manufacturers and the DCs. The transportation efforts depend on the number of shipments and the distance from a supplier to a specific DC. The number of shipments to a DC required is dependent on the sales volumes and the physical volumes of the products delivered. From a single product perspective, an assignment to the DC type with the lowest volume-weighted average distances may save inbound transportation costs assuming that trucks always have full loads. Transportation costs are especially important for the allocation of products with high sales and high physical volumes. The assignment of a product to a certain DC type influences the volume per period that a supplier has to deliver to a DC. Regional or local DCs serve fewer stores than a central DC that serves a large number of stores, where correspondingly more store demand has to be fulfilled from a central site. When choosing more decentralized distribution, i.e., via regional or local DCs, less volume of this product is required per DC but at multiple places. The efficiency of inbound transportation then particularly depends on the question of the extent to which transportation units can be utilized. Consequently, inbound delivery costs for an stock keeping unit (SKU) depend on the other SKUs of the same supplier that are allocated to the same DC type, as freight space can be shared.
Warehousing
Outbound transportation
While in inbound transportation the source-destination is a result of the allocation of the products, not every supplier has to serve all DC types, the outbound transport is pre-determined. Outbound transportation bridges the geographical disparity between DCs and the stores. In addition, store delivery frequency is very dependent on store-specific situations and last-mile volume bundling across stores [16]. Outbound transportation from the DCs to stores depends on the allocation of products across the several DCs Distributing products via more regional DCs saves outbound transportation efforts since distances from the DCs to the stores get shorter and vice versa via central DCs. This is especially important for high-volume products. However, it might be in contradiction to inbound transportation and the resulting cycle stock. A more central DC type comprises fewer DCs and suppliers may therefore supply higher volumes or deliver their products more frequently. In addition, the safety stocks required increase as DC types become more decentralized [17].
Instore operations
Instore operations efforts are known to cause a high share of total operational costs in retailing [13]. In the stores, the products delivered have to be stacked onto the shelves. As known from literature, this effort decreases as for example the case pack size and the number of case packs per order line rise, assuming sufficient shelf space [18]. Each product sold in the stores belongs to a specific product category. Shelf filling activities are more efficient if all products of one layout segment are sourced from a single DC.
Figure Figure 2.3 gives an overview of the sub-process in the retail network structure: (A) inbound transportation costs, (B) warehouse inventory costs, (C) warehouse picking costs, (D) outbound transportation costs, and (E) instore operational costs. In addition, the associated cost drivers are denoted: inbound delivery volume and inbound delivery frequency from one supplier to a specific DC, cycle stocks and system-wide safety stocks, picking volume, technology and labor costs, outbound delivery volume and the number of different DC types to which products of one store layout segment are assigned.
Figure 2.3: Sub-process in the retail network structure
2.2 Machine learning
phase. The models are examined and tested if they are applicable on the dataset of the AH/Simacan control tower. The definition for machine learning used throughout this research is: "the complex computation process of automatic pattern recognition and intelligent decision making based on training sample data" [19]. A more general definition of machine learning is "the process of applying a computing-based resource to implement learning algorithms" [19]. Based on different books on machine learning [19][20][21][22], the basic theory of the different Machine Learning techniques used in this research is described in this section. Three categories of learning algorithms are: supervised learning, unsupervised learning, and semi-supervised learning. In supervised learning, the goal is to create a model which predictsybased on somex, given a training set consisting of examples pairs of (xi,yi). Here yi is called the label of the example xi.
Whenyis continuous, the problem at hand is called a regression problem, and when yis discrete the problem at hand is called a classification problem. Throughout this research, the focus is on supervised learning as we try to detect whether a truck described by some featuresx, is is on-time at their destination. In this case, the prediction valuey takes the value 1 if a truck arrives too late and 0 if a truck arrives on-time. Section 2.2.1 describe the machine learning classifiers used in this research. Then Section 2.2.2 describes the metrics used to evaluate classifiers, Section 2.2.2 describes the challenges of an imbalanced dataset. Lastly, Section 2.2.3 describes the cross-validation methodology.
2.2.1 Machine Learning classifiers
In this section several machine learning algorithms are briefly explained.
Random Forest
Figure 2.4: Example of Decision Tree
Algorithms for constructing decision trees for a Random Forest usually work top-down, by choosing a variable at each step which has the best splits for the remain items. Different algorithms use different metrics for measuring the ’best’ split variable. One of the most used algorithms is the Gini Impurity [23]. Gini Impurity is a measurement of the likelihood of incorrect classification of a new instance of a random variable, if that new instance were randomly classified according to the distribution of class labels from the data set. So, Gini Impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. The Gini impurity can be computed by summing the probabilityp_iof an item with labelibeing chosen times the probabilityP
k6=ipk= 1−pi
of a mistake in categorizing that item. It reaches its minimum (zero) when all cases in the node fall into a single target category [24].
K-Nearest Neighbour
The K-nearest neighbour (KNN) is a distance-based classifier. Distance-based classifiers generalise from training data to unseen data by looking at similarities between training instances. Given a query instance q, the classifier finds the k training instances, the closest in distance to the query instanceq. Subsequently, it classifies the query instance using a majority vote among the k neighbours. The distance from the query instance to its training instances can be calculated using different metrics such as the Euclidean distance, Minkowski distance, or Manhatten distance. An example of the k-nearest neighbour classification is given in figure 2.5. For the calculation of the KNN there are three algorithms. These are described below.
Brute-force
The most basic computation method for the Nearest Neighbor classification is bruteforce. This algorithm simply calculates the distances between all points in the data set and uses those to determine which points are closest. For small sample sizes this algorithm can return accurate result. Due to its naive nature, brute-force quickly becomes an unfeasible approach when sample size increases.
k-d tree
points A, B and C, of these points A and B are very distant and C and B are close. From this information it follows that points A and C are also very distant. A k-d tree is constructed by iterating over several steps. In each iteration a (not previously used) feature is selected at random, on which a decision will be based. The median value of the selected feature is calculated, values larger than this median are separated from the smaller values. Now two branches have been created, each with approximately half of the samples. On both branches these steps are repeated. This process continues until the number of samples in a branch drops below a certain threshold. After a tree has been generated, the approximately closest neighbors can easily be determined. All nodes of the tree are applied on the new instance, the branch where the new instance ends up in, contains samples that are close. For all of these samples the distance to the new instance is calculated to find nearest ones.
Ball tree
In high dimensional space it becomes computationally expensive to create a k-d tree. In those situations it is computationally favorable to create a ball tree. Omohundro [25] describe a ball tree as a binary tree of which each node represents a hypersphere called a ball. Each node of the tree splits the data into two disjoint sets, each set is contained by the smallest ball containing all points. The hyperspheres are allowed to cross, data points are assigned to the sphere of which the center is closest.
Advantages of KNN are [3]: i) high precision and accuracy ii) non-linear classification iii) no assumption of features. The disadvantages are i) it is sensitive to unbalanced sample set, ii) it is computational expensive.
Figure 2.5: Example of K-Nearest Neighbour Classification
Naive Bayes
evidencex, according to the following formula:
P(H|e) = P(e|H)∗P(H)
P(e)
The classifier is called naive because it assumes conditional independence, making the computation of the above formula less computationally expensive; especially for datasets with many features. Although Naive Bayes assumes conditional independence, it performs well in domains where independence is violated [14]. Advantages of Naive Bayes are: i) high speed ii) insensitive to irrelevant feature data iii) simple and mature algorithm. A disadvantage is that it requires the assumption of independence of features [26].
Generalized Linear Model
The Generalized Linear Model is a generalization of the general linear model [27]. In its simplest form, a linear model specifies the (linear) relationship between a dependent (or response) variableY, and a set of predictor variables, the X0s, so that:
Y =β0+β1∗x1+β2∗x2+...+βn∗xn
Generalized linear models are a class of linear models which unify several widely used models, including linear regression and logistic regression. The distribution over each output is assumed to be an exponential family distribution whose natural parameters are a linear function of the inputs.
Logistic Regression
Logistic Regression is one of the generalized linear models that is used in this research. Logistic Regression is one of the most simple and commonly used Machine Learning algorithms for two-class classification [28]. It is easy to implement and can be used as the baseline for any binary classification problem. Its basic fundamental concepts are also constructive in deep learning. Logistic regression describes and estimates the relationship between one dependent binary variable and independent variables. Figure 2.6 gives an example of a Logistic Regression.
Figure 2.6: Example of Logistic Regression
are used. In this method an iterative process is used during which in each iteration the coefficients are slightly changed to try to improve the maximum likelihood. In this research two methods to fit the model are taken in to account. These methods will not be discussed in depth since it is not within the scope of this research. The first method is Liblinear, it used a coordinate descent algorithm to find suitable values for the coefficients. The second method is saga which uses a stochastic average gradient descend. The second method usually is faster on large data sets.
In the loss function that is minimized it is usual to include a regularization term. Such a term is to penalize complex models and favor models which are simpler. With Logistic Regression two types of regularization are commonly used, L1 and L2. The first of these is a regularization that favors sparse models, or models where a large fraction of the coefficients is zero. L2 is used as regularization term when a sparse model is not suitable. When the data set contains highly correlated features, L1 should be used as regularization term. It picks a single of the correlated features and sets the coefficient of the other features to zero. L2 would simply shrink the coefficient of all correlated features. Usually a parameters is added to the algorithm which can be used to determine the strength of the regularization.
Artificial neural networks
Artificial neural networks (ANN) is a machine learning model that uses a structure of nodes, i.e. artificial neurons, to classify testing instances [29]. These nodes are connected to each other by directed links. An ANN consists of an input layer, some hidden layers, and an output layer. Every directed link between neurons has some numeric weight shown aswijin the example ANN, shown in Figure 2.7. These numeric weights are used
in the activation function of each node. This activation function is used to determine the output of a node. Different learning algorithms can be used to determine the number of hidden layers, the number of neurons, and the weights between the neurons. Some of the most popular are feed-forward back-propagation and radial basis function networks. This research uses the Perceptron classifier which is a class of ANN that uses back-propagation for learning.
AdaBoost
Adaptive boosting (AdaBoost or Ada) is, like the Random Forest classifier, an ensemble classifier. AdaBoost uses multiple training iterations on subsets of the dataset to boost the accuracy of a (weak) machine learning classifier. The machine learning classifier is first trained on a subset of the dataset. Then all training instances are weighted, with any sample not correctly classified in the training set being weighted more, thereby having a higher probability of being chosen in the training set of the next iteration. Likewise, any sample correctly classified is weighted less. This process is repeated until the set maximum number of estimators is reached. AdaBoost is known for offering accurate machine learning classifiers [19]. However, a disadvantage of AdaBoost is that it is a greedy learning, i.e. offering suboptimal solutions. In this research, AdaBoost is used with (standard) decision trees.
Gradient Boosting
Boosting is a method of converting weak learners into strong learners. It is like the Random Forest classifier and AdaBoost, an ensemble classifier. In boosting, each new tree is a fit on a modified version of the original data set.
Gradient Boosting trains many models in a gradual, additive and sequential manner. The major difference between AdaBoost and Gradient Boosting Algorithm is how the two algorithms identify the shortcomings of weak learners (eg. decision trees). While the AdaBoost model identifies the shortcomings by using high weight data points, gradient boosting performs the same by using gradients in the loss function (y = ax+b+e ,
e needs a special mention as it is the error term) [30]. The loss function is a measure indicating how good are model’s coefficients are at fitting the underlying data. A logical understanding of loss function would depend on what we are trying to optimize. For example, if we are trying to predict the sales prices by using a regression, then the loss function would be based off the error between true and predicted house prices. Similarly, if our goal is to classify credit defaults, then the loss function would be a measure of how good our predictive model is at classifying bad loans. One of the biggest motivations of using gradient boosting is that it allows one to optimize a user specified cost function, instead of a loss function that usually offers less control and does not essentially correspond with real world applications.
2.2.2 Evaluation classifiers
Different performance metrics exist to evaluate a classifier.
Confusion matrix
Actual Positive Actual Negative
Predicted Positive True Postive (TP) False Positive (FP)
Predicted Negative False Negative (FN) True Negative (TN)
Table 2.1: Sample cost function
The confusion matrix shows how many too late instances were correctly classified as being too late (TP), how many too late instances were missed (FP), how many benign instances were correctly classified as being on time (TN), and how many on time classes were incorrectly classified (FN). Other metrics and their formula are shown in table 2.2. These metrics use the metrics shown in Table 2.1. A frequently used metric is the accu-racy, defined by the percentage of correct predictions (TP + TN), of the total predictions (TP + TN + FP + FN). This metric, however, might not reflect the performance of a classifier well. In a skewed dataset, that is a dataset containing more of one class than the other, high accuracy can be achieved by always predicting the majority class. For example in a dataset consisting of 90% malicious actions and 10% benign actions, al-ways predicting malicious actions results in an accuracy of 90%. In the case of a skewed dataset, the performance metrics Precision (PPV) and/or Recall (TPR), reflect the per-formance of a classifier more realistic. The harmonic mean of the Precision and Recall are reflected in the f1-score (F-score with α = 1).
Metric Formula
Accuracy T P+T NT P++T NF P+F N
True Positive Rate (TPR) T PT P+F N
False Positive Rate (FPR) F PF P+T N
True Negative Rate (TNR) T NT N+F P
Precision (PPV) T PT P+F P
F-score (F-measure) (1 +α2)α2P P V(P P V∗T P R∗T P R)
Table 2.2: Performance Metrics
Area under the curve
To understand the details of the area under the curve, it is required to first explain the receiver operating characteristics (ROC) curve. The ROC is used for visualizing classifier performance and has long been used in signal detection theory to depict the trade off between true and false positive rates of classifiers [31].
positive rate (F P/N) for different thresholds. Since classifiers calculate a score between 0.0 and 1.0, a threshold has to be chosen as border between positive and negative classifi-cations. The calculated score,x,can be can be seen as being sampled from a continuous random distribution X. An instance is classified as positive if x > T, with T being the chosen threshold. Different thresholds will result in different true and false positive rates. Figure 2.8 shows three examples of an ROC curve: a random model and two models with predictive capabilities. The ROC curve of a random model approaches the line stretching from (0, 0) to (0, 1). The reason behind this behavior is best explained with an example. Assume that a random fractionK is classified as positive, then a fractionK
of the instances that should be classified as positive will be correctly classified, and the same fractionK of values that should be negative will be correctly classified as negative. For models that perform better than random guessing, the true positive rate will be higher than the false positive rate and thus the model will have a ROC curve above the diagonal.
The area under the curve (AUC) is a measure that tries to summarize the ROC curve in a single number. It is important to note that it is impossible to summarize the curve in a single number without loss of information. The name of the AUC is very accurate, it is the area under the ROC curve. For the ROC curve of the random model graphed in Figure 2.8, the AUC is exactly 0.5, for Model B the AUC is approximately 0.67. Models that are better than a random classifier have an AUC above 0.5, a perfect classifier has an AUC of 1.0.
Figure 2.8: Graph containing the ROC of a random classifier and two better performing classifiers.
Imbalanced Dataset
time examples and 5% too late examples, an accuracy of 95% might be the result of the classifier predicting on-time labels 100% of the time. This research addresses this challenge by using metrics that take into account the skewness of a dataset, such as the f1-score which is the harmonic mean between the True Postive Rate and True Negative Rate.
In regular learning, we treat all misclassifications equally, which causes issues in im-balanced classification problems, as there is no extra reward for identifying the minority class over the majority class [32]. Cost-sensitive learning changes this, and uses a func-tionC(p, t) (usually represented as a matrix) that specifies the cost of misclassifying an instance of classtas class p. This allows us to penalize misclassifications of the minority class more heavily than we do with misclassifications of the majority class, in hopes that this increases the true positive rate. A common scheme for this is to have the cost equal to the inverse of the proportion of the data-set that the class makes up. This increases the penalization as the class size decreases.
Actual Positive (yi= 1) Actual Negative (yi = 0)
Predicted Positive (c1= 1) CT Pi CF Pi
Predicted Negative (c0= 1) CF Ni CT Ni
Table 2.3: Sample cost function matrix
A simple way to deal with imbalanced data-sets is simply to balance them, either by oversampling instances of the minority class or undersampling instances of the majority class [32]. This simply allows us to create a balanced data-set that, in theory, should not lead to classifiers biased toward one class or the other. However, in practice, these simple sampling approaches have flaws: i) Oversampling the minority can lead to model overfitting, since it will introduce duplicate instances, drawing from a pool of instances that is already small. ii) undersampling the majority can end up leaving out important instances that provide important differences between the two classes.
2.2.3 Cross-validation
When training models to make predictions, it is needed to have a method for estimating how accurately the model will make predictions in practice. In cross-validation the data is split in a set used to train the model, the training set, and a set against which the models is tested, the test set. Many types of cross-validation rely on multiple iterations to reduce variability. In the following paragraphs three methods are explained.
Holdout method
The holdout method randomly splits the data set into two sets,d0andd1, the training set
and test set, respectively. The model is trained usingd0 and validated usingd1. Usually
range from 30:70 to 10:90 [29]. The disadvantage of this method is the usage of a single train/test split, this makes the method susceptible to random variations.
Repeated random sub-sampling validation
This method works by repeating the holdout method. Because of this repetition, it is also known as Monte Carlo cross-validation. In each replication the data set is randomly split into a training and test set. The results are averaged over all iterations. The disadvantage of this method is that some samples may never be used as validation whereas other may be selected multiple times.
k-fold cross-validation
In k-fold cross-validation the data set is shuffled and split into kequally sized sub sets. Of these k sets, a single set is retained to be used as test set, the remainingk-1sets are used as training set. This process is repeated k times, with each of the sub sets being used as test set once. The performance is averaged over all iterations to get an accurate estimation of model performance. Based on a sensitivity analysis, the value of k should usually be 5 or 10 when the method is used for error estimation [33].
2.3 Related work
The previous sections introduced the background on supply chain and retail distribution, and an introduction to explain and to understand the models and methods used in the modelling phase. This section examines related papers on integration platforms in the domain of logistics. These papers are found using a systematic literature research by using the steps described in Webster & Watson [34]:
1. Databases/journalsScopus was used, as it indexes transportation, logistics and information systems.
2. Query abstractsThe following query on Sciencedirect was used to search for rel-evant literature:
("supply chain" OR "logistics" OR "transport" ) AND
("control tower" OR "tracking" OR "control") AND
("integration" OR "collaboration" OR "information sharing" OR "cloud" OR "shar* data")
AND NOT
("bullwhip" OR "service logistics")
Subject area Papers
Engineering (117)
Computer Science (71) Business, Management and Accounting (43) Economics, Econometrics and Finance (3)
Table 2.4: Results of the literature review
3. Evaluation of abstractsAfter identifying the initial set of papers, their titles and abstracts were examined to see if they are indeed relevant to the current research. After this step, 33 papers remained.
4. Evaluation of papersThe 33 papers were then reviewed to discard those that do not discuss an integration platform or shared information system within a logistic partnership. After this step, 10 papers remained.
An overview of these papers is shown in Table 2.5. Section 2.3.1 describes the most important findings per paper. In addition find relevant information several whitepapers were reviewed. Section 2.3.2 describes three whitepapers.
2.3.1 Academic work on control towers
Liotine [35] summarizes the findings of an industry panel study evaluating how new Au-tonomous Intelligence technologies, such as artificial intelligence and machine learning, impact the system and operational architecture of supply chain control tower (CT) im-plementations that serve the pharmaceutical industry. Such technologies can shift CTs to a model in which real-time information gathering, analysis, and decision-making are possible. There were some challenges in achieving this, which include data quality and integrity, collaboration and data sharing across supply chain tiers, cross-system interop-erability, decision-validation, and organizational impacts, among others.
are reluctant to invest in an integrated system for a better information sharing practice primarily due to a lack of mutual trust, security risk, and a shortage of both investment capital and human resource capacity. These challenges affect the performance of the logistics sector in Vietnam as they decrease the value-added relationship with other partners in the logistics sector. The authors recommend some corresponding actions and practices which can be implemented practically to overcome the existing problems to enhance the information sharing practices in the logistics sector in Vietnam, given the different types of logistics firms and the Vietnamese government.
Li et al. [38] presented the concept of cloud-based ubiquitous object sharing platform (CUOSP) to solve the heterogeneous system integration problems, especially for smaller enterprises in manufacturing and logistics. CUOSP acts as an overall cloud middleware solution to assist enterprise users to construct intelligent infrastructure. It primarily sim-plifies and facilitates the integration and deployment of heterogeneous systems through the sharable ubiquitous object, thereby reducing the cost for integration and develop-ment.
Raweewan & Ferral [39] presented a quantitative approach to assessing the value of information relative to sharing with a potential collaboration partner for competition-cooperation and coopetition strategies within a supply chain. The approach is particu-larly well suited for collaboration where the parties considering collaboration might be routine competitors because it captures risks and benefits.
Korpela et al. [40] focuses on business to business (B2B) integration within the supply chain, referring to the electronic data exchanged over the internet between business partners and value-added service providers. Even the biggest organizations lack the power, knowledge or capability to themselves design or deploy end-to-end information integration through the supply network. The aim of this study was to establish how B2B DSC integration can be accelerated.
Cheng & Wang [41] examined the shortfalls of the current supply chain performance measurement (SCPM) system in monitoring the performance of stakeholders in a global-ized environment. SCPM lags in measuring the supply chain performance of stakeholders from the cross-hierarchy perspective. The capability of a logistics and transport data ex-change platform to function as an SCPM mechanism is illustrated through a case study of DTTN.
Natvig & Wienhofen [42] performed a case study with several elicitation methods such as observations, interviews, innovation games, and paper prototyping. The suggested solutions are expressed through paper prototypes which have been co-created and val-idated by the stakeholders in the supply chain during an iterative incremental process. Due to the identified need for more robust and automated solutions, the paper proto-types suggest unified solutions that (1) provide easy and automated access to the right information at the right time for all actors in the supply chain; (2) supports easy detec-tion of deviadetec-tions; and (3) supports decisions that can improve efficiency and deviadetec-tion handling.
as recent advances from control theory, through cooperative game theory, distributed machine learning to holonic systems, cooperative enterprise modeling, system integration, and autonomous logistics processes are surveyed.
Xian et al. propose a scalable and reconfigurable solution architecture based on a cloud- and IoT-enabled technologies for the logistics of an auction logistics center (ALC). This proposal integrates IoT hardware devices and software services to form a platform for planning and control at the ALC. It enables real-time visibility and traceability for pre/post-logistics operations. From a case study, it justifies that the services provided improve the efficiency and effectiveness of supervision and decision making. The major contribution of this paper is presenting a cloud- and IoT-enabled architecture to solve practical industrial challenges that consist of three layers.
Conclusion In summary, it has been shown that there is a huge potential for the use of a control tower, which is defined as an integrated platform where transportation companies and their stakeholders share information or date and connect different services, in the field of retail distribution. It is not only applicable in retail distribution, but the potential benefits are applicable as well on other area in the distribution domain. The potential benefits are real-time information sharing, improve efficiency and deviation handling and support complex decision-making task. However, none of the reviewed papers investi-gated a real-life implementation. The papers reviewed the existing literature, proposed a method for an integration platform or tested it in a small environment which was set-up. Therefore, there is is a gap in the literature regarding practical experiments with control towers in the retail distribution.
2.3.2 Whitepapers on controltowers
This section describes the used method to find relevant whitepapers and gives a summary of the whitepapers.
1. Sources: logistiek.nl, ict-en-logistiek.nl and dinalog.nl
2. Search terms: The following query was used to find relevant information or whitepapers:
(whitepaper OR simacan) AND
(’logistiek’ OR ’logistics’ OR ’transport’) AND
(’control tower’ OR ’open data’ OR ’platform’ or integration’)
Publisher Topic Year
Siemens Digital Logistics Control tower 2019 SyncroSupply Contol tower 2018
ABN AMRO Open data 2016
Albert Heijn Online Simacan Control tower 2016 PostNL Simacan Control tower 2016
KPN Smart logistics 2014
Table 2.6: Overview of whitepapers
3. Evaluation of articlesAfter identifying the initial set of papers, the papers were examined to see if they are indeed relevant to the current research. After this step, 4 papers remained.
Siemens published a whitepaper on a case study on a control tower [45]. Siemens stated that the potential uses of the control tower grow with each day. The use case was in collaboration with an industrial enterprise with more than 20 production facilities uses a control tower to manage its global transports. This makes it possible to seamlessly map all transport activities between the companyâĂŹs facilities and its customers. In addition to the production facilities, the cloud-based IT solution also handles the supply chain integration of all the companyâĂŹs modes of transport and the transport service providers it hires. This not only lends visibility across the multimodal transport network, but it also unleashes the potential for optimization of the distribution processes. Various monitoring functions along the supply chain automatically check transport-related data and information. This makes it possible to know whether there is any deviation from the planned delivery date, whether transport-related documents such as waybills will ready on time, and whether a shipment has been confirmed by the carrier by the stipulated deadline. Furthermore stated Siemens that there are huge potential savings for logistics service providers and their customers in all sectors of the economy with a control tower. Transport and cargo space capacities are used to maximum effect, routes are optimized using real-time data and potential bottlenecks are avoided. All this means that the control tower will undoubtedly play a key role in the supply chain of tomorrow.
SyncroSupply published a whitepaper on their implementation of a control tower [46]. Their solution is described below:
1. Keep everthing in view
Often, unforeseen events, such as traffic holdups or breakdowns, disrupt the delivery schedule and this affects the operations of the day. That is why SyncroSupply also controls day-to-day operations and, thanks to intelligent optimization algorithms, always finds solutions for the following situations: (1) the truck is too early, too late, or arrives unannounced, (2) the truck brings material for other loading points or more material than planned, (3) the truck brings urgently needed bottleneck material.
2. Optimal truck throughput
on the expected time of arrival via an ETA calculation. SyncroSupply determines the optimal throughput sequence of all trucks based on the status of all vehicles and the utilization of the loading points.
3. Comprehensive process
With SyncroSupply, companies manage their delivery logistics as a comprehensive process from the truck supply control to the arrival of the trucks at the plant, through the loading or unloading up to the departure.
ABN AMRO published a whitepaper on the value of Open Data within the logistics sector [47]. Applying available data speeds up innovation. More than ever, companies must be willing to share data. By combining Open Data to existing, often static closed data, the quality of data information increases. This makes it possible to organize the logistics process even more efficiently. The following characteristics indicate a successful implementation of Open Data:
1. Belief that open data accelerates logistics innovation 2. Strive for continues improvement
3. Transparent communication
Nowadays logistics companies have a lot of data gathered. The challenge for these companies is to make this data commercially applicable. The importance of collaboration is clear for them (98% of the interviewed transportation firms beliefs in Open Data).
Data understanding
This chapter describes the dataset used throughout this research and describes steps 2 and 4 (data collection, exploratory data analysis) from Shmueli and Koppius [10]. Step 3 and 4 are exchanged in this research as stated in section 1.3.4. The data is explored by using the standard operations in Alteryx and the python libary PandaProfiling [49]. Section 3.1 describes how the dataset was created and how it was collected. Section 3.2 describes the basic characteristics of the dataset such as the data types. Section 3.3 describes the data exploration phase with tables, charts, and other visualisation tools to better understand the content of the dataset.
3.1 Data collection
The dataset used throughout this research is provided by Simacan’s Control Tower [50]. The Control Tower tracks real-time information of the trucks of Albert Heijn Transport and keeps track of the realization of the planning. Generally the data of the Control Tower consist of the planning data and the realization data. Section 3.2 describes the data set in more detail. The data from the Control tower is avialable from 1 december 2016 until 30 november 2018 for all the four regional DC’s. Since this research only focuses on the region that is handled by DC Pijnacker, the other DC’s where filtered out. To extend the dataset form the AH/Simacan control tower external datasources are joined with the dataset from the control tower. These external datascources are internal data from Albert Heijn, weather data from the Royal Netherlands Meteorological Insti-tute (KNMI) and traffic data form Rijkswaterstaat (the Dutch instiInsti-tute for mobility).
3.2 Data description
Name Rename Type
DC_naam DC string
land_winkel Country string
datum_rit Date datetime
dagtype Day string
ritid RouteID string
transporteur_naam Carrier string
voertuig VehicleID string
herkomst Origin string
bestemming Destination string
locatietype_bestemming DestinationType string
stopnummer StopNumber string
stopid_bestemming DestinationID integer
minuten_geplande_rijtijd_herkomst_naar_bestemming PlannedDrivetime integer
minuten_gerealiseerde_rijtijd_herkomst_naar_bestemming RealDrivetime integer
geplandestarttijd Startime datetime
geplandestarttijd Endtime datetime
minuten_geplande_verblijftijd PlannedDocking integer
minuten_gerealiseerde_verblijftijd RealDocking integer
lzv_boolean LZV boolean
Table 3.7: Overview of the metadata
Figure 3.1 gives an overview of of a given route with example data. In this case the truck started at 0:00 with the route. It starts with loading at a dock at the DC, after that it departs towards its first destination. In this case the first destination is Store1. In the control tower the arrival at Store1 is saved as a timestamp. Based on the departure time at the DC and the arrival time at the store, the travel time is calculated. In this case the travel time between the DC and Store1 is 45 minutes. The time between a truck arrived at a store and the departure at a store is the docking time at a stop. In this example the load/unload time at Store1 is 30 minutes.
Start(Loading_DC) 0:00
End_Loading _DC
0:15
Arrival_Stor e1
1:00
Departure_St ore1
1:30
Arrival_Sto re2
3:00
Departure_S tore2
3:30
End(Arrival_ MC)
4:30
3.3 Data exploration
To verify the data quality of the given dataset and to understand the content of the dataset, the variables are explored individually. This step is reported in section 3.3.1 (variables), 3.3.2 (transformed variables) and section 3.3.3 (external variables).
3.3.1 Variables in the dataset
This section describes the variables given in provided dataset by Simacan. The python library panda profiling is used to explore the variables [49].
DC: Since this research only focus on DC Pijnacker, the variable DC is constant and have only one value, so it should be ignored for analysis.
Country: The shops that are deliverd by DC Pijnacker are all located in the Nether-lands. Therfore the variable Country is constant and have only one value, so it should be ignored for analysis.
Date: The variable Date gives the corresponding date for a given ride in in the set. The type of this variable is datetime. There are 699 dates in the dataset (see table 3.8). The date 3/30/18 is the most common value in the dataset (see table 3.9).
Distinct count 699 Unique (%) 0.1% Missing (%) 0.0% Missing (n) 0
Table 3.8: Variable: day
Value Count Frequency (%)
3/30/18 3220 0.3%
5/11/18 3075 0.3%
3/29/18 3003 0.3%
Other values (696) 957089 99.1%
Table 3.9