Variables in the dataset - Data exploration

3.3 Data exploration

3.3.1 Variables in the dataset

This section describes the variables given in provided dataset by Simacan. The python library panda profiling is used to explore the variables [49].

DC: Since this research only focus on DC Pijnacker, the variable DC is constant and have only one value, so it should be ignored for analysis.

Country: The shops that are deliverd by DC Pijnacker are all located in the Nether- lands. Therfore the variable Country is constant and have only one value, so it should be ignored for analysis.

Date: The variable Date gives the corresponding date for a given ride in in the set. The type of this variable is datetime. There are 699 dates in the dataset (see table 3.8). The date 3/30/18 is the most common value in the dataset (see table 3.9).

Distinct count 699 Unique (%) 0.1% Missing (%) 0.0% Missing (n) 0

Table 3.8: Variable: day

Value Count Frequency (%)

3/30/18 3220 0.3%

5/11/18 3075 0.3%

3/29/18 3003 0.3%

Other values (696) 957089 99.1%

Table 3.9

Day: The variable Day gives the corresponding day for a given ride in in the set. The type of this variable is categorical (string). There are 7 days in the dataset (see table 3.10). This correspondents with days of the week. Friday is the day with the highest frequency in the dataset and sonday has the lowest frequency (see table 3.11).

Distinct count 7 Unique (%) 0.0% Missing (%) 0.0% Missing (n) 0

Table 3.10: Day

Value Count Frequency (%)

vr 168306 17.1% za 160660 16.3% do 151686 15.4% wo 147064 14.9% di 141024 14.3% ma 139212 14.1% zo 76435 7.8%

Table 3.11: Overview of the metadata

RouteID: The variable RouteID is a unique identifier for every route in the dataset. RouteIDs are unique to each route; they are equivalent to IDs, and machine learning cannot learn anything from them.

Carrier: The variable Carrier gives the name of the external carrier for a given ride in in the set. The type of this variable is categorical (string). There are 22 different carriers in the dataset (see table 3.12). Simon Loos B.V, Post-Kogeko B.V. and v.d.Brink%Zn. are the most common carriers for rides in the area of DC Pijnacker (see table 3.13).

Distinct count 22 Unique (%) 0.0% Missing (%) 0.0% Missing (n) 0

Table 3.12: Day

Value Count Frequency (%)

1 354136 36.0% 2 273956 27.8% 3 254248 25.8% 4 45202 4.6% 5 29819 3.0% 6 8783 0.9% 7 6487 0.7% 8 5777 0.6% 9 3908 0.4% 10 796 0.1% Other values (12) 1275 0.1%

Table 3.13: Overview of the metadata

VehicleID: The variable VehicleID gives the license plate of the trailer for a given ride in in the set. The type of this variable is categorical (string). There were 858 different trailers active in the area of Pijacker (see table 3.14). License plate ’78BHT4’ is the most used one during the timewindow of the dataset (see table 3.15). Table 3.14 shows the percentage of missing values. In the case of the variable VehicleID 7.6 % or 74883 data points are missing.

Distinct count 858 Unique (%) 0.1% Missing (%) 7.6% Missing (n) 74883

Table 3.14: Day

Value Count Frequency (%)

1 12409 1.3%

2 11721 1.2%

3 11338 1.2%

Other values (847) 874036 88.8% (Missing) 74883 7.6%

Table 3.15: Overview of the metadata

Origin: The variable Origin gives the starting point for a given ride in in the set. The type of this variable is categorical (string). There were 264 different origin in the area of Pijacker (see table 3.16). DCP is the point where most of the rides start, which is quite logic, since all rides starts at the DC (see table 3.17). Table 3.14 shows the percentage of missing values. In the case of the variable Origin 28.1 % or 276830 data points are missing. The values are missing due to the first stop of a route. The first leg is from DCP(null value in the data) to DCP and gives the loading time at a dock.

Distinct count 264 Unique (%) 0.0% Missing (%) 28.1% Missing (n) 276830

Table 3.16: Day

Value Count Frequency (%)

DCP 289376 29.4%

8991 10364 1.1%

1624 4239 0.4%

Other values (253) 403578 40.9% (Missing) 276830 28.1%

Table 3.17: Overview of the metadata

Destination: The variable Destination gives the end point for a given ride in in the set. The type of this variable is categorical (string). There were 267 different destinations in the area of Pijacker (see table 3.18). DCP/MCP is the endpoint for all rides. Hence their frequencies are quite high (see table 3.19).

Distinct count 267 Unique (%) 0.0% Missing (%) 0.0% Missing (n) 0

Table 3.18: Day

Value Count Frequency (%)

DCP 289376 29.4%

MCP 276707 28.1%

8991 10375 1.1%

1624 4239 0.4%

Other values (257) 407929 41.0%

Table 3.19: Overview of the metadata

DestinationType: The variable DestinationType gives the type of destination for a given ride in in the set. So, is the destination a store or the DC. The type of this variable is categorical (string). There are only 3 different destinations types (see table 3.18). These are store, DCP and MCP. The frequencies are given in 3.21.

Distinct count 3 Unique (%) 0.0% Missing (%) 0.0% Missing (n) 0

Table 3.20: Day

Value Count Frequency (%)

WINKEL 418282 42.5% DC 289376 29.4% MC 276729 28.1%

Table 3.21: Overview of the metadata

Stopnumber: The variable Stopnumber indicates the number of stops in the trip. The type of this variable is numerical. There are 10 unique values, so the minimum stopnumber is 1 and the maximum is 10 (see table 3.22). Table 3.23 gives the frequency if the stopnumbers. The most routes in the dataset consist of one, two or three stops.

Distinct count 10 Unique (%) 0.0% Missing (%) 0.0% Missing (n) 0

Table 3.22: Day

Value Count Frequency (%)

1 276731 28.1%

2 276821 28.1%

3 276757 28.1%

4 128830 13.1%

Other values 25244 2.6%

Table 3.23: Overview of the metadata

DestionationID: The variable DestionationID is a unique identifier for every destination in the dataset. DestionationIDs are unique to each destination; they are equivalent to IDs, and machine learning cannot learn anything from them.

PlannedDriveTime: The variable PlannedDriveTime gives the planned driving tim- ing in minutes for a given leg in the dataset. The type of this variable is numerical (integer). There are 629 distinct values, and there are no missing values (see table 3.24). Table 3.25 gives the quantile statistics of this variable. The minimum is -3146 minutes and the maximum is 1439. These extreme values refers to extreme values that abnormally lie outside the overall pattern of a distribution of variable. Section 4.2 describes the taken steps to deal with outliers in the dataset. The median of the planned drive time is 27 minutes and the interval for the interquantile range is [0,42].

Distinct count 629 Unique (%) 0.1% Missing (%) 0.0% Missing (n) 0 Table 3.24: Day Minimum -3146 5-th percentile 0 Q1 0 Median 27 Q3 42 95-th percentile 75 Maximum 1439 Range 4585 Interquartile range 42

Table 3.25: Quantile statistics

RealDriveTime: The variable RealDriveTime gives the realized driving time in minutes for a given leg in the dataset. The type of this variable is numerical (integer). There are 88408 distinct values, and there are none missing values (see table 3.26). Table 3.27 gives the quantile statistics of this variable. The minimum is -1409 minutes and the maximum is 2945. These extreme values refer to extreme values that abnormally lie outside the overall pattern of the variable. Section 4.2 describes the taken steps to deal with outliers in the dataset. The median of the realized drive time is 19 minutes and the interval for the interquartile range is [0,35].

Distinct count 88408 Unique (%) 9.0% Missing (%) 0.0% Missing (n) 0 Table 3.26: Day Minimum -1429 5-th percentile 0 Q1 0 Median 19,043 Q3 35,083 95-th percentile 74,153 Maximum 2945 Range 4374 Interquartile range 35,083

Table 3.27: Quantile statistics

StartTime: The variable StartTime is the start datetime for each leg in the dataset. Since this datetime is unique to each leg, a machine learning algorithm cannot learn anything from them.

EndTime: The variable EndTime is the end datetime for each leg in the dataset. Since this datetime is unique to each leg, a machine learning algorithm cannot learn anything from them.

PlannedDocking: The variable PlannedDocking gives the planned docking time in minutes for a given destination of a leg in the dataset. The type of this variable is

numerical (integer). There are 116 distinct values, and there are none missing values (see table 3.28). Table 3.29 gives the quantile statistics of this variable. The minimum is 1 minutes and the maximum is 2926. These extreme values refers to extreme values that abnormally lie outside the overall pattern of a distribution of variable. Section 4.2 describes the taken steps to deal with outliers in the dataset. The mediaan of the planned drive time is 26 minutes and the interval for the interquantile range is [16,34].

Distinct count 116 Unique (%) 0.0% Missing (%) 0.0% Missing (n) 0 Table 3.28: Day Minimum 1 5-th percentile 13 Q1 16 Median 26 Q3 34 95-th percentile 51 Maximum 2926 Range 2925 Interquartile range 18

Table 3.29: Quantile statistics

RealDocking: The variable RealDocking gives the planned docking time in minutes for a given destination of a leg in the dataset. The type of this variable is numerical (integer). There are 121227 distinct values, and there are none missing values (see table 3.30). Table 3.31 gives the quantile statistics of this variable. The minimum is -20 minutes and the maximum is 1748. These extreme values refers to extreme values that abnormally lie outside the overall pattern of a distribution of variable. Section 4.2 describes the taken steps to deal with outliers in the dataset. The mediaan of the planned drive time is 26 minutes and the interval for the interquantile range is [12,40].

Distinct count 121227 Unique (%) 12.3% Missing (%) 0.0% Missing (n) 0 Table 3.30: Day Minimum -20,377 5-th percentile 0 Q1 11,695 Median 25 Q3 40 95-th percentile 60.15 Maximum 1748 Range 1768.4 Interquartile range 28,305

Table 3.31: Quantile statistics

LZV: This variable classify if the truck is a long combination vehicle. In the area of Pijnacker, long combination vehicles are not used. Therfore the variable LZV is constant and have only one value, so it should be ignored for analysis.

In document Applying machine learning on the data of a controltower in a retail distribution landscape (Page 42-48)