Functional Analysis of Real World Truck Fuel Consumption Data

(1)

Technical Report, IDE0806, January 2008

Functional Analysis of Real World Truck Fuel

Consumption Data

Master’s Thesis in Computer Systems Engineering

Georg Vogetseder

School of Information Science, Computer and Electrical Engineering Halmstad University

(2)

(3)

Functional Analysis of Real World

Truck Fuel Consumption Data

School of Information Science, Computer and Electrical Engineering Halmstad University

Box 823, S-301 18 Halmstad, Sweden

(4)

Acknowledgement

If it looks like a duck, and quacks like a duck, we have at least to consider the possibility that we have a small aquatic bird of the family anatidae on our hands.

Douglas Adams (1952-2001)

Thanks to my family, especially my mother Eva and friends.

(5)

Abstract

This thesis covers the analysis of sparse and irregular fuel consumption data of long distance haulage articulate trucks. It is shown that this kind of data is hard to analyse with multivariate as well as with functional methods. To be able to analyse the data, Principal Components Analysis through Conditional Expectation (PACE)is used, which enables the use of observations from many trucks to compensate for the sparsity of observations in order to get continuous results.

The principal component scores generated by PACE, can then be used to get rough estimates of the trajectories for single trucks as well as to detect outliers.

The data centric approach of PACE is very useful to enable functional analysis of sparse and irregular data. Functional analysis is desirable for this data to sidestep feature extraction and enabling a more natural view on the data.

(6)

List of Figures

3.1 Fuel Consumption between Observations . . . 14

3.2 Fuel consumption plot generated from the raw data . . . 15

3.3 Histograms of the original and the cleaned data . . . 17

3.4 Fuel consumption plot generated from the clean data . . . 17

3.5 Scatter plot and histograms . . . 18

3.6 Histogram of the distance between observations . . . 19

4.1 Distribution and mean/variance of binned data . . . 22

4.2 Boxplots of binned data . . . 23

4.3 Outlier detection based on feature extraction . . . 25

4.4 Straight line fitting . . . 27

4.5 Plot of mean function and principal components . . . 29

4.6 Scree Plot . . . 30

4.7 Smoothed covariance matrix . . . 31

4.8 Reconstructed curves versus mean function and raw observations of se-lected trucks . . . 32

4.9 Reconstructed curves and raw measurements for all trucks. . . 33

4.10 Reconstructed traces of misfitted trucks . . . 33

4.11 Comparison of reconstructed trajectories with differing number of PCs 35 4.12 Reconstructed trajectories without measurement error assumed . . . . 37

4.13 A comparison of µwith different smoothing kernels . . . 38

4.14 A comparison of 3 PCs with different smoothing kernels . . . 39

4.15 Distribution of all mean curves . . . 41

4.16 Graph of all mean curves . . . 41

4.17 Trucks with a high influence on the results of PACE . . . 42

4.18 Data variance . . . 43 vi

(9)

4.19 Normal Distribution Plots of the PC scores . . . 45

4.20 Histograms of the probability of trucks . . . 46

4.21 Samples of truck probability . . . 46

4.22 PACE Results of Speed Data . . . 47

4.23 PACE Results on Seasonal Fuel Consumption . . . 48

4.24 Selected trucks from the Seasonal Fuel Consumption Data . . . 49

(10)

List of Tables

4.1 MSE of PACE with 8 principal components . . . 34

4.2 MSE of PACE with 3 PCs . . . 35

4.5 MSE of PACE with 8 PCs and error cut-off . . . 37

(11)

1

Introduction

1.1

Background

The original idea for analyzing this data came from Volvo Parts AB, one of the main business units of Volvo Group AB. The role of Volvo Parts is to provide solutions and tools to the after-market, which includes vehicle electronics diagnostic tools. When a truck is in the workshop, the vehicle electronics data is read out from the truck using diagnostics tools from Volvo Parts and transmitted to a central database.

This data, which is collected from the vehicles electronics systems is called logged vehicle data (LVD) and is collected from sensors within the truck. Several electronic subsystems supply information for LVD, which can include data from the electronic suspension, the transmission, and most importantly from the Engine Electric Control Unit. The current main use of LVD is seemingly just basic analysis, e.g. remote diagnostics of faulty components and simple statistics.

One of the problems with analysing LVD is the relative lack of observations. The source of this lack of information is the data retrieval process. The procedure is a time consuming process, making it a cost factor for the workshops. The time consumption affects the adoption rate of this procedure in the field negatively, which leads to the data composition detailed in Section 3.1.

The basic idea behind the problems detailed in this thesis is to expand the usefulness of the data for Volvo Parts, retrieving additional new information from it and provide means to access this information. This is done by using recent advanced statistical

(12)

1. Introduction 2

techniques. As a starting point to the application of these techniques, the analysis of the fuel consumption data contained in LVD was suggested.

Fuel consumption data is very interesting from a statistical point of view. This interest stems from being a major cost factor, as well as being influenced by a high number of other factors, such as:

• Usage patterns of the operator, i.e. the driving style and habits

• Maintenance of the truck

• Gross Combination Weight usage, i.e. the cargo of the truck

• Environment, i.e. hilliness, road condition, etc.

The influence of these and more factors make this data a good indicator. But the mass of influences also makes exact determination of the underlying cause impossible. Additionally, some of these influences might cancel each other out, thus removing information. If it is possible to extract information from fuel consumption data, then it should work for the rest of the data too.

1.2

Motivation and Novelty

From LVD, it should be possible to extract information on hidden trends, i.e. the principal components (see Section 2.1.1) that are common to all similar trucks. Based on these components, it should be possible to determine if a truck is unrelated to other trucks, i.e. a outlier and to predict future developments in fuel consumption, when the trucks behavior is similar to that of other vehicles.

It is very easy to take the last observation of each truck in a group of similar trucks to determine abnormal fuel consumption, but it is hardly possible to calculate underlying trends or other information from these facts.

To discover information like trends or outliers from LVD, the data of a truck has to include not only the last observation available, but also past ones. These requirements, multiple observations of a truck and a set of similar trucks lead to the irregular and sparse structure of the data used in this thesis. The data is described in more detail in Section 3.1.

(13)

1. Introduction 3

The analysis of this data can be done in at least two ways. The most obvious choice in methodology would be the use of multivariate statistics, but for several reasons de-tailed below, the central methodology for this thesis isfunctional statistics. Functional statistics focuses on analysing the data as functions, rather than a set of discrete values

1_.

Multivariate statistics are a set of methods which work on more than one variable at a time. Some examples for these methods are regression analysis, principal components analysis and artificial neural networks. Principally, functional statistics are also part of this set, as both have multiple variables as input. However, the focus on handling the input variables as continuous functions rather than arbitrary variables separates those two fields.

As the observation of trucks in the workshop is not happening regularly, i.e. the observations can not be fitted to a grid, it is difficult to incorporate all information from the input into variables for use in multivariate statistics. Therefore, features like mean, variance, duration of all observations, date of first observation, odometer count at the last observation, etc. have to be extracted from the data to be able to do analysis. Inevitably, the extraction of this knowledge leads to information loss, which is problematic on this already sparse data. The process of discovery and selection of important features for multivariate analysis is very difficult and time consuming. It is crucial to extract and select the best and most important features from the data to minimize the data loss and maximize the information content of the features for the success of all further steps in analysis. Feature extraction creates an additional layer of data processing and introduces a large number of tunable knobs.

Functional Data Analysis (FDA) on the other hand, preserves the information in the data present and does not need feature extraction at all. Furthermore, it facilitates a more natural handling of the data, describing not only more or less abstract features of the data, but a function which resembles the data. The choice of using functional over multivariate data analysis is also motivated by the ability to analyze the func-tional properties of the data, e.g. derivatives of the data. Addifunc-tionally, FDA does not introduce a high number of additional parameters, unlike multivariate analysis.

(14)

1. Introduction 4

However, multivariate analysis has an advantage over FDA when a high number of different functions have to be analysed at the same time. FDA has problems in visu-alizing this higher dimensional data, as well as the necessity of having a high amount of data for each dimension (curse of dimensionality).

The most important step in FDA is the transformation of the discrete data to a func-tional basis. Again, the irregular and sparse nature of the data makes this transforma-tion difficult. For being able to perform FDA on this data, a method called Principal Components Analysis through Conditional Expectation (PACE) is applied. The foun-dation of PACE is the assumption that a smooth function is underlying the sparse data. Under this assumption, it is possible to use even irregular data for the discovery of principal components.

The main novel aspect of this thesis is the application of FDA and PACE to automotive data. Previously it has successfully been applied to biological data, economic processes, bidding in online auction houses, but not automotive data. PACE itself is highly interesting to be applied to the data at hand, because it is able to work on it without the need for feature extraction or regular observations.

The methods used in this work can be used to describe the actual fuel consumption of the observed trucks in customer hands. This means the methods applied to LVD are driven by data and not by a model.

1.3

Related Work

General sources of information on data analysis – related to this work – are The El-ements of Statistical Learning [1], Functional Data Analysis [2] and Nonparametric Functional Data Analysis [3].

The single most important paper related to this work is Functional Data Analysis for Sparse Longitudinal Data [4], which proposed the method PACE and applied it to yeast cell cycle gene expression data and to longitudinal CD4 cell percentages. The percentage is used as a maker for the progress of AIDS in adults.

(15)

1. Introduction 5

Functional Data Analysis for Sparse Auction Data [5] combines the PACE approach with linear regression to predict closing prices of online auctions.

The most related of the few public papers on fuel consumption in heavy trucks isHeavy Truck Modeling for Fuel Consumption Simulations and Measurements [6]. This work deals with building a simulation model of fuel consumption. Another paper, which dis-cusses methods to reduce idle fuel consumption in North American long distance trucks and highlights typical driver behavior is Analysis of Technology Options to Reduce the Fuel Consumption of Idling Trucks[7]

Additional information on doing PCA on sparse and irregular data can be found in Principal component models for sparse functional data[8] andSparse Principal Compo-nent Analysis[9]. More related to PACE isProperties of principal component methods for functional and longitudinal data analysis[10]. Another paper which is related to the estimation of Functional Principal Component Scores is [11]. Knowledge relating to linear regression analysis for longitudinal data can be found in [12].

1.4

Limitations

The scope of this thesis is to research the possibilities for the application of FDA meth-ods to the sparse and irregular automotive data from LVD. It is outside of the scope of this thesis to establish a conclusive theory about a true long term fuel consumption model of all truck engines.

The conclusive, globally valid model is impossible because of a relatively low number of individuals in the data, as well as a limited observation duration and possible differences in usage patterns of the trucks, i.e. vehicles with a high mileage in a limited time span do not necessarily exhibit a similar fuel consumption to low mileage trucks in the same time span.

(16)

1. Introduction 6

1.5

Outline

The next chapter ”Methods” describes crucial used methods. This includes underlying basic methods as well as the foundations of FDA and PACE. The chapter 3 ”Applica-tion” provides a description of the data used in this thesis and includes information on the interplay of the proposed methods and the data. Chapter 4 provides comprehen-sive information on the results. The last two chapters, ”Conclusion” and ”Discussion”, wrap up the results from this thesis and provide an outlook on possible continuations of the research.

(17)

2

Methods

This chapter is divided into three parts. General Statistical Methods describes non-functional methods which are fundamental to this work. Functional Data Analysis provides an introduction into this field. The final part, Principal Components Analysis through Conditional Expectation gives an overview of this crucial method.

2.1

General Statistical Methods

This section introduces general statistical concepts used in this thesis and a number of tools to visualize data and test results.

2.1.1

Principal Component Analysis

One of the constitutional methods for analysing LVD is the Karhunen-Lo`eve transfor-mation, universally known as Principal Component Analysis (PCA). PCA is also the foundation to Functional Principal Component Analysis (FPCA)[1, 13].

Basically, PCA is a method to explore data by finding the most important ways the variables in the data differ from another. It can compress the data by discovering a low number of linear combinations of input variables which contribute most to the variability of the input. These linear combinations are found by constructing a linear basis for the data where the retained variability is maximal.

(18)

2. Methods 8

Mathematically speaking, the goal is to reduce or compress high dimensional data X

to lower dimensional data Y.

To do this reduction, a number of algorithms are available, here, a method involving the calculation of the covariance is described.

The first step is to calculate the mean vector µfor each variable:

µi = 1 Ki Ki X j=1 xij, i= 1. . . N

where N denotes the number of variables and Ki the number of observations in one

variable.

Subsequently,µis removed from every observation inX, which is subsequently denoted as X−X¯.

In the next step the covariance matrix cov(X −X¯) has to be calculated. Covariance is a measure how two variables vary together. If those two variable vary in the same way (i.e. same prefix), the covariance will be positive. If, on the other hand, the two variables have different prefixes, the covariance will be negative. A covariance matrix is the result of calculating the covariance for all members of two vectors. The resulting matrix gives the grade of correlation between the input vectors.

To find a mapping M that is able to transform the high dimensional data into low dimensional data, M that maximizes MTcov(X −X¯)M has to be found. It can be shown that the best (variance maximizing) mapping is formed by the eigenvectors of the covariance matrix. Hence, PCA has to solve the eigenproblem to get the transformation matrix.

cov(X−X¯)M =λM

The eigenproblem has to be solved dtimes with different principal eigenvalues λto get the principal eigenvectors (or principal components). The low dimensional representa-tion Y can then be computed by simple multiplication:

(19)

2. Methods 9

2.1.2

Hierarchical Clustering

Hierarchical clustering is a relatively simple method [1] to segment data into related groups. Clustering is used within this thesis for testing if differing clusters of trucks can be found from extracted features. Hierarchical clustering needs a dissimilarity measure between the elements. The standard for measuring the dissimilarity is the euclidean distance, which is also used in this thesis.

When the distance between all possible pairs of elements is calculated, the clusters can be built. For building these clusters, there are two different approaches: The agglom-erative approach, which starts with as many clusters as there are individuals. The divisive method starts with one big cluster which is then split into smaller clusters. Agglomerative methods are guaranteed to have a monotonic increasing level of dissim-ilarity between merged clusters, growing with the level of merging. This property is not guaranteed to divisive approaches.

The second choice for building the clusters is to decide on the measurement for the distance between two clusters.

• Single Linkage – The link between the clusters is defined by the smallest dis-tance between elements in the two clusters.

• Complete Linkage – The link is defined by the largest distance between ele-ments in the two clusters, the opposite of the first method.

• Average Linkage – Uses the average distance between all pairs of elements in both clusters.

2.1.3

Validation Methods

A number of methods to validate the results and to estimate variation were used in the scope of this thesis. These include brief usage of bootstrap, jackknife and various cross validation methods, such as k-fold and leave-one-out[1].

Bootstrapping is the process of randomly picking a samples from given observations where a single observation can be chosen multiple times. The goal of a bootstrap is to approximate the distribution from these samples.

(20)

2. Methods 10

Jackknifing can be used to estimate the bias and standard error. Jackknife is very similar to k-fold and leave-one-out cross validation, as it systematically removes one or more observations from a sample and then recalculates the results as often as there are possible readouts.

2.1.4

Diagrams

A number of special diagrams were used to illustrate some results of this thesis. Those diagrams are dendrograms, boxplots and scree plots [1, 2].

• Dendrograms are tree diagrams which are used to illustrate the result of a clus-tering algorithm. An example for such a diagram is Figure 4.3. On the vertical axis the distance between clusters is plotted. A horizontal line denotes a split between classes at this specific distance measure. This implies that a split at a higher distance value has a higher dissimilarity between the split classes, as opposed to a lower distance value split.

• Boxplots describe groups of data – such as binned data – through five statistical properties. A boxplot example can be seen in Figure 4.2. The box represents the lower and the upper quartile, showing where half of the data is contained. The line in this box illustrates the median of data in this group. The whiskers attached to this box extend to the furthest data point, up to a maximum of 1.5 the distance between the quartiles. Data points outside of this boundary are usually marked with a cross, indicating a possible outlier.

• Scree plots give an indication of the relevance of a principal component (eigen-function) by indicating the accumulated eigenvalue up to the n-th principal com-ponent. This plot can be used to select a suitable number of eigenfunctions. An example for a scree plot is Figure 4.6.

2.2

Functional Data Analysis

Functional data analysis (FDA) [2, 3] is a collection of methods which enable the investigation of data in a functional form. Functional data is the idea of looking at a

(21)

2. Methods 11

set of observations not as a vector in discrete time, but as a continuous function. The analysis of functions rather than discrete samples inherits advantages over multivariate analysis.

An advantage of this property is that the rate of change or derivatives of these functions can easily be calculated and analysed. FDA also includes variants of multivariate methods like PCA. Functional PCA, like normal PCA, not only provides a method for dimensionality reduction, but also characterizes the main modes of variation from a mean function.

To perform FDA on discretely sampled data, the data has to be converted to a contin-uous, functional format. This means a function has to be fitted to the sampled data points. It is not feasible to convert every dataset to a functional form. Especially in the case of sparse and irregular observations, this task is very difficult, but central to the success of functional data analysis.

Usually, the methods used to convert data into a functional format are interpolation and smoothing, or more generally function fitting. A very simple method to do this conversion would be a least squares fit of a first order polynomial (a straight line). Usually, a more flexible method is used for this step, namely spline interpolation. Depending on the underlying data, other fits like Fourier functions are possible. FDA is easily applicable if the measurements were done with a regular spacing, and the data is complete over the observation duration. In the opposite case, it is very difficult to estimate the complete trajectory, when only a single subject is taken into calculation.

2.3

Principal Components Analysis through

Con-ditional Expectation

Principal Components Analysis through Conditional Expectation (PACE) is a deriva-tive of functional principal components analysis for sparse longitudinal data, proposed in the paper Functional Data Analysis for Sparse Longitudinal Data by Yao, M¨uller and Wang[4].

(22)

2. Methods 12

PACE is an algorithm for extracting the principal components from irregular and sparse data. It also provides an estimation of individual smooth trajectories of the data. PACE assumes that the data is randomly located with a random number of observations per subject. Furthermore it assumes that data is determined by a underlying smooth trajectory.

The first step in PACE is the estimation of the smooth mean function µ, by using a local linear line smoother on all measurements combined into one pool of data. The choice of the smoothing parameter, or bandwidth is done automatically[14] or by hand in this step.

The covariance surface can then be calculated like a regular covariance matrix. This raw covariance surface is stripped of the variance (the first diagonal). This raw matrix is then smoothed utilizing a local linear surface smoother. The bandwidth is chosen by leave-one-curve-out cross-validation. The smoothing step is necessary to fill in for missing observations. The estimation of these two model components share the same smoothing kernel. The choice of a smoothing kernel is discussed in Chapter 4.

From these model components, it is possible to calculate the estimates of the eigenvalues and eigenfunctions, i.e. the functional principal components of sparse and irregular data.

The last step is the calculation of the functional principal component scores. Those scores describe how much of a principal component is retained in a single subject. However, the conventional method of using numerical integration to recover the Prin-cipal Component (PC) scores leads to biased results; because of sparse and irregular data. In this step, the conditional expectation comes into play. It provides the best prediction of the PC scores if the measurement error is Gaussian, or the best linear prediction otherwise. PACE is discussed in detail by Yao, M¨uller and Wang [4].

(23)

3

The Vehicle Application and Data

Description

The purpose of this chapter is to outline the connection between the methods proposed in Chapter 2 and the application of those methods on the Volvo data.

3.1

Volvo Truck Data

The original data received from Volvo Parts AB consists of 2027 observations of 267 trucks. It was collected between June 2004 and May 2007 in North America.

All trucks have the same engine and are configured as articulate truck for long distance transports on smooth roads. The gross combination weight (GCW), which includes the weight of the towed trailer and the truck itself is 36 tons, the US federal GCW limit. Data is retrieved when a truck is in a workshop that is equipped to read out the onboard electronics and performs this procedure. It is then sent to the Volvo Headquarter in Gothenburg for storage and analysis.

The data from each observation contains only informations from one of the trucks onboard electronic systems, the Engine Control Unit (ECU). From these data, two variables are mainly relevant for this thesis:

• Total distance driven

• Total amount of fuel consumed

(24)

3. The Vehicle Application and Data Description 14 0 1 2 3 4 5 6 7 x 105 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Distance Driven [km]

Incremental Fuel Mileage [km/l]

Figure 3.1: This figure shows the distribution of the fuel consumption, when the fuel mileage is calculated only between two observations. The outliers visible in this figure can be explained by a high amount of idling between two close observations. When the fuel mileage is calculated accumulative, those outliers do not occur.

These variables are not reset when the ECU was read out in the workshop and therefore behave accumulative. Using these variables as a basis to calculate the fuel consumption per distance or time has an averaging effect on itself as it includes all former mileage data. This is necessary because of the unevenly distributed data. If a truck was read out twice within a very short span of time, the fuel consumption in this interval is possibly vastly different from the normal fuel consumption behavior of the truck, possibly because the truck was not moved very far withhin this time span, but idling for some time. The outliers caused by this effect can be seen in Figure 3.1. These outliers are the reason for not using the difference in fuel amounts between two observations as a calculation basis in this thesis. The accumulative approach allows those outliers to remain in the dataset.

3.1.1

Impurities in the Truck Data

The raw data retrieved from the trucks contains irregular observations or changes in the truck data which result – in some cases – in a removal of specific observations or the whole truck from the data set. See Figure 3.2 for a plot of the raw fuel consumption

(25)

3. The Vehicle Application and Data Description 15 0 1 2 3 4 5 6 7 x 105 1 1.5 2 2.5 3 3.5 4 Distance Driven [km] Fuel Mileage [km/l]

Figure 3.2: Fuel consumption plot generated from the raw data. The lines are linear interpolations between the observations.

data.

• Incomplete Observations– A truck is missing one of more variables that would be required for analysis. The observations from this individual can not be used for the calculations.

• Physically impossible changes in accumulative variables – Between two observations of a single truck, accumulative variables changed to a smaller value. This means that a later observation in time has a smaller number of total driving distance than an earlier measurement for example. This is physically impossible, but observable if the ECU has been replaced or the contents of the ECU were erased during a software update. This criteria applies to 44 trucks. Although it is possible to use a subset of the observations from each of these trucks. This was not done, because the quality of the measurement might have been compromised and the manual effort of cleaning the data is a time consuming task for very few usable measurements.

• Empty and Duplicated Observations– Some observations do not contain any new information, but only seem to be resubmits of earlier or empty observations with a different time stamp. These particular observations are removed from the

(26)

3. The Vehicle Application and Data Description 16

final data, but the remaining observations of this truck are used. Phenomena like these might occur, when the data aquisition process in the workshop was interrupted, or a transmission error occurred.

• Early Observations – These observations are too early in the life of the truck to give a meaningful information. The removal of these observations is moti-vated by the unusual fuel consumption of a truck in this state. The unusual fuel consumption is caused by a high number of short trips the truck has to travel before it can be put into regular service. Examples are drives to paint shops or truck customizers as well as transfers to the customer. The number of observa-tions purged when this criteria is set to remove all measurements below 10000 km is 150, when all measurements before 1000km are deleted, the number of observations drops by 100. See Figure 3.3.

From the 269 initial individual trucks, 56 trucks are removed. In terms of observations, from originally 20271 observations 1320 remained in the data set, when the lower border for observations is set to 1000km. See Figure 3.4 for a plot of the cleaned fuel consumption data. The most visible change to Figure 3.2 is the lower number of outliers at roughly 0 kilometers, which is mostly an effect of the removal of very early observations.

3.1.2

Data structure

Some properties of the data make the task of analyzing inherently difficult. Most of these properties stem from the sparsity of the data. Sparseness in this case means that every truck has been observed on average just 7.405 times with a standard deviation of 2.4083 observations. The sparseness of the data is visualized in Figure 3.5.

• The data is not fully observed. The observations of a single truck often are not scattered over a very long distance in time or driven distance, but measured only within a short span. The average duration between the first observation of a truck and the last one, where measurements are taken, is 317841 kilometers with a standard deviation of 114208 kilometers. The mean focus of the observations

(27)

3. The Vehicle Application and Data Description 17 0 1 2 3 4 5 6 7 8 9 x 105 0 50 100 150 200 250 Distance Driven [km] Number of Observations Raw Data Cleaned Data

Figure 3.3: This comparison shows the number of observations on the raw data versus the cleaned data. The overall reduction in the number of observations as well as the lower amount of observations at the beginning is noticeable.

0 1 2 3 4 5 6 7 x 105 1 1.5 2 2.5 3 3.5 4 Driven Distance [km] Fuel Mileage [km/l]

Figure 3.4: Fuel consumption plot generated from the clean data. Note the lack of outliers at the beginning of the data.

(28)

3. The Vehicle Application and Data Description 18 0 1 2 3 4 5 6 7 8 9 x 105 2 2.2 2.4 2.6 2.8 3 Driven Distance [km] Fuel Mileage [km/l]

Figure 3.5: The scatter plot in this figure highlights the sparse and irregular distribution of the data. The histograms describe the distribution of the observations along the axes. is at 303232 kilometers deviating by 133609 kilometers, which means that most of the trucks are not observed from the beginning, but observed later on in their life-cycle.

• The density of measurements varies. This implies that the placement of measure-ments is irregular throughout the duration of their observation. As the trucks are independent of each other, the times when observations happen are not correlated with each other. For a visual representation of the irregular duration between the measurements, see Figure 3.6. This figure indicates a non-normal distribution. The average distance between observations is 52020 kilometers with a standard deviation of 61858 kilometers.

• Unsupported curvature. The irregular placement and the sparsity of variables causes this property to occur. If a part of a curve has a high curvature, which can be approximated by kd2_y

dx2k or ( d2_y

dx2)2. When this is the case, the relative

resolution of the data at the point of the high curvature should also be high to enable a good estimation of the underlying function [2].

(29)

3. The Vehicle Application and Data Description 19 0 0.5 1 1.5 2 2.5 3 3.5 4 x 105 0 50 100 150 200 250 300 350

Distance Driven between Observations [km]

Number of Observations

Figure 3.6: This figure shows the distribution of distances between two observations of the same truck.

3.2

Approach

The first part in analyzing truck data, which is described in section 4.1, is to establish results with basic multivariate analysis as a basis where the results of functional analysis can be compared to. This part shows pitfalls and difficulties when applying standard multivariate methods to the data.

The first possible way for multivariate analysis is feature extraction. It is a difficult task to find relevant features to extract. A simple statistical feature will be extracted from the data to be able to give an idea how feature extraction works. The second possibility for multivariate analysis is to put the observations into bins. This is done in order to be able to align the data onto a vertical grid.

The second way is necessary, because it is very hard to visualize and convert to the original data format from the extracted features. However, binning cannot easily be used for outlier detection. Usually, some of the bins are likely to have only a low number of observations which makes outlier determination in this bin very difficult. If the bins are made larger, multiple – or even all – observations of a single truck might be put into a single bin. This leads to increased difficulty in differentiating between normal and outlying observations.

(30)

3. The Vehicle Application and Data Description 20

These steps should lead totwo results: A simple outlier detection, based on a clustering of the extracted features and a variance and mean estimation for the data, based on the binned data.

The task of estimating fuel consumption behavior for a single truck, outside of its observation duration using the extracted features is very hard. This is because the mapping between the values of the features and a function is not available. Addition-ally, information from other, similar trucks is not taken into consideration.

The last step in Basic Analysis (Sect. 4.1) is a demonstration of the main problem of applying FDA on the data at hand: The difficulty of fitting a function to a single truck.

The main task of this thesis is to apply the PACE algorithm to the data (Sect. 4.2), and to try out the various options within the PACE algorithm. In this section, the results of PACE in general will be assessed, the difference between PACE with different options in regard of the PACE generated functions as well as general statistical properties, such as the mean function.

The first advantage in using the PACE algorithm in comparison to the basic methods is the lack of need to pre-process data, i.e. to extract features or otherwise process the data. This non-parametric input of the data is complimented by a number of options to tune the algorithm itself for various needs (amount of information retained, if the input data has measurement errors, etc.).

The next step is to try out a number of methods which can be applied to the results of PACE. For example to calculate the possibility of the fuel consumption of a particular truck, given all the other trucks.

PACE enables the user to analyse the sparse and irregular data at hand, enabling the use of additional techniques from FDA, whereas using only multivariate data analysis or normal FDA on the same data is very difficult to do and does not incorporate the information gathered from the other trucks.

PACE makes outlier detection, estimation of the function outside the observation du-ration and the gathering of common statistical properties, like mean and variance in functional form, from sparse and irregular data a lot easier or even possible.

(31)

4

Results

4.1

Basic Data Analysis

The aim of this section is to provide an overview of basic multivariate analysis possi-bilities with the available data. Functional methods are applied from Section 4.2.

4.1.1

Data Binning

One approach, as described in the previous chapter, is the creation of a vertical grid for the data domain followed by binning the data into a limited number of “buckets” along the time or distance axis, similar to creating a histogram. If there is more than one observation of a truck in one of these bins, an average of these measurements is put into the bin. This has to be done to avoid biasing in case of dense observations of a truck within a short timespan.

The size and the quantity of the bins is crucial for binning. With the data at hand, 25 bins were used, which results in a size of 36087 kilometers per bin.

In Figure 4.1 the number of observations per bin, as well as an estimation of the mean function and the variance of the data can be seen.

In Figure 4.2 a boxplot of the binned data and one of the results of bootstrapping [1] the mean value per bin (10000 bootstrap samples) are illustrated.

(32)

4. Results 22 0 5 10 15 20 25 0 20 40 60 80 100 120 140 Bins Observations Histogram 0 2 4 6 8 x 105 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3

Mean and Deviation

Distance Driven [km]

Fuel Mileage [km/l]

Figure 4.1: The histogram depicts the number of observations per bin. Especially the first and the last few bins have a very small number of observations, which leads to the abnormal results in these bins in the mean and standard deviation figure on the right. This figure shows the mean as well as the standard deviation estimated from the binned data.

(33)

4. Results 23 5 10 15 20 25 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1 Bin Values

Binned Data Boxplot

5 10 15 20 25 2.35 2.4 2.45 2.5 2.55 2.6 2.65 2.7 2.75 2.8 Bin Values

Bootstrapped Mean Boxplot

Figure 4.2: The figures show boxplots for the binned data (left) and bootstrapped mean values (right). The left boxplot is a simple plot of the raw binned data, providing an easy visualization. The right boxplot is generated by bootstrapping the mean of each bin 10000 times. Bootstrapping should give an idea of how much the mean can vary, if new data has the same distribution as the data at hand.

(34)

4. Results 24

4.1.2

Feature Extraction

The features which are retrieved from all observations of a single truck, are used to construct a simple outlier detector with hierarchical clustering.

The goal of this simple outlier detector is to find trucks, whose mean is deviating significantly from the mean of the entire data. A single extracted feature was used in this case:

∆T ruck= (µT ruck−µAll)2

The data was then clustered with a hierarchical algorithm, using average distance linking. The outlying classes were subjectively selected by looking at the resulting dendrogram. For the results, see Figure 4.3.

(35)

4. Results 25 1 5 2 4 3 7 6 0.05 0.1 0.15 0.2 0.25 Dendrogram Class Class Distance 0 2 4 6 8 x 105 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1 Distance Driven [km] Fuel Mileage [km/l] Plot

Figure 4.3: Results of outlier detection based on feature extraction. The left figure shows the dendrogram of the clustering algorithm. This figure shows that the class 6 is an extreme outlier, whereas the classes 3 and 7 are also quite different from the main part of the data. The basis for this classes being outliers is a vastly different mean from the rest of the data. In the other figure, the outlying clusters are highlighted. The extreme outlier is marked red, the normal outliers are marked green and the normal data is colored blue. The class 3 has 5 members, whereas the other outlier classes have just 1 member. The classes 1 and 5 have 114 respective 52 members. Class 2 has 27 members, whereas 4 has 13 members.

(36)

4. Results 26

4.1.3

Function Fitting

Finding a plausible function that is fitting the data of the trucks well is difficult, because of the open-ended nature of the measurements. If a set of observations have a defined start and an end of their measurements, i.e. the data is fully observed, it is easy to interpolate the data in between, even if the data within this span is sparse. This property of the data at hand is also discussed in Section 3.1.

If the set of data is not fully observed, it is almost impossible to get a reliable fit outside the observation span of a single entity. This reliable fit outside of this span is necessary for performing FDA on this data, as FDA needs the same set of basis functions, or in the case of spline interpolation, the same knots for all functions to work.

It was not possible to get a good fit on this data with splines, where all of the knots are distributed the same for all truck entities. Also, polynomial fits, i.e. the approximation of the data with low (< 5) order did not result in a stable fit for the available data. The most reliable fit under these conditions were generated by fitting a linear function to the fuel consumption observations. These results in fitting the sparse and irregular data motivate the idea of combining the observations by the means of PACE, to be able to get better fits from the reconstructed trajectories.

(37)

4. Results 27 2 4 6 8 x 105 1.5 2 2.5 3 3.5 Distance Driven [km] Fuel Mileage [km/l] 2 4 6 8 x 105 1.5 2 2.5 3 3.5 Distance Driven [km] Fuel Mileage [km/l]

Figure 4.4: On the left, all fitted straight lines are shown. The right figure shows the mean straight line along with the standard deviation of the slope and the offset (blue) and the standard deviation of just the offset (dashed). The main problem with this straight line fit are a number of fits with high gradients, which are not valid outside their observation span. However, the mean line shows a slight increase in fuel economy, just like the mean curve from PACE (Figure 4.5).

(38)

4. Results 28

4.2

Application of PACE

The goal of this section is to elaborate on the application of the PACE method on the truck data, focusing only on fuel consumption per kilometer over the distance axis. Along with the results of this first application, some options available for a fine-tuning of the method will be presented and a general estimate of variability will be given.

4.2.1

Baseline PACE Results

The data in use for this initial run of the PACE method is the cleaned set, with all the trucks removed which have less than 2 observations. Additionally every observation, that happened before a threshold of 10000 km has been removed. The PACE method has some interchangeable sub-methods. For the baseline results, mostly the same parts as in the original method described in [4] were used. Thus, the kernel used for smoothing the mean function is the Epanechnikov kernel [4] and the input data is assumed to contain measurement errors.

A small discrepancy to the original method is the choice of using Fraction of Variance Explained1 (FVE), instead of the Akaike Information Criterion [1] (AIC) to select the number of PCs. The FVE threshold is set at 95 % of variance explained.

Regarding Figure 4.5, the smoothed mean curve should be taken with a grain of salt, especially the variance plots and the measurement density plots in Figure 3.3 should be considered. The number of PCs selected by FVE is 8, which accounts for 96.57 % of the total variation. The scree plot (Section 2.1.4) of the principal components from this analysis can be seen in Figure 4.6. The first, strong principal component is almost a straight line, which is basically shifting the mean from its starting point closer to the position of the measurements. The second and the fourth principal component seem to serve partially as corrective for trucks with a higher initial fuel economy than the average truck. The smoothed covariance matrix generated and used by PACE is visualized in Figure 4.7 by a color-matrix.

1_{The sum of the eigenvalues of a certain number of eigenfunctions divided by the sum of all}

eigenvalues has to exceed a certain threshold. The first number of PCs which exceeds this threshold is subsequently used.

(39)

4. Results 29 0 2 4 6 8 x 105 2.45 2.5 2.55 2.6 Distance Driven [km] Fuel Mileage [km/l]

Smooth mean curve

0 2 4 6 8 x 105 −3 −2 −1 0 1 2 3 4 x 10 −3 Distance Driven [km] Principal Components 55.72 % 11.88 % 8.65 % 4.03 %

Figure 4.5: The smooth mean function generated by PACE (left) is the basis for all other results. The four most significant PCs (right) are the strongest ways in which the individual trucks vary. The legend quantifies the strength of the PCs.

(40)

4. Results 30 0 5 8 10 15 20 25 30 35 40 45 50 0 10 20 30 40 50 60 70 80 90 100 96.57

Number of principal components

Fraction of variance explained (%)

Scree Plot

Figure 4.6: The scree plot, which highlights the trade-off between the number of PCs used versus the variance retained. The use of more than 10 PCs makes little sense, as the Fraction of Variance Explained (FVE) is not improving much.

(41)

4. Results 31 0 2 4 6 8 10 x 105 0 1 2 3 4 5 6 7 8 9 x 10 5 Distance Driven [km] Smoothed Covariance Matrix

Distance Driven [km] −0.04 −0.02 0 0.02 0.04 0.06 0.08

Figure 4.7: The smoothed covariance matrix generated by PACE. (The diagonal, which is the variance, has been removed prior to smoothing.) The main part of the matrix shows a small positive covariance (green).

(42)

4. Results 32 2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 14 Fuel Consumption [km/l] 2 4 6 8 x 105 Vehicle # 106 2 4 6 8 x 105 Vehicle # 92 2 4 6 8 x 105 Vehicle # 72 2 4 6 8 x 105 Vehicle # 4

Figure 4.8: These plots exhibit the mean curve(red), the corresponding original obser-vations(green) and the reconstructed curve(blue). Vehicle 14 and 106 have high values on all major PC scores, under opposite prefixes. Number 92 has the lowest PC scores overall; Trucks 72 and 4 have average PC scores. High PC scores lead to extreme values, especially on the strong first PC.

From the estimated PCA scores, the mean function µ and the principal component functions, the individual traces of the trucks can be reconstructed, which should give a rough estimate on the behavior of the truck. A number of selected reconstructions can be viewed in Figure 4.8 and a collection of all traces and the original measurements can be seen in Figure 4.9.

As a next step, for an analysis of the results, the goodness-of-fit of the original mea-surements versus the reconstructed traces is assessed. To estimate the goodness-of-fit, the mean squared error [1] between the discrete observation and the estimated re-construction is considered. However, the irregular measurement intervals are making assessment of the results difficult.

In Figure 4.10 some examples of bad fits are explained. Just taking the mean of the mean square error (MSE) of all observations of one truck is prone to skewing, as well as just summing up the MSE for each single truck. A more sensible approach to

(43)

4. Results 33 0 1 2 3 4 5 6 7 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Distance Driven [km] Fuel Mileage [km/l]

Figure 4.9: This graph shows all reconstructed traces (gray) and original measurements (blue). Note how the traces tend to follow the observations, especially when the relative occurrence of observations is low.

2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 73 Fuel Consumption [km/l] 2 4 6 8 x 105 Vehicle # 102 2 4 6 8 x 105 Vehicle # 106 2 4 6 8 x 105 Vehicle # 202

Figure 4.10: As described in the text, these figures depict misfitted trucks. Vehicle #73 and #106 show trucks which provide bad fits, whereas #102 is a truck which is only identifiable as misfit when median mean square error (MSE) is applied. Truck #202 is a counter-example, where the misfit is more noticeable when the mean MSE is used.

(44)

4. Results 34

Method Max. MSE Mean MSE Median MSE Std. MSE

Mean MSE per Truck 0.189% 0.0343% 0.0209% 0.0383%

Median MSE per Truck 0.238% 0.0215% 0.0096% 0.0331%

All Observations Pooled 0.679% 0.0310% 0.0089% 0.0629%

Table 4.1: MSE of the reconstructed traces by PACE and the original observations with 8 PCs. In the last column, the standard deviation of the MSE is given.

get reliable error measurements is to use the median of the individual MSE as error measure. A good example of a bad fit is truck #102 (Figure 4.10), which is, when the median MSE is used, the third worst fitting truck, in contrast to mean MSE, where the truck is ranked 63rd.

A counter-example is provided by vehicle #202 which is ranked 3rd using the median and 19th _{with mean MSE. In this example, one of the observations is a strong outlier,}

which is influencing the median MSE, because of the low number of observations on this truck.

Because both measurement methods have their respective merits, both are used for judging the fit of the individual trucks. In addition to these two methods, which view the trucks as separate entities, all truck observations will be pooled and the overall MSE is given. The results can be seen in Table 4.1

After the establishment of these baseline results, various parts of the PACE method can be changed to see their influence on the results.

4.2.2

Number of Principal Components

As a first variation, the number of PCs will be varied and the resulting MSE table will be compared. Also, a visual comparison will be offered. In addition to the baseline threshold of FVE – 0.95, thresholds of 0.75, 0.85 and 0.9999 will be subject to this experiment. The difference between the baseline and the variants is just the number of PCs. With a lower number of PCs, the MSE in the data should be higher. The problem of using a higher number of PCs, which result in a lower MSE, probably cause a worse performance in generalisation, i.e. the principal components might be over-fitted to the existing data.

(45)

4. Results 35 2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 14 Fuel Consumption [km/l] 2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 105 2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 92 2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 72 2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 4 8 PC 3 PC 4 PC 29 PC

Figure 4.11: These plots show how much reconstructed traces vary with different num-bers of PCs involved.

Table 4.2: MSE of the reconstructed traces with 3 PCs (76.69% variance retained). Thus, the only difference to the baseline result is the number of PCs, graphs of the mean function and the PCs themselves will be omitted. Only the MSE table and the reconstructed trajectories of selected trucks will be shown. For the baseline table, see Table 4.1 and for a comparative visualization of reconstructed trajectories see Figure 4.11

As expected, the MSE results from the variations (Tables 4.2, 4.3, 4.4) perform anal-ogous to the scree plot visible in Figure 4.6. When using a lower number of PCs the

(46)

4. Results 36

Table 4.4: MSE of the reconstructed traces with 29 PC (99.99% variance retained).

error increases, whereas a high number of used principal components do not necessar-ily boost the error performance much. This means, the scree plot and the fraction of variance retained is proportional to the size of the MSE.

4.2.3

Error Assumptions in PACE

There are two possibilities to tune the behavior of PACE regarding “measurement errors”:

• The assumption that the observations are containing no ”measurement errors”.

• In addition to the presence of ”measurement errors”, the estimated errors are cut off at the quartiles for the estimation of the error variance σ.

The notion of ”measurement errors” in this context is a bit misleading, as PACE assumes a underlying smooth function. The accumulative fuel consumption data is precise enough, but the variation of the observations around this smooth function can be considered as noise. The assumptions on measurement error mostly influence the calculation of the PC scores.

The previous results are containing the results for PACE with assumed measurement errors without cut off. Thus, his section just covers the other two possible modes of operation are covered in this section.

(47)

4. Results 37

Table 4.5: MSE of the reconstructed traces with 8 PC. For the estimation of error variance, all data outside the quartiles were cut off.

2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 14 Fuel Consumption [km/l] 2 4 6 8 x 105 Vehicle # 105 2 4 6 8 x 105 Vehicle # 92 2 4 6 8 x 105 Vehicle # 72 2 4 6 8 x 105 Vehicle # 4

Figure 4.12: Reconstructed traces of selected trucks with no measurement error as-sumed. The influence of this assumption can be seen clearly in Vehicle #105, where PC scores are maximized to fit at the observation points.

Baseline PACE with error cut off:

Table 4.5 shows that the MSE for with error cut off is almost as small as the MSE with 29 PC, which is a clear improvement over the baseline. Basically the additional cut off leads to reduced outliers, which seems to improve the performance in comparison with baseline results.

Using PACE under the assumption of zero measurement error:

As there is no assumed measurement error, the MSE reconstruction error is very small. But the problem with this tight fit is, that reconstruction is only accurate at the original measurement points. This is shown in Figure 4.12.

(48)

4. Results 38 0 1 2 3 4 5 6 7 8 9 x 105 2.45 2.5 2.55 2.6 2.65 2.7 Driven Distance [km] Fuel Consumption [km/l] Epanechnikov kernel Rectangular kernel Gaussian Kernel

Figure 4.13: This Figure shows the effects of using different kernels for smoothing the mean curve µ. The Gaussian kernel produces a very smooth mean curve, whereas the rectangular kernel picks up noise from the measurements. Epanechnikov produces a compromise to these kernel variants.

4.2.4

Different Kernel Functions

Usually, the Epanechnikov kernel [4] is the standard choice for the smoothing steps in the PACE method. This kernel function has a compact basis and definitive endings. Alternative choices of kernels are rectangular and Gaussian kernels[4]. Whereas the rectangular kernel has definitive endings, the Gaussian kernel extends to infinity. For the smooth mean curve and the principal components the rectangular kernel has the effect of adding some noise to the curves, whereas the Gaussian kernel has stronger smoothing properties. In Figure 4.13 all three mean curves and in Figure 4.14 the three most significant PC curves are visible.

In comparison, the overall MSE of the pooled data is slightly higher for PACE with a rectangular kernel than it is with a Epanechnikov kernel (0.0351% mean, 0.0113% median in the rectangular case versus 0.031% mean, 0.0089% median with the Epanech-nikov kernel). In the Gaussian case, the fit is worse than with the other kernels (0.0468% mean, 0.0162% median)2_.

(49)

4. Results 39 1 2 3 4 5 6 7 8 x 105 −3 −2 −1 0 1 2 3 4 x 10 −3 Distance [km] Principal Components Rectangular PC1 Rectangular PC2 Rectangular PC3 Epanechnikov PC1 Epanechnikov PC2 Epanechnikov PC3 Gaussian PC1 Gaussian PC2 Gaussian PC3

Figure 4.14: This Figure shows the effect of using different kernels for smoothing on the PCs. The order of the PCs is visualized by the thickness of the lines, i.e. the thickest line depicts the first principal component. Generally, the same observations as in Figure 4.13 apply.

4.2.5

Variances

There are two different variances in the results, model and data variance.As the different name indicates, the variances result from different sources, and therefore must be handled differently.

The model variance is based on the question of how sure we are of a model. A gen-eralisation of this variance would be to do leave-one-curve-out cross validation on the smooth mean curve. This enables a visualization of how much influence a single curve has on the overall result of the mean curve or the principal components.

The data variance represents the density of measurements in a certain part of the curve. For calculating the variance and the confidence interval of a certain part of the curve, the number of trucks that influence a part of the curve has to be given. There are two different approaches to this:

One approach is to bin the data and calculate the variances of the bins as shown in Section 4.1, and use them to get approximate results for the variance.

(50)

4. Results 40

Another approach is the use of reconstructed curves as a basis for calculating the variances. There are two different implementations to this approach. Either the recon-structed curves are taken into account only within the interval of their real observations, i.e. incorporate only observations which are relevant for this particular interval. The other approach is to use the complete reconstructed curves, which ignores the number of real observations in a part of the curve.

4.2.5.1 Model Variance

The data used for this experiment is generated by PACE which uses 8 principal compo-nents and every observation which happened before the truck had run 1000 kilometers was removed. The method which is used to generate the necessary data to analyze the model variance is leave-one-curve-out cross validation. This validation method gener-ates a result from PACE for excluding one truck at a time in the data. Thus, there are as many PACE results generated as there are trucks.

Model variance gives us two results. The first result is a ranking of the most influential trucks to the mean curve or a PC. This result can be found when a PACE result which excludes a particular truck and the overall PACE result is compared.

Additionally, model variance gives the distribution and variation for a particular result from PACE. In Figure 4.15 the very peaky distribution of all the mean curves at different points can be seen. Figure 4.16 shows all leave-one-out mean curves, the overall mean curve µ and the standard deviation σ curves. The average deviation of

σ from µ is 0.0016 km/l, the maximal deviation 0.0039 km/l. Figure 4.17 is a fuel consumption plot of interesting vehicles in regard of their influence on the mean curve or on the PCs.

4.2.5.2 Data Variance

Analysing data variance gives an idea on the variation of the data around the mean function. As mentioned before, there are two methods to accomplish the inclusion: The first, simpler method is to use the data from binning. The other method, outlined in this part of the thesis is, to consider only the segments of the reconstructed data where

(51)

4. Results 41 2.5 2.55 2.6 0 20 40 60 80 100 120 140 160 180 Fuel [km/l] Number of Curves 50000 km 2.5 2.55 2.6 0 20 40 60 80 100 120 140 160 180 Fuel [km/l] 225000 km 2.5 2.55 2.6 0 20 40 60 80 100 120 140 160 180 Fuel [km/l] 400000 km 2.5 2.55 2.6 0 20 40 60 80 100 120 140 160 180 Fuel [km/l] 575000 km 2.5 2.55 2.6 0 20 40 60 80 100 120 140 160 180 Fuel [km/l] 750000 km

Figure 4.15: The distribution of all mean curves generated with the leave-one-curve-out method at various points. Two properties of the mean curves are visible, namely the peakiness of the distribution and the higher deviation from the mean at 50000 km and at 750000 km. 1 2 3 4 5 6 7 8 x 105 2.5 2.52 2.54 2.56 2.58 2.6 2.62 2.64 Distance [km] Fuel Consumption [km/l]

Figure 4.16: All mean curves generated by leave-one-out-cross-validation and the orig-inal µ (blue) and σ (red) curves. The small distance between the σ curves and the µ

(52)

4. Results 42 2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 43 Fuel Consumption [km/l] 2 4 6 8 x 105 Vehicle # 15 2 4 6 8 x 105 Vehicle # 88 Distance [km] 2 4 6 8 x 105 Vehicle # 23 2 4 6 8 x 105 Vehicle # 186

Figure 4.17: Plot of trucks with a high influence on the results of PACE. Trucks 43 and 15 have a strong influence on the µ curve since they provide data at he end of all observations where data is very sparse. Vehicle 88 is the truck with the smallest influence on µ. It has both a small observation duration and average measurements. Truck #23 has the highest influence on the first PC and truck #185 has the smallest influence on the first PC.

real data support is existent. Both results can be seen in Figure 4.18. Both methods deliver a similar result. The main difference is the the resolution of the result based on PACE, which is much higher. However, unlike the binning results, the estimated data between the observations is also incorporated into the variance results, which means that regions with low data support are also represented in the variance.

4.3

Prediction of Fuel Consumption with PACE

Prediction in this case essentially is the usage of the reconstructed trajectories from the PC scores to guess the fuel consumption of a truck at a certain point3.

The baseline to measure the effectiveness of the prediction, the value of the last mea-surement available will be used as the predicted value. This straight assumption is good on the accumulative data because the fuel consumption is usually developing in an almost straight line.

3_{If the data — unlike the available truck data — is not open ended, an alternative to the direct}

(53)

4. Results 43 0 1 2 3 4 5 6 7 8 x 105 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 Distance Driven [km] Fuel Mileage [km/l] 0 1 2 3 4 5 6 7 8 x 105 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 Distance Driven [km] Fuel Mileage [km/l]

Figure 4.18: The standard deviation extracted from the binned data is visible on the left. The right graph shows the standard deviation of of the data, reconstructed from the observation duration and the trajectories regenerated from the PC scores of PACE. For testing the prediction of new observations, the last observation of the truck to predict is removed from the data, and the PACE results are calculated without it. The prediction at the time of the removed observation is taken from these results. This procedure is done for each available truck.

In this general prediction test, straight line prediction produces a maximum error of 5.04%, a mean error of 0.58% with a standard deviation of 0.81%. Whereas using the reconstructed trajectories to predict produces a maximum error of 5.49%, a mean error of 1.25% with a standard deviation of 1.06%.

These results emphasize the straight nature of data. In general, these results show that it is better to assume steady continuing fuel consumption behavior for forward prediction.

If the trajectory is used as the true reference instead of the real observations, straight line prediction has a maximum error of 2.81%, amean error of 0.48% with astandard deviation of 0.56%. Prediction using PACE produces a maximum error of 2.31%, a mean error of 0.38% with a standard deviation of 0.40%. These results are not necessarily an indicator for prediction quality, but for result stability in the case of a single missing observation.

(54)

4. Results 44

Using the reconstructed trajectories directly for prediction is affected by the assumption of the presence of a measurement error – i.e. a basic underlying deviation even at the points with known observations, and the bad fit which usually occurs when dealing with outliers. However, given the relatively constant measurements of individual trucks and the preexisting error between the actual observations and the trajectories, the prediction works and is quite stable regarding the removal of observations.

4.4

Detection of Outliers with PACE

The main idea behind outlier detection with PACE, in particular with the PC scores, is to be able to quantify how normal and likely the fuel consumption behavior of a single truck is.

The first step in quantifying this probability, the distribution of the PC scores has to be found. In this case, the scores are normally distributed, which can be seen in Figure 4.19. This makes the calculation of probabilities for a single PC score easily possible. By using the probabilities from just the first principal component, the same outliers as with simple feature extraction (Section 4.1.2) can be found. Basically, the same outliers can be found by just using the raw PC scores.

However, if the probabilities of several PCs are calculated, it is possible to calculate the “normality” of a truck. The distribution of these probabilities, with a varying count of PCs used can be seen in Figure 4.20. In Figure 4.21 a few example fuel consumption plots of trucks along with their normality can be seen. For these samples, the weighted four first PCs were used.

In comparison to clustering the data with extracted features, as in Section 4.1, this approach delivers a probability value and not a grouping of data. The probability value provides finer increments in comparison to the clusters. The normality method is coupled tightly to the PCs calculated with the help of PACE, whereas clustering works on arbitrary extracted features. However, for finding outliers in the data, calculating the normality is much more non-parametric.

(55)

4. Results 45 −200 0 200 0.0030.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99 0.997 Data Probability PC 1 −40 −20 0 20 40 0.0030.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99 0.997 PC 2 Data Probability −50 0 50 0.0030.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99 0.997 Data Probability PC 3 −20 0 20 0.0030.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99 0.997 Data Probability PC4

Figure 4.19: These are the normal probability plots for the first four PCs. The dashed red line represents the ideal normal distribution, whereas the blue crosses are the actual observations.

(56)

4. Results 46 0 0.5 1 0 20 40 60 80 100 120 Probability Number of Observations 1 PC 0 0.5 1 0 20 40 60 80 100 120 Probability 3 PCs 0 0.5 1 0 20 40 60 80 100 120 4 PCs Probability 0 0.5 1 0 20 40 60 80 100 120 4 PCs (weighted) Probability

Figure 4.20: These histograms show the likelihoods for the occurrence of a single truck with different counts of PCs used for the calculation. When multiple PCs are used, the result is given by the product from the probabilities of all principal components. In the rightmost histogram the likelihoods are weighted by the eigenvalues of the PCs.

2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 88 Prob. 78,84% Dist. [km] Mileage [km/l] 2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 92 Prob. 75,60% Dist. [km] 2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 4 Prob. 50,74% Dist. [km] 2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 72 Prob. 53,48% Dist. [km] 2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 6 Prob. 25,93% Dist. [km] Mileage [km/l] 2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 16 Prob. 18,20% Dist. [km] 2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 14 Prob. 2,360% Dist. [km] 2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 106 Prob. 11,84% Dist. [km]

Figure 4.21: These graphs depict trajectories of several trucks, along with their nor-mality. The normality describes how average the fuel consumption of one truck is regarding to all other trucks.

(57)

4. Results 47 2 4 6 8 x 105 40 45 50 55 60 65 70 Distance [km] Average Speed [km/h] Mean Function 2 4 6 8 x 105 −4 −3 −2 −1 0 1 2 x 10 −3 Distance [km] PCs 0 2 4 6 8 x 105 0 10 20 30 40 50 60 70 80 90 100 Observations Distance [km] PC1 43% PC2 16% PC3 13% PC4 7%

Figure 4.22: This figure shows the mean curve of average vehicle speed, the PCs and a scatter-plot of all available observations. The mean curve is an indicator, that trucks with an high odometer count have a higher average speed.

4.5

Expansion of our Application

As an example of the application of PACE on other data the average vehicle speed is used. Furthermore, in this section the PACE method is used on cyclic fuel consumption data, even if PACE was developed for longitudinal data.

The results of PACE on the average vehicle speed can be seen in figure 4.22. In comparison to the results from the fuel consumption data, the speed data has also a similar distribution of the observations.

The most interesting outcome from the analysis of seasonal fuel consumption was not the mean curve, but the second and the third principal component, which seem to resemble high fuel efficiency in spring and autumn. Based on those