In this chapter, we address several important issues to be considered while signing outlier detection techniques for WSNs and explain why general outlier de-tection techniques are not directly applicable for WSNs. We provide a technique-based taxonomy of those outlier detection techniques specifically developed for WSNs and compare them in a comparative table. We further present a guideline of requirements of an optimal outlier detection technique for WSNs.
According to the analysis of general outlier detection techniques, and the overview of current outlier detection techniques for WSNs, we conclude that there is no generic outlier detection technique applicable for all application domains or data types. Also, no existing outlier detection technique considers all issues of outlier detection and satisfies all requirements. As we will show in the coming chapters, we take into account all the requirements presented in the guideline for an optimal outlier detection technique for WSNs while designing our tech-niques. The performance evaluation and design choices we make show that these techniques fulfill the specified requirements.
Sensor Data Labelling Techniques
To measure the performance of an outlier detection technique, one needs a reference value, usually called a ground truth. Of-ten, labelling techniques are used to label sensor data and classify each data point as normal or outlier. The choice of the labelling technique strongly influences the performance of outlier detection techniques. Therefore, it is important to choose the right tech-nique for labelling a dataset before performing outlier detection.
In doing so, the shape of the dataset as well as the definition that application at hand uses for outliers are two deciding factors on what labelling techniques should be used. In this chapter, we first define different types of outliers and then investigate performance of four different labelling techniques based on, i.e., Mahalanobis distance, density, running average, and Bayesian networks, to identify them. To present impact of labelling techniques on out-lier detection process, we will use the dataset labelled using these techniques in the following chapters.
As it has been shown in the previous chapters, the term outlier can be defined in many different ways, depending on the context and the outlier detection tech-nique used. The definition of an outlier depends on the application and on the characteristics of sensor data to be analyzed.
To evaluate the performance of an outlier detection technique, one needs a reference value, usually called a ground truth. However, quite often the ground truth is not available. Therefore, labelling techniques are used to label sensor data and assign each data point to a normal or outlier class. Due to the various possible interpretations of the term outlier and the fact that one labelling technique might work well for one dataset, while it performs badly on another, it is hard to choose a suitable labelling technique. The characteristics of the dataset as well as the labelling technique are the two deciding factors in this selection. To complicate the matter, one should note that an outlier detection technique might have a very high detection rate on results of one labelling technique, while fails when used for another.
To clarify, let us assume we use a clustering-based technique to label a dataset and then use a time series streaming based outlier detection technique to identify outliers. Clustering-based labelling techniques consider outliers to be either data points that do not belong to clusters or clusters that are significantly smaller than other clusters [136, 56]. Time series streaming based outlier detection techniques, however, state that if the removal of a point from the time sequence results in a sequence that can be represented more briefly than the original one, then the point is an outlier [75]. It is obvious that these two definitions of outliers have very little in common, which results in failing the outlier detection technique to correctly identify outliers labelled by the used labelling technique. On the one hand, a dataset is usually not labelled solely for the purpose of the outlier detection and many other applications will use it. On the other hand, often no information about what labelling techniques have been used to label data is available. Therefore, it is necessary to have a guideline on circumstances under which each labelling technique can identify outliers.
Since there exists no universally accepted definition for an outlier, there is also no general purpose labelling technique. In this chapter we investigate and compare four data labelling techniques based on Mahalanobis distance, density, running average, and Bayesian networks. This results in identification of various types of outliers occurring in sensor dataset, which will be identified in Section 3.2.
The real dataset used by our labelling techniques are described in Section 3.3.
Detailed explanation of the four different labelling techniques used in this chapter is provided in Section 3.4. We present a thorough comparison between these
labelling techniques in terms of performance, complexity, and the effect of the data characteristics in Section 3.5. Based on these comparisons we present a guideline on choosing the labelling technique which best fits the characteristics of the outlier detection techniques presented in later chapters in Section 3.6. Finally, this chapter is concluded in Section 3.7.