3.4 Data Labelling Techniques
3.4.4 Naive Bayesian-Based Labelling Technique
We choose the Naive Bayesian classifier presented by [28] to label the data. To be able to use Bayesian classifiers, the data should be divided into a number of
Figure 3.12: Outliers found using Naive Bayesian-based labelling technique
mutually exclusive classes over the full length of the data range. This technique needs to learn the data distribution before it can detect outliers. During the training phase, for each node, number of occurrences of each combination of his-torical and current samples of the node itself as well as current samples of the nodes neighbors is calculated. Based on these numbers, the labelling technique calculates the probabilities of the occurrence of different combinations of classes for the current, previous and neighbor samples [28]. Since, this technique calcu-lates for each sample its probability to be in each possible class, if a sensor value does not fit in the class with the highest probability, it is labelled as an outlier.
This definition of outlier, however, only applies to 1-D datasets.
To be able to use the technique for 2-D datasets, we consider an observation to be an outlier if one or both sensor values are marked as outlier by the technique. In the original paper the local neighborhood consists of two randomly chosen distinct neighbors from the immediate one-hop neighborhood of the computing node.
Since we do not know the connectivity of the Grand St. Bernard deployment, we use two randomly chosen neighbors within a radius of 0.5 kilometers around the computing node. We choose to use four classes for each feature. The boundaries of the classes are defined as:
[min( ~sd), median( ~sd) − std( ~sd), median( ~sd), median( ~sd) + std( ~sd), max( ~sd))]
(3.4) where ~sd is a vector containing sensor data of feature d. An example of a dataset labelled using this technique is depicted in Figure 3.12.
Implementation of the original algorithm presented in [28] did not lead to good outlier detection results. As it can be observed from Figure 3.13, this technique
Temperature -0.92 0.22 2.79 5.37 10.93 Humidity 64.75 79.34 84.86 90.39 93.66
Table 3.6: Four classes used in the Naive Bayesian-based labelling technique
labels around 37% of the data as outliers. It seems that the data values labelled as outliers are mostly concentrated on the boundaries of the classes. One of the problems causing this is that the Grand St. Bernard dataset is not linearly repeatable. Another problem is that the dataset is very diverse both in terms of the range and distribution of the data. Moreover, at some periods of time the data barely changed, while at other times the data kept fluctuating constantly.
We improve the percentage of detected outliers in the original algorithm by changing the distinguishing factor of the classes to:
pn× c < ps (3.5)
where pn and ps are the probability for sensor observations to be in class n and s, respectively, and c is a constant value representing the weight of the probability pn. In this way, the probability of class pn to be relevant should be c times higher. We repeat the experiments using the new distinguishing factor for different values of c (c = 1, 10, 100, 100). Figure 3.13, 3.14, 3.15, and 3.16, illustrate these experimental results. The four classes used for this dataset are shown in Table 3.6.
To address the problem of having non-linear repeatable data, we convert the cartesian coordinates to polar coordinates, by which the performance of the al-gorithm got better. However, the translation itself is very difficult. The reason behind this is that this translation is usually easy for data in a circular or el-liptical shape. However, the datasets selected from the Grand St. Bernard have irregular shapes with no similarities between them. This makes it hard to prepare the data for labelling using this algorithm. To solve the problem of irregularity we can apply a clustering algorithm to determine the centers of the data, which help easier translation into polar coordinates.
Another problem faced by this technique, is the correlation between the nodes.
This correlation is not necessary constant and is also not always bound to dis-tance. Therefore, it is not always the case that the sensor values used by neigh-boring nodes to calculate the probability are related to sensor values of the node in question. So, the prediction based on the neighbor node might not help to predict the current sensor observation and might even disturb the prediction. An example of this correlation changes is depicted in Figures 3.17 and 3.18, which
Figure 3.13: Data labelled using Bayesian classifiers, c=1
Figure 3.14: Data labelled using Bayesian classifiers, c=10
Figure 3.15: Data labelled using Bayesian classifiers, c=100
Figure 3.16: Data labelled using Bayesian classifiers, c=1000
Figure 3.17: Correlation between humidity and temperature on 2007/09/26
show the correlation between humidity and temperature in both small and big network clusters for two consecutive days.
The complexity of this Naive Bayesian-based labelling technique is O(n) for training, O(c3) for the storage of the probability tables, and O(n) for the out-lier detection procedure, where n is number of classes and c is a constant value representing the weight of the probability of a data point fitting in class n.
Due to poor results we got during the experiments, we exclude the Naive Bayesian-based labelling technique in the rest of the discussion.