Data preprocessing - Semantic IoT for reasoning and BigData analytics

5.4 Implementation

5.4.2 Data preprocessing

The data preprocessing is to be done inside the car. In the project, the tasks performed by the preprocessing layer are limited to the filtering of the multivariate data for its relevant attributes and anomaly detection. There are two reasons for limiting the tasks at this point.

The first being that, while the actual reverse engineering process is con-sidered to be transparent, it should still be taken into account that the full computational power of the raspberry pi used in the car is unavailable. Sec-ondly, a raspberry pi is an unusually powerful device to have this close to the sensing layer. In most cases, this will either be a smart device or a less powerful micro-controller. Leveraging the full power of a raspberry pi would create an unrealistic processing model for the vast amount of IoT applica-tions, losing the generality of the architecture.

Anomaly detection

Discovering potholes in the data can be considered a specific case of anomaly detection. In the general case, anomalies are patterns in data that do not exhibit the expected behavior. The problem of anomaly detection has been studied extensively, because anomalies contain data of interest in a broad range of applications. Consequently, a broad variety of approaches exist [44].

Some of these can be adapted for processing streaming data.

Statistical approaches The oldest approaches to anomaly detection are the statistics based approaches. Assuming the dataset is generated by a sta-tistical model, the chance that a particular data element is obtained from it is known. Values that have a significantly low chance to occur are considered outliers, which in turn can be considered anomalies. In the case of a Gaussian model, a simple method is to use a box-plot, which equates to considering any values that are not within 3σ from the mean to be anomalous. In a

data streaming context, these kinds of methods offer a low computational complexity, but in many cases the data will not be generated following some statistic distribution.

Similar to the statistical approach, are the regressive approaches. In these approaches, a regression model is fitted to the data. The difference between the actual data and the regression model, often called residual here, can then be used to determine how anomalous the behavior is. Once again the main issue is that data is often not generated following a clear model [44].

Nearest neighbor-based techniques build upon the idea that normal data instances have close neighbors, while anomalies occur far away. These al-gorithms use the distance to its k nearest neighbor, or compute the density of the data around an instance. The local outlier factor technique [45] is one of the major contributions in this field. It compares the local density of an instance to that of his neighbors in order to detect anomalies. Nearest neighbor techniques are computationally more expensive than the previous methods, but can be applied more often in the general case.

Clustering techniques Clustering techniques can be considered similar to density based techniques, in the regard that they often require a kind of distance computation in order to do the clustering. The key difference is that each instance is compared to the cluster it belongs to, rather than the local neighborhood. In the case of clustering based techniques, an anomaly is either not part of a cluster, part of a sparse cluster or should lie close to their respective cluster centroid. One of the main concerns for clustering based anomaly detection is the high computational complexity [44].

Hierarchical Temporal Memory (HTM) algorithm In recent work, the Hierarchical Temporal Memory (HTM) algorithm has been adapted in order to do online, unsupervised anomaly detection [46]. HTM is a machine intelligence framework based on neuroscience. It models spatial and temporal patterns in time sequences. The framework itself does not produce anomaly values, but was adapted to do so by using some of its internal data. First, a raw anomaly score is made using the prediction vector π(x_t) and the actual value a(x_t). Both are binary vectors, which is the data representation used internally in the HTM framework. The raw anomaly value is computed by comparing the actual value with is prediction. This value is then used to the compute the anomaly likelihood. In order to determine the anomaly

likelihood, a rolling window is used to calculate a normal distribution using the past raw anomaly values. The mean u_t and standard deviation σ_t, along with a moving average computed over a smaller range ˜u_t are then used in a Gaussian tail function Q. The anomaly likelihood L_tthus can be determined as in Eq. 5.2.

L_t= 1 − Q(u˜_t− u_t

σ_t ) (5.2)

An anomaly is detected by applying a threshold to this value. Interesting about this method is that it simply offers a means of comparing an anomaly score given the recent history of anomaly scores. Hence, this part of the detection method could be extended to other methods using anomaly scores as well. It should be noted however that this algorithm is more complex and require considerable work to tune it for achieving desirable accuracy. Like-wise, it also requires more computing resources than, for instance, statistical based methods.

Data filtering

The filtering of the data is a simple process of reading only the relevant sensory data from the data file. In the use case of detecting potholes, the data of interest is the rotational speed of the individual wheels. Each wheel will be considered separate from the others. A smaller sample from the dataset is given in figure 5.4.

Approach using differences There’s a clear difference in size between the spikes of the detected holes and the spikes in the data that occur when the speed of the car increases. A very simple idea is thus to try and measure the difference between consecutive points, and apply a threshold to the distance between them to determine if the cause of it was a hole in the road or not.

Essentially, the pattern that occurs when the car drives through a hole is simplified to its peak to peak value this way. The result of the algorithm is a simple classification - a hole is either detected or not.

Simple statistical approach Another approach implemented is based on the concept of outliers in statistics. Although there is no clear presence of a distribution, this approach uses the observation that, when there is no pothole, the data is either stable or exhibits a step-like function (which is a

Figure 5.4: A sample of the dataset

result of the quantization levels of the sensor). When there is a pothole, the values spike up and down for a short period as depicted in figure 5.5. The lower values and higher values cancel each other when calculating the mean.

In the case of a spike, it can be expected to be more standard deviations removed from the mean than a step.

• Center a window with N points around the point to calculate the anomaly score for.

• Calculate the mean and standard deviation of the window.

• Calculate the amount of standard deviations the current value deviates from the mean.

• Using the z-score, obtain the confidence interval associated with the z-score. The anomaly score is the inverse of the confidence.

This leaves two parameters to decide: the size of the used window, and the threshold for determining a value to be anomalous. These will later be investigated when evaluating the algorithm in section 6.3.3. An advantage

of the used method is that the resulting anomaly value lies in a normalized range by default.

Figure 5.5: Occurrence of a pothole at wheel A. The rotation speed of the wheels fluctuates during a period of 0.15s.

Data communication

To communicate the data to the edge, the publish-subscribe based MQTT (Message Queuing Telemetry Transport) is used. It is an established stan-dard [47] for topic based communication in IoT environments. It is lightweight, requiring little device resources. Furthermore, it has a low network over-head, which is highly desired in this use-case. The MQTT standard also provides options for delivery assurance (at most once, exactly once and at least once mode) and data availability for bad network environments [48].

Lastly, implementations exist for various programming languages, increasing the portability of the system. The topic based model itself simplifies the aggregation of the data, and can also later be reused in the API to manage all nodes that are located at the same level in the layered model.

In document Semantic IoT for reasoning and BigData analytics (Page 52-57)