• No results found

Temporal Correlation-Based Outlier Detection Technique

4.5 Statistical-Based Outlier Detection Techniques

4.5.1 Temporal Correlation-Based Outlier Detection Technique

We here present our temporal correlation-based outlier detection technique (TOD), which instantly identifies temporal outliers in a single node using temporal corre-lation of local data. A temporal outlier is an actual observation x(s, t) exceeding the confidence interval of its predicted value ˆx(s, t) with a given probability.

Upon completion of temporal correlation modelling and value prediction, it is now time to identify outliers. To clarify TOD, we consider three questions, that are, (i) how to deal with x(s, t) after identifying it as normal or outlier. (ii) how to distinct x(s, t) as an error or indication of the occurrence of an event, if x(s, t) is identified as an outlier. (iii) when and how to update the temporal model to better capture changes in normal behavior of sensor data.

For the first question, identified normal observations can be directly used for the next prediction. Otherwise, after detecting an outlier x(s, t), TOD uses observations at the previous time instances {x(s, t − p), . . . , x(s, t − 1)} and the temporal correlation model to identify the next observation sequence as normal or outlier. The standard error of predicted values will be incrementally updated step by step. Here two possibilities exist:

If all observations of the entire sequence are identified as outliers, the entire sequence represents occurrence of an event and indicates changes in normal

behavior of sensor data. Consequently, this observation sequence including x(s, t) will replace {x(s, t − p), . . . , x(s, t − 1)} for the future prediction.

If only a few observations in the sequence are detected as outliers, we con-sider the observations labelled as outlier as errors and will not use them for the next prediction process. Instead, the predicted values of these errors are used to predict the next observation. In case of identifying absolute errors, the predicted values will directly replace the outliers in the next prediction process.

Being able to correctly label the sequence as event or error is very important because an error-labelled sequence will be replaced by a new sequence of predicted values, while an event-labelled sequence will be directly used in the following prediction process.

We further discuss how to decide about the length of the observation sequence to distinguish between errors or events. This actually is a practical problem depending on application requirements and sampling rates. One may note that the duration of an event is not known beforehand. Moreover, some events last longer than the others. These characteristics make identification of the entire event difficult. Therefore, we aim at identifying changes in the normal behavior of data instead of identifying the entire event. This implies that the length of the observation sequence is not necessary to be long. As stated in Chapter 3, we chose the minimum length of a sequence to be four observations to distinguish between different types of outliers in the Grand St. Bernard dataset. This implies that each observation sequence lasts for 8 minutes.

TOD waiting to distinguish between events and errors does not cause a consid-erable delay in the process of identifying type of outlier, as each new observation is labelled as outlier or normal upon arrival. Spending a short delay for making this distinction is unavoidable. In this way, the second question is completely answered.

For the third question regarding when and how to update the temporal model, changing this parametric temporal model requires the same amount of observa-tions to perform the model update. Updating the model consumes high resources in terms of memory and processor. To solve these problems, we can update the model only at the end of the day. After outlier detection is finished for one day, we have a new clean time series consisting of the original normal observations, the event-based observations, and the predicted values which replace the detected errors. This new time series can be directly used to model the possibly changed time correlation without any smoothing operation. Each node can check whether any event has been detected in the whole time series. Otherwise, the existing tem-poral correlation model can still be used for the following day. In this condition,

1 procedure FittingTemporalModel(time series data) 2 each node smoothes time series data;

3 each node models temporal correlation by fitting a AR model;

4 initiate OutlierDetectionProcess(AR model parameters);

5 return;

6 procedure OutlierDetectionProcess(AR model parameters) 7 when a new observation x(s, t) arrives

8 predict ˆx(s, t) using AR model parameters as well as previous observations;

9 if(x(s, t) > (ˆx(s, t) + ci ∗ σ) or x(s, t) < (ˆx(s, t) − ci ∗ σ)) 10 x(s, t) indicates a temporal outlier;

11 successive outliers + 1;

12 use the same previous observations for the next prediction;

13 if(the number of successive outliers > the length of time sequence) 14 it indicates a change in normal behavior of sensor data;

15 these successive outliers are used for the next prediction;

16 endif;

17 else

18 if(the number of successive outliers > 0)

19 predicted values of previous temporal outliers are used for the next prediction;

20 endif

20 x(s, t) indicates a normal observation and is used for the next prediction;

21 endif;

22 if(at the end of the day)

23 initiate UpdatingTemporalModel(time series data);

24 endif 25 return;

26 procedure UpdatingTemporalModel(time series data) 27 refit a temporal correlation AR model;

28 return;

Table 4.1: Pseudocode of TOD

each node keeps on using the model until a significant change in the behavior of data is detected. This can contribute to reduction of number of model updates.

The corresponding pseudocode for TOD is shown in Table 4.1.

TOD enables that each node analyzes its local data to detect outliers in an online manner and to further distinguish between different types of outliers in a timely manner. Although its detection performance may be impacted by having only local data analysis, its local processing and low resource consumption makes it ideal for WSNs. An offline outlier detection requires existence of the entire time series and has a considerable detection delay. It also fails in detecting changes in behavior of normal data or between two consecutive time series [93] as soon as it occurs. In contrary, our TOD detects outliers upon arrival of a new observation and can also solve the problem of occurrence of missing values, which can be timely replaced by reliable predicted values.

4.5.2 Spatial Correlation-Based Outlier Detection Techniques