• No results found

In this section, we compare the outlier detection performance of top-n LDOF with two typical top-n outlier detection methods, top-n KNN and top-n LOF. Experiments start with a synthetic 2-D dataset which contains outliers that are meaningful but are difficult for top-nKNN and top-nLOF. In Experiments 2 and 3, we identify outliers in two real world datasets to illustrate the effectiveness of our method in real world situations. For consistency, we only use the parameter

k to represent the neighbourhood size in the investigation of the three methods. In particular, in top-n LOF, the parameter MinPts is set to neighbourhood size

k as chosen in the other two methods.

Synthetic Data. In Figure 3.1(b), there are 150 objects in clusterC1, 50 objects

in cluster C2, 10 objects in cluster C3, and 4 additional objects {o1, o2, o3, o4}

which are genuine outliers. We ran the three outlier detection methods over a large range of k. We use detection precision† to evaluate the performance of each method. In this experiment, we set n = 4 (the number of real outliers). The experimental result is shown in Figure 3.3(a). The precision of top-n KNN becomes 0 when the k is larger than 10 due to the effect of the mini-cluster

C3 as we discussed in Section 3.1. For the same reason, the precision of top-n

LOF dramatically descends when k is larger than 11. When the k reaches 13, top-n LOF misses all genuine outliers in the top-4 ranking (they even drop out of top-10). On the contrary, our method is not suffering from the effect of the mini-cluster. As shown in the Figure 3.3(a), the precision of our approach keeps stable at 100% accuracy over a large neighbourhood size range (i.e. 20-50). Medical Diagnosis Data. In real world data repositories, it is hard to find a dataset for evaluating outlier detection algorithms, because only for very few real world datasets it is exactly known which objects are really behaving differ- ently [38]. In this experiment, we use a medical dataset, WDBC (Diagnosis)∗,

Precision=n

5 10 15 20 25 30 35 40 45 50 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Neighbourhood size k Precision top−n LDOF top−n LOF top−n KNN

(a) Precisions in synthetical dataset.

30 35 40 45 50 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Neighbourhood size k Precision top−n LDOF top−n LOF top−n KNN (b) Precisions in WDBC dataset.

Figure 3.3: Detecting precisions of top-n LDOF, top-n KNN and top-n LOF on (a) Synthetical dataset, (b) WDBC dataset.

which has been used for nuclear feature extraction for breast tumor diagnosis. The dataset contains 569 medical diagnosis records (objects), each with 32 attributes (ID, diagnosis, 30 real-valued input features). The diagnosis is binary: ‘Benign’ and ‘Malignant’. We regard the objects labeled ‘Benign’ as normal data. In the experiment we use all 357 ‘Benign’ diagnosis records as normal objects and add a certain number of ‘Malignant’ diagnosis records into normal objects as outliers. As demonstrated in the Figure 3.4, the majority of normal data is crowded in a big cluster, however, it is adjacent with a few scattered clusters. Most of outliers (malignant cases) indicated by red triangles are hid amount those scat-

Figure 3.4: Data visualisation for WDBC dataset.

tered clusters. Intuitively, our LDOF would have superior detection performance resulting from the new outlierness definition compared to LOF and KNN.

Figure 3.3(b) shows the experimental result for adding the first 10 ‘Malignant’ records from the original dataset. Based on the rule for selecting neighbourhood size, k, suggested in Section 3.3, we set k ≥30 in regards to the data dimension. We measure the percentage of real outliers detected in top-10 potential outliers as detection precision†. In the experiments, we progressively increase the value ofk

and calculate the detection precision for each method. As shown in Figure 3.3(b), the precision of our method begins to ascend atk = 32, and keeps stable when k

is greater than 34 with detection accuracy of 80%. In comparison, the precision of the other two techniques is towed over the whole k value range.

To further validate our approach, we repeat the experiment 5 times with a different number of outliers (randomly extracted from ‘Malignant’ objects). Each time, we perform 30 independent runs, and calculate the average detection precision and standard deviation over thekrange from 30 to 50. The experimental results are listed in Table 3.1. The bold numbers indicate that the detection precision vector over the range ofkis statistically significantly improved compared

Table 3.1: The detecting precision for each method based on 30 independent runs for WDBC dataset.

Number of outliers Precision (mean ± std.)

LDOF LOF KNN 1 0.29±0.077 0.12±0.061 0.05±0.042 2 0.33±0.040 0.13±0.028 0.11±0.037 3 0.31±0.033 0.22±0.051 0.22±0.040 4 0.35±0.022 0.27±0.040 0.26±0.035 5 0.38±0.026 0.28±0.032 0.28±0.027

to the other two methods (paired T-test at the 0.1 level, with p-value 0.07, 0.01, 0.08 and 0.04 respectively).

Space Shuttle Data. In this experiment, we use a dataset originally used for classification, named Shuttle‡. We use the testing dataset which contains 14500 objects, and each object has 9 real-valued features and an integer label (1-7). We regard the (only 13) objects with label 2 as outliers, and regard the rest of the six classes as normal data.

As demonstrated in the Figure 3.5, the normal data consists of a main clus- ter and a few small clusters. Most of outliers (red triangles) only deviate from the main cluster with very short distance. Obviously, the traditional outlierness definitions, i.e, LOF and KNN, are not able to effectively distinguish those ge- nius outliers from normal data. On the contrary, due to the new definition of outlierness, our LDOF would have significantly better performance.

We run the experiment 15 times and each time we randomly pick a sample of normal objects (i.e. 1,000 objects) to mix with the 13 outliers. The mean values of detection precision of the three methods are presented in Figure 3.6. As illustrated in Figure 3.6, top-n KNN has the worst performance (rapidly drops to 0). Top-n LOF is better, which has a narrow precision peak (k from 5 to 15), and then declines dramatically. Top-nLDOF has the best performance, as it

Figure 3.5: Data visualisation for Space Shuttle dataset.

ascends steadily and keeps a relative high precision over thekrange from 25 to 45. Table 3.2 shows the average precisions for the three methods over independent 15 runs. The bold numbers indicate that the precision vector is statistically significantly improved compared to the other two methods (paired T-test at the 0.05 level, with p-value 0.02).

Table 3.2: The detecting precision for each method based on 15 independent runs for Shuttle dataset.

Precision (mean ± std.)

LDOF LOF KNN

0.25±0.081 0.03±0.057 0.08±0.114

3.6

Summary

In this chapter, we have proposed a new outlier detection definition, LDOF. Our definition uses a local distance-based outlier factor to measure the degree to which an object deviates from its scattered neighbourhood. Due to the definition of out-

5 15 25 35 45 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Neighbourhood size k Precision top−n LDOF top−n LOF top−n KNN

Figure 3.6: Outlier detection precision over different neighbourhood size for Shut- tle dataset.

lierness proposed in LDOF, the genius outliers are effectively distinct from small and scattered clusters. In other words, our proposed LDOF algorithm is partic- ularly designed for scattered datasets. Meanwhile, the definition of outlierness is able to push genius outliers into higher rankings compared to other methods. Therefore, in real world applications, the performance of LDOF is more effec- tive by applying top-n detection algorithm. We have analysed the properties of LDOF, including its lower bound and false-detection probability. Furthermore, a method for selecting k has been suggested. In order to ease the parameter setting in real world applications, the top-n technique has been used in this ap- proach. Experimental results have demonstrated the ability of our new approach to better discover outliers with higher precision, and to remain stable over a large range of neighbourhood sizes, compared to top-nKNN and top-nLOF. However, how to judge scattered datasets is a main problem from the application point of view. In other words, it is necessary to set up a mechanism that is able to prop- erly determine whether LDOF should be used instead of KNN and LOF given a dataset. Therefore, as future work, we are looking to propose an approach to

provide suitable judgement on scattered datasets, as well as further enhance the outlier detection accuracy.

An Effective Pattern Based

Outlier Detection Approach for

Mixed Attribute Data

As we mentioned in Chapter 2, detecting outliers in mixed attribute datasets is one of major challenges in real world applications. Existing outlier detection methods are ineffective for mixed attribute real world datasets mainly due to their inability of considering interactions among different types of, e.g., numerical and categorical attributes. To address this issue in mixed attribute datasets, we propose a novel Pattern based Outlier Detection approach (POD). A pattern in this thesis is defined as a mathematical representation that describes the majority of the observations in datasets and captures the interactions among different types of attributes. In POD, the more an object deviates from these patterns, the higher its outlier factor is. We simply use logistic regression to learn patterns and formulate the outlier factor in mixed attribute datasets. A series of experiments show that the performance enhancement by the POD is statistically significant comparing to several classic outlier detection methods.

The rest of the thesis is organised as follows: in Section 4.1, mixed attribute data problem is discussed and relevant work is reviewd. In Section 4.2, we intro-

duce and discuss the pattern in mixed attribute data. In Section 4.3, we formally present our pattern based outlier definition and our outlier factors. In Section 4.4, the top-n pattern based outlier detection algorithm is described. Experiments are reported in Section 4.5. Finally, a summary is presented in Section 4.6.

4.1

Introduction

As the discussion in Section 1.2 and Section 2.8, real world data usually con- tain different types of attributes, called mixed attribute data, to which most of existing outlier detection methods are incapable to handle. In recent years, re- searchers have proposed several algorithms to deal with mixed attribute datasets. A typical method, LOADED [19], uses Association Rules to explore infrequent items among categorical values, and calculates covariance matrix to examine the anomaly in numerical values. Outliers in mixed attribute datasets can be de- termined by their anomaly scores, which are the sum of anomaly scores in the categorical and the numerical values. Although LOADED gives a specific method for exploring anomalies in either categorical or numerical values, they could not reach perfect performance due to the lack of considering interactions between different types of attributes. In 2005, Otey et al. [43] proposed an improved ver- sion, named RELOADED. Though RELOADED needs less main memory than LOADED, it still separately considers anomaly of each type of attributes rather than effectively takes their interactions into account. Yu et al. [61] proposed a graph-based outlier detection algorithm for dealing with mixed attribute data. They separately compute Euclidean distance for numerical values and Hamming distance for categorical values to calculate outlier indicators. More recently, Ye, et al. [60] proposed projection-based outlier detection method. They use the equi-width method to discretise numerical attribute values in order to handle mixed attribute datasets. Although both of the papers claim that they are de- signed for mixed attribute data, they consider two different types of attributes separately and then simply assemble them together.

In this thesis, we propose a Pattern based Outlier Detection approach (POD), which is able to effectively consider interactions between different types of at- tributes without attribute conversion processes (discretising or recoding). A pat- tern in this thesis is defined as a mathematical representation that describes the majority of the observations in datasets and captures the interactions among dif- ferent types of attributes. Then, based on the notation of pattern, a new outlier factor for mixed attribute data is proposed. The more an object deviates from these patterns, the higher its outlier factor is. In POD, we simply use logistic regression to learn patterns and formulate the outlier factor in mixed attribute datasets. To validate our approach, we compare POD with three other typical methods, LOADED [19], KNN [46] and LOF [7] over a series of mixed attribute datasets from both synthetic and real world. Experimental results show a statis- tically significant improvement of POD over the three methods.

Related documents