Outlier Detection in Data Mining, Pattern Recognition and

Besides statistics and computer vision, many efficient outlier detection approaches have been developed in several areas such as machine learning, pattern recognition and data mining, depending on application areas e.g. information systems, health care, network systems, news documentation, industrial machines, and video surveillance (Breuning et al., 2000; Hodges and Austin, 2004; Chandola, 2008; Chandola et al., 2009; Yang and Wang, 2006;

Hido et al.,2011;Zimek et al.,2012;Aggarwal, 2013; Liu et al., 2013;Sugiyama and Borgwardt, 2013). Hodges and Austin (2004) stated that there is no single universally applicable or generic outlier detection approach. People are trying to develop more effective methods for their particular application area taking into account the characteristics of their data. We choose the following three algorithms that are based on different approaches, which have been recently proposed or are popular in the data mining, machine learning and pattern recognition literature.

2.6.1 Local Outlier Factor

Breuning et al. (2000) introduced the Local Outlier Factor (LOF) assuming that for many real-world datasets, there exist more complex data structures that can contain outliers relative to their local neighbourhoods. Since the LOF was introduced, there have been many variants of this algorithm developed, and it is considered to be accurate and efficient and has been frequently used for comparison with newly proposed methods (Kriegel et al., 2009; Schubert et al.,

2014). Breuning et al. (2000) assigned a measure of being an outlier for each observation in a dataset called the LOF. The measure depends on how isolated an observation is w.r.t. the surrounding neighbourhood, particularly w.r.t. to the densities of the neighbourhood. To find the LOF for a point pi, the algorithm uses three consecutive steps. First, the reachability distance of an observation pi w.r.t. observation pj is calculated as:

where k − distance(pj) is the distance between pj and its kth neighbour, and d(pi, pj) is the distance between pi and pj. The reachability distance of an observation to one of its neighbour is shown in Figure 2.7a. Second, the local reachability density for pi is computed as:

lrdM inP ts(pi) =

pj∈NM inP ts(pi)reach−distM inP ts(pi, pj) |NM inP ts(pi)|

, (2.42)

where the local reachability density is the inverse of the average reachability distance based on the NM inP ts(pi) nearest neighbours of pi, which can be considered as the local neighbourhood of pi. |NM inP ts(pi)| is the number of observations in the local neighbourhood. Figure 2.7b depicts the elements of local reachability density of pi. Finally, the LOF of an observation is defined as the average of the ratio of the local reachability density of pi and those of the M inP ts-nearest neighbours to pi, which is defined as:

LOFM inP ts(pi) =

P pj∈NM inP ts(pi) lrdM inP ts(pj) lrdM inP ts(pi) |NM inP ts(pi)| . (2.43)

A large LOF indicates that the observation pi is a potential outlier. That means the density of all the neighbours of pi is higher than the density of the pi itself. Usually, outliers have larger LOF scores than a threshold in the range between 1.2 and 2.0, depending on the data (Goldstein, 2012). The reader is referred to

Breuning et al. (2000) for more details about the LOF algorithm.

𝑝2 𝑝𝑖 𝑝1 𝑝3 𝑝𝑗 𝑝1 𝑝2 reach-distk (𝑝1,𝑝𝑗) = k-distance (𝑝𝑗) reach-distk(𝑝2,𝑝𝑗) (a) (b)

Figure 2.7 Local outlier factor: (a) reachability distances of p1 and p2 topj, and (b)

graphical representation of local reachability density forpi with its nearest neighbours

p1, p2 and p3. Euclidean distances shown in solid lines and k-distances are shown in

dotted lines, neighbourhood size k = 3, neighbourhood of the points are indicted by the coloured circles.

2.6.2 Direct Density Ratio Based Method

The density ratio based approach is one of the most well-known approaches in the statistical, machine learning and pattern recognition literature. It performs outlier detection using the ratio of the two probability density functions of the test and training datasets. The approach for identifying outliers in a test or validation dataset is based on a training or model dataset that only contains inlier data (Sch¨olkopf et al., 2001; Kanamori et al., 2009). Density estimation is not trivial and getting an appropriate parametric model may not be possible. Therefore, Direct Density Ratio (DDR) estimation methods have been developed that do not require density estimation. RecentlyHido et al.(2011) introduced an inlier based outlier detection method based on DDR estimation that calculates the importance or an inlier score defined as:

w(p) = ptr(p) pte(p)

, (2.44)

where ptr(p) and pte(p) are the densities of identically and independently distributed (i.i.d.) training {ptr_j }ntr

j=1 and test {p te i }

nte

i=1 samples, respectively. It is plausible to consider observations with small inlier scores as outliers. Hido et al.

(2011) used unconstrained Least Squares Importance Fitting (uLSIF), which originated from the idea of LSIF (Kanamori et al., 2009). In uLSIF, the closed-form solution is computed by solving a system of linear equations. The importance w(p) in Eq. (2.44) is modelled as:

ˆ w(p) = b X l=1 αl ϕl(p), (2.45)

where {αl}bl=1 are parameters and {ϕl(p)}bl=1 are basis functions such that ϕl(p)≥0. The parameters are determined by minimizing the following objective function: 1 2 Z ˆ w(p)− ptr(p) pte(p) 2 pte(p)dx. (2.46)

The solution of uLSIF is computed through matrix inversion, and the leave-one-out-cross-validation score (Kanamori et al., 2009) for uLSIF is computed analytically. Hido et al. (2011) showed that the uLSIF is

competitively accurate and computationally more efficient than the existing best methods e.g. OSVM (Sch¨olkopf et al., 2001). The reader is referred to

Hido et al. (2011) for further information about uLSIF.

2.6.3 Distance based Outlier Detection

Knorr and Ng (1998) first introduced the new paradigm of Distance Based (DB) outlier detection that generalises the statistical distribution based approaches. In contrast to statistical distribution based approaches, it does not need prior knowledge about the data distribution. In DB outlier detection, a point p is considered as an outlier w.r.t. parametersα, δ if at least a fractionα of the data has a distance from p larger thanδ, that is:

|{q∈P|d(p, q)> δ}| ≥αn, (2.47)

where q ∈ P, and (α, δ)∈_R; and 0 ≤ α ≤ 1 are the user defined parameters. But the problem is how to fix the distance thresholdδ. Ramaswamy et al.(2000) proposed thekth_{Nearest Neighbour (k}th_{NN) distance as a measure of outlyingness} to overcome the limitation. The score of a point is defined as:

q_kth_NN(p) :=dk(p;P), (2.48)

where dk(p;P) is the distance between p and its kthNN. This method is computationally intensive and Wu and Jermaine (2006) proposed a sampling algorithm to efficiently estimate the score in Eq. (2.48), defined as:

q_kth_S

p(p) :=d k_{(p, S}

p(P)), (2.49)

whereSp(P) is a subset ofP, which is randomly and iteratively sampled for each point inP. To save computation time without losing accuracy, recentlySugiyama and Borgwardt (2013) suggested sampling only once. They define the score as:

qSp(p) := minq∈Spd(p, q), (2.50)

where minq∈Spd(p, q) is the minimum distance betweenpandq, whereqis a point in the subset Sp. Sugiyama and Borgwardt(2013) named the algorithm qSp, and

stated that it outperforms state-of-the-art DB algorithms including Angle Based Outlier Factor (ABOF; Kriegel et al., 2008) and OSVM (Sch¨olkopf et al., 2001) in terms of efficiency and effectiveness.

In document Robust statistical approaches for feature extraction in laser scanning 3D point cloud data (Page 62-66)