An Outlier Mining Algorithm Based on Attribute Entropy

(1)

Procedia Environmental Sciences 11 (2011) 132 – 138

doi:10.1016/j.proenv.2011.12.021

Available online at www.sciencedirect.com

Procedia

Environmental

Sciences

Procedia Environmental Sciences 00 (2011) 000–000

www.elsevier.com/locate/procedia

An Outlier Mining Algorithm Based on Attribute Entropy

Ming-jian Zhou

1,a

_{, Jun-cai Tao}

2,b

1_{Department of Computer Science and Technology Nanchang University Nanchang, Jiangxi, China} 2_{Computer Center Nanchang University Nanchang, Jiangxi, China}

a_{[email protected],}b_{[email protected]}

Abstract

This paper describes the outlier data mining and commonly used outlier mining methods, on this basis, it proposes an outlier mining algorithm based on attribute entropy (OMABAE). Firstly, the concept of attribute entropy is introduced to calculate attribute entropy of each attribute, and constructs the attribute entropy matrix. Secondly, the object entropy of each object is computed according to the attribute entropy matrix, and finally outlier will be detected by comparing the object deviation degree with entropy threshold. The experimental results show that this algorithm can detect outlier efficiently.

Keywords: Outlier; Attribute Entropy; deviation degree; Data Mining

1. Introduction

The rapid development of data technology, as Internet growth, creates a large overload data for the business community. To find useful information hidden in them, data mining is came into being. Data mining is a process to extract potential information from a large amount of data. In general, it can be divided into four categories: related and dependent relationship discovery, type determination, category description, and outlier data mining [1].

An outlier, according to Hawkins [2], is “an observation that deviates so much from other observations as to arouse that it was generated by a different mechanism”. Outliers are frequently treated as noise that needs to be removed from a data set in order for a specific model or algorithm to succeed. So, outlier is always cancelled or neglected simply. However, scholars gradually realize that certain outlier probably is the real reflection of normal data. So outlier mining becomes an important aspect of data mining.

The target of outlier mining is to find small groups of data that are exceptional when compared with rest large amount of data. It often makes people find some real, but unexpected knowledge. Therefore,

Selection and/or peer-review under responsibility of the Intelligent Information Technology Application Research Association.

Open access under CC BY-NC-ND license.

Selection and/or peer-review under responsibility of the Intelligent Information Technology Application Research Association.

(2)

outlier mining in real life has a wide range of applications, such as credit card malicious overdraw, network intrusion detection, loan proof checking and so on[3].

Presently, the classical technologies of outlier mining can be divided into five categories: statistic-based methods [4], distance-based methods [5,6], density-based methods [7,8,9], clustering-based methods [10] and deviation-based methods [11,12].

The statistic-based methods use inconsistent test to determine the outliers with a known probability distribution of the data set, such test needs to know the data distribution. But in many instance, data distribution is unknown, and does not fit any desired mathematics distribution.

The distance-based methods do not make assumption for the distribution of the data since they essentially compute distance among points. Outliers in it are those objects which do not have enough neighbors. But using these methods requires the identification of suitable parameters.

The density-based methods determine whether the given data is an outlier by comparing the neighbors of the density of each data point. They put the object into nearly cluster when the density of point in an area is bigger than certain threshold value, and can detect local exception which can not distinguished by distance-based methods.

The clustering-based methods first carry out clustering operation on the sample data, and then detect the isolated points which can not be classified, and these isolated points are outliers. The advantage of these methods is that it needs not the knowledge about the data set.

The deviation-based methods identify outliers by examining the main characteristic of objects in a group instead of by applying statistical tests or distance-based measurement. Objects that deviate from the given description are considered outliers. Their complexity is linear with the size of data set, and has perfect calculated performance, but the hypothesis of exception is too idealization.

Above classical methods have respective advantages in application, but they all have some limitation in certain aspects. So, based on attribute entropy, this paper proposes an outlier mining algorithm OMABAE. This algorithm resolves some factual problems in outlier mining and fills up the deficiency of existing algorithm.

2. Formal definition

In order to comprehend the algorithm proposed in this paper, relative conceptions are introduced as follows.

Definition 1: The data set D is defined as D=(U,A), U stands for the object set, U={ui| i∈L},

L={1,2,…,m}, A stands for the attribute set, A={ai | i∈S}, S={1,2,…,n}.

Definition 2: Attribute similarity coefficient. Given a data set D=(U,A), ui∈U, aj∈A, i∈L, j∈S, the

notation of L and S are the same with Definition 1, xij denote a sample value on ui and aj, the attribute

similarity coefficient of xij is:

avg j j avg j ij ij

_a

a

x

−

=

_max

σ

(1) Where,

a

avg_j is the average value of aj,

a

maxj is the maximal value of aj.

Definition 3: Attribute Entropy. Given a data set D=(U,A), ui∈U, aj∈A, i∈L, j∈S, the notation of L

and S are the same with Definition 1, xij denote a sample value on ui and aj,

σ

ijis the attribute similarity coefficient of xij , then the attribute entropy of xij is:

ij ij ij

σ

(3)

134 Ming-jian Zhou and Jun-cai Tao / Procedia Environmental Sciences 11 (2011) 132 – 138 Author name / Procedia Environmental Sciences 00 (2011) 000–000

Definition 4: Attribute Entropy Matrix. Given a data set D=(U,A), the notation of U and A are the

same with Definition 1, then the attribute entropy matrix of D is:

⎥

⎦

⎤

⎢

⎣

⎡

=

Λ

mn m n

ρ

"

1 1 11 (3)

Definition 5: Object Entropy. Given a data set D=(U,A), ui∈U, i∈L, and D’s attribute entropy matrix

Λ

, the notation of U, A and L are the same with Definition 1, then the object entropy of ui is:

∑

=

∗

=

n j ij j i

oe

1

η

ρ

(4) Where

η

_k( k∈S, S is the same with Definition 1) is the weight coefficients of attribute a_k, and

∑

= n k 1 k

η

=1.

Definition 6: Maximal Attribute Entropy. Given a data set D=(U,A) and its attribute entropy matrix

Λ

, the notation of U and D are the same with Definition 1, then the maximal attribute entropy matrix of D is:

i m i

oe

_max

=

max

=₁ (5)

Definition 7: Object Deviation Degree. Given a data set D=(U,A), ui∈U, i∈L, the notation of U, A

and L are the same with Definition 1, then the object deviation degree of ui is: max max

oe

odd

i i

−

=

(6) All attributes are taken into account in object deviation degree, the larger the odd is, the greater the possibility of the object being an outlier, and vice verse.

3. Outlier mining algorithm

Generally, outlier keeps away from normal data. Namely, they deviate from the center of data set, and have small quantity. So, the outlier detection focus on finding the data objects which are very dissimilar to the other data objects in some dataset. In our approach, to find out the outliers, the object deviation degree odd of each object of data set must be calculated based on attribute entropy matrix, and then distinguish the outlier by comparing the odd value with pre-set deviation threshold dt: if odd value of object ui is larger

than dt, then the object ui is outlier. The OMABAE algorithm is shown as follow:

OMABAE(x[1..m][1..n],eta[1..n],dt)

Input: 1)data set D represented by x[1..m][1..n], in which x[i][j] is the jth attribute value of ith object; 2) the weight coefficients eta(1..n) of each attribute; 3)deviation threshold dt

Output: the outliers

//step 1a: get xmax and xavg of each attribute

For (i=1;i<=n;i++) {

Xmax[i]=x[1][i]; Xavg[i]= x[1][i]; for (j=2;j<=m;j++)

(4)

{ If (x[j][i]>xmax[i]) xmax[i]=x[j][i]; Xavg[i]+=x[j][i]; } Xavg[i]/=m; }

//step 1b: compute attribute entropy For (i=1;i<=m;i++)

{

for (j=1;j<=n;j++) {

sigma[i][j]=abs(x[i][k]-xavg[k])/( xmax[k]- xavg[k]); ro[i][j]=-sigma[i][j]*ln(sigma[i][j])

} }

//step 2: compute object entropy oemax=0.0; For (i=1;i<=m;i++) { oe[i]=0.0; For (j=1;j<=n;j++) oe[i]+=ro[i][j]*eta[j]; If (oemax<oe[i]) oemax=oe[i]; } For (i=1;i<=m;i++) odd[i]=(oemax-oe[i])/oemax; // step 3: detect outliers using dt

For (i=1;i<=m;i++) {

if (odd[i]>dt) output(i); }

The time complexity of OMABAE algorithm is affected by the size of data set (m) and the number of attributes (n). OMABAE has three main steps:

1) Calculating the attribute entropy of each object on each attribute; 2) Calculating object deviation degree of each object;

3) Detecting the outlier by comapring object deviation degree with pre-set threshold.

Time complexity of the three steps respectively is O(2*m*n),O(m*n+m), O(m). Therefore, the total complexity of OMABAE algorithm is O(3*m*n+2*m).

4. Expriement

In order to verify the performance of the algorithm proposed in this paper, we have implemented the algorithm and compare it with the traditional KNN[8], LOF[9] and FINDCBLOF[13] algorithm. All experiments were written in VC++ 6.0, and were performed on a PIV 2.8 machine with 1GB of RAM and running Windows XP. We experimented with three real-life datasets: the ginger dataset of SIRC-TCM[14](150 objects and 20 attributes), Lymphography dataset of UCI[15](148 instances and 19 attributes) and Glass Identification dataset of UCI(214 instances and 9 attributes). The experimental dataset are got by inserting 10% outliers with large deviation into original dataset.

(5)

4.1recall ratio

The recall ratio of algorithm is defined as:

outliers ected

the of

numberof the identified outliers by a orithm number ratio recall exp lg _ =

As shown in Fig. 1, with the different objects and attributes, the recall ratio of OMABAE is all above 0.9, and is higher than the three comparison algorithms. This is because the other three algorithms need to repeatedly input and test the parameter to achieve satisfactory results.

4.2precision ratio

The precision ratio of algorithm is defined as:

orithm a by objects identified the of number orithm a by outliers identified the of number ratio precision lg lg _ =

Fig. 2 shows the precision ratio of KNN, LOF, FINDCBLOF and OMABAE algorithm. As is seen from Fig. 2, the OMABAE algorithm has higher precision ratio contrast to the three comparison algorithms, but with the

The recall ratio of four different algorithms 1.0 0.9 0.8 0.7 0.6 0.5 Lymphography Dataset Glass Identification Dataset Ginger Dataset FINDCBLOF KNN LOF OMABAE Figure 1. The recall ratio with different dataset

(6)

number of attributes increased gradually, the precision ratio of OMABAE reduced. This is because that the attribute interference between each other is increased with the increase of the number of attributes.

5. Conclusion

This paper presents an outlier data mining algorithm based on attribute entropy. Experimental results show that this algorithm, compared with the traditional ones, has better recall ratio and precision ratio. So it is more suitable for massive data.

However, this new algorithm is limited to numerical data sets. Further study shall be concerned about how to detect outliers in non-numerical data sets.

Acknowledgment

This work is partially support by the Education Department of Jiangxi Province of China (Grant No.GJJ10296).

References

[1]Liu Ying, Sprague A, “Outlier Detection and Evaluation by Network Flow,” International Journal of Computer Applications in Technology, vol. 33, Dec. 2008, pp. 237-246, doi: 10.1504/IJCAT.2008.021946.

[2]Hawkins D, “Identification of Outlier,” London: Chapman and Hall,1980

[3]Wang Hongding, Tong Yunhai, “Research Progress on Outlier Mining,” CAAI Transaction on Intelligent System, vol. 1, no. 1,pp.67-73,2006

The precision ratio of four different algorithms 1.0 0.9 0.8 0.7 0.6 0.5 Lymphography Dataset Glass Identification Dataset Ginger Dataset FINDCBLOF KNN LOF OMABAE Figure 2. The precision ratio with different dataset

(7)

[4]Barnet V, Lewis T, “Outliers in Statistical Data,” New York: John Woley& Sons,1994

[5]Bay S, Schwabacher M, “Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule,” Proc. of ACM SIGKDD International Conf. on Knowledge Discovery and Data Mining. Washington, 2003, pp.29-38

[6]Knorr E, Ng R, “Finding Intentional Knowledge of Distance-Based Outliers,” Proc. of the VLDB Conf. Edinburgh: Morgan Kaufmann Publishers,1999,pp.211-222

[7]Mao Junguo, Duan lijuan, Wang Shi, Shi Yun, “Data Mining Principle and Algorithm,” Tsinghua University Press, 2007 [8]Romaswany S, Rastogi R, Shim K, “Efficient Algorithms for Mining Outliers from Large Data Set,” Proc. of the ACM SIGMOD International Conf. on Management of Data, Texas: ACM Press,2000,pp.473-478

[9]Breuing M, Kriegel H, Ng R, Sander J, “LOF: Identifying Density-Based Local Outlier,” Dallas, Texas: Proc. of ACM SIGMOD Conf.,2000,pp.94-104

[10] Han Jiawei,Kamber M, “Data Mining: Concepts and Techniques,” New York: Morgan Kaufmann Publishers, 2001 [11] Yu Zhongqing, Fang Yi, Pan Zhenkuan, Shao Fengjing, “OLPA Architecture,” Qingdao-Hong Kong international Computer Conf.,1999,10

[12] Arning A, Agrawal R, Raghavan, “A Linear Method for Deviation Detection in Large Database,” Proc. of the KDD Conf., Portland: AAAI Press,1996,pp.164-169

[13] He Zengyou, Xu Xiaofei, Deng Shengchun, “A Fast Greedy Algorithm for Outlier Mining,” Proc. of the 10th Pacific-Asia Conf. on Advances in Knowledge Discovery and Data Mining, Singapore,2006, pp.567-576

[14] http://www.tcm120.com/