Pattern Definition in Mixed Attribute Data

Outlierness or anomaly in mixed attribute data is often resulted from interactions between categorical and numerical values. For example, in a society income survey dataset, it is common to see that a man with an occupation of engineer has a Bachelor degree. However, the record becomes unusual if the man is only 16 years old. That is to say, the outlierness or anomaly in mixed attribute datasets has its own characteristics, and it is hard to follow the outlier definition given by single type attribute outlier detection methods, e.g. KNN [35], LOF [7] or existing ones for mixed attribute datasets such as LOADED [19] or a graph based technique [61].

Before exploring a suitable outlier definition, we have to define the normal behaviour or majority in mixed attribute datasets first. We call the normal behaviour or majority in mixed attribute datasetpattern. Although the notation ofpattern has been mentioned in [26], it follows the definition of pattern in pattern recognition (i.e., regard an example, a cluster, as a pattern) and only discusses

in numerical space. Distinct from the pattern definition in pattern recognition domain, pattern here refers to the common characteristics or behaviour in mixed attribute data space. We give an example to illustrate what apattern looks like in a mixed attribute dataset. In Figure 4.1, an indicative example of a simple mixed attribute dataset, with two numerical attributes and one categorical attribute (binary value), is projected into a 2-D space. “Dot” objects indicate the data objects with categorical value “Male”, while “cross” objects indicate the data objects with categorical value “Female”.

As demonstrated in Figure 4.1, most of “dot” and “cross” objects are regularly located into two groups. We can regard this data object distribution as thepattern in the example, which indicates the normal behaviour or majority in the simple mixed attribute dataset. Intuitively, if an object “looks” like it might not belong to the pattern, the object is suspicious to be an outlier. As shown in Figure 4.1, objects A and B are outliers as they deviate from the pattern.

We denote D as a set of mixed attribute data objects. Oi ∈ D is the

ith mixed attribute data object in D. Each data object contains M numerical attributes and N categorical attributes. Denote object i as Oi = [xi,ci],

where x contains numerical values and c contains categorical values. Denote xi = [x1i, x2i,· · · , x

i,· · ·, xMi ] andci = [c1i, c2i,· · · , cki,· · · , cNi ], where x j

i is thejth

numerical attribute value in Oi and cki is the kth categorical attribute value in

Oi.

In this thesis, we concentrate on a specific type of patterns, where only one categorical attribute is involved in each pattern. To simplify the discussion below, We define S0 _{a subspace of} _D _{which only contains a subset of attributes in} _D_. S0k_{is a subspace of}_D_{which only contains the}_k_{th categorical attribute and ALL}

the numerical attributes. O0k

i is the projection of Oi on the subspace S

0_k

GivenS0k_{, most of}_O0k _{exhibit the common characteristics or behaviour in the}

mixed attribute subspace. The pattern in mixed attribute subspace is defined as follow:

A _B

Figure 4.1: An indicative example of the pattern in a simple mixed attribute dataset.

Definition 4.1. We call the common characteristics or behaviour demonstrated by projected objects O0k _{as the mixed attribute subspace PATTERN} _Pk_.

Based on the pattern definition above, if a mixed attribute dataset contains

N categorical attributes, there are N patterns, P = {P1_{, P}2_,_{· · ·} _{, P}N_}_{. Defini-}

tion 4.1 gives us a description of what are the normal objects in mixed attribute space. Furthermore, this definition significantly simplifies the mixed attribute space compared to the original one. Our pattern definition only focuses on a subspace O0 at a time which only contains one categorical attribute rather than considering N categorical attributes in O. Such a simplification provides us a simple mechanism to handle datasets with a large set of categorical attributes. We will use this mechanism to handle a set of categorical attributes in Section 4.4. In order to take interactions between categorical and numerical attributes into account, we propose to use logistic regression to represent patterns in mixed

attribute datasets. To simplify our discussion, we assume all the categorical attributes only have binary value, i.e. ck

i ∈ {0,1} in this section. We will return

to this issue in Section 4.4. Given a projected object O0k

i = [xi, cki] on subspace

S0k _{and a binary variable} _Y _where _Y ₌_ck

i, logistic regression can take a simple

form: P_O0_k i =    P(Y = 1|xi) = _1+exp(wx1 T i) if ck i = 1, P(Y = 0|xi) = exp(wxT i) 1+exp(wxT i) otherwise, (4.1) whereP_O0_k

i measures the degree of projected objectO

0_k

i belonging to pattern Pk.

w is the parameter vector in logistic regression. The parameters in Equation 4.1 can be learnt from data directly by, e.g, maximum likelihood [8].

wk←−arg max

P(Y_ik|xk_i,wk), (4.2)

where wk is the logistic regression parameter in subspace S0k_. _Yk

i takes the

value of ck

i. As the information of both categorical and numerical attributes is

considered in the learning procedure, we propose a categorical outlier factor in the next section, which represents interaction between categorical and numerical attributes.

4.3 Pattern Based Outlier in Mixed Attribute

In document Towards Outlier Detection For Scattered Data and Mixed Attribute Data (Page 79-82)