• No results found

Dynamic Clustering Algorithm on Symbolic Objects: SYNTHO

THO

The SYNTHO algorithm is a convergent algorithm of clustering (a type of dynamic cluster- ing) (Diday 1971 [93]) on symbolic objects, which respects the four principles of symbolic data processing: fidelity to the data, predominance of knowledge, consistency, and inter- pretability (Diday 1987 [94]), (Kodratoff 1986 [188]). Indeed, it is guided by background knowledge represented by the affinities and taxonomies. Synthesis objects generated are more general than input assertions and synthesis objects obtained are easily explained, interpretable, understandable, and usable.

The data consist on the one hand of knowledge, represented by a list of “assertions” defined on different domains (observation spaces) and on the other hand, of additional knowledge provided by the expert, represented by taxonomies and defined affinities between “events” whose conjunction consists of “assertions”, that we defined in Section 15.2.

15.6.1

Description of the Algorithm

Before describing the SYNTHO algorithm, it is necessary to provide the following three measures:

• a measure of similarity r(ai, aj) or affinity af f (ai, aj) between assertions, to measure

the adequacy between two assertions.

• an index of aggregation ∆(ai, Pj), which measures the similarity between an assertion

and a cluster.

• a criterion W (L, P ), to optimize, measuring the adequacy between a partition P of A (a set of assertions) and its representation L.

The purpose of the SYNTHO algorithm is to set up an optimization criterion W (L, P ), which is to be maximized through functions of assignment (f ) and representation (g) that we define later.

We denote A ={ai}i=1..n the set of given assertions to partition, enai, as the n

th event

of assertion ai and af f (enai, e

m

aj) as the affinity between two events e

n ai and e

m

aj, given by an expert.

We define the affinity af f (ai, aj) between the assertions aiand aj such that:

af f (ai, aj) = Max[af f (enai, e

m aj)].

In the case of assertions defined on the same variables, we use the Russell and Rao measure of similarity rRR(ai, aj) between two assertions, such as:

rRR(ai, aj) = 1 n n X k=1 1 pk pk X l=1 δlij.

with δl ij =  0 if yk l(ai)6= ykl(aj) 1 if yk l(ai) = ykl(aj) where y k

l is the value of the lth category of the kth

variable yk with k∈ {1 . . . n} and l ∈ {1 . . . p

k}, pk is the number of categories of the kth

variable and n is the number of variables of the assertions ai, aj.

In the case of assertions defined on different variables, we denote: r′(ai, aj) = af f (ai, aj) but

in the case of assertions defined on the same variables, we denote: r′

(ai, aj) = rRR(ai, aj).

Let P ={Pj}j=1...kbe a partition of A into k clusters. We define an index of aggregation

∆(ai, Pj), between an assertion ai and the cluster Pj such as:

∆(ai, Pj) =

X

j

[r′(ai, aj), ai∈ Li, aj∈ Pj]

where r′ is either an affinity af f or a measure of similarity rRRdepending on the fact that

assertions are defined on the same variables or not For Li being a representation:

∆(Li, Pj) =

X

j

[r′(ai, aj), ai∈ Li, aj∈ Pj].

We denote L ={L1, . . . , Lk} ⊂ A, the set of prototypes or the space of representation of a

partition and define the criterion to be optimized W (L, P ), representing the improvement of the affinities and similarities between the prototype and the assertions of the cluster as:

W : Lk× Pk→ R+, W (L, P ) =

X

j

[∆(Lj, Pj), j = 1 . . . k].

We will demonstrate later that the value of this criterion increases with each iteration until it becomes stable.

The SYNTHO algorithm is iterative, proceeding by alternating between two steps and based on two functions:

• f, the assignment function, which associates a partition P of k classes with the k prototypes L

• g, the representation function, which associates with the k prototypes L a partition P of k classes.

1. Given an initial set of k prototypes L = {L1, . . . , Lk}, taken from the set of

assertions A = {a1, . . . , an}, which may be specified randomly or according to

criteria determined by the expert. The choice of k will be based on the number of desired clusters.

2. Assign each assertion ai ∈ A\L to the cluster with the closest prototype Lj ∈

L according to the criteria for assigning the assertion ai to the prototype Lj:

r′(ai, Lj) = M ax[r

(ai, Lk), Lk ∈ L].

It forms thus the cluster Pj and we obtain a partition P of A with P =

{P1. . . Pj. . . Pk} of A. f : Lk→ Pk where f (Lj) = Pj with Pj ={ai∈ A\L, r ′ (ai, Lj) > r ′ (ai, Lm) with Lm∈ L\Lj}.

In the case where the criterion for assigning an assertion ap to two prototypes

would be the same, we arbitrarily assign this assertion, ap, to the prototype of

3. Once the partition P is given, we have to find all new prototypes:

L ={L1. . . Lj. . . Lk} ⊂ A on P = {P1. . . Pj. . . Pk}. Each new prototype Lj of

Pj will be determined as having the strongest representation criterion of cluster

Pj within A, ∆(Lj, Pj).

g : Pk → Lkwhere g(Pj) = Ljwith Lj∈ A, ∆(Lj, Pj) = M ax[∆(ai, Pj), ai∈ A].

The algorithm is deemed to have converged when the assignments no longer change.

15.6.2

Convergence of the Algorithm

The SYNTHO algorithm induces two sequences Un and Vn defined as follows:

Let Un = W (Ln, Pn)∈ R+, Vn = (Ln, Pn)∈ Lk× Pk where Ln = g(P(n−1)) and Pn =

f (Ln) where U

n is a positive sequence, because it is a sum of positive values.

Proposal 1 The SYNTHO algorithm converges.

Proof 1 We have to show that f is increasing the value of the criterion W . In other words, we have to prove that:

W (L, P ) < W (L, f (L)), i.e., k X j=1 ∆(Lnj, Pjn)≤ k X j=1 ∆(Lnj, Pjn+1)

Which is equivalent to:

k X j=1 X aj∈Pjn [r′(Lnj, anj)]≤ k X j=1 X aj∈Pjn [r′(Lnj, an+1j )]

Each assertion is assigned to the class Pn

j if its affinity to Lnj is the highest among all

prototypes. Therefore, the ai changing cluster increases the value of the criterion W . From

which it results that:

W (Ln, Pn) < W (Ln, Pn+1) (15.1) We need now show that:

W (Ln, Pn+1)≤ W (Ln+1, Pn+1) (15.2)

In other words, we have to prove that: W (L, P )≤ W (g(P ), P ), which is equivalent to:

k X j=1 ∆(Ln j, Pjn)≤ k X j=1 ∆(Ln+1j , Pn j )

which is true by construction. For j = 1, . . . , k, we have:

∆(Ln+1j , Pjn) = M ax[∆(ai, Pjn), ai∈ A]

Hence, it results from (15.1) and (15.2) that: W (Ln, Pn)≤ W (Ln+1, Pn+1). Therefore, we

have: Un ≤ Un+1, which means that Un is an increasing sequence. As it is positive and

bounded by the sum of all affinities for all the given events that can be combined, we can conclude that it is convergent.

15.6.3

Building of Synthesis Objects

For each cluster Ci, we generate Si, the synthesis object representing the cluster. The

synthesis object Si will be characterized by the conjunction of assertions contained in the

cluster Ci. The obtained synthesis object Siis a conjunction of assertions ai, themselves a

conjunction of elementary events. We represent Sias follows: Si is a mapping of Ω1× . . . ×

Ωk → {true, false} denoted by Si ={∧aji(ωj), aji ∈ Ci} where ωj is a generic element of

Ωj.

Example 8 (of obtained cluster) C1= (vase1, flower2, flower3) where

vase1= [color∈ {blue}] ∧ [size ∈ {large}] ∧ [material ∈ {crystal}] ∧ [shape ∈ {round}]

f lower2= [type∈ {rose}]∧[color ∈ {yellow, white}]∧[size ∈ {large}]∧[stem ∈ {thorny}]

f lower3= [type∈ {iris}] ∧ [color ∈ {yellow, violet}] ∧ [size ∈ {large, medium}] ∧ [stem ∈

{smooth}].

The synthesis object S, not yet generalized, representing the obtained cluster C1 will be:

S1= [color(vase1)∈ {blue}] ∧ [size(vase1)∈ {large}] ∧ [material(vase1)∈ {crystal}]

∧[shape(vase1)∈ {round}]∧[type(flower2)∈ {rose}]∧[color(flower2)∈ {yellow, white}]

∧ [size(flower2)∈ {large}] ∧ [stem(flower2)∈ {thorny}] ∧ [type(flower3)∈ {iris}]

∧ [color(flower3)∈ {yellow, violet}] ∧ [size(flower3)∈ {large, medium}]

∧ [stem(flower3)∈ {smooth}].

After three steps of generalization, that we will explain in Section 15.7, we obtained the following final modal synthesis object:

S1mod = [material(vase)∈ {crystal}] ∧ [color(flower) ∈ {clear}]

∧[type(flower) ∼ {0.5(rose), 0.3(tulip)}]∧[size(flower) ∼ {0.6(large), 0.2(medium)}].

Related documents