• No results found

P ROBABILISTIC C ONCEPTUAL C LUSTERING

The ground work for the area of conceptual clustering was largely carried out by Michaski and Stepp (1980, 1981). Other researchers have since developed a number of conceptual clustering methods that are probabilistic in design, and not conjunctional. The probabilistic method developed by Hanson and Bauer (1989), WITT, will be discussed here, and COBWEB and two other methods will be discussed in chapter four. In total four probabilistic methods are discussed in this thesis.

Hanson and Bauer (1989) suggest that there are four disadvantages to using concept description based on logic statements.

The first problem with using logic statement methods is that membership to a concept, or category, is strictly based upon meeting the given conditions. In other words, a value is either necessary (equality or inequality) or sufficient (within a given range). Hanson and Bauer argue that this creates an “Aristotelian” view of categories, meaning that they are characterised solely by their shared properties and not by their actual likeness, thus possibly failing to reflect their true similarity. Within human categorisation objects may be considered related without specific values being necessary or sufficient; this has been referred to as the concept of polymorphy (Wittgenstein 1953).

Secondly, concepts that are illustrated through logic expressions have firm boundaries and do not contain a gradient or level of membership. However, categories contain members some of which are more tightly fitted to a representation than others. As such some objects are more suited to membership of a particular category than others. However, the less suited object is still a member of the category.

A third problem is that a key feature of a concept is the interrelationship between the features within the contained objects. Logical-statement-based methods, while ignoring the relationships between features, can be overly focused on the

Chapter 2 - Clustering Techniques commonality of features belonging to each object within the concept. The cohesion within a concept, that is the interrelation between features, can provide for structuring within a concept.

Finally, the absoluteness of a logical expression used to express a concept does not cater for comparisons between categories, or relative properties. Within human categorisation, categories arise from direct comparison with other objects and categories within context. As such, each category is defined relative to the others by comparison.

All four of these points expressed by (Hanson and Bauer 1989) represent a common thread: logic expressions are not sufficiently flexible to express categories. Logic expressions, by definition, are finite rules, and they provide a model which does not completely express categorisation from the human perspective. It is this ability that probabilistic concept formation aims to provide.

2.4.2.1

WITT

WITT5 (Hanson and Bauer 1989) is a conceptual clustering system which builds upon the work done by Michalski and Stepp. The method is similar to PAF in that it generates a concept description for disjoint clusters, created utilising the attribute- value pairs of a group of instances. However, the focus of WITT’s concept creation and clustering is that of the interrelatedness of features and not just the attribute value pairs on their own. As such the concepts are represented as co-occurrences between features across attribute-value pairs. WITT realises these co-occurrences through the use of contingency tables. A given contingency table for a group of instances represents the attributes within these instances in a matrix. The matrix counts the number of times that different attributes with certain values appear in conjunction with each other. WITT, unlike PAF, is probabilistic in nature and utilises these contingency tables to calculate how likely certain features are to be found together based upon how many times different attribute-value pairs have occurred together, in unison. WITT measures the inter-instance correlation using a metric called cohesion. It acts in a similar way to a distance measure in a data clustering

- 26 -

technique, but is used to illustrate conceptual likeness and is far more computationally complex. It is a measure of the distance in terms of relations between features, calculated from the contingency tables. The following section details how this cohesion metric is calculated.

2.4.2.2

COHESION

Hanson and Bauer (1989) defined cohesion, Cc, of a concept c as:

c c c

W

C

O

=

where Wc is the within-concept cohesion of the concept c, and Oc is the average cohesiveness between c and all other concepts. Categories, or concepts, are not usually formed in isolation from outside input or comparison to existing concepts. Concepts are formed utilising both knowledge within the cluster, and outside of it. A person will form a concept in their mind that an eagle and a hawk are both birds, while at the same time acknowledging that they are not fish. The concept is formed by maximising the closeness within the concept of birds, while also minimising the similarity across categories.

The within-concept cohesion, Wc, is a measure the average variance across the co- occurrence of attribute-pairs within c. It is defined as:

1 1 1

(

1)

2

N N ij i j i c

D

W

N N

! = = +

=

!

" "

where N is the total number of attributes, and Dij is the co-occurrence distribution within the contingency table for attributes i and j. Dij is further defined as:

1 1 1 1 1 1

log(

)

(

)(log(

))

i j uv uv u v ij i j i j uv uv u v u v

f

f

D

f

f

= = = = = =

=

! !

! !

! !

Chapter 2 - Clustering Techniques where fuv is the frequency with which value u of attribute i and value v of attribute j co-occur. Each contingency table is a matrix comprised of n and m values, while u and v refer to the number of times that attribute i and j each occurred. As such Dij involves summing all u x and v values over the whole contingency table. Using this equation, if there was perfect co-occurrence within a given table, having the attributes always occurring together, Dij would equal 1.0. If, instead, co-occurrence occurred equally across all combinations then the resultant Dij would be zero. All other combinations fall between these two extremes, serving as a metric of distribution of co-occurrence within the table. Wc can be calculated using the value of Dij for each contingency table within concept c,. The summed values of each Dij within c are divided by a function of N to produce the variance within the c, thus demonstrating its cohesion.

The second component required to calculate the cohesion within a concept, Cc, is Oc which is defined as:

1

1

K ck i k c

B

O

K

! =

=

"

#

where K is the total number of concepts, and Bck is the measure of relative cohesion between the concepts c and k. Bck is defined as:

1

2

ck c k c k

B

W

W

W

!

=

+

"

where Wc is the measure of within-concept cohesion of c, Wk is the measure of within-concept cohesion of k, and Wc k! is the measure of cohesion within a union of the two concepts c and k. Oc is thus the sum of cohesion measures between c and all other L concepts, and then divided by L-1 to calculate the average cohesion across of the whole set of concepts.

2.4.2.3

THE WITT ALGORITHM

The WITT algorithm is largely controlled by the cohesion metric and, given a set of N instances, the algorithm could consider all possible clusters. However, as N

- 28 -

increased so would the number of resultant concepts. The algorithm is bound by two thresholds to create “good” concepts, while operating as efficiently as possible. The first phase of the algorithm won’t be discussed at length here, but is discussed in (Hanson and Bauer 1989). Basically the initial phase of the algorithm creates some starting clusters, utilising a simple distance metric and a strict threshold (T1) to verify quality. This phase is largely a data clustering technique, and is referred to as the pre- clustering algorithm. However, once completed, the cohesion measure is then utilised to create a concept hierarchy. Again, this phase utilises thresholds (T2 and T3) to ensure quality. The algorithm continues to iterate as long as the cohesion factor between the two most similar clusters is greater than T3 enabling these clusters to be merged. Once these prospective merges score less than the threshold, the complete clustering has been achieved. This phase is detailed in Table 6:

1) Compute the cohesion score C for all unclustered instances and existing concepts.

2) Select the highest instance-cluster pair with score S

3) If S is greater than T2 then add the instance to the cluster and go back to Step 1

4) If not, then use the pre-clustering algorithm again to generate more initial clusters.

1) For each new cluster c, if Wi c! is less than T3 for all k then add c.

2) If any new clusters are added then go to Step 1.

5) Else calculate the within-cluster cohesion factor Wc j! for all clusters and select the pair with the highest score. If the score is higher than T3 then merge clusters and go to Step 1, else stop.

Table 6. The WITT Algorithm

2.5

SUMMARY

This chapter has given a brief overview of the topic of clustering, focusing on the two main forms: data clustering and conceptual clustering. Within data clustering partitional and hierarchical methods were explained. Conceptual clustering was then explored, highlighting both conceptual and probabilistic methods with in-depth explanations of the PAF and WITT algorithms.

Chapter 2 - Clustering Techniques This research examined several conceptual clustering methods, with special attention paid to methods that handle change over time. The next chapter will examine several more conceptual clustering techniques which aim to adapt to change over time.

3

Knowledge Acquisition