• No results found

Visualizing Transactional Data with Multiple Clusterings for Knowledge Discovery

N/A
N/A
Protected

Academic year: 2021

Share "Visualizing Transactional Data with Multiple Clusterings for Knowledge Discovery"

Copied!
32
0
0

Loading.... (view fulltext now)

Full text

(1)

Visualizing Transactional Data

with Multiple Clusterings for

Knowledge Discovery

Nicolas Durand(1), Bruno Cr´emilleux(1) and

Einoshin Suzuki(2)

(1)GREYC, University of Caen, France

(2)

Introduction

• Data visualization: great importance in data mining ◦ To “understand” the data,

◦ To discover knowledge,

Tools (information retrieval), . . .

• 80% of information: obtained from eyesight

[Card et al. 1999][Fayyad et al. 2002]

• Visualization based on clustering

SOM [Kohonen 1982][Vesanto 1999], CLUSION [Strehl et al. 2003], CDCS [Chang et al. 2005], gCLUTO [Rasmussen et al. 2004], . . .

(3)

Motivations

• Clustering = to form groups of similar objects

Hard-clustering (distinct groups) Soft-clustering (overlapping)

◦ Problems: a lot of algorithms,

goodness of a clustering (subjective)

◦ 1 clustering single “view” of the data

• Our idea: use of multiple clusterings/views of the

(4)

Contributions

VISUMCLUST (Visualization of Multiple Clusterings):

New approach of transactional data visualization

• How to represent multiple clustering results?

1 clustering 2 clustering n clustering VisuMClust data clusteringslinking

using colors

image generation

• Algorithm of color attribution to clusters, based on

hypergraph minimal transversals

(5)

Transactional data set

Id Items

t1 A J 9 transactions

t2 A B C (consumers, patients, users, . . . )

t3 A B C

t4 A B C 10 items

t5 D E (products, medical examinations,

t6 D E H web pages, . . . )

t7 A D E F G H

t8 A F G I J

(6)

Clustering results

Clustering algorithm: ECCLAT (Extraction of

Clusters from Concept LATtice) [Durand et al. 2002]

◦ Soft-clustering (overlapping groups) ◦ Categorical data

◦ Based on frequent closed itemsets

3 clusterings obtained using different parameters:

Id Clusters

C (A;t1, t2, t3, t4, t7, t8) (DE;t5, t6, t7) (I;t8, t9)

C (AF G;t7, t8) (ABC;t2, t3, t4) (DEH;t6, t7) (I;t8, t9) (J;t1, t8)

C (ABC;t2, t3, t4) (DE;t5, t6, t7) (I;t8, t9) (J;t1, t8) 1 cluster: (DE; t5, t6, t7)

(7)

V

ISU

MC

LUST

result

Display result on the running example:

• line = clustering P P P PP

C

1 X X X X X X X X X X

t

5 Presence

t

5 Absence

t

5

t

8

(8)

V

ISU

MC

LUST

result

Display result on the running example:

line = clusteringcolumn = transaction P P P PP

C

1 X X X X X X X X X X

t

Presence

t

5 Absence

t

5

t

8

(9)

V

ISU

MC

LUST

result

Display result on the running example:

line = clustering

column = transaction

rod = presence of

trans-actions in clusterings P P P PP

C

1 X X X X X X X X X X

t

5 Presence Absence

t

5

t

8

(10)

V

ISU

MC

LUST

result

Display result on the running example:

• line = clustering • column = transaction • rod = presence of transactions in clusterings • color = membership of transactions to clusters P P P PP

C

1 X X X X X X X X X X

t

5 Presence

t

5 Absence

t

t

(11)

Allocated colors

Allocated colors in the example:

1 (blue) 2 (green) 3 (yellow) 4 (orange) 5 (red)

Clusters (I;t8, t9) (DE;t5, t6, t7) (A;t2, t3, t4, t7, t8) ∅ ∅

from C

Clusters (I;t8, t9) (DEH;t6, t7) (ABC;t2, t3, t4) (J;t1, t8) (AF G;t7, t8)

from C

Clusters (I;t8, t9) (DE;t5, t6, t7) (ABC;t2, t3, t4) (J;t1, t8) ∅

from C

• 1 color = 1 cluster of each clustering

(12)

Color attribution algorithm: idea

• Related to clustering of clusters, but . . . • Enumerate all possibilities: not feasible

(kn where n=number of clusterings results,

k=maximal number of cluster in a clustering result)

. . . 1 cluster of each clustering . . . related to

hypergraph transversals

→ candidate groups, evaluated and selected

according to a similarity measure

(13)

Hypergraph

ABC ; 2,3,4 AFG ; 7,8 DEH ; 6,7 I ; 8,9 J ; 1,8 DE ; 5,6,7 C C C 3 1 2 A ; 1,2,3,4,7,8 • Vertices = clusters • Hyperedges = clusterings

(14)

Hypergraph minimal transversals

DEH ; 6,7 I ; 8,9 J ; 1,8 DE ; 5,6,7 C C C 3 1 2 A ; 1,2,3,4,7,8 ABC ; 2,3,4 AFG ; 7,8

• Transversal = set of vertices meeting all hyperedges

({A,AFG,ABC})

(The transversals are also called hitting sets)

• Minimal transversal = tranversal with a minimal

(15)

Minimal transversals as candidate groups

Set of all minimal transversals = set of combinations

of clusters through the clustering results gathering the identical clusters

• Minimal transversals good candidate groups to

form colors

In the example: 7 minimal transversals (whereas 60 candidate groups in all)

• CMT prototype [Hébert 2005]: to compute the

(16)

Minimal transversals evaluation

• Similarity between two clusters = Jaccard coefficient

(on the itemset)

J accard(I1, I2) = |

I1I2|

|I1I2| J accard(ABC, AF G) = 0.2

• Similarity of a group (set of clusters):

intra-color similarity (noted SimC)

SimC = average of the similarity values of all pairs of clusters of the group

SimC({A, AF G, ABC}) = 0.29

SimC({A, ABC}) = 0.55 (ABC present both for C and C)

Goodness measure SD of the allocation of colors =

(17)

Color attribution algorithm

Input: C, ..., Cn where Ci = {ci,1, ..., ci,m}.

Output: D = (color1, ..., colorp) where colori = {c1,j, ..., cn,j0}.

Start

0. Color = 0

1. Transform Ci into a hypergraph H

Repeat step 2-6 until there are no clusters (ci,j) without color: 2. Color += 1

3. Compute the minimal transversals of H

4. Select the best candidate Cand (the higher intra-color similarity)

5. Attribute the current Color to each ci,j ∈ Cand

6. Add Cand to D and remove the colored clusters (ci,j) from H End

(18)

Example

Color 1 (blue) minimal transversal: {I} → <I, I, I> SimC = 1.00 ABC ; 2,3,4 AFG ; 7,8 DEH ; 6,7 J ; 1,8 DE ; 5,6,7 C C C 3 1 2 A ; 1,2,3,4,7,8 I ; 8,9

(19)

Example

Color 2 (green)

minimal transversal: {DE, DEH}

→ <DE, DEH, DE> SimC = 0.77 ABC ; 2,3,4 AFG ; 7,8 J ; 1,8 C C C 3 1 2 A ; 1,2,3,4,7,8 DE ; 5,6,7 DEH ; 6,7

(20)

Example

Color 3 (yellow)

minimal transversal: {A, ABC}

→ <A, ABC, ABC> SimC = 0.55

J ; 1,8 C C C 3 1 2 A ; 1,2,3,4,7,8 AFG ; 7,8 ABC ; 2,3,4

(21)

Example

Color 4 (orange) minimal transversal: {J} → <−, J, J> SimC = 0.33 C C 3 2 AFG ; 7,8 J ; 1,8

(22)

Example

Color 5 (red)

The remaining cluster

→ <−,AF G,−>

C2 AFG ; 7,8

(23)

Experiments

Data sets:

Benchmarks: Mushroom, Votes, Hepatitis,

Ionosphere, Titanic

• Geographical real-world database: Cross channel

atlas (called “Geo”) For each data set:

• ECCLAT to obtain several clusterings • VISUMCLUST

• Evaluation of the color attribution using the

(24)

Results

Mushroom Votes Hepatitis Ionosphere Titanic Geo

Baseline 0.554 0.617 0.436 0.342 0.675 0.472

method

VISUMCLUST 0.658 0.839 0.521 0.389 0.878 0.684

Gain +18.7% +35.9% +19.5% +13.9% +30.1% +44.9%

(25)

Cross channel atlas (Geo)

• 41 English counties (south) and 28 French regions

(north, west)

Described by 99 items (demographical and

economic indicators)

(26)

Result of V

ISU

MC

LUST

on Geo

1 F i n i s t è r e ; 2 C ô t e s d ' A r m o r ; 3 I l l e e t V i l a i n e ; 4 M o r b i h a n ; 5 E u r e ; 6 S e i n e -M a r i t i m e ; 7 L o i r e -A t l a n t i q u e ; 8 M a i n e -e t -L o i r e ; 9 M a y e n n e ; 1 0 S a r t h e ; 1 1 V e n d é e ; 1 2 C a l v a d o s ; 1 3 M a n c h e ; 1 4 O r n e ; 1 5 E u r e -e t -L o i r ; 1 6 A i s n e ; 1 7 O i s e ; 1 8 S o m m e ; 1 9 N o r d ; 2 0 P a s -d e -C a l a i s ; 2 1 E s s o n n e ; 2 2 H a u t s -d e -S e i n e ; 2 3 V i l l e d e P a r i s ; 2 4 S e i n e S a i n t -D e n i s ; 2 5 S e i n e -e t -M a r n e ; 2 6 V a l d ' O i s e ; 2 7 V a l -d e -M a r n e ; 2 8 Y v e l i n e s ; 2 9 S w i n d o n ; 3 0 L o n d o n ; 3 1 I s l e o f W i g h t ; 3 2 P o o l e ; 3 3 B o u r n e m o u t h ; 3 4 T o r b a y ; 3 5 P l y m o u t h ; 3 6 S o u t h G l o u c e s t e r s h i r e ; 3 7 N o r t h S o m e r s e t ; 3 8 C i t y o f B r i s t o l ; 3 9 B a t h a n d N o t h S o m e r s e t ; 4 0 K e n t ; 4 1 E a s t S u s s e x ; 4 2 W e s t S u s s e x ; 4 3 B u c k i n g h a m s h i r e ; 4 4 S u r r e y ; 4 5 W i n d s o r ; 4 6 R e a d i n g ; 4 7 H a m p s h i r e ; 4 8 W e s t ; 4 9 O x f o r d s h i r e ; 5 0 M e d w a y ; 5 1 S o u t h a m p t o n ; 5 2 M i l t o n K e y n e s ; 5 3 B r a c k n e l l F o r e s t ; 5 4 P o r t s m o u t h ; 5 5 B r i g h t o n a n d H o v e ; 5 6 W o c k i n g h a m ; 5 7 S l o u g h ; 5 8 W i l t s h i r e ; 5 9 G l o u c e s t e r s h i r e ; 6 0 D o r s e t ; 6 1 D e v o n ; 6 2 S o m e r s e t ; 6 3 C o r n w a l l a n d I s l e o f S c i l l y ; 6 4 H e r t f o r d s h i r e ; 6 5 T h u r r o c k ; 6 6 S o u t h e n d -o n -S e a ; 6 7 B e d f o r d s h i r e ; 6 8 E s s e x ; 6 9 L u t o n ;

Display result and description of each cluster

(27)

Result of V

ISU

MC

LUST

on Geo

1 Finistère ; 2 Côtes d'Armor ; 3 Ille et Vilaine ; 4 Morbihan ; 5 Eure ; 6 Seine-Maritime ; 7 Loire-Atlantique ; 8 Maine-et-Loire ; 9 Mayenne ; 10 Sarthe ; 11 Vendée ; 12 Calvados ; 13 Manche ; 14 Orne ; 15 Eure-et-Loir ; 16 Aisne ; 17 Oise ; 18 Somme ; 19 Nord ; 20 Pas-de-Calais ; 21 Essonne ; 22 Hauts-de-Seine ; 23 Ville de Paris ; 24 Seine Saint-Denis ; 25 Seine-et-Marne ; 26 Val d'Oise ; 27 Val-de-Marne ; 28 Yvelines ; 29 Swindon ; 30 London ; 31 Isle of Wight ; 32 Poole ; 33 Bournemouth ; 34 Torbay ; 35 Plymouth ; 36 South Gloucestershire ; 37 North Somerset ; 38 City of Bristol ; 39 Bath and Noth Somerset ; 40 Kent ; 41 East Sussex ; 42 West Sussex ; 43 Buckinghamshire ; 44 Surrey ; 45 Windsor ; 46 Reading ; 47 Hampshire ; 48 West ; 49 Oxfordshire ; 50 Medway ; 51 Southampton ; 52 Milton Keynes ; 53 Bracknell Forest ; 54 Portsmouth ; 55 Brighton and Hove ; 56 Wockingham ; 57 Slough ; 58 Wiltshire ; 59 Gloucestershire ;

1) Blue and green = English and French units far from Lon-don and Paris (respectively)

→ population getting aged, . . .

2) Yellow = capitals and their surrounding

→ dynamic units, large

ag-glomerations with good demo-graphic indicators (birth rate, . . . )

3) . . .

Observations validated by

(28)

Result of V

ISU

MC

LUST

on Geo

1 Finistère ; 2 Côtes d'Armor ; 3 Ille et Vilaine ; 4 Morbihan ; 5 Eure ; 6 Seine-Maritime ; 7 Loire-Atlantique ; 8 Maine-et-Loire ; 9 Mayenne ; 10 Sarthe ; 11 Vendée ; 12 Calvados ; 13 Manche ; 14 Orne ; 15 Eure-et-Loir ; 16 Aisne ; 17 Oise ; 18 Somme ; 19 Nord ; 20 Pas-de-Calais ; 21 Essonne ; 22 Hauts-de-Seine ; 23 Ville de Paris ; 24 Seine Saint-Denis ; 25 Seine-et-Marne ; 26 Val d'Oise ; 27 Val-de-Marne ; 28 Yvelines ; 29 Swindon ; 30 London ; 31 Isle of Wight ; 32 Poole ; 33 Bournemouth ; 34 Torbay ; 35 Plymouth ; 36 South Gloucestershire ; 37 North Somerset ; 38 City of Bristol ; 39 Bath and Noth Somerset ; 40 Kent ; 41 East Sussex ; 42 West Sussex ; 43 Buckinghamshire ; 44 Surrey ; 45 Windsor ; 46 Reading ; 47 Hampshire ; 48 West ; 49 Oxfordshire ; 50 Medway ; 51 Southampton ; 52 Milton Keynes ; 53 Bracknell Forest ; 54 Portsmouth ; 55 Brighton and Hove ; 56 Wockingham ; 57 Slough ; 58 Wiltshire ; 59 Gloucestershire ; 60 Dorset ; 61 Devon ; 62 Somerset ; 63 Cornwall and Isle of Scilly ; 64 Hertfordshire ;

1) Blue and green = English and French units far from Lon-don and Paris (respectively)

→ population getting aged, . . .

2) Yellow = capitals and their surrounding

→ dynamic units, large

ag-glomerations with good demo-graphic indicators (birth rate, . . . )

3) . . .

Observations validated by

(29)

Result of V

ISU

MC

LUST

on Geo

1 Finistère ; 2 Côtes d'Armor ; 3 Ille et Vilaine ; 4 Morbihan ; 5 Eure ; 6 Seine-Maritime ; 7 Loire-Atlantique ; 8 Maine-et-Loire ; 9 Mayenne ; 10 Sarthe ; 11 Vendée ; 12 Calvados ; 13 Manche ; 14 Orne ; 15 Eure-et-Loir ; 16 Aisne ; 17 Oise ; 18 Somme ; 19 Nord ; 20 Pas-de-Calais ; 21 Essonne ; 22 Hauts-de-Seine ; 23 Ville de Paris ; 24 Seine Saint-Denis ; 25 Seine-et-Marne ; 26 Val d'Oise ; 27 Val-de-Marne ; 28 Yvelines ; 29 Swindon ; 30 London ; 31 Isle of Wight ; 32 Poole ; 33 Bournemouth ; 34 Torbay ; 35 Plymouth ; 36 South Gloucestershire ; 37 North Somerset ; 38 City of Bristol ; 39 Bath and Noth Somerset ; 40 Kent ; 41 East Sussex ; 42 West Sussex ; 43 Buckinghamshire ; 44 Surrey ; 45 Windsor ; 46 Reading ; 47 Hampshire ; 48 West ; 49 Oxfordshire ; 50 Medway ; 51 Southampton ; 52 Milton Keynes ; 53 Bracknell Forest ; 54 Portsmouth ; 55 Brighton and Hove ; 56 Wockingham ; 57 Slough ; 58 Wiltshire ; 59 Gloucestershire ;

1) Blue and green = English and French units far from Lon-don and Paris (respectively)

→ population getting aged, . . .

2) Yellow = capitals and their surrounding

→ dynamic units, large

ag-glomerations with good demo-graphic indicators (birth rate, . . . )

3) . . .

(30)

Conclusion

• New approach for visualizing transactional data

V

ISU

MC

LUST

• Display multiple clustering results at the same

time

Consistent view obtained with the color allocation

algorithm based on the minimal transversals of a hypergraph

• Importance of multiple clusterings/views of the data

Very useful for knowledge discovery

• Evaluation on benchmarks, experiment on a

(31)

Future Work

• More experiments (clustering algorithms, data sets) • Interactive interface (with clickable objects to access

to some details/descriptions, . . . )

• Applications:

◦ comparison of clusterings results,

◦ help to choose a clustering algorithm, the

parameters

(32)

References

N. Durand and B. Crémilleux. ECCLAT: a New Approach of

Clusters Discovery in Categorical Data. In ES 2002, Cambridge, UK, December 2002.

C. Hébert. Enumerating the Minimal Transversals of a Hypergraph

Using Galois Connections. Technical report, University of Caen Basse-Normandie, France, 2005.

http://www.ics.uci.edu/mlearn/. UCI Machine Learning

Repository. (Mushroom, Votes, Hepatitis, Ionosphere)

http://www.amstat.org/publications/jse/. (Titanic)

http://atlastransmanche.certic.unicaen.f r/index.gb.html.

References

Related documents

As well as evaluating the test on the best data available, we are interested in the sensitivity of the tests to the number of bands available, the sensitivity of the photometry,

ANNz2 - photometric redshift and probability distribution function estimation using machine learning..

Berdasarkan hasil dari penelitian yang berjalan, sistem yang diusulkan untuk memerikan solusi pada permasalahan ini adalah dengan membuat Aplikasi Mobile Learning Client

This research sets to investigate the following research problem: There is a reason to believe that Finnish telecom operators are leaving their end-user customers in an indif-

Similar tests are conducted on a different host OS, both CentOS and Ubuntu and the obtained load values are stored for further analysis. One should not confuse that these tools are

Also the work of the task forces of HPH, mainly targeting vulnerable patient groups like psychiatric patients [42,43], children and adolescents [44], migrants and ethnic

This forum was attended by more than 500 domestic and foreign businessmen and investors, representatives of a number of influential international organizations

investigate the perspectives of the Executive Directors and Liaison staff of each organization involved with the 3Rs Project concerning the introduction of the rights