Visualizing Transactional Data
with Multiple Clusterings for
Knowledge Discovery
Nicolas Durand(1), Bruno Cr´emilleux(1) and
Einoshin Suzuki(2)
(1)GREYC, University of Caen, France
Introduction
• Data visualization: great importance in data mining ◦ To “understand” the data,
◦ To discover knowledge,
◦ Tools (information retrieval), . . .
• 80% of information: obtained from eyesight
[Card et al. 1999][Fayyad et al. 2002]
• Visualization based on clustering
SOM [Kohonen 1982][Vesanto 1999], CLUSION [Strehl et al. 2003], CDCS [Chang et al. 2005], gCLUTO [Rasmussen et al. 2004], . . .
Motivations
• Clustering = to form groups of similar objects
Hard-clustering (distinct groups) Soft-clustering (overlapping)
◦ Problems: a lot of algorithms,
goodness of a clustering (subjective)
◦ 1 clustering → single “view” of the data
• Our idea: use of multiple clusterings/views of the
Contributions
• VISUMCLUST (Visualization of Multiple Clusterings):
New approach of transactional data visualization
• How to represent multiple clustering results?
1 clustering 2 clustering n clustering VisuMClust data clusteringslinking
using colors
image generation
• Algorithm of color attribution to clusters, based on
hypergraph minimal transversals
Transactional data set
Id Items
t1 A J 9 transactions
t2 A B C (consumers, patients, users, . . . )
t3 A B C
t4 A B C 10 items
t5 D E (products, medical examinations,
t6 D E H web pages, . . . )
t7 A D E F G H
t8 A F G I J
Clustering results
• Clustering algorithm: ECCLAT (Extraction of
Clusters from Concept LATtice) [Durand et al. 2002]
◦ Soft-clustering (overlapping groups) ◦ Categorical data
◦ Based on frequent closed itemsets
• 3 clusterings obtained using different parameters:
Id Clusters
C (A;t1, t2, t3, t4, t7, t8) (DE;t5, t6, t7) (I;t8, t9)
C (AF G;t7, t8) (ABC;t2, t3, t4) (DEH;t6, t7) (I;t8, t9) (J;t1, t8)
C (ABC;t2, t3, t4) (DE;t5, t6, t7) (I;t8, t9) (J;t1, t8) 1 cluster: (DE; t5, t6, t7)
V
ISUMC
LUSTresult
Display result on the running example:
• line = clustering P P P PP
C
1 X X X X X X X X X Xt
5 Presencet
5 Absencet
5t
8V
ISUMC
LUSTresult
Display result on the running example:
• line = clustering • column = transaction P P P PP
C
1 X X X X X X X X X Xt
Presencet
5 Absencet
5t
8V
ISUMC
LUSTresult
Display result on the running example:
• line = clustering
• column = transaction
• rod = presence of
trans-actions in clusterings P P P PP
C
1 X X X X X X X X X Xt
5 Presence Absencet
5t
8V
ISUMC
LUSTresult
Display result on the running example:
• line = clustering • column = transaction • rod = presence of transactions in clusterings • color = membership of transactions to clusters P P P PP
C
1 X X X X X X X X X Xt
5 Presencet
5 Absencet
t
Allocated colors
Allocated colors in the example:
1 (blue) 2 (green) 3 (yellow) 4 (orange) 5 (red)
Clusters (I;t8, t9) (DE;t5, t6, t7) (A;t2, t3, t4, t7, t8) ∅ ∅
from C
Clusters (I;t8, t9) (DEH;t6, t7) (ABC;t2, t3, t4) (J;t1, t8) (AF G;t7, t8)
from C
Clusters (I;t8, t9) (DE;t5, t6, t7) (ABC;t2, t3, t4) (J;t1, t8) ∅
from C
• 1 color = 1 cluster of each clustering
Color attribution algorithm: idea
• Related to clustering of clusters, but . . . • Enumerate all possibilities: not feasible
(kn where n=number of clusterings results,
k=maximal number of cluster in a clustering result)
• . . . 1 cluster of each clustering . . . related to
hypergraph transversals
→ candidate groups, evaluated and selected
according to a similarity measure
Hypergraph
ABC ; 2,3,4 AFG ; 7,8 DEH ; 6,7 I ; 8,9 J ; 1,8 DE ; 5,6,7 C C C 3 1 2 A ; 1,2,3,4,7,8 • Vertices = clusters • Hyperedges = clusteringsHypergraph minimal transversals
DEH ; 6,7 I ; 8,9 J ; 1,8 DE ; 5,6,7 C C C 3 1 2 A ; 1,2,3,4,7,8 ABC ; 2,3,4 AFG ; 7,8• Transversal = set of vertices meeting all hyperedges
({A,AFG,ABC})
(The transversals are also called hitting sets)
• Minimal transversal = tranversal with a minimal
Minimal transversals as candidate groups
• Set of all minimal transversals = set of combinations
of clusters through the clustering results gathering the identical clusters
• Minimal transversals → good candidate groups to
form colors
In the example: 7 minimal transversals (whereas 60 candidate groups in all)
• CMT prototype [Hébert 2005]: to compute the
Minimal transversals evaluation
• Similarity between two clusters = Jaccard coefficient
(on the itemset)
J accard(I1, I2) = |
I1∩I2|
|I1∪I2| J accard(ABC, AF G) = 0.2
• Similarity of a group (set of clusters):
intra-color similarity (noted SimC)
SimC = average of the similarity values of all pairs of clusters of the group
SimC({A, AF G, ABC}) = 0.29
SimC({A, ABC}) = 0.55 (ABC present both for C and C)
• Goodness measure SD of the allocation of colors =
Color attribution algorithm
Input: C, ..., Cn where Ci = {ci,1, ..., ci,m}.
Output: D = (color1, ..., colorp) where colori = {c1,j, ..., cn,j0}.
Start
0. Color = 0
1. Transform Ci into a hypergraph H
Repeat step 2-6 until there are no clusters (ci,j) without color: 2. Color += 1
3. Compute the minimal transversals of H
4. Select the best candidate Cand (the higher intra-color similarity)
5. Attribute the current Color to each ci,j ∈ Cand
6. Add Cand to D and remove the colored clusters (ci,j) from H End
Example
Color 1 (blue) minimal transversal: {I} → <I, I, I> SimC = 1.00 ABC ; 2,3,4 AFG ; 7,8 DEH ; 6,7 J ; 1,8 DE ; 5,6,7 C C C 3 1 2 A ; 1,2,3,4,7,8 I ; 8,9Example
Color 2 (green)
minimal transversal: {DE, DEH}
→ <DE, DEH, DE> SimC = 0.77 ABC ; 2,3,4 AFG ; 7,8 J ; 1,8 C C C 3 1 2 A ; 1,2,3,4,7,8 DE ; 5,6,7 DEH ; 6,7
Example
Color 3 (yellow)
minimal transversal: {A, ABC}
→ <A, ABC, ABC> SimC = 0.55
J ; 1,8 C C C 3 1 2 A ; 1,2,3,4,7,8 AFG ; 7,8 ABC ; 2,3,4
Example
Color 4 (orange) minimal transversal: {J} → <−, J, J> SimC = 0.33 C C 3 2 AFG ; 7,8 J ; 1,8Example
Color 5 (red)
The remaining cluster
→ <−,AF G,−>
C2 AFG ; 7,8
Experiments
Data sets:
• Benchmarks: Mushroom, Votes, Hepatitis,
Ionosphere, Titanic
• Geographical real-world database: Cross channel
atlas (called “Geo”) For each data set:
• ECCLAT to obtain several clusterings • VISUMCLUST
• Evaluation of the color attribution using the
Results
Mushroom Votes Hepatitis Ionosphere Titanic Geo
Baseline 0.554 0.617 0.436 0.342 0.675 0.472
method
VISUMCLUST 0.658 0.839 0.521 0.389 0.878 0.684
Gain +18.7% +35.9% +19.5% +13.9% +30.1% +44.9%
Cross channel atlas (Geo)
• 41 English counties (south) and 28 French regions
(north, west)
• Described by 99 items (demographical and
economic indicators)
Result of V
ISUMC
LUSTon Geo
1 F i n i s t è r e ; 2 C ô t e s d ' A r m o r ; 3 I l l e e t V i l a i n e ; 4 M o r b i h a n ; 5 E u r e ; 6 S e i n e -M a r i t i m e ; 7 L o i r e -A t l a n t i q u e ; 8 M a i n e -e t -L o i r e ; 9 M a y e n n e ; 1 0 S a r t h e ; 1 1 V e n d é e ; 1 2 C a l v a d o s ; 1 3 M a n c h e ; 1 4 O r n e ; 1 5 E u r e -e t -L o i r ; 1 6 A i s n e ; 1 7 O i s e ; 1 8 S o m m e ; 1 9 N o r d ; 2 0 P a s -d e -C a l a i s ; 2 1 E s s o n n e ; 2 2 H a u t s -d e -S e i n e ; 2 3 V i l l e d e P a r i s ; 2 4 S e i n e S a i n t -D e n i s ; 2 5 S e i n e -e t -M a r n e ; 2 6 V a l d ' O i s e ; 2 7 V a l -d e -M a r n e ; 2 8 Y v e l i n e s ; 2 9 S w i n d o n ; 3 0 L o n d o n ; 3 1 I s l e o f W i g h t ; 3 2 P o o l e ; 3 3 B o u r n e m o u t h ; 3 4 T o r b a y ; 3 5 P l y m o u t h ; 3 6 S o u t h G l o u c e s t e r s h i r e ; 3 7 N o r t h S o m e r s e t ; 3 8 C i t y o f B r i s t o l ; 3 9 B a t h a n d N o t h S o m e r s e t ; 4 0 K e n t ; 4 1 E a s t S u s s e x ; 4 2 W e s t S u s s e x ; 4 3 B u c k i n g h a m s h i r e ; 4 4 S u r r e y ; 4 5 W i n d s o r ; 4 6 R e a d i n g ; 4 7 H a m p s h i r e ; 4 8 W e s t ; 4 9 O x f o r d s h i r e ; 5 0 M e d w a y ; 5 1 S o u t h a m p t o n ; 5 2 M i l t o n K e y n e s ; 5 3 B r a c k n e l l F o r e s t ; 5 4 P o r t s m o u t h ; 5 5 B r i g h t o n a n d H o v e ; 5 6 W o c k i n g h a m ; 5 7 S l o u g h ; 5 8 W i l t s h i r e ; 5 9 G l o u c e s t e r s h i r e ; 6 0 D o r s e t ; 6 1 D e v o n ; 6 2 S o m e r s e t ; 6 3 C o r n w a l l a n d I s l e o f S c i l l y ; 6 4 H e r t f o r d s h i r e ; 6 5 T h u r r o c k ; 6 6 S o u t h e n d -o n -S e a ; 6 7 B e d f o r d s h i r e ; 6 8 E s s e x ; 6 9 L u t o n ;Display result and description of each cluster
Result of V
ISUMC
LUSTon Geo
1 Finistère ; 2 Côtes d'Armor ; 3 Ille et Vilaine ; 4 Morbihan ; 5 Eure ; 6 Seine-Maritime ; 7 Loire-Atlantique ; 8 Maine-et-Loire ; 9 Mayenne ; 10 Sarthe ; 11 Vendée ; 12 Calvados ; 13 Manche ; 14 Orne ; 15 Eure-et-Loir ; 16 Aisne ; 17 Oise ; 18 Somme ; 19 Nord ; 20 Pas-de-Calais ; 21 Essonne ; 22 Hauts-de-Seine ; 23 Ville de Paris ; 24 Seine Saint-Denis ; 25 Seine-et-Marne ; 26 Val d'Oise ; 27 Val-de-Marne ; 28 Yvelines ; 29 Swindon ; 30 London ; 31 Isle of Wight ; 32 Poole ; 33 Bournemouth ; 34 Torbay ; 35 Plymouth ; 36 South Gloucestershire ; 37 North Somerset ; 38 City of Bristol ; 39 Bath and Noth Somerset ; 40 Kent ; 41 East Sussex ; 42 West Sussex ; 43 Buckinghamshire ; 44 Surrey ; 45 Windsor ; 46 Reading ; 47 Hampshire ; 48 West ; 49 Oxfordshire ; 50 Medway ; 51 Southampton ; 52 Milton Keynes ; 53 Bracknell Forest ; 54 Portsmouth ; 55 Brighton and Hove ; 56 Wockingham ; 57 Slough ; 58 Wiltshire ; 59 Gloucestershire ;1) Blue and green = English and French units far from Lon-don and Paris (respectively)
→ population getting aged, . . .
2) Yellow = capitals and their surrounding
→ dynamic units, large
ag-glomerations with good demo-graphic indicators (birth rate, . . . )
3) . . .
→ Observations validated by
Result of V
ISUMC
LUSTon Geo
1 Finistère ; 2 Côtes d'Armor ; 3 Ille et Vilaine ; 4 Morbihan ; 5 Eure ; 6 Seine-Maritime ; 7 Loire-Atlantique ; 8 Maine-et-Loire ; 9 Mayenne ; 10 Sarthe ; 11 Vendée ; 12 Calvados ; 13 Manche ; 14 Orne ; 15 Eure-et-Loir ; 16 Aisne ; 17 Oise ; 18 Somme ; 19 Nord ; 20 Pas-de-Calais ; 21 Essonne ; 22 Hauts-de-Seine ; 23 Ville de Paris ; 24 Seine Saint-Denis ; 25 Seine-et-Marne ; 26 Val d'Oise ; 27 Val-de-Marne ; 28 Yvelines ; 29 Swindon ; 30 London ; 31 Isle of Wight ; 32 Poole ; 33 Bournemouth ; 34 Torbay ; 35 Plymouth ; 36 South Gloucestershire ; 37 North Somerset ; 38 City of Bristol ; 39 Bath and Noth Somerset ; 40 Kent ; 41 East Sussex ; 42 West Sussex ; 43 Buckinghamshire ; 44 Surrey ; 45 Windsor ; 46 Reading ; 47 Hampshire ; 48 West ; 49 Oxfordshire ; 50 Medway ; 51 Southampton ; 52 Milton Keynes ; 53 Bracknell Forest ; 54 Portsmouth ; 55 Brighton and Hove ; 56 Wockingham ; 57 Slough ; 58 Wiltshire ; 59 Gloucestershire ; 60 Dorset ; 61 Devon ; 62 Somerset ; 63 Cornwall and Isle of Scilly ; 64 Hertfordshire ;1) Blue and green = English and French units far from Lon-don and Paris (respectively)
→ population getting aged, . . .
2) Yellow = capitals and their surrounding
→ dynamic units, large
ag-glomerations with good demo-graphic indicators (birth rate, . . . )
3) . . .
→ Observations validated by
Result of V
ISUMC
LUSTon Geo
1 Finistère ; 2 Côtes d'Armor ; 3 Ille et Vilaine ; 4 Morbihan ; 5 Eure ; 6 Seine-Maritime ; 7 Loire-Atlantique ; 8 Maine-et-Loire ; 9 Mayenne ; 10 Sarthe ; 11 Vendée ; 12 Calvados ; 13 Manche ; 14 Orne ; 15 Eure-et-Loir ; 16 Aisne ; 17 Oise ; 18 Somme ; 19 Nord ; 20 Pas-de-Calais ; 21 Essonne ; 22 Hauts-de-Seine ; 23 Ville de Paris ; 24 Seine Saint-Denis ; 25 Seine-et-Marne ; 26 Val d'Oise ; 27 Val-de-Marne ; 28 Yvelines ; 29 Swindon ; 30 London ; 31 Isle of Wight ; 32 Poole ; 33 Bournemouth ; 34 Torbay ; 35 Plymouth ; 36 South Gloucestershire ; 37 North Somerset ; 38 City of Bristol ; 39 Bath and Noth Somerset ; 40 Kent ; 41 East Sussex ; 42 West Sussex ; 43 Buckinghamshire ; 44 Surrey ; 45 Windsor ; 46 Reading ; 47 Hampshire ; 48 West ; 49 Oxfordshire ; 50 Medway ; 51 Southampton ; 52 Milton Keynes ; 53 Bracknell Forest ; 54 Portsmouth ; 55 Brighton and Hove ; 56 Wockingham ; 57 Slough ; 58 Wiltshire ; 59 Gloucestershire ;1) Blue and green = English and French units far from Lon-don and Paris (respectively)
→ population getting aged, . . .
2) Yellow = capitals and their surrounding
→ dynamic units, large
ag-glomerations with good demo-graphic indicators (birth rate, . . . )
3) . . .
Conclusion
• New approach for visualizing transactional data
→
V
ISUMC
LUST• Display multiple clustering results at the same
time
• Consistent view obtained with the color allocation
algorithm based on the minimal transversals of a hypergraph
• Importance of multiple clusterings/views of the data
Very useful for knowledge discovery
• Evaluation on benchmarks, experiment on a
Future Work
• More experiments (clustering algorithms, data sets) • Interactive interface (with clickable objects to access
to some details/descriptions, . . . )
• Applications:
◦ comparison of clusterings results,
◦ help to choose a clustering algorithm, the
parameters
References
• N. Durand and B. Crémilleux. ECCLAT: a New Approach of
Clusters Discovery in Categorical Data. In ES 2002, Cambridge, UK, December 2002.
• C. Hébert. Enumerating the Minimal Transversals of a Hypergraph
Using Galois Connections. Technical report, University of Caen Basse-Normandie, France, 2005.
• http://www.ics.uci.edu/∼mlearn/. UCI Machine Learning
Repository. (Mushroom, Votes, Hepatitis, Ionosphere)
• http://www.amstat.org/publications/jse/. (Titanic)
• http://atlas−transmanche.certic.unicaen.f r/index.gb.html.