Cluster analysis and Association analysis for the same data

(1)

Cluster analysis and Association analysis for the same data

Huaiguo Fu

Telecommunications Software & Systems Group Waterford Institute of Technology

Abstract:Both cluster analysis and association analysis are important tasks of data mining. In some applications, we need both cluster analysis and association analysis for the same data. Each task takes very high time cost to deal with large data. In order to reduce expensive cost of the two mining tasks for large data set of transactions, we propose one strategy to unify cluster analysis and association analysis. This paper presents a new core algorithm of the strategy for analysis of large and high-dimensional data as well. The experimental results show the efficiency of this algorithm.

Key–Words:Association analysis, Clustering, Closed set, Concept lattice, Algorithm

1 Introduction

Both cluster analysis and association analysis are im-portant tasks of data mining. In recent years, clus-ter analysis and association analysis have attracted a lot of attention among the fields of research and applications. Cluster analysis and association anal-ysis play an important role in data mining applica-tions such as text mining, Web mining, information retrieval and biomedical informatics, and many oth-ers. A variety of techniques and approaches of cluster analysis and association analysis have been developed and successfully applied to real-life data mining prob-lems. However, due to large amounts of data continue to grow inexorably in size and complexity, the tech-niques and approaches of cluster analysis and associ-ation analysis suffer from the challenges such as very large data, high-dimensional data, distributed hetero-geneous data, and complex data, etc. In some appli-cations, we need both cluster analysis and association analysis for the same data. Each task takes very high time cost to deal with large data. Although cluster analysis and association analysis are separated tasks for research and applications, in order to reduce the expensive cost of data mining tasks, we propose to unify the cluster analysis and association analysis for mining the database of transactions. This is the key motivation to unify cluster analysis and association analysis. Furthermore, we can unify cluster analysis and association analysis for database of transactions due to the following reasons:

1) Both of them analyze the relationship between the elements of data set. In fact, the two tasks extract the same essential relationship:

similar-ity. Only the description and bounds of the re-lationship are different. Frequent pattern reveals one kind of similarity between elements of data. Cluster analysis may reveal associations and re-lationships in data that may contribute to mining the models or rules from data. So the elements in a frequent pattern are similar, and the similar elements may have the same frequency.

2) Mining closed sets can be an essential step for cluster analysis and association analysis on trans-actional data. Some existing works show we can extract the clusters and frequent patterns from closed sets [2, 15]. Cluster analysis and associa-tion analysis may share the closed sets for min-ing the same data set. So we need not to extract closed sets separately for cluster analysis and as-sociation analysis.

3) Closed sets mining provides a solution to inter-pret the clusters and frequent patterns.

For the most of techniques and approaches of cluster analysis and association analysis, it’s hard to interpret the mining results. For example, it’s hard to interpret the clusters and frequent pattern produced with existing mining techniques. It’s also hard to give the signification of the distance measure in most of clustering methods.

Closed sets is derived from formal concept anal-ysis (FCA). The formal concept can help us to interpret the closed sets. Closed sets mining fa-cilitates pattern interpretation. In human think-ing and life, the objects are clustered by concepts and attributes, and we can interpret attribute

(2)

pat-terns and object patpat-terns with concepts. So the concept-based methods can be used for the inter-pretation of the clusters and frequent patterns. In this paper, the idea of unifying cluster analy-sis and association analyanaly-sis focuses on the database of transactions. The main framework of the idea is:

• Generating the data context with the description of items or transactions from the database of transactions

• Mining closed sets and the lattice of closed sets of database of transactions with FCA

• In each closed sets, adding extended information such as support, similarity, and interpretation, etc. We propose a new structure of each node of lattice. The node contains attribute set, object set, the number of objects, support and similarity description.

• Generating the clusters and closed frequent pat-terns with the interpretation

The core of FCA is concept lattice. Theoretical foundation of concept lattice founds on the mathemat-ical lattice theory [1, 8]. Lattice is a popular mathe-matical structure for modeling conceptual hierarchies. Concept lattice is a method for deriving conceptual structures out of data. It allows us to analyze and mine the complex data for such as classification [11, 13], association rules mining [6, 7], clustering [10, 9, 4], etc.

Due to the high dimension, large volume of data, we need to develop more scalable and more efficient techniques and methods to analyze and represent the large and high-dimensional data sets. In this paper we present a new algorithm to analyze large and high-dimensional data.

The rest of this paper is organized as follows. Ba-sic definitions for unifying cluster analysis and asso-ciation analysis are presented in the next section. The framework of unifying cluster analysis and associa-tion analysis is introduced in secassocia-tion 3. In secassocia-tion 4, we present a new algorithm. Section 5 shows the ex-perimental results. The paper ends with a short con-clusion in section 6.

2 Definitions

Definition 1 Data context is defined by a triple

(O, A, R), whereO andA are two sets, and R is a relation between O and A. The elements of O are called transactions or objects, while the elements of

Aare called items or attributes.

For example, Figure 1 represents a data context (O, A, R). O ={1,2,3,4,5,6,7,8}is the set of ob-jects, andA={a1, a2, a3, a4, a5, a6, a7, a8}is the set

of items. The crosses in the table describe the relation

RofOandA. In the data context we use detailed de-scription for the name of each item and object. As an example, we only use digital formalization to describe each item and object.

a1 a2 a3 a4 a5 a6 a7 a8 1 × × × 2 × × × × 3 × × × × × 4 × × × × 5 × × × × 6 × × × × × 7 × × × × 8 × × × ×

Figure 1: An example of data context

A data context is usually represented by the bi-nary data, but in practice, the values of attribute are not binary, we can transform many-valued data con-text to binary values data concon-text by concept scaling [8].

Definition 2 Two closure operators are defined as

O1→O001 for setOandA1 →A001 for setA.

O₁0 :={a∈A|oRafor allo∈O₁}

A0₁:={o∈O |oRafor alla∈A₁}

These two operators are called the Galois con-nection for (O, A, R). These operators are used to determine a formal concept.

Definition 3 Aformal conceptof(O, A, R)is a pair

(O₁, A₁)withO₁ ⊆O,A₁ ⊆A, O₁=A0

1 andA1=

O0

1.O1 is called extent,A1is called intent.

For example, (68,a1a3a4a6) is a formal concept

of the data context of Figure 1. a₁a₃a₄a₆ is intent of (68,a1a3a4a6), and68is extent of (68,a1a3a4a6).

Definition 4 We say that there is a hierarchical order between two formal concepts(O1,A1)and(O2,A2), ifO1 ⊆O2(orA2⊆A1).

All formal concepts with the hierarchical order of concepts form a complete lattice called concept lat-tice.

Definition 5 An itemsetC ⊆Ais aclosed itemsetiff

(3)

(a₁a₂a₃a₄a₅a₆a₇a₈,∅)_e₍₀₎ (a1a3a4a5,7)e(1) (a1a2a3a4a6,6)e(1) (a1a3a4a6,68)e(2) (a1a2a4a6,56)e(2) (a1a4a6,568)e(3) (a1a3a4,678)e(3) (a₁a₂a₃,36)_e₍₂₎ (a1a2a3a7a8,3)e(1) (a1a2a7a8,23)e(2) (a1a2a7,123)e(3) (a1a3a7a8,34)e(2) (a₁a₇a₈,234)_e₍₃₎ (a1a7,1234)e(4)(a1a3,34678)e(5) ₍_a₁_a₄_,₅₆₇₈₎ e(4) (a₁a₂,12356)_e₍₅₎ (a1,12345678)e(8)

Figure 2: An example of knowledge lattice

Definition 6 IfC1 andC2 are closed itemsets,C1 ⊆

C2, then we say that there is a hierarchical order be-tweenC1andC2.

All closed itemsets with the hierarchical order of closed itemsets form of a complete lattice called

closed itemset lattice.

Definition 7 A formal concept is called extended concept if the formal concept is added by de-scribed information of the formal concept in data context. We note (O₁, A₁)_e₍_{described inf ormation}₎ or

(A1, O1)e(described inf ormation) as the extended con-cept of(O1, A1).

A concept lattice is called knowledge lattice if all formal concepts of the concept lattice are updated with their extended concepts.

Figure 2 presents an example of knowledge lat-tice. Each node contains intent, extent and number of extent.

3 Framework of unifying cluster

analysis and association analysis

In this section, we propose a framework of unifying cluster analysis and association analysis (see Figure 3).

From the database of transactions, we can gen-erate data context that should be described by the items and transactions. And then an efficient algo-rithm should be applied to generate formal concepts.

When the formal concepts are produced, some ex-tended information should be extracted with formal concepts, according to the need of the mining task, to form extended concepts. Extended concepts can contain intent, extent, support and similarity descrip-tion. Knowledge lattice can be generated with ex-tended concepts. Finally, closed frequent patterns and clusters can be produced from the same knowledge lattice or extended concepts.

Database - Data context -Formal concepts -Extended concepts ¾ Knowledge lattice: Concepts Support Description ... ¾

Closed Frequent Pattern Cluster

Figure 3: Framework of unifying cluster analysis and association analysis

Data context is the base of the mining task. Data context need to have understandable description for each item and transaction. Sometimes we need to re-duce, transpose or order the data context. For exam-ple, when data have high dimension, especially the the size of object set is smaller than the size of item set, we can transpose the data context to generate formal concepts for mining high-dimensional data. Analyz-ing the most of lattice algorithms, we find that one algorithm can focuss on items or transactions of data context. The performances of an algorithm can be dif-ferent according to the number of items and

(4)

transac-tions.

In this framework, the generation of formal con-cepts and knowledge lattice is the essential step. The key of the applications is the performance of the algo-rithm of generation of the formal concepts or closed itemsets. So we focus on lattice algorithm and pro-pose a new algorithm based on lattice structure to gen-erate frequent patterns in next section.

4 New algorithm

In this section, we analyze the search space of the closed itemsets of a data context, and then present a new algorithm to analyze and represent large data.

4.1 Analysis of the search space

Using one example: a data context with 4 attributes (a_m, a_m₋₁, a_m₋₂, a_m₋₃), we analyze the search space of closed itemsets (see Figure 4).

am−3am−2am−1am am−2am−1am am−3am−1am am−3am−2am am−3am−2am−1 am _a_m₋₁ _a_m₋₂ am−3 am−1am am−2am am−2am−1 am−3am am−3am−1 am−3am−2

Figure 4: An example of the search space of closed itemsets

Figure 4 illustrates each node maybe a closed itemset for any data context with 4 attributes. The search space of closed itemsets is very large if there are too many attributes. It’s hard for concept lattice structure to face the complexity of very large data. So we propose a new method to decompose the search space, and then separately deal with in each partition. In order to discuss the decomposition of the search space, we give the following definition. Definition 8 Given an attributeai∈Aof the context

(O, A, R), a set E, ai 6∈ E. We define ai ⊗E = {{ai} ∪Xfor allX ⊆E}. Ak = {ak} ∪   [ ∀Xi∈∪Aj {{ak} ∪Xi}   = ak⊗ {ak+1, ak+2,· · ·, am} k+ 1≤j≤m

We can decompose the search space into many partitions such asAm, Am−1, Am−2,Am−3 or

com-bination of some of them. In each partition we can look for the closed itemsets independently. But the problem is:

• how to balance the number of closed itemsets of partitions

• whether each partition contains closed itemsets For example, for the data context of Figure 1, we can decompose the search space into following 4 par-titions: partition1 A8 A7 A6 A5 partition2 A4 A3 partition3 A2 partition4 A1

Figure 5: Decomposition of the search space of the data context Figure 1

The result is there are no closed itemsets in

partition4, partition3, partition2 but 17 closed

itemsets in partition₁. So there are some problems for this strategy to decompose the search space. We need to improve it. One solution is to order the data context.

Definition 9 A data context is called ordered data contextif we order the items of data context by num-ber of objects of each item from the smallest to the biggest one, and the items with the same objects are merged as one item. We note ordered data context

(O, AC_{, R}₎_{of the data context}₍_{O, A, R}₎_.

The following example (see Figure 6) is Ordered data context of the data context of figure 1. From the ordered data context, using the same method as above to decompose the search space in 4 partitions, we can get closed itemsets in each partition. We can prove that there exists closed itemsets in eachAiof an ordered data context. For example, there are respec-tively 6, 6, 4, 1 closed itemsets in 4 partitions of the ordered data context (see Figure 6).

Definition 10 An itema_iof a data context(O, A, R), all subsets of {ai, ai+1, . . . , am−1, am} that include

ai, form a search sub-space (for closed itemset) that is

calledfolding search sub-space (F3S)ofai, denoted

(5)

a5 a8 a6 a7 a4 a3 a2 a1 1 × × × 2 × × × × 3 × × × × × 4 × × × × 5 × × × × 6 × × × × × 7 × × × × 8 × × × ×

Figure 6: An example of ordered data context Summing up the analysis of the search space of closed itemsets, we can order the data context as or-dered data context, the search space of closed itemsets is: F3Sm∪F3Sm−1∪F3Sm−2∪F3Sm−3∪ · · · ∪

F3Si· · · ∪F3S1∪ ∅, and then decompose the search

space into some partitions. We can generate closed itemsets in each partition.

4.2 The new algorithm

Definition 11 Given an itemset A1 ⊂ A, A1 = {b1, b2, . . . ,bi, . . . ,bk},bi ∈ A. A1 is an infrequent itemset.The candidate of next closed itemsetofA1, notedC./

A1, isA1 ]ai = (A1 ∩(a1, a2, . . . , ai−1)∪ {ai})00, whereai < bkandai ∈/ A1,ai is the biggest one of A with A1 < A1 ] ai following the order:

a1< . . . < ai < . . . < am.

We propose a new algorithm that can be used to generate closed itemsets or frequent closed itemsets. The principle of the algorithm is presented by follow-ing steps:

• Decompose the search space into some partitions – Convert(O, A, R)to(O, AC_{, R}₎_where

AC₌_{_aC

1,aC2. . . ,aCi , . . . ,aCm}

– In order to balance the number of closed itemsets of partitions, some items ofACare chosen to form an order setP

1) P ={aC_P T,a C PT−1. . . ,a C Pk, . . . ,a C P1 }, |P|=T,aC_P k ∈A C 2) aC_P_T < aC_P_T₋₁ < . . . < aC_P k < . . . < aC_P₂ < aC_P₁ =aC m

3) A parameter DP is used to choose

aC

Pk (0 < DP < 1), where DP=

|{aC

1,···,aC_Pk}| |{aC1,···,aC_Pk₋₁}|

– Get the partitions: [aC_P ,aC_P ) and[aC_P )

1) Interval [aPk, aPk+1) is the search

space from itemaPktoaPk+1

2) h aC_P_k, aC_P_k₊₁ ´ = [ Pk≤i<Pk+1≤PT (F3Si) £ aC_P_T¢ = F3SPT

• Generate next frequent closed itemset from an itemsetA1for each partition

– If|A0₁| ≥ minsupport, we search the next closure ofA1

– If |A0₁| < minsupport, we search C_A./₁. The closed itemsets between A1 and C_A./₁

are ignored

Conceptual clustering [5, 12] can seek clusters by concept structures. One approach of conceptual clustering is based on concept lattice [3]. When

minsupport = 1, this algorithm can be used to gen-erate all closed itemsets and then conceptual overlap-ping clusters based on the algorithm [3].

5 Experimental results

We test our algorithm to generate frequent closed itemsets and clusters on some data of UCI [14] (see table 1).

DataSet Objects Items Closed itemsets 1)breast-cancer-wisconsin 699 110 9860 2)house-votes-84 435 18 10642 3)audiology 26 110 30401 4)lung-cancer 32 228 186092 5)agaricus-lepiota 8124 124 227594 6)promoters 106 228 304385 7)soybean-large 307 133 806030 8)dermatogogy 366 130 1484088

Table 1: The datasets for experiments

The algorithm is implemented in JAVA, and tested on all above contexts in two cases to compare and an-alyze the performance of the algorithm:

• Case1: generating frequent itemsets and clusters separately from the context;

• Case2: generating frequent itemsets and clusters from closed itemsets based on the new strategy. The experimental results (see figure 7) show the total time cost of Case1 is much higher than Case2. So the integration of the cluster analysis and association

(6)

analysis based on closed itemsets mining can reduce expensive cost of the two mining tasks for large data set of transactions.

Figure 7: The time cost (milliseconds) for two cases on test datasets

6 Conclusion and further work

In this paper, we propose one strategy to unify the cluster analysis and association analysis for transac-tional database to reduce the expensive cost of data mining tasks. From data context, knowledge lat-tice can be generated with extended concepts. Ex-tended concepts can contain intent, extent, support and similarity description. So closed frequent patterns and clusters can be produced from the same knowl-edge lattice or extended concepts. Furthermore, we present a new algorithm for analysis of large and high-dimensional data.

For future work, we will develop the algorithm to analyze huge and distributed data, and improve the algorithm for mining non-transactional database. Acknowledgements: This work is supported by Sci-ence Foundation Ireland via the Autonomic Manage-ment of Communications Networks and Services pro-gramme (grant no. 04/IN3/I4040C) and the project of EU IST Network of Excellence ”OPAALS”.

References:

[1] G. Birkhoff. Lattice Theory. American Math-ematical Society, Providence, RI, 3rd edition, 1967.

[2] C. Carpineto and G. Romano. Galois: An order theoretic approach to conceptual clustering. In

Proc. of the Machine Learning conf., pages 33– 40, 1993.

[3] C. Carpineto and G. Romano. Galois: An order-theoretic approach to conceptual cluster-ing. InProceedings of ICML’93, pages 33–40, Amherst, Juillet 1993.

[4] C. Carpineto and G. Romano. Concept Data Analysis: Theory and Applications. John Wiley and Sons, 2004.

[5] D. H. Fisher. Knowledge acquisition via incre-mental conceptual clustering. Machine Learn-ing, (2):139–172, 1987.

[6] H. Fu and E. Mephu Nguifo. Partitioning large data to scale up lattice-based algorithm. In Pro-ceedings of ICTAI03, pages 537–544, Sacra-mento, CA, November 2003. IEEE Computer Press.

[7] H. Fu and E. Mephu Nguifo. Mining frequent closed itemsets for large data. In Proceedings of The 2004 International Conference on Ma-chine Learning and Applications (ICMLA04), Louisville, USA, December 2004.

[8] B. Ganter and R. Wille. Formal Concept Analy-sis. Mathematical Foundations. Springer, 1999. [9] R. Godin, G. Mineau, R. Missaoui, and H. Mili.

M´ethodes de classification conceptuelle bas´ees sur les treillis de Galois et applications. Revue d’intelligence artificielle, 9(2):105–137, 1995. [10] R. Godin, R. Missaoui, and A. April.

Ex-perimental comparision of Galois lattice brows-ing with conventional information retrieval methods. Internat. J. Man-Machine studies, (38):747–767, 1993.

[11] D. Kourie and G. Oosthuizen. Lattices in Ma-chine Learning: Complexity Issues. Acta Infor-matica, 35(4):269–292, 1998.

[12] M. Lebowitz. Experiments with incremental concept formation: Unimem. Machine Learn-ing, (2):103–138, 1987.

[13] E. Mephu Nguifo and P. Njiwoua. Treillis de concepts et classification supervis`ee. Technique et Science Informatiques, 24, 2005. Hermes-lavoisier.

[14] C. Merz and P. Murphy. UCI Repository of Ma-chine Learning databases, 1996.

[15] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Efficient mining of association rules using closed itemsets lattices. Journal of Infor-mation Systems, 24(1):25–46, 1999.