• No results found

1.4 Thesis Outline

2.1.2 Preliminary Concepts and Problem Statement

The problem of mining frequent patterns was first introduced by Agrawal et al. [4] in 1993, and the problem of mining frequent closed patterns was first introduced by Pasquier et al. [92] in 1999. These works remain the standard references for all mining algorithms of frequent (closed) patterns.

Definition 2.1.1. Let I = {i1, i2, . . . , im} be a set of possible items in a given dataset. Any item in I is distinct. An itemset or pattern z is an unordered subset of I, i.e., z I. It can contain any item of I at most once.

As pattern z is unordered, the patterns{a, b, c} and {c, a, b}are equal. For ease of writing, patterns will be written as a series of continuous item identifiers and set brackets omitted. For example, pattern {a, b, c} will be written as abc and pattern

{b, d} will be written as bd.

Definition 2.1.2. The length |z| of a pattern z is the number of items in z, i.e.,

|z|=iz1.

Definition 2.1.3. A patternz is called asub-pattern of another patternz ifz z. Symmetrically,z is called a super-pattern of z.

Definition 2.1.4. A transaction T is represented by a pair {T ID, z}, where T ID is a unique transaction identification number, and z is a pattern. The length of transaction T is denoted by |T| and is equal to the length |z|, and is defined as

|T|=|z|=iz1.

Definition 2.1.5. A transaction dataset T D is a set of transactions. The number of transactions in T D is denoted by |T D|, and is defined as|T D|=TT D1.

The term ‘size of a transaction dataset’ is sometimes taken to mean the number of individual items in the dataset. We avoid this term, which is potentially confusing. Definition 2.1.6. The absolute support or support of a pattern z in a transaction datasetT D, which is denoted assup(z, T D) orsup(z), is the number of transactions containing pattern z in T D, i.e., sup(z) = ((TT D)(zT))1. The relative support

or frequency of z in T D, denoted as freq(z, T D) or freq(z), is defined as the absolute support value of z divided by the number of transactions in T D, i.e.,

Note that the term support has sometimes been used for the relative support in the literature.

Definition 2.1.7. Frequent patterns are determined by a pre-specified threshold value. The threshold value can be defined in terms of minimum frequency (de- noted as min freq) or minimum support (denoted as min sup). Given min freq or min sup, a pattern z is said to be frequent in a transaction dataset T D if and only if the support (or frequency) value of the pattern z is not less than the mini- mum support (or minimum frequency) value specified by users, i.e., sup(z, T D)

min sup =min freq× |T D|. Otherwise, the pattern z is said to be an infrequent pattern.

Note that in the literature, the term minimum support has sometimes been used forminimum frequency too, since in a static transaction dataset they have the same effect on the mining result. However, when applying them for mining in dynamic transaction datasets, they have different effects.

Definition 2.1.7 leads to the anti-monotone (downward closure) property [5] of frequent patterns, which is that all sub-patterns of a frequent pattern are also fre- quent. However, a super-pattern of a frequent pattern can be frequent or infrequent. Any super-pattern of an infrequent pattern is infrequent.

Definition 2.1.8. A pattern z is said to be closed in a transaction dataset T D if inT D there is no patternz satisfying z z and sup(z, T D) = sup(z, T D). Given

min freq ormin sup, if the closed pattern z is frequent in T D as well, the pattern

z is said to be a frequent closed pattern (FCP) in T D.

Definition 2.1.9. A dynamic transaction dataset (DTD) is a transaction dataset where dataset updates happen. A dynamic transaction dataset is composed of two parts: the base transaction datasetT D0 and a series of updates Ui (i= 1,2, . . . , n). Each update can vary the data of DT D. Transaction dataset T Dn is the dataset after the nth update of a dynamic transaction dataset, i.e.,

T Dn=T D0

n i=1

Ui

Mining in dynamic transaction datasets can face the following four different simple cases of dataset update2:

1. Adding new transactions Δ+ that contain only items that existed in the orig- inal transaction datasetD before the update.

2. Adding new transactions Δ+ that contain new items that do not exist in the original transaction dataset D.

3. Deleting a number of existing transactions Δ from the original transaction datasetD.

4. Modifying a number of existing transactions in D, which includes adding new items, deleting existing items, or replacing some existing items with different items.

These four simple cases are demonstrated in Figure 2.3, where D denotes the updated transaction dataset after an update, and Δ denotes the unchanged trans- actions existing in both of D and D, i.e., Δ =DD during the update. Cases 1 and 2 are shown in Figure 2.3(a). Case 1 is a special case of Case 2, since Case 1 can be regarded as adding new transactions with zero new items. Case 3 is shown in Figure 2.3(b). Case 4 can be considered as the combination of Cases 1, 2 and 3. Note that in Case 4, the total number of transactions in the dataset generally remains the same, since the operations in Case 4 operate on the item level instead of the transaction level. However, there is an exception. In an extreme situation where the operation of deleting existing items removes all existing items from some transactions, the number of transactions in the updated dataset becomes less than that in the original dataset, i.e.,|D||D|. Figure 2.3(c) illustrates this. Note that in Cases 1 and 2, D = Δ + Δ+ =D+ Δ+. In Case 3, D = Δ =DΔ, while in Case 4, D = Δ + Δ+ =DΔ+ Δ+. As Cases 1 and 4 can be thought of special cases of Cases 2 and 3, we only consider Cases 2 and 3 in this chapter.

Given theoriginal transaction dataset Dbefore the update, the update ofadding new transactions Δ+ or deleting existing transactions Δ and the specified mini- mum frequency min freq, the problem of incremental mining of frequent closed patterns in a dynamic transaction dataset is to mine the complete set of frequent closed patterns and their correspondingly exact support values in theupdated trans- action dataset D after a update, by using the knowledge about D. The required knowledge of D is at least the complete set of frequent closed patterns, with their correspondingly exact support values in D.

(a) The updating case of adding new transactions

(b) The updating case of deleting existing trans- actions

(c) The updating case of modifying existing transactions

Figure 2.3: The simple cases of dataset update

Before we consider our incremental mining algorithm in more detail, we remark that, in common with full pattern mining, the problem is NP-hard, as is demon- strated by the following two propositions:

Proposition 2.1.10. Updating the frequent closed pattern set is NP-hardwhen new transactions are added to the original dataset.

Proof. Yang in his paper [124] indicated that the problem of counting the number of frequent closed patterns in a static transaction dataset with a specified arbitrary support threshold is #P-complete and the problem of mining (i.e., enumerating) frequent closed patterns is NP-hard. In the worst case, when the original dataset is empty and so the update is equal to the dataset, i.e., D =D, the update reduces to the exact case considered by Yang.

Proposition 2.1.11. When an update consists only of deleting some of the existing transactions, the problem of incrementally mining frequent closed patterns is:

1. in P when the deleting update does not introduce new frequent patterns. 2. NP-hard when the deleting update introduces new frequent patterns.

Proof. Let the complete set of frequent closed patterns in the original transaction dataset D beF CS. The length of the longest pattern inF CS islF CS, the number of patterns is |F CS|, the complete set of frequent closed patterns in the updated transaction dataset D is F CS, and the length of the longest transaction in D is

lD.

1. When the deleting update does not bring new frequent patterns, F CS must be a subset of F CS, i.e.,F CS F CS. For each frequent closed pattern pin

F CS, when the deleting update reduces the support value ofp,pcan be closed by one of its super-patterns in F CS or even become infrequent; otherwise, p will certainly be inF CS. Thus, usingF CS as a candidate setCfor computing

F CS, we can first count the support value of each pattern in C by scanning

D once. This step takes at most O(|F CS| ×lF CS × |D| ×lD) time. Then,

F CS can be obtained by eliminating non-frequent and non-closed patterns in

C, which takes at mostO|F CS|2×lF CS2 time. An algorithm that executes the above steps needs Omax(|F CS| ×lF CS × |D| ×lD, |F CS|2×lF CS2) time.

2. If and only if applying a minimum frequency value instead of a minimum support value as the threshold value to determine frequent patterns, a delet- ing update can bring new frequent patterns. When a deleting update does bring new frequent patterns, the problem of mining frequent closed patterns in a static transaction dataset D can be considered as a special case of that of incrementally mining frequent closed patterns in the updated transaction dataset after the update of only deleting some existing transactions from the original transaction dataset. That can be shown in the situation where the complete set of frequent closed patterns in the original dataset is empty, new frequent closed patterns appear after the deleting update, and the updated datasetD is equal toD.