GENERALISED INTERACTION MINING 51 In probabilistic frequent itemset mining (PFIM) for example, the goal is to nd item-

Generalised Interaction Mining

CHAPTER 3. GENERALISED INTERACTION MINING 51 In probabilistic frequent itemset mining (PFIM) for example, the goal is to nd item-

sets that are frequent with a high probability. In an uncertain transaction database, it may not be certain whether an item is present in a transaction. For example, noise, additive noise in privacy preserving data mining and inherent uncertainty in the problem domain may cause this to be the case. Therefore, the event an item i is contained in a transaction t is associated with a probability. Chapter 10 will provide more details, motivation and examples. Prior to the work presented in part

IVof this thesis, all previous approaches to frequent itemset mining in uncertain and probabilistic databases used the expected support method. While this approach has many drawbacks as presented in chapter10, it can also be implemented in GIM very eciently. The alternative and superior method, based on computing the probability distribution of support and used in part IV, can also be implemented in GIM and this is considered in chapter13.

The assumption made in work addressing this problem is that the items are indepen- dent, and therefore that the probability that the itemset V0 exists in a transaction ti can be computed as Πv∈V0P(v ∈ti), where P(E) is the probability that eventE

occurs. The expected support of an itemset V0 is the expected number of times it occurs in the transactions: 1

n P

iΠv∈V0P(v ∈ ti) where n is the number of trans-

actions in the database. The Expected Frequent Itemset Mining (EFIM) problem is to search for all itemsets whose expected support is above a user dened threshold minExpSup. It is not hard to show that the expected support is anti-monotonic. EFIM can be solved in GIM as follows:

• The vectorsxv are dened so thatxv[i] =P(v∈ti). This results in probability

vectors. The order on the variables is arbitrary.

• aI(xV0, xv)is computed asa_I(x_V0, xv)[i] =x_V0[i]·xv[i]. Note that x_V0[i]is the

probability thatV0⊆ti under the independence assumption.

• mI(xV0) = 1 n

ixV0[i]where nis the number of transactions. Note that this

is the expectation of the support ofV0.

• MI(·) is trivial.

• II(·) =SI(·)and returns true if and only if mI(xV0)≥minExpSup.

This problem naturally ts into the vectorised framework, which provides both an intuitive way of thinking about the problem and an ecient solution. Consider in contrast how this would be implemented using an Apriori style algorithm; in partic- ular the need to determine whether a candidate itemset is present in a transaction.

52 3.11. COMPLEX (NON-TRIVIAL) INTERESTINGNESS MEASURES

3.11 Complex (Non-Trivial) Interestingness Measures

So far in this chapter, all problems considered had a trivial MI(·). This section will consider the case when MI(·) is non-trivial. Recall from section 3.2 that this

means that sub-interactions must be examined in order to calculate a measure on the interaction and to determine whether of not an interaction is interesting. This requires a way to store and quickly retrieveP ref ixN odes given the interaction they represent, in order to obtain thevaluem and valueM values stored within the node. This is done using a map that maps a given sequence of variables to the corresponding P ref ixN ode. Such a map is called aSequenceM apand provides constant time look- up for the required values. An ecient method for implementing this is to use aM ap or Hashtablethat maps P ref ixN odes to themselves, with identity based solely on the sequence of variables encountered in a traversal towards the root. That is, the hashCode()and equality is dependent solely on thevariableIdsequence represented by the P ref ixN ode and not on the values. This allows an arbitrary sequence of variables to be created (for example, as a chain of P ref ixN odes eectively a singly linked list), and the values of that interaction can be retrieved by retrieving theP ref ixN ode that the algorithm created earlier via the M ap's get(·) operation.

In the worst case, this look-up operation requiresO(|V0|)time, where|V0|is the size

of the interaction (in the case of a collision, checking if two sequences are identical requires at most a scan over them).

The additional space used by this method is only the bucket array of theHashT able. However, recall that algorithm3.1assumes a garbage collector, and hence ifP ref ixN odes are not explicitly stored, they will be deleted. Storing them, as is required by a non- trivial MI(·), leads to the prex tree remaining in memory. Conversely, note that for trivialMI(·), only a single path in the prex tree is retained in memory and fur- thermore, a sequence map is not required. Insertion into this map is performed by the store(·) function in algorithm3.1. Depending on the measure to be evaluated,

store(·) may only need to store selected nodes. For example, some measures like

minP I and maxP I require only that interactions of size 1 remain in memory for

later use by MI(·). In these cases, the memory requirement is the same as for a trivialMI(·).

Algorithm3.3shows how the sequence map and the prex tree nodes can be used to eciently retrieve all immediate sub-interactions{V0−v:v∈V0}ofV0. Of course, non-immediate sub-interactions can also be retrieved.

The following lemma proves that all sub-interactions will be available, provided they are interesting. This is important for two reasons:

CHAPTER 3. GENERALISED INTERACTION MINING 53

In document Verhein, Florian (2010): Generalised Interaction Mining: Probabilistic, Statistical and Vectorised Methods in High Dimensional or Uncertain Databases. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 89-91)