• No results found

Generalised Interaction Mining

CHAPTER 3. GENERALISED INTERACTION MINING 37 be output in time linear in their length, or in constant time if they are output in

compressed format as a prex tree.

IfMI(·)is trivial,t(MI) =O(1)and the prex tree is never kept in memory, leading to low space usage. The eect of non-trivialMI(·)is discussed in section3.11, where

the prex tree allows compression.

Theorem 3.9. The space usage isO(|V| ·vs+|V|2) ifM

I is trivial, wherevsis the space required by a single interaction vector.

Proof. In the worst case (all single variables are interesting interactions), all individ- ual variables' interaction vectors must remain in memory at one point. The search is depth rst, and so the depth is at most |V|. At each node along the current path

of the search, a list (joinT o) of at most size |V| is kept (containing references to

objects already in memory), as well as at most one additional vector (the vector corresponding to xV0 required to build vectors for longer antecedents). Therefore,

at most O(|V|+|V|) =O(|V|) vectors each requiring vs space are in memory, and O(|V|2) references to existing objects already counted.

In most applications,vsis at worstn, the number of samples. Since most databases are sparse, sparse methods can be used to simultaneously reduce the space required by interaction vectors and the run time of functions on them. Furthermore, the search can often progress in subspaces. This will be described in section3.15. Note the order in which the xv are used. xv will only ever be needed by the algorithm once all possible interactions that can be created from {v0 ∈V :v0 < v} have been

mined. This means the vector will only need to be loaded into memory at this point. Furthermore, for any variablevthat is not interesting, it's vectorxv is not required. This means that the entire data set need never be in memory unless all single-variable interactions are interesting. It is worth highlighting the fact that there will only ever be a single interaction vector in memory at any level (depth) in the search, with the exception of depth 1 of course. This is an important advantage of the algorithm.

A treatment of related algorithmic approaches and their key dierences is given in section4.3 of chapter4.

3.4 Counting Based Approaches: The Simplest Example

In data sets where a variable is either present or absent in a sample, the simplest operation is to count how many times an interaction pattern occurs in the samples.

38 3.5. MINING MAXIMAL INTERACTIONS This can be used as the basis of of more complex methods. Frequent itemset mining (FIM) (or more generally, frequent pattern mining) is perhaps the simplest and most widespread instance of interaction mining and aims to nd all sets of items in a transaction database that occur in at least minSup transactions [10]. A survey of such methods may be found in [43] and chapter 4 considers this problem in depth. Since FIM aims to nd items that occur frequently together, it can be assumed that there is some interaction between these variables in the process generating the data; for example, an unseen variable the human purchaser tends to like particular combinations of items.

FIM can be implemented eciently in the GIM framework as follows: Each item is a variable, and each transaction is a sample. The database consists of thexv :v∈V where each xv encodes the set of transactions in which it exists. Geometrically then, items exist in the space spanned by the transaction identiers. This idea will be covered in more detail in chapter 4. Interaction vectors xV0 encode the

set of transaction IDs whose corresponding transactions contain V0. One ecient implementation uses bit-vectors so thatxV0[i]is1 if theith transaction containsV0

and0otherwise2. With this encoding,a(xV0, xv) =xV0AN D xv, the bit-wise AN D

operation. Sincexv encodes those transaction ids for transactions containingv, and xV0 those containing V0, xV0v therefore encodes those transactions containing all

items inV0 and the itemv. Note that this would be the induction step in a proof of correctness. mI(xV0) =|xV0|, the number of set bits. Note that this is the support

of the itemsetV0. MI(·)is trivial. Finally,SI(·) =II(·)and returns true if and only

if the value computed by mI(xV0) is at least minSup. Note that the prex tree is

not kept in memory in this application sinceMI(·) is trivial.

3.5 Mining Maximal Interactions

Interactions often overlap each other, and if an interaction is interesting then its sub-interactions are usually also interesting. In a number of applications then, only the maximal interaction is of interest. For example, this is useful in some graph mining problems and the maximal frequent itemset mining problem.

Denition 3.10. A maximal interaction V0 ⊆ V is an interaction for which no super-interactionV00 exists so thatV0 ⊂V00 and V00 is interesting.

2A compressed TID-set or TID-list may also be used, which lists only the identiers of the

CHAPTER 3. GENERALISED INTERACTION MINING 39