Mining Interesting Complex Relationships - FAST MINING OF COMPLEX SPATIAL CO-LOCATION

Fast Mining of Complex Spatial Co-location Patterns

CHAPTER 5. FAST MINING OF COMPLEX SPATIAL CO-LOCATION

5.5 Mining Interesting Complex Relationships

The complex relationships considered in this chapter are specic types of interac- tions between variables, and hence can be solved using the GIM framework and algorithm. At the time this work was performed, GIM had not been developed. Instead, the problem is mapped to itemset mining. Solving the problem directly in GIM is described in section5.6.

In itemset mining, the data set consists of a set of transactionsT, where each trans- actiont∈T is a subset of a set of items I; that is,t⊆I. In order to map complex spatial co-location mining to the itemset mining task, the set of complex maximal cliques (relationships) become the set of transactions T. The items are the object types including the complex types such asA+and−A. For example, if the object types are {A, B, C}, and each of these types is present and absent in at least one

maximal clique, then I = {A, A+,−A, B, B+,−B}. An interesting itemset mining

algorithm minesT for interesting itemsets. The support of an itemsetI0 ⊆I is the number of transactions containing the itemset: support(I0) =|{t∈T :I0 ⊆t}|. So

called frequent itemset mining uses the support as the measure of interestingness. For reasons described in Section5.1, this work uses minPI (see Equation5.1) which, under the mapping described above, is equivalent to

(5.2) minP I(I0) = min

i∈I0{support(I 0

CHAPTER 5. FAST MINING OF COMPLEX SPATIAL CO-LOCATION

PATTERNS 105

Since minP I is anti-monotonic, the search space for interesting patterns can easily be pruned.

5.5.1 Mapping the Problem to GLIMIT

Recall from Chapter4that GLIMIT is a fast and ecient itemset mining algorithm that has been shown to outperform Apriori [11] and FP-Growth [47]. Since it is based on a framework of functions (Section4.4), new measures can easily be incorporated. In particular, theminP I measure can be build on top of the frequent itemset mining approach: Recall from example 4.4 on page 79 that it can be incorporated into GLIMIT as follows (let I0 ={1,2, ..., q} for simplicity): g(·) is the identity function

(there is no transformation on the data set), ◦ = ∩ (intersection) and f(·) = | · |

(the set size). LetmI0 be the result computed byf(·)on the itemsetI0. This means

that mI0 = support(I0), as is the case for FIM. To evaluate minP I, F(·) is used: F(mI0, m₁, ..., mq) = min_i_∈_I0{m_I0/mi}.

GLIMIT is used with the above instantiations of its framework to mine interesting complex co-locations, as shown in Figure 5.2. For comparison, an Apriori [11] style implementation will be used in the experiments.

The Apriori [11] and Apriori-like algorithms are bottom up item enumeration type itemset mining algorithms. Apriori works in a breadth rst fashion, making one pass over the data set for each level expanded. This is in contrast to GLIMIT, which makes only one pass over the entire data set. In Apriori, a candidate generation step generates candidate itemsets (itemsets that may be interesting) for the next level, followed by a data set pass (support counting) where each candidate itemset is either conrmed as interesting, or discarded. The support counting step is computationally intensive as subsets of the transactions need to be generated. This is particularly problematic when the transaction width is large, as is the case for spatial co-location data that includes complex relationships. GLIMIT operates on completely dierent principles and does not have these drawbacks.

It is also worth noting that since all single itemsets are always interesting (by def- inition, their minP I has the maximum value of1), they cannot be discarded from

the search. Note that were an FP-Growth style algorithm developed for this mining task, it would need to build an FP-Tree for the entire data set without any pruning. Hence, it is not a practical choice for this problem.

106 5.6. MAPPING THE PROBLEM TO GIM

5.6 Mapping the Problem to GIM

GIM is more abstract than GLIMIT (which focuses only on itemset mining) and operates on a dierent (but related) framework and algorithm. Furthermore, while GLIMIT performs subset checking and requires that the entire PrexTree remain in memory, both of these can be avoided in GIM; saving space and time. This was briey discussed in section3.11.

While the experiments in this chapter were performed using GLIMIT, it is worth showing how the problem can be solved directly in GIM:

• Each complex type is a variable, and complex maximal clique is a sample. • Interaction vectors xV0 contain the set of complex maximal clique IDs that

contain V0, where V0 is a complex spatial co-location pattern. SupposexV0 is

implemented as a bit vector.

• a(xV0, xv) =x_V0AN D xv, the bit-wiseAN D operation.

• mI(xV0) =|x_V0|, the number of set bits. This is the support of the interaction V0

• MI(·) evaluates minP I of V0 by using the result of mI(xV0) and looking up

the mI(xv) : v ∈ V0 using the sequence map. As explained in section 3.11, store(·) only needs to storeP ref ixN odes corresponding to single variables. Hence, the prex tree is not stored in memory.

• SI(·) =II(·)and returns true i the value computed byMI(·)is at least equal to the minP I threshold.

Recall that mining maximal cliques is performed using a specialist algorithm [12]. This step can also be performed using GIM, as described in Section3.8.2.

In document Verhein, Florian (2010): Generalised Interaction Mining: Probabilistic, Statistical and Vectorised Methods in High Dimensional or Uncertain Databases. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 142-144)