5.5 Enumeration in Convex Set Patterns Structures
5.5.5 Empirical Evaluation
We report an experimental study of the different algorithms, carried out on a machine equipped with Intel Core i7-2600 CPUs 3.4 Ghz machine with 16 GB RAM. All materials are available on https://github.com/BelfodilAimene/MiningConvexPolygonPatterns.
5.5.5.1 Mining polygon patterns
We compare the three algorithms without any constraint: EXTCBO, DELAUNAYENUM and EXTREMEPOINTSENUM. Figure 5.13 plots for each one their run times and the number of pattern candidates they generated. Datasets consist of n objects drawn from the IRIS dataset uniformly from the three different classes for the attributes sepal-length and sepal-width (or petal-lentgh and petal-width). First, notice that EXTCBO generates a lot of candidates discarded by the canonicity test (redundant), while the two others generate each pattern only once. This implies that EXTCBO is from one to two orders of magnitude slower (it is the only one computing closures). Interestingly, EXTREMEPOINTSENUMis faster than DELAUNAYENUMas it does not require to compute and update a Delaunay triangulation (even when the state-of-the-art [156] is used).
5.5.5.2 Impact of the constraints
EXTCBO enumerates convex polygons in a bottom-up fashion w.r.t. inclusion. It can thus only handle maximum perimeter and area constraints that are monotone (the proof is given, e.g. by [26]): when a pattern is generated and does not satisfy the constraint, the algorithm backtracks (a well-known property in pattern mining). DELAUNAYENUMenumerates convex polygons in a top-down fashion w.r.t. inclusion. It can thus naturally prune w.r.t. minimal support, area and perimeter. EXTREMEPOINTSENUMenumerates patterns by inclusion but also from simpler to more complex shapes (extreme points inclusion). It can thus handle maximum shape complexity, perimeter and area constraints.
Figure 5.13: Polygon pattern enumeration performance comparison. IRIS sepal-length × sepal- width (left). IRIS petal-length × petal-width (right).
(a) Max Shape variation (b) Min Support variation
(c) Max Perimeter variation (d) Min Perimeter variation
(e) Max Area variation (f) Min Area variation
Figure 5.14: Run time and generated patterns count for our three algorithms when introducing constraints.
Figure 5.14 reports the run time of our three algorithms on the IRIS Sepal length vs. Sepal Width dataset when introducing each constraint separately and varying the associated threshold (min. and max. perimeter are computed as they have the same behavior as min. and max. area, respectively). It also reports the number of generated patterns: The lower, the better. Some algorithms output more patterns as they cannot efficiently handle the constraints (such invalid patterns need to be removed during post-processing). As such, depending on the constraints the user is interested in, one algorithm may be preferred to another.
Figure 5.15: (Left) Comparing interval and convex polygon top-3 patterns. (Right) Comparing interval (red) and convex polygon patterns (green) gini, area and density (represented by points diameter).
5.5.5.3 Intervals vs. convex polygons
Our main motivation for introducing convex polygon patterns is to discover shapes with high density and area, and possibly with high class homogeneity (e.g., low Gini). Figure 5.15 (left) considers the IRIS dataset (Sepal length vs. Sepal Width). It presents the three most frequent polygons that have null Gini, and either 3 or 4 extreme points for a fair comparison: convex polygons better stick to the data without extremely over-fitting.
We also compare interval and polygon patterns in several datasets and plot their area, density and Gini. Figure 5.15 (right) plots all discovered patterns area (Y), Gini (X), and density (point diameter). It appears that convex polygons enable to find shapes with higher density, yet smaller area, over the same Gini range. Rectangles with high area are exactly those that we want to avoid for spatial data: they have high chances to enclose both zones of high and low density, and high impurity (high gini).
5.5.6 Conclusion
In this section we have investigated the problem of enumerating exhaustively and non redun- dantly separable sets by the language of convex sets, or more particularly convex polygons. We have seen that this language is equivalent to the language of finite conjunction of linear inequalities making the language fare more expressive than the language of intervals discussed in the precedent section. However, the expressivity of this pattern language comes with a cost, its intelligibility. Indeed, in d-dimensional finite numerical datasets (G,M), while the vertex representation has a maximal size of |G|, the size of the halfspace representation (the number of the linear inequalities) has the order of magnitude of |G|bd/2cmaking the intelligibility of such kind of patterns questionable. Moreover, the number of extents could be 2|G| when the points
are co-spherical. Other simplifications of the language could be investigated as for instance the language of neighborhood patterns [86] which, sadly, does not induce a pattern structure.
5.6
Conclusion
In this chapter, we have investigated the problem of enumerating definable sets in a finite pattern setup for the particular case of pattern structures and formal contexts. We have extended the state-of-the-art of enumeration techniques by proposing: (1) Algorithm CBOI for concepts enumeration in formal context that leverages existing (inherent) implications between the context attributes [16] and (2) Three algorithms for enumerating definable sets for the particular language of convex sets [20].
However, in this chapter, we have still not discussed Problem 5.1 when non pattern structures are considered. Let us investigate for instance the sequential pattern language [6]. This language does not induce a pattern structure [48] but still induces a pattern multistructure. Hence, accord- ing to Theorem 3.5, all definable sets can be obtained from support-closed patterns. Therefore, a first approach to enumerate all possible definable sets is to enumerate support-closed patterns [166]. However, since two support-closed patterns can have the same extent (see Section 3.5.1.2), such algorithms are complete and sound but potentially redundant. A second approach used in the literature when dealing with pattern setups that are not pattern structures is to use completions in order to transform them to pattern structures (see section 3.7.1). The common used technique is the antichain completion as it is the case for sequential patterns [39, 48] and graph patterns [78]. After such a transformation, algorithms solving Problem 5.3 (e.g Algorithm 6) can be used. However, one needs to keep in mind that the considered language after completion is no longer the same, i.e. it is the logical conjunction of the basic patterns. Hence, using such a technique makes it possible to produce algorithms that are complete and non redundant but not necessarily sound, i.e. some output definable sets are not induced by the basic pattern language.
Whether the first or the second approach is used to tackle the general Problem 5.1, one can correct the different algorithms by:
• Obtaining non-redundancy for the first approach by storing in memory all the already generated extents then checking before an output if the extent has already been output. Such a solution is costly in memory.
• Obtaining soundness for the second approach by checking before an output of a set A if there exists at least one maximal common description cov∗(A) which extent is A. If not so, the algorithm does not output the set A. Such a solution can be costly in time since the number of extents in the completion can be exponential to the number of extent in the basic language (see Example 3.31).
Proposing a better algorithm solving Problem 5.1 remains an open problem that we are thoroughly investigating currently. Our main intuition is to use upper-approximation extents to jump between definable sets. Moreover, when the number of maximal covering descriptions is not infinite and the considered pattern setup is a pattern multistructure, the computation of upper- approximations can be done using only maximal covering descriptions thanks to Theorem 3.4.
Discriminative Subgroup Discovery
C
H
A
P
6
D
ISCRIMINATIVES
UBGROUPD
ISCOVERYT
here is a large set of definitions of the task of subgroup discovery in the literature [96, 106, 150, 163]. In a very general way, we define here subgroup discovery as follows: “The task of finding a small subset of interesting patterns in a given dataset”. There is four important terms in this definition. The notions of dataset and pattern has been presented in Chapter 3 and Chapter 5. The two remaining terms are interpreted below: Intersting. In the sense that someone needs a formal way to say if a pattern is more interest-ing than another, an order relation. Often, the evaluation of the interestingness of a pattern is made through the usage of some quality measure, that is a mapping that associates to each pattern a value. The higher is this value, the more interesting is the pattern.
Small. In the sense that someone needs a way to select a small set of interesting patterns from the set of all interesting ones.
While the definition of subgroup discovery is quite large, we will consider here a particular task on subgroup discovery which we call Discriminative Subgroup Discovery. The following of this chapter is organized as follows:
• Section 6.1 presents the task of discriminative subgroup discovery. • Section 6.2 presents the notion of relevance theory introduced in [81].
• Section 6.3 presents the common way used in the literature to evaluate subgroup interest- ingness via a quality measure.
• Section 6.4 gives an idea about the different approaches existing in the literature tackling the problem of discriminative subgroup discovery.
6.1
Introduction by Examples
From now on, we will make an abstraction on the pattern language that was discussed in Chapter 3 and the provided dataset. We consider then as an input an arbitrary pattern setup P = (G, (D, v),δ). In Discriminative Subgroup Discovery, the set of objectsG is partitioned into two sets: the set of positive or target instances G+ and the set of negative instancesG−.
Informally, Discriminative Subgroup Discovery strives out to find subgroups that discrim- inate/separate positive instances from the negative ones. We call subgroup any extent S ∈ Pext.
Example 6.1. Fig. 6.1 (left) depicts a formal contextP (or equivalently a pattern structure) where G+= {g
1, g2, g3}andG−= {g4, g5, g6}. The set of all possible subgroups is givenPextdepicted on
Fig. 6.1 (right). Recall that each extent A ∈ Pext, has at least one description inducing it. In
pattern structure, one could choose the intent int(A).
G m1 m2 m3 label g1 × × + g2 × × + g3 × × + g4 × × − g5 × × − g6 × × − ; {g1, g2, g3, g4, g5, g6} {g1, g2, g3, g4, g5} {g1, g2, g4, g5, g6} {g3, g6} {g1, g2, g4, g5} {g3} {g6}
Figure 6.1: Formal ContextP = (G,M,I) (left) and its subgroupsPext (right). Objects in bold
are positive instances.
Example 6.2. Consider now the numerical dataset depicted in Fig. 6.2. The set of subgroups depends on the hypothesis space, i.e. the description language. If we consider the interval pattern structure, the set {g1, g2, g3}is a subgroup whose intent is [1, 2] × [1,3]. If the language of convex set pattern language is considered, thenG+= {g
1, g2, g3, g4}is an extent. m1 m2 label g1 1 2 + g2 1 3 + g3 2 1 + g4 3 5 + g5 3 4 − g6 2 5 − x y 0 1 2 3 4 5 0 1 2 3 4 5 6 g1 g2 g3 g4 g5 g6 m1 m2
Figure 6.2: A labeled numerical dataset (G,M) (left) and its representation in plane (right) where Black (resp. white) dots represent positive (resp. negative) instances.