• No results found

Selection and Generation

CHAPTER 7. CORRELATED MULTIPLICATION RULES WITH

7.3 Correlated Multiplication Rules (CMRules)

Linear models such as regression are important for modeling [44] and classication [104], but their power is limited by their linear decision boundary. They can be used to model non-linearities (in terms of the original variables) by using non-linear functions on existing variables as regressors (or `features' in machine learning). Fur- thermore, when such functions are applied to multiple variables at a time, they generate composite variables capable of capturing non-linear interactions between variables. These can replace existing variables or function as additional variables in a linear model. The use of such composite variables transforms the space in which the (linear) model is built. If the composite features are non-linear in the original features, the model becomes non-linear in the original variables1. This is analogous to using kernel functions to map the space so that a non-linear decision boundary is achieved using linear models.

Good candidates for composite variables are multiplications of sets of variables. For example, the regression/decision problemy=α1v1+α2v2+α3v3+α1,2v1v2+β can

CHAPTER 7. CORRELATED MULTIPLICATION RULES WITH

APPLICATIONS 145

capture non linearities in v1 and v2 as well as capturing an interaction betweenv1 and v2 since it includes the term α1,2v1v2. Finding the right variables to multiply together in order to achieve a better t is a challenging problem. The best composite variables are highly correlated with the dependent variable; they capture the non- linearities and interactions well and therefore allow a better t. This is a simple but powerful idea. However, if there are m variables, there are O(mk) composite

feature size to k and, in general, there are O(2m). Therefore, an intelligent search

with pruning is required.

Example 7.1. Suppose we have the regression/decision problem P : y = α1v1 + α2v2+β. We can add an extra (composite) variablev1v2: P0 : y=α1v1+α2v2+β+ α1,2v1v2. This model is now capable of expressing a non-linear decision boundary in terms of v1 and v2 and can also capture a multiplicative interaction between v1 and v2. If there are no such patterns, α1,2 will be found to be 0. If there are non- linearities or interactions in the problem that can be approximated by v1v2 thenP0 will haveα1,26= 0 and will t better than P.

Correlated Multiplication Rules (CMRules) are rules of the formvi∗vj∗, ...,∗vk→c, where the vi ∈ A are observed variables and c is the dependent variable to be predicted. Based on the above discussion, the antecedents of rules with the highest correlations dene ideal composite variables and capture interactions explaining the dependent variable c. Furthermore, such rules are easy to interpret and understand and can therefore provide useful information to help explain interactions in the data. The data set consists of a set of variablesAand a dependent variablec. Each row in the data set consists of samples of these variables. Each column therefore contains all samples for a particular variable v∈A or c. Consider these columns as vectors and denote these by xv and xc respectively. xv[i] is therefore theith sample of variable v.

The CMRules presented in this chapter use Pearson's product moment correlation coecient, although other measures can easily be used in its place. Pearson's corre- lation coecient between two variablesv and cmay be written as

rv,c=

Pn

i=1(xv[i]−xv¯)(xc[i]−xc¯) pPn

i=1(xv[i]−x¯v)2pPni=1(xc[i]−x¯c)2

where rv,c ∈[−1,1] and x¯v is the mean of v. If all variables are binary valued, it reduces to the φcoecient.

146 7.3. CORRELATED MULTIPLICATION RULES (CMRULES) The correlation C(A0 → c) of a rule A0 → c is dened as the Pearson's product moment correlation coecient between the antecedent and the consequent of the rule. Let the vector corresponding to the antecedent be denoted xA0. C(A0 → c)

may then be written as:

C(A0 →c) = (xA0−xA¯0)·(xc−xc¯) ||xA0−x¯A0||||xc−xc¯||

Since CMRules have multiplication semantics, the vectorxA0 is dened by xA0[i] = ΠvA0xv[i]

That is, for each sample / row in the database, the values of the variables in the antecedent of the rule are multiplied together. Accordingly, CMRules with a high correlation show those variables that, when multiplied together in all the samples, are highly correlated with the dependent variablec.

7.3.1 Directing the Search by Correlation Improvement

Recall that the goal is to nd highly correlated CMRules. The simplest approach is to attempt to nd rules with a correlation above some user dened threshold. However, it is problematic to direct the search space by only expanding rules with correlation above a threshold for two reasons.

• First, it introduces an arbitrary parameter to which the approach becomes

sensitive.

• More subtly, since correlation is not downwards closed, it also introduces a

dependency on the order in which variables are added to the antecedent. This could be addressed by forcing anti-monotonicity as a greedy heuristic but the limitation inherent in absolute threshold based techniques remains.

Instead, the following approach is used: A variable should only be added to the antecedent of a rule if it improves the rule compared to its generalizations. That is, if the variable improves the rule compared to those less specic rules having fewer variables in the antecedent. The Correlation Improvement (CI) measures this improvement in terms of correlation;

Denition 7.2. The Correlation Improvement is CI(A0 →c) =C(A0→c)−max

a∈A0{C(A 0− {

CHAPTER 7. CORRELATED MULTIPLICATION RULES WITH

APPLICATIONS 147

whereA0− {a} →c is a less specic sub rule obtained by removing variableafrom the antecedent. Note that A0− {z} →c is a generalization (a less specic rule) that therefore applies to more samples, while A0 →c is more specic and applies to less samples. For an empty antecedent (the base case),CI(∅ →c) =C(∅ →c)wherex∅

is a vector of1s.

The Correlation Improvement is positive if the rule to which it is applied has a higher correlation than any of its immediate generalizations. This means it is a bet- ter predictor of the consequent variable than any of the less specic rules. This also has a geometric interpretation: Since correlation is the cosine of the angle between vectors (Recall gure 6.2), using the correlation improvement technique means the antecedent is built so that it moves closer to the consequent (in terms of the sub- tended angle) in comparison to the immediate sub-rules.

The following lemma shows that Correlation Improvement is downward closed. Lemma 7.3. IfA0 →cis expanded (and considered interesting) only whenCI(A0 →

c)≥0 then Correlation Improvement is downwards closed: If A0 → c is interesting, then so are all sub-rules.

Proof. This follows by induction over all sub-rules.

This means that A0 → c is not only more correlated than all immediate sub-rules, but indeed all sub-rules. Consequently, Correlation Improvement is a useful method to direct the search for CMRules.

It is now possible to dene the CMRule problem:

Problem denition: Correlated Multiplication Rules (CMRules): Find all CMRules with CI(A0 →c)> minCI, a threshold. (Recall that A0 is interpreted as a multiplication of variables; Πvi∈A0vi).

Note that if minCI >0, only variable interactions that predict the dependent vari-

able better than all individual variables are found.

Finally, note that CMRules with a single variable in the antecedent are also useful they can (trivially) be used to select a good subset of variables (features) to use since they simple variables that are highly correlated with the dependent variable. Recall that CMRules can be used for automated supervised variable (feature) selection and composite variable generation in particular, for generating composite variables that allow a non-linear interactions to be captured in the model.

148 7.4. CMRULES FOR FEATURE SELECTION AND GENERATION

7.4 CMRules for Feature Selection and Generation

Recall that the antecedents of CMRules are ideal candidates for composite features, and CMRules with only a single variable in the antecedent are also useful for fea- ture selection. This is because the interactions dened by the antecedent variables are highly correlated with the dependent variable to be predicted, and since the antecedent is the product of variables, it introduces introduces non-linearity. In order to use CMRules for feature selection and composite feature generation, this chapter proposes the Top Correlated Multiplication Rules (TCMR) method as follows:

1. First, allCM Rules are mined using the correlation improvement method. 2. Then, the rules are sorted according to their C(A0 → c) values and the top

ranked rules selected.

3. Finally, the antecedents of the selected rules are used as (composite) variables in the model. This means the vectors xA0 of the selected rules A0 →c become

the new values of the composite features A0.

This simple procedure works well, as will be shown in section7.6.

7.5 Mining CMRules

This chapter adopts the Generalised Rule Mining (GRM) method proposed in chap- ter6. GRM is a combination of a framework of functions on vectors and an ecient algorithm for mining rules directly. It solves rule mining at the abstract level. It is used for the following reasons:

• GRM proved to be easy to apply to the problem addressed in this paper. Since

GRM solves rule mining at the abstract level, the CMRules problem can be solved by instantiating the functions in the framework appropriately (below) and using the GRM algorithm.

• GRM supports real valued variables and any rule semantics. Since CMRules

is the rst rule based method with multiplicative semantics and real valued variables, other algorithms are not applicable.

• The GRM algorithm is ecient and uses linear time in the rules mined. It does

not use candidate generation, does not require multiple scans and does not generate a compressed version of the database.

CHAPTER 7. CORRELATED MULTIPLICATION RULES WITH

APPLICATIONS 149

The core of GRM is its framework, which abstracts and vectorises rule mining. Re- call thatA is the set of variables that may be in the antecedent, andC is the set of variables that may be in the consequent. In CMRules, C={c}, the dependent vari-

able to be predicted. As an aside, supposing there are multiple dependent variables, one may have |C| > 1. This is equivalent to mining CMRules for each dependent

variable separately, however the separation method takes O(|C|) times longer than

if they are mined together in one go.

In the GRM framework, each possible antecedent A0 ⊆A and each possible conse- quent c ∈ C are expressed as vectors, denoted by xA0 and xc respectively. Recall

that these were already dened in section 7.3. The GRM framework is composed of ve functions, which can be instantiated in the following way to solve the CMRules problem:

1. mR :X2 → R is a distance measure between the antecedent and consequent

vectorsxA0 and xc. mR(xA0, xc) evaluates the quality of the ruleA0 →c.

In CMRules,mR(xA0, xc) =C(A0 →c), Pearson's correlation coecient as per

section 7.3. Geometrically, in CMRules close means the angle between xA0

and xc is small (gure6.2 on page 123).

2. aR:X2→Xoperates on vectors of the antecedent so thatxA0a=aR(xA0, xa)

whereA0 ⊆Aanda∈(A−A0). This meansaR(·) combines the vectorxA0 for

an existing antecedentA0 ⊆Awith the vectorxafor a new antecedent element a∈A−A0. The resulting vector xA0arepresents the larger antecedent A0∪a.

Therefore, aR(·) allows the antecedent vector required by mR(·) to be built

incrementally. SinceaR(·)denes how the vectors are built, it also implicitly denes the semantics of the rule.

In CMRules, aR(·) is the element-wise multiplication of (typically) real val-

ued vectors: aR(xA0, aa)[j] = xA0[j]∗xa[j]. The semantics of CMRules are

multiplicative. 3. MR : R|P(A

0)|

R is a measure that evaluates a rule A0 → c based on the value computed by mR(·) for any sub-rule A00 → c :A00 ⊆A0. This supports interestingness measures where a ruleA0 →cneeds to be compared with some or all of its sub-rules.

Recall that in CMRules, the Correlation Improvement method ensures rules are mined that are more correlated withcthan their immediate sub-rules. The Correlation Improvement method is implemented using MR(·) according to denition7.2.

150 7.5. MINING CMRULES 4. IR:R2→ {true, f alse}determines whether a ruleA0 →cis interesting based

on the values produced bymR(·)andMR(·). Only interesting rules are output and further expanded (i.e. more specic rules are examined) by the algorithm. In CMRules, the desired search strategy is achieved when IR returns true for A0 → c if and only if CI(A0 → c) > minCI. Geometrically, since C(A0 →c) =cos(θ)andCI(·)must be positive, the search progresses by adding

variables to the antecedent in a way that the antecedent vector moves closer to the consequent vector in terms of the subtended angleθ(gure6.2 on page 123). Recall that this means variables will only be added to the antecedent if they improve the correlation with the dependent variablecin comparison to all the immediate sub-rules, and therefore, by lemma 7.3, all sub-rules.

5. Sometimes it is possible to determine that a rule is not interesting based purely on the antecedent. This is supported by IA : X → {true, f alse}. IA(xA) = f alse impliesIR(·) =f alsefor all A0 →c:c∈C.

In CMRules, IA(xA0) = true for all A0 since it is not possible to prune the

search based on the antecedent only.

With the above instantiations of the GRM framework, the GRM algorithm (algo- rithm6.1 on page 132) is used to mine all CMRules eciently.

Theorem 6.9 on page 131 gives the run time complexity of any approach that can be implemented in GRM, given the timest(X) taken to compute function X in the framework.

Lemma 7.4. Mining all CMRules takes O(R· |A|2· |C| ·n) time, where n is the number of samples (instances) andR is the number of rules mined.

Proof. Using theorem6.9;t(mR) =t(aR) =O(n)wherenis the number of samples, since all functions require examining each element of the vector once. t(IR) =O(1). t(MR) =O(|A|) since MR examines immediate sub-rules.

This result states that the performance is linear in the number of interesting rules found by the algorithm. That is, the number of CMRules output. It is therefore not possible to improve the algorithm other than by a constant factor since each CMRule must at least be output by the algorithm. Note that this result is not the same as saying that the run time is linear in the size of the search space. The search space is known beforehand, but the required rules are not. The search space may be O(2|A∪C|) but if only R of these rules are interesting (and typically R <<2|A∪C|),

CHAPTER 7. CORRELATED MULTIPLICATION RULES WITH