MASSIVE algorithm - meyerTheseDistrib

The DISR maximization step can be expressed as a weighted Dense Subgraph Problem (DSP) [3] or equivalently as a Dispersion Sum Problem [106]. The dense subgraph problem is defined for a complete undirected graph on the node set A={1, ..., n} where each edge

(i, j) takes a weight wij ≥0, withwii= 0. The goal is

maximize X i∈A X j∈A wijvivj (4.11) subject to X i∈A vi =d (4.12) vi ∈ {0,1}, i∈A (4.13)

which means to select a node subset S ⊆A of fixed size|S|=d, such that the total edge weight in the induced subgraph is maximal.

In the dispersion sum problem, nlocations are given, where location iis distant from location j by wij = _HI(₍X_Xi,j;_i,j_,YY)₎, and the objective is to establish d facilities among the n

locations, as distant as each other (having the maximum average distance between facilities).

The DISR optimization problem can be put in a DSP framework by setting: 1. the ith node represents the variableXi,

2. the binary variablevi, i= 1, . . . , ntakes the value1if theith variable is selected and

0 otherwise

3. the weight wij = _HI(₍X_Xi,j_i,j;_,YY)₎ is the symmetrical relevance of the two variables linked

by the edge.

The DSP is a NP-hard problem since it can be reduced to the CLIQUE problem (see [106]). However, there exists a branch-and-bound algorithm able to deal with up to 90 variables [106], and several promising results on the performance of greedy searches [21, 111]. [10] indicates that the backward elimination combined with a sequential search (BESR) performs well on binary quadratic problems such as the DSP ([10] uses BESR before a linear programming optimization). The BESR method starts with a set containing

all the variables ( i.e. vj = 1 for all j ∈A) and then selects the variablei whose removal

(i.e. vi = 1 ←0) induces the lowest decrease of the objective function and so on, till the

adequate number of variable is reached (i.e. P

i∈Avi =d). The procedure is enhanced by

an iterative sequential replacement which, at each step, swaps the status of a selected and a non-selected variable (i.e. swapping vi = 1andvj = 0) such that the largest increase in

the objective function is achieved. The sequential replacement is stopped when no further improvement is possible.

The combination of backward elimination and sequential search is a bidirectional search (Section 3.2.3). The backward elimination strategy can be adopted here since considering a n-variables-problem does not mean estimating a n-variate probability distribution as is usually the case in variable selection (Section 3.2); instead, it increases the number of elementswij that have to be summed in (4.9). The DISR criterion requires only symetrical

relevance of pairwise combinations of inputs, that are all computed and stored in the matrix W.

The proposed method combines an evaluation function (DISR) able to select complementary variables and a search algorithm (the backward elimination) also able to select complementary variables. We call this combination Matrix of Average Sub-Subset Infor- mation for Variable Elimination (MASSIVE).

4.4.1 Computational Complexity

The proposed implementation works as follows:

1. the DISR-matrix is computed. This step demands n(n₂−1) evaluationswij = _HI(₍X_Xi,j;_i,j_,YY)₎

since the average sub-subset information is symmetric.

2. a backward elimination is applied to the DISR-matrix. This computation has aO(n2)

complexity if the implementation for binary quadratic problems of [85] is adopted (see Algorithm 4.2 for a detailed pseudo-code).

3. a sequential replacement is performed. This has also aO(n2₎_{complexity [85].} Table 4.4 compares the computational complexity of the evaluation step of different techniques. Note that the table reports a naive implementation of CMIM. A more efficient implementation of CMIM is given in [49]. The total complexity of MASSIVE is inO(F×n2₎ where F is the cost of an estimation of the mutual information involving m samples and three variables (two inputs and one output). For instance, if the empirical entropy is used (Section 2.6), MASSIVE has a cost O(m×n2). A complete ranking of variables can be returned by MASSIVE, by selecting the subset composed of the nvariables (with no increase in the asymptotic computational cost). In that case, the ranking is given by

Algorithm 4.2Detailed pseudo-code of the backward elimination for quadratic optimization problem (given a matrix of weights W). The C++ code for this method is freely available on the Internet http://www.ulb.ac.be/di/map/pmeyer/links.html.

Inputs: number d of variables to select, matrix of weights W (with elements wij) of size

n×n

S ← {1,2, ..., n}

Initializescore vector: fork∈S: scorek←

jwjk

Select minimal score: b←arg mink∈S(scorek) while|S|> d

Update subset by eliminating worst variable: S←S\b Update score vector:scorek←scorek−wkb,k∈S

Select minimal score: b←arg mink∈S(scorek) end-while

Output: subsetS

the backward elimination since there are no remaining variables to be used for sequential replacement. Note that a conventional forward selection (up to d variables) based on an information criterion (e.g., MRMR) demands O(n× d) evaluations, each having a complexity depending ondand on the number of samples m(i.e. O(m×n×d2)with the empirical estimator). A conventional backward selection, where the evaluation is performed inside the loop and not precomputed as in MASSIVE, demands O(n2) evaluations (i.e. O(m×n3) with the empirical estimator). Hence, the MASSIVE implementation makes possible the adoption of a BESR strategy at a cost lying between the conventional forward and backward approaches.

methods: REL CMIM MRMR DISR MASS

calls of evaluation function (n×₂d) (n×₂d) (n×₂d) (n×₂d) (n×₂n)

calls of MI by evaluation 1 d−1 d d−1 1

k-variate density d+ 1 3 2 3 3

Table 4.4: The number of calls of the evaluation function is n×d in a forward selection strategy. Note that d = n for a backward elimination or for a complete ranking of the n variables. The computational cost of the criteria REL, CMIM, MRMR, DISR and MASSIVE is the number of calls of mutual information (MI) multiplied by the cost of an estimation of the mutual information involving a k-variate density and msamples.

In document meyerTheseDistrib (Page 107-110)