Preprocessing - Optimization strategies - Extensions of non-negative matrix factorization and t

5.3 Optimization strategies

5.3.4 Preprocessing

Particularly the logarithm in eq. (5.9) can cause serious global convergence problems by inducing local maxima to the log-likelihood function. Any point Xij = 1 with a small probability 1 − exp (−[W H]ij)

will result in a logarithmic divergence of the log-likelihood. The optimization algorithm thus will try to compensate the divergence by increasing [WH]ij. In order to attenuate this problem, we propose

an appropriate preprocessing step for our optimization procedure. Introducing an auxiliary variable α ∈]0, 1[, we set

P (Xij= 1|S1. . . , SK) = 0, if Xij= 0

P (Xij= 1|S1. . . , SK) = α, if Xij= 1

for all i, j. (5.22)

This can be summarized by

αXij = 1 − exp (−[WH]ij) ⇔ − ln(1 − αXij) = [WH]ij (5.23)

Since the left hand side of the last equation is always nonnegative we recover a standard NMF problem X0≈ WH when substituting X0

ij=: − ln(1 − αXij).

Figure 5.5: The preprocessing procedure approximates the penalties considered by the actual log likelihood (top row) for Xij = 1 (left) and Xij = 0 (right) by a quadratic form with an adjustable

parameter α (bottom row).

E(α, W, H) = N X i=1 M X j=1 (ln(1 − αXij) + [WH]ij) 2 (5.24)

the well-known Alternating Least Squares Algorithm as described in [CZA08] can be used to minimize (5.24) with respect to W ≥ 0 and H ≥ 0. The ALS-updates are given by

Hrs ← max{, − N X i=1 [(WTW)−1WT]riln(1 − αXis)} (5.25) Wlm ← max{, − M X j=1 ln(1 − αXlj)[HT(HHT)−1]jm} (5.26)

In fact, any NMF algorithm could be used in the preprocessing step. Our choice of the ALS procedure is motivated by its simplicity and speed. The bad theoretical convergence properties (induced by truncation of negative elements) are alleviated by repeating the procedure several times using different random initializations for H and W and retaining only the solution with the smallest Euclidean distance. Multiple random initializations also lead to a more complete coverage of the search space. Determining the Parameter α

The effect of α can be understood in the following way: As long as the matrix factorization framework permits it, an optimization algorithm should increase those [WH]ij for which Xij= 1, and diminish

those [WH]ij for which Xij = 0 in order to reach the maximum likelihood solution. This kind of

5.3. OPTIMIZATION STRATEGIES 61

Figure 5.6: Log-likelihood of the approximations computed by the ALS-method as a function of α for 10 random initializations. The best value is obtained for α = 0.87 in this example. The horizontal line denotes the true log-likelihood.

− ln(1 − α) for Xij = 1 and 0 if Xij = 0. Thus, qualitatively, the basic properties of the data are

roughly reflected in the simplification. However, an optimal α cannot be estimated from the data directly but has to be determined by an additional optimization process.

The true quantity to be maximized is the log-likelihood function (5.9) which is a sum over N M individual terms. The costs corresponding to terms Xij = 1 and Xij= 0 are asymmetric for the two

cases. A Xij = 1 term leads to costs ln(1 − exp(−[WH]ij)), whereas a Xij = 0 term yields a cost

of −[WH]ij (see Fig. 5.5, top row). The (negative of the) Euclidean cost function (5.24) implies a

quadratic cost for both cases instead (see Fig. 5.5, bottom row). Xij = 0 leads to a term [WH]2ij,

while Xij = 1 yields a cost of − (ln(1 − α) + [WH]ij) 2

. The latter is an inverted parabola with a maximum at − ln(1 − α)(> 0). Note that only the left branch of this parabola needs to be considered, since the terms Xij = 1 seek to approach the maximum at [WH]ij = − ln(1 − α) and the terms

Xij = 0 favour small values [WH]ij.

From simulations on toydata sets, we observed that the best obtained log-likelihood LL(X, W(α), H(α)) among several randomly initialized runs resembles a concave function of α (see Figure 5.6). Thus, a Golden Section Search procedure can be applied to obtain the optimal α in a reasonable amount of trials and computational time.

In summary, with the help of the auxiliary parameter α, the original log-likelihood components for Xij = 1 and Xij = 0 are approximated by quadratic forms. This parameter α has to be optimized in

order to find the best such approximation for a given dataset.

5.3.5 Uniqueness

Referring to the non-uniqueness of unconstrained NMF solutions, in case of binary datasets this problem is relieved.

As is shown in a schematic drawing in Fig. 5.7, with continuous-valued data several equivalent solutions exist because the spanning basis vectors H1∗, . . . , HK∗ can lie anywhere between the data

cloud and the boundaries of the non-negative orthant (see also chapter 3 and [SPTL09]). In a binary problem setting, the data lie in the corners of a M -dim hypercube in the non-negative orthant having one corner at the origin and edge length − ln(1 − α). The (continuous-valued) basis vectors Hk∗

coincide with the borders of this hypercube if K = M and are inside the hypercube if K < M . Thus, the additional freedom of the basis vectors due to multiple possibilities outside the data cloud in the continuous case is missing in the binary situation, since the basis vectors are inside the data cloud in this case.

Figure 5.7: left: Non-uniqueness of NMF solutions for continuous-valued data illustrated with a 2-dim manifold embedded in 3-dim space. Different solutions are indicated by bundles of spanning basis vectors H1∗, H2∗ right: In case of binary data, there is no such ambiguity since the data lie in the

corners of a hypercube with edge length − ln(1 − α). Note that one of the corners corresponds to the origin and the edges span the positive orthant.

We can summarize the whole optimization strategy as searching the parameter space by repetitive running a fast ALS algorithm on a simplified problem involving an additional parameter α. This α is optimized by a Golden Section Search procedure. The simplified problem is solvable by standard NMF algorithms. Considering the results of chapter 3, NMF is known not to produce unique results without additional restrictions in general. In the special case of binary data, however, these uniqueness problems do not exist. Thus the preprocessing procedure for fixed α leads to optimal solutions of E(α, W, H) in eq. (5.24). Optimization of α then leads to a good approximation of the original problem LL(W, H) in eq. (5.9). From this point, we can run the AGA or multiplicative algorithm to optimize the actual log likelihood LL into the nearest local maximum. In that sense we can interpret the binary NMF problem to have a quasi optimal solution inherited from the quadratic approximation.

In document Extensions of non-negative matrix factorization and their application to the analysis of wafer test data (Page 65-69)