Illustrative example - Extensions of non-negative matrix factorization and their application to

In this section the determinant criterion is illustrated by simulations. First, we show that plain NMF without additional constraints produces various distinct decompositions of an artificial data set, whereas the detNMF algorithm consistently recovers the correct solution in multiple runs.

3.3.1 Unconstrained NMF versus detNMF

The following simulation setup was used. A fixed 3 × 5 matrix H representing basis components of characteristic shapes (see also Fig. 3.5, left)

H =   H1∗ H2∗ H3∗  =   bowl stair block   (3.15)

The true weight matrix W is generated by non-negative equally distributed random numbers in [0, 1]. The algorithm receives X = WH and the correct factorization rank K = 3 as input. The matrices H and W are initialized with random positive numbers and then iteratively updated as described above until the normalized reconstruction error falls below a certain threshold (e.g. E(α=0)_{N M} < 10−10_).

Figure 3.7 shows a comparison of the results gained via plain (unconstrained) NMF and detNMF. The pictures show 3-dimensional projections of 10 randomly initialized runs using equally and sparsely distributed weights W and H as in Figure 3.5. While the detNMF algorithm consistently recovers the original basis vectors H which constitute the edges of a tetrahedron in a three dimensional projection, the solutions gained via plain NMF without additional constraints vary.

In the simulations, the algorithm detNMF always extracted the correct features despite starting with random initializations as well as different original coefficient matrices W and varying numbers of individuals (e.g. N = 100, 1000, 10000).

3.3. ILLUSTRATIVE EXAMPLE 35

Figure 3.5: Left: Original features bowl H1∗(top), stair H2∗(center), and block H3∗(bottom), which

are perfectly recovered by the detNMF algorithm. Each vector Hk∗is a 5-dim analogon to the 2-dim

vectors in Fig. 3.2. Right: Example of a valid solution, but with wrong features H01∗, H02∗, H03∗,

obtained via unconstrained NMF

During iteratively reconstructing the data matrix X with sufficient precision, the determinant criterion pushes the feature vectors towards the optimal solution with the smallest possible determinant (see Fig. 3.6). In contrast, the unconstrained version (α = 0, as proposed by [LS99]) converges to several different solutions, depending on the initialization of the NMF procedure. In Fig. 3.5 (right) we give an example of an exact nonnegative factorization of the data in (W0, H0) which does not reproduce the original features H correctly. Note that H02∗+ 0.2H03∗ ≈ H2∗. Furthermore, note

that det(HHT_{) = 0.18 and det(H}0_H0T_{) = 0.31. The basis H}0 _{is sufficient to explain all data by a}

non-negative superposition, but does not coincide with the correct solution which generated the data and is characterized by a minimal determinant.

3.3.2 Determinant Criterion versus Sparseness Constraints

In the following, we will demonstrate that the determinant criterion offers a natural way to induce unique solutions to exactly solvable NMF problems, whereas sparseness constraints as suggested by [Hoy02] or [LHZC03] among others can be misleading. We discuss two extreme toy examples:

1. In the first example our detNMF algorithm correctly discovers the unique features of a non-sparse data distribution, whereas a sparse NMF approach fails to do so.

2. In the second example we use a data distribution with a multi-modal density which lends itself to a very sparse representation.

With these extreme examples we intend to highlight the fact that sparse coding does not represent a robust and natural criterion for unique solutions, while a determinant criterion can be, provided enough data is available.

For comparison, we chose the nnsc-algorithm, as described in [Hoy02] (nnsc: non-negative sparse coding), which is tailored to provide good solutions for sparse data sets. It minimizes the following objective function Ennsc= N X i=1 M X j=1 (Xij− [WH]ij)2+ λ X ij Wij. (3.16)

Figure 3.6: Typical evolution of simulation parameters using the determinant criterion. top: Loga- rithm of the reconstruction error, bottom: determinant det(HHT)

The term λP

ijWij, λ ≥ 0 penalizes large mixing coefficients Wij, hence pushes the solution towards

small coefficients. Ideally, using an appropriate sparsity constraint, most data points should be explained by a minimal set of basis vectors Hs∗, with the majority of all other coefficients Wij, j 6= s

being zero. An optimally sparse W thus corresponds to a solution where the columns W∗j contain

as much zeros as possible.

With the first example we discuss a data distribution which is concentrated right in the middle between the basis vectors which span the data space. This non-sparse data distribution serves to demonstrate potential drawbacks of sparsity constraints by means of the following idea:

The representation of all data points using only non-negative expansion coefficients, i.e. (W ≥ 0), requires all basis vectors to lie on the periphery of the data subspace (see Fig. 3.2). If, however, the NMF algorithm tries to satisfy a sparseness condition by approaching one basis vector to a data agglomeration which is not located near a true basis vector, the non-negativity constraint instantly forces other basis vectors to balance this close approach by moving further away from the cloud of data points. Otherwise, not all non-negativity constraints can be met. The property of maximal sparseness, imposed by some suitable criterion, thus can lead to a valid unique solution which might not be the solution searched for and certainly does not represent a minimum volume solution. On the other hand we expect our minimal volume constraint to cope with the situation and to yield the correct solution.

Example 2 demonstrates that even in cases of very sparse data distributions where any sparse NMF algorithm is expected to yield good results our minimal volume constraint also should yield unique and correct solutions.

In case of example 1 and using the feature vectors Hk∗(eq. 3.15), we construct the coefficients in W

as follows: 90% of the data points are generated via s(t · H1∗+ (1 − t) · H3∗), where the parameter t is

randomly drawn from a Gaussian distribution with (µ, σ) = (0.5, 0.03), and s equally distributed in the interval [0, 1]. Being projected onto the first three principal components of X, the feature vectors Hk∗ constitute the edges of a tetrahedron (see Fig. 3.8, solid lines). Note that by construction, the

data has exactly three principal components related to nonzero eigenvalues. Most data points are an approximately equally weighted mixture of two features and thus lie on a surface between two edges of the tetrahedron. The remaining 10% of the data are equally distributed in the space between the feature vectors as illustrated in Fig. 3.8.

3.3. ILLUSTRATIVE EXAMPLE 37

Figure 3.7: Comparison of plain NMF (left) and detNMF (right). The results of 10 randomly initialized runs are shown. top: W equally distributed; bottom: W sparse

Figure 3.8: 3D-visualization of the data distribution of example 1 (see text for details). Data and feature vectors are projected onto the three principal components of the data. In this space, the original feature vectors H1∗, H2∗, H3∗ constitute the edges of a tetrahedron with length 1. These

features are exactly recovered by the detNMF algorithm (solid lines). Obviously, the sparse NMF approach fails to position all basis vectors correctly. Note that the normalized feature vectors deduced from the nnsc algorithm are drawn on a larger scale (dashed lines) to render them visible as two feature vectors, deduced with both algorithms, almost coincide. All feature vectors intersect at the vertex located at the origin. left: top view, the vertex at the origin is placed in the center of the figure right: side view, the left corner represents the origin of the axis system.

Figure 3.9: 3D-visualization of the data distribution of example 2 (see text for details). Data and feature vectors are projected onto the three principal components of the data. In this space, the original feature vectors H1∗, H2∗, H3∗ constitute the edges of a tetrahedron with length 1. These

features are recovered by both the nnsc algorithm (broken lines) and the detNMF algorithm (solid lines). All feature vectors intersect at the vertex located at the origin which is placed in the center of the figure.

After random initialization of W and H, we processed both algorithms (detNMF,nnsc) until the reconstruction error ((N M )−1E(α = 0) in eq. 3.8 and (N M )−1Ennsc(λ = 0) in eq. 3.16, respectively)

was smaller than 10−10. Again, the detNMF algorithm recovered the correct feature vectors. On the other hand, the nnsc algorithm produced a solution with a smaller value of P

i,jWij, but also

In document Extensions of non-negative matrix factorization and their application to the analysis of wafer test data (Page 40-45)