AltMin for Regularized Formualtion - Alternating Minimization

2.1 Alternating Minimization

2.1.2 AltMin for Regularized Formualtion

It is reasonable to add a regularizer kXk2_F + kY k2_F to control the norms of X and Y ; however, as we will see later, the reason why it helps is not because it controls the norms

of the iterates (in fact, it does not!), but it can control the row-norms of the iterates, though via an unknown mechanism.

Using AltMin to solve the regularized formulation (1.13) is quite similar to the original AltMin, and we denote this algorithm as AltMinReg; see details in Table 2.3.

Table 2.3: AltMinReg Initialization: randomly generate (X0, Y0) .

The k-th iteration:

Xk←− arg minX F (X, Yk−1) + λkXk2F,

Yk←− arg minY F (Xk−1, Y ) + λkY k2F.

Similar to AltMin, for AltMinReg each subproblem can still be solved in closed form. In fact, (x∗₁, . . . , x∗_m) , (arg minXF (X, Y ))T and (y∗1, . . . , yn∗) , (arg minY F (X, Y ))T

are given by x∗_i = (X j∈Ωx_i yjyjT + 2λI) † (X j∈Ωx_i Mijyj), i = 1, . . . , m, y_j∗ = (X i∈Ωy_j xixTi + 2λI) † (X i∈Ωy_j Mijxi), j = 1, . . . , n, (2.8)

For a given λ, the computation cost and the memory space for AltMinReg are almost the same as those for original AltMin. Nevertheless, it may take extra time to pick a good parameter λ (usually by cross validation).

While the SVD factors of the true matrix M form one global optimal solution of (1.10), they may not be the global optimal solution of (1.13). The global optimal value of (1.13) can differ from the global optimal value of (1.10), which is 0, by an amount proportional to λ. More precisely, suppose (U, V ) is a global optimal solution of F (X, Y ) = 1₂kPΩ(M − XYT)k2F, i.e. U VT = M , and ( ˆX, ˆY ) is a global optimal

solution of F1(X, Y ) = F (X, Y ) + λ(kXk2F + kY k2F). Due to the optimality of ( ˆX, ˆY ),

we have

F1( ˆX, ˆY ) ≤ F1(U, V ) = λ(kU k2+ kV k2),

which implies

F ( ˆX, ˆY ) ≤ λ(kU k2+ kV k2) − λ(k ˆXk2_F + k ˆY k2_F).

(assuming M = ˆU Σ ˆVT is the SVD), and the above relation becomes

F ( ˆX, ˆY ) ≤ λ(2kM k∗− k ˆXk2F − k ˆY k2F).

The training error corresponding to ( ˆX, ˆY ) is then bounded as

RMSE2_train= F ( ˆX, ˆY ) kP_ΩM k2_F/2 = 2λ 1 kP_ΩM k2_F(2kM k∗− k ˆXk 2 F − k ˆY k2F). (2.9)

The above relation is a necessary condition for ( ˆX, ˆY ) to be a global optimizer of the regularized function, and can be used to test the global optimality of ( ˆX, ˆY ). This criterion can help understand how well the regularized function has been solved.

We consider the same setting as before, i.e. the original matrix M is generated by M = U VT where U, V ∈ R1000×10has independent Gaussian entries with zero mean and variance 1/1000, and Ω is generated by a Bernolli model with parameter p. We consider random initial point X0, Y0 ∈ R1000×10 with independent Gaussian entries (also zero

mean and variance 1/1000). The figures are obtained by averaging 100 Monte Carlo runs.

Fig. 2.7 shows the performance of AltMinReg for p = 0.05, under two choices of λ: λ = 10−3 and λ = 10−4. It can be seen from the figure that for both choices of λ the training error and the test error both converge to a small number. This greatly improves the success probability compared to the unregularized AltMin (the success probability of AltMin for p = 0.05 is between 20% and 30%). We summarize the finding below:

Observation 2.1.5 Regularizers λ(kXk2_F + kY k2_F) can be very helpful for AltMin even in the noiseless case. In particular, in a certain setting the probability of successful recovery can be increased from less than 30% to more than 99% by using AltMinReg.

(a) λ = 10−3 (b) λ = 10−4

Figure 2.7: Performance of AltMinReg, i.e. AltMin for the MF formulation with the regularizer λ(kXk2

F + kY k2F). m = n = 1000, r = 10, p = 0.05.

We present various quantities of the convergent solutions for 10 experiments in Fig. 2.8 and Fig. 2.9. As shown in the two tables, the incoherence constants of X are all between 2.8 and 3.2, and the incoherence constants of Y are all between 3.4 and 3.8, no matter what value λ is. The incoherence constant of XYT is around 7.5 for λ = 10−3 and around 6.5 for λ = 10−4, both of which are rather small. The choice of λ = 10−3 leads to a much larger test error than λ = 10−4, because a large λ leads to a large distortion of the optimal solution.

Figure 2.8: Various quantities related to the convergent solution of AltMinReg with λ = 10−3 in 10 experiments. m = n = 1000, r = 10, p = 0.05. The incoherence constants of X and Y are all below 4, and the incoherence constant of XYT is around 7.5.

We claim that AltMinReg improves upon AltMin because AltMinReg controls the incoherence constants, not because it controls the norms of X and Y . To validate this claim, we give an example that the norms of X and Y are well controlled but still the

Figure 2.9: Various quantities related to the convergent solution of AltMinReg with λ = 10−4 in 10 experiments. m = n = 1000, r = 10, p = 0.05. The incoherence constants of X and Y are all below 4, and the incoherence constant of XYT _{is around 6.5.}

test error is high. We run 10 experiments for λ = 3 × 10−5, and record the results for each experiment in the table of Fig. 2.10 and the average results for success/failure in Table 2.1.2. In Figure 2.10, Column 1-4 indicate failed instances (test error > 0.3),4 and Column 5-10 indicate successful instances (test error < 0.006). The norms of X, Y for the failed instances are well controlled and almost indistinguishable from the norms of X, Y for the successful instances, thus the norms of X, Y are not indicators of success. In contrast, the incoherence constants for failed and successful instances are hugely different. More specifically, either µ(XYT) or max{µ(X), µ(Y )} is much larger in failed instances than the corresponding quantity in successful instances.

Figure 2.10: Results for AltMinReg with λ = 3 × 10−5 in 10 experiments. m = n = 1000, r = 10, p = 0.05. Column 1-5 indicate failed instances, and Column 6-10 indicate successful instances. The norms of X, Y for the failed and successful instances are rather close, but the incoherence constants for failed and successful instances are hugely different. See Table 2.1.2 for an averaged version of this table.

We summarize the observations obtained from Figure 2.10 and Table 2.1.2 below.

Observation 2.1.6 When applying AltMinReg, the norms of X, Y for the failed instances are almost indistinguishable from the norms of X, Y for the successful instances. In contrast, either µ(XYT) or max{µ(X), µ(Y )} is much larger in failed instances than the corresponding quantities in successful instances.

Note that in some applications the test error 0.3 might be acceptable, but since a much smaller test error < 0.01 can be achieved, we still view a test error of 0.3 as a failure.

Succeed Fail Average test error 5.5e-3 0.54 Average max{kXkF, kY kF} 4.65 4.76

Average µ(XYT) 7.2 70.2

Average µ(X) 2.96 120.3

Average µ(Y ) 2.92 16

Table 2.4: Comparison of successful and failed instances for AltMinReg, λ = 3e − 5, m = n = 1000, r = 10, p = 0.05. This table is an averaged version of the table in Figure 2.10.

This observation clearly validates our previous claim, which is restated below.

Claim 2.1.1 (Empirical claim) The reason why AltMinReg improves upon AltMin for some p is not that it can control the norms of X, Y , but that it can control the incoherence constants of X, Y , even though the underlying mechanism is unknown.

In the above experiments where p = 5r/n = 0.05, we can obtain a test error < 10−3 for properly chosen λ. However, as p becomes smaller and smaller, the regularizer coefficient λ should be larger and larger to control the incoherence, resulting in larger and larger training error and test error. Simulation results show that when p = 4r/n = 0.04, AltMinReg can achieve a training error below 0.05 and a test error below 0.1; when p = 3r/n = 0.03, AltMinReg can only achieve a test error of around 0.2.

In document Matrix Completion via Nonconvex Factorization: Algorithms and Theory (Page 51-56)