Main Results - EFFICIENT ALGORITHMS FOR COLLABORATIVE FILTERING

proposed in [72] in the context of column sampling. A similar approach could be developed for the uniform sampling of entries.

(iv) The Hessian of F can be computed explicitly as well. This opens the way to quadratically convergent minimization algorithms (e.g. the Newton method).

However, the computational complexity of such a procedure might limit its applicability.

Figure 4.4 illustrates the effectiveness of the manifold optimization step. Here, we present the results of our simulations with 1000× 1000 matrices of rank 10. We generate the matrix as M = UV^T where Uij, Vij ≈ N(0, 1). We plot both the fit error kPE(M − cM )kF/p

|E| and the prediction error kM − cMkF/n as a function of the number of iterations of gradient descent for two different revealed set sizes

|E| = 10⁵ and |E| = 2 · 10⁵. We see that the prediction error decays exponentially with the number of iterations. Further, the prediction error is close to the fit error, thus validating our use of the fit error as a stopping criterion. We refer to Chapter 6 for more extensive simulations demonstrating the performance of OptSpace on simulated and real datasets.

4.2 Main Results

We characterize the performance of OptSpace based on the following analytical results. Since we are interested in large datasets, we shall strive to prove performance guarantees that are asymptotically optimal for large m and n. However, our main results are completely non-asymptotic and provide bounds for any m and n.

We begin by analyzing the “Projection” step of OptSpace. There are efficient algorithms for finding the rank-r projection of a sparse matrix and the complexity of the whole procedure is O(|E|r log n). Our first result bounds the estimation error for this simple procedure.

Theorem 4.2.1. Let M be a rank r matrix of dimensions nα × n that satisfies

|Mi,j| ≤ Mmax for all (i, j)∈ [nα] × [n]. Assume that the revealed set E ⊂ [nα] × [n]

1e-14

Figure 4.4: Empirical demonstration of the rate of convergence of Manifold Opti-mization. We plot the fit error and the prediction error as a function of the number of iterations of gradient descent. Simulations with m = n = 1000 and r = 10 and ǫ =|E|/n = 100 and 200.

is uniformly random given the size |E|. Then there is a constant C such that with probability larger than 1− 1/n³

√1 Recall that k · kF denotes the Frobenius norm and k · k2 denotes the operator norm.

Further, the left hand side is simply the root mean squared error of the estimation.

Observe that we do not need any of the incoherence assumptions for this result.

Projection is a standard procedure employed for dimensionality reduction. Indeed, procedures similar to our Projection step have been widely used in learning. The above result is a rigorous analysis of this intuitive step. The error bound achieved above consists of two terms. The first term corresponds to the missing entries. Indeed, the bound is non-trivial as soon as the number of revealed entries is larger, in order, than the number of degrees of freedom O(nr). The second term in the error bound corresponds to the noise W .

4.2. MAIN RESULTS 35

The second main result provides performance guarantees for the entire algorithm, i.e the output of the manifold optimization step. This theorem is order-optimal in a number of important circumstances including the noiseless case (bounded r and µ) and the case of i.i.d Gaussian noise. We refer to Chapter 6 for further comparisons.

Theorem 4.2.2. Let M be a rank r matrix of dimensions nα× n satisfying the incoherence conditions with parameter µ. Let Σ_min = Σ₁ ≤ · · · ≤ Σr = Σ_max be singular values of M and define κ ≡ Σmax/Σmin. Assume that the revealed set E ⊂ [nα]× [n] is uniformly random given the size |E|. Let cM be the output of OptSpace given the input NÊ = MÊ + WÊ. Then there exists numerical constants C and C^′ such that if

|E| ≥ Cnµrακ² max

log n ; µr√ ακ⁴

then, with probability at least 1− 1/n³,

√1

mnkM − cMkF ≤ C^′ nκ²√ rα

|E| kW^Ek2 (4.23)

provided that the right-hand side is smaller than Σmin/(n√ α).

As was noted before, the cleaning step essentially eliminates the term corre-sponding to the missing entries in Theorem 4.2.1. In Figure 4.5, we demonstrate the performance of OptSpace in practice. Here, we plot the reconstruction rate, the empirical probability of recovery the matrix M with a tolerance of 10⁻⁴, i.e kM − cMk^F/kMk^F ≤ 10⁻⁴ as a function of the sampling probability p = |E|/mn.

The matrices M are generated as UV^T with Uij, Vij ≈ N(0, 1). We conduct the ex-periments with m = n = 500 and for ranks 10, 20 and 40. For comparison, we have also plotted the upper bound on reconstruction error computed using rigidity theory [98]. It is clear from the plots that the OptSpace performs close to the fundamental limit even in practice. Results of more extensive simulations are presented in Section 6.3.

A key observation concerning the results presented is that the dependence on noise in Theorems 4.2.1 and 4.2.2 is only through the operator norm of fW^E. This

0 0.2 0.4 0.6 0.8 1

0 0.05 0.1 0.15 0.2 0.25 0.3

ReconstructionProb.

Sampling Probability

r = 10 20 40

Figure 4.5: Empirical reconstruction probability for OptSpace as a function of the sampling probability |E|/n². Simulations with m = n = 500 and r = 10, 20, 40. The plain red curve is the fundamental limit computed using rigidity theory [98].

is of significance since in many cases of interest (when the noise is not spectrally concentrated, for eg. i.i.d sub-Gaussian entries), the operator norm ofkfW^Ek2 is much smaller than the noise intensity, typically measured by the Frobenius normkfW^WkF. In order to gain more intuition about the results, it is instructive to consider a couple of simple models for the noise matrix W :

Independent entries model. We assume that W ’s entries are independent random variables, with zero mean E{W^ij} = 0 and sub-Gaussian tails. The latter means that

P{|Wij| ≥ x} ≤ 2 e⁻^2σ2^x2 , for some bounded constant σ².

Worst case model. In this model W is arbitrary, but we have an uniform bound on the size of its entries: |Wij| ≤ Wmax.

The basic parameter entering our main results is the operator norm of fW^E, which is bounded as follows.

In document EFFICIENT ALGORITHMS FOR COLLABORATIVE FILTERING (Page 43-47)