• No results found

2.4 Learning

2.4.3 Generative Model: Matrix Factorization

“A” generative approach to model for high dimensional samples (xi ∈RD) is to arrange them

as columns (or rows) of a matrix (say X ∈ RD×N) and derive a decomposition as an abstract

summarization of the data. In the mathematical discipline of linear algebra, matrix factorization is decomposition of a matrix to a canonical form,e.g.,

B∈ B, C∈ C, E∈ E (2.4.3) whereB,C, andEdenote the sets of feasible choices forB,C, andErespectively. Different choices forB,C, andEyield various decomposition methods. For example assumingE=0(i.e., Eq.2.4.3 is an exact decomposition): 1) if Bis the set of all “orthogonal” matrices andC is the set of all “orthonormal” ones,Eq.2.4.3 is the Singular Value Decomposition (SVD) method21; 2) IfBis set of

lower triangular matrices andCis set of upper triangular ones, thenEq.2.4.3 isLUdecomposition;

etc.. There are many flavors of matrix factorization and here we only focus on low-rank matrix approximation. By low-rank matrix approximation, we mean: rank(X) > rank(BC) and E

denotes error or noise matrix entries of which should be close to zero; henceX≈BC. We need a measure of distance (D(·;·)) (a divergence) to measure the quality of the approximation; therefore

Eq.2.4.3 can be written as an optimization problem

min

B,C D(X;BC)

subject to:B∈ B, C∈ C (2.4.4) Here we show a few examples of popular algorithms that can be cast out as low-rank matrix approximation. Most of the dictionary learning methods can be viewed as variations ofEq.2.4.4,

k-SVD [7], Non-negative Matrix Factorization [141], Independent Component Analysis (ICA) [25],etc.[75], [190]. Table 2.2 represents some other examples of popular methods that can be described by X ≈ BC(for more examples see [190]). Just for illustration purposes, we derive a matrix factorization for k-means clustering which is widely known as a straightforward and fairly efficient method for solving unsupervised learning problems:

Example 1: k−means clustering is a method of cluster analysis which aims to partitionN

observations intoK clusters, in which each observation belongs to the cluster with the nearest

21Typically SVD is represented asX=UΣVT, whereU

RD×randV∈RN×rare orthonormal matrices andΣis

Table 2.2: This table shows examples of well-known methods that can be viewed as matrix factorization: Singular Value Decomposition (SVD),k-means/medians, Probabilistic Latent Semantic Indexing (pLSI), Non-negative Matrix Factorization (NMF). In the table,k · k2

F denotes Frobenius norm andΛis a diagonal

matrix andKLdenotesKullback-Leiblerdivergence [65].

Method D(X;BC) B C SVD kX−BCk2 F BTB=I CCI = Λ k-means kX−BCk2 F - CCT =I, cij ={0,1} k-medians kX−BCk1 - CCT =I, cij ={0,1} pLSI [109] KL(X;BC) 1TB1= 1 1TC=1 bij ≥0 cij≥0 NMF [141] KL(X;BC) bij ≥0 cij≥0

mean. Difference between thek−means algorithm and its soft version is that the variable de- scribing how data points belong to clusters takes “degree” values instead of binary (0 and 1) values. Assuming that each of theN observations (xi) belongs to aD-dimensional feature space (xi ∈RD):

hardk-means: softk-means:

min cki,bk N X i=1 kxi− K X k=1 bkckik22 min cki,bk N X i=1 kxi− K X k=1 bkckik22 s.t.: K X k=1 cki= 1, cki∈ {0,1} s.t.: K X k=1 cki= 1, cki≥0 (2.4.5) where bk are the centroids of the clusters, cki are cluster membership values. Because of the constraint,{cki}Kk=1can be viewed as the probability or membership values.

Alternatively, one can viewEq.2.4.5 as a constrained matrix factorization problem:

min

C,B kX−BCk 2

F

subject to C∈ C (2.4.6) whereC :={ck : ck ≥0, 1Tck = 1, 1 ≤ k ≤K}for softk−means andC ∈ {0,1}K×N for hardk−means;X∈RD×N is matrix holding the observations; each column of theXis a sample.

(a) (b) (c)

Figure 2.9:(a) shows some of common choices for the discriminative loss function. Notice thatzero-oneloss function is a sign function. (b) shows maximum margin hyperplane and margins for an SVM trained with samples of the two classes. (c) shows an example of loss function for multi-class classification.

Similarly, the columns ofB ∈ RD×K are cluster centroids and the columns ofC

RK×N hold

the membership values. For brevity of notation,Cencodes the feasible set for the columns ofC

that was shown earlier inEq.2.4.5;ck are columns of the matrixC. In matrix nomenclature,B and Ccan be calledbasis matrixand coefficient matrixrespectively. Notice that from the matrix factorization point of view, Eq.2.4.6 clusters the columns of Xand the constraints are defined on the columns onC. If the constraint is defined on rows ofBinstead, the matrix factorization clusters the rows of theXinstead of the columns, and the rows ofCplay the role of centroids while the rows ofBhold membership values.