Spectral Methods for Learning
Latent Variable Models:
Unsupervised and Supervised Settings
Anima Anandkumar
Data vs. Information
Messy DataMissing observations, gross corruptions, outliers.
High dimensional regime: as data grows, more variables ! Useful information: low-dimensional structures.
Data vs. Information
Messy DataMissing observations, gross corruptions, outliers.
High dimensional regime: as data grows, more variables ! Useful information: low-dimensional structures.
Learning with big data: ill-posed problem.
Data vs. Information
Messy DataMissing observations, gross corruptions, outliers.
High dimensional regime: as data grows, more variables ! Useful information: low-dimensional structures.
Learning with big data: ill-posed problem.
Learning is finding needle in a haystack
Learning with big data: computationally challenging!
How to model information structures?
Latent variable modelsIncorporatehidden or latentvariables.
Information structures: Relationshipsbetween latent variables and observed data.
How to model information structures?
Latent variable modelsIncorporatehidden or latentvariables.
Information structures: Relationshipsbetween latent variables and observed data.
Basic Approach: mixtures/clusters Hidden variable is categorical.
How to model information structures?
Latent variable modelsIncorporatehidden or latentvariables.
Information structures: Relationshipsbetween latent variables and observed data.
Basic Approach: mixtures/clusters Hidden variable is categorical.
Advanced: Probabilistic models
Hidden variables have more general distributions. Can model mixed membership/hierarchical groups.
x1 x2 x3 x4 x5 h1
Latent Variable Models (LVMs)
Document modeling Observed: words. Hidden: topics. Social Network Modeling
Observed: social interactions. Hidden: communities, relationships. Recommendation Systems
Observed: recommendations (e.g., reviews). Hidden: User and business attributes
LVM for Feature Engineering
Learn good features/representations for classification tasks, e.g., computer vision and NLP.
Sparse Coding/Dictionary Learning
Sparse representations, low dimensional hidden structures. A fewdictionary elements make complicated shapes.
Associative Latent Variable Models
Supervised Learning
Associative Latent Variable Models
Supervised Learning
Given labeled examples {(xi, yi)}, learn a classifieryˆ=f(x). Associative/conditional models: p(y|x).
Associative Latent Variable Models
Supervised Learning
Given labeled examples {(xi, yi)}, learn a classifieryˆ=f(x). Associative/conditional models: p(y|x).
Example: Logistic regression: E[y|x] =σ(hu, xi).
Mixture of Logistic Regressions
Associative Latent Variable Models
Supervised Learning
Given labeled examples {(xi, yi)}, learn a classifieryˆ=f(x). Associative/conditional models: p(y|x).
Example: Logistic regression: E[y|x] =σ(hu, xi).
Mixture of Logistic Regressions
E[y|x, h] =g(hU h, xi+hb, hi)
Multi-layer/Deep Network
Challenges in Learning LVMs
Computational Challenges
Maximum likelihood is NP-hard in most scenarios.
Practice: Local search approaches such asBack-propagation, EM, Variational Bayeshave no consistency guarantees.
Sample Complexity
Sample complexity is exponential (w.r.t hidden variable dimension) for many learning methods.
Outline
1 Introduction
2 Spectral Methods
Classical Matrix Methods Beyond Matrices: Tensors
3 Moment Tensors for Latent Variable Models Topic Models
Network Community Models Experimental Results
4 Moment Tensors in Supervised Setting
Outline
1 Introduction
2 Spectral Methods
Classical Matrix Methods Beyond Matrices: Tensors
3 Moment Tensors for Latent Variable Models Topic Models
Network Community Models Experimental Results
4 Moment Tensors in Supervised Setting
Classical Spectral Methods: Matrix PCA and CCA
Unsupervised Setting: PCAFor centered samples{xi}, find projectionP with Rank(P) =k s.t. min P 1 n X i∈[n] kxi−P xik2.
Result: Eigen-decomposition of S =Cov(X).
Supervised Setting: CCA For centered samples{xi, yi}, find
max a,b a⊤ˆ E[xy⊤]b q a⊤Eˆ[xx⊤]a b⊤Eˆ[yy⊤]b .
Result: Generalized eigen decomposition.
x y
ha, xi
Shortcomings of Matrix Methods
Learning through Spectral Clustering
Dimension reduction through PCA (on data matrix) Clustering on projected vectors (e.g. k-means).
Shortcomings of Matrix Methods
Learning through Spectral Clustering
Dimension reduction through PCA (on data matrix) Clustering on projected vectors (e.g. k-means).
Basic method works only for single memberships. Failure to cluster under small separation.
Shortcomings of Matrix Methods
Learning through Spectral Clustering
Dimension reduction through PCA (on data matrix) Clustering on projected vectors (e.g. k-means).
Basic method works only for single memberships. Failure to cluster under small separation.
Outline
1 Introduction
2 Spectral Methods
Classical Matrix Methods Beyond Matrices: Tensors
3 Moment Tensors for Latent Variable Models Topic Models
Network Community Models Experimental Results
4 Moment Tensors in Supervised Setting
Beyond SVD: Spectral Methods on Tensors
How to learn the mixture models without separation constraints?
◮ PCA usescovariance matrixof data. Arehigher order momentshelpful?
Unified framework?
◮ Moment-based estimationof probabilistic latent variable models?
SVD gives spectral decomposition of matrices.
Moment Matrices and Tensors
Multivariate Moments in Unsupervised SettingM1 :=E[x], M2 :=E[x⊗x], M3 :=E[x⊗x⊗x].
Matrix
E[x⊗x]∈Rd×dis a second order tensor.
E[x⊗x]i1,i2 =E[xi1xi2].
For matrices: E[x⊗x] =E[xx⊤].
Tensor
E[x⊗x⊗x]∈Rd×d×d is a third order tensor.
Moment Matrices and Tensors
Multivariate Moments in Unsupervised SettingM1 :=E[x], M2 :=E[x⊗x], M3 :=E[x⊗x⊗x].
Matrix
E[x⊗x]∈Rd×dis a second order tensor.
E[x⊗x]i1,i2 =E[xi1xi2].
For matrices: E[x⊗x] =E[xx⊤].
Tensor
E[x⊗x⊗x]∈Rd×d×d is a third order tensor.
E[x⊗x⊗x]i1,i2,i3 =E[xi1xi2xi3].
Multivariate Moments in Supervised Setting
Spectral Decomposition of Tensors
M2 =P i λiui⊗vi = + .... MatrixM2 λ1u1⊗v1 λ2u2⊗v2Spectral Decomposition of Tensors
M2 =P i λiui⊗vi = + .... MatrixM2 λ1u1⊗v1 λ2u2⊗v2 M3 =P i λiui⊗vi⊗wi = + .... TensorM3 λ1u1⊗v1⊗w1 λ2u2⊗v2⊗w2u⊗v⊗w is a rank-1tensor since its (i1, i2, i3)th entry isui1vi2wi3. How to solve this non-convex problem?
Decomposition of Orthogonal Tensors
M3 = X
i
wiai⊗ai⊗ai.
Decomposition of Orthogonal Tensors
M3 = X
i
wiai⊗ai⊗ai.
SupposeA has orthogonal columns.
Decomposition of Orthogonal Tensors
M3 = X
i
wiai⊗ai⊗ai.
SupposeA has orthogonal columns.
M3(I, a1, a1) =Piwihai, a1i2ai =w1a1. ai are eigenvectors of tensorM3.
Analogous to matrix eigenvectors:
Decomposition of Orthogonal Tensors
M3 = X
i
wiai⊗ai⊗ai.
SupposeA has orthogonal columns.
M3(I, a1, a1) =Piwihai, a1i2ai =w1a1. ai are eigenvectors of tensorM3.
Analogous to matrix eigenvectors:
M v=M(I, v) =λv.
Two Problems
How to find eigenvectors of a tensor? A is not orthogonal in general.
Orthogonal Tensor Power Method
Symmetricorthogonal tensor T ∈Rd×d×d:T = X
i∈[k]
Orthogonal Tensor Power Method
Symmetricorthogonal tensor T ∈Rd×d×d:T = X
i∈[k]
λivi⊗vi⊗vi.
Recall matrix power method: v7→ M(I, v) kM(I, v)k.
Orthogonal Tensor Power Method
Symmetricorthogonal tensor T ∈Rd×d×d:T = X
i∈[k]
λivi⊗vi⊗vi.
Recall matrix power method: v7→ M(I, v) kM(I, v)k.
Algorithm: tensor power method: v7→ T(I, v, v) kT(I, v, v)k.
Orthogonal Tensor Power Method
Symmetricorthogonal tensor T ∈Rd×d×d:T = X
i∈[k]
λivi⊗vi⊗vi.
Recall matrix power method: v7→ M(I, v) kM(I, v)k.
Algorithm: tensor power method: v7→ T(I, v, v) kT(I, v, v)k.
How do we avoidspurioussolutions (not part of decomposition)? •{vi}’sare the onlyrobust fixed points.
Orthogonal Tensor Power Method
Symmetricorthogonal tensor T ∈Rd×d×d:T = X
i∈[k]
λivi⊗vi⊗vi.
Recall matrix power method: v7→ M(I, v) kM(I, v)k.
Algorithm: tensor power method: v7→ T(I, v, v) kT(I, v, v)k.
How do we avoidspurioussolutions (not part of decomposition)? •{vi}’sare the onlyrobust fixed points. •Allother eigenvectorsaresaddle points.
Orthogonal Tensor Power Method
Symmetricorthogonal tensor T ∈Rd×d×d:T = X
i∈[k]
λivi⊗vi⊗vi.
Recall matrix power method: v7→ M(I, v) kM(I, v)k.
Algorithm: tensor power method: v7→ T(I, v, v) kT(I, v, v)k.
How do we avoidspurioussolutions (not part of decomposition)? •{vi}’sare the onlyrobust fixed points. •Allother eigenvectorsaresaddle points.
Whitening: Conversion to Orthogonal Tensor
M3= X i wiai⊗ai⊗ai, M2 = X i wiai⊗ai.Find whitening matrixW s.t. W⊤A=V is an orthogonal matrix.
WhenA∈Rd×k hasfull column rank, it is an invertible
transformation. v1 v2 v3 W a1 a2 a3
Use pairwise moments M2 to find W.
Putting it together
Non-orthogonal tensor M3 =Piwiai⊗ai⊗ai,M2 =Piwiai⊗ai. Whitening matrix W: Multilinear transform: T =M3(W, W, W) v1 v2 v3 W a1 a2 a3 TensorM3 TensorTPutting it together
Non-orthogonal tensor M3 =Piwiai⊗ai⊗ai,M2 =Piwiai⊗ai. Whitening matrix W: Multilinear transform: T =M3(W, W, W) v1 v2 v3 W a1 a2 a3 TensorM3 TensorTPutting it together
Non-orthogonal tensor M3 =Piwiai⊗ai⊗ai,M2 =Piwiai⊗ai. Whitening matrix W: Multilinear transform: T =M3(W, W, W) v1 v2 v3 W a1 a2 a3 TensorM3 TensorTTensor Decomposition: Guaranteed Non-Convex Optimization!
Outline
1 Introduction
2 Spectral Methods
Classical Matrix Methods Beyond Matrices: Tensors
3 Moment Tensors for Latent Variable Models Topic Models
Network Community Models Experimental Results
4 Moment Tensors in Supervised Setting
Types of Latent Variable Models
What is the form of hidden variablesh? Basic Approach: mixtures/clusters
Hidden variableh iscategorical.
Advanced: Probabilistic models
Hidden variablehhas more general distributions. Can model mixed memberships, e.g. Dirichlet distribution.
x1 x2 x3 x4 x5 h1
Outline
1 Introduction
2 Spectral Methods
Classical Matrix Methods Beyond Matrices: Tensors
3 Moment Tensors for Latent Variable Models Topic Models
Network Community Models Experimental Results
4 Moment Tensors in Supervised Setting
Geometric Picture for Topic Models
Topic proportions vector (h)Geometric Picture for Topic Models
Single topic (h)Geometric Picture for Topic Models
Single topic (h) A A A x1 x2 x3 Word generation (x1, x2, . . .)Geometric Picture for Topic Models
Single topic (h) A A A x1 x2 x3 Word generation (x1, x2, . . .) Linear model: E[xi|h] =Ah.Moments for Single Topic Models
E[xi|h] =Ah. w:=E[h].
Learn topic-word matrix A, vector w
x1 x2 x3 x4 x5 A A A A A h
Moments for Single Topic Models
E[xi|h] =Ah. w:=E[h].
Learn topic-word matrix A, vector w
x1 x2 x3 x4 x5 A A A A A h
Pairwise Co-occurence Matrix Mx M2 :=E[x1⊗x2] =E[E[x1⊗x2|h]] = k X i=1 wiai⊗ai Triples Tensor M3 M3 :=E[x1⊗x2⊗x3] =E[E[x1⊗x2⊗x3|h]] = k X i=1 wiai⊗ai⊗ai
Moments under LDA
M2 := E[x1⊗x2] − α0 α0+ 1E [x1]⊗E[x1] M3 := E[x1⊗x2⊗x3] − α0 α0+ 2E [x1⊗x2⊗E[x1]]−more stuff... Then M2 = X ˜ wi ai⊗ai M3 = X ˜ wi ai⊗ai⊗ai.Three words per document suffice for learning LDA. Similar forms for HMM, ICA, sparse codingetc.
“Tensor Decompositions for Learning Latent Variable Models” byA. Anandkumar, R. Ge, D. Hsu, S.M. Kakade and M. Telgarsky. JMLR 2014.
Outline
1 Introduction
2 Spectral Methods
Classical Matrix Methods Beyond Matrices: Tensors
3 Moment Tensors for Latent Variable Models Topic Models
Network Community Models Experimental Results
4 Moment Tensors in Supervised Setting
Network Community Models
0.4 0.3 0.3 0.7 0.2 0.1 0.1 0.8 0.1
Network Community Models
0.4 0.3 0.3 0.7 0.2 0.1 0.1 0.8 0.1
Network Community Models
0.4 0.3 0.3 0.7 0.2 0.1 0.1 0.8 0.1
Network Community Models
0.4 0.3 0.3 0.7 0.2 0.1 0.1 0.8 0.1
Network Community Models
0.4 0.3 0.3 0.7 0.2 0.1 0.1 0.8 0.1
Subgraph Counts as Graph Moments
“A Tensor Spectral Approach to Learning Mixed Membership Community Models” byA. Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.
Subgraph Counts as Graph Moments
“A Tensor Spectral Approach to Learning Mixed Membership Community Models” byA. Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.
Subgraph Counts as Graph Moments
3-Star Count Tensor
˜ M3(a, b, c) = 1 |X|# of common neighbors inX = 1 |X| X x∈X G(x, a)G(x, b)G(x, c). ˜ M3= 1 |X| X x∈X [G⊤x,A⊗G⊤x,B⊗G⊤x,C] x a b c A B C X
“A Tensor Spectral Approach to Learning Mixed Membership Community Models” byA. Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.
Outline
1 Introduction
2 Spectral Methods
Classical Matrix Methods Beyond Matrices: Tensors
3 Moment Tensors for Latent Variable Models Topic Models
Network Community Models Experimental Results
4 Moment Tensors in Supervised Setting
Computational Complexity
(
k
≪
n
)
n= # of nodes N = # of iterations k= #of communities. c= #of cores. Whiten STGD Unwhiten Space O(nk) O(k2) O(nk) Time O(nsk/c+k3) O(N k3/c) O(nsk/c)Whiten: matrix/vector products and SVD.
STGD:Stochastic Tensor Gradient Descent
Unwhiten: matrix/vector products Our approach: O(nsk
c +k 3)
Tensor Decomposition on GPUs
102 103 10−1 100 101 102 103 104 Number of communitiesk R u n n in g ti m e( se cs )MATLAB Tensor Toolbox(CPU) CULA Standard Interface(GPU) CULA Device Interface(GPU) Eigen Sparse(CPU)
Summary of Results
Friend Users Facebook n∼20k Business User Reviews Yelp n∼40k Author Coauthor DBLP(sub) n∼1million(∼100k)Error (E) and Recovery ratio (R)
Dataset ˆk Method Running Time E R
Facebook(k=360) 500 ours 468 0.0175 100%
Facebook(k=360) 500 variational 86,808 0.0308 100%
.
Yelp(k=159) 100 ours 287 0.046 86%
Yelp(k=159) 100 variational N.A.
.
DBLP sub(k=250) 500 ours 10,157 0.139 89%
DBLP sub(k=250) 500 variational 558,723 16.38 99%
DBLP(k=6000) 100 ours 5407 0.105 95%
Experimental Results on Yelp
Lowest errorbusiness categories &largest weightbusinesses
Rank Category Business Stars Review Counts
1 Latin American Salvadoreno Restaurant 4.0 36
2 Gluten Free P.F. Chang’s China Bistro 3.5 55
3 Hobby Shops Make Meaning 4.5 14
4 Mass Media KJZZ91.5FM 4.0 13
Experimental Results on Yelp
Lowest errorbusiness categories &largest weightbusinesses
Rank Category Business Stars Review Counts
1 Latin American Salvadoreno Restaurant 4.0 36
2 Gluten Free P.F. Chang’s China Bistro 3.5 55
3 Hobby Shops Make Meaning 4.5 14
4 Mass Media KJZZ91.5FM 4.0 13
5 Yoga Sutra Midtown 4.5 31
Bridgeness: Distance from vector [1/ˆk, . . . ,1/kˆ]⊤
Top-5bridgingnodes (businesses)
Business Categories
Four Peaks Brewing Restaurants, Bars, American, Nightlife, Food, Pubs, Tempe
Pizzeria Bianco Restaurants, Pizza, Phoenix
FEZ Restaurants, Bars, American, Nightlife, Mediterranean, Lounges, Phoenix
Matt’s Big Breakfast Restaurants, Phoenix, Breakfast& Brunch Cornish Pasty Co Restaurants, Bars, Nightlife, Pubs, Tempe
Outline
1 Introduction
2 Spectral Methods
Classical Matrix Methods Beyond Matrices: Tensors
3 Moment Tensors for Latent Variable Models Topic Models
Network Community Models Experimental Results
4 Moment Tensors in Supervised Setting
Moment Tensors for Associative Models
Multivariate Moments: Many possibilities...
E[x⊗y],E[x⊗x⊗y],E[ψ(x)⊗y]. . . . Feature Transformations of the Input: x7→ψ(x)
How to exploit them?
Are moments E[ψ(x)⊗y]useful?
If ψ(x) is a matrix/tensor, we have matrix/tensor moments. Can carry outspectral decomposition of the moments.
Score Function Features
Higher order score function: Sm(x) := (−1)m∇(m)p(x) p(x)
∗ Can be a matrix or a tensor instead of a vector.
∗ Derivative w.r.t parameter or input Form the cross-moments: E[y· Sm(x)].
Extension of Stein’s lemma: E[y· Sm(x)] =Eh∇(m)G(x)i
when E[y|x] :=G(x) Spectral decomposition: E h ∇(m)G(x)i= X j∈[k] u⊗jm
Learning Deep Neural Networks
Realizable Setting E[y|x] =σd(Ad σd−1(Ad−1 σd−2(· · ·A2 σ1(A1x))))
M3 =E[y· S3(x)] = X
i∈[r]
λi·u⊗i 3
where ui =e⊤i A1 are rows of A1.
Guaranteed learning of weights (layer-by-layer) via tensor decomposition.
Outline
1 Introduction
2 Spectral Methods
Classical Matrix Methods Beyond Matrices: Tensors
3 Moment Tensors for Latent Variable Models Topic Models
Network Community Models Experimental Results
4 Moment Tensors in Supervised Setting
Conclusion: Guaranteed Non-Convex Optimization
Tensor DecompositionEfficient sampleandcomputational complexities
Better performance compared toEM,Variational Bayesetc. In practice
Scalable andembarrassingly parallel: handle large datasets. Efficient performance: perplexity orground truth validation. Related Topics
Overcomplete Tensor Decomposition: Neural networks, sparse coding and ICA models tend to be overcomplete (more neurons than input dimensions).
Provable Non-Convex Iterative Methods: Robust PCA, Dictionary learning etc.
My Research Group and Resources
Furong Huang Majid Janzamin Hanie SedghiNiranjan UN Forough Arabshahi
ML summer school lectures available at