Spectral Methods for Learning Latent Variable Models: Unsupervised and Supervised Settings

(1)

Spectral Methods for Learning

Latent Variable Models:

Unsupervised and Supervised Settings

Anima Anandkumar

(2)

(3)

Data vs. Information

Messy Data

Missing observations, gross corruptions, outliers.

High dimensional regime: as data grows, more variables ! Useful information: low-dimensional structures.

(4)

Data vs. Information

Messy Data

Learning with big data: ill-posed problem.

(5)

Data vs. Information

Messy Data

Learning with big data: ill-posed problem.

Learning is finding needle in a haystack

Learning with big data: computationally challenging!

(6)

How to model information structures?

Latent variable models

Incorporatehidden or latentvariables.

Information structures: Relationshipsbetween latent variables and observed data.

(7)

How to model information structures?

Basic Approach: mixtures/clusters Hidden variable is categorical.

(8)

How to model information structures?

Basic Approach: mixtures/clusters Hidden variable is categorical.

Advanced: Probabilistic models

Hidden variables have more general distributions. Can model mixed membership/hierarchical groups.

x1 x2 x3 x4 x5 h1

(9)

Latent Variable Models (LVMs)

Document modeling Observed: words. Hidden: topics. Social Network Modeling

Observed: social interactions. Hidden: communities, relationships. Recommendation Systems

Observed: recommendations (e.g., reviews). Hidden: User and business attributes

(10)

LVM for Feature Engineering

Learn good features/representations for classification tasks, e.g., computer vision and NLP.

Sparse Coding/Dictionary Learning

Sparse representations, low dimensional hidden structures. A fewdictionary elements make complicated shapes.

(11)

Associative Latent Variable Models

Supervised Learning

(12)

Associative Latent Variable Models

Supervised Learning

Given labeled examples {(xi, yi)}, learn a classifieryˆ=f(x). Associative/conditional models: p(y|x).

(13)

Associative Latent Variable Models

Supervised Learning

Example: Logistic regression: E[y|x] =σ(hu, xi).

Mixture of Logistic Regressions

(14)

Associative Latent Variable Models

Supervised Learning

Example: Logistic regression: E[y|x] =σ(hu, xi).

Mixture of Logistic Regressions

E[y|x, h] =g(hU h, xi+hb, hi)

Multi-layer/Deep Network

(15)

Challenges in Learning LVMs

Computational Challenges

Maximum likelihood is NP-hard in most scenarios.

Practice: Local search approaches such asBack-propagation, EM, Variational Bayeshave no consistency guarantees.

Sample Complexity

Sample complexity is exponential (w.r.t hidden variable dimension) for many learning methods.

(16)

Outline

1 Introduction

2 Spectral Methods

Classical Matrix Methods Beyond Matrices: Tensors

3 Moment Tensors for Latent Variable Models Topic Models

Network Community Models Experimental Results

4 Moment Tensors in Supervised Setting

(17)

Outline

1 Introduction

2 Spectral Methods

(18)

Classical Spectral Methods: Matrix PCA and CCA

Unsupervised Setting: PCA

For centered samples{xi}, find projectionP with Rank(P) =k s.t. min P 1 n X i∈[n] kxi−P xik2.

Result: Eigen-decomposition of S =Cov(X).

Supervised Setting: CCA For centered samples{xi, yi}, find

max a,b a⊤ˆ E[xy⊤_]_b q a⊤_Eˆ_[_xx⊤_]_{a b}⊤_Eˆ_[_yy⊤_]_b .

Result: Generalized eigen decomposition.

x y

ha, xi

(19)

Shortcomings of Matrix Methods

Learning through Spectral Clustering

Dimension reduction through PCA (on data matrix) Clustering on projected vectors (e.g. k-means).

(20)

Shortcomings of Matrix Methods

Basic method works only for single memberships. Failure to cluster under small separation.

(21)

Shortcomings of Matrix Methods

Basic method works only for single memberships. Failure to cluster under small separation.

(22)

Outline

1 Introduction

2 Spectral Methods

(23)

Beyond SVD: Spectral Methods on Tensors

How to learn the mixture models without separation constraints?

◮ PCA usescovariance matrixof data. Arehigher order momentshelpful?

Unified framework?

◮ Moment-based estimationof probabilistic latent variable models?

SVD gives spectral decomposition of matrices.

(24)

Moment Matrices and Tensors

Multivariate Moments in Unsupervised Setting

M1 :=E[x], M2 :=E[x⊗x], M3 :=E[x⊗x⊗x].

Matrix

E[x⊗x]∈Rd×dis a second order tensor.

E[x⊗x]i1,i2 =E[xi1xi2].

For matrices: E[x⊗x] =E[xx⊤_]_.

Tensor

E[x⊗x⊗x]∈_Rd×d×d _{is a third order tensor.}

(25)

Moment Matrices and Tensors

Multivariate Moments in Unsupervised Setting

M1 :=E[x], M2 :=E[x⊗x], M3 :=E[x⊗x⊗x].

Matrix

E[x⊗x]∈Rd×dis a second order tensor.

E[x⊗x]i1,i2 =E[xi1xi2].

For matrices: E[x⊗x] =E[xx⊤_]_.

Tensor

E[x⊗x⊗x]∈_Rd×d×d _{is a third order tensor.}

E[x⊗x⊗x]i1,i2,i3 =E[xi1xi2xi3].

Multivariate Moments in Supervised Setting

(26)

Spectral Decomposition of Tensors

M2 =P i λiui⊗vi = + .... MatrixM2 λ1u1⊗v1 λ2u2⊗v2

(27)

Spectral Decomposition of Tensors

M2 =P i λiui⊗vi = + .... MatrixM2 λ1u1⊗v1 λ2u2⊗v2 M3 =P i λiui⊗vi⊗wi = + .... TensorM3 λ1u1⊗v1⊗w1 λ2u2⊗v2⊗w2

u⊗v⊗w is a rank-1tensor since its (i1, i2, i3)th entry isui1vi2wi3. How to solve this non-convex problem?

(28)

Decomposition of Orthogonal Tensors

M3 = X

i

wiai⊗ai⊗ai.

(29)

Decomposition of Orthogonal Tensors

M3 = X

i

wiai⊗ai⊗ai.

SupposeA has orthogonal columns.

(30)

Decomposition of Orthogonal Tensors

M3 = X

i

wiai⊗ai⊗ai.

M3(I, a1, a1) =Piwihai, a1i2ai =w1a1. ai are eigenvectors of tensorM3.

Analogous to matrix eigenvectors:

(31)

Decomposition of Orthogonal Tensors

M3 = X

i

wiai⊗ai⊗ai.

M3(I, a1, a1) =Piwihai, a1i2ai =w1a1. ai are eigenvectors of tensorM3.

Analogous to matrix eigenvectors:

M v=M(I, v) =λv.

Two Problems

How to find eigenvectors of a tensor? A is not orthogonal in general.

(32)

M3= X i wiai⊗ai⊗ai, M2 = X i wiai⊗ai.

Find whitening matrixW s.t. W⊤_A₌_V _{is an orthogonal matrix.}

WhenA∈Rd×k hasfull column rank, it is an invertible

transformation. v1 v2 v3 W a1 a2 a3

Use pairwise moments M2 to find W.

(39)

Putting it together

Non-orthogonal tensor M3 =Piwiai⊗ai⊗ai,M2 =Piwiai⊗ai. Whitening matrix W: Multilinear transform: T =M3(W, W, W) v1 v2 v3 W a1 a2 a3 TensorM3 TensorT

(40)

Putting it together

(41)

Putting it together

Tensor Decomposition: Guaranteed Non-Convex Optimization!

(42)

Outline

1 Introduction

2 Spectral Methods

(43)

Types of Latent Variable Models

What is the form of hidden variablesh? Basic Approach: mixtures/clusters

Hidden variableh iscategorical.

Advanced: Probabilistic models

Hidden variablehhas more general distributions. Can model mixed memberships, e.g. Dirichlet distribution.

x1 x2 x3 x4 x5 h1

(44)

Outline

1 Introduction

2 Spectral Methods

(45)

(46)

Geometric Picture for Topic Models

Topic proportions vector (h)

(47)

Geometric Picture for Topic Models

Single topic (h)

(48)

Geometric Picture for Topic Models

Single topic (h) A A A x1 x2 x3 Word generation (x1, x2, . . .)

(49)

Geometric Picture for Topic Models

Single topic (h) A A A x1 x2 x3 Word generation (x1, x2, . . .) Linear model: E[xi|h] =Ah.

(50)

Moments for Single Topic Models

E[xi|h] =Ah. w:=E[h].

Learn topic-word matrix A, vector w

x1 x2 x3 x4 x5 A A A A A h

(51)

Moments for Single Topic Models

E[xi|h] =Ah. w:=E[h].

Learn topic-word matrix A, vector w

x1 x2 x3 x4 x5 A A A A A h

Pairwise Co-occurence Matrix Mx M2 :=E[x1⊗x2] =E[E[x1⊗x2|h]] = k X i=1 wiai⊗ai Triples Tensor M₃ M3 :=E[x1⊗x2⊗x3] =E[E[x1⊗x2⊗x3|h]] = k X i=1 wiai⊗ai⊗ai

(52)

Moments under LDA

M2 := E[x1⊗x2] − α0 α0+ 1E [x1]⊗E[x1] M3 := E[x1⊗x2⊗x3] − α0 α0+ 2E [x1⊗x2⊗E[x1]]−more stuff... Then M2 = X ˜ wi ai⊗ai M3 = X ˜ wi ai⊗ai⊗ai.

Three words per document suffice for learning LDA. Similar forms for HMM, ICA, sparse codingetc.

“Tensor Decompositions for Learning Latent Variable Models” byA. Anandkumar, R. Ge, D. Hsu, S.M. Kakade and M. Telgarsky. JMLR 2014.

(53)

Outline

1 Introduction

2 Spectral Methods

“A Tensor Spectral Approach to Learning Mixed Membership Community Models” byA. Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.

(61)

Subgraph Counts as Graph Moments

(62)

Subgraph Counts as Graph Moments

3-Star Count Tensor

˜ M3(a, b, c) = 1 |X|# of common neighbors inX = 1 |X| X x∈X G(x, a)G(x, b)G(x, c). ˜ M3= 1 |X| X x∈X [G⊤_x,A⊗G⊤_x,B⊗G⊤_x,C] x a b c A B C X

(63)

Outline

1 Introduction

2 Spectral Methods

(64)

Computational Complexity

(

k

≪

n

)

n= # of nodes N = # of iterations k= #of communities. c= #of cores. Whiten STGD Unwhiten Space O(nk) O(k2) O(nk) Time O(nsk/c+k3) O(N k3/c) O(nsk/c)

Whiten: matrix/vector products and SVD.

STGD:Stochastic Tensor Gradient Descent

Unwhiten: matrix/vector products Our approach: O(nsk

c +k 3₎

(65)

Tensor Decomposition on GPUs

102 103 10−1 100 101 102 103 104 Number of communitiesk R u n n in g ti m e( se cs )

MATLAB Tensor Toolbox(CPU) CULA Standard Interface(GPU) CULA Device Interface(GPU) Eigen Sparse(CPU)

(66)

Summary of Results

Friend Users Facebook n∼20k Business User Reviews Yelp n∼40k Author Coauthor DBLP(sub) n∼1million(∼100k)

Error (E) and Recovery ratio (R)

Dataset ˆk Method Running Time E R

Facebook(k=360) 500 ours 468 0.0175 100%

Facebook(k=360) 500 variational 86,808 0.0308 100%

.

Yelp(k=159) 100 ours 287 0.046 86%

Yelp(k=159) 100 variational N.A.

.

DBLP sub(k=250) 500 ours 10,157 0.139 89%

DBLP sub(k=250) 500 variational 558,723 16.38 99%

DBLP(k=6000) 100 ours 5407 0.105 95%

(67)

Experimental Results on Yelp

Lowest errorbusiness categories &largest weightbusinesses

Rank Category Business Stars Review Counts

1 Latin American Salvadoreno Restaurant 4.0 36

2 Gluten Free P.F. Chang’s China Bistro 3.5 55

3 Hobby Shops Make Meaning 4.5 14

4 Mass Media KJZZ91.5FM 4.0 13

(68)

Experimental Results on Yelp

Lowest errorbusiness categories &largest weightbusinesses

Rank Category Business Stars Review Counts

1 Latin American Salvadoreno Restaurant 4.0 36

2 Gluten Free P.F. Chang’s China Bistro 3.5 55

3 Hobby Shops Make Meaning 4.5 14

4 Mass Media KJZZ91.5FM 4.0 13

5 Yoga Sutra Midtown 4.5 31

Bridgeness: Distance from vector [1/ˆk, . . . ,1/kˆ]⊤

Top-5bridgingnodes (businesses)

Business Categories

Four Peaks Brewing Restaurants, Bars, American, Nightlife, Food, Pubs, Tempe

Pizzeria Bianco Restaurants, Pizza, Phoenix

FEZ Restaurants, Bars, American, Nightlife, Mediterranean, Lounges, Phoenix

Matt’s Big Breakfast Restaurants, Phoenix, Breakfast& Brunch Cornish Pasty Co Restaurants, Bars, Nightlife, Pubs, Tempe

(69)

Outline

1 Introduction

2 Spectral Methods

(70)

Moment Tensors for Associative Models

Multivariate Moments: Many possibilities...

E[x⊗y],E[x⊗x⊗y],E[ψ(x)⊗y]. . . . Feature Transformations of the Input: x7→ψ(x)

How to exploit them?

Are moments E[ψ(x)⊗y]useful?

If ψ(x) is a matrix/tensor, we have matrix/tensor moments. Can carry outspectral decomposition of the moments.

(71)

Score Function Features

Higher order score function: Sm(x) := (−1)m∇

(m)_p₍_x₎ p(x)

∗ Can be a matrix or a tensor instead of a vector.

∗ Derivative w.r.t parameter or input Form the cross-moments: E[y· Sm(x)].

Extension of Stein’s lemma: E[y· Sm(x)] =Eh∇(m)G(x)i

when E[y|x] :=G(x) Spectral decomposition: E h ∇(m)G(x)i= X j∈[k] u⊗_jm

(72)

Learning Deep Neural Networks

Realizable Setting E[y|x] =σd(Ad σd−1(Ad−1 σd−2(· · ·A2 σ1(A1x))))

M3 =E[y· S3(x)] = X

i∈[r]

λi·u⊗i 3

where ui =e⊤_i A1 are rows of A1.

Guaranteed learning of weights (layer-by-layer) via tensor decomposition.

(73)

(74)

Outline

1 Introduction

2 Spectral Methods

(75)

Conclusion: Guaranteed Non-Convex Optimization

Tensor Decomposition

Efficient sampleandcomputational complexities

Better performance compared toEM,Variational Bayesetc. In practice

Scalable andembarrassingly parallel: handle large datasets. Efficient performance: perplexity orground truth validation. Related Topics

Overcomplete Tensor Decomposition: Neural networks, sparse coding and ICA models tend to be overcomplete (more neurons than input dimensions).

Provable Non-Convex Iterative Methods: Robust PCA, Dictionary learning etc.

(76)

My Research Group and Resources

Furong Huang Majid Janzamin Hanie Sedghi

Niranjan UN Forough Arabshahi

ML summer school lectures available at