Learning, Sparsity and Big Data

(1)

Learning, Sparsity and Big Data

M. Magdon-Ismail

(Joint Work)

(2)

Out-of-Sample is What Counts

NO YES ? • A pattern exists • We don’t know it

• We have data to learn it

• Tested on new cases

“Teaching Machine Learning to a Diverse Audience: the Foundation-based Approach,” Lin,_M-I, Abu-Mostafa,ICML 2012.

c

(3)

DMOZ Data and SVMs

Good way to generate human labeled text data.

US:Pennsylvania:Business&Economy vs.

US:Arkansas:Localities:M

Bag of (18K) words and SVMs:

300 important words All words

Error 24.7% 27.6%

important words:

pennsylvania, arkansas, business, baptist, company, . . .

Other kinds of tasks: polarity relevance interest topic clustering/classification ... c

(4)

Financial Market Price Data

50 stocks in S&P100 2000-2011.

15 stocks reconstruct price changes (30% error):

AA: Alcoa

AMGN: Amgen

AVP: Avon

BHI: Baker Hughes

WMB: Williams Companies DIS: Disney

C: Citigroup

AEP: American Express

F: Ford GD: General Dynamics INTC: Intel IBM: IBM MCD: MacDonald’s TXN: Texas Instruments XRX: Xerox

Optimal, 15 d.o.f. (PCA): 24% error.

c

(5)

Throwing Out Unnecessary Features is Good

Sparsity: represent your solution using only a few features.

‘Sparse’ solutions generalize to out-of-sample better – less overfitting.

Sparse solutions are easier to interpret – few important features.

Computations are more efficient.

Problem (NP-Hard): Find the few relevant features quickly.

Question: Are those features provably good?

c

(6)

Three Learning Problems

1. Data Summarization/Compression (PCA)

2. Categorization/Clustering (k-Means)

3. Prediction (Regression)

c

(7)

Data

Data Matrix Response Matrix

n d at a p oi nt s d dimensions

name age debt income · · · hair weight sex

          

John 21yrs −$10K $65K · · · black 175lbs M

Joe 74yrs −$100K $25K · · · blonde 275lbs M

Jane 27yrs −$20K $85K · · · blonde 135lbs F

...

Jen 37yrs −$400K $105K · · · brun 155lbs F

          

credit? limit risk

           X ₂_K _high × 0 − X ₁₀_K _low ... X ₁₅_K _high           

X

∈

_R

n×d

Y

∈

_R

n×ω c

(8)

More Beautiful Data

X ∈ _R231×174 Y ∈ _R231×166

number of principal components

% re co n st ru ct io n er ro r 10 20 30 40 50 60 0 5 10 15

number of principal components

% re co n st ru ct io n er ro r 10 20 30 40 50 60 0 5 10 15 c

(9)

PCA,

K

-means, Linear Regression

k = 20

r = 2k

PCA K-Means Regression

=

c

(10)

Sparsity

Represent your solution using only a few . . .

Example: linear regression

        =         Xw = y

y is an optimal linear combination of only a few columns in X.

(sparse regression; regularization (||w ||₀ ≤ k); feature subset selection; . . . )

c

(11)

Singular Value Decomposition (SVD)

X = Uk Ud−k Σk 0 0 Σ_d₋_k Vt k Vt d−k O(nd2) U Σ Vt (n×d) (d×d) (d×d) X_k = U_kΣ_kVt k = XV_kVt k

X_k is the best rank-k approximation to X.

Reconstruction of X using only a fewdeg. of freedom.

X X20 X40 X60

V_k is an orthonormal basis for the best k-dimensional subspace of the row space of X.

c

(12)

Fast Approximate SVD

1: Z = XR R ∼ N(d × r), Z ∈ Rn×r

2: Q = qr.factorize(Z) 3: Vˆk ← svdk(QtX)

Theorem. Let r = k(1 + 1_ǫ) and E = X − X ˆV_kVˆt

k. Then,

E[||E ||] ≤ (1 + ǫ)|| X − Xk ||

running time is O(ndk) = o(svd)

[BDM, FOCS 2011]

We can construct

V

_k

quickly

c

(13)

V

_k

and Sparsity

Important “dimensions” of Vt

k are important for X

×s1 ×s2 ×s3 ×s4 ×s5     −→     Vt k Vˆkt ∈ Rk×r

The sampled r columns are “good” if I = Vt

kVk ≈ Vˆ

t

kVˆk.

Sampling schemes: Largest norm (Jollife, 1972);

Randomized norm sampling (Rudelson, 1999; RudelsonVershynin, 2007);

Greedy (Batson et al, 2009;BDM, 2011).

V

_k

is important

c

(14)

Sparse PCA – Algorithm

1: Choose a few columns C of X; C ∈ Rn×r.

2: Find the best rank-k approximation of X in the span of C, XC,k. 3: Compute the SVDk of

XC,k = UC,kΣC,kVt_C_,k.

4:

Z = XV_C_,k.

Each feature in Z is a mixture of only the few original r feature dimensions in C.

||X − XV_C_,kVt C,k || ≤ || X − XC,kVC,kV t C,k || = ||X − XC,k || ≤ 1 + O(2_rk) ||X − X_k ||. [BDM, FOCS 2011] c

(15)

Sparse PCA

k = 20 k = 40 k = 60

Dense PCA

Sparse PCA, r = 2k

Theorem. One can construct, in o(svd), k features that are r-sparse, r = O(k), that

are as good as exact dense top-k PCA-features.

c

(16)

Clustering:

K

-Means

Full, slow Fast, sparse

3 clusters

4 Clusters

Theorem. There is a subset of features of size O(#clusters) which produces nearly

the optimal partition (within a constant factor). One can quickly produce features with a log-approximation factor.

[BDM,2013]

c

(17)

PCA Regression using Few Important Features

PCA, slow, dense Sparse, fast

k = 20

k = 40

Theorem. Can find O(k) pure features which performs as well top-k pca-regression

(additive error controlled by ||X − X_k ||_F/σ_k).

[BDM,2013]

c

(18)

The Proofs

All the algorithms use the sparsifier of Vt

k in [BDM,FOCS2011].

1. Choose columns of Vt

k to preserve its singular values.

2. Ensure that the selected columns preserve the structural properties of the objective with respect to the columns of X that are sampled.

3. Use dual set sparsification algorithms to accomplish (2).

c

(19)

THANKS!

• Data Compression (PCA):

• Unsupervised k-Means Clustering:

• Supervised PCA Regression:

• Support Vector Machines

• Coresets (subsets of data instead features)

Few features: easy to interpret; better generalizers; faster computations.

c