• No results found

Learning, Sparsity and Big Data

N/A
N/A
Protected

Academic year: 2021

Share "Learning, Sparsity and Big Data"

Copied!
19
0
0

Loading.... (view fulltext now)

Full text

(1)

Learning, Sparsity and Big Data

M. Magdon-Ismail

(Joint Work)

(2)

Out-of-Sample is What Counts

NO YES ? • A pattern exists • We don’t know it

• We have data to learn it

• Tested on new cases

“Teaching Machine Learning to a Diverse Audience: the Foundation-based Approach,” Lin,M-I, Abu-Mostafa,ICML 2012.

c

(3)

DMOZ Data and SVMs

Good way to generate human labeled text data.

US:Pennsylvania:Business&Economy vs.

US:Arkansas:Localities:M

Bag of (18K) words and SVMs:

300 important words All words

Error 24.7% 27.6%

important words:

pennsylvania, arkansas, business, baptist, company, . . .

Other kinds of tasks: polarity relevance interest topic clustering/classification ... c

(4)

Financial Market Price Data

50 stocks in S&P100 2000-2011.

15 stocks reconstruct price changes (30% error):

AA: Alcoa

AMGN: Amgen

AVP: Avon

BHI: Baker Hughes

WMB: Williams Companies DIS: Disney

C: Citigroup

AEP: American Express

F: Ford GD: General Dynamics INTC: Intel IBM: IBM MCD: MacDonald’s TXN: Texas Instruments XRX: Xerox

Optimal, 15 d.o.f. (PCA): 24% error.

c

(5)

Throwing Out Unnecessary Features is Good

Sparsity: represent your solution using only a few features.

‘Sparse’ solutions generalize to out-of-sample better – less overfitting.

Sparse solutions are easier to interpret – few important features.

Computations are more efficient.

Problem (NP-Hard): Find the few relevant features quickly.

Question: Are those features provably good?

c

(6)

Three Learning Problems

1. Data Summarization/Compression (PCA)

2. Categorization/Clustering (k-Means)

3. Prediction (Regression)

c

(7)

Data

Data Matrix Response Matrix

n d at a p oi nt s d dimensions

name age debt income · · · hair weight sex

          

John 21yrs −$10K $65K · · · black 175lbs M

Joe 74yrs −$100K $25K · · · blonde 275lbs M

Jane 27yrs −$20K $85K · · · blonde 135lbs F

...

Jen 37yrs −$400K $105K · · · brun 155lbs F

          

credit? limit risk

           X 2K high × 0 − X 10K low ... X 15K high           

X

R

n×d

Y

R

n×ω c

(8)

More Beautiful Data

X ∈ R231×174 Y ∈ R231×166

number of principal components

% re co n st ru ct io n er ro r 10 20 30 40 50 60 0 5 10 15

number of principal components

% re co n st ru ct io n er ro r 10 20 30 40 50 60 0 5 10 15 c

(9)

PCA,

K

-means, Linear Regression

k = 20

r = 2k

PCA K-Means Regression

=

c

(10)

Sparsity

Represent your solution using only a few . . .

Example: linear regression

        =         Xw = y

y is an optimal linear combination of only a few columns in X.

(sparse regression; regularization (||w ||0 ≤ k); feature subset selection; . . . )

c

(11)

Singular Value Decomposition (SVD)

X = Uk Ud−k Σk 0 0 Σdk Vt k Vt d−k O(nd2) U Σ Vt (n×d) (d×d) (d×d) Xk = UkΣkVt k = XVkVt k

Xk is the best rank-k approximation to X.

Reconstruction of X using only a fewdeg. of freedom.

X X20 X40 X60

Vk is an orthonormal basis for the best k-dimensional subspace of the row space of X.

c

(12)

Fast Approximate SVD

1: Z = XR R ∼ N(d × r), Z ∈ Rn×r

2: Q = qr.factorize(Z) 3: Vˆk ← svdk(QtX)

Theorem. Let r = k(1 + 1ǫ) and E = X − X ˆVkVˆt

k. Then,

E[||E ||] ≤ (1 + ǫ)|| X − Xk ||

running time is O(ndk) = o(svd)

[BDM, FOCS 2011]

We can construct

V

k

quickly

c

(13)

V

k

and Sparsity

Important “dimensions” of Vt

k are important for X

×s1 ×s2 ×s3 ×s4 ×s5     −→     Vt k Vˆkt ∈ Rk×r

The sampled r columns are “good” if I = Vt

kVk ≈ Vˆ

t

kVˆk.

Sampling schemes: Largest norm (Jollife, 1972);

Randomized norm sampling (Rudelson, 1999; RudelsonVershynin, 2007);

Greedy (Batson et al, 2009;BDM, 2011).

V

k

is important

c

(14)

Sparse PCA – Algorithm

1: Choose a few columns C of X; C ∈ Rn×r.

2: Find the best rank-k approximation of X in the span of C, XC,k. 3: Compute the SVDk of

XC,k = UC,kΣC,kVtC,k.

4:

Z = XVC,k.

Each feature in Z is a mixture of only the few original r feature dimensions in C.

||X − XVC,kVt C,k || ≤ || X − XC,kVC,kV t C,k || = ||X − XC,k || ≤ 1 + O(2rk) ||X − Xk ||. [BDM, FOCS 2011] c

(15)

Sparse PCA

k = 20 k = 40 k = 60

Dense PCA

Sparse PCA, r = 2k

Theorem. One can construct, in o(svd), k features that are r-sparse, r = O(k), that

are as good as exact dense top-k PCA-features.

c

(16)

Clustering:

K

-Means

Full, slow Fast, sparse

3 clusters

4 Clusters

Theorem. There is a subset of features of size O(#clusters) which produces nearly

the optimal partition (within a constant factor). One can quickly produce features with a log-approximation factor.

[BDM,2013]

c

(17)

PCA Regression using Few Important Features

PCA, slow, dense Sparse, fast

k = 20

k = 40

Theorem. Can find O(k) pure features which performs as well top-k pca-regression

(additive error controlled by ||X − Xk ||Fk).

[BDM,2013]

c

(18)

The Proofs

All the algorithms use the sparsifier of Vt

k in [BDM,FOCS2011].

1. Choose columns of Vt

k to preserve its singular values.

2. Ensure that the selected columns preserve the structural properties of the objective with respect to the columns of X that are sampled.

3. Use dual set sparsification algorithms to accomplish (2).

c

(19)

THANKS!

• Data Compression (PCA):

• Unsupervised k-Means Clustering:

• Supervised PCA Regression:

• Support Vector Machines

• Coresets (subsets of data instead features)

Few features: easy to interpret; better generalizers; faster computations.

c

References

Related documents

The main wall of the living room has been designated as a "Model Wall" of Delta Gamma girls -- ELLE smiles at us from a Hawaiian Tropic ad and a Miss June USC

Conceptualizing advanced nursing practice: Curriculum issues to consider in the educational preparation of advanced practice nurses in

According to a recent Ventana Research report on big data technology, only 22% of 163 organizations that Ventana polled last year were using Hadoop, and 45% said they had no plans

We agree that many of the elements listed above are an important part of the accounting for disclosure. Elements such as time, date, patient identification, user identification,

Reading Recovery and Descubriendo la Lectura teacher leaders maintain registered status through affiliation with a university training center and continued employment in the role

Delivery can be arranged and will be charged on a pallet basis.. Stock will be available on a first come, first

Note: Students who fail to earn a Scholars Merit in a designated course are required to propose their own service learning activity.. Transfer students will be required to earn

come a part of the culture of the Russian Empire; on the contrary – in Polish cultural tradition, its representatives established themselves as the great prophets, and that, in