• No results found

Big Data Analytics: Optimization and Randomization

N/A
N/A
Protected

Academic year: 2021

Share "Big Data Analytics: Optimization and Randomization"

Copied!
315
0
0

Loading.... (view fulltext now)

Full text

(1)

Tianbao Yang, Qihang Lin\, Rong Jin Tutorial@SIGKDD 2015

Sydney, Australia

Department of Computer Science, The University of Iowa, IA, USA

\Department of Management Sciences, The University of Iowa, IA, USA

Department of Computer Science and Engineering, Michigan State University, MI, USA

Institute of Data Science and Technologies at Alibaba Group, Seattle, USA

August 10, 2015

(2)

URL

http://www.cs.uiowa.edu/˜tyng/kdd15-tutorial.pdf

(3)

No

This tutorial is not an exhaustive literature survey

It is not a survey on different machine learning/data mining algorithms

Yes

It is about how to efficientlysolve machine learning/data mining (formulated as optimization) problems forbig data

(4)

Outline

Part I: Basics

Part II: Optimization Part III: Randomization

(5)
(6)

Outline

1 Basics

Introduction

Notations and Definitions

(7)

Model Optimization

20 40 60 80 100

0 0.05 0.1 0.15 0.2 0.25 0.3

iterations

distance to optimal objective

0.5T 1/T2 1/T

Data

(8)

Big Data Challenge

Big Data

(9)

Big Model

60 million parameters

(10)

Learning as Optimization

Ridge Regression Problem:

min

w∈Rd

1 n

n

X

i =1

(yi− w>xi)2+λ 2kwk22

xi ∈ Rd: d -dimensional feature vector yi ∈ R: target variable

w ∈ Rd: model parameters n: number of data points

(11)

Ridge Regression Problem:

min

w∈Rd

1 n

n

X

i =1

(yi − w>xi)2

| {z }

Empirical Loss

+λ 2kwk22

xi ∈ Rd: d -dimensional feature vector yi ∈ R: target variable

w ∈ Rd: model parameters n: number of data points

(12)

Learning as Optimization

Ridge Regression Problem:

min

w∈Rd

1 n

n

X

i =1

(yi− w>xi)2+ λ 2kwk22

| {z }

Regularization

xi ∈ Rd: d -dimensional feature vector yi ∈ R: target variable

w ∈ Rd: model parameters n: number of data points

(13)

Classification Problems:

min

w∈Rd

1 n

n

X

i =1

`(yiw>xi) +λ 2kwk22

yi ∈ {+1, −1}: label

Loss function `(z): z = y w>x

1. SVMs: (squared) hinge loss `(z) = max(0, 1 − z)p, where p = 1, 2

2. Logistic Regression: `(z) = log(1 + exp(−z))

(14)

Learning as Optimization

Feature Selection:

min

w∈Rd

1 n

n

X

i =1

`(w>xi, yi) + λkwk1

`1 regularization kwk1 =Pdi =1|wi| λ controls sparsity level

(15)

Feature Selection using Elastic Net:

min

w∈Rd

1 n

n

X

i =1

`(w>xi, yi)+λkwk1+ γkwk22

Elastic net regularizer, more robust than `1 regularizer

(16)

Learning as Optimization

Multi-class/Multi-task Learning:

minW

1 n

n

X

i =1

`(Wxi, yi) + λr (W)

W ∈ RK ×d

r (W) = kWk2F =PKk=1Pdj=1Wkj2: Frobenius Norm

r (W) = kWk =Piσi: Nuclear Norm (sum of singular values) r (W) = kWk1,∞=Pdj=1kW:jk: `1,∞mixed norm

(17)

Regularized Empirical Loss Minimization min

w∈Rd

1 n

n

X

i =1

`(w>xi, yi) + R(w)

Both ` and R are convex functions

Extensions to Matrix Cases are possible (sometimes straightforward) Extensions to Kernel methods can be combined with randomized approaches

Extensions to Non-convex (e.g., deep learning) are in progress

(18)

Data Matrices and Machine Learning

The Instance-feature Matrix: X ∈ Rn×d

X =

x>1 x>2

·

·

· x>n

(19)

The output vector: y =

y1

y2

·

·

· yn

∈ Rn×1

continuous yi ∈ R: regression (e.g., house price)

discrete, e.g., yi ∈ {1, 2, 3}: classification (e.g., species of iris)

(20)

Data Matrices and Machine Learning

The Instance-Instance Matrix: K ∈ Rn×n Similarity Matrix

Kernel Matrix

(21)

Some machine learning tasks are formulated on the kernel matrix Clustering

Kernel Methods

(22)

Data Matrices and Machine Learning

The Feature-Feature Matrix: C ∈ Rd ×d Covariance Matrix

Distance Metric Matrix

(23)

Some machine learning tasks requires the covariance matrix Principal Component Analysis

Top-k Singular Value (Eigen-Value) Decomposition of the Covariance Matrix

(24)

Why Learning from Big Data is Challenging?

High per-iteration cost High memory cost

High communication cost

Large iteration complexity

(25)

1 Basics

Introduction

Notations and Definitions

(26)

Norms

Vector x ∈ Rd

Euclidean vector norm: kxk2=

x>x =

q Pd

i =1xi2

`p-norm of a vector: kxkp =Pdi =1|xi|p1/p where p ≥ 1

1 `2norm kxk2= q

Pd i =1xi2

2 `1norm kxk1=Pd i =1|xi|

3 ` norm kxk= maxi|xi|

(27)

Vector x ∈ Rd

Euclidean vector norm: kxk2=

x>x =

q Pd

i =1xi2

`p-norm of a vector: kxkp =Pdi =1|xi|p1/p where p ≥ 1

1 `2norm kxk2= q

Pd i =1xi2

2 `1norm kxk1=Pd i =1|xi|

3 ` norm kxk= maxi|xi|

(28)

Norms

Vector x ∈ Rd

Euclidean vector norm: kxk2=

x>x =

q Pd

i =1xi2

`p-norm of a vector: kxkp =Pdi =1|xi|p1/p where p ≥ 1

1 `2norm kxk2= q

Pd i =1xi2

2 `1norm kxk1=Pd i =1|xi|

3 ` norm kxk= maxi|xi|

(29)

Matrix X ∈ Rn×d

Singular Value Decomposition X = UΣV>

1 U ∈ Rn×r: orthonormal columns (U>U = I): span column space

2 Σ ∈ Rr ×r: diagonal matrix Σii = σi> 0, σ1≥ σ2. . . ≥ σr 3 V ∈ Rd ×r: orthonormal columns (V>V = I): span row space

4 r ≤ min(n, d ): max value such that σr > 0: rank of X

5 UkΣkVk>: top-k approximation Pseudo inverse: X= V Σ−1U>

QR factorization: X = QR (n ≥ d ) Q ∈ Rn×d: orthonormal columns R ∈ Rd ×d: upper triangular matrix

(30)

Matrix Factorization

Matrix X ∈ Rn×d

Singular Value Decomposition X = UΣV>

1 U ∈ Rn×r: orthonormal columns (U>U = I): span column space

2 Σ ∈ Rr ×r: diagonal matrix Σii = σi> 0, σ1≥ σ2. . . ≥ σr 3 V ∈ Rd ×r: orthonormal columns (V>V = I): span row space

4 r ≤ min(n, d ): max value such that σr > 0: rank of X

5 UkΣkVk>: top-k approximation Pseudo inverse: X= V Σ−1U>

QR factorization: X = QR (n ≥ d ) Q ∈ Rn×d: orthonormal columns R ∈ Rd ×d: upper triangular matrix

(31)

Matrix X ∈ Rn×d

Singular Value Decomposition X = UΣV>

1 U ∈ Rn×r: orthonormal columns (U>U = I): span column space

2 Σ ∈ Rr ×r: diagonal matrix Σii = σi> 0, σ1≥ σ2. . . ≥ σr 3 V ∈ Rd ×r: orthonormal columns (V>V = I): span row space

4 r ≤ min(n, d ): max value such that σr > 0: rank of X

5 UkΣkVk>: top-k approximation Pseudo inverse: X= V Σ−1U>

QR factorization: X = QR (n ≥ d ) Q ∈ Rn×d: orthonormal columns R ∈ Rd ×d: upper triangular matrix

(32)

Norms

Matrix X ∈ Rn×d

Frobenius norm: kX kF =qtr (X>X ) =qPni =1Pdj=1Xij2 Spectral (induced norm) of a matrix: kX k2 = maxkuk2=1kX uk2

kAk2= σ1 (maximum singular value)

(33)

Matrix X ∈ Rn×d

Frobenius norm: kX kF =qtr (X>X ) =qPni =1Pdj=1Xij2 Spectral (induced norm) of a matrix: kX k2 = maxkuk2=1kX uk2

kAk2= σ1 (maximum singular value)

(34)

Convex Optimization

minx ∈X f (x )

X is a convex domain

for any x , y ∈ X , their convex combination αx + (1 − α)y ∈ X

f (x ) is a convex function

(35)

Characterization of Convex Function

f (αx + (1 − α)y ) ≤ αf (x ) + (1 − α)f (y ),

∀x , y ∈ X , α ∈ [0, 1]

f (x ) ≥ f (y ) + ∇f (y )>(x − y ) ∀x , y ∈ X local optimum is global optimum

(36)

Convex Function

Characterization of Convex Function

f (αx + (1 − α)y ) ≤ αf (x ) + (1 − α)f (y ),

∀x , y ∈ X , α ∈ [0, 1]

f (x ) ≥ f (y ) + ∇f (y )>(x − y ) ∀x , y ∈ X local optimum is global optimum

(37)

Convex function:

f (x ) ≥ f (y ) + ∇f (y )>(x − y ) ∀x , y ∈ X Strongly Convex function:

f (x ) ≥ f (y ) + ∇f (y )>(x − y ) +λ

2kx − y k22∀x , y ∈ X Global optimum is unique

strong convexity constant

(38)

Convex vs Strongly Convex

Convex function:

f (x ) ≥ f (y ) + ∇f (y )>(x − y ) ∀x , y ∈ X Strongly Convex function:

f (x ) ≥ f (y ) + ∇f (y )>(x − y ) +λ

2kx − y k22∀x , y ∈ X Global optimum is unique

strong convexity constant

(39)

Non-smooth function

Lipschitz continuous: e.g. absolute loss f (x ) = |x |

|f (x ) − f (y )| ≤ Gkx − y k2 Lipschitz constant

Subgradient: f (x ) ≥ f (y ) +∂f (y )>(x − y )

−1 −0.5 0 0.5 1

−0.2 0 0.2 0.4 0.6 0.8

|x| non−smooth

sub−gradient

Smooth function

e.g. logistic loss f (x ) = log(1 + exp(−x )) k∇f (x ) − ∇f (y )k2 ≤ Lkx − y k2

smoothness constant

−5 −4 −3 −2 −1 0 1 2 3 4 5

−1 0 1 2 3 4 5

6 log(1+exp(−x))

f(y)+f’(y)(x−y) y f(x)

Quadratic Function

(40)

Non-smooth function vs Smooth function

Non-smooth function

Lipschitz continuous: e.g. absolute loss f (x ) = |x |

|f (x ) − f (y )| ≤ Gkx − y k2 Lipschitz constant

Subgradient: f (x ) ≥ f (y ) +∂f (y )>(x − y )

−1 −0.5 0 0.5 1

−0.2 0 0.2 0.4 0.6 0.8

|x| non−smooth

sub−gradient

Smooth function

e.g. logistic loss f (x ) = log(1 + exp(−x )) k∇f (x ) − ∇f (y )k2 ≤ Lkx − y k2

smoothness constant

−5 −4 −3 −2 −1 0 1 2 3 4 5

−1 0 1 2 3 4 5

6 log(1+exp(−x))

f(y)+f’(y)(x−y) y f(x)

Quadratic Function

(41)

Non-smooth function

Lipschitz continuous: e.g. absolute loss f (x ) = |x |

|f (x ) − f (y )| ≤ Gkx − y k2 Lipschitz constant

Subgradient: f (x ) ≥ f (y ) +∂f (y )>(x − y )

−1 −0.5 0 0.5 1

−0.2 0 0.2 0.4 0.6 0.8

|x| non−smooth

sub−gradient

Smooth function

e.g. logistic loss f (x ) = log(1 + exp(−x )) k∇f (x ) − ∇f (y )k2 ≤ Lkx − y k2

smoothness constant

−5 −4 −3 −2 −1 0 1 2 3 4 5

−1 0 1 2 3 4 5

6 log(1+exp(−x))

f(y)+f’(y)(x−y) y f(x)

Quadratic Function

(42)

Next ...

min

w∈Rd

1 n

n

X

i =1

`(w>xi, yi) + R(w) Part II: Optimization

stochastic optimization distributed optimization

Reduce Iteration Complexity: utilizing properties of functions

(43)

Part III: Randomization Classification, Regression SVD, K-means, Kernel methods

Reduce Data Size: utilizing properties of data

Please stay tuned!

(44)

Big Data Analytics: Optimization and Randomization

Part II: Optimization

(45)

2 Optimization

(Sub)Gradient Methods

Stochastic Optimization Algorithms for Big Data

Stochastic Optimization Distributed Optimization

(46)

Learning as Optimization

Regularized Empirical Loss Minimization min

w∈Rd

1 n

n X i =1

`(w>xi, yi) + R(w)

| {z }

F (w)

(47)

Optimization (Sub)Gradient Methods

Convergence Measure

Most optimization algorithms are iterative wt+1= wt+ ∆wt

bT w

Convergence Rate: after T iterations, how good is the solution

F (wbT) − min

w F (w) ≤ (T )

0 20 40 60 80 100

0 0.1 0.2 0.3 0.4 0.5

iterations

objective

T ε

Total Runtime =Per-iteration Cost×Iteration Complexity

(48)

Optimization (Sub)Gradient Methods

Convergence Measure

Most optimization algorithms are iterative wt+1= wt+ ∆wt

Iteration Complexity: the number of iterations T () needed to have

F (wbT) − min

w F (w) ≤  (  1)

good is the solution F (wbT) − min

w F (w) ≤ (T )

0 20 40 60 80 100

0 0.1 0.2 0.3 0.4 0.5

iterations

objective

T ε

Total Runtime =Per-iteration Cost×Iteration Complexity

(49)

Optimization (Sub)Gradient Methods

Convergence Measure

Most optimization algorithms are iterative wt+1= wt+ ∆wt

Iteration Complexity: the number of iterations T () needed to have

F (wbT) − min

w F (w) ≤  (  1)

Convergence Rate: after T iterations, how good is the solution

F (wbT) − min

w F (w) ≤ (T )

0 20 40 60 80 100

0 0.1 0.2 0.3 0.4 0.5

iterations

objective

T ε

(50)

Convergence Measure

Most optimization algorithms are iterative wt+1= wt+ ∆wt

Iteration Complexity: the number of iterations T () needed to have

F (wbT) − min

w F (w) ≤  (  1)

Convergence Rate: after T iterations, how good is the solution

F (wbT) − min

w F (w) ≤ (T )

0 20 40 60 80 100

0 0.1 0.2 0.3 0.4 0.5

iterations

objective

T ε

Total Runtime =Per-iteration Cost×Iteration Complexity

(51)

Big O(·) notation: explicit dependence on T or 

Convergence Rate Iteration Complexity linear OµT (µ < 1) O

 log

1





sub-linear OT1α

 α > 0 O

 1

1/α



Why are we interested in Bounds?

(52)

More on Convergence Measure

Big O(·) notation: explicit dependence on T or 

Convergence Rate Iteration Complexity linear OµT (µ < 1) O

 log

1





sub-linear OT1α

 α > 0 O

 1

1/α



Why are we interested in Bounds?

(53)

Convergence Rate Iteration Complexity linear O(µT) (µ < 1) Olog(1) sub-linear O(T1α) α > 0 O1/α1 

0 20 40 60 80 100

0 0.1 0.2 0.3 0.4 0.5

iterations (T)

distance to optimum

0.5T

seconds

(54)

More on Convergence Measure

Convergence Rate Iteration Complexity linear O(µT) (µ < 1) Olog(1) sub-linear O(T1α) α > 0 O1/α1 

0 20 40 60 80 100

0 0.1 0.2 0.3 0.4 0.5

iterations (T)

distance to optimum

0.5T 1/T

seconds

minutes

(55)

Convergence Rate Iteration Complexity linear O(µT) (µ < 1) Olog(1) sub-linear O(T1α) α > 0 O1/α1 

0 20 40 60 80 100

0 0.1 0.2 0.3 0.4 0.5

iterations (T)

distance to optimum

0.5T 1/T 1/T0.5

seconds

minutes hours

(56)

More on Convergence Measure

Convergence Rate Iteration Complexity linear O(µT) (µ < 1) Olog(1) sub-linear O(T1α) α > 0 O1/α1



0 20 40 60 80 100

0 0.1 0.2 0.3 0.4 0.5

iterations (T)

distance to optimum

0.5T 1/T 1/T0.5

seconds

minutes hours

Theoretically, we consider

O(µT) ≺ O

 1 T2



≺ O

1 T



≺ O

 1

T



log

1





1

 1

 1

2

(57)

Non-smooth `(z)

hinge loss: `(w>x, y ) = max(0, 1 − y w>x) absolute loss: `(w>x, y ) = |w>x − y | Smooth `(z)

squared hinge loss: `(w>x, y ) = max(0, 1 − y w>x)2 logistic loss: `(w>x, y ) = log(1 + exp(−y w>x)) square loss: `(w>x, y ) = (w>x − y )2

(58)

Strong convex V.S. Non-strongly convex

λ-strongly convex R(w)

`2regularizer: λ2kwk22

Elastic net regularizer: τ kwk1+λ2kwk22 Non-strongly convex R(w)

unregularized problem: R(w) ≡ 0

`1regularizer: τ kwk1

(59)

F (w) = 1 n

n

X

i =1

`(w>xi, yi) +λ 2kwk22 Suppose `(z) is smooth

Full gradient: ∇F (w) = 1nPni =1∇`(w>xi, yi) + λw Per-iteration cost: O(nd )

Gradient Descent

wt = wt−1− γt∇F (wt−1) step size

(60)

Gradient Method in Machine Learning

F (w) = 1 n

n

X

i =1

`(w>xi, yi) +λ 2kwk22 Suppose `(z) is smooth

Full gradient: ∇F (w) = 1nPni =1∇`(w>xi, yi) + λw Per-iteration cost: O(nd )

Gradient Descent

wt = wt−1− γt∇F (wt−1) step size

(61)

F (w) = 1 n

n

X

i =1

`(w>xi, yi) +λ 2kwk22 If λ = 0: R(w) is non-strongly convex

Iteration complexityO(1)

If λ > 0: R(w) is λ-strongly convex Iteration complexityO(λ1log(1))

(62)

Accelerated Gradient Method

Accelerated Gradient Descent

wt = vt−1− γt∇F (vt−1)

vt = wt + ηt(wt − wt−1)

Momentum Step

wt is the output and vt is an auxiliary sequence.

(63)

Accelerated Gradient Descent

wt = vt−1− γt∇F (vt−1)

vt = wt + ηt(wt − wt−1)

Momentum Step

wt is the output and vt is an auxiliary sequence.

(64)

Accelerated Gradient Method

F (w) = 1 n

n

X

i =1

`(w>xi, yi) +λ 2kwk22 If λ = 0: R(w) is non-strongly convex

Iteration complexityO(1), better than O(1)

If λ > 0: R(w) is λ-strongly convex Iteration complexityO(1

λlog(1)), better than O(1λlog(1))for small λ

(65)

Consider a more general case min

w∈Rd

F (w) = 1 n

n

X

i =1

`(w>xi, yi) +R0(w) + τ kwk1

| {z }

R(w)

R(w) = R0(w) + τ kwk1

R0(w): λ-strongly convex and smooth

(66)

Deal with `1 regularizer

Consider a more general case min

w∈RdF (w) = 1 n

n

X

i =1

`(w>xi, yi) + R0(w)

| {z }

F0(w)

+τ kwk1

R(w) = R0(w) + τ kwk1

R0(w): λ-strongly convex and smooth

(67)

Accelerated Gradient Descent

wt = arg min

w∈Rd

∇F0(vt−1)>w + 1

tkw − vt−1k22 + τ kwk1 vt = wt + ηt(wt − wt−1)

Proximal mapping

Proximal mapping has close-form solution: Soft-thresholding Iteration complexity and runtime remain unchanged.

(68)

Deal with `1 regularizer

Accelerated Gradient Descent

wt = arg min

w∈Rd

∇F0(vt−1)>w + 1

tkw − vt−1k22 + τ kwk1 vt = wt + ηt(wt − wt−1)

Proximal mapping

Proximal mapping has close-form solution: Soft-thresholding Iteration complexity and runtime remain unchanged.

(69)

F (w) = 1 n

n

X

i =1

`(w>xi, yi) +λ 2kwk22 Suppose `(z) is non-smooth

Full sub-gradient: ∂F (w) = n1Pni =1∂`(w>xi, yi) + λw Sub-Gradient Descent

wt = wt−1− γt∂F (wt−1)

(70)

Sub-Gradient Method in Machine Learning

F (w) = 1 n

n

X

i =1

`(w>xi, yi) +λ 2kwk22 Suppose `(z) is non-smooth

Full sub-gradient: ∂F (w) = n1Pni =1∂`(w>xi, yi) + λw Sub-Gradient Descent

wt = wt−1− γt∂F (wt−1)

(71)

F (w) = 1 n

n

X

i =1

`(w>xi, yi) +λ 2kwk22 If λ = 0: R(w) is non-strongly convex

Iteration complexityO(12)

If λ > 0: R(w) is λ-strongly convex Iteration complexityO(λ1 )

No efficient acceleration scheme in general

(72)

Problem Classes and Iteration Complexity

min

w∈Rd

1 n

n

X

i =1

`(w>xi, yi) + R(w)

Iteration complexity

`(z) ≡ `(z, y ) Non-smooth Smooth R(w) Non-strongly convex O12

 O1

λ-strongly convex Oλ1  O1

λlog1

Per-iteration cost: O(nd ), too high if n or d are large.

(73)

2 Optimization

(Sub)Gradient Methods

Stochastic Optimization Algorithms for Big Data

Stochastic Optimization Distributed Optimization

(74)

Stochastic First-Order Method by Data Sampling

Stochastic Gradient Descent (SGD)

Stochastic Variance Reduced Gradient (SVRG)

Stochastic Average Gradient Algorithm (SAGA) Stochastic Dual Coordinate Ascent (SDCA)

Accelerated Proximal Coordinate Gradient (APCG) Assumption: kxik ≤ 1 for any i

(75)

F (w) = 1 n

n

X

i =1

`(w>xi, yi) +λ 2kwk22 Full sub-gradient: ∂F (w) = n1Pni =1∂`(w>xi, yi) + λw

Randomly sample i ∈ {1, . . . , n}

Stochasticsub-gradient: ∂`(wTxi, yi) + λw

Ei[∂`(wTxi, yi) + λw] = ∂F (w)

(76)

Basic SGD (Nemirovski & Yudin (1978))

Applicable in all settings!

min

w∈RdF (w) = 1 n

n

X

i =1

`(w>xi, yi) +λ 2kwk22

sample: it ∈ {1, . . . , n}

update: wt = wt−1− γt∂`(wTt−1xit, yit) + λwt−1

output: wT = 1 T

T

X

t=1

wt

(77)

Applicable in all settings!

min

w∈RdF (w) = 1 n

n

X

i =1

`(w>xi, yi) +λ 2kwk22

sample: it ∈ {1, . . . , n}

update: wt = wt−1− γt∂`(wTt−1xit, yit) + λwt−1

output: wT = 1 T

T

X

t=1

wt

References

Related documents

Hertel and Martin (2008), provide a simplified interpretation of the technical modalities. The model here follows those authors in modeling SSM. To briefly outline, if a

We believe that the book will be read by the people with a common inter- est in geospatial techniques, remote sensing, sustainable water resource develop- ment, applications and

REQUIREMENTS : A recognized three-year qualification (REQV 13) which must include appropriate training as an educator.. Eight years teaching experience for CES, seven years

From 1990 through 1999 almost 3.2 billion guilders from the Netherlands’ budget for development assistance were spent on relief of the external debt of developing countries. A

It is argued that while the Commission should be commended for seeking to address increasing rule of law backsliding at Member State level, it may also be

The comparison of the effects of three different ARBs (losartan, irbesartan, and candesartan) on endothelium-dependent vasomotor responsiveness, oxidant stress, and on markers

It will ensure record of all rented machineries and equipment as per project with help of proper daily based data entry.. Resources includes mainly three

We propose that hydrodynamic escape of hydrogen rich protoatmospheres, accreted by forming planets, explains the limit in rocky planet size.. Following the hydrodynamic