Big Data Analytics: Optimization and Randomization

(1)

Tianbao Yang^†, Qihang Lin^\, Rong Jin^∗‡ Tutorial@SIGKDD 2015

Sydney, Australia

†Department of Computer Science, The University of Iowa, IA, USA

\Department of Management Sciences, The University of Iowa, IA, USA

∗Department of Computer Science and Engineering, Michigan State University, MI, USA

‡Institute of Data Science and Technologies at Alibaba Group, Seattle, USA

August 10, 2015

(2)

URL

http://www.cs.uiowa.edu/˜tyng/kdd15-tutorial.pdf

(3)

No

This tutorial is not an exhaustive literature survey

It is not a survey on different machine learning/data mining algorithms

Yes

It is about how to efficientlysolve machine learning/data mining (formulated as optimization) problems forbig data

(4)

Outline

Part I: Basics

Part II: Optimization Part III: Randomization

(5)

(6)

Outline

1 Basics

Introduction

Notations and Definitions

(7)

Model Optimization

20 40 60 80 100

0 0.05 0.1 0.15 0.2 0.25 0.3

iterations

distance to optimal objective

0.5^T 1/T² 1/T

Data

(8)

Big Data Challenge

Big Data

(9)

Big Model

60 million parameters

(10)

Learning as Optimization

Ridge Regression Problem:

min

w∈R^d

1 n

n

X

i =1

(y_i− w^>x_i)²+λ 2kwk²₂

xi ∈ R^d: d -dimensional feature vector yi ∈ R: target variable

w ∈ R^d: model parameters n: number of data points

(11)

min

w∈R^d

1 n

n

X

i =1

(yi − w^>xi)²

| {z }

Empirical Loss

+λ 2kwk²₂

(12)

min

w∈R^d

1 n

n

X

i =1

(yi− w^>xi)²+ λ 2kwk²₂

| {z }

Regularization

(13)

Classification Problems:

min

w∈R^d

1 n

n

X

i =1

`(yiw^>xi) +λ 2kwk²₂

y_i ∈ {+1, −1}: label

Loss function `(z): z = y w^>x

1. SVMs: (squared) hinge loss `(z) = max(0, 1 − z)^p, where p = 1, 2

2. Logistic Regression: `(z) = log(1 + exp(−z))

(14)

Feature Selection:

min

w∈R^d

1 n

n

X

i =1

`(w^>x_i, y_i) + λkwk₁

`₁ regularization kwk₁ =^P^d_{i =1}|w_i| λ controls sparsity level

(15)

Feature Selection using Elastic Net:

min

w∈R^d

1 n

n

X

i =1

`(w^>xi, yi)+λkwk₁+ γkwk²₂

Elastic net regularizer, more robust than `1 regularizer

(16)

Multi-class/Multi-task Learning:

minW

1 n

n

X

i =1

`(Wxi, yi) + λr (W)

W ∈ R^{K ×d}

r (W) = kWk²_F =^P^K_k=1^P^d_j=1W_kj²: Frobenius Norm

r (W) = kWk∗ =^P_iσi: Nuclear Norm (sum of singular values) r (W) = kWk1,∞=^P^d_j=1kW_:jk∞: `1,∞mixed norm

(17)

Regularized Empirical Loss Minimization min

w∈R^d

1 n

n

X

i =1

`(w^>xi, yi) + R(w)

Both ` and R are convex functions

Extensions to Matrix Cases are possible (sometimes straightforward) Extensions to Kernel methods can be combined with randomized approaches

Extensions to Non-convex (e.g., deep learning) are in progress

(18)

Data Matrices and Machine Learning

The Instance-feature Matrix: X ∈ R^n×d

X =





 x^>₁ x^>₂

·

· x^>_n







(19)

The output vector: y =





 y1

y₂

·

· y_n







∈ R^n×1

continuous yi ∈ R: regression (e.g., house price)

discrete, e.g., yi ∈ {1, 2, 3}: classification (e.g., species of iris)

(20)

The Instance-Instance Matrix: K ∈ R^n×n Similarity Matrix

Kernel Matrix

(21)

Some machine learning tasks are formulated on the kernel matrix Clustering

Kernel Methods

(22)

The Feature-Feature Matrix: C ∈ R^{d ×d} Covariance Matrix

Distance Metric Matrix

(23)

Some machine learning tasks requires the covariance matrix Principal Component Analysis

Top-k Singular Value (Eigen-Value) Decomposition of the Covariance Matrix

(24)

Why Learning from Big Data is Challenging?

High per-iteration cost High memory cost

High communication cost

Large iteration complexity

(25)

1 Basics

Introduction

Notations and Definitions

(26)

Norms

Vector x ∈ R^d

Euclidean vector norm: kxk2=

√ x^>x =

q Pd

i =1x_i²

`_p-norm of a vector: kxk_p =^P^d_{i =1}|x_i|^p^1/p where p ≥ 1

1 `2norm kxk2= q

Pd i =1x_i²

2 `1norm kxk1=Pd i =1|xi|

3 `_∞ norm kxk_∞= maxi|xi|

(27)

√ x^>x =

q Pd

i =1x_i²

1 `2norm kxk2= q

Pd i =1x_i²

(28)

Norms

√ x^>x =

q Pd

i =1x_i²

1 `2norm kxk2= q

Pd i =1x_i²

(29)

Matrix X ∈ R^n×d

Singular Value Decomposition X = UΣV^>

1 U ∈ R^n×r: orthonormal columns (U^>U = I): span column space

2 Σ ∈ R^{r ×r}: diagonal matrix Σii = σi> 0, σ1≥ σ2. . . ≥ σr 3 V ∈ R^{d ×r}: orthonormal columns (V^>V = I): span row space

4 r ≤ min(n, d ): max value such that σr > 0: rank of X

5 U_kΣ_kV_k^>: top-k approximation Pseudo inverse: X^†= V Σ⁻¹U^>

QR factorization: X = QR (n ≥ d ) Q ∈ R^n×d: orthonormal columns R ∈ R^{d ×d}: upper triangular matrix

(30)

Matrix Factorization

(31)

(32)

Norms

Frobenius norm: kX k_F =^qtr (X^>X ) =^q^Pⁿ_{i =1}^P^d_j=1X_ij² Spectral (induced norm) of a matrix: kX k₂ = max_kuk₂₌₁kX uk₂

kAk2= σ1 (maximum singular value)

(33)

Frobenius norm: kX k_F =^qtr (X^>X ) =^q^Pⁿ_{i =1}^P^d_j=1X_ij² Spectral (induced norm) of a matrix: kX k₂ = max_kuk₂₌₁kX uk₂

kAk2= σ1 (maximum singular value)

(34)

Convex Optimization

min_{x ∈X} f (x )

X is a convex domain

for any x , y ∈ X , their convex combination αx + (1 − α)y ∈ X

f (x ) is a convex function

(35)

Characterization of Convex Function

f (αx + (1 − α)y ) ≤ αf (x ) + (1 − α)f (y ),

∀x , y ∈ X , α ∈ [0, 1]

f (x ) ≥ f (y ) + ∇f (y )^>(x − y ) ∀x , y ∈ X local optimum is global optimum

(36)

Convex Function

Characterization of Convex Function

f (αx + (1 − α)y ) ≤ αf (x ) + (1 − α)f (y ),

∀x , y ∈ X , α ∈ [0, 1]

f (x ) ≥ f (y ) + ∇f (y )^>(x − y ) ∀x , y ∈ X local optimum is global optimum

(37)

Convex function:

f (x ) ≥ f (y ) + ∇f (y )^>(x − y ) ∀x , y ∈ X Strongly Convex function:

f (x ) ≥ f (y ) + ∇f (y )^>(x − y ) +λ

2kx − y k²₂∀x , y ∈ X Global optimum is unique

strong convexity constant

(38)

Convex vs Strongly Convex

Convex function:

f (x ) ≥ f (y ) + ∇f (y )^>(x − y ) ∀x , y ∈ X Strongly Convex function:

f (x ) ≥ f (y ) + ∇f (y )^>(x − y ) +λ

2kx − y k²₂∀x , y ∈ X Global optimum is unique

strong convexity constant

(39)

Non-smooth function

Lipschitz continuous: e.g. absolute loss f (x ) = |x |

|f (x ) − f (y )| ≤ Gkx − y k₂ Lipschitz constant

Subgradient: f (x ) ≥ f (y ) +∂f (y )^>(x − y )

−1 −0.5 0 0.5 1

−0.2 0 0.2 0.4 0.6 0.8

|x| non−smooth

sub−gradient

Smooth function

e.g. logistic loss f (x ) = log(1 + exp(−x )) k∇f (x ) − ∇f (y )k₂ ≤ Lkx − y k₂

smoothness constant

−5 −4 −3 −2 −1 0 1 2 3 4 5

−1 0 1 2 3 4 5

6 log(1+exp(−x))

f(y)+f’(y)(x−y) y f(x)

Quadratic Function

(40)

Non-smooth function vs Smooth function

Non-smooth function

−1 −0.5 0 0.5 1

−0.2 0 0.2 0.4 0.6 0.8

|x| non−smooth

sub−gradient

Smooth function

smoothness constant

−5 −4 −3 −2 −1 0 1 2 3 4 5

−1 0 1 2 3 4 5

6 log(1+exp(−x))

Quadratic Function

(41)

Non-smooth function

−1 −0.5 0 0.5 1

−0.2 0 0.2 0.4 0.6 0.8

|x| non−smooth

sub−gradient

Smooth function

smoothness constant

−5 −4 −3 −2 −1 0 1 2 3 4 5

−1 0 1 2 3 4 5

6 log(1+exp(−x))

Quadratic Function

(42)

Next ...

min

w∈R^d

1 n

n

X

i =1

`(w^>xi, yi) + R(w) Part II: Optimization

stochastic optimization distributed optimization

Reduce Iteration Complexity: utilizing properties of functions

(43)

Part III: Randomization Classification, Regression SVD, K-means, Kernel methods

Reduce Data Size: utilizing properties of data

Please stay tuned!

(44)

Big Data Analytics: Optimization and Randomization

Part II: Optimization

(45)

2 Optimization

(Sub)Gradient Methods

Stochastic Optimization Algorithms for Big Data

Stochastic Optimization Distributed Optimization

(46)

Regularized Empirical Loss Minimization min

w∈R^d

1 n

n X i =1

`(w^>x_i, y_i) + R(w)

| {z }

F (w)

(47)

Optimization (Sub)Gradient Methods

Convergence Measure

Most optimization algorithms are iterative wt+1= wt+ ∆wt

bT w

Convergence Rate: after T iterations, how good is the solution

F (wbT) − min

w F (w) ≤ (T )

0 20 40 60 80 100

0 0.1 0.2 0.3 0.4 0.5

iterations

objective

T ε

Total Runtime =Per-iteration Cost×Iteration Complexity

(48)

Convergence Measure

Iteration Complexity: the number of iterations T () needed to have

F (w_b_T) − min

w F (w) ≤  ( 1)

good is the solution F (wbT) − min

w F (w) ≤ (T )

0 20 40 60 80 100

0 0.1 0.2 0.3 0.4 0.5

iterations

objective

T ε

(49)

Convergence Measure

F (w_b_T) − min

w F (w) ≤  ( 1)

F (wbT) − min

w F (w) ≤ (T )

0 20 40 60 80 100

0 0.1 0.2 0.3 0.4 0.5

iterations

objective

T ε

(50)

Convergence Measure

F (w_b_T) − min

w F (w) ≤  ( 1)

F (wbT) − min

w F (w) ≤ (T )

0 20 40 60 80 100

0 0.1 0.2 0.3 0.4 0.5

iterations

objective

T ε

(51)

Big O(·) notation: explicit dependence on T or 

Convergence Rate Iteration Complexity linear Oµ^T (µ < 1) O

log

1

sub-linear O_T¹α

α > 0 O

1

^1/α

Why are we interested in Bounds?

(52)