Support Vector Machines: Training with Stochastic Gradient Descent

(1)

1

Support Vector Machines: Training with Stochastic Gradient Descent

Machine Learning Fall 2017

Supervised Learning: The Setup

1 Machine Learning Spring 2018

The slides are mainly from Vivek Srikumar

(2)

Support vector machines

• Training by maximizing margin

• The SVM objective

• Solving the SVM optimization problem

• Support vectors, duals and kernels

(3)

SVM objective function

3

Regularization term:

• Maximize the margin

• Imposes a preference over the hypothesis space and pushes for better generalization

• Can be replaced with other

regularization terms which impose other preferences

Empirical Loss:

• Hinge loss

• Penalizes weight vectors that make mistakes

• Can be replaced with other loss functions which impose other preferences

A hyper-parameter that controls the tradeoff

between a large margin and a small hinge-loss

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

max

w

1 2 w

^>

w

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1

max

w

1 2 w

^>

w + C X

i

⇠

_i

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1 ⇠

_i

⇠

_i

0 min

w,b

1 2 w

^>

w + C X

i

max 0, 1 y

_i

(w

^>

x

_i

+ b)

L

_Hinge

(y, x, w, b) = max 0, 1 y(w

^>

x + b)

min

w

1 2 w

^>₀

w

₀

+ C X

i

max(0, 1 y

_i

w

^>

x

_i

)

5

(4)

Outline: Training SVM by optimization

1. Review of convex functions and gradient descent 2. Stochastic gradient descent

3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss

5. Stochastic sub-gradient descent for SVM

6. Comparison to perceptron

(5)

Outline: Training SVM by optimization

1. Review of convex functions and gradient descent 2. Stochastic gradient descent

3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss

5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron

5

(6)

Solving the SVM optimization problem

This function is convex in w, b

For convenience, use simplified notation:

w

₀

← w w ← [w

₀

,b]

x

_i

← [x

_i

,1]

6

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

max

w

1 2 w

^>

w

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1

max

w

1 2 w

^>

w + C X

i

⇠

_i

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1 ⇠

_i

⇠

_i

0 max

w,b

1 2 w

^>

w + C X

i

max 0, 1 y

_i

(w

^>

x

_i

+ b)

L

_Hinge

(y, x, w, b) = max 0, 1 y(w

^>

x + b)

min

w

1 2 w

^>₀

w

₀

+ C X

i

max(0, 1 y

_i

w

^>

x

_i

)

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

max

w

1 2 w

^>

w

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1

max

w

1 2 w

^>

w + C X

i

⇠

_i

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1 ⇠

_i

⇠

_i

0 min

w,b

1 2 w

^>

w + C X

i

max 0, 1 y

_i

(w

^>

x

_i

+ b)

L

_Hinge

(y, x, w, b) = max 0, 1 y(w

^>

x + b)

min

w

1 2 w

^>₀

w

₀

+ C X

i

max(0, 1 y

_i

w

^>

x

_i

)

5

(7)

A function ! is convex if for every ", $ in the domain, and for every % ∈ [0,1] we have

! %" + 1 − % $ ≤ %! " + 1 − % !($)

7

u v

f(v) f(u)

Recall: Convex functions

From geometric perspective

Every tangent plane lies below the function

(8)

A function ! is convex if for every ", $ in the domain, and for every % ∈ [0,1] we have

! %" + 1 − % $ ≤ %! " + 1 − % !($)

8

u v

f(v) f(u)

Recall: Convex functions

From geometric perspective

Every tangent plane lies below the function

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

minw

1

2w^>w

s.t.8i, yⁱ(w^>x_i + b) 1

minw

1

2w^>w + C X

i

⇠_i

s.t.8i, yⁱ(w^>x_i + b) 1 ⇠_i

⇠_i 0

minw,b

1

2w^>w + C X

i

max 0, 1 y_i(w^>x_i + b)

L_Hinge(y, x, w, b) = max 0, 1 y(w^>x + b)

minw

1

2w^>₀ w₀ + C X

i

max(0, 1 y_iw^>x_i)

J^t(w) = 1

2w^>₀ w₀ + C X

i

max(0, 1 y_iw^>x_i)

rJ^t =

⇢[w₀; 0] if max(0, 1 y_iw_i^x) = 0 [w₀; 0] Cy_ix_i otherwise

f (x) f (u) + rf(u)^>(x u)

(9)

Convex functions

9

Linear functions max is convex

Some ways to show that a function is convex:

1. Using the definition of convexity

2. Showing that the second derivative is

nonnegative (for one dimensional functions) 3. Showing that the second derivative is

positive semi-definite (for vector functions)

(10)

Not all functions are convex

These are concave

These are neither

! "# + 1 − " ' ≥ "! # + 1 − " !(')

(11)

Convex functions are convenient

A function ! is convex if for every ", $ in the domain, and for every % ∈ [0,1] we have

! %" + 1 − % $ ≤ %! " + 1 − % !($)

In general: Necessary condition for x to be a minimum for the function f is ∇f (x)= 0

For convex functions, this is both necessary and sufficient

11

u v

f(v) f(u)

(12)

This function is convex in w

• This is a quadratic optimization problem because the objective is quadratic

• Older methods: Used techniques from Quadratic Programming

– Very slow

• No constraints, can use gradient descent

– Still very slow!

Solving the SVM optimization problem

12

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

max

w

1 2 w

^>

w

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1

max

w

1 2 w

^>

w + C X

i

⇠

_i

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1 ⇠

_i

⇠

_i

0 max

w,b

1 2 w

^>

w + C X

i

max 0, 1 y

_i

(w

^>

x

_i

+ b)

L

_Hinge

(y, x, w, b) = max 0, 1 y(w

^>

x + b)

min

w

1 2 w

₀^>

w

₀

+ C X

i

max(0, 1 y

_i

w

^>

x

_i

)

(13)

Gradient descent

General strategy for minimizing a function J(w)

• Start with an initial guess for w, say w

⁰

• Iterate till convergence:

– Compute the gradient of J at w

^t

– Update w

^t

to get w

^t+1

by taking a step in the opposite direction of the gradient

13

J(w)

w w⁰

Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction

We are trying to minimize

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

minw

1

2w^>w

s.t.8i, yⁱ(w^>x_i + b) 1

minw

1

2w^>w + CX

i

⇠_i

s.t.8i, yⁱ(w^>x_i + b) 1 ⇠_i

⇠_i 0

minw,b

1

2w^>w + CX

i

max 0, 1 y_i(w^>x_i + b)

minw

1

2w^>₀ w₀ + CX

i

max(0, 1 y_iw^>x_i)

J(w) = 1

2w^>₀ w₀ + CX

i

max(0, 1 y_iw^>x_i)

5

(14)

Gradient descent

General strategy for minimizing a function J(w)

• Start with an initial guess for w, say w

⁰

• Iterate till convergence:

– Compute the gradient of J at w

^t

– Update w

^t

to get w

^t+1

by taking a step in the opposite direction of the gradient

14

J(w)

w w⁰

w¹

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268

minw

1

2w^>w

s.t.8i, yⁱ(w^>x_i + b) 1

minw

1

2w^>w + CX

i

⇠_i

s.t.8i, yⁱ(w^>x_i + b) 1 ⇠_i

⇠_i 0

minw,b

1

2w^>w + CX

i

max 0, 1 y_i(w^>x_i + b)

minw

1

2w^>₀ w₀ + CX

i

max(0, 1 y_iw^>x_i)

J(w) = 1

2w^>₀ w₀ + CX

i

max(0, 1 y_iw^>x_i)

(15)

Gradient descent

General strategy for minimizing a function J(w)

• Start with an initial guess for w, say w

⁰

• Iterate till convergence:

– Compute the gradient of J at w

^t

– Update w

^t

to get w

^t+1

by taking a step in the opposite direction of the gradient

15

J(w)

w w⁰

w¹ w²

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

minw

1

2w^>w

s.t.8i, yⁱ(w^>x_i + b) 1

minw

1

2w^>w + CX

i

⇠_i

s.t.8i, yⁱ(w^>x_i + b) 1 ⇠_i

⇠_i 0

minw,b

1

2w^>w + CX

i

max 0, 1 y_i(w^>x_i + b)

minw

1

2w^>₀ w₀ + CX

i

max(0, 1 y_iw^>x_i)

J(w) = 1

2w^>₀ w₀ + CX

i

max(0, 1 y_iw^>x_i)

5

(16)

Gradient descent

General strategy for minimizing a function J(w)

• Start with an initial guess for w, say w

⁰

• Iterate till convergence:

– Compute the gradient of J at w

^t

– Update w

^t

to get w

^t+1

by taking a step in the opposite direction of the gradient

16

J(w)

w w⁰

w¹ w² w³

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268

minw

1

2w^>w

s.t.8i, yⁱ(w^>x_i + b) 1

minw

1

2w^>w + CX

i

⇠_i

s.t.8i, yⁱ(w^>x_i + b) 1 ⇠_i

⇠_i 0

minw,b

1

2w^>w + CX

i

max 0, 1 y_i(w^>x_i + b)

minw

1

2w^>₀ w₀ + CX

i

max(0, 1 y_iw^>x_i)

J(w) = 1

2w^>₀ w₀ + CX

i

max(0, 1 y_iw^>x_i)

(17)

Gradient descent for SVM

1. Initialize w

⁰

2. For t = 0, 1, 2, ….

1. Compute gradient of J(w) at w

^t

. Call it ∇J(w

^t

) 2. Update w as follows:

17

r: Called the learning rate .

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

minw

1

2w^>w

s.t.8i, yⁱ(w^>x_i + b) 1

minw

1

2w^>w + C X

i

⇠_i

s.t.8i, yⁱ(w^>x_i + b) 1 ⇠_i

⇠_i 0

minw,b

1

2w^>w + C X

i

max 0, 1 y_i(w^>x_i + b)

LHinge(y, x, w, b) = max 0, 1 y(w^>x + b)

minw

1

2w^>₀ w₀ + CX

i

max(0, 1 y_iw^>x_i)

J(w) = 1

2w₀^>w₀ + C X

i

max(0, 1 y_iw^>x_i)

5

(18)

Outline: Training SVM by optimization

ü Review of convex functions and gradient descent 2. Stochastic gradient descent

3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss

5. Stochastic sub-gradient descent for SVM

6. Comparison to perceptron

(19)

Gradient descent for SVM

1. Initialize w

⁰

2. For t = 0, 1, 2, ….

1. Compute gradient of J(w) at w

^t

. Call it ∇J(w

^t

) 2. Update w as follows:

19

r: Called the learning rate

Gradient of the SVM objective requires summing over the entire training set

Slow , does not really scale

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

minw

1

2w^>w

s.t.8i, yⁱ(w^>x_i + b) 1

minw

1

2w^>w + C X

i

⇠_i

s.t.8i, yⁱ(w^>x_i + b) 1 ⇠_i

⇠_i 0

minw,b

1

2w^>w + C X

i

max 0, 1 y_i(w^>x_i + b)

LHinge(y, x, w, b) = max 0, 1 y(w^>x + b)

minw

1

2w^>₀ w₀ + CX

i

max(0, 1 y_iw^>x_i)

J(w) = 1

2w₀^>w₀ + C X

i

max(0, 1 y_iw^>x_i)

5

(20)

Stochastic gradient descent for SVM

Given a training set S = {(x

_i

, y

_i

)}, x ∈ ℜ

ⁿ

, y ∈ {-1,1}

1. Initialize w

⁰

= 0 ∈ ℜ

ⁿ

2. For epoch = 1 … T:

1. Pick a random example (x

_i

, y

_i

) from the training set S

2. Treat (x

_i

, y

_i

) as a full dataset and take the derivative of the SVM objective at the current w

^t-1

to be rJ

^t

(w

^t-1

)

3. Update: w

^t Ã w^t-1

– °

_t

rJ

^t

(w

^t-1

)

3. Return final w

20 216

217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268

minw

1

2w^>w

s.t.8i, yⁱ(w^>x_i + b) 1

minw

1

2w^>w + CX

i

⇠_i

s.t.8i, yⁱ(w^>x_i + b) 1 ⇠_i

⇠_i 0

minw,b

1

2w^>w + C X

i

max 0, 1 y_i(w^>x_i + b)

minw

1

2w₀^>w₀ + C X

i

max(0, 1 y_iw^>x_i)

J(w) = 1

2w^>₀ w₀ + CX

i

max(0, 1 y_iw^>x_i)

(21)

Stochastic gradient descent for SVM

Given a training set S = {(x

_i

, y

_i

)}, x ∈ ℜ

ⁿ

, y ∈ {-1,1}

1. Initialize w

⁰

= 0 ∈ ℜ

ⁿ

2. For epoch = 1 … T:

1. Pick a random example (x

_i

, y

_i

) from the training set S

2. Treat (x

_i

, y

_i

) as a full dataset and take the derivative of the SVM objective at the current w

^t-1

to be rJ

^t

(w

^t-1

)

3. Update: w

^t Ã w^t-1

– °

_t

rJ

^t

(w

^t-1

)

3. Return final w

21 216

217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

minw

1

2w^>w

s.t.8i, yⁱ(w^>x_i + b) 1

minw

1

2w^>w + CX

i

⇠_i

s.t.8i, yⁱ(w^>x_i + b) 1 ⇠_i

⇠_i 0

minw,b

1

2w^>w + C X

i

max 0, 1 y_i(w^>x_i + b)

minw

1

2w₀^>w₀ + C X

i

max(0, 1 y_iw^>x_i)

J(w) = 1

2w^>₀ w₀ + CX

i

max(0, 1 y_iw^>x_i)

5

(22)

Stochastic gradient descent for SVM

Given a training set S = {(x

_i

, y

_i

)}, x ∈ ℜ

ⁿ

, y ∈ {-1,1}

1. Initialize w

⁰

= 0 ∈ ℜ

ⁿ

2. For epoch = 1 … T:

1. Pick a random example (x

_i

, y

_i

) from the training set S

2. Repeat (x

_i

, y

_i

) to make a full dataset and take the derivative of the SVM objective at the current w

^t-1

to be ∇J

^t

(w

^t-1

)

3. Update: w

^t Ã w^t-1

– °

_t

rJ

^t

(w

^t-1

)

3. Return final w

22 216

217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268

minw

1

2w^>w

s.t.8i, yⁱ(w^>x_i + b) 1

minw

1

2w^>w + CX

i

⇠_i

s.t.8i, yⁱ(w^>x_i + b) 1 ⇠_i

⇠_i 0

minw,b

1

2w^>w + C X

i

max 0, 1 y_i(w^>x_i + b)

minw

1

2w₀^>w₀ + C X

i

max(0, 1 y_iw^>x_i)

J(w) = 1

2w^>₀ w₀ + CX

i

max(0, 1 y_iw^>x_i)

(23)

Stochastic gradient descent for SVM

Given a training set S = {(x

_i

, y

_i

)}, x ∈ ℜ

ⁿ

, y ∈ {-1,1}

1. Initialize w

⁰

= 0 ∈ ℜ

ⁿ

2. For epoch = 1 … T:

1. Pick a random example (x

_i

, y

_i

) from the training set S

2. Repeat (x

_i

, y

_i

) to make a full dataset and take the derivative of the SVM objective at the current w

^t-1

to be ∇J

^t

(w

^t-1

)

3. Update: w

^t

←w

^t-1

– %

_t

∇J

^t

(w

^t-1

)

3. Return final w

23 216

217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

minw

1

2w^>w

s.t.8i, yⁱ(w^>x_i + b) 1

minw

1

2w^>w + CX

i

⇠_i

s.t.8i, yⁱ(w^>x_i + b) 1 ⇠_i

⇠_i 0

minw,b

1

2w^>w + C X

i

max 0, 1 y_i(w^>x_i + b)

minw

1

2w₀^>w₀ + C X

i

max(0, 1 y_iw^>x_i)

J(w) = 1

2w^>₀ w₀ + CX

i

max(0, 1 y_iw^>x_i)

5 216

217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

min

w

1 2 w

^>

w

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1

min

w

1 2 w

^>

w + C X

i

⇠

_i

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1 ⇠

_i

⇠

_i

0 min

w,b

1 2 w

^>

w + C X

i

max 0, 1 y

_i

(w

^>

x

_i

+ b)

L

_Hinge

(y, x, w, b) = max 0, 1 y(w

^>

x + b)

min

w

1 2 w

^>₀

w

₀

+ C X

i

max(0, 1 y

_i

w

^>

x

_i

)

J

^t

(w) = 1

2 w

^>₀

w

₀

+ C · N max(0, 1 y

_i

w

^>

x

_i

)

rJ

^t

=

⇢ [w

₀

; 0] if max(0, 1 y

_i

w

^x_i

) = 0 [w

₀

; 0] Cy

_i

x

_i

otherwise

f (x) f (u) + rf(u)

^>