1
Support Vector Machines: Training with Stochastic Gradient Descent
Machine Learning Fall 2017
Supervised Learning: The Setup
1
Machine Learning Spring 2018
The slides are mainly from Vivek Srikumar
Support vector machines
• Training by maximizing margin
• The SVM objective
• Solving the SVM optimization problem
• Support vectors, duals and kernels
SVM objective function
3
Regularization term:
• Maximize the margin
• Imposes a preference over the hypothesis space and pushes for better generalization
• Can be replaced with other
regularization terms which impose other preferences
Empirical Loss:
• Hinge loss
• Penalizes weight vectors that make mistakes
• Can be replaced with other loss functions which impose other preferences
A hyper-parameter that controls the tradeoff
between a large margin and a small hinge-loss
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
max
w1
2 w
>w
s.t. 8i, y
i(w
>x
i+ b) 1
max
w1
2 w
>w + C X
i
⇠
is.t. 8i, y
i(w
>x
i+ b) 1 ⇠
i⇠
i0
min
w,b1
2 w
>w + C X
i
max 0, 1 y
i(w
>x
i+ b)
L
Hinge(y, x, w, b) = max 0, 1 y(w
>x + b)
min
w1
2 w
>0w
0+ C X
i
max(0, 1 y
iw
>x
i)
5
Outline: Training SVM by optimization
1. Review of convex functions and gradient descent 2. Stochastic gradient descent
3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss
5. Stochastic sub-gradient descent for SVM
6. Comparison to perceptron
Outline: Training SVM by optimization
1. Review of convex functions and gradient descent 2. Stochastic gradient descent
3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss
5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron
5
Solving the SVM optimization problem
This function is convex in w, b
For convenience, use simplified notation:
w
0← w w ← [w
0,b]
x
i← [x
i,1]
6
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
max
w1
2 w
>w
s.t. 8i, y
i(w
>x
i+ b) 1
max
w1
2 w
>w + C X
i
⇠
is.t. 8i, y
i(w
>x
i+ b) 1 ⇠
i⇠
i0
max
w,b1
2 w
>w + C X
i
max 0, 1 y
i(w
>x
i+ b)
L
Hinge(y, x, w, b) = max 0, 1 y(w
>x + b)
min
w1
2 w
>0w
0+ C X
i
max(0, 1 y
iw
>x
i)
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
max
w1
2 w
>w
s.t. 8i, y
i(w
>x
i+ b) 1
max
w1
2 w
>w + C X
i
⇠
is.t. 8i, y
i(w
>x
i+ b) 1 ⇠
i⇠
i0
min
w,b1
2 w
>w + C X
i
max 0, 1 y
i(w
>x
i+ b)
L
Hinge(y, x, w, b) = max 0, 1 y(w
>x + b)
min
w1
2 w
>0w
0+ C X
i
max(0, 1 y
iw
>x
i)
5
A function ! is convex if for every ", $ in the domain, and for every % ∈ [0,1] we have
! %" + 1 − % $ ≤ %! " + 1 − % !($)
7
u v
f(v) f(u)
Recall: Convex functions
From geometric perspective
Every tangent plane lies below the function
A function ! is convex if for every ", $ in the domain, and for every % ∈ [0,1] we have
! %" + 1 − % $ ≤ %! " + 1 − % !($)
8
u v
f(v) f(u)
Recall: Convex functions
From geometric perspective
Every tangent plane lies below the function
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
minw
1
2w>w
s.t.8i, yi(w>xi + b) 1
minw
1
2w>w + C X
i
⇠i
s.t.8i, yi(w>xi + b) 1 ⇠i
⇠i 0
minw,b
1
2w>w + C X
i
max 0, 1 yi(w>xi + b)
LHinge(y, x, w, b) = max 0, 1 y(w>x + b)
minw
1
2w>0 w0 + C X
i
max(0, 1 yiw>xi)
Jt(w) = 1
2w>0 w0 + C X
i
max(0, 1 yiw>xi)
rJt =
⇢[w0; 0] if max(0, 1 yiwix) = 0 [w0; 0] Cyixi otherwise
f (x) f (u) + rf(u)>(x u)
Convex functions
9
Linear functions max is convex
Some ways to show that a function is convex:
1. Using the definition of convexity
2. Showing that the second derivative is
nonnegative (for one dimensional functions) 3. Showing that the second derivative is
positive semi-definite (for vector functions)
Not all functions are convex
These are concave
These are neither
! "# + 1 − " ' ≥ "! # + 1 − " !(')
Convex functions are convenient
A function ! is convex if for every ", $ in the domain, and for every % ∈ [0,1] we have
! %" + 1 − % $ ≤ %! " + 1 − % !($)
In general: Necessary condition for x to be a minimum for the function f is ∇f (x)= 0
For convex functions, this is both necessary and sufficient
11
u v
f(v) f(u)
This function is convex in w
• This is a quadratic optimization problem because the objective is quadratic
• Older methods: Used techniques from Quadratic Programming
– Very slow
• No constraints, can use gradient descent
– Still very slow!
Solving the SVM optimization problem
12
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
max
w1
2 w
>w
s.t. 8i, y
i(w
>x
i+ b) 1
max
w1
2 w
>w + C X
i
⇠
is.t. 8i, y
i(w
>x
i+ b) 1 ⇠
i⇠
i0
max
w,b1
2 w
>w + C X
i
max 0, 1 y
i(w
>x
i+ b)
L
Hinge(y, x, w, b) = max 0, 1 y(w
>x + b)
min
w1
2 w
0>w
0+ C X
i
max(0, 1 y
iw
>x
i)
Gradient descent
General strategy for minimizing a function J(w)
• Start with an initial guess for w, say w
0• Iterate till convergence:
– Compute the gradient of J at w
t– Update w
tto get w
t+1by taking a step in the opposite direction of the gradient
13
J(w)
w w0
Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction
We are trying to minimize
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
minw
1
2w>w
s.t.8i, yi(w>xi + b) 1
minw
1
2w>w + CX
i
⇠i
s.t.8i, yi(w>xi + b) 1 ⇠i
⇠i 0
minw,b
1
2w>w + CX
i
max 0, 1 yi(w>xi + b)
LHinge(y, x, w, b) = max 0, 1 y(w>x + b)
minw
1
2w>0 w0 + CX
i
max(0, 1 yiw>xi)
J(w) = 1
2w>0 w0 + CX
i
max(0, 1 yiw>xi)
5
Gradient descent
General strategy for minimizing a function J(w)
• Start with an initial guess for w, say w
0• Iterate till convergence:
– Compute the gradient of J at w
t– Update w
tto get w
t+1by taking a step in the opposite direction of the gradient
14
J(w)
w w0
w1
Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction
We are trying to minimize
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268
minw
1
2w>w
s.t.8i, yi(w>xi + b) 1
minw
1
2w>w + CX
i
⇠i
s.t.8i, yi(w>xi + b) 1 ⇠i
⇠i 0
minw,b
1
2w>w + CX
i
max 0, 1 yi(w>xi + b)
LHinge(y, x, w, b) = max 0, 1 y(w>x + b)
minw
1
2w>0 w0 + CX
i
max(0, 1 yiw>xi)
J(w) = 1
2w>0 w0 + CX
i
max(0, 1 yiw>xi)
Gradient descent
General strategy for minimizing a function J(w)
• Start with an initial guess for w, say w
0• Iterate till convergence:
– Compute the gradient of J at w
t– Update w
tto get w
t+1by taking a step in the opposite direction of the gradient
15
J(w)
w w0
w1 w2
Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction
We are trying to minimize
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
minw
1
2w>w
s.t.8i, yi(w>xi + b) 1
minw
1
2w>w + CX
i
⇠i
s.t.8i, yi(w>xi + b) 1 ⇠i
⇠i 0
minw,b
1
2w>w + CX
i
max 0, 1 yi(w>xi + b)
LHinge(y, x, w, b) = max 0, 1 y(w>x + b)
minw
1
2w>0 w0 + CX
i
max(0, 1 yiw>xi)
J(w) = 1
2w>0 w0 + CX
i
max(0, 1 yiw>xi)
5
Gradient descent
General strategy for minimizing a function J(w)
• Start with an initial guess for w, say w
0• Iterate till convergence:
– Compute the gradient of J at w
t– Update w
tto get w
t+1by taking a step in the opposite direction of the gradient
16
J(w)
w w0
w1 w2 w3
Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction
We are trying to minimize
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268
minw
1
2w>w
s.t.8i, yi(w>xi + b) 1
minw
1
2w>w + CX
i
⇠i
s.t.8i, yi(w>xi + b) 1 ⇠i
⇠i 0
minw,b
1
2w>w + CX
i
max 0, 1 yi(w>xi + b)
LHinge(y, x, w, b) = max 0, 1 y(w>x + b)
minw
1
2w>0 w0 + CX
i
max(0, 1 yiw>xi)
J(w) = 1
2w>0 w0 + CX
i
max(0, 1 yiw>xi)
Gradient descent for SVM
1. Initialize w
02. For t = 0, 1, 2, ….
1. Compute gradient of J(w) at w
t. Call it ∇J(w
t) 2. Update w as follows:
17
r: Called the learning rate .
We are trying to minimize
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
minw
1
2w>w
s.t.8i, yi(w>xi + b) 1
minw
1
2w>w + C X
i
⇠i
s.t.8i, yi(w>xi + b) 1 ⇠i
⇠i 0
minw,b
1
2w>w + C X
i
max 0, 1 yi(w>xi + b)
LHinge(y, x, w, b) = max 0, 1 y(w>x + b)
minw
1
2w>0 w0 + CX
i
max(0, 1 yiw>xi)
J(w) = 1
2w0>w0 + C X
i
max(0, 1 yiw>xi)
5
Outline: Training SVM by optimization
ü Review of convex functions and gradient descent 2. Stochastic gradient descent
3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss
5. Stochastic sub-gradient descent for SVM
6. Comparison to perceptron
Gradient descent for SVM
1. Initialize w
02. For t = 0, 1, 2, ….
1. Compute gradient of J(w) at w
t. Call it ∇J(w
t) 2. Update w as follows:
19
r: Called the learning rate
Gradient of the SVM objective requires summing over the entire training set
Slow , does not really scale
We are trying to minimize
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
minw
1
2w>w
s.t.8i, yi(w>xi + b) 1
minw
1
2w>w + C X
i
⇠i
s.t.8i, yi(w>xi + b) 1 ⇠i
⇠i 0
minw,b
1
2w>w + C X
i
max 0, 1 yi(w>xi + b)
LHinge(y, x, w, b) = max 0, 1 y(w>x + b)
minw
1
2w>0 w0 + CX
i
max(0, 1 yiw>xi)
J(w) = 1
2w0>w0 + C X
i
max(0, 1 yiw>xi)
5
Stochastic gradient descent for SVM
Given a training set S = {(x
i, y
i)}, x ∈ ℜ
n, y ∈ {-1,1}
1. Initialize w
0= 0 ∈ ℜ
n2. For epoch = 1 … T:
1. Pick a random example (x
i, y
i) from the training set S
2. Treat (x
i, y
i) as a full dataset and take the derivative of the SVM objective at the current w
t-1to be rJ
t(w
t-1)
3. Update: w
t à wt-1– °
trJ
t(w
t-1)
3. Return final w
20 216
217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268
minw
1
2w>w
s.t.8i, yi(w>xi + b) 1
minw
1
2w>w + CX
i
⇠i
s.t.8i, yi(w>xi + b) 1 ⇠i
⇠i 0
minw,b
1
2w>w + C X
i
max 0, 1 yi(w>xi + b)
LHinge(y, x, w, b) = max 0, 1 y(w>x + b)
minw
1
2w0>w0 + C X
i
max(0, 1 yiw>xi)
J(w) = 1
2w>0 w0 + CX
i
max(0, 1 yiw>xi)
Stochastic gradient descent for SVM
Given a training set S = {(x
i, y
i)}, x ∈ ℜ
n, y ∈ {-1,1}
1. Initialize w
0= 0 ∈ ℜ
n2. For epoch = 1 … T:
1. Pick a random example (x
i, y
i) from the training set S
2. Treat (x
i, y
i) as a full dataset and take the derivative of the SVM objective at the current w
t-1to be rJ
t(w
t-1)
3. Update: w
t à wt-1– °
trJ
t(w
t-1)
3. Return final w
21 216
217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
minw
1
2w>w
s.t.8i, yi(w>xi + b) 1
minw
1
2w>w + CX
i
⇠i
s.t.8i, yi(w>xi + b) 1 ⇠i
⇠i 0
minw,b
1
2w>w + C X
i
max 0, 1 yi(w>xi + b)
LHinge(y, x, w, b) = max 0, 1 y(w>x + b)
minw
1
2w0>w0 + C X
i
max(0, 1 yiw>xi)
J(w) = 1
2w>0 w0 + CX
i
max(0, 1 yiw>xi)
5
Stochastic gradient descent for SVM
Given a training set S = {(x
i, y
i)}, x ∈ ℜ
n, y ∈ {-1,1}
1. Initialize w
0= 0 ∈ ℜ
n2. For epoch = 1 … T:
1. Pick a random example (x
i, y
i) from the training set S
2. Repeat (x
i, y
i) to make a full dataset and take the derivative of the SVM objective at the current w
t-1to be ∇J
t(w
t-1)
3. Update: w
t à wt-1– °
trJ
t(w
t-1)
3. Return final w
22 216
217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268
minw
1
2w>w
s.t.8i, yi(w>xi + b) 1
minw
1
2w>w + CX
i
⇠i
s.t.8i, yi(w>xi + b) 1 ⇠i
⇠i 0
minw,b
1
2w>w + C X
i
max 0, 1 yi(w>xi + b)
LHinge(y, x, w, b) = max 0, 1 y(w>x + b)
minw
1
2w0>w0 + C X
i
max(0, 1 yiw>xi)
J(w) = 1
2w>0 w0 + CX
i
max(0, 1 yiw>xi)
Stochastic gradient descent for SVM
Given a training set S = {(x
i, y
i)}, x ∈ ℜ
n, y ∈ {-1,1}
1. Initialize w
0= 0 ∈ ℜ
n2. For epoch = 1 … T:
1. Pick a random example (x
i, y
i) from the training set S
2. Repeat (x
i, y
i) to make a full dataset and take the derivative of the SVM objective at the current w
t-1to be ∇J
t(w
t-1)
3. Update: w
t←w
t-1– %
t∇J
t(w
t-1)
3. Return final w
23 216
217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
minw
1
2w>w
s.t.8i, yi(w>xi + b) 1
minw
1
2w>w + CX
i
⇠i
s.t.8i, yi(w>xi + b) 1 ⇠i
⇠i 0
minw,b
1
2w>w + C X
i
max 0, 1 yi(w>xi + b)
LHinge(y, x, w, b) = max 0, 1 y(w>x + b)
minw
1
2w0>w0 + C X
i
max(0, 1 yiw>xi)
J(w) = 1
2w>0 w0 + CX
i
max(0, 1 yiw>xi)
5 216
217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
min
w1
2 w
>w
s.t. 8i, y
i(w
>x
i+ b) 1
min
w1
2 w
>w + C X
i
⇠
is.t. 8i, y
i(w
>x
i+ b) 1 ⇠
i⇠
i0
min
w,b1
2 w
>w + C X
i
max 0, 1 y
i(w
>x
i+ b)
L
Hinge(y, x, w, b) = max 0, 1 y(w
>x + b)
min
w1
2 w
>0w
0+ C X
i