1
Support Vector Machines
Machine Learning Fall 2017
Supervised Learning: The Setup
Machine Learning Spring 2018
The slides are mainly from Vivek Srikumar
Big picture
Linear models
Big picture
Linear models
How good is a learning
Big picture
Linear models
How good is a learning algorithm?
Mistake- driven learning Perceptron,
Winnow
Big picture
Linear models
How good is a learning Mistake-
driven learning
PAC, Agnostic learning Perceptron,
Winnow
Big picture
Linear models
How good is a learning algorithm?
Mistake- driven learning
PAC, Agnostic learning Perceptron,
Winnow Support Vector
Machines
The Risk Minimization Principle
Big picture
Linear models
How good is a learning Mistake-
driven learning
PAC, Agnostic learning Perceptron,
Winnow Support Vector
Machines
….
The Risk Minimization Principle
….
This lecture: Support vector machines
• Training by maximizing margin
• The SVM objective
• Solving the SVM optimization problem
• Support vectors, duals and kernels
This lecture: Support vector machines
• Training by maximizing margin
• The SVM objective
• Solving the SVM optimization problem
• Support vectors, duals and kernels
VC dimensions and linear classifiers
What we know so far
1. If we have m examples, then with probability 1 - !, the true error of a hypothesis h with training error err
S(h) is bounded by
Generalization error Training error A function of VC dimension.
Low VC dimension gives tighter bound
VC dimensions and linear classifiers
What we know so far
1. If we have m examples, then with probability 1 - !, the true error of a hypothesis h with training error err
S(h) is bounded by
2. VC dimension of a linear classifier in d dimensions = d + 1
Generalization error Training error A function of VC dimension.
Low VC dimension gives tighter bound
VC dimensions and linear classifiers
What we know so far
1. If we have m examples, then with probability 1 - !, the true error of a hypothesis h with training error err
S(h) is bounded by
2. VC dimension of a linear classifier in d dimensions = d + 1
Generalization error Training error A function of VC dimension.
Low VC dimension gives tighter bound
But are all linear classifiers the same?
Recall: Margin
The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it.
+
+ +
+ ++ + - +
- - - - - - - -
-
- - -
- - -
-
-
Recall: Margin
The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it.
+
+ +
+ ++ + - +
- - - - - - - -
-
- - -
- - - -
-
Margin with respect to this hyperplaneWhich hyperplane is better?
h1
h2
Recall: Margin
Which hyperplane is better?
h1
h2
The farther from the data points, the less chance to make wrong prediction
Recall: Margin
Data dependent VC dimension
• Intuitively, larger margins are better
• Suppose we only consider linear classifiers with margins !
1and !
2– H
1= linear classifiers that have a margin at least !
1– H
2= linear classifiers that have a margin at least !
2– And !
1> !
2• The entire set of functions H
1is “better”
Data dependent VC dimension
Theorem (Vapnik):
– Let H be the set of linear classifiers that separate the training set by a margin at least !
– Then
– R is the radius of the smallest sphere containing the data
Data dependent VC dimension
Theorem (Vapnik):
– Let H be the set of linear classifiers that separate the training set by a margin at least !
– Then
– R is the radius of the smallest sphere containing the data Larger margin ⟹ Lower VC dimension
Lower VC dimension ⟹ Better generalization bound
Learning strategy
Find the linear classifier that maximizes the margin
This lecture: Support vector machines
• Training by maximizing margin
• The SVM objective
• Solving the SVM optimization problem
• Support vectors, duals and kernels
Recall: The geometry of a linear classifier
Prediction = sgn(b +w
1x
1+ w
2x
2)
+ + + + ++ +
+
- - - -
- - - - -
-
- - -
- - - - -
b +w1 x1 + w2x2=0
Recall: The geometry of a linear classifier
Prediction = sgn (b +w
1x
1+ w
2x
2)
+ + + + ++ +
+
- - - -
- - - - -
-
- - -
- - - - -
b +w1 x1 + w2x2=0
We only care about the sign, not the magnitude
Recall: The geometry of a linear classifier
Prediction = sgn(b +w
1x
1+ w
2x
2)
b +w1 x1 + w2x2=0 2b +2w1 x1 + 2w2x2=0
1000b +1000w1 x1 + 1000w2x2=0
+
+ +
+ ++ + +
- - - -
- - - - -
-
- - -
- - - - -
We only care about the sign, not the magnitude
Maximizing margin
• Margin = distance of the closest point from the hyperplane
• We want max
w°
Some people call this the geometric margin
The numerator alone is called the functional margin
Maximizing margin
• Margin = distance of the closest point from the hyperplane
• We want max
w!
Some people call this the geometric margin
The numerator alone is called the functional margin
Recall: The geometry of a linear classifier
Prediction = sgn(b +w
1x
1+ w
2x
2)
b +w1 x1 + w2x2=0
We only care about the sign, not the magnitude
+ + + + ++ +
+
- - - -
- - - - -
-
- - -
- - -
-
-
Recall: The geometry of a linear classifier
Prediction = sgn(b +w
1x
1+ w
2x
2)
b +w1 x1 + w2x2=0
We only care about the sign, not the magnitude
+ + + + ++ +
+
- - - -
- - - - -
-
- - -
- - - - -
We have the freedom to scale up/down w and b so that the numerator is 1.
Maximizing margin
• Margin = distance of the closest point from the hyperplane
• We want max
w!
• We only care about the sign of w
Tx
i+b in the end and not the magnitude
– Set the absolute score (functional margin) of the closest point to be 1 and allow w to adjust itself
max
%! is equivalent to max
%
&
%
in this setting
Max-margin classifiers
• Learning problem:
30
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
min
w1
2 w
>w
s.t. 8i, y
i(w
>x
i+ b) 1
min
w1
2 w
>w + C X
i
⇠
is.t. 8i, y
i(w
>x
i+ b) 1 ⇠
i⇠
i0
min
w,b1
2 w
>w + C X
i
max 0, 1 y
i(w
>x
i+ b)
L
Hinge(y, x, w, b) = max 0, 1 y(w
>x + b)
min
w1
2 w
0>w
0+ C X
i
max(0, 1 y
iw
>x
i)
J
t(w) = 1
2 w
0>w
0+ C X
i
max(0, 1 y
iw
>x
i)
rJ
t=
⇢ [w
0; 0] if max(0, 1 y
iw
ix) = 0
w Cy
ix
iotherwise
Max-margin classifiers
• Learning problem:
31
Minimizing gives us max
$
%
$
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
min
w1
2 w
>w
s.t. 8i, y
i(w
>x
i+ b) 1
min
w1
2 w
>w + C X
i
⇠
is.t. 8i, y
i(w
>x
i+ b) 1 ⇠
i⇠
i0
min
w,b1
2 w
>w + C X
i
max 0, 1 y
i(w
>x
i+ b)
L
Hinge(y, x, w, b) = max 0, 1 y(w
>x + b)
min
w1
2 w
0>w
0+ C X
i
max(0, 1 y
iw
>x
i)
J
t(w) = 1
2 w
0>w
0+ C X
i
max(0, 1 y
iw
>x
i)
rJ
t=
⇢ [w
0; 0] if max(0, 1 y
iw
ix) = 0
w Cy
ix
iotherwise
Max-margin classifiers
• Learning problem:
32
This condition is true for every example, specifically, for the example closest to the
separator
Minimizing gives us max
$
%
$
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
min
w1
2 w
>w
s.t. 8i, y
i(w
>x
i+ b) 1
min
w1
2 w
>w + C X
i
⇠
is.t. 8i, y
i(w
>x
i+ b) 1 ⇠
i⇠
i0
min
w,b1
2 w
>w + C X
i
max 0, 1 y
i(w
>x
i+ b)
L
Hinge(y, x, w, b) = max 0, 1 y(w
>x + b)
min
w1
2 w
0>w
0+ C X
i
max(0, 1 y
iw
>x
i)
J
t(w) = 1
2 w
0>w
0+ C X
i
max(0, 1 y
iw
>x
i)
rJ
t=
⇢ [w
0; 0] if max(0, 1 y
iw
ix) = 0
w Cy
ix
iotherwise
Max-margin classifiers
• Learning problem:
• This is called the “hard” Support Vector Machine
We will look at how to solve this optimization problem later
33
This condition is true for every example, specifically, for the example closest to the
separator
Minimizing gives us max
$
%
$
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
min
w1
2 w
>w
s.t. 8i, y
i(w
>x
i+ b) 1
min
w1
2 w
>w + C X
i
⇠
is.t. 8i, y
i(w
>x
i+ b) 1 ⇠
i⇠
i0
min
w,b1
2 w
>w + C X
i
max 0, 1 y
i(w
>x
i+ b)
L
Hinge(y, x, w, b) = max 0, 1 y(w
>x + b)
min
w1
2 w
0>w
0+ C X
i
max(0, 1 y
iw
>x
i)
J
t(w) = 1
2 w
0>w
0+ C X
i
max(0, 1 y
iw
>x
i)
rJ
t=
⇢ [w
0; 0] if max(0, 1 y
iw
ix) = 0
w Cy
ix
iotherwise
Hard SVM
• This is a constrained optimization problem
• If the data is not separable, there is no w that will classify the data
• Infeasible problem, no solution!
What if the data is not separable?
34
Maximize margin Every example has an
functional margin of at least 1 216
217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
min
w1
2 w
>w
s.t. 8i, y
i(w
>x
i+ b) 1
min
w1
2 w
>w + C X
i
⇠
is.t. 8i, y
i(w
>x
i+ b) 1 ⇠
i⇠
i0
min
w,b1
2 w
>w + C X
i
max 0, 1 y
i(w
>x
i+ b)
L
Hinge(y, x, w, b) = max 0, 1 y(w
>x + b)
min
w1
2 w
0>w
0+ C X
i
max(0, 1 y
iw
>x
i)
J
t(w) = 1
2 w
0>w
0+ C X
i
max(0, 1 y
iw
>x
i)
rJ
t=
⇢ [w
0; 0] if max(0, 1 y
iw
xi) = 0
w Cy
ix
iotherwise
What if the data is not separable?
Hard SVM
• This is a constrained optimization problem
• If the data is not separable, there is no w that will classify the data
• Infeasible problem, no solution!
35
Maximize margin Every example has an
functional margin of at least 1 216
217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
min
w1
2 w
>w
s.t. 8i, y
i(w
>x
i+ b) 1
min
w1
2 w
>w + C X
i
⇠
is.t. 8i, y
i(w
>x
i+ b) 1 ⇠
i⇠
i0
min
w,b1
2 w
>w + C X
i
max 0, 1 y
i(w
>x
i+ b)
L
Hinge(y, x, w, b) = max 0, 1 y(w
>x + b)
min
w1
2 w
0>w
0+ C X
i
max(0, 1 y
iw
>x
i)
J
t(w) = 1
2 w
0>w
0+ C X
i
max(0, 1 y
iw
>x
i)
rJ
t=
⇢ [w
0; 0] if max(0, 1 y
iw
xi) = 0
w Cy
ix
iotherwise
Dealing with non-separable data
Key idea: Allow some examples to “break into the margin”
+
+ +
+ ++ + - +
- - - - - - - -
-
- - -
- - - - -
+ -
Dealing with non-separable data
Key idea: Allow some examples to “break into the margin”
+
+ +
+ ++ + - +
- - - - - - - -
-
- - -
- - - - -
+ -
Dealing with non-separable data
Key idea: Allow some examples to “break into the margin”
+
+ +
+ ++ + - +
- - - - - - - -
-
- - -
- - - - -
+ -
This separator has a large enough margin that it should generalize well.
Dealing with non-separable data
Key idea: Allow some examples to “break into the margin”
So, when computing margin, ignore the
examples that make the margin smaller or the data inseparable.
+
+ +
+ ++ + - +
- - - - - - - -
-
- - -
- - - - -
+ -
This separator has a large enough margin that it should generalize well.
Soft SVM
• Hard SVM:
• Introduce one slack variable »
iper example – And require
yiwTxi ¸ 1 - »i and »i ¸ 040
Maximize margin
Every example has a functional margin of at least 1
Intuition: The slack variable allows examples to “break” into the margin
If the slack value is zero, then the example is either on or outside the margin 216
217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
min
w1
2 w
>w
s.t. 8i, y
i(w
>x
i+ b) 1
min
w1
2 w
>w + C X
i
⇠
is.t. 8i, y
i(w
>x
i+ b) 1 ⇠
i⇠
i0
min
w,b1
2 w
>w + C X
i
max 0, 1 y
i(w
>x
i+ b)
L
Hinge(y, x, w, b) = max 0, 1 y(w
>x + b)
min
w1
2 w
0>w
0+ C X
i
max(0, 1 y
iw
>x
i)
J
t(w) = 1
2 w
0>w
0+ C X
i
max(0, 1 y
iw
>x
i)
rJ
t=
⇢ [w
0; 0] if max(0, 1 y
iw
xi) = 0
w Cy
ix
iotherwise
Soft SVM
• Hard SVM:
• Introduce one slack variable per example
– And require and
41
Maximize margin
Every example has a functional margin of at least 1
Intuition: The slack variable allows examples to “break” into the margin
If the slack value is zero, then the example is either on or outside the margin
⇠
i216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
maxw
1
2w>w
s.t.8i, yi(w>xi + b) 1
maxw
1
2w>w + C X
i
⇠i
s.t.8i, yi(w>xi + b) 1 ⇠i
⇠i 0
5
⇠i 0 216
217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
min
w1
2 w
>w
s.t. 8i, y
i(w
>x
i+ b) 1
min
w1
2 w
>w + C X
i
⇠
is.t. 8i, y
i(w
>x
i+ b) 1 ⇠
i⇠
i0
min
w,b1
2 w
>w + C X
i
max 0, 1 y
i(w
>x
i+ b)
L
Hinge(y, x, w, b) = max 0, 1 y(w
>x + b)
min
w1
2 w
0>w
0+ C X
i
max(0, 1 y
iw
>x
i)
J
t(w) = 1
2 w
0>w
0+ C X
i
max(0, 1 y
iw
>x
i)
rJ
t=
⇢ [w
0; 0] if max(0, 1 y
iw
xi) = 0
w Cy
ix
iotherwise
Soft SVM
• Hard SVM:
• Introduce one slack variable per example
– And require and
42
Maximize margin
Every example has a functional margin of at least 1
Intuition: The slack variable allows examples to “break” into the margin
If the slack value is zero, then the example is either on or outside the margin
⇠
i216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
maxw
1
2w>w
s.t.8i, yi(w>xi + b) 1
maxw
1
2w>w + C X
i
⇠i
s.t.8i, yi(w>xi + b) 1 ⇠i
⇠i 0
5
⇠i 0 216
217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
min
w1
2 w
>w
s.t. 8i, y
i(w
>x
i+ b) 1
min
w1
2 w
>w + C X
i
⇠
is.t. 8i, y
i(w
>x
i+ b) 1 ⇠
i⇠
i0
min
w,b1
2 w
>w + C X
i
max 0, 1 y
i(w
>x
i+ b)
L
Hinge(y, x, w, b) = max 0, 1 y(w
>x + b)
min
w1
2 w
0>w
0+ C X
i
max(0, 1 y
iw
>x
i)
J
t(w) = 1
2 w
0>w
0+ C X
i
max(0, 1 y
iw
>x
i)
rJ
t=
⇢ [w
0; 0] if max(0, 1 y
iw
xi) = 0
w Cy
ix
iotherwise
Soft SVM
• Hard SVM:
• Introduce one slack variable per example
– And require and
• New optimization problem for learning
43
Maximize margin
Every example has a functional margin of at least 1
⇠
i⇠i 0
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
maxw
1
2w>w
s.t.8i, yi(w>xi + b) 1
maxw
1
2w>w + C X
i
⇠i
s.t.8i, yi(w>xi + b) 1 ⇠i
⇠i 0
5
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
min
w1
2 w
>w
s.t. 8i, y
i(w
>x
i+ b) 1
min
w1
2 w
>w + C X
i
⇠
is.t. 8i, y
i(w
>x
i+ b) 1 ⇠
i⇠
i0
min
w,b1
2 w
>w + C X
i
max 0, 1 y
i(w
>x
i+ b)
L
Hinge(y, x, w, b) = max 0, 1 y(w
>x + b)
min
w1
2 w
0>w
0+ C X
i
max(0, 1 y
iw
>x
i)
J
t(w) = 1
2 w
0>w
0+ C X
i
max(0, 1 y
iw
>x
i)
rJ
t=
⇢ [w
0; 0] if max(0, 1 y
iw
xi) = 0 w Cy
ix
iotherwise
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
minw
1
2w>w
s.t.8i, yi(w>xi + b) 1
minw
1
2w>w + C X
i
⇠i
s.t.8i, yi(w>xi + b) 1 ⇠i
⇠i 0
minw,b
1
2w>w + C X
i
max 0, 1 yi(w>xi + b)
LHinge(y, x, w, b) = max 0, 1 y(w>x + b)
minw
1
2w0>w0 + C X
i
max(0, 1 yiw>xi)
Jt(w) = 1
2w0>w0 + C X
i
max(0, 1 yiw>xi)
rJt =
⇢[w0; 0] if max(0, 1 yiwxi ) = 0 w Cyixi otherwise
5
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
min
w1
2 w
>w
s.t. 8i, y
i(w
>x
i+ b) 1
min
w1
2 w
>w + C X
i
⇠
is.t. 8i, y
i(w
>x
i+ b) 1 ⇠
i⇠
i0
min
w,b1
2 w
>w + C X
i
max 0, 1 y
i(w
>x
i+ b)
L
Hinge(y, x, w, b) = max 0, 1 y(w
>x + b)
min
w1
2 w
0>w
0+ C X
i
max(0, 1 y
iw
>x
i)
J
t(w) = 1
2 w
0>w
0+ C X
i
max(0, 1 y
iw
>x
i)
rJ
t=
⇢ [w
0; 0] if max(0, 1 y
iw
xi) = 0 w Cy
ix
iotherwise
Soft SVM
• Hard SVM:
• Introduce one slack variable per example
– And require and
• New optimization problem for learning
44
Maximize margin
Every example has a functional margin of at least 1
⇠
i216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
maxw
1
2w>w
s.t.8i, yi(w>xi + b) 1
maxw
1
2w>w + C X
i
⇠i
s.t.8i, yi(w>xi + b) 1 ⇠i
⇠i 0
5
⇠i 0 216
217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
minw
1
2w>w
s.t.8i, yi(w>xi + b) 1
minw
1
2w>w + C X
i
⇠i
s.t.8i, yi(w>xi + b) 1 ⇠i
⇠i 0
minw,b
1
2w>w + C X
i
max 0, 1 yi(w>xi + b)
LHinge(y, x, w, b) = max 0, 1 y(w>x + b)
minw
1
2w0>w0 + C X
i
max(0, 1 yiw>xi)
Jt(w) = 1
2w0>w0 + C X
i
max(0, 1 yiw>xi)
rJt =
⇢[w0; 0] if max(0, 1 yiwxi ) = 0 w Cyixi otherwise
5
Soft SVM
45
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
minw
1
2w>w
s.t.8i, yi(w>xi + b) 1
minw
1
2w>w + C X
i
⇠i
s.t.8i, yi(w>xi + b) 1 ⇠i
⇠i 0
minw,b
1
2w>w + C X
i
max 0, 1 yi(w>xi + b)
LHinge(y, x, w, b) = max 0, 1 y(w>x + b)
minw
1
2w0>w0 + C X
i
max(0, 1 yiw>xi)
Jt(w) = 1
2w0>w0 + C X
i
max(0, 1 yiw>xi)
rJt =
⇢[w0; 0] if max(0, 1 yiwix) = 0 w Cyixi otherwise
Soft SVM
46
Maximize margin
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
minw
1
2w>w
s.t.8i, yi(w>xi + b) 1
minw
1
2w>w + C X
i
⇠i
s.t.8i, yi(w>xi + b) 1 ⇠i
⇠i 0
minw,b
1
2w>w + C X
i
max 0, 1 yi(w>xi + b)
LHinge(y, x, w, b) = max 0, 1 y(w>x + b)
minw
1
2w0>w0 + C X
i
max(0, 1 yiw>xi)
Jt(w) = 1
2w0>w0 + C X
i
max(0, 1 yiw>xi)
rJt =
⇢[w0; 0] if max(0, 1 yiwix) = 0 w Cyixi otherwise