Support Vector Machines

(1)

1

Support Vector Machines

Machine Learning Fall 2017

Supervised Learning: The Setup

Machine Learning Spring 2018

The slides are mainly from Vivek Srikumar

(2)

Big picture

Linear models

(3)

Big picture

Linear models

How good is a learning

(4)

Big picture

Linear models

How good is a learning algorithm?

Mistake- driven learning Perceptron,

Winnow

(5)

Big picture

Linear models

How good is a learning Mistake-

driven learning

PAC, Agnostic learning Perceptron,

Winnow

(6)

Big picture

Linear models

How good is a learning algorithm?

Mistake- driven learning

Winnow Support Vector

Machines

The Risk Minimization Principle

(7)

Big picture

Linear models

How good is a learning Mistake-

driven learning

Winnow Support Vector

Machines

….

The Risk Minimization Principle

….

(8)

This lecture: Support vector machines

• Training by maximizing margin

• The SVM objective

• Solving the SVM optimization problem

• Support vectors, duals and kernels

(9)

This lecture: Support vector machines

• Training by maximizing margin

• The SVM objective

• Solving the SVM optimization problem

• Support vectors, duals and kernels

(10)

VC dimensions and linear classifiers

What we know so far

1. If we have m examples, then with probability 1 - !, the true error of a hypothesis h with training error err

_S

(h) is bounded by

Generalization error Training error A function of VC dimension.

Low VC dimension gives tighter bound

(11)

VC dimensions and linear classifiers

What we know so far

1. If we have m examples, then with probability 1 - !, the true error of a hypothesis h with training error err

_S

(h) is bounded by

2. VC dimension of a linear classifier in d dimensions = d + 1

(12)

VC dimensions and linear classifiers

What we know so far

1. If we have m examples, then with probability 1 - !, the true error of a hypothesis h with training error err

_S

(h) is bounded by

2. VC dimension of a linear classifier in d dimensions = d + 1

But are all linear classifiers the same?

(13)

Recall: Margin

The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it.

+

+ +

+ ++ + - +

- - - - - - - -

-

- - -

-

(14)

Recall: Margin

The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it.

+

+ +

+ ++ + - +

- - - - - - - -

-

- - -

- - - -

-

Margin with respect to this hyperplane

(15)

Which hyperplane is better?

h₁

h₂

Recall: Margin

(16)

Which hyperplane is better?

h₁

h₂

The farther from the data points, the less chance to make wrong prediction

Recall: Margin

(17)

Data dependent VC dimension

• Intuitively, larger margins are better

• Suppose we only consider linear classifiers with margins !

₁

and !

₂

– H

₁

= linear classifiers that have a margin at least !

₁

– H

₂

= linear classifiers that have a margin at least !

₂

– And !

₁

> !

₂

• The entire set of functions H

₁

is “better”

(18)

Data dependent VC dimension

Theorem (Vapnik):

– Let H be the set of linear classifiers that separate the training set by a margin at least !

– Then

– R is the radius of the smallest sphere containing the data

(19)

Data dependent VC dimension

Theorem (Vapnik):

– Let H be the set of linear classifiers that separate the training set by a margin at least !

– Then

– R is the radius of the smallest sphere containing the data Larger margin ⟹ Lower VC dimension

Lower VC dimension ⟹ Better generalization bound

(20)

Learning strategy

Find the linear classifier that maximizes the margin

(21)

This lecture: Support vector machines

• Training by maximizing margin

• The SVM objective

• Solving the SVM optimization problem

• Support vectors, duals and kernels

(22)

Recall: The geometry of a linear classifier

Prediction = sgn(b +w

₁

x

₁

+ w

₂

x

₂

)

+ + + + ++ +

+

- - - -

- - - - -

-

- - -

- - - - -

b +w₁ x₁ + w₂x₂=0

(23)

Recall: The geometry of a linear classifier

Prediction = sgn ^{(b +w}

₁

^x

₁

^{+ w}

₂

^x

₂

⁾

+ + + + ++ +

+

- - - -

- - - - -

-

- - -

- - - - -

b +w₁ x₁ + w₂x₂=0

We only care about the sign, not the magnitude

(24)

Recall: The geometry of a linear classifier

Prediction = sgn(b +w

₁

x

₁

+ w

₂

x

₂

)

b +w₁ x₁ + w₂x₂=0 2b +2w₁ x₁ + 2w₂x₂=0

1000b +1000w₁ x₁ + 1000w₂x₂=0

+

+ +

+ ++ + +

- - - -

- - - - -

-

- - -

- - - - -

We only care about the sign, not the magnitude

(25)

Maximizing margin

• Margin = distance of the closest point from the hyperplane

• We want max

_w

°

Some people call this the geometric margin

The numerator alone is called the functional margin

(26)

Maximizing margin

• Margin = distance of the closest point from the hyperplane

• We want max

_w

!

Some people call this the geometric margin

The numerator alone is called the functional margin

(27)

Recall: The geometry of a linear classifier

Prediction = sgn(b +w

₁

x

₁

+ w

₂

x

₂

)

b +w₁ x₁ + w₂x₂=0

We only care about the sign, not the magnitude

+ + + + ++ +

+

- - - -

- - - - -

-

- - -

-

(28)

Recall: The geometry of a linear classifier

Prediction = sgn(b +w

₁

x

₁

+ w

₂

x

₂

)

b +w₁ x₁ + w₂x₂=0

We only care about the sign, not the magnitude

+ + + + ++ +

+

- - - -

- - - - -

-

- - -

- - - - -

We have the freedom to scale up/down w and b so that the numerator is 1.

(29)

Maximizing margin

• Margin = distance of the closest point from the hyperplane

• We want max

_w

!

• We only care about the sign of w

^T

x

_i

+b in the end and not the magnitude

– Set the absolute score (functional margin) of the closest point to be 1 and allow w to adjust itself

max

%

! is equivalent to max

%

&

%

in this setting

(30)

Max-margin classifiers

• Learning problem:

30

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

min

w

1 2 w

^>

w

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1

min

w

1 2 w

^>

w + C X

i

⇠

_i

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1 ⇠

_i

⇠

_i

0 min

w,b

1 2 w

^>

w + C X

i

max 0, 1 y

_i

(w

^>

x

_i

+ b)

L

_Hinge

(y, x, w, b) = max 0, 1 y(w

^>

x + b)

min

w

1 2 w

₀^>

w

₀

+ C X

i

max(0, 1 y

_i

w

^>

x

_i

)

J

^t

(w) = 1

2 w

₀^>

w

₀

+ C X

i

max(0, 1 y

_i

w

^>

x

_i

)

rJ

^t

=

⇢ [w

₀

; 0] if max(0, 1 y

_i

w

_i^x

) = 0

w Cy

_i

x

_i

otherwise

(31)

Max-margin classifiers

• Learning problem:

31

Minimizing gives us max

$

%

$

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

min

w

1 2 w

^>

w

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1

min

w

1 2 w

^>

w + C X

i

⇠

_i

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1 ⇠

_i

⇠

_i

0 min

w,b

1 2 w

^>

w + C X

i

max 0, 1 y

_i

(w

^>

x

_i

+ b)

L

_Hinge

(y, x, w, b) = max 0, 1 y(w

^>

x + b)

min

w

1 2 w

₀^>

w

₀

+ C X

i

max(0, 1 y

_i

w

^>

x

_i

)

J

^t

(w) = 1

2 w

₀^>

w

₀

+ C X

i

max(0, 1 y

_i

w

^>

x

_i

)

rJ

^t

=

⇢ [w

₀

; 0] if max(0, 1 y

_i

w

_i^x

) = 0

w Cy

_i

x

_i

otherwise

(32)

Max-margin classifiers

• Learning problem:

32

This condition is true for every example, specifically, for the example closest to the

separator

$

%

$

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

min

w

1 2 w

^>

w

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1

min

w

1 2 w

^>

w + C X

i

⇠

_i

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1 ⇠

_i

⇠

_i

0 min

w,b

1 2 w

^>

w + C X

i

max 0, 1 y

_i

(w

^>

x

_i

+ b)

L

_Hinge

(y, x, w, b) = max 0, 1 y(w

^>

x + b)

min

w

1 2 w

₀^>

w

₀

+ C X

i

max(0, 1 y

_i

w

^>

x

_i

)

J

^t

(w) = 1

2 w

₀^>

w

₀

+ C X

i

max(0, 1 y

_i

w

^>

x

_i

)

rJ

^t

=

⇢ [w

₀

; 0] if max(0, 1 y

_i

w

_i^x

) = 0

w Cy

_i

x

_i

otherwise

(33)

Max-margin classifiers

• Learning problem:

• This is called the “hard” Support Vector Machine

We will look at how to solve this optimization problem later

33

This condition is true for every example, specifically, for the example closest to the

separator

$

%

$

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

min

w

1 2 w

^>

w

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1

min

w

1 2 w

^>

w + C X

i

⇠

_i

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1 ⇠

_i

⇠

_i

0 min

w,b

1 2 w

^>

w + C X

i

max 0, 1 y

_i

(w

^>

x

_i

+ b)

L

_Hinge

(y, x, w, b) = max 0, 1 y(w

^>

x + b)

min

w

1 2 w

₀^>

w

₀

+ C X

i

max(0, 1 y

_i

w

^>

x

_i

)

J

^t

(w) = 1

2 w

₀^>

w

₀

+ C X

i

max(0, 1 y

_i

w

^>

x

_i

)

rJ

^t

=

⇢ [w

₀

; 0] if max(0, 1 y

_i

w

_i^x

) = 0

w Cy

_i

x

_i

otherwise

(34)

Hard SVM

• This is a constrained optimization problem

• If the data is not separable, there is no w that will classify the data

• Infeasible problem, no solution!

What if the data is not separable?

34

Maximize margin Every example has an

functional margin of at least 1 216

217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

min

w

1 2 w

^>

w

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1

min

w

1 2 w

^>

w + C X

i

⇠

_i

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1 ⇠

_i

⇠

_i

0 min

w,b

1 2 w

^>

w + C X

i

max 0, 1 y

_i

(w

^>

x

_i

+ b)

L

_Hinge

(y, x, w, b) = max 0, 1 y(w

^>

x + b)

min

w

1 2 w

₀^>

w

₀

+ C X

i

max(0, 1 y

_i

w

^>

x

_i

)

J

^t

(w) = 1

2 w

₀^>

w

₀

+ C X

i

max(0, 1 y

_i

w

^>

x

_i

)

rJ

^t

=

⇢ [w

₀

; 0] if max(0, 1 y

_i

w

^x_i

) = 0

w Cy

_i

x

_i

otherwise

(35)

What if the data is not separable?

Hard SVM

• This is a constrained optimization problem

• If the data is not separable, there is no w that will classify the data

• Infeasible problem, no solution!

35

Maximize margin Every example has an

functional margin of at least 1 216

217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

min

w

1 2 w

^>

w

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1

min

w

1 2 w

^>

w + C X

i

⇠

_i

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1 ⇠

_i

⇠

_i

0 min

w,b

1 2 w

^>

w + C X

i

max 0, 1 y

_i

(w

^>

x

_i

+ b)

L

_Hinge

(y, x, w, b) = max 0, 1 y(w

^>

x + b)

min

w

1 2 w

₀^>

w

₀

+ C X

i

max(0, 1 y

_i

w

^>

x

_i

)

J

^t

(w) = 1

2 w

₀^>

w

₀

+ C X

i

max(0, 1 y

_i

w

^>

x

_i

)

rJ

^t

=

⇢ [w

₀

; 0] if max(0, 1 y

_i

w

^x_i

) = 0

w Cy

_i

x

_i

otherwise

(36)

Dealing with non-separable data

Key idea: Allow some examples to “break into the margin”

+

+ +

+ ++ + - +

- - - - - - - -

-

- - -

- - - - -

+ -

(37)

Dealing with non-separable data

Key idea: Allow some examples to “break into the margin”

+

+ +

+ ++ + - +

- - - - - - - -

-

- - -

- - - - -

+ -

(38)

Dealing with non-separable data

Key idea: Allow some examples to “break into the margin”

+

+ +

+ ++ + - +

- - - - - - - -

-

- - -

- - - - -

+ -

This separator has a large enough margin that it should generalize well.

(39)

Dealing with non-separable data

Key idea: Allow some examples to “break into the margin”

So, when computing margin, ignore the

examples that make the margin smaller or the data inseparable.

+

+ +

+ ++ + - +

- - - - - - - -

-

- - -

- - - - -

+ -

This separator has a large enough margin that it should generalize well.

(40)

Soft SVM

• Hard SVM:

• Introduce one slack variable »

_i

per example – And require

^y_i^w^T^x_i ^{¸ 1 - »}_i ^{and »}_i ^{¸ 0}

40

Maximize margin

Every example has a functional margin of at least 1

Intuition: The slack variable allows examples to “break” into the margin

If the slack value is zero, then the example is either on or outside the margin 216

217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

min

w

1 2 w

^>

w

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1

min

w

1 2 w

^>

w + C X

i

⇠

_i

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1 ⇠

_i

⇠

_i

0 min

w,b

1 2 w

^>

w + C X

i

max 0, 1 y

_i

(w

^>

x

_i

+ b)

L

_Hinge

(y, x, w, b) = max 0, 1 y(w

^>

x + b)

min

w

1 2 w

₀^>

w

₀

+ C X

i

max(0, 1 y

_i

w

^>

x

_i

)

J

^t

(w) = 1

2 w

₀^>

w

₀

+ C X

i

max(0, 1 y

_i

w

^>

x

_i

)

rJ

^t

=

⇢ [w

₀

; 0] if max(0, 1 y

_i

w

^x_i

) = 0

w Cy

_i

x

_i

otherwise

(41)

Soft SVM

• Hard SVM:

• Introduce one slack variable per example

– And require and

41

Maximize margin

If the slack value is zero, then the example is either on or outside the margin

⇠

_i

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

maxw

1

2w^>w

s.t.8i, yⁱ(w^>xi + b) 1

maxw

1

2w^>w + C X

i

⇠_i

s.t.8i, yⁱ(w^>x_i + b) 1 ⇠_i

⇠_i 0

5

⇠i 0 216

217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

min

w

1 2 w

^>

w

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1

min

w

1 2 w

^>

w + C X

i

⇠

_i

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1 ⇠

_i

⇠

_i

0 min

w,b

1 2 w

^>

w + C X

i

max 0, 1 y

_i

(w

^>

x

_i

+ b)

L

_Hinge

(y, x, w, b) = max 0, 1 y(w

^>

x + b)

min

w

1 2 w

₀^>

w

₀

+ C X

i

max(0, 1 y

_i

w

^>

x

_i

)

J

^t

(w) = 1

2 w

₀^>

w

₀

+ C X

i

max(0, 1 y

_i

w

^>

x

_i

)

rJ

^t

=

⇢ [w

₀

; 0] if max(0, 1 y

_i

w

^x_i

) = 0

w Cy

_i

x

_i

otherwise

(42)

Soft SVM

• Hard SVM:

• Introduce one slack variable per example

– And require and

42

Maximize margin

If the slack value is zero, then the example is either on or outside the margin

⇠

_i

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

maxw

1

2w^>w

s.t.8i, yⁱ(w^>xi + b) 1

maxw

1

2w^>w + C X

i

⇠_i

s.t.8i, yⁱ(w^>x_i + b) 1 ⇠_i

⇠_i 0

5

⇠i 0 216

217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

min

w

1 2 w

^>

w

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1

min

w

1 2 w

^>

w + C X

i

⇠

_i

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1 ⇠

_i

⇠

_i

0 min

w,b

1 2 w

^>

w + C X

i

max 0, 1 y

_i

(w

^>

x

_i

+ b)

L

_Hinge

(y, x, w, b) = max 0, 1 y(w

^>

x + b)

min

w

1 2 w

₀^>

w

₀

+ C X

i

max(0, 1 y

_i

w

^>

x

_i

)

J

^t

(w) = 1

2 w

₀^>

w

₀

+ C X

i

max(0, 1 y

_i

w

^>

x

_i

)

rJ

^t

=

⇢ [w

₀

; 0] if max(0, 1 y

_i

w

^x_i

) = 0

w Cy

_i

x

_i

otherwise

(43)

Soft SVM

• Hard SVM:

• Introduce one slack variable per example

– And require and

• New optimization problem for learning

43

Maximize margin

⇠

_i

⇠i 0

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

maxw

1

2w^>w

s.t.8i, yⁱ(w^>xi + b) 1

maxw

1

2w^>w + C X

i

⇠_i

s.t.8i, yⁱ(w^>x_i + b) 1 ⇠_i

⇠_i 0

5

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

min

w

1 2 w

^>

w

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1

min

w

1 2 w

^>

w + C X

i

⇠

_i

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1 ⇠

_i

⇠

_i

0 min

w,b

1 2 w

^>

w + C X

i

max 0, 1 y

_i

(w

^>

x

_i

+ b)

L

_Hinge

(y, x, w, b) = max 0, 1 y(w

^>

x + b)

min

w

1 2 w

₀^>

w

₀

+ C X

i

max(0, 1 y

_i

w

^>

x

_i

)

J

^t

(w) = 1

2 w

₀^>

w

₀

+ C X

i

max(0, 1 y

_i

w

^>

x

_i

)

rJ

^t

=

⇢ [w

₀

; 0] if max(0, 1 y

_i

w

^x_i

) = 0 w Cy

_i

x

_i

otherwise

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

minw

1

2w^>w

s.t.8i, yⁱ(w^>x_i + b) 1

minw

1

2w^>w + C X

i

⇠_i

s.t.8i, yⁱ(w^>x_i + b) 1 ⇠_i

⇠_i 0

minw,b

1

2w^>w + C X

i

max 0, 1 y_i(w^>x_i + b)

L_Hinge(y, x, w, b) = max 0, 1 y(w^>x + b)

minw

1

2w₀^>w₀ + C X

i

max(0, 1 y_iw^>x_i)

J^t(w) = 1

2w₀^>w₀ + C X

i

max(0, 1 y_iw^>x_i)

rJ^t =

⇢[w₀; 0] if max(0, 1 y_iw^x_i ) = 0 w Cy_ix_i otherwise

5

(44)

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

min

w

1 2 w

^>

w

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1

min

w

1 2 w

^>

w + C X

i

⇠

_i

s.t. 8i, y

ⁱ

(w

^>

x

_i

+ b) 1 ⇠

_i

⇠

_i

0 min

w,b

1 2 w

^>

w + C X

i

max 0, 1 y

_i

(w

^>

x

_i

+ b)

L

_Hinge

(y, x, w, b) = max 0, 1 y(w

^>

x + b)

min

w

1 2 w

₀^>

w

₀

+ C X

i

max(0, 1 y

_i

w

^>

x

_i

)

J

^t

(w) = 1

2 w

₀^>

w

₀

+ C X

i

max(0, 1 y

_i

w

^>

x

_i

)

rJ

^t

=

⇢ [w

₀

; 0] if max(0, 1 y

_i

w

^x_i

) = 0 w Cy

_i

x

_i

otherwise

Soft SVM

• Hard SVM:

• Introduce one slack variable per example

– And require and

• New optimization problem for learning

44

Maximize margin

⇠

_i

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

maxw

1

2w^>w

s.t.8i, yⁱ(w^>xi + b) 1

maxw

1

2w^>w + C X

i

⇠_i

s.t.8i, yⁱ(w^>x_i + b) 1 ⇠_i

⇠_i 0

5

⇠i 0 216

217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

minw

1

2w^>w

s.t.8i, yⁱ(w^>x_i + b) 1

minw

1

2w^>w + C X

i

⇠_i

s.t.8i, yⁱ(w^>x_i + b) 1 ⇠_i

⇠_i 0

minw,b

1

2w^>w + C X

i

max 0, 1 y_i(w^>x_i + b)

minw

1

2w₀^>w₀ + C X

i

max(0, 1 y_iw^>x_i)

J^t(w) = 1

2w₀^>w₀ + C X

i

max(0, 1 y_iw^>x_i)

rJ^t =

⇢[w₀; 0] if max(0, 1 y_iw^x_i ) = 0 w Cy_ix_i otherwise

5

(45)

Soft SVM

45

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

minw

1

2w^>w

s.t.8i, yⁱ(w^>x_i + b) 1

minw

1

2w^>w + C X

i

⇠_i

s.t.8i, yⁱ(w^>x_i + b) 1 ⇠_i

⇠_i 0

minw,b

1

2w^>w + C X

i

max 0, 1 y_i(w^>x_i + b)

minw

1

2w₀^>w₀ + C X

i

max(0, 1 y_iw^>x_i)

J^t(w) = 1

2w₀^>w₀ + C X

i

max(0, 1 y_iw^>x_i)

rJ^t =

⇢[w₀; 0] if max(0, 1 y_iw_i^x) = 0 w Cy_ix_i otherwise

(46)

Soft SVM

46

Maximize margin

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

minw

1

2w^>w

s.t.8i, yⁱ(w^>x_i + b) 1

minw

1

2w^>w + C X

i

⇠_i

s.t.8i, yⁱ(w^>x_i + b) 1 ⇠_i

⇠_i 0

minw,b

1

2w^>w + C X

i

max 0, 1 y_i(w^>x_i + b)

minw

1

2w₀^>w₀ + C X

i

max(0, 1 y_iw^>x_i)

J^t(w) = 1

2w₀^>w₀ + C X

i

max(0, 1 y_iw^>x_i)

rJ^t =

⇢[w₀; 0] if max(0, 1 y_iw_i^x) = 0 w Cy_ix_i otherwise