The Perceptron Algorithm

(1)

1

The Perceptron Algorithm

Machine Learning Fall 2017

Supervised Learning: The Setup

1 Machine Learning

Spring 2018

The slides are mainly from Vivek Srikumar

(2)

Outline

• The Perceptron Algorithm

• Perceptron Mistake Bound

• Variants of Perceptron

2

(3)

Where are we?

• The Perceptron Algorithm

• Perceptron Mistake Bound

• Variants of Perceptron

3

(4)

Recall: Linear Classifiers

• Input is a n dimensional vector x

• Output is a label y ∈ {-1, 1}

• Linear Threshold Units (LTUs) classify an example x using the following classification rule

– Output = sgn(w^Tx + b) = sgn(b +åw_i x_i) – w^Tx + b ≥ 0 → Predict y = 1

– w^Tx + b < 0 → Predict y = -1

4

b is called the bias term

(5)

Recall: Linear Classifiers

• Input is a n dimensional vector x

• Output is a label y ∈ {-1, 1}

• Linear Threshold Units (LTUs) classify an example x using the following classification rule

– Output = sgn(w^Tx + b) = sgn(b +åw_i x_i) – w^Tx + b ¸ 0 ) Predict y = 1

– w^Tx + b < 0 ) Predict y = -1

5

b is called the bias term

∑ sgn

&_' &₍ &₎ &_* &₊ &_, &_- &_.

/_' /₍ /₎ /_* /₊ /_, /_- /_. 1

1

(6)

The geometry of a linear classifier

6

sgn(b +w₁ x₁ + w₂x₂)

In n dimensions, a linear classifier

represents a hyperplane that separates the space into two half-spaces

x

₁

x

₂

+ + + + ++ +

+

- - - -

- - - - -

-

- - -

- - - - -

b +w₁ x₁ + w₂x₂=0

We only care about the sign, not the magnitude

[w₁ w₂]

(7)

The Perceptron

7

(8)

The hype

8

The New Yorker, December 6, 1958 P. 44

The New York Times, July 8 1958 The IBM 704 computer

(9)

The Perceptron algorithm

• Rosenblatt 1958

• The goal is to find a separating hyperplane

– For separable data, guaranteed to find one

• An online algorithm

– Processes one example at a time

• Several variants exist (will discuss briefly at towards the end)

9

(10)

The Perceptron algorithm

Input: A sequence of training examples (x₁, y₁), (x₂, y₂),!

where all x_i ∈ ℜⁿ, y_i ∈ {-1,1}

• Initialize w₀= 0 ∈ ℜⁿ

• For each training example (x_i, y_i):

– Predict y’ = sgn(w_t^Tx_i) – If y_i ≠ y’:

• Update w_t+1 ←w_t + r (y_i x_i)

• Return final weight vector

10

(11)

The Perceptron algorithm

11

Remember:

Prediction = sgn(w^Tx)

There is typically a bias term also (w^Tx + b), but the bias may be treated as a

constant feature and folded into w

(12)

The Perceptron algorithm

12

Footnote: For some algorithms it is mathematically easier to represent False as -1, and at other times, as 0. For the Perceptron algorithm, treat -1 as false and +1 as true.

Remember:

Prediction = sgn(w^Tx)

There is typically a bias term also (w^Tx + b), but the bias may be treated as a

constant feature and folded into w

(13)

The Perceptron algorithm

13

Mistake on positive: w_t+1 ← w_t + r x_i Mistake on negative: w_t+1 ← w_t - r x_i

(14)

The Perceptron algorithm

14

r is the learning rate, a small positive number less than 1

(15)

The Perceptron algorithm

15

Update only on error. A mistake- driven algorithm

(16)

The Perceptron algorithm

16

This is the simplest version. We will see more robust versions at the end

(17)

The Perceptron algorithm

17

This is the simplest version. We will see more robust versions at the end Mistake can be written as y_iw_t^Tx_i ≤ 0

(18)

Intuition behind the update

Suppose we have made a mistake on a positive example That is, y = +1 and w_t^Tx <0

Call the new weight vector w_t+1 = w_t + x (say r = 1)

The new dot product will be w_t+1^Tx = (w_t + x)^Tx = w_t^Tx + x^Tx > w_t^Tx

For a positive example, the Perceptron update will increase the score assigned to the same input

Similar reasoning for negative examples

18

(19)

Geometry of the perceptron update

19

w_old Predict

(20)

Geometry of the perceptron update

20

w_old

(x, +1)

Predict

(21)

Geometry of the perceptron update

21

w_old

(x, +1)

For a mistake on a positive example

Predict

(22)

Geometry of the perceptron update

22

w_old

(x, +1)

w ← w + x

(x, +1)

Predict Update

(23)

Geometry of the perceptron update

23

w_old

(x, +1)

w ← w + x

(x, +1)

Predict Update

x

(24)

Geometry of the perceptron update

24

w_old

(x, +1)

w ← w + x

(x, +1)

Predict Update

x

(25)

Geometry of the perceptron update

25

w_old

(x, +1) (x, +1)

w_new w ← w + x

(x, +1)

Predict Update After

x

(26)

Geometry of the perceptron update

26

w_old

Predict

(27)

Geometry of the perceptron update

27

w_old (x, -1)

Predict

For a mistake on a negative example

(28)

Geometry of the perceptron update

28

w_old (x, -1)

w ← w - x

(x, -1)

Predict Update

- x

(29)

Geometry of the perceptron update

29

w_old (x, -1)

w ← w - x

(x, -1)

Predict Update

- x

(30)

Geometry of the perceptron update

30

w_old (x, -1)

w ← w - x

(x, -1)

Predict Update

- x

(31)

Geometry of the perceptron update

31

w_old

(x, -1) (x, -1)

w_new w ← w - x

(x, -1)

Predict Update After

- x

(32)

Perceptron Learnability

• Obviously Perceptron cannot learn what it cannot represent

– Only linearly separable functions

• Minsky and Papert (1969) wrote an influential book

demonstrating Perceptron’s representational limitations

– Parity functions can’t be learned (XOR)

• But we already know that XOR is not linearly separable

– Feature transformation trick

32

(33)

What you need to know

• The Perceptron algorithm

• The geometry of the update

• What can it represent

33

(34)

Where are we?

• The Perceptron Algorithm

• Perceptron Mistake Bound

• Variants of Perceptron

34

(35)

Convergence

Convergence theorem

– If there exist a set of weights that are consistent with the data (i.e. the data is linearly separable), the perceptron algorithm will converge.

35

(36)

Convergence

Convergence theorem

– If there exist a set of weights that are consistent with the data (i.e. the data is linearly separable), the perceptron algorithm will converge.

Cycling theorem

– If the training data is not linearly separable, then the

learning algorithm will eventually repeat the same set of weights and enter an infinite loop

36

(37)

Margin

The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it.

37

+ + + + ++ + - +

- - - - - - - -

-

- - -

- - - -

-

Margin with respect to this hyperplane

(38)

Margin

• The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it.

• The margin of a data set (!) is the maximum margin possible for that dataset using any weight vector.

38

+ + + + ++ + - +

- - - - - - - -

-

- - -

- - - - -

Margin of the data

(39)

Mistake Bound Theorem

[Novikoff 1962, Block 1962]

Let {(x₁, y₁), (x₂, y₂),!, (x_m, y_m)} be a sequence of training

examples such that for all i, the feature vector x_i ∈ ℜⁿ, ||x_i|| ≤ R and the label y_i ∈ {-1, +1}.

Suppose there exists a unit vector u 2 <ⁿ(i.e ||u|| = 1) such that for some ° 2 < and ° > 0 we have y_i (u^T x_i) ¸ °.

Then, the perceptron algorithm will make at most (R/°)² mistakes on the training sequence.

39

(40)

Mistake Bound Theorem

Suppose there exists a unit vector u ∈ ℜⁿ(i.e ||u|| = 1) such that for some $ ∈ ℜ and $ > 0 we have y_i (u^T x_i) ≥ $.

Then, the perceptron algorithm will make at most (R/°)² mistakes on the training sequence.

40

(41)

Mistake Bound Theorem

Then, the perceptron algorithm will make at most (R/ $)² mistakes on the training sequence.

41

(42)

Mistake Bound Theorem

42

We can always find such an R. Just look for the farthest data point from the origin.

(43)

Mistake Bound Theorem

43

The data and u have a margin !.

Importantly, the data is separable.

! is the complexity parameter that defines the separability of data.

Suppose there exists a unit vector u ∈ ℜⁿ(i.e ||u|| = 1) such that for some ! ∈ ℜ and ! > 0 we have y_i (u^T x_i) ≥ !.

Then, the perceptron algorithm will make at most (R/ !)² mistakes on the training sequence.

(44)

Mistake Bound Theorem

44

If u hadn’t been a unit vector, then we could scale ! in the mistake bound. This will change the final mistake bound to (||u||R/ !)²

Suppose there exists a unit vector u ∈ ℜⁿ(i.e ||u|| = 1) such that for some ! ∈ ℜ and ! > 0 we have y_i (u^T x_i) ≥ !.

Then, the perceptron algorithm will make at most (R/ !)² mistakes on the training sequence.

(45)

Mistake Bound Theorem

45

Suppose we have a binary classification dataset with n dimensional inputs.

If the data is separable,…

…then the Perceptron algorithm will find a separating hyperplane after making a finite number of mistakes

(46)

Proof (preliminaries)

The setting

• Initial weight vector w is all zeros

• Learning rate = 1

– Effectively scales inputs, but does not change the behavior

• All training examples are contained in a ball of size R

– ||x_i|| ≤ R

• The training data is separable by margin " using a unit vector u

– y_i (u^T x_i) ≥ "

46

• Receive an input (x_i, y_i)

• if sgn(w_t^Tx_i) ≠ y_i:

Update w_t+1 ← w_t + y_i x_i

(47)

Proof (1/3)

1. Claim: After t mistakes, u

^T

w

_t

≥ t "

47

(48)

Proof (1/3)

1. Claim: After t mistakes, u

^T

w

_t

≥ t "

48

Because the data is separable by a margin "

(49)

Proof (1/3)

1. Claim: After t mistakes, u

^T

w

_t

≥ t "

Because w

₀

= 0 (i.e u

^T

w

₀

= 0), straightforward induction gives us u

^T

w

_t

≥ t "

49

Because the data is separable by a margin "

(50)

2. Claim: After t mistakes, ||w

_t

||

²

≤ tR

²

Proof (2/3)

50

(51)

Proof (2/3)

51

The weight is updated only when there is a mistake. That is when y_i w_t^T x_i < 0.

||x_i|| ≤ R, by definition of R

2. Claim: After t mistakes, ||w

_t

||

²

≤ tR

²

(52)

Proof (2/3)

2. Claim: After t mistakes, ||w

_t

||

²

≤ tR

²

Because w

₀

= 0 (i.e u

^T

w

₀

= 0), straightforward induction gives us ||w

_t

||

²

≤ tR

²

52

(53)

Proof (3/3)

What we know:

1. After t mistakes, u^Tw_t ≥ t"

2. After t mistakes, ||w_t||² ≤ tR²

53

(54)

Proof (3/3)

What we know:

54

From (2)

(55)

Proof (3/3)

What we know:

55

From (2)

u^T w_t =||u|| ||w_t|| cos(<angle between them>) But ||u|| = 1 and cosine is less than 1

So u^T w_t · ||w_t||

(56)

Proof (3/3)

What we know:

56

From (2)

So u^T w_t · ||w_t|| (Cauchy-Schwarz inequality)

(57)

Proof (3/3)

57

From (2)

So u^T w_t ≤ ||w_t||

From (1)

What we know:

1. After t mistakes, u^Tw_t ≥ t#

(58)

Proof (3/3)

58

Number of mistakes

What we know:

(59)

Proof (3/3)

59

Bounds the total number of mistakes!

Number of mistakes

What we know:

(60)

Mistake Bound Theorem

60

(61)

Mistake Bound Theorem

61

(62)

Mistake Bound Theorem

62

162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215

L

₁

( U, B) = 1

2 log |K

^BB

| 1

2 log |K

^BB

+ A

₁

| 1

2 a

₂

1 2 a

₃

1

2 X

K k=1

kU

^(k)

k

²F

+ 1 2

2

a

₄^>

(K

_BB

+ A

₁

)

¹

a

₄

+ 2 tr(K

_BB¹

A

₁

) + CONST L

₂

( U, B, ) = 1

2 log |K

^BB

| 1

2 log |K

^BB

+ A

₁

| 1

2 a

₂

+ a

₃

1

2

>

K

_BB

+ 1

2 tr(K

_BB¹

A

₁

) 1 2

X

K k=1

kU

^(k)

k

²_F

(t+1)

= (K

_BB

+ A

₁

)

¹

(A

₁ ^(t)

+ a

₆

) a

₆

= X

j

k(B, x

_i_j

)(2y

_i_j

1) N k(B, x

ⁱj

)

^{> (t)}

|0, 1 (2y

_i_j

1)k(B, x

_i_j

)

^{> (t)}

L

₁

( U, B, q(v)) = log(p(U)) +

Z

q(v) log p(v |B)

q(v) dv + X

j

Z

q(v)F

_v

(y

_i_j

, )dv L

₁

y, U, B, q(v)

B, U

L

^⇤₁

(y, U, B) L

₁

y, U, B, q(v)

(A

₁

, rA

¹

) (a

₂

, ra

²

) (a

₃

, ra

³

) (a

₄

, ra

⁴

) (a

₅

, ra

⁴

)

L

₁

(y, U, B, q(v)) = log(p( U)) + H q(v) +

Z

log N (v|0, k(B, B ) + X

j

Z

q(v)F

_v

(y

_i_j

, )dv

sgn

X

k i=1

c

_i

· sgn(w

i^>

x) (2)

= sgn

X

k i=1

c

_i

w

_i^>

x (3)

y

_i

w

^>

x

_i

kwk < ⌘ (4)

s

₁

+ s

₂

+ s

₃

0 (5)

R

4

(63)

Mistake Bound Theorem

63

162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215

L

₁

( U, B) = 1

2 log |K

^BB

| 1

2 log |K

^BB

+ A

₁

| 1

2 a

₂

1 2 a

₃

1

2 X

K k=1

kU

^(k)

k

²F

+ 1 2

2

a

₄^>

(K

_BB

+ A

₁

)

¹

a

₄

+ 2 tr(K

_BB¹

A

₁

) + CONST L

₂

( U, B, ) = 1

2 log |K

^BB

| 1

2 log |K

^BB

+ A

₁

| 1

2 a

₂

+ a

₃

1

2

>

K

_BB

+ 1

2 tr(K

_BB¹

A

₁

) 1 2

X

K k=1

kU

^(k)

k

²_F

(t+1)

= (K

_BB

+ A

₁

)

¹

(A

₁ ^(t)

+ a

₆

) a

₆

= X

j

k(B, x

_i_j

)(2y

_i_j

1) N k(B, x

ⁱj

)

^{> (t)}

|0, 1 (2y

_i_j

1)k(B, x

_i_j

)

^{> (t)}

L

₁

( U, B, q(v)) = log(p(U)) +

Z

q(v) log p(v |B)

q(v) dv + X

j

Z

q(v)F

_v

(y

_i_j

, )dv L

₁

y, U, B, q(v)

B, U

L

^⇤₁

(y, U, B) L

₁

y, U, B, q(v)

(A

₁

, rA

¹

) (a

₂

, ra

²

) (a

₃

, ra

³

) (a

₄

, ra

⁴

) (a

₅

, ra

⁴

)

L

₁

(y, U, B, q(v)) = log(p( U)) + H q(v) +

Z

log N (v|0, k(B, B ) + X

j

Z

q(v)F

_v

(y

_i_j

, )dv

sgn

X

k i=1

c

_i

· sgn(w

i^>

x) (2)

= sgn

X

k i=1

c

_i

w

_i^>

x (3)

y

_i

w

^>

x

_i

kwk < ⌘ (4)

s

₁

+ s

₂

+ s

₃

0 (5)

R 4

162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215

L₁(U, B) = 1

2 log|K^BB| 1

2 log|K^BB + A₁| 1

2 a₂ 1 2 a₃ 1

2

XK k=1

kU^(k)k²F + 1 2

2a₄^>(K_BB + A₁) ¹a₄

+ 2 tr(K_BB¹ A₁) + CONST L₂(U, B, ) = 1

2 log|K^BB| 1

2 log|K^BB + A₁| 1

2a₂ + a₃ 1

2

>K_BB + 1

2tr(K_BB¹ A₁) 1 2

XK k=1

kU^(k)k²F

(t+1) = (K_BB + A₁) ¹(A₁ ^(t) + a₆)

a₆ = X

j

k(B, x_i_j)(2y_i_j 1) N k(B, xⁱj)^{> (t)}|0, 1 (2y_i_j 1)k(B, x_i_j)^{> (t)} L₁(U, B, q(v)) = log(p(U)) +

Z

q(v) log p(v|B)

q(v) dv + X

j

Z

q(v)F_v(y_i_j, )dv L₁ y,U, B,q(v)

B, U L^⇤₁(y,U,B) L₁ y,U, B,q(v)

(A₁, rA¹) (a₂,ra²) (a₃,ra³) (a₄,ra⁴) (a₅,ra⁴)

L₁(y,U, B,q(v)) = log(p(U)) + H q(v) +

Z

log N (v|0, k(B, B) + X

j

Z

q(v)F_v(y_i_j, )dv

sgn

Xk i=1

c_i · sgn(wi^>x) (2)

= sgn

Xk i=1

c_iw_i^>x (3)

y_iw^>x_i

kwk < ⌘ (4)

s₁ + s₂ + s₃ 0 (5)

4