machine learning

(1)

CSE 575: Sta*s*cal Machine Learning

Jingrui He CIDSE, ASU

(2)

(3)

3

1-‐Nearest Neighbor

Four things make a memory based learner:

1.  A distance metric

Euclidian (and many more)

2.  How many nearby neighbors to look at?

One

1.  A weigh:ng func:on (op:onal)

Unused

2.  How to ﬁt with the local points?

Just predict the same output as the nearest neighbor.

(4)

4

Consistency of 1-‐NN

•  Consider an es*mator f_n trained on n examples

–  e.g., 1-‐NN, regression, ...

•  Es*mator is consistent if true error goes to zero as amount of

data increases

–  e.g., for no noise data, consistent if:

•  Regression is not consistent!

–  Representa*on bias

•  1-‐NN is consistent (under some mild ﬁneprint)

(5)

5

(6)

6

k-‐Nearest Neighbor

Four things make a memory based learner:

1.  A distance metric

Euclidian (and many more)

2.  How many nearby neighbors to look at?

k

1.  A weigh:ng func:on (op:onal)

Unused

2.  How to ﬁt with the local points?

(7)

7

k-‐Nearest Neighbor (here k=9)

K-‐nearest neighbor for funcFon ﬁHng smooth away noise, but there are clear deﬁciencies.

(8)

8

Curse of dimensionality for instance-‐based learning

•  Must store and retrieve all data!

–  Most real work done during tes*ng

–  For every test sample, must search through all dataset – very slow! –  There are fast methods for dealing with large datasets, e.g., tree-‐based

methods, hashing methods,

(9)

(10)

w.x = ∑_j w(j)_x(j) 10

Linear classiﬁers – Which line is beber?

Data:

(11)

11

Pick the one with the largest margin!

w.x = ∑_j w(j)_x(j)

w.x

+ b = 0

(12)

12

Maximize the margin

w.x

+ b = 0

(13)

13

But there are a many planes…

w.x

+ b = 0

(14)

14 w.x

+ b = 0

(15)

15 w.x + b = +1 w.x + b = -‐1 w.x + b = 0 margin 2γ x-‐ x+

Normalized margin – Canonical hyperplanes

(16)

16

Normalized margin – Canonical hyperplanes w.x + b = +1 w.x + b = -‐1 w.x + b = 0 margin 2γ x-‐ x+

(17)

17

Margin maximiza*on using canonical hyperplanes w.x + b = +1 w.x + b = -‐1 w.x + b = 0 margin 2γ

(18)

18

Support vector machines (SVMs)

w.x + b = +1 w.x + b = -‐1 w.x + b = 0 margin 2γ

•  Solve eﬃciently by quadra*c programming (QP)

–  Well-‐studied solu*on algorithms

(19)

19

What if the data is not linearly separable?

Use features of features of features of features….

(20)

20

What if the data is s*ll not linearly separable?

•  Minimize w.w and number of training mistakes

–  Tradeoﬀ two criteria?

•  Tradeoﬀ #(mistakes) and w.w

–  0/1 loss

–  Slack penalty C

–  Not QP anymore

–  Also doesn’t dis*nguish near misses

(21)

21

Slack variables – Hinge loss

•  If margin ≥ 1, don’t care

(22)

22

Side note: What’s the diﬀerence

between SVMs and logis*c regression?

SVM: LogisFc regression:

(23)

23

(24)

24

Lagrange mul*pliers – Dual variables

Moving the constraint to objecFve funcFon Lagrangian:

(25)

25

Lagrange mul*pliers – Dual variables

(26)

26

Dual SVM deriva*on (1) – the linearly separable case

(27)

27

Dual SVM deriva*on (2) – the linearly separable case

(28)

28

Dual SVM interpreta*on

w.x

+ b = 0

(29)

29

Dual SVM formula*on – the linearly separable case

(30)

30

Dual SVM deriva*on – the non-‐separable case

(31)

31

Dual SVM formula*on – the non-‐separable case

(32)

32

Why did we learn about the dual SVM?

•  _{There
are
some
quadra*c
programming}

algorithms that can solve the dual faster than the primal

•  _{But,
more
importantly,
the
“kernel
trick”!!!}

(33)

33

Reminder from last *me: What if the data is not linearly separable?

Use features of features of features of features….

(34)

34

Higher order polynomials

number of input dimensions

nu m be r o f m on om ial te rm s d=2 d=4 d=3 m – input features

d – degree of polynomial

grows fast! d = 6, m = 100

(35)

35

Dual formula*on only depends on dot-‐products, not on w!

(36)

36

(37)

37

Finally: the “kernel trick”!

•  Never represent features explicitly

–  Compute dot products in closed form

•  Constant-‐*me high-‐dimensional dot-‐ products for many classes of features •  Very interes*ng theory – Reproducing

(38)

38

Polynomial kernels

•  _{All
monomials
of
degree
d
in
O(d)
opera*ons:}

•  _{How
about
all
monomials
of
degree
up
to
d?}

– Solu*on 0:

(39)

39

Common kernels

•  Polynomials of degree d

•  Polynomials of degree up to d

•  Gaussian kernels

(40)

40

Overﬁvng?

•  _{Huge
feature
space
with
kernels,
what
about}

overﬁvng???

– Maximizing margin leads to sparse set of support vectors

– Some interes*ng theory says that SVMs search for simple hypothesis with large margin

(41)

41

What about at classiﬁca*on *me

•  _{For
a
new
input
x,
if
we
need
to
represent} Φ(x), we are in trouble!

•  _{Recall
classiﬁer:
sign(w.Φ(x)+b)}

(42)

42

SVMs with kernels

•  _{Choose
a
set
of
features
and
kernel
func*on}

•  _{Solve
dual
problem
to
obtain
support
vectors}

α_i

•  _{At
classiﬁca*on
*me,
compute:}

(43)

43

What’s the diﬀerence between SVMs and Logis*c Regression?

SVMs Logistic

Regression

Loss function Hinge loss Log-loss

High dimensional features with

kernels

(44)

44

Kernels in logis*c regression

•  Deﬁne weights in terms of support vectors:

(45)

45

What’s the diﬀerence between SVMs and Logis*c Regression? (Revisited)

SVMs Logistic

Regression

Loss function Hinge loss Log-loss

High dimensional features with

kernels

Yes! Yes!

Solution sparse Often yes! Almost always no!

Semantics of output