CSE 575: Sta*s*cal Machine Learning
Jingrui He CIDSE, ASU
3
1-‐Nearest Neighbor
Four things make a memory based learner:
1. A distance metric
Euclidian (and many more)
2. How many nearby neighbors to look at?
One
1. A weigh:ng func:on (op:onal)
Unused
2. How to fit with the local points?
Just predict the same output as the nearest neighbor.
4
Consistency of 1-‐NN
• Consider an es*mator fn trained on n examples
– e.g., 1-‐NN, regression, ...
• Es*mator is consistent if true error goes to zero as amount of
data increases
– e.g., for no noise data, consistent if:
• Regression is not consistent!
– Representa*on bias
• 1-‐NN is consistent (under some mild fineprint)
5
6
k-‐Nearest Neighbor
Four things make a memory based learner:
1. A distance metric
Euclidian (and many more)
2. How many nearby neighbors to look at?
k
1. A weigh:ng func:on (op:onal)
Unused
2. How to fit with the local points?
7
k-‐Nearest Neighbor (here k=9)
K-‐nearest neighbor for funcFon fiHng smooth away noise, but there are clear deficiencies.
8
Curse of dimensionality for instance-‐based learning
• Must store and retrieve all data!
– Most real work done during tes*ng
– For every test sample, must search through all dataset – very slow! – There are fast methods for dealing with large datasets, e.g., tree-‐based
methods, hashing methods,
w.x = ∑j w(j) x(j) 10
Linear classifiers – Which line is beber?
Data:
11
Pick the one with the largest margin!
w.x = ∑j w(j) x(j)
w.x
+ b = 0
12
Maximize the margin
w.x
+ b = 0
13
But there are a many planes…
w.x
+ b = 0
14 w.x
+ b = 0
15 w.x + b = +1 w.x + b = -‐1 w.x + b = 0 margin 2γ x-‐ x+
Normalized margin – Canonical hyperplanes
16
Normalized margin – Canonical hyperplanes w.x + b = +1 w.x + b = -‐1 w.x + b = 0 margin 2γ x-‐ x+
17
Margin maximiza*on using canonical hyperplanes w.x + b = +1 w.x + b = -‐1 w.x + b = 0 margin 2γ
18
Support vector machines (SVMs)
w.x + b = +1 w.x + b = -‐1 w.x + b = 0 margin 2γ
• Solve efficiently by quadra*c programming (QP)
– Well-‐studied solu*on algorithms
19
What if the data is not linearly separable?
Use features of features of features of features….
20
What if the data is s*ll not linearly separable?
• Minimize w.w and number of training mistakes
– Tradeoff two criteria?
• Tradeoff #(mistakes) and w.w
– 0/1 loss
– Slack penalty C
– Not QP anymore
– Also doesn’t dis*nguish near misses
21
Slack variables – Hinge loss
• If margin ≥ 1, don’t care
22
Side note: What’s the difference
between SVMs and logis*c regression?
SVM: LogisFc regression:
23
24
Lagrange mul*pliers – Dual variables
Moving the constraint to objecFve funcFon Lagrangian:
25
Lagrange mul*pliers – Dual variables
26
Dual SVM deriva*on (1) – the linearly separable case
27
Dual SVM deriva*on (2) – the linearly separable case
28
Dual SVM interpreta*on
w.x
+ b = 0
29
Dual SVM formula*on – the linearly separable case
30
Dual SVM deriva*on – the non-‐separable case
31
Dual SVM formula*on – the non-‐separable case
32
Why did we learn about the dual SVM?
• There are some quadra*c programming
algorithms that can solve the dual faster than the primal
• But, more importantly, the “kernel trick”!!!
33
Reminder from last *me: What if the data is not linearly separable?
Use features of features of features of features….
34
Higher order polynomials
number of input dimensions
nu m be r o f m on om ial te rm s d=2 d=4 d=3 m – input features
d – degree of polynomial
grows fast! d = 6, m = 100
35
Dual formula*on only depends on dot-‐products, not on w!
36
37
Finally: the “kernel trick”!
• Never represent features explicitly
– Compute dot products in closed form
• Constant-‐*me high-‐dimensional dot-‐ products for many classes of features • Very interes*ng theory – Reproducing
38
Polynomial kernels
• All monomials of degree d in O(d) opera*ons:
• How about all monomials of degree up to d?
– Solu*on 0:
39
Common kernels
• Polynomials of degree d
• Polynomials of degree up to d
• Gaussian kernels
40
Overfivng?
• Huge feature space with kernels, what about
overfivng???
– Maximizing margin leads to sparse set of support vectors
– Some interes*ng theory says that SVMs search for simple hypothesis with large margin
41
What about at classifica*on *me
• For a new input x, if we need to represent Φ(x), we are in trouble!
• Recall classifier: sign(w.Φ(x)+b)
42
SVMs with kernels
• Choose a set of features and kernel func*on
• Solve dual problem to obtain support vectors
αi
• At classifica*on *me, compute:
43
What’s the difference between SVMs and Logis*c Regression?
SVMs Logistic
Regression
Loss function Hinge loss Log-loss
High dimensional features with
kernels
44
Kernels in logis*c regression
• Define weights in terms of support vectors:
45
What’s the difference between SVMs and Logis*c Regression? (Revisited)
SVMs Logistic
Regression
Loss function Hinge loss Log-loss
High dimensional features with
kernels
Yes! Yes!
Solution sparse Often yes! Almost always no!
Semantics of output