Regression Analysis
PCA
LD
K-Nearest Neighbor Classifier
ANN as Classifier
SVM became famous when, using images as input, it gave accuracy comparable to neural-network with
hand-designed features in a handwriting recognition task Currently, SVM is widely used in object detection &
Large Margin Linear Classifier
Nonlinear SVM: The Kernel Trick
( )
( ) for all
i j
g
x
g
x
j
i
An example we’ve learned before:
Minimum-Error-Rate Classifier
For two-category case,
g
( )
x
g
1( )
x
g
2( )
x
1 2
Decide
if ( )
g
x
0; otherwise decide
( )
(
| )
(
| )
Nearest Neighbor
Decision Tree
Linear Functions
( )
Tg
x
w x
b
( )
Tg
x
w x
b
x1 wT x + b < 0
A hyper-plane in the
feature space
(Unit-length) normal vector
of the hyper-plane:
w
n
w
How would you classify these points using a linear discriminant function in order to minimize the error rate?
x1 x2
How would you classify these points using a linear discriminant function in order to minimize the error rate?
x x2
How would you classify these points using a linear discriminant function in order to minimize the error rate?
x1 x2
x x2
How would you classify these points using a linear discriminant function in order to minimize the error rate?
Infinite number of answers!
The linear discriminant
function (classifier) with the maximum margin is the
best
“safe zone”
Margin is defined as the
width that the boundary
could be increased by before hitting a data point
Why it is the best?
Robust to outliners and thus
strong generalization ability
Margin
Given a set of data points:
With a scale transformation
on both w and b, the above is equivalent to
x1 x2
For
1,
0
For
1,
0
T i i T i i
y
b
y
b
w x
w x
{( ,
x
iy
i)},
i
1, 2,
,
n
, where
For
1,
1
For
1,
1
We know that
The margin width is:
x1 x2
1
1
T Tb
b
w x
w x
Margin x+ x+ x-(
)
2
(
)
M
x
x
n
w
x
x
w
w
n
Formulation: x x2 Margin x+ x+ x -n
such that
2
maximize
w
For
1,
1
For
1,
1
Formulation: x1 x2 Margin x+ x+ x -n 2
1
minimize
2
w
such that
For
1,
1
For
1,
1
Formulation:
x x2
Margin
x+
x+
x
-n
(
T) 1
i i
y
w x
b
2
1
minimize
2
w
(
T) 1
i i
y
w x
b
2
s.t.
with linear constraints
2 11
minimize
( , ,
)
(
) 1
2
n
T
p i i i i
i
L
b
y
b
w
w
w x
1
2
is.t.
i
0
0
pL
b
0
pL
w
1n
1
2
is.t.
i
0
1 1 1
1
maximize
2
n n n
T i i j i j i j
i i j
y y
x x
s.t.
i
0
The solution has the form:
(
T) 1
0
i
y
i ib
w x
Thus, only support vectors have
0
i
1 SV
n
i i i i i i
i i
y
y
w
x
x
get from (
) 1
0,
where is support vector
T i i
b
y
w x
b
SV
( )
T i Tii
g
b
b
x
w x
x x
Notice it relies on a dot product between the test point x
and the support vectors xi
Also keep in mind that solving the optimization problem
involved computing the dot products xiTx
j between all pairs
What if data is not linear separable? (noisy data, outliers, etc.)
Slack variables ξi can be
added to allow
mis-classification of difficult or noisy data points
x x2
1
2
(
T) 1
i i i
y
w x
b
2
1
1
minimize
2
w
C
i
isuch that
0
i
1 1 1
1
maximize
2
n n n
T i i j i j i j
i i j
y y
x x
such that
0
i
C
Datasets that are linearly separable with noise work out great:
0 x
0 x
x2
0 x
But what are we going to do if the dataset is just too hard?
How about… mapping data to a higher-dimensional space:
General idea: the original input space can be mapped to
some higher-dimensional feature space where the
training set is separable:
With this mapping, our discriminant function is now:
SV
( )
T( )
i( )
i T( )
i
g
b
b
x
w
x
x
x
No need to know this mapping explicitly, because we only use
the dot product of feature vectors in both the training and test.
A kernel function is defined as a function that corresponds to
a dot product of two feature vectors in some expanded feature space:
( ,
i j)
( )
i T(
j)
2-dimensional vectors x=[x1 x2];
let K(xi,xj)=(1 + xiTx j)2,
Need to show that K(xi,xj) = φ(xi)Tφ(x j):
K(xi,xj)=(1 + xiTx j)2,
= 1+ xi12xj12 + 2 xi1xj1xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2] = φ(xi)Tφ(x
j), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
Linear kernel:
2
2
( ,
)
exp(
)
2
i j i jK
x
x
x x
( ,
i j)
Ti jK
x x
x x
( ,
i j)
(1
Ti j)
pK
x x
x x
0 1
( ,
i j)
tanh(
Ti j)
K
x x
x x
Examples of commonly-used kernel functions:
Polynomial kernel:
Gaussian (Radial-Basis Function (RBF) ) kernel:
Sigmoid:
In general, functions that satisfy Mercer’s condition can be
1 1 1
1
maximize
( ,
)
2
i i j i j i j
i i j
y y K
x x
such that
0
i
C
1
0
n i i iy
The solution of the discriminant function is
SV
( )
i( , )
ii
g
K
b
x
x x
2. Choose a value for C
3. Solve the quadratic programming problem (many software packages available)
- if ineffective, more elaborate kernels are needed - domain experts can give assistance in formulating
appropriate similarity measures
Choice of kernel parameters
- e.g. σ in Gaussian kernel
- σ is the distance between closest points with different
classifications
- In the absence of reliable criteria, applications rely on the use of a validation set or cross-validation to set such parameters.