CS 3750 Advanced Machine Learning
Support Vector Machines for Regression
Min Chi [email protected] Intelligent system program
November 10, 2003
•Introduction
•Basic techniques in SVR
•Basic linear regression(separable case)
• Linear ε- Intensive Loss Algorithm(non- separable case)
• Primal Formulation
• Dual Formulation
• Nonlinear regression
•Kernel Formulation
•Some SVM algorithms
Support vector machine SVM Support vector machine SVM
•SVM maximize the margin around the separating hyperplane.
•The decision function is fully specified by a subset of the training data, the support vectors.
Introduction Introduction
• Transformation with -linear function
-nonlinear function-Kernel.
• Nonlinear become linear boundary in the transformed space.
• Aim: To find the optimal Kernel or linear
function and corresponding support vectors.
The regression Problem The regression Problem
•Regression=find a function that fits the observations.
• “Close point” may be wrong due to the noise only.
-Line should be influenced by the real data not the noise.
-Ignore the errors from those point which are close.
•Uniqueness: Every strictly convex constrained optimization problem has a unique solution.
) 1 ( for
) ( ) 1 ( ) ( )
(
x f x1 f x2 x x1 x2f λ
< λ + − λ
λ= λ + − λ
•Lagrange Function.
•Dual Objective Function
•Duality Gap
•Karush-Kuhn-Tucker (KKT) conditions.A set of primal and dual variables that is both feasible and satisfies the KKT conditions is the optimal solution.
Duality theory in the convex Duality theory in the convex optimization
optimization
.• Training data:
• Our goal is to find a function f(x) that as at most εdeviation from the actually obtained target for all the training data. At the same time as flat as possible.
• With a hyperplane( assuming linear model and the data can be separated!)
R y R x y
x y
x , ),..., ( l, l)}, ∈ n , ∈ {( 1 1
b x w a
x
f ( , ) = 〈 , 〉 +
y
iBasic Linear regression (separable Basic Linear regression (separable case)
case)
(2) y
, , y
subject to 2 1 minimize
i i
2
≤
− +
〉
〈
≤
−
〉
〈
−
ε ε b
x w
b x w w
i i
i i
(1) ,
,
)
(x x b with R b R
f = 〈ω 〉+ ω∈ n ∈
Linear function f taking the form:
Flatness in the equation (1) means that one seeks small ω. Formally, we can write this problem as a convex optimization problem by:
All pair with All pair with
εεprecisionprecision(xi,yi)
Primal regression problem
Primal regression problem
Support Vector regression (
Support Vector regression (
VapnikVapnik1995)1995) (non(non--separable case)separable case)
• Before, we introduce the case that function f actually exists that approximates all data pairs with ε precision.
Sometimes, we may want to allow some errors.
Linear
Linear ε ε -support vector regression - support vector regression
Define if there is no error for . -minimize the error made outside of the tube.
-The parameter C to control the amount of influence of error. C balance the two competing goals.(error and
||w||)
= 0 ξ
ix i
(3) 0
,
y ,
, y
subject to
) 2 (
1 minimize
*
* i
i
1 2 *
≥
+
≤
− +
〉
〈
+
≤
−
〉
〈
−
+
+ ∑
=
i i
i i
i
i i
i l
i i i
b x w
b x w
C w
ξ ξ
ξ ε
ξ ε ξ ξ
ε-ε-intensive loss functionintensive loss function
−
= ≤
otherwise
for 0
ε ξ
ε ξ ε ξ
The Optimization problem The Optimization problem
*
*
* 1
*
*
* 1
* 1
1 2 *
, , , : variable primal
(5) 0 , , , : Multiplier Lagrange
subject to
) (
) , (
) , (
) 2 (
: 1 max
: is problem the
of dual The
i i
i i i i l i
i i i i i
i i l
i i
i i
i l
i i
l i
i i
b
a a b x y
a
b x y
a
C L
ξ ξ ω
η η
ξ η ξ η ω
ξ ε
ω ξ
ε
ξ ξ ω
≥ +
−
−
〉
〈
− + +
−
+
〉
〈 +
−
−
− + +
=
∑
∑
∑
∑
=
=
=
=
Lagrangian function will help us to formulate the dual problem
(6) 0
) (
1
*
− =
∂ =
∂ ∑
= l
i
a
ia
ib L
(7) 0 )
(
1
*
− =
−
∂ =
∂ ∑
= i
l
i
a
ia
ix
L ω
ω
(8) 0
(*) (*)
(*)
= − − =
∂
∂
i i
i
a
L C η
ξ
It follows from the saddle point condition that the partial derivations of L with respect to the primal variables has to varnish for optimality.
∑
∑
∑
∑
∑
∑
∑
∑
∑
∑
∑
∑
∑
∑
=
=
=
=
=
=
=
=
=
=
=
=
=
=
−
−
+
〉
〈 +
−
−
−
+
〉
〈
−
−
−
−
+ +
〉
〈
=
l
i i i
l
i i i
l
i i
l
i i i
l
i i i
l
i i i
l
i i
l
i i
l
i i i
l
i i i
l
i i i
l
i i
l
i i
l
i i
b a x
a y
a a
a
b a x
a y
a a
a
C C
w w L
1
*
* 1
1
* 1
* 1
* 1
*
* 1
*
1 1
1 1
1
1
* 1
, , 2 ,
1
ξ η ξ
η
ω ξ
ε
ω ξ
ε
ξ ξ
calculation
calculation
∑
∑
∑
∑
∑
∑
=
∑ − =
= =
∑ +
=
〉
〈
=
=
= =
−
−
= =
=
−
−
= =
=
=
− +
〉
〈
−
−
+
− +
−
−
−
+
−
− +
〉
〈
=
l
i
a a i i l
i
x a a w
w
i i
i
l
i
i i i l
i
i i )
a C
i i l
i i
) a C
i i l
i i
l
i i i
l
i i i i
i i
i i
b a a x
a a
y a a a
a a
C
a C
w w L
1
) 0 ) ( (6), From ( 0
* 1
) ) ( (7), From ( ,
*
1
* 1
*
0 (8),
from ( 0
*
* 1
*
0 (8),
from ( 1 0
1
* 1
* (*) (*)
(*) (*)
) (
, ) (
) (
) (
) (
) (
2 , 1
43 42 1 4
4 3 4
4 2 1
4 43 4
42 1
4 43 4
42 1
ω η
η
ω
ε η
ξ
η ξ
(9)
] , 0 [ ,
0 ) : (
subject to
) (
) (
, ) (
) 2 (
- 1
) (
) (
2 , - 1
* 1
* 1
* 1
*
* 1
*
1
* 1
*
∈
=
− +
− +
−
〉
〈
−
−
=
+
− +
−
〉
〈
=
∑
∑
∑
∑
∑
∑
=
=
=
=
=
=
C a
a
a a
y a a a
a
x x a a a a
y a a a
a w
w L
i i l
i i i
l
i i i i
l
i i i
j i j j l
i i i
l
i i i i
l
i i i
ε
ε
Results.
Results.
From the equation (7)
i i l
i
( a
ia
*) x
1
−
= ∑
=
ω
We can get:
(10) ,
) (
)
(
*1
b x x a a x
f
l i ii i
− 〈 〉 +
= ∑
=
How to computing b?
How to computing b?
(12)
0 )
(
0 )
(
(11) 0 ) ,
(
0 ) ,
(
*
*
*
*
=
−
=
−
=
−
〉
〈
− + +
= +
〉
〈 +
−
−
i i
i i
i i
i i
i i
a C
a C
b x y
a
b x y
a
ξ ξ
ω ξ
ε
ω ξ
ε
• At the KKT(Karush-Kuhn-Tucker) conditions: at the optimal solution the product between dual variables and constraints has to vanish. This means that the Lagrange multipliers will only be non-zero for points outside the ε band. Thus these points are the support vectors.In the SV case:
(13) ) , 0 ( for ,
) , 0 ( for ,
: follows as
computed be
can b Hence
* C
a x
y b
C a
x y
b
i i
i
i i
i
∈ +
〉
〈
−
=
∈
−
〉
〈
−
=
ε ω
ε ω
How to computing b?
How to computing b?
Firstly, only samples with corresponding a lie outside the the ε-insensitive tube around f.f
Secondly, there can never be a set of dual variables which are both simultaneously nonzero.
Finally, for we have .
) , ( x
iy
i0
*
=
i i
a a
i,
ai*a
) , 0 (
(*) C
ai
∈
ξi(*) = 0C ai(*)
=
Nonlinear SVR Nonlinear SVR
•Key idea:transform xi to a higher dimension to make life easier.
-input space: the space xi are in.
-feature space: the space of Φ(xi) after transformation.
•Why transform?
- Linear operation in the feature space is equivalent to the nonlinear operation in the input space.
-make the non-separable problem separable.
(continue) ( continue)
Possible problems of the transformation:
•High computation burden SVM solved the problem:
•Kernel tricks for efficient computation.
Example Transformation Example Transformation
•Define the Kernel Function:
2 2 2 1 1 2
1 2
1 ) (1 x y x y )
y , y x
k( x = + +
y ) , y x k( x
) y x y x (
y y y y y y
x x x x x x
=
+ +
=
〉
Φ
Φ
〈
=
Φ
=
Φ
2 1 2 1
2 2 2 1 1 2
1 2
1
2 1 2 2 2 1 2 1 2
1
2 1 2 2 2 1 2 1 2
1
1 y )
( y x ), ( x
) 2 , , , 2 , 2 , 1 ( y ) ( y
) 2 , , , 2 , 2 , 1 ( x ) ( x
•Consider the following transformation:
without by
computed be
can product inner
the
So k
Kernel Trick Kernel Trick
. ) ( ), ( )
: is (.) mapping the
and
funtion Kernel
e between th ip
relationsh The
〉 Φ Φ
〈
=
Φ
y x k(x,yk
• This is known as the Kernel trick.
•We choose k instead of choosing Φ(.)
•K(x,y) need to satisfy a technical condition (Mercer Conditions) in order for Φ(.) to exist.
Which functions correspond to a dot product in the feature space?
) , ( x x
'k
Kernel Kernel
) ( ).
( )
,
(xi x j xi x j
K = Φ Φ
Hilbert Schmidt Theory:
is a symmetric function.
) , (
xi xj KMercer's conditions:
∞
<
≥ Φ Φ
=
∫
∫∫
∑
∞=
dx x f
dxdx x
f x f x x K
x x
a x
x
K i j
m m
j i
) (
, 0 )
( ) ( ) , (
if ) ( ).
( )
, (
2
' '
'
1
Linear combination of Kernels:
) , ( )
, ( :
) ,
(x x' c1k1 x x' c2k2 x x'
k = +
Integral of Kernels:
dz z x s z x s x
x
k
( ,
') : = ∫ ( , ) (
', )
Smola, Scholkopf and Muller(1998):0 )
( )
2 ( ) ](
[
if ) (
: ) , (
2 , ' '
≥
=
−
=
∫
− 〈 〉− e k x dx
k F
x x k x
x k
x i d
π
ωω
Burges(1999):
k ( x , x
') : = k ( 〈 x , x
'〉 )
0 )
( )
(
, 0 )
(
, 0 )
(
2
≥
∂ +
∂
≥
∂
≥
ξ ξ
ξ ξ ξ
ξ ξ
ξ
k k
k k
Non Non - - linear SVR algorithm linear SVR algorithm
) ( )
(
*1 i i
l
i
a
i− a Φ x
= ∑
=
ω
Primal problem:
(3) 0
,
y )
( ,
) ( , y
subject to
) 2 (
1 minimize
*
* i
i
1 2 *
≥
+
≤
− +
〉 Φ
〈
+
≤
−
〉 Φ
〈
−
+
+
∑
=
i i
i i
i
i i
i l
i i i
b x w
b x w
C w
ξ ξ
ξ ε
ξ ε ξ ξ
Parameters used in SV Parameters used in SV Regression
Regression
Non Non - - linear SVR algorithm linear SVR algorithm
(15) ) , ( ) (
) ( and ) ( ) (
: (10) th Similar wi .
) ( ), ( : ) , ( If
* 1
* 1
' '
b x x k a a x
f x
a a
x x x
x k
i i l
i i
i i l
i i − Φ = − +
=
〉 Φ Φ
〈
=
∑
∑
= =ω
]
, 0 [ ,
0 ) : (
subject to
) 14 ( )
( ) (
) , ( ) (
) 2 (
- 1 maximize
(9) th similar wi
* 1
* 1
* 1
*
* 1
*
∈
=
−
+
− +
−
−
−
=
∑
∑
∑
∑
=
=
=
=
C a
a
a a
y a a a
a
x x k a a a a L
i i l
i i i
l
i i i i
l
i i i
j i j j l
i i i
ε
Cost Function Cost Function
Training data:
We assume that the training data has been drawn iid from some probability distribution p(x,y). Our goal is to find a function f(x) that minimizes a risk functional.
A possible approximation:
Add a capacity control term:
R y R x y
x y
x , ),..., ( l, l)}, ∈ n , ∈ {( 1 1
= ∫ ( , , ( )) ( , ) ]
[ f c x y f x dp x y R
∑
== l
i i i i
emp c x y f x
f l R
1
)) ( , , 1 (
: ] [
constant.
regulation called
0
2 ] [ :
]
[ 2
>
+
=
λ
λ ω
fR f
Rreg emp
)) ( (
log ))
( , ,
( x y f x p y f x
c = − −
One hand, we want to avoid using a very complicated function c as this may lead to difficult optimization; on the other hand, we should use the cost function that suits data best. Under the assumption that the data were idd and generated by function plus addictive noise.
For the loss function:
(15) otherwise
) ) (
~(
) ( for )) 0
( , ,
(
−
−
≤
= −
ε
ε x
f y c
x f x y
f y x c
−
= ≤
otherwise
for 0
ε ξ
ε ξ ε ξ
Compare with
ε
−intensivelossfunction•Extend the special choice to more general convex cost functions.
• Moreover, we might choose different cost functions and different values of for each sample. Similar with (3), we can get:
~
*~ ,
i
i
c
c ε , ε
*(16) 0
,
y ,
, y
subject to
))
~( )
~( 2 (
1 minimize
*
* i
i
1 2 *
≥
+
≤
− +
〉
〈
+
≤
−
〉
〈
−
+
+
∑
=
i i
i i
i
i i
i l
i
i i
b x w
b x w
c c
C w
ξ ξ
ξ ε
ξ ε
ξ ξ
By standard Lagrange multiplier technique, exactly the same manner as in the ε-intensive loss function.
(18)
0
} )
~(
| inf{
)]
~( ,
0 [
0 ) (
: subject to
(17) )
~( )
~( : ) (
) where (
))) ( ) ( ( ) (
) (
(
, ) (
) 2 (
- 1
1
* 1
*
*
* 1
*
* 1
*
≥
≥
∂
=
∂
∈
=
−
∂
−
= Τ
−
=
Τ + Τ + +
−
−
+
〉
〈
−
−
=
∑
∑
∑
∑
=
=
=
=
ξ
ξ ξ
ξ
ξ ξ ξ
ξ ξ
ω
ξ ξ
ε
ξ ξ
ξ
a c
C c C a
a a
c c
x a a
C a
a a
a y
x x a a a a L
l i
i i l i
i i i
i i
i i l
i
i i i
j i j j l
i
i i
Architecture
Architecture
Algorithms
Algorithms- - Interior Point Algorithms. Interior Point Algorithms.
m
n A R b R
R l a c l
b A
a c q a
∈
∈
∈
≤
≤
= +
−
, , ,
, ,
subject to
. ) 2 ( 1 minimize
component.
th i the denotes and
vector denote
variable i
µ µ
α α
α
α
•Add slack variables:
0 , subject t
. ) 2 ( 1 minimize
≥
= +
=
= +
t g µ t l, a b a-g o Aα
a c q
α
•Write the Wolfe dual.
•Get the KKT conditions
•Solve the Equations
SMO(Sequential minimal SMO(Sequential minimal
optimization) optimization)
-each iteration,SMO chooses only two ,and find optimal value, updates
the SVM to reflect new optimal value - 3 components to SMO
1. Analytic method to solve for two lagrange multiplier
2. Heuristic for choosing
3. A method for computing bias b
αi
Other Algorithms
• Subset selection algorithms
• Inverse problems.(suitable for specific settings)
• Convex combination and -norms.(different ways of measuring capacity and reduction to the linear programming)
• Semiparametric modeling.(different ways of controlling capacity and different classes).
l
1conclusion conclusion
• Linear operation in the feature space is equivalent to the nonlinear operation in the input space.
• key concepts of SVM
• optimization
• kernel trick
• There are still a lot of open issues in SVR.
Both time complexity & storage capacity problem are
• increasing as train data increase