Support Vector Machines for Regression

(1)

CS 3750 Advanced Machine Learning

Support Vector Machines for Regression

Min Chi [email protected] Intelligent system program

November 10, 2003

•Introduction

•Basic techniques in SVR

•Basic linear regression(separable case)

• Linear ε- Intensive Loss Algorithm(non- separable case)

• Primal Formulation

• Dual Formulation

• Nonlinear regression

•Kernel Formulation

•Some SVM algorithms

(2)

Support vector machine SVM Support vector machine SVM

•SVM maximize the margin around the separating hyperplane.

•The decision function is fully specified by a subset of the training data, the support vectors.

Introduction Introduction

• Transformation with -linear function

-nonlinear function-Kernel.

• Nonlinear become linear boundary in the transformed space.

• Aim: To find the optimal Kernel or linear

function and corresponding support vectors.

(3)

The regression Problem The regression Problem

•Regression=find a function that fits the observations.

• “Close point” may be wrong due to the noise only.

-Line should be influenced by the real data not the noise.

-Ignore the errors from those point which are close.

•Uniqueness: Every strictly convex constrained optimization problem has a unique solution.

) 1 ( for

) ( ) 1 ( ) ( )

(

x f x₁ f x₂ x x₁ x₂

f _λ

< λ + − λ

_λ

= λ + − λ

•Lagrange Function.

•Dual Objective Function

•Duality Gap

•Karush-Kuhn-Tucker (KKT) conditions.A set of primal and dual variables that is both feasible and satisfies the KKT conditions is the optimal solution.

Duality theory in the convex Duality theory in the convex optimization

optimization

^.

(4)

• Training data:

• Our goal is to find a function f(x) that as at most εdeviation from the actually obtained target for all the training data. At the same time as flat as possible.

• With a hyperplane( assuming linear model and the data can be separated!)

R y R x y

x y

x , ),..., ( _l, _l)}, ∈ ⁿ , ∈ {( ₁ ₁

b x w a

x

f ( , ) = 〈 , 〉 +

y

i

Basic Linear regression (separable Basic Linear regression (separable case)

case)

(2) y

, , y

subject to 2 1 minimize

i i

2

 



≤

− +

〉

〈

≤

−

〉

〈

−

ε ε b

x w

b x w w

i i

(1) ,

,

)

(x x b with R b R

f = 〈ω 〉+ ω∈ ⁿ ∈

Linear function f taking the form:

Flatness in the equation (1) means that one seeks small ω. Formally, we can write this problem as a convex optimization problem by:

All pair with All pair with

εεprecisionprecision(x_i,y_i)

Primal regression problem

(5)

Support Vector regression (

VapnikVapnik1995)1995) (non

(non--separable case)separable case)

• Before, we introduce the case that function f actually exists that approximates all data pairs with ε precision.

Sometimes, we may want to allow some errors.

Linear

Linear ε ε -support vector regression - support vector regression

Define if there is no error for . -minimize the error made outside of the tube.

-The parameter C to control the amount of influence of error. C balance the two competing goals.(error and

||w||)

= 0 ξ

i

x i

(3) 0

,

y ,

, y

subject to

) 2 (

1 minimize

*

* i

i

1 2 *







≥

+

≤

− +

〉

〈

+

≤

−

〉

〈

−

+

+ ∑

=

i i

i

i i

i l

i i i

b x w

C w

ξ ξ

ξ ε

ξ ε ξ ξ

(6)

ε-ε-intensive loss functionintensive loss function







−

= ≤

otherwise

for 0

ε ξ

ε ξ _ε ξ

The Optimization problem The Optimization problem

*

* 1

*

* 1

1 2 *

, , , : variable primal

(5) 0 , , , : Multiplier Lagrange

subject to

) (

) , (

) 2 (

: 1 max

: is problem the

of dual The

i i

i i i i l i

i i i i i

i i l

i i

i l

i i

l i

i i

b

a a b x y

a

b x y

a

C L

ξ ξ ω

η η

ξ η ξ η ω

ξ ε

ω ξ

ε

ξ ξ ω

≥ +

−

〉

〈

− + +

−

+

〉

〈 +

−

− + +

=

∑

=

Lagrangian function will help us to formulate the dual problem

(7)

(6) 0

) (

1

*

− =

∂ =

∂ ∑

= l

i

a

i

a

i

b L

(7) 0 )

(

1

*

− =

−

∂ =

∂ ∑

= i

l

i

a

i

a

i

x

L ω

ω

(8) 0

(*) (*)

(*)

= − − =

∂

i i

i

a

L C η

ξ

It follows from the saddle point condition that the partial derivations of L with respect to the primal variables has to varnish for optimality.

∑

=

−

+

〉

〈 +

−

+

〉

〈

−

+ +

〉

〈

=

l

i i i

l

i i i

l

i i

l

i i i

l

i i i

l

i i i

l

i i

l

i i

l

i i i

l

i i i

l

i i i

l

i i

l

i i

l

i i

b a x

a y

a a

a

b a x

a y

a a

a

C C

w w L

1

*

* 1

1

* 1

*

* 1

*

1 1

1

* 1

, , 2 ,

1 ξ η ξ

η

ω ξ

ε

ω ξ

ε

ξ ξ

calculation

(8)

∑

=

∑ − =

= =

∑ +

=

〉

〈

=

= =

−

= =

=

−

= =

=

− +

〉

〈

−

+

− +

−

+

−

− +

〉

〈

=

l

i

a a i i l

i

x a a w

w

i i

i

l

i

i i i l

i

i i )

a C

i i l

i i

) a C

i i l

i i

l

i i i

l

i i i i

i i

b a a x

a a

y a a a

a a

C

a C

w w L

1

) 0 ) ( (6), From ( 0

* 1

) ) ( (7), From ( ,

*

1

* 1

*

0 (8),

from ( 0

*

* 1

*

0 (8),

from ( 1 0

1

* 1

* (*) (*)

(*) (*)

) (

, ) (

) (

2 , 1

43 42 1 4

4 3 4

4 2 1

4 43 4

42 1

4 43 4

42 1

ω η

η

ω

ε η

ξ

η ξ

(9)

] , 0 [ ,

0 ) : (

subject to

) (

, ) (

) 2 (

- 1

) (

2 , - 1

* 1

*

* 1

*

1

* 1

*



 



∈

=

− +

−

〉

〈

−

=

+

− +

−

〉

〈

=

∑

=

C a

a

a a

y a a a

a

x x a a a a

y a a a

a w

w L

i i l

i i i

l

i i i i

l

i i i

j i j j l

i i i

l

i i i i

l

i i i

ε

(9)

Results.

From the equation (7)

i i l

i

( a

i

a

^*

) x

1

−

= ∑

=

ω

We can get:

(10) ,

) (

)

(

^*

1

b x x a a x

f

^l _i _i

i i

− 〈 〉 +

= ∑

=

How to computing b?

(12)

0 )

(

0 )

(

(11) 0 ) ,

(

0 ) ,

(

*

=

−

=

−

=

−

〉

〈

− + +

= +

〉

〈 +

−

i i

a C

b x y

a

b x y

a

ξ ξ

ω ξ

ε

ω ξ

ε

• At the KKT(Karush-Kuhn-Tucker) conditions: at the optimal solution the product between dual variables and constraints has to vanish. This means that the Lagrange multipliers will only be non-zero for points outside the ε band. Thus these points are the support vectors.In the SV case:

(10)

(13) ) , 0 ( for ,

) , 0 ( for ,

: follows as

computed be

can b Hence

* C

a x

y b

C a

x y

b

i i

i

i i

i

∈ +

〉

〈

−

=

∈

−

〉

〈

−

=

ε ω

How to computing b?

Firstly, only samples with corresponding a lie outside the the ε-insensitive tube around f.f

Secondly, there can never be a set of dual variables which are both simultaneously nonzero.

Finally, for we have .

) , ( x

_i

y

_i

0

*

=

i i

a a

_i

,

a_i*

a

) , 0 (

(*) C

a_i

∈

ξi^(*) ⁼ ⁰

C a_i^(*)

=

Nonlinear SVR Nonlinear SVR

•Key idea:transform xi to a higher dimension to make life easier.

-input space: the space xi are in.

-feature space: the space of Φ(xi) after transformation.

•Why transform?

- Linear operation in the feature space is equivalent to the nonlinear operation in the input space.

-make the non-separable problem separable.

(11)

(continue) ( continue)

Possible problems of the transformation:

•High computation burden SVM solved the problem:

•Kernel tricks for efficient computation.

Example Transformation Example Transformation

•Define the Kernel Function:

2 2 2 1 1 2

1 2

1 ) (1 x y x y )

y , y x

k( x  = + +



 



 



 





y ) , y x k( x

) y x y x (

y y y y y y

x x x x x x



 



 



 



= 

+ +

=

 〉



 

 Φ 



 

 Φ 

〈

 =



 

 Φ 

 =



 

 Φ 

2 1 2 1

2 2 2 1 1 2

1 2

1

2 1 2 2 2 1 2 1 2

1

2 1 2 2 2 1 2 1 2

1

1 y )

( y x ), ( x

) 2 , , , 2 , 2 , 1 ( y ) ( y

) 2 , , , 2 , 2 , 1 ( x ) ( x

•Consider the following transformation:

without by

computed be

can product inner

the

So k

(12)

Kernel Trick Kernel Trick

. ) ( ), ( )

: is (.) mapping the

and

funtion Kernel

e between th ip

relationsh The

〉 Φ Φ

〈

=

Φ

y x k(x,y

k

• This is known as the Kernel trick.

•We choose k instead of choosing Φ(.)

•K(x,y) need to satisfy a technical condition (Mercer Conditions) in order for Φ(.) to exist.

Which functions correspond to a dot product in the feature space?

) , ( x x

^'

k

Kernel Kernel

) ( ).

( )

,

(x_i x _j x_i x _j

K = Φ Φ

Hilbert Schmidt Theory:

is a symmetric function.

) , (

x_i x_j K

Mercer's conditions:

∞

<

≥ Φ Φ

=

∫

∫∫

∑

^∞

=

dx x f

dxdx x

f x f x x K

x x

a x

x

K _i _j

m m

j i

) (

, 0 )

( ) ( ) , (

if ) ( ).

( )

, (

2

' '

'

1

(13)

Linear combination of Kernels:

) , ( )

, ( :

) ,

(x x^' c₁k₁ x x^' c₂k₂ x x^'

k = +

Integral of Kernels:

dz z x s z x s x

x

k

( ,

^'

) : ⁼ ∫ ( , ) (

^'

, )

Smola, Scholkopf and Muller(1998):

0 )

( )

2 ( ) ](

[

if ) (

: ) , (

2 , ' '

≥

=

−

=

∫

⁻ ^〈 ^〉

− e k x dx

k F

x x k x

x k

x i d

π

ω

Burges(1999):

k ( x , x

^'

) : = k ( 〈 x , x

^'

〉 )

0 )

( )

(

, 0 )

(

, 0 )

(

2

≥

∂ +

∂

≥

∂

≥

ξ ξ

ξ ξ ξ

ξ ξ

ξ

k k

Non Non - - linear SVR algorithm linear SVR algorithm

) ( )

(

^*

1 i i

l

i

a

i

− a Φ x

= ∑

=

ω

Primal problem:

(3) 0

,

y )

( ,

) ( , y

subject to

) 2 (

1 minimize

*

* i

i

1 2 *







≥

+

≤

− +

〉 Φ

〈

+

≤

−

〉 Φ

〈

−

+

∑

=

i i

i

i i

i l

i i i

b x w

C w

ξ ξ

ξ ε

ξ ε ξ ξ

(14)

Parameters used in SV Parameters used in SV Regression

Regression

Non Non - - linear SVR algorithm linear SVR algorithm

(15) ) , ( ) (

) ( and ) ( ) (

: (10) th Similar wi .

) ( ), ( : ) , ( If

* 1

' '

b x x k a a x

f x

a a

x x x

x k

i i l

i i

i i l

i i − Φ = − +

=

〉 Φ Φ

〈

=

∑

= =

ω

]

, 0 [ ,

0 ) : (

subject to

) 14 ( )

( ) (

) , ( ) (

) 2 (

- 1 maximize

(9) th similar wi

* 1

*

* 1

*







∈

=

−







+

− +

−

=

∑

=

C a

a

a a

y a a a

a

x x k a a a a L

i i l

i i i

l

i i i i

l

i i i

j i j j l

i i i

ε

(15)

Cost Function Cost Function

Training data:

We assume that the training data has been drawn iid from some probability distribution p(x,y). Our goal is to find a function f(x) that minimizes a risk functional.

A possible approximation:

Add a capacity control term:

R y R x y

x y

x , ),..., ( _l, _l)}, ∈ ⁿ , ∈ {( ₁ ₁

= ∫ ( , , ( )) ( , ) ]

[ f c x y f x dp x y R

∑

=

= ^l

i i i i

emp c x y f x

f l R

1

)) ( , , 1 (

: ] [

constant.

regulation called

0

2 ] [ :

]

[ ²

>

+

=

λ

λ ω

f

R f

R_reg _emp

)) ( (

log ))

( , ,

( x y f x p y f x

c = − −

One hand, we want to avoid using a very complicated function c as this may lead to difficult optimization; on the other hand, we should use the cost function that suits data best. Under the assumption that the data were idd and generated by function plus addictive noise.

(16)

For the loss function:

(15) otherwise

) ) (

~(

) ( for )) 0

( , ,

( 





−

≤

= −

ε

ε x

f y c

x f x y

f y x c







−

= ≤

otherwise

for 0

ε ξ

ε ξ _ε ξ

Compare with

ε

−intensivelossfunction

•Extend the special choice to more general convex cost functions.

• Moreover, we might choose different cost functions and different values of for each sample. Similar with (3), we can get:

~

*

~ ,

i

c

c ε , ε

^*

(16) 0

,

y ,

, y

subject to

))

~( )

~( 2 (

1 minimize

*

* i

i

1 2 *







≥

+

≤

− +

〉

〈

+

≤

−

〉

〈

−

+

∑

=

i i

i

i i

i l

i

i i

b x w

c c

C w

ξ ξ

ξ ε

ξ ξ

By standard Lagrange multiplier technique, exactly the same manner as in the ε-intensive loss function.

(17)

(18)

0

} )

~(

| inf{

)]

~( ,

0 [

0 ) (

: subject to

(17) )

~( )

~( : ) (

) where (

))) ( ) ( ( ) (

) (

(

, ) (

) 2 (

- 1

1

* 1

*

* 1

*

* 1

*















≥

∂

=

∂

∈

=

−







∂

−

= Τ

−

=

Τ + Τ + +

−

+

〉

〈

−

=

∑

=

ξ

ξ ξ

ξ

ξ ξ ξ

ξ ξ

ω

ξ ξ

ε

ξ ξ

ξ

a c

C c C a

a a

c c

x a a

C a

a a

a y

x x a a a a L

l i

i i l i

i i i

i i

i i l

i

i i i

j i j j l

i

i i

Architecture

(18)

Algorithms

Algorithms- - Interior Point Algorithms. Interior Point Algorithms.

m

n A R b R

R l a c l

b A

a c q a

∈

≤

= +

−

, , ,

, ,

subject to

. ) 2 ( 1 minimize

component.

th i the denotes and

vector denote

variable _i

µ µ

α α

α

•Add slack variables:

0 , subject t

. ) 2 ( 1 minimize

≥

= +

=

= +

t g µ t l, a b a-g o Aα

a c q

α

•Write the Wolfe dual.

•Get the KKT conditions

•Solve the Equations

SMO(Sequential minimal SMO(Sequential minimal

optimization) optimization)

-each iteration,SMO chooses only two ,and find optimal value, updates

the SVM to reflect new optimal value - 3 components to SMO

1. Analytic method to solve for two lagrange multiplier

2. Heuristic for choosing

3. A method for computing bias b

αi

(19)

Other Algorithms

• Subset selection algorithms

• Inverse problems.(suitable for specific settings)

• Convex combination and -norms.(different ways of measuring capacity and reduction to the linear programming)

• Semiparametric modeling.(different ways of controlling capacity and different classes).

l

1

conclusion conclusion

• Linear operation in the feature space is equivalent to the nonlinear operation in the input space.

• key concepts of SVM

• optimization

• kernel trick

• There are still a lot of open issues in SVR.

Both time complexity & storage capacity problem are

• increasing as train data increase

(20)