• No results found

Optimal Control Theory and Linear Quadratic Regulators. Sham Kakade and Wen Sun CS 6789: Foundations of Reinforcement Learning

N/A
N/A
Protected

Academic year: 2021

Share "Optimal Control Theory and Linear Quadratic Regulators. Sham Kakade and Wen Sun CS 6789: Foundations of Reinforcement Learning"

Copied!
22
0
0

Loading.... (view fulltext now)

Full text

(1)

Optimal Control Theory and

Linear Quadratic Regulators

Sham Kakade and Wen Sun

(2)

Today

Recap:

LQR model + Ricatti equations


Today: LQRs

Infinite horizon model + SDP formulations

(3)
(4)

Robotics and Controls

Dexterous Robotic Hand Manipulation

(5)

Optimal Control

maps a state , a control (the action) , and a disturbance , 


to the next state , starting from an initial state . 


The objective is to find the control policy which minimizes the long term cost, 


where is the time horizon (which can be finite or infinite)

often solved by considering the linearized control (sub-)problem where


the dynamics are approximated by 


with the matrices and are derivatives of the dynamics (around some

trajectory) and where the costs are approximated by a quadratic function in and .

f

t

x

t

R

d

u

t

R

k

w

t

x

t+1

R

d

x

0

x

t+1

=

f

t

(x

t

,

u

t

,

w

t

)

π

minimize

E

π

[

H−1

t=0

c

t

(x

t

,

u

t

)

]

such that

x

t+1

=

f

t

(x

t

,

u

t

,

w

t

)

H

x

t+1

=

A

t

x

t

+

B

t

u

t

+

w

t

,

A

t

B

t

f

x

t

u

t
(6)

The Linear Quadratic Regulator (LQR)

(finite horizon case)

Let’s suppose this local approximation to a non-linear model is globally valid. 


(clearly false but this is an effective approach once when we ‘close’).


The finite horizon LQR problem is given by


where initial state is randomly distributed according ; 


the disturbance is multi-variate normal, with covariance ; 


and are referred to as system (or transition) matrices;


and are psd matrices that parameterize the quadratic costs. 


Note that this model is a finite horizon MDP, where the and .

minimize

E

[

x

H

Qx

H

+

H−1

t=0

(x

t

Qx

t

+

u

t

Ru

t

)

]

such that

x

t+1

=

A

t

x

t

+

B

t

u

t

+

w

t

,

x

0

D,

w

t

N(0,

σ

2

I

) ,

x

0

D

D

w

t

R

d

σ

2

I

A

t

R

d×d

B

t

R

d×k

Q

R

d×d

R

R

k×k

S

=

R

d

A

=

R

k
(7)

Same defs

(but for costs)

define the value function as

and the state-action value as:


V

hπ

:

R

d

R

V

hπ

(x) =

E

[

x

H

Qx

H

+

H−

1 t=h

(x

t

Qx

t

+

u

t

Ru

t

)

π

,

x

h

=

x

]

,

Q

hπ

:

R

d

×

R

k

R

Q

hπ

(x,

u) =

E

[

x

H

Qx

H

+

H−

1 t=h

(x

t

Qx

t

+

u

t

Ru

t

)

π

,

x

h

=

x,

u

h

=

u

]

,

(8)

Value Iteration and the Ricatti Equations

Theorem: (for the finite horizon case, with time homogenous )


The optimal policy is a linear controller specified by: 


where

where can be computed iteratively, in a backwards manner, using the following

algebraic Ricatti equations, where for ,


and where .


The above equation is simply the value iteration algorithm.

Furthermore, for , we have that:


A

t

=

A

,

B

t

=

B

π

(x

t

) =

K

t

x

t

K

t

= (B

P

t+1

B

+

R)

−1

B

P

t+1

A

P

t

t

[

H

]

P

t

=

A

P

t+1

A

+

Q

A

P

t+1

B(B

P

t+1

B

+

R)

−1

B

P

t+1

A

=

A

P

t+1

A

+

Q

(K

t+1

)

(B

P

t+1

B

+

R)K

t+1

P

H

=

Q

t

[

H

]

V

t

(x) =

x

P

t

x

+

σ

2

H h=t+1

Trace(P

h

)

(9)

Proof: optimal control at

h

=

H

1

Bellman equations there is an optimal policy which is deterministic + only a function of and .

Due to that , we have:


This is a quadratic function of . Solving for the optimal control at , gives:


where the last step uses that .

x

t

t

x

H

=

Ax

+

Bu

+

w

H1

Q

H−1

(x,

u) =

E

[

(Ax

+

Bu

+

w

H−1

)

Q(Ax

+

Bu

+

w

H−1

)

]

+

x

Qx

+

u

Ru

= (Ax

+

Bu)

Q(Ax

+

Bu) +

σ

2Trace

(Q) +

x

Qx

+

u

Ru

u

x

π

H−1

(x) =

(B

QB

+

R)

−1

B

QAx

=

K

H−1

x,

P

H

:=

Q

(10)

Proof: optimal value at

h

=

H

1

(shorthand ). using the optimal control at:


Continuing


where the fourth step uses our expression for .

K

H1

=

K

V

H−1

(x) =

Q

H−1

(x,

K

H−1

x)

=

x

(A

BK

)

Q(A

BK

)x

+

x

Qx

+

x

K

RKx

σ

2Trace

(Q)

V

H−1

(x)

σ

2Trace

(Q) =

x

(

(A

BK

)

Q(A

BK

) +

Q

+

K

RK

)

x

=

x

(

AQA

+

Q

2K

B

QA

+

K

(B

QB

+

R)K

)

x

=

x

(

AQA

+

Q

2K

(B

QB

+

R)K

+

K

(B

QB

+

R)K

)

x

=

x

(

AQA

+

Q

K

(B

QB

+

R)K

)

x

=

x

P

H−1

x

.

K

=

K

H1
(11)

Proof: wrapping up…

This implies that:


The remainder of the proof follows from a recursive argument, which can be verified along identical lines to the case.

Q

H−2

(x,

u) =

E[V

H−1

(Ax

+

Bu

+

w

H−2

)] +

x

Qx

+

u

Ru

= (Ax

+

Bu)

P

H−1

(Ax

+

Bu) +

σ

2

(

Trace

(P

H−1

) +

Trace

(Q)

)

+

x

Qx

+

u

Ru

.

(12)
(13)

The Linear Quadratic Regulator (LQR)

(infinite horizon case)

The infinite horizon LQR problem is given by


where and are time homogenous.


Studied often in theory, but less relevant in practice (?)


(largely due to that time homogenous, globally linear models are rarely good approximations)

Discounted case never studied.


(discounting doesn’t necessarily make costs finite)

Note that we can have ‘unbounded’ average cost.

minimize

lim

H→∞

1

H

E

[

H

t=0

(x

t

Qx

t

+

u

t

Ru

t

)

]

such that

x

t+1

=

Ax

t

+

Bu

t

+

w

t

,

x

0

D,

w

t

N(0,

σ

2

I

) .

A

B

(14)

Infinite horizon case

Theorem:


Suppose that the optimal average cost is finite.


Let be a solution to the following algebraic Riccati equation:


(Note that is a positive definite matrix).

P

P

=

A

T

PA

+

Q

A

T

PB(B

T

PB

+

R)

−1

B

T

PA

.

(15)

Infinite horizon case

Theorem:


Suppose that the optimal average cost is finite.


Let be a solution to the following algebraic Riccati equation:


(Note that is a positive definite matrix).

P

P

=

A

T

PA

+

Q

A

T

PB(B

T

PB

+

R)

−1

B

T

PA

.

P

We have that the optimal policy is: 


where the optimal control gain is:


We have that is unique and that the optimal average cost is .

π

(x) =

K

x

K* =

(B

T

PB

+

R)

−1

B

T

PA

(16)
(17)

The Primal SDP:

(for the infinite horizon LQR)

The primal optimization problem is given as:


where the optimization variable is . 


maximize

σ

2Trace

(P)

subject to

[

A

T

PA

+

Q

I A

PB

B

T

PA B

PB

+

R

]

0,

P

0

P

(18)

The Primal SDP:

(for the infinite horizon LQR)

The primal optimization problem is given as:


where the optimization variable is . 


maximize

σ

2Trace

(P)

subject to

[

A

T

PA

+

Q

I A

PB

B

T

PA B

PB

+

R

]

0,

P

0

P

This SDP has a unique solution, , which implies:

satisfies the Ricatti equations.

The optimal average cost of the infinite horizon LQR is

The optimal policy use the gain matrix: 


P

P

σ

2Trace

(P

)

(19)

The Primal SDP:

(for the infinite horizon LQR)

The primal optimization problem is given as:


where the optimization variable is . 


maximize

σ

2Trace

(P)

subject to

[

A

T

PA

+

Q

I A

PB

B

T

PA B

PB

+

R

]

0,

P

0

P

This SDP has a unique solution, , which implies:

satisfies the Ricatti equations.

The optimal average cost of the infinite horizon LQR is

The optimal policy use the gain matrix: 


P

P

σ

2Trace

(P

)

K* =

(B

T

PB

+

R)

−1

B

T

PA

Proof idea: Following from the Ricatti equation, 


we have the relaxation that for all matrices , the matrix must satisfy:

K

P

(20)

The Dual SDP:

The dual optimization problem is:


where the optimization variable is , a matrix, with the block structure:



 minimize Trace

(

Σ ⋅

[

Q

0

0

R

]

)

subject to

Σ

xx

= (A B)Σ(A B)

+

σ

2

I,

Σ ⪰

0

Σ

(

d

+

k

)

×

(

d

+

k

)

Σ

=

[

Σ

Σ

xx

Σ

xu ux

Σ

uu

]

(21)

The Dual SDP:

The dual optimization problem is:


where the optimization variable is , a matrix, with the block structure:



 minimize Trace

(

Σ ⋅

[

Q

0

0

R

]

)

subject to

Σ

xx

= (A B)Σ(A B)

+

σ

2

I,

Σ ⪰

0

Σ

(

d

+

k

)

×

(

d

+

k

)

Σ

=

[

Σ

Σ

xx

Σ

xu ux

Σ

uu

]

The interpretation of is that it is the covariance matrix of the stationary distribution. 

(22)

The Dual SDP:

The dual optimization problem is:


where the optimization variable is , a matrix, with the block structure:



 minimize Trace

(

Σ ⋅

[

Q

0

0

R

]

)

subject to

Σ

xx

= (A B)Σ(A B)

+

σ

2

I,

Σ ⪰

0

Σ

(

d

+

k

)

×

(

d

+

k

)

Σ

=

[

Σ

Σ

xx

Σ

xu ux

Σ

uu

]

The interpretation of is that it is the covariance matrix of the stationary distribution. 


This analogous to state-action visitation distributions (the dual variables in the MDP LP).

Σ

This SDP has a unique solution, say

Σ

⋆. The optimal gain matrix is then given by:


References

Related documents

The results for the Pulsed technique showed it was better than both the other techniques when it came to their weaknesses (relatively poor spatial awareness for Teleport and

Due to its combination of optoelectronic activity, high surface area, mesostructure, and controllable morphology, mp-TiO 2 has many potential applications, including

After you select a resolution by left clicking on it, go ahead and click the "Save" button at the bottom of the screen, this will save your current selected resolution and

Problem-solving gift strategy: Gerald might leave $5 million to a testamentary charitable lead annuity trust that pays AICR income for 10 or 15 years and

Na pitanje o načinima na koje bi odgojitelji mogli podizati ugled odgojiteljske struke odgovorili su: stavljanjem u isti rang s učiteljima, financiranjem iz ministarstva, a

the federal funds rate in the auto equation should be interpreted with caution because the funds rate has been near zero for much of this period. The effect of the federal funds

A majority of Mary Blair concept paintings made for Disney films and several personal artifacts in the exhibition are from the collection of the Walt Disney Family Foundation. This

The development of this cross-disciplinary problem-solving ability has increasingly become part of clinical legal education.' The pediatric medical-legal partnership, a