Binary Action Search for Learning Continuous-Action Control Policies

(1)

Binary Action Search

for Learning Continuous-Action Control Policies

Jason Pazis and Michail G. Lagoudakis

Intelligent Systems Laboratory

Department of Electronic and Computer Engineering Technical University of Crete

Chania, Crete, Greece

The 26th International Conference on Machine Learning June 14-18, 2009

Montreal, Canada

Partially supported by Marie Curie International Reintegration Grant MIRG-CT-2006-044980.

(2)

Introduction Binary Action Search Experiments Conclusion

Motivation: Discrete agents in a continuous world

Current Algorithms

Can easily handle continuous state spaces Mostly handle discrete action spaces

Real-world problems

Many problems have continuous control variables

The problem

Current continuous-action approaches are often inefficient Can we control continuous variables using discrete decisions?

(3)

Motivation: Discrete agents in a continuous world

Current Algorithms

Real-world problems

The problem

Jason Pazis and Michail G. Lagoudakis, ICML 2009 Binary Action Search

(4)

Motivation: Discrete agents in a continuous world

Current Algorithms

Real-world problems

The problem

(5)

Outline

1 Introduction

2 Binary Action Search

3 Experiments

4 Conclusion

(6)

Markov Decision Process Planning

Learning Continuous actions

Outline

1 Introduction

3 Experiments

4 Conclusion

(7)

Markov Decision Process

Markov Decision Process MDP µ = (S, A, P, R, γ, D)

S is the state space A is the action space

P is the transition model: P(s⁰|s, a) R is the reward function: R(s, a) γ ∈ (0, 1] is the discount factor D is the initial state distribution

µ

a_t st s_t+1

rt+1

Markov Property

Transitions and rewards are independent of history

(8)

Planning

Optimization

Optimize the expected total discounted reward

E_{s∼D ;}_a_t∼?; st∼P

∞

X

t=0

γ^trt

s0 = s

!

Policy π

A way of making decisions in all situations

Deterministic policy: a mapping from states to actions There exists at least one deterministic optimal policy π^∗

Algorithms

Value iteration, policy iteration, linear programming

(9)

Planning

Optimization

E_{s∼D ;}_a_t∼?; st∼P

∞

X

t=0

γ^trt

s0 = s

!

Policy π

Algorithms

(10)

Planning

Optimization

E_{s∼D ;}_a_t∼?; st∼P

∞

X

t=0

γ^trt

s0 = s

!

Policy π

Algorithms

(11)

Learning

Interaction

Transition model and reward function are unknown Repeated interaction with an unknown process Sample at time t: (s_t, a_t, r_t, s_t+1)

Reinforcement Learning

Prediction: learn/predict the value of a fixed policy Control : learn a good policy for controlling the process

Algorithms

Prediction: DUE, TD-Learning, LSTD, . . . Control: Q-Learning, Sarsa, LSPI, FQI, . . .

(12)

Learning

Interaction

Algorithms

(13)

Learning

Interaction

Algorithms

(14)

The need for continuous actions in control

Benefits

Smoothness of motion Power consumption Mechanical stresses Induced power line noise

Problems

An infinite number of choices at each step Tabular approaches are not sufficient Discrete maximization is not sufficient Fine discretization is inefficient

(15)

Related work

Neural network approaches

Gaskett et al., AI 1999 Str¨oslin et al., ICANN 2003

Monte Carlo sampling

Lazaric et al., NIPS 2008 Sallans and Hinton, JMLR 2004

Single state-action approximator

Santamaria, Sutton, Ram, Adaptive Behavior 1998

Exploitation of temporal locality

Pazis and Lagoudakis, ADPRL 2009 Riedmiller, ESANN 1997

(16)

Binary Action Search Properties

Outline

1 Introduction

3 Experiments

4 Conclusion

(17)

Binary Action Search

Choosing continuous actions

Choosing a continuous action value in a single step is hard!

How about breaking this hard decision into many easier ones?

Multi-step action choice

Given a continuous action value in some state ...

... decide whether it’s better to increase it or decrease it!

Searching for an action

Need a discrete binary policy π : S × A 7→ {Inc, Dec} Perform N-step binary search over the action space

Successively approximate the single continuous action value Anytime algorithm: more accurate action choice with larger N

(18)

Binary Action Search

Need a discrete binary policy π : S × A 7→ {Inc, Dec} Perform N-step binary search over the action space

(19)

Binary Action Search

Need a discrete binary policy π : S × A 7→ {Inc, Dec}

Perform N-step binary search over the action space

(20)

Binary Action Search (s, π, N)

// s : The current state of the process

// π : A policy making binary decisions, +1 or −1 // N : The number of resolution bits

a ← (a_max+ a_min)/2 // Initialize a

∆ ← (a_max− amin)2^N−1/(2^N− 1) // Initialize ∆

for i = 1 to N do

∆ ← ∆/2 // update ∆

e ← π(s, a) // binary decision (+1 or −1)

a ← a + e∆ // update a

end for return a

(21)

Learning Binary Policies

Requirements

continuous augmented state space (S, A)

discrete binary action space {Increase (+1), Decrease (−1)}

most reinforcement learning algorithms can be used

Learning Data

need to get sample transitions for the transformed MDP for each actual transition sample with a continuous action ... ... generate N transition samples with discrete actions

(22)

Learning Binary Policies

Requirements

continuous augmented state space (S, A)

discrete binary action space {Increase (+1), Decrease (−1)}

most reinforcement learning algorithms can be used

Learning Data

need to get sample transitions for the transformed MDP for each actual transition sample with a continuous action ...

... generate N transition samples with discrete actions

(23)

A Simple Example

A simplified domain

Continuous action range [1.0, 8.0]

N = 3 (3-bit resolution)

Actual sample

s a = 2.0 s⁰

r

Derived samples

s, 4.5 −1(a = 2.5) s, 2.5 s, 1.5 s⁰, 4.5 0

−1 (a = 1.5)

0

+1^{(a = 2.0)} r

(24)

A Simple Example

A simplified domain

Actual sample

s a = 2.0 s⁰

r

Derived samples

s, 4.5 −1(a = 2.5) s, 2.5 s, 1.5 s⁰, 4.5 0

−1 (a = 1.5)

0

+1^{(a = 2.0)} r

(25)

A Simple Example

A simplified domain

Actual sample

s a = 2.0 s⁰

r

Derived samples

s, 4.5 −1(a = 2.5) s, 2.5 s, 1.5 s⁰, 4.5 0

−1 (a = 1.5)

0

+1^{(a = 2.0)} r

(26)

Properties

Optimality

Searching for the single best action - no local optima

Integration

BAS can be combined with most existing RL algorithms The RL algorithm needs to support continuous state spaces

Transformation

Binary decisions over an augmented state space: (S, A)

Efficiency

Needs only learning a binary search policy

(27)

Properties

Optimality

Integration

Transformation

Efficiency

(28)

Properties

Optimality

Integration

Transformation

Efficiency

(29)

Properties

Optimality

Integration

Transformation

Efficiency

(30)

Inverted Pendulum Double Integrator

Outline

1 Introduction

3 Experiments

4 Conclusion

(31)

Inverted Pendulum

Balancing a pendulum at the upright position States: vertical angle θ and angular velocity ˙θ Discrete actions: three actions [−50 N, 0 N, +50 N]

Continuous actions: 2⁸ equally spaced in [−50 N, +50 N]

Uniform noise in [−10N, +10N] is added to all actions

(32)

Inverted Pendulum

Learning Setup

Training samples collected in advance from “random episodes”

Starting in a randomly perturbed state near equilibrium Following a policy that made random decisions

Parameters

Reward function: −((2θ/π)²+ ( ˙θ)²+ (F /50)²)

|θ| > π/2 signals the end of episode and a reward of −1000 Discount factor γ = 0.95

Control interval dt = 100 msec

(33)

Basis Functions

Augmented state vector s = (θ, ˙θ, F )

Block of 28 basis functions for each discrete action 1 constant term and 27 radial basis functions (Gaussians) Arranged in a 3 × 3 × 3 grid

φ =

1 , e⁻

√

(θ/nθ−θ1)2+( ˙θ/nθ˙− ˙θ1)2+(F /nF −F1)2

2σ2 ,

· · · , e⁻

√

(θ/nθ−θ3)2+( ˙θ/nθ˙− ˙θ3)2+(F /nF −F3)2 2σ2

>

,

θ_i’s, ˙θ_i’s and F_i’s are in {−1, 0, +1}

nθ= π/2, nθ˙= 2 and nF = 50

(34)

Inverted Pendulum: Total accumulated reward

0 100 200 300 400 500 600 700 800 900 1000

−1800

−1600

−1400

−1200

−1000

−800

−600

−400

−200 0

Number of training episodes

Reward

FQI−BAS LSPI−BAS

LSPI−11action

FQI−11action

LSPI−Combined State−Action

LSPI−3action FQI−5action

FQI−3action FQI−Combined State−Action

LSPI−5action

(35)

Inverted Pendulum: 10N (left) and 20N (right) noise

−50 −40 −30 −20 −10 0 10 20 30 40 50

0 10 20 30 40 50 60 70

Force

Number of steps

−50 −40 −30 −20 −10 0 10 20 30 40 50

0 10 20 30 40 50 60 70

Force

Number of steps

10N noise BAS mean force magnitude: 6.65N 10N noise 3-action mean force magnitude: 17.91N 20N noise BAS success rate: 99.64%

20N noise 3-action success rate: 39.49%

(36)

Double Integrator

Control a car moving on a one-dimensional flat terrain States: position p and velocity v

Actions: Control the acceleration a Linear dynamics: ˙p = v and ˙v = a

Setup

Reward function: −(p²+ a²)

Constraints: |p| ≤ 1, |v | ≤ 1 and |a| ≤ 1

−50 reward for constraint violation Discount factor γ = 0.98

Control interval dt = 500 msec

(37)

Double Integrator

Basis Functions

Augmented state vector s = (p, v , a)

Simple polynomial approximator with 10 terms

φ = (1, p, v , a, p²a, v²a, a², pv , pa, va, a²p, a²v )^>

(38)

Double Integrator: Total accumulated reward

0 100 200 300 400 500 600 700 800 900 1000

−50

−45

−40

−35

−30

−25

−20

−15

−10

−5 0

Number of training episodes

Reward

BAS Combined State−Action

11action

5action 25action

(39)

Strengths and Weaknesses Future Work

Acknowledgments

Outline

1 Introduction

3 Experiments

4 Conclusion

(40)

Acknowledgments

Strengths Simplicity

Requires no tuning

Requires only 2 actions from the discrete policy

Achieves resolutions impossible to reach with discrete actions Can be used in conjunction with any RL algorithm

Can be used in an online, offline, on-policy or off-policy setting

Weaknesses

The state space of the problem is now more complex

More samples have to be processed by the learning algorithm

(41)

Acknowledgments

Strengths Simplicity

Requires no tuning

Requires only 2 actions from the discrete policy

Achieves resolutions impossible to reach with discrete actions Can be used in conjunction with any RL algorithm

Can be used in an online, offline, on-policy or off-policy setting

Weaknesses

The state space of the problem is now more complex

More samples have to be processed by the learning algorithm

(42)

Acknowledgments

Future Work

Ongoing Research

High-dimensional action spaces

Increasing learning and execution efficiency

Future Research

Planning with BAS

Skewing functions over action range

(43)

Acknowledgments

Thank you for your attention!

(44)

Acknowledgments

Scalability

Binary Action Search n continuous actions 2ⁿ discrete actions

Discrete controller n action variables m actions per variable mⁿactions

(45)

Acknowledgments

Least-Squares Policy Iteration

LSPI (D, φ, γ, )

// D : Source of samples (s, a, r , s⁰) // φ : Basis functions

// γ : Discount factor // : Stopping criterion w⁰← 0

repeat

w ← w⁰, A ← 0, b ← 0 for each (s, a, r , s⁰) in D

a⁰= arg maxa⁰⁰∈Aw^>φ(s⁰, a⁰⁰) A ← A + φ(s, a)“

φ(s, a) − γφ(s⁰, a⁰)”>

b ← b + φ(s, a)r w⁰← A⁻¹b until (kw − w⁰k < ) return w

(46)

Acknowledgments

Fitted Q Iteration

Fitted Q Iteration (D, γ, N)

// Input: samples D, discount factor γ, iterations N // Output: learned value function Q^N

Initialize k ← 0, Q^k← 0 repeat

TS ← ∅ for each (s, a, r , s⁰) in D

TS ← TS ∪˘`(s, a), r + γ max_a0∈AQ^k(s⁰, a⁰)´¯

Q^k+1← use regression on TS k ← k + 1

until (k = N)

return QN