• No results found

Binary Action Search for Learning Continuous-Action Control Policies

N/A
N/A
Protected

Academic year: 2022

Share "Binary Action Search for Learning Continuous-Action Control Policies"

Copied!
46
0
0

Loading.... (view fulltext now)

Full text

(1)

Binary Action Search

for Learning Continuous-Action Control Policies

Jason Pazis and Michail G. Lagoudakis

Intelligent Systems Laboratory

Department of Electronic and Computer Engineering Technical University of Crete

Chania, Crete, Greece

The 26th International Conference on Machine Learning June 14-18, 2009

Montreal, Canada

Partially supported by Marie Curie International Reintegration Grant MIRG-CT-2006-044980.

(2)

Introduction Binary Action Search Experiments Conclusion

Motivation: Discrete agents in a continuous world

Current Algorithms

Can easily handle continuous state spaces Mostly handle discrete action spaces

Real-world problems

Many problems have continuous control variables

The problem

Current continuous-action approaches are often inefficient Can we control continuous variables using discrete decisions?

(3)

Introduction Binary Action Search Experiments Conclusion

Motivation: Discrete agents in a continuous world

Current Algorithms

Can easily handle continuous state spaces Mostly handle discrete action spaces

Real-world problems

Many problems have continuous control variables

The problem

Current continuous-action approaches are often inefficient Can we control continuous variables using discrete decisions?

Jason Pazis and Michail G. Lagoudakis, ICML 2009 Binary Action Search

(4)

Introduction Binary Action Search Experiments Conclusion

Motivation: Discrete agents in a continuous world

Current Algorithms

Can easily handle continuous state spaces Mostly handle discrete action spaces

Real-world problems

Many problems have continuous control variables

The problem

Current continuous-action approaches are often inefficient Can we control continuous variables using discrete decisions?

(5)

Introduction Binary Action Search Experiments Conclusion

Outline

1 Introduction

2 Binary Action Search

3 Experiments

4 Conclusion

Jason Pazis and Michail G. Lagoudakis, ICML 2009 Binary Action Search

(6)

Introduction Binary Action Search Experiments Conclusion

Markov Decision Process Planning

Learning Continuous actions

Outline

1 Introduction

2 Binary Action Search

3 Experiments

4 Conclusion

(7)

Introduction Binary Action Search Experiments Conclusion

Markov Decision Process Planning

Learning Continuous actions

Markov Decision Process

Markov Decision Process MDP µ = (S, A, P, R, γ, D)

S is the state space A is the action space

P is the transition model: P(s0|s, a) R is the reward function: R(s, a) γ ∈ (0, 1] is the discount factor D is the initial state distribution

µ

at st st+1

rt+1

Markov Property

Transitions and rewards are independent of history

Jason Pazis and Michail G. Lagoudakis, ICML 2009 Binary Action Search

(8)

Introduction Binary Action Search Experiments Conclusion

Markov Decision Process Planning

Learning Continuous actions

Planning

Optimization

Optimize the expected total discounted reward

Es∼D ;at∼?; st∼P

X

t=0

γtrt

s0 = s

!

Policy π

A way of making decisions in all situations

Deterministic policy: a mapping from states to actions There exists at least one deterministic optimal policy π

Algorithms

Value iteration, policy iteration, linear programming

(9)

Introduction Binary Action Search Experiments Conclusion

Markov Decision Process Planning

Learning Continuous actions

Planning

Optimization

Optimize the expected total discounted reward

Es∼D ;at∼?; st∼P

X

t=0

γtrt

s0 = s

!

Policy π

A way of making decisions in all situations

Deterministic policy: a mapping from states to actions There exists at least one deterministic optimal policy π

Algorithms

Value iteration, policy iteration, linear programming

Jason Pazis and Michail G. Lagoudakis, ICML 2009 Binary Action Search

(10)

Introduction Binary Action Search Experiments Conclusion

Markov Decision Process Planning

Learning Continuous actions

Planning

Optimization

Optimize the expected total discounted reward

Es∼D ;at∼?; st∼P

X

t=0

γtrt

s0 = s

!

Policy π

A way of making decisions in all situations

Deterministic policy: a mapping from states to actions There exists at least one deterministic optimal policy π

Algorithms

Value iteration, policy iteration, linear programming

(11)

Introduction Binary Action Search Experiments Conclusion

Markov Decision Process Planning

Learning Continuous actions

Learning

Interaction

Transition model and reward function are unknown Repeated interaction with an unknown process Sample at time t: (st, at, rt, st+1)

Reinforcement Learning

Prediction: learn/predict the value of a fixed policy Control : learn a good policy for controlling the process

Algorithms

Prediction: DUE, TD-Learning, LSTD, . . . Control: Q-Learning, Sarsa, LSPI, FQI, . . .

Jason Pazis and Michail G. Lagoudakis, ICML 2009 Binary Action Search

(12)

Introduction Binary Action Search Experiments Conclusion

Markov Decision Process Planning

Learning Continuous actions

Learning

Interaction

Transition model and reward function are unknown Repeated interaction with an unknown process Sample at time t: (st, at, rt, st+1)

Reinforcement Learning

Prediction: learn/predict the value of a fixed policy Control : learn a good policy for controlling the process

Algorithms

Prediction: DUE, TD-Learning, LSTD, . . . Control: Q-Learning, Sarsa, LSPI, FQI, . . .

(13)

Introduction Binary Action Search Experiments Conclusion

Markov Decision Process Planning

Learning Continuous actions

Learning

Interaction

Transition model and reward function are unknown Repeated interaction with an unknown process Sample at time t: (st, at, rt, st+1)

Reinforcement Learning

Prediction: learn/predict the value of a fixed policy Control : learn a good policy for controlling the process

Algorithms

Prediction: DUE, TD-Learning, LSTD, . . . Control: Q-Learning, Sarsa, LSPI, FQI, . . .

Jason Pazis and Michail G. Lagoudakis, ICML 2009 Binary Action Search

(14)

Introduction Binary Action Search Experiments Conclusion

Markov Decision Process Planning

Learning Continuous actions

The need for continuous actions in control

Benefits

Smoothness of motion Power consumption Mechanical stresses Induced power line noise

Problems

An infinite number of choices at each step Tabular approaches are not sufficient Discrete maximization is not sufficient Fine discretization is inefficient

(15)

Introduction Binary Action Search Experiments Conclusion

Markov Decision Process Planning

Learning Continuous actions

Related work

Neural network approaches

Gaskett et al., AI 1999 Str¨oslin et al., ICANN 2003

Monte Carlo sampling

Lazaric et al., NIPS 2008 Sallans and Hinton, JMLR 2004

Single state-action approximator

Santamaria, Sutton, Ram, Adaptive Behavior 1998

Exploitation of temporal locality

Pazis and Lagoudakis, ADPRL 2009 Riedmiller, ESANN 1997

Jason Pazis and Michail G. Lagoudakis, ICML 2009 Binary Action Search

(16)

Introduction Binary Action Search Experiments Conclusion

Binary Action Search Properties

Outline

1 Introduction

2 Binary Action Search

3 Experiments

4 Conclusion

(17)

Introduction Binary Action Search Experiments Conclusion

Binary Action Search Properties

Binary Action Search

Choosing continuous actions

Choosing a continuous action value in a single step is hard!

How about breaking this hard decision into many easier ones?

Multi-step action choice

Given a continuous action value in some state ...

... decide whether it’s better to increase it or decrease it!

Searching for an action

Need a discrete binary policy π : S × A 7→ {Inc, Dec} Perform N-step binary search over the action space

Successively approximate the single continuous action value Anytime algorithm: more accurate action choice with larger N

Jason Pazis and Michail G. Lagoudakis, ICML 2009 Binary Action Search

(18)

Introduction Binary Action Search Experiments Conclusion

Binary Action Search Properties

Binary Action Search

Choosing continuous actions

Choosing a continuous action value in a single step is hard!

How about breaking this hard decision into many easier ones?

Multi-step action choice

Given a continuous action value in some state ...

... decide whether it’s better to increase it or decrease it!

Searching for an action

Need a discrete binary policy π : S × A 7→ {Inc, Dec} Perform N-step binary search over the action space

Successively approximate the single continuous action value Anytime algorithm: more accurate action choice with larger N

(19)

Introduction Binary Action Search Experiments Conclusion

Binary Action Search Properties

Binary Action Search

Choosing continuous actions

Choosing a continuous action value in a single step is hard!

How about breaking this hard decision into many easier ones?

Multi-step action choice

Given a continuous action value in some state ...

... decide whether it’s better to increase it or decrease it!

Searching for an action

Need a discrete binary policy π : S × A 7→ {Inc, Dec}

Perform N-step binary search over the action space

Successively approximate the single continuous action value Anytime algorithm: more accurate action choice with larger N

Jason Pazis and Michail G. Lagoudakis, ICML 2009 Binary Action Search

(20)

Introduction Binary Action Search Experiments Conclusion

Binary Action Search Properties

Binary Action Search (s, π, N)

// s : The current state of the process

// π : A policy making binary decisions, +1 or −1 // N : The number of resolution bits

a ← (amax+ amin)/2 // Initialize a

∆ ← (amax− amin)2N−1/(2N− 1) // Initialize ∆

for i = 1 to N do

∆ ← ∆/2 // update ∆

e ← π(s, a) // binary decision (+1 or −1)

a ← a + e∆ // update a

end for return a

(21)

Introduction Binary Action Search Experiments Conclusion

Binary Action Search Properties

Learning Binary Policies

Requirements

continuous augmented state space (S, A)

discrete binary action space {Increase (+1), Decrease (−1)}

most reinforcement learning algorithms can be used

Learning Data

need to get sample transitions for the transformed MDP for each actual transition sample with a continuous action ... ... generate N transition samples with discrete actions

Jason Pazis and Michail G. Lagoudakis, ICML 2009 Binary Action Search

(22)

Introduction Binary Action Search Experiments Conclusion

Binary Action Search Properties

Learning Binary Policies

Requirements

continuous augmented state space (S, A)

discrete binary action space {Increase (+1), Decrease (−1)}

most reinforcement learning algorithms can be used

Learning Data

need to get sample transitions for the transformed MDP for each actual transition sample with a continuous action ...

... generate N transition samples with discrete actions

(23)

Introduction Binary Action Search Experiments Conclusion

Binary Action Search Properties

A Simple Example

A simplified domain

Continuous action range [1.0, 8.0]

N = 3 (3-bit resolution)

Actual sample

s a = 2.0 s0

r

Derived samples

s, 4.5 −1(a = 2.5) s, 2.5 s, 1.5 s0, 4.5 0

−1 (a = 1.5)

0

+1(a = 2.0) r

Jason Pazis and Michail G. Lagoudakis, ICML 2009 Binary Action Search

(24)

Introduction Binary Action Search Experiments Conclusion

Binary Action Search Properties

A Simple Example

A simplified domain

Continuous action range [1.0, 8.0]

N = 3 (3-bit resolution)

Actual sample

s a = 2.0 s0

r

Derived samples

s, 4.5 −1(a = 2.5) s, 2.5 s, 1.5 s0, 4.5 0

−1 (a = 1.5)

0

+1(a = 2.0) r

(25)

Introduction Binary Action Search Experiments Conclusion

Binary Action Search Properties

A Simple Example

A simplified domain

Continuous action range [1.0, 8.0]

N = 3 (3-bit resolution)

Actual sample

s a = 2.0 s0

r

Derived samples

s, 4.5 −1(a = 2.5) s, 2.5 s, 1.5 s0, 4.5 0

−1 (a = 1.5)

0

+1(a = 2.0) r

Jason Pazis and Michail G. Lagoudakis, ICML 2009 Binary Action Search

(26)

Introduction Binary Action Search Experiments Conclusion

Binary Action Search Properties

Properties

Optimality

Searching for the single best action - no local optima

Integration

BAS can be combined with most existing RL algorithms The RL algorithm needs to support continuous state spaces

Transformation

Binary decisions over an augmented state space: (S, A)

Efficiency

Needs only learning a binary search policy

(27)

Introduction Binary Action Search Experiments Conclusion

Binary Action Search Properties

Properties

Optimality

Searching for the single best action - no local optima

Integration

BAS can be combined with most existing RL algorithms The RL algorithm needs to support continuous state spaces

Transformation

Binary decisions over an augmented state space: (S, A)

Efficiency

Needs only learning a binary search policy

Jason Pazis and Michail G. Lagoudakis, ICML 2009 Binary Action Search

(28)

Introduction Binary Action Search Experiments Conclusion

Binary Action Search Properties

Properties

Optimality

Searching for the single best action - no local optima

Integration

BAS can be combined with most existing RL algorithms The RL algorithm needs to support continuous state spaces

Transformation

Binary decisions over an augmented state space: (S, A)

Efficiency

Needs only learning a binary search policy

(29)

Introduction Binary Action Search Experiments Conclusion

Binary Action Search Properties

Properties

Optimality

Searching for the single best action - no local optima

Integration

BAS can be combined with most existing RL algorithms The RL algorithm needs to support continuous state spaces

Transformation

Binary decisions over an augmented state space: (S, A)

Efficiency

Needs only learning a binary search policy

Jason Pazis and Michail G. Lagoudakis, ICML 2009 Binary Action Search

(30)

Introduction Binary Action Search Experiments Conclusion

Inverted Pendulum Double Integrator

Outline

1 Introduction

2 Binary Action Search

3 Experiments

4 Conclusion

(31)

Introduction Binary Action Search Experiments Conclusion

Inverted Pendulum Double Integrator

Inverted Pendulum

Balancing a pendulum at the upright position States: vertical angle θ and angular velocity ˙θ Discrete actions: three actions [−50 N, 0 N, +50 N]

Continuous actions: 28 equally spaced in [−50 N, +50 N]

Uniform noise in [−10N, +10N] is added to all actions

Jason Pazis and Michail G. Lagoudakis, ICML 2009 Binary Action Search

(32)

Introduction Binary Action Search Experiments Conclusion

Inverted Pendulum Double Integrator

Inverted Pendulum

Learning Setup

Training samples collected in advance from “random episodes”

Starting in a randomly perturbed state near equilibrium Following a policy that made random decisions

Parameters

Reward function: −((2θ/π)2+ ( ˙θ)2+ (F /50)2)

|θ| > π/2 signals the end of episode and a reward of −1000 Discount factor γ = 0.95

Control interval dt = 100 msec

(33)

Introduction Binary Action Search Experiments Conclusion

Inverted Pendulum Double Integrator

Basis Functions

Augmented state vector s = (θ, ˙θ, F )

Block of 28 basis functions for each discrete action 1 constant term and 27 radial basis functions (Gaussians) Arranged in a 3 × 3 × 3 grid

φ =



1 , e

(θ/nθ−θ1)2+( ˙θ/nθ˙− ˙θ1)2+(F /nF −F1)2

2σ2 ,

· · · , e

(θ/nθ−θ3)2+( ˙θ/nθ˙− ˙θ3)2+(F /nF −F3)2 2σ2

>

,

θi’s, ˙θi’s and Fi’s are in {−1, 0, +1}

nθ= π/2, nθ˙= 2 and nF = 50

Jason Pazis and Michail G. Lagoudakis, ICML 2009 Binary Action Search

(34)

Introduction Binary Action Search Experiments Conclusion

Inverted Pendulum Double Integrator

Inverted Pendulum: Total accumulated reward

0 100 200 300 400 500 600 700 800 900 1000

−1800

−1600

−1400

−1200

−1000

−800

−600

−400

−200 0

Number of training episodes

Reward

FQI−BAS LSPI−BAS

LSPI−11action

FQI−11action

LSPI−Combined State−Action

LSPI−3action FQI−5action

FQI−3action FQI−Combined State−Action

LSPI−5action

(35)

Introduction Binary Action Search Experiments Conclusion

Inverted Pendulum Double Integrator

Inverted Pendulum: 10N (left) and 20N (right) noise

−50 −40 −30 −20 −10 0 10 20 30 40 50

0 10 20 30 40 50 60 70

Force

Number of steps

−50 −40 −30 −20 −10 0 10 20 30 40 50

0 10 20 30 40 50 60 70

Force

Number of steps

10N noise BAS mean force magnitude: 6.65N 10N noise 3-action mean force magnitude: 17.91N 20N noise BAS success rate: 99.64%

20N noise 3-action success rate: 39.49%

Jason Pazis and Michail G. Lagoudakis, ICML 2009 Binary Action Search

(36)

Introduction Binary Action Search Experiments Conclusion

Inverted Pendulum Double Integrator

Double Integrator

Control a car moving on a one-dimensional flat terrain States: position p and velocity v

Actions: Control the acceleration a Linear dynamics: ˙p = v and ˙v = a

Setup

Reward function: −(p2+ a2)

Constraints: |p| ≤ 1, |v | ≤ 1 and |a| ≤ 1

−50 reward for constraint violation Discount factor γ = 0.98

Control interval dt = 500 msec

(37)

Introduction Binary Action Search Experiments Conclusion

Inverted Pendulum Double Integrator

Double Integrator

Basis Functions

Augmented state vector s = (p, v , a)

Simple polynomial approximator with 10 terms

φ = (1, p, v , a, p2a, v2a, a2, pv , pa, va, a2p, a2v )>

Jason Pazis and Michail G. Lagoudakis, ICML 2009 Binary Action Search

(38)

Introduction Binary Action Search Experiments Conclusion

Inverted Pendulum Double Integrator

Double Integrator: Total accumulated reward

0 100 200 300 400 500 600 700 800 900 1000

−50

−45

−40

−35

−30

−25

−20

−15

−10

−5 0

Number of training episodes

Reward

BAS Combined State−Action

11action

5action 25action

(39)

Introduction Binary Action Search Experiments Conclusion

Strengths and Weaknesses Future Work

Acknowledgments

Outline

1 Introduction

2 Binary Action Search

3 Experiments

4 Conclusion

Jason Pazis and Michail G. Lagoudakis, ICML 2009 Binary Action Search

(40)

Introduction Binary Action Search Experiments Conclusion

Strengths and Weaknesses Future Work

Acknowledgments

Strengths Simplicity

Requires no tuning

Requires only 2 actions from the discrete policy

Achieves resolutions impossible to reach with discrete actions Can be used in conjunction with any RL algorithm

Can be used in an online, offline, on-policy or off-policy setting

Weaknesses

The state space of the problem is now more complex

More samples have to be processed by the learning algorithm

(41)

Introduction Binary Action Search Experiments Conclusion

Strengths and Weaknesses Future Work

Acknowledgments

Strengths Simplicity

Requires no tuning

Requires only 2 actions from the discrete policy

Achieves resolutions impossible to reach with discrete actions Can be used in conjunction with any RL algorithm

Can be used in an online, offline, on-policy or off-policy setting

Weaknesses

The state space of the problem is now more complex

More samples have to be processed by the learning algorithm

Jason Pazis and Michail G. Lagoudakis, ICML 2009 Binary Action Search

(42)

Introduction Binary Action Search Experiments Conclusion

Strengths and Weaknesses Future Work

Acknowledgments

Future Work

Ongoing Research

High-dimensional action spaces

Increasing learning and execution efficiency

Future Research

Planning with BAS

Skewing functions over action range

(43)

Introduction Binary Action Search Experiments Conclusion

Strengths and Weaknesses Future Work

Acknowledgments

Acknowledgments

Thank you for your attention!

Jason Pazis and Michail G. Lagoudakis, ICML 2009 Binary Action Search

(44)

Introduction Binary Action Search Experiments Conclusion

Strengths and Weaknesses Future Work

Acknowledgments

Scalability

Binary Action Search n continuous actions 2n discrete actions

Discrete controller n action variables m actions per variable mnactions

(45)

Introduction Binary Action Search Experiments Conclusion

Strengths and Weaknesses Future Work

Acknowledgments

Least-Squares Policy Iteration

LSPI (D, φ, γ, )

// D : Source of samples (s, a, r , s0) // φ : Basis functions

// γ : Discount factor //  : Stopping criterion w0← 0

repeat

w ← w0, A ← 0, b ← 0 for each (s, a, r , s0) in D

a0= arg maxa00∈Aw>φ(s0, a00) A ← A + φ(s, a)

φ(s, a) − γφ(s0, a0)>

b ← b + φ(s, a)r w0← A−1b until (kw − w0k < ) return w

Jason Pazis and Michail G. Lagoudakis, ICML 2009 Binary Action Search

(46)

Introduction Binary Action Search Experiments Conclusion

Strengths and Weaknesses Future Work

Acknowledgments

Fitted Q Iteration

Fitted Q Iteration (D, γ, N)

// Input: samples D, discount factor γ, iterations N // Output: learned value function QN

Initialize k ← 0, Qk← 0 repeat

TS ← ∅ for each (s, a, r , s0) in D

TS ← TS ∪˘`(s, a), r + γ maxa0∈AQk(s0, a0)´¯

Qk+1← use regression on TS k ← k + 1

until (k = N)

return QN

References

Related documents

In this study, a quantitative-based needs assessment procedure was used to establish the needs of graduate students in terms of courses offered during their

If you’re living in an English speaking country for any length of time,or just visiting on a shoestring (con un presupuesto limitado),the chances are you’ll have to venture into

Where I build on a measure of lexical overlap to determine lexical flow between languages in the contact history of a linguistic area, diatopic variation, as between the dialects

The scope of this paper is to review leakage current monitoring conducted by different researchers worldwide primarily in respect to the variety of applications, the waveform

The animation and interactive media industry in India has been making headlines of late, not so much for its impact on the domestic market, but as a premier

The purpose of this project initiative was to improve and expand the scope of practice of the registered nurses working in a private medical office to initiate early

squall2sparql is a translator from SQUALL, a controlled natural language for English, to SPARQL 1.1, a standard expressive query and update language for linked open data.. It

Customer check-up see Maintenance schedules chapter Change of descaling filter see menu service date Descaling see service routines menu.. Menu PIN 