Linear Bayesian Reinforcement Learning

(1)

Linear Bayesian Reinforcement Learning

Nikolaos Tziortziotis

University of Ioannina

Christos Dimitrakakis

EPFL

Konstantinos Blekas

University of Ioannina

Abstract

This paper proposes a simple linear Bayesian ap-proach to reinforcement learning. We show that with an appropriate basis, a Bayesian linear Gaus-sian model is sufficient for accurately estimating the system dynamics, and in particular when we allow for correlated noise. Policies are estimated by first sampling a transition model from the cur-rent posterior, and then performing approximate dynamic programming on the sampled model. This form of approximate Thompson sampling results in good exploration in unknown environments. The approach can also be seen as a Bayesian general-isation of least-squares policy iteration, where the empirical transition matrix is replaced with a sam-ple from the posterior.

1 Introduction

Reinforcement learning is the problem of learning how to act in an unknown environment solely by interaction. The agent’s goal is to find a policy for selecting actions that maximises its expected utility. More specifically, we consider a discrete-time setting, such that at discrete-timetthe agent observes a reward

rt∈R, while its utility is the random quantity:

U =

∞ X

t=0

γtrt, (1.1)

whereγ∈(0,1)is a discount factor. The expected utility of policyπ∈Π for environmentµ∈ Mis denoted byEµ,πU. However, since the environmentµis unknown to the agent, optimising with respect toπis not possible.

We consider reinforcement learning problems where the underlying environment is a Markov decision process (MDP). In this setting at each time steptthe agent observes the envi-ronment statest∈ S, as well as the rewardrt. The agent then takes an actionat ∈ A, before observing a new state-reward pair (st+1, rt+1). In MDPs, the environment dynamics are

Markovian. Consequently, for a given MDPµ∈ Mwe have:

Pµ(st+1∈S|st=s, at=a) =Tµ(S|s, a) (1.2) whereTµ is a conditional measure on the space of statesS. That is,Tµ(S|s, a)is the probability that the next state is in

then setSwhen we take actionafrom statesin the MDPµ. We assume that the reward is a deterministic function of the state and actionrt+1=Rµ(st, at).

This paper focuses on Bayesian methods for solving the re-inforcement learning problem (see [Vlassiset al., 2012] for an overview). This is a decision-theoretic approach [DeG-root, 1970], with two key ideas. The first is to select an appropriate prior distributionξabout the unknown environ-ment, such thatξ(µ)represents our subjective belief thatµis the true environment. The second is to replace the expected utility over the real environment, which is unknown, with the expected utility under the subjective beliefξ, i.e.

Eξ,πU = Z

M

(_Eµ,πU) dξ(µ). (1.3) Formally, it is then possible to optimise with respect to the policy which maximises the expected utility over all possible environments, according to our belief. However, our future observations will alter our future beliefs according to Bayes’ theorem. In particular the posterior mass placed on a set of MDPsB⊂ Mgiven a historyhtcomposed of a sequence of states,st₌_{s1, . . . , s}_{t, actions}_at−1₌_{a1, . . . , a}

t−1, is: ξ(B|ht), R B Qt k=1 d dνTµ(st+1|at, st) dξ(µ) R M Qt k=1 d dνTµ(st+1|at, st) dξ(µ) , (1.4)

where _dd_νTµdenotes the Radon-Nikodym derivative with re-spect to some measureν on S.1 _{Consequently, the} Bayes-optimal policy must take into account all potential future be-lief changes. For that reason, it will not in general be Marko-vian with respect to the states, but will depend on the com-plete history.

Most previous work on Bayesian reinforcement learning in continuous environments has focused on Gaussian process models for estimation. However, these suffer from two limita-tion. Firstly, they have significant computational complexity. Secondly, each dimension of the predicted state distribution is modelled independently. In this paper, we investigate the use of Bayesian inference under the assumption that the dy-namics are (perhaps under a transformation) linear. Then the modelling problem becomes multivariate Bayesian linear re-gression, for which we can calculate (1.4) efficiently online.

1

(2)

An other novelty of our approach in this context is that we do not simply use the common heuristic of acting as though the most likely or expected model is correct. Instead, gener-ate asample modelfrom the posterior distribution. We then draw trajectories from the sampled model and collect simu-lated data which we use to obtain a policy. The policy is then executed in the real environment. This form ofThompson samplingis known to be a very efficient exploration method in bandit and discrete problems. We also show its efficacy for continuous domains.

The remainder of this paper is organised as follows. Sec-tion 2 gives an overview of related work and our contribuSec-tion. Section 3 formally introduces our approach, with a descrip-tion of the inference model in Sec. 3.1, and the policy se-lection method used in Sec. 3.2. Finally, the details of the online and off-line versions of our algorithms are detailed in Sec. 3.3. Experimental results are presented in Sec. 4 and we conclude with a discussion of future directions in Sec. 5.

2 Related work and our contribution

As mentioned in the introduction, Bayesian reinforcement learning models the reinforcement learning problem as a decision-theoretic problem by placing a prior distributionξ

on the set of possible MDPsM. However, the exact solution of the decision problem is generally intractable as the trans-formed problem becomes a Markov decision processes with exponentially many states [Duff, 2002].

One of the first and most interesting approaches for approx-imate Bayesian reinforcement learning is Thompson pling, which is also used in this paper. The idea is to sam-ple a model from the posterior distribution, calculate the optimal policy for the sampled model, and then follow it for some period [Strens, 2000]. Thompson sampling has been recently shown to perform very well both in theory and practice in bandit problems [Kaufmanna et al., 2012; Agrawal and Goyal, 2012]. Extensions and related models include Bayesian sparse sampling [Wanget al., 2005], which uses Thompson sampling to deal with node expansion in the tree search. Taking multiple samples from the posterior can be used to estimate upper and lower bounds on the Bayes-optimal value function, which can then be used for tree search algorithms [Dimitrakakis, 2008; 2010]. Multiple samples can also be used to create an augmented optimistic model [As-muth et al., 2009; Castro and Precup, 2010]; or they can be used to construct a better lower bound [Dimitrakakis, 2011]. Finally, multiple samples can also be combined via voting schemes [Doshi-Velez, 2009]. Other approaches at-tempt to build optimistic models without sampling. For ex-ample [Kolter and Ng, 2009] adds an exploration bonus to rewards, while [Arayaet al., 2012] uses optimistic transition functions by constructing an augmented MDP in a Bayesian analogue of UCRL [Jackshet al., 2010].

For continuous state spaces, most Bayesian approaches have focused on Gaussian process (GP) models [Rasmussen and Kuss, 2004; Jung and Stone, 2010; Engelet al., 2005; Reisingeret al., 2008; Deisenrothet al., 2009]. There are two key ideas that set our method apart from this work. Firstly, GP models are typically employed independently for each

state feature. In contrast, the model we use deals naturally with correlated state features – consequently, less data may be necessary. Secondly, we do not calculate policies using the expected transition dynamics of the environment, as this is known to have potentially bad effects [Poupartet al., 2006; Dimitrakakis, 2011]. Instead, value functions and policies are calculated bysamplingfrom the posterior distribution of environments. This also important for efficient exploration.

This paper proposes a linear model-based Bayesian frame-work for reinforcement learning, for arbitrary state spacesS and for discrete action spacesAusing Thompson sampling. First, we define a prior distribution onlinear dynamical mod-els, using a suitably chosenbasis. Bayesian inference in this model isfully closed form, so that given a set of example tra-jectories it is easy to sample a model from the posterior dis-tribution. For each such sample, we estimate the optimal pol-icy. Since closed-form calculation of the optimal policy is not possible for general cost functions even with linear dynamics, we use approximate dynamic programming (ADP, see [Bert-sekas, 2005] for an overview) with trajectories drawn from the sampled model. The resulting policy can then be applied to the real environment.

We experimented with two different ADP approaches for finding a policy for a given sampled MDP. Both are approx-imate policy iteration (API) schemes, using a set of trajecto-ries generated from the sampled MDP to estimate a sequence of value functions and policies. For the policy evaluation step, we experimented with fitted value iteration (FVI) [Ernstet al., 2005] and least-square temporal differences (LSTD) [Bradtke and Barto, 1996].

In the case where we use LSTD, the approach can be seen as an online, Bayesian generalisation of least-squares policy iteration (LSPI) [Lagoudakis and Parr, 2003]. Instead of forming LSTDQ on the empirical transition matrix, we per-form a least-squares fit on a sample model drawn from the posterior. This fit can be very accurate by drawing a large amount of simulated trajectories in the sampled model.

We consider two applications of this approach. In the of-flinecase, data is collected using a uniformly random policy. We then generate a model from the posterior distribution and calculate a policy for it, which is then evaluated in the real en-vironment. In theonlinecase, data is collected using policies generated from the sampled models, as in Thompson sam-pling. At the beginning each episode, a model is sampled from the posterior and the resulting policy is executed in the real environment. Thus, there is no separate data collection and evaluation phase. Our results show that this approach successfully finds optimal policies quickly and consistently both in the online and in the offline case, and that it has a significant overall advantage over LSPI.

3 Linear Bayesian reinforcement learning

The model presented in this paper uses Bayesian inference to estimate the environment dynamics. The assumption is that, with a suitable basis, these dynamics are linear with Gaussian noise. Unlike approaches using Gaussian pro-cesses, however, the next-state distribution is not modelled using a product distribution, i.e. we do not assume that

(3)

the various components of the state are independent. In a further innovation, rather than using the expected posterior parameters, we employ sampling from the posterior distri-bution. For each sampled model, we then obtain an ap-proximately optimal policy by using approximate dynamic programming, which is then executed in the real environ-ment. This form of Thompson sampling [Strens, 2000; Thompson, 1933] allows us to perform efficient exploration, with the policies naturally becoming greedier as the posterior distribution converges.

3.1 The predictive model

In our model we assume that, for a state setSthere exists a mappingf :S → X to ak-dimensional vector spaceXsuch that the transformed state at timetisxt ,f(st). The next statest+1is given by the output of a functiong:X × A → S

of the transformed state, the action and some additive noise:

st+1=g(xt, at) +εt. (3.1) In this paper, we model the noise εtand the function g as a multivariate linear-Gaussian model. This is parametrised via a set of k × k design matrices {Ai|i∈ A }, such that g(xt, at) = Aatxt and a set of covariance matrices {Vi|i∈ A }for the noise. Then, the next state distribution is:

st+1|xt=x, at=i∼N(Aix,Vi). (3.2) In order to model our uncertainty with a (subjective) prior distributionξ, we have to specify the model structure. In our model, we do not assume independence between the output dimensions, something which could potentially make infer-ence difficult. Fortunately, in this particular case, a conjugate prior exists in the form of thematrix-normal distributionfor

Aand theinverse-Wishartdistribution forV. GivenVi, the distribution forAi is matrix-normal, while the marginal dis-tribution ofViis inverse-Wishart. More specifically,

Ai |Vi=Vb ∼φ(Ai|M,C,Vb) (3.3)

Vi ∼ψ(Vi|W, n), (3.4) whereφiis the prior distribution on dynamics matrices condi-tional on the covariance and two prior parameters:M, which is the prior mean andCwhich is the prior output (dependent variable) covariance. Finally,ψis the marginal prior on co-variance matrices, which has an inverse-Wishart distribution withW andn. More precisely, the distributions are:

φ(Ai|M,C,Vb)∝e− 1 2tr[(Ai−M)>Vi−1(Ai−M)C]_, _(3.5) ψ(Vi|W, n)∝ |V−1W/2|n/2e− 1 2tr(V −1_W₎ . (3.6) Essentially, the model is an extension of the univariate Bayesian linear regression model (see for example [DeGroot, 1970]) to the multivariate case via vectorisation of the mean matrix. Since the prior is conjugate, it is relatively simple to calculate posterior values of the parameters after each ob-servation. While we omit the details, a full description of inference using this model is given in [Minka, 2001].

Throughout this text, we shall employ ξt = (φt, ψt) to denote our posterior distributions at timet, withξtreferring to our complete posterior. The remaining problem is how to estimate the Bayes-expected utility of the current policy and how to perform policy improvement.

3.2 Policy evaluation and optimisation

In the Bayesian setting, policy evaluation and optimisation are not trivial. The most common method used is the ex-pected MDP heuristic, where policies are evaluated or opti-mised on the expected MDP. However, this ignores the shape of the posterior distribution. Alternatively, policies can be evaluated via Monte Carlo sampling, but then optimisation becomes hard. A good heuristic that does not ignore the com-plete posterior distribution and for which it is easy to calcu-late a policy, called Thompson sampling, is the one we shall actually employ in this paper. The following paragraphs give a quick overview of each method.

Expected MDP A naive way to estimate the expected util-ity of a policy is to first calculate the expected (or most proba-ble) dynamics, and then use either an exact or an approximate dynamic programming algorithm. This may very well be a good idea if the posterior distribution is sharply concentrated around the mean, since then:

Eξ,πU ≈Eµξ,πU, µξ ,Eξµ. (3.7) whereµξ is the expected MDP model.2 However, as pointed out in [Arayaet al., 2012; Dimitrakakis, 2011] this approach may give completely incorrect results,

Monte Carlo sampling An alternative method is to take a number of samplesµifrom the current posterior distribution and then calculate the expected utility of each, i.e.

Eξ,πU = 1 K K X i=1 Eµi,πU+O(K −1/2₎_, _µ i∼ξt. (3.8) This form of Monte Carlo sampling gives much more accu-rate results, at the expense of some additional computation. However, finding an optimal policy over a set of sampled MDPs is difficult even for restricted classes of policies [Dim-itrakakis, 2011]. Nevertheless, Monte Carlo sampling can also be used to obtain stochastic upper and lower bounds on the value function, which can be used to improve the policy search [Dimitrakakis, 2010; 2008].

Thompson sampling An interesting special case is when we only sample a single MDP, i.e. when we perform Monte Carlo sampling with K = 1. Then it is relatively easy to calculate the optimal policy for this sample. This method, which we employ in this work, is calledThompson sampling, and was first used in the context of reinforcement learning by [Strens, 2000]. The idea is to sample an MDP from the current posterior and then calculate a policy that is optimal with respect to that MDP. We then execute this policy in the environment. The major advantage of Thompson sampling is that it is known to result in a very efficient form of exploration (see for example [Agrawal and Goyal, 2012] for recent results on bandit problems).

2

Similar problems exist when using the most probable MDP in-stead.

(4)

3.3 Algorithm overview

We can now put everything together for the complete linear Bayesian reinforcement learning (LBRL) algorithm. The al-gorithm has four steps. Firstly, sampling a model from the posterior distribution. Secondly, using the sampled model to calculate a new policy. Finally, executing this policy in the real environment. In the online version of the algorithm the data obtained by executing this policy is then used to calcu-late a new posterior distribution.

Sampling from the posterior Our posterior distribution at timetisξt = (φt, ψt), withψtbeing the marginal posterior on covariance matrices, andφtbeing the posterior on design matrices (conditional on the covariance). In order to sample a model from the posterior, we first draw a covariance ma-trix Vi using (3.4) for every action i ∈ A, and then plug those into (3.3) to generate a set of design matricesAi. The first step requires sampling from the inverse-Wishart distribu-tion (which can be done efficiently using the algorithm sug-gested by [Smith and Hocking, 1972]), and the second from the matrix-normal distribution.

ADP on the sampled MDP Given an MDPµsampled from our posterior belief, we can calculate a nearly-optimal pol-icyπusing approximate dynamic programming (ADP) onµ. This can be done with a number of algorithms. Herein, we investigated two approximate policy iteration (API) schemes, using either fitted value iteration (FVI) or least-squares tem-poral differences (LSTD) for the policy evaluation step. Both of these algorithms require sample trajectories from the envi-ronment. This is fortunately very easy to achieve, since we can use the sampled model to generate any number of trajec-tories arbitrarily. Consequently, we can always have enough simulateddata to perform a good fit with FVI or LSTD.3We note here that API using LSTD additionally requires a gen-erative model for the policy improvement step. Happily, we can use the sampled MDPµfor that purpose.4

Algorithm 1LBRL: Linear Bayesian reinforcement learning InputBasisf, ADP parametersP, priorξ0

forepisodekdo

µ(k)∼ξtk(µ)// generate MDP from posterior

π(k)₌_ADP₍_µ(k)_{, P}₎_{// Get new policy}

fort=tk, . . . , tk+1−1do

at|st=s∼π(k)(a|s)// Take action

ξt+1(µ) =ξt(µ|st+1, at, st)// Update posterior end for

end for

Offline LBRL In the offline version of the algorithm, we simply collect a set of trajectories from auniformly random

3

These use no data collected in the real environment.

4

In preliminary experiments, we also investigated the use of fit-ted Q-iteration and LSPI, but found that these had inferior perfor-mance.

policy, comprising a historyhtof lengtht. Then, we sample an MDP from the posteriorξt(µ) =ξ0(µ|ht)and calculate the optimal policy for the sample using ADP. This policy is thenevaluatedon the real environment.

Online LBRL In the online version of the algorithm, shown in Alg. 1 we collect samples using ourown generated policies. We begin with some initial beliefξ0 = (φ0, ψ0) and a uniformly random policyπ(0)_{. This policy is executed}

until either the episode ends naturally or due to reaching a time-limitT. At thek-th episode, which starts at timetk, we sample a new MDPµ(k) _∼ _ξ

tk from our current posterior

ξtk(·) =ξ(· |htk)and then calculate a near-optimal station-ary policy forµ(k)_:

π(k)≈arg max π Eµ

(k),πU,

such thatπ(k)₍_a _| _s₎_{is a conditional distribution on actions} a ∈ Agiven statess ∈ S. This policy is then executed in the real environment until the end of the episode and the data collected are used to calculate the new posterior. As the calcu-lation of posterior parameters is fully incremental, we incur no additional computational cost for running this algorithm online.

4 Experiments

We conducted two sets of experiments to analyze both the of-fline and the online performance of the various algorithms. Comparisons have been made with the well-known least square policy iteration (LSPI) algorithm [Lagoudakis and Parr, 2003] for the offline case, as well as an online variant [Bus¸oniuet al., 2010] for the online case. We used prelim-inary runs and guidance from the literature to select the fea-tures for the LSTDQ algorithm used in the inner loop of LSPI. The source for all the experiments can be found in [Dimi-trakakiset al., ].

We employed the same features for the ADP algorithms used in LBRL. However, the basis used for the Bayesian re-gression model in LBRL was simply f(s) _, [s,1]>. In preliminary experiments, we found this sufficient for a high-quality approximation. After that, we use API to find a good policy for a sampled MDP, where we experimented with reg-ularised FVI and LSTD for the policy evaluation step, adding a regularisation factor10−2_I_{. In both cases, we drew}

sin-gle step transitions from a set of3000uniformly drawn states from the sampled model.

For the offline performance evaluation, we first drew roll-outs fromk={50,100, . . . ,1000}states drawn from the en-vironment’s starting distribution, using a uniformly random policy. The maximum horizon of each rollout was set equal to40. The collected data was then fed to each algorithm in or-der to produce a policy. This policy was evaluated over1000 rollouts on the environment.

In the online case, we simply use the last policy calculated by each algorithm at the end of the last episode, so there is no separate learning and evaluation phase. This means that effi-cient exploration must be performed. For LBRL, this is done using Thompson sampling. For online-LSPI, we followed the

(5)

200 400 600 800 1,000 50 100 150 200 250 300

Number of Training Episodes (Rollouts)

# Steps Mountain car LBRL-FVI LBRL-LSTD LSPI 200 400 600 800 1,000 0 1,000 2,000 3,000

Number of Training Episodes (Rollouts) Pendulum

LBRL-FVI LBRL-LSTD LSPI

Figure 1: Offline performance comparison between LBRL-FVI, LBRL-LSTD and LSPI. The error bars show 95% confidence intervals, while the shaded regions show 90% percentiles over 100 runs.

approach of [Bus¸oniuet al., 2010], who adopts an-greedy exploration scheme with an exponentially decaying schedule

t=td, with0 = 1. In preliminary experiments, we found

d = 0.9968to be a reasonable compromise. We compared the algorithms online for1000episodes.

4.1 Inverted pendulum

The first set of experiments includes theinverted pendulum domain, which tries to balance a pendulum by applying forces of a mixed magnitude (50 Newtons). The state space consists of two continuous variables, the vertical angle (θ) and the an-gular velocity (θ˙) of the pendulum. The agent has at his arse-nal three actions: no force, left force or right force. A zero re-ward is received at each time step except in the case where the pendulum falls (|θ| ≤ π/2). In this case, a negative (-1) re-ward is given and a new rollout begins. Each rollout starts by setting the pendulum in a perturbed state close to the equilib-rium point. More information about the environment dynam-ics can be found at [Lagoudakis and Parr, 2003]. Each rollout is allowed to run for3000steps at maximum. Additionally, the discount factor is set to0.95. For FVI/LSTD and LSPI, we used an equidistant3×3grid of RBFs over the state space fol-lowing the suggestions of [Lagoudakis and Parr, 2003], which was replicated for each action for the LSTDQ algorithm used in LSPI.

4.2 Mountain car

In the second experimental set, we have used themountain carenvironment. Two continuous variables characterise the vehicle state in the domain, its position (p) and its velocity (u). The objective in this task is to drive an underpowered vehicle up a steep road from a randomly selected position to the right hilltop (p≥0.5) with at most1000steps. In order to achieve our goal, we can select between three actions: for-ward, reverse and zero throttle. The received reward is−1 except in the case where the target is reached (zero reward).

At the beginning of each rollout, the vehicle is positioned to a new state, with the position and the velocity uniformly ran-domly selected. The discount factor is equal to0.999. An equidistant4×4 grid of RBFs over the state space plus a constant term is selected for FVI/LSTD and LSPI.

4.3 Results

In our results, we show the average performance in terms of number of steps of each method, averaged over 100 runs. For each average, we also plot the 95% confidence interval for the accuracy of the mean estimate with error bars. In addition, we show the 90% percentile region of the runs, in order to indicate inter-run variability in performance.

Figure 1 shows the results of the experiments in theoffline case. For themountain car, it is clear that the most stable approach is LBRL-LSTD, while LSPI is the most unstable. Nevertheless, on average the performance of LBRL-LSTD and LSPI is similar, while LBRL-FVI is slightly worse. For the pendulum domain, the performance of LBRL remains quite good, with LBRL-LSTD being the most stable. While LSPI manages to find the optimal policy frequently, neverthe-less around 5% of its runs fail.5

Figure 2 shows the results of the experiments in the on-line case. For the mountain car, both LSPI and LBRL-LSTD managed to find an excellent policy in the vast ma-jority of runs. In thependulumdomain, we see that LBRL-LSTD significantly outperforms LSPI. In particular, after 80 episodes all more than 90% of the runs are optimal, while many LSPI runs fail to find a good solution even after hun-dreds of episode. The mean difference is somewhat less spec-tacular, though still significant.

The success of LBRL-LSTD over LSPI can be attributed to a number of reasons. Firstly, it could be themore efficient

5

We note that the results presented in [Lagoudakis and Parr, 2003] for LSPI are slightly better, but remain significantly below the LBRL results.

(6)

0 200 400 600 800 1,000 50 100 150 200 250 300

Number of Training Episodes (Rollouts)

# Steps Mountain car LBRL-LSTD LSPl 0 200 400 600 800 1,000 0 1,000 2,000 3,000

Number of Training Episodes (Rollouts) Pendulum

LBRL-LSTD LSPI

Figure 2: Online performance comparison between LBRL-LSTD and LSPI. The error bars show 95% confidence intervals, while the shaded regions show 90% percentiles over 100 runs.

exploration. Indeed, in the mountain car domain, where the starting state distribution is uniform, we can see that LBRL and LSPI have very similar performance. Another possible reason is that LBRL also makesbetter use of the data, since it uses it to calculate the posterior distribution over MDP dy-namics. It is then possible to perform very accurate ADP using simulated data from a model sampled from the poste-rior. This is supported by the offline results in the pendulum domain.

5 Conclusion

We presented a simple linear Bayesian approach to reinforce-ment learning in continuous domains. Unlike Gaussian pro-cess models, by using a linear-Gaussian model, we have the potential to scale up to real world problems which Bayesian reinforcement learning usually fails to solve with a reasonable amount of data. In addition, this model easily takes into ac-count correlations in the state features, further reducing sam-ple comsam-plexity. We solve the problem of computing a good policy in continuous domains with uncertain dynamics by us-ing Thompson samplus-ing. This not much more expensive than computing the expected MDP and forces a natural exploration behaviour.

In practice, the algorithm is at least as good as LSPI in offline mode, while being considerably more stable overall. When LBRL is used to perform online exploration, we find that the algorithm very quickly converges to a near-optimal policy and is extremely stable. Experimentally, it would be interesting to compare LBRL with standard GP methods that employ the expected MDP heuristic.

Thompson sampling could be used with other Bayesian models for continuous state spaces. A natural extension would thus be to move to a non-parametric model, e.g. re-place the multivariate linear model with a multivariate Gaus-sian process. The major hurdle would be the computational cost. Consequently, in future work we would like to use a

re-cently proposed methods for efficient Gaussian processes in the multivariate case, such as [Alvarezet al., 2011] that uses convolution processes. Other Bayesian schemes for multi-variate regression analysis [Mehmet, 2012] may be applica-ble as well.

Finally, it would be highly interesting to consider other ex-ploration methods. One example is the Monte-Carlo exten-sion of Thompson sampling used in [Dimitrakakis, 2011], which can also be used for continuous state spaces. Other approaches, such as teh optimistic transition MDP used in [Arayaet al., 2012] may not be so straightforward to adopt to the continuous case. Nevertheless, while these approaches may be costly computationally, we believe that they will be beneficial in terms of performance.

Acknowledgements

We wish to thank the anonymous reviewers for their excel-lent comments and suggestions. This work was partially sup-ported by the Marie Curie Project ESDEMUU, Grant Number 237816 and by an ERASMUS exchange grant.

References

[Agrawal and Goyal, 2012] S. Agrawal and N. Goyal. Anal-ysis of Thompson sampling for the multi-armed bandit problem. InCOLT 2012, 2012.

[Alvarezet al., 2011] M. Alvarez, D. Luengo-Garcia, M. Titsias, and N. Lawrence. Efficient multioutput gaussian processes through variational inducing kernels. 2011.

[Arayaet al., 2012] M. Araya, V. Thomas, O. Buffet, et al. Near-optimal BRL using optimistic local transitions. In ICML, 2012.

[Asmuthet al., 2009] J. Asmuth, L. Li, M. L. Littman, A. Nouri, and D. Wingate. A Bayesian sampling approach

(7)

to exploration in reinforcement learning. In UAI 2009, 2009.

[Bertsekas, 2005] D. Bertsekas. Dynamic programming and suboptimal control: From ADP to MPC. Fundamental Issues in Control, European Journal of Control, 11(4-5), 2005. From 2005 CDC, Seville, Spain.

[Bradtke and Barto, 1996] S.J. Bradtke and A.G. Barto. Lin-ear least-squares algorithms for temporal difference lLin-earn- learn-ing. Machine Learning, 22(1):33–57, 1996.

[Bus¸oniuet al., 2010] L. Bus¸oniu, D. Ernst, B. De Schutter, and R. Babuˇska. Online least-squares policy iteration for reinforcement learning control. InProceedings of the 2010 American Control Conference, pages 486–491, 2010. [Castro and Precup, 2010] P. Castro and D. Precup. Smarter

sampling in model-based Bayesian reinforcement learn-ing. Machine Learning and Knowledge Discovery in Databases, pages 200–214, 2010.

[DeGroot, 1970] M. H. DeGroot. Optimal Statistical Deci-sions. John Wiley & Sons, 1970.

[Deisenrothet al., 2009] M.P. Deisenroth, C.E. Rasmussen, and J. Peters. Gaussian process dynamic programming. Neurocomputing, 72(7-9):1508–1524, 2009.

[Dimitrakakiset al., ] C. Dimitrakakis, N. Tziortziotis, and A. Tossou. Beliefbox: A framework for statistical meth-ods in sequential decision making. http://code. google.com/p/beliefbox/.

[Dimitrakakis, 2008] C. Dimitrakakis. Tree exploration for Bayesian RL exploration. InComputational Intelligence for Modelling, Control and Automation, International Conference on, pages 1029–1034, Wien, Austria, 2008. IEEE Computer Society.

[Dimitrakakis, 2010] C. Dimitrakakis. Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning. In ICAART 2010, pages 259–264. Springer, 2010.

[Dimitrakakis, 2011] C. Dimitrakakis. Robust bayesian re-inforcement learning through tight lower bounds. In Euro-pean Workshop on Reinforcement Learning (EWRL 2011), number 7188 in LNCS, pages 177–188, 2011.

[Doshi-Velez, 2009] Finale Doshi-Velez. The infinite par-tially observable Markov decision process. InAdvances in Neural Information Processing Systems 21, Cambridge, MA, 2009. MIT Press.

[Duff, 2002] M. O. Duff. Optimal Learning Computa-tional Procedures for Bayes-adaptive Markov Decision Processes. PhD thesis, University of Massachusetts at Amherst, 2002.

[Engelet al., 2005] Y. Engel, S. Mannor, and R. Meir. Rein-forcement learning with gaussian process. InInternational Conference on Machine Learning, pages 201–208, 2005. [Ernstet al., 2005] D. Ernst, P. Geurts, and L. Wehenkel.

Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005.

[Jackshet al., 2010] T. Jacksh, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning.Journal of Machine Learning Research, 11:1563–1600, 2010. [Jung and Stone, 2010] T. Jung and P. Stone. Gaussian

pro-cesses for sample-efficient reinforcement learning with RMAX-like exploration. In ECML/PKDD 2010, pages 601–616, 2010.

[Kaufmannaet al., 2012] E. Kaufmanna, N. Korda, and R. Munos. Thompson sampling: An optimal finite time analysis. InALT-2012, 2012.

[Kolter and Ng, 2009] J. Z. Kolter and A. Y. Ng. Near-Bayesian exploration in polynomial time. InICML 2009, 2009.

[Lagoudakis and Parr, 2003] M.G. Lagoudakis and R. Parr. Least-squares policy iteration. The Journal of Machine Learning Research, 4:1107–1149, 2003.

[Mehmet, 2012] G. Mehmet. A bayesian multiple kernel learning framework for single and multiple output regres-sion. InProceedings of the 20th European Conference on Artificial Intelligence (ECAI), pages 354–359, 2012. [Minka, 2001] T. P. Minka. Bayesian linear regression.

Tech-nical report, Microsoft research, 2001.

[Poupartet al., 2006] P. Poupart, N. Vlassis, J. Hoey, and K. Regan. An analytic solution to discrete Bayesian rein-forcement learning. InICML 2006, pages 697–704. ACM Press New York, NY, USA, 2006.

[Rasmussen and Kuss, 2004] C.E. Rasmussen and M. Kuss. Gaussian processes in reinforcement learning. In Ad-vances in Neural Information Processing Systems 16, pages 751–759, 2004.

[Reisingeret al., 2008] J. Reisinger, P. Stone, and R. Mi-ikkulainen. Online kernel selection for bayesian reinforce-ment learning. In International Conference on Machine Learning, pages 816–823, 2008.

[Smith and Hocking, 1972] WB Smith and RR Hocking. Wishart variates generator, algorithm as 53.Applied Statis-tics, 21:341–345, 1972.

[Strens, 2000] M. Strens. A Bayesian framework for rein-forcement learning. InICML 2000, pages 943–950, 2000. [Thompson, 1933] W.R. Thompson. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of two Samples. Biometrika, 25(3-4):285–294, 1933.

[Vlassiset al., 2012] N. Vlassis, M. Ghavamzadeh, S. Man-nor, and P. Poupart. Reinforcement Learning, chap-ter Bayesian Reinforcement Learning, pages 359–386. Springer, 2012.

[Wanget al., 2005] T. Wang, D. Lizotte, M. Bowling, and D. Schuurmans. Bayesian sparse sampling for on-line re-ward optimization. In ICML ’05, pages 956–963, New York, NY, USA, 2005. ACM.

Linear Bayesian Reinforcement Learning