Deep deterministic policy gradient is an actor-critic algorithm that provides an improvement to DQN discussed before by making it more tractable to continuous actions where both the ac- tor and critic functions can be approximated by two separate neural networks with parameter vectorsθπandθQrespectively. As the name suggests, the algorithm updates the parameters of the actor network using deterministic policy gradient theorem [34]. Each training step modi- fies the policy in such a way that its outputs are pushed in the direction of the positive gradient of the action-value function. Especially for continuous actions, this strategy is very effective, as it directly pushes the generated actions towards the assumed best action with respect to the action-value estimations. ∇θπJ(πθ)=Es∼ρπ £ ∇θππ¡s|θπ¢∇aQ¡s,a|θQ¢|a=π(s|θπ) ¤ (3.12) Like DQN, it was observed that directly updating the parameters of the actor and critic net- works using temporal difference as in equation (3.5), leads to divergence of the learning al- gorithm. Thus, same concepts of target networks with parameter vectorsθπ0 andθQ0 for both
the actor and the critic respectively and experience replay are used in DDPG algorithms as well. However, in contrast to DQN, the target networks are updated after each gradient step to slowly replicate the changes made to the trained networks.
θQ0
=τθQ+(1−τ)θQ0 θπ0
=τθπ+(1−τ)θπ0 (3.13) This improves the stability of the learning algorithm [6]. The parameterτ determines how quicklyθQ0 andθπ0 trackθQandθπ. Values ofτclose to one result in fast yet unstable learn- ing, whereas small values ofτ result in slow yet stable learning. To encourage exploration, a stochastic policy is still used as behavioural policy to generate the training samples, which yields an off-policy training algorithm.
The complete DDPG algorithm is shown below Algorithm 4Deep Deterministic Policy Gradient
Initialize a replay bufferDwith sizeN.
Initialize critic networkQ(s,a;θQ) and actor networkπ(s,a;θπ) with random weightsθQand θπrespectively.
Initialize target networksQ0(s,a;θQ0
) andπ0(s,a;θπ0) withθQ0
=θQ,θπ0=θπ
for allepisodedo Initializes1
Initialize random processN
repeat
for each steptin an episodedo
Select an actionat∈Athroughat=π(st;θπ)+N.
Execute actionat, observe rewardrt+1and new statest+1. Store transition〈st,at,rt,st+1〉in replay bufferD.
Sample random mini-batch of transitions〈si,ai,ri,si+1〉fromD.
Update the critic network by performing a gradient descent on (yi−Q(si,ai|θQ))2.
Update actor network using (3.12). Update the target networks using
θQ0 =τθQ+(1−τ)θQ0 θπ0=τθπ+(1−τ)θπ0 end for untilterminated end for
Here it should be pointed out that the noise used in the previous algorithm is action space noise since the noise is added directly to the output of the actor network.
Figure 3.5:Architecture of DDPG where the policyπis trained by backpropagating the q-gradient with respect to actiona
3.5 Summary
In this chapter, it was discussed how to apply deep learning to reinforcement learning problems while guaranteeing training stability. This is done by introducing experience replay and target network concepts. Additionally, the deep deterministic policy gradient algorithm within the off-policy actor-critic framework is discussed upon which the proposed approach in this thesis is built.
4 Grid Mapping with Rao-Blackwellized Particle Filters
In this chapter, a grid-based SLAM with Rao-Blackwellized particle filters introduced in [35] is discussed in order to provide an accurate estimate of the robot’s pose and a partial map of the environment that can be utilized later in the reward shaping of the reinforcement learning in an attempt to speed-up the learning rate.4.1 A Brief Introduction to SLAM Problem
Building maps is one of the fundamental tasks of mobile robots. In literature, the mobile robot mapping problem is often referred to as the simultaneous localization and mapping (SLAM) problem [36]. SLAM depicts the process of a robot creating a map of an unknown environment while simultaneously estimating its own position within the self-created map. The main chal- lenge of the SLAM problem is the cyclic dependency between both tasks, namely, localization and mapping, i.e., for localization, a consistent map of the environment is required and for acquiring the map, a robot needs a good estimate of its location. SLAM is considered to be one of the most important functionalities a mobile robot must posses in order to become truly autonomous [37]. In order to introduce the basic concepts and notations, the classical probabilistic framework will be used.
The pose of the robot which is to be estimated at a certain time instance is represented by the state vectorxt where the dimension ofxt is problem dependent. In case of a planar mobile
robot, the robot under study,xt∈R3since it consists of the 2D coordinates and the heading of
the robotxt =(x,y,θ)T. The second variable to be estimated is the map of the environment
mthat can take different forms depending on the chosen map representation which can be feature-based, grid-based, volumetric or topological. It is often assumed that the map is static and, therefore, does not evolve with time. Additionally, the robot is equipped with multiple information sources which provide the required perception inputs. The information that is provided to the robot can be classified into two classes, namely idiothetic and allothetic [38]. Idiothetic information is related to internal cues and is regarded as the control inputut.
This internal information can be retrieved from velocity commands, wheel encoders, inertial measurement unit (IMU), etc. On the other hand, the allothetic information is related to the external cues which can be represented as a landmark in the environment that the robot can observe and deduce its pose relative to it. These measurements are denoted by vectorzt.
The robot then seeks to calculate the conditional probability given in equation (4.1) that esti- mates the trajectory of the robotx1:t and the mapmbased on both idiothetic and allothetic
information given byu1:t−1andz1:trespectively.
p(x1:t,m|z1:t,u1:t−1) (4.1) There are different methods to solve the conditional probability which are, in general, based on Bayes filters that is considered the classical and popular technique used for SLAM.