Reward Function Definition - Towards Continuous Control for Mobile Robot Navigation: A Reinforc

The reward function is the most important aspect in a reinforcement learning problem since the actions are selected in such a way that the cumulative reward is maximized. The reward signal is the mean by which the goal of the learning is specified for the agent. It is a designed task-specific function that, given the action of the agent and the state of the system, returns a single real number indicating how good or bad that action was. The reward signal corresponds to pleasure and pain in biological systems. Designing a good reward signal for a robotic reinforcement learning task can be challenging in different ways. This area of reinforcement learning, known as reward designing or shaping, is considered an art rather than a well-established science [18]. Not to mention that the reward function purpose is not to tell the learning algorithmhowto achieve a certain goal but ratherwhatto achieve. Accordingly, the learning algorithm should then use this reward to discover the necessary actions to execute in order to achieve the ultimate goal without being hard-coded. For the navigation problem, reward functions can be a simple bonus when the agent reaches a target and, consequently, a penalty in case it hits an obstacle. Thissparse rewardis assigned to prioritize actions that make the agent reach the goal and penalize actions that make the agent colliding. On the other hand, the reward can be more sophisticated and depend on the distance between the agent and the target. This is calleddense reward. In the autonomous navigation problem, the advantage of using a dense reward over a sparse reward can be intuitively formulated as follows. In case of a dense reward, if a sequence of robot’s actions results in a positive reward, the parameters of the policy network are updated so that the probability of taking that set of actions is more often in the future. However, in case of a sparse reward, the robot continues taking random actions until, by chance, it gets some non-zero reward. The disadvantage of this is that since

the non-zero rewards are seen so rarely as it happens only when the robot reaches the target, the sequence of actions that resulted in the reward might be very long. More importantly, it is not clear which of the these actions were really useful in getting the reward. In reinforcement learning context, this problem is known ascredit assignment.

This section first discusses how a reward function is designed for the autonomous navigation problem for mobile robots. It then presents a novel idea to shape reward functions. The idea is based on shaping the reward function based on the online-acquired knowledge about the environment which is provided in a form of an occupancy grid map that the robot builds during training.

5.2.1 Designing the Reward Function

In this section, two different reward functions are defined to assess the quality of the perfor- mance of the robot while interacting with its environment.

5.2.1.1 Exponential Euclidean Distance

The goal is to move the robot towards a defined goal position. The agent receives a penalty proportional to the exponent of the euclidean distance between its current position and the goal position. The euclidean distance between the robot and the target is simply evaluated by

d=°°p x,y t −g ° °₂= q ¡ p_ty−gy¢2 +¡px_t−gx¢2 (5.5) whereptrepresents the current position of the robot at timetwith respect to the inertial frame.

Then, the reward based on the exponential function of the euclidean distance is evaluated as follows

rexp=1−eγd (5.6)

whereγrepresents the decay rate of the exponent. In this perspective, all the states closed to the goal would receive much higher rewards than the ones far away. In addition, a sparse reward is added if the agent reaches the target position within the interval of a predefined toler- ance. On the other hand, the agent would receive a high negative reward ”penalty” when it gets too close to an obstacle. Here it should be pointed out that the episode is terminated in three different scenarios; i) the agent reaches the goal with some tolerancedmi n, ii) the agent gets

closer to an obstacle with a minimum threshold, iii) the agent exceeds the maximum number of allowed time-stepsT in every episode without either reaching the target or hitting an obstacle. The maximum number of iterations per episode is a hyperparameter that is tuned based on the average number of actions required by the agent to reach the goal observed during the preliminary experiments. The rewardr(st,at) is given after executing every navigation action

atand can be, mathematically, formulated as:

r(st,at)=        rreached, d<dmi n, rcrashed, st s, 1−eγd, otherwise. (5.7)

wheredis the euclidean distance between the agent and the target,γis a hyper-parameter that can be tuned andst srepresents an undesirable terminal state including getting too close to an

obstacle or exceeding the maximum number of steps allowed in an episode. 5.2.1.2 Difference in Distance in two consecutive time-steps

The second reward function can be given as

wherer(st,at) is the reward based on the difference in the distance between the agent and the

target in two consecutive time stepsdt−1−dt. This means that the reward would be positive in

case the agent is moving towards the target and negative otherwise. To motivate the robot to move towards the target, this term is multiplied by a hyperparameter ”scaling factor”λg. This distance-based reward can be formulated as

r(st,at)=        rreached, d<dmi n, rcrashed, st s, λg¡° °p x,y t−1−g ° °₂− ° °p x,y t −g ° °₂ ¢ , otherwise. (5.9)

In addition, an orientation-based rewardr_tωis added to motivate the robot to correct its head- ing with respect to the target. This term is defined as

r_tω=°°atan2 ¡

py_t −gy,px_t−gx¢

−pω_t°°₁ (5.10)

5.2.2 Shaping the Reward Function

In this subsection, the reward function is shaped based on the available knowledge about the environment gained throughout the robot’s experience. For this purpose, a 2D occupancy grid map of the surrounding built by the SLAM algorithm discussed in chapter 4 is generated while the robot is exploring the unknown environment, using data extracted from laser range finder and the robot’s odometry information. Every cell inside the occupancy grid is classified as (occupied, free, unknown) based on a predefined threshold value that determines the occupation probability of each cell. Furthermore, the occupation probability of every cell is being updated while the robot keeps exploring the environment. In that sense, the reward function does not only depend on how far the agent is from the target but on the distance to the multi-obstacles inside the workspace as well. The incorporation of the environment’s knowledge should be weighted by the level of certainty of the map’s posteriorp(m|z1:t,x1:t)=Q_iM₌₀p(mi|z1:t,x1:t).

Moreover, since every obstacle inside the environment is represented by a number of occupied grid cells, this part of the reward is normalized by the total number of occupied grid cells in the field of view (FOV) of the robot. This can be formulated as follows:

r(st,at)= 1 k M Y i=0 p(mi|z1:t,x1:t) k X i=0 e−cmi n_, | {z } map-dependent term (5.11)

whereMis the total number of grid cells in the constructed map,k is the total number of occupied cells in the field of view of the robot andcmi nis the distance between the robot and the

occupied cell. As a matter of fact, since the reward function evolves with time due to the incorporation of the uncertainty, the reward function does not follow the MDP framework anymore. Here it should be pointed out that the map-dependent term defined in equation (5.11) is added to both rewards defined in equations (5.6) and (5.8) and a comparison between these four rewards is made in chapter 7.

In document Towards Continuous Control for Mobile Robot Navigation: A Reinforcement Learning and SLAM Based Approach (Page 42-44)