Reward calculation - Exploration as an Optimisation Problem

3.3 Exploration as an Optimisation Problem

3.4.2 Reward calculation

Equation (3.5) defines the underlying optimisation process of CBE and has a general form, which is invariant to the choice of objective function f. CBE only requires access to f through noisy observations. The implementation of CBE in this paper uses mutual information (MI), often referred to also as information gain, as the objective function. MI measures the reduction in the map entropy after observations are made:

M Iξ(θ)= Hm_{− H}m_|ξ(θ). (3.7) Here H(m) is the entropy of the map. The conditional entropy H

m_|ξ(θ)is the

Hm_|ξ(θ)along an entire path, CBE builds a hypothetical map m0for each path

sample. m0 is built by simulating laser scans as the robot executes the path. The

simulated scans are generated under the assumption of an optimistic agent, i.e. it adheres to the existing map m but assumes free space (maximum range reading) for unknown areas. Under this assumption, the entropy of the hypothetical is an upper bound for the empirical MI:

M Iξ(θ)_{≤ H}m0. (3.8) Another approach for computing MI was proposed by Charrow et al. (2015), where m0 is computed by marginalising out cell occupancy over each simulated

beam. This approach is useful for maps built using noisy 3D sensors such as RGB- D, but adds little value when dealing with laser scanners. With high-update-rate and low-noise sensors such as laser scanners, only the transition areas, from un- occupied to occupied e.g. next to walls, are uncertain. In other observed regions of the map, p(m) tends to 0 or 1, making marginalisation redundant and similar MI results in both methods. In unobserved areas of the map, where p(m) = 0.5, the marginalised MI of (Charrow et al., 2015) differs from the optimistic agent MI estimate only by a multiplicative factor. Consequently, the procedure proposed by Charrow et al. (2015) offers little value for the exploration scenario presented in the work especially given its additional computational overhead. We emphasise again that the exact method for MI calculation is irrelevant as CBE, a black-box optimiser, learns a representation of the MI response surface from forward-simulation observations, regardless of the mechanism used to generate them.

parametric optimisation of Eq. (3.5) was chosen as a trade-off of computational complexity with the expressivity of the solution. In certain scenarios, the limited path expressiveness may result in safe but non-informative paths. To overcome this, the following heuristics are used:

• The first term provides a global context to the overall objective function. A coarse path is planned from the robot’s location to the nearest frontier. This path does not have to be traversable by the robot and it can violate safety or kinematic constraints. However, it biases path selection towards a region of unexplored space. We define a penalty term, PH

ξ(θ)= cos(ω), where ω is the difference between the direction of development of the candidate path and the coarse path. Therefore, a path that develops in the opposite direction of the global coarse path will have higher penalty than a path that is oriented towards a similar direction. As PH is a cosine, the amplitude of

Path Length Penalty Term, P

P

0

0.5

1

1.5

2

2.5 Scaled Path Length

0

0.5

1

1.5

2

2.5

3

3.5

Figure 3.2: PL, path length penalty term, as a function of path length. The

length is scaled with respect to the sensor maximum range. This penalty term penalises very short and long paths. Short paths are undesirable as they have little effect on map building. Longer paths are penalised as a function of their length, in order to prevent overly confident solutions.

• The second term PL is a function of the path length. Figure 3.2 depicts the

choice of PL used in this paper, where the path length is scaled with respect

to the maximum sensor range. The rationale behind PL is to penalise very

short and long paths. Very short paths are undesirable as their ability to reduce uncertainty is negligible. Longer paths are penalised in order to prevent overly confident decisions.

The additional penalties are added to the MI reward with corresponding weights, W1 and W2. These weights keep the penalties small compared to the typical MI

utility: M IM odif ied ξ(θ)= MIξ(θ)+ W1· PH ξ(θ)+ W2· PL ξ(θ). (3.9) The weights W1 and W2 are user defined and capture the user’s approach to ex-

ploration. For example, increasing W1 will result in a process which resembles

frontier exploration. In our implementation, the goal of PH is to pull the robot

ward is negligible. Consequently, we set W1 = 100. This value is approximately

10% of the average MI reward in standard scenarios. This value will have little effect in standard planning scenarios. While in situations where MI is close to zero, W1 = 100 guarantees that PH will exceed the expected noise of MI obser-

vations. The weight of the second penalty W2 was set to 50, which affects the

overall reward only for very short paths or if the expected path exceeds 5 times the maximum laser range.

Even with the simplest occupancy map representation, the forward projection model needed to estimate MI is expensive to evaluate. This is the main motivation for using BO; optimising decision making while keeping sampling low. Instead of optimising by explicitly calculating the forward simulation MI results, BO learns a model for MI from sparse samples. It then uses these models to infer the next sampling location. The efficiency of BO relies on the accuracy of the learned GP models. However, a high fidelity GP surrogate model requires a substantial number of function observations. We take advantage of the way MI is sampled in order to increase the number of observations without increasing the computational cost. We notice that MI along the path is a non-decreasing monotonic function. Since the robot motion along the path is a set of sequential observation points, MI in any given point is the sum of accumulated effect of all previous observations and the contribution of the current observation:

M I(θk+1) ≡ MI([u1...uk+1])

= MI([u1...uk]) + δMI(uk+1|zk+1)

≡ MI(θk+1) + δMI(uk+1|zk+1).

(3.10)

Thus, by evaluating MI sequentially, CBE produces several reward observations from a single path sample with no additional computational cost. More samples produce a denser and more accurate GP model of the objective function.

In document Autonomous Exploration over Continuous Domains (Page 55-58)