Related Work in BDI Learning - Learning plan selection for BDI agent systems

The issue of combining online learning with deliberation in BDI agent systems has not been widely addressed in the literature. In terms of offline approaches, Guerra- Hern´andez et al.[2005] reported preliminary results on learning the context condition for a single plan using a decision tree in a simple paint-world example, although they do not consider issues of learning in plan hierarchies, non-deterministic domains, and nuances such as the presence of noisy training data, all of which we address in this thesis. The work in [Lokuge and Alahakoon,2007] gives a detailed account using a real-world ship berthing logistics application. The authors take operational shipping data to train a neural network offline that is then integrated into the BDI deliberation cycle to improve plan selection. They show that the trained system is able to outperform the human oper- ators in terms of scheduling the docking of ships to loading berths. Similar approaches integrating previously (offline) learnt knowledge with BDI deliberation have also been used in robotic soccer [Brusey,2002;Riedmiller et al.,2001], although no new learning is done in the deployed system. In [Nguyen and Wobcke,2006] learnt user preferences are incorporated during BDI plan selection in a dialogue manager application using a decision tree learner. In contrast, [Karim et al., 2006] take the approach of refining existing BDI plans or learning new plans as a sequence of recorded actions based on prescriptions provided by the domain expert.

A closely related area to BDI is that of hierarchical task network (HTN) planning where task decompositions used are similar to BDI goal-plan hierarchies [Erol et al., 1994]. Particularly, we are interested in the fact that BDI and HTN systems map quite well to each other, and that plans’ context conditions in BDI systems are synonymous with methods’ preconditions in the HTN case. We explore several related works in this area in some detail later in Chapter7. The key difference between learning in HTN systems and our BDI approach, however, is that in our case learning is performed online in a trial-and-error manner since we do not have a model of the environment, whereas in HTN planning systems it is predominantly done offline and a model of the environment is assumed. As such, the issue of determining confidence in the ongoing learning (Chapter4) that may not be reliable due to insufficient data, is generally not a concern in HTN systems.

The work ofSimari and Parsons[2006] has highlighted the relationship between BDI and Markov Decision Processes on which the reinforcement learning literature is founded.

Recently,Broekens et al.[2010] reported progress on integrating reinforcement learning to improve plan selection in GOAL, a declarative agent programming language in the BDI flavour. They use an abstract state representation using only the count of action rules and a sum cost heuristic that captures the number of pending goals. The intent is to keep the representation domain independent, with the focus on improving the plan selection functionality in the framework itself. In that way, their approach complements ours, and may be integrated as “meta-level” learning to influence the plan selection. We note that such work is still preliminary and it is difficult to ascertain the generality of their approach in other domains. Nevertheless, their early results are encouraging in that the agent always achieves the goal state in less number of tries with learning enabled than without. Our work also relates to the existing work in hierarchical reinforcement learning [Barto and Mahadevan,2003], where task hierarchies similar to those of BDI programs are used. We discuss this related area further in Chapter7. Of particular in- terest is the early work byDietterich [2000] that supports learning at all levels in the task hierarchy (as we do in our learning framework described in Chapter3) in contrast to waiting for learning to converge at the bottom levels first.

To our knowledge, the first attempt at a principled integration of online learning in BDI systems was started within our own research group byAiriau et al.[2009], where the use of decision trees for learning plan selection in BDI systems was initially introduced. Their work explored the nuances of learning within the hierarchical structure of a BDI program, and showed that it can be problematic to assume a mistake at a higher level in the hierarchy, when a poor outcome may have been the result of a wrong decision at lower levels. That research formed the starting point for this thesis, and the learning framework described here builds upon this earlier work.

Chapter

3

A BDI Learning Framework†

In this chapter we discuss the elements that constitute our BDI learning framework. Our learning task is one of plan selection, in that we would like our BDI agent to improve its plan selection in any situation based on ongoing experience. Our approach to this is to learn to refine the applicability or context conditions of plans over time.

To this end, we provide a new account of plans’ context conditions to include decision trees. The idea is that as more experiences are collected regarding outcomes under different situations in which a plan was selected, the induced decision tree from those samples will provide a meaningful generalisation of the real applicability conditions of that plan. We present the key mechanisms that are required for this scheme to function: first, an approach to determining the input for the decision trees and recording the input experience samples from the hierarchy of decisions in the BDI plan library; and second, a new selection scheme that probabilistically selects from the candidate plans based on each plan’s believed likelihood of success in the situation as given by its decision tree. Next, we discuss learning in the context of the BDI goal failure recovery mechanism, and in recursive goal-plan structures.

We conclude with a discussion of an important challenge in this setup: that of the reliability of ongoing learning. Since the decision trees we use for plan selection are built from ongoing experiences, then initially the decision trees will not be so reliable. Our solution for this issue of confidence in ongoing learning is given separately in

†

Parts of the work presented in this chapter have appeared or will appear in [Airiau et al.,2009;Singh et al.,2010a,b,2011].

Chapter4.

What Causes Plan Failure?

In saying that we wish to improve plan selection in any situation we imply that we would like to avoid, as much as possible, plan selections that lead to failures. If we are to learn in a meaningful way from failures, it becomes important to also understand the reasons for such failures.

As described previously in Chapter2, the context condition of a plan encodes the programmed applicability conditions in which the plan is considered to be a reasonable strategy to address a given event-goal. The agent’s plan library captures the “know- how” information about the domain that the agent operates in and is specified by the domain expert. So, given that a plan’s selection in a given situation implies applicability in that situation, why should the chosen plan fail? It may be for one of the following reasons:

1. The plan was a bad choice in the situation. This may happen if there is a mismatch between the programmed context condition and the real applicability conditions of the plan. In other words, the context condition does not fully capture the state of affairs of the world.

2. The plan was the correct choice in the situation but the environment changed during plan execution. In other words, the reasons for executing the plan changed, while the plan was executing. This is perhaps the most common reason for failure in a dynamic environment, and is also the motivation behind the BDI failure recovery mechanism.

3. The plan was the correct choice in the situation but nevertheless failed due to unknown reasons. It may be that the world is only partially observable in which case the reasons for failure are non-deterministic.

4. The plan was the correct choice in the situation but a poor plan choice was made further below in the goal-plan hierarchy. Since plans often post subgoals that are then addressed by further plan choices, it may be the case that the failure occurred at the sub-task level.

5. The plan was a correct choice for addressing the event-goal and all choices in the hierarchy below were also correct, but the way in which prior event-goals were resolved meant that there was no way for the plan to succeed. This may occur, for example, when two subgoals interact over some common resource, such that the resolution of the first subgoal depletes the resource in a way that makes the second subgoal impossible to achieve.

For instance, consider the example of an agent controller for an unmanned aerial vehicle (UAV) that may contain several plans in its library to address the event-goal of landing the airplane. While some plans may apply in normal weather conditions, others may apply only in what are classified as emergency situations.

It may be that a plan to land the UAV in a field in case of an emergency fails because, despite the programmer’s best attempts, it was not possible to craft its context condition to capture every situation that constitutes an emergency (reason 1).

Even if the plan was activated correctly in an emergency, it may be aborted during execution if the agent no longer believes that landing in the field is an option, perhaps based on new sensor data confirming risk to farm animals in the field below (reason 2). In this case, if an alternative exists, for landing on a nearby airstrip for example, then the agent could recover from the initial failure by trying this alternative. Otherwise, if no alternatives remain then it might have to abort the goal to land safely.

That is not to say that landing on a nearby airstrip could not fail for all sorts of unknown factors beyond its control (reason 3). Or, it could potentially fail causing the plane to overshoot the airstrip, because the subgoal to determine final approach speed was incorrectly resolved for the current weather conditions (reason 4).

Finally, consider the case where a landing event-goal is the final subgoal in a higher level plan to survey the landscape and where the prior subgoals are used for navigating a set of waypoints in the flight path. It is foreseeable that the UAV successfully navigates all waypoints, but in the process consumes too much fuel, making returning to base and landing the plane unachievable (reason 5).

In document Learning plan selection for BDI agent systems (Page 42-47)