Contributions - Rule-based interactive assisted reinforcement learning

9.1.1 Assisted Reinforcement Learning

The first contribution made was a taxonomy and framework for describing and classifying Reinforcement Learning agents that utilise external information to leverage the learning pro- cess and supplement the environment and reward functions. The taxonomy and framework has been named Assisted Reinforcement Learning. The Assisted Reinforcement Learning framework has been designed to promote collaboration between the sub-fields of Reinforce- ment Learning, and to help in describing and comparing the methods they introduce. The increase in collaboration can reduce duplication in published ideas and methods, and increase the pace that new and outstanding ideas and methods are adopted.

9.1.2 Evaluative versus Informative Advice

The second contribution was a comparison of evaluative and informative advice giving styles, where the accuracy, availability, and number of interactions of the two advice delivery styles were measured for the Mountain Car environment. A human trial was performed which found that informative advice givers were more accurate, had higher engagement, better understood the behaviour of the agent, and was preferred more, than the evaluative advice-giving users. The findings of this trial should inform future Interactive Reinforcement Learning development, as it has in this thesis, with a greater emphasis on the engagement of

the human and the informative advice they provide.

9.1.3 Simulated Users in Interactive Reinforcement Learning

The third contribution was the introduction of simulated users into Interactive Reinforce- ment Learning. While simulated users have been used in the field before, they have not been formally acknowledged, and a methodology for their design and use has not been published. This thesis showed that simulated users present an effective method for providing indicative evaluations of Interactive Reinforcement Learning agents for comparison and development.

The fourth and fifth contributions of this thesis were a list of characteristics of human interactions used for designing simulated users, and a set of principles for the evaluation of simulated users that were adopted from the spoken dialogue systems field.

9.1.4 Persistent Advice

The sixth contribution of this thesis is the introduction of a method for the retention and reuse of human-sourced advice, named persistence. The use of persistent advice was found to substantially improve the performance of the agent while reducing the number of interactions required of the human. To handle the risk that incorrect advice introduces, and to manage the exploration-exploitation trade-off, probabilistic policy reuse was introduced. PPR was found to be a viable method to balance the advantages and disadvantages that retained advice provides.

9.1.5 Rules-Based Reinforcement Learning

The final contribution of this thesis was Rules-Based Interactive Reinforcement Learning. Rules as an advice delivery method was shown to provided the same performance impact as state-based advice, but with a substantially reduced interaction count. Rules allow advice to be provided that generalises over multiple states. Coupled with the exception-driven decision tree generation algorithm Ripple-Down Rules, conflicting rules are managed and a rule model can be built interactively. Additionally, Ripple-Down rules has previously been shown to assist users in defining rules(Gaines & Compton, 1995; Compton et al., 1991, 2006).

9.2 Future Work

This section discusses possible future directions for the research presented in this thesis. In general, all the techniques and concepts contributed by this thesis should be tested and validated in more complex settings such as higher dimensional and continuous state spaces, and in real-world scenarios. Additionally, these contributions should be tested in combination with other technologies such as function approximators, deep learning, and AI safety, to demonstrate its performance on the leading edge of Reinforcement Learning. The remainder of this section lists some more specific research directions for simulated users, Interactive Reinforcement Learning, and human-agent interaction.

9.2.1 Effect of Latency on Accuracy

Chapter 5 performed a human trial that compared evaluative and informative advice delivery, measuring human accuracy, availability, and engagement. Humans providing evaluative advice were observed to have considerably worse accuracy than informative advice-giving users. A potential reason for this is disparity is latency. If the humans were late in giving their advice, accuracy would suffer more for evaluative advice givers than informative advice givers on the Mountain Car environment. A more detailed study that compared latency for the environment, and between advice delivery methods, should be a future direction of research.

9.2.2 Simulated Users

This dissertation introduced the use of simulated users into Reinforcement Learning for the purpose of indicative evaluation and development of RL technologies. In chapter 4, a list of characteristics for human interactions was introduced. This list included accuracy, availability, and knowledge level, all of which were extensively used in this research. The other interactions included concept drift, reward and cognitive bias, and latency. Future research is required to investigate methods for simulating these characteristics for different types of advice delivery methods and environments. Additionally, a larger and more comprehensive trial is required to compare how well simulated users can replicate the interaction behaviour

of real humans, and to investigate how well the evaluations provided by simulated users reflect the actual behaviour presented by humans. This research is not only applicable to the interactive reinforcement learning field, but also the spoken dialogue systems and knowledge acquisition fields.

9.2.3 Human/Agent Interfaces

The Interactive Reinforcement Learning agents demonstrated in this thesis used two types of advice, either state-based or rule-based advice. Directly providing state-based or rule-based advice may not be the most user-friendly method for humans to provide assistance. Improving the user experience for the human when interacting with the agent may improve engagement and amount of advice the user provides. While the agent needs to receive advice in a specific format, as long as an interpreter is used to transform the input, the human may provide assistance in any form.

Research and development into methods for humans to provide advice that are user- friendly and allow interactions with a high informational payload should be a priority. These improved interaction interfaces should improve engagement, decrease the interactions required to convey a lesson, and improve the learning speed of the agent.

9.2.4 Closing the Loop

In Interactive Reinforcement Learning, the focus is on the advice that the human is providing to the agent. However, in human teaching, it is well established that teaching is a two-way communication task, with the teachers and students informing each other. Interactive Reinforcement Learning differs slightly from this teacher / student model, as the RL agent can learn a better solution than the teacher initially demonstrates. Research is needed into methods for conveying behaviour learnt by the agent back to the human, so that the agent can teach the user the better behaviour. Ideally, this transfer of behaviour between the agent and the human would occur repeatedly and in both directions, making use of the agent’s rapid trial-and-error learning and the humans problem solving and pattern recognition abilities. This tandem learning may help in finding optimal behaviours in shorter periods of time, and in teaching the human the optimal behaviour once found. An example of

Reinforcement Learning agents teaching humans can be seen in Tesauro’s TDGammon paper Tesauro explains that the agent’s style of play frequently differs from traditional human strategies, and in some cases this has lead to major revisions in the positional thinking of top human players (Tesauro, 1994).

9.2.5 Multiple Users

A direction for future research is the extension of interaction Reinforcement Learning to support multiple advice-giving users. This extension would allow groups of users, each with their own areas of expertise and accuracy, to assist the agent in learning and decision making. The possibility to receive advice from a few users or a few thousand(C. Zhang & Liu, 2015; Haque, 2014) can allow the agent to rapidly receive massive amounts of advice and would address issues of limited availability of individual users. The support of multiple users introduces challenges such as the management of large amounts of conflicting advice, inaccurate advice, malicious users, and optimal advisor discovery.

9.2.6 Incorrect Advice Identification and Mitigation

Regardless of the intentions and accuracy of the advising user, at some point inaccurate advice is provided to the agent. Inaccurate advice may come directly from the user because of misunderstanding of the reward function, concept drift, or it may come simply from noise in the communication path. Experiments performed in Chapter 7 (Figure 7.6) demonstrated the effect that incorrect advice can have on an agent’s performance. While probabilistic policy reuse was found to reduce the impact of incorrect advice, research into other technologies and preprocessing should be explored.

In document Rule-based interactive assisted reinforcement learning (Page 176-180)