• No results found

domain. Second, the features can be regarded as causes for the observed decisions, al- lowing for a deeper understanding of the observed behavior. In particular, we consider a LFD problem and propose a Bayesian nonparametric framework for feature learning in LFD. We assume that the policies depend on the features of the observed demon- strations. With this model, we are able to (i) significantly reduce the state space by (ii) learning the features as well as the number of features and (iii) provide a better understanding of what caused the teacher to take the observed actions.

5.3

Problem Formulation

The goal of this work is to introduce a feature-based LFD framework. While this model can be used for teaching an agent given observations of the desired behavior, the focus of this work lies on the analysis of the observed behavior by means of the features. Since we consider a decision-making task, the investigated problem can be modeled by means of a Partially Observable Markov Decision Process (POMDP) [182, 183], which is defined by

• a set of observations, Oz,

• a set of states, Os,

• a finite set of Nu actions, Ou,

• a transition model, which describes the probability of entering a state after taking an action in the current state,

• an observation model which explains how the observations are generated from the states,

• a discount factor, which penalizes long-term rewards, • and a reward function, R.

As the main goal of this work is to provide a means to understand observed behavior, we consider an imitation learning approach and learn the relation between states and actions from the observations directly, where the reward and, hence, the discount factor, are not considered.

We assume that we (or an agent) have access to Nz noisy observations, zn ∈ Oz,

86 Chapter 5: Bayesian Feature-Based Learning From Demonstrations

noise, i.e., zn, describes observations of the states, xn, with additive Gaussian noise.

Further, we assume that the actions, un∈ U, taken in the corresponding states, can be

observed. The observations are assumed to be optimal in the sense that they represent the behavior of the agent seeking its goal, i.e., without any exploratory steps.

A simple approach to the considered problem would be the use of a feature extraction technique such as PCA [9,10] or NMF [132] to learn features from the observed states. Following this solution, features cannot be jointly learned and shared among different actions. Moreover, learning discriminative features is not guaranteed. Therefore, we argue that the features and policies need to be learned jointly such that a trade-off between feature sharing, promoting a compact model, and discrimination capability is found based on the observations.

5.4

State of the Art

In the following, we provide an overview of the current state of the art of feature representations in LFD. As IRL includes the problem of solving a RL problem, we start this section by first giving an overview of feature learning for RL.

5.4.1

Reinforcement Learning

Large state and action spaces are especially problematic for value-based RL algorithms such as value-iteration [178,184] or Q-learning [185] since the value function, represent- ing the expected reward for each state, needs to be approximated. In early approaches, a pre-defined set of basis functions is linearly weighted to represent the value function. These basis functions are often referred to as feature mappings. Exploiting the linear relationship, efficient methods for estimating the values are proposed e.g., in [180]. However, only few approaches exist to learn the basis functions. For example, in [186], a feature selection approach is presented. By means of an l1-regularized approximate

linear program (RALP) the value function is approximated by selecting features from a large set of potential features.

To learn nonlinear features, Riedmiller [187] has proposed the Q-fitted value iteration. Here, the value function is approximated by means of a neural network and the features are learned in the hidden layers of the network. Although this approach shows good

5.4 State of the Art 87

performance in practice, it lacks convergence guarantees. In [43], the neural network is replaced with a deep-layered counterpart. The capability of this architecture is demonstrated on playing old computer games, often outperforming experienced human players. Research in this area is ongoing. The latest results have gained global public interest in a competition of the traditional Chinese game go between a human expert and an artificial intelligence [44]. However, the underlying deep architectures needs to be designed carefully and parameter tuning can be challenging. As the inferred features mainly serve the purpose of dimensionality reduction, they do not necessarily possess a meaning that can be easily interpreted. A survey on feature learning for batch-based reinforcement learning is provided in [188].

A different concept of features is proposed by Hutter with the Feature Reinforcement Learning framework [189]. The goal is to learn a feature mapping from the agents history (comprised of actions, states, and rewards) to an MDP state, enabling decision learning for infinite state spaces [190].

The Contingent Feature Analysis (CFA) [40] is motivated by the question, which infor- mation is needed to understand the behavior of an agent. For this, features are sought for that explain the high variance of the temporal derivative of the observed states, when the agent performs an action.

An alternative framework for learning the latent structure in the state space based on a [191] is proposed in [192]. In this framework, it is assumed that the observed states can be compactly represented by exploiting the structure within the states, enabling efficient learning. An extension to an online approach has been proposed in [193], where the features are selected from a large set by means of Group LASSO [194].

5.4.2

Inverse Reinforcement Learning

IRL is concerned with the problem of learning the reward function from observed behavior [176]. In the context of IRL, features are mainly used to parameterize the reward function, often in a linear way, e.g., in [176, 177]. Recent attempts have been made to consider a DL architecture [195] for feature learning.

A Bayesian nonparametric approach is proposed in [196]. Here, an IBP prior is utilized to model feature activations. As the features of the reward function are assumed to be known, this approach can be understood rather as a feature selection than feature learning for IRL, where the number of features is estimated by the IBP. Different results

88 Chapter 5: Bayesian Feature-Based Learning From Demonstrations

on Bayesian nonparametrics for IRL, that is indirectly related to feature learning, are given in [197], where a partitioning of the state space is sought for, or [198], where complex behavior is decomposed into several, simpler behaviors that can be easily learned.

5.4.3

Imitation Learning

Instead of estimating the reward as in IRL, Imitation Learning aims at inferring the underlying policy directly. Usually, handcrafted features are used, e.g., in [199–201]. Attempts to introduce new features are made in [202] as an extension of the maximum margin planning algorithm proposed in [199]. As explained in [4], imitation learning can be considered as a supervised learning task. Thus, feature selection and learning techniques developed for classification and regression can also be used in imitation learning. An excellent overview is given in [14]. Though these models work well in practice, they might not be able to provide a deeper understanding of the observed behavior as they do not explicitly model the states and policies.