• No results found

5.2 Graph-Based Skill Discovery in Continuous Domains

5.2.1 Prior Work

In this subsection, we review heuristics for generating sample transition graphs in unknown continuous MDPs, where the agent has to generate the graph based on expe- rience it has collected during exploring its environment. In domains with continuous

state spaces S⊂ Rns, there cannot be a one-to-one correspondence between states and

graph nodes since there exists an infinite number of states. Thus, several states need

to be aggregated into one node, i.e., V  S. Accordingly, one has to choose how many

nodes there should be and where in the state space these nodes should be placed. Prior work on choosing the positions of the graph nodes has mainly focused on covering

the state space uniformly with nodes and neglected the domain’s dynamics Pssa. We

summarize three heuristics that have been proposed for choosing graph nodes based on a set of transitions sampled from the domain. These heuristics are all parametrized by

the parameter vnum, which determines the number of nodes of the generated graph.

One straightforward choice for the graph node position is to use a uniform grid over the state space. This approach has been used in the context of graph-based skill

discovery, e.g., by Mannor et al. (2004). For an ns-dimensional state space where the

range of possible values in each dimension has been scaled to[0,1], the resolution in

5. LEARNING GRAPH-BASED REPRESENTATIONS 88

disadvantage is that the approach will not scale to domains with many dimensions since the resolution in each dimension declines exponentially with ns.

A second heuristic is the on-policy sampling heuristic (also denoted as “random subsampling” by Mahadevan and Maggioni (2007)), which samples the graph node positions uniform randomly from the set of states S0encountered during exploration. In contrast to the grid-based heuristic, this heuristic depends not directly on the state space’s dimensionality ns, but rather on the “effective” dimensionality of the manifold of feasible states. If there is redundancy in the dimensions of the state space, this effective dimensionality might be considerably lower than ns. The heuristic is on-policy, i.e., regions of the state space that are often visited by the sampling policy are represented by more graph nodes.

The ε-net heuristic, also denoted as “trajectory-based subsampling” (Mahadevan and Maggioni, 2007), aims at covering the set of states encountered during exploration as uniformly as possible. It follows a greedy strategy: the first graph node v0is picked at random from S0. By induction, for k ≥ 1 suppose the graph nodes v0, . . . , vk−1 have already been selected and their pairwise euclidean distance is at least ε. Search for s ∈ S0 whose distance to each of the v0, . . . , vk−1is at least ε: if there are such states, pick one at random and add it to the set of graph nodes. If there is no such candidate, return the current set of graph nodes. This set corresponds to a locally maximal set of graph nodes with pairwise distance at least ε. The advantage of this approach compared to the on-policy sampling method is that the effective state space is covered more uniformly. For parameterizing the heuristic by vnuminstead of ε, we perform binary search for a value of ε that yields a set of vnumgraph nodes. As discussed in Section 3.3.4, Bacon and Precup (2013) have proposed to use the ε-net heuristic for skill discovery.

Once a finite set of graph nodes has been chosen with any of the discussed heuristics, the states of the original MDP can be associated with their closest graph nodes. By this, one can use the approach discussed in Section 4.2.1 to create graph edges and their weights by replacing an observed transition between two states by a transition between the two respective closest graph nodes. More sophisticated strategies for mapping state transitions onto node transitions would be possible: for instance, one could associate each state transition with a weighted sum of node transitions such that the weights sum to one and the weighted sum of node transitions is maximally close to the original state transition. This would, however, reduce the sparsity of the graph’s connectivity and by this increase the runtime of agglomerative graph clustering approaches as discussed in Section 2.4.3.3. Note that by constructing a state transition graph, one effectively creates a discrete version of the MDP that is embedded into the continuous state space. See Section 5.3.2.1 for an illustration of the three discussed heuristics.

5.3

METHODS

While the heuristics discussed in Section 5.2.1 focus on covering the state space uni- formly, they do not take the domain’s dynamics into account. Thus, for many valid state transitions s → s0of the domain, there may not be any pair of graph nodes v1, v2∈ V

89 5.3 METHODS

such that v1→ v2is a good representation of s → s0. Accordingly, the graph may not be able to capture the domain’s dynamics Pssa0 accurately. In this section, we propose

a generative model which defines how probably a set of observed transition has been generated from a transition graph. We then propose the heuristicFIGE which is derived from this generative model as maximum likelihood solution under simplifying assump- tions. Thereupon, we illustrate different approaches for transition graph generation and discuss how skill prototypes can be generated for a given transition graph in a continuous domain.