Relation to hPOMDPs - Generic Reinforcement Learning Beyond Small MDPs

Aresolving history is one that completely determines the internal POMDP state following that history. Holmes et al. (2006) show that there exists a looping suffix tree can perfectly predict the observations of any (strongly connected) deterministic POMDP without rewards (abbreviated as POMDP\R) given a resolving history sequence to start with. Their proof uses the following steps.

• The determinism of the POMDP\R implies that given a resolving sequence, every history following that sequence is also resolving (see Lemma 1).

• Given a set ofminimal resolving sequencesfor each state, one can construct an infinite- depth suffix tree that represents this set which maps any history to an internal POMDP state.

• The infinite history suffix tree can be made finite by looping over certain sequences (excisable sequenceswhich we will subsequently define).

In this section we show that it is not the determinism of the POMDP\R that enables the first step, but the fact that deterministic POMDP\Rs also satisfy the history condition Lemma 1. Thus hPOMDP\Rs can also be predicted using looping suffix trees, where hPOMDP\R is the natural definition of an hPOMDP without rewards. We show that there exists a looping suffix tree such that emission probabilities at the leaf nodes correspond to the appropriate emission probabilities of the hPOMDP\R. We will need the following notation, lemmas and definitions from Holmes et al. (2006).

• We assume here that the history starts with an action and ends with an observation i.e.

• Sλ= S ∪ {λ}whereS is the state set of the hPOMDP\R and_λis the empty state. • We abuse notation and defineh:Sλ → Sλto also be a function mapping each states_i

to the state reached starting froms

i and following historyh. If the history sequenceh cannot occur from a particular starting states

ithen we seth(s

i) =λ. In the text, we will say “the functionh” wherever it is not clear whether we are referring to the history sequence or the related function.

• trans(h) ={ao : a ∈ A,o∈ O,andaois a possible transition followingh}

For this section, we will assume that the emission probabilities of the hPOMDP\R are dependent on the tuple(s,a)rather than on the state arrived ats0, see Section 2.3 for a discussion of this. The two definitions are equivalent although a POMDP\R with emissions depending ons0alone may have more states as seen in Figure 4.3. This change in definition is to be consistent with Holmes et al. (2006), and is easier for the purposes of constructing an associated looping suffix tree. The edges of the hPOMDP\R are now labelled byaoand by Lemma 1 on page 30 we know that each edgeaouniquely determines the next states0. Thus the resulting hPOMDP\R can be seen as a finite state machine with transitions given by theaopairs.trans(h)also defines exactly the statess0that can follow the historyhif it is in some states. The determinism of the transitionsaoalso means thath(Sλ)is a well-defined function.

Definition 9. A history sequencehresolvesto a states

i iff the functionhmaps every state in

Sλ_{to either}_s

i or_λwith at least one state mapping tos_i.

Figure 4.1 shows a deterministic POMDP\R without a resolving sequence. hPOMDPs always have resolving sequences via the map_φthat resolves anyhto an internal state, including the initial empty history_e(although we can slightly weaken this condition as discussed in Section 2.3.3). However, the history condition (Lemma 1) alone is not enough for the existence of a resolving sequence; the same example (Figure 4.1) satisfies this condition but has no resolving sequences.

Definition 10. A states_iis reachable from states_jif there exists a finite sequence of actions

a1...ansuch the probability of seeing states_iafter taking the sequencea₁...a_nis non-zero. Definition 11. An hPOMDP\R is strongly-connected if every state is reachable from every other state.

The following lemma for hPOMDP\Rs illustrates how both the strongly-connected nature and the existence of a map_φfor the hPOMDP\R are both sufficient for the existence of resolving sequences for every state, given a single resolving sequence. Figure 4.2a provides an example of a strongly-connected stochastic hPOMDP, demonstrating that the class is not empty.

Lemma 12. We can construct infinitely many resolving sequences for every state of a strongly- connected hPOMDP.

s0 s1

a: 0

b: 1 b: 1

Figure 4.1:This deterministic POMDP\R has no resolving sequences. It still satisfies Lemma 1; givens

0taking actionbdetermines the next state beings

0while takingadetermines the next state to bes₁. However, given a history (e.g.b1b1a0) it is impossible to say whether we are in

s0ors

1. If we equipped this POMDP\R with a map_φsuch that_φ(_e) = s₀ then the history

b1b1a0resolves tos 1. s0 s1 a: 0 b: 0 a: 1 a: 1

(a) A stochastic hPOMDP\R with_φ(b0) =s₀, φ(a0) = s1andPr(1|s₀,a) ∈ [0, 1). The corresponding looping suffix tree is shown on the right.

∅

s0 s1

a0 a1

(b)The emission probabilities fors₀are given byΩ(·|s₀,a), all other emissions are deterministic. The transition sets are trans(s0) ={a1,a0}andtrans(s₁) ={b0,a1}.

Figure 4.2: Stochastic hPOMDP\R and corresponding LST

Proof. Let us assume we have a history sequence that resolves to some initial states

i in

an hPOMDP\R via the map _φ. For instance, this could be the empty history_e such that φ(e) = s_i. Leta

i be some action feasible from states

i that results in observation o

i and

leads (deterministically) tos

j. Then, by definitionh(s) =s

ior_λfor all statess. Therefore,

hao(s) = ao(h(s)) = sj or_λ i.e. hao resolves to s

j. Since the hPOMDP\R is strongly- connected, we can construct a resolving sequence for every state by repeating this construction using an appropriate sequence of actions (and resulting observations) that make each state reachable froms

i. Note that this process does not rely on the (potentially low) probability of the sequences, simply their possibility, since we can read the possible transitions directly from the hPOMDP\R specification. By the strongly connected nature of the hPOMDP\R, for every

s ∈Sthere exists a suffixqsuch thathqresolves tos. This includes the states_iitself, which allows us to construct an infinite number of resolving sequences for every state.

s0 s1

r: 1

l: 1

r: 0,u: 0 r: 0,u: 0

(a)The original flip automaton as described in Holmes et al. (2006). Here we use the convention that the emission probabilities depend on state and action pairs. 1 1 0 0 r l l u r u l,u r,u l r

(b)An equivalent flip automaton with emission probabilities restricted to depend only on the (next) state. The state labels are now the observations that will be emitted if the agent transitions to that state.

In Lemma 12 we saw that we can have arbitrarily many resolving sequences of arbitrary length. However, we are particularly interested in the smallest possible resolving sequences. A suffix tree built from these resolving sequences (called a history suffix tree) would then allow us to map any history to the corresponding internal state.

Definition 13. The set ofminimal resolving sequences for a states

i is the set of resolving sequenceshsuch that no shorter sequenceh0formed by removing prefixes ofhis also resolving. Definition 14. For any two historieshandqsuch thath= eq,the sequenceeis excisable fromhiff∀p trans(ph) =trans(pq).Otherwiseeis non-excisable.

Definition 15. Two historieshandqare functionally equivalent iff the functionsh:Sλ → Sλandq:Sλ → Sλare equal.

A minimal resolving sequence may be unbounded in length due to the presence of excisable sequences within the resolving sequence. Thus, the history suffix tree might be infinite.

Lemma 16(from (Holmes et al., 2006)). For two historieshandqsuch thath=eq, ifhandq are functionally equivalent theneis excisable fromh.

The proof for the above lemma can be found in Holmes et al. (2006). Intuitively, it follows from the definition of the functionhand excisability. Note that excisability is precisely the property that loops in looping suffix trees cover. If there are resolving historieshandqsuch thath=eq, then we can simply treathas if it wereqsince they resolve to the same internal hPOMDP state and they behave the same way for every possibly prefix. Effectively we create a loop from the node corresponding tohin the history suffix tree to the node corresponding toq.

Lemma 17(from (Holmes et al., 2006)). Every branch of a history suffix tree either becomes resolving or reaches a level that begins with an excisable sequence after finite depth.

Proof. Consider a branch which does not become resolving at a finite depth represented by the (infinite) historyh. Leth_ibe some prefix ofh. There can only be a finite number of such prefixes that are functionally distinct since each functionh

ihas finite domain and range. Thus for somejwithi< j<∞,h

jis functionally equivalent toh

i. Additionally,h

j = ehifor some

e. By the previous lemma,eis excisable fromh

Effectively, this lemma allows us to make a finite looping suffix tree out of a potentially infinite history suffix tree by looping over the excisable sequences. We use this lemma for the following theorem.

Theorem 18. Let M be a strongly-connected hPOMDP\R. Given a resolving history to begin with, there exists a prediction looping suffix treeLMsuch that each (non-looping) leaf node ofLM

corresponds to a statesi inMand the emission probabilities at the node mapping tosicorrespond

Proof. The following proof is similar to the proof of Theorem 1 in Holmes et al. (2006) with the addition of the appropriate emission probabilities at the leaf nodes.

By Lemma 17, the history suffix tree either resolves at a finite depth or reaches the start of an excisable sequence. If the branchhresolves at a finite depth, we are done. If the branch

hdoes not resolve at some finite depth, then letkbe the depth at which it first begins with an excisable sequenceesuch thateh

j = hk forj< k. We can then place a loop fromh

k to

hj and any history that followsh_kwill be looped back toh_j, sinceh_jandh_k are functionally equivalent.

The emission probabilities for the observations at each leaf node are simply the emission probabilities for the corresponding hPOMDP\R state-action pairs,Ω(o|s,a)wherescorresponds to the hPOMDP\R state mapped to by the corresponding branch of the looping suffix tree i.e. if the leaf node corresponds to the minimum resolving sequence mapping tosthen the emission probabilities assigned to that node must beΩ(o|s,a)for eachoand a. Thus we define a looping suffix tree that has the same emission probabilities as the hPOMDP\RMfor each state and action.

An hPOMDP\R has the property that there exists a function_φ:H → S such that_φ(h) =

s ∈ S. The looping suffix tree constructed in the above proof is effectively a representation of this function_φ. It should be noted that the above is a proof of existence, and not a completely constructive one, unlike the proof of Holmes et al. (2006), since the emission probabilities are simply copied from the original generating process rather than learned. It is possible that by observing the frequency of observations-action tuples at each leaf node of the constructed LST which is consistent with the history so far one can make an empirical estimate of the emission probabilities that converges asymptotically to the true emission probabilities. However, we are more interested in the use of LSTs in predicting reward distributions, which we pursue empirically in the following section.

In document Generic Reinforcement Learning Beyond Small MDPs (Page 81-86)