• No results found

3 Abstract Value Functions

The type of programs we consider cannot be directly exe- cuted, the nondeterminism in the program needs to be re- solved first. Of course, the agent executing the program strives for an optimal execution strategy for the program. Op- timality, in this case, is defined with respect to the expected reward accumulated during the first h steps of the program and the probability that these first h steps can be executed suc- cessfully (i.e., the probability of not running into a situation in which the program cannot be executed any further). Those two quantities are measured by the value functions Vhδ(s) and

h(s), respectively. Our intention and the key to abstract-

ing from the actual situation is to identify regions of the state space in which these functions are constant. The advantages of such an abstract function are twofold. First, these functions can be pre-computed since they are independent of the actual situation (and the initial situation). This allows to apply some simplification in order to lower the time necessary to evalu- ate the formula. Second, these abstract value functions allow to asses the values of a program containing nondeterministic choices of action arguments without explicitly enumerating all possible (ground) choices for these arguments. Rather the abstract value functions abstract from the actual choices by identifying properties for the action arguments that lead to a certain value for the expected reward and the probability of successfully executing the program, respectively. For in- stance, if a high reward is given to situations where there is a green block on top of a non-green block and the program tells the agent to pick a block and move it onto another non- deterministically chosen block, then the value function for the expected reward distinguishes the cases where a green block is moved on top of a non-green block from the other constel- lations. What it does not do is to explicitly refer to the green and the non-green blocks in the domain instance. Thus, given these abstract value functions, the nondeterminism in the pro- gram can be resolved by settling on the choice maximizing the abstract value functions when evaluated in the current sit- uation.

For a program δ and a horizon h we compute case state- ments Vhδ(s) and Pδ

h(s) representing the abstract value func-

tions. As can be seen in the definition below the computa- tion of these case statements is independent of the situation s. Vδ

h(s) and Phδ(s) are inductively defined on the structure

of δ. Since the definition is recursive we first need to as- sume that the horizon h is finite and that the programs are nil-terminated which can be achieved easily by sequentially combining a program with the empty program nil.

1. Zero horizon: V0δ(s) def. = rCase(s) and Pδ 0(s) def. = case[true, 1] For the remaining cases we assume h >0.

2. The empty program nil:

Vhnil(s)def.= rCase(s) and Pnil h (s)

def.

= case[true, 1] 3. The program begins with a stochastic action A(~x) with

outcomes N1(~x), . . . , Nk(~x):

VhA(~x);δ(s)def.= rCase(s)⊕

k

M

j=1

pCaseA

j(~x, s) ⊗ R(Vh−1δ (do(Nj(~x), s)))

That is, the expected value is determined as the sum of the immediate reward and the sum over the expected values executing the remaining program in the possible successor situations do(Nj(~x), s) each weighted by the

probability of seeing the deterministic actions Nj(~x) as

the outcome. Due to regression the formulas only refer

to the situation s and not to any of the successor situa- tions.

For the probability of successfully executing the pro- gram the definition is quite similar only that the imme- diate reward is ignored:

PhA(~x);δ(s)def.= k M j=1 pCaseA j(~x, s) ⊗ R(Pδ h−1(do(Nj(~x), s)))

4. The program begins with a test action: Vhϑ?;δ(s)def.= (ϑ[s] ∧ Vδ

h(s)) ∪ (¬ϑ[s] ∧ rCase(s))

In case the test does not hold the execution of the pro- gram has to be aborted and consequently no further re- wards are obtained.

Phϑ?;δ(s)def.= (ϑ[s] ∧ Pδ

h(s)) ∪ case[¬ϑ[s], 0]

5. The program begins with a conditional: Vifϑ then δ1elseδ2end;δ

h (s) def. = (ϑ[s] ∧ Vδ1;δ h (s)) ∪ (¬ϑ[s] ∧ V δ2;δ h (s))

Analogous for Pifϑ then δ1elseδ2end;δ

h (s).

6. The program begins with a nondeterministic branching: V(δ1| δ2);δ h (s) def. = casemax (Vδ1;δ h (s) ∪≥V δ2;δ h (s))

where∪≥ is an extended version of the∪-operator that

additionally sorts the formulas according to their values such that vi ≥ vi+1holds in the resulting case statement.

Another minor modification of the∪-operator is neces- sary to keep track of from where the formulas originate. The resulting case statement then looks like this:

case[φi, vi → idxi]

where idxi = 1 if φi stems from Vhδ1;δ(s) and idxi =

2 if φi stems from Vhδ2;δ(s). This allows the agent

to reconstruct what branch has to be chosen when φi

holds in the current situation. For all further operations on the case statement those mappings can be ignored. P(δ1| δ2);δ

h (s) is defined analogously.

7. The program begins with a nondeterministic choice of arguments:

Vhπ x.(γ);δ(s)def.= casemax ∃x. Vhγ;δ(s) Note that the resulting case statement is independent of the actually available choices for x. The formulas φi(x, s) in Vhγ;δ(s) (which mention x as a free vari-

able) describe how the choice for x influences the ex- pected reward for the remaining program γ; δ. To obtain Vhπ v.(γ);δ(s) it is then maximized over the existentially quantified case statement Vhγ;δ(s). Again, Phπ x.(γ);δ(s) is defined analogously.

8. The program begins with a sequence: V[δ1;δ2];δ3 h (s) def. = Vδ1;[δ2;δ3] h (s)

that is, we associate the sequential composition to the right. By possibly repetitive application of this rule the program is transformed into a form such that one of the cases above can be applied.

9. Procedure calls:

The problem with procedures is that it is not clear how to macro expand a procedure’s body when it includes a recursive procedure call. Similar to how procedures are handled by Golog’s Do-macro we define an auxiliary macro:

VP(t1,...,tn);δ

h (s)

def.

= P (t1[s], . . . , tn[s], δ, s, h, v)

We consider programs including procedure definitions to have the following form:

{proc P1(~v1) δ1end; · · · ; proc Pn(~vn) δnend; δ0}

Then, we define the optimal expected value obtainable for executing the first h steps of such a program as:

V{proc P1(~v1) δ1end;··· ;proc Pn(~vn) δnend;δ0}

h (s) def. = ∀Pi.[ n ^ i=1 ∀~vi, s′, h′, δ′, v. v= Vδ i;δ′ h′ (s ′) ⊃ P (~vi, δ′, s, h, v)] ⊃ Vhδ0(s)

Lemma 1. For every δ and h, the formulas in Vδ h(s) and

Phδ(s) partition the state space.

Proof. (Sketch) By definition the formulas in rCase(s) and pCaseAj(~x, s) partition the state space. The operations on case statements used in the definition of Vhδ(s) Pδ

h(s) retain

this property.

4

Semantics

Informally speaking, the semantics for the kind of programs we consider is given by the optimal h-step execution of the program. Formally, it is defined by means of the macro BestDo+(δ, s, h, ρ) where δ is the program for which a h- step policy ρ in situation s shall be computed. A policy is a special kind of program that is intended to be directly handed over to the execution system of the agent and executed with- out further deliberation. A policy for a program δ “imple- ments” a (h-step) execution strategy for δ: it resolves the nondeterminism in δ and considers the possible outcomes of stochastic actions. In particular, it may proceed differently depending on what outcome actually has been chosen by Na- ture.

The macro BestDo+(δ, s, h, ρ) is defined inductively on the structure of δ. Its definition is in parts quite similar to that of DTGolog’s BestDo which is why we do not present all cases here, but focus on those where the definitions differ. Clearly, if h equals zero the horizon has been reached and the

execution of δ is terminated. This is denoted by the special action Stop.

BestDo+(δ, 0, s, ρ)def.= ρ = Stop

If the program begins with a stochastic action α a pol- icy for every possibly outcome n1, . . . , nk is determined by

means of the auxiliary macro BestDoAux+ which expects a list of deterministic outcome actions as its first argument. senseEffectαis the sense action associated with α.

BestDo+([α; δ], h, s, ρ)def.= ∃ρ′. ρ= [α; senseEffecta; ρ′]

∧ BestDoAux+({n1, . . . , nk}, h, s, ρ′)

If the first argument, the list of outcome actions, is empty then

BestDoAux+expands to

BestDoAux+({}, h, s, ρ)def.= ρ = Stop.

Otherwise, the first action n1 of the list is extracted and (if

it is possible) a policy for the remaining program starting in the situation do(n1, s) is computed by BestDo+. Then,

the policy is assembled by branching over the sense outcome condition θ1 for outcome n1. The if-branch is determined

by BestDo+(δ, do(ni, s), h − 1), ρ1); the else branch by the

BestDoAux+macro for the remaining outcome actions.

BestDoAux+({n1, . . . , nk}, h, s, ρ) def. = ¬P oss(n1, s) ∧ BestDoAux+({n2, . . . , nk}, h, s, ρ) ∨ P oss(n1, s) ∧ ∃ρ′. BestDoAux+({n2, . . . , nk}), h, s, ρ′) ∧ ∃ρ1. BestDo(δ, h − 1, do(n1, s), ρ1) ∧ ρ = if θ1thenρ1elseρ′

The cases where the program begins with a test-action or a conditional are handled in a quite similar manner by DT- Golog’s BestDo which is why we omit them here. The cases where the program begins with a nondeterministic statement are handled quite differently, though. Whereas BestDo com- putes the expected reward as well as the probability of suc- cessfully executing the remaining program for the current sit- uation, BestDo+(δ, s, h, ρ) relies on Vδ

h(s) and Phδ(s) for

that. If the program begins with a nondeterministic branch- ing another auxiliary macro is necessary:

BestDo+((δ1| δ2); δ, s, h, ρ) def.

=

BestDoNDet((δ1| δ2); δ, s, h,

case[φi(s), (vi, pi) → idxi], ρ)

where the forth argument of BestDoNDet , the case state- ment, is the result obtained from applying the casemax- operator onVδ1;δ h (s) ◦ P δ1;δ h (s)  ∪≥  Vδ2;δ h (s) ◦ P δ2;δ h (s) 

where ≥ implies an ordering over tuples (vi, pi) and im-

plements the trade-off between the expected reward and the probability of successfully executing the program. The

BestDoNDet -macro then is defined as: BestDoNDet((δ1| δ2); δ, s, h, case[φi(s), (vi, pi) → idxi], ρ) def. = _ i φi(s) ∧ BestDo+(δidxi; δ, s, h, ρ) According to Lemma 1 exactly one of the φi(s) holds and

thus the decision of whether to continue with the policy com- puted for δ1; δ or for δ2; δ is unambiguous.

If the remaining program begins with a nondeterministic choice of arguments the definition of BestDo+again relies on an auxiliary macro BestDoP ick:

BestDo+(π x. (γ); δ, s, h, ρ)def.=

BestDoPick(π x. (γ); δ, s, h, Vhγ;δ(s) ◦ Phγ;δ(s), ρ)

The definition of BestDoPick(π x. (γ); δ, s, h,

case[φi(x, s), (vi, pi)], ρ) resembles the operation method

of thecasemax-operator. We assume that the φi(x, s) are

sorted such that(vi, pi) ≥ (vi+1, pi+1). Then:

BestDoPick(πx.(γ); δ, s, h, case[φi(x, s), (vi, pi)], ρ) def. = _ i ^ j<i ¬∃x. φj(x, s) ∧ ∃x. [φi(x, s) ∧ BestDo+(γ; δ, s, h, ρ)] ∨^ i ¬∃x. φi(x, s) ∧ ρ = Stop

Note that the existential quantifier over the the φialso ranges

over the macro BestDo+ and thus the x which occurs as a free variable in the policy returned by BestDo+(γ; δ, s, h, ρ) is bound by the existential such that φi(x, s) holds.

Theorem 1. For any DTGolog programδ,

D |= ∀ρ. ∃p, v. BestDo(δ, h, S0, p, v, ρ)

≡ BestDo+(δ, h, S0, ρ) (We assume that all restricted nondeterministic choices of ar- guments inδ have been rewritten as nondeterministic branch-

ings.)

There seems to be an anomaly in the definition of DT- Golog’s BestDo-macro. Whereas for primitive actions the reward obtained in the situation before the primitive action is executed is considered this is not the case for stochastic actions. For instance, let A be a primitive, deterministic ac- tion and B a stochastic action with A being its sole outcome action (which is chosen by Nature with a probability of 1). Then the expected rewards for executing A and B may be different which seems to be strange. This anomaly can easily be “fixed” by considering the reward obtained in a situation before a stochastic action is executed:

BestDo([α; δ], s, h, ρ, v, pr)def.=

∃ρ′, v′. BestDoAux({n1, . . . ; nk}, δ, s, h, ρ′, v′, pr)

∧ v = reward(s) + v′∧ ρ = [α; senseEffect α; ρ′]

For the proof of Theoream 1 we assumed the definition of BestDo as shown above.