Strongly feasible trajectories - Saddle point algorithm

4.4 Saddle point algorithm

4.4.1 Strongly feasible trajectories

We begin by studying the saddle point controller defined by (4.32) and (4.33) in a problem in which optimality is not taken into account, i.e., f0(t, x) ≡ 0. In this case the action

descent equation of the controller (4.32) takes the form ˙

x= ΠX(x,−KLx(t, x, λ)) = ΠX(x,−Kfx(t, x)λ), (4.34)

while the multiplier ascent equation (4.33) remains unchanged. The bounds to be derived for the fit ensure that the trajectories x(t) are strongly feasible in the sense of Definition 5. To state the result consider an arbitrary fixed action ¯x∈X and an arbitrary multiplier ¯

λ∈Λ and define the energy function

V_x,_¯_λ¯(x, λ) = 1

2 ||x−x¯||

2₊_||_λ₋_¯_λ_||2

. (4.35) We can then bound fit in terms of the initial value V_x,_¯¯_λ(x(0), λ(0)) of the energy function

for properly chosen ¯x and ¯λas we formally state next.

Theorem 9. Let f : R×X → Rm, satisfying assumptions 6 and 7, where X ⊆ Rn is

a convex set. If the environment is viable, then the solution x(t) of the dynamical system defined by (4.34)and (4.33)is strongly feasible for allT >0. Specifically, the fit is bounded by

FT ,i ≤ min x†_∈_X†

KVx†,ei(x(0), λ(0)), (4.36) where ei withi= 1. . . m form the canonical base of Rm.

Proof. Consider action trajectoriesx(t) and multiplier trajectoriesλ(t) and the corresponding energy functionV_x,_¯_λ¯(x(t), λ(t)) in (4.35) for arbitrary given action ¯x∈Xand multiplier

λ∈Λ. The derivative ˙V_x,_¯¯_λ(x(t), λ(t)) of the energy with respect to time is then given by

V_x,_¯¯_λ(x(t), λ(t)) = (x(t)−x¯)>x˙(t) + (λ(t)−λ¯)>λ˙(t). (4.37)

Substitute the action and multiplier derivatives by their corresponding values given in (4.34) and (4.33) to reduce (4.37) to

V_x,_¯¯_λ(x(t), λ(t)) =(x(t)−x¯)>ΠX(x,−Kfx(t, x(t))λ(t))

Then, using the result of Lemma 11 for bothX and Λ, the following inequality holds ˙

V_x,_¯_λ¯(x(t), λ(t))≤K(¯x−x(t))>fx(t, x(t))λ(t)

+K(λ(t)−λ¯)>f(t, x(t)). (4.39) Notice that f(t, x)λ(t) is a convex function with respect to the action, therefore we can upper bound the inner product (¯x−x(t))>fx(t, x(t))λ(t) by the quantity f(t,x¯)>λ(t)− f(t, x(t))>λ(t) and transform (4.39) into

V_x,_¯¯_λ(x(t), λ(t))≤K(f(t,x¯)−f(t, x(t)))>λ(t)

+K(λ(t)−¯λ)>f(t, x(t)). (4.40) Further note that in the above equation the second and the third term are opposite. Thus, it reduces to

V_x,_¯¯_λ(x(t), λ(t))≤K h

λ(t)>f(t,x¯)−λ¯>f(t, x(t))i. (4.41) Observe that the integral of the left hand side of the above equation can be written as

Z T

V_¯_x,_λ¯(x(t), λ(t))dt=V_x,_¯_λ¯(x(T), λ(T))−V_x,_¯¯_λ(x(0), λ(0)). (4.42)

Then using the fact thatV_x,_¯¯_λ(x(t)), λ(t))≥0 for all t, yields Z T

V_¯_x,_λ¯(x(t), λ(t))dt≥ −V_x,_¯¯_λ(x(0), λ(0)). (4.43)

Then, integrating both sides of (4.42) and using the bound in (4.43), we have that

Z T

λ>f(t, x(t))−λ(t)>f(t,x¯)dt≤ Vx†,¯λ(x(0), λ(0))

K . (4.44)

Since the environment is viable, there exist a fixed action x† such that f(t, x†) 0 for all

t≥0. Then choosing ¯x=x†, sinceλ(t)0 for allt, we have that

λ(t)>f(t, x†) ≤0∀t∈[0, T]. (4.45) Therefore the left hand side of (4.44) can be lower bounded by

¯ λ> Z T 0 f(t, x(t))dt≤ Vx†,λ¯(x(0), λ(0)) K . (4.46)

all i= 1. . . m: Z T 0 fi(t, x(t))dt≤ V_x†_,ei(x(0), λ(0)) K . (4.47)

Notice that since the above inequality holds for anyx†∈X†it is also true for the particular

x†that minimizes the right hand side. The left hand side of the above inequality is thei–th component of the fit. Thus, since the m components of the fit of the trajectory generated by the saddle point algorithm are bounded for allT, the trajectory is strongly feasible with the specific upper bound stated in (4.36).

Theorem 9 assures that if an environment is viable for an agent that selects actions over a set X, the solution of the dynamical system given by (4.34) and (4.33) is a trajectory

x(t) that is strongly feasible in the sense of Definition 5. This result is not trivial, since the functionf that defines the environment is observed causally and can change arbitrarily over time. In particular, the agent could be faced with an adversarial environment that changes the functionf in a way that makes the value off(t, x(t)) larger. The caveat is that the choice of the functionf must respect the viability condition that there exists a feasible action x† such that f(t, x†) 0 for all t ∈ [0, T]. This restriction still leaves significant leeway for strategic behavior. E.g., in the shepherd problem of Section 4.2.2 we can allow for strategic sheep that observe the shepherd’s movement and respond by separating as much as possible. The strategic action of the sheep are restricted by the condition that the environment remains viable, which in this case reduces to the not so stringent condition that the sheep stay in a ball of radius 2r if all ri =r.

Since the initial value of the energy function V_x†_,ei(x(0), λ(0)) is the square of the dis-

tance betweenx(0) andx†added to a term that depends on the distance between the initial multiplier andei, the bound on the fit in (4.36) shows that the closer we start to the feasible

set the smaller the accumulated constraint violation becomes. Likewise, the larger the gain

K, the smaller the bound on the fit is. As in section 4.3 we observe that increasing K

can make the bound on the fit arbitrarily small, yet for the same reasons discussed in that section this can’t be done.

Further notice that for the saddle point controller defined by (4.34) and (4.33) the action derivatives are proportional not only to the gain K but to the value of the multiplier λ. Thus, to select gains that are compatible with the system’s physical constraints we need to determine upper bounds in the multiplier values λ(t). An upper bound follows as a consequence of Theorem 9 as we state in the following corollary.

Corollary 4. Given the controller defined by (4.34) and (4.33) and assuming the same hypothesis of Theorem 9, if the set of actionsXis bounded in norm byR, then the multipliers

λare bounded for all times by

0≤λi(t)≤ 4R2+ 1, for alli= 1, . . . , m. (4.48)

Proof. First of all notice that according to (4.33) a projection over the positive orthant is performed for the multiplier update. Therefore, for each component of the multiplier we have that λi(t) ≥ 0 for all t ∈ [0, T]. On the other hand, since the trajectory of the multipliers is defined by ˙λ(t) = ΠΛ(λ(t), Kf(t, x(t)), while λ(t) > 0 we have that ˙λ(t) =

Kf(t, x(t)). Lett0 be the first time instant for whichλi(t)>0 for a giveni∈ {1,2, . . . , m},

i.e.,

t0 = inf{t∈[0, T], λi(t)>0}. (4.49)

In addition, let T₀∗ be the first time instant greater thant0 whereλi(t) = 0, if this time is

larger than T we set T₀∗ =T

T₀∗ = max{inf{t∈(t0, T], λi(t)>0}, T}. (4.50)

Further define ts+1 = inf{t∈[Ts∗, T], λi(t)>0},and

T_s∗ = max{inf{t∈(ts, T], λi(t)>0}, T}. (4.51)

From the above definition it holds that in any time in the interval (T_s∗, ts+1],λi(t) = 0. And

therefore in those intervals the multipliers are bounded. In the case whereτ ∈(ts, Ts∗]

Z τ ts ˙ λi(t)dt= Z τ ts Kfi(t, x(t))dt. (4.52) Notice that the right hand side of the above equation is, proportional to thei–th component of the fit restricted to the time interval [t0, τ]. In Theorem 9 it was proved that the i–th

component of the fit is bounded for all time horizons byV_x†_,ei(x(ts),0)/K. In this particular

case we have that

V_x†_,ei(x(t_s),0) = 1 2 (x(ts)−x†)2+ (0−ei)2 , (4.53) and since for any x∈X we have thatkxk ≤R, we conclude

V_x†_,ei(x(ts),0)≤

1 2 (2R)

2_{+ 1}2

. (4.54) Therefore, for all τ ∈ (tsT_s∗] λi(τ) ≤ 1₂ 4R2+ 12. This completes the proof that the multipliers are bounded.

The bound in Corollary 4 ensures that action derivatives ˙x(t) remain bounded if the subgradients are. This means that action derivatives increase, at most, linearly withK and is not compounded by an arbitrary increase of the multipliers. Observe as well, tat the cumulative nature of the fit does not guarantee that the constraint violation is controlled. This is because time intervals of constraint violations can be compensated by time intervals where the constraints are negative. To overcome this issue, we next show that the saddle point controller archives bounded saturated fit for all time horizon.

Corollary 5. Let the hypothesis of Theorem 9 hold. Letδ >0and letF¯_T be the saturated fit defined in (4.5). Then, the solution of the dynamical system (4.34)and (4.33) whenf(t, x)

is replaced by f¯δ(t, x)) = max{f(t, x),−δ} archives a bounded saturated fit. Furthermore the bound is given by

F_{T ,i} ≤ min

x†_∈_X†

KVx†,ei(x(0), λ(0)), (4.55) where ei withi= 1. . . m form the canonical base of Rm.

Proof. Since ¯fδ(t, x) is the pointwise maximum of two convex functions, it is a convex

function itself. As a consequence of Theorem 9 the fit for the environment ¯fδ(t, x) satisfies

Z T 0 ¯ fδ,i(t, x(t))dt≤ min x†_∈_X† 1 KVx†,ei(x(0), λ(0)). (4.56)

The fact that the left hand side of the above equation corresponds to the saturated fit [cf., (4.5)] completes the proof.

The above result establishes that a trajectory that follows the saddle point dynamics for the environment defined by ¯fδ(t, x) achieves bounded saturated fit. This means that it is possible to adapt the controller (4.34) and (4.33), so that the fit is bounded while not alternating between periods of large under and over satisfaction of the constraints.

In document Stochastic Control Foundations Of Autonomous Behavior (Page 97-101)