Dispatch-and-Search: Dynamic Multi-Ferry Control in
Partitioned Mobile Networks
∗Ting He
†, Ananthram Swami
‡, and Kang-Won Lee
††IBM Research, Hawthorne, NY 10532 USA
{the,kangwon}@us.ibm.com
‡Army Research Laboratory, Adelphi, MD 20783 USA
ABSTRACT
We consider the problem of disseminating data from a base station to a sparse, partitioned mobile network by control-lable data ferries with limited ferry-node and ferry-ferry communication ranges. Existing solutions to data ferry con-trol mostly assume the nodes to be stationary, which reduces the problem to designing fixed ferry routes. In the more challenging scenario of mobile networks, existing solutions have focused on single-ferry control and left out an impor-tant issue of ferry cooperation in the presence of multiple ferries. In this paper, we jointly address the issues of ferry navigation and cooperation using the approach of stochastic control. Under the assumption that ferries can communi-cate within each partition, we propose a hierarchical control system called Dispatch-and-Search (DAS), consisting of a global controller that dispatches ferries to individual par-titions and local controllers that coordinate the search for nodes within each partition. Formulating the global and the local control as Partially Observable Markov Decision Processes (POMDPs), we develop efficient control policies to optimize the (discounted) total throughput, which sig-nificantly improve the performance of their predetermined counterparts in cases of limited prior knowledge.
Categories and Subject Descriptors
C.2.1 [Computer-Communication Networks]: Network Architecture and Design—store and forward networks; G.4 [Mathematical Software]:algorithm design and analysis; I.2.8 [Artificial Intelligence]: Problem Solving, Control Methods, and Search—control theory; I.2.9 [Artificial In-telligence]: Robotics—autonomous vehicles
∗Research was sponsored by the U.S. Army Research Laboratory and
the U.K. Ministry of Defence under Agreement Number W911NF-06-3-0001. The views and conclusions are those of the authors and do not represent the U.S. Army Research Laboratory, the U.S. Government, the U.K. Ministry of Defence or the U.K. Government. The U.S. and U.K. Governments are authorized to reproduce and distribute reprints for Government purposes notwithstanding copyright notation.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
MobiHoc’11,May 16–20, 2011, Paris, France.
Copyright 2011 ACM 978-1-4503-0722-2/11/05 ...$10.00.
General Terms
Algorithms, Design, Performance
Keywords
Data ferry control, Partially Observable Markov Decision Processes, Myopic control policies
1.
INTRODUCTION
The development of robotic technology has led to im-provements in many applications, a recent one being the use of controllable mobile nodes to assist communications in sparse networks. Sparse network topologies are routinely encountered in a broad variety of applications, e.g., sen-sor/actuator networks, search-and-rescue operations, under-water networks, spacecraft networks (“interplanetary com-munications”), vehicular ad hoc networks (VANETs), and military mobile ad hoc networks (MANETs). The sparse networks discussed above pose significant challenges to tra-ditional communication schemes since existing in-network solutions such as routing and delay-tolerant communications are only designed for (intermittently) connected networks and will fail if the network is permanently partitioned. In such partitioned networks, the only remedy is to look for external help, and mobile helper nodes called data ferries, usually mounted on controllable platforms such as UAVs, stand out as a preferred solution in many scenarios due to their agility in contrast to fixed infrastructures. The prob-lem, though, is that proper control has to be in place to leverage these data ferries effectively.
Previous solutions to data ferry control have mostly fo-cused on designing fixed ferry routes [3, 5, 12, 13] for static networks. Although the approach still applies to mobile net-works [9], the performance is generally suboptimal as fixed routes can only leverage long-term statistics of node move-ments. To fully utilize the ferries’ capability, we have shown in [2] that each ferry needs to dynamically adjust its move-ments based on run-time observations. As in [2], we as-sume a ferry can only observe nodes within a certain dis-tance defined by ferry-node communication range, referred to as partial observations1. Although it shows promising
performance improvements over fixed routes, the solution in [2] is limited by (i) only considering a single ferry and (ii) only focusing on contact rate optimization. The avail-ability of multiple ferries raises a new challenge because one
1This is in contrast to assuming complete observations of
ferry’s optimal movements will depend on the movements of the other ferries, which induces a need for the ferries to cooperate. While this is not an issue under unlimited ferry-ferry communications, global cooperation can be expensive, especially for networks with scattered partitions. Further-more, although contact rates are performance indicators, the ultimate goal of the ferries should be to optimize the end-to-end communication performance perceived by nodes, which also depends on other factors such as traffic demands. These new challenges call for a new solution that can efficiently uti-lize multiple ferries to optimize the end-to-end performance without excessive cooperation overhead.
1.1
Related Work
Designated communication nodes, typically called data ferries or data mules, have been considered as an effective so-lution to support communications in sparse networks. Data ferries physically carry data either between task (i.e., non-ferry) nodes [3, 11–14] or between nodes and a base sta-tion [1, 4–6]. The main issue in using such ferries is how to control their mobility, where existing solutions can be di-vided intosingle-ferry control [1, 2, 4, 9, 14] and multi-ferry control [3,5,6,12,13]. An in-between case was studied in [11] where nodes can switch between the roles of ferry and non-ferry. Most existing work focuses on ferry route design un-der the assumption that either the task nodes are station-ary [3–5, 12, 13], or their locations are always known by the ferries if they are mobile (i.e.,completeobservations) [11,14]. A related problem of static but randomly distributed nodes has been studied: using predetermined routes in [6] and dynamic route adaptation based on complete observations in [1]2. The problem of stochastic node movements and
par-tial observability was first studied in [9] using predetermined routes and later improved in [2] using a control policy that navigates the ferry dynamically.
1.2
Our Approach and Results
We consider controlling multiple ferries in partitioned mo-bile networks to optimize the total throughput. We present our solution in the scenario of data dissemination, although our approach extends naturally to the cases of data har-vesting and peer-to-peer communications (see discussions in Section 7). Our specific contributions are:
Flexible control framework: To avoid long-range ferry-ferry communications, we propose a hierarchical control frame-work called Dispatch-and-Search (DAS), which divides the problem into the global control of allocating ferries among partitions and the local control of searching for nodes within each partition. Since the global control can be implemented by the base station (BS), ferries only need to communicate within each partition to cooperate in local control.
Rigorous problem formulation: Under Markovian node mobility, we show that both the global and the local control problems can be formulated as special cases of Partially Ob-servable Markov Decision Process (POMDP). The POMDP model provides a comprehensive representation of available information, control options, and performance objective in a dynamic system, which provides a foundation for rigorous optimization.
Efficient solutions: Based on the POMDP formulation,
2Both works treat each message as a new node which arrives
randomly in space and time; [1] also assumes locations and arrival times of pending messages are known at run time.
domain 1 gateway node domain 2 ferry dispatch return Base Station search r
Figure 1: Example: K = 3 ferries serve D = 2
do-mains in despatch-and-search mode. r: (projected) ferry communication radius.
we develop solutions to global and local control in the form of dynamic control policies. Due to the intrinsic hardness of POMDP, we focus on an efficient heuristic called the my-opic policy. In particular, in global control, even the mymy-opic policy can be computationally expensive; hence, we also develop two approximations with similar performance but much lower complexity than the exact solution.
Performance insights: We evaluate the proposed policies through both analysis and simulations. We obtain closed-form upper and lower bounds on the optimal local perfor-mance, as well as a lower bound on the optimal global per-formance which can be evaluated numerically. Moreover, we compare our solutions with predetermined control through extensive simulations, which show significant performance gain (e.g., 100%+ on discounted throughput and 40%+ on undiscounted throughput) in cases of minimum prior knowl-edge.
We assume Markovian mobility models to rigorously rep-resent the impact of past actions on future decisions. Al-though a limiting assumption, it includes popular models such as random walk/waypoint and their variations as spe-cial cases and can be viewed as a second-order approxima-tion of general mobility models.
The rest of the paper is organized as follows. Section 2 specifies the network model and the DAS framework, Sec-tion 3 gives a brief overview of POMDP, SecSec-tions 4 and 5 address the local and the global control problems respec-tively, Section 6 presents simulation evaluation, and Section 7 concludes the paper.
2.
PROBLEM FORMULATION
2.1
Network Model
Suppose that there is a total ofK ferries andD network partitions, each partition occupying a disjoint region called adomain, as illustrated in Fig. 1. A base station (BS) gen-erates traffic for all nodes at constant rates3 λ = (λd)Dd=1
(fluid model), which is then delivered to the respective mains by the ferries in a delay-tolerant manner. In each do-main, we select a node as the gateway to the ferries, which then disseminates received traffic within the domain using in-network routing; we will simply call these gateway nodes “nodes”. According to the ferry-node communication range, each domain is partitioned into Nd (d= 1, . . . , D) cells
de-noted by a setSd (Sd∩ Sd′ =∅for d6=d′), such that the
transmission of each ferry can cover one cell at a time. We
round ∆
search return dispatch
time
Figure 2: DAS procedure: Dispatch control occurs at the time-scale of rounds (∆ + 1slots) and search control at the time-scale of slots.
assume that node mobility in domaind can be modeled as a Markov chain onSd with transition matrixPd. We also
assume that ferry-ferry communications can span an entire domain but not across domains. The total network field is denoted byS :=SD
d=1Sd of size|S|=N =PDd=1Nd. We
do not restrict the relationships betweenK, D, andNd,
al-though restrictions may apply in later analysis (e.g., Section 5.3).
2.2
System Architecture
Due to the ferry-ferry communication constraint, ferries can only cooperate locally within a domain. Accordingly, we propose a hierarchical control system called “Dispatch-and-Search” (DAS), consisting of a global dispatch controller
πgat the BS in charge of allocating ferries to individual
do-mains, and a local search controller πl,d for each domain
d, located at an arbitrarily selected lead ferry, in charge of jointly navigating ferries dispatched to that domain. Specif-ically, the lead ferry will collect observations from the other ferries in its domain and broadcast navigation actions deter-mined by the search controller in terms of the set of cells to cover (and their matching with the ferries) in each slot in order to contact the gateway node (by any ferry) and deliver traffic. We assume that all ferries dispatched to the same domain carry the same traffic. As illustrated in Fig. 2, dis-patch control operates periodically with period ∆ + 1, called around(∆ is a design parameter); the first ∆ slots are used to search for the gateway nodes in the individual domains; the last slot is used for all the ferries to return to the BS and pick up new traffic, after which the next round begins4.
2.3
Main Problem: DAS Control
Let λπ
β := E[P∞t=1N
π(t)βt] denote the discounted total
throughput for a given discount factorβ ∈ (0, 1) under a DAS control policyπ:= (πg,(πl,d)Dd=1), whereNπ(t) is the
amount of data delivered (over all domains) at timetunder policyπ. Our goal is to design a policyπthat maximizesλπ
β.
Remarks: The discounts are used to model the objective of delivering the data as soon as possible. The above defini-tion is a cumulative throughput, but it represents a delivery rate, defined as the ratio of the delivered traffic and the delivery timeλπβ :=E[(P∞t=1N π(t)βt)/(P∞ t=1β t)], in that λπβ=λπβ(1−β)/β.
3.
PRELIMINARY
3.1
POMDP Recap
POMDP [8] is a general framework to model the con-trol of a stochastic system. We provide a quick overview of POMDP in this section. A POMDP is represented by
4The above assumes that the ferries can move between the
BS and any domain in one slot. Longer BS-domain distances can be modeled by replacing the last slot byT slots during which a round trip can be completed; all the results hold by replacing ∆ + 1 with ∆ +T.
a tuple (X, A,O, r, Px, Po), whereX represents the
pos-sible states of the controlled system, A the set of control actions, O the observations of controller, and r(x, a, o) a reward function modeling the performance criterion based on the state x∈ X, the actiona∈ A, and the observation
o∈ O. The other elementsPxandPodescribe how the state
evolves (Px(x, a, x′) := Pr{xt+1=x′|xt=x, at=a}) and
the observation is generated (Po(x, a, o) := Pr{ot=o|xt=
x, at =a}), and are derived from the first three elements
in a given problem setting. Under partial observability, the controller does not know the true statexprecisely, but can only estimate it up to a distributionb= (b(x))x∈X, called
thebelief, based on the action and observation history. The objective of the controller is to find a policyπ that determines the action to take based on the current belief so as to maximize the total reward over a time window. With an infinite horizon, a common form of the total reward is thediscounted reward
Rπ∞:=E[ ∞
X
t=1
βtr(xt, aπt, ot)] (1)
for a fixed discount factor β ∈(0, 1). The optimal policy
π∗ that maximizes Rπ
∞ is known to be the solution to the
Bellman equation:
V∞(b) =βarg max a
E[r(b, a, o) +V∞(T(b, a, o))], (2)
where V∞(b) := maxπRπ∞|b1=b, called the value function,
is the maximum reward starting from a given belief stateb, the expectation is overPo(b, a, o) :=Pxb(x)·Po(x, a, o),
and the functionb′=T(b, a, o) generates a new belief for
the next step based on the previous belief, the action, and the observation. It is known that solving (2) for the optimal policy is PSPACE-hard in general [7]. A popular alternative is to use themyopic policyin which one only maximizes the averageimmediatereward, and is given by
πMY (b) := arg max a X x b(x)E[r(x, a, o)], (3)
where the expectation is overPo(x, a, o). The myopic
pol-icy is easy to implement and has exhibited excellent per-formance in single-ferry control [2]. We will focus on this policy in the sequel and examine its different forms and per-formances at different levels of control.
3.2
Applying POMDP to Multi-Ferry Control
The use of POMDP is natural for the problem under con-sideration. Under the DAS architecture, the goal of multi-ferry control is to jointly design global dispatch controller and local search controller such that the overall (discounted) throughput is optimized. The challenge is that both con-trollers face a dynamic environment that changes over time, e.g., the search controller relies on node distributions, which evolve due to node mobility, and the dispatch controller re-lies on (predicted) contact processes as well as traffic de-mands for each domain. Moreover, characteristics of this dynamic environment are only partially observable to the controllers due to limited ferry-node communication range. POMDP is well suited for addressing these challenges using its model of dynamic systems under partial observability.
The local and the global controllers have inherent con-nections: the global action affects the local control because the number of ferries dispatched to a domain will largely
determine how long it takes to contact the node; the lo-cal performance also affects the global control because the dispatch controller needs to weigh the impact of dispatch-ing more/fewer ferries to each domain to make a balanced decision. The other parameters such as domain sizes, node mobility models, and traffic rates all play a role in the perfor-mance. How can we capture all these factors in the POMDP framework? What reward functions should we use for lo-cal/global control so that together they achieve throughput optimization? These are the main questions we will answer in the sequel.
4.
LOCAL CONTROL: SEARCH
Consider a domaindto whichk∈ {0, . . . , K}ferries have been dispatched in the current round; assumek ≤N, size of the domain (subscriptd is dropped). Given initial node distributionb0 and mobility modelP, the goal of the local controllerπlis to jointly navigate thekferries to contact the
node and deliver traffic. Moreover, due to the discount in throughput calculation, it is intuitively desirable to make the contact as soon as possible. In this section, we cast the prob-lem of local control as a POMDP and establish lower and upper bounds on the optimal reward, where the lower bound is provided by the myopic policy. Based on the bounds, we discuss conditions under which the myopic policy is close to optimal.
4.1
Local Control as a POMDP
In the language of POMDP, the problem can be formu-lated as follows:
1. Local state: xl
t = (st, δt), where st ∈ S is the node’s
location at timet, andδt ∈ {0,1} indicates whether
the next contact will be the first in the current round (δt= 1) or not (δt= 0);
2. Local action: al
t ⊆ Swith|alt| ≡kdenotes the set of
cells for the ferries to cover in slot t (no two ferries cover the same cell);
3. Local observation: olt = zt ∈ {0, 1} is a joint
con-tact indicator, i.e., zt =Ist∈al
t, where I· denotes the
indicator function;
4. Local reward: the one-time rewardrl(xl, al, ol) =δz
gives a unit reward for the first contact, after whichδ
transits to 0 and no further reward occurs; the overall rewardRπl
∞ is defined as in (1) forrl.
We elaborate on the preceding POMDP formulation: let Υπl denote the time till (the first) contact (TTC) under
policyπl. It is easy to see thatR∞πl =E[βΥ
πl
]. Sinceβx is
a decreasing function, maximizingRπl
∞ leads to minimizing
the TTC in a discounted average sense5. Intuitively, the
local rewardRπl
∞represents the weight each unit of delivered
traffic will contribute to the discounted total throughputλβ,
starting from the current round. Later we will show that local policies optimizing this reward will help to optimize the overallλβ under DAS framework (see Section 5.1).
5The convexity of βx implies that the mean TTC satisfies
E[Υπl]≥logRπl
∞/logβ.
4.2
Local Control Policy
The true statexl
t is not directly known to the controller
since node location st is unknown. Instead, the controller
observes another state yl
t, which consists of the belief bt
of st (bt(s) := Pr{st = s|al1, . . . , alt−1, ol1, . . . , olt−1}) and
the indicator δt. At the end of the slot, the new state
yl t+1= (bt+1, δt+1) transits as6 bt+1 =PT(ztest+ (1−zt)bt\al t), δt+1 =δt(1−zt), (4) whereestorbt\al
tis the updated belief based on observation zt, and multiplying it by the transition matrixPT gives the
predicted belief for the next slot. The myopic search pol-icy is the one that maximizes the probability of immediate contact: πMY l (b) = arg max al X s∈al b(s), (5)
i.e., thekferries will search the cells with the topk proba-bilities in the belief.
4.3
Performance of Local Control
For a given search policy π (subscript l is dropped for simplicity), let Υπdenote the TTC underπfor a given ini-tial beliefb0 andk ferries. The distribution of the random variable Υπ can be characterized as follows. Conditioned
on the event that no contact has occurred beforet(i.e., the previous t−1 slots are misses), letaπ
t and bπt denote the
action and the belief under policy π in slot t, and pπ t the
conditional probability of contact. By definition, we have
pπt = Ps∈aπ
t b
π
t(s), and aπt, bπt can be computed from the
following updates: starting frombπ
1 =PTb0,
aπt =π(bπt), bπt+1=PTbπt\aπ
t, t= 1,2, . . . (6)
Based on pπ
t, it is easy to see that the distribution of Υπ
and the correspondingRπ
∞are given by Pr{Υπ=t}=pπt t−1 Y j=1 (1−pπj), (7) Rπ∞= ∞ X t=1 βtPr{Υπ=t}= ∞ X t=1 βtpπt t−1 Y j=1 (1−pπj). (8) In particular, plugging inaπ t =πlMY(bπt) forπlMYin (5) yields
the reward of myopic policyRMY ∞ .
Although the above method will give the reward for any given policy, it does not indicate how close its performance is to the optimal. Characterization of theoptimal reward R∗∞
is an open question, since it is computationally intractable to compute the optimal policy. Hence, we derive the following lower and upper bounds on the optimal reward. The lower bound comes naturally from the myopic policy. For the up-per bound, consider the passive belief transitions without any observation: b(t) := (PT)tb0. Let B
t,k := max|a|=k
P
s∈ab
(t)(s) be the sum of the k largest elements in b(t). 6Here PT denotes matrix transpose, e
s the unit vector
with one in the sth element, and b\a the posterior
be-lief after a miss given by b\a(s) = 0 for all s ∈ a and
Define R∞:=βT0(1− T0−1 X t=1 Bt,k) + T0−1 X t=1 βtBt,k, (9)
whereT0 := inf{t ≥1 : 1−Pjt−=11Bj,k ≤Bt,k}. We have
the following result.
Theorem 4.1. The rewardR∗
∞of the optimal search
pol-icy satisfies: RMY
∞ ≤R∗∞≤R∞.
To prove the theorem, we introduce the following lemmas; their proofs can be found in Appendix.
Lemma 4.2. For any policy π, its reward R∞π is
mono-tone increasing with the conditional contact probability pπ t
for anyt.
Lemma 4.3. For anytand any π,
pπt ≤
Bt,k
max(1−Pt−1
j=1Bj,k, Bt,k)
=:pt. (10)
Proof: (Theorem 4.1) The lower bound holds trivially. For the upper bound, Lemma 4.2 implies that substituting
pt in Lemma 4.3 into (8) will give an upper bound on the reward: Rπ∞≤ T0 X t=1 βtpt t−1 Y j=1 (1−pj) =βT0(1− T0−1 X t=1 Bt,k) + T0−1 X t=1 βtBt,k=R∞,
which impliesR∗∞≤R∞as the bound holds for anyπ. Note
thatQt−1
j=1(1−pj) = 1−
Pt−1
j=1Bj,k.
Closed-form bounds can be obtained by further bounding
RMY
∞ andR∞(see Appendix for proofs).
Corollary 4.4. For k ferries and a domain of size N
(cells),RMY
∞ ≥βk/[N(1−β) +βk].
In general, there is no non-trivial closed-form upper bound, i.e., R∗∞ can approach β arbitrarily. Under certain
condi-tions, however, we have the following result.
Corollary 4.5. If the initial belief is the steady-state
distributionb∗ (b∗=PTb∗), then
R∞=βT0[1−B∗,k(T0−1)] +B∗,k(β−β
T0)
1−β , (11) whereB∗,k:= max|a|=kPs∈ab∗(s)andT0=⌈B
−1
∗,k⌉.
More-over, if P is doubly stochastic7, then the above reduces to R∞≤βk/[(1−β)N].
Remark: For a large domain (N ≫k), Corollary 4.4 im-plies thatR∗
∞ ≥βk/[(1−β)N], achievable by the myopic
search policy. The worst case is when the node moves equal likely in all directions (i.e.,Pis doubly stochastic) and the initial distribution is uniform, under which Corollary 4.5 saysβk/[(1−β)N] is also an upper bound. In this case, the optimal rewardR∗
∞ ≈ βk/[(1−β)N]. Note that the 7That is, each column ofPalso sums up to one.
steady-state distribution b∗ is uniform if and only if P is
doubly stochastic.
Similar analysis can also be performed for the mean TTC
E[Υπ]. Given a policyπ, the mean TTC can be computed by E[Υπ] =P∞ t=1tp π t Qtj−=11(1−p π j). Analogous to Lemma 4.2,
it can be shown thatE[Υπ] is monotone decreasing with each pπ
t. Accordingly, we can bound the minimum mean TTC by
E[Υ]≤min
π
E[Υπ]≤E[ΥMY], (12)
whereE[ΥMY] is the mean TTC for the myopic policy, bounded
by E[ΥMY] ≤N/k, and E[Υ] is an analytical lower bound
analogous to R∞, given by E[Υ] = T0(1−PTt=10−1Bt,k) +
PT0−1
t=1 tBt,k. Under the conditions thatb0 =b∗ and Pis
doubly stochastic,E[Υ] can be relaxed toE[Υ]≥ 1
2(1 +
N k).
Note that the optimal policy that maximizes Rπ
∞ may not
achieve the minimumE[Υπ].
5.
GLOBAL CONTROL: DISPATCH
The global control operates on top of local control at the macro time scale of rounds. Given local search policies, the goal of the global dispatch policy is to allocate the ferries among the domains at the beginning of each round to max-imize the total throughput. In this section, we formulate the global control as another POMDP, based on which we develop an approximate myopic dispatch policy with a sig-nificantly lower complexity than the original and analyze its performance.
5.1
Global Control as a POMDP
The global control problem can be formulated as the fol-lowing POMDP:
1. Global state: xg
τ = (sτ, mτ), wheresτ = (sτ(d))Dd=1
still denotes node locations in each domain, andmτ =
(mτ(d))Dd=1 denotes the buffer state of the BS, where
mτ(d) is the amount of traffic generated for domain
d, all at the beginning of roundτ (i.e., at time (∆ + 1)(τ−1));
2. Global action: ag
τ = (agτ(d))Dd=1 represents the
distri-bution of the ferries among the domains in round τ
(i.e.,ag τ(d)∈NandPDd=1a g τ(d) =K); 3. Global observation: og τ = (υτ,b′τ), whereυτ = (υτ(d))Dd=1
denotes the TTCs for each domain, andb′τ = (b′τ,d)Dd=1
the updated beliefs of node locations at the end of roundτ (at time (∆ + 1)τ); if there is no contact with a domain either because no ferry serves the domain (ag
τ(d) = 0) or because the ferries fail to contact the
node within time ∆, defineυτ(d) :=∞;
4. Global reward: the one-time reward is defined as
rg(xg,ag,og) :=PDd=1m(d)βυ(d), whereas the overall
reward has a slightly different form from (1):
Rπg ∞ :=E[ ∞ X τ=1 β(∆+1)(τ−1)rg(xgτ,a πg τ ,ogτ)]. (13)
We now explain the meaning of this reward function.
Lemma 5.1. The global reward Rπg
∞ is equal to the
dis-counted total throughputλβ (see Section 2.3) under the
Proof: For a domain d, delivery occurs if and only if there is a contact within ∆, each containingmτ(d) data
(assuming the entire ferry buffer content can be delivered in one shot) at time (∆ + 1)(τ−1) +υτ(d). Its throughput is
thusE[P∞
τ=1mτ(d)·β(∆+1)(τ−1)+υτ(d)]. Summing over all
domains givesRπg
∞ =λβ.
Remarks: Our control objective of maximizingλβ is thus
converted into maximizingRπg
∞ at global level via ferry
allo-cationagτ and maximizing the weightsE[βυ(d)] at local level
via ferry navigation al
t in each domain (under given agτ).
Note that the latter is consistent with the local reward8 in
Section 4.1.
Here the period ∆ is introduced to facilitate synchroniza-tion among ferries. Its selecsynchroniza-tion involves a tradeoff: a smaller ∆ means lower probabilities of contact per round but more rounds during a given time, whereas a larger ∆ means higher probabilities of contact per round but fewer rounds. The ex-act value can be tuned to optimize the overall performance (see Fig. 8 and 11).
5.2
Global Control Policy
Let yg
τ denote the observed state. Since one part (sτ)
of the state xg
τ is partially observable and the other part
(mτ) is completely observable, we have a mixed stateygτ :=
(bτ,mτ), wherebτ = (bτ,d)Dd=1are the beliefs of node
loca-tionssτ as reported by the local controllers. At the end of
each round, the new beliefsbτ+1will go through a sequence
of transitions depending on the actions and observations of the local controllers by (4). Instead of repeating those de-tails, we simply consider the outputb′
τ as part of the global
observation such thatbτ+1=b′τ. The BS buffer statemτ+1
transits differently for each domain depending on whether the domain receives service or not:
mτ+1(d) =λd(∆ + 1) +mτ(d)Iυτ(d)>∆. (14)
The global control is much harder to solve than the local control due to the large solution space with|A|= K+KD−1
actions. Themyopic dispatch policy is given by:
πMY g (yg) := arg max ag D X d=1 m(d)E[βυ(d)]. (15)
Let Υ(d) denote the TTC in domain d without deadline constraint, whose distribution is given by (7) for initial node distributionbdandag(d) ferries. By definition, we can
com-puteE[βυ(d)] as a truncated averageP∆
t=1βtPr{Υ(d) =t}.
The problem is that even the myopic policy can become complex if the action space is large. To address this issue, we look at simpler approximations. First, we note that the global reward is related to the local reward as follows.
Lemma 5.2. For each domain with ag(d) > 0 and local
control rewardRπl d,R πl d −β∆≤E[βυ(d)]≤R πl d .
Proof: See Appendix.
The lemma indicates that for sufficiently large ∆, the my-opic policy for global control will allocate the ferries to maxi-mize the weighted sum of local control rewardsP
dm(d)R πl
d .
Assuming myopic search policies are used for local control, from the previous analysis (Corollary 4.4), we can replace
8A subtle difference is thatυ(d) is the TTC truncated at
∆, although the difference will be small for large ∆; see Lemma 5.2.
Rπl
d by its lower boundR πl
d ≥βa
g(d)/[(1−β)N
d+βag(d)],
which gives anapproximate myopic dispatch policy:
πMY g (yg)≈arg max ag D X d=1 m(d)ag(d) (1−β)Nd+βag(d) . (16)
A nice property of this approximation is that the function
fd(a) :=m(d)a/[(1−β)Nd+βa] is increasing and concave
in a, i.e., extra ferries for a domain only provide diminish-ing gain compared with existdiminish-ing ferries. This monotonicity means that instead of searching all K+KD−1
possible val-ues ofag, we can sequentially dispatch one ferry at a time to maximize the reward gain among all the domains. This property implies the following algorithm for implementing (16). Starting fromag=0, repeat fork= 1, . . . , K:
1. findd∗= arg maxd=1,...,Dfd(ag(d) + 1)−fd(ag(d));
2. updateag(d∗)←ag(d∗) + 1.
Compared with theO(∆D K+KD−1
) complexity9(per round) of the myopic dispatch policy, this approximation signifi-cantly reduces the complexity toO(KD).
In the special case of large domains (Nd ≫ K, ∀d), we
can further simplify (16) into anasymptotically approximate myopic dispatch policy:
πMY g (yg)≈arg max ag D X d=1 m(d)ag(d) Nd , (17)
for which the optimal action is simply to dispatch all the ferries to domaind∗= arg maxdm(d)/Nd. In contrast, for
smaller domains, the ferries may be split among multiple domains in a round. In fact, under (16), they will be dis-patched to the same domain if and only if ∃d such that
fd(K)−fd(K−1)≥fd′(1)−fd′(0) for anyd′6=d.
5.3
Performance of Global Control
To analyze the performance of global control, it is neces-sary to characterize the evolution of buffer states{mτ}∞τ=1.
For each domaind, ifqτ(d) denotes the delivery probability
in roundτ, then we see from (14) that{mτ(d)}∞τ=1follows a
Markovian-like transition diagram in Fig. 3. However, it is not a real Markov chain, asqτ(d) depends on the node beliefs
and the policies, making it intractable for analysis. In this section, we derive bounds for the closed-form dispatch policy (17) to obtain insights under the assumption ofNd≥K,∀d.
λd∆ 2λd∆ 3λd∆ . . .
qτ(d)
qτ(d)
1−qτ(d)
1−qτ(d) 1−qτ(d)
Figure 3: Evolution of buffer statemτ(d).
Under this policy, the delivery probabilityqτ(d) and the
average immediate rewardE[βυτ(d)] can be bounded in closed
form as follows.
Lemma 5.3. Under myopic search policies and the
dis-patch policy in (17), let dτ := arg maxdmτ(d)/Nd denote
9By precomputingE[βυ(d)|ag(d) =k] for eachd= 1, . . . , D
and k = 1, . . . , K, we can reduce the complexity to
O(DK∆ +D K+KD−1
the domain served in roundτ. We have qτ(dτ)≥1− 1− K Ndτ ∆ =:qτ, (18) E[βυτ(dτ)]≥ βK 1−β∆1− K Ndτ ∆ Ndτ−β(Ndτ −K) . (19)
Proof: See Appendix.
These bounds suggest the following evolution ofmτ10:
mτ+1=
m
τ+λ(∆ + 1)−mτ(dτ)edτ w.p. qτ,
mτ+λ(∆ + 1) o.w. (20)
Sincedτis a function ofmτ, we see that the process{mτ}∞τ=1
following evolution (20) is a Markov chain. This chain also determines the reward: since the expected immediate reward is mτ(dτ)E[βυτ(dτ)], applying (19) gives a lower bound on
the reward r(mτ) := βKmτ(dτ) 1−β∆1− K Ndτ ∆ Ndτ −β(Ndτ−K) , (21)
which is only a function ofmτ. These results lead to the
following performance bound.
Theorem 5.4. Under myopic search policies and the
ap-proximate myopic dispatch policy in (17), the discounted to-tal throughput is lower bounded by
Rπg ∞ ≥E[ ∞ X τ=1 β(∆+1)(τ−1)r(mτ)|m1=0] =:Rπ∞g, (22)
where the expectation is over the Markov chain {mτ}∞τ=1
specified in (20).
Proof: See Appendix.
Generally, (22) does not have a closed-form solution as it depends on the transient statistics of the Markov chain. Hence, we develop a numerical method to approximate it arbitrarily. LetMτ :={(mτ(i), p(τi))} denote the set of all
possible values ofmτ and the probabilitypτ of reaching it
from the initial buffer state m1. Then Algorithm 1 com-putes a T-step approximation of Rπg
∞ by computing Mτ
(τ = 1, . . . , T+ 1) iteratively. In each iteration (lines 2– 8), it enumerates all elements ofMτ (line 4), computes the
possible next states and their probabilities inMτ+1(line 7),
and accumulates the corresponding reward (line 8). Obviously, Rπg
T is a lower bound of Rπ∞g. On the other
hand, it is easy to see that assuming immediate delivery for every domain in roundsτ > T gives an upper bound:
Rπg ∞ −R πg T ≤β (∆+1)T X (mT+1,pT+1)∈MT+1 pT+1 D X d=1 mT+1(d) ! +β (∆+1)(T+1)(∆ + 1) 1−β∆+1 D X d=1 λd ! . (23)
Replacing line 2 by a stopping rule requiring the above bound≤ǫfor a constantǫ >0 will guarantee the resulting
Rπg
T to beǫ-close toRπ∞g. Note that the size ofMτ grows
exponentially as |Mτ| = 2τ−1, and thus the accuracy of
the above approximation will be limited by the complexity constraint.
10Here “w.p.” stands for “with probability” and “o.w.” for
“otherwise”.
Algorithm 1Evaluate Global Reward
Require: Domain size (Nd)Dd=1, data rateλ, number of
fer-ries K, round length ∆, discount factor β, initial BS buffer statem1, and horizonT.
Ensure: Return the T-step approximated reward lower boundRπg T . 1: M1← {(m1,1)},Rπ0g = 0 2: forτ= 1 toT do 3: Mτ+1← ∅,Rπτg ←R πg τ−1 4: for all(mτ, pτ)∈ Mτ do 5: dτ←arg maxdmNτ(dd) 6: qτ ←1−1− K Ndτ ∆ 7: add (mτ +λ(∆ + 1)−mτ(dτ)edτ, pτqτ), (mτ + λ(∆ + 1), pτ(1−qτ)) toMτ+1 8: Rπg τ ←Rπτg+β(∆+1)(τ−1)pτr(mτ)
6.
SIMULATION AND COMPARISON WITH
PREDETERMINED CONTROL
Having developed the DAS policies, we now evaluate their performance against appropriate benchmarks. A benchmark of particular interest is the class of predetermined control policies specified by fixed ferry routes. In the sequel, we will derive the optimal predetermined policies for local and global control and compare them with the proposed policies.
6.1
Local Control
For the predetermined local controller, knowledge of node location is fixed at the steady-state distribution b∗. Thus,
to maximize the chance of contact, the best policy for k
ferries is to wait for the node at thek most probable cells
s(i) (i= 1, . . . , k) such thatb
∗(s(1)) ≥ b∗(s(2)) ≥. . .. We
will call this thewaiting policy.
We now compare the waiting policy with the myopic search policy. Suppose the node follows a 2-D random walk on a grid with the level of mobility controlled by an activeness parameter α:=P
j6=iP(i, j) (i.e., probability of moving to
a different cell), starting from the steady-state distribution (uniform distribution). As illustrated in Fig. 4–5, the per-formance of both policies improves (reward increases while TTC decreases) as the number of ferrieskincreases, and the myopic policy clearly outperforms the waiting policy. The performance gap, however, varies depending on the level of mobility, with a larger gap in low mobility cases. Intuitively, this is because the waiting policy relies on node mobility to create contact opportunities, whereas the myopic policy will actively search for the node and is thus less affected. We also plot the lower and upper bounds on the myopic reward (Corollary 4.4–4.5) and the corresponding bounds on its mean TTC; we note that both bounds track the actual per-formance consistently and that the bounds are fairly tight.
We have also conducted similar simulations under biased mobility (Fig. 6–7), modeled by alocalized random walkwith
tightness parameter η [10]. This model allows non-uniform transition probabilitiesP(i, j)∝e−η||j−h|| biased toward a
home cell h (ifη ≥0), where||j−h||denotes the taxicab distance between cells j and h, andη controls the level of bias (η= 0 for standard random walk). In the simulations, the home cell was at the center of the grid. The results in-dicate that performance trends and comparisons are similar
1 2 3 4 5 6 7 8 9 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 waiting myopic upper bound lower bound k R πl ∞
(a) Local reward.
1 2 3 4 5 6 7 8 9 10 100 101 102 103 waiting myopic upper bound lower bound k E [Υ πl ] (b) Mean TTC. Figure 4: Myopic search policy vs. waiting policy: low-mobility random walk (α= 0.1, η = 0, N = 25,
β= 0.9, 5000 Monte Carlo runs).
1 2 3 4 5 6 7 8 9 10 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 waiting myopic upper bound lower bound k R πl ∞
(a) Local reward.
1 2 3 4 5 6 7 8 9 10 0 5 10 15 20 25 30 35 waiting myopic upper bound lower bound k E [Υ πl ] (b) Mean TTC. Figure 5: Myopic search policy vs. waiting policy: high-mobility random walk (α= 0.9, η= 0, rest as in Fig. 4). 1 2 3 4 5 6 7 8 9 10 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 waiting myopic upper bound lower bound k R πl ∞
(a) Local reward.
1 2 3 4 5 6 7 8 9 10 0 10 20 30 40 50 60 waiting myopic upper bound lower bound k E [Υ πl ] (b) Mean TTC. Figure 6: Myopic search policy vs. waiting policy: low-mobility, localized random walk (α= 0.1,η= 0.5, rest as in Fig. 4).
to those under symmetric mobility, but with a smaller gap between the two policies, as biased mobility provides more information to the waiting policy through its non-uniform steady-state distribution. The analytical bounds are looser in these cases.
In all of the above simulation results, we observe that the reward and the TTC metrics both improve monotonically as the number of ferries increases, but the marginal gain becomes smaller, particularly in the case of TTC; this is in accordance with the theoretical results developed in Sec-tion 4.3.
6.2
Global Control
The problem of predetermined global control is analogous to that of dynamic global control in Section 5.1 except that the beliefs of node locations are fixed at their steady states, and the BS buffer state is replaced by its expectationmτ.
For fair comparison with the myopic dispatch policy (15), we consider the followingmyopic predetermined dispatch policy:
πg0(m) = arg max ag D X d=1 m(d)E[βυ0(d)], (24) 1 2 3 4 5 6 7 8 9 10 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 waiting myopic upper bound lower bound k R πl ∞
(a) Local reward.
1 2 3 4 5 6 7 8 9 10 0 5 10 15 20 25 waiting myopic upper bound lower bound k E [Υ πl ] (b) Mean TTC. Figure 7: Myopic search policy vs. waiting policy: high-mobility, localized random walk (α = 0.9, η = 0.5, rest as in Fig. 4).
whereυ0(d) is the TTC in domaindunder the waiting pol-icy, starting from the steady-state distribution. After one round, the expected buffer state will evolve as mτ+1(d) =
λd(∆ + 1) +mτ(d) Pr{υ0(d)>∆}. A crucial difference
be-tween (24) and (15) is that E[βυ0(d)] and Pr{υ0(d) > ∆}
are fixed for any given ag(d). Therefore, given the initial buffer statem1=m1, the entire sequences of{mτ}∞τ=1and {ag
τ}∞τ=1are determined.
To complete the policy specification, we need to eval-uate E[βυ0(d)] and Pr{υ0(d) > ∆} for a given ag(d) = k ∈ {1, . . . , K}. Let Hk (index d is dropped for
simplic-ity) denote the hitting time for the k most probable cells
{s(i)}ki=1 in the steady state b∗, starting from b∗. Then
E[βυ0(d)] = P∆
t=1βtPr{Hk = t} and Pr{υ0(d) > ∆} =
Pr{Hk>∆}. By analysis similar to that in Section 4.3, we
have Pr{Hk = t}=pt0Qtj−=11(1−p 0
j), wherep0t is the
con-ditional contact probability given byp0
t =Pki=1b 0
t(s(i)), for
b01=b∗ andb0t+1=PTb0t\{s(i)}k
i=1.
We now compare the performance of the above predeter-mined policy with that of the myopic dispatch policy (15) and its approximations (16–17). We assume 2-D random walk for each domain with identical size, activeness, and traffic rate. We first evaluate the performance of these poli-cies for different values of the round length ∆; see Fig. 8 (“approx myopic 1” for (16) and “approx myopic 2” for (17)), where for each ∆, we simulate the policies for enough time (≥100 slots) to approximate the total rewardRπg
∞.
Interest-ingly, although it takes much longer to guarantee a contact (at least 5 slots even if nodes do not move), all the dis-patch policies achieve their best performance at a smaller ∆: ∆ = 3 for the myopic policy (15) and its approximation (16), ∆ = 2 for its other approximation (17), and ∆ = 1 for the predetermined policy (24).
Based on the above results, we compare the policies under their best ∆ with respect to (wrt) the discounted through-put λβ,T := E[PTt=1N(t)β
t] (Fig. 9), where N(t) denotes
the amount of traffic delivered in slott, and wrt the undis-counted throughput λβ=1,T (Fig. 10). The myopic policy
significantly outperforms the predetermined policy on both metrics (by 125% on λβ,T and by 47% on λ1,T).
More-over, the two approximations provide close-to-myopic per-formance at much reduced complexities, where (16) even outperforms the myopic policy onλβ,T. This shows that
my-opic policy is suboptimal for global control. We also evaluate anǫ-approximation (withǫ= 0.25) of the analytical bound in (22) by Algorithm 1 (“myopic lower bound”), which gives a conservative lower bound for myopic policies.
1 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 11 12 myopic approx myopic 1 approx myopic 2 predetermined ∆ R πg ∞ Figure 8: Optimizing ∆ (D = 2, Nd≡ 25, αd ≡0.5, ηd≡0, K = 5, λd≡1, β= 0.9, 500Monte Carlo runs). 0 20 40 60 80 100 0 2 4 6 8 10 12 myopic approx myopic 1 approx myopic 2 predetermined myopic lower bound
T
λβ
,T
Figure 9: Myopic vs.
prede-termined dispatch policy:
dis-counted throughput (∆
opti-mized by Fig. 8, rest as in Fig. 8).
0 20 40 60 80 100 0 20 40 60 80 100 120 140 160 180 200 myopic approx myopic 1 approx myopic 2 predetermined T λ1 ,T
Figure 10: Myopic vs.
prede-termined dispatch policy: undis-counted throughput (same as Fig. 9).
of different mobility parameters and traffic rates for each domain under the localized random walk model defined in Section 6.1; see Fig. 11–13 (the optimal ∆ is 2 for the my-opic policies and 1 for the predetermined policy). The re-sults show similar trends, except that the performance gap between the myopic policies and the predetermined policy shrinks (the myopic policy outperforms the predetermined policy by 68% onλβ,Tand by 26% onλ1,T). Intuitively, this
is because in heterogeneous cases, prior knowledge such as domain sizes, mobility parameters, and traffic rates already provides good distinction between domains, which helps the predetermined dispatch policy just as biased mobility helps the waiting policy in local control.
7.
CONCLUSION
We have considered the control of multiple data ferries in partitioned wireless networks. Compared with the lit-erature, our solution deals with a more challenging case of stochastically moving nodes and limited ferry-node and ferry-ferry communication ranges. Using the approach of stochastic control, we develop a fully dynamic solution via the hierarchical policy of “dispatch” and “search”, which only requires local cooperation between ferries and provides sub-stantial performance improvements over existing predeter-mined control in high uncertainty scenarios (e.g., uniform node steady-state distributions, identical partitions).
Although we have focused on data dissemination in pre-senting the detailed solution, our approach is applicable to other communication scenarios as well. Under other commu-nication modes, the local control remains the same whereas the global control needs to be modified. For example, a global POMDP similar to that in Section 5.1, withmτ(d)
denoting the traffic generatedby domaindand reward
rg(xg, ag, og) := PDd=1m(d)β∆+1Iυ(d)≤∆, corresponds to
data harvesting (with delayed pickup). A more complicated variation can model peer-to-peer communication via the re-lay of BS by tracking the buffer states of each source domain and the BS with a set of state variables and using the BS buffer statemτ to compute the rewardrg(xg,ag,og) as in
Section 5.1. Detailed studies are left to future work.
8.
REFERENCES
[1] G. D. Celik and E. Modiano. Dynamic vehicle routing for data gathering in wireless networks. InIEEE CDC, December 2010.
[2] T. He, K.-W. Lee, and A. Swami. Flying in the dark: Controlling autonomous data ferries with partial observations. InACM MobiHoc, 2010.
[3] D. Henkel and T. Brown. On controlled node mobility in delay-tolerant networks of unmannned aerial vehicles. InISART, 2006.
[4] D. Henkel and T. Brown. Towards autonomous data ferry route design through reinforcement learning. In
IEEE/ACM WoWMoM, 2008.
[5] D. Jea, A. Somasundara, and M. Srivastava. Multiple controlled mobile elements (data mules) for data collection in sensor networks. InDCOSS’05. [6] V. Kavitha and E. Altman. Analysis and design of
message ferry routes in sensor networks using polling models. InIEEE WiOpt, May 2010.
[7] C. Papadimitriou and J. Tsitsiklis. The complexity of markov decision processes.Math. of Operation Research, 1987.
[8] E. Sondik. The optimal control of partially observable markov processes over the infinite horizon: Discounted costs.OR, 1978.
[9] M. Tariq, M. Ammar, and E. Zegura. Message ferry route design for sparse ad hoc networks with mobile nodes. InACM MobiHoc, 2006.
[10] B. Walker, T. Clancy, and J. Glenn. Using localized random walks to model delay-tolerant networks. In
IEEE MILCOM, 2008.
[11] J. Wu, S. Yang, and F. Dai. Logarithmic
store-carry-forward routing in mobile ad hoc networks.
IEEE Trans. PDS, 2007.
[12] Z. Zhang and Z. Fei. Route design for multiple ferries in delay tolerant networks. In IEEE WCNC, 2007. [13] W. Zhao, M. Ammar, and E. Zegura. Controlling the
mobility of multiple data transport ferries in a delay-tolerant network. InIEEE INFOCOM’05. [14] W. Zhao, M. Ammar, and E. Zegura. A message
ferrying approach for data delivery in sparse mobile ad hoc networks. InACM MobiHoc, 2004.
APPENDIX
Proof of Lemma 4.2
Based on (8), we have ∂Rπ∞ ∂pπ t =βt t−1 Y j=1 (1−pπj)− ∞ X s=t+1 βspπs s−1 Y j=1,j6=t (1−pπj) = "t−1 Y j=1 (1−pπj) # βt−E[βΥπ|Υπ> t]≥0. Therefore,Rπ1 2 3 4 5 6 7 8 9 10 4 5 6 7 8 9 10 11 12 myopic approx myopic 1 approx myopic 2 predetermined ∆ R πg ∞ Figure 11: Optimizing∆(D= 3, N := (Nd)Dd=1 = (16, 25, 36), α = (0.9, 0.5,0.1), η= (0,0.2,0.5), K= 5, λ = (1, 0.75, 0.5), β = 0.9, 500
Monte Carlo runs).
0 20 40 60 80 100 0 2 4 6 8 10 12 myopic approx myopic 1 approx myopic 2 predetermined myopic lower bound
T
λβ
,T
Figure 12: Myopic vs.
prede-termined dispatch policy:
dis-counted throughput (∆
opti-mized by Fig. 11, rest as Fig. 11).
0 20 40 60 80 100 0 50 100 150 200 250 myopic approx myopic 1 approx myopic 2 predetermined T λ1 ,T
Figure 13: Myopic vs.
prede-termined dispatch policy: undis-counted throughput (same as Fig. 12).
Proof of Lemma 4.3
First, we prove by induction that (recallb(t)= (PT)tb0)
bπt(i)≤ b(t)(i) 1−Pt−1 j=1 P s∈aπ j b (j)(s), ∀i∈ S, t≥1. (25) Fort= 1,bπ 1 =PTb0=b(1). Fort >1, bπt(i)= X j6∈aπ t−1 bπ t−1(j)Pj,i 1−P s∈aπ t−1b π t−1(s) ≤ b (t)(i) 1−Pt−1 j=1 P s∈aπ j b (j)(s),
obtained by applying (25) forbπt−1(j). This proves (25).
Then, we apply the above topπt =Ps∈aπ
t b π t(s): pπt ≤ P s∈aπ t b (t)(s) 1−Pt−1 j=1 P s∈aπ j b (j)(s) ≤pt forptdefined as in (10).
Proof of Corollary 4.4
The proof is based on the fact that pMY
t ≥ k/N, ∀t. By
Lemma 4.2, we will lower boundRMY
∞ by replacingpMYt with k/N, which yields RMY ∞ ≥ ∞ X t=1 βtk N(1− k N) t−1 = βk N(1−β) +βk.
Proof of Corollary 4.5
Ifb0 =b∗, then b(t) ≡b∗,∀t. Accordingly, Bt,k ≡B∗,k,andT0=⌈B−∗,k1⌉. Plugging these into (9) gives (11).
If, in addition, P is doubly stochastic, then it is known thatb∗is uniform, i.e.,B∗,k =k/N, applying which to (11)
gives R∞=β⌈N/k⌉ 1− k N ⌈N k⌉ −1 +k(β−β ⌈N/k⌉) N(1−β) . (26) If we maximize the right-hand side of (26) with respect to
⌈N
k⌉, calculation shows that the maximum is achieved at ⌈N k⌉= N k, N k + 1, or N k − 1 1−β− 1
logβ + 1. At the first two
values, the right-hand side is
kβ(1−βN/k)
N(1−β) <
βk
(1−β)N;
at the third value, it is
k N βNk+1− 1 1−β− 1 logβ logβ + β 1−β ! < βk (1−β)N.
Combining both givesR∞≤βk/[(1−β)N].
Proof of Lemma 5.2
The upper bound holds becauseRπl
d =E[βΥ(d)] whileE[βυ(d)]
is a truncated average. The lower bound is because
Rπl d −E[β υ(d)] = ∞ X t=∆+1 βtPr{Υ(d) =t} ≤β∆Pr{Υ(d)>∆} ≤β∆. (27)
Proof of Lemma 5.3
Letpt be the conditional contact probability in domaindτ.
By definition, we have pt ≥K/Ndτ for the myopic search
policy. From the analysis of local control (Section 4.3), we haveqτ(dτ) = 1−Q∆t=1(1−pt) and E[βυτ(dτ)] = ∆ X t=1 βtPr{υτ(dτ) =t}= ∆ X t=1 βtpt t−1 Y j=1 (1−pj),
both increasing withpt. Plugging in the lower bound ofpt
yields the results.
Proof of Theorem 5.4
Due to discount, the total throughputRπg
∞ is an increasing
function of the probability of delivery qτ(dτ) and is thus
lower bounded if qτ(dτ) is replaced by its lower bound in
(18), makingmτ evolve according to (20).
Moreover, for given (bτ, mτ), the expected immediate
reward under the dispatch policy (17) ismτ(dτ)E[βυτ(dτ)],
which is lower bounded byr(mτ) defined in (21) for anybτ