Estimating optimal control strategies for large scale spatio-temporal decision problems.

(1)

ABSTRACT

MEYER, NICHOLAS JAMES. Estimating optimal control strategies for large scale spatio-temporal

decision problems. (Under the direction of Dr. Eric Laber.)

Sequential decision problems arise in many disciplines of science.

Applications include

management of infections diseases, pursuit and evasion games, and power grid optimization.

This dissertation addresses large scale complex decision problems that evolve not only over

time, but also over space. The chapters to follow are a collection of research papers. Each

chapter highlights different challenges, presents a novel solution, and demonstrates the results

through extensive simulation experiments.

Infectious diseases are complex systems that are inherently difficult to control. Limited

information, unknown dynamics, and high dimensional state spaces require a novel approach to

estimating optimal treatment allocation strategies. Chapter 2 presents a model based approach

to estimate an optimal treatment strategy. The method uses a postulated model for the disease

dynamics to maximize the value function using simulation optimization. When the dynamics

model is correctly specified, this approach produces an effective treatment allocation strategy.

However, the method lacks robustness to misspecification of the dynamics model. Chapter 3

addresses this issue of model misspecification using a semi-parametric estimator of the optimal

treatment strategy. Both methods are demonstrated using simulation experiments with a case

studies for White-nose syndrome and the Ebola virus.

(2)

©

Copyright 2017 by Nicholas James Meyer

(3)

Estimating optimal control strategies for large scale

spatio-temporal decision problems

by

Nicholas James Meyer

A dissertation submitted to the Graduate Faculty of

North Carolina State University

in partial fulfillment of the

requirements for the Degree of

Doctor of Philosophy

Statistics

Raleigh, North Carolina

2017

APPROVED BY:

Dr. Marie Davidian

Dr. Krishna Pacifici

Dr. Brian Reich

Dr. Butch Tsiatis

Dr. Eric Laber

(4)

DEDICATION

(5)

BIOGRAPHY

(6)

ACKNOWLEDGEMENTS

(7)

LIST OF TABLES

. . . .

vii

LIST OF FIGURES

. . . .

viii

LIST OF ALGORITHMS

. . . .

xi

Chapter 1 Introduction

. . . .

1

1.1 Sequential Decision Problems . . . .

1

1.2 Management of infectious diseases

. . . .

3

1.3 Pursuit and evasion

. . . .

3 Chapter 2 Optimal treatment allocations in space and time for online control

of an emerging infectious disease

. . . .

5

2.1 Introduction . . . .

5

2.2 White-nose syndrome in bats . . . .

9

2.3 Defining an optimal treatment allocation strategy . . . 11

2.4 Estimating an optimal allocation strategy . . . 15

2.4.1 A scalable class of allocation strategies . . . 19

2.5 Simulation experiments

. . . 22

2.5.1 Spread of an infectious disease in Euclidean space

. . . 22

2.5.2 Spread off an infectious disease across a network . . . 26

2.6 Controlling the spread of white-nose syndrome

. . . 30

2.6.1 A system dynamics model for WNS under no interventions . . . 33

2.6.2 Simulating management of WNS . . . 33

2.7 Discussion . . . 35

Chapter 3 Model-free estimation of the optimal treatment strategy for

dy-namical spatio-temporal systems

. . . .

40

3.1 Introduction . . . 40

3.2 Ebola Virus . . . 42

3.3 Notation and Setup . . . 43

3.4 Features for Constructing Treatment Strategies . . . 46

3.5 Estimating the Optimal Strategy via Model-Based Policy Search . . . 48

3.6 Estimating the Optimal Strategy via the Q-function . . . 50

3.6.1 Working Model of the Q-function . . . 51

3.6.2 Estimating the Q-function . . . 52

3.7 System Dynamics Models . . . 53

3.7.1 Model 1: No Resistance to Treatment . . . 54

3.7.2 Model 2: Resistance to Treatment Dictated by Covariates . . . 54

3.8 Simulation Experiment . . . 55

3.8.1 Network Structures . . . 55

(8)

3.8.3 Experiment Results

. . . 57

3.9 Management of the Ebola Virus . . . 57

3.10 Conclusion

. . . 60

Chapter 4 Cooperative search strategies for pursuing adversarial evaders

. . .

61

4.1 Introduction . . . 61

4.2 Setup and notation . . . 62

4.3 Estimating the optimal search strategy . . . 64

4.4 Estimating the evader’s location

. . . 66

4.5 Evader behaviors . . . 67

4.6 Simulation experiment . . . 68

4.7 Conclusion

. . . 69

References

. . . .

72 Appendices

. . . .

85 Appendix A

Supplemental Material for Chapter 2 . . . 86

A.1 Illustrative example of Thompson Sampling . . . 86

A.2 Additional details for experiments using networks . . . 87

A.3 Tuning procedure for simultaneous perturbation . . . 89

A.4 Generating spatial locations and network structures

. . . 89

A.4.1 Lattice layout and network . . . 92

A.4.2 Random

k

-nearest neighbor layout and network . . . 93

A.4.3 Scalefree (“small world”) network . . . 93

A.4.4 Clustered Network . . . 94

A.4.5 Covariates . . . 95

A.5 Tuning the generative model . . . 95

A.6 Approximating the posterior distribution of

θ

. . . 96

A.7 Model fit diagnostics . . . 97

A.7.1 Estimated infection spread

. . . 98

A.7.2 Posterior predictive checks . . . 102

Appendix B

Supplemental Material for Chapter 3 . . . 117

B.1 Network Structures . . . 117

B.1.1 Lattice Structure . . . 117

B.1.2 Random Structure . . . 118

(9)

LIST OF TABLES

Table 2.1

Estimated coefficients and 95% credible intervals (CIs) using for the gravity

model fit using WNS data from 2006-2014. Rows for

θ

3 and

θ

4 correspond to

intervention effects which are not identifiable in the WNS data as this data

does not contain any interventions. . . 37

Table 2.2

Average proportion of infected counties in 100 Monte Carlo simulations of

the spread of WNS from 2015-2022 under the gravity model. Policy-search

resulted in markedly fewer infected counties than the next best estimator. . . 37

Table 3.1

Parameters indexing the generative model for the Ebola simulations . . . 59

Table 3.2

Simulation results for the management of the Ebola Virus.

. . . 60

Table A.1

Estimated coefficients and 95% credible intervals (CIs) using for the network

spread dynamics model fit using WNS data from 2006-2014. Rows for

θ

3 and

θ

4 correspond to intervention effects which are not identifiable in the WNS

(10)

LIST OF FIGURES

Figure 2.1

Spread of white-nose syndrome [U.S. Fish and Wildlife Service, 2015].

Out-lined counties contain caves. Those without color are uninfected as of June

2014. . . 10

Figure 2.2

Left:

Schematic for Thompson sampling over a generic class of dynamics

models.

Right:

Schematic for Thompson sampling with a parametric class

of models indexed by (

θ, β

) and a finite simulation horizon

T

. . . 16

Figure 2.3

Left:

(S1) regular lattice layout with 1000 locations.

Center:

(S2)

uni-formly distributed layout with 1000 locations.

Right:

(S3) clustered layout

with 1000 locations. . . 24

Figure 2.4

Estimated average proportion infected based on 100 Monte Carlo

replica-tions under correct specification of the system dynamics model. Horizontal

line represents two standard errors. . . 27

Figure 2.5

Estimated average proportion infected based on 100 Monte Carlo

replica-tions under misspecification of the system dynamics model. Horizontal line

represents two standard errors. . . 28

Figure 2.6

Left:

(N1) regular lattice network with 1000 locations.

Center:

(N2)

random

k

-nearest neighbor network with 1000 locations.

Right:

(N4)

small-world network with 1000 locations. . . 30

Figure 2.7

Estimated average proportion infected based on 100 Monte Carlo

replica-tions under correct specification of the network spread dynamics model.

Horizontal line represents two standard errors. . . 31

Figure 2.8

Estimated average proportion infected based on 100 Monte Carlo

replica-tions under misspecification of the network spread dynamics model.

Hori-zontal line represents two standard errors. . . 32

Figure 2.9

Posterior distribution of the regression parameters associated with

unin-fected (

θ

1 ) and infected (

θ

2 ) counties in the gravity model (2.2) applied

to the white-nose syndrome data. The covariates are the number of caves

in the county (“caves”), the average number of days per year below 10

◦

C

(“cold days”), area in

km

2 (“Area”), and species richness (“SR”). . . 34

Figure 2.10 Histograms for each feature coefficient at the final time point during the

WNS simulation experiment. . . 36

Figure 3.1

Observed outbreaks for West Africa with the first infections on April 26, 2014. 44

Figure 3.2

Histogram of

log

-population for all administrative units in the Ebola study.

45 Figure 3.3

Example network structure to illustrate construction of features. . . 48

Figure 3.4

Figure displaying examples of the three network structures.

Left: lattice

network with 1000 locations.

Center: random nearest neighbor network

with 1000 locations.

Right: scale-free network with 1000 nodes. . . 55

Figure 3.5

Simulation results for all network structures showing effects of

(11)

Figure 4.1

Schematic for starting positions of all units and evader’s goal locations.

Red diamond is the starting evader location. White squares are the starting

pursuer locations. Green crosses are the possible evader goal locations. . . . 69

Figure 4.2

Simulation results comparing finite Q-function approximation with a

heuris-tic against a random walk. Error bars are a symmetric two standard error

interval. X-axis is the number of time points,

n

, evaluated before using a

heuristic. Y-axis is the estimated probability of capture. . . 70

Figure A.1

Average cumulative outcome for Thompson sampling and greedy allocation

selection based on

k

= 1

,

5 ,

and 10 exploration steps of each allocation and

a gap in means of

η

= 0

.

2 ,

0 .

4 ,

0 .

8 ,

and 1

.

0 The figure illustrates that greedy

action selection need not be consistent for the optimal strategy (i.e., these

methods do not converge to 1). Results are based on 1000 Monte Carlo

replications. . . 88

Figure A.2

Posterior distribution of the regression parameters associated with

unin-fected (

θ

1 ) and infected (

θ

2 ) counties in the network spread model applied

to the white-nose syndrome data. The covariates are the number of caves

in the county (“caves”), the average number of days per year below 10

◦

C

(“cold days”), area in

km

2 (“Area”), and species richness (“SR”). . . 90

Figure A.3

The lattice network with 1000 locations . . . 92

Figure A.4

A random k-nearest neighbor network with 1000 locations . . . 94

Figure A.5

The scalefree network with 1000 locations

. . . 95

Figure A.6

A clustered network with 1000 location . . . 96

Figure A.7

Estimated spread starting at 2006 and moving through 2013 for the spatial

spread system dynamics model. . . 98

Figure A.8

Estimated spread starting at 2006 and moving through 2013 for the network

spread system dynamics model. . . 99

Figure A.9

Estimated spread starting at 2011 and moving through 2013 for the spatial

spread system dynamics model. . . 100

Figure A.10 Estimated spread starting at 2011 and moving through 2013 for the network

spread system dynamics model. . . 101

Figure A.11 Posterior predictive check for the total number of infections. . . 102

Figure A.12 Posterior predictive check for mean year of infection. . . 103

Figure A.13 Posterior predictive check for the number of infections in 2013. . . 103

Figure A.14 Posterior predictive check for the number of infections in 2012. . . 104

Figure A.15 Posterior predictive check for the number of infections in 2011. . . 104

Figure A.16 Posterior predictive check for the number of infections in 2010. . . 105

Figure A.17 Posterior predictive check for the number of infections in 2009. . . 105

Figure A.18 Posterior predictive check for the number of infections in 2008. . . 106

Figure A.19 Posterior predictive check for the number of infections in 2007. . . 106

Figure A.20 Posterior predictive check for maximum distance from starting location to

final set of infected locations. . . 107

(12)

Figure A.22 Posterior predictive check for maximum difference in longitude from the

starting location to final set of infected locations. . . 108

Figure A.23 Posterior predictive check for mean distance from the starting location to

final set of infected locations. . . 108

Figure A.24 Posterior predictive check for mean difference in latitude from the starting

location to final set of infected locations. . . 109

Figure A.25 Posterior predictive check for mean difference in longitude from the starting

location to final set of infected locations. . . 109

Figure A.26 Posterior predictive check for minimum difference in latitude from the

start-ing location to final set of infected locations. . . 110

Figure A.27 Posterior predictive check for minimum difference in longitude from the

starting location to final set of infected locations. . . 110

Figure A.28 Posterior predictive check for the total number of infections using out of

sample statistics.

. . . 111

Figure A.29 Posterior predictive check for the mean infection year using out of sample

statistics. . . 111

Figure A.30 Posterior predictive check for the number of infections in 2013 using out of

sample statistics.

. . . 112

Figure A.31 Posterior predictive check for the number of infections in 2012 using out of

sample statistics.

. . . 112

Figure A.32 Posterior predictive check for the minimum difference in longitude from the

starting location to final set of infected locations using out of sample statistics.113

Figure A.33 Posterior predictive check for the minimum difference in latitude from the

starting location to final set of infected locations using out of sample statistics.113

Figure A.34 Posterior predictive check for the mean difference in longitude from the

starting location to final set of infected locations using out of sample statistics.114

Figure A.35 Posterior predictive check for the mean difference in latitude from the

start-ing location to final set of infected locations usstart-ing out of sample statistics. . 114

Figure A.36 Posterior predictive check for the max difference in longitude from the

start-ing location to final set of infected locations usstart-ing out of sample statistics. . 115

Figure A.37 Posterior predictive check for the max difference in latitude from the starting

location to final set of infected locations using out of sample statistics. . . . 115

Figure A.38 Posterior predictive check for the max distance from the starting location

to final set of infected locations using out of sample statistics. . . 116

Figure A.39 Posterior predictive check for the mean distance from the starting location

to final set of infected locations using out of sample statistics. . . 116

Figure B.1

Example lattice structure with 1000 locations. . . 118

Figure B.2

Example random structure with 1000 locations. . . 120

(13)

LIST OF ALGORITHMS

Algorithm 2.1

Policy-search algorithm for an optimal allocation strategy.

. . . 17

Algorithm 2.2

Stochastic approximation algorithm for arg max

d

∈D

C

T

(

d

;

β, θ

). . . 21

Algorithm 3.1

Stochastic approximation for policy search. . . 50

Algorithm 3.2

Stochastic approximation for estimating the Q-function. . . 53

Algorithm A.1 Tuning simultaneous perturbation. . . 91

Algorithm A.2 Connecting sub-networks. . . 93

(14)

Chapter 1

Introduction

1.1 Sequential Decision Problems

We are constantly presented with problems or decisions in which we try and select the choice that

will maximize our utility. Often times we are required to make multiple sequential decisions.

Whether conscious or not, our brains make quick mental judgments about the probably of

certain outcomes and how much we value each outcome and this drives our decisions. Sometimes

we are unsure of how our decisions will impact our lives and those around us and thus need to

learn from our experiences to improve. By learning from our successes and mistakes, we can

improve our decision making ability to maximize our long run utility. This informal process

of making sequential decisions can be formalized with a rigorous mathematical framework that

allows us to set up a problem, define the outcomes, select utility values, and solve for the

optimal decision strategy.

(15)

according to the utility function.

Let

T

=

{1

,

2 ,

3 , . . .

}

be the set of time points at which the agent makes a decision. At each

decision point

t

, the agent observes the state of the system,

S

t

∈ S

. Using the state information,

the agent makes a decision

A

t

∈ A. The system transitions to the next state

S

t

+1

and the

agent receives utility feedback

R

(

S

t

,

A

t

,

S

t

+1

)

∈

R

. Receiving feedback at each step informs

the agent about its decision for the one-step transition. However, it is important to remember

that making yourself as happy as possible today, does not necessarily make you happier in the

long run. Thus, evaluating an agent’s long run utility is important for comparison.

Define a strategy of an agent,

π

, to be a mapping from

S

to the space of all random

variables with support

A. The value of some agent’s strategy is the expected long run utility

when starting in a specified state,

s

, and making decisions according to their decision strategy,

π

. The value function is defined as

V

π

(

s

) =

E

π





X

v

≥

t

γ

v

−

t

R

(

S

t

,

A

t

,

S

t

+1

)

S

t

=

s





where

E

π

is the distribution if decisions are made according to

π

and

γ

∈

[0

,

1) is a discount

factor. A strategy

π

∗

is optimal if

V

π

∗

≥

V

π

for any considered strategy

π

.

(16)

1.2 Management of infectious diseases

Management of an infectious disease is a sequential decision problem that evolves not just over

time but also over space. A critical component to managing these large dynamical diseases

is the ability to apply treatment effectively. With constraints on resources, it is critical that

experts allocation treatments in the most effective way.

There are three main challenges to estimate the optimal treatment strategy for managing

an infectious disease. One, the decision space is large and increases tremendously fast with the

number of susceptible locations due to treatment interactions. The large dimensions increases

computational costs and restricts the feasibility of existing methodology. Two, the transmission

dynamics of a disease are often complex which causes difficulty when estimating a dynamics

model. Because of the complex dynamics, a quality estimator of the optimal strategy should

be robust to model misspecification. Three, contrary to many sequential decision problems, in

this context only a single trajectory of the system is observed. With limited data, estimating

the optimal strategy can be unstable.

In chapters 2 and 3, we develop estimators of the optimal treatment allocation strategy to

control the spread of an infectious disease. In chapter 2, we propose a method for estimating

the optimal treatment allocation strategy using a model-based approach. The method uses a

postulated model for the spread dynamics of the disease and uses simulation based optimization

to estimate the optimal strategy. This approach performs well when the dynamics model is

correctly specified, but it is not robust to misspecification. Chapter 3 addresses this issue and

proposes a more robust approach. Both chapters demonstrate the methods through a series of

simulation experiments and a case study.

1.3 Pursuit and evasion

(17)

2015]. All of these areas have scenarios where a group of search agents are pursuing an adversary

who is actively avoiding capture. Sometimes this adversary is simply trying to hide, but other

times the adversary has a mission it is trying to complete before being caught. Regardless of

the situation, it is imperative that the adversary be caught as quickly as possible.

Most approaches in existing literature for constructing pursuit strategies focus on greedy

methods and heuristics.

Two methods are local-max and global-max strategies [Hespanha

et al., 2000, Vidal et al., 2002, Kwak and Kim, 2014]. A local-max strategy is a one-step

greedy strategy that maximizes probability of capture at the next time step. A global-max

strategy locates the position with the highest probability of housing the evader and moves

the pursuers as close as possible on the next move. Kwak and Kim [2014] combined both

methods using a weighted combination of local-max and global-max and trained the weights

using reinforcement learning. A different approach is taken by Wang and Liu [2016] wherein

they utilize online learning algorithms designed for adversarial multi-armed bandit problems.

In this setting, both the pursuers and evaders adapt over time.

(18)

Chapter 2

Optimal treatment allocations in

space and time for online control of

an emerging infectious disease

2.1 Introduction

(19)

systems and to human health. The impact of bat loss due to white-nose syndrome is projected

to produce several billion dollars of agricultural costs per year [Subcommittee on Fisheries,

Wildlife, and Oceans, 2011]. Understanding the dynamics of these epidemics and providing

tools to efficiently and effectively control them is of paramount importance.

A key component in controlling the spread of an epidemic is deciding where, when, and

to whom to apply an intervention. A treatment allocation strategy formalizes this process

as a sequence of functions, one per treatment period, that map up-to-date information on the

epidemic to a subset of locations to receive treatment. An optimal treatment allocation strategy

optimizes the expectation of some cumulative outcome, e.g., the cumulative number of infected

individuals, the geographic footprint of the disease, the estimated total cost of the disease,

or a composite of several important outcomes. Estimation of an optimal treatment allocation

strategy for an emerging epidemic presents several major challenges:

(i) data scarcity, at the onset of the epidemic there is little information about disease

dy-namics and typically no information on the effectiveness of potential treatments;

(ii) scalability, the number of possible allocations is exponential in the number of locations;

e.g., in the problem of white-nose syndrome, there are more than 1,100 locations leading

to more treatment allocations than can possibly be enumerated using existing computing

resources;

(iii) interference, dependence among locations violates the no interference among experimental

units assumption [Sobel, 2006b, Hudgens and Halloran, 2008b]; and

(iv) a long time horizon, an epidemic can persist for decades before eradication, and thus

an optimal treatment allocation strategy must adapt to evolving logistical constraints,

technologies, and system dynamics.

(20)

strategy is the maximizer over a pre-specified class of strategies of the mean outcome under this

model. The system dynamics model and estimated optimal allocation strategy are updated each

time new data are collected to provide a continually evolving strategy. Furthermore, the class

of potential allocation strategies is chosen to reduce computational complexity when scaling to

large decision problems and to ensure that logistical/feasibility constraints are satisfied. We

show that the proposed estimator can scale to problems with more than one thousand nodes,

four covariates per node, fifteen treatment periods, and

O

(10

150 _{) possible allocations at each}

time period.

(21)

and Zhao, 2015], however, these methods heavily rely on smoothness of an outcome regression

model across treatment values which does not apply in the treatment allocation problem.

Both estimation of dynamic treatment regimes and estimation an optimal treatment

allo-cation fall under the umbrella of reinforcement learning problems [Bertsekas, 1996, Sutton and

Barto, 1998b, Powell, 2007, Sugiyama, 2015]. Our proposed estimator is an approximate

vari-ant of Thompson sampling [Thompson, 1933] wherein allocations are chosen with probability

that is proportional to the posterior probability that they are optimal. Thompson sampling

has been studied in the reinforcement learning literature primarily in its application to bandit

problems [Scott, 2010, Chapelle and Li, 2011, Agrawal and Goyal, 2011, 2012, Kaufmann et al.,

2012, Korda et al., 2013, Agrawal and Goyal, 2013, Gopalan et al., 2014, Russo and Van Roy,

2014]. Osband et al. [2013] and Gopalan and Mannor [2015] applied Thompson sampling to

sequential decision problems modeled as Markov decision processes. However, these estimators

require: (i) a finite set of system states; and (ii) that a fixed allocation strategy be applied

without adjustment for potentially long time periods of time. In the settings we consider, the

system state is continuous and high-dimensional (making discretization impractical) and the

application of a fixed sub-optimal allocation strategy for a prolonged period is neither ethical

nor feasible. For a comprehensive survey of Bayesian reinforcement learning see Ghavamzadeh

et al. [2015].

(22)

2.2 White-nose syndrome in bats

White-nose syndrome (WNS) is a disease caused by the fungus

Pseudogymnoascus destructans

(formerly

Geomyces destructans) and predominately affects hibernating bats in North America

[Blehert et al., 2009]. An infected bat will present with a white fungus on the muzzle, ears

and/or wings, and erratic behavior during hibernation. The erratic behavior during hibernation

depletes fat reserves and expends valuable energy resulting in low survival and death [Blehert

et al., 2009]. Mortality rates exceed 90 percent in some areas and more than 5.7 million bats

have died due to WNS [Blehert et al., 2009, U.S. Fish and Wildlife Service, 2015].

WNS was first recorded in Schoharie County, NY in 2006 [Blehert et al., 2009] and is

now found in 25 states, 5 Canadian provinces, as far south as Mississippi, and as far west

as Missouri; see Figure 2.1. More than half of the 47 species of bats in the U.S. hibernate

making them vulnerable to exposure. Currently, two endangered species, the Gray bat,

Myotis

grisescens, and the Indiana bat,

Myotis sodalis, as well as one threatened species, the Northern

long-eared bat,

Myotis septentrionalis, are infected with WNS [Blehert et al., 2009].

The

ecological damage due to loss of bats and speed of spread is unprecedented and the long-term

damage is still considered to be immeasurable [Blehert et al., 2009]. Short-term estimates of

economic damage hover around

$

3 .

7 billion/year mainly due to agricultural loss [Boyles et al.,

2011]. The estimated value of bats to the entire agricultural industry is

$

22 .

9 billion/year not

including many secondary effects and impacts, e.g., downstream effects of increased pesticide

use; predation effects on evolved resistance of insects to pesticides and genetically modified

crops [Boyles et al., 2011].

(23)

2006

2008

2010

2012

Year Infected

(24)

not explicitly provide a treatment plan or strategy to control WNS [Szymanski et al., 2009].

Each state is left to implement treatments at its own discretion. Potential treatments include

anti-fungal biological or non-chemical agents for bats at risk, modifying cave environmental

variables, e.g., temperature and humidity, to slow fungus growth and improve bat survival,

vaccines to boost resistance, and artificial caves [Cornelison et al., 2014, Hoyt et al., 2015].

Unfortunately, many of these have not been tested in the field and their efficacy is currently

unknown. Additional challenges exist because the disease has a highly complex nature of spread

including a large spatial range [Maher et al., 2012]. Therefore, to maximize what benefits these

treatments may provide, it is essential to develop a principled, adaptive, and data-driven control

strategy that addresses the full potential range of WNS before further devastation occurs. We

estimate such a control strategy and demonstrate that, if implemented, it may have a profound

effect on the course of the current epidemic.

2.3 Defining an optimal treatment allocation strategy

(25)

methodology can be extended to handle settings in which there are several treatment options

available at each location. A treatment allocation strategy formalizes the treatment allocation

process as a map from current information on all locations to a probability distribution over

possible allocations. An allocation strategy is said to be optimal if it maximizes the mean

cumulative utility over a pre-specified class of strategies (minimizing cost can be handled in the

obvious way).

Let

L

=

{1

, . . . , L

}

denote the set of locations and

T

=

{1

,

2 , . . .

}

the set of treatment

stages. The treatment stages may be dictated by the evolving decision process. Define

S

t

_`

∈

_R

p

to be a summary of the information collected at location

`

∈ L

up to and including time

t

∈ T

and let

S

t

be

S

t

_`

_∈L

; we assume that

S

t

is completely observed and measured without

error. Let

A

t

_`

∈ {0

,

1}

denote an indicator that location

`

received treatment at time

t

and

A

t

=

A

t

_`

_∈L

is the allocation at time

t

. Let

B

L

denote the set of all probability distributions

over

{0

,

1}

L

_{. A treatment allocation strategy,}

_π

_{, is a function from}

_S

_{= supp}

_S

t

_into

_B

L

so

that under

π

, a decision maker presented with

S

t

=

s

t

will select allocation

a

t

with probability

π

(

a

t

_;

_s

t

_{). Allocation strategies of this type are termed stochastic strategies to contrast them}

with deterministic strategies which map states to allocations rather than to a distribution over

allocations [Sutton and Barto, 1998b]. In the context of online estimation and optimization, the

use of stochastic allocation strategies is critical to ensure consistent estimation of an optimal

strategy [Kaelbling et al., 1996, Cesa-Bianchi and Lugosi, 2006] much in the same way that

randomization is critical in adaptive clinical trials to ensure consistent estimation of an optimal

treatment [Berry and Fristedt, 1985]; see the Supplemental Materials for an illustrative

exam-ple. Let

Y

_`

t

∈

R

denote an outcome measured at location

`

at time

t

and let

Y

t

=

{

Y

_`

t

}

`

∈L

. For

a pre-specified constant,

γ

∈

(0

,

1), the goal is to choose an allocation strategy that maximizes

the mean of the discounted total utility

P

t

≥

1 γ

t

−

1 u

Y

t

, where

u

(·) is a scalar utility function

and the constant

γ

balances proximal and distal outcomes. In some settings, it may be

desir-able to choose an alternative measure of cumulative utility, e.g., lim

T

→∞

T

−

1 P

T

_t

₌₁

u

(

Y

t

); our

(26)

an optimality allocation strategy using potential outcomes [Rubin, 1978, Splawa-Neyman et al.,

1990].

Let Π denote a class of allocation strategies of interest; throughout, we implicitly assume

all allocation strategies under consideration belong to Π. Hence, the definition of optimality

depends on Π. This class can be used to enforce logistical constraints, e.g., a limit on the number

of locations that can be treated at each time point. Because our estimation algorithm is online,

this class of allocation strategies can be changed in real-time to reflect changing constraints.

Define

F

=

n

a

∈ {0

,

1}

L

:

a

∈

supp

π

for some

π

∈

Π

o

to be the set of feasible allocations. We

use overline notation to denote past history, e.g.,

a

t

=

{

a

v

}

t

_v

₌₁

, and a ‘*’ superscript to denote

potential outcomes, e.g.,

Y

∗

t

(

a

t

) denotes the outcome that would be observed under treatment

sequence

a

t

. Define

W

∗

=

Y

∗

t

(

a

t

)

,

S

∗

t

+1

(

a

t

) :

a

t

∈ F

_t

_∈T

to be the set of potential outcomes

under

{

a

t

}

t

∈T

, i.e., the states and outcomes that would be observed under actions

{

a

t

}

t

∈T

.

For any

π

∈

Π, let

ξ

_π

t

(

s

)

_t

_∈T

_,

_s

_∈S

denote a collection of independent random variables so

that

P

ξ

t

_π

(

s

t

) =

a

t

=

π

(

a

t

;

s

t

). Define

Y

∗

t

(

π

)

,

P

a

t

Y

∗

t

(

a

t

)

Q

t

_v

₌₁

I

ξ

_π

v

S

∗

v

(

a

v

−

1 )

=

a

v

to be the potential outcome under allocation strategy

π

, where

S

∗

1 (

a

0 _{) =}

_S

1 . An

alloca-tion strategy,

π

opt

∈

Π, is optimal if

E

h

P

t

≥

1 γ

t

−

1 u

Y

∗

t

(

π

opt

)

i

≥

_E

h

P

t

≥

1 γ

t

−

1 u

Y

∗

t

(

π

)

i

for all

π

∈

Π. If there are multiple optimal strategies within Π there is no loss choosing

among them arbitrarily. Thus, for concision, we assume hereafter that

π

opt

is unique. In order

to estimate

π

opt

from the observed data, we require assumptions about the data-generating

mechanism.

At time

t

, the available data to estimate

π

opt

are

H

1 =

S

1 if

t

= 1 and

H

t

= (

S

1 ,

A

1 ,

Y

1 , . . . ,

S

t

−

1 ,

A

t

−

1 ,

Y

t

−

1 ,

S

t

) if

t

≥

2. We assume: (A1) sequential ignorability

[Robins, 2004b],

A

t

⊥

W

∗

|

H

t

for all

t

∈ T

; (A2) the observed outcomes are the potential

outcomes under treatment actually received,

Y

t

=

Y

∗

t

(

A

t

) and

S

t

=

S

∗

t

(

A

t

−

1 ) for all

t

∈ T

;

and (A3) positivity, there exists

>

0 so that

P

A

t

=

a

|

H

t

>

for all

a

∈ F

and

t

∈ T

(27)

Given a data-generating process which satisfies (A1)-(A3), for any

π

∈

Π it follows that

E





X

t

≥

1 γ

t

−

1 u

Y

∗

t

(

π

)





=

lim

T

→∞

Z

(

T

X

t

=1

γ

t

−

1 u

y

t

)

_T

Y

v

=1

f

v

(

y

v

|

h

v

,

a

v

)

π

(

a

v

;

s

v

)

g

v

(

h

v

|

h

v

−

1 )

dλ

(

y

T

,

a

T

,

h

T

)

,

(2.1)

where

f

v

is the conditional density for

Y

v

given

H

v

and

A

v

,

g

v

is the conditional density for

H

v

given

H

v

−

1 with

g

1 (

h

1 |

h

0 ) =

g

1 (

h

1 ), and

λ

is a dominating measure. Thus, (2.1) shows

how the expected cumulative utility can be expressed using the data-generating model.

The foregoing assumptions along with the assumption of no interference among

experimen-tal units are standard in causal inference for non-spatial sequential decision making problems

[Chakraborty and Moodie, 2013, Schulte et al., 2014]. However, in spatio-temporal decision

problems, the proximity of the locations can induce spillover effects thereby causing

interfer-ence among experimental units (locations) [Halloran and Struchiner, 1995, Diez Roux, 2004,

Hong and Raudenbush, 2006, Hudgens and Halloran, 2008b, VanderWeele and Tchetgen

Tch-etgen, 2011, Ogburn and VanderWeele, 2014]. Furthermore, in many settings, there are cost

constraints of the form

P

`

∈L

ω

t

`

a

t

`

≤

c

t

,

where

ω

t

`

is the cost of applying treatment at location

`

at time

t

and

c

t

is a total budget at time

t

. Constraints of this form are another reason why the

(28)

methods rely on estimation of part or all of the conditional distribution of

Y

t

given (

S

t

,

A

t

)

treating

A

t

as a categorical variable with 2

L

levels. Fitting such a model, even if sufficient

replications were available to identify the distribution, would be computationally infeasible.

2.4 Estimating an optimal allocation strategy

In the context of an emerging epidemic, there is typically little or no data that can be used to

form reliable estimators for some (or all) components of the system dynamics model. Thus, it

is essential to add information from scientific theory into the estimation process. We integrate

scientific theory into the estimation process by taking a Bayesian perspective on parameter

un-certainty and allowing the use of informative priors on some (or all) of the parameters indexing

our postulated system dynamics model.

An overview of our estimation procedure is as follows. Let

D

denote a class of deterministic,

i.e., non-stochastic, allocation strategies. Under

d

∈ D, a decision maker presented with state

S

=

s

will select allocation

d

(

s

). At each time

t

, we draw a system dynamics model from

the posterior distribution over dynamics models and subsequently use simulation–optimization

[Law et al., 1991, Banks et al., 1998, Gosavi, 2003b] to compute a maximizer, say

d

b

t

, of (2.1) over

D

where (2.1) is computed with respect to the sampled dynamics model. Given state

S

t

=

s

t

,

the selected allocation at time

t

is

d

b

t

(

s

t

). This implicitly defines a stochastic allocation,

π

_b

t

(

s

t

),

as a mixture over

d

(

s

t

) :

d

∈ D

with mixture probabilities equal to the posterior probability

that

d

is the maximizer of (2.1); thus, the implied class of stochastic strategies, Π, is the class

of all mixtures over strategies in

D. A schematic for this procedure is displayed in the left panel

of Figure 2.2.

(29)

Posterior over

dynamics models

Simulation

optimization

Sample model

Set

A

=

d

b

(

S

)

b

d

Observe

S

Posterior over

(

β, θ

)

Simulation

optimization

Sample (

β,

e

θ

e

)

Set

A

=

π

b

(

S

)

b

π

= arg max

d

∈D

C

T

(

d

;

β,

e

θ

e

)

Observe

S

Figure 2.2:

Left:

Schematic for Thompson sampling over a generic class of dynamics models.

Right:

Schematic for Thompson sampling with a parametric class of models indexed by (

θ, β

)

and a finite simulation horizon

T

.

special cases [Chapelle and Li, 2011, Agrawal and Goyal, 2011, Kaufmann et al., 2012, Korda

et al., 2013]. Intuitively, a stochastic allocation strategy should balance exploration of the space

of potential allocations with choosing allocations that are estimated to produce high expected

utility; the proposed version of Thompson sampling achieves this balance through the posterior

of mean utility under each

d

∈ D

which becomes increasingly concentrated on the maximizer

as data accumulate.

To describe the implementation of our estimator we make several assumptions in addition

to (A1)-(A3). We assume the system is Markov and homogeneous in time so that, for any

v

, the

densities in (2.1) become

f

v

(

y

v

|

h

v

,

a

v

) =

f

(

y

v

|

s

v

,

a

v

) and

g

v

(

s

v

|

h

v

−

1 ) =

g

(

s

v

|

s

v

−

1 ,

a

v

−

1 ).

(30)

data (

S

T

,

Y

t

,

A

T

) is

L

_T

(

β, θ

) =

T

Y

v

=1

f

Y

v

S

v

,

A

v

;

β

π

v

(

A

v

;

S

v

)

g

S

v

S

v

−

1 ,

A

v

−

1 ;

θ

,

where we define

g

(

s

1 |

s

0 ,

a

0 ) =

g

(

s

1 ) to be the distribution of the initial state.

For any deterministic strategy

d

and fixed

T >

0, define

C

T

(

d

;

β, θ

) =

Z

(

T

X

t

=1

γ

t

−

1 u

y

t

)

_T

Y

v

=1

f

{

y

v

|

s

v

, d

(

s

v

);

β

}

g

(

s

v

|

s

v

−

1 , d

(

s

v

−

1 );

θ

)

dλ

(

y

T

,

s

T

)

,

and for

t

≤

T

define

b

π

t,T

_{= arg max}

d

∈D

C

T

(

d

;

β

e

t

,

θ

e

t

) where

β

e

t

,

θ

e

t

are distributed

accord-ing to the posterior of

β, θ

given

H

t

.

If the parametric densities are correctly specified,

i.e.,

f

(

y

t

|

s

t

,

a

t

) =

f

(

y

t

|

s

t

,

a

t

;

β

∗

) and

g

(

s

t

|

s

t

−

1 ,

a

t

−

1 ) =

g

(

s

t

|

s

t

−

1 ,

a

t

−

1 ;

θ

∗

) for ‘true’

pa-rameters

β

∗

and

θ

∗

, then under standard regularity conditions [Gelman et al., 2014],

π

opt

=

arg max

π

∈

Π

lim

t

→∞

n

lim

T

→∞

C

T

(

d

;

β

e

t

,

θ

e

t

)

o

with probability one.

Algorithm 2.1:

Policy-search algorithm for an optimal allocation strategy.

Input:

T <

∞,

S

1

1 Draw

β

e

1 ,

θ

e

1 from the prior

2 Compute

π

_b

1 = arg max

d

∈D

C

T

(

d

;

β

e

1 ,

θ

e

1 )

(via Algorithm 2)

3 for

j

≥

1 do

4 Apply allocation

A

j

=

π

_b

j

(

S

j

), observe

Y

j

,

S

j

+1

5 Draw

β

e

j

+1

,

θ

e

j

+1

from posterior of (

β, θ

) given

H

j

+1

6 Compute

π

_b

j

+1

= arg max

_d

∈D

C

T

(

d

;

β

e

j

+1

,

θ

e

j

+1

)

(via Algorithm 2)

7 end

(31)

accuracy of the numerical integration used to compute

C

T

(

d

;

β, θ

). In Section 4.1, we provide a

class of strategies under which sampling from

π

(

a

;

s

) scales linearly in the number of locations,

L

, making it feasible even when

L

is on the order of tens of thousands. In most ecological

applications, the dimensions of

β

and

θ

are orders of magnitude smaller than

L

, e.g., the “gravity

model” for WNS [Maher et al., 2012] is determined by thirteen parameters; thus, integrating

over the posterior of these parameters is typically not a computational bottleneck. As detailed

in the next section, we use stochastic approximation to compute arg max

π

∈

Π

C

T

(

d

;

β, θ

); the

number of Monte Carlo replicates in the numerical integration used to approximate

C

T

(

d

;

β, θ

)

is generally smaller than

L

.

Remark 2.4.1

The Markov dynamics assumption used above is always trivially true if

S

t

=

H

t

for all

t

(more formally, let

S

t

∈

_R

∞

and define

S

t

= (

t,

H

t

,

0), where

0 is the zero element

in

R

∞

). However, this choice of state is rarely useful in large systems as the growing dimension

makes modeling difficult. Thus, the Markov assumption can be viewed as an assumption about

the ability of domain experts and analysts to construct a concise summary of the past that

captures all salient features of the decision problem. One approach is to construct the state by

concatenating information from the past

k

time points where

k

is dictated by domain knowledge

or estimated from historical data. State construction for Markov decision processes is currently

an active area of research [Mahadevan, 2009, Sugiyama, 2015].

Remark 2.4.2

The assumption of a low-dimensional parametric model for the transition may

seem overly restrictive in some settings. Because the dynamics model is being estimated online,

sieves are a natural mechanism to add flexibility as data accumulate. A template for this

ap-proach is as follows. Assume that

g

s

t

s

t

−

1 ,

a

t

−

1 =

g

s

t

s

t

−

1 ,

a

t

−

1 ;

θ

∗

where

θ

∗

∈

Θ

⊆

R

∞

.

Let

k

t

_{denote a sequence of non-decreasing integers that satisfies}

_k

t

_{→ ∞}

_as

_t

_{→ ∞}

_and

postu-late models the form

g

_k

t

s

t

s

t

−

1 ,

a

t

−

1 ;

θ

_k

∗

t

where

θ

_k

∗

t

∈

Θ

k

t

⊆

_R

k

t

is the projection of

θ

∗

onto

Θ

_k

t

. Let

θ

b

t

_k

t

denote the maximum likelihood estimator of

θ

_k

∗

t

. Then, under appropriate regularity

conditions [e.g., Newey, 1997]

(

θ

b

t

k

t

−

θ

_k

∗

t

) =

O

P

(32)

sampling algorithm could be implemented by drawing

e

θ

t

k

t

from a

k

t

γ

/

√

t

-neighborhood of

b

θ

t

_k

t.

An alternative is to fit a Bayesian nonparametric (BNP) model. A BNP model assumes an

infinite-dimensional parameter space so that with sufficient data the estimated model converges

to the true model with few assumptions about its parametric form. BNP has recently been applied

to spatiotemporal epidemics [Xu et al., 2016a] and used to estimate a non-spatial dynamic

treatment regime [Xu et al., 2016b], and so this appears to be a promising direction for future

work.

2.4.1 A scalable class of allocation strategies

The class of allocation strategies

D

has a large impact on the quality of the estimated optimal

decision strategy and the computational complexity of Algorithm (2.1). We propose a flexible

but computationally efficient class of allocation strategies that is designed to scale to large

decision problems with potentially tens of thousands of locations. However, as we demonstrate

in the next section, this class of strategies is also useful for problems with as few as 100 nodes.

Throughout, we assume that at time

t

exactly

c

t

locations can be treated, while

c

t

is allowed

to depend on the state

S

t

we suppress this in the notation.

Because of spatial interference, the effect of treating a given location will depend on the

configuration of treatments applied at nearby locations. Thus, finding an optimal treatment

allocation is a complex discrete optimization problem. To reduce computational burden, we

select an allocation in batches. Each location is assigned a priority score that depends both on

the current state and allocations selected in the preceding batches. For each batch, we then

select locations with highest priority scores for treatment.

The class of allocation strategies that we propose depends on a parametric class of functions

from supp

S

t

× {0

,

1}

L

into

R

L

,

R

=

R

(

s

t

,

a

t

;

η

) :

η

∈

E

, where

E

⊆

R

q

. Given

η

∈

E

, the

function

R

(

s

t

,

a

t

;

η

) is a vector of priority scores, one per location, so that

R

`

(

s

t

,

a

t

;

η

) represents

(33)

locations

{

j

:

a

t

_j

= 1}

are certain to be treated. If

a

t

_`

= 1 then

R

`

(

s

t

,

a

t

;

η

) =

−∞

so that each

location is selected for treatment at most once per time point. For each non-negative integer

m

, define the binary vector

U

_`

t

(

s

t

,

a

t

;

η, m

) =











1 if

R

`

s

t

,

a

t

;

η

≥

R

(

m

)

s

t

,

a

t

;

η

0 else

,

where

`

∈ L

and

v

₍

_k

₎

denotes the

kth

order statistic of

v. Let

k

≤

c

t

be a non-negative

integer and

0 to be a vector of zeros. Define

d

(1)

(

s

t

;

η

) to be the binary vector that selects

the

b

c

t

/k

c

locations with the highest priority scores. Let

w

(1)

denote

d

(1)

(

s

t

;

η

). Recursively,

for

j

= 2

, . . . , k

, set

w

(

j

)

=

d

(

j

−

1)

(

s

t

;

η

), ∆

j

=

b

jc

t

/k

c − b(

j

−

1)

c

t

/k

c, and

d

(

j

)

(

s

t

, η

) =

U

t

s

t

, w

(

j

−

1)

;

η,

∆

j

+

w

(

j

−

1)

. The final decision rule is

d

(

s

t

;

η

) =

d

(

k

)

(

s

t

;

η

); the dependence

of this rule on

t

occurs only through

c

t

.

The parameter

k

in the above class of strategies governs the number of locations that are

selected each time the priority scores are updated. If

k

= 1 then the priority scores are computed

once, under no treatments, and the top

c

t

locations are treated; if

k

=

c

t

then the algorithm

updates the priority scores after every location selection. In large problems, we anticipate

choosing

k

L

, e.g.,

k

=

O

(log

L

). If the computational complexity of computing

R

`

(

s

t

,

a

t

;

η

)

is

N

, then the complexity of computing

d

(

s

t

;

η

) is

O

(

kLN

). Thus, if

k

=

O

(log

L

) and

N

is

negligible relative to

L

, then evaluating the strategy is

O

(

L

log

L

) which is feasible even for

large values of

L

.

Let

D

denote the class of policies

{

d

(

s

;

η

) :

η

∈

E

}. Algorithm (2.1) requires maximization

of

C

T

₍

_d

_;

_{β, θ}

_{) over}

_d

_{∈ D}

_{(or, equivalently, over}

_η

_∈

_E

_{). Thus, the order of computation}

Estimating optimal control strategies for large scale spatio-temporal decision problems.

ABSTRACT

MEYER, NICHOLAS JAMES. Estimating optimal control strategies for large scale spatio-temporal

decision problems. (Under the direction of Dr. Eric Laber.)

Sequential decision problems arise in many disciplines of science.

Applications include

management of infections diseases, pursuit and evasion games, and power grid optimization.

This dissertation addresses large scale complex decision problems that evolve not only over

time, but also over space. The chapters to follow are a collection of research papers. Each

chapter highlights different challenges, presents a novel solution, and demonstrates the results

through extensive simulation experiments.

Infectious diseases are complex systems that are inherently difficult to control. Limited

information, unknown dynamics, and high dimensional state spaces require a novel approach to

estimating optimal treatment allocation strategies. Chapter 2 presents a model based approach

to estimate an optimal treatment strategy. The method uses a postulated model for the disease

dynamics to maximize the value function using simulation optimization. When the dynamics

model is correctly specified, this approach produces an effective treatment allocation strategy.

However, the method lacks robustness to misspecification of the dynamics model. Chapter 3

addresses this issue of model misspecification using a semi-parametric estimator of the optimal

treatment strategy. Both methods are demonstrated using simulation experiments with a case

studies for White-nose syndrome and the Ebola virus.

©

Copyright 2017 by Nicholas James Meyer

Estimating optimal control strategies for large scale

spatio-temporal decision problems

by

Nicholas James Meyer

A dissertation submitted to the Graduate Faculty of

North Carolina State University

in partial fulfillment of the

requirements for the Degree of

Doctor of Philosophy

Statistics

Raleigh, North Carolina

2017

APPROVED BY:

Dr. Marie Davidian

Dr. Krishna Pacifici

Dr. Brian Reich

Dr. Butch Tsiatis

Dr. Eric Laber

DEDICATION

BIOGRAPHY

ACKNOWLEDGEMENTS

TABLE OF CONTENTS

LIST OF TABLES

. . . .

vii

LIST OF FIGURES

. . . .

viii

LIST OF ALGORITHMS

. . . .

xi

Chapter 1 Introduction

. . . .

1

1.1

Sequential Decision Problems . . . .

1

1.2

Management of infectious diseases

. . . .

3

1.3

Pursuit and evasion

. . . .

3

Chapter 2 Optimal treatment allocations in space and time for online control

of an emerging infectious disease

. . . .

5

2.1

Introduction . . . .

5

2.2

White-nose syndrome in bats . . . .

9

2.3

Defining an optimal treatment allocation strategy . . . 11