Markov Decision Processes and Approximate Dynamic Programming Methods for Optimal Treatment Design.

(1)

ABSTRACT

MASON, JENNIFER ELIZABETH. Markov Decision Processes and Approximate Dynamic Programming Methods for Optimal Treatment Design. (Under the direction of Brian T. Denton.)

(2)

c

(3)

Markov Decision Processes and Approximate Dynamic Programming Methods for Optimal Treatment Design

by

Jennifer Elizabeth Mason

A dissertation submitted to the Graduate Faculty of North Carolina State University

in partial fulﬁllment of the requirements for the degree of

Doctor of Philosophy

Industrial Engineering

Raleigh, North Carolina

2012

APPROVED BY:

Russell E. King David L. Roberts

Nilay D. Shah James R. Wilson

(4)

DEDICATION

(5)

BIOGRAPHY

(6)

ACKNOWLEDGEMENTS

First, I would like to thank my advisor Dr. Brian Denton for his ever-present guidance, support, and patience over the last ﬁve years. I appreciate all he has done for me to help develop my research and teaching skills, and I know that this training will help me immensely as I start my career. I would also like to acknowledge the funding that supported this research. This work was supported in part by the National Science Foundation under Grant Number CMMI 0968885 (Denton). The research was also supported in part by a Doctoral Dissertation Grant from the Agency for Healthcare Research and Quality under Grant Number 1R36HS020878 (Mason).

I would like to thank my committee members Dr. Russell King, Dr. David Roberts, Dr. Nilay Shah, and Dr. James Wilson for serving on my committee and providing me with helpful suggestions and edits for this dissertation. In addition, a special thanks to Dr. Nilay Shah and Dr. Steven Smith for their help and guidance over the last four years; I owe my understanding of medical problems to both of them.

(7)

TABLE OF CONTENTS

List of Tables . . . vii

List of Figures . . . ix

Chapter 1 Introduction . . . 1

Chapter 2 Literature Review . . . 5

2.1 Basis Function Approximation . . . 7

2.2 Reinforcement Learning . . . 9

2.3 Applications of ADP Methods . . . 14

2.4 Contributions of this Dissertation . . . 17

Chapter 3 Optimal Control of Medication Treatment Initiation . . . 19

3.1 Introduction . . . 19

3.2 Diabetes Treatment Background and Literature Review . . . 21

3.3 Model . . . 24

3.4 Results . . . 31

3.4.1 Data and Study Population . . . 31

3.4.2 Model Validation . . . 34

3.4.3 Primary Prevention Treatment Policies . . . 37

3.4.4 Primary and Secondary Prevention Treatment Policies . . . 40

3.4.5 Estimated Beneﬁt of the Optimal Guidelines to the U.S. Diabetes Popu-lation . . . 43

3.5 Conclusions . . . 51

Chapter 4 Approximate Dynamic Programming Approaches for Optimal Treat-ment . . . 54

4.1 Continuous-State MDP Formulation . . . 54

4.2 Finite-State MDP . . . 58

4.3 ADP Approach 1: Policy Mapping . . . 63

4.4 ADP Approach 2: Basis Function Approximation . . . 63

4.4.1 Linear Programming Formulation . . . 64

4.4.2 Basis Functions . . . 68

4.5 Monte Carlo Simulation . . . 71

4.6 Results . . . 73

4.6.1 Sensitivity Analysis . . . 75

4.7 Conclusions . . . 80

Chapter 5 Using Electronic Health Records to Monitor and Improve Adher-ence to Medication . . . 82

(8)

5.2 Background on Medication Adherence . . . 85

5.3 Literature Review . . . 87

5.3.1 Machine Maintenance Applications . . . 88

5.3.2 Medical Decision Making Applications . . . 89

5.4 Model Formulation . . . 92

5.5 Model Properties and Insights . . . 96

5.5.1 Model Assumptions . . . 96

5.5.2 Model Properties . . . 97

5.6 Case Study: Statin Adherence for Patients with Type 2 Diabetes . . . 109

5.6.1 Data and Model Parameter Estimation . . . 110

5.6.2 Numerical Results . . . 112

5.6.2.1 Active vs. Inactive Surveillance . . . 113

5.6.2.2 Sensitivity to Cost of Intervention . . . 115

5.6.2.3 Sensitivity to Individual Patient Risk Factors . . . 117

5.6.2.4 Potential Yearly Beneﬁts of AAS to the U.S. Diabetes Population118 5.7 Conclusions . . . 121

Chapter 6 Conclusions. . . .124

(9)

LIST OF TABLES

Table 3.1 International guideline thresholds for initiation of cholesterol and blood pressure medications. Guidelines that assume diabetes patients are not considered CHD risk equivalent are represented with *. LDL is measured in mg/dL for the U.S. guidelines, and LDL, HDL, and TC are measured in mmol/L for all other guidelines. LR is unitless, and SBP is measured in mmHg. . . 23 Table 3.2 Ranges for TC, HDL, and SBP states based on [24]. . . 27 Table 3.3 Baseline characteristics for the study population (N = 663), including

mean and variance. . . 32 Table 3.4 Percentage change in risk factors for given medications as computed from

Mayo Electronic Medical Records and Diabetes Electronic Management System. . . 33 Table 3.5 Description of model parameters including cost inputs and utility

decre-ments for the reward function of the MDP model. . . 33 Table 3.6 Costs and utility decrements for each medication used in the model. . . 34 Table 3.7 Male comparison of expected LYs before death, expected LYs before a

stroke or CHD event, and expected LYs after an event from age 50 for our MDP model and the Framingham Heart Study (FHS). The 95% conﬁdence intervals are provided for the FHS estimates. . . 36 Table 3.8 Female comparison of expected LYs before death, expected LYs before a

stroke or CHD event, and expected LYs after an event from age 50 for our MDP model and the Framingham Heart Study (FHS). The 95% conﬁdence intervals are provided for the FHS estimates. . . 36 Table 3.9 Yearly costs (billions) and future event-free LYs for newly diagnosed

di-abetes patients using no treatment, optimal guidelines (R0 = $100,000, R0 = $250,000, and R0 = $10 million), and U.S. I. . . 50 Table 3.10 Yearly costs (billions) and future QALYs for newly diagnosed diabetes

patients using no treatment, optimal guidelines (R0 = $100,000, R0 = $250,000, andR0 = $1 billion), and U.S. I. . . 50

Table 4.1 Parameter values for Equations (4.37), (4.38), (4.39), and (4.40) found in Stevens et al. [97] and Kothari et al. [61]. . . 69 Table 4.2 Comparison among the ADP methods and no treatment of expected QALYs

before a stroke, CHD event, or death from other causes for males and females. For each simulation, 120,000 patients are sampled. The 95% conﬁdence intervals for simulated results are provided in parentheses. . . . 73 Table 4.3 Sensitivity analysis results for base case probabilities and 50% higher

med-ication decrements to QALYs. . . 75 Table 4.4 Sensitivity analysis results for base case probabilities and 50% lower

(10)

Table 4.5 Sensitivity analysis results for 25% higher probabilities and 50% higher medication decrements to QALYs. . . 76 Table 4.6 Sensitivity analysis results for 25% higher probabilities and base case

med-ication decrements to QALYs. . . 77 Table 4.7 Sensitivity analysis results for 25% higher probabilities and 50% lower

medication decrements to QALYs. . . 77 Table 4.8 Sensitivity analysis results for 25% lower probabilities and 50% higher

medication decrements to QALYs. . . 78 Table 4.9 Sensitivity analysis results for 25% lower probabilities and base case

med-ication decrements to QALYs. . . 78 Table 4.10 Sensitivity analysis results for 25% lower probabilities and 50% lower

med-ication decrements to QALYs. . . 79 Table 4.11 Sensitivity analysis results for base case probabilities and 5 times higher

medication decrements to QALYs. . . 79

Table 5.1 Adherence States Deﬁned by Percentage of Days Covered (PDC) and the Corresponding Percent Change in Total Cholesterol (TC) for Patients that Initiate Statins. . . 88 Table 5.2 Initial hospitalization costs and follow-up events for adverse events. . . 111 Table 5.3 Optimal ages to begin having yearly interventions for female patients using

active surveillance. Imperfect (probabilistic) interventions are assumed. Note: ‘–’ denotes it is never optimal for the patient to have interventions. . 117 Table 5.4 Optimal ages to begin having yearly interventions for female patients using

active surveillance. Perfect interventions are assumed. . . 118 Table 5.5 Yearly costs (billions) and future LYs for newly-diagnosed diabetes

(11)

LIST OF FIGURES

Figure 3.1 Simpliﬁed state transition diagram for the case of two medications. When medications are initiated (actions denoted by the solid lines), the risk factors are improved and the probability of the occurrence of an adverse event (denoted by the dashed lines) is reduced. . . 25 Figure 3.2 Comparison of optimal treatment policies for male patients to treatment

by U.S. and international guidelines. . . 38 Figure 3.3 Comparison of optimal treatment policies for female patients to treatment

by U.S. and other international guidelines. . . 39 Figure 3.4 Histograms to provide the diﬀerence in LYs and medication costs for males

between the optimal guidelines (R0 = $100,000) and U.S. I. . . 44 Figure 3.5 Histograms to provide the diﬀerence in LYs and medication costs for

females between the optimal guidelines (R0 = $100,000) and U.S. I. . . . 45 Figure 3.6 Comparison of optimal treatment policies for male patients to treatment

by U.S. and international guidelines. . . 46 Figure 3.7 Comparison of optimal treatment policies for female patients to treatment

by U.S. and international guidelines. . . 47 Figure 3.8 Histograms to provide the diﬀerence in QALYs and medication and

treat-ment costs for males between the optimal guidelines (R0 = $100,000) and U.S. I. . . 48 Figure 3.9 Histograms to provide the diﬀerence in QALYs and medication and

treat-ment costs for females between the optimal guidelines (R0 = $100,000) and U.S. I. . . 49

Figure 4.1 Example of the partitioned continuous state space for LR and SBP. For this particular partitioning, which is used in the numerical experiments, LR is divided into three (qLR= 3) discrete states (low (L), medium (M), and high (H)), and SBP is divided into four (qSBP = 4) discrete states (low (L), medium (M), high (H), and very high (V)). The dots in each cell of the partition represent the conditional mean LR and SBP values for the cell. . . 59 Figure 4.2 Example of the tile coding in which the bounded continuous state space

for LR and SBP is partitioned using two tilings. One tiling is shown with solid lines dividing the state space, and the other tiling is shown with a dotted line, providing a single tile over the entire state space. . . 68

(12)

Figure 5.2 Comparison of expected LYs verses costs for medication, interventions, and treatment of events for active adherence surveillance (AAS) policies (with varying R values) and inactive adherence surveillance (IAS) poli-cies (when interventions occur every k years) for female patients using imperfect interventions. Results are a weighted average of LYs and costs for the 16 possible risk states. . . 114 Figure 5.3 Comparison of expected LYs verses costs, as shown in Figure 5.2, for male

patients. . . 114 Figure 5.4 Comparison of expected LYs verses costs for medication, interventions,

(13)

Chapter 1

Introduction

Chronic diseases are the leading cause of death in the United States and other countries, ac-counting for seven out of ten deaths each year [62]. Fortunately, for many chronic diseases there are treatment options to manage the disease and reduce the risk of adverse events or death related to the disease. However, in many cases the cost of treatment is high, and some treatments have side eﬀects that can reduce a patient’s quality of life. In light of these facts, the optimal control of treatment for chronic diseases is very important. Improving treatment plans for chronic diseases has the potential to prolong lives, improve quality of life, and reduce costs.

(14)

options are expensive, and the total cost of treatment for such a large portion of the population can be high. Therefore, the optimal time and order to initiate drug treatments (if at all) over the course of a patient’s lifetime is unclear.

Treatment optimization problems can pose many challenges. First, there are advantages and disadvantages to initiating medications. While treatment has the long-term benefit of reducing the probability of serious health outcomes, this must be traded off against the burden of taking medication, side effects, and the monetary cost of treatment. Second, if treatment is initiated, the decision is further complicated by choosing which medications to initiate and in which order. The large number of treatment options is coupled with a large state space that defines the possible health states of the patient based on risk factors, medication history, and the occurrence of adverse events. Uncertainties in the effects of treatment and the evolution of a patient’s health state as he or she ages further complicate the decision process.

(15)

Chapters 3, 4, and 5.

In Chapter 3 we describe an initial MDP model for optimal control of medication initiation decisions. This model aims to answer the following question: When and in what order should medications be initiated to reduce the risk of adverse health events? The time horizon is defined by a finite set of annual decision epochs. The states represent the patient’s health status based on risk factors for cardiovascular disease and stroke. The actions represent initiating medications or deferring initiation at each epoch. Once patients initiate a medication, they are assumed to remain on that medication for the remainder of their lives. The yearly rewards include a monetary reward for the patient’s quality of life minus the costs of treatment. A series of experiments are performed to evaluate the trade-off between these competing criteria. We compare expected outcomes for the optimal policy with outcomes for current blood pressure and cholesterol initiation guidelines in the United States and around the world. In general, we find that male patients should initiate treatments earlier than female patients. We report a number of findings related to the optimal sequence and time of treatment. We also present structural properties related to the initiation of one medication over another and the benefits of coordinated treatment.

In Chapter 4, we use ADP methods to approximate solutions to the true continuous state MDP underlying the approximate discrete-state MDP formulation of Chapter 3. Since the continuous-state MDP is computationally intractable, we use a basis function approximation of the true value function for the continuous-state MDP. While the discrete-state MDP formula-tion of Chapter 3 provides one approximaformula-tion, we experiment with basis funcformula-tion approximaformula-tion methods to find alternative policies that will achieve the best results according to specific cri-teria. Numerical experiments are performed using a simulation model to compare the different policies found from the ADP approach and the MDP model from Chapter 3.

(16)

benefits of treatment. In Chapter 5 we formulate a new MDP model to answer the following question: When should adherence-improving interventions occur for patients who have already initiated medication? We use this MDP model to determine the optimal policy for interventions based on the patient’s adherence state. The time horizon is defined by a finite set of annual decision epochs. The states represent the percentage of time that the patient takes his or her medication as prescribed based on pharmacy claims data. The actions are deciding to have an intervention or deferring an intervention to a later stage. This decision is revisited each year regardless of the actions chosen in the previous years. The yearly rewards for this model are a monetary reward for the patient’s quality of life minus the costs of treatment and interventions. We find that the optimal policy is highly dependent on the cost and effectiveness of the intervention. In addition, we prove that the optimal policy is monotonic with respect to the patient’s adherence state. We also prove theorems providing insight into how the optimal control limit changes when interventions with different effects are considered and when patients of different health statuses are considered.

(17)

Chapter 2

Literature Review

The treatment of chronic diseases involves sequential decision making under uncertainty: actions (e.g., initiating a medication or waiting to treat) must be taken today without knowing what effect the actions will have on a patient’s health status or what the natural progression of a patient’s health will be. The need for sequential decision making under uncertainty also arises in many other settings such as machine maintenance, inventory control, and artificial intelligence. This literature review highlights solution methods for finite-horizon, discrete-time problems for which the Markov assumption holds. In particular, the primary model to define such problems is the MDP. MDPs are Markov processes that can be controlled through actions. At each stage of the process, the optimal action is taken according to some criteria as governed by optimality equations. A finite-horizon, discrete-time MDP is defined by the following. The decision horizon is defined by a discrete set of decision epochs indexed by t = 1, . . . , T. The states in the MDP,s∈S, capture all information needed to make decisions. The set of possible actions is defined by a∈A. Transition probabilities among the states define the probability of being in states′ at timet+ 1 given the state at time tissand the action taken at timetis a:

pt(s′|s, a) = Pr{st+1 =s′|st=s, at=a}. (2.1)

(18)

action taken. In each decision epoch for each state, the optimal value function, vt(s), for a

ﬁnite horizon MDP is deﬁned by the optimality equations:

vt(s) = max a∈A

{

rt(s, a) +λ

∑

s′∈S

pt(s′|s, a)vt+1(s′)

}

, (2.2)

along with the following boundary condition at stage T:

vT(s) =µT(s), (2.3)

where λ∈(0,1] is the discount factor andµT(s) is the expected future rewards accrued after

the end of the decision horizon. The optimal action, a∗_t(s), at time tfor statesis chosen based on the optimality equations:

a∗_t(s) = argmax

a∈A

{

rt(s, a) +λ

∑

s′∈S

pt(s′|s, a)vt+1(s′)

}

. (2.4)

Solution methods for MDPs have been well established [81]; however, MDPs can become more diﬃcult to solve or even intractable when the number of states and/or actions grows large or is inﬁnite. This phenomenon is often referred to as the curse of dimensionality [80]. This curse can arise when considering medical decision making problems. For example, for some chronic diseases, there are many possible treatment options available (especially when considering dosage of medications and treatment of multiple risk factors), leading to a large action space. In addition, the state space can grow very large for medical decision making problems when the state includes multiple patient risk factors, the state space is continuous, or the Markov process is of higher order incorporating the dependence of state transitions on the history of the patient’s health.

(19)

for MDPs that suﬀer from the curse of dimensionality. The ADP methods addressed include basis function approximation of the value function and sampling-based reinforcement learning (RL) techniques such asQ-learning. We also highlight particular examples of ADP techniques applied to diﬀerent types of health care problems, including medical decision making problems. The remainder of this chapter is organized as follows: Section 2.1 reviews the literature related to basis function approximation of the value function. Section 2.2 highlights the main methods and algorithms related to RL. Section 2.3 provides examples of health care applications of ADP methods. Finally, Section 2.4 highlights the main contributions of this dissertation.

2.1 Basis Function Approximation

Approximation methods have long been used to solve problems for which computing power is inadequate. Bellman and Dreyfus propose the use of functional approximations to solve dynamic programming problems [13]. A function is approximated by adding up other functions, referred to as basis functions, multiplied by appropriately chosen coeﬃcients. Bellman and Dreyfus use the basis function approximation method to solve a recurrence relation. The basis functions used in their example are Legendre polynomials.

Basis function approximation has become a common approach for many types of problems (e.g., solving systems of partial diﬀerential equations). Powell (see Chapter 7) [80] describes the method of basis function approximation for estimating the value function of an MDP as a way of dealing with the curse of dimensionality. With this method, the number of state variables can be greatly reduced to the number of coeﬃcients for the basis functions. The estimated value function is given by the following:

˜

v(s) = ∑

k∈K

wkfk(s), (2.5)

where K is the set indices for the basis functions and the coeﬃcients, wk, that serve as the

(20)

in Equation (2.5) reduces the dimensionality of the underlying problem to the selection of |K| parameters.

Schweitzer and Seidmann [88] present a general framework for approximation of the value function for a stationary, infinite horizon semi-MDP, with finite state space. For problems with large state spaces, traditional solution methods for infinite-horizon MDPs (linear programs (LPs), value iteration, and policy iteration) may be too time consuming or even infeasible. For these large-scale problems it is important to reduce the size of the problem, and it may not be as important to have the optimal solution if a near-optimal solution is achievable with less effort. The value function may be approximated with K basis functions. Solving for the coefficients of these basis functions instead of using the traditional solution methods reduces the dimension of the problem from the number of the states to K. Schweitzer and Seidmann present three algorithms for estimating the values of the coefficients of the basis functions both with and without discounting: linear programming, policy iteration, and least squares. They also provide a framework for assessing the quality of the approximations.

De Farias and Van Roy [28] extend the work of Schweitzer and Seidmann by providing theoretical guarantees (error bounds) on the performance of the linear programming approach to ADP using basis functions. In addition, De Farias and Van Roy emphasize the importance of choosing appropriate coeﬃcients for the value-function approximations in the objective function. These state-relevant weights, represented by the column vector c, aﬀect the quality of the solution found using the approximate LP, provided here for a cost minimization problem:

max∑

s∈S

[

c(s)∑

k∈K

wkfk(s)

]

s.t. r(s, a) +λ∑

s′∈S

p(s′|s, a)∑

k∈K

wkfk(s′)≥

∑

k∈K

wkfk(s),∀s∈S, a∈A. (2.6)

(21)

approx-imate LP solution method were provided for an uncontrolled queuing system and a controlled queuing system in which the service rate could be controlled.

De Farias and Van Roy [29] further extend the use of the approximate LP to estimate weights for basis functions. While the approximate LP may only have a small number of variables the number of constraints could be intractable, particularly when the action space is large. A constraint sampling method is presented to create a reduced linear program that still provides a near-feasible solution. A controlled queuing system is again used to demonstrate how the method could be applied.

There are several potential types of basis functions that have been shown to perform well in practice. Many sets of basis functions form a complete set of functions over L2(R). There are inﬁnitely many functions in these sets, and as more of these basis functions are used to approximate the unknown function, a better approximation is achieved. As the number of basis functions used from this set approaches inﬁnity, the approximate function approaches the true function, with appropriately chosen weights for the basis functions. Some examples of complete sets of basis functions include Legendre polynomials, radial basis functions, and Fourier series (where Fourier series form a complete set over L2([0,2π])).

For medical decision making problems, the topic of this dissertation, hazard functions can be used for the basis functions, as used by Lee et al. [64] (reviewed below in Section 2.3). Hazard functions can work well for medical decision making problems because often the state space includes inputs for the hazard functions. While there is no guarantee that use of this type of function will provide a good approximation of the value function being estimated (over L2(R+)), these functions appear to be an intuitive choice for medical decision making problems.

2.2 Reinforcement Learning

(22)

of the environment (anything that cannot be changed by the agent), a policy that describes how the agent acts under certain conditions, rewards that indicate what is favorable in the short term, and a value function that provides information about what is favorable in the long term. The decision taken by the agent depends on the value associated with each action in the given state. RL models are closely related to MDPs, and solution methods for MDPs, such as dynamic programming, can also be considered RL techniques. RL models are divided into two categories: episodic tasks that occur for a finite length of time (e.g., finite-horizon MDPs) and continuing tasks that go on for an infinite amount of time (e.g., infinite-horizon MDPs). For the purpose of RL algorithms,Vπ(s) is the expected value of starting in statesand proceeding with policy π, and the action-value function Qπ(s, a) is the expected value of starting in state s, taking actiona, and proceeding with policyπ thereafter. RL algorithms can be more useful than traditional dynamic programming algorithms when a perfect model of an MDP is not available. This may be the case, for example, when a large number of transition probabilities must be estimated. In addition, RL algorithms often require less computational effort.

The class of RL algorithms, including value iteration, can be described by the umbrella term generalized policy iteration (GPI). GPI involves both policy evaluation and policy improvement. Through policy evaluation the value, Vπ(s), of starting in a given state and proceeding with policyπ is computed for each state. Policy improvement determines whether the current policy can be improved upon by comparing Vπ(s) andQπ(s, a).

(23)

subset of states must be evaluated since the computational eﬀort required to estimate the value of a particular state is independent of the total number of states.

Gosavi [45] presents a tutorial and recent advances in RL. Gosavi brieﬂy describes several methods for solving discrete, stationary, inﬁnite-horizon control problems with RL. He presents Q-learning methods that employ sampling with value iteration and policy iteration methods

based on the Robbins-Monro (RM) stochastic approximation. The RM algorithm [85] was originally used to determine a unique root of a function. In Q-learning, the mean,E[X], of a random variableX can be estimated using the RM algorithm, whereXm represents the sample

from themth iteration,Ym is the estimate of the mean from the mth iteration, andµm is the

mth step size. The RM algorithm is provided in Algorithm 1. Convergence of this algorithm is guaranteed if limM→∞

∑M

m=1µm =∞ and limM→∞

∑M

m=1µ2m <∞. One example step size

that satisﬁes these conditions isµm = ₁₊1_m.

Algorithm 1 The Robbins-Monro Algorithm.

Step 1

m:= 1

Y0 _{is set to any arbitrary number} Specify ϵ >0

Step 2

Ym←(1−µm)Ym−1+µmXm

Step 3

if |Ym−Ym−1|< ϵthen

stop

else

m := m+1 return toStep 2 end if

(24)

estimate of the value function. If Wk is the estimate from the kth iteration, the TD-learning algorithm deﬁnes the next iterate as the following:

Wk+1 ←(1−µ)Wk+µ[feedbackk], (2.7)

where µ again represents the step size, and in the second term µ is multiplied by the value of feedback. The value of feedback depends on the type of TD-learning algorithm being used. However, the feedback typically is a function of the immediate reward. For the purpose of trying to estimate the value of state-action pairs, a uniqueW will represent each pairing. For the case of maximizing rewards, larger feedback will increase the likelihood that a particular action should be taken while negative feedback has the opposite eﬀect. As trials proceed, the optimal policy is learned. TD(0) is the same as the RM algorithm, while TD(λ) incorporates more rewards in the future. For TD(λ), the feedback is given by the following:

feedbackk=rk

∞ ∑

i=0

λirk+i, (2.8)

whereλ∈[0,1] is the discount factor andrk is the immediate reward from iteration k.

Q-learning [111] is an algorithm that is related to the RM algorithm and TD-learning.

Q-learning uses simulation in a model-free context, in which transition probabilities are not

assumed, to update the action-value function, Q(s, a). The Q-learning algorithm for the kth iteration is provided in Algorithm 2 [112]. Note that Vk−1(y) ≡ maxb{Qk−1(y, b)} in the

Q-learning algorithm.

(25)

Algorithm 2 Q-Learning Algorithm.

Step 1

Observe the current statesk

Select and take actionak

Observe the next state yk

Receive immediate rewardrk

Step 2

UpdateQk−1 values based on the learning factor αk:

if s=sk and a=ak then

Qk(s, a) = (1−αk)Qk−1(s, a) +αk[rk+λVk−1(yk)]

else

Qk(s, a) =Qk−1(s, a)

end if

impractical (or impossible). In such situationsQ-values need to be approximated using regres-sion or some other state space approximation. Gosavi also discusses further extenregres-sions, recent advances in RL, and issues of convergence.

Kaelbling et al. [58] also present a survey of RL techniques. Kaelbling et al. describe RL as “the problem faced by an agent that must learn behavior through trial-and-error interactions with a dynamic environment.” Appropriate actions for the stochastic environment are chosen by searching the space of possible actions (with techniques such as genetic algorithms) to ﬁnd the one that performs the best, or by using dynamic programming techniques to estimate utilities of taking certain actions. Kaelbling et al. also discuss the dilemma of exploitation versus ex-ploration. Kaelbling et al. provide model-free methods (including TD-learning andQ-learning) in which structure of the model is not assumed, and model-based methods (including certainty equivalent methods for learning the model) in which known data is used more eﬃciently to learn the model and determine the best actions. In addition, Kaelbling et al. present models for partially observable environments and use of some of the models presented to applications including game play of backgammon and robotics.

(26)

methods such as aggregation and basis function approximation of the value function.

2.3 Applications of ADP Methods

ADP methods have been applied to problems in many diﬀerent settings including applications such as energy allocation, vehicle routing, and backgammon. More recently ADP methods have been used for health care applications, including patient scheduling and medical decision making problems. In this section we present a summary of the ADP health care applications.

Maxwell et al. [67] use an ADP approach to determine the best strategy for dynamic repo-sitioning of ambulances in metropolitan areas in order to maximize the number of calls reached within a designated length of time. The problem is formulated as an MDP with the state space including information about the number of ambulances and the number of waiting calls in the emergency medical services system. The ambulance classification includes the ambulance’s sta-tus, location, and timing of any movement. The call classification includes the call’s stasta-tus, location, timing, and priority level. Decisions can only be made when events occur, such as a call coming in or an ambulance transporting a patient to the hospital. Only ambulances that have just finished transporting a patient to the hospital are available for redeployment to one of the possible ambulance bases. If a call with high priority is answered in more than the designated time, a cost of 1 is incurred; otherwise, there is no cost. This simple cost function does not assign different costs according to the length of time past the threshold that the call is answered. While this may be a shortcoming, the authors note that high-priority calls are served first since calls are attended to by priority level, and the modeling framework is general enough that it could take medical outcomes into account as done by Erkut et al. [36]. The objective of the MDP is to find the policy that minimizes the total discounted number of calls that cannot be answered within the threshold time.

(27)

to evaluate expected costs. Six basis functions are used: ϕ1(s) = 1 to allow for the value function to be shifted, and functions to describe assigned calls that cannot be reached within the threshold, the rate of calls that cannot be met by the threshold because of ambulance distance, the rate of missed calls due to queueing issues, and the same two rates for future calls. The basis function coeﬃcients are estimated using the cost projections found through Monte Carlo simulation. Two examples of implementation are provided for two large metropolitan areas to show the beneﬁt of the approximately optimal policies over traditional deployment policies.

Patrick et al. [75] present a discounted, infinite-horizon MDP to schedule appointments for incoming patients of different priority levels while meeting requirements for priority-specific wait times. The model considers an N day planning horizon. The state includes information about the number of patients currently scheduled over the planning horizon and the number of patients of each priority type that are waiting to be scheduled. At each decision epoch, the person in charge of scheduling the patients must decide which appointment slots to allocate to patients waiting to be assigned. If there is more demand than time slots, then the action of diverting patients to other facilities may also be taken. Costs are associated with booking patients beyond their targeted time, diverting patients, and leaving patients unbooked. An ADP method is proposed to deal with the large state space of this MDP. Patrick et al. use a basis-function approximation of the value function and solve for the weighting parameters using an approximate LP. They also show how to derive a policy from the solution to the approximate LP. Simulation of the policy is used to estimate performance to the wait time targets in practice.

(28)

the daily dose of treatment the doctor should prescribe. Transition probabilities are deﬁned among the health states, and a cost function rewards health measurements in the target treat-ment range. The objective is to minimize the expected cost over the COH time horizon. One solution method presented for the COH problem is to discretize the state space and solve the problem using backward induction. The authors also present an ADP method to approximate the value function as an alternative method to deal with the continuous state space problem [50]. The RM algorithm is implemented, and the value function is approximated using two diﬀerent sets of piecewise linear basis functions. The basis function approximations result in costs 1% to 3% higher than the solution found by solving the MDP. However, solving the problems using basis function approximation of the value function took less than 20 seconds while the MDP took over 40 hours to solve.

Hsih et al. [53, 87] present an optimal learning approach for glycemic control for patients with type 2 diabetes. The main learning approach used is knowledge gradient. This approach uses one-step look ahead to determine the value of learning with each possible action. The state space includes the patient’s fasting plasma glucose, HbA1c, body mass index, and presence of side eﬀects from medication. Monthly time steps are used, and the decision made by the doctor is to recommend blood sugar management through diet and exercise or one of four glycemic control medications. A utility function is deﬁned in terms of the patient’s fasting plasma glucose, and the objective is to maximize the expected utility over the entire horizon. Learning techniques with one-step look ahead, including the knowledge gradient method, are used to learn the transition probabilities among the health states for each of the medication states. Numerical experiments show that this method can yield better policies than policies derived from a model which assumes estimated probabilities.

(29)

maximization of rewards for QALYs minus costs for all medical expenditures, and minimization of costs subject to no reduction in QALYs relative to current practice. The dialysis problem is formulated as an MDP with a continuous-time model for disease progression and discrete-time control of doses used in dialysis. Lee et al. also show how the MDP can be transformed into a discrete-time stochastic shortest path problem. The MDP is solved using approximate policy iteration, incorporating the steps of policy evaluation and policy iteration. The approximate value function is expressed as the sum of basis functions multiplied by weights and the weights are estimated using the simulation model by finding weights that minimize the squared differ-ence between the simulated rewards and the approximate value function. The basis functions are hazard rates of events dependent on the dose of dialysis, quality of life estimates for each of the dose levels, and the logarithm of each of the hazard rates and quality of life estimates. Complex dose strategies that were dependent on patient risk factors proved to be cost effective and could potentially reduce expected costs of treating patients with chronic kidney failure.

2.4 Contributions of this Dissertation

(30)

(31)

Chapter 3

Optimal Control of Medication

Treatment Initiation

3.1 Introduction

Currently 25.8 million people in the United States have diabetes. Approximately 1.9 million people aged 20 or older were newly diagnosed with diabetes in 2010 [19]. Treatment of the diabetes population can be costly; currently it is estimated that $113 billion per year in direct medical costs is spent for diabetes-related treatment in the United States [55]. This yearly cost is expected to triple in the next 25 years. Part of the challenge of controlling costs is reducing medication costs for diabetes patients. Another consideration for controlling costs is to prevent or delay the occurrence of stroke and CHD events for which diabetes patients are particularly at risk, thereby reducing the hospitalization and other signiﬁcant follow-up costs associated with these events.

(32)

hand, can be used as a measure of primary and secondary prevention: QALYs trade off the benefits of treatment, including increase in event-free years, with the reduction in quality of life due to treatment or the occurrence of adverse events. QALYs are one of the most commonly used criteria in the health policy literature [42]. They define a measure of a life year on a 0 to 1 scale based on a patient’s health status, with QALY decrements to represent the burden of treatment or minor illnesses and debilitating diseases or events such as stroke or CHD events. QALYs can be used to measure the trade-off between the burden of treatment and the benefits of prevention of stroke and CHD events.

Recently, several risk models have been developed to predict the probability of complica-tions of type 2 diabetes over the course of an individual’s lifetime [61, 97, 96, 34, 35]. These models serve as a guide to clinicians for establishing the importance of treatment; however, there has been little investigation of how to eﬀectively use these risk models to design optimal treatment policies for blood pressure and cholesterol management. The research presented in this chapter seeks to bridge this gap by furthering the basic knowledge of how to optimally treat cardiovascular risk in patients with diabetes over the course of their lifetime.

We present an MDP to determine the optimal timing of medical treatment decisions for blood pressure and cholesterol control in patients with type 2 diabetes. We consider two different bi-criteria formulations of our MDP problems. First, we use our model to find the optimal treatment decision that trades off the expected time to first event and the cost of medication. Second, we use our model to find the optimal treatment decisions that trade off expected QALYs and total costs of treatment (medication costs plus one-time and follow-up treatment costs for adverse events). In both cases we combined the two criteria using a willingness-to-pay factor to balance life years (LYs) and QALYs against the costs of medication and treatment, respectively. We vary the willingness-to-pay factor to estimate the efficient frontier of treatment policies. We also evaluate the most common treatment guidelines in the United States and other countries applied to U.S. patients and compare them to the Pareto-optimal policies from our model.

(33)

arises in the context of many chronic diseases. There is a significant literature on treatment optimization. However, to our knowledge ours is the first to examine simultaneous control of multiple risk factors. We highlight the benefits of coordinated treatment over the myopic nature of current guidelines by comparing costs, QALYs, and event-free LYs for the different policies. We also present structural properties for the order of treatment initiation when primary prevention is considered, and we provide discussion on the benefits of coordinated treatment for multiple risk factors.

We address several specific research questions in this chapter including the following: How much can coordinated management of coexisting risk factors improve patient outcomes (e.g., QALYs, LYs before an adverse health event) over current guidelines? What effect does treat-ment coordination have on costs? How should treattreat-ment plans differ for males and females? How dependent is the optimal treatment regimen on an individual patient’s metabolic risk pro-file? To help answer these questions, we present patient-specific treatment plans based on our model. We also compare expected LYs and medication costs, and expected QALYs and total costs for optimal treatment plans and current practice guidelines.

The remainder of this chapter is organized as follows: In Section 3.2 we provide background on diabetes treatment and a review of the relevant literature. In Section 3.3 we give a detailed description of the MDP model. In Section 3.4 we present numerical results at both the indi-vidual level and the population level. Finally, in Section 3.5 we highlight main conclusions and directions for future work.

3.2 Diabetes Treatment Background and Literature Review

(34)

of glucose in individuals with diabetes has signiﬁcant risk reduction for cardiovascular events [103, 47, 32].

There are many published recommendations in the United States and other countries for initiation of blood pressure and cholesterol medications. Table 3.1 provides a summary of U.S. and international guidelines for initiation of these medications based on well-established risk factors. For comparison, we provide both the current U.S. guideline for diabetes patients that uses the same treatment threshold for all patients (U.S. I) and the current U.S. guideline for patients without diabetes that uses risk-based treatment thresholds (U.S. II). The patients are assigned a risk level (low, medium, or high) based on risk factors such as age and gender. In the United States, initiation of blood pressure and cholesterol medications has been recommended by two independent committees [10, 22]. For diabetes patients these guidelines are “one size ﬁts all”; all diabetes patients are treated to the same threshold, regardless of risk of events, gender, age, or any other factors. The uncoordinated treatment of these risk factors is questionable since blood pressure and cholesterol both aﬀect the overall health of a patient and his or her risk of complications [97, 96].

U.S. and other international guidelines are typically defined by clinical thresholds for stroke and CHD risk factors (other events which are less common such as kidney failure and neu-ropathy also influence guidelines). The most common risk factors considered by the guidelines are cholesterol and systolic blood pressure (SBP). There are several measures associated with cholesterol including low-density lipoprotein (LDL), high-density lipoprotein (HDL), lipid ratio (LR), and total cholesterol (TC). A patient’s TC is a combination of LDL, HDL, and triglyc-erides, a relationship estimated by the Friedewald equation [40]. A patient’s LR is TC divided by HDL. If any of these risk factors are outside of the specified threshold the patient should begin an additional medication for cholesterol or blood pressure treatment, as appropriate.

(35)

Table 3.1: International guideline thresholds for initiation of cholesterol and blood pressure medications. Guidelines that assume diabetes patients are not considered CHD risk equivalent are represented with *. LDL is measured in mg/dL for the U.S. guidelines, and LDL, HDL, and TC are measured in mmol/L for all other guidelines. LR is unitless, and SBP is measured in mmHg.

Guideline Cholesterol Blood Pressure

U.S. I [10, 22] ATP III: LDL≥100 JNC 7: SBP>130

U.S. II [10, 22]

ATP III*: High Risk: LDL≥100,

JNC 7*: SBP> 140 Medium Risk: LDL≥130,

Low Risk: LDL≥190

Australia [48] LDL≥2.5 or TC ≥4.0 or HDL <1.0 SBP>130

Canada [15] LDL≥2.5 or LR ≥4.0 SBP>130

European Union [46] LDL≥2.5 or TC ≥4.5 SBP>130

British [57] LDL≥2.0 or TC ≥4.0 SBP>130

UKPDS model is a set of risk equations based on a large cohort of diabetes patients in the United Kingdom; inputs for the risk equations include time since diagnosis of diabetes, age, SBP, LR, and gender. We use the UKPDS model to estimate probabilities of fatal and nonfatal stroke and CHD events in our MDP.

(36)

over the course of a patient’s lifetime.

MDP models have also been used to determine the optimal timing of one-time medical interventions for a number of diseases other than diabetes. Alagoz et al. [7, 8] provide a discrete-time, infinite-horizon, stationary MDP model to determine the optimal timing of liver transplantation based on a patient’s MELD score. They also present structural results, proving sufficient conditions for the existence of a control-limit policy for transplantations. Shechter et al. [90] present an MDP model for the optimal initiation of HIV treatment according to a patient’s CD4 count with the goal of maximizing a patient’s quality-adjusted lifetime. They assume a stationary, infinite-horizon model and prove that a control-limit policy exists in terms of the patient’s CD4 count.

This chapter contributes to the existing literature in two main ways. First, we present a novel model formulation to determine optimal treatment policies for management of a chronic disease over the course of their lifetime. Our model involves the use of multiple medications for simultaneous control of multiple risk factors. To our knowledge, we are the first to model simultaneous control of multiple risk factors. Most related research concentrates on optimal treatment decisions for a single risk factor. Other diabetes models are more descriptive and do not provide dynamic, prescriptive policies over time as our work does. Second, we use our model to answer important policy questions regarding the benefits of coordinating treatment guidelines for cholesterol and blood pressure control. We anticipate our findings will provide insights into the ordering of treatment decisions in other contexts.

3.3 Model

(37)

a simpliﬁed state transition diagram of our model for the purpose of illustrating the problem. In the diagram, solid lines illustrate the actions of initiating one or both of the most common medications (statins (ST), ACE inhibitors (AI)), and dashed lines represent the occurrence of an adverse event (stroke or CHD event) or death from other causes. In each medication state, including the no medication state (∅), patients probabilistically move between health states, here represented byL (low), M (medium),H (high), and V (very high). These health state levels represent the levels of patient risk factors (e.g., blood pressure and cholesterol). For patients on one or both medications, improvements in patient risk factors (blood pressure, cholesterol, or both) reduce the probability of adverse events.

Figure 3.1: Simpliﬁed state transition diagram for the case of two medications. When medi-cations are initiated (actions denoted by the solid lines), the risk factors are improved and the probability of the occurrence of an adverse event (denoted by the dashed lines) is reduced.

(38)

treatment (e.g., statins can cause liver problems or severe muscle pain). However, this occurs in a small proportion of patients.

The problem we explore in this chapter is a generalization of the above two-medication problem, depicted in Figure 3.1, in which the patient may elect to initiate one or more of a set of available treatments at each decision epoch. This optimal treatment problem can be viewed as a nested stopping time problem. After the ﬁrst medication is initiated (the ﬁrst stopping time is chosen), there is a subsequent stopping time problem for the next medication to be initiated, and so on. A brief description of the MDP model is presented below.

Actions are taken at a discrete set of decision epochs indexed byt= 1, . . . , T, where epocht represents the year [t, t+ 1). This range constitutes the finite decision horizon. Similar to other studies [30, 63, 89], yearly decision epochs are used to represent annual visits to a clinician. Ages aboveT are represented by an infinite post-decision horizon, assuming no new medications are initiated, allowing for accrual of rewards for patients living past the end of the decision horizon. States are composed ofliving states andabsorbing states. Each living state is defined by the factors that influence a patient’s cardiovascular risk: the patient’s TC, HDL, and SBP levels, medication status, and history of stroke and CHD events. We denote the set of the TC states by

LTC ={L, M, H, V}, with similar deﬁnitions for HDL,LHDL ={L, M, H, V}, and SBP,LSBP=

{L, M, H, V}. The thresholds for these ranges are based on clinically-relevant cut points for

treatment found in Table 3.2 [24]. The history of stroke and CHD events is deﬁned by the current number of events the patient has had up to some maximum number,k: LS ={0,1, . . . , k} and

LCHD = {0,1, . . . , k}. Elements of these sets are indexed by ℓTC, ℓHDL, ℓSBP, ℓS, and ℓCHD, respectively. The set of health states is given byL=LTC×LHDL×LSBP×LS×LCHD. Elements of Lare indexed by ℓ.

The set of medication states is denoted byM={m= (m1, m2, . . . , mn) :mi∈ {0,1},∀i=

1,2, . . . , n}wherendenotes the number of medications. Ifmi= 0, the patient is not currently

on medication i, and if mi = 1, the patient is currently on the medication. When a patient

(39)

change in TC,ωHDL(i), representing the proportional change in HDL, andωSBP(i), representing the proportional change in SBP. Note, in general cholesterol medications result in decreased TC and increased HDL, while blood pressure medications result in decreased SBP. The med-ications we consider are targeted speciﬁcally at either cholesterol or blood pressure and each has negligible eﬀect on the other risk factor. For example, if medication iis a blood pressure medication, thenωTC_{(i) =}_ωHDL_{(i) = 0.}

The living states in the model are denoted by (ℓ,m) ∈ L × M. The absorbing states are represented by the death states: D = {DS,DCHD,DO}. The three types of death states represent dying from a stroke, DS, a CHD event, DCHD, or other causes, DO. The absorbing states will be denoted by d ∈ D. Including living and absorbing states, there are a total of 43×2n×(k+ 1)2+ 3 states in our model for each time period.

At each decision epoch, it must be determined which medications to initiate (if any). The action space is dependent on the history of medications that have been initiated in previous epochs. For each medication, at each epoch, medicationican be initiated (I) or initiation can be delayed (W). These actions are deﬁned for medicationi as follows:

A₍_ℓ,mi₎=       

{Ii, Wi} ifmi= 0,

{Wi} ifmi= 1,

(3.1)

whereA₍_ℓ,_m₎={A₍_ℓ,m₁₎×A₍_ℓ,m₂₎× · · · ×A₍_ℓ,mn₎}. Action a∈A₍_ℓ,_m₎ denotes the action taken in state (ℓ,m). If a patient is in living state (ℓ,m) and takes action a, the medication state is then denoted by m′, where m′_i is set to 1 for any medicationsi that are newly initiated by

Table 3.2: Ranges for TC, HDL, and SBP states based on [24].

L M H V

TC (mg/dL) <160 [160, 200) [200, 240) ≥240 HDL (mg/dL) <40 [40, 50) [50, 60) ≥60

(40)

action a; m′_i = mi for all medications i which are not newly initiated. Once medication i is

initiated, the patient’s blood pressure and cholesterol are modiﬁed by the medication eﬀects denoted byωTC(i), ωHDL(i), andωSBP(i), resulting in a reduction in the probability of having a stroke or CHD event.

Three types of probabilities are incorporated into the model: probabilities among health states, probability of events (both fatal and nonfatal), and probability of death from other causes. At epocht∈1, . . . , T, death from other causes occurs with probabilityπO

t . If the patient

is in state (ℓ,m) ∈ L × M, a nonfatal stroke or CHD event occurs with probability πS_t(ℓ,m) and πCHD_t (ℓ,m), respectively, which depend on the patient’s age, health state, medication status, and other risk factors such as race and gender. Fatal stroke and CHD events occur with probability πe_tS(ℓ,m) and πeCHD_t (ℓ,m), respectively. Given that the patient is in state (ℓ,m) at epoch t, the probability of moving into one of the absorbing statesd∈ D at epoch t+ 1 is denoted by ¯pm_t (d|ℓ), where

¯

pm_t (d|ℓ) =               

π_tO ifd=DO, e

π_tCHD(ℓ,m) ifd=DCHD, e

π_tS(ℓ,m) ifd=DS,

(3.2)

for (ℓ,m)∈ L × M, and ¯pm_t (d|d) = 1 for allt∈1, . . . , T. The probability of having a nonfatal event or dying (from an event or other causes) is denoted by π∗_t(ℓ,m), where

π∗_t(ℓ,m) = (1−πCHD_t (ℓ,m)−πe_tCHD(ℓ,m))πS_t(ℓ,m)

+ (1−πS_t(ℓ,m)−πe_tS(ℓ,m))π_tCHD(ℓ,m)

+eπ_tS(ℓ,m) +πeCHD_t (ℓ,m) +πO_t . (3.3)

(41)

be altered for events that are not assumed to be independent. Given that the patient is in health state ℓ ∈ L, the probability of being in health state ℓ′ in the next epoch following is denoted by qt(ℓ′|ℓ). The transition probabilities between health states do not depend on

the medication state since the transition probabilities qt(ℓ′|ℓ) are computed from the natural

progression of blood pressure and cholesterol in the absence of medication. We deﬁne pm_t (j|ℓ) to be the probability of a patient being in state j ∈ L ∪ D at epoch t+ 1, given the patient is in living state (ℓ,m) at epoch t, where m incorporates the action a taken at time t. The probability pm_t (j|ℓ) is deﬁned by the following:

pm_t (j|ℓ) =                          [ 1− ∑

d∈D

¯

pm_t (d|ℓ)]qt(j|ℓ) ifℓ, j∈ L,

¯

pm_t (j|ℓ) ifℓ∈ L,j=D,

1 ifℓ=j ∈ D,

0 otherwise.

(3.4)

The reward rt(ℓ,m) is the dollar reward for QALYs minus treatment and medication costs

accrued in decision epoch tin living state (ℓ,m) as described in the following equation:

rt(ℓ,m) =R(ℓ,m)−CO−

(

CS(ℓ) +CCHD(ℓ))−(CFS(ℓ) + CFCHD(ℓ))−CMED(m), (3.5)

(42)

health care for diabetes patients, cost of medications, cost of initial hospitalization for stroke and CHD events, and cost of follow-up treatment for stroke and CHD events, respectively.

For a patient in living state (ℓ,m) in epoch t, let vt(ℓ,m) denote the patient’s maximum

total expected discounted rewards prior to her ﬁrst event or death. The following recursion deﬁnes the optimal action in each state fort= 1, . . . , T −1:

vt(ℓ,m) = max

a∈A(ℓ,m) {

rt(ℓ,m′(a)) +λ

∑

∀j∈L∪D

pm_t ′(a)(j|ℓ)vt+1(j,m′(a)) }

, (3.6)

where j indexes states in L ∪ D, m′(a) is deﬁned as the medication state m with action a

taken into account, andλ∈[0,1) is the discount factor per decision epoch, which is commonly set to 97% in health economic evaluations (see Chapter 7 of [41] for a discussion of this). The boundary condition is given by vT(ℓ,m) = rT(ℓ,m) +E[PDHR|ℓ,m], where E[PDHR|ℓ,m]

is the expected post–decision horizon reward (PDHR). This represents expected rewards for a patient living past the decision horizon (e.g., past age 100). The PDHR depends on the state and treatment status of the patient in the last year of the decision horizon and the number of years into the post–decision horizon that the patient lives. This approximation of rewards is needed because of the limited samples in the data set for older patients.

(43)

of other medical care for diabetes patients) can greatly aﬀect optimal policies.

3.4 Results

In this section we present numerical results illustrating optimal treatment policies for two bicriteria perspectives: (a) expected time to ﬁrst event versus medication costs, and (b) expected QALYs versus medication and treatment costs. Backward induction was used to compute the optimal treatment decisions over the patient’s lifetime. The model and solution method was coded in C/C++. Model instances were solved in under 40 minutes using a 2.83GHz PC with 8GB of RAM. We provide results for each perspective for a population of 40-year-old patients newly diagnosed with type 2 diabetes. The proportion of patients in each of the health states at age 40 is estimated using the Mayo cohort described in Section 3.4.1.

The remainder of this section is organized as follows: In Section 3.4.1 we define the specific parameters for our problem, the model inputs, and their sources. In Section 3.4.2 we present a comparison of outputs from our model to those found in the literature for validation purposes. In Section 3.4.3 we present a model for primary prevention with results for maximization of LYs before an event. In Section 3.4.4, we provide results from the population level for maximizing QALYs over the patient’s lifetime (i.e., average results for patients with diabetes). For the results presented Sections 3.4.3 and 3.4.4, we compare the optimal treatment outcomes to the outcomes from applying U.S. and international guidelines. We also highlight the main differences in the policies for individual patients. In Section 3.4.5 we provide estimates of the yearly benefit of applying the optimal guidelines to the U.S. diabetes population over the current U.S. guidelines.

3.4.1 Data and Study Population

(44)

The DEMS dataset included 663 patients with cholesterol, HbA1c, blood pressure, and other laboratory values. Population statistics are provided in Table 3.3. The patients in this dataset are hereafter referred to as the Mayo cohort. Changes in TC, HDL, and SBP values from medications are found in Table 3.4. These values were estimated by computing the change in metabolic values before and after initiation of the given treatments using methods reported in Denton et al. [30]. These changes are assumed to be independent and additive for patients on multiple medications. It is important to note that while fibrates have been shown to improve cholesterol values, there is debate if the use of fibrates actually reduces a patient’s risk of CHD events [43]. It is possibly a limitation that we use modification of surrogate markers (blood pressure and cholesterol) to reflect the benefits of medication rather than modified risk of events (stroke and CHD).

Table 3.3: Baseline characteristics for the study population (N = 663), including mean and variance.

Patient Attribute Study Cohort

Age 52.46 (8.83)

Years with Diabetes 3.24 (5.33)

% Female 39.67

HDL 43.65 (11.58)

LDL 126.98 (37.31)

TC 216.98 (37.31)

SBP 139.11 (19.75)

HbA1c 8.01 (2.38)

(45)

Table 3.4: Percentage change in risk factors for given medications as computed from Mayo Electronic Medical Records and Diabetes Electronic Management System.

Medication (i) ωTC(i) ωHDL(i) ωSBP(i)

Statins -14.0 +7.3

-Fibrates -3.9 +4.7

-ACE/ARBs - - -3.7

Thiazides - - -5.0

β Blockers - - -4.6

Calcium Channel Blockers - - -2.5

Table 3.6. The costs are the lower bound values based on U.S. pharmaceutical cost estimates [4], and the utility decrements are drawn from the literature [20, 77]. The costs presented in Tables 3.5 and 3.6 are in 2009 dollars. For all the numerical experiments, we consider a decision horizon from age 40 to age 100 with an inﬁnite horizon estimate of rewards accrued after the end of the decision horizon.

Table 3.5: Description of model parameters including cost inputs and utility decrements for the reward function of the MDP model.

Parameter Type Parameter Value Source

Cost Inputs

Initial hospitalization for stroke (CS) $13,204 [1] Initial hospitalization for CHD (CCHD₎ _$18,590 _[1] Follow-up for stroke (CFS) $1,664 [101] Follow-up for CHD (CFCHD) $2,576 [86, 101] Willingness-to-pay Factor (R0) $100,000 [82]

Discount Factor (λ) 0.97 [41]

Utility Inputs CHD decrement (d

CHD₎ _0.07 _{[23, 106]}

Stroke decrement (dS) 0.21 [23, 100, 99]

(46)

Table 3.6: Costs and utility decrements for each medication used in the model.

Medication Cost [4] Utility Decrement

Statins $212 0.003 [20]

Fibrates $652 0.003 [20]

ACE/ARBs $48 0.005 [77]

Thiazides $48 0.005 [77]

β Blockers $48 0.005 [77]

Calcium Channel Blockers $866 0.005 [77]

dataset. A spline ﬁt was used to interpolate missing laboratory values for cholesterol values to obtain an estimate of yearly levels for these risk factors [30]. Each risk factor was divided intoL,M,H, andV categories (as deﬁned in Section 3.3). The transition probabilities among metabolic states were estimated from the percentages of patients that moved between each state at timet to each state at timet+ 1.

Transition probabilities to event and death states were drawn from the literature. The UKPDS risk equations [61, 97, 96] were used to compute probabilities of incurring a CHD event or stroke, both fatal and nonfatal, based on patient risk factors including age, gender, TC, SBP, and HbA1c. The Centers for Disease Control and Prevention (CDC) mortality tables [18] were used to estimate the probability of death from other causes.

3.4.2 Model Validation

(47)

estimates compared from the FHS and our model were the expected LYs before a stroke or CHD event from age 50 and the expected LYs before death from age 50.

(48)

Table 3.7: Male comparison of expected LYs before death, expected LYs before a stroke or CHD event, and expected LYs after an event from age 50 for our MDP model and the Framingham Heart Study (FHS). The 95% conﬁdence intervals are provided for the FHS estimates.

FHS: Diabetes Patients FHS: Overall MDP: U.S. I MDP: No Treatment Life expectancy 21.3 (19.4 to 23.1) 27.9 (27.3 to 28.6) 28.8 26.9

LYs before event 14.2 (12.3 to 16.1) 21.2 (20.5 to 22.0) 21.2 18.9

LYs after event 7.1 (6.0 to 8.3) 6.7 (6.2 to 7.1) 7.6 8.0

Table 3.8: Female comparison of expected LYs before death, expected LYs before a stroke or CHD event, and expected LYs after an event from age 50 for our MDP model and the Framingham Heart Study (FHS). The 95% conﬁdence intervals are provided for the FHS estimates.

FHS: Diabetes Patients FHS: Overall MDP: U.S. I MDP: No Treatment Life expectancy 26.5 (24.4 to 28.5) 33.8 (33.2 to 34.4) 32.1 29.4

LYs before event 19.6 (17.5 to 21.9) 27.3 (26.7 to 28.0) 25.3 23.1

(49)

Unfortunately there is no perfect way to validate medical decision making models such as ours. There are many possible reasons why our estimates for event-free years and life expectancy would differ from the estimates from the FHS. First, our model uses 2007 estimates of proba-bilities of death from other causes; life expectancies have increased significantly since the 1950s when the FHS study began. Second, we use the UKPDS risk equations to estimate the risk of stroke and CHD events; while these equations are widely believed to provide valid estimates of risk, they are based on observed events from a population of diabetes patients from the United Kingdom. Finally, it is impossible to know what medication policy was used by the patients in the FHS, and available medications and U.S. treatment guidelines have changed significantly since the 1950s.

3.4.3 Primary Prevention Treatment Policies

In this section we consider primary prevention of stroke and CHD events. The yearly rewards for primary prevention are deﬁned as follows:

r(ℓ,m) =       

R0−CO−CMED(m) ∀ℓ:ℓS=ℓCHD = 0,

0 otherwise.

(3.7)

Patients receive rewards for LYs (R0) minus costs of medication and other healthcare costs up until the occurrence of a first event including CHD or stroke (fatal or nonfatal) or death from other causes. We set all costs other than the other costs and medication costs equal to zero: CS(ℓ) =CCHD(ℓ) = CFS(ℓ) = CFCHD(ℓ) = 0. In addition, no costs are incurred in this model after a patient has an event. With this reward structure, the objective is to maximize the reward for LYs minus costs incurred prior to an event or death. In other words, the goal of this reward structure is to delay the patient’s first event. This goal is in line with a physician’s primary prevention goal to delay the time until a patient’s first event. This goal is consistent with the U.S. guideline’s goal of primary prevention [10, 22].

(50)

Figure 3.2: Comparison of optimal treatment policies for male patients to treatment by U.S. and international guidelines.

guideline results for males and females, respectively. These graphs present the expected LYs versus the expected discounted costs of medication before an event has occurred (the other costs have not been included in the graph). There is great similarity in costs and LYs between the U.S. and international guideline results. While the LYs achieved with the guidelines are very near the optimal policy curves for both the males and females, we see that the costs of the guidelines could be greatly reduced by implementing the optimal policies due to the ﬂatness of the optimal policy curve as the LYs increase.