Formulation as an MDP - Model description and the MDP formulation

CHAPTER 4: PRIORITIZATION IN A MULTI-SERVER QUEUEING SYSTEM

4.3 Model description and the MDP formulation

4.3.2 Formulation as an MDP

We next formulate this system as a discrete-time MDP. For notational convenience, we use bold symbols to denote two-dimensional vectors in N2, where N denotes the set of natural numbers. For x∈ N2, letxi

denote the ith component of the vector x fori = 1,2. We also define a partial ordering of the vectors as follows: for two vectorsx,y∈N2,xis said to be smaller thany, denoted byx≤y, ifx1≤y1 andx2≤y2.

Finally, we lete0= (0,0),e1= (1,0) and e2= (0,1).

System state: Letxt= (xt,1, xt,2) denote the system state at discrete time pointst= 0,1,2, . . ., where xt,i is the number of typeicustomers at timet fori∈ {1,2}, which includes all existing customers as well

customers that belong to the same type due to the Markov property of customer transitions. The state space isS=_N2_.

Actions: The decisions are made at the beginning of each time period immediately after the customer arrival (if there is one), at which times we need to decide how to assign the servers to all presenting customers. Let ai denote the number of servers we allocate to stage i customers, anda = (a1, a2) denote a possible

action ifa∈Awhere

A={a:a∈_N2 _and_a

1+a2≤b}.

For any statex, letA(x) denote the set of all feasible actions whereA(x) ={a:a∈Aanda≤x}.LetA0(x) denote the set of all feasible non-idling actions, whereA0(x) =a:a∈A(x) anda1+a2= min{x1+x2, b} .

One step expected reward: When the process is in state x∈S and an actiona∈A(x) is chosen, there will be an expected immediate reward R(x,a) = P2

i=1aiRipi0, which is bounded since ai, Ri are

nonnegative and bounded.

Transition probabilities: LetPa(x,y) denote the probability that the process will transit to state y∈S, starting from statex∈S given action a∈A(x). The transition probabilities can be computed by conditioning on how each customer evolves. We can also compute these probabilities recursively as follows: first we can obtain the transition probabilities in statex=e0 with the only feasible actiona=e0:

Pe0(e0,e0) = λ0, Pe0(e0,e1) = λ1, Pe0(e0,e2) = λ2, and Pe0(e0,y) = 0 fory ∈ {/ e0,e1,e2}.

Then, for x ∈ S and x6= e0, we have x≥ ei for at least one i ∈ {1,2}. Using the fact that all patients

evolve independently,Pa(x,y) satisfies the following properties forx≥eiand a∈A(x):

Pa(x,y) = 2 X j=0 pijI{y≥ej}Pa−ei(x−ei,y−ej), ifa≥ei, (4.1) Pa(x,y) = 2 X j=0 qijI{y≥ej}Pa(x−ei,y−ej), ifa≤x−ei. (4.2)

The intuition behind (4.1) is that, if in statexwe take an actionathat keeps at least one stageicustomer in service, and we pick any one such customer (referred to as customer A), and then we compute the transition probability to state yby conditioning on how customer A evolves. The customer A jumps to stage j with probabilitypij, and the probability that the system will transition to statey equals to the probability that

the remaining customers (we have x−eiremaining customers, among which we take action a−ei) transit

customer in the queue when we take action a in state x, then we can obtain (4.2) by picking a stage i customer in queue and conditioning on how this customer evolves.

We will only need the these properties (4.1) and (4.2) later in the proofs of our analytical results, and we can obtain the values of the transition probabilities recursively with these two properties and the initial transition probabilities atx=e0in the numerical study.

Objectives: we consider two models with different optimality criteria, which we refer to as a discounted model and an average model, respectively. For the discounted model, we maximize the expected total discounted reward over an infinite-horizon, which can be expressed as

Vπ,α(x) =Eπ "_∞ X t=0 R(xt,at)αt x0=x # ,

whereα∈(0,1) is the discount factor. NoteVπ,α(x) is well defined sinceR(x,a) is bounded andα <1. Let

Vα(x) = max

π Vπ,α(x),

where the maximum is attainable since the action space is finite andVπ,α(x) is bounded. A policyπ∗ is said

to beα-optimal if Vπ∗_,α(x) =V_α(x) for allx∈S.

For theaverage model, we would like to maximize the long-run average reward, which can be expressed as gπ(x) = lim inf T→∞ Eπ h PT t=0R(xt,at) x0=x i T+ 1 .

A policyπ∗ is said to be average optimal ifgπ∗(x) = max_πg_π(x) for all x∈S.

The following two Lemmas will provide the optimality equations for these two models.

Lemma 4.1. (The optimality equation for the discounted model.) (a) Vα(x) satisfies the following optimality equation:

Vα(x) = max a∈A(x)    R(x,a) +αX y∈S Pa(x,y)Vα(y)    , x∈S. (4.3)

(b) The stationary policy that selects any action maximizing the right-hand side of (4.3) in state x is α- optimal.

Lemma 4.1 follows directly from Theorems 2.1 and 2.2 in Chapeter II of Ross (1983).

(a) There exists a bounded function h(x) and a constantg, where

h(x) = lim

α→1[Vα(x)−Vα(x0)] andg= limα→1(1−α)Vα(x0) for somex0∈S, which satisfy the following optimality equation:

g+h(x) = max a∈A(x)    R(x,a) +X y∈S Pa(x,y)h(y)    , x∈S. (4.4)

(b) There exists a stationary policy π∗ that is average optimal andgπ∗(x) =g for all x∈S, andπ∗is any policy that selects an action maximizing the right-hand side of (4.4)in state x.

Proof. For any policyπand anyα <1,Vπ,α(x) is bounded. Then,Vα(x) is bounded for anyαby definition.

Thus, for somex0∈S,

|Vα(x)−Vα(x0)| ≤ |Vα(x)|+|Vα(x0)|

is bounded for allαandx. Then, part (a) follows from Theorem 2.2 in Chapter V of Ross (1983) and part (b) follows from Theorem 2.1 of the same chapter.

Before proceeding further, it is convenient to define the first-order difference operator in the vector form as follows:

Definition 4.1. For a real-valued function w(x) defined on S, the first-order difference operator Dj is

defined as

Djw(x) =w(x+ej)−w(x), forj= 1,2.

4.4 Main results for the discounted model

In this section, we start with the analysis of the optimal control of the α-discounted model. For a real-valued functionw(x) defined on the state spaceS, define the mappingTaas follows.

Taw(x) =      R(x,a) +αP y∈SPa(x,y)w(y), ifa∈A(x), −∞, otherwise. (4.5)

Then, the optimality equation (4.3) can be rewritten as

Vα(x) = max a∈A(x) n TaVα(x) o ,

anda∗ is an optimal action in state xifTa∗V_α(x)≥T_aV_α(x) for alla.

We first propose a finite-horizon MDP model with the objective of maximizing the total expected discounted reward over n time periods. Then, we prove some structural properties for the optimal value function of the discounted model,Vα, by letting ngo to infinity.

LetVn(x) = maxa∈A(x)

TaVn−1(x) for n≥1, and V0(x) be a bounded nonnegative function. Then,

we have the following result which follows directly from Proposition 3.1 of Chapter II in Ross (1983).

Lemma 4.3. Vn(x)→Vα(x)uniformly in xasn→ ∞.

As a result of Lemma 4.3, we can prove the structural properties ofVα(x) by induction onVn(x), starting

from any boundedV0(x). For simplicity, we assumeV0(x) = 0 for any x∈S from now on.

In document Ouyang_unc_0153D_16287.pdf (Page 79-83)