• No results found

Partially Observable Markov Decision Process

4.4 Optimization for online energy management

5.2.2 Partially Observable Markov Decision Process

Next, we define a constrained Partially Observable (PO) Markov Decision Process (MDP).

States: ¯S is a finite set of states describing the position of the MU within the road link, along with the absorbing state ¯s when the MU exits the coverage area of the BS. Therefore, ¯S ” S Y t¯su, and S ” t1, . . . , Su is the set of road sub-links.

Actions: A is a finite set of actions that the BS can perform. Specifically, the BS can perform actions for Beam Training (BT) and actions for Data Trans- mission (DT), which involve selection of the transmission beam, power, and duration. In general, aP A is in the form a “ p ˆS, Pc, Tcq, where: c “ tBT, DTu

refers to the action class; ˆS Ď S is a sub-set of sub-links, defining the sup- port of the transmission beam; Pc is the transmission power per beam, such

that Pt “ Pc| ˆS| in Eq. (5.6) is the total transmit power, Tc is the transmis-

sion duration of action a P A (number of micro time-slots of duration ∆t).

If c “ BT, then a “ p ˆS, PBT, TBTq, where PBT and TBT are fixed parame-

ters of the model. Specifically, we assume that BT actions perform simul- taneous beamforming over ˆS in one interval of ∆t seconds (i.e., TBT “ 1).

Also, PBT “ Pmin, where Pmin is the power per beam required to attain the

alarm and mis-detection probabilities PFA and PMD, which are also fixed pa-

rameters of the model, via Eq. (5.10). If c “ DT, then a “ p ˆS, PDT, TDTq,

where PDT and TDT are part of the optimization. Specifically, we assume that

DT actions perform simultaneous data communication over ˆS for TDT ´ 1

micro time-slots, where the last interval of ∆t seconds is dedicated to the

ACK/NACK feedback transmission from the MU to the BS. During DT, the transmission rate follows from Eq. (5.7). Note that the action space grows as | ˆS| “řSs“1S!{s!{pS ´sq! “ 2S´1. To reduce its cardinality, we restrict A such

that ˆS is a sub-set of consecutive indices in S, i.e., the beam directions specified by ˆS define a compact range of transmission for the BS. Thus, | ˆS| “ SpS`1q{2. Observations: O is a finite set of observations, defined as O” tACK, NACK, ¯su. Specifically, o“ ¯s means that the MU exited the coverage area of the BS; for simplicity, in this chapter we assume that such event is observable, i.e., the BS knows when the MU exited its coverage area.

Transition probabilities: Pps1|s, aq is the transition probability from s P ¯S

to s1 P ¯S given a P A. Note that these probabilities are a function of the

duration Tc of action a. If the transmission duration of a P A is Tc “ 1,

then we store the 1-step transition probabilities into matrix M“ rMss1s, with elements Mss1 “ Pps1|s, aq given by the 1-step mobility model, as:

Pps1|s, aq “ p s1 “ s ` 1, s “ 1, . . . , S Pps1|s, aq “ q s1 “ s ´ 1, s “ 2, . . . , S Pps1|s, aq “ 1 ´ p ´ q s1 “ s, s “ 2, . . . , S Pps1|s, aq “ 1 ´ p s1 “ s, s “ 1 Pps1|s, aq “ 1 s1 “ s, s “ ¯s. (5.11)

If the transmission duration of aP A is Tc “ N , then we compute the N -step

transition probabilities into matrix MN, i.e., we take the N -th power of matrix

M so as to account for the N -step evolution of the system state under aP A with transmission duration Tc.

Observation model: Ppo|s, a, s1q is the probability of observing o P O given

s P ¯S and a P A with transmission duration Tc, ending in s1 P ¯S. We assume

that the BS can successfully perform a“ p ˆS, Pc, Tcq if the MU remains within

ˆ

S for Tc subsequent micro time-slots, i.e., the MU does not exit from the beam

ACK to the BS, o“ ACK. Therefore, we define XTc

0 “ tX0, . . . , XTcu, as the system state path from time 0 to time Tc, and the event Es,sN1 “ tX0Tc P ˆSTc`1 | X0 “ s, XTc “ s1, Tc “ N u, meaning that the system state path X

Tc

0 remains

within ˆS for Tc subsequent micro time-slots, given that X0 “ s, XTc “ s1, Tc “ N . In order to compute it, we also define matrix ˜M as the transition

probability matrix restricted to the beam support ˆS, i.e., ˜M “ r ˜Mss1s, with elements ˜Mss1 “ Mss1 if ts, s1u P ˆS, otherwise ˜Mss1 “ 0. We derive PpEs,sN1q as:

PpEs,sN1q “ PpX0Tc P ˆSTc`1 | X0 “ s, XTc “ s1, Tc “ N q “ 1 MN ss1PpX Tc 0 P ˆS Tc`1, X Tc “ s1 | X0 “ s, Tc “ N q “ 1 MN ss1 ÿ sN 0 P ˆSN`1 Nź´1 z“0 PpXz`1“sz`1 | Xz“szq“ ˜ MN ss1 MN ss1 . (5.12)

Given a P A with transmission duration Tc, Ppo|s, a, s1q is defined as follows.

If c “ BT, we account for false-alarm and mis-detection errors in the beam detection process. In particular, if ts, s1u P ˆS (i.e., the MU is within the

beam support during the duration of BT) then PpACK|s, a, s1q “ PD (correct

detection) and PpNACK|s, a, s1q “ PMD (mis-detection); on the other hand, if

ts, s1u R ˆS (i.e., the MU is outside of the beam support during the duration of

BT), then PpACK|s, a, s1q “ PFA (false-alarm) and PpNACK|s, a, s1q “ 1´PFA.

If c “ DT, then the transmission is successful if the event EN

s,s1 occurs, so that PpACK|s, a, s1q “ PpEN

s,s1q for ts, s1u P S, and PpNACK|s, a, s1q “ 1 ´ PpACK|s, a, s1q. Finally, Pp¯s|s, a, s1q “ 1 whenever either s “ ¯s or s1 “ ¯s, i.e.,

the BS knows when the MU exited its coverage area.

Rewards: rps, aq is the expected reward given s P ¯S and a P A, defined as the transmission rate (number of bits transmitted from the BS to the MU) during DT if the MU remains within ˆS for Tc subsequent micro time-slots. Formally,

rps, aq “ řs1P ¯SPps1|s, aqrps, a, s1q “ ř

s1P ¯SMNss1ErRpTDT ´ 1qX pEs,sN1qs, where X pEN

s,s1q “ 1 iff the event Es,sN1 is true (thus rps, aq “ 0 if the MU exits from the beam support). Note that ErX pEN

s,s1qs “ PpEs,sN1q, which is computed in Eq. (5.12). The transmission rate R follows from Eq. (5.7) when Pt “ PDT| ˆS|.

Finally, rps, aq “ RpTDT´ 1qřs1P ¯SM˜Nss1, where TDT ´ 1 refers to the fact that

transmission. If c“ BT, then rps, aq “ 0, as no bits of data are transmitted.

Costs: cps, aq is the expected energy cost given s P ¯S and a P A. The total expected cost during a transmission episode is subject to the constraint C. If c “ DT, then cps, aq “ PDT| ˆS|pTDT ´ 1q, @s P S (we reserve one micro

time-slot for the feedback transmission). If c“ BT, then cps, aq “ PBT| ˆS|. In

this POMDP formulation, the overhead cost given by BT with respect to DT is implicitly modelled with the total fraction of time spent by BT with respect to DT.