Partially Observable Markov Decision Process

4.4 Optimization for online energy management

5.2.2 Partially Observable Markov Decision Process

Next, we define a constrained Partially Observable (PO) Markov Decision Process (MDP).

States: ¯S is a finite set of states describing the position of the MU within the road link, along with the absorbing state ¯s when the MU exits the coverage area of the BS. Therefore, ¯S ” S Y t¯su, and S ” t1, . . . , Su is the set of road sub-links.

Actions: A is a finite set of actions that the BS can perform. Specifically, the BS can perform actions for Beam Training (BT) and actions for Data Trans- mission (DT), which involve selection of the transmission beam, power, and duration. In general, aP A is in the form a “ p ˆS, Pc, Tcq, where: c “ tBT, DTu

refers to the action class; ˆS Ď S is a sub-set of sub-links, defining the support of the transmission beam; Pc is the transmission power per beam, such

that Pt “ Pc| ˆS| in Eq. (5.6) is the total transmit power, Tc is the transmis-

sion duration of action a P A (number of micro time-slots of duration ∆t).

If c “ BT, then a “ p ˆS, PBT, TBTq, where PBT and TBT are fixed parame-

ters of the model. Specifically, we assume that BT actions perform simultaneous beamforming over ˆS in one interval of ∆t seconds (i.e., TBT “ 1).

Also, PBT “ Pmin, where Pmin is the power per beam required to attain the

alarm and mis-detection probabilities PFA and PMD, which are also fixed pa-

rameters of the model, via Eq. (5.10). If c “ DT, then a “ p ˆS, PDT, TDTq,

where PDT and TDT are part of the optimization. Specifically, we assume that

DT actions perform simultaneous data communication over ˆS for TDT ´ 1

micro time-slots, where the last interval of ∆t seconds is dedicated to the

ACK/NACK feedback transmission from the MU to the BS. During DT, the transmission rate follows from Eq. (5.7). Note that the action space grows as | ˆS| “řS_s_“1S!{s!{pS ´sq! “ 2S_{´1. To reduce its cardinality, we restrict A such}

that ˆS is a sub-set of consecutive indices in S, i.e., the beam directions specified by ˆS define a compact range of transmission for the BS. Thus, | ˆS| “ SpS`1q{2. Observations: O is a finite set of observations, defined as O” tACK, NACK, ¯su. Specifically, o“ ¯s means that the MU exited the coverage area of the BS; for simplicity, in this chapter we assume that such event is observable, i.e., the BS knows when the MU exited its coverage area.

Transition probabilities: Pps1_{|s, aq is the transition probability from s P ¯}_S

to s1 _{P ¯}_{S given a P A. Note that these probabilities are a function of the}

duration Tc of action a. If the transmission duration of a P A is Tc “ 1,

then we store the 1-step transition probabilities into matrix M“ rMss1s, with elements Mss1 “ Pps1|s, aq given by the 1-step mobility model, as:

If the transmission duration of aP A is Tc “ N , then we compute the N -step

transition probabilities into matrix MN_{, i.e., we take the N -th power of matrix}

M _{so as to account for the N -step evolution of the system state under a}_{P A} with transmission duration Tc.

Observation model: Ppo|s, a, s1_{q is the probability of observing o P O given}

s P ¯S and a P A with transmission duration Tc, ending in s1 P ¯S. We assume

that the BS can successfully perform a“ p ˆS, Pc, Tcq if the MU remains within

S for Tc subsequent micro time-slots, i.e., the MU does not exit from the beam

ACK to the BS, o“ ACK. Therefore, we define XTc

0 “ tX0, . . . , XTcu, as the system state path from time 0 to time Tc, and the event Es,sN1 “ tX₀Tc P ˆSTc`1 | X0 “ s, XTc “ s1, Tc “ N u, meaning that the system state path X

0 remains

within ˆS for Tc subsequent micro time-slots, given that X0 “ s, XTc “ s1, Tc “ N . In order to compute it, we also define matrix ˜M as the transition

probability matrix restricted to the beam support ˆS, i.e., ˜M _{“ r ˜}M_ss1s, with elements ˜M_ss1 “ M_ss1 if ts, s1u P ˆS, otherwise ˜M_ss1 “ 0. We derive PpE_s,sN1q as:

PpEs,sN1q “ PpX₀Tc P ˆSTc`1 | X0 “ s, XTc “ s1, Tc “ N q “ 1 MN ss1PpX Tc 0 P ˆS Tc`1_{, X} Tc “ s1 | X0 “ s, Tc “ N q “ 1 MN ss1 ÿ sN 0 P ˆSN`1 N_ź´1 z_“0 PpXz_`1“sz_`1 | Xz“szq“ ˜ MN ss1 MN ss1 . (5.12)

Given a P A with transmission duration Tc, Ppo|s, a, s1q is defined as follows.

If c “ BT, we account for false-alarm and mis-detection errors in the beam detection process. In particular, if ts, s1_{u P ˆ}_{S (i.e., the MU is within the}

beam support during the duration of BT) then PpACK|s, a, s1_{q “ P}_D _(correct

detection) and PpNACK|s, a, s1q “ PMD (mis-detection); on the other hand, if

ts, s1_{u R ˆ}_{S (i.e., the MU is outside of the beam support during the duration of}

BT), then PpACK|s, a, s1_{q “ P}_FA _{(false-alarm) and PpNACK|s, a, s}1_{q “ 1´P}_FA_.

If c “ DT, then the transmission is successful if the event EN

s,s1 occurs, so that PpACK|s, a, s1_{q “ PpE}N

s,s1q for ts, s1u P S, and PpNACK|s, a, s1q “ 1 ´ PpACK|s, a, s1q. Finally, Pp¯s|s, a, s1_{q “ 1 whenever either s “ ¯}_s _{or s}1 _{“ ¯}_{s, i.e.,}

the BS knows when the MU exited its coverage area.

Rewards: rps, aq is the expected reward given s P ¯S and a P A, defined as the transmission rate (number of bits transmitted from the BS to the MU) during DT if the MU remains within ˆS for Tc subsequent micro time-slots. Formally,

rps, aq “ ř_s1_{P ¯}_SPps1|s, aqrps, a, s1q “ ř

s1_{P ¯}_SMN_ss1ErRpTDT ´ 1qX pE_s,sN1qs, where X pEN

s,s1q “ 1 iff the event E_s,sN1 is true (thus rps, aq “ 0 if the MU exits from the beam support). Note that ErX pEN

s,s1qs “ PpE_s,sN1q, which is computed in Eq. (5.12). The transmission rate R follows from Eq. (5.7) when Pt “ PDT| ˆS|.

Finally, rps, aq “ RpTDT´ 1qř_s1_{P ¯}SM˜Nss1, where TDT ´ 1 refers to the fact that

transmission. If c“ BT, then rps, aq “ 0, as no bits of data are transmitted.

Costs: cps, aq is the expected energy cost given s P ¯S and a P A. The total expected cost during a transmission episode is subject to the constraint C. If c “ DT, then cps, aq “ PDT| ˆS|pTDT ´ 1q, @s P S (we reserve one micro

time-slot for the feedback transmission). If c“ BT, then cps, aq “ PBT| ˆS|. In

this POMDP formulation, the overhead cost given by BT with respect to DT is implicitly modelled with the total fraction of time spent by BT with respect to DT.

In document Bayesian Learning Strategies in Wireless Networks (Page 144-147)