Modified Bayesian Network Model - Theoretical Evaluation

6.4 Theoretical Evaluation

6.4.1 Modified Bayesian Network Model

Figure 6.3 presents an adaptation of the Bayesian network model proposed in Section 5.3 which describes the behaviour of DIAQ when applied to the simple 2 eNB 2 UE cellular network described Section 5.2. The shaded nodes and dotted edges show extra dependencies introduced by DIAQ, compared to classical stateless Q-learning described by the Bayesian network shown in Figure 5.2. The variables used to denote the Bayesian network nodes are the following:

• RNT P ∈ {Y es, No} - whether or not, at the latest file arrival time, the corre- sponding eNB has an up-to-date RNTP message from its neighbour.

• IU Ex ∈ {Y es, No} - whether or not UE1 or UE2 is located within the interference range of the adjacent eNB during the current file arrival.

• Πn ∈ {Same, Dif f } - joint policy of the eNBs after n learning iterations. The policy of an eNB is defined as its preferred subchannel (1 or 2). Πn takes two values of interest - whether the policies of 2 eNBs are the same or different (Πn = Dif f is the learning objective).

• Πm

n ∈ {Same, Dif f } - joint masked policy, i.e. the combination of Πn and the heuristic functions of both eNBs defined in Equation (6.3). It is conditionally dependent onΠnand RN T P (Πmn may be different toΠn, based on the Q-table transformation described by Equation (6.4)).

• RU Ex ∈ {S, F } - whether or not a file transmission to UE1 or UE2 is successful

Πn Πn+1 Πm n RN T P RU E1 RU E2 IU E1 IU E2

Figure 6.3: Bayesian network describing the behaviour of distributed ICIC accelerated Q-learning applied to the 2 eNB 2 UE dynamic spectrum access network

(S), or whether it failed (F ) due to interference. It is conditionally dependent on Πm

n and IU Ex.

• Πn+1 ∈ {Same, Dif f } - the updated joint policy for the next iteration as a result of the outcome at the current iteration. It is conditionally dependent on Πn,Πmn, RU E1and RU E2.

The key difference between this Bayesian network model and the one describing regular stateless Q-learning shown in Figure 5.2 is the addition of the joint masked policy nodeΠm

n, which takes into account the RN T P signal from the neighbouring eNB and which is now used for decision making instead of the regular joint Q-learning pol- icyΠn. The environmental T xOL variable from the original Bayesian network that indicates whether transmissions overlap in time is omitted in the new version for sim- plicity, since it only affects the convergence speed of the learning process and does not have any effect on the relative performance of DIAQ compared to regular Q-learning. It can be viewed as part of the IU E1 and IU E2 variables whose only role is determin- ing whether either UE receives harmful inter-cell interference from the adjacent eNB during a pair of transmissions from both eNBs.

Similarly to the Bayesian network model discussed in Section 5.3, based on the conditional dependencies depicted in Figure 6.3, the equation for calculating the joint probability distribution over all variables Pjoint = P (Πn+1, Πn, Πmn, RU E1, RU E2, IU E1, IU E2, RN T P ) is the following:

Pjoint= P (Πn+1|Πn,Πmn, RU E1, RU E2) P (RU E1|Πmn, IU E1) P (RU E2|Πmn, IU E2) ×P (Πm

n|Πn, RN T P) P (Πn) P (RNT P ) P (IU E1) P (IU E2)

(6.5) which consists of a number of prior probabilities of the form P(X), and conditional probabilities of the form P(X|Y1...Y_n).

6.4.2 Prior and Conditional Probability Distributions

The prior probability distributions that appropriately describe the 2 eNB 2 UE sce- nario from Section 5.2 are defined in Table 6.1. The P(Π0) and P (IU Ex) distributions are identical to those proposed in Section 5.3 for classical stateless Q-learning. The

Table 6.1: Prior probability distributions used in the Bayesian network model of distributed ICIC accelerated Q-learning (DIAQ)

P(Π0) P(IU Ex) P(RNT P )

Same Dif f Y es N o Y es N o

0.5 0.5 0.4 0.6 High Low

probability of an RNTP message exchange RN T P is a new additional environmental variable that is specific to inter-eNB ICIC signalling used by the DIAQ scheme. P(RNT P = Y es) = High represents a high chance of an RNTP message exchange taking place between current file arrivals at the two eNBs. Since these exchanges can take place as often as every 20 ms, an eNB is highly likely to have an up-to-date RNTP message from its neighbour. If P(RNT P = Y es) is changed to 0, the Bayesian network model will describe the regular stateless Q-learning algorithm introduced in Subsection 4.1.2.

The conditional probability distributions are defined in Table 6.2. The values used for P(Πm

n|Πn, RN T P) state that the masked policies Πmn of the eNBs will be the same (Same) with a probability of 1, if their Q-learning policies are the same (Πn = Same) and there was no RNTP exchange between the file arrivals that could change them (RN T P = No). In all other cases, i.e if the Q-learning policies of the eNB are al- ready different (Πn = Dif f ) or if there has been a timely RNTP signal exchange (RN T P = Y es) to correct them, the masked policies of the eNBs will always be different (Dif f ). The reasoning behind the P(RU Ex|Πmn, IU Ex) distribution is to indi- cate, that a transmission to U E1 or UE2 will fail with a probability of 1 (RU Ex = F ), if IU Ex = Y es and both eNBs have chosen the same subchannel (Πmn = Same). If Πm

n = Dif f or IU Ex= No, then the transmission will be successful: RU Ex= S. The P(Πn+1|Πn,Πmn, RU E1, RU E2) table defines how the Q-learning policies of both eNBs (Πn+1) are likely to change, given their current Πn and Πmn, and the result of transmissions to both UEs (RU E1 and RU E2). Firstly, if bothΠnandΠmn are Same or both are Dif f , and the transmissions to both UEs are successful (RU E1 = RU E2 = S), then both eNBs will reward their respective subchannels and maintain the same policies with a probability of 1 (Πn+1 = Πn). Secondly, if both Πn and Πmn are Same and only a transmission to one of the UEs failed ({S, F } or {F, S}), this UE is

Table 6.2: Conditional probability distributions used in the Bayesian network model of distributed ICIC accelerated Q-learning (DIAQ)

P(Πm

n|Πn, RN T P)

Same 0 1 0 0

Dif f 1 0 1 1

Same, Y es Same, N o Dif f , Y es Dif f , N o Πn, RN T P

P(RU Ex|Πmn, IU Ex)

S 0 1 1 1

F 1 0 0 0

Same, Y es Same, N o Dif f , Y es Dif f , N o Πm

n, IU Ex

P(Πn+1|Πn,Πmn, RU E1, RU E2)

Same 1 Low Low High f(n) 0

Dif f 0 High High Low 1 − f (n) 1

Same Same Same Same Same Dif f

Same Same Same Same Dif f Dif f

S, S S, F F , S F , F S, S S, S

Πn,Πmn, RU E1, RU E2

more likely to change its policy due to the WoLF learning rate used in its Q-learning algorithm, described Subsection 4.2.1. Therefore, there is a relatively high probability of the policies being different at the next iteration: P(Πn+1 = Dif f ) = High. If transmissions to both UEs fail ({F, F }), both eNBs are likely to change their policies, thus makingΠn+1 = Same a more likely outcome. Lastly, if the Q-learning policies of both eNBs are the same (Πn = Same), the masked policies are different (Πmn = Dif f ), and both transmissions are successful (RU E1 = RU E1 = S), the probability of theΠn+1 = Same at the next iteration is time-dependent. A realistic approximation of its value at different stages of learning is the following:

f(n) =          0 n = 0 0.5 n = 1 High n >1 (6.6)

to zeros. Therefore, if different subchannels are successfully used (Πm

n = Dif f ), they will be positively reinforced and used at the next iteration with a probability of 1: P(Πn+1= Same|...) = 0. After one learning iteration, there is about a 50% chance of one of the eNBs changing its Q-learning policy, depending on whether its first trial was a success on its preferred subchannel, or a failure on the other subchannel: P(Πn+1= Same|...) = 0.5. Afterwards, the eNB, whose Q-learning policy is overriden by the RNTP exchange (sinceΠn 6= Πmn), is relatively unlikely to change its policy due to the effect of the WoLF learning rates, i.e. the Q-values undergo smaller step changes after successful trials: P(Πn+1 = Same|...) = High.

The remaining ten combinations ofΠn,Πmn, RU E1and RU E2values are not considered, since they can never occur according to the P(Πm

n|Πn, RN T P) and P (RU Ex|Πmn, IU Ex) conditional probability distributions. Regardless of the values used for these combinations in the P(Πn+1|Πn,Πmn, RU E1, RU E2) table, they will be multiplied by zero during the calculation of the joint probability distribution defined in Equation (6.5).

In document Accelerating Reinforcement Learning for Dynamic Spectrum Access in Cognitive Wireless Networks (Page 95-99)