Figure 5.2 presents the Bayesian network which describes the behaviour of the dis- tributed Q-learning algorithm introduced in Subsection 4.1.2 when applied to the sim- ple DSA network model shown in Figure 5.1.
The variables used to denote the Bayesian network nodes are the following:
• Πn ∈ {Same, Dif f } - the joint policy of the BSs after n learning iterations. The individual policy of one BS is defined as its preferred subchannel πx ∈ {1, 2} and is derived from the Q-table based on Equation (4.3). The joint policy Πntakes two values of interest - whether the individual policies of 2 BSs are the same or different (Πn= Dif f is the learning objective).
• IU Ex∈ {Y es, No} - whether or not UE1 or UE2 is located within the interfer- ence range of the adjacent BS during the current file arrival.
• T xOL ∈ {Y es, No} - whether file transmissions to UE1 and UE2 overlap in time during the current iteration.
• RU Ex ∈ {S, F } - whether a file transmission to UE1 or UE2 was successful (S), or whether it failed (F ) due to interference. It is conditionally dependent on Πn, IU Exand T xOL.
• Πn+1 ∈ {Same, Dif f } - the joint policy after the Q-learning updates are per-
Πn
Πn+1
RU E1 RU E2
IU E1 T xOL IU E2
formed based on Equation (4.4), as a result of the outcome at the current itera- tion. It is conditionally dependent onΠn, RU E1 and RU E2.
Based on the conditional dependencies described above and depicted in Figure 5.2, the equation for calculating the joint probability distribution over all variables Pjoint = P (Πn+1,Πn, RU E1, RU E2, IU E1, IU E2, T xOL) is the following:
Pjoint= P (Πn+1|Πn, RU E1, RU E2)
×P (RU E1|Πn, IU E1, T xOL) P (RU E2|Πn, IU E2, T xOL) ×P (Πn) P (IU E1) P (IU E2) P (T xOL)
(5.1)
which consists of a number of prior probabilities of the form P(X), and conditional probabilities of the form P(X|Y1...Yn).
5.3.1
Prior and Conditional Probability Distributions
The prior probability distributions that appropriately describe the given 2 BS 2 UE scenario are defined in Table 5.1. Before any file arrivals at either BS, the Q-tables of both BSs are initialised to zero for both subchannels. Therefore, there is a 50% chance of the BSs choosing the same subchannel, since both of them will choose either subchannel at random, i.e. P(Π0 = Same) = 0.5. Furthermore, it is assumed that the interference range overlap of the BSs is such that there is a 40% chance of a UE being located in it, i.e. P(IU Ex = Y es) = 0.4. Finally, the offered traffic level is assumed to produce a 60% chance of transmissions to both UEs overlapping in time at any given learning iteration, thus potentially resulting in inter-cell interference: P(T xOL = Y es) = 0.6. The values chosen for P (IU Ex) and P (T xOL) only affect the relative difficulty of the DSA problem. They can be changed without the loss of generality of the proposed probabilistic model.
Table 5.1: Prior probability distributions used in the Bayesian network model of dis- tributed stateless Q-learning
P(Π0) P(IU Ex) P(T xOL) Same Dif f Y es N o Y es N o
The conditional probability distributions are defined in Table 5.2. The values used for the P(RU Ex|Πn, IU Ex, T xOL) distribution state that a transmission to UE1 or U E2 will fail with a probability of 1 (RU Ex = F ) only if the given UE is within the interference range of the other BS (IU Ex = Y es), transmissions to both UEs overlap in time (T xOL= Y es) and both BSs have chosen the same subchannel (Πn = Same). Whereas, in any other case, i.e. if Πn = Dif f , IU Ex = No or T xOL = No, the transmission will be successful: RU Ex= S.
The P(Πn+1|Πn, RU E1, RU E2) table defines how the Q-learning policies of both BSs (Πn+1) are likely to change, given their current joint policyΠn, and the result of trans- missions to both UEs (RU E1 and RU E2). Both BSs are running a stateless Q-learning algorithm introduced in Subsection 4.1.2. Firstly, if the transmissions to both UEs are successful (RU E1 = RU E2 = S), then both BSs will reward their respective sub- channels and maintain the same policies regardless whether they are the same or dif- ferent (Πn+1 = Πn). Secondly, if Πn = Same and only a transmission to one of the UEs failed ({S, F } or {F, S}), this UE is more likely to change its policy due to the WoLF learning rate used in its Q-learning algorithm, described in Subsection 4.2.1. Therefore, there is a relatively high probability of the policies being different
Table 5.2: Conditional probability distributions used in the Bayesian network model of distributed stateless Q-learning
P(RU Ex|Πn, IU Ex, T xOL)
S 0 1 1 1 1 1 1 1
F 1 0 0 0 0 0 0 0
Same Same Same Same Dif f Dif f Dif f Dif f
Y es Y es N o N o Y es Y es N o N o
Y es N o Y es N o Y es N o Y es N o
Πn, IU Ex, T xOL P(Πn+1|Πn, RU E1, RU E2)
Same 1 Low Low High 0
Dif f 0 High High Low 1
Same Same Same Same Dif f
S, S S, F F , S F , F S, S
at the next iteration: P(Πn+1 = Dif f ) = High. If transmissions to both UEs fail ({F, F }), both BSs are likely to change their policies to the same other subchannel, thus making Πn+1 = Same a more likely outcome: P (Πn+1 = Same) = High. The remaining three combinations ofΠn, RU E1 and RU E2 values are not considered, since they can never occur according to the P(RU Ex|Πn, IU Ex, T xOL) conditional probability distribution. Regardless of the values used for these combinations in the P(Πn+1|Πn, RU E1, RU E2) table, they will be multiplied by zero during the calculation of the joint probability distribution defined in Equation (5.1).
5.3.2
Bayesian Network Inference
The aim of the Bayesian network model described above is to establish the marginal likelihood of the joint Q-learning policy at the next iteration P(Πn+1) by taking a sum over all other variables in Pjointas follows:
P(Πn+1) = X Πn X RU E1 X RU E2 X IU E1 X IU E2 X T xOL Pjoint (5.2)
The resulting distribution can then be substituted as the prior for the next learning iteration: P(Πn) ← P (Πn+1). This enables iterative evaluation of the Bayesian net- work model which shows how the probability of transmission failure P(RU Ex) and the probability of BSs using different subchannels P(Πn) change over time, as the learn- ing process progresses. The individual P(RU Ex) distribution can be obtained using the same principle of marginalisation as follows:
P(RU E1/2) = X Πn+1 X Πn X RU E2/1 X IU E1 X IU E2 X T xOL Pjoint (5.3)
This probabilistic analysis is only valid for the 2 BS 2 UE network model described in Section 5.2, and is not designed to be scalable to larger and more realistic networks. The purpose of this model is to enable theoretical analysis of the relative behaviour of RL algorithms using a simple and tractable problem. An additional, useful approach to evaluating such algorithms used in Chapter 6 is performing realistic large scale sim- ulations and assessing similarities between the simulation results and the theoretical
predictions obtained via the method proposed in this chapter.