Absorbing Markov Chain Formulation

Figure 5.4 shows an alternative formulation of the convergence properties of distributed Q-learning derived from the Bayesian network model introduced in Section 5.3. It is a Markov chain describing the probabilities of transitions between two dif- ferent states of the joint policy - Same(π1 = π2) and Dif f (π1 6= π2). The transition probabilities are taken from the P(Πn+1|Πn) distribution which, in turn, is calculated using the following definition of conditional probability:

P(Πn+1|Πn) =

P(Πn+1,Πn) P(Πn)

(5.4)

where P(Πn+1,Πn) is obtained by marginalising all other variables from the overall joint distribution as follows:

P(Πn+1,Πn) = X RU E1 X RU E2 X IU E1 X IU E2 X T xOL Pjoint (5.5)

π1 = π2 π1 6= π2

1 0.73

0.27

Figure 5.4: An absorbing Markov chain describing the transitions between two states of the joint policy derived from Bayesian network model of the 2 base station 2 user equipment cellular network

Firstly, the Markov chain in Figure 5.4 shows that “π1 6= π2” is an absorbing state, i.e. a state that cannot be left, since the probability of transition from “π1 6= π2” to “π1 = π2” is zero. Therefore, this is an absorbing Markov chain which formally demonstrates that the RL algorithm is guaranteed to converge on the desired absorbing state “π1 6= π2”. The speed of convergence is controlled by the probability of transition from “π1 = π2” to “π1 6= π2”, which in this case is 0.27. The objective of future, more advanced RL algorithms, designed using the method proposed in this chapter, is to increase this transition probability to speed up their convergence and, thus, increase their adaptability, whilst preserving the absorbing state “π1 6= π2”.

5.6 Conclusion

The Bayesian network based joint policy transition analysis methodology proposed in this chapter is able to provide a simple and accurate probabilistic model of distributed RL algorithms applied to a minimum complexity DSA problem. A Monte Carlo simulation of a distributed Q-learning based DSA algorithm shows that the proposed approach demonstrates remarkably accurate prediction of the convergence behaviour of such algorithms. Furthermore, their behaviour can also be expressed in the form of an absorbing Markov chain, derived from the novel Bayesian network model. This representation enables further theoretical analysis of convergence and adaptability properties of RL based DSA algorithms. Finally, the main benefit of the analysis tool presented in this chapter is that it enables the design and theoretical evaluation of novel RL based DSA algorithms by extending the proposed Bayesian network model, that describes a standard distributed Q-learning scheme.

Chapter 6. Distributed Heuristically Acceler-

ated Q-Learning

6.1 Motivation . . . 89 6.2 Heuristically Accelerated Reinforcement Learning . . . 90 6.3 Distributed ICIC Accelerated Q-Learning . . . 92 6.4 Theoretical Evaluation . . . 94 6.4.1 Modified Bayesian Network Model . . . 95 6.4.2 Prior and Conditional Probability Distributions . . . 96 6.4.3 Convergence Behaviour of DIAQ . . . 99 6.4.4 Absorbing Markov Chain Analysis . . . 100 6.5 Simulation Results . . . 102 6.5.1 Temporal Performance . . . 102 6.5.2 Initial and Final Performance . . . 104 6.6 Conclusion . . . 105

6.1 Motivation

Although RL algorithms such as stateless Q-learning investigated in Chapters 4 and 5 have been shown to be a powerful approach to problem solving, their common disad- vantage is the need for many learning iterations before convergence on an acceptable solution, which significantly limits their adaptability in challenging and potentially dynamic multi-agent environments. One of the more recent promising solutions to this issue, proposed in the artificial intelligence domain, is the heuristically accelerated reinforcement learning (HARL) approach. Its goal is to speed up RL algorithms by guiding the exploration process using additional heuristic information [11]. In [10], case-based reasoning is used for heuristic acceleration in a multi-agent RL algorithm

to assess similarity between states of the environment and to make a guess at what action needs to be taken in a given state, based on the experience obtained in other similar states. In [11], Bianchi et al. prove the convergence of four multi-agent HARL algorithms and show how they outperform the regular RL algorithms. There appears to be no evidence in the literature of the HARL approach being applied in the wireless communications domain.

The purpose of this chapter is to alleviate the problem of poor temporal performance of RL based DSA algorithms and, thus, to improve their adaptability, by proposing a cognitive DSA scheme which combines distributed Q-learning and standardised inter-cell interference coordination (ICIC) signalling in LTE networks using a novel adaptation of the HARL framework. Furthermore, it is designed to comply with the current LTE standards and enables robust distributed machine intelligence to be easily implemented in current or future LTE releases.

In previous work on combining ICIC and RL, researchers have only considered apply- ing RL to learning various parameters related to ICIC or radio resource management in Orthogonal Frequency-Division Multiple Access (OFDMA) cellular systems, such as LTE or WiMAX. For example, Simsek et al. [81] use RL to learn optimal cell range bias and power allocation strategies and compare them to static ICIC methods; Dirani and Altman [25] use a fuzzy Q-learning algorithm and ICIC to learn a coordinated power allocation strategy; and Vlacheas et al. [91] use a fuzzy RL principle for auto- matic tuning of the Relative Narrowband Transmit Power (RNTP) indicator, which is a key ICIC parameter in the LTE downlink. However, no evidence of previous work in the literature was found on using heuristic ICIC methods to enhance the performance of RL based DSA algorithms.

In document Accelerating Reinforcement Learning for Dynamic Spectrum Access in Cognitive Wireless Networks (Page 87-90)

5.6

Conclusion

Chapter 6.

Distributed Heuristically Acceler-

ated Q-Learning

Contents

6.1

Motivation