Q-Learning - MAPE-K Framework - Proposed Solution

Chapter 4 Application of Fuzzy Q-Learning

4.2 Proposed Solution

4.2.1 MAPE-K Framework

4.2.1.2 Q-Learning

Like any other Reinforcement Learning technique, Q-Learning uses rewards to learn and to make decisions on several policies. This is a fall under off- policy method where the target policy is not equal to the behavioural policy. In Q-Learning, it is derived from Markov Decision Process (MDP) and consists of three inputs. State, action and reward. Every state will have a different score depending on the steps taken toward the final policy and continuing with the learning update.

Figure 4.8 illustrates how reinforcement learning applied the mentioned parameters.

Figure 4.8. Q-Learning Framework [141]

The algorithm will have an iteration process. This is known as an episode to ensure that it will optimise with the alpha and gamma values:

a) γ = Gamma value in the range of 0 and 1. The lowest value will instruct Q- Learning to find the instance reward and ignore the total score of the accumulated reward.

b) α = Alpha value identical to gamma, which is a range between 0 and 1. In normal circumstances, the value is set low to optimise the algorithm, such as 0.2.

Definition 13: Q-Learning Update

. 𝑄(𝑠𝑡, 𝑎𝑡) ⟵ 𝑄(𝑠𝑡, 𝑎𝑡) + 𝑎[𝑟𝑡+1+ 𝛾𝑎𝑚𝑎𝑥 𝑄(𝑠𝑡+1, 𝑎) − 𝑄(𝑠𝑡, 𝑎𝑡)] Q-Learning Algorithm [131]

Initialize 𝑄0 (s, a) to random values Choose a starting point 𝑠0

While the policy is not good enough Choose at according to values 𝑄𝑡 (𝑠𝑡, . ) 𝑎𝑡 = 𝑓(𝑄𝑡(𝑠𝑡, . ))

Obtain in return: 𝑠𝑡 + 1 (𝑠′_{) and 𝑟𝑡} Update using Definition 13

End While

4.2.1.3 Fuzzy Q-Learning

This is the algorithm based on Q-Learning and the limitations of fuzzy, introduced by [131] and iteratively enhanced by [133], [130] and [134]. The constraint of fuzzy [147], is that it must be heavier that ‘W’, where in his case the ‘W’ refers to Weight. On the other hand, Fuzzy Q-Learning actions are able to process fuzzy constraints and proceed with the optimal policy. This feature is not available for the actions in Q-learning introduced by [131].

Fuzzy Q-Learning Algorithm – Basic [107] Observe the state x

For each rule, choose the actual consequence using some EEP

Compute the global consequence a(x) and its corresponding Q-value Q(x,a) Apply the action a(x). Let y be the new state

Receive the reinforcement r

Update the q-values using Definition 13.

Enhanced Fuzzy Q-Learning Algorithm [4-9] Require: 𝛾, 𝜂, ∈

1) Initialize q-values 𝑞[𝑖, 𝑗] = 0,1 < 𝑖, 1 < 𝑗 < 𝐽

2) select an action for each fired rule: 𝑎𝑖_{= 𝑎𝑟𝑔𝑚𝑎𝑥}

𝑘𝑞[𝑖, 𝑘]𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 1−∈

𝑎𝑖= 𝑟𝑎𝑛𝑑𝑜𝑚 {𝑎𝑘, 𝑘 = 1,2, … . , 𝐽}𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 ∈ 3) Calculate the control action by the fuzzy controller:

𝑎 = ∑𝑁𝑖=1𝜇𝑖(𝜒)𝑥 𝑎𝑖 , where 𝛼𝑖(𝑠)𝑖𝑠 𝑡ℎ𝑒 𝑓𝑖𝑟𝑖𝑛𝑔 𝑙𝑒𝑣𝑒𝑙 𝑜𝑓 𝑡ℎ𝑒 𝑟𝑢𝑙𝑒 𝑖

4) Approximate the Q function from the current q-values and the firing level of the rules:

𝑄(𝑠(𝑡), 𝑎) = ∑𝑁 𝛼(𝑠)

𝑖=1 𝑋 𝑞[𝑖, 𝑎𝑖], where 𝑄(𝑠(𝑡), 𝑎) is the value of the Q function for the state current state 𝑠(𝑡) in iteration 𝑡 and the action 𝑎.

6) Observe the reinforcement signal, 𝑟(𝑡 + 1) and compute the value for the new state: 𝑉(𝑠(𝑡 + 1)) = ∑𝑁𝑖=1𝛼𝑖(𝑠(𝑡 + 1)). 𝑚𝑎𝑥𝑘(𝑞|𝑖, 𝑞𝑘]).

7) Calculate the error signal:

∆𝑄 = 𝑟(𝑡 + 1) + 𝛾𝑥 𝑉𝑡(𝑠(𝑡 + 1)) − 𝑄(𝑠(𝑡), 𝑎) , where 𝛾 is the discount factor 8) Update q-values:

𝑞[𝑖, 𝑎𝑖] = 𝑞[𝑖, 𝑎𝑖] + 𝜂. Δ𝑄. 𝛼𝑖(𝑠(𝑡)) , where 𝜂 is the learning rate 9) Repeat the process for the new state until it converges

4.2.1.4 Architecture

The proposed architecture is available in Figure 3.7 and the components are mapped into a MAPE-K model in Figure 4.10. This architecture is the enhanced version of [108-110] with the introduction of an adaptive policy. 4.2.1.5 Algorithm

This section presents an approach which provides the overall algorithm of this research. The algorithm itself is segmented into phases and produces the results for further analysis.

Implementation of the Complete Algorithm Phase One.

1. Acceptance of the three QoS parameters (Latency, Availability and Packet loss).

Phase Two

1. Execution of QoS parameters from case study [139] (Pass, Low, Normal, High, Peak) to centre of weighted approach.

- Use scores for each performance - Use reward for Q-Learning algorithm - Populate the values for next phase Phase Three

1. Application of fuzzy Sugeno for membership functions and rule base

Phase Four

i. Comparison of three QoS parameters

a. Execution of QoS pair

i. Require: Latency, Availability (QoS) 1. 1st_Set

a. Require: Epsilon (Small =0.1), Lambda (Small =0.1) and Alpha (Small =0.1)