RL for EH Communications (unknown underlying model)

CHAPTER 3. COGNITIVE RADIO NETWORKING WITH ENERGY HARVESTING

4.4 RL for EH Communications (unknown underlying model)

This section provides a solution for the second scenario, where RL is used to handle the challenge of knowledge unavailability about the channel gain and EH processes. SARSA learning algorithm is used to evaluate different actions. The performance of the proposed model is investigated using two different exploration algorithms, which are the convergence-based algorithm, and the -greedy algorithm.

4.4.1 RL prediction methods

In this work, SARSA and Q-learning learning are used to predict the action-value function for different state-action pairs. SARSA is an on-policy updating strategy, which attempts

to evaluate the policy that is used to make decisions. On the other hand, Q-learning is an off-policy method, where the action-value function is estimated for the policy that is unrelated to the policy used for evaluation [4].

Updating in SARSA works as follows. Starting from time slot i, let the agent be at state s, and the selected action according to the current policy π is a. Based on the selected action, it moves to the next state s0 and receives a reward r(s, a, s0). Using a policy derived from the Q(s, a) (e.g., -greedy algorithm), an action a0 is selected to the next state s0. At this point, the estimate of the action-value function, Q(s, a), is updated using the gained experience. The updating equation in SARSA is given by [4]

Q(s, a) ←Q(s, a) + α[r(s, a, s0) + γ Q(s0, a0) − Q(s, a)] (4.15) Using Q-learning, actions are assigned as follows. At the current state, actions are selected according to a policy derived from Q(s, a) (e.g., -greedy algorithm), while the greedy action is assigned to the next state s0. The updating equation in Q-learning is given by [4]

Q(s, a) ←Q(s, a) + α[r(s, a, s0) + γ max

b Q(s

0_{, b) − Q(s, a)]} _(4.16)

where 0 < α < 1 refers to the learning rate. This factor determines the amount of contribution of the newly acquired information for updating the action-value function. If α = 0, then the agent will not learn any thing from the acquired information. On the other hand, if α = 1, the agent will only consider the newly acquired information [89].

4.4.2 RL exploration algorithms

This part discusses two exploration algorithms for RL to deal with the case of knowledge unavailability about the underlaying model. The exploration algorithms play an essential role in RL. Their role appears in finding a balance between exploration and exploitation to maximize the cumulative rewards. The exploitation mode can be defined as using the current available knowledge to select the best policy to be used. On the other hand, exploration is known as investigating new policies in the hope of getting policy that is better than the current best one [4].

4.4.2.1 The -greedy algorithm

This algorithm [53] uses the exploration probability to find a balancing point between exploration and exploitation modes. This parameter changes the mode based on its value at each time slot.

In this algorithm, the current best action is selected with probability 1 − . On the other hand, a random non-greedy action is selected with probability . The can be either fixed [4], or with adaptive value during the learning time [36]. In the case of adaptive -greedy, takes values that changes with time. For example, in [36], is set to e−0.1i, where i the time slot number. In this case, at the beginning of the session, the exploration probability has large values to increase the probability of exploration. As time increases, the probability of exploration decreases and the exploitation probability increases. This is to increase the opportunity of exploitation at the end of the session, where most of the policies have been explored and it is preferred to exploit the best known policy.

4.4.2.2 The convergence-based algorithm

This part presents our exploration algorithm. It uses two parameters to balance between exploration and exploitation. The first parameter is the action-value function convergence error ζ. The same action at a state is exploited for a number of iterations until the estimated value of this state-action pair converges to a value with an error less than or equal to ζ. The second parameter is the exploration time threshold τ . This parameter controls the exploration process, where the agent can explore different actions for a τ from the total available time T , after that, the agent is forced to exploit the best available policy πbest during the remaining time [90; 91].

In this algorithm, the first step is to assign random feasible actions to all available states. Then, for each visited state, the same action is selected for a time until its estimated value converges to a value determined by ζ. Once the estimated value of a state-action pair converges to a value with an error less than or equal to ζ, a new random action is assigned from uniformly distributed unexplored actions to that state. This mechanism continues for all states, and stops in two cases: The first one occurs if all available actions for a states s are evaluated before

reaching τ . At this time, the action with the best value πbest(s) will be exploited in the

future. The second case occurs when the available time reaches τ . Then, the agent suspends exploration, and starts exploiting the best available policy πbest regardless of exploring all

available actions or not.

Using the SARSA with the convergence-based algorithm, an action for next state s0 is selected according to the current policy π

pT x0 ← π(s0) (4.17)

and for the case of integrating the Q-learning and convergence-based algorithms, an action is assigned to next state s0 according to

pT x0 ← arg max

a Q(s

, a) (4.18)

One of the main advantages of the convergence-based algorithm is that once an action at a state has been evaluated, and its action-value function has converged to an unfavorable value, this action will not be exploited in the future. This is an important property that contributes to discarding actions that may reduce the cumulative reward in the future. One more characteristic is that it assigns dynamic evaluation time for different actions at different states. This evaluation time depends on the required time by the estimated action-value function to converge for each state-action pair. Algorithm 4.2 summarizes the proposed algorithm.

Algorithm 4.2 Convergence-based Algorithm for estimating π∗

1: Initialize Q0_{(s, p}T x_{), ∀s ∈ S, ∀p}T x_{∈ P}T x

s , arbitrarily

2: Initialize the action-value convergence error ζ, the exploration time threshold τ , and the learning rate α

3: Initialize Qbest(s) = −∞, ∀s ∈ S

4: Initialize the policy π and the current best policy πbest by random actions ∀s ∈ S

5: for each s ∈ S do 6: πbest(s), π(s) ← %

7: PT x

s ← PsT x− %

8: end for

9: for each step i of episode do 10: Observe current state S

11: Select action PT x to state S according to the policy π (i.e., PT x← π(S)) 12: Observe the immediate reward r(S, PT x_{), and next state S}0

13: Predict Q(S, PT x_{) using a prediction method (e.g., SARSA or Q-learning)}

14: if |Qi_{(S, P}T x_{) − Q}i−1_{(S, P}T x_{)| ≤ ζ AND i < τ then}

15: if Qi(S, PT x) ≥ Qbest(S) then 16: Qbest(S) ← Qi(S, PT x) 17: πbest(S) ← PT x 18: end if 19: if PT x S 6= φ then

20: Update π by selecting a new random action % ∈ PT x

S to state S π(S) ← % PT x S ← PST x− % 21: else 22: π(S) ← πbest(S) 23: end if 24: else if i ≥ τ then 25: π ← πbest 26: end if 27: S ← S0 28: end for

In document Enhancing the performance of energy harvesting wireless communications using optimization and machine learning (Page 57-62)