MDP, also known as stochastic dynamic programs, is a commonly used model for sequential decision making for stochastic problems with uncertain outcomes. An MDP model mainly consists of five elements: decision epochs, states, actions, transition probabilities and rewards. Choosing an action from the action set in a current state generates a reward. This reward, along with the transition probability, determines the state at the next decision epoch. Policies and strategies are prescriptions of which action to choose under any eventuality at every future decision epoch.
In this section, we explain how the handover decision problem in a heterogenous network can be formulated as an MDP and how value iteration algorithm (VIA) solves the MDP problem to obtain the optimal policy of selecting the appropriate candidate BS at each decision epoch.
Decision Epochs: The decision epoch is defined as a sequence of time π = 1, 2, ..., π which represents the time of succesive decision as shown in Fig.6-2.
State: In the proposed handoff decision algorithm, the state is defined as the target base station that the UE is going to be connected with. Hence the state sapce is presented as Eq.6.6:
Figure 6-2: Timing diagram of an Markov decision process
Action: For our proposed model, the action represents the selection of the appro- priate target base station that the UE is going to connect with at a particular decision epoch based on the reward function. The reward function is defined in the next para- graph. Based on the action chosen by the policy, the UE may stay connected to the current serving base station or choose another target base station in order to reduce energy consumption and handover delay. Hence, the action (π΄) is defined as selecting the appropriate candidate BS and is represented as π΄ β {πππ, πππ, ππ π, ππ π}, where πππ represents the action of staying with the macro BS, πππ represents the action of switching from macro BS to small BS, ππ π represents the action of switching from small to macro BS and ππ π represents the action of staying with small BS.
Our proposed model can be represented as Fig.6-3, where π πππ, π πππ, π ππ π and π ππ π represents the transition probability between macro-macro, macro-small, small-macro and small-small respectively.
Reward: For a state π β π and selected action π β π΄, the reward function π (π , π) is defined as Eq.6.7,
π (π , π) = π€πππ(π , π) + π€πππ(π , π) (6.7)
Where ππ(π , π) and ππ(π , π) are the power consumption reward function and delay reward function respectively; π€π and π€π are the respective weight factors where π€π+ π€π = 1. The power consumption and delay reward functions are defined as Eq.6.8 and Eq.6.9:
amf afm amm aff State 1: Macro State 2: Femto Prmf Prff Prfm Prmm
Figure 6-3: Proposed markov model
ππ(π , π) = ππππ₯β ππ ππππ₯β ππππ (6.8) ππ(π , π) = ππππ₯β ππ ππππ₯β ππππ (6.9) where ππ and ππ are the UE power consumption and handover delay of the BS selected by action π respectively; ππππ₯ is the maximum transmit power of the UE, ππππ₯ is the maximum tolerable time delay to transfer a call to the new BS, ππππ is the minimum transmit power required to meet the QOS, and ππππ is zero.
Transition Probability: For a current state is π β π and the chosen action is π β π΄, the probability that the next state would be π β² β π is denoted as transition probability π π[π β²|π , π]. We determine the percentage of coverage area [177] for each base station in order to calculate transition probability. It is noteworthy that, some locations within the coverage area of the BS will be below the desired received signal strength threshold due to random effects of shadowing. Hence the percentage of the coverage area gives us a good estimation of the transition probability. We start by
calculating the radius of the cells from link budget calculation for a mean received sig- nal strength (πΎ). If π π(π) is the received signal at the UE located at π radial distance from the BS, then the probability that this received signal exceeds the threshold πΎ is represented by the following eqaution Eq.6.10, where π is the pathloss exponent:
π π(π π(π) > πΎ) = π (οΈ πΎ β π π(π) π )οΈ = 1 2β 1 2πππ ( πΎ β π π(π) πβ2 ) = 1 2β 1 2πππ ( πΎ β [ππ‘β (π πΏ(π0) + 10ππππ(ππ0))] πβ2 ) (6.10)
Where π(.) is the Gaussian tail integral function and πππ (.) is the Gaussian error function. In order to determine the path loss as referenced to the cell boundary (where π = π ), we refer to the following equation Eq.6.11
π πΏ(π) = 10ππππ(π π0
) + 10ππππ(π
π ) + π πΏ(π0) (6.11)
Therefore Eq.6.10 may be expressed as Eq.6.12
π π(π π(π) > πΎ) = 1 2β 1 2πππ ( πΎ β [ππ‘β (π πΏ(π0) + 10ππππ(ππ 0) + 10ππππ( π π ))] πβ2 ) (6.12) If we let π = πΎβππ‘+π πΏ(π0)+10π πππ( π π0) πβ2 and π = 10π πππ π
πβ2 , then the percentage of coverage area with a received signal greater than or equal to πΎ is expressed as Eq.6.13
π (πΎ) = 1 2β 1 π 2 β«οΈ π 0 π πππ (π + π πππ π ) ππ (6.13)
By substituting π‘ = π + π πππ π in Eq.6.13 we get,
π (πΎ) = 1 2(1 β πππ (π) + ππ₯π( 1 β 2ππ π2 )[1 β πππ ( 1 β ππ π )]) (6.14)
the probability that the received signal strength will be greater than the threshold and the percentage of coverage area that would receive coverage above the threhold received signal respectively. Then from the percentage coverage area of macro BS (π΄π) and the percentage coverage area of small BS (π΄π), we can find the transi- tion probabilities between different cells. For instance, the probability of staying with macro BS, π πππ =
π΄πβπ΄π
π΄π ; the transition probability between macro-small BS, π πππ =
π΄π
π΄π; the transition probability between small-macro BS, π ππ π =
π΄πβπ΄π
π΄π ; and the probability of staying with small BS, π ππ π =
π΄π
π΄π.
Optimality Equations and Value Iteration Algorithm: Let us now denote
π (π ) as the maximum expected total reward for an initial state π , then the optimality equations are given by Eq. 6.15 as follows:
π (π ) = πππ₯πβπ΄{π (π , π) + βοΈ
π β²βπ
ππ π[π β²|π , π]π (π β²)} (6.15)
The solution of the optimality equation correspond to the maximum expected total reward π (π ) and the MDP optimal policy π(π ). This MDP optimal policy π(π ) indicates the decision of selecting the appropriate candidate cell. Value iteration algorithm (VIA) [159] is used to solve this optimization problem. The VIA is shown in algorithm 3 where π is an error function.
Algorithm 3 Value Iteration Algorithm (VIA)
1: Set π0(π β²) = 0 for each state π . Specify π > 0, and set π = 0. 2: For each state π , compute ππ+1(π ) by
ππ+1(π ) = πππ₯ πβπ΄{π (π , π) + βοΈ π β²βπππ π[π β²|π , π]ππ(π β²)} 3: πΏ = πππ₯(ππ+1(π ) β ππ(π )) 4: If πΏ < π1βπ2π , go to step 5.
otherwise increase π by 1 and return to step 2. 5: Output a stationary optimal policy, π, such that
π(π ) = ππππππ₯πβπ΄{π (π , π) +βοΈπ β²βπππ π[π β²|π , π]ππ+1(π β²)} and stop
6.3.1
Handover decision process
The handover decision process involves three main phases: 1) Base station discovery; 2) Base station analysis; and 3) Base station selection for the handoff execution.
β Base station Discovery: During the Base station(BS) discovery phase, the user equipment determines which BS is available for handover. In our proposed method the advertised information from the candidate BS is the required trans- mit power of UE for target BS and delay needed to be connected to the target BS. Please note that the transmit power is determined from a target ππΌπ π , which ensures that the service meets the QOS requirement. The traffic loads in the BS may also change with time and therefore the BSs are monitored after a particular period of time,π π π . We refer to this time as decision epochs as mentioned earlier. At each decision epoch the UE determines its state based on the maximum reward function for the candidate BS.
β Base Station Analysis: In this stage, the expected total reward and transition probability for all of the target BSs are determined using the above mentioned equations.
β Base Station Selection: Finally by executing all the steps of VIA, we get the optimal policy for handoff execution. A candidate BS which has highest corresponding value of the total expected reward (TER) is selected as the best BS to handoff. Therefore for our environment we can select the small BS if π πΈπ π ππππ > π πΈπ π ππππ. Fig.6-4 shows a sample sequence of states and actions taken by optimal policy for providing maximum expected reward, over a sample time period. According to this sample policy, at first the UE will be connected to the macro BS (state π1), then will take action πππ and will handover to the small BS (state π2). At the next decision epoch the optimal policy will take action πππ and let the UE stay at state π2 followed by taking action ππ π and switching to state π1 at the following decision epoch.
Macro Macro Macro
Femto Femto Femto
Time amm amf afm aff amm amm amf amf afm afm aff aff S1 S1 S1 S2 S2 S2 Macro S1 Femto S2
Figure 6-4: A sample of states and actions taken by optimal policy
6.3.2
Avoiding Ping Pong effect
We might face a problem while executing handover, namely Ping-pong effect, where the UE can jump between the macro and small BSs very frequently. In order to get rid of this problem we introduce hysteresis time ππ . Therefore if a BS offers maximum expected reward for a time not less than the hysteresis time ππ , only then the UE would be handed over to it. That way we can avoid the frequent occurance of handover.