1/38
Reinforcement Learning for Resource Allocation
and Time Series Tools for Mobility Prediction
Baptiste Lefebvre1,2, Stephane Senecal2 and Jean-Marc Kelif2
1École Normale Supérieure (ENS), Paris, France,
2Orange Labs, Issy-les-Moulineaux, France
[email protected], [email protected]
First GdR MaDICS Workshop on Big Data for the 5G RAN 25 November 2015 @ Huawei FRC
2/38
Context Current Controler Proposed Controler Mobility Prediction Conclusion
Agenda
1 Context 2 Current Controler 3 Proposed Controler 4 Mobility Prediction 5 Conclusion3/38
Agenda
1 Context 2 Current Controler 3 Proposed Controler 4 Mobility Prediction 5 Conclusion4/38 Context Current Controler Proposed Controler Mobility Prediction Conclusion
Wireless Networks
f
UE =User Equipment
5/38
Radio Resource Management (RRM)
... ... slot (0.5 ms) 12 sub ca rriers (180 kHz) PRB1 PRE2
Allocation Sharing of joint timeslots and frequency bands
Load ρ= C X c=1 T r RDc nc
Quality of Service (QoS)
QoS =1− 1 2 ρ Energy/Power Consumption P =PBS+r PRS + ˜ρPAP ˜ ρ=min(ρ,1)
1. Physical Ressource Block
6/38 Context Current Controler Proposed Controler Mobility Prediction Conclusion
Goal : optimization of the energy consumption
under QoS constraints
Formal framework considered :reinforcement learning [SB98] More specifically, Markov Decision Processes (MDP)[Put94] :
• A systemstate enumerates UEs of each radio condition and enumerates active resources
• Anaction is eithernull, either a deactivationor anactivation of a resource
• Apolicy associates to every state an action to proceed
• In order to perform energy savings, one needs to compute or estimate an optimal policy, i.e. a policy which implements a good trade-off between energy (electricalpower) consumption and targetedQoSlevel
7/38
Agenda
1 Context 2 Current Controler 3 Proposed Controler 4 Mobility Prediction 5 Conclusion8/38
Context Current Controler Proposed Controler Mobility Prediction Conclusion
MDP Controler
• Acontrolerexecutes a policyΠ, which for a given traffic amount, aims at maximizing an objective function (QoS, power)
Transition Probability Operator P(s,a,s0) Instantaneous Reward Function R(s,a)
• Searching for anoptimal policy for a fully known MDP model can be performed bydynamic programming
9/38
Controler for Geometric Criterion
max Π E "∞ X t=0 φtR(st,Π(st))|s0 =s #!
• Solving anequations system by iterating until reaching a fixed point(geometric criterion) :
Π(s) =arg max a∈A X s0∈S P(s,a,s0) R(s,a) +φV(s0) ! V(s) = X s0∈S P(s,Π(s),s0)R(s,Π(s)) +φV(s0) Parameterφ∈[0;1[
10/38
Context Current Controler Proposed Controler Mobility Prediction Conclusion
Controler for Average Criterion
max Π Tlim→∞E " 1 T T X t=0 R(st,Π(st))|s0 =s #!
• Solving anequations system1 by iterating until reaching afixed
point(average criterion) :
Π(s) =arg max a∈A X s0∈S P(s,a,s0) R(s,a) +V(s0) ! V(s) = X s0∈S P(s,Π(s),s0)R(s,Π(s)) +V(s0)
11/38
States Transitions and Rewards
• The system evolves incontinuous timeand not indiscrete time
• It is possible to turn acontinuous-time MDPinto a discrete-time MDPvia the use ofuniformization anddiscretization schemes
• P(s,a,s0)is replaced by Q(s,a,s0) which denotes thetransition rate(i.e. Poisson process parameter)
12/38
Context Current Controler Proposed Controler Mobility Prediction Conclusion
States Transitions Modeling
Q(n,r),a,(n0,r0) = λi ifBλi(s,a,s 0) ni 1 n r RDi Fi ifBµi(s,a,s0) 0 else Bλi(s,a,s0) =n0=n+e(i)∧ r0 =r+a Bµi(s,a,s 0 ) = n0=n−e(i)∧ r0 =r+a
12/38
States Transitions Modeling
Q(n,r),a,(n0,r0) = λi ifBλi(s,a,s 0) ni 1 n r RDi Fi ifBµi(s,a,s 0) 0 else Bλi(s,a,s0) = n0=n+e(i)∧ r0 =r+a Bµi(s,a,s 0) =n0=n−e(i)∧ r0 =r+a
13/38
Context Current Controler Proposed Controler Mobility Prediction Conclusion
Rewards - Costs Functions
C(s,a) = X s0∈S Q(s,a,s0)6=0 γE(n0,r+a) + (1−γ)F(n0,r+a) E(n,r) = PBS+rPRS PBS+R(PRS +PAP) ifn=0 PBS+r(PRS+PAP) PBS+R(PRS +PAP) else F(n,r) =1−exp − log(2)T r Rn PC i=1ni PC i=1Dini
13/38
Rewards - Costs Functions
C(s,a) = X s0∈S Q(s,a,s0)6=0 γE(n0,r+a) + (1−γ)F(n0,r+a) E(n,r) = PBS+rPRS PBS+R(PRS+PAP) ifn=0 PBS+r(PRS +PAP) PBS+R(PRS+PAP) else F(n,r) =1−exp − log(2)T r Rn PC i=1ni PC i=1Dini
14/38
Context Current Controler Proposed Controler Mobility Prediction Conclusion
Current Results
• The optimal policy is athreshold policy
• The optimal policy depends ontraffic volume, on target throughputand on cell capacity
• The execution of the optimal policy enables energy savings of the order of40%
• Proposal of taking into accountactivation timeby adding a timer
15/38
Optimization under Congestion
The controler does not activate the whole resources in order to reduce congestion as fast as possible
16/38
Context Current Controler Proposed Controler Mobility Prediction Conclusion
Unused Resources
17/38
Excessive QoS
The controler can grant an effective QoS level much greater than initially targeted QoS level (e.g. 50 Kbps→ 400 Kbps)
18/38
Context Current Controler Proposed Controler Mobility Prediction Conclusion
Agenda
1 Context 2 Current Controler 3 Proposed Controler 4 Mobility Prediction 5 Conclusion19/38
Context Current Controler Proposed Controler Mobility Prediction Conclusion
States Transitions Modeling
Q(s,a,s0) = λi ifBλi(s,a,s 0) ni 1 n r+a R Di Fi ifBµi(s,a,s0)∧ ¬B(s,a,s0) ni 1 n r RDi Fi ifBµi(s,a,s0)∧ B(s,a,s0) 0 else i ∧r0 =r+a∨ B(r0,a,r) Bµi (n,r),a,(n0,r0)=n0 =n−e(i)∧r0=r+a∨ B(r0,a,r) B(r0,a,r) = (r0=r =1∧a=−1)∨(r0 =r =R∧a=1)
19/38
Context Current Controler Proposed Controler Mobility Prediction Conclusion
States Transitions Modeling
Q(s,a,s0) = λi ifBλi(s,a,s 0) ni 1 n r+a R Di Fi ifBµi(s,a,s0)∧ ¬B(s,a,s0) ni 1 n r RDi Fi ifBµi(s,a,s0)∧ B(s,a,s0) 0 else Bλi (n,r),a,(n0,r0) = n0 =n+e(i)∨(n0 =n∧n =N) ∧r0 =r+a∨ B(r0,a,r) Bµi (n,r),a,(n0,r0)=n0 =n−e(i)∧r0=r+a∨ B(r0,a,r) B(r0,a,r) = (r0=r=1∧a=−1)∨(r0 =r =R∧a=1)
20/38
Ideal and Effective Power Consumption
• Ideal Power Consumption:
P∗(n) = (
PBS+PRS if α(n) =0
PBS+dα(n)ePRS +α(n)PAP else
• Ideal Number of Resources2 :
α(n) =min C X i=1 ni T Di R,R !
• Effective Power Consumption:
ˆ P(n,r) = ( PBS+rPRS ifn=0 PBS+rPRS +rPAP else 2. Solving equationF(n,r) =12 =β
21/38
Context Current Controler Proposed Controler Mobility Prediction Conclusion
Power Consumption Error Modeling
• Normalized Regret: E(n,r),a= ˆ P(n,r)−P∗(n) R(PRS +PAP) ifB(r,a) ˆ P(n,r+a)−P∗(n) R(PRS +PAP) else B(r,a) = (r =1∧a=−1)∨(r =R∧a=1)
22/38
Rewards - Costs Functions
• Symmetrical Instantaneous Reward:
R(s,a) =−|E(s,a)|
• Asymmetrical Instantaneous Reward:
23/38
Context Current Controler Proposed Controler Mobility Prediction Conclusion
24/38
Overall Performance
β current controler proposed controler γ qˆ0,01 qˆ0,5 qˆ0,99 θ qˆ0,01 qˆ0,5 qˆ0,99 1 2 0,604 −0,98 +0,40 +0,80 1 −0,02 +0,00 +0,02 0,5 +0,21 +0,52 +0,84 1e−4 −0,02 +0,00 +0,02 0,4 +0,23 +0,54 +0,85 1e−8 −0,02 +0,00 +0,02 3 4 0,604 −0,98 +0,33 +0,60 1 −0,08 +0,02 +0,04 0,5 +0,21 +0,44 +0,66 1e−2 −0,04 +0,02 +0,08 0,4 +0,22 +0,47 +0,70 1e−4 +0,00 +0,06 +0,12 9 10 0,604 −0,98 +0,03 +0,41 1 −0,34 −0,02 +0,32 0,5 −0,15 0,21 +0,50 55e−3 −0,13 +0,21 +0,44 0,4 −0,09 +0,27 +0,54 3e−3 +0,00 +0,35 +0,54
25/38
Context Current Controler Proposed Controler Mobility Prediction Conclusion
Overall Performance
26/38
Agenda
1 Context 2 Current Controler 3 Proposed Controler 4 Mobility Prediction 5 Conclusion27/38
Context Current Controler Proposed Controler Mobility Prediction Conclusion
Mobility
• Traffic due toarrivalsand
departuresof UEs in the coverage zoneof the BS, modeled by
Poisson processes
• Moves of UEs inducing
propagation losses,shadowingand
28/38
Problem Statement
• Theactivation/deactivation timeframeof a physical resource is not taken into account in the modeling
• Idea : implement the prediction of states to be visited in the next seconds
• This approach makes it possible to consider mobile users
• Given SINR traces of users who crossed the cell and the SINR trace of a user currently crossing the cell, we aim at estimating the SINR to be measured in the near future
29/38
Context Current Controler Proposed Controler Mobility Prediction Conclusion
Problem Modeling
• LetT ={T1,· · ·,TK} denote a set of time series
• LetT1 =ht1,1,· · ·,t1,N1i denote a time series • . . .
• LetTK =htK,1,· · · ,tK,NKi denote a time series
• LetT =ht1,· · ·,tNi denote a time series to be completed
ˆ tN+1 =f(T) ˆ tN+1=g(T) Tk ∼ D ˆ tN+1 =h(T) Tk ∼ D={D1,· · ·,DM}
30/38
Dynamic Time Warping (DTW)
• LetT =ht1,· · ·,tNi denote a time series
• LetT0 =ht10,· · ·,tN0 0i denote another time series
• Letd denote a distance measure between elements of these time series
D(ti,tj0) =d(ti,tj0) +min D(ti−1,tj0−1),D(ti−1,tj0),D(ti,tj0−1)
DTW(T,T0) =D(tN,tN00)
31/38
Context Current Controler Proposed Controler Mobility Prediction Conclusion
Barycentric Averaging DTW
• LetT ={T1,· · ·,TK} denote a set of time series
• LetT1 =ht1,1,· · ·,t1,N1i denote a time series • . . .
• LetTK =htK,1,· · · ,tK,NKi denote a time series
Thebarycentric averaging DTW T satisfies (cf. [PKG11]) :
∀N∈N∗, ∀T =ht1,· · · ,tNi K X k=1 DTW(T,Tk) 2 ≤ K X k=1 DTW(T,Tk) 2
32/38
Fast Dynamic Time Warping (FastDTW)
• Multi-level approachfor the computation of the dynamic time warping, cf. [SC04]
• Linearspatial complexity
• Lineartemporal complexity
• Approximation methodenjoying a good precision (via tuning parameterr)
33/38
Context Current Controler Proposed Controler Mobility Prediction Conclusion
Preliminary Results
Estimations implemented with a precision of dB order for time horizons of 1s order
34/38
Agenda
1 Context 2 Current Controler 3 Proposed Controler 4 Mobility Prediction 5 Conclusion35/38
Context Current Controler Proposed Controler Mobility Prediction Conclusion
Conclusion
Summary :
• Reviewof State-of-the-Art controlers
• Proposal of a modified andimproved controler
• Proposal of amobility prediction mechanism
(different from those proposed for intercells transfert management)
Work in progress/Perspectives :
• Integrationof the mobility prediction module to the controler
• Enhancementof the mobility prediction mechanism
• Design of ahigher-level control system for many cells, even for an entire network
36/38
References
[PKG11] François Petitjean, Alain Ketterlin, and Pierre Gançarski.
A global averaging method for dynamic time warping, with applications to clustering.
Pattern Recognition, 44(3) :678–693, 2011.
[Put94] Martin Puterman.
Markov decision processes : discrete stochastic dynamic programming.
Wiley-Interscience, 1994.
[SB98] Richard S. Sutton and Andrew G. Barto.
Reinforcement Learning : An Introduction.
MIT Press Cambridge, 1998.
[SC78] Hiroaki Sakoe and Seibi Chiba.
Dynamic Programming Algorithm Optimization for Spoken Word Recognition.
Transactions on Acoustics, Speech and Signal Processing, 26(1) :43–49, 1978.
[SC04] Stan Salvador and Philip Chan.
FastDTW : Toward accurate dynamic time warping in linear time and space.
37/38
Context Current Controler Proposed Controler Mobility Prediction Conclusion
Thank you !
Thanks for your attention !
Questions ?
These research works are funded by Orange and supported by the collaborative research project ANR NETLEARN (ANR-13-INFR-0004)
38/38
Appendix : example of a MDP-based controler
0 1 0 2 0 3 0 4 1 1 1 2 1 3 1 4 2 1 2 2 2 3 2 4 3 1 3 2 3 3 3 4 4 1 4 2 4 3 4 4 5 1 5 2 5 3 5 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
38/38
Context Current Controler Proposed Controler Mobility Prediction Conclusion
Appendix : example of a MDP-based controler
0 1 0 2 0 3 0 4 1 1 1 2 1 3 1 4 2 1 2 2 2 3 2 4 3 1 3 2 3 3 3 4 4 1 4 2 4 3 4 4 5 1 5 2 5 3 5 4 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1
38/38
Appendix : example of a MDP-based controler
0 1 0 2 0 3 0 4 1 1 1 2 1 3 1 4 2 1 2 2 2 3 2 4 3 1 3 2 3 3 3 4 4 1 4 2 4 3 4 4 5 1 5 2 5 3 5 4 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1
38/38
Context Current Controler Proposed Controler Mobility Prediction Conclusion
Appendix : example of a MDP-based controler
0 1 0 2 0 3 0 4 1 1 1 2 1 3 1 4 2 1 2 2 2 3 2 4 3 1 3 2 3 3 3 4 4 1 4 2 4 3 4 4 5 1 5 2 5 3 5 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1