Reinforcement Learning for Resource Allocation and Time Series Tools for Mobility Prediction

(1)

1/38

Reinforcement Learning for Resource Allocation

and Time Series Tools for Mobility Prediction

Baptiste Lefebvre1,2_{, Stephane Senecal}2 _{and Jean-Marc Kelif}2

1_{École Normale Supérieure (ENS), Paris, France,}

[email protected]

2_{Orange Labs, Issy-les-Moulineaux, France}

[email protected], [email protected]

First GdR MaDICS Workshop on Big Data for the 5G RAN 25 November 2015 @ Huawei FRC

(2)

2/38

Context Current Controler Proposed Controler Mobility Prediction Conclusion

Agenda

1 Context 2 Current Controler 3 Proposed Controler 4 Mobility Prediction 5 Conclusion

(3)

3/38

Agenda

(4)

4/38 Context Current Controler Proposed Controler Mobility Prediction Conclusion

Wireless Networks

f

UE =User Equipment

(5)

5/38

Radio Resource Management (RRM)

... ... slot (0.5 ms) 12 sub ca rriers (180 kHz) PRB1 PRE2

Allocation Sharing of joint timeslots and frequency bands

Load ρ= C X c=1 T r RDc nc

Quality of Service (QoS)

QoS =1− ₁ 2 ρ Energy/Power Consumption P =PBS+r PRS + ˜ρPAP ˜ ρ=min(ρ,1)

1. Physical Ressource Block

(6)

6/38 Context Current Controler Proposed Controler Mobility Prediction Conclusion

Goal : optimization of the energy consumption

under QoS constraints

Formal framework considered :reinforcement learning [SB98] More specifically, Markov Decision Processes (MDP)[Put94] :

• A systemstate enumerates UEs of each radio condition and enumerates active resources

• Anaction is eithernull, either a deactivationor anactivation of a resource

• Apolicy associates to every state an action to proceed

• In order to perform energy savings, one needs to compute or estimate an optimal policy, i.e. a policy which implements a good trade-off between energy (electricalpower) consumption and targetedQoSlevel

(7)

7/38

Agenda

(8)

8/38

MDP Controler

• Acontrolerexecutes a policyΠ, which for a given traffic amount, aims at maximizing an objective function (QoS, power)

Transition Probability Operator P(s,a,s0) Instantaneous Reward Function R(s,a)

• Searching for anoptimal policy for a fully known MDP model can be performed bydynamic programming

(9)

9/38

Controler for Geometric Criterion

max Π E "_∞ X t=0 φtR(st,Π(st))|s0 =s #!

• Solving anequations system by iterating until reaching a fixed point(geometric criterion) :

Π(s) =arg max a∈A X s0_∈S P(s,a,s0) R(s,a) +φV(s0) ! V(s) = X s0_∈S P(s,Π(s),s0)R(s,Π(s)) +φV(s0) Parameterφ∈[0;1[

(10)

10/38

Controler for Average Criterion

max Π Tlim→∞E " 1 T T X t=0 R(st,Π(st))|s0 =s #!

• Solving anequations system1 _by _iterating _{until reaching a}_fixed

point(average criterion) :

Π(s) =arg max a∈A X s0_∈S P(s,a,s0) R(s,a) +V(s0) ! V(s) = X s0_∈S P(s,Π(s),s0)R(s,Π(s)) +V(s0)

(11)

11/38

States Transitions and Rewards

• The system evolves incontinuous timeand not indiscrete time

• It is possible to turn acontinuous-time MDPinto a discrete-time MDPvia the use ofuniformization anddiscretization schemes

• P(s,a,s0)is replaced by Q(s,a,s0) which denotes thetransition rate(i.e. Poisson process parameter)

(12)

12/38

States Transitions Modeling

Q(n,r),a,(n0,r0) =          λi ifBλi(s,a,s 0₎ ni 1 n r RDi Fi ifB_µ_i(s,a,s0) 0 else B_λ_i(s,a,s0) =n0=n+e(i)∧ r0 =r+a Bµi(s,a,s 0 ) = n0=n−e(i)∧ r0 =r+a

(13)

12/38

States Transitions Modeling

Q(n,r),a,(n0,r0) =          λi ifBλi(s,a,s 0₎ ni 1 n r RDi Fi ifBµi(s,a,s 0₎ 0 else B_λ_i(s,a,s0) = n0=n+e(i)∧ r0 =r+a Bµi(s,a,s 0_{) =}_n0₌_n₋_e(i)_∧ _r0 ₌_r₊_a

(14)

13/38

Rewards - Costs Functions

C(s,a) = X s0∈S Q(s,a,s0)6=0 γE(n0,r+a) + (1−γ)F(n0,r+a) E(n,r) =        PBS+rPRS PBS+R(PRS +PAP) ifn=0 PBS+r(PRS+PAP) PBS+R(PRS +PAP) else F(n,r) =1−exp     − log(2)T r Rn PC i=1ni PC i=1_Dini    

(15)

13/38

Rewards - Costs Functions

C(s,a) = X s0∈S Q(s,a,s0)6=0 γE(n0,r+a) + (1−γ)F(n0,r+a) E(n,r) =        PBS+rPRS PBS+R(PRS+PAP) ifn=0 PBS+r(PRS +PAP) PBS+R(PRS+PAP) else F(n,r) =1−exp     − log(2)T r Rn PC i=1ni PC i=1_Dini    

(16)

14/38

Current Results

• The optimal policy is athreshold policy

• The optimal policy depends ontraffic volume, on target throughputand on cell capacity

• The execution of the optimal policy enables energy savings of the order of40%

• Proposal of taking into accountactivation timeby adding a timer

(17)

15/38

Optimization under Congestion

The controler does not activate the whole resources in order to reduce congestion as fast as possible

(18)

16/38

Unused Resources

(19)

17/38

Excessive QoS

The controler can grant an effective QoS level much greater than initially targeted QoS level (e.g. 50 Kbps→ 400 Kbps)

(20)

18/38

Agenda

(21)

19/38

States Transitions Modeling

Q(s,a,s0) =                    λi ifBλi(s,a,s 0₎ ni 1 n r+a R Di Fi ifB_µ_i(s,a,s0)∧ ¬B(s,a,s0) ni 1 n r RDi Fi ifB_µ_i(s,a,s0)∧ B(s,a,s0) 0 else i ∧r0 =r+a∨ B(r0,a,r) Bµi (n,r),a,(n0,r0)=n0 =n−e(i)∧r0=r+a∨ B(r0,a,r) B(r0,a,r) = (r0=r =1∧a=−1)∨(r0 =r =R∧a=1)

(22)

19/38

States Transitions Modeling

Q(s,a,s0) =                    λi ifBλi(s,a,s 0₎ ni 1 n r+a R Di Fi ifB_µ_i(s,a,s0)∧ ¬B(s,a,s0) ni 1 n r RDi Fi ifB_µ_i(s,a,s0)∧ B(s,a,s0) 0 else Bλi (n,r),a,(n0,r0) = n0 =n+e(i)_∨_(n0 ₌_n_∧_n ₌_N₎ ∧r0 =r+a∨ B(r0,a,r) Bµi (n,r),a,(n0,r0)=n0 =n−e(i)∧r0=r+a∨ B(r0,a,r) B(r0,a,r) = (r0=r=1∧a=−1)∨(r0 =r =R∧a=1)

(23)

20/38

Ideal and Effective Power Consumption

• Ideal Power Consumption:

P∗(n) = (

PBS+PRS if α(n) =0

PBS+dα(n)ePRS +α(n)PAP else

• Ideal Number of Resources2 :

α(n) =min C X i=1 ni T Di R,R !

• Effective Power Consumption:

ˆ P(n,r) = ( PBS+rPRS ifn=0 PBS+rPRS +rPAP else 2. Solving equationF(n,r) =1₂ =β

(24)

21/38

Power Consumption Error Modeling

• Normalized Regret: E(n,r),a=          ˆ P(n,r)−P∗(n) R(PRS +PAP) ifB(r,a) ˆ P(n,r+a)−P∗(n) R(PRS +PAP) else B(r,a) = (r =1∧a=−1)∨(r =R∧a=1)

(25)

22/38

Rewards - Costs Functions

• Symmetrical Instantaneous Reward:

R(s,a) =−|E(s,a)|

• Asymmetrical Instantaneous Reward:

(26)

23/38

(27)

24/38

Overall Performance

β current controler proposed controler γ qˆ0,01 qˆ0,5 qˆ0,99 θ qˆ0,01 qˆ0,5 qˆ0,99 1 2 0,604 −0,98 +0,40 +0,80 1 −0,02 +0,00 +0,02 0,5 +0,21 +0,52 +0,84 1e−4 −0,02 +0,00 +0,02 0,4 +0,23 +0,54 +0,85 1e−8 ₋₀_,₀₂ ₊₀_,₀₀ ₊₀_,₀₂ 3 4 0,604 −0,98 +0,33 +0,60 1 −0,08 +0,02 +0,04 0,5 +0,21 +0,44 +0,66 1e−2 −0,04 +0,02 +0,08 0,4 +0,22 +0,47 +0,70 1e−4 +0,00 +0,06 +0,12 9 10 0,604 −0,98 +0,03 +0,41 1 −0,34 −0,02 +0,32 0,5 −0,15 0,21 +0,50 55e−3 ₋₀_,₁₃ ₊₀_,₂₁ ₊₀_,₄₄ 0,4 −0,09 +0,27 +0,54 3e−3 +0,00 +0,35 +0,54

(28)

25/38

Overall Performance

(29)

26/38

Agenda

(30)

27/38

Mobility

• Traffic due toarrivalsand

departuresof UEs in the coverage zoneof the BS, modeled by

Poisson processes

• Moves of UEs inducing

propagation losses,shadowingand

(31)

28/38

Problem Statement

• Theactivation/deactivation timeframeof a physical resource is not taken into account in the modeling

• Idea : implement the prediction of states to be visited in the next seconds

• This approach makes it possible to consider mobile users

• Given SINR traces of users who crossed the cell and the SINR trace of a user currently crossing the cell, we aim at estimating the SINR to be measured in the near future

(32)

29/38

Problem Modeling

• LetT ={T1,· · ·,TK} denote a set of time series

• LetT1 =ht1,1,· · ·,t1,N1i denote a time series • . . .

• LetTK =htK,1,· · · ,tK,NKi denote a time series

• LetT =ht1,· · ·,tNi denote a time series to be completed

ˆ tN+1 =f(T) ˆ tN+1=g(T) Tk ∼ D ˆ tN+1 =h(T) Tk ∼ D={D1,· · ·,DM}

(33)

30/38

Dynamic Time Warping (DTW)

• LetT =ht1,· · ·,tNi denote a time series

• LetT0 =ht₁0,· · ·,t_N0 0i denote another time series

• Letd denote a distance measure between elements of these time series

D(ti,tj0) =d(ti,tj0) +min D(ti−1,tj0−1),D(ti−1,tj0),D(ti,tj0−1)

DTW(T,T0) =D(tN,tN00)

(34)

31/38

Barycentric Averaging DTW

• LetT ={T1,· · ·,TK} denote a set of time series

• LetT1 =ht1,1,· · ·,t1,N1i denote a time series • . . .

• LetTK =htK,1,· · · ,tK,NKi denote a time series

Thebarycentric averaging DTW T satisfies (cf. [PKG11]) :

∀N∈N∗, ∀T =ht1,· · · ,tNi K X k=1 DTW(T,Tk) 2 ≤ K X k=1 DTW(T,Tk) 2

(35)

32/38

Fast Dynamic Time Warping (FastDTW)

• Multi-level approachfor the computation of the dynamic time warping, cf. [SC04]

• Linearspatial complexity

• Lineartemporal complexity

• Approximation methodenjoying a good precision (via tuning parameterr)

(36)

33/38

Preliminary Results

Estimations implemented with a precision of dB order for time horizons of 1s order

(37)

34/38

Agenda

(38)

35/38

Conclusion

Summary :

• Reviewof State-of-the-Art controlers

• Proposal of a modified andimproved controler

• Proposal of amobility prediction mechanism

(different from those proposed for intercells transfert management)

Work in progress/Perspectives :

• Integrationof the mobility prediction module to the controler

• Enhancementof the mobility prediction mechanism

• Design of ahigher-level control system for many cells, even for an entire network

(39)

36/38

References

[PKG11] François Petitjean, Alain Ketterlin, and Pierre Gançarski.

A global averaging method for dynamic time warping, with applications to clustering.

Pattern Recognition, 44(3) :678–693, 2011.

[Put94] Martin Puterman.

Markov decision processes : discrete stochastic dynamic programming.

Wiley-Interscience, 1994.

[SB98] Richard S. Sutton and Andrew G. Barto.

Reinforcement Learning : An Introduction.

MIT Press Cambridge, 1998.

[SC78] Hiroaki Sakoe and Seibi Chiba.

Dynamic Programming Algorithm Optimization for Spoken Word Recognition.

Transactions on Acoustics, Speech and Signal Processing, 26(1) :43–49, 1978.

[SC04] Stan Salvador and Philip Chan.

FastDTW : Toward accurate dynamic time warping in linear time and space.

(40)

37/38

Thank you !

Thanks for your attention !

Questions ?

These research works are funded by Orange and supported by the collaborative research project ANR NETLEARN (ANR-13-INFR-0004)

(41)

38/38

Appendix : example of a MDP-based controler

0 1 0 2 0 3 0 4 1 1 1 2 1 3 1 4 2 1 2 2 2 3 2 4 3 1 3 2 3 3 3 4 4 1 4 2 4 3 4 4 5 1 5 2 5 3 5 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(42)

38/38

Appendix : example of a MDP-based controler

0 1 0 2 0 3 0 4 1 1 1 2 1 3 1 4 2 1 2 2 2 3 2 4 3 1 3 2 3 3 3 4 4 1 4 2 4 3 4 4 5 1 5 2 5 3 5 4 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1

(43)

38/38

Appendix : example of a MDP-based controler

0 1 0 2 0 3 0 4 1 1 1 2 1 3 1 4 2 1 2 2 2 3 2 4 3 1 3 2 3 3 3 4 4 1 4 2 4 3 4 4 5 1 5 2 5 3 5 4 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1

(44)

38/38

Appendix : example of a MDP-based controler

0 1 0 2 0 3 0 4 1 1 1 2 1 3 1 4 2 1 2 2 2 3 2 4 3 1 3 2 3 3 3 4 4 1 4 2 4 3 4 4 5 1 5 2 5 3 5 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1