• No results found

Learning-based Solution for Network Selection

4.3 Multi Armed-Bandit Solution for Vehicular Edge Computing

4.3.4 Learning-based Solution for Network Selection

subject to: C1 :

m∈ℑM(t) Qmt ≤ 1 (4.33)

Constraint (4.33) declares that each EU offloads to only one network.

The waiting time for an offloaded task depends on the number of tasks offloaded to that network. In case of the availability of information, the EU selects the BS such that argminm{Twl

m j(ϑt)}. However, the EU is not aware of the offloading decisions of other users

and therefore the BSs’ queue status. Therefore, in the following, we develop a learning-based solution for network selection for computation offloading in the vehicular environment.

4.3.4

Learning-based Solution for Network Selection

The waiting time of a task depends on several parameters related to either the EU or the BSs. The EU knows the tasks’ parameters such as the size and the required number of process operations. However, the traffic load in each network is unknown to the EU as it depends on the cars’ arrival and departure and the offloading demands. Hence, we utilize the single-player MAB model, which is suitable to solve the problems with limited information such as P1.

In a bandit model, an agent gambles on a machine with a finite set of arms. Upon pulling an arm, the agent receives some instantaneous reward from the reward generating process of the arm, which is a priory unknown. Since the agent does not have sufficient knowledge, at each trial it might pull some inferior arms in terms of reward which results in some instantaneous regret. By pulling arms sequentially at different rounds of the game, the

agents aims at satisfying some optimality conditions [153]. Since in this work the objective is minimizing the waiting time, we map to use the notion of cost instead of reward. Therefore, the goal is minimizing the cost. In brief, in our model:

• The EU and networks represent the agent and the arms;

• The instantaneous loss of pulling arm is the difference between the expected waiting time and the waiting time of the optimal arm;

At every round, the player selects an arm (a network), for offloading a packet, observes its loss and updates the estimation of its loss distribution. Each time a network is selected, the player observes the waiting time that is used for cost calculation. The objective is to minimize the loss over time. We define the instantaneous cost function for taking action m (network selection), at round t as:

ctm= ( Twl m j(ϑt) ) ·1{Qm t=1} (4.34)

The value of offloading delay depends largely on the task queuing time; however, due to the dynamicity of a vehicular network such as vehicles’ density, often no information is available about this variable. Moreover, the statistical characteristics of cars’ arrival and density, also of offloaded tasks, change over time. Hence we assume that λmand µmare not

identically distributed through time, however, their distribution remain identical only over a specific period of time, and changes from one period to another. Hence, the queue status of the BSs is piece-wise stationary, where the length of the period and the distribution are not known. Therefore, we introduce the following assumption.

Assumption 2. For all BSs in network m, λmand µmare Piece-wise constant over intervals

of unknown length and suffer ruptures at change points.

Based on this assumption, BSs of the same network have the same probability distribution for the arrival and departure of the tasks, while they change in each period.

Network selection for task offloading with MAB is a stochastic problem. The previously offloaded tasks provide latency/cost information. However, this information may not be accurate due to insufficient trial of each arm in the window time period. Hence, there exists an exploration-exploitation trade-off to be addressed. One of the most influential works in the literature that considers the exploration-exploitation trade-off is Upper Confidence Bound (UCB) algorithm [154]. In UCB algorithm, at every round of the game, an index is calculated for each arm corresponding to the average gained reward of pulling the arm in all previous

that uses the last τ observations for learning.

The number of times the mth arm has been selected during a window with length τ up to round t is given by Ctm(τ) = t

s=t−τ+1 1{Qm t =1} (4.35)

Let us define the total number of offloaded tasks of the EU, Ct, by round t to all networks

as Ct= t

s=1Fm j∈ℑ

F(s) 1{Qm s=1} (4.36)

Inspired by the SW-UCB, we propose a learning algorithm in this work. We define the cost index of pulling arm m at round t as

ˆ

cmt (τ) = ¯cmt (τ) − β s

ξ log(min{Ct, τ})

Ctm(τ) (4.37)

where the first term on the right side of the equation is the exploitation factor, the second one is the exploration factor, 0.5 < ξ < 1, is a constant weight, β is an upper bound on exploration factor, and ¯ctm(τ) is the average accumulated cost up to round t with window length τ, defined as ¯ cmt (τ) = 1 Ctm(τ) t

s=t−τ+1 cms (4.38)

Each time there is a task to be offloaded, the agent pulls the arm with the minimum ˆcmt (τ). The proposed MAB algorithm is illustrated in Algorithm 13. In lines 5-6 the agent pulls each arm once and calculates the immediate cost. In line 8-9 the cost function considering the exploitation and exploration is calculated. In lines 10-11, the best arm that maximizes the cost function is selected. Lines 12-15 update the total number of turns and selected arms and average accumulated cost.

Algorithm 13 The proposed MAB Algorithm 1: Input: ξ > 0.5, Cm0 = 0, ¯cm0(τ) = 0, C0= 0

2: Output: a selected arm for each offloading task 3: Set the window length τ

4: for t=1 to T do

5: if ∃m that has not been pulled yet then 6: Pull the arm, and update ¯ctm(τ), Cm

t (τ) and Ct

7: else

8: calculate the cost function ∀m 9: cˆmt (τ) = ¯cm

t (τ) − β

q

ξ log(min{Ct,τ})

Ctm(τ)

10: Select the arm such that: 11: arg minmcˆmt (τ)

12: calculate cms and update: 13: Ctm(τ) ←− Ct−1m (τ) + 1 14: Ct←− Ct−1+ 1 15: c¯mT(τ) =Cm1 t (τ)∑ t s=t−τ+1cms 16: end if 17: end for

Let L∗(t, m) = minmEcmψt represent the expected cost of offloading to the expected

optimal network selected by nature during interval3 ψt, and L(t, m) = ˆcmt (τ) denote the

accumulated cost of offloading to the mth network selected by the proposed MAB method. We define the regret during T rounds as

RT = E  T

t=1 L(t, m)  − T

t=1 L∗(t, m) (4.39)

which is the expected loss of the algorithm compared with the optimal network selection.