Estimating the number of receiving nodes - Stochastic Optimization and Machine Learning Modelin

In Section 3.3, we used the number of active nodesN as an input feature to predict the ETA. However, this information is usually available at the transmitting device only, which, in general, does not share this value unless a modification of the trans- mitter protocol is introduced. Therefore, in this section, we investigate how to derive the correct value ofN at the receiving nodes, only exploiting those features actually available at the receivers’ side: we use different subsets of the network parameters as input features, with the goal of understanding how the knowledge of an additional feature affects the final classification performance. We tested four parameters subsets: the first subset contains only the total ETA (n= 1),i.e., the total transmission duration, whereas the others include either the total ETA and the transmission power (ET A, PT x) or the total ETA and the distance (ET A, d) (n = 2), or all

also tested, including adding the information about the PoD and the transmission channel, but the improvement over the results showed here was found to be negli- gible. For this task, we exploit all the 5400 experiments collected from our testbed, as described in Section 3.2, equally distributed among the 4 classes (N = 1, N = 2,

N = 3, N = 4). Since the goal is to correctly identify the value of N, here l = 1, hence Xtr ∈ R4320×n, Ytr =ytr ∈ R4320 and Xtest ∈ R1080×n, Ytest = ytest ∈ R1080.

Analogously to Section 3.3, a comparison of the results derived from the diﬀerent techniques will be shown in Section 3.6.2.

3.4.1 Naive Bayes classiﬁer

The Bayesian classiﬁer is a probability model that computes the posterior probability

p(yk|x) of a class given an input examplex, for each of thek possible classes. Using

Bayes’ rule, this probability is computed as

p(yk|x)∝p(x|yk)p(yk), (3.20)

where the posterior probability is proportional to the likelihood p(x|yk) and the

priorp(yk). The objectxis classiﬁed under the class that holds the highest posterior

probability:

y_MAP∗ = arg max

p(x|yk)p(yk), (3.21)

known as Maximum A Posteriori (MAP) formulation. The likelihood expresses how probable the input data x is for a given class yk, whereas the prior captures the

assumptions about the class, before observing the data, in the form of a probability distribution p(yk). If no prior knowledge about the class distribution is available, a

uniform prior p(yk) =c is assumed and (3.21) turns into the Maximum Likelihood

(ML) solution y_ML∗ = arg max

p(x|yk). The likelihood function can be modeled

using different probability distributions: in this work we used both a Gaussian and a Poisson pdf, setting their parameters (mean and variance for the former, mean for the latter) according to the examples in the training set. In particular, we ran the classification algorithm with input examplesx of different dimensions, based on the number of features n involved: the idea behind this approach is to analyze how additional knowledge on an experiment impacts the performance of the classifier. Therefore, the input vector contains either the ETA alone, or the pair (ETA,PT x) or

(ETA, distance), or the triple (ETA,PT x, distance), and diﬀerent pdfs are generated

as input features, (3.21) is reformulated as

y∗_MAP= arg max

p(ET A|yk, d)p(d)p(yk), (3.22)

wherep(d|yk) = p(d) for a homogeneous dataset.

3.4.2 Support Vector Classiﬁcation

Support Vector Classiﬁcation (SVC) machines [82] identify the optimal hyperplane maximizing the margin which separates diﬀerent classes in a multi-dimensional feature space. A generic hyperplane can be written as the set of points x satisfying

wT_x₋_b _{= 0, where} _w _{is the normal vector to the hyperplane and} _b _{represents the}

oﬀset from the origin along w. When the data points are linearly separable, it is possible to identify a pair of hyperplanes able to perfectly separate all the objects. By deﬁning these hyperplanes as wT_x₋_b ₌ _±_{1, the margin,} _i.e._{, the region in-}

between, has length _||_w2_||, and the objective is to maximize _||_w1_||. If we consider a binary classiﬁcation problem, y(i) _{∈ {−}₁_,₁_}_{, the maximization of} 1

||w|| is equivalent

to the minimization of 1₂||w||2; therefore the problem can be formulated as:

min w,b 1 2||w|| 2 s.t. y(i)(wTx(i)+b)≥1, i= 1, . . . , m, (3.23)

where the constraint y(i)₍_wT_x(i) ₊ _b₎ _≥ _{1 has been added to guarantee that all}

x(i)_{s lie outside the margin. In (3.23), we assumed that all training examples can}

be separated, which is not verified in general. As a result, (3.23) can be modified to allow for some tolerance in the classification error. Therefore, for each training examplex(i)_{, there is an associated non-negative slack-variable}_ξ

i ≥0, i= 1, . . . , m,

allowing each example to lie on the other side of the hyperplane separating its class. The optimization problem then becomes:

min w,b 1 2||w|| 2₊_Cm i=1 ξi s.t. y(i)(wTx(i)+b)≥1−ξi, ξi ≥0, i= 1, . . . , m, (3.24)

where the parameter C balances the trade-oﬀ between having a large margin and ensuring that most examples lie in the region associated to their class. The “dual”

formulation of Eq. (3.24) is max ααα m i=1 αi− 1 2 m i,j=1 y(i)y(j)αiαj(x(i))Tx(j) s.t. 0≤αi ≤C, i = 1, . . . , m, m i=1 αiy(i)= 0, (3.25)

whereααα= (α₁, . . . , αm) is the vector of Lagrange multipliers. It can be shown that

i=1

αiy(i)x(i), (3.26)

by which it is possible to calculate the optimal weights in terms of the optimal values of ααα. Now, a prediction for a new example input x can be performed calculating

wT_x₊_b_{, and predicting} _y _{= 1 (or} _y ₌ ₋_{1) if this quantity is bigger (or smaller)}

than zero. Hence, the prediction function can be written as

f_w(x) = sgn _m i₌₁ αiy(i)(x(i))Tx+b , (3.27)

which only depends on the inner product between the input vectorxand the subset of training vectorsx(i₎ _{for which}_α

i = 0. Moreover, from (3.25) it turns out that this

subset only contains those training examples, known as Support Vectors (SVs), lying within the margin or in the region of the hyperspace belonging to the other class: as a consequence, the actual complexity of the prediction process only depends on the number of support vectors. The inner product (x(i)₎T_x₊_b _{in (3.27) can be replaced}

by particular non linear functions k(x(i)_,_x_{), known as}_kernels_{, which correspond to}

scalar products between either linear or non linear transformations of their inputs. Substituting k(x(i)_,_x_{) in (3.27), we thus obtain the optimal prediction function in}

a non-linear feature space, rather than in input space:

f_w(x) = sgn _m i=1 αiy(i)k(x(i),x) . (3.28)

Finally, as the classiﬁcation problem studied here involves |C| = 4 classes, we adopted a generalization of the standard SVM known as multiclass SVM [77], which works by reducing the multiclass problem into a number of binary problems. This generalization, known as one-vs-one classiﬁcation, works by building a set of

|C|(|C| −1)/2 binary classiﬁers, and then selecting the class that is assigned by the majority of the classiﬁers.

3.4.3 k-Nearest Neighbor

In k-Nearest Neighbor (k-NN), an example x is classified under the most common class among itsk nearest neighbors: k-NN is a type of lazy learning, since a learning phase is not needed at all. The neighbors are taken from a set of examples for which the class is known: this set can be thought of as the training set for the algorithm. In other words, the training phase of the algorithm simply consists in storing the training examples: once a test example is provided, the algorithm classifies it by assigning the most frequent class among its k nearest training examples. The pa- rameterkis chosen so as to balance the trade-off between reducing the effect of noise (largek) and avoiding the creation of indistinct boundaries between the classes (low

k).

In document Stochastic Optimization and Machine Learning Modeling for Wireless Networking (Page 63-67)