Predicting network performance in IoTenvironments using LSTM

(1)

IT 20 087

Examensarbete 15 hp December 2020

Predicting network performance in IoTenvironments using LSTM

Wietze Schelhaas

Institutionen för informationsteknologi

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Predicting network performance in IoT environments using LSTM

Wietze Schelhaas

There are still many problems that need to be solved with Internet of Things (IoT) technology, one of them being performance assurance. To ensure a certain quality of service in an IoT environment, the network has to be monitored and actively measured. However, Due to the limited computational recourses Internet of things nodes have, active measurement is difficult to achieve without also inducing energy and network overhead. A potential solution to this problem is to apply a machine-learning algorithm to predict network performance metrics such as round- trip time or packet loss. By substituting active performance measurements with a machine-learning algorithm, you reduce the overhead created by active performance measurements

Previous research has revolved around applying traditional machine learning algorithms to wireless sensor network features such as packet statistics and topological information of the network to predict round-trip time. The purpose of this thesis is to use a more advanced deep learning algorithm namely Long short-term memory (LSTM) to try and exploit time dependencies in the data Three different datasets containing network statistics are used in three different experiments. In every experiment, LSTM models with different configurations are created, and their prediction

capabilities are compared to traditional neural networks with equivalent configurations. In all experiments, both the LSTM model and its corresponding equivalent neural network model produced similar results, meaning that a time dependency in the data could not be proven.

IT 20 087

Examinator: Johannes Borgström Ämnesgranskare: Christian Rohner Handledare: Andreas Johnsson

(4)

(5)

1 Introduction

This thesis reports on research about predicting network performance in wireless sensor networks. Previous research consists of collecting data on wireless sensor networks by focusing on the topological structure and the quality of the wireless sensor network in the form of nodes signal strengths and LQI (link quality indicator). Then, different machine learning techniques are applied to this data and used to predict future network performance. This section addresses some brief motivation and what the research goals are.

1.1 Problem specification

Internet of things is an emerging technology in many enterprises. The ability of internet-connected devices to communicate with each other is very attractive for business. Nonetheless, there are still many problems that need to be dealt with.

One problem is related to the performance assurance. Guaranteeing quality of service is a necessity in some networked applications such as when monitoring health. In order to achieve this, the network performance needs to be measured.

Measuring network performance in wireless sensor networks has proven to be difficult because of large network overhead when performing active measurements using these nodes[1]. In previous research the idea of using machine learning to predict network performance such as packet loss or round trip times has been explored[2]. For this, wireless sensor network data had been gathered.

Traditional techniques such as random forest and linear regression have been utilized on that data for prediction. More concretely, a machine learning model M is given feature set X as input and uses this to predict Y in the form of network performance such as packet loss or round trip time.

M : X_t−→ Yt

This thesis investigates the use of a more advanced machine learning model namely the LSTM (Long short-term memory) model using the same gathered data. This advanced deep learning algorithm performs particularly well on sequenced data such as time series. One of the main aims of this thesis is to find out if there are any time dependencies in the already captured data, that is can we improve the prediction of the network performance at the current time given network statistics of prior time steps. More concretely can we improve the prediction of Ytat current time t given {Xt−n. . . Xt−1, Xt} prior feature sets?

M : {Xt−n...Xt−1, Xt} −→ Yt

(8)

The discussed problem above is framed as research questions in the list below:

• Can we improve performance of existing ML-models for estimation of round trip times by exploiting temporal dependencies in IoT-network data?

• Can we use LSTM models to exploit potential dependencies?

• Is there a dependency from network statistics at time t − n to a sampled network performance at time t?

• What n optimizes these predictions?

(9)

2 Background

Below follows some background information about IoT systems and previous research that focuses on solving the problem addressed in section Problem specification.

2.1 IoT networks

The Internet of things and its applications continue to grow in various industries such as agriculture, food processing industry, environmental monitoring and more[3]. Every device or machine that benefits from an internet connection and therefore, the ability to communicate with other such devices could be used in an IoT system. By linking those devices with automatization, it is possible for the system to produce a desirable action for a user without human intervention.

A typical IoT system is shown in figure 1. It consists of a wireless sensor network (WSN) that has nodes as the wireless communicating ”things” in the system, a gateway (GW) which connects the wireless sensor network to a network infrastructure which, in turn, provides a link to customers.

One interesting but futuristic use case for IoT is real-time heart attack detection using smartwatches[4]. A heart rate monitor in the watch would detect a heart attack and GPS data would be sent to the nearest ambulance. Seamlessly cars in the vicinity would be notified to avoid collision with the ambulance. For this, all cars and smartwatches have to be connected to the internet.

Figure 1: A typical IoT system from [2]

(10)

2.1.1 Wireless sensor network

One foundational technology for IoT is wireless sensor networks. These are networks of intelligent nodes that are dedicated to monitor and record the physical conditions of the environment around them. Wireless sensor networks are particularly useful in remote areas with hazardous conditions[5]. Besides environmental monitoring, wireless sensor networks can be applied to monitor in areas such as healthcare, industry and traffic.

These applications and scenarios all have different requirements for ensuring the quality of service, and they depend on what the WSN is trying to achieve.

For example, in a area such as healthcare monitoring, where sensors are used to monitor diseases like Alzheimer’s and heart attacks, it is imperative that reliability and timeliness on the network is always achieved. However, in other contexts, mobility or security can be of greater importance. To ensure this specific quality, the network has to be monitored and evaluated

2.1.2 RPL routing

RPL(Routing Protocol for Low-Power and Lossy Networks)[6] is an IPv6 routing protocol for networks that are characterized by a high loss rate, low data rates, and an instability. These issues are generally caused by nodes that operate with constraints such as low processing power or a limited battery. RPL generates a directed acyclic graph that serves as the routing topology. This graph has a root node that is usually a router with an internet connection. All nodes have an associated rank value which is determined by how far topologically the node is from the root node. The rank value increases as there are more nodes in between a current node and the root node. The nodes will take this rank value into consideration whenever data needs to be sent over the network by picking the neighboring node with the lowest rank value.

2.2 Performance measurements

Active measurements have been proven to be problematic since the nodes in wireless sensor networks have very limited resources in terms of low battery, small memory size and computing power. When the frequency of measurements increases, a problematic overhead emerges.[1]

The need for measuring performance comes from a service-level agreement (SLA). This is a special kind of contract between customer and service provider which guarantees that the service the customer is paying for is also received[3].

An SLA also lays out the metrics which are used when measuring the performance of the service. To make sure the contract is followed, these metrics have to be mutually agreed upon since they can then be measured and reported from both parties. This thesis is only going to be concerned with round-trip

(11)

time and packet loss. Below follows an explanation of what they are and how they are measured.

2.2.1 TWAMP

Two-Way Active Measurement Protocol (TWAMP) is an active measurement protocol to measure two-way metrics such as round-trip time [7] [8]. It operates in the two nodes where the round-trip time is calculated in between. Moreover, there is no need to synchronize the host’s clock if the round-trip time is being calculated. A Session-Controller and a Session-Reflector decide what TWAMP- test to run and intercommunicate packets accordingly. For example when round- trip time is calculated the controller transmits a special TWAMP probe packet to the reflector and records a timestamp while doing so. The reflector sends a response packet and another timestamp is recorded at the controller upon receiving the response. The difference between these two timestamps is the round-trip time. Figure 2 shows how TWAMP is used to calculate different metrics. Four timestamps are recorded during the entire interaction. t1 is the time when the TWAMP controller initiates the exchange of the TWAMP packets. t2 holds the time the reflector receives the controller’s packet. The reflector then reflects the packet and stores the current time in t3. t4 records the time when the controller receives the reflector’s packet. Using these timesteps, different metrics can be calculated, for example One-way delay is the difference between t₂ and t₁ and Round-trip time is the difference between t₄ and t₁.

Figure 2: Illustration of how TWAMP calculates different metrics between two nodes [2]

(12)

2.2.2 ICMP

Similar to TWAMP, Internet Control Message Protocol (ICMP) is used to calculate one-way delay. When data propagates trough a link going from one router to the next until it reaches its destination, many things can go wrong.

If for some reason a packet gets dropped, the host needs to become aware of this. Besides active measurement, ICMP also provides a way to inform network nodes about errors that might occur. If a packet is lost the router informs the transmitting host that it was lost and the cause of the loss through an ICMP packet. ICMP packets are encapsulated into IP datagrams for easy transmission but ICMP is not considered a high-level protocol such as TCP, UDP, etc. The ICMP header has a simple structure. Its most important field is the type field, which signifies the error that occurred. Most common are ICMP packets with the type value set to 3. This signals that the destination is unreachable. This could be due to many things, for example that the destination is located in a network that is not reachable or that no device is listening on the address[9].

2.3 Machine learning and deep learning

Machine learning is a concept where a machine learns to solve a given problem without it being explicitly programmed to. Special machine-learning algorithms are fed training data and its desired output. Through learning algorithms, a model is built. Learning consists of finding a mapping from the training data to the desired output. The desired output for the training data is known and whenever the model makes a prediction this prediction is compared to the desired value and its mistakes will be corrected. This type of machine learning is called supervised learning [10] since the correction part of the algorithm can be seen as a teacher supervising the whole process. The model is then used to make predictions on newly introduced input data.

For this thesis, data has already been gathered and therefore it’s going to focus more on different machine-learning models. Below follows a list of models that are going to be used.

2.3.1 Neural networks

A neural network is based on a set of connected artificial neurons. An artificial neuron is also called a perceptron. A perceptron one simple computing unit and the smallest component of the network. It can classify inputs to either of two classes by drawing a line trough the input space separating the data. The inputs to the perceptron have weights attached to them. The computation is divided into two parts. A weighted sum is calculated from the input vector and its corresponding weights which are associated with the links between the input and the perceptron, see figure 3.

(13)

Figure 3: A perceptron with n inputs.

si =

n

X

i=1

wixi (1)

The output of this linear combination is then sent as input to an activation function φ that normalizes the input between 0 and 1, such as the sigmoid function, see equation 2. As the name suggests, this function serves as the neurons activation value ai, 1 being fully activated. It also introduces non- linearity to the network which is essential for the network to work properly, without it the network could perform only linear mappings[11].

φ = 1

1 + e^−x (2)

ai= φ(si) (3)

This activated value is the prediction of the perceptron. Before the learning process has started, the weights would be randomized so the prediction would also be a random one. The perceptron needs to learn to make better predictions and it does this by adjusting its weights. This is done using this formula:

w_j = w_j+ α ∗ x_j∗ Err (4)

Where α is known as the learning rate, w_j is the weight asssociated with input xj and Err is the error produced by the preceptron which is given by:

Err = y − ai (5)

(14)

where a_i is the prediction of the perceptron and y is the actual value.

The reason that the inputs are weighted is that some of the inputs contribute more into making a decision than others.

Perceptrons suffer from major limitations. It can be shown that a single perceptron can only represent linearly separable functions [11] which means that it can only classify data which belongs to two classes. This issue is solved by creating a network of layers of perceptrons. This is what is referred to as a multilayer perceptron or Neural network. Figure 4 shows an example of such network. Every perceptron is connected to every perceptron in the next layer. For the hidden and output layer the input comes from computation of the previous layer, all these edges have weights attached to them just like the input for any other perceptron. The learning process is considerably more difficult for a neural network since now the output depends on the weighted summation and activation of all previous layers.

Figure 4: A neural network with an input layer, hidden layer and output layer.

2.3.2 Backpropagation

In order for the a neural network to make better predictions the network has to learn. This learning process is similar to the learning process of a perceptron, in the sense that the weights to every neuron has to be adjusted in such a way that the predicting error is minimized. This process is a lot more complicated when having multiple layers of perceptrons. The error cannot be calculated until every neuron has finished their calculations and forwarded their activation value all the way to the output layer where the prediction is made and can be compared to the actual value. In order for the weights of the previous layers to be updated, the error value has to be propagated back into the network. This process is called backpropagation [12].

(15)

2.3.3 Deep learning

Deep learning is a special class of machine learning that tries to mimic a human brain even further by having an increased amount of hidden layers [13].

Which allow machine-learning algorithms to solve more complex problems. Even though research around deep learning technology has been around for a long time it has not been until recently that deep learning has become more popular.

The most important reason for that is that we can now provide deep learning algorithms the resources they need. It is not until now that current hardware meets the requirements that deep learning algorithms pose.

2.3.4 RNN

A Recurrent neural network is a special kind of deep learning model. Recurrent neural networks solve a problem that is very difficult for normal neural networks, to predict sequenced data. RNN’s are normal neural networks but with added memory. They remember what they have seen in the past by feeding back the output of some layers back as input to a previous layer. This is illustrated in figure 5. The idea behind this is that some things are hard to predict if you do not know the data that came before. For example, if you try to predict the next word in a sentence, it is very useful to know the words that came before it.

Figure 5: Recurrent neural networks have feedback loops

An RNN has a state called hidden state, denoted as htin figure 5 This state will serve as the input to the previous layer. Whenever the entire sequence has run through the RNN the hidden state will become the output, that is the predicted class for classification or the predicted value for regression. The calculations done inside a RNN unit is very simple. The input x_tis concatenated with the output of prior time steps h_t−1 and sent through an tanh activation function. This will be the new output. This is illustrated in figure 6.

Recurrent neural networks suffer from the vanishing gradients problem [14].

Every time the gradient gets propagated back to the previous layer the gradient becomes smaller. Having many layers means that the gradient could end up

(16)

Figure 6: Inside an RNN unit

so small that when it has backpropagated all the way back to the input layer the weights are updated with values so small that they don’t change at all and therefore no learning is done[15]. RNN’s have as many layers as input sequence.

This becomes evident when you represent an RNN like figure 7. You could say that RNN’s have short term memory only, sometimes more context to a sequence is needed in order to accurately predict what’s coming next.

Figure 7: Unrolled recurrent neural network

2.3.5 LSTM

As described in the previous section, recurrent neural networks are incapable of finding long-term time dependencies. In some machine-learning problems,

(17)

we need the time dependencies from the data not only from recent time but also from further back. This is the motivation behind Long short-term memory (LSTM)[15]. An LSTM model is a recurrent neural network that learns what data to forget and what data to remember when predicting. Figure 8 illustrates the insides of one LSTM unit. They are far more complex than a vanilla RNN, but decomposing its components can make it less intimidating. It consists of one added state, the cell state, and three gates. The key component of an LSTM is the cell state. This state could be seen as the long-term memory of the LSTM unit and it is represented as a matrix with a size depending on the number of neurons in the LSTM unit. These neurons are not drawn in figure 8, but could be seen as a hidden layer in a normal neural network. The gates interact with the cell state, deciding what information to forget and what new information to add to the cell state/memory. In 8, the boxes which contain σ and tanh are activation functions. Similarly to normal neural networks, the inputs to these functions are weighted sums of a feature vector. The + and X boxes are elementwise addition and multiplication respectively. The concat circle concatenates the current input with the input of the current time step x_t, and then is sent to all gates. Below follows a detailed explanation of what these gates do and what their purpose is.

Forget gate

The forget gate identifies previous information that is not needed and instead, this information will be dropped from the cell state. A concatenation of the previous hidden state and the current input is weighted and sent through a sigmoid activation function. As mentioned before, the sigmoid function normalizes its input between 0 and 1, where 0 means completely forget and 1 means remember everything. This value is then multiplied with the previous cell state.

This could be seen as, given the input x_t at the current timestep, how relevant is the information we all ready know?

Input gate

The input gate updates cell state with new information. Again a concatenation of the previous hidden state and the current input is weighted and sent through a sigmoid activation function. The same process is also done with an tanh activation function. The sigmoid activation decides what information to keep.

1 is important, 0 means not important. The tanh activation creates a candidate cell state ˜ct. After this, the cell state ˜ctand the result of the sigmoid activation are element-wise multiplied so that only the information that the LSTM unit deems useful is kept. This information is added to the output of the forget process.

Output gate

The new cell state contains information about what should be forgotten or kept from prior steps and what should be added as useful new information.

This new cell state is used in the output gate to calculate the LSTM unit’s hidden state/output. The same concatenation is again sent through an sigmoid function and element-wised multiplied with the cell state after it has been sent through an tanh function. Since the cell state contains what should be kept and what should be forgotten given previous and current input, multiplying it with the previous and current input drops useless information and preserves useful

(18)

information. This will be the output of the LSTM unit and can be used to make a prediction. The prediction will only be as good as the ability for the LSTM to find time dependencies, that is to forget and preserve the right information. This is learned during training and is done by adjusting the weights to all activation functions in such a way that they minimize error.

All these steps are done at every timestep, LSTM are recurrent neural networks so its output becomes input in the next timestep. This is illustrated in figure 9

Recurrent neural networks such as the LSTM model can be described math- ematically using the following formula:

M : {X_t−n...X_t−1, X_t} −→ Y_t

{X_t−n...X_t−1, X_t} consists of the data of all previous n timesteps, this is used as explained above to generate a prediction Yt

Figure 8: Insides of one LSTM unit

2.4 Related work: ML for IoT performance management

Machine learning can be used to solve the issue of having large overheads in active measurements by letting the machine-learning model predict the performance of the network. Research around this idea has been done before, the sections below summarises some papers written on the subject.

(19)

Figure 9: The output of the LSTM unit becomes input in the next time step.

2.4.1 Machine-Learning Based Active Measurement Proxy for IoT Systems

One candidate solution discussed in this paper is to place a proxy in a wireless sensor network that handles measuring packets [2]. When a sensor network node wants to perform measurements, for example using TWAMP to calculate the round-trip time, a proxy intercepts those packets and acts as the answering node. The proxy predicts the performance using machine learning and forges a response. The proxy bases its response on data it has gathered from the network using a special Feature Monitoring module. If a packet loss is predicted the proxy simply refrains from sending a response back. To reduce the cost of transferring feature data across the network feature reduction is used.

An active measurement proxy is installed at a testbed located at Uppsala University to run experiments and evaluate its capabilities. This network consists of 18 motes (wireless sensor nodes) running the Contiki 3.0 OS and are all connected to raspberry pis for accurate round-trip time calculation. Special jam motes are utilized to reproduce inference patterns to more accurately simulate a real environment when capturing data used for the machine learning models later. These features gathered contains information about the quality of the communication such as RSSI (Received Signal Strength) and LQI (Link Quality Indicator) but also, topological information like geographical coordinates and number motes within radio reach of a certain device.

(20)

2.4.2 Predicting Round-Trip Time Distributions in IoT Systems using Histogram Estimators

This paper [16] extends the work done mentioned in the last section by not predicting discrete values for round-trip times but instead predicting the distri- bution of the round-trip time. The same feature data is being used as before to model a conditional density function using a histogram estimator. Statistics such as mean and quantiles can be derived from which can then be used to determine probabilistic bounds.

2.4.3 Predicting service metrics for cluster-based services using real- time analytics

This paper [17] discusses the predicting performance of cloud services using statistical learning. Understanding and predicting the performance of cloud services is difficult, for this a thorough understanding of the components and its interactions in cloud services are required. Instead, a machine can learn the complex behavior of such a system and predict future performance. Device statistics are gathered with the use of an implemented analytics engine from a server cluster that runs a video-on-demand (VoD) service, the objective is to use these statistics to predict video frame rate and audio buffer rate.

2.4.4 round-trip delay time as a linear function of distance between the sensor nodes in wireless sensor network

In the work presented in [18] researches how round-trip time is affected by the distance between sensor nodes in a wireless sensor network. The authors of the paper derive a linear function that predicts round-trip time using a couple of parameters such as the number of nodes in the round-trip times path, the amount of traffic present in the network, the throughput, etc. To prove that a relationship exists between the round-trip time and the distance between the sensor nodes all parameters except the distance are changed when applied to the linear function.

(21)

3 Testbed and data traces

This section describes the data used for the LSTM models. Data sets have been created trough the work in [19]. Below follows a detailed description of the data and its gathering process.

3.1 Testbed

The EWSN’17 testbed located at Uppsala University was used in [19] for data collection and to run experiments involving the reduction of overhead when performing active measurements. The network consists of 18 wireless sensor nodes, also called motes. These motes are specialized in gathering and processing sen- sory information. Figure 10 shows where the motes are located in the building.

This university building poses a challenging experimental environment not only because of a large number of people with mobile wireless devices entering and leaving but also the walls have thick stone and metal structures.

Figure 10: EWSN’17 testbed and its highlighted motes.

3.2 TWAMP for wireless sensor networks

A TWAMP protocol specialized for wireless sensor networks has been implemented as part of the work in [1] to perform active measurements between the motes in the testbed. TWAMP is used to calculate round trip time, one mote from [#125; #126; #134] is selected to serve as a controller and one of the other motes is picked to serve as a reflector. The controller sends a TWAMP probe packet every 6 seconds to the selected reflector, this will be repeated for five times until a new reflector is chosen and the process is repeated. When

(22)

a reflector receives a packet, along with the by TWAMP calculated round-trip time, information that can be used to predict this round-trip time is gathered from the network. This data is cleaned, pre-processed, and put into one file creating a data-trace. One sample in this data-trace consists of the information gathered when active measurement is performed between two motes. For this project three different data-traces all containing data collected from the EWSN’17 testbed will be analyzed using machine learning models.The samples are sorted on time in ascending order, as mentioned before there are 6 seconds between each sample. In the following section, this data and its features are described in greater detail.

3.3 Data description

Every sample has a target Y for this thesis will be the round trip time for every sample in the data-trace. The features can be split up into three different statistics, basic, sensing, and topological.

Basic statistics

• src x, src y: geographical coordinates of the source node, the source node being the controler.

• node x, node y: geographical coordinates of the reflector node

• distance: the euclidean distance between the controller and the reflector node.

• hour: the hour of the day the reflector receives a packet.

• day: the day of the month (from 1 − 31)

• nodeID: The Unique ID of the reflector

• RTT: the round-trip time measured using TWAMP, this is the target feature.

• CH: the wireless channel that was used to transmit the package.

• RSSI: Received Signal Strength Indication, indicates the signal strength received by the source node.

• LQI: Link quality indicator, is a number that as the name suggests indicates how good a communication channel is for transmitting and receveing correct data.

Sensing statistics

• Light The light intensity at the controller node.

Topological

(23)

• Rank The RPL rank value as explained in section 2.1.2 from the controller to all other nodes.

• nbr The number of topological neighbours each node has.

(24)

4 Method

This section describes the how the data is split up and applied to machine learning models to try and solve the problems raised in the problem specification, see section 1.1. An LSTM model is applied to data traces which contain statistics collected from a wireless sensor network as explained in section 3. Experiments have been run on three different data traces using a number of different model configurations which will predict round trip times. The main issue this thesis tries to solve is to find out if temporal dependencies exist in IoT network data.

This can be done in a number of ways, for example, through auto-correlation models such as the Auto Regressive Integrated Moving Average (ARIMA) model where it is found how correlated a time series is with its past vaules. Other alternatives for finding time dependencies are shown in [20]. In this thesis another aproach is taken. In all experiments, the predicting capabilities of the LSTM are compared to the predicting capabilities of normal neural networks using the same configuration. If a time dependency is present in the data the LSTM would perform better than the corresponding neural network. Below follows a more detailed description of how the experiments were conducted and how the models were implemented.

4.1 LSTM

The data traces consist of rows of samples, one sample contains network statistics after one active measurement has been performed between two motes as explained in section 3.2. Every sample has an associated round trip time, the idea is that the statistics used during the measurement can be used to predict this round trip time. One sample in the data-trace will be the input to the LSTM at a given timestep. So for example, if the data-trace contains n samples, sorted over time in ascending order the first sample will enter the LSTM at timestep t = 1 the second at timestep t = 2 until the last sample at time t = n.

The data traces used for this thesis have tens of thousands of samples. This means data has been collected across multiple days. It does not make sense to input a sequence of data with a time horizon of multiple days since there will most likely be no dependencies between the quality of the network now and for example 5 days ago.

The input into an LSTM layer is a matrix of size (n, m) where n again, is the number of timesteps/the number of sequences, and m is the number of all the features as explained in 3.2. This is a matrix where the rows are the samples and the columns are the features to those samples. At every timestep, the next row is inserted as input to the LSTM. This is illustrated in figure 12 This matrix could be seen as one subsequence of the data trace. The data trace has to be split up into subsequences of nXm matrices and then one by one fed into the LSTM model. One feedforward step would consist of one such matrix going trough the LSTM and feeding the correct row to its corresponding time step. Figure 13 illustrates how the data trace is split up into matrices when n = 3. It it is important to note that the target value for each such matrix is

(25)

the round trip time captured at the last sample in the matrix.

The LSTM model in this thesis predicts round trip time Ytgiven {Xt−n. . . Xt−1, Xt} samples:

M : {Xt−n...Xt−1, Xt} −→ Yt (6)

In equation 6 the feature set {Xt−n. . . Xt−1, Xt} is one nXm subsequence matrix as explained above, each Xt being one sample. Since we don’t know what n will maximize prediction capabilities the same model is run multiple times while increasing this number. n will have a starting value of 2 and for every iteration n gets increased by one until it reaches 10. The reason n is not increased to bigger values is because of scalability issues. The training time is increased quite a lot when n is increased.

4.2 Neural network

The neural network model used in this thesis predicts round trip time and is used as a comparison to the predicting capabilities of the LSTM model. Since a neural network doesn’t have memory of previous timesteps it only takes into consideration the sample X_t at the current time step t when predicting round trip time Y_t:

M : Xt−→ Yt (7)

For the neural network no restructuring of the data is needed, the samples are fed one by one through the network and the corresponding round trip time is predicted.

4.3 Experiments

The machine learning models are built using Keras [21]. Figure 11 illustrates one of those networks. There is no rule of thumbs when it comes down to setting hyper parameters of a machine learning model. Some parameters work well on your input data and some do not. For this thesis experiments have been run on a number of different LSTM and neural network architectures, the change between the networks being the number of hidden layers and the number of neurons per layer. For all architectures, the last two dense layers as shown in figure 11 always stayed the same. These dense layers are simple non-recurrent neural network layers, the last one always has only one neuron with no activation function since we want the output to be a single number, namely the predicted round trip time.

(26)

As explained in section 3.2 three experiments each having its own data trace have been conducted. A model is created and for each experiment the architecture of the model is changed to have three or fewer LSTM layers coming before the last dense layers and for those layers to have 32, 64, or 128 neurons per such layer. The same data was subsequently run on an equivalent model where all the LSTM layers have been replaced by a normal non-recurrent layer, generating a model which is a simple neural network. Since neural networks don’t have memory as LSTM’s do, the two models can be compared to see if time dependencies indeed exist in the data.

Figure 11: Typical LSTM model used in this thesis

4.3.1 Evaluation metric

The metric used for evaluating the accuracy of the machine learning model is Normalized Mean Absolute Error (NMAE). It allows for easy comparison with previous work [19]. In the following equation y is the actual value and ˆy is the predicted value

NMAE = Pn

i=1kyi− ˆyk Pn

i=1yi

(8)

(27)

Figure 12: Sn is sample at timestep n, fi,j is the feature j for sample i

Figure 13: Entire data-trace with 6 samples split up into subsequent matrices when n = 3

(28)

5 Result and Discussion

This section presents implementation details of the machines-learning models, how different experiments have been performed and the predicting capabilities of those machine learning models. The number n mentioned below always refers to the number of time sequences the LSTM model takes into consideration as explained in the Method section.

5.1 Evaluation results

Table 5.2, 5.2 and 5.2 shows the predicting capabilities for the different LSTM and neural network architectures for the three experiments and their corresponding data traces. All three tables have the same format, the first column shows the number of layers that were used in the machine learning architecture as explained in section (method section). For every such architecture, its corresponding layers are changed to have either 32, 64, or 128 neurons. The next columns present the predicting capabilities for both the Neural network and LSTM model for all layer and neuron combinations. For the LSTM architecture it shows the lowest NMAE for the value of n that produced it. For the neural network architecture no such n is needed since it does not take into consideration the features at precious time steps. For example, in table 5.2 the LSTM architecture with with 1 Layer and 128 neurons scored an NMAE of 0.55 while corresponding neural network scored an NMAE of 0.53.

The best NMAE for all LSTM architectures was produced in experiment 3 while using 3 layers and 128 neurons for each layer, which scored an NMAE of 0.5. Overall the result for all experiments are quite similar, the neural network models perform slightly better than all LSTM models. Overall, experiment 3 has the best result, the data-trace used for experiment 3 contained the most amount of samples out of all three data-traces.

Based on these experiments, we can not conclude that a time dependency exist in the data. At least the LSTM model is not able to capture it since the neural network model which does not take into consideration prior features performs similarly. In figure 14, this becomes more evident. This figure shows NMAE in the y-axis and n in the x-axis. If a time dependency exists then you would suspect the error to increase if n increases or that particular n performs better than others. Instead, the NMAE oscillates while increasing n. The same patterns can be seen if you plot in the same way for all other LSTM architectures, see appendix for these plots.

(29)

Figure 14: change in NMAE for different time horizons.

5.2 Discussion

From the evaluation results, we cannot conclude that a temporal dependency exists since the LSTM model produced similar results to the Neural network.

This section discusses potential reasons and how to move forward.

• As mentioned in 3.2 one sample in the data traces is generated whenever a probe packet gets received by a reflector. A problem occurs when the probe packets get lost, this leads to the reflector not generating a sample.

Probes being lost during transmission is quite common in wireless sensor networks. This implies that the time between samples is not constant, which is an issue when trying to find temporal dependencies in the data, especially if multiple consecutive probes get lost and the time between two samples increases. If this is the problem then a solution would be to somehow acquire a data-trace which does not contain any lost probes, or sampling the features independent of the received probe packets

• Another problem might be that the probes are not sent frequently enough.

Data from the wireless sensor network was gathered every time a probe packet was received. In [19] it was chosen that probes would be sent every 6 seconds, this means that data is gathered every 6 seconds given that the probe was successfully received by the reflector. The reason 6 seconds was chosen when re-sending probes is to avoid congestion, but in these 6 seconds, valuable data could have been gathered. If this is part of the problem then it could be fixed by regathering data traces where probes are sent more frequently or sampling the features independent of the received probe packets.

• The approach taken in this thesis assumes that the time horizon is constant

(30)

over the entire time the data-trace is fed to the model. In reality, though, this might not be optimal. Sometimes maybe the current network quality depends on events that happened further back and sometimes it depends on events that occur more recently.

(31)

Table 1: Experiment 1

Number of layers 32 Neurons 64 Neurons 128 Neurons

NN LSTM NN LSTM NN LSTM

1 0.57 0.57 0.55 0.54 0.53 0.55

2 0.56 0.66 0.52 0.65 0.51 0.56

3 0.62 0.62 0.64 0.65 0.53 0.55

1 0.52 0.53 0.51 0.54 0.55 0.56

2 0.53 0.53 0.54 0.55 0.57 0.55

3 0.56 0.54 0.53 0.55 0.55 0.57

1 0.54 0.52 0.52 0.51 0.53 0.54

2 0.54 0.51 0.51 0.52 0.55 0.53

3 0.52 0.53 0.49 0.51 0.50 0.50

(32)

6 Conclusion

in this thesis, Neural network and LSTM models were developed and evaluated for predicting round trip time in wireless sensor networks. Network attributes such as its quality and topology were used as features for these machine learning models. These machine learning models can be used in wireless sensor networks inside a special proxy mote which will intercept measuring probes and predict a desired network performance metric. For this thesis, the performance metric to be predicted is round-trip time.

Previous research has consisted of using traditional machine learning models to predict round trip times, so one of the main research questions for this thesis was to find out if the predicting accuracy would increase when using more advanced models. LSTM models, which are good in processing sequenced data were used in the hope of exploiting temporal dependencies in data traces. Three experiments each having their own data trace with numerous samples were conducted. The lack of difference between the non-recurrent neural network and the LSTM architectures does not indicate a time dependence. Experiment 3 produced the best result with an Normalized Mean Absolute Error of 0.5. The datatrace that was used in experiment 3 contained the most amount of samples out of all three data-traces which might be the reason that it performs better.

6.1 Future work

There are possibilities to how this research could be extended:

• Mainly, new data traces have to be gathered using a higher probe frequency and if possible with less loss, to exclude or prove the problems mentioned in the Discussion section, i. e. that because of one sample is being generated every 6 seconds, we might loose valuable information between those seconds.

• Second, experiments with different recurrent neural network models such as vanilla recurrent neural networks or Gated Recurrent Units could be conducted

(33)

References

[1] Andreas Johnsson and Christian Rohner. On performance observability in iot systems using active measurements. In NOMS 2018-2018 IEEE/IFIP Network Operations and Management Symposium, pages 1–5. IEEE, 2018.

[2] W. Yan, C. Flinta, and A. Johnsson. Machine-learning based active measurement proxy for iot systems. In 2019 IFIP/IEEE Symposium on Inte- grated Network and Service Management (IM), pages 198–206, 2019.

[3] Li Da Xu, Wu He, and Shancang Li. Internet of things in industries:

A survey. IEEE Transactions on industrial informatics, 10(4):2233–2243, 2014.

[4] Samr Ali and Mohammed Ghazal. Real-time heart attack mobile detection service (rhamds): An IoT use case for software defined networks. In 2017 IEEE 30th Canadian conference on electrical and computer engineering (CCECE), pages 1–6. IEEE, 2017.

[5] Vasco Nuno Sim˜oes Pereira. Performance measurement in wireless sensor networks. Technical report, Universidade de Coimbra, 2016.

[6] Tim Winter, P Thubert, A Brandt, J Hui, R Kelsey, P Levis, K Pister, R Struik, JP Vasseur, and Alexander R RPL. Ipv6 routing protocol for low-power and lossy networks. IETF RFC 6550, 2012.

[7] Kaynam Hedayat, R Krzanowski, Al Morton, Kiho Yum, and Jozef Babiarz.

A two-way active measurement protocol (TWAMP). RFC 5357, October, 2008.

[8] S Baillargeon, C Flinta, and A Johnsson. Ericsson two-way active measurement protocol (TWAMP) value-added octets. IETF RFC 6802, 2012.

[9] AG CoNe, Rechnernetze und Telematik, and Christian Schindelhauer. In- ternet control message protocol. Technical report, University of Freiburg, 2008.

[10] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001.

[11] Stuart Russell and Peter Norvig. Artificial intelligence: a modern approach.

Alan apt, 2002.

[12] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. ICS-8506, California Univ San Diego La Jolla Inst for Cognitive Science, 1985.

[13] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.

[14] Sepp Hochreiter. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02):107–116, 1998.

(34)

[15] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory. Neu- ral computation, 9(8):1735–1780, 1997.

[16] Christofer Flinta, Wenqing Yan, and Andreas Johnsson. Predicting round- trip time distributions in iot systems using histogram estimators. In NOMS 2020-2020 IEEE/IFIP Network Operations and Management Symposium, pages 1–9. IEEE, 2020.

[17] Rerngvit Yanggratoke, Jawwad Ahmed, John Ardelius, Christofer Flinta, Andreas Johnsson, Daniel Gillblad, and Rolf Stadler. Predicting service metrics for cluster-based services using real-time analytics. In 2015 11th International Conference on Network and Service Management (CNSM), pages 135–143. IEEE, 2015.

[18] Ravindra N Duche and Nisha P Sarwade. Round trip delay time as a linear function of distance between the sensor nodes in wireless sensor network.

International Journal of Engineering Sciences & Emerging Technologies, 1:2231–6604, 2012.

[19] Wenqing Yan. Machine learning for enabling activemeasurements in iot environments. Technical report, KTH, 2018.

[20] Richard M Levich and Rosario C Rizzo. Alternative tests for time series dependence based on autocorrelation coefficients. WORKING PAPER SERIES-NEW YORK UNIVERSITY SALOMON CENTER S, 1999.

[21] Fran¸cois Chollet. keras. https://github.com/fchollet/keras, 2015.

Predicting network performance in IoTenvironments using LSTM

Examensarbete 15 hp December 2020

Predicting network performance in IoTenvironments using LSTM

Wietze Schelhaas

Institutionen för informationsteknologi

Abstract

Predicting network performance in IoT environments using LSTM

Wietze Schelhaas

Contents

1 Introduction

1.1 Problem specification

2 Background

2.1 IoT networks

2.2 Performance measurements

2.3 Machine learning and deep learning

2.4 Related work: ML for IoT performance management

3 Testbed and data traces

3.1 Testbed

3.2 TWAMP for wireless sensor networks

3.3 Data description

4 Method

4.1 LSTM

4.2 Neural network

4.3 Experiments

5 Result and Discussion

5.1 Evaluation results

5.2 Discussion

6 Conclusion

6.1 Future work

References