II. THESIS
5. CONTRIBUTIONS AND LESSONS LEARNED
5.1 Contributions
The following table presents the main contributions provided by this thesis.
Objective/Area Contributions
Intrusion detection -Proposal of a new model (ID-CVAE), which is essentially an unsupervised
technique trained in a supervised manner thanks to the use of class labels during training.
-First application, as far as we know, of a conditional VAE to perform intrusion detection.
-ID-CVAE integrates the intrusion label in the decoder layer, which results in a less complex model than an equivalent model that exclusively uses a VAE and with a better detection performance.
-For ID-CVAE, the classification process only requires one single training stage followed by as many test stages as distinct values we try to predict. A VAE would require as many training and test stages as there are distinct values label values. Considering that the training phase is the most costly, we can see the improvement in performance obtained by using a conditional VAE (CVAE)
-With ID-CVAE for classification we obtain an accuracy over 80% for the NSL-KDD 5 labels scenario, which is better than the values obtained from Random Forest, Linear SVM, Logistic Regression and MLP
Type of traffic classification -The natural domain for a CNN, which is image processing, is expanded to
Network Traffic Classification (NTC) in an easy and natural way. It demonstrates that a CNN can be successfully applied to NTC classification, giving an easy way to extend the image-processing paradigm of CNN to a vector time-series data (in a similar way to previous extensions to text and audio processing).
-First application, as far as we know, of a CNN+RNN model to an NTC problem.
-It is shown that a RNN combined with a CNN provides better detection results than alternative algorithms without requiring any feature engineering, usual when applying other models.
-A robust model that gives excellent F1 detection scores under a highly unbalanced dataset, with over 100 different classification labels is provided. It works with a very small number of features and does not require feature engineering. The model is trained with high-level header-based data extracted from the packets. It is not required to rely on IP addresses or payload data, which are probably confidential or encrypted.
Traffic prediction -Results of applying machine learning techniques to forecast the on-off
activity state of IoT mobile devices, using time-series and no-time-series methods, are presented. Data from real IoT mobile devices is employed. It provides new insights comparing the results for time-series and non-time- series methods, applying the methods to a large number of devices with very different connectivity behaviours.
-Data pre-processing to present the data in a form that could be used by both time-series and non-time-series methods.
-Test results were achieved with a specifically developed cross-validation process, applied to both, time-series and non-time-series methods.
predicting on/off connectivity (a discrete variable), which constitutes a new starting point and a different issue itself, providing novel results.
-It is proven that a mixed method (ARIMAX) provides the best accuracy (93%) but requires a huge training time. ARIMA and some non-time-series methods (logistic regression and random forest) with accuracy over 90% also provide very good performances. Considering the higher computational requirements for ARIMAX compared to logistic regression, random forest and ARIMA; the latter methods would be a better choice for a production environment (as they provide similar practical accuracy with less computing time).
QoE estimation -It is demonstrated that a CNN can be applied to a time-series of samples (formed by aggregated information from network packets) to predict QoE for video transmission.
-The proposed model can be integrated into a network management system to monitor network quality (as observed by the end-user), which is an essential part of a self-adapting network (e.g. SDN, edge computing...). The model is applicable to a real-time environment (in time-steps of 1-second) and is able to predict video QoE for current and near-future video transmissions.
-The best proposed model includes a combination of CNN and RNN networks, being the CNN network the most critical piece. This is somewhat surprising given the time-series nature of the data that was formed by adding 3 samples of elementary flows (which consist of aggregated information from networks packets taken in a 1 second period).
-Excellent prediction performance for not extremely unbalanced labels with a small dataset
Synthesize training data to
improve classification -
First application, as far as we know, of a conditional VAE to generate fully synthetic network traffic data
-Application of the synthetic data to an intrusion detection problem, which shows that, when training an ML algorithm with the new synthetic data, the detection results obtain a substantial improvement. This improvement is greater with the synthetic data generated by the proposed method in comparison with the results obtained by training with synthetic data generated by alternative SOTA over-sampling methods: SMOTE, ADASYN, …
-Innovative methods to assess the similarity of the probability distributions of features for the real and synthetic data are proposed.
-Analysis of different variants of a VAE architecture for the proposed model, providing an extensive study on the alternatives.
-The new method allows to synthesize the new samples just knowing the intrusion label to which the synthetic data should belong, with the advantage of not relying on specific samples associated with the labels. This association is usually noisy and identifying a canonical set of samples associated with each label can be complex. Therefore, the proposed model streamlines the data generation process based exclusively on the intrusion label.
Synthesize missing data -First proposal, as far as we know, of a feature reconstruction model using a
conditional VAE and first application for intrusion detection.
-General framework available for other areas that may need a technique for feature reconstruction and imputation of missing values
-Generative method that learns the probability distribution of the features conditioned on the label value. Inclusion of the label value to obtain the
from a particular set of labels, we generate training samples associated with that set of labels, replicating the probabilistic structure of the original data that comes from those labels. In this way, we obtain the probability distribution of P(X, Y) instead of P(X) (Section 3.1.3).
5.2 Lessons learned
The following table presents the lessons learned from the research carried out for this thesis:
Objective/Area Lessons learned
Intrusion detection -An unsupervised algorithm (VAE) can be used for intrusion detection
providing better results than classic supervised ML methods: random forest, SVM, MLP and logistic regression.
-Very simple encoder and decoder networks (3 layers only) are enough to obtain best results. Increasing the number of layers does not improve results.
-Using a conditional VAE instead of a VAE provides many advantages in terms of a faster classification algorithm.
-VAEs and conditional VAES present robust and easier training than alternatives that do not use variational methods.
Type of traffic classification -In a prediction/detection problem related to time-series of vectors, in addition
to using an RNN which is the natural choice given the time-series nature of the problem, it is a good strategy to include a CNN as an initial step in a deep learning architecture. A CNN, in addition to performing feature engineering, is able to extract time patterns that are useful when they are subsequently processed by the RNN.
-When a sequence of packet headers is used to predict the type of traffic of a network flow, it is not necessary to deal with the entire sequence; a small number of packets are enough to make the prediction with high accuracy.
Traffic prediction -One week of historical data is enough to provide good forecasts
-Independently of the method, the on-off connectivity from IoT devices presents a rich periodic structure, allowing good prediction results, even with short training data.
-The non-time-series methods require, in general, less training time than the time-series-methods.
-From the results obtained, we expect in future works to find additional predictors (covariates) that will probably improve the predicting power of the methods presented. One of these new predictors could be obtained by performing clustering of the signals, trying to use the cluster index as an additional new predictor. The main problem will be the high computational demand for this task.
-It is interesting that logistic regression which is a very simple model can provide an accuracy comparable (in practical terms) to more sophisticated models (e.g. ARIMAX, Random Forest, ARIMA,..).
QoE estimation -It is possible to apply new deep learning models whose origin focused mainly
on the areas of video, audio and language processing for the prediction of QoE of transmitted videos.
-We extended the study with the inclusion of a GP Classifier that, being a non- parametric model, could make full use of the scarce data available. This inclusion provides a slight improvement in the prediction results. In addition, it requires much more memory and processing time, which makes it less
one-shot learning advances.
Synthesize training data to improve classification
-A VAE with an architecture adequately tuned provides better data over- sampling/synthesis results than alternative methods (SMOTE, ADASYN...)
-Very simple encoder and decoder networks (3 layers only) are enough to obtain best results. Increasing the number of layers does not improve results.
-Using a conditional VAE instead of a VAE provides many advantages in terms of the ability to generate features conditioned on the classification labels.
-VAEs and conditional VAEs present robust and easier training than alternatives that do not use variational methods.
Synthesize missing data