PAPER 5: VARIATIONAL DATA GENERATIVE MODEL FOR INTRUSION DETECTION

II. THESIS

7. PAPERS SUMMARY

7.5 PAPER 5: VARIATIONAL DATA GENERATIVE MODEL FOR INTRUSION DETECTION

Title Variational data generative model for intrusion detection

Journal Knowledge and Information Systems

Impact Factor 2.247

Quartile Q2

#Citations 0 (https://scholar.google.es/citations?user=3RSZbOYAAAAJ&hl=es)

Status Published: 13-December-2018

Link https://doi.org/10.1007/s10115-018-1306-7

7.5.1 Objectives

In order to train an intrusion detection classifier is very important to have access to representative and balanced training data. This is usually a difficult task since intrusion detection samples of network traffic are strongly biased to normal traffic, being difficult to access traffic associated with intrusion events. Considering these difficulties, it is important to have a way to produce traffic samples associated to intrusion events which are rare compared with the main/normal traffic.

There are several classic techniques to over-sample the minority classes in order to have a more balanced dataset (e.g. SMOTE, ADASYN…). These techniques create new data points based in the proximity to existing points of the same class. They are based on topological proximity and do not consider the probability distribution of features for the different classes.

In this work is presented a new method to create synthetic data based on their probability distribution conditioned on the class to which they belong. This new method consists of a generative model based on a customized Variational Autoencoder (VAE). The VAE architecture has been modified to create a novel model based on a conditional VAE which integrates the class label as a new input to the VAE’s decoder architecture.

The advantage of using a conditional generative model to generate new data is based on the capacity of this model to create data using noise as input, therefore we do not create synthetic samples based on proximity to existing samples (a noisy and error prone task) but on following a probability distribution for the minority class which is learned as part of the training of the resulting model.

classifier for intrusion detection. The work carries out an extensive comparison of the synthetic data produced by the new method with data produced by classis over-sampling techniques showing the better performance (when used as synthetic training data) of the new proposed method.

7.5.2 Datasets

For this work we have used the NSL-KDD [67] dataset. This is a classic Intrusion Detection dataset. The dataset has 32 continuous and 3 categorical features, with an intrusion label of 5 values (Normal, DoS, Probe, R2L and U2R). This is a quite unbalanced dataset.

We have performed an additional data transformation: scaling all NSL-KDD continuous features to the range [0,1] and one-hot encoding all categorical features. This provides a final dataset with 116 features: 32 continuous and 84 with values in {0,1} associated to the three one-hot encoded categorical features.

The three categorical features: protocol, flag and service have respectively 3, 11 and 70 distinct values. The accuracy obtained when synthesizing these discrete features (having as reference the original ones) depends heavily on the cardinality of the feature.

We provide all results using the full original training dataset of 125973 samples and the full original test dataset of 22544 samples.

7.5.3 Models

The novel proposed architecture consists of a VAE which tries to recover an output identical to the inputs (the inputs being the network traffic features used to detect the intrusion class) but introducing a variation to the normal VAE consisting of the inclusion of an additional input to the decoder. This additional input is the one-hot encoded class label. The addition of this input is critical to improve the model in two directions: making easier the data generation process (which is now conditioned on the class label) and producing better synthetic data which is more closely related to the original one in terms of probability distribution conditioned on the class label.

To arrive to the proposed model, we have analyzed different VAE architecture variants, providing an extensive study on the alternatives.

Besides the proposal of a new architecture based on a conditional VAE we have used several machine learning techniques to demonstrate that the generated synthetic data can be used to

classification results obtained from the application of original and synthesized data to several classification algorithms.

To conclude the comparative of results we have checked the results obtained with the new model compared to some well-known over-sampling methods: SMORE, SMOTE-Borderline, SMOTE-ENN, SMOTE-Tomek, ADASYN.

7.5.4 Results/Conclusions

This work is the response to several challenges:

- Generate synthetic data for an intrusion detection dataset, with many and heterogeneous features both continuous and discrete and with a highly imbalanced distribution of intrusion labels.

This has been achieved by using a new generative model based on a conditional VAE.

- To show that the synthetic generated data have similar probabilistic structure to the original data. Verifying this similarity is a hard problem since it involves comparing the probability distributions of multivariate vectors (116 features) with non-Gaussian marginals (discrete and continuous features) and complex joint probability distributions. The challenge is twofold: obtain the joint probability distributions and compare them.

To handle these problems, we have proposed several methods based on extended histograms and the comparison of classification results under different scenarios of training with original and synthetic data

- To show that the synthetic data generated with the new architecture produces better results than synthetic data generated by state-of-the-art (SOTA) over-sampling methods.

This has been shown when comparing accuracy and F1 classification results when using training data generated by several over-sampling algorithms including the proposed one. We demonstrate that both accuracy and F1 are improved when the new architecture is used.

In document Novel applications of Machine Learning to Network Traffic Analysis and Prediction (Page 77-80)