II. THESIS
3. RESEARCH CONTEXT AND RELATED WORKS REVIEW
3.2 SPECIFIC APPLICATION AREAS
3.2.5 Synthetic data generation
The main principle behind all ML models is that they learn from data instead of learning in an imperative way based on predefined rules (programming paradigm). Hence, the importance of having large representative datasets. Large datasets are important since the objective is to be able to create algorithms that can generalize to data outside of the data used for training, hence the need of a representative dataset. A dataset is representative if it includes samples that represent all possible behaviors that we try to model with our algorithm, and avoids non- representative samples (noise). Since the behaviour of systems is often complex, their representative datasets are usually large. When we have problems acquiring a representative dataset due to cost, time, privacy or technical difficulties, and we end up with small datasets or datasets that do not include sufficient samples of under-represented behaviours, then we need to consider the use of synthetic data.
In order to create a dataset that can be used for model training, we can have three alternatives [120]:
• Real data: data generated by the normal generation environment associated with the data and that we try to model with our ML algorithm
• Semi-synthetic data: data generated by an artificial generation environment that tries to be similar to the normal generation environment of the data. In this case the intention is to reproduce virtual entities (e.g. users, systems...) with a behaviour similar to the real one, with the intention that the data produced by the simulated environment is similar and representative of the real one. The simulation can be based on physical entities (e.g. network, switches, computers...) or simulated by software processes.
• Synthetic data: data synthetically created without using a simulated generation environment. This data is created trying to be similar to the real data (e.g. correlation, probability distribution, patterns...). In this case, we synthetize the data directly instead of obtaining it by simulating the data generation environment.
There are pros and cons for all three alternatives [120]. Of course, the best option is to have a real and representative dataset. Since this is not always an option, the next best option is to create semi-synthetic data that simulates the data generation process in a realistic way. But, in many cases, due to cost, time, or technical difficulties, the only available option is to create synthetic data. This latter option can be problematic, since the generated data can be noisy and not representative of the original data, therefore, it is important to articulate good methods to generate synthetic data when all other possibilities are not feasible. Synthetic data should resemble the actual data, but with the variability required to not be an exact copy of the original data.
Intrusion detection is an area particularly interesting for the generation of synthetic data. Acquiring a representative dataset can be costly and time consuming even with a simulated
Another possibility to create synthetic data is provided by generative models that learn the latent joint probability distribution of the data. This allows the subsequent sampling of the joint probability distribution, creating synthetic data with a joint probability distribution similar to that of the original data. This is an alternative way to generate synthetic data, and it is the one that is followed in this thesis [5] using a variational autoencoder. Authors in [121] present a work of a similar nature, where a generative model is constructed to capture the joint probability distribution of the data. In this case, the data to be synthetized is relational data (contained in a database). The joint probability distribution for the complete dataset is obtained through a complex process that identifies the probability distribution of each column in the database, followed by an estimate of the covariance between columns using a Gaussian Copula. The covariance estimate is extended to related tables. To synthetize new data, they sample through the resulting (and complex) joint probability distribution. A similar approach to synthetize data with different generative models has also been applied to generate images [10][35][38] and text [36][43].
When generating synthetic data there are two scenarios: (a) to create synthetic samples with all their features [5], or, (b) to complete partially-filled samples where the values of some features are known but other are missing, in this case the synthesis is reduced to the missing features, with the important constraint of synthetizing the missing features conditioned on the values of the known ones [3].
There are several works related to the creation of semi-synthetic data for intrusion detection: In [122] the authors propose a modular synthetic dataset generation framework for web applications, together with a monitoring environment to collect data at multiple protocol layers (e.g. TCP, database queries, system calls...). They can create different types of attacks or reuse existing ones by adopting the Metasploit Framework within their own simulation environment, which they call Wind Tunnel. The approach corresponds to a semi-synthetic model. The work in [123] proposes a simulated environment to create intrusion data for a vehicular adhoc network (VANET). They present an experiment using a network simulator with 10 simulated scenarios of mobility of VANET hosts and 5 types of emulated security threats with the capacity to define the total number of vehicles and the number of malicious hosts in the VANET. In [124] a generator architecture (semi-synthetic approach) is proposed for datasets of system calls used for host intrusion detection systems (HIDS). The generator architecture is generic, but it is demonstrated using Ubuntu Linux and Mozilla Firefox as the profiled application. Authors in [125] implement a software simulated environment to create high-level human threats produced by malicious employees/agents inside an organization. They create a complex high-level simulated environment including aspects such as human behaviour, relationship and communications models within the organization. They create synthetic datasets corresponding to complex threats scenarios associated with personal dynamics within the organization
Synthetic data generation is an interesting research area that will surely be further explored with the arrival of new algorithms (e.g. variational autoencoders and generative adversarial networks). The following table presents a summary of the main works related to the research carried out for this thesis. It provides a reference to the document, the data set used and the scope of the work. In this case, only similar works (for synthetic data) are presented in a
Objective/Area Ref. Dataset Scope
Synthesize data [126] Data collected from sensors deployed in the Intel Berkeley Research Laboratory
-They propose a method to recover missing (incomplete) data from sensors in IoT networks using data obtained from related sensors. The method used is based on a probabilistic matrix factorization and it is more applicable to the recovery of continuous features [127] MNIST and Cocaine-Opioid and Alcohol-Cannabis datasets (NIH- funded project)
-Reconstruction of missing data for multimodal datasets. The proposed model is based on a combination of a denoising autoencoder and a variant of a generative adversarial network. It obtains better results than alternatives models such as: matrix factorization, multimodal autoencoder, pix2pix and CycleGAN
-This work can be considered aligned (but not strictly similar) with the present thesis work, but it requires a training process and a network both more complex. [128] MNIST -Reconstruction of missing parts of digits of the
MNIST dataset using a VAE and a variant of principal component analysis (PCA). The model based on VAE provides the best reconstruction of the missing parts. It does not employ a conditional VAE. [129] MINIST and Frey
Face datasets
-First application of VAE to image generation [130] MINIST, CIFAR-
10 and Toronto Face Database.
-First application of Generative Adversarial Networks to image generation
[131] QASent and WikiQA datasets.
-Text generation variational autoencoder, conditioned on an input text.
[132] Yahoo Answer and Yelp15 review datasets
-Text generation with a VAE and a dilated CNN as the decoder. [121] Biodegradability, Mutagenesis, Airbnb, Rossmann and Telstra. All open-source datasets.
-Generative model for relational data in general. They fit the probability distribution for the columns data using the Kolmogorov-Smirnov test as a measure of goodness of fit to some predefined distributions. They use a Gaussian Copula to model the covariance between different columns.