Real-world/row-data would normally be incompatible to be used by learning algorithms for several reasons; the data often is incomplete, has a lot of missing values and errors, inconsistent, and stored in multiple locations. Data pre-processing technique is used to solve all these data problems and transforming the raw data into an understandable format. Most of the time is spent on data-pre-processing and less time on the employment of the learning algorithms. The figure below demonstrates where data pre-processing takes place in machine learning:
Figure 3-7 Pre-processing Stage in Machine Learning Process [107]
73 | P a g e
3.4.3.1
Data Collection
The first step would be the gathering of the data often referred to as data collection. The collection of the data would often depend on an ETL (extract, transform, load) process. The data would be extracted from its multiple sources such as web pages, flat files or multiple databases, then transformed to an appropriate format and loaded to a unified location where machine learning would take place [108], [109].
3.4.3.2
Missing Values
Missing values in a dataset would fail the performance of a learning algorithm and draw up an inaccurate inference about the data. Therefore, it is important to solve any missing values in the data. There are multiple techniques to deal with missing data but the two prominent ways are; either delete rows with missing values or use mean, median or mode to replace missing values. The first technique in some cases is acceptable to remove rows with missing values, but this way would reduce the data volume significantly, and also these values can contain crucial information. Depending on the problem and the data type, sometimes it is best to use the second technique and replace missing values with the total mean values as it can give better results [110],[111].
3.4.3.3
Categorical Values
Machine learning algorithms are based on mathematical equations and would require numerical values. The data can often contain categorical values columns, for example, a dataset with ‘country’ column as a variable, this variable would cause some problem to the learning algorithm. Categorical values will need to be converted to numerical values in a way in which
74 | P a g e
the new numerical values would have equal importance. This is done by converting the categorical values to variables (columns) and filling the rows with 1/0 values.
Table 3-1 Handling Categorical Values
Before After Country Age UK 19 Iraq 27 Yemen 20 UK 28 Iraq 30
UK IRAQ YEMEN Age
1 0 0 19
0 1 0 27
0 0 1 20
1 0 0 28
0 1 0 30
Table 3-1 is an example of how to handle categorical values are handled in the pre-processing stage when conducting a machine learning investigation [112], [113].
3.4.3.4
Data Normalization & Rescaling
Other pre-processing methods that might be needed are the removal of unnecessary or repeated variables. Then finally before the exploration of the data, all the data numeric values must be rescaled to range between 0 and 1; this is called data normalization and can be achieved by subtracting the minimum value from all values in the column, then divided by all values by the maximum number. Below is the equation for data normalization where
𝑥 = (𝑥
1, … , 𝑥
𝑛)
𝑥 =(
𝑥1, … , 𝑥𝑛)
and 𝑧𝑖 the ⅈth normalized data [114], [115]. 𝑧𝑖 = 𝑥𝑖 − 𝑚ⅈ𝑛(𝑥)𝑚𝑎𝑥(𝑥) − 𝑚ⅈ𝑛(𝑥)
75 | P a g e
3.4.3.5
Synthetic Minority Over-Sampling Technique (SMOTE)
The pre-processing of the data especially after the completion of data cleaning, normalization, and handling of missing values, can often result in an imbalanced dataset. Imbalanced data can result in compromising the learning process for some classifiers such as the Support Vector Machine, leading to biased prediction and affecting their accuracy. In some cases, where there is enough data, a quick solution would be using a technique called Random Under-sampling to remove data to ensure all classes have equal size. However, it is always beneficial to train models with as much data as possible, and removing data is not always advisable.
The alternative option would be to use a technique called Random Over-Sampling, which would randomly replicate minority data to balance the classes. This method prevents further loss of information from the data, however, the downside is that the data becomes prone to overfitting due to the duplication of the data. Therefore, the best alternative technique would be the deployment of another commonly used technique called Synthetic Minority Oversampling Technique (SMOTE).
Figure 3-8 Synthetic Minority Over-Sampling Technique (SMOTE) [116]
Fe atur e 2 Fe atur e 2 Feature 1 Feature 1
76 | P a g e
SMOTE would first identify the feature vectors to resample, then take the difference between the feature vectors and their nearest neighbour. The difference would then be multiplied with a random number between 0 and 1, and the final step would be to find a new point on the line segment by adding the random number to the feature vector. This process would then be repeated for the identified feature vectors. The figure above demonstrates a theoretical process of SMOTE technique to resolve the imbalanced data problem [117–119].