ML enables a system to SON and self-building by analysing data. It applies algorithms that allow the system to modify itself without being ‘explicitly programmed’ [42]. In general, ML can be classified into two categories: unsupervised learning and supervised learning.
1.5.1 Unsupervised Learning
Generally, the main objective of supervised learning is to establish a model from the training data with labels. Unlike supervised learning, there is no implication of the pattern or ‘correct answer’ for the input and the output is not provided. The objective of this learning is to find an input dataset pattern in which certain datasets follow more often than the others [43]. One of the highly common-used algorithms for unsupervised learning is clustering, which aims to cluster the input datasets into several subsets, in which the elements may generally follow the same pattern [44]. According to the parameters given by the dataset, the algorithm will automatically distribute data elements into groups and each group has similar elements (this similarity is basically defined by their parameters). Once clustering is finished, we can study the pattern of each group and make decisions according to the result, such as taking measures to mitigate negative parameters or finding outliers that are not suitable for this group [45].
The application of clustering is distributed among various fields. Short-range weather prediction can be realised through collecting and clustering daily weather conditions as the database [46]. In the financial field, clustering can be applied to analyse stock prices and find a potential manipulation factor [47]. Image compression uses clustering to group image pixels according to their RGB value (the parameters) so that the pixels with similar colours or patterns are assembled for a better compression [48]. The clustering algorithm is also widely applied in Geography. Through clustering referent vectors of the self-organising map, the model can be used to analyse and measure the colour of the ocean [49].
1.5.2 Supervised Learning
To implement a supervised learning algorithm, the following two requirements should be satisfied: 1) the target and predictor variables should be clarified and listed and 2) sufficient samples implying the ‘correct’ values for the target variables should be given. The algorithm will learn from these given data to analyse the pattern between the input variables and output results. Therefore, a common supervised learning model will follow a similar methodology to implement and analyse the algorithm [50]:
The first step is to collect the training dataset. This set should contain pre-defined values of the parameters and output result variables, for example, a list of patients with the name of their illness as the result variables. Meanwhile, each patient is attached with their gender, age, and occupation as pre-defined values of the parameters. Normally, this training set is incomplete because we cannot collect all patient data in the world, and most importantly, we cannot collect the data for new patients because in the incidents have not occurred at this time point. The algorithm can only generate the model to find the pattern between the input and output for the given data.
As a result, the following procedure will be the evaluation phase of the generated model. For this phase, we need a test dataset, whose characteristic is the same as that of the training dataset. The result variables, however, should be held first for later evaluation. The model generated from the training dataset according to the ML algorithm is then applied to the test data and the predicted result variables are achieved. Next, the predicted result variables are compared with provided ones of the test dataset and the performance of this model can be evaluated. Finally, the model is modified to mitigate the error rate for the given test dataset. Nevertheless, this modified model may not be satisfactory to predict the unseen data yet, and thus, we need another validation dataset to apply the modified model to it, as done for the test dataset. Further modification will be added to this model till the error rate for the validation dataset is also mitigated to a minimum, and the final version of this model can be applied to predict the unseen data. The whole process is illustrated in Figure 1
Figure 1-5 Supervised machine learning process
1.5.3 Overfitting vs Underfitting
When we try to evaluate the performance of a supervised ML model, we may introduce a terminology, called ‘fitting’. It is normally used to test the adaptation of a newly built model in statistics [51]. Both overfitting and underfitting will lead to a poor performance of the proposed model.
From the beginning of learning, the error rate of the model will gradually drop as the model continues learning and modifying itself. The model is still in the underfitting phase, which requires more relevant feature and more accurate approach to improve. However, if the model includes more features or more complicated approaches than necessary, the noise and random fluctuations of training data will be picked up and caused overfitting problem. This will negatively impact model’s performance to new data and therefore reducing model’s ability to generalize [52]. At this time, the error rate will start to increase as the model’s complexity increases. The model has accounted for so much irrelevant information from the training dataset that the importance of useful information is supressed during the computation. Furthermore, the computation time will increase because of the high complexity caused by unnecessary information. Thus, the model becomes a ‘personalised’ version of the training dataset and is not reliable for predicting the result for the test and validation set [52]. As a result, it is very important to understand the background of the data we aim to train so that only relevant features can be included. Our simulations also suggest that direct applying machine learning algorithm without introducing background knowledge will bring poor performance to models.