CHAPTER 5: Broad Absorption Line Quasar (BALQ)
5.1.5 Machine Learning (ML) Algorithms
With the advent of big data era, many promising resources have been developed to handle the challenge. One of these tools is machine learning (ML). The term machine
5.1. BACKGROUND
learning refers to a technique that enables computers to learn information based on data and experiences without being explicitly programmed. It is closely linked with statistics in respect of performing predictions. ML is broadly applied to untangle real world problems. From winning the American quiz game show, Jeopardy, with the IBM Watson computer system (Ferrucci et al. 2010), to automating rovers for space missions
(McGovern & Wagstaff 2011), the applications of ML extend to many fields.
ML can be branched into supervised learning, unsupervised learning, and rein- forcement learning. With supervised learning, the outputs are already established and the learning process attempts to predict the outcome of an event given prior knowledge of other outcomes. In contrast, the outputs are not provided in unsupervised learning. The algorithm then seeks to discover the inherent pattern in the data. Reinforcement learning is an automated approach whereby machine learns by interacting with the environment. It decides the ideal course of action based on its past trials in order to acquire the greatest reward or outcome.
In supervised learning, a model is trained using a training dataset given inputs where the desired outputs are identified beforehand. A set of features to be examined is used as the input to make predictions on the outputs. The feature is also known by its many names such as attribute, input, and explanatory variable. Outputs are also called classes, labels, outcomes, response variables, and targets. There are two subdivisions of supervised learning, namely classification and regression problems, which differ depending on the type of labels. For classification, the label is categorical or discrete, while it is continuous for regression. For the purposes of this study, we use supervised ML for binary classification problems and only review these methods.
Various ML algorithms have been developed to solve different tasks. This ideology is clearly captured by the “No Free Lunch” theorems. There are two of these theorems, one for supervised ML (Wolpert 1996) and another for optimisation (Wolpert & Macready 1997). Essentially the main message conveyed is that there is no unique algorithm that works perfectly in every scenario and is free from any drawbacks. A model is often based on some simplifications of the actual reality, which is achieved by making assumptions. Generally, the assumptions might hold for one situation but possibly fail for other. For this reason, it is vital to examine a variety of algorithms to assess the performance of the model.
A good ML model will strive to achieve a balanced bias-variance trade-off. Bias measures the error between the expected predictions and the actual values, while variance error measures the consistency of the model predictions in the training dataset. A model will suffer underfitting (high bias) when it is far too simple, and is therefore unable to grasp the underlying pattern in the data. In this case, the learning algorithm performs badly on both the training and test dataset. On the other hand, a model suffers from overfitting (high variance) when it is extremely complex, for example containing too many parameters. This model is able to perform exceedingly well on the training data, but has a poor performance on the test dataset. Hence, managing the trade-off is crucial
CHAPTER 5. BROAD ABSORPTION LINE QUASAR (BALQ)
in selecting an optimal model that will generalise well on real world data. We investigate several algorithms that will be described in detail next.
5.1.5.1 Decision Tree
Decision tree (Breiman et al. 1984) is a predictive model that uses a series of observations and sample splitting according to those observations to make conclusions about the possible classification of an input. The training process starts at the tree root node, i.e., the origin of all branches, and the samples are split into branches of child nodes. At a node, all features are considered and the split is chosen that maximises the purity of the children sample given the parent sample. The purity of the sample can be thought of as the least amount of mixing between the different classifications. In the case, where the parent node is a mix of objects with a classification of either A or B, the perfect split (aka the split that gives the maximum purity) would perfectly split the sample with classification A into one child node and objects with classification B into the other child node. This purity can rarely be achieved after one split. Therefore, the splitting procedure is iterated at each child node until the samples are perfectly separated or upon reaching the specified minimum number of samples at the node. The final node is called the terminal or leaf node, and the predicted outcome is given at this point.
Technically, the objective function of the decision tree algorithm is to maximise the impurity decrease at each split. The impurity decrease, ∆I(s, t), evaluates the quality of a split s given node t. In the binary case, the parent node at t is separated into two child nodes, the left (tL) and the right (tR):
∆I(s, t) = I(t)−NtL Nt
I(tL)− NtR Nt
I(tR), (5.5)
where Nt, NtL, and NtR are the number of samples in the parent, left, and right nodes, respectively. The quantity I(t) is the purity in the sample and ∆I(s, t) describes the changes in purity from the parent node to the left and right child nodes.
The two common impurity measures, I(t), are the Gini index (Gini 1921) and the Shannon entropy (Shannon 1948). The Gini index minimises the probability of misclassification and is defined as
IG(t) = c X
i=1
p(i|t)[1 − p(i|t)]. (5.6)
The parameter p(i|t) is the fraction of samples that are in class c at node t. For binary classification, this can be simplified to
5.1. BACKGROUND
The impurity function using Shannon entropy maximises the amount of information in the tree. This is given by
IH(t) =− c X
i=1
p(i|t) log2p(i|t) (5.8)
and for the binary problem
IH(t) =−p(i|t) log2p(i|t) − [1 − p(i|t)] log2[1− p(i|t)]. (5.9) The impurity decrease based on Gini index, ∆IG(s, t), is also known as Gini impurity, while it is termed information gain, ∆IH(s, t), using Shannon entropy.
A quantitative representation of the feature importances can be allocated based on how much they contribute to the prediction during the training process. For a single decision tree, T , the importance of variable Xj (Breiman et al. 1984) is given by
Imp(Xj, T ) = X t∈NT 1(Xj, t) Nt N∆I(s, t), (5.10)
where Nt/N is the proportion of samples at node t and ∆I(s, t) is the impurity decrease of a split s at t as specified in Eq. (5.5). The quantity 1(Xj, t)is an indicator function that equals 1 when node t splits on input variable Xj and 0 otherwise.
5.1.5.2 Random Forest
Random forest (Breiman 2001) is an ensemble of decision trees. A collection of trees is built from random samples of the training set generated by bootstrap sampling without replacement. The final classification is obtained via averaging all the votes from each trees. By combining multiple independently trained decision trees, the variance captured by individual trees in the forest is reduced.
By taking the mean of Eq. (5.10), the equation can be extended to calculate the feature importance trained with random forest algorithm (Breiman 2001) by taking the mean of Eq. (5.10) from a number of trees, which is also known as the mean decrease impurity: Imp(Xj) = 1 NT X T Imp(Xj, T ), (5.11)
where NT is the number of trees in the forest. The value is normalised to have a sum of unity. If Gini index is used as the impurity function, consequently the equation is sometimes called the Gini importance or mean decrease Gini.
CHAPTER 5. BROAD ABSORPTION LINE QUASAR (BALQ)
5.1.5.3 Logistic Regression
Logistic regression (Cox 1958), also called logit regression or maximum-entropy classifi- cation, belongs to a generalised linear model. It is used to predict the probability of being one of two possible classifications based on the input features. The regions that separate the input space into its respective class are called the decision boundaries. If there exists a region that perfectly divides the two classes, then the classes are linearly separable and the decision boundary is called a hyperplane.
Logistic regression algorithm attempts to search for the optimal hyperplane in the input features by maximising the log likelihood function. The hyperplane has a linear form that can be written as a combination of the predictor variables ~x and the corresponding weight vector ~w:
z = w0+ w1x1+ . . . + wmxm= w0+ n X i=1
wmxm = w0+ ~wT~x,
where n is the number of samples, m is the number of features, and w0 is the intercept or bias. The probability model is characterised by a logistic or sigmoid function
P (y =±1|~x, ~w) = 1
1 + exp[−y(w0+ ~wT~x)]
, (5.12)
which can be interpreted as the conditional probability that a sample belongs to class y given feature ~x and weight ~w.
Consider the training vectors of n samples for binary class (~x(i), y(i)) and y ∈ {−1, +1}n, where ~x(i)∈ Rdis a d-dimensional real vector and i = 1, . . . , n. The logistic regression solves the following optimisation problem:
min w0, ~w 1 2w~ Tw + C~ n X i
log[1 + exp(−y(i)(w0+ ~wT~x(i)))], (5.13) where the first term is the L2 regularisation. The misclassification penalty is controlled by the parameter C = 1/λ, which is the inverse of the regularisation parameter, λ. It controls the trade-off between large decision boundaries and correctly identified ~x(i). Smaller C value implies that the optimisation will be less strict on penalising misclassification and seeks a hyperplane with larger decision bounds. Conversely, large C means weak regularisation and tends to impose higher penalties on misidentified points. This leads to a stricter decision bound.
The probability is then compared with a threshold of 0.5 to determine either the object is in y binary class −1 or +1. If the likelihood is greater than the threshold, the object will be predicted as a +1, and vice versa if it is less. The weight function is iteratively calculated using a training sample and is chosen to minimise the devia- tions between the predicted classification given a trial weight function and the known
5.1. BACKGROUND
classification.
5.1.5.4 Support Vector Machine (SVM)
Support vector machine (SVM;Boser et al. 1992;Cortes & Vapnik 1995) is a discrimi- native classifier with the aim of obtaining a maximum margin hyperplane that divides the classes. This is illustrated in Fig.5.1. Margin is defined as the distance between the closest points, i.e., the support vectors, and the hyperplane. In a linear model, the hyperplane is a linear combination of input vector, ~x, and is written as w0+ ~wT~x, where
~
wis the weight vector and w0 is the intercept.
X1 X2
Figure 5.1: Separating hyperplane for samples from two classes using two features, X1 and X2, trained with linear SVM. The hyperplane is indicated by the solid line. The margin is the grey shaded region between the two dashed lines with circled points on the boundary as the support vectors.
Given the training vectors of n samples for binary class (~x(i), y(i)) and y ∈ {−1, +1}n, where ~x(i) ∈ Rd is a d-dimensional real vector and i = 1, . . . , n. The objective function to be optimised consists of solving the primal problem:
min w0, ~w,~ξ 1 2w~ Tw + C~ n X i ξ(i) (5.14)
subject to y(i)(w0+ ~wT~x(i))≥ 1 − ξ(i), ξ(i)≥ 0, ∀i.
The slack parameter, ξ(i), is introduced to relax the constraints on the margin. Similar to logistic regression, the penalty is governed by C. If the value is large, the algorithm will try to enforce a small margin hyperplane such that all the points are grouped correctly, and vice versa for small value of C.
CHAPTER 5. BROAD ABSORPTION LINE QUASAR (BALQ)
This optimisation problem can also be represented in its dual form: max α n X i αi− 1 2 X i X j αiαjy(i)y(j)~x(i)T· ~x(j), (5.15) subject to 0≤ αi≤ C, n X i αiy(i) = 0,∀i,
where αi is the Lagrangian multiplier. Solving the dual problem yields the weight coefficient and intercept:
~ w = n X i αiy(i)~x(i) (5.16) w0 = y(i)− ~wT~x(i). (5.17)
The predictions for unseen data can be made using sign w0+ n X i αiy(i)~x(i)T · ~x(j) ! . (5.18)
A technique called kernel trick can be implemented with SVM. The idea is to apply functions, i.e., kernels, to map inputs from a low dimensional space to higher dimensional feature spaces, φ(·), such that the data becomes linearly separable. In this case, the dot product (~x(i)T · ~x(j)) is replaced with a kernel function K(~x(i), ~x(j)) = φ(~x(i))T · φ(~x(j)).
The algorithm searches for decision boundaries with the largest margin and solves for the hyperplane parameters. Once the optimal hyperplane is found, discrete class labels are assigned to make predictions on the new points.