CHAPTER THREE
STATISTICAL LEARNING MODELS
There were seven types of statistical learning models that were used to automate the process of coding item descriptions, though some of these model types had more than one variation, resulting in ten statistical learning models in total. Many of these models require additional hyperparameters to be set, these are noted throughout the
descriptions of the individual models. Where hyperparameters were required, these were set using a combination of a tuning grid of plausible values, and ten-fold cross validation when training the models. The best performing hyperparameter set were then used when applying the models to make predictions on the SS1 data.
Multinomial logistic regression. Multinomial logistic regression extends the standard logistic regression for use with a dependent variable with more than two classes. The multinomial logistic regression model used in this analysis, together with the models implementing the lasso and ridge regularisations (outlined below) are all fit using the glmnet package in R (Hastie and Qian, 2014).
Multinomial logistic regression – LASSO. LASSO regression (Tibshirani, 1996) uses L1 regularisation to shrink the absolute magnitude of coefficients to avoid overfitting, high-valued regression coefficients are penalised to reduce the risk of overfitting. One effect of this form of regularisation is that it also performs variable selection, as some of the coefficients shrink to zero and are excluded from the final model. The λ hyperparameter was set, which establishes the level of shrinkage in the model.
Multinomial logistic regression – Ridge. Ridge regression uses another form of regularisation, L2 regularisation, which introduces a penalty equalling the square of the magnitude of all the coefficients. All coefficients are therefore shrunk by the same factor. In the case of L2 regularisation, variable selection does not take place, as the shrinkage does not result in any coefficients shrinking to zero. Once again, the λ hyperparameter sets the level of shrinkage in the model.
k-nearest Neighbours. The k-nearest neighbours (knn) algorithm is a non-parametric classification algorithm, that assigns a class based on a vote of the classes of the k nearest points based on a distance metric (Altman, 1992). The knn algorithm is often
implemented using the Euclidean distance metric, however this can result in a large number of ties when used for text classification. Therefore, two alternative distance metrics were implemented, that have previously been applied to knn classification of texts: the cosine similarity distance (Manning et al., 2010) and the Jaccard similarity distance (Ouyang, 2016). The dbscan (Hahsler et al., 2015) package in R was used to calculate the predicted class based on the corresponding distance metrics. The probability of the assigned class is then estimated from the proportion of the k-nearest neighbours from which the predicted class is assigned that fell within the assigned class.
The value of k, that is, the number of points upon which the classification is based is a hyperparameter, set through cross-validation.
Support Vector Machines. Support Vector Machines (SVMs) (Vapnik, 1998) are non-probabilistic binary linear classifiers that fit a hyperplane to the data that divides the data into two separate classes of the outcome variable, whilst maximising the distance between the hyperplane and the support vectors, those data points at the edge of the class closest to the hyperplane. The SVMs applied to the data were fitted using the e1071 package in R (Meyer et al., 2019), which provides an interface for the libsvm C++
package (Chang and Lin, 2011).
In the case of an outcome with more than two classes multiple SVMs are fitted, the libsvm package uses the one-versus-one multiclass classification where k(k − 1)/2 models are fitted (with k being the number of classes, in this case 11), each of which involves only two classes. A voting mechanism is then incorporated to assign the final suggested class, using the method outlined by Friedman (1996).
The performance of SVMs can be improved through the use of kernels (Hofmann et al., 2008). Where the classes of a dependent variable cannot be linearly separated in the
dimensional space of the original independent variables, often transforming to a higher-dimensional space can allow linear separation to be achieved. For example, squaring the independent variables may separate the classes in this new higher dimensional space, where they were not previously separable. Transforming all of the independent variables can quickly become computationally intensive, therefore the use of a kernel allows the original vector of independent variables to be transformed into the dot product of the transformed variables. Calculating one value that represents the higher dimensional space is less computationally intensive than transforming all of the independent variables. In addition to the linear kernel, models with radial and polynomial kernels were fitted.
As SVMs are non-probabilistic, it is not possible to directly estimate class probabilities from the model. However, libsvm uses Platt scaling (Platt, 1999) to estimate the probability as a logistic transformation of the classifier scores, after the SVM is fitted.
For the multiclass case libsvm implements the approach outlined by (Wu et al., 2004).
Random Forest. The Random Forest algorithm (Breiman, 2001a) is a homogenous ensemble, which fits a series of decision trees (Breiman et al., 1984) to assign a class, and then takes the modal classification across the fitted models. The Random Forest algorithm incorporates the technique of bootstrap aggregation (Breiman, 1996), whereby each decision tree is fitted using a random sample with replacement of the training data set. In addition, each decision node within any given tree selects a variable to split on from a random sample with replacement subset of the predictor variables. By applying bootstrap aggregation to both the sample and feature selection for each tree the correlation between trees is reduced, producing better estimates than an individual decision tree (Breiman, 2001a). The probability of the assigned class is then estimated by the proportion of the total trees fitted that predicted the modal predicted class, which
is the final class assigned by the model. The ranger package in R (Wright and Ziegler, 2015) was used to fit the random forest model in this analysis.
Gradient Boosting Machine. Gradient Boosting Machines (Friedman, 2002;
Friedman, 2001) are another example of homogenous ensembles, which fits decision trees iteratively, rather than independently, as the Random Forests algorithm does. In the case of classification, the initial decision tree returns the probability of class membership. A pseudo-residual is then calculated, that is, the difference between the predicted probability, and the observed class membership (either 0 or 1). This pseudo-residual is then the output of a subsequent model, predicted by the same set of predictor variables as the initial model. This process is repeated iteratively to minimise the size of the pseudo-residuals. In the case of multiclass classification, at each step a decision tree is fitted for each class, and the softmax function is used to produce k probabilities for class membership, that is one probability per output class. A variant of the standard GBM, extreme gradient boosting or xgboost (Chen et al., 2015) was used in these analyses. This adapts the GBM algorithm, as described above, to incorporate bootstrap aggregation of the sample for each iterative tree, and the set of features for nodes within a tree, as is the case in the Random Forests algorithm.