In this section we describe an approach to classify Android malware using support vector machines (SVM). We then continue to elaborate on conformal prediction (CP), which we use on poor SVM results to determine precise confidence levels in new predictions. However, we attempt to only use this in the cases where it is most useful, as conformal prediction is computationally expensive. The structure and advantages of this hybrid method are further explained in Section 4.3, along with its results in Section 4.6.
4.2.1
Support Vector Machines (SVM)
In Section 2.2.6 of the survey, we briefly differentiated between binary classification (i.e., a sample is either malicious or benign), and multi-class classification (i.e., a sample can belong to one of any number of classes). When given a dataset of samples belonging to different classes, support vector machines (SVM) can be used to segregate the samples using hyperplanes. A single hyperplane can be defined by the set of points x that satisfies the following relation:
x · w − b = 0
Where w is the normal to the hyperplane, w and x are used to compute the dot prod- uct, and kwkb is the offset of the hyperplane from the origin along the normal.
Two-Class Support Vector Machines, i.e. binary classification, equates to a training dataset D consisting of a set of tuples (xi, yi). Here xi is a p-dimensional vector of features, normally represented by real numbers, and yi ∈ {−1, +1} denotes the class result of samplei. SVM separates the two classes {−1, +1} by constructing the optimal hyperplane, subjecting w and b to the following class constraints:
∀yi = +1 : xi· w − b ≥ +1 (4.1)
∀yi = −1 : xi· w − b ≤ −1 (4.2)
Complete class segregation using a hyperplane is only possible when the samples are linearly separable. Normally this is not the case for multi-class methods, as the number of classes leads to a high-dimensional space. In these cases, it is possible to use other separation kernels such as polynomial or radial basis function [160]. For the purpose of our experiments, we use the standard radial basis function (RBF), whose value solely depends on a sample’s distance to a “centre” point. Once the hyperplane is established, a classification decision yi for each testing dataset sample i can be obtained.
Multi–Class Support Vector Machines extends the two-class classification approach. This multi-class classification using SVMs adaptation is straightforward and has two main approaches: the one-vs-all approach and the one-vs-one approach. An in-depth comparison of the two approaches can be found in [100] and Figure 4.1. In the one-vs- all approach, k SVM classifiers are constructed for each class, classk, in the training dataset. Each classifier then considers the samples of classk as positive and all others negatives. In detail, the ith SVM (i ∈ [1 . . . k]) labels samples of class
i as +1 and the remaining samples as -1. The result is k decision functions as shown below:
x · w1+ b1, . . . , x · wk+ bk (4.3)
Where the class of each sample is chosen according to the following decision criteria derived from all k SVM’s decision functions:
classi ≡ argmaxj=1...k(xi· wj + bj) (4.4)
Unlike the one-vs-all (or one-vs-rest) approach, the layout of features is more in- volved in one-vs-one. In this method, k(k − 1)/2 classifiers are constructed for k classes with each constructed from the samples of two unique classes. After training, the testing is done using a voting system. For each decision function for classes i and j, denoted by x · wij+ bij, the sign of the result (i.e., + or -) indicates whether the samples belongs to class i or j. If it belongs to i, then the vote for i is increased by 1. Otherwise, the vote for a class j is increased by 1. After all k(k − 1)/2 decision functions have contributed a vote, each sample is classified into the class it received the highest votes for.
For the experiments in Section 4.6, we applied the one-vs-one method (see Figure 4.1) as it gives us a better notion of non-conformity scores. These are a crucial part of our statistical classification (see Section 4.5.3), otherwise known as conformal prediction.
ϭ Ϯ ϯ Ŭϭ Ϭ нϭ ŬϮ нϭ Ϭ Ŭϯ нϭ Ϭ ϭ Ϯ Ϭ ͍ ^ĂŵƉůĞŶ ͍ с ůĂƐƐϭ ůĂƐƐϮ ůĂƐƐϯ ůĂƐƐϭ ůĂƐƐϭ ůĂƐƐϮ ůĂƐƐϮ ůĂƐƐϯ ůĂƐƐϯ EŽƚůĂƐƐϯ EŽƚůĂƐƐϭ EŽƚůĂƐƐϮ ůĂƐƐϯ ůĂƐƐϭ ůĂƐƐϮ Ğ Đ ŝƐ ŝŽ Ŷ Ŭ ϯ Ğ Đ ŝƐ ŝŽ Ŷ Ŭ Ϯ Ğ Đ ŝƐ ŝŽ Ŷ Ŭ ϭ Ğ Đŝ Ɛ ŝŽ Ŷ Ŭ ϯ Ğ Đ ŝƐ ŝŽ Ŷ Ŭ Ϯ Ğ Đ ŝƐ ŝŽ Ŷ Ŭ ϭ
(a) One-vs-All (b) One-vs-One
4.2.2
Conformal Prediction (CP)
In traditional classification, the algorithm typically chooses a single class label per sam- ple. This decision is absolute and inflexible, regardless of how well the sample actu- ally fits, and ignores alternative choices despite their likelihood. Thus, in cases where multiple class choices for a single sample have similar probabilities of being correct, a traditional classification algorithm is prone to error. To address these shortcomings, conformal prediction [210] can statistically assess how well a sample fits into a class with the use of qualitative scoring. This relies on non-conformity (NC) scores.
NC scores are a geometric measurement (e.g., distance to hyperplane using the RBF kernel) of how well a sample fits into one or more classes. They increase with the distance to the hyperplane for incorrect predictions but are inversely affected for correct predictions. NC scores can be used to derive p-values to assess how unusual the sample is relative to previous samples. Specifically, a p-value p is calculated as the proportion of a class’s samples with equal, or greater, NC scores. These p-values are therefore highly useful statistical measurements to gauge how well a sample fits into one class, compared to all other classes, and lends flexibility toward more accurate classification.
For intelligent classification we consider the credibility and confidence of each choice. A high credibility score, i.e. highest p-value of sample set, indicates a clear paring be- tween a sample and a class label. The qualitative metric confidence is defined as 1 − p, where p defines the line between confident and ambiguous labelling. By analysing CP credibility and confidence scores, one can determine the quality of classification much better than with standard classification. For example, choices with high credibility, but poor confidence, imply that multiple class labels have a p-value close to the chosen class. Alternatively, poor credibility and confidence scores may prove that a sample does not match any known class and belongs to a new family (i.e., zero-day malware).
Furthermore, by implicitly setting a confidence threshold, i.e. p-value threshold, we can obtain a set of likely class labels per sample. This is highly desirable for classifica- tions with low confidence as one can tune the threshold for higher accuracies. However, conformal prediction is costly performance-wise and it is still necessary to choose the most liable option from each set of predictions. Therefore it is essential to choose a p-value threshold that maximizes accuracy for the least performance cost.
While SVM can provide probabilities for each classification choice, e.g. for mal- ware detection [174], derivation of these probabilities is based on Platt’s scaling [163] which, like other regression techniques, are sensitive to outliers (i.e. distant data points). Such predictions also tend towards extremes, unlike conformal prediction [245], as they transform the dataset produced by SVM instead of the actual dataset.