4.6 Statistics and Results
4.6.2 Classification Using SVM
In this section we discuss classification results when running SVM classifiers on features derived from CopperDroid. We do this in several modes, as previously described in Section 4.4.2 and Table 4.3. We first evaluate our classifier on vectors extracted from basic system calls without argument modelling. This is done with our basic baseline, based on the SVM results of boolean-based feature vectors modelling the presence, and absence, of system calls in the trace. We then repeat the SVM classification experiment but with system-call frequency (i.e., mode sys). For all subsequent experiments, based on behaviours instead, we use the performance from mode sys as our enriched baseline. After establishing our baselines, we present and compare baseline results and perfor- mances to evaluate our SVM classification using CopperDroid’s high-level behaviours reconstructed from just system call data. The goal of this is to reduce runtimes with- out sacrificing accuracy and, where possible, improve the classification accuracy. These system call baselines, produced by a collaborator, demonstrate the novel and useful aspects of the author’s behavioural reconstruction in CopperDroid as well using these behaviours for our hybrid multi-class classifier.
Table 4.3: Operational SVM modes. First two are baseline for following modes.
Mode Type Features Argument Modelling
Filter Trivial sys* Boolean system calls (syscall) 7 -
sys Frequency system calls (syscall) 7 - rec b+ Frequency syscall + high-level behaviour + binder 7 - rec ba+ Frequency syscall + high-level behaviour + binder 3 -
rec b Frequency rec b+ 7 syscall rec ba Frequency rec ba+ 3 syscall
4.6.2.1 Baseline: Classification Using System Calls
The overall results for SVM classification in different operational modes, using different feature sets, are shown in Figure 4.6 (page 110). More specifically, comparisons on the number of features used for SVM-based classification, the overall runtime divided into feature extraction and classification, and the classification accuracy of each operation mode can be found in Figures 4.6(a), 4.6(b) and 4.6(c), respectively.
In general, from our sets of experiments, we see that CopperDroid’s behaviour re- construction retains high accuracy levels despite drastically reducing the number of fea- tures (roughly 80 to 20, see Figure 4.6(a)). Furthermore we see that lowering the number of features has improved performance, as it results in less calculations, allowing us to lessen the performance or accuracy trade-off of most traditional systems.
For experiments based on basic system calls only, we ran the SVM in a boolean mode (i.e., call was used or unused) as well as a frequency mode (i.e., number of times a call is executed). The latter yielded marginally better results than boolean mode and so for all subsequent experiments, we used the results from this mode sys as our baseline. It should be mentioned again that the system call names and frequencies were stored in a text file and fed to our classifier. We deliberately used this fast-to-read representation in order to prevent skewing runtime measurements as reading large system call traces, and not modelling system call arguments, is memory intensive.
4.6.2.2 Enrichment of the Baseline
To improve our classification techniques there are three levels of improvement unique to our CopperDroid behaviour profiles. In the first step, the author deviates from individual system calls to focus instead on reconstructed behaviours. By extracting actions from sequences of related system calls, we can reduce noise from irrelevant fluctuations. For example, although the same file may be written in ten, one byte, writes instead of one, ten byte, write, our classifier would register both as the same file access behaviour.
Secondly, the author utilizes CopperDroid’s IPC binder behaviour extraction. This is a useful, but not straightforward, process that relies heavily on CopperDroid’s un- marshalling Oracle, which was a core contribution by the author to CopperDroid (see Section 3.4). In the third step the author used each behaviour’s details (e.g., filename, filetype, IP address, port, parameters) to further improve accuracy with a more fine- grained, expressive, feature set. These improvements can be seen in Figure 4.6(c).
The optimizations and additions we introduced to our feature sets visibly improved accuracies for modes rec b+ and rec ba+ when compared to our sys baseline (see
sys* sys rec_b+ rec_ba+ rec_b rec_ba 0 20 40 60 80 100 120 140 # Fe at ur es
(a) Number of features across SVM operation modes
sys* sys rec_b+ rec_ba+ rec_b rec_ba 0 2 4 6 8 10 Ru nti me (se c) Ext SVM
(b) Time to extract feature vectors (EXT) and classify samples (SVM)
sys* sys rec_b+ rec_ba+ rec_b rec_ba 0.0 0.2 0.4 0.6 0.8 1.0 A c c u ra c y
(c) Classification accuracy across SVM modes
Figure 4.6: Feature amount, runtime, and accuracy for each SVM operational mode.
Figure 4.7). However, a larger corpus of features typically leads to slower runtimes for feature extraction and classification phases. Hence, in order to further improve runtimes, we filter out uninteresting system calls (43% across all samples), such as brk, which we found to be of no particular help towards classification accuracy.
Filters: The filtration method for the baseline system calls is determined by what CopperDroid did not use to recreate behaviours. A non-exhaustive list of used sys- tem calls include system calls for files, such as write, writev, open, close, and unlink, network, such as connect and sendto, and others like clone and ioctl.
(Each circle is a sample, each colour is a family)
(a) Classifying with bare-bone system calls, threshold of 20 samples per family.
(Each circle is a sample, each colour is a family)
(b) Classifying with reconstructed behaviours, thresholds 10 behaviours per sample, 20 samples per family.
Figure 4.7: Visual t-SNE4 classification improvements from system calls to behaviours.
Of the 70 or so system calls filtered out (exact value depends on Android version), there were several get methods (e.g., getdents64, getgid32) and set methods (e.g., setsid, setpriority). In our experience, filtering these calls result in noticeable accuracy improvements for our multi-class classification. However, as these calls still have some effect on the Android system, they may be more useful in two-class classifi-
cation (i.e., malware detection) developed in the future. This may be because the system calls can do no harm, or all malware use it evenly, and therefore cannot help differentiate between malware families, but can help separate malware from benign apps. As further discussed in Chapter 6, future work on two-class classification would involve a dataset of benign apps (e.g., PlayDrone [217]). System call filtering could also significantly reduces the number of features and improve overall runtime, as shown in Figure 4.6.
Behaviour Threshold: The author investigated the impact of behaviour quantities on the classifier using the behaviour threshold mentioned in Section 4.4.2. For each sample the author measured the number of extracted behaviours it had exhibited while being run in CopperDroid emulators. This is the sample’s behaviour count. Samples that demonstrate a higher behaviour counts typically produce richer traces which, in turn, result in detailed feature sets and better classification accuracy. In our experiments we used a behaviour threshold to filter out samples exhibiting little to no behaviours. The effect of the behaviour threshold on the classification accuracy is demonstrated in Figure 4.8. It can be observed that as the behaviour threshold increases (i.e., 0, 2, 5, 15, 20, and 30), the accuracy does as well. We did not continue testing past a threshold of 30 as it was above 26.5, the mean of behaviours seen across all samples.
The trade-off to using a behaviour threshold to boost accuracy is that, although the ratio of behaviours to samples is higher, the number of discarded samples increases. This is also shown in Figure 4.8, where the number of samples that meet the threshold goes down as accuracy improves. In Sections 4.6.3 and 4.6.4, we apply conformal prediction to lessen the trade-off of filtering a small set of samples. The hybrid technique can be applied with any number of samples. However, based on the intersection in Figure 4.8, we choose to do our experiments with a base case of 10 behaviours per sample.