Future Directions - Analysis and Classification of Android Malware

There are many opportunities to further the work that has been presented within this thesis. First of all, CopperDroid and dynamic frameworks in general can be improved with more effective stimulation. As previously discussed by the author in Section 2.5 of the survey, there are many attractive aspects to hybrid solutions. The combination of static and dynamic analysis to improve code coverage and discover sequences of valid stimuli to reach all interesting behaviours is very compelling. Furthermore, the possibility of hybrid emulator and physical devices could provide an interesting counter attack to VM-aware and VM-evasive malware (see Section 2.3).

There is a second area for future work which is unique to CopperDroid. It is possi- ble that, during run-time, CopperDroid can both intercept system calls and then alter the

return value after it has copied the original value. This could be a novel form of stimulation and to better disguise the emulator as a real device. For example, instead of actually modifying the emulator IMEI to a realistic number, CopperDroid could alter the return values of involved system calls to contain a range of believable values. Furthermore, while allowing network access triggers more behaviours for analysis, similar “trickery” techniques may help attain these behaviours with less risk to other systems and users.

As a part of our analysis, it would also be interesting to build more dependencies between objects and behaviours. For instance, although we can track the file access behaviour creating a file A and a network behaviour sending the contents of file A, CopperDroid does not automatically see this as a chain of behaviours relating to a single file. This may be solved with methods like taint tracking or symbolic execution. As shown in Chapter 5, mapping specific app components to specific behaviours, malicious or benign, also provides a wealth of information. This may require making tool to automatically extract APK components and systematically trigger different component sets over a series of experiments. There are also areas to create better hardware/system stimuli (e.g., accelerometer, geo-location) or to integrate previous works [22, 87, 133].

Future work on our proposed multi-class classifier can be divided into several areas. Firstly, the feature set could be enhanced with more behaviours. This may be solved with alternate solutions such as memory artefacts and/or memory fingerprints. Similarly, analysing a much larger, more diverse, and more current set of malware may reveal addition features essential for the classification (both multi-class and binary) of Android malware. Furthermore, our feature vector currently includes one element rep- resenting the number of bytes across all network behaviours of a sample. Future work should include determining whether splitting this amount, or any other element, into two elements (i.e., received bytes and sent bytes) would improve classification.

Determining whether there are better fitting machine learning methods than SVM would also be useful. This could be achieved by analysing more extensive datasets to discover areas where the classifier is currently lacking. Many available machine learning approaches, as well as different settings, have not yet been tested for the most appro- priate method. Moreover, automatic tools to determine (1) optimal p-value limit for of our CP, (2) best thresholds for behaviours per sample, and (3) ideal samples per family thresholds would greatly enhance the conformal prediction component of this work.

While our classifier should be able to detect zero-day malware, i.e. samples likely to be malware but dissimilar to all available classes, we have not implemented a tool to do so. Theoretically, however, setting an upper p-value limit and a lower p-value thresh- old could determine clear classification labels (above high p-value limit), classifications

to be done with CP (between thresholds), and zero-day malware (below lower p-value limit). Automatically finding these limits would be an interesting topic for future work. Furthermore, discovering which behaviour features are best for two-class identification and/or multi-class classification would be an interesting area of work. Similarly, apply- ing the contributions in this thesis to two-class classification may prove more accurate than multi-class classification due to more available features and more defined classes.

For memory forensics, future work could enable networking to analyse downloaded content (e.g., malicious APKs), uploaded data (e.g., IMEI), and SMS communications to and from a C&C server (e.g., NickiBot malware [159]). This seems a logical step for- ward as previous work has shown memory forensics to be capable of analysing network activities such as messaging and email [44, 134, 198]. Further analysis in this area and with larger more diverse dataset could provide more useful memory artefacts.

To automate forensic analysis and the discovery memory artefacts, one could implement the algorithms in Section 5.5 (Algorithms 5.1, 5.2, and 5.3) that were manually applied during our initial experiments. While these can be generic to detect malicious or dangerous behaviours, with more fine-tuning, these tools may be sensitive enough to detect specific behaviours and malware, as Algorithms 5.2 and 5.3 demonstrate best. Automatic tools to generate signatures of significant areas of memory is another are of future work, and whether they can be used to detect evasive malware such as bootkits.

During analysis, we have found that most Volatility plugins, such as pstree, proc maps, or psxview, ran between 1-5 seconds across the whole memory image. We also found string scans, i.e. yarascan, where the most time consuming. When analysing the entire image, yarascan could take 20+ minutes. Conversely, when analysing specific processes or only app processes, the time for a string search has not exceeded 15 minutes. Specifically, scanning for a string in one processes may take one minute, but not exceed six. Therefore, in future work it would be more efficient to implement fast plugins first (e.g., pstree) and filter out uninteresting processes. Parallel processing, with each thread processing a subset of app processes and/or different threads scanning for different artefacts simultaneously, should also improve performance. In this model, when an artefact is discovered in one Android process, the other threads can shift their focus to look for more incriminating artefacts in the same process.

While majority of our analyses yielded multiple artefacts per sample (as we can detect some failed or dormant malicious behaviours), malware with very few artefacts would attribute to a higher false negative rate. Few artefacts may also increase false positives, as one “dangerous” artefact may accuse a sample of being malware, despite the app just being “dangerous” or slightly intrusive instead of malicious.

In document Analysis and Classification of Android Malware (Page 167-170)