Evaluation - Analysis and Classification of Android Malware

Our experimental setup is as follows. We ran unmodified Android images on top of the CopperDroid-enhanced emulator. Occasionally a clean image is customized to include personal information, such as contacts, SMS texts, call logs, and pictures to mimic, as closely as possible, a real device. Each analysed malware sample is then installed in the image and traced via CopperDroid until a timeout was reached (10 minutes by default). At the end of the analysis, a clean execution environment is restored to prevent corruptions and side-effects caused by installing multiple malware samples in the same system. To limit noisy results, each sample was executed and analysed six times: three times without stimulation and three times with stimulation. Afterwards, single execution results were merged. Future work and improvements are discussed later in Chapter 6.

We evaluated CopperDroid on three well-known and diverse datasets. These in- cluded the public Contagio dump and Android Malware Genome datasets [52, 249] and one provided by McAfee [141]. These datasets are composed of 1,226, 395 and 1,365 samples, respectively, equating to more than 2,900 samples overall.

3.6.1 Effectiveness

To evaluate the effectiveness of CopperDroid’s stimulation, we first analysed all samples without external stimulation. Then we performed full stimulation-driven analyses on the same malware sets. A summary of the results is presented in Table 3.2, while more detailed results on the McAfee dataset are reported in Table B.1 in Appendix B. These all or no stimuli results were generated by collaborators. For a fine-grained analysis of incremental behaviours induced by stimuli, the author presents Table 3.4.

As Table 3.2 shows, stimulation results for the newer McAfee dataset is consistent with the older datasets: 836 of 1365 McAfee samples exhibited additional high-level behaviours (defined in Section 3.5) and, on average, the number of additional behaviours was 6.5 more than the 22.8 behaviours observed without CopperDroid’s stimulation. While not the most effective solution, this stimulus technique allowed CopperDroid to analyse a significant number of additional behaviours for very little performance cost.

Table 3.2: Summary of stimulation results, per dataset.

Malware Incremental Behaviours Average Standard Dataset (Samples) Increment Deviations Genome 752/1226 (60%) 2.9/10.3 (28.1%) 2.4/11.8 Contagio 289/395 (73%) 5.2/23.6 (22.0%) 3.3/19.8 McAfee 836/1365 (61%) 6.5/22.8 (28.5%) 9.5/30.1

Table 3.3: Overall behaviour breakdown of McAfee dataset.

Behaviour Class No Stimulation Stimulation

FS Access 889/1365 (65.13%) 912/1365 (66.81%) Access Personal Information 558/1365 (40.88%) 903/1365 (66.15%) Network Access 457/1365 (33.48%) 461/1365 (33.77%) Execute External App 171/1365 (12.52%) 171/1365 (12.52%) Send SMS 38/1365 (2.78%) 42/1365 (3.08%) Make/Alter Call 1/1365 (0.07%) 55/1365 (4.03%)

Of course, it is important to understand whether an observed behaviour is new or if it refers to a similar, previously-observed action (e.g., same network transmission but different timestamp). To achieve this, we currently disregard pseudorandom or ephemeral values observed in specific behaviours, like a timestamp or an ID, found in otherwise identical behaviours. Hence, a repeated behaviour will not contribute to the percentage of additional behaviours observed with stimulation. All the other behaviours are considered to be new and therefore contribute to the aforementioned percentage.

During the analysis of the McAfee dataset, roughly 10% of the samples did not exhibit any behaviour, regardless of the stimulation technique adopted. Nearly half of these samples did so because CopperDroid could not successfully install them in the emulator. The other half were installed but stayed dormant or did not exhibit any interesting behaviour before CopperDroid’s analysis timeout. There are a variety of reasons, in- cluding “incorrect” stimulation/environment elements or VM evasions (see Discussions in Chapter2). While more sophisticated code coverage solutions may be deployed, many deter fast, lightweight, performance. While we may adopt better stimulation techniques in the future, it is not the current focus of the CopperDroid analysis framework.

Table 3.3 reports the overall breakdown of the observed behaviours (i.e., applica- tion actions defined in Figure 3.10) on the McAfee dataset. Each row identifies the class of behaviour and how many samples, over the total dataset, exhibited at least one occurrence of that behaviour with and without stimulation. Here, we see that the two behaviours most reactive to stimulation are Access Personal Information and Make/Al- ter Call. The first is triggered by CopperDroid’s stimulation technique, resulting in an access to the user’s personal information. The latter is mostly due to a set of malware that, whenever a phone call is received, hide its notification from user. Conversely, the author presents Table 3.4, which provides a more fine-grained overview of the effects of stimulation on all behavioural subclasses defined in Section 3.5.

Lastly, the author ran a number of malware samples with no, selective, and full stimulation with the help of a collaborator. The aim of this experiment was to qualitatively

Table 3.4: Incremental behaviour induced by various stimuli.

Sample Behaviour Behaviour Behaviours Incr. Behaviour Incr. Behaviour Incr. Behaviour Family Class Subclass No Stim. Type Stim. SMS Stim. Loc. Stim.

YZHC

Network Access HTTP 4 - - N/A

DNS 1 - - N/A

Exec External App

Generic 3 +10 (+433%) - N/A

Shell 1 +3(+400%) - N/A

Priv. Esc. 2 - +2(+100%) N/A

Install APK 4 - - N/A

Access Personal Info Account - - +1(⊥) N/A

FS Access Write 414 - - N/A

zHash

Network Access HTTP 2 +2 (+100%) +5 (+350%) N/A

DNS - - +1 (⊥) N/A

Exec External App

Generic 1 +12 (+1300%) +3 (+400%) N/A

Shell 1 +3 (+400%) - N/A

Priv. Esc. 4 - - N/A

Install APK 4 - - N/A

Access Personal Info Account 2 - - N/A

FS Access Write 163 - +255 (+257%) N/A

SHBreak

Network Access HTTP 3 - N/A N/A

Exec External App

Generic 2 +113 (+5750%) N/A N/A

Shell 1 +22 (+2500%) N/A N/A

Install APK 4 +4 (+100%) N/A N/A

FS Access Write 195 +353 (+281%) N/A N/A

DKF

Network Access HTTP 13 - N/A -

Exec External App

Generic 1 +2 (+300%) N/A +1 (+200%)

Shell 1 - N/A -

Install APK 4 - N/A -

FS Access Write 3 +197 (+6667%) N/A +144 (+4800%)

Fladstep

Network Access HTTP 15 - N/A N/A

Exec External App

Generic 3 +17 (+633%) N/A N/A

Shell 1 +5 (+500%) N/A N/A

Install APK 4 - N/A N/A

FS Access Write 171 +80 (+47%) N/A N/A

(Priv. Esc. = Privilege Escalation, DFK = DroidKungFu, N/A = stimuli not possible based on Manifest)

identify which individual stimulus induced what amounts of incremental behaviour, and whether combinations of stimulation are more effective than individual triggers. For illustration, we deliberately show the Android malware samples that had the highest, average, and lowest incremental behaviours both percentage wise and amount wise. If several families had the same maximum amount of incremental behaviour, we chose the one with the highest percentage in incremental behaviour and vice versa.

The author then determined the best representative sample from each family based on the amount and diversity of behaviours. The results of various stimulations on these malware samples can be seen in Table 3.4. Here, we can begin to see correlations be- tween different stimuli and behaviours. As the table shows, our selective stimulations was able to disclose a number of additional previously-unseen behaviours (e.g., YZHC SMS stimulation showed more access to personal account information) or already- observed behaviours (e.g., SHBreak showed 113 additional generic executions).

3.6.2 Performance

In this section we evaluate CopperDroid’s overhead through a number of experiments conducted on a GNU/Linux Debian 64-bit system equipped with an Intel 3.30GHz core (i5) and 3GB of RAM. Performance evaluations of CopperDroid’s system call collection were performed by collaborators, while the Oracle was evaluated by the author. As the CopperDroid framework, specifically the system call collection part, is still undergoing moderate changes, newer evaluations have not yet been conducted.

Benchmarking a multi-layered system such as Android, in conjunction with an emulated environment, can be rather complicated. Traditional benchmarking suites based on measuring I/O operations are similarly affected by the caching mechanisms of emulated environments. On the other hand, CPU-intensive benchmarks are meaningless against the overhead of CopperDroid, as it operates purely on system calls.

To address such issues, we performed two different benchmarking experiments. The first is a macrobenchmark that tests the overhead introduced by CopperDroid on com- mon Android-specific actions, such as accessing contacts and sending SMS texts. Be- cause such actions are performed via the Binder protocol, these tests give a good eval-

(a) Binder Macrobenchmark

(b) System Call Microbenchmark

uation of the overhead caused by CopperDroid’s Binder analysis infrastructure. The second set of experiments is a microbenchmark that measures the computational time CopperDroid needs to analyse a subset of interesting system calls.

To execute the first set of benchmarks, we created a fictional Android app to per- forms generic tasks, such as sending and reading (SMS) texts, accessing local account information (GET_ACC), and reading all contacts (CONTACTS). We then ran the test app for 100 iterations and collected the average time required to perform these operations under three settings: on a vanilla Android emulator, on a CopperDroid emulator with CopperDroid configured to monitor the targeted test app, and on a CopperDroid emulator with CopperDroid configured to track all system-wide events. Results are reported in Figure 3.13 (a). As can be observed, the overhead introduced by the targeted analysis is relatively low, respectively ≈ 26%, ≈ 32%, ≈ 24% and ≈ 20%. On the other hand, system-wide analyses increase the overhead considerably (>2x). This is due to the of the number of Android components that are concurrently analysed.

The second set of experiments measure the average time CopperDroid requires to in- spect a subset of interesting system calls. This experiment collected more than 150,000 system calls obtained by executing apps with arbitrary workloads. As tracking a system call requires intercepting entry and exit points, we report each measures separately in Figure 3.13 (b) (the average times are 0.092ms for entry and 0.091ms for exit).

The author evaluated the Oracle’s performance by sending various object types to be unmarshalled. A hundred requests for one object were sent to the multi-threaded Or- acle for ten tests. Performance scores were then averaged. This test was run on simple (Integer) and complex primitives (String Array), simple (Account) and very complex objects (Intent), and an IBinder object (i.e., only the handle). When unmarshalling IBinder completely, the results would be a combination of the IBinder performance and an object performance. As seen with our Android object examples, this can be a wide range of values (see Figure 3.14). While unmarshalling real method parameters (e.g., sendtText) would require a mix of types and vastly less than 100 parameter, however the performance can be estimated with these results.

0 5 10 15 20 IBinder Intent Account String Array Integer 0.21 0.21 0.36 20.8 0.22 Performance in Seconds

In document Analysis and Classification of Android Malware (Page 81-86)