4.3 Supervised Malware Detection
5.1.2 Usage Scenarios
ToGather is designed to be practical and efficient in the hands of security practitioners. (1)
Security analysts might useToGatherframework as an investigation tool to minimize the efforts of generating threat networks for a given Android malware family. The analyst leverages the IP ad- dresses and domain names ordered by their importance in the generated threat network to prioritize the takedown and mitigation operations. (2)ToGatheracts as a monitoring system. It analyzes a feed of Android malware (e.g., new samples daily) to generate a snapshot of the threat network and thus uncover malicious activities (e.g., spamming and phishing). Periodic reporting gives insights into the evolution and the malicious behaviors of a given malware family over time.
5.2
Methodology
In this section, we present the overall workflow ofToGatherframework, as shown in Figure 5.1, starting from Android malware samples and ending with the produced relevant threat intelligence:
1) The first step inToGatherconsists of deriving network information from Android samples in a given analysis window (e.g., day, week, month) whether the samples are from the same malware family or not. However, we consider one malware family as a typical use-case ofToGather, as
Android Malware Samples Dynamic Analyses Static Analyses IP addresses Domain Names Filtering & White-listing Passive DNS Database Geo-location
Extraction of Cyber Network Information
IP addresses Domain Names Filtering & White-listing Network building
Correlation and Network Information Enrichment Generate The Global Threat Network
Global Threat Network
Figure 5.1: ToGather Approach Overview
presented in the evaluation section. ToGather conducts dynamic and static analyses where each analysis produces a report for each Android malware sample. Therefore, we produce dynamic and static analyses reports for each malware sample. Leveraging both analysis types enhances the resiliency ofToGather against common obfuscation techniques, which hide relevant information about malicious activities such as domain names and IP addresses (network information). After- ward, ToGather extracts network information (IP addresses and domain names) by parsing the related text blocks (strings) from analysis reports and applies a simple text pattern search. In static analysis, we mainly concentrate on the Dalvik compiled code (classes.dex) for such extraction. We collect network information more efficiently from dynamic analysis reports since they are more structured and have labeled fields.
2) Next, we filter the extracted network identifiers from noise information such as non-routed IP addresses. Also, we filter domain names and URLs that use Unicode characters. For the current
ToGatherimplementation, we consider domain names and URLs written only in the standard En-
glish/Latin alphabet. In the case of URL links, we keep only domains. To this end, we have a set of valid IP addresses and domain names found in Android malware. It is important to notice that malware hashs tag network information, and these tags are kept during all the workflow steps of
ToGather. To minimize false positives, ToGatherapplies whitelisting mechanisms. For domain
names,ToGatherleverages Alexa [51] and Quantcast [53] (more than one million domain names). However, the number of white domain names is a hyper-parameter ofToGatherthat can be used to control the number of false positives. In the case of IP addresses, we leverage a set of public white IPs such as Google DNS servers and other ones [20]. It is important to emphasize thatToGather
considers public cloud vendor IPs and domain names as a whitelist. The aim is to observe and then gain insight into the use of the cloud infrastructure by Android malware. This idea originates from
the observation that Android malicious apps (and malware in general) make more use of the cloud as a low-cost infrastructure for their malicious activity.
3) In this step, we propose a mechanism to enhance and enrich the malicious network infor- mation to cover related domains and IPs. In essence,ToGather aims at answering the following questions: (i) What are the IP addresses of current malicious domains? Here we investigate the IP addresses of server machines that host malicious activities that are most likely related to the analyzed Android malware. (ii) What are the domain names pointing to the current malicious IP addresses? The intuition is that a malicious server machine with a given IP address could host var- ious malicious contents, and the adversary could use multiple domains pointing to such contents. To answer this question,ToGatherhas a module to enrich network information using passive DNS replication. The latter is a technology that builds zone replicas without the cooperation from zone administrators, based on captured name server responses, as presented in Section 5.2.3. We use the network information, whether IP addresses or domains, as parameters of two functions applied on a passive DNS database. The goal of the function is to enrich the list of domains and IP addresses that could be part of the adversary threat network. The enrichment services are: (i) GetIP(Domain): This function takes a domain as a parameter to query the passive DNS database. The result is all IP addresses pointing to the domain. (ii) GetDomain(IP): This function gets all the domains that resolve to the IP address given as a parameter.
We consider passive DNS correlation for two reasons: (i) A small number of Android malware samples generally yields limited network information. (ii) Security practitioners aim at having a more in-depth situational awareness about malware Internet activity. As such, they would like to consider all related IPs and domain names. The result of the correlation is a set of IP addresses and domain names inferred using passive DNS related to Android malware apps. The correlation results could, however, overwhelm the investigation process. Passive DNS correlation is therefore optional if we have a significant number of samples from a given Android family. The correlation with passive DNS could produce some known benign entries. For this reason, we filter the likely harmless network information by matching the newly found IP addresses against top Alexa [51] and Quantcast [53] domain names and known public IP addresses [52].
relevant and actionable intelligence, ToGather aggregates all the previous records into a hetero- geneous network with different types of nodes: malware hashes,IP addressesanddomain names. We consider the heterogeneous network that is extracted from a given Android malware family as the malicious activity map of that family on the Internet. We call such a heterogeneous network, athreat network. Furthermore, ToGatherproduces homogenous networks by executing multiple projections according to the node type (IP address or domain name). Therefore,ToGatherproduces three homogeneous graphs, one only considers IP addresses connections, the other only considers domain name connections, and a threat network with IPs and domains as network information. The Graph homogeneity is required to apply graph partitioning on domain threat network, and network information threat network.
Sub-Threat Networks
Community Detection Computation of Actions Priority Malicious Activity Tagging
Page
Ranking TaggingThreat
Global Threat Network
Tagged Threat Networks
Figure 5.2: Graph Analysis Overview
5) Further,ToGatheraims at producing more granular graphs (see Figure 5.2) from the gener- ated threat networks derived in the previous step. In this respect,ToGatherchecks the possibility of community identification in these threat networks based on the connectivity between nodes. The higher is the connectivity between the nodes in a particular area of the network, the more is the pos- sibility to have a malicious community. For community detection (Section 5.2.1), we adopt a highly scalable algorithm [82] to enhanceToGathercommunity detection module. The intuition behind using the community concept is as follows: (i) ConsideringToGathertypical usage scenario, where we enter Android malicious apps from the same family, the community could define different threat networks that are related to the malicious activities. In other words, either one adversary is using
these threat networks as backups, or we have multiple adversaries instead. In the case of Android malware, the second hypothesis is more plausible because of the low cost of repackaging of existing malware samples to suit the need of the adversary. (ii) In caseToGatherreceives Android malware from different families, the communities is interpreted as the threat networks of different Android malware families to focus on the relation between them. The output of this step is a set of threat networks related to IPs, domains, as well as network information and their communities (sub-threat networks).
6) To produce actionable cyber-threat intelligence, we leverage the page ranking algorithm (Sec- tion 5.2.2) to deliver ranking scores for critical nodes of a given (sub)-threat network. Consequently, the investigator should have some priority list when it comes to mitigation or takedown of nodes that are associated with a malicious cyber-infrastructure. As a result,ToGatherproduces a threat net- work for each Android malware family together with the ranking of each node. BecauseToGather
generates multiple homogeneous graphs based on the node type (IP, domain, network information), it produces different ranking lists. Therefore, the security practitioner has the opportunity of select- ing the node type during the mitigation or the takedown to protect his system. Also, it is essential to mention that it is expensive for the adversary to get new IP addresses. In contrast, domain names could be frequently changed due to their affordability.
7) We do not focus only on Android malware. Instead, we aim to gain insights into the shared network IP and domains of the analysed Android malware samples with other platform malware families. Indeed, an adversary could have many malicious activities in several operating systems to achieve wider coverage. Therefore, similarly to the first step, we conduct dynamic and static anal- yses on Windows and Linux malware samples to extract the corresponding network information. The same step is applied to this network information. Afterward, we correlate the Android network information with the non-Android malware information to discover another dimension of the ad- versary network. The result will be all IP addresses and domains of Android malware in addition to all network records of a given non-Android malware family if they share some network informa- tion. It is essential to notice that malware families also label information networks of non-Android malware.
8) In this final workflow step ofToGather, we leverage other intelligence sources to label ma- licious activities that are committed by the discovered threat networks. The currentToGatherim- plementation includes the correlation with spam emails, reconnaissance traces, and phishing URLs. We considerToGather as an active service that receives at every epoch time (day, week, month) Android malware with the corresponding family (the typical use case) and produces valuable intel- ligence about this malware family.