Conclusion - Analyzing and Defending Against Evolving Web Threats

In this chapter, we examined the security model that high-interaction honeyclients use, and we evaluated their weaknesses in practice. We introduced and discussed a number of possible attacks, and we test them against several popular, well-known high-interaction honeyclients. In particular, we have introduced three novel attack techniques (JavaScript- based honeyclient detection, in-memory execution, and whitelist-based attacks) and put under the microscope already-known attacks. Our attacks evade the detection of the

tested honeyclients, while successfully compromising regular visitors. Furthermore, we suggest several countermeasures aiming to improve honeyclients. By employing these countermeasures, a honeyclient will be better protected from evasion attempts and will provide more accurate results.

1 SHDocVw::IShellWindowsPtr spSHWinds;

2 IDispatchPtr spDisp;

3 IWebBrowser2 * pWebBrowser = NULL;

4 HRESULT hr;

6 // get all active browsers

7 spSHWinds.CreateInstance(__uuidof(SHDocVw::ShellWindows));

9 // get one, or iterate va to get each one

10 spDisp = spSHWinds->Item (va);

12 // get IWebBrowser2 pointer

13 hr = spDisp.QueryInterface (IID_IWebBrowser2, & pWebBrowser);

15 if (SUCCEEDED(hr) && pWebBrowser != NULL) {

16 visitUrl(pWebBrowser); // with the use of IWebBrowser2::←-

Navigate2

17 }

Figure 3.7: Confuse honeyclient: find an Internet Explorer instance and force it to visit a URL of our choice.

An Automated Approach to the

Detection of Evasive Web-based

Malware

In the previous chapter we showed how evasions work against high-interaction honeyclients. In this chapter, we focus on JavaScript-based evasions that target low-interaction honeyclients by introducing Revolver, a novel approach to automatically detect evasive behavior in malicious JavaScript. Revolver uses efficient techniques to identify similarities between a large number of JavaScript programs (despite their use of obfuscation techniques, such as packing, polymorphism, and dynamic code generation), and to automatically interpret their differences to detect evasions.

In the last several years, we have seen web-based malware—malware distributed over the web, exploiting vulnerabilities in web browsers and their plugins—becoming a prevalent threat. Microsoft reports that it detected web-based exploits against over 3.5 million

attacks are the method of choice for attackers to compromise and take control of victim machines [39, 82]. At the core of these attacks are pieces of malicious HTML and JavaScript code that launch browser exploits.

Recently, a number of techniques have been proposed to detect the code used in drive-by- download attacks. A common approach is the use of honeyclients (specially instrumented browsers) that visit a suspect page and extract a number of features that help in determining if a page is benign or malicious. Such features can be based on static characteristics of the examined code [14, 19], on specifics of its dynamic behavior [18, 61, 75, 81, 88, 106], or on a combination of static and dynamic features [91].

Drive-by downloads initially contained only the code that exploits the browser. This approach was defeated by static detection of the malicious code using signatures. The attackers started to obfuscate the code in order to make the attacks impossible to be matched by signatures. Obfuscated code needs to be executed by a JavaScript engine to truly reveal the final code that performs the attack. This is why researchers moved to dynamic analysis systems which execute the JavaScript code, deobfuscating this way the attacks regardless of the targeted vulnerable browser or plugin. As a result, the attackers have introduced evasions: JavaScript code that detects the presence of the monitoring system and behaves differently at runtime. Any diversion from the original targeted vulnerable browser (e.g., missing functionality, additional objects, etc.) can be used as an evasion.

As a result, malicious code is not a static artifact that, after being created, is reused without changes. To the contrary, attackers have strong motivations to modify the code they use so that it is more likely to evade the defense mechanisms employed by end-users and security researchers, while continuing to be successful at exploiting vulnerable browsers. For example, attackers may obfuscate their code so that it does not match the string signatures used by antivirus tools (a situation similar to the polymorphic techniques used in binary malware). Attackers may also mutate their code with the intent of evading a specific detection tool, such as one of the honeyclients mentioned above.

In this chapter we propose Revolver, a novel approach to automatically identify evasions in drive-by-download attacks. In particular, given a piece of JavaScript code, Revolver efficiently identifies scripts that are similar to that code, and automatically classifies the differences between two scripts that have been determined to be similar. Revolver first identifies syntactic-level differences in similar scripts (e.g., insertion, removal, or substitution of snippets of code). Then Revolver attempts to explain the semantics of such differences (i.e., their effect on page execution). We show that these changes often correspond to the introduction of evasive behavior (i.e., functionality designed to evade popular honeyclient tools).

There are several challenges that Revolver needs to address to make this approach feasible in practice. First, typical drive-by-download web pages serve malicious code that is heavily

simple polymorphic techniques, e.g., by randomly renaming variables and functions names. Polymorphism creates a multitude of differences in two pieces of code. From a superficial analysis, two functionally identical pieces of code will appear as very different. In addition, malicious code may be produced on-the-fly, by dynamically generating and executing new code (through JavaScript and browser DOM constructs such as the eval() and setTimeout()functions). Dynamic code generation poses a problem of coverage; that is, not all JavaScript code may be readily available to the analyzer. Therefore, a naive approach that attempts to directly compare two malicious scripts would be easily thwarted by these obfuscation techniques and would fail to detect their similarities. Instead, Revolver dynamically monitors the execution of JavaScript code in a web page so that it can analyze both the scripts that are statically present in the page and those that are generated at runtime. In addition, to overcome polymorphic mutations of code, Revolver performs its similarity matching by analyzing the Abstract Syntax Tree (AST) of code, thereby ignoring superficial changes to its source code.

Another challenge that Revolver must address is scalability. For a typical analysis of a web page, Revolver needs to compare several JavaScript scripts (more precisely, their ASTs) with a repository of millions of ASTs (potential matches) to identify similar ones. To make this similarity matching computationally efficient, we use a number of machine learning techniques, such as dimensionality reduction and clustering algorithms.

Finally, not all code changes are security-relevant. For example, a change in a portion of the code that is never executed is less interesting than one that causes a difference in the runtime behavior of the script. In particular, we are interested in identifying code changes that cause detection tools to misclassify a malicious script as benign. To identify such evasive code changes, Revolver focuses on modifications that introduce control flow changes in the program. These changes may indicate that the modified program checks whether it is being analyzed by a detector tool (rather than an unsuspecting visitor) and exhibits a different behavior depending on the result of this check.

By automatically identifying code changes designed to evade drive-by-download detectors, one can improve detection tools and increase their detection rate. We also leverage Revolver to identify benign scripts (e.g., well-known libraries) that have been injected with malicious code, and, thus, display malicious behavior.

This chapter makes the following contributions:

• Code similarity detection: We introduce techniques to efficiently identify JavaScript code snippets that are similar to each other. Our tool is resilient to obfuscation techniques, such as polymorphism and dynamic code generation, and also pinpoints the precise differences (changes in their ASTs) between two different versions of similar scripts. • Detection of evasive code: We present several techniques to automatically classify

executed code. In particular, Revolver has identified several techniques that attackers use to evade existing detection tools by continuously running in parallel with a honeyclient.

4.1 Background and Overview

To give the reader a better understanding of the motivation for our system and the problems that it addresses, we start with a discussion of malicious JavaScript code used in drive-by- download attacks. Moreover, we present an example of the kind of code similarities that we found in the wild.

Malicious JavaScript code. The web pages involved in drive-by-download attacks typi- cally include malicious JavaScript code. This code is usually obfuscated, and it fingerprints the visitor’s browser, identifies vulnerabilities in the browser itself or the plugins that the browser uses, and finally launches one or more exploits. These attacks target memory corruption vulnerabilities or insecure APIs that, if successfully exploited, enable the attackers to execute arbitrary code of their choice.

Figure 4.1 shows a portion of the code used in a recent drive-by-download attack against users of the Internet Explorer browser. The code (slightly edited for the sake of clarity) instantiates a shellcode (Line 8) by concatenating the variables defined at Lines 1–7; a later portion of the code (not shown in the figure) triggers a memory corruption vulnerability, which, if successful, causes the shellcode to be executed.

A common approach to detect such attacks is to use honeyclients, which are tools that pose as regular browsers, but are able to analyze the code included in the page and the side-effects of its execution. More precisely, low-interaction honeyclients emulate regular browsers and use various heuristics to identify malicious behavior during the visit of a web page [18,41,75]. High-interaction honeyclients consist of full-featured web browsers running in a monitoring environment that tracks all modifications to the underlying system, such as files created and processes launched [81, 99, 106]. If any unexpected modification occurs, it is considered to be a manifestation of a successful exploit. Notice that this sample is difficult to detect with a signature, as strings are randomized on each visit to the compromised site.

Evasive code. Attackers have a vested interest in crafting their code to evade the detection of analysis tools, while remaining effective at exploiting regular users. This allows their pages to stay “under the radar” (and actively malicious) for a longer period of time, by avoiding being included in blacklists such as Google’s Safe Browsing [37] or being targeted by take-down requests.

Attackers can use a number of techniques to avoid detection [86]: for example, code obfuscation is effective against tools that rely on signatures, such as antivirus scanners; requiring arbitrary user interaction can undermine high-interaction honeyclients; probing for arcane characteristics of browser features (likely not correctly emulated in browser emulators) can thwart low-interaction honeyclients.

1 var nop="%uyt9yt2yt9yt2";

2 var nop=(nop.replace(/yt/g,""));

3 var sc0="%ud5db%uc9c9%u87cd...";

4 var sc1="%"+"yutianu"+"ByutianD"+ ...;

5 var sc1=(sc1.replace(/yutian/g,""));

6 var sc2="%"+"u"+"54"+"FF"+...+"8"+"E"+"E";

7 var sc2=(sc2.replace(/yutian/g,""));

8 var sc=unescape(nop+sc0+sc1+sc2);

Figure 4.1: Malicious code that sets up a shellcode.

An effective way to implement this kind of circumventing techniques consists of adding some specialized “evasive code” whose only purpose is to cause detector tools to fail on an existing malicious script. Of course, the evasive code is designed in such a way that regular browsers (used by victims) effectively ignore it. Such evasive code could, for example, pack an exploit code in an obfuscation routine, check for human interaction, or implement a test for detecting browser emulators (such evasive code is conceptually similar to “red pills” employed in binary malware to detect and evade commonly-used analysis tools [29]).

Figure 4.2 shows an evasive modification to the original exploit of Figure 4.1, which we also found used in the wild. More precisely, the code tries to load a non-existent ActiveX control, named yutian (Line 2). On a regular browser, this operation fails, triggering the execution of the catch branch (Lines 4–11), which contains an identical copy of the malicious code of Figure 4.1. However, low-interaction honeyclients usually emulate the ActiveX

1 try {

2 new ActiveXObject("yutian");

3 } catch (e) {

4 var nop="%uyt9yt2yt9yt2";

5 var nop=(nop.replace(/yt/g,""));

6 var sc0="%ud5db%uc9c9%u87cd...";

7 var sc1="%"+"yutianu"+"ByutianD"+ ...;

8 var sc1=(sc1.replace(/yutian/g,""));

9 var sc2="%"+"u"+"54"+"FF"+...+"8"+"E"+"E";

10 var sc2=(sc2.replace(/yutian/g,""));

11 var sc=unescape(nop+sc0+sc1+sc2);

12 }

Figure 4.2: An evasion using non-existent ActiveX controls.

API by simulating the presence of any ActiveX control. In these systems, the loading of the ActiveX control does not raise any exception; as a consequence, the shellcode is not instantiated correctly, which stops the execution of the exploits and causes the honeyclient to fail to detect the malicious activity.

Detecting evasive code using code similarity. Code similarity approaches have been proposed in the past, but none of them has focused specifically on malicious JavaScript. There are several challenges involved when processing malicious JavaScript for similarities. Attackers actively try to trigger parsing issues in analyzers. The code is usually heavily

code itself is designed to evade signature detection from antivirus products. This renders string-based and token-based code similarity approaches ineffective against malicious JavaScript. We will show later how regular code similarity tools, such as Moss [96], fail when analyzing obfuscated scripts. In Revolver, we extend tree-based code similarity approaches and focus on making our system robust against malicious JavaScript. We elaborate on our novel code similarity techniques in §4.2.4.

At a high-level overview, we use Revolver to detect and understand the similarity between two code scripts. Intuitively, Revolver is provided with the code of both scripts and their classification by one or more honeyclient tools. In our running example, we assume that the code in Figure 4.1 is flagged as malicious and the one in Figure 4.2 as benign. Revolver starts by extracting the Abstract Syntax Tree (AST) corresponding to each script. Revolver inspects the ASTs rather than the original code samples to abstract away possible superficial differences in the scripts (e.g., the renaming of variables). When analyzing the AST of Figure 4.2, it detects that it is similar to the AST of the code in Figure 4.1. The change is deemed to be interesting, since it introduces a difference (the try-catch statement) that may cause a change in the control flow of the original program. Our system also determines that the added code (the statement that tries to load the ActiveX control) is indeed executed by tools visiting the page, thus increasing the relevance of the detected change (execution bits are described in more detail in §4.2.1). Finally, Revolver classifies the modification as a

possible evasion attempt, since it causes the honeyclient to change its detection result (from malicious to benign).

Assumptions and limitations. Our approach is based on a few assumptions. Revolver relies on external detection tools to collect (and make available) a repository of JavaScript code, and to provide a classification of such code as either malicious or benign (i.e., Revolver is not a detection tool by itself). To obtain code samples and classification scores, we can rely on several publicly-available detectors [18, 41, 75].

Attackers might write a brand new attack with all components (evasion, obfuscation, exploit code) written from scratch. In such cases, Revolver will not be able to find any similarities the first time it analyzes these attacks. The lack of similarities though can be used to our advantage, since we can isolate brand-new attacks (provided that they can be identified by other means) based on the fact that we have never observed such code before.

In the same spirit, to detect evasions, Revolver needs to inspect two versions of a malicious script: the “regular” version, which does not contain evasive code, and the “evasive” version, which attempts to circumvent detection tools. Furthermore, if an evasion is occurring, we assume that a detection tool would classify these two versions differently. In particular, if only the evasive version of a JavaScript program is available, Revolver will not be able to detect this evasion. We consider this condition to be unlikely. In fact, trend results from a recent Google study on circumvention [86] suggest that malicious code evolves over

sufficiently large code repository should allow us to have access to both regular and evasive versions of a script. Furthermore, we have anecdotal evidence of malware authors creating different versions of their malicious scripts and submitting them to public analyzers, until they determine that their programs are no longer detected (this situation is reminiscent of the use of anti-antivirus services in the binary malware world [54]).

Revolveris not effective when server-side evasion (for example, IP cloaking) is used: in such cases, the malicious web site does not serve at all the malicious content to a detector coming from a blocked IP address, and, therefore, no analysis of its content is possible. This is a general limitation of all analysis tools and can be solved by means of a better analysis infrastructure (for example, by visiting malicious sites from IP addresses and networks that are not known to be associated with analysts and security researchers and cannot be easily fingerprinted by attackers).

In document Analyzing and Defending Against Evolving Web Threats (Page 66-80)