3.6 Related Work
3.6.2 Static Taint Checking
Static taint checking is essentially information flow analysis specialized to determine whether data from an untrusted source flows into a sensitive sink. Static taint checking has a long history, but Huang et al. were perhaps the first to apply it to SQLCIVs [40]. They used a CQual-like [24] type system to propagate taint information through PHP programs. Livshits and Lam [59] used a precise points-to analysis for Java [98] and queries specified in PQL [55] to find paths in Java programs that allow “raw” input to flow into SQL queries. Both of these tools are sound with respect to the policy they enforce and the language
3.6. Related Work 59
features they support, and both find many vulnerabilities, but both consider all values returned from designated filtering functions to be safe. Because the policy they use says nothing about the context of the user input and the structure of the query, both techniques may miss real SQLCIVs. Additionally, Huang et al.’s type system does not support some of PHP’s more dynamic features, in part because it does not track string values at all and supporting these features would result in excessively many false positives.
Jovanovic et al. sought to address this last shortcoming with Pixy [43, 44], a static taint analysis for PHP that propagates limited string information and implements a finely tuned alias analysis. Xie and Aiken designed a more precise and scalable analysis for finding SQLCIVs in PHP by using block- and function-summaries [99]. The precision they gained comes at the expense of automation — the user must provide the filenames when the analysis encounters a dynamic include statement, and the user must tell the analysis whether each regular expression encountered in a filtering function is “safe.” We are able to make a stronger guarantee about the absence of SQLCIVs because we analyze the possible values of the strings and check conformance to a policy that takes into account the query’s structure.
Subsequent to the publication of our work in this chapter (as [94]), Pixy was modified and extended to become Saner [3]. Saner, like the work we present in this chapter, aims to find SQL injection vulnerabilities while taking into account the semantics of input validation routines. Saner uses FSTs to model string functions, although some of the FSTs Saner uses are less precise than their counterparts in this work, and Saner represents sets of strings as regular languages. Saner checks against the simpler policy that tainted strings must be literals and so differs slighly from our work both in precision and expressivity. However, Saner does have a dynamic component whereby it attempts to provide witnesses of vulnerabilities in order to counter the problem of false positives that static analyses generally have.
60
Chapter 4
Static Analysis for Detecting XSS
Vulnerabilities
As we mentioned in Section 1.2, XSS is related to SQL injection in that both result from input validation errors in web applications, but of the two, XSS is more prevalent. XSS attacks can leak confidential information, such as session identifiers for secure sessions, or serve as launching pads for other, more severe attacks on web users’ local systems. This chapter describes the nature of XSS vulnerabilities and builds on the static analysis presented in Chapter 3 to formulate and experiment with the first practical, static, server- side analysis for detecting XSS vulnerabilities that takes into account the semantics of input validation routines.
4.1
Causes of XSS Vulnerabilities
Section 1.2 provides an initial description of XSS as an integrity vulnerability in web ap- plications. More specifically, we consider XSS to be the class of vulnerabilities in which an attacker causes the web application server to produce HTML documents in which some substring that did not come from the application’s source code causes the web browser’s JavaScript interpreter to be invoked. Section 4.3.2 provides more discussion on this policy.
4.1. Causes of XSS Vulnerabilities 61
Several factors contribute to the prevalence of XSS vulnerabilities. First, the system requirements for XSS are minimal: XSS afflicts web applications that display untrusted input, and most do. Second, most web application programming languages provide an unsafe default for passing untrusted input to the client. Typically, printing the untrusted input directly to the output page is the most straightforward way of displaying such data. Static taint analysis addresses this second factor. It marks data values assigned from untrusted sources as tainted and reports a vulnerability if the application may display that data without first sanitizing it. The analysis considers untrusted data sanitized if that data has passed through one of a designated set of sanitizing functions. This chapter addresses a third factor: proper validation for untrusted input is difficult to get right, primarily because of the many, often browser-specific, ways of invoking the JavaScript interpreter.
The MySpace worm, which infected more systems quickly than previous Internet worms (see Figure 1.3a), exploited a weakness in the MySpace input filters. The MySpace input filters prohibited all occurrences of the strings “<script>” and “javascript;” however, Internet Explorer concatenates strings broken by newlines and allows JavaScript within cascading style sheet tags, so the worm introduced its script by means of a string like:
<div style="background:url(’java script:. . . ’)">.
As if to further illustrate the point about proper input filtering being difficult to get right, several online services sell website visitor-tracking code for sites that officially forbid JavaScript, such as Xanga and MySpace. The sites that sell this code provide their service by continually finding new XSS vulnerabilities, so that whenever their current exploit stops working, they can use another vulnerability to inject their code.
Browser-specific ways of invoking the JavaScript interpreter exist because browsers han- dle permissively pages that are not standards compliant. In the early days of the web, this design decision on the part of the browser implementers seemed good because it enabled browsers to make a “best effort” to display poorly written pages. However, as the JavaScript
4.2. Overview 62
language became more standardized and its use increased, this decision has exacerbated the XSS problem.
4.2
Overview
This section provides an overview of our approach and introduces a running example to be used for a more detailed presentation of our approach in the next section.
4.2.1 Current Practice
Currently, XSS scanners rely on either testing or static taint analysis. Automated testing is ill-suited to finding errors in input validation code, because even flawed validation code catches most malicious uses, and, in order for an exploit to work, it must be crafted specif- ically for a certain validation’s weakness. Taint analysis takes as input a list of functions designated as sanitizers, but it does not perform any analysis on them, so it will not catch errors caused by weak input validation.
4.2.2 Our Approach
This chapter proposes an approach to finding not only XSS vulnerabilities due to unchecked untrusted data but also XSS vulnerabilities due to insufficiently-checked untrusted data. The approach has two parts: (1) an adapted string analysis to track untrusted substring values, and (2) a check for untrusted scripts based on formal language techniques.
The string taint analysis phase is largely the same as in Section 3.3.1, but this analysis produces a representation of the language of HTML documents rather than queries that a web application produces.
The second phase of our approach enforces the policy that generated web pages include no untrusted scripts. In order to generate the right low-level description of this high-level policy, we must consider how web browsers’ layout engines parse web documents, and under which circumstances they invoke the JavaScript engine. In order to generate the policy
4.2. Overview 63 lib.inc.php 481 f u n c t i o n s t o p x s s ( $ da ta ) { 524 /∗ Get a t t r i b u t e =” j a v a s c r i p t : f o o ( ) ” t a g s . Catch s p a c e s i n t h e r e g e x 525 ∗ /(=| u r l \ ( ) ( ” ? ) [ ˆ > ] ∗ s c r i p t : / ∗/ 526 $ pr eg = ’/(=|(U\s*R\s*L\s *\())\s *("|\ ’)?[^ >]*\s*’ . 527 ’S\s*C\s*R\s*I\s*P\s*T\s *:/i’;
528 $ da ta = preg replace ( $preg , ’ HordeCleaned’ , $ da ta ) ;
543 /∗ Get a l l on<foo >=”bar ( ) ” . NEVER a l l o w t h e s e . ∗/
544 $ da ta = preg replace ( ’/([\s"\ ’]+ on\w +)\s*=/i’ ,
545 ’HordeCleaned=’, $ da ta ) ;
550 /∗ Remove a l l s c r i p t s . ∗/
551 $ da ta = preg replace ( ’|< script [^ >]* >.*? </script >| is’ ,
552 ’<HordeCleaned_script />’, $ da ta ) ;
553 r e t u r n $ da ta ;
554 }
projects stats pop.php
21 $ use = ( i n t ) $ REQUEST [ ’use’ ] ;
22 $module = s t o p x s s ($ REQUEST [ ’module ’ ] ) ;
62 i f ( $ use ) {
63 echo ’<br >’ ;
71 } e l s e {
72 $ h i d d e n f i e l d s = "<input type=’ hidden ’ "
73 . " name=’ module ’ value=’ $module ’/>\n" ;
74 echo ’
75 <form action =" stats.php" method =" get">
76 <div style =" width :60%; float: right">
77 <input name =" speichern" />
78 ’. g e t b u t t o n s ( ) . ’
79 </div >
80 ’. $ h i d d e n f i e l d s . ’
81 </form >’;
82 }
Figure 4.1: Vulnerable PHP code.
description, we studied the Gecko layout engine [27], which Firefox [23] and Mozilla [67] use, looked at the W3C recommendation for scripts in HTML documents [39], and looked at online documentation for how other browsers handle HTML documents [78]. We represent the policy using regular languages and check whether untrusted parts of the document can invoke the JavaScript interpreter using language inclusion.
4.3. Analysis Algorithm 64
4.2.3 Running Example
Section 4.3 presents our analysis algorithm using the PHP code in Figure 4.1 as input. This section explains the example code and describes its vulnerability.
The program fragment in Figure 4.1 is pared down and adapted for display from PHPro- jekt 5.2.0, a modular application for coordinating group activities and sharing information and documents via the web [72]. It is widely used and can be configured for 38 languages and 9 DBMSs. The untrusted data comes from the $ REQUEST array on lines 21 and 22. Line 22 assigns untrusted data to the $module variable after passing it through the stop xss function.
The stop xss function, which PHProjekt uses to perform input validation, uses Perl- style regular expressions to remove dangerous substrings from the input. The complete function also checks for alternate character encoding schemes, likely phishing attacks, and other dangerous tags, but these checks are omitted here for brevity. The primary ways to invoke the JavaScript interpreter are through script URIs; event handlers, all of which begin with “on”; and “<script>” tags. The function stop xss removes these three cases with the regular expression replacements on lines 528, 544, and 551, respectively.
The W3C recommendation for HTML attributes specifies that white space characters may separate attribute names from the following ‘=’ character. The regular expression on line 545 reflects this specification: ‘\w’ represents word characters (word characters include alphanumeric characters, ‘ ’, and ‘.’), and ‘\s’ represents white space characters. However, the Gecko layout engine’s HTML parser permits arbitrary non-word characters between the attribute name and the ‘=’ character. Consequently, if an input string includes a substring such as “’ onload#=” followed by arbitrary script, it will pass the filter on line 544 but will cause untrusted JavaScript to be executed when the output page is viewed in Firefox.
4.3
Analysis Algorithm
4.3. Analysis Algorithm 65
$ da ta 1 = $ REQUEST [ ’module ’ ] ;
$ da ta 2 = preg replace ( ’/([\s"\ ’]+ on\w+)\s *=/i’ , ’HordeCleaned=’ , $ da ta 1 ) ; $module = $ da ta 2 ;
i f ( $ use ) {
$ o utput1 = ’<br >’ ; } e l s e {
$ h i d d e n f i e l d s = "<input value=’ $module ’/>" ; $ o utput2 = ’ </div > ’ . $ h i d d e n f i e l d s ;
}
$ o utput3 = φ ( $output1 , $ o utput2 ) ;
Figure 4.2: Code in SSA form. REQUESTmoduleT → Σ∗
data1 → REQUESTmodule
data2 → preg replace( /([\s"\’]+on\w+)\s*=/i, HordeCleaned=, data1);
module → data2
output1 → <br>
hiddenfields → <input value=’module’/> output2 → </div> hiddenfields
output3 → output1 | output2
Figure 4.3: Productions for extended CFG.
4.3.1 String-taint Analysis
The string-taint phase of our analysis comes from Section 3.3.1; we review the main steps here to illustrate it on an example relevant to XSS and set up the running example for the next section. The first phase of the string-taint analysis translates output statements (e.g., echo statements) into assignments to an added output variable, and translates the program into static single assignment (SSA) form [17] in order to encode data dependencies. Figure 4.2 illustrates this translation on the example code, omitting and simplifying several parts of the program for the sake of presentation.
Because the SSA form encodes data dependencies, the next phase of the string-taint analysis drops control structures, translates assignment statements into grammar produc- tions, and labels untrusted data sources. This phase constructs an extended context free
4.3. Analysis Algorithm 66 REQUESTmoduleT → ♦ data1 → REQUESTmodule data2 → data1; module → data2 output1 → <br>
hiddenfields → <input value=’module’/> output2 → </div> hiddenfields
output3 → output1 | output2
Figure 4.4: CFG with untrusted substrings summarized; see Section 4.3.2 for an explanation of ‘♦.’
grammar (CFG); it is extended in the sense the grammar productions’ right hand sides may contain string functions. Figure 4.3 shows the extended CFG for our example, with output3 as the start symbol. In Figure 4.3, REQUESTmodule is labeled with a taint annotation indicating that substrings derived from that non-terminal are untrusted. The only string function in our example grammar is preg replace, which takes three arguments: a pattern, a replacement, and a subject. It searches for instances of the pattern in the subject and replaces them with the replacement. In general the replacement may reference and include parts of the string that the regular expression pattern matched, but in this example it does not.
We transform the extended CFG into a standard CFG using FSTs as described previ- ously. In the extended CFG in Figure 4.3, the subject argument to preg replace has Σ∗ as its language, so the output language of the preg replace function is:
Σ∗ L((\s|"|’)+
(o|O)(n|N)\w+\s∗=).
We use regular expression notation to simplify the presentation of this example. The ‘♦’ character in Figure 4.4 replaces the regular expression shown above in the grammar. The next section discusses this substitution in more detail.
4.3. Analysis Algorithm 67