In this section we will define an intermediate representation called Input Validation and Sanitization Language (IVSL). This language is used to represent sanitizer functions—as de- fined in 2.1—which are then analyzed by our string analysis algorithms in 3 in a programming language independent way. Before analyzing input validation and sanitization code for a given input field(s) in a web application, we first extract such code as an IVSL program.
Sanitizer → sanitizer( Var [, Var]∗ ){ Block } Var → <identifier>
Block → Stmt [; Stmt]∗
Stmt → Var := Exp | return Var | reject
| if( Pred ) { Block }[ else { Block } ] | while( Pred ) {Block }
Exp → "<string-literal>" | Var |
?
| StringFuncPred → Pred && Pred | Pred || Pred | !Pred |
?
| ( Pred ) | | Var RelOp "<string-literal >"| Var matches RegExp
| StringFunc RelOp "<string-literal >" | IntFunc RelOp <integer-literal> RelOp → < | <= | > | >= | == | !=
StringFunc → replace( RegExp, "<string-literal>" | Var
,Var )
| concat( "<string-literal>" | Var , "<string-literal>" | Var ) | trim(Var, ’<char >’[, ’<char>’]* )
| addslashes(Var )
| htmlspecialchars(Var )
| substring(Var, [<integer-literal >], [<integer-literal >] ) IntFunc → length( Var ) RelOp <integer-literal>
| indexof(Var, "<string-literal>"[, " <string-literal >"] )
RegExp → /[ˆ] UnionExp [$]/
UnionExp → InterExp
|
UnionExp | InterExpInterExp → ConcatExp & InterExp
| ConcatExp
ConcatExp → RepeatExp ConcatExp | RepeatExp
RepeatExp → RepeatExp ? | RepeatExp * | RepeatExp + | RepeatExp {<integer-literal >[, <integer-literal >] }
| ComplExp
ComplExp → ~ ComplExp
| CharClassExp
CharClassExp → [CharClasses] | [ˆCharClasses] | SimpleExp
CharClasses → CharClass CharClasses | CharClass CharClass → <char> - <char> | <char>
SimpleExp → <char> | . | (UnionExp)|
Figure 2.4: The abstract grammar for IVSL, the intermediate language used to represent sanitizers.
Figure 2.4 shows the syntax for IVSL. Keywords and operators are written in bold, non- terminals in italic, terminals are surrounded by < and >, and the typewriter font is used for the built-in functions.
An IVSL program has only one single main function called sanitizer which represents one single-input or multi-input sanitizer function as defined in 2.1. sanitizer is a function that takes one or more string variables as input and either rejects or returns a string value as output. In IVSL, variables can be declared and defined simply by assigning them values. Only string variables are allowed and the ASCII encoding is the encoding that is currently supported.
<string-literal> represents a string literal (i.e., a string constant) where characters " and
\should be escaped properly using character \. <char> represents a single ASCII character
constant that is properly escaped depending on the context. If it appears outside a regular
expression then only’and\should be escaped using\. If it appears inside a regular expression
then, in addition to’and\, all regular expression reserved characters such as/and? should
also be escaped. <integer-literal> represents an integer constant number along with the-sign
if the number is negative. Integer literals are allowed only as parameters to functions, in regular expressions to allow for repetition or in predicates to represent variable length or indices within a string variable or value. Syntax for variable <identifier> follows rules for PHP identifiers.
The language allows conditional statements, loops and assignment statements with string
operations. Assigning a variable
?
represents assignment of an arbitrary string value s ∈ Σ∗.This allows for translation into IVSL from other languages when right hand side expressions of non-string type are present. The operator matches is the language membership operator
which returns true if a variable string value is an element in the regular language defined by the regular expression. The comparison operators such as < and != refer to lexicographical
ordering when applied to string expressions. We use
?
to indicate non-deterministic branchconditions. Since we only allow string expressions in the language,
?
can be used to representnon-string predicates such as predicates on boolean or integer expressions. matches and comparison operators have the highest precedence followed by the parentheses then the logical operators.
The language does not allow user-defined functions. It provides two types of built-in func- tions: (1) string functions which return string values and (2) integer functions which return integer values. There are three core built-in functions which are concat, replace and length. These functions can be used to model a wide range of string manipulation oper- ations in different programming languages. Table 2.1 shows some examples for translating some PHP and JavaScript string operations into IVSL code. Notice that the translation for the same operation may differ depending on the context where this operation has been used. In addition, the language provides a number of specialized built-in functions which are functions that allow for more precise modeling of builtin library string functions in PHP, JavaScript and Java.
2.2.1
Using IVSL to Validate and Sanitize Inputs
An IVSL program has only one accepting sink which is the return Var which returns the value of the string variable Var. The input string(s) are validated using branch conditions
Lang. Original Operation IVSL Code
JS if (v1.match(/foo/)) if(v1 matches /foo/)
JS v2 = v1.match(/bar/) v2 =
?
JS if (/foo/.test(v1)) if (v1 matches /foo/)
JS v2 = /bar/.test(v1) v2 =
?
PHP $v2 = nl2br($v1) v2 = replace(/\\n/,"<br/>", v1);
v2 = replace(/\\r/, "<br/>", v2); v2 = replace(/\\r\\n/, "<br/>", v2); v2 = replace(/\\n\\r/, "<br/>", v2);
Table 2.1: Example of translation from JavaScript and PHP string operations to IVSL code.
that test if a set of validation constraints are satisfied. For example, all string values for vari- able s of length greater than or equal to 10 will be filtered by the following branch condition: length(s) < 10. If a string value is not valid, then it will be rejected by executing the
rejectstatement which halts the execution and exits the program. I.e., the reject state-
ment corresponds to the exit() statement in PHP. Unlike return Var, we allow multiple
rejectstatements since a string may get rejected based on many validation constraints.
Input sanitization is carried out either through core string manipulation operations concat
and replace or through specific operations such as trim, addslashes and
htmlspecialchars.