• No results found

5.3 The Model

5.3.1 Building the corpus

Our approach involves configuring the model automatically using machine learning. The corpusis a set of sequences of observations annotated with states, that contains the knowl- edge that will be learned by the model. The corpus is crucial for the approach as it includes the information about which sequences of instructions lead to vulnerabilities or not.

The corpus is built in four steps: collecting a set of (PHP) instructions associated with slices vulnerable and not vulnerable; representing these instructions in ISL (sequences of observations); annotating manually the state to each observation (to each ISL token) of the sequences; and removing duplicated sequences of observations annotated with states. The upper part of Figure5.1(a) represents these steps.

The most critical step is the first, in which a set of slices representing existing vulner- abilities (and non-vulnerabilities) with different combinations of code elements has to be obtained. In practice we used a large number of slices from open source applications (see Section5.4).

A sequence of the corpus is composed of two or more pairs htoken,statei. The instruction $var = $_POST[’paramater’], for instance, translated into ISL becomes

input varand is represented in the corpus ashinput,Tainti hvar_vv,Tainti. Both states are Taint (compromised) because the input is always Taint (input is the source of attacks we consider).

In the corpus, the sequences of observations are annotated according to their taintd- ness status and type, as presented in column 4 of Table 5.1, and the tokens represent- ing some class of functions from that table. For instance, the PHP instruction $var =

htlmentities($_POST[’parameter’])is translated tosanit_f input varand rep- resented in the corpus by the sequencehsanit_f,Sani hinput, Sani hvar,N-Tainti. The first two tokens were annotated with the San state, because the sanitization function sanitizes its parameter, and the last token was annotated withN-Taintstate, meaning that the operation and the final state of the sequence are not tainted.

Notice that in the previous examples the state of the last observation is the final state of the sequence. In the sanitization example that state isN-Taint, indicating that the sequence is not-tainted (not compromised), while in the other example that state isTaint, indicating that the sequence is tainted (compromised).

As mentioned above, the tokenvar_vvis not produced when slices are translated into ISL, but used in the corpus to represent variables with state Taint (tainted variables). In fact, during translation into ISL variables are not known to be tainted or not, so they are represented by the token var. In the corpus, if the state of the variable is annotated as

Taint, the variable is represented byvar_vv, forming the pairhvar_vv,Tainti.

Listings 5.2 and 5.3 show an example of this process of creating of the corpus, with its four steps. Listing 5.2(a) presents PHP instructions extracted from vulnerable and non- vulnerable slices. Two examples of these slices, respectively, are the sequences of instruc- tions of the lines {1, 8} and {2, 5, 8}. Listing5.2(b) represents each of these instructions into ISL (second step). Some instructions have more than one representation, depending if the extracted slice is vulnerable or not. For example, the instruction labeled by 5 has two representations (the two lines immediately below of it) to represent the sanitization of an un- tainted and a tainted variable, respectively (first and second representations). In the figure, it is visible the difference between thevarandvar_vvtokens. For the two examples of slices above, line 8 is represented in ISL by the first representation for the vulnerable slice, and by the second representation for the non-vulnerable slice. Listing5.3represents the last two steps and the corpus. Each sequence of observations is annotated as explained above. The duplicated sequences are reduced to one sequence, because different PHP instructions can result in the same sequence. For example, the PHP instructions from lines 1 and 2 (Listing

5.3 The Model 1 $var = $_POST[‘parameter’] 2 $var = $_GET[‘parameter’] 3 $var = htmlentities($_POST[‘parameter’]) 4 $var = mysql_real_escape_string($_GET[‘parameter’]) 5 $var = htmlentities($var)

6 $var = "SELECT field FROM table WHERE field = $var" 7 $var = mysql_query($var)

8 echo $var 9 include($var)

10 if (isset($var) && $var > number)

11 if (is_string($var) && preg_match(’pattern’, $var))

(a) collecting step.

1 $var = $_POST[‘parameter’] input var_vv

2 $var = $_GET[‘parameter’] input var_vv

3 $var = htmlentities($_POST[‘parameter’]) sanit_f input var

4 $var = mysql_real_escape_string($_GET[‘parameter’]) sanit_f input var

5 $var = htmlentities($var) sanit_f var var

sanit_f var_vv var

6 $var = "SELECT field FROM table WHERE field = $var" var var var_vv var_vv 7 $var = mysql_query($var) ss var var ss var_vv var_vv 8 echo $var ss var_vv ss var 9 include($var) ss var_vv ss var

10 if (isset($var) && $var > number) cond fillchk var_vv cond cond fillchk var cond

11 if (is_string($var) && preg_match(’pattern’, $var)) cond typechk_str var_vv contentchk var_vv cond cond typechk_str var_vv contentchk var cond cond typechk_str var contentchk var_vv cond cond typechk_str var contentchk var cond

(b) representing step.

1 <input,Taint> <var_vv,Taint>

2 <sanit_f,San> <input,San> <var,N-Taint> 3 <sanit_f,San> <var,San> <var,N-Taint> 4 <sanit_f,San> <var_vv,San> <var,N-Taint> 5 <var,N-Taint> <var,N-Taint>

6 <var_vv,Taint> <var_vv,Taint>

7 <ss,N-Taint> <var,N-Taint> <var,N-Taint> 8 <ss,N-Taint> <var_vv,Taint> <var_vv,Taint> 9 <ss,N-Taint> <var_vv,Taint>

10 <ss,N-Taint> <var,N-Taint>

11 <cond,N-Taint> <fillchk,Val> <var_vv,Val> <cond,N-Taint> 12 <cond,N-Taint> <fillchk,Val> <var,Val> <cond,N-Taint>

13 <cond,N-Taint> <typechk_str,Val> <var_vv,Val> <contentchk,Val> <var_vv,Val> <cond,N-Taint>

14 <cond,N-Taint> <typechk_str,Val> <var_vv,Val> <contentchk,Val> <var,Val> <cond,N-Taint>

15 <cond,N-Taint> <typechk_str,Val> <var,Val> <contentchk,Val> <var_vv,Val> <cond,N-Taint>

16 <cond,N-Taint> <typechk_str,Val> <var,Val> <contentchk,Val> <var,Val> <cond,N-Taint>

Listing 5.3: Building the corpus: annotating and removing steps.