Static Analysis Versus Penetration Testing:
a Controlled Experiment
Riccardo ScandariatoiMinds-DistriNet KU Leuven 3001 Leuven, Belgium Email: firstname.lastname@example.org
Department of Computer Science Northern Kentucky University
Highland Heights, KY 41076 Email: email@example.com
Wouter JooseniMinds-DistriNet KU Leuven 3001 Leuven, Belgium Email: firstname.lastname@example.org
Abstract—Suppose you have to assemble a security team, which is tasked with performing the security analysis of your organization’s latest applications. After researching how to assess your applications, you find that the most popular techniques (also offered by most security consultancies) are automated static analysis and black box penetration testing. Under time and budget constraints, which technique would you use first? This paper compares these two techniques by means of an exploratory controlled experiment, in which 9 participants analyzed the security of two open source blogging applications. Despite its relative small size, this study shows that static analysis finds more vulnerabilities and in a shorter time than penetration testing.
Validating the security of an application’s code entails elim-inating software defects that make the application vulnerable to attack. Several techniques are used to secure applications, ranging from formal verification of security properties to security testing. Most security consultancy firms, for instance, provide services for inspecting the application code by means of automated static analysis or assessing the attack surface via penetration testing. Automated static analysis relies on code scanners, which provide a report of potential weaknesses in the application. A scanner analyzes source code to identify poor programming practices that enable potential attacks, such as a loop that incorrectly iterates over the elements of an array, opening the possibility of a buffer overflow. Penetration testing applies known attack patterns to the entry points of an appli-cation, such as an SQL injection via an input field of a web form. Penetration testing also leverages tools that automate the discovery of vulnerabilities in running applications.
In both cases, the role of humans overseeing the tools has great impact. Since scanners report false positives, their output must be manually validated. Also for penetration testing, the presence of a knowledgeable individual is important to configure the tools and evaluate their output.
In this study, we take the viewpoint of a newly formed security team (e.g., at a medium sized software company) that is composed of employees with good development experience and less extensive security expertise. This is often the case when the team members are drawn from company personnel rather than being hired externally. As the team would likely have limited resources to devote to assessing the application, it is important to determine which vulnerability discovery technique provides a more favorable cost-to-benefit ratio.
The contribution of this paper is an exploratory controlled experiment comparing automated static analysis (SA) and penetration testing (PT) in the context of the security analysis of web applications. In particular, we chose to use Fortify’s Static Code Analyzer as the tool supporting the static analysis process and Burp Suite as the tool supporting the penetration testing process. These tools are widely used and representative of the state of the practice in software security.
The experiment enrolled nine students in their final year of a master’s degree in computer science at Northern Kentucky University. The participants are returning students, i.e., they had significant industry experience after earning their bache-lor’s degree and prior to the beginning of the master’s program. The majority of the participants are getting their master’s degree while working full time. The experiment was designed in such a way that each participant applied both techniques. This gave us the opportunity to collect the comparative opinion of the participants.
The experiment aims at answering the following three research questions: (1) which technique (static analysis or penetration testing) discovers more vulnerabilities, (2) which technique is more precise, i.e., has a lower rate of false alarms, and (3) which technique yields a higher productivity, i.e., more discovered vulnerabilities per hour.
The rest of the paper is organized as follows. Section II describes the two techniques in more detail and Section III discusses the related work. The planning and operation of the experiment are described in Sections IV and V, respectively. The results are presented in Sections VI and VII. The threats to the validity of this study are presented in Section VIII. Finally, Section IX presents the concluding remarks.
Vulnerabilities can be identified in software using both static and dynamic analysis techniques. Static analysis is the evaluation of an application through examination of the code without execution. Dynamic analysis is the testing and evalu-ation of a running applicevalu-ation. Static and dynamic analysis techniques have different advantages and disadvantages in terms of identifying vulnerabilities. In our study, we compared a static technique, automated static analysis, with a dynamic technique, penetration testing with a testing proxy.
Neither static nor dynamic vulnerability discovery tech-niques are perfect. Sometimes code is incorrectly identified
as containing a fault. This misidentification is called a false positive. A true positive exists when an actual vulnerability is identified in the code.
A. Static Analysis with Fortify SCA
Inspecting code for security flaws enables vulnerabilities to be identified early in the software life-cycle before an application is capable of being tested. However, manual code review is time consuming and requires a detailed understanding of security. Automated static analysis tools like Fortify Source Code Analyzer (SCA, www.fortify.com) enhance code reviews by identifying vulnerabilities faster than a human could while requiring less expertise from the tool user and reviewer . Fortify SCA is the market leader according to the Gartner’s magic quadrant for static security analysis .
In this experiment, participants conducted individual code reviews based on vulnerability warnings reported by Fortify SCA. They filtered and prioritized warnings based on the experimental instructions, then evaluated the warnings one by one. To evaluate a warning, participants would examine the data flow from the input source to the vulnerable sink. If a malicious user could not control the input source or if the sink was not actually vulnerable, then the warning would be marked as a false positive. Similarly, if an input validation function was applied at an intermediate point in the data flow to cleanse data correctly, then the warning would also be marked as a false positive. Alternatively, if input validation did not detect all possible dangerous inputs or if there was no such validation, then the warning would be labeled as a true positive.
B. Penetration Testing with Burp Suite
Penetration testing is the process of evaluating the security posture of a running application by attempting to compromise its security. Penetration testing can be performed with full access to the application’s design documentation and source code (white-box testing) or without any such access (black-box testing) or with an intermediate level of access (gray-box testing) . Penetration testing can be performed as an exploratory technique, i.e., without a test plan, or as a systematic technique with a test plan. It can also be performed using automated penetration testing tools.
In this experiment, participants used black-box exploratory penetration testing with the assistance of PortSwigger’s Burp Suite (portswigger.net) tool. Burp Suite is not an automated vulnerability discovery tool like Fortify SCA. Instead, Burp Suite provides a set of testing tools including an intercepting proxy, a web application spider, and a configurable web application fuzzer. Participants use the spider to map the web application’s functionality, then use the intercepting proxy to observe and modify HTTP requests either manually or programmatically according to patterns applied via the fuzzer. Burp Suite is mentioned as one of the best niche players in the Gartner’s magic quadrant for dynamic security testing.
Case studies. Earlier work has examined the differences between automated static analysis and penetration testing. Typ-ically, the evaluation is conducted in the context of a case study where one expert uses the studied techniques, while our work is
a controlled experiment involving several individuals. Antunes and Vieira compared the effectiveness of three commercial automated penetration testing tools with three free automated static analysis tools in finding SQL injection vulnerabilities in web services . They found that static analysis tools found more vulnerabilities but also had more false positives than automated penetration testing tools. Static and dynamic analysis tools found different vulnerabilities in the same parts of the source code studied.
Austin and Williams  compared, in a case study, four vulnerability discovery techniques on two web-based elec-tronic health record systems. The techniques studied included exploratory manual penetration testing, systematic manual penetration testing, static analysis, and automated penetration testing. They found that different techniques found different types of vulnerabilities, while automated penetration testing was the most productive technique as measured in terms of validated vulnerabilities discovered per hour of effort.
Descriptive studies.Prior work has evaluated the effective-ness of individuals applying a single technique in the context of a descriptive study, while this work compares techniques. Baca et al.  found that average developers were no better than chance at correctly identifying static analysis warnings as actual vulnerabilities. However, developers with security experience and/or static analysis experience were two to three times as accurate as typical developers. Their study indicates that developer experience is an essential aspect in evaluating the effectiveness of static analysis.
Edmundson et al.  hired 30 developers from an out-sourcing site to perform a manual code review of a small open source web application written in PHP with seven known vul-nerabilities. They did not find significant correlations between development or security experience and the number of correct vulnerabilities found.
IV. PLANNING THEEXPERIMENT
According to the template defined by Basili et al. , the goals of this study are defined as follows.
Purpose The purpose of this study is to characterize the
Object of study the choice of a security bug finding technique
(either white-box static analysis or black-box pen testing)
Focus on both the analyst’s productivity and the quality
of the analysis results
Stakeholder from the point of view of the (junior) security
Context and in the context of a master’s level course at
Northern Kentucky University.
The participants of this study are nine students of a course on secure software engineering that is positioned in the last year of the two year master’s degree program in computer science at Northern Kentucky University. The participants are in their sixth year of academic education. To enroll in this course, students must have completed two prerequisite courses,
Fig. 1. Seniority of the participants
Fig. 2. Roles covered in the industry
one in software engineering and the other in computer security. The program is often selected by professionals who, after a few years in the software industry, are seeking a higher degree to advance in their career. Most of these students are taking evening classes while working full time as software developers. We surveyed the background of the participants by means of a questionnaire administered at the beginning of the exper-iment. As shown in Figure 1, seven participants have at least one year of experience as employees in the software industry and four have a seniority that is greater than three years. Figure 2 describes the roles covered by the participants in their occupation. All of them have worked as developers, and as many as seven also have experience as software designers. Four of the participants also reported some experience as testers, which is a useful asset for the tasks they carried out in the experiment, especially for penetration testing. There are more roles than participants, since participants may work in multiple different roles over the course of their career.
We also investigated the programming skills and security expertise of the participants, and the results are reported in Figure 3. Good Java programming skills are important for the static analysis task as the code needs to be inspected in order to validate the warnings produced by the analysis tool. Two thirds
Fig. 3. Skill levels of the participants
of our participants claimed to be skilled Java programmers, although we did not test their skills directly. Concerning their security knowledge, the participants are not complete novices, although two thirds admit to having limited expertise and only one third claimed adequate security skills.
In summary, despite their enrollment in a degree program, the participants have a profile which is closer to that of professionals than of students. Indeed, they have substantial industrial experience and advanced development skills. Clearly, the participants are not entirely representative of the population of security analysts, due to their sub-optimal security skills. However, they have the necessary maturity to substantiate the validity of the results of this work, which focuses on professionals beginning their activity in a security team.
C. Experimental Objects
For this experiment, we needed to select two approximately equivalent applications that were written in a language with which the students were familiar. We also needed vulnerabili-ties found in the applications to be of types that the students had studied. In order for the applications to be approximately equivalent, we selected them from the same application do-main: weblogs written in Java. In order for the experiment to be authentic, we decided to use open source applications that were currently in use instead of using applications created solely for the purposes of this experiment.
We selected two weblog applications for the experi-ment: Apache Roller (roller.apache.org) and Pebble (pebble. sourceforge.net). Both Roller and Pebble are comprehensive blogging platforms, with support for templates, feeds, multiple users, threaded comments, and plugins. Both applications are currently in development and have a history of vulnerabilities recorded in public databases. The applications are approxi-mately of the same size and complexity.
1) Pebble: Pebble 2.6.3, the version used in the experiment, consists of 56,168 executable lines of code as measured by Fortify SCA. Version 2.6.4 is the current version (not available at the time of the experiment) and 37 versions of Pebble have been released since version 1.0 was released in 2003. Pebble stores its data in XML files rather than in a SQL database.
Pebble has had five vulnerability advisories in the Na-tional Vulnerability Database (NVD). Note that advisories may aggregate multiple vulnerabilities of the same type, so that the number of vulnerabilities may be higher than the number of advisories. The reported vulnerabilities refer to cross-site scripting and HTTP response splitting. Three of these vulnerabilities had not yet been fixed in the version of Pebble used in the experiment.
2) Apache Roller: The version of Apache Roller used in the experiment is 5.0.1, which is the latest version. It consists of 62,217 executable lines of code as measured by Fortify SCA. Five versions of Roller have been released since version 3.1 went public in 2007, although Roller has been under development since 2002. Roller stores data in a SQL database, which opens the possibility of SQL injection vulnerabilities.
We found three vulnerability advisories for Roller in the NVD. Eight vulnerabilities were reported in the Open Source Vulnerability Database (OSVDB) and in the Apache Roller project’s Jira issue repository. As an example, one historical vulnerability mentioned in both OSVDB and Jira involved the database and was related to the storage of clear-text passwords in early versions of Roller. All the above-mentioned vulnerabilities had been fixed in the version of Roller used in the experiment.
1) Static analysis (SA): Participants are provided with the source code of the application to analyze and with the cor-responding results of the Fortify Static Code Analysis (SCA) tool. The results of the static analysis are stored as an Fortify Project (FPR) file, which contains the list of vulnerability warnings reported by the tool. The FPR file is the result of a completely automated analysis, i.e., it is obtained by using the default configuration of the scanner. This choice makes our experiment easier to reproduce. Participants examine the vulnerability warnings by opening the FPR file with the Fortify Audit Workbench tool, which is also provided to them. In the FPR file, each warning is associated to the line number of the source file where the vulnerability is allegedly present. The tool also reports the type of vulnerability, such as Cross-Site Scripting or SQL Injection, as well as the severity of the discovered vulnerability, rated on a scale form 1 (low) to 5 (high).
In the Audit Workbench, filters can be applied to the list of vulnerabilities to organize them by different categorization systems. As a means to scope the analysis to the most impor-tant security issues, participants are asked to use the “OWASP Top 10 2010” filter and to focus only on those vulnerability warnings that belong to one of those ten categories. The Open Web Application Security Project (OWASP) is a well-known consortium that publishes a list of the ten most critical risks affecting web applications, such as Injection and Cross-Site Scripting . When the filter is applied in Audit Workbench, vulnerability reports are categorized according to the OWASP Top 10 classification. Vulnerabilities of types that are not found in the OWASP Top 10 are shown in an unclassified folder, which participants were instructed to ignore.
In the case of Pebble, Fortify SCA reports 37 warnings in OWASP Top 10 categories, of which 21 are confirmed
vulnerabilities. For Apache Roller, Fortify SCA lists 12 high-priority warnings, of which 9 are confirmed vulnerabilities. The number of confirmed vulnerabilities is obtained through the scrutiny of a security expert (as described in Section V-C) and is not disclosed to the participants. The task of the participants is to analyze the source code of the application and decide whether each of the vulnerability warnings is correct (an actual vulnerability exists) or not (the tool is wrong). The participants are also asked to keep track of the order they follow while assessing the vulnerability warnings.
Participants worked on the task during one supervised lab session. However, they were given a week to write up their findings at home and to turn in a written report. The results documented by the participants correspond to the work observed by the supervisors in the lab, i.e., the participants did not cheat by performing extra analysis at home. According to the template provided to the participants, each analyzed warning is documented in the written report as follows:
• Vulnerability number: the sequential number of the
vulnerability, according to the order they are analyzed.
• Vulnerability type: the type reported by Fortify SCA.
• Location: the file name, class, and line number of the vulnerability sink. If two vulnerabilities have the same sink, they were counted as a single vulnerability.
• Status: whether the vulnerability warning is correct or not. The conclusion needs to be backed up with evidence.
• Description: the nature of the vulnerability and what impact it would have on the application.
2) Penetration testing (PT): In order to perform a pene-tration test, a running instance of the target application needs to be accessible to the participants. To this aim, we set up nine virtual machines on a server, each containing an identical copy of the application under analysis. The only difference between virtual servers was the IP address. Participants were requested to create a map of the application’s entry points, which can be accomplished by means of the spider func-tionality of Burp Suite. Afterward, participants were asked to discover weaknesses that could be exploited through the entry points. The discovery process could employ manual techniques as well as the fuzz testing capabilities of Burp Suite. Note that participants were asked to test the application both when unauthenticated and when authenticated as a default user (credentials were provided). Participants were also requested to classify discovered vulnerabilities according to the OWASP Top 10 list, and they were expected to focus solely on vulnerabilities from this list. Participants were also asked to keep track of the order of discovery of the vulnerabilities.
According to the template provided to the participants, each analyzed warning should be documented in the written report as follows:
• Vulnerability number: the sequential number of the
vulnerability, according to the order they are analyzed.
• Vulnerability type: a type selected by the participant out of the OWASP Top 10 2010 list of vulnerability categories.
Fig. 4. Design of the experiment
• URL: the URL through which the vulnerability is
• Input Field: the input field(s) used to exploit the
• Input Data: the input data that is necessary to exploit the vulnerability.
• Description: the nature of the vulnerability and what impact it would have on the application. The partici-pants should also state any assumptions that are made in determining that this is a vulnerability.
The above documentation contains all the information that is necessary to replicate the attack described by the participant.
E. Design of the Experiment
As shown in Figure 4, the experiment is organized in two laboratories. In the first lab, five randomly chosen participants analyzed the Pebble application by means of static analysis, while the other four analyzed the same application via penetra-tion testing. In the second lab, the Apache Roller applicapenetra-tion was analyzed, and participants were assigned to the treatment they did not apply in the previous lab.
In summary, we chose a paired comparison design (each participant is administered both treatments) but we deemed that randomizing the order of the treatments would have not been enough to counter the learning effect and, hence, used two objects, i.e., applications. The two objects can be considered equivalent for the sake of the experiment, as the two applications provide the same functionality (blogging) and have similar feature sets. The two applications also have a comparable size of about 60,000 executable lines of code. Furthermore, they have the same maturity as both have about 10 years of development history. We also assume that the two applications have the same complexity. This assumption has been validated by means of a specific question in the ques-tionnaire that we administered at the end of the experiment. To the question “Did you find that the two applications were of comparable complexity?”, the participants replied that they agreed: the median answer is 3 (‘agree’) on a 4-value scale ranging from 1 (strongly disagree) to 4 (strongly agree).
TABLE I. TERMINOLOGY.
Measure Definition Formula Wish
TP True positive An actual vulnerability is correctly reported by the participant (a.k.a. correct result)
FP False positive A vulnerability is reported by the participant but it is not present in the code (a.k.a. error, incorrect re-sult, false alarm)
TOT Reported vul-nerabilities
The total number of vulnerabilities reported by the participant
TIME Time The time (in hours) that it takes the participant to complete the task
PREC Precision Percentage of the reported vulner-abilities that are correct
PROD Productivity Number of correct results produced in a unit of time
According to the goals mentioned at the beginning of this section, we are interested in both the quality of the analysis results and the productivity of the analyst. The quality is primarily characterized by the number of correct results (the actual vulnerabilities that are found) as more results mean a more complete analysis and, consequently, a more secure application. Another important aspect is the number of errors (false alarms) as they result in a waste of resources for both the analysis team and the quality assurance team that is tasked with the bug fixing. As summarized in Table I, the correct results are called true positives(TP) and the errors are called false positives (FP). Next to the total number of true positives, we quantify the quality by means of the precision
(PREC), i.e., the ratio of correct results over the total amount of vulnerabilities reported. The measure of precision takes into account the number of errors but it scales them with respect to the total amount of results. This corresponds to the reasonable assumptions that it is more likely to make mistakes if more work is done. The productivity (PROD) is quantified with respect to only correct results. Therefore, it is calculated as the number of true positives produced per hour.
According to the above definitions, we refine the overall research goals into the following three null hypotheses. First, we wonder whether, on average, the two techniques produce the same amount of true positives.
Assuming that the discovered vulnerabilities have a compara-ble importance, a technique that unearths more vulnerabilities is clearly to be preferred. Note that in our experiment, the task of the participants is to focus on the vulnerabilities of highest importance (as defined by OWASP) and therefore, the above-mentioned assumption holds in this study.
Moreover, we are interested in knowing whether, on average, the two techniques have the same precision.
A more precise technique implies that less “garbage” is present in the analysis results, and therefore less effort is wasted when the recommendations of the analysis report are followed in order to fix the security flaws.
Finally, we question whether, on average, the two techniques yield the same productivity.
TABLE II. QUESTIONNAIRE ABOUT THE TRAINING. VALUES ARE ON A SCALE FROM1 (‘STRONGLY DISAGREE’)TO4 (‘STRONGLY AGREE’).
The training and the warm-up exercises were sufficient to become familiar with: Median answer
The static analysis technique 3 (‘agree’) The penetration testing technique 3 (‘agree’) The Burp Suite tool 3 (‘agree’) The Fortify SCA tool 3 (‘agree’)
Knowing which technique has a higher productivity can be a critical element when choosing the appropriate technique, e.g., if time-to-market is a business constraint.
V. OPERATION OF THEEXPERIMENT
A. Training of the Participants and Warm-up Exercises
The experiment is embedded in a master course on secure software engineering . The course lasts 15 weeks with one class of 3 hours per week, which runs during the evening. The course covers topics like risk analysis, secure design, and secure coding. The training of the participants, with respect to the skill-set necessary to perform the tasks, took place in
weeks 5 and 6 of the course.
In week 5, the participants are taught the penetration testing technique. The class is structured into a theoretical lecture of 1 hour and a practical lab session of 2 hours. The lecture describes the process that is followed in order to test an application and gives details about how to map, analyze, and exploit the target. The lab mimics the set-up used in the upcoming experiment (same room, same format for the assignment, and so on) and plays the role of the first warm-up exercise. In the warm-up, the participants have the opportunity to gain experience with Burp Suite and penetration testing on a small scale, toy application.
In week 6, the participants are taught the static analysis technique. Again, the class is organized as a lecture of 1 hour followed by a lab session of 2 hours. The lecture explains the process of reviewing the code with the support of a static analysis tool, provides checklists to optimize the review results, and explains how to interpret the vulnerability warnings gener-ated by the static analysis tool. The lab is the second warm-up exercise where the participants gain experience with Fortify SCA and perform static analysis on a toy application, which is different from the one used for penetration testing. The set-up of this lab is identical to the one used for the experiment. Lecture materials and warm-up exercises for both weeks are available online . We assessed whether the training and the warm-up exercises were sufficient to become familiar with the techniques and the tools involved in the experiment via a questionnaire administered at the start of the experiment. As shown in Table II, the participants agreed that the training was sufficient in all respects.
B. Execution of the Experiment
The experiment was executed in October 2012. At the beginning of the experiment, we asked the participants to sign an informed consent form, which was approved by the
Institutional Review Board of Northern Kentucky University.
The experiment was articulated over weeks 8 and 9 of the
course. In both weeks, the participants carried out their tasks in a lab session of 3 hours, which started at 6 PM.
Each participant worked on his/her own laptop. Access to the Internet was provided. At the beginning of each lab, each participant received a printed assignment containing the description of the task to be carried out and all the necessary information, like the IP address of the virtual machine and login credentials for the penetration testing task. The labs were supervised by one instructor who also answered the technical questions of the participants. A second supervisor monitored the adherence to the experimental protocol, e.g., by checking that the participants were using the time tracking tool at all times (as explained later) and that they were filling in the questionnaires, when requested to do so.
Concerning the static analysis task, the participants had the Fortify Audit Workbench installed on their own laptops. This software is used to browse the vulnerability report produced by the Fortify SCA tool. The vulnerability report (FPR file) and the application source code (compressed archive file) were also made available for download to the participants.
Concerning the penetration testing task, the participants had the Burp Suite installed on their own laptops. Each participant was provided with access to her own virtual machine with a running instance of the application. Therefore, it was not possible for the participants to interfere with each other.
At the beginning of the first experiment session (week 8), we administered an online entry questionnaire, while at the end of the second experiment session (week 9) we administered an online exit questionnaire. The list of questions and the replies of the participants are publicly available . The questionnaires were not anonymous. The goals of the entry questionnaire were to understand the background of the par-ticipants (as already described in Section IV-B) and to assess the quality of the training and warm-up exercises provided to the participants (as already described in Section V-A). The goals of the exit questionnaire were to validate some of the assumptions that underpin the design of the experiment (e.g., the fact that the two application are of comparable complexity, as discussed in Section IV-E) and to gather the opinion of the participants concerning the two techniques (see Section VII). We report that the participants adhered to the instructions and respected deadlines quite strictly. Further, their engage-ment with the experiengage-ment was very high, as demonstrated by the positive attitude during the lab sessions. Also, participants demonstrated a high level of commitment with respect to the experiment and the course in general. Participants obtained an average grade of 93 out of 100 points (standard deviation of 5.7 and 95% confidence interval of [86.9, 96.9] points), which is remarkable. They were graded on the correctness of their findings (number of correctly classified vulnerabilities), not simply on the number of vulnerabilities reported.
C. Measurement Procedure
The reports turned in by the participants have been val-idated and graded by a senior security researcher. For static analysis, each participant labeled the vulnerability warnings
contained in one of the FPR files. The participant has made an evaluation of whether the warning is correct (i.e., a vulnerabil-ity is indeed present in the code) or bogus. The securvulnerabil-ity expert has produced an independent assessment of the warnings in the FPR files and his judgment is assumed to be correct. Given the seniority level of the subject and his expertize with static analysis, we have no reason to doubt it. Using this ‘reference solution’, the labeling of the participant can be classified as a true positive (TP, the warning is a vulnerability for the expert and the participant concurs) or false positives (FP, the warning is not a vulnerability for the expert but the participant believes it is). The other cases are out of scope in this experiment, as it is harder to compare to penetration testing.
For penetration testing, it is straightforward to validate the reports of the participants. The report contains the list of vulnerabilities discovered by the participant, each associated with the parameters (link to the entry point, input, and so on) that describe how to exploit the vulnerability. The security expert has to mount the attack as it is described in the report and verify whether it is successful (TP, true positive) or not (FP, false positives).
The time has been tracked by means of Kimai (kimai.org, which is a simple, online time-sheeting tool. At the beginning of the experiment, the participants have been given a personal login to the time tracking tool. After having logged in, the participant had to select the activity he/she was busy with. The time tracking could be started and paused by means of a single button. We have pre-configured the Kimai tool with two activities. The first activity refers to the discovery of the first vulnerability. The second activity refers to finishing the task after the first vulnerability has been discovered. Hence, the total time (TIME) spent on a task is the sum of the time spent on the two activities. The tool did not allow the participants to define other activities. We have already used this tool in other experiments and found that it is both very usable and non-invasive. Also, notice that one supervisor was monitoring the correct usage of the Kimai tool during the experiment. Therefore, the time measures that we obtained from the logs of Kimai are accurate.
The opinion of the participants about the two techniques and the related tools have been extracted from the exit ques-tionnaire and will be discussed in Section VII.
In order to enable the replication of this study, all the data used in this paper is available online . The data analysis is performed with R. Given the limited sample size, the analysis presented in this section makes use of non parametric tests. In particular, the location shifts between the two treatments are tested by means of the Wilcoxon signed-rank test for paired samples. The same test is used to analyze the exit questionnaire. A significance level of 0.05 is always used. The 95% confidence intervals are computed by means of the one-sample Wilcoxon rank-sum test. The association between two variables is studied by means of the Spearman rank correlation coefficient. A correlation is considered only if the modulus of the coefficient is at least 0.70 and the p-value of the significance test is smaller than 0.05.
Fig. 5. Boxplot of reported results (TOT), correct results (TP) and false alarms (FP)
A. True Positives (H0TP)
The left-hand side of Figure 5 summarizes the results concerning the total number of reported vulnerabilities (TOT), which appears to be quite different in the two treatments. With static analysis (SA), the participants reported an average of 14.8 vulnerabilities (standard deviation of 13.3, confidence interval of [5, 30]). With penetration testing (PT) the average is 3.1 vulnerabilities, which is much lower, and the standard deviation is 2.0 (confidence interval of [2, 5]). The location shift is not statistically significant (p-value>0.05).
As shown by the box-plot in middle of Figure 5, there is an imbalance also for the number of correct results (TP). With static analysis, the participants discovered an average of 9.7 confirmed vulnerabilities (standard deviation of 7.9, confi-dence interval of [4.5, 18.5]). With penetration testing they discovered only 2.2 confirmed vulnerabilities on average, with a standard deviation of 1.7 and a confidence interval of [1, 4]. The location shift is statistically significant (p-value=0.0249). The left-hand side of Figure 6 shows that static analysis has produced more correct results in both the Pebble and the Apache Roller applications.
We can reject the null hypothesisH0TPand conclude that static analysis produces, on average, a higher number of correct results than penetration testing.
This conclusion is not surprising. Reviewing the applica-tion code with the support of a security scanner can be seen as a check-list based approach, where the participants have to skim through a list of suggestions made by the tool. Usually, these approaches are superior as far as the true positives are concerned. Penetration testing, instead, does not enjoy a similar level of guidance and it is easier for less experienced participants to get ‘stuck’. However, having a list of warnings in the static analysis technique might be somewhat limiting as the scope of the analysis is bounded by the alarms produced by the tool. The vulnerabilities reported using static analysis were, of course, similar. Instead, vulnerabilities reported using penetration testing differed among participants. In our study, though, penetration testing did not find novel
application-Fig. 6. Barplot of average correct results (TP) and average false alarms (FP). Data is organized by application
specific vulnerabilities w.r.t. what has been generated by the tool. Rather, penetration testing reports included vulnerability types that could not be found in the source code, such as vulnerabilities in the configuration of web application server on top of which the web application was running.
We have run a series of correlation tests to investigate the associations among the variables we have measured. The only strong association that is also statistically significant is between the number of correct vulnerabilities discovered (TP) with penetration testing and the years of experience in the software industry (described in Figure 1). There is an inverse relationship (ρ=-0.83, p-value=0.0058), which might hint to the fact that younger professionals perform better when it comes to penetration testing. This is a speculation, of course, which needs to be validated further.
The precision (PREC) is influenced by the ratio between the correct results (TP) and the false alarms (FP). The right-hand side of Figure 5 provides the characterization of the amount of false alarms (i.e., errors) produced by the participants. Overall, the participants have made more mistakes (FP) with static analysis than penetration testing (about 5 errors versus 1, on average). The difference, however, is not statistically significant (p-value>0.05). Note that in Figure 5 there is an outlier1 forFPinPT. The location shift remains not significant
even if the outlier is (pairwise) removed. If we examine the data by application, as presented in the right-hand side of Figure 6, we see that each treatment has underperformed in one of the two objects. Therefore, no clear trend can be identified. Consequently, we have not observed an advantage of either technique as far as the precision is concerned. As shown in Figure 7, static analysis has scored a precision (PREC) of74.3 percent(standard deviation of 20.6 and confidence interval of [54.8, 91.7]) and penetration testing has achieved a precision of 76.3 percent (standard deviation of 30.7 and confidence interval of [58.3, 100]). The difference is not statistically significant (p-value>0.05). This is also true if the outlier is removed.
1Unfortunately, we do not have a precise explanation for the behavior of
the outlying participant. The participant reported some technical difficulties with the Tomcat server during the penetration testing.
Fig. 7. Boxplot and barplot for precision (PREC)
We cannot reject the null hypothesis H0PREC and we conclude that there is no difference between static analy-sis and penetration testing as far as the average precision of the analysis results is concerned.
This conclusion comes as a surprise because it contradicts the expectations. As a penetration tester can directly observe the running system and assess the success of an attack he/she is performing, we expected penetration testing to be less prone to false alarms than static analysis. When validating the results of the static analysis, indeed, one has to ‘picture in her mind’ whether a snippet of code could be exploited. Due to the abstract nature of this endeavor, one would expect the participants to make more errors. These errors have shown up in our experiment, but the proportion with respect to the correct results is the same in both treatments. This is certainly an interesting area for future work.
C. Productivity (H0PROD)
The left-hand side of Figure 8 summarizes the measures of the total time spent by the participants to carry out the tasks. The right-hand side of the figure shows the productivity of the participants, which is the ratio between correct results (the TP in Figure 5) and the time. The top row of the figure reports the results organized by treatment. The bottom row breaks down the results by both application and treatment.
Participants using static analysis spent on average 1.68 hours to complete the task (standard deviation of 0.47 and con-fidence interval of [1.26, 2.06] hours), while with penetration testing they spent 2.01 hours on average (standard deviation of 0.33 and confidence interval of [1.77, 2.27] hours). The location shift is statistically significant (p=0.0273).
The difference between the treatments for both TP and
TIME is consequently reflected by the difference in produc-tivity. Participants using static analysis discovered7.1 vulner-abilities per hour (standard deviation of 8.1 and confidence interval of [2.38, 15.7]). Participants using penetration testing discovered 1.1 vulnerabilities per hour (standard deviation of 0.82 and confidence interval of [0.62, 1.85]). The location shift is statistically significant (p=0.0195).
We can reject the null hypothesisH0PRODand conclude that static analysis produces, on average, a higher num-ber of correct results per hour than penetration testing.
Fig. 8. Boxplot of time spent by the participants to complete the task (TIME) and their productivity (PROD). Data is organized by treatment (top) and by application (bottom)
TABLE III. PARTICIPANTS’OPINION ABOUT SA AND PT.
Question Median C.I.
Q1. Rate your understanding of the description of Task “Static Analysis” (i.e., whether you knew what to do and how to proceed)
4 (very clear) [3, 4] Q2. Rate your understanding of the description
of Task “Penetration Testing” (i.e., whether you knew what to do and how to proceed)
3 (clear) [2, 3.5] Q3. Rate the difficulty of Task “Static Analysis” 2 (easy) [1.5, 2.5] Q4. Rate the difficulty of Task “Penetration
4 (very difficult) [3, 4] Q5. Did you find that the two tasks were of
2 (disagree) [1.5, 3] Q6. If not, please motivate – – Q7. If you needed to identify vulnerabilities in a
future project you would use “Static Analysis”
4 (strongly agree) [3, 4] Q8. If you needed to identify vulnerabilities
in a future project you would use “Penetration Testing”
3 (agree) [2.5, 3.5] Q9. Which of the two techniques did you like
PT(mode) – Q10. What is the reason of your preference? – – Q11. Looking back to your hands-on experience,
describe the advantages and/or disadvantages of using “Static Analysis”
Q12. Looking back to your hands-on experience, describe the advantages and/or disadvantages of using “Penetration Testing”
We also monitored the time that it took participants to report the first vulnerability. Unsurprisingly, this time is sig-nificantly shorter for the participants using static analysis: 8.5 minutes for SA versus 43.6 minutes for PT, on average (p-value=0.0195). This confirms the advantage provided by the static analysis tool in structuring the activity of the partici-pants. Penetration testing requires much more out-of-the-box thinking, which in turn requires more time.
To further interpret the quantitative results illustrated in the previous section, this section discusses observations gathered from the exit questionnaire. The questionnaire puts significant focus on the opinion and preference of the participants about the two techniques. In this respect, we asked the questions reported in Table III. All close-ended questions are on a scale form 1 to 4. Questions Q6 and Q10–Q12 are open-ended.
From the answers to questions Q1 and Q2, it appears that the participants had a good understanding of how to apply the techniques in order to complete their tasks. There is no statistically significant difference in the median rank of the two questions (p-value>0.05).
Instead, there is a significant difference between questions Q3 and Q4 (p-value=0.0115). Penetration testing is perceived as two levels more difficult than static analysis. This is confirmed by question Q5, where the participants disagreed that the two tasks were of comparable difficulty. With question Q6, we asked the participants to explain why they think the two techniques differ in difficulty. Once more, comments of the participants pointed in the direction of penetration testing being more difficult. One participant remarked that penetration testing gives less guidance and requires more creativity. Three other participants mentioned that they felt like they needed more knowledge of the different types of attacks in order to be more successful with the penetration testing technique.
As shown in questions Q7 and Q8, the participants would consider using both techniques in the real world, with a preference for static analysis. The location shift in the medians is statistically significant (p-value=0.0477).
With Q9, we posed a direct question about the participants’ preference with respect to the two techniques. Five participants preferred penetration testing and four chose static analysis. Their answers are independent from the treatment they started with. A majority for PT is quite surprising, given that the previous answers always favored SA. The explanation lies in the answers to question Q10. Out of the four participants preferringSA, one mentioned that he appreciated the opportu-nity of learning about bad programming patterns by looking at the broken code. The three other participants enjoyed the guidance provided by SA and the positive feeling of making progress (PTcould be more frustrating if one gets ‘stuck’ and does not find vulnerabilities). All five participants preferring PT mentioned that this techniques is more fun as it feels like a game, is more interactive and less tedious than SA. Additionally, one has the opportunity of seeing the results of exploiting the discovered vulnerabilities. The fun-factor was an unexpected and interesting explanation for the PTpreference. Finally, in questions Q11 and Q12, we asked the partic-ipants to list the advantages and disadvantages of the two techniques. These two questions confirmed the observations made with the previous questions. Concerning static analysis, the participants perceived it as quicker and easier thanks to the guidance it provides (in the shape of a checklist of warnings to be verified). The downside of the guidance is that the participants findSA tedious and feel limited as there might be additional vulnerabilities that are not spotted by the static anal-ysis tool. Some participants felt like the tool catches the low-hanging fruits, i.e., the most common vulnerabilities, although
they were not negative about it. Concerning penetration testing, participants felt that fewer false positives are produced (which turned out to be untrue in our study) and that the technique is able to catch problems that would go undetected with SA. On the flip side, some participants mentioned that there are no guarantees of completeness. Most notably, participants are strongly aligned in saying that PT requires more experience and is more time-consuming.
Wrap up. The results of the exit questionnaire suggest that a newly formed security team should start using static analysis first. This would give the team members the opportunity to learn about code weaknesses and to gain experience about attacks, without too much frustration while still delivering results. Only then should penetration testing be introduced in the team, as a complementary activity that broadens the scope of the analysis. Otherwise, expecting results from penetration testing without adequate experience could be unrealistic.
VIII. THREATS TOVALIDITY
We have identified no relevant threats to the construct validity. Concerning theconclusion validity, the measures are reliable and the appropriate statistics have been used. However, we remark that the the number of participants is small. Further, we have employed only one security expert to evaluate the reports of the participants. Nevertheless, we are confident that no major errors have been made. As a minor threat, we report that in one case, a participant using penetration testing had to interrupt the execution of the task for a few minutes (less than 10) because the virtual machine hosting the target application had to be restarted. This might have influenced the number of vulnerabilities discovered by the subject. However, the time measures for this participant are reliable as the timer was paused during the interruption.
As far as theinternal validityis concerned, the main threat is posed by having used two different objects in the two labs. To thwart this threat, we chose two applications that are very similar in terms of functionality, size, complexity and maturity. To further balance the design of our experiment, both objects are used in each treatment. The assumption that the participants have good development skills is based on the answers to the entry questionnaires. Although it is common practice to estimate skills this way, it could also be a threat. We also mention that the experiment sessions began at 6PM. Hence, participants could have been fatigued.
The usual threats to external validityapply to the results of this study. Our conclusions could be specific to the type of applications we used (weblogs) or their technology (web applications written in Java). Further, the results are obtained for the case of analysts focusing on top-priority vulnerabilities, as defined by the OWASP Top 10 list. The results might be specific to the size, complexity, and maturity of the applica-tions we have used. Results might differ if different tools than Fortify SCA and Burp Suite are used in the task execution. Finally, the results are valid for professionals with a seniority and expertise similar to our participants.
This paper presented the results of an exploratory con-trolled experiment carried out at Northern Kentucky University
in the Fall of 2012. The study analyzed the differences between performing a white box security analysis (static analysis supported by a scanning tool) and a black box security analysis (penetration testing supported by a proxy/spider tool). Both the techniques and the tools used in the experiment are representative of what security analysts use in industry. The participants were nine professionals who had returned to university to earn a master’s degree. The study focused on the number of true positives produced by the participants, their precision, and their productivity. Despite the limited scale of the study, the results show that static analysis produces more true positives and in a shorter time than penetration testing. The precision of both techniques is identical. Furthermore, results of the questionnaire suggest that, despite penetration testing being more fun, a newly formed security team should start using static analysis first and introduce penetration testing later, once the team has matured.
In future work, it would be interesting to investigate varia-tions of this study, such as using different support tools for both static analysis and penetration testing or choosing different types of applications. This would contribute to extending the applicability of our results to a larger spectrum of techniques.
This research is partially funded by the Research Fund KU Leuven, and by the EU FP7 project NESSoS. With the financial support from the Prevention of and Fight against Crime Programme of the European Union (B-CCENTRE).
 B. Chess and G. McGraw, “Static analysis for security,” Security & Privacy, IEEE, vol. 2, no. 6, 2004.
 Gartner. Research metodologies: Magic quadrant. [Online]. Available: http://www.gartner.com/technology/research/methodologies/ research mq.jsp
 K. R. van Wyk. (2013) Adapting penetration testing for software development purposes. [Online]. Available: https://buildsecurityin. us-cert.gov/bsi/articles/best-practices/penetration/655-BSI.html  N. Antunes and M. Vieira, “Comparing the effectiveness of penetration
testing and static code analysis on the detection of SQL injection vulnerabilities in web services,” in IEEE Pacific Rim International Symposium on Dependable Computing (PRDC), 2009.
 A. Austin and L. Williams, “One technique is not enough: A comparison of vulnerability discovery techniques,” inInternational Symposium on Empirical Software Engineering and Measurement (ESEM), 2011.  D. Baca, K. Petersen, B. Carlsson, and L. Lundberg, “Static code
analysis to detect software security vulnerabilities: Does experience matter?” in International Conference on Availability, Reliability and Security (ARES), 2009.
 A. Edmundson, B. Holtkamp, E. Rivera, M. Finifter, A. Mettler, and D. Wagner, “An empirical study on the effectiveness of security code review,” inInternational Symposium on Engineering Secure Software and Systems (ESSoS), 2013.
 V. R. Basili, G. Caldiera, and H. D. Rombach, “The goal question metric approach,” inEncyclopedia of Software Engineering. Wiley, 1994.  OWASP. (2010) Top 10 application security risks. [Online]. Available:
https://www.owasp.org/index.php/Top 10 2010-Main
 J. Walden. (2012) CS 666: Secure software engineering. [Online]. Available: http://faculty.cs.nku.edu/∼waldenj/classes/2012/fall/csc666/
 R. Scandariato and J. Walden. Static analysis versus penetration testing: Supporting material. [Online]. Available: https://sites.google. com/site/nkustudy/