Chapter 2 Background and related work
2.6 Computer systems benchmarking
A benchmark is an instrument that allows evaluating and comparing different entities (systems, components, tools, etc.) according to specific characteristics, like for example performance, robustness and dependability, under the same conditions (Gray 1993). In practice, benchmarking is a process that encompasses the execution
of the system under test under conditions that are constant over time and the measurement of specific characteristics at each execution, in a way that provides results that are fair and comparable across alternative systems and/or components. The main components of a benchmark are:
Metrics that characterize the objects under comparison. For instance, metrics for benchmarking CPUs’ throughput are Instructions Per Second (typically scaled to millions, MIPS, or above – GIPS, TIPS) and Floating-Point Operations
Per Second (e.g., MFLOPS), while complex computer systems as web servers
are analyzed with respect to their response time, availability and latency (“Transaction Processing Performance Council (TPC)”). The definition of metrics is of utmost importance for modeling the characteristics of the system to be measured in a proper way;
Workload, which is a set of operations that the systems under test must execute during the benchmark execution, usually including several
components (instructions, software components, other systems) and parameters
(defining a particular instance of the workload). Workloads are typically built according to the characteristics of the system under benchmarking. For instance, workloads for measuring the CPU throughput in terms of FLOPS must be made of floating-point, computation-intensive instructions, while measuring the response time of a web server requires a set of several remote nodes requesting operations that the system must execute at a given rate. Several techniques are available for defining a proper workload (Calzarossa, Italiani, and Serazzi 1986; Agrawala, Mohr, and Bryant 1976; Calzarossa and Serazzi 1993; D. Ferrari 1972; Eeckhout et al. 2005; Domenico Ferrari 1984), which nonetheless remains an open problem in many scenarios;
Benchmarking procedure that describes the setup required to run the benchmark and the set of steps and rules to be followed during its execution (Gray 1993). For instance, benchmarking a web server requires setting-up of the remote nodes submitting the workload, configuring the environment to automatically start the web server, starting and stopping the workload execution, calculating the defined metric, among others.
In order to give confidence on the results, a proper benchmark must encompass several properties (M. Vieira and Madeira 2003), namely it should be easy to implement and use, provide repeatable results, be portable to different systems in a given domain, include representative components, and be non-intrusive in order to not interfere in the results.
Work on performance benchmarking ranges from simple benchmarks that target a very specific hardware system or component to very complex benchmarks focusing on complex systems (e.g., databases, operating systems, web servers (M. Vieira and Madeira 2003)). Performance benchmarks have contributed to improve successive
generations of systems (Gray 1993), and the beginning of the millennium has boosted the research on dependability benchmarking, with several works carried out by different groups following different approaches (e.g., experimentation, modeling, fault injection) (Koopman et al. 1997; M. Vieira and Madeira 2003; Zheng 1993; Antunes and Vieira 2010).
The goal of dependability benchmarking is to characterize the behavior of a system in the presence of faults, quantifying dependability attributes. A dependability benchmark thus involves the use of techniques as fault injection and robustness testing, adds to the main components of a benchmark a faultload (containing the faults in presence of which assess the system), and measures relative to dependability
attributes.
In the last few years, benchmarks were also developed for evaluating the security of systems, as for example (Mendes, Madeira, and Duraes 2014; Marco Vieira and Madeira 2005; Mendes, Duraes, and Madeira 2011). Such benchmarks are based on the idea of evaluating a system in the presence of vulnerabilities related to its security (i.e., software faults that have the effect to reduce the security attributes of a system), and consists of a benchmarking procedure, a workload, a vulnerability injector and a vulnerability library, and an attackload (a set of attacks execute against the system under test). The authors in (Neto and Vieira 2011) proposed a different approach to security benchmarking, by assessing the trustworthiness (i.e., the accumulation of evidence that something can be trusted) of web applications and systems. Differently from security benchmarks, the goal of a trustworthiness benchmark is to increase the thrust in security attributes of a system or parts of it. The benchmarking procedure involves the analysis of the code of a specific system or component by using static code analyzers (SCA), which results in a number of vulnerabilities reported (NVR) that is used to estimate trustworthiness.
Benchmarking frameworks are lacking in the failure prediction scenario. In this direction, benchmarks for machine learning models can be adapted to the failure prediction problem, even if only some models can take advantage of the existing approaches. Benchmarking machine learning models is a well-known problem in the machine learning community, typically addressed by using well established
datasets (see e.g., (Zheng 1993; Maxion and Tan 2000)), which correspond to the workload mentioned above. The datasets include data generally accepted by a
community (e.g., IRIS dataset and others (Bache and Lichman 2013)) that the tool or algorithm must process to assess its performance (prediction accuracy, recognition error rate, etc.). These datasets can be used independently of any system configuration. However, as mentioned before, such repositories are not enough for assessing and comparing failure prediction algorithms on a particular system, as the data may reflect the behavior of the several different systems.