• No results found

2.2 Automated Techniques for Load Testing

2.2.2 Identifying Performance Problems

Profiling-based Techniques. The most common way of identifying performance

problems is through profiling of the system under test. Over the years, profiling tools such as gprof [54] has been used to find bottlenecks related to excessive CPU usage.

From the memory consumption perspective, Seward et al. propose Memcheck [103], a tool that performs memory usage profiling of a subject program. The results of Mem- check can be used to help identify memory bottlenecks and other types of memory related errors. Memcheck is built on top of Valgrind [85], a binary instrumentation framework. It maintains a shadow value for every register and memory location, and uses these shadow values to store additional information, which enables tracking of all types of memory operations, such as value initialization, allocation/deallocation, copying, etc. Although programs instrumented with Memcheck typically run 20-30 times slower than normal, it is claimed to be fast enough to use with large programs that reach 2 million lines of code.

However, profiling tools often rely on one specific run of the program under test. Selecting the ‘right’ input that will expose the resource consumption problems be- comes critical, and in practice often leads to missed performance bugs. In recent years, several works have been done to alleviate this problem, either by aggregating information from multiple runs (possibly with order of magnitude difference in input data sizes), in the hope of providing “cost functions” for the key methods in the system, or by mining from millions of traces collected on deployed software systems. Goldsmidth et al. propose an approach that, given a set of loads, executes the program under those loads, groups code blocks of similar performance, and applies various fit functions to the resulting data to summarize the complexity [52]. This approach is applied to various applications such as a data compression program, a C language parser, and a string matching algorithm, and is able to confirm the expected performance of the implementation of those classic algorithms. Although the approach improves dramatically from single run profiling, the user provided workloads still proves to be critical to the performance of this approach.

with counters, uses an invariant generator to compute bounds on these counters, and composes individual bounds together to estimate the upper bound of loop iterations. This approach relies on the power of the invariant generator and the user input of quantitative functions to bound any type of data structures. As a result, this approach is demonstrated to scale only to small examples up to a hundred lines of code.

Coppa et al. propose a new profiling tool, aprof [28], that further alleviates the problem of relying user provided workloads. aprof also generates performance curves of individual methods in terms of their input sizes. However, instead of allowing users to provide work loads that range orders of magnitude in size, aprof only needs a few runs under a typical usage scenario. The key insight is that aprof automatically identifies different sizes of inputs to a specific method by monitoring its read memory size in each invocation. The approach is evaluated on a few components in the SPEC CPU2006 benchmarks, and is shown to provide informative plots from single runs on typical workloads. However, aprof does not consider alternative types of inputs that are received during runtime, such as data received on-line (e.g., reads from external devices such as the network, keyboard, or timer). The accuracy of the approach would be undermined in those situations.

Zaparanuks et al. propose another profiling tool, AlgoProf [133], that attempts to achieve a similar goal as aprof, but uses different types of metrics. Instead of using cost metrics such as invocation counts, response times, or instruction counts, AlgoProf uses a set of metrics that focus on repetitions, i.e., loop counts, recursions, and data structure access counts. As a result, AlgoProf produces cost functions that are more focused on the algorithmic complexity. AlgoProf is shown to provide accurate performance plots for classical data structure algorithms such as trees and graphs, and is proved useful in uncovering algorithmic inefficiencies. However, it is not clear whether this type of metrics is as useful in broader types of applications.

Han et al. propose StackMine [60], a tool that takes a different approach to avoid the workload dependent problem. Instead of focusing on a few isolated runs, Stack- Mine works on millions of stack traces collected on the deployed software systems (e.g., Microsoft Windows Error Reporting Tool). It uses machine learning algorithms to mine suspicious traces out of those traces: a sequence of calls appearing a long time across multiple stacks can be a CPU consumption bug; a sequence of calls wait- ing long time across multiple stacks can be a wait bug. This technique provides an alternative approach to performance debugging in the large, if the user can afford to collect a large amount of stack trace data on deployed products, which is generally expensive and fragmented.

Our work complements these techniques, because we focus on selecting input values that lead to worst case performances. We conjecture that the generated input values can in turn be used to enable a more accurate profiling of the program under test.

Search-based Techniques. Another thread of related works aim to predict overall performance of a system at a stage where only performance evaluation of the con- stituent components are available [1, 12, 107]. This type of information is valuable, because it can predict performance of systems that are yet to be built, therefore avoid- ing system architectures that are doomed to lead to bad performance. For example, Siegmund et al. proposed a technique that can predict program performance based on selected features in a software product line environment [107]. The proposed tech- nique models the problem as a search problem, and uses heuristic search to reduce the number of measurements required to detect the feature interactions that actually contribute to performance out of an exponential number of potential interactions.

mance information of individual components, in order to evaluate performance of the whole system. However, the information we gather are input constraints corre- sponding to the worst performing program paths, which can enable more accurate compositional analysis, and produce more accurate results.