2.2 Automated Techniques for Load Testing
2.2.1 Generating Load Tests
Black-box Techniques. There is a large number of tools for supporting load test- ing activities [79], some of which offer capabilities to define an input specification (e.g., input ranges, recorded input session values) and to use those specifications to generate load tests [108]. For example, a popular tool like Silk [108] provides a user interface and wizards to define a typical user profile and scenario, manipulate the number of virtual users to load a system, and monitor a vast set of resources to mea- sure the impact of different configurations. Clearly, more accurate and richer user and scenario specifications could yield more powerful load tests. Support for identi- fying the input values corresponding to the most load-effective profiles and scenarios, however, is very limited.
A common trait among these tools is that they provide limited support for select- ing load inducing inputs as they all treat the program as a black box. The program is not analyzed to determine what inputs would lead to higher loads, so the effective- ness of the test suite depends exclusively on the ability of the tester to select values. Similar trends appear in load testing techniques and processes in general as they use other sources of information (e.g., user profiles [11], adaptive resource models [14]) to decide how to induce a given load, but still operating from a black box perspective.
One recent advance in this area is the FORPOST technique proposed by Grechanik et al. [55]. The technique is novel in that it uses an adaptive, feedback-directed learning algorithm to learn rules from execution traces of applications, and then uses
these rules to select test input data to find performance bottlenecks. It has been shown to help identifying bottlenecks in two applications: an insurance application, for which the inputs are customer profiles; and an online pet store, for which the inputs are URLs selecting different functions in online shopping. When FORPOST is applied to these applications, it automatically learns rules on the bad inputs (high loads) such as: a customer should have home/auto discount, or a customer has viewed more than 16 items, etc. It then uses these rules to derive test inputs, such as customer profiles that have home/auto discounts, that lead to high workloads. Bottlenecks are identified by comparing good with bad test cases: a prominent resource consuming method that occurs in good test cases (high load), but is not invoked or has little significance in bad test cases (low load), is likely to be a bottleneck. This approach works well if there exists a large pool of candidate inputs, the variety among candidate inputs is high, and the properties of the inputs can be expressed by simple rules.
Unlike these black-box techniques, our load test generation technique uses a white- box approach to generate tests. We view black-box approaches as complementary, where a hybrid approach may combine the benefits of both approach in a gray-box performance testing, in which a white-box approach is used to select precise input values, and a black-box approach is used to learn input rules, which in turn would help improving scalability of the white-box approach.
White-box Techniques. Until recently, techniques and tools for performance val- idation or characterization have treated the target program as a black box. One interesting exception is an approach proposed by Yang et al. [129]. Conceptually, the approach aims to assign load sensitivity indexes to software modules based on their potential to allocate memory, and use that information to drive load testing. Our approach also considers program structure, but a key difference in that, instead of
having to come up with static indices, our approach explores the program system- atically with the support of symbolic execution to identify promising paths that we later instantiate as tests.
The WISE technique proposed by Burnim et al. proposes to uses symbolic ex- ecution to identify a worst-case scenario [20]. The technique utilizes full symbolic execution on small data sizes, and then attempts to generalize the worst case com- plexity from those small scenarios. This works well when the user can provide a branch policy indicating which branches or branch sequences should be taken or not in order to generalize the worst-case from small examples, which requires an extremely good understanding of the program behavior. The study shows that an ill-defined branch policy fails to scale even for tiny programs like Mergesort.
Our approach is different in two significant respects. First, our work is designed to targets scalability specifically. We characterize the components of a system by performing an incremental symbolic execution favoring the deeper exploration of a subset of the paths associated with code structures. This removes the requirement for a user-provided branch policy. For the entire system, we use a compositional approach to avoid exploring whole program paths and facing path explosion problem. Second, our goal is to develop a suite of diverse tests, not just identifying the worst-case. This requires the incorporation of additional mechanisms and criteria to capture distinct paths that contribute to a diverse test suite, and of a family of performance estimators that can be associated with program paths.