3.3 Validation
3.3.2 Empirical Validation on Case Studies
In this chapter, we will show our model-based performance testing approach [58] [63] in practice and demonstrate its applicability with a set of experi- ments on a case study. We show that using abstract models for describing the user profiles allows us quickly to experiment with different load mixes and detect worst case scenarios. We will also show how using log data as a
Figure 3.8: PTA model for an aggressive-bidder user type.
basis for workload model creation can radically reduce the effort needed to create workload models.
Generating Load With MBPeT
We demonstrate the applicability of our tool and approach by using it to evaluate the performance of an auction web service, generically called YAAS. The web service was developed in-house and implemented as a stand-alone application. The YAAS has a RESTful [66] interface based on the HTTP protocol and works as a normal auctioning web site where users create, search, browse, and bid on items that other users have created.
In the experiment we created three types of users: Passive, Aggressive, and Non-Bidders. The idea was that Aggressive users bid frequently on items while Passive users bid less on items. Non-bidders do not bid on items at all but only search and browse through the catalog of items. Figure 3.8 shows the PTA model for an aggressive user. Similar models were created for the two other user types.
We set out to test how many concurrent users the host node can support without exceeding the specified target response time values that we had set as thresholds for different actions. The length of the test was 20 minutes and the number of concurrent users were linearly ramped up from 0 to 300 over the course of the test run. From table 3.2 it can be seen that the system could support 64 concurrent users before breaching one of the set response time threshold values. The MBPeT tool also maintains a log file of all the traces generated from the workload models during each test run. In this particular example, the tool generated a total of 2576 unique traces in 20 minutes. The length of the traces varied from 2 up to 45 actions.
Target Response Time Non-Bidders (22 %) Passive Users (33 %) Aggressive users 45 % Verdict Actions Average Max Time of Time of Time of Time of Time of Time of Pass/fail
(sec) (sec) breach (sec) breach (sec) breach (sec) breach (sec) breach (sec) breach (sec) browse() 4.0 8.0 279 (78 users) 394 (110 users) 323 (90 users) 394 (110 users) 279 (78 users) 394 (110 users) Failed search(string) 3.0 6.0 279 (78 users) 394 (110 users) 279 (78 users) 394 (110 users) 229 (64 users) 327 (92 users) Failed get action(id) 2.0 4.0 280 (79 users) 325 (91 users) 279 (78 users) 279 (78 users) 276 (77 users) 325 (91 users) Failed get bids(id) 3.0 6.0 279 (78 users) 446 (130 users) 325 (91 users) 394 (110 users) 327 (92 users) 394 (110 users) Failed bid(id,price, username, password) 5.0 10.0 —- —– 327 (92 users) 474 (132 users) 328 (92 users) 468 (131 users) Failed
Table 3.2: Response time measurements for user actions when ramping up from 0 to 300 users.
Traces JMeter MBPeT
Nr of Traces Percent of Traces Nr of Traces Percent of Traces
browse,get auction 3706 27,77 % 3682 27,31 %
search,get auction 3101 23,24 % 3218 22,87 %
browse,get auction,get bids 2427 18,19 % 2447 18,15 %
search,get auction,get bids 2306 17,28 % 2305 17,10 %
browse,get auction,get bids,bid 1803 13,51 % 1831 13,58 %
total 13343 100% 13483 100%
Table 3.3: JMeter vs MBPeT: Running 5 different traces with 100 users in parallel for 20 minutes. A uniform think time of 3 seconds between actions was used
Comparing MBPeT to JMeter
We investigated how our approach compares to JMeter [67], a well-know java-based load generator. In this experiment, we ran 5 different test se- quences with 100 concurrent users and a test session of 20 minutes with a ramp-up period of 120 seconds. Between each action a uniform think time of 3 seconds was used. The tests were run three times and an average was computed. The test sequences used in this experiment were selected based on a previous test run of the YAAS application, where the top 5 most exe- cuted sequences were selected for this experiment. We constructed a model containing the 5 selected trace in MBPeT and did the same in JMeter.
The test sequences and the distribution between them can be seen in Table 3.3. The table shows that the JMeter tested on average a total of 13343 test sequences while the MBPeT tested on average a total of 13483 test sequences. This corresponds to a 1 percent speed advantage for the MBPeT tool. We note that there is a difference in the percentages in which test sequences were executed. This is because the MBPeT tool uses probabilistic models, from which load is generated, and due to the randomness in the models it is difficult to control the exact distribution between test sequences. The test was not conducted to show that our tool in any way is better than JMeter, but rather to show that our tool can compete with modern load generator in terms of generation speed. In this experiment it did require less time to develop the JMeter scripts than it took to make our models and adapter. This is due to the fact that JMeter is a mature load generator that comes equipped with all kinds of ready-made templates, databases, and other elements which our tool lacks. However, because we are using
graphical models, our approach allows for a higher degree of model reuse between experiments compared to JMeter.
Log2Model
To validate the applicability of our Log2Model tool we tried to re-create the workload models from log data that was produced during an earlier test run of the YAAS application. We created a set of PTA models for the YAAS system and loaded them in our MBPeT tool. We ran load tests based on the models for 2 hours in order to produce a log file. From the produced log file, containing roughly 10,000 requests, we tried to re-create the original models as accurately as possible using our Log2Model tool.
Figure 3.9 shows a comparison between the original model and the re- created model. From this figure one can see that the two models are virtually identical apart from some probability values on a few edges. This is due to the fact that we use a stochastic model for generating the load and we do not have an exact control over what traces are generated. With a larger log file, approximately 1,000,000 lines, the probability values would be ex- pected to be even closer to the original values, if not exact. The paper shows that automation can reduce the effort necessary for creating work- load models for performance testing. In contrast, Cai et al. [68] report that they spent around 18 hours manually creating a test plan and the JMeter scripts for the reference Java PetStore application [69]. In our experiment, parsing the 10,000 line took 2 seconds, while the rest of the information processing, including creating the models took less than a second. Table 3.4 shows a summary of the execution time for different steps of the algorithm for different log sizes. The final step, constructing the workload model, was purposely left out since it varies considerably depending on the chosen number of clusters and threshold value.
Phase 30.000 50.000 100.000 200.000 400.000
Parsing 6 sec 10 sec 22 sec 50 sec 2 min
Pre-processing 4 sec 9 sec 10 sec 21 sec 31 sec
Request tree reduction 0.3 sec 0.3 sec 0.8 sec 2 sec 5 sec
Clustering 0.08 sec 0.08 sec 0.4 sec 5 sec 60 sec
Total 10.38 sec 19.38 sec 33.2 sec 1 min 18 sec 3 min 36 sec
Table 3.4: Table showing execution times for different log sizes in terms of lines of log entries.
In a second experiment we show how complex models can be generated from only a small amount of log data. In this experiment we used log data from a web site called pubiliiga [70]. The web site maintains football scores in the league of pubiliiga. The web site also stores information about
1 2 1.0 / 0 / browse() 0.10 / 7 / browse() 3 0.87 / 4 / get_auction(id) 6 0.03 / 0 / exit() 0.05 / 4 / browse() 4 0.75 / 4 / get_bids(id) 0.20 / 0 / exit() 0.20 / 5 / browse() 5 0.50 / 3/ bid(id,price,username,password) 0.30 / 0 / exit() 0.25 / 6 / browse() 0.45 / 4 / get_bids(id) 0.30 / 0 / exit()
(a) Original model. 1 3 1.0 / 0 / browse() 0.095 / 7 / browse() 2 0.033 / 0 / exit() 4 0.87 / 4 / get_auction(id) 0.052 / 4 / browse() 0.20 / 0 / exit() 5 0.75 / 5 / get_bids(id) 0.24 / 5 / browse() 0.29 / 0 / exit() 6
0.48 / 3 / bid(id, price, username, password) 0.23 / 6 / browse()
0.30 / 0 / exit() 0.48 / 4 / get_bids(id)
(b) Recreated model.
Figure 3.9: Original and re-created workload models.
where and when the games are played, rules, game scores, teams, etc. The web site has been created using the Django framework [71] and runs on top of an Apache [59] web server. The results that we are going to show in this section are generated from a mere 30,000 lines of log data (7 MB) out of the original 1.3 million lines (320 MB). Generating the model on a computer equipped with an 8 core Intel i7 2.93 GHz processor and 16 GB of memory took approximately 2 seconds. After a model has been generated, the user decides the desired level of complexity on the model. This is done by choosing a threshold value. Traces are ordered according to their execution frequency as they are found in the log. Some traces are executed more than
others. For example, a threshold value of 20 percent (0.2) means that a model is constructed with 20 percent of the most frequent traces and the rest will be filtered out. Figure 3.10 shows the generated workload model when using a threshold value of 0.5, which means that 50 percent of the traces are included in the model, starting from the high-frequent ones.
1 2 0.10 / 0 / exit() 3 0.0072 / 4 / action13 4 0.14 / 0 / action16 6 0.056 / 0 / action17 7 0.18 / 9 / action21 8 0.044 / 2 / action19 9 0.20 / 8 / action20 10 0.13 / 0 / action15 11 0.047 / 0 / action5 12 0.08 / 5 / action18 15 0.0062 / 4 / action11 21 0.0082 / 18 / action14 22 0.0062 / 7 / action12 23 0.0010 / 48 / action8 26 0.0010 / 11 / action6 0.35 / 0 / exit() 0.077 / 6 / action21 0.19 / 42 / action19 0.15 / 2 / action20 0.077 / 98 / action18 0.077 / 1 / action11 25 0.077 / 1 / action6 0.28 / 0 / exit() 0.0047 / 625 / action13 0.20 / 815 / action16 0.056 / 17 / action21 0.012 / 120 / action19 0.029 / 8 / action20 0.22 / 22 / action15 0.060 / 59 / action5 0.022 / 29 / action18 0.0035 / 22 / action11 0.0094 / 27 / action14 0.0023 / 7 / action6 5 0.095 / 26 / action17 14 0.0035 / 70 / action12 16 0.0035 / 15 / action9 0.30 / 0 / exit() 0.0061 / 13 / action13 0.12 / 14 / action16 0.23 / 72 / action17 0.090 / 4 / action21 0.037 / 5 / action19 0.020 / 6 / action20 0.12 / 6 / action15 0.033 / 14 / action5 0.018 / 9 / action18 0.0041 / 283 / action11 0.0020 / 14 / action12 0.0082 / 19 / action9 17 0.0082 / 36 / action7 28 0.0041 / 11 / action4 13 0.0061 / 20 / action14 24 0.0020 / 15 / action1 0.22 / 0 / exit() 0.0063 / 5 / action13 0.18 / 8 / action16 0.45 / 13 / action17 0.038 / 142 / action21 0.019 / 17 / action19 0.022 / 41 / action20 0.0063 / 212 / action15 0.0094 / 52 / action5 0.013 / 11 / action18 0.0063 / 32 / action11 0.0063 / 26 / action12 0.019 / 7 / action9 0.0063 / 2 / action14 0.038 / 0 / exit() 0.013 / 3 / action13 0.063 / 3 / action16 0.013 / 4 / action21 0.025 / 1 / action20 0.79 / 4 / action15 0.013 / 53 / action5 0.013 / 2 / action18 0.019 / 90 / action14 0.013 / 3 / action12 0.069 / 0 / exit() 0.0069 / 109 / action13 0.73 / 6 / action16 0.038 / 0 / action17 0.035 / 9 / action21 0.021 / 10 / action19 0.031 / 3 / action20 0.042 / 28 / action5 0.017 / 27 / action18 0.010 / 53 / action14 0.37 / 0 / exit() 0.0034 / 914 / action13 0.13 / 8 / action16 0.20 / 25 / action17 0.012 / 16 / action21 0.090 / 41 / action19 0.010 / 25 / action20 0.12 / 69 / action15 0.017 / 41 / action5 0.0052 / 11 / action18 0.0034 / 114 / action11 0.0052 / 20 / action14 0.0069 / 37 / action12 0.0086 / 52 / action9 0.0034 / 4 / action1 20 0.0034 / 3 / action3 0.34 / 0 / exit() 0.013 / 621 / action13 0.16 / 92 / action16 0.013 / 51 / action21 0.026 / 41 / action19 0.018 / 12 / action20 0.026 / 80 / action15 0.37 / 0 / action5 0.0088 / 549 / action18 0.0088 / 37 / action11 0.013 / 12 / action14 0.31 / 0 / exit() 0.18 / 14 / action16 0.24 / 3 / action17 0.067 / 29 / action21 0.030 / 29 / action19 0.045 / 7 / action20 0.045 / 290 / action18 0.030 / 25 / action14 0.060 / 17 / action9 0.083 / 0 / exit() 0.12 / 139 / action13 0.083 / 1 / action16 0.083 / 54 / action19 0.33 / 4 / action20 0.083 / 48 / action18 0.083 / 0 / action11 0.12 / 222 / action14 0.62 / 7 / action16 0.17 / 6 / action20 0.069 / 17 / action15 0.069 / 2 / action11 0.069 / 0 / action14 0.33 / 0 / exit() 0.11 / 66 / action16 0.11 / 60 / action21 0.11 / 86 / action19 0.11 / 12 / action20 0.11 / 6 / action18 0.11 / 43 / action12 1 / 5 / action21 0.5 / 47 / action20 0.5 / 35 / action14 1 / 1 / action11 0.25 / 0 / exit() 0.22 / 15 / action16 0.024 / 12 / action19 0.031 / 11 / action20 0.024 / 64 / action15 0.31 / 35 / action17 0.024 / 18 / action9 0.016 / 62 / action7 27 0.079 / 3 / action10 0.016 / 31 / action4 0.33 / 0 / exit() 0.33 / 2 / action13 0.33 / 45 / action19 0.093 / 0 / exit() 0.14 / 6 / action17 0.047 / 4 / action21 0.047 / 5 / action19 0.070 / 22 / action20 0.070 / 37 / action15 0.070 / 3 / action18 0.047 / 13 / action12 0.047 / 0 / action9 18 0.023 / 10 / action2 19 0.35 / 3 / action8 1 / 0 / action17 0.12 / 5 / action9 0.88 / 2 / action10 0.5 / 0 / action16 0.5 / 38 / action4 0.071 / 0 / exit() 0.36 / 28 / action16 0.14 / 29 / action19 0.29 / 5 / action20 0.14 / 0 / action5 0.33 / 0 / exit() 0.67 / 11 / action17 0.5 / 0 / action15 0.5 / 43 / action3 0.5 / 7 / action18 0.5 / 50 / action2 1 / 0 / action16
Figure 3.10: Models created from pubiliiga.fi log data. Threshold set at 50 percent
This experiment serves to show that complex models can be generated even when using relatively small amounts of data and when many traces are filtered out. From this experiment we can conclude that generating a model that perfectly matches reality would be highly complex and unmaintainable. But moreover, having a performance test where only a few static traces are repeated over and over again would be a huge simplification of reality.
3.4
Related Work
The related work presented in this section is divided into two parts; the first part addresses research work related to different model-based performance testing approaches while the second part presents work related to different performance testing tools. The approaches presented below is not an ex- haustive list of related work but they are the ones that we find the most interesting or similar to our approach.