Application Robustness Classification using Perturbation Testing

(1)

Application Robustness Classification using Perturbation Testing

Amol Khanapurkar, Mohit Nanda, Suresh Malan

Performance Engineering Research Center

Tata Consultancy Services

Mumbai, India

{amol.khanapurkar, mohit.nanda, suresh1.malan}@tcs.com

Load Testing helps to capture performance at different load levels. It is naïve

to assume that application performance at a certain load level may be in the

same band that the load testing has pointed out. By introducing small spike for

short duration we observed that applications either show resiliency, graceful

degradation and recovery or total crash even after the spike has subsided.

This behavior provides insights into application robustness and subsequent

tune-ability which cannot be captured in load tests. Perturbation testing is a

useful method to classify applications & mitigate risks.

1. Introduction

Enterprise applications cater to different workloads, belong to different domains and provide different functionalities. This makes it difficult for the enterprises to compare, rate and rank them on a consistent yardstick. A consistent yardstick can be a useful tool to drive policies and implement mandates in an organization. Based on sheer technical parameters we propose a methodology to build a consistent way of comparing heterogeneous applications.

Second issue is that, applications spend a lot of time in the test-tune iterations. A very difficult question to be answered here is that – Is the application tuned to the fullest extent or is there any further scope to tune the application? In practice, tuning is carried out until SLAs can be met or some cost-benefit analysis suggest that cost of tuning further cannot be justified since other cheaper, but possibly short-term options are available. We attempt to answer the tune-ability question raised above by capturing application behaviors and analyzing them for further scope in tuning.

The paper presents Perturbation Testing as a method for coming up with a consistent classification criteria for application robustness and how this classification can help in analyzing the tune-ability of applications.

By application robustness, we mean the capability of the application to maintain desirable characteristics like low-response times, guaranteed high-throughput and no downtime under peak loads relevant to that application. We also mean robustness as the ability to recover from adverse conditions such as low availability of resources and prolonged periods of peak loads. Thus robustness, in the context of this paper means, no or slow degradation and quick recovery.

Tune-ability of the application in context of this paper means how quickly the application can reach its optimal state with existing constraints. We argue in this paper that the application robustness and tune-ability of the application can be quickly identified by technique we call Perturbation Testing, as compared to industry practice of performing Load testing. Enterprises perform load testing in which workload exercised against the application is systematically varied to know performance at different load levels. Load testing is useful since it captures instantaneous values of performance numbers at a given load level. Mathematically put, load testing is Discrete load where as the real-life production workload is more Continuous. Hence it is essential to capture the transition from one load level to another. Load testing fails to capture these transitions and this is where Perturbation Testing plays an important role.

(2)

Perturbation testing is introducing spike for a short duration. After the spike starts, response times start increasing which is expected. But during and after the spike different applications show different behaviors. Some are resilient to the increased load and the increase in response times happen only for duration of the spike, some degrade more but gracefully recover within a certain interval after the spike, while some crash as soon as the spike begins and never recover. These different behaviors coupled with utilization data provide plenty of insight into application robustness and scope for further tuning. We exploit these different behaviors to form different classifications for applications which we present in rest of the paper.

2. Related Work

Though there is no directly related work in the area, there has been some similar work in the field of computer networks and availability. Moshe & Tan [1], in their paper, discuss about effect of explicitly introduced perturbation in TCP queues and study effect of different type of perturbations. Similarly Remzi has discussed in his doctoral dissertation [2], various types of perturbations and the network availability and performance behavior with respect to them. There are some references available to such studies conducted for the application programming world [3], it has largely been ignorant towards observing application behavior under perturbation and has been relying chiefly on load tests only.

3 Terms of Reference 3.1. Definition of Terms

Load Testing: - Testing with increasing loads to find

out performance at different load levels. Pictorially, load testing can be depicted as below

Figure 1: Depiction of a Typical Load Testing

Perturbation Testing: - We use the term

perturbation testing to mean introduction of a short

spike of up to 2x workload in a test that runs for T units of time. In our test configuration the spike lasted for one-third of the test duration. A perturbation testing workload graph is depicted as below

Figure 2: Perturbation Testing Workload Pattern

3.2. Applications used for Testing

We used 3 applications in Perturbation Testing. This section describes the applications, their technology and workloads.

Table 1: Application Matrix

Applicatio n Type Techno logy Perturbation Testing Workload Peak to normal ratio Load Testing Workload eCommerce J2EE- MySQL 2X 0.5X, 1X, 1.5X and 2X Quizzing J2EE- MySQL 2X 0.5X, 1X, 1.5X and 2X Reporting J2EE- MySQL 2X 0.5X, 1X, 1.5X and 2X

Where, X = normal workload expected for that application.

At workload of X all 3 applications have sub-second response times under realistic think times on test hardware. The next paragraph describes the application functionalities.

• eCommerce application is an online retail application comprising of a shopping cart transaction involving credit-card payment. The performance test script consists of 11 web pages comprising of Login, Item

T/ 3 T/ 3 T X 2X

(3)

selection, Shopping Cart checkout and Logout.

• The Quizzing application comprised of Login, Navigating to the Quiz, taking a quiz and Logout. The performance test script comprised of 43 web pages which included 30 questions.

• The Reporting application consisted of a Telecom domain application. The reports consisted of telecom records against a database comprising of 2.5 M billing records of 10K customers. The performance testing script of this application comprised of 5 web pages.

Section 4 and 5 present details of Load Testing and Perturbation Testing respectively.

4. Load Testing

This section provides the test environment, testing methodology and test results. It also comments on shortcomings of this testing and paves the way for describing perturbation testing.

4.1. Test Environment

All 3 applications were tested on similar hardware. Each had Application Server and Database server on the same machine in the test environment. The hardware configurations and software versions are mentioned in Table 2.

All the servers were present in the same local network and were dedicated only for the respective applications deployed on them.

Table 2: Server and Load Generator Configurations

App & DB Server Configuration

CPU 4 core 2.1 GHz AMD Opteron

Memory 4 GB

OS CentOS 5.7

App Server Apache Tomcat[4] 6.0.14 DB Server MySQL[5] 5.5

Load Generator The Grinder[6] 3.2

Network 1 Gbps

Load Generator configuration

OS CentOS 5.4

RAM 8 GB

Processor Intel(R) Xeon(R) CPU E543 2.66GHz

4.2. Test Methodology

The applications were tested under realistic workloads as applicable for each of the applications. The tests were designed to find out the performance of the applications when the workloads varied from 0.5X, 1X, 1.5X and 2X. Typical workloads for each application along with think time between pages is specified in Table 3

Table 3: Load Test Workloads and Think Times

Applicatio n X Think times (in seconds) Load Test Workload eCommerce application 1000 2.5 500, 1000, 1500, 2000 Quizzing application 250 5 250, 500, 750, 1000 Reporting application 1000 1 500, 1000, 1500, 2000

To ensure that all the test results were collected in a consistent fashion, both the application and the database server software were rebooted between tests. Software configuration parameters for both servers were never changed throughout the load tests.

Best practices such as validating the observed throughput against theoretical throughput calculated using Little’s Law[7], were followed. The

(4)

4.3. Test Results

This section provides result for all 3 applications under load tests.

4.3.1 Load Test Results for eCommerce application

Figure 3: eCommerce Application Load Test Results

Little’s Law Verification [N = (R+Z) * X], where N = Number of Users

R = Response Time in seconds X = Throughput in web pages / second Z = Think Time in seconds

Table 4: eCommerce Application Load Test Results - Little’s law verification

N RMeasured XMeasured XTheory

% Deviation 500 0.12 95 97.64 3.15 1000 0.36 193 186.61 3.37 1500 46.34 6 29.21 78.19 2000 47.19 4 38.32 89.59

4.3.2 Load Test Results for Quizzing application

Figure 4: Quizzing Application Load Test Results

Table 5: Quizzing Application Load Test Results - Little’s law verification

N RMeasur ed XMeasur ed XTheory % Deviation 125 0.30 26 23.59 8.48 250 0.56 47 44.98 5.04 375 0.32 30 70.53 57.88 500 1.24 40 80.10 50.16

4.3.3 Load Test Results for Reporting application

Figure 5: Reporting Application Load Test Results

Table 6: Reporting Application Load Test Results - Little’s law verification

N RMeasured XMeasu red XTheory % Deviatio n 500 0.39 373 360 3.62 1000 0.68 645 597 8.10 1500 2.04 67 493 86.40 2000 53.38 14 37 62.34

(5)

In all 3 cases, the Little’s Law verification failed above 1X workload. In all 3 cases, the tests were time-based i.e. the tests were stopped after fixed duration. Since all 3 applications degraded for loads beyond 1X, the transactions which were started but never finished due to expiry of test duration never reflected in throughput numbers. The same applies to response time measurements. Since response time for the transactions that never finished was not captured, the average number is misleading.

Besides the above mentioned common factor, the other factors were as follows:

• CPU Utilization for eCommerce application was nearly 100% throughout the test duration beyond 1X load.

• Quizzing application had spikes of upto 100% CPU utilization.

• The Reporting application had a huge queue on the disk and CPU utilization fell to near 0%.

These systemic factors caused disturbances in the system which led to Little’s Law based deviations. This failure to match theoretical numbers based on Little’s Law with actual numbers measured via tests paved way for performing Perturbation Testing.

5. Perturbation Testing

The rationale behind perturbation testing is that although load tests indicate that applications won’t scale to higher loads (in current conditions), in real life a peak of 2x is quite likely. Thus the question we had to ask ourselves was that – Will the applications be able to sustain if hit by a spike of 2x?

Also, note that at 1X loads, response times of all application were below 1 second. At this point of time we were intrigued by the question – Which application has the maximum scope for tuning so that efforts can be made in the right direction?

To answer both questions, we decided to perform perturbation testing on our applications and see how applications perform before, during and after being perturbed. The results of perturbation testing are presented in sections below.

5.1 Perturbation Test Results

Please refer section 3.1 for refreshing the perturbation testing pattern. For eCommerce, Quizzing, Reporting application 1X load was 1000, 250 and 1000 respectively. Test duration T was 30 minutes, hence T/3 i.e. duration of perturbation was 10 minutes.

5.1.1 Perturbation Test Results for eCommerce application

Figure 6: eCommerce Application Perturbation Test Results

Applying Little’s Law to Perturbation test results we see Table 7.

Table 7: eCommerce Application Perturbation Test Results – Little’s Law verification

Perio d N RMeasure d XMeasur ed XTheory % Deviati on Before 1000 1.03 300 283 5.82 During 2000 17.92 74 98 32.56 After 1000 6.13 297 116 61.05

5.1.2 Perturbation Test Results for Quizzing application

(6)

Figure 7: Quizzing Application Perturbation Test Results

Table 8: Quizzing Application Perturbation Test Results – Little’s Law verification

Period N Rmeasu red Xmeasur ed Xtheor y % deviation Before 250 0.18 46.10 48.25 4.66 During 500 0.76 88.05 86.74 1.48 After 250 0.63 51.50 44.38 13.82

5.1.3 Perturbation Test Results for Reporting application

Figure 8: Reporting Application Perturbation Test Results

Table 9: Reporting Application Perturbation Test Results – Little’s Law verification

Period N Rmeasur ed Xmeasu red Xtheory % deviat ion Before 1000 2.23 346.13 308.76 10.80 During 2000 36.44 44.58 53.42 19.83 After 1000 83.25 23.29 11.87 49.04

5.2 Perturbation Test Utilization

The graphs below show CPU utilization trend for the performance tests with perturbation load. The behavior seen in the three applications is quite varied and coincides with the response time results.

5.2.1 Utilization for eCommerce Application

Figure 9: eCommerce Application Utilization

5.2.2 Utilization for Quizzing Application

Figure 10: Quizzing Application Utilization

5.2.3 Utilization for Reporting Application

(7)

5.3 Analysis of Perturbation Test Results

and Utilization

Analysis of the results helps in understanding the behavior of the three applications Before/During/After perturbation. All three applications perform equally well under their typical workloads with normal CPU utilization behavior. However each of the three applications reacts differently to the additional load enabled during Perturbation testing.

In case of the eCommerce application, the CPU utilization soared up to 100% under the perturbation load, hence causing the response times to shoot up and throughput to come down. However, the application regained its performance after the perturbation period. The same can be seen in Figures 6, 9 and Table 7.

For the Quizzing application, the CPU utilization increased substantially during perturbation and the response times increased under perturbation by around 5x and in the post-perturbation period, the application showed very slow signs of recovery with the CPU utilizations and the response times slowly subsiding back to normalcy. The same is evident in Figures 7, 10 and Table 8.

The Reporting application had the weakest performance which under the perturbation, lost approx. 80% of its throughput and had 15x response times. The CPU utilizations are lowest as the perturbation starts and continue to be negligible. At the same time high response times and low throughput indicate the system waiting on some long queue or a resource and the application can be perceived as unresponsive. Also it did not recover in post-perturbation period as can be seen in Figure 8, 11 and Table 9.

5.4 Applicability of Perturbation Test

Results

5.4.1 Computation of Metrics

Unlike Load Testing which provide ‘Discrete’ result values for specific load levels, Perturbation Testing provides results over a ‘Continuous’ range. Thus using perturbation testing we see the following 3 types of behaviours

• Resilient – Can absorb most of the perturbations and is quick to recover.

• Graceful – Quick to degrade and slowly but surely recovers back.

• Weak – Crashes when perturbed with a spike

Using Perturbation Testing one can compute two metrics

• Rate of Degradation (RoD )and • Rate of Recovery (RoR)

Since degradation is almost certain, the RoD is less useful. From a practical stand-point RoR is more important metric. We demonstrate how RoR is computed using response time metrics since it’s an end-user facing metric. RoR can also be computed in throughput terms. The following graph is perturbation result graph is used to depict the RoR calculation

Y2 Y1 X2 X1

Figure 12: RoR Calculation

Rate of Recovery (RoR) is computed as rate at which response times improve. It can be found out by calculating the slope of the imaginary line between (X1, Y1) and (X2, Y2), where:

X1 = End Time-stamp of the test X2 = End of Perturbation Period

Y1 = Average Response Time After Perturbation Period Y2 = Average Response Time During Perturbation Period

Slope = (Y1-Y2) / (X1-X2)

Table 10: Slope Calculation for 3 applications Application

Name Y1 Y2 X1 X2 Slope

eCommerce 6.13 17.92 180 120 -0.196 Quizzing 0.63 0.76 180 120 -0.002 Reporting 83.25 36.44 180 120 0.780

(8)

X1 and X2 are timestamps and hence they are abstracted to number of samples. X2 of 120 indicates that perturbation period ends at the end of collection of 120th response time sample and X1 indicates end of test at the end of 180th response time sample. Slope of -0.196 => Response Times improve by 0.196 seconds per unit time (frequency of data collection i.e. 10 seconds in this case). Negative slope indicates recovery, positive slope indicates crash. Similarly, when using throughput based calculation, positive slope indicates recovery and negative slope indicates crash.

5.4.2 Using the metrics

RoR is a sufficient metric to perform

• Application Robustness Classification • Computation of Tune-ability Index

However more metrics can be captured if desired like • Rate of Degradation (RoD)

• Magnitude of Degradation (avg. response time after perturbation / avg. response time before perturbation)

5.4.3 Application Robustness Classification

Using Response Time metrics, we can make classification as below

• Slope less than equal to -0.1 => Resilient • Slope between -0.1 and 0 => Graceful • Slope greater than 0 => Weak

Thus, Resilient applications are those that recover quickly even after being hit by a spike which can sustain for reasonable period of time. Weak applications crash irrecoverably when perturbed. Graceful application show in-between behavior. The three applications we used in the paper then get classified according to the above-mentioned criteria as

Table 11: Applications classified by robustness criteria

Application Name Robustness Classification

eCommerce Resilient

Quizzing Graceful

Reporting Weak

5.4.4 Computation of Tune-ability Index

Tune-ability index is loosely defined as the potential for improvement. The value of slope itself can be taken as the tune-ability index. Based on response time metrics, positive slope indicates greater potential for improving the application. It is possible to evolve stronger definition for Tune-ability index by coupling it with Utilization data, however when answering the question – Which application has maximum potential for improvement?; even the loose definition suffices. In context of the applications used in the paper, the tune-ability index is as follows

Table 11: Applications classified by robustness criteria

Application Name

Tune-ability Index

Comments

eCommerce -0.196 Easiest to tune Quizzing _-0.002 Less easy to

tune

Reporting 0.780 Difficult to tune

Since CPU is the fastest device in the system, the CPU utilization pattern presented in section 5.2 is used to interpret tune-ability index. The degradation of the applications can be prevented as shown in table below.

(9)

Table 12: Subjective analysis of tune-ability index Application

Name

CPU Utilization

Possible Fix Difficulty Level eCommerc e High Implement Admission Control Easy

Quizzing Medium Identify why CPU utilization alternates between high and low states. Augment the other resource if necessary. Hard

Reporting Low CPU

utilization is low indicating that the bottleneck is elsewhere. Detecting bottleneck will dictate what solution is possible. Hardest

When tune-ability index is computed on RoR, the lower the index value the easier it is to tune that application from an efforts perspective.

6 Future Work

From the data presented above it is easy to see that enforcing certain rules / constraints can easily improve robustness rating of the application. For .e.g. the Reporting application degrades rapidly due to build-up of queue and it simply crashes in trying to process the backlog. In such case enforcing Admission Control i.e. allowing only certain number of concurrent reports can improve performance. Similarly, the Quizzing application exhibits pattern in which CPU-utilization alternates between high and low states suggesting that the application affinity is divided between CPU and some other resource. Augmentation of the other resource can significantly improve the probability of having higher CPU utilization and hence improving robustness rating. Future work includes devising strategies to convert applications from Weak to Graceful, and Graceful to Resilient state, based on behavior in Perturbation Tests.

7 Conclusion

When dealing with business applications with sub-second response times it is easy to get carried away by current performance levels. Load Testing helps to capture performance at different load levels. However, it is naïve to assume that application performance at a certain load level may be in the same band that the load testing has pointed out.

Using Perturbation Testing technique we can methodically rank the applications by comparing them against themselves because it is the growth in workloads and data sizes that bring down the applications. Considering only workload we devised a methodology which is a combination of different performance testing techniques. The methodology is based on introducing perturbations and collecting metrics that very quickly and convincingly capture the qualitative state of the applications. These states of the application are apples-to-apples comparable and can be used for ranking different types of application and computing the potential for tuning of applications which appear to be super-quick under normal usage.

8 Acknowledgements

The authors are thankful to Dr Rajesh Mansharamani, TCS and Prof. Kishor Trivedi, Duke University for their support and guidance.

9 REFERENCES

[1] Fei Ge Moshe, Z. Ji Li Liansheng Tan, "FAST TCP performance under perturbation imposed queueing delay in equilibrium", 2nd International Conference on Future Computer and Communication (ICFCC) 2010, 21-24 May ‘10, Wuhan

[2] Remzi H. Arpaci-Dusseau, "Performance Availability for Networks of Workstations", Ph.D. Dissertation, Fall 1999, University of California, Berkeley

[3] Cheer-Sun D. Yang, Lori L. Pollock, "Towards a structural load testing tool", ISSTA '96 Proceedings of 1996 ACM SIGSOFT international symposium on Software testing and analysis [4] Apache Tomcat, http://tomcat.apache.org [5] MySQL, http://dev.mysql.com/downloads

[6] The Grinder, Load Testing Tool http://grinder.sourceforge.net

[7] Little, J. D. C. "A Proof of the Queueing Formula L = λ W," Operations Research, 9, 383-387 (1961)