Profiling services for resource optimization and capacity planning in distributed systems

(1)

DOI 10.1007/s10586-008-0063-x

Profiling services for resource optimization and capacity planning

in distributed systems

Guofei Jiang· Haifeng Chen · Kenji Yoshihira

Received: 22 May 2008 / Accepted: 8 September 2008 / Published online: 24 September 2008 © Springer Science+Business Media, LLC 2008

Abstract The capacity needs of online services are mainly determined by the volume of user loads. For large-scale distributed systems running such services, it is quite diffi-cult to match the capacities of various system components. In this paper, a novel and systematic approach is proposed to profile services for resource optimization and capacity planning. We collect resource consumption related measure-ments from various components across distributed systems and further search for constant relationships between these measurements. If such relationships always hold under var-ious workloads along time, we consider them as invariants of the underlying system. After extracting many invariants from the system, given any volume of user loads, we can follow these invariant relationships sequentially to estimate the capacity needs of individual components. By comparing the current resource configurations against the estimated ca-pacity needs, we can discover the weakest points that may deteriorate system performance. Operators can consult such analytical results to optimize resource assignments and re-move potential performance bottlenecks. In this paper, we propose several algorithms to support capacity analysis and guide operator’s capacity planning tasks. Our algorithms are evaluated with real systems and experimental results are also included to demonstrate the effectiveness of our approach.

G. Jiang (

)· H. Chen · K. Yoshihira

NEC Laboratories America, 4 Independence Way, Princeton, NJ 08540, USA e-mail:[email protected] H. Chen e-mail:[email protected] K. Yoshihira e-mail:[email protected]

Keywords System management· Capacity planning · Resource optimization· Distributed systems · System invariants· Algorithms

1 Introduction

In the last decade, with the great success of Internet technol-ogy, many large-scale distributed systems have been built to run various online services. These systems have unprece-dented capacity to process large volume of transaction re-quests simultaneously. For example, Google has thousands of servers to handle millions of user queries every day and Amazon sold 3.6 million items or 41 items per second in the single day of December 12, 2005 [2]. While clients may only see a single website, the distributed systems running such services are very complex and could include thousands of components such as operating systems, application soft-ware, servers, networking and storage devices. Meantime, clients always expect high Quality of Services (QoS) such as short latency and high availability from online transac-tion services. Studies have shown that clients are easy dis-satisfied because of unreliable services or even seconds de-lay in response time. However, under many dynamics and uncertainties of user loads and behaviors, some components inside the system may suddenly become the performance bottleneck and then deteriorate system QoS. Therefore, it is very desirable to make right capacity planning for each com-ponent so as to maintain the whole system at a good state.

Operators usually need to consider two important factors for capacity planning and resource optimization. From one side, sufficient hardware resources have to be deployed so as to meet customer’s QoS expectation. From the other side, an oversized system with scale could significantly waste hard-ware resources, increase IT costs and reduce profits. For

(2)

distributed systems, it is especially important to balance re-sources across distributed components so as to achieve max-imal system level capacity. Otherwise, mis-matched com-ponent capacities could lead to performance bottlenecks at some segments of the system while wasting resources at other segments. Therefore, a critical challenge is how to match the capacities of large number of components in a dis-tributed system. In practice, operators may take many trial and error procedures for their capacity planning work and assign resources based on their intuition, practical experi-ences or rules of thumb [1]. Therefore, it is difficult to sys-tematically and precisely analyze the capacity needs for in-dividual components in a distributed system with scale and complexity.

For standalone software, we usually use fixed numbers to specify hardware requirements such as CPU frequency and memory size. For example, Microsoft recommends the following minimum system requirements to run Microsoft Office 2003: a Pentium III processor with a clock speed of at least 233 MHz, a minimum of 128 MB RAM and at least 400 MB free hard-disk space [24]. However, it is difficult to give such specifications for online services be-cause their system requirements are mainly determined by an external factor—the volume of user loads. For various user loads, we need a model rather than a fixed number to analyze the capacity needs of each component. Such mod-els could enable us to answer those “what. . . if. . . ” questions in capacity planning. For example, what should be the size of the database server if the volume of web requests in-creases three times tomorrow? Though queuing and other models are widely applied in performance modeling [21], these models are often used to analyze a limited number of components under various assumptions. For example, a closed queuing network was used to model multi-tier Inter-net applications [31] but only CPU resources were consid-ered in the performance model. In practice, we have to con-sider many other resources as well such as memory, disk I/O and network. Therefore queuing models seem not to scale well in profiling large distributed systems.

In this paper, we propose a novel approach to profile ser-vices for capacity planning and resource optimization. Dur-ing operation, distributed systems generate large amounts of monitoring data to track their operational status. We collect resource consumption related monitoring data from various components across distributed systems. CPU usage, network traffic volume, and number of SQL queries are typical exam-ples of such monitoring data. While large volumes of user requests flow through various components, many resource consumption related measurements respond to the intensity of user loads accordingly. Here we introduce a new concept named flow intensity to measure the intensity with which in-ternal measurements react to the volume of user loads. Fur-ther we search for constant relationships between these flow

intensity measurements collected at various points across the system. If such relationships always hold under vari-ous workloads along time, we consider them as invariants of the underlying distributed system. In this paper, we pro-pose an algorithm to automatically extract such invariants from monitoring data. After extracting many invariants from a distributed system, given any volume of user loads, we can follow these invariant relationships sequentially to estimate the capacity needs of individual components. By compar-ing the current resource assignments against the estimated capacity needs, we can also discover and rank the weakest points that may deteriorate system performance. Later oper-ators can consult such analytical results to optimize resource assignments and remove potential performance bottlenecks. Several graph-based algorithms are proposed in this paper to support such capacity analysis. Our algorithms are tested with real distributed systems including a large production system and experimental results are also included to demon-strate the effectiveness of our approach.

2 System invariants and capacity planning

As discussed earlier, we need to collect monitoring data from operational systems to profile online services. In this paper, capacity planning is discussed in the context of tem testing or operational management stage but not sys-tem design stage. For example, we do not analyze how to optimize architecture design for improving system capacity. Instead, we assume that services are already deployed and functional on distributed systems so that we are able to col-lect monitoring data from operational systems. In fact, our approach is proposed to guide operators’s capacity planning tasks during system evolution. For example, how to upgrade system’s capacity during sales events and how to locate per-formance bottlenecks?

For online services, many of internal measurements re-spond to the intensity of user loads accordingly. For exam-ple, network traffic volume and CPU usage usually go up and down in accordance with the volume of user requests. This is especially true for many resource consumption re-lated measurements because they are mainly driven by the intensity of user loads. In this paper, a metric named flow intensity is used to measure the intensity with which such internal measurements react to the volume of user requests. For example, number of SQL queries and average CPU us-age (per sampling unit) are such flow intensity measure-ments.

We observe that there exist strong correlations between these flow intensity measurements. Along time, many re-source assumption related measurements have similar evolv-ing curves because they mainly respond to the same external factor—the volume of user requests. As an example, Fig.1

(3)

Fig. 1 Examples of flow

intensities

Fig. 2 An example of invariant

networks

shows the intensities of HTTP requests and SQL queries col-lected from a typical three-tier web system and their curves are very similar. As an engineered system, a distributed sys-tem imposes many constraints on the relationships among these internal measurements. Such constraints could result from many factors such as hardware capacity, application software logic, system architecture and functionality. For ex-ample, in a web system, if a specific HTTP request x always leads to two related SQL queries y, we should always have

I (y)= 2I (x) because this logic is written in its application

software. Note that here we use I (x) and I (y) to represent the flow intensities measured at the point x and y respec-tively. No matter how flow intensities I (x) and I (y) change in accordance with varying user loads, such relationships (the equation I (y)= 2I (x)) are always constant. In this pa-per, we model and search for the relationships among mea-surements collected at various points across distributed sys-tems. If the modeled relationships hold all the time, they are regarded as invariants of the underlying system. Note that the relationship I (y)= 2I (x) but not the measurements is considered as an invariant. Our previous work [13] verified that such invariant relationships widely exist in real distrib-uted systems, which are governed by the physical proper-ties or logic constraints of system components. For a typical three-tier web system including a web server, an applica-tion server and a database server, we collected 111 measure-ments and further extracted 975 of such invariants among them.

In this paper, we include an algorithm to automatically extract such invariants from the measurements collected at various points across distributed systems. These invariants characterize the constant relationships between various flow intensity measurements and further formulate a network of invariants. A simple example of such networks is illustrated

in Fig.2. In this network, each node represents a measure-ment while each edge represents an invariant relationship between the two associated measurements. After extracting invariants from a distributed system, we can use such an in-variant network to profile services for capacity planning and resource optimization. For online services, we can use trend analysis [5] to predict the future volume of user requests. The challenge here is how to estimate and upgrade the ca-pacity of various components inside the distributed system so as to serve the predicted volume of user requests. For ex-ample, based on the analysis of web server’s access log, the volume of HTTP requests is predicted to grow 150% in a month. Now we need to analyze whether the current capac-ities of internal components (such as the memory of appli-cation server and the disk I/O utilization of database server) are sufficient to support such a growth.

Since the validity of invariants is not affected by the change of user loads, we choose the volume of user requests as the starting node and sequentially follow the edges (i.e., the invariant relationships shown in Fig.2) to derive the ca-pacity needs of various internal components. In the above example, if the predicted number of HTTP requests is I (x1), we can use the invariant relationship I (y)= 2I (x) to con-clude that the number of SQL queries must be 2I (x1). Fur-ther we can consult this information for upgrading the re-lated database server. Note that here the capacity needs of components are quantitively represented by these resource consumption related measurements. For example, given a workload, a server may be required to have two 1 GHz CPUs, 4 GB memory and 100 MB/s network bandwidth etc. By comparing the current resource configurations against the estimated capacity needs, we can also discover the weak-est points that may become performance bottlenecks. There-fore, given any volume of new user loads, operators can use such a network of invariants to estimate the capacity needs of various components, balance resource assignments and remove potential performance bottlenecks.

In the following two sections, we introduce the models of invariants and our invariant search algorithm first before we

(4)

propose and discuss capacity planning algorithms. A simi-lar invariant search algorithm was proposed in our previous work [14] for fault detection and isolation. However in this paper, we have made the following new contributions: – We extended our work to extract invariants among

mul-tiple workload classes and also modified invariant search algorithm accordingly for capacity planning and resource optimization;

– We proposed a new algorithm to use invariant networks to predict the capacity needs of components under any new workloads;

– We also proposed a method to optimize and balance re-source assignments in distributed systems based on the estimated capacity needs;

– Finally we evaluated our approach with real systems in-cluding a commercial distributed system.

3 Correlation of flow intensities

For convenience, in the following sections, variables such as x and y are used to represent flow intensity measure-ments and we use equations such as y= f (x) to represent invariants. With flow intensities measured at various points across systems, we need to consider how to model the rela-tionships between these measurements, i.e., with measure-ments x and y, how to learn a function f so that we can have

y= f (x)? As mentioned earlier, many of such resource

con-sumption related measurements change in accordance with the volume of user requests. As time series, these measure-ments have similar evolving curves along the time t and have linear relationships. In this paper, we use AutoRegressive models with eXogenous inputs (ARX) [20] to learn linear relationships between measurements.

At time t , we denote the flow intensities measured at the input and output of a component by x(t) and y(t) respec-tively. The ARX model describes the following relationship between two flow intensities:

y(t )+ a1y(t− 1) + · · · + any(t− n)

= b0x(t− k) + · · · + bm−1x(t− k − m + 1) + b, (1) where[n, m, k] is the order of the model and it determines how many previous steps are affecting the current output.

ai and bj are the coefficient parameters that reflect how strongly a previous step is affecting the current output. Since there exist time delays in correlating measurements across distributed systems and various system components may also have unsynchronized time clocks, we consider the tem-poral dependency in the above ARX model. In fact, even with synchronized time clocks, different components may

log their data timestamps with various time delays. Lets de-note:

θ= [a1, . . . , an, b0, . . . , bm−1, b]T, (2) ϕ(t )= [−y(t − 1), . . . , −y(t − n), x(t − k), . . . ,

x(t− k − m + 1), 1]T. (3)

Then (1) can be rewritten as:

y(t )= ϕ(t)Tθ. (4)

Assuming that we have observed two measurements over a time interval 1≤ t ≤ N, lets denote this observation by:

ON= {x(1), y(1), . . . , x(N), y(N)}. (5) For a given θ , we can use the observed inputs x(t) to calcu-late the simucalcu-lated outputs ˆy(t|θ) according to (1). Thus we can compare the simulated outputs with the real observed outputs and further define the estimation error by:

EN(θ, ON)= 1 N N t=1 (y(t )− ˆy(t|θ))2 = 1 N N t=1 (y(t )− ϕ(t)Tθ )2. (6)

The Least Squares Method (LSM) can find the following ˆθ

that minimizes the estimation error EN(θ, ON):

ˆθN= _N t=1 ϕ(t )ϕ(t )T _{−1 N} t=1 ϕ(t )y(t ). (7)

There are several criteria to evaluate how well the learned model fits the real observation. In this paper, we use the fol-lowing equation to calculate a normalized fitness score for model validation: F (θ )= 1− Nt=1|y(t) − ˆy(t|θ)|2 N t=1|y(t) − ¯y|2 , (8)

where¯y is the mean of the real output y(t). Basically Equa-tion (8) introduces a metric to evaluate how well the learned model approximates the real data. A higher fitness score in-dicates that the model fits the observed data better and its upper bound is 1. Given the observation of two flow inten-sities, we can always use (7) to learn a model even if this model does not reflect their real relationship at all. Based on statistical theory, only a model with high fitness score is really meaningful in characterizing linear data relationship. We can set a range of the order[n, m, k] rather than a fixed number to learn a list of model candidates and then select the model with the highest fitness score. Other criteria such

(5)

as Minimum Description Length (MDL) [28] can also be used to select models. Note that we use the ARX model to learn the long-run relationship between two measurements, i.e., a model y= f (x) only captures the main characteris-tics of their relationship. The precise relationship between two measurements should be represented with y= f (x) + where is the modeling error. Note that is usually small for a model with high fitness score.

The ARX model shown in (1) can be easily extended to learn a relationship with multiple inputs and multiple out-puts. For example, the volume of HTTP requests x can usu-ally be split into multiple classes such as browsing, shop-ping and ordering. The different types of user requests may result in different amount of resource consumptions. Lets use xi(1≤ i ≤ N) to represent the volume of N request types respectively and then we have x=N_i₌₁xi. Now if the relationship y = f (x) is sensitive to the distribution change of request types, we can derive a new relationship

y = f (x1, . . . , xN). In this case, (1) can be replaced with the following equation with multiple inputs. Here we use {bi

0, . . . , bimi−1} to denote the coefficient parameters for the ithrequest type.

y(t )+ a1y(t− 1) + · · · + any(t− n) = b1 0x1(t− k1)+ · · · + bm11−1x1(t− k1− m1+ 1) + · · · + bN 0xN(t− kN)+ · · · + bN mN−1xN(t− kN− mN+ 1) + b = N i=1 _m_i−1 j=0 (bi_jxi(t− ki− j)) + b. (9)

Now we can define new θ and ϕ(t) with the following equations: θ= [a1, . . . , an, b10, . . . , bm11−1, . . . , b₀N, . . . , bN_m_N−1, b]T, (10) ϕ(t )= [−y(t − 1), . . . , −y(t − n), x1(t− k1), . . . , x1(t− k1− m1+ 1), . . . , xN(t− kN), . . . , xN(t− kN− mN+ 1), 1]T. (11) It is straightforward to see that the same (7) and (8) can be used to estimate the parameter θ and calculate its fit-ness score respectively. In practice, we can select ki= k and mi= m (1 ≤ i ≤ N) (k and m are fixed values) to reduce pa-rameter search spaces. Especially if we do not consider any time delay in (9), we have y=N_i₌₁bi₀xi. In this case, bi₀ represents the resource consumption unit for each request of type i. The same equations also work here to estimate the best-fit parameters bi₀(1≤ i ≤ N). With this extension to model invariants among multiple classes of workloads,

our capacity planning approach works even if the distribu-tion of workload classes is not stadistribu-tionary [29]. Without loss of generality, in the following sections, we will use (1) to introduce the concept of our algorithms. Later in this paper we will extend our algorithms to support capacity planning under multiple classes of workloads.

4 Extracting invariants

Given two measurements, we analyzed how to automatically learn a model in the above section. In practice, we may col-lect many resource consumption related measurements from a complex system but obviously not any pairs of them would have such linear relationships. Due to system uncertainties and user behavior changes, some learned models may not be robust along time. The challenging question is how to extract invariants from large number of measurements. In practice, we may manually build some relationships based on prior system knowledge. However, this knowledge is usu-ally very limited and system dependent. In this section, we introduce an automated algorithm to extract and validate in-variants from monitoring data.

Note that for capacity planning purpose, we only need to search for invariants among resource consumption re-lated measurements. Assume that we have m of such mea-surements denoted by Ii, 1≤ i ≤ m. Since we have lit-tle knowledge about their relationships in a specific sys-tem, we try any combination of two measurements to con-struct a model first and then continue to validate whether this model fits with new observations, i.e., we use brute-force search to construct all hypotheses of invariants first and then sequentially test the validity of these hypotheses in operation. Note that we always have sufficient monitoring data from an 24∗365 operational system to validate these hypotheses along time. The fitness score Fk(θ ) given by (8) is used to evaluate how well a learned model matches the data observed during the kth time window. We denote the length of this window by l, i.e., each window includes

l sampling points of measurements. As discussed earlier, given two measurements, we can always use (7) to learn a model. However, models with low fitness scores do not char-acterize the real data relationships well so that we choose a threshold F to filter out those models in sequential test-ings. Denote the set of valid models at time t= k · l by Mk (i.e., after k time windows). During the sequential testings, once if Fk(θ )≤ F, we stop testing this model and remove it from Mk.

After receiving monitoring data for k of such windows, i.e., total k· l sampling points, we can calculate a confidence score with the following equation:

pk(θ )= k i=1Fi(θ ) k = pk−1(θ )· (k − 1) + Fk(θ ) k . (12)

(6)

Algorithm 4.1

Input: Ii(t ), 1≤ i ≤ m

Output: Mkand pk(θ )for each time window k Part I: Model Construction

at time t= l (i.e. k = 1), set M1to an empty set for each Iiand Ij, 1≤ i, j ≤ m, i = j

learn a model θijusing (7); compute F1(θij)with (8); if F1(θij) > F,

then set M1= M1∪ {θij}, p1(θij)= F1(θij). Part II: Sequential Validation

for each time t= k · l(k > 1), set Mk= Mk−1;

for each θij∈ Mk,

compute Fk(θij)with (8) using Ii(t ) and Ij(t ), (k− 1) · l + 1 ≤ t ≤ k · l; if Fk(θij)≤ F,

then remove θijfrom Mk; otherwise update pk(θij)with (12). output Mkand pk(θ ).

k= k + 1.

Fig. 3 Invariants extraction algorithm

In fact, pk(θ ) is the average fitness score for k time win-dows. Since the set Mk only includes valid models and Fi(θ ) > F (1≤ i ≤ k), we always have F < pk(θ )≤ 1. The invariant extraction algorithm is shown in Fig.3, which in-cludes two parts: Part I for model construction and Part II for sequential validation.

The invariants extracted with Algorithm4.1 should es-sentially be considered as likely invariants. As mentioned earlier, a model can be regarded as an invariant of the under-lying system only if this model holds all the time. However, even if the validity of a model has been sequentially tested for a long time, we still can not guarantee that this model will always hold. Therefore, it is more accurate to consider these valid models as likely invariants. Based on historical monitoring data, in fact each confidence score pk(θ ) mea-sures the robustness of an invariant. Note that given two measurements, logically we do not know which one should be chosen as the input or output (i.e., x or y in (1)) in com-plex systems. Therefore, we always construct two models with reverse input and output. If two learned models have very different fitness scores, we must have constructed an AutoRegressive (AR) model rather than an ARX model. Since we are only interested in strong correlation between two measurements, we filter out those AR models by re-questing the fitness scores of both models to overpass the threshold. Therefore, an invariant relationship between two measurements is bi-directional in this paper.

Our previous work verified that invariants widely exist in distributed systems and Fig. 4 shows an invariant

net-Fig. 4 An invariant network extracted from a three-tier web system

work extracted from a typical three-tier web system [13]. In this figure, each node represents a measurement while each edge represents an invariant relationship between the two associated measurements. Since we do not need any it-erations to calculate θ in (7), the computational complexity of Algorithms4.1is usually acceptable even under O(m2)

brute-force searches. For example, it takes a common lap-top around 20 minutes to extract the invariants from 100 flow intensity measurements (i.e. m= 100) and each mea-surement (as a time series) includes 1000 data points. Note that Part II runs much faster than Part I in Algorithm4.1. It only takes 2 seconds to validate nearly 1000 invariants as shown in Fig.4. For large systems, we have proposed effi-cient and scalable algorithms to extract invariants by com-promising search accuracy [15]. Due to limited space, we will not discuss the scalability issue here. Note that the in-variant extraction time is negligible compared to the time taken to manually build models in capacity analysis. As dis-cussed in Sect.3, if some flow intensity metrics include mul-tiple classes of measurements, we can use mulmul-tiple variables instead of one in model construction and Algorithm 4.1

still works without any change. For example, we can set

Ii= {I_i1, . . . , I_iN} in Algorithm4.1. In this case, Ii is a vec-tor rather than a scalar.

5 Estimation of capacity needs

In the above section, Algorithm 4.1 automatically ex-tracts all possible invariants among the measurements Ii, 1≤ i ≤ m. Further these measurements and invariants for-mulate a relation network that can be used as a model to systematically profile services. Under low volume of user

(7)

Fig. 5 Capacity planning using invariant networks

requests, we extract a network of invariants from a system when its quality of services meets client’s expectation, i.e., we only profile a system when it is at a good state. As-sume that we have collected ten resource consumption re-lated measurements (i.e., m= 10) from a system and extract an invariant network as shown in Fig.5. For simplicity, here we use this network as an example to illustrate our capacity planning algorithms. In this graph, each node with number i represents the flow intensity measurement Ii. As discussed earlier, Iicould also be a vector including multiple sub-class measurements.

Since we use a threshold F to filter out those models with low fitness scores, not any pair of measurements would have invariant relationships. Therefore, in Fig.5, we observe two disconnected subnetworks and even some isolated nodes such as node 1. An isolated node implies that this mea-surement does not have any linear relationship with other measurements. All edges are bi-directional because we al-ways construct two models (with reverse input and output) between two measurements. Now lets consider a triangle re-lationship among three measurements such as{I10, I3, I4}. Assume that we have I3= f (I10)and I4= g(I3), where

f and g are both linear functions as shown in (1). Based on the triangle relationship, theoretically we can conclude that I4= g(I3)= g(f (I10)). According to linear property of functions f and g, the function g(f (.)) should be linear too, which implies that there should exist an invariant re-lationship between the measurements I10and I4. However, since we use a threshold to filter out those models with low fitness scores, such a linear relationship may not be robust enough to be considered as an invariant by Algorithm4.1. This explains why there is no edge between I10and I4.

As discussed in Sect.2, invariants characterize constant long-run relationships between measurements and their va-lidity is not affected by the dynamics of user loads. While each invariant models some local relationship between its associated measurements, the network of invariants could well capture many invariant constraints underlying the whole distributed system. Rather than using one or several analytical models to profile services, here we combine a large number of invariant models together into a network to analyze capacity needs and optimize resource assignments. At time t (e.g., in a month or during a sales event), assume that the maximal volume of user requests is predicted to

be x. Without loss of generality, in Fig.5, we assume that the measurement I10represents the volume of user requests, i.e., I10= x. Now the challenging question is how to up-grade the capacities of other nodes so as to serve this volume of user requests.

Starting from the node I10= x, we sequentially follow edges to estimate the capacity needs of other nodes in the in-variant network. According to Fig.5, we can reach the nodes {I3, I5, I7} with one hop. Given I10= x, now the question is how to follow invariants to estimate these measurements. As discussed in Sect.3, we use the model shown in (1) to search for invariant relationships between measurements so that all invariants can be considered as instances of this model tem-plate. According to linear property of models, the capacity needs of system components increase monotonically as the volume of user loads increases. Therefore, we use the max-imal value of user loads to estimate the capacity needs of internal components. Here we use x to denote the maximal value of I10. In (1), if we set the inputs x(t)= x at all time steps, we expect the output y(t) to converge to a constant value y(t)= y, where y can be derived from the following equations: y+ a1y+ · · · + any= b0x+ · · · + bm−1x+ b, y= m−1 i=0 bix+ b 1+n_j₌₁aj . (13)

For convenience, we use f (θij)to represent the propaga-tion funcpropaga-tion from Ii to Ij, i.e., f (θij)=

m₋₁ k=0bkIi+b

1+n_k₌₁ak where

all coefficient parameters are from the vector θij, as shown in (2). Therefore, given a value of the input measurement, we can use (13) to estimate the value of the output measure-ment. For example, given I10= x, we can use invariants to derive the values of I3,I5 and I7 respectively. Since these measurements are the inputs of other invariants, in the same way, we can further propagate their values to other nodes in the network such as the nodes I4and I6.

As shown in Fig.5, some nodes such as I4and I7can be reached from the starting node I10 via multiple paths. Be-tween the same two nodes, multiple paths may include dif-ferent number of edges and each invariant (edge) also may have different quality in modeling two nodes’ relationship. Therefore, the capacity needs of a node will be estimated via different paths with different accuracy. For each node, the question is how to locate the best path for propagating the volume of user loads from the starting node. At first, we will choose the shortest path (i.e., with minimal number of hops) to propagate this value. As discussed in Sect.3, each invari-ant includes some modeling error when it characterizes the relationship between two measurements. These model-ing errors could accumulate along a path and a longer path usually results in a larger estimation error. In Sect. 4, we

(8)

introduce a confidence score pk(θ )to measure the robust-ness of invariants. According to the definition of confidence score, an invariant with higher fitness score will lead to bet-ter accuracy in capacity estimation. For simplicity, here we use pij to represent the latest pk(θ )between the measure-ments Ii and Ij. If there is no invariant between Ii and Ij, we set pij= 0. Given a specific path s, we can derive an ac-cumulated score qs=

pij to evaluate the accuracy of this whole path. Therefore, for multiple paths with same number of edges, we choose the path with the highest score qs to estimate capacity needs.

In Fig.5, we also observe that some nodes are not reach-able from the starting node. However, these measurements may still have linear relationships with a set of other nodes because they may have a similar but nonlinear or stochastic way to respond to user loads. Now the question is how to es-timate the capacity needs of these unreachable nodes. Note that it is extremely difficult to model and estimate the capac-ity needs of all components at fine granularcapac-ity. As discussed earlier, many analytical models like queuing models hardly can scale to model thousands of resource metrics in a dis-tributed system, especially if these metrics have complicated dependencies. In this paper, we extract invariant networks to automatically profile as many relationships as possible and then have to manually build complicated models for the remaining part. In performance modeling, many models have been developed to characterize individual components. For example, Menasce et al. utilized several known laws to quantify performance models of various components, which include utilization law, service demand law and the forced flow law etc. [21]. Following these laws and classic theory, we can manually build complicated nonlinear or stochastic models to link those unreachable nodes. In some cases, we can also use bound analysis to derive rough relationship be-tween measurements.

Here we give some examples on how to build models to connect those unreachable nodes. Following the utiliza-tion law [21], a disk I/O utilization U can be expressed as

U = λ · S, where λ is the intensity of disk I/Os (e.g. the

number of disk I/Os per second) and S is the average ser-vice time. Further S includes the disk controller time as well as the time taken to seek a data block from the disk. Under various workloads (random vs. sequential workloads, low vs. heavy workloads), S could be quite different. Therefore, while λ (as a flow intensity node) can be reachable from the starting node, there is no linear relationship between U and λ. However, many literatures [23] on disk performance conclude that a system’s performance becomes sluggish if

S≥ 0.02 s. Therefore, we can use bound analysis to estimate

the capacity needs of disk I/Os with U > 0.02λ. We can fur-ther propagate the value of λ to the unreachable node U by manually modeling their relationship with domain knowl-edge.

Recently researchers at Google.com [9] investigated the power provisioning problem for a warehouse-sized system with 15 thousand servers. At the server level, they discov-ered the following nonlinear relationship between power consumption P and CPU utilization U :

P = Pidle+ (Ppeak− Pidle)(2U− U1.4), (14) where Pidleand Ppeakare the power consumptions at server idle and peak hours respectively. Since various servers be-hind an online service run different service logics and func-tions (e.g. database function or web server function), their CPU utilization could be very different even under the same volume of incoming workloads. Given a new volume of workload x, we can follow the invariant network to estimate the CPU utilization Ui for each server. According to (14), we can then estimate the power consumption Pi of each server and further sum up Pi to estimate the whole power supply needs under the new volume of workloads. Accord-ing to their work, other power consumptions from network-ing and coolnetwork-ing systems etc. are proportional to the power consumptions of all servers. Therefore, we can also use in-variant networks to support the capacity planning of power supplies in a data center.

Since invariant networks have automatically modeled the relationships among many resource consumption related metrics, it becomes easier to manually model the remain part of distributed systems. Therefore, our approach and other performance modeling methods are essentially complimen-tary to each other. By introducing other models as shown in the above examples, we can continue to propagate the volume of user loads to those isolated nodes. For example, in Fig.5, if we can manually bridge any two nodes from the two disconnected subnetworks, we will be able to prop-agate the volume of user loads several hops further. Even in this case, our invariant network is still very useful be-cause it can guide us on where to manually bridge two dis-connected subnetworks. In fact, it is usually much easier to build models among the measurements from the same com-ponent because their dependencies are much more straight-forward in this local context. As shown in the above exam-ple, it is much easier to build a model between U and λ from the same disk than a model between U (of the back-end database server) and the volume of HTTP requests x (of the front web server) directly. This is because we have known domain knowledge on disks to support modeling. Es-sentially it is the invariant network that enables us to propa-gate the value of x into various internal components in large distributed systems. Therefore, rather than building models for measurements across many segments of distributed sys-tems, we can manually build local models from the same segment to link disconnected subnetworks. Conversely if we develop a model of each resource metric as a direct function of exogenous workload, we will not be able to observe the

(9)

Algorithm 5.1 Input: M, P and x.

Output: Ii(1≤ i ≤ N) and R. at step k= 0,

set V0= S0= {I1}, q1= 1, I1= x and other Ii= 0. do

k= k + 1;

set Sk= φ;

for each Ii∈ Sk−1and Ij∈ U − Vk−1, if pij= 0,

then Sk= Sk∪ {Ij}; for each Ij∈ Sk,

Il= arg maxIi∈Sk₋₁(qi· pij); compute Ij= f (θlj); qj= plj· ql;

Vk= Sk∪ Vk−1; while Sk= φ

R= Vk;

output R and all Ii.

Fig. 6 Capacity needs estimation algorithm

relationship between disconnected subnetworks and have to manually model each disconnected metric once. In addition, it is very difficult to model such a relationship across mul-tiple system segments because their dependencies could be very complicated and difficult for us to understand. In this paper, we just consider such complicated models as another class of invariants constructed from domain knowledge and do not distinguish them in our analysis and algorithms.

Now we summarize the above discussion and propose the following algorithm shown in Fig.6for estimating capacity needs. For convenience, we define the following variables in our algorithms:

– Ii: the individual measurements, 1≤ i ≤ N. – U : the set of all measurements, i.e., U= {Ii}.

– M: the set of all invariants, i.e., M= {θij} where θijis the invariant model between the measurements Ii and Ij. – pij: the confidence score of the model θij. Note that we

set pij= 0 if there is no invariant (edge) between the mea-surements Ii and Ij.

– P : the set of all confidence scores, i.e., P= {pij}. – x: the predicted maximal volume of user loads.

– I1: the starting node in the invariant network, i.e., I1= x. – Sk: the set of nodes that are only reachable at the kthhop

from I1but not at earlier hops.

– Vk: the set of all nodes that have been visited up to the kth hop.

– R: the set of all nodes that are reachable from Ii. – φ: the empty set.

– f (θij): the propagation function from Ii to Ij. For linear invariants, f (θij)=

m−1

k=0bkIi+b

1+n_k₌₁ak . For those nonlinear or

stochastic models, it may include a variety of equations. – qs: the maximal accumulated confidence score of the

paths from the starting node I1to Is.

As discussed in Sect.4, Algorithm4.1automatically ex-tracts robust invariants after long sequential testing phases. In this section, Algorithm5.1follows the extracted invariant network specified by M and P to estimate capacity needs. Since we always choose the shortest path to propagate from the starting node to other nodes, at each step Algorithm5.1

only searches those unvisited nodes for further propagation and all those nodes visited before this step must already have their shortest paths to the starting node. Meantime, Algo-rithm5.1only uses those newly visited nodes at each step to search their next hop because only these newly visited nodes may link to some unvisited nodes. For those nodes with multiple same-length paths to the starting node, we choose the best path with the highest accumulated confi-dence score for estimating the capacity needs. Essentially Algorithm5.1is an efficient graph algorithm based on dy-namic programming [7]. We incrementally estimate the ca-pacity needs of those newly visited nodes and compute their accumulated confidence scores at each step until no more nodes are reachable from the starting node. Our algorithm can be easily extended to support those models with multi-ple inputs. Before estimating the capacity of a new node, we just need to check whether all of its input nodes have already been visited. If not, this node is considered as an unvisited node until all of its input nodes are ready.

6 Resource optimization

In the above section, Algorithm5.1sequentially estimates the resource usages of components under a given volume of user loads. Assume that we have collected the informa-tion about current resource configurainforma-tions when the system was deployed or upgraded. In practice, such configuration information is usually maintained and updated by a Config-uration Management DataBase (CMDB). For each measure-ment Ii, we denote the capacity of its related resource con-figuration by Ci. For example, if I (i) is the Megabyte of real memory usage, Ci will be the total memory size of a server. Similarly if I (i) is the number of concurrent SQL connec-tions, the maximum C(i) for MySQL database is 4096. In performance analysis, there are also some benchmarks that could provide specifications for Ci. Note that this configura-tion informaconfigura-tion includes hardware specificaconfigura-tions like mem-ory size as well as software configurations like the maximal number of database connections. In practice, software con-figurations such as the maximum number of file descriptors

(10)

Fig. 7 Capacity analysis and resource optimization

and the maximum number of concurrent threads could also affect system performance. In this paper, we also consider them as system resources. Given a volume of user loads x, we use Algorithm5.1to estimate the values of Ii. By com-paring Iiagainst Ci, we can further get information to locate potential performance bottlenecks and balance resource as-signments.

Lets denote Oi = Ci−Ii_C_i , where Oi represents the per-centage of resource shortage or available margin. Given an estimated volume of user loads, all those components with negative Oi are short of capacities under the new workload and we should assign more resources to these components to remove performance bottlenecks. Conversely, for those components with large positive Oi, they must have over-sized capacities to serve such a new volume of user loads and we may remove some resources from these components to reduce IT cost. Note that we always need to keep right capacity margin for each component. Therefore, as shown in Fig.7, we can use such analytical results as a guideline to adjust each component’s capacity and build a resource balanced distributed system. The values of Oi can also be sorted and ranked to list the priority of resource assignments and optimization.

Note that we propagate the maximal volume of user loads

x through the invariant network to estimate capacity needs. All Ii resulting from Algorithm5.1represent the capacity needs of internal components that can serve this maximal volume of user loads. Given a step input x(t)= x, we derive its stable output y(t)= y using (13). However, we did not consider the transient response of y(t) before it converges to the stable value y. As shown in Fig.8, theoretically y(t) may respond with overshoot and its transient value may be larger than the stable value y. The overshoot is generated be-cause a system component does not respond quickly enough to the sudden change of user loads. For example, with a sud-den increase of user loads in a three-tier web system, the ap-plication server may take some time to initialize more EJB instances and create more database connections so as to han-dle the increasing workloads. During this overshoot period, we may observe longer latency of user requests.

Fig. 8 System response with

overshoot

However, computing systems usually respond to the dy-namics of user loads quickly. Therefore, even if the over-shoot exists, it must only last very short time. In fact, in our experiments, we do not observe any overshoot responses at all though theoretically it may exist. Now if we want a sys-tem to have enough capacity to handle such overshoots, we can calculate the volume of overshoot and propagate this as the maximal value (rather than the stable y) in capac-ity planning. In practice, the capacities of many systems are only expected to support high QoS in 95% of its operational time. Since it may take huge amount of extra resources to guarantee QoS in some rare events, some service providers are willing to compromise QoS for short period of time so as to reduce IT costs. However, for applications with stringent QoS requirements such as real-time video/voice communi-cation, we may have to consider the overshoot situations.

For low order ARX models with n, m≤ 2, much liter-ature in classic control theory [18] introduces how to ana-lytically calculate the overshoot. Basically we can apply Z-transform [4] of (1) to analytically derive the transient re-sponse of y(t) and then calculate the overshoot value by locating the maximal point of y(t). For high order ARX models, given an input x(t)= x, we have to use (1) iter-atively to simulate the transient response of y(t) and then estimate the overshoot value. At each step of Algorithm5.1, rather than using the function f (θlj)to estimate a stable Ij, we can use the simulation results from (1) to estimate tran-sient Ijand further propagate the maximal value to estimate the capacity needs of other nodes. All other parts of Algo-rithm 5.1remain the same. Therefore, with ARX models, we are also able to analyze the transient behavior of various components.

In our earlier discussion, the volume of user loads is cho-sen as the starting point to propagate the capacity estimation. Therefore following the invariant networks, we can only estimate the capacity needs of those resource metrics that are reachable from this starting point. However, if we just want to check whether various components have matched resource assignments, any nodes in the invariant networks can be chosen as a starting point to estimate the capacity needs of others and to determine whether they have consis-tent “sizes”. For example, we may just follow a single in-variant to check whether component A and B have matched resource assignments. In this case, we do not need a fully connected invariant network and even a disconnected sub-network can enable us to evaluate whether all of its nodes

(11)

Fig. 9 The experimental system

have balanced resource assignments. If we do not extract in-variant network but directly correlate resource metrics with exogenous workloads, we can not support such resource op-timization procedures.

7 Experiments

Our capacity planning approach is evaluated with a three-tier web system as well as a commercial mobile internet system. In both cases, we extract invariants from resource consumption related measurements and then follow the in-variant network to estimate the capacity needs under heavy workloads. Later we measure the real resource usages of components under such heavy workloads and compare them with the estimated values to verify our prediction accuracy.

7.1 Three-tier Web systems

In this section, our capacity planning experiments were per-formed in a typical three-tier web system which includes an Apache web server, a JBoss application server [12] and a MySQL database server. Figure9illustrates the architec-ture of our experimental system and its components. The ap-plication software running on this system is Pet store [27], which was written by Sun Microsystems.

Just like other web services, here users can visit the Pet store website to buy a variety of pets. We developed a client emulator to generate a large class of user scenarios and workload patterns. Various user actions such as browsing items, searching items, account login, adding an item to a shopping-cart, payment and checkout are all included in our workloads. Certain randomness of user behaviors is also considered in the emulated workloads. For example, a user action is randomly selected from all possible user scenarios that could follow the previous user action. The time interval between two user actions is also randomly selected from a reasonable time range. Note that workloads are dynamically generated with much randomness and variance so that we never get a similar workload twice in our experiments.

In our experiments, at first we run a low volume of user loads to collect measurements from various components and further use these measurements to extract invariants. As dis-cussed in Sect. 5, given a predicted high volume of user loads, we can use Algorithm 5.1to estimate the capacity

Fig. 10 Low and high volume of user loads in experiments

Fig. 11 Categories of measurements

needs of various components. Meantime, we also run this predicted volume of user loads to collect measurements and these real measurements are then compared with the esti-mated values to verify the capacity estimation accuracy of our algorithm. Figure10shows examples of both the low volume of user loads and the high volume of user loads used in our experiments, which have very different intensity and dynamics. We have repeated such experiments many times with various workloads to verify our results. Note that we do not have to repeat the same workload in our evaluations. Measurements were collected from the three servers used in our testbed system. Figure 11 shows the categories of our resource consumption related measurements. Totally we have eight categories and each category includes different number of measurements. In our previous work [13], we collected as many as 111 measurements and extracted 975 invariants from our testbed system. Figure4 illustrates its invariant network and it is too meshed for us to observe its detailed connectivity. For capacity planning, we only chose those resource consumption related metrics and extracted a small invariant network so as to analyze its connectivity in details. Meantime, in a typical three-tier web system, the ap-plication server runs bulk of business logic and usually

(12)

de-Fig. 12 Measurements and

invariants

mands much more capacities than the web server and the database server. Therefore, we collected many of our mea-surements from the application server for our capacity plan-ning experiments. Note that the web server has two network interfaces eth0 and eth1, which communicate with clients and the application server respectively.

This monitoring data was used to calculate various flow intensities with sampling unit equal to 6 seconds. We have total 20 resource consumption related measurements from three servers and all of these measurements were used in the following experiments, i.e., n= 20. In our experiments, we collected one hour data to construct models and then con-tinued to test these models for every half an hour, i.e., the window size is half an hour and includes 300 data points. Studies have shown that users are often dissatisfied if the latency of their web requests is longer than 10 seconds. Therefore the order of the ARX model (i.e., [n, m] shown in (1)) should have very narrow range. In our experiments, since the sampling time was selected to be 6 seconds, we set 0≤ n, m ≤ 2.

By combining every two measurements from the total 20 measurements, Algorithm4.1totally built 190 models (i.e.,

n(n− 1)/2 = 190) as invariant candidates. For each model,

a fitness score was calculated according to (8). In our exper-iments, we selected the threshold of fitness score F = 0.3.

A model is considered as an invariant only if its fitness score is always higher than 0.3 in each testing phases. After sev-eral phases of sequential testings with various workloads, eventually we extract 68 invariants that are distributed as shown in Fig.12. In the following equations, we list several examples of such extracted invariants. If we use Iejb, Iweb and Isql to represent the flow intensities of ‘number of EJB created’,‘number of HTTP requests’ and ‘number of SQL queries’ measured from the test bed system, we extract the

following invariants with Iwebas the input:

Iejb(t )= −0.44Iejb(t− 1) + 0.18Iejb(t− 2)

+ 1.01Iweb(t )+ 1.40Iweb(t− 1), (15)

Isql(t )= 0.16Isql(t− 1) + 0.37Isql(t− 2)

+ 3.01Iweb(t )− 0.84Iweb(t− 1). (16) In Fig. 12, each node represents a measurement while each edge represents an invariant relationship between the two associated measurements. Therefore, this invariant net-work totally includes 20 nodes and 68 edges. All measure-ments with initial tag ‘A_ ’, ‘W_’ or ‘D_’ are collected from the application server, the web server or the database server respectively. From this figure, we notice that the 20 measure-ments can be split into many clusters. The biggest cluster in this figure includes 13 measurements. These measurements respond to the volume of user loads directly so that they have many linear relationships among each other. Meantime, we also observe several smaller clusters, which characterize some local relationships between measurements. As men-tioned earlier, some measurements may respond to the vol-ume of user loads in same but nonlinear or stochastic ways. For example, ‘Disk Write Merge’ has an invariant linear re-lationship with ‘Disk Write Sectors’ though they both do not have linear relationships with the volume of user loads. The measurement ‘W_CPU Utilization’ is very noisy and it does not have robust relationships with any other measure-ments. Therefore it is an isolated node in the invariant net-work. Later we discovered that the CPU usage of the web server in our testbed was extremely low (only close to 1%) so that the value of ‘W_CPU Utilization’ was too small and easily affected by other system processes, i.e., we barely ob-serve any intensities from ‘W_CPU Utilization’ because the

(13)

Fig. 13 Estimated and real values of measurements

web server only presents static html files but runs on a pow-erful machine.

Now we choose the volume of HTTP requests ‘W_Apache’ as the starting node in our capacity analysis. In our experiments, the predicted volume of HTTP requests (high user loads) is shown in Fig.10and its maximal value is 840 requests/second, i.e., x= 840. Following the invari-ant network shown in Fig. 12, we use Algorithm 5.1 to estimate other measurements. For example, with one hop from ‘W_Apache’, we can estimate the values of the fol-lowing 10 measurements: ‘W_eth1 Packets’, ‘W_eth0 Pack-ets’, ‘W_CPU Soft IRQ Time’, ‘A_JBoss EJB Created’, ‘A_eth0 Packets’, ‘A_JVM Processing Time’, ‘D_MySQL’, ‘D_ CPU Soft IRQ Time’, ‘D_CPU Utilization’ and ‘D_eth0 Packets’. With another hop, we can also estimate the values of ‘A_CPU Utilization’ and ‘A_CPU Soft IRQ Time’. These values resulting from Algorithm5.1are then compared with the real maximal values collected from our experiments and their differences are shown in Fig.13. Here we use e and r to denote the estimated values and the real monitoring values respectively. From this figure, we observe that our approach achieves very good accuracy and averagely it only results in 5% error in estimating capacity needs. We have repeated our experiments many times with different user loads and observed similar estimation accuracy in every of our exper-iments.

In Fig.13, the value of each measurement has its own specific metric. For example, the value of ‘A_JVM Process-ing Time’ refers to the CPU time (nanoseconds) used by JVM (among 6 seconds) and the value of ‘A_CPU Uti-lization’ refers to the percentage of user CPU utiliza-tion. As illustrated in (13), given Iweb = 840, we can use the coefficient parameters from (15) to derive that

Iejb = ₁_+0.44−0.181.01+1.40 Iweb = 1607. In the same way, we can use the coefficient parameters from (16) to derive that

Isql =₁_{−0.16−0.37}3.01−0.84 Iweb= 3878. As examples, Figs.14 and

Fig. 14 The number of created EJBs

Fig. 15 The number of SQL queries

15illustrate the real numbers of EJBs created in the appli-cation server and the real number of SQL queries collected from the database server respectively. Both measurements result from the user loads shown in Fig.10so that they both have similar curves with that of the user loads. In these two figures, the noisy curves represent the real values while the dashed lines represent the estimated maximal values. The estimated values are very close to the real maximal values monitored from the system though some rare peaks may have larger values.

As discussed earlier, invariants are used to characterize long-run relationships between measurements and they do not capture some rare system uncertainties and noises. How-ever, since those peaks are rare and we always add right margin on the estimated capacity needs, our estimated val-ues should be accurate enough for capacity planning [21]. In Fig.12, we also observe some nodes that are unreach-able from the starting node ‘W_Apache’. As discussed in Sect. 5, with domain knowledge, we have to manually

(14)

build some models between these isolated measurements and those listed in Fig.13 so as to propagate the volume of user loads. In our future work, we will build a library to include nonlinear models of common components such as disks to complement our invariant network in capacity plan-ning. Based on various scenarios, operators can pull nonlin-ear models from the library to link disconnected invariant networks.

7.2 Mobile Internet systems

In this subsection, our approach was also evaluated with a commercial mobile internet system with multi-class work-loads. The mobile internet system provides mobile users with services such as web access, e-mail, news delivery and location information. It provides direct access to the Internet for subscribers using mobile terminals. This system consists of dozens of high end servers including portal web servers, mail servers, picture mail servers and account authentication servers. Web portal access and mail access are the two ma-jor classes of user traffics. Due to business confidentiality, we are not able to illustrate its specific architecture here and will only report field testing results.

Most of these servers run on Unix operating systems and many measurements are collected by using Unix “sar” com-mand. Network measurements are collected by SNMP mon-itoring agents. CPU, memory and network usages are the three types of resource metrics used in our evaluations. We collected these measurements from every servers once per 10 seconds for a week. During this period, we observed that the ratio between mail access traffic and web access traffic could vary roughly from 0.11 to 0.29. Following the same evaluation approach used in the above section, we extracted invariants from a two-hour period of low workloads and then used the invariant network to estimate the capacity needs of components under 5 periods of heavy workloads during the week. The real resource usages observed from log files were then compared with these estimated values to verify estima-tion accuracy. Our approach averagely results in 6.8% error in estimating capacity needs and the biggest estimation er-ror is 9.5%. Based on operators’ feedbacks, they are quite satisfied with such accuracies in capacity planning tasks.

8 Discussions

Large-scale distributed systems consist of thousands of com-ponents. Each component could potentially become the per-formance bottleneck and deteriorate system QoS. As dis-cussed earlier, it is extremely difficult to model and analyze the capacity needs of each individual components at fine granularity in a large system. Many classic approaches can not scale to profile the behaviors of large number of com-ponents precisely. For example, it is not clear how to model

the relationships among hundreds of metrics with queuing models and it is also very time-consuming to manually build such models. Meantime, if we only model the system be-havior at coarse granularity, some system metrics will not be considered in the model so that we may not be able to predict real performance bottlenecks and optimize resource assignments.

In this paper, we extract invariant networks to profile the relationships among resource consumption related metrics across distributed systems. Our motivation is that we should develop algorithms to automatically profile as many rela-tionships as possible among system metrics. Though such relationships may be simple, the invariant networks essen-tially enable us to cover large number of system metrics and scale up our capacity analysis. For the remaining compli-cated models, we do not have a choice so that we have to manually build them with system knowledge. Therefore, our approach is complimentary to many existing modeling tech-niques and could greatly reduce system modeling complex-ity. To the best of our knowledge, this paper proposes the first solution to analyze system capacity and optimize re-source assignments across large systems at such a fine gran-ularity. In fact, most of existing works just model hardware resources such as CPU or memory in capacity analysis and do not consider system metrics like the number of concur-rent database connections, which could also affect system performance.

Our approach has several limitations as well. As dis-cussed earlier, there exist some disconnected subnetworks that are not reachable from the starting point—the volume of workloads. In order to estimate the capacity needs of these isolated nodes, we have to manually build models with do-main knowledge to bridge disconnected subnetworks. How-ever, since specific system knowledge (e.g. disks) is needed here, we do not expect that any other methods will be able to automatically profile those complicated relationships. Com-pared to other modeling techniques, our invariant networks automatically extract the maximal coverage of system met-rics whose capacity needs can be estimated from the volume of workloads. Therefore, we are enabled to map the volume of exogenous workloads into the intensities of internal sys-tem metrics and decide where to build models in their local context. This greatly reduces modeling complexity when we applied our approach on several commercial systems. Mean-time, even with a disconnected subnetwork, we can choose one node as a starting point to verify whether all system metrics inside this subnetwork have balanced resource con-figurations. This is also proved to be useful in operational environments.

Another limitation is that our approach can not automat-ically model the relationship between response time and system configurations. For example, operators often raise the following questions in capacity planning: how much re-sponse time can be improved if a specific part of resource

(15)

is upgraded? how many users can the system support if its response time is allowed to increase 10%. For a distributed system with scale and complexity, so far we are not aware of any solutions that can address such problems. We pro-file systems and extract invariant networks at a good state when clients are satisfied with system QoS. It seems to be difficult to reason about other scenarios beyond this good state. We do not know which parts of system models are still valid and which part are not under those new scenarios. Recently many researchers employed a closed queuing net-work to model multi-tier Internet applications and then used mean value analysis to calculate system response time [31]. However, performance bottlenecks may shift under the new scenarios so that the original model of response time might become invalid. In fact, the original performance model may even not include the metric of new bottleneck. For exam-ple, if a performance model profiles the relationship between CPU resources and system response time, it may not be use-ful if memory suddenly becomes the bottleneck.

9 Related work

Queuing models have been widely used to analyze system performance. Menasce et al. [21] wrote a book on how to use queuing networks for performance modeling of vari-ous components such as a database server or a disk. How-ever, as discussed earlier, queuing models are often used to characterize individual components with many assumptions such as stationary workloads. It is not clear how to pro-file large-scale complex systems with queuing models. Re-cently Urgaonkar et al. [31,32] employed a closed queuing network to model multi-tier Internet applications and used Mean Value Analysis (MVA) to calculate the response time of multi-tier distributed systems. They only considered CPU resources in their queuing models but not other resources like memory, disk I/O and network. Besides multi-tier In-ternet applications, distributed systems have many other ar-chitectures and the closed queuing network may not model them well. Stewart and Shen [30] profiled applications to predict throughput and response time of multi-component online services. Stewart et al. [29] used a transaction mix vector to characterize workloads and exploited nonstationar-ity for performance prediction in distributed systems. Rather than using queuing models, in this paper we automatically extract a network of invariants from operational systems and use this network as a model for performance analysis. There-fore, our approach provides a systematic way to profile com-plex services for capacity planning, which is complimentary to those classic modeling techniques.

Many companies have developed their practical ap-proaches to capacity planning. For example, Microsoft [22] published scalability recommendations for their portal server

deployment. IBM Global Services [11] developed their practical capacity planning processes for web application deployment. Oracle [25] provides capacity planning tools for database sizing. Most of these approaches are developed from their practical experiences on some specific compo-nents and they may not scale and generalize well for capac-ity planning tasks in large scale distributed systems. There exists much work on characterizing web traffic for web server capacity planning. Kant and Won [16] analyzed the patterns of web traffic and proposed a methodology to de-termine bandwidth requirements for hardware components of a web server. Barford and Crovella [3] developed a tool to generate representative web workloads for web server management and capacity planning. Our approach does not characterize workloads but extract invariants to characterize the relationships between the volume of workloads and the capacity needs of individual components.

Machine learning methods have also been applied to identify performance bottlenecks in distributed systems. Co-hen et al. [6] used a tree-augmented naive (TAN) Bayesian network to learn the probabilistic relationship between SLA violations and resource usages. They further used this learned Bayesian network to identify performance bottle-necks during SLA violations. In their Elba project, Parekh et al. [26] compared the TAN Bayesian network with other classifiers like decision tree in performance bottle detection. Both of their work employed supervised learning mecha-nisms and required large number of SLA violation samples in their classifier training. For capacity planning tasks, this is not practical because we want to avoid such SLA violations in real business. Our approach is able to predict and remove performance bottlenecks ahead of real SLA violations.

In autonomic computing community [17], there is much work on how to dynamically allocate resources for service provisioning. Hellerstein et al. [10] wrote a book on how to apply feedback control theory to resource management in computing systems. Kephart et al. [8,33] defined util-ity functions based on service-level attributes and proposed an utility-based approach for efficient allocations of server resources. Kusic and Kandasamy [19] used multiple queu-ing models to characterize the performance models of server clusters and then applied control theory for optimal resource allocation. Assume that we have a large cluster of servers for multiple applications, autonomic service provisioning is about how to dynamically allocate and share resources among these applications so as to achieve some optimal goals like maximizing the profits of services. Though this topic is related, the scope of our work focuses on offline capacity planning and resource optimization rather than on-line service provisioning. In addition, dynamic service pro-visioning also needs right capacity planning to support sys-tem evolution.