Queuing theory - Towards auto-scaling in the cloud: online resource allocation techniques

Queuing theory based models are common approach to estimate application performance and status metrics such as response time, length of queue and requests waiting time. The models have also been successfully applied for resource allocation and application scaling problems.

The general representation of queuing model is drown in3. Customers arrive to the system with certain mean arrival rate λ and M processing nodes(application servers) serve them with a mean rate of µ. The common way to characterize queuing model is to apply Kendall’s notation [84]. The notation is used to describe and classify queuing model. The model is described as follows: A/S/c/K/N/D. Below presented the meaning of each value.

• A - inter arrival time distribution • B - service time distribution • C - number of servers

• K - system capacity. Maximum number of customers allowed in the system. If the limit is

reached, then further arrivals are not allowed. If the value is not present, then the capacity is assumed to be unlimited

• N - calling population. The size of the population from which the customers come. If the

• D - the Service Discipline or priority order. It defines how jobs are served in the system.

Two most used disciplines are FIFO and LIFO.

The last three values (K, N, D) often are not shown. If they are not present, then K = ∞, N =∞, D = F IF O. Usually only arrival time (A) and service time (B) distributions are considered.

Ofter they marked as M, D and G. M stands for exponential distribution. D - means deterministic distribution. If the G is present, then it refers to Gaussian or normal distribution.

The queuing model presented in figure 3 is one of examples of an application that can be scaled horizontally. The queue of customers can be considered as a loadbalancer, which redis- tributes requests among M servers running inside VMs. To model n-tier application the model has to be extended, so it becomes queuing network as it shown on figure4.

Most of the works apply two queuing models: M/M/1 and G/G/1. The first model assumes that arrivals λ and service rates µ follow Poisson distribution. In this case the response time can be calculated as follows R = _µ−λ1 . The second model assumes that normal distribution governs service rate and arrival rate. In this case arrival rate calculated in a following way: λ ≥ [s + qa2+q2b

2(R−s)]−1. For multi-tier applications presented on figure 4 the end-to-end response

time is a sum of response times provided by each tier [158].

Researchers usually evaluate two types of interactive applications: single tier and multi-tier applications. Ali-Eldin et al. [8] [7] applied G/G/n model to perform horizontal scaling. n - in the model stands for number of servers of the application. They design adaptive proportional controller which reacts to changes in the load dynamics and targets the application performance SLO. The controller has predictive component that anticipates future workload. However, the evaluation presented in the paper is based on simulation, so it is hard to judge whether the controller will be able to provide same quality of control for the real application. In [144] authors analyze actual traces of request arrivals to the application tier. They observe that arrival process follows Poisson distribution. Based on the observation they build provisioning scheme to scale application tier of multi-tier web application. Unfortunately both works target only single tier, while many web applications are multi-tiered.

Zhang, Cherkasova, and Smirni [158] evaluated multi-tier web application. Each tier is considered as queue. Authors empirically found the correlation between request rate and CPU uti- lization. They proposed a controller that uses regression to approximate CPU usage based on request rate. In [62] authors developed auto-scaling system which performs horizontal scaling of multi-tier web application. The system aims to provide desired response time with minimal number of VMs. Each tier modeled as G/G/n queue. The approach requires offline service time profiling of each tier. During runtime the proposed system classifies workload to decide which tier to scale. For example, ’ordering’ requests put more load on data tier, while ’browsing’ requests utilize front-end tier.

Many web applications characterized by high dynamics of incoming workloads. Urgaonkar et al. [136][135] combined predictive and reactive methods to provision multi-tier application with respect to incoming workload. To perform capacity planning and respond to flash crowds or deviations from expected long-term behavior it is better to use the combination of two methods. Each server of a tier modeled as G/G/1 queue. The prediction mechanism uses histograms to anticipate peak loads. However, peak load provisioning leads to high resource wastage. Gandhi et al. [53] build auto-scaling system based on queuing-theoretic model that dynamically resizes application tier of a web application to meet user specified performance goal. With help of Kalman filters the system dynamically learns the model parameters and proactively scales the application.

All presented above works exploit horizontal scaling to deal with workload changes. Vertical scaling of front-end tier server was investigated by [38]. One of the issues of e-commerce applications is scalability of database tier that was discussed in section2.2. Authors propose M/M/1 queuing model to simplify scaling of database tier. They show that single large VM

delivers lower response time and higher throughput in comparison to multiple small VMs, even though there is the same total amount of resources (CPU, RAM) in both systems.

It is common to apply queuing model to scale tiered web applications. However, there are works that address scaling of batch applications. N. Bennani and A. Menasce [105] propose multi-class queuing network to provision batch applications. Authors implemented horizontal scaling of batch application servers based on predicted workload. However, the proposed system does not consider workload dynamics such as change from CPU-intensive jobs to I/O- intensive jobs. The parameters of the model are fixed at design time.

The queuing theory usually applied to the systems with stationary characteristics such as con- stant arrival and service rates, user think time. However, many cloud applications facing workload fluctuations [82,119], so the arrival rate can change dramatically. It means that auto-scaling system will not be able properly provision the application when one of the parameters changes. Therefore, the queuing model of the application requires periodical reevaluation, which usually done offline.

In document Towards auto-scaling in the cloud: online resource allocation techniques (Page 43-45)