This section provides a clear picture of how the provided prediction could be uti- lized and employed in workload management of data-intensive applications. We have visualized some sample predicted PDFs from the test set of the TPC-H work- load as shown in Fig. 4.4(a) and 4.4(b). In particular, the figure plots 14 random sample predicted PDFs for CPU and execution time. The histograms show the actual CPU and runtime values for the whole test dataset. Each PDF may (not) belong to different queries as they were randomly selected from the test set, meaning they are conditioned on different inputs. The dotted vertical line shows the observation value. As the figures show, the PDFs accurately approximate the resource usage and performance distributions which are primarily within the range (0.1, 0.4) and (0.1, 0.25) for CPU and runtime respectively. In a consistent manner, the models for CPU and execution time beyond the values 0.5 and 0.3 are much more uncertain. Put differently, the tendency of all CPU and runtime PDFs is to the right hand side of diagram and this is consistent with the plotted histograms of actual resource and performance values in which, for example, we hardly face resource demand above 0.5.
These sample PDFs demonstrate that the MDN is also a reliable classifier in the classic point estimate sense, where the PDFs cover the observation points with high
§4.5 Distribution-Based Prediction Utilization 59
(a)
(b)
Figure 4.4: Sample PDF predictions for (a) CPU and (b) Execution Time of Hive queries based on TPC-H workload.
probability in all figures but PDFs number 14 in 4.4(a) and 8 in 4.4(b). However, they locate the shape of distributions precisely.
We also argue that distribution-based prediction gives the resource and workload management systems a concise yet lucid way of interpreting workload behaviour. Such capability is crucial for a number of resource management activities such as run-time performance isolation or diagnosis inspection. In particular, upper and lower bounds of resource usage simplify the task of performance isolation, since for example our predictions in all figures capture the dominant CPU time precisely. SLA management also becomes more applicable when for example we already know the minimum and maximum required share of CPU for a given query. When it comes to performance inspection, diagnosing abnormal behaviour as per the predicted num- bers is also viable. Specifically, Fig. 4.4(a) reports that for a given set of queries we will not face peak CPU time (>0.5) very often, hence a higher peak CPU time indicates the possible presence of a fault in the software or cluster.
4.6
Summary
In this chapter, we presented a novel approach of using mixture density networks for CPU and runtime distribution prediction of large-scale analytics Hive queries. We evaluated our approach on TPC-H, showing that it outperforms the state of the art techniques in half of experiments. This result is quite promising as it shows that proposed approach is not only able to predict the full distribution over targets accurately, it is also a reliable single point estimator.
In the next chapter, we will investigate the elasticity management of data-intensive workloads on clouds.
Chapter5
Elasticity Management of Data
Analytics Flows on Clouds
Growing attention to getting real-time insights into streaming data leads to the for- mation of many complex data analytics flows. For example, by analyzing data using data analytics flows, real-time situational awareness can be developed for handling events such as natural disasters, traffic congestion, or major traffic incidents[43].
A data analytics flow typically operates on three layers: ingestion, analytics, and storage, each of which is provided by a data-intensive system. These systems are of- ten available as cloud managed services, enabling the users to have pain-free deploy- ment of data analytics flow applications. For example, Fig. 5.1 depicts a click-stream data analytics flow in which Amazon Kinesis [5] is used for managing the ingestion of streaming data at scale. Apache Storm [12] deployed on EC2, processes streaming data and persists the aggregated results in DynamoDB [3].
Despite straightforward orchestration, elasticity management of the flows is chal- lenging. This is due to: i) heterogeneity of workloads and diversity of cloud resources such as queue partitions, compute servers, NoSQL throughputs capacity, ii) workload dependencies between layers, and iii) different performance behaviours and resource consumption patterns.
To address the issues above, in this chapter we investigate the problem of multi- layered and holistic resource allocation of the data analytics flows deployed on public clouds. We propose a framework for design and stability analysis of adaptive con- trollers by employing tools from classic nonlinear control theory. With numerous experiments on a real-world click-stream data analytics flow, we show that, com- pared to the state of the art techniques, our approach is able to reduce the error (i.e. deviation from desired utilization) by up to 48%.
5.1
Challenges in Elasticity Management of Data Analytics
Flows
Elasticity management of data analytics flow applications is challenging due to three unique characteristics of cloud-hosted data-intensive systems. First, data analytics
Figure 5.1: A data analytics flow that performs real-time sliding-windows analysis over click stream data.
flow applications have heterogeneous workloads, in which different platforms and workloads are dependent on each other. For example, Fig. 5.2 clearly shows how the workload dynamics in the data ingestion layer is strongly correlated with the ana- lytics layer. To provide smooth elasticity management, these workload dependencies need to be detected and considered in resource allocations.
Second, data analytics flow applications often deal with immense data volume which, together with uncertain velocity of data streams, leads to changing resource consumption patterns. This mandates an elasticity technique that could sustain workload fluctuations time efficiently, meaning that resources should be acquired and released as soon as required.
Third, a data analytics flow is deployed on heterogeneous cloud services and re- sources, each of which exhibits different performance behaviours and different pric- ing schemes. In this setting, resource allocation needs to cater for diverse resource requirements and their associated cost dimensions to meet the users’ Service Level Objectives (SLOs). Existing solutions that enjoy dynamic intelligent auto-scaling al- gorithms [82, 75, 94] lack a holistic approach for resource requirements management of big data analytics workloads. Instead, they focus on one resource type such as Vir- tual Machines (VMs) or particular workload like Hadoop. Nevertheless, [119] shows that the ability to scale down, for example, both web servers and cache tier leads to 65% saving of the peak operational cost, compared to 45% if we only consider resizing the web server tier.
5.2
Related Work
Elasticity techniques have been studied extensively in recent years [85]. Several tech- niques such as Control theory [86], Queueing model [109], Markov decision process [75] have been used to tackle the problem with respect to different resource types such as Cache servers [64], HDFS storage [82], or VMs [48]. However, recent stud- ies in resource management using control theory [82, 86, 65, 66] have clearly shown the benefits of dynamic resource allocations against fluctuating workloads. More importantly, what makes the control theory approach stands out in workload man- agement techniques is the fact that it does not rely on any prior information about the workload behavior and unlike for example queueing model it imposes very mild assumptions on the system model. Such features lead to a simple yet effective ap-
§5.2 Related Work 63 0 50 100 150 200 250 300 350 400 450 500 550 0 2 4 6 xm104 In pu tmRec or ds mSU) IngestionmLayermSKinesis) 0 50 100 150 200 250 300 350 400 450 500 550 0 10 20 30 TimemSmin) CPUSA) AnalyticsmLayermSApachemStorm)
Figure 5.2: The data arrival rate at the ingestion layer (Amazon Kinesis in Fig.5.1) is strongly correlated (coefficient = 0.95) with the CPU load at the analytics layer
(Apache Storm in Fig.5.1).
proach that would sustain any workload’s shape and dynamics.
A number of inquiries [86, 82, 65, 66, 77, 78, 47, 87, 67, 96, 95] have been made into the elasticity management of either data-intensive systems or single/multi-tier web applications using control theory. Lama et al. in [77] propose a fuzzy controller for efficient server provisioning on multi-tier clusters which bounds the 90th-percentile response time of requests flowing through the multi-tier architecture. They further improve their approach in [78] by adding neural networks to the controller in order to avoid tuning the parameters on a manual trial-and-error basis, and come up with a more effective model in the face of highly dynamic workloads. Similar to this study, Jamshidi et al. in [65] propose a fuzzy controller that enables qualitative specification of elasticity rules for cloud-based software. They further equipped their technique in [66] with the Q-Learning technique, a model-free reinforcement learning strategy, to free users of most tuning parameters. More recently, Farokhi et al. in [47] use a fuzzy controller for vertical elasticity of both CPU and memory to meet the performance objective of an application.
In [82], the authors proposed afixed-gaincontroller for elasticity management of a Hadoop Distributed File System (HDFS) [102] under dynamic web 2.0 workloads. To avoid oscillatory behavior of the controller, [82] develops a proportional thresholding technique which in fact works by dynamically configuring the range for the controller variables. Similarly, in [87], the authors propose a multi-model controller which in fact integrates decisions from the empirical model and workload forecast model with the classical fixed-gain controller. The empirical model is to retrieve distinct configurations which are capable of sustaining the anticipated Quality of Service (QoS) based on recorded data from the past. In contrast, the forecast model which is built by Fourier Transformation is to provide proactive resource resizing decisions for specific classes of workloads.
More closely related to the topic of this chapter, the authors in [96] propose a re- source controller for multi-tier web applications. The proposed control system is built
upon a black-box system modelling approach to alleviate the absence of first prin- ciple models for complex multi-tier enterprise applications. Unlike [96], the authors of [86] modeled the system (i.e. web server) as a second-order differential equation. However, the estimated system model used for control would become inaccurate if the real workload range were to deviate significantly from those used for developing the performance model. The authors of [96] next in [95] enhanced the previous work by employing multi-input multi-output (MIMO) control combined with a model es- timator that captures the relationship between resource allocations and performance in order to assign the right amount of resource. The resource allocation system can then automatically adapt to workload changes in a shared virtualized infrastructureto achieve the average response time. Along similar lines, the authors in [67] incorpo- rate a Kalman filter into a feedback controller to dynamically allocate CPU resources to virtual machines hosting server applications. However, our work differs in that our control system, rather than adjusting CPU allocation in a shared infrastructure, which commercial cloud providers do not provide, regulates resources in a higher abstraction level that is the number of instantiated VMs, for example. Above all, unlike our work, this class of control systems are only quasi-adaptive as their gain parameters do not rely on the history of the previously computed control gains and hence are unable to dynamically adapt to workload changes (see Section 5.3.3.2).
In summary, almost all of the above studies share the same constraint: lack of a holistic view on resource requirement management in which they have primar- ily investigated virtual server allocation problems even in a multi-tier Internet ser- vice. Our work completes this picture through studying different cloud resources including distributed messaging queue partitions (data ingestion layer), VMs (data analytics layer), and provisioned read or write throughputs of tables (data storage layer).