SySTEM MODEL - Networking for Big Data Chapman pdf

A cloud user generates large amounts of data dynamically over time, and aims to transfer the data into a cloud comprising of geo-distributed data centers, for processing using a MapReduce-like framework. We show investigations of two representative scenarios in this big picture [22,23].

Timely Migration of geo-Distributed Big Data into a Cloud

Consider a cloud consisting of K geo-distributed data centers in a set of regions [K]. A cloud user (e.g., a global astronomical telescope application) continuously produces large volumes of data at a set [Ξ] of multiple geographic locations (e.g., dispersed telescope sites). The user connects to the data centers from different data generation locations via virtual private networks (VPNs), with G VPN gateways at the user side and K VPN gateways each co-located with a data center. Let [G] denote the set of VPN gateways at the user side. An illustration of the system is in Figure 5.1. A private (the user’s) network interconnects the data generation locations and the VPN gateways at the user side. Such a model reflects typical connection approaches between users and public clouds (e.g., Windows Azure Virtual Network [24]), where dedicated private network connections are established between a user’s premises and the cloud, for enhanced security and reliability, and guaranteed inter- connection bandwidth.

We aim to upload the geo-dispersed data sets to the best data center, for processing with a MapReduce-like framework. It is common practice to process data within one data center rather than over multiple data centers. For example, Amazon Elastic MapReduce launches all processing nodes of a MapReduce job in the same EC2 Availability Zone [25].

Moving Big Data to the Cloud ◾ 79

Inter-datacenter connections within a cloud are dedicated high-bandwidth lines [26]. Within the user’s private network, the data transmission bandwidth between a data generation location d∈ [Ξ] and a VPN gateway g∈ [G] is large as well. The bandwidth Ugi on a VPN link (g,i) from user side gateway g to data center i is limited, and constitutes the bottleneck in the system.

Assume the system executes in a time-slotted fashion [16,17] with slot length τ. Fd(t) bytes of data are produced at location d in slot t, for upload to the cloud. ldg is the latency between data location d∈ [Ξ] and user side gateway g∈ [G], pgi is the delay along VPN link (g,i), and η_ik is the latency between data centers i and k. These delays, which can be obtained by a simple command such as ping, are dictated by the respective geographic distances. Given a typical cloud platform that encompasses disparate data centers of different resource charges, detailed cost composition and performance bottlenecks should be analyzed for efficiently moving data into the cloud.

Uploading Deferral Big Data to the Cloud

The next scenario we look at is to upload deferrable Big Data to a cloud by considering practical bandwidth charging models through the Internet, instead of assuming a VPN between the user and the cloud. Commercial Internet access, particularly the transfer of Big Data, is nowadays routinely priced by ISPs through a percentile charge model, a dra- matic departure from the more intuitive total-volume-based charge model as in residential utility billing or the flat-rate charge model as in personal Internet and telephone billing [6,7,27,28]. Specifically, in a θ-th percentile charge scheme, the ISP divides the charge period, for example, 30 days, into small intervals of equal fixed length, for example, 5 min. Statistical logs summarize traffic volumes witnessed in different time intervals, sorted in ascending order. The traffic volume of the θ-th percentile interval is chosen as the charge

Gateway 4 Data location 2

Intranet links at the user side Data location 1

Gateway 4′ Data center 4 Data center 3 Data center 2 Data center 1 Intranet links in cloud Internet links Gateway 3′ Gateway 2′ Gateway 1′ Gateway 3 Gateway 2 Gateway 1

volume. For example, under the 95th-percentile charge scheme, the cost is proportional to the traffic volume sent in the 8208-th (95% × 30 × 24 × 60/5 = 8208) interval in the list [7,27,28]. The MAX contract model is simply the 100-th percentile charge scheme. Such percentile charge models are perhaps less surprising when one considers the fact that infrastructure provisioning cost is more closely related to peak instead of average demand.

Toward minimizing bandwidth cost based on the percentile charge scheme, most exist- ing studies examine strategies through careful traffic scheduling, multihoming (subscrib- ing to multiple ISPs), and inter-ISP traffic shifting. They model the cost minimization problem with a critical, although sometimes implicit, assumption that all data generated at the user location have to be uploaded to the cloud immediately, without any delay [27,28]. Consequently, the solution space is restricted to traffic smoothing in the spatial domain only. Real-world Big Data applications reveal a different picture, in which a reasonable amount of uploading delay (often specified in SLAs) is tolerable by the cloud user, providing a golden time window for traffic smoothing in the temporal domain, which can substan- tially slash peak traffic volumes and hence communication cost. For instance, astronomical data from observatories are periodically generated at huge volumes but require no urgent attention. Another well-known example is human genome analyses [10], where data are also “big” but not time sensitive. We show a model and online algorithms that explore time deferral of the data upload to achieve cost efficiency.

We also assume the general case where the mappers and reducers may reside in geo- graphically dispersed data centers. The Big Data can tolerate bounded upload delays specified in their SLA. We model a cloud user producing a large volume of data every hour, as exemplified by astronomical observatories. As shown in Figure 5.2, the data location is multihomed with multiple ISPs, for communicating with data centers. Through the

Reducers Mappers Data sources Data location 2 Data location 1 Data center 3 Data center 2

Data center 1 Data center 1′

Data center 2′

Data center 3′

Moving Big Data to the Cloud ◾ 81

infrastructure provided by ISP i, data can be uploaded to a corresponding data center DCi. Each ISP has its own traffic charge model and pricing function.

After arrival at the data centers, the uploaded data are processed using a MapReduce-like framework. Intermediate data need to be transferred among data centers in the shuffling stage. In a general model, multiple ISPs may be employed by the cloud to communicate among its distributed data centers, for example, ISP A for communicating between DC1 and DC2, and ISP B for communicating between DC1 and DC3. If two inter-DC connections are covered by the same ISP, it can be equivalently viewed as two ISPs with identical traffic charge models.

The system runs in a time-slotted fashion. Each time slot is 5 min. The charge period is a month (30 days). Assume there are M mappers and R reducers in the system. [M] and [R] denote the set of mappers and the set of reducers, respectively. Since each mapper is associated with a unique ISP in the first stage, we employ m∈ [M] to represent the ISP used to connect the user to mapper m. All mappers use the same hash function to map the intermediate keys to reducers [21]. The upload delay is defined as the dura- tion between when data are generated to when they are transmitted to the mappers. We focus on uniform delays, that is, all jobs have the same maximum tolerable delay

D, which is reasonable assuming data generated at the same user location are of similar nature and importance. We use Wt to represent each workload released at the user location in time slot t.

ONLINE ALgORIThMS FOR TIMELy MIgRATION OF BIg

In document Networking for Big Data Chapman pdf (Page 101-104)