• No results found

MapReduce Data Flow Analysis

4.3 MapReduce Case Study

4.3.1 MapReduce Data Flow Analysis

Recalling from previous chapter, here we analyze what happens when a MapRe- duce request is sent to a data center. The input is split into ntinputs of Stin the map phase, to assign the tasks to the hosts. Each individual task with specified input is allocated to a host in the data center. To complete a task, a host acquires not only the task input data, but also the appropriate VM containing the execu- tion code. Therefore, the data transmitted within the data center communication infrastructure includes VM and input data. For input data size, Sin, we assume several cases from the range of 1GB to 50GB.

In the Reduce phase, output is aggregated in the output file of size Soutand deliv- ered as the job result. Besides, the Map phase output, which is called intermediate output, is exchanged among hosts as a result of the shuffle-exchange phase, with size Sinter. Therefore, the data to be transmitted is as depicted in Figure 4.2.

90 Analysis Framework SV Mand |hosts| denote the VM size and the number of hosts assigned to the job respectively.

The output data size and intermediate output size may vary according to the MapReduce application type and the input file. A diverse range of output and intermediate output sizes is elaborated across our evaluation scenarios, in this case study; however, in typical cases, we consider the intermediate output and output size to be 30% and 10% of input data size.

4.3.1.0.1 Data center communication: The energy consumed to transmit

the required data for a job, as shown in (3.8), is the multiplication of power drawn for the communication by the amount of data that should be transmitted, as de- picted in Figure 4.2, over the network throughput, τDC. Considering three-tier, switch centric network architecture for intra-data center communication, recalling from (3.8), if we assume a one hop path to traverse in each tier, the energy con- sumption in this part follows (4.1). Pintra DC

switch (l) represents the power drawn in the

switch at layer l, in the three-tier architecture. Note that we assume homogeneous replication factor and strategy in all the levels of the data center network. Sdrepre- sents the size of the data that should be transmitted over the data center neworks, which can be computed according to the formula in Figure 4.2. Pintra DC

switch (l) de-

notes the power drown in layer l of the intra-data center communication.

EM R intra DC comm= Sd× r τDC × 3 X l=1 Pintra DC switch (l) (4.1)

4.3.1.0.2 P2P communication: To analyze the energy consumed in the P2P-

cloud per MapReduce job, we should consider two different scenarios. A case where jobs are assigned to the hosts within a vicinity, i.e. intra-vicinity, and the second case for inter-vicinity responses. In case of inter-vicinity responses, a job may be assigned to hosts in another vicinity. The input data, intermediate output and VM images should be sent to the distant host through the Internet. On the other hand, in case of intra-vicinity responses, VM, input and intermediate output data

Analysis Framework 91 need only to be sent to a host via wireless network. To exemplify, considering IEEE 802.11a wireless infrastructure and TCP protocol, we obtained throughput τintra P 2P = 10.9M bps, with the average number of hops to traverse of 2 for the access points and 4.78 for routers, as we discussed in the previous chapter. Overall, the energy required to accomplish a MapReduce job on community for the intra- vicinity mode is given in (4.2).

EM R

intra P 2P=

Sd

τintra P 2P(2PAP+ 4.78Prouter) (4.2)

In inter-vicinity mode, besides the energy consumed in the vicinity, some energy is dissipated for transmitting the input data and VM images to the far vicinities. Therefore, the energy drawn in such scenarios have an added term to (4.2), and is explained as EM R

inter P 2P = Eintra P 2PM R + Sin+ (nhosts× SV M) × EInternet

EP 2P c = X h∈hops Pc(h) ×rh× Sd τh (4.3)

4.3.1.0.3 Communication over the Internet: Further to intra-data center

or intra-vicinity communication in P2P, input and output data should be trans- mitted over the Internet, in the data center and inter-vicinity service provisioning cases of P2P service provisioning. Recalling the Internet energy consumption as modeled in the previous chapter, the energy consumed over the Internet follows:

EInternetM R = Sin+ Sout ϕ × X ∗∈L Pc(∗) × |hops(∗)| τ∗ (4.4)

4.3.1.0.4 Computing Energy: The energy drained within each host is Phost×

ttask for each phase. ttask is the time to process the assigned task in the host. Therefore, the overall computing energy dissipation follows (4.5).

EcomputeM R = X

i∈phases X

j∈nt

92 Analysis Framework

Table 4.1: VM Specifications for MapReduce Case Study

Type Cores Memory Storage number of number of

(GB) (GB) mappers reducers

Small 1 1 1 1 1

Medium 1 3.75 4 1 1

Large1 2 7.5 32 1 1

Large2 2 7.5 32 2 2

4.3.1.0.5 Caching: As stated, the MapReduce workload is composed of input

data, intermediate output data and VMs that contain the computing platform. A remarkable amount of energy is consumed to transmit the VM packets over community network. To alleviate the burden of VM transmission, in this work, we introduce the caching mechanism to save most prevalent VMs in community nodes, i.e. P2P-cloud with cache. In this way, we can save the energy required to transmit the VMs over the community each time.