• No results found

ROUTING ALGORITHM BASED COST MINIMIZATION FOR BIG DATA PROCESSING

N/A
N/A
Protected

Academic year: 2020

Share "ROUTING ALGORITHM BASED COST MINIMIZATION FOR BIG DATA PROCESSING"

Copied!
5
0
0

Loading.... (view fulltext now)

Full text

(1)

26

ROUTING ALGORITHM BASED COST

MINIMIZATION FOR BIG DATA PROCESSING

D.Vinotha,PG Scholar,Department of CSE,RVS Technical Campus,[email protected]

Dr.Y.Baby Kalpana, Head of the Department, Department of CSE,RVS Technical Campus, [email protected]

ABSTRACT: The information explosion is the rapid increase in the amount of published information or data and the effects of this abundance. As the amount of available data grows, the problem of managing the information becomes more difficult, which can lead to information overload. Therefore, it is imperative to study the cost minimization problem for big data processing in geo-distributed data centres. The cost efficient in big data processing because of the following weaknesses. First, data locality may result in a waste of resources. Second, the links in networks vary on the transmission rates and costs according to their unique features the distances and physical optical fiber facilities between data centers. To conquer above weaknesses, the cost minimization problem for big data processing via joint optimization of task assignment, data placement, and routing in geo-distributed data centers has been studied. Finally, the comparison is made and the changes and improvement werestudied.

Keywords: Big data, Data centre resizing, routing algorithm ,data centres, markov chain process

1. INTRODUCTION

The explosive growth of demands on big data processing imposes a heavy burden on computation, storage, and communication in data centers, which hence incurs considerable operational expenditure to data center providers. Therefore, cost minimization has become an emergent issue for the upcoming big data era. Different from conventional cloud services, one of the main features of big data services is the tight coupling between data and computation as computation tasks can be conducted only when the corresponding data is available. As a result, three factors, i.e., task assignment, data placement and data movement, deeply influence the operational expenditure of data centers. In this paper, we are motivated to study the cost minimization problem via a joint optimization of these three factors for big data services in geo-distributed data centers.

Big data analysis has shown its great potential in unearthing valuable insights of data to improve decision making, minimize risk and develop new products and services. By 2015, 71% of worldwide data center hardware spending will come from the big data processing, which will surpass $126.2 billion.The study of the cost minimization problem via a joint optimization of three factors task assignment, data placement and data movement, deeply influence the operational expenditure of data centers for big data services in geo-distributed data centers have been introduced. To describe the task completion time with the consideration of both data transmission and computation, a two-dimensional Markov chain and derive the average task completion time in closed-form has been proposed. Furthermore,

model of the problem as a Mixed-Integer Non-Linear Programming (MINLP) and propose an efficient solution to linearize has done. The high efficiency of their proposal is validated by extensive simulation based studies [6].

2 RELATED WORKS

2.1Multi-level Power Management

The coordination problem has been

seeked.There are two key contributions. First, a

power management solution that coordinates

different individual approaches has been proposed and validated. Using simulations based on 180 server traces from nine different real-world enterprises, demonstrate the correctness, stability, and efficiency

advantages of solution.Second, using unified

architecture as the base, a detailed quantitative sensitivity analysis has performed and draw

conclusions about the impact of different

architectures, implementations, workloads, and system design choices.Perform a detailed sensitivity analysis to evaluate several interesting variations in the architecture and implementation, and in the mechanisms and policies space is the main advantage.Power delivery, electricity consumption, and heat management are becoming key challenges in data center environments.There is individual solution to solve this problem no coordination between them were the demerits[9].

2.2 Poisson Model

(2)

27

the discovery and modeling of the user's aggregate interest in a session. This approach relies on the premise that the visiting time of a page is an indicator of the user's interest in that page. Even the same person may have different desires at different times.The model has an advantage over previous proposals in terms of speed and memory usage.The experiments show that the model can be used on Web sites with different structures.To confirm our finding, compare these models to two previously proposed recommendation models. Results show that this model improves the efficiency significantly. If the representation is not appropriate for the model, the prediction accuracy will decrease [2].

2.3 Geographical Load Balancing

The exploration of whether geographical load balancing can encourage use of “green” renewable energy and reduce use of “brown” fossil fuel energy has done. It makes two contributions. First, derive two distributed algorithms for achieving optimal geographical load balancing. Second, show that if electricity is dynamically priced in proportion to the instantaneous fraction of the total energy that is brown, then geographical load balancing provides significant reductions in brown energy use. Geographical load balancing provides a huge opportunity for environmental benefit as the penetration of green, renewable energy sources increases. Specifically, an enormous challenge facing the electric grid is that of incorporating intermittent, unpredictable renewable sources such as wind and solar.Geographical load balancing aims to reduce energy costs, but this can come at the expense of increased total energy usage.By routing to a data center farther from the request source to use cheaper energy, the data center may need to complete the job faster, and so use more service capacity, and thus energy, than if the request was served closer to the source[6].

2.3 Cost minimization

Data centre resizing (DCR) has been proposed to reduce the computation cost by adjusting the number of activated servers via task placement. To describe the rate-constrained computation and transmission in big data processing process, a two dimensional Markov chain and derive the expected task completion time in closed form has been proposed. To deal with the high computational complexity of solving MINLP, a mixed-integer linear programming (MILP) problem is linearized, which can be solved using commercial solver.DCR and task placement are usually jointly considered to match the computing requirement.[5]

Consider the below table 1 for various references

in following

equation

3 SYSTEM MODEL

Based on the study of data placement, task assignment, data center resizing and routing, the overall operational cost in large-scale geo-distributed data centers for big data applications will be minimized.First characterize the data processing process using a two-dimensional Markov chain and derive the expected completion time in closed-form, based on which the joint optimization is formulated as an MINLP problem. To tackle the high computational complexity of solving MINLP, linearize it into an MILP problem. Through extensive

experiments, joint-optimization solution has

substantial advantage over the approach by two-step separate optimization. K shortest path algorithm is used to perform the minimum shortest path for routing.

3.1Big data and Data Flow

Collecting dataset for big data is the first task. The whole system can be modelled as a directed graph G = (N;E).Receive data flows from source nodes and forward them according to the routing strategy. The weight of each link w(u;v), representing the corresponding communication cost, can be defined as

Where CR and CL, and are the inter-data centre

(3)

28

3.2Data placement

We define a binary variable yjk to denote whether

chunk k is placed on server j as follows,

In the distributed file system, we maintain P copies for each chunk k < K, which leads to the following constraint:

Furthermore, the data stored in each server j belongs to J cannot exceed its storage capacity, i.e.,

The data placement and task assignment are transparent to the data users with guaranteed QOS.

Let be the processing rate and

loading rate for data chunk k on server j, respectively. The processing procedure then can be described by a two-dimensional markov chain process.

According to the QoS requirement ,

(5) Where

(6)

(7)

3.3Routing of distributed data centers and Cost minimization

The cost minimization problem for big data processing via joint optimization of task assignment, data placement, and routing in geo-distributed data centers. Specifically, consider the following issues in joint optimization. Servers are equipped with limited storage and computation resources. Each data chunk has a storage requirement and will be required by big data tasks.

K Shortest Path Routing Algorithm

The K shortest path routing algorithm is an extension algorithm of the shortest path routing

algorithm in a given network. It is sometimes crucial to have more than one path between two nodes in a given network. In the event there are additional constraints, other paths different from the shortest path can be computed. To find the shortest path one can use shortest path algorithms such as Dijkstra’s algorithm or Bellman Ford algorithm and extend them to find more than one path. The K Shortest path routing algorithm is a generalization of the shortest path problem. The algorithm not only finds the shortest path, but also K other paths in order of increasing cost. K is the number of shortest paths to find. The problem can be restricted to have the K shortest path without loops (loopless K shortest path) or with loop [4]

3.4Task assignment

A task is distributed to a server where its requested data chunk does not reside, it needs to wait for the data chunk to be transferred. Each task should be responded in time D. Moreover, in practical data

center management, many task predication

mechanisms based on the historical statistics have been developed and applied to the decision making in data centers. To keep the data center settings up-to-date, data center operators may make adjustment according to the task predication period by period.To deal with the high computational complexity of solving MINLP, linearize it as a mixed-integer linear programming (MILP) problem, which can be solved

using commercial solver. Through extensive

numerical studies, show the high efficiency of proposed joint-optimization based algorithm.The flow of work can be explained in the Fig 1.1.During the file transfer, files of size > 10MB are transferred to their destination. If File sending to S->D cost exceeds the Server cost means the cost minimization to be done where D is the number of copies.

Algorithm

The Dijkstra’s algorithm can be generalized to find the K Shortest path.

Algorithm

*P =empty,

*countu = 0, for all u in V

insert path Ps = {s} into B with cost 0 while B is not empty and countt < K:

– let Pu be the shortest cost path in B with cost C – B = B – {Pu }, countu = countu + 1

– if u = t then P = P U Pu – if countu ≤ K then

(4)

29

– let Pv be a new path with cost C + w(u, v) formed by concatenating edge (u, v) to path Pu

– insert Pv into B

(8)

4 JOINT OPTIMIZATION

To linearize the constrains due to product of two variables joint optimization is done. We define a new variable as follows

(9) Which can be equivalently replaced by linear constrains as

(10)

(11) The constrains can be written in linear form as

(12)

(13) In a similar way,we define a new variable as

-(14) Which can be linearized by

-(15)

-(16)

5 PERFORMANCE MEASURE

The performance results of routing

algorithm (k map) is analyzed which is compared with a separate optimization scheme algorithm (joint), in which minimum number of servers to be activated is found, the traffic routing scheme using the network flow model is described. The result

graph will be non-joint, joint, genetic

algorithmPerformance graph.From the below

graph,the values of both joint and k-map has been compared. The values of k-map will high value than using joint linear method. Based on this individual

cost for the number of servers, communication and operation are determined.

0 100

1 2 3 4 5 6

SE

R

V

ER

C

O

ST

NO OF REPLICAS

(a) SERVER COST

JOINT K MAP

0 20 40

1 2 3 4 5 6

C

O

M

M

U

N

IC

A

TI

O

N

C

O

ST

NO OF SERVER

(b) COMMUNICATION COST

JOINT KMAP

0 100

1 2 3 4 5 6

O

P

ER

A

TI

O

N

C

O

ST

NO OF SERVER

(c) OPERATION COST

JOINT K MAP

(5)

30

numerical studies, it show the high efficiency of proposed joint-optimization based algorithm. This to be enhanced using Coupling Genetic Algorithm with a Grid Search Method to Solve Mixed Integer Nonlinear Programming Problems.

REFERENCES

[1]J.Dean and S.Ghemawat,”Mapreduce: simplified data processing on large clusters, “Communications of the ACM, vol. 51, no. 1, pp.107-113, 2008. [2] S. Gunduz and M. Ozsu, “A poisson model for user accesses to web pages,” in Computer and Information Sciences - ISCIS 2003, ser. Lecture Notes in Computer Science. Springer Berlin Heidelberg,

2003, vol. 2869, pp. 332–339.

[3] B.L.HongXu,ChenFeng, ”Temperature Aware

Workload Management in Geo-distributed

Datacenters,” in Proceeding of International

Conferences on Measurement and Modelling of Computer Systems (SIGMETRICS).ACM, 2013, pp.33-36.

[4] Http://en.Wikipedia.org/wiki/K shortest path routing

[5] Lin Gu, DezeZeng “Cost Minimization for Big Data Processing in Geo-Distributed Data Centers”, Member, IEEE, Peng Li, Member, IEEE and Song Guo, Senior Member, IEEE /TETC.2014.2310456, [6] Z.Liu, M.Lin, A.Wierman, S.H.Low, and L.L. Andrew,”Greening Geographical Load Balancing,” in Proceedings of International Conference on Measurement and Modelling of Computer Systems (SIGMETRICS).ACM, 2011, pp. 233-244.

[7] Z. Liu, Y. Chen, C. Bash, A. Wierman, D. Gmach, Z. Wang, M. Marwah, and C. Hyser,

“Renewable and Cooling Aware Workload

Management for Sustainable Data Centers,” in

Proceedings of International Conference on

Measurement and Modeling of Computer Systems (SIGMETRICS). ACM, 2012, pp. 175–186.

[8] I.Marshall and C.Roadknight,”ss,” Computer Networks and ISDN Systems, vol.30, no.223, pp. 2123-2130, 1998.

[9] R. Raghavendra, P. Ranganathan, V. Talwar, Z. Wang, and X. Zhu, “No “Power” Struggles: Coordinated Multi-level Power Management for the Data Center,” in Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 2008, pp. 48–59.

[10] M. Sathiamoorthy, M. Asteris, D.

Papailiopoulos, A. G. Dimakis, R. Vadali, S. Chen, and D. Borthakur, “Xoring elephants: novel erasure codes for big data,” in Proceedings of the 39th international conference on Very Large Data Bases,

ser. PVLDB’13. VLDB Endowment, 2013, pp. 325– 336.

[11]A.Qureshi,R.Weber,H.Balakrishnan,J.Guttang,an d B.Maggs,”Cutting the Electric Bill for Internet-scale Systems, ”in Proceedings of the ACM Special

Interest Group on Data Communication

(SIGCOMM).ACM,2009,pp 123-134.

[12]R.Urgaonkar, B.Urgaonkar, M.J.Neely, and

A.Sivasubramaniam,” Optimal Power Cost

Management Using Stored Energy in Data

References

Related documents

Also, since the inception of the SADS in October 1986, Government has sought to examine its role to establish whether South Africa's diamond resources are optimally utilised,

หลักเกณฑและสิทธิในการพิจารณาราคา 5.1 ในการสอบราคาครั้งนี้ “องคการบริหารสวนตําบล ” จะพิจารณาตัดสินดวยราคารวม 5.2 หากผูเสนอราคารายใดมีคุณสมบัติไมถูกตองตามขอ

Respondents of both countries believe that the most important role of a company in a society is economical responsibility (Lithuania-paying taxes, Hungary – making profit) and

Device Director identifies devices with “wrong” configurations, such as APNs (Access Point Name) and incompatible browsers, and allows operators to write scripts to group those

Telephone Cell phones VoIP Tablets PCs Smartphones Telephone TDM IP/. Internet Consoles

The retrofit concept is based on energy efficiency measures (reduction of transmission, infiltration and ventilation losses), on a high ratio of renewable energy sources and on

It is illegal for a business to only offer to supply a good or service on the condition that the customer also buy goods or services from another specified business. This behaviour is