Workload Factoring with the Cloud: A Game-Theoretic Perspective

(1)

Workload Factoring with the Cloud: A

Game-Theoretic Perspective

Amir Nahir

Department of Computer Science Technion, Israel Institue of Technology

Haifa 32000, Israel Email: [email protected]

Ariel Orda

Department of Electrical Engineering Technion, Israel Institue of Technology

Danny Raz

Department of Computer Science Technion, Israel Institue of Technology

Abstract—Cloud computing is an emerging paradigm in which tasks are assigned to a combination (“cloud”) of servers and devices, accessed over a network. Typically, the cloud constitutes an additional means of computation and a user can perform workload factoring, i.e., split its load between the cloud and its other resources. There is an intrinsic relation between the “benefit” that a user perceives from the cloud and the usage pattern followed by other users, giving rise to a non-cooperative game, which we model and investigate. We show that the considered game admits a Nash equilibrium. Moreover, we show that this equilibrium is unique. We investigate the “price of anarchy” of the game and show that, while in some cases of interest the Nash equilibrium coincides with a social optimum, in other cases it can be arbitrarily large. We demonstrate that, somewhat counter-intuitively, exercising admission control to the cloud may deteriorate its performance. Furthermore, we demonstrate that certain (heavy) users may “scare off” other, potentially large, communities of users. Accordingly, we propose a resource allocation scheme that addresses this problem and opens the cloud to a wide range of user types.

I. INTRODUCTION

Cloud computing is a new and emerging paradigm in which tasks are executed by several services deployed on a set of distributed servers and accessed via a common network. This network of servers and devices, collectively known as

the cloud, can offer a significant reduction in computing / storage cost due to economy of scale. In fact, computing at the scale of the cloud allows users to reach into the cloud for resources as they need them, and to attain supercomputer-level power using any standard client with network connectivity. For this reason, cloud computing has also been described as “on-demand computing” [6]. It is important to note that, for many typical users, the cloud constitutesan additionalmeans of computation, i.e., in additional to other (typically local) resources. A user can thus split its load between the cloud and its other resources, that is, perform workload factoring

[20].

Since the cloud offers its services to a multiplicity of potential users, there is an intrinsic relation between the “benefit” that a user perceives from the cloud and the usage pattern followed by other users. This relation can come in many different forms. For example, when the cloud instan-taneously shares its resources among current active jobs, i.e., essentially provides a “best effort” service, each user perceives

the presence of the other users through the congestion effect of their submitted jobs. A second example is a cloud that provides some form of QoS (quality of service) guarantees, hence mitigating (or eliminating altogether) such congestion effects. In this case, the cloud may exercise some admission or rate control, in order to meet such guarantees. Hence, the presence of other users is perceived at a longer time scale, namely by occasionally being denied access to the cloud, or experiencing rate limitations [15]. These two examples are typical of clouds that provide free access, e.g. Google Apps for end users or private clouds for corporate users. Yet a third example is a cloud that charges prices for usage, possibly offering various levels of QoS for different price ranges, e.g. [2]. Here, high demand for service would (slowly) translate into higher prices, hence relating again between the aggregate behavior of users and the benefit of the cloud as perceived by an individual user1.

While applying to a wide range of time scales, all these ex-amples identify a common and fundamental property, namely: a dependence between the level of usage of the cloud and the utility perceived by each user. The dependence is, in fact, bi-directional, as the perceived utility affects the user’s decision on whether to use the cloud and to what extent. Since cloud users can be expected to make decisions in a self-optimizing (selfish) manner, we face a non-cooperative game, whose investigation is the subject of this study.

We consider users that can use the cloud as well as local resources in order to execute computation tasks, i.e., “jobs”. More specifically, each user needs to decide how to split its job between the cloud and some local, private resources, such that the the completion time of the job would be minimized. Since the local resource is private, it provides guaranteed performance, i.e., it does not depend on the decisions nor job patterns of other users; the cloud, on the other hand, consitutes a more powerful resource, yet its performance does depend on the loads presented by other users. We establish that the underlying non-cooperative game admits a Nash equilibrium. We note that a major complication in establishing this property

1_{We note that some price-charging clouds may fall within the framework} of the second example rather than the third, due to the observed preference of ISP customers of flat fees over usage-based pricing [12], which would call for the employment of rate control schemes [15].

(2)

stems from an inherent discontinuity of the users’ objective functions. Informally, this discontinuity occurs at the point where a user starts sending part of its job to the cloud, making its completion time depend not only on the local resource but on thelaterbetween two completion times, i.e., of the portions executed locally and in the cloud. Furthermore, we establish that the equilibrium point is unique. These structural results enable us to investigate the performance of such systems, and, in particular, the inherent inefficiency that stems from the noncooperative behavior. Specifically, we study the price of anarchyof the game and indicate that it may heavily depend on the specific manner in which users perceive the performance they experience in the cloud, as well as on the cloud’s behavior under heavy load. For example, we demonstrate a case where users make worst-case decisions (e.g., when handling time-critical jobs), in which the unique Nash equilibrium coincides with the social (i.e., system-wide) optimum; whereas in an-other case, where the cloud performs outsourcing under heavy load (employing techniques such as cloud federation [16]), we demonstrate that the price of anarchy may be arbitrarily large. We then turn to investigate practical implications of our formal model and analysis. An important question is whether proper design and management of the system can mitigate the defficiencies of noncooperative behavior. One defficiency, measured by the price of anarchy, is the sub-optimality of the social performance at the Nash equilibrium. Yet, from a practical standpoint, there are other defficient aspects that need to be considered, for example the potential “conquest” of the cloud by certain communities of users (e.g., “heavy users”), which “scare off” all others. Our study indicates that the employment of management tools can indeed come at rescue in some cases, yet, and somewhat counter-intuitively, in other cases they may fail to deliver. Specifically, we demonstrate the latter (negative) finding by showing that the employment of an apparently appealing admission control scheme deteriorates system-wide performance; whereas we demonstrate the former (positive) finding by showing how the partition of the cloud into “virtual clouds” can result in increasing its attractiveness to various types of users.

To summarize, the main contributions of this paper are as follows.

• Formal modeling of workload factoring with the cloud as a noncooperative game.

• Establishment of the existence and uniqueness of the game’s Nash equilibrium.

• Investigation of the game’s price of anarchy.

• Investigation of practical implications of the formal model and analysis, specifically:

– demonstrating that admission control may fail to improve system performance;

– demonstrating how proper resource allocation within the cloud can cope with its potential “conquest” by certain classes of users.

The paper is organized as follows. After discussing related work in the next section, in Section III we formalize the

model. In Section IV we establish the existence and uniqueness of the Nash equlibrium. In Section V we investigate the price of anarchy. The next two sections deal with practical implications: admission control is considered in Section VI and resource allocation is considered in Section VII. Finally, conclusions appear in Section VIII. Due to space limits some details are omitted and can be found (online) in [11].

II. RELATEDWORK

Game theoretic models have been employed in various related contexts, such as networking [7], [9], [13], peer-to-peer networks [10] and distributed systems [3], [5]. These studies mainly investigated the structure of the system operating points i.e., the Nash equilibria of the respective games. Such equilib-ria are inherently inefficient [4] and, in general, exhibit sub-optimal system performance. As a result, the question of how much worse the quality of a Nash equilibrium is with respect to a centrally enforced optimum has received considerable attention e.g., [8], [17], [18]. A prominent measure proposed to quantify this effect is the price of anarchy[14], which is the worst possible ratio between the social performance at a Nash equilibrium and the corresponding social optimum.

A study that is particularly relevant to our framework is [1]. That study considers a problem where “customers” arrive to a “public” mainframe and each needs to decide whether to execute its job there or on a less powerful yet private machine. While choosing between executing jobs on a shared infrastructure (namely, the cloud) or on a private facility is also the problem investigated in our study, there are several major differences between our framework and that of [1]. First, in [1], a customer appears once in the system, and, based on the current load of the mainframe, makes a single, one time and binary decision, namely “mainframe” or “private machine”; we, on the other hand, envision users that consider using the cloud facilities over a long time scale and for multiple jobs, hence the decision of each user is a strategy on the portion of its jobs that would be served by the cloud. Moreover, while [1] focuses solely on a processor-sharing service regime in the mainframe, we consider also other, and more general, classes of service; in particular, while [1] assumes that all jobs perceive the service in the mainframe in the same way, we allow heterogeneity among users. Game theoretic tools have been hardly used in the context of cloud computing. One example is [19], which addresses a task scheduling problem in the presence of multiple decision makers, taking a game theoretic perspective. Another study that is relevant to ours is [20], which also deals with workload factoring; however, that study assumed that the resources that a user gets from the cloud are not shared in any way, hence the game scenario was not considered.

III. MODEL

We consider a “cloud workload factoring” (CWF) game of

N players, each of which is a user that has a job that requires processing. Player strategies are the choice of the fraction of the job that will be processed by a shared resource (i.e.,

(3)

the cloud). We denote player i’s strategy byσi. The strategy profile of all players is denoted byσ. The goal of each player is to minimize the expected completion time of its job.

Each playeriis characterized by the size of its job, denoted by ωi (measured in bits), and by its private computational resource, whose performance is manifested by a delay (i.e., computation time) function Li(·). For example, a player may have a private computer that serially processes jobs, in which case Li(σi) =

(1−σi)ωi

µi for some computing rate µi

2_{. We} assume that Li(·)is monotonically increasing in the amount of work (i.e., monotonically deceasing inσi). In addition, we assume that Li(1) = 0, i.e., if player isubmits its entire job to the cloud, it local computation time is 0.

We denote the estimated completion time of a job of player

iin the cloud byT_icloud, noting that, in general, it depends on the loads submitted to the cloud by the various players. We assume thatTcloud

i is a sum of two components, namelydi(σi) andD(σ). The first component,di(σi), is a delay component that is specific to the player and independent of the load submitted by other players to the cloud, i.e.,di:σi→ <+. For example,di can correspond to the time required to send a job to the cloud, which, in turn, may depend on the geographical location of the player with respect to the cloud’s resources. We assume thatdi(·)is continuous, monotonically non-decreasing, and, in addition, di(0) = 0. The second component, D(σ), is the cloud’s processing delay, which depends on the particular loads submitted to it. Since the job sizes,{ωj}N1 , are assumed

to be constant, D is a function of the strategy profile σ, i.e., D : σ → <+_{. We assume that} _D₍_·₎ _{is continuous and}

monotonically increasing (in every one of its parameters), and, in addition,D(0) = 0. We thus have:

T_icloud(σ) =D(σ) +di(σi).

A player’s completion time is the time in which its job completes processing (both locally and in the cloud), formally:

Ti(σ) = max

T_icloud(σ), Li(σi) .

Note that in case the player chooses not to schedule any work to the cloud, its completion time is Li(0).

We denote the cloud’s actual processing completion time by

TCloud(σ). We note thatTCloud(σ)may be different than the expected completion time perceived by the player, T_icloud(σ). For example, consider a case in which the cloud processes jobs serially. A player may expect to get an average place in the cloud’s queue, and therefore D(σ) = 1

2 · N P i=1 σi·ωi µc , and di(σi) = 1₂ · σi_µ·ωi

c . However, the cloud’s actual completion

time, i.e., among all jobs, would beTCloud(σ) =

N P i=1 λi·ωi µc 3_. A player best-response moveis a strategyσi which, given the strategies of all other players,σ−i, yields the lowest value

2_{Note that}₍₁₋_σ

i)is the portion of the job that is processed locally.

3_{We note that if players are concerned with worst-case performance, rather} than average performance, then D(σ) = TCloud₍_σ₎_{; this point is further}

discussed in Section V.

toT_icloud(< σ−i, σi >). A strategy profileσ is said to be at

Nash equilibriumif each player considers its chosen strategy to be the best under the given choices of other players. We note that for certain values of σ−i, Ticloud(< σ−i, σi >) is not a continuous function ofσi since the completion time for

σi = 0(no computation in the cloud) may be much smaller than the completion time where a very small fraction is sent to the cloud. This discontinuity prevents us from adopting “classic” results on the existence of Nash equilibria. Moreover, we note that each player may have a differentdi, which further complicates the analysis of Nash equilibria.

In order to quantify the performance of the CWF game from a social (i.e., system-wide) perspective, we associate with the game a social cost function, denoted byC(·). The social cost typically depends on the performance of the users, and various forms can be considered, e.g., completion time of the last job (i.e., makespan), average completion time, etc.

The selfish behavior of the players may lead to system-wide inefficiency. We quantify this inefficiency through the worst possible ratio between the social cost at Nash equilibrium and the social cost of the corresponding optimal solution. In keeping with common terminology [8], [14], this ratio is termed the price of anarchy and it quantifies the “penalty” incurred by lack of cooperation (or coordination) among the players in a noncooperative game.

IV. NASHEQUILIBRIUMEXISTENCE ANDUNIQUENESS

In this section we establish the existence and uniqueness of the Nash equilibrium.

A. Existence

First, we prove that the CWF game admits a Nash equi-librium. As pointed out in the previous section, this task is especially challenging due to the discontinuity of the objective function as well as the heterogeneity among users in terms of thedi component. We begin by establishing several properties of the CWF game.

Lemma 1: Let ibe some player. Let σ denote a profile in whichiis at best response. Then σi<1.

Proof: Assume by negation that the claim is wrong. Therefore, there exists a profile σ and a player i, such that

σi = 1 and i is at best response. First, we choose σi0, such that

Li(σi0)< D(< σ−i, σi0>) +di(σi0),

. We note that suchσ_i0 is guaranteed to exist sinceLi(·),D(·) and di(·) are continuous, and Li(1) = 0. Next, we observe that bothLi(σ0i)< Ticloud(< σ−i, σ0i>)(sinceσ0iwas chosen that way), and, in addition,Tcloud

i (< σ−i, σ0i>)< Ticloud(σ). Therefore, it holds that Ti(< σ−i, σi0 >)< Ti(σ), which is in contradiction to the assumption thatiis at best response in

σ.

Corollary 1: Let σdenote a Nash equilibrium profile. For every playeri,σi<1.

The above corollary implies that, at Nash equilibrium, every player makes some use of its local resource. Following similar

(4)

reasoning, it is easy to see that in every Nash equilibrium profile, there exists at least one player that makes use of the cloud.

Next, we show that, when a player is at best response, either it processes its entire job locally, or its local completion time is equal to that of the cloud.

Lemma 2: Leti be some player. Let σdenote a profile in whichi is at best response. Then, eitherLi(σ0i) =T

cloud i (σ) or σi= 0andD(σ)≥Li(ωi).

Proof: Assume by negation that the claim is wrong. Ifσi= 0, andD(σ)< Li(ωi), then there exists some >0 such that

D(< σ−i, >) +di()≤Li(),

so player i can unilaterally improve its cost by moving of its job for processing into the cloud. Otherwise (i.e., σi 6= 0 ), we consider two cases: in case Li(σi0)< T

cloud

i (σ), player

ican unilaterally improve its cost by reducingσi; and in case

Li(σ0i)> Ticloud(σ)playeri can unilaterally improve its cost by increasingσi.

Corollary 2: Letσ denote a Nash equilibrium profile. For every player i, either Li(σi0) = T

cloud

i (σ) or σi = 0 and

D(σ)≥Li(ωi).

The following claims establish some monotonicity among the players in terms of their usage of the cloud, according to the power of the local resources.

Lemma 3: Leti, jbe two players such thatLi(0)≥Lj(0). Let σ be a profile such that j is at best-response, and, in addition,σj >0. Letσ0 be a profile following playeri’s best-response move to σ. Thenσ_i0>0.

Proof:First, we note that, sincejis at best-response when

σ is applied, andσj >0, it holds thatLj(σj) =Tjcloud(σ)<

Lj(0); this is due to the fact thatLj(0)is the time required to compute all of player j’s work locally. Assume, by negation, that the claim is wrong, i.e., following playeri’s best-response move, σ0_i = 0. Hence, Li(0) ≤ D(< σ−i,0 >) (i.e., player

i opts not to submit any work to the cloud, since the cloud’s delay is higher then the time required for player ito process its job locally). Thus we get:

Li(0)≤D(< σ−i,0>)≤D(σ)≤Tjcloud(σ) =

Lj(σj)< Lj(0) which is a contradiction.

Corollary 3: Let σ∗ denote a Nash equilibrium profile. Assume, w.l.o.g., that the players are ordered in ascending order of local computation time, i.e., L1(0)≥L2(0)≥ · · · ≥

LN(0). Let k be a player such thatσ∗k >0. It holds that for every player i,i < k,σ∗_i >0.

Corollary 4: Let σ∗ denote a Nash equilibrium profile. Assume, w.l.o.g., that the players are ordered in ascending order of local computation time, i.e., L1(0)≥L2(0)≥ · · · ≥

LN(0). Thenσ1∗>0.

Proof: Follows directly from Lemma 3 and from the observation that, at Nash equilibrium, at least one player makes use of the cloud.

We turn to prove the existence of a Nash equilibrium. We do so by first establishing the existence of a Nash equilibrium in a 2-player CWF game, and then move on to extend it, inductively, to the general,N-player case.

Theorem 1: The 2-player CWF game admits a Nash equi-librium profile.

Proof: Assume, w.l.o.g., thatL1(0) ≥L2(0). We show,

by construction, that the game has a Nash equilibrium point. We begin with a state in which neither of the players submits any of its workload to the cloud, i.e., σ0 ₌_{< σ}0

1 = 0, σ20 =

0>, and proceed by having each player, in its turn, perform a best-response move.

Letσ1

1 denote the relative amount of workload submitted by

player 1 to the cloud following the first best-response move. It is easy to see that σ1

1 >0. Next, we consider player 2; if

D(< σ1

1,0 >)≥L2(0) (i.e., the relative workload submitted

by player 1 to the cloud generates a delay that is longer than the time required for player2 to process its entire work locally), it holds that< σ11,0>is a Nash equilibrium profile.

Otherwise, it holds that player 2 will submit some non-zero relative workload, denoted byσ12, for processing in the cloud.

Consider, then, the latter case.

When player 1 performs a best-response move again, it is clear thatD(< σ11,0>)< D(< σ11, σ21>)(i.e., the workload

submitted by player2to the cloud increased the cloud’s delay). Thus, it holds that L1(σ11) < D(< σ11, σ12 >) +d1(σ11), so

player1’s best-response move would be to decrease its relative workload submitted to the cloud, i.e.,σ2

1< σ11.

In a similar fashion, when player2performs a best-response move again, it holds that D(< σ11, σ12 >)> D(< σ12, σ12 >)

(i.e., since player1reduced its relative workload in the cloud, the cloud’s delay has decreased). Therefore, player 2’s best-response move would be to increase it relative work load, i.e.,

σ2 2 > σ21.

Once again, since player 2 increased its relative workload in the cloud, it holds that player1 would decrease its as part of his next best-response move, i.e., σ3

1 < σ12; we can apply

the same process repeatedly.

We conclude that the sequence of player2’s relative work-loads following its best-response moves constitues a monoton-ically increasing sequence,{σ2}

k

1. By definition, it holds that

this sequence is upper bounded (by1), and therefore converges to a non-zero value. We denote its limit byσ₂∗.

Similarly, it holds that the sequence of player 1’s relative workloads following its best-response moves constitutes a monotonically decreasing sequence, {σ1}

k

1. By definition, it

holds that this sequence is lower bounded (by0), and therefore converges. We denote its limit by σ∗

1. Following Corollary 3,

sinceσ1

2 >0, it holds thatσ1∗>0.

Finally, we conclude that σ∗ =< σ1∗, σ2∗ > is a Nash

equilibrium profile and the Theorem follows.

For extending the above result to the N-player case, we need the following auxiliary claims, which compare between clouds of different performance in terms of best responses and Nash equilibrium profiles.

(5)

Lemma 4: Leti, j be two players. Let D1(·), D2(·) repre-sent two clouds. Letσ1_{be a profile such that players}_{i, j}_{are at}

best response withD1₍_·₎_{. Let}_σ2_{be a profile such that players}

i, j are at best response withD2₍_·₎_{. Assume}_σ2

i > σ1i. Then

σ2

j ≥σ1j.

Proof: First, we note that since σ2

i > σ1i, it holds that

Li(σ2i)< Li(σi1), in addition, since playeriis at best response both in σ1 _and_σ2_{, it follows that:}

D2(σ2) +di(σi2) =Li(σ2i)<

Li(σi1)≤D

1₍_σ1₎_.

Hence, it follows thatD1₍_σ1₎_{> D}2₍_σ2₎_.

Assume, by negation, that σ2

j < σj1. It follows that

Lj(σj2) ≥ Lj(σj1). In addition, since player j is at best response both in σ1 _and_σ2_{, it follows that:}

D2(σ2) +dj(σj2)≥Lj(σj2)≥

Lj(σj1) =D

1₍_σ1_{) +}_d

j(σ1j), which is a contradiction.

Lemma 5: Let D1₍_·₎_{, D}2₍_·₎ _{represent two clouds, such}

that for every σ, D1₍_σ₎ _{> D}2₍_σ₎ _(i.e., _D2₍_·₎ _represents

a “better” cloud). Let σ1, σ2 be Nash equilibrium profiles w.r.t D1(·), D2(·), respectively. Furthermore, assume that σ1

is such that all players submit some work to the cloud. Then, there exists at least one player, i, such thatσ2_i > σ_i1.

Proof:Assume, by negation, that the claim is wrong, i.e., for every player j,σ2

j ≤σj1. SinceD1(σ)> D2(σ), and, in addition, both D1₍_·₎ _and_D2₍_·₎_{are monotonicaly increasing,}

we get that:

D1(σ1)≥D1(σ2)> D2(σ2).

Consider some playerk. Sinceσ1_{is a Nash equilibrium profile}

and σ1

k > 0, it holds that Lk(σ1k) = D1(σ1) +dk(σ1k). In addition, since σ2 _{is a Nash equilibrium profile, it holds that}

Lk(σk2) ≤ D2(σ2) +dk(σk2). Since σ2k ≤ σk1, it holds that

Lk(σk2)≥Lk(σk1); it follows that: D1(σ1) +dk(σk1) =Lk(σk1)≤ Lk(σ2k)≤D 2₍_σ2_{) +}_d k(σ2k), which is a contradiction.

Corollary 5: Let D1₍_·₎_{, D}2₍_·₎ _{represent two clouds, such}

that, for every σ, D1₍_σ₎ _{> D}2₍_σ₎ _(i.e., _D2₍_·₎ _represents

a “better” cloud). Let σ1_{, σ}2 _{be Nash equilibrium profiles}

w.r.t D1₍_·₎_{, D}2₍_·₎_{, respectively. Furthermore, assume that} _σ1

is such that all players submit some work to the cloud. For every player i, σ2

i ≥ σ1i, and, in addition, for at least one player a strict inequality holds.

Proof: Follows directly from Lemmas 4 and 5.

Theorem 2: TheN-player CWF game admits a Nash equi-librium profile.

Proof:By induction on the number of players. ForN = 2, the claim was established in Theorem 1. Assume now that any

N =k−1player game has a Nash equilibrium profile. We turn to consider ak-player game. Assume, w.l.o.g., that players are

ordered in ascending order of workload, i.e.,L1(0)≥L2(0)≥

. . .≥Lk(0).

We define a new (k−1)-player’s game, which includes only players 1,2, . . . , k−1 (i.e., we exclude player k from the game). This game’s cloud load function,D1(·), is defined based onD(·)(the load function of the original game), in the following manner:

D1(σ1, σ2, . . . , σk−1) =D(σ1, σ2, . . . , σk−1,0),

i.e., D1(·) is equal to D(·), when player k sets its relative workload to0. By the inductive hypothesis, thek−1-player’s game has a Nash equilibrium profile. We denote it byσ₋1_k.

Returning to thek-player’s game, we consider the strategy profile σ1 =< σ₋1_k,0 > (i.e., the first k−1 players play as in the Nash equilibrium profile of the(k−1)-player’s game, and player k does not submit any work to the cloud). We continue by letting player k perform its best-response move. IfD(σ1₎_≥_L

k(0)(i.e., the work submitted by the firstk−1 players to the cloud generates a delay which is longer then the time required for playerk to process its entire work locally), it holds that σ1 _{is a Nash equilibrium profile. Assume, then,}

that D(σ1₎_{< L}

k(0). First, we note that, in this case, every playeri,1≤i≤k−1, submits some work to the cloud, since

D(σ1)< Lk(0)≤Lk−1(0)≤ · · · ≤L1(0).

Next, we let player k perform its best-response move, and denote the relative workload it submits to the cloud byσ1

k. Again, we define a new(k−1)-player’s game, by excluding playerk. For this game, we define the cloud’s delay function

D2₍₎_{in the following manner:}

D2(σ1, σ2, . . . , σk−1) =D(σ1, σ2, . . . , σk−1, σ1k). By the inductive hypothesis, the(k−1)-player’s game has a Nash equilibrium profile, which we denote byσ2

−k. Following Corollary 5, it holds that, for every playeri, i= 1,2, . . . , k,

σ2

−k(i) ≤ σ1−k(i) and, in addition, for at least one player a strict inequality holds. Therefore, when player k is given another chance to perform a best-response move, the cloud delay that it observes is better (lower) than the one that it observed in its previous turn. Therefore, playerkwill increase the relative load it submits to the cloud, and the process continues repeatedly.

We conclude that the sequence of playerk’s relative work-loads following its best-response moves,{σk}

m

1 , is

monotoni-cally increasing. By definition, this sequence is upper bounded (by1) and therefore converges. Denote its limit byσ_k∗.

Similarly, it holds that the sequence of Nash equilibrium profiles of the(k−1)player games,{σ−k}m₁, is monotonically decreasing. By definition, this sequence is lower bounded (by

~₀_{), and therefore converges. Denote the limit of this sequence}

by σ∗₋_k. Following Corollary 3, since σ_k∗ >0, it holds that, for every playeri,σ∗_i >0.

We conclude thatσ∗=< σ₋∗_k, σ∗_k >is a Nash equilibrium profile and the Theorem follows.

(6)

B. Uniqueness

We turn to establish the uniqueness of the Nash equilibrium.

Theorem 3: The CWF game admits a unique Nash equilib-rium profile.

Proof:

Assume, by negation, that the claim is wrong. Therefore, there exist two different profiles, σ1_{, σ}2_{, such that both}

con-stitute Nash equilibrium points. Hence, there exists at least one player i, such that σ1

i 6=σ2i. Assume, w.l.o.g, thatσ1i > σi2. Since Li(·) is monotonically increasing in the workload, it follows thatLi(σi1)< Li(σi2). Since bothσ1andσ2are Nash equilibrium profiles, it holds that:

D(σ1) +di(σi1) =Li(σ1i)< Li(σi2)≤D(σ

2₎_,

therefore,D(σ1₎_{< D}₍_σ2₎_.

Next, we consider any player j, and claim that σ1_j ≥ σ_j2. Assuming the claim is wrong, it holds that there exists at least on player j, such that σ1

j < σj2. Following similar reasoning to that applied above, we conclude that:

D(σ2) +dj(σi2) =Li(σ2i)< Li(σi1)≤D(σ

1₎_,

which contradicts our previous conclusion that D(σ1₎ _<

D(σ2₎_.

We thus conclude that, for every player i, σ1

j ≥ σj2, and, in addition, for at least one player a strict inequality applies. Since D(·)is monotonically increasing, it follows that

D(σ1·ω)> D(σ2·ω),

which contradicts our first conclusion, hence the theorem follows.

V. PRICE OFANARCHY

We turn to analyze the Price of Anarchy [8] of the CWF game. We shall demonstrate that this value may heavily depends on the specific manner in which users perceive the performance they experience in the cloud, as well as on the cloud’s behavior under heavy load.

First, we consider the case in which players base their decisions on worst-case cloud performance (i.e., D(·) ≡

TCloud₍_·₎_{). In addition, we assume players incur no private} cost (i.e., di(·)≡ 0). We consider as the game’s social cost its makespan, i.e., the time in which the last job completes processing. We show that, in this case, the game’s Price of Anarchy is1, i.e., it coincides with the social optimum.

Theorem 4: Considering the makespan as the social cost, if

di(·)≡0 andD(·)≡TCloud(·), then the price of anarchy of the CWF game is1.

Proof:Letσ∗denote the game’s Nash equilibrium profile and σOP T _{a socially optimal profile. We show that} _σOP T ₌

σ∗. Assume by negation that the claim is wrong. It follows that there exists some player i, such that σOP T

i 6=σ∗i. First, we consider the case that, for some playeri,σ_i∗> σOP T

i >0. In this case, it holds thatLi(σ∗i)< Li(σOP Ti ), and therefore

C(σ∗) =D(σ∗) =Li(σi∗)< Li(σiOP T)≤C(σ OP T₎_,

where the second equality follows from Corollary 2. The above is in contradiction toσOP T_{’s optimality. Therefore, we} conclude that, for any playerisuch thatσ∗

i >0, it holds that

σOP T i ≥σ∗i.

Next, we consider the case that, for some playeri,σOP T i >

σ_i∗. In this case, since for every other player σ_iOP T ≥σ_i∗, it holds thatD(σOP T)> D(σ∗), and therefore

C(σ∗) =D(σ∗)< D(σOP T)≤C(σOP T),

which is in contradiction toσOP T_{’s optimality, and the} theo-rem follows.

Consider now the same setting, with the following change: now, players do not base their decisions onworst-casecloud performance but rather onexpectedcloud performance.

Specifically, consider a cloud that processes jobs in serial fashion, formally: TCloud(σ) = N P i=1 σi·ωi µc .

We assume that all players are identical (in job size and local computation function), and, in addition, the local computation model is that of linear processing, that is

Li(σi) =

(1−σi)·ωi

µ ,

for some arbitraryµ.

Players expect the cloud to behave as a simple queue, and expect to get an average location in the cloud’s job queue, that is: D(σ) = N−1 2 · σi·ωi µc .

Note that the above expression relies on the fact that all players are identical, and therefore assumes that the Nash equilibrium is symmetric.

In addition the to waiting time in the queue, each player also has to account for the processing of its own job, therefore, we set

di(σi) =

σi·ωi

µc

.

As before, the game’s social cost is its makespan, i.e.,

C(σ) = maxnTCloud(σ),max

i {Li((1−σi)·ωi, µi)}

o

First, we calculate the Nash equilibrium profile,σ∗, of this game. In view of the above assumptions, we have:

(1−σ∗)·ω µ = N−1 2 · σ∗·ω µc +σ ∗_·_ω µc , which yields: σ∗= 2µc 2µc+ (N+ 1)µ .

(7)

Thus, the social cost at the Nash equilibrium is

C(σ∗) = 2N ω 2µc+ (N+ 1)µ

.

Next, we calculate the optimal value of the social cost. By Theorem 4, an optimal configuration, σOP T_{, coincides with} the Nash equilibrium of a game in which

D(·)≡TCloud(·)≡ N X i=1 σi·ωi µc

and d(·) ≡ 0. In such a game, all players are at Nash equilibrium when, for each player,

(1−σ)·ω µ = N·σ·ω µc , which yields: σ= µc µc+N·µ .

The optimal value of the social cost is thus:

C(σOP T) = N·ω

µc+N·µ

.

Therefore, the price of anarchy is

C(σ∗) C(σOP T₎ = 2N ω 2µc+(N+1)µ N·ω µc+N·µ = 2(µc+N µ) 2µc+ (N+ 1)µ .

We note that this expression is upper bounded by 2, and, the stronger the cloud, the lower the value of this expression. For example, if the cloud has equal power to the aggregate of all users, i.e., µc=N µ, then the price of anarchy is roughly

4

3, while ifµc = 10N µ, then the price of anarchy is (roughly) 22

21.

We turn to demonstrate a case where the price of anarchy can be arbitrarily large. Consider a cloud that employs cloud federation[16], thus enabling it to “outsource” some of its load to other clouds. More specifically, the cloud processes jobs serially, but, due to limited capacity, after reaching a certain load threshold, it starts sending (outsourcing) some of its jobs to another cloud. That is, under such conditions, every job that is submitted to the cloud is outsourced with some probability

p. Furthermore, we assume that an outsourced job experiences a (fixed) delay, denoted by∆, which is significantly larger than the delays incurred when being processed at the original cloud. Specifically, we assume: ∆>1 p N X i=1 ωi µc .

As before, we assume that all players are identical (in job size and local computation resources), di(σi) ≡ 0, and the social cost of the game is its makespan. Finally, we assume that

2·p·∆< Li(0)<3·p·∆.

Since players are identical, the load threshold can be ex-pressed in terms of the strategy profile σ, namely: the cloud performs outsourcing when

N

P

i=1

σi> σT for someσT.4 A player considers the expected delay that its submitted work would experience at the cloud, including possible out-sourcing, therefore, for

N P i=1 σi> σT, we have: D(σ) = (1−p)· N X i=1 σi·ωi µc +p·∆, whereas for N P i=1 σi< σT, we have: D(σ) = N X i=1 σi·ωi µc .

Due to the assumption made on∆, it follows that, below the threshold, D(σ) < p·∆, and above the threshold, D(σ) <

2·p·∆.

Since Li(0) > 2·p·∆ and Li(1) = 0, the continuity of Li(σi) implies that there is a σ˜i, 0 < σ˜i < 1, such that

Li( ˜σi) =p·∆. Assume now that the number of users, N, is large enough, so that:

N·σ˜i> σT.

Let σ∗ be the Nash equilibrium profile. Since the players are identical, it follows from Corollary 4 thatσ_i∗>0for alli. Therefore:Li(σi∗) =D(σ∗). We proceed to show that, at Nash equilibrium, outsourcing is in effect. By way of contradiction, assume that, at Nash equilibrium, the cloud operates below the threshold, i.e.,

N

P

i=1

σ_i∗< σT. This implies thatD(σ∗)< p·∆, thusLi(σ∗i)< p·∆. SinceLi( ˜σi) =p·∆, the monotonicity of Li implies that σi∗ > σ˜i, thus N · σi∗ > σT, hence establishing that, at Nash equilibrium, the cloud operates above the threshold.

Consider now the social cost at Nash equilibrium, i.e.,

C(σ∗). Since at Nash equilibrium all players are subject to outsourcing and the social cost is the makespan, C(σ∗) is lower-bounded by ∆ times the probability of at least one outsourcing occurs, i.e.,C(σ∗)>(1−(1−p)N₎_·_∆_{. On the} other hand,Li(0)<3·p·∆implies that the optimal social cost is upper bounded by the latter value, i.e.,C(σOP T)<3·p·∆; indeed, this can be achieved by executing all jobs locally (since all jobs incur the same delay, the makespan is equal to that value). Therefore, the price of anarchy is larger than

(1−(1−p)N₎_·_∆

3·p·∆ =

1−(1−p)N

3·p

which can be made arbitrarily large by considering a large population of users and small values of p, e.g., p = and

N =1 for a sufficiently small .

4_{For sustaining continuity, we could assume that the probability of} outsourc-ing grows linearly from0topin a small neighborhood around the threshold, without affecting the discussion and findings that follow. For simplicity, we ommit these details.

(8)

VI. ADMISSIONCONTROL

As demonstrated in the previous section, when all players are at Nash equilibrium, the system may exhibit degraded performance. To account for that, a cloud service provider may try to enforce some admission control policy to guarantee that the system’s delay remains adequate. Indeed, this is a rather “natural” reaction of a provider to high load. In this section we explore this possibility, focusing on the following scenario:

• di(·)≡0,

• the players’ perceived cloud delay, D(·), depends (only) on the aggregate load,

• all players are identical in terms of job size and local computation, and

• the local computation model is that of linear processing, that is

Li(σi) =

(1−σi)·ωi

µ ,

for some arbitrary µ.

We show that, under this scenario, even the application of the most extreme admission control policy, namely, preventing any number of users from accessing the cloud, causes a deterioration in the social cost.

We first show that the average completion time increases when any number of players are not permitted to access the cloud.

Lemma 6: For the scenario specified above, preventing any number of players from accessing the cloud increases the average job completion time of the CWF game.

Proof: Assume, by negation, that the claim is wrong. Therefore, there exist someKplayers, such that if they are not allowed to access the cloud, the average job completion time does not increase. We note that, since all players are identical, the identity of the K players is irrelevant.

Denote byσ1_{the Nash equilibrium profile when all players}

may access the cloud. since all players are identical, it is easy to see that, for every two players i, j, it holds thatσ1

i =σj1. Since D(·)is monotonically increasing in the aggregate load, it also holds that

Li(σ1i) =D(N·σ

1

i). (1)

Denote byσ2_{the Nash equilibrium profile when}_K_players

may not access the cloud. Therefore,(N−K)identical players make use of the cloud, hence all of them submit the same amount of work for processing in the cloud. Therefore,

Li(σ2i) =D((N−K)·σ

2

i). (2)

It is easy to see that when K players cannot access the cloud, the remaining(N−K)players will increase their usage of the cloud, i.e., σ2

i > σi1, which in turn implies that

Li(σ2i)> Li(σi1). (3) Following our assumption that D(·) is monotonically in-creasing in the aggregated load, expressions (1), (2) and (3) imply that

N·σ1_i >(N−K)·σ_i2. (4)

Next, we require that the average job completion time remains the same, or improves. To satisfy this requirement, it must hold that

K·Li(0) + (N−K)·Li(σ2i)

N ≤Li(σ

1

i).

Since all local computing resources process jobs in a linear fashion, it follows that

K· ωi µ + (N−K)· (1−σ2 i)·ωi µ N ≤ (1−σ1_i)·ωi µ .

Simplifying the above expression yields the conclusion that

(N−K)·σ2_i ≥N·σ1_i,

which conradicts expression (4) and the theorem follows. Next, we turn to show that preventing any number of users from accessing the cloud causes a deterioration in the social cost.

Theorem 5: For the scenario specified above, preventing any number of players from accessing the cloud increases the social cost of the CWF game.

Proof:First, we note that, since all players are identical, it is easy to see that the Nash equilibrium profile (in case all players may access the cloud) is symmetrical. This, in turn, implies that the average completion time is equivalent to the social cost. Following Lemma 6, it holds that preventing any number of users from accessing the cloud, causes the average completion time to increase, and the theorem follows.

VII. RESOURCEALLOCATION WITHIN THECLOUD

From Corollary 3 it follows that “heavy” players, i.e., play-ers with large jobs, may “scare off” other playplay-ers by placing much of their work in the cloud. In such a case, potentially large communities may not make use of the cloud at all. While this might not be detrimental from the point of view of overall performance (e.g., average or worst-case performance across all users), it may negatively affect other goals sought by the cloud provider. For example, if the provider charges some fixed fee to open an account, driving away many “light” users ends up in reducing profitability. To address this concern, we propose a scheme for allocating resources within the cloud. The idea is to partition the resources of the cloud, effectively making a cluster of virtual clouds, such that each would serve a different class of users. We demonstrate the idea on the following setting. We consider two classes of users: “small” users, each having a job of size ω, and “large” users, each having a job of sizeω·t, for somet >1. We denote byKthe number of large players and byN−K the number of small players. We assume that the local computation model is that of linear processing, according to the following expression:

Li(σi) =

(1−σi)·ωi

µ ,

for someµ >0.

Furthermore, we make the following assumptions regarding processing in the cloud:

(9)

• di(·)≡0;

• the cloud processes user jobs in serial fashion, that is:

TCloud(σ) = N P i=1 σi·ωi µc for some µc>0;

• users base their decisions on worst-case performance, hence: D(·)≡TCloud(·).

First, we identify the conditions under which small players stay out of the cloud.

Lemma 7: For the scenario specified above, in case

t≥ µc+k·µ

k·µ , (5)

only the K large players make use of the cloud.

Proof: By construction. Assume the condition specified in the claim holds, and consider the strategy profileσin which each of theK large players submitsσ0= µc

µc+K·µ, while the

other (N−K)small players do not submit any work to the cloud. It holds that for all large players,

Li(σi) = (1−σi)·ωi µ = (1− µc µc+K·µ)·ω·t µ = (_µK·µ c+K·µ)·ω·t µ = K· 1 µc+K·µ ·µc µc ·ω·t=K·σ0·ω·t µc =D(σ).

That is, all large players are at best response with the cloud. Next, we consider the cloud’s completion time. It holds that

TCloud(σ) =K·σ 0_·_ω_·_t µc = K·ω·t µc+K·µ .

This equation, together with the assumption that t≥ µc+k·µ

k·µ , yield that the cloud’s completion time is at least ω_µ, i.e., the delay generated by the work submitted by large players is at least as large as the time required for a small player to process its entire job locally. Therefore, small players are also at best response with the cloud (staying away from it), and the claim follows.

We turn to propose concrete resource allocation guidelines for two examplary cases. In the first, the cloud provider provides large users with enough computational resources such that their performance would not degrade much with respect to the case in which they make use of all of the cloud’s resources. In the second case that we consider, the cloud provider attempts to allocate resources in a way that supports global fairness, that is, each user can access an amount of resources that is proportional to its job size.

A. Preventing Large User Service Degradation

Recall that, when all cloud resources are shared by all players, and condition 5 is met, the expected completion time for large players is

k·t·ω µc+k·µ

;

in a similar fashion, if only µ1c of the cloud’s computational power is allocated for large players, their completion time is

k·t·ω µ1

c+k·µ

.

Given a permissible “degradation ratio”,ρ, we require that k·t·ω µ1 c+k·µ k·t·ω µc+k·µ ≤ρ.

After some algebra, the above expression yields:

µ1c ≥

µc+ (1−ρ)·k·µ

ρ ,

that is, as long as a large player may access at least µc+(1−ρ)·k·µ

ρ of the cloud’s computational power, its perfor-mance degradation would be at most ρ. We note that, since some resources are exclusively allocated to the small players (namely, µc−µ1c), they would all make some (nonzero) use of the cloud.

B. Creating System-wide Fairness

Here, the cloud provider partitions its resources such that, at equilibrium, the completion time of a large player would be at most t times that of a small player (we recall thatt is the proportion between large and small job sizes). We denote by

µ1c the amount of resources allocated for large players, hence the remainder,µc−µ1c, is allocated to the small players.

As shown above, when K large players get access to µ1

c computational resources, their completion time is

k·t·ω µ1

c+k·µ

.

In a similar fashion, when (N−K) small players access

(µc−µ1c)of the cloud’s resources, their completion time is

(N−k)·ω

(µc−µ1c) + (N−k)·µ

.

Requiring that large players completion time is at most t

times that of small players yields the following condition:

µ1_c ≥K

N ·µc,

that is, large players should get access to resources in an amount that is at least proportional to their relative size. Again, we note that, as long as some resources are exclusively allocated to the small players, i.e., µ1_c < µc, they would all make some (nonzero) use of the cloud.

VIII. CONCLUSIONS

Paraphrasing on the classic saying,to send or not to sendis a dilemma confronted by users in the cloud era. As opposed to the origin, in this case the answer is not binary, and a user may decide to split its job between the cloud and local resources, i.e., perform workload factoring. Yet, in this case too, the dilemma is addressed individually. We attempted to understand the global behavior of such an environment, by investigating the underlying noncooperative game. Such

(10)

an understanding enables to identify possible shortcomings that stem from the noncooperative, uncoordinated behavior. The potentially large price of anarchy is one such finding. The possibility that certain populations of users would fully possess the cloud resources, scaring off other populations, is yet another finding. Based on the foundations laid in this study, future work can identify other interesting patterns of behavior. For example, the di(·) components could account for the geographical distance between a user and the cloud, in which case investigation of the Nash equilibrium could detect to what extent a cloud is efficient in attracting remote users. By better understanding possible shortcomings we could explore schemes for better managing the cloud’s resources. For example, we demonstrated how a simple scheme for allocating the cloud’s resources can increase its attractiveness to different types of users. At the same time, we also demonstrated that exercising control, even in an apparently “right way”, may end up (quite counter-intuitively) in deteriorated performance.

This study is a first attempt to understand “cloud workload factoring games”. As such, it adopted the traditional assump-tion of full informaassump-tion, i.e., users have full and instantaneous knowledge regarding the performance of the cloud. In reality, of course, this is not quite the case, and a very interesting question for future research is the effect of incomplete in-formation on the game’s structure and the implied behavior and performance of the users. Another important subject for future work is to consider QoS (quality of service) issues. For example, some of the users may need to obey strict delay bounds for completing their jobs, hence requiring the incorporation of constraints in the determination of their best responses. Another example is to consider clouds that offer various (price-dependent) levels of QoS. Yet another intriguing subject for future research is to consider a multiplicity of competing clouds, giving rise to a two-level game.

REFERENCES

[1] E. Altman and N. Shimkin, “Individual equilibrium and learning in processor sharing systems,”Oper. Res., vol. 46, no. 6, pp. 776–784, 1998.

[2] Amazon Elastic Compute Cloud, http://aws.amazon.com/ec2. [3] B.-G. Chun, K. Chaudhuri, H. Wee, M. Barreno, C. H. Papadimitriou,

and J. Kubiatowicz, “Selfish caching in distributed systems: a game-theoretic analysis,” inPODC ’04: Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing. New York, NY, USA: ACM, 2004, pp. 21–30.

[4] G. Debreu, “A Social Equilibrium Existence Theorem,”Proceedings of the National Academy of Science, vol. 38, pp. 886–893, Oct. 1952. [5] D. Grosu and A. T. Chronopoulos, “Noncooperative load balancing in

distributed systems,”J. Parallel Distrib. Comput., vol. 65, no. 9, pp. 1022–1034, 2005.

[6] B. Hayes, “Cloud computing,”Commun. ACM, vol. 51, no. 7, pp. 9–11, 2008.

[7] Y. A. Korilis and A. A. Lazar, “On the existence of equilibria in noncooperative optimal flow control,”Journal of the ACM, vol. 42, no. 3, pp. 584–613, 1995.

[8] E. Koutsoupias and C. H. Papadimitriou, “Worst-case equilibria,” Lec-ture Notes in Computer Science, no. 1563, pp. 404–413, 1999. [9] A. A. Lazar, A. Orda, and D. E. Pendarakis, “Virtual path bandwidth

allocation in multiuser networks,”IEEE/ACM Trans. Netw., vol. 5, no. 6, pp. 861–871, 1997.

[10] T. Moscibroda, S. Schmid, and R. Wattenhofer, “On the topologies formed by selfish peers,” inPODC ’06: Proceedings of the twenty-fifth annual ACM symposium on Principles of distributed computing. New York, NY, USA: ACM Press, 2006, pp. 133–142.

[11] A. Nahir, A. Orda, and D. Raz, “Workload factoring with the cloud: A game-theoretic perspective,” Department of Electrical Engineering, Technion, Haifa, Israel, Tech. Rep., 2010. [Online]. Available: http://www.ee.technion.ac.il/Sites/People/ ArielOrda/Info/Other/NOR10CWF.pdf

[12] A. M. Odlyzko, “Internet pricing and the history of communications,” Computer Networks, vol. 36, no. 5/6, pp. 493–517, 2001.

[13] A. Orda, R. Rom, and N. Shimkin, “Competitive routing in multiuser communication networks,” IEEE/ACM Transactions on Networking, vol. 1, no. 5, pp. 510–521, oct 1993.

[14] C. Papadimitriou, “Algorithms, games, and the internet,” inSTOC ’01: Proceedings of the thirty-third annual ACM symposium on Theory of computing. New York, NY, USA: ACM, 2001, pp. 749–753. [15] B. Raghavan, K. Vishwanath, S. Ramabhadran, K. Yocum, and A. C.

Snoeren, “Cloud control with distributed rate limiting,” inSIGCOMM ’07: Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications. New York, NY, USA: ACM, 2007, pp. 337–348.

[16] B. Rochwerger, D. Breitgand, E. Levy, A. Galis, K. Nagin, I. M. Llorente, R. Montero, Y. Wolfsthal, E. Elmroth, J. Caceres, M. Ben-Yehuda, W. Emmerich, and F. Galan, “The RESERVOIR model and architecture for open federated cloud computing,” IBM Journal of Research and Development, vol. 53, no. 4, 2009.

[17] T. Roughgarden, “Designing networks for selfish users is hard,” in FOCS ’01: Proceedings of the 42nd IEEE symposium on Foundations of Computer Science. Washington, DC, USA: IEEE Computer Society, 2001, p. 472.

[18] T. Roughgarden and E. Tardos, “How bad is selfish routing?”J. ACM, vol. 49, no. 2, pp. 236–259, 2002.

[19] G. Wei, A. V. Vasilakos, Y. Zheng, and N. Xiong, “A game-theoretic method of fair resource allocation for cloud computing services,” The journal of Supercomputing.

[20] H. Zhang, G. Jiang, K. Yoshihira, H. Chen, and A. Saxena, “Intelligent workload factoring for a hybrid cloud computing model,” inSERVICES ’09: Proceedings of the 2009 Congress on Services - I. Washington, DC, USA: IEEE Computer Society, 2009, pp. 701–708.