Optimize the Dynamic Provisioning and Request Dispatching in Distributed Memory Cache Services

(1)

Optimize the Dynamic Provisioning and Request

Dispatching in Distributed Memory Cache Services

Boyang Yu and Jianping Pan

University of Victoria, British Columbia, Canada

Abstract—The dynamic provisioning of distributed cache ser-vices helps to improve the system efficiency. We model the system as groups of servers caching different and none-overlapping key segments of content objects, and investigate the benefit of cache hit and request batching. A stochastic network optimization prob-lem is formulated, which aims at achieving the system stability, low energy cost and certain cache hit rate simultaneously through the dynamic control of server activeness and request dispatching. The problem is transformed into a minimization problem at each time slot and the online algorithm to solve it is proposed. Also we show that dynamic programming helps to lower the computational complexity. Finally, the proposed algorithm is evaluated through extensive simulations.

I. INTRODUCTION

Today the design of large-scale services attracts more atten-tion than before, motivated by the unprecedented increase of contents stored on the network and the continuous growth of user requests to retrieve them. Scalability and efficiency are two of the most important issues in the system design. In this paper, we aim at improving the efficiency of the distributed memory cache services through the optimized dynamic control of server provisioning and request dispatching.

The distributed cache services are broadly used in different large-scale networked systems to relieve the heavy workload. It provides temporary key-value storage in memory and makes the memory of different servers unified as whole, functioning as the cache of the whole key-value space. Memcached [1] is the state-of-the-art and most popular implementation of such a service. It achieves the horizontal scalability mainly through the consistent hashing[2], where the key space is partitioned into segments and each server is responsible for one segment or sub-space of the whole space. Besides, it also supports the mirroring of one sub-space, i.e., assigning multiple servers to redundantly cover the same sub-space and distributing the incoming traffic to them through a load balancer.

In contrast to the well-achieved scalability as stated above, in this paper, we are more interested in improving the effi-ciency of such a system, which is often overlooked in the existing practices of system design. Due to the dynamics of the workload, it is not necessary to keep all the servers of the system active all the time. However, inappropriately turning servers into inactive status will impact the system stability or service quality, shown as a large number of requests queued to wait for service or a large average response time of the system. Therefore, the reasonable control of the server activeness, which leads to a tradeoff between the system stability and

other goals such as the energy cost and cache hit rate, is the main objective of this work, termed as dynamic provisioning. The stochastic network optimization framework is adopted to achieve such an objective, which ensures the system stability by avoiding the queue backlog becoming infinite and takes the other goals into account. Based on the framework, we are to solve a minimization problem and adjust the server activeness accordingly at each discretized time slot. That is feasible in a data center because of its high-performance internal network. Request dispatching is another issue that will be solved. We will discuss where to dispatch a request if there are multiple choices and the favourable amount of content objects to be batched in a request, based on the queue backlogs at that time slot as input. These decisions will also contribute to the system stability because they affect the arrival of requests at servers. Since the applied framework only requires the current status of queue backlogs in making decisions, the scheme proposed in this paper does not rely on the prediction of the future workload, which makes it easy to apply and less vulnerable to the unpredictability of the workload.

Compared with existing works optimizing the distributed systems similarly using stochastic network optimization (for example [3], [4], [5]), ours is characterized by its specific modeling and scheme design towards the distributed cache service. The differences between the cache service analyzed and other common distributed services include: in the cache service, a request can only be dispatched to certain specific servers because of the consistent hashing design and the preference for a high cache hit rate; the change of server provisioning can further influence the responsible key space of servers still being active as well as the resultant cache hit rate, which is crucial to the effective throughput of the system; the diminishing overhead in batching requests was observed in the distributed cache system but not paid enough attention to in the existing design of dynamic server provisioning.

The rest of the paper is organized as follows. Section II presents the related work. Section III states the modeling framework. Section IV presents the proposed algorithm. Sec-tion V makes the evaluaSec-tion through simulaSec-tions. Finally we present the conclusions in Section VI.

II. RELATEDWORK

The distributed memory cache system [6] is broadly used in different networked services today and consistent hashing technique [2] is applied to support the scalability. How Face-book leverages and improves Memcached, a typical imple-978-1-4799-4852-9/14/$31.00 c2014 IEEE

(2)

mentation of the cache system, to support its social network is stated in [7]. Raindel et al. [8] considered the distributed memory storage system, and discussed how to serve the multi-get requests to achieve the maximum throughput. Byers et al. [9] discussed the method in achieving load balance among servers under consistent hashing through mapping a key to multiple servers. These works mostly try to improve the scalability or throughput performance, but the possibility of adjusting the server activeness to achieve the well balance between the performance and efficiency is overlooked.

The stochastic network optimization framework [10] is developed based on the system capacity analysis [11] and Lyapunov optimization. It helps to make optimized decisions only with current observations, not relying on the knowledge of arrival rate distribution. Urgaonkar et al. [3] proposed that the throughput and energy of a data center can be optimized through the admission control and routing control. Zhou et al. [5] solved the tradeoff between the performance and cost in the VM resource pool through the framework with specific consid-erations on the nonlinear energy consumption and the power budget. Maguluri et al. [4] focused on task scheduling under the cloud scenario, and also used the Lyapunov optimization as the main method to solve the problem. Comparatively, we considered more detailed perspectives in the distributed cache system, e.g., a request can only be dispatched to some of the active servers in the system.

III. MODELINGFRAMEWORK

A. Distributed Memory Cache Systems

In a common distributed memory cache system, such as Memcached [1], the cache servers work together to fulfill the content object requests initiated by the clients. Consistent hashing [2] is applied in dispatching requests to different servers which results in that each single server covers a sub-space of the whole key sub-space and is only responsible to serve the requests falling into the range it covers. Because the workload in requesting objects with specific keys is dynamic, the provisioning of the servers, keeping or turning them to be active or inactive, should be adjusted dynamically to tradeoff the resultant service quality and the energy cost.

The whole key space is denoted by S. Each key is a non-negative integer between 0 and the space size. We simplify the consistent hashing scheme as equally dividing S into N

segments, denoted by S1, ..., SN. Formally∪Nk=1Sk =S and

Si∩Sj=∅for anyi6=j. Then the segments form a directional

circle based on their indexes, such as Sk is the predecessor

of Sk+1 andSN is the predecessor ofS1, which is similar to

Chord [12], a typical DHT system.

There is a pool of available servers and we virtually partition the servers into N groups. Each active server in group i

owns the key segment Si, where 1 ≤i≤N. Meanwhile, it

might also serve the segments precedingSiin the circle when

necessary. The number of available servers in each group is assumed to beM. Totally, there areN×M servers considered at most in our scheme. We consider the time is discretized into slots. In each time slott, the activeness status of a server can

Group 4 Server 2

Server 1 Server 1 Server 1 Server 2 Server 2 Server 2

Server 3

Server 3 Server 3 Server 3

Server 1

Relay 1 Relay 2 Relay 3 Relay 4 Group 1 Group 2 Group 3

Client A

Fig. 1. System architecture

be changed. Accordingly, theactiveness of a serverjin group

iis denoted byaij(t)and the value 1 or 0 is used to represent

the status of active or inactive, where1≤j≤M. We further model theactiveness of a server groupiby

ai(t) =1( X

j

aij(t)), (1)

where 1(x) is an indicator function, with the return value 1

if x > 0, or 0 otherwise. And then we obtain the binary

provisioning vectorY(t), such as

Y(t) ={a1(t), ..., aN(t)} , (2)

which represents the activeness of server groups in slot t. Fig. 1 shows an example with N = 4 and M = 3. As illustrated, the whole group 2 and some servers in group 1 and group 4 are set to be inactive. HereY(t) ={1,0,1,1}.

Determined by the consistent hashing applied in dispatching requests to servers, the responsible key space of each active server group i includes its own key segment Si and the key

segments of one or multiple consecutively inactive groups preceding it in the circle, denoted by

ˆ

Si=∪ik=p+1Sk , (3)

wherepis the first active server group precedingiin the circle. In the example illustrated by Fig. 1, the servers in group 3 will also be responsible for the segment owned by group 2 due to the inactive status of the latter. Meanwhile, the servers in the same group cover the same sub-space of keys, so there would be multiple candidates in serving requests to that sub-space.

Besides, we set another layer between clients and servers in the system, which is shown as relays in Fig. 1. Each of such logical relays is responsible for one of the N key segments. And the requests falling into the same key segment, even if from different clients, will be aggregated at the relay, leading to a higher performance by the diminishing overhead effect. The relays can make the dynamic adjustment of server provisioning transparent to the clients. The request arrival at each relay kis also the arrival at each key segment Sk. B. Covered Key Space v.s. Cache Hit Rate

For a lower content retrieval latency, either the cache hit rate or the cache access latency should be improved. We are to achieve the latter through bounding the average queue length at servers. Meanwhile, increasing the number of active server groups can help the former improved, because some

(3)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 0.6 0.7 0.8 0.9

Ratio of covered space to whole space

Cache hit rate

Cache = 8,000 Cache = 16,000 Cache = 32,000 Cache = 64,000

Fig. 2. Influence of key space

1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6

Objects to retrieve per request

Normalized delay

1 server 2 servers 4 servers

Fig. 3. Diminishing overhead servers might serve a smaller key space. When the unequal distribution of workloads in different key segments is ignored, the expected cache hit rate Ci of server groupiis a function

of the ratio of its responsible space size over the whole space size, such as Ci =C(|Sˆi|/|S|). An experiment is conducted

to determine that relationship, based on the trace of HTTP requests to Wikipedia [13]. We simulate 10 LRU-based cache servers, each with the same cache capacity. Depending on the provisioning status, an active server might be responsible for

n/10 of the whole space, where 1 ≤n ≤10. n is adjusted from 1 to 10 and the results are shown in Fig. 2. We notice that either a smaller key space or a larger cache size can increase the cache hit rate. With the measurement, we use the function fitting tool in Matlab to model the relationship between the covered space ratioxand the cache hit rateC(x). When each server can store 64,000 objects at most, we obtainC(x) as

C(x) =−0.257x3−0.586x2−0.495x+ 0.830 . (4)

C. Diminishing Overhead by Request Batching

The diminishing overhead by batching objects in requests was initially noticed as the multi-get hole [14] in the dis-tributed cache service. Specifically, when the average number of objects to fetch through each request increases, the request processing efficiency will improve, which would be positive to enlarge the system throughput. To take it into account in the dynamic server provisioning, some experiments are conducted with real servers. We make clients send requests repeatedly, each of which is to obtain x objects by using the multi-get command. With x varied between 1 and 10, we measured the average number of requests served per second, denoted byT(x). Then the average delay of any request containingx

objects, normalized to that of any request containing 1 object, is calculated as M(x) = ₁1/T_/T(₍₁₎x) = _TT(1)₍_x₎. Our experiment results on the normalized mean delay with varyingxand the number of servers are shown in Fig. 3. Based on the results, we can make a linear approximation to M(x), such as

M(x) =σ+τ x , (5)

which can be intuitively understood as for each request, no matter how many objects are in it, there is an initial overhead

σin delay, and another part of the delay is determined by the number of objects in it, with a rate of τ. In the experiment with only one server, σ≈0.6 andτ≈0.4.

D. Problem Formulation

By the modeling, the incoming requests are queued re-spectively at each relays based on the key of the content

object to fetch. There are N queues at relays, corresponding to the N segments of the whole key space. Hk(t)is used to

represent thequeue backlog at relayk, also corresponding to the segment Sk, where k = 1,2, ..., N. When the time goes

fromt tot+ 1, the queue backlog of relay kis updated by

Hk(t+ 1) = max(Hk(t)−dk(t),0) +Ak(t), (6)

where

0≤dk(t)≤dmax . (7)

Ak(t)denotes the number of arrived objects to retrieve at time

t, whose keys are in the segmentSk.dk(t)denotes the number

of requested objects planned to be dispatched from the queue

k to cache servers at time t, constrained by the maximum rate dmax. Because the actual number of departing objects is

limited by the number of existing objects in the queue, we use

ˆ

dk(t)to denote the actual number of departure, such as ˆ

dk(t) = min(dk(t), Hk(t)). (8)

For the cache servers, thequeue backlog at the server j in group i is denoted by Qij(t). It decreases by lmax at most

in each time slot if the server is in the active status such as

aij(t) = 1. So the queue backlog is updated by

Qij(t+ 1) = max(Qij(t)−aij(t)lmax,0) +Aij(t), (9)

where Aij(t) is the number of objects newly arrived at the

server at the time slot t.

The dispatching of requests from the relays to the cache servers is decided by the active status of server groups. An active server groupiwill be responsible for its own segmenti

as well as the segments of its inactive predecessors. So the exact arrival Aij(t) at a server is determined by both the

provisioning vectorY(t)and corresponding departure amounts

ˆ

dk(t) in the related relays. If we consider that the departure ˆ

dk(t) at relay k can be distributed to multiple servers in the

same groupi∗ to be determined, we have

ˆ dk(t) = N X j=1 dkj(t), (10)

where dkj(t) is the specific amount dispatched to server j

among dˆk(t). Then the actual arrival amount to server j in

groupicould be represented by

Aij(t) = i X

k=p+1

M(dkj(t)), (11)

wherep is the first index beforei satisfying ap(t) = 1. The

functionM(x)is from the modeling of diminishing overhead in (5). Serving a request to retrieve a larger number of content objects, can lower the latency shared by each object in the request. SoM(x)can be intuitively considered as there is a discount on the counted number of arrived objects at servers when xis larger than 1.

H andQare used to represent the set of relay queues and server queues. Our first objective is to stabilize the queues

(4)

bounded, so that a certain cache access latency is ensured. Also we try tomaximize a utility functionat each time slot to achieve certain other goals. The utility function is defined as

U(t) =−X Q aij(t) +γ PN i=1Ci(t)ai(t) PN i=1ai(t) , (12)

whereCi(t) is the cache hit rate of server groupiat time t,

obtained from (4). In (12), the first item represents the energy consumption in keeping some servers active at timet, and the second item represents the benefit of cache hit, modeled as the average hit rate of all the active server groups. Besides,

γ is the weight of the cache hit rate performance which is to tradeoff the energy cost and cache hit rate.

IV. ONLINEALGORITHM

A. Algorithm Design

In order to achieve the goals specified above in an optimal way, we are to design an online algorithm that controls two sets of variables, i.e., the cache server activeness aij, and

the dispatched amount dkj from the relay k to the cache

server j in the corresponding server group. Following the stochastic network optimization [10], Lyapunov drift is used in quantifying the queue stability. First we define the Lyapunov function on the queue backlog ofH andQ, such as

L(Q(t),H(t)) =X Q Qij(t)2+ X H Hk(t)2 , (13)

which is abbreviated as L(t) in the following. Then the conditional Lyapunov drift functionat timetis obtained as

4(Q(t),H(t)) =EnL(t+ 1)−L(t)|Q(t),H(t)o, (14) which represents the change of Lyapunov function condition-ing on the known queue backlogs at a previous time slot.

Minimizing the drift function can potentially ensure the queue stability or system stability. With the other goals on the energy cost and cache hit rate considered, we are to minimize the drift-plus-penalty function, defined as

4(Q(t),H(t))−V EnU(t)|Q(t),H(t)o, (15) whereV is a parameter to tradeoff the importance of the queue stability and utility. Here we use the minus of the utility as the penalty in the drift-plus-penalty method [10].

With the queue backlog functions defined in (6) and (9) and based on the stochastic network optimization, (15) is relaxed to a minimization problem at each time slott, such as

min 2X Q Qij(t)(Aij(t)−aij(t)lmax) + 2X H Hk(t)(Ak(t)−dk(t))−V U(t) s.t. (7)(8)(10)(11) . (16)

In our scheme, the problem is solved in every time slot to give instructions on how to control the server provisioning and request dispatching of the system. In the problem,Qij(t),

Hk(t) andAk(t) are known parameters in each time slot, so

below we are to determine the values of other variables.

1) Server Group Activeness: The binary provisioning vec-torY(t)determines the destination server group when requests are dispatched from the relays and it affects the request arrival at each server group. So we need to iterate all the 0-1 combinations of the vector Y(t) to search the best solution of (16); and in each iteration, the vector of Y(t)is given, so we obtain the best conditional solution based on that specific

Y(t); after the iterations, we will choose the minimum in all the conditional solutions. Next we give the scheme to obtain the conditional best solution under a givenY(t). When Y(t)

is given, the cache utility part in the objective function of (16) can be ignored, because its value is already determined by

Y(t). Besides,2P

HHk(t)Ak(t)can also be ignored, because

it is not relevant in deciding the value of the variables. 2) Dispatching Destination: We are to determine the value of dkj(t), which is related to the constraint (10) and (11) in

the problem (16). Assuming the dispatched amountdk(t)of all

the relays and the activenessaij(t)of all the servers are given,

along with the dˆk(t)determined by dk(t), then the objective

of (16) would be simplified to minimize

2X Q Qij(t)Aij(t) = 2 X Q Qij(t) i X k=p+1 M(dkj(t)). (17)

The problem can be decoupled on i, so each active server group is considered respectively. For any related relay k

whose destination server group isi∗based on the givenY(t), although the departure amount dˆk(t)is assumed to be given,

the exact portion dkj(t) distributed to each active server j

in the group i∗ is still open to be determined. On the left hand side of (17),Qij(t)s are in fact the constant weight of

a sum, and Aij(t)s are possible to be adjusted to make the

sum smaller. First, it is more beneficial to dispatch the request to a shorter queue, as it ensures a smaller weight in the sum. Second, there is no benefit to partition dˆk(t) into parts and

dispatch each part to different servers in the group, proved by the fact thatM(x1) +M(x2) =T(1)/T(x1) +T(1)/T(x2)≥ T(1)/T(x1+x2) = M(x1+x2) holds for any positive x1

andx2. So the value ofdkj(t), given the requests from relay

kwould be dispatched to the group i∗, should be set to

dkj(t) =

(_ˆ

dk(t), ifj= arg minjQi∗_j(t)witha_i∗_j(t) = 1 0, otherwise

(18) It can be intuitively understood as Completely Joining the Shortest Queue, which means that for any relayk, its departing requests should all be dispatched to the active server with the shortest queue in its destination group determined byY(t).

3) Dispatching Amount: Then the solution ofdk(t)is to be

determined. Here we still assume the activeness statusaij(t)of

servers in a group is known. And the problem (16) is simplified to minimize −2X H Hk(t)dk(t) + 2 X Q Qij(t)Aij(t), ₍₁₉₎

which can be decoupled onH, so in fact we are to minimize

(5)

wherei∗ is the destined group of relaykdetermined byY(t)

andj∗_{is the server being active and with the shortest queue in}

the group. The indicator function1( ˆdk(t))is used to reflect the

dispatching amount could be 0. It is obvious that ifdk(t) = 0,

(20) is 0, therefore we set dk(t)to positive only if it makes

(20) negative. First we assume 1( ˆdk(t)) is 1. Because of the min function in (8), there are two cases when comparing between dmax and Hk(t): 1) in the case of dmax ≥Hk(t),

we need to minimize −Hk(t)dk(t) +Qi∗_j∗(t)τ H_k(t), so we

set dk(t) to the maximum value dmax if dmax ≥τ Qi∗_j∗(t),

otherwise to 0; 2) in the case of dmax < Hk(t), we need to

minimize −Hk(t)dk(t) +Qi∗_j∗(t)τ d_max, so we set d_max to

the maximum valuedmaxifHk(t)≥τ Qi∗_j∗(t), otherwise to

0. Then we have the generalized solution

dk(t) = (

dmax, if max(dmax, Hk(t))≥τ Qi∗_j∗(t)

0, otherwise (21)

After the dk(t) is determined, since we have assumed 1(P _ˆ

dk(t)) = 1, we need to check whether (20) with obtained

dk(t)is less than 0. If it is, then the solution ofdk(t)obtained

is adopted; otherwise,dk(t)is set to 0 to make (20) to 0. 4) Cache Server Activeness: The value of aij(t) is still

not determined yet. Since the activeness status ai(t)is known

from given Y(t), each active serve group can be considered respectively, along with the relays relying on it. Note that if a server in groupiis kept active, there are two possible reasons: to dispatch requests to it from the corresponding relays or to make it serve requests that are already in the queue.

After decoupling (16) on i, then in any group considered, when the server with the shortest queue among the active servers is given, denoted bys, the variable part in the objective function is simplified to −2X j Qij(t)aij(t)lmax+V X j aij(t). (22)

Because each server has a queue, below we represent a server by its queue. Now we make decisions about whether to keep a queue longer than sactive in the group. Intuitively, to keep a longer queue active has a larger potential to make (22) negative, because it implies the larger weight Qij(t) in the

sum. Then in the solution, by iterations, each of the queues is assumed be the shortest in turn; in each iteration, we search among all the queues longer than s following the decreasing order of the queue length and choose those that can keep (22) negative to be active. For a given server group, the computational complexity of this method is O(M2), because of the two-level iterations.

5) Improvement by Dynamic Programming: We have de-vised the method to obtain the solution conditioning on a specific Y(t). Y(t) has 2N cases, so dynamic programming is introduced to reuse the intermediate results of shared sub-problems in the iterations and to lower the complexity from

O(2N)toO(N3). We devise the recursive formula of DP as,

D(l1,l2]_{= min} n min l1<m<l2 {D(l1,m]_⊕_D(m,l2]_}_{, I}(l1,l2] o , (23) 1 4 7 10 13 16 x 104 0 500 1000 1500 Parameter V

Average queue length

Servers Relays Active servers Fig. 4. Effect ofV 1000 200 300 400 500 600 700 100 200 300 400

Average arrival rate

Average queue length of servers

V=5k V=10k V=20k V=40k

Fig. 5. Queue length

100 200 300 400 500 600 700 0 20 40 60 80 100

Average number of active servers

V=5k V=10k V=20k V=40k

Fig. 6. Number of activer servers

100 200 300 400 500 600 700 0.5 0.55 0.6 0.65 0.7 0.75 0.8

Average cache hit rate

gamma=1 gamma=5 gamma=10 gamma=20 gamma=40

Fig. 7. Cache hit rate where I(l1,l2] _{means the best solution for server groups in}

the range (l1, l2], assuming that group l1 and l2 are active

and all the groups in between are inactive andD(l1,l2] _means

the best solution for groups in the range (l1, l2] assuming at

least the group l1 and l2 are active. Besides, the operation D(l1,m] ⊕_D(m,l2] _{is to merge two none-overlapping}

sub-solutions together. Based on the formula, we begins from the solutions for the shortest ranges of(l1, l2], and then gradually

obtain the solution for the whole range of server groups. More details of this method are given in our technical report [15].

V. EVALUATIONS

We simulate a memory cache system, with20server groups and 10 servers in each group at most. The same maximum service rate of servers lmax and that of relays dmax are set

to 70 and 1,800 objects/slot, respectively. The mean arrival rates at each of the 20 key segments are set to satisfy the ratio λ1 : λ2 : ... : λ20, which is obtained by hashing the

URL in each record of the Wikipedia request trace [13] into

20 segments and counting the number of requests in each segment. The average of arrival rates,λ=P20_i₌₁λi/20, is set

to700by default which makes the largest arrival rateλ16near

the system capacity. In each time slot, the actual amount of arrivals for segmentiis randomly generated through sampling the uniformly distributed random variable between [0,2λi].

Each experiment runs500,000 time slots by default.

First, how the change of parameterV influences the queue length at servers and relays is investigated. We expect with the increase of V, the average queue length would be increased, because it was shown in stochastic network optimization that the scheme ensures an O(V)-approximation of the queue length. Fig. 4 shows the results of time-averaged queue length of servers, relays and active servers with the increasing value of V. The average queue length at relays is shown to keep stable, because under the shown range of V, the throughput capacity at servers is still higher than the actual arrival. Besides, the increase ofV results in fewer servers being active to save energy and makes the average queue length increased.

(6)

0 250 500 750 1000 0 20 40 60 80 100 120 Time

Queue length / Active servers

Queue length Number of active servers

Fig. 8. Handling bursty traffic

100 200 300 400 500 600 700 0 200 400 600 800 1000

Average queue length

Proposed JSQ+MaxWeight MaxWeight JSQ Random

Fig. 9. Performance comparison

Second, we varied the overall average arrival rate λ and compared the performance. Fig. 5 shows the evaluation result on the time-averaged queue length and Fig. 6 shows the resultant number of active servers. The smaller value of V

leads to the shorter queue length, however it results in a larger number of servers being active. Besides, when the workload becomes higher, the active number of servers and the queue length will increase. We also noticed that whenλis comparatively larger, the effect ofV in lowering the number of active servers is weakened, since the arrival rate is close to the system capacity. Fig. 7 gives the cache hit rate performance with varying λ and γ. When the workload is comparatively low, we see the effect of parameter γin helping the cache hit rate to increase is higher. In the extreme case, the cache hit rate is improved from about61%to76%. But when the arrival rate is near the same capacity, the same extent of increasing

γ leads to a much less noticeable effect on the cache hit rate. The reason is that when the arrival rate is larger, the queue length at servers in certain groups would be longer, and then those groups have to be kept active, which leads to the less freedom in improving the cache hit rate.

Third, an extra experiment was conducted to verify the performance under the dynamic or bursty changing workload. The experiment lasts1,000time slots. We setλ= 200, but in the time period of [300,700], the arrival rates were increased to twice of the normal rate, so that the workload was largely increased in that period. We monitored the average server queue length and the number of active servers in each time slot. Fig. 8 shows that although the bursty workload increases the average queue length during the period of[300,700], after the workload becomes normal, the queue length decreases in quite a short time. Besides, the increased number of active servers during the bursty period shows that the scheme can adapt to the dynamic workload.

Finally, the performance of our scheme was compared with others. For all the schemes compared, we adjust parameters to make the resultant number of active servers similar, so the average queue length is the metric of comparison. In the other schemes compared, the ratio of the number of active servers in different groups is set to be proportional to the workload predicted by Exponentially Weighted Moving Averaging (EWMA) [16]. Then the exact servers to be active in each group can be selected randomly or by giving a higher priority to the servers with a larger queue backlog (MaxWeight). Besides, the dispatching from a relay to an active server can be random or by Join-the-Shortest-Queue (JSQ).

The results after the first 5,000 slots are shown in Fig. 9. It shows that the proposed scheme outperforms the others under all the settings of λ. Besides, the performance of the JSQ+MaxWeight scheme is quite close to the proposed when the mean arrival rate is low, but the difference becomes larger when λincreases.

VI. CONCLUSIONS

We investigated the dynamic provisioning and request dis-patching in the distributed memory cache services and pro-posed the online scheme in achieving the system stability, low energy cost and high cache hit rate. The stochastic network optimization framework was applied in devising the scheme, and dynamic programming was used to lower the algorithm complexity. The scheme proposed only requires the information that can be obtained in the current time slot and its performance is evaluated through extensive simulations.

ACKNOWLEDGMENT

This work is supported in part by NSERC, CFI and BCKDF. REFERENCES

[1] Online, “http://memcached.org/.”

[2] D. Karger, A. Sherman, A. Berkheimer, B. Bogstad, R. Dhanidina, K. Iwamoto, B. Kim, L. Matkins, and Y. Yerushalmi, “Web caching with consistent hashing,”Computer Networks, vol. 31, no. 11, pp. 1203–1213, 1999.

[3] R. Urgaonkar, U. C. Kozat, K. Igarashi, and M. J. Neely, “Dynamic resource allocation and power management in virtualized data centers,” inin Proc. of NOMS, 2010, pp. 479–486.

[4] S. T. Maguluri, R. Srikant, and L. Ying, “Stochastic models of load balancing and scheduling in cloud computing clusters,” inProc. of IEEE INFOCOM, 2012, pp. 702–710.

[5] Z. Zhou, F. Liu, H. Jin, B. Li, B. Li, and H. Jiang, “On arbitrating the power-performance tradeoff in SaaS clouds,” in Proc. of IEEE INFOCOM, 2013.

[6] M. Tomaˇsevic, J. Protic, M. Tomasevic, and V. Milutinovi´c,Distributed Shared Memory: Concepts and Systems. John Wiley & Sons, 1998. [7] R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H. C. Li,

R. McElroy, M. Paleczny, D. Peek, P. Saabet al., “Scaling memcache at facebook,” inProc. of USENIX NSDI, 2013, pp. 385–398.

[8] S. Raindel and Y. Birk, “Replicate and bundle (rnb)–a mechanism for relieving bottlenecks in data centers,” inProc. of IEEE IPDPS, 2013, pp. 601–610.

[9] J. Byers, J. Considine, and M. Mitzenmacher, “Simple load balancing for distributed hash tables,” inPeer-to-Peer Systems II. Springer, 2003, pp. 80–87.

[10] M. J. Neely, “Stochastic network optimization with application to communication and queueing systems,”Synthesis Lectures on Commu-nication Networks, vol. 3, no. 1, pp. 1–211, 2010.

[11] L. Tassiulas and A. Ephremides, “Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks,” Automatic Control, IEEE Transactions on, vol. 37, no. 12, pp. 1936–1948, 1992.

[12] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan, “Chord: A scalable peer-to-peer lookup service for internet applications,” inACM SIGCOMM Computer Communication Review, vol. 31, no. 4. ACM, 2001, pp. 149–160.

[13] G. Urdaneta, G. Pierre, and M. van Steen, “Wikipedia workload analysis for decentralized hosting,”Elsevier Computer Networks, vol. 53, no. 11, pp. 1830–1845, 2009.

[14] Online, “http://highscalability.com/blog/2009/10/26/facebooks-memcached-multiget-hole-more-machines-more-capacit.html/.” [15] B. Yu and J. Pan, “Optimize the dynamic provisioning and request

dispatching in distributed memory cache services,” http://grp.pan.uvic. ca/∼_{boyangyu/tr-dcache.pdf, Tech. Rep., 2014.}

[16] J. S. Hunter, “The exponentially weighted moving average.”Journal of Quality Technology, vol. 18, no. 4, pp. 203–210, 1986.