A Load-Balancing Algorithm for Cluster-based Multi-core Web Servers

(1)

Available at http://www.Jofcis.com

A Load-Balancing Algorithm for Cluster-based Multi-core Web Servers

Guohua YOU†, Ying ZHAO

College of Information Science and Technology, Beijing University of Chemical Technology, Beijing 100029, China

Abstract

The demand for high performance web servers leads to the utilization of multi-core cluster-based web servers. Furthermore, a lot of dynamic requests are changing traditional web environment. So the load-balancing algorithm is crucial to the cluster-based web servers. However, traditional load balance algorithm did not consider the service time distribution of the dynamic requests and the characteristics of multi-core web servers. This paper proposes a new load-balancing algorithm. The new approach, according to the service time distribution of dynamic requests, assigns the dynamic requests, and keeps load balance in multi-core web servers by Genetic Algorithm. Simulation experiments have been done to evaluate the new algorithm. The obtained results prove that the new algorithm is fairer and has better performance.

Keywords: Dynamic Requests; Web Server; Cluster; Multi-core; Genetic Algorithm

1. Introduction

With the rapid development of Internet industry, people are more like to rely on the web for their daily activities. Consequently, the web servers played a crucial role in the world of information and business-oriented services. To meet demands of users, the web servers must be made with higher performance. One of the most popular schemes for addressing this problem is cluster-based web servers [1, 2]. Fig.1 describes its architecture. The architecture mainly consists of a web switch and a set of web servers [3]. In this design, we assume all the web server nodes are homogeneous. Conceptually, the web switch acts as a centralized global scheduler that assigns the requests based on load-balancing algorithm. Furthermore, with the emergence of multi-core technology, most web servers began to adopt multi-core CPUs to improve the hardware performance in past few years. Multi-core system integrates two or more processing cores into one silicon chip [4]. In this design, every processing core has its own private L1 cache and shared L2 cache [5]. All the processing cores share the main memory and the system bandwidth. Fig.2 shows the architecture of multi-core system. When the cluster-based web servers employ multi-core CPUs, there will be some new problems.

Each node of cluster has a service application including multiple threads, which serve the requests. When the multi-thread application in the multi-core node serves dynamic requests, it is easier to give birth to ping-pong effect [6], which will greatly degrade the performance of the multi-core system. To eliminate the ping-pong effect, we introduce CPU affinity. CPU affinity is the capacity of binding a process or thread to

†_{Corresponding author.}

(2)

a specific CPU core [7]. Some works [8-12] for improving the performance of applications in multi-core system by CPU affinity have been done. The processing of the dynamic requests is complicated. Some dynamic requests are very simple, but some dynamic requests are very complex. So the service time of the dynamic requests differs greatly, and usually obeys heavy-tailed distribution [13]. Actually, some traditional load-balancing algorithms in cluster-based web servers, such as Round-Robin (RR), Content Aware Policy (CAP) [14] and Weighted Round-Robin (WRR), didn’t consider the characteristics of multi-core web servers and the service time distribution of the dynamic requests. As a result, we proposed a load-balancing algorithm, which could addresses the above-mentioned problems, and improve the performance of the cluster-based multi-core web servers effectively.

Fig.1 Web Server Cluster Architecture

Fig.2 Architecture of Multi-core CPUs [5]

The remainder of the paper is organized as follows: The new load-balancing algorithm is described in Section 2. Section 3 introduces the simulation experiments of the new algorithm and presents an evaluation of the performance. And finally, we present our conclusions in Section 4.

2. New Load-balancing Algorithm

2.1. Description of Algorithm

In a website, although there are a lot of dynamic requests, the types of dynamic requests are limited. Many dynamic requests are different just because their parameters are different (For example, http://www. testExample.com/Web.aspx?name=tom and http://www.testExample.com/Web.aspx?name=mike), but the requested pages are the same one. We consider the dynamic requests that request same page as same type. Generally, at a multi-core node of cluster, incoming requests will be assigned to the threads from a thread pool. When the same type dynamic requests are assigned to threads, these threads would execute the same code. Thus, these threads have shared data. Furthermore, according to the thread scheduling strategy of the multi-core system, O/S always tries to assign these threads to different processing cores due to load balance

(3)

between cores [15], so the shared data will be continually transferred between both L1 caches of the different processing cores back and forth, which is the ping-pong effect.

Fig.3 Load-balancing Algorithm

Because the web server nodes in cluster are homogenous, we can assume that each web server node has the same amount of threads. So we can consider all the multi-core CPUs in the cluster as a “Super Multi-core CPU”. The “Super CPU” includes all cores of the cluster. Consequently, we can assign the threads that serve the same type requests to the same core by hard affinity method. Moreover, for the load balancing between different cores, we calculate the thread allocation strategy by Genetic Algorithm. The load balance between cores could ensure the load balance between nodes in the cluster.

As it is shown in Fig. 3, when they arrive at classifier from TCP queue, dynamic requests are classified based on their URLs and the same type requests will be assigned to the same request queue. The weight of each request queue can be calculated based on the access frequency and the mean service time of this kind of dynamic requests, which can be gained from log file [16]. In a multiple-threads system, CPU time is allocated to every thread equally during a CPU cycle, so the number of threads represents the proportions of CPU capacity. The number of threads that serve a request queue could be calculated based on the weight of the request queue. To avoid the ping-pong effect, all threads that process the same request queue should be assigned to the same core. After the thread allocation scheme is determined, the dynamic requests in the request queues are assigned to these threads. Then these threads begin to execute. After execution, the results generate the new dynamic pages, which are sent to network scheduled by I/O management, and these are responses. In the design, we deploy the new load-balancing algorithm into web switch.

2.2. Calculation of Algorithm Parameter 2.2.1. Weight of Dynamic Request Queue

After gaining the number of visits of each kind of requests at a specified time interval from log files, we can calculate the percentage C_i of the access times of the request queue i in total access times of all

(4)

request queues. Likewise, we can calculate the average service time T_i of requests in the request queue i. Thus, the weight W_i of the request queue i can be calculated given by

i i i

W CT

=

(1) where 1 1 M i i C = =

∑

and M is the total of the request queues. 2.2.2. Threads Number of Request Queue

According to the weight Wi of the request queue i, the number λi of the threads that serve the request queue i can be calculated based on the following formula

1 i i M j j

W

H

W

λ

=

∑

(2)

where H is the total of the threads in the thread pool, and M is the total of the request queues. λ_i is the number of threads that are used to handle the request queue i, and it is an integer through rounding. 2.2.3. Load Balance between Cores

In order to avoid the ping-pong effect between threads, the threads serving the same request queue should be assigned to the same processing core as a whole. After these threads are allocated to the processing cores as a whole, the number of threads on cores differs greatly too. Thus, this will give rise to a new question: load balance between cores. To keep load balance between cores, we must allocate the threads serving the same request queue to the same processing core and keep the number of threads on the different cores evenly. We can solve the problem by means of the Genetic Algorithm.

If the “Super Multi-core CPU” has N cores，and the number of the request queues is M. So we can define the chromosome of the Genetic Algorithm as following

where R_jis an integer and 0≤R_j ≤ N−1. R_j is a gene in chromosome and represents the serial number of the core, to which the threads serving the request queue j are assigned. So a chromosome stands for a thread assignment solution.

For purpose of brevity, we define the threads that serve the same request queue as a Service Thread Group (STG). If a STGserves the request queue j, then it isSTG_j. And the number of threads in STG_j

(5)

STGs that are allocated to core i. If we define the number of STGs on core i asB_i, then we can enumerate all the STGs on core i:

1 2 , ,..., ,..., , 1 k Bi A A A A i STG STG STG STG ≤ ≤k B. k A STG is the Service Thread Group that serves request queueAk. If the number of threads on core i isXi, then Xi could be calculated by the following formula:

1 i k B i A k

X

λ

=

∑

₍₃₎

So we can get the number of threads on every core: X₁, X₂, ...., X_N. We define D X( ) as the variance of X₁, X₂, ...., X_N. If D X( ) is large, it means that the numbers of threads on different cores differ greatly. So as to keep load balance between cores, the lower value of D X( ) is favorable.

Therefore, we define the fitness function in the Genetic Algorithm as following

1 ( )

( ) 1

f e

D X

=

+

(4) where eis a chromosome. Because the lower value of D X( ) is helpful to keep load balance between cores, and sometimes D X( ) might be zero. So we use the reciprocal of D X( )+1 as fitness function. So the larger value of f e( ) means better load balance between cores.

Genetic Algorithm has the following procedure:

(1) Initial Population: a population is a collection of chromosomes. The population size L can be determined experimentally. The initial population is usually generated randomly. We can generate L

chromosomes by assigning a random integer, which ranges from 0 toN−1, to every gene in chromosome. (2) Calculation of the Fitness Value: we can use the fitness function formula (4) and above method to calculate the fitness value of every chromosome in the population.

(3) Selection: we select the fitter chromosomes by the roulette wheel method. The greater the fitness values of a chromosome, the larger the probability to be chosen. We repeat the selection operation as many times as the number of chromosomes.

(4) Crossover: For the randomly selected couple of chromosomes, we decide whether to perform crossover or not based on crossover probability. If the crossover is allowed, it will generate a new couple of chromosomes by exchanging portions of the two old chromosomes.

(5) Mutation: we randomly choose a chromosome from the population. For a gene of the chromosome, we allow a random change with very small probability. If it happens, the gene will be assigned to random integer, which ranges from 0 to N−1 and is different from the origin value of the gene.

(6) Termination Condition: in Genetic Algorithm, the generational processes (2), (3), (4), and (5) are repeated. The chromosome with the largest fitness value will be recorded at each iteration. If the largest fitness value doesn’t change for five times, we think the iterations should be ended.

(6)

gain the best threads assignment solution, which could keep the load balance between cores effectively. And then the load balance between cores could ensure the load balance between nodes in the cluster.

3. Experiments and Evaluation

3.1. Experiments Setup

To validate the new load-balancing algorithm, we developed a simulation program, which was called RSSP, deployed it in Web Switch. Classifier in RSSP parses the request’s URL, assigns the requests with the same URL to the same request queue. After classification, RSSP calculates the weights of each request queue, and decide the number of threads that serve the request queues. Then the thread assignment scheme could be determined based on the Genetic Algorithm. After the threads are assigned to the cores on the basis of thread assignment solution, the scheduler in RSSP could assign the requests to the corresponding cores. We can utilize the hard affinity method to accomplish the threads assignment. In order to simulate the generation of dynamic pages, we created 50 DLL files instead of 50 dynamic pages. When a request is assigned to a thread, the corresponding DLL file would be loaded and executed so as to simulate the generation procedure of the dynamic page. The functions of these DLL files are distinct. So the execution time of these DLL files is different. Furthermore, the threads that execute the same dynamic web page have shared data. So we could set shared data in these DLL files. The default value of shared data is 2k.

So as to simulate the visit behavior of users to websites, we designed a sending requests module, which can automatically send requests to the cluster-based web servers in a specific arrival process. In our experiment, the default arrival process is heavy-tailed distribution.

3.2. Results Evaluation

In this section, we discuss the results of simulation experiments. We changed load intensity and measured the cluster response time, throughput and scalability for the three scheduling policies: WRR, CAP and RSSP.

3.2.1. Mean Response Time

From Fig. 4 (a), we can see the mean response time curves of three policies. For all algorithms, the average response time curves are exponential. The average response time curves are relatively smooth at first, and then begin to increase sharply with the increase of clients. The average response time of CAP algorithm is similar to RSSP. However, they differ greatly for higher number of clients. As it can be seen in Fig.4 (a), the WRR has higher average response time than CAP and RSSP for all the number of clients. That is because one or more web servers in the cluster with WRR scheme reaches to CPU bottleneck sooner due to shortcoming of load balance.

3.2.2. Throughput Evaluation

Fig.4 (b) illustrates the throughput of these algorithms in the cluster. Generally, throughput rises at first as the number of clients increase, and then peaks when the CPU resource becomes bottleneck on web servers. From Fig.4 (b), we can see that WRR is easier to achieve a peak than CAP and RSSP because it reaches CPU bottleneck more easily. CAP indicates comparable throughput to the throughput that is obtained by

(7)

RSSP. However, because of better load balance, RSSP can respond to more clients than the WRR and CAP schemes, and can reach higher throughput.

3.2.3. Scalability for Three Scheduling Policies

One of the important characteristic of a cluster is scalability and load-balancing algorithm has great influence on the scalability of the cluster. So we evaluate the scalability of the cluster in terms of maximum throughput. We measured the cluster maximum throughput with the change of number of servers. Fig. 4(c) demonstrates the scalability of cluster with increasing server nodes. RSSP has better scalability than WRR and CAP due to better load balance. As the number of nodes increases, the throughput for CAP becomes more flat. That is because the overhead exists in CAP algorithm. WRR has worse throughput than RSSP and CAP, because it has inefficient request assignment scheme and blindness regarding type of requests.

Fig.4 Evaluation for Three Load Balance Algorithms (a) Mean Response Time (b) Throughput (c) Scalability (d) Mean Response Time when Size of Shared Data is Changed

3.2.4. Changing the Shared Data

In our experiment, the DLL files have the shared-data to simulate the dynamic web pages with shared data. When we changed the size of the shared data in 50 DLL files, the mean response time for three strategies was measured and shown on Fig. 4 (d). We can see that the mean response time of WRR and CAP increases as the size of the shared data increases. However, the mean response time of RSSP changes little. The reason is that we adopted the new load-balancing algorithm, which dispels the ping-pong effect. When the size of the shared data increases, the impact on the mean response time of RSSP is slight.

(8)

4. Conclusions

In order to avoid the ping-pong effect and improve the performance of load balance in a cluster-based multi-core web server, we propose a new load-balancing algorithm, which applies the affinity method to the cluster-based web server and tries to keep load balance between cores and nodes in cluster-based multi-core web server. We describe the principle of the new algorithm and give the calculation formulas. Furthermore, we developed RSSP, a simulation program based on the new method, and did the simulation experiments using it. We analyze the key indices of performance and do compare with WRR and CAP strategies. The new algorithm has better performance, and can avoid the ping-pong effect effectively.

Acknowledgement

This paper has been partially supported by the National Grand Fundamental Research 973 Program of China (No. 2011CB706900).

References

[1] E. Casalicchio, V. Cardellini, and M. Colajanni. Content-aware dispatching algorithms for cluster-based web servers. Cluster Computing, 5: 65–74, 2002.

[2] J. Yang, G. Tan, F. Wang, and D. Pan. Solution to new task allocation problem on multi-core clusters. Journal of Computational Information Systems, 7(5): 1691-1697, 2011.

[3] S. Sharifian, S.A. Motamedi, and M.K. Akbari. A predictive and probabilistic load-balancing algorithm for cluster-based web servers. Applied Soft Computing, 11: 970-981, 2011.

[4] P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: a 32-way multithreaded sparc processor. IEEE Micro, 25: 21-29, 2005.

[5] J. M. Calandrino, J. H. Anderson, and D. P. Baumberger. A hybrid real-time scheduling approach for large-scale multicore platforms. In 19th Euromicro Conference on Real-Time Systems (ECRTS'07), pages 247-258, 2007. [6] G. You, Y. Zhao. Dynamic requests scheduling model in multi-core web server. In the 9th International

Conference on Grid and Cloud Computing (GCC2010), pages 201-206, 2010.

[7] R. Bolla, R. Bruschi. PC-based software routers: high performance and application service support. In Workshop on Programmable Routers for Extensible Service of Tomorrow (PRESTO’08) , pages 27-32, 2008.

[8] A. Chonka, W. Zhou, K. Knapp, and Y. Xiang. Protecting information systems from DDoS attack using multi-core methodology. In Proceedings of the IEEE 8th International Conference on Computer and Information Technology, pages 270–275, 2008.

[9] Y. Lu, J. Tang, J. Zhao, and X. Li. A case study for monitoring-oriented programming in multi-core architecture. In Proceedings of the 1st international workshop on Multicore software engineering (IWMSE 08), pages 47-52, 2008.

[10] R. Islam, W. Zhou, Y. Xiang, and A. N. Mahmood. Spam filtering for network traffic security on a multi-core environment. Concurrency Computat.: Pract. Exper, 21: 1307-1320, 2009.

[11] H. Feng, E. Li, Y. Chen, and Y. Zhang. Parallelization and characterization of SIFT on multi-core systems. In

IEEE International Symposium on Workload Characterization (IISWC 2008), pages 14-23, 2008.

[12] C. Terboven, D. an Mey, D. Schmidl, H. Jin, and T. Reichstein. Data and thread affinity in openMP programs. In

Proceeding of the 2008 workshop on Memory access on future processor (MAW’ 08), pages 377-384, 2008. [13] E. Hernández-Orallo, J. Vila-Carbó. Web server performance analysis using histogram workload models.

Computer Networks, 53: 2727-2739, 2009.

[14] M.Andreolini, E.Casalicchio, M.Colajanni, and M.Mambelli. Acluster-based web system providing differentiated and guaranteed services. Cluster Computing, 7 (1): 7–19, 2004.

[15] S.B. Siddha. Multi-core and Linux Kernel. http://oss.intel.com/pdf/mclinux. pdf.

[16] S. Sharifian, S. A. Motamedi, and M. K. Akbari. A content-based load balancing algorithm with admission control for cluster webservers. Future Generation Computer Systems, 24: 775-787, 2008.