Dynamic Applications Scheduling on Heterogeneous Multi-core Systems

(1)

Dynamic Applications Scheduling on

Heterogeneous Multi-core Systems

Chia-Chiao Ho

A Thesis Submitted to

Institute of Computer Science and Information Engineering

College of Engineering

National Chung Cheng University

for the Degree of

Master

in

Computer Science and Information Engineering

(2)

Abstract

Emerging heterogeneous multi-core systems provide good computing power and have become the mainstream of computing systems. However, we notice that there is a resource contention problem in the heterogeneous multi-core systems. To solve this problem, we exploited a global resource management method to manage the heterogeneous computing resources in a system. The resource manager needs an efficient algorithm to manage the resources. We compared three algorithms in this Thesis and we found out that the Min-min algorithm gives the shortest makespan among these algorithms but the average turnaround time is high. To solve this problem, we proposed a new method called MMA based on the idea of the min-min algorithm, while adding an aging technique, which is used to decrease the average turnaround time. We implemented a simulator to evaluate the performance of four algorithms, including the MMA algorithm, the Min-min algorithm, the Max-min algorithm, and the MET algorithm. Experiments show that the Min-min algorithm still has the shortest makespan, but the MMA algorithm is only left behind by about 1.5% in the first type of test cases and by the range between 0.4% and 6.6% in the second type of test cases. The MMA algorithm gives the shortest average turnaround time in all test cases and supercedes other three algorithms by 20%.

(3)

List of Figures

1.1 A scheduling example of two applications using GPU for acceleration . . . 6

1.2 Alternative scheduling example of two applications using GPU for accel-eration . . . 6

2.1 The platform model for OpenCL . . . 10

2.2 An expected execution time example for Min-min heuristic algorithm . . . 15

2.3 The updated completion time after one step . . . 16

3.1 An example of decomposing an application . . . 19

3.2 The cetralized task scheduler . . . 21

4.1 The flow of the algorithm . . . 28

4.2 An example with 3 compute devices and 6 applications . . . 29

4.3 The scheduling result at time 0 . . . 32

4.6 The scheduling result . . . 35

5.1 The Makespan of the First Type of Test Cases . . . 41

5.2 The Turnaround Time of the first type of test cases . . . 42

(6)

5.3 The Makespan comparison between MMA and Min-min . . . 43

5.4 The Turnaround Time comparison between MMA and Min-min . . . 43

5.9 Number of Completed Applications every 50 Time Units . . . 48

(7)

List of Tables

1.1 A simple example . . . 6

4.1 Task execution time on each compute device . . . 29

4.2 Applications arrival time . . . 30

4.3 Task completion time at time 0 . . . 30

4.4 Task completion time after dispatching T1,0 . . . 31

4.7 Task completion time at time 6 after dispatching T1,1 . . . 33

5.1 The ranges used in the first type of test cases . . . 40

5.2 Number of Completed Applications every 50 Time Units . . . 49

5.3 Accumulation of Completed Applications . . . 50

(8)

List of Algorithms

(9)

Chapter 1 Introduction

The first microprocessor was invented in the 1970s and this invention made comput-ers more common to people. Because of being applied in different fields, such as financial service, Internet service, scientific calculation and entertainment, computers need more computing power to handle these complex jobs. For satisfying growing demands, increas-ing the clock frequency was the most common solution in the early years. However, the gain from increasing clock frequency did not increase as much as expected. For pursuing higher performance, the new strategy in designing microprocessors is aiming to integrate several slower microprocessors into a single chip, which is called multi-core micropro-cessor. In the early design, most multi-core architecture are homogeneous. Programmers exploit new programming models to design applications that can be executed in parallel such that performance can be improved and be better in most cases than that in the single-core microprocessors with high clock speed.

Nevertheless, some domain-specific applications need more computing power, such as 3D video processing and physic simulation. The domain-specific applications run faster on

(10)

some domain-specific microprocessors, such as digital signal microprocessors (DSP) and graphic processing units (GPU).

Hence, the recent idea to improving the performance of domain-specific applications is to integrate some accelerators such as DSP or GPU with GPP in a system or on a chip. These systems are called heterogeneous computing systems. With the introduction of such heterogeneous computing systems, programmers can assign application-specific tasks to be executed on the accelerators to increase the overall performance.

Although integrating these accelerators is thought to be a better solution, under some circumstances these accelerators would become bottlenecks in the system. In this chapter, we will explain the problems that may be encountered while designing applications on heterogeneous computing systems and will propose a solution to the problems.

1.1 Background

Heterogeneous computing systems consist of different types of processors. Tradition-ally there are at least one general purpose-processor (GPP) and other application-specific processors, such as DSP and GPU. In the recent years, the most popular heterogeneous computing systems have CPU and GPU.

In a classic computer system, GPU is designed for rendering graphics and outputting to screens. After the integrating of the general-purpose computing on graphics processing units (GPGPU) architecture, programmers are able to exploit the computing power of GPU to accelerate more applications in different fields.

(11)

more computing power of GPUs, there are several programming models proposed by dif-ferent hardware manufacturers and software companies. With these new programming models, such as Compute Unified Device Architecture (CUDA) [1] provided by NVIDIA and Open Computing Language (OpenCL) [2] provided by the Khronos Group, program-mers can use these programming models to exploit the computing power of GPUs.

Among these new programming models, the most attractive one is OpenCL. OpenCL was first developed by Apple Inc. and then managed by a working group under the Khronos Group. With the contributions made by the members in the working group, such as Ad-vanced Micro Devices, Inc. , Intel Corporation and NVIDIA Corporation, a set of cross-platform APIs and heterogeneous programming models are established. Programmers can use the APIs and programming models to compose new applications running on heteroge-neous computing platforms without extra modification.

For example, if a programmer wants to accelerate a JPEG encoding application with GPUs made by NVIDIA, he needs to rewrite the encoder with NVIDIA CUDA SDK. If the underlying GPU architectures or the programming models are changed, for example, replacing the accelerators with GPUs made by AMD, then the programmer has to rewrite the application with AMD Stream SDK [3]. Nevertheless, there are still other accelerators not mentioned here, such as IBM Cell B.E. Processor and TI DSPs. Programmers would spend a lot of time to learn different programming models and implementing different versions of applications for different accelerators.

Now consider programming using OpenCL which provides cross-platform APIs, once the programmer uses the OpenCL to compose some parts of an application for acceleration and running on NVIDIA GPUs, and then he only needs to modify the configurations of the accelerated codes with little effort and can thus make the application able to run on other

(12)

accelerators.

1.2 Motivation

Although OpenCL has provided a solution for programmers to program on heteroge-neous computing systems, there are still some problems that rely on programmers to deal with. One of these problems is task scheduling. When using the OpenCL programming model to accelerate parts of an application, the programmer may assign some tasks to be executed on specific GPUs or DSPs.

However, the GPU or DSP tasks are non-preemptive once they are already in execution through OpenCL APIs. If there are many applications running simultaneously in the system with only one accelerator, then most of the tasks are queued for the busy accelerator.

Take for example, two applications, A1and A2where A1has two tasks T1,1and T1,2, and A2has two tasks T2,1and T2,2. T1,2needs to wait for T1,1to complete and T2,2needs to wait for T2,1 to complete. Suppose the tasks T1,2 and T2,2 can be executed on both a CPU and a GPU. The execution times of these tasks were measured individually and listed in Table 1.1.

(13)

Table 1.1: A simple example

Tasks Execution time on CPU Execution time on GPU

T1,1 20 ms N/A T1,2 15 ms 10 ms T2,1 20 ms N/A T2,2 15 ms 10 ms CPU core #1 CPU core #2 GPU T_1,1 T_2,1 T_1,2 20 30 40 0 T_2,2 Time (ms)

Figure 1.1: A scheduling example of two applications using GPU for acceleration

This Thesis is inspired by the scenario mentioned above. It is not difficult to accelerate an application by using different accelerators to achieve the best performance or meet other requirements. But when there are more than one application running on a system and every application has its own requirements to meet, the original arrangement would not guaran-tee the satisfaction of requirements. We can regard this problem as a resource contention problem. To solve this problem, a global resource scheduling and management mechanism is needed. We proposed a global resource scheduling and management framework in this Thesis and we will explain the framework in detail in the following chapters.

CPU core #1 CPU core #2 GPU T_1,1 T_2,1 T1,2 20 30 35 0 Time (ms) T_2,2

Figure 1.2: Alternative scheduling example of two applications using GPU for acceleration

(14)

1.3 Thesis Organization

(15)

Chapter 2 Related Work

In this chapter we first introduce the Open Computing Language (OpenCL), a gramming framework for heterogeneous computing systems. Then we review several pro-gramming frameworks contributed by other researchers and examine two heuristic task scheduling and mapping algorithms.

2.1 Open Computing Language

OpenCL is used for improving the programmability on heterogeneous computing sys-tems. There are four models used to express the fundamental concept in the OpenCL programming framework as listed below:

• Platform model: This model describes the relationship between a host and one or

more compute devices in a heterogeneous computing system.

• Memory model: There are different memory regions on the host and compute

de-vices, and this model defines the scope, usage, and restriction of the memory regions.

(16)

Moving data from one region to another region should obey the usage defined by the OpenCL programming framework.

• Execution model: The execution model is the core concept in the OpenCL

program-ming framework, this model defines how the kernels execute on the compute devices and how the OpenCL applications work on the host. The term “kernel” is a function declared in an OpenCL program and executed on the compute devices. We explain this model in detail in Section 2.1.2.

• Programming model: This model expresses two different parallel programming

mod-els that can be used by a programmer, including the data parallel and task parallel. The task parallel programming is not widely supported by the compute devices.

2.1.1 OpenCL Platform Model

(17)

(18)

resources are listed as follows:

• Devices: The compute devices that are available for use.

• Kernels: The OpenCL functions composed by a programmer and to be executed on

compute devices.

• Program Objects: These objects include the source codes of kernels and executable

binaries.

• Memory Objects: The data is encapsulated into memory objects on the host and these

objects can be operated by the kernel instances on the compute devices.

Within the context, a data-structure called command-queue is used to coordinate the execution of the kernels on the compute devices. A command-queue is associated with each compute device. The OpenCL application can enqueue three types of commands into the command-queues, as listed below:

• Kernel commands: This type of commands are used to execute kernels.

• Memory commands: Any memory-related operations between the host and compute

devices belong to this type.

• Synchronization commands: The synchronization commands are a set of event

ma-nipulators. These commands help the programmer to control the execution order of commands, query the status of commands, and profile the execution time of kernels.

(19)

device, the compute device cannot execute the commands from these two command-queues at the same time. So the OpenCL runtime would select one command from these command-queues to be executed on the compute device. But there is no explicit rule to decide which command-queue will be selected first in the OpenCL specification.

2.2 Heterogeneous Computing Framework

In order to manage the computing resources on heterogeneous computing systems, there is a need to build a programming framework for supporting the resource management. We have surveyed several programming frameworks for heterogeneous computing systems built with CPUs and GPUs in these literature [4, 5, 6, 7, 8, 9, 10].

Spafford et al. [8] proposed a programming framework based on OpenCL, which helps the programmers to deal with the data transferring between CPUs and GPUs, and also helps them to decompose and dispatch the tasks onto devices. The motivation of this work is to improve the portability of programs written in OpenCL. Although OpenCL enables the kernel programs to be executed on different platforms without modification of the ker-nel code. But the performance after porting on different platforms would be decreased since the original configuration maybe unsuitable. Programmers may spend extra time on tuning these configurations before porting programs from one platform to another. Based on this drawback of OpenCL, they build a library to help programmers to use OpenCL without hand-tuning the configurations. The core of this library is a high-level centralized queue, which would receive applications submitted by users, and this centralized queue is responsible to decompose, configure and dispatch the applications onto proper computing resources. This framework also provides an auto-tuning of the tasks’ execution

(20)

ters. When tasks are assigned to different processors, the configurations will be adjusted automatically.

Another programming model proposed by Bahga et al. [9] also aims to optimize the performance with a centralized queue. But this work provided a view that is different from [8]. Their motivation is from the resource contention problem. If all applications are designed to use the fastest processors in a heterogeneous computing system, then some applications would be forced to wait until the release of desired processors. As a result, we cannot gain any benefit from using faster processors, and the execution time might be longer than executing on slower processors.

For solving this problem, they ask programmers to decompose their programs into smaller tasks and assign different priorities to them. Smaller tasks and priority assign-ment can prevent some tasks from occupying the resources too long. A resource monitor is also proposed by this work. They profiled each task to get its resource requirement. When a task is submitted to the centralized queue, the resource monitor will examine the current availabilities of computing resources and make a decision to allow the task to execute or not. If the computing resources are not available, the request will be rejected. This resource monitoring mechanism promises that the tasks in execution will not be interfered by other tasks, preventing the resource contention problem.

(21)

2.3 Heterogeneous Task Scheduling and Mapping

Algo-rithms

After reviewing the related programming frameworks, we will examine the core of scheduling mechanism which are the scheduling and task mapping algorithms in this sec-tion.

Given a machine with m processors and n tasks, scheduling and mapping these tasks onto the processors with a minimum makespan is an NP-Complete problem [11, 12]. A scheduling algorithm to solve this scheduling problem with an optimal solution in polyno-mial time is not possible. As a result, many heuristic scheduling algorithms are developed to find a near-optimal solution for the scheduling problem.

In [13, 14], many scheduling algorithms are classified and compared. Among these scheduling algorithms, we adopt the ideas of the Min-min heuristic and the Max-min heuristic algorithms, because we observed that the min-min algorithm gives good makespan and the max-min algorithm also gives good makespan under some conditions.

The Min-min heuristic algorithm is originally from [15], and then implemented in [16]. The goal of the Min-min heuristic is to make the makespan of scheduling result shorter. This heuristic algorithm will be given a set of tasks, which are ready to be executed. These tasks are profiled or predicted to get their execution time. With these execution time infor-mation and the processors’ busy time, we can calculate an expected completion time for all tasks.

The Min-min heuristic algorithm first chooses the minimum execution time for each task, forming a minimum completion time list, and then we pick the task with the minimum value from this list, map this task onto the processor which it can be completed the earliest.

(22)

(23)

(24)

Chapter 3 Preliminaries

The system model and the problem formulation for the dynamic scheduling mecha-nism proposed in this Thesis are described in this chapter.

3.1 System Model

In this section, we describe how to model the target heterogeneous computing systems and application.

3.1.1 Target Heterogeneous Multi-core Systems

(25)

In our system model, we assume there are more than two compute devices in the system. With more than one compute devices, we can demonstrate the idea of global resource management. In other words, we can show that the efficiency of executing applications is more efficient when scheduling applications on two or more compute devices.

3.1.2 Applications

In our model, the applications are designed with the OpenCL framework. An appli-cation is composed of a series of OpenCL kernels, as illustrated in Figure 3.1, there is an application with four OpenCL kernels. We assume that all the applications in our model are executed sequentially. In other words, the kernels in an application should be executed sequentially. For example, the kernel K2in the application illustrated in Figure 3.1 can be executed until the kernel K1 finished. We set this restriction for representing the depen-dency between kernels in an application. But we do not take kernels as the basic unit of scheduling, instead, we define a task as the basic unit of scheduling. A task can be a set of one or more kernels, depends on the design of programmers. These tasks are dispatched by the scheduler to appropriate compute devices when the application is executed.

Since tasks can be executed on various compute devices, the computation time of a task depends on the characteristics of the task and the compute device. Further, a compute device needs to spend time on reading data from the host and writing data back to the host, which constitute the communication time.

The sum of the computation time and the communication time of a task is called the execution time of the task. Normally in other system models the computation time and communication time into consideration are considered separately while scheduling tasks in a system. But in our system model, we assume that a task still needs to spend time on

(26)

(27)

contend the compute devices. As a result, the resource contention problem decreases the performance of the system.

With regard to this drawback of OpenCL execution model, we should build a centralized task scheduler to cooperate the applications in the system. The centralized task scheduler is used to receive tasks submitted from applications, and we can know what applications are intend to use the compute devices. With the information of requesting compute devices, proper scheduling and mapping decision can be made by a scheduling algorithm.

We have consider several required functionalities should be provided by the central-ized task scheduler. The centralcentral-ized task scheduler should be able to receive tasks from applications. That is, the programmer should describe a task and submit it to the scheduler in design time. Other information related with the tasks should also be submitted from applications, including the execution time.

The centralized task scheduler is responsible to build command queues with each com-pute devices, and enqueue the tasks according to the scheduling result. For maintaining the execution order according to the scheduling result, the centralized task scheduler should use OpenCL event objects to keep the execution order.

3.2 The Age of an Application

Applications are submitted to a system at different points of time, which is called their arrival times. The age of an application is defined as the time from its arrival in the system to the time it is scheduled. At every scheduling point, we can calculate the age of ready tasks according to the application it belongs to. The age value can be used to observe which application has stayed in the system for a long time. If an application is finished

(28)

(29)

with a shorter age, that means the application has a shorter response time. The time span from the ready time to the finished time of an application is called its turnaround time. The average turnaround time can reflect the response time of a system.

3.3 Problem Definition

In this Thesis, we are targeting on a resource contention problem. In the resource contention problem, there is a heterogeneous multi-core system consisting of a set of m heterogeneous compute devices P, where P= {pk,1 ≤ k ≤ m}. A compute device pk can be a single-core or multi-core CPU, GPU, DSP or other accelerator.

There is a set of n applications A, A= {ai,1 ≤ i ≤ n}, to be executed on this system. Each application has a sequence of tasks. The jth task in the application ai is denoted as Ti, j. There is a dependence between Ti, j and Ti, j+1, where j≥ 1. Task Ti, j+1 cannot be executed until Ti, j is finished. Every task is non-preemptive in this system, that means a task Ti, j will not be interrupted during its execution.

Tasks can be assigned to any compute device in a heterogeneous multi-core system. The execution time of the task Ti, j on a compute device pk is denoted as τ(Ti, j, pk). The completion time of the task Ti, j is denoted asε(Ti, j, pk).

With a dedicated scheduling algorithm, the tasks of applications would be scheduled and mapped on compute devices in the system.

The Symbol Rkstands for the scheduling result on the compute device pk. The sequence of tasks assigned to a compute device pk is record in Rk. The completion time of Rk is the completion time of the last task in Rk. A compute device pk would execute tasks following the sequence in Rk.

(30)

(31)

Chapter 4 Dynamic Application Scheduling on

Heterogeneous Multi-Core Systems

In this chapter, the Min-min based Scheduling Algorithm with Aging technique (MMA) for heterogeneous multi-core systems is proposed. Some examples are also presented in this chapter to explain the proposed algorithm.

4.1 The Scheduling Algorithm

The Min-min based Scheduling Algorithm with Aging technique is described in Al-gorithm 1 and the flow of this alAl-gorithm is shown in Figure 4.1.

The scheduler initializes the task queue of compute devices at steps 4 to 6 of Algorithm 1, If there are un-finished tasks on certain compute device pk, then the task queue Qksaves the information of the amount of time remaining for tasks to finish. This information can help the scheduler to calculate the completion time of a task.

(32)

idle. The time when the scheduler is triggered is called the scheduling point. The scheduler will continuously run until the ready tasks are all scheduled on appropriate compute devices or all compute devices have jobs to do.

We use a variable idleFlag to check the task queues of each compute device. If any compute device is not assigned to do any task, then the idleFlag should be true. After each task assignment, the scheduler will examine the idleFlag to make decision to continue or not.

The scheduler gathers the ready tasks and puts them into a ready queue Qrfrom steps 10 to 18 of Algorithm 1. Meanwhile, three pieces of information related to a task is retrieved from the system, including the execution time of the task, the completion time of the task, and the age of the task. With these three information, the scheduler can make appropriate decision to assign tasks to compute devices.

Next, we move on to the core of the scheduling algorithm steps 19 to 31 of Algorithm 1. Because we want to minimize the turnaround time, the tasks having the oldest age should be scheduled earlier than the other younger tasks. The first step is to determine the tasks having the oldest age value in the ready queue Qr. We use maxAge to save the oldest age value. Then, from the tasks with the same age with maxAge, we choose the task T_i_{, j} with the minimum completion time. The next step is to assign this task Ti, j to the compute device ps which gives the minimum completion time. Here we use a task queue Rs to record this assignment.

(33)

scheduling point, their completion time should be calculated again.

In this algorithm, the scheduling process stops when all the compute devices are as-signed at least one task. Otherwise, we retrieve the idle information again from steps 27 to 30. If the value of the variable idleFlag becomes false, then the scheduling process will stop. If there are still one or more compute device idle, the scheduling process will con-tinue until there are no more tasks to schedule at this scheduling point. Once the scheduling process is finished, the scheduler will return a set of task queues R.

The dispatcher uses the tasks queues returned from the scheduler to dispatch tasks ac-cording to that scheduled order in the task queues. As a result, the computing resources can be shared in a better manner.

4.2 Scenarios of the Proposed Approach

In Figure 4.2, a scenario of six applications being executed on a heterogeneous multi-core system with three compute devices is illustrated. On the left side of Figure 4.2, the three blocks labeled with p0,p1,p2are three compute devices. In the middle of Figure 4.2, there is a scheduler which is responsible to schedule and dispatch tasks for applications. On the right side of Figure 4.2, there are six blocks with dotted border, representing the applications. The blocks inside of a dotted block are tasks, which will be submitted to the scheduler one-by-one. The task execution time are listed in Table 4.1, and the arrival time of applications is listed in Table 4.2.

We assume the system starts at time point 0. According to Table 4.2, the applications a0, a1, and a2 arrive and are ready to be executed. Tasks are submitted form these three applications to the scheduler, then the tasks T0,0, T1,0, and T2,0are put into the ready queue

(34)

1 Input: A set of n applications : A; a set of m compute devices in the heterogeneous system: P

2 Output: The scheduling result R

3 idleFlag : A flag used to indicate if there is any idle compute device or not 4 R_k: A queue for recording the order of tasks assigned to compute device p_k 5 Q_r: A queue for storing the ready tasks

6 foreach task queue R_kfor compute device p_k do 7 R_k← getComputeDeviceStatus(p_k)

8 end

9 idleFlag← checkIdle(R) 10 if idleFlag= true then 11 Q_r← /0

12 foreach ready tasks Ti, j do 13 Q_r← Q_r∪ {T_i_{, j}} 14 age(T_i_{, j}) = getAge(a_i)

15 foreach compute devices p_k do

16 τ(Ti, j, pk) = getExecutionTime(Ti, j, pk) 17 ε(T_i_{, j}, p_k) = getCompletionTime(T_i_{, j}, p_k)

18 end

19 end

20 while Q_r= /0 and idleFlag = true do 21 maxAge← max{age(Ti, j)|Ti, j∈ Qr}

22 ε(T_i_{, j}, ps) = min{ε(Ti, j, pk)|Ti, j ∈ Qr,age(Ti, j) = maxAge} 23 R_s← R_s∪ {T_i_{, j}} 24 remove T_i_{, j} from Qr 25 foreach tasks Ti, j in Qr do 26 ε(T_i_{, j}, p_s) = ε(T_i_{, j}, p_s) + τ(T_i_{, j}, ps) 27 end 28 idleFlag← checkIdle(R) 29 if idleFlag= f alse then

30 break

31 end

32 end

33 end 34 return Rk

(35)

calculate , and ε , , for tasks in

pick a task _, from where _, has the least

ε _, , and the same age with maxAge

put ready tasks into

map the task _, on

the device

remove , from

update ε , , with , , for

the rest of tasks in retrieve the maxAge from

all tasks in

idleFlag = checkIdle(R)

is idleFlag true?

retrieve the maxAge from all tasks in

return R Yes

No

Figure 4.1: The flow of the algorithm

(36)

(37)

Table 4.2: Applications arrival time application arrival time

a0 0 a1 0 a2 0 a3 1 a4 6 a5 5

Table 4.3: Task completion time at time 0

Tasks completion time on p0 completion time on p1 completion time on p2 age

T0,0 7 13 11 0

T1,0 15 15 6 0

T2,0 9 19 15 0

Qr. When the system reaches the scheduling point, the scheduler would schedule and dispatch these tasks in Qr onto appropriate compute devices.

At time 0, all the compute devices are idle and there are tasks in Qr, the conditions to trigger the scheduler are satisfied. The scheduler retrieves the task execution time and calculate their completion time according to the utilization of compute devices. Meanwhile, the ages of these tasks are also calculated for deciding the priorities. We list the task completion time and age in Table 4.3.

According to Algorithm 1, the next step is to compare the completion time among the tasks which have the largest age value. In this case, all of the three tasks have the same age, hence the scheduler should choose the minimum completion time from these three tasks. According to Table 4.3, the task T1,0has the minimum completion time when executed on the compute device p2. So, the scheduler schedules the task T1,0 first, then dispatches it to the compute device p2 and removes T1,0 from Qr. After dispatching the task T1,0, the completion time of the rest of tasks would be affected if they are also dispatched to the same

(38)

Table 4.4: Task completion time after dispatching T1_,0

Tasks completion time on p0 completion time on p1 completion time on p2 age

T0,0 7 13 17 0

T2_,0 9 19 21 0

compute device. The scheduler updates the completion time of the rest of tasks according the execution time of the task T1,0. We list the updated completion time in Table 4.4.

The scheduler will check the triggering condition after dispatching a task. In this case, there are still two tasks not scheduled yet and two compute devices are also idle, hence the scheduler will continue to schedule the rest of the tasks in Qr.

By repeating the steps above we can schedule the rest of tasks. T0,0is dispatched to p0 and T2_,0is also dispatched to p0. The result of this scheduling event is illustrated in Figure 4.3. The x-axis represents the time line and the y-axis represents the compute devices. The blocks in this chart stand for the tasks. We can observe the utilization of the compute devices from this chart.

(39)

(40)

Table 4.6: Task completion time at time 6

T1,1 28 4 19 6

T3,1 20 15 12 5

T4,0 19 9 10 0

T5,0 26 8 20 1

Table 4.7: Task completion time at time 6 after dispatching T1_,1

T3,1 20 19 12 5

T4_,0 19 13 10 0

T5,0 26 12 20 1

The next scheduling point would be at time 6 because there are two compute devices p1 and p2 idle and also four tasks ready. The four tasks include T1,1, T3,1, T4,0, and T5,0. Again, the completion time and the age of tasks are calculated, and we also list these values in Table 4.6. The completion time on p0is affected by the unfinished tasks remaining in the task queue of p0. In this case, the scheduler schedules the oldest tasks first, which is T1_,1. The execution time of T1,1on p1 is the smallest, thus this assignment is decided. Then the next step is updating the completion time of the rest of tasks, the updated values are listed in Table 4.7. We follow the same rule to pick the oldest tasks and compare the completion time among these oldest tasks. The scheduling result is illustrated in Figure 4.5.

(41)

(42)

( , ) ( , ) ( , ) 6 7 16 0 ( , ) 1 ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , )

Figure 4.6: The scheduling result

(43)

Chapter 5 Experiments

In this chapter, we first introduce the environment used for experiments. Then, several comparative algorithms are listed in Section 5.2.

5.1 Experimenting Environment

To evaluate the quality and performance of the scheduling algorithms, we designed a simulator to simulate the scheduling of a set of applications and to observe the scheduling results. Using the simulator to evaluate the algorithms has many advantages than using the real heterogeneous multi-core platforms. One of these advantages is that the simulator can cover a wide range of different types of hardware by modifying the arguments of the simulator. Another advantages is that the evaluation time can be reduced without running real-life applications. But the correctness of the simulation results still need to be verified on a real system after simulation.

The simulator consists of three parts: the Test Case Generator (TCG), the Scheduling Simulator (SS), and the Simulation Result Analyzer (SRA). The three parts are explained

(44)

below:

• The Test Case Generator (TCG): TCG needs five arguments to generate the test cases

according to user requirements. These arguments are as follows:

– The number of compute devices: This argument specifies the number of com-pute devices installed in a system, and the TCG will generate the execution time of a task according to this argument.

– The maximum time for the execution of a task on a specific compute device: Since the system consists of heterogeneous compute devices, the execution time of a task varies when executed on different compute devices. In this simulator, the execution time is randomly assigned to each task and the maximum execu-tion time is limited by this argument.

– The number of applications: This argument gives the number of incoming ap-plications. The characteristics of an application is specified in terms of the num-ber of tasks belong to this application and the arrival time of this application. These two characteristics can be assigned via the following two arguments. – The maximum number of tasks in an application: The number of tasks in an

application can be a fixed number or a range. In this Thesis, we assume the number of tasks is not a fixed number. The TCG will randomly assign the number of tasks to an applications according to this argument.

(45)

• The Scheduling Simulator (SS): The SS needs the user to implement the algorithm

into a scheduling function, then the SS will feed the test cases to the scheduling function. We have implemented four algorithms in the SS, including the algorithm we proposed and other three listed in the Section 5.2.

• The Simulation Result Analyzer (SRA): In the SS, the scheduling function outputs

a set of task queues after simulation. These task queues are used to record the tasks assigned to the compute device which owns the task queue. The records in the task queue include the start time of the tasks and their finish time. The SRA collects and analyzes the data from the task queues, and then calculates the makespan and turnaround time of the scheduling results.

5.2 Comparative Algorithms

We chose three other scheduling algorithms and implemented them into three schedul-ing functions. These three algorithms are:

• Minimum Execution Time First (MET): The MET algorithm schedules tasks

accord-ing to their execution time only. Each time when the system reaches the schedulaccord-ing point, the scheduler chooses one task arbitrarily and assigns it to the compute device that can execute the task the fastest.

• Min-min Heuristic Algorithm • Max-min Heuristic Algorithm

The Min-min and max-min heuristic algorithm are already introduced in Chapter 2. We adopt the ideas of these two algorithm and also implement them into scheduling functions

(46)

in this simulator.

5.3 Test Cases

Before explaining the experiment results, the test cases are described in detail in this section. We use two test cases to evaluate the performance of the algorithm we proposed. These two test cases are explained below:

1. The aim of the first test case is to observe the system performance under different inter-arrival times of the applications. We set different ranges of arrival time of ap-plications, in other words, the density of applications that the system need to process in a time span. The ranges used in this Thesis are listed in Table 5.1. The smaller the range is, the more applications in a time span. In this case, the other arguments are set as follows:

• The number of compute devices: 4 • The number of applications: 100

• The maximum execution time of a task: 50

• The maximum number of tasks of an application: 15

Since the number of applications is set to 100, then the range 10 in the Table 5.1 means all the 100 applications will arrive from time 0 to time 10. The range 8000 means these 100 applications arrive at the system from 0 to 8000.

(47)

Table 5.1: The ranges used in the first type of test cases Range Inter-arrival Time

10 0.1 50 0.5 100 1 200 2 400 4 500 5 1000 10 2000 20 4000 40 8000 80

with small amount of compute devices, but performing bad when using in a large system with a lot of compute devices. We test five different number of compute de-vices, which are 2, 4, 6, 8 and 16 in this test cases, and other arguments are listed as follows:

• The number of applications: 100

• The maximum execution time of a task: 50

• The maximum number of tasks of an application: 15 • The range of arrival time: 500

5.4 Experiment Results

In this section, the experiment results will be examined with two performance metrics: makespan and average turnaround time. The makespan is defined as the longest processing time of a compute device in the system with a given test case. The average turnaround time is an average value of all the applications’ turnaround time. The turnaround time is the

(48)

(49)

(50)

(51)

(52)

(53)

(54)

5.5 Conclusions on the Experiment Results

After reviewing these two experiment results, although the MMA algorithm delivers a longer makespan than the min-min algorithm, it gives the shorter average turnaround time in both two cases.

In order to find out the reasons why the MMA algorithm performs worse than the Min-min algorithm on the makespan metric, we observe one test case and take throughput as another metric. The throughput metric indicates the number of applications completed in a given time span. Here, we take 50 time units as the span of time and the number of completed applications is shown in Figure 5.9 and the data is listed in Table 5.2. We can see that there are at least two applications completed every 50 time units when using the MMA algorithms, and the Min-min algorithm gives unstable throughput in this case. If a system can give stable throughput, users can get the execution results as soon as possible, instead of getting all the results at a time. As shown in Figure 5.10 and Table 5.3, the MMA algorithm makes applications completed at a stable rate, and the Min-min algorithm makes applications finished at the last moment.

(55)

(56)

Table 5.2: Number of Completed Applications every 50 Time Units

Time Min-min MMA

(57)

Table 5.3: Accumulation of Completed Applications

Time Min-min MMA

(58)

Chapter 6 Conclusions and Future Work

We have noticed that there is a resource contention problem in the newly emerged Heterogeneous Multi-Core Systems. To solve this problem, we exploited the global re-source management method to manage the computing rere-sources in a system. The rere-source manager needs an efficient algorithm to manage the resources. Thus we have compared three algorithms and we found out that the Min-min algorithm gives the shortest makespan among these algorithms but the average turnaround time is high.

(59)

algorithm.

In this Thesis, we used a simulator to evaluate the performance of different scheduling algorithms on heterogeneous multi-core systems. In the future we will try to implement these scheduling algorithms on a real heterogeneous multi-core system and make compar-isons between the simulation results and actual performance. The application model in this Thesis is restricted, we will enhance this model in order to get more precise simulation re-sults. With the actual implementation and a more precise model, then we can enhance the MMA scheduling algorithm to improve the makespan and meet more requirements from users in the future.

(60)

Bibliography

[1] NVIDIA Corporation. OpenCL Programming Guide for the CUDA Architecture, 3.2 edition, Auguest 2010.

[2] Aaftab Munshi. The OpenCL Specification. Khronos OpenCL Working Group, 1.1 edition, September 2010.

[3] Advanced Micro Devices, Inc. AMD Accelerated Parallel Processing (APP) SDK OpenCLTMProgramming Guide, 1.2c edition, April 2011.

[4] V´ıctor J. Jim´enez, Llu´ıs Vilanova, Isaac Gelado, Marisa Gil, Grigori Fursin, and Na-cho Navarro. Predictive runtime code scheduling for heterogeneous architectures. In Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers, HiPEAC ’09, pages 19–33, Berlin, Heidelberg, 2009. Springer-Verlag.

(61)

[6] Tomoaki Hamano, Toshio Endo, and Matsuoka Satoshi. Power-aware dynamic task scheduling for heterogeneous accelerated clusters. In Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing, pages 1–8, Washing-ton, DC, USA, 2009. IEEE Computer Society.

[7] A.P.D. Binotto, B.M.V. Pedras, M. Gotz, A. Kuijper, C.E. Pereira, A. Stork, and D.W. Fellner. Effective dynamic scheduling on heterogeneous multi/manycore desktop platforms. In Computer Architecture and High Performance Computing Workshops (SBAC-PADW), 2010 22nd International Symposium on, pages 37 –42, oct. 2010. [8] Kyle Spafford, Jeremy Meredith, and Jeffrey Vetter. Maestro: data orchestration and

tuning for opencl devices. In Proceedings of the 16th international Euro-Par confer-ence on Parallel processing: Part II, Euro-Par’10, pages 275–286, Berlin, Heidelberg, 2010. Springer-Verlag.

[9] A. Bahga and V.K. Madisetti. A dynamic resource management and scheduling en-vironment for embedded multimedia and communications platforms. Embedded Sys-tems Letters, IEEE, 3(1):24 –27, march 2011.

[10] Mario Kicherer, Rainer Buchty, and Wolfgang Karl. Cost-aware function migration in heterogeneous systems. In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers, HiPEAC ’11, pages 137–145, New York, NY, USA, 2011. ACM.

[11] J. D. Ullman. Np-complete scheduling problems. J. Comput. Syst. Sci., 10:384–393, June 1975.

(62)

[12] Michael R. Garey and David S. Johnson. Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York, NY, USA, 1990. [13] Howard Jay Siegel and Shoukat Ali. Techniques for mapping tasks to machines in

heterogeneous computing systems. J. Syst. Archit., 46:627–639, June 2000.

[14] Tracy D. Braun, Howard Jay Siegel, Noah Beck, Lasislau L. B¨ol¨oni, Muthucumara Maheswaran, Albert I. Reuther, James P. Robertson, Mitchell D. Theys, Bin Yao, Debra Hensgen, and Richard F. Freund. A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. J. Parallel Distrib. Comput., 61:810–837, June 2001.

[15] Oscar H. Ibarra and Chul E. Kim. Heuristic algorithms for scheduling independent tasks on nonidentical processors. J. ACM, 24:280–289, April 1977.

[16] R F Freund, T Kidd, D Hensgen, and L Moore. Smartnet: A scheduling framework for metacomputing. In 2nd International Symposium on Parallel Architectures, Algo-rithms, and Networks (ISPAN ’96, pages 514–521, 1996.

[17] Muthucumaru Maheswaran, Shoukat Ali, Howard Jay Siegel, Debra Hensgen, and Richard F. Freund. Dynamic mapping of a class of independent tasks onto hetero-geneous computing systems. J. Parallel Distrib. Comput., 59:107–131, November 1999.

(63)

[19] C. Augonnet, J. Clet-Ortega, S. Thibault, and R. Namyst. Data-aware task scheduling on multi-accelerator based platforms. In Parallel and Distributed Systems (ICPADS), 2010 IEEE 16th International Conference on, pages 291 –298, dec. 2010.

[20] Andre R. Brodtkorb, Christopher Dyken, Trond R. Hagen, Jon M. Hjelmervik, and Olaf O. Storaasli. State-of-the-art in heterogeneous computing. Sci. Program., 18:1– 33, January 2010.

[21] Michael D. Linderman, James Balfour, Teresa H. Meng, and William J. Dally. Em-bracing heterogeneity: parallel programming for changing hardware. In Proceedings of the First USENIX conference on Hot topics in parallelism, HotPar’09, pages 3–3, Berkeley, CA, USA, 2009. USENIX Association.

[22] Dominik Grewe and Michael OBoyle. A static task partitioning approach for het-erogeneous systems using opencl. In Jens Knoop, editor, Compiler Construction, volume 6601 of Lecture Notes in Computer Science, pages 286–305. Springer Berlin / Heidelberg, 2011.

[23] H. Topcuoglu, S. Hariri, and Min-You Wu. Performance-effective and low-complexity task scheduling for heterogeneous computing. Parallel and Distributed Systems, IEEE Transactions on, 13(3):260 –274, mar 2002.

[24] Richard F. Freund and Howard Jay Siegel. Guest editor’s introduction: Heterogeneous processing. Computer, 26:13–17, 1993.

[25] Ashfaq A. Khokhar, Viktor K. Prasanna, Muhammad E. Shaaban, and Cho-Li Wang. Heterogeneous computing: Challenges and opportunities. Computer, 26:18–27, 1993.

(64)

[26] Howard Jay Siegel, Seth Abraham, William L. Bain, Kenneth E. Batcher, Thomas L. Casavant, Doug DeGroot, Jack B. Dennis, David C. Douglas, Tse-Yun Feng, James R. Goodman, Alan Huang, Harry F. Jordan, J. Robert Jamp, Yale N. Patt, Alan Jay Smith, James E. Smith, Lawrence Snyder, Harold S. Stone, Russ Tuck, and Benjamin W. Wah. Report of the purdue workshop on grand challenges in computer architecture for the support of high performance computing. J. Parallel Distrib. Comput., 16(3):199– 211, 1992.

[27] I. Ekmecic, I. Tartalja, and V. Milutinovic. A survey of heterogeneous computing: concepts and systems. Proceedings of the IEEE, 84(8):1127 –1144, aug 1996.

[28] I. Ekmecic, I. Tartalja, and V. Milutinovic. Em3: a taxonomy of heterogeneous com-puting systems. Computer, 28(12):68 –70, dec 1995.

Dynamic Applications Scheduling on Heterogeneous Multi-core Systems