High-Performance Algorithms for Compile-Time Scheduling of Parallel Processors

(1)

High-Performance Algorithms for Compile-Time

Scheduling of Parallel Processors

By

Yu-Kwong KWOK

A Thesis Presented to

The Hong Kong University of Science and Technology in Partial Fulfillment

of the Requirements for the Degree of Doctor of Philosophy

in Computer Science

Hong Kong, May 1997

(2)

Authorization

I hereby declare that I am the sole author of the thesis.

I authorize the Hong Kong University of Science and Technology to lend this thesis to other institutions or individuals for the purpose of scholarly research.

I further authorize the Hong Kong University of Science and Technology to reproduce the thesis by photocopying or by other means, in total or in part, at the request of other institutions or individuals for the purpose of scholarly research.

(3)

High-Performance Algorithms for Compile-Time

Scheduling of Parallel Processors

By

Yu-Kwong KWOK

APPROVED:

Dr. Ishfaq AHMAD, SUPERVISOR

Prof. Roland T. CHIN, HEAD OF DEPARTMENT

Department of Computer Science, HKUST May 1997

(4)

Acknowledgments

The completion of this thesis would not have been possible without the help and encouragement of my advisor, colleagues, and many professional acquaintances. I would like to thank all the people who have helped me in one way or another to complete my research work. In particular, I would like to thank my advisor, Dr. Ishfaq Ahmad, for introducing the field of parallel and distributed computing to me, and for his continual support on both academic and personal problems. Indeed, his guidance and advice have had a major positive impact on my development as a scientific researcher and as an individual. I am most grateful to my wife, Fyon, for her love and support on my studies that kept me going on whenever I was tired and frustrated. Without her encouragement, it would have been difficult to finish my doctoral studies.

I would also like to thank the other members of the thesis examination committee: Professor Ming L. Liou, Professor Michael Palis1, Dr. Ting-Chuen Pong, and Professor Derick Wood, for their helpful reviews and comments on the thesis. In particular, I am very grateful to Professor Derick Wood for his insightful advice and helpful technical suggestions. Thanks are also due to Dr. Dik-Lun Lee for his valuable comments on my research work. I would also like to thank our head of department, Professor Roland Chin, for his kindness and encouragements. I am also very grateful to our English tutor, Miss Elut Kwok, for her invaluable help in improving the presentation of the thesis. I would also like to take this opportunity to thank my other teachers including Dr. Manhoy Choi, Dr. Mordecai Golin, Dr. Jun Gu, Dr. Jogesh Muppala, Dr. Tin-Fook Ngai, and Dr. Dit-Yan Yeung.

I would like to acknowledge the Hong Kong Research Grants Council for supporting this work (under contract number HKUST 734/96E). Last but not the least, I would like to thank the Computer Science Department for its generosity in providing a nice, convenient, and professional environment for its graduate students.

(5)

Dedication

(6)

v

-Table of Contents

Authorization Page... i Signature Page... ii Acknowledgments...iii Dedication ...iv Table of Contents... v List of Figures ... ix List of Tables ... xv

List of Symbols ...xviii

Abstract ... xix

Chapter 1 Introduction . . . .1

1.1 Overview... 1

1.2 Parallel Architectures and The Scheduling Problem... 3

1.3 Research Objectives ... 5

1.4 Contributions... 6

1.5 Organization of the Thesis ... 7

Chapter 2 Background and Literature Survey . . . .9

2.1 Introduction... 9

2.2 Problem Statement and The Models Used... 13

2.2.1 The DAG Model ... 14

2.2.2 DAG Generation ... 14

2.2.3 Variations of the DAG Model ... 15

2.2.4 The Multiprocessor Model ... 16

2.3 NP-Completeness of the Scheduling Problem ... 16

2.4 Basic Techniques in DAG Scheduling ... 18

2.5 A Taxonomy of DAG Scheduling Algorithms ... 21

2.6 Survey of DAG Scheduling Algorithms... 23

2.6.1 Scheduling DAGs with Restricted Structures... 23

(7)

vi

-2.6.1.2 Coffman and Graham’s Algorithm for Two-Processor Scheduling ... 25

2.6.1.3 Scheduling Interval-Ordered DAGs ... 27

2.6.2 Scheduling Arbitrary DAGs without Communication ... 27

2.6.2.1 Level-based Heuristics ... 27

2.6.2.2 A Branch-and-Bound Approach... 28

2.6.2.3 Analytical Performance Bounds for Scheduling without Communication ... 29

2.6.3 UNC Scheduling ... 31

2.6.3.1 Scheduling of Primitive Graph Structures ... 31

2.6.3.2 The EZ Algorithm ... 32

2.6.3.3 The LC Algorithm ... 33

2.6.3.4 The DSC Algorithm ... 34

2.6.3.5 The MD Algorithm ... 36

2.6.3.6 The DCP Algorithm... 37

2.6.3.7 Other UNC Approaches... 39

2.6.3.8 Theoretical Analysis for UNC Scheduling ... 40

2.6.4 BNP Scheduling ... 40

2.6.4.1 The HLFET Algorithm ... 40

2.6.4.2 The ISH Algorithm ... 42

2.6.4.3 The MCP Algorithm ... 42

2.6.4.4 The ETF Algorithm ... 45

2.6.4.5 The DLS Algorithm... 46

2.6.4.6 The LAST Algorithm ... 47

2.6.4.7 Other BNP Approaches... 49

2.6.4.8 Analytical Performance Bounds of BNP Scheduling... 51

2.6.5 TDB Scheduling... 51 2.6.5.1 The PY Algorithm ... 52 2.6.5.2 The LWB Algorithm ... 53 2.6.5.3 The DSH Algorithm... 54 2.6.5.4 The BTDH Algorithm... 55 2.6.5.5 The LCTD Algorithm ... 56 2.6.5.6 The CPFD Algorithm... 58 2.6.5.7 Other TDB Approaches... 60 2.6.6 APN Scheduling... 62

2.6.6.1 The Message Routing Issue ... 62

2.6.6.2 The MH Algorithm ... 63

(8)

vii

-2.6.6.4 The BU Algorithm... 65

2.6.6.5 The BSA Algorithm... 67

2.6.6.6 Other APN Approaches ... 69

2.6.7 Scheduling in Heterogeneous Environments ... 70

2.6.8 Mapping Clusters to Processors ... 71

2.7 Summary and Concluding Remarks... 73

Chapter 3 A Low Complexity Scheduling Algorithm . . . .75

3.2 Scheduling Using Neighborhood Search ... 76

3.3 The Proposed Algorithms ... 78

3.3.1 A Solution Neighborhood Formulation ... 78

3.3.2 The Sequential FAST Algorithm... 79

3.3.3 The PFAST Algorithm... 82

3.4 An Example ... 83

3.5 Performance Results... 84

3.5.1 Number of Search Steps... 86

3.5.2 CASCH ... 86

3.5.3 Applications... 88

3.5.4 Results of the PFAST Algorithm Compared against Optimal Solutions... 89

3.5.5 Large DAGs ... 94

Chapter 4 Scheduling using Parallel Genetic Search . . . .98

4.2 Overview of Genetic Search Techniques... 100

4.2.1 Standard Genetic Algorithms ... 100

4.2.2 Genetic Search Operators ... 101

4.2.3 Control Parameters... 103

4.2.4 Parallel Genetic Algorithms ... 103

4.3 The Proposed Parallel Genetic Algorithm for Scheduling ... 104

4.3.1 A Scrutiny of List Scheduling ... 104

4.3.2 A Genetic Formulation of the Scheduling Problem... 106

4.3.3 Genetic Operators ... 108

(9)

viii

-4.3.5 Parallelization... 113

4.4.1 Comparison against Optimal Solutions... 115

4.4.2 Comparison with the PFAST Algorithm ... 119

4.4.3 Results on Regular Graphs ... 120

4.5 Related Work... 125

Chapter 5 A Parallel Algorithm for APN Scheduling . . . .128

5.2 The Proposed Algorithm ... 128

5.2.1 Scheduling Serially ... 129

5.2.2 Scheduling in Parallel... 133

5.3 An Example ... 137

Chapter 6 Conclusions and Future Research . . . .156

6.1 Summary... 156

6.1.1 A Low Time-Complexity Algorithm... 158

6.1.2 Scheduling using Parallel Genetic Search ... 158

6.1.3 A Parallel Algorithm for Scheduling under Realistic Environments... 159

6.2 Future Research Directions ... 160

References ... 162

(10)

ix

-List of Figures

Figure 1.1: (a) A shared-memory architecture; (b) A Message-passing (distributed memory) architecture; (c) An SMP cluster; (d) a heterogeneous system consisting of a network of workstations and

a high performance machine. ... 4 Figure 2.1: (a) A simplified taxonomy of the approaches to the scheduling problem; (b)

A task interaction graph; (c) A task precedence graph. ... 11 Figure 2.2: (a) A task graph; (b) The static levels (SLs), t-levels, b-levels and ALAPs of

the nodes... 19 Figure 2.3: A partial taxonomy of the multiprocessor scheduling problem. ... 21 Figure 2.4: (a) A simple tree-structured task graph with unit-cost tasks and without

communication among tasks; (b) The optimal schedule of the task graph using three processors. ... 25 Figure 2.5: (a) A simple task graph with unit-cost tasks and without communication

among tasks; (b) The optimal schedule of the task graph in a two-processor system. ... 26 Figure 2.6: (a) A unit-computational interval ordered DAG. (b) An optimal schedule

of the DAG. ... 27 Figure 2.7: (a) A fork set; and (b) a join set. ... 31 Figure 2.8: (a) The schedule generated by the EZ algorithm (schedule length = 18); (b)

A scheduling trace of the EZ algorithm... 33 Figure 2.9: (a) The schedule generated by the LC algorithm (schedule length = 19); (b)

A scheduling trace of the LC algorithm... 34 Figure 2.10:(a) The schedule generated by the DSC algorithm (schedule length = 17);

(b) A scheduling trace of the DSC algorithm (N.C. indicates “not

(11)

x

-Figure 2.11:(a) The schedule generated by the MD algorithm (schedule length = 17); (b) A scheduling trace of the MD algorithm (N.C. indicates “not considered”, N.R. indicates “no room”)... 38 Figure 2.12:(a) The schedule generated by the DCP algorithm (schedule length = 16);

(b) A scheduling trace of the DCP algorithm (N.C. indicates “not

considered”, N.R. indicates “no room”). ... 39 Figure 2.13:(a) The schedule generated by the HLFET algorithm (schedule length =

19); (b) A scheduling trace of the HLFET algorithm (N.C. indicates “not considered”)... 41 Figure 2.14:(a) The schedule generated by the ISH algorithm (schedule length = 19); (b)

A scheduling trace of the ISH algorithm (N.C. indicates “not

considered”)... 43 Figure 2.15:(a) The schedule generated by the MCP algorithm (schedule length = 20);

(b) A scheduling trace of the MCP algorithm (N.C. indicates “not

considered”)... 44 Figure 2.16:(a) The schedule generated by the ETF algorithm (schedule length = 19);

(b) A scheduling trace of the ETF algorithm (N.C. indicates “not

considered”)... 46 Figure 2.17:(a) The schedule generated by the DLS algorithm (schedule length = 19);

(b) A scheduling trace of the DLS algorithm (N.C. indicates “not

considered”)... 48 Figure 2.18:(a) The schedule generated by the LAST algorithm (schedule length = 19);

(b) A scheduling trace of the LAST algorithm (N.C. indicates “not

considered”)... 49 Figure 2.19:(a) The schedule generated by the PY algorithm (schedule length = 21); (b)

The e-values of the nodes computed by the PY algorithm. ... 53 Figure 2.20:(a) The schedule generated by the LWB algorithm (schedule length = 16);

(b) The lwb (lower bound) values of the nodes computed by the LWB algorithm. ... 54 Figure 2.21:(a) The schedule generated by the DSH algorithm (schedule length = 15);

(12)

xi

-(b) A scheduling trace of the DSH algorithm... 56 Figure 2.22:(a) The schedule generated by the LCTD algorithm (schedule length = 17);

(b) A scheduling trace of the LCTD algorithm. ... 57 Figure 2.23:(a) The schedule generated by the CPFD algorithm (schedule length = 15);

(b) A scheduling trace of the CPFD algorithm... 61 Figure 2.24:The schedule generated by the MH and DLS algorithm (schedule length =

20, total comm. costs incurred = 16). ... 64 Figure 2.25:The schedule generated by the BU algorithm (schedule length = 24, total

comm. costs incurred = 27). ... 66 Figure 2.26:Intermediate schedules produced by BSA after (a) serial injection

(schedule length = 30, total comm. cost = 0); (b) n4 migrates from PE 0 to PE 1 (schedule length = 26, total comm. cost = 2); (c) n3 migrates from PE 0 to PE 3 (schedule length = 23, total comm. cost = 4); (d) n8 migrates from PE 0 to PE 1 (schedule length = 22, total comm. cost = 9). ... 68 Figure 2.27:(a) Intermediate schedule produced by BSA after n6 migrates from PE 0 to

PE 3 (schedule length = 22, total comm. cost = 15); (b) final schedule produced by BSA after n9 migrates from PE 0 to PE 1 and n5 is bubbled up (schedule length = 16, total comm. cost = 21). ... 69 Figure 3.1: (a) A task graph; (b) The static levels (SLs), t-levels (ASAP times), b-levels,

and ALAP times of the nodes (CPNs are marked by an asterisk). ... 84 Figure 3.2: The schedule of the DAG in Figure 3.1 generated by (a) the MD algorithm;

(b) the ETF and DLS algorithms; (c) the DSC algorithm; (d) the procedure InitialSchedule(); and (e) the random search process in the FAST

algorithm. ... 85 Figure 3.3: Average normalized schedule lengths of the FAST algorithms for random

task graphs with 1000 nodes with various values of MAXSTEP and

MAXCOUNT. ... 86 Figure 3.4: The system organization of CASCH. ... 87 Figure 3.5: Normalized execution times, number of processors used, and scheduling

(13)

xii

-algorithm running times for the Gaussian elimination for all the scheduling algorithms... 88 Figure 3.6: Normalized execution times, number of processors used, and scheduling

algorithm running times for the Laplace equation solver for all the

scheduling algorithms. ... 89 Figure 3.7: Normalized execution times, number of processors used, and scheduling

algorithm running times for FFT for all the scheduling algorithms... 90 Figure 3.8: (a) The average running times of the PFAST algorithm for the type-1

random task graphs with three CCRs using 1 PPE on the Intel Paragon; (b) the average speedups of the PFAST algorithm for 2, 4, 8, and 16 PPEs.... 93 Figure 3.9: (a) The average running times of the PFAST algorithm for the type-2

random task graphs with three CCRs using 1 PPE on the Intel Paragon; (b) the average speedups of the PFAST algorithm for 2, 4, 8, and 16 PPEs.... 94 Figure 3.10:Normalized schedule lengths and scheduling times for the large DAGs for

all the scheduling algorithms. ... 95 Figure 4.1: Examples of the standard (a) crossover operator; (b) mutation operator;

and (c) inversion operator on a binary coded chromosome. ... 102 Figure 4.2: The schedule generated by the start-time minimization method (schedule

length = 16 time units)... 107 Figure 4.3: Examples of the (a) order crossover, (b) PMX crossover, (c) cycle crossover,

and (d) mutation operators... 109 Figure 4.4: (a) A simple task graph; (b) An example of generating an invalid ordering

of the graph by using the PMX crossover operator; (c) An example of generating an invalid ordering of the graph by using the cycle crossover operator... 110 Figure 4.5: (a) The average running times of the PGS algorithm for the type-1 random

task graphs with three CCRs using 1 PPE on the Intel Paragon; (b) the average speedups of the PGS algorithm for 2, 4, 8, and 16 PPEs. ... 118 Figure 4.6: The average percentage deviations from the optimal of the solutions

(14)

xiii

-generated by the PGS algorithm for the type-1 random task graphs with three CCRs using 8 PPEs on the Intel Paragon. ... 118 Figure 4.7: (a) The average running times of the PGS algorithm for the type-2 random

task graphs with three CCRs using 1 PPE on the Intel Paragon; (b) the average speedups of the PGS algorithm for 2, 4, 8, and 16 PPEs. ... 119 Figure 4.8: The average percentage deviations from the optimal of the solutions

generated by the PGS algorithm for the type-2 random task graphs with three CCRs using 8 PPEs on the Intel Paragon. ... 120 Figure 4.9: (a) The average running times of the PGS algorithm for all the regular task

graphs with three CCRs using 1 PPE on the Intel Paragon; (b) the average speedups of the PGS algorithm over the DSC algorithm; (c) the average speedups of the PGS algorithm over the DCP algorithm. ... 124 Figure 4.10:The average ratios of schedule lengths for all the regular task graphs with

three CCRs using 8 PPEs on the Intel Paragon: (a) PGS vs. DSC and (b) PGS vs. DCP. ... 125 Figure 5.1: (a) A task graph; (b) The static levels (SLs) and the CPN-Dominant

sequence of the nodes... 138 Figure 5.2: (a) Intermediate schedule produced by BSA after serial injection (schedule

length = 300, total comm. cost = 0); (b) Intermediate schedule after n4, n3, n8 are migrated (schedule length = 220, total comm. cost = 90); (c)

Intermediate schedule after n6 is migrated (schedule length = 220, total comm. cost = 150); (d) final schedule after n9 is migrated (schedule length = 160, total comm. cost = 160)... 139 Figure 5.3: (a) The randomly generated task execution times on the heterogeneous

processors; (b) Intermediate schedule produced by BSA after serial injection (schedule length = 238, total comm. cost = 0); (c) Intermediate schedule after n3, n4, and n7 are migrated (schedule length = 218, total comm. cost = 110); (d) Intermediate schedule after n8 and n9 are migrated (schedule length = 147, total comm. cost = 200); (e) final schedule after n3 is migrated (schedule length = 138, total comm. cost = 200). ... 140 Figure 5.4: (a) Computations of the EPFTs of the RPNs n1, n2, n3, n4, n6 and n7 based

(15)

xiv

-on the processing times for PE 1; (b) Computati-ons of the LPFTs of the RPNs; (c) The estimated values of the RPNs... 141 Figure 5.5: (a) The schedule of the first partition; (b) The schedule of the second

partition; (c) The schedule of the third partition; (d) The combined final schedule generated by the PBSA algorithm (schedule length = 146, total comm. costs incurred = 110). ... 142 Figure 5.6: (a) Average schedule length ratios of the PBSA algorithm to the BSA

algorithm; (b) Average speedups of PBSA with respect to BSA for regular graphs; (c) Average speedups of PBSA with respect to BSA for random graphs. ... 150 Figure 5.7: (a) Average schedule length ratios of the PBSA algorithm to the MH

algorithm; (b) Average speedups of PBSA over MH for regular graphs; (c) Average speedups of PBSA over MH for random graphs... 151 Figure 5.8: (a) Average schedule length ratios of the PBSA algorithm to the DLS

algorithm; (b) Average speedups of PBSA over DLS for regular graphs; (c) Average speedups of PBSA over DLS for random graphs. ... 152 Figure 5.9: (a) Average schedule length ratios of the PBSA algorithm to the BU

algorithm; (b) Average speedups of PBSA over BU for regular graphs; (c) Average speedups of PBSA over BU for random graphs. ... 153 Figure 5.10:The effect of parameterα on schedule length... 154

(16)

xv

-List of Tables

Table 2.1: Summary of optimal scheduling under various simplified situations. .... 17 Table 2.2: Some of the well known scheduling algorithms and

their characteristics. ... 24 Table 3.1: Results of the PFAST algorithm compared against optimal solutions

(percentage deviations) for the type-1 random task graphs with three CCRs using 1, 2, 4, 8, and 16 PPEs on the Intel Paragon. ... 92 Table 3.2: Results of the PFAST algorithm compared against optimal solutions

(percentage deviations) for the type-2 random task graphs with three CCRs using 1, 2, 4, 8, and 16 PPEs on the Intel Paragon. ... 94 Table 4.1: The percentage deviations from optimal schedule lengths for the type-1

random task graphs with three CCRs using four values of population-size constant: kp = 2, 3, 5, and 10; 8 PPEs were used for all the cases... 116 Table 4.2: The percentage deviations from optimal schedule lengths for the type-1

random task graphs with three CCRs using four values of number-of-generation constant: kg = 2, 5, 10, and 20; 8 PPEs were used for all the cases... 116 Table 4.3: Results of the PGS algorithm compared against optimal solutions

(percentage deviations) for the type-1 random task graphs with three CCRs using 2, 4, 8, and 16 PPEs on the Intel Paragon. ... 117 Table 4.4: Results of the PGS algorithm compared against optimal solutions

(percentage deviations) for the type-2 random task graphs with three CCRs using 2, 4, 8, and 16 PPEs on the Intel Paragon. ... 119 Table 4.5: Ratios of the schedule lengths generated by the PGS algorithm to that of the

DSC algorithm for the Gaussian elimination task graphs with three CCRs using 2, 4, 8, and 16 PPEs on the Intel Paragon. ... 121 Table 4.6: Ratios of the schedule lengths generated by the PGS algorithm to that of the

(17)

xvi

-DCP algorithm for the Gaussian elimination task graphs with three CCRs using 2, 4, 8, and 16 PPEs on the Intel Paragon. ... 121 Table 4.7: Ratios of the schedule lengths generated by the PGS algorithm to that of the

DSC algorithm for the LU-decomposition task graphs with three CCRs using 2, 4, 8, and 16 PPEs on the Intel Paragon. ... 122 Table 4.8: Ratios of the schedule lengths generated by the PGS algorithm to that of the

DCP algorithm for the LU-decomposition task graphs with three CCRs using 2, 4, 8, and 16 PPEs on the Intel Paragon. ... 122 Table 4.9: Ratios of the schedule lengths generated by the PGS algorithm to that of the

DSC algorithm for the Laplace equation solver task graphs with three CCRs using 2, 4, 8, and 16 PPEs on the Intel Paragon. ... 123 Table 4.10: Ratios of the schedule lengths generated by the PGS algorithm to that of the

DCP algorithm for the Laplace equation solver task graphs with three CCRs using 2, 4, 8, and 16 PPEs on the Intel Paragon. ... 123 Table 5.1: Ratios of schedule lengths generated by the MH, DLS, BU and PBSA (using

16 PPEs) algorithms for Cholesky factorization task graphs on four target topologies. ... 144 Table 5.2: Ratios of schedule lengths generated by the MH, DLS, BU and PBSA (using

16 PPEs) algorithms for Gaussian elimination task graphs on four target topologies. ... 146 Table 5.3: Ratios of schedule lengths generated by the MH, DLS, BU and PBSA (using

16 PPEs) algorithms for FFT task graphs

on four target topologies. ... 147 Table 5.4: Ratios of schedule lengths generated by the MH, DLS, BU and PBSA (using

16 PPEs) algorithms for random task graphs

on four target topologies. ... 148 Table 5.5: Average running times of the MH, DLS, BU, BSA and PBSA (using 16 PPEs)

algorithms for various task graphs across all CCRs

and target topologies. ... 149 Table 5.6: Ratios of the schedule length produced by the PBSA algorithm to that of the

(18)

xvii

-BSA algorithm for random task graphs of various sizes on a 4x2 mesh with heterogeneous processors. ... 152

(19)

List of Symbols

Notation Definition

The node number of a node in the parallel program task graph The computation cost of node

An edge from node to

The communication cost of the directed edge from node to

v Number of nodes in the task graph

e Number of edges in the task graph

p The number of processors in the target system

PE Processing Element (i.e., a processor)

CP A critical path of the task graph

CPN Critical Path Node

IBN In-Branch Node

OBN Out-Branch Node

b-level Bottom level of a node

t-level Top level of a node

ASAP As soon as possible start-time of a node

ALAP As late as possible start-time of a node

The possible data available time of on target processor P The start-time of node on target processor P

The finish-time of node on target processor P The parent node of that sends the data arrive last

TPE Target Processing Element to which the tasks are scheduled

PPE Physical Processing Element on which the scheduling algorithm is executed

q The number of physical processing elements

Pivot_PE The target processor from which nodes are migrated The processor accommodating node

CCR Communication-to-Computation Ratio

SL Schedule Length

RPN Remote Parent Node (i.e., a parent node that is handled by another PPE) The importance factor of a remote parent

The crossover rate The mutation rate The population size The number of generations Fitness value ni w n( )i ni ni,nj ( ) ni nj c n( i,nj) ni nj DAT n( i,P) ni ST n( i,P) ni FT n( i,P) ni VIP n( )i ni Proc n( )i ni α µc µm Np Ng f

(20)

High-Performance Algorithms for Compile-Time

Scheduling of Parallel Processors

By

Yu-Kwong KWOK

A Thesis Presented to

Department of Computer Science, The Hong Kong University of Science and Technology, Hong Kong May 1997

Abstract

Scheduling and mapping of precedence-constrained task graphs to the processors is one of the most crucial problems in parallel and distributed computing and thus continues to spur a great deal of interest from the researchers. Due to the NP-completeness of the problem, a large portion of related work is based on heuristic approaches with the objective of finding good solutions within a reasonable amount of time. The major drawback with most of the existing heuristics, however, is that they are evaluated with small problems sizes and thus their scalability is unknown. As a result, they are not applicable in a real environment when presented with moderately large problems. Furthermore, most heuristics ignore the important details of the application and the target system, or do not perform well when these details are taken into account. In this research, we propose three scheduling algorithms for achieving the conflicting goals of high-performance and low time-complexity. In addition, we aim at making our algorithms scalable and applicable when used in a real environment. The novelty of our algorithms is their efficient scheduling strategies which yield better solutions without incurring a high complexity. The second novelty is that our algorithms are parallelized which further lowers their complexities. The first algorithm, called the Parallel Fast Assignment using Search Technique (PFAST) algorithm, is a linear-time algorithm and uses an effective neighborhood search strategy for finding a good solution in a short time. The PFAST algorithm constructs an initial solution using a fast heuristic and then refines the solution using a parallel random search. The second algorithm, called the Parallel Genetic Search (PGS) algorithm, employs a parallel genetic technique which is based on the premises that the recombinative nature of a genetic algorithm can potentially determine an optimal schedule. Using well-defined crossover and mutation operators, the PGS algorithm judiciously combines good building-blocks of potential solutions to move towards a better solution. The third algorithm, called the Parallel Bubble Scheduling and Allocation (PBSA) algorithm, is applicable in a general distributed computing environment in that it takes into account more considerations such as link contention, heterogeneity of processors, and the network topology. The PBSA algorithm uses an efficient strategy for simultaneous scheduling of tasks and messages. The algorithm is parallelized by partitioning the task graph to smaller graphs, finding their sub-schedules concurrently, and then combining these sub-schedules. The proposed algorithms have been evaluated through extensive experimentations and yield consistently better performance when compared with numerous existing algorithms.

(21)

Chapter 1 Introduction

1.1 Overview

Parallel processing is a promising approach to meet the computational requirements of the Grand Challenge problems [92], [168] or to enhance the efficiency of solving emerging applications [110], [155]. However, it poses a number of problems which are not encountered in sequential processing including designing a parallel algorithm for the application, partitioning of the application into tasks, coordinating communication and synchronization, and scheduling of the tasks onto the machine. If these problems are not properly handled, parallelization of an application may not be beneficial. For example, if the tasks of an application are not properly scheduled to the machine, the extra inter-task communication overhead can offset the gain from parallelization. A large body of research efforts addressing these problems has been reported in the literature [15], [28], [38], [65], [81], [92], [123], [129], [130], [131], [153], [155], [164], [189], [192]. Our research focus is on the scheduling aspect.

The objective of scheduling is to minimize the completion time of a parallel application by properly allocating the tasks to the processors and sequencing the execution of the tasks. In a broad sense, the scheduling problem exists in two forms: static and dynamic. In static scheduling, which is also referred to as compile-time scheduling, the characteristics of a parallel program, including task processing times, data dependencies and synchronizations, are known before program execution [28], [38], [65]. A parallel program, therefore, can be represented by a node- and edge-weighted directed acyclic graph (DAG), in which the node weights represent task processing times and the edge weights represent data dependencies as well as the length of communication time. In dynamic scheduling, few assumptions about the parallel program can be made before execution, and thus, scheduling decisions have to be made on-the-fly [3], [94]. The goal of a dynamic scheduling algorithm as such includes not only the minimization of the program completion time but also the minimization of scheduling overhead, which represents a significant portion of the cost paid for running the scheduler. We address only the static scheduling problem. Hereafter we refer to the static

(22)

scheduling problem as simply scheduling.

The scheduling problem is NP-complete for most of its variants except for a few highly simplified cases (these cases will be elaborated in Chapter 2) [37], [40], [41], [51], [52], [63], [74], [78], [90], [100], [150], [151], [152], [158], [162], [185], and therefore, many heuristics with polynomial-time complexity have been suggested [8], [9], [31], [40], [51], [52], [66], [102], [134], [147], [154], [165], [174]. However, most of these heuristics are based on simplifying assumptions about the structure of the parallel program and the target parallel architecture, and thus are not useful in practical situations.

Common simplifying assumptions include uniform task execution times, zero inter-task communication times, contention-free communication, full connectivity of parallel processors, and availability of unlimited number of processors. These assumptions are not valid in practical situations for a number of reasons. For one thing, it is not realistic to assume that the task execution times of an application are uniform no matter whether data parallelism or functional parallelism is exploited because the amount of computation encapsulated in a task is usually varied. Furthermore, other simplifying assumptions are usually not valid for modern parallel machines such as distributed-memory multicomputers (DMMs) [92], shared-memory multiprocessors (SMMs) [92], clusters of symmetric multiprocessors (SMPs) [155], and networks of workstations (NOWs) [92]. First, inter-task communication in the form of message-passing or shared-memory access inevitably incurs a non-negligible amount of latency. Second, contention-free communication and full connectivity of processors are impossible for a DMM, a SMP or a NOW, and are hardly achievable even in a SMM. Finally, relative to the size of a parallel application in terms of the number of tasks [10], the number of processors in most parallel machines, which may be thousands [92], is nonetheless limited. Thus, a scheduling algorithm relying on such assumptions is apt to have restricted applicability in real environments.

To be applicable in a practical environment, a scheduling algorithm needs to address a number of challenging issues. It should exploit parallelism by identifying the task graph structure, and take into consideration task granularity, load balancing, arbitrary task weights and communication costs, the number of target processors available, and the scheduling of inter-processor communication. For example, to model link contention, messages have to be properly scheduled to the communication links. To properly handle message scheduling, the processor network topology should also be considered. To give good performance while taking into account all of these aspects can produce a scheduling algorithm with very high time-complexity. High complexity in turn limits scalability. For instance, many heuristics

(23)

have been evaluated with small task graphs and thus their scalability is not known. It is not clear whether they can yield good solutions within reasonable time for large task graphs found in many real-world problems [10]. On the other hand, to be of practical use, a scheduling algorithm should be fast. These conflicting requirements make the design of an efficient scheduling algorithm more intricate and challenging.

1.2 Parallel Architectures and The Scheduling Problem

Parallel processing platforms exist in a wide variety of forms [49], [84], [167]. The most common forms are the shared-memory (Figure 1.1(a)) and message-passing (distributed memory) (Figure 1.1(b)) architectures. Shared-memory machines (e.g., the BBN Butterfly [49], the KSR-1 [92] and KSR-2 [92]) present a uniform-address-space view of the memory to the programmer and therefore, interprocessor communication is accomplished through writing and reading of shared variables. The hardware generally provides an “equal-cost” access to any shared variable from any of the processors and there is no notion of communication locality. The major advantage of shared-memory architectures is that program development is relatively easy. Message-passing architectures (e.g., Intel iPSC hypercubes [84], Intel Paragon meshes [92], and CM-5 [92]) use direct communication links between processors and therefore, interprocessor communication and synchronization are achieved through explicit message-passing. In a message-passing architecture, each processing element (PE) is composed of a processor and a local memory, and is connected to a fixed number of PEs in some regular topology such as a ring or a hypercube (see Figure 1.1(b)) [166], [170]. The advantage of this approach over the shared-memory approach is the greater communication bandwidth in the system, due to the large number of simultaneous communications possible on the independent interprocessor links. Another advantage is scalability. One can add new PEs as well as communication channels to a message-passing multicomputer at a very low cost. The disadvantage is the longer communication delay when the destination PE is not directly connected to the source PE. This problem has recently been alleviated by employing a new message switching technique called the worm-hole routing [43]. However, the message set-up time remains a significant overhead in PE communications.

Recently the symmetric multiprocessor (SMP) machines has joined the parallel machines marketplace. They are designed using workstation microprocessor technology, but with several CPUs coupled together with a small shared memory in each SMP node. The term “symmetric” means that each CPU can retrieve data stored at a given memory location in the same amount of time. Each SMP node closely resemble a shared-memory multiprocessor, but is slower and less expensive. Examples include SGI’s PowerChallenge and Sun’s SparcServer

(24)

products. It is also possible to cluster SMP nodes into larger groups to aggregate more computational power, as shown in Figure 1.1(c). The resulting configuration then works like a distributed-memory multicomputer, except that each processing element now has multiple

Processors

Shared Memory Modules

P1 P2 Pn

M1 M2 Mk

(a) A shared-memory architecture using a shared bus.

PE0 PE2

PE1 PE3

PE5 PE7

PE6 PE4

Figure 1.1: (a) A shared-memory architecture; (b) A Message-passing (distributed memory) architecture; (c) An SMP cluster; (d) a heterogeneous system consisting of a network of workstations and a high performance machine.

(b) A hypercube with 8 processing elements. Shared Bus P0 P1 P3 P2 Memory Module SMP Node1 P0 P1 P3 P2 Memory Module SMP Node2 (c) Two nodes of an SMP cluster.

Communication Network

(d) A heterogeneous system comprising a parallel machine and multiple heterogeneous workstations connected via a high-speed network.

Parallel Machine Workstation Workstation Workstation Workstation Workstation Workstation Workstation

(25)

CPUs sharing a common memory.

With the advanced networking and communication tools, heterogenous integrated environments [167], as shown in Figure 1.1(d), connecting both high performance machines and workstations are also viable choices for parallel processing.

To effectively harness the computing power of all these parallel architectures, optimized scheduling of the parallel program tasks to the processors is crucial. Indeed, if the tasks of a parallel program are not scheduled properly, the parallel program is likely to produce a poor speedup which renders parallelization useless.

1.3 Research Objectives

In the design of our scheduling algorithms for efficient parallel processing, we address four fundamental aspects: performance, time-complexity, scalability, and applicability. These aspects are elaborated below.

Performance. We have designed high performance and robust scheduling algorithms. By high performance we mean the scheduling algorithms produce high quality solutions. The quality of the solutions are evaluated with respect to optimal solutions. While the proposed algorithms are heuristic in nature, they give better performance than contemporary algorithms in general. The algorithms are also robust in that they can be used under a wide range of input parameters (e.g., arbitrary number of available target processors and diverse task graph structures).

Time-complexity. We have designed our scheduling algorithms with low time-complexity by parallelization of the algorithms. The time-complexity of an algorithm is an important factor insofar as the quality of solution is not compromised. Parallelizing a scheduling algorithm is a novel as well as natural way to reduce the time-complexity. This approach is novel in that no previous work has been done in the parallelization of a scheduling algorithm. And it is natural in that parallel processing is realized only when a parallel processing platform is available. Furthermore, parallelization can be utilized not only to speed up the scheduling process further but also to improve the solution quality.

Scalability. Our parallel scheduling algorithms are scalable. On the one hand, the algorithms possess problem-size scalability, that is, the algorithms consistently give a good performance even for large input. This is essential since, in practice, a task graph generated from a numerical application can easily have a few thousands nodes [10]. On the other hand,

(26)

the algorithms possess processing-power scalability, that is, given more processors for a problem, the parallel scheduling algorithms produce solutions with almost the same quality in a shorter period of time.

Applicability. Our scheduling algorithms can be used in practical environments. To achieve this goal, we take into account realistic assumptions about the program and multiprocessor models such as arbitrary computation and communication weights, link contention, and processor network topology. Moreover, the proposed algorithms have been evaluated using real benchmarks generated by a parallel program compiler [10].

1.4 Contributions

The contributions of this thesis are as follows:

• A Low-Complexity Algorithm. We have developed a scheduling algorithm that takes linear-time to generate solutions with high quality. We have compared the algorithm with a number of well-known efficient scheduling algorithms and found that it outperforms the other algorithms in terms of solution quality as well as speed. We have also parallelized the algorithm to further reduce its running time. We have investigated the performance of the parallel algorithm under a wide range of input and scheduling parameters and found that it generates optimal solutions for a significant portion of the test cases and close to optimal solutions for the others. Furthermore, its running times are considerably shorter than other algorithms. The algorithm, called Parallel Fast Assignment using Search Technique (PFAST) algorithm, can be used in a parallel processing platform for producing high quality schedules under running time constraint.

• Near-to-Optimal Scheduling using Parallel Genetic Search. We have developed a parallel genetic-search-based scheduling algorithm which can potentially generate optimal solutions. Genetic algorithms are global search techniques which explore different regions of the search space by keeping track of a set of potential solutions of diverse characteristics, called a population. As such, genetic search techniques can potentially improve the quality of solutions when longer running time is affordable. In the proposed genetic-search-based algorithm, we have used a novel encoding scheme, an effective initial population generation strategy, and two computationally efficient genetic search operators. We have evaluated the algorithm by extensive experimentations and found that it generates optimal solutions for more than half of the test cases and close-to-optimal solutions for the other half. Another merit of genetic

(27)

search is that its inherent parallelism can be exploited to further reduce its running time. The proposed algorithm, called the Parallel Genetic Scheduling (PGS) algorithm, can be used when close-to-optimal solutions are desired.

• A Parallel Algorithm for Scheduling in Unrestricted Environments. We have developed a parallel algorithm for scheduling that is targetted for networked homogeneous or heterogeneous systems. The design of the algorithm takes into account the considerations such as limited number of processors, link contention, heterogeneity of processors and processor network topology. The algorithm has been evaluated under a board range of parameters: different number of processors, different types of interconnection topology, different extents of processor heterogeneity, and different task graph structures. In our experimental studies, we have found that the algorithm is highly scalable in that it outperforms a number of other general scheduling algorithms even for large task graphs. The proposed algorithm, called Parallel Bubble Scheduling and Allocation (PBSA) algorithm, is applicable in a general distributed ocmputing environment.

1.5 Organization of the Thesis

The organization of the thesis is as follows. In Chapter 2, we provide a detailed discussion about the scheduling problem. To enable the reader to better understand the state of current research, in the same chapter we also provide a brief literature survey, which is based on a taxonomy of scheduling algorithms. In Chapter 3, we present the proposed Parallel Fast Assignment using Search Technique (PFAST) algorithm. We first put forth the sequential version of the algorithm, called the FAST algorithm, followed by the results and comparisons with other fast scheduling algorithms. Finally we present the PFAST algorithm followed by the experimental results. In Chapter 4, we present the proposed Parallel Genetic Scheduling (PGS) algorithm. We first briefly review the basic ideas and mechanisms of genetic search techniques, followed by a discussion about the parallelization issues of a genetic algorithm. We then present a scrutiny of list scheduling techniques. This scrutiny motivates the formulation of the scheduling problem in a genetic-search framework. Finally, we present the PGS algorithm and its performance results. The PGS algorithm is also compared with the PFAST algorithm. In Chapter 5, we present the proposed Parallel Bubble Scheduling and Allocation (PBSA) algorithm. We discuss the target scheduling environments, followed by the design principles of the PBSA algorithm. A small example is used to illustrate the characteristics of the algorithm. Finally, performance results and a comparison with other algorithms are presented. Last, in Chapter 6, we conclude with some final remarks and a brief

(28)

(29)

Chapter 2 Background and Literature Survey

2.1 Introduction

To facilitate understanding the proposed algorithms and the state of current research on scheduling, in this chapter we present the background of the scheduling problem and briefly survey the representative work in the literature. Multiprocessors scheduling has been an active research area since the notion of parallel processing was introduced. However, many different assumptions and terminology are independently suggested. Unfortunately, some of the terms, concepts, and assumptions are not clearly stated nor consistently used by most of the researchers. As a result, it is difficult to appreciate the merits of various scheduling algorithms and quantitatively evaluate the performance of the algorithms. To avoid this problem, we first introduce the directed acyclic graph (DAG) model of a parallel program, and then proceed to describe the multiprocessor model. This is followed by a discussion about the NP-completeness of variants of the scheduling problem. Finally, some basic techniques used in most scheduling algorithms are introduced. Based on the classification of different scheduling environments, we describe a taxonomy of scheduling algorithms. We illustrate the taxonomy by describing several recently reported algorithms and use some scheduling examples to highlight the differences in the design of the algorithms.

Since the scheduling problem is of crucial importance to the effective utilization of large scale parallel computers and distributed computer networks, many different forms of scheduling have been studied. In a broad sense, the general scheduling problem can be divided into two categories—job scheduling and scheduling and mapping (see Figure 2.1(a)). In the former category, independent jobs are to be scheduled among the processors of a distributed computing system to optimize overall system performance [29], [33], [36], [68]. In contrast, the scheduling and mapping problem requires the allocation of multiple interacting tasks of a single parallel program in order to minimize the completion time on the parallel computer system [1], [8], [9], [18], [31], [40], [184]. While job scheduling requires dynamic run-time scheduling that is not a priori decidable, the scheduling and mapping problem can

(30)

be addressed in both static [51], [52], [66], [85], [86], [102], [134], [165] as well as dynamic contexts [3], [94], [145]. When the characteristics of the parallel program, including its task execution times, task dependencies, task communications and synchronization, are known a priori, scheduling can be accomplished offline during compile-time. On the contrary, dynamic scheduling is required when a priori information is not available and scheduling is done on-the-fly according to the state of the system. We consider only the static scheduling problem.

Two distinct models of the parallel program have been considered extensively in the context of static scheduling—the task interaction graph (TIG) model and task precedence graph (TPG) model. They are shown in Figure 2.1(b) and Figure 2.1(c).

The task interaction graph model, in which graph vertices represent parallel processes and edges denote the inter-process interaction [25], is usually used in static scheduling of loosely-coupled communicating processes to a distributed system. Since all tasks are considered as simultaneously and independently executable, there are no temporal execution dependency. For example, a TIG is commonly used to model the finite element method (FEM) [24], [26], [28]. The objective of scheduling is to minimize parallel program completion time by properly mapping the tasks to the processors [27], [119], [127]. This requires balancing the computation load uniformly among the processors while simultaneously keeping communication costs as low as possible. The research in this area was pioneered by Stone and Bohkari [24], [26], [27], [28], [177]: Stone [177] applied network-flow algorithms to solve the assignment problem while Bokhari described the mapping problem as being equivalent to graph isomorphism, quadratic assignment and sparse matrix bandwidth reduction problems [25].

The task precedence graph model, in which the nodes represent the tasks and the directed edges represent the execution dependencies as well as the amount of communication, is commonly used in static scheduling of a parallel program with tightly-coupled tasks to multiprocessors. For example, in the task precedence graph shown in Figure 2.1(c), task n4

cannot commence execution before tasks n1 and n2 finish execution and gathers all the

communication data from n2 and n3. The scheduling objective is to minimize the program

completion time or maximize the speed-up, defined as the time required for sequential execution divided by the time required for parallel execution. For most parallel applications, a task precedence graph can model the program more accurately because it captures the temporal dependencies among tasks. This is the model we use in our research.

(31)

the parallel program and the parallel machine, such as uniform node weights, zero edge weights and the availability of unlimited number of processors. However, even with some of these assumptions the scheduling problem has been proven to be NP-complete except for a few restricted cases. Indeed, the problem is NP-complete even in two simple cases: (1) scheduling tasks with uniform weights to an arbitrary number of processors [185] and (2) scheduling tasks with weights equal to one or two units to two processors [185]. There are only three special cases for which there exists optimal polynomial-time algorithms. These cases are: (1) scheduling tree-structured task graphs with uniform computation costs on arbitrary number of processors [90], (2) scheduling arbitrary task graphs with uniform computation costs on two processors [41] and (3) scheduling an interval-ordered task graph [57] with uniform node weights to an arbitrary number of processors [151]. However, even in

Parallel Program Scheduling

Job Scheduling

(independent tasks) Scheduling and Mapping (multiple interacting tasks)

Dynamic Scheduling Static Scheduling

Task Interaction Graph Task Precedence Graph

n2 n1 n3 n4 (a) (b) (c)

Figure 2.1: (a) A simplified taxonomy of the approaches to the scheduling problem; (b) A task interaction graph; (c) A task precedence graph.

(32)

these cases, communication among tasks of the parallel program is assumed to take zero time [40]. Given these observations, it is unlikely that the general scheduling problem can be solved in polynomial-time, unless .

Due to the intractability of the general scheduling problem, two distinct approaches have been taken: sacrificing efficiency for the sake of optimality and sacrificing optimality for the sake of efficiency. To obtain optimal solutions under relaxed constraints, state-space search and dynamic programming techniques have been suggested. However, these techniques are not useful because most of them are designed to work under restricted environments and most importantly they incur an exponential time in the worst case. In view of the ineffectiveness of optimal techniques, many heuristics have been suggested to tackle the problem under more pragmatic situations. While these heuristics are shown to be effective in experimental studies, they usually cannot generate optimal solutions, and there is no guarantee about their performance in general. Most of the heuristics are based on a list scheduling approach [40], which is explained below.

The basic idea of list scheduling is to make a scheduling list (a sequence of nodes for scheduling) by assigning them some priorities, and then repeatedly execute the following two steps until all the nodes in the graph are scheduled:

1) Remove the first node from the scheduling list.

2) Allocate the node to a processor which allows the earliest start-time.

There are various ways to determine the priorities of nodes such as HLF (Highest level First) [40], LP (Longest Path) [40], LPT (Longest Processing Time) [74] and CP (Critical Path) [78].

Recently a number of scheduling algorithms based on a dynamic list scheduling approach have been suggested [114], [172], [193]. In a traditional scheduling algorithm, the scheduling list is statically constructed before node allocation begins, and most importantly the sequencing in the list is not modified. In contrast, after each allocation, these recent algorithms re-compute the priorities of all unscheduled nodes which are then used to rearrange the sequencing of the nodes in the list. Thus, these algorithms essentially employ the following three-step approaches:

1) Determine new priorities of all unscheduled nodes. 2) Select the node with the highest priority for scheduling.

3) Allocate the node to the processor which allows the earliest start-time.

(33)

Scheduling algorithms which employ this three-step approach can potentially generate better schedules. However, a dynamic approach can increase the time-complexity of the scheduling algorithm.

There are a number of other variations of scheduling algorithms designed for particular target parallel processing environments. These variations include algorithms based on task-duplication technique and algorithms for scheduling tasks to processors connected by a particular topology in which link contention has to be considered. We survey representative work in these various categories. Examples are used to illustrate their difference in design.

The organization of this chapter is as follows. In Section 2.2 we provide the problem statement and describe in detail the DAG model and the multiprocessor model used in our study. In Section 2.3 we describe the optimality and NP-completeness of the various simplified cases of the problem. In Section 2.4 we describe and explain a set of basic techniques commonly used in most scheduling algorithms. In Section 2.5 we propose a taxonomy of the scheduling algorithms which is useful for understanding the different characteristics and assumptions of different algorithms. In Section 2.6 we apply the proposed taxonomy by presenting a survey of contemporary scheduling algorithms and analytical results relating to the theoretical performance of some algorithms. Small examples are given to illustrate the design of the surveyed algorithms. The final section summarizes this chapter.

2.2 Problem Statement and The Models Used

The objective of scheduling is to minimize the overall program finish-time by proper allocation of the tasks to the processors and arrangement of execution sequencing of the tasks. Scheduling is done in such a manner that the precedence constraints among the program tasks are not violated. An important implication of minimization of program finish-time is that the system throughput is maximized. This is because minimization of the overall finish-time of every parallel program increases the number of parallel programs that can be processed per unit of time. The overall finish-time of a parallel program is commonly called the schedule length or makespan.

In the literature, there have been some variations to this goal. For example, some researchers proposed algorithms to minimize the mean flow-time or mean finish-time, which is the average of the finish-times of all the program tasks [30], [122]. The significance of the mean finish-time criterion is that minimizing it in the final schedule leads to the reduction of the mean number of unfinished tasks at each point in the schedule. There are also some other

(34)

algorithms proposed to reduce the setup costs of the parallel processors [178]. We focus on algorithms for minimizing schedule length.

2.2.1 The DAG Model

In static scheduling, a parallel program can be represented by a directed acyclic graph (DAG) , where V is a set of v nodes and E is a set of e directed edges. A node in the DAG represents a task which in turn is a set of instructions which must be executed sequentially without preemption in the same processor. The weight of a node is called the computation cost and is denoted by . The edges in the DAG, each of which is denoted by , correspond to the communication messages and precedence constraints among the nodes. The weight of an edge is called the communication cost of the edge and is denoted by . The source node of an edge is called the parent node while the sink node is called the child node. A node with no parent is called an entry node and a node with no child is called an exit node. The communication-to-computation-ratio (CCR) of a parallel program is defined as its average edge weight divided by its average node weight. Hereafter we use the terms node and task interchangeably.

The precedence constraints of a DAG dictate that a node cannot start execution before it gathers all of the messages from its parent nodes. The communication cost between two tasks assigned to the same processor is assumed to be zero. If node is scheduled to some processor, then and denote the start-time and finish-time of , respectively. After all the nodes have been scheduled, the schedule length is defined as

across all processors. The goal of scheduling is to minimize .

The node and edge weights are usually obtained by estimation at compile-time [10], [38], [189]. Generation of the generic DAG model and some of the variations are described below.

2.2.2 DAG Generation

A parallel program can be modeled by a DAG. Although program loops cannot be explicitly represented by the DAG model, data-flow computations parallelism in loops can be exploited to subdivide the loops into a number of tasks by the loop-unraveling technique [20], [120]. The idea is that all iterations of the loop are started or fired together, and operations in various iterations can execute when their input data are ready for access. In addition, for a large class of data-flow computation problems and most of the numerical algorithms (such as matrix multiplication), there are very few, if any, conditional branches or indeterminism in the program. Thus, the DAG model can be used to accurately represent these applications so

G = (V E, ) n_i w n( )i ni,nj ( ) c n( _i,n_j) n_i ST n( )_i FT n( )_i n_i maxi{FT n( )i } maxi{FT n( )i }

(35)

that the scheduling techniques can be applied. Furthermore, in many numerical applications, such as Gaussian elimination or fast Fourier transform (FFT), the loop bounds are known during compile-time. As such, one or more iterations of a loop can be deterministically encapsulated in a task and, consequently, be represented by a node in a DAG.

The node- and edge-weights are usually obtained by estimation using profiling information of operations such as numerical operations, memory access operations, and message-passing primitives [97].

2.2.3 Variations of the DAG Model

There are a number of variations of the generic DAG model described above. The more important variations are: preemptive scheduling vs. non-preemptive scheduling, parallel tasks vs. non-parallel tasks and DAG with conditional branches vs. DAG without conditional branches.

Preemptive Scheduling vs. Non-preemptive Scheduling: In preemptive scheduling, the execution of a task may be interrupted so that the unfinished portion of the task can be re-allocated to a different processor [34], [75], [88], [157]. On the contrary, algorithms assuming non-preemptive scheduling must allow a task to execute until completion on a single processor. From a theoretical perspective, a preemptive scheduling approach allows more flexibility for the scheduler so that a higher utilization of processors may result. Indeed, a preemptive scheduling problem is commonly reckoned as “easier” than its non-preemptive counterpart in that there are cases in which polynomial time solutions exist for the former while the latter is proved to be NP-complete [40], [74]. However, in practice, interrupting a task and transferring it to another processor can lead to significant processing overhead and communication delays. In addition, a preemptive scheduler itself is usually more complicated since it has to consider when to split a task and where to insert the necessary communication induced by the splitting. We concentrate on the non-preemptive approaches. Parallel Tasks vs. Non-parallel Tasks: A parallel task is a task that requires more than one processor at the same time for its execution [187]. Blazewicz et al. investigated the problem of scheduling a set of independent parallel tasks to identical processors under preemptive and non-preemptive scheduling assumptions [22], [23]. Du and Leung also explored the same problem but with one more flexibility: a task can be scheduled to no more than a certain pre-defined maximum number of processors [48]. However, in Blazewicz et al. ‘s approach, a task must be scheduled to a fixed pre-defined number of processors. Wang and Cheng further extended the model to allow precedence constraints among tasks [187]. They devised a list scheduling approach to construct a schedule based on the earliest completion time (ECT)

(36)

heuristic. We concentrate on scheduling DAGs with non-parallel tasks.

DAG with Conditional Branches vs. DAG without Conditional Branches: Towsley [182] addressed the problem of scheduling a DAG with probabilistic branches and loops to heterogeneous distributed systems. Each edge in the DAG is associated with a non-zero probability that the child will be executed immediately after the parent. He introduced two algorithms based on the shortest path method for determining the optimal assignments of tasks to processors. El-Rewini and Ali [53] also investigated the problem of scheduling DAGs with conditional branches. Similar to Towsley’s approach, they also used a two-step method to construct a final schedule. However, unlike Towsley’s model, they modelled a parallel program by using two DAGs: a branch graph and a precedence graph. This model differentiates the conditional branching and the precedence relations among the parallel program tasks. The objective of the first step of the algorithm is to reduce the amount of indeterminism in the DAG by capturing the similarity of different instances of the precedence graph. After this pre-processing step, a reduced branch graph and a reduced precedence graph are generated. In the second step, all the different instances of the precedence graph are generated according to the reduced branch graph, and the corresponding schedules are determined. Finally, these schedules are merged to produce a unified final schedule. Since modelling branching and looping in DAGs is an inherently difficult problem, few work has been reported in this area. We concentrate on DAGs without conditional branching in this research.

2.2.4 The Multiprocessor Model

In DAG scheduling, the target system is assumed to be a network of processing elements (PEs), each of which is composed of a processor and a local memory unit so that the PEs do not share memory and communication relies solely on message-passing. The processors may be heterogeneous or homogeneous. Heterogeneity of processors means the processors have different speeds or processing capabilities. However, we assume every module of a parallel program can be executed on any processor even though the completion times on different processors may be different. The PEs are connected by an interconnection network of a certain topology. The topology may be fully-connected or of a particular structure such as a hypercube or mesh. Although processors may be heterogeneous, we assume the communication links are homogeneous, that is, a message is transmitted with the same speed on all links.

2.3 NP-Completeness of the Scheduling Problem

(37)

three simple cases. The first case is to schedule a uniform node-weight free-tree to an arbitrary number of processors. Hu [90] proposed a linear-time algorithm to solve the problem. The second case is to schedule an arbitrarily structured DAG with uniform node-weights to two processors. Coffman and Graham [41] devised a quadratic-time algorithm to solve this problem. Both Hu’s algorithm and Coffman et al.’s algorithm are based on node-labeling methods that produce optimal scheduling lists leading to optimal schedules. Sethi [162] then improved the time-complexity of Coffman et al.’s algorithm to almost linear-time by suggesting a more efficient node-labeling process. The third case is to schedule an interval-ordered DAG with uniform node-weights to an arbitrary number of processors. Papadimitriou and Yannakakis [151] designed a linear-time algorithm to tackle the problem. A DAG is called interval-ordered if every two precedence-related nodes can be mapped to two non-overlapping intervals on the real number line [57]. In all these three cases, communication between tasks is ignored. Recently, Ali and El-Rewini [11] showed that interval-ordered DAG with uniform edge weights, which are equal to the node weights, can also be optimally scheduled in polynomial time. These optimality results are summarized in Table 2.1.

The DAG scheduling problem is in general an NP-complete problem [63]. Ullman [185] showed that scheduling a DAG with unit computation to p processors is NP-complete. He also showed that scheduling a DAG with one or two unit computation costs to two processor is NP-complete [40], [185]. Papadimitriou and Yannakakis [151] showed that scheduling an interval ordered DAG with arbitrary computation costs to two processors is NP-complete. Garey et al. [64] showed that scheduling an opposing forest with unit computation to p

Table 2.1: Summary of optimal scheduling under various simplified situations.

Researcher(s) Complexity p Structure

Hu [90] — Uniform Free-tree NIL

Coffman and Graham

[41] 2 Uniform — NIL

Sethi [162] 2 Uniform — NIL

Papadimitriou and

Yannakakis [151] — Uniform Interval-Ordered NIL

Ali and El-Rewini [11] — Uniform (= c) Interval-Ordered Uniform (= c) Papadimitriou and

Yannakakis [151] NP-Complete — — Interval-Ordered NIL Garey and Johnson [63] Open Fixed, > 2 Uniform — NIL

Ullman [185] NP-Complete — Uniform — NIL

Ullman [185] NP-Complete Fixed, > 1 = 1 or 2 — NIL

w n( )i c n( i,nj) O v( ) O v( )2 O v( α( )v +e) O v( +e) O ev( )

(38)

processors is NP-complete. Finally, Papadimitriou and Yannakakis [152] showed that scheduling a DAG with unit computation to p processors possibly with task-duplication is also NP-complete.

2.4 Basic Techniques in DAG Scheduling

Most scheduling algorithms are based on the list scheduling technique [1], [8], [9], [31], [40], [51], [52], [66], [102], [117], [134], [165], [194]. In list scheduling, the nodes are assigned priorities and placed in a list arranged in a descending order of priority. A node with higher priority is examined for scheduling before a node with lower priority. If more than one node has the same priority, ties are broken using specific method. In the following, we describe different methods for assigning priority.

Two frequently used attributes for assigning priority are the t-level (top level) and b-level (bottom level) [1], [8], [66]. The t-level of a node is the length of a longest path (there can be ore than one longest path) from an entry node to (excluding ). Here, the length of a path is the sum of all the node and edge weights along the path. The t-level of highly correlates with ’s earliest start-time, denoted by , which is determined after is scheduled to a processor. The b-level of a node is the length of a longest path from to an exit node. The b-level of a node is bounded from above by the length of a critical path. A critical path (CP) of a DAG, which is an important structure in the DAG, is a longest path in the DAG. Clearly a DAG can have more than one CP. Consider the task graph shown in Figure 2.2(a). In this task graph, nodes are the nodes of the only CP and are called CPNs (Critical-Path Nodes). The edges on the CP are shown with thick arrows. The values of the priorities discussed above are shown in Figure 2.2(b).

Below is a procedure for computing the t-levels:

Computation of t-level:

(1) Construct a list of nodes in topological order. Call it TopList. (2) for each node in TopList do

(3) max = 0

(4) for each parent of do

(5) ift-level( ) + + > max then

(6) max = t-level( ) + +

(7) endif

(8) endfor

(9) t-level( ) = max

(10) endfor

The time-complexity of the above procedure is . A similar procedure, which also ni ni ni n_i n_i T_S( )n_i n_i ni ni n1, ,n7 n9 ni nx ni n_x w n( )_x c n( _x,n_i) nx w n( )x c n( x,ni) n_i O e( +v)

(39)

has time-complexity , for computing the b-levels is shown below:

Computation of b-level:

(1) Construct a list of nodes in reversed topological order. Call it RevTopList. (2) for each node in RevTopList do

(3) max = 0

(4) for each child of do

(5) if + b-level( ) > max then

(6) max = + b-level( )

(7) endif

(8) endfor

(9) b-level( ) = + max

(10) endfor

In the scheduling process, the t-level of a node varies while the b-level is usually a constant, until the node has been scheduled. The t-level varies because the weight of an edge may be zeroed when the two incident nodes are scheduled to the same processor. Thus, the path reaching a node, whose length determines the t-level of the node, may cease to be the longest one. On the other hand, there are some variations in the computation of the b-level of a node. Most algorithms examine a node for scheduling only after all the parents of the node have been scheduled. In this case, the b-level of a node is a constant until after it is scheduled to a processor. Some scheduling algorithms allow the scheduling of a child before its parents,

n1 2 n2 3 n5 5 n9 n3 3 n7 4 n4 4 n8 4 n6 4 1 6 10 1 1 1 1 4 1 1 5 5 (a) (b)

node SL t-level b-level ALAP

* 11 0 23 0 8 6 15 8 8 3 14 9 9 3 15 8 5 3 5 18 5 10 10 13 * 5 12 11 12 5 8 10 13 * 1 22 1 22 n1 n2 n3 n4 n5 n6 n7 n8 n9 1

Figure 2.2: (a) A task graph; (b) The static levels (SLs), t-levels, b-levels and ALAPs of the nodes.

O e( +v) n_i n_y n_i c n( _i,n_y) n_y c n( _i,n_y) n_y n_i w n( )_i