Turku Centre for Computer Science
TUCS Dissertations
No 181, August 2014
Johan Ersfolk
Scheduling Dynamic Dataflow
Graphs with Model Checking
Scheduling Dynamic Dataflow
Graphs with Model Checking
Johan Ersfolk
To be presented, with the permission of the Department of Information
Technologies at ˚Abo Akademi University, for public criticism in
Auditorium Gamma on August 15, 2014, at 12 noon.
˚
Abo Akademi University
Department of Information Technologies Joukahainengatan 3-5 A
Supervisor
Professor Johan Lilius
Department of Information Technologies ˚
Abo Akademi University
Joukahaisengatan 3-5, FI-20520 ˚Abo Finland
Reviewers
Associate Professor Stavros Tripakis
Department of Information and Computer Science Aalto University
PO Box 15400, FI-00076 Aalto Finland
Docent Johan Eker Ericsson Research Lund
Sweden
Opponent
Associate Professor Stavros Tripakis
Department of Information and Computer Science Aalto University
PO Box 15400, FI-00076 Aalto Finland
ISBN 978-952-12-3091-2 ISSN 1239-1883
Abstract
With the shift towards many-core computer architectures, dataflow pro-gramming has been proposed as one potential solution for producing soft-ware that scales to a varying number of processor cores. Programming for parallel architectures is considered difficult as the current popular pro-gramming languages are inherently sequential and introducing parallelism is typically up to the programmer. Dataflow, however, is inherently parallel, describing an application as a directed graph, where nodes represent calcu-lations and edges represent a data dependency in form of a queue. These queues are the only allowed communication between the nodes, making the dependencies between the nodes explicit and thereby also the parallelism. Once a node have the sufficient inputs available, the node can, independently of any other node, perform calculations, consume inputs, and produce out-puts.
Dataflow models have existed for several decades and have become pop-ular for describing signal processing applications as the graph representation is a very natural representation within this field. Digital filters are typically described with boxes and arrows also in textbooks. Dataflow is also be-coming more interesting in other domains, and in principle, any application working on an information stream fits the dataflow paradigm. Such applica-tions are, among others, network protocols, cryptography, and multimedia applications. As an example, the MPEG group standardized a dataflow language called RVC-CAL to be use within reconfigurable video coding. Describing a video coder as a dataflow network instead of with conventional programming languages, makes the coder more readable as it describes how the video data flows through the different coding tools.
While dataflow provides an intuitive representation for many applica-tions, it also introduces some new problems that need to be solved in order for dataflow to be more widely used. The explicit parallelism of a dataflow program is descriptive and enables an improved utilization of available pro-cessing units, however, the independent nodes also implies that some kind of scheduling is required. The need for efficient scheduling becomes even more evident when the number of nodes is larger than the number of process-ing units and several nodes are runnprocess-ing concurrently on one processor core.
There exist several dataflow models of computation, with different trade-offs between expressiveness and analyzability. These vary from rather restricted but statically schedulable, with minimal scheduling overhead, to dynamic where each firing requires a firing rule to evaluated. The model used in this work, namely RVC-CAL, is a very expressive language, and in the general case it requires dynamic scheduling, however, the strong encapsulation of dataflow nodes enables analysis and the scheduling overhead can be reduced by using quasi-static, or piecewise static, scheduling techniques.
The scheduling problem is concerned with finding the few scheduling decisions that must be run-time, while most decisions are pre-calculated. The result is then an, as small as possible, set of static schedules that are dynamically scheduled. To identify these dynamic decisions and to find the concrete schedules, this thesis shows how quasi-static scheduling can be rep-resented as a model checking problem. This involves identifying the relevant information to generate a minimal but complete model to be used for model checking. The model must describe everything that may affect scheduling of the application while omitting everything else in order to avoid state space explosion. This kind of simplification is necessary to make the state space analysis feasible. For the model checker to find the actual schedules, a set of scheduling strategies are defined which are able to produce quasi-static schedulers for a wide range of applications.
The results of this work show that actor composition with quasi-static scheduling can be used to transform dataflow programs to fit many differ-ent computer architecture with differdiffer-ent type and number of cores. This in turn, enables dataflow to provide a more platform independent represen-tation as one application can be fitted to a specific processor architecture without changing the actual program representation. Instead, the program representation is in the context of design space exploration optimized by the development tools to fit the target platform. This work focuses on repre-senting the dataflow scheduling problem as a model checking problem and is implemented as part of a compiler infrastructure. The thesis also presents experimental results as evidence of the usefulness of the approach.
Sammandrag
Under det senaste decenniet har vi sett en ¨overg˚ang fr˚an allt snabbare enk¨arniga processorer mot datorarkitekturer med ett ¨okande antal proces-sork¨arnor. Orsaken till den h¨ar utvecklingen ¨ar att gr¨ansen har n˚atts f¨or n¨ar varje h¨ojning av hastigheten p˚a en processor orsakar en betydligt st¨orre h¨ojning av energif¨orbrukningen, vilket betyder dyrare ber¨akningar och sam-tidigt problem med att processorer v¨arms upp f¨or mycket. Medan utvecklin-gen inom halvledarteknologi fortfarande erbjuder en ¨okning av antalet tran-sistorer som ryms p˚a en viss yta, och d¨arigenom erbjuder en motsvarande ¨
okning av ber¨akningskapasitet som tidigare, ¨ar trenden nu att anv¨anda dessa till att ¨oka p˚a antalet processork¨arnor. Det h¨ar har i sin tur lett till att det blivit sv˚arare f¨or programmerare att konstruerar program som effektivt anv¨ander den ber¨akningskapacitet som finns tillg¨anglig p˚a moderna proces-sorer.
Programmering av parallella arkitekturer anses i allm¨anhet sv˚art efter-som nuvarande popul¨ara programmeringsspr˚ak i grunden ¨ar sekventiella medan parallelismen ofta ¨ar n˚agot som programmeraren sj¨alv inf¨or. Data-fl¨odes programmering har f¨oreslagits som en m¨ojlig l¨osning f¨or att producera program som skalar till ett varierande antal processork¨arnor. Datafl¨odes beskrivningen ¨ar till sin natur parallel och beskriver ett program som en riktad graf d¨ar noderna representerar ber¨akningar och kanter representerar ett databeroende i form av en k¨o. Dessa k¨oer ¨ar den enda till˚atna kom-munikation mellan noderna vilket g¨or databeroenden mellan noderna och d¨arigenom ocks˚a parallelismen explicit. Genast en nod har sin beh¨ovda indata tillg¨anglig kan noden, oberoende av n˚agon annan nod, utf¨ora sina ber¨akningar, konsumera indata och producera utdata.
Datafl¨odesprogrammering har varit intressant inom forskning i flera de-cennier och har mera allm¨ant blivit popul¨art f¨or att beskriva signalbe-handlingsapplikationer eftersom grafrepresentationen ¨ar en mycket naturlig representation inom detta omr˚ade. Digitala filter, till exempel, beskrivs vanligtvis med rektanglar och pilar, motsvarande ber¨akningar och signaler, ocks˚a i l¨arob¨ocker. Datafl¨ode blir ocks˚a allt mer intressant p˚a andra omr˚aden, och i princip ¨ar alla program som arbetar p˚a en informationsstr¨om l¨ampliga ett beskrivas med datafl¨odes paradigmen. S˚adana applikationer ¨ar bland
an-nat n¨atverksprotokoll, kryptografi- och multimediaprogram. Som ett exem-pel har MPEG-gruppen standardiserat ett datafl¨odesspr˚ak som kallas RVC-CAL f¨or att anv¨ands inom omkonfigurerbar videokodning f¨or att beskriva videokodnings kompnenter medan kommunikationen mellan dessa kompo-nenter beskrivs som datak¨oer. Att beskriva en video kodare som ett datafl¨odes n¨atverk i st¨allet f¨or med konventionella programmeringsspr˚ak g¨or program-met mera l¨attl¨ast eftersom den beskriver hur videodatan fl¨odar genom de olika kodningsverktygen.
Medan datafl¨ode ger en intuitiv representation f¨or m˚anga till¨ampningar, uppst˚ar ocks˚a en del nya problem som m˚aste l¨osas f¨or att datafl¨ode ska kunna anv¨andas i st¨orre utstr¨ackning. Den uttryckliga parallellism i ett datafl¨odesprogram ¨ar beskrivande och m¨ojligg¨or ett f¨orb¨attrat utnyttjande av tillg¨angliga ber¨akningsenheter, men att ett program best˚ar av ett antal oberoende noder inneb¨ar ocks˚a att n˚agon form av schemal¨aggning beh¨ovs. Speciellt n¨ar antalet noder ¨ar st¨orre ¨an antalet ber¨akningsenheter, vilket betyder att flera noder turvis k¨ors p˚a en processork¨arna, blir behovet av en effektiv schemal¨aggning uppenbart. Det finns flera datafl¨odes ber¨ aknings-modeller med olika kompromisser mellan uttrycksfullhet och analysbarhet. Dessa varierar fr˚an ganska begr¨ansade men statiskt schemal¨aggningsbara, med minimal schemal¨aggningskostnad under k¨orningen av programmet, till dynamiska d¨ar varje uppgift kr¨aver att ett villkor utv¨arderas under k¨orningen av programmet. Den ber¨akningsmodell som anv¨ands i det h¨ar arbetet, n¨amligen DPN (dataflow process network), som ocks˚a spr˚aket RVC-CAL bygger p˚a, ¨ar en mycket uttrycksfull modell som i det allm¨anna fallet kr¨aver dynamisk schemal¨aggning under programmets k¨orning. Lyckligtvis m¨ojligg¨or dock den starka inkapslingen av datafl¨odesnoder kraftfull analys vilket g¨or att man kan identifiera delar av ett program som styckvis kan schemal¨aggas i f¨orv¨ag vilket leder till minskad schemal¨aggningskostnad under k¨orningen av programmet.
Schemal¨aggnings problemet best˚ar i att hitta de f˚a schemal¨aggningsbeslut som m˚aste tas medan programmet ¨ar aktivt, medan de flesta besluten kan ber¨aknas i design skedet. Resultatet blir d˚a en, s˚a liten som m¨ojlig, upps¨attning av statiska scheman d¨ar endast valet av schema utf¨ors under programmets k¨orning. F¨or att identifiera dessa dynamiska beslut och f¨or att hitta konkreta scheman, presenterar den h¨ar avhandling hur styckvis-statisk schemal¨aggning kan representeras som ett modellgransknings (model checking) problem. Det h¨ar handlar i huvudsak om att identifiera den rel-evanta informationen som beh¨ovs f¨or att generera en liten men fullst¨andig modell som kan anv¨andas f¨or att unders¨oka tillst˚andsrymden av program-met. Modellen ska beskriva allt som kan p˚averka schemal¨aggningen men g¨omma alla andra detaljer f¨or att undvika att kombinatorisk explosion av tillst˚andsrymden. Denna typ av f¨orenkling ¨ar n¨odv¨andig f¨or att g¨ora analy-sen av tillst˚andsrymden genomf¨orbar. F¨or att identifiera de konkreta
sche-man i tillst˚andsrymden beh¨ovs en upps¨attning schemal¨aggnings strategier som ger en partiell beskrivning en m¨angd tillst˚and som upprepat bes¨oks medan programmet k¨ors och d¨arigenom kan l¨ankas ihop av n˚agra statiska shcheman. P˚a det h¨ar s¨attet kan man g¨ora en uppskattning av de sche-mans som beh¨ovs f¨or att programmet ska fungera korrekt och beskriva dem partiellt s˚a att man kan s¨oka dem i tillst˚andsrymden.
Resultatet av detta arbete visar att komposition av datafl¨odesnoder med styckvis-staticka schemal¨aggningsmetoder kan anv¨andas f¨or att anpassa datafl¨odesprogram f¨or en m¨angd olika datorarkitekturer med olika egen-skaper och ett varierande antal processork¨arnor. Detta m¨ojligg¨or i sin tur att datafl¨ode kan anv¨andas som en mer plattformsoberoende representation eftersom en applikation kan monteras p˚a en specifik processorarkitektur utan att ¨andra sj¨alva programrepresentationen. Ist¨allet modifieras programrepre-sentation av olika utvecklingsverktyg f¨or att passa m˚alplattformen. Arbetet ¨
ar inriktat p˚a hur man kan representera datafl¨odesschemal¨aggningsproblemet som ett modellgranskningsproblem och de verktyg som beh¨ovs f¨or att visa att metoden fungerar har konstruerats som en del av en befintlig kompila-torinfrastruktur. Slutligen presenterar avhandlingen n˚agra fallstudier med experiment som visar att metoden ¨ar anv¨andbar och kan anv¨andas f¨or att g¨ora program effektivare.
Acknowledgements
One may have a blast working on a dissertation if the work environment is right and the needed support is available. Many-many people and organi-zations have contributed to this thesis, be it with moral support, technical support, financial support, or pure cooperation.
First of all, I would like to thank my supervisor, Johan Lilius, for the opportunity to pursue doctoral studies and to work at the Embedded Systems Lab. for all these years. Speaking of which, I am also grateful for the patience and the appropriate pushing-in-the-right-direction that finally resulted in this thesis. I am also grateful to Stavros Tripakisand Johan Eker for reviewing this thesis and for providing useful feedback regarding possible improvements to the contents. A special thanks also to Ream Barclay who not did not only check the language of the thesis but also helped and instructed me regarding how improve the language of the thesis; after correcting hundreds of issues, the text is already in a readable shape.
The research work related to this thesis including the collection of papers which is part of this thesis, is a result of several peoples’ ideas and efforts. I am especially grateful to all the co-authors of these papers, and these are many: Ghislain Roquire, Marco Mattavelli, Jani Boutellier, Aman-ullah Ghazi, Olli Silv´en, Fareed Jokhio, Wictor Lund, and Johan Lilius. Also, I would like to thank everyone behind the efforts related to the Orcc project and the large number of CAL applications that are available, as this has been essential for this work to be possible. In particular, I would like to mention Herv´e Yviquel, Mathieu Wipliez, Antoine Lorence, and Micka¨el Raulet, who have always been ready to answer questions related to Orcc and assist with any problems related to the implementation work.
The ESLab has been a great place to work; even if the set of people in the lab has changed during these years and the lab has increased in size, the atmosphere and the working philosophy is still the same as in the old days. It is always refreshing to visit the offices of Stefan Gr¨onroos and
Dag ˚Agren and check-out their latest gadgets. Then with the addition of
Jerker Bj¨orkquist and Wictor Lund, the future of technology can not stay unchanged. For the uncountable hours spent discussing the ideas that
now are the contributions of this thesis a huge thanks goes to Andreas Dahlin and S´ebastien Lafond.
There are a large number of issues that are not directly related to re-search but are needed in order to get the rere-search done. One needs to travel to conferences, get the computers to work, print the dissertation, etc. While the thanks belong to the whole department, I would like to mention the few I personally have felt free to consult at any time, these are especially: Tove
¨
Osterroos, Nina Rytk¨onen, Christel Engblom, Pia Kallio, Joachim Storrank, Niklas Gr¨onblom, Karl R¨onnholm, and Marat Vagapov. I would also like to thank Tomi M¨antyl¨afor all the help with getting this thesis printed.
I would also like to thank IT-department, TUCS, and Tekniikan edist¨amiss¨a¨ati¨o for funding the my research.
Let us now consider the sequence of events that resulted in this thesis. It all started in 2001 somewhere in the deep forests between Myrkky and Per¨al¨a; the names, by the way, mean poison and behind. After roller skating about the kilometers on a gravel road, Richard Westerb¨ack convinced me that I should join him to ˚Abo Akademi. So, this we did. Several years later, when it was time to do my masters thesis, again, the same guy, convinced me to join him and do the master’s thesis at the Embedded Systems Laboratory. So, an enormous thanks toRichard Westerb¨ackwho is responsible for me doing all the right decisions regarding work related questions.
Finally, a big thanks to my family. Being a student for more than two decades requires a lot of patience from everybody around you. First of all I am grateful for all the support from my parents, Berit and B¨orje: ”he
ha it f¨ailast naa” and toMikael, Reetta, Tanja, Abdullah, Yasmine,
and Jacob: ”ni ha heija o˚ap bra”. Also to my wifeTabitaand my family-in-lawRiitta, Mauri, Tomi, and Aili: ”kiitos kaikesta tuesta”. And last, not least, but definitely the smallest, the finest of all dogs, Ruben, who’s contribution to this thesis cannot be described with words.
˚
List of Original Publications
1. Johan Ersfolk, Ghislain Roquier, Fareed Jokhio, Johan Lilius, and Marco Mattavelli. Scheduling of dynamic dataflow programs with model checking. In IEEE International Workshop on Signal
Process-ing Systems (SiPS), 2011. Beirut, Lebanon.
2. Johan Ersfolk, Ghislain Roquier, Johan Lilius, and Marco Mattavelli. Scheduling of dynamic dataflow programs based on state space analy-sis. InAcoustics, Speech and Signal Processing (ICASSP), 2012 IEEE
International Conference on, pages 1661 –1664, march 2012. Kyoto,
Japan.
3. J. Ersfolk, G. Roquier, W. Lund, M. Mattavelli, and J. Lilius. Static and quasi-static compositions of stream processing applications from dynamic dataflow programs. In Acoustics, Speech and Signal
Process-ing (ICASSP), 2013 IEEE International Conference on, pages 2620–
2624, 2013. Vancouver, Canada.
4. J. Ersfolk, G. Roquier, J. Lilius, and M. Mattavelli. Modeling control tokens for composition of cal actors. In Design and Architectures
for Signal and Image Processing (DASIP), 2013 Conference on, pages
71–78, 2013. Cagliari, Italy.
5. J. Boutellier, A. Ghazi, O. Silven, and J. Ersfolk. High-performance programs by source-level merging of rvc-cal dataflow actors. InSignal
Processing Systems (SiPS), 2013 IEEE Workshop on, pages 360–365,
Contents
I Research Summary 1
1 Introduction 3
1.1 Recent Developments . . . 4
1.2 Programming Many-Core Systems . . . 7
1.2.1 Current State of the Art . . . 8
1.2.2 Dataflow programming . . . 10
1.3 The Problem Statement . . . 11
1.4 Contribution of the Thesis . . . 12
1.4.1 Constructed Artifacts . . . 14
1.5 Research Methods in this Thesis . . . 16
1.5.1 Design-science research . . . 16
1.5.2 Experiments and Measurements . . . 20
1.6 Structure of Thesis . . . 22
2 Some Background Regarding Dataflow Models 25 2.1 Processes, Actors, and Dataflow Models . . . 26
2.2 Dataflow Models of Computation . . . 30
2.3 The Cal Actor Language . . . 36
2.3.1 Relevant Language Constructs . . . 38
2.3.2 Scheduling and Code Generation . . . 41
3 The Scheduling Problem 43 3.1 Specifying Dynamism . . . 44
3.2 Scheduling Based on Partition Input . . . 46
3.2.1 Choices and Redundant Choices . . . 47
3.2.2 Control Token Paths and Guard Strength . . . 48
3.2.3 Variable Dependency Graph . . . 50
3.2.4 A Few Complex or Many Simple Guards? . . . 53
3.3 Nondeterminism and Time-dependent Actors . . . 54
3.3.1 Introduction of Time-dependency by Composition . . 55
3.3.2 Analyzing Independent Data Paths . . . 57
3.4.1 Boundedness of Nondeterminism . . . 61
3.4.2 Valid Partitions . . . 62
4 Analysis of an Actor in Isolation 63 4.1 A Simplified Actor Scheduler . . . 65
4.2 A First Approximation . . . 66
4.3 Concrete Schedules . . . 68
4.4 Validation of Action Sequences . . . 70
4.5 Describing the Actor State More Precisely . . . 73
4.6 The Results of the Analysis . . . 74
4.7 Related Work . . . 76
5 Analysis of Actor Partitions 81 5.1 The Scheduling Model Format . . . 82
5.2 Correctness of Scheduling Model . . . 84
5.3 Schedule Construction . . . 90
5.4 Scheduling Strategies . . . 96
5.5 Actor Composition . . . 97
5.6 Related Work . . . 99
6 Discussion on Efficiency and Implementation 105 6.1 Design Parameters . . . 106
6.2 Performance . . . 108
6.3 Efficient Code Generation . . . 111
6.4 Related Work . . . 112
7 Some Scheduling Case Studies 117 7.1 Case I – FIR Filter . . . 118
7.2 Case II – IIR Filter . . . 120
7.3 Case III – Zigbee Transmitter . . . 123
7.4 Case IV – MPEG-4 Decoder (Coarse Gained) . . . 130
7.5 Case V – MPEG-4 Decoder (Fine Grained) . . . 134
8 Vision of the Future 143 8.1 Identified Problems . . . 143
8.2 Preciser Specification . . . 146
8.3 Specification Verification . . . 149
8.4 Related Work . . . 150
9 Overview of the Original Publications 155 9.1 Paper I: Scheduling of Dynamic Dataflow Programs with Model Checking . . . 155
9.2 Paper II: Scheduling of Dynamic Dataflow Programs Based on State Space Analysis . . . 157
9.3 Paper III: Static and Quasi-Static Composition of Stream Processing Applications from Dynamic Dataflow Programs . 158 9.4 Paper IV: Modeling Control Tokens for Composition of CAL
Actors . . . 159 9.5 Paper V: High-Performance Programs by Source-Level
Merg-ing of RVC-CAL Dataflow Actors . . . 160 9.6 Positioning the Papers . . . 161
10 Conclusions 163
10.1 Retrospective – Goals versus Approach . . . 164 10.2 This Research in Perspective . . . 166
Part I
Chapter 1
Introduction
Programming languages and methods are constantly evolving to further im-prove programmer productivity and make large software projects manage-able. This development is driven by the advances in semiconductor design which provide processors and systems with more and more processing ca-pacity, which in turn enable more complex software. Software developers invent new applications that utilize the increasing processing capacity of new hardware, introducing new services that simply were not possible ear-lier. To enable programmers to develop increasingly complex systems, more advanced tools and methods are developed.
For many years, the main difference between two generations of a pro-cessor, for a programmer, was that the uni-processor speed, including the clock rate, doubled every 18 months [12, 96]. Raising the speed of a processor mainly resulted in that the software, without modifications, would run twice as fast on the new processor allowing the developer to add more features to utilize the increased processing capacity. To handle more complex software projects and improve programmer productivity, this development mainly resulted in raising the abstraction level of the programming languages by adding one more level of language constructs on top of the existing ones. In the early history of computers, programming was performed by directly using the instruction-set of the processor; programming using such assembly languages became tedious as the processors gained more processing capacity and the abstraction level was raised to languages where the focus was more on the level of the algorithm than on which instructions to use. Imperative languages, such as C, which have their roots in ALGOL which was intro-duced in the mid-1950s [15, 101], could be considered to be such languages as the specific instruction set of the processor is not anymore important for the programmer, but the program still reflects a general processor design as every programming construct has a counterpart in the processor instruction set. Such languages are in principle processor independent as the compiler
translates the program into the appropriate processor instructions.
To further raise the abstraction level, introduction of new programming concepts, such as objects, that, instead of being based on processor architec-tures are abstractions of the concepts programmers typically describe, was studied already in the 1960s, but only became mainstream as late as in the 1990s; partly as a result of the growing popularity of graphical user interfaces for which this kind of abstractions was well suited. This next generation of languages, improved on reusability by collecting functionality and data in a natural way. Object oriented programming has significantly improved pro-ductivity and reusability [88], and provides a really intuitive description for some applications, such as, graphical user interfaces, where inheriting the basic functionality, while adding new functionality, hides the less interest-ing details from the programmer while makinterest-ing it easy to add new features. Further improvements on methods for designing object oriented programs have increased the productivity of the programmer. A natural next step for object oriented methods was to introduce design patterns, that is, more or less, agreed upon general designs that are used to implement certain types software components [92]. All these methods improve the productivity and make large projects manageable, however, they are platform independent only as long as one processor is assumed and do not scale on systems with more processors.
1.1
Recent Developments
Currently, processor clock rates have reached the limit of what is practical with the technology available today and instead of raising the clock rate, the increasing number of transistors on a chip is used to build parallel execution units or processor cores. The reason for this is that while the number of transistors on a chip and the density of transistors continue to increase, the power density on the chip will limit which ones can be turned on at the same time and also the clock rate at which the processor core runs [12]. It can easily be seen that the clock rate of a processor reaches a wall when the frequency and consequently the voltage is increased for a specific technology, by inspecting the switching power of a CMOS transistorP =α×C×f×v2. For different applications or devices, the power is limited for various reasons. For hand-held devises, it is obvious that battery time is a limiting factor, but, also the power dissipation plays a role as it will heat up the device. As an example, in [111, 98], it is said that, the power dissipation for a cell phone, should be kept under 3 W. As a contrast, in data centers, the power consumption directly translates into a cost, when both the power to run the systems and to cool the systems results in enormous electricity bills [90].
capacity, has created an enormous problem for programmer productivity with the current methods for programming and has forced programmers to rethink how to design applications that can utilize many-core systems. Most of the current methods for programming systems with several cores depend on the programmer to make the decision of what to run where, and how the synchronization is handled. The problem with such approaches is that, in-stead of describing the algorithm, the programmer also needs to think about how to distribute and manage the program on the available system. Fur-thermore, if the current development continues as predicted, transistors are becoming comparably inexpensive and it will not be possible to use all parts of a chip simultaneously [126, 71]. This means that architectures, already now, but even more in the future, will contain accelerators and specialized hardware of which most parts normally are turned off. As an example, the OMAP 4 platform includes a dual-core general purpose processor, two addi-tional low-power cores, a digital signal processor, a graphics processor, and multimedia accelerators. Using hardware accelerators allows the GPP to run at a lower clock frequency reducing the power consumption [94]. For the programmer, this means that parallelization is not the only problem, but also heterogeneous platforms and the synchronization of the various processing elements need to be handled in an efficient manner.
The potential overheads related to synchronizing complex systems with accelerators can be illustrated by examining the study of the development in energy efficiency of mobile communication equipment between the years 1995 and 2003, presented in [111]. The study in [111] shows that the usage time of such equipment had not improved despite improvements in silicon processes. The development of the hardware had enabled improved function-ality and new features to be supported, and with improved energy efficiency. As the complexity of the DSPs and the number of accelerators had grown, the software has become more difficult to manage. In 1995, the software on the phones was still scheduled by the developers in a static manner. This approach was efficient and predictable, the accelerators had determin-istic latencies and were used without interrupts and introduced almost no overhead. The more recent models were more complex systems developed by larger teams, and in order to make the development manageable the software was divided into layers and the accelerators were synchronized by interrupts and the scheduling was performed by a preemptive operating sys-tem. The ordinary operation of a hardware accelerator is to interrupt the processor when it has finished its task. This interrupt will make the sys-tem switch tasks, run interrupt handlers and other parts of the operating system and then switch back to the program using the accelerator. These context switches caused by the interrupts introduce an overhead. Experi-ments reveal that the impact the context switches have on cache-hit rates significantly affects the overall performance of the system [111]. The
per-formance is further affected by the communication between the processing units introducing an overhead every time an accelerator is used. In order to improve the efficiency, the cache-hit ratios should be increased and the number of context switches kept low. Due to multitasking and sporadic events, static scheduling is not an option. It is impossible to return to sys-tems scheduled by the developers as the syssys-tems have grown in complexity. In order to acquire a system with better performance and with less overhead the compilers should have knowledge, not only of the processor, but also of the whole system including the accelerators.
The other direction of hardware development is to make the proces-sor cores more complex and use a number of functional units to utilize instruction-level parallelism (ILP) to increase the parallel operation a pro-gram performs when running a sequential propro-gram [57]. ILP is a measure of how many operations can be executed in parallel in a computer program. A processor taking advantage of ILP is called superscalar and has multiple functional units executing instructions simultaneously. To be able to exploit ILP, superscalar processors are performing dynamic scheduling of instruc-tions during execution. As the number of simultaneously issued instrucinstruc-tions increases, the cost from logic gates required to handle the instructions and check dependencies at run time at the CPUs clock rate increases rapidly. Even if there are no real dependencies between a set of instructions the pro-cessor must check for dependencies. In practice there is a limited amount of instruction-level parallelism in programs [119].
One attempt to reduce the overhead of the processors, related to dynamic scheduling, is to move this complexity from the hardware to the compiler; this is called static scheduling [38]. The advantage of static compile-time scheduling is that a compiler has several orders of magnitude more time to make sophisticated scheduling decisions than a dynamic run-time sched-uler. Static scheduling performed by the compiler implies that no special hardware is required to analyze dependencies; the result is reduced manu-facturing costs and power consumption. The Very Long Instruction Word (VLIW) processors are using this approach, one VLIW instruction encodes multiple operations to be executed in parallel on different functional units of the device. VLIW instructions have as many operation fields as there are functional units on the processor and each field specifies what operation should be executed on the corresponding unit. The compiler examines the dependencies in a sequence of instructions and encodes the operations into instructions containing several operations. To handle the communication inside the processor VLIWs have a forwarding network. The forwarding network is used for bypassing operands from one functional unit to another or for writing operands to a register file. This network becomes complex when the number of functional units grows, and an attempt to further move complexity from the hardware to the compiler is the Transport Triggered
Architecture (TTA) [38, 53], where the program instructions describe how operands are passed between the functional units and register files.
For the programmer, this kind of parallelism is transparent, as it is the compiler or the hardware that finds the parallelism and exploits it. The level of IPL is, however, limited and does not scale to systems with many processor cores. Also, as the number of cores is growing, IPL is really difficult to utilize on several cores as the delays of communicating between cores is big compared to the time it takes to run an instruction. Instead, it is necessary for a programmer to describe larger parts of the program that can be executed in parallel; this again, forces the programmer to construct programs in a different fashion than before.
1.2
Programming Many-Core Systems
Parallel programming has existed for several decades already, and the target of the language constructs has slightly varied with the available platforms, however, the high-level concepts are still rather unchanged. Parallel pro-grams either utilize data parallelism, task parallelism, or a combination of these. Data parallelism can be seen as having many independent data el-ements for which the computations can be performed concurrently, while task parallelism can be seen as a series of tasks that can be performed in a pipelined fashion. A simple example from the real world should make this clear. If we have a box full of oranges, data parallelism would be to have several people pealing oranges in parallel while task parallelism would be to have one person pealing oranges while another is cutting the oranges into pieces, and a third removes the seeds.
The difficulty with parallel programming is that it does not conform to the real world. If two persons reach for the same orange, it will be clear from the result which one acquired it; with a memory object in a computer program, if two parallel parts of the program grab it, they will both acquire it and when they update it, the result may be a combination of what both of them did, leading to inconsistent or corrupted data. To assure correct behavior, shared data structures must be locked while being modified, however, locking results in waiting and potentially deadlocks. A further problem is that errors may occur sporadically when a program is executed as threads may be interleaved differently each run. Programming with locks is difficult, and requires the programmer to keep track of which lock corresponds to which data; this kind of disciplined work it better suited for a tool than a human being [114].
1.2.1 Current State of the Art
Current parallel programs typically depend on multi-threading as the ba-sis for parallel execution, with the drawback that it is the programmer who needs to express the execution platform (threads) as a part of the implemen-tation. Threads create an illusion of shared memory, however, it is up to the programmer to protect the data against the nondeterministic behavior of using threads as the abstraction for concurrency [84]. As it is up to the programmer to choose when to spawn parallel tasks and when shared data must be locked to prevent race conditions, while it is up to the operating system to choose in which order the threads are interleaved, the program-ming becomes difficult, error prone, platform dependent, and simply painful to have anything to do with.
One significant drawback with this kind of an approach, is that, the pro-grammer writes the application to fit the parallelism of the target platform. As an example, if the target platform is a quad-core processor, it would seem to make sense to implement a program with four threads co-operating to finish the task. On the other hand, this implementation would not effi-ciently utilize a new platform with eight cores, as the implementation has been tailored for the previous processor. The other problem is, leaving the synchronization to be solved by the programmer. This is in general very difficult for a human being to get correct. Instead, high-level constructs that describe communication and synchronization, should be available for a programmer, while mutexes and locks should be generated by the tools.
There are extensions to popular programming languages which add prim-itives for parallel operations. One such implementation of multi-threading is OpenMP, which of a set of compiler directives, library routines, and en-vironment variables, which lets the programmer mark the parts of the code that are meant to run in parallel, while the framework takes care of the thread creation and synchronization. [36]
Methods that have started to gain more interest let the programmer express parallel tasks that can be distributed on the threads/cores by a run-time system. OpenCL (Open Computing Language), for example, is a standard for cross-platform parallel computing. An OpenCL program is typically a program written in a language like C, but with some part of the program, that can run on a separate device, like a graphics processing unit or a digital signal processor, constructed as OpenCL kernels, which are func-tions written in a flavor of C/C++. The OpenCL kernels are controlled by the host through an application programming interface (API) that are used to define and then control the platforms. [97] OpenCL enables programmers to program heterogeneous platforms without binding the program to much to the platform.
parallelism withglued-on-top solutions is that the benefit from having more cores is limited by the sequential parts of the program. Amdahl’s law [8], describes this relationship as
S= 1
B+n1(1−B)
wherenis the number of cores,B is the fraction of strictly sequential code, and S is the speedup. This gives an upper bound for how much speedup is possible for a program with a certain amount of parallelism, it also gives the point after which, adding more cores does not improve the speedup anymore. To avoid unnecessary limitations on the parallelism, a program should be described as parallel as possible; it is not enough that a programmer adds the parallelism needed for one target platform and hopes that it will scale in the future. A number of technologies are available, where the compu-tations are described with explicit dependencies between the calculations. A calculation, then, may, when it completes, enable other calculations that depend on it. The parallelism of the program is then a dynamic property which changes while the program executes, and typically a run-time sys-tem is needed to schedule the parallel calculations such that the syssys-tem is utilized. Such a run-time system typically uses a thread pool to which it distributes parallel tasks, usually by using some kind of dispatch queues.
One implementation of this is theGrand Central Dispatch(GCD) which is a technology by Apple Inc. [9], to enable concurrent execution of parallel parts of a program. The parallel tasks which are either functions or a smaller unit which is introduced called blocks, are enqueued and executed on the thread pool of the GCD.
Another example, which also extends the programming model from C/C++ is the Cilk [24] run-time system. With Cilk, a programmer can easily create parallel tasks (called threads in Cilk) which the run-time system distributes on the available processor cores. Cilk is especially good for describing re-cursive algorithms and a dependency graph is constructed at run-time as a consequence of (recursive) function calls in the program code. This graph, which is a directed acyclic graph, describes the dependencies between the tasks of the application. A newer version of Cilk called Cilk Plus is currently part of Intel Parallel Studio, which is a toolkit by Intel Inc. for parallel pro-gramming.
For other types of applications such as multimedia and signal processing algorithms the dependency graph is more static and can therefore be de-scribed statically as a dataflow network, where the dataflow nodes describe tasks and the edges data dependencies.
1.2.2 Dataflow programming
Dataflow has been proposed as one solution for how to represent parallelism of an application. A dataflow program consists of independent computa-tional nodes only connected by communication queues. As such the nodes can run in parallel as long as there is enough input and enough space on the output queue. Dataflow programs explicitly express the parallelism of the program and the nodes can easily be distributed to many processor cores. This is guaranteed by the strong encapsulation of dataflow actors as the ac-tors are not allowed to share state but only communicate through directed connections. [21]
Dataflow programs provide a parallelism which resembles task paral-lelism, as actor firings can be seen as tasks that can execute in parallel. The programmer describes firings, the operations that should happen, the input and output rates, and possibly a condition enabling a firing to take place. A dataflow program scales to platforms with different degrees of par-allelism, with some restrictions, depending on how fine grained the dataflow implementation is. While dataflow programs provide a flexible parallel im-plementation, this does not automatically mean that the program will be efficient. As the control flow is not explicitly described, but instead the concurrency is specified, scheduling is required when several actors shares a processor.
Dataflow programming has proved itself useful in signal processing ap-plications as it maps nicely to the mathematical representations used within this field. It is common praxis to represent filters and transforms as boxes connected with data queues and many tools used for modeling use a dataflow representation, e.g. Simulink and Labview. However, for dataflow program-ming to be useful for a wider range of applications, the programprogram-ming meth-ods and tool support need to improve.
For the type of applications, for which dataflow typically is used, perfor-mance of the generated program is important for how useful the application will be. For dataflow programs, a central aspect that has a significant im-pact on the performance of the program is the scheduling of the dataflow nodes. On a theoretical platform with more cores that dataflow nodes and no communication overheads, a dataflow program does not need scheduling as every node simply occupies one core and waits for input to arrive. On a real platform there are usually more dataflow nodes than processor cores and, in the case there actually are more processor cores, the communication overheads between the cores make it necessary to map several nodes to the same processor core. This means that scheduling, that is, deciding what to run next, is required.
1.3
The Problem Statement
Dataflow programming can be used to highlight the parallelism of an appli-cation. A dataflow program is a directed graph, where arcs represent data dependencies between the computational nodes; as any other communica-tion is forbidden, the communicacommunica-tion and dependencies between the nodes are explicitly described. For a dataflow program to be truly platform inde-pendent, it should adapt to both platforms with massive parallelism as well as to single processor systems, but it should also be possible to tune it for systems with different cache sizes and for processors with varying penalties for branch instructions. This means that the size of the computational nodes (actors) must adapt to the target platform, either by merging smaller nodes into larger or by splitting larger nodes into a set of smaller; these operations affect the scheduling of the nodes but also the size of data a node works on during one firing.
The approach chosen for this work requests the programmer to construct an as fine-grained as possible design, which then is transformed by actor composition in to something that is efficient on the specific target platform. The composition implies a joint scheduling of the actors, which can be made efficient by removing some generality from the composition by removing some of the scheduling decisions. Such a simplified composed actor, again, needs to be proven correct in that it still accepts every possible input stream that the original actors did. Also, deadlock freeness must be proven for a program after composition of actors.
This thesis discusses approaches for choosing appropriate program parti-tions for composition, such that it is possible to reason about the efficiency and correctness of the scheduler of the composition. The thesis includes discussion about the different aspects of a scheduling model that relates to the correctness of the model and shows how the complexity of a scheduler can be approximated. A full prototype tool chain, that takes a CAL pro-gram, performs partitioning, scheduler composition, and actor merging, is presented.
As some of the definitions used in this work may have different meanings in other contexts, it is necessary to define what is intended with some of the central concepts discussed in this thesis.
Composition Actor composition is used to transform a set of actors into one single larger actor with less scheduling overhead. In this context, com-position means that the action schedulers of a set of actors are composed in to a single scheduler. The goal with the composition is to allow a simplified action scheduler and the composed scheduler is not required to allow all the different firing sequences that was allowed by the original program, however, it is required to produce identical outputs for an input sequence.
Scheduling The composed scheduler is simplified by having as many of the scheduling decisions as possible done at compile-time. The scheduling is about finding the sequences of action firings that always follow each other. The composed scheduler runs such sequences, called static schedules, instead of single actions. In practice this means that the scheduling sequentializes a set of actors. Scheduling may also refer to the run-time scheduling which includes choosing and running one of the composed schedulers, however, it should be clear from the context which type of scheduling is discussed.
Merging The actual process of creating a new actor from the composed actor scheduler and the functionality of the actors that has been jointly scheduled is called actor merging. The actor merging decides how the func-tionality and data channels of the different actors are put in to a single actor, and how the composed scheduler is integrated in the actor. The result of the merging is then a new actor that can be code generated instead of the original actors.
1.4
Contribution of the Thesis
The problem of scheduling dynamic dataflow programs is manifold and can be explored from several angles. Starting from a choice between different models of computations, which restricts or enriches the model with extra information, to approaches which try to fit parts of a program in to more static descriptions, and by this, extract static schedules. The approach investigated in this thesis, starts from an expressive model, which is allowed to describe dataflow actors with any behavior. These actors are analyzed and a model describing the behavior of an actor partition, with a limited state space only including scheduling related information, is constructed. Then the state space is analyzed, and static schedules that link some of the reoccurring states of the program are extracted and used to generate a quasi-static scheduler. The problems discussed in this thesis are different aspects of how this analysis is made feasible.
The first thing that is needed to even make a state space analysis of a dataflow program feasible is to reduce the state space to only include the in-formation essential for the scheduling analysis. A CAL program, essentially, is a set of dataflow actors which can be described as state machines with variables, and which are connected by data channels. One of the main points in the thesis is therefore to show how to describe the state space of a program, with the minimal information such that the program can be scheduled correctly. This property is a central requirement in the work presented and is therefore discussed throughout the thesis, but mainly in Papers 2 and 4, and in Chapters 3, 4, and 5.
A related issue, which is important for constructing a set of guards for a composition of actors, relates to the set of firing rules that are used by the quasi-static scheduler, and how these can be chosen. The related research problem is how one canshow that a set of guards are strong enough to describe the behavior of a partition, and how control values should be modeled in order to choose suitable partitions for composition.
This is on a more abstract level discussed in Chapter 3 and with some more details in Paper 4; then in Chapter 5, more concrete examples are given of how this issue can be handled.
With a model that efficiently describes the state space of the program and correctly gives an adequate set of guards that describes the operations of a dataflow program, or part of one, the next problem is to actually extract both the static schedules and a scheduler which describes how the schedules should be fired based on the set of guards provided. The thesis, therefore, also presents a number of strategies that can be used to find static schedules out of the state space of a program partition by using a model checker. The initial work related to this issue is presented in Paper 1, however, a more in depth discussion is provided in Chapter 5, while some case studies are presented in Chapter 7.
With the composition and partial compile-time scheduling of the actor partitions successfully completed, one would hope that the transformed pro-gram in every sense performs better than the original propro-gram where each actor is individually scheduled at run-time. Now, this is unfortunately not simply a software issue, as the processor architecture plays a crucial role in deciding how a composition and a specific schedule impacts on the perfor-mance. For this reason, the thesis contributes with measurements and conclusions regarding how partitioning, scheduling, and composi-tion affect the performance of a program. Measurements are mainly presented in Papers 3 and 5, while Chapter 6 provides a more general dis-cussion of this topic.
The work done related to this thesis shows that this type of methods can be used to schedule CAL actors for composition. However, to enable more automatized and successful scheduling of CAL program, some improvements regarding specification of programs could be useful for a future tool set to allow easier verification of properties related to scheduling, but potentially also others. The case studies in Chapter 7 highlights some constructs that make the scheduling difficult, some possible solutions on how to implement dataflow programs or how specifications could be added to a program is discussed in Chapter 8.
Orcc C Backend Promela Backend Scheduling Tool RVC-CAL Prog Missing Schedules C code Scheduler Promela Code Input Sequences Actor Schdules Merger Partitions
Figure 1.1: The artifacts produced for the scheduling task and how these connect to the existing tool chain (gray).
1.4.1 Constructed Artifacts
In order to evaluate the proposed approach and to measure the impact this kind of composition and scheduling has on a dataflow program when it is mapped on to different platforms, it is necessary to build a number of software artifacts that can be used for this purpose. The scheduling approach is implemented as part of a existing compiler infrastructure and some additional external tools, which produce interchange artifacts that describe different aspects of the dataflow program. Figure 1.1 illustrates the main software components and the information that is passed between these. A central part of the tool chain is the Orcc compiler which is both used to generate the representation of the program that is used in the scheduling as well as to finally generate the executable program, based on the information generated by the scheduling tool, to a language such as C, that can be compiled to the target platform.
The approach is described in more detail in the following chapters, how-ever, to put these in a context, some of the high level ideas and key artifacts should be introduced.
Scheduling Tool The scheduling tool, which consists of a model checker and a wrapping software that specifies the model checking tasks and analyses the model checking result. This software uses information generated from the dataflow program description and defines a set of schedules for the model checker to search for. The scheduling is a iterative task where one found schedule may introduce new schedules to search for, this is illustrated by the feedback loop of the scheduling tool in Figure 1.1. This part of the
scheduling approach is mainly described in Chapter 5.
Partition The set of actors that are to be composed and thereby ana-lyzed for scheduling is called a partition. A partition may correspond to the whole dataflow program but usually partitions are chosen, by the designer, to minimize the scheduling overhead by grouping together actors with re-lated behavior, and to improve run-time performance by choosing partition sizes that fit the target platform. Partitioning is only described in this thesis from the point of view of correctness, where correctness means that a model describing the scheduling of a partition must include all the necessary in-formation, while the performance aspect is left for an external design space exploration. The required properties of a partition is discussed in Chapter 3, which presents the different types of dynamic behavior a partition of actors can have.
Schedules The task of the scheduling framework is to produces static schedules, that is, sequences of firings that, once chosen for execution, the whole schedule can be guaranteed to run to completion without the need for evaluating guard expressions. When several schedules are need to describe the behavior of a partition, the choice of schedule is performed by a scheduler at run-time. The scheduler produces, the schedules and the firing rules for these, is then the artifact that is the end result of the scheduling tool and is used by the compiler to merge actors before the code is generated for the target platform. The construction of a set of static schedules for a partition together with a scheduler that corresponds to the firing rules of the schedules, is mainly discussed in Chapter 5.
To construct and initialize the model checker for one scheduling task a few pieces of information, either automatically generated from the dataflow program or provided by the developer, is need.
Promela Code The behavior of the dataflow program is translated to the language used by the model checker, in this case Promela. For this purpose a backend producing Promela code is constructed for the Orcc compiler. The generation of the Promela code not as such interesting in this context, and is only briefly described in Chapter 5 and Paper 1, instead finding the relevant information to generate code from is more relevant. An addition to generating the Promela code, the backed runs a set of analysis passes which are used to simplify the program description to only include scheduling information as well as producing additional information which is used by the scheduling tool, this involves information regarding static schedules and their corresponding input sequences of individual actors.
Actor Schedules Simplifying of the individual actors to only include in-put dependent firing rules, while firing sequences that can be resolved from the program text of the actor are described as static schedules, gives a good staring point for the actual scheduling task. Generating a set of static sched-ules that describe the possible firing sequences of the actor, gives a predic-tions of what the static schedules, of a partition, that are to be searched for by the model checker should look like. Similarly, it also gives the data rates and the needed input sequences that are consumed by the corresponding schedules.
Input Sequences A final artifact that is produced as an input for the model checking task, is the set of input sequences an actor accepts. These input sequences correspond to the per actor static schedules described above, and describe the number of data tokens a schedule consumes and, if it can be resolved, the value of a control token that enables the schedule. This information is used to initialize the model checking tasks as well as to de-cide when a schedule has finished based on the number of tokens it has consumed. The construction of the actor schedules and the corresponding input sequences is discussed in Chapter 4.
1.5
Research Methods in this Thesis
This thesis presents research related to the scheduling problem of dynamic dataflow programs, where the dataflow programs in practice are programs implemented using the Cal Actor Language (CAL) [46]. The research starts from an idea that the dataflow program can be modeled as a set of commu-nicating finite state machines (FSM), for which a state space analysis can give the set of schedules that describes the behavior of the program. Apart from building tools and prototypes for evaluating the ideas, the research also includes performing measurements evaluating the impacts of the approach and algebraic analysis of the correctness of the models.
To narrow down the discussion on research methods, the discussion is based on the book On Research Methods [74] by P. J¨arvinen, and the indi-vidual parts of the research work is compared to the methods presented in this book.
1.5.1 Design-science research
A great part of the work presented in this thesis relates to tools and ar-tifacts that have been built to solve the problems of scheduling dataflow applications. This kind of a research problem fits well with design-science, which describes a methodology for research where the target is to create new innovative artifacts for a specific problem domain, and, while this method
has not strictly been followed, it is relevant to compare the presented work with the main points of this research method.
According to March and Smith (1995) [91], design-science has two ba-sic activities, build and evaluate, and four types of products or artifacts,
constructs,models,methods, and instantiations.
The buildactivity corresponds to building tools and methods for solv-ing a specific problem, and the build activity is performed to construct a prototype that shows that the proposed idea is feasible and enables a evalu-ation of the proposed approach. What separate design-science research from engineering is that the construction does not follow what is best praxis in the field, but instead, the purpose of building an artifact is to create new knowledge and to enable an evaluation of a new idea. It is therefore impor-tant in design-science that research resources are not used to build artifacts which are based on already known and evaluated ideas. [91, 74]
A constructed artifact then needs to be evaluated in order to answer the question regarding if the artifact contributes to the progress of the field. The questions that needs to be asked are: How well does the artifact perform the task? How does it compare to previous work? To be able to evaluate the artifact, some criteria for success and metrics need to be defined, such that results are comparable. Of course, if no previous methods have been able to achieve the same goals, the feasibility of constructing the artifact is already a valid result. [91, 74]
The building process related to this thesis, includes extending a code generation tool chain with a set of analysis and transformation operations. To evaluate how well the artifacts work, some metrics that can be compared are: how fast the generated code is when run on different platforms, how usable the tool is regarding how large parts of a program can be transformed in to something more efficient, and finally, how usable the tool is for a developer, where usable means how difficult the tool is to use and how much time it requires. These metrics can then be compared to previous work, however, some criteria for success are also needed to define what can be regarded as success. In this case, it is not required that the usability improves, but instead, it is a trade-off with the other metrics.
Design-science research must produce an artifact, according to the defi-nitions of March and Smith, we have the four types of artifacts as follows.
The first one, Constructs, provides the language in which problems and solutions are defined and forms a vocabulary for the domain [91, 74]. According to March the evaluation of constructs typically involve complete-ness, simplicity, elegance, understandability, and ease of use. In the dataflow domain, which is rather mature, the constructs have been developed over several decades and for this reason adding new constructs can typically be avoided. In this work, existing constructs are used to keep the discussion simple and not introduce overlapping constructs; instead, the focus is on
other types of artifacts.
Modelsare built as a set of proposition or statements describing rela-tionships between theconstructsand are used to improve the understanding of the problem and solution and ties the approach to the real world problems. According to March the models are evaluated in terms of their fidelity with real world phenomena, completeness, level of detail, robustness, and internal consistency. In this work, several models are constructed which each high-lights one property of the dataflow program that is being analyzed. These methods are for this reason presented formally to demonstrate the proper-ties related to completeness and consistency, as an incomplete model in the context of code analysis and transformation, is worthless. The evaluation of the models is presented in Paper 4, and Chapter 3.
The methodsare algorithms or guidelines that describe how a specific problem should be solved, or more applicable in the context of this the-sis, how to search the solution space. According to March the evaluation of methods mainly regards their efficiency, generality, and ease of use. In the work of this thesis, this relates mainly to strategies related to schedul-ing, how the information from the models is used to perform the actual scheduling. The evaluation of methods is mainly presented as case studies in Chapter 7 of this thesis.
Instantiationsare the actual tools constructed to demonstrate the ap-proach and show that constructs, models, or methods can be implemented in a working system. The instantiations are the artifacts that link researchers to the real world and show how the artifacts react to it and how users react to the artifact. According to March the evaluation of instantiations relates to the efficiency of the artifact and what impact it has on the environment and users. For the tools produced within this work, this would mean that a number of developers should be able to use the tools and feel that it helps them achieve some of their goals.
Evaluation of the Work The research work behind this thesis was not directly based on design science, but because of many similarities regarding the goals of the research and the kind of experiments needed, it is relevant to evaluate this work as a design science research problem. This can be done by analyzing the seven guidelines, regarding design science research, given by Hevner et al. in [65].
The first one, Guideline 1: Design as an Artifact, is obviously followed as described above according to the definitions of March and Smith. The second guideline, Guideline 2: Problem Relevance, states that these arti-facts should provide solutions to important and relevant business problems. In this context, this means that the problem that is solved has a value in industrial applications and that it aids to the development in the field. The relevance cannot be evaluated from the research work but instead from the
problem statement motivating the research. The relevance of the schedul-ing problem comes from the fact that, usschedul-ing dataflow in many industrial applications is avoided due to the lack of methods for generating efficient code.
Guideline 3: Design Evaluation. It is crucial that the resulting artifacts
are evaluated with proper metrics and sufficient experimental data. Within this context, when the artifact is part of a design tool chain, the relevant properties are related both to the usability of the artifact from the point of view of the user of the artifact, but also regarding how the artifact affects the performance of the dataflow program which is transformed by the arti-fact. For the user, the question is how much manual labor and additional expertise is needed to use the artifact and how much time the user must wait for the tool to finish. These properties are evaluated in Chapter 7 which provides a set of case studies where the artifact is used to construct schedulers for composed actors. While several of the studied examples are automatically transformed by the artifact, the ones that are not are also im-portant for giving an improved understanding of the research problem. For the second part, regarding the performance of the actors that are composed by the artifact, the evaluation requires numerous experiments including dif-ferent configurations and target platforms. Experimental data is presented in Chapter 6 and in several of the original publication, however, what is even more important than being able to show promising numbers is that the experiments are properly performed. We will get back to evaluating the measurements shortly, after discussing the other guidelines by Hevner et al.
Guideline 4: Research Contributions. Following the discussion on
de-sign evaluation, a more general evaluation is defining clear and verifiable contributions of the work. According to Hevner et al. the contribution can be the design artifact itself or contributions related to foundations and methodologies if the area of the design artifact. While much of the work pre-sented in this thesis evaluate experimental strategies for building scheduling models and deriving schedules for composed actors, the design artifact both shows that such methods are implementable but also enables an evaluation of methods presented. The following guideline,Guideline 5: Research Rigor,
relates to the research contributions by requesting rigorous methods in both the construction and evaluation of the design artifact [65]. In the context of this work, rigor relates to the set of dataflow programs that are used to design and evaluate the artifact such that the artifact can be shown to solve a general problem and not only a special case. Similarly, rigor can be related to the data sets that are used to perform experiments and evaluation of the artifact; rigorous experiments make the knowledge acquired from the exper-iments more general. Furthermore, to show that the presented methods are complete, with respect to the possible inputs to the artifact, the methods need to be described mathematically based on an formal description of the
dataflow program.
The next guideline, Guideline 6: Design as a Search Process, defines the research as an iterative process, where available means, which are re-sources available to construct a solution, are used to reach desired ends, representing goals or constraints on the solution, while satisfying laws in the problem environment [65]. For complex problems, the first version of an artifact typically requires simplifications of the problem space or decompo-sition in to subproblems. While such an artifact hardly can be expected to as such be usable, it enables an improved understanding of the problem and raises relevant questions. The last guideline, Guideline 7: Communication
of Research, elaborates this idea by highlighting the communication of the
research results to the relevant audience. The important result of the work is then not the artifact itself but the knowledge acquired from constructing the artifact and by conducting the experiments enabled by the artifact.
1.5.2 Experiments and Measurements
One aspect of evaluating the approach is to construct experiments and per-form measurements. In the context of CAL, relevant experiments are mul-timedia applications such as video decoders and various signal processing applications like network protocols and digital filters.
For measurements to make sense, it is essential to plan what is actually measured and what parameters may affect the measurement. To draw more general conclusions from the measurements, it is also important that the experiments are repeated for several application, input sequences, compiler settings, and hardware configurations. In Paper 3, where a larger set of experiments is presented, it is obvious from the results that, had the ex-periments been done for only a part of the chosen exex-periments, the results would not have given a realistic view of the impact of the presented program transformations. The reason is that the same configurations give, not only different, but contradicting results on different platforms that were used in the experiment. Instead of simply concluding that a transformation of a program is good or not, it is then possible to find how platform parame-ters such as cache size, has an impact on how the transformation affects the program performance. To be able to make a conclusion, it is therefore important to have a large enough set of test cases.
Measurement Errors and Variation Performing measurements on a computer system is made complicated by operating systems, caches, and advanced processor features. In general, it is hard to know if a specific mea-surement is a result of the feature being investigated or if it is a coincidence of a combination of features that are not taken into account in the measure-ment setup. Further, the same experimeasure-ment may give varying results for every