Simplifying System-on-Chip Design through Architecture and System CAD Tools

(1)

through Architecture and

System CAD Tools

by

Lesley Shannon

A Thesis submitted in conformity with the requirements for the Degree of Doctor of Philosophy,

Department of Electrical and Computer Engineering, University of Toronto

c

(2)

(3)

Architecture and System CAD Tools

Lesley Shannon

Doctor of Philosophy,

2006

Department of Electrical and Computer Engineering

University of Toronto

Abstract

Historically designers created computing systems by combining Integrated Circuits (ICs) on Printed Circuit Boards (PCBs), whereas now they are able to form complete Systems-on-Chip (SoCs). For the purpose of this study, SoCs are defined as a collection of functional units on one chip that interact to perform a desired operation. These modules are typically of a coarse granularity to promote reuse of previously designed Intellectual Property (IP). The decreasing size of process technologies enables designers to implement increasingly complex SoCs using both Application Specific Integrated Circuits (ASICs) and Field Programmable Gate Arrays (FPGAs). The impact of increasing design complex-ity is increased design time and costs for electronics. Therefore, this research investigates methods to facilitate the design of SoCs through both architecture and CAD tools.

This thesis has two main contributions. The first is an architectural framework for SoCs, wherein they are modelled as Systems Integrating Modules with Predefined Physical Links (SIMPPL). The strength of the model is the Computing Element (CE) abstraction that separates the module’s datapath from system-level control and communications to facilitate design reuse. Although SIMPPL can be used to build SoCs for ASICs or FPGAs, using an FPGA provides designers with a reprogrammable implementation platform. Thus, our second contribution is to develop a design infrastructure that leverages the advantages of reconfigurability.

(4)

(5)

First, I would like to my parents and my brother, Matthew: I finally did it. I know you are probably wondering what took me so long, but I finally finished. Thank you for believing that I would finish... eventually. Also, thank you to my extended family for your love and support, and for keeping me grounded. All work and no play makes Lesley a dull girl – and a little stir crazy.

To the many friends I have made throughout my graduate studies at the University of Toronto. SF2206, LP392 and EA306 will always hold a special place in my heart. Thank you for all the long talks, extra-curricular events and crazy antics that helped to calm the madness of a stressful environment. My door will always be open to each of you.

To my many friends outside of work. You have given me sound advice and made sure that I have remembered that the world is a beautiful place built on science, but filled with art and music. Thank you for nurturing the non-engineering parts of my soul.

Lastly, but certainly not least, I must thank my supervisor, Dr. Paul Chow. My Doctor-ate studies have been a fantastic experience and I know that I interminably indebted to you for this. Over the years I have seen how crucial the rapport between student and supervisor is to the graduate experience. All I can say is that if my students find me to be half as good a supervisor as you have been to me, I will consider myself a success.

I would also like to acknowledge the financial support, as well as the equipment and software donations, that I have received from the following organizations that have made this research possible: the Canadian Microelectronics Corporation, the Natural Sciences and Engineering Research Council, the O’Brien Foundation, the Ontario Government, the University of Toronto, the Walter C. Sumner Foundation, and Xilinx.

(6)

(7)

Abstract iii

Acknowledgments v

List of Figures xi

List of Tables xiii

Glossary xv 1 Introduction 1 1.1 Motivation . . . 1 1.2 Objective . . . 2 1.3 Contributions . . . 3 1.4 Thesis Organization . . . 4

2 Modelling SoCs: SIMPPL 5 2.1 IP Reuse . . . 5

2.2 On-Chip Communication Structures . . . 6

2.3 SIMPPL Model . . . 7

3 The Computing Element Abstraction 10 3.1 SIMPPL Controller . . . 12

3.1.1 Controller Architecture . . . 12

3.1.2 Controller Instruction Set . . . 14

3.2 Debug Controllers . . . 18

3.2.1 Debug-Controller Architecture and Interface . . . 18

3.2.2 Debugger Options and Detectable Errors . . . 20

3.3 SIMPPL Control Sequencer . . . 21

3.3.1 SCS Interface . . . 21

3.3.2 Static Programming Example . . . 23

3.3.3 Dynamic Programming Architecture . . . 25

3.4 Summary . . . 26

(8)

4.2 CE Implementations . . . 31

4.2.1 Controller Implementation Statistics . . . 32

4.2.2 CE Architectures . . . 35

4.2.3 CE Implementation Statistics . . . 36

4.3 Detailed Analysis of an SoC Implementation . . . 38

4.3.1 Resource Usage . . . 38

4.3.2 Design Time Statistics . . . 40

4.4 Summary . . . 42

5 Designing SoCs on FPGAs 43 5.1 Current Status of SoC Design for FPGAs . . . 43

5.1.1 Systems Research using FPGAs . . . 44

5.1.2 Commercial System Design Tools for FPGAs . . . 45

5.2 Moving from Off-Chip Estimation to On-Chip Evaluation . . . 46

5.2.1 Simulating versus Profiling Hardware/Software Codesigns . . . . 47

5.2.1.1 GNU’s gprof . . . 48

5.2.2 Proposed Design Methodology . . . 48

5.2.2.1 Benefits of Designing SIMPPL SoCs using FPGAs . . 50

5.3 Designing SIMPPL SoCs on FPGAs . . . 51

5.3.1 Experimental Platform . . . 52 5.3.2 System Generator . . . 52 5.3.3 On-Chip Testbed . . . 55 5.4 Summary . . . 56 6 SnoopP 58 6.1 General Architecture . . . 58 6.1.1 Design Decisions . . . 60 6.2 Experimental Evaluation . . . 62 6.2.1 Methodology . . . 62 6.2.2 Dhrystone . . . 64 6.2.3 AES . . . 70 6.3 Summary . . . 74 7 WOoDSTOCK 75 7.1 Multi-CE Profiling Architecture . . . 75

7.1.1 Bottleneck Detection . . . 76

7.1.2 Implementation and Design Decisions . . . 79

7.2 Case Studies . . . 81

7.2.1 Methodology . . . 81

7.2.2 Pipelined System Example . . . 83

7.2.3 Branching System Example . . . 85

7.3 Summary . . . 86

(9)

8.1.1 SoC Architecture . . . 88

8.1.2 SoC Design Tools . . . 89

8.2 Future Work . . . 90

Bibliography 91 A SIMPPL Controllers HDL Source Code 96 A.1 Full Instruction Set . . . 96

A.2 Consumer Execute Controller . . . 97

A.3 Consumer Debug Controller . . . 106

A.4 Producer Execute Controller . . . 122

A.5 Producer Debug Controller . . . 130

A.6 Full Execute Controller . . . 144

A.7 Full Debug Controller . . . 152

B Input File Format for the System Generator 166 C On-chip Testbed Source and Sink Packet Interpreters 168 C.1 Transmitter Testbed . . . 168

C.2 Receiver Testbed . . . 175

D Execution Profile Data from SnoopP Experiments 181

E SnoopP HDL Source Code 185

F Input File Format for WOoDSTOCK 195

(10)

(11)

2.1 Standardizing the IP interface using (a) SIMPPL for point-to-point

com-munications and (b) OCP for different bus standards. . . 6

2.2 A generic SoC described using the SIMPPL model. . . 8

2.3 The system generator’s generic computing element. . . 8

3.1 The concept for the hardware CE. . . 10

3.2 The hardware CE abstraction. . . 11

3.3 An overview of the SIMPPL controller datapath architecture. . . 13

3.4 An internal link’s data packet format. . . 15

3.5 A Data packet with four bypass instructions. . . 17

3.6 The SIMPPL debug controller architecture. . . 18

3.7 The SIMPPL debug controller interface. . . 19

3.8 The standard SIMPPL control sequencer structure and interface to the SIMPPL controller. . . 22

3.9 Pseudocode for the sensor unit’s SCS program. . . 23

3.10 Pseudo-HDL code to implement the state machine for the sensor unit’s program counter and the valid instruction signal. . . 24

3.11 A CE with multiple packets of data in flight. . . 25

4.1 The SIMPPL model for the video streaming and snapshot applications. . . 29

4.2 The SIMPPL model for an MPEG-1 video decoder. . . 30

4.3 The shared Computing Element architecture for a shared memory CE. . . 35

4.4 The pipelined Computing Element architecture. . . 36

5.1 An SoC Design Methodology for Reconfigurable Platforms. . . 49

5.2 (a) A CE connected to an off-chip input peripheral. (b) A CE connected to an off-chip output peripheral. . . 53

5.3 System Generator Design Flow. . . 54

5.4 The on-chip testbed for debugging CEs. . . 55

6.1 The Generic SnoopP Architecture. . . 59

6.2 Profiling Results for Dhrystone using gprof on a Sun station and mb-gprof and SnoopP on a MicroBlaze using software implementations of the mul-tiply and divide functions. . . 66

6.3 Profiling Results for Dhrystone using gprof on a Sun station and mb-gprof and SnoopP on a MicroBlaze that includes a hardware multiplier and a software implementation of the divide function. . . 68

(12)

6.5 Profiling Results for AES using gprof on a Sun station and mb-gprof and SnoopP on a MicroBlaze using software implementations of the multiply and divide functions. . . 73 7.1 The WOoDSTOCK architecture. . . 77 7.2 Examples of the different types of bottlenecks detectable by

WOoD-STOCK: (a) interior bottleneck, (b) input bottleneck, and (c) output bot-tleneck. . . 77 7.3 The interface of WOoDSTOCK with the Fast Simplex Link (FSL) network

of a multi-CE system and a host PC via the Microprocessor Debug Module. 80 7.4 Two application architectures described with the SIMPPL model: (a) a

pipelined system (b) a system with branching. . . 81

(13)

3.1 The current instruction set supported by the SIMPPL controller. . . 16

3.2 The current error cases detectable using the debug controller. . . 20

4.1 Table of the System Integration times for SoCs. . . 30

4.2 SIMPPL Controller implementation statistics. . . 32

4.3 Execution overhead clock cycles for the Consumer, Producer, and Full SIMPPL Controllers. . . 34

4.4 Implemented CEs. . . 37

4.5 Table of the resource usage of the individual modules and total system. . . 38

4.6 Table of the CE design and integration times required for the system given in person-hours. . . 40

6.1 gprof Statistics on Functions comprising the Dhrystone Benchmark after One Hundred and One Million Passes. . . 63

6.2 Dhrystone SnoopP Counter Assignments. . . 65

6.3 gprof Statistics on Functions Comprising the AES Benchmark for 2 and 400 Different Keys with 10 Thousand Blocks Each. . . 71

6.4 AES SnoopP Counter Assignments. . . 72

7.1 Example output equations for the systems in Figure 7.2. . . 78

7.2 Table for pipelined system counter results describing the counter enables, what the counters represent, and reporting the measured results as percent-ages of the total monitor run time given in Counter 4 to the nearest million clock cycles. . . 82

7.3 Table for branching system counter results describing the counter enables, what the counters represent, and reporting the measured results as percent-ages of the total monitor run time given in Counter 7 to the nearest million clock cycles. . . 84

D.1 mb-gprof Statistics on Functions comprising the Dhrystone Benchmark af-ter One Hundred and One Million Passes. . . 182

D.2 Cycle-Accurate Results using SnoopP to Profile Dhrystone on MicroBlaze systems that include and exclude the Hardware Multiplier and Divider. . . 183

D.3 The Results from Profiling AES on-chip with SnoopP for Both 2 and 400 Keys. . . 183

D.4 mb-gprof Statistics on Functions Comprising the AES Benchmark for 2 and 400 Different Keys with 10 Thousand Blocks Each. . . 184

(14)

(15)

API Application Programming Interface ASIC Application Specific Integrated Circuit

BRAM Block RAM

CAD Computer Aided Design

CE Computing Element

CLB Configurable Logic Block

EX IR Executing Instruction Register FPGA Field Programmable Gate Arrays

FSL Fast Simplex Links

FSM Finite State Machine

HDL Hardware Description Language

IC Integrated Circuit

IDCT CE Inverse Discrete Cosine Transform CE

IP Intellectual Property

IQ CE Inverse Quantizer

ISS Instruction Set Simulator

LSB Least Significant Bits

LUTs Look Up Tables

MC/PR CE Motion Control/Picture Reconstruction CE

MDM Microprocessor Debug Module

MHS Microprocessor Hardware Specification MMR CE Missing Macroblock Replacer CE MPD Microprocessor Peripheral Definition MSS Microprocessor Software Specification

NoC Network-on-Chip

OCP Open Core Protocol

OPB On-chip Peripheral Bus

PAO Peripheral Analyze Order

PC Program Counter

PCB Printed Circuit Board

PE Processing Element

SCS SIMPPL Control Sequencer

SIMPPL Systems Integrating Modules with Predefined Physical Links

SnoopP Snooping Profiler

SoC System on Chip

SUT System Under Test

VLD/RLD CE Variable Length Decoder/Run Level Decoder CE VSIA Virtual Socket Interface Alliance

WOoDSTOCK Watching Over Data STreaming On Computing element linKs XMP Xilinx Microprocessor Project

XPS Xilinx Platform Studio

(16)

(17)

Introduction

Historically designers created computing systems by combining Integrated Circuits (ICs) on Printed Circuit Boards (PCBs). However, due to the decreasing size of process tech-nologies, designers have been able to implement these same systems as Systems-on-Chip (SoCs) using Application Specific Integrated Circuits (ASICs) since the late 1990’s. The term System-on-Chip (SoC) has been used with many different connotations in previous work. For this study, we define an SoC as a collection of functional units on one chip that interact to perform a desired operation. These modules are typically of a coarse granularity so that previously designed Intellectual Property (IP) modules can be reused to try and re-duce the design time of more complex systems. Examples of IP modules range from data intensive processing cores, such as FIR filters and FFTs, to more control intensive cores, such as memory controllers and processors.

1.1

Motivation

IP reuse is more challenging in hardware designs than reusing software functions in new software applications. Software designers benefit from a fixed implementation platform with a highly abstracted programming interface, enabling them to focus on adapting the functionality to the new application. Unfortunately, to reuse hardware IP [1, 2], designers need to consider changes to the module’s:

• functionality,

• physical interface, and

• communication protocols.

(18)

Depending on the amount of time required to adapt IP to a new application, there may be little benefit in reusing the IP to create new SoCs. However, if we create a framework for describing SoCs to simplify the integration of IP modules, it would allow hardware design-ers to focus on adapting IP functionality similar to software designdesign-ers updating software functions.

Another challenge for SoC designers is verifying design functionality and performance in a timely fashion. Designers have traditionally relied on simulation and estimation to evaluate their systems. Given the potential size and complexity of SoCs, simulation can be a very time consuming process that takes orders of magnitude longer than on-chip ex-ecution. However, if an SoC is implemented on an ASIC, it has a restrictive design envi-ronment that is not easily altered post fabrication. Therefore, the importance of correctly implementing a design on an ASIC the first time necessitates lengthy simulation times to prevent a costly redesign.

Now that commercial Field Programmable Gate Arrays (FPGAs) are also large enough to implement entire systems on one chip [3, 4], as opposed to just the glue logic, they offer a unique opportunity for SoC designers. When using a reconfigurable implementa-tion platform, there is no cost to reprogramming the hardware, therefore, we can develop a new design infrastructure where system evaluation is performed on-chip. It would pro-vide greater flexibility to the designer and allow a new approach to the design process. For example, Hemmert et al. [5] introduced a debugger for hardware designs capable of running on an FPGA for the benefit of accelerated speed of execution during the debug-ging process. Recent work allows designers to incorporate a Statistics Module into a soft processor to obtain a variety of run-time statistics that can be dynamically reconfigured [6]. Furthermore, designing for a reconfigurable implementation platform enables designers to easily respecify the system’s architecture if the on-chip evaluation determines that the cur-rent architecture fails to meet design specifications.

1.2

Objective

The objective of this research is twofold:

• To study how creating a framework for SoC architectures can facilitate both IP reuse and system design.

• To develop a set of on-chip CAD tools that can exploit FPGAs to reduce time spent evaluating the functionality and performance of SoCs.

(19)

Defining a system framework requires that the system-level communication structure used to integrate the IP modules be characterized. It also necessitates the specification of the physical interface as well as the communication protocols for IP modules, to facilitate their reuse in different applications. Finally, a formalized method for adapting how an IP module is used by different systems without necessitating significant redesign is desired to facilitate design reuse.

Having created a system framework, it is possible to develop on-chip CAD tools that can be tailored to different SoC architectures. Given the unrestricted ability to reprogram an FPGA, on-chip CAD tools can be used during the design process to evaluate functionality and performance. By performing these operations on the runtime platform, designers can reduce simulation time and overall design time.

1.3

Contributions

This thesis can be divided into two significant contributions:

• an architectural framework for SoCs, and

• a design infrastructure developed to leverage the advantages of reconfigurability and a defined SoC model.

The proposed framework models SoCs as Systems Integrating Modules with Prede-fined Physical Links (SIMPPL [7]) to expedite system integration. Within this framework, IP modules are abstracted as Computing Elements (CEs) to reduce the complexities of adapting IP to new applications. A lightweight controller has been created to provide a fixed system-level interface for the IP module with standardized communication protocols. It also executes a program that dictates how the IP is used in the system, thus localizing the system-level control to simplify any necessary functional redesign of the IP for other applications.

Designers implementing SoCs on FPGAs can leverage configurability by moving the evaluation of the system on-chip. This can reduce system design time by decreasing the amount of time spent simulating the system’s runtime behaviour, while still providing accurate information. To this end, two on-chip profiling tools, SnoopP [8] and WOoD-STOCK [7] have been designed. Furthermore, fixing the SoC architectural framework allows us to create a system specification tool [7] that can facilitate the redesign of the system-level architecture and an on-chip verification environment [9] for SoCs imple-mented on FPGAs using the SIMPPL model.

(20)

1.4

Thesis Organization

This thesis is divided into eight chapters. Chapter 2 summarizes the previous work done on IP reuse and on-chip communication structures and presents the SIMPPL framework for SoC design. Chapter 3 describes the Computing Element (CE) abstraction that is central to this model. To demonstrate how the SIMPPL framework and CE abstraction can facil-itate design, three applications are implemented as SoCs as described in Chapter 4. The remainder of the thesis document discusses designing SoCs on FPGAs. Chapter 5 pro-vides an overview of current research and describes the system-level design tools created for generating and verifying SoCs within the SIMPPL framework on FPGAs. Along with these tools, the design infrastructure for SoCs on FPGAs also includes two profiling tools for evaluating system performance. The first is SnoopP, a Snooping Profiler for measuring the performance of applications on processors at runtime, which is discussed in Chapter 6, and WOoDSTOCK is the other. Chapter 7 describes how Watching Over Data STreaming On Computing element linKs (WOoDSTOCK) can be used to detect processing load im-balances in systems modelled using SIMPPL. Finally, the conclusions and potential future work for this thesis are summarized in Chapter 8.

(21)

Modelling SoCs: SIMPPL

This chapter begins by summarizing popular methods of simplifying IP reuse in Sec-tion 2.1, followed by a discussion of some of the previous work investigating on-chip interconnect structures in Section 2.2. It concludes with a presentation of the SIMPPL system framework for SoC design in Section 2.3.

2.1

IP Reuse

Multiple books exist discussing the complexities involved in reusing legacy IP in new de-signs [1, 2]. Although IP reuse can reduce design time, problems that arise when incor-porating previously designed modules into new designs are of significant concern. This has led to the development of well-defined IP design methodologies [10, 11] to ensure reusability of cores with fixed interfaces and functionality. It does not, however, address the common situation where a module has defined functionality but requires the ability to interface with different communication structures.

The Spirit Consortium [12] has created two specifications for facilitating IP reuse. The first is the IP meta-data description, which provides a generic method for describing IP modules. The consortium has also created an IP tool integration API that allows designers to integrate tools into an IP framework for SoC design.

The VSI Alliance has proposed the Open Core Protocol (OCP) [13] to enable the sep-aration of external core communications from the IP core’s functionality, similar to the SIMPPL model. Both communication models are illustrated in Figure 2.1. The SIMPPL model targets the direct communication model using a defined, point-to-point interconnect structure for all on-chip communications. In contrast, OCP is used to provide a

(22)

(b) (a)

H/W IP to OCP

OCP to Bus A OCP to Bus B H/W IP to OCP H/W IP H/W IP IP Interface Bus A Bus B

Figure 2.1: Standardizing the IP interface using (a) SIMPPL for point-to-point communi-cations and (b) OCP for different bus standards.

defined socket interface for IP that allows a designer to attach interface modules that act as adaptors to different bus standards that include point-to-point interconnect structures as shown in Figure 2.1(a). This allows a designer to easily connect a core to all bus types supported by the standard.

More recently, an Interface Adaptor Logic (IAL) layer has been proposed [14] that uses a socket interface for IP modules, similar to the OCP. However, unlike OCP, it is specifically aimed at IP reuse in reconfigurable SoCs. FPGA companies also recognize the importance of simplifying the inclusion of previously designed IP into newer system designs. Xilinx provides its own bus-interface module for interconnecting IP with a defined socket interface [15].

All the protocols presented in this section make it easier to port IP among different bus standards. For example, the OCP and the IAL layer provide standardized adapters that allow cores of fixed functionality to connect to a variety of bus standards. The SIMPPL model, however, has a fixed interface, supporting only point-to-point connections with the objective of allowing is to enable designers to treat IP modules as programmable coarse-grained functional units. Designers can then reprogram the IP module’s usage in the system to adapt to the requirements of new applications.

2.2

On-Chip Communication Structures

Many different on-chip interconnect strategies have been proposed for SoC design, in-cluding hierarchical buses that use bridges to connect to each other [16, 17, 18], but the

(23)

maximum bandwidth for each bus is limited by the number of modules connected to it. The WISHBONE [19] SoC interconnect architecture provides multiple different intercon-nect structures, allowing the designer to select the bus architecture for a particular system. Since all the Wishbone interconnects are designed as single-level buses, the standard pro-vides the user with a simpler design approach, unless components running at different clock rates must share the same bus.

Berkeley’s SCORE [20] architecture divides system computations into fixed-size pages and uses the data abstraction of streams to pass data between pages. Streams provide a high-level description of point-to-point communication, comparable to the SIMPPL inter-nal communication link, but without defining a physical connection. Adaptive System-on-chip (aSOC) [21] uses a physical implementation of a point-to-point communication archi-tecture for heterogeneous systems, where unlike the SIMPPL model, the communication interface for each module is tailored in hardware to optimize the module’s performance.

Networks provide another form of scalable on-chip communication. Multiple Network-on-Chip (NoC) topologies have been studied for ASIC designs [22, 23]. One popular NoC topology is the mesh [24, 25], which has also been investigated on an FPGA platform [26]. The SIMPPL model, however, can be used to implement any fixed point-to-point network topology, allowing the designer to choose the appropriate topology for each application.

2.3

SIMPPL Model

The proposed SIMPPL model represents SoCs as Systems Interfacing Modules with Pre-defined Physical Links (SIMPPL) [7], implementing an SoC as a combination of different Computing Elements (CEs) that are connected via communication links. Figure 2.2 illus-trates a possible SoC architecture described using the SIMPPL model, where the solid lines indicate internal links and the dotted lines indicate I/O communication links. I/O communi-cation links may require different protocols to interface with off-chip hardware peripherals, but the internal links are standardized physical links with defined communication protocols to make the actual interconnection of CEs a trivial problem and to create a framework for systems design. The current work using the SIMPPL model assumes that the internal links are n-bit wide Asynchronous FIFOs with a user defined depth. Using asynchronous FIFOs simplifies multi-clock domain systems, allowing designers to isolate different clock do-mains in different CEs and buffer the data transfers between CEs. Point-to-point links not only offer higher bandwidth than shared buses, but recent work has demonstrated that com-mercial FPGA routing fabrics can implement network topologies where CEs have a high

(24)

CE CE CE CE CE CE c c c c off-chip on-chip

Figure 2.2: A generic SoC described using the SIMPPL model.

CE

Instruction &

Data Memory

0

1

. . .

N - 1

M - 1

Figure 2.3: The system generator’s generic computing element.

degree of connectivity without performance degradation due to routing congestion [27]. Each CE has the generic structure shown in Figure 2.3, where each CE has N input links and M output links. Internal links connect a CE to other CEs, where input links connect to

parent CEs and output links connect to child CEs. The information passed between CEs

is abstracted from the links themselves and instead, the data transfers are adapted to the specific requirements of each CE. This format of communicating data between modules is akin to software design, where the stack provides the physical interface between software functions, similar to the proposed internal links. However, the information passed on the stack, such as the number of parameters, is determined by the individual function calls. In the SIMPPL model, the size and nature of the data in the packet communicated between

(25)

the IP modules performs this task. Each module has internal protocols capable of properly creating and interpreting the information in a packet.

A proposed model for the future of SoC design using many interacting heterogeneous processors [28] can also have the same structure as a SIMPPL SoC, however the SIMPPL model is more general, allowing CEs to depict either processors (software CEs) or dedi-cated logic modules (hardware CEs). The SIMPPL model representation of SoCs is more reminiscent of Kahn process networks [29], particularly Data process networks [30], in that it is a collection of CEs interconnected via unidirectional links and well suited to data intensive applications. However, unlike these models that assume the internal links have unbounded capacity, the SIMPPL model uses real FIFOs that have limited capacity. Work at Philips Research produced YAPI [31], an application model based on Kahn process net-works that has been extended to support non-deterministic events and decouple the data types used for communications and computation.

Although the SIMPPL model allows non-deterministic events, they are supported by the CE abstraction. The abstraction allows inter-CE synchronization to be programmed to meet the specific requirements of each application. The SIMPPL model only provides a physical structure for the system and is oblivious to the meaning of the data flowing between CEs, deferring the interpretation of the data to the CE abstraction discussed in the following chapter.

(26)

The Computing Element Abstraction

The Computing Element (CE) is an abstraction of software or hardware IP that facili-tates design reuse by separating the datapath (computation), the inter-CE communication, and the control. Researchers have demonstrated some of the advantages of isolating in-dependent control units for a shared datapath to support sequential procedural units in hardware [32]. This is similar to when a CE is implemented as software on a processor (software CE), the software is designed with the communication protocols, the control se-quence, and the computation as independent functions. Should a software CE need to be reused and updated for a new application, the software changes should be localized to only the control sequence functions.

Typically, complex control is easier to implement in software than in hardware. Fig-ure 3.1 illustrates the desired functionality of a hardware CE. Using a microcontroller as the CE communication interface isolates the Processing Element’s (PE’s) functionality from the rest of the system. The PE now operates as a coarse grain functional unit that is only accessible via the microcontroller. The PE’s local control is encapsulated in the CE’s local program. Its instructions, along with data requests from adjoining CEs, are interpreted by

control

data

PE 1

PE 2

Local

Prog

Local

Prog

M

ic

roc

ont

ro

lle

r

M

ic

roc

ont

ro

lle

r

C

E

0

C

E

1

Figure 3.1: The concept for the hardware CE.

(27)

Computing

Element (CE)

Rx Tx

External I/O Signals

SIMPPL Control

Sequencer (SCS)

SIMPPL Controller

PE (Hardware IP)

Data

Rx

Data

Tx

PE Control

PE Status

Internal Rx and Tx

Communication Links (FIFOs)

Prog Instr

Controller Status

Figure 3.2: The hardware CE abstraction.

the microcontroller and executed by the PE.

However, general purpose microcontrollers are too big and too slow for the hardware-to-hardware interactions of dedicated logic modules in hardware CEs. Ideally, a controller customized to each CE’s datapath could be used as a generic system interface, optimized for that specific CE’s datapath. To this end, we’ve created two versions of a fast, pro-grammable, lightweight controller – an execution-only (execute) version and a run-time debugging (debug) version – that are both adaptable to different types of computations suitable to SoC designs on both ASICs and FPGAs.

Figure 3.2 illustrates how the control, communications and the datapath are decoupled in hardware CEs. The Processing Element (PE) represents the datapath of the CE or the IP module, where an IP module implements a functional block having data ports and control and status signals. It performs a specific function, be it a computation or communica-tion with an off-chip peripheral, and interacts with the rest of the system via the SIMPPL controller, which interfaces with the internal communication links. The SIMPPL Control Sequencer (SCS) module allows the designer to specify, or “program”, how the PE is used in the SoC. It contains the sequence of instructions that are executed by the controller for a

(28)

given application. The controller then manipulates the control bits of the PE based on the current instruction being executed by the controller and the status bits provided by the PE. Section 3.3.2 illustrates a programming example for the SCS.

The remainder of this chapter is divided into the following sections. Section 3.1 pro-vides details on the underlying SIMPPL controller architecture and Section 3.2 outlines the additional functionality and hardware of the “debug” version of the controllers. Fi-nally, the SIMPPL Controller Sequencer’s interface and programming model are discussed in Section 3.3.

3.1

SIMPPL Controller

The SIMPPL controller acts as the physical interface of the IP core to the rest of the system. Its instruction set is designed to facilitate controlling the core’s operations and reprogram-ming the core’s use for different applications. Details on the controller’s architecture and the instructions it supports are given below.

3.1.1

Controller Architecture

Figure 3.3 illustrates the SIMPPL controller’s datapath architecture. The controller exe-cutes instructions received via both the internal receive (Rx) link and the SCS. Instructions from the Rx Link are sent by other CEs as a way to communicate control or status infor-mation from one CE to another CE, whereas instructions from the SCS implement local control. Instruction execution priority is determined by the value of the Cont Prog bit so that designers can vary priority of program instructions depending on how a CE is used in an application. If this status bit is high, then the “program” (SCS) instructions have the highest priority, otherwise the Rx link instructions have the highest priority. Since the user must be able to properly order the arrival of instructions to the controller from two sources, allowing multiple instructions in the execution pipeline greatly complicates the synchronization required to ensure that the correct execution order is achieved. Therefore, the SIMPPL controller is designed as a single-issue architecture, where only one instruc-tion is in flight at a time, to reduce design complexity and to simplify program writing for the user. The SIMPPL controller also monitors the PE-specific status bits that are used to generate status bits for the SCS, which are used to determine the control flow of a program as discussed in Section 3.3.1.

(29)

E X I R a0 R E G Prog Instr Internal Rx Link Internal Tx Link Received Data Transmitted Data Controller Status Bits Processing Element (Hardware IP) SI M P PL C o nt ro l Se qu en c e r ( S C S ) Optional Asynchronous FIFOs SIMPPL Controller Cont Prog PE Status PE Control

(30)

by the instruction currently being executed. The inputs multiplexed to the Tx link are the Executing Instruction Register (EX IR), an immediate address that is required in some in-structions, the address stored in the address register a0 and any data that the hardware IP transmits. Data can only be received and transmitted via the internal links and cannot orig-inate from the SCS. Furthermore, the controller can only send and receive discrete packets of data, which may not be sufficient for certain types of PEs requiring continuous data streaming. To solve this problem, the controller supports the use of optional asynchronous FIFOs to buffer the data transmissions between the controller and the PE. The designer can then clock the controller at a faster rate than the PE to guarantee that it accurately receives/transmits at the necessary data rate.

3.1.2

Controller Instruction Set

Although the current SIMPPL controller uses a 33-bit wide FIFO, the data word is only 32-bits. The remaining bit is used to indicate whether the transmitted word is an instruction or data. Figure 3.4 provides a description of the generic data packet structure transmitted over an internal link. The instruction word is divided into the least significant byte, which is designated for the opcode, and the upper 3 bytes, which represents the Number of Data Words (NDW) sent or received in a data transmission instruction. The current instruction set uses only the five Least Significant Bits (LSBs) of the opcode byte to represent the in-struction. The remaining bits are reserved for future extensions of the controller instruction set.

Designers can choose to reduce the resource usage of SoCs using the SIMPPL model that do not require a 32-bit data word length or address space. If the width of the data word transmitted/received by a CE is less than 32-bits and the maximum number of data words, the NDW value, is less than 223, then the designer may choose to reduce the width of the FIFOs used as internal Rx and Tx links for that CE. For example, if the width of the data words being processed by a CE is 24-bits, the internal links can be 25-bits wide, where 24-bits are used for the data word and one bit is used as the control bit. The opcode of the instruction word would still be the eight LSBs, however, there would only be two bytes to represent the NDW value for the instruction, decreasing the packet size that could be received or transmitted by the CE.

All SIMPPL controller instruction packets have three components: (1) the instruction word; (2) the address or state word (optional); and (3) the data words (optional). The instruction set is divided into two groups; instructions that perform a control operation, and

(31)

} Instruction Immediate Address/State Word

Data 0 Data 1 Data NDW - 1 1 0 Data 2 0 0 0 0 opcode program word control bit Tx CE

Num Data Words (NDW)

.

} *Optional Rx CE Data Packet 0 7 32 31

Figure 3.4: An internal link’s data packet format.

those that transfer data. Instructions resulting in data transfers are further subdivided into three different categories: (1) read requests, (2) receives, and (3) writes. A read request is issued by the program of one CE and sent to another CE requesting that data be transmitted back to the original CE. A receive instruction must then be generated as the first transmitted word to accompany the data sent back to the initiating CE, since all transfers via internal links start with an instruction. Finally, the program can also use a write instruction to accompany data words transmitted to another CE.

Table 3.1 contains all the instructions currently supported by the SIMPPL controller and Appendix A.1 lists the instructions and their corresponding opcodes. The objective is to provide a minimal instruction set to reduce the size of the controller, while still providing sufficient programmability such that the cores can be easily reconfigured for any potential application. Although some instructions required to fully support the reconfigurability of

(32)

Table 3.1: The current instruction set supported by the SIMPPL controller.

Instruction Type Rd Rx Wr Issue Exec. Addr Data

Req Instr Instr Field Field

Imm. Data Transfer X X X S/R S/R X

Imm. Data + Imm. Addr. X X X S/R S/R X X

Addr. Reg. Initialization X S S X

Addr. Reg. Arithmetic X S S

Imm. Data + Indir. Addr. X X X S S X X

Imm. Data + Autoinc. X X X S S X X

Bypass S/R S/R X

No-op S R

Reset S R

some types of hardware PEs may be missing, the instructions in Table 3.1 support the hard-ware CEs that have been built to date. Furthermore, the controller supports the expansion of the instruction set to meet future requirements.

The first column in Table 3.1 describes the operation being performed by the instruc-tion. Columns 2 through 4 are used to indicate whether the different instruction types can be used to request data (Rd Req), receive data (Rx), or write data (Wr). The next two columns are used to denote whether each instruction may be issued from or executed from the SCS (S) or internal Receive Communication Link (R). Finally, the last two columns are used to denote whether the instruction requires an address field (Addr Field) or a data field (Data Field) in the packet transmission.

The first instruction type described in Table 3.1 is the immediate data transfer instruc-tion. It consists of one instruction word of the format shown in Figure 3.4, excluding the address field, where the two LSBs of the opcode indicates whether the data transfer is a read request, a write, or a receive. The immediate data plus immediate address instruction is similar to the immediate data transfer instruction except that an address field is required as part of the instruction packet.

Instructions that use the a0 register have a one or two-word format, but are not trans-mitted as they only make sense in the context of the local controller. The initialization of the local address register with an immediate value is a two word instruction, where the first contains the opcode and the second is the new address. The address register arithmetic

(33)

Bypass Headers Ad d re s s/Sta te D a ta 0 D a ta 1 Da ta NDW - 1 D a ta 2 op c o d e Nu m D a ta W o rd s (N DW )

.

}

Byp a ss 1 Byp a ss 2 Byp a ss 3 *Optional Instruction Data Byp a ss 0

. .

Figure 3.5: A Data packet with four bypass instructions.

instructions are single word instructions used to add or subtract an offset to the current local address register value. The value in the address register can provide the immediate address for any data transfer instructions sent to other CEs, using indirect addressing with an optional post-increment.

The remaining instructions provide control functionality for the controller. The bypass instruction allows a packet of data received from one CE to bypass the current CE, such that the bypass instruction header is removed and the enclosed instruction is forwarded without execution. Figure 3.5 illustrates a data packet that is encompassed within four bypass instructions. By prepending N bypass instructions to a data packet, the packet will bypass N controllers before the N+1th controller processes the actual data packet. The

no-op instruction can be used in combination with SCS status bits to provide handshaking

controls between CEs. This will be further discussed in Section 3.3.1. Finally, the reset instruction can be transmitted from the CE to reset the controller and PE of the receiving CE.

Designers can reduce the size of the controller by tailoring the instruction set to the PE. Although some CE’s may receive and transmit data, thus requiring the full instruction set, others may only produce data or consume data. The Producer controller (Producer) is designed for CE’s that only generate data. It does not support any instructions that may read data from a CE. The Consumer controller (Consumer) is designed for CEs that receive input data without generating output data. It does not support any instructions that try to

(34)

Execute SIMPPL Controller CE ID Error Type Register Ex IR Imm. Addr. A0 Register Data Cntr Prog IR Rx IR Controller Status Prog/PE Status Execution/ Fetch Time Counter

Debug SIMPPL Controller

Status Check Internal Error Status Ready Communication Links debug status upload link Rx and Tx Communication Links

Figure 3.6: The SIMPPL debug controller architecture.

write PE data to a Tx link.

3.2

Debug Controllers

Here we introduce a debug SIMPPL controller (debug controller), based on the execute SIMPPL controller (execute controller) described in Section 3.1. This extension of the original architecture allows designers to detect low-level programming and integration er-rors for individual CEs.

3.2.1

Debug-Controller Architecture and Interface

Figure 3.6 shows the architecture of a debug controller, with the execute controller de-scribed in Section 3.1 forming the central component. While the execute controller has three states in the instruction execution state machine: fetch, decode, and execute, the

(35)

de-Debug Controller Interface Debug SIMPPL Controller Off-Chip Interface Module Debug SIMPPL Controller Debug SIMPPL Controller

.

Off-Chip

Figure 3.7: The SIMPPL debug controller interface.

bug controller has a fourth state – the stall state. An input signal (Status Check) has been added to the debug controller to allow designers to request a status check of the CE while the system is running. Additional output signals are used to indicate if a run-time error has occurred in the CE(int error) and when the CE’s status information is ready to be accessed (status ready). The controller enters the stall state if an error occurs during the execution of an instruction or if a status check has been requested (status check). The stall state allows the controller to upload all of the status information about the current executing instruction to the debug status upload link before executing the next instruction.

Eleven status registers have been added to the debug controller architecture, as shown in Figure 3.6, to store run-time status information about the CE. These include the CE’s ID register, registers that store information about the instruction currently executing (Ex IR, Imm Addr, A0 register, Data Cntr, Execution/Fetch Time Counter), the current state of the CE (Error Type Register, Prog/PE Status, Controller Status), and the “next” instructions available from the program and from the receive link (Rx IR and Prog IR). The status registers are connected to form a large shift register to upload the values from the CE to the debug status upload link. The debug controller requires twelve cycles, or one cycle plus the number of status registers, in the stall state to upload all of the status information from the CE to the link, assuming the upload link is not full. Otherwise, the controller will remain in the stall state until all the status register values have been uploaded.

(36)

Table 3.2: The current error cases detectable using the debug controller.

Error Case Error Code Error Type

Instruction word not in Fetch Cycle 8000 0001 Programming

Data word in Fetch Cycle 4000 0001 Programming

Execution Time Overflow 2000 0001 Programming

Fetch Time Overflow 1000 0001 Programming

Writing to a Full Tx Link 0800 0001 Integration

Reading from an Empty Rx Link 0400 0001 Integration

Writing data to the PE when it is not ready 0200 0001 Integration Writing an address to the PE when it is not ready 0100 0001 Integration Reading data from the PE when it is not ready 0080 0001 Integration

Executing an invalid instruction 0040 0001 Programming

that is used to upload the debugging information to the debug interface shown in Figure 3.7. The debug controller interface connects via a bus to an off-chip peripheral interface module that allows users to read the available status information off-chip from the controllers. The interface also contains a status register that indicates which CEs have status information available and what, if any, CEs have encountered run-time errors. Alternatively, if a debug controller is implemented in ASIC technology, the status information can be downloaded off-chip by implementing the registers using scannable flipflops.

3.2.2

Debugger Options and Detectable Errors

The debug controller supports two different run-time operations: error detection and status checks. When the Status check signal is set high for a clock cycle, it triggers the CE to upload status information after the execution of the current instruction completes. This allows the designer to check what instruction is being executed by a CE at random points of operation of the application. The Status Check can also be tied high for the duration of the profile period to obtain a continuously running profile of the CE, however, the CE will stall if the upload link becomes full.

Column 1 of Table 3.2 lists the error cases that the debug controller is currently able to detect, but the number of detectable error cases may be extended if a future need is determined. The second column in the table indicates the error code that is uploaded from

(37)

the debug controller when an error occurs. The final column indicates whether an error case is the result of a programming error or a CE/system integration error.

3.3

SIMPPL Control Sequencer

The SIMPPL Control Sequencer provides the local program that specifies how the PE is to be used by the system. For example, a CE that has an audio sampling PE can be reprogrammed to generate packets of different formats depending on the requirements of the application. In this section, we discuss the SCS’s architecture for both ASIC and FPGA platforms and provide a programming example. We then conclude with a discussion of how the CE abstraction allows a designer to dynamically generate program instructions, which we refer to as dynamic programming.

3.3.1

SCS Interface

The operation of a SIMPPL controller is analogous to a generic processor, where the con-troller’s instruction set is akin to assembly language. For a processor, programs consist of a series of instructions used to perform the designed operations. Execution order is dictated by the processor’s Program Counter (PC), which specifies the address of the next instruction of the program to be fetched from memory. While a SIMPPL controller and program perform the equivalent operations to a program running on a generic processor, the controller uses a remote PC in the SCS to select the next instruction to be fetched.

Figure 3.8 illustrates the SCS structure and its interface with the SIMPPL controller via six standardized signals. The 32-bit program word and the program control bit, which indicates if the program word is an instruction or address, are only valid when the valid

instruction bit is high. The valid instruction signal is used by the SIMPPL controller

in combination with the program instruction read to fetch an instruction from the Store Unit and update the PC. The continue program bit indicates whether the current program instruction has higher priority than the instructions received on the CE Rx link. It can be used in combination with PE-specific and controller status bits to help ensure the correct execution order of instructions.

For example, if the SCS has a status bit that indicates when the controller is executing an instruction from a Rx Link (exec rx instr), it can be used to stall the CE until it has received a packet from an adjacent CE. To perform this handshaking, the SCS program initially stalls the controller by setting the valid instruction bit low. When the controller

(38)

SIMPPL Controller Program Instruction continue progam valid instruction program instruction read Controller Status Bits

SIMPPL Control Sequencer (SCS)

Store Unit

(Program) PC

program control bit

Figure 3.8: The standard SIMPPL control sequencer structure and interface to the SIMPPL controller.

receives an instruction on the Rx Link, it acts as a request signal and the exec rx instr will go high. In response to this request, the SCS’s valid instruction signal then goes high along with the continue program so that the next instruction executed by the controller is an SCS instruction to acknowledge the received request.

Although a PC is traditionally implemented as a counter, the SCS’s remote PC can also be constructed as a Finite State Machine (FSM). This allows branches to be executed im-plicitly as transitions in the PC’s FSM depending on the control and status signal values. The PC FSM is application-specific and uses the current PC and status bit values to gen-erate the correct index to the store unit to select the correct instruction to be fetched and sent to the controller. This reduces the size of both the SIMPPL controller and the program located in the store unit by eliminating the need for branch instructions in the instruction set. Furthermore, it reduces the performance overhead of using the SIMPPL controller as an interface since it does not have to execute conditional or explicit branch instructions.

If an SoC is implemented on an FPGA, the designer can choose to implement the program’s store unit in an on-chip memory. Yet many CEs only require small SCSs for an application, thus the instructions can be stored as a separate FSM. When an SoC is implemented as an ASIC, the designer could choose to design each SCS for its specific application by instantiating a small memory for the Store Unit and then implementing

(39)

write start addr to a0; for (i=0; i< 1024; i++) {

while (!valid_sensor_data);

write 8 data words starting at addr (a0); a0 = a0 + 8;

}

Figure 3.9: Pseudocode for the sensor unit’s SCS program.

the PC as application-specific dedicated logic. However, one of the benefits of the CE abstraction is that it decouples the control from the datapath to support programmability. Hardwiring the PC means that the designer cannot alter the CE’s program post-fabrication. To allow post-fabrication programmability, ASIC designers can implement a small memory for the instruction words and a small region of programmable fabric that enables designers to change the PC to support a variety of SCS programs for the CE. The following example demonstrates how to write a program and use the SIMPPL controller interface.

3.3.2

Static Programming Example

Assume a hardware system that consists of two PEs: 1) a memory, and 2) a sensor unit used to measure multiple environmental quantities at set time intervals. The total storage requirements for each set of measurements is 32 bytes (eight data words) and the memory is large enough to store 1024 samples. The user wants to store the first 1024 samples to experimentally measure when the environmental system reaches steady state before decid-ing how often to record samples and upload the results to a host PC. The sensor unit has a status bit, valid sensor data, that indicates when a set of measurements is available for reading. The sensor unit’s SIMPPL controller passes the status information to its SCS to indicate that data is available for transmission to the memory unit. The pseudocode for the sensor unit’s SCS program is given in Figure 3.9. At present, we do not have compiler sup-port for the SIMPPL controller and all programs (SCSs) are hand generated. Figure 3.10 illustrates pseudo-HDL implementations of the sensor CE’s Program Counter FSM and the valid instruction signal that dictate the program instruction and if it is available to be fetched by the SIMPPL controller using the prog instr read signal.

The PC requires four states to implement the pseudocode in Figure 3.9 and the PC state only changes after an instruction has been read or all 1024 samples have been written to memory. The first two states, Write a0 state and Write address state, write the starting

(40)

if (rst=1) {

PCstate <= Write a0 state; else

PCstate <= nextPC; }

//Next-state state machine for the PC: case (PCstate) {

Write a0 state: //Instruction to initialize a0 if ((prog_instr_read) && (rst=0))

nextPC = Write address state; else

nextPC = Write a0 state;

Write address state: //New address for a0 if (prog_instr_read)

nextPC = Write autoinc state; else

nextPC = Write address state;

Write autoinc state: //Write data to (a0)+

if ((prog_instr_read) && (SampleCntr=1024)) nextPC = Done state;

else

nextPC = Write autoinc state; Done state:

nextPC = Done state; }

/*Used to indicate when the instruction is valid. *Stalls the processor when there is no valid *instruction. */

case (PCstate) { Write a0 state:

valid_instruction = 1; Write address state:

valid_instruction = 1; Write autoinc state:

valid_instruction = valid_sensor_data; Done state:

valid_instruction = 0; }

Figure 3.10: Pseudo-HDL code to implement the state machine for the sensor unit’s pro-gram counter and the valid instruction signal.

(41)

Instr A State A Instr B State BNull Instr D State D Consumer Controller PE E Producer Controller D B A C FSM Status Bits SCS 1 1 0 1 1

continue program bit Instruction/State word Instr E State E Tx Rx Rx Tx

Figure 3.11: A CE with multiple packets of data in flight.

address of the memory unit to the a0 register. The third state (Write autoinc state) writes eight data words to the memory unit starting at address (a0) and then post-increments a0 by eight. While the valid instruction signal is high during the first two states to initial-ize the address register, it is assigned the value of the valid sensor data status bit in the

Write autoinc state because the data write instruction should only occur when the sensor

has new data to transmit to the memory. A separate counter state machine(SampleCntr), not shown in Figure 3.10, is used to count the number of times the sensor unit measure-ments are sent to the memory unit. When the SampleCntr equals 1024, the program has completed so the PC goes to the Done state, where no further instructions are executed, and the valid instruction signal goes low permanently.

3.3.3

Dynamic Programming Architecture

For some applications, a designer may wish to have a CE support multiple processing oper-ations that are data packet dependent. If the CE is pipelined with independent Producer and Consumer controllers for the PE, then the Consumer may receive a variety of instruction packets that should result in the Producer generating different instruction packets depend-ing on the received data. The followdepend-ing example demonstrates how the Consumer and Producer controllers can work together to correctly process the received instruction pack-ets and generate the appropriate output instruction packpack-ets, even in the presence of bypass instructions.

Figure 3.11 illustrates a CE that receives packets A through E in order, where packet

(42)

Producer’s SCS. For the purpose of this example, the Consumer does not have an SCS and the order of packets received by a CE must be maintained when they are transmitted to the subsequent CE. Therefore, it is imperative that data packets A and B, which were inflight when packet C arrived, are transmitted first. To enable this functionality, the instructions from the Producer’s Rx Communication Link and those created in the Producer’s SCS have variable processing priority determined by the value of the continue program status bit. When the continue program status bit is set, the controller continues to fetch available instructions from the SCS, even if there are data packets to be processed on the receive link. Therefore, each Producer’s SCS uses a 35-bit wide FIFO to store the instruction word, the control bit, the valid instruction bit and the continue program bit as well. The FIFO acts as the Store Unit where the maximum depth is equal to the maximum number of data packets that can be processed concurrently. The PE enqueues valid instructions into the FIFO for every data packet in flight, setting the continue program bit for each instruction, as indicated in Figure 3.11.

To ensure that bypassed packets are transmitted in the proper order, the PE must detect if the Consumer receives a bypass instruction. In this situation, the PE will queue a null instruction into the FIFO with the continue program and valid instruction bits set low, as shown in Figure 3.11. To guarantee that instructions are enqueued in the Producer’s FIFO in the correct order, the SCS state machine must push the correct instruction onto the FIFO before the Consumer controller finishes reading the current data packet. The Producer will then dequeue the instructions and transmit the data packets in order. When the “null” instruction is detected with the continue program and valid instruction bits set low, the Rx Communication Link will be given priority. The bypassed packet will then be retransmitted by the Producer to the subsequent CE and the “null” instruction will be dequeued from the FIFO.

Thus, for the example shown in Figure 3.11, the Producer will transmit packets A and

B from the PE. It will then detect a “null” instruction, with the continue program bit set

low, and process packet C from the bypass link, while simultaneously dequeuing the “null” instruction. This will be followed by packets D and E being sent to the next CE.

3.4

Summary

The Computing Element abstraction simplifies the reuse of software and hardware IP by isolating the functionality of the IP, i.e. the Processing Element (PE), from the system-level communication and control. This is particularly important for hardware reuse, where

(43)

redesign can be extremely costly. The SIMPPL Controller has been created to facilitate the reuse of hardware PEs. It provides a fixed physical interface to the PE as well as a fixed set of communication protocols for transmitting data among the CEs. Although the controller’s underlying architecture is optimized to act as a PE’s system interface, it can also be extended to provide runtime debugging capabilities for verification of the PE and its integration with the controller. CE’s use a SIMPPL Control Sequencer (SCS) to store the local program that dictates how the PE will be used for the given application. The SCS allows both static and dynamic programming models to increase the flexibility of the CE abstraction and facilitate the reuse of CEs over a variety of applications.

(44)

Implementing SIMPPL SoCs

To investigate the usage of a programmable controller interface for IP modules, three SoCs are created using the SIMPPL framework. All three of the SoCs are implemented on a Xilinx Multimedia board. The board’s resources include a Virtex II 2000, five ZBT mem-ory banks, a YCrCb video decoder that runs at 27MHz, and an RGB video DAC operating at 25 MHz. Section 4.1 of this chapter describes the nature of the three applications and the effects of using SIMPPL on system design time. Next, an overview of the CEs used to implement each application is given in Section 4.2. The chapter then concludes with a detailed examination of the effects of using the SIMPPL framework on a non-trivial SoC design in Section 4.3.

4.1

SIMPPL SoC Applications

Figure 4.1 illustrates the system level connections for two video-based systems. The first is a video streaming system, which does not include the Switch CE. Instead, it uses one of the two memory banks to buffer the video feed from the video camera while the other bank is displayed using the video DAC driving an SVGA monitor. The second system is a video snapshot system, which includes the Switch CE and only allows the user to update the SVGA display with a new image when the switch is toggled. The Vid In CE interfaces with a video decoder to read in data in YCrCb format and then convert it to RGB format. The Vid Out CE receives data in RGB format and transmits it to a video DAC used to drive an SVGA monitor. These CEs, in combination with two external memory banks controlled by the Mem CE, are used to implement a video streaming and a video snapshot application. The video recorder and video display need to be synchronized because the system may

(45)

Vid_Out CE Vid_In CE Mem Bank 0 Mem Bank 1 switch _Mem CE Switch CE

Figure 4.1: The SIMPPL model for the video streaming and snapshot applications.

come out of reset when the video recorder is mid-frame. Although the video applications require synchronization between the Vid In CE and Vid Out CE to properly display the video camera images, they do not communicate directly. Since the user is able to write individual programs to control the operation of the Vid In, Vid Out, and Mem CEs, there are multiple ways to implement this system. The straightforward approach is to have the Vid In and Vid Out CEs become active as soon as the system comes out of reset, and have the Mem CE only execute the memory reads and writes requested via the internal links from the Video CEs. However, this would not guarantee synchronization between the video data being received and the video data written to the SVGA. Therefore, to achieve synchronization between the two Video CEs, the Vid In CE starts running as soon as the system comes out of reset and the Vid Out CE stalls, waiting for an indication that the Vid In CE has started writing a new frame to the Mem CE.

Another significant design challenge for the video systems is the different operating frequencies of the CEs. Fortunately, the CE abstraction and asynchronous FIFO commu-nication links effectively isolate the different clock domains to simplify their integration and synchronization. For instance, the Vid In and Input Switch CEs operate at 27 MHz, however, the Mem CE operates at 54 MHz. Furthermore, the Vid Out CE uses an asyn-chronous FIFO interface between its PE and controller so that the controller can run at 50

(46)

Video Frame Buffer Video Stream Parser (Parser) Motion Compensation (MC) Variable Length Decoder/Run-Level Decoder (VLD/RLD) Colour Space Converter Frame Storage Buffer Picture Reconstruction (PR) IM P P L IM P P L Inverse Quantization (IQ) IM P P L IM P P L Inverse Discrete Cosine Transform (IDCT) IM P P L IM P P L S IM P P L status register (MC/PR) IM P P L Missing Macroblock Replacer IM P P L IM P P L

Figure 4.2: The SIMPPL model for an MPEG-1 video decoder.

Table 4.1: Table of the System Integration times for SoCs.

SoC Design System Integration Time

Custom Streaming Video System 140 hours

SIMPPL Streaming Video System 4.5 hours

SIMPPL Snapshot Video System 1.5 hours

SIMPPL MPEG-1 Video Decoder System 18 hours

MHz to guarantee that valid data will be available for the PE, which runs at 25 MHz to match the video DAC’s operating frequency.

Figure 4.2 illustrates the third system designed using the SIMPPL model. It is an MPEG-1 video decoder that runs at 30 frames per second, generating 320 by 240 pixel images on an SVGA monitor. The decoder was designed and implemented by four grad-uate students as a course project [9]. The synchronization challenge for this system is to maintain the order of packets processed in the system while ensuring that certain instruc-tion packets are only processed by selected CEs. Recalling the discussion in Secinstruc-tion 3.3.3 and the CE architecture in Figure 3.11, the bypass instruction allows such packets to by-pass processing by a CE, but the continue program status bit can be used to ensure that the bypassed packet maintains its position in the data stream.

4.1.1

SIMPPL SoC Implementation Statistics

Table 4.1 summarizes the time required to integrate the CEs and create the SCSs for the systems shown in Figures 4.1 and 4.2. Before the SIMPPL model was defined, a novice

(47)

designer created a custom version of the video streaming application. The student found it difficult to create the proper system-level control due to the multiple clock domains and synchronization requirements. After some redesign, the modules were reused and inte-grated with SIMPPL controllers to create the Vid In, Vid Out, and Mem CEs (Figure 4.1), which required approximately 40 hours. However, the integration of the CEs and the design of their respective SCSs took only 4.5 hours for the SIMPPL Streaming Video SoC, which is less than 3.5% of the time required to implement the system-level integration for the custom design. The CE abstraction simplified the system-level integration by isolating the different clock domains, which greatly reduced integration time. The SIMPPL Snapshot Video system only required the addition of the Input Switch CE and minor adjustments to the SCSs previously used in the streaming video system, reducing the system integration time to 1.5 hours. Thus, not only does the SIMPPL framework reduce system integration time, but it also facilitates the reuse of CEs for new applications.

Recalling that the SIMPPL MPEG-1 Video Decoder System was built by a four person design team, it took 18 person-hours to properly connect all the CEs and to generate the appropriate SCSs. Integrating all the MPEG-1 hardware PEs with Producer and Consumer controllers required an additional 39 person-hours, or 2.4%, of the total system design time of 1607 person-hours. For complex designs, system integration can be a significant portion of the total design time, however, the SIMPPL framework limits the system integration for the MPEG-1 Video Decoder to 1.1% of the total design time. Furthermore, the CE abstraction hides the implementation details of the CE from the rest of the system so that changes to the PE do not necessitate redesign at the system-level. For example, the Video Stream Parser CE is currently implemented as a software CE on a MicroBlaze, due to design time constraints. However, the fixed communication links allow it to be swapped out in favour of a hardware CE implementation in the future without any changes to the rest of the system.

4.2

CE Implementations

This section describes the implementation statistics for the controllers implemented on FPGAs and ASICs, possible different CE architectures, and the different CEs that have been created and tested to date on an FPGA for the SoCs described in Section 4.1.

(48)

Table 4.2: SIMPPL Controller implementation statistics.

Controller Type FPGA ASIC platform– ASIC platform–

platform Area Speed

Area Max. Area Max. Area Max.

Freq. (103um2) Freq. (103um2) Freq.

LUTs Flipflops (MHz) (MHz) (GHz) Consumer Execute 277 117 287 5.25 183 12.16 1.59 Producer Execute 355 125 285 5.42 184 13.16 1.56 Full Execute 346 115 283 5.49 183 13.71 1.59 Consumer Debug 1002 477 180 19.17 165 29.62 1.24 Producer Debug 955 478 199 19.24 164 28.01 1.09 Full Debug 946 478 185 19.48 166 29.59 1.09

4.2.1

Controller Implementation Statistics

Table 4.2 summarizes the area and operating frequency measurements obtained for the different types of SIMPPL controllers implemented on both FPGA and ASIC platforms. The ASIC measurements are obtained using Synopsys synthesis tools for a 90 nm standard cell process. The ASIC platform–Area values are minimized for area and the operating frequency is left unconstrained, whereas the ASIC platform–Speed values minimize the operating frequency and leave the area unconstrained. To obtain comparable operating frequency measurements on an FPGA, the Virtex4 LX 40 -12 is used since it is the highest speed grade device fabricated in a 90 nm technology available from Xilinx. The FPGA measurements are generated using Xilinx’s Place and Route tool from the ISE tool suite version 7.1.4.

Column 1 lists all of the different types of debug and execute SIMPPL controllers. Although the regularity of the controller’s architecture can allow them to be autogenerated