Reconfigurable Computing. Reconfigurable Architectures. Chapter 3.2

(1)

Reconfigurable Computing

Reconfigurable Architectures

Chapter 3.2

Prof. Dr.-Ing. Jürgen Teich

(2)

(3)

Recall:

1.

Brief Historically development (Estrin Fix-Plus and

Rammig machine)

2.

Programmable Logic

1.

PALs and PLAs

2.

CPLDs

3.

FPGAs

1.

Technology

2.

Architecture by means of an example

1.

Actel

2.

Xilinx

3.

Altera

(4)

Once again: General purpose vs Special purpose



With LUTs as function generators, FPGA can be seen

as general purpose devices.



Like any general purpose device,

they are

flexible

but often

inefficient.



Flexible because any n-variable Boolean function can

be implemented using an n-input LUT.



Inefficient since complex functions must be

implemented in many LUTs at different locations. The

connection among the LUTs is done using the routing

matrix wich increases the signal delays.



LUT implementation is usually slower than direct

wiring.

(5)

Once again: General purpose vs Special purpose

Example:

Implement the function

using 2-input LUTs

.

LUTs are grouped in logic blocks (LB). 2 2-input LUT per LB

Connection inside a LB is efficient (direct)

Connection outside LBs are slow (Connection matrix)

A F = ABD + ACD + B C A B D A C D A B C F Connection matrix

(6)

Once again: General purpose vs Special purpose

Idea: Implement frequently used blocks as hard-core module in

the device

A B D A C D A B C F Connection matrix A B C D F

(7)

Coarse grained reconfigurable devices



Overcome the inefficiency of FPGAs by providing coarse

grained functional units (adders, multipliers, integrators,

etc.), efficiently implemented



Advantage

: Very efficient in terms of speed (no need for

connections over connection matrices for basic

operators)



Advantage

: Direct wiring instead of LUT implementation



A coarse grained device is usually an array of

programmable and identical processing elements (PE)

capable of executing few operations like addition and

multiplication.



Depending on the manufacturer, the functional units

communicate via buses or can be directly connected

using programmable routing matrices.

(8)

Coarse grained reconfigurable devices



Memory exists between and inside the PEs.



Several other functional units according to the

manufacturer.



A PE is usually an 8-bit, 16-bit or 32-bit tiny ALU

which can be configured to execute only one

operation on a given period (until the next

configuration).



Communication among the PEs can be either packet

oriented (on buses) or point-to-point (using crossbar

switches).



Since each vendor has its own implementation

approach, study will be done by means of few

examples. Considered are: PACT XPP, Quicksilver

ACM, NEC DRP, TCPA.

(9)

The PACT XPP – Overall structure

XPP (Extreme Processing Platform) is

a hierarchical structure consisting of:

 An array of Processing Array Elements

(PAE) grouped in clusters called Processing Array (PA)

 PAC = Processing Array Cluster (PAC) + Configuration manager (CM)

 A hierarchical configuration tree

 Local CMs manage the configuration at the PA level

 The local CMs access the local

configuration memory while the

supervisor CM (SCM) accesses external memory and supervises the whole

(10)

The PACT XPP – The Processing Array Elements

 A Communication Network

 Memory elements aside the PACs

 A set of I/Os

 The PAE: Two types of PAE  The ALU PAE

 The RAM PAE  The ALU PAE:

 Contains an ALU which can be configured to perform basic operations

 Back-register (BREG) provides routing

channels for data and events from bottom to top

 Forward Register (FREG) provides routing channels from top to bottom

(11)

The PACT XPP - The Processing Array Elements

 DataFlow Register (DF-REG) can be used at

the object outputs for buffering data

 Input register can be preloaded by configuration data.



The RAM PAE:

1.Differs from the ALU-PAE only on the

function. Instead of an ALU, a RAM-PAE contains a dual-ported RAM.

2.Useful for data storage

3.Data is written or read after the reading of

an address at the RAM-inputs

4.BREG, FREG, and DF-REG of the

RAM-PAE have the same function as in the ALU-PAE

(12)

The PACT XPP - Routing



Routing in PACT XPP:

 Two independent networks

 One for data transmission

 The other for event transmission

 A Configuration BUS exists besides the data and event networks (very little

information exists about the configuration bus)

 All objects can be connected to

horizontal routing channels using switch-objects

 Vertical routing channels are provided by the BREG and FREG

 BREGs route from bottom to top

 FREGs route from top to bottom

Horizontal routing channels Vertical routing channels

(13)

The PACT XPP - Interface



Interfaces are available inside the

chip



Number and type of interfaces vary

from device to device



On the XPP42-A1:



6 internal interfaces consisting of:

 4 identical general purpose I/O on-chip

interfaces (bottom left, upper left, upper right, and bottom right)

 One configuration manager

 One JTAG (Join Test Action Group,

"IEEE Standard 1149.1") Boundary scan

interface for testing purpose (not shown in the picture)

(14)

The PACT XPP - Interface



The I/O interfaces can operate

independent from each other. Two

operation modes

 The RAM mode

 The streaming mode

RAM mode:

 Each port can access external Static RAM (SRAM).

 Control signals for the SRAM transactions are available.

(15)

The PACT XPP - Interface

Streaming mode:

1. For high speed streaming of data to

and from the device

2. Each I/O element provides two

bidirectional ports for data streaming

3. Handshake signals are used for

synchronization of data packets to external port

(16)

The Quicksilver ACM - Architecture

Structure: Fractal-like structure

 Hierarchical group of four nodes with full communication among the nodes  4 lower level nodes are grouped in a

higher level node

 The lowest level consists of 4 heterogeneous processing nodes  The connection is done in a Matrix

Interconnect Network (MIN)

 A system controller

(17)

The Quicksilver ACM – The processing node

An ACM processing node

consists of:

 An algorithmic engine. It is

unique to each node type and defines the operation to perform by the node.

 The node memory for data storage at the node level.

 A node wrapper which is

common to all nodes. It is used to

hide the complexity of the

(18)

The Quicksilver ACM – The processing node

Four types of nodes exist:

 The Programmable Scalar Node (PSN) provides a standard 32-bit

RISC architecture with 32-bit

general purpose registers

 The Adaptive Execution Node

(AXN) provides variable size MAC and ALU operations

 The Domain Bit Manipulation (DBM) node provides bit

manipulation and byte oriented operation

 External Memory Controller node

provides DDRRAM, SRAM, memory random access DMA

(19)

The Quicksilver ACM – The processing node

ACM DBM-Node

ACM AXN-Node

(20)

The Quicksilver ACM – The processing node

The node wrapper envelopes the algorithmic engine and presents an identical interface to neighbouring nodes. It features:

1.A MIN interface to support the

communication among nodes via the MIN-network

2.A hardware task manager for task

management at the node level

3.A DMA engine

4.Dedicated I/O circuitry 5.Memory controllers

6.Data distributors and aggregators

(21)

The Quicksilver ACM - The MIN

Matrix Interconnect Network is

the communication medium in

an ACM chip

1. Hierarchically organized. The MIN

at a given level connects many lower-level MINs

2. The MIN-Root is used for: 1.Off-chip communication 2.Configuration

3. Supports the communication

among nodes

4. Provides service like Point to

point dataflow streaming, Real-time broadcasting, DMA, etc.

Example of ACM chip configuration

(22)

The Quicksilver ACM - The System Controller

The system controller is in charge of the system management:

 Loads tasks into node ready-to-run

queue for execution

 Statically or dynamically sets the communication channels between the processing nodes

 Carries out the reconfiguration of nodes on a clock cycle-by-clock cycle basis

The ACM chip features a set of I/O interfaces controllers like:

 PCI

 PLL

 SDRAM and SRAM

The system controller

(23)

The NEC DRP – Architecture

The NEC Dynamically

Reconfigurable Processor (DRP)

consists of:

 A set of byte oriented processing elements (PE)

 A programmable interconnection network for communication among the PEs.

 A sequencer. Can be programmed as finite state machine (FSM) to control the reconfiguration process

 Memory around the device for storing configuration and computation data

(24)

The NEC DRP - The Processing Element

 ALU: ordinary byte arithmetic/logic

operations

 DMU (data management unit): handles

byte select, shift, mask, constant

generation, etc., as well as bit manipulations

 An instruction dictates ALU/DMU operations and inter-PE connections

 Source/destination operands can either be from/to

 its own register file

 other PEs (i.e., flow through)

 Instruction pointer (IP) is provided from STC (state transition controller)

(25)

The NEC DRP – Reconfiguration Process

 Instruction Pointer (IP) from STC

identifies a datapath plane

 Spatial computation with using a customized datapath plane

 When IP changes, datapath

plane switches instantaneously

 PE instructions as a collection behave like an extreme VLIW

 Sequencing through instructions => Dynamic reconfiguration AES 3DES MD5 SHA-1 Compress Data In Control (task selection by descriptor) Dynamic Reconfigura tion Data Out Multiple Datapath Planes

(26)

The NEC DRP – Reconfiguration Process

Ad d Sel Ad d C m p Ad d Add C m p Se l PE PE Array ALU DMU Insts. PE 0 1 2 IP = “1” 1

3

4

PE Array PE _ALU DMU 0 1 2 Insts. IP = “1” 1

1

2

1

Identify the instruction to be executed

2

Decode the instruction in the ALU plane

3

Configure the ALU Plane according to the instruction

4

+

(27)

Tightly-Coupled Processor Arrays (TCPA)

• Processor elements (PEs) with VLIW (Very long

instruction word)-Architecture

• Weakly programmable

– Small local instruction memory

– Limited parametrizable instruction set focused on digital signal processing

• Data flow oriented control path, no global address space,

data streaming over the processing field

• Regular interconnect network

• Application areas: Digital signal processing, e.g., mobile

communication, HDTV, multimedia, . . .

(28)

(29)

• Basic structure: Grid

• Dynamic reconfigurable

• By using a bypass, more

than one hop is possible

in a single clock cycle

• Interconnect wrapper

is

responsible for switching

(30)

(31)

(32)

• Multicast

-Scheme for

partial dynamic reconfiguration

• Differential reconfiguration

(program/connections) also

possible

(33)

24 Core TCPA – Lehrstuhl für Informatik 12

• 24x 16 Bit cores

• Technology

• CMOS 1.0 V • 9 metal layers

• 90 nm standard cell layout • FUs/PE • 2xAdd, 2xMul, • 1xShift, 1xDPU • Register/PE: 15 • Instruction memory • 1024x32 = 4kB • Clock frequency: 200 MHz • Peak Performance: 24 GOPS • Energy consumption

• 133 mW @ 200 MHz (Hybrid Clock Gating). • Power efficiency: 180 MOPS/mW