• No results found

Reconfigurable Computing. Reconfigurable Architectures. Chapter 3.2

N/A
N/A
Protected

Academic year: 2021

Share "Reconfigurable Computing. Reconfigurable Architectures. Chapter 3.2"

Copied!
33
0
0

Loading.... (view fulltext now)

Full text

(1)

Reconfigurable Computing

Reconfigurable Architectures

Chapter 3.2

Prof. Dr.-Ing. Jürgen Teich

(2)
(3)

Recall:

1.

Brief Historically development (Estrin Fix-Plus and

Rammig machine)

2.

Programmable Logic

1.

PALs and PLAs

2.

CPLDs

3.

FPGAs

1.

Technology

2.

Architecture by means of an example

1.

Actel

2.

Xilinx

3.

Altera

(4)

Once again: General purpose vs Special purpose

With LUTs as function generators, FPGA can be seen

as general purpose devices.

Like any general purpose device,

they are

flexible

but often

inefficient.

Flexible because any n-variable Boolean function can

be implemented using an n-input LUT.

Inefficient since complex functions must be

implemented in many LUTs at different locations. The

connection among the LUTs is done using the routing

matrix wich increases the signal delays.

LUT implementation is usually slower than direct

wiring.

(5)

Once again: General purpose vs Special purpose

Example:

Implement the function

using 2-input LUTs

.

LUTs are grouped in logic blocks (LB). 2 2-input LUT per LB

Connection inside a LB is efficient (direct)

Connection outside LBs are slow (Connection matrix)

A F = ABD + ACD + B C A B D A C D A B C F Connection matrix

(6)

Once again: General purpose vs Special purpose

Idea: Implement frequently used blocks as hard-core module in

the device

A B D A C D A B C F Connection matrix A B C D F

(7)

Coarse grained reconfigurable devices

Overcome the inefficiency of FPGAs by providing coarse

grained functional units (adders, multipliers, integrators,

etc.), efficiently implemented

Advantage

: Very efficient in terms of speed (no need for

connections over connection matrices for basic

operators)

Advantage

: Direct wiring instead of LUT implementation

A coarse grained device is usually an array of

programmable and identical processing elements (PE)

capable of executing few operations like addition and

multiplication.

Depending on the manufacturer, the functional units

communicate via buses or can be directly connected

using programmable routing matrices.

(8)

Coarse grained reconfigurable devices

Memory exists between and inside the PEs.

Several other functional units according to the

manufacturer.

A PE is usually an 8-bit, 16-bit or 32-bit tiny ALU

which can be configured to execute only one

operation on a given period (until the next

configuration).

Communication among the PEs can be either packet

oriented (on buses) or point-to-point (using crossbar

switches).

Since each vendor has its own implementation

approach, study will be done by means of few

examples. Considered are: PACT XPP, Quicksilver

ACM, NEC DRP, TCPA.

(9)

The PACT XPP – Overall structure

XPP (Extreme Processing Platform) is

a hierarchical structure consisting of:

An array of Processing Array Elements

(PAE) grouped in clusters called Processing Array (PA)

PAC = Processing Array Cluster (PAC) + Configuration manager (CM)

A hierarchical configuration tree

Local CMs manage the configuration at the PA level

The local CMs access the local

configuration memory while the

supervisor CM (SCM) accesses external memory and supervises the whole

(10)

The PACT XPP – The Processing Array Elements

A Communication Network

Memory elements aside the PACs

A set of I/Os

 The PAE: Two types of PAE  The ALU PAE

 The RAM PAE  The ALU PAE:

 Contains an ALU which can be configured to perform basic operations

 Back-register (BREG) provides routing

channels for data and events from bottom to top

 Forward Register (FREG) provides routing channels from top to bottom

(11)

The PACT XPP - The Processing Array Elements

DataFlow Register (DF-REG) can be used at

the object outputs for buffering data

Input register can be preloaded by configuration data.

The RAM PAE:

1.Differs from the ALU-PAE only on the

function. Instead of an ALU, a RAM-PAE contains a dual-ported RAM.

2.Useful for data storage

3.Data is written or read after the reading of

an address at the RAM-inputs

4.BREG, FREG, and DF-REG of the

RAM-PAE have the same function as in the ALU-PAE

(12)

The PACT XPP - Routing

Routing in PACT XPP:

Two independent networks

One for data transmission

The other for event transmission

A Configuration BUS exists besides the data and event networks (very little

information exists about the configuration bus)

All objects can be connected to

horizontal routing channels using switch-objects

Vertical routing channels are provided by the BREG and FREG

BREGs route from bottom to top

FREGs route from top to bottom

Horizontal routing channels Vertical routing channels

(13)

The PACT XPP - Interface

Interfaces are available inside the

chip

Number and type of interfaces vary

from device to device

On the XPP42-A1:

6 internal interfaces consisting of:

4 identical general purpose I/O on-chip

interfaces (bottom left, upper left, upper right, and bottom right)

One configuration manager

One JTAG (Join Test Action Group,

"IEEE Standard 1149.1") Boundary scan

interface for testing purpose (not shown in the picture)

(14)

The PACT XPP - Interface

The I/O interfaces can operate

independent from each other. Two

operation modes

The RAM mode

The streaming mode

RAM mode:

Each port can access external Static RAM (SRAM).

Control signals for the SRAM transactions are available.

(15)

The PACT XPP - Interface

Streaming mode:

1. For high speed streaming of data to

and from the device

2. Each I/O element provides two

bidirectional ports for data streaming

3. Handshake signals are used for

synchronization of data packets to external port

(16)

The Quicksilver ACM - Architecture

Structure: Fractal-like structure

 Hierarchical group of four nodes with full communication among the nodes  4 lower level nodes are grouped in a

higher level node

 The lowest level consists of 4 heterogeneous processing nodes  The connection is done in a Matrix

Interconnect Network (MIN)

 A system controller

(17)

The Quicksilver ACM – The processing node

An ACM processing node

consists of:

An algorithmic engine. It is

unique to each node type and defines the operation to perform by the node.

The node memory for data storage at the node level.

A node wrapper which is

common to all nodes. It is used to

hide the complexity of the

(18)

The Quicksilver ACM – The processing node

Four types of nodes exist:

The Programmable Scalar Node (PSN) provides a standard 32-bit

RISC architecture with 32-bit

general purpose registers

The Adaptive Execution Node

(AXN) provides variable size MAC and ALU operations

The Domain Bit Manipulation (DBM) node provides bit

manipulation and byte oriented operation

External Memory Controller node

provides DDRRAM, SRAM, memory random access DMA

(19)

The Quicksilver ACM – The processing node

ACM DBM-Node

ACM AXN-Node

(20)

The Quicksilver ACM – The processing node

The node wrapper envelopes the algorithmic engine and presents an identical interface to neighbouring nodes. It features:

1.A MIN interface to support the

communication among nodes via the MIN-network

2.A hardware task manager for task

management at the node level

3.A DMA engine

4.Dedicated I/O circuitry 5.Memory controllers

6.Data distributors and aggregators

(21)

The Quicksilver ACM - The MIN

Matrix Interconnect Network is

the communication medium in

an ACM chip

1. Hierarchically organized. The MIN

at a given level connects many lower-level MINs

2. The MIN-Root is used for: 1.Off-chip communication 2.Configuration

3. Supports the communication

among nodes

4. Provides service like Point to

point dataflow streaming, Real-time broadcasting, DMA, etc.

Example of ACM chip configuration

(22)

The Quicksilver ACM - The System Controller

The system controller is in charge of the system management:

Loads tasks into node ready-to-run

queue for execution

Statically or dynamically sets the communication channels between the processing nodes

Carries out the reconfiguration of nodes on a clock cycle-by-clock cycle basis

The ACM chip features a set of I/O interfaces controllers like:

PCI

PLL

SDRAM and SRAM

The system controller

(23)

The NEC DRP – Architecture

The NEC Dynamically

Reconfigurable Processor (DRP)

consists of:

A set of byte oriented processing elements (PE)

A programmable interconnection network for communication among the PEs.

A sequencer. Can be programmed as finite state machine (FSM) to control the reconfiguration process

Memory around the device for storing configuration and computation data

(24)

The NEC DRP - The Processing Element

ALU: ordinary byte arithmetic/logic

operations

DMU (data management unit): handles

byte select, shift, mask, constant

generation, etc., as well as bit manipulations

An instruction dictates ALU/DMU operations and inter-PE connections

Source/destination operands can either be from/to

its own register file

other PEs (i.e., flow through)

Instruction pointer (IP) is provided from STC (state transition controller)

(25)

The NEC DRP – Reconfiguration Process

Instruction Pointer (IP) from STC

identifies a datapath plane

Spatial computation with using a customized datapath plane

When IP changes, datapath

plane switches instantaneously

PE instructions as a collection behave like an extreme VLIW

Sequencing through instructions => Dynamic reconfiguration AES 3DES MD5 SHA-1 Compress Data In Control (task selection by descriptor) Dynamic Reconfigura tion Data Out Multiple Datapath Planes

(26)

The NEC DRP – Reconfiguration Process

Ad d Sel Ad d C m p Ad d Add C m p Se l PE PE Array ALU DMU Insts. PE 0 1 2 IP = “1” 1

3

4

PE Array PE ALU DMU 0 1 2 Insts. IP = “1” 1

1

2

1

Identify the instruction to be executed

2

Decode the instruction in the ALU plane

3

Configure the ALU Plane according to the instruction

4

+

(27)

Tightly-Coupled Processor Arrays (TCPA)

• Processor elements (PEs) with VLIW (Very long

instruction word)-Architecture

• Weakly programmable

– Small local instruction memory

– Limited parametrizable instruction set focused on digital signal processing

• Data flow oriented control path, no global address space,

data streaming over the processing field

• Regular interconnect network

• Application areas: Digital signal processing, e.g., mobile

communication, HDTV, multimedia, . . .

(28)
(29)

• Basic structure: Grid

• Dynamic reconfigurable

• By using a bypass, more

than one hop is possible

in a single clock cycle

• Interconnect wrapper

is

responsible for switching

(30)
(31)
(32)

• Multicast

-Scheme for

partial dynamic reconfiguration

• Differential reconfiguration

(program/connections) also

possible

(33)

24 Core TCPA – Lehrstuhl für Informatik 12

• 24x 16 Bit cores

• Technology

• CMOS 1.0 V • 9 metal layers

• 90 nm standard cell layout • FUs/PE • 2xAdd, 2xMul, • 1xShift, 1xDPU • Register/PE: 15 • Instruction memory • 1024x32 = 4kB • Clock frequency: 200 MHz • Peak Performance: 24 GOPS • Energy consumption

133 mW @ 200 MHz (Hybrid Clock Gating). • Power efficiency: 180 MOPS/mW

References

Related documents

The central bank will tighten the money supply through its restrictive monetary policy and also raise the rate of interest at which this money is made available to the

Cheryl Smith is a Director at Leavitt Partners and helps guide the firm’s health insurance exchange practice.. She brings to the firm applied experience in the

As a consequence, ECDC has initiated a two-year European Union public health microbiology training programme (EUPHEM) closely linked to the European Programme for

UCN-0l1 and trabectedin affect multiple NER proteins Inhibition decreases the interaction of XPA and ERCC1, as well as other proteins with kinase activity F11782

The involvement of intrinsic motivation in flow is also consistent with the absorbing aspect of the flow experience: although flow activities can be motivated by a

Such a representation can be transformed into one where terms in different languages are directly connected to the same semantic entity whenever the respective meaning can be

The study shows that collaborative learning style is the most preferred learning style practised by students followed by participant, dependent, competitive, independent, and