Reconfigurable Computing
Reconfigurable Architectures
Chapter 3.2
Prof. Dr.-Ing. Jürgen Teich
Recall:
1.
Brief Historically development (Estrin Fix-Plus and
Rammig machine)
2.
Programmable Logic
1.PALs and PLAs
2.CPLDs
3.
FPGAs
1.
Technology
2.
Architecture by means of an example
1.Actel
2.
Xilinx
3.Altera
Once again: General purpose vs Special purpose
With LUTs as function generators, FPGA can be seen
as general purpose devices.
Like any general purpose device,
they are
flexible
but often
inefficient.
Flexible because any n-variable Boolean function can
be implemented using an n-input LUT.
Inefficient since complex functions must be
implemented in many LUTs at different locations. The
connection among the LUTs is done using the routing
matrix wich increases the signal delays.
LUT implementation is usually slower than direct
wiring.
Once again: General purpose vs Special purpose
Example:
Implement the function
using 2-input LUTs
.
LUTs are grouped in logic blocks (LB). 2 2-input LUT per LB
Connection inside a LB is efficient (direct)
Connection outside LBs are slow (Connection matrix)
A F = ABD + ACD + B C A B D A C D A B C F Connection matrix
Once again: General purpose vs Special purpose
Idea: Implement frequently used blocks as hard-core module in
the device
A B D A C D A B C F Connection matrix A B C D FCoarse grained reconfigurable devices
Overcome the inefficiency of FPGAs by providing coarse
grained functional units (adders, multipliers, integrators,
etc.), efficiently implemented
Advantage
: Very efficient in terms of speed (no need for
connections over connection matrices for basic
operators)
Advantage
: Direct wiring instead of LUT implementation
A coarse grained device is usually an array of
programmable and identical processing elements (PE)
capable of executing few operations like addition and
multiplication.
Depending on the manufacturer, the functional units
communicate via buses or can be directly connected
using programmable routing matrices.
Coarse grained reconfigurable devices
Memory exists between and inside the PEs.
Several other functional units according to the
manufacturer.
A PE is usually an 8-bit, 16-bit or 32-bit tiny ALU
which can be configured to execute only one
operation on a given period (until the next
configuration).
Communication among the PEs can be either packet
oriented (on buses) or point-to-point (using crossbar
switches).
Since each vendor has its own implementation
approach, study will be done by means of few
examples. Considered are: PACT XPP, Quicksilver
ACM, NEC DRP, TCPA.
The PACT XPP – Overall structure
XPP (Extreme Processing Platform) is
a hierarchical structure consisting of:
An array of Processing Array Elements(PAE) grouped in clusters called Processing Array (PA)
PAC = Processing Array Cluster (PAC) + Configuration manager (CM)
A hierarchical configuration tree
Local CMs manage the configuration at the PA level
The local CMs access the local
configuration memory while the
supervisor CM (SCM) accesses external memory and supervises the whole
The PACT XPP – The Processing Array Elements
A Communication Network Memory elements aside the PACs
A set of I/Os
The PAE: Two types of PAE The ALU PAE
The RAM PAE The ALU PAE:
Contains an ALU which can be configured to perform basic operations
Back-register (BREG) provides routing
channels for data and events from bottom to top
Forward Register (FREG) provides routing channels from top to bottom
The PACT XPP - The Processing Array Elements
DataFlow Register (DF-REG) can be used atthe object outputs for buffering data
Input register can be preloaded by configuration data.
The RAM PAE:
1.Differs from the ALU-PAE only on the
function. Instead of an ALU, a RAM-PAE contains a dual-ported RAM.
2.Useful for data storage
3.Data is written or read after the reading of
an address at the RAM-inputs
4.BREG, FREG, and DF-REG of the
RAM-PAE have the same function as in the ALU-PAE
The PACT XPP - Routing
Routing in PACT XPP:
Two independent networks
One for data transmission
The other for event transmission
A Configuration BUS exists besides the data and event networks (very little
information exists about the configuration bus)
All objects can be connected to
horizontal routing channels using switch-objects
Vertical routing channels are provided by the BREG and FREG
BREGs route from bottom to top
FREGs route from top to bottom
Horizontal routing channels Vertical routing channels
The PACT XPP - Interface
Interfaces are available inside the
chip
Number and type of interfaces vary
from device to device
On the XPP42-A1:
6 internal interfaces consisting of:
4 identical general purpose I/O on-chipinterfaces (bottom left, upper left, upper right, and bottom right)
One configuration manager
One JTAG (Join Test Action Group,
"IEEE Standard 1149.1") Boundary scan
interface for testing purpose (not shown in the picture)
The PACT XPP - Interface
The I/O interfaces can operate
independent from each other. Two
operation modes
The RAM mode
The streaming mode
RAM mode:
Each port can access external Static RAM (SRAM).
Control signals for the SRAM transactions are available.
The PACT XPP - Interface
Streaming mode:
1. For high speed streaming of data to
and from the device
2. Each I/O element provides two
bidirectional ports for data streaming
3. Handshake signals are used for
synchronization of data packets to external port
The Quicksilver ACM - Architecture
Structure: Fractal-like structure
Hierarchical group of four nodes with full communication among the nodes 4 lower level nodes are grouped in a
higher level node
The lowest level consists of 4 heterogeneous processing nodes The connection is done in a Matrix
Interconnect Network (MIN)
A system controller
The Quicksilver ACM – The processing node
An ACM processing node
consists of:
An algorithmic engine. It is
unique to each node type and defines the operation to perform by the node.
The node memory for data storage at the node level.
A node wrapper which is
common to all nodes. It is used to
hide the complexity of the
The Quicksilver ACM – The processing node
Four types of nodes exist:
The Programmable Scalar Node (PSN) provides a standard 32-bit
RISC architecture with 32-bit
general purpose registers
The Adaptive Execution Node
(AXN) provides variable size MAC and ALU operations
The Domain Bit Manipulation (DBM) node provides bit
manipulation and byte oriented operation
External Memory Controller node
provides DDRRAM, SRAM, memory random access DMA
The Quicksilver ACM – The processing node
ACM DBM-Node
ACM AXN-Node
The Quicksilver ACM – The processing node
The node wrapper envelopes the algorithmic engine and presents an identical interface to neighbouring nodes. It features:
1.A MIN interface to support the
communication among nodes via the MIN-network
2.A hardware task manager for task
management at the node level
3.A DMA engine
4.Dedicated I/O circuitry 5.Memory controllers
6.Data distributors and aggregators
The Quicksilver ACM - The MIN
Matrix Interconnect Network is
the communication medium in
an ACM chip
1. Hierarchically organized. The MIN
at a given level connects many lower-level MINs
2. The MIN-Root is used for: 1.Off-chip communication 2.Configuration
3. Supports the communication
among nodes
4. Provides service like Point to
point dataflow streaming, Real-time broadcasting, DMA, etc.
Example of ACM chip configuration
The Quicksilver ACM - The System Controller
The system controller is in charge of the system management:
Loads tasks into node ready-to-run
queue for execution
Statically or dynamically sets the communication channels between the processing nodes
Carries out the reconfiguration of nodes on a clock cycle-by-clock cycle basis
The ACM chip features a set of I/O interfaces controllers like:
PCI
PLL
SDRAM and SRAM
The system controller
The NEC DRP – Architecture
The NEC Dynamically
Reconfigurable Processor (DRP)
consists of:
A set of byte oriented processing elements (PE)
A programmable interconnection network for communication among the PEs.
A sequencer. Can be programmed as finite state machine (FSM) to control the reconfiguration process
Memory around the device for storing configuration and computation data
The NEC DRP - The Processing Element
ALU: ordinary byte arithmetic/logicoperations
DMU (data management unit): handles
byte select, shift, mask, constant
generation, etc., as well as bit manipulations
An instruction dictates ALU/DMU operations and inter-PE connections
Source/destination operands can either be from/to
its own register file
other PEs (i.e., flow through)
Instruction pointer (IP) is provided from STC (state transition controller)
The NEC DRP – Reconfiguration Process
Instruction Pointer (IP) from STCidentifies a datapath plane
Spatial computation with using a customized datapath plane
When IP changes, datapath
plane switches instantaneously
PE instructions as a collection behave like an extreme VLIW
Sequencing through instructions => Dynamic reconfiguration AES 3DES MD5 SHA-1 Compress Data In Control (task selection by descriptor) Dynamic Reconfigura tion Data Out Multiple Datapath Planes
The NEC DRP – Reconfiguration Process
Ad d Sel Ad d C m p Ad d Add C m p Se l PE PE Array ALU DMU Insts. PE 0 1 2 IP = “1” 13
4
PE Array PE ALU DMU 0 1 2 Insts. IP = “1” 11
2
1
Identify the instruction to be executed2
Decode the instruction in the ALU plane3
Configure the ALU Plane according to the instruction4
+Tightly-Coupled Processor Arrays (TCPA)
• Processor elements (PEs) with VLIW (Very long
instruction word)-Architecture
• Weakly programmable
– Small local instruction memory
– Limited parametrizable instruction set focused on digital signal processing
• Data flow oriented control path, no global address space,
data streaming over the processing field
• Regular interconnect network
• Application areas: Digital signal processing, e.g., mobile
communication, HDTV, multimedia, . . .
• Basic structure: Grid
• Dynamic reconfigurable
• By using a bypass, more
than one hop is possible
in a single clock cycle
• Interconnect wrapper
is
responsible for switching
• Multicast
-Scheme for
partial dynamic reconfiguration
• Differential reconfiguration
(program/connections) also
possible
24 Core TCPA – Lehrstuhl für Informatik 12
• 24x 16 Bit cores• Technology
• CMOS 1.0 V • 9 metal layers
• 90 nm standard cell layout • FUs/PE • 2xAdd, 2xMul, • 1xShift, 1xDPU • Register/PE: 15 • Instruction memory • 1024x32 = 4kB • Clock frequency: 200 MHz • Peak Performance: 24 GOPS • Energy consumption
• 133 mW @ 200 MHz (Hybrid Clock Gating). • Power efficiency: 180 MOPS/mW