The Tilera TILE64 Processor - Many-Core Network on Chip Platforms

2.4 Many-Core Network on Chip Platforms

2.4.1 The Tilera TILE64 Processor

Processors in the Tile family are based on the Tilera’s multi-core architecture. We will take a brief look at the TILE64 processor in particular as an example. The platform supports several programming languages, e.g. full ANSI C, providing a way to make porting of legacy code easy.

Figure2.9shows the architecture diagram of the TILE64 processor. The processor features 64 homogeneous PEs arranged in a two-dimensional 8x8 grid. Each PE is referred to as a tile. All the tiles are connected with the I/O, the peripherals and each other via high-speed, on-die, packet-switched, multiple two-dimensional mesh networks (the mesh network is based on Tilera’s iMesh interconnect technology)[13, 138]. By employing a dedicated mesh network with different latencies and bandwidths for inter-tile, memory, and I/O communications, the architecture provides a high bandwidth and extremely low latency communication among tiles. There are four on-die Memory Controllers (MCs) that connect the tiles to on-board Double Data Rate (DDR) memories.

The tiles on the TILE64 can operate between 600—1000 MHz. Additionally, cores can be grouped into islands to eliminate unnecessary communication and reduce power

consumption (unused tiles can be put into sleep mode). As can be seen from Figure2.9, each tile contains three major components: processor engine, cache engine, and switch engine. The processor engine is a three-way Very Long Instruction Word (VLIW) processor architecture with an independent program-counter.

The cache engine contains the tile’s Translation Lookaside Buffers (TLBs), caches, and cache-sequencers. In addition to the support for both private and shared memory, the TLBs also support pinning blocks of memory in the cache. There are separate 8 KiB L1 instruction and data caches. L1 instruction cache has 8 entries TLB, while data cache has 16 entries TLB. A unified 2-way 64 KiB L2 cache backs the L1 caches. Each tile also contains a 2D Direct Memory Access (DMA) engine that supports block copy functions like cache-to-memory, memory-to-cache, and cache-to-cache. There is no L3 cache but each tile’s L2 cache can be shared with other tiles, in effect providing a shared L3 cache.

The switch in the switch engine is a full crossbar for non-blocking routing, with credit-based flow control. There are in total five different networks out of that; four are dynamic networks and one is a static network. The dynamic networks are dimensional- ordered wormhole-routed. There is one-cycle latency for each hop through the network for cases when the packets are going straight. If the packet has to make a turn at the switch then latency is increased by one cycle due to the route calculation.

There are five different networks in the iMesh. Each network supports 32-bit unidirectional links, allowing traffic flow in both directions at the same time. The five networks are:

Static Network (STN) is a scalar network with low latency allowing static configura- tion of the routing decisions. It is mainly used for streaming data from one tile to another via pre-configured routes.

User Dynamic Network (UDN) low latency, user programmable, packet-switched network used for communications between threads running in parallel on multiple tiles.

Memory Dynamic Network (MDN) used for memory transfers such as loads, stores, and cache misses.

Tile Dynamic Network (TDN) supports data transfer between tile caches. TDN works in concert with MDN.

Input/Output Dynamic Network (IDN) is network accessible to OS-level code, not user applications. Used primarily to transfer data between tiles and I/O devices, and I/O devices and memory.

2.4 Many-Core Network on Chip Platforms 33

fabric. It consists of a 2D array of compute nodes connected by a low-latency mesh network-on- chip. Figure 1 shows an implementation of the architecture, highlighting the key components:

 A superscalar, floating-point RISC CPU in each mesh node that can execute two floating point operations and a 64-bit memory load operation on every clock cycle.

 Local memory in each mesh node that provides 32 Bytes/cycle of sustained bandwidth and is part of a distributed, shared memory system.

 Multicore communication infrastructure in each node that includes a network interface, a multi-channel DMA engine, multicore address decoder, and network-monitor.

 A 2D mesh network that supports on-chip node-to-node communication latencies in nanoseconds, with zero startup overhead.

Figure 1: An Implementation of the Epiphany Architecture

The Epiphany architecture was designed for good performance across a broad range of applications, but really excels at applications with high spatial and temporal locality of data and

Router ` MESH NODE RISC CPU Local Memory DMA ENGINE Network Interface

Figure 2.10 Adapteva Epiphany-64 Architecture Diagram[59]

The Tile Processor architecture defines a flat globally shared 64-bit physical address space and a 32-bit virtual address space. In addition to the default hardware backed cache coherent memory the TILE64 also supports other memory modes, i.e. a non- coherent and a non-cacheable memory mode. Different memory attributes and modes are managed and configured by means of page table entries and enforced through TLB entries. TILE64 provides directory-based coherence policy. Every node has directory cache and off-chip directory controller. Tile-to-tile memory request/response transits the TDN. Off-chip memory request/response transit the MDN. The traffic due to the cache coherency was so high that an extra mesh network was added to the later TILEPro processors. A Coherence Dynamic Network (CDN) is used only for passing invalidation messages needed for the cache-coherency protocol.

On TILE64 processor each tile can independently run a full OS, e.g. GNU/Linux. In addition, multiple tiles taken together can run a multi-processor OS like an Symmetric Multiprocessing (SMP) version of GNU/Linux.

In document RA-LPEL: A Resource-Aware Light-Weight Parallel Execution Layer for Reactive Stream Processing Networks on The SCC Many-core Tiled Architecture (Page 59-61)