• No results found

Single-chip Cloud Computer IA Tera-scale Research Processor

N/A
N/A
Protected

Academic year: 2021

Share "Single-chip Cloud Computer IA Tera-scale Research Processor"

Copied!
25
0
0

Loading.... (view fulltext now)

Full text

(1)

“Single-chip Cloud Computer”

IA Tera-scale Research Processor

Jim Held Intel Fellow & Director Tera-scale Computing Research

Intel Labs

August 31, 2010

(2)

2

Agenda

Tera-scale Research

SCC Architecture

Software environment

Co-travelers Program

Summary

(3)

Performance Scaling Challenges

Energy

(4)

4

Tera-scale Research

Cores

– power efficient general & special function

Interconnects

– High bandwidth, low latency

Memory Hierarchy

– Feed the compute engine

System Software

– Scalable services

Programming

– Empower the mainstream

Applications

– Identify, characterize & optimize

(5)

Teraflops Research Processor

Goals:

 Deliver Tera-scale performance

– Single precision TFLOP at desktop power – Frequency target 5GHz

– Bi-section B/W order of Terabits/s – Link bandwidth in hundreds of GB/s

 Prototype two key technologies

– On-die interconnect fabric – 3D stacked memory

 Develop a scalable design

methodology

– Tiled design approach – Mesochronous clocking – Power-aware capability I/O Area I/O Area PLL single tile 1.5mm 2.0mm TAP 21 .72 mm I/O Area PLL TAP 12.64mm

65nm, 1 poly, 8 metal (Cu) Technology 100 Million (full-chip) 1.2 Million (tile) Transistors 275mm2 (full-chip) 3mm2(tile) Die Area 8390 C4 bumps #

65nm, 1 poly, 8 metal (Cu) Technology 100 Million (full-chip) 1.2 Million (tile) Transistors 275mm2 (full-chip) 3mm2(tile) Die Area 8390 C4 bumps #

(6)

6

Within-Die Variation-Aware

DVFS and scheduling

 Max Frequency variation per core 28% at 1.2V 62% at 0.8V  No correlation die to die – individual characterization

required

 Improved performance or energy efficiency with:

– Multiple frequency islands

– Dynamic scheduling of processing to core

Dighe, S, et al., “Within-Die Variation-Aware Dynamic Voltage-Frequency Scaling, Core

Mapping and Thread Hopping for an 80-Core Processor”, in Proceedings of ISSCC 2010 (IEEE

(7)

Cloud Computing Today

Cloud datacenters:

–1000s of networked computers

–Millions of threads & petabytes of data

Opportunity:

–Lower power, higher density via integration

–Greater efficiency and better programmability

Example: Intel’s Open Cirrus testbed Intel Labs Pittsburgh

1 Gb/s (x8 p2p) 45 Mb/s T3 to Internet 1 Gb/s (x2x5 p2p) 1 Gb/s (x4x4 p2p) 1 Gb/s (x4x4 p2p) (x15 p2p)1 Gb/s (x15 p2p)1 Gb/s (x15 p2p)1 Gb/s 1 Gb/s (x8) 1 Gb/s (x4) 1 Gb/s (x4) 1 Gb/s (x4) 1 Gb/s (x4) 1 Gb/s (x4) 1 Gb/s (x4) 1 Gb/s (x4)

(8)

8

Motivations for SCC

Many-core processor research

–High-performance power-efficient fabric

–Fine-grain power management

–Message-based programming support

Parallel Programming research

–Better support for scale-out server model

– Operating system, communication architecture

–Scale-out programming model for client

– Programming languages, runtimes

8

(9)

5 .2 mm 3.6mm VRC 2 1 .4 m m 26.5mm

System Interface + I/O

DDR3 MC DDR3 MC DDR3 MC DDR3 MC PLL TILE TILE JTAG Router L2$1 L2$0 Core0 Core1 MPB

Single-chip Cloud Computer

Experimental Processor

Technology 45nm Hi-K CMOS

Interconnect 9 Metal (Cu)

Transistors Die: 1.3B, Tile: 48M

Tile Area 18.7mm2

(10)

10

Architectural Overview

Howard, J, et al., “A 48-Core IA-32 Message-Passing Processor with DVFS in 45nm CMOS”, in

Proceedings of ISSCC 2010 (IEEE International Solid-State Circuits Conference), Feb. 2010

2

nd

Generation Intel Labs experimental processor

– IA-based software research vehicle

“Cluster-on-die” architecture

– 48 Pentium™ Processor cores (P54C - x87FP only)

M em or y Co n tr ol le r Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R M em or y Co n tr ol le r M em or y Co n tr ol le r M em or y Co n tr ol le r System I/F Tile Core 1 Core 0 L2$1 L2$0 Router MPB Core 1 Core 0

(11)

On-die Interconnect

Architecture

–6x4 2D Mesh NOC

–16B wide data links + 2B sideband

–8 Virtual Channels in 2 classes

–Fixed (X-Y) routing

Performance

–Target freq: 2GHz @ 1.1V

–Link Bandwidth 64GB/s

–4 cycle latency

Power Management

–Independent Frequency & Voltage control

–Sleep mode, clock gating, low power RF

0.01 0.1 1 10 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 Supply (V) F re q (G H z ) Router Core 50°C 0.94V 1.4GHz 0.55V 60MHz 1.34V 2.6GHz 0.73V 300MHz 1.32V 1.3GHz 0.94V 0.9GHz

(12)

12

Memory Architecture

Memory

– Up to 64GB DDR3 via 4 memory controllers @ 21.3GB/s – 16KB SRAM in each tile as Message Passing Buffer (MPB)

Caching

– 32KB L1 per core (16KB I,D), 12MB L2 cache (256KB/core) – No HW cache-coherent shared memory

Addressing

– Core physical to system physical addresses in 16MB sections – Memory mapped configuration & control registers

Core Physical Address Space Core Physical

Address Space

System Physical Address Space

(13)

Power Management

Configurable MC, Mesh, SIF Voltage & Frequency

Software-controlled DVFS* of cores

– Fine-grain voltage control at 4 tile cluster level (6.25mV) – Frequency control at tile level (16bit divider)

– Closed loop - thermal sensors per tile, current through BMC

Me m or y C on tr ol le r Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Me m or y C on tr ol le r Me m or y C on tr ol le r Me m or y C on tr ol le r System I/F V0 V1 Fn Fn Fn Fn

*Dynamic voltage and frequency scaling

DVFS gives wide operating range: 125W @ 1.14V 1GHz

(14)

14

Measured full chip power

(15)

Full Power Breakdown

Total -125.3W

Cores 69% MC & DDR3-800 19% Routers & 2D-mesh 10% Global Clocking 2%

Low Power Breakdown

Total - 24.7W

Cores 21% MC & DDR3-800 69% Routers & 2D-mesh 5% Global Clocking 5% Clocking: 1.9W Routers: 12.1W Cores: 87.7W MCs: 23.6W Clocking: 1.2W Routers: 1.2W Cores: 5.1W MCs: 17.2W Cores-125MHz, Mesh-250MHz, 0.7V, 50°C Cores-1GHz, Mesh-2GHz, 1.14V, 50°C

Power breakdown

(16)

16

Rocky Lake – SCC platform

 Replacement for evaluation board

– 100 boards with more I/O, more robust, less expensive – BIOS/Firmware in definition

(17)

17

SCC “Chipset”

System Interface FPGA

–Connects to SCC Mesh interconnect

–IO capabilities like

PCIe, Ethernet & SATA

–Bitstream is part of sccKit distribution

Board Management Controller (BMC)

–JTAG interface for Clocking, Power etc.

–USB Stick with FPGA bitstream

–Network interface for User interaction via Telnet

–Status monitoring

–Firmware is part of sccKit distribution

(18)

18

Software Environment

SCC Software

– Bare Metal

– Customized Linux

– RCCE communication & power management API – Tools

– Selected Intel tools (e.g., icc, ifort, ...)

– Microsoft research release of SCC extensions to Visual Studio

Management Console PC Software

– PCIe driver with integrated TCP/IP driver

– Programming API for communication with SCC platform – GUI for interaction with SCC platform

(19)

19

RCCE Communication API

A compact, lightweight communication

environment.

– SCC and RCCE were designed together side by side:

– … a true HW/SW co-design project.

A research vehicle to understand how message

passing APIs map onto many core chips.

For experienced parallel programmers willing to

work close to the hardware.

Static SPMD Execution Model:

– identical UEs created together when a program starts (this is a standard approach familiar to message passing programmers)

UE: Unit of Execution … a software entity that advances a program counter (e.g. process of thread).

(20)

20

SCC Disclosure Demos

Financial Analytics

w/ shared virtual memory Microsoft Visual Studio Advanced Power Management

(21)

SCC Co-Travelers Program

Currently building SCC software research community

– 100 systems total, with 40 in Oregon Datacenter – Research partners for 2010 have been selected

SCC community website available today

– Communities.intel.com/community/marc – To share ideas, HowTo’s, code, tools

(22)

22

Summary

SCC provides a unique experimental

platform for many-core research

–Better support for “Cloud” data center servers

–Scale-out programming model for client

We are sharing SCC with selected

researchers in academia and industry

–Documentation and presentations

http://www.intel.com/info/scc

(23)

SCC Team

Jason Howard, Saurabh Dighe, Yatin Hoskote, Sriram Vangal, David Finan, Gregory Ruhl, David Jenkins, Howard Wilson,

Nitin Borkar, Gerhard Schrom, Fabrice Pailet, Shailendra Jain, Tiju Jacob, Satish Yada, Sraven Marella, Praveen Salihundam, Vasantha Erraguntla, Michael Konow, Michael Riepen, Guido Droege, Joerg Lindemann, Matthias Gries, Thomas Apel,

Kersten Henriss, Tor Lund-Larsen, Sebastian Steibl, Shekhar Borkar, Vivek De, Rob Van Der Wijngaart, Timothy Mattson

(24)

24

(25)

References

Related documents

The main new fi nding of this study is that people with less severe radiographic changes prior to knee joint replacement are less likely to experience major improvement in pain

Principles of managing a modern software process. I will use that framework to summarize the history of best-practice evolution. The sections that follow describe three discrete

Intraparty Politics and the Local State: Factionalism, Patronage and Power in Buffalo City Metropolitan Municipalityi. Tatenda

The authors have identified the challenges of ensuring the validity of the crowdsourced workforce, evidencing fairness and providing accountability that a collective intelligence

Metropolitan Water District For watering trees, shrubs, palms and narrow planter beds directly at the base we go from overhead irrigation to a more efficient method—root

Data binding can also mean that if an outer representation of the data in an element changes, then the underlying data can be automatically updated to reflect the change..

Compared to the previous 8-bit ASICs [4,5], our core achieves considerably higher throughput and lower cycle count with corresponding area.. Our cycle count is also lower than that

Urgency of urban farming in the green kampong of Yogyakarta includes a limited land use as a function of green open space, the role and participation of women and