“Single-chip Cloud Computer”
IA Tera-scale Research Processor
Jim Held Intel Fellow & Director Tera-scale Computing Research
Intel Labs
August 31, 2010
2
Agenda
Tera-scale Research
SCC Architecture
Software environment
Co-travelers Program
Summary
Performance Scaling Challenges
Energy
4
Tera-scale Research
Cores
– power efficient general & special functionInterconnects
– High bandwidth, low latencyMemory Hierarchy
– Feed the compute engineSystem Software
– Scalable servicesProgramming
– Empower the mainstreamApplications
– Identify, characterize & optimizeTeraflops Research Processor
Goals:
Deliver Tera-scale performance
– Single precision TFLOP at desktop power – Frequency target 5GHz
– Bi-section B/W order of Terabits/s – Link bandwidth in hundreds of GB/s
Prototype two key technologies
– On-die interconnect fabric – 3D stacked memory
Develop a scalable design
methodology
– Tiled design approach – Mesochronous clocking – Power-aware capability I/O Area I/O Area PLL single tile 1.5mm 2.0mm TAP 21 .72 mm I/O Area PLL TAP 12.64mm
65nm, 1 poly, 8 metal (Cu) Technology 100 Million (full-chip) 1.2 Million (tile) Transistors 275mm2 (full-chip) 3mm2(tile) Die Area 8390 C4 bumps #
65nm, 1 poly, 8 metal (Cu) Technology 100 Million (full-chip) 1.2 Million (tile) Transistors 275mm2 (full-chip) 3mm2(tile) Die Area 8390 C4 bumps #
6
Within-Die Variation-Aware
DVFS and scheduling
Max Frequency variation per core 28% at 1.2V 62% at 0.8V No correlation die to die – individual characterization
required
Improved performance or energy efficiency with:
– Multiple frequency islands
– Dynamic scheduling of processing to core
Dighe, S, et al., “Within-Die Variation-Aware Dynamic Voltage-Frequency Scaling, Core
Mapping and Thread Hopping for an 80-Core Processor”, in Proceedings of ISSCC 2010 (IEEE
Cloud Computing Today
Cloud datacenters:
–1000s of networked computers
–Millions of threads & petabytes of data
Opportunity:
–Lower power, higher density via integration
–Greater efficiency and better programmability
Example: Intel’s Open Cirrus testbed Intel Labs Pittsburgh
1 Gb/s (x8 p2p) 45 Mb/s T3 to Internet 1 Gb/s (x2x5 p2p) 1 Gb/s (x4x4 p2p) 1 Gb/s (x4x4 p2p) (x15 p2p)1 Gb/s (x15 p2p)1 Gb/s (x15 p2p)1 Gb/s 1 Gb/s (x8) 1 Gb/s (x4) 1 Gb/s (x4) 1 Gb/s (x4) 1 Gb/s (x4) 1 Gb/s (x4) 1 Gb/s (x4) 1 Gb/s (x4)
8
Motivations for SCC
Many-core processor research
–High-performance power-efficient fabric
–Fine-grain power management
–Message-based programming support
Parallel Programming research
–Better support for scale-out server model
– Operating system, communication architecture
–Scale-out programming model for client
– Programming languages, runtimes
8
5 .2 mm 3.6mm VRC 2 1 .4 m m 26.5mm
System Interface + I/O
DDR3 MC DDR3 MC DDR3 MC DDR3 MC PLL TILE TILE JTAG Router L2$1 L2$0 Core0 Core1 MPB
Single-chip Cloud Computer
Experimental Processor
Technology 45nm Hi-K CMOS
Interconnect 9 Metal (Cu)
Transistors Die: 1.3B, Tile: 48M
Tile Area 18.7mm2
10
Architectural Overview
Howard, J, et al., “A 48-Core IA-32 Message-Passing Processor with DVFS in 45nm CMOS”, in
Proceedings of ISSCC 2010 (IEEE International Solid-State Circuits Conference), Feb. 2010
2
ndGeneration Intel Labs experimental processor
– IA-based software research vehicle
“Cluster-on-die” architecture
– 48 Pentium™ Processor cores (P54C - x87FP only)
M em or y Co n tr ol le r Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R M em or y Co n tr ol le r M em or y Co n tr ol le r M em or y Co n tr ol le r System I/F Tile Core 1 Core 0 L2$1 L2$0 Router MPB Core 1 Core 0
On-die Interconnect
Architecture
–6x4 2D Mesh NOC
–16B wide data links + 2B sideband
–8 Virtual Channels in 2 classes
–Fixed (X-Y) routing
Performance
–Target freq: 2GHz @ 1.1V
–Link Bandwidth 64GB/s
–4 cycle latency
Power Management
–Independent Frequency & Voltage control
–Sleep mode, clock gating, low power RF
0.01 0.1 1 10 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 Supply (V) F re q (G H z ) Router Core 50°C 0.94V 1.4GHz 0.55V 60MHz 1.34V 2.6GHz 0.73V 300MHz 1.32V 1.3GHz 0.94V 0.9GHz
12
Memory Architecture
Memory
– Up to 64GB DDR3 via 4 memory controllers @ 21.3GB/s – 16KB SRAM in each tile as Message Passing Buffer (MPB)
Caching
– 32KB L1 per core (16KB I,D), 12MB L2 cache (256KB/core) – No HW cache-coherent shared memory
Addressing
– Core physical to system physical addresses in 16MB sections – Memory mapped configuration & control registers
Core Physical Address Space Core Physical
Address Space
System Physical Address Space
Power Management
Configurable MC, Mesh, SIF Voltage & Frequency
Software-controlled DVFS* of cores
– Fine-grain voltage control at 4 tile cluster level (6.25mV) – Frequency control at tile level (16bit divider)
– Closed loop - thermal sensors per tile, current through BMC
Me m or y C on tr ol le r Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Tile R Me m or y C on tr ol le r Me m or y C on tr ol le r Me m or y C on tr ol le r System I/F V0 V1 Fn Fn Fn Fn
*Dynamic voltage and frequency scaling
DVFS gives wide operating range: 125W @ 1.14V 1GHz
14
Measured full chip power
Full Power Breakdown
Total -125.3W
Cores 69% MC & DDR3-800 19% Routers & 2D-mesh 10% Global Clocking 2%Low Power Breakdown
Total - 24.7W
Cores 21% MC & DDR3-800 69% Routers & 2D-mesh 5% Global Clocking 5% Clocking: 1.9W Routers: 12.1W Cores: 87.7W MCs: 23.6W Clocking: 1.2W Routers: 1.2W Cores: 5.1W MCs: 17.2W Cores-125MHz, Mesh-250MHz, 0.7V, 50°C Cores-1GHz, Mesh-2GHz, 1.14V, 50°CPower breakdown
16
Rocky Lake – SCC platform
Replacement for evaluation board
– 100 boards with more I/O, more robust, less expensive – BIOS/Firmware in definition
17
SCC “Chipset”
System Interface FPGA
–Connects to SCC Mesh interconnect
–IO capabilities like
PCIe, Ethernet & SATA–Bitstream is part of sccKit distribution
Board Management Controller (BMC)
–JTAG interface for Clocking, Power etc.
–USB Stick with FPGA bitstream
–Network interface for User interaction via Telnet
–Status monitoring
–Firmware is part of sccKit distribution
18
Software Environment
SCC Software
– Bare Metal
– Customized Linux
– RCCE communication & power management API – Tools
– Selected Intel tools (e.g., icc, ifort, ...)
– Microsoft research release of SCC extensions to Visual Studio
Management Console PC Software
– PCIe driver with integrated TCP/IP driver
– Programming API for communication with SCC platform – GUI for interaction with SCC platform
19
RCCE Communication API
A compact, lightweight communication
environment.
– SCC and RCCE were designed together side by side:
– … a true HW/SW co-design project.
A research vehicle to understand how message
passing APIs map onto many core chips.
For experienced parallel programmers willing to
work close to the hardware.
Static SPMD Execution Model:
– identical UEs created together when a program starts (this is a standard approach familiar to message passing programmers)
UE: Unit of Execution … a software entity that advances a program counter (e.g. process of thread).
20
SCC Disclosure Demos
Financial Analytics
w/ shared virtual memory Microsoft Visual Studio Advanced Power Management
SCC Co-Travelers Program
Currently building SCC software research community
– 100 systems total, with 40 in Oregon Datacenter – Research partners for 2010 have been selected
SCC community website available today
– Communities.intel.com/community/marc – To share ideas, HowTo’s, code, tools
22
Summary
SCC provides a unique experimental
platform for many-core research
–Better support for “Cloud” data center servers
–Scale-out programming model for client
We are sharing SCC with selected
researchers in academia and industry
–Documentation and presentations
http://www.intel.com/info/scc
SCC Team
Jason Howard, Saurabh Dighe, Yatin Hoskote, Sriram Vangal, David Finan, Gregory Ruhl, David Jenkins, Howard Wilson,
Nitin Borkar, Gerhard Schrom, Fabrice Pailet, Shailendra Jain, Tiju Jacob, Satish Yada, Sraven Marella, Praveen Salihundam, Vasantha Erraguntla, Michael Konow, Michael Riepen, Guido Droege, Joerg Lindemann, Matthias Gries, Thomas Apel,
Kersten Henriss, Tor Lund-Larsen, Sebastian Steibl, Shekhar Borkar, Vivek De, Rob Van Der Wijngaart, Timothy Mattson
24