SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

(1)

SPARC64

™

X:

Fujitsu’s New Generation 16 Core Processor

for the next generation UNIX servers

August 29, 2012

Takumi Maruyama

Processor Development Division

Enterprise Server Business Unit

Fujitsu Limited

(2)

SPARC64™ X

Agenda

Fujitsu Processor Development History

SPARC64

TM

_X



Design concept



SWoC (Software on Chip)



Processor chip overview



u-Architecture



Performance

(3)

SPARC64 SPARC64 II Tr=190M CMOS Cu 130nm Tr=30M CMOS Cu 180nm / 150nm Tr=190M0 CMOS Cu 130nm SPARC64 V SPARC64 GP Tr=30M CMOS Al 250nm / 220nm Tr=46M CMOS Cu 180nm GS8900 :Technology generation GS21 600 GS8600 GS8800B Tr=400M CMOS Cu 90nm SPARC64 VII GS21 SPARC64 V + SPARC64 VI GS8800 GS21 900 Tr=500M CMOS Cu 90nm Tr=10M CMOS Al 350nm Tr=540M CMOS Cu 90nm SPARC64™ Processor Mainframe Tr=600M CMOS Cu 65nm Tr=760M CMOS Cu 45nm H ig h Perf o rm a nce T echno lo g y H ig h Relia bility T echno lo g y

Fujitsu Processor Development

Store Ahead Branch History Prefetch Single-chip CPU Non-Blocking $ O-O-O Execution Super-Scalar L2$ on Die HPC-ACE System On Chip Hardware Barrier $ ECC Register/ALU Parity Instruction Retry $ Dynamic Degradation RC/RT/History Multi-core Multi-thread SPARC64 GP SPARC64 IXfx SPARC64 X SPARC64 VIIIfx

Virtual Machine Architecture

Software On Chip Tr=1B CMOS Cu 40nm Tr=2.95B CMOS Cu 28nm

(4)

SPARC64™ X

SPARC64™ X Design Concept



Combine UNIX and HPC FJ processor features to realize

an extremely high throughput UNIX processor.

 SPARC64 VII/VII+ (UNIX processor) feature

 High CPU frequency (up-to 3GHz)

 Multicore/Multithread

 Scalability : up-to 64sockets

 SPARC64 VIIIfx (HPC processor) feature

 HPC-ACE: Innovative ISA extensions to SPARC-V9

 High Memory B/W: peak 64GB/s, Embedded Memory Controller



Add new features vital to current and future UNIX servers

 Virtual Machine Architecture  Software On Chip

 Embedded IOC (PCI-GEN3 controller)  Direct CPU-CPU interconnect

(5)

Software on Chip 1/2



HW for SW

Accelerates specific software function with HW



The targets

 Decimal operation (IEEE754 decimal and NUMBER)  Cypher operation (AES/DES)

 Database acceleration



HW implementation

 The HW engines for SWoC are implemented in FPU • To fully utilize 128 FP registers & software pipelining

 Implemented as instructions rather than dedicated co-processor to maximize flexibility of SW.

 Avoid complication due to “CISC” type instructions

• Various “RISC” type instructions are newly defined, instead. • 18 insts. for Decimal, and 10 insts. for Cypher operation

(6)

SPARC64™ X

Software on Chip 2/2

Decimal Instructions



Supported data type

 IEEE754 DPD(Densely Packed Decimal)

8B fixed length

 NUMBER

Variable length (max 21Byte)



Instructions

 Both DPD/NUMBER instructions are defined as 8B operation (add/sub/mul/div/cmp) on FP registers

 To maximize performance with reasonable

HW cost

 When the data length is > 8byte, multiple such

instructions will be used.

 An instruction for special byte-shift on FP

registers is newly added to support unaligned NUMBER 1...1 0...0 1...1 0 0 0 0 0 0 0 0 and and Fd[rs1] Fd[rs2] Fd[rd]

(7)

SPARC64™ X Chip Overview



Architecture Features

• 16 cores x 2 threads

• SWoC (Software on Chip) • Shared 24 MB L2$

• Embedded Memory and IO

Controller



28nm CMOS

• 23.5mm x 25.0mm • 2,950M transistors • 1,500 signal pins • 3GHz



Performance (peak)

• 288GIPS/382GFlops • 102GB/s memory throughput DDR3 Interface DDR3 Interface Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core L2 Cache Data L2 Cache Data L2 Cache Control SERDES Inter-CPU SERDES PCI GEN3 MAC MAC

(8)

SPARC64™ X Instruction Set Architecture SPARC-V9/JPS HPC-ACE VM SWoC Branch Prediction 4K BRHIS 16K PHT Integer Execution Units 156 GPR x 2 + 64 GUB ALU/SHIFT x2 ALU/AGEN x2 MULT/DIVIDE x1 FP Execution Units 128 FPR x 2 + 64 FUB FMA x4, FDIV x2 IMA/Logic x4 Decimal x1 / Cypher x2 L1$ L1I$ 64KB/4way L1D$ 64KB/4way L1$ control

SPARC64™ X Core spec

L1I$ L1D$ Instruction Control Execution Unit Register File

(9)

u-Architecture enhancements

from SPARC64

TM

VII+



CPU Core

 Deeper pipeline to increase Frequency  Better Branch Prediction Scheme

 Various Queue-size and #Floating point register increase  Richer execution Units, including

 2EX + 2EAG → 2EX + 2EX/EAG

 2FMA → 4FMA to support 2way-SIMD

 SWoC engine (Decimal and Cypher)

 More aggressive O-O-O execution of load and store  Multi-banked 2port L1-Cache



System On Chip

 #core and L2$ size (4core/12MB→16core/24MB)

 Memory Controller, IO Controller, and CPU-CPU I/F are all embedded to increase performance and reduce cost.

(10)

SPARC64™ X

SPARC64

TM

VII/VII+ Pipeline

L1 I$ 64KB 2Way Branch Target Address 8Kentry Decode & Issue RSE 8x2Entry RSA 10Entry RSF 8x2Entry RSBR 10Entry GUB 32Registers GPR 156Registers x2 EXA EXB EAGA EAGB FPR 64Registers x2 FUB 48Registers FLA FLB Fetch Port 16Entry Store Port 16Entry Store Buffer 16Entry L1 D$ 64KB 2Way System Bus Interface Fetch (4 stages) Issue (2 stages) Dispatch Reg.-Read (4 stages) Execute Memory (L1$: 3 stages) CSE 64Entry Commit (2 stages) PC x2 Control Registers x2 L2$ 6MB/12MB 12Way 4-core

(11)

Control Registers PC FPR 128Registers GPR 156Registers L1 I$ 64KB 4Way Branch Target Address 4Kentry Decode & Issue RSE 24Entry RSA 24Entry RSF 20Entry RSBR 16Entry GUB 64Registers GPR 156Registers EXA EXB EAGA EXC EAGB EXD FPR 128Registers FUB 64Registers FLA Decimal Cypher FLB Fetch Port 32Entry Store Port 24Entry L1 D$ 64KB 4Way Memory Controller Fetch (4 stages) Issue (4 stages) Dispatch Reg.-Read (5 stages) Execute Memory (L1$: 3 stages) CSE 96Entry Commit (2 stages) PC Control Registers

SPARC64

TM

X Pipeline

L2$ 24MB 24Way FLC Cypher FLD Write Buffer 10Entry Pattern History Table 16Kentry IO Controller Router 16-core _…

(12)

SPARC64™ X

Execution units enhancements (Ex.)

 Integer Execution Unit

 2EX + 2EAG → 2EX + 2EX/EAG

 2 → 4W GPR

→ _{4 integer instructions can be executed} per cycle (sustained)

 Load Store Unit

 Aggressive load/store O-O-O execution:

 Execute load without waiting for preceding store

address calculation.

 Multi-banked 2port L1-cache to execute 2 load or 1 load+1 store in parallel

 Doubled L1$ bus size

 Doubled L1$ associativity (2→4way)

→ Increase L1-cache throughput and hit-rate

GUB GPR EXA EXB EXC EXD L1$ 2R/1W Commit Update …. (banked) L1 cache 16B store 16B load x2 2R/1W

(13)

SPARC64

TM

X interconnects



SPARC64

TM

_{VII/VII+ interconnects}

 4 CPU require 8 additional LSIs to be connected with DIMM

 DIMM i/f: 4.35GB/s (STREAMtriad)



SPARC64

TM

_{X interconnects}

 No additional LSIs to be connected with DIMM

 DIMM i/f: 65.6GB/s (STREAMtriad)  CPU i/f: 14.5GB/s x 5ports (peak)

• 3 ports: glueless 4way CPU interconnect

• 2 ports: > 4way CPU

SC SC MAC MAC MAC MAC ＣＰＵ CPU ＣＰＵＣＰＵ DDR2 DIMMs SC SC

SPARC64TM_{VII/VII+ interconnects}

(SPARC Enterprise M8000) SPARC64TM_{X interconnects} DDR2 DIMMs DDR2 DIMMs DDR2 DIMMs DDR3 DIMMs DDR3 DIMMs DDR3 DIMMs DDR3 DIMMs ＣＰＵＣＰＵＣＰＵＣＰＵ 102GB/s 14.5GB/s

(14)

SPARC64™ X

High Speed Transceivers (SerDes)



CPU-CPU glue-less communication links

 14.5Gb/s x 8 lanes bi-directional serial interface, 5 ports

 Embedded equalizer circuit enables long distance signal transmission

 Embedded adaptive control logic optimizes equalizer parameters automatically

depending on the various system configurations



PCI Express ports

 8Gb/s x 8 lanes (Gen 3), 2 ports

14.5Gb/s x 8lanes SerDes

TX

Tx

RX

PLL



Built-in SerDes provides peak 88.5GB/s x2 (up/down) total

(15)

Reliability, Availability, Serviceability

 New RAS features from SPARC64™ VII/VII+

 Floating-point registers are ECC protected

 #checkers increased to ~53,000 to identify a failure point more precisely

→ Guarantees Data Integrity

Units Error detection and

correction scheme

Cache (Tag) ECC

Duplicate & Parity

Cache (Data) ECC

Parity

Register ECC (INT/FP)

Parity(Others)

ALU Parity/Residue

Cache dynamic degradation

Yes HW Instruction Retry Yes

History Yes Green: 1bit error Correctable _{Yellow: 1bit error Detectable}

Gray: 1bit error harmless

(16)

SPARC64™ X

Hardware Instruction Retry

1. Error

PC

Fetch Execute Commit

PC

SW visible resources

SW visible resources 3. Single step execution

Instruction Retry

5. Back to normal execution after the re-executed Instruction gets committed without an error.

X

GPR,FPR Memory

GPR,FPR Memory Fetch Execute Commit

4. Update of SW visible resources 2. Flush CSE RSE,RSF RSA RSBR GUB,FUB IWR IBF ALU EAGA/B EXA/B FLA/B CSE RSE,RSF RSA RSBR GUB,FUB IWR IBF ALU EAGA/B EXA/B FLA/B

 When an error is detected, Hardware re-execute the instruction automatically to remove the transient error by itself.

(17)

SPARC64

TM

X Performance @3GHz

 SPARC64TM_{X realizes 7x INT/FP/JVM throughput and 15x memory}

throughput of SPARC64TM_VII+

 The INT/FP/JVM result is with un-tuned Compiler/JVM.

 SWoC of SPARC64TM_{X results in max 98x throughput.}

 The NUMBER score is for scalar. Expect to be much better for vector data.

R elat iv e to SP AR C 64 TM VI [email protected] 6GH z

Hardware measured results

7x

SWoC 98x 15x

(18)

SPARC64™ X

SPARC64

TM

X CPI

(Cycle Per Instruction)

Example

Hardware measured results

 4 integer execution units and write port increase of GPR (integer register) improves overall performance.

 Memory latency reduction, Large L2$, branch prediction, and L1$ improvement also contribute to the high performance dramatically.

SPARC64TM_{VII+ v.s. SPARC64}TM_X

INT (single thread)

Lower Performance

Higher Performance

Shorter memory latency Large L2$ 2→4way L1$ Increased throughput of L1$ Improved Branch prediction 2EX+2EAG→2EX+2EX/EAG 2→4W GPR