New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

(1)

New Dimensions in

Configurable Computing at

runtime simultaneously

allows Big Data and fine

Grain HPC

Alan Gara

Intel Fellow

Exascale Chief Architect

(2)

Legal Disclaimer

Today’s presentations contain forward-looking statements. All statements made that are not historical facts are subject to a number of risks and uncertainties, and actual results may differ materially.

NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS. Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase.

Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See www.intel.com/products/processor_number for details.

Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel, Intel Xeon, Intel Core microarchitecture, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

(3)

The next step in perf/$

Typical DRAM Memory Die (2016) ~ 8Gb will be about 100 mm^2 (as always)

Processor floating point unit ~0.03 mm^2 (2 Flops/cycle) (see below)

Even if the core is 100x bigger than the FPU , At 1.0 GB/core we have >100x

more silicon in memory than processing. This is not cost balanced.

Threading gives us a mechanism to change this balance if we have enough

bandwidth to support much higher compute/memory.

New memory architectures allow us to get a significant step in perf/$

DARPA:Exascale computing study: Exascale_Final_report_100208.pdf

For cost balance we need to either,

1) Use much less memory per compute

or

2) Make physical size of capacity

much smaller

(4)

Big Data meets HPC

Big Data

HPC

Large Memory Capacity

Large

Small to Large

Bandwidth to Large Memory

Small to Large

Large

System Fabric Bandwidth

Small to Large

Large

System compute capability

Small to Large

Large

Big Data Bandwidth requirements to data vary

• Random access requires high bandwidth

• Cacheable accesses can tolerate much lower bandwidth to

memory

• Where data can be cached matters…

at processor: fabric requirements lower

(5)

Will HPC and Big Data drive a

different system balance point?

Storage/ NVM

Interconnect

Memory

Processor

Big data cost balance

HPC cost balance

Variation in budgeting to will remain bound (1:10)

Synthetic data for illustration only

(6)

System commonality between big data and HPC

Need to understand future memory technology

characteristics

Comes down to bandwidth… (assuming we have better $/bit)

DRAM like Bandwidth?

System architectures for Big

data and HPC very similar.

Both benefit from new

technologies

Big Data will benefit more.

Architectures will not be identical

but configurability will allow for

cross over.

Can be fundamental or

market microarchitecture

choice

(7)

This will also drive user effort

Memory capacity per compute

5x-10x better than DRAM

Modest need for threading

when new technologies

available.

Task scaling can be effectively

applied to many applications.

Memory capacity per

performance drops 10x to 20x

from current levels.

Aggressive threading is

commonplace/necessary.

Program model changes focus

on thread scaling. Aggressively

strive for more performance for

similar task numbers.

New memory technologies

replace/augment DRAM

DRAM the remains

dominant load-store

memory technology

(8)

Microarchitecture choices will drive

bandwidth/ capacity tradeoff

0 1 2 3 4 5 6 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0

DENSITY VERSUS BANDWIDTH TRADE -OFF

DRAM (approx) NVM (optimal)

(9)

(10)

(11)

11

Two Design Options For Supercomputing

A processor with >10B transistors on a die in 2020

OR

A processor with fewer transistors on a smaller cost

effective die

(12)

Option 1: Large Die With>10B Transistors

12

More cache Fewer cores “Everything integrated” More cores Enough cache for HPC “Everything integrated”

Flavor of cores Enough cache for HPC “Everything integrated”

 Enables on-package memory ? Cache size beyond a certain

threshold not utilized by the programmer”

 High FLOPS count on a die ? Enough on-package memory becomes difficult to implement Extreme performance levels result in problematic off-package memory

usage

 “Powerful” cores for ST performance “Smaller” cores for highly parallel ? Enough on-package memory becomes

difficult to implement.

Extreme performance levels result in problematic off-package memory usage

(13)

Option 2: Cost Effective Die That

Supports On-package Memory

13 Stacked Memory

Processor die matched to

performance. Can be much smaller

than memory.

Scalable fabric

“Building Block”

• Broad Usage

: With the right memory capacity per building block, it can address

a large portion of the HPC market

• Cost

: Building blocks can replace the compute and DRAM in a node (at the right

price point)

• Scalability

: Configure building block as memory or memory+compute

• Power

: Better thermal solution with disaggregated compute blocks

(14)

The Possibilities

With the “Building Block” Approach

14

At Exascale

Evolved

Cost

₁

Memory capacity

(in-package)

2 TB

300 GB

Memory capacity

(outside package)

Assume none

2TB (DDR4/5)

Number of cores

₈₀₀₀

₁₀₀₀

Memory Bandwidth

(In-package)

50 TB/s

5 TB/s

Memory Bandwidth

(outside-package)

Assume none

400 GB/s

Performance peak

_512TF

_64TF

1) On-package memory has 8-10x the bandwidth compared to external memory

2) At iso cost and memory capacity, on-package memory enables 8-10x additional compute to