Algorithmic Challenges and Opportunities for Data Analysis and Visualization in the Co-design Process

(1)

Algorithmic Challenges and Opportunities

for Data Analysis and Visualization

in the Co-design Process

Hasan Abbasi, Janine Bennett, Peer-Timo Bremer,

Varis Carey, Greg Eisenhauer, Attila Gyulassy,

Scott Klasky, Robert Moser, Todd Oliver, Manish

Parashar , Valerio Pascucci, Karsten Schwan,

Hongfeng Yu, and Matthew Wolf

(2)

(3)

Combustion Workflow

RHS of S3D solver at each Stage of an explicit time step Asynchronous movement of data or share data in memory (different levels)

In situ, in transit data analysis/viz workflow via hybrid staging

(4)

We are Building Proxy and Skeletal Apps that Enable

Empirical Evaluation of Codesign Design Choices

• Proxy App for Topology driven feature extraction

• Proxy App for Topology driven feature tracking

• Proxy App for Statistical analysis

• Proxy App for Visualization

• Proxy App for Uncertainty Quantification

(5)

Codesign Questions for

Data Analysis and Visualization Algorithms

• How much memory will be available in

situ and with what characteristics?

• Will we have hardware and runtime

support for asynchronous computation

in situ?

• Will be performance for small

messages reduced?

• What is the ratio of network

bandwidth for “in situ” vs “in transit”

communication?

• How well will modern processors

support code that is branch heavy and

flop free?

(6)

A Wide Range of Analysis and Visualization Algorithms

Are Needed for Combustion Applications

• In situ multi-variate volume and particle rendering

• Lagrangian particle querying and analysis

• Topological segmentation:

– Contour trees

– Morse-Smale complex

– Time tracking

– Scalar field comparison

• Distance field (level set)

• Filtering and averaging (spatial and temporal)

• Shape analysis

• Statistical moments (conditional)

• Statistical dimensionality reduction (joint PDFS)

• Spectra (scalar, velocity, coherency)

Flame-centric control volume analysis

(7)

We Build Reduced Topological Models for

Characterization and Tracking of Combustion Features

domain

hierarchy

Death Birth Continuation Split t t + Δt

(8)

Merge Trees Represent Feature Extraction at Different

Scales with Thresholds for Noise Removal

(9)

Visualization and Analysis Bottlenecks May

Differ from Simulation Bottlenecks

• Typically I/O bound: limited by rate at which data can be accessed

– Memory layout may significantly impact efficiency beyond traditional

cache effects seen in solvers

– FLOP vs branch ratios can vary dramatically

• Algorithms with many branches are highly data-dependent

– Feature density: how much data is relevant

– Feature distribution: balance of work load distributed

(10)

Execution Models and Data Movements Depend on

Different Flavors of Hybrid Staging

In Situ

• Co-located: complete resource sharing/contention with simulation

• Partial Sharing: different processor/core less resource contention

• Out of Band: on node with minimal resource usage (e.g. use of ooc

techniques, low priority)

In Transit

• Local: sending data to different nodes of the same machine

(11)

The Codesign Process Provides a Unique

(12)

Multi-scale Design Patterns and Execution Models Lead

to Local, and global algorithm design parameters

• Global design parameters include:

– Number of execution units

– Data aggregation patterns • N-1 communication patterns

• Use of pair-wise data exchange schemes

• Local design parameters are algorithm-specific:

– Sort First , Filter Last

– Filter First, Sort Last – Filter First, Traverse Last

(13)

We can project behaviors for different design parameters and

execution models onto prospective hardware configurations

• SST and spreadsheet models provide a projection of

communication and wall time

• Prospective configurations:

– PIM (Processing in memory, cache-less architecture) – exaNode1 (commodity NIC and memory)

– exaNode2 (custom on-board NIC + faster memory)

Performance analysis spreadsheet

ExaCT tools

(14)

We are exploring 3 use cases that cover a

range of characteristic behaviors

Statistics Visualization Topology

Local compute behavior Two-Phases: All FLOPs Two-phases: Some FLOPS, Three-phases: 2 FLOP-free, 1 FLOP-heavy Complexity is data dependent? No Yes Yes Amount of data transferred for aggregation/gather

Constant - small Can be data-dependent

Data-dependent

Scatter required? Sometimes (small data)

(15)

Topology Algorithm

• Computation of local features

– Process vertices in sorted order, detecting joining of iso-surface components

– Uses Union-Find data structure

• Communication to resolve features spanning multiple blocks

– Local merge trees are communicated in N-to-1 merges

– Corrections to local trees are re-broadcast to local compute nodes

• Feature-based statistics computation

– The segmentation stored with the corrected local merge trees are used to compute per-feature statistics

(16)

Basic Topology Measurements Obtained with Byfl

• Basic analysis for on node computation (data size 560x560x560)

num cores num points points per core Total Loads (MB) Total Stores (MB) Total FLOPs Loads/core (MB) 1000 175,616,000 175,616 4,344,617 3806745 0 4,345 7000 175,616,000 25,088 4,098,019 3589344 0 585 28000 175,616,000 6,272 3,790,376 3319704 0 135

(17)

Wait for corrections then compute Simple Higher latency

Compute-and-correct Lower latency Re-do work, more complex

Interleave compute/comm Streaming compute Asynchronous comm

N-to-1 gather then 1-to-N scatter Simple com. Model Higher latency N-to-1 gather interleaved with

1-to-N scatter

Less idle time on compute node Re-do some work N-to-1 gather interleaved with

1-to-leaves scatter

Less idle time on compute node Complex communicators

Topology Design Space Exploration

• Computation of local features – alternatives

• Communication – alternatives

• Feature-based statistics – alternatives

Feature Advantage Disadvantage

Sort-first Efficient (possible GPU) O(n) Memory for sorted indices

Progressive-sort Smaller memory footprint extra pass over data

Union-Find Efficient O(n) Union-Find data structure

(18)

Example execution model 1:

in situ data transfers on node or on network

• Topology computed directly integrated with the solver

• K-rounds of merges on in-situ processes

• Factors analyzed:

– Number of nodes

– Number of cores per node

– Number of merge operations per stage – Initial data access through shared pointer

• Free parameters:

– Communicate on network first – Communicate on network last – Size of messages

(19)

Communication Loads for Different Data Transfer Patterns

Data Size: 2025x1600x400, Kay=0.31, Binary Merges

Msg.

Si

ze

Msg.

Co

un

t

Merge stages

(20)

Communication Loads for Different Data Transfer Patterns

Data Size: 2025x1600x400, Kay=0.31, 8-way Merges

Msg.

Si

ze

Msg.

Co

un

t

Merge stages

(21)

Example execution model 2:

part in situ and part plus in transit

• Initial local compute directly integrated with the solver

• K-rounds of merges on in-situ processes

• (N-K) rounds of merging in staging area

• Predicted cost factors:

– Initial data access through shared pointer

– K-round of merges in blocking mode as part of the solver code including shared memory (on-node) and MPI messages (off-node) communications

– Data transfer to staging area

– Asynchronous computation in staging area

• Free parameters: merging strategy and staging area break

• Things to watch out for:

– Initial data transfer might pollute the cache – In-situ merge becomes sparse quickly

(22)

Communication Loads for Different Data Transfer Patterns

Data Size: 2025x1600x400, Kay=0.31

T

otal

Co

mmunic

atio

n

Merge stages

2-cores per node

4-cores per node

8-cores per node

8 -w a y merg e bi na ry merg e

(23)

Statistics Algorithm

•

1

st

_-4

th

_{order moments, variance, skewness, kurtosis, minimum and}

maximum values are values commonly computed by physics codes

• Pair-wise update formulas for 1

st

_-4

th

_{order moments allow for a}

single-pass distributed implementation

– Given moment(A) and moment(B), compute moment(A U B)

• The global model can optionally be scattered to all processes to

allow for assessment of observations (e.g. to determine outliers)

(24)

Statistics Design Space Exploration

• The algorithm is mostly FLOPs, is not data-dependent, and requires

small amounts of data to be communicated

• Small-scale, algorithm-specific design parameters

– None: local computations are straightforward implementation of update formulas

• Large-scale design parameters

– Update formulas provide complete flexibility in communication patterns

– Support arbitrary depth/width of the compute tree

• Execution model

– Initial local compute level is good candidate for insitu (all data must be transferred otherwise)

– Later local compute levels could be placed anywhere (require very small data sizes transferred: moments, minima, and maxima only)

(25)

Measurements obtained with Byfl confirm data-parallel,

scalable nature of statistics algorithms

1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09 1.00E+10

1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08

op erat ion s pe r co re

points per core

HCCI-ALU HCCI-FLOP LEJ-ALU LEJ-FLOP

Data movement options: all gather or gather of 3KB per processor

Data set Dim x Dim y Dim z num cores points/core Loads/core (MB) Stores/core (B) FLOPs/core ALU ops/core mem ops/core

LEJ 2,025 1,600 400 64 20,250,000 154.5 46.28 567,000,036 1,032,750,120 40,500,024 LEJ 2,025 1,600 400 640 2,025,000 15.45 22.63 56,700,036 103,275,113 4,050,018 LEJ 2,025 1,600 400 6,400 202,500 1.545 20.26 5,670,036 10,327,612 405,017 LEJ 2,025 1,600 400 64,000 20,250 0.154 20.03 567,036 1,032,862 40,517 LEJ 2,025 1,600 400 640,000 2,025 0.016 20.00 56,736 103,387 4,050 HCCI 560 560 560 56 3,136,000 23.929 137.18 87,808,036 159,936,131 6,272,041 HCCI 560 560 560 560 313,600 2.393 31.72 8,780,836 15,993,714 627,219 HCCI 560 560 560 5,600 31,360 0.239 21.17 878,116 1,599,472 62,737 HCCI 560 560 560 56,000 3,136 0.024 20.12 87,844 160,048 6,289 HCCI 560 560 560 112,000 1,568 0.012 20.06 43,940 80,080 3,153

(26)

Visualization Algorithm

• Volume rendering of local data to generate partial images

– Cast a ray from the eye through each pixel of an image

– For each ray, sample local volume, map data into color values via transfer function, and accumulate color values.

• Parallel image compositing to combine the partial images into a

final global image

– Build communication schedule according to distribution of pixel data – Exchange pixel data via communication

(27)

Visualization Design Space Exploration

• The algorithm is marginally FLOPs, can be data-dependent, and

requires a potentially large amount of messages exchanged.

• Small-scale, algorithm-specific design parameters

– Adaptive workload of local rendering

• Features specified by users in transfer function space, features identified by analysis algorithms, data resolution for different exploration purposes – Optimization and acceleration on CPUs and/or GPUs

• Large-scale design parameters

– Optimize communication schedule according to pixel data distribution • Minimize link contention, pixel data exchanged, and blending cost

– Exploration using MPI and/or OpenMP

• Execution model

– Local rendering of simulation data could be performed in situ to minimize data movement

(28)

Visualization Analysis Results

Measurements obtained with Byfl confirm data-parallel, nearly scalable nature of local rendering. The

number of operations varies marginally across cores depending on data features and rendering parameters.

Data set Dim x Dim y Dim z num cores points/core Loads/core (MB) Stores/core (B) FLOPs/core ALU ops/core mem ops/core

LEJ 2,025 1,600 400 64 20,250,000 1,422.38 ±1.67% 891.88 ±1.80% 55,276,662±1.34% 789,879,803 ±1.62% 466,466,350 ±1.84% LEJ 2,025 1,600 400 640 2,025,000 185.53 ±1.76% 110.74 ±2.05% 7,712,093 ±1.37% 102,163,763 ±1.77% 58,18,647 ±2.04% LEJ 2,025 1,600 400 6,400 202,500 80.77 ±1.87% 50.09 ±2.21% 383,819 ±1.26% 4,921,160 ±1.82% 2,691,746 ±2.21% LEJ 2,025 1,600 400 64,000 20,250 2.89 ±4.71% 1.84 ±5.92% 106,312 ±4.86% 1,655,434 ±6.11 975,687 ±5.29% LEJ 2,025 1,600 400 640,000 2,025 0.44 ±2.91% 0.23 ±3.11% 16,851 ±1.86% 272,890 ±1.43% 132,357 ±3.33% HCCI 560 560 560 64 2,744,000 153.36 ±3.28% 101.2 ±3.54% 5,508,745 ±3.06% 87,148,662 ±3.53% 53,197,224 ±3.47% HCCI 560 560 560 640 274,400 99.54 ±1.29% 58.38 ±1.52% 436,888 ±9.34% 5,569,581 ±1.26% 3,068,742 ±1.52% HCCI 560 560 560 6,400 27,440 8.84 ±0.58% 5.83 ±0.73% 322,109 ±0.64% 4,993,671 ±0.87% 3,036,488 ±0.64% HCCI 560 560 560 64,000 2,744 3.09 ±1.75% 2.05 ±1.96% 110,723 ±1.71% 1,749,591 ±2.06% 1,070,992 ±1.87% HCCI 560 560 560 512,000 343 0.57 ±1.36% 0.38 ±1.86% 20,762 ±1.56% 325,887 ±1.98% 196,553 ±2.03%

Measurements with a middle image quality

Middle image quality

High image quality

For un-optimized image compositing, the size of messages exchanged across cores depends on image resolutions. For the same image resolution, the size of messages is nearly same for the different numbers of cores. The messages can be reduced via optimization.

(29)

Successful execution hinges on tight integration

with all the co-design components

• Separate Proxy Apps for Data Analysis and Visualization

– Integration to understand combined behavior – Further reduction to build skeletal apps

• Coordination with Data Management

– Efficient data transfer strategies

– Where are the biggest reserves in performance, energy, wall time?

• Coordination with DSL

– Improve local compute patterns

• Fast index computation

• Elimination of unnecessary branches (boundary cases)

• UQ Analysis

– How much persistent memory is required?

• Modeling capabilities

– Compilers (Byfl, ROSE)

– Simulation(SST, spreadsheet model)

• Solvers

(30)

Uncertainty Quantification

within the SDMA workflow

(31)

UQ and Data Management

The Problem: Evaluate sensitivities of quantities of interest (QoIs) with

respect to chemistry model parameters, or modeled fields (e.g.

reaction rate fields)

The

to solving this problem requires solving

P+1

forward simulations, where P is the number of sensitivity

evaluations

• P can be very large (>> 1000) makes classical approach infeasible

solve one auxiliary problem, the

adjoint

problem

, which is linearized about the primal solution.

(32)

• The challenge:

• Solving the Adjoint Problem requires the Primal State • The Adjoint Problem must be solved backwards in time.

• Must store primal state at all time substep!

(33)

To reduce Storage requirements:

• Store a limited number of Primal states (check points) • Use check points to recompute Primal state when needed Example with two checkpoints:

(34)

• Storing full Primal solution state is prohibitive (e.g. 1PB/state)

• Only interested in sensitivities in a limited region in space & time (RoI, e.g. Extinction event)

• RoIs are not known a priori

– Solve Primal problem & identify RoIs using in situ analysis

– Resolve Primal problem only checkpointing state from the RoIs

– Only solve adjoint problem in RoIs

(35)

Example from Combustion Use Case

• Naïve adjoint solution (store full state in space & time)

– Storage: 5ZB

– Compute: 2 X Primal problem

• One level checkpointing

– Storage: 4EB

• Six level checkpointing

– Storage: 50PB

• One level checkpointing on N RoIs

– Storage: (1+1.3N)PB

– Compute: (2+0.0002N) X Primal Problem

• Important trade-off between storage and recomputing

(36)

(37)

Goals for SDMA in EXACT

1. Explore data staging techniques to deal with Exascale data

2. Design questions for data staging

– Where should data for A&V be stored

– Where should the A&V operations be executed – How should SDMA integrate with the solver

– What architectural features be leveraged for SDMA

(38)

(39)

Design Space

Solver Proxy

1. Where do we move the data to

2. How do we extract data from the solver

3. What hardware features can be exploited Descriptive Stats Visualization Topological Analysis

1. What processing resources are allocated

2. How do we schedule the execution of these tasks

(40)

Storage scalability and the power wall

• Disk sizes of 29.52 TB

• Single disk bandwidth of 384.2 MB/s

• Power consumption of ~45 W/disk

• System memory 32PB

3

• A full checkpoint every hour => 106.7 TB/s I/O bandwidth =>

277,633 disks => 13 MW of power

1,2

Without even considering RAID!

1. Power use of disk subsystems in supercomputers, Curry, M.L. and Ward, H.L. and Grider, G. and Gemmill, J. and Harris, J. and Martinez, D., Proceedings of the sixth workshop on Parallel Data Storage, Nov 2012

2. G. Grider. Exa-scale FSIO. HEC-FSIO workshop presentation. August 2010

3. R. Stevens and A. White. A DOE laboratory plan for providing exascale applications and technologies for critical DOE mission needs, SciDac Workshop, July 2010

(41)

Synchronous I/O is not the solution

S3D simulation Synchronous I/O

O(1M)

cores _{every 30 minutes}1 PB/dump

• Analysis

MS-Complex • Visualization

Volume, Surface, Particle rendering • Downstream Isomap O(400 PB)/run Synchronous I/O

• Storage space requirements

• 35 disks for each dump (No RAID) • 1.5 KW/live dump

• Performance requirements

• 5% overhead, ~31k disks, >1.4 MW • 10% overhead, ~15k disks, >0.65 MW • 50% overhead, ~3k disks, >0.14 MW

(42)

What EXACTly is Staging?

• Extra stage(s) in the data pipeline

• Use available memory resources to serve as a buffer

• Use available compute resources to serve as an execution target

• Traditional staging used discrete nodes

• Used for

• high performance I/O

• managing storage variability • for application coupling

(43)

Keep data in a Shared Data Space

• Maintain data in a shared space in staging

– Shared space can share the same memory as the simulation

(44)

Managing Data Movement

• Data movement is expected to be a

bottleneck for Exascale

• Use flexible resource allocation to

optimize data movement

(45)

Statistics Visualization Topology Statistics Visualization In transit ADIOS S3D-Box

In situ Analysis and Visualization ADIOS _H y b ri d S ta g in g ADIOS ADIOS In transit Analysis Visualization ADIOS ADIOS In transit Analysis Visualization ADIOS ADIOS In transit Analysis Visualization ADIOS S3D-Box

In situ Analysis and Visualization

ADIOS

S3D-Box

ADIOS

S3D-Box

ADIOS ADIOS

S3D-Box

ADIOS

Compute cores

Parallel Data Staging coupling/analytics/viz

A s y n c h ro n o u s D a ta T ra n s fe r

• Use compute and deep-memory hierarchies to

optimize overall workflow for power vs.

performance tradeoffs

• Utilize hybrid staging for analytics and visualization

• Abstract complex/deep memory hierarchy access

Statistics

Visualization

Topology

Statistics

Visualization

In transit

ADIOS S3D-Box

In situ Analysis and Visualization ADIOS

H

y

b

ri

d

S

ta

g

in

g

ADIOS ADIOS In transit Analysis Visualization ADIOS ADIOS In transit Analysis Visualization ADIOS ADIOS In transit Analysis Visualization ADIOS S3D-Box

ADIOS

S3D-Box

ADIOS

S3D-Box

ADIOS ADIOS

S3D-Box

ADIOS

Compute

cores

Parallel Data Staging coupling/analytics/viz

A

s

y

n

c

h

ro

n

o

u

s

D

a

ta

T

ra

n

s

fe

r

Hybrid Staging

(46)

Hybrid Staging

• Hybrid staging is a combination of the available solutions

• Classify data processing actions into

– In line/in situ

– Asynchronous/in situ

– Asynchronous/in staging

– Asynchronous/on disk

• What about tasks that span multiple classes?

– Partition the algorithms

(47)

Resources in Hybrid Staging

• Placement of analysis and visualization tasks in a complex system

• Leverage fast/slow DRAM and Leverage local NVRAM/SSD

• Impact of network data movement compared to memory movement

• Network topology impact on performance and power

(48)

Tradeoffs for Hybrid Staging

• Going to disk is slow

even for small

application sizes

• Inline approach adds

more overhead to

application runtime

• In transit approach gives

better overall

performance

– Additional cost of data movement

• Offline: Process data after writing to disk

• In line: Process data in place

synchronously with the application • Staging: Move data to staging

resources for processing

(49)

Impact of Task Mapping

• Data-centric task

mapping

• Significant saving in

amount of data

transferred data by

• co-locating data

producers and

consumers

Concurrent coupling (CAP1: 512, CAP2: 64 cores) Sequential coupling (SAP1: 512, SAP2: 128, SAP3: 384 cores)

CAP1 data CAP2

Time SAP1 SAP2 SAP3 data data Time

(50)

Complex Memory Hierarchy

• Workflow integrates knowledge of complex memory hierarchies • Placement decisions

are important factors for evaluation

• Local NVRAM must be leveraged

• Fast-small memory vs

(51)

(52)

Impact of NVRAM

• Study how deep memory hierarchy can be used for

end-to-end I/O analytic pipeline

– How NVRAM can be used as a staging area

– How much of each level of the memory hierarchy to use

for the staging area?

– Where to move data (RAM, NVRAM, SSD, disk, network)

– When (and how frequently) to move the data over the

(53)

Tradeoff between Frequency and Costs

• Frequency of analysis impacts energy and performance of analysis

• Not a linear function

Sweet Spot – case dependent

NVRAM/disk gap

Experiments in

collaboration with Steve Poole, ORNL

(54)

Asynchronous Workflow Impact

Bitter spot – case dependent • Frequency of analysis impacts energy and performance of analysis • Not a linear function

(55)

NVRAM for C/R

215 220 225 230 235 240 245 250 250 450 600 Ex ecution tim e (m s) NVM B/W per core (MB)

Optimizing by hiding data movement to NVM

Run time (sec)

Run time - Optimized (sec)

• Experiment was conducted in a 12 core - 2.8 GHz Intel Xeon

node, with 48 GB DDR3 memory. To emulate NVM, 24 GB of

memory was used for NVM

NVM per core data

copy bandwidth

was assumed to be

450 MB/sec

(compared to 2 GB

device B/W).

(56)

Deep Memory Tradeoffs

• Driving synthetic application benchmark:

– Generates data

• Allocates two matrices in DRAM memory and fills them up with random data

– Operates with generated data – runs a kernel

• Multiplication (MUL), addition (ADD) or read access (NOP) generated matrices

– Manage generated data:

• Keeps the data in RAM memory

• Staging area (Local Fusion-IO, remote Fusion-IO, HDD, SSD, etc…) – Asynchronously runs data analysis (reads data from staging area)

• Quality of solutions

– Frequency of data analysis

(57)

Co-Design Decisions for Complex Memory

• Impact of using slow memory for SDMA processing

– Power vs Performance tradeoffs

– Size of Memory vs Speed of Memory

• Using a combination of fast memory and slow memory

– Fast memory size and speed compared to slow memory size

• Managing performance at the application level

– Tradeoff frequency of analysis with memory usage

– Use combination of asynchronous and synchronous computation

• Use knowledge of the workflow to study tradeoffs

(58)

Dealing with UQ

• Our next big target for data management

• Use analysis to select only the ROI

– Use the feature detection algorithms and their

optimization

• Ideal candidate for deep memory

– Data is not used for a long time after output

– Data access is regular and predictable

(59)

The Proxy App

• Magically generates a workflow through tracing patterns

• Underdevelopment

http://www.olcf.ornl.gov/center-projects/adios/

Original Application Communication Phase Pattern Analyzer Tracing Tools Link Skel-xml file for I/O SKEL-Code Generator Desired Benchmark User’s Changes Tracing Records