Algorithmic Challenges and Opportunities
for Data Analysis and Visualization
in the Co-design Process
Hasan Abbasi, Janine Bennett, Peer-Timo Bremer,
Varis Carey, Greg Eisenhauer, Attila Gyulassy,
Scott Klasky, Robert Moser, Todd Oliver, Manish
Parashar , Valerio Pascucci, Karsten Schwan,
Hongfeng Yu, and Matthew Wolf
Combustion Workflow
RHS of S3D solver at each Stage of an explicit time step Asynchronous movement of data or share data in memory (different levels)In situ, in transit data analysis/viz workflow via hybrid staging
We are Building Proxy and Skeletal Apps that Enable
Empirical Evaluation of Codesign Design Choices
•
Proxy App for Topology driven feature extraction
•
Proxy App for Topology driven feature tracking
•
Proxy App for Statistical analysis
•
Proxy App for Visualization
•
Proxy App for Uncertainty Quantification
Codesign Questions for
Data Analysis and Visualization Algorithms
•
How much memory will be available in
situ and with what characteristics?
•
Will we have hardware and runtime
support for asynchronous computation
in situ?
•
Will be performance for small
messages reduced?
•
What is the ratio of network
bandwidth for “in situ” vs “in transit”
communication?
•
How well will modern processors
support code that is branch heavy and
flop free?
A Wide Range of Analysis and Visualization Algorithms
Are Needed for Combustion Applications
•
In situ multi-variate volume and particle rendering
•
Lagrangian particle querying and analysis
•
Topological segmentation:
– Contour trees
– Morse-Smale complex
– Time tracking
– Scalar field comparison
•
Distance field (level set)
•
Filtering and averaging (spatial and temporal)
•
Shape analysis
•
Statistical moments (conditional)
•
Statistical dimensionality reduction (joint PDFS)
•
Spectra (scalar, velocity, coherency)
Flame-centric control volume analysis
We Build Reduced Topological Models for
Characterization and Tracking of Combustion Features
domain
hierarchy
Death Birth Continuation Split t t + ΔtMerge Trees Represent Feature Extraction at Different
Scales with Thresholds for Noise Removal
Visualization and Analysis Bottlenecks May
Differ from Simulation Bottlenecks
•
Typically I/O bound: limited by rate at which data can be accessed
– Memory layout may significantly impact efficiency beyond traditionalcache effects seen in solvers
– FLOP vs branch ratios can vary dramatically
•
Algorithms with many branches are highly data-dependent
– Feature density: how much data is relevant– Feature distribution: balance of work load distributed
Execution Models and Data Movements Depend on
Different Flavors of Hybrid Staging
In Situ
•
Co-located: complete resource sharing/contention with simulation
•
Partial Sharing: different processor/core less resource contention
•
Out of Band: on node with minimal resource usage (e.g. use of ooc
techniques, low priority)
In Transit
•
Local: sending data to different nodes of the same machine
The Codesign Process Provides a Unique
Multi-scale Design Patterns and Execution Models Lead
to Local, and global algorithm design parameters
•
Global design parameters include:
– Number of execution units– Data aggregation patterns • N-1 communication patterns
• Use of pair-wise data exchange schemes
•
Local design parameters are algorithm-specific:
– Sort First , Filter Last– Filter First, Sort Last – Filter First, Traverse Last
We can project behaviors for different design parameters and
execution models onto prospective hardware configurations
•
SST and spreadsheet models provide a projection of
communication and wall time
•
Prospective configurations:
– PIM (Processing in memory, cache-less architecture) – exaNode1 (commodity NIC and memory)
– exaNode2 (custom on-board NIC + faster memory)
Performance analysis spreadsheet
ExaCT tools
We are exploring 3 use cases that cover a
range of characteristic behaviors
Statistics Visualization Topology
Local compute behavior Two-Phases: All FLOPs Two-phases: Some FLOPS, Three-phases: 2 FLOP-free, 1 FLOP-heavy Complexity is data dependent? No Yes Yes Amount of data transferred for aggregation/gather
Constant - small Can be data-dependent
Data-dependent
Scatter required? Sometimes (small data)
Topology Algorithm
•
Computation of local features
– Process vertices in sorted order, detecting joining of iso-surface components
– Uses Union-Find data structure
•
Communication to resolve features spanning multiple blocks
– Local merge trees are communicated in N-to-1 merges– Corrections to local trees are re-broadcast to local compute nodes
•
Feature-based statistics computation
– The segmentation stored with the corrected local merge trees are used to compute per-feature statistics
Basic Topology Measurements Obtained with Byfl
•
Basic analysis for on node computation (data size 560x560x560)
num cores num points points per core Total Loads (MB) Total Stores (MB) Total FLOPs Loads/core (MB) 1000 175,616,000 175,616 4,344,617 3806745 0 4,345 7000 175,616,000 25,088 4,098,019 3589344 0 585 28000 175,616,000 6,272 3,790,376 3319704 0 135
Wait for corrections then compute Simple Higher latency
Compute-and-correct Lower latency Re-do work, more complex
Interleave compute/comm Streaming compute Asynchronous comm
N-to-1 gather then 1-to-N scatter Simple com. Model Higher latency N-to-1 gather interleaved with
1-to-N scatter
Less idle time on compute node Re-do some work N-to-1 gather interleaved with
1-to-leaves scatter
Less idle time on compute node Complex communicators
Topology Design Space Exploration
•
Computation of local features – alternatives
•
Communication – alternatives
•
Feature-based statistics – alternatives
Feature Advantage Disadvantage
Sort-first Efficient (possible GPU) O(n) Memory for sorted indices
Progressive-sort Smaller memory footprint extra pass over data
Union-Find Efficient O(n) Union-Find data structure
Example execution model 1:
in situ data transfers on node or on network
•
Topology computed directly integrated with the solver
•
K-rounds of merges on in-situ processes
•
Factors analyzed:
– Number of nodes– Number of cores per node
– Number of merge operations per stage – Initial data access through shared pointer
•
Free parameters:
– Communicate on network first – Communicate on network last – Size of messages
Communication Loads for Different Data Transfer Patterns
Data Size: 2025x1600x400, Kay=0.31, Binary Merges
Msg.
Si
ze
Msg.
Co
un
t
Merge stages
Communication Loads for Different Data Transfer Patterns
Data Size: 2025x1600x400, Kay=0.31, 8-way Merges
Msg.
Si
ze
Msg.
Co
un
t
Merge stages
Example execution model 2:
part in situ and part plus in transit
•
Initial local compute directly integrated with the solver
•
K-rounds of merges on in-situ processes
•
(N-K) rounds of merging in staging area
•
Predicted cost factors:
– Initial data access through shared pointer
– K-round of merges in blocking mode as part of the solver code including shared memory (on-node) and MPI messages (off-node) communications
– Data transfer to staging area
– Asynchronous computation in staging area
•
Free parameters: merging strategy and staging area break
•
Things to watch out for:
– Initial data transfer might pollute the cache – In-situ merge becomes sparse quickly
Communication Loads for Different Data Transfer Patterns
Data Size: 2025x1600x400, Kay=0.31
T
otal
Co
mmunic
atio
n
Merge stages
2-cores per node
4-cores per node
8-cores per node
8 -w a y merg e bi na ry merg e
Statistics Algorithm
•
1
st-4
thorder moments, variance, skewness, kurtosis, minimum and
maximum values are values commonly computed by physics codes
•
Pair-wise update formulas for 1
st-4
thorder moments allow for a
single-pass distributed implementation
– Given moment(A) and moment(B), compute moment(A U B)
•
The global model can optionally be scattered to all processes to
allow for assessment of observations (e.g. to determine outliers)
Statistics Design Space Exploration
•
The algorithm is mostly FLOPs, is not data-dependent, and requires
small amounts of data to be communicated
•
Small-scale, algorithm-specific design parameters
– None: local computations are straightforward implementation of update formulas
•
Large-scale design parameters
– Update formulas provide complete flexibility in communication patterns
– Support arbitrary depth/width of the compute tree
•
Execution model
– Initial local compute level is good candidate for insitu (all data must be transferred otherwise)
– Later local compute levels could be placed anywhere (require very small data sizes transferred: moments, minima, and maxima only)
Measurements obtained with Byfl confirm data-parallel,
scalable nature of statistics algorithms
1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09 1.00E+10
1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08
op erat ion s pe r co re
points per core
HCCI-ALU HCCI-FLOP LEJ-ALU LEJ-FLOP
Data movement options: all gather or gather of 3KB per processor
Data set Dim x Dim y Dim z num cores points/core Loads/core (MB) Stores/core (B) FLOPs/core ALU ops/core mem ops/core
LEJ 2,025 1,600 400 64 20,250,000 154.5 46.28 567,000,036 1,032,750,120 40,500,024 LEJ 2,025 1,600 400 640 2,025,000 15.45 22.63 56,700,036 103,275,113 4,050,018 LEJ 2,025 1,600 400 6,400 202,500 1.545 20.26 5,670,036 10,327,612 405,017 LEJ 2,025 1,600 400 64,000 20,250 0.154 20.03 567,036 1,032,862 40,517 LEJ 2,025 1,600 400 640,000 2,025 0.016 20.00 56,736 103,387 4,050 HCCI 560 560 560 56 3,136,000 23.929 137.18 87,808,036 159,936,131 6,272,041 HCCI 560 560 560 560 313,600 2.393 31.72 8,780,836 15,993,714 627,219 HCCI 560 560 560 5,600 31,360 0.239 21.17 878,116 1,599,472 62,737 HCCI 560 560 560 56,000 3,136 0.024 20.12 87,844 160,048 6,289 HCCI 560 560 560 112,000 1,568 0.012 20.06 43,940 80,080 3,153
Visualization Algorithm
•
Volume rendering of local data to generate partial images
– Cast a ray from the eye through each pixel of an image– For each ray, sample local volume, map data into color values via transfer function, and accumulate color values.
•
Parallel image compositing to combine the partial images into a
final global image
– Build communication schedule according to distribution of pixel data – Exchange pixel data via communication
Visualization Design Space Exploration
•
The algorithm is marginally FLOPs, can be data-dependent, and
requires a potentially large amount of messages exchanged.
•
Small-scale, algorithm-specific design parameters
– Adaptive workload of local rendering• Features specified by users in transfer function space, features identified by analysis algorithms, data resolution for different exploration purposes – Optimization and acceleration on CPUs and/or GPUs
•
Large-scale design parameters
– Optimize communication schedule according to pixel data distribution • Minimize link contention, pixel data exchanged, and blending cost
– Exploration using MPI and/or OpenMP
•
Execution model
– Local rendering of simulation data could be performed in situ to minimize data movement
Visualization Analysis Results
Measurements obtained with Byfl confirm data-parallel, nearly scalable nature of local rendering. The
number of operations varies marginally across cores depending on data features and rendering parameters.
Data set Dim x Dim y Dim z num cores points/core Loads/core (MB) Stores/core (B) FLOPs/core ALU ops/core mem ops/core
LEJ 2,025 1,600 400 64 20,250,000 1,422.38 ±1.67% 891.88 ±1.80% 55,276,662±1.34% 789,879,803 ±1.62% 466,466,350 ±1.84% LEJ 2,025 1,600 400 640 2,025,000 185.53 ±1.76% 110.74 ±2.05% 7,712,093 ±1.37% 102,163,763 ±1.77% 58,18,647 ±2.04% LEJ 2,025 1,600 400 6,400 202,500 80.77 ±1.87% 50.09 ±2.21% 383,819 ±1.26% 4,921,160 ±1.82% 2,691,746 ±2.21% LEJ 2,025 1,600 400 64,000 20,250 2.89 ±4.71% 1.84 ±5.92% 106,312 ±4.86% 1,655,434 ±6.11 975,687 ±5.29% LEJ 2,025 1,600 400 640,000 2,025 0.44 ±2.91% 0.23 ±3.11% 16,851 ±1.86% 272,890 ±1.43% 132,357 ±3.33% HCCI 560 560 560 64 2,744,000 153.36 ±3.28% 101.2 ±3.54% 5,508,745 ±3.06% 87,148,662 ±3.53% 53,197,224 ±3.47% HCCI 560 560 560 640 274,400 99.54 ±1.29% 58.38 ±1.52% 436,888 ±9.34% 5,569,581 ±1.26% 3,068,742 ±1.52% HCCI 560 560 560 6,400 27,440 8.84 ±0.58% 5.83 ±0.73% 322,109 ±0.64% 4,993,671 ±0.87% 3,036,488 ±0.64% HCCI 560 560 560 64,000 2,744 3.09 ±1.75% 2.05 ±1.96% 110,723 ±1.71% 1,749,591 ±2.06% 1,070,992 ±1.87% HCCI 560 560 560 512,000 343 0.57 ±1.36% 0.38 ±1.86% 20,762 ±1.56% 325,887 ±1.98% 196,553 ±2.03%
Measurements with a middle image quality
Middle image quality
High image quality
For un-optimized image compositing, the size of messages exchanged across cores depends on image resolutions. For the same image resolution, the size of messages is nearly same for the different numbers of cores. The messages can be reduced via optimization.
Successful execution hinges on tight integration
with all the co-design components
• Separate Proxy Apps for Data Analysis and Visualization
– Integration to understand combined behavior – Further reduction to build skeletal apps
• Coordination with Data Management
– Efficient data transfer strategies
– Where are the biggest reserves in performance, energy, wall time?
• Coordination with DSL
– Improve local compute patterns
• Fast index computation
• Elimination of unnecessary branches (boundary cases)
• UQ Analysis
– How much persistent memory is required?
• Modeling capabilities
– Compilers (Byfl, ROSE)
– Simulation(SST, spreadsheet model)
• Solvers
Uncertainty Quantification
within the SDMA workflow
UQ and Data Management
The Problem: Evaluate sensitivities of quantities of interest (QoIs) with
respect to chemistry model parameters, or modeled fields (e.g.
reaction rate fields)
The
to solving this problem requires solving
P+1
forward simulations, where P is the number of sensitivity
evaluations
•
P can be very large (>> 1000) makes classical approach infeasible
solve one auxiliary problem, the
adjoint
problem
, which is linearized about the primal solution.
• The challenge:
• Solving the Adjoint Problem requires the Primal State • The Adjoint Problem must be solved backwards in time.
• Must store primal state at all time substep!
To reduce Storage requirements:
• Store a limited number of Primal states (check points) • Use check points to recompute Primal state when needed Example with two checkpoints:
• Storing full Primal solution state is prohibitive (e.g. 1PB/state)
• Only interested in sensitivities in a limited region in space & time (RoI, e.g. Extinction event)
• RoIs are not known a priori
– Solve Primal problem & identify RoIs using in situ analysis
– Resolve Primal problem only checkpointing state from the RoIs
– Only solve adjoint problem in RoIs
Example from Combustion Use Case
• Naïve adjoint solution (store full state in space & time)
– Storage: 5ZB
– Compute: 2 X Primal problem
• One level checkpointing
– Storage: 4EB
– Compute: 3 X Primal problem
• Six level checkpointing
– Storage: 50PB
– Compute: 8 X Primal problem
• One level checkpointing on N RoIs
– Storage: (1+1.3N)PB
– Compute: (2+0.0002N) X Primal Problem
• Important trade-off between storage and recomputing
Goals for SDMA in EXACT
1.
Explore data staging techniques to deal with Exascale data
2.
Design questions for data staging
– Where should data for A&V be stored
– Where should the A&V operations be executed – How should SDMA integrate with the solver
– What architectural features be leveraged for SDMA
Design Space
Solver Proxy
1. Where do we move the data to
2. How do we extract data from the solver
3. What hardware features can be exploited Descriptive Stats Visualization Topological Analysis
1. What processing resources are allocated
2. How do we schedule the execution of these tasks
Storage scalability and the power wall
• Disk sizes of 29.52 TB
• Single disk bandwidth of 384.2 MB/s
• Power consumption of ~45 W/disk
• System memory 32PB
3• A full checkpoint every hour => 106.7 TB/s I/O bandwidth =>
277,633 disks => 13 MW of power
1,2Without even considering RAID!
1. Power use of disk subsystems in supercomputers, Curry, M.L. and Ward, H.L. and Grider, G. and Gemmill, J. and Harris, J. and Martinez, D., Proceedings of the sixth workshop on Parallel Data Storage, Nov 2012
2. G. Grider. Exa-scale FSIO. HEC-FSIO workshop presentation. August 2010
3. R. Stevens and A. White. A DOE laboratory plan for providing exascale applications and technologies for critical DOE mission needs, SciDac Workshop, July 2010
Synchronous I/O is not the solution
S3D simulation Synchronous I/O
O(1M)
cores every 30 minutes 1 PB/dump
• Analysis
MS-Complex • Visualization
Volume, Surface, Particle rendering • Downstream Isomap O(400 PB)/run Synchronous I/O
• Storage space requirements
• 35 disks for each dump (No RAID) • 1.5 KW/live dump
• Performance requirements
• 5% overhead, ~31k disks, >1.4 MW • 10% overhead, ~15k disks, >0.65 MW • 50% overhead, ~3k disks, >0.14 MW
What EXACTly is Staging?
• Extra stage(s) in the data pipeline
• Use available memory resources to serve as a buffer
• Use available compute resources to serve as an execution target
• Traditional staging used discrete nodes
• Used for
• high performance I/O
• managing storage variability • for application coupling
Keep data in a Shared Data Space
• Maintain data in a shared space in staging
– Shared space can share the same memory as the simulation
Managing Data Movement
• Data movement is expected to be a
bottleneck for Exascale
• Use flexible resource allocation to
optimize data movement
Statistics Visualization Topology Statistics Visualization In transit ADIOS S3D-Box
In situ Analysis and Visualization ADIOS H y b ri d S ta g in g ADIOS ADIOS In transit Analysis Visualization ADIOS ADIOS In transit Analysis Visualization ADIOS ADIOS In transit Analysis Visualization ADIOS S3D-Box
In situ Analysis and Visualization
ADIOS
ADIOS
S3D-Box
In situ Analysis and Visualization
ADIOS
ADIOS
S3D-Box
In situ Analysis and Visualization
ADIOS ADIOS
S3D-Box
In situ Analysis and Visualization
ADIOS
Compute cores
Parallel Data Staging coupling/analytics/viz
A s y n c h ro n o u s D a ta T ra n s fe r
• Use compute and deep-memory hierarchies to
optimize overall workflow for power vs.
performance tradeoffs
• Utilize hybrid staging for analytics and visualization
• Abstract complex/deep memory hierarchy access
Statistics
Visualization
Topology
Statistics
Visualization
In transit
ADIOS S3D-BoxIn situ Analysis and Visualization ADIOS
H
y
b
ri
d
S
ta
g
in
g
ADIOS ADIOS In transit Analysis Visualization ADIOS ADIOS In transit Analysis Visualization ADIOS ADIOS In transit Analysis Visualization ADIOS S3D-BoxIn situ Analysis and Visualization
ADIOS
ADIOS
S3D-Box
In situ Analysis and Visualization
ADIOS
ADIOS
S3D-Box
In situ Analysis and Visualization
ADIOS ADIOS
S3D-Box
In situ Analysis and Visualization
ADIOS
Compute
cores
Parallel Data Staging coupling/analytics/viz
A
s
y
n
c
h
ro
n
o
u
s
D
a
ta
T
ra
n
s
fe
r
Hybrid Staging
Hybrid Staging
• Hybrid staging is a combination of the available solutions
• Classify data processing actions into
– In line/in situ
– Asynchronous/in situ
– Asynchronous/in staging
– Asynchronous/on disk
• What about tasks that span multiple classes?
– Partition the algorithms
Resources in Hybrid Staging
• Placement of analysis and visualization tasks in a complex system
• Leverage fast/slow DRAM and Leverage local NVRAM/SSD
• Impact of network data movement compared to memory movement
• Network topology impact on performance and power
Tradeoffs for Hybrid Staging
• Going to disk is slow
even for small
application sizes
• Inline approach adds
more overhead to
application runtime
• In transit approach gives
better overall
performance
– Additional cost of data movement
• Offline: Process data after writing to disk
• In line: Process data in place
synchronously with the application • Staging: Move data to staging
resources for processing
Impact of Task Mapping
• Data-centric task
mapping
• Significant saving in
amount of data
transferred data by
• co-locating data
producers and
consumers
Concurrent coupling (CAP1: 512, CAP2: 64 cores) Sequential coupling (SAP1: 512, SAP2: 128, SAP3: 384 cores)CAP1 data CAP2
Time SAP1 SAP2 SAP3 data data Time
Complex Memory Hierarchy
• Workflow integrates knowledge of complex memory hierarchies • Placement decisions
are important factors for evaluation
• Local NVRAM must be leveraged
• Fast-small memory vs
Impact of NVRAM
• Study how deep memory hierarchy can be used for
end-to-end I/O analytic pipeline
– How NVRAM can be used as a staging area
– How much of each level of the memory hierarchy to use
for the staging area?
– Where to move data (RAM, NVRAM, SSD, disk, network)
– When (and how frequently) to move the data over the
Tradeoff between Frequency and Costs
• Frequency of analysis impacts energy and performance of analysis• Not a linear function
Sweet Spot – case dependent
NVRAM/disk gap
Experiments in
collaboration with Steve Poole, ORNL
Asynchronous Workflow Impact
Bitter spot – case dependent • Frequency of analysis impacts energy and performance of analysis • Not a linear function
NVRAM for C/R
215 220 225 230 235 240 245 250 250 450 600 Ex ecution tim e (m s) NVM B/W per core (MB)Optimizing by hiding data movement to NVM
Run time (sec)
Run time - Optimized (sec)
• Experiment was conducted in a 12 core - 2.8 GHz Intel Xeon
node, with 48 GB DDR3 memory. To emulate NVM, 24 GB of
memory was used for NVM
NVM per core data
copy bandwidth
was assumed to be
450 MB/sec
(compared to 2 GB
device B/W).
Deep Memory Tradeoffs
• Driving synthetic application benchmark:
– Generates data
• Allocates two matrices in DRAM memory and fills them up with random data
– Operates with generated data – runs a kernel
• Multiplication (MUL), addition (ADD) or read access (NOP) generated matrices
– Manage generated data:
• Keeps the data in RAM memory
• Staging area (Local Fusion-IO, remote Fusion-IO, HDD, SSD, etc…) – Asynchronously runs data analysis (reads data from staging area)
• Quality of solutions
– Frequency of data analysis
Co-Design Decisions for Complex Memory
•
Impact of using slow memory for SDMA processing
– Power vs Performance tradeoffs
– Size of Memory vs Speed of Memory
•
Using a combination of fast memory and slow memory
– Fast memory size and speed compared to slow memory size
•
Managing performance at the application level
– Tradeoff frequency of analysis with memory usage
– Use combination of asynchronous and synchronous computation
•
Use knowledge of the workflow to study tradeoffs
Dealing with UQ
• Our next big target for data management
• Use analysis to select only the ROI
– Use the feature detection algorithms and their
optimization
• Ideal candidate for deep memory
– Data is not used for a long time after output
– Data access is regular and predictable
The Proxy App
• Magically generates a workflow through tracing patterns
• Underdevelopment
http://www.olcf.ornl.gov/center-projects/adios/
Original Application Communication Phase Pattern Analyzer Tracing Tools Link Skel-xml file for I/O SKEL-Code Generator Desired Benchmark User’s Changes Tracing RecordsSKEL xml file for
Communication
Insert
Skel-2.0
Child of Skel that creates proxies for
X-Stack Interactions
• Increase engagement with X-Stack projects
– Dynamic Task Scheduling
– Resource allocation
– Memory and Task dependencies
• Bring Fast Forwards into the conversation
– Storage
– Complex Memory hierarchies – DRAM, NVRAM,
ScratchPads, SSDs
Outstanding Questions
• GPU/Accelerators for SDMA tasks?
– Staging nodes can be customized with additional resources
– Analysis tasks can be split similarly to in situ/in transit
• Integration with the Solver
– Data extraction impact on Solver performance
– Inline processing can impact cache
– Data copies pollute cache
What about UQ?
Solver
Store
Solution
Compute
Adjoint
Need to store the ENTIRE data set
Feed the adjoint computation in
UQ v2.0
Solver
Store
Solution
Compute
Adjoint
Store smaller
subset of solution
Recompute
u(t)
UQ v3.0
Solver
Store
Solution
Compute
Adjoint
Store even smaller
subset of solution
Recompute
u(t)
u(t), from t
ito t
i-1Filter
Domains
Identify the interesting events