Scalable Parallel Distance Field
Construction for Large-Scale Applications
Hongfeng Yu
(UNL),
Jinrong Xie
, Kwan-Liu Ma
(UC Davis),
Motivation
• Distance transform
A fundamental requirement for many
applications
• Image processing
• Computational geometry
• Robotics
A critical role in visualization
• Reduce visual clutter
• Index and compress data
at extreme resolutions
2
reference.wolfram.com
Challenges
3
p
Definition of distance transform:
Challenges at scale:
• High communication
cost
Objectives
• Scalable distance transform for large-scale scientific
computing
• Support of various data types, structures, and
semantics
• Used in an in-situ setting with a parallel computing
environment
Related Work
• Selected Existing Approaches
– Complete Euclidean distance transformation by parallel operation, H.
Yamada, 1984
– Fast hierarchical 3D distance transforms on the GPU, N. Cuntz and A.
Kolb, 2007
– Data-parallel octrees for surface reconstruction, K. Zhou et al. 2011
• Limitations
– Less feasible to address the scalability issue
• Communication overhead and unbalanced workload
Our Contribution
• Highly scalable parallel distance transform
– Leverage spatial and temporal coherence in simulation data
– Minimize communication cost across processors
– Achieve balanced workload among processors
– Scale up to 69,120 CPU cores on state-of-the-art
supercomputer
Our Approach
7
Global Tree for Workload Partition
Local Tree for Distance Computation
Our Approach
8
Collect coarse global element distribution
Construct global
distance tree Assign leaf octants
Construct full-grown local distance tree
Compute
distance field
Update tree and distance field
P0 P1 P2 P3 P4
P4
Local Tree for Distance Computation
Parallel Distance Tree Construction
Our Approach
9 Ω ΓP0
P1
P2
P3
P4
Workload Partition
Our Approach
10 Ω ΓP0
P1
P2
P3
P4
Collective Reductionbitmap
Workload Partition
Our Approach
11 Ω ΓP0
P1
P2
P3
P4
Collective Reductionbitmap
A bitmap of 128KB
can represent more
than 1 million blocks
Workload Partition
12
Our Approach
For each processor do:
Input: bitmap
Output: global distance tree
x
y
b
c
d
e
f
g
a
h
i
j
k
l
m
n
o
p
P0 P1 P2 P3 P4x
y
b
c
d
e
f
g
a
h
i
j
k
l
m
n
o
p
13Our Approach
b c d e
f
g h
i
j
k
l m n o
a
p
For each processor do:
Z-curve
Morton code
P0 P1 P2 P3 P4x
y
b
c
d
e
f
g
a
h
i
j
k
l
m
n
o
p
14Our Approach
b c d e
f
g h
i
j
k
l m n o
a
p
For each processor do:
Two-pass task assignment
P0 P1 P2 P3 P4
1
Each processor handle local data
Assign to idle processors
2.1
Assign to processor owning a neighboring data domain
2.2
15
Our Approach
b c d e
f
g h
i
j
k
l m n o
a
p
For each processor do:
r
3r
r
3r
Local octant
Construct Full-grown Local Distance Tree
r
16
Our Approach
b c d e
f
g h
i
j
k
l m n o
a
p
Leave node vertex
Local elements
distance
For each processor do:
Compute Distance on Leave Node Vertex
17
vertices
Our Approach
Exchange Vertices
exchange the vertices that need to be
checked with the remote processors
P4 P3
18
local elements
Our Approach
For each processor do:
Compute Min-distance to Remote Vertices
P4
P3 P3
19
Processors exchange
the results
remote distanceOur Approach
P4
P3
20
Collect coarse global element distribution Construct initial coarse global distance tree
Assign leaf octants Construct full-grown local distance tree
Compute distance field Update tree and
distance field
Only a marginal portion of the tree structure needs to be updated with
respect to the field evolution between two consecutive time steps.
We do not need to update the distance field of a region if there are no field
changes within the region’s triple.
Integration with Combustion Simulations
21
H. Yu, C. Wang, R. W. Grout, J. H. Chen, and K.-L. Ma, “In situ visualization for large-scale combustion simulations,” IEEE Computer Graphics and Applications, vol. 30, no. 3, pp. 45–57, 2010.
Based on the APIs developed in our
in-situ visualization framework
Simulation provides the size and
coordinates of each processor’s global domain and local partition
Simulation provides the pointer to the buffer of the local field data
Distance field construction module is initialized and invoked by the solver at a given rate
Evaluation
22
Data Set Data Type and Scale Mode System Combustion Volume (1.3B grid points) In-situ processing Hopper Combustion Volume (1.6B grid points) Post- processing Intrepid Car Polygon (3.4M triangles) Post-processing Hopper Boeing Polygon (350M triangles) Post-processing Hopper
System Configuration
Hopper A Cray XE6 supercomputer
Lawrence Berkeley National Laboratory
6384 nodes, 2.1 GHz 12-core CPUs X 2, 32 GB of RAM/node Intrepid An IBM Blue Gene/P supercomputer
Argonne National Laboratory
Performance
23 0 20 40 60 80 100 120 140 4320 8640 17280 34560T
ime (sec.)
Number of cores
Simulation
Distance Field Construction
Performance
24Post-processing
Remote distance
T ime(s ec ) #coresExchange
T ime(s ec ) #coresLocal tree
T ime(s ec ) #coresLocal distance
T ime(s ec ) #coresDistance volume
T ime( se c) #coresAccumulated time
T ime(sec ) #coresLocal tree
T ime(s ec ) #coresLocal distance
T ime(s ec ) #coresDistance volume
T ime( se c) #coresAccumulated time
T ime(s ec ) #coresApplications
25
Conventional Transfer Function Distance-based Transfer Function
Applications
26
Distance-based Transfer Function
Applications
27
Study material types applied on the front hood of the car:
temperature distribution at different distances from the car
Applications
28
Conclusion and Future Work
29