TO IMPROVE
PARALLEL I/O PERFORMANCE
Andrew B. Hastings
Sun Microsystems, Inc.
Alok Choudhary
Northwestern University September 19, 2006
Outline
• Motivation
• Previous work
• New shared memory solutions
• Performance evaluation
Why Shared Memory?
• Because it's there!
> For Phase II of DARPA's High Productivity Computer
Systems program, Sun proposed a petascale shared memory system
• Opportunity to improve performance without altering
applications
> Shared memory typically has lower latency and lower
overhead (especially for small payloads) than messages
> Change just the library to use shared memory
• Interesting research area
A Common Parallel I/O Problem
• Application accesses may be noncontiguous in
memory and in the file
> If not optimized, can result in tens of thousands of small
Posix I/O operations
• For MPI-IO, two MPI derived datatypes specify the
file and memory access patterns Process 1 (P1) memory
file
Process 2 (P2) memory
Previous Solutions – 1
• Data sieving I/O
Each process locks and reads a contiguous block, fills in the altered data, then writes back and unlocks
memory data file ? ? ? ? buffer P1 P2 1 2 3 4 5 6 4 I/O requests
Previous Solutions – 2
• List I/O
Each process creates a list of memory regions and a list of file regions; calls a new filesystem interface
• Datatype I/O
Each process creates a small data structure describing repeating regions in memory and in file; calls a new
filesystem interface
memory data
file
(memory list, file list) (memory list, file list)
2 I/O requests
Previous Solutions – 3
• Two-phase collective
Each process sends round of data to each aggregator. Aggregator(s) receive and merge into buffer, make large write call(s) to filesystem; repeat until done.
memory data buffer P1 P2 (aggregator) write send send receive receive merge
Using Shared Memory: mmap
• Each process maps file into its address space,
copies data to appropriate location in mapped file
> Similar to List I/O but mostly implemented in
library memory data mapped file P1 P2 loads/stores
Using Shared Memory: Collectives
• Collective Shared Data: Each aggregator copies
data between its working buffer and shared application memory
• Collective Shared Buffer: Each process copies data
between its application memory and aggregator(s)'s shared working buffer(s)
memory data buffer P1 P2 (aggregator) loads/stores write
Datatype Iterators
• Problem: copy driven by (offset, length) list:
> Huge list thrashes processor cache > List generation expensive; delays I/O
• Solution: datatype iterator tracks position in MPI
datatype, returns next (offset, length) on demand
> State fits in handful of cache lines
> Tiny startup cost; higher traversal cost can overlap I/O
datatype iterator MPI datatype datatype stack ... (offset, length) 0 983,039
Overlapping I/O
• Strategy: Split working buffer into sub-buffers
> After sub-buffer is filled, initiate asynchronous I/O
> Before filling next sub-buffer, wait for previous
asynchronous I/O on it to complete
> Overlaps I/O and data rearrangement!
• Performance gain for collective shared buffer on
FLASH I/O benchmark:
> 60% with lists
Performance Evaluation
• Hardware: Sun Fire™ 6800, 24 × 1200 MHz
processors,150 MHz system bus, 96 GB memory, 4 × 1Gb FC channels, 4 Sun StorEdge™ T3 disk
arrays (T3 cache disabled)
• Software: LAM 7.1.1, ROMIO 1.2.4, Solaris™ 9,
Sun StorageTek™ QFS 4.5 (3 data+1 metadata), 64-bit execution model
• Bandwidth to data arrays: < 300 MB/s
• Caveat: Buffered reads benefit from warm buffer
cache!
Tile Reader Benchmark
• Tiled display simulation
> File size 7–37 MB
> From Parallel I/O
Benchmarking
Consortium, Argonne Data distribution for 2×2 tile array:
Data read by one process
270 128 768 1024 2×2 2×4 3×4 4×4 5×4 6×4 0 50 100 150 200 250 300 350 400 450 500 550 600
Aggregate Read Bandwidth (MB/s)
CSB-dt (dir) CSB-list (dir) CSD (dir) mmap (buf) *List I/O (buf) 2PC (dir) 2PC (buf) *DS (dir) *DS (buf)
Tile array dimensions
ROMIO 3D Block Test
• 600×600×600 array of ints block-distributed to
processes
> Uneven data distribution for some process counts
> Fixed file size: 824 MB
Data accessed by one process Data distribution for 8 processes:
ROMIO 3D Block Test Results
4 8 12 16 20 24 0 100 200 300 400 500 600 700 800 900 1000Aggregate Read Bandwidth (MB/s)
CSB-dt (dir) CSB-list (dir) CSD (dir) mmap (buf) *List I/O (buf) 2PC (dir) 2PC (buf) *DS (dir) *DS (buf) Number of processes 4 8 12 16 20 24 0 25 50 75 100 125 150 175 200 225 250 275 300
Aggregate Write Bandwidth (MB/s)
FLASH I/O Benchmark
• From Argonne/Northwestern
• Checkpoint reorganizes to group
values by variable
• 80 blocks per process
has 24 variables X-Axis
Y-Axis Z-Axis
FLASH block structure
Variable 0 Variable 1 Variable 2 Variable 23 Blocks to access in X-axis Blocks to access in Y-axis Guard Cells Cut a slice
of the block Each element
Memory Organization File Organization 0 1 2 3 4 5 6 7 Y 0 1 2 3 4 5 6 7 Z 0 1 2 3 4 5 6 7 X
Var 0 Var 1 Var 2 Var 23
Block 0 Block 1 Block 2 Block 79
FLASH I/O Benchmark Results
4 8 12 16 20 24 0 25 50 75 100 125 150 175 200 225 250 275 300Aggregate Write Bandwidth (MB/s) Block size: 20×20×20 cells
CSB-dt (dir) CSB-list (dir) CSD (dir) mmap (buf) List I/O (buf) 2PC (dir) 2PC (buf) *DS (dir) *DS (buf) Number of processes 8 12 16 20 24 28 32 36 0 25 50 75 100 125 150 175 200 225 250 275 300
Aggregate Write Bandwidth (MB/s) Number of processes: 22
Number of cells along block edge
Conclusion
• Combination of collective shared buffer, datatype
iterators, and sub-buffering offered best aggregate performance for several application I/O patterns
> Achieved 90% of available disk bandwidth
> 5× improvement over two-phase collective
• Rediscovered streaming I/O principles:
1. Reduce startup overhead (datatype iterators)
2. Overlap I/O and computation when possible
Future Work
• Apply datatype iterators to MPI messages
> Direct sender-to-receiver copy if shared memory
• Apply datatype iterators to data sieving and
two-phase collective in ROMIO (currently list-based)
> Could benefit traditional clusters
• Possible standardization of datatype iterators
> Required for use of datatype iterators in ROMIO if
ROMIO is to remain portable across MPI implementations
Acknowledgements
• Harriet Coverston and Anton Rang of Sun
Microsystems also contributed to this work.
• This material is based on work supported by the US
Defense Advanced Research Projects Agency under Contract No. NBCH3039002.
Datatype Iterators – Interface
• Interfaces:
> dtc_next: advance cursor to next contiguous
block, return (offset, length)
> dtc_size_seek/dtc_extent_seek: position cursor
to size or extent within datatype
> dtc_size_tell/dtc_extent_tell: return size or extent
within datatype corresponding to cursor position
• Simplifies implementation:
> Collective shared buffer 62% fewer code lines
Datatype Iterators – Example
• Copy (non-)contiguous application data directly to
(non-)contiguous shared working buffer:
while (file_off + file_len <= end_off) { // Entire file block still // fits in current chunk while (file_len >= mem_len) { // Mem block fits in file block src = app_buf + mem_off;
memcpy(dest, src, mem_len); // Copy remaining mem block file_off += mem_len;
file_len -= mem_len; dest += mem_len;
(mem_off, mem_len) = dtc_next(mem_dtc); // Get next mem block }
while (mem_len >= file_len) { // File block fits in mem block dest = temp_buf + file_off - start_off;
memcpy(dest, src, file_len); // Copy remaining file block mem_off += file_len;
mem_len -= file_len; src += file_len;
(file_off, file_len) = dtc_next(file_dtc); // Get next file block if (file_off + file_len > end_off)
break; }
Legend
• CSB-dt: collective shared buffer with datatype iterators > 1 aggregator, 32MB buffer, 4 sub-buffers
• CSB-list: collective shared buffer with lists > 1 aggregator, 32MB buffer, 4 sub-buffers
• CSD: collective shared data (lists)
> All processes aggregators, 32MB buffer, no sub-buffers
• 2PC: two-phase collective (lists)
> All processes aggregators, 16MB buffer
• DS: data sieving > 8 MB buffer