EXPLOITING SHARED MEMORY TO IMPROVE PARALLEL I/O PERFORMANCE

(1)

TO IMPROVE

PARALLEL I/O PERFORMANCE

Andrew B. Hastings

Sun Microsystems, Inc.

Alok Choudhary

Northwestern University September 19, 2006

(2)

Outline

• Motivation

• Previous work

• New shared memory solutions

• Performance evaluation

(3)

Why Shared Memory?

• Because it's there!

> For Phase II of DARPA's High Productivity Computer

Systems program, Sun proposed a petascale shared memory system

• Opportunity to improve performance without altering

applications

> Shared memory typically has lower latency and lower

overhead (especially for small payloads) than messages

> _{Change just the library to use shared memory}

• Interesting research area

(4)

A Common Parallel I/O Problem

• Application accesses may be noncontiguous in

memory and in the file

> If not optimized, can result in tens of thousands of small

Posix I/O operations

• For MPI-IO, two MPI derived datatypes specify the

file and memory access patterns Process 1 (P1) memory

file

Process 2 (P2) memory

(5)

Previous Solutions – 1

• Data sieving I/O

Each process locks and reads a contiguous block, fills in the altered data, then writes back and unlocks

memory data file ? ? ? ? buffer P1 P2 1 2 3 4 5 6 4 I/O requests

(6)

Previous Solutions – 2

• List I/O

Each process creates a list of memory regions and a list of file regions; calls a new filesystem interface

• Datatype I/O

Each process creates a small data structure describing repeating regions in memory and in file; calls a new

filesystem interface

memory data

file

(memory list, file list) (memory list, file list)

2 I/O requests

(7)

Previous Solutions – 3

• Two-phase collective

Each process sends round of data to each aggregator. Aggregator(s) receive and merge into buffer, make large write call(s) to filesystem; repeat until done.

memory data buffer P1 P2 (aggregator) write send send receive _receive merge

(8)

Using Shared Memory: mmap

• Each process maps file into its address space,

copies data to appropriate location in mapped file

> Similar to List I/O but mostly implemented in

library memory data mapped file P1 P2 loads/stores

(9)

Using Shared Memory: Collectives

• Collective Shared Data: Each aggregator copies

data between its working buffer and shared application memory

• Collective Shared Buffer: Each process copies data

between its application memory and aggregator(s)'s shared working buffer(s)

memory data buffer P1 P2 (aggregator) loads/stores write

(10)

Datatype Iterators

• Problem: copy driven by (offset, length) list:

> Huge list thrashes processor cache > List generation expensive; delays I/O

• Solution: datatype iterator tracks position in MPI

datatype, returns next (offset, length) on demand

> State fits in handful of cache lines

> _{Tiny startup cost; higher traversal cost can overlap I/O}

datatype iterator MPI datatype datatype stack ... (offset, length) 0 983,039

(11)

Overlapping I/O

• Strategy: Split working buffer into sub-buffers

> After sub-buffer is filled, initiate asynchronous I/O

> Before filling next sub-buffer, wait for previous

asynchronous I/O on it to complete

> Overlaps I/O and data rearrangement!

• Performance gain for collective shared buffer on

FLASH I/O benchmark:

> 60% with lists

(12)

Performance Evaluation

• Hardware: Sun Fire™ 6800, 24 × 1200 MHz

processors,150 MHz system bus, 96 GB memory, 4 × 1Gb FC channels, 4 Sun StorEdge™ T3 disk

arrays (T3 cache disabled)

• Software: LAM 7.1.1, ROMIO 1.2.4, Solaris™ 9,

Sun StorageTek™ QFS 4.5 (3 data+1 metadata), 64-bit execution model

• Bandwidth to data arrays: < 300 MB/s

• Caveat: Buffered reads benefit from warm buffer

cache!

(13)

Tile Reader Benchmark

• Tiled display simulation

> File size 7–37 MB

> From Parallel I/O

Benchmarking

Consortium, Argonne Data distribution for 2×2 tile array:

Data read by one process

270 128 768 1024 2×2 2×4 3×4 4×4 5×4 6×4 0 50 100 150 200 250 300 350 400 450 500 550 600

Aggregate Read Bandwidth (MB/s)

CSB-dt (dir) CSB-list (dir) CSD (dir) mmap (buf) *List I/O (buf) 2PC (dir) 2PC (buf) *DS (dir) *DS (buf)

Tile array dimensions

(14)

ROMIO 3D Block Test

• 600×600×600 array of ints block-distributed to

processes

> Uneven data distribution for some process counts

> Fixed file size: 824 MB

Data accessed by one process Data distribution for 8 processes:

(15)

ROMIO 3D Block Test Results

4 8 12 16 20 24 0 100 200 300 400 500 600 700 800 900 1000

Aggregate Read Bandwidth (MB/s)

CSB-dt (dir) CSB-list (dir) CSD (dir) mmap (buf) *List I/O (buf) 2PC (dir) 2PC (buf) *DS (dir) *DS (buf) Number of processes 4 8 12 16 20 24 0 25 50 75 100 125 150 175 200 225 250 275 300

Aggregate Write Bandwidth (MB/s)

(16)

FLASH I/O Benchmark

• From Argonne/Northwestern

• Checkpoint reorganizes to group

values by variable

• 80 blocks per process

has 24 variables X-Axis

Y-Axis Z-Axis

FLASH block structure

Variable 0 Variable 1 Variable 2 Variable 23 Blocks to access in X-axis Blocks to access in Y-axis Guard Cells Cut a slice

of the block _{Each element}

Memory Organization File Organization 0 1 2 3 4 5 6 7 Y 0 1 2 3 4 5 6 7 Z 0 1 2 3 4 5 6 7 X

Var 0 Var 1 Var 2 Var 23

Block 0 Block 1 Block 2 Block 79

(17)

FLASH I/O Benchmark Results

4 8 12 16 20 24 0 25 50 75 100 125 150 175 200 225 250 275 300

Aggregate Write Bandwidth (MB/s) Block size: 20×20×20 cells

CSB-dt (dir) CSB-list (dir) CSD (dir) mmap (buf) List I/O (buf) 2PC (dir) 2PC (buf) *DS (dir) *DS (buf) Number of processes 8 12 16 20 24 28 32 36 0 25 50 75 100 125 150 175 200 225 250 275 300

Aggregate Write Bandwidth (MB/s) Number of processes: 22

Number of cells along block edge

(18)

Conclusion

• Combination of collective shared buffer, datatype

iterators, and sub-buffering offered best aggregate performance for several application I/O patterns

> Achieved 90% of available disk bandwidth

> 5× improvement over two-phase collective

• Rediscovered streaming I/O principles:

1. Reduce startup overhead (datatype iterators)

2. Overlap I/O and computation when possible

(19)

Future Work

• Apply datatype iterators to MPI messages

> Direct sender-to-receiver copy if shared memory

• Apply datatype iterators to data sieving and

two-phase collective in ROMIO (currently list-based)

> Could benefit traditional clusters

• Possible standardization of datatype iterators

> Required for use of datatype iterators in ROMIO if

ROMIO is to remain portable across MPI implementations

(20)

Acknowledgements

• Harriet Coverston and Anton Rang of Sun

Microsystems also contributed to this work.

• This material is based on work supported by the US

Defense Advanced Research Projects Agency under Contract No. NBCH3039002.

(21)

Andrew B. Hastings

[email protected]

Alok Choudhary

[email protected]

TO IMPROVE

(22)

Datatype Iterators – Interface

• Interfaces:

> dtc_next: advance cursor to next contiguous

block, return (offset, length)

> dtc_size_seek/dtc_extent_seek: position cursor

to size or extent within datatype

> dtc_size_tell/dtc_extent_tell: return size or extent

within datatype corresponding to cursor position

• Simplifies implementation:

> Collective shared buffer 62% fewer code lines

(23)

Datatype Iterators – Example

• Copy (non-)contiguous application data directly to

(non-)contiguous shared working buffer:

while (file_off + file_len <= end_off) { // Entire file block still // fits in current chunk while (file_len >= mem_len) { // Mem block fits in file block src = app_buf + mem_off;

memcpy(dest, src, mem_len); // Copy remaining mem block file_off += mem_len;

file_len -= mem_len; dest += mem_len;

(mem_off, mem_len) = dtc_next(mem_dtc); // Get next mem block }

while (mem_len >= file_len) { // File block fits in mem block dest = temp_buf + file_off - start_off;

memcpy(dest, src, file_len); // Copy remaining file block mem_off += file_len;

mem_len -= file_len; src += file_len;

(file_off, file_len) = dtc_next(file_dtc); // Get next file block if (file_off + file_len > end_off)

break; }

(24)

Legend

• CSB-dt: collective shared buffer with datatype iterators > _{1 aggregator, 32MB buffer, 4 sub-buffers}

• CSB-list: collective shared buffer with lists > 1 aggregator, 32MB buffer, 4 sub-buffers

• CSD: collective shared data (lists)

> _{All processes aggregators, 32MB buffer, no sub-buffers}

• 2PC: two-phase collective (lists)

> All processes aggregators, 16MB buffer

• DS: data sieving > _{8 MB buffer}