COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

(1)

Edgar Gabriel

COSC 6374

Parallel Computation Parallel I/O (I) –

I/O basics

Edgar Gabriel Fall 2012

Concept of a clusters

Compute node

message passing network administrative network

Memory

Processor 1

Processor 2

Network card 1Network card 2 local disks

(2)

Parallel Computation Edgar Gabriel

I/O Problem (I)

• Every node has its own local disk

• Most applications require data and executable to be locally available

– e.g. an MPI application using multiple nodes requires

• executable to be available on all nodes

• in the same directory

• using the same name

• Multiple processes need to access the same file – potentially different portions

– efficiency

Basic characteristics of storage devices

• Capacity: amount of data a device can store

• Transfer rate or bandwidth: amount of data at which a device can read/write in a certain amount of time

• Access time or latency: delay before the first byte is moved

Prefix Abbreviation Base ten Base two

kilo, kibi K, Ki 10^3 2^10=1024

Mega, mebi M, Mi 10^6 2^20

Giga, gibi G, Gi 10^9 2^30

Tera, tebi T, Ti 10^12 2^40

Peta, pebi P, Pi 10^15 2^50

(3)

UNIX File Access Model

• A File is a sequence of bytes

• When a program opens a file, the file system establishes a file pointer. The file pointer is an integer indicating the position in the file, where the next byte will be written/read.

• Disk drives read and write data in fixed-sized units (disk sectors)

• File systems allocate space in blocks, which is a fixed number of contiguous disk sectors.

• In UNIX based file systems, the blocks that hold data are listed in an inode. An inode contains the information needed to find all the blocks that belong to a file.

• If a file is too large and an inode can not hold the whole list of blocks, intermediate nodes (indirect blocks) are introduced.

Write operations

• Write:

– the file systems copies bytes from the user buffer into system buffer.

– If buffer filled up, system sends data to disk

• System buffering

+ allows file systems to collect full blocks of data before sending to disk

+ File system can send several blocks at once to the disk (delayed write or write behind)

- Data not really saved in the case of a system crash

- For very large write operations, the additional copy from

user to system buffer could/should be avoided

(4)

Read operations

• Read:

– File system determines, which blocks contain requested data

– Read blocks from disk into system buffer

– Copy data from system buffer into user memory

• System buffering:

+ file system always reads a full block (file caching)

+ If application reads data sequentially, prefetching (read

ahead) can improve performance

- Prefetching harmful to the performance, if application has a random access pattern.

Dealing with disk latency:

Caching and buffering

• Avoids repeated access to the same block

• Allows a file system to smooth out I/O behavior

• Helps to hide the latency of the hard drives

• Lowers the performance of I/O operations for irregular access

• Non-blocking I/O gives users control over prefetching and delayed writing

– Initiate read/write operations as soon as possible – Wait for the finishing of the read/write operations just

when absolutely necessary.

(5)

Improving Disk Bandwidth:

disk striping

• Utilize multiple hard drives

• Split a file into constant chunks and distribute them across all disks

• Three relevant parameters:

– Stripe factor: number of disks – Stripe depth: size of each block

– Which disk contains the first block of the file

Disk 1 Disk 2 Disk 3 Disk 4

Block 1 Block 2 Block 3 … _{Block n}

…

Disk striping

• Ideal assumption

b(N, p) = p b(N/p, 1)*

with N: number of bytes to be written b: bandwidth

p: number of disks

• Realistically:

b(N,p) < p * b(N/p,1)

since

– N is often not large enough to fully utilize p hard drives

– networking overhead

(6)

Two levels of disk striping (I)

• Using a RAID controller – Hardware

– typically a ‘single box’

– number of disks: 3…n

Redundant arrays of independent disks (RAID)

• Goals: improve reliability and performance of an I/O system – improve performance of an I/O system

• Several RAID levels defined

• RAID 0: disk striping without redundant storage (“JBOD”= just a bunch of disks)

– No fault tolerance

– Good for high transfer rates

• i.e. read/write bandwidth of a single large file – Good for high request rates

• i.e. access time to many (small) files

• RAID 1: mirroring

– All data is replicated on two or more disks

– Does not improve write performance and just moderately the read performance

(7)

RAID level 2

• RAID 2: Hamming codes

– Each group of data bits has several check bits appended to it forming Hamming code words

– Each bit of a Hamming code word is stored on a separate disk – Very high additional costs: e.g. up to 50% additional capacity

required

• Hardly used today since parity based codes faster and easier

RAID level 3

• Parity based protection:

– Based on exclusive OR (XOR) – Reversible

– Example

01101010 (data byte 1) XOR 11001001 (data byte 2)

--- 10100011 (parity byte)

– Recovery

11001001 (data byte 2)

XOR 10100011 (parity byte)

--- 01101010 (recovered data byte 1)

(8)

RAID level 3 (cont.)

• Data divided evenly into N subblocks (N = number of disks, typically 4 or 5)

• Computing parity bytes generates an additional subblock

• Subblocks written in parallel on N+1 disks

• For best performance data should be of size (N * sector size)

• Problems with RAID level 3:

– All disks are always participating in every operation =>

contention for applications with high access rates

– If data size is less than N*sector size, system has to read old subblocks to calculate the parity bytes

• RAID level 3 good for high transfer rates

RAID level 4

• Parity bytes for N disks calculated and stored

• Parity bytes are stored on a separate disk

• Files are not necessarily distributed over N disks

• For read operations:

– Determine disks for the requested blocks – Read data from these disks

• For write operations

– Retrieve the old data from the sector being overwritten – Retrieve parity block from the parity disk

– Extract old data from the parity block using XOR operations – Add the new data to the parity block using XOR

– Store new data – Store new parity block

• Bottleneck: parity disk is involved in every operation

(9)

RAID level 5

• Same as RAID 4, but parity blocks are distributed on different disks

Block 1 Block 2 Block 3 Block 4 P(1,2,3,4) Block 5 Block 6 Block 7 P(5,6,7,8) Block 8

RAID level 6

• Tolerates the loss of more than one disk

• Collection of several techniques

• E.g. P+Q parity: store parity bytes using two different algorithms and store the two parity blocks on different disks

• E.g. Two dimensional parity

Parity disks

(10)

RAID level 10

• Is RAID level 1 + RAID level 0 RAID 1 mirroring

RAID 0 striping

• Also available: RAID 53 (RAID 0 + RAID 3)

Comparing RAID levels

RAID level

Protection Space usage Good at.. Poor at..

0 None N Performance Data protect.

1 Mirroring 2N Data protect. Space effic.

2 Hamming codes ~1.5N Transfer rate Request rate

3 Parity N+1 Transfer rate Request rate

4 Parity N+1 Read req. rate Write perf.

5 Parity N+1 Request rate Transfer rate

6 P+Q or 2-D (N+2) or

(MN+M+N)

Data protect. Write perf.

10 Mirroring 2N Performance Space effic.

53 parity N+striping

factor

Performance Space effic.

(11)

Two levels of disk striping (II)

• Using a parallel file system

– exposes the individual units capable of handling data

• often called storage servers, I/O nodes, etc.

– each storage server might use multiple hard drives underneath the hood to increase its read/write bandwidth

– Metadata server which keeps track of which parts of a file are on which storage server

– Single disk failure less of a problem, if each server uses underneath the hood a RAID 5 storage system

Compute nodes Meta-data server

storage server 0

storage server 1

storage server 2

storage server 3

Parallel File Systems: Conceptual overview

(12)

File access on a parallel file system

Compute node Metadata server

Application calls write() OS requests list of relevant I/O nodes for

this write operation

MD server sends storage IDs, offsets

etc.

OS sends data to storage servers

Disk striping

• Requirements to improve performance of I/O operations using disk striping:

– Multiple physical disks

– Have to balance network bandwidth and I/O bandwidth

• Problem of simple disk striping:

– for a fixed file size, the number of disks which can be used in parallel is limited

• Prominent parallel file systems – PVFS2

– Lustre – GPFS

– NFS v4.2 (new standard currently being ratified)

(13)

Distributed vs. Parallel File Systems

• Distributed File Systems

– Offer access to a collection of files on remote machines – Typically client-server based approach

– Transparent for the user

• NFS – The Network File System

– Protocol for a remote file service – Stateless server (v3)

– Communication based on RPC (Remote Procedure Call) – NFS provides session semantics – changes to an open file are

initially only visible to the process that modified the file – File locking not part of NFS protocol (v3) but often available

through a separate protocol/daemon

– Client caching not part of the NFS protocol (v3) – implementation dependent behavior

Network File System (NFS)

Compute node =

NFS client NFS server

Application calls write() OS forwards data to NSF

server

NFS daemon receives data

NFS daemon calls write()

(14)

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

COSC 6374

Parallel Computation Parallel I/O (I) –

I/O basics

Edgar Gabriel Fall 2012

Concept of a clusters

I/O Problem (I)

• Every node has its own local disk

• Most applications require data and executable to be locally available

– e.g. an MPI application using multiple nodes requires

• executable to be available on all nodes

• in the same directory

• using the same name

• Multiple processes need to access the same file – potentially different portions

– efficiency

Basic characteristics of storage devices

UNIX File Access Model

Write operations

• Write:

– the file systems copies bytes from the user buffer into system buffer.

– If buffer filled up, system sends data to disk

• System buffering

+ allows file systems to collect full blocks of data before sending to disk

+ File system can send several blocks at once to the disk (delayed write or write behind)

- Data not really saved in the case of a system crash

- For very large write operations, the additional copy from

user to system buffer could/should be avoided

Read operations

• Read:

– File system determines, which blocks contain requested data

– Read blocks from disk into system buffer

– Copy data from system buffer into user memory

• System buffering:

+ file system always reads a full block (file caching)

+ If application reads data sequentially, prefetching (read

- Prefetching harmful to the performance, if application has a random access pattern.

Dealing with disk latency:

Caching and buffering

• Avoids repeated access to the same block

• Allows a file system to smooth out I/O behavior

• Helps to hide the latency of the hard drives

• Lowers the performance of I/O operations for irregular access

• Non-blocking I/O gives users control over prefetching and delayed writing

– Initiate read/write operations as soon as possible – Wait for the finishing of the read/write operations just

when absolutely necessary.

Improving Disk Bandwidth:

disk striping

• Utilize multiple hard drives

• Split a file into constant chunks and distribute them across all disks

• Three relevant parameters:

– Stripe factor: number of disks – Stripe depth: size of each block

– Which disk contains the first block of the file

…

Disk striping

• Ideal assumption

b(N, p) = p * b(N/p, 1)

with N: number of bytes to be written b: bandwidth

p: number of disks

• Realistically:

since

– N is often not large enough to fully utilize p hard drives

– networking overhead

Two levels of disk striping (I)

• Using a RAID controller – Hardware

– typically a ‘single box’

– number of disks: 3…n

Redundant arrays of independent disks (RAID)

RAID level 2

RAID level 3

RAID level 3 (cont.)

RAID level 4

RAID level 5

• Same as RAID 4, but parity blocks are distributed on different disks

RAID level 6

RAID level 10

Comparing RAID levels

Two levels of disk striping (II)

• Using a parallel file system

– exposes the individual units capable of handling data

• often called storage servers, I/O nodes, etc.

b(N, p) = p b(N/p, 1)*