Edgar Gabriel
COSC 6374
Parallel Computation Parallel I/O (I) –
I/O basics
Edgar Gabriel Fall 2012
Concept of a clusters
Compute node
message passing network administrative network
Memory
Processor 1
Processor 2
Network card 1Network card 2 local disks
Parallel Computation Edgar Gabriel
I/O Problem (I)
• Every node has its own local disk
• Most applications require data and executable to be locally available
– e.g. an MPI application using multiple nodes requires
• executable to be available on all nodes
• in the same directory
• using the same name
• Multiple processes need to access the same file – potentially different portions
– efficiency
Basic characteristics of storage devices
• Capacity: amount of data a device can store
• Transfer rate or bandwidth: amount of data at which a device can read/write in a certain amount of time
• Access time or latency: delay before the first byte is moved
Prefix Abbreviation Base ten Base two
kilo, kibi K, Ki 10^3 2^10=1024
Mega, mebi M, Mi 10^6 2^20
Giga, gibi G, Gi 10^9 2^30
Tera, tebi T, Ti 10^12 2^40
Peta, pebi P, Pi 10^15 2^50
Parallel Computation Edgar Gabriel
UNIX File Access Model
• A File is a sequence of bytes
• When a program opens a file, the file system establishes a file pointer. The file pointer is an integer indicating the position in the file, where the next byte will be written/read.
• Disk drives read and write data in fixed-sized units (disk sectors)
• File systems allocate space in blocks, which is a fixed number of contiguous disk sectors.
• In UNIX based file systems, the blocks that hold data are listed in an inode. An inode contains the information needed to find all the blocks that belong to a file.
• If a file is too large and an inode can not hold the whole list of blocks, intermediate nodes (indirect blocks) are introduced.
Write operations
• Write:
– the file systems copies bytes from the user buffer into system buffer.
– If buffer filled up, system sends data to disk
• System buffering
+ allows file systems to collect full blocks of data before sending to disk
+ File system can send several blocks at once to the disk (delayed write or write behind)
- Data not really saved in the case of a system crash
- For very large write operations, the additional copy from
user to system buffer could/should be avoided
Parallel Computation Edgar Gabriel
Read operations
• Read:
– File system determines, which blocks contain requested data
– Read blocks from disk into system buffer
– Copy data from system buffer into user memory
• System buffering:
+ file system always reads a full block (file caching)
+ If application reads data sequentially, prefetching (read
ahead) can improve performance- Prefetching harmful to the performance, if application has a random access pattern.
Dealing with disk latency:
Caching and buffering
• Avoids repeated access to the same block
• Allows a file system to smooth out I/O behavior
• Helps to hide the latency of the hard drives
• Lowers the performance of I/O operations for irregular access
• Non-blocking I/O gives users control over prefetching and delayed writing
– Initiate read/write operations as soon as possible – Wait for the finishing of the read/write operations just
when absolutely necessary.
Parallel Computation Edgar Gabriel
Improving Disk Bandwidth:
disk striping
• Utilize multiple hard drives
• Split a file into constant chunks and distribute them across all disks
• Three relevant parameters:
– Stripe factor: number of disks – Stripe depth: size of each block
– Which disk contains the first block of the file
Disk 1 Disk 2 Disk 3 Disk 4
Block 1 Block 2 Block 3 … Block n
…
Disk striping
• Ideal assumption
b(N, p) = p * b(N/p, 1)
with N: number of bytes to be written b: bandwidth
p: number of disks
• Realistically:
b(N,p) < p * b(N/p,1)
since
– N is often not large enough to fully utilize p hard drives
– networking overhead
Parallel Computation Edgar Gabriel
Two levels of disk striping (I)
• Using a RAID controller – Hardware
– typically a ‘single box’
– number of disks: 3…n
Redundant arrays of independent disks (RAID)
• Goals: improve reliability and performance of an I/O system – improve performance of an I/O system
• Several RAID levels defined
• RAID 0: disk striping without redundant storage (“JBOD”= just a bunch of disks)
– No fault tolerance
– Good for high transfer rates
• i.e. read/write bandwidth of a single large file – Good for high request rates
• i.e. access time to many (small) files
• RAID 1: mirroring
– All data is replicated on two or more disks
– Does not improve write performance and just moderately the read performance
Parallel Computation Edgar Gabriel
RAID level 2
• RAID 2: Hamming codes
– Each group of data bits has several check bits appended to it forming Hamming code words
– Each bit of a Hamming code word is stored on a separate disk – Very high additional costs: e.g. up to 50% additional capacity
required
• Hardly used today since parity based codes faster and easier
RAID level 3
• Parity based protection:
– Based on exclusive OR (XOR) – Reversible
– Example
01101010 (data byte 1) XOR 11001001 (data byte 2)
--- 10100011 (parity byte)
– Recovery
11001001 (data byte 2)
XOR 10100011 (parity byte)
--- 01101010 (recovered data byte 1)
Parallel Computation Edgar Gabriel
RAID level 3 (cont.)
• Data divided evenly into N subblocks (N = number of disks, typically 4 or 5)
• Computing parity bytes generates an additional subblock
• Subblocks written in parallel on N+1 disks
• For best performance data should be of size (N * sector size)
• Problems with RAID level 3:
– All disks are always participating in every operation =>
contention for applications with high access rates
– If data size is less than N*sector size, system has to read old subblocks to calculate the parity bytes
• RAID level 3 good for high transfer rates
RAID level 4
• Parity bytes for N disks calculated and stored
• Parity bytes are stored on a separate disk
• Files are not necessarily distributed over N disks
• For read operations:
– Determine disks for the requested blocks – Read data from these disks
• For write operations
– Retrieve the old data from the sector being overwritten – Retrieve parity block from the parity disk
– Extract old data from the parity block using XOR operations – Add the new data to the parity block using XOR
– Store new data – Store new parity block
• Bottleneck: parity disk is involved in every operation
Parallel Computation Edgar Gabriel
RAID level 5
• Same as RAID 4, but parity blocks are distributed on different disks
Block 1 Block 2 Block 3 Block 4 P(1,2,3,4) Block 5 Block 6 Block 7 P(5,6,7,8) Block 8
RAID level 6
• Tolerates the loss of more than one disk
• Collection of several techniques
• E.g. P+Q parity: store parity bytes using two different algorithms and store the two parity blocks on different disks
• E.g. Two dimensional parity
Parity disks
Parallel Computation Edgar Gabriel
RAID level 10
• Is RAID level 1 + RAID level 0 RAID 1 mirroring
RAID 0 striping
• Also available: RAID 53 (RAID 0 + RAID 3)
Comparing RAID levels
RAID level
Protection Space usage Good at.. Poor at..
0 None N Performance Data protect.
1 Mirroring 2N Data protect. Space effic.
2 Hamming codes ~1.5N Transfer rate Request rate
3 Parity N+1 Transfer rate Request rate
4 Parity N+1 Read req. rate Write perf.
5 Parity N+1 Request rate Transfer rate
6 P+Q or 2-D (N+2) or
(MN+M+N)
Data protect. Write perf.
10 Mirroring 2N Performance Space effic.
53 parity N+striping
factor
Performance Space effic.
Parallel Computation Edgar Gabriel
Two levels of disk striping (II)
• Using a parallel file system
– exposes the individual units capable of handling data
• often called storage servers, I/O nodes, etc.
– each storage server might use multiple hard drives underneath the hood to increase its read/write bandwidth
– Metadata server which keeps track of which parts of a file are on which storage server
– Single disk failure less of a problem, if each server uses underneath the hood a RAID 5 storage system
Compute nodes Meta-data server
storage server 0
storage server 1
storage server 2
storage server 3
Parallel File Systems: Conceptual overview
Parallel Computation Edgar Gabriel
File access on a parallel file system
Compute node Metadata server
Application calls write() OS requests list of relevant I/O nodes for
this write operation
MD server sends storage IDs, offsets
etc.
OS sends data to storage servers
Disk striping
• Requirements to improve performance of I/O operations using disk striping:
– Multiple physical disks
– Have to balance network bandwidth and I/O bandwidth
• Problem of simple disk striping:
– for a fixed file size, the number of disks which can be used in parallel is limited
• Prominent parallel file systems – PVFS2
– Lustre – GPFS
– NFS v4.2 (new standard currently being ratified)
Parallel Computation Edgar Gabriel
Distributed vs. Parallel File Systems
• Distributed File Systems
– Offer access to a collection of files on remote machines – Typically client-server based approach
– Transparent for the user
• NFS – The Network File System
– Protocol for a remote file service – Stateless server (v3)
– Communication based on RPC (Remote Procedure Call) – NFS provides session semantics – changes to an open file are
initially only visible to the process that modified the file – File locking not part of NFS protocol (v3) but often available
through a separate protocol/daemon
– Client caching not part of the NFS protocol (v3) – implementation dependent behavior
Network File System (NFS)
Compute node =
NFS client NFS server
Application calls write() OS forwards data to NSF
server
NFS daemon receives data
NFS daemon calls write()
Parallel Computation Edgar Gabriel