Edgar Gabriel
COSC 6374
Parallel Computation Parallel I/O (I) –
I/O basics
Edgar Gabriel Spring 2008
Concept of a clusters
Compute node
message passing network administrative network
Memory
Processor 1
Processor 2
Network card 1Network card 2 local disks
COSC 6374 – Parallel Computation Edgar Gabriel
I/O Problem (I)
• Every node has its own local disk – no “globally visible” file-system
• Most applications require data and executable to be locally available
– e.g. an MPI application using 100 nodes requires the executable to be available on all nodes in the same directory using the same name
I/O problem (II)
• Current processor performance: e.g. Pentium 4 – 3 GHz ~ 6GFLOPS
– Memory Bandwidth: 133 MHz * 4 * 64Bit ~ 4.26 GB/s
• Current network performance:
– Gigabit Ethernet: latency ~ 40 µs, bandwidth=125MB/s – InfiniBand 4x: latency ~ 5 µs, bandwidth =1GB/s
• Disc performance:
– Latency: 7-12 ms
– Bandwidth: ~20MB/sec – 60 MB/sec
COSC 6374 – Parallel Computation Edgar Gabriel
Basic characteristics of storage devices
• Capacity: amount of data a device can store
• Transfer rate or bandwidth: amount of data at which a device can read/write in a certain amount of time
• Access time or latency: delay before the first byte is moved
Prefix Abbreviation Base ten Base two
kilo, kibi K, Ki 10^3 2^10=1024
Mega, mebi M, Mi 10^6 2^20
Giga, gibi G, Gi 10^9 2^30
Tera, tebi T, Ti 10^12 2^40
Peta, pebi P, Pi 10^15 2^50
UNIX File Access Model (I)
• A File is a sequence of bytes
• When a program opens a file, the file system establishes a file pointer. The file pointer is an integer indicating the position in the file, where the next byte will be
written/read.
• Multiple processes can open a file concurrently. Each process will have its own file pointer.
• No conflicts occur, when multiple processes read the same file.
• If several processes write at the same location, most UNIX file systems guarantee sequential consistency. (The data from one of the processes will be available in the file, but not a mixture of several processes).
COSC 6374 – Parallel Computation Edgar Gabriel
UNIX File Access Model (II)
• Disk drives read and write data in fixed-sized units (disk sectors)
• File systems allocate space in blocks, which is a fixed number of contiguous disk sectors.
• In UNIX based file systems, the blocks that hold data are listed in an inode. An inode contains the
information needed to find all the blocks that belong to a file.
• If a file is too large and an inode can not hold the whole list of blocks, intermediate nodes (indirect blocks) are introduced.
Write operations
• Write:
– the file systems copies bytes from the user buffer into system buffer.
– If buffer filled up, system sends data to disk
• System buffering
+ allows file systems to collect full blocks of data before sending to disk
+ File system can send several blocks at once to the disk (delayed write or write behind)
- Data not really saved in the case of a system crash
- For very large write operations, the additional copy from user to system buffer could/should be avoided
COSC 6374 – Parallel Computation Edgar Gabriel
Read operations
• Read:
– File system determines, which blocks contain requested data
– Read blocks from disk into system buffer
– Copy data from system buffer into user memory
• System buffering:
+ file system always reads a full block (file caching)
+ If application reads data sequentially, prefetching (read ahead) can improve performance
- Prefetching harmful to the performance, if application has a random access pattern.
File system operations
• Caching and buffering improve performance – Avoiding repeated access to the same block – Allowing a file system to smooth out I/O behavior
• Non-blocking I/O gives users control over prefetching and delayed writing
– Initiate read/write operations as soon as possible – Wait for the finishing of the read/write operations just
when absolutely necessary.
COSC 6374 – Parallel Computation Edgar Gabriel
Distributed File Systems vs. Parallel File Systems
• Offer access to a collection of files on remote machines
• Typically client-server based approach
• Transparent for the user
• Concurrent access to the same file from several processes is considered to be an unlikely event
– in contrary to parallel file systems, where it is considered to be a standard operation
• Distributed file systems assume different numbers of processors than parallel file systems
• Distributed file systems have different security requirements than parallel file systems
NFS – Network File System
• Protocol for a remote file service
• Client – server based approach
• Stateless server (v3)
• Communication based on RPC (Remote Procedure Call)
• NFS provides session semantics – changes to an open file are initially only visible to the process that modified the file
• File locking not part of NFS protocol (v3)
• File locking handled by a separate protocol/daemon – Locking of blocks often supported
• Client caching not part of the NFS protocol (v3) – depending on implementation
– E.g. allowing cached data to be stale for 30 seconds
COSC 6374 – Parallel Computation Edgar Gabriel
NFS in a cluster
Front-end node hosts the file server
NFS in a cluster (II)
• All file operations are remote operations – file server (= NFS server) = bottleneck
• Extensive usage of file locking required to implement sequential consistency of UNIX I/O
• Communication between client and server typically uses the “slow” communication channel on a cluster
• Do we use several disks at all ?
• Some inefficiencies in the specification, e.g. a read operation involves two RPC operations
– Lookup file-handle – Read request
COSC 6374 – Parallel Computation Edgar Gabriel
Parallel I/O
•Basic idea: disk striping
• Stripe factor: number of disks
• Stripe depth: size of each block
Disk striping
• Requirements for improving disk performance:
– Multiple physical disks
– Separate I/O channels to each disk – Data transfer to all disks simultaneously
• Problem of simple disk striping:
– Minimum stripe depth (sector size) required for optimal disk performance
• since file size is limited, the number of disks which can be used in parallel is limited as well
– Loss of a single disk makes entire file useless
• Risk to loose a disk is proportional to the number of disks used
• RAID (Redundant Arrays of Independent Disks – see lecture 2)
COSC 6374 – Parallel Computation Edgar Gabriel
Parallel File Systems
• Goals
– Several process should be able to access the same file concurrently
– Several process should be able to access the same file efficiently
• Problems
– Unix sequential consistency semantics – Handling of file-pointers
– Caching and buffering
Concurrent file access – logical view
• Number of compute and I/O nodes need not match 1 1 2 2 3 3 4 4 Blocks from compute nodes
1 2 3 4 1 2 3 4 Logical view (“shared file”)
1 3 1 3 2 4 2 4 I/O nodes
Disks
COSC 6374 – Parallel Computation Edgar Gabriel
Concurrent file access – opening a file
• Each I/O node has a subset of the blocks
• File system needs to look up where the file resides – Each I/O node maintains its own directory information or – Centralized name service
• File system needs to look up striping factor (often fixed)
• Creating a new file
– file systems has to choose different I/O nodes for holding the first block to avoid contention
Concurrent write operations
• How to ensure sequential consistency ? – File locking
• Prevents parallelism even if processes write to different locations in the same file (false sharing) – Better: locking of individual blocks
• Parallel file systems often offer two consistency models – Sequential consistency
– A relaxed consistency model
• application is responsible for preventing overlapping write-operations
COSC 6374 – Parallel Computation Edgar Gabriel
File pointers
• In UNIX: every process has a separate file pointer (individual file pointers)
• Shared file pointers often useful (e.g. reading the next piece of work, writing a parallel log-file)
– On distributed memory machines: slow, since somebody has to coordinate the file pointer
– Can be fast on shared memory machines – General problems:
• file pointer atomicity
• Non blocking I/O
• Explicit file offset operations: each process tells the file system where to read/write in the file
– no update to file pointers!
Buffering and caching
• Client buffering: buffering at compute nodes
– Consistency problems (e.g. one node writes, another tries to read the same data)
• Server buffering: buffering at I/O nodes
– Prevents concatenating several small requests to a single large one => produces lots of traffic
COSC 6374 – Parallel Computation Edgar Gabriel
Example for a parallel file system: xFS
• Anderson et all., 1995
• Storage server: storing parts of a file
• Metadata manager: keeps track of data blocks
• Client: processes user requests
Client Manager
Manager Storage
server
Storage server
Storage server
xFS continued
• Communication based on active messages – Uses fast networking infrastructure
• Log-based file system
– Modifications to a file are written to a log-file and collectively written to disk
– To find a data block, a separate table (imap) holds inode references to the position in the log-file
– Log-file is distributed among several processes using RAID techniques
• Storage servers are organized in stripe groups
– I.e. not all storage servers are participating in all operations
– A globally replicated table stores which server is belonging to which stripe group
• Each file has a manager associated to it
• Manager map: identifies manager for a specific file
COSC 6374 – Parallel Computation Edgar Gabriel
xFS continued again
• Starting point: file, data
• Directory returns file id
• Manager map returns metadata manager
• Metadata manager returns exact location of inode in the log: stripe group id, segment id and offset in segment
• Client computes on which server the block really is
Directory
Manager map
fid File, data
imap
Stripe group map
Client caching in xFS
• xFS maintains a local block cache – Based on block caching
– Request of write permission transfers the ownership – Manager keeps track where a file block is cached
• Collaborative caching
– Manager transfer most recent version of a data block directly from one cache into another cache
COSC 6374 – Parallel Computation Edgar Gabriel
xFS versus NFS
Issue NFS v3 xFS
Design goals Access transparency Server-less system
Access model Remote Log-based
Communication RPC Active msgs.
Client process Thin Fat
Server groups No Yes
Name space Per client Global
Sharing semantics Session UNIX
Caching unit Implementation dep. Block Fault tolerance Reliable
communication
Striping
Summary
• Parallel I/O is a means to decrease the file I/O access times on parallel applications
• Performance relevant factors – Stripe factor
– Stripe depth
– Buffering and caching – Non-blocking I/O
• Parallel file systems offer support for concurrent access of processes to the same file
– Individual file pointers – Shared file pointers – Explicit offset
• Distributed file systems are a poor replacement for parallel file systems