• No results found

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

N/A
N/A
Protected

Academic year: 2021

Share "COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters"

Copied!
14
0
0

Loading.... (view fulltext now)

Full text

(1)

Edgar Gabriel

COSC 6374

Parallel Computation Parallel I/O (I) –

I/O basics

Edgar Gabriel Fall 2012

Concept of a clusters

Compute node

message passing network administrative network

Memory

Processor 1

Processor 2

Network card 1Network card 2 local disks

(2)

Parallel Computation Edgar Gabriel

I/O Problem (I)

• Every node has its own local disk

• Most applications require data and executable to be locally available

– e.g. an MPI application using multiple nodes requires

• executable to be available on all nodes

• in the same directory

• using the same name

• Multiple processes need to access the same file – potentially different portions

– efficiency

Basic characteristics of storage devices

• Capacity: amount of data a device can store

• Transfer rate or bandwidth: amount of data at which a device can read/write in a certain amount of time

• Access time or latency: delay before the first byte is moved

Prefix Abbreviation Base ten Base two

kilo, kibi K, Ki 10^3 2^10=1024

Mega, mebi M, Mi 10^6 2^20

Giga, gibi G, Gi 10^9 2^30

Tera, tebi T, Ti 10^12 2^40

Peta, pebi P, Pi 10^15 2^50

(3)

Parallel Computation Edgar Gabriel

UNIX File Access Model

• A File is a sequence of bytes

• When a program opens a file, the file system establishes a file pointer. The file pointer is an integer indicating the position in the file, where the next byte will be written/read.

• Disk drives read and write data in fixed-sized units (disk sectors)

• File systems allocate space in blocks, which is a fixed number of contiguous disk sectors.

• In UNIX based file systems, the blocks that hold data are listed in an inode. An inode contains the information needed to find all the blocks that belong to a file.

• If a file is too large and an inode can not hold the whole list of blocks, intermediate nodes (indirect blocks) are introduced.

Write operations

• Write:

– the file systems copies bytes from the user buffer into system buffer.

– If buffer filled up, system sends data to disk

• System buffering

+ allows file systems to collect full blocks of data before sending to disk

+ File system can send several blocks at once to the disk (delayed write or write behind)

- Data not really saved in the case of a system crash

- For very large write operations, the additional copy from

user to system buffer could/should be avoided

(4)

Parallel Computation Edgar Gabriel

Read operations

• Read:

– File system determines, which blocks contain requested data

– Read blocks from disk into system buffer

– Copy data from system buffer into user memory

• System buffering:

+ file system always reads a full block (file caching)

+ If application reads data sequentially, prefetching (read

ahead) can improve performance

- Prefetching harmful to the performance, if application has a random access pattern.

Dealing with disk latency:

Caching and buffering

• Avoids repeated access to the same block

• Allows a file system to smooth out I/O behavior

• Helps to hide the latency of the hard drives

• Lowers the performance of I/O operations for irregular access

• Non-blocking I/O gives users control over prefetching and delayed writing

– Initiate read/write operations as soon as possible – Wait for the finishing of the read/write operations just

when absolutely necessary.

(5)

Parallel Computation Edgar Gabriel

Improving Disk Bandwidth:

disk striping

• Utilize multiple hard drives

• Split a file into constant chunks and distribute them across all disks

• Three relevant parameters:

– Stripe factor: number of disks – Stripe depth: size of each block

– Which disk contains the first block of the file

Disk 1 Disk 2 Disk 3 Disk 4

Block 1 Block 2 Block 3 Block n

Disk striping

• Ideal assumption

b(N, p) = p * b(N/p, 1)

with N: number of bytes to be written b: bandwidth

p: number of disks

• Realistically:

b(N,p) < p * b(N/p,1)

since

– N is often not large enough to fully utilize p hard drives

– networking overhead

(6)

Parallel Computation Edgar Gabriel

Two levels of disk striping (I)

• Using a RAID controller – Hardware

– typically a ‘single box’

– number of disks: 3…n

Redundant arrays of independent disks (RAID)

• Goals: improve reliability and performance of an I/O system – improve performance of an I/O system

• Several RAID levels defined

• RAID 0: disk striping without redundant storage (“JBOD”= just a bunch of disks)

– No fault tolerance

– Good for high transfer rates

• i.e. read/write bandwidth of a single large file – Good for high request rates

• i.e. access time to many (small) files

• RAID 1: mirroring

– All data is replicated on two or more disks

– Does not improve write performance and just moderately the read performance

(7)

Parallel Computation Edgar Gabriel

RAID level 2

• RAID 2: Hamming codes

– Each group of data bits has several check bits appended to it forming Hamming code words

– Each bit of a Hamming code word is stored on a separate disk – Very high additional costs: e.g. up to 50% additional capacity

required

• Hardly used today since parity based codes faster and easier

RAID level 3

• Parity based protection:

– Based on exclusive OR (XOR) – Reversible

– Example

01101010 (data byte 1) XOR 11001001 (data byte 2)

--- 10100011 (parity byte)

– Recovery

11001001 (data byte 2)

XOR 10100011 (parity byte)

--- 01101010 (recovered data byte 1)

(8)

Parallel Computation Edgar Gabriel

RAID level 3 (cont.)

• Data divided evenly into N subblocks (N = number of disks, typically 4 or 5)

• Computing parity bytes generates an additional subblock

• Subblocks written in parallel on N+1 disks

• For best performance data should be of size (N * sector size)

• Problems with RAID level 3:

– All disks are always participating in every operation =>

contention for applications with high access rates

– If data size is less than N*sector size, system has to read old subblocks to calculate the parity bytes

• RAID level 3 good for high transfer rates

RAID level 4

• Parity bytes for N disks calculated and stored

• Parity bytes are stored on a separate disk

• Files are not necessarily distributed over N disks

• For read operations:

– Determine disks for the requested blocks – Read data from these disks

• For write operations

– Retrieve the old data from the sector being overwritten – Retrieve parity block from the parity disk

– Extract old data from the parity block using XOR operations – Add the new data to the parity block using XOR

– Store new data – Store new parity block

• Bottleneck: parity disk is involved in every operation

(9)

Parallel Computation Edgar Gabriel

RAID level 5

• Same as RAID 4, but parity blocks are distributed on different disks

Block 1 Block 2 Block 3 Block 4 P(1,2,3,4) Block 5 Block 6 Block 7 P(5,6,7,8) Block 8

RAID level 6

• Tolerates the loss of more than one disk

• Collection of several techniques

• E.g. P+Q parity: store parity bytes using two different algorithms and store the two parity blocks on different disks

• E.g. Two dimensional parity

Parity disks

(10)

Parallel Computation Edgar Gabriel

RAID level 10

• Is RAID level 1 + RAID level 0 RAID 1 mirroring

RAID 0 striping

• Also available: RAID 53 (RAID 0 + RAID 3)

Comparing RAID levels

RAID level

Protection Space usage Good at.. Poor at..

0 None N Performance Data protect.

1 Mirroring 2N Data protect. Space effic.

2 Hamming codes ~1.5N Transfer rate Request rate

3 Parity N+1 Transfer rate Request rate

4 Parity N+1 Read req. rate Write perf.

5 Parity N+1 Request rate Transfer rate

6 P+Q or 2-D (N+2) or

(MN+M+N)

Data protect. Write perf.

10 Mirroring 2N Performance Space effic.

53 parity N+striping

factor

Performance Space effic.

(11)

Parallel Computation Edgar Gabriel

Two levels of disk striping (II)

• Using a parallel file system

– exposes the individual units capable of handling data

• often called storage servers, I/O nodes, etc.

– each storage server might use multiple hard drives underneath the hood to increase its read/write bandwidth

– Metadata server which keeps track of which parts of a file are on which storage server

– Single disk failure less of a problem, if each server uses underneath the hood a RAID 5 storage system

Compute nodes Meta-data server

storage server 0

storage server 1

storage server 2

storage server 3

Parallel File Systems: Conceptual overview

(12)

Parallel Computation Edgar Gabriel

File access on a parallel file system

Compute node Metadata server

Application calls write() OS requests list of relevant I/O nodes for

this write operation

MD server sends storage IDs, offsets

etc.

OS sends data to storage servers

Disk striping

• Requirements to improve performance of I/O operations using disk striping:

– Multiple physical disks

– Have to balance network bandwidth and I/O bandwidth

• Problem of simple disk striping:

– for a fixed file size, the number of disks which can be used in parallel is limited

• Prominent parallel file systems – PVFS2

– Lustre – GPFS

– NFS v4.2 (new standard currently being ratified)

(13)

Parallel Computation Edgar Gabriel

Distributed vs. Parallel File Systems

• Distributed File Systems

– Offer access to a collection of files on remote machines – Typically client-server based approach

– Transparent for the user

• NFS – The Network File System

– Protocol for a remote file service – Stateless server (v3)

– Communication based on RPC (Remote Procedure Call) – NFS provides session semantics – changes to an open file are

initially only visible to the process that modified the file – File locking not part of NFS protocol (v3) but often available

through a separate protocol/daemon

– Client caching not part of the NFS protocol (v3) – implementation dependent behavior

Network File System (NFS)

Compute node =

NFS client NFS server

Application calls write() OS forwards data to NSF

server

NFS daemon receives data

NFS daemon calls write()

(14)

Parallel Computation Edgar Gabriel

Parallel vs. Distributed File Systems

• In distributed file systems:

– Concurrent access to the same file from several processes is considered to be an unlikely event

– Assume different (i.e. lower) numbers of processes accessing a file

– Different security requirements

References

Related documents

Forbes: One thing that a lot of engineering students don't realize is how engineering really is a team sport, that any project is made up of the work of -- in the case of Ford

Studies show that according to the understanding situation, knowledge used may be related to different kind of information: data flow relations, functional relations

Ward &amp; Daniel (2006) belyser at i forkant av dette gjennomføres det gjerne en separat evaluering av selve systemimplementeringen. I prosjektet hos Kunden AS ble

For the in vivo and in organello translation assays, cells must be plated one day before the labeling. The labeling of the cells and the preparation of the samples takes one day

There are currently three optional routes for suppliers and consumers interested in LNG exports of Canadian gas: British Columbia’s West Coast with 115.2 million tonnes per year

Hence, an appropriate control strategy that could improve the transient performance of robust yaw rate and sideslip tracking control should be designed for an active yaw control

What is the most likely explanation for the abnormal pulmonary function