Distributed RAID Architectures
Distributed RAID Architectures
for Cluster I/O Computing
for Cluster I/O Computing
Kai Hwang
Kai Hwang
Internet and Cluster Computing Lab.
University of Southern California
Presentation Outline :
n
Scalable Cluster I/O
n
The RAID-x Architecture
n
Cooperative disk drivers
n
Benchmark Experiments
n
Security and Fault Tolerance
K. Hwang , March 15,2001 in Beijing
K. Hwang , March 15,2001 in Beijing
3
Scalable clusters providing SSI
services are gradually replacing the
SMP, cc-NUMA, and MPP in Servers,
K. Hwang , March 15,2001 in Beijing
K. Hwang , March 15,2001 in Beijing
4
g
Size Scalability (physical & application)
g
Enhanced Availability (failure management)
g
Single System Image (Middleware,OS extensions)
g
Fast Communication (networks & protocols)
g
Load Balancing (CPU, Net, Memory, Disk)
g
Security and Encryption (clusters of clusters)
g
Distributed Environment (User friendly)
g
Manageability (Jobs and resources )
g
Programmability (simple API required)
g
Applicability (cluster- and grid-awareness)
K. Hwang , March 15,2001 in Beijing
K. Hwang , March 15,2001 in Beijing
5
g
Sixteen Pentinum PCs are housed in
two 9-ft computer racks.
g
All PCs run with the RedHat
Linux v. 6.0 (Kernel v. 2.2.5)
g
All nodes are connected by a 100
Mbps Fast Ethernet
g
The cluster is ported with DQS,
LSF, MPI, PVM, TreadMarks,
Elias, and NAS benchmarks, etc.
g
Scalable to a future system with
100’s of future PC nodes
inter-connected by Gigabit networks
The USC Trojans Cluster Project
Internet and Cluster Computing Lab. EEB Rm.104
K. Hwang , March 15,2001 in Beijing
K. Hwang , March 15,2001 in Beijing
6
Trojans Linux Cluster
with Middleware for Security
and Checkpoint Recovery
Pentium
PC
Pentium
PC
Pentium
PC
Gigabit Network Interconnect
Security and Checkpointing Middleware
Single-System Image and Availability Infrastructure
Programming Environments
(Java, EDI, HTML, XML)
Web Windows
User Interface
Other Subsystems
(Database, OLTP, etc.)
An I/O-centric cluster architecture
Entry
Partition
Fast Ethernet
Internet/Intranet
Client
Database
Partition
Service
Partition
Service Flow
Data Flow
K. Hwang , March 15,2001 in Beijing
K. Hwang , March 15,2001 in Beijing
8
Distributed RAID
Embedded in
Clusters or Storage-Area Networks:
g
I/O Bottleneck in Scalable Cluster Computing
n
The gap between CPU/Memory and disk-IO widens
as the
µµP doubles in speed every year
n
Cluster applications are often I/O-bound
g
Disks connected to hosts are often subject to failure by
hosts themselves. Distributed RAID has much higher
availability by fault isolation, rollback recovery, and
automatic file migration.
Distributed RAID with a single
I/O space embedded in a cluster
Workstations
or PCs
Cluster Network (SAN or LAN)
Research Projects on
Parallel and Distributed RAID
System Attributes USC Trojans RAID-x Princeton TickerTAIP Digital Petal Berkeley Tertiary Disk HP AutoRAID RAID Architecture environment Orthogonal striping and mirroring in a Linux cluster RAID-5 with multiple controllers Chained Declustering in Unix cluster
RAID-5 built with a PC cluster Hierarchical with RAID-1 and RAID-5 Enabling Mechanism for SIOS Cooperative device drivers in Linux kernel Single RAID server implementation Petal device drivers at user level xFS storage servers at file level Disk array within single controller Data Consistency Checking Locks at device driver level Sequencing of user requests Lamport’s Paxos algorithm Modified DASH protocol in the xFS file system Use mark to update the parity disk Reliability and Fault Tolerance Orthogonal striping and mirroring Parity checks in RAID-5 Chained Declustering
SCSI disks with parity in RAID-5
Mirroring and parity checks
Four RAID architectures using different
mirroring and parity checking schemes
B8
M6 M7 M8
B6 B7
Disk 1 Disk 2 Disk 3 Disk 4
B0 B4 B8 P4 B12 B16 B1 B5 P3 B9 B13 B17 B2 P2 B6 B10 B14 P6 P1 B3 B7 B11 P5 B15
Disk 1 Disk 2 Disk 3 Disk 4
B0 B2 B4 B6 B8 B10 B1 B3 B5 B7 B9 B11 M0 M2 M4 M6 M8 M10 M1 M3 M5 M7 M9 M11
Disk 0 Disk 1 Disk 2 Disk 3 B0 B4 B1 B5 B2 M3 M4 M5 B3 M9 M10 M11 B9 B10 B11 M0 M1 M2
Disk 0 Disk 1 Disk 2 Disk 3 B0 M3 B4 M7 7 B1 M0 B5 M4 B2 M1 B6 M5 B3 M2 B7 M6 B8 M11 B9 M8 B10 M9 B11 M10 Data blocks Mirrored blocks
(a) Striped mirroring in RAID-10
(b) Parity checking in RAID-5
(c) Orthogonal striping and mirroring
(OSM) in the RAID-x
(d) Skewed striping in a chained
Theoretical Peak Performance of
Four RAID Architectures
7
Performance
Indicators
RAID-10
RAID-5
Chained
Declustering
RAID-x
Read
n B
n B
n B
n B
Large Write
n B
(n-1) B
n B
n B
Max. I/O
Bandwidth
Small Write
n B
nB / 2
n B
n B
Large Read
mR / n
mR / n
mR / n
mR / n
Small Read
R
R
R
R
Large Write
2 mW / n
mW / (n-1)
2 mW / n
mW / n +
mW / n(n-1)
Parallel
Read or
Parallel
Write Time
Small Write
2W
R+W
2W
W
Max. Fault Coverage
n/2 disk
failures
Single disk
failure
n/2 disk
failures
Single disk
failure
Distributed RAID-x architecture
Cluster Network
P/M CDD P/M CDD P/M CDDNode 0
Node 1
Node 3
D0 P/M CDD
Node 2
D1 D2 D3 B0 B12 B24 M25 M26 M27 B1 B13 B25 M14 M15 M24 B2 B14 B26 M3 M12 M13 B3 B15 B27 M0 M1 M2 D4 D5 D6 D7 B4 B16 B28 M29 M30 M31 B5 B17 B29 M18 M19 M28 B6 B18 B30 M7 M16 M17 B7 B19 B31 M4 M5 M6 D8 D9 D10 D11 B8 B20 B32 M33 M34 M35 B9 B21 B33 M22 M23 M32 B10 B22 B34 M11 M20 M21 B11 B23 B35 M8 M9 M108
K. Hwang , March 15,2001 in Beijing
K. Hwang , March 15,2001 in Beijing
14
Single I/O space in a Distributed RAID
enabled by CDDs at Linux kernel level
Cluster node Cluster node Cluster node
Interconnection Network
(b) A global virtual disk with a
SIOS formed by cooperative disks
CDD CDD CDD Cluster node Cluster node Central NFS Server
Interconnection Network
(a) Separate disks driven by
independent disk drivers (IDDs)
IDD
K. Hwang , March 15,2001 in Beijing
K. Hwang , March 15,2001 in Beijing
15
(b) Using CDDs to achieve a SIOS in a serverless cluster.
User Level Kernel Level User Level Kernel Level
Client side Server side
CDD CDD
User
Application NFS Serveris bypassed
(a) Parallel disk I/O using the NFS in a server/client cluster.
User Level Kernel Level User Level Kernel Level
Client side Server side
NFS Client NFS Server User Application Traditional Device Driver
Remote disk access using central NFS server versus
using cooperative disk drivers in the RAID-x cluster
1 3 5 4 6 1 3 4 5 6 2 2
K. Hwang , March 15,2001 in Beijing
K. Hwang , March 15,2001 in Beijing
16
Architectural design of
Architectural design of
cooperative device drivers
cooperative device drivers
Node 2
Node 1
CDD
CDD
Virtual disks
Physical disks
Communications
through the network
(a) Device masquerading
Cooperative Disk Driver (CDD)
Storage
Manager
CDD Client
Module
Data Consistency Module
Communications through the network
K. Hwang , March 15,2001 in Beijing
K. Hwang , March 15,2001 in Beijing
17
Cluster node 1 Application
/sios
dir1 dir2 file1 file2 file3
Cluster node 2 Application
/sios
dir1 dir2
file1 file2 file3
Cluster node 3 Application
/sios
dir1 dir2 file1 file2 file3
Cluster node 4 Application
/sios
dir1 dir2 file1 file2 file3
CDD CDD CDD CDD
Cluster Network
Maintaining consistency of the global directory
Elapsed Time in Executing the Andrew
Benchmark on the Linux Cluster at USC
0 5 10 15 20 25 30 35 1 4 8 12 16 Number of Clients
Elapsed Time (sec)
Compile Read File Scan Dir Copy Files Make Dir 0 1 2 3 4 5 6 7 8 1 4 8 12 16 Number of Clients
Elapsed Time (sec)
Parallel Write Performance of four RAID
Architectures against Traffic Rate
13
Parallel writes (20MB per client)
0 2 4 6 8 10 12 14 16 18 1 4 8 12 16
Number of Clients
Aggregate Bandwidth (MB/s)
RAID-x Chained Declustering RAID-10 RAID-5 NFSParallel Write Performance of four
RAID architectures vs. Disk Array Size
13
Parallel write
0 2 4 6 8 10 12 14 16 18 2 4 8 12 16 Disk Numbers Aggregate Bandwidth (MB/s) RAID-x Chained Declustering RAID-10 RAID-5K. Hwang , March 15,2001 in Beijing
K. Hwang , March 15,2001 in Beijing
21
Achievable I/O Bandwidth and
Improvement Factor on Trojans Cluster
NFS
RAID-x
I/O
Operations
1 Client
16 Clients
Improve
1 Client
16 Clients
Improve
Large Read
2.58 MB/s
2.3 MB/s
0.89
2.59 MB/s
15.63 MB/s
6.03
Large Write
2.11 MB/s
2.77 MB/s
1.31
2.92 MB/s
15.29 MB/s
5.24
Small Write
2.47 MB/s
2.81 MB/s
1.34
2.35 MB/s
15.1 MB/s
6.43
Chained Declustering
RAID-10
Operations
1 Client
16 Clients
Improve
1 Client
16 Clients
Improve
Large Read
2.46 MB/s
15.8 MB/s
6.42
2.37 MB/s
10.76 MB/s
4.54
Large Write
2.62 MB/s
12.63 MB/s
4.82
2.31 MB/s
9.96 MB/s
4.31
Small Write
2.31 MB/s
12.54 MB/s
5.43
2.27 MB/s
9.98 MB/s
4.39
Effects of Stripe Unit Size on
I/O Bandwidth of RAID Architectures
Large write (320MB for 16 clients)
14
0 2 4 6 8 10 12 14 16 18 20 16 32 64 128
Stripe Unit Size (KB)
Aggregate Bandwidth (MB/s)
RAID-xChained DeclusteringRAID-10 RAID-5
Bonnie Benchmark Results on Trojans Cluster
Bonnie Benchmark Results on Trojans Cluster
File rewrite
0 0.5 1 1.5 2 2.5 3 3.5 2 4 8 12 16Number of Disks
Output Rate (MB/s)
RAID-x chained declustering RAID-10 RAID-5Securing Networks, Intranets,
Clusters, or Grid Resources
with intrusion control and automatic
recovery from malicious attacks
Highly secured Intranet with
intrusion detection and
response, automatic
recovery from
malicious attacks,
and fault-tolerance
with distributed
storage for
reliable I/O
Gateway
firewall
to screen
traffic flow
between
networks
Cluster
with no
security
protection
Increasing
Reliability
SMP cluster Intranet
Grid
Increasing scalability
No data
protection
Fault
K. Hwang , March 15,2001 in Beijing
K. Hwang , March 15,2001 in Beijing
25
Distributed Micro-Firewalls
Distributed Checkpointing on
Distributed Checkpointing on
The RAID
The RAID
-
-
x in Trojans Cluster
x in Trojans Cluster
Process 0 Time Process 2 Process 4 Process 1 Process 3 Process 5 Process 6 Process 7 Process 8 Process 9 Process10 Process11 Stripe0 Stripe1 Stripe2 C S C: Checkpointing overhead S: Synchronization overhead
18
K. Hwang , March 15,2001 in Beijing
K. Hwang , March 15,2001 in Beijing
27
Security Component Technologies
F
Firewalls
and
Cryptography
F
Cluster Middleware for Security
F
Anti-virus and Immune Systems
F
Intrusion Detection and Response
F
Distributed Software RAIDs
Distributed Intrusion Detection and Responses
Security Threats
Effectiveness in using Micro-Firewalls
Insider attacks
Protect hosts against attack from insiders
Denial-of-Service attacks
Protect against denial-of-service attacks
from any source
Trojan Program
Protect hosts from trapdoors by any source
IP Address
Spoofing
Can be reconfigured to prevent IP spoofing
at the client host level
Probes and
Scans
Use with IDS to block the probes and scans
close to their sources
Unauthorized
External access
Can prevent unauthorized access to the
external networks at the source
Attacks on
Intranet
Infra-structure
Resist both internal and external attacks and
provide fine-grained access control
K. Hwang , March 15,2001 in Beijing
K. Hwang , March 15,2001 in Beijing
29
Checkpointing overhead
on distributed RAIDs
0
0.5
1
1.5
2
2.5
3
3.5
1.21
2.21
3.21
4.21
5.21
6.21
7.21
8.11
checkpoint file size (MB)
checkpoint overhead (sec)
NFS
Vaidya
Striped
K. Hwang , March 15,2001 in Beijing
K. Hwang , March 15,2001 in Beijing
30
Advantages and Shortcomings
of Distributed Checkpointing
Checkpointing Scheme Advantages Shortcomings Suitable applications
Simultaneous writing to a central storage
(The NFS scheme)
Simple,
no inconsistent state
Has network and I/O contentions, NFS is single point of failure
Small size of checkpoint, small number of nodes, low I/O operation Staggered writing to a
central storage (Vaidya scheme)
Eliminate the network and I/O contention Network bandwidth is wasted, NFS is a single point of failure Small size of checkpointers, small number of nodes, low I/O operations Striped staggering
checkpointing on any distributed RAID (Our scheme)
Eliminate network and I/O contentions, low checkpoint overhead, fully utilize network bandwidth, tolerate multiple failures among stripe groups
Can not tolerate more node failures within each stripe group
Large size of checkpointers, large number of nodes, low communication, I/O intensive applications
K. Hwang , March 15,2001 in Beijing
K. Hwang , March 15,2001 in Beijing