Christoph Lofi
Philipp Wille
Institut für Informationssysteme
Relational Database Systems 2
2. Physical Data Storage
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
2
Data
Storage Manager
Query Processor
Application
Interfaces
Indices
Statistics
DDL
Interpreter
Query
Evaluation
Engine
Applications
Programs
Object Code
Transaction Manager
Buffer
Manager
File Manager
Catalog/
Dictionary
Embedded
DML
Precompiler
DML
Compiler
DB Scheme
Application
Programs
Direct Query
Application
Programmers
DB
Administrators
2.1 Introduction
2.2 Hard Disks
2.3 RAIDs
2.4 SANs and NAS
2.5 Case Study
•
DBMS needs to retrieve, update and process
persistently stored
data
–
Storage consideration is an important factor in
planning a database system (physical layer)
–
Remember:
The data has to be securely stored, but
access to the data should be declarative!
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
4
2.1 Physical Storage Introduction
•
Data is stored on a
storage media
. Media highly
differ in terms of
–
Random Access
Speed
–
Random/Sequential Read/Write
speed
–
Capacity
–
Cost
per Capacity
•
Capacity:
Quantifies the amount of data
which can be stored
–
Base Units: 1 Bit, 1 Byte = 2
3
Bit = 8 Bit
–
Capacity units according to IEC, IEEE, NIST, etc:
•
Usually used for file sizes and primary storage (for higher
degree of confusion, sometimes used with SI abbreviations…)
•
1
KiB
= 1024
1
Byte; 1
MiB
= 1024
2
Byte ; 1
GiB
= 1024
3
Byte;
…
–
Capacity units according to SI:
•
Usually used for advertising secondary/tertiary storage
•
1
KB
= 1000
1
Byte
≈
0.976 KiB; 1
MB
= 1000
2
Byte
≈
0.954 MiB;
1
GB
= 1000
3
Byte
≈
0.931 GiB; …
–
Especially used by the networking community:
•
1
Kb
= 1000
1
Bit
=
0.125 KB
≈
0.122 KiB ;
1
Mb
= 1000
2
Bit
=
0.125 MB
≈
0.119 MiB
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
6
•
Random Access Time:
Average time to access
a random piece of data at a known media position
–
Usually measured in ms or ns
–
Within some media, access time can vary depending
on position (e.g. hard disks)
•
Transfer Rate:
Average amount consecutive of
data which can be transferred per time unit
–
Usually measured in KB/sec, MB/sec, GB/sec,…
–
Sometimes also in Kb/sec, Mb/sec, Gb/sec
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
8
•
Volatile:
Memory needs constant power to keep data
–
Dynamic:
Dynamic volatile memory needs to be “refreshed”
regularly to keep data
–
Static:
No refresh necessary
•
Access Modes
–
Random Access:
Any piece of data can be accessed in
approximately the same time
–
Sequential Access:
Data can only be accessed in sequential
order
•
Write Mode
–
Mutable
Storage: Can be read and written arbitrarily
–
Write Once Read Many (
WORM
)
•
Online
media
–
„always on“
–
Each single piece of data can be accessed fast
–
e.g. hard drives, main memory
•
Nearline
media
–
Compromise between online and offline
–
Offline media can automatically put “on line”
–
e.g. juke boxes, robot libraries
•
Offline
media (disconnected media)
–
Not under direct control of processing unit
–
Have to be connected manually
–
e.g. box of backup tapes in basement
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
10
2.1 Online, Nearline, Offline
•
Media characteristics result in a
storage hierarchy
•
DBMS optimize data distribution among the storage
levels
–
Primary Storage:
Fast, limited capacity, high price,
usually volatile electronic storage
•
Frequently used data / current work data
–
Secondary Storage:
Slower, large capacity, lower price
•
Main stored data
–
Tertiary Storage:
Even slower, huge capacity, even lower
price, usually offline
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
12
2.1 The Storage Hierarchy
Cos
t
Spe
ed
Optical Disks, Tape
Primary
Secondary
Tertiary
Cache, RAM
Flash, Magnetic Disks
~100 ns
~10 ms
Type
Media
Size
Random
Acc. Speed
Transfer
Speed
Characteristics
Price
Price/GB
Pri
L1-Processor Cache
32 KB
5 x 10
-10s
15.4 GB/sec
Vol, Stat,
RA,OL
Pri
DDR3-Ram
(Corsair Dominator
Platinum Series)
8 GB
2.6 x 10
-8s
12.3 GB/sec
Vol, Dyn, Ra,
OL
€ 160
€ 20
Sec
Harddrive SSD
(Samsung 840 PRO)
256 GB
4 x 10
-6s
513 MB/sec
Stat, RA, OL
€ 187
€ 0.73
Sec
Harddrive Magnetic
(Seagate ST2000DM001)
2000 GB
5.7 x 10
-4s
153 MB/sec
Stat, RA, OL
€ 100
€ 0.05
Ter
Blank recordable
DVD-R disk
4.7 GB
9.8 x 10
-2s
11 MB/sec
Stat, RA, OF,
WORM
€ 0.15
/Disk
€ 0.03
Ter
LTO-5 tape
(TDK - LTO Ultrium 5 Data Cartridge)
1500 GB
58 s
280 MB/sec
Stat, SA, OF
€ 15
/Tape
€ 0.01
2.1 Storage Media – Examples
Pri= Primary, Sec=Secondary, Ter=Tertiary
•
Hard drives are currently the standard for
large, cheap and persistent storage
–
Usually used as the
main storage
media
for most data in a DB
•
DBMS need to be
optimized
for
efficient disk storage and access
–
Data access needs to be as fast as possible
–
Often used data should be accessible with highest speed,
rarely needed data may take longer
–
Different data items needed for certain reoccurring tasks
should also be stored/accessed together
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
14
•
Directionally
magnetization of a ferromagnetic
material
•
Realized on hard disk platters
–
Base platter made of non-magnetic aluminum or glass substrate
–
Magnetic grains worked into base platter to form
magnetic regions
•
Each region represents 1 Bit
–
Read head
can detect
magnetization direction of
each region
–
Write head
may change
direction
•
G
iant
M
agneto
r
esistance Effect (
GMR
)
–
Discovered 1988 simultaneously by Peter Grünberg and Albert
Fert
•
Both honored with the 2007 Nobel Prize in Physics
–
Allows the construction of efficient
read heads:
•
The electric resistance of an alternating ferro- and non-magnetic
material giantally changes within changing directed magnetic fields
–
http://www.research.ibm.com/research/demos/gmr/cyberdemo1.htm
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
16
•
Perpendicular
Recording (used since 2005)
–
Longitudal Recording limited to ~200 Gb/inch
2
due to
superparamagnetic effect
•
Thermal energy may spontaneously change magnetic direction
–
Perpendicular recording allows for up to 1000 Gb/inch
2
–
Very simplified:
Align
magnetic field orthogonal
to surface instead of
parallel
•
Magnetic regions can be
smaller
•
Usage of magnetic
grains
instead of continuous
magnetic material
–
Between magnetic direction transitions,
Neel Spikes
are
formed
•
Areas of unsure magnetic direction
–
Neel Spikes are larger for continuous materials
–
Magnetic regions can be smaller as the transition width can be
reduced
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
18
•
A hard disk is made up of multiple
double-sided platters
–
Platter sides are called
surfaces
–
Platters are fixed on main
spindle
and rotate at equal and constant speed
(common: 5400 rpm / 7200 rpm)
–
Each surface has it’s own
read
and
write head
–
Heads are attached to
arms
•
Arms can position heads
along the surface
•
Heads cannot move
inde-pendently
–
Heads have no contact
to surface and hover on
top of an air bearing
•
Each surface is divided into circular
tracks
–
Some disks may use spirals
•
All tracks of all surfaces with the same diameter
are called
cylinder
–
Data within the same cylinder can be accessed very
efficiently
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
20
EN 13.2
•
Each track is subdivided into
sectors
of equal
capacity
a) Fixed angle sector subdivision
•
Same number of sectors per track, changing density,
constant speed
b) Fixed data density
•
Outer tracks have more sectors than inner tracks
•
Transfer speed higher on
outer tracks
•
Adjacent sectors can be
•
Hard drives are not completely reliable!
–
Drives do
fail
–
Means for physical failure recovery are necessary
•
Backups
•
Redundancy
•
Hard drives age and
wear
down.
Wear significantly increases by:
–
Contact
cycles (head parking)
–
Spindle start-stop
–
Power-on
hours
–
Operation outside ideal environment
•
Temperature too low/high
•
Unstable voltage
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
22
2.2 HD – Reliability
•
Reliability
measures are
statistical
values assuming
certain usage patterns
•
Desktop usage (all per year): 2,400 hours, 10,000 motor
start/stops, 25°C temperature
•
Server usage (all per year): 8,760 hours, 250 motor start/stops,
40°C temperature
–
Non-Recoverable read errors:
A sector on the surface
cannot be read anymore – the data is lost
•
Desktop disk: 1 per 10
14
read bits, Server: 1 per 10
15
read bits
•
Disk can detect this!
–
Maximum
contact cycles
: Maximum number of allowed
head contacts (parking)
–
Mean Time Between Failure (
MTBF
): Statistically
anticipated time for a large disk population failing to
50%
•
Drive manufactures usually use optimistic simulations to
guess the MTBF
•
Desktop: 0.7 million hours (80 years), Server: 1.2 million
hours (137 years) –
Manufacturers values
–
Annualized Failure Rate (
AFR
): Probability of a failure
per year based on MTBF
•
AFR = OperatingHoursPerYear / MTBF
hours
•
Desktop: 0.34%, Server: 0.73%
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
24
2.2 HD – Reliability
•
Failure rate during a hard disks lifespan is not constant
•
Can be better modeled by the “
bathtub curve
” having
3 components
–
Infant Mortality Rate
–
Wear Out Failures
–
Random Failures
•
Report by Google
–
100,000 consumer grade disks
(80-400GB, ATA Interface,
5,400-7,200 RPM)
•
Results (among others)
–
Drives fail often!
–
There is infant mortality
–
High usage increases infant
mortality, but not later failure
rates
–
Observed AFR is around
7%
and
MTBF
16.6 years
!
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für
Informationssysteme
26
Failure trends in a large disk drive population
E. Pinheiro, W.-D. Weber, L. A. Barroso 5th USENIX conference on File and Storage Technologies, 2007
2.2 Real World Failure Rates
•
Seagate ST32000641AS 2 TB (Desktop Harddrive, 2011)
–
Manufacturer’s specifications
2.2 HD – Example Specs
Specification
Value
Capacity
2 TB
Platters
4
Heads
8
Cylinders
16,383
Sectors per track
63
Bytes per sector
512
Spindle Speed
7200 RPM
MTBF
85 years
•
Assume a storage need of 100 TB. Only following
HDs are available
–
Capacity:
1 TB
capacity each
–
MTBF: 100,000 hours each (ca.
11 years
)
•
Consider using 100 of these disks independently
(w/o RAID).
–
Total Storage: 100,000 GB = 100 TB
–
MTBF: 1,000 hours (ca.
42 days
)
–
THIS IS BAD!
•
More sophisticated ways of using multiple disks are
needed
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
28
•
Alternative to hard-drives:
SSD
–
Use microchips which retain data in non-volatile
memory chips
and contain
no moving parts
•
Use the same interface as hard disk drives
–
Easy replacement in most applications possible
•
Key components
–
Memory
–
Controller
•
Flash-Memory
–
Most SSDs use NAND-based flash memory
–
Retains memory even without power
–
Slower than DRAM solutions
–
Single-level cell versus multi-level cell
–
Wears down!
•
DRAM
–
Use volatile Random Access Memory
–
Ultrafast data access (< 10 microseconds)
–
Sometimes use internal battery or external power device to
ensure data persistence
–
Only for Applications that require even faster access, but do
not need data persistence after power loss
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
30
2.2 Memory
•
The controller is an embedded processor
•
Incorporates the electronics that bridge the
NAND memory components to the host
computer
•
Some of its functions
–
error correction,
wear leveling
,
bad block
mapping
, read and write caching, encryption, garbage
collection
•
Advantages
–
Low access time and latency
–
No moving parts
shock resistant
•
MTBF about 2 million hours
–
Lighter and more energy-efficient than HDDs
•
Disadvantages
–
Divided into blocks/pages
•
If one byte changes the whole page has to be written
•
The old page will be marked as stale
•
Only whole blocks can be deleted
–
Limited ability of being rewritten (between 3,000 and 100,000
cycles per page)
•
Wear leveling algorithms assure that write operations are equally
distributed to the pages
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
32
2.2 SSD – Summary
•
The
disk controller
organizes low level
access to the disk
–
e.g. head positioning, error checking, signal
processing
–
Usually integrated into the disk
–
Provides
unified
and
abstracted
interface
to access the disks (e.g. LBA)
–
Connects disk to a peripheral bus (e.g. IDE,
SCSI, FiberChannel, SAS)
•
The
host bus adapter
(HBA) bridges
between the peripheral bus and systems
internal bus (like PCIe, PCI)
–
Internal Bus usually integrated into systems
main board
–
Often confused as being the disk controller
•
DAS
(Directly Attached Storage)
2.2 HD – Controller
Disk
Controller
Host Bus
Adapter
Peripheral
Bus
Internal Bus
Inner System / Mainboard
•
Sectors can be logically grouped to
blocks
by the
operating system
–
Sectors in a block do not necessarily need to be
adjacent
–
e.g. NTFS defaults to 4 KiB per block
•
8 sectors on a modern disk
•
Hardware address
of a block is combination of
–
Cylinder number, surface number, block number
within track
–
Controller maps hardware address to
logical block
address
(LBA)
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
34
2.2 HD – Controller
•
Disk controller transfers content of whole blocks to
buffer
–
Buffer resides in
primary storage
and can be accessed
efficiently
–
Time needed to transfer a random block (4KiB/Block on
ST3100034AS): (<10 msec)
•
Seek Time:
Time needed to position head to correct cylinder (<8
msec)
•
Latency (Rotational Delay):
Time until the correct block arrives
below the head (<0.14 msec)
•
Block Transfer Time:
Time to read all sectors of block (<0.01 msec)
–
Bulk Transfer Rate for
n
adjacent blocks (<20msec for
n
=10)
•
Locating data on a disk is a major
bottleneck
–
Try operating on data already in buffer
–
Aim for bulk transfer, avoid random block transfer
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
36
2.2 HD – Controller
•
A single HD is often not sufficient
–
Limited capacity
–
Limited speed
–
Limited reliability
•
Idea: Combine multiple HD into a
RAID Array
(
R
edundant
A
rray of
I
ndependent
Disk
s)
–
RAID Array treats multiple hardware disks as a single
logical disk
•
More HDs for increased capacity
•
The RAID controller connects
to multiple hard disks
–
Disks are
virtualized
and
appear to be just one single
logical
disk
–
The RAID controller acts as an
extended specialized HBA
(Host Bus Adapter)
–
Still
DAS
(Directly Attached
Storage)
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
38
2.3 RAID Controller
RAID Controller
Peripheral Bus
Internal Bus
Represented
as single
logical Disk
•
Mirroring
(or shadowing): Increases reliability by
complete redundancy
•
Idea: Mirror Disks are
exact copies
of original disk
–
Not space efficient
•
Read speed can be
n
times as fast, write speed does not
increase
•
Increases reliability. Assume
–
Two disks with a MTBF 11 years each
•
One original disk, one mirror disk
•
Assume disk failures are independent of
each other (unrealistic)
–
Disk replacement time of 10 hours
•
Striping:
Improve performance by parallelism
•
Idea: Distribute data among all disks for increased
performance
•
BitLevel Striping:
Split all bits of a byte to the disks
–
e.g. for 8 disks, write
i
-th bit to disk
i
–
Number of disk needs to be a power of 2
–
Each disk is involved in each access
•
Access rate does not increase
•
Read and write transfer speed linearly increases
•
Simultaneous accesses not possible
–
Good for speeding up few, sequential and large
accesses
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
40
Silber 11.3
•
Block Level Striping:
Distribute blocks among
the disks
–
Only one disk is involved reading a specific block
•
Read and write speed of a single block not increased
•
Other disks still free to read/write other blocks
•
Read and write speed of multiple accesses increase
–
Good for large number of parallel accesses
•
Error Correction Codes:
Increase reliability
with computed redundancy
•
Hamming Codes
(~1940)
–
Can
detect
and
repair
1 bit errors within
a set of n
data bits
by computing k
parity bits
•
n
= 2
k
-
k
– 1
•
n=1, k=2;
n=4, k=3;
n = 11, k=4;
n = 26, k=5; …
–
Especially used for in-memory and tape error
correction
•
Media cannot detect errors autonomously
•
Not really used for hard drives anymore
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
42
•
Interleaved Parity
(Reed-Solomon Algorithm on
GD(2) Galois Field)
–
Can
repair
1-bit errors (when the error is known)
–
Hard Disks can detect read errors themselves, no need
for complete Hamming codes
–
Basic Idea:
•
From
n
data pieces D
1
,…,D
n
compute a parity data D
p
by
combining data using logical
XOR
(eXclusive OR)
–
XOR
is associative and commutative
–
Important: A
XOR
B
XOR
B = A
•
i.e. D
p
= D
1
XOR
D
2
XOR
…
XOR
D
n
•
Interleaved Parity. Example:
•
A = 0101, B = 1100, C = 1011
•
P
=
0010
= A
XOR
B
XOR
C
•
C is lost.
–
P
= A
XOR
B
XOR
C
–
C =
P
XOR
A
XOR B
–
C =
A
XOR
B
XOR
C
XOR
A
XOR
B
–
C =
A
XOR
A
XOR
B
XOR
B
XOR
C
–
C = 0
XOR
C
–
C = 1011
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
44
2.3 RAID Principles – Interleaved Parity
0101
XOR
1100
XOR
1011
P
0010
0010
XOR
0101
XOR
1100
C
1011
•
The 3 RAID principles can be combined in multiple ways
–
Not every combination is useful
•
This led to the definition of 7 core RAID
levels
–
RAID 0 – RAID 6
–
The most dominant levels are RAID 0, RAID 1, RAID 1+0, RAID 5
•
In following examples, assume
–
A MTBF
of 100,000 hours (11.42 years) per disk
–
A Mean Time to Repair (MTTR) of 6 hours
–
Failure rate is constant and failures between disks are independent
–
MTBF
raid
is the mean time to data loss within the raid if each failing
disk is replaced within the MTTR
–
D
is the number of drives in the RAID set
–
C=200 GB
is capacity of one disk, C
capacity of whole raid
•
Mean Time to Repair (
MTTR
)
–
MTTR = TimeToNotice + RebuildTime
–
Assume time to notice of 0.5 hours
–
Rebuild time
is the time for completely writing back lost
data
•
Assume disk capacity of 200GB
•
Write back speed of 10 MB/sec
–
Consisting of reading remaining disks
–
Computing parity / Reconstructing data
•
Rebuild time around 5.5 hours
–
During rebuild, a RAID is especially
vulnerable
–
MTTR = 6 hours
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
46
•
File A (A1-Ax), File B (B1-Bx), File C (C1-Cx)
•
Raid 0
–
Block-Level-Striping
only
–
Increased parallel access and transfer speeds, reduced
reliability
–
All disks contain data (0% overhead)
–
Works with any number of disks
–
MTBF
raid
= MTBF
disk
/ D
–
4 disks:
•
MTBF
raid
= 2.86 years
•
C
raid
= 800 GB
(0 GB wasted (0%))
–
Common size:
2 disks
•
Raid 1
–
Mirroring
only
–
Increased reliability, increased read transfer speed, low space
efficiency
–
MTBF
raid
= MTBF
disk
D
/ (D! * MTTR
D-1
)
–
4 disks
:
•
MTBF
raid
= 2.2 trillion years
•
C
raid
= 200 GB
(600 GB wasted (75%))
•
Age of universe may be around 15 billion years…
–
Common size:
2 disks
•
MTBF
raid
= 95,130 years
•
C
raid
= 200 GB
(200 GB wasted (50%))
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
48
2.3 RAID Levels
•
RAID 2
–
Not used anymore
in practice
•
was used in old mainframes
–
Bit-Level-Striping
–
Use Hamming Codes
•
Usually Hamming Code(7,4) – 4 data bits, 3 parity bits
•
Reliable 1-Bit error recovery (i.e. one disk may fail)
–
3 redundant disks per 4 data disks (75% overhead)
•
Ratio better for larger number of disks
–
MTBF
raid
= MTBF
disk
2
/ (D*(D-1) * MTTR)
–
7
disks (does not really make sense for 4 –
not comparable
to other values)
•
MTBF
= 4,530 years
•
RAID 3
–
Interleaved-Parity
–
Byte-Level-Striping
–
Dedicated Parity Disk
•
Bottleneck! Every write operation
needs to update the parity disk.
•
No parallel writes
–
1 redundant disk per
n
data disks
•
Overhead decreases with number of disks
while reliability decreases
•
25% overhead for 4 data disks
–
MTBF
raid
= MTBF
disk
2
/ (D*(D-1) * MTTR)
–
4 disks
•
MTBF
raid
= 15,854 years
•
C
raid
= 600 GB
(200 GB wasted (25%))
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
50
2.3 RAID Levels
•
RAID 4
–
Block-Level Striping
–
As RAID 3 otherwise
–
4 disks
(common size)
•
MTBF
raid
= 15,854 years
•
C
raid
= 600 GB
(200 GB wasted (25%))
–
5 disks
(also common size)
•
MTBF
raid
= 9,513 years
•
C
raid
= 800 GB
(200 GB wasted (20%))
•
RAID 5
–
Parity is distributed
among the
hard disks
•
May allow for parallel block writes
–
As RAID 4 otherwise
–
Bottleneck when writing many files
smaller than a block
•
Whole parity block has to be read and
re-written for each minor write
–
Can recover from a single disk failure
–
MTBF
raid
and C
raid
as for RAID 3 & 4
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
52
2.3 RAID Levels
•
RAID 6
–
Two independent parity blocks distributed
among the disks
•
May be implemented by parity on orthogonal data or by using Reed-Solomon on
GF(2
8
)
–
As RAID 5 otherwise
–
2 redundant disks per
n
data disks
•
Can recover from a
double disk failure
•
No vulnerability during single failure rebuild
•
Very suitable for larger arrays
•
Writer overhead due to more complicated parity computation
–
MTBF
raid
= MTBF
disk
3
/ (D*(D-2)*(D-1) * MTTR
2
)
–
4 disks
•
MTBF
raid
= 132 million years
•
C
raid
= 400 GB
(400 GB wasted (50%))
–
8 disks
(common)
•
Additionally, there are hybrid levels combing the core levels
–
RAID 0+1, RAID 1+0, RAID 5+0, RAID 5+1, RAID 6+6, …
•
Raid 1+0
–
Mirrored sets nested in a striped set
•
RAID 0 on sets of RAID 1 sets
–
Very high read and write transfer speeds
, increased reliability, low space efficiency,
limited maximum size
–
Most performant RAID combination
–
D
1
= Drives per RAID 1, D
0
=Number of RAID1 sets
–
MTBF
raid
= MTBF
disk
D1
/ (D
1
! * MTTR
D1-1
) / D
0
–
4 disks
: D
1
= 2, D
0
= 2
•
MTBF
raid= 47,565 years
•
C
raid= 400 GB
(400 GB wasted (50%))
–
6 disks
: D
1
= 2, D
0
= 3
•
MTBF
raid= 31,706 years
•
C
raid= 600 GB
(600 GB wasted (50%))
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
54
•
RAIDs controllers
directly
connect storage to the
system bus
–
Storage available to only one system/ server/ application
•
Number of disks is limited
–
Consumer grade RAID: 2-4 disks
–
Enterprise grade RAID: 8-24+ disks
•
Solutions
–
NAS
(Network Attached Storage): Provide abstracted
file
systems
via network (
software solution
)
–
SAN
(Storage Area Network): Virtualized logical storage
within a specialized network on
block level
(
hardware
•
Before discussing NAS, we need file
systems
•
A file systems is a
software
for
abstracting
file operations on a
logical storage device
–
Files are a collection of binary data
•
Creating, reading, writing, deleting, finding,
organizing
–
How does a file access translate into
top-level operations on a logical
storage device?
•
e.g. which blocks have to be read/written?
•
Bridge between
application software
and (abstracted)
hardware
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
56
2.4 File Systems vs. Raw Devices
Application
Software
File System
Logical
Storage
•
Raw Devices access allows
applications to bypass the OS and
the file system
•
Application may directly tune
aspects of physical storage
–
There is still the hard drive
controller…so, its not really direct
•
May lead to very efficient
implementations
2.4 File Systems vs. Raw Devices
Application
Software
•
Idea: Provide a remote
file system
using already
available
network
infrastructure
–
NAS
: Network Attached Storage
–
Use specialized
network protocols
(e.g. CIFS, NFS,
FTP, etc)
–
Easiest case: File Server (e.g. Linux+Samba)
•
Advantages:
–
Easy
to setup, easy to use,
cheap
infrastructure
–
Allows
sharing
of storage among several systems
–
Abstracts on file system level (easy for most
applications)
•
Disadvantages
–
Inefficient
and slow
•
large protocol and processing overhead
–
Abstracts on file system level (not suitable for special
purposes like raw devices or storage virtualization)
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
58
2.4 NAS – Network Attached Storage
Application
Software
File System
Logical
Storage
Network
NAS
Ser
ver
•
SANs offer specialized
high-speed networks
for storage devices
–
Usually uses local FibreChannel
networks
–
Remote location may be connected via Ethernet
or IP-WAN (Internet)
–
Network uses specialized storage protocols
•
iFCP (SCSI on FiberChannel)
•
iSCSI (SCSI on TCP/IP)
•
HyperSCSI (SCSI on raw ethernet)
•
SANs provide
raw block level
access to
logical storage
devices
–
Logical disks of any size can be offered by the SAN
–
For a client system using a logical disk, it might
appears like a local disk or RAID
–
Client system has full control over file systems on
2.4 SAN – Storage Area Network
Application
Software
File System
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme
60
2.4 SAN – Storage Area Network
SAN HBA
SAN/RAID HBA
Peripheral Bus (SCSI, SAS, etc.)
SAN Switch
SAN Switch
SAN HBA
SAN Bus (iFCP)
SAN
HBA
SAN
HBA
SAN Switch
NAS Protocol
(CIFS)
Ethernet
Network
WAN-SAN Bus (HyperSCSI)
NAS
Head
•
Advantages:
–
Very efficient
•
Highly optimized local network infrastructure
•
Optimized protocols with low overhead
–
Very flexible
(any number of systems may use any number of
disks at any location)
–
Helps for disaster protection
•
SAN can transparently span to even remote locations
–
May also employ
NAS heads
for NAS-like behavior
•
Disadvantages
•
There are different types of storage
–
Usually, there is a storage hierarchy
•
Faster, smaller, more expensive storage
•
Slower, bigger, less expensive storage
•
Hard drives are currently the most popular media
–
Mechanical device
•
High sequential transfer rates,
•
Bad random access times, low random transfer rates
•
Prone to failure
–
DBMS must be optimized for the used storage
devices!
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme