Relational Database Systems 2 2. Physical Data Storage

(1)

Christoph Lofi

Philipp Wille

Institut für Informationssysteme

Relational Database Systems 2

2. Physical Data Storage

(2)

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme

2 Data

Storage Manager

Query Processor

Application

Interfaces

Indices

Statistics

DDL

Interpreter

Query

Evaluation

Engine

Applications

Programs

Object Code

Transaction Manager

Buffer

Manager

File Manager

Catalog/

Dictionary

Embedded

DML

Precompiler

DML

Compiler

DB Scheme

Application

Programs

Direct Query

Application

Programmers

DB

Administrators

(3)

2.1 Introduction

2.2 Hard Disks

2.3 RAIDs

2.4 SANs and NAS

2.5 Case Study

(4)

• DBMS needs to retrieve, update and process

persistently stored

data

–

Storage consideration is an important factor in

planning a database system (physical layer)

–

Remember:

The data has to be securely stored, but

access to the data should be declarative!

4 2.1 Physical Storage Introduction

(5)

• Data is stored on a

storage media

. Media highly

differ in terms of

–

Random Access

Speed

–

Random/Sequential Read/Write

speed

–

Capacity

–

Cost

per Capacity

(6)

• Capacity:

Quantifies the amount of data

which can be stored

–

Base Units: 1 Bit, 1 Byte = 2

3 _{Bit = 8 Bit}

–

Capacity units according to IEC, IEEE, NIST, etc:

• Usually used for file sizes and primary storage (for higher

degree of confusion, sometimes used with SI abbreviations…)

•

1 KiB

= 1024

1 _{Byte; 1}

_MiB

_{= 1024}

2 _{Byte ; 1}

_GiB

_{= 1024}

3 _Byte;

…

–

Capacity units according to SI:

• Usually used for advertising secondary/tertiary storage

•

1 KB

= 1000

1 _Byte

_≈

_{0.976 KiB; 1}

_MB

_{= 1000}

2 _Byte

_≈

_{0.954 MiB;}

1 GB

= 1000

3 _Byte

_≈

_{0.931 GiB; …}

–

Especially used by the networking community:

•

1 Kb

= 1000

1 _Bit

₌

_{0.125 KB}

_≈

_{0.122 KiB ;}

1 Mb

= 1000

2 _Bit

₌

_{0.125 MB}

_≈

_{0.119 MiB}

6

(7)

(8)

• Random Access Time:

Average time to access

a random piece of data at a known media position

–

Usually measured in ms or ns

–

Within some media, access time can vary depending

on position (e.g. hard disks)

• Transfer Rate:

Average amount consecutive of

data which can be transferred per time unit

–

Usually measured in KB/sec, MB/sec, GB/sec,…

–

Sometimes also in Kb/sec, Mb/sec, Gb/sec

8

(9)

• Volatile:

Memory needs constant power to keep data

–

Dynamic:

Dynamic volatile memory needs to be “refreshed”

regularly to keep data

–

Static:

No refresh necessary

• Access Modes

–

Random Access:

Any piece of data can be accessed in

approximately the same time

–

Sequential Access:

Data can only be accessed in sequential

order

• Write Mode

–

Mutable

Storage: Can be read and written arbitrarily

–

Write Once Read Many (

WORM

)

(10)

• Online

media

–

„always on“

–

Each single piece of data can be accessed fast

–

e.g. hard drives, main memory

• Nearline

media

–

Compromise between online and offline

–

Offline media can automatically put “on line”

–

e.g. juke boxes, robot libraries

• Offline

media (disconnected media)

–

Not under direct control of processing unit

–

Have to be connected manually

–

e.g. box of backup tapes in basement

10

2.1 Online, Nearline, Offline

(11)

• Media characteristics result in a

storage hierarchy

• DBMS optimize data distribution among the storage

levels

–

Primary Storage:

Fast, limited capacity, high price,

usually volatile electronic storage

• Frequently used data / current work data

–

Secondary Storage:

Slower, large capacity, lower price

• Main stored data

–

Tertiary Storage:

Even slower, huge capacity, even lower

price, usually offline

(12)

12

2.1 The Storage Hierarchy

Cos

t

Spe

ed

Optical Disks, Tape

Primary

Secondary

Tertiary

Cache, RAM

Flash, Magnetic Disks

~100 ns

~10 ms

(13)

Type

Media

Size

Random

Acc. Speed

Transfer

Speed

Characteristics

Price

Price/GB

Pri

L1-Processor Cache

32 KB

5 x 10

-10

_s

_{15.4 GB/sec}

_{Vol, Stat,}

RA,OL

Pri

DDR3-Ram

(Corsair Dominator

Platinum Series)

8 GB

2.6 x 10

-8

_s

_{12.3 GB/sec}

_{Vol, Dyn, Ra,}

OL

€ 160

€ 20

Sec

Harddrive SSD

(Samsung 840 PRO)

256 GB

4 x 10

-6

_s

_{513 MB/sec}

_{Stat, RA, OL}

_{€ 187}

_{€ 0.73}

Sec

Harddrive Magnetic

(Seagate ST2000DM001)

2000 GB

5.7 x 10

-4

_s

_{153 MB/sec}

_{Stat, RA, OL}

_{€ 100}

_{€ 0.05}

Ter

Blank recordable

DVD-R disk

4.7 GB

9.8 x 10

-2

_s

_{11 MB/sec}

_{Stat, RA, OF,}

WORM

€ 0.15

/Disk

€ 0.03

Ter

LTO-5 tape

(TDK - LTO Ultrium 5 Data Cartridge)

1500 GB

58 s

280 MB/sec

Stat, SA, OF

€ 15

/Tape

€ 0.01

2.1 Storage Media – Examples

Pri= Primary, Sec=Secondary, Ter=Tertiary

(14)

• Hard drives are currently the standard for

large, cheap and persistent storage

–

Usually used as the

main storage

media

for most data in a DB

• DBMS need to be

optimized

for

efficient disk storage and access

–

Data access needs to be as fast as possible

–

Often used data should be accessible with highest speed,

rarely needed data may take longer

–

Different data items needed for certain reoccurring tasks

should also be stored/accessed together

14

(15)

• Directionally

magnetization of a ferromagnetic

material

• Realized on hard disk platters

–

Base platter made of non-magnetic aluminum or glass substrate

–

Magnetic grains worked into base platter to form

magnetic regions

• Each region represents 1 Bit

–

Read head

can detect

magnetization direction of

each region

–

Write head

may change

direction

(16)

• G

iant

M

agneto

r

esistance Effect (

GMR

)

–

Discovered 1988 simultaneously by Peter Grünberg and Albert

Fert

• Both honored with the 2007 Nobel Prize in Physics

–

Allows the construction of efficient

read heads:

• The electric resistance of an alternating ferro- and non-magnetic

material giantally changes within changing directed magnetic fields

–

http://www.research.ibm.com/research/demos/gmr/cyberdemo1.htm

16

(17)

• Perpendicular

Recording (used since 2005)

–

Longitudal Recording limited to ~200 Gb/inch

2 _{due to}

superparamagnetic effect

• Thermal energy may spontaneously change magnetic direction

–

Perpendicular recording allows for up to 1000 Gb/inch

2 –

Very simplified:

Align

magnetic field orthogonal

to surface instead of

parallel

• Magnetic regions can be

smaller

(18)

• Usage of magnetic

grains

instead of continuous

magnetic material

–

Between magnetic direction transitions,

Neel Spikes

are

formed

• Areas of unsure magnetic direction

–

Neel Spikes are larger for continuous materials

–

Magnetic regions can be smaller as the transition width can be

reduced

18

(19)

• A hard disk is made up of multiple

double-sided platters

–

Platter sides are called

surfaces

–

Platters are fixed on main

spindle

and rotate at equal and constant speed

(common: 5400 rpm / 7200 rpm)

–

Each surface has it’s own

read

and

write head

–

Heads are attached to

arms

• Arms can position heads

along the surface

• Heads cannot move

inde-pendently

–

Heads have no contact

to surface and hover on

top of an air bearing

(20)

• Each surface is divided into circular

tracks

–

Some disks may use spirals

• All tracks of all surfaces with the same diameter

are called

cylinder

–

Data within the same cylinder can be accessed very

efficiently

20

EN 13.2

(21)

• Each track is subdivided into

sectors

of equal

capacity

a) Fixed angle sector subdivision

• Same number of sectors per track, changing density,

constant speed

b) Fixed data density

• Outer tracks have more sectors than inner tracks

• Transfer speed higher on

outer tracks

• Adjacent sectors can be

(22)

• Hard drives are not completely reliable!

–

Drives do

fail

–

Means for physical failure recovery are necessary

• Backups

• Redundancy

• Hard drives age and

wear

down.

Wear significantly increases by:

–

Contact

cycles (head parking)

–

Spindle start-stop

–

Power-on

hours

–

Operation outside ideal environment

• Temperature too low/high

• Unstable voltage

22

2.2 HD – Reliability

(23)

• Reliability

measures are

statistical

values assuming

certain usage patterns

• Desktop usage (all per year): 2,400 hours, 10,000 motor

start/stops, 25°C temperature

• Server usage (all per year): 8,760 hours, 250 motor start/stops,

40°C temperature

–

Non-Recoverable read errors:

A sector on the surface

cannot be read anymore – the data is lost

• Desktop disk: 1 per 10

14 _{read bits, Server: 1 per 10}

15 _{read bits}

• Disk can detect this!

–

Maximum

contact cycles

: Maximum number of allowed

head contacts (parking)

(24)

–

Mean Time Between Failure (

MTBF

): Statistically

anticipated time for a large disk population failing to

50%

• Drive manufactures usually use optimistic simulations to

guess the MTBF

• Desktop: 0.7 million hours (80 years), Server: 1.2 million

hours (137 years) –

Manufacturers values

–

Annualized Failure Rate (

AFR

): Probability of a failure

per year based on MTBF

• AFR = OperatingHoursPerYear / MTBF

_hours

• Desktop: 0.34%, Server: 0.73%

24

2.2 HD – Reliability

(25)

• Failure rate during a hard disks lifespan is not constant

• Can be better modeled by the “

bathtub curve

” having

3 components

–

Infant Mortality Rate

–

Wear Out Failures

–

Random Failures

(26)

• Report by Google

–

100,000 consumer grade disks

(80-400GB, ATA Interface,

5,400-7,200 RPM)

• Results (among others)

–

Drives fail often!

–

There is infant mortality

–

High usage increases infant

mortality, but not later failure

rates

–

Observed AFR is around

7%

and

MTBF

16.6 years

!

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für

Informationssysteme

26

Failure trends in a large disk drive population

E. Pinheiro, W.-D. Weber, L. A. Barroso 5th USENIX conference on File and Storage Technologies, 2007

2.2 Real World Failure Rates

(27)

• Seagate ST32000641AS 2 TB (Desktop Harddrive, 2011)

–

Manufacturer’s specifications

2.2 HD – Example Specs

Specification

Value

Capacity

2 TB

Platters

4 Heads

8 Cylinders

16,383

Sectors per track

63 Bytes per sector

512 Spindle Speed

7200 RPM

MTBF

85 years

(28)

• Assume a storage need of 100 TB. Only following

HDs are available

–

Capacity:

1 TB

capacity each

–

MTBF: 100,000 hours each (ca.

11 years

)

• Consider using 100 of these disks independently

(w/o RAID).

–

Total Storage: 100,000 GB = 100 TB

–

MTBF: 1,000 hours (ca.

42 days

)

–

THIS IS BAD!

• More sophisticated ways of using multiple disks are

needed

28

(29)

• Alternative to hard-drives:

SSD

–

Use microchips which retain data in non-volatile

memory chips

and contain

no moving parts

• Use the same interface as hard disk drives

–

Easy replacement in most applications possible

• Key components

–

Memory

–

Controller

(30)

• Flash-Memory

–

Most SSDs use NAND-based flash memory

–

Retains memory even without power

–

Slower than DRAM solutions

–

Single-level cell versus multi-level cell

–

Wears down!

• DRAM

–

Use volatile Random Access Memory

–

Ultrafast data access (< 10 microseconds)

–

Sometimes use internal battery or external power device to

ensure data persistence

–

Only for Applications that require even faster access, but do

not need data persistence after power loss

30

2.2 Memory

(31)

• The controller is an embedded processor

• Incorporates the electronics that bridge the

NAND memory components to the host

computer

• Some of its functions

–

error correction,

wear leveling

,

bad block

mapping

, read and write caching, encryption, garbage

collection

(32)

• Advantages

–

Low access time and latency

–

No moving parts



shock resistant

• MTBF about 2 million hours

–

Lighter and more energy-efficient than HDDs

• Disadvantages

–

Divided into blocks/pages

• If one byte changes the whole page has to be written

• The old page will be marked as stale

• Only whole blocks can be deleted

–

Limited ability of being rewritten (between 3,000 and 100,000

cycles per page)

• Wear leveling algorithms assure that write operations are equally

distributed to the pages

32

2.2 SSD – Summary

(33)

• The

disk controller

organizes low level

access to the disk

–

e.g. head positioning, error checking, signal

processing

–

Usually integrated into the disk

–

Provides

unified

and

abstracted

interface

to access the disks (e.g. LBA)

–

Connects disk to a peripheral bus (e.g. IDE,

SCSI, FiberChannel, SAS)

• The

host bus adapter

(HBA) bridges

between the peripheral bus and systems

internal bus (like PCIe, PCI)

–

Internal Bus usually integrated into systems

main board

–

Often confused as being the disk controller

• DAS

(Directly Attached Storage)

2.2 HD – Controller

Disk

Controller

Host Bus

Adapter

Peripheral

Bus

Internal Bus

Inner System / Mainboard

(34)

• Sectors can be logically grouped to

blocks

by the

operating system

–

Sectors in a block do not necessarily need to be

adjacent

–

e.g. NTFS defaults to 4 KiB per block

• 8 sectors on a modern disk

• Hardware address

of a block is combination of

–

Cylinder number, surface number, block number

within track

–

Controller maps hardware address to

logical block

address

(LBA)

34

2.2 HD – Controller

(35)

• Disk controller transfers content of whole blocks to

buffer

–

Buffer resides in

primary storage

and can be accessed

efficiently

–

Time needed to transfer a random block (4KiB/Block on

ST3100034AS): (<10 msec)

• Seek Time:

Time needed to position head to correct cylinder (<8

msec)

• Latency (Rotational Delay):

Time until the correct block arrives

below the head (<0.14 msec)

• Block Transfer Time:

Time to read all sectors of block (<0.01 msec)

–

Bulk Transfer Rate for

n

adjacent blocks (<20msec for

n

=10)

(36)

• Locating data on a disk is a major

bottleneck

–

Try operating on data already in buffer

–

Aim for bulk transfer, avoid random block transfer

36

2.2 HD – Controller

(37)

• A single HD is often not sufficient

–

Limited capacity

–

Limited speed

–

Limited reliability

• Idea: Combine multiple HD into a

RAID Array

(

R

edundant

A

rray of

I

ndependent

Disk

s)

–

RAID Array treats multiple hardware disks as a single

logical disk

• More HDs for increased capacity

(38)

• The RAID controller connects

to multiple hard disks

–

Disks are

virtualized

and

appear to be just one single

logical

disk

–

The RAID controller acts as an

extended specialized HBA

(Host Bus Adapter)

–

Still

DAS

(Directly Attached

Storage)

38

2.3 RAID Controller

RAID Controller

Peripheral Bus

Internal Bus

Represented

as single

logical Disk

(39)

• Mirroring

(or shadowing): Increases reliability by

complete redundancy

• Idea: Mirror Disks are

exact copies

of original disk

–

Not space efficient

• Read speed can be

n

times as fast, write speed does not

increase

• Increases reliability. Assume

–

Two disks with a MTBF 11 years each

• One original disk, one mirror disk

• Assume disk failures are independent of

each other (unrealistic)

–

Disk replacement time of 10 hours

(40)

• Striping:

Improve performance by parallelism

• Idea: Distribute data among all disks for increased

performance

• BitLevel Striping:

Split all bits of a byte to the disks

–

e.g. for 8 disks, write

i

-th bit to disk

i

–

Number of disk needs to be a power of 2

–

Each disk is involved in each access

• Access rate does not increase

• Read and write transfer speed linearly increases

• Simultaneous accesses not possible

–

Good for speeding up few, sequential and large

accesses

40

Silber 11.3

(41)

• Block Level Striping:

Distribute blocks among

the disks

–

Only one disk is involved reading a specific block

• Read and write speed of a single block not increased

•

• Read and write speed of multiple accesses increase

–

Good for large number of parallel accesses

(42)

• Error Correction Codes:

Increase reliability

with computed redundancy

• Hamming Codes

(~1940)

–

Can

detect

and

repair

1 bit errors within

a set of n

data bits

by computing k

parity bits

• n

= 2

k

_-

_k

_{– 1}

• n=1, k=2;

n=4, k=3;

n = 11, k=4;

n = 26, k=5; …

–

Especially used for in-memory and tape error

correction

• Media cannot detect errors autonomously

• Not really used for hard drives anymore

42

(43)

• Interleaved Parity

(Reed-Solomon Algorithm on

GD(2) Galois Field)

–

Can

repair

1-bit errors (when the error is known)

–

Hard Disks can detect read errors themselves, no need

for complete Hamming codes

–

Basic Idea:

• From

n

data pieces D

₁

,…,D

_n

compute a parity data D

_p

by

combining data using logical

XOR

(eXclusive OR)

–

XOR

is associative and commutative

–

Important: A

XOR

B

XOR

B = A

• i.e. D

_p

= D

₁

XOR

D

₂

XOR

…

XOR

D

_n

(44)

• Interleaved Parity. Example:

• A = 0101, B = 1100, C = 1011

• P

=

0010

= A

XOR

B

XOR

C

• C is lost.

–

P

= A

XOR

B

XOR

C

–

C =

P

XOR

A

XOR B

–

C =

A

XOR

B

XOR

C

XOR

A

XOR

B

–

C =

A

XOR

A

XOR

B

XOR

B

XOR

C

–

C = 0

XOR

C

–

C = 1011

44 2.3 RAID Principles – Interleaved Parity

0101

XOR

1100

XOR

1011

P

0010

XOR

0101

XOR

1100

C

1011

(45)

• The 3 RAID principles can be combined in multiple ways

–

Not every combination is useful

• This led to the definition of 7 core RAID

levels

–

RAID 0 – RAID 6

–

The most dominant levels are RAID 0, RAID 1, RAID 1+0, RAID 5

• In following examples, assume

–

A MTBF

of 100,000 hours (11.42 years) per disk

–

A Mean Time to Repair (MTTR) of 6 hours

–

Failure rate is constant and failures between disks are independent

–

MTBF

_raid

is the mean time to data loss within the raid if each failing

disk is replaced within the MTTR

–

D

is the number of drives in the RAID set

–

C=200 GB

is capacity of one disk, C

capacity of whole raid

(46)

• Mean Time to Repair (

MTTR

)

–

MTTR = TimeToNotice + RebuildTime

–

Assume time to notice of 0.5 hours

–

Rebuild time

is the time for completely writing back lost

data

• Assume disk capacity of 200GB

• Write back speed of 10 MB/sec

–

Consisting of reading remaining disks

–

Computing parity / Reconstructing data

• Rebuild time around 5.5 hours

–

During rebuild, a RAID is especially

vulnerable

–

MTTR = 6 hours

46

(47)

• File A (A1-Ax), File B (B1-Bx), File C (C1-Cx)

• Raid 0

–

Block-Level-Striping

only

–

Increased parallel access and transfer speeds, reduced

reliability

–

All disks contain data (0% overhead)

–

Works with any number of disks

–

MTBF

_raid

= MTBF

_disk

/ D

–

4 disks:

• MTBF

_raid

= 2.86 years

• C

_raid

= 800 GB

(0 GB wasted (0%))

–

Common size:

2 disks

(48)

• Raid 1

–

Mirroring

only

–

Increased reliability, increased read transfer speed, low space

efficiency

–

MTBF

_raid

= MTBF

_disk

D

_{/ (D! * MTTR}

D-1

₎

–

4 disks

:

• MTBF

_raid

= 2.2 trillion years

• C

_raid

= 200 GB

(600 GB wasted (75%))

• Age of universe may be around 15 billion years…

–

Common size:

2 disks

• MTBF

_raid

= 95,130 years

• C

_raid

= 200 GB

(200 GB wasted (50%))

48

2.3 RAID Levels

(49)

• RAID 2

–

Not used anymore

in practice

• was used in old mainframes

–

Bit-Level-Striping

–

Use Hamming Codes

• Usually Hamming Code(7,4) – 4 data bits, 3 parity bits

• Reliable 1-Bit error recovery (i.e. one disk may fail)

–

3 redundant disks per 4 data disks (75% overhead)

• Ratio better for larger number of disks

–

MTBF

_raid

= MTBF

_disk

2 _{/ (D(D-1) MTTR)}

–

7 disks (does not really make sense for 4 –

not comparable

to other values)

• MTBF

= 4,530 years

(50)

• RAID 3

–

Interleaved-Parity

–

Byte-Level-Striping

–

Dedicated Parity Disk

• Bottleneck! Every write operation

needs to update the parity disk.

• No parallel writes

–

1 redundant disk per

n

data disks

• Overhead decreases with number of disks

while reliability decreases

• 25% overhead for 4 data disks

–

MTBF

_raid

= MTBF

_disk

2 _{/ (D(D-1) MTTR)}

–

4 disks

• MTBF

_raid

= 15,854 years

• C

_raid

= 600 GB

(200 GB wasted (25%))

50

2.3 RAID Levels

(51)

• RAID 4

–

Block-Level Striping

–

As RAID 3 otherwise

–

4 disks

(common size)

• MTBF

_raid

= 15,854 years

• C

_raid

= 600 GB

(200 GB wasted (25%))

–

5 disks

(also common size)

• MTBF

_raid

= 9,513 years

• C

_raid

= 800 GB

(200 GB wasted (20%))

(52)

• RAID 5

–

Parity is distributed

among the

hard disks

• May allow for parallel block writes

–

As RAID 4 otherwise

–

Bottleneck when writing many files

smaller than a block

• Whole parity block has to be read and

re-written for each minor write

–

Can recover from a single disk failure

–

MTBF

_raid

and C

_raid

as for RAID 3 & 4

52

2.3 RAID Levels

(53)

• RAID 6

–

Two independent parity blocks distributed

among the disks

• May be implemented by parity on orthogonal data or by using Reed-Solomon on

GF(2

8 ₎

–

As RAID 5 otherwise

–

2 redundant disks per

n

data disks

• Can recover from a

double disk failure

• No vulnerability during single failure rebuild

• Very suitable for larger arrays

• Writer overhead due to more complicated parity computation

–

MTBF

_raid

= MTBF

_disk

3 _{/ (D(D-2)(D-1) * MTTR}

2 ₎

–

4 disks

• MTBF

_raid

= 132 million years

• C

_raid

= 400 GB

(400 GB wasted (50%))

–

8 disks

(common)

(54)

• Additionally, there are hybrid levels combing the core levels

–

RAID 0+1, RAID 1+0, RAID 5+0, RAID 5+1, RAID 6+6, …

• Raid 1+0

–

Mirrored sets nested in a striped set

• RAID 0 on sets of RAID 1 sets

–

Very high read and write transfer speeds

, increased reliability, low space efficiency,

limited maximum size

–

Most performant RAID combination

–

D

₁

= Drives per RAID 1, D

₀

=Number of RAID1 sets

–

MTBF

_raid

= MTBF

_disk

D1

_{/ (D}

1 ! * MTTR

D1-1

) / D

0 –

4 disks

: D

₁

= 2, D

₀

= 2

• MTBF

_raid

= 47,565 years

• C

_raid

= 400 GB

(400 GB wasted (50%))

–

6 disks

: D

₁

= 2, D

₀

= 3

• MTBF

_raid

= 31,706 years

• C

_raid

= 600 GB

(600 GB wasted (50%))

54

(55)

• RAIDs controllers

directly

connect storage to the

system bus

–

Storage available to only one system/ server/ application

• Number of disks is limited

–

Consumer grade RAID: 2-4 disks

–

Enterprise grade RAID: 8-24+ disks

• Solutions

–

NAS

(Network Attached Storage): Provide abstracted

file

systems

via network (

software solution

)

–

SAN

(Storage Area Network): Virtualized logical storage

within a specialized network on

block level

(

hardware

(56)

• Before discussing NAS, we need file

systems

• A file systems is a

software

for

abstracting

file operations on a

logical storage device

–

Files are a collection of binary data

• Creating, reading, writing, deleting, finding,

organizing

–

How does a file access translate into

top-level operations on a logical

storage device?

• e.g. which blocks have to be read/written?

• Bridge between

application software

and (abstracted)

hardware

56 2.4 File Systems vs. Raw Devices

Application

Software

File System

Logical

Storage

(57)

• Raw Devices access allows

applications to bypass the OS and

the file system

• Application may directly tune

aspects of physical storage

–

There is still the hard drive

controller…so, its not really direct

• May lead to very efficient

implementations

2.4 File Systems vs. Raw Devices

Application

Software

(58)

• Idea: Provide a remote

file system

using already

available

network

infrastructure

–

NAS

: Network Attached Storage

–

Use specialized

network protocols

(e.g. CIFS, NFS,

FTP, etc)

–

Easiest case: File Server (e.g. Linux+Samba)

• Advantages:

–

Easy

to setup, easy to use,

cheap

infrastructure

–

Allows

sharing

of storage among several systems

–

Abstracts on file system level (easy for most

applications)

• Disadvantages

–

Inefficient

and slow

• large protocol and processing overhead

–

Abstracts on file system level (not suitable for special

purposes like raw devices or storage virtualization)

58 2.4 NAS – Network Attached Storage

Application

Software

File System

Logical

Storage

Network

NAS

Ser

ver

(59)

• SANs offer specialized

high-speed networks

for storage devices

–

Usually uses local FibreChannel

networks

–

Remote location may be connected via Ethernet

or IP-WAN (Internet)

–

Network uses specialized storage protocols

• iFCP (SCSI on FiberChannel)

• iSCSI (SCSI on TCP/IP)

• HyperSCSI (SCSI on raw ethernet)

• SANs provide

raw block level

access to

logical storage

devices

–

Logical disks of any size can be offered by the SAN

–

For a client system using a logical disk, it might

appears like a local disk or RAID

–

Client system has full control over file systems on

2.4 SAN – Storage Area Network

Application

Software

File System

(60)

60 2.4 SAN – Storage Area Network

SAN HBA

SAN/RAID HBA

Peripheral Bus (SCSI, SAS, etc.)

SAN Switch

SAN HBA

SAN Bus (iFCP)

SAN

HBA

SAN

HBA

SAN Switch

NAS Protocol

(CIFS)

Ethernet

Network

WAN-SAN Bus (HyperSCSI)

NAS

Head

(61)

• Advantages:

–

Very efficient

• Highly optimized local network infrastructure

• Optimized protocols with low overhead

–

Very flexible

(any number of systems may use any number of

disks at any location)

–

Helps for disaster protection

• SAN can transparently span to even remote locations

–

May also employ

NAS heads

for NAS-like behavior

• Disadvantages

(62)

• There are different types of storage

–

Usually, there is a storage hierarchy

• Faster, smaller, more expensive storage

• Slower, bigger, less expensive storage

• Hard drives are currently the most popular media

–

Mechanical device

• High sequential transfer rates,

• Bad random access times, low random transfer rates

• Prone to failure

–

DBMS must be optimized for the used storage

devices!

62

2 Physical Storage

(63)

• Access Pathes

–

Physical Data Access

–

Index Structures

–

Physical Tuning