CS 464/564 Introduction to Database Management System Instructor: Abdullah Mueen

(1)

CS 464/564

Introduction to Database Management System Instructor: Abdullah Mueen

LECTURE 14: DATA STORAGE AND REPRESENTATION

(2)

Data Storage

Memory Hierarchy Disks

Fields, Records, Blocks

Variable-length Data

Modifying Records

(3)

Strawman Implementation

Use UNIX file system to store relations, e.g.

◦ Students(name, id, dept) in file /usr/db/Students

One line per tuple, each component stored as character string, with

# as a separator, e.g.

◦ tuple could be: Smith#123#CS

Store schema in /usr/db/schema, e.g.:

◦ Students#name#STR#id#INT#dept#STR

3

(4)

What's Wrong?

The storage of the tuples on disk is inflexible: if a student changes major from EE to ECON, entire file must be rewritten

Search is very expensive (read entire relation)

Query processing is "brute force" -- there are faster ways to do joins, etc.

Data is not buffered between disk and main memory

No concurrency control

(5)

Memory Hierarchy

Cache 1ns

main memory 1-10ns

secondary storage (disk) 10ms

tertiary storage (tapes, CD-ROM) 

5

faster, smaller,

more expensive

slower, larger, cheaper

(6)

Cache Memory

Transfer a few bytes at a time between cache and main memory: instruction, integer, floating point, short string

Processor operates on instruction and data in the cache Typical size: 1 Mbyte (2²⁰ bytes)

Typical speed to/from main memory: 10 nanosec (1 nanosec = 10^-9 sec)

(7)

Main Memory

Typical size: 100 Mbytes to 10 Gbytes (1 Gbyte = 2³⁰ bytes) Typical access speed (to read or write): 10 to 100 nanosec At least 100 times larger than cache

At least 10 times slower than cache

7

(8)

Secondary Storage

Usually disk

Divided logically into blocks, unit of transfer between main memory (called disk I/O)

Typical size: 100 Gbytes-10TB

Typical speed: 10 millisec (10

^-3

sec)

At least 100 times larger than main memory

Much slower than main memory and much much slower than cache:

(9)

Tertiary Storage

Tape(s) CD-ROM(s)

At least 1000 times slower than secondary storage At least 50% cheaper than secondary storage

9

(10)

Volatile vs. Nonvolatile

Storage is volatile if the data is lost when the power is gone Usually main memory is volatile

Usually secondary and tertiary storage is nonvolatile

Thus every change made to a database in main memory must be backed up on disk before it can be permanent.

(11)

Disks

11

spindle disk heads

platters:

each has two surfaces, each surface consists of tracks (concentric rings)

one head per surface, very close to surface, does the reading and writing

(12)

More on Disks

orange ring is a track black squares are gaps, which don't hold data part of track between two gaps is a sector one or more sectors

(13)

Disk Controller

controls mechanical actuator that moves the heads in and out (radius, distance from spindle)

◦ one track from each surface at the same radius forms a cylinder

selects a surface

selects a sector (senses when that sector is under the corresponding head)

transfers bits

13

(14)

Typical Values (old vs. newer)

Rotation speed (old): 5400 rmp

Rotation speed (2014): 7200 rpm, 1 rotation in 8.33 ms Number of platters (old): 5

Number of platters (2014): 8

Number of tracks/surface: 20,000

Number of tracks/surface (2008): 65,536 Number of sectors/track: 500

Number of sectors/track (2008): 256 Number of bytes/sector: thousands

Number of bytes/sector (2008): thousands

(15)

Disk Latency for a Read

Time between issuing command to read a block and when contents of block appear in main memory:

◦ time for processor and disk controller to process request, including resolving any contention (negligible)

◦ seek time: time to move heads to correct radius (0 to ~18 millisec)

◦ rotational latency: time until first sector of block is under the head (~9 millisec)

◦ transfer time: until all sectors of the block have passed under the head; depends on rotation speed and size of block

15

(16)

Disk Latency for Updates

For a write: like reading plus verification (read back and compare) To modify a block:

◦ read it into main memory

◦ change it in main memory

◦ write it back to disk

(17)

Moral of the Story

Disks accesses are orders of magnitude slower than accesses to main memory.

They are unavoidable in large databases.

Thus do everything possible to minimize them.

Can lead to different algorithms.

17

(18)

Speeding Up Disk Accesses

1. Place blocks accessed together on same cylinder

◦ reduces seek time and rotational latency

2. Divide data among several disks

◦ head assemblies can move in parallel

3. Mirror a disk: make copies of it

◦ speeds up reads: get data from disk whose head is closest to desired block

◦ no effect on writes: write to all copies

◦ also helps with fault tolerance

(19)

Speeding up Disk Accesses

4. Be clever about order in which read and write requests are serviced, i.e., algorithm in OS or DBMS or disk controller

◦ Ex: elevator algorithm

5. Prefetch blocks to main memory in anticipation of future use (buffering)

19

(20)

Elevator Algorithm

Disk head assembly sweeps in and out repeatedly

When heads pass a cylinder with pending requests, they stop to do the request

When reaching a point with no pending requests ahead, change direction

Works well when there are many "independent" read and write

requests, i.e., don't need to be done in a particular order, that are

(21)

Prefetching

Suppose you can predict order in which blocks will be requested from disk.

Load them into main memory buffers before they are needed.

Have flexibility to schedule the reads efficiently

Can also delay writing buffered blocks if the buffers are not needed immediately

21

(22)

Disk Failures

Intermittent failure: attempt to read or write a sector fails but a subsequent try succeeds Impossible to read sector

Impossible to write a sector

Disk crash: entire disk becomes unreadable

(23)

Coping with Intermittent Failures

Use redundant bits in each sector

Store checksums in the redundant bits

After a read, check if checksums are correct; if not then try again

After a write, can do a read and compare with value written, or be optimistic and just check the checksum of the read

23

(24)

Checksums

Suppose we use one extra bit, a parity bit.

◦ if the number of 1's in the data bits is odd, then set the parity bit to 1, otherwise to 0

This is not foolproof: 101 and 110 both have even parity so checksum would be 0 for both

Use n parity bits in the checksum:

◦ parity bit 1 stores parity of every n-th bit, starting with first bit,

◦ parity bit 2 stores parity of every n-th bit, starting with second bit, etc.

◦ Probability of missing an error is 1/2ⁿ

(25)

Coping with Permanent Read/Write Errors

Stable storage policy:

Each "virtual" sector X is represented by two real sectors, X

_L

and X

_R

. To write value v to X:

◦ repeat {write v to X_L, read from X_L } until read's checksum is correct or exceed max # of tries

◦ do the same thing with X_R

◦ if X_L or X_R is discovered to be bad, then must find a substitute

25

(26)

Handling Write Failures

Suppose write(s) to X_L all fail.

◦ Then old value is safe in X_R.

Suppose write(s) to X_R all fail.

◦ Then new value is safe in X_L.

Assumption is that it is highly unlikely for two sectors to fail around the same time.

(27)

More on Stable Storage

To read from X:

◦ repeatedly read X_L until checksum is good or exceed max # tries

◦ if read of X_L failed then repeatedly read X_R until checksum is good or exceed max # tries

Handles permanent read failures, unless both X_L and X_R fail about the same time (unlikely)

27

(28)

Coping with Disk Crashes

"Mean time to failure" of a disk is length of time by which 50% of such disks will have had a head crash

Goal is to have a much longer "mean time to data loss" for your system Key idea: use redundancy

Discuss three such approaches next…

(29)

Mirroring (RAID Level 1)

Keep another copy of each disk: write to both, read from one.

Only way data can be lost is if second disk crashes while first is being repaired.

If mean time to crash of a single disk is 10 years and it takes 3 hours to repair a disk, then mean time to data loss is 146,000 years.

29

(30)

Parity Blocks (RAID Level 4)

Block i of the spare disk contains the parity checks for block i of all the data disks.

If spare disk fails, get a new spare.

•

Drawback of previous

scheme is that you need double the number of disks.

•

Instead use one spare disk no matter how many data disks you have.

Image source: wikipedia

(31)

RAID Level 5

Drawback of previous scheme is that spare disk is a bottleneck.

Instead, let each data disk also serve as the spare disk for some blocks.

All these assume only one crash at a time. RAID Level 6 uses error-correcting codes to be able to handle multiple crashes.

31

Image source: wikipedia

(32)

Data Representation

Attributes are represented by sequences of bytes, called fields Tuples are represented by collections of fields, called records Relations are represented by collections of records, called files

Files are stored in blocks, using specialized data structures to support

efficient modification and querying

(33)

Representing SQL Data Types

integers and reals: built-in CHAR(n): array of n bytes

VARCHAR(n): array of n+1 bytes (extra byte is either string length or null char) dates and times: fixed length strings

etc.

33

(34)

Representing Tuples

For now, assume all attributes (fields) are fixed length.

Concatenate the fields

Store the offset of each field in schema

0 30 286 287 297

(35)

More on Tuples

Due to hardware considerations, certain types of data need to start at addresses that are multiples of 4 or 8

Previous example becomes:

35

0 32 288 292 304

nameCHAR(30) 30 bytes + 2

address

VARCHAR(255) 256 bytes

gender CHAR(1) 1 byte + 3

birthdate DATE10 bytes + 2

(36)

Record Headers

Often it is convenient to keep some "header" information in each record:

◦ a pointer to schema information (attributes/fields, types, their order in the tuple, constraints)

◦ length of the record/tuple

◦ timestamp of last modification

0 30 286 287 297

To Schema

Length

Timestamp

(37)

Packing Records into Blocks

Start with block header:

◦ timestamp of last modification/access

◦ offset of each record in the block, etc.

Follow with sequence of records May end with some unused space

37

header block 1 block 2 … block n-1 block n

(38)

Representing Addresses

Often addresses (pointers) are part of records:

◦ the application data in object-oriented databases

◦ as part of indexes and other data structures supporting the DBMS

Every data item (block, record, etc.) has two addresses:

◦ database address: address on the disk (typically 8-16 bytes)

◦ memory address, if the item is in virtual memory (typically 4 bytes)

(39)

Translation Table

Provides mapping from database addresses to memory addresses for all blocks currently in memory

Later we'll discuss how to implement it

39

(40)

Pointer Swizzling

When a block is moved from disk into main memory, change all the disk addresses that point to items in this block into main memory addresses.

Need a bit for each address to indicate if it is a disk address or a memory address.

Why? Faster to follow memory pointers (only uses a single machine

instruction).

(41)

Example of Swizzling

41

Block 1

Block 2

Disk Main Memory

read into main memory

(42)

Swizzling Policies

Automatic swizzling: as soon as block is brought into memory, swizzle all relevant pointers Swizzling on demand: only swizzle a pointer if and when it is actually followed

No swizzling

Programmer control

(43)

Automatic Swizzling

Locating all pointers within a block:

◦ refer to the schema, which will indicate where addresses are in the records

◦ for index structures, pointers are at known locations

Update translation table with memory addresses of items in the block

Update pointers in the block (in memory) with memory addresses, when possible, as obtained from translation table

43

(44)

Unswizzling

When a block is moved from memory back to disk, all pointers must go back to database (disk) addresses

Use translation table again

Important to have an efficient data structure for the translation table

(45)

Pinned Records and Blocks

A block in memory is pinned if it cannot be safely written back to disk Indicate with a bit in the block header

Reasons for pinning:

◦ related to failure recovery (more later)

◦ because of pointer swizzling

If block B1 has swizzled pointer to an item in block B2, then B2 is pinned.

45

(46)

Unpinning a Block

Consider each item in the block to be unpinned

Keep in the translation table the places in memory holding swizzled pointers to that item (e.g., with a linked list)

Unswizzle those pointers (i.e., use translation table to replace the memory addresses with database (disk) addresses

(47)

Variable Length Data

Data items with varying size (e.g., if maximum size of a field is large but most of the time the values are small)

Variable-format records (e.g., NULLs method for representing a hierarchy of entity sets as relations)

Records that do not fit in a block (e.g., an MPEG of a movie)

47

(48)

Variable-Length Fields

Store the fixed-length fields before the variable-length fields in each record Keep in the record header

◦ record length

◦ pointers to the beginnings of all the variable-length fields

Book discusses variations on this idea

(49)

Variable-Format Records

Represent by a sequence of tagged fields Each tagged field contains

◦ name

◦ type

◦ length, if not deducible from the type

◦ value

49

(50)

Splitting Records Across Blocks

Called spanned records Useful when

◦ record size exceeds block size

◦ putting an integral number of records in a block wastes a lot of the block (e.g., record size is 51% of block size)

Each record or fragment header contains

◦ bit indicating if it is a fragment

◦ if fragment then pointers to previous and next fragments of the record (i.e., a

(51)

Record Modification

Modifications to records:

◦ insert

◦ delete

◦ update

issues even with fixed-length records and fields even more involved with variable-length data

51

(52)

Inserting New Records

If records need not be any particular order, then just find a block with enough empty space Later we'll see how to keep track of all the tuples of a given relation

But what if blocks should be kept in a certain order, such as sorted on primary key?

(53)

Insertion in Order

53

If there is space in the block, then add the record (going right to left), add a pointer to it (going left to right) and rearrange the pointers as needed.

record 4

record 3

record 2

record unused 1

header

(54)

What if Block is Full?

Records are stored in several blocks, in sorted order

One approach: keep a linked list of "overflow" blocks for each block in the main sequence Another approach: slide records down to another block, leave forwarding addresses in old block, slide records on both old and new blocks

(55)

Deleting Records

Try to reclaim space made available after a record is deleted

If using an offset table, then rearrange the records to fill in any hole that is left behind and adjust the pointers

Additional mechanisms are based on keeping a linked list of available space and compacting when possible

55

(56)

Tombstones

What about pointers to deleted records?

We place a tombstone in place of each deleted record Tombstone is permanent

Issue of where to place the tombstone

Keep a tombstone bit in each record header: if this is a tombstone, then no need to store additional data

(57)

Updating Records

For fixed-length records, there is no effect on the storage system For variable-length records:

◦ if length increases, like insertion

◦ if length decreases, like deletion except tombstones are not necessary

57