Physical Data Organization

(1)

UVA

DEPARTMENT OF COMPUTER SCIENCE

Physical Data Organization

Database design using logical model of the database - appropriate level for users to focus on

- user independence from implementation details Performance

- other major factor in user satisfaction - depends on

- efficient data structures for data representation - efficiency of system operation on those structures Disk access

- one of the most critical factors in performance

- main memory is in general not big enough for entire DB - recovery problem with main memory DB

- disk contains data files and system files including data dictionary and index files

(2)

UVA

Storage Media Hierarchy

Storage medium: primary storage and secondary storage

- database is stored physically on some some storage medium - primary storage: can be operated directly by CPU

--- main memory & cache

- secondary storage: larger capacity, lower cost, slower access;

cannot be operated directly by CPU; must be copied to primary Hierarchy

- access speed, cost per unit of data, reliability - cache: fastest and most costly

- main memory

- flash memory: limited number of writes (also slow) non-volatile: disk-substitute in embedded systems - magnetic disk and optical disk (CD-ROM)

- tape storage: sequential access; for backup and archival

(3)

UVA

Disk Access and Buffer Management

Disk

- direct access storage device (not sequential)

- arm movement involves seek time and latency time - goal is to reduce # of disk access and seek time - a block need not to be transferred every time

- buffer blocks: closely related with concurrency control and recovery strategy of the database system

Buffer management

- goal is to increase hit ratio

- similar to virtual memory management in OS - differences: forced writing for recovery and

MRU (most recently used first) replacement algorithm - priority-based replacement: data dictionary and

index blocks have high priority

(4)

UVA

RAID

Redundant arrays of independent disks

- motivation: large # of small disks might be cost effective; higher reliability and higher performance Higher reliability by redundancy

- mirroring/shadowing: a logical disk consists of two physical disks --- write on both

Higher performance by parallelism

- data striping: splitting data across multiple disks - bit-level or block-level striping

- with n disks, block i will go to disk (i mod n) + 1 RAID levels

- to provide redundancy at lower cost using disk striping combined with error-correcting bits, instead of mirroring

(5)

UVA

File Organization

File

- a sequence of records mapped unto disk blocks

- block: unit of data transfer between disk and memory - block size ranges from 512 bytes to few Kbytes

- fixed-length records vs variable-length records Fixed-length records

- size of each field is declared

- when delete, mark it to be ignored: searching for deleted free space may not be efficient

- use pointer for free space: danger of dangling pointer which no longer points to the desired record

- problem of interblock records: needs 2 accesses ... block i) (block i+1 ...

--- record j

(6)

UVA

Variable-length Records

When such situations occur?

- multiple record types in one file

- record type allows variable length fields - repeating groups (multiple values)

Methods to deal with them

- byte string representation: special end-of-record symbol ( ) at the end of each record

- each record is a string of consecutive bytes - difficulty in reusing the space of deleted record - fixed-length representation:

1) reserved space for expected maximum length - useful only if most are close to max. length 2) a list of fixed-length records chained by pointers 3) anchor block (first record of the chain) and overflow

block (all the others) chained by pointers

(7)

UVA

Mapping Data to Files

Relational database - straight-forward

- in most cases, each relation in a separate file File organization

- how to organize a given set of records in files

- heap file: any record can be placed anywhere (no ordering) - sequential file: records are stored in a sequential order

- hashing file: hash function computes the specific block for the record based some attribute value

- clustering file: records of different relations stored on the same file/block for efficient processing

- related records can be read by one block read - may be inefficient for other operations

(8)

UVA

Efficient Searching

Additional structures help searching

- associated with files to make the search for records based on certain field more efficient - for direct data locating w/o sequential search - two approaches: indexing and hashing

Sequential file

- records are chained together by pointers for fast retrieval in search key order

- records are stored physically in search key order to minimize the number of block accesses

- difficult to maintain the physical sequential order as records are inserted and deleted

- binary search for files can be done on the blocks rather than on the records, if block address are available in the file header

(9)

UVA

Index Structures

Index file

- index is usually defined on a single field of a record (index field)

- index file is for fast random access Dense index

- one index record for every search-key value - faster access but higher overhead

Sparse index

- index records for only some of the records - less faster but less overhead

(Brighton) (record: Brighton, ..) (Brighton) (Downhill) (record: Downhill, ..)

(Marinion) (record: Marinion, ..) (Marinion)

dense index sparse index

(10)

UVA

Index Structures

Hierarchy of index

- multi-level index for a large index file - index tree (search tree)

Primary and secondary index

- primary index is the one whose search key specifies the sequential order of the file

- secondary index: index other than primary one

- secondary index improves the performance of queries that use keys other than the primary search key

- modifying DB imposes a serious overhead on secondary index (compared to the primary index)

- dense index is desirable than sparse index for secondary index, since the file is not ordered physically according to the secondary index

(11)

UVA

Clustering Index

Clustering field

- a non-key field that does not have a distinct value for each record, on which records of a file are physically ordered

Clustering index

- clustering index is to speed up retrieval of records that have the same value for the clustering field

- differs from primary index which requires that ordering field should have a distinct value for each record

(12)

UVA

Index File

Index file size

- index file for a primary index need substantially fewer blocks than the data file

- why?

- fewer index entries: an entry exists for each block of data file rather than for each record - index entry is smaller in size than a data record:

only two fields (key value and block pointer) Blocking factor (bfr)

- savings in disk block accesses

- bfr = block size (B) / record length (R)

(13)

UVA

Index File: Example

An ordered file with 30,000 records, B = 1 Kb, R = 100 bytes - bfr = 10; data file needs 3000 blocks

- binary search would require (log₂ Blocks) = 12 accesses - with ordering key field of 9 bytes and block pointer

of 6 bytes, size of primary index entry = 15 bytes - bfr = block size (B) / record length (R) = 68 - total # of index entries: 3000

- # of blocks needed for the index = (3000/68) = 45 - binary search on index file would require

(log₂ B_i) = (log₂ 45) = 6 accesses - search for a record using the primary index

6 (for index) + 1 (for data) = 7 accesses

(14)

UVA

Search Tree

Disadvantage of indexed sequential file organization - performance degradation as file grows

- file reorganization can avoid this performance degradation with its own overhead

Search tree

- a special type of tree used to guide the search for a record given the value of one of its fields - in a search tree of order p, each node contains

at most p 1 search values and p pointers in the order

<P₁, K₁, ..., P_{q 1}, K_{q 1}, P_q>, where q p P_i: pointer to a child node or null pointer

K_i: search key value from some ordered set of

values (all search key values are assumed to be unique) for all values X in the subtree pointed by P_i, we have

K_{i 1}<X<K_i for 1<i<q, X<K_i for i=1, and K_{i 1}<X for i=q

(15)

UVA

B-tree Index Files

B-tree (balanced tree)

- a search tree with some additional constraints for efficient insertion and deletion

- number of access is fixed Formal definition

A B-tree of order n is a search tree that satisfies 1) the root has at least two children

2) all nodes other than root have at least n/2 children 3) all leaf nodes are at the same level (balanced)

Insertion and deletion

- insertion may need split when a node becomes too large (more than n children)

- deletion may need combining if a node becomes too small (less than n/2 pointers)

- balance property must be maintained

(16)

UVA

B-tree and B+-tree

Node structure of B-tree

<P₁, (K₁, Pr₁), P₂, ..., (K_{q 1}, Pr_{q 1}), P_q>

P_i: tree pointer to point another node K_i: search key value

Pr_i: data pointer to point record whose search key field value is K_i (or the data block containing it) - within each node, K₁ < K₂ < .. <K_{q 1}

- for all values X in the subtree pointed by P_i, we have K_{i 1}<X<K_i for 1<i<q, X<K_i for i=1, and K_{i 1}<X for i=q - a node with q tree pointers, q p, has q 1 search

key field values, and hence q 1 data pointers B⁺-tree: a variation of B-tree data structure

- most widely used multi-level index implementation

<P₁, K₁, ..., P_{q 1}, K_{q 1}, P_q>, where q p

- at leaf node, it is <K₁, Pr₁, ..., K_{q 1}, Pr_{q 1}, P_next>

(17)

UVA

B+-tree

Requirements for maintaining B⁺-tree

- every node must contain at least n/2 pointers except for the root (which should have at least 2) - balanced: for ensuring good performance

Searching for key field value K

1) visit the root node, looking for the smallest key value greater than K. Suppose the value is K_i. 2) follow pointer P_i to another node

- if K < K₁, then follow P₁

- if K > K_max, then follow P_max

3) repeat step 2 until reaching a leaf node

(18)

UVA

Differences of B+-tree from B-tree

1. In B⁺-tree, data pointers are stored only at the leaf nodes - more entires can be packed into internal (non-leaf) nodes

of a B⁺-tree than for a similar B-tree

- for the same block (node) size, the order p will be larger for the B⁺-tree than for the B-tree --- improved search time - B-tree eliminates redundant storage of search key values - faster search in some cases to find desired search key

values before reading a leaf node in B-tree

2. Leaf and non-leaf nodes are of the same size in B⁺-tree, while in B-tree, non-leaf nodes are larger

- complication in storage management for index structures 3. Deletion in B-tree is more complicated

- in B⁺-tree, deleted entry always appears in a leaf - in B-tree, it can be a non-leaf node, requiring

replacement by the proper value from the subtree