CS631 Quick Reference Author: Ramdas Rao
Data Storage
Overview of Physical Storage Media
In decreasing order of cost and performance / speed:
• Cache - Is volatile, L1 Cache operates at processor speed (e.g., if processor is 3GHz, then the memory access time is 1/3 ns)
• Main Memory: Access speed is about 10 to 100 ns; 300 times slower than cache; volatile
• Flash Memory: Read access speed about 100 ms (same as memory); Writing is slower 4 to 10 µs; Limited number of erase cycles sup-ported; NOR flash; NAND flash uses page-at-a-time read/write; non-volatile
• Magnetic disk storage: Non-volatile; 10 ms access time; order of magnitude slower
• Optical Storage: CD, DVD (Digital Versatile Disk); capacities of 700 MB to 17 GB; Write-once, read-many (WORM); optical disk juke-box
• Tape Storage: Sequential access; mostly used for backup and archival; High capacity (40 GB to 300 GB); Tape jukebozes (li-braries) - 100s of TB and PB
Magnetic Disks
• Platter - Tracks - Sectors - Blocks
• A cylinder is a set of tracks one below the other on each platter • Concept of zones: The number of sectors in outer zones is greater
than the number of sectors in the inner zones (e.g., 1000 v/s 500 sec-tors)
• Disk Controller Interfaces: ATA (PATA), IDE, SATA (Serial ATA), SCSI, Fiber Channel, Firewrite, SAN (Storage Area Network) - stor-age on network made to appear as one large disk, NAS (Network Attached Disks) - NFS or CIFS
• Performance Measures of disks: Access time, capacity, Data transfer rate, Reliability
• Access time: Time from when a red or write is issues to the time when the data transfer begins
• Access time = Seek Time (arm positioning) + Latency (waiting for sector to rotate under head)
• Average Seek Time = 1/2 of Worst Case Seek Time = 4 to 10 ms • Average Latency Time = 1/2 of time for full rotation = 4 to 10 ms • Average Access Time = 8 to 20 ms
• Data Transfer Rate = Rate at which date can be transferred = 25 MB/s to 100 MB/s
The transfer rate on the inner tracks are significantly lower (30 MB/s) than the outer tracks (since the umber of sectors on the inner tracks is lesser than the outer)
• Mean Time to Failure (MTTF):
– For a single disk, it is about 57 to 136 years
– If multiple disks are used, the MTTF reduces significantly -with 1000 new disks, MTTF is 1200 hours = 47 days
– If 100 disks are in an array and each has a MTTF of 100000 hours, then the MTTF of the array is 100000/100 = 1000 hours – If 2 disks have MTTF of 100000 hours and MTTR of 10 hours,
then the M T T DataLoss =1000002
10+10 = 500x10 6hours
• Mean Time to Data Loss = MTTF + MTTR, (MTTR is Mean Time to Repair)
• Optimization of Disk Block Access:
– Scheduling of access of blocks (e.g., Elevator algo)
– File organization (File Systems, fragmentation, sequential blocks, etc.)
– Non-volatile write buffers ∗ NVRAM to speed-up writes
∗ Log disk (since access to log disk is sequential)
∗ Log file (no separate disk - like in journaling file systems)
RAID
Redundant Array of Independent (Inexpensive) Disks; RAID improves reli-ability via redundancy
• How RAID improves performance via parallelism:
Increases the number of I/O requests hanbdled per second or the transfer rate or both
– Bit-level striping:
∗ The bits of each byte are split across several disks ∗ For an 8-disk configuration, transfer rate is 8 times that of
single disk, and number of I/Os are same as that for single disk; Bit i of each byte does to disk i
∗ For a 4-disk config, bits i and 4+i of each byte do to disk i – Block-level striping (most commonly used):
∗ Stripes blocks across multiple disks (one block on each disk)
∗ Logical block i goes to disk (i mod n) + 1 and it uses the floor(i/n)th physical block of the disk (forumlae assume disk number starts from 1 and blocks from 0)
∗ For large reads (multiple blocks), the data transfer rate is n times that of single disk (n is the number of disks) ∗ For single block read, transfer rate is the same as that of
single disk, but other disks are free to process other re-quests
– Other forms of striping: Bytes of a sector, sectors of a block • 2 main goals of parallelism in a disk system are:
– Load-balance smaller disk requests so that the throughput is increased
– Parallelize large accesses so that the response time of large accesses is reduced
• RAID Levels:
– RAID Level 0: No redundancy, block striping; used when backup is easily restorable
– RAID Level 1: Mirroring w/ block striping (aka level 1 + 0 or 10); Mirroring without block striping is called Level 1 (2M number of disks required; used when number of writes are more (e.g., log disk)
– RAID Level 2: Memory Style ECC (w/ parity bits); Fewer number of disks required than level 1; Some disks store par-ity (e.g., 3 Parpar-ity disks for 4 disks of data); Subsumed by level 3
– RAID Level 3: Bit-interleaved parity; a single parity bit can be used for error detection as well as correction (e.g., 1 P disk for 3 disks of data)
– RAID Level 4: Block-interleaved parity; separate disk for par-ity (at block level); Parpar-ity disk will be involved for every read / write; A single write requires 4 disk accesses: 2 to read the 2 old blocks and 2 to write the new blocks (parity and data); Subsumed by level 4
– RAID Level 5: Block interleaved distributed parity; all disks store parity for the other disks; subsumes level 4
– RAID Level 5: P+Q redundancy (like RAID Level 5, but stores extra redundant information to guard against multiple disk fail-ures); ECC such as Reed-Solomon used; 4-bits of parity instead of 2 can tolerate upto 2 disk failures
• Choice of RAID:
– RAID Level 0 : Use where data safety is not critical (and backup easily restorable)
– RAID Level 1: Offers best write performance, use for High I/O requirements for moderate storage (e.g., log files in a database system)
– RAID Level 5: Storage intensive apps such as vide data stor-age; More frequent reads and rare writes
– RAID Level 6: Use when data safety is very important • Hardware RAID v/s Software RAID: Hot-swapping, etc.
Buffer Manager
• Buffer Replacement Strategy: LRU (not good for nested loop), MRU (depends on the join strategy), Toss Immediate
• Pinning - pinning a block in memork (block that is not allowed to be written back to disk)
• Forced output of blocks: even if space is not required (used when xact log records need to go to stable storage)
File Organization
• Fixed Length Records:
– On deleting, move the records (expensive)
– On deleting, move final record to free space (requires addi-tional disk access)
– Store header and pointers to link (chain) free space for records • Variable Length Records:
– Use slotted page structure for each block – Each block has a header that stores:
∗ Number of record entries in the header ∗ End of free space in the block
∗ An array whose entries contain the location and the size of each record in the block
• Organization of the records within a file: – Heap file organization
– Seq. file organization (may require overflow blocks or periodic reorganization)
– Hashing file organization
– Multi-table clustering file organization: For example, for join of depositor nad customer, after on depositor record, store the customer records for that depositor
Data Dictionary Storage
• Store like a miniature database • Types of information stored:
– Names of the relations
– Names and attributes of each relation – Domains and lengths of each attribute – Names and definitions of views – Integrity constraints
– User, Auth and Accounting info, Passwords – Statistical data
– Info on indices
Indexing and Hashing
• 2 basic types of indices: Ordered Indices (based on a sorted order-ing of values) and Hash Indices (based on a uniform distribution of values across a range of buckets)
• Indexing techniques must be evaluated on these factors: – Access time
– Access type (range of values or point-based) – Insertion time
– Deletion time
– Space overhead for storing the index • Ordered Indices:
– Clustering or Primary Index: If the file containing the records is sequentially ordered, then a clustering or primary index is an index whose search key also defines the sequential order of the file
– Non-clustering or Secondary Index: Indices whose search key specifies an order different from the sequential order of the file • Indexed Sequential File: A file with a clustering index on the search
key is called an Indexed Sequential File.
– Dense index: An index record appears for every search-key value in the file
– Sparse index: An index record appears for only some of the search-key values (some sequential scan required to locate the records)
– Main disadvantages of Indexed Sequential File: Performance degrades (both for sequential scans as well as for index lookups) as file grows; can be remedied by periodic reorg of the file, but expensive
• Multi-level Index: – n-level sparse indices
– If an index occupies 100 blocks, using binary search requires ceiling (log2(b)) disk accesses
– Closely related to tree structures
Read update of indices pseudo code from notes / book • Secondary Indices:
– Cannot be sparse (must be dense)
– Pointers in a secondary index (on search keys that are not can-didate keys) do not point directly to the file (instead, each points to a bucket that contains pointers to the file)
– Disadvantages: Sequential scan in secondary-key order is very slow; they impose significant overhead on the modification of the file (note that when a file is modified, every index must be updated)
B+-Tree Index Files
P1K1P2K2. . . Kn−1Pn
• Most widely used index structure
• Maintains its efficiency despite insertions and deletions of data • A b+-tree index takes the form of a balanced tee in which every path
from the root of the tree to a leaf of the tree is of the same length • Each non-lead node has between ceil(n/2) and ceil(n) children, where
nis fixed for a particular tree
• Each leaf must have at least ceil((n-1)/2) values and at most ceil(n-1) values
• Each non-leaf must have at least ceil(n/2) pointers and at most ceil(n) pointers
• Imposes performance overhead for insertion and deletion and space overhead (as much as half of a node maybe empty), but is still pre-ferred (since periodic file reorg is not needed)
• A B+-tree index is like a multi-level search index • Queries on B+-trees:
– If there are K-search key values in a file, then the path from the root to the leaf is no larger than ceil(logn/2(K)). For
exam-ple, if K = 1000000 and n = 100, then ceil(log50(1000000))=
4. Therefore, at most 4 nodes need to be accessed. For binary search, it would require ceil(log2(1000000)) = 20 nodes.
Algo for B+tree from book
B+-Tree File Organization
• Leaf node stores records than pointers
• Need to consider capacity of mode while splitting and coalescing (since records are stored in the leaf nodes)
• SQL CLOBs and large objects are split into sequence of smaller records and organized in a B+-Tree file organization
• Forumla here
• Indexing Strings: Strings are variable length (need to consider capac-ity); prefix compression can be used to reduce the size
• Advantages: No repetition of search-key values; Non-leaf nodes contain pointer to data (so additional ptr is required in nonleaf nodes -height of the tree may increase compared to B+-tree); not much gain compared to B+-tree since anyway majority of the data is in the leaf node
Other properties
• Mutliple Key Access:
1. Use multiple-single key indices: Perform intersection. Perfor-mance is poor if there are many records satistying condition individually, but few satisfying both conditions
2. Use indices on multiple keys: composite search key; queries with conjunction predicates with equality on primary key is still okay (since we can treat as (P, -inf) to (P, inf); but if inequality for first, then inefficient
3. Bitmap indices can be used: existence bitmaps and presence of NULLs need to be handled
4. R-tree (extension of B+-tree) to handle indexing on multiple dimensions (e.g., for geographical data)
• Non-unique search keys: Use unique record if to prevent buckets and extra page lookups; Search for customer name = ’X’ internally be-comes (’X’, -inf) to (’X’, inf)
• Covering indices: Store multiple (extra) attributes alongwith pointer to records (e.g., balance can be stored if it is required frequently); saves one disk access
• Secondary indices and index relocation:
– Some file organaiztions (such as B+-tree) change the location of records even when the records may not have been updated – To overcome problems due to this in Secondary indices, we can
store the values of the search-key indices (instead of pointers) in the secondary index and use the primary index to lookup – Cost of access increases, but no change is required on file reorg
Hashing
• Hash function must be chosen so that: distribution is uniform and is random
• Hashing can be used for:
– Hash file organization: compute address of block directly – Hash index organization: organizes index into a hash file
struc-ture
• Example hash function: s[0[ to s[n-1] is a string of n characters long s[0] ∗ 31n−1+ s[1] ∗ 31n−2+ . . . + s[n − 1] mod number of buckets
• Bucket overflows can still occur due to: insufficient buckets, and skew
• Overflow can be handled through Overflow chaining, or Open hash-ing (linear or quadratic probhash-ing, etc.); open hashhash-ing is not good for db since deletion in this is troublesome
• Dynamic hashing: One form of extendable hashing; use of hash pre-fix; increasing and decreasing of bucket address table
• Advantages of dynamic hashing: No space reservation required; Per-formance does not degrade as file grows or shrinks
• Disadvantages of dynamic hashing: Additional lookup for bucket ad-dress table required
• Linear hashing avoids extra level of indirection at the possible cost of more buckets
• Ordered Indexing: Can handle range queries better
• Hashing: Bad for range queries; suitable for single-value compar-isons; good for temporary files during query processing
Bitmap Index Structure
Bitmaps and B+-trees can be combined
Query Processing
• Steps in Query Processing: Parser and Translator (gives rel. algebra expression), optimizer (also consults statistics about data and give ex-ecution plan), Evaluation Engine (evaluates plan and outputs results of the query)
Mention about join and sorting techniques here
Cost of Selections
ts: Seek time, br: number of blocks in file, tT: Transfer time for one block.
Index structures are called access paths since they provide a path through which data can be located and accessed.
• Linear Search (A1): Can be applied to any file
– Cost = ts+ (br∗ tT) (one seek + search all blocks)
– For key attributes, we can stop after finding the match. Average Cost is : ts+b2r ∗ tT. Worst-case cost is: ts+ br∗ tT
• Binary Search (A2):
– Cost for key searches: dlog2(br)e ∗ (ts+ tT)
– Cost for non-key searches: dlog2(br)e ∗ (bn+ tT), where n is
the number of items with duplicate keys
• Primary Index, equality on key (A3): For a B+-tree, if hiis the height
of the tree, then
– Cost = (hi+ 1) ∗ (tT+ ts)
• Primary Index, equality on non-key (A4):
– Cost = hi∗ (tT+ ts) + ts+ b ∗ tT, where b is the number of
blocks containing the matched duplicate keys • Secondary Index, equality on key (A5):
– For key: Cost is same as that for A3
– For non-key: Cost = (hi+ n) ∗ (tT+ ts), n is the number of
blocks contianing matching keys • Primary Index, comparison (A6):
– For A > v, first locate V and then sequential access. Cost is similar to A4
– For A < v or A ≤ v, no index is used. Similar to A1. • Secondary Index, comparison (A7):
– Searching index is similar to A6
– But, retrieving each block may require access a different block – Therefore, linear search may be better
• Conjunctive Selection using one index (A8): – Use one index to retrieve (use A2 through A7) – Compare each for satisfying the other condition
– To reduce the cost, we choose a θi and one of A1 through
A7 for which the combination results in the least cost for σthetai(r)
– Cost is cost of chosen algo
• Conjunctive Selection using composite index (A9): – Use composite index (same as A3, A4 or A5)
• Conjunctive Selection by using intersection of identifiers (A10): – Cost is sum of (cost of individual index scans) + (cost of
re-trieval of records in the intersection)
– Sorting can be used so that all pointers in a block come to-gether; blocks are read in sorted physical order to minimize disk arm movement
• Disjunctive Selection by using union of pointers (A11):
– If access paths are available on all conditions, each index is scanned to get the pointers, union is taken and records are re-treived
– Even if one of the condition does not have an access path, the most efficient method could be a linear scan
Cost of Joins
• Two reasons why sorting is important: the query may require output to be sorted, and joins and some other operations can be implemented efficiently, if the input relations are first sorted. Sorting physically is more important than sorting logically (to reduce disk arm movement) • Natural Join can be expressed as a θ join followed by elimination of
repeated attributes by a projection
• nr: number of tuples in r, ns: number of tuples in s, br: number of
blocks of r, bs: number of blocks of s
• Nested loop Join:
– If both relations can be read into memory, cost = (br+ bs)
– Else, if only one block of each relation fits into memory, cost = nr∗ bs+ br, assuming “r” is the outer relation
• Block Nested loop Join:
– Assumption: M+1 is the total blocks available (1 for o/p); else the denom will be (M-2)
– r is the outer relation: Cost = d br
M −1∗bs+br, where M blocks
are allocated to r • Sort Merge Join
– Soting cost of each reltion assuming they are not sorted is (M is the number of pages available for sorting - 1 is for o/p and M-1 for input):
∗ for r, br(2dlogM −1(br/M )e + 1 + 1
∗ for s, bs(2dlogM −1(bs/M )e + 1 + 1
– After sorting, assuming that all the tuples with the same value for the join attributs fir in memory, the cost is: (Sorting Cost) + br+ bs
• Hash Join
– Assume no overflow occurs
– Use smaller relation (say r) as the build relation and larger re-lation (say s) as the probe rere-lation
– If M > br/M , no need of recursive paritioning and cost is:
Cost = 3(br+ bs)
– Else, if recursive partioning occurs: Cost = 2(br +
bs)dlogM −1(br)−1e+br+bs(included is the cost for reading
and writing partitions)
• A B-tree organization has b(m − 1)n/mc entries for each node
Query Optimization
Equivalence Rules
(Set Version)
• Cascade of σ for conjunction selections σθ1∧θ2(E) = σθ1(σθ2(E))
• Commutativity of selection operations
σθ1(σθ2(E)) = σθ2(σθ1(E))
• Only final projection in a sequence of projections πL1(πL2(. . . πLn))) = πL1(E)
• Selections can be combined with Cartersian products and theta joins: –
σθ(E1xE2) = E1onθ E2
–
σθ1(E1σθ2E2) = E1onθ1∧θ2E2
• Theta joins and natural joins are commutative E1on E2= E2on E1
• Associativity of joins
– Natural Joins are associative:
(E1on E2) on E3= E1 on (E2on E3)
– Theta joins are associative in the following manner: If θ2 has
only attributes from E2and E3, then:
(E1onθ1 E2) onθ1∧θ3E3= E1 onθ1∧θ3 (E2onθ2E3)
– Cartesian products are also associative • Distributivity of selections
– If θ0only involves E1, then
σθ0(E1onθE2) = (σθ0(E1)) onθE2
– If θ1only involves E1and θ2only involves E2, then
σθ1∧θ2(E1onθE2) = (σθ1(E1)) onθσθ2(E2))
• Distributivity of projections If L1 are only attributes of E1 and L2
only of E2, then
πL1∪L2(E1onθE2) = πL1(E1) onθ (πL2(E2))
• Set operations of union and intersection are commutative E1∪ E2= E2∪ E1
E1∩ E2= E2∩ E1
However, Set Difference is not commutative. • Set union and intersection are associative
(E1∪ E2) ∪ E3= E1∪ (E2∪ E3)
(E1∩ E2) ∩ E3= E1∩ (E2∩ E3)
• Selection operation distributes over union, intersection and set differ-ence operations:
σP(E1− E2) = σP(E1) − σP(E2)
Also,
σP(E1− E2) = σP(E1) − E2
(this does not hold for intersection???) • Projection operation distributes over union:
πL(E1∪ E2) = (πL(E1)) ∪ (πL(E2))
Join ordering: choose such that the size of th temporary results are re-duced
Enumeration of Equivalent Expressions:
• Space requirements can be optimized by pointing to shared sub ex-pressions
• Time requirements can be reduced by optimization (dynamic pro-gramming, etc.)
Estimating Statistics of Expr Results
• Catalog Information:
– nr= number of tuples in r
– br= number of block containing tuples of r
– lr= size of a tuple of r in bytes
– fr= blocking factor of r (= number of tuples of r that fit in one
block)
– V (A, r) = number of distinct values that appear in r for at-tribute A
– If A is a key for r, then V (A, r) = nr
– If tuples of r are phsyically stored together, br= dnfrre
– Histogram for a range of values of attribute can be used for estimating (histograms can be equi-width or equi-height) • Selection Size Estimation:
– Equality (σA=a(r))
∗ Assuming equi-probable, N um = nr/V (A, r)
∗ With histogram, num = nrange/V (A, range)
– Comparison (σA≤v(r))
∗ If v < min(A, r), num = 0 ∗ If v ≥ max(A, r), N um = nr
∗ Else, num = nr.max(A,r)−min(A,r)v−min(A,r) (this can be
modi-fied to use a histogram, where available - use the number in the ranges, instead of in the entire relation)
∗ If v is not known as in the case of stored procedures, as-sume num = nr/2 – Complex selections ∗ Conjunctions (σθ1∧θ2...∧n(r)): num = nr. s1∗s2∗...sn nn r ,
where siis the number of tuples that satisfy the selection
σθ1(r).
si/nris called the selectivity of the selection σθ1(r)
∗ Disjunctions (σθ1∨θ2...θn(r)): num = nr ∗ [1 − (1 − s1 nr)(1 − s2 nr) . . . (1 − sn nr)
∗ Negations: Compute num as: num = nr− num(σθ(r).
If NULLs are present, compute as: num = nr −
num(σθ(r) − num(N U LLs).
• Join Size Estimation:
– Cartesian Product: N um(rxs) = nrxns
∗ R ∩ S = φ: same as Cartesian Product
∗ R ∩ S is a key for R: num ≤ ns. Similarly, when it is a
key for S.
∗ R ∩ S is a foreign key of S, referencing R: N um = ns
∗ R ∩ S is neither a key for R nor S: Choose minimum of the following, where R ∩ S = {A}
· N um = nr∗ ns/V (A, s)
· N um = nr∗ ns/V (A, r)
• Size Estimation for Other Operations:
– Projection (πA(r)): N um = V (A, r) (since projection
elimi-nates duplicates)
– Aggregation (AGF(r)): N um = V (A, r) (since one tuple in
output for each distinct value of A) – Set Operations
∗ Same relation operations: Rewrite as conjunctions, dis-junctions or negations and use previous results (e.g., σθ1(r) ∪ σθ2(r) = σθ1∨θ2(r)
∗ Different relation operations: Inaccurate, but provides up-per bound
· N um(r ∪ s) = nr+ ns
· N um(r ∩ s) = min(nr, ns)
· N um(r − s) = nr
– Outer Joins: Inaccurate, but provides upper bound
∗ N um(rlef touters) = N um(r on s) + nr. Similarly,
for (r right outer s)
∗ N um(routers) = N um(r on s) + nr+ ns
• Estimation of number of distinct values:
– If selection condition θ forces A to take a single value, N um = V (A, σθ(r)) = 1
– If range of values, then Num = Number of specified values in the selection condition
– If selection condition of the form (A op V), then V (A, σθ(r)) = V (A, r) ∗ s, where s is the selectivity of the
selection
– In all other cases, num = min(V (A, r), nσθ(r))
– For joins:
∗ If all attrs. in A are from r, then V (A, r on s) = min(V (A, r), nrons)
∗ If A has A1 from r and A2 from S, then N um =
min(V (A1, r) ∗ V (A2 − A1, s), V (A2, s) ∗ V (A1 −
A2, r), nrons)
– For projections: N um = nr
– For aggregates like sum, count, avg, it is same as nr
– For min(A) and max(A), N um = min(V (A, r), V (G, r)), where G denotes the grouping attributes
Choice of Evaluation Plans
• To choose the best overall algo, we must consider even nonoptimal algos for individual operations
• Cost-based optimization: With n relations, there are 2(n−1)!(n−1)! different join orders.
• Time complexity is O(3n)
• Dynamic Programming Algo: Outline algo here
Heuristics in Optimization
1. Perform selection operations as early as possible (may cause prob-lems if no index on selection attribute and r relation is small in σθ(r on s)
2. Perform projections early (similar problems as in 1 above) 3. Left-Deep Join Orders: convenient for pipelining 4. Avoid Cartesian products
5. Cached plan can be reused
Optimizing Nested Queries
“where exsits” type of query can be optimized by using “decorrelation”; rewrite as join of temporary table (remember to use select distinct and to take care of NULL values to preserve the number of tuples)
Materialized Views
Normally, only query defintion is stoed. In materialized views, we compute the contents of the view and store. View Maintenance is required to keep the materialized view up-to-date. View Maintenance can be relegated to the programmer or be taken care by the system (can be immediate or deferred)
• Incremental View Maintenance: Update can be treated conceptually to Delete followed by Insert
• Join Operation:
– For inserts, vnew = vold ∪ (iron s) – For deletes, vnew = vold − (dron s) • Selection and Projection Operation:
– For selection inserts, vnew = vold ∪ σθ(ir)
– For selection deletes, vnew = vold − σθ(dr)
– For projection, need to handle duplicates:
∗ Keep count for each tuple in projection πA(r)
∗ Decrement count on delete and delete record from view when count is 0
∗ Increment count on insert or add to view if not present • Aggregation Operations:
– Count: Similar to projection
– Sum: Similar to count (but need to keep sum as well as count) – Avg: Keep sum as well as count
– Min, Max: Insetion - easy; deletion - expensive - need to find new min, max
• Other Operations:
– Set Intersection: (r ∩ s)
∗ On insertion in r, check if it is in s. If so, add to view ∗ On deletion in r, check if it is in s. If so, delete from view – Outer Joins: (routerjoins)
∗ On insertion in r, check if it is in s. If so, add to view. If not is s, still add to view, but padded with NULLs ∗ On deletion from s, pad with NULLs if it is in r and no
longer in S
∗ On deletion from r, remove from view
Query Optimization using Materialized Views
• Optimizer may need to substitute the query (or sub-query) by materi-alized view, if it exists
• Replacing a use of a materialized view by the view definition. For example, if σA=10(V ), where V is defined as r on s and there is an index on A in r, but not in r on s.
Transactions
Transaction: a set of operations that form a single logical unit of work. • ACID properties of a transaction:
– A - Atomicity: all or none (handled by Transaction Mgmt. component)
– C - Consistency: if db was consistent before xact, then it should be consistent after the xact (handled by programmer or con-straints)
– I - Isolation: an xact does not see the effects of a concurrent running xact (handled by the Concurrency Control component) – D - Durability: once committed, stays committed (handled by
the Recovery Mgmt. component)
• Transaction States: Active, Failed, Aborted (perform rollback), Par-tially committed, Committed
• Shadow-copy technique: Ensures atomicity and durability, and is used by text editors. Disadvantage: Very expensive to make copies of entire db; no support for concurrent xacts
• Need for Concurrent Executions: Improved throughput (tps), Im-proved resource utilization, Reduced waiting time (e.g., smaller xacts queued up behind a large xact), Reduced average response time • Schedules: represent the chronological order in which the
instruc-tions are executed in the system. For a set for n transacinstruc-tions, there exist n! different serial schedules
• Consistency of the db under concurrent execution can be ensured by making sure that any schedule that is executed has the same effect as a serial schedule (that is, one w/o concurrent execution)
Conflict Serializability:
• Instructions I1and I2conflict if they are operations by different xacts
on the same data item and at least one of them is a write operation • If a schedule S can be transformed into a schedule S0
by a series of swaps of non-conflicting instructions, S and S0are said to be conflict equivalent
• A schedule S is said to be conflict serializable if it is conflict equiva-lent to some serial schedule
• This prohibits certain types of schedule even though there would be no problem (e.g., ops that simply add and subtract). However, these cases are harder to analyze.
View Serializability:
• Less stringent than Conflict Serializability
• View Equivalence: 2 schedules S and S0are view equivalent if ALL 3 conditions mentioned below are met:
– For each data item Q, if xact Tiread the initial value of Q in S,
then xact Tiin S0must also read the initial value of Q
– For each data item Q, if xact Tiexecutes read(Q) in S and if
that value was produced by xact Tj, then that read(Q) op of
xact Tiin S0must also read the value of Q that was produced
by that same write op of xact Tj
– For each data item Q, the xact (if any) that performs the final write(Q) op in S, must also perform the final write(Q) op in S0 • A schedule is said to be view serializable if it is view equivalent to
some serial schedule
• Blind Writes: Writing a value w/o reading it first
• Blind Writes appear in any view serializable schedule that is not con-flict serializable
Other properties
• Recoverable Schedule: is one where, for each pair of xacts Tiand
Tjsuch that Tjreads a data item previously written by Ti, then the
commit operation of Tiappears before the commit operation of Tj
• Cascading rollback is undesirable since it can lead to undoing a sig-nificant amount of work
• Cascadeless Schedule: is one where, for each pair of xacts TiandTj
such that Tjreads a data item previosuly written by Ti, the commit
operation of Tioccurs before the read operation of Tj.
• A cascadeless schedule is also recoverable, but not vice-versa. • The goal of concurrency control schemes is to provide a high degree
of concurrency, while ensuring that all schedules that can be gener-ated are conflict or view serializable, and are cascadeless.
• Testing for Conflict Serializability: (to show that the generated sched-ules are serializable)
– Construct precedence graph for a schedule S (vertices are xacts, edges indicate read/write dependencies)
– If the graph contains no cycles, then the schedule S is conflict serializable
– A serializability order of the xacts can be obtained through topological sorting of the precedence graph
– Cycle detection algos are O(n2) • Testing for View Serializability:
– NP-complete problem
– Sufficient conditions can be used
– If sufficient conditions are satisfied, the schedule is view-serializable
– But, there may be view-serializable schedules that do not sat-isfy the sufficient conditions
See examples and exercises of schedules from the book
Concurrency Control
Shared locks (S) and Exclusive locks (X): Compatibility matrix: (S,S) true, (S,X) false, (X,S) false, (X,X) false
Starvation can be avoided: by processing the lock requests in the order in which they were made
2PL
• Ensures serializability
• Growing phase, Shrinking phase • Does not prevent deadlock
• Cascading rollbacks may occur (e.g., if T7reads a data item that was
written by T5and then T5aborts
– To avoid cascading rollbacks, strict 2PL can be used where exclusive locks must be held till the xact aborts or commits (prevents xacts from reading uncommitted writes)
– rigorous 2PL can be used where ALL locks are held till the xact aborts or commits; xacts are serialized in their commit or-der
• Upgrading and Downgrading of locks can be done; upgrading should be allowed only in the growing phase, while downgrading only in the shrinking phase (e.g., series of reads followed by write to a data item - in other forms of 2PL above, the xact must obtain an X lock on the data item to be updated, even if it is much later)
Implementation of locking: Hash table for data items with linked list (of xacts that have been granted locks for that data item plus those that are wait-ing). Overflow chaining can be used.
Graph-based Protocols
• Acyclic graph of data item locking order
• A data item can be locked by Tionly if its parent is currently locked
by Ti
• Locks can be released earlier; so shorter waiting times and increased concurrency
• Deadlock free; so no rollbacks are required
• Disadvantages: may need to lock more data items than needed (lock-ing overhead and increased wait(lock-ing time), w/o prior knowledge of which data items to lock, xacts may have to lock the root of the tree and that can reduce concurrency greatly
• Cascadelessness can be obtained by tracking commit dependencies such that a transaction is not allowed to commit until the ones that it had read values written by have not commited
Timestamp-based Protocols
• Determines the serializability order by selecting the order in advance • Using timestamps: could be the system clock or a logical counter • Each xact is given a timestamp when it enters the system
• Each data item has 2 timestamps: W-timestamp (the largest ts of any xact that wrote the data item successfully) and R-timestamp (the largest ts of any xact that read the data item successfully)
• Timestamp-Ordering Protocol is: – If Tiissues read(Q)
∗ If T S(Ti) < W − timestamp(Q), reject the read and
rollback Ti
∗ If T S(Ti) ≥ W − timestamp(Q), execute the read
and set the R-timestamp of Q to maximum of T Si and
R-timestamp(Q) – If Tiissues write(Q)
∗ If T S(Ti) < R − timestamp(Q), reject the write and
rollback Ti
∗ If T S(Ti) < W − timestamp(Q), reject the write and
rollback Ti
∗ In all other cases, execute the write and set the W-timestamp of Q to T Si
– Rolled-back TS get a new timestamp when they are restarted – Freedom from deadlocks
– However, xacts could starve (e.g., long duration xact getting restarted repeatedly due to conflicts with short duration xacts) – Recoverability and cascadelessness can be ensured by:
∗ Performing all writes together at the end of the xact; no xact is permitted to access any of the data items that have been written
∗ Using a limited form of locking, whereby uncommitted reads are postponed until the xact that updated the item commits
– Recoverability alone can be guaranteed by using commit de-pendencies, that is, tracking uncommited writes and allowing a xact Tito commit only after the commit of all xacts that wrote
a value that Tiread.
• Thomas’ Write Rule:
– Allows greater potential concurrency
– Ignores writes if T Si < W − timestamp(Q), instead of
rolling it back
Validation-based Protocols
• Also called optimistic concurrency control
• Each xact goes through 3 phases (for update xacts and 2 for read-only xacts):
– Read phase: The system executes the xact Ti; it reads all data
items and performs all write operations on temporary local variables, w/o updates to the actual db
– Validation phase: Checks if the updates can be copied over to the db w/o conflict
– Write phase: Done only if the xact succeeds in the validation phase. If so, the system applies the updates to the db; otherwise the xact is rolled back
• Validation test for xact Tj requires that for all xacts Ti with
T S(Ti) < T S(T j), one of the following conditions must hold:
– F inish(Ti) < Start(Tj)
– The set of data item written by Tidoes not intersect with the set
of data items read by Tjand Ticompletes its write phase before
Tjstarts its validation phase. (Start(Tj) < F inish(Ti) <
V alidation(Tj). This ensures that the writes of Tiand Tjdo
not overlap
Multiple Granularity
• Hierarchy of granularity: DB, Areas, Files, Records; visualize as a tree with the DB at the root of the tree
• Explicit locking at one level will mean implicit locking at all nodes below it
• Care must be taken not to grant explicit lock at a level above which another lock has been granted already (e.g., cannot lock a record ex-plicitly, if the file has been locked). The tree must be traversed from the root to the required level to find out.
• Also, a db cannot be locked, if someone else is holding a lock at a lower level. Instead of searching the entire tree to determine this, intention lock modes are used.
– When an xact locks a node, it acquires an intention lock on all the nodes from the root to that node.
– IS (Intention-Shared) lock: If a node is locked in IS mode, then explicit shared locking is being at the lower level
– IX (Intention-Exclusive) lock: If a node is locked in IX mode, then explicit exclusive or shared locking is being at the lower level
– SIX (Shared and Intention-Exclusive) lock: If a node is locked in SIX mode, then the subtree rooted at that node is being locked in explicitly shared mode and explicit exclusive lock-ing is belock-ing at the lower level
– Compatibility Matrix:
IS IX S SIX X
IS true true true true false IX true true false false false S true false true false false SIX true false false false false X false false false false false
– Multiple-granularity protocol:
∗ The compat matrix above must be followed for granting locks
∗ It must lock the root of the tree first, and it can lock it in any mode
∗ It can lock a node Q in S or IS mode only if it currently has the parent of Q locked in either IS or IX mode ∗ It can lock a node Q in X, SIX, or IX mode only if it
cur-rently has the parent of Q locked in either IX or SIX mode ∗ It can lock a node only if it has not previously unlcoked
any node (that is, Tiis 2P)
∗ It can unlock a node Q only it it currently has none of the children of Q locked
∗ Locking is done top-down, whereas unlocking is done bottom-up
– This protocol enhances concurrency and reduces lock overhead and is good for apps that include a mix of:
∗ Short xacts that access only a few data items
∗ Long xacts that produce reports from the entire file or set of files
– Deadlock is possible
Multiversion Schemes
Instead of delaying the reads or aborting an xact, these schemes use old copies of the data. Each write produces a new version of a data item, and read is given one of the versions of the data item. The protocol must ensure that the version given ensures serializability and that an xact be able to easily determine which version to read.
• Multiversion Timestamp Ordering:
– Each xact has unique ts as before (for the TS Scheme) – Each version of data item has content of the data item, R-ts and
W-ts
– Whenever an xact writes to Q, a new version of Q is produced whose R-ts and W-ts are initialized to T S(Ti).
– Whenever an xact reads Q, the R-ts of Q is set to T S(Ti) only
if R − ts(Q) < T S(Ti)
– The protocol is (an xact Tiwants to read or write Q):
∗ Find a version Qkwhose w-ts is the largest ts ≤ T S(Ti)
∗ If xact Tiissues read(Q), the value returned is the content
of Qk
∗ If xact Ti issues write(Q) and T S(Ti) < R − ts(Qk),
then rollback Ti(some other xact already read the value
and so we cannot change it now). On the other hand, if T S(T i) = W − ts(Qk), overwrite the contents of Qk
(w/o creating a new version); else, create a new version. – Older versions of a data item are removed by: If there are 2
ver-sions of a data item with W-ts less than the oldest transaction in the system, the older of these 2 versions can be removed – A read request never fails and is never made to wait
– Disadv: Reading requires updating of R-ts (2 disk accesses than one), and conflicts between xacts are resolved through rollbacks rather than waits (Multiversion 2PL solves the roll-back problem).
– Does not ensure recoverability and cascadelessness; can be ex-tended in the same manner as the basic TS-ordering scheme • Multiversion 2 PL: Attempts to combine the adv. of multiversion
with 2PL; it differentiates between read-only xacts and update xacts. TODO
Deadlock Handling
2 methods to deal with deadlocks: Deadlock prevention, and Deadlock de-tection and recovery. Deadlock prevention is used if the probability of dead-locks is relatively high; otherwise detection and recovery are more efficient. Detection scheme requires overhead to maintain information while running to detect deadlocks as well as losses that can occur due to recovery from deadlocks.
• Deadlock Prevention using partial ordering: Use partial ordering technique like tree protocol
• Deadlock Prevention using total ordering and 2PL: Use total or-dering and 2PL; in this case, the xact cannot request locks on items that precede that item in the ordering
• Deadlock Prevention using wait-die: Using xact rollback; older xacts are made to wait; younger ones are rolled back if the lock is currently held by an older one; the older the xact gets, the more it must wait
• Deadlock Prevention using wound-wait: Pre-emptive technique; younger xact is wounded by older one; younger one is made to wait, if older xact has a lock on the item; there may be fewer rollbacks in this scheme
Both wait-die and wound-wait avoid starvation and both may cause unnecessary rollbacks
• Timeout-based schemes: In between deadlock prevention and de-tection schemes; allow an xact to wait for sometime; if timeout, as-sume that deadlock may have occurred and rollback xact; Easy to implement, but difficult to determine the correct duration of time to wait; suitable for short xacts
• Deadlock Detection and Recovery: Must check periodically if a deadlock has occurred (detection); Can be done to see if cycles exist in a wait-for graph; Selection of a victim can be done on the basis of minimum cost (how many xacts will be involved; how many data items have been used; how much longer to complete, etc.); Total roll-back or partial rollroll-back (just enough rollroll-back to the point where the appropriate lock is released that breaks the deadlock); Starvation can be prevented by including the number of times an xact has been rolled back in the cost factor while deciding the victim
Insert and Delete Operations
• Delete operation similar to write (X lock for delete op in 2PL; treated similar to write op in TS-ordering)
• Insert operation: X-lock in 2PL; in TS-ordering, assign TS of the xact that is inserting the item to the R-ts and W-ts of the data item
Phantom Phenomenon
Consider computation of sum by using a select and an insert statement; this can result in a non-serializable schedule if locking is done at the granularity of the data item; neither access any tuple in common - so the conflict would go undetected.
Can be alleviated by:
• Associating a virtual data item with every relation and having the xacts lock this (in addition to the tuples), if they are updating or read-ing info about the relation
• Index-locking protocol using 2PL can be used: nodes of the index must be locked in shared mode for lookups; writes must lock the ap-propriate nodes of the index in exlcusive mode
• Variants of index-locking can be used to implement the other schemes (apart from 2PL)
Weak Levels of Consistency
Serializability allows programmers to ignore issues related to concurrency when they code xacts.
• Degree-Two Consistency: Purpose is to avoid cascading aborts w/o necessarily ensuring serializability; S-locks may be acquired and re-leased at any time; X-locks can be acquired at any time, but cannot be released until the xact aborts or commits; results in non-serializable schedules; therefore, this approach is undesirable for many apps • Cursor Stability: Form of two-degree consistency for programs
written in host languages where iteration of tuples is done using a cursor; Instead of locking the entire relation, the tuple that is cur-rently being processed is locked in S-mode; Any modified tuples are locked in X-mode until the xact commits; 2PL is not used, Serializ-ability is not guaranteed; Heavily accessed relations gain increased concurreny and improved system performance. Programmers must take care at the app level so that db consistency is ensured.
• Weak Levels of Consistency in SQL: SQL-92 levels: – Serializable (default)
– Repeatable Read (xact may not be serializable wrt other xacts; e.g., when an xact is searching for records satisfying some con-ditions, the xact may find some records inserted by a committed xact, but not others)
– Read committed
– Read uncommitted (lowest level of consistency allowed in SQL-92)
Concurrency in Index Structures
Since indices are accessed frequently, they would become a point of great lock contention, leading to a low degree of concurrency. It is acceptable to have nonserializable concurrent access to an index, as long as the accuracy of the index is maintained.
2 technique:
• Crabbing Protocol:
– When searching, lock root node in shared mode, then the child node. After acquiring lock on child, release lock on parent. – When inserting, traverse tree as in search mode. Then, lock
the node affected in X-mode. If coalescing, splitting or redis-tribution is required, lock the parent in X-mode; then perform the operations on the node(s) and release the locks on the node and the siblings; retain lock on parent, if parent needs further splitting, coalescing, or resitribution.
– Progress of locking goes from top to bottom while searching and bottom to up when splitting, coalescing, redistributing • B-link-tree locking protocol: Achieves more concurrency by
avoid-ing holdavoid-ing the lock on one node while holdavoid-ing lock on another node, by using a modified version of B+-trees called B-link trees; these require that every node including the internal nodes and leaf nodes maintain a pointer to its right sibling
– Lookup: Each node must be locked in S mode before access-ing it; Split may occur concurrently with lookup, so the search value may have moved to the right node; Leaf nodes are locked in 2PL to avoid phantom phenomenon
– Insertion and deletion: Follows the rules to locate the lead node into which the insertion or deletion will take place; Upgrades the shared lock to X lock on the affected leaf; Leaf nodes are locked in 2PL to avoid phantom phenomenon
– Split: Create the new node (split); change the right-sibling pointers accordingly; release X-lock on the original node (if it is non-leaf; leaf nodes are locked in 2PL to avoid phantom phenomenon)
– Coalescing: Node into which coalescing will be done should be locked in X mode; once coalescing has been done, parent node is locked in X mode to remove the deleted node; then, xact releases the locks on the coalesced nodes, if parent is not to be coalesced, lock on parent can be released
– Note: An insertion or deletion may lock a node, ulock it, and subsequently relock it. Furthermore, a lookup that runs concur-rently with a split or coalescence operation may find that the desired value has shifted to the right-sibling node by the split or coalescence operation; this can be accessed by following the right-sibling pointer.
– Coalescing of nodes can cause inconsistencies; lookups may have to restart
– Instead of 2PL on leaf nodes, key-value locking on individ-ual key-values can be done. However, must be done carefully; else, phantom phenomenon can occur for range lookups; can be taken by locking one more key value than the range (next-key value).
Recovery System
• Fail-stop assumption: Hardware errors and bugs in software bring the system to a halt, but do not corrupt the nonvolatile storage con-tents.
• Stable Storage Implementation: Keep 2 physical blocks for each logical database block. Write the info onto the first physical block.
When the first write completes successfully, write the same info onto the second physical block. The o/p is completed only after the second write completes successfully.
During recovery, the system examines each pair of physical blocks. If contents same, nothing to be done. If error in one, replace with the other. If contents differ, replace first’s contents with the contents of the second block.
Number of blocks to compare can be reduced by keeping list of on-going writes in NVRAM (so that only these need to be compared).
Log-Based Recovery
Recovery is used for rolling back transactions as well as for crash recovery. Update log record has: Xact Id, Data-item id, Old Value, New Value
• <T Start>, <T commit> or <T abort> records are written at start, commit or abort of a transaction
• Deferred Database Modification: Only new values need to be stored (for redo; no need for undo)
• Immediate Database Modification: Requires to store both old and new values (for undo and redo)
– Undo is performed before redo
• Checkpoints: Helps in reducing scanning the log after a crash to lo-cate the transactions to be undone and redone; also helps in reducing the time to redo (since the changes before the checkpoint would have been applied already).
Transactions are not allowed to perform any update actions, such as writing to a buffer block or writing a log record, while a checkpoint is in progress.
– Output all log records to stable storage
– Output all modified buffer blocks to stable storage – Write the checkpoint record to stable storage
For recovery, the log must be scanned backward to find the most re-cent checkpoint record. It needs to further continue searching back-ward until it finds all the transactions that have some record after the most recent checkpoint record. Only these transactions need to be redone / undone. No commit record, do undo; else do redo.
• Recovery with Concurrency Control:
– List of active transactions are stored as part of the checkpoint record
– The log can be used to rollback even failed xacts.
– If strict 2PL is used (that is, excl. locks till end of xact), the locks held by an xact may be released only after the xact has been rolled back. So, when an xact is being rolled back, no other xact may have updated the same data item (the xact should have locked the data item since it was to update it in the first place). Therefore, restoring the old value of a data item will not erase the effects of any other xact.
– Undo must be done by processing the log backward – Redo must be done by processing the log forward
– For recovery: Scan the log backward until it finds the ¡check-point L¿ record performing the following steps as it reads each record while scanning backward:
∗ If ¡Ticommit¿ record found, add Tito the redo list
∗ If ¡Tistart¿ record found and Tiis not on the redo list,
add to the undo list
∗ Finally, for all Tiin the checkpoint record list, that does
not appear in the redo list, add to the undo list. This is to take care of long running xacts that may not have updated anything since the checkpoint record was written. ∗ Undo must be done prior to redo
• WAL (Write-ahead logging): Before a block of data in main memory can be output to the database (in non-volatile storage), all log records pertaining to data in that block must have been output to stable stor-age.
Strictly speaking, the WAL rule requires only that the undo info in the log have been output to stable storage, and permits the redo info to be written later. This is relevant only in systems where undo and redo info are stored in separate log records.
• Dump database procedure: Output all log records to stable, then the buffer blocks, copy the contents of the db to stable storage, then out-put a log record ¡dump¿ onto the stable storage. To recover, only records after the ¡dump¿ record must be redone. But copying of en-tire db is impractical and the xact processing must be halted during the dump. Fuzzy dumps can be used to allows xacts to be active while the dump is in progress.
Advanced Recovery Techniques
Using logical logging for undo process for achieving more concurrency (faster release of locks on certain structures such as B+-tree index pages)
Fuzzy Checkpointing:
• Normal checkpointing may halt the xact processing for a long time if the number of pages to be written is large
• Allows xacts to modify buffer blocks once the checkpoint record has been written
• While performing fuzzy checkpointing, the xact processing is halted only briefly to make a list of buffers modified. The checkpoint is record before the buffers are written out.
• The locks are released and xacts can modify the buffer blocks; the checkpointing process proceeds to output the modified blocks in its list in parallel. However, the block being written out by the check-pointing process still needs to be locked; other blocks need not be. • Concept of “last-checkpoint” record at a fixed position on disk can
be used to guard against failures. This record should be updated only after ALL the buffers in the checkpoint’s list have been written to stable storage.
ARIES
• Features:
– Uses LSN (log sequence number) – Physiological Redo
– Dirty Page Table
– Fuzzy Checkpointing (allows dirty pages to be written contin-uously in the background, removing in bottle necks when all pages need to be written at once)
• LSN:
– Every log record has a unique LSN that uniquely identifies the log record
– LSN is most often file number and an offset within that file – Each page has an LSN that indicates the LSN of the last record
that modified that page.
– PageLSN is essential to ensure idempotence in the presence of phsyiological redo operations
– Physiological redo cannot be reapplied to a page since it would result in incorrect changes on the page
– Each log record contains a field called ”PrevLSN” that points to the previous log record for this transactioon (helps in locating transaction log records easily without reading the whole log) – CLRs (Compenstation Log Records) have an additional field
UndoNextLSN that is used in the case of the operation-abort log record to point to the log record that is to be undone next • Dirty Page Table:
– Stores the list of pages that have been updated in the buffer – For each page, the PageLSN and the RecLSN is also stored – RecLSN indicates which log records have already been applied
to the disk version of the page
– Intially, when the page is brought in from the disk, the RecLSN is set to the current end of the log
• Checkpointing:
– A checkpoint log record contains the Dirty Page Table and the list of active transactions
– For each transaction, the checkpoint record also stores the last LSN for that transaction
– A fixed position on the disk notes the LSN of the last complete checkpoint log record
– Analysis Pass: Determines which xacts to undo, which pages were dirty at the time of the crash, and the LSN from which the redo pass should start.
– Redo Pass: Starts from a position determined during the anal-ysis phase, and performs a redo, repeating history, to bring the database to a state it was in before the crash.
– Undo Pass: Rolls back all xacts that were incomplete at the time of the crash. Need to elaborate here about CLRs, etc. While undoing, if a CLR is found, it uses the UndoNextLSN to locate the next record to be undone; else it undoes the record whose number is found in the PrevLSN field
• Advantages of ARIES:
– Recovery is faster (no need to reapply already redone records; pages need not even be fetched if the changes are already ap-plied)
– Lesser data needs to be stored in the log – More concurrency is possible
– Recovery Independence (e.g., for pages that are in error, etc.) – Savepoints (e.g., rolling back to a point where deadlock can be
broken)
– Allows fine-grained locking
– Recovery optimizations (fetch-ahead of pages, out-of-order redo)
Remote Backup Systems
Several issues must be addressed:
• Detection of failure: Using “heartbeat” messages and multiple links of communication
• Transfer of control: When original comes back up, it must update itself (by receiving the redo logs from the old backup site and replay-ing them locally). The old backup can then fail itself to allow the recovered primary to take over.
• Time to recover: Hot-spare configuration can be used. • Time to commit:
– One-safe: Commit as soon as commit log record is written to stable storage at primary
– Two-very safe: Commit only when both primary and secondary have written the log records to stable storage (problem is when secondary is down)
– Two-safe: Same as Two-very safe when both primary and sec-ondary are up; when secsec-ondary is down, proceed as One-safe)
Database System Architectures
Main Types: Client-Server, Parallel, Distributed
Centralized Systems:
• Coarse-grained parallelism:
A single query is not partitioned among multiple processors. Such systems support a higher throughput; that is, they allow a greater number of transactions to run per second, although individ-ual transactions do not run any faster.
• Fine-grained parallelism: Single tasks are parallelized (split) among multiple processors
Client-Server Systems:
Clients access functionality through API (JDBC, ODBC, etc.) or transac-tional remote procedure calls
Server System Architectures:
2 types: Transaction-server v/s Data-server systems
• Transaction-server systems (aka query-server systems): – Components of a Transaction-server system include:
∗ Server Processes ∗ Lock Manager Process ∗ Log Writer Process
∗ Database Writer Process ∗ Process Monitoring Process ∗ Checkpoint Process
– The shared memory contains all the shared data: ∗ Buffer Pool
∗ Lock Table ∗ Log Buffer ∗ Cached Query Plans
– Semaphores or ”Test and Set” atomic operations must be used to ensure concurrent access to the shared memory
– Even if the system handles lock requests through shared mem-ory, it still uses the lock manager process for deadlock detection • Data-server systems (aka query-server systems):
– This architecture is used typically when:
∗ High-speed connection between clients and servers ∗ Client systems have comparable computational power as
those of servers
∗ Tasks to be executed are computationally intensive – The client needs to have full backend functionality
Parallel Systems
• Speedup v/s Scaleup
• Factors affecting Scaleup / Speedup • Interconnection Networks:
– Bus
– Mesh (max. distance is 2(√n − 1) or√n, if wrapping is al-lowed from the ends)
– Hypercube (max. distance is log n)
• Parallel System Architectures: Shared-memory, Shared-disks, Shared nothing, Hierarchical
– Hierarchical: Share nothing at the top-level(???), but internally each node has either shared-memory or shared-disk architec-ture)
Distributed Systems
• Reasons: Sharing data, Autonomy, Availability
• Multidatabase or heterogeneous distributed database systems • Issues in distributed database systems: Software development cost,
Greater potential for bugs, Increased processing overhead • Local-Area Networks, Storage Area Networks (SAN)
• Wide-Area Networks: Disconintuous Connection WANs v/s Contin-uous Connection WANs
Distributed DB
• Each site may participate in the execution of transactions that access data at one site, or several sites.
• The difference between centralized and distributed databases is that, in the centralized case, the data reside in one location, whereas in the distributed case, the data reside in several locations.
• Homogeneous Distributed DB: All sites have identical dbms soft-ware, are aware of one another, and agree to cooperate in processing users’ requests
• Heterogeneous Distributed DB: Different sites may use different schemas and different dbms software, and may provide only limited facilities for cooperation in transaction processing
Distributed Data Storage
Two approaches to storing a relation in a distributed db:
• Replication: Several identical copies of a relation are stored; each replica at a different site. Full replication: a copy is stored at every site
– Advantages: Availability, Increased parallelism (minimizes movement of data between sites)
– Disadvantages: Increased overhead on update
– In general, replication increases the performance of and the availability of data for read operations; but update transactions incur greater overhead
– Concept or primary copy of a relation
• Fragmentation: The relation is partitioned into several fragments, and each fragment is stored at a different site
– Horizontal Fragmentation: Each tuple to one or more sites ri= σPi(r). r is reconstructed using: r = r1∪r2∪r3. . .∪rn
– Vertical Fragmentation: Decomposition of scheme of relation (so that columns are at one or more sites) ri = πRi(r). The
original relation can be obtained by taking the natrual join of all the fragmented relations. Primary key (e.g., tuple id) needs to exist in each fragment.
– For privacy reasons, vertical fragmentation can be used for hid-ing columns.
• Fragmentation and Replication can be combined
• Transparency: Users should get: Fragmentation transparency, Repli-cation transparency, LoRepli-cation transparency
• To prevent name clashes: a name server can be used (single point-of-failure) or site id prepended to each relation name. Aliases can be used to map aliases to real names stored at each site. This helps when the administrator decides to move a data item from one site to another.
Distributed Transactions
• Need to worry about failure of a site or failure of communication link while participating in a transaction
• Transaction Manager (handles ACID for local) and Transaction Co-ordinator (coordinates the execution of both local and global transac-tions initiated at its site)
• Transaction Coordinator Responsibilities: Start execution of a trans-action, Break a xact into sub-parts and distribute to various sites, co-ordinate termination of xact (abort or commit)
• Failures: failure of a link, loss of messages, network partition, failure of a site
2PC
• Protocol: When all sites inform the coordinator that the transaction is complete:
– P1: Coord sends all sites ¡prepare T¿, Sites reply with ¡ready T¿ or ¡no T¿
– P2: Coord sends ¡commit T¿ or ¡abort T¿ (based on whether all sites were ready to commit or not)
– All such comm. must be logged to stable storage before it sends the msg. out so that recovery is possible
– In some implementations, each site sends ¡ack T¿ msg to the coord. The coord records ¡complete T¿ after it receives ¡ack T¿ from all the sites
• Handling of failures:
– Failure of site: Handling by coordinator:
∗ If site failed before replying ¡ready¿, the coord treats it similar to a reply of ¡abort¿
∗ If site failed after replying ¡ready¿, the coord ignores the failure and proceeds normally (the site will take care after it comes back up)
Handling by site: When the site comes back up, it checks it log: ∗ If no control records in log, execute undo
∗ If commit in log, commit the xact ∗ If abort in log, execute undo
∗ If ¡ready¿ is present in log, it needs to find out from the co-ord about the status. If coco-ord is down, it can ask the other sites. If this info is not available, then the site can neither commit nor abort T. It needs to postpone the decision for T until it gets the needed info.
– Failure of coord: When coord fails, then the participating sites must try to determine the outcome (but cannot be done in all cases)
∗ If site has ¡commit T¿, then it needs to commit the xact ∗ If site has ¡abort T¿, then it needs to undo
∗ If site does not have ¡ready T¿, then it can undo
∗ Otherwise, site has a ¡ready T¿. In this case, it must wait for the coord to recover. This is the “blocking problem”. If locking is used, other transactions may be forced to wait.
– Network Partition:
∗ If the coord and all participants are in the same partition, then no effect.
∗ Otherwise, the sites that are in the partition other than the coord, treat the failure as if the coord failed. Similarly, for the sites in the same partition as the coord and the coord, they treat the failure as if the sites in the other partition had failed.
• To allow the recovered site to proceed, the list of items locked can also be recorded with the ¡ready T¿ message in the log. The recovery proceeds to relock those items, whereas other xact can proceed.
3PC
• Tries to avoid blocking in certain cases by informing at least “k” other sites of its decision
• It is assumed that no network partition occurs and not more than “k” sites fail, where “k” is a predetermined number
• If the coord fails, then the sites elect a new coord. The new coord tries to find out if any site knows about the old coord’s intentions. If it finds any one site, then it starts the third phase (to commit or abort). If it cannot, the new coord aborts the xact.
• If a n/w partition occurs, it may appear to be the same as “k” sites failing and blocking may occur.
• 3PC has overheads; so it is not used widely.
• Also, it should be implemented carefully; else the same xact may be committed in one partition and may be aborted in another
Alternative methods of xact processing
Using persistent messaging; this requires complicated error handling (e.g., by using compensating xacts). Persistent messaging can be used for xacts that cross organizational boundaries.
Implementation of persistent messaging:
• Sending site protocol: Messages must be logged to persistent stor-age within the context of the same xact as the originating xact be-fore sending it out; On receiving an ack from the receiver, this can be deleted. If no ack is recd., the site tries repeatedly. After pre-determined number of failures, error is reported to the application (compensating xact must be applied).
• Receiving site protocol: On receipt, the receiver must first log into persistent storage; Duplicates must be rejected; After the xact for logging the message to the log relation commits, the receiver send an ack. Ack is also sent for duplicates. Deleting received messages from the receiver must be done carefully, since the ack may not have reached the sender and a duplicate may be sent. Each message can be given a timestamp to deal with this problem. If the ts of a recd. msg. is older than some predetermined cutoff, then that msg is discarded and all other messages recorded that have ts older than the cutoff can be deleted.
Concurrency Control in Dist. DB
• Each site participates in the execution of a commit protocol to ensure global transaction atomicity.
Locking Protocols:
• Single Lock-Manager Approach:
– A single lock manager (residing at a single site) for the entire system
– Request for lock is delayed until it can be granted; message is sent to the site from which the lock request was initiated. – The xact can read from any of the site where the replica is
avail-able; but all sites where a replica of the data item exists must be involved in the writing.