Query Processing - Modern Database Management Solution Book

Review Questions

18.1 What are the objectives of query processing?

The aims of query processing are to transform a query written in a high-level language, typically SQL, into a correct and efficient execution strategy expressed in a low-level language (implementing relational algebra), and to execute the strategy to retrieve the required data (see Section 18.1).

18.2 How does query processing in relational systems differ from the processing of low-level query

languages for network and hierarchical systems?

In first generation network and hierarchical database systems, the low-level procedural query language is generally embedded in a high-level programming language such as COBOL, and it is the programmer’s responsibility to select the most appropriate execution strategy. In contrast, with declarative languages such as SQL, the user specifies what data is required rather than how it is to be retrieved. This relieves the user of the responsibility of determining, or even knowing, what constitutes a good execution strategy and makes the language more universally usable (see start of chapter).

18.4 What are the typical stages of query decomposition?

The typical stages of query decomposition are analysis, normalization, semantic analysis, simplification, and query restructuring (see Section 18.2).

18.5 What is the difference between conjunctive and disjunctive normal form?

Conjunctive normal form: A sequence of conjuncts that are connected with the ∧ (AND) operator. Each conjunct contains one or more terms connected by the ∨ (OR) operator. A conjunctive selection contains only those tuples that satisfy all conjuncts.

Disjunctive normal form: A sequence of disjuncts that are connected with the ∨ (OR) operator. Each disjunct contains one or more terms connected by the ∧ (AND) operator. A disjunctive selection contains those tuples formed by the union of all tuples that satisfy the disjuncts.

See Section 18.2 under normalization.

18.6 How would you check the semantic correctness of a query?

See Section 18.3 under semantic analysis.

18.7 State the transformation rules that apply to:

(a) Selection operations. See Section 18.3.1: Rules 1, 2, 4, 6, 9.

(b) Projection operations. See Section 18.3.1: Rules 3, 4, 7, 10.

(c) Theta-join operations. See Section 18.3.1: Rules 5, 6, 7, 11.

18.8 State the heuristics that we should apply to improve the processing of a query.

See Section 18.3.2.

18.9 What type of statistics should a DBMS hold to be able to derive estimates of relational algebra

operations?

See Section 18.4.1.

18.10 Under what circumstances would the system have to resort to a linear search when implementing a selection operation?

(1) Unordered file, with no indexes.

(2) Composite predicate where one of the terms contains an OR condition and the term requires a linear search.

(3) Composite predicate where no attribute can be used for efficient retrieval. 18.11 What are the main strategies for implementing the join operation?

Main strategies for implementing join are: block nested loop join, indexed nested loop join, sort-merge join, hash join. See Section 18.4.3.

18.12 What is the difference between materialization and pipelining?

Materialization - output of one operation is stored in a temporary relation for processing by next operation.

Pipelining - pipeline results of one operation to another operation without creating a temporary relation to hold the intermediate result.

See Section 18.5.

18.13 Discuss the difference between linear and non-linear relational algebra trees. Give examples to illustrate your answer.

Linear tree - relation on one side of each operator is always a base relation. See Section 18.5 under Linear Trees.

18.14 What are the advantages and disadvantages of left-deep trees?

The inner relation of a join needs to be examined for each tuple of the outer relation, so inner relations must always be materialized. With a left-deep tree, the inner relation is always a base relation, and so already materialized.

Exercises

18.15 Calculate the cost of the three strategies cited in Example 18.1 if the Staff relation has 10,000 tuples, Branch has 500 tuples, there are 500 Managers (one for each Branch), and there are 10 London branches.

Costs: (1) 10,010,500 (2) 30,500 (3) 11,520.

18.16 Using the Hotel schema given in the Exercises of Chapter 13, determine whether the following queries are semantically correct:

(a) SELECT r.type, r.price

FROM room r, hotel h

WHERE r.hotel_number = h.hotel_number AND h.hotel_name = ‘Grosvenor Hotel’ AND r.type > 100;

Not semantically correct: hotel_number and hotel_name not in schema; type is character string and so cannot be compared with an integer value (100).

(b) SELECT g.guest_no, g.name

FROM hotel h, booking b, guest g

WHERE h.hotel_no = b.hotel_no AND h.hotel_name = ‘Grosvenor Hotel’;

Not semantically correct: hotel_name not in schema; Guest table not connected to remainder of query.

FROM hotel h, booking b, room r

WHERE h.hotel_no = b.hotel_no AND h.hotel_no = ‘H21’ AND b.room_no = r.room_no AND type = ‘S’ AND b.hotel_no = ‘H22’;

Not semantically correct: hotel_no cannot be both H21 in Hotel and H22 in Booking.

18.17 Again, using the Hotel schema given in the Exercises of Chapter 13, draw a relational algebra tree for each of the following queries and use the heuristic rules given in Section 18.3.2 to transform the queries into a more efficient form:

(a) SELECT r.rno, r.type, r.price

FROM room r, booking b, hotel h

WHERE r.room_no = b.room_no AND b.hotel_no = h.hotel_no AND h.hotel_name = ‘Grosvenor Hotel’ AND r.price > 100;

(b) SELECT g.guest_no, g.name

FROM room r, hotel h, booking b, guest g

WHERE h.hotel_no = b.hotel_no AND g.guest_no = b.guest_no AND

h.hotel_no = r.hotel_no AND h.hotel_name = ‘Grosvenor Hotel’ AND date_from >= ‘1-Jan-98’ AND date_to <= ‘31-Dec-98’;

Discuss each step and state any transformation rules used in the process.

See Figures 18.1 and 18.2 overleaf.

18.18 Using the Hotel schema, assume the following:

There is a hash index with no overflow on the primary key attributes, Room_No/Hotel_No in Room.

There is a clustering index on the foreign key attribute Hotel_No in Room.

There is a B+-tree index on the Price attribute in Room.

A secondary index on the attribute Type in Room.

ntuples(Room) = 10000 bfactor(Room) = 200

ntuples(Hotel) = 50 bfactor(Hotel) = 40

ntuples(Booking) = 100000 bfactor(Booking) = 60

ndistincthotel_no(Room) = 50 ndistincttype(Room) = 10 ndistinctprice(Room) = 500

minprice(Room) = 200 maxprice(Room) = 50 nlevelshotel_no(I) = 2

nlevelstype(I) = 2

nlevelsprice(I) = 2 nlfblocksprice(I) = 50

(a) Calculate the cardinality and minimum cost for each of the following selection operations:

S1: σroom_no=1 ∧ hotel_no=1(Room)

Use Hash: cost = 1; Linear search cost is: [10000/200]/2 = 25

S2: σtype='D'(Room)

Use equality condition on clustering index: SCtype(Room) = [10000/10] = 1000

Cost: 2+ [1000/200] = 5 Linear search cost is: 50

S3: σhotel_no=2(Room)

Use equality on clustering index: SChotel_no(Room) = [10000/50] = 200

Linear search cost is: 50

S4: σprice>100(Room)

Use inequality on a secondary B+-tree: Cost: 2 + [50/2 + 10000/2] = 5027 Linear search cost is: 50

S5: σ_{type='S' ∧ hotel_no=3}(Room)

If use secondary index searches on each of the two components, costs are 5 and 3, respectively (from above). Optimizer would then choose search via clustering index on hotel_no, and check the remaining predicate in memory.

Linear search cost is: 50

S6: σtype='S' ∨ price < 100(Room)

Use the secondary indexes on Type and Price, then take the union of the two sets. Linear search cost is: 50

(b) Calculate the cardinality and minimum cost for each of the following join operations: J1: Hotel hotel_no Room

Assume nbuffer = 100

Block Nested Loop 102, buffer has only 1 block for Hotel and Room. 52, all of Hotel fits into buffer

Indexed Nested Loop 152, using primary key index

Sort-Merge 302 unsorted

52 sorted

Hash 156 if hash index fits in memory

J2: Hotel hotel_no Booking

Block Nested Loop 33336, buffer has only 1 block for Hotel and Booking.

16669, all of Hotel fits into buffer Indexed Nested Loop 152, using primary key index

Sort-Merge 250007 unsorted

16669 sorted

Hash 50007 if hash index fits in memory

J3: Room room_no Booking

Block Nested Loop 833400, buffer has only 1 block for Room and Booking.

16717, all of Room fits into buffer Indexed Nested Loop 30050, using clustering index

Sort-Merge 250305 unsorted

16717 sorted

Hash 50151 if hash index fits in memory

J4: Room hotel_no Hotel

Block Nested Loop 150, buffer has only 1 block for Room and Hotel. 52, all of Room fits into buffer

Indexed Nested Loop 30050, using clustering index

Sort-Merge 302 unsorted

52 sorted

Hash 156 if hash index fits in memory

J5: Booking hotel_no Hotel

Block Nested Loop 50001, buffer has only 1 block for Hotel and Booking.

17008, if (nbuffer-2) blocks for Booking 16669, if all of Booking fits into buffer Indexed Nested Loop 1666716667, using linear search

Sort-Merge 250007 unsorted

16669 sorted

Hash 50007 if hash index fits in memory

Block Nested Loop 850017, buffer has only 1 block for Booking and Room.

25171, if (nbuffer-2) blocks for Booking 16717, all of Booking fits into buffer Indexed Nested Loop 1666716667, using clustering index

Sort-Merge 250305 unsorted

16717 sorted

Hash 50151 if hash index fits in memory

(c) Calculate the cardinality and minimum cost for each of the following projection operations:

P1: Πhotel_no(Hotel)

Duplicate elimination using sorting: 4

Duplicate elimination using hashing: 4 (cardinality of result stays same because Hotel_No is key attribute)

P2: Πhotel_no(Room)

Duplicate elimination using sorting: 350

Duplicate elimination using hashing: 51 (cardinality of result estimated as SChotel_no(Room) = 200, occupying 1 block)

P3: Πprice(Room)

Duplicate elimination using sorting: 350

Duplicate elimination using hashing: 51 (cardinality of result estimated as SCprice(Room) = 20, occupying 1 block)

P4: Πtype(Room)

Duplicate elimination using sorting: 350

Duplicate elimination using hashing: 55 (cardinality of result estimated as SChotel_no(Room) = 1000, occupying 5 blocks)

P5: Πhotel_no, price(Room)

Duplicate elimination using sorting: 350

Duplicate elimination using hashing: 51 (cannot be any more than cost of P2, P3)

18.19 Modify the block nested loop join and the indexed nested loop join algorithms presented in Section 18.4.3 to read (nbuffer - 2) blocks of the outer relation R at a time, rather than one block at a time.

Hint: Add an extra outer loop to each algorithm to read (nbuffer-2) blocks of R in each time rather than process each block of R individually.

Πr.room_no, r.type, r.price

B R

σr.room_no=b.room_no ∧ b.hotel_no=h.hotel_no ∧

h.hotel_name=’Grosvenor Hotel’ ∧ r.price >

× ×

Πr.room_no, r.type, r.price

× σb.hotel_no=h.hotel_no B R × H σr.price>100

σr.room_no=b.room σh.hotel_name='Grosvenor Hotel'

Πr.room_no, r.type, r.price

B R H σr.price>100 σ h.hotel_name='Grosvenor Hotel' b.hotel_no=h.hotel_no r.room_no=b.room_no

Πr.room_no, r.type, r.price

R H

σr.price>100 σh.hotel_name='Grosvenor Hotel'

b.hotel_no=h.hotel_no

r.room_no=b.room

Πb.room_no,b.hotel_no

Πg.guest_no,g.name

B ×

σ b.hotel_no=h.hotel_no ∧ r.room_no=b.room_no ∧ h.hotel_no=r.hotel_no ∧

h.hotel_name=’Grosvenor Hotel’ ∧ date_from>= '1-Jan-98' ∧ date_to <='31-Dec-98'

× × G Πg.guest_no,g.name × σg.guest_no=b.guest_no B × G σh.hotel_no=r.hotel_no σh.hotel_no=b.hotel_no σh.hotel_name='Grosvenor Hotel' Πg.guest_no,g.name B H B σ h.hotel_name='Grosvenor Hotel' g.guest_no=b.gues_no h.hotel_no=b.hotel_no H R × H R

σ date_from>= '1-Jan-98' ∧ date_to <='31-Dec-98'

h.hotel_no=r.hotel_no

Πg.guest_no,g.name R B G σh.hotel_name='Grosvenor Hotel' g.guest_no=b.guest_no hotel_no=r.hotel_no Πr.hotel_no Πg.guest_no,g.name h.hotel_no=b.hotel_no

σ date_from>= '1-Jan-98' ∧ date_to <='31-Dec-98'

Πb.hotel_no

Πh.hotel_no

Part Five Current Trends

Chapter 19 Distributed DBMSs - Concepts and Design

Review Questions

19.1 Explain what is meant by a DDBMS and discuss the motivation in providing such a system.

See Section 19.1.1; motivation given at start of Section 19.1.

19.2 Compare and contrast a DDBMS with distributed processing. Under what circumstances would

you choose a DDBMS over distributed processing?

Distributed processing defined at end of Section 19.1.1. Would choose a DDBMS, for example, if each site needed control over its own data, sites had their own existing DBMSs, communication costs would be significantly reduced, and so on.

19.3 Compare and contrast a DDBMS with a parallel DBMS. Under what circumstances would you

choose a DDBMS over a parallel DBMS?

Parallel DBMS defined at end of Section 19.1.1.

19.4 Discuss the advantages and disadvantages of a DDBMS.

See Section 19.1.2.

19.5 One problem area with DDBMSs is that of distributed database design. Discuss the issues that

have to be addressed with distributed database design. Discuss how these issues apply to the global system catalog.

The question that is being addressed is how the database and the applications that run against it should be placed across the sites. Two basic alternatives: partitioned or replicated. In partitioned scheme database is divided into a number of disjoint partitions each of which is placed at a different site. Replicated designs can be fully or partially replicated.

Two fundamental design issues are fragmentation and distribution. Mostly involves mathematical programming to minimize combined cost of storing the database, processing transactions against it, and communication. Problem is NP-hard; therefore proposed solutions are based on heuristics. The global system catalog (GSC) is only relevant if we talk about a distributed DBMS or multi- DBMS that uses a global conceptual schema. Problems are similar to above. Briefly, a GSC may be either global to entire database or local; it may be maintained centrally at one site, or in a distributed fashion over a number of sites; finally, replication - there may be a single copy of the directory or multiple copies. These three dimensions are orthogonal to one another.

19.6 What are the strategic objectives for the definition and allocation of fragments?

See start of Section 19.4.

19.7 Define and contrast alternative schemes for fragmenting a global relation. State how you would

check for correctness to ensure that the database does not undergo semantic change during fragmentation.

Alternative schemes are: primary horizontal, vertical, mixed, and derived horizontal fragmentation (see Section 19.4).

Correctness rules are: completeness, reconstruction, and disjointness (see Section 19.4 again).

19.8 What layers of transparency should be provided with a DDBMS? Give justification for your

answer.

19.9 A DDBMS must ensure that no two sites create a database object with the same name. One solution to this problem is to create a central name server. What are the disadvantages with this approach? Propose an alternative approach that overcomes these disadvantages.

See Section 19.5.1 - Naming Transparency.

Problems with the central name server, which has the responsibility for ensuring uniqueness of all names in the system, are:

• loss of some local autonomy

• performance problems, if the central site becomes a bottleneck

• low availability, if the central site fails, the remaining sites cannot create any new database objects.

An alternative solution, is to prefix an object with the identifier of the site that created it. For example, a relation BRANCH created at site S1 might be named S1.BRANCH. Similarly, we would need to be able to identify each fragment and each of its copies. Thus, copy 2 of fragment 3 of the branch relation created at site S1 might be referred to as S1.BRANCH.F3.C2. However, this results in loss of distribution transparency.

An approach which resolves the problems with both these solutions uses aliases for each database object. Thus, S1.BRANCH.F3.C2 might be known as local_branch by the user at site S1. The DDBMS has the task of mapping aliases to the appropriate database object.

Exercises

A multinational engineering company has decided to distribute its project management information at the regional level in mainland Britain. The current centralised relational schema is as follows:-

Employee (NIN, First_Name, Last_Name, Address, Birth_Date, Sex, Salary, Tax_Code,

Dept_No)

Department (Dept_No, Dept_Name, Manager_NIN, Business_Area_No, Region_No)

Projects (Proj_No, Proj_Name, Contract_Price, Project_Manager_NIN, Dept_No)

Works_On (NIN, Proj_No, Hours_Worked)

Business (Business_Area_no, Business_Area_Name)

Region (Region_No, Region_Name)

where Employee contains employee details and the national insurance number NIN is the key.

Department contains department details and Dept_No is the key. Manager_NIN identifies

the employee who is the manager of the department. There is only one manager for each department.

Projects contains details of the projects in the company and the key is Proj_No. The

project manager is identified by the Project_Manager_NIN, and the department responsible for the project by Dept_No.

Works_on contains details of the hours worked by employees on each project and (NIN,

Proj_No) forms the key.

Business contains names of the business areas and the key is Business_Area_No.

and Region contains names of the regions and the key is Region_No.

Departments are grouped regionally as follows:

Region 1: Scotland; Region 2: Wales; Region 3: England

Information is required by business area which covers: Software Engineering, Mechanical Engineering and Electrical Engineering. There is no Software Engineering in Wales and all Electrical Engineering departments are in England. Projects are staffed by local department offices.

As well as distributing the data regionally, there is an additional requirement to access the employee data either by personal information (by Personnel) or by work related information (by Payroll).

19.10 Draw an Entity-Relationship (ER) diagram to represent this system.

For simplicity use crow’s foot notation:

19.11 Produce a distributed database design for the above system and include:

(a) a suitable fragmentation schema for the system;

(b) in the case of primary horizontal fragmentation, a minimal set of predicates;

(c) the reconstruction of global relations from fragments.

State any assumptions necessary to support your design.

Possible solution as follows:

Don't fragment Business/Region - replicate relations at all sites - only contain a small number of records.

Department

Use primary horizontal fragmentation for Department with minterm predicates : D1 Region = 'Scotland' and Business_area = 'SE'

D2 Region = 'Scotland' and Business_area = 'ME' D3 Region = 'Wales' and Business_area = 'ME' D4 Region = 'England' and Business_area = 'SE' D5 Region = 'England' and Business_area = 'ME' D6 Region = 'England' and Business_area = 'EL' Reconstruction: D1 ∪ D2 ∪ D3 ∪ D4 ∪ D5 ∪ D6

Employee

Use vertical fragmentation for Employee:

E1: Πnin,first_name,last_name,address,birth_date,sex,dept_noEmployee E2: Πnin,salary,tax_codeEmployee

Then used derived fragmentation on fragment E1: Eii: E1 dept_no Di 1 ≤ i ≤ 6

Reconstruction: (E11 ∪ E12 ∪ E13 ∪ E14 ∪ E15 ∪ E16 ) nin E2

Projects

Use derived fragmentation for Projects:

Pi: Projects dept_no Di 1 ≤ i ≤ 6 Employee Works_On Project Business Department Region

Reconstruction: (P1 ∪ P2 ∪ P3 ∪ P4 ∪ P5 ∪ P6 )

Works_on

Use derived fragmentation for Works_on:

Wi: Works_on ninE1i 1 ≤ i ≤ 6

19.12 Repeat Exercise 19.11 for the DreamHome case study presented in Section 1.7.

Possible solution as follows:

Don't fragment Branch - replicate relations at all sites - only contain a small number of records. Property_for_Rent

Use primary horizontal fragmentation for Property_for_Rent with minterm predicates (for example):

P1j asking_price ≤ 39999 AND bno = j

P2j 40000 ≤ asking_price ≤ 69999 AND bno = j P3j asking_price ≥ 70000 AND bno = j

1 ≤ j ≤ maximum number of branches Reconstruction: ∪ (P1i∪ P2i∪ P3i)

i=1

Staff

Assume salaries paid by head office (branch 1 say), so use vertical fragmentation first: S1: Πsno, fname, lname, address, tel_no, bnoStaff

S2: Πsno, position, sex, dob, salary, ninStaff

Then use horizontal fragmentation on fragment S1: S1i: σbno= i S1 1 ≤ i ≤ j

Reconstruction: (S11 ∪ S12 ∪ S13 … ∪ S1j ) nin S2

Renter

Use horizontal fragmentation:

Ri: σbno= i Renter 1 ≤ i ≤ j Reconstruction: (R1 ∪ R2 ∪ R3 … ∪ Rj )

Viewing, Owner

Use derived fragmentation for Viewing and Owner:

Vik: Viewing SJ Pik 1 ≤ i ≤ 3, 1 ≤ k ≤ j Oik: Owner SJ Pik 1 ≤ i ≤ 3, 1 ≤ k ≤ j Reconstruction: as for Property_for_Rent

19.13 In Section 19.5.1 when discussing naming transparency, we proposed the use of aliases to uniquely identify each replica of each fragment. Provide an outline design for the implementation of this approach to naming transparency.

FUNCTION map(name) {

66 THEN

result = name of replica of name; IF name appears in the fragment table THEN {

result = expression to construct fragment; FOR each iname IN result {

replace iname in result with map(iname); }

} RETURN result; }

IF name appears in the alias table THEN

expression = map(name); ELSE

In document Modern Database Management Solution Book (Page 54-67)