Benefits of Normalisation in a Data Base - Part 1

(1)

Denormalisation

(But not hacking it)

“Denormalisation: Why, What, and How?” Rodgers

Oracle Performance Tuning Corrigan/Gurry

(2)

2

Overview

• Purpose of normalisation

• Methods of improving database performance

• Denormalisation

– Definition

– Part of database design

– What can be denormalised

• Examples

• Applying Denormalisation Safely

(3)

3

Purpose of Normalisation

• Two views

– Design process

• Steps 1NF, 2NF, 3NF, BCNF • Complex to apply • This is how we teach it

– Analysis process

• Understanding dependencies • Correctness check

– Levels of correctness

• e.g. 1NF, 2NF, 3NF, BCNF

• Normalisation is designed to limit the amount of redundant data stored in a database.

• Normalisation is based on identifying the functional dependencies in the database and using the dependencies to remove potential redundancy. • Redundant data introduces potential update anomalies when information in

the database is changed.

• Normalisation also helps to clarify the designer’s understanding of the data. • Therefore, an unnormalised set of data will potentially contain redundant

data, encourage anomalies and make integrity checking difficult. • It is never acceptable to design a database that is unnormalised.

(4)

4

Why is normalisation good?

• Better analysis

– Understand the dependencies

– Clarify integrity constraints

– Scientific, mathematical, etc.

• Provides future flexibility

– Reduce data redundancy

– Avoid anomalies during updates, deletes and inserts

– Robust designs

(5)

5

Problems with Normalisation

• Can lead to many relations

– Each relation is atomic

– Requires many join queries

– Affects performance

• Unnecessary relations

– eg

area( area_code, area_description )

(6)

6

What to do when the database is slow?

• Better design

• Index the database

• Use clustering, partitioning, etc.

• Accept poorer performance

If all else fails, denormalise.

• When a normalised set of relations is created in a DBMS it is important to ensure that the resulting database answers queries efficiently.

• A relational DBMS provides a number of options for improving the efficiency of a slow database:

1. Improve the design by looking for mistakes and errors in the current design. If the design does not correspond to the actual data usage the database will be slow and inefficient.

2. Create indexes on the most frequently queried attributes. Ensure that existing indexes are correct.

3. Use clustering to improve the performance of joins between relations. Clustering is good because it is hidden from the user. 4. Accept the poorer performance. If poor performance does not

interfere with the user’s requirements then accept the current performance.

• If the performance of the database is still poor, then it may be necessary to

(7)

7

Definition

“Denormalisation is the design process of taking

normalised data and producing a physical design

in which normalised data is rearranged so that

optimal access and manipulation of data can be

achieved.”

Inmon

Also called “Consolidation”

• A set of relations may be:

• Unnormalised When no systematic analysis of the data has been carried out and there may be hidden redundancy in the data.

• Normalised When a systematic analysis of the data has been carried out and the set of data has been correctly normalised to remove data redundancy.

• Denormalised When a normalised set of data has been

systematically analysed and known redundancy has been introduced into the database to improve the database’s efficiency.

• Poor performance is normally a result of joining many relations or performing complex calculations.

• Poor performance affects all types of database.

• The problems that occur in relational databases have been more extensively studied than the problems that occur in other databases. • Relational DBMSs are optimised to performed three-way joins. When more

complex joins are required, the database structure may not be efficient. • The normalisation and denormalisation processes provide a systematic

(8)

• The denormalisation process is performed during the physical design stage. • Therefore, denormalisation can only be performed after the data has been

normalised.

• Rodgers describes the main purpose of denormalisation as being “to reduce the number of tables that need to be joined for specific access needs”. • It is important to understand which entities are accessed by the application

programs and how these entities relate to other entities.

• Rodgers suggests using entity-relationship diagrams, data flow diagrams and function/entity cross-reference matrices to identify database usage.

8

The Database Design Process

Conceptual Model Logical Model Physical Model Entity-Relationship Model Relational Model Denormalisation is part of the physical database design.

(9)

9

What can be denormalised?

• One-to-one relationships

• Many-to-many relationships

• Splitting tables

• Report tables

• Reference data

• Low-level detail data

• Derived data

(10)

10

One-to-One Relationships

Lift Contract

has a for a

Merge two entities related by a one-to-one entity

Lift Contract

Data about both the lift and contract are stored in Lift Contract.

• When two entities are related by a one-to-one relationship, there is always a one-to-one correspondence between the entities.

• It is possible to merge the entities into a single entity and implement them as a single relation in the database.

• Implementing the entities as a single relation avoids a join between two relations.

• However, if both the entities are not created and deleted at the same time, then null values will have to be used to represent the missing entity values.

(11)

11

Example

LiftNo LiftLocation MainDate Checked 100 Poole House 01/03/2001 SMcK 200 Dorset House 01/04/2001 JC 300 Studland House 01/05/2001 JC

ContractNo LiftNo SignDate Auth 123 100 01/03/2000 PB 345 200 01/04/2000 OD 567 300 01/05/2000 PB

LiftNo LiftLocation MainDate Checked ContractNoLiftNo SignDate Auth 100 Poole House 01/03/2001 SMcK 123 100 01/03/2000 PB 200 Dorset House 01/04/2001 JC 345 200 01/04/2000 OD 300 Studland House 01/05/2001 JC 567 300 01/05/2000 PB

Lift Contract

(12)

12

Many-to-Many Relationships

Employee Project Employee Project Works on Employee Project

Many-to-Many Resolved Many-to-Many

Storing as a one-to-many relationship

Project must be duplicated

• In the relational data model many-to-many relationships are replaced with a new entity and two one-to-many relationships.

• This means that all queries which involve many-to-many relationships require a join between three relations.

• It is possible to store a many-to-many relationship in two relations if one of the relations contains duplicate data.

• In the example above, the many-to-many relationship between the employee and project entities is stored as a one-to-many relationship.

• This is achieved by duplicating the project data. For example, employee ‘Smith’ will work on project P1 (first copy) and employee ‘Jones ‘ will work on project P1 (second copy).

• Update anomalies will occur when it is possible for the entities to exist without taking part in the relationship. For example, in the example above, the project must contain a foreign key to the employee entity which will be null if projects can exist without employees working on them.

(13)

13

Example

EmpNo Ename Dept Sal 100 Stephen DEC RW 200 Jim DEC RW 300 Peter BS KL

Employee Project

Pcode Pname Budget

A DBS1 10 B DBS2 20 C ADB 15 D Prog 13 E SSADM 12 Empno Pcode 100 A 100 C 200 B 200 D 200 E 200 C EmployeeOnProject

EmpNo Pcode Pname Budget

100 A DBS1 10 200 B DBS2 20 100 C ADB 15 200 C ADB 15 200 D Prog 13 200 E SSADM 12 EmpProject

(14)

14

Splitting Tables

Purchases Purchases2000-2003 Purchases2004 Horizontal Split Purchases Purchases Product Purchases Customer Vertical Split

(15)

15

Reporting Tables

Reporting tools cannot manipulate complex relational models.

Therefore, simplify the structure by combining tables into simple tables.

(16)

16

Summary Tables

Purchases by product, brand, shop and area

Purchases by shop and area

(17)

17

Reference Data

• Reference data consists of descriptions and

codes.

– e.g. Telephone numbers and people.

• Reference data is stored in lookup tables.

• Store descriptions in the entities they describe.

– Justified because codes are artificial shorthands for

descriptions.

595015, Computer Centre 595205, S Mc Kearney

• A lot of information in a database is stored using codes that correspond to descriptions.

• To access the code corresponding to a description it is necessary to perform a join between an entity and the table of codes.

• The join can be avoided if the descriptions are actually stored in the entity. • This process replaces transitive dependencies that were removed to

(18)

18

Example

Pcode Pname Budget

A DBS1 10

B DBS2 20

C ADB 15

D Prog 13

E SSADM 12

Empno Pcode Pname

100 A DBS1 100 C ADB 200 B DBS2 200 D Prog 200 E SSADM 200 C ADB EmployeeOnProject Project Empno Pcode 100 A 100 C 200 B 200 D 200 E 200 C EmployeeOnProject

(19)

19

Detail Data

Project Job Time-Booking time-booking(projno,jobno,month,booking)

One row per booking

job(projno,jobno,m1,m2,m3,…,m12)

Merge job and time-booking

• In the example above, the time-booking entity contains one row for every month booked on every job in every project.

• The data stored in the attributes projno, jobno and month will be duplicated often. For example:

10, 100, Jan, 345 10, 100, Feb, 232 10, 100, Mar, 342 10, 101, Jan, 876 …

• This version of time-booking is easy to process with SQL but requires a lot of storage space.

• When the job and the time-booking entities are merged each project and job has one tuple containing all the bookings for the year.

10, 100, 345, 232, 342, … 10, 101, 876, …

• Merging job and time-booking saves space but is difficult to process with SQL. For example, aggregate functions (MAX, AVERAGE) cannot be used. • The merged relation also restricts the number of booking months to 12.

(20)

20

Derived Data

• Derived data is normally not stored in the

database.

– It is calculated from the contents of the database.

• Calculations can take a long time.

• Therefore, store derived data in the database.

• Derived data that is complex to calculate can be pre-calculated and stored in the database.

• For example, total sales figures may be stored in the database rather than requiring them to be calculated from the individual sales figures.

• This approach is widely used in information warehouses that store large numbers of pre-calculated summary tables. The summary tables are normally updated as a batch process.

• The main problem with storing derived data is that it must be re-calculated when the underlying data is changed.

(21)

21

Applying Denormalisation Safely

• DBMS features

– Integrity checks

– Triggers

• Automatic duplicated updates

– Views

• Hide the denormalisation relations • Maintain the normalised model

– Materialised views

• Pre-executed queries – e.g. summaries, joins, etc

(22)

22

Materialised Views

• Snapshots of a query

– Unlike views, materialized views are executed and

the results stored in the database

• Example (simplified!)

create materialized view sum_sales as select product_no, sum(qty) sum_qty from sales group by product_no; product_no sum_qty … … … … … … … …

(23)

23

Materialised Views

• Query Rewriting

– Optimiser can identify queries that can be answered

more efficiently using the MV

– Example

• Query

select sum(qty) from sales • Rewritten and executed as

select sum(sum_qty) from sum_sales • MV sum_sales is a much smaller table

– Queried more quickly

(24)

24

Materialised Views

• Advantages

– Pre-calculated queries much faster

• Similar to clustering and partitioning

– Unlike clustering, can have many MVs

– Oracle can rewrite queries to use a MV

automatically

• Users query a normalised data model (logical model) • Optimiser uses denormalised model (physical model)

– MVs are updated automatically

(25)

25

Materialised Views

• Disadvantages

– Complex

• Difficult to set up correctly

– Performance penalty

• Updating can be difficult and slow

– Not all DBMSs support MVs

• Oracle • SQL Server