Denormalisation
(But not hacking it)
“Denormalisation: Why, What, and How?” Rodgers
Oracle Performance Tuning Corrigan/Gurry
2
Overview
• Purpose of normalisation
• Methods of improving database performance
• Denormalisation
– Definition
– Part of database design
– What can be denormalised
• Examples
• Applying Denormalisation Safely
3
Purpose of Normalisation
• Two views
– Design process
• Steps 1NF, 2NF, 3NF, BCNF • Complex to apply • This is how we teach it– Analysis process
• Understanding dependencies • Correctness check
– Levels of correctness
• e.g. 1NF, 2NF, 3NF, BCNF
• Normalisation is designed to limit the amount of redundant data stored in a database.
• Normalisation is based on identifying the functional dependencies in the database and using the dependencies to remove potential redundancy. • Redundant data introduces potential update anomalies when information in
the database is changed.
• Normalisation also helps to clarify the designer’s understanding of the data. • Therefore, an unnormalised set of data will potentially contain redundant
data, encourage anomalies and make integrity checking difficult. • It is never acceptable to design a database that is unnormalised.
4
Why is normalisation good?
• Better analysis
– Understand the dependencies
– Clarify integrity constraints
– Scientific, mathematical, etc.
• Provides future flexibility
– Reduce data redundancy
– Avoid anomalies during updates, deletes and inserts
– Robust designs
5
Problems with Normalisation
• Can lead to many relations
– Each relation is atomic
– Requires many join queries
– Affects performance
• Unnecessary relations
– eg
area( area_code, area_description )
6
What to do when the database is slow?
• Better design
• Index the database
• Use clustering, partitioning, etc.
• Accept poorer performance
If all else fails, denormalise.
• When a normalised set of relations is created in a DBMS it is important to ensure that the resulting database answers queries efficiently.
• A relational DBMS provides a number of options for improving the efficiency of a slow database:
1. Improve the design by looking for mistakes and errors in the current design. If the design does not correspond to the actual data usage the database will be slow and inefficient.
2. Create indexes on the most frequently queried attributes. Ensure that existing indexes are correct.
3. Use clustering to improve the performance of joins between relations. Clustering is good because it is hidden from the user. 4. Accept the poorer performance. If poor performance does not
interfere with the user’s requirements then accept the current performance.
• If the performance of the database is still poor, then it may be necessary to
7
Definition
“Denormalisation is the design process of taking
normalised data and producing a physical design
in which normalised data is rearranged so that
optimal access and manipulation of data can be
achieved.”
Inmon
Also called “Consolidation”
• A set of relations may be:
• Unnormalised When no systematic analysis of the data has been carried out and there may be hidden redundancy in the data.
• Normalised When a systematic analysis of the data has been carried out and the set of data has been correctly normalised to remove data redundancy.
• Denormalised When a normalised set of data has been
systematically analysed and known redundancy has been introduced into the database to improve the database’s efficiency.
• Poor performance is normally a result of joining many relations or performing complex calculations.
• Poor performance affects all types of database.
• The problems that occur in relational databases have been more extensively studied than the problems that occur in other databases. • Relational DBMSs are optimised to performed three-way joins. When more
complex joins are required, the database structure may not be efficient. • The normalisation and denormalisation processes provide a systematic
• The denormalisation process is performed during the physical design stage. • Therefore, denormalisation can only be performed after the data has been
normalised.
• Rodgers describes the main purpose of denormalisation as being “to reduce the number of tables that need to be joined for specific access needs”. • It is important to understand which entities are accessed by the application
programs and how these entities relate to other entities.
• Rodgers suggests using entity-relationship diagrams, data flow diagrams and function/entity cross-reference matrices to identify database usage.
8
The Database Design Process
Conceptual Model Logical Model Physical Model Entity-Relationship Model Relational Model Denormalisation is part of the physical database design.
9
What can be denormalised?
• One-to-one relationships
• Many-to-many relationships
• Splitting tables
• Report tables
• Reference data
• Low-level detail data
• Derived data
10
One-to-One Relationships
Lift Contract
has a for a
Merge two entities related by a one-to-one entity
Lift Contract
Data about both the lift and contract are stored in Lift Contract.
• When two entities are related by a one-to-one relationship, there is always a one-to-one correspondence between the entities.
• It is possible to merge the entities into a single entity and implement them as a single relation in the database.
• Implementing the entities as a single relation avoids a join between two relations.
• However, if both the entities are not created and deleted at the same time, then null values will have to be used to represent the missing entity values.
11
Example
LiftNo LiftLocation MainDate Checked 100 Poole House 01/03/2001 SMcK 200 Dorset House 01/04/2001 JC 300 Studland House 01/05/2001 JC
ContractNo LiftNo SignDate Auth 123 100 01/03/2000 PB 345 200 01/04/2000 OD 567 300 01/05/2000 PB
LiftNo LiftLocation MainDate Checked ContractNoLiftNo SignDate Auth 100 Poole House 01/03/2001 SMcK 123 100 01/03/2000 PB 200 Dorset House 01/04/2001 JC 345 200 01/04/2000 OD 300 Studland House 01/05/2001 JC 567 300 01/05/2000 PB
Lift Contract
12
Many-to-Many Relationships
Employee Project Employee Project Works on Employee ProjectMany-to-Many Resolved Many-to-Many
Storing as a one-to-many relationship
Project must be duplicated
• In the relational data model many-to-many relationships are replaced with a new entity and two one-to-many relationships.
• This means that all queries which involve many-to-many relationships require a join between three relations.
• It is possible to store a many-to-many relationship in two relations if one of the relations contains duplicate data.
• In the example above, the many-to-many relationship between the employee and project entities is stored as a one-to-many relationship.
• This is achieved by duplicating the project data. For example, employee ‘Smith’ will work on project P1 (first copy) and employee ‘Jones ‘ will work on project P1 (second copy).
• Update anomalies will occur when it is possible for the entities to exist without taking part in the relationship. For example, in the example above, the project must contain a foreign key to the employee entity which will be null if projects can exist without employees working on them.
13
Example
EmpNo Ename Dept Sal 100 Stephen DEC RW 200 Jim DEC RW 300 Peter BS KL
Employee Project
Pcode Pname Budget
A DBS1 10 B DBS2 20 C ADB 15 D Prog 13 E SSADM 12 Empno Pcode 100 A 100 C 200 B 200 D 200 E 200 C EmployeeOnProject
EmpNo Pcode Pname Budget
100 A DBS1 10 200 B DBS2 20 100 C ADB 15 200 C ADB 15 200 D Prog 13 200 E SSADM 12 EmpProject
14
Splitting Tables
Purchases Purchases2000-2003 Purchases2004 Horizontal Split Purchases Purchases Product Purchases Customer Vertical Split15
Reporting Tables
Reporting tools cannot manipulate complex relational models.
Therefore, simplify the structure by combining tables into simple tables.
16
Summary Tables
Purchases by product, brand, shop and area
Purchases by shop and area
17
Reference Data
• Reference data consists of descriptions and
codes.
– e.g. Telephone numbers and people.
• Reference data is stored in lookup tables.
• Store descriptions in the entities they describe.
– Justified because codes are artificial shorthands for
descriptions.
595015, Computer Centre 595205, S Mc Kearney
• A lot of information in a database is stored using codes that correspond to descriptions.
• To access the code corresponding to a description it is necessary to perform a join between an entity and the table of codes.
• The join can be avoided if the descriptions are actually stored in the entity. • This process replaces transitive dependencies that were removed to
18
Example
Pcode Pname Budget
A DBS1 10
B DBS2 20
C ADB 15
D Prog 13
E SSADM 12
Empno Pcode Pname
100 A DBS1 100 C ADB 200 B DBS2 200 D Prog 200 E SSADM 200 C ADB EmployeeOnProject Project Empno Pcode 100 A 100 C 200 B 200 D 200 E 200 C EmployeeOnProject
19
Detail Data
Project Job Time-Booking time-booking(projno,jobno,month,booking)One row per booking
job(projno,jobno,m1,m2,m3,…,m12)
Merge job and time-booking
• In the example above, the time-booking entity contains one row for every month booked on every job in every project.
• The data stored in the attributes projno, jobno and month will be duplicated often. For example:
10, 100, Jan, 345 10, 100, Feb, 232 10, 100, Mar, 342 10, 101, Jan, 876 …
• This version of time-booking is easy to process with SQL but requires a lot of storage space.
• When the job and the time-booking entities are merged each project and job has one tuple containing all the bookings for the year.
10, 100, 345, 232, 342, … 10, 101, 876, …
• Merging job and time-booking saves space but is difficult to process with SQL. For example, aggregate functions (MAX, AVERAGE) cannot be used. • The merged relation also restricts the number of booking months to 12.
20
Derived Data
• Derived data is normally not stored in the
database.
– It is calculated from the contents of the database.
• Calculations can take a long time.
• Therefore, store derived data in the database.
• Derived data that is complex to calculate can be pre-calculated and stored in the database.
• For example, total sales figures may be stored in the database rather than requiring them to be calculated from the individual sales figures.
• This approach is widely used in information warehouses that store large numbers of pre-calculated summary tables. The summary tables are normally updated as a batch process.
• The main problem with storing derived data is that it must be re-calculated when the underlying data is changed.
21
Applying Denormalisation Safely
• DBMS features
– Integrity checks
– Triggers
• Automatic duplicated updates
– Views
• Hide the denormalisation relations • Maintain the normalised model
– Materialised views
• Pre-executed queries – e.g. summaries, joins, etc
22
Materialised Views
• Snapshots of a query
– Unlike views, materialized views are executed and
the results stored in the database
• Example (simplified!)
create materialized view sum_sales as select product_no, sum(qty) sum_qty from sales group by product_no; product_no sum_qty … … … … … … … …
23
Materialised Views
• Query Rewriting
– Optimiser can identify queries that can be answered
more efficiently using the MV
– Example
• Queryselect sum(qty) from sales • Rewritten and executed as
select sum(sum_qty) from sum_sales • MV sum_sales is a much smaller table
– Queried more quickly
24
Materialised Views
• Advantages
– Pre-calculated queries much faster
• Similar to clustering and partitioning– Unlike clustering, can have many MVs
– Oracle can rewrite queries to use a MV
automatically
• Users query a normalised data model (logical model) • Optimiser uses denormalised model (physical model)
– MVs are updated automatically
25
Materialised Views
• Disadvantages
– Complex
• Difficult to set up correctly
– Performance penalty
• Updating can be difficult and slow
– Not all DBMSs support MVs
• Oracle • SQL Server