14.1.1 What is This?
This chapter discusses how to check the quality of a data model. It builds through a series of structured steps.
These steps reflect the theory that underpins relational data model. It concludes with a checklist for assessment of an overall quality.
14.1.2 Why is it Important?
A data model often plays a fundamental part in the clarification of some key activities, such as the sources of data or the design of a database.
This makes it very valuable to be able to make an assessment of the quality of a data model.
14.1.3 What Will I Learn?
You will learn a series of steps that follow a structured path to the formal assessment of whether a data model is fit for purpose.
14.2 Create a Top-Level Business Data Model
14.2.1 Types of Data Models
All the data models that we will be discussing can be described as Entity-
Relationship Diagrams, or ‘ERDs’. They all show relationships between entities or tables.
At the conceptual level, the ‘things of interest,’ such as ‘customers,’ are called
entities and at the logical or physical level they are called tables, because they often appear as tables in databases.
At the physical level, tables are given names in the plural, such as Customers, whereas at the conceptual level they often appear in the singular, that is Customer. At the logical level they might be either singular or plural.
A top-level business data model can be created using Microsoft Word and is intended for business users and a non-technical audience.
The other models referred to in this document will always be created by a data modeling tool such as ERWin or IBM’s Rational Rose.
They could be described as conceptual, logical or physical models.
Conceptual models show the ‘things of interest’ that are in scope, for example, customers and materiel. They may or may not include keys and will certainly not include physical data types, such as the length of character strings.
Logical models will include primary and foreign keys and often the modeling tool will provide a facility to generate a physical model from a logical one.
Physical models are often close to the actual design of an operational database. They will always show data types and field lengths.
14.2.2 Example of a Simple Business Data Model
This model was created in Word and shows customers, orders and products. The flow of logic in a data model should go from top-left to bottom-right. This means that the more fundamental things are on the top and to the left.
This diagram is a good example:
This version shows that customers and products each have a hierarchy so that a customer is part of a higher customer.
Similarly, a product can be part of a more complex product.
Which of these two you choose to use will depend on the audience. In general, it is better to choose the simple option.
14.3 Draft the Business Rules
Business rules are valuable because they define in plain English with business terminology the underlying relationships between the terms that appear in a data model.
The user ‘commCustomery’ will then be able to agree and sign off the rules. Here is a small example:
14.4 Draft a Glossary of Terms
It is very important to establish agreed definitions of terms and words in common use.
This is a small example:
14.5 Check that the Data Model is Correct
There may be errors that have a simple explanation. For example, the incorrect use of the modeling tool. Any errors should be discussed and resolved with the modeler and the users.
This is where the glossary and business rules are very valuable.
14.6 Review with Users
At this point, review the business rules and the glossary with users and aim to get sign off.
Make any necessary changes to format and contents.
14.7 Check Normalized Design
14.7.1 Normalized Design
This discussion applies to Entity-Relationship Diagrams (ERDs) and not to data warehouses.
We will start by defining the rules for normalization so that we can recognize cases where they have been broken.
A little background is appropriate at this point.
The theory that provides the foundation for data models and ERDs was developed in 1970 by an Englishman called Ted Codd, who was a research scientist with IBM in California at the time.
Rule 1:
One of Codd’s rules can be summarized as:
“The data in a table must belong to the key, the whole key and nothing but the key, so help me Codd”.
This means, for example, that a record in a Customers Table must contain data only about
the customer, and nothing about people in the customer, or activities of the customer.
It might include things like the name of the customer and when the customer was founded.
Check 1: Can the values of every data item in a table be derived only from the primary key?
Rule 2:
Another of Codd’s rules stated that derived data must not be included.
For example, the headcount for a customer would not be included in the Customers Table because it can be derived by counting the records of members in the
customer.
Check 2: Can any data item be derived from other items?
Rule 3:
There must be no repeating groups in a table.
The one uncomfortable exception is addresses. They are very often stored as a number of repeated lines called ‘Address_Line_1,’ ‘Address_Line_2,’ and so on.
Check 3: Do any column names repeat in the same table?
An item of data must only be in one table.
For example, the name of a customer would appear only in the Customers Table.
Check 4: Does the same item of data in appear in more than one table?
14.8 Reference Data
14.8.1 Background
A list should be made of the reference data referred to in a data model.
When the list is complete it should be analyzed for consistency. For example, there will not usually be any relationships between the reference data. However, if there are any, then they should be sensible and consistent. For example, a town might be in a county which would be in a country. These could all be classified a reference data that has relationships that should be validated.
Typical reference data could include ranks, and types of materiel or equipment. In passing, we should note that customers, ranks and materiel are all examples of hierarchical structures. Ranks will change only very, very rarely. However, when they are stored in a table that is joined to itself then the table will have a recursive relationship to itself. Therefore, wherever these occur, we would expect to find compact data models that include a great deal with compact and powerful structures.
14.8.2 Standards
Any appropriate national, international standards must be considered when values for reference data are decided. These include MOD, NATO and ISO standards. For example, NATO maintains standards for product classification and this is already in use within the MOD. Therefore any data model relating to products should consider this standard and where appropriate the necessary tables should be added to the model.
14.9 Slowly Changing Data
The classic example of reference data that never changes is a calendar. The values are predictable for hundreds of years ahead. There is a category in between that is usually called slowly changing data. This applies where the values of the data changes on roughly a six-monthly basis.
Data about categories and types is often fixed values but some can change infrequently.
For example, a new aircraft type was introduced with unmanned aircraft. The
values then became fixed-wing, rotary and unmanned. This would be an example of slowly changing data.
This highlights the fact that what constitutes reference data can be subjective and may be defined differently in data models created by different people or
14.11 Look for Design Patterns
14.11.1 Some Examples
This data model shows examples of design patterns for one-to-many and many-to- many relationships, reflexive associations and reference data.
PK stands for ‘Primary Key’ and FK stands for ‘Foreign Key’.
PF, which is shown in the Products_in_an_Order Table, stands for Primary and
Foreign key. This is a primary key in one table that is also a link to another table, where it is also a primary key.
14.11.2 Inheritance
More details are provided in Chapter 3. Concepts in the document entitled “How to Understand a Data Model”.
We use the concept of inheritance where have super-types and sub-types.
Inheritance in data modeling is just the same as the general meaning of the word. It means that at a high level, we identify the general name of the ‘thing of interest’ and the characteristics that all of these things share. For example, an aircraft will
have a name for the type of aircraft, such as Tornado and it will be of a certain type, such as fixed-wing or rotary.
At the lower level of fixed-wing aircraft, an aircraft will have a minimum length for the runway that the aircraft needs in order to take off.
This situation is shown in the following diagram:
14.11.3 One-to-One Relationships
We can remind ourselves that Rule 1 above states:
“The data in a table must belong to the key, the whole key and nothing but the key, so help me Codd”.
One implication is that there should not be a one-to-one relationship between two tables in a model because the data can be combined into one table with the same primary key. However, there is an exception to this which is when a one-off event can occur which involves a substantial amount of data. In that case, it would not be
good to create a large number of fields which will be blank in the large majority of cases.
For example, when a soldier joins the army there might be data that is involved only with the joining details. The basic data for the soldier will be part of his or her basic records – such as date of birth and place of birth. If a separate table exists for ‘Joining Details’ then it would contain such things as date and place of joining. Then the Soldiers Table would have a one-to-one relationship with the Joining Details Table.
In other words, it can sometimes be acceptable to see a one-to-one in a data model. If that happens, it is necessary to establish the associated business rules to clarify the conditions.
14.12 Review Data Warehouses Designs
This section is relevant if the data model includes a data warehouse or data mart. A data warehouse can be a star or a snowflake design.
This diagram shows a typical data warehouse. It is a star structure with only one dimension for the related dimension tables. The arrows point from children to parents. This is a simple data warehouse for customers, orders and products.
14.13 Check Naming Standards
At this step, we check for compliance with naming standards. For example, a typical standard might state that field names should be specified with underscores linking
related words and first letters in capitals, such as Customer_ID. In the absence of any explicit standard, this should be the default.
This is shown in the model in Section 9.1 and also in this one.
We might say that naming standards are nice to have. In other words, they are not essential but they reflect best practice.
14.14 Check for Consistent Data Types
There are two reasons why it is important to check for consistent data types and lengths:
It avoids nasty surprises when a physical database is generated from the data model.
It is an indication of the professionalism of the manner in which the data model was produced, unless it has been reverse engineered from a database, in which case these design considerations do not apply.
For example, names should always be the same, or should be handled in a way that handles any differences in a way that ensures consistency.
Typically, a longer name should be explicitly truncated to a shorter value where appropriate.
In the absence of any explicit standard, the default for names or address lines should be VARCHAR(255) or VARCHAR2(255) for Oracle.
For other character strings they should default to Memo or Text.
14.15 Check for Defaults
We would like to see Default Values used wherever possible because they increase the discipline enforced by the model and they indicate that a thorough analysis was carried out during the creation of the data model.
For example, a ‘Start Date’ could default to the current day, or the ‘System Date’.
14.16 Determine the Assurance Level The assurance level could be:
Acceptable
Acceptable with reservations Not acceptable
14.17. Checklist for Quality Assurance
This Checklist extends the basic concept of the data model scorecard that was originated by Steve Hoberman.
If the answer to all the essential features is ‘Yes’ then the model is acceptable. If any of the essential questions have a ‘No’ answer, the model is not acceptable. Any ‘No’ answers to ‘Desirable’ or ‘Not critical’ questions do not affect the
14.17.1 Typical Summary
The results of a typical model might result in this summary:
“Reservations are that the documentation does not demonstrate that the data model meets the user requirements. The data model shows some weaknesses that the supplier has agreed to address.”
14.17.2 Follow-Up Remedial Action
A reasonable result of a QA analysis would be the identification of some problems that could be rectified fairly easily and quickly. This applies to things like
documentation and naming standards. The appropriate remedial action will depend on the context and scope of the data model.
14.17.3 For a Health Check:
No action is required beyond the presentation of a report because the QA is simply to establish the ‘as-is’ situation.
14.17.4 For a Proposed Application:
It is essential that the model accurately meets the user requirements. If it does not, then it must be corrected in discussion with the users and the modeler.
14.17.5 For Data Migration:
It is essential that the model is correct at the detailed level of tables, fields and data types.
14.18 What Have We Learned?
In this chapter we have learned a structured approach to checking the quality of a data model produced by somebody else.
We have learned to determine whether it is fit for purpose.
Our approach is based on the sound theoretical foundation that is a very strong part of data models for relational databases.
15. Enterprise Data Models