Chapter 8. Database Design II: Relational Normalization Theory

(1)

Chapter 8

Database Design II: Relational Normalization Theory

The E-R approach is a good way to start dealing with the complexity of modeling a real-world enterprise. However, it is only a set of guidelines that requires consid-erable expertise and intuition, and it can lead to several alternative designs for the same enterprise. Unfortunately, the E-R approach does not include the criteria or tools to help evaluate alternative designs and suggest improvements. In this chap-ter, we present the relational normalization theory, which includes a set of concepts and algorithms that can help with the evaluation and refinement of the designs obtained through the E-R approach.

The main tool used in normalization theory is the notion of functional dependency (and, to a lesser degree, join dependency). Functional dependency is a generalization of the key dependencies in the E-R approach whereas join dependencies do not have an E-R counterpart. Both types of dependency are used by designers to spot situations in which the E-R design unnaturally places attributes of two distinct entity types into the same relation schema. These situations are characterized in terms of normal forms, whence comes the term normalization theory. Normalization theory forces relations into an appropriate normal form using decompositions, which break up schemas involving unhappy unions of attributes of unrelated entity types. Because of the central role that decompositions play in relational design, the techniques that we are about to discuss are sometimes also called relational

decomposition theory.

8.1 THE PROBLEM OF REDUNDANCY

Example is the best way to understand the potential problems with relational de-signs based on the E-R approach. Consider theCREATE TABLEPerson statement (5.1) on page 93. Recall that this relation schema was obtained by direct translation from the E-R diagram in Figure 5.1. The first indication of something wrong with this translation was the realization that SSN is not a key of the resulting Person re-lation. Instead, the key is a combination (SSN, Hobby). In other words, the attribute SSNdoes not uniquely identify the tuples in the Person relation even though it

(2)

does uniquely identify the entities in the Person entity set. Not only is this coun-terintuitive, but it also has a number of undesirable effects on the instances of the Person relation schema.

To see this, we take a closer look at the relation instance shown in Figure 5.2, page 93. Notice that John Doe and Mary Doe are both represented by multiple tuples and that their addresses, names, and Ids occur multiple times as well. Redundant storage of the same information is apparent here. However, wasted space is the least of the problems. The real issue is that when database updates occur, we must keep all the redundant copies of the same data consistent with each other, and we must do it efficiently. Specifically, we can identify the following problems:

Update anomaly. If John Doe moves to 1 Hill Top Dr., updating the relation in Figure 5.2 requires changing the address in both tuples that describe the John Doe entity.

Insertion anomaly. Suppose that we decide to add Homer Simpson to the Per-son relation, but Homer’s information sheet does not specify any hobbies. One way around this problem might be to add the tuple023456789, Homer Simp-son, Fox 5 TV, NULL—that is, to fill in the missing field with NULL. However, Hobbyis part of the primary key, and most DBMSs do not allow null values in primary keys. Why? For one thing, DBMSs generally maintain an index on the primary key, and it is not clear how the index should refer to the null value. Assuming that this problem can be solved, suppose that a request is made to insert023456789, Homer Simpson, Fox 5 TV, acting. Should this new tuple just be added, or should it replace the existing tuple023456789, Homer Simp-son, Fox 5 TV, NULL? A human will most likely choose to replace it, because humans do not normally think of hobbies as a defining characteristic of a per-son. However, how does a computer know that the tuples with primary key 111111111, NULL and 111111111, acting refer to the same entity? (Recall that the information about which tuple came from which entity is lost in the translation!) Redundancy is at the root of this ambiguity: If Homer were de-scribed by at most one tuple, only one course of action would be possible. Deletion anomaly. Suppose that Homer Simpson is no longer interested in act-ing. How are we to delete this hobby? We can, of course, delete the tuple that talks about Homer’s acting hobby. However, since there is only one tuple that refers to Homer (see Figure 5.2), this throws out perfectly good information about Homer’s Id and address. To avoid this loss of information, we can try to replace acting with NULL. Unfortunately, this again raises the issue of nulls in primary key attributes. Once again, redundancy is the culprit. If only one tuple could possibly describe Homer, the attribute Hobby would not be part of the key.

For convenience, we sometimes use the term “update anomalies” to refer to all of the above anomaly types.

(3)

8.2 Decompositions 213

Figure 8.1 Decomposition of the Person relation shown in Figure 5.2.

SSN Name Address

111111111 John Doe 123 Main St.

555666777 Mary Doe 7 Lake Dr.

987654321 Bart Simpson Fox 5 TV

(a) Person1 SSN Hobby 111111111 stamps 111111111 hiking 111111111 coins 555666777 hiking 555666777 skating 987654321 acting (b) Hobby 8.2 DECOMPOSITIONS

The problems caused by redundancy—wasted storage and anomalies—can be easily fixed using the following simple technique. Instead of having one relation describe all that is known about persons, we can use two separate relation schemas.

Person1(SSN, Name, Address)

Hobby(SSN, Hobby) (8.1)

Projecting the relation in Figure 5.2 on each of these schemas yields the result shown in Figure 8.1. The new design has the following important properties:

1. The original relation of Figure 5.2 is exactly the natural join of the two relations in Figure 8.1. In fact, one can prove that this property is not an artifact of our example—that is, that under certain natural assumptions1any relation, r, over the schema of Figure 5.2 equals the natural join of the projections of r on the two relational schemas in (8.1). This property, called losslessness, will be discussed in Section 8.6.1. This means that our decomposition preserves the original information represented by the Person relation.

2. The redundancy present in the original relation of Figure 5.2 is gone and so are the update anomalies. The only items that are stored more than once are SSNs, which are identifiers of entities of type Person. Thus, changes to addresses, names, or hobbies now affect only a single tuple. Likewise, the removal of Bart Simpson’s hobbies from the database does not delete the information about his address and does not require us to rely on null values. The insertion anomaly is also gone because we can now add people and hobbies independently. Observe that the new design still has a certain amount of redundancy and that we might still need to use null values in certain cases. First, since we use SSNs as tuple

(4)

Figure 8.2 The ultimate decomposition. SSN 111111111 555666777 987654321 Name John Doe Mary Doe Bart Simpson Address 123 Main St. 7 Lake Dr. Fox 5 TV Hobby stamps hiking coins skating acting

identifiers, each SSN can occur multiple times and all of these occurrences must be kept consistent across the database. So consistency maintenance has not been eliminated completely. However, if the identifiers are not dynamic (for instance, SSNs, which do not change frequently), consistency maintenance is considerably simplified. Second, imagine a situation in which we add a person to our Person1 relation and the address is not known. Clearly, even with the new design, we have to insert NULL in the Address field for the corresponding tuple. However, Address is not part of a primary key, so the use of NULL here is not that bad (we will still have difficulties joining Person1 on the Address attribute).

It is important to understand that not all decompositions are created equal. In fact, most of them do not make any sense even though they might be doing a good job at eliminating redundancy. The decomposition

Ssn(SSN) Name(Name)

Address(Address) (8.2)

Hobby(Hobby)

is the ultimate “redundancy eliminator.” Projecting on the relation of Figure 5.2 yields a database where each value appears exactly once, as shown in Figure 8.2. Unfortunately, this new database is completely devoid of any useful information; for instance, it is no longer possible to tell where John Doe lives or who collects stamps as a hobby. This situation is in sharp contrast with the decomposition of Figure 8.1, where we were able to completely restore the information represented by the original relation using a natural join.

The need for schema refinement. Translation of the Person entity type into the relational model indicates that one cannot rely solely on the E-R approach for designing database schemas. Furthermore, the problems exhibited by the Person example are by no means rare or unique. Consider the relationship HasAccount of Figure 5.10. A typical translation of HasAccount into the relational model might be

CREATE TABLE HasAccount ( AccountNumberINTEGER NOT NULL,

ClientId CHAR(20), (8.3)

(5)

8.3 Functional Dependencies 215

PRIMARY KEY (ClientId, OfficeId),

FOREIGN KEY (OfficeId) REFERENCES Office ... ... ... )

Recall that a client can have at most one account in an office and hence (ClientId, OfficeId) is a key. Also, an account must be assigned to exactly one office. Careful analysis shows that this requirement leads to some of the same problems that we saw in the Person example. For example, a tuple that records the fact that a particular account is managed by a particular office cannot be added without also recording client information (since ClientId is part of the primary key) which is an insertion anomaly. This (and the dual deletion anomaly) is perhaps not a serious problem, because of the specifics of this particular application, but the update anomaly could present maintenance issues. Moving an account from one office to another involves changing OfficeId in every tuple corresponding to that account. If the account has multiple clients, this might be a problem.

We return to this example later in this chapter because HasAccount exhibits certain interesting properties not found in the Person example. For instance, even though a decomposition of HasAccount might still be desirable, it incurs addi-tional maintenance overhead that the decomposition of Person does not.

The above discussion brings out two key points: (1) Decomposition of relation schemas can serve as a useful tool that complements the E-R approach by eliminat-ing redundancy problems; (2) The criteria for chooseliminat-ing the right decomposition are not immediately obvious, especially when we have to deal with schemas that con-tain many attributes. For these reasons, the purpose of Sections 8.3 through 8.6 is to develop techniques and criteria for identifying relation schemas that are in need of decomposition, as well as to understand what it means for a decomposition not to lose information.

The central tool in developing much of decomposition theory is functional dependency, which is a generalization of the idea of key constraints. Functional dependencies are used to define normal forms—a set of requirements on relational schemas that are desirable in update-intensive transaction systems. This is why the theory of decompositions is often also called normalization theory. Sections 8.5 through 8.9 develop algorithms for carrying out the normalization process.

8.3 FUNCTIONAL DEPENDENCIES

For the remainder of this chapter, we use a special notation for representing at-tributes, which is common in relational normalization theory, as follows: Capital letters from the beginning of the alphabet (A, B, C, D) represent individual at-tributes; capital letters from the middle to the end of the alphabet with bars over them (P, V, W, X, Y, Z) represent sets of attributes. Also, strings of letters, such as

ABCD, denote sets of the respective attributes ({A, B, C, D}); strings of letters with

bars over them, (X Y Z), stand for unions of these sets (X∪ Y ∪ Z). Although this notation requires some getting used to, it is very convenient and provides a succinct language, which we use in examples and definitions.

(6)

A functional dependency (FD) on a relation schema, R, is a constraint of the form

X→ Y, where X and Y are sets of attributes used in R. If r is a relation instance of R, it is said to satisfy this functional dependency if

For every pair of tuples, t and s, in r, if t and s agree on all attributes in

X, then t and s agree on all attributes in Y.

Put another way, there must not be a pair of tuples in r such that they have the same values for every attribute in X but different values for some attribute in Y.

Note that the key constraints introduced in Chapter 4 are a special kind of FD. Suppose that key(K) is a key constraint on the relational schema R and that r is a relational instance over R. By definition, r satisfies key(K) if and only if there is no pair of distinct tuples, t, s∈ r, such that t and s agree on every attribute in key(K). Therefore, this key constraint is equivalent to the FD K→ R, where K is the

set of attributes in the key constraint and R denotes the set of all attributes in the schema R.

Keep in mind that functional dependencies are associated with relation schemas, but when we consider whether or not a functional dependency is satisfied we must consider relation instances over those schemas. This is because FDs are integrity

constraints on the schema (much like key constraints), which restrict the set of

allowable relation instances to those that satisfy the given FDs. Thus, given a schema, R= (R; Constraints), where R is a set of attributes and Constraints is a set of FDs, we are looking for a set of all relation instances over R that satisfies every FD in Constraints. Such relational instances are called legal instances of R.

Functional dependencies and update anomalies. Certain functional dependen-cies that exist in a relational schema can lead to redundancy in the corresponding relation instances. Consider the two examples discussed in Sections 8.1 and 8.3: the schemas Person and HasAccount. Each has a primary key, as illustrated by the corresponding CREATE TABLEcommands (5.1), page 93, and (8.3), page 214. Correspondingly, there are the following functional dependencies:

Person: SSN,Hobby → SSN,Name,Address,Hobby

HasAccount: ClientId,OfficeId → AccountNumber, (8.4)

ClientId,OfficeId

These are not the only FDs implied by the original specifications, however. For instance, both Name and Address are defined as single-valued attributes in the E-R diagram of Figure 5.1. This clearly implies that one Person entity (identified by its attribute SSN) can have at most one name and one address. Similarly, the business rules of PSSC (the brokerage firm discussed in Section 5.5) require that every account be assigned to exactly one office, which means that the following FDs must also hold for the corresponding relation schemas:

Person: SSN → Name, Address

(7)

8.4 Properties of Functional Dependencies 217

It is easy to see that the syntactic structure of the dependencies in (8.4) closely corresponds to the update anomalies that we identified for the corresponding relations. For instance, the problem with Person was that for any given SSN we could not change the values for the attributes Name and Address independently of whether the corresponding person had hobbies: If the person had multiple hobbies, the change had to occur in multiple rows. Likewise with HasAccount we cannot change the value of OfficeId (i.e., transfer an account to a different office) without having to look for all clients associated with this account. Since a number of clients might share the same account, there can be multiple rows in HasAccount that refer to the same account; hence, multiple rows in which OfficeId must be changed.

Note that in both cases, the attributes involved in the update anomalies appear on the left-hand sides of an FD. We can see that update anomalies are associated with certain kinds of functional dependencies. Which dependencies are bad? At the risk of giving away the secret, we draw your attention to one major difference between the dependencies in (8.4) and (8.5): The former specify key constraints for their corresponding relations whereas the latter do not.

However, simply knowing which dependencies cause the anomalies is not enough—we must do something about them. We cannot just abolish the offending FDs, because they are part of the semantics of the enterprise being modeled by the databases. They are implicitly or explicitly part of the Requirements Document and cannot be changed without an agreement with the customer. On the other hand, we saw that schema decomposition can be a useful tool. Even though a de-composition cannot abolish a functional dependency, it can make it behave. For instance, the decomposition shown in (8.1) yields schemas in which the offending FD, SSN→ Name, Address, becomes a well-behaved key constraint.

8.4 PROPERTIES OF FUNCTIONAL DEPENDENCIES

Before going any further, we need to learn some mathematical properties of func-tional dependencies and develop algorithms to test them. Since these properties and algorithms rely heavily on the notational conventions introduced at the begin-ning of Section 8.3, it might be a good idea to revisit these conventions.

The properties of FDs that we are going to study are based on entailment. Con-sider a set of attributes R, a set, F, of FDs over R, and another FD, f , on R. We say thatF entails f if every relation r over the set of attributes R has the following property:

If r satisfies every FD in F, then r satisfies the FD f .

Given a set of FDs,F, the closure of F, denoted F+, is the set of all FDs entailed by F. Clearly, F+_contains_{F as a subset.}2

2_{If f}_{∈ F, then every relation that satisfies every FD in F obviously satisfies f . Therefore, by the definition}

(8)

IfF and G are sets of FDs, we say that F entails G if F entails every individual FD inG. F and G are said to be equivalent if F entails G and G entails F.

We now present several simple but important properties of entailment.

Reflexivity. Some FDs are satisfied by every relation no matter what. These de-pendencies all have the form X→ Y, where Y ⊆ X, and are called trivial FDs.

The reflexivity property states that, if Y⊆ X, then X → Y.

To see why trivial FDs are always satisfied, consider a relation, r, whose set of attributes includes all of the attributes mentioned in X Y. Suppose that t, s∈ r are tuples that agree on X. But, since Y⊆ X, this means that t and s agree on Y as well. Thus, r satisfies X→ Y.

We can now relate trivial FDs and entailment. Because a trivial FD is satisfied by every relation, it is entailed by every set of FDs! In particular,F+contains every trivial FD.

Augmentation. Consider an FD, X→ Y, and another set of attributes, Z. Let R contain X∪ Y ∪ Z. Then X → Y entails X Z → Y Z. In other words, every relation r over R that satisfies X→ Y must also satisfy the FD X Z → Y Z.

The augmentation property states that, if X→ Y, then X Z → Y Z.

To see why this is true, if tuples t, s∈ r agree on every attribute of X Z then in particular they agree on X. Since r satisfies X→ Y, t and s must also agree on Y. They also agree on Z, since we have assumed that they agree on a bigger set of attributes, X Z. Thus, if s, t agree on every attribute in X Z, they must agree on Y Z. As this is an arbitrarily chosen pair of tuples in r, it follows that r satisfies X Z→ Y Z. Transitivity. The set of FDs {X→ Y, Y → Z} entails the FD X → Z.

The transitivity property states that, if X→ Y and Y → Z , then X → Z. This property can be established similarly to the previous two (see the exercises).

These three properties of FDs are known as Armstrong’s Axioms,3_{and their}

main use is in the proofs of correctness of various database design algorithms. However, they are also a powerful tool used by (real, breathing human) database designers because they can help them quickly spot problematic FDs in relational schemas. We now show how Armstrong’s axioms are used to derive new FDs. Union of FDs. Any relation, r, that satisfies X→ Y and X → Z must also satisfy

X→ Y Z. To show this, we can derive X → Y Z from X → Y and X → Z using simple

syntactic manipulations defined by Armstrong’s axioms. Such manipulations can

3_{Strictly speaking, these are inference rules, because they derive new rules out of old ones. Only the}

(9)

be easily programmed on a computer, unlike the tuple-based considerations we used to establish the axioms themselves. Here is how it is done:

(a) X→ Y Given

(b) X→ Z Given

(c) X→ Y X Adding X to both sides of (a): Armstrong’s augmentation rule (d) Y X→ Y Z Adding Y to both sides of (b): Armstrong’s augmentation rule (e) X→ Y Z By Armstrong’s transitivity rule applied to (c) and (d)

Decomposition of FDs. In a similar way, we can prove the following rule: Every relation that satisfies X→ Y Z must also satisfy the FDs X → Y and X → Z. This is accomplished by the following simple steps:

(a) X→ Y Z Given

(b) Y Z→ Y By Armstrong’s reflexivity rule, since Y ⊆ Y Z (c) X→ Y By transitivity from (a) and (b)

Derivation of X→ Z is similar.

Armstrong’s axioms are obviously sound. By sound we mean that any expression of the form X→ Y derived using the axioms is actually a functional dependency. Soundness follows from the fact that we have proved that these inference rules are valid for every relation. It is much less obvious, however, that they are also complete—that is, if a set of FDs,F, entails another FD, f , then f can be derived fromF by a sequence of steps similar to the ones above that rely solely on Arm-strong’s axioms! We do not prove this fact here, but a proof can be found in [Ullman 1988]. Soundness and completeness of Armstrong’s axioms is not just a theoretical curiosity—this result has considerable practical value because it guarantees that en-tailment of FDs (i.e., the question of whether f∈ F+) can be verified by a computer program. We are now going to develop one such algorithm.

An obvious way to verify entailment of an FD, f , by a set of FDs, F, is to instruct the computer to apply Armstrong’s axioms toF in all possible ways. Since the number of attributes mentioned in F and f is finite, this derivation process cannot go on forever. When we are satisfied that all possible derivations have been made, we can simply check whether f is among the FDs derived by this process. Completeness of Armstrong’s axioms guarantees that f ∈ F+if and only if f is one of the FDs thus derived.

To see how this process works, consider the following sets of FDs:F = {AC →

B, A→ C, D → A} and G = {A → B, A → C, D → A, D → B}. We can use Armstrong’s

axioms to prove that these two sets are equivalent, i.e., every FD inG is entailed byF, and vice versa. For instance, to prove that A → B is implied by F, we can apply Armstrong’s axioms in all possible ways. Most of these attempts will not lead anywhere, but a few will. For instance, the following derivation establishes the desired entailment:

(10)

(a) A→ C An FD inF

(b) A→ AC From (a) and Armstrong’s augmentation axiom

(c) A→ B From (b), AC→ B ∈ F, and Armstrong’s transitivity axiom

The FDs A→ C and D → A belong to both F and G, so the derivation is trivial. For D→ B in G, the computer can try to apply Armstrong’s axioms until this FD is derived. After a while, it will stumble upon this valid derivation:

(a) D→ A an FD in F

(b) A→ B derived previously

(c) D→ B from (a), (b), and Armstrong’s transitivity axiom This shows that every FD inF entailed by G is done similarly.

Although the simplicity of checking entailment by blindly applying Armstrong’s axioms is attractive, it is not very efficient. In fact, the size ofF+can be exponential in the size ofF, so for large database schemas it can take a very long time before the designer ever sees the result. We are therefore going to develop a more efficient algorithm, which is also based on Armstrong’s axioms but which applies them much more judiciously.

Checking entailment of FDs. The idea of the new algorithm for verifying entail-ment is based on attribute closure.

Given a set of FDs,F, and a set of attributes, X, we define the attribute closure of X with respect toF, denoted X+_F, as follows:

X+_F = {A | X → A ∈ F+}

In other words, X+_F is a set of all those attributes, A, such that X→ A is entailed byF. Note that X ⊆ X+_F because, if A∈ X, then, by Armstrong’s reflexivity axiom,

X→ A is a trivial FD that is entailed by every set of FDs, including F.

It is important to understand that the closure ofF (i.e., F+) and the closure of X (i.e., X+_F) are related but different notions:F+is a set of functional dependencies, whereas X+_F is a set of attributes.

The more efficient algorithm for checking entailment of FDs now works as follows: Given a set of FDs,F, and an FD, X → Y, check whether Y ⊆ X+_F. If this is so, thenF entails X → Y. Otherwise, if Y ⊆ X+_F, thenF does not entail X → Y.

The correctness of this algorithm follows from Armstrong’s axioms. If Y⊆ X+_F, then X→ A ∈ F+for every A∈ Y (by the definition of X+_F). By the union rule for FDs,F thus entails X → Y. Conversely, if Y ⊆ X+_F, then there is B∈ Y such that

B∈ X+_F. Hence, X→ B is not entailed by F. But then F cannot entail X → Y; if it did, it would have to entail X→ B as well, by the decomposition rule for FDs.

The heart of the above algorithm is a check of whether a set of attributes belongs to X+_F. Therefore, we are not done yet: We need an algorithm for computing the

(11)

Figure 8.3 Computation of attribute closure X+_F.

closure := X repeat

old := closure

if there is an FD Z→ V ∈ F such that Z ⊆ closure then

closure := closure∪ V until old= closure

returnclosure

closure of X, which we present in Figure 8.3. The idea behind the algorithm is enlarging the set of attributes known to belong to X+_F by applying the FDs inF. The closure is initialized to X, since we know that X is always a subset of X+_F.

The soundness of the algorithm can be proved by induction. Initially, closure is X, so X→ closure is in F+. Then, assuming that X→ closure ∈ F+at some intermediate step in the repeat loop of Figure 8.3, and given an FD Z→ V ∈ F such that Z ⊂

closure, we can use the generalized transitivity rule (see exercise 8.7) to infer thatF

entails X→ closure ∪ V. Thus, if A ∈ closure at the end of the computation, then

A∈ X+_F. The converse is also true: If A∈ X+_F, then at the end of the computation

A∈ closure (see exercise 8.8).

Unlike the simple-minded algorithm that uses Armstrong’s axioms indiscrim-inately, the run-time complexity of the algorithm in Figure 8.3 is quadratic in the size ofF. In fact, an algorithm for computing X+_F that is linear in the size of F is given in [Beeri and Bernstein 1979]. This algorithm is better suited for a computer program, but its inner workings are more complex.

Example 8.4.1 (Checking Entailment) Consider a relational schema, R = (R; F), where R= ABCDEFGHIJ, and the set of FDs, F, which contains the following FDs:

AB→ C, D → E, AE → G, GD → H, ID → J. We wish to check whether F entails ABD→ GH and ABD → HJ.

First, let us compute ABD+_F. We begin with closure = ABD. Two FDs can be used in the first iteration of the loop in Figure 8.3. For definiteness, let us use AB→ C, which makes closure = ABDC. In the second iteration, we can use D→ E, which makes closure = ABDCE. Now it becomes possible to use the FD AE→ G in the third iteration, yielding closure = ABDCEG. This in turn allows GD→ H to be applied in the fourth iteration, which results in closure = ABDCEGH. In the fifth iteration, we cannot apply any new FDs, so closure does not change and the loop terminates. Thus, ABD+_F= ABDCEGH.

Since GH⊆ ABDCEGH, we conclude that F entails ABD → GH. On the other hand, HJ⊆ ABDCEGH, so we conclude that ABD → HJ is not entailed by F. Note, however, thatF does entail ABD → H.

(12)

Figure 8.4 Testing equivalence of sets of FDs. Input:F, G – FD sets

Output:true, ifF is equivalent to G; false otherwise for eachf ∈ F do

ifG does not entail f then return false for eachg∈ G do

ifF does not entail g then return false returntrue

The above algorithm for testing entailment leads to a simple test for equivalence between a pair of sets of FDs. LetF and G be such sets. To check that they are equivalent, we must check that every FD inG is entailed by F, and vice versa. The algorithm is depicted in Figure 8.4.

8.5 NORMAL FORMS

To eliminate redundancy and potential update anomalies, database theory identi-fies several normal forms for schemas such that, if a schema is in one of the normal forms, it has certain predictable properties. Originally, [Codd 1970] proposed three normal forms, each eliminating more anomalies than the previous one.

The first normal form (1NF), as introduced by Codd, is equivalent to the defini-tion of the reladefini-tional data model. The second normal form (2NF) was an attempt to eliminate some potential anomalies, but it has turned out to be of no practical use, so we do not discuss it.

The third normal form (3NF) was initially thought to be the “ultimate” normal form. However, Boyce and Codd soon realized that 3NF can still harbor undesirable combinations of functional dependencies, so they introduced the so-called Boyce-Codd normal form(BCNF). Unfortunately, the sobering reality of computational sciences is that there is no free lunch. Even though BCNF is more desirable, it is not always achievable without paying a price elsewhere. In this section, we define both BCNF and 3NF. Subsequent sections develop algorithms for automatically converting relational schemas that possess various bad properties into sets of schemas in 3NF and BCNF. We also study the trade-offs associated with such conversions.

Toward the end of this chapter, we show that certain types of redundancy are caused not by FDs but by other dependencies. To deal with this problem, we introduce the fourth normal form (4NF), which further extends BCNF.

The Boyce-Codd Normal Form. A relational schema, R= (R; F), where R is the set of attributes of R andF is the set of functional dependencies associated with R, is in Boyce-Codd normal form if, for every FD X→ Y ∈ F, either of the following is true:

Y⊆ X (i.e., this is a trivial FD). X is a superkey of R.

(13)

8.5 Normal Forms 223

In other words, the only nontrivial FDs are those in which a key functionally determines one or more attributes.

It is easy to see that Person1 and Hobby, the relational schemas given in (8.1), are in BCNF, because the only nontrivial FD is SSN→ Name, Address. It applies to Person1 which has SSN as a key.

On the other hand, consider the schema Person defined by theCREATE TABLE statement (5.1), page 93, and the schema HasAccount defined by the SQL state-ment (8.3), page 214. As discussed earlier, these statestate-ments fail to capture some important relationships, which are represented by the FDs in (8.5), page 216. Each of these FDs is in violation of the requirement to be in BCNF: They are not trivial, and their left-hand sides—SSN and AccountNumber—are not keys of their respective schemas.

Note that a BCNF schema can have more than one key. For instance, R =

(ABCD; F), where F = {AB → CD, AC → BD} has two keys, AB and AC. And

yet it is in BCNF because the left-hand side of each of the two FDs inF is a key. An important property of BCNF schemas is that their instances do not contain redundant information. Since we have been illustrating redundancy problems only through concrete examples, the above statement might seem vague. Exactly what is redundant information? For instance, does the abstract relation

A B C D

1 1 3 4

2 1 3 4

over the above mentioned BCNF schema R store redundant information?

Superficially it might seem so, because the two tuples agree on all but one attribute. However, having identical values in some attributes of different tuples does not necessarily imply that the tuples are storing redundant information. Redundancy arises when the values of some set of attributes, X, necessarily implies the value that must exist in another attribute, A—a functional dependency. If two tuples have the same values in X, they must have the same value of A. Redundancy is eliminated if we store the association between X and A only once (in a separate relation) instead of repeating it in all tuples of an instance of the schema R that agree on X. Since R does not have FDs over the attributes BCD, no redundant information is stored. The fact that the tuples in the relation coincide over BCD is coincidental. For instance, the value of attribute D in the first tuple can be changed from 4 to 5 without regard for the second tuple.

A DBMS automatically eliminates one type of redundancy: Two tuples with the same values in the key fields are prohibited in any instance of a schema. This is a special case: The key identifies an entity and so determines the values of all attributes describing that entity. As the definition of BCNF precludes associations that do not contain keys, the relations over BCNF schemas do not store redundant information. As a result, deletion and update anomalies do not arise in BCNF relations.

(14)

Relations with more than one key still can have insertion anomalies. To see this, suppose that associations over ABD and over ACD are added to our relation as shown: A B C D 1 1 3 4 2 1 3 4 3 4 NULL 5 3 NULL 2 5

Because the value over the attribute C in the first association and over B in the second is unknown, we fill in the missing information with NULL. However, now we cannot tell if the two newly added tuples are the same—it all depends on the real values for the nulls. A practical solution to this problem, as adopted by the SQL standard, is to designate one key as primary and to prohibit null values in its attributes.

The Third Normal Form. A relational schema, R= (R; F), where R is the set of attributes of R andF is the set of functional dependencies associated with R, is in third normal formif, for every FD X→ A ∈ F, any one of the following is true:

A∈ X, (i.e., this is a trivial FD). X is a superkey of R.

A∈ K for some key K of R.

Observe that the first two conditions in the definition of 3NF are identical to the conditions that define BCNF. Thus, 3NF is a relaxation of BCNF’s requirements. Every schema in BCNF must also be in 3NF, but the converse is not true in general. For instance, the relation HasAccount (8.3) on page 214 is in 3NF because the only FD that is not based on a key constraint is AccountNumber→ OfficeId, and OfficeId is part of the key. However, this relation is not in BCNF, as shown previously.4

If you are wondering about the intrinsic merit of the third condition in the definition of 3NF, the answer is that there is none. In a way, 3NF was discovered by mistake—in the search for what we now call BCNF! The reason for its remarkable survival is that it was found later to have some desirable algorithmic properties that BCNF does not have. We discuss these issues in subsequent sections.

Recall from Section 8.2 that relation instances over HasAccount might store redundant information. Now we can see that this redundancy arises because of the functional dependency that relates AccountNumber and OfficeId and that is not implied by key constraints.

4_{In fact, HasAccount is the smallest possible example of a 3NF relation that is not in BCNF (see}

(15)

8.6 Properties of Decompositions 225

For another example, consider the schema Person discussed earlier. This schema violates the 3NF requirements because, for example, the FD SSN→ Name is not based on a key constraint (SSN is not a superkey) and Name does not belong to a key of Person. However, the decomposition of this schema into Person1 and Hobby in (8.1), page 213, yields a pair of schemas that are in both 3NF and BCNF. 8.6 PROPERTIES OF DECOMPOSITIONS

Since there is no redundancy in BCNF schemas and redundancy in 3NF is limited, we are interested in decomposing a given schema into a collection of schemas, each of which is in one of these normal forms.

The main thrust of the discussion in the previous section was that 3NF does not completely solve the redundancy problem. Therefore, at first glance, there appears to be no justification to consider 3NF as a goal for database design. It turns out, however, that the maintenance problems associated with redundancy do not show the whole picture. As we will see, maintenance is also associated with integrity constraints,5 _{and 3NF decompositions sometimes have better properties in this}

regard than do BCNF decompositions. Our first step is to define these properties. Recall from Section 8.2 that not all decompositions are created equal. For in-stance, the decomposition of Person shown in (8.1) is considered good while the one in (8.2) makes no sense. Is there an objective way to tell which decompositions make sense and which do not, and can this objective way be explained to a com-puter? The answer to both questions is yes. The decompositions that make sense are called lossless. Before tackling this notion, we need to be more precise about what we mean by a “decomposition.”

A decomposition of a schema, R= (R; F), where R is a set of attributes of the schema andF is its set of functional dependencies, is a collection of schemas

R1= (R1;F1), R2= (R2;F2), . . . , Rn= (Rn;Fn) such that the following conditions hold

1. R = ∪n_i₌₁Ri

2. F entails Fifor every i= 1, . . . , n.

The first part of the definition is clear: A decomposition should not introduce new attributes, and it should not drop attributes found in the original schema. The second part of the definition says that a decomposition should not introduce new functional dependencies (but may drop some). We discuss the second requirement in more detail later.

The decomposition of a schema naturally leads to decomposition of relations over it. A decomposition of a relation, r, over schema R is a set of relations

r1= πR₁(r), r2= πR₂(r), . . . , rn= πR_n(r)

5_{For example, a particular integrity constraint in the original table might be checkable in the}

(16)

where π is the projection operator. It can be shown (see exercise 8.9) that, if r is a valid instance of R, then each ri satisfies all FDs inFi and thus each ri is a valid relation instance over the schema Ri. The purpose of a decomposition is to replace the original relation, r, with a set of relations r1, . . . , rnover the schemas that constitute the decomposed schema.

In view of the above definitions, it is important to realize that schema

decomposi-tion and reladecomposi-tion instance decomposidecomposi-tion are two different but related nodecomposi-tions.

Applying the above definitions to our running example, we see that splitting Person into Person1 and Hobby (see (8.1) and Figure 8.1) yields a decomposition. Splitting Person as shown in (8.2) and Figure 8.2 is also a decomposition in the above sense. It clearly satisfies the first requirement for being a decomposition. It also satisfies the second, since only trivial FDs hold in (8.2), and these are entailed by every set of dependencies.

This last example shows that the above definition of a decomposition does not capture all of the desirable properties of a decomposition because, as you may recall from Section 8.2, the decomposition in (8.2) makes no sense. In the following, we introduce additional desirable properties of decompositions.

8.6.1 Lossless and Lossy Decompositions

Consider a relation, r, and its decomposition, r₁, . . . , r_n, as defined above. Since after the decomposition the database no longer stores the relation r and instead maintains its projections r1, . . . , rn, the database must be able to reconstruct the original relation r from these projections. Not being able to reconstruct r means that the decomposition does not represent the same information as does the original database (imagine a bank losing the information about who owns which account or, worse, giving those accounts to the wrong owners!).

In principle, one can use any computational method that guarantees reconstruc-tion of r from its projecreconstruc-tions. However, the natural and, in most cases, practical method is the natural join. We thus assume that r is reconstructible if and only if

r= r1 r2 · · · rn

Reconstructibility must be a property of schema decomposition and not of a par-ticular instance over this schema. At the database design stage, the designer ma-nipulates schemas, not relations, and any transformation performed on a schema must guarantee that reconstructibility holds for all of its valid relation instances.

This discussion leads to the following notion. A decomposition of schema R=

(R; F) into a collection of schemas

R1= (R1;F1), R2= (R2;F2), . . . , Rn= (Rn;Fn) is lossless if, for every valid instance r of schema R,

r = r1 r2 · · · rn where

(17)

A decomposition is lossy otherwise.

In plain terms, a lossless schema decomposition is one that guarantees that any valid instance of the original schema can be reconstructed from its projections on the individual schemas of the decomposition. Note that

r ⊆ r1 r2 · · · rn

holds for any decomposition (exercise 8.10), so losslessness really just states the opposite inclusion.

r ⊇ r1 r1 · · · rn The fact that

r ⊆ r1 r2 · · · rn

holds, no matter what, may seem confusing at first. If we can get more tuples by joining the projections of r, why is such a decomposition called “lossy?” After all, we gained more tuples, not less! To clarify this issue, observe that what we might lose here are not tuples but rather information about which tuples are the right

ones. Consider, for instance, the decomposition (8.2) of schema Person. Figure 8.2

presents the corresponding decomposition of a valid relation instance over Person shown in Figure 5.2. However, if we now compute a natural join of the relations in the decomposition (which becomes a Cartesian product since these relations do not share attributes), we will not be able to tell who lives where and who has what hobbies. The relationship among names and SSNs is also lost. In other words, when reconstructing the original relation getting more tuples is as bad as getting fewer— we must get exactly the set of tuples in the original relation.

Now that we are convinced of the importance of losslessness, we need an algo-rithm that a computer can use to verify this property, since the definition of lossless joins does not provide an effective test but only tells us to try every possible relation. This is not feasible or efficient.

A general test of whether a decomposition into n schemas is lossless exists, but is somewhat complex. It can be found in [Beeri et al. 1981]. However, there is a much simpler test that works for binary decompositions, that is, decompositions into a pair of schemas. This test can establish losslessness of a decomposition into more than two schemas, provided that this decomposition was obtained by a series of binary decompositions. Since most decompositions are obtained in this way, the simple binary test introduced below is sufficient for most practical purposes. Testing the losslessness of a binary decomposition. Let R= (R; F) be a schema and R1= (R1; F1), R2= (R2; F2)be a binary decomposition of R. This

decomposi-tion is lossless if and only if either of the following is true:

(R₁∩ R₂)→ R₁∈ F+.

(18)

Figure 8.5 Tuple structure in a lossless binary decomposition: A row of r1combines

with exactly one row of r₂. R1 r1 r2 R2 R2 R1

∪

R2 – – – – –

To see why this is so, suppose that (R1∩ R2)→ R2∈ F+. Then R1is a superkey

of R since the values in a subset of attributes of R1determine the values of all

attributes of R2(and, obviously, R1functionally determines R1). Let r be a valid

relation instance for R. Since R1is a superkey of R every tuple in r1= π_R₁(r)extends

to exactly one tuple in r. Thus, as depicted in Figure 8.5, the cardinality (the number of tuples) in r1equals the cardinality of r, and every tuple in r1joins with exactly

one tuple in r2= π_R₂(r) (if more than one tuple in r2 joined with a tuple in r1, (R1∩ R2)→ R2would not be an FD). Therefore, the cardinality of r1 r2equals the

cardinality of r1, which in turn equals the cardinality of r. Since r must be a subset

of r1 r2, it follows that r= r1 r2. Conversely, if neither of the above FDs holds,

it is easy to construct a relation r such that r⊂ r1 r2. Details of this construction

are left to exercise 8.11.

The above test can now be used to substantiate our intuition that the decompo-sition (8.1) of Person into Person1 and Hobby is a good one. The intersection of the attributes of Hobby and Person1 is{SSN}, and SSN is a key of Person1. Thus, this decomposition is lossless.

8.6.2 Dependency-Preserving Decompositions

Consider the schema HasAccount once again. Recall that it has attributes Ac-countNumber, ClientId, OfficeId, and its FDs are

ClientId, OfficeId → AccountNumber (8.6)

AccountNumber → OfficeId (8.7)

(19)

AcctOffice = (AccountNumber, OfficeId;

{AccountNumber → OfficeId}) (8.8)

AcctClient = (AccountNumber, ClientId; { })

because AccountNumber (the intersection of the two attribute sets) is a key of the first schema, AcctOffice.

Even though decomposition (8.8) is lossless, something seems to have fallen through the cracks. AcctOffice hosts the FD (8.7), but AcctClient’s associated set of FDs is empty. This leaves the FD (8.6), which exists in the original schema, homeless. Neither of the two schemas in the decomposition has all of the attributes needed to house this FD; furthermore, the FD cannot be derived from the FDs that belong to the schemas AcctOffice and AcctClient.

In practical terms, this means that even though decomposing relations over the schema HasAccount into relations over AcctOffice and AcctClient does not lead to information loss, it might incur a cost for maintaining the integrity constraint that corresponds to the lost FD. Unlike the FD (8.7), which can be checked locally, in the relation AcctOffice, verification of the FD (8.6) requires computing the join of AcctOffice and AcctClient before checking of the FD can begin. In such cases, we say that the decomposition does not preserve the dependencies in the original schema. We now define what this means.

Consider a schema, R= (R; F), and suppose that R1= (R1;F1), R2= (R2;F2), . . . , Rn= (Rn;Fn)

is a decomposition.F entails each Fiby definition, soF entails ∪ni=1Fi. However, this definition does not require that the two sets of dependencies be equivalent—that is, that∪n_i₌₁Fimust also entailF. This reverse entailment is what is missing in the above example. The FD set of HasAccount, which consists of dependencies (8.6) and (8.7), is not entailed by the union of the dependencies in decomposition (8.8), which consists of only a single dependency AccountNumber→ OfficeId. Because of this, (8.8) is not a dependency-preserving decomposition.

Formally,

R1= (R1; F1), R2= (R2; F2), . . . , Rn= (Rn; Fn)

is said to be a dependency-preserving decomposition of R= (R; F) if and only if it is a decomposition and the sets of FDsF and ∪n_i₌₁Fiare equivalent.

The fact that an FD, f , inF is not in any Fidoes not mean that the decomposition is not dependency preserving, since f might be entailed by ∪n_i₌₁Fi. In this case, maintaining f as a functional dependency requires no extra effort: If the FDs in ∪n

i=1Fiare maintained, f will be also. It is only when f is not entailed by∪ni=1Fithat the decomposition is not dependency preserving and so maintenance of f requires a join.

In view of this definition, the above decomposition of HasAccount is not dependency preserving, and this is precisely what is wrong. The dependencies that exist in the original schema but are lost in the decomposition become inter-relational constraints that cannot be maintained locally. Each time a relation in

(20)

Figure 8.6 Decomposition of the HasAccount relation.

AccountNumber ClientId OfficeId

B123 111111111 SB01 A908 123456789 MN08 HasAccount AccountNumber OfficeId B123 SB01 A908 MN08 AcctOffice AccountNumber ClientId B123 111111111 A908 123456789 AcctClient

Figure 8.7 HasAccount and its decomposition after the insertion of several rows.

AccountNumber ClientId OfficeId

B123 111111111 SB01 B567 111111111 SB01 A908 123456789 MN08 HasAccount AccountNumber OfficeId B123 SB01 B567 SB01 A908 MN08 AcctOffice AccountNumber ClientId B123 111111111 B567 111111111 A908 123456789 AcctClient

the decomposition is changed, satisfaction of the inter-relational constraints can be checked only after the reconstruction of the original relation. To illustrate, consider the decomposition of HasAccount in Figure 8.6.

If we now add the tupleB567, SB01 to the relation AcctOffice and the tuple B567, 111111111 to AcctClient, the two relations will still satisfy their local FDs (in fact, we see from (8.8) that only AcctOffice has a dependency to satisfy). In contrast, the inter-relational FD (8.6) is not satisfied after these updates, but this is not immediately apparent. To verify this, we must join the two relations, as depicted in Figure 8.7. We now see that constraint (8.6) is violated by the first two tuples in the updated HasAccount relation.

(21)

The next question to ask is how hard it is to check whether a decomposition is dependency preserving. If we already have a decomposition complete with sets of functional dependencies attached to the local schemas, dependency preservation can be checked in polynomial time. We simply need to check that each FD in the original set is entailed by the union of FDs in the local schemas. For each such test, we can use the quadratic attribute closure algorithm discussed in Section 8.4.

In practice, the situation is more involved. Typically, we (and computer algo-rithms) must first decide how to split the attribute set in order to form the decom-position, and only then attach FDs to those attribute sets. We can attach the FD

X→ Y to an attribute set S if X ∪ Y ⊆ S. We state this more formally as follows.

Consider a schema, R= (R; F), a relation, r, over R, and a set of attributes, S, such that S⊆ R. If S is one of the schemas in a decomposition of R, the only FDs that are guaranteed to hold over π_S(r)are X→ Y ∈ F+such that X Y⊆ S. This leads us to the following notion:

π_S(F) = {X → Y | X → Y ∈ F+and X∪ Y ⊆ S}

which is called the projection of the setF of FDs onto the set of attributes S. The notion of projection for FDs opens a way to construct decompositions know-ing only how to split the attribute set of the original schema. If R= (R; F) is a schema and R1, . . . , Rnare subsets of attributes such that R= ∪ni=1Ri, then the col-lection of schemas (R1; π_R₁(F)), . . . , (Rn; π_R_n(F)) is a decomposition. Since, given an attribute set in a decomposition, we can always determine the corresponding set of FDs by taking a projection, it is customary to omit the FDs when specifying schema decompositions.

Constructing decompositions thus involves computing projections of FDs. This computation involves calculating the closure ofF,6which can be exponential in the size ofF. If its cost is factored into the cost of checking for dependency preservation, this checking is exponential as well. In this regard, it is interesting that, in order to test a decomposition for losslessness, one does not need to compute the FDs that belong to the local schemas. The test presented earlier used the original setF of FDs and is polynomial in the size ofF.

To summarize, we considered two important properties of schema decompo-sition: losslessness and dependency preservation. We also saw an example of a relation (HasAccount) that has a lossless but not dependency-preserving decom-position into BCNF (and which, as we shall see, does not have a BCNF decomposi-tion that has both properties). Which of these two properties is more important? The answer is that losslessness is mandatory while dependency preservation, though very desirable, is optional. The reason is that lossy decompositions lose information

(22)

contained in the original database, and this is not acceptable. In contrast, decom-positions that do not preserve FDs only lead to computational overhead when the database is changed and inter-relational constraints need to be checked.

8.7 AN ALGORITHM FOR BCNF DECOMPOSITION

We are now ready to present our first decomposition algorithm. Let R= (R; F) be a relational schema that is not in BCNF. The algorithm in Figure 8.8 constructs a new decomposition by repeatedly splitting R into smaller subschemas, so that at each step the new database schema has strictly fewer FDs that violate BCNF than does the schema in the previous iteration (see exercise 8.13). Thus, the algorithm always terminates and all schemas in the result are in BCNF.

To see how the BCNF decomposition algorithm works, consider the Hascount example once again. This schema is not in BCNF, because of the FD Ac-countNumber→ OfficeId whose left-hand side is not a superkey. Therefore, we can use this FD to split HasAccount in the while loop of Figure 8.8. The result is, not surprisingly, the decomposition we saw in (8.8).

The next example is more complex and also much more abstract. Consider a relation schema, R=(R; F), where R = ABCDEFGH (recall that A, B, etc., denote attribute names), and let the setF of FDs be

ABH→ C A→ DE BGH→ F

F→ ADH

BH→ GE

To apply the BCNF decomposition algorithm, we first need to identify the FDs that violate BCNF—those whose left-hand side is not a superkey. We can see that the first FD is not one of these, because the attribute closure (ABH)+(computed with the algorithm in Figure 8.3) contains all schema attributes and so ABH is a superkey. Figure 8.8 Lossless decomposition into BCNF.

Input: R = (R; F) Decomposition := R

whilethere is a schema S= (S; F)in Decomposition that is not in BCNF do /* Let X→ Y be an FD in F such that X Y ⊆ S

and it violates BCNF. Decompose using this FD */ Replace S in Decomposition with schemas S1= (X Y; F1)

and S2 = (S − Y) ∪ X; F2, whereF1andF2are all the FDs

fromFthat involve only attributes in their respective schemas end

(23)

8.7 An Algorithm for BCNF Decomposition 233

However, the second FD, A→ DE, does violate BCNF: The attribute closure of A is ADE, and so A is not a superkey. We can thus split R using this FD:

R1= (ADE; {A → DE})

R2= (ABCFGH; {ABH → C, BGH → F, F → AH, BH → G})

Notice that we split F→ ADH into {F → AH, F → D} and BH → GE into {BH →

G, BH→ E} and that some FDs fell by the wayside: F → D and BH → E no longer

have a home, since none of the new schemas contains all the attributes used by these FDs. However, things are still looking bright, since the FD F→ D can be derived from other FDs embedded in the new schemas R1and R2: F→ AH and A → DE.

Thus, if the validity of these two dependencies is maintained, the validity of F→ D is necessarily maintained as well. Checking the two dependencies is easy because the attributes of each are confined to a single relation. Similarly, BH→ E can still be derived because the attribute closure of BH, restricted to the FDs embedded in R1and R2, contains E. (Verify using the algorithm in Figure 8.3!) The above

decomposition is therefore dependency preserving.

It is easy to see that R1 is in BCNF, so the BCNF decomposition algorithm

must concentrate on R2. The FDs ABH→ C and BGH → F did not violate BCNF

in R since both ABH and BGH are superkeys. As a result, they do not violate BCNF in R2(which has only a subset of the attributes of R). The FD that clearly

violates BCNF here is F→ AH, so the algorithm might pick it up and split R2

accordingly.

R21= (FAH; {F → AH}) R22= (FBCG; { })

Now both schemas, R21and R22, are in BCNF (R22trivially so because it has no FDs).

However, the price is that the FDs ABH→ C, BGH → F, and BH → G that were present in R2are now homeless. Furthermore, none of these FDs can be derived

using the FDs that are still embedded in R1, R21, and R22. For instance,

comput-ing (ABH)+with respect to this set of FDs yields ABHDE, which does not contain

C, so ABH→ C is not derivable. Thus, we obtain a non-dependency–preserving

decomposition of R into three BCNF schemas: R1, R21, and R22.

This decomposition is by no means unique. For instance, if our algorithm had picked up F→ ADH at the very first iteration, the first decomposition would have been

R₁= (FADH; {F → ADH}) R₂= (FBCEG; { })

This is not the end of the differences: While both of these schemas are in BCNF, some FDs that were present in the decomposition into R1, R21, R22are now lost. Properties of the BCNF decomposition algorithm. First and foremost, the BCNF decomposition algorithm in Figure 8.8 always yields a lossless decomposition. To see this, consider the two schemas involving attribute sets X Y and (S− Y) ∪ X that

(24)

replace the schema S= (S, F)in the algorithm. Notice that X Y∩ ((S − Y) ∪ X) = X and thus X Y∩ ((S − Y) ∪ X) → X Y since X → Y ∈ F. Therefore, according to the losslessness test for binary decompositions on page 227,{X Y, (S − Y) ∪ X} is a lossless decomposition of S. This means that at every step in our algorithm we replace one schema by its lossless decomposition. Thus, by Exercise 8.12, the final decomposition produced by this algorithm is also lossless.

Are the decompositions produced by the BCNF algorithm always dependency preserving? We have seen that this is not the case. Decomposition (8.8) of HasAc-count is not dependency preserving. Moreover, it is easy to see that no BCNF decomposition of HasAccount (not only those produced by this particular al-gorithm) is both lossless and dependency preserving. Indeed, there are just three decompositions to try, and we can simply check them all.

Finally, we have seen that the BCNF decomposition algorithm is nondetermin-istic. The final result depends on the order in which FDs are selected in the while loop. The decomposition chosen by the database designer might be a matter of taste, or it might be based on objective criteria. For instance, some decompositions might be dependency preserving, others not; some might lead to fewer FDs left out as inter-relational constraints (e.g., the decomposition R1, R21, R22is better in

this sense than the decomposition R₁, R₂). Some attribute sets might be more likely to be queried together so they better not be separated in the decomposition. The next section describes one common approach that can help in choosing one BCNF decomposition over another.

8.8 SYNTHESIS OF 3NF SCHEMAS

We have seen that some schemas (such as HasAccount, in Figure 8.7) cannot be decomposed into BCNF so that the result is also dependency preserving. However, if we agree to settle for 3NF instead of BCNF, dependency-preserving decomposi-tions are always possible (but recall that 3NF schemas might contain redundancies). Before we present a 3NF decomposition algorithm, we introduce the concept of

minimal cover, which is very simple. We know that sets of FDs might look completely

different but nonetheless be logically equivalent. Figure 8.4 presented one fairly straightforward way of testing equivalence. Since there might be many sets of FDs equivalent to any given set, we question whether there is a set of FDs that can be viewed as “canonical.” It turns out that defining a unique canonical set is not an easy task, but the notion of minimal cover comes close.

8.8.1 Minimal Cover

LetF be a set of FDs. A minimal cover of F is a set of FDs, G, that has the following properties:

1. G is equivalent to F (but, possibly, different from F).

(25)

8.8 Synthesis of 3NF Schemas 235

3. It is not possible to makeG “smaller” (and still satisfy the first two properties) by either of the following:

a. Deleting an FD

b. Deleting an attribute from an FD

Clearly, because of the rule of decomposition for functional dependencies, it is easy to convertF into an equivalent set of FDs where the right-hand sides are singleton attributes. However, property 3 is more subtle. Before presenting an algorithm for computing minimal covers, we illustrate it with a concrete example.

Consider the attribute set ABCDEFGH and the following set,F, of FDs:

ABH→ C A→ D C→ E BGH→ F F→ AD E→ F BH→ E

Since not all right-hand sides are single attributes, we can use the decomposition rule to obtain an FD set that satisfies the first two properties of minimal covers.

ABH→ C A→ D C→ E BGH→ F (8.9) F→ A F→ D E→ F BH→ E

We can see that BGH→ F is entailed by BH → E and E → F, and that F → D is entailed by F→ A and A → D. Thus, we are left with

ABH→ C A→ D C→ E (8.10) F→ A E→ F BH→ E

It is easy to check by computing attribute closures that none of these FDs is redun-dant; that is, one cannot simply throw out an FD from this set without sacrificing equivalence to the original setF. However, is the resulting set a minimal cover of F? The answer turns out to be no because it is possible to delete the attribute A from

(26)

Figure 8.9 Computation of a minimal cover. Input:a set of FDsF

Output:G, a minimal cover of F

Step 1:G := F, where all FDs are converted to use singleton-attributes on

the right-hand side

Step 2: Remove all redundant attributes from the left-hand sides

of FDs inG

Step 3: Remove all redundant FDs fromG returnG

the first FD. ABH→ C is entailed by the set

BH→ C A→ D C→ E F→ A E→ F BH→ E (8.11)

and BH→ C is entailed by set (8.10). The first fact can be verified by computing the attribute closure of ABH with respect to (8.11) and the second by computing the attribute closure of BH with respect to (8.10). It then follows easily that the two sets are equivalent, and as a result (8.10) is not a minimal cover. Interestingly, after the removal of the redundant attribute A, the set (8.11) is still nonminimal because

BH→ E is redundant. Removing this FD yields a minimal cover at last.

The algorithm for computing minimal covers is presented in Figure 8.9. Step 1 is performed by a simple splitting of the FDs according to their right-hand sides. For instance, X→ AB turns into X → A and X → B.

Step 2 is performed by checking every left-hand attribute inG for redundancy. That is, for every FD X→ A ∈ G and every attribute B ∈ X, we have to check if

(X− B) → A is entailed by G—very tedious work if done manually. In the above

example, we performed this step when we checked that BH→ C is entailed by the FD set (8.10), which allowed us to get rid of the redundant attribute A in ABH→ C. Step 3 is accomplished by another tedious algorithm: For every g∈ G, check that the FD g is entailed byG − g.

An important observation about the algorithm in Figure 8.9 is that steps 2 and 3 cannot be done in a different order: Performing step 3 before 2 will not always return a minimal cover. In fact, we have already seen this phenomenon: We obtained set (8.10) by removing redundant FDs from (8.9); then we obtained set (8.11) by deleting redundant attributes. Nevertheless, the result still had a redundant FD (BH→ E). On the other hand, if we first remove the redundant attributes from (8.9), we get

(27)

8.8 Synthesis of 3NF Schemas 237 BH→ C A→ D C→ E BH→ F F→ A F→ D E→ F BH→ E

Then removing the redundant FDs BH→ F, F → D, and BH → E yields the follow-ing minimal cover:

BH→ C A→ D C→ E F→ A E→ F (8.12)

8.8.2 3NF Decomposition through Schema Synthesis

The algorithm for constructing dependency-preserving 3NF decompositions works very differently from its BCNF counterpart. Instead of starting with one big schema and successively splitting it, the 3NF algorithm starts with individual attributes and groups them into schemas. For this reason, it is called 3NF synthesis. Given a schema, R= (R; F), where R is a superset of attributes and F is a set of FDs, the algorithm carries out four steps.

1. Find a minimal cover,G, for F.

2. PartitionG into FD sets G1, . . . ,Gn, such that eachGiconsists of FDs that share the same left-hand side. (It is not necessary to assume that differentG_is have different hand sides, but it is usually a good idea to merge sets whose left-hand sides are the same.)

3. For each Gi, form a relation schema, Ri= (Ri;Gi), where Ri is the set of all attributes mentioned inGi.

4. If one of the Ris, is a superkey of R (i.e., (Ri)+_F = R), we are done — R1, . . . , Rn is the desired decomposition. If no Ri is a superkey of R, let R0be a set of

attributes that constitutes a key of R, and let R0= (R0; { }) be a new schema.

Then R0, R1, . . . , Rnis the desired decomposition.

Note that the collection of schemas obtained after step 3 might not even be a decomposition because some attributes of R might be missing (see example below). However, any such missing attributes will be recaptured in step 4, because these attributes must be part of the key of R (exercise 8.25).

It is easy to see that the result of the above algorithm is a dependency-preserving decomposition. By construction, every FD inG has a home, so G = ∪Gi. Since, by definition of minimal covers,G+= F+,F is preserved.