THE FOURTH NORMAL FORM - Chapter 8. Database Design II: Relational Normalization Theory

This decomposition is lossless because we first obtained a lossless 3NF decomposition and then applied the BCNF algorithm, which preserves losslessness. It is not dependency preserving, however, since St C Sem→ P and C Sem T → P are not represented in the above schemas.

8.9 THE FOURTH NORMAL FORM

Not all of the world’s problems are due to bad FDs. Consider the following schema:

Person(SSN, PhoneN, ChildSSN) (8.13)

where we assume that a person can have several phone numbers and several children. Here is one possible relation instance.

SSN PhoneN ChildSSN 111-22-3333 516-123-4567 222-33-4444 111-22-3333 516-345-6789 222-33-4444 111-22-3333 516-123-4567 333-44-5555 111-22-3333 516-345-6789 333-44-5555 (8.14) 222-33-4444 212-987-6543 444-55-6666 222-33-4444 212-987-1111 555-66-7777 222-33-4444 212-987-6543 555-66-7777 222-33-4444 212-987-1111 444-55-6666

As there are no nontrivial functional dependencies, this schema is in 3NF and even BCNF. Nonetheless, it is clearly not a good design as it exhibits a great deal of redundancy. There is no particular association between phone numbers and children, except through the SSN, so every child item related to a given SSN must occur in one tuple with every PhoneN related to the same SSN. Thus, whenever a phone number is added or deleted, several tuples might need to be added or deleted as well. If a person gives up all phone numbers, the information about her children will be lost (orNULLvalues will have to be used).

It might seem that a compression technique can help here. For instance, we might decide to store only some tuples as long as there is a way to reconstruct the original information. SSN PhoneN ChildSSN 111-22-3333 516-123-4567 222-33-4444 111-22-3333 516-345-6789 333-44-5555 222-33-4444 212-987-6543 444-55-6666 222-33-4444 212-987-1111 555-66-7777

Still, although this is more efficient, it solves none of the aforesaid anomalies. Also, it imposes an additional burden on the applications, which now must be aware of the compression schema.

In our discussion of BCNF, we concluded that redundancy arises when a particular semantic relationship among attribute values is stored more than once. In Figure 5.2 on page 93, the fact that the person with SSN 111111111 lives at 123 Main St. is an example of that—it is stored three times. In that case, the problem was traced back to the functional dependency that relates SSN and Address and the fact that SSN is not a key (and hence there can be several rows with the same SSNvalue). The redundant storage of a semantic relationship, however, is not lim- ited to this situation. In the relation r shown in (8.14), the relationships SSN-PhoneN and SSN-ChildSSN are stored multiple times and there are no FDs involved. The problem arises here because there are several attributes—in this case PhoneN and ChildSSN—that have the property that their sets of values are associated with a single value of another attribute—in this case SSN. A relationship between a particular SSN value and a particular PhoneN value is stored as many times as there are children of the person with that SSN. Note that the relation satisfies the following property:

r= πSSN,PhoneN(r) πSSN,ChildSSN(r) (8.15)

When (8.15) is required of all legal instances of a schema, this property is known as join dependency. A join dependency can arise when characteristics of an enter- prise are described by sets of values. With the E-R approach to database design, we saw that such characteristics are represented as set-valued attributes and that translating them into attributes in the relational model is awkward. In particular, when an entity type or a relationship type has several set-valued attributes, a join dependency results.

Condition (8.15) should look familiar to you. It guarantees that a decomposition of r into the two tables π_SSN,PhoneN(r)and π_SSN,ChildSSN(r)will be lossless. That is certainly true in this case, but it is not our immediate concern. The condition also tells us something about r: A join dependency indicates that semantic relationships can be stored redundantly in an instance of r.

Formally, let R be a set of attributes. A join dependency (JD) is a constraint of the form

R= R1 · · · Rn

where R1, . . . , Rnis a decomposition of R. Recall that earlier (page 216) we defined the notion of satisfaction of FDs by relational instances. We now define the same notion for JDs: A relation instance, r, over R satisfies the above join dependency if

r= π_R

1(r) · · · πRn(r)

Let R= (R; Constraints) be a relational schema, where R is a set of attributes and

8.9 The Fourth Normal Form 243

set of attributes R is a legal instance of R if and only if it satisfies all constraints in

Constraints.

Of particular interest are so-called binary join dependencies, also known as multivalued dependencies (MVD). These are JDs of the form R= R1 R2. The

redundancy exhibited by the relation schema Person (8.13) was caused by this particular type of join dependency. The fourth normal form, introduced in [Fagin 1977], is designed to prevent redundancies of this type.

MVDs constrain instances of a relation in the same way that FDs do, so a description of a relation schema must include both. As a result, we describe a relation schema, R, as (R;D), where D is now a set of FDs and MVDs. Entailment of JDs is defined in the same way as entailment of FDs. LetS be a set of JDs (and possibly FDs) and d be a JD (or an FD). ThenS entails d if every relation instance r that satisfies all dependencies inS also satisfies d. In the next section, we show how an MVD might be entailed by a set of MVDs. With this in mind, a relation schema, R= (R; D), is said to be in fourth normal form (4NF) if, for every MVD R = X Y that is entailed byD, either of the following is true:

X⊆ Y or Y ⊆ X (the MVD is trivial).

X∩ Y is a superkey of R ((X ∩ Y) → R is entailed by D).

It is easy to see that Person is not a 4NF schema, because the MVD Person= {SSN PhoneN} {SSN ChildSSN} holds whereas SSN = {SSN PhoneN} ∩ {SSN ChildSSN}is not a superkey. What is the intuition here? If SSN were a superkey, then for each value of SSN there would be at most one value of PhoneN and one value of ChildSSN and hence no redundancy.

4NF and BCNF. As it turns out, 4NF schemas are also BCNF schemas (i.e., 4NF closes the loopholes that BCNF leaves behind). To see this, suppose that R= (R; D) is a 4NF schema and X→ Y is a nontrivial functional dependency that holds in R. To show that 4NF schemas are also BCNF schemas we must demonstrate that X is a superkey of R. For simplicity, assume that X and Y are disjoint. Then R1= X Y, R2= R − Y is a lossless decomposition of R, which follows directly from the test

for losslessness of binary schema decompositions presented in Section 8.6.1.Thus,

R= R1 R2is a binary join dependency, i.e., an MVD that holds in R. But by the

definition of 4NF it follows that either X Y⊆ R − Y (an impossibility) or R − Y ⊆ X Y (which means that R= X Y and X is a superkey) or R1∩ R2(= X) is a superkey. This

means that every nontrivial FD in R satisfies the BCNF requirements.

It can also be shown (but it is harder to do so) that if R= (R; D) is such that D consists only of FDs, then R is in 4NF if and only if it is in BCNF (see [Fagin 1977]). In other words, 4NF is an extension of the requirements for BCNF to design environments where MVDs, in addition to FDs, must be specified.

Designing 4NF schemas. Because 4NF implies BCNF, we cannot hope to find a general algorithm for constructing a dependency-preserving and lossless decomposition of an arbitrary relation into relations in 4NF. However, as with BCNF, a lossless decomposition into 4NF can always be achieved. Such an algorithm is very similar to that for BCNF. It is an iterative process that starts with the original schema and at each stage yields decompositions that have fewer MVDs that violate 4NF: If Ri= (Ri; Di)is such an intermediate schema andDientails an MVD, Ri= X Y, that violates 4NF, the algorithm replaces Ri with a pair of schemas (X; Di,1)and (Y; Di,2)—neither of which have that MVD. Eventually, there will be no MVDs left that violate the requirements for 4NF.

Two important points regarding this algorithm need to be emphasized. First, if R_i= (R_i; D_i) is a schema and S→ T ∈ D (for simplicity, assume that S and T are disjoint), then this FD implies the MVD R_i= ST (R_i− T). Thus, the 4NF de- composition algorithm can treat FDs as MVDs. The other non-obvious issue in the 4NF decomposition algorithm has to do with determining the set of dependencies that hold in the decomposition. That is, if Ri= (Ri;Di) is decomposed with respect to the MVD Ri= X Y, what is the set of dependencies that is expected to hold over the attributes X and Y in the resulting decomposition? The answer is π_X(D+_i )

and π_Y(D_i+)—the projections ofD_i+on X and Y. HereD_i+is the closure ofDi, i.e., the set of all FDs and MVDs entailed byDi(the optional Section 8.10 provides a set of inference rules for MVD entailment). Projection of FDs on a set of attributes has been defined in Section 8.6.2. For an MVD, Ri= V W in Di+, its projection π_X(Ri= V W) is defined as X = (X ∩ V) (X ∩ W), if V ∩ W ⊆ X, and it is unde- fined otherwise. It can easily be verified that the projection rule for MVDs is sound (see exercise 8.22).

To illustrate the algorithm, consider a schema with attributes ABCD and the MVDs ABCD= AB BCD, ABCD = ACD BD, and ABCD = ABC BCD. Ap- plying the first MVD, we obtain the following decomposition: AB, BCD. Projection of the remaining MVDs on AB is undefined. Projection of the third MVD on BCD is BCD= BC BCD, which is a trivial MVD, and projection of the second MVD on BCD is BCD= CD BD. Thus, we can decompose BCD with respect to this last MVD, which yields the following final result: AB, BD, CD. Note that if we first de- composed ABCD with respect to the third MVD, the final result would be different:

AB, BC, BD, CD.

The design theory for 4NF is not as well developed as those for 3NF and BCNF, and very few algorithms are known. The basic recommendation is to start with a decomposition into 3NF and then proceed with the above algorithm and further decompose the offending (non-4NF) schemas. On a more sophisticated level, the work reported in [Beeri and Kifer 1986a, 1986b, 1987], among others, develop a design theory and the corresponding algorithms that can rectify design problems by synthesizing new (!) attributes. These advanced issues are briefly surveyed in the next section.

In document Chapter 8. Database Design II: Relational Normalization Theory (Page 31-35)