Testing essentiality - Essential FDs - On fast and space efficient database normalization : a d

3.5 Essential FDs

3.5.2 Testing essentiality

Theorems 3.75 and 3.78 show that all essential FDs can be derived while avoiding inessential ones. However, to do this, we require a test to recognize inessential FDs.

We will develop such a test next, and start by providing an efficient criteria for testing whether a FD is forced.

Theorem 3.79. An atomic FD X →A∈Σ∗a _{is not forced iff there exists a} _B _∈_X _with

X∗_\_AB _→_A_∈_Σ∗_.

Proof. Let X → A be not forced, i.e., Σ∗a _{\ {}_X _→ _A_} _² _X _→ _A_{. By looking at the}

closure algorithm we see that there exists some Y →A ∈Σ∗a_{\ {}_X _→_A_} _with _Y _⊆ _X∗_.

As Y is a minimal LHS forA it cannot includeX (else it were equal toX), so there must be some B ∈X\Y. Thus Y ⊆X∗_\_AB _{which shows}_X∗ _\_AB _→_A_∈_Σ∗_.

Now let X∗ _\_AB _→_A _∈ _Σ∗ _{for some} _B _∈ _X_{. For every} _A

i ∈ X∗\XA there exists

a minimal set Ui ⊆ X with Ui → Ai ∈ Σ∗a\ {X →A}. As well, there exists a minimal

U ⊆X∗_\_AB _with _U _→_A_∈_Σ∗a_{\ {}_X _→_A_{}. Together they imply} _X_→_A_{, thus} _X _→_A

is not forced.

As all forced FDs appear in every canonical cover of Σ, computing the set of all forced FDs in Σ∗a _{is easy. What we really need though is a criterion for testing whether a FD}

is essential. This, however, is an NP-complete problem, which we will show by reducing the following problem to it, which is known to be NP-complete [31]:

Problem “prime attribute”

Given a set Σ of FDs on schemaR and an attribute A∈R, is A a prime attribute, i.e., does A lie in a minimal key of R?

Theorem 3.80. Given a set Σof FDs, the problem of deciding whether a FD is essential is NP-complete.

Proof. To verify that a FD is essential, we only need to guess the canonical cover con- taining it. By Theorem 3.55 the size of any canonical cover is polynomial in Σ, so the problem lies in NP.

We prove completeness by reducing the “prime attribute” problem to it. Let G be a set of FDs on R, and A /∈R an additional attribute. We construct Σ as

Σ :=G∪ {A →Ai |Ai ∈R}

We claim that Ai is a prime attribute w.r.t. G iff A → Ai is essential w.r.t. Σ. Since A

does not appear in any FD in G we have

Σ∗a ₌_G∗a_{∪ {}_A_→_A

i |Ai ∈R}

Now let Σ0 _{be any canonical cover of Σ, and let Σ}0

A :={A → Ai ∈ Σ0}. Clearly the FD

A →R ∈ Σ∗ _{is implied by Σ}0 _iff _A _→ _K _{is implied by Σ}0

A for some minimal key K of R

(w.r.t. G). This is the case iff Σ0

A consists of exactly (since Σ0 is non-redundant) those

FDs A→Ai for which Ai ∈K. Thus A→Ai is essential iffAi is prime.

From this, we can deduce that identifying the (minimal) autonomous sets of CC(Σ) is difficult as well.

Theorem 3.81. Given a set Σ of FDs and an atomic FD X →A∈Σ∗a_{, the problem of}

deciding whether the set {X →A} is autonomous for CC(Σ) is co-NP-complete.

Proof. {X → A} is autonomous iff X → A appears in all or no canonical covers of Σ. Thus, if {X → A} is not autonomous, we only need to guess one canonical cover which contains X → A, and one which does not. By Theorem 3.55 the size of these canonical covers is polynomial in Σ. This shows that the problem lies in co-NP.

To show co-NP-hardness, we use that it is NP-hard to decide whether a FDX →A is essential. Let Σ0 _{be any canonical cover of Σ. Such a canonical cover Σ}0 _{can be computed}

in polynomial time. By definition, X → A is essential iff it appears in some, but not necessarily all canonical covers of Σ. We distinguish two cases.

(1) If X →A∈Σ0 _then _X _→_A _{is essential.}

(2) If X →A /∈Σ0 _then _X _→_A _{is essential iff it appears in some but not all canonical}

covers of Σ, i.e., iff {X →A} is not autonomous.

We thus reduced the NP-hard problem of deciding whether X → A is essential to the problem of deciding whether {X →A} is not autonomous.

The last theorem shows that the autonomous set problem is hard when given Σ. However, when we try to find autonomous sets of CC(Σ), we first compute Σ∗a_{, which}

can be exponential in the size of Σ. Thus it might be possible to decide whether a given set is autonomous in time polynomial in the size of Σ∗a_{. We show next that this is not}

the case.

Theorem 3.82. Given a set Σ of FDs, its atomic closure Σ∗a _{and an atomic FD} _X _→

A ∈Σ∗a_{, the problem of deciding whether the set} _{_X _→_A_} _{is autonomous for} _CC_(Σ) _is

co-NP-complete.

Proof (Sketch). In [31] the NP-hardness of the “prime attribute” problem is shown by first reducing the “vertex cover” problem to the “key of cardinality m” problem, and then in turn to “prime attribute”. We will describe a slightly modified version of this reduction.

LetGbe the graph for the vertex cover, Σcard the set of FDs for the “key of cardinality

m” problem (denoted D[0]0 _{in [31]), and Σ}

prime the set of FDs for the “prime attribute”

problem (denoted D[0] in [31]). For the first reduction, Σcard is constructed as follows:

Σcard:={N(v)→v |v is vertex in G}

where N(v) is the neighborhood of v inG.

The second reduction is more complicated. We give a modified version next (the modification occurs in condition (ii) below), which can be shown to be correct as in [31]. Let A0 _{be the set of attributes occurring in Σ}

card,A00 of cardinality m <|A0| and b a new

attribute. The attribute set for the “prime attribute” problem is thenA =A0_∪_[_A00_×_A0_],

and Σprime is constructed as follows:

(i) for E →F ∈Σcard add {b} ∪E →F to Σprime

(ii) for e∈A0 _add _{_e_{} →}_A00_{× {}_e_} _{to Σ}

prime

(iii) for i∈A00 _and _e_∈_A0 _add _{_b,₍_{i, e}_{)} → {}_e_} _{to Σ}

(iv) fori∈A00 _and _{e, f} _∈_A0 _{distinct add} _{(_{i, e}₎_,₍_{i, f}_{)} → {}_b_} _{to Σ}

prime

(v) for e∈A0 _add _{_e_{} → {}_b_} _{to Σ}

prime

Using these constructions, we want to establish some bounds on the size of the LHSs of FDs. This can be done by verifying the following statements.

• The size of the LHS of a FD in Σcard is the degree of the corresponding vertex in G

• Σ∗a

card= Σcard

• The LHS of a FD in Σ∗a

prime is at most one larger that the LHS of a FD in Σ∗carda

The reductions from the “prime attribute” problem to the problems “essential FD” and then “autonomous set” do not increase the size of the LHSs of atomic FDs. Thus the maximal size of the LHSs of FDs in Σ∗a _{is at most one larger that the maximal degree}

of vertices in G. It has been shown in [17] that the vertex cover problem is still NP-hard for graphs of maximal degree 3. These cases reduce to instances of the “autonomous set” problem for which all FDs in Σ∗a _{contain at most 4 attributes in their LHS. But there}

exist less than k5 _{such FDs, where} _k _{is the number of different attributes occuring in}

Σ, so the size of Σ∗a _{is polynomial in the size of Σ, and can therefore be computed in}

polynomial time by Lemma 2.13. This shows the theorem.

In the following we will construct some sufficient (but not necessary) criteria that a FD is inessential. Recall thatN(X →A) denotes the neighborhood ofX →AinT r(CC(Σ)).

Lemma 3.83. X →A∈Σ∗a _{is essential iff} _Σ∗a_\_N₍_X _→_A₎₂_X _→_A_.

Proof. If X → A is essential then Σ∗a_\_N₍_X _→ _A₎_{∪ {}_X _→ _A_} _{intersects with every}

minimal transversal and thus is a cover of Σ. As Σ∗a_\_N₍_X _→_A_{) is not a cover of Σ by}

Lemma 3.31, Σ∗a_\_N₍_X _→_A_{) cannot imply} _X _→_A_.

If X → A is inessential then N(X → A) = {X → A}, and Σ∗a_{\ {}_X _→ _A_} _implies

X →A.

Thus, if we can find a set S ⊆Σ∗a_\_N₍_X _→_A_{) with} _S _²_X _→ _A _then _X _→ _A _must

be inessential. Instead of Σ∗a _{we use only Σ (making Σ canonical if needed). Determining}

which FDs in Σ are adjacent to X →A is no easier than testing essentiality (since every essential FD has an adjacent FD in Σ). However, we do have some criteria that assure that a FD Y →B is not adjacent to X →A: Clearly all forced FDs are adjacent only to themselves, and adjacent FDs have equivalent LHSs by Lemma 3.33. Thus we construct

S from all forced FDs and all FDs in Σ with LHS smaller4 _than_X_{, as FDs with LHS not}

implied by X cannot be used in deriving X →A. Note that we do not lose anything by using Σ instead of Σ∗a_{: The FDs in Σ with LHS smaller than} _X _{form a cover for the set}

of all FDs in Σ∗a _{with LHS smaller than} _X _{by Lemma 2.35.}

We can adapt the basic “linear resolution” algorithm to compute all essential FDs (and possibly more due to the inaccuracy of our essentiality test) as follows:

Algorithm “essential linear resolution” INPUT: set of FDs Σ

OUTPUT: superset of Σ∗e

compute a canonical cover Σ0 _{of Σ}

FΣ:=∅ for all X →A∈Σ0 _do if is forced(X →A) then add X →A to FΣ end Σ∗e _{:= Σ}0 for all Y →B ∈Σ∗e _do

for all X →A∈Σ0 _with _A_∈_{Y, B /}_∈_X _do

// derive (XY \A)→B by rule (2.1) find U ⊆(XY \A) withU →B atomic if U →B /∈Σ∗e _{and is essential(}_U _→_B_{) then}

add U →B to Σ∗e

end end

// compute essentially cartesian closure for all X ∈LHS(Σ∗e_{) do} RHSX :=X∗\X for all A∈X do remove (X\A)∗ fromRHSX end for all B ∈RHSX do

if X →B /∈Σ∗e _{and is essential(}_X _→_B_{) then}

add X →B to Σ∗e

end end

function is essential(X →A)

smaller(X →A) :={Y →B ∈Σ|Y ⊆X∗_∧_X _*_Y∗_}

if FΣ∪smaller(X →A) implies X →A then

return false else return true function is forced(X →A) for all B ∈X do if Σ²X∗_\_AB _→_A _then return false end return true

Other optimizations such as those used in the revised linear resolution algorithm can be used in computing the essential closure (or a superset thereof) as well. Also, the test

is essential(X →B)

for computing the essentially cartesian closure can be performed for all values B at once, by computing the closure of X under FΣ∪smaller(X →B).

Chapter 4 Domination Normal Form

A common approach in designing relational databases is to start with a relation schema, which is then decomposed into multiple subschemas [10, 34, 38, 42]. A good choice of subschemas can be determined using integrity constraints defined on the schema, such as functional, multivalued or join dependencies.

Many normal forms proposed so far, such as BCNF, 4NF or KCNF, characterize the absence of redundancy [1, 44]. This is desirable for several reasons, foremost the avoidance of update anomalies [33] and minimization of storage space [9]. However, these normal forms have significant drawbacks. First, they only consider a single relation schema, instead of considering all schemas in a decomposition together. While this is not strictly true for the normal form proposed by Topor and Wang in [45], their approach only consid- ers pairs of schemas at a time, and thus cannot capture redundancy across larger schema sets.

The common generalization to multiple schemas is that the whole schema collection is in that normal form, if every schema taken individually is. But this means that those normal forms cannot capture redundancy which exists across multiple relations. As a trivial example, we can duplicate a schema. Clearly the extra schema is then superfluous in the whole schema collection, but each schema taken individually may still be redundancy free. But even if no schemas or attributes in a schema collection are superfluous, the design may not be desirable.

Example 4.1. Let R = ABCD and Σ = {AB → CD, CD → B}. Then R is not in BCNF, but has a dependency preserving BCNF decomposition into the subschemas

ABC, ABD, BCD.

For any instancerofR, the projections ofronto the schemasABCandABDtogether already take up more space than the original relationr: no tuples are lost in the projection since AB is a key, and the attributes A and B are stored twice. While we have not defined what redundancy means for multiple schemas, it seems intuitively clear that this decomposition should not be called “redundancy free”. From a storage space point-of- view, it is clearly less desirable than the original schema R.

The second big problem is that dependency preserving decompositions into these normal forms do not always exist [5]. Thus, when faced with such a case, a designer must either accept the loss of some dependencies, or cannot achieve the normal form in question.

We believe that what a normal form should do, is the following:

Characterize “good” representations (i.e., decompositions) of a schema, in such a way that a “good” representation does always exist.

In the following we will propose a normal form which meets this criterion.

4.1 Minimization as Normal Form

The approach we suggest is the following: among a set of suitable decompositions (e.g. the set of all lossless, or lossless and dependency preserving decompositions), characterize the “best” ones. We do so by defining an order on the decompositions, such that the “best” decompositions are the minimal ones with respect to that order.

This leaves the question of when to call one decomposition better than another one. The motivation for many normal forms proposed so far has been the elimination of redundancy (and with it, the absence of update-anomalies). This may suggest to define a quantitative measure of redundancy over multiple schemas, as has been done by Arenas and Libkin in [1].

We take a different approach here: instead of trying to minimize redundancy, we try to minimize the size of instances. Intuitively this should lead to similar results, an assumption which is supported by the findings of Biskup in [9], but measures for size appear easier to construct than measures for redundancy (cf. [1]). In the following we will define and motivate three different orders on decompositions, all of which intuitively compare decompositions by the size of instances for them. By proving them to be equivalent, we will establish a syntactic characterization for a semantically motivated definition.

In document On fast and space efficient database normalization : a dissertation presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Information Systems at Massey University, Palmerston North, New Zealand (Page 93-99)