L(A, K[M(B, C, D)], P[Q[R(E, F)]]) - Dependencies in complex value databases : a dissertation p

/ �

L(A, K[M(B,

>.)], P[>.])

L(A, K[M(>., >., D)], P[Q[R(E, F)]])

� �

L(>., K[M(>., >., D)], P[Q[R(E, >.)]])

L(A, K[M(>., >., D)], P[Q[R(>., F)]])

Fig. 3 . 1 . NLNF decomposition Tree of Example 3.22.

EXAMPLE 3 . 2 3 . Example 3. 18 has shown that the nested attribute N =

DNA(Origin[Base] ,Count(A,C,G,T) ,Gene(Start,End,Sub[Nucleo] ,Translation[Amino]) )

i s not i n NLNF with respect to the set E of FDs from Example 3.3. Since

DNA(Origin[Base]) ---+ DNA(Count(A,C,G,T))

is neither inevitable nor is DNA(Origin[Base] ) a superkey for N with respect to E, we de

compose N into N{ = DNA(Origin[Base] ,Count(A,C,G,T)) and N� = DNA(Origin[Base] ,

Gene(Start, End,Sub [Nucleo] ,Translation[Amino] ) ) . The FD

DNA(Origin[>.] ,Count(A,C,G)) ---+ DNA(Count (T))

1s m 7rNf (E+ ) , but is neither inevitable nor is DNA(Origin[>.] ,Count(A,C,G)) a su

perkey for N{ with respect to 7rNf (E+ ) . Therefore, we decompose N{ into N1 =

DNA(Origin[>.] ,Count(A,C,G,T)) and N2 = DNA(Origin[Base] ,Count(A,C,G)). The pro

jected FDs 7fN1 (E+) are covered by the set E1 with the following FDs:

- DNA(Origin[>.] ,Count(A,C,G)) ---+ DNA(Count(T)) ,

- DNA(Origin[>.] ,Count(A,C,T) ) ---+ DNA(Count(G) ) ,

- DNA(Origin[>.] ,Count(C,G,T) ) -+ DNA(Count(A) ) , and - DNA(Count(A,C,G,T) ) -+ DNA(Origin[>.]) .

The projected FDs 1fN2 (E+) are covered by E2 {DNA(Origin[Base] ) -+

DNA(Count(A,C,G) ) } . N1 is in NLNF with respect to E1 and N2 is in NLNF with re spect to E2 . The FD

DNA(Gene(Sub[Nucleo] )) -+ DNA(Gene(Translation[Amino] ) )

is an element of 1fN� (E+ ) , but is neither inevitable nor i s DNA(Gene(Sub[Nucleo] ) ) a superkey for N� with respect t o 1fN� (E+ ) . Therefore, we decompose N� into N3 DNA(Gene(Sub[Nucleo] ,Translation[Amino] ) ) and N�

DNA(Origin[Base] ,Gene(Start,End,Sub[Nucleo] ) ) . The projected FDs 1fN3 (E+ ) are covered by the set E3 with the following FDs:

- DNA(Gene(Sub[Nucleo] ) ) -+ DNA(Gene(Translation[Amino] ) ) , - DNA(Gene(Sub[>.] ) ) -+ DNA(Gene(Translation[>.] ) ) , and

- DNA (Gene(Translation[>.] ) ) -+ DNA(Gene(Sub[>.] ) ) .

N3 is again in NLNF with respect to E3 . The FD

DNA(Gene(Start,Sub[>.] )) -+ DNA(Gene(End))

is in 1fN� (E+), but is neither inevitable nor is DNA(Gene(Start,Sub[>.] ) ) a superkey for N� with respect to 1fN� (E+ ) . We decompose N� into N4 = D A(Gene(Start,End,Sub[>.] ) ) and N5 = DNA(Origin[Base] ,Gene(Start,Sub[Nucleo] ) ) . The projected FDs 1fN4 ( E + ) are covered by the set E4 with the following FDs:

- DNA(Gene(Start,Sub[>.]) ) -+ DNA(Gene(End) ) , - DNA(Gene(End,Sub[>.] ) ) -+ DNA(Gene(Start)), and - DNA(Gene(Start,End)) -+ DNA(Gene(Sub[>.] ) ) .

N4 is in NLNF with respect to E4 . The projected FDs 7rN5 (E+) are covered by the set E5 = {DNA(Origin [Base] ,Gene(Start,Sub[>.] )) -+ DNA(Gene(Sub[Nucleo] ) ) } . N5 is in NLNF with respect to E5 . The output o f Algorithm 3.4. 1 is therefore { ( N1 , E1 ) ) , (N2, E2) , (N3 , E3) ) , (N4 , E4) ) , (N5 , E5) } . See Figure 3.2 for an illustration. 0

For relational databases it is well-known that any relation schema with any set of FDs defined on it can be decomposed into subschemata that are all in BCNF with respect to the projected sets of FDs. In the presence of lists, however, the situation is different. Of course, one can modify Definition 3.47 of lossless NLNF decomposition to lossless BCNF decomposition for any nested attributes. Consider the nested attribute N = _{L[A] where} the set E of FDs on N simply consists of the single FD >. -+ L[>.] . The FD is not trivial and >. is not a superkey for N with respect to E. Consequently, N is not in BCNF with respect to E. However, any decomposition of L[A] must contain the nested attribute L[A] itself. Therefore, no lossless BCNF decomposition of L [A] with respect to E exists.

3.4. DECOMPOSITION INTO NLNF Sebastian Link DNA(Origin(Base),Count(A,C,G,T),Gene(Start,End,Sub(Nucleo),Translation(Amino)))

---

DNA(Origin(Base),Count(A,C,G,T))

I ---

DNA(Origin(>.),Count(A,C,G,T)) DNA(Origin(Base),Count(A,C,G)) DNA(Origin(Base),Gene(Start,End,Sub(Nucleo),Translation(Amino)))

DNA(Gene(Sub(Nucleo),Translation(Amino))) DNA( Origin(Base) ,Gene(Start,End,Sub(N ucleo)))

DNA(Gene(Start,End,Sub(>.))) DNA(Origin(Base),Gene(Start,Sub(Nucleo)))

Fig. 3.2. NLNF decomposition Tree of Example 3.23. 3.4.3 Problems with NLNF decomposition

Algorithm 3.4 . 1 generalises the well-known BCNF decomposition algorithm for relational databases, see for instance [181, p.270] . It follows that the NLNF decomposition algorithm causes at least as many problems as its relational counterpart. The first problem is that Algorithm 3.4. 1 does not execute in time polynomial in the sizes of

N

and

E

since com puting a cover of

1fN; (E+)

is intractable [33] . Changing the computations of

7rN1 (E+)

and

7rN2 (E+ )

in lines (6) and (8) of Algorithm 3.4. 1 , respectively, to polynomial-time compu tations of

7rN1 (E)

and

7rN2 (E)

in the size of

E

leads to an algorithm which may not always output an NLNF decomposition. For example, let E =

{L(A)

---+

L(B) , L(B)

---+

L(C) }

be a set of FDs defined on

N

L(A, B

C, D) .

Then,

7rM(E)

contains only trivial FDs

for

M

L(A, C, D) ,

but the FDs in

7rM(E+)

are covered by

{L(A)

---+

L(C) } .

It follows

that if the FD,

L(A)

---+

L(B)

is chosen at line (4) of Algorithm 3.4.1, then

M,

which is

not in NLNF with respect to

7rM (E+),

is in the output decomposition. Furthermore, the cardinality of the decomposition returned by Algorithm 3.4 . 1 may be exponential in the cardinality of

N

[181 , p. 271

]

. While checking whether

N

itself is in NLNF with respect to

E

can be done in polynomial time in the size of

N

and

E

(Theorem 3.37 and Lemma 3.30), checking whether a proper subattribute

Ni E Sub(N)

is in NLNF with respect to

1fN; (E+)

is harder. The following theorem follows from Corollary 3 in [29] .

Theorem 3.49.

Let N E

A and E a set of FDs on N. The problem of deciding whether

an arbitrary Ni E Sub(N) is in NLNF with respect to 1fN; (E+) is coNP-complete.

Proof (Sketch).

The problem of deciding whether an arbitrary

Ni E Sub(N)

is not in NLNF with respect to

1fN; (E+)

is in

NP.

According to Theorem 3.37 one guesses non-

deterministically an FD X ---+ Y E 1r N; ( E+) and verifies in polynomial time that X ---+ Y is not inevitable on Ni with respect to 1rN; (E+) and that X is not a super key for Ni with respect to 1rN; (E+ ) . Following Lemma 3.30, X ---+ Y is not inevitable, if there is some Y' E

MaxB(Ni)

with Y' ::; Y and Y' 1:_ X.

It remains to show that the problem of deciding whether an arbitrary Ni E Sub(N)

is not in NLNF with respect to 1rN; (E+) is NP-hard. One can use the polynomial-time reduction of the hitting set problem [1 22] in [29, p.55-57] to the decision problem whether an arbitrary subschema of a relation schema is not in BCNF with respect to the corre sponding projected set of given FDs. This is possible since every relational subschema can be represented using only null, flat and record-valued attributes, and NLNF and BCNF are equivalent in the absence of lists. Note that in this case inevitable FDs are simply trivial

FDs. D

For relational databases a polynomial-time algorithm in the sizes of a relation schema R and E that outputs a lossless BCNF decomposition with respect to E has been proposed in [270] . It is the subject of future research to generalise this algorithm to the context of lists.

We have seen that it is always possible to achieve a lossless NL decomposition. Un fortunately, losslessness is not the only desirable property of a decomposition. An output { (N1 , EI ) , . . . , (Nk, Ek) } should only be considered equivalent to (N, E) in case the seman

tic information in U Ei is equivalent to the semantic information in E. This means that

i=l

it is not only necessary not to lose any information regarding the database itself, but also not to lose any information regarding the semantic properties that this database carries. In other words, the dependencies must have been preserved at the end of the decomposition process.

Definition 3.50. Let N E N A and E a set of FDs on N. A lossless join decomposition { N1 , . . . , Nk } of N is called

dependency-preserving

with respect to E if and only if E* =

(�1

7rN, (E*)

)

* D

We can see that the lossless NLNF decomposition in Example 3.22 is indeed dependency-preserving. What about the decomposition of our GenBank example?

EXAMPLE 3 . 2 4 . Consider the decomposition of the GenBank database from Example 3.23.

Define 8 as U 1rN; (E+ ) . The decomposition is dependency-preserving if and only if E+

�

i=l

e+. All FDs in E are also in e except

DNA(Origin[Base]) ---+ DNA(Count(A,C,G,T) )

and

3.4. DECOMPOSITION INTO NLNF Sebastian Link

The closure of DNA(Origin[Base] ) with respect to e+ 1s DNA(Origin[Base] ,Count(A,C,G,T) ) , i.e., the first FD is also in e+. The closure of DNA(Origin[Base] ,Gene(Start,End)) with respect to e+ is again N, i.e. , the second FD is in e+, too. Consequently, E+ � e+ holds and the decomposition of the GenBank

example is dependency-preserving. D

Unfortunately, our examples are exceptions. For relational databases it has been shown in [26, 29, 273] that there may be no decomposition of a relation schema into BCNF that is dependency-preserving. This negative result carries immediately over to the framework of lists, see Theorem 3 of [29] .

Theorem 3.51.

There are nested attributes

_{and sets}

of FDs on

_{for which no}

dependency-preserving and lossless

NLNF

decomposition exists.

Proof.

Let N =

L(A, B, C)

and E =

{L(A, B)

L(C) , L(C)

L(B) } .

By a brute force

examination of E+ it can be shown that

L(A, B)

L(C)

is in every non-redundant cover of E. Therefore, in any dependency-preserving and lossless join decomposition of N with respect to E, one of the subattributes of N must be

L(A, B, C) ,

but this nested attribute

is not in NLNF. D

Following [29] , and using the same polynomial-time reduction of the hitting set problem [122] as in the proof of Theorem 3.49 it can be shown that the problem whether there exists a dependency-preserving and lossless NLNF decomposition for an arbitrary nested attribute is NP-hard.

For relational databases, an exponential algorithm in the size of E which decides the problem whether there is a dependency-preserving and lossless BCNF decomposition can be found in [214] . A method of guaranteeing a dependency-preserving decomposition which is in BCNF was proposed in [157] , wherein it was shown that by adding attributes to R and FDs to E it is always possible to obtain a BCNF dependency-preserving decomposition of the augmented schema with respect to the augmented set of FDs.

In summary, obtaining a dependency-preserving and lossless NLNF decomposition is in general an unrealistic goal. In relational database theory, research on the third normal form (3NF) [186, 304] has shown that a lossless join decomposition that preserves dependencies can always be found [41 , 49] . Note, however, that 3NF cannot guarantee the absence of redundancies. It is again subject of future research to extend these results to the framework of lists.

In document Dependencies in complex value databases : a dissertation presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Information Systems at Massey University (Page 95-100)