/ �
L(A, K[M(B,
C,>.)], P[>.])
L(A, K[M(>., >., D)], P[Q[R(E, F)]])
� �
L(>., K[M(>., >., D)], P[Q[R(E, >.)]])
L(A, K[M(>., >., D)], P[Q[R(>., F)]])
Fig. 3 . 1 . NLNF decomposition Tree of Example 3.22.
EXAMPLE 3 . 2 3 . Example 3. 18 has shown that the nested attribute N =
DNA(Origin[Base] ,Count(A,C,G,T) ,Gene(Start,End,Sub[Nucleo] ,Translation[Amino]) )
i s not i n NLNF with respect to the set E of FDs from Example 3.3. Since
DNA(Origin[Base]) ---+ DNA(Count(A,C,G,T))
is neither inevitable nor is DNA(Origin[Base] ) a superkey for N with respect to E, we de
compose N into N{ = DNA(Origin[Base] ,Count(A,C,G,T)) and N� = DNA(Origin[Base] ,
Gene(Start, End,Sub [Nucleo] ,Translation[Amino] ) ) . The FD
DNA(Origin[>.] ,Count(A,C,G)) ---+ DNA(Count (T))
1s m 7rNf (E+ ) , but is neither inevitable nor is DNA(Origin[>.] ,Count(A,C,G)) a su
perkey for N{ with respect to 7rNf (E+ ) . Therefore, we decompose N{ into N1 =
DNA(Origin[>.] ,Count(A,C,G,T)) and N2 = DNA(Origin[Base] ,Count(A,C,G)). The pro
jected FDs 7fN1 (E+) are covered by the set E1 with the following FDs:
- DNA(Origin[>.] ,Count(A,C,G)) ---+ DNA(Count(T)) ,
- DNA(Origin[>.] ,Count(A,C,T) ) ---+ DNA(Count(G) ) ,
- DNA(Origin[>.] ,Count(C,G,T) ) -+ DNA(Count(A) ) , and - DNA(Count(A,C,G,T) ) -+ DNA(Origin[>.]) .
The projected FDs 1fN2 (E+) are covered by E2 {DNA(Origin[Base] ) -+
DNA(Count(A,C,G) ) } . N1 is in NLNF with respect to E1 and N2 is in NLNF with re spect to E2 . The FD
DNA(Gene(Sub[Nucleo] )) -+ DNA(Gene(Translation[Amino] ) )
is an element of 1fN� (E+ ) , but is neither inevitable nor i s DNA(Gene(Sub[Nucleo] ) ) a superkey for N� with respect t o 1fN� (E+ ) . Therefore, we decompose N� into N3 DNA(Gene(Sub[Nucleo] ,Translation[Amino] ) ) and N�
DNA(Origin[Base] ,Gene(Start,End,Sub[Nucleo] ) ) . The projected FDs 1fN3 (E+ ) are covered by the set E3 with the following FDs:
- DNA(Gene(Sub[Nucleo] ) ) -+ DNA(Gene(Translation[Amino] ) ) , - DNA(Gene(Sub[>.] ) ) -+ DNA(Gene(Translation[>.] ) ) , and
- DNA (Gene(Translation[>.] ) ) -+ DNA(Gene(Sub[>.] ) ) .
N3 is again in NLNF with respect to E3 . The FD
DNA(Gene(Start,Sub[>.] )) -+ DNA(Gene(End))
is in 1fN� (E+), but is neither inevitable nor is DNA(Gene(Start,Sub[>.] ) ) a superkey for N� with respect to 1fN� (E+ ) . We decompose N� into N4 = D A(Gene(Start,End,Sub[>.] ) ) and N5 = DNA(Origin[Base] ,Gene(Start,Sub[Nucleo] ) ) . The projected FDs 1fN4 ( E + ) are covered by the set E4 with the following FDs:
- DNA(Gene(Start,Sub[>.]) ) -+ DNA(Gene(End) ) , - DNA(Gene(End,Sub[>.] ) ) -+ DNA(Gene(Start)), and - DNA(Gene(Start,End)) -+ DNA(Gene(Sub[>.] ) ) .
N4 is in NLNF with respect to E4 . The projected FDs 7rN5 (E+) are covered by the set E5 = {DNA(Origin [Base] ,Gene(Start,Sub[>.] )) -+ DNA(Gene(Sub[Nucleo] ) ) } . N5 is in NLNF with respect to E5 . The output o f Algorithm 3.4. 1 is therefore { ( N1 , E1 ) ) , (N2, E2) , (N3 , E3) ) , (N4 , E4) ) , (N5 , E5) } . See Figure 3.2 for an illustration. 0
For relational databases it is well-known that any relation schema with any set of FDs defined on it can be decomposed into subschemata that are all in BCNF with respect to the projected sets of FDs. In the presence of lists, however, the situation is different. Of course, one can modify Definition 3.47 of lossless NLNF decomposition to lossless BCNF decomposition for any nested attributes. Consider the nested attribute N = L[A] where the set E of FDs on N simply consists of the single FD >. -+ L[>.] . The FD is not trivial and >. is not a superkey for N with respect to E. Consequently, N is not in BCNF with respect to E. However, any decomposition of L[A] must contain the nested attribute L[A] itself. Therefore, no lossless BCNF decomposition of L [A] with respect to E exists.
3.4. DECOMPOSITION INTO NLNF Sebastian Link DNA(Origin(Base),Count(A,C,G,T),Gene(Start,End,Sub(Nucleo),Translation(Amino)))
---
DNA(Origin(Base),Count(A,C,G,T))I ---
DNA(Origin(>.),Count(A,C,G,T)) DNA(Origin(Base),Count(A,C,G)) DNA(Origin(Base),Gene(Start,End,Sub(Nucleo),Translation(Amino)))DNA(Gene(Sub(Nucleo),Translation(Amino))) DNA( Origin(Base) ,Gene(Start,End,Sub(N ucleo)))
DNA(Gene(Start,End,Sub(>.))) DNA(Origin(Base),Gene(Start,Sub(Nucleo)))
Fig. 3.2. NLNF decomposition Tree of Example 3.23. 3.4.3 Problems with NLNF decomposition
Algorithm 3.4 . 1 generalises the well-known BCNF decomposition algorithm for relational databases, see for instance [181, p.270] . It follows that the NLNF decomposition algorithm causes at least as many problems as its relational counterpart. The first problem is that Algorithm 3.4. 1 does not execute in time polynomial in the sizes of
N
andE
since com puting a cover of1fN; (E+)
is intractable [33] . Changing the computations of7rN1 (E+)
and7rN2 (E+ )
in lines (6) and (8) of Algorithm 3.4. 1 , respectively, to polynomial-time compu tations of7rN1 (E)
and7rN2 (E)
in the size ofE
leads to an algorithm which may not always output an NLNF decomposition. For example, let E ={L(A)
---+L(B) , L(B)
---+L(C) }
be a set of FDs defined on
N
=L(A, B
,C, D) .
Then,7rM(E)
contains only trivial FDsfor
M
=L(A, C, D) ,
but the FDs in7rM(E+)
are covered by{L(A)
---+L(C) } .
It followsthat if the FD,
L(A)
---+L(B)
is chosen at line (4) of Algorithm 3.4.1, thenM,
which isnot in NLNF with respect to
7rM (E+),
is in the output decomposition. Furthermore, the cardinality of the decomposition returned by Algorithm 3.4 . 1 may be exponential in the cardinality ofN
[181 , p. 271]
. While checking whetherN
itself is in NLNF with respect toE
can be done in polynomial time in the size ofN
andE
(Theorem 3.37 and Lemma 3.30), checking whether a proper subattributeNi E Sub(N)
is in NLNF with respect to1fN; (E+)
is harder. The following theorem follows from Corollary 3 in [29] .Theorem 3.49.
Let N E
NA and E a set of FDs on N. The problem of deciding whether
an arbitrary Ni E Sub(N) is in NLNF with respect to 1fN; (E+) is coNP-complete.
Proof (Sketch).
The problem of deciding whether an arbitraryNi E Sub(N)
is not in NLNF with respect to1fN; (E+)
is inNP.
According to Theorem 3.37 one guesses non-deterministically an FD X ---+ Y E 1r N; ( E+) and verifies in polynomial time that X ---+ Y is not inevitable on Ni with respect to 1rN; (E+) and that X is not a super key for Ni with respect to 1rN; (E+ ) . Following Lemma 3.30, X ---+ Y is not inevitable, if there is some Y' E
MaxB(Ni)
with Y' ::; Y and Y' 1:_ X.It remains to show that the problem of deciding whether an arbitrary Ni E Sub(N)
is not in NLNF with respect to 1rN; (E+) is NP-hard. One can use the polynomial-time reduction of the hitting set problem [1 22] in [29, p.55-57] to the decision problem whether an arbitrary subschema of a relation schema is not in BCNF with respect to the corre sponding projected set of given FDs. This is possible since every relational subschema can be represented using only null, flat and record-valued attributes, and NLNF and BCNF are equivalent in the absence of lists. Note that in this case inevitable FDs are simply trivial
FDs. D
For relational databases a polynomial-time algorithm in the sizes of a relation schema R and E that outputs a lossless BCNF decomposition with respect to E has been proposed in [270] . It is the subject of future research to generalise this algorithm to the context of lists.
We have seen that it is always possible to achieve a lossless NL decomposition. Un fortunately, losslessness is not the only desirable property of a decomposition. An output { (N1 , EI ) , . . . , (Nk, Ek) } should only be considered equivalent to (N, E) in case the seman
k
tic information in U Ei is equivalent to the semantic information in E. This means that
i=l
it is not only necessary not to lose any information regarding the database itself, but also not to lose any information regarding the semantic properties that this database carries. In other words, the dependencies must have been preserved at the end of the decomposition process.
Definition 3.50. Let N E N A and E a set of FDs on N. A lossless join decomposition { N1 , . . . , Nk } of N is called
dependency-preserving
with respect to E if and only if E* =(�1
7rN, (E*))
* DWe can see that the lossless NLNF decomposition in Example 3.22 is indeed dependency-preserving. What about the decomposition of our GenBank example?
EXAMPLE 3 . 2 4 . Consider the decomposition of the GenBank database from Example 3.23.
5
Define 8 as U 1rN; (E+ ) . The decomposition is dependency-preserving if and only if E+
�
i=le+. All FDs in E are also in e except
DNA(Origin[Base]) ---+ DNA(Count(A,C,G,T) )
and
3.4. DECOMPOSITION INTO NLNF Sebastian Link
The closure of DNA(Origin[Base] ) with respect to e+ 1s DNA(Origin[Base] ,Count(A,C,G,T) ) , i.e., the first FD is also in e+. The closure of DNA(Origin[Base] ,Gene(Start,End)) with respect to e+ is again N, i.e. , the second FD is in e+, too. Consequently, E+ � e+ holds and the decomposition of the GenBank
example is dependency-preserving. D
Unfortunately, our examples are exceptions. For relational databases it has been shown in [26, 29, 273] that there may be no decomposition of a relation schema into BCNF that is dependency-preserving. This negative result carries immediately over to the framework of lists, see Theorem 3 of [29] .
Theorem 3.51.
There are nested attributes
Nand sets
Eof FDs on
Nfor which no
dependency-preserving and lossless
NLNFdecomposition exists.
Proof.
Let N =L(A, B, C)
and E ={L(A, B)
-+L(C) , L(C)
-+L(B) } .
By a brute forceexamination of E+ it can be shown that
L(A, B)
-+L(C)
is in every non-redundant cover of E. Therefore, in any dependency-preserving and lossless join decomposition of N with respect to E, one of the subattributes of N must beL(A, B, C) ,
but this nested attributeis not in NLNF. D
Following [29] , and using the same polynomial-time reduction of the hitting set problem [122] as in the proof of Theorem 3.49 it can be shown that the problem whether there exists a dependency-preserving and lossless NLNF decomposition for an arbitrary nested attribute is NP-hard.
For relational databases, an exponential algorithm in the size of E which decides the problem whether there is a dependency-preserving and lossless BCNF decomposition can be found in [214] . A method of guaranteeing a dependency-preserving decomposition which is in BCNF was proposed in [157] , wherein it was shown that by adding attributes to R and FDs to E it is always possible to obtain a BCNF dependency-preserving decomposition of the augmented schema with respect to the augmented set of FDs.
In summary, obtaining a dependency-preserving and lossless NLNF decomposition is in general an unrealistic goal. In relational database theory, research on the third normal form (3NF) [186, 304] has shown that a lossless join decomposition that preserves dependencies can always be found [41 , 49] . Note, however, that 3NF cannot guarantee the absence of redundancies. It is again subject of future research to extend these results to the framework of lists.