Linear ordering query - Hyperset approach to semi-structured databases and the experimental imp

The query example considered in this section has mainly theoretical interest, although it might be useful in practice. The point is that we can define in ∆linear ordering on the transitive closure of any hyperset by using the lexicographical linear ordering we have on labels. In fact, the resulting linear ordering on hypersets is itself, in a sense, lexicographical. Having defined linear ordering, we can further define any (“generic” polynomial-time) computable operation over hypersets by simulating any given Turing Machine (as shown in descriptive complexity theory [34, 37, 55, 74]). This is the key point of the main result in [57] (for well-founded sets) and in [58, 41, 43] (for hypersets) on the expressive power of∆coinciding with polynomial time computability over (hyper)sets. (We omit precise formulation which is more subtle in the case of hypersets having labelled elements; see [57, 41]).

Let us consider the set query declaration StrictLinOrder_on_TC(set z) (and other associated declarations) which can be found in Appendix A.314. In fact, the rather complicated query StrictLinOrder_on_TC serves as additional witness demonstrating that everything is implemented correctly, and to check whether and where any optimisation of the implementation is required. Note thatStrictLinOrder_on_TC invokesCanand without this canonisation the transitive closure

TCPure(BibDB)

14_{It is based on formula (22) and Theorem 2 in [43]. We leave this for the reader to realise how this query below}

participating in the query below (according to Appendix A.3) would have too many repetitions, and, hence, Squarewould have even more repetitions so that the recursion in the set query

StrictLinOrder_on_TCover thisSquarewould take many hours. Now let us run

set query let

set constant BibDB =

http://www.csc.liv.ac.uk/˜molyneux/t/BibDB-f1.xml#BibDB in call SuccessorPairs( call StrictLinOrder_on_TC(BibDB) ) endlet;

Note thatSuccessorPairs(defined in Appendix A.3) makes the result more concise. We see that our database BibDB becomes linear ordered (with corresponding simple set names from the bibliographic database substituted in the place of new set names generated by the query system):

Query is well-formed, well-typed and executable

Result = { ’null’:{’fst’:{}, ’snd’:"Databases"}, ’null’:{’fst’:"Databases",’snd’:"Jones"}, ’null’:{’fst’:"Jones", ’snd’:"Smith"}, ’null’:{’fst’:"Smith", ’snd’:BibDB}, ’null’:{’fst’:BibDB, ’snd’:p1}, ’null’:{’fst’:p1, ’snd’:b1}, ’null’:{’fst’:b1, ’snd’:b2/p3}, ’null’:{’fst’:b2/p3, ’snd’:p2} } p2 = {’author’:"Smith",’title’:"Databases",’refers-to’:b2/p3} b2/p3 = {’author’:"Jones",’title’:"Databases"} p1 = {’refers-to’:p2} b1 = {’refers-to’:b2/p3,’refers-to’:p1} BibDB = {’paper’:p1,’paper’:p2,’paper’:b2/p3,’book’:b1, ’book’:b2/p3}

Finished in: 270500 ms (˜ 4 minutes and 30 seconds)

The correspondence of set names with those nodes in the graph in Figure 3.1 is explicitly shown in the above result. Thus, the resulting linear ordering on the transitive closure ofBibDBis:

Here it is important that recursion inStrictLinOrder_on_TCdoes not use bisimulation for comparison iteration steps (see Chapter 4). This crucially optimises recursion, and in particular the query StrictLinOrder_on_TC which also uses Can in its library declaration. Without the first optimisations this query would take about 30 minutes, and without also usingCaneven hours. Of course, several minutes for such a small database (with

TC(BibDB)containing 9 sets) is also quite long, and thus the query system implementation needs to be further optimised. But the query is rather complicated (see Appendix A.3), and recursion actually uses81 = 92steps of iteration ifCanis involved. This means in the average about 3.3 seconds per iteration step.

Bisimulation

Before discussing the theoretical and practical issues surrounding bisimulation, let us recall some relevant details of the hyperset approach to WDB. As previously described in Chapter 2 WDB is represented as a system of set equations x¯ = ¯b(¯x) where x¯ is a list of set names

x1, . . . , xkand¯b(¯x)is the corresponding list of bracket expressions (for simplicity, “flat” ones).

Visually equivalent representation can be done in the form of labelled directed graph, where labelled edgesxi

label

−→xj correspond to the set membershipslabel:xj ∈ximeaning that the

equation forxi has the form xi = {. . . , label:xj, . . .}. In this case we also callxj a child

ofxi. Note that, our usage of the membership symbol (∈) as relation between set names or

graph nodes is non-traditional but very close to the traditional set theoretic membership relation between abstract (hyper)sets. Of course this analogy is very important for us and it is indeed highly natural, hence we decided not to introduce a new kind of membership symbol here. For the purposes of our description below labels can be ignored, as inclusion of labels will not affect the nature of our discussion. We will also apply the transitive closure operator TC(x) to a set namex. The essential point is that in this contextTC(x)is understood as a set of set names (or graph nodes) rather than of abstract sets denoted by these names. Again, we do not bother with introducing a new denotation for suchTC.

4.1 Hyperset equality and the problem of efficiency

One of the key points of our approach is the interpretation of WDB-graph nodes as set names

x1, . . . , xk where different nodes xi and xj can, in principle, denote the same (hyper)set,

xi =xj. This notion of equality between nodes is defined by the bisimulation relation denoted

also as xi ≈ xj (to emphasise that set names can be syntactically different, but denote the

same set) which can be computed by the appropriate recursive comparison of child nodes or set names. Thus, in outline, to check bisimulation of two nodes we need to check bisimulation between some children, grandchildren, and so on, of the given nodes, i.e. many nodes could be

involved. If the WDB is distributed amongst many WDB files and remote sites, downloading the relevant WDB files might be necessary in this process and will take significant time. There is also the analogous problem with the related transitive closure operator (TC) whose efficient implementation in the distributed case requires additional considerations not discussed here. So, in practice the equality relation for hypersets seems intractable, although theoretically it takes polynomial time with respect to the size of WDB. Nevertheless, we consider that the hyperset approach to WDB based on bisimulation relation is worth implementing because it suggests a very clear and mathematically well-understood view on semi-structured data and the querying of such data. Thus, the crucial question is whether the problem of bisimulation can be resolved in any reasonable and practical way. Some possible approaches and strategies related with the possible distributed nature of WDB and showing that the situation is manageable in principle are outlined below.

Although for the general database perspective we should consider graphs with labels on edges and hypersets with labelled elements, the majority of our considerations in this chapter will be devoted to the pure case without any labels. Extension to the labelled case is quite straightforward and is not explicitly considered, except in Definition 2 (b). Of course, our implementation of bisimulation relation considers the labelled case.

4.1.1 Bisimulation relation

Equality between set names (or graph nodes) of any WDB is determined by bisimulation relation defined according to [3] (see also [48, 53]).

Definition 2. (a)Bisimulation relation≈(or≈_WDB) on a WDB without labels (the pure case) is the largest one such that for all set namesx, ythe following implication holds:

x≈y⇒ ∀x0 ∈x∃y0 ∈y(x0 ≈y0) &∀y0∈y∃x0 ∈x(x0 ≈y0). (4.1)

(b) In the general labelled case, it should satisfy the implication

x≈y⇒ ∀l:x0∈x∃m:y0 ∈y(l=m∧x0 ≈y0) &

∀m:y0 ∈y∃l:x0 ∈x(l=m∧x0 ≈y0). (4.2)

It is well-known that the largest such relation does exist. Indeed, the class Rof relationsR

satisfying any of the above formulas (in place of≈) is evidently closed under taking unions, so the union of all of them is the required largest one≈. In fact, for≈the implication⇒above can be replaced by ⇐⇒ . Moreover, the classR evidently contains the identity relation= and is closed under taking compositions R◦S and inverse relationsR−1. It follows that the largest such relation ≈is reflexive, transitive and symmetric, that is, an equivalence relation.

The bisimulation relation is completely coherent with hyperset theory as it is fully described in the books of Aczel [3], and Barwise and Moss [5] for the pure case, and this fact extends easily to the labelled case. It is by this reason that the bisimulation relation≈between set names can be considered as equality relation=between corresponding abstract hypersets. So, we will not go into further general theoretical details concerning the bisimulation relation (except for the concept of local bisimulation in Chapter 6 below), paying the main attention to implementation aspects.

In document Hyperset approach to semi-structured databases and the experimental implementation of the query language Delta (Page 74-79)