12.3 Experimental results
12.3.3 Conclusions
We make a distinction between conclusions that have emerged from experiments on data mapping performance and experiments on query evaluation performance.
Data mapping – flesh & skeleton mapping Performance offflscales linearly with the number of ordinary nodes that are mapped. Analogously, the performance offsk increases linearly with the number of possibility nodes. Additionally, the execution time to create skeleton path tables is constant for 0.1≤pδ ≤0.6. For larger values ofpδ, execution time increases linearly.
Data mapping – best DO glue process The PB.PPR glue process is the best candidate to incorporate in a data mapping. The execution time of PB.PPR increases linearly with the number of ordinary nodes. The number of possibility nodes in a data set does not influence the execution time of PB.PPR.
Query evaluation – unmaterialized views vs. materialized views For the test set of increasing data set size and the test set of increasing amounts of uncertainty, performance results indicate that materialized views perform better than unmaterialized views in many scenarios. For category 1 (complex predicate) and category 3 (par/child), this was more rule than exception. We interpret this behaviour as an indication that the query planning process was too complex for the query planner.
Query evaluation – same complexity In general, the normalized speedup of mosttg-queries compared tot-queries is constant. This indicates that the complexity of the evaluation oftg-queries andt-queries is the same.
Query evaluation – performance of tg-queries compared to t-queries In general, tg- queries perform best with an incorporated PC glue process. Especially,tg-queries that incorporate PC.PPR outperformt-queries in most scenarios. A short summary of the performance results:
• For varying data set sizes, tg-queries that incorporate PC.PPR perform better or similar to
t-queries in categories 1 (complex predicate), 2 (foll/prec sibling), 4 (par/child), 5 (simple predicate). For category 3 (anc/desc),t-queries have a better performance.
• For varying amounts of uncertainty, tg-queries that incorporate PC.PPR perform in all categories better or similar to t-queries.
Cat.1 (Comlex) Cat.2 (Foll/Prec Sib.) Cat.3 (Anc/Desc) Cat.4 (Par/Child) Cat.5 (Simple) 1 2 3 4 Query category Normalized sp eedup, baseline: t -queries PC.PPR (v) PC.CD (v) PC.CD TI (v) BB.PPR (v) BB.CD (v)
(a) Total over varying movie title thresholds
PC.PPR (v) PC.CD (v) PC.CD TI (v) BB.PPR (v) BB.CD (v)
1 2 3 4
QRO glue process
Normalized sp eedup, baseline: t -queries Cat.1 (Complex) Cat.2 (Foll/Prec sib) Cat.3 (Anc/Desc) Cat.4 (Par/Child) Cat.5 (Simple)
(b) Total over varying movie title thresholds
Cat.1 (Comlex) Cat.2 (Foll/Prec Sib.) Cat.3 (Anc/Desc) Cat.4 (Par/Child) Cat.5 (Simple)
101 102 103 Query category Avg. execution time (ms)
t-queries (rest analogue to Fig. 12.8a)
(c) Average query execution time per query –categorized per query category
130 12.3. EXPERIMENTAL RESULTS
• For the real world uncertain test set,tg-queries that incorporate PC.PPR perform better or similar tot-queries in categories 1 (complex predicate), 2 (foll/prec sibling), 4 (par/child), 5 (simple predicate). For category 3 (anc/desc), t-queries have a better performance.
Note that skeleton path tables have to be created as part of a data mapping in order to use PC. As a consequence, a data mapping is less efficient.
Query evaluation – advantages of inlining Our experimental results show that queries from category 3 (anc/desc) and category 4 (par/child) profit from a loosening in the inlining rules. More specifically, we managed to get a linear reduction in complexity for increasing data set sizes and an constant increase of 8 for normalized speedup.
Unfortunately, experimental results on the real world uncertain test set show that in practice, the performance advantage as a result of inlining is minimal. The reason for this is rather trivial. The traditional inlining rules –described in Section 2.3.1– apply to the flesh of a p-document instead of the individual possible worlds represented by that p-document. Our approach to add uncertainty to an XML-document basically makes edges between nodes uncertain. Hence, the multiplicity between element types is not affected. However, in practice, nodes of a p-document are uncertain. As a consequence, for element types N and N0 that have a multiplicity of 1..1 in each possible document represented by a p-document have a multiplicity of 1..∗ in the flesh of that p-document in case one occurrence ofN0 is uncertain. Hence, the traditional inlining rules prohibit the inlining ofN0 withN.
Chapter 13
Related Work
This work is a continuation of previous work [48] where we designed a P-XML into URDBMS data mapping based on the XPath Accelerator (XA) approach of Grust et al. [20] combined with the PB.D glue process. We used this data mapping to perform an XPath performance study on MayBMS. Results indicated that query evaluation was inefficient. In this thesis, we managed to resolve this issue with a change of XA element encoding and a change of XML into RDBMS data mapping from XA to ASI[XA].
The work of Van Keulen et al. [50] gave rise to investigate XPath processing with a URDBMS. They propose two approaches to build an XPath processor for P-XML data. The first approach is to instruct a URDBMS to cope with XML. The second approach is to instruct an XML RDBMS to cope with uncertainty. They implemented the second approach. No results on scalability are reported.
Besides our work, the work of Hollander et al. [26] also investigated the possibility to transform a URDBMS to an XPath processor. They propose two different P-XML into URDBMS data mappings. Their first mapping extends the XA approach and their second mapping extends the Shared Inlining (SI) approach, both in combination with ‘Trio’ –a URDBMS. In summary, their approach consist of three steps: (1) create an event table that represents the skeleton of a p- document, (2) map the flesh of a p-document to relation tables (3) create uncertain relation tables per element type by joining the flesh with the skeleton using an inheritance driven glue method –discussed in Section 10.2. The event table that Hollander uses in the first step is similar to the
result of the flesh mapping introduced in this research
The performance study of Hollander et al. [26] shows that their SI approach is more efficient than their XA approach for query evaluation. Unfortunately, Trio was unable to import large documents. As a consequence, their performance study is of a different order of magnitude than the performance study of this work.
Most work on query evaluation on P-XML data is of Kimelfeld et al. [34, 32, 33]. They propose a variety of algorithms for which they give a complexity analysis. Their work is not described with the same detailed level as our work. Also, they have not shown that their algorithms scale up to practice. In contrast, our work has not yet achieved a complexity analysis. Provisionally, it remains the question how the work of Kimelfeld et al. fits with our work.
Nierman et al. [41] have implemented a query evaluation mechanism for the probabilistic XML model member of thePrXML{exp} family. The work of Nierman et al. is not comparable with our
approach, since we investigated a P-XML model of a different probabilistic XML family.
Chapter 14
Conclusions & Future Work
14.1
Summary
In this thesis, we identified the following problem:
There does not exists an efficient query evaluation mechanism for P-XML that can be used in practice.
— Section 1.2
In order to resolve this problem, we take the approach to instruct a uncertain relational database management system (URDBMS) to cope with P-XML. We accomplish this with a specification and design of P-XML into URDBMS database mapping (f,g) where f is a data mapping that maps p-documents to U-Relations andg is a query mapping that maps XPath to SQL queries.
A p-document constructed of flesh and skeleton Document instances of P-XML are built up from ordinary nodes and distributional nodes. We introduce the terminologyfleshto refer to all ordinary nodes in a p-document andskeletonto refer to all distributional nodes in a p-document. The flesh defines the content of a p-document and the skeleton defines the uncertainty distribution of a p-document as a set of choices.
First design of (f,g) Our first design constructs f as (1) a flesh mapping (ffl) to map tree- structured content to table-structured content and (2) a skeleton mapping (fsk) to map a tree- structured uncertainty distribution to a table-structured uncertainty distribution. Additionally, we use (3) a glue process to merge the result of ffl and fsk. The result of such glue process is a set of U-Relations that represents the same set of possible worlds as the original p-document. We refer to a glue process that is part off as document oriented gluing (DO).
In correspondence withf, we constructg as an ordinary XPath to SQL mapping. According to the possible worlds semantics –described in Section 2.4.1–, a traditional query evaluated on a set of possible worlds results in a set of possible answers, each provided by one of the possible worlds. Therefore, the result of a traditional XPath query evaluated on a tree-structured representation of a set of possible worlds is similar to the result of a traditional SQL query –derived from g– evaluated on a table-structured representation of possible worlds –derived from f. We refer to queries derived from an ordinary XPath to SQL mapping ast-queries.
Second design of(f,g) A P-XML into URDBMS data mapping is not obliged to incorporate a glue process. Our second design constructsf asffl andfsk. Query mappingg is altered such that a glue process is incorporated in the query evaluation process. We refer to a glue process that is part of a query evaluation process as query result oriented gluing (QRO). Queries that include a glue process are referred to astg-queries.
Experimental validation We conducted an extensive experimental evaluation of both designs on synthetically generated data sets and real-world data sets.
Experimental results on the mapping of P-XML data indicate that the execution time of ffl increases linearly for increasing amounts of ordinary nodes and the execution time offsk increases linearly for increasing amounts of distributional nodes. For the first design, the glue process PB.PPR is most efficient as additional glue process. The execution time of this glue process increases linearly with the number of ordinary nodes and is unaffected by the number of distributional nodes.
Experimental results on the evaluation of XPath queries indicate thattg-queries andt-queries have the same complexity. In most scenarios, XPath queries evaluated astg-queries that incorporate PC.PPR as glue process are most efficient.