• No results found

In this section, we present further techniques from probabilistic databases (Section 2.4.1) and related approaches, such as probabilistic XML, statistical relational learning, and probabilistic programming.

2.4.1 Probabilistic Databases

Data Model A number of representation mechanisms for probabilistic databases were investigated in the last years. Most closely related to our data model are x-tuples [14], also called block-independent disjoint data-bases [30], which allow non-overlapping blocks of tuples to be mutually ex-clusive, and hence extend tuple-independent databases. More expressive are pc-tables [60], which annotate each tuple by a propositional formula over random variables. With specific restrictions on pc-tables, it is possible to store them in a relational database. Then they are called U-relations [10].

Moreover, pvc-tables [46] are an extension of pc-tables which adds further support for aggregations. An alternative to represent dependencies across different tuples are decomposition trees [47]. Their inner nodes represent independent-and, independent-or, or mutual exclusion operators over the

leafs which correspond to database tuples. Storing and querying correlated tuples in a probabilistic database is considered in [76, 128]. The authors ex-press correlations among database tuples by junction trees and additionally allow lineage formulas on top. Likewise, in [119] factor graphs are rep-resented by arithmetic circuits to support correlations. Finally, based on c-tables, [79] features continuous probabilities, which are not supported by most works in the field.

Probability Computations Propelled by the existence of queries ex-hibiting #P-hard probability computations [58], one major focus of research on probabilistic databases has been on efficient probability computations of query answers. One solution are safe query plans [29], which alter the query plan to carry out efficient probability computations while performing data computations. In [31] a dichotomy result is shown which characterizes con-junctive queries’ probability computations to be in either polynomial time or #P-hard. In addition, the difference of two safe conjunctive queries can be unsafe, as shown in [81]. For a larger set of queries, namely unions of conjunctive queries, [28, 32] establishes an efficient algorithm either com-puting the probabilities in polynomial time or characterizing the query as computationally hard. Another approach is taken by works not directly act-ing on queries, but relyact-ing on intensional semantics, e.g. on lineage formulas.

There, read-once formulas [129] are a class of formulas which both can be de-tected efficiently and also allow efficient probability computations. Another line of works is knowledge compilation [74, 105], which transforms lineage formulas into various forms of binary decision diagrams. In particular, for unions of conjunctive queries, [74] introduced a hierarchy of compilation target languages while rating their expressiveness. Finally, if exact proba-bilities for query answers are not required, one can employ approximation techniques as in [47, 107]. These works incrementally create decomposi-tion trees whose probability serves as bound on the probability of the full lineage formula. Supporting aggregations over probabilistic databases can yield exponentially sized results, which is studied in [117]. To avoid this, the authors of [100] maintain only bounds on the aggregates. Furthermore, in [46] arithmetic expressions representing aggregations are compiled into decomposition trees, which also limits the blow-up.

Probabilistic Database Systems Most of the above approaches are im-plemented in existing probabilistic database systems. Many of these have been released as open-source prototypes in recent years, including Mys-tiQ [18], MayBMS [10], SPROUT [106], Trio [14], PrDB [128], and ER-ACER [97].

2.4.2 Probabilistic XML

The eXtensible Markup Language (XML) is a document encoding yielding tree shaped parses. If different branches of a node in a parse tree are con-ditioned by independent or mutually exclusive probabilities, or even logical formulas, then the notion of probabilistic XML arises. Within this field many results are shared with probabilistic databases, since one can encode prob-abilistic databases and probprob-abilistic XML into each other [8]. Most results of both fields can be transferred to each other, where the differences arise from the tree shape of the XML documents. The complexity of the query evaluation problem of the various probabilistic XML models is characterized in [82]. Additionally, [4] studies the expressiveness of these models.

2.4.3 Statistical Relational Learning

Emerging from a mixture of artificial intelligence [123] and machine learn-ing [99], the subfield of statistical relational learnlearn-ing [56] has many similari-ties with probabilistic databases. They both tackle uncertainty in relational data which is further described by logical statements.

Markov Logic Networks The most well-known statistical relational learn-ing approach are Markov Logic Networks (MLNs) [120], which have been subject to many improvements, e.g. [102, 121, 132]. Their grounding mech-anism proceeds via instantiating literals with all possible constants based on the literals’ type signatures. This is commonly referred to as open-world assumption. In contrast, probabilistic databases rely on the closed world as-sumption with more selective grounding via deduction rules, which usually results in much fewer grounded tuples, and hence in much better scalability.

In Markov logic networks, dependencies between grounded literals are in-duced by weighted logical clauses where all literals have an equal impact on the truth value of the clause. All these clauses are kept in a large conjunctive formula resembling an undirected graphical model. Conversely, probabilis-tic databases use unweighted lineage formulas, where all input tuples fully determine the truth value of the output tuple. Hence, in this case the lin-eage formulas take acyclic directed structures, as opposed to one conjunctive formula in Markov logic networks. A step to merge both worlds was taken in MarkoViews [73] where probabilistic databases are extended by uncer-tain views inspired by Markov logic networks. Finally, we compete against Markov logic networks implementations throughout the experiments in this thesis.

2.4.4 Probabilistic Programming

In probabilistic programming known programming frameworks, such as Pro-log [90], are extended by probabilities.

ProbLog Most closely related to our data model, tuple-independent prob-abilistic databases with lineage, is ProbLog [33], where the similarities are discussed in [143]. Their input facts are annotated with independent proba-bilities and are correlated by rules known from Prolog [90]. After executing the rules, ProbLog keeps the SLD proofs to perform probability compu-tations, which can be viewed as lineage tracing in probabilistic databases.

The major difference is that probabilistic databases return all answers to a query with their respective probability, whereas ProbLog queries rather ask whether an answer exists at all. Because of these similarities, ProbLog will be a competitor in our experiments.