The core concept of statistical relational learning (SRL) or probabilistic logic learning (PLL) is the combination of machine learning, statistical techniques and
PROBABILISTIC LOGIC LEARNING 17
reasoning in first order logic. Many variants of this theme have been studied, differing both in their logical and in their probabilistic language. Here, we will distinguish two main streams by means of the basic probabilistic framework they employ. We start with the framework we will use throughout this thesis, namely the addition of independent probabilistic alternatives to relational languages, and afterwards discuss relational extensions of graphical models, which encode dependencies between random variables by means of their underlying graphical structure.
2.3.1
Using Independent Probabilistic Alternatives
A probabilistic alternative2 is a basic random event with a finite number of different
outcomes, such as tossing a coin or rolling a die. Sets of mutually independent probabilistic alternatives are commonly used to define joint distributions over such events. A simple probabilistic model following this idea are probabilistic context
free grammars (PCFGs) [Manning and Sch¨utze, 1999]. Formally, a PCFG is a
tuple (Σ, N, S, R) where Σ, the alphabet of the language defined by the grammar, is a finite set of symbols called terminal symbols, N is a finite set of so-called
nonterminal symbols, S ∈ N is the designated start symbol, R a set of rules of the
form P : A → β with left hand side A ∈ N and right hand side β ∈ (Σ ∪ N)∗, that is, a finite sequence of symbols from Σ ∪ N, where denotes the empty sequence, and P ∈ [0, 1] such that the sum over all rules in R with the same left hand side A is 1. As common for such grammars, we denote terminal and non-terminal symbols by lower and upper case letters, respectively, and simply write a PCFG as the set of rules R with start symbol S, leaving Σ and N implicit. Sentences are derived starting from S by replacing the leftmost nonterminal symbol A in the current intermediate sentence by some β with P : A → β ∈ R, until no more replacements are possible, where replacement by corresponds to simply deleting A. The choice of rule for given A is governed by the probability distribution over A’s rules given by their labels P , and is independent of everything else, including replacements of further occurrences of A. Thus, the independent probabilistic alternatives of PCFGs are the choices of rules during derivations, and the probability of a derivation is given as the product of the probability of all its rule applications. Furthermore, the probability of a sentence ω ∈ (Σ ∪ N)∗ is the sum of probabilities of all derivations ending in ω.
18 FOUNDATIONS
Example 2.8 The following grammar defines a probability distribution over all
finite non-empty strings over the alphabet {a, b}.
0.3 : S → aX 0.5 : X → aX 0.6 : Y → aX
0.7 : S → bY 0.1 : X → bY 0.2 : Y → bY (2.3) 0.4 : X → 0.2 : Y →
Note that the grammar does not contain ambiguities, that is, each sentence can be obtained by a single derivation only. For instance, the sentence aab is generated by the derivation
S−→ aX0.3 −→ aaX0.5 −→ aabY0.1 −→ aab0.2 and thus has probability
0.3 · 0.5 · 0.1 · 0.2 = 0.003.
Stochastic Logic Programs (SLPs) [Muggleton, 1995] directly upgrade the idea of PCFGs to definite clauses, that is, instead of probability distributions over all rules with the same left hand side, they use probability distributions over all definite clauses with the same head predicate. Further probabilistic logic languages using independent probabilistic alternatives include the probabilistic logic programs of Dantsin [1991], PHA and ICL [Poole, 1993b, 2000], probabilistic Datalog [Fuhr, 2000], PRISM [Sato and Kameya, 2001], LPADs and CP-logic [Vennekens et al., 2004; Vennekens, 2007] and ProbLog as presented in Chapter 3 of this thesis; we will discuss this group of languages in more detail in Section 3.4. While most other formalisms use rule-based logical languages, FOProbLog [Bruynooghe et al., 2010] combines arbitrary first order formulae with independent probabilistic alternatives.
2.3.2
Using Graphical Models
While the probabilistic languages discussed in the previous section define joint distributions in terms of mutually independent random variables, in Bayesian
Networks (BNs) [Pearl, 1988], a joint probability distribution over a finite set of
random variables with finite domains is defined in terms of a conditional distribution for each variable given a subset of the others. More specifically, a BN is a directed acyclic graph whose nodes correspond to the random variables and whose edges represent direct dependencies between random variables. Each node in the network has an associated probability distribution over its values given the values of its
parents, the starting nodes of the node’s incoming edges. The full joint distribution
PROBABILISTIC LOGIC LEARNING 19 Earthquake Alarm JohnCalls MaryCalls Burglary P(E) 0.002 P(B) 0.001 E E ¬ E ¬E B ¬ B B ¬B P(A) 0.95 0.29 0.94 0.001 A ¬A P(J) 0.9 0.05 A ¬A P(M) 0.7 0.01 Figure 2.2: Bayesian network
Example 2.9 Figure 2.2 shows the well-known alarm Bayesian network [Pearl,
1988; Russell and Norvig, 2004], where all random variables have domain {0, 1}. It defines the joint distribution
P(E, B, A, J, M) = P (E) · P (B) · P (A|E, B) · P (J|A) · P (M|A) (2.4) For instance, the probability of {E = 1, B = 0, A = 1, J = 1, M = 1} thus is
P(1, 0, 1, 1, 1) = 0.002 · (1 − 0.001) · 0.29 · 0.9 · 0.7 = 0.000365
While Bayesian networks can be mirrored in terms of independent alternatives, as we will see in Section 3.4.1, inference for special purpose languages can directly exploit the underlying independencies.
Relational extensions of Bayesian networks typically specify the graph structure at an abstract level in some relational language and use this specification as a kind of template, from which concrete instances of Bayesian networks can be obtained by grounding out logical variables. Prominent examples of such extensions include Relational Bayesian Networks [J¨ager, 1997], Probabilistic Relational Models [Friedman et al., 1999], CLP(BN ) [Santos Costa et al., 2003], Logical Bayesian Networks [Fierens et al., 2005], Bayesian Logic Programs [Kersting and De Raedt, 2008], and P-log [Baral et al., 2009]. In contrast to these languages, Markov Logic Networks [Richardson and Domingos, 2006] are a first order variant of undirected graphical models, using weighted first order logic formulae as templates to construct Markov Networks, thereby defining probability distributions over possible worlds.
20 FOUNDATIONS x y y z z z z 0 0 0 1 1 1 1 1 (a) x y 1 z 0 (b)
Figure 2.3: Full binary decision tree and corresponding BDD for formula x ∨ (y ∧ z); dotted edges correspond to value 0, solid ones to 1.