Baysian Logic Programs (BLPs) - Directed Models For Statistical Relational Learning

Bayesian Logic programs [46] are a model based on Bayesian networks. BLPs unify Bayesian networks with logic programming [72] to overcome the propositional character of Bayesian networks and the purely logical nature of logical programs. BLPs use Bayesian clauses with conditional probability tables to represent the distribution of the head of the clause conditional on its body, and use combining rules to combine the information on a single literal that is the head of several clauses. BLPs are implemented in a package called BALIOS [46] and are considered as one of the successful models of Statistical relational learning.

2.5.1 Representation and Learning in BLPs

BLPs are produced from logical programs. A logical program is a set of clauses of the form A : B1, B2, . . . Bn where A and Bi are universally quantified atoms. We call A the head

and Bis the body of the clause. The head of the clause is considered true in the model if

the body of the of the clause is true. BLPs use Bayesian clauses which differ from logical clauses. Bayesian clauses use a conditional probability table to keep the probability of the head of the clause conditioned on its body, whereas logical clauses have a deterministic value. It is possible to have several clauses with the same variable in the head of the clause. Since each clause has its own local probability distribution, a variable may have several local probability distributions with possibly different sets of parents. Here are two clauses

for defining an intelligent person.

intelligent(X) : −highrank(X)

intelligent(X) : −f riend(X, Y ), intelligent(Y )

To obtain a single conditional probability distribution for the variable that includes the union of all parents, both combining rules and aggregation functions are used. A combining rule is a function that maps finite sets of conditional probability distributions { P (A|A_i1. . . Aini), i = 1 . . . m } on to one combined conditional probability distribution P (A|B1. . . Bk) with P (A|B1. . . Bk) ⊆ Smi=1{Ai1. . . Aini}. BLPs can also use aggregation functions, as in PRMs. In some domains it makes more sense to use aggregation functions instead of combination rules. For example, to summarize the grades of a student it is more appropriate to use aggregation functions instead of combining rules.

BLPs use a unique and complicated graphical model which is an extension to Bayesian networks. The model uses a bipartite directed acyclic graph with two set of nodes: Gray ovals denote random variables and black boxes denote local probability models. Incoming edges to black boxes represent parents of the variable xi that is connected through a single

outgoing edge of the black box. The black box specifies a conditional probability distribution p(xi|pa(xi)). Variables are given as white ovals and constants are represented as white boxes.

Arguments of variables are represented as white circles on the boundary of the ovals. Consider Figure 2.11, which is the BLP for part of our university example. BLPs add aggregation functions to their model using octagonal nodes. R5 computes the average grade for a student on the set of subjects that he is registered in deterministically, and R6 shows that rank of a student probabilistically depends on his average value.

2.5.2 Learning in BLPs

Learning in BLPs is a probabilistic extension of learning in inductive logic programing [72] as in MLNs [15] and is formulated as follows: Given a set of Bayesian logic programs GM,

GD, and a scoring function f (.); find a acyclic candidate G∗M such that G∗M matches GD

best according to f (.) . The score function f (.) is used to evaluate how good the clauses are.

Figure 2.11: A Bayesian logic program for part of the university domain. Octagonal nodes show aggregate predicates. The Figure is taken from [46]

To adapt traditional techniques used for parameter estimation of Bayesian networks such as expectation maximization algorithm [14], combining rules are required to be decomposable [40]; most common combining rules for Bayesian networks such as “noisy or” [102] are decomposable. The best match refers to those parameters of the associated conditional probability distributions that maximize the scoring function where the score function is based on maximum likelihood [17].

Structure learning in BLPs follows the procedure of rule learning in ILP systems [82] which have operators such as adding and deleting literals, flipping, instantiating variables, unifying variables on literals or clauses. BLPs speed up the learning procedure executing several operations simultaneously. A simple hill climbing algorithm for BLPs learning can be sketched as follows: (1) start with a Bayesian logic programs GM, (2) compute F (GM)

based on GD, (3) carry out all operators to find all the neighbors G0M of GM that do not

introduce cycles, (4)compute f (G0_M) based on GD, (5) if f (GM) < f (G0_M) then GM = G0_M,

2.5.3 Inference in BLPs

Inference, as in other SRL methods, is intractable in BLPs and is proceeded via grounding the clauses of Bayesian logic program. Each Bayesian logic program specifies a propositional Bayes net, where inference is done using standard Bayes net learning algorithms. The set of the variables in GI of BLPs is similar to those of MLNs: A variable is required for each

grounding of each predicate. The set of predicates grounding a clause are connected to each other; the literals of the body of the clause are the parents of the literal in the head of the clause. Figure 2.12 is the GI for the first order formulas in Table 2.3 with constants k and j.

The second formula is illegal for BLPs because it introduces cycles. The formula is changed into ∀x∀y(intelligent(x) ∧ f riend(x, y) ⇒ highrank(y))

Figure 2.12: An example of inference graph for two formulas (intelligent(x) ⇒ highrank(x)) and (intelligent(x) ∧ f riend(x, y) ⇒ highrank(y)).The set of predicates grounding a clause are connected to each other; the literals of the body of the clause are the parents of the literal in the head of the clause.

Combining rules are required for predicates that are head of several clauses. The parameters of the two clauses need to be combined to achieve correct conditional distribution for highrank. A well known combining rule that is frequently used is “noisy or”.

In document Directed Models For Statistical Relational Learning (Page 49-52)