Weight Learning - Probabilistic Inference

6.6 Probabilistic Inference

6.6.2 Weight Learning

In our approach, we learn the weights of the formula discriminatively (maximizing the conditional likelihood of the query predicates given the evidence ones). Weight learning takes as input a training dataset and an MLN program without weight, then tries to compute the optimal weights of the MLN rules by maximizing the likelihood of the training data.

We use Diagonal Newton discriminative learner [LOWDand DOMINGOS2007] as implemented in Tuffy. In our approach, we learn the weights for each user network separately, then use their mean for the formulae that compose our final set.

6.7. TRANSFER PROCESS

6.7 Transfer Process

6.7.1 Rule Mapping

The set of predicates used to describe data in the source and target domains may be partially or completely distinct. The first task in the transfer process is to establish a mapping from predicates in the source domain to predicates in the target domain. At this stage, we do not revise the weights of the rules, but focus on their structure. We deploy a global mapping approach: establish a mapping for each source predicate to a target predicate and then use it to translate the entire source MLN. While specific techniques can be used to discover mappings automatically, we assume here that the global mappings are already given.

For clarity, we illustrate in Table 6.2 an example of the rule mapping between the source and target domain. In this example, we deal with a set of predicates in the source domains, namely User(person), Book(object), Category(cat), shareCategory(object, object), and shareAge(person, person). There are also three rules defined in this domain. In the target domain, we have the following predicates: User(person), Movie(object), Genre(gen), and shareGenre(object, object). We have at our disposal a global mapping, which in this case maps the predicate User of the first domain to the predicate User of the second domain, Book to Movie, Category to Genre, and finally shareCategory to shareGenre.

When the three rules of the source domain are then transferred to the target domain, the predicate shareAge is evaluated as invalid. This is done because the predicate is missing in the global mapping and is not contained in the schema of the target domain. As such, the rule containing this predicate is also made invalid and not used in the target domain. For the other two rules, we accordingly replace the predicates and variables using the information in the global mapping.

Source domain: Target domain: User(person) Book(object) Category(cat) shareCategory(object, object) shareAge(person, person) User(person) Movie(object) Genre(genre) shareGenre(object, object) Source Rules:

(1) hasRating(u₁, b1, r) ∧ shareCategory(b1, b2) => rates(u1, b2, r)

(2) hasRating(u1, b, r) ∧ shareAge(u1, u2) => rates(u2, b, r)

(3) hasRating(u₁, b1, r1) ∧ hasRating(u2, b1, r1) ∧ hasRating(u1, b2, r2) ∧

¬ sameUser(u1, u2) ∧ ¬ sameBook(b1, b2) => rates(u2, b2, r2)

Mapping:

User(person) → User(person) √

Book(object) → Movie(object) √

Category(cat) → Genre(genre) √

shareCategory(object, object) → shareGenre(object, object) √

shareAge(person, person) ×

(1) hasRating(u₁, b1, r) ∧ shareGenre(b1, b2) => rates(u1, b2, r)

√ (2) hasRating(u₁, b, r) ∧ shareAge(u1, u2) => rates(u2, b, r) ×

(3) hasRating(u1, b1, r1) ∧ hasRating(u2, b1, r1) ∧ hasRating(u1, b2, r2) ∧

√ ¬ sameUser(u₁, u2) ∧ ¬ sameMovie(b1, b2) => rates(u2, b2, r2)

6.7. TRANSFER PROCESS

6.7.2 Rule Revision

The second step of the knowledge transfer process is to revise the source rules in order to improve their fit to the target data. The revision procedure focuses particularly on learning appropriate weights of the rules in the target domain. We base our approach on previous works [MIHALKOVAet al. 2007, PAESet al. 2005, RICHARDSand MOONEY1995] of first-order theory revi-

sion. We introduce the basic idea behind theory revision: one can start with a domain theory that may be approximate and incomplete and then correct for inaccuracies and incompleteness by training on examples. If the domain theory is at least approximately correct, we can learn faster with it than without it. Ideally, this should result in more accurate theories.

The problem tackled in our case is that perhaps not all the source rules are useful in the target domain, and not all the target theory can be explained/learned from the available source rules. Our work aims to address the following questions:

• If not all the source rules are related to the target task, how do we select the most relevant subset from the source domain rules?

• If not all the theory of the target domain can be explained or learned from the source rules, how do we identify the subset from the target domain that can benefit the most from the knowledge transfer?

Adhering to the definitions of Paes et al. [PAESet al. 2005], we formulate our revision task as generalization, which involves improving the inferential capabilities of a given probabilistic first-order theory by adding previously missing answers. The revision approach starts from an initial theory that is then minimally modified to become consistent with a set of given examples. In our case, we deal with positive examples only. This initial theory is divided in two parts: (i) background knowledge, which is the predicate schema that is assumed to be correct, and (ii) the knowledge that can be modified by the revision, in our case the rules.

Definition 14. (Revision State). A revision state is defined as a tuple (T , R, C+_{, F ) consisting of a fixed probabilistic first-order theory T , the set of}

probabilistic first-order rulesR, a set of positive examples C+_{, and a probabilistic}

We introduce the notion of consistency of revision states to express the condition that the revised theory logicaly implies all the examples and maximizes a given evaluation function.

Definition 15. (Revision State Consistency). A revision state is consistent and denoted as (T , R|=

, C+_{, F ) if its background theory and rules logically imply all}

the examplesT t R|= _{|= C}+ _{and maximize the probabilistic evaluation function}

F .

The theory in our case is a Markov logic program. The dataset C+ _{of examples}

consists of the literals obtained after grounding. Rule revision, presented in Al- gorithm 7, consists in using an initial probabilistic first-order theory consisting of fixed background knowledge T , rules R, a set of positive examples C+_{, and}

a probabilistic evaluation function F , in order to find a consistent revision state (T , R|=_{, C}+_{, F ). We achieve this by performing probabilistic revision of the the-}

ory in the target domain. In probabilistic revision, the current structure is retained and the probability distributions that maximize the given probabilistic evaluation function are searched, resulting in a consistent revision state (according to Defini- tion15). This process boils down to parameter revision of the theory, which in our case is the task of learning the weight of the rules.

Algorithm 7: Rule Revision Algorithm

Input: Theory T , rules R, evidence C+_{and evaluation function F}

Output: (T , R|=_{, C}+_{, F ) a consistent revision state} 1: repeat

2: for rule αi∈ R do

3: Perform probabilistic revision

4: end for

5: until state (T , R|=_{, C}+_{, F ) is consistent}

There exists various algorithms that can be used for parameter learning in MLN. In our approach, we perform discriminative weight learning with the Diagonal Newton discriminative learner as presented in Lowd and Domin- gos [LOWDand DOMINGOS2007]. This is a gradient descent-style algorithm,

6.8. EXPERIMENTAL RESULTS

which deploys a preconditioned scaled conjugate gradient. The discriminative training method minimizes the negative conditional likelihood of the query predicates given the evidence ones.

Thereby, the evaluation function F is the negative conditional log-likelihood (N CLL), which is defined as N CLL(T |B) = −CLL(T |B) with CLL being the conditional log-likelihood function [FRIEDMANet al. 1997]:

CLL =

i=1

logP (yi|xi,1, ..., xi,v−1) (6.2)

where Bi = {yi, xi,1, ..., xi,v−1} and yirepresents the class in the example i.

Maximizing the conditional likelihood of the class is equivalent to minimizing the classification error. Conditional likelihood is preferable in classification problems, where a theory with the smallest classification error needs to be found.

For the rules transferred from the source domain, we keep their original weights if they are positive, otherwise assign a value of 1. After various trials, we witness that performing parameter revision with negative weights leads to intractable processes. Meanwhile, we assign the weight of value 1 to the rules of the target domain.

6.8 Experimental Results

Our experiments are organized in two parts. The first part consists of experiments conducted in the single-domain case, where we need to test the accuracy of our approach within one domain. The second part consists of the cross-domain case, where we evaluate the mechanism of transferring knowledge across domains and test the accuracy of predictions in the target domain.

6.8.1 Datasets

The experiments are conducted on the following three publicly available datasets: • MovieLens5_{: The original MovieLens dataset contains 10 million ratings}

(1-5 scales) from 71576 users and 10681 movies. For a better comparison with existing approaches, we follow the evaluation procedure of Shi et

al.[SHIet al. 2013], by selecting a subset with the first 5000 users and 5000 movies according to the identifiers in the original dataset. In the following, this dataset is denoted as ML.

• LibraryThing6_{: The original LibraryThing dataset contains ca. 750 thou-}

sand ratings from 7279 users and 37232 books, and in the subset we also select the first 5000 users and 5000 books [CLEMENTSet al. 2010]. This dataset is denoted as LT. The statistics of the ML and LT datasets are sum- marized in Table6.3.

• BookCrossing7_{: This is a dataset of ratings from an online book club where}

users can rate books. In prior work [ZIEGLERet al. 2005], book ratings were collected from this site8_.

We use this dataset for one specific type of experiments to evaluate recomendation utility in the single-domain case. This dataset is also used in recent studies on information heterogeneity in recommender sys- tems [CANTADORet al. 2011]. We performed a cleanup of the data, since it is quite noisy: there are invalid ISBNs, and some of the ISBNs in the rating file cannot be found in the book description file. Statistics of this dataset, denoted as BX, are displayed in Table6.4. We tests with various subsets by filtering users based on different numbers of minimal ratings.

Nr. users Nr. items Nr. Ratings Sparseness

ML 5000 5000 584628 97.70%

LT 5000 5000 179419 99.30%

Table 6.3: Statistics of the datasets ML and LT

As in the related work of Shi et al. [SHIet al. 2013], we follow the subset selection procedure rather than random selection, in order to ensure accurate performance comparison and future experimental reproducibility.

6_{http://ir.ii.uam.es/hetrec2011/datasets.html}

http://www.bookcrossing.com

6.8. EXPERIMENTAL RESULTS

Min. ratings Nr. Users Nr. Books Nr. Ratings

5 5628 57,324 136,284

10 3056 52,528 119,563

30 1053 42,340 86,928

50 568 36,194 68,361

Table 6.4: Statistics of the BookCrossing dataset (BX)

In document Cross-domain Recommendations based on semantically-enhanced User Web Behavior (Page 168-175)