Conclusions and Limitations - Explaining Data Patterns using Knowledge from the Web of Data

with the #KMi.H results, whose WRaccscores significantly increase when the background

knowledge is extended.

While this suggests that, by adding even more information, better explanations can be obtained, it also reveals that hypotheses for explanation have to be made more understandable. In #Red, we require some information to reveal the relations between the readers and their topic (what connects Jane Austen with Anglican women? Is Jane Austen Anglican too?). In the case of #KMi.H, this relation is already more visible to our eyes (the faculty of Engineering and the Engineering books are in fact about the same topic), but an automatic approach should be able to express it anyway, no matter how trivial such relation is.

Finally, while WRacc has proven to be a good starting point, the F-Measure should be

preferred because it can be more effective in “unclear situations”, i.e. with smaller clusters or when no obvious explanation is visible because of a lack of information.

3.5 Conclusions and Limitations

In this chapter, we have presented a preliminary idea about building a process that automatically generates explanations using background knowledge found in the Web of Data. We

have shown in Section3.1how we identified induction as the reasoning process to obtain

explanations for patterns, and how the framework of Inductive Logic Programming was

a good candidate for the scope. We continued with Section3.2 where we introduced the

general framework of ILP, and with Section3.3in which we designed our problem as an ILP

one. Finally, we tested the approach in Section3.4on two use-cases, #Red and #KMi.H,

presenting and discussing the obtained results.

The results and final discussion have revealed the limitations of the current approach and consequently the next steps that we need to undertake. We present them below.

Data is manually selected. Focusing on proving the validity of the Inductive Logic

Programming approach, the background knowledge in this chapter has been semi-manually selected. More precisely, when expanding the background knowledge with information from other Linked Data datasets, we intentionally chose the properties that we believed would be forming interesting hypotheses. This mostly means that we drove the ILP process to induce the explanations that we wanted (for instance, we believed that the religion could be influencing a reader’s book choice). With that said, our objective is to make this selection of background knowledge an automatic process, where knowledge in Linked Data is agnosti-

cally accessed. In order to detect what strongly connects the data in a pattern, the next step is to find a good way to automatically detect the useful background knowledge in Linked Data.

ÀAleph is not scalable. Inductive Logic Programming has shown to be poorly scalable

when the background knowledge is large. This means that designing a process in which we first find the valid Linked Data information and then build the background knowledge to use Aleph is risky, since the computational complexity might be too high and the hypotheses might never be obtained. However, we have shown that by adding information in the background knowledge little by little, we were able to sensibly improve the hypothesis accuracy. The next step to take in this direction is therefore to design an inductive process in which information from Linked Data is iteratively added until a satisfying explanation is found.

ÃExplanations are not complete. Finally, we have seen that hypotheses are still unclear

to the reader, e.g. we do not really know what is the connection between Jane Austen and

Anglican women. Rolling back to the Explanation Ontology of Section2.1.2, we can now say

that the ILP process allows us to find some plausible anterior events for a pattern. However, we are missing the context that relates an anterior and a posterior event, as well as the theory behind them. This means that our system has to include some Linked Data-based processes to automatically detect the contextual relation of two events, as well as the theory governing their occurrence.

Through the next chapters, we will see how we propose to solve the issues of manual selection, of scalability and of context definition. Regarding the automatic definition of a theory behind the explanations, we will comment on its feasibility in the third part of this work.

Chapter 4 Generating Explanations through

Automatically Extracted Background

Knowledge

In this chapter we show how we extended the Inductive Logic Programming idea into the framework we propose, that we called Dedalo. We designed it as a process to automatically find pattern explanations by iteratively building the required background knowledge from Linked Data. Dedalo is inspired by Inductive Logic Programming and integrates new features such as a heuristic greedy search and Linked Data Traversal. These allow us to answer our

second research question (Section1.3.2), i.e. how to find in the Web of Data the background

knowledge that we need to generate explanations. After the introduction in Section4.1of

the problems tackled in this chapter, we present Dedalo’s foundations in Section4.2and the

designed approach in Section4.3. In Section4.4, we evaluate the performance of the process

according to different criteria while, in Section4.5, we conclude by discussing some of the

limitations of the approach.

4.1 Introduction

In the previous chapter we have seen how Inductive Logic Programming is a promising solution if we aim to automatically explain patterns. With that said, ILP has shown two main drawbacks: first, it requires the background knowledge about the evidence to be manually selected, which means introducing a priori knowledge of the problem; second, it incurs considerable computational issues when the background knowledge is too large and rich. This means that adding the entire knowledge from Linked Data in the process is not feasible;

in addition, most of this knowledge would certainly be irrelevant. Thus, it is necessary that we detect and select only the salient information that we need to build the background knowledge for the induction process.

The solution we present here is to redesign the inductive process to make it not only more suitable for the Web of Data, but also able to automatically detect the relevant background knowledge. Our idea is based on two key aspects. First, we avoid scalability issues by increasing the background knowledge from Linked Data iteratively. This is achieved using a Link Traversal strategy, which uses the links between entities to blindly explore the graph of Linked Data. By “blindly”, we mean that we use URI dereferencing to discover new resources on-the-fly (which possibly belong to unknown data sources) that can serendipitously reveal new knowledge. Second, in order to know which is the right piece of information that has to be extracted, such Link Traversal is driven by a greedy search whose ultimate scope is to find relevant information about the items of a pattern that we want to explain. These two aspects are finally combined with an inductive reasoning process to find out hypotheses that potentially explain a pattern.

The resulting process, Dedalo, is able not only to automatically navigate throughout Linked Data, without knowing them in advance or in their entirety, but also to cleverly use this information to explain patterns of data. In terms of the Explanation Ontology that we previously presented, achieving this task allows Dedalo to find candidate anterior events for a specific observation (the pattern).

In document Explaining Data Patterns using Knowledge from the Web of Data (Page 92-95)