The goal of this thesis was to show that it is possible to automatically explain data patterns using background knowledge from the Web of Data. The main motivation behind that was to show how the interconnected and cross-domain knowledge of the Web of Data could be exploited to assist experts (and non-experts) in the process of explaining their results, and therefore to improve the process of discovering knowledge.
Based on this main research hypothesis, i.e.
the process of pattern explanation can be improved by using background knowledge from the Web of (Linked) Data
our work focused on setting up a process in which we could obtain sound explanations for a pattern under observation, given some background knowledge derived from the Web of Data. This process, of course, had to be completely human-independent, both in producing the explanations and in finding the background knowledge it required. Multiple research questions arose from setting such an objective, namely:
• what do we mean by explanation? (Section1.3.1)
• how do we automatically get the relevant knowledge from the Web of Data? (Section1.3.2)
• how do we know which knowledge is required for an explanation? (Section1.3.2)
• how do we automatically generate explanations? (Section1.3.3)
• how do we know that our explanations are good? (Section1.3.4)
• how do we know that our process is good? (Section1.3.4)
On the basis of those research questions, we can now analyse the answers and contributions we did bring.
8.2 Summary, Answers and Contributions | 165
8.2.1 Definition of an Explanation
Did we define what is an explanation? Considering that the concept of explanation was at the core of our research hypothesis, the first question that we focused on was on how to formally define it. In fact, there was a need for us to identify the elements acting in an explanation, so that the process that we would build could know what to look for.
The challenge here was to prove that we were not making assumptions of why a phe- nomenon happens, nor identifying some potentially correlated events, whose validity could not be proved. Identifying those elements and how they interact in an explanation was therefore the key for stating that we were able to automatically produce meaningful and complete explanations for some phenomena. The challenge, then, became to find those elements, and once they were found, to use them to formally define an explanation.
The methodology we chose to answer this question was to conduct a survey in Cognitive
Science (Section2.1), where we studied what is important in an explanation according to the
experts of the different disciplines (philosophy, psychology, neuroscience, computer science, sociology and linguistics). The result of the study was that disciplines in Cognitive Science structurally do see explanations in the same way. Besides the contextual or lexical differences that they can have, explanations are always composed by the same four components: an event (that we called anterior event or explanans) that happens first, a second event (posterior event or explanandum) which follows, a set of circumstances (or context) that links the two events, and a law (or theory) that governs their occurrence. We modelled those components in a small
ontology that we called the Explanation Ontology and presented in detail in Section2.1.2.
This ontology, as well as the survey in Cognitive Science, represented our main contribution for this research question.
The formal model for an explanation revealed exactly which were the tasks that we needed to complete in our process. Given a posterior phenomenon encoded in a pattern, our ultimate target was to create an automatic process, which would use the knowledge from the Web of Data to derive the three other components: the anterior event, the context and the theory.
8.2.2 Detection of the Background Knowledge
Did we find a way to detect the background knowledge in the Web Data? The next question we targeted was how to detect in the Web of Data the background knowledge that was needed to derive the components of an explanation. This question
required us to investigate how to use the structured and cross-domain nature of Linked Data to reveal such knowledge.
Being determined to use Linked Data, the major challenge for us was to avoid falling into computational issues due to the large space of analysis. What was required was to focus on how to strategically find in Linked Data only the right piece of background knowledge. This research question raised two problems at the same time: on the one hand, the one of automatically finding and choosing the right datasets in the large choice offered by Linked Data (inter-dataset problem); on the other, the one of finding in a selected dataset the piece of information required for the explanation (intra-dataset problem).
The first contribution we brought here was to show that a combination Linked Data Traversal and some graph search strategies were the key to discover knowledge with little
computational effort. In both Chapter 4, focused on the detection of the anterior events
participating into an explanation, and Chapter6, focused on the detection of the context, we
showed that using URI dereferencing and the native links between entities was the key to agnostically access datasets on-the-fly, to efficiently access only the necessary portion of data, and to avoid data crawling and indexing – which would have implied, at the same time, introducing a priori knowledge about the problem and increasing the computational costs.
With that said, our second contribution for this question was to show which are the
features of Linked Data that help in finding the right piece of information. In Chapter4,
we showed that an ad-hoc adaptation of the Shannon’s Entropy measure was a successful solution for the detection of the right candidate anterior events. This was achieved after a comparative study on different graph search heuristics, where we showed that Entropy was the best measure in accessing the required information in an efficient (in time and search
space) way. Analogously, in Chapter6, we showed that specific, low-level entities that were
richly described were the important Linked Data structural characteristics to be taken into account in order to efficiently identify the context in which the pattern and its candidate explanans happen.
8.2.3 Generation of the Explanations
Did we find a way to automatically generate explanations using background knowledge from the Web of Data? Once detected the right background knowledge, the next problem was how to use it to automatically generate complete explanations. The particular challenge in this context was to be able to emulate the human process of “generating understandable explanations using one’s own background knowledge”, without running into the computational issues that the
8.2 Summary, Answers and Contributions | 167 data deluge would bring.
Following the idea of using logical inference and reasoning, we showed that Inductive Logic Programming was an interesting solution to the problem. The main insight here was to recognise that Inductive Logic Programming frameworks, that combine features from Machine Learning and Logic Programming, are able to automatically derive (first-order clausal) hypotheses based on some reasoning upon a set of positive and negative examples and some background knowledge about them. The first contribution we brought was to
show, in Chapter3, how we could use ILP to generate pattern explanations with background
knowledge from the Web of Data. With this scope in mind, we redesigned our problem as an ILP one, where: (1) items in the pattern to explain played the role of the positive examples from which we would learn; (2) the hypotheses to be derived were the anterior event explaining the pattern; and (3) the background knowledge upon which we would have based our reasoning was composed by Linked Data statements.
Once shown the potential of ILP in generating potentially useful anterior events for a
pattern, we proposed in Chapter4to extend and adapt the process to make it more suitable
for Linked Data, whose size and complexity could not be handled by the ILP frameworks, notoriously known for not being efficient or scalable. The approach we presented, defining Dedalo’s initial design, was considered more adequate because it used an anytime process, in which the background knowledge was iteratively enlarged using the above mentioned graph Link Traversal-based search, which allowed to constantly find new and possibly better anterior events to a pattern. The design and implementation of this process represents our second contribution for this question.
8.2.4 Evaluation of the Explanations
Did we find a way to evaluate the explanations, and to evaluate the approach we propose? The final question consisted in how to make sure that the induced explanations could be validated and evaluated as valid for a pattern. The major challenge consisted in defining what makes an explanation to be valid and how to measure it.
Here, the first part of our work was focused on evaluating the generated explanations, especially by finding which were the criteria to assess their interestingness. We began by using standard accuracy measures, i.e. the F-Measure and the Weighted Relative Accuracy, to statistically determine how much an induced anterior event would be correlated to a pattern. To do so, we used the anterior event as a classifier that was tested on the ground truth of the pattern. Inspired from cluster analysis approaches, we then extended the F-Measure to a
weighted variant of it, that we called “fuzzy” F-Measure. Our choice, intentionally done to prioritise the most influential data within the pattern, gave a first contribution to the problem: in fact, we demonstrated that better explanations could be found when taking into account that items influence patterns in a different way. As a second contribution to the problem,
we proposed in Chapter5to improve the accuracy of explanations by aggregating different
explanantia, and used for that a trained Neural Network to predict which aggregations were more convenient. Looking for producing explanations more complete with respect to the Explanation Ontology, we finally focus on showing a pattern and an induced Linked Data explanation to be in the right context. We measured the pertinence of this context using the
evaluation function learnt in Chapter6. Our major contribution here was to reveal which
were the features of the Web of Data topology to be taken into account when measuring entity relatedness.
The second part of this problem concerned how to evaluate our approach against domain experts and users. For that, we designed an empirical user-study based on a real-world scenario (the interpretation of Google Trends), where we evaluated Dedalo’s results against
human judgment. Chapter 7 presented how we articulated this study, where users were
required to provide their own explanations to a trend’s popularity as well as ranking contexts according to their strengths, that we then used to assess the validity of Dedalo’s results. Our results were finally measured in terms of how much knowledge we could automati- cally find, and how helpful an automatic process to produce explanation could be for non experts in the relevant domain. The general evaluation showed that our approach was in fact able to identify explanations that could be as interesting as the ones of the human evalua- tors, and possibly to bring new knowledge beyond what non expert users would already know. The work we presented demonstrated that it is possible to use the Web of Data as a source of background knowledge to explain data patterns, and it showed some of the possible solutions to achieve it. Of course, this does not come without limitations and possibilities of extension, that we will discuss in the following section.