• No results found

3. Large-Scale Relational Learning and Application on Linked Data

3.3. Machine Learning Tasks on Linked Data

Once the factorization of an adjacency tensor has been computed with R, it can be used for various learning tasks that are important to Linked Data and the Semantic Web. In the following we will briefly describe some of these tasks and discuss the benefits of R as well as the advantages of large-scale relational learning in these application scenarios.

Prediction of Triples The prediction of the truth value of triples is an important task in many fields of application and can also be used to improve the data quality in automati- cally created knowledge bases. This task corresponds the link prediction in relational learning and can be approached with R as described in section 2.3. A very inter- esting property of R is that the state of a single relationshipxijk is conditionally

3.3 Machine Learning Tasks on Linked Data 67 independent from all other variables given the expressionaTiRkaj. Once a factorization of a knowledge base has been computed – what can be done “offline” – this property enables fast query answering, as the computational complexity of the corresponding matrix-vector multiplications depends only on the dimensionality of the latent space

Aand is independent of the size of a knowledge base. This is an important feature of R compared to other relational learning approaches where exact inference is often intractable and even approximate inference remains very time consuming.

Instance Matching In instance matching the task is to determine which entities from het- erogeneous data sources refer to the same underlying entity; a task that is considered critical for Linked Data (Ferrara et al., 2008; Ferrara et al., 2011). Instance matching is essentially equivalent to entity resolution in relational learning and can be approached with R, by creating a joint adjacency tensor for all data sources whose instances should be matched and by applying the entity resolution methods as described in chap- ter 2.

Retrieval of Similar Entities A particular strength of the R factorization is that it computes a global latent representation of entities, i.e. the factor matrixA. Analogous to the retrieval of documents via latent-variable models, the latent representation of entities can be used to retrieve entities that are similar to a queried entity. As discussed in section 2.2, the matrixAcan be interpreted as an embedding of entities into a latent space that reflects their similarity over all relations in the domain of discourse. Therefore, in order to retrieve entities that are similar to a particular entitye with respect to all relations in the data, it is sufficient to compute a ranking of entities by their similarity toe inA. This can be done efficiently, sinceAis only ann ×r matrix.

Decision Support for Knowledge Engineers Another important application of R is the automatic creation of taxonomies from instance data. Recently, it has been proposed that machine learning methods should assist knowledge engineers in the creation of ontologies, such that an automated system suggests new axioms for an ontology, which are added under the supervision of an engineer (Auer and Lehmann, 2010). Here, we consider the simpler task of learning a taxonomy from instance data, which can be interpreted as a hierarchical grouping of entities. Consequently, a natural approach to learning a taxonomy for a particular domain is to compute a hierarchical clustering of the entities in this domain and to interpret the resulting clusters according to their members.

68 3. Large-Scale Relational Learning and Application on Linked Data

However, there are only very few approaches that are able to compute ahierarchical

clustering formulti-relationaldata (Roy et al., 2007), and even less approaches that could be applied to complete knowledge bases. To compute such a clustering with R, we exploit again the fact thatAreflects the similarity of entities in the relational domain, and simply compute a hierarchical clustering in this latent-component space. This approach has the advantage that anyfeature-based hierarchical clustering algorithm can be readily be applied to this task. The clustering, however, will still be determined by the entities’ similarities in the relational domain. While this approach differs in some important aspects from the system envisioned by Auer and Lehmann (2010), it can be used to address some of the discussed challenges, in particular scalability.

For all of these tasks the quality of the model can be improved via the application of R to complete knowledge bases. For tasks like instance matching and taxonomy learning the scalability to one or multiple knowledge bases is even required, as these tasks are defined over complete knowledge bases. Scaling R to data sets of this size can therefore be an important step towards relational learning from complete knowledge bases in the Semantic Web, what is one of the main motivations for the work in this chapter.

Another noteworthy aspect about learning on Linked Data is the importance of collective learning, i.e. the inclusion of information that might be more distant in the relational graph in learning and prediction tasks. This ability of a learning method is not only important because of the relational nature of Linked Data as discussed in section 1.2.1, but also because of the way how information is modeled in Linked Data. Since RDF is restricted to dyadic relations, higher-order relations are often modeled via intermediary nodes such as blank nodes or abstract entities, such that the actual fact is not included in a single triple but is spread over a chain of triples. For instance, in version3.7of the DBpedia ontology, many geographical locations such as river mouth locations are modeled via chains of triples such as

(Rhone, mouthPosition, Rhone-mouthPosition),

(Rhone-mouthPosition, longitude, 4.845555782318115), (Rhone-mouthPosition, latitude, 43.330799).

To access, for instance, the longitude of the river Rhône, a learning method requires therefore the ability to include information that is “past” the abstract entityRhone-mouthPosition, what is enabled through collective learning.