Charles University in Prague
Faculty of Mathematics and Physics
MASTER THESIS
Jan Michelfeit
Linked Data Integration
Department of Software Engineering
Supervisor of the master thesis: RNDr. Tomáš Knap, Ph.D.
Study programme: Computer Science
Specialization: I2 Software systems
I wish to record my gratitude to my supervisor, Tomáš Knap, Ph.D., for guidance he gave me throughout the work on this thesis. I would like to thank my mother for her support while working on this thesis but most importantly for the fact she made my studies possible and guided me during the whole time.
I declare that I carried out this master thesis independently, and only with the cited sources, literature and other professional sources.
I understand that my work relates to the rights and obligations under the Act No. 121/2000 Coll., the Copyright Act, as amended, in particular the fact that the Charles University in Prague has the right to conclude a license agreement on the use of this work as a school work pursuant to Section 60 paragraph 1 of the Copyright Act.
Název práce: Linked Data Integration Autor: Jan Michelfeit
Katedra: Katedra softwarového inženýrství
Vedoucí diplomové práce: RNDr. Tomáš Knap, Ph.D.
Abstrakt: Linked Data je úspěšná forma publikování strukturovaných dat, která by mohla znamenat pro data to samé, co dokázal web pro dokumenty. Silná stránka Linked Data je jejich vhodnost pro integraci dat z více zdrojů. Integrace Linked Data otevírá dveře novým příležitostem, ale zároveň přináší nové výzvy. Je třeba vyvinout nové algoritmy a nástroje pokrývající všechny kroky datové integrace.
Tato práce se zabývá tradičním postupem integrace dat a jeho aplikací na Linked Data, se zaměřením na řešení konfliktů, které se mohou objevit. Práce navrhuje nový algoritmus pro řešení konfliktů, který zároveň podporuje důvěru s pomocí informací o původu a analýzy kvality. Navržené algoritmy jsou implementované v rámci frameworku ODCleanStore pro integraci Linked Data.
Klíčová slova: Linked Data, datová integrace, datová kvalita, datové konflikty
Title: Linked Data Integration Author: Jan Michelfeit
Department: Department of Software Engineering Supervisor: RNDr. Tomáš Knap, Ph.D.
Abstract: Linked Data have emerged as a successful publication format which could mean to structured data what Web meant to documents. The strength of Linked Data is in its fitness for integration of data from multiple sources. Linked Data integration opens door to new opportunities but also poses new challenges. New algorithms and tools need to be developed to cover all steps of data integration.
This thesis examines the established data integration proceses and how they can be applied to Linked Data, with focus on data fusion and conflict resolution. Novel algorithms for Linked Data fusion are proposed and the task of supporting trust with provenance infor-mation and quality assessment of fused data is addressed. The proposed algorithms are implemented as part of a Linked Data integration framework ODCleanStore.
Contents
1 Introduction 4
1.1 Introduction to Linked Data . . . 6
1.2 Technical Preliminaries . . . 7 1.2.1 RDF . . . 7 1.2.2 Named Graphs . . . 8 1.2.3 Ontologies . . . 8 1.3 Motivating Example . . . 8 1.3.1 Challenges . . . 9
1.3.2 Benefits of Linked Data Integration . . . 10
1.3.3 Contributions of the Thesis . . . 11
2 Data Integration 13 2.1 Basic Terminology . . . 13
2.1.1 Entities . . . 13
2.1.2 Identifiers . . . 14
2.1.3 Conflicts . . . 14
2.1.4 Materialized vs. Virtual Integration . . . 15
2.2 The Data Integration Process . . . 16
2.3 Data Integration Steps in Detail . . . 19
2.3.1 Schema Mapping . . . 19 2.3.2 Data Retrieval . . . 19 2.3.3 Data Preparation . . . 19 2.3.4 Schema Translation . . . 20 2.3.5 Quality Assessment . . . 20 2.3.6 Duplicate Detection . . . 20 2.3.7 Data Fusion . . . 21 2.3.8 Conflict Resolution . . . 21
3 Data Fusion & Conflict Resolution 23 3.1 Data Conflicts . . . 23
3.2 Conflict Resolution Functions . . . 23
3.3 Overview of Concrete Resolution Functions . . . 24
3.4 Conflict Resolution Strategies . . . 26
3.5 Classification of Conflict Resolution Functions . . . 27
3.6 Properties of Conflict Resolution Functions . . . 29
3.7.1 Resource Descriptions . . . 31
3.7.2 Query Execution . . . 31
3.7.3 Data Schemata . . . 32
3.7.4 NULL values . . . 32
4 Linked Data Fusion and Conflict Resolution Algorithm 33 4.1 Formalism . . . 33
4.1.1 RDF Data Model . . . 33
4.1.2 Conflict Resolution Terminology . . . 34
4.2 Input & Output . . . 36
4.3 High-level Overview . . . 37
4.4 Algorithm Description . . . 37
4.4.1 Construction of Canonical URI Mapping . . . 37
4.4.2 URI Resolution . . . 40
4.4.3 Application of Resolution Function . . . 40
4.5 Time & Memory Complexity . . . 41
5 Fused Linked Data Quality Algorithm 43 5.1 Data Quality and F-Quality . . . 43
5.2 Methodology . . . 44
5.3 Terminology . . . 45
5.4 Requirements on F-Quality Assessment . . . 46
5.5 Assessment Algorithm Description . . . 46
5.5.1 Factor 1: Source Quality . . . 48
5.5.2 Factor 2: Conflicting Values . . . 48
5.5.3 Factor 3: Confirmation by Multiple Sources . . . 49
5.6 Complexity & Relation to Resolution Functions . . . 50
5.7 Discussion . . . 50
6 Implementation 52 6.1 ODCleanStore . . . 52
6.1.1 Transformers . . . 53
6.1.2 Output Web Service . . . 54
6.2 ODCS-FusionTool . . . 54
6.3 Implementation Details . . . 56
6.3.1 Interfaces . . . 56
6.3.2 Resolution Functions . . . 58
7 Evaluation 61
7.1 Experience from OpenData.cz initiative . . . 61
7.2 Integration of Multiple Sources from the Web . . . 61
7.2.1 Performance . . . 62
7.2.2 Completness, Conciseness, Consistency . . . 63
7.2.3 Quality . . . 65
7.3 Lessons Learned . . . 67
8 Related Work 69 8.1 LDIF . . . 69
8.1.1 Sieve . . . 69
8.1.2 Conflict Resolution in Sieve vs. ODCleanStore . . . 70
8.2 Conflict-Based Quality . . . 70
8.3 SPARQL Queries over Multiple Data Sources . . . 71
8.3.1 Query Rewriting . . . 71
8.3.2 Networked Graphs . . . 72
8.3.3 Federated SPARQL Queries . . . 72
8.4 Traditional Data Integration Systems . . . 72
9 Conclusion 75 9.1 Future Work . . . 75 9.2 Summary . . . 77 Bibliography 82 List of Figures 83 List of Listings 84 A Contents of CD-ROM 85 B ODCS-FusionTool Manual 86
1. Introduction
Linked Data and RDF have emerged as a successful publication format for structured data. The volume of published Linked Data continuously grows, available datasets cover many areas from geographic data to life sciences and we can see the Web evolve into an exponentially growing information space with over 60 bilion facts represented as RDF triples. As the word Linked implies, a crucial aspect is linking across datasets – the drive for Linked Data adoption is their outstanding fitness for integration. This brings an exciting opportunity for new applications exploiting data gathered from multiple sources in addition to traditional data integration scenarios which can also benefit from the power and flexibility of the RDF data model.
The new opportunities for data consumption also present new challenges. One aspect is information quality. Usage of information from the open Web environment comes with inherent problems such as the provision of poor quality, inaccurate or outdated data. The end data consumer needs a support in decisions about which data are worth using, espe-cially when contradicting information is presented. Effective integration needs resolution of emerging conflicts and uncertainities. Quality assessment can act both as a deciding factor for conflict resolution and as an indicator of poor data which need an attention of an expert.
Technical challenges are another aspect to face. Integration of large and frequently updated datasets should be efficient and a pay-as-you-go approach may be essential for adoption in practice. In addition, the majority of data integration tools are designed for traditional databases and new tools need to be developed for Linked Data integration.
In this thesis, we address some of the key challenges of Linked Data integration, namely the aspects of Linked Data fusion, conflict resolution and quality. We also analyse the specifics of Linked Data integration compared to relational databases.
The main contributions of this thesis are (1) a Linked Data fusion and conflict resolu-tion algorithm, (2) an algorithm for quality assessment of fused data with a new concept of F-quality, (3) examination of the general data integration process in its entirety and comparison to specifics of Linked Data integration, with focus on conflict resolution.
This thesis contributes to the technology stack required for adoption of Linked Data with implementation of a data fusion component in ODCleanStore, a Linked Data manage-ment framework covering all important aspects of data integration from data cleansing to provision of integrated views on the data. We also provide a standalone tool for integration of RDF data from multiple sources. Results are evaluated on integration of several Linked Data sources from the Web.
Structure of the thesis. The first chapter gives an introduction to basic concepts of Linked Data and demonstrates their advantages on a motivating example. Chapter 2 covers aspects of data integration with focus on Linked Data, while Chapter 3 elaborates more on the data fusion and conflict resolution step of integration. The proposed conflict resolution algorithm is described in Chapter 4 and the quality assessment algorithm for fused data is described in Chapter 5. More details about the implementation in ODCleanStore are in Chapter 6. Chapter 7 is devoted to evaluation of the algorithms. Our approach is compared to related work in Chapter 8. Finally, Chapter 9 summarizes the thesis with an outline of future work.
Contents of the CD attached to this thesis are listed in Appendix A. User instructions for the standalone integration tool ODCS-FusionTool are in Appendix B. Appendix C explains how to build source codes on the attached CD.
1.1
Introduction to Linked Data
The term Linked Data refers to a set of best practices for publishing and connecting structured data on the Web [2, 5]. These practices can be summarized in the following four principles:
• Things are identified with Uniform Resource Identifiers (URIs). • The used URIs are dereferencable HTTP URIs.
• When a URI identifier is looked up, useful information is provided using standard formats (e.g. RDF/XML).
• Links to other URIs are provided, so that related information can be discovered. Linked Data are built on well-known technologies and W3C standards like URIs, HTTP (HyperText Transfer Protocol), or RDF and follow in the Web’s footsteps. Simply put, Linked Data attempt to do for data what the Web did for documents.
RDF (Resource Description Framework) is the standard framework for representation of Linked Data. Data are modelled as typed statements that can represent a link between arbitrary entities identified by URIs or assert a literal value of an entity’s property. Typed statements compose a navigable graph of interconnected data. Links may span between datasets and thus compose the Web of Data capable of describing all things in the world.
Open (and Closed) Data
Open Data are founded on the idea, that data should be freely available, with a license enabling anyone to access and re-use it for any purpose. It is said that the best use of your data will be invented by someone else. Open Data enable agile use of publicly available information. Useful applications supporting government transparency, traffic improvements, or turism are but a few examples of how they can be exploited.
Currently, the main producers of Open Data are government authorities and scientists. Data Hub, a large community-run catalogue of useful available datasets, lists over 8.000 governmental datasets.1 Statistics about RDF datasets are maintained by LODStats which
lists nearly 2300 datasets with over 62 bilion RDF triples.2
Open Data can be published in an arbitrary format and incompatibilities between formats raise barriers in their re-use. Adoption of RDF and Linked Data principles help dispose of such barriers and promote simple data integration. The principles of Linked Open Data have been adopted by the Linking Open Data project3 and the community has published many interlinked open data sets, some of which are shown on Figure 1.1.
1http://datahub.io, as of May 2013
2http://stats.lod2.eu/, retrieved on 27th May 2013 3
Figure 1.1: Linking Open Data cloud visualization, by Richard Cyganiak and Anja Jentzsch. The latest version can be found athttp://lod-cloud.net/.
Benefits of open sharing and transparency are more and more recognized in the enter-prise environment. Complete openess of datasets may not always be an option, however. The termLinked Closed Data refers to datasets which adhere to the Linked Data principles but their openess is limited due to legal, business or privacy reasons [12]. Corporations can still benefit from the application of Linked Data principles and abide imposed restrictions on data publication at the same time.
1.2
Technical Preliminaries
1.2.1
RDF
Resource Description Framework (RDF) is a language for representing information about resources that can be identified on the Web. The RDF specification is maintained as a W3C recommendation [31].
URIs (Uniform Resource Identifiers) are used to identify RDF resources. Each resource can be described by a collection of typed statements in the form subject-predicate-object. Every such triple describes a property of an RDF resource – subject determines the re-source being described, predicate acts as the property (terms property and predicate are interchangeable), and object is the value of the property. A subject can be either a URI resource or an auxiliary blank node. A blank node differs from an RDF resource in that
it does not have any URI assigned. An object can be a URI resource, a blank node, or it can be a (typed) literal value.
RDF triples compose a directed graph where subjects and objects are the nodes and predicates represent directed labeled edges from the respective subject node to the object node. The graph nature of the RDF data model can be exploited for rich queries with the SPARQL query language [39].
1.2.2
Named Graphs
A collections of related triples (e.g. from one RDF document) makes up an RDF graph. A Named Graph is an RDF graph nameable by a URI [10]. Named Graphs provide a mechanism for expressing metadata about RDF graphs which is necessary for keeping provenance and quality-related information, security (signing, access control) etc.
A triple from a Named Graph may also be regarded as a quad with four components: subject, predicate, object and graph name.
1.2.3
Ontologies
An ontology describes the meaning of things from a specific domain and relations among these things in a machine-understandable form. This description consists of a collection of classes and properties. The term vocabulary is also used for a collection of classes and properties.
Linked Data ontologies are expressed in RDF using definitions from RDF Schema (RDFS) [9] and Web Ontology Language (OWL) [32]. While RDF Schema provides basic means for description of classes and properties, OWL adds more expressive power and se-mantics (e.g. relations between classes like disjointness or cardinality of properties). OWL property owl:sameAs indicates that the two resources it connects have the same identity (refer to the same thing).
1.3
Motivating Example
Scenarios when data from multiple sources need to be integrated emerge all the time. Enterprise data integration is crucial e.g. after mergers, or transitions from legacy systems; information needs to be shared in case of disasters; product catalogues list products from various suppliers; new useful mashup applications for the Web or mobile platforms appear. Let us demonstrate the benefits of Linked Data for integration purposes on an exmple of a fictional news company News-O-Rama. The news company is developing a new system that would serve as a knowledge base for its journalists. Requirements on the system include the following:
1. All information about an entity (person, organization) is accessible in one place in a single system.
2. Journalists can add new facts they obtain (e.g. business relations, statistical data). 3. Journalists are able to locate sources of information and verify them at their origin. 4. The knowledge base needs to be regularly updated from publicly available sources
(e.g. public contracts published by authorities, feeds about cultural events). 5. Data from legacy systems will be gradually transfered to the new knowledge base. 6. Some information needs to be shared with partner organizations and, conversely,
some information is available from partner knowledge bases.
7. Missing information about an entity can be obtained from publicly available sources (e.g. Wikipedia) and displayed on demand.
8. The knowledge base supports full-text search across all stored data. Advanced queries such as looking up all filmmakers satisfying a certain condition should be supported as well.
1.3.1
Challenges
Implementation of the knowledge base for News-O-Rama is a very complex task if we are limited to traditional technologies such as relational databses. The news company may run into the following problems:
• Data sources are technically heterogeneous. There are different incompatible and proprietary interfaces to the new knowledge base, legacy systems, partner knowledge bases and external sources.
• There are structural conflicts. Each source uses a different database schema. One models an address as an opaque string while another models it as a structured at-tribute. One system classifies persons according to their region (e.g. domestic, EU) while other stores records according to their occupation (e.g. politician, artist). • Data conflicts emerge as the result of combining data from different sources (e.g.
dif-ferent name spellings, years of birth). Some values are outdated, some are mistyped, some are outright incorrect.
• Not every piece of information has the same quality. Journalists need to be able to distinguish data from trustworthy sources and data obtained from the open Web. The recency of data may also be relevant. Quality of data needs to be visualized so that the journalist is able to efficiently skim through volumes of data without putting too much effort into analysis of data sources.
• Data coming from different sources lack common identifiers that could be used to match data about a single entity in order to provide its complete profile.
• A method how to store and maintain provenance metadata for each piece of data needs to developed so that origin of information can be tracked down.
• Access from partner organizations must be restricted to designated portions of data. Access control settings need to be bound to pieces of data.
• Addition of a new source requires a lot of manual work to adapt to its interface and convert data to the target data model and schema.
• Full-text search is done across many attributes in many tables. Addition of a new table requires an update of the full-text search feature. Advanced search for all filmmakers is problematic because people are assigned an occupation such as “actor” or “director” but not “filmmaker” directly.
1.3.2
Benefits of Linked Data Integration
Challenges outlined in the previous section can all be solved eventually, given a sufficient amount of effort, time and money. Let us look, however, at the benefits of Linked Data in this scenario. We will see that many of the tasks can be simplified, initial investments lowered and extensibility and interoperability improved for the future.
Single standard. RDF is the de facto standard format for representation of Linked Data. W3C has standardized its XML serialization, SPARQL as a query language, and SPARQL Protocol for remote communication. This helps overcome the first problem of technical heterogeneity. It is an established standard that News-O-Rama can use for interoperability both in-house and with its partners. Publicly available data published on the Web as RDF/SPARQL endpoints can be utilized with minimum cost (e.g. DBpedia4). In addition,
an RDF wrapper can be created over legacy databases, providing on-the-fly translation to the new format.
Lower entry barriers. Developing a new system on top of a relational database needs a substantial initial investment. Database schema must be designed (and is difficult to change later on) and tables populated with data using ETL tools before initial operating capability (IOC) is achieved. Linked Data, on the other hand, replace a rigid schema with flexible ontologies. A proof of concept may be created before the final ontology is approved. Ad-hoc mappings from existing sources to RDF can be created and changed later as needed.
Flexibility of schema and mappings. We can take the previous point one step further and let the system evolve gradually, endorsing the pay-as-you-go philosophy [38]. Data described using different ontologies may co-exist side by side. Even though ontologies are not perfectly aligned initially, we can improve this in time by refining mappings between properties in corresponding ontologies.
The same goes for matching of entities in time – when a journalist discovers a duplicate record for a person, a single owl:sameAs statement can be added and the system will instantly reflect the change.
One more advantage of a flexible schema is its expressive power. One of the require-ments was that the journalists can enter facts such as business relationships. With a rigid schema, what if we need to enter a type of relationship that wasn’t thought of at the time of schema design? In Linked Data, we can connect any two entities with any kind of relationship at no cost.
Identification with URIs. Identification of things with URIs in Linked Data offers a natural instrument for matching of different representations of the same real-world object. It encourages reuse of identifiers. Even when partner organizations of News-O-Rama use different URIs, relationships may be discovered by providing links to a trusted third-party authority such as DBpedia. In addition, entities can easily be referenced in communication beween journalists.
Metadata and named graphs. RDF statements can be grouped into a named graph identified with a URI. This provides a natural mechanism for attaching metadata. State-ments about a named graph may capture e.g. provenance, or access control settings.
Named graphs also offer a natural level for quality assessment. With an appropriate granularity of named graphs, we can even model different levels of trust to a source in different areas of expertise.
Search capabilities. Data are not distributed across tables but share a single data space in a typical RDF data store. Therefore the knowledge base can be searched regardless of where data reside. Storing values in simple statements rather than in unstructured documents also helps improve the precision of search results.
The RDF model allows reasoning based on description logic. In the example with a “filmmaker”, we can introduce a filmmaker class and define its subclasses such as actor anddirector. Reasoning over the hierarchy of classes will allow us to find all “filmmaker” entitites even if they are designated as an instance of one of the subclasses.
1.3.3
Contributions of the Thesis
This thesis proposes a Linked Data fusion and fused data quality assessment algorithms which are implemented in the Conflict Resolution component in a Linked Data integration tool ODCleanStore.
Say that a journalist requests information about a well-known politician. Ideally, she should be presented with a complete, concise, and consistent description. Completness
is achieved by using data from all available sources. Each source, however, uses different URIs, a different ontology to describe data, and data conflicts emerge where sources overlap. Before the data are presented to the journalist, they are processed by the Conflict Resolution component. Different identifiers of the same resource or property are collapsed to a single one using mappings in the knowledge base, thus improving conciseness. Data conflicts are resolved based on settings provided by the user according to her current needs. The journalist may choose to select only the latest contact information, the most trusted relations to lobbyists, and the average values of geographical latitude and longitude of her office, for instance.
The Conflict Resolution component also addresses the problem of verifiability and qual-ity estimation. For each statement in the result, a list of named graphs it was obtained from is provided so that the journalist can look up the original sources. A quality score of each statement is computed which helps the journalist select the data that can be trusted and are worth using.
Technical Note
Throughout this thesis, we will use the shorthand URI notation known from XML names-paces to write URI resources. A URI is written as a prefix followed by a semicolon and a local part, e.g. rdf:type. Prefixes used in this thesis can be expanded to corresponding URIs using prefix.cc5 and ex: will be used as a special prefix for URIs used in examples with no special meaning.
5
2. Data Integration
Data integration is the problem of “combining data residing at different sources and provid-ing the user with a unified view of these data” [28]. The whole process of data integration is a complex task with many subtasks, each a research area on its own. The challenges that they must address can be roughly summarized as:
• Technical and semantical heterogeneity of data sources. • Schema, identity and data conflicts.
• Cleansing of incorrect or otherwise flawed data.
• Identification of the target schema and translation from source to target schemata. • Presentation of the result.
The process of data integration has been studied mostly in relation to relational databas-es [7, 8, 28]. We adopt thdatabas-ese rdatabas-esults and contribute with new findings from analysis of ex-isting integration systems with focus on Linked Data. In the following sections, we start by looking at these challenges in more detail and introduce related concepts and terminology. Then we look at individual steps of data integration. Each step is introduced generally and then followed by more details from the Linked Data point of view.
2.1
Basic Terminology
First, let us establish some of the concepts needed when talking about data integration.
2.1.1
Entities
Entity is a representation of a real-world object. An entity may represent a person, for example, a city, but also an address if we decide to model it as an entity.
The form of entity representation depends on the underlying data model. In the rela-tional database world, we talk about tuples, records, or table rows. In the Linked Data world, we talk about resources. For the purposes of Linked Data integration presented in theis thesis, we interpret an entity as a resource description composed of triples having the RDF node representing the resource as its subject. Other options are possible such as inclusion of other triples in the neighbourhood (e.g. blank nodes can be regarded as structured attributes).
Entities have attribute (or property) values. In terms of RDF, a triple (s, p, o) is interpreted as entity s having value o of property p.
2.1.2
Identifiers
Each entity needs anidentifier. It is the primary key in databases and typically a URI in the RDF data model. A special case are blank nodes which are identified only locally with a blank node ID. They can be regarded either as entities with only local identifier or as a structured attribute of an entity. We use the former interpretation in our work since blank node could be part of several entity descriptions in the latter interpretation and have more complex semantics.
Note that a single unique real-world object may have multiple entity representations having distinct identifiers.
2.1.3
Conflicts
Conflicts occuring in data integration are caused by several factors including different decisions in data modelling, intra- and inter-source duplicates and provision of inaccurate or incomplete data. The resulting conflicts can be classified into three categories: schema conflicts, identity conflicts and data conflicts.
Schema Conflicts
As the name implies, schema conflicts are determined by schemata used to describe the data. Different schemata may use different attribute/property names (e.g. “telephone” vs. “phoneNumber”) and different data representations (e.g. using one or two attributes for name and surname). Design choices also cause structural conflicts such as choosing between an attribute and an entity to represent a value (e.g. address). This would correspond to the choice between a literal or an RDF resource with its own description in RDF.
The conflicts may also be in value semantics, such as usage of different scales and mea-surement units or different data types. These incompatibilities may not be visible directly from the schemata and would propagate as data conflicts without proper transformation. Identity Conflicts
Identity conflicts are a result of multiple representations with different identifiers for the same real-world object. Identity conflicts are also known as key conflicts in databases.
In RDF, this means usage of different URIs for the same real-world object. The city of Prague is identified byhttp://sws.geonames.org/3067696/ on Geonames1 while
DBpe-dia2 uses http://dbpedia.org/page/Prague, for example.
1Geographical database,http://www.geonames.org.
Some schema conflicts in Linked Data can also be regarded as a special case of iden-tity conflicts. Different URIs may be used for the same concept in two ontologies, such as geo:lat and freebase:location.geo.latitude for geographic latitude – the only difference is the property identifier.
Data Conflicts
Data conflicts occur when multiple different values exist for an attribute of the same entity. This kind of conflicts is also referred to as attribute conflicts orinstance-level conflict.3
Inter-source conflicts demonstrate only after two or more sources are merged. Their resolution is a task for the data fusion step of integration. Examples include different geographic coordinates for a city or different entity labels. Intra-source data conflicts (e.g. typing errors or different measurements such as a city population measured in different years) are a task for data cleansing techniques but they can also be deferred to the data fusion phase (e.g. in the city population case).
Another classification distinguishes uncertainty and contradictions. Contradiction is defined as occurence of multiple different values for an attribute of an entity. Uncertainty, on the other hand, is a situation where the value is missing (NULL) in one source and present in another.
Other Types of Conflicts
In the presence of a trusted knowledge base, even a single value can represent a conflict if it is in contradiction with the knowledge base. This can be seen as a reasonableness conflict [52].
With respect to RDF, we have made the assumption that a triple subject represents an entity, triple objects represent attribute values and conflicts occur in place of objects. Conflicts in place of triple subjects or predicates can also be considered, however. Predicate conflicts may be a special case of relationship conflicts [29].
We have also assumed that data conflicts are resolved at the entity granularity. Other granularities are also possible [7] but not relevant for our purposes further in this text.
2.1.4
Materialized vs. Virtual Integration
There are two basic approaches to data integration. One is materialized data integra-tion which materializes the unified view of the data in a persistent store such as a data warehouse. The other is virtual data integration where the view on data is virtual and created at query time; the actual data are stored only at the sources. The difference is thus in the place where queried data are stored. Materialized views are created using
ETL (Extract-Transform-Load) and similar tools, virtual integration builds on mediator-wrapper architectures and distributed query processing.
The choice between the two depends on many factors such as how often the sources are updated, query frequency, lifetime of the result or volume of data. Materialized views are less technically demanding in data retrieval and more time can be devoted to data preparation. For this reason, they are more convenient for resolution of identity and data conflicts. The actual data fusion and conflict resolution may or may not be deferred to query time, however, so that the unified view is materialized only partially [25]. On the other hand, virtual views are suitable when data sources are not known in advance or for quality-driven query processing [1].
Existing Linked Data integration solutions dealing with identity and instance conflicts use materialized views [25, 33]. Virtual integration solutions are based on a mediator-wrapper architecture [27] or SPARQL federation [18, 40, 46].
2.2
The Data Integration Process
Data integration is driven by desire for new knowledge. That is why data to be combined come from new sources which may have been produced by different people, for different purposes and in different points of time. Therefore, various forms of heterogeneity must be dealt with in order to provide a unified view on the data.
Diagram in Figure 2.1 shows the steps a typical data integration system must perform: • Schema mappingtackles schema heterogeneity and maps source schemas to the target
schema.
• Data retrieval handles data extraction from sources and related technical differences. Source Selection may filter relevant sources prior to retrieval.
• Data preparation transforms the data to a form appropriate for integration. • Quality assessment uses data and metadata to estimate the quality of data. • Schema translation uses schema mappings to resolve schema heterogeneities. • Duplicate detection resolves identity conflicts.
• Data fusion produces the integrated result with data conflicts resolved by conflict resolution.
Figure 2.1 is very general. The dashed arrows indicate optional flows and in addition, a concrete data integration system may execute the steps in different order or omit some of the steps entirely. Systems are also diverse in where data are stored during the process. Data may be kept in-memory only, loaded to a staging database or some combination may be used.
Figure 2.1: Steps in a general data integration system
Let us look at three examples of concrete integration frameworks shown in Figure 2.2. Note that the terminology has been modified to match terms introduced in this thesis.
The Linked Data integration framework ODCleanStore [25] is shown in Figure 2.2a. Data are pushed through a web service and processed by a pipeline of transformers. The composition and order of transformers can be customized depending on input data. The processed data are stored to a persistent store. Data Retrieval module obtains relevant data when a user query is issued and conflicts are resolved at runtime. ODCleanStore integration process can be classified as partially materialized – pre-processed data and mappings are stored but identity and data conflicts are resolved at runtime.
Figure 2.2b shows the Linked Data integration framework LDIF [45]. LDIF has a typical architecture of a materialized integration system. All the integration steps are executed in a single pipeline. The quality assessment and data fusion step are coupled together in a component called Sieve [33].
Fusionplex [35], shown in Figure 2.2c, is a typical representative of virtual integration system based on wrappers. The focus here is on efficient query planning and data retrieval, while duplicate detection and quality assessment receive less attention. Fusionplex is one of
(a) ODCleanStore
(b) LDIF
(c) Fusionplex
the few systems which resolve conflits. It integrates data from very heterogeneous sources adapted by a wrapper. The integration process is executed whenever a query is issued. The query translator determines relevant sources and calls the data retrieval component. Data retrieval obtains data from sources with regard to schema mappings – it translates the queries and the response between source and target schemas. The rest of the components fuse data and returns the result directly without storing it.
2.3
Data Integration Steps in Detail
2.3.1
Schema Mapping
The purpose of the schema mapping step is creation of mappings between source and target schemata. The mappings may be created both automatically [3], semi-automatically, or manually [25, 45]. In the latter case, mappings are usually created beforehand and used for all executions of integration process.
Three basic approaches have been proposed for databases [1, 28] –global-as-view (GAV, global schema is expressed as queries over sources), local-as-view (LAV, data sources expressed as queries over the global schema) and their combination global-local-as-view (GLAV).
In the context of RDF, schema mapping consists of creation of links between classes and properties. Simple owl:sameAs links or more complex relations likerdfs:subPropertyOf may be produced. A language designed to represent mappings in RDF has been pro-posed [6].
2.3.2
Data Retrieval
The data retrieval step handles technical heterogeneities. Sources may use different data models (RDF, relational databases, XML) or incompatible (proprietary) technologies. Data retrieval may also include a source selection substep which filters relevant sources prior to retrieval.
This step is most relevant for virtual integration where it is often combined with data preparation realized by query reformulation. Materialized integration can replace this step with preparation of data warehouse where data are integrated.
2.3.3
Data Preparation
Data preparation is a broad term for data transformations and cleansing executed prior to the fusion of sources. It helps improve the results of data fusion. Data preparation may include cleansing of flawed data, normalization of values but also resolution of schematic
conflicts (e.g. division of a person’s name to separate attributes for name and surname). The transformation of values may be executed once (e.g. as part of an ETL process) or online (in virtual integration).
2.3.4
Schema Translation
Schema translation is an optional step which applies mappings from the schema mapping step to transform data to a single target vocabulary.
Schema translation may be a separate integration step (e.g. in LDIF), executed as part of data retrieval query processing (Fusionplex), or even be deferred to the data fusion step (ODCleanStore).
2.3.5
Quality Assessment
Quality assessment estimates fitness of use of data for integration purposes. The quality is usually calculated from metadata and source statistics (such as timestamp, source prefer-ence or query cost) and/or the data itself (e.g. conformance to constraints, completeness). The quality can be assessed on source or data level. The assessment granularity for Linked Data is often on the named graph level (ODCleanStore, Sieve). Other aspect is the time and purpose of quality assessment. It may be used for query execution planning at query time [36], before data fusion to act in fusion and conflict resolution decisions (OD-CleanStore, Sieve, Fusionplex), or even during conflict resolution as proposed in Chapter 5.
Examples of Linked Data quality assessment methods can be found in [24, 33].
2.3.6
Duplicate Detection
Duplicate detection deals with identity conflicts resulting from multiple representations with different identifiers of the same real-world object. Duplicate detection itself has many alternative names including object identification, record linkage, entity (co-)reference rec-onciliation, identity resolution and others. The result of this step is either renaming of identifiers or a collection of mappings (e.g. owl:sameAs4 links in RDF).
Entities have to be linked based on their attributes in the absence of unique identifica-tion. The challenge for duplicate detection is how to perform it efficiently and effectively. Effectivness depends on the used similarity measures and similarity threshold tweaking. Similarity measures are often application-dependent but generic string measures such as the Levenshtein distance can be applied [23, 43]. In any case, duplicate detection is a heuristic process.
4owl:sameAsproperty indicates that the linked resources refer to the same thing, have the same
Comparison of all pairs of entities is not usually feasible for efficiency’s sake. Duplicate detection methods usually employ clustering methods to improve overall performance. The implication for data integration is that duplicate detection is often not feasible at query time in virtual integration.
To name two examples of duplicate detection frameworks for Linked Data: Silk [50] (part of LDIF, used also in ODCleanStore) features a declarative language for specifying what types of links should be produced under what conditions. iSPARQL [23] extends the SPARQL query language with similarity matching.
2.3.7
Data Fusion
Data fusion is the step in the integration process where the actual merging of data is performed. It can be defined as “the process of fusing multiple records representing the same real-world object into a single, consistent, and clean representation” [8].
Let us look at the individual parts of this definition. Data fusion deals with records representing the same real-world object. Our understanding of a record determines the granularity on which data fusion works. In relational databases, we understand a tuple (a table row) by a record. Fusion is usually done at the tuple level [7, 41] although some tools focus on attribute conflict resolution [3].
The situation is quite different in the RDF data model. There is no concept of a record. In Section 2.1.1, we introduced resource description as a collection of triples having the resource in place of the subject as an alternative for records. We use this perspective in the fusion algorithm proposed here, although we note this problem is more complex (see Section 3.7). Schemata for Linked Data are used rather loosely and properties may freely be added or removed. For this reason, existing RDF fusion tools operate on the property value level rather than on whole resource descriptions.
Another keyword from the definition issingle. This doesn’t necessarily mean outputting a single record or RDF triple. Depending on the conflict resolution strategy, the output may include even all values from all sources. It should be clear, however, that they all belong to a single record. Identity conflicts should be resolved (if not done in previous steps), and a single identifier (URI) used for the target resource representation.
Finally, representation should be consistent and clean. This means that conflicts which emerge during data fusion should be resolved. Uncertain or low-quality data may be purged.
2.3.8
Conflict Resolution
The purpose of the conflict resolution step is resolution ofdata conflicts(see Section 2.1.3). A data conflict occurs when multiple different values exist for an attribute of the same
entity. Conflicts demonstrate after entities with conflicting values are fused together but should be resolved before the final entity representation is produced. For this reason, we list conflict resolution as a substep of data fusion.
Conflicts are resolved by application of a conflict resolution function which usually operates on entity or attribute level. In the former case, the resolution functions takes several entity representations as its input and produces one or more result entities [35, 41]. Attribute conflicts are resolved en bloc, which allows decisions based on values of multiple attributes. The input of resolution function in the latter case – resolution on the attribute level – is a collection of attribute values and the output is one or more values [3, 53], which doesn’t have to be necessarily from the same domain as the original values (e.g. Concat
function). This approach is more convenient when the output are individual projected attributes rather than whole entities. It is also more natural for Linked Data [25, 33], where a strict notion of record is missing.
The issue of data fusion and conflict resolution is further discussed in more detail in the following chapter.
3. Data Fusion & Conflict Resolution
Of all steps of data integration listed in the previous chapter, the focus of this thesis is on data fusion and conflict resolution. Therefore, we describe these two steps in more detail before proposing a concrete Linked Data fusion algorithm in Chapter 4. This chapter formally defines data conflicts and lists concrete conflict resolution functions which can be applied to resolve these conflicts. We propose an implementation-centric classification of resolution functions based on an existing classification of resolution strategies, present a comprehensive list of resolution functions in the literature and analyze their properties. Finally, this chapter discusses some specifics of Linked Data with regard to data fusion.3.1
Data Conflicts
Different kinds of conflicts (namely schema, identity and data) were described in Sec-tion 2.1.3. The purpose of the conflict resoluSec-tion step is resoluSec-tion of data conflicts. Defi-nition 3.1 formalizes what we understand by a data conflict.
Definition 3.1 (Data Conflict). Let X be a set of representations of the same real-world object. Let A(x) denote the value of attribue A for a representation x. A data conflict occurs for attribute A with cardinality 1 or [0,1] (“exactly one” or “at most one value”) when the set {A(x) | x∈X} contains more than one distinct value.
Note that we limit this definition to attributes which are constrained to have at most one value. This restriction is often omitted because in the relational model, each tuple does have at most one value for each attribute (orNULL). This is not the case in RDF, however, where it is perfectly valid to have multiple values for the same property. A resource can belong to multiple classes which is denoted by multiple values for the rdf:typeproperty, for example.
3.2
Conflict Resolution Functions
Conflicts can be resolved on entity or attribute level (although other levels are also con-ceivable [7]). In either case, the conflicts are resolved independently by applying aconflict resolution function (or resolution function for short) on the respective level.
An entity-level or attribute-level resolution function takes a collection of entities or attribute values, depending on the level it operates on, and produces one or more entities or attributes, respectively, with resolved conflicts.
The exact input and output of conflict resolution function is application-dependent in practice. Provenance information may be present in both input and output, the function
may access additional metadata or source statistics, for instance. For this reason, we do not provide a generpurpose definition; definition specific for the conflict resolution al-gorithm proposed in this thesis is given in Definition 4.8 further in this text. The defined resolution function formally operates on sets of RDF quads but conceptually resolves at-tribute conflicts. We will assume this view on resolution functions in this thesis from now on.
3.3
Overview of Concrete Resolution Functions
A number of functions for resolution of conflicting values has been proposed in the literature [3, 7, 25, 35, 37, 48, 53]. The following list contains a comprehensive overview of proposed functions with a few new additions. Some systems also allow custom user-defined resolution functions [41]. The terminology differs between authors and functions sometimes overlap – we prefer more general functions and terminology used in ODCleanStore where applicable.
All Returns all values.
Any Returns an arbitrary (non-NULL) value.
First, Last Returns the first or the last (non-NULL) value, respectively. Re-quires ordering of the values on input. First is an equivalent of
COALESCE()in SQL.
Random Returns a random (non-NULL) value. The chosen value differs among calls on the same input.
Certain If the input values contains only one distinct (non-NULL) value, returns the value. Otherwise returns NULL or empty output (de-pending on the underlying data model).
Best Returns the value with the highest data quality value. The quality measure is application-specific.
TopN Returnsnbest values (seeBest). n is given as a parameter.
Threshold Returns values with data quality higher then a given threshold. The threshold is given as a parameter. The quality measure is application specific.
BestSource Returns a value from the most preferred source. The preference of source may be explicit (given preferred order of sources) or based on an underlying data quality model.
MaxSourceMetadata Returns a value from the source with a maximal source metadata value. The metadata value may be e.g. timestamp of the source, access cost or a data quality indicator. The used type of source metadata is either given as a parameter or fixed.
MinSourceMetadata Returns a value from the source with the minimal source metadata value (seeMaxSourceMetadata).
Latest Returns the most recent (non-NULL) value. Recency may be avail-able from another attribute, value/entity metadata or source meta-data (the last case is a special case of MaxSourceMetadata).
ChooseSource Returns a value originating from the source given as a parameter. Returned value may beNULL (or equivalent).
Vote Returns the most-frequently occuring (non-NULL) value. Different strategies may be employed in case of tie, e.g. choosing the first or a random value.
WeightedVote Same as Vote but each occurence of a value is weighted by the quality of its source.
Longest, Shortest Returns the longest/shortest (non-NULL) value.
Max, Min Returns the maximal/minimal (non-NULL) value according to an ordering of input values.
Filter Returns values within a given range. The minimum and/or max-imum are given as parameters. Applicable only to values from an ordered domain.
MostGeneral Returns the most general value according to a taxonomy or ontol-ogy.
MostSpecific Returns the most specific value, according to a taxonomy or ontol-ogy (if the values are on a common path in the taxonomy).
Concat Returns a concatenation of all values. The separator of values may be given as a parameter. Annotations such as source identifiers may be added to the result.
Constant Returns a constant value. The constant may be given as a param-eter or be fixed (e.g. NULL).
CommonBeginning Returns the common substring at the beggining of conflicting val-ues.
CommonEnding Returns the common substring at the end of conflicting values.
TokenUnion Tokenizes the conflicting values and returns the union of the tokens.
TokenIntersection Tokenizes the conflicting values and returns the intersection of the tokens.
Avg Returns the average of all (non-NULL) input values.
Median Returns the median of all (non-NULL) input values.
Count Returns the number of distinct (non-NULL) input values.
Variance, StdDev Returns the variance or standard deviation of values, respectively.
ChooseCorresponding Returns the value that belongs to an entity (tuple) whose value has already been chosen for an attribute A, where A is given as a parameter. Applicable only for resolution on entity level.
ChooseDepending Returns the value that belongs to an entity (tuple) which has a valuev of an attributeA, wherev and A are given as parameters. Applicable only for resolution on entity level.
MostComplete Returns the (non-NULL) value from the source having fewest NULLs for the respective attribute across all entities.
MostDistinguishing Returns the value that is the most distinguishing among all present values for the respective attribute.
Lookup Returns a value by doing a lookup into the source given as a pa-rameter, using the input values.
MostActive Returns the most often accessed or used value.
GlobalVote Returns the most-frequently occuring (non-NULL) value for the re-spective attribute among all entities in the data source.
3.4
Conflict Resolution Strategies
The most thorough classification of conflict handling strategies was published by Bleiholder & Naumann [8]. We adopt their scheme and suggest a simplified classification based on experience from implementation in ODCleanStore hereafter.
Bleiholder & Naumann identify three major groups of conflict handling strategies: conflict-ignoring, conflict-avoiding and conflict-resolving (Figure 3.1).
Conflict-ignoring strategies do not make decisions between conflicting values. Such strategy may simply return all input values (All) or consider all possibilities, giving the user the maximum completness of all “possible worlds”.
Conflict-avoiding strategies acknowledge conflicts but choose between alternatives re-gardless of the actual values. One option is making decision depending on metadata, for example one source can be absolutely preferred over another based on timestamp, collective rating or explicit user preference (e.g. ChooseSource, BestSource). Another option is taking the first or an arbitrary non-null value. Consistent query answering approaches return only sure facts and can also be classified as conflict-avoiding strategies (Certain). The last group areconflict-resolving strategies which regard the data and metadata be-fore decision making. These strategies can further be divided intodeciding and mediating. Deciding strategies only choose value(s) already present in their input (e.g. Best, Max).
Figure 3.1: Classification of conflict handling strategies according to Bleiholder & Naumann [8]
Mediating strategies, on the other hand, may also produce new values, such as average or sum of values (Avg, Sum).
Another classification concerns the choice of a concrete resolution strategy [1]. The choice can be made at design time or query time. While the former group specifies the strategy when the integration system is designed, the latter group makes the choice when the query is formulated. Query time techniques give more flexibility and enable the data consumer to specify different resolutions in different contexts.
3.5
Classification of Conflict Resolution Functions
The concrete resolution functions listed above include both functions returning a single value and returing multiple values. While returning a single value is more common in the database world (analogous to aggregation functions), functions with more output values are used in Linked Data integration [25, 33].
Classification according to Bleiholder & Naumann recognizes deciding and mediating conflict resolution functions and returning all values is considered a conflict-ignoring strat-egy rather than a resolution function. If we allow resolution functions to return more than one value, however, there is little difference from the implementation point of view between
Filter returning values from a certain range and All returning all values, for example. On the other hand, mediating function Avgneeds to be treated differently than e.g. Sum
when measuring the quality of fused data (as shown in Section 5.5). For this reason, a finer distinction of mediating functions would be useful.
We propose a classification of conflict resolution functions which is based on experience gained when implementing resolution functions in ODCleanStore. The classification is shown in Figure 3.2.
Figure 3.2: Classification of conflict resolution functions based on experience from ODCleanStore implementation.
Since we allow resolution functions to return an arbitrary number of values, returning all values or deciding solely based on the data source is classified as a resolution function, avoiding the need of distinction of conflict-avoidance and conflict-ignorance.
Deciding functions choose only values received in the input. Many-valued deciding functions may return arbitrary number of output values as opposed to single-valued re-turning one value (or NULL or zero values if there is no acceptable input value), which affects their possible usage.
Mediating functions may produce completely new values. Moderating mediating func-tions attempt to produce a value similar to the input values and/or minimize the error of the output. Moderating functions are generally idempotent and monotonous (if applica-ble). Scattering functions may produce a result completely different from the input values, even from a different domain (as is the case of Count, for instance). The implication for measuring quality of resolved data is that it doesn’t make sense to measure the error of the produced value compared to inputs. While it would also make sense to distinguish me-diating resolution functions producing single and multiple values, many-valued meme-diating functions do not appear in practice.
Set Resolution Functions
Allowing resolution functions to both accept and produce multiple values enables imple-mentation of set operations by resolution functions for attributes with multiplicity larger than one.
Input values originating from each source would be regarded as one input set and the resolution function could return e.g. their intersection. Performing the union of values would be equivalent to using All without duplicates.
3.6
Properties of Conflict Resolution Functions
Tables 3.1 and 3.2 summarize concrete resolution functions and their basic properties.Column “Type” classifies functions according to schema from Section 3.5. Many-valued deciding functions may return arbitrary number of output values while the others produce only one value (orNULL/zero values if there is no acceptable result). Column “Type B&N” contains classification according to Bleiholder & Naumann [8]. Column “In ODCleanStore” indicates whether the function has an implementation in the Conflict Resolution component of ODCleanStore. “Applicable domain” shows the the types of data the resolution function can be applied on. Finally, the last column indicates what inputs does a function need – some need additional metadata (e.g. for computing data quality), some require access to other attributes of the currently resolved entity or even all data from relevant sources (source statistics), some are parametrized.
Other properties such as idempotency or associativity of many of the functions are further analyzed in [7].
Function Typea Type
B&Nb In ODCleanStore Applicable domainc Input d Avg MM M X N D Median M X N D CommonBeginning M × S D CommonEnding M × S D MostGeneral M × T DM MostSpecific M × T DM TokenUnion M × S D TokenIntersection M × S D Concat MS M X * D Sum M X N D Constant M × * -Count M × * D Variance M × N D StdDev M × N D Lookup M × * DS
aMediating Moderating (MM), Mediating Scattering (MS)
bConflict avoiding (CA), Conflict Ignoring (CI), Mediating (M), Deciding (D) cString (S), Numeric (N), Date/Time (D), Taxonomical (T), Any (*)
dConflicting data (D), metadata (M), source statistics (S), other attributes (A), parameters (P) Table 3.1: Mediating conflict resolution functions and their properties
Function Typea Type B&Nb In ODCleanStore Applicable domainc Input d Any DS D X * D First, Last D × * D Random D × * D Best D X * DM Certain CA X * D BestSource CA X * DM MaxSourceMetadata CA X * DM MinSourceMetadata CA X * DM Latest D X * DM ChooseSource D X * DP Vote D X * D WeightedVote D X * DM Longest, Shortest D X ST D Max, Min D X SN D ChooseCorresponding D × * DAP ChooseDepending D × * DAP MostComplete D × * DS MostDistinguishing D × * DS MostActive D × * DM GlobalVote D × * DS All DM CI X * D TopN D X * DM Filter D X * D Threshold D X * DM
aDeciding Single-Valued (DS), Deciding Many-Valued (DM)
bConflict avoiding (CA), Conflict Ignoring (CI), Mediating (M), Deciding (D) cString (S), Numeric (N), Date/Time (D), Taxonomical (T), Any (*)
dConflicting data (D), metadata (M), source statistics (S), other attributes (A), parameters (P) Table 3.2: Deciding conflict resolution functions and their properties
3.7
Specifics of Linked Data for Data Fusion
The majority of the literature on data integration aims at relational database systems or systems providing a uniform access to heterogeneous sources using a relational database-like data model. The issue of Linked Data integration has its specific features and chal-lenges, however. Some of them were already discussed when talking about particular data integration steps. This section attempts to summarized specifics relevant for data fusion.
3.7.1
Resource Descriptions
We consider a resource description composed of triples having the same subject as an equivalent of database records for Linked Data. This approach is sufficient in most scenarios and enables straightforward application of traditional database data fusion techniques.
We could also include statements having the resource as their object as part of the de-scription, however – e.g. symmetric properties1 establish a relationship in both directions.
Blank nodes present yet another problem. They cannot be identified across datasets and therefore they wouldn’t belong to any resource description. They are also often used for modelling of structured attributes. This problem may be addressed by the introduction of the concise bounded description [47] which includes statements reachable transitively via blank nodes into the description.
3.7.2
Query Execution
Implementations of the fusion step in relational database integration systems build on join- and union-based operations [8]. Complex algorithms and non-trivial queries need to be devised in order to convert data from different sources to a common schema and correctly join them using only the resources of SQL. Some systems extend the query power of SQL at the expense of portability.
RDF simplifies this aspect of data fusion – the task boils down to simply selecting the relevant triples. The developer has the rich expressive power of SPARQL at hand. Some system can even do basic inferencing during query execution [16]. This feature can be used to join resources based on owl:sameAs transparently under the hood, for example. The downside is performance. Special RDF databases often do not match database en-gines continously perfected for decades. The lower performance demonstrates especially when entire sources are being integrated in a batch (as opposed to ad-hoc queries) where join/union algorithms are more efficient.
1
3.7.3
Data Schemata
A schema in relational databases defines tables and explicitly lists attributes each record in the table must have (or have NULL in place of the attribute value). A schema also defines additional integrity constraints.
The equivalent of schema for RDF is an ontology (Section 1.2.3). While an RDF ontolo-gy can also define restrictions on data, they are often not applied strictly [21], not checked when storing the data and can only be detected by an ontology reasoner. In addition, properties not described in any ontology appear in data, values for defined properties are missing and even the class a resource belongs to often cannot be determined.
The loose application of schemata makes automatic generation of rules (e.g. conflict resolution settings) more difficult. A preliminary survey of most commonly used ontologies was conducted as part of analysis for this thesis. The results suggests that the ontologies do not contain enough information to suggest conflict resolution settings reliably. Another implication is that satisfaction of integrity constraints cannot be used to detect low quality data reliably. On the other hand, the RDF approach is more flexible for ad-hoc queries based on settings entered by the user.
Another difference between Linked Data and traditional databases is in schema map-ping. The mapping between properties does not need to be a simple a one-to-oneowl:sameAs link but the relationships can be richer. Enterprise taxonomies based on open standards (e.g. SKOS2 can be employed and much more. This is both a challenge and opportunity
for further research.
3.7.4
NULL values
The special value NULL has no equivalent in RDF other than a value simply not being there. This difference is not significant as far as the common resolution functions are concerned. The only case when it may be relevant is when the various interpretations NULL can have need to be distinguished. If one source has a value for a reviewer of a publication and another source does not, for example, it may be interpreted either as an uncertainty or a contradiction (the book wasn’t reviewed). The additional information needed to distinguish these cases may be expressed by RDF collection constructs,3 blank
nodes or contained in the related ontology (e.g. property cardinalities).
2www.w3.org/2004/02/skos/ 3
4. Linked Data Fusion and Conflict
Resolution Algorithm
Integration is a well-explored field in relational databases. While Linked Data represented in RDF are convenient for integration by design, tools and algorithms implementing inte-gration subtasks still need to be devised. We contribute to this effort by introducing the general Linked Data fusion and conflict resolution algorithm presented in this chapter.
Let us recapitulate the main challenges of Linked Data fusion that must be addressed by a data fusion algorithm:
1. Different URIs can be used to represent the same real-world entities.
2. Different vocabularies can be used to describe data, i.e. different predicate URIs can be used for equivalent or overlapping properties.
3. Data conflicts emerge when quads sharing the same subject and predicate have in-consistent values in place of the object.
The context that the algorithm can leverage in its task consists of the actual data being integrated, metadata (which we suppose to be modelled on the Named Graph level) and mappings between resource URIs and property URIs from the used vocabularies. The input also includes conflict resolution settings and the algorithm can use available conflict resolution functions.
First, we introduce the necessary terminology. Then we describe how the introduced algorithm works and finally we analyze the time & memory complexity.
4.1
Formalism
4.1.1
RDF Data Model
Definition 4.1 (RDF nodes). Let U, B and L be sets of all URI references, blank nodes and RDF literals, respectively. The sets U, B and L are pairwise disjoint. An RDF node is an element of their union N =U ∪B∪L.
Definition 4.2 (RDF triple). An RDF triple is a statement expressing that a resource has a property with a ceratin value. Formally, the set of all triples in the RDF model is
T riples= (U ∪B)×U ×(U ∪B∪L).
We denote the elements of each triple as subject (representing the resource of interest), predicate (the property) and object (the value of the property), respectively.
Definition 4.3 (RDF graph, Named Graph). A subset G of T riples can be represented as a directed labeled graph and we refer to it as an RDF graph. Subjects and objects are vertices and a directed edge labeled pexists for each triple (s, p, o)∈G.
A Named Graph is a pair (G, n) where G⊂T riples is an RDF graph and n ∈U [10]. We say that graph Gisnamed n. Any set of Named Graphs can be thought of as a partial function N : U → P(T riples) (P denotes the power set). An additional restriction on Named Graphs is that blank nodes cannot be shared between them, more formally blank nodes used in triples from N(g1) are distinct from those used in triples from N(g2) for
N(g1)6=N(g2).
Definition 4.4 (Quad). Letquad denote a quadruple(s, p, o, g)such that there is a triple (s, p, o) in RDF graph G named g. Elements s, p , o, g are referred to as quad’s subject, predicate, object, and graph name, respectively. The space of all quads is denotedQuads.
Quads= (U ∪B)×U×(U ∪B∪L)×U
Definition 4.5. LetGbe an RDF graph. We define functionssubjects(G),predicates(G) and objects(G) by the following formulas:
subjects(G) = {s | (s, p, o)∈G}
predicates(G) = {p| (s, p, o)∈G}
objects(G) = {o | (s, p, o)∈G}
We define these functions on a set of quadsQ⊂Quads analogously and in additiona
graphs(Q) ={g | (s, p, o, g)∈Q}.
4.1.2
Conflict Resolution Terminology
The result of the conflict resolution algorithm implemented in ODCleanStore contains not only the RDF triples with conflicts resolved according to the given conflict resolution strategy but also:
• Information about the Named Graphs each triple was selected from or derived from. • A quality value for each triple calculated based on conflicting values and input meta-data; quality is expressed as a value from an ordered space C. We use C = [0; 1] in the context of ODCleanStore (see Section 5.1).
In order to convey this information in the result, we introduceresolved quads as output of the conflict resolution algorithm.
Definition 4.6 (Resolved quad). Resolved quad is a triple (q, S, c) from the space of all result quads denoted ResolvedQuadsdefined as
ResolvedQuads=Quads× P(U)×C,
whereqis the result quad, S is set of names of Named Graphsqwas selected or derived from and cis the quality.
The Conflict Resolution component in ODCleanStore deals with conflicting property values, i.e. possible conflicts in place of objects of quads. Therefore quads which are in conflict must share the same subject and predicate. This gives motivation for the following definition.
Definition 4.7 (Object conflict cluster). Object conflict cluster in a set of quads Q ⊆
Quadsis a maximal subset CC ⊆Q such thatCC 6=∅and
∀(s1, p1, o1, g1)∈CC ∀(s2, p2, o2, g2)∈CC : (s1 =s2 ∧p1 =p2).
In other words, all quads in a conflict cluster share the same subject and predicate. We will denote the conflict cluster corresponding to subjects and predicate pas CCs,p.
Conflict resolution functions in the context of relational databases typically operate on attribue values from the attribute’s domain or on a set of tuples representing a database record. In the Linked Data world, the former approach would lead to loss of provenance information. Whole quads must therefore be given to a conflict resolution function and not only the conflicting (object) values.
Conflict resolution functions also need additional metadata and context information in order to have enough expressive power to implement all desired resolution strategies listed in Section 3.3. These metadata and context data can also be modeled in RDF as quads. Definition 4.8 (Conflict resolution function). Conflict resolution function f is a function
f :P(Quads)× P(Quads)→ P(ResolvedQuads).
Object conflict resolution function f0 is a partial function
f0 :P(Quads)× P(Quads)→ P(ResolvedQuads) such that the following holds for f0(CC, M):
1. If ∃s, p:subjects(CC) ={s} ∧predicates(CC) ={p} (i.e. CC is an object conflict cluster), than subjects(f0(CC, M)) = {s} and predicates(f0(CC, M)) = {p} (i.e.
2. If CC 6= ∅ and CC is not an object conflict cluster in Quads, than f0(CC, M) is undefined.
The first argument CC of an (object) conflict resolution function represents the (pos-sibly) conflicting quads to be resolve