Developing a Semantic Framework for Healthcare Information Interoperability

(1)

DEVELOPING A SEMANTIC FRAMEWORK FOR HEALTHCARE INFORMATION INTEROPERABILITY

A dissertation submitted to Kent State University in partial fulfillment of the requirements for the

degree of Doctor of Philosophy

by Mehmet Aydar December 2015

(2)

Dissertation written by Mehmet Aydar

B.E., Bahcesehir University, 2005 MTEC, Kent State University, 2008

Ph.D., Kent State University, 2015

Approved by

, Chair, Doctoral Dissertation Committee Austin Melton

, Members, Doctoral Dissertation Committee Angela Guercio

Ye Zhao

Alan Brandyberry Helen Piontkivska

Accepted by

, Chair, Department of Computer Science Javed I. Khan

, Dean, College of Arts and Sciences James L. Blank

(3)

TABLE OF CONTENTS

LIST OF FIGURES . . . vii

LIST OF TABLES . . . viii

Acknowledgements . . . ix

Dedication . . . x

1 Introduction . . . 2

1.1 Interoperability in Healthcare . . . 2

1.1.1 Standards vs. Translations . . . 4

1.1.2 RDF as a Common Healthcare Information Representation . . . . 4

1.1.3 Summary Graph . . . 5

1.1.4 Graph Node Similarity Metric . . . 6

1.1.5 Instance Matching . . . 7

1.2 Dissertation Overview and Contributions . . . 8

2 Background and Related Work . . . 11

2.1 Data, Information and Knowledge . . . 11

2.2 Semantic Web . . . 12

2.2.1 Resource Description Framework (RDF) . . . 13

(4)

2.2.2 RDF Schema . . . 15

2.2.3 Ontology . . . 15

Web Ontology Language (OWL) . . . 16

2.2.4 Linked Data . . . 16

2.3 Data Mapping . . . 18

2.4 Data Translation . . . 23

2.5 Entity Similarity . . . 24

2.5.1 Jaccard Similarity Measure . . . 24

2.5.2 RoleSim Similarity Measure . . . 25

2.6 Graph Summarization . . . 28

2.6.1 Definition (Summary Graph) . . . 28

2.6.2 Summary Graph Generation Methods . . . 28

2.7 Healthcare Data Interoperability Initiatives . . . 30

3 Translation of Instance Data using RDF and Structured Mapping Definitions . 32 3.1 Translation of Instance Data . . . 34

3.2 Metadata Repository . . . 34

3.3 Mapping Schema . . . 36

3.4 Mapping Interface . . . 39

3.5 Translation Engine . . . 40

3.6 Conclusion . . . 41

4 Pairwise Nodes Similarity in RDF Graphs . . . 44

(5)

4.2 Input Graph Data . . . 46

4.3 Methods . . . 46

4.4 Computation of Entity Similarity . . . 47

4.4.1 IRI Nodes Similarity . . . 47

4.4.2 Literal Node Similarity . . . 51

4.4.3 Descriptor Importance and Automatic Detection of Noise Labels . 52 Detection of Noise Descriptors . . . 54

4.5 The Algorithm . . . 54

4.6 Related Work . . . 57

5 Building Summary Graphs of RDF Data in the Semantic Web . . . 61

5.1 Contributions and Outline . . . 62

5.2 Methods . . . 63

5.3 The Algorithm . . . 64

5.4 Generation of the Classes and Properties . . . 64

5.5 Class Relation Stability Metric . . . 65

5.6 Data Dictionary Extraction . . . 67

5.7 Evaluation . . . 68

5.8 Related Work . . . 75

6 RinsMatch: a suggestion-based instance matching system in RDF Graphs . . . 77

(6)

6.2 Instance Matching in Healthcare Data . . . 79

6.3 RDF Entity Similarity for Instance Matching . . . 80

6.4 User Interaction . . . 80 6.5 Evaluation . . . 83 6.6 Related Work . . . 83 6.7 Conclusion . . . 86 7 Conclusion . . . 88 BIBLIOGRAPHY . . . 92

(7)

LIST OF FIGURES

1 An example of RDF triples illustrating resources from diverse domains [87] 14

2 Example Ontologies and their Mappings [37] . . . 22

3 RoleSim(u,v) based on similarity of their neighbors [54] . . . 27

4 Translation of instance data . . . 35

5 An example of data dictionary elements stored in the metadata server . . 36

6 A mapping example showing how selected data dictionary elements are mapped . . . 37

7 A figure showing the RDF representation of mapping between sample data dictionary elements . . . 38

8 A screen shot of the mapping interface. . . 40

9 Example Graph for Graph Equivalence matching . . . 49

10 Sample graph demonstrating two nodes . . . 58

11 A figure consisting of different types of entities and elements belonging to the class types. . . 74

12 An excerpt from the generated summary graph. . . 75

(8)

LIST OF TABLES

1 A Sample of RDF Triples from Each Dataset . . . 70 2 An Excerpt from Dynamically Assigned Weights of Descriptors . . . 71 3 Evaluation Results . . . 73

(9)

Acknowledgements

I wish to thank my committee members who were more than generous with their expertise and precious time. A special thanks to Dr. Austin Melton, my advisor for his countless hours of reflecting, reading, encouraging, and most of all patience throughout the entire process. Thank you Dr. Ye Zhao, Dr. Angela Guercio, Dr. Alan Brandyberry, and Dr. Dr. Helen Piontkivska for agreeing to serve on my committee.

I would like to acknowledge and thank the Kent State University Semantic Web Research Group (SWRG) members, the Cleveland Clinic Cardiology Application Group members, and the Semantic Web Health Care and Life Sciences (HCLS) Interest group members for their helpful feedback.

Finally I would like to thank Dr. Ruoming Jin for his help and guidance during the study and Dr. Victor E. Lee for sharing RoleSim similarity measure.

(10)

I dedicate my dissertation work to the hero of my life, my dear and loving father, Abdullah Aydar for his encouragement, inspiration and endless support. I will always

(11)

(12)

CHAPTER 1

Introduction

It is estimated that up to 7000 unique dialects are spoken around the world [14]. Despite the fact that individuals speak distinctive dialects, they discover routes to com-municate by either agreeing on the same dialect, or by utilizing a translator. This aligns with the definition of “interoperability.” Broadly speaking, interoperability is a measure of the degree to which diverse systems, organizations, and/or individuals are able to work together to achieve a common goal [46]. The term was initially defined for infor-mation technology or systems engineering services to allow for inforinfor-mation exchange [2]. The concept of interoperability plays a pivotal role in our daily lives. Individuals, or-ganizations and governments have a tendency to agree on standards to make products more understandable and usable by diverse communities. For instance, the World Wide Web(www) [101] is a large interoperable network of documents whose standards are de-fined by the World Wide Web Consortium(W3C) [88].

1.1 Interoperability in Healthcare

Interoperability in healthcare is stated as the ability of health information systems to work together within and across organizational boundaries in order to advance the effective delivery of healthcare for individuals and communities [47]. In 2009 The Health Information Technology for Economic and Clinical Health Act(HITECH) authorized in-centive payments to clinicians and hospitals when they implement electronic health record

(13)

(EHR) [45] systems to achieve Meaningful Use of healthcare data. [19]. Meaningful Use is specified by the following 3 components [41]: (1) use of certified EHR, (2) use of certified EHR technology for electronic exchange of health information and (3) use of certified EHR technology to submit clinical quality measures. The components 2 and 3 require interoperability between EHRs and that is where the Meaningful Use is delayed because different EHRs do not talk to each other. According to a review [64] published in 2014, only 10% of ambulatory practices and 30% of hospitals are participating in operational health information exchange efforts. The lack of interoperability has a significant eco-nomic impact on the healthcare industry. The West Health Institute(WHI) testified to U.S. Congress, and released an estimate that system and device interoperability could save over $30 billion a year in the U.S. healthcare system alone [44].

Achieving interoperability in healthcare is a significantly challenging process. The current healthcare information technology (IT) environment breeds incredibly complex data ecosystems. Clinical data exist in multiple layers and forms ranging from hetero-geneous structured, semi-structured, and unstructured data captured in enterprise-wide electronic medical record and billing systems, through a wide variety of departmental, study, and lab-based registries and databases, to the massive quantities of device-specific data generated by an ever increasing number of instruments used to evaluate, monitor, and treat patients. In many cases pertinent patient records are collected in multiple sys-tems, often supplied by competing manufacturers with diverse data formats. This causes inefficiencies in data interoperability, as data retrieval from disparate data sources is time consuming and costly. Also different formats of data create barriers in exchanging health information, because physicians are not able to interpret the exchanged data if they are

(14)

not well acquainted with the source data. The same holds true for the computer software systems which may focus on different algorithmic purposes including but not limited to mapping, translation, validation and search.

1.1.1 Standards vs. Translations

Interoperability can be accomplished by adopting standards and/or translations be-tween different standards. There exist many different standards in healthcare, each is developed to fulfill different purposes. In [22], the authors express the barriers in imple-menting universal standards: complexity of the standards, diverse use cases and evolu-tion of the standards. These hurdles cause the standardizaevolu-tion efforts to take significant amounts of time, and the systems relying on the standards also need to be updated along with the standardization. Consequently a proficient route for translation between differ-ent data models is more practical for information interoperability. Given the fact that it is unrealistic to have one universal standard that fits all the use cases, is it conceivable to implement standards in information translation? We think the answer is “yes.” Our work is based on the technical side of healthcare information translation. We propose trans-lation oriented methods, metrics, algorithms and frameworks that assist in healthcare information interoperability.

1.1.2 RDF as a Common Healthcare Information Representation

Is it possible to depict the healthcare data entities having diverse formats in a single flexible form? In this sense, RDF (Resource Description Framework) [57] gained notoriety in recent years, because it is schemaless and is a commonly acknowledged framework by the Semantic Web [17] community. It is also proposed as a universal healthcare

(15)

information representation in [23]. RDF makes statements about resources in the form of (subject, predicate, object) expressions, which are known as triples. In this work we utilize RDF representation of healthcare data. Both instance data and data models which are used to organize the instance data are represented in RDF. Healthcare data in different systems are linked through the relations in RDF triples, ending up with an extensive connected graph.

1.1.3 Summary Graph

An intermediate data structure, that describes the metadata or the summary of the dataset, is often needed for faster computational processing of the graph representation of healthcare data. The summary data structure comprises of the fundamental type classes and their associations, with each type class representing the type of a collection of data entities. Such a data structure is also represented in RDF that forms a summary graph. The flexibility of RDF allows the metadata of the related data entities to be linked. In this case, the summary graph is linked to the RDF representation of the dataset and utilized for data interoperability. For instance, if a physician wants to retrieve the lab records of a cohort [94], the entities that link to the lab type class are filtered, making use of the type class associations. A summary graph also reduces the size of the input dataset for various computer algorithms. For instance, an instance matching algorithm utilizes the higher level type classes, and compares the instances only belonging to the same type class. Also the associations of the type classes are leveraged for intelligent exploration of graph representation of the healthcare dataset, such that a filtering algorithm retrieves the data elements having a specific relation to a specific type class. In this work, we

(16)

present an effective algorithm for generating a summary graph from an RDF dataset and we test the algorithm on a healthcare dataset.

The relations between different type classes are represented as predicates in the sum-mary graph. However since in this work, the sumsum-mary graph is auto-generated, it is error prone. The degree of confidence of type class relations in the summary graph needs to be known by the physician or by the interpreter algorithm for more precise results. For instance, in a cardiac surgery related data set, if an “Event” type class has relations to “Vascular Procedure” and “Cardiac Valve Replacement” type classes, it makes more sense for the interpreter to know how likely an instance in the “Event” class has a relation to the “Vascular Procedure” or to the “Cardiac Valve Replacement” class. Therefore in this work, the relations between the type classes are generated along with a degree of confidence, which we call the stability measure. The stability measure is computed based on the actual number of the data entities in the relations between the type classes in the dataset.

1.1.4 Graph Node Similarity Metric

Summarizing RDF data requires grouping similar entities in the same type class. But how should we characterize the similarity of healthcare data elements? When healthcare data elements are represented in RDF, the similarity can be examined as graph node similarity. Therefore, the challenge is to define the factors that affect the similarity of a pair of RDF nodes. There have been numerous studies investigating the similarity measures in various fields. Some studies concentrated on graph neighborhood-based similarity measures as in [53, 59], while others focused on comparing instances based on

(17)

properties and roles [49, 90]. An RDF entity is described through its relations to other entities and the collection of literals associated with it in the lexical form. The descriptors are represented as predicates and literal nodes in RDF. In this work we introduce a graph node pairs similarity metric, that utilizes the graph nodes’ descriptors, namely, common predicates of the RDF nodes and common string literal words referenced by the nodes, augmented with the node neighborhood similarity.

Each descriptor of a data entity may have a different impact in similarity calculation. For instance in a dataset related to heart surgery, the word “cardiac” is less important than the word “valve,” since “valve” describes more about the nature of the surgery while “cardiac” is a more general term in the dataset. On the other hand the importance of “valve” is not the same for all the entities, such that it is more important for a surgi-cal procedure related data entity while it is less important in a cardiac valve anatomy pathology related data entity. As a consequence; each descriptor has a different impor-tance weight, and the weight differs per each data entity. So how should we identify appropriate metrics for generating the descriptor weights in the RDF representation of healthcare data entities? In this work we propose an importance weighting metric which has a similar notion to the term frequency-inverse document frequency(tf-idf) [61, 80], based on the frequency of the descriptor in the referenced data entity and in the corpus.

1.1.5 Instance Matching

Healthcare data exist in different systems with diverse data models. RDF represen-tation of healthcare data contains both the instance data and the concepts belonging to the data models of the instance data. Since different systems may have substantial

(18)

amounts of redundancy, the RDF graph may have duplicate nodes both in data instances level and data model concepts level. Therefore, an efficient instance matching technique utilized for RDF graphs covers both de-duplication and data concepts matching. Once detected, the redundant nodes can be merged, reducing the size of the input graph. Also matching the data concepts belonging to separate data models assists in the information translation process, because translation requires data models level mapping. The in-stance matching technique needs to be automated due to the size of the healthcare data. Nevertheless automated instance matching techniques are error prone. Consequently, a technique that permits user interactions yields more precise results. In this work we present a semi-automatic instance matching technique which is based on a graph node similarity measure with the inputs from the users and utilized for the RDF representation of healthcare data.

1.2 Dissertation Overview and Contributions

This dissertation makes the following contributions. First, we propose a system for translation of healthcare instance data, based on structured mapping definitions and using RDF as a common information representation to achieve semantic interoperability between different data models. The main components of the framework are a metadata repository that stores the data dictionary of the data sources, a user-friendly interface to view and manage the metadata and the mappings between different data elements, and a translation engine that translates the source RDF data with regards to the target data model. The framework allows the domain experts to define mappings between different data elements and utilizes the mappings to translate the data models from one format

(19)

to another.

Second, we introduce a pairwise graph nodes similarity metric that utilizes the Jaccard index with the common relations of the healthcare data entities and common string literal words referenced by the healthcare data entities and augmented with data entity neighbors similarity. The precision of the similarity metric is enhanced by incorporating the auto-generated importance weights of the healthcare data entity relations and the string literals referenced by the healthcare data entities in the RDF representation of the dataset.

Third, we provide a novel machine learning algorithm that automatically generates a summary graph structure from the RDF representation of healthcare dataset, and we propose that the summary graph can further be utilized for interoperability purposes. In the summary graph, similar entities are classified as the same type class, and the relations between the type classes are created with regards to their members’ relations. Furthermore, a degree of confidence measure is added for the relations between the type classes in the summary graph.

Fourth, we present a suggestion based semi-automatic instance matching system and we test it on the RDF representation of a healthcare dataset. The system utilizes the graph node similarity metric introduced in our second contribution, and it presents sim-ilar node pairs to the user for possible instance matching. Based on the user feedback, it merges the matched nodes and suggests more matching pairs depending on the common relations and neighbors of the already matched nodes. The process runs in iterations until the similarity algorithm infers no new matching candidate pairs and there is no more feedback from the user. We propose that the instance matching technique could be

(20)

leveraged for mapping between separate healthcare data models, as the data model con-cepts are also incorporated in RDF. Mappings contribute in the information translation process, because translation requires data models level mapping.

The dissertation is organized as follows. Chapter 2 gives background information on Semantic Web and existing work. Chapter 3 introduces a framework for translation of healthcare instance data, based on structural mapping definitions and RDF representa-tion of healthcare data. Chapter 4 introduces a pairwise graph node similarity metric and its algorithm. Chapter 5 introduces an automatic summary graph structure generation approach utilizing the similarity metric introduced in chapter 4. Chapter 6 introduces RinsMatch, a suggestion-based instance matching system in RDF Graphs. We offer concluding remarks in Chapter 7.

(21)

CHAPTER 2

Background and Related Work

This chapter provides background information regarding Semantic Web technologies and prior work relevant to our work in healthcare data interoperability.

2.1 Data, Information and Knowledge

In [48] the author described data, information and knowledge as follows: data is the facts and a description of the World, information is captured data and knowledge, and knowledge is a personal map/model of the World. In this sense, when the facts are captured in structured format they become information. The interconnection between the data and information elements constitutes the knowledge.

In a clinical domain, data are the facts about the patients, diseases, results of received care, etc. For instance, patient demographic information such as name, race and date of birth is data. They are true whether this is recorded somewhere or not. Information is the way that the data is captured in some format in order that people or machines can understand and access it at different times. Therefore any sort of database is information as long as it can be accessible and understood. For instance, when a nurse captures the demographics data of a patient on a hospital visit they become information as they are accessible by other people or machines. In this case the birthday record of a patient is information but the actual birth date of the patient is data. The patient record (information) could be transferred to other data sources or it could be modified. But

(22)

modifying the patient information does not change the data (actual birth day of the patient).

The knowledge is based on the data, information and other knowledge. A processing engine bases its decisions from the knowledge that it has. The human processing engine, the brain, stores the knowledge inter-connectedly and makes decisions based on its own knowledge.

2.2 Semantic Web

“The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling com-puters and people to work in cooperation. [17]”

– Tim Berners-Lee

Following the definition proposed by Tim Berners-Lee, Semantic Web [99] is not a new Web but an extension of the current Web. The fundamental thought is to describe the data on the Web with machine interpretable metadata in order that machines can process the information available on the Web. Semantic Web technologies are heavily influenced by Artificial Intelligence, by languages that developed from efforts to represent documents in a universal way, and by the network protocols that the World Wide Web is built upon.

Each object on Semantic Web is identified with an identifier called URI (Uniform Resource Identifier) [16]. A uniform resource locator (URL) [89] is a subclass of URI that identifies a Web page. URI and URL support only ASCII encoding. On the other hand, IRI (International Resource Identifier) [31] is a generalization of the URI that fully

(23)

supports international characters. In this regard, every absolute URI and URL is also an IRI, however not every IRI is a URI.

Semantic Web does not only contain Web pages; in fact, the objects on Semantic Web can be any type of resource such as Web pages, images, cities, people, events, etc. The combination of identifiers that can refer to resources or concepts on distributed machines, a universal way to retrieve data over a network, and the unambiguous interpretation of content are the primary reasons why Semantic Web technologies address a variety of problems associated with traditional databases: integration, translation, querying, and interpretation.

2.2.1 Resource Description Framework (RDF)

Central to Semantic Web technologies is a general purpose language, the Resource Description Framework (RDF) [57], for representing information in Semantic Web in a way that the meaning (or semantics) is unambiguous to a machine or software process. RDF describes resources through statements in the form of (subject, predicate, object) expressions which are known as triples. Every triple describes an entity or a relationship between two entities [31].

The subject in an RDF triple is either an Internationalized Resource Identifier(IRI) or a blank node, the predicate is an IRI, and the object is either an IRI, a literal or a blank node. The subjects and objects of triples in the RDF graph form RDF nodes. Each RDF node that corresponds to a unique RDF entity is represented with a unique IRI, and the values such as strings, numbers and dates are represented by literal nodes. A literal node can consist of two or three elements: a lexical form, a datatype IRI and a

(24)

Figure 1: An example of RDF triples illustrating resources from diverse domains [87] language tag. The language tag in a literal node is included if and only if the datatype IRI of the literal node corresponds to rdf:langString [24]. A predicate in an RDF triple is also called a property of the RDF subject node. A predicate can be one of two types: a DatatypeProperty where the subject of the triple is an IRI and the object of the triple is a literal, or an ObjectProperty where both the subject and object of the triple are IRIs. Each object of a subject node is called a neighbor of that subject node.

Figure1 (copied from [87]) shows an example of RDF triples, illustrating the resources from diverse domains and the way they are linked through named relations.

The flexibility of RDF data model enables data interchange, merging datasets with different data models and evolution of the data models over time. The linked structure

(25)

of RDF triples allows resources belonging to different datasets to be linked with a named relation. The collection of triples with the linking structures constitutes a directed and labeled graph, where the edges represent the named relations that have semantically unambiguous meanings that can be utilized by the software algorithms for intelligent exploration of the graph-structured data.

Definition (RDF Graph Data)

Let a directed labeled graphG(V, L, E) be the RDF data such thatV is the finite set of vertices; L denotes the finite set of edge labels; and E is the set of edges of the form

e = l(u, v), with u, v ∈ V, l ∈ L and e ∈ E . Note that an edge l(u, v) represents the RDF triple (u, l, v).

2.2.2 RDF Schema

RDF is schemaless which means that the RDF data is not tied to a particular schema. However, schemas for RDF can exist and the RDF resources represented in RDF can have a structure given by the schema. As a matter of fact, the W3C specified RDF Schema (RDFS) [24] standard to provide a data-modelling vocabulary for RDF resources is an extension of the basic RDF vocabulary. RDF schema provides features to describe RDF classes and properties; define hierarchies; constraint the values for the domain and range of properties, etc.

2.2.3 Ontology

The meaning of ontology slightly differs in different contexts. In the context of Com-puter Science, a frequently cited explanation of an ontology is given by Gruber [42]: “an

(26)

ontology defines a set of representational primitives with which to model a domain of knowledge or discourse. The representational primitives are typically classes (or sets), attributes (or properties), and relationships (or relations among class members).” RDF Schema(RDFS) is a formal language to define an ontology to be used for a specific do-main.

A formal definition of ontology is given in [55] as a pair O = (S, A), where S is the ontology signature which describes the vocabulary (the named terms) of an ontology, and A is a set of ontological axioms (statements that say what is true in the domain) specifying the interpretation of the vocabulary in some domain of discourse.

Web Ontology Language (OWL)

The Web Ontology Language (OWL) is another formal language for authoring on-tologies. It extends RDFS with more complex features, enabling applications to process the content of the information and facilitating greater data interoperability. OWL is compatible with RDFS. In addition, it provides additional features like property char-acteristics, property restrictions, ontology mapping(such as equivalence between classes, properties and individuals) and complex class definitions [27].

2.2.4 Linked Data

As the Web has grown exponentially, it has not only become the universal platform of interconnected documents but also the linked information space for data. Data published in the Web have been traditionally in the forms of HTML tables, CSV or XML. Due to the lack of expressivity of entity relations in HTML documents and the lack of explicit semantics and relational structures in the published data, the actual semantics of data

(27)

in the Web is currently being underutilized for cognitive computation purposes.

With continuing development of Semantic Web technologies, there has in recent years been significant progress including explicit semantics with data in the Web as an ever-growing number of organizations adopt Semantic Web technologies. Publishing data on the Web in a standard model by using a standard methodology and interlinking the data available on the Web using Semantic Web technologies provides a Web of data that applications can access to and utilize through semantic queries.

The collection of inter-linked datasets on the Web is also referred to as Linked Data. In [18] the authors defined Linked Data as the: “data published on the Web in such a way that it is machine-readable, its meaning is explicitly defined, it is linked to other external data sets, and can in turn be linked to from external data sets.” Historically, the term “Linked Data” was used in 2006 by Tim Berners-Lee, the inventor of the World Wide Web, who outlined the rules for publishing data on the Web to make the Web of data a reality, as follows [15]:

• Use URIs as names for things.

• Use HTTP URIs so that people can look up those names.

• When someone looks up a URI, provide useful information, using the standards. • Include links to other URIs, so that they can discover more things.

Consequently, in short amount of time many organizations started to publish open data using the data standards of the World Wide Web Consortium. As of 2015, the structured data available in the Semantic Web have been rapidly increasing with the

(28)

contribution of Linked Open Data along with several other Semantic Web projects which incorporate many datasets associated with different domains such as publications, life sciences, media, social Web, geography, etc. For instance, DBpedia [9] and FreeBase [21] made available the content of Wikipedia in RDF and incorporated links to other open datasets, e.g., to GeoNames [3], a geographical database which covers over eight million place names in RDF model. Concurrently, the number of Web pages published in RDFa (Resource Description Framework in Attributes) [5], a W3C recommendation that adds semantics to web page content by embedding a set of a attribute-level extensions, and microformats [56], conventions for embedding semantic markup to identify specific resources in web pages, has increased. The applications that consumes a dataset on the Linked Open Data can exploit the extra knowledge available through links to the other datasets. While publishing open data initiatives have made available thousands of general purpose datasets and many domain-specific data sources in the Resource Description Framework (RDF) data model, there is still a long way to go as Linked Open Data is still a small portion of the information available on the Web.

2.3 Data Mapping

Data mapping [96] is the process of creating the linkages and relations between data elements of distinct data models. Data mapping creates connections between different data elements. The connectivity of the data elements increases data interoperability and data reusability while reducing the redundancy. Also it is a key step for data integration tasks such as data transformation, data lineage analysis, discovery of new data details within connected data sources, consolidation of multiple data sources into a single data

(29)

source, etc. Furthermore data mapping is needed for standardization of the data. For instance, healthcare institutions often need to map their local data to an accepted medical standard such as ICD-9 [40] or SNOMED CT [71] to be able to share their data with other medical facilities. There are different techniques for data mapping.

In manual data mapping the mapping between different data elements is being done manually by the users. The users can document the mapping in a variety of ways us-ing hard programmus-ing codes, XSLT [28] transforms or by usus-ing graphical mappus-ing tools that automatically generate data transformation programs. There is a variety of graph-ical tools for data mappings. Some tools work by drawing a line between different data elements while others use some kind of string similarity algorithms to automatically con-nect the source and destination. These kinds of graphical tools are common for ETL (Extract-Transform-Load) [1] based applications. They create automatic data transfor-mation programs in different languages such as SQL, XSLT, Java or C++ depending on the application environment to support transforming data from one form to another.

A data-driven mapping technique uses heuristics and statistics to automatically dis-cover mappings and transformations between different data sets by evaluating the ac-tual data values in the data sources [96]. The discovered transformation logic could be arithmetic, concatenations, case statements, etc. The main advantage of a data-driven technique over the manual data mapping is that it reduces the human labor. But on the other hand the results may not be as accurate as manual data mapping.

Synonyms-driven mapping relies on a metadata registry to auto map the data elements based on data element names and their synonyms. The metadata registry contains the synonyms of the data element names. For example auto connecting a data element

(30)

named “gender” to another data element named “sex” is possible if both these data element names are listed as synonyms in the metadata registry. A synonyms-driven mapping can generate matches between data schema elements but will not auto generate the data transformation logic.

The Semantic Web technology is not in its mature phase yet. However its ambitious vision aims to take the current state of the data mapping techniques to a more auto-matic process based on Semantic Web languages such as Resource Description Framework (RDF), the Web Ontology Language (OWL) [8], etc. The Semantic Web technology is based on connected data elements. Ontology which is a part of the Semantic Web tech-nology consists of mapped data concepts which are mapped to actual data elements in RDF and give meaning to the data that can be leveraged for the mapping process. The idea of “Linked Data” of Semantic Web builds on a standardized metadata publishing mechanism which will accelerate the development of more automatic data mapping tools. In essence, the task of linking is connecting semantically related concepts from multi-ple data sources. Over the years, this problem has been studied by the research commu-nity as the task of ontology matching, ontology alignment or more commonly ontology mapping. Although these concepts are frequently used interchangeably, they have slight differences. Mapping takes place after matching the related concepts first, and the align-ment process refers to identifying semantically equivalent classes from the data. For our purposes, we will be using the term ontology mapping as finding the matching concepts and aligning them with other semantically relevant concepts and classes of concepts. The matching concepts do not necessarily have to be identical or equivalent. They might also be hierarchically related with subset or superset relations.

(31)

Ontology mapping is an essential task in achieving semantic interoperability on the Semantic Web. As the amount of publicly available heterogenic data on the Semantic Web grows continually, the applications require using more and more ontologies every day. Considering the scale and distribution of the Semantic Web resources, it is no longer possible to maintain one ontology that covers the whole spectrum. Hence, many ontologies are being created and used by many different applications.

In this context, ontology mapping can also be thought of as a semantic interoper-ability layer which allows applications to query information across multiple resources. Over the years, the ontology mapping problem has been researched by the many di-verse communities for various reasons including data translation, ontology merging, data integration, query answering, etc.

In [83], the problem of ontology mapping is described as: “given two ontologiesO1 and

O2, mapping one ontology with another means that for each concept (node) in ontology

O1, try to find a corresponding concept (node), which has same or similar semantics,

in ontology O2 and vice versa.” In other words, the process of ontology mapping is

identifying the semantic relationships between two related concepts belonging to different ontologies.

A mathematical definition of ontology mapping is given in [55] as morphism (a structure-preserving map from one mathematical structure to another [97]) of ontologi-cal signatures. Given two ontologies O1 = (S1, A1) and O2 = (S2, A2), a total ontology

mapping from O1 to O2 is a morphism function f : S1 → S2 of ontological signatures,

such that, A2 |= f(A1), i.e., all interpretations that satisfy O2 axioms also satisfy O1’s

(32)

Figure 2: Example Ontologies and their Mappings [37]

exists. Therefore, a weaker notion of ontology mapping is introduced as a partial ontol-ogy mapping from O1 to O2 if there exists a sub-ontology O10 = (S

0 1, A 0 1), i.e., S 0 1 ∈ S1

and A0₁ ∈A1, such that there is a total mapping from O10 to O2.

Figure 2 (taken from [37]) illustrates a mapping between two given ontologies de-scribing the domain of car retailing. A reasonable mapping between the two ontologies is given by the dashed lines in the figure.

In the literature, there has been a lot of research done in the subject of ontology map-ping in diverse communities from different perspectives. As a result, the terminology used for defining the problem varied such as matching, alignment, merging, articulation, fu-sion, integration, morphism, etc. Consequently, many tools and methods have developed in the field. Some surveys on the current state-of-the-art have tried to categorize the approaches and describe the terminologies [39] [55].

(33)

In some other studies, the authors have addressed the problem in relation to ontol-ogy design and integration i.e., [66, 69, 70]. However, they focus on heuristics to match ontology elements and do not utilize the information in the data instance level. These approaches do not use machine learning. Still others advocated implementing machine learning techniques [20, 91, 102] in their frameworks for learning to combine matching measures and selection strategies.

Moving forward a few years, we see a large number of related works under the topic of Ontology Alignment [33, 35, 51, 52, 74, 75]. Some of them are focusing on mapping using an upper ontology [33, 51, 52]. An upper ontology is a top-level ontology [100] that describes very general concepts that are the same across all knowledge domains. Other related works are concentrating on aligning ontologies based on set theory approaches that utilize existing links on instances [74, 75].

Parundekar, et al. [74, 75] proposed an approach for aligning ontologies in Linked Open Data by discovering concept coverings and linkage using set theory ideas. They consider the union of disjoint concepts for alignment process. Although our approach has some commonalities, we don’t rely on existence of owl:sameAs links between instances as it is the case in their approach. Another difference is that we use machine learning algorithms for finding concepts and their hierarchy.

2.4 Data Translation

Data translation [93] is converting a set of data values from the data format of a source data system into the data format of a destination data system. Data mapping from the source data system to the destination data system is a key step for data translation. The

(34)

programming code of the actual data translation program is based on the data mapping between the source and destination data systems.

2.5 Entity Similarity

2.5.1 Jaccard Similarity Measure

The Jaccard similarity coefficient also known as the Jaccard index is a statistical measure that is used for comparing similarity and diversity of sample sets, and it is defined as the size of the intersection divided by the size of the union of the sample sets [50]. According to the Jaccard index ifA andB are subject nodes in a dataset, then their similarity,J(A, B), is formulated as:

J(A, B) = |A∩B|

|A∪B| (1)

(35)

example as follows. Given three sets A, B, C such that:

A={apple, banana, orange},

B ={banana, orange, peach},

C ={peach, f ig, cherry}

The Jaccard indexJ(set1, set2) between two sets,set1 andset2, is calculated as follows:

J(A, B) = 2/4 = 0.5,

J(A, C) = 0/6 = 0,

J(B, C) = 1/5 = 0.2

Therefore,A and B are the most similar sets whileA and C are the least similar sets.

2.5.2 RoleSim Similarity Measure

RoleSim is a similarity metric for graph nodes based on the maximal matching of neighborhood pairs and a simple iterative computational method.

Given a graph G = (V, E), RoleSim measures the similarity of each node pair in V based on their neighborhood similarities with the intuition that “two nodes are similar if they relate to similar objects” [54] :

(36)

RoleSim(u, v)= (1−β) (2) ×maxM∈M m(u,v) P (x,y)∈M RoleSim(x, y) Nu+Nv− |M| +β

RoleSim(u, v) denotes the similarity of the nodesu, v ∈V.The definition of RoleSim is recursive; i.e., RoleSim(x, y) is calculated the same way as RoleSim(u, v). N(u) and

N(v) denote their respective sets of neighborhoods andNuandNv denote their respective

degrees, i.e., Nu =|N(u)| and Nv =|N(v)|.

We define M to be a set of ordered pairs (x, y) where x ∈ N(u) and y ∈ N(v) such that there does not exist (x0, y0) ∈ M, s.t. x = x0 or y = y0, and furthermore, M is maximal in that no more ordered pairs may be added to M and keep the constraint above. M m(u, v) is the set of all such M’s. M m(u, v) is a set of sets.

As shown in the formula, the RoleSim(u, v) depends on their neighborhood similar-ities. Instead of using the mean of neighborhood similarities, the maximal matching is used. Figure 3 (taken from [54]) shows how the maximal matching works to calculate the similarity measure between the nodes u and v. The nodes (x1, x2, x3) are the neighbors of uand the nodes (y1, y2, y3, y4) are the neighbors of v. The cell values are the similar-ities of each (x, y) pair. M is a maximal subset ofN(u)×N(v) such that no element of

N(u) appears more than once as a first coordinate and no element ofN(v) appears more than once as a second coordinate of an ordered pair in M. Thus, |M| = min(Nu, Nv). The maximal matching ensures that the total value of selected cells has the maximum

(37)

Figure 3: RoleSim(u,v) based on similarity of their neighbors [54] possible value. The maximal matching value, M(u, v), is calculated as

M(u, v) = maxM∈M m(u,v) P (x,y)∈M RoleSim(x, y) M ax(Nu, Nv) (3)

The parameter β is a decay (damping) factor [98], 0 < β <1. It is similar to the one used in [72], so that the influence of neighbors decreases with distance which dampens the recursive effect. In [72], the value of β was originally set to 0.15. We provide more details about how the value of β is chosen in section 5.7.

The input graph in RoleSim is not a labeled graph. This makes it hard to directly apply the RoleSim measure in an RDF data set since the RDF triples contain labeled edges. In our work we utilize the RoleSim measure in conjunction with the Jaccard index to calculate the similarity of two graph nodes. TheRoleSim(u, v) measure is used when there may be multiple neighbors for a common property reached from the node pairs u

(38)

2.6 Graph Summarization

A summary graph of an RDF data graph is the RDF graph such that each node in the summary graph is a subset of the original graph nodes of the same type and each property in the summary graph is a subset of the original graph nodes properties.

2.6.1 Definition (Summary Graph)

LetG= (V, L, E) be an RDF graph as defined in section 2.2.1. We define a summary graph asG0 = (V0, L0, E0), such thatV0 contains equivalence classes of V. E0 and L0 are, respectively, the sets of edges and labels in the graphG0. As we will see,L0 ⊂L, and the elements ofE0 are defined by the elements in the equivalence classes in V0 and the edges inE.

2.6.2 Summary Graph Generation Methods

There exist several methods to obtain a summary graph: (1) A summary graph can be obtained from the dataset ontology, if the dataset is already tied to an ontology. (2) Another way to obtain the summary graph is to locate the type triples in the dataset and to organize the type classes and relations accordingly, if the data set is published using a standard vocabulary [36]. (3) Or the summary graph can be built automatically by infer-ring the class types based on the similarity of the RDF nodes. Our graph summarization approach is based on method 3 as explained in details in chapter 5.

The summary graph data structure has been used for various purposes in different studies. For instance in [86] the authors utilized the summary graph for keyword search on RDF graphs. However they assumed that there is already a summary graph data

(39)

structure in place, which may not always be correct in real world datasets.

In [36] the authors used an index structure similar to a summary graph (called type system in their study) to generate benchmark data from a real world dataset with com-plete control over both the structuredness and the size of the generated data. In their study, the level of structuredness of a datasetDwith respect to a typeT in the summary graph is determined by how well the instance data inDconform to typeT. In the study, instead of assuming that there is already a summary graph in place they propose an algo-rithm to auto generate the summary graph with the assumption that a standardized and known vocabulary are used in the input RDF data set. Their summary graph generation algorithm uses the following steps:

• Scan the input RDF dataset and look for the triples whose property is rdf:type [24], extract t as the object of the triple.

• Then extract the subjects that has a typeT.

• P(T) = union of all the properties of the subjects above, where the type T refers to a class in the summary graph and P(T) is the union of all the properties used inT.

In this sense, their summary graph generation algorithm aligns with the method 2, as opposed to our work in which we infer the summary graph classes based on the similarity of the RDF nodes.

(40)

2.7 Healthcare Data Interoperability Initiatives

Healthcare information interoperability requires the necessary tools, an efficient roadmap and a strong commitment from relevant communities. Many studies have been suggested for this purpose. In this section we review the selected studies regarding healthcare data interoperability.

The Yosemite Project [23] suggests an ambitious roadmap for healthcare information interoperability on a global scale, by using RDF as a universal information representation and creating a hub for crowd-sourcing translation rules. They provide general transla-tion rules, metadata including source and target data models, translatransla-tion rules language (such as SPARQL/SPIN N3, Java, Python, etc.), dependencies, test data with valida-tion, license, maintainer and usage metrics. They envision a translation hub that allows downloading the translation rules as needed. Comparing to the Yosemite Project, our work is more specifically focused on leveraging entity similarities for auto-classification, generating mapping between entities and exploiting the structured mapping definitions in healthcare information translation.

3M Health Information Systems [4] created a public use version of the 3M Health-care Data Dictionary (HDD), a medical terminology server that has been in operational use at the US Department of Defense (DoD) and other health care organizations. The project is named as HDD Access [58]. It is an application that contains a controlled medical vocabulary stored in a relational database with application programming inter-face (API) runtime services that other applications can call. In this sense their work is a standardization effort, as opposed to our work in which we concentrate on translation.

(41)

In [65], the authors presented FURTHeR, a semantic framework intended to run federated queries across disparate sources. The infrastructure incorporates a terminology server, a metadata repository and translation services for translating the semantic queries to run on disparate data sources. The infrastructure enables semantic interoperability via semantic search. Our work does not utilize a standardized terminology server. In our work, the information interoperability is achieved by translation of instance data with structured mapping definitions.

SemanticDB is a semi-structured patient data repository system developed within the Cleveland Clinic that leverages several advances in XML processing and Semantic Web technology standards for the purpose of outcomes research [32]. The SemanticDB patient data registry stores the current and historical data for a patient population which is specifically selected for clinical research in cardiology domain. SemanticDB contains a data production pipeline that translates the pertinent patient data to generate customized views of the repository content for statistical analysis and reporting. The translation in SemanticDB relies on the hard coded mapping definitions implemented in the transfor-mation code, and in contrast to our work their mappings are not stored in structured format. However, in our work we utilize the SemanticDB patient data registry as a test dataset for testing various algorithms presented in this work.

(42)

CHAPTER 3

Translation of Instance Data using RDF and Structured Mapping Definitions In this chapter, we present a translation oriented framework for healthcare informa-tion interoperability. The main ideas and secinforma-tions explained in this chapter are part of a published research study [12]. The idea behind the system is achieving semantic interop-erability between diverse healthcare data models, by empowering translation of instance data based on structured mapping definitions, and using RDF as a common information representation. We have developed a framework that allows the domain experts to define linkages between different data elements and utilizes the mappings to translate the data models from one format to another.

The clinical data in a healthcare IT environment can be incredibly complex. The data is generated in different systems through diverse work flows. Therefore, often the related patient data is stored in multiple systems, supplied by different manufacturers with diverse data formats. This not only causes inefficiencies for the physicians as data retrieval from disparate data sources is time consuming, wasteful and costly, but also creates barriers in exchanging health information. Users may not be able to understand the exchanged data if they are not competent with the data format. The same holds true for the machines to process the data for various algorithmic purposes including but not limited to mapping, translation, validation, search, etc.

Interoperability can be achieved by adopting standards and translations between dif-ferent standards. There exist many difdif-ferent standards in healthcare, each having its

(43)

own uses and advantages. It is unrealistic to have one universal standard that fits all the use cases since there are significant barriers in implementing universal standards, as explained in [22]: complexity of the standards, diverse use cases and evolution of the standards cause the standardization efforts to take significant amounts of time, and the systems relying on the standards also need to be updated along with the standardization. Therefore an efficient way of translation between different data models is needed. This work focuses on the technical side of healthcare information translation.

Since each data model can be in a different format, adopting a common informa-tion representainforma-tion model is unavoidable for the translainforma-tion. RDF (Resource Descripinforma-tion Framework) [57] gained popularity in recent years, since it is schemaless and is a com-monly accepted information representation model by the Semantic Web community. RDF is also proposed as a universal healthcare information representation in [23]. In our work, we represent healthcare data in RDF model. Instance data along with its data model is represented in RDF. And different datasets are linked through the predicates in RDF triples which form an extensive and connected graph.

Translation requires mappings between the sets of data elements in the different data models. In our work, we present a mapping schema to enable capturing the mappings be-tween different data elements in structured and machine readable format. Capturing the mappings between the data models in a machine readable format enables auto-generation of the translation code up to some degree.

(44)

3.1 Translation of Instance Data

Figure 4 illustrates the main components of the translation framework and their interactions as proposed in this work. The red labels help show how the translation components work together. The metadata for each data source (1) are captured and saved in the metadata repository or database (2). Further, the data managers define the linkages (mappings) between the data elements using a user-friendly interface, and these linkages are also saved in the metadata repository (3). In preparation for a translation, the defined mappings from the source data model to the target data model are retrieved (4). The source instance data are represented or lifted to RDF model (5). Then the translation engine uses the source RDF data, the target data model metadata, and the mappings from the source data model to the target data model to translate the source instance data into target data (6). The translation engine produces the results, and these results are conveyed to the RDF representation of the target data model (7). More details about the components are given in the following sections.

3.2 Metadata Repository

For each of the data sources, a data dictionary is captured and stored in a metadata repository [95]. In our work a data dictionary incorporates data model details for classes, concepts, concept values, and their associations to each other. For instance, the metadata of a dataset in relational schema [34] includes tables, table fields, pre-defined possible field values for the fields, as well as the relations between different metadata elements. And when represented in the metadata repository, the tables are represented as classes, table fields are represented as concepts and pre-defined possible field values are represented as

(45)

Data source 1 Data source 2 Data source 3 Data source N

Data Models Capture

Data

Models Metadata & Mappings

Database

Target Data Source X Read Mappings for: 1 2 3 4 5

UI for Metadata & Mappings

Translation Engine Convert to RDF 6 7 Output

(46)

Metadata

DataModel1 (For Admissions Database) Class 1 (Demographics) Concepts of Class1: -MRN -FullName - Gender - hx_smoke Gender Concept Values: - ‘Male’ - ‘Female’ Class 2 (Encounters) DataModel2 (For Surgery Database) Class 3 (Demographics) Concepts of Class3: - MedicalRecordNumber - FirstName - LastName - Sex - CigSmoker Sex Concept Values: - ‘M’ - ‘F’ Class 4 (Events) Concepts of Class4: - EventID - EventDate - EventType EventType Concept Values: - ‘Cath’ - ‘Aortic’ - ‘Mitral’

Figure 5: An example of data dictionary elements stored in the metadata server concept values.

Figure 5 shows the structure of the main data dictionary elements stored in the metadata server with sample data models, classes, concepts and value sets.

3.3 Mapping Schema

The metadata repository also includes a mapping schema. The goal is to enable storing the mappings between different data models in a structured format. The mapping schema is designed so that a target field can be mapped to from multiple candidate source fields that belong to one or more candidate source data models with field-level

(47)

Figure 6: A mapping example showing how selected data dictionary elements are mapped priorities, and a value set translation from source values to target values, and a pre-defined derivation (transformation) logic from source fields to the target field.

Figure 6 illustrates how the mappings are characterized between concepts. The figure explicitly shows the mappings from the concepts of “Class 3” to the concepts of “Class 1” of figure 5. The concepts and classes shown in the figure belong to the example data models defined in figure 5. As it seen in figure 6, the concept of “MRN” is mapped to from the concept of “MedicalRecordNumber” and the concept of “hx smoke” is mapped to from the concept of “CigSmoker.” The mapping to the concept of “FullName” involves a derivation logic, as it is mapped to from 2 different concepts (“FirstName,” “LastName”) and the translation engine needs to know the derivation logic. In this case the derivation logic is a string concatenation method. The “Gender” concept is mapped to from the

(48)

Figure 7: A figure showing the RDF representation of mapping between sample data dictionary elements

“Sex” concept. However the pre-defined values for “Gender” concept and “Sex” concept are different. Therefore the mappings additionally need to be defined on the concept values level. As seen in the figure, the value “M” mapped to “Male” while the value “F” is mapped to “Female.”

In this work, data dictionary and mappings between the data dictionary elements are represented in RDF model. By doing this we are able create linkages between isolated data models. Figure 7 depicts a visualization of an excerpt from the RDF graph, that

(49)

includes the mappings between the “hx smoke” and “CigSmoker” concepts along with the concepts’ data models details. The data dictionary elements shown in the figure belong to the example data models defined in figure 5.

3.4 Mapping Interface

In a clinical domain, data concept knowledge and concept relations are known by the data domain experts. But the data is stored in physical data sources which are managed by the IT specialists. The IT specialists often do not know the meaning of the data stored in the data sources. They refer to data on the underlying data source schema level while the data domain experts or the end users refer to the data on data concept level. For example a doctor performs a procedure for a patient to cure a specific disease. The disease itself is a data concept. The doctor has the knowledge of the disease concept. But he/she does not necessarily have the knowledge of the data schema field where the disease concept is being stored. IT specialists have the knowledge of the metadata details of the data schema field but they often do not know the actual meaning of the data concept being stored in the data schema field. This causes a communication gap between the domain experts and the IT specialists in defining the mappings between different data concepts. Traditionally the mappings between different data concepts are defined by the data managers. However the data managers rely on the IT specialists to locate the data concepts in the physical data sources and create the linkages between the concepts. Therefore in this work, a user-friendly interface is presented to be utilized by the data managers. The interface allows viewing the data dictionaries, managing the concept definitions and defining mappings between different data elements. It uses

(50)

Figure 8: A screen shot of the mapping interface.

the metadata repository as the backend database. The mappings defined by the data managers are stored back in the metadata repository within the mapping schema which is a machine readable format. In addition, the interface also has a test module that lets the data managers write and execute tests to validate their mappings.

Figure 8 shows a screen shot of the interface. In the figure the left side contains form elements to select the target and source concepts while the right side contains the form elements for identifying the value set conversion.

3.5 Translation Engine

The flexibility of RDF allows different schemata and models to be converted, repre-sented and connected via RDF. In this work, the translation engine performs the trans-lation on the RDF representation of the source data elements. The conversion from a source data format to the RDF representation differs for each data model, i.e., a different RDF data conversion routine is executed for each data format. The conversion from

(51)

source instance data to the RDF representation happens on the fly during the transla-tion process; only the specific source instance data, which is required by the mapping, is converted.

The usage of a metadata repository provides valuable benefits in data interoperability. Capturing the mapping knowledge in a structured format provides connectivity between disparate data elements, and it enables an automatic generation of the information trans-lation code to some extent. This leads to more transparency in the transtrans-lation process; e.g., in case of any data error it helps to track the source of the problem by checking the documented data mappings.

Data value translation enables the translation of a concept term in one data code system into an equivalent concept term in another data code system. The associations in the documented value mappings are used to discover the concept term conversion. Also, in many cases a single target data field is mapped to from more than one source data field. In this case the translation engine needs to know the priority of the source fields and/or a derivation logic to translate from multiple source fields. The priorities and the derivation logic can be defined by the data managers using the mapping interface. The translation engine is then able to generate the translation code by utilizing the defined priorities and the derivation logic.

3.6 Conclusion

In this chapter we presented a translation oriented framework to achieve interoper-ability between different healthcare data models. It alleviates the lack of interoperinteroper-ability solutions based on translation of RDF representation of instance data and structured

(52)

mapping definitions between the data concepts belonging to different data models. We utilized a metadata repository to store the data dictionaries, which incorporate data model details for classes, concepts, concept values, and their associations to each other. We also introduced a user-friendly interface that permits the data managers to view and manage the data dictionary elements and the mappings between them.

In this chapter we assumed that there is already a defined data dictionary in place for each data source. This assumption is correct for some data sources, e.g., the data dictio-nary of a relational database can be captured automatically using tools like SchemaSpy [30]. However, capturing the data dictionary from a dataset that is already represented in RDF model can be a challenging task due to the flexibility of the RDF data model not imposing constraints on the schema. In chapter 4 we introduce a proficient pairwise graph node similarity metric. Following with chapter 5 we present a summary graph generation method based on the pairwise graph node similarities introduced in chapter 4 and we clarify how the summary graph can be exploited to capture the data dictionary elements from RDF representation of the instance data.

Translation of instance data requires data models level mapping, which means the data concepts belonging to the source and target data sources need to be mapped ac-cordingly. The interface introduced in this chapter lets the data manager characterize the mappings between healthcare data concepts belonging to different data models. However manually defining the mappings requires significant amounts of time due to the size of the healthcare data. In chapter 6, we introduce a semi-automatic instance matching technique in RDF graphs that leverages the pairwise graph node similarities introduced in chapter 4 and we describe how the instance matching technique introduced in chapter

(53)

6 can be utilized for semi-automatically generating the mappings between different data models.

(54)

CHAPTER 4

Pairwise Nodes Similarity in RDF Graphs

In this chapter, we introduce an efficient algorithm to compute the similarity between different healthcare data elements. The main ideas and sections explained in this chapter are part of the published research studies [10, 13]. Since we represent the healthcare data in an RDF model, the computation of entity similarity is studied as a pairwise RDF graph node similarity problem. Computation of entity similarity is essential for numerous interoperability tasks. For instance, automatic classification algorithms group similar entities together which requires an entity similarity metric. Also, instance matching and concept mapping techniques utilize entity similarities to link the same or similar real-world objects.

The algorithm is based on an efficient graph node similarity metric which is called RNodeSim (RDF Node Similarity). An RDF entity is described through a set of predi-cates, the collection of literal neighboring nodes that it references and the neighbor nodes with which it interacts. We utilize the common descriptors within the Jaccard measure context when calculating the similarity of an RDF node pair, along with the similarities of their neighbors.

Each descriptor of an RDF node may have a different impact in the similarity calcula-tion, in other words each descriptor has a different importance weight. Therefore in this work, we present an importance weighting metric for the descriptors of the RDF nodes and we enhance the similarity metric by incorporating the auto-generated importance

(55)

weights of the descriptors. The algorithm runs in multiple iterations as the similarity of the neighbor nodes is not known in advance. The iterations continue until the pair similarity results converge.

4.1 Contributions and Outline

In this chapter, we investigate the problem of calculating entity similarity in RDF graphs. Our approach is to use graph locality and neighborhood similarity within the Jaccard measure context. We treat the properties of the entities as the dimensions when measuring the similarity of two subject entities. Our main contributions:

• We provide a novel RDF node pair similarity metric that utilizes common descrip-tors.

• We auto-generate the importance weight of each descriptor, and we apply the weights in the pairwise similarity calculation.

• We add a string similarity measure when two graph nodes have literal type neigh-bors.

• We enhance the RDF node pairs’ similarity by taking into account the similarity of neighbors in the RDF graph.

The rest of the chapter is organized as follows. We define the input data graph. Then, we discuss the proposed methods. The subsequent section presents our approach for computation of entity similarity and the algorithm in detail. Finally, we review the related work, and follow with our conclusion.

(56)

4.2 Input Graph Data

The input data is an RDF graph G(V, L, E) as defined in section 2.2.1.

4.3 Methods

The intuition in our graph node pair similarity computation is that the nodes that have similar edges to similar neighbors tend themselves to be similar nodes. To achieve our goal, we define the similarity of two nodes in the original graph using a function similar to the Jaccard index [50] in conjunction with RoleSim Similarity [54].

The Jaccard similarity measure can be applied to graph data to find the similarity of the entities by considering the graph nodes as set names or labels and the predicates between the nodes as set elements. However, it is clear that the Jaccard similarity only considers the properties of the entities when determining the similarity of the entities; it does not take into account the similarity of neighboring entities. By a neighbor of an entity, we mean another entity which is “connected” by a property. In our work, given an input RDF graphG(V, L, E), entityv is connected to entityu, i.e., v is a neighbor of

u, if there is a label l ∈ L such that l(u, v) ∈ E. In other words, objects are neighbors of their subjects for a collection of RDF triples. Thus, an entity and its neighbor are connected and related by a property.

Intuitively, when we are trying to decide how similar two entities are, it is, of course, important to compare the properties of the two entities. However, the properties by themselves tell only part of the story. More can be told by comparing the neighboring entities which are related by similar properties. Thus, by utilizing the neighborhood similarity, we may produce better results than by using only the Jaccard similarity.

(57)

For this reason, we use the RoleSim similarity measure to determine the similarity of the interacting references based on the way they interact with their neighborhood nodes. In other words, two nodes or entities tend to have the same role when they interact with equivalent sets of neighbors.

4.4 Computation of Entity Similarity

Our premise is that similar nodes tend to have similar properties and interact with similar neighbor nodes, which are either IRIs or literals.

4.4.1 IRI Nodes Similarity

In this work, the subjects of the triples determine the sets in which we are interested. We think of the subjects as being the names or labels for the sets. More exactly, the subject of each triple determines a node of a new graph which is itself a set, and the elements of each set are the predicates of the triples whose subject is the name or label of the set. The objects of the triples, which may themselves be names or labels of sets, determine the neighbors of the subject sets. Hence, we use the subject nodes in the original graph to designate sets and their properties in the original graph are the elements in these sets. We calculate the Jaccard similarity between two nodes u and v

by noting that |u∩v| is the number of properties that the subject nodes u and v have in common while|u∪v| is the number of properties in the union of the subject nodes u

and v.

While the Jaccard index gives a good initial similarity between two nodes, it can be improved since it does not take the importance of neighborhood similarities into consideration. To improve our similarity calculations, the Jaccard similarity of the two