Semantic Web in data mining and knowledge discovery: a comprehensive survey

(1)

Contents lists available atScienceDirect

Web Semantics: Science, Services and Agents

on the World Wide Web

journal homepage:www.elsevier.com/locate/websem

Review article

Semantic Web in data mining and knowledge discovery: A

comprehensive survey

Petar Ristoski

∗

, Heiko Paulheim

Data and Web Science Group, University of Mannheim, B6, 26, 68159 Mannheim, Germany

a r t i c l e i n f o Article history:

Received 28 March 2015 Received in revised form 5 November 2015 Accepted 1 January 2016 Available online 8 January 2016 Keywords:

Linked Open Data Semantic Web Data mining Knowledge discovery

a b s t r a c t

Data Mining and Knowledge Discovery in Databases (KDD) is a research field concerned with deriving higher-level insights from data. The tasks performed in that field are knowledge intensive and can often benefit from using additional knowledge from various sources. Therefore, many approaches have been proposed in this area that combine Semantic Web data with the data mining and knowledge discovery process. This survey article gives a comprehensive overview of those approaches in different stages of the knowledge discovery process. As an example, we show how Linked Open Data can be used at various stages for building content-based recommender systems. The survey shows that, while there are numerous interesting research works performed, the full potential of the Semantic Web and Linked Open Data for data mining and KDD is still to be unlocked.

Contents

1. Introduction... 2

2. Scope of this survey... 2

3. The knowledge discovery process... 2

4. Data mining using linked open data... 3

5. Selection... 4

5.1. Using LOD to interpret relational databases... 5

5.2. Using LOD to interpret semi-structured data... 5

5.3. Using LOD to interpret unstructured data... 6

6. Preprocessing... 7 6.1. Domain-independent approaches... 7 6.2. Domain-specific approaches... 7 7. Transformation... 8 7.1. Feature generation... 9 7.2. Feature selection... 10 7.3. Other... 11 8. Data mining... 12 8.1. Domain-independent approaches... 12 8.2. Domain-specific approaches... 13 9. Interpretation... 13

10. Example use case... 15

10.1. Linking local data to LOD... 16

10.2. Combining multiple LOD datasets... 17

10.3. Building LOD-based recommender system... 17

10.4. Recommender results interpretation... 17

∗_{Corresponding author.}

E-mail addresses:[email protected](P. Ristoski),[email protected](H. Paulheim). http://dx.doi.org/10.1016/j.websem.2016.01.001

(2)

11. Discussion... 17

12. Conclusion and outlook... 18

Acknowledgment... 18

References... 18

1. Introduction

Data mining is defined as ‘‘a non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data’’ [1], or ‘‘the analysis of (often large) observational datasets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner’’ [2]. As such, data mining and knowledge discovery are typically considered knowledge intensive tasks. Thus, knowledge plays a crucial role here. Knowledge can be (a) in the primary data itself, from where it is discovered using appropriate algorithms and tools, (b) in external data, which has to be included with the problem first (such as background statistics or master file data not yet linked to the primary data), or (c) in the data analyst’s mind only.

The latter two cases are interesting opportunities to enhance the value of the knowledge discovery processes. Consider the following case: a dataset consists of countries in Europe and some economic and social indicators. There are, for sure, some interesting patterns that can be discovered in the data. However, an analyst dealing with such data on a regular basis will know that some of the countries are part of the European Union, while others are not. Thus, she may add an additional variable

EU_Member

to the dataset, which may lead to new insights (e.g., certain patterns holding for EU member states only).

In that example, knowledge has been added to the data from the analyst’s mind, but it might equally well have been contained in some exterior source of knowledge, such as Linked Open Data.

Linked Open Data (LOD) is an open, interlinked collection of datasets in machine-interpretable form, covering multiple domains from life sciences to government data [3,4]. Thus, it should be possible to make use of that vault of knowledge in a given data mining, at various steps of the knowledge discovery process.

Many approaches have been proposed in the recent past for using LOD in data mining processes, for various purposes, such as the creation of additional variables, as in the example above. With this paper, we provide a structured survey of such approaches. Following the well-known data mining process model proposed by Fayyad et al. [1], we discuss how semantic data is exploited at the different stages of the data mining model. Furthermore, we analyze how different characteristics of Linked Open Data, such as the presence of interlinks between datasets and the usage of ontologies as schemas for the data, are exploited by the different approaches.

The rest of this paper is structured as follows. Section2sets the scope of this survey, and puts it in the context of other surveys in similar areas. Section3describes the knowledge discovery process according to Fayyad et al. In Section4, we introduce a general model for data mining using Linked Open Data, followed by a description of approaches using Semantic Web data in the different stages of the knowledge discovery process inSections 5through9. In Section10, we give an example use-case of LOD-enabled KDD process in the domain of recommender systems. We conclude with a summary of our findings, and identify a number of promising directions for future research.

2. Scope of this survey

In the last decade, a vast amount of approaches have been proposed which combine methods from data mining and

knowledge discovery with Semantic Web data. The goal of those approaches is to support different data mining tasks, or to improve the Semantic Web itself. All those approaches can be divided into three broader categories:

•

Using Semantic Web based approaches, Semantic Web Tech-nologies, and Linked Open Data to support the process of knowl-edge discovery.

•

Using data mining techniques to mine the Semantic Web, also called Semantic Web Mining.

•

Using machine learning techniques to create and improve Semantic Web data.

Stumme et al. [5] have provided an initial survey of all three categories, later focusing more on the second category. Dating back to 2006, this survey does not reflect recent research works and trends, such as the advent and growth of Linked Open Data. More recent surveys on the second category, i.e., Semantic Web Mining, have been published by Sridevi et al. [6], Quboa et al. [7], Sivakumar et al. [8], and Dou et al. [9].

Tresp et al. [10] give an overview of the challenges and opportunities for the third category, i.e., machine learning on the Semantic Web, and using machine learning approaches to support the Semantic Web. The work has been extended in [11].

In contrast to those surveys, the first category – i.e., the usage of Semantic Web and Linked Open Data to support and improve data mining and knowledge discovery – has not been subject of a recent survey. Thus, in this survey, we focus on that area.

The aim of this survey is to give a survey on the field as broad as possible, i.e., capturing as many different research directions as possible. As a consequence, a direct comparison of approaches is not always possible, since they may have been developed with slightly different goals, tailored towards particular use cases and/or datasets, etc. Nevertheless, we try to formulate at least coarse-grained comparisons and recommendations, wherever possible.

3. The knowledge discovery process

In their seminal paper from 1996, Fayyad et al. introduced a process model for knowledge discovery processes. The model comprises five steps, which lead from raw data to actionable knowledge and insights which are of immediate value to the user. The whole process is shown inFig. 1. It comprises five steps: 1. Selection The first step is developing an understanding of

the application domain, capturing relevant prior knowledge, and identifying the data mining goal from the end user’s perspective. Based on that understanding, the target data used in the knowledge discovery process can be chosen, i.e., selecting proper data samples and a relevant subset of variables. 2. Preprocessing In this step, the selected data is processed in

a way that allows for a subsequent analysis. Typical actions taken in this step include the handling of missing values, the identification (and potentially correction) of noise and errors in the data, the elimination of duplicates, as well as the matching, fusion, and conflict resolution for data taken from different sources.

3. Transformation The third step produces a projection of the data to a form that data mining algorithms can work on—in most cases, this means turning the data into a propositional form, where each instance is represented by a feature vector. To improve the performance of subsequent data mining

(3)

Fig. 1. An overview of the steps that compose the KDD process.

algorithms, dimensionality reduction methods can also be applied in this step to reduce the effective number of variables under consideration.

4. Data mining Once the data is present in a useful format, the initial goal of the process is matched to a particular method, such as classification, regression, or clustering. This step includes deciding which models and parameters might be appropriate (for example, models for categorical data are different than models for numerical data), and matching a particular data mining method with the overall criteria of the KDD process (for example, the end user might be more interested in an interpretable, but less accurate model than a very accurate, but hard to interpret model). Once the data mining method and algorithm are selected, the data mining takes place: searching for patterns of interest in a particular representational form or a set of such representations, such as rule sets or trees.

5. Evaluation and interpretation In the last step, the patterns and models derived by the data mining algorithm(s) are examined with respect to their validity. Furthermore, the user assesses the usefulness of the found knowledge for the given application. This step can also involve visualization of the extracted patterns and models, or visualization of the data using the extracted models.

The quality of the found patterns depends on the methods being employed in each of these steps, as well as their interdependencies. Thus, the process model foresees the possibility to go back to each previous step and revise decisions taken at that step, as depicted in

Fig. 1. This means that the overall process is usually repeated after adjusting the parametrization or even exchanging the methods in any of these steps until the quality of the results is sufficient.

4. Data mining using linked open data

As a means to express knowledge about a domain in the Semantic Web, ontologies have been introduced in the early 1990s as ‘‘explicit formal specifications of the concepts and relations among them that can exist in a given domain’’ [12]. For the area of knowledge discovery and data mining, Nigro et al. [13] divide ontologies used in this area into three categories:

•

Domain ontologies: Express background knowledge about the

application domain, i.e., the domain of the data at hand on which KDD and data mining are performed.

•

Ontologies for data mining process: Define knowledge about the

data mining process, its steps and algorithms and their possible parameters.

•

Metadata ontologies: Describe meta knowledge about the data at

hand, such as provenance information, e.g., the processes used to construct certain datasets.

It has been already shown that ontologies for the data mining process and metadata ontologies can be used in each step of the KDD process. However, we want to put a stronger focus on the usage of Linked Open Data (LOD) in the process of knowledge discovery, which represents a publicly available interlinked collection of datasets from various topical domains [3,4].

Fig. 2 gives an overview of the Linked Open Data enabled

knowledge discovery pipeline. Given a set of local data (such as a relational database), the first step is to link the data to the corresponding LOD concepts from the chosen LOD dataset (cf. Section5).1Once the local data is linked to a LOD dataset, we can explore the existing links in the dataset pointing to the related entities in other LOD datasets. In the next step, various techniques for data consolidation, preprocessing and cleaning are applied, e.g., schema matching, data fusion, value normalization, treatment of missing values and outliers, etc. (cf. Section 6). Next, some transformations on the collected data need to be performed in order to represent the data in a way that it can be processed with any arbitrary data analysis algorithms (cf. Section7). Since most algorithms demand a propositional form of the input data, this usually includes a transformation of the graph-based LOD data to a canonical propositional form. After the data transformation is done, a suitable data mining algorithm is selected and applied on the data (cf. Section 8). In the final step, the results of the data mining process are presented to the user. Here, ease the interpretation and evaluation of the results of the data mining process, Semantic Web and LOD can be used as well (cf. Section9). For the survey presented in the following section, we have compiled a list of approaches that fulfill the following criteria:

1. They are designed and suitable for improving the KDD process in at least one step.

2. They make use of one or more datasets on the Semantic Web. Each of the approaches is assessed using a number of criteria: 1. Is the approach domain-independent or tailored to a specific

domain?

1 We should note that the data can be linked to the LOD datasets in different stages of the KDD process, for example, in some approaches only the results and the discovered patterns from the data mining process are linked to a given LOD dataset in order to ease the interpretation of them. For simplicity’s sake we describe the process of linking as the first step, which is also depicted inFig. 2.

(4)

Fig. 2. An overview of the steps of the linked open data enabled KDD pipeline.

2. Is the approach tailored towards a specific data mining technique (e.g., rule induction)?

3. Does it use a complex ontology or only a weakly axiomatized one (such as a hierarchy)?

4. Is any reasoning involved?

5. Are links to other datasets (a core ingredient of Linked Open Data) used?

6. Are the semantics of the data (i.e., the ontology) exploited? Furthermore, we analyze which Semantic Web datasets are used in the papers, to get a grasp of which are the most prominently used ones.

In the following sections, the survey introduces and discusses the individual approaches.2_{A small box at the end of each section}

gives a brief summary, a coarse-grained comparison, and some guidelines for data mining practitioners who want to use the approaches in actual projects.

5. Selection

To develop a good understanding of the application domain, and the data mining methods that are appropriate for the given data, a deeper understanding of the data is needed. First, the user needs to understand what is the domain of the data, what knowledge is captured in the data, and what is the possible additional knowledge that could be extracted from the data. Then, the user can identify the data mining goal more easily, and select a sample of the data that would be appropriate for reaching that goal.

However, the step of understanding the data is often not trivial. In many cases, the user needs to have domain specific knowledge

2 We should note that some of the approaches might be applicable in several steps of the LOD-enabled KDD pipeline. However, in almost all cases, there is one step which is particularly in the focus of that work, and we categorize those works under that step.

in order to successfully understand the data. Furthermore, the data at hand is often represented in a rather complex structure that contains hidden relations.

To overcome this problem, several approaches propose using Semantic Web techniques for better representation and explo-ration of the data, by exploiting domain specific ontologies and Linked Open Data. This is the first step of the Semantic Web en-hanced KDD pipeline, called linking. In this step, a linkage, or

map-ping, to existing ontologies, and LOD datasets is performed on the

local data.

Once the linking is done, additional background knowledge for the local data can be automatically extracted. That allows to formally structure the domain concepts and information about the data, by setting formal types, and relations between concepts. Using background knowledge in many cases the users can easily understand the data domain, without the need for employing domain experts.

Furthermore, many tools for visualization and exploration of LOD data exist that would allow an easier and deeper understanding of the data. An overview of tools and approaches for visualization and exploration of LOD is given in the survey by Dadzie et al. [14]. The authors first set the requirements or what is expected of the tools for visualization or browsing the LOD: (i) the ability to generate an overview of the underlying data, (ii) support for filtering out less important data in order to focus on selected regions of interest (ROI), and (iii) support for visualizing the detail in ROIs. Furthermore, all these tools should allow the user to intuitively navigate through the data, explore entities and relations between them, explore anomalies within the data, perform advanced querying, and data extraction for reuse. They divided the analyzed browsers between those offering a text-based presentation, like Disco3 _{and Sig.ma [}₁₅_{] and Piggy}

(5)

Bank [16], and those with visualization options, like Fenfire [17], IsaViz,4and RelFinder.5The analysis of the approaches shows that most of the text-based browsers provide functionalists to support the tech-users, while the visualization-based browsers are mostly focused on the non-tech users. Even though the authors conclude that there is only a limited number of SW browsers available, we can still make use of them to understand the data better and select the data that fits the data analyst’s needs. The categorization of approaches in the survey by Dadzie et al. has been extended by Peña et al. [18], based on the datatypes that are visualized and the functionality needed by the analysts. The authors list some more recent approaches for advanced LOD visualization and exploration, like CODE [19], LDVizWiz [20], LODVisualization [21], and Payola [22].

The approaches for linking local data to LOD can be divided into three broader categories, based on the initial structural representation of the local data:

5.1. Using LOD to interpret relational databases

Relational databases are considered as one of the most popular storage solutions for various kinds of data, and are widely used. The data represented in relational databases is usually backed by a schema, which formally defines the entities and relations between them. In most of the cases, the schema is specific for each database, which does not allow for automatic data integration from multiple databases. For easier and automatic data integration and extension, a global shared schema definition should be used across databases. To overcome this problem, many approaches for mapping rela-tional databases to global ontologies and LOD datasets have been proposed. In recent surveys [23–25] the approaches have been categorized in several broader categories, based on three criteria: existence of an ontology, domain of the generated ontology, and application of database reverse engineering. Additionally, [25] pro-vides a list of the existing tools and frameworks for mapping rela-tional databases to LOD, from which the most popular and most used is the D2RQ tool [26]. D2RQ is a declarative language to de-scribe mappings between application-specific relational database schemata and RDF-S/OWL ontologies. Using D2RQ, Semantic Web applications can query a non-RDF database using RDQL, publish the content of a non-RDF database on the Semantic Web using the RDF Net API,6 _{do RDFS and OWL inferencing over the}

con-tent of a non-RDF database using the Jena ontology API,7 _and

access information in a non-RDF database using the Jena model API.8_{D2RQ is implemented as a Jena graph, the basic information}

representation object within the Jena framework. A D2RQ graph wraps one or more local relational databases into a virtual, read-only RDF graph. D2RQ rewrites RDQL queries and Jena API calls into application-datamodel-specific SQL queries. The result sets of these SQL queries are transformed into RDF triples which are passed up to the higher layers of the Jena framework.

5.2. Using LOD to interpret semi-structured data

In many cases, the data at hand is represented in a semi struc-tured representation, meaning that the data can be easily under-stand by humans, but it cannot be automatically processed by

4_{http://www.w3.org/2001/11/IsaViz/.} 5_{http://www.visualdataweb.org/relfinder.php.} 6_{http://wifo5-03.informatik.uni-mannheim.de/bizer/rdfapi/tutorial/netapi.} html. 7_{https://jena.apache.org/documentation/ontology/.} 8_{https://jena.apache.org/tutorials/rdf_api.html.}

machines, because it is not backed by a schema or any other for-mal representation. One of the most used semi-structure represen-tations of data is the tabular representation, found in documents, spreadsheets, on the Web or databases. Such representation often follows a simple structure, and unlike relational databases, there is no explicit representation of a schema.

Evidence for the semantics of semi-structured data can be found, e.g., in its column headers, cell values, implicit relations be-tween columns, as well as caption and surrounding text. However, general and domain-specific background knowledge is needed to interpret the meaning of the table.

Many approaches have been proposed for extracting the schema of the tables, and mapping it to existing ontologies and LOD. Mulwad et al. have made significant contribution for interpreting tabular data using LOD, coming from independent domains [27–32]. They have proposed several approaches that use background knowledge from the Linked Open Data cloud, like Wikitology [33], DBpedia [34], YAGO [35], Freebase [36] and WordNet [37], to infer the semantics of column headers, table cell values and relations between columns and represent the inferred meaning as graph of RDF triples. A table’s meaning is thus captured by mapping columns to classes in an appropriate ontology, linking cell values to literal constants, implied measurements, or entities in the LOD cloud and identifying relations between columns. Their methods range from simple index lookup from a LOD source, to techniques grounded in graphical models and probabilistic reasoning to infer meaning associated with a table [32], which are applicable on different types of tables. i.e., relational tables, quasi-relational (Web) tables and spreadsheets tables.

Liu et al. [38] propose a learning-based semantic search algorithm to suggest appropriate Semantic Web terms and ontologies for the given data. The approach combines various measures for semantic similarity of documents to build a weighted feature-based semantic search model, which is then able to find the most suitable ontologies. The weights are learned from training data, using subgradient descent method and logistic regression.

Limaye et al. [39] propose a new probabilistic graphical model for simultaneously choosing entities for cells, types for columns and relations for column pairs, using YAGO as a background knowledge base. For building the graphical models, several types of features were used, i.e., cell text and entity label, column type and type label, column type and cell entity, relation and pair of column types, relation and entity pairs. The experiments showed that approaching the three sub-problems collectively and in a unified graphical inference framework leads to higher accuracy compared to making local decisions.

Venetis et al. [40] associate multiple class labels (or concepts) with columns in a table and identify relations between the ‘‘subject’’ column and the rest of the columns in the table. Both the concept identification for columns and relation identification are based on maximum likelihood hypothesis, i.e., the best class label (or relation) is the one that maximizes the probability of the values given the class label (or relation) for the column. The evidences for the relations and for the classes are retrieved from a previously extracted isA database, describing the classes of the entities, and relations database, which contains relations between the entities. The experiments show that the approach can obtain meaningful labels for tables that rarely exist in the tables themselves, and that considering the recovered semantics leads to high precision search with little loss of recall of tables in comparison to document based approaches.

Wang et al. [41] propose a multi-phase algorithm that using the universal probabilistic taxonomy called Probase [42] is capable of understanding the entitles, attributes and values in many tables on the Web. The approach begins by identifying a single ‘‘entity column’’ in a table and, based on its values and rest of the column

(6)

Table 1

Summary of approaches used in the selection step.

Approach Domain Ontology LOD Used datasets

Problem Data mining Complexity Reasoning Links Semantics LOD Ontology [27–32] Persons, places, organizations / H Yes No Yes DBpedia, YAGO, Freebase,

WordNet

Wikitology

[50] Biology / L No No No / /

[51] Commerce / H Yes No No DBpedia /

[53] Medicine, publications / H No Yes Yes ClinicTrials.gov, BibBasea _/

[39] Persons, places, organizations / H No No Yes DBpedia, YAGO, WordNet /

[40] Geography / H No No No YAGO, Freebase /

[41] Persons, places, organizations / H No Yes Yes Probase, DBpedia /

[43–45] Persons, places, organizations, music, movies

/ H Yes Yes Yes DBpedia, YAGO,

MusicBrainzb

/

[46] Books / H No No Yes YAGO /

a_{http://data.bibbase.org/.} b_{http://linkedbrainz.org/.}

headers, associates a concept from the Probase knowledge base with the table.

Zhang et al. [43,44] propose an incremental, bootstrapping approach that learns to label table columns using partial data in the column, and uses a generic feature model able to use various types of table context in learning. The work has been extended in [45], where the author shows that using sample selection techniques, it is possible to semantically annotate Web tables in a more efficient way.

Similarly, an approach for interpreting data from Web forms using LOD has been proposed [46]. The approach starts by extracting the attribute–value pairs of the form, which is done using probing methods. Then, the data extracted from the Web forms are represented as RDF triple, or complete RDF graph. To enrich the graph with semantics, it is aligned with a large reference ontology, like YAGO, using ontology alignment approaches.

A particular case are tables in Wikipedia, which follow a certain structure and, with links to other Wikipedia pages, can be more easily linked to existing LOD sources such as DBpedia. Therefore, several approaches for interpreting tables from Wikipedia with LOD have been proposed. Munoz et al. [47,48] propose methods for triplifying Wikipedia tables, called WikiTables, using existing LOD knowledge bases, like DBpedia and YAGO. Following the idea of the previous approaches, this approach starts by extracting entities from the tables, and then discovering existing relations between them. Similarly, a machine learning approach has been proposed by Bhagavatula et al. [49], where no LOD knowledge base is used, but only a metadata for the entities types and relations between them is added.

Similarly, approaches have been proposed for interpreting tabular data in spreadsheets [50,51], CSV [52], and XML [53].

5.3. Using LOD to interpret unstructured data

Text mining is the process of analyzing unstructured informa-tion, usually contained in a natural language text, in order to dis-cover new patterns. Most common text mining tasks include text categorization, text clustering, sentiment analysis and others. In most cases, text documents contain named entities that can be identified in real world, and further information can be extracted about them. Several approaches and APIs have been proposed for extracting named entities from text documents and linking them to LOD. One of the most used APIs is DBpedia Spotlight [54,55], which allows for automatically annotating text documents with DBpedia URIs. This tool is used in several LOD enabled data mining approaches, e.g., [56–59]. Several APIs for extracting semantic rich-ness from text exist, like Alchemy API,9OpenCalais API,10Textwise

9_{http://www.alchemyapi.com/api/.}

10_{http://www.opencalais.com/documentation/opencalais-documentation.}

SemanticHacker API.11All these APIs are able to annotate named entities with concepts from several knowledge bases, like DBpedia, YAGO, and Freebase. These tools and APIs have been evaluated in the NERD framework, implemented by Rizzo et al. [60].

Furthermore, Linked Open Data is also heavily used for better understanding of social media, which unlike authored news and other textual Web content, social media data pose a number of new challenges for semantic technologies, due to their large-scale, noisy, irregular, and social nature. An overview of tools and ap-proaches for semantic representation of social media streams is given in [61]. This survey discusses five key research questions: (i) What ontologies and Web of Data resources can be used to rep-resent and reason about the semantics of social media streams? For example, FOAF12and GUMO ontology [62] for describing people and social network, SIOC13_{and DLPO ontology [}₆₃_{] for modeling}

and interlinking social media, MOAT [64] ontology for modeling tag semantics. (ii) How can semantic annotation methods capture the rich semantics implicit in social media? For example, keyphrase extraction [65,66], ontology-based entity recognition, event detec-tion [67] and sentiment detection citegangemi2014frame, sentilo. (iii) How can we extract reliable information from these noisy, dynamic content streams? (iv) How can we model users’ digital identity and social media activities? For example, discovering user demographics [68], deriving user interests [69] and capturing user behavior [70]. (v) What semantic-based information access meth-ods can help address the complex information seeking behavior in social media? For example, semantic search in social media [71] and social media streams recommendation [72].

Once the user has developed a sufficient understanding of the domain, and the data mining task is defined, they need to select an appropriate data sample. If the data have already been mapped to appropriate domain specific ontologies or linked to external Linked Open Data, the users can more easily select a representative sample and/or meaningful subpopulation of the data for the given data mining task. For example, for a collection of texts, the user may decide to select those which mention a politician after the data has been linked to the semantic web, so that such a selection becomes possible.

Table 1gives an overview of the discussed approaches in this

section.14 _{It can be observed that at the selection step, links}

between datasets play only a minor role, and reasoning is scarcely

11_{http://textwise.com/api.} 12_{http://xmlns.com/foaf/spec/.} 13_{http://sioc-project.org/.}

14 The tables used for summarizing approaches at the end of each section are structured as follows: The second column of the table states the problem domain on which the approach is applied. The third column states the data mining task/domain that was used in the approach. The next two columns capture the characteristics of

(7)

used. In most cases, general-purpose knowledge bases, such as DBpedia or YAGO, are used as sources of knowledge.

The selection of relevant semantic web datasets is usually done by interlinking a dataset at hand with data from Linked Open Data. There are strategies and tools for different kinds of data: relational databases are typically mapped to the semantic web using mapping rules and tools such as D2R. In those cases, mapping rules are typically written manually, which is easily possible because the schema of a relational database is usually explicitly defined.

Semi-structured data, such as Web tables, usually comes with-out explicit semantics, and in large quantities. Here, differ-ent heuristics and machine learning approaches are often applied to link them to LOD sources. For that case, it has been shown that combining approaches which perform schema and instance matching in a holistic way typically outperform ap-proaches that handle both tasks in isolation.

For unstructured data, i.e., textual contents, the interlinking is typically done by linking named entities in the text to LOD sources with tools such as DBpedia Spotlight.

Once the interlinking is done, data visualization and summa-rization techniques can benefit from the additional knowledge contained in the interlinked datasets.

6. Preprocessing

Once the data is mapped to domain specific knowledge, the constraints expressed in the ontologies can be used to perform data validity checks and data cleaning. Ontologies can be used for detecting outliers and noise, as well as for handling missing values and data range and constraint violations, and guiding the users through custom preprocessing steps.

Ontologies are often used in many research approaches for the use of data cleaning and data preprocessing. Namely, there are two applications of ontologies in this stage: domain-independent ontologies used for data quality management, and domain ontologies. The first category of ontologies usually contains specifications for performing cleaning and preprocessing operations. In these approaches, the ontology is usually used to guide the user through the process of data cleaning and validation, by suggesting possible operations to be executed over the data. The second category of ontologies provides domain specific knowledge needed to validate and clean data, usually in an automatic manner.

6.1. Domain-independent approaches

One of the first approaches that uses a data quality ontology is proposed by Wang et al. [74]. They propose a framework called

OntoClean15for ontology-based data cleaning. The core component

of the framework is the data cleaning ontology component, which is used when identifying the cleaning problem and the relevant

the ontologies used in the approach, i.e., the complexity level of the ontology, and if reasoning is applied on the ontology. Based on a prior categorization of ontologies presented in [73], we distinguish two degrees of ontology complexity: ontologies of low complexity that consist of class hierarchies and subclass relations (marked with L), and ontologies with high complexity that also contain relations other than the subclass relations, and further constraints, rules and so on (marked with H). The sixth column indicates if links (such as owl:sameAs) to other LOD sources were followed to extract additional information. The next column states whether explicit semantic information were used from a given LOD source. The final two columns list the used LOD sources and shared ontologies, respectively. If a LOD source is used, the respective ontology is used as well, without explicitly stating that in the table. 15 Not to be confused with the ontology engineering method by Guarino and Welty.

data. Within this component, the task ontology specifies the potential methods that may be suitable for meeting the user’s goals, and the domain ontology includes all classes, instances, and axioms in a specific domain, which provides domain knowledge such as attribute constraints for checking invalid values during performing the cleaning tasks.

A similar approach is proposed by Perez et al. [75] with the

OntoDataClean framework, which is able to guide the data cleaning

process in a distributed environment. The framework uses a preprocessing ontology to store the information about the required transformations. First, the process of identifying and storing the required preprocessing steps has to be carried by a domain expert. Then, these transformations are needed to homogenize and integrate the records so they can be correctly analyzed or unified with other sources. Finally, the required information are stored in the preprocessing ontology, and the data transformations can be accomplished automatically. The approach has been tested on four databases in the domain of bio-medicine, showing that using the ontology the data can be correctly preprocessed and transformed according to the needs.

6.2. Domain-specific approaches

One of the first approaches to use a domain specific ontology is proposed by Philips et al. [76]. The approach uses ontologies to organize and represent knowledge about attributes and their constraints from relational databases. The approach is able to automatically, or semi-automatically with an assist of the user, identify the domains of the attributes, relations between the attributes, duplicate attributes and duplicate entries in the database.

Kedad et al. [77] propose a method for dealing with semantic heterogeneity during the process of data cleaning when integrating data from multiple sources, which is differences in terminologies. The proposed solution is based on linguistic knowledge provided by a domain is-a ontology. The main idea is to automatically generate correspondence assertions between instances of objects based on the is-a hierarchy, where the user can specify the level of accuracy expressed using the domain ontology. Once the user has specified the level of accuracy, two concepts will be considered the same if there is a subsumption relation between them, or both belong to the same class. Using this approach the number of results might be increased when querying the data, e.g., for the query ‘‘Do red cars have more accidents than others?’’ the system will not only look for red cars, but also for cars with color ruby, vermilion, and seville, which are subclasses of the red color.

Milano et al. introduce the OXC framework [78] that allows data cleaning on XML documents based on a uniform representation of domain knowledge through an ontology, which is gathered from domain analysis activities and from the DTDs of the documents. The framework comprises a methodology for data quality assessment and cleaning based on the reference ontology, and an architecture for XML data cleaning based on such methodology. Given a domain ontology, a mapping relation between the DTD and the ontology is defined, which is used to define quality dimensions (accuracy, completeness, consistency and currency), and perform data quality improvement by relying on the semantics encoded by the ontology.

Brüggemann et al. [79] propose a combination of domain spe-cific ontologies and data quality management ontologies, by anno-tating domain ontologies with data quality management specific metadata. The authors have shown that such hybrid approach is suitable for consistency checking, duplicate detection, and meta-data management. The approach has been extended in [80], where correction suggestions are being generated for each detected inconsistency. The approach uses the hierarchical structure of the

(8)

ontology to offer the user semantically related context-aware cor-rection suggestions. Moreover, the framework uses several mea-surements of semantic distances in ontologies to find the most suitable corrections for the identified inconsistencies. Based on those metrics the system can offer several suggestions for value corrections, i.e., value of next-sibling, first-child and parent. The approach has been applied on data from the cancer registry of Lower Saxony,16_{showing that it can successfully support domain}

experts.

Wang et al. [81] present a density-based outlier detecting method using domain ontology, named ODSDDO (Outlier Detecting for Short Documents using Domain Ontology). The algorithm is based on the local outlier factor algorithm, and uses domain ontology to calculate the semantic distance between short documents which improves the outlier detecting precision. To calculate the semantic similarity between two documents, first each word from each document is mapped to the corresponding concept in the ontology. Then, using the ontology concept tree, the similarity between each pair of concepts is calculated. The distance between two documents is then simply calculated as average of the sum of the maximum similarities between the pairs of concepts. The documents that have small or zero semantic similarity to other documents in the dataset are considered to be outliers.

Lukaszewski [82] propose an approach to admit and utilize noisy data by enabling to model different levels of knowledge gran-ularity both in training and testing examples. The authors argue that erroneous or missing attribute values may be introduced by users of a system that are required to provide very specific values, but the level of their knowledge of the domain is too general to precisely describe the observation by the appropriate value of an attribute. Therefore, they propose knowledge representation that uses hierarchies of sets of attribute values, derived from subsump-tion hierarchies of concepts from an ontology, which decreases the level of attribute-noise in the data.

Fürber and Hepp [83–86] propose approaches for using Semantic Web technologies and Linked Open Data to reduce the effort for data quality management in relational databases. They show that using LOD reference data can help identifying missing values, illegal values, and functional dependency violations. In their first work [83], the authors describe how to identify and classify data quality problems in relational databases, through the use of SPARQL Inferencing Notation (SPIN).17 _{SPIN is a Semantic}

Web vocabulary and processing framework that facilitates the representation of rules based on the syntax of the SPARQL protocol and RDF query language. To apply the approach on relational databases, the D2RQ tool [26] is used to extract data from relational databases into an RDF representation. The framework allows domain experts to define data requirements for their data based on forms as part of the data quality management process. The SPIN framework then automatically identifies requirement violations in data instances, i.e. syntactic errors, missing values, unique values violations, out or range values, and functional dependency violations. This approach is extended in [85] to assess the quality state of data in additional dimensions.

In a further work [84], instead of manually defining the data validation rules, the authors propose using Linked Open Data as trusted knowledge base that already contains information on the data dependencies. This approach has been shown to significantly reduce the effort for data quality management, when reference data is available in the LOD cloud. The approach was evaluated against a local knowledge base that contained manually created

16_{http://www.krebsregister-niedersachsen.de.} 17_{http://spinrdf.org/.}

address data. Using GeoNames as a reference LOD dataset, the approach was able to identify invalid city entries, and invalid city–country relations.

A similar approach using SPIN, has been developed by Moss et al. [87] for assessing medical data. The system compromises a set of ontologies that support reasoning in a medical domain, such as human psychology, medical domain, and patient data. To perform the data cleaning, several rules for checking missing data points and value checking were used. The approach is evaluated on data from the Brain-IT network,18 _{showing that it is able to}

identify invalid values in the data. Ontologies are often used in the healthcare domain for data quality management and data cleaning. Literature review of such papers is presented in [88].

In [89] we have developed an approach for filling missing values in a local table using LOD, which is implemented in a system named Mannheim Search Joins Engine.19_{The system relies}

on a large data corpus, crawled from over one million different websites. Besides two large quasi-relational datasets, the data corpus includes the Billion Triples Challenge 2014 Dataset20 _[₉₀_],

and the WebDataCommons Microdata Dataset21 [91]. For a given local table, the engine searches the data corpus for additional data for the attributes of the entities in the input table. To perform the search, the engine uses the existing information in the table, i.e. the entities’ labels, the attributes’ headers, and the attributes’ data types. The discovered data is usually retrieved from multiple sources, therefore the new data is first consolidated using schema matching and data fusion methods. Then, the discovered data is used to fill the missing values in the local table. Also, the same approach can be used for validating the existing data in the given table i.e. outlier detection, noise detection and correction.

section. We can observe that, while ontologies are frequently used for data cleaning, well-known LOD datasets like DBpedia are scarcely exploited. Furthermore, many approaches have been tailored to and evaluated in the medical domain, likely because quite a few sophisticated ontologies exist in that domain.

Ontologies and Semantic Web data help with preprocessing the data, mostly for increasing the data quality. There are various data quality dimensions that can be addressed. Outliers and false values may be found by identifying data points and values that violate constraints defined in those ontologies. Subsumption hierarchies and semantic relations help unifying synonyms and detecting interrelations between attributes. Finally, missing values can be inferred and/or filled from LOD datasets.

7. Transformation

At this stage, the generation of better data for the data mining process is prepared. The transformation step includes dimension-ality reduction, feature generation and feature selection, instance sampling, and attribute transformation, such as discretization of numerical data, aggregation, functional transformations, etc. In the context of Semantic Web enabled data mining, feature generation and feature selection are particularly relevant.

18_{http://www.brain-it.eu/.}

19_{http://searchjoins.webdatacommons.org/.} 20_{http://km.aifb.kit.edu/projects/btc-2014/.} 21_{http://webdatacommons.org/structureddata/.}

(9)

Table 2

Summary of approaches used in the preprocessing step.

Problem Data mining Complexity Reasoning Links Semantics LOD Ontology

[74] Geography / H No No No / OntoClean ontology

[75] Biomedicine / H No No No / OntoDataClean

ontology

[77] Medicine / H No No No / Custom ontology

[78] / / H No No No / Custom ontology

[79,80] Medicine / H Yes No No / Custom ontology

[81] Social media Outlier detection H No No No / Custom ontology

[83–86] Geography / H Yes No Yes DBpedia, GeoNamesa _/

[89] Geography, companies, movies, books, music, persons, drugs

/ H No Yes Yes BTC 2014,

WebDataCommons Microdata Dataset

/

[87] Medicine / H No No No / Custom ontology

a_{http://sws.geonames.org/.} 7.1. Feature generation

Linked Open Data has been recognized as a valuable source of background knowledge in many data mining tasks. Augmenting a dataset with features taken from Linked Open Data can, in many cases, improve the results of a data mining problem at hand, while externalizing the cost of creating and maintaining that background knowledge [92].

Most data mining algorithms work with a propositional feature

vector representation of the data, i.e., each instance is represented

as a vector of features

⟨

f1

,

f2

, . . . ,

fn

⟩

, where the features are either

binary (i.e., fi

∈ {

true

,

false

}

), numerical (i.e., fi

∈

R), or nominal (i.e., fi

∈

S, where S is a finite set of symbols) [93]. Linked Open

Data, however, comes in the form of graphs, connecting resources with types and relations, backed by a schema or ontology.

Thus, for accessing Linked Open Data with existing data mining tools, transformations have to be performed, which create proposi-tional features from the graphs in Linked Open Data, i.e., a process called propositionalization [94]. Usually, binary features (e.g.,

true

if a type or relation exists,

false

otherwise) or numerical features (e.g., counting the number of relations of a certain type) are used. Furthermore, elementary numerical or nominal features (such as the population of a city or the production studio of a movie) can be added [95]. Other variants, e.g., computing the fraction of relations of a certain type, are possible, but rarely used.

In the recent past, a few approaches for propositionalizing Linked Open Data for data mining purposes have been proposed. Many of those approaches are supervised, i.e., they let the user for-mulate SPARQL queries, which means that they leave the proposi-tionalization strategy up to the user, and a fully automatic feature generation is not possible. Usually, the resulting features are bi-nary, or numerical aggregates using

SPARQL COUNT

constructs.

LiDDM [96] is an integrated system for data mining on the

Semantic Web. The tool allows the users to declare SPARQL queries for retrieving features from LOD that can be used in different machine learning techniques, such as clustering and classification. Furthermore the tool offers operators for integrating data from multiple sources, data filtering and data segmentation, which are carried manually by the user. The usefulness of the tool has been presented through two use cases, using DBpedia, World FactBook22

and LinkedMDB,23in the application of correlations analysis and rule learning.

A similar approach has been used in the RapidMiner24semweb plugin [97], which preprocesses RDF data in a way that it can

22_{http://wifo5-03.informatik.uni-mannheim.de/factbook/.} 23_{http://www.linkedmdb.org/.}

24_{http://www.rapidminer.com/.}

be further processed by a data mining tool, RapidMiner in that case. Again, the user has to specify a SPARQL query to select the data of interest, which is then converted into feature vectors. The authors propose two methods for handling set-values data, by mapping them into an N-dimensional vector space. The first one is FastMap, which embeds points in an N-dimensional space based on a distance metric, much like Multidimensional Scaling (MDS). The second one is Correspondence Analysis (CA), which maps values to a new space based on their cooccurrence with values of other attributes. The approaches were evaluated on IMDB data,25 showing that the mapping functions can improve the results over the baseline.

Cheng et al. [98] propose an approach for automated feature generation after the user has specified the type of features. To do so, the users have to specify the SPARQL query, which makes this approach supervised. The approach has been evaluated in the domain of recommender systems (movies domain) and text classification (tweets classification). The results show that using semantic features can improve the results of the learning models compared to using only standard features.

Mynarz et al. [99] have considered using user specified SPARQL queries in combination with SPARQL aggregates, including

COUNT, SUM, MIN, MAX

. Kauppinen et al. have developed the SPARQL package for R26_[₁₀₀_,₁₀₁_{], which allows importing LOD}

data in the very well known environment for statistical computing and graphics R. In their research they use the tool to perform statistical analysis and visualization of the linked Brazilian Amazon rainforest data. The same tool has been used in [102] for statistical analysis in piracy attack reports data. Moreover, they use the tool to import RDF data from multiple LOD sources in the environment of R, which allows them to easily analyze, interpret and visualize the discovered patterns in the data.

FeGeLOD [95] was the first fully automatic approach for

enriching data with features that are derived from LOD. In that work, we have proposed six different feature generation strategies, allowing for both binary features and simple numerical aggregates. The first two strategies are only concerned with the instances themselves, i.e., retrieving the data properties of each entity, and the types of the entity. The four other strategies consider the relation of the instances to other resources in the graph, i.e. incoming and outgoing relations, and qualified relations, i.e., aggregates over the type of both the relation and the related entity. The work has been continued into the RapidMiner Linked Open Data extension27_[₁₀₃_,₁₀₄_{]. Currently, the RapidMiner LOD}

25_{http://www.imdb.com/.}

26_{http://linkedscience.org/tools/sparql-package-for-r/.}

27 http://dws.informatik.uni-mannheim.de/en/research/rapidminer-lod-extension/.

(10)

extension supports the user in all steps of the LOD enabled knowledge discovery process. i.e. linking, combining data from multiple LOD sources, preprocessing and cleaning, transformation, data analysis, and interpretation of data mining findings.

FeGeLOD and the RapidMiner LOD extension have been used in different data mining applications, i.e., text classification [58,57,56,105], explaining statistics [106–108], linkage error detection [109], and recommender systems [110,111]. Besides using simple binary and numerical representation of the features, we have proposed using adapted versions of TF–IDF based measures. In [112] we have performed an initial comparison of different propositionalization strategies (i.e., binary, count, relative count and TF–IDF) for generating features from types and relations from Linked Open Data.

A problem similar to feature generation is addressed by Kernel

functions, which compute the distance between two data instances.

The similarity is calculated by counting common substructures in the graphs of the instances, e.g., walks, paths and threes. The graph kernels are used in kernel-based data mining and machine learning algorithms, most commonly support vector machines (SVMs), but can also be exploited for tasks such as clustering. In the past, many graph kernels have been proposed that are tailored towards specific application [113–115], or towards specific semantic representation [116–119]. But only a few approaches are general enough to be applied on any given RDF data, regardless of the data mining task. Lösch et al. [120] introduce two general RDF graph kernels, based on intersection graphs and intersection trees. First, they propose the use of walk and path kernels, which count the number of walks and paths in the intersected graphs. Then, they propose full subtree kernel, which counts the number of full sub-trees of the intersection tree.

The intersection tree path kernel introduced by Lösch et al., has been modified and simplified by Vries et al. [121–124], which also allows for explicit calculation of the instances’ feature vectors, instead of pairwise similarities. Computing the feature vectors significantly improves the computation time, and allows using any arbitrary machine learning methods. They have developed two types of kernels over RDF data, RDF walk count kernel and RDF WL sub tree kernel. The RDF walk count kernel counts the different walks in the sub-graphs (up to the provided graph depth) around the instances nodes. The RDF WL sub tree kernel counts the different full sub-trees in the sub-graphs (up to the provided graph depth) around the instances nodes, using the Weisfeiler–Lehman algorithm [125]. The approaches developed by Lösch et al. and by Vries et al. have been evaluated on two common relational learning tasks: entity classification and link prediction.

7.2. Feature selection

We have shown that there are several approaches that generate propositional feature vectors from Linked Open Data. Often, the resulting feature spaces can have a very high dimensionality, which leads to problems both with respect to the performance as well as the accuracy of learning algorithms. Thus, it is necessary to apply some feature selection approaches to reduce the feature space. Additionally, for datasets that already have a high dimensionality, background knowledge from LOD or linguistic resources such as WordNet may help reducing the feature space better than standard techniques which do not exploit such background knowledge.

Feature selection is a very important and well studied problem in the literature. The objective is to identify features that are correlated with or predictive of the class label. Generally, all feature selection methods can be divided into two broader categories: wrapper methods and filter methods (John et al. [126] and Blum et al. [127]).

In feature vectors generated from external knowledge we can often observe relations between the features. In many cases those relations are hierarchical relations, or we can say that the features subsume each other, and carry similar semantic information. Those hierarchical relations can be easily retrieved from the ontology or schema used for publishing the LOD, and can be used to perform better feature selection.

We have introduced an approach [128] that exploits hierarchies for feature selection in combination with standard metrics, such as information gain or correlation. The core idea of the approach is to identify features with similar relevance, and select the most valuable abstract features, i.e. features from as high as possible levels of the hierarchy, without losing predictive power, and thus, find and optimal trade-off between the predictive power and the generality of a feature in order to avoid over-fitting. To measure the similarity of relevance between two nodes, we use the standard correlation and information gain measure. The approach works in two steps, i.e., an initial selection and an additional pruning step.

Jeong et al. [129] propose the TSEL method using a semantic hierarchy of features based on WordNet relations. The presented algorithm tries to find the most representative and most effective features from the complete feature space. To do so, they select one representative feature from each path in the tree, where path is the set of nodes between each leaf node and the root, based on the lift measure, and use

χ

2to select the most effective features from the reduced feature space.

Wang et al. [130] propose a bottom-up hill climbing search algorithm to find an optimal subset of concepts for document representation. For each feature in the initial feature space, they use a kNN classifier to detect the k nearest neighbors of each instance in the training dataset, and then use the purity of those instances to assign scores to features.

Lu et al. [131] describe a greedy top-down search strategy for feature selection in a hierarchical feature space. The algorithm starts with defining all possible paths from each leaf node to the root node of the hierarchy. The nodes of each path are sorted in descending order based on the nodes’ information gain ratio. Then, a greedy-based strategy is used to prune the sorted lists. Specifically, it iteratively removes the first element in the list and adds it to the list of selected features. Then, removes all ascendants and descendants of this element in the sorted list. Therefore, the selected features list can be interpreted as a mixture of concepts from different levels of the hierarchy.

When creating features from multiple LOD sources, often a single semantic feature can be found in multiple LOD source represented with different properties. For example, the area of a country in DBpedia is represented with db:areaTotal, and with yago:hasArea in YAGO. The problem of aligning properties, as well as instances and classes, in ontologies is addressed by

ontology matching techniques [132]. Even though there exist a

vast amount of work in the area of ontology matching, most of the approaches for generating features from Linked Open Data are not explicitly addressing this problem. The RapidMiner LOD extension offers an operator for matching properties extracted from multiple LOD sources, which are later fused into single feature. The operator is based on the probabilistic algorithm for ontology matching PARIS [133]. Unlike most other systems, PARIS is able to align both entities and relations. It does so by bootstrapping an alignment from the matching literals and propagating evidence based on relation functionalities. In [104] we have shown that, for example, the value for the population of a country can be found in 10 different sources within the LOD cloud, which using the RapidMiner LOD extension matching and fusion operator were merged into a single feature. Such a fusion can provide a feature that mitigates missing values and single errors for individual sources, leading to only one high-value feature.

(11)

Table 3

Summary of approaches used in the transformation step.

Problem Data mining Complexity Reasoning Links Semantics LOD Ontology [96] Government, economy,

movies

Correlation, association mining

H No Yes No DBpedia, World

FactBook,a_LinkedMDB

/

[97] Movies Classification H No Yes No DBpedia, LinkedMDB /

[98] Movies, social media Recommender systems, classification

H No Yes No YAGO /

[100,101] Geography Correlations L No Yes Yes Linked Brazilian

Amazon Rainforest, DBpedia / [95] Biology, sociology, economy Classification, regression

H No Yes Yes DBpedia /

[104] Economy, publications Correlations H No Yes Yes DBpedia, YAGO, LinkedGeoData,b

Eurostat,c_GeoNames,

WHO,d_{Linked Energy}

Data,e_OpenCyc,f

World Factbook

/

[56] News Sentiment analysis H No Yes No DBpedia /

[109] Music, movies, books Linkage error detection

H No Yes Yes DBpedia, DBTropes,g

Peel Sessionsh

/ [121,122] Publications, geology Property value

prediction, link prediction

H No No No Custom dataset, British Geological Surveyi

SWRCj

[123] Bio-medicine, publications

Classification H No Yes Yes MUTAG, ENZYMES,

Semantic Web Conference Corpus,k

British Geological Survey

/

[129] News Text classification H No No No WordNet /

[130] Biomedicine Text classification H No No No / UMLS

[131] Pharmacology Classification H No No No / NDF-RTl

[134] Commerce Rule learning H No No No / Products ontology

[135,136] Movies Association rules H No No No / Custom ontology

[139] Medicine Association mining H Yes No No / UMLSm

[137,138] Accident reports Text mining, rule learning

H No No No / Custom ontology

a_{http://wifo5-03.informatik.uni-mannheim.de/factbook/.} b _{http://linkedgeodata.org.}

c _{http://eurostat.linked-statistics.org/}_and_{http://wifo5-03.informatik.unimannheim.de/eurostat/.} d _{http://gho.aksw.org/.} e_{http://en.openei.org/lod/.} f _{http://sw.opencyc.org/.} g_{http://skipforward.opendfki.de/wiki/DBTropes.} h _{http://dbtune.org/bbc/peel/.} i _{http://www.bgs.ac.uk/opengeoscience/.} j _{http://ontoware.org/swrc/.} k_{http://data.semanticweb.org/.} l _{http://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/NDFRT/.} m _{http://www.nlm.nih.gov/research/umls/.}

In pattern mining and association rule mining, domain ontolo-gies are often used to reduce the feature space in order to get more meaningful and interesting patterns. In the approach proposed by Bellandi et al. [134] several domain-specific and user-defined constraints are used, i.e., pruning constraints, used to filter unin-teresting items, and abstraction constraints permitting the gen-eralization of items towards ontology concepts. The data is first preprocessed according to the constraints extracted from the on-tology, and then, the data mining step takes place. Applying the pruning constraints excludes the information that the user is not interested in, before applying the data mining approach.

Onto4AR is a constraint-based algorithm for association mining proposed by Antunes [135] and revised later in [136], where taxonomical and non-taxonomical constraints are defined over an item ontology. This approach is interesting in the way that the ontology offers a high level of expression for the constraints,

which allows to perform the knowledge discovery at the optimal level of abstraction, without the need for user input. Garcia et al. developed a technique called Knowledge Cohesion [137,138] to extract more meaningful association rules. The proposed metric is based on semantic distance, which measures how close two items are semantically based within the ontology, where each type of relation is weighted differently.

7.3. Other

Zeman et al. [139] present the Ferda DataMiner tool, which is fo-cused on the data transformation step. In this approach the ontolo-gies are used for two purposes: construction of adequate attribute categorization, and identification and exploitation of semanti-cally related attributes. The authors claim that ontologies can be

(12)

efficiently used for categorization of attributes as higher-level se-mantics could be assigned to individual values. For example, for blood pressure there are predefined values that divide the domain in a meaningful way: say, blood pressure above 140/90 mm Hg is considered as hypertension. For the second purpose, ontologies are used to discover the relatedness between the attributes, which can be exploited so as to meaningfully arrange the corresponding data attributes in the data transformation phase.

section. It can be observed that at this stage of the data mining process, many approaches also exploit links between LOD datasets to identify more features. On the other hand, the features are most often generated without regarding the schema of the data, which is, in most cases, rather used for post processing of the features, e.g., for feature selection. Likewise, reasoning is only scarcely used. Most data mining algorithms and tools require a propositional representation, i.e., feature vectors for instances. Typical ap-proaches for propositionalization are, e.g., adding all numer-ical datatype properties as numernumer-ical features, or adding all direct types as binary features. There are unsupervised and supervised methods, where for the latter, the user specifies a query for features to generate—those are useful if the user knows the LOD dataset at hand and/or has an idea which fea-tures could be valuable. While such classic propositionalization methods create human interpretable features and thus are also applicable for descriptive data mining, kernel methods often de-liver better predictive results, but at the price of losing the inter-pretability of those results.

A crucial problem when creating explicit features from Linked Open Data is the scalability and the number of features gener-ated. Since only few approaches focus on identifying high value features already at the generation step, combining feature gen-eration with feature subset selection is clearly advised. The schema information for the LOD sources, such as type hier-archies, can be exploited for feature space reduction. There are a few algorithms exploiting the schema, which often provide a better trade-off between feature space reduction and predictive performance than schema-agnostic approaches.

8. Data mining

After the data is selected, preprocessed and transformed in the most suitable representation, the next step is choosing the appro-priate data mining task and data mining algorithm. Depending on the KDD goals, and the previous steps of the process, the users need to decide which type of data mining to use, i.e. classification, regression, clustering, summarization, or outlier detection. Un-derstanding the domain will assist in determining what kind of information is needed from the KDD process, which makes it easier for the users to make a decision. There are two broader cat-egories of goals in data mining: prediction and description. Predic-tion is often referred to as supervised data mining, which attempts to forecast the possible future or unknown values of data elements. On the other hand, descriptive data mining is referred as unsuper-vised data mining, which seeks to discover interpretable patterns in the data. After the strategy is selected, the most appropriate data mining algorithm should be selected. This step includes selecting methods to search patterns in the data, and deciding on specific models and parameters of the methods.

Once the data mining method and algorithm are selected, the data mining takes place.

To the best of our knowledge, there are rarely any approaches in the literature that incorporates data published as Linked Open Data into the data mining algorithms themselves. However, many approaches are using ontologies for the data mining process, not

to only support the user in the stage of selecting the data mining methods, but to guide the users through the whole process of knowledge discovery.

8.1. Domain-independent approaches

While there is no universally established data mining ontology yet, there are several data mining ontologies currently under de-velopment, such as the Knowledge Discovery (KD) Ontology [140], the KDDONTO Ontology [141], the Data Mining Workflow (DMWF) Ontology28_[₁₄₂_{], the Data Mining Optimization (DMOP)}

Ontology29_{by Hilario [}₁₄₃_,₁₄₄_{], OntoDM}30_[₁₄₅_,₁₄₆_{], and its sub}

ontology modules OntoDT,31OntoDM-core32[147] and OntoDM-KDD33_[₁₄₈_].

An overview of existing intelligent assistants for data analysis that use ontologies is given in [149]. In this survey, all approaches are categorized by several criteria. First, which types of support the intelligent assistants offer to the data analyst. Second, it surveys the kinds of background knowledge that the IDAs rely on in order to provide the support. Finally, it performs thorough comparison of IDAs in light of the defined dimensions and the identification of limitations and missing features.

One of the earliest approaches, CAMLET, was proposed by Suyama et al. [150], which uses two light-weight ontologies of machine learning entities to support the automatic composition of inductive learning systems.

Among the first prototypes is the Intelligent Discovery Assistant proposed by Bernstein et al. [151], which provides users with systematic enumerations of valid sequences of data mining operators. The tool is able to determine the characteristics of the data and of the desired mining result, and uses an ontology to search for and enumerate the KDD processes that are valid for producing the desired result from the given data. Also, the tool assists the user in selecting the processes to execute, by ranking the processes according to what is important to the user. A light-weight ontology is used that contains only a hierarchy of data mining operators divided into three main classes: preprocessing operators, induction algorithms and post processing operators.

Many approaches are using Semantic Web technologies to as-sist the user in building complex data mining workflows. Žáková et al. [152,140] proposed an approach for semiautomatic work-flow generation that requires only the user input and the user desired output to generate complete data mining workflows. To implement the approach, the authors have developed the knowl-edge discovery ontology, which gives a formal representation of a knowledge types and data mining algorithms. Second, a planning algorithm is implemented that assembles workflows based on the planning task descriptions extracted from the knowledge discov-ery ontology and the given user’s input–output task requirements. In such semiautomatic environment the user is not required to be aware of the numerous properties of the wide range of relevant data mining algorithms. In their later work, the methodology is implemented in the Orange4WS environment for service-oriented data mining [153,154].

Diamantini et al. [155] introduce a semantic based, service-oriented framework for tools sharing and reuse, giving advanced support for the semantic enrichment through semantic annotation

28_{http://www.e-lico.eu/dmwf.html.} 29_{http://www.e-lico.eu/DMOP.html.} 30_{http://www.ontodm.com/doku.php.} 31_{http://www.ontodm.com/doku.php?id=ontodt.} 32_{http://www.ontodm.com/doku.php?id=ontodm-core.} 33_{http://www.ontodm.com/doku.php?id=ontodm-kdd.}