appunto a studiare metodi per la conservazione dei Linked (Open) Data assumendo come punto di partenza i risultati consolidati in materia di conservazione delle basi di dati e dei contenuti web. Grazie all’analisi di due specifici casi (DBpedia ed Europeana), il team di PreLiDa è riuscito a individuare i principali problemi di conservazione dei LinkedOpenData, identificando le sfide tecniche, organizzative ed economiche che occorre affrontare per la conservazione di questa particolare tipologia di oggetti digitali. In particolare, DBpedia ha l’obiettivo di astrarre informazioni da Wikipedia e renderle disponibili gratuitamente sul web in formato LinkedData: i dati sono estratti in formato RDF e possono essere recuperati come pagine web o tramite l’interrogazione dell’endpoint SPARQL. DBpedia collabora con Wikipedia anche per la conservazione dei set di dati estratti, gestendo eventuali aggiornamenti e modifiche. La conservazione delle diverse versioni di dati avviene mediante dump, con un meccanismo di controllo delle versioni, tenendo traccia delle modifiche apportate alle pagine di Wikipedia per estrarle e trasformarle in RDF, aggiornando il dataset e conservando i metadati di rilievo (ad esempio, la data di creazione o modifica dei dati, o il nominativo dell’utente responsabile). Sono conservati anche i collegamenti esterni, ma non il contenuto di tali relazioni. I formati di conservazione sono Turtle, Quad-turtle e CSV, ma il software di rendering e di query non è oggetto di conservazione: l’adozione di formati aperti per la conservazione ha il vantaggio di ridurre i problemi di accesso ai dati nel futuro, ma la decisione di non conservare né il contenuto delle relazioni esterne né il software comporta la necessità da parte dell’utente di usare il proprio endpoint SPARQL per interrogare il dataset e l’impossibilità di estendere le query ai set di dati direttamente o indirettamente collegati.
Abstract. The production of machine-readable data in the form of RDF datasets belonging to the LinkedOpenData (LOD) Cloud is growing very fast. However, selecting relevant knowledge sources from the Cloud, assessing the quality and extracting synthetical information from a LOD source are all tasks that require a strong human effort. This paper proposes an approach for the automatic extrac- tion of the more representative information from a LOD source and the creation of a set of indexes that enhance the description of the dataset. These indexes col- lect statistical information regarding the size and the complexity of the dataset (e.g. the number of instances), but also depict all the instantiated classes and the properties among them, supplying user with a synthetical view of the LOD source. The technique is fully implemented in LODeX, a tool able to deal with the performance issues of systems that expose SPARQL endpoints and to cope with the heterogeneity on the knowledge representation of RDF data. An eval- uation on LODeX on a large number of endpoints (244) belonging to the LOD Cloud has been performed and the effectiveness of the index extraction process has been presented.
A pilot study is reported on developing the basic Linguistic LinkedOpenData (LLOD) infrastructure for hashtags from social media posts. Our goal is the encoding of linguistical- ly and semantically enriched hashtags in a formally compact way using the machine- readable OntoLex model. Initial hashtag pro- cessing consists of data-driven decomposition of multi-element hashtags, the linking of spelling variants, and part-of-speech analysis of the elements. Then we explain how the On- toLex model is used both to encode and to en- rich the hashtags and their elements by linking them to existing semantic and lexical LOD re- sources: DBpedia and Wiktionary.
LinkedOpenData opens up a promising opportunity for machine learning in terms of feature learning from large scale and ever-growing graph-based knowledge sources. In this paper, we present a hybrid approach for automatic entity typing and type alignment. We experimented three different strategies in type alignment. The evaluation result suggests that LOD can complement extremely rich semantic information compared with WordNet, particularly for complex multiword schema terms. Even though the type alignment directly suggested by LOD suffers low quality, the corresponding concept hierarchies from the multiple community-driven classification schemes can contribute very effective semantic evidences for facilitating alignment task with respect to the similarity and relatedness measurement.
The World Wide Web has made it easier than ever to share knowledge with others. Web pages are connected through hyperlinks, together they form giant linked collection of documents. Consequently, there is now an abundance of data freely available to us. Public bodies such as governments and research initiatives offer many different data sets on a variety of topics . Despite the abundance of data, interoperability is lacking. It is still a difﬁcult task to combine data from many different sources. Most data bases require distinct ways to access the data and have their data structured according to different standards . The implicit relationship between two data sets cannot be interpreted by machines. By applying the same principles the Web uses to link documents, the concept of linked (open) data aims to solve the problem of separated data and deﬁne explicit relationships to make the data.
In this paper we present work done towards populating a domain ontology using a public knowledge base like DBpedia. Using an academic ontology as our target we identify mappings between a subset of its predicates and those in DBpedia and other linked datasets. In the semantic web context, ontology mapping allows linking of independently developed ontologies and inter-operation of heterogeneous resources. Linkedopendata is an initiative in this direction. We populate our ontology by querying the linkedopen datasets for extracting instances from these resources. We show how these along with semantic web standards and tools enable us to populate the academic ontology. Resulting instances could then be used as seeds in spirit of the typical bootstrapping paradigm.
several important LinkedOpenData datasets. It enables users to easily identify resources in the LOD cloud by providing a general unified method for querying a whole group of datasets. FactForge is designed also as a use-case for large-scale reasoning and data integration. In brief, the datasets are unified via a common ontology – PROTON, whose concepts are mapped to the concepts of the involved LOD datasets. We do this by a set of rules. Each of them maps a PROTON class or a PROTON property to the corresponding class or property of the other ontologies. This mechanism of constructing a reason-able view over selected LOD datasets ensures that the redundant instance representations (classes and properties) are cleaned as much as possible. The instances are grouped in equivalent classes of instances. Finally, the instances in these datasets are linked via owl:sameAs statements. FactForge development can be divided into six main steps: 1. Selecting the LOD datasets
Mass adoption of the Semantic Web’s vision will not become a reality unless the benefits provided by data published under the LinkedOpenData principles are understood by the majority of users. As technical and implementation details are far from being interesting for lay users, the ability of machines and algorithms to understand what the data is about should provide smarter summarisations of the available data. Visualization of LinkedOpenData proposes itself as a perfect strategy to ease the access to information by all users, in order to save time learning what the dataset is about and without requiring knowledge on semantics. This article collects previous studies from the Information Visualization and the Exploratory Data Analysis fields in order to apply the lessons learned to LinkedOpenData visualization. Datatype analysis and visualization tasks proposed by Ben Shneiderman are also added in the research to cover different visualization features.
The recent developments in the field of LinkedOpenData (LOD) further added value to the already existing vast amount of structured information that is available through the use of LOD [Bizer et al., 2008]. The availability of such an amount of structured information suggests apply- ing LOD to the semi-automatic generation of knowledge. In the context of our work knowledge represents the data that is organized in the knowledge containers of a Case- Based Reasoning (CBR) system. Each CBR system uses the four knowledge containers vocabulary, similarity mea- sures, transformational knowledge and cases. The work presented in this paper focuses on filling the vocabulary and similarity measure containers as well as generating cases from LOD. Therefor, we demonstrate some possibilities of filling the aforementioned knowledge containers using simplified examples. These examples are preliminary, thus cannot be used in a real-world CBR system, because they are incomplete.
The OWLG represents an open forum for interested individuals to address these and related issues. At the time of writing, the group consists of about 100 people from 20 different countries. Our group is relatively small, but continuously growing and sufficiently heterogeneous. It includes people from library science, typology, historical linguistics, cognitive science, computational linguistics, and information technology; the ground for fruitful interdisciplinary discussions has been laid out. One concrete result emerging out of collaborations between a large number of OWLG members is the LLOD cloud as already sketched above. The emergence of the LLOD cloud out of a set of isolated resources was accompanied and facilitated by a series of workshops and publications organized under the umbrella of the OWLG, including the Open Linguistics track at the Open Knowledge Conference (OKCon-2010, July 2010, Berlin, Ger- many), the First Workshop on LinkedData in Linguistics (LDL-2012, March 2012, Frankfurt am Main, Germany), the Workshop on Multilingual LinkedOpenData for Enterprises (MLODE-2012, September 2012, Leipzig, Germany), the LinkedData for Linguistic Typology track at ALT-2012 (September 2013, Leipzig, Germany). Plans to create a LLOD cloud were first publicly announced at LDL-2012, and subsequently, a first instance of the LLOD materialized as a result of the MLODE-2012 workshop, its accompanying hackathon and the data postproceedings that will appear as a special issue of the Semantic Web Journal (SWJ). The Second Workshop on LinkedData in Linguistics (LDL- 2013) continues this series of workshops. In order to further contribute to the integration of the field, it is organized as a joint event of the OWLG and the W3C Ontology-Lexica Community Group.
But that would be too easy. Unfortunately, ambiguity comes into play. For example, Mary Washington College, leads to two competing interpretations, Mary Washing- ton College and Mary Washington College. The complex- ity is therefore not that much reduced as we now work with competing entity names (rather than competing strings) which furthermore could each lead to multiple entities. We differentiate entity names from entities. Entity names are surface forms that exists in DBpedia but they can lead to many different entities (word senses or actual named entities). There are usually disambiguation markers (e.g. New York(disambiguation)) to show links between entity names and entities. There are also ”redirects” links in DB- pedia (and Wikipedia) which can be tricky to use as some of them are true synonyms (e.g. automobile and car) but others are just related items (e.g. video and Audio-visual). Using a structured linkedopendata resource brings a completely new dimension, as we now work with enti- ties and entity names instead of surface strings as for the frequency-based resources. Table 4 shows all existing en- tity names in DBpedia with their number of word senses for the complex compound New York Stock Exchange Com- posite Trading. Examples of the entities are also shown, to illustrate different relations between entity names and entities. Entity names can be considered abbreviations (New - Net economic welfare), shorter forms (Exchange - Heat exchange), or domain specific terms (Composite -
Abstract. Extracting structured information from text plays a crucial role in automatic knowledge acquisition and is at the core of any knowledge representation and reasoning system. Traditional methods rely on hand-crafted rules and are restricted by the performance of various linguistic pre-processing tools. More recent approaches rely on supervised learning of relations trained on labelled examples, which can be manually created or sometimes automatically generated (referred as distant supervision). We propose a supervised method for entity typing and alignment. We argue that a rich feature space can improve extraction accuracy and we propose to exploit LinkedOpenData (LOD) for feature enrichment. Our approach is tested on task-2 of the Open Knowledge Extraction challenge, including automatic entity typing and alignment. Our approach demonstrate that by combining evidences derived from LOD (e.g. DBpedia) and conventional lexical resources (e.g. WordNet) (i) improves the accuracy of the supervised induction method and (ii) enables easy matching with the Dolce+DnS Ultra Lite ontology classes.
In recent years, many linguistic resources have been released as LinkedData (Chiarcos et al., 2011). Most of the datasets that are part of the so called Linguistic LinkedOpenData (LLOD) cloud consist of dictionaries, written corpora or lexica. However, multimodal dataset are currently heavily underrepresented. In order to address this gap, we describe a framework supporting the easy publication of multimodal data as RDF / LinkedData which is based on an existing multimodal data model and on the Rails framework. In this paper we describe our approach and summarize our experiences. In particular, we describe our ex- periences in releasing a multimodal corpus based on an online chat game as LinkedData. The cor- pus consists of chats and related actions in an ob- ject arrangement game using a computer-mediated setting. It contains multiple forms of annota- tion, including primary material such as text tran- scripts and information about object movements as
A second resource in the linkedopendata network into which ChEMBL-RDF was integrated is Chem2Bio2RDF. Chem2Bio2RDF is a single RDF repository covering over twenty public data resources pertaining to drugs, chem- ical compounds, carcinogens, protein targets, genes, dis- eases, side effects, pathways and their relations . The entities and their relations were further annotated by Chem2Bio2OWL ontology, making it a rich semantic resource for integrative searches  and data min- ing . The ChEMBL-RDF set was uploaded into the Chem2Bio2RDF triple store, enabling queries linking ChEMBL with other entities in Chem2Bio2RDF. Since both ChEMBL-RDF and Chem2Bio2RDF use InChI keys to present chemical compounds and adopt Bio2RDF pro- tein identifiers to present targets, queries can be easily constructed to link the two. For instance, to investi- gate the relations between drug side effects and their targets (usually off-target effects), a query was created to link the targets in ChEMBL to side effects (e.g., heart disease) in Chem2Bio2RDF via side effect related drugs and their bioassay activities. In this case, 36 drugs causing heart disease were linked to 87 unique pro- tein targets (IC 50 < 10 μm). The top two most com-
Abstract. LinkedOpenData cloud (LOD) is essentially read-only, re- straining the possibility of collaborative knowledge construction. To sup- port collaboration, we need to make the LOD writable. In this paper, we propose a vision for a writable linkeddata where each LOD participant can define updatable materialized views from data hosted by other par- ticipants. Consequently, building a writable LOD can be reduced to the problem of SPARQL self-maintenance of Select-Union recursive mate- rialized views. We propose TM-Graph, an RDF-Graph annotated with elements of a specialized provenance semiring to maintain consistency of these views and we analyze complexity in space and traffic.
It is one of the reasons why in recent years, researchers have started to explore ways to en- hance available data sources with additional information from openly accessible and interlinked datasets (Sect. 5.2.2). The entirety of these public data collections is also known as the LinkedOpenData cloud. It consists of a large number of repositories, which can be accessed over the web. The LOD cloud provides metadata descriptions for many real-world entities (e.g., multi- media items or travel destinations) that link to each other by utilizing the Resource Description Framework (RDF). RDF is a vocabulary language for structuring and publishing data sources in a graph-like fashion. The result is a network in which the nodes represent entities and the edges are the relations between these entities. Another advantage of this kind of markup is that it describes data semantically, i.e., links have a meaning. RDF graphs are machine-readable and can be freely used by software applications. Hence the data web is similar to the World Wide Web (WWW) because of its open accessibility and decentralized nature. Additionally, it can be processed without any manual interventions (Sect. 3.1). Over the last years, the LOD cloud has grown into a huge knowledge graph (Sect. 3.2), containing valuable information about items from many domains and data collections. It is a manifestation of the vision of a Semantic Web that was introduced by Tim Berners-Lee in 2001 . The most prominent data collection in the LOD cloud is DBpedia. It contains structured information of Wikipedia articles. Hence, any third party can access data from the prominent online reference repository and make use of it . Another positive feature of the data cloud is that it contains connections between data collections. For instance, some real-world objects occur in more than one LOD repository. Correspondence links between the collections identify matching objects. In this way, data can be queried and aggregated across repositories, such that software agents can discover new con- nections and interesting information (Sect. 3.2).
An interview, undertaken with the responsible functionary of the initiating govern- ment department, clarified their motivations in relation to Open Government Data. To achieve the desired objectives, a LinkedOpenData environment for research informa- tion was created, based on the spirit that data, which is publicly financed, should be properly accessible to the public. However, its capacities were still limited and infor- mation was not as complete as it could be. Data was published in an infrastructure to facilitate exploration and ultimately to add value, but still accessibility remains cur- tailed, in particular for non-expert users. Therefore, the development of visual explo- ration workflows on top of the LOD was encouraged, providing accessibility to complex data, and understanding of the meaningful correlations emerging from the links.
then provided the hosting and a community evolved, which created links and applications. Although it is difficult to determine whether open licenses are a necessary or sufficient condition for the collaborative evolution of a data set, the opposite is quite obvious: closed licenses or unclearly licensed data are an impediment to an architecture which is focused on (re-)publishing and linking of data. Several data sets, which were converted to RDF by members of the OWLG, could not be re-published due to licensing issues. In particular, these include the Leipzig Corpora Collection (LCC, (Quasthoff et al., 2009)) and the RDF data used in the TIGER Corpus Navigator (Hellmann et al., 2010). Very often (as is the case in the previous two examples), the reason for closed licences is the strict copyright of the primary data (such as newspaper texts) and researchers are unable to publish their resulting data. The open part of the American National Corpus (OANC 51 ) on the other hand has been converted to RDF and was re-published successfully using
Results in all experiments are computed using 10-fold cross validation over 30 runs of different random splits of the data to test their significance. The null hypothesis to be tested is that for a given dataset, the baseline lexicons and the lexicons adapted by our models will have the same performance. We test this hypothesis using the Paired T-Test since it determines the mean of the changes in performance, and reports whether this mean of the differences is statistically significant. Note that all the results in F1-measure reported in this section are statistically significant with ρ < 0.001.
Linguistics, natural language processing, and related disci- plines share a fundamental interest in language resources and their availability beyond individual research groups. This is necessary not only to fullfill fundamental princi- ples of science (replicability), but also to facilitate subse- quent re-use of resources created from public funding, e.g., as training data for novel tools, as a basis to increase the amount of data available for quantitative analyses, or as a component of innovative applications. The latter may in- clude quite unforeseen uses, as in the case of the psycholin- guistic resource WordNet (Fellbaum, 1998) turning into a significant component in numerous information technology systems.