Conclusions and Future Work - Efficient Extraction and Query Benchmarking of Wikipedia Data

This chapter summarizes our research work, highlights our main contributions, and gives the general conclusion over the work. It then pinpoints the future directions in which we can move further to extend and broaden the research conducted in those areas.

8.1. Conclusions

Each direction of our research work has its own value and benefits. In each of the following subsections, we discuss the significance of each research direction in detail.

8.1.1. DBpedia Live Extraction

Due to the permanent update of Wikipedia articles, we also aim to update DBpedia accordingly. We proposed a framework for instantly retrieving updates from Wikipedia, extracting RDF data from them, and storing this data in a triplestore. Our new revision of the DBpedia Live extraction framework adds a number of features and particularly solves the following issues:

MediaWiki templates in article abstracts are now rendered properly.

Changes of mappings in the DBpedia mappings wiki are now retrospectively applied to all potentially affected articles.

Updates can now be propagated easily, i.e. DBpedia Live mirrors can now get recent updates from our framework, in order to be kept in sync.

Many users can benefit from DBpedia, not only computer scientists. We have pointed out some directions how librarians and libraries can make use of DBpedia and how they can become part of the emerging Web of Data. We see a great potential for libraries to become centers of excellence in knowledge management on the Web of Data. As libraries supported the knowledge exchange through books in previous centuries, they now have the opportunity to extend their scope towards supporting the knowledge exchange through structured data and ontologies on the Web of Data. Due to the wealth and diversity of structured knowledge already available in DBpedia and other datasets on the Web of Data many other scientists, e.g. in life sciences, humanities, or engineering, would benefit a lot from such a development.

8.1.2. DBPSB

We proposed the DBPSB benchmark for evaluating the performance of triplestores based on non-artificial data and queries. Our solution was implemented for the DBpedia dataset and tested with 4 different triplestores, namely Virtuoso, Sesame, Jena-TDB, and BigOWLIM. The main advantage of our benchmark over previous work is that it uses real RDF data with typical graph characteristics including a large and heterogeneous schema part. Furthermore, by basing the benchmark on queries asked to DBpedia, we intend to spur innovation in triplestore performance optimization towards scenarios, which are actually important for end users and applications. We applied query analysis and clustering techniques to obtain a diverse set of queries corresponding to feature combinations of SPARQL queries. Query variability was introduced to render simple caching techniques of triplestores ineffective.

The benchmarking results we obtained reveal that real-world usage scenarios can have substantially different characteristics than the scenarios assumed by prior RDF benchmarks. Our results are more diverse and indicate less homogeneity than what is suggested by other benchmarks. The creativity and inaptness of real users while constructing SPARQL queries is reflected by DBPSB and unveils for a certain triplestore and dataset size the most costly SPARQL feature combinations.

8.1.3. DeFacto

DeFacto enables checking the validity of a given RDF triple. When given a test statement, it returns a confidence value for it as well as possible evidence for that statement. The evidence consists of a set of webpages, textual excerpts from those pages and meta-information on the pages. These text excerpts and the associated meta information allow the user to quickly get an overview over possible credible sources for the input statement. Instead of having to use search engines, browsing several webpages and looking for the relevant pieces of information in each webpage, using DeFacto the user can more efficiently review the presented information.

8.2. Future Work

Each research area has its own direction(s), in which we can go move further, and expand the work.

8.2.1. DBpedia Live Extraction

There are several directions in which we aim to extend the DBpedia Live framework:

8.2. Future Work

Support of other languages: Currently, the framework supports only the En- glish Wikipedia edition, and recently the Dutch DBpedia Live1 _{has also been}

developed. We plan to extend our framework to include other languages as well. The main advantage of such a multi-lingual extension is that infoboxes within different Wikipedia editions cover different aspects of an entity at varying degrees of completeness. For instance, the Italian Wikipedia contains more knowledge about Italian cities and villages than the English one, while the German Wikipedia contains more structured information about people than the English edition. This leads to an increase of the quality of extracted data compared to knowledge bases that are derived from single Wikipedia editions. Moreover, this also helps in detecting inconsistencies across different Wikipedia and DBpedia editions.

Wikipedia article augmentation: Interlinking DBpedia with other data sources makes it possible to develop a MediaWiki extension that augments Wikipedia articles with additional information as well as media items (e.g. pictures and audio) from these sources. For instance, a Wikipedia page about a geographic location such as a city or a monument can be augmented with additional pictures from Web data sources such as Flickr or with additional facts from statistical data sources such as Eurostat or the CIA Factbook.

Wikipedia consistency checking: The extraction of different Wikipedia editions along with interlinking DBpedia with external Web knowledge builds the base for detecting inconsistencies in Wikipedia content. For instance, whenever a Wikipedia author edits an infobox within a Wikipedia article, the new content of the infobox could be checked against external data sources and information extracted from other language editions. Inconsistencies could be pointed out along with proposals on how to solve these inconsistencies. In this way, DBpedia can provide feedback to Wikipedia maintainers in order to keep Wikipedia data more consistent, which eventually may also lead to an increase of the quality of data in Wikipedia.

8.2.2. DBPSB

Several improvements can be envisioned in future work to cover a wider spectrum of features in DBPSB:

Coverage of more SPARQL 1.1 features, e.g. reasoning and subqueries. Inclusion of further triplestores and continuous usage of the most recent

DBpedia query logs.

Testing of SPARQL update performance via DBpedia Live, which is modified several thousand times each day. In particular, an analysis of the dependency of query performance on the dataset update rate could be performed.

8.2.3. DeFacto

DeFacto can be extended in manifold ways. First, BOA is able to detect natural- language representations of predicates in several languages. Thus, we could have the user choose the languages he/she understands and provide facts in several languages, therewith also increasing the portion of the Web that we search through. Furthermore, we could extend our approach to support data type properties. Moreover, DeFacto can be extended with adding the temporal dimension to it, i.e. DeFacto can be used to check if a statement was true during certain time span. On a grander scale, we aim to provide even lay users of knowledge bases with the means to check the quality of their data by using natural language input. This would support the transition from the Document Web to the Semantic Web by providing a further means to connect data and documents.

A. DBpedia SPARQL Benchmark

In document Efficient Extraction and Query Benchmarking of Wikipedia Data (Page 101-105)