4.4 Extracting Data from Partially Structured Documents
4.5.7 Spreadsheet Data Management
While the last section dealt with work related specifically to our attribute equivalence algo- rithms as presented in Section 4.3, we will now discuss related work that is specific to relation extraction from partially structured documents as discussed in Section 4.4.
There are two classes of related work: first, there is a number of tools build to enable a user to extract and clean relational data from source documents, for example Wrangler (Kandel et al., 2011) and OpenRefine (Verborgh and Wilde, 2013), the successor to Google Refine. Both systems offer tooling to help a user working manually on a document cleaning project. In contrast, the DeExcelerator (Section 4.4) is a predefined pipeline that can be applied to large heterogeneous documents collections automatically. The user can be involved by creating new implementations of the extraction steps.
The second category of related work covers table recognition algorithms. They exist for various types of input, e.g. Web tables, Web lists, PDF files and even images, and use a large variety of heuristic, learning-based and even many visual techniques. An extensive overview of these approaches is given in (Zanibbi et al., 2004). There is also more recent work that special- izes on data extraction from spreadsheets. (Cunha et al., 2009) describes the transformation of spreadsheets into relational database. They focus much stronger on the intricacies of relational schema design, such as functional dependencies and normal forms. They use the normalized database schema they create to make the spreadsheet queryable, and even to export the data back to a spreadsheet format in third normal form. However, their approach assumes “clean” spreadsheets that are basically in first-normal form and contain none of the denormalizations we identified in Section 4.4.1. The approaches could therefore be combined to allow complete normalization of data even from very dirty input.
In (Chen and Cafarella, 2013), an automatic method for extracting relations from spread- sheets is proposed that is most similar to ours. Similarly to our work, a large corpus of spread- sheets are analyzed to gain an understanding of spreadsheet usage on the Web. Their survey confirms many of our findings, for example the fact that a large percentage of spreadsheets has a multi-dimensional or hierarchical attributes, and thus the need for transformation tools to import the data into relational systems. However, in contrast to our work, it focuses heavily on the detection and extraction of hierarchical or multi-dimensional attribute headers, and do not include other transformation that are part of the DeExcelerator, such as layout cell removal or data type and Null value detection. Still, the approaches presented for attribute detection and hierarchy extraction could be easily included into the corresponding transformation steps of the DeExcelerator due to its framework character. In (Chen and Cafarella, 2014) the same authors extend their techniques with a more sophisticated graphical model for the hierarchy extraction, as well as a semi-automatic component for user-interactive repair, and finally a method for utilizing attribute co-occurrence statistics from a corpus of spreadsheets to further improve the precision of the extraction. These advanced methods lend themselves well to in- clusion within the PTDI paradigm, both in using the statistics given on a data curation system, and in including the user in the cleaning process, similarly to our methods in Section 4.3.
As a final note, research aimed at understanding Web tables, as presented in the previous section, is also related to the problem of extracting data from spreadsheets. For example, ap- proaches that map encountered entities into knowledge bases such as (Venetis et al., 2011) are
applicable to spreadsheets as well. While these approaches focus more on specifics of identify- ing relational tables in large corpora, and semantic annotation, respectively, the DeExcelerator focuses on syntactic artifacts that occur in tables meant for human consumption. Therefore, the DeExcelerator could be seen as a preprocessing step to these techniques.
4.6
SUMMARY AND DISCUSSION
In this chapter, we have studied data curation systems, which we defined as schema-less, collab- orative, multi-domain dataset repositories. We discussed examples for such systems, including Open Data platforms, scientific data repositories, and enterprise data lakes. These platforms are characterized by the fact that data is stored there mostly in its raw form, as provided by source systems or human publishers, and is not organized with a central schema. The motiva- tion is to store all incoming data and make it available, even if the future use case is not always known.
We first surveyed a large set of public Open Data platforms to gain real-world usage in- sights into usage patterns and problems of this new form of data management. The survey results show that those platforms, while containing vast amounts of potentially reusable data, do not support this re-use as well as they could. Especially automated ad-hoc discovery and in- tegration of Open Data, as we aim for in this thesis, is hindered by the lack of unified metadata or even standardized data and file formats. Since traditional full integration is not always appli- cable for lack of resources on such platforms, the full effort of searching through, cleaning and integrating the data falls to the future reuser. To alleviate these problems, we first introduced requirements for a data curation system that facilitates reuse. To fulfill these requirements, and enable light-weight integration before reuse, we proposed the Publish-time Data Integration (PTDI) paradigm. Under this paradigm, the publisher of a dataset is encouraged to optimize the reusability of newly published dataset through automatically generated recommendations, which just have to be accepted or declined by the user.
We introduced PTDI operators, components which are triggered when a new dataset is published, and which generate various types of recommendations to improve the dataset’s reusability. We then introduced two specific PTDI operators, one for generating attribute name recommendationsand one for extracting relational data from partially structured documents such as spreadsheets or HTML.
The idea of the first operator is to use statistics on attribute names usage with the datasets already existing in the data curation system, to recommend attribute names that fit well with the vocabulary in use in the system. The motivation for this operator is to constrain hetero- geneity, without using a global schema. The generated recommendations are based on instance set overlap as well as a preceding domain classification, in which datasets in the data curation system are automatically clustered according to their schema to identify related datasets. We further presented several versions of the method that use different preprocessing steps to re- duce the run-time requirements of the method while keeping precision on the same level as the base method. We evaluated the method concerning run-time, number of recommenda- tions and their precision on real-world Open Data, and used crowdsourcing to evaluate the quality of our recommendations.
The second operator, called DeExcelerator, is aimed at transforming partially structured documents, i.e., document in which structured data is freely intermingled with textual or lay- out elements, into pure relational data with accompanying metadata. We analyzed a collection of real-world Open Data spreadsheets to identify a set of typical spreadsheet denormalizations, which are usage patterns that hinder automatic reuse of the data contained in the document. Based on these, we proposed a pipeline of abstract operators that successively remove these denormalizations and transform the spreadsheet. We evaluated the relevance of the denormal- izations we identified as well as the correctness of the generated transformations on real world spreadsheet by means of a user study.
To conclude this chapter, we will review the requirements for a system that facilitates dataset reuse in a data curation system, from Section 4.2.1. Firstly, we enable Standardiza- tion without Schemaby generating recommendations that encourage users to publish in a way that facilitates reuse, without forcing them to adhere to a specific schema or strict regulation. In other words, all data is welcome, but the user is supported in increasing data quality. In addi- tion, by basing the generated recommendations on existing content in the system, lightweight integration can be performed without a defined global schema. Furthermore, through encour- aging publishers to enrich their data at publish-time, when full knowledge of the dataset is still available, we avoid burdening the reuser with all cleaning and integration effort, thus fulfilling the Integration before Reuse requirement. However, the publisher is only likely to accept recom- mendations and make changes to the data if there are only Minimal Requirements for the User. This requirement is reached through fully automated generation of a ranked list of recommen- dations, involving the user with selection and verification while minimizing recommendation latency. Concerning Adaptation to Evolving Usage, we discussed two approaches. The first is a side-effect of basing recommendations on system content statistics. In this way, if the plat- form’s content gradually changes, the recommendations produced by the operators will change as well. Furthermore, by keeping both the library of operators and the operators themselves extendable, PTDI systems can be adapted to changing use cases. Finally, Scalable Integration with respect to user effort is reached through using a loosely coupled library of operators and preprocessors, that produce independently usable recommendations, instead of a monolithic integration process that each dataset has to go through. This enables users to pick and choose which recommendations to follow up on, and invest as much effort into the data’s reuseability as they are willing to.
Having reviewed our initial requirements, we can conclude that the methods introduced in this chapter allow to greatly increase reusability of data published in data curation systems, without changing their free-for-all nature, and while requiring minimal user effort.
5
CONCLUSION AND FUTURE WORK
5.1 Conclusion