Semi-structured data are ”schema-less” data [Buneman et al.,2001], meaning it does not have any rigid and predetermined schema upfront, which is one of the main advantages that make them very popular. They are ”self-describing” [Buneman et al.,2001], which means the struc- ture and the values are embedded in the same file. These characteristics made semi-structured
data the most suitable and natural data model to accommodate heterogeneity [Chung and Jesura- jaiah, 2005] and an important feature of the Web [Ya-qin and Wen-yong, 2010]. XML and JSON26 are the main data models representing semi-structured data. Both of them are hierar- chical and can be easily parsed [McMullen and Hawick,2013;Ray,2003] due to the availability of tools and libraries.
This subchapter begins by introducing the RESTful API. Then, the two most utilised semi- structured technologies on the Web, XML and JSON, are presented.
2.5.1
RESTful Web APIs
A Web API is a broad class of Web services and interfaces. In the context of the thesis, a Web API can be defined as an interface of a service that consists of a set of HTTP request messages along with a definition of the structure of response messages [Cao et al.,2013]. The most used technologies in representing the outputted messages JSON or XML (see MQ2 in Section1.2). This interface is standards-based application-to-application programming interface, meaning it can be called from other programs [Burghardt et al.,2005].
A Web API is considered a RESTful service [Richardson and Ruby,2008] when conforming to the REST27 architecture principles [Fielding and Taylor, 2000], being client-server based communication, statelessness of the request and the use of a uniform interface. The common technology used to implement RESTful Web services is HTTP [Maleshkova et al.,2010].
RESTful28 Web APIs are one major technology that makes use of semi-structured data in their data exchange. Many formats are utilised for this. XML and JSON are, however, the most frequently used mechanisms. They are the preferred representation for machine-readable data [Trifa et al.,2010].
The steady increase of the number of Web APIs, and RESTful web services in general [Wu et al., in press], suggests that the amount of semi-structured data is constantly growing on the Web. More precisely, the number of Web APIs continued to increase even in post-Linked Data
26JavaScript Object Notation 27Representational state transfer
era (after 2006) (see Figure 1.1 in Section 1.2). Many explanations can be put forward, for example: some use cases are more suitable to be implemented in a Web API architecture, the implementation of a Linked Data source is less accessible, or the Linked Data paradigm is not a success and does not respond to the developer needs etc. One conclusion that can be drawn is that relatively to Linked Data, older data sources being semi-structured data sources are still growing.
2.5.2
XML
XML is a mark-up language that allows users to define a set of tags which describe arbitrary document structure [Bray et al., 1997]. It is designed to be ”eXtensible” by allowing to create user-defined forms by defining various entities, tags or elements [Van der Aalst and Kumar,
2003]. XML is a labelled tree, where each tag corresponds to a labelled node in the data-model, and each nested sub-tag is a child in the tree [Decker et al.,2000].
The flexibility and the simplicity of defining an XML structure along with the availability of tools for manipulating it, made of XML an effective and a popular mechanism in cross application communication and information exchange.
2.5.3
JSON
JSON is a popular format for data serialisation and a lightweight, text-based, language-independent data interchange format. It is widely used as an alternative to XML [Guinard et al.,2010], not as a mark-up language, but as a data exchange format particularly when dealing with existing Web services [Sumaray and Makki,2012]. Soon after its creation, JSON was adopted by many well-known companies, such as: Google29 and Yahoo30 [Robal and Kalja, 2009], due to its efficiency yet simplicity in representing semi-structured data.
29https://www.google.com/ 30https://www.yahoo.com
2.5.4
Structuring Semi-structured Data
The focus of this section is to give a brief overview about the tools and methods that allow the transition from semi-structured data model, particularly JSON and XML, to RDF data model.
The problem of converting hierarchical, or tree-based, data models to graph-based data models has existed for more than a decade. Various solutions have been proposed [Bohring et al., 2005;Cruz et al., 2004;Johnson,2013;Van Deursen et al.,2008] that can be classified into two categories: Fixed RDF transformation and ontology-dependent RDF transformation.
The systems of the first category perform syntactical and generic conversions from one data model and format to another. The transition, in this category of approaches, consists of mainly restructuring and reorganising different components of semi-structured data (namespace, root, tags, attributes and values) into a subject, predicate and object RDF structure. This operation is not considered challenging as an XSLT31 script or the combination of JSON/XML parser with Jena framework can achieve an acceptable result. The disadvantage of this operation is the fact that no meaning will be associated with the resultant RDF file. Many examples of tools appertain to this class of systems can be stated including [Breitling, 2009] or the java library XmlToRdf32.
The second class of systems are based on ontologies when converting semi-structured data schema, frequently XML, to an RDF schema. It is a challenging task to project the representa- tion of concepts and the relationships between them of a given ontology while converting from one data model to another. This is what Van Deursen et al.[2008] attempted to achieve, for instance. The system they proposed takes as inputs an XML file, an OWL ontology, and the mapping document describing the link between the XML file and the ontology. RDF instances conforming to the OWL ontology are the outcome of this tool.
31Extensible Stylesheet Language Transformations 32https://github.com/AcandoNorway/XmlToRdf