Addition of Triples - The 3XL System - Data warehousing technologies for large-scale and right-

2.2 The 3XL System

2.2.3 Addition of Triples

We now describe how 3XL handles triples that are inserted into a specific modelM which is a database.M has the database schema D which has been generated as de-scribed above from the ontologyOS with schematic data. We assume that the triples to insert are taken from an ontologyOI which only contains data about instances, and not schematic data about classes etc. Note thatOI can be split up into several smaller sets such thatOI=OI1∪ · · · ∪OInwhere eachOIi,i = 1, . . . , n, is added at a different time. In other words, unlike schema generation which happens only once, addition of triples can happen many times.

First, we focus on the state ofM after the addition of the triples in OI to give an intuition for the algorithms that handle this. Then, we present pseudocode in Algorithms 1–3 and explain the handling of triple additions in more details.

If the subject of a triple is an instance of a class that is not described by OS, the triple is represented in the overflow table. Assume in the following that the subjects of the triples to insert are instances of classes described byOS.

When a triple (s, p, o) is added to M, 3XL has to decide in which class table and/or multiproperty table to put the data from the triple. Typically, the data in a

2.2 The 3XL System 21

triple becomes part of a row to be inserted intoM. For each different s for which a triple (s, rdf:type, t) exists²inOI, the row Rsthat is made from the triples with the commons is inserted into Ct.

We now consider the effects of adding a triple (s, p, o) where p is a property defined in OS. First, assume that p is declared to have owl:maxCardinality 1. Then Rs’s column for p in Ct gets the value ν(p, o) which equals o if p is an owl:DataProperty or equals the value of the ID attribute in Ro ifp is an owl:ObjectProperty. In other words, the value of a data property is stored di-rectly whereas the value of an object property is not stored as a URI but as the (more efficient) integer ID of the referenced object.

Now assume that no owl:maxCardinality is given for p. As previously mentioned, such properties can be handled in two ways. If array columns are used, the situation resembles that of a property with a maximal cardinality of 1. The only difference is that the column forp in Rsdoes not get its value set toν(p, o). Instead the value ofν(p, o) is added to the array in the column for p in Rs. If multiproperty tables are used, the row ι, ν(p, o) where ι is the value of the ID attribute in Rs

is added to the multiproperty table for p. In other words, the row that is inserted into the multiproperty table has a reference (by means of a loose foreign key) to the row Rs. Further, it has a reference to the row for the referenced object ifp is an owl:ObjectPropertyand otherwise the value of the property.

So for properties defined inOS, the values they take inOIare stored explicitly in columns in class tables and multiproperty tables. For other triples, information is not stored explicitly by adding a row. If the predicatep of a triple (s, p, o) is rdf:type, this information is stored implicitly since this triple does not result in a row being added toM, but decides in which class table Rsis put.

The pseudocode listed in Algorithms 1–3 shows how addition of triples is han-dled. For a so-called value holdervh (we will explain it next), we denote by vh[x]

the value thatvh holds for x. We let the value holders hold lists for multiproperties and denote by ◦ the concatenation operator for a list.

When triples are being added toM, 3XL may not immediately be able to figure out which table to place the data of the triple in. For this reason, and to exploit the speed of bulk loading, data to add is temporarily held in a data buffer. Data from the data buffer is then, when needed, flushed into the database. This is illustrated in Figure 2.2.

The data buffer does not hold triples. Instead it holds value holders (see Algo-rithm 1, line 1 and AlgoAlgo-rithm 2). So for each subjects of triples that have data in the data buffer, there is a value holder associated with it. In this value holder, an associative array maps between property names and values for these properties. In other words, the associative array fors reflects the mapping p 7→ ν(p, o). Note that

2Recall that the type must be explicitly given.

Algorithm 1 AddTriple Input: A triple (s, p, o)

1: vh ← GetValueHolder(s)

2: ifp is defined in O_S then . O_S is the ontology describing triple data 3: if domain(p) is more specific than vh[rdf:type] then

4: vh[rdf:type] ← domain(p) 5: if maxCardinality(p) = 1 then 6: vh[p] ← Value(p, o) 7: else

8: vh[p] ← vh[p] ◦ Value(p, o)

9: else ifp = rdf:type and o is more specific than vh[rdf:type] and o is described byO_Sthen

10: vh[rdf:type] ← o 11: else

12: Insert the triple into overflow

if the predicatep of a triple (s, p, o) is rdf:type, p 7→ o is also inserted into the associative array in the value holder fors unless the associative array already maps rdf:typeto a more specific type thano. Actually, 3XL infers triples of the form (s, rdf:type, o) based on predicate names, but only the most specialized type is stored (Algorithm 1 lines 3–4). This type information is later used to determine where to place the values held by the value holder. For a multipropertyp, the associative array mapsp to a list of values (Algorithm 1, line 8) but for a property q with a max-imal cardinality of 1, the associative array maps q to a scalar value (Algorithm 1, line 6). Further, 3XL assigns a unique ID to each subject which is also held by the value holder (Algorithm 2, line 19 when the value holder is created).

Example 2 (Data buffer) Assume that the following triples are added to an empty 3XL modelM for the running example:

- (http://example.org/HTML-4.0, version, ”4.0”)

- (http://example.org/HTML-4.0, approvalDate, ”1997-12-18”) - (http://example.org/programming.html, title, ”How to Code?”) - (http://example.org/programming.html, keyword, ”Java”)

- (http://example.org/programming.html, keyword, ”programming”)

Before the triples are inserted into the underlying database by 3XL, the data buffer has the following state.

2.2 The 3XL System 23

Algorithm 2 GetValueHolder Input: A URIu for an instance

1: if the data buffer holds a value holdervh for u then 2: returnvh

3: else

4: table ← The class table holding u (found from map) 5: iftable is not NULL then

6: /* Read values from the database */

7: vh ← new ValueHolder()

8: Read all values foru from table and assign them to vh.

9: Delete the row with URIu from table

10: for all multiproperty tablesmp referencing table do

11: Read all property values in rows referencing the row foru in table and assign these values tovh

12: Delete frommp the rows referencing the row with URI u in table 13: Addvh to the data buffer

14: returnvh 21: Addvh to the data buffer 22: returnvh

Algorithm 3 Value

Input: A propertyp and an object o

1: ifp is an owl:ObjectProperty then

2: res ← the ID of the instance with URI o (found from map) 3: ifres is NULL then

4: res ← (GetValueHolder(o))[ID]

5: returnres 6: else

7: /* It is an owl:DataProperty */

8: returno

Here the top row of a table shows which subject, the value holder holds values for.

The following rows show the associative array. Note that the type for http://example.org /programming.html is assumed to beDocument since this is the most general class in the domains oftitle and keyword.

Now assume that the triple (http://example.org/programming.html, usedVersion, http://example.org/HTML-4.0) is added to M. Then the type detection finds that

Figure 2.2: Data flow in 3XL

http://example.org/programming.html must be of typeHTMLDocument, so its value holder gets the following state.

http://example.org/programming.html

ID 7→ 2

rdf:type 7→ HTMLDocument

title 7→ How to Code?

keyword 7→ [programming, Java]

usedVersion 7→ 1

Note how the value holder mapsusedVersion to the ID value for http://example.org /HTML-4.0, not to the URI directly. If the required rdf:type triples now are in-serted, this does not change anything since the type detection has already deduced the types.

Due to the definition ofν described above, the value holders and eventually the columns in the database hold IDs of the referenced instances for object properties.

But when triples are added, the instances are referred to by URIs. So on the addition of the triple (s, p, o) where p is an object property, 3XL has to find an ID for o, i.e., ν(p, o). If o is not already represented in M, a new value holder for o is created (Algorithm 3, line 4). Depending on the range ofp, type information about o may be inferred. Ifo on the other hand is already represented in M, its existing ID should of course be used. It is possible to search for the ID by using the query SELECT id FROM Cowl:Thing WHERE uri = o. However, for a large model with many class tables and many rows (i.e., data about many instances) this can be an expensive query. To make this faster, 3XL maintains a table map(uri, id, ct) where uriand id are self-descriptive and ct is a reference to the class table where the

2.2 The 3XL System 25

instance is represented. Whenever an instance is inserted into a class table C_X, the instance’s URI and ID and a reference to CX are inserted into map. By searching the data buffer and the map table, it is fast to look up if an instance is already represented and to get its ID if it is. The map table exists in the PostgreSQL database, but for performance reasons 3XL does not query/update the map table in the database while adding triples. Instead, 3XL only extracts all rows in the table once when starting a load of triples and places them in a temporary BerkeleyDB database [12] which acts like a cache. With BerkeleyDB it is possible to keep a configurable amount of the data in memory and efficiently and transparently write the rest to disk-based storage.

Similarly, 3XL also needs to determine if the instance s is already represented when adding a triple (s, p, o). Again the map table is used. If s is not already repre-sented, a new value holder is created and added to the data buffer. Ifs on the other hand is represented, a value holder is created in the data buffer and given the values that can be read from the class table referenced from map and then R_sand all rows referencing it from multiproperty tables are deleted. In this way, it is easy to get the new and old data fors written to the database as data for s is just written as if it was all newly inserted. This also helps, if due to newly added data it becomes evident thats has a more specialized type than known before. In our implementation, the deletions are not done immediately as shown in the pseudocode. For a better performance, we invoke one operation deleting several rows before inserting new data.

When the data buffer gets full, a part of data in the data buffer is inserted into the database. This is done in a bulk operation where PostgreSQL’s very efficient COPY mechanism is used instead of INSERT SQL statements. So the data gets dumped from the data buffer to temporary files in comma-separated values (CSV) format and the temporary files are then read by PostgreSQL. The rdf:types read from the value holders are used to decide which tables to insert the data into. In case, no type is known, owl:Thing is assumed. For unknown property values, NULL is inserted.

If multiproperty tables are used, values from a multiproperty are inserted into these instead of a class table.

To exploit that the data might have locality such that triples describing the same instance appear close to each other, a partial-commit mechanism is employed, in which the least recently usedm% of the data buffer’s content is moved to the database when the data buffer gets full (the percentagem is user-configurable). This is illus-trated in Figure 2.3. In this way, the system can in many cases avoid reading in the data just written out to the database.

In document Data warehousing technologies for large-scale and right-time data Xiufeng, Liu (Page 31-36)