Conclusion and Future Work - Data warehousing technologies for large-scale and right-time data

In this chapter, we present the 3XL triple-store. Unlike most current triple-stores, 3XL is specifically designed to support easy integration with non-RDF data and at the same time support efficient data management operations (load and retrieval) on very large OWL Lite triple-stores. 3XL’s approach has a number of notable charac-teristics. First, 3XL is DBMS-based and uses a specialized data-dependent schema derived from an OWL Lite ontology. In other words, 3XL performs an “intelligent partitioning” of the data which is efficiently used by the system when answering triple queries and at the same time intuitive to use when the user queries the data directly in SQL. Second, 3XL uses advanced object-relational features of the under-lying ORDBMS (in this case PostgreSQL), such as table inheritance and arrays as

“in-lined” attribute values. The table inheritance represents subclass relationships in a natural way to a user. Third, 3XL is designed to be efficient for bulk insertions.

It makes extensive use of a number of bulk loading techniques that speed up bulk operations significantly and is designed to use the available main memory very effi-ciently, using specialized caching schemes for triples and the map table. Fourth, 3XL supports very efficient bulk retrieval for point-wise queries where the subject and/or the predicate is known, as we have found such queries to be the most important for most bulk data management applications. 3XL also supports efficient retrieval for composite queries. 3XL is motivated by our own experiences from a project using very large amounts of triples. Extensive experiments based on the real-world EIAO dataset and the industry standard LUBM benchmark show that 3XL has loading and query performance comparable to the best file-based solutions, and outperforms other based solutions. At the same time, 3XL provides flexibility as it is DBMS-based and uses a specialized and intuitive schema to represent the data. 3XL thus bridges the gap between efficient representations and flexible and intuitive represen-tations of the data. 3XL thus places itself in a unique spot in the design space for triple-stores.

The overall lessons learnt can be summarized as follows: 1) Using a specialized schema generated from an OWL ontology is very effective. With this schema, it is fast to find the relevant data which is intelligently partitioned into class tables (and possibly also multiproperty tables). This results in very good query performance.

2.5 Conclusion and Future Work 45

2) An ORDBMS is a very strong foundation for building such a specialized schema for an OWL ontology. It provides the needed functionality (i.e., table inheritance) to represent class relationships and efficient storage of multi-valued variables in arrays.

It also provides a query optimizer, index support, etc. for free. Finally, it provides the user flexibility as it is easy to combine the data with non-RDF data. 3) The right choice of caching mechanisms is very important for the performance. In particular, the often used map table should be cached outside the DBMS. For a map table too big to fit in memory, an external caching system based on BerkeleyDB is better than using the DBMS. 4) When loading OWL data, using bulk-loading in clever ways has a huge effect. In particular, instead of bulk-loading all available data, only the oldest parts should be loaded keeping the freshest data in memory.

There are a number of interesting directions for future work. First of all, op-timization will be continued, focusing on performance improvement of data trans-ferred from memory to database, and supporting queries of form (∗, ∗, o) better. Fur-ther, 3XL will be extended to support more of the OWL features, in the first case all of OWL Lite. Finally, 3XL will be integrated with a reasoner running on top of 3XL to allow more reasoning than the class-subclass reasoning on instances currently supported.

Chapter 3

All-RiTE: Right-Time ETL for Live DW Data

Data warehousing traditionally extracts, transforms, and loads (ETL) the data from different source systems into a central data warehouse (DW) in at regular interval, e.g., daily. Data warehousing technologies face the challenge on how to deal with the so-called live DW data, such as accumulating facts and early-arriving facts, which eventually will be updated but also queried in online fashion. For this type of data, traditional SQL INSERTs are used, then typically followed by the updates and possi-ble deletions. This is, however, not efficient to modify the data in the DW. This chap-ter presents the ETL middleware system All-RiTE which enables efficient processing of live DW data with support for SELECTs, INSERTs, UPDATEs and DELETEs.

All-RiTE makes use of a novel main memory-based intermediate data store between data producer and the DW to accumulate the live DW data, and does on-the-fly data modifications when the data is materialized to the DW or queried. A number of poli-ciesare proposed to control when to move data from source systems towards the DW.

The data in the intermediate data store can be read by data consumers with specified time accuracies. Our experimental studies show that All-RiTE provides a new “sweet spot” combining the best of standard JDBC and bulk load: data availability as with INSERTs, loading speeds even faster than bulk load for short transactions, and very efficient UPDATEs, DELETEs and SELECTs of live DW data.

3.1 Introduction

In data warehousing, the data from various source systems is extracted, transformed, and loaded into the DW. The data is typically refreshed at a regular time interval, e.g., daily. In recent years, there has been an increasing demand for fresher data, i.e., a shorter refreshment period. In some cases, it is even required to have DWs refreshed with the data within minutes or seconds after the triggering event in the real world (e.g., the placement of an order). This is often referred to as “real-time” or

“near real-time” data warehousing [10, 17]. Besides, a recent and more sophisticated requirement is to make the data available in the DW when the users need it, but not necessarily before. This is referred to as “right-time” data warehousing [85]. Users can specify a certain freshness for the data they see. For example, a user specifies to read the data with a freshness of at least five minutes. Then, the data committed five minutes ago should be available in the DW, but not necessary after that, e.g., one minute ago. The short intervals lead to relatively few insertions in each refreshment compared to, e.g., daily refreshments.

As shorter refresh intervals are used, it is also more likely that the just-inserted data needs to be updated. That is, the data will be updated if some information is not available when the data is inserted. In this chapter, we use the term live DW datafor the DW data that can be updated and available for reads in online fashion before loaded into the DW. A typical scenario illustrating this involves accumulating facts [42] which are updated when events in the modeled world happen. A fact is, e.g., inserted when (or shortly after) an order is placed. This fact gets updated in the DW whenever the order becomes accepted, packed, sent, and paid. Another example involving live DW data is early-arriving facts [42] where some of the referenced dimension values are unknown at load time. The inserted data then gets updated later when more information is available. For both of the scenarios, it is not efficient to insert data into the DW and then update it shortly after. Better performance can be achieved if the data is updated before it physically gets into the DW. We believe this will often be possible assuming that most updates to live DW data happen within in a short time after the data was created, e.g., within minutes or hours, while changes are rarely made for the data created a long time ago. To support such a scenario efficiently has traditionally involved a complex setup of the ETL program with much hand-coding, though.

Thus, a solution that enables efficient near real-time and/or right-time ETL for live DW data is thus strongly needed. This chapter presents exactly such a solution:

The middleware All-RiTE. This chapter extends the previous solution, RiTE [85], which only supported insertions, but not updates and deletions. The current solution is named All-RiTE because it supports all the traditional operations and provides Right Time ETL. With All-RiTE, the user uses JDBC INSERTs to add data to the

In document Data warehousing technologies for large-scale and right-time data Xiufeng, Liu (Page 55-60)