An object-oriented wrapper to relational databases with query optimisation

(1)

INFORMATYKI I AUTOMATYKI

KATEDRA INFORMATYKI STOSOWANEJ

mgr in

Ŝ

. Jacek Wi

ś

licki

Ph.D. Thesis

An object-oriented wrapper

to relational databases

with query optimisation

praca doktorska

Obiektowa osłona

do relacyjnych baz danych

z uwzgl

ę

dnieniem optymalizacji

zapyta

ń

Advisor:

prof. dr hab. in

Ŝ

. Kazimierz Subieta

(2)

(3)

Index of Contents

ABSTRACT... 5

ROZSZERZONE STRESZCZENIE... 7

CHAPTER 1INTRODUCTION... 13

1.1 Motivation... 13

1.2 Theses and Objectives... 14

1.3 History and Related Works ... 16

1.4 Thesis Outline ... 18

CHAPTER 2THE STATE OF THE ART AND RELATED WORKS... 20

2.1 The Impedance Mismatch ... 21

2.2 Related Works... 23

2.2.1 Wrappers and Mediators ... 23

2.2.2 ORM and DAO ... 32

2.2.3 XML Views over Relational Data ... 37

2.2.4 Applications of RDF ... 38

2.2.5 Other Approaches... 40

2.3 The eGov-Bus Virtual Repository... 42

2.4 Conclusions ... 44

CHAPTER 3RELATIONAL DATABASES... 45

3.1 Relational Optimisation Constraints ... 45

3.2 Relational Calculus and Relational Algebra ... 47

3.3 Relational Query Processing and Optimisation Architecture... 48

3.3.1 Space Search Reduction ... 51

3.3.2 Planning... 55

3.3.3 Size-Distribution Estimator... 58

3.4 Relational Query Optimisation Milestones ... 60

3.4.1 System-R ... 60

3.4.2 Starburst ... 61

3.4.3 Volcano/Cascades ... 62

CHAPTER 4THE STACK-BASED APPROACH... 64

4.1 SBA Object Store Models ... 65

4.2 SBQL ... 65

4.2.1 SBQL Semantics ... 66

4.2.2 Sample Queries ... 68

4.3 Updateable Object-Oriented Views... 75

4.4 SBQL Query Optimisation... 77

4.4.1 Independent Subqueries ... 79

4.4.2 Rewriting Views and Query Modification ... 82

4.4.3 Removing Dead Subqueries ... 83

4.4.4 Removing Auxiliary Names ... 83

4.4.5 Low Level Techniques ... 84

CHAPTER 5OBJECT-RELATIONAL INTEGRATION METHODOLOGY... 86

5.1 General Architecture and Assumptions... 86

5.2 Query Processing and Optimisation ... 88

5.2.1 Naive Approach vs. Optimisation ... 90

(4)

CHAPTER 6QUERY ANALYSIS,OPTIMISATION AND PROCESSING... 100

6.1 Proposed Algorithms... 100

6.1.1 Selecting Queries... 101

6.1.2 Deleting Queries... 103

6.1.3 Updating Queries... 104

6.1.4 SQL Query String Generation ... 106

6.1.5 Stack-Based Query Evaluation and Result Reconstruction... 108

6.2 Query Analysis and Optimisation Examples... 108

6.2.1 Relational Test Schemata ... 109

6.2.2 Selecting Queries... 117

6.2.3 Imperative Constructs ... 130

6.2.4 Multi-Wrapper and Mixed Queries ... 137

6.2.5 SBQL Optimisation over Multi-Wrapper Queries ... 144

6.3 Sample Use Cases ... 151

6.3.1 Rich Employees... 151

6.3.2 Employees with Departments... 152

6.3.3 Employees with Cars... 153

6.3.4 Rich Employees with White Cars ... 154

CHAPTER 7WRAPPER OPTIMISATION RESULTS... 156

7.1 Relational Test Data... 157

7.2 Optimisation vs. Simple Rewriting ... 159

7.3 Application of SBQL optimisers... 181

CHAPTER 8SUMMARY AND CONCLUSIONS... 183

8.1 Prototype Limitations and Further Works... 184

8.2 Additional Wrapper Functionalities ... 185

APPENDIX ATHE EGOV-BUS PROJECT... 187

APPENDIX BTHE ODRAPLATFORM... 190

B.1 ODRA Optimisation Framework... 193

APPENDIX CTHE PROTOTYPE IMPLEMENTATION... 195

C.1 Architecture ... 195

C.1.1 Communication protocol ... 196

C.2 Relational Schema Wrapping ... 198

C.2.1 Example ... 199

C.2.2 Relational Schema Models ... 201

C.2.3 Result Retrieval and Reconstruction... 206

C.3 Installation and Launching ... 209

C.3.1 CD Contents ... 209

C.3.2 Test Schemata Generation ... 210

C.3.3 Connection Configuration ... 210

C.3.4 Test Data Population ... 211

C.3.5 Schema Description Generation ... 212

C.3.6 Server... 213

C.3.7 Client ... 217

C.4 Prototype Testing ... 217

C.4.1 Optimisation Testing ... 218

C.4.2 Sample batch files... 219

INDEX OF FIGURES... 221

INDEX OF LISTINGS... 224

INDEX OF TABLES... 225

(5)

Abstract

This Ph.D. thesis is focused on transparent and efficient integration of relational databases to an object-oriented distributed database system available to top-level users as a virtual repository. The core of the presented solution is to provide a wrapper – a dedicated generic piece of software capable of interfacing between the virtual repository structures (in the most common case – object-oriented updateable views) and the wrapped relational database, enabling bidirectional data exchange (e.g. retrieval and updates) with optimal query evaluation.

The idea of integration of distributed, heterogeneous, fragmented and redundant databases can be dated to the eighties of the last century and the concept of federated databases, nevertheless the virtual repository approach is closer to the distributed mediation concept from the early nineties. Regardless of the origin, the goal of such systems is to present final and complete business information transparently combined of data stored in bottom-level resources, exactly matching top-level user demands and requirements. In the most general case, the resources mean any data sources and feeds, including the most common relational databases (the focus of the thesis), object-oriented and object-relational databases, XML and RDF data stores, Web Services, etc. The need for integration of various resources representing completely different paradigms and models is caused by the characteristics of the nowadays data and information management systems (besides the resources’ distribution, fragmentation, replication, redundancy and heterogeneity) – the software is usually written is some high-level object-oriented programming language that is completely unadjacent to the resource-level query language (the phenomenon is often referred to as the impedance

mismatch).

This integration process must be completely transparent so that an end user (a human or an application) is not aware of an actual data source model and structure.

(6)

On the other side, an object-oriented database communicating directly with such a wrapper must be opaque – it must work as a black box and its underlying object-relational interface cannot be available directly from any system element located in upper layers. Another feature of a wrapper is its genericity – its action and reliability must be completely independent of a relational resource it wraps. Also neither data materialisation nor replication are allowed at the virtual repository side (or any other intermediate system module), of course except for current query results that must be somehow returned to users querying the system. Similarly, updating the wrapped resource data from the virtual repository level and its object-oriented query language must me assured. Besides the transparency aspects, the most effort has been devoted to efficient optimisation procedures enabling action of powerful native relational resource query optimisers together with the object-oriented optimisation methods applicable in the virtual repository.

The virtual repository for which the wrapper was designed relies on the stack-based approach (SBA), the corresponding stack-stack-based query language (SBQL) and the updateable object-oriented views. Therefore the wrapper must be capable of transforming object-oriented queries referring to the global schema expressed in SBQL into SQL-optimiseable relational queries, whose results are returned to the virtual repository in the same way as actual object-oriented results.

The thesis has been developed under the eGov-Bus (Advanced eGovernment Information Service Bus) project supported by the European Community under “Information Society Technologies” priority of the Sixth Framework Programme (contract number: FP6-IST-4-026727-STP). The idea with its different aspects (including the virtual repository it is a part of and data fragmentation and integration issues) has been presented in over 20 research papers, e.g. [1, 2, 3, 4, 5, 6, 7].

Keywords: database, object-oriented, relational, query optimisation, virtual repository,

(7)

Rozszerzone streszczenie

Liczba dostępnych obecnie źródeł danych jest ogromna. Wiele z nich jest dostępne poprzez Internet, jakkolwiek mogą one nie być publiczne lub ograniczać dostęp do ściśle określonej grupy uŜytkowników. Takie zasoby są rozproszone, niejednorodne, podzielone i nadmiarowe. Idea, której część została opracowana i zaimplementowana w poniŜszej pracy doktorskiej polega na wirtualnej integracji takich zasobów w zcentralizowaną, jednorodną, spójną i pozbawioną fragmentacji oraz nadmiarowości całość tworzącą wirtualne repozytorium zapewniające pewne powszechne funkcjonalności i usługi, włączając w to infrastrukturę zaufania (bezpieczeństwo, prywatność, licencjonowanie, płatności, itp.), Web Services, rozproszone transakcje, zarządzanie procesami (workflow management), itd.

Głównym celem przedstawionych prac jest integracja heterogenicznych relacyjnych baz danych (zasobów) do obiektowego systemu bazodanowego, w którym takie zasoby są widziane jako czysto obiektowe modele i składy, które mogą być w przeźroczysty sposób odpytywane za pomocą obiektowego języka zapytań (innymi słowy, muszą być one nieodróŜnialne od rzeczywistych zasobów obiektowych). W związku z tym schematy relacyjne muszą zostać „osłonięte” wyspecjalizowanym oprogramowaniem zdolnym do dwukierunkowej wymiany danych pomiędzy znajdującym się na samym dole SZRBD i innymi elementami systemu. Ten proces musi być przeźroczysty, aby końcowy uŜytkownik (człowiek lub aplikacja) nie był w Ŝaden sposób świadomy faktycznego modelu zasobu. Z drugiej strony, obiektowa baza danych komunikująca się bezpośrednio z taką osłoną musi być całkowicie nieprzejrzysta (czarna skrzynka), aby osłona nie była w Ŝaden sposób bezpośrednio dostępna z Ŝadnego elementu systemu znajdującego się w górnych warstwach architektury. Kolejną cechą takiej osłony jest pełna generyczność – jej działanie, niezawodność i własności wydajnościowe powinny być całkowicie uniezaleŜnione od zasobu, który kryje. Z załoŜenia niedozwolone są takŜe materializacja i replikacja danych relacyjnych

(8)

po stronie wirtualnego repozytorium (ani Ŝadnego innego pośredniego modułu systemu), oczywiście poza bieŜącymi wynikami zapytań, które muszą zostać zwrócone uŜytkownikom odpytującym system.

Tezy pracy zostają zdefiniowane jak poniŜej:

1. Spadkowe bazy relacyjne mogą zostać w przeźroczysty sposób włączone do

obiektowego wirtualnego repozytorium, a znajdujące się w nich dane przetwarzane i aktualizowane przez obiektowy język zapytań – nieodróŜnialnie od danych czysto obiektowych, bez konieczności materializacji i replikacji.

2. Dla takiego systemu mogą zostać opracowane i zaimplementowane mechanizmy

umoŜliwiające współdziałanie obiektowej optymalizacji wirtualnego

repozytorium z natywnymi optymalizatorami zasobu relacyjnego.

Sztuka budowania obiektowych osłon przeznaczonych do integracji heterogenicznych zasobów relacyjnych do homogenicznych obiektowych systemów bazodanowych jest rozwijana od około piętnastu lat. Prowadzone prace mają na celu połączenie znacznie starszej teorii baz relacyjnych (podstawy zostały zebrane i zaprezentowane przez Edgara F. Codda prawie czterdzieści lat temu) z względnie młodym paradygmatem obiektowym (pojęcie „obiektowego systemu bazodanowego” pojawiło się w połowie lat osiemdziesiątych ubiegłego stulecia, a podstawowe koncepcje zamanifestowane w [8], jakkolwiek odpowiednie prace rozpoczęły się ponad dziesięć lat wcześniej).

Prezentowana praca doktorska jest skoncentrowana na nowatorskim podejściu do integracji heterogenicznych zasobów relacyjnych do rozproszonego obiektowego systemu bazodanowego. Zasoby te muszą być dostępne dla globalnych uŜytkowników poprzez globalny model obiektowy i obiektowy język zapytań, tak aby ci uŜytkownicy nie byli w Ŝaden sposób świadomi faktycznego modelu i składu zasobu. Opracowany i zaimplementowany proces integracji jest całkowicie przeźroczysty i umoŜliwia dwukierunkową wymianę danych, tj. odpytywanie zasobu relacyjnego (pobieranie danych zgodnych z kryteriami zapytań) i aktualizację danych relacyjnych. Ponadto, znajdujące się tuŜ nad zasobem relacyjnym struktury obiektowe (utworzone bezpośrednio w oparciu o ten zasób) mogą zostać bezproblemowo przekształcane i filtrowane (poprzez kaskadowo nabudowane aktualizowalne perspektywy obiektowe) w taki sposób, aby odpowiadały modelowi biznesowemu i ogólnemu schematowi, którego część mają stanowić. Poza aspektami przeźroczystości, największy wysiłek

(9)

został poświęcony procedurom wydajnej optymalizacji zapytań umoŜliwiającej działanie natywnych optymalizatorów zasobu relacyjnego. Te funkcjonalności zostały osiągnięte poprzez wyspecjalizowany moduł stanowiący osłonę obiektowo-relacyjną, nazywany dalej w skrócie osłoną.

Zaprezentowane w pracy doktorskiej prototypowe rozwiązanie zostało oparte o podejście stosowe do języków zapytań i baz danych (SBA, Stack-Based Approach), wynikające z niego język zapytań (SBQL, Stack-Based Query Lanuage) oraz aktualizowalne obiektowe perspektywy, interfejs JDBC, protokół TCP/IP oraz język SQL. Implementacja została wykonana w języku JavaTM. Koncepcja z jej róŜnymi aspektami (włącznie z wirtualnym repozytorium, którego jest częścią oraz kwestiami fragmentacji i integracji danych) przedstawiona została w ponad 20 artykułach, np. [1, 2, 3, 4, 5, 6, 7].

Praca doktorska została wykonana w ramach projektu eGov-Bus (Advanced

eGovernment Information Service Bus) wspieranego przez Wspólnotę Europejską

w ramach priorytetu „Information Society Technologies” Szóstego Programu Ramowego (nr kontraktu: FP6-IST-4-026727-STP).

Tekst pracy został podzielny na następujące rozdziały, których zwięzłe streszczenia znajdują się poniŜej:

Chapter 1 Introduction

Wstęp

Rozdział zawiera motywację do podjęcia tematyki, tezy, cele i załoŜenia pracy doktorskiej, opis wykorzystanych rozwiązań technicznych oraz zwięzły opis stanu wiedzy i prac powiązanych z tematyką.

Chapter 2 The State of the Art and Related Works

Stan wiedzy i prace pokrewne

W tej części pracy omówione zostały stan wiedzy oraz inne prowadzone na świecie prace mające na celu powiązanie baz relacyjnych z obiektowymi systemami i językami programowania. Jako punkt wyjściowy zostało przyjęte zjawisko „niedopasowania impedancji” pomiędzy systemami językami zapytań i językami programowania, jego konsekwencje i moŜliwe rozwiązania problemu. W szczególności uwzględniona została architektura mediatorów i wrapperów zaproponowana przez G. Wiederholda w 1992 oraz szereg rozwiązań opartych na tym właśnie modelu. W dalszej kolejności opisane zostały róŜnorodne rozwiązania ORM (Object-Relational

(10)

Mapping) i DAO (Data Access Objects) mające na celu mapowanie istniejących danych

zgromadzonych w systemach relacyjnych na struktury obiektowe dostępne z poziomu obiektowych języków programowania oraz zapewnianie trwałości obiektów na poziomie obiektowych języków programowania w (relacyjnych) bazach danych, odpowiednio. Część rozdziału została poświęcona takŜe technikom udostępniania danych relacyjnych poprzez perspektywy XML oraz RDF. Oddzielny podrozdział stanowi krótki opis roli osłony w ogólnej architekturze wirtualnego repozytorium opracowywanego w ramach projektu eGov-Bus.

Chapter 3 Relational Databases

Relacyjne bazy danych

Znajdują się tutaj podstawy systemów relacyjnych oraz optymalizacji zapytań relacyjnych z istniejącymi wyzwaniami i opracowanymi metodami, głównie w odniesieniu do najpopularniejszego obecnie języka SQL. Główna część rozdziału dotyczy architektury optymalizatora zapytań wyraŜonych w języku SQL działającego w oparciu o rachunek relacyjny i algebrę relacyjną oraz najpopularniejszych technik stosowanych we współczesnych systemach relacyjnych. Wspomniane zostały równieŜ „kamienie milowe” w rozwoju optymalizacji zapytań relacyjnych takie jak System-R, Starburst i Volcano.

Chapter 4 The Stack-Based Approach

Podejście stosowe

Rozdział koncentruje się na podstawach podejścia stosowego (SBA) oraz obecnie zaimplementowanych w wirtualnym repozytorium metodach optymalizacji zapytań obiektowych, głównie w odniesieniu do języka SBQL, który został wykorzystany podczas prac nad prototypem osłony. W rozdziale ujęte zostały metody optymalizacji SBQL oparte na przepisywaniu zapytań (optymalizacja statyczna) związane z jak najwcześniejszym wykonywaniem selekcji, przepisywaniem perspektyw (query modification), usuwaniem martwych podzapytań i pomocniczych nazw, jak równieŜ wykorzystaniem indeksów.

Chapter 5 Object-Relational Integration Methodology

Metodologia integracji obiektowo-relacyjnej

Ta część pracy stanowi koncepcyjny opis opracowanej i zaimplementowanej metodologii integracji zasobów relacyjnych do obiektu wirtualnego repozytorium. Omówione zostały załoŜenia systemu, sposób odwzorowania schematów z zastosowaniem aktualizowanych perspektyw oraz przetwarzania zapytań obiektowych

(11)

na relacyjne. Wprowadzone zostały takŜe podstawy przetwarzania zapytania mającego na celu umoŜliwienia działania optymalizacji obiektowej oraz relacyjnej. Rozdział zakończony jest abstrakcyjnym (niezaleŜnym od implementacji) przykładem mapowania schematów oraz optymalizacji zapytania.

Chapter 6 Query Analysis, Optimisation and Processing

Analiza, optymalizacja i przetwarzanie zapytań

W rozdziale znajduje się szczegółowe omówienie opracowanych i zaimplementowanych algorytmów analizy i transformacji zapytań, które mają na celu uzyskanie maksymalnej wydajności i niezawodności systemu. Zaprezentowane zostają przykładowe schematy relacyjne oraz proces ich odwzorowania (za pomocą aktualizowanych perspektyw obiektowych) na schemat obiektowy dostępny dla wirtualnego repozytorium. Omówione metody poparte są rzeczywistymi przykładami przekształceń zapytań odnoszących się do wspomnianych schematów testowych, w których szczegółowo zademonstrowane są poszczególne etapy analizy i optymalizacji. Na końcu rozdziału znajdują się przykłady dalszych transformacji utworzonego schematu obiektowego, które mogą zostać zastosowane w górnych warstwach wirtualnego repozytorium.

Chapter 7 Wrapper Optimisation Results

Wyniki optymalizacji osłony

Rozdział stanowi prezentację i dyskusję wyników optymalizacji przeprowadzanej przez osłonę w odniesieniu do niezoptymalizowanych zapytań oraz do zapytań optymalizowanych za pomocą mechanizmów SBQL. Działanie osłony przetestowane zostało na względnie duŜym zbiorze testowych zapytań (odnoszących się to przykładowych schematów wprowadzonych w poprzednim rozdziale) wykorzystujących róŜnorodne mechanizmy optymalizacyjne języka SQL, którego zapytania wykonywane są bezpośrednio na zasobie relacyjnym. Przeprowadzone zostały takŜe testy dla róŜnych wielkości baz danych, w celu obserwacji zaleŜności wyniku optymalizacji od liczności pobieranych rekordów oraz liczby wykonywanych złączeń.

Chapter 8 Summary and Conclusions

Podsumowanie i wnioski

Zawarte zostały tutaj doświadczenia i wnioski zdobyte podczas opracowywania osłony i testowania prototypu. Podsumowanie wyników optymalizacji przeprowadzanej przez prototyp jednoznacznie dowodzi słuszności tez pracy doktorskiej. Osobny

(12)

podrozdział poświęcony jest dalszym pracom, które mogą zostać wykonanie w celu rozwoju prototypu i rozszerzenia jego funkcjonalności.

Tekst pracy został rozszerzony o trzy załączniki omawiające kolejno projekt eGov-Bus, obiektowy system bazodanowy ODRA odpowiedzialny za obsługę wirtualnego repozytorium, w ramach którego został zaimplementowany prototyp osłony oraz szczegóły implementacji prototypu osłony.

(13)

Chapter 1 Introduction

1.1 Motivation

Nowadays, we observe an enormous number of data sources. Many of them are available via Internet (although they may be not public or restrict access to a limited number of users). Such resources are distributed, heterogeneous, fragmented and redundant. Therefore, information available to users (including governmental and administrative offices and agencies as well as companies and industry) is incomplete and requires much effort and usually manual (or human-interactive computer-aided) analysis to reveal its desired aspects. Even after such time and resource consuming process actually there is no guarantee that some important pieces of information are not lost or overlooked.

The main idea partially developed and implemented in the thesis is to virtually integrate such resources into a centralised, homogeneous, integrated, consistent and non-redundant whole constituting a virtual repository providing some common functionalities and services, including a trust infrastructure (security, privacy, licensing, payments, etc.), web services, distributed transactions, workflow management, etc. The proposed solution is focused on relational databases that are the most common in currently utilised (and still developed) data and information management systems. Contemporary system designers still choose them since they are mature, predictable, stable and widely available as either commercial or free products with professional maintenance and technical support. On the other hand programmers tend to use high-level object-oriented programming languages that collide with relational paradigms and philosophy (the object-relational impedance mismatch phenomenon). Global migration to object-oriented database systems is unlike in reasonable future due to programmers’ habits, its unpredictably high costs (and time required) and lack of commonly accepted

(14)

object-oriented standards and mature database systems (most of them are functional, however still experimental prototypes). These factors reveal strong need for providing a smooth bridge between relational and object-oriented technologies and systems enabling the effective and seamless manipulating relational data in the object-oriented manner together with combining them with purely object-oriented data.

1.2 Theses and Objectives

The Ph.D. dissertation is focused on a novel approach to integration of heterogeneous relational database resources in an object-oriented distributed database system. Such resources must be available to global (top level) users via a common object data model (provided by the virtual repository) and an object-oriented query language so that these users are not aware of an actual resource data model and storage. Relational databases are mapped as purely object-oriented models and stores that can be transparently queried with an object-oriented query language. In other words, they are to be compatible with actual object-oriented resource. Therefore, relational schemata must be

wrapped (enveloped) with some dedicated pieces of software – interfaces capable of

bidirectional exchanging data between a bottom-level RDBMS and other elements of the system. Such an interface is referred to as a relational-to-object data wrapper, or simply – a wrapper.

This integration process must be completely transparent so that an end user (a human or an application) is not aware of an actual data source model and structure. On the other side, an object-oriented database communicating directly with such a wrapper must be opaque – it must work as a black box and its underlying object-relational interface cannot be available directly from any system element located in upper layers. Another feature of a wrapper is its genericity: its action and reliability must be completely independent of a resource it wraps. Also neither data

materialisation nor replication are allowed at the virtual repository side (or any other

intermediate system module), of course except for current query results that must be somehow returned to users querying the system.

(15)

The summarized theses are:

1. Legacy relational databases can be transparently integrated to an

object-oriented virtual repository and their data can be processed and updated with an oriented query language indistinguishably from purely object-oriented data without materialisation or replication.

2. Appropriate optimisation mechanisms can be developed and implemented for

such a system in order to enable coaction of the object-oriented virtual repository optimisation together with native relational resource optimisers.

The prototype solution accomplishing, verifying and proving the theses has been developed and implemented according to the modular reusable software development methodology, where subsequent components are developed independently and combined according to the predefined (primarily assumed) interfaces.

First, the state of the art in the field of integration of heterogeneous database and information systems was analysed. A set of related solutions has been isolated with their assumptions, strengths and weaknesses, considering possibility of their adaptation to the designed solution (Chapter 2). The experience of previous works together with requirements of the virtual repository and the author’s experience from the previously participated commercial projects allowed designing and implementing the general modular wrapper architecture (subchapter 5.1 General Architecture and Assumptions). The architecture assures genericity, flexibility and reliability in terms of communication and data transfer and it was not subjected to any substantial verification and experiments. Basing on the above, the first working prototype (without optimisation issues, but allowing transparent and reliable accessing the wrapped resource, referred to as the naive approach) has been implemented and experimentally tested. Since this part of the solution is mutually independent of the virtual repository and the objectives were clear and well defined, the fast cascaded development model has been used.

The next stage concerned thorough analysis of various optimisation methods used in relational database systems (Chapter 3) and object-oriented database systems, especially based on the stack-based approach (Chapter 4) implemented in the virtual repository (subchapter 2.3 The eGov-Bus Virtual Repository). The goal was to establish SBQL syntax patterns transformable to optimiseable SQL queries. Subsequently improved prototypes verified with experiments have allowed achieving a set of reliable query rewriting rules and designing the desired query optimiser. The evolutionary/spiral

(16)

development model used at this stage was implied by integration of the wrapper with the independently developed virtual repository and its continuously changing implementation. Also, the experiments’ results concerning the query evaluation efficiency and reliability forced further improvement of the rewriting and optimisation methods.

The described shortly development process consists mainly of analysis of related solutions in terms of the thesis objectives and integration requirements with the rest of the virtual repository. The conclusions allowed stating primary assumptions realised in subsequent prototypes and continuously verified with experiments. The resulting integration process is completely transparent and it enables bidirectional data transfer, i.e. querying a relational resource (retrieving data according to query conditions) and updating relational data. Moreover, bottom-level object-oriented data structures basing directly on a relational resource can be easily transformed and filtered (with multiply cascaded object-oriented updateable views, subchapter 6.3 Sample Use Cases) so that they comply with the business model and goals of the general schema they are a part of. Besides the transparency aspects, the most effort has been devoted to efficient optimisation procedures enabling action of powerful native relational resource query optimisers. The prototype solution realising the above goals and functionalities is implemented with JavaTM language. It bases on:

• The Stack-Based Approach (SBA), providing SBQL (Stack-Based Query Language) being a query language with a complete computational power of regular programming languages and updateable object-oriented views,

• JDBC used for connecting to relational databases and manipulating their data; the technology is supported by almost all currently used relational systems,

• TCP/IP sockets allowing communication between distributed system components,

• SQL used for low-level communication with wrapped databases.

1.3 History and Related Works

The art of building object-relational wrappers for integrating heterogeneous relational database resources into homogeneous object-oriented database systems has been developed for about fifteen years, struggling to join the much older theory of relational databases (almost forty years since its fundamentals were presented by Edgar F. Codd) with the relatively young object-oriented database paradigm. The term of an

(17)

“object-oriented database system” first appeared in the mid-eighties of the last century and the concepts were manifested in [8], however the corresponding works started more than ten years earlier.

The concept of mediators and wrappers was first formulated by Wiederhold in [9] as a set of indications for developing future information systems. The inspiration was the growing common demand for information induced by broadband networks and rapid Internet development, simultaneously obstructed by inefficient, fragmented, heterogeneous and distributed data sources. The main idea was to support the decision-making software with some flexible infrastructure (expressed in terms of mediators and wrappers) capable of retrieving complete information without human interference. A mediator was defined as an autonomous piece of software capable of processing data from am underlying datasource according to the general system requirements. Since mediators were considered resource-independent, they were supplied with wrappers – another software modules transparently interfacing between the resource and the mediator itself. Another very important concept stated was that mediators are not to be accessible in any user-friendly language, but a communication-friendly language – the mediation process (i.e. creating the actual information basing on various mediators and their datasources) is realised in the background and it must be reliable and efficient. The user interface is realised by some top-level applications and they should be user-friendly instead.

The approach and the methodology presented in [9] had several implementations, e.g. Pegasus (1993), Amos (1994) and Amos II (developed since 2002), DISCO (1997) described shortly in the subchapter 2.2.1 Wrappers and

Mediators. In the virtual repository the described wrapper solution is a part of, the

mediation process relies on updateable object-oriented views based on the Stack-Based Approach.

There exist also many other approaches aiming to “objectify” relational data – they are described in subchapters 2.2.2 ORM and DAO, 2.2.3 XML Views over

Relational Data, 2.2.4 Applications of RDF. Nevertheless, their application in the thesis

(18)

1.4 Thesis Outline

The thesis is subdivided into the following chapters:

Chapter 1 Introduction

The chapter presents the motivation for the thesis subject, the theses and the objectives, the description of solutions employed and the short description of the state of the art and the related works.

Chapter 2 The State of the Art and Related Works

The state of the art and the related works aiming to combine relational and object-oriented databases and programming languages (including mediator/wrapper approaches, ORM/DAO and XML) are briefly discussed here. Further, the wrapper location in the overall virtual repository architecture is presented.

Chapter 3 Relational Databases

The fundamentals of relational systems and their query optimisation methods and challenges are given, focused mainly on SQL, the most common relational query language.

Chapter 4 The Stack-Based Approach

The stack-based approach and SBQL, the corresponding query language, are presented, with the concept fundamentals and meaningful query examples. Also, the SBQL optimisation methods are introduced.

Chapter 5 Object-Relational Integration Methodology

The chapter presents the concept and the assumptions of the developed and implemented methodology for integrating relational resources into virtual repository structures. The schema transformation procedures are given, followed by query analysis, optimisation and processing steps. The chapter is concluded with a conceptual example of the query processing methods applied.

Chapter 6 Query Analysis, Optimisation and Processing

The detailed description of the developed and implemented query analysis and transformation methods is given, followed by demonstrative examples based on two relational test schemata introduced. The chapter ends with more complex examples of

(19)

further schema transformations with object-oriented updateable views applicable in upper levels of the virtual repository.

Chapter 7 Wrapper Optimisation Results

The chapter provides the discussion of the optimisation results for the prototype based on a relatively large set of sample queries and various database sizes (for the schemata presented in the previous chapter).

Chapter 8 Summary and Conclusions

The conclusions and future works that can be performed for the further wrapper prototype development.

The thesis text is extended with three appendices describing the eGov-Bus project, the ODRA platform responsible for the virtual repository environment and the wrapper prototype implementation issues.

(20)

Chapter 2 The State of the Art and Related

Works

The art of object-oriented wrappers built on top of relational database systems has been developed for years – first papers on the topic are dated to late 80s and were devoted to federated databases. The motivation for the wrappers is reducing the technical and cultural difference between traditional relational databases and novel technologies based on object-oriented paradigms, including analysis and design methodologies (e.g., based on UML), object-oriented programming languages (e.g. C++, Java, C#), object-oriented middleware (e.g., based on CORBA), relational databases and pure object-oriented databases. Recently, Web technologies based on XML/RDF also require similar wrappers. Despite the big pressure on object-oriented and XML-oriented technologies, people are quite happy with relational databases and there is a little probability that the market will massively change soon to other data store paradigms (costs and time required for such migration are unpredictably huge).

Unfortunately, the object-orientation has as many faces as existing systems, languages and technologies. Thus, the number of combinations of object-oriented options with relational systems and applications is very large. Additionally, wrappers can have different properties, in particular, can be proprietary to applications or generic, they can deal with updates or be read-only, can materialise objects on the wrapper side or deliver purely virtual objects, can deal with object-oriented query language or provide some iterative “one-object-in-a-time” API, etc. [10]. This causes an extremely huge number of various ideas and technologies. For instance, Google reports more than 500 000 results as a response to the query “object relational wrapper”.

(21)

2.1 The Impedance Mismatch

The problem of the impedance mismatch between query languages and programming

languages is very well realised and documented, for example see [11]. The impedance

mismatch term is used here as a mere analogy to the electrical engineering phenomenon. It refers to a set of conceptual and technical difficulties which are often encountered when a RDBMS is used by a program written in an object-oriented programming language or style, particularly when objects and/or class definitions are mapped in a straightforward way to database tables and/or relational schemata.

The basic technical differences and programming difficulties between programming and query (mainly the most popular SQL) languages refer to [11]:

• Syntax – a programmer is obliged to use two different language styles and obey

two different grammars,

• Typology – a query language operates on types defined in a relational database

schema, while a programming language is usually based on a completely different typological system, where nothing like relation exists. Most programming languages are supplied with embedded static (compile time) typological control, while SQL does not provide it (dynamic binding),

• Semantics and language paradigms –concepts of language semantics are

completely different. A query language is declarative (what to retrieve, not how to retrieve), while programming languages are imperative (how to do, instead of

what to do),

• Pragmatics – a query language frees a programmer from many data organisation

and implementation details (collection organisation, presence of indices, etc.), while in programming languages these details must be coded explicitly,

• Binding phases and mechanisms – query languages are interpreted (late binding),

while programming languages assume early binding during compilation and linking phases. This causes many problems, e.g. for debuggers,

• Namespaces and scoping rules – query and programming languages have their

own namespaces that can contain the same names with different meanings. Mapping between these namespaces requires additional syntactical and semantic tools. Programming language namespace is hierarchical and obeys stack-based scoping rules – these rules are ignored by a query language, which causes many inconveniences,

(22)

• Null values – databases and query languages are supplied with dedicated tools for

storing and processing null values, these means are not available to programming languages (or null is regarded completely different from the one in a database),

• Iterative schemata – in a query language iterations are embedded in its operators’

semantics (e.g. selection, projection, join), while in a programming language iterations must be realised explicitly with some loops (e.g. for, while, repeat). Query result processing with a programming language requires dedicated facilities like cursors and iterators,

• Data persistence – query languages process only persistent (physically stored)

data, while programming languages operate only on volatile (located in operating memory) data. Combination of these languages requires using dedicated language constructs to parameterize queries with language variables and other language and architectural means for transmitting stored data to memory and reversely,

• Generic programming means – these means in a query language are based on

reflection (e.g. in dynamic SQL). Using something similar in a programming language is usually impossible because of early binding. Other means are used instead, e.g. higher-level functions, casting, switching to lower language level, polymorphism or templates.

SBQL, the query language with a complete functionality of an object-oriented programming language (described in Chapter 4) applied in the virtual repository does not involve the impedance mismatch unless it is embedded in some programming language.

The impedance mismatch is also very often defined in terms of relational and

object-oriented database systems and their query and programming languages,

respectively, regarding mainly their diverse philosophical and cultural aspects [12, 13]. The most important and pointed differences are found in interfaces (relational data only vs. object-oriented “behaviour”), schema binding (much more freedom to an object structure than in strict relational tables) and access rules (a set of predefined relational operators vs. object-specific interfaces). To solve this mismatch problem supporters of each paradigm argue that the other one should be abandoned. But this does not seem to be a good choice, as relational systems are still extremely popular and useful in many commercial, business, industrial, scientific and administrative systems (the situation does not seem to change soon), while programmers are willing to use object-oriented

(23)

languages and methodologies. There are proposed, however, some reasonable solutions that can minimise the impact of the impedance mismatch on further development on information systems. Preferably, object-oriented programmers should realise that a very close mapping between relational data and object-oriented is erroneous and leads to many complications. Their object-oriented code should be closer to relational model, as with application of JDBC for Java (applied in the wrapper prototype implementation) or ADO.NET [14] for C#.

2.2 Related Works

Below there are presented some of the most important solutions aiming to access (mainly) relational databases with object-oriented tools (preferably in distributed environments) over last fifteen years roughly grouped and classified by their methodologies and main approaches. Many of them are not strictly related to the subject of the thesis, nevertheless their importance in the field cannot be neglected, as well as some experiences and concepts became useful for the prototype implementation.

2.2.1 Wrappers and Mediators

The concept of wrappers and mediators to heterogeneous distributed data sources was introduced in [9] as vision and indication for developing future information systems for next years (appropriate solutions were usually unavailable yet and were assumed to appear within next ten years). The idea appeared as a response to a growing demand for data and information triggered by the evolution of fast broadband networks and still limited by distributed heterogeneous and inefficient data sources. A knowledge that information exists and is accessible causes users’ expectations, while real-life experience showing that this information is not available in a useful form and it cannot be combined with other pieces of information makes much confusion and frustration. The main reasons for designing and implementing proposed mediation systems are heterogeneities of single databases, where a complete lack of data abstractions exits and no common data representation is available. This issue makes combining information from various database systems very difficult and awkward.

The main goal of this architecture was to design mediators as relatively simple software modules transparently keeping specific data information and sharing those data abstractions with higher level mediators or applications. Therefore for large networks

(24)

mediators should be defined for primitive (bottom-level) mediators that can still mediate for lower modules and data sources. Mediators described in [9] were provided with some technical and administrative “knowledge” concerning how to process data behind them in order to achieve effective decision process supported by distributed, modular and composable architecture. They must work completely automatically, without any interference of human experts, producing complete information from available data.

The described information processing model of the proposed mediations system was based on existing human-based or human-interactive solutions (without any copying or mimicking of them, however). The mediation process is introduced in order to overcome the feature of a database-application interface, where only a communication protocol and formats are defined – the interface does not resolve abstraction and representation problems. A proper interface must be active, which functionality is defined as mediation. Mediation includes all the processing needed to make an interface work, knowledge structures driving data transformations and any intermediate storage (if required). Mediators defined as above should be realised as explicit modules acting as an intermediate active layer between user applications and data resources in sharable architecture, independent of data resources. There are also defined meta-mediators responsible for managing and enabling allocating and accessing actual mediators and data sources.

The concept of splitting a database-application interface with a mediator layer makes two additional interfaces appear: between a database server and a mediator and between a mediator and a user application.

Since programmers tend to stick to one interface language and they are unwilling to switch to any other, unless the current one becomes somehow inconvenient or ineffective, access to a mediator must be defined in a universal high-level extensible and flexible language. The proposal does not suggest any particular language, rather than its recommendation and features. A strong assumption is made that such a language does not have to be user-friendly, but machine- and communication-friendly, as it works in the background – any user interaction is defined on an application level and it is completely separated from a mediation process. This statement allows omitting problems and inadequacies introduced to SQL [15].

On the other side of a mediator, an interaction with a data source takes place. A communication can be realised with any common interface language supported by

(25)

a resource, e.g., SQL for relational databases. A mediator covering several data sources can be also equipped with appropriate knowledge on how to combine, retrieve and filter data.

The most effective mediation can be achieved if a mediator serves a variety of application, i.e. it is sharable. On the other hand, applications can compose their tasks basing on a set of underlying mediators (unavailable information may motivate creating new mediators).

The simplified mediation architecture is presented in Fig. 1.

Resource Wrapper Mediator Wrapper Mediator Wrapper Mediator Mediator Wrapper Mediator Wrapper Mediator Mediator Mediator Wrapper Mediator User application 1 User application 2 Wrapper Mediator Wrapper Resource Resource

Resource Resource Resource

Resource Resource Communication-friendly internal language User-friendly external language Resource language

Fig. 1 Mediation system architecture

The general mediation concept presented briefly above deserved several implementations; the most important are described shortly in the following subsections. Its intentions and architectural aspects are also reflected in the virtual repository idea (subsection 2.3 The eGov-Bus Virtual Repository).

2.2.1.1 Pegasus

Pegasus [16, 17] was a prototype multidatabase (federated) system. Its aim was to integrate heterogeneous resources into a common object-oriented data model. Pegasus accessed foreign systems with mapper and translator modules. Mappers were responsible for mapping a foreign schema into a Pegasus schema, while translators processed top-level queries expressed with an object-oriented query language

(26)

(HOSQL1, an Iris query language [18, 19]) to native queries applicable to an integrated resource.

A mapping procedure generated a Pegasus schema covering a resource so that it was compatible with a common database model (a function realised by object-oriented updateable views in the thesis) and defined data and operations available from a resource. Dedicated mappers and translators were designed for different resources and data models, including relational databases, and available as separate modules (mapper/translator pairs).

Mapping of relational databases served three different cases: with both primary and foreign keys given, without foreign keys and without primary keys [20]. In the first case a user-defined Pegasus type was created for each relation that has a simple primary key. Then for each user-defined type and for every non-key attribute in the associated relation, a function was created with the type as argument and the attribute as result. For relations with composite primary key, multiargument functions were created with the primary key fields as arguments and the attribute as result. No type creation was needed in the mapping. If an attribute in the mapped relation was a foreign key, it was replaced with the type corresponding to that key. In case of no foreign keys available, user-defined types could not be created, since the attributes that refer to a primary key field cannot be substituted by its associated type. However, a functional view of the underlying relations was created in a manner similar to the previous case. In the last case (no primary keys’ information), as functional dependencies of any kind were not known, only a predicate function was created for each relation, where all the attributes of the relations formed an argument of the function. A predicate function contained one or more arguments returned a boolean result.

An input query expressed in terms of Pegasus schema with HOSQL was parsed into an intermediate syntax tree (F-tree) with functional expressions (calls, variables, literals) of HOSQL as nodes. The F-tree was then transformed into a corresponding B-tree with catalogue information – for foreign schema queries (e.g., targeting a relational database) this information included all necessary connection parameters (a database name, a network address, etc.). If such a query aimed at a single foreign system, it was optimised by Pegasus and sent directly to the resource (a minimum of processing at

1

(27)

a Pegasus side). In case of a query aiming at multiple foreign schemata, its parts were grouped by a destination in order to reduce a number of resource invocations (optimisation issues), which process resulted in D-tree structure. Basing on statistical information Pegasus performed a cost-based optimisation (join order, join methods, join sites, intermediate data routing and buffering, etc.) within such a D-tree. Once optimised, a query was translated into a foreign resource query language and executed.

2.2.1.2 Amos

Amos2, another early work devoted to building object-oriented views over relational databases and querying these databases with object-oriented query languages (basing of the Pegasus experience and the mediator approach [9]) has been described in [21]. In the system a user interacts with a set of object-oriented views that translate input queries to a relational query (a set of queries) that are executed on a covered resource. A returned result is again converted (possibly composed from a set of partial results of single relational queries) to an object-oriented form and presented to a user. Such a view besides of covering a relational resource was also capable of storing its own purely object-oriented data and methods. Therefore input queries can transparently combine relational and object references. One of strengths of these views was a query optimisation based on an object and relational algebra and calculus. The most important weakness was a lack of support for relational data updates. Object-oriented views were regarded as means to integrate a relational resource into multidatabase (federated) systems based on an object-oriented common data model (CDM) [22]. This approach assumed also applying views for a general case of mapping any heterogeneous component database into such a system, which approach is also reflected in the thesis virtual repository architecture. The prototype has been implemented with Amos [23] object-oriented database system over Sybase RDBMS.

An object-oriented view was designed over an existing relational database to express its semantics in terms of an object-oriented data model. At this mapping stage there was an assumption that a relational database was directly mappable to an object-oriented model, a set of relational views (or other modifications) was required for a relational resource to enable such mapping otherwise. After this mapping procedure a system was ready for querying. A query evaluation methodology for the described

2

(28)

system was developed as an extension to conventional methodology [24] where a query cost plans are compared with a translator introduced. A translator was a middleware piece of software responsible for replacing predicates during views’ expansion with their base relational expressions (in general: one-to-one mapping between relational tuples and Amos objects.

2.2.1.3 Amos II

Amos II [25, 26] is a current continuation of the Amos project shortly described above being a prototype object-oriented peer mediator system with a functional data model and a functional query language – AmosQL based on a wrapper/mediator concept introduced in [9]. The solution enables interoperations between heterogeneous autonomous distributed database systems. Transparent views used in Amos II mediators enable covering other mediators, wrapping data sources and native Amos objects in a modular and composable way. The Amos II data manager and its query processor enable defining additional data types and operators with some programming languages (e.g., Java or C).

Amos II is based on a mediator/wrapper approach [9] with mediator peers communicating over Internet rejecting a centralised architecture with a single server responsible for a resource integration process and translating data into a CDM. Each mediator peer works as a virtual database with data abstraction and a query language. Applications access data from distributed data sources through queries to views in mediator peers. Logical composition of mediators is achieved when multidatabase views in mediators are defined in terms of views, tables, and functions in other mediators or data sources.

Due to an assumed heterogeneity of integrated data sources, Amos II mediators can be supplied with one or more wrappers processing data from covered resources, e.g., ODBC relational databases, XML repositories, CAD systems, Internet search engines. In terms of Amos II, a wrapper is a specialised facility for query processing and data translations from between an external resource and the rest of the system. It contains an interface to a resource and information on how to efficiently translate and process queries targeting at its resource. In Amos II architecture its external peers are also regarded as external resources with a dedicated wrapper and special query

(29)

optimisation methods based on a distribution, capabilities, costs, etc. of the different peers [27].

Amos II assumes an optimisation of AmosQL queries prior to their execution. This procedure is based on an object calculus and an object algebra. First, a query is compiled and decomposed into algebra expressions. The object calculus is expressed in an internal simple logic-based language called ObjectLog [28], which is an object-oriented dialect of Datalog [29, 30]. AmosQL optimisation rules rely on its functional and multidatabase properties. Distributed multidatabase queries are decomposed into local queries executed on an appropriate peer (load balancing is also taken into account). At each peer, a cost optimiser based on statistical estimates is further applied. Finally, an optimised algebra is reinterpreted to produce a final result of a query.

Fig. 2 Distributed mediation in Amos II [26]

Since Amos II is a distributed system, it introduces a distributed mediation mechanism, where peers communicate and interoperate with TCP/IP protocol. The mediation process is illustrated in Fig. 2 where an application accesses data from two different mediators over heterogeneous distributed resources. Peer communication is depicted with thick lines (the arrows indicate peers acting as servers), and the dotted lines show a communication process with a name server to register a peer and to obtain information on a peer group. The name server stands for a mediator server keeping a simple meta-schema of a peer group (e.g., names, locations, etc.; a schema of each peer is kept and maintained at its side locally). The information in the name server is managed without explicit operator intervention; its content is managed through messages from autonomous mediator peers. Mediator peers usually communicate directly without involving a name server that is used only when a new mediator peer connection is established. A peer can communicate directly with any other peer within its group (Fig. 2 shows communication between different topological peer levels,

(30)

however). An optimal communication topology is established by the optimisation process for a particular query (individual peer optimiser can also exchange data and schema information to produce an optimised execution plan of a query).

2.2.1.4 DISCO

DISCO3 [31,32] was designed as a prototype heterogeneous distributed database based on underlying data sources according to a mediator/wrapper paradigm [9]. It realised common problems concerning such environments, i.e. inconsistencies and incompatibilities of data sources and unavailability and disappearance of resources. Moreover, much attention was put to cost-based query optimisation among wrappers. Besides regular databases, file systems and information retrieval systems (multimedia and search engines), DISCO was oriented on WWW resources (e.g., HTML documents). Its main application was search and integration of data stored in distributed heterogeneous resources, which solution should provide uniform and optimised information access based on a common declarative query language.

In a mediator-based architecture, end users interacted with an application which in turn accessed a uniform representation of underlying resources with a SQL-like declarative query language. These resources were covered by mediators encapsulating a representation of multiple data sources for this query language. Mediators could be developed independently or might be combined, which enabled to deal with complexities of distributed data sources. Queries entering mediators, expressed with an algebraic language supporting relational operations, were transformed into subqueries distributed to appropriate data sources. Wrappers interfacing between mediators and actual data sources represented resources as structured views. They accepted (sub)queries from mediators and translated them into an appropriate query language of a particular wrapped resource. On the other hand, query results were reformatted so that they were acceptable by wrappers’ mediators.

Each type of a data source required implementing a dedicated wrapper. A wrapper had to contain a precise definition of data source’s capabilities so that a subset of the algebraic language supported by a wrapper could be chosen (resource heterogeneity). On registration of a wrapper by a mediator, a wrapper transmitted this

3

(31)

subset description that could be supported in a mediator. This information was automatically incorporated by a mediator in a query transformation process.

As mentioned above, cost-based optimisation was one of the most important issues in DISCO. As in the case of defining a language subset available via a wrapper, optional cost information was also included at a wrapper implementation stage. This information concerned selected or all of the algebraic operations supported by a wrapper. Again, cost information was transmitted to a mediator when registering a wrapper. This wrapper-specific cost information overrode a general cost information used by a mediator so that a more accurate cost model could be used. The DISCO’s cost-based optimiser was used to produce the best possible query evaluation plan.

The problem of resource unavailability/disappearance was solved on a level of partial query results’ composition. A partial answer to a query being a part of a final result was produces by currently available data sources. It contained also a query representing finished or unfinished parts of the answer. When a previously unavailable data source became accessible, a partial answer could be resubmitted as a new query to obtain a final answer to the original query.

Wrapper-mediator interaction occurred in two phases: first – when a wrapper was registered by a mediator (a mediator could register various wrappers), second – during query processing. During registration a wrapper's local schema its support for query processing capabilities and specific cost information were supplied to a mediator. A mediator contained a global schema defined by its administrator (this global schema is seen by applications) and a set of views defining how to connect a global schema to local schemata. During a query processing phase a query from an application was passed to a mediator transforming it into a plan consisting of subqueries and a composition query (information how to produce a final result for a mediator). This plan was optimised with respect to information provided by wrappers in the integration stage. Then, a resulting plan was executed by issuing subqueries to appropriate available wrappers that evaluated them on their underlying resources and returned partial results. In an ideal case, when all the wrappers were available, a mediator combined their partial answers according to the composition query and returned a final result to the applications. If some wrappers were unavailable, however, a mediator returned only a partial answer. The application could extract from it some information depending on which wrappers were accessible.

(32)

2.2.1.5 ORDAWA

ORDAWA4 [33] was an international project whose aim was to develop techniques for integration and consolidation of different external data sources in an object–relational data warehouse. The detailed issues consisted of construction and maintenance of materialised relational and object-oriented views, index structures, query transformations and optimisations, and techniques of data mining.

The ORDAWA architecture [33] is realised as a customised wrapper/mediator approach. The object-oriented views with wrappers constitute mediators transforming data from bottom-level resources. The integrator works also as a top-level mediator basing on underlying mediators.

Any further description of ORDAWA is out of the scope of a thesis, since it assumes data materialisation, which approach has been strictly rejected here – it was mentioned just as another implementation of the wrapper/mediator approach.

2.2.2 ORM and DAO

ORM5 [34, 35, 36, 37] is a programming technique for converting data between incompatible type systems in databases and object-oriented programming languages. In effect, this creates a “virtual object database” which can be used from within the programming language.

In common relational DBMS, stored values are usually primitive scalars, e.g., strings or numeric values. Their structure (tables and columns) creates some logic that in object-oriented programming languages is commonly realised with complex variables. ORM aims to map these primitive relational data onto complex constructs accessed and processed from an object-oriented programming language. The core of the approach is to enable bidirectional transparent transformations between persistent data (stored in a database) with their semantics and transient objects of a programming language, so that data can be retrieved from a database and stored there as results of a programme executions. An object-relational mapping implementation needs to systematically and predictably choose which relational tables to use and generate the necessary SQL code.

4_{Object–Relational Data Warehousing System} 5

(33)

Unfortunately, most of ORM implementations are not very efficient. The reason for this are usually are extra mapping operations and memory consumption occurring in an additional middleware layer introduced between a DBMS and an end-user application.

On the other hand, DAO6 [38, 39] indents to map programming language objects to database structures and persist them there. Originally the concept was developed for Java and most applications are created for this language. However. an early attempt to build DAO architecture referred to as an “object wrapper” (the term of DAO appeared later) was shown in [40], while a general architecture and constraints for building object-oriented business applications over relational data stores with persistence issues was extensively described and discussed in [41, 42, 43].

The most important differences between ORM and DAO with their advantages and disadvantages are very well depicted in [44], the crucial distinction appears at the very beginning of design – in ORM a database schema already exists and an application code is created according to it (objects-to-tables mapping), for DAO application logic is established and then appropriate database persistence structures are developed (tables-to-objects mapping). However, in real-life implementations both approaches are very often mixed in favour of API or application usability and features.

Below there are shortly described the most common and important approaches to ORM and DAO, mainly for Java (due to the prototype implementation technology), a more complete list of ORM solutions for different platforms and languages can be found at [45].

2.2.2.1 DBPL Gateway

The early project concerning the ORM approach was DBPL7 [46] being a Modula-2 [47] extension provided with a gateway to relational databases (Ingres [48] and Oracle [49]). The API extension concerned a new bulk data type constructor “relation”, persistence and high-level relational expressions (queries) based on the nested relational calculus, maintaining strong typing and orthogonality. The gateway itself was a mechanism enabling regular DBPL programmes accessing relational databases, i.e. no

6_{Data Access Object} 7