Transparent Indexing in Distributed Object-Oriented Databases

(1)

TECHNICAL UNIVERSITY OF LODZ

Faculty of Electrical, Electronic,

Computer and Control Engineering

Computer Engineering Department

mgr in

Ŝ

. Tomasz Marek Kowalski

Ph.D. Thesis

Transparent Indexing in

Distributed Object-Oriented Databases

Supervisor:

prof. dr hab. in

Ŝ

_{. Kazimierz Subieta}

(2)

(3)

INDEX OF CONTENTS

SUMMARY... 6

ROZSZERZONE STRESZCZENIE... 8

CHAPTER 1 INTRODUCTION ... 13

1.1 Context ... 14

1.2 Short State of the Art of Indexing in Databases ... 14

1.3 Research Problem Formulation ... 15

1.4 Proposed Solution ... 16

1.5 Main Theses of the PhD Dissertation ... 17

1.6 Thesis Outline ... 18

CHAPTER 2 INDEXING IN DATABASES - STATE OF THE ART ... 20

2.1 Database Index Properties... 21

2.1.1 Transparency ... 21

2.1.2 Indices Classification... 22

2.2 Index Data-Structures ... 23

2.2.1 Linear-Hashing ... 26

2.2.2 Scalable Distributed Data Structure (SDDS)... 28

2.3 Relational Systems ... 30 2.4 OODBMSs ... 32 2.4.1 db4o Database ... 35 2.4.2 Objectivity/DB ... 35 2.4.3 ObjectStore... 37 2.4.4 Versant ... 38 2.4.5 GemStone’s Products ... 39

2.5 Advanced Solutions in Object-Relational Databases ... 40

2.5.1 Oracle’s Function-based Index Maintenance ... 42

2.6 Global Indexing Strategies in Parallel Systems ... 44

2.6.1 Central Indexing ... 46

2.6.2 Strategies Involving Decentralised Indexing... 47

2.7 Distributed DBMSs ... 48

CHAPTER 3 THE STACK-BASED APPROACH ... 50

3.1 Abstract Data Store Models ... 50

3.1.1 AS0 Model ... 50

3.1.2 Abstract Store Models Supporting Inheritance... 51

3.1.3 Example Database Schema... 52

3.1.4 Example Store with Static Inheritance of Objects ... 53

3.2 Environment and Result Stacks... 54

3.2.1 Bind Operation ... 55

3.2.2 Nested Function... 56

3.3 SBQL Query Language ... 56

3.3.1 Expressions Evaluation ... 57

3.3.2 Imperative Statements Evaluation ... 61

3.4 Static Query Evaluation and Metabase... 62

3.4.1 Type Checking ... 64

3.5 Updateable Object-Oriented Views ... 65

CHAPTER 4 ORGANISATION OF INDEXING IN OODBMS ... 67

4.1 Implementation of a Linear Hashing Based Index... 67

(4)

4.1.2 Example Indices ... 69

4.2 Index Management... 70

4.2.1 Index Creating Rules and Assumed Limitations ... 73

4.3 Automatic Index Updating ... 75

4.3.1 Index Update Triggers ... 75

4.3.2 The Architectural View of the Index Update Process ... 78

4.3.3 SBQL Interpreter and Binding Extension... 80

4.3.4 Example of Update Scenarios... 81

4.3.4.1 Conceptual Example ... 81

4.3.4.2 Path Modification ... 83

4.3.4.3 Keys with Optional Attributes ... 85

4.3.4.4 Polymorphic Keys... 87

4.3.5 Optimising Index Updating ... 90

4.3.6 Properties of the Solution ... 93

4.3.7 Comparison of Index Maintenance Approaches... 94

4.4 Indexing Architecture for Distributed Environment ... 96

4.4.1 Global Indexing Management and Maintenance ... 97

4.4.2 Example on Distributed Homogeneous Data Schema ... 99

CHAPTER 5 QUERY OPTIMISATION AND INDEX OPTIMISER ... 101

5.1 Query Optimisation in the ODRA Prototype ... 102

5.2 Index Optimiser Overview ... 103

5.2.1 General Algorithm... 106

5.3 Selection Predicates Analysis ... 107

5.3.1 Incommutable Predicates... 109

5.3.2 Matching Index Key Values Criteria... 111

5.3.3 Processing Inclusion Operator... 112

5.4 Role of a Cost Model ... 113

5.4.1 Estimation of Selectivity ... 115

5.5 Query Transformation – Applying Indices... 118

5.5.1 Index Invocation Syntax... 118

5.5.2 Rewriting Routines... 120

5.5.3 Processing Disjunction of Predicates ... 122

5.5.4 Optimising Existential Quantifier... 123

5.5.5 Reuse of Indices through Inheritance ... 124

5.6 Secondary Methods... 126

5.6.1 Factoring Out Independent Subqueries ... 127

5.6.2 Pushing Selection ... 128

5.6.3 Methods Assisting Invoking Views... 129

5.6.4 Syntax Tree Normalisation... 130

5.6.5 Harmful Methods ... 131

5.7 Optimisations involving Distributed Index ... 132

5.7.1 Rank Queries Optimisation ... 134

5.7.1.1 Hoare’s Algorithm in Distributed Environment ... 136

5.7.1.2 Modification of Hoare’s Algorithm ... 138

5.8 Increasing Query Flexibility with Respect to Indices Management ... 140

CHAPTER 6 INDEXING OPTIMISATION RESULTS ... 142

6.1 Test Data Distribution ... 142

6.2 Sample Index Optimisation Test... 144

6.3 Omitting Key in an Index Call Test – enum Key Types ... 146

6.4 Multiple Index Invocation Test ... 148

6.5 Complex Expression Based Index Test ... 150

6.6 Disjunction of Predicates Test... 151

CHAPTER 7 INDEXING FOR OPTIMISING PROCESSING OF HETEROGENEOUS RESOURCES... 153

7.1 Volatile Indexing ... 153

(5)

7.1.2 Index Materialisation... 154

7.1.3 Solution Properties ... 155

7.1.4 Prove of Concept Test ... 155

7.2 Optimising Queries Addressing Heterogeneous Resources... 157

7.2.1 Overview of a Wrapper to RDBMS ... 158

7.2.2 Volatile Indexing Technique Test ... 160

CHAPTER 8 CONCLUSIONS ... 165

8.1 Future Work... 167

INDEX OF FIGURES ... 168

INDEX OF TABLES ... 170

(6)

SUMMARY

The Ph.D. thesis focuses on the development of robust transparent indexing architecture for distributed object-oriented database. The solution comprises management facilities, automatic index updating mechanism and index optimiser. From the conceptual point of view transparency is the most essential property of a database index. It implies that programmers need not to involve explicit operations on indices into an application program. Usually a query optimiser automatically inserts references to indices when necessary into a query execution plan. The second aspect of the transparency concerns a mechanism maintaining the cohesion between existing indices and indexed data. So-called automatic index updating detects data modifications and reflects them in indices accordingly.

The thesis has been developed in the context of the Stack-Based Architecture (SBA) [1, 117] a theoretical and methodological framework for developing object-oriented query and programming languages. The developed query optimisation methods are based on the corresponding Stack-Based Query Language (SBQL).

The orthogonality of SBQL constructs enables simple defining of complex selection predicates accessing arbitrary data. The main goal of the work is designing the indexing architecture facilitating processing of a possibly wide family of predicates. This requires generic and complete approach to the problem of index transparency.

The solution presented in the thesis provides transparent indexing employing single or multiple-key indices in a distributed homogeneous object-oriented environment. The selection of an index structure, either centralised or distributed, is not restricted. The work extensively describes optimisation methods facilitating processing in the context of a where operator, i.e. selection, considering the role of a cost model, conjunction and disjunction of predicates, and the class inheritance.

The author proposes a robust approach to automatic index updating capable of dealing with index keys based on arbitrary, deterministic and side effects free expressions. Consequently, optimised selection predicates can be freely composed of various SBQL constructs, in particular, algebraic and non-algebraic operators, path expressions, aggregate functions and class methods invocations. The solution also takes into consideration inheritance and polymorphism.

(7)

A part of the thesis concerns optimisation methods devoted to distributed object-oriented databases enabling efficient parallel processing of queries. In particular, one of the designed methods concerns optimisation of rank queries. It enables taking advantage of distributed and scalable index structures.

A particularly difficult query optimisation domain concerns processing queries addressing heterogeneous resources. A volatile indexing technique proposed by the author is a significant step in this matter. This solution relies on the developed indexing architecture. Additionally, it can be applied to data virtually accessible using SBQL views. In contrast to regular indices, a volatile index is materialised only during a query evaluation. Therefore, efficacy of this technique shows when the index is invoked multiple times which mainly concerns processing of complex and laborious queries.

A key aspect concerning the development of database query optimisation methods is preservation of original query semantics. Consequently, for the designed optimisation methods the author has determined rules in a context of the assumed object data model and the SBQL query language. With this knowledge a database programmer can be assisted and advised, e.g. by compiler, on how to design safe and optimisable queries. Moreover, the conducted research can facilitate also database designers. Among the other, the potential influence of other optimisation methods on indexing has been verified.

A significant part of algorithms and solutions developed in the thesis have been verified and confirmed in the prototype ODRA OODBMS implementation [58, 59].

Keywords: indexing, database, distributed database, object-oriented, query

(8)

POLITECHNIKA ŁÓDZKA

Wydział Elektrotechniki, Elektroniki, Informatyki i Automatyki

Katedra Informatyki Stosowanej

Praca doktorska pt.:

Przezroczyste indeksowanie w rozproszonych obiektowych bazach danych

ROZSZERZONE STRESZCZENIE

Bazy danych stanowią podstawę wielu rozległych i w dzisiejszych czasach często rozproszonych systemów komputerowych. Zarządzanie systemami o takim rozmiarze i złoŜoności ułatwiają technologie obiektowe. Przemysł jednak skłania się ku rozwiązaniom relacyjnym, gdy kwestią kluczową jest wydajność. Ten aspekt jest wciąŜ zaniedbany w opartych o obiektowe paradygmaty bazach danych z uwagi na niedostatek zaawansowanych procedur optymalizacyjnych.

Indeksowanie jest najwaŜniejszą metodą optymalizacyjną w bazach danych. Zasadnicza koncepcja indeksowania w obiektowych bazach danych nie róŜni się od indeksowania w systemach relacyjnych [15, 20, 29, 54, 55, 65]. Z koncepcyjnego punktu widzenia najistotniejszą własnością indeksu w bazie danych jest przezroczystość. Oznacza ona, Ŝe programista aplikacji z bazą danych nie musi być

świadomy istnienia indeksów. Najczęściej optymalizator zapytań jest odpowiedzialny za automatyczne wykorzystanie indeksów. Drugi waŜny aspekt przezroczystości jest związany z utrzymywaniem spójności między indeksami a indeksowanymi danymi. Jest to problem tzw. automatycznej aktualizacji indeksu. Modyfikacje w bazie powinny być automatycznie wykrywane i odzwierciedlane w odpowiednich indeksach.

W rozproszonych bazach danych najbardziej zaawansowane rozwiązania opierają się o statyczne partycjonowanie indeksu. Są one zaimplementowane w czołowych produktach obiektowo-relacyjnych. Pozwalają one jedynie na definiowanie kluczy indeksu uŜywając prostych wyraŜeń korzystających z danych znajdujących się w

(9)

jednej tabeli. W kwestii globalnej optymalizacji zapytań odnoszących się do heterogonicznych zasobów, autor nie znalazł w literaturze naukowej Ŝadnych sformalizowanych metod opartych o indeksowanie. Analiza stanu wiedzy jednoznacznie wskazuje na potrzebę rozwijania metod i architektury indeksowania dla rozproszonych obiektowych baz danych.

Ortogonalność języka SBQL pozwala na wyjątkowo łatwe definiowanie złoŜonych predykatów selekcji odnoszących się do dowolnych danych. Głównym celem pracy jest opracowanie architektury indeksowania, która wspomagałaby przetwarzanie moŜliwie szerokiej rodziny predykatów. Wymaga to generycznego i kompletnego podejścia do problemu przezroczystości. PoniewaŜ praca dotyczy rozproszonych obiektowych baz danych kolejnym waŜnym celem jest opracowanie metod optymalizacyjnych, które będą umoŜliwiały zrównoleglenie obliczeń w szczególności poprzez wykorzystanie rozproszonych skalowalnych struktur indeksu.

Szczególnie trudną dziedziną w kontekście optymalizacji jest przetwarzanie zapytań odnoszących się do rozproszonych heterogenicznych zasobów. Z tego powodu jako kolejny cel pracy autor postawił sobie identyfikacje przezroczystej i wydajnej strategii indeksowania, którą moŜna by stosować na poziomie globalnego schematu bazy.

Kluczowym aspektem pracy nad wszystkimi metodami optymalizacyjnymi jest zachowanie oryginalnej semantyki zapytań. W tym celu autor określił reguły, które odnoszą się do opracowanych metod w kontekście przyjętego obiektowego modelu danych i języka zapytań SBQL. Znajomość tych reguł moŜe być przydatna programistom w budowaniu zapytań, których postać umoŜliwia automatyczną optymalizację. Dodatkowo, przeprowadzone badania mogą być równieŜ pomocne projektantom baz danych. Między innymi, określono potencjalny wpływ innych metod optymalizacyjnych na pracę optymalizatora wykorzystującego indeksy.

Zaproponowane przez autora w pracy doktorskiej rozwiązania są przedstawione w kontekście stosowej architektury (SBA, Stack-Based Architecture) [1, 117] i wynikającego z niej języka zapytań (SBQL, Stack-Based Query Language). Architektura stosowa jest to formalna metodologia dotycząca obiektowych języków zapytań i programowania w bazach danych.

(10)

Tezy będące przedmiotem dysertacji są następujące:

1. Przetwarzanie predykatów selekcji opartych o dowolne wyraŜenia kluczowe korzystające z danych w rozproszonej obiektowej bazie danych moŜe być zoptymalizowane przez scentralizowane lub rozproszone przezroczyste indeksowanie.

2. Wykonywanie złoŜonych zapytań odnoszących się do rozproszonych heterogenicznych zasobów moŜe być wspomagane przez techniki wykorzystujące przezroczystą optymalizację opartą o indeksowanie.

W udokumentowaniu w/w tez wykorzystano zaprojektowane przez autora system zarządzania indeksami i optymalizator zapytań stosujący indeksy.

Dodatkowym elementem ściśle związanym z pierwszą tezę jest autorskie podejście do problemu automatycznej aktualizacji indeksu. Przedstawione rozwiązanie zapewnia przezroczyste indeksowanie wykorzystujące indeksy oparte o jeden lub wiele kluczy. Optymalizacja dotyczy przetwarzania predykatów selekcji opartych o dowolne, deterministyczne i pozbawione efektów ubocznych wyraŜenia, na które mogą się składać: np. wyraŜenia ścieŜkowe, funkcje agregujące i wywołania metod klas (z uwzględnieniem dziedziczenia i polimorfizmu). Zaproponowana architektura indeksowania moŜe być zastosowana do rozproszonych homogenicznych źródeł danych. Wybór struktury indeksu, scentralizowanej czy rozproszonej, nie jest w Ŝaden sposób ograniczony. Autor zaproponował równieŜ metodę optymalizacji rankingowych zapytań, która umoŜliwia wykorzystanie zarówno istniejących lokalnych indeksów, jak i rozproszonego, skalowalnego globalnego indeksu.

Rozwiązaniem zaproponowanym przez autora w celu udowodnienia drugiej tezy pracy jest technika ulotnego indeksowania. Polega ona na tej samej architekturze indeksowania, ale dodatkowo moŜe być stosowana do przetwarzania danych heterogenicznych wirtualnie dostępnych poprzez perspektywy SBQL. W odróŜnieniu od normalnych indeksów ulotny indeks jest materializowany tylko podczas wykonywania zapytania. Przedstawiona technika jest skuteczna w przetwarzaniu złoŜonych zapytań, w których indeks jest wywoływany więcej niŜ jeden raz.

Opracowane algorytmy i rozwiązania związane z tezami pracy zostały w znaczącym zakresie zweryfikowane i potwierdzone na prototypowej implementacji w obiektowej bazie danych ODRA [58, 59].

(11)

Dysertacja została podzielona na osiem rozdziałów, których zwięzły opis znajduje się poniŜej:

Chapter 1 Introduction

Wstęp

Pierwszy rozdział wprowadza w tematykę pracy, przedstawia jej kontekst, zwięzły opis stanu wiedzy w dziedzinie i motywacje autora. Sformułowano cele pracy oraz zidentyfikowano związane z nimi problemy. W tym kontekście omówiono szczegółowo tezy dysertacji oraz zarysowano opracowane przez autora rozwiązania.

Chapter 2 Indexing In Databases - State of the Art

Indeksowanie w Bazach Danych – Stan Wiedzy

W opisie stanu wiedzy przedstawiono podstawowe pojęcia związane z indeksowaniem w bazach danych. Przytoczono reprezentatywne przykłady istniejących rozwiązań w przemyśle i literaturze naukowej. Rozdział zawiera przegląd róŜnych struktur indeksujących ze szczególnym uwzględnieniem liniowego haszingu, który został wykorzystany w autorskim rozwiązaniu. Dodatkowo, zbadano zcentralizowane i rozproszone strategie indeksowania w róŜnych systemach rozproszonych.

Chapter 3 The Stack-based Approach

Podejście Stosowe

Rozdział dotyczy teoretycznych podstaw tez pracy, tj. architektury stosowej (SBA) i wynikającego z niej języka zapytań SBQL. Przytoczono opisy podstawowych pojęć: stosu środowiskowego, stosu rezultatów, wiązania nazw, statycznej ewaluacji zapytań i aktualizowalnych obiektowych perspektyw.

Chapter 4 Organisation of Indexing in OODBMS

Organizacja Indeksowania w Obiektowych Bazach Danych

Ta część pracy przedstawia zaprojektowaną i w znaczącym zakresie zaimplementowaną architekturę indeksowania w obiektowej bazie danych ODRA. Opisano podstawowe własności zastosowanej struktury indeksu i modułu zarządzania indeksami. Przedstawiono równieŜ autorski mechanizm zapewniający przezroczystą, automatyczną aktualizację indeksów, który opiera się o ideę wyzwalaczy aktualizacji indeksu (index update triggers). Przedstawiona koncepcja jest rozszerzona na potrzeby globalnego indeksowania w kontekście rozwijanej w projekcie ODRA rozproszonej architektury.

(12)

Chapter 5 Query Optimisation and Index Optimiser

Optymalizacja Zapytań z Wykorzystaniem Indeksów

Rozdział koncentruje się na rozwijanych przez autora metodach przezroczystego wykorzystania indeksów w optymalizacji zapytań. Przedstawiono algorytmy dotyczące transformacji pośredniego drzewa zapytania oraz związane z nimi reguły. Szczególny nacisk został połoŜony na zachowanie w procesie optymalizacji pierwotnej semantyki zapytania. Opracowane metody zostały poparte odpowiednimi rzeczywistymi przykładami przekształceń w języku SBQL. Autor przedstawił równieŜ dyskusję na temat wpływu innych metod optymalizacji zapytań na indeksowanie. W rozdziale omówiono równieŜ metody optymalizacji dedykowane przetwarzaniu globalnych zapytań w rozproszonym środowisku. W tym zakresie przedstawiono autorskie podejście do optymalizacji zapytań rankingowych w rozproszonej architekturze bazy danych oparte o zmodyfikowany algorytm Hoare’a.

Chapter 6 Indexing Optimisation Results

Wyniki Optymalizacji przez Indeksowanie

W rozdziale zaprezentowano rezultaty testów zaimplementowanego systemu indeksowania. Wyniki potwierdzają skuteczność i wydajność opracowanej metodologii. Testy empirycznie potwierdzają poprawność zastosowanych rozwiązań opisanych w rozdziałach 4-tym i 5-tym. Całość stanowi dowód pierwszej tezy dysertacji.

Chapter 7 Indexing for Optimising Processing of Heterogeneous Resources

Indeksowanie w Optymalizacji Przetwarzania Heterogenicznych Zasobów Ta część pracy dowodzi drugiej tezy dysertacji. W rozdziale przedstawiono technikę tzw. ulotnego indeksowania (volatile indexing technique) oraz jej zastosowanie w optymalizacji zapytań odnoszących się do rozproszonych heterogonicznych danych. Skuteczność zaproponowanej techniki jest potwierdzona testem, w którym optymalizowane jest zapytanie SBQL odnoszące się do zasobów obiektowej bazy danych i zasobów znajdujących się w relacyjnej bazie danych.

Chapter 8 Conclusions

Podsumowanie

Ostatni rozdział podsumowuje pracę nad architekturą systemu indeksowania dla rozproszonej obiektowej bazy danych. Wymieniono opracowane rozwiązania i wyniki badań jednoznacznie potwierdzające słuszność tez pracy doktorskiej. Na koniec wskazano kierunki dalszych badań w tej dziedzinie.

(13)

Chapter 1 Introduction

Databases are a fundamental feature of many large computer applications. In many cases databases are to be geographically distributed. The size and complexity of such systems require the developers to take advantage of modern software engineering methods which as a rule are based on the object-oriented approach (cf. UML notation). In contrast, the industry still widely uses relational databases. While the efficiency of them in majority of applications cannot be questioned, many professionals point out their drawbacks. One of the major drawbacks is so-called impedance mismatch. The mismatch concerns many incompatibilities between object-oriented design and relational implementation. The mismatch concerns also incompatibilities between object-oriented programming (in languages such as C++, Java and C#) and SQL, the primary programming interface to relational databases. For this reason in the last two decades new and new object-oriented database management systems are proposed. Some of them are well recognized on the market (e.g. ObjectStore, Objectivity/DB, Versant, db4o, and others), however the scale of applications of them is at least the order of magnitude lower than applications of relational systems (some of them extended by object-oriented features).

One of the reasons of relatively low acceptance of commercial object-oriented databases concerns their query languages that are considered very limited and treated as secondary in the development of applications. This is in sharp contrast to relational systems, where SQL is considered the primary factor stimulating their successes.

In this research we focus on equipping object-oriented database systems with a powerful and efficient query language. The power of such a language should not be lower than the power of SQL. The performance efficiency of such a language requires powerful query optimization methods. Query optimisation in object-oriented database management systems has been deeply investigated over last two decades. Unfortunately, this research remains mostly not implemented in nowadays OODBMSs because of many reasons: limited query languages, non-implementable methods that were proposed, lack of interest of commercial companies, etc.

In this thesis we investigate a well-known and the most important method of performance improvement known as indexing. The research addresses this subject in the

(14)

Introduction

context of the Stack-Based Architecture (SBA), which is a theoretical and methodological framework for developing object-oriented query and programming languages. The solutions that we have developed are implemented and tested in the ODRA OODBMS prototype [58, 59] that is based on SBA and its own query language SBQL (Stack-Based Query Language).

1.1 Context

The Stack-Based Architecture (SBA) is a formal methodology addressing object-oriented database query and programming languages [1, 117]. It assumes the object relativism principle that claims no conceptual difference between the objects of different kinds or stored on different object hierarchy levels. Everything (e.g. a Person object, a salary attribute, a procedure returning the age of a person and a view returning well-paid employees) is considered an object with an own unique identifier. SBA reconstructs query languages’ concepts from the point of view of programming languages (PLs) introducing notions and methods developed in the domain of programming languages (e.g. environment stack, result stack, nesting and binding names).

ODRA (Object Database for Rapid Application development) is a prototype object-oriented database management system based on the Stack-Based Architecture (SBA) [2, 119]. ODRA introduces its own query language SBQL that is integrated with programming capabilities and abstractions, including database abstractions: updatable views, stored procedures and transactions. The main goal of the ODRA project is to develop new paradigms of database application development together with a distributed database-oriented and object-oriented execution environment.

1.2 Short State of the Art of Indexing in Databases

The general idea of indices in object-oriented databases does not differ from indexing in relational databases [15, 20, 29, 54, 55, 65]. The most characteristic property of the database indexing is transparency. A programmer of database applications does not need to be aware of the indices existence as they are utilised by the database engine automatically. This is usually accomplished by a query optimiser that automatically inserts references to indices into a query execution plan when necessary. The second important aspect of transparency concerns maintaining cohesion between existing indices and the data that is indexed. Data modifications are

(15)

automatically detected and corresponding changes are reflected in indices. This process is called automatic index updating.

Many indexing methods can be adopted from relational database systems and even their applicability can be significantly extended. There are also situations where indexing methods from RDBMSs become outdated in object-oriented databases. In particular, join operations do not need to be supported because in object databases the necessity for joins is much lower due to object identifiers and explicit pointer links in the database.

In the object-oriented database domain the research into indexing has been mainly focused on path expression processing and inheritance hierarchy inside indexed collections [10, 11, 12, 21, 67, 77, 81, 111]. Some papers propose generic approaches to provide automatic index maintenance transparency [43, 46]. However, there is no information that these proposals have been actually incorporated in commercial or open source database products.

Indexing is also an important subject in a distributed environment. The most of research concerns development of various distributed index structures and global indexing strategies. Many works are conducted in the context of data exchange in p2p networks. In databases, the most advanced solutions are based on static index partitioning. They are implemented in leading object-relational products. Nevertheless, an index key definition is limited to expressions accessing data from only one table. The Author have not found in the research literature any formalised global optimisation methods based on indexing for processing queries involving heterogeneous resources.

The analysis of the state of art unambiguously indicates that the development of indexing methods and architectures that are dedicated to distributed object-oriented databases is still a valid and challenging subject.

1.3 Research Problem Formulation

The orthogonality of SBQL language constructs allows defining selection predicates using complex and robust expressions accessing arbitrary data. The transparent indexing of objects to facilitate processing queries involving such predicates requires development of a generic and complete solution. Particularly, achieving automatic index updating transparency is simple only in case of indices defined on simple keys, i.e. direct attributes and table columns. Inheritance, methods

(16)

Introduction

polymorphism, data distribution, etc. make difficult identifying objects influencing a value of an index key.

Data processing in a distributed environment enables parallel processing of queries and may take advantage of distributed and scalable index structures. This creates a demand for introducing an appropriate indexing architecture and specific optimisation methods.

An even more complex task concerns evaluation of queries addressing a heterogeneous distributed environment. From the point of view of performance it is vital to exploit local resources optimisation methods and to develop robust techniques improving query processing on a global schema level. Identifying effective transparent global indexing strategies is in this context a significant, but particularly challenging subject.

Finally, each optimisation method improving query performance must ensure preservation of query semantics. Therefore, in the context of a query language and an object model the appropriate rules for exploiting such methods must be determined. With this knowledge a database programmer can be assisted and advised, e.g. by a compiler concerning how to design proper optimisable queries.

1.4 Proposed Solution

In order to provide transparent indexing in distributed object-oriented databases, the author of this thesis proposes the following tenets:

• precisely defined indices management facilities and convenient syntax for an index call to be used in query optimisation,

• set of algorithms, optimisation methods and rules composing the index optimiser, i.e. the module responsible for detecting parts of a query that can be substituted with an index call and performing appropriate query transformations,

• the generic automatic index maintenance solution based on index update definitions assigned to indices and associated with them index update triggers assigned to objects participating in indexing,

• volatile indexing technique enabling taking advantage of the developed indexing architecture and omitting troublesome issue of the automatic index maintenance in processing specific family of queries addressing heterogeneous resources.

(17)

The most important properties necessary to provide desired indices behaviour have been implemented in ODRA OODBMS prototype and are operating [59].

1.5 Main Theses of the PhD Dissertation

The summarised theses are:

1. Processing of selection predicates based on arbitrary key expressions accessing data in a distributed object-oriented database can be optimised by centralised or distributed transparent indexing.

2. Evaluation of complex queries involving distributed heterogeneous resources can be facilitated by techniques taking advantage of transparent index optimisation.

The common basis for accomplishing the theses are developed indexing management facilities and the index optimiser.

The first thesis is additionally supported by the author’s generic approach to automatic index maintenance. The proposed approach provides transparent indexing using single or multiple-key indices. It applies to selection predicates based on arbitrary, deterministic and side effects free expressions consisting of e.g. path expressions, aggregate functions and class methods invocations (addressing inheritance and polymorphism). An extensive part of the work comprises optimisation methods facilitating processing in the context of a where operator (i.e. selection), considering the role of a cost model, conjunction and disjunction of predicates, and class inheritance. The proposed architecture can handle homogeneous data distribution and distributed index structures. The selection of an index structure, either centralised or distributed, is not restricted. The author also introduces an efficient method for optimisation of rank queries taking advantage of indexing in a distributed environment.

The solution proposed by the author addressing the second thesis is the volatile indexing technique. It relies on the same indexing architecture, but addresses as well data virtually accessible through SBQL views. A volatile index differs from a regular index since it is materialised only during a query evaluation. Therefore, efficacy of this technique can be seen in processing of laborious queries when the index is invoked more than once.

(18)

Introduction

implementation in the ODRA OODBMS. The only important aspect to be implemented and validated in the future concerns data and index distribution in the context of the first thesis. This element is planned to be finished together with the development of a distributed infrastructure in the ODRA prototype.

1.6 Thesis Outline

The thesis is organised as follows:

Chapter 1 Introduction

The chapter presents a general overview of the thesis subject, context, the author’s motivation, formulation of the problem and objectives of the research, the theses and the description of developed solutions.

Chapter 2 Indexing In Databases - State of the Art

The state of the art chapter introduces basic concepts concerning indexing in databases together with an overview of solutions existing in commercial products and in the research literature. Additionally, the inspection of varieties of index-structures and indexing strategies applying to centralised and distributed environment is provided.

Chapter 3 The Stack-based Approach

The theoretical fundament for the thesis is the Stack-Based Architecture (SBA) and the corresponding query language SBQL. The chapter introduces basic notions relevant to the work including environment and result stacks, static query evaluation and updateable object-oriented views.

Chapter 4 Organisation of Indexing in OODBMS

The chapter presents the designed and implemented indexing architecture in the ODRA OODBMS. It focuses particularly on basic properties of the employed index structure, the designed indexing management facilities and module providing automatic index updating transparency (based on the author’s index update triggers concept). Finally, extending the architecture to distributed databases is discussed.

Chapter 5 Query Optimisation and Index Optimiser

The algorithms and rules responsible for taking advantage of indices in transparent optimisation of queries with respect to query semantics are presented and explained on examples. The chapter includes description of indexing methods designed

(19)

for a distributed environment and discussion about influence of secondary methods on indexing.

Chapter 6 Indexing Optimisation Results

The chapter presents results of tests confirming efficiency of the methods presented in the thesis.

Chapter 7 Indexing for Optimising Processing of Heterogeneous Resources

The chapter focuses on the volatile indexing technique and presents its application in optimisation of queries addressing heterogeneous resources. The description is supported by an appropriate test proving the efficacy of this technique.

Chapter 8 Conclusions

The chapter gives conclusions concerning achieved objectives and depicts the area of future works.

(20)

Indexing In Databases - State of the Art

Chapter 2 Indexing In Databases - State of the Art

Indices are auxiliary (redundant) data structures stored at a server. A database administrator manages a pool of indices generating a new or removing an existing one depending on the current need. As indices at the end of a book are used for quick page finding, a database index makes quick retrieving objects (or records) matching given criteria possible. Because indices have relatively small size (comparing to a whole database) the gain in performance is fully justified by some extra storage space. Due to single aspect search, which allows one for very efficient physical organisation, the gain in performance can be even several orders of magnitude.

In general, an index can be considered a two-column table where the first column consists of unique key values and the other one holds non-key values, usually references to objects or database table rows. Key values are used as an input for an index search procedure. As the result, the procedure returns corresponding non-key values from the same table row. In query optimisation indices are usually used in the context of a where operator when the left operand refers to a collection indexed by key values composing the right operand. [29, 33, 118]

Syntactic analysis and validation High-level language query

Intermediate query representation

Query optimiser Query evaluation plan

Query code generator Query code

Runtime database processor Query result

Fig. 2.1 A typical stages of high-level language query optimisation [29]

A database query is expressed in a high-level query language (e.g. SQL, OQL). Fig. 2.1 presents general steps of processing a query. First, it has to be the subject of syntactic analysis (parsing). Next, it is validated for semantic correctness and accordance with a present database schema. The database uses internal query

(21)

representation usually organised into a tree or graph structure. There might be many execution strategies that a DBMS can follow to obtain an answer to a query. In terms of query results all execution plans are equivalent but the cost difference between alternative plans can be enormous. The cost is usually measured as the time needed to complete query execution. A database query optimiser should efficiently estimate the cost of a plan. Final steps of query processing consist of code generation according to the designed execution strategy and eventually its execution [29, 54, 118].

An important part of designing an execution plan is the analysis of database indices. The query optimiser should be capable of identifying parts of the query which evaluation can be assisted with indexing. Next, with the help of a database cost model it has to decide which combination of indices would minimise the cost of query execution.

An important task of database administrators is to manage a pool of indices, which is a part of processes of physical designing and tuning of a database. From the physical and conceptual properties of a database index follow its obvious advantages. However, when the design is improper, processing queries through indices may cause disadvantages concerning the global processing time. The disadvantages are usually caused by frequent database updates, which may totally undermine the gain in query processing due to an index, because the updating cost of the index exceeds the gain due to faster query processing.

2.1 Database Index Properties

Indices are an essential constituent of a database’s architecture. Obviously, their central feature is a data-structure that can be efficiently organised, searched and maintained. Nonetheless, their actual strength lies in unique properties and versatile utilisation of a database’s index. The significant advantage and a partial cause of a success of large database systems is an indexing transparency.

2.1.1 Transparency

In a common approach, the programmer should not involve explicit operations on indices into an application program. To make indexing transparent from the point of view of a database application programmer, it ought to ensure two important functionalities: index optimisation and automatic index updating.

(22)

evaluation. Therefore, the administrator of a database can freely establish new indices and remove them without changing codes of applications. The responsibility for ensuring such transparency lies in query optimisation and particularly in the index optimiser.

The second functionality, i.e. automatic index updating, in research literature is also referred as index maintenance or dynamic index adaptation. It is a response to changes in a database. Indices, like all redundant structures, can lose cohesion with data if the database is updated. An automatic mechanism should alter, remove or rebuild an index in case of database updates that affect topicality of its contents. Consequently, the gain in queries performance coming from indexing compromises insertion, deletion and data modifications speed, since such operations require suitable updates to indices. Thus, it is an administrator’s responsibility to manage indices judiciously and not to cause an overall database’s performance deterioration, particularly in update-intensive systems.

In general, databases provide the user transparency fully. Nevertheless, some approaches let administrators and application designers decide about the degree of the index transparency and explicitly control indices state depending on a need. Occasionally, the transparency is supported only to a limited extent burdening a database’s user.

2.1.2 Indices Classification

According to [29] there are three essential kinds of a database’s index:

• primary index – physically ordering data on a disk or in a memory according to some unique property field (each record must contain a unique value for such a field – so-called primary key),

• clustering index – introducing physical data order according to a non-unique property (i.e. when several data records can have equal value of ordering fields),

• secondary index – providing an alternative access to data according to designated criteria without affecting their actual location (called also secondary access paths or methods).

Since one physical ordering is possible, a data table or collection can have only one primary index or clustering index not both. The limit concerning the number of

(23)

secondary indices depends on a database. In reality often occur some departures from definitions above. For example, in some databases data and indices are stored separately and even primary or clustering indices contain only references to actual data, which are stored physically e.g. in a linked list.

Indices can be also classified according to a relation between keys and indexed data. Usually the division is the following:

• dense index – contains an entry for each key value occurring in a database,

• sparse index – associates blocks of ordered indexed data only with a single key value (e.g. the lowest one).

Primary indices are usually sparse since physically data are often divided into blocks. Additionally to dense and sparse, a range index can be considered since index can be split into slots representing specified ranges of key values.

Other obvious classification of indices concerns their data-structure, e.g. hash table or a B-tree. In databases sometimes many index kinds are combined in one so-called multilevel index. The next subchapter describes the most popular kinds of data-structures employed in databases indexing systems.

2.2 Index Data-Structures

The most popular data structures used for index organisation are various kinds of B-trees proposed by Bayer and McCreight [8] and hash tables invented by H. P. Luhn [14]. Improving efficiency of selecting or sorting data queries vary with the choice of a proper data structure for indexing certain data. However, each index consumes some amount of database store space and needs some additional overhead for time of inserting, modifying or removing indexed data. Individual properties of different structures have been presented in thousands of papers and books devoted to databases and algorithms, e.g. [14, 23, 29]. In the context of this dissertation the kind of an exploited physical index organisation is generally insignificant. Only some properties of the index structure are important, in particular:

• key order preservation, i.e. range queries support,

• support for indexing using multiple keys,

(24)

The same index interface from the point of the database can be used for a variety of index structures. Therefore, this work omits detailed discussion of this subject focusing mainly on an index structure used in the author’s implementation, i.e. linear hashing.

The hash table uses the hash coding method based on a hash function which maps key values into a limited set of integer values. A calculated hash value points to a table in a memory (called a bucket) holding corresponding non-key values. This method allows indexed values to be looked up or updated in a very short constant time, particularly when a hash function distributes key values equally. A disadvantage of this technique is a necessity of specifying a size of the index table. However, dynamic hashing and linear hashing (described in the next section) deal with this issue. Another problem appears when two or more keys are mapped to the same location in the table. Similarly, it may happen that two or more objects have the same key value of an attribute. Resolving these so-called collisions leads to deterioration of index performance. There are many techniques allowing to put such items in a hash table and query them in a fairly fast way: a rehash function, a linked list approach (separate chaining), a linked list inside a table (coalesced hashing) and buckets. Methods involving linear or dynamic hashing use load control algorithms automatically forcing a hash table to expand to prevent performance loss.

Another very popular indexing technique is based on B-trees. A B-tree is slightly worse than hash table from the point of view of search time and frequent data updates, which often involve the tree reorganisation. However, its advantage is simplicity of an algorithm and the economical memory space consumption. B-trees store keys in a non-descending order, so they can be very helpful in laborious queries involving sorting or ranking data. Many different kinds of tree-structures are proposed in the literature and incorporated in commercial products, e.g.

• B+ tree, B# tree, B* tree – varieties of B-tree,

• AVL tree, splay tree – balanced binary search trees,

• radix tree – optimised to store a set of strings in lexicographical order,

• and many more [14].

Indexing techniques used in data warehousing applications are a bit different from the techniques used in on-line transaction processing. Bitmap indices are stored as bitmaps (often compressed) [17, 66]. Consequently, answer to the most of queries can

(25)

be obtained by performing bitwise logical operations. They are the most effective on keys with a limited set of values (e.g. a gender field) and often use a combination of such keys (i.e. multiple keys index). When these conditions are met, the bitmap indices prove reduced storage requirements and greater efficiency than regular indices. On the other hand the performance of the index maintenance is their serious drawback. Bitmap indexes are primarily intended for non-volatile systems since the method is very sensitive to updating indexed data. It causes the necessity to keep locks on segments storing a bitmap index to reflect a change, which is very time consuming. In typical cases bitmap indices are easier to destroy and re-create than to maintain.

Other variants of indices for data warehousing have also been developed [84]:

• projection index - quite useful in cases where column values must be retrieved for all selected rows, because they probably would be found on the same index page,

• bit-sliced index – based on processing bitmaps; provide an efficient means of calculating aggregates.

Many other index structures evolve to facilitate various indices applications:

• inverted files, signature-based files – two principal indexing methods for text documents databases purposes [134] and for indexing according to set-valued attributes with low cardinality [41],

• multi-index, path index, access support relations, T-Index, path dictionary index – for path expression processing in OODBMSs [10, 11, 12, 67, 81],

• inherited multi-index, nested-inherited index, triple-node hierarchies, H-tree, CH-tree, hcC-tree (hierarchy class Chain tree), signature file hierarchies, signature graphs – oriented on facilitating processing collections organized in class hierarchies in OODBMSs [10, 11, 12, 21, 77, 111],

• R-Tree, UB-tree, kd-tree, X-Tree, Parametric R-Tree, TPR-Tree (Time

parameterized R-Tree), TPR*-Tree, grid file - for spatial (i.e. multidimensional) and spatio-temporal data, e.g. Geographic Information Systems, [30, 32, 40, 125].

• etc.

(26)

domain. In general, together with a growth of an indexed dataset an index can be split into small parts maintained on independent servers; hence, utilising their storage (e.g. main memory or disks) and processing power. In contrast to local indices such an index:

• enables exploiting a parallel computing power (therefore, they are usually referred as parallel or distributed indices),

• can be scalable, freely spreading its parts between network nodes, without compromising its primary efficiency,

• provides a higher level of concurrency.

An overview of properties of a distributed index structure based on the idea of linear hashing is given in section 2.2.2. Similarly to local indices, parallel indices have been developed in many variants for various applications and systems, e.g.:

• scalable distributed data structure variants (cf. section 2.2.2),

• distributed hash table (DHT) [34] for data indexing in peer-to-peer (P2P) networks, e.g. Chord [113],

• scalable distributed B-tree [3],

• a combination of a bit vector, a graph structure and a grid file database multi-key distributed index [45],

• hierarchical distributed index psiX for XML documents in p2p networks [105],

• DiST, PN-tree – structures for indexing multidimensional (spatial) datasets [4, 16].

All index structures mentioned in this subchapter are only a small fraction of existing solutions, which are described in thousands of research papers. The next section concerns a linear hashing index which is an important part of the author’s prototype implementation verifying the thesis.

2.2.1 Linear-Hashing

Linear hashing is a dynamic indexing structure invented by Witold Litwin [72]. Similarly as a regular hash table, it comprises the buckets, which store index entries according to some hash function. The linear hashing strives to keep a relation between the number of index entries and the number of buckets in order to ensure constant

(27)

search, insertion and deletion efficiency and to minimise the buckets capacity overflowing. Buckets are added (through splitting) and removed (through merging) one at a time which is possible by taking advantage of a dynamic hashing functions family.

At the start, a linear hashing structure consists of N0 empty buckets numbered starting from 0 to N0-1. Three important parameters describe an index state:

• n – the number of a bucket to be split next if necessary (initially equal to 0),

• j – a current lowest index buckets level (initially equal to 0).

• N – the number of buckets equal to N0·2j+n (consequently, initially equal to N0). The buckets from nTH to N0·2j -1 belong to the jTH level while the rest of the buckets, i.e. from 0 to nTH-1 and all starting from N0·2j bucket to NTH-1, belong to the

jTH+1 level. Index entries are spread over the index according to hash functions hj(key) depending on the level of an index bucket. The target bucket T for a key is determined according to the following formulas:

[

   ⇒ ⋅ ∈ + ⇒ ∈ = ) , ( ) 2 , ) , ( ( if ) , 1 ( ) , 0 ) , ( ( if ) , ( 0 h j key N n key j h key j h n key j h key j T _TH _j TH , where:

• h(j, key) := hash(key) mod (N0·2j),

• hash(key) is the basic key hashing function,

• [minvalue – stands for the inclusive left limit of the defined values range,

• maxvalue[ – stands for the exclusive right limit of the defined values range. The most crucial operation, i.e. splitting, is triggered after an insertion when the index load becomes too high. A new bucket is appended to the buckets table and elements from nTH bucket are divided between nTH and the new bucket nTH+N0·2j according to the h(j+1, key) function. It is worth noticing that:

h(j+1, key)∈{h(j, key), h(j, key) + N0·2j}

Next, parameters n and N are incremented by one. Eventually, when n reaches N0·2(j+1) indicating that the hash table has doubled its size from N0·2j the n parameter is set to 0 and the index level j is incremented by one.

(28)

threshold then merging of buckets is performed.

An example bucket split procedure is presented in Fig. 2.2. Buckets entries are represented by values of their hash(key) functions. The state before the split is presented in Fig. 2.2a. The parameters of an index were the following: n = 0, N0 = N =100, j = 0, so all buckets are addressed using the h(0, key) function.

Fig. 2.2 Example of a bucket split operation [72]

The split is performed on the nTH bucket, which was already overflowed. A new bucket at the end of the buckets table is allocated and it is filled with entries moved from the 0 bucket for which h(1, key), i.e. hash(key) mod 200, is equal to 100. Finally, n and N are incremented. An index state after the split is shown in Fig. 2.2b. As it is shown dynamic expansion of a linear hashing table helps to minimise the buckets overflow.

An overview of SDDS based on the idea of linear hashing, an efficient structure for distributed indexing, is presented in the following section.

2.2.2 Scalable Distributed Data Structure (SDDS)

SDDS is a scalable distributed data structure introduced by W. Litwin [73, 74] which deals with storing index positions in a file distributed over a given network. Its properties make it a good candidate for indexing global data in a distributed

(29)

infrastructure (e.g. grid). SDDS uses LH* which generalises a linear hashing method described in section 2.2.1 to a distributed memory or disk files.

In contrast to the linear hashing, SDDS buckets can be located on different sites. LH* structure does not require a central directory, and it grows gracefully, through splits of one bucket at a time, to virtually any number of servers. The SDDS strategies differ with an approach to buckets splitting which can be managed by a coordinator site, triggered by a bucket overflow or by controlling an index load factor.

An application of the SDDS significantly extends features of the linear hashing. The major advantages of an SDDS index concerning distributed indexing are the following:

• avoiding of a central address calculation spot,

• parallel and distributed query evaluation support,

• concurrency transparency,

• scalability – it does not assume any constraints in size or capacity,

• an SDDS file expands over new servers when current servers are overloaded,

• index updating does not demand a global refresh on servers or clients,

• over 65% of an SDDS file is used,

• in general, the small number of messages between servers (1 per random insert; 2 per key search),

• parallel operations on SDDS M buckets require at most 2·M+1 messages and between 1 and O(log(M)) rounds of messages.

The characteristics of SDDS outperform in efficiency the centralised index directory approach (described in detail in section 2.6.1) or any static data structures.

Variants of SDDS index include implementations:

• preserving key order and supporting range queries (e.g. a RP* family of SDDS structures [75]),

• providing a high-availability, i.e. toleration for unavailability of some servers sites composing SDDS (e.g. LH*RS [76])

(30)

2.3 Relational Systems

The System-R, developed by IBM1 Research between 1972 and 1981, is the first database management system implementing the relational model [6]. Innovative solutions developed within the system included query optimiser utilising indices [15, 65]. The overview of the relational query optimisation including the fundamentals of an approach to indexing has been collected in [20, 54, 55]. Almost 40 years of research on relational systems resulted in development of various indexing aspects. Numerous indexing based solutions are incorporated in available commercial products. The major RDBMSs currently are SQL Server by Microsoft [109], DB2 by IBM [24], Informix by IBM [53] and Oracle by Oracle Corporation [91]. The most popular open-source relational systems are PostgreSQL by PostgreSQL Global Development Group [103], MySQL by SUN Microsystems [83] and Firebird by Firebird Foundation [31].

The well-known indexing solutions designed for RDBMSs are the following:

• primary index, clustering index, secondary access paths (cf. section 2.1.2),

• multi-key index – enables indexing using combination of multiple fields,

• derived key index (iSystem DB2/400 by IBM) [51], function-based index (Oracle) [115], functional indexes (Informix) [49] – indices on expressions, built-in or user functions that exactly match selection predicates within an SQL

where clause,

• computed-column indices (MSSQL Sever) – solution similar to the previous one but relying on an additional table column (computed-column), which can define indexable expression using derived attributes and user functions (the index maintenance relies on maintenance of the computed column) [110],

• temporary index – transient internal structure created automatically by DB engine or defined manually (it is described below in this subchapter) [51, 110],

• development of diverse index structures (it is the topic of subchapter 2.2),

• other product specific solutions.

In RDBMSs the keys that are used for defining an index on a table are usually

(31)

simple values stored in columns. Developers of such an index can use various index structures and mechanisms for assuring index transparency. A query optimiser can easily identify where clauses addressing indexed selection predicates. Modifications to an indexed table are also easy to detect by the DB engine during run-time or even earlier through the analysis of an intermediate form of DML2 statements. Insertion or deletion of table rows transparently triggers addition or removal of an appropriate index entry. Analogously, modification to any value in a key column results in changes inside an index. Therefore, details of automatic index updating in RDBMSs are usually omitted in technical RDBMS specifications and considered rather as an implementation issue.

Function-based indices and other similar solutions enabling defining keys using expressions addressing more than one table column and internal functions or user-written functions generally do not introduce conceptual difficulties. The functions supporting such indices can be written in a native database language (e.g. PL/SQL) or an external programming language (C++, Java, etc.). Furthermore, they must be deterministic (i.e. depend only on the state of a database store) and side effects free (i.e. do not introduce any changes to data). The idea of function-based indices is derived from optimisation through method (or function) pre-computation or materialisation. It is widely discussed in the research literature [5, 9, 13, 27, 57, 80]. The optimisation gain relies on pre-calculating the result of a given function or a derived attribute for all objects of a collection. The obtained results are used as keys to index objects and are stored inside the index. Thus, when queries are evaluated, the optimiser strives to use the result computed earlier in order to avoid laborious execution of functions and derived attributes.

The automatic index maintenance of function-based indices requires simply considering modifications to any value stored in all columns used in a key definition. Nevertheless, this aspect of indexing becomes complex when object-oriented model and language extensions are considered. In extreme cases, it may even lead to serious errors (see section 2.5.1).

If appropriate indices do not exist then the optimiser can try to facilitate query

(32)

processing using temporary indices instead. Their applications are described in detail for iSystem DB2/400 by IBM [51]. A temporary index can be created solely for performing joins (e.g. nested loop join), ordering, grouping, distinct and record selection. It is applied by the optimiser to satisfy a specific query request. Such an index can be built as a part of a query plan. After query execution it is destroyed. In effect, it would not be reused and shared across jobs and queries. Sometimes a temporary index can be created for a longer period. Such a decision can be made by the DB engine basing on the analysis of query requests over time. In order to reuse and share such an index it has to be altered if the underlying table changes. The advantage of a temporary index is the shorter access time as it is stored only in main memory.

2.4 OODBMSs

Index organisation and optimisation in object-oriented database management systems have been deeply researched, see [7, 10, 11, 12, 21, 43, 46, 77, 79, 81, 93, 111]. Experimental database prototypes are among other the following: IRIS by Hewlett Packard, ORION by MCC3, OPENOODB by Texas Instruments and project ENCORE/ObServer by Brown University. Few former commercial OODBMSs are: ONTOS by Ontos, ARDENT by ARDENT Software4, ODE by AT&T Bell Labs, POET by POET Software [29].

OODBMSs base on a hierarchical object-oriented data model. One of important notions of the object-oriented model is a reference, i.e. a pointer link to an object. Pointer links express relationships (associations) between objects. In the result of attempts to standardise object-oriented database management systems the ODMG5 [18] purposed OQL6 [29, 33] which to some extent influenced the development of object-oriented query languages. Differences in data models and query languages imply that some indexing techniques are specialised to relational or object-oriented approaches only.

OQL involves path expressions composed of object names separated by dots in

3_{Microelectronics and Computer Technology Corporation, Austin, Texas}

4_{Formerly O2 by O2 Technology}

5_{Object Data Management Group}

(33)

order to navigate via pointers to objects easily. Navigation to a pointed object in OODBMSs can be fast as it is usually resolved at a low-level with a direct link. In the relational model such relationships (i.e. primary-foreign key dependencies) require performing joins and for efficient query evaluation require indices. Nevertheless, some object-oriented systems may implicitly rely on a flat, relational-like data model. In such a case, navigation along a pointer link still requires performing an implicit join among objects. Thus the assumption limiting OQL path expressions is that the operand before a dot operator should not deliver a collection.

Much work has been dedicated in the OODBMSs research to cope with improving the efficiency of processing nested predicates, i.e. based on derived attributes defined using path expressions. These works additionally extend path expression indexing with consideration of inheritance issues. The most important proposed solutions are Multi-Index, Inherited Multi-Index, Nested-Inherited Index, Path Index, Access Support Relations [10, 11, 12], Triple-node hierarchies [77] and T-Index (focused on semi-structured data) [81]. The efficiency of these methods was deeply studied, described through appropriate cost models and verified by prototype implementations. The solutions are focused on various criteria, such as the cost of retrieval, cost of updates operations or cost of storage. However, the transparency aspect of automatic index updating is not always precisely explained. Generally, it is assumed that each modification of an attribute of a class instance and creation or deletion of an instance should cause appropriate index updating actions. However, instances of one of classes accessed by an indexed path expression can be located in different collections. Moreover, these collections can contain the arbitrary number of objects not associated with indexed objects. These circumstances can make automatic index updating routines inapplicable or seriously affect database’s performance. Let us consider an example of an OQL query returning data concerning departments who are supervised by an employee John Doe:

SELECT * FROM Departments d

WHERE d.supervisedBy.name = “JOHN DOE”

A path expression based index supporting query evaluation concerns only a number of employees who are department supervisors. Unfortunately, modifying a name of each employee would be burdened by the index maintenance mechanisms. This inconvenience is however justified. In the approach to automatic index updating