Mälardalen University The Department of Computer Science and Engineering CD5130 Object oriented programming, advanced cource

(1)

Object databases

Group 21 Thomas Bengtsson 780714 Per Bengtsson 710313

(2)

Summary

Object databases can be useful when storing complex data structures. The use of a consistent object model throughout the entire system development process can greatly simplify and speed up the development of new systems. The database can be queried using OQL, a query

language based on SQL-92 with extensions for handling complex objects among other things. Using OQL the programmer can make use of object methods including polymorphism directly in the queries making OQL a very powerful tool.

(3)

Introduction

This essay is to a large degree based on facts taken from the standard set by the ODMG (Object Data Management Group). As we have no experience with object databases we have no way of knowing to what extent these standards are followed by the current ODBMS vendors. The text is meant to be an introduction to object databases not a tutorial and as such it is far from being comprehensive.

We will begin by introducing the relational database concepts as the reader will have to be somewhat knowledgeable about relational databases to be able to understand comparisons made between RDBMS's and ODBMS's. We will then proceed with the presentation of the ODBMS's, describing the advantages of using object databases, the object model the databases are based on and OQL, the query language used to query the databases.

Short introduction to RDBMS

The most widely used form of database today is the relational database (RDBMS). The

RDBMS is based on tables of data using unique keys (such as social security numbers) to form relationships between these tables. This is meant as an introduction to relational databases and SQL for those who have not encountered them before.

Table_PERSON pk_person_id 1111111111 first_name John last_name Doe Shoe_size 9 Favorite_color blue

Each table has a primary key (several columns can be combined to form a primary key) to uniquely identify each record. In the table above the most probable key would be

pk_person_id. This table can be related (connected) to other tables using the primary key of the PERSON table and a foreign key in another table. A foreign key has only one purpose, to relate to the primary key of another table. In the table below the value of the foreign key is the same as the primary key value of the table above.

Table_CONTACT_INFO pk_info_id 1 tel_number 121212 cell_number 121213 e-mail john.doe@work contact_place work fk_person_id 1111111111

If there can only be one CONTACT_INFO for each PERSON then that relationship would be referred to as a 1:1 relationship. If it is possible for each PERSON to have more then one CONTACT _INFO then that relationship would be called a 1:N relationship. This is called the

(4)

relationships cardinality. When designing a relational database it is important to avoid N:N relationships (many to many).

A relational database can consist of hundreds of tables or more. The image below is not nearly that large but should give you an idea of how a relational database looks.

There is a lot to know about relational databases and we cannot possibly cover everything here. We will attempt to explain the different aspects of database development as we approach them from an object database perspective later in this report

SQL

To manipulate a database you need a tool, one way is to implement a dml (data manipulation language). Dml includes functionality for

•Insert new data into database •Modification of data in database •Retrieval of data from database •Removal of data from database

There are two dml types, procedural mdl and non-procedural mdl. In procedural mdl, all data manipulating is done in record-to-record manner and the user knows exactly how to retrieve a

(5)

record directly from the database. This means that you take a record and process it and then take the next record etc, this process repeats until the goal of the database operation is completed. In a non-procedural dml the DBMS lets the user state what data needs manipulating and how it should be manipulated.

To be able to query a relational database a standardized query language has been developed called SQL (structured query language). SQL is one of the more common procedural mdl's to date. SQL has a long list of key words but this text will only explain a small subset of the keywords used in SQL.

Select is a keyword that means we want to retrieve data from the database it states what is to be retrieved using parameters.

Syntax: select xxx (where xxx is what we want to retrieve) Example: select car_id ;

Or to put it in plain English: retrieve all car_id

We cannot select anything if we do not specify from where the data is to be retrieved. That is where the next keyword comes in handy.

From is another keyword used in SQL. From specifies from what set of records we would like to receive data.

Syntax: from xxx (where xxx is a set of records) Example: select car_id from car;

Or to put it in plain English: retrieve all car_id values from the table with the name “car”.

Where is a keyword that limits the number of records the user retrieves by setting up a condition on every record, which must be met.

Syntax: where xxx=yyy

Example: select car_id from car where model=”Volvo”

Or to it put it in plain English: retrieve all car_id values from the record set with the name “car” but only retrieve records that have the attribute “model” is set to “Volvo”.

It is possible to join several sets of records to retrieve data from them. This also done using the "where" keyword.

Syntax: where x.qqq = y.zzz Example:

Select car_id, name from car c, person p

where c.color = ”blue” and p.favorite_color = ”blue”

Or to put it in plain English: This query lists car id's coupled with a persons name when the car's color matches the person's favorite color.

(6)

What is an ODBMS?

An ODBMS (Object Database Management System) does not use tables to store data like a relational database. Instead the objects themselves are stored in the database. The relationships are not stored using primary and foreign keys but rather by maintaining the references objects have to each other. The remaining text in this paper will describe in a general fashion what an ODBMS is and to what use it can be put.

Benefits of ODBMS's

Mapping

When a programmer using an object oriented language such as C++ needs to save his objects to a relational database (to make his objects persistent) he has to map the objects to fit the tables of the relational database. It takes significant time to write the code necessary for this mapping, and that additional code means that more bugs might be written into the code. Therefore using an object database would mean less code would have to be written, there would probably be fewer bugs and performance would increase as there would be no need to transform the data from tables into objects and back again. When an object database is being used the same object model is used to describe the transient objects in the computers memory (RAM) as the persistent objects stored in the ODBMS

Scalability

It is worth noting that a program written as a single user program could have a lot to gain from using an ODBMS rather than saving objects to a file. Many single user programs often later become multi user programs. If an ODBMS has been used to store the persistent objects then the data can easily be accessed by multiple users. The scalability of software is therefore much better when an ODBMS is used to manage the persistent data.

Relations

In an SQL database the data is stored in tables. When data is to be recovered from these tables it is often necessary to recreate the relationships between tables using joins in runtime. An ODBMS on the other hand can traverse the relationships between objects without the joins that are so detrimental to efficiency as the relational paths are stored in the objects. Likewise relationships between objects in the form of inheritance have to be restored should the data be stored in tables. In an ODBMS the persistent objects are stored in the same state as the transient objects eliminating the need for code to transform the data between the different states.

Sequences

Sequences (of events) can be stored in tables, but that would require a row to describe the time index of the events. If a RDBMS was used that would mean that the table would have to be

(7)

sorted to be of any use. Even if there are efficient sorting algorithms that can be implemented an object database has a better solution. An object that stores a sequence can store the data in a linked list thus providing good access to the data in a pre-sorted fashion. Handling sequences of this kind is important in several different applications such as storing medical data (EKG), Financial information (stock prices) etc.

Storing complex data structures

The performance of an ODBMS is far superior to a RDBMS in some cases. One way to illustrate this is to look at the part-subpart (possibly recursive) relation (a [1:1] or has-many [1:N]).

The part-subpart structure, just like any other tree-like data structure, is hard to represent in an effective way in a RDBMS. In a RDBMS this could be done using a recursive structure. The table “part” would contain a primary key “part_id” and a few other values to tie other parts to the part in question. (e.g. if each instance links to (up to) two other instances, the table “part” will have to use two columns to store the links, for example “contains_part_id1” and

“contains_part_id2”). For a piece of software to read the entire hierarchy into memory an extensive query would have to me made to traverse these relationships until the

“contains_part_id” values are NULL. In an object database no queries would have to be made as the tree structure is preserved in the database. Since there is no need to recreate the original data structure performance is increased substantially.

Storing objects – storing methods

When an ODBMS stores an object it stores the entire object including its data and its methods. This means that the objects can be used while still in the database. Thus operations can be executed in the database engine rather then in the application. This means that development of applications is easier because less code has to be written which makes bugs less likely and testing and future modification of the code easier. This is described in more detail under the heading OQL – methods later in this paper.

A single consistent object model

In traditional system development different models and terms are used to describe the representation of data. In the system design phase flow charts might be used, the database designers might use ER-diagrams and the programmers might use pseudo code. When the

(8)

object-oriented approach is used a single object model can be used. UML could be used to describe the objects, the methods of the objects and the communication between objects. Programmers can use these UML diagrams directly when coding and the database could store the objects without the need to change them into another form such as tables.

This simplification makes system development quicker as fewer transitions need to be made between ”models” the OO approach also means less code as objects do not need to be transformed into tables and back. Less code means less bugs and speedier development.

(9)

The inner workings of an ODBMS

The following description of the inner workings of ODBMS's is most probably not

representative of all ODBMS's. This section is only meant to give the reader a general idea of how an ODBMS works.

Accessing objects in the database

An object database is treated much like a huge virtual memory. The application and the DB share a memory space called an object cache. The ODBMS is responsible for mapping pages from the database's secondary storage onto the virtual memory of the operating system (the object cache). When the application tries to access an object one of two things can happen.

The object is in the object cache and is promptly accessed by the application. If the requested object is not in the cache the database will detect this and map a page from its secondary storage into a page in the object cache. If the object has references to other objects the addresses of these objects will be converted into virtual memory addresses so that the

application can use them as programming language pointers. This process is called swizzeling. Swizzleing can be lazy, not loading pages into the object cache until they are requested, or eager, when all objects referenced by the originally fetched object are loaded into the object cache. The size of the object cache should be modified depending on which kind of swizzeling is being used, eager swizzeling requiring a larger cache size than lazy swizzeling.

When the program terminates or the database needs to use the space in the object cache the DB copies the updated page back to secondary storage replacing the earlier version of it. The database can refer to objects on secondary storage using their physical address or using logical addresses (this way of introducing a level of indirection is called object-descriptor architecture). Physical addressing is practical when the database is small and updating and writing to the database is not frequent. On larger databases with frequent updates it is better to use logical addressing in stead. Using logical addressing introduces a layer of indirection, which causes a decrease in database performance. However when logical addresses are used it is a lot easier to move data to different parts of the database or network. This means that scalability is greatly improved. Also the use of logical addressing means that swizzeling is not necessary as relationships between objects can be traversed using the logical addresses. The database runtime is linked with the application rather then the two being totally separate processes. This coupled with the fact that the application and the DB share the object cache means that a significant increase in performance is possible as inter-process communication can be avoided. This integration is not possible if multiple clients need to be able to access the database. In such cases a client server approach is necessary. ODBMS use stubs that act as a proxy to the actual database. However when stubs are linked to the application, just as in the integration mentioned before, much of the performance benefits are lost as communication over the network is necessary between the stubs and the actual DB server.

(10)

Increasing performance

The performance of object databases can be increased using methods common to RDBMS's. Most of these methods are implemented similarly to the way they would be in a RDBMS, such as tweaking of cache sizes, replication and the use of indexes. The implementation of

clustering in ODBMS's deserve a few lines however.

When an object not present in the object cache is requested the result is a page fault. The database detects this and loads the page with the requested object into to the object cache. It is possible to minimize the amount of page faults if objects are clustered intelligently, i.e. if objects that are likely to be used together are stored on the same page in secondary storage. This would mean less page faults resulting in less disc access. Clustering can be controlled automatically by the database based on access patterns or manually by the developer.

Concurrency

If several people are to share a single database it is important that they do not interfere with each other. If two people can access the same data at the same time then it is possible for inconstancies to creep into the database. Concurrency control means that the database makes sure that while the database is inconsistent, i.e. during a transaction, no one other then the current user can access the affected objects.

Transactions

Transactions are the basis for most database access. The point of a transaction is to be sure that either everything or nothing of what you want to do is carried out. Thus nothing is done until the transaction is completed by the commit command, whereby all changes are written to the persistent objects on secondary storage. Without the use of transactions a bank account could be debited for a transferal of funds without the receiving account ever being credited. There are several concepts in transaction handling that need to be described:

Atomicity is the transaction concept stating that either everything in the transaction is done or nothing in the transaction is done.

Constancy means that the database is carried from one consistent state to another. This means that although the database is by necessity inconsistent during a transaction it should always remain consistent before and after any transaction.

Isolation means that all changes to a database remain invisible to other users until the transaction commits.

Durability means that the results of any transaction should be preserved regardless of if processors, secondary storage devices or other hardware devices should fail. This can be accomplished trough the use of redo and undo log-files usually situated on a disk separate from the one(s) storing the database.

Distributed transactions are transactions that cut across more then one database or process. According to the ODMG object databases can support distributed transactions. If they do

(11)

these transactions are transparent to the user and should be fully compliant with the ISO XA standard.

Transactions within ODBMS's are handled by programming languages using the two types transactionfactory and transaction. Transactionfactory is used to create transaction objects. The transactionfactory has the following interface:

interface TransactionFactory{ Transaction new(); Transaction current(); };

The transactionfactory creates transaction objects whose interface looks like this: interface Transaction{

void begin() raises(TransactionInProgress, DatabaseClosed); void commit() raises(TransactionNotInProgress);

void abort() raises(TransactionNotInProgress); void checkpoint() raises(TransactionNotInProgress); void join() raises(TransactionNotInProgress);

void leave() raises(TransactionNotInProgress);

boolean isOpen();

};

Locking

To allow transactions to work properly access to object need to be tightly controlled. This is done by applying locks to objects to prevent others from modifying them in the middle of a transaction. The following locks can be acquired on objects in an object database:

A read lock does not hinder anyone else from obtaining a read lock on an object but it does hinder others from writing to the object.

A write lock grants exclusive access to the object with the purpose of editing the object. An upgrade lock does not prevent others from establishing a read lock but hinder others from obtaining a write lock. It is therefore a kind of lock useful for locking object you intend to write to later on. This should eliminate the risk of causing a deadlock where two transactions each wait for objects locked by the other transaction. This is of course dependant on that programmers always use update locks if they later intend to write to an object.

(12)

Object model

In a heterogeneous environment a common object model is required to facilitate interoperability. Therefore regardless of what programming language is being used the

programmer must conform to the ODMG object model if s/he wants to be able to store objects in the database. The object model exists to “...determine the characteristics of objects, how objects can be related to each other, and how objects can be named and identified” (Cattell & Barry, 2000)

Object creation

Objects are created using an object factory. Separate object factories are supplied to

programmers for each object type (collection objects etc). All object factories are derived from a “super” interface as seen in the image below.

(13)

Objects in the ODMG object model are subject to a hierarchy visualized by the image below.

The object in the top of the hierarchy has an ODL (Object Definition Language) interface, which is passed down to all other objects

interface Object{

enum Lock_Type{read, write, upgrade};

void lock(in Lock_Type mode) raises(LockNitGranted); boolean try_lock(in Lock_Type mode);

boolean same_as(in Object anObject); Object copy();

void delete(); };

copy(); makes a copy of an object. However the copy does not retain the object id of the original object.

same_as(); is used to compare objects. If an object was copied using copy(); and then compared to the original using same_as();, same_as(); would not return TRUE because the copy is just a copy and has a different object identifier then the original.1

Delete(); is used for deleting objects from the database. In addition to removing the object from the database delete(); will also purge the object from memory.

1

All objects have unique object identifiers which never change during the lifetime of an object. These identifiers are handled by the ODBMS, not the application. The way which object identifiers are implemented may vary between ODBMS vendors and is not something covered by the ODMG standard. Objects can also be given one or more names that are mapped onto the object. It is expected that names will be given to rot objects that are to be used as entry points into the database.

(14)

Lock(); attempts to get a lock on an object (read, write or update). The difference between lock and try lock is that lock causes an exception if the lock fails. Try_lock receives a boolean as a response as to whether or not the lock request was granted.

Literals

Literals are not objects but rather embedded in objects. There are several different types of literals namely atomic, collection, and structured literals.

The atomic literals are numbers and characters. The atomic literals supported by the ODMG object model are as follows:

long long long short unsigned long unsigned short float double boolean octet char string enum

If the specific programming language used does not support one or more of these types then a class library shall be supplied as a part of the programming language binding. This class library is to define the implementation of the unsupported type(s).

Collection literals are like collection objects. The collection literals however lack object identifiers. These collections can contain objects as well as literals. The supported collection literals are: •set<t> •bag<t> •list<t> •array<t> •dictionary<t>

Structured literals also referred to simply as structure(s) is a type that contains a predefined number of elements with predefined names. Each one of these elements can contain either a literal value or an object. The object model supports:

•date •interval •time •timestamp

(15)

Developers can define structures of their own as well. This could be useful for example when an address is to be stored. Struct Address{ string street_name short street_number char entrance long zip_code string City }:

Language bindings

Each programming language has a language binding that “defines a mapping between abstract classes [in the ODMG object model] and its language's implementation classes.” In other words, there are separate sets of language binding for C++, Java, and Smalltalk. Each of these language bindings describe how that particular programming language should access object databases. There are also class libraries supplied to these bindings to facilitate programming. In this text we have chosen not to describe any of these bindings and instead focus on the

(16)

OQL – Object Query Language

OQL is the object database answer to SQL. It retains much of the look and feel of SQL as it is based on the SQL-92 standard extended to take into account thing such as complex objects, traversing of object relationships, polymorphism etc. It is therefore fairly easy for someone with experience of object oriented programming and SQL to learn the basics of OQL. Likewise for someone totally unfamiliar with SQL the OQL syntax can seem quite hard to understand. SQL queries are performed by selecting attributes from tables. OQL queries on the other hand are performed using names of objects or collections of objects as the entry point into the database. In the following example “Students” is the name of a collection referring to all the “student” objects in the database. OQL queries can return not only single objects but also collections of objects and more as described in the below examples.

Query returning a collection of objects:

Select x from Students x where x.age = 21

This query would return a collection of objects consisting of all student objects where the age attribute is 21.

Query returning a single object:

Select x from Students x where x.soc_security_nr = 123456

This query would return a single object - the student object with the stated social security number.

Query returning a collection of literals:

Select x.student_id_number from Students x where x.year_of_admition = 2003

This query would return a collection of literals listing the student id numbers of all students admitted during 2003.

Query returning a single literal:

An individual object can be given a name and as such literals contained within such an object may be accessed in the following way:

Principal.emplyed_year

The above query would return a literal, namely the value of the “employed_year” attribute of the object named principal.

Navigating object relationships

As previously stated it is common to use a named object as an entry point into the database, but this is of course not enough. It must be possible to traverse the relationships between objects in order to access the desired data. In OQL this is accomplished using the “.” or “->” operator. In the below example the desired data is the network socket number in the apartment inhabited by student s.

(17)

If we want to know what courses student s has completed then we could write the following query:

select c.name from s.courses c

This would return a collection of text strings with the names of the courses associated with student s.

Using the “where” clause.

The where clause can be used to restrict the result of a query to only the data that matches the condition imposed. In the below example only the e-mail addresses to students named Adam will be returned.

Select s.e-mail from students s

where s.name = “Adam”

Boolean operators

Boolean operators can be used in an OQL query to restrict the query. In addition to the usual AND, OR and NOT (can also be written !=), OQL also supports the use of ORELSE and ANDTHEN

X ORELSE Y means that y will only be evaluated if x is false.

X ANDTHEN Y means that x is evaluated and then Y. If X is false then Y is never evaluated.

Joins

Joins exist in OQL just like in SQL. With the aid of joins objects in different collections can be compared.

Select x from Students s, Faculty f where s.date_of_birth = f.date_of_birth

In the above example the students who share a birthday with someone in the faculty are selected.

Undefined values

Sometimes an attribute or a relation in an object is set to NULL. In the cases where that value is requested or that relationship is to be traversed the ODBMS returns the value

UNDEFINED. To determine if a value is undefined the following two methods can be used: is_undefined(UNDEFINED) – will return a value of TRUE

(18)

The following is an example of how to avoid getting a lot of UNDEFINED mixed up with your query results:

Select s.shoe_size from students s

where is_defined(s.shoe_size)

This will return the shoe sizes of the students, but only those students for which shoe size has been defined.

Methods

Methods can be invoked using the same syntax as when accessing an attribute if the method does not have any parameters. If the method has parameters they are written in parentheses. Methods can only be invoked if they return the same type expected by the query.

Select s.get_current_course.course_name from students s

where s.student_id = 123456

The above example uses the method get_current_course() to return the current course for a particular student, that object is then traversed to find the name of that course.

Polymorphism

A powerful tool when programming in an object-oriented language is polymorphism. It allows the same message to be sent to several different objects that share a common super-object causing them to execute different methods (they each share the same method name and

interface, but the implementation differs). In an object database polymorphism can be used the same way that methods are invoked.

Other operators

Other operators include sort by, group by, intersect, union, except, count, min, max, avg and more.

(19)

Concluding thoughts

Object databases are not well known. They do seem to have valid uses and advantages to relational databases in certain situations. We believe that object databases will be more common in the future, but that they are currently held back mainly because computer professionals know very little about object databases and are very familiar with relational databases. Object databases might have a problem gaining acceptance and popularity if they cannot become more widely known. Unfortunately object databases seem to be caught in somewhat of a catch 22. They are not used because few people know about them or how to use them. And they will not be used unless people know about them and know how to use them (and when to use them). However, if ODBMS's become more widely known and used we believe that they can take significant market shares away from RDBMS's.

Some of the literature used to research this subject in preparation for this paper is rather old (1995). Therefore some of the information taken from these older sources may not be totally accurate today. The purpose of this paper had been to give the reader a general idea of object databases, and should the reader's interest have been peaked then we strongly encourage the reader to read the ODMG 3.0 standard to get a more extensive view of object databases.

(20)

References

Cattell, R. G. G, & Barry, D. K. (2000). The Object Data Standard: OMDG 3.0. Academic Press

Loomis, Mary E. S. (1995) Object Databases: The Essentials. Addison-Wesley

Connoly, T. & Begg, C. & Strachan, A. (1998) Database Systems A Practical Approach to Design Implementation, and Management. Addison Wesley