Scaling Distributed Database Management Systems by using a Grid-based Storage Service

(1)

Management Systems by using a

Grid-based Storage Service

Master Thesis

Silviu-Marius Moldovan [email protected]

Supervisors:Gabriel Antoniu, Luc Bougé {Gabriel.Antoniu,Luc.Bouge}@irisa.fr

Keywords:databases, grids, scalability, performance, fault tolerance, consistency.

Abstract

This report deals with databases with distributed storage, focusing on the insuffi-cient storage space issue. In order to make these systems more scalable, the advantages offered by grids can be taken into consideration. Thus, an approach to create an interface between a database system and a grid-based data storage service is presented.

Research Master Degree in Computer Science Rennes, June 2008

(2)

1 Introduction 2

1.1 Motivation . . . 2

1.2 Current scenario . . . 2

1.3 Proposal . . . 2

2 Storing data in databases: state of the art 4 2.1 Efficiency:main-memorydatabases . . . 4

2.1.1 Main-memory databases: a few new issues . . . 4

2.1.2 Case study: The DERBY data storage system . . . 5

2.2 Scalability: distributedmain-memory databases . . . 7

2.2.1 Motivation . . . 7

2.2.2 Case study: Sprint [2007] . . . 7

2.3 Going larger: using grids . . . 8

2.3.1 Context . . . 8

2.3.2 Grid data sharing services . . . 9

2.3.3 Grid-based databases in practice . . . 10

3 Contribution: interfacing a database system with a grid-based data storage service 12 3.1 The database system: Berkeley DB . . . 12

3.2 The grid-based data storage service: BlobTamer . . . 14

3.3 Storing the data of Berkeley DB using BlobTamer . . . 17

3.3.1 Selecting the level to interface . . . 17

3.3.2 Methodology applied . . . 19

3.3.3 How Berkeley DB reads and writes . . . 20

4 Implementation details 23 4.1 Mapping files into memory . . . 23

4.2 Re-implementing the read and write operations . . . 23

4.3 Re-implementing collateral operations . . . 25

5 Conclusion 26 5.1 Contribution . . . 26

5.2 Other approaches . . . 26

(3)

1 Introduction

1.1 Motivation

One of the most important aspect related to data is the way they are stored. The most com-mon way to store data nowadays relies on databases. Almost every type of application use them, ranging from those performing intense scientific computation to those storing per-sonal data of employees in a factory. Initially, databases stored their data on the disk space of one machine. The evolution in the database database domain has occured from the need of greater efficiency and larger storage space.

In-memory databases are one way to increase the access perfomance. In this approach, the data is stored in the main-memory and a backup copy may be kept on the disk. Since memory access time is smaller than disk access time, the access time can be reduced. But the storage capacity of the database will still be limited to the memory of a single machine.

A simple approach to extend databases and increase the storage capacity is to distribute their data on more machines. Thus, one could use the storage capacities and the computing power of different nodes in the same cluster. Of course, supplementary problems might arise in the management of such systems. For example, in order to assure data availability, a piece of data is replicated over several, different nodes. Thus, the coherence of the multiple copies must be assured. Databases with distributed data represent the subject of this study.

1.2 Current scenario

Nowadays, the amount of data to be stored is getting larger. Scientific applications, for example, perform numerical simulations and need big storage capacities and computing power. A larger space than one offered by a database with one node is required. Distributing the data of databases over a cluster of computers is one solution. But clusters can only have a few tens of nodes. If the data should be stored in the memory of the machines in the cluster, for the sake of efficiency, then the space might not be sufficient. For example, if each node has 2 Gigabytes of RAM memory and there are 30 nodes in the cluster, only 60 Gigabytes of storage will be available. Grids might represent the solution for further extension of the storage capacity of databases. But resources in grids are heterogeneous and new problems related to data management might arise.

1.3 Proposal

The most convenient approach of using grids relies on grid data-sharing services [2]. Besides transparent access to data, these services also provide persistence and consistency of data, in a fault tolerant way. Applications can, thus, benefit from their properties and leave them to intermediate all operations between them and the grids that store their data.

An approach of enabling databases to use grids for storing their data already exists [1]. It uses the JuxMem ([3], [2]) grid data-sharing service and allows performing some basic operations like table creation, record insertion or simple select queries. But this approach assumes extending the grid data-sharing service, by adding a new API on top of it. It does not take into consideration the database management system that will use this API. This problem has not been studied from the perspective of a database system.

It would be interesting to study an open-source database system and try to find an ap-propriate level where to redirect the data onto a grid-based data storage service. Thus, an

(4)

interface could be created between the database engine and the above-mentioned service, at a level which is convenient for the database engine.

The rest of the report is structured as follows: Section 2 presents the newest concepts in database design and analyzes the possibilities of using grids for storing the data of databases; Section 3 presents a database engine, a grid-based storage service and analyzes the possibil-ity of creating an interface between the two systems; Section 4 presents the most important implementation details of the interface; finally, Section 5 concludes on the contributions of this work and on the possibilities of continuing what has been done.

(5)

2 Storing data in databases: state of the art

Old-fashioned databases store their data on the disk space of one machine. But that is not practical any more. Nowadays databases must be more efficient and the amounts of data to be stored are increasing. This has led to a revolution in the concepts of database design. The main ideas and concepts that have appeared lately are presented in this section.

2.1 Efficiency:main-memorydatabases

One of the first changes that occured was using the main-memory as storage medium. In main-memory database systems data is stored in the main physical memory, as opposed to the conventional approach in which data is stored on the disk. Even if the data may have a copy on disk, the primary copy lives permanently in memory, and this has deep influence on the design and performance of this type of database system. Storing large databases in memory has become a reality nowadays, since memory is getting cheaper and chip capacity increases.

2.1.1 Main-memory databases: a few new issues

The properties that distinguish main memory from magnetic disks lead to optimizations in most of the aspects of database management. Using memory-resident data has impact on several functional components of database management systems, as shown in [6]. These new issues are illustrated below.

Concurrency Control The most commonly used concurrency control methods in practice are based. In conventional database systems, where data are disk-resident, small lock-ing granules (fields or records) are chosen, to reduce contention. In main-memory database systems, on the other hand, contention is already low because data are memory-resident. Thus, for these systems it has been suggested that very large lock grains (e.g., relations) are more suitable for the data. The lock granule can be chosen to be the entire database, in the extreme case, which leads to a desirable, serial execution of transactions. In a con-ventional system, locks are implemented through a hash table that contains entries for the locked objects. The objects themselves, located on the disk, contain no lock information. In main-memory systems, in contrast, a few bits in the objects can be used to represent lock status. The overhead of a small number of bits is not significant, for records with reasonable length, and lookups through the entire hash table are avoided.

Persistence There are several steps to achieve persistence in a database system

Logging First of all, a log of transaction activity is kept. This log must reside on stable stor-age. In conventional systems, it is copied directly to the disk. But logging can affect response time, as well as throughput, if the log becomes a bottleneck. These two prob-lems exist in main-memory systems, too, but their impact on performance is bigger, since logging is the only disk operation required by each transaction. If logging takes too much time, for example, the overall performance of the system will decrease. To eliminate the response time problem, one solution is to use a small amount of main memory to hold a portion of the log (the log tail); after its log information is written

(6)

into memory, a transaction is considered comitted. If there is not enough memory for the log tail, the solution is to pre-commit transactions. This implies releasing the transaction’s locks as soon as its log record is placed in the log, without waiting for the information to be propagated to the disk. To relieve a log bottleneck, group com-mits can be used: the log records of several transactions are allowed to accumulate in memory, and they are flushed to the disk in a single disk operation.

Checkpointing The second step in achieving persistence is to keep the disk-resident copy of the database up-to-date. This is done by checkpointing, which is one of the few reasons to access the disk resident copy of the database, in a main-memory database system. Thus, disk access can be tailored to the needs of the checkpointer: disk I/O is performed using very large blocks, which are more efficiently written.

Recovery The third step to achieve persistence is recovery from failures, which implies restoring the data from its disk-resident backup. To do that in main-memory systems, one solution that increases performance is to load blocks of the database on demand until all data has been loaded. Another solution is to use disk stripping: the database is spread across multiple disks and read in parallel.

Access methods In conventional systems, index structures like B-Trees are used for data access. They have a short, bushy structure and the data values on which the index is built need to be stored in the index itself. In main-memory systems, hashing is one solution to access data. It provides fast lookup and update, but is not as space-efficient as a tree, ac-cording to [6]. Thus, tree structures such as the T-Tree have been designed explicitly for memory-resident databases [8]. As opposed to B-Trees, these trees can have a deeper, less-complicated structure, so that traversing them is much faster. Also, because random access is fast in main memory, pointers can be followed efficiently. Therefore, the index structures can store pointers to the indexed data, rather than the data itself or block identifiers, which is more efficient [6].

Data representation In main-memory databases, pointers are used for data representation: relational tuples can be represented as a set of pointers to data values. Using pointers has two main advantages, as stated in [6]: it is space efficient (if large values appear more than once in the database) and it simplifies the handling of variable length fields.

2.1.2 Case study: The DERBY data storage system

The DERBY data storage system, described in [7], is used to support a distributed, memory-resident database system. DERBY consists of general-purpose workstations, connected through a high-speed network, and makes use of them to create a massive, cost-effective main-memory storage system.

Workstations in DERBY are classified into servers and clients. Servers run on machine with idle resources and provide the storage space for the DERBY memory storage system. They have the lowest priority and are "guests" of the machines they borrow resources from. One important advantage of DERBY is that, at any moment, each workstation can be oper-ating as a client, a server, both or neither (as it can be seen in Figure 1) and this may change

(7)

Figure 1: The DERBY architecture (node 5 serves as the UPS for nodes 1, 2, 3 and 4)

over time. Another advantage is that the system configuration models a dynamic, realistic data processing environment where workstations come and go over time.

The primary role of a DERBY client is to forward read/write requests from the database application to the DERBY server where the record is located. DERBY’s basic data storage abstraction assumes there is exactly one server responsible for each record. A server where the record is stored is called the primary location of the record; this location may change over time, due to the dynamic nature of the system.

The primary role of a server in DERBY is to keep all records in memory and to avoid disk accesses when satisfying client requests. Servers guarantee long-term data persistence (they eventually propagate modified records to the disk), but also short-term persistence without disk storage. The latter is achieved through a new and interesting approach: a part of the nodes in the network are supplied with Uninterrupted Power Supplies (UPS) which temporarily hold data. This a cost-effective way to achieve reasonably large amounts of persistent storage. To provide high-speed persistent storage, each server is associated with a number of workstations equipped with these UPS-s. Also, each workstation with UPS can provide service to more servers. In Figure 1, for example, node 5 is an UPS for the other nodes. In the case of failures, the lost data is recovered from the logs kept on disks, or from recently modified log buffers stored in the UPS workstations.

To guarantee data consistency and regulate concurrent access, DERBY provides a basic locking mechanism , with a lock granularity of a single record. This is quite unusual, since large granularities are most appropriate for memory-resident systems. Every client must hold locks for the records it operates on. The server maintains a list with each record of the clients caching the record as read-only and as write-locked.

One important contribution of DERBY is that, during initial testing, it has been shown that load balancing has little effect on the performance of memory-based systems, in contrast to disk-based storage systems. Memory-based systems need to consider migrating load only when a limited resource is close to saturation. Thus, DERBY dynamically redistributes the load via a saturation prevention algorithm that is different from the conventional approach to load balancing. The algorithm attempts to optimize processing load constrained by mem-ory space availability, more exactly to find the first acceptable distribution of load that does not exceed a predefined threshold of available memory space on any machine. This way, the servers are kept away from reaching their saturation points.

(8)

2.2 Scalability: distributedmain-memory databases

2.2.1 Motivation

Applications accessing an in-memory database are usually limited by the memory capac-ity of the machine hosting the database. Having a database whose data is distributed on a cluster of workstations brings several advantages. To begin with, the application can take advantage of the aggregated memory of all the machines in the cluster, and, thus, the avail-able storage capacity will be larger. Then, by distributing the data over more nodes, fault tolerance can be achieved. Copies of the same piece of data can be replicated over multiple nodes and, so, if one node fails, the data can be recovered from these nodes. Finally, the fail-ure recovery will also be performant, since the backup copies of the data which are closest to the node will be used for restoring a failed node.

2.2.2 Case study: Sprint [2007]

Sprint is a middleware infrastructure for high-performance and high-availability data man-agement. It manages commodity in-memory databases running in a cluster of shared-nothing servers, as stated in [5]. Applications will then be limited by the aggregated capacity of the servers in the cluster.

The hardware infrastructure of Sprint is represented by its physical servers, while the software one by the logical servers: edge servers, data servers anddurability servers. The ad-vantage of this decoupled design is that each type of server can handle a different aspect of database management. Edge servers receive client queries and execute them against data servers. Data servers run a local, in-memory, off-the-shelf database management system and execute transactions without accessing the disk. Durability servers ensure transaction persistency and handle recovery. The server types can be identified in Sprint’s architecture, illustrated in Figure 2.

Physical servers communicate by message-passing only, while logical ones can use point-to-point or total order multicast communication. Physical servers can fail by crashing and may recover after the failure but lose all information stored in the main-memory before the crash. The failure of a physical server implies the failure of all the logical servers it hosts.

Sprint tolerates unreliable failure detection, which relies on an entity providing a list of nodes that may have failed, but not all of them having actually failed. The advantages of this are fast reaction to failures and the certainty that failed servers are eventually detected by operational servers. The limitation of this approach is illustrated by the case of an opera-tional server that may be mistakenly suspected to have failed. But even if that happens, the system remains consistent: the falsely suspected server is replaced by another one and they exist simultaneously for a certain time.

All permanent state is stored by thedurability servers, which periodically creates an image on disk of the current database state. In the case of a failure, new instances ofedge serversand

data serversare created on operational physical servers, using the state stored by the durabil-ity servers. If adurability serverfails, the information needed for recovery are retrieved from operationaldurability servers.

According to [5], the usual ACID (atomicity, consistency, isolation, durability) properties of a transaction are guaranteed by Sprint. The system distinguishes between two types of transactions: local transactions that only access data stored on a single data server, and global

(9)

Figure 2: The Sprint architecture

transactions that access data on multiple servers. Both types of transactions are supported and respect the ACID properties.

Database tables are partitioned over the data servers. Data items can be replicated on multiple data servers, which brings several benefits. First of all, the failure recovery mecha-nism is more efficient. Then, this allows parallel execution of read operations, even though it would make write operations more difficult, since all replicas of a data item must be mod-ified.

One important contribution of Sprint is its approach to distributed query processing. In most distributed database architectures, high-level client queries are translated into lower-level internal requests. In Sprint, however, a middleware solution is adopted: queries are decomposed into internal ones, according to the way the database is fragmented and repli-cated. Also, the distributed query decomposition and merging are simple, since Sprint was designed for multi-tier architectures.

The experiments conducted proved that Sprint shows good performance and scalability on clusters of up to 64 nodes (among which 32 data servers). Experiments at larger scales (e.g., at grid scale) have not been performed, however.

2.3 Going larger: using grids

2.3.1 Context

Sometimes, the amount of data in a database can be too large to store, even for a single cluster of computers. Grid computing has emerged as a response to the growing demand for resources. A grid is composed by a federation of clusters. A grid system seems to offer

(10)

the necessary infrastructure for storing databases: they offer much larger amounts of storing capacities, on many nodes. This could allow to store larger volumes of data.

But new problems arise, related to grid infrastructures. To begin with, since there are more clusters in the grid, there will be a hierarchy of latences in the system: some latences will be greater than others. When a message is sent between two nodes from the same clus-ter, it will cost 100 or 1000 times less than if the two nodes were on different clusters. A possible optimization for this problem is to use communication protocols on a hierarchical level. Then, a grid system is composed of many hosts, from many administrative domains, with heterogeneous resource capabilities (computing power, storage, operating system). Fi-nally, together with the number of nodes, the number of failures will increase. Thus, special care must be taken when managing data in such a system.

2.3.2 Grid data sharing services

A data sharing service for grid computing opens an alternative approach to the problem of grid data management. This concept decouples data management from computation. Main features The main goal of such a system, as stated in [2], is to provide transparent access to data. The most widely-used approach to data management in grid environments is by explicit data transfers: the user has to localize the data he is interested in and to perform the desired transfer. Thanks to transparent access to remote data through an external data-sharing service, the client does not have to handle data transfers and does not have to care where the data are.

There are another three properties that a data-sharing service provides.

Persistence First of all, it provides persistent data storage, to save data transfers. Since large masses of data are to be handled, data transfers between different grid components can be costly and is desirable to avoid repeating them.

Fault-tolerance Second of all, the service is fault tolerant. Data is available despite events that can occur because of the dynamic character of the grid infrastructure, like re-sources joining and leaving, or unexpected failures. Thereby, replication techniques and failure detection mechanisms are provided.

Consistency Finally, the consistency of replicated data is guaranteed. Since data manipu-lated by grid applications are mutable, and data are often replicated to enhance access locality, the service must ensure the consistency of the different replicas. To achieve this, the service relies on consistency models, implemented by consistency protocols. According to [3], a data sharing service for grid computing can be thought of as a hybrid system between distributed shared memory (DSM) systems and peer-to-peer (P2P) systems. That is because it takes benefit from the advantages provided by both types of systems. It provides transparent data sharing and consistency protocols and models, just like a DSM system. Meanwhile, it provides fault-tolerance mechanisms and it manages heterogeneous resources in a very scalable environment, just like a P2P system.

(11)

An example: JuxMem An architecture was proposed ([3], [2]) for a data-sharing service, based on the observations above. The software architecture of JuxMem (forJuxtaposed Mem-ory) reflects the hardware architecture of a grid: a hierarchical model consisting of a federa-tion of distributed clusters of computers. This architecture is made up of a network of peer groups, which can correspond to clusters at the physical level, to a subset of the same phys-ical cluster, or to nodes spread over several physphys-ical clusters. In each such group there are nodes that provide memory for data storage, nodes that simply use the service to allocate and to access data blocks and one node that manages the available memory.

In order to allocate memory, the client must specify on how many clusters the data should be replicated, and on how many nodes in each cluster. In response, a set of data replicas are instantiated and an ID is returned. To read or write a data block, the client only needs to specify this ID, and JuxMem transparently locates the corresponding data block and per-forms the necessary data transfers. Each block of data stored in the system is replicated and associated to a group of peers, each peer in the group hosting a copy of the same data block. These peers could belong to different clusters and, thus, the data could be replicated on several physical clusters.

JuxMem’s approach to maintain the consistency between the different copies of a same piece of data is based on home-based protocols. For each piece of data there is ahome entityin charge of maintaining a reference data copy. A protocol implementing theentry consistency modelin a fault-tolerant way has been developed. To limit inter-cluster communications, the home entity is organized in a hierarchical way: local homes, at cluster level, are the clients of a global home, at grid level ([4], [9]).

2.3.3 Grid-based databases in practice

Grids have been developped in the context of high-performance computing (HPC) appli-cations, like numerical simulations. The use of these infrastructures has been very little explored in the context of databases. Two examples are presented below.

DB/JuxMem This approach [1] extends the JuxMem grid data-sharing service with a database-oriented API, that allows to perform basic operations like table creation, record insertion and simple select queries. In order to achieve this a layered software architecture is added on top of JuxMem, each layer having a precise role in database management: data storage, indexing, table fragmentation, etc. The highest one is the database-oriented API. These high-level layers over JuxMem are necessary, because JuxMem initially manipulated data only as byte sequences. One advantage of this approach is that it provides structured data management to applications running over JuxMem. Another advantage is that database management systems can benefit from the properties of JuxMem. The data and metadata of databases are handled using JuxMem, which transparently allocates, localizes, replicates and transfers data, in a fault-tolerant and consistent way. Finally, the approach takes bene-fit of a grid-scale computing infrastructure, as opposed to previous efforts, which provided a distributed main-memory database management system that relied only on a cluster of computers.

Oracle 11g The Oracle Database [14] was designed for enterprise grid computing. The Or-acle grid architecture creates large resource pools, which are shared by different applications.

(12)

Data processing and storage capacity can then be dynamically provisioned to apllications as needed. One of the most important features for providing resource provisioning is repre-sented by Real Application Clusters (RAC). A RAC is a cluster database with a shared cache architecture that runs on multiple machines. These machines are attached through a cluster interconnect and a shared storage subsystem. A RAC database appears like a single database to users and the same maintenance tools and approaches used for a single database can be used on the entire cluster. One important role of RAC is its ability to manage workload: it can add or remove nodes on demand, based on the processing requirements. RAC also plays an important role in assuring data availability. All the data in the database are replicated on all the nodes in the cluster. RAC exploits this redundancy: users have access to all data as long as there is one available node in the cluster, even if all the other nodes have failed. Even though the Oracle database is self-managing and provides automatic resource allocation, as mentioned above, administrators are allowed to influence how the database resources are allocated to users. This is done through another feature, calledResource Manager. The sys-tem also provides capabilities to schedule and perform jobs in the grid, through theScheduler

(13)

3 Contribution: interfacing a database system with a grid-based

data storage service

3.1 The database system: Berkeley DB

Berkeley DB [11] is a database library style toolkit written completely in the C programming language. It contains almost 1,800,000 lines of code, structured in many APIs. The library provides a broad base of functionality to application developers. An overview of its features and provided services is illustrated in Figure 3.

Main features The system uses asimple function-call interfacefor all operations, there is no query language to parse and no execution plan to produce. One big advantage of this li-brary is that it is open-source, so the complete source code is available and can be modified according to one’s needs. The library isembedded, since it runs in the address space of the application that uses it, inside the same process. As a result, the database operations hap-pen inside the library and require no inter-process communication. Another advantage of Berkeley DB is that it isscalable. Firstly, even though the library is quite small, it can manage databases of up to 256 Terabytes and records of up to 4 Gigabytes. Secondly, it supports high concurrency, thousands of users being able to perform operations on the same database in the same time. Another interesting feature of Berkeley DB is itsconfigurability: the applica-tions can select the storage structure that provides the fastest acces to their data as well as the database services they need (eg, the degree of logging, locking, concurrency or recov-erability). Moreover, applications can choose whether to store database pages on the hard disk, or in Berkeley DB’s page cache.

Record structure and storage Records in Berkeley DB are (key,value) pairs. Some simple operations on records are supported: inserting records in tables, deleting them from tables, searching records by their key and updating found records. Values of any data type can be stored in a Berkeley DB database, no matter how complex they are. Berkeley DB does not operate on thevaluepart of a record. The system cannot decompose thevalueinto constituent parts that it could further use and analyze. Thus, it can provide no information about the contents or structure of the stored value. The application must know the structure of the keys and values that it uses. The data of Berkeley DB databases are stored on the disk. For the sake of efficiency, Berkeley DB uses an in-memory cache which allows for grouped flushing onto disk.

Access methods Berkeley DB supports 4 types of storage structures.Hash tablesare suitable for very large databases where the time necessary to do a search or an update operation can easily be predicted. It helps fetching records for which the exact key is provided, but not records with similar keys. Btrees are the structures suitable for range searches, when the application needs to find all the records with the keys between two known values. Since in this structure similar keys are stored closely one to the other, it is very convenient to fetch the values related to keys which are nearby. This type of structure is the default one in Berkeley DB. For applications that need to store and fetch records, but cannot easily generate keys by themselves, the best choice is record-number-based storage. In this approach, the record numbers, generated automatically by Berkeley DB, represent the keys for the records.Queues

(14)

Figure 3: Berkeley DB features

are suitable for applications that create a lot of records and then must process them in the creation order. These structures store fixed-length records and they use record numbers as keys, too. They are designed for fast record insertions at the tail of the queue and retrieval at the head. In this access method, locking at record level is used.

Data management services provided Berkeley DB offers several data management ser-vices which work with all available storage structures. To begin with, the system supports

concurrency, allowing more users to work on the same record without interfering with one another. Simultaneous readers and writers are supported due to the locking system, which is used by the access methods to acquire the right to read or write database pages. Transac-tionsare also supported in Berkeley DB. The system uses two-phase locking to assure that concurrent transactions are isolated from one another. The transaction system uses write-ahead logging protocols to guarantee therecoverabilityof the changes performed, in the case of a failure. Thus, when an application starts, it can ask Berkeley DB to run recovery, which will restore the database to a clean, consistent state.

Difference to relational databases Berkeley DB is not a relational database. One impor-tant difference to these latter systems is that Berkeley DB does not support SQL queries. The access to data is done through the API provided. The advantage of a relational database is that it knows everything about the data and can execute queries in a high-level language, without any programming being required. In Berkeley DB, on the other hand, the applica-tion developer must understand how the data is represented and accessed and must write the code that will get and store records. The advantage of systems like Berkeley DB is that the overhead of query parsing, optimization and execution is eliminated. Thus, a low-level written program can be very fast. Another difference to relational databases is that Berke-ley DB has no notion of schema (i.e., structure of records in tables, relationships among the tables of a database, etc.) and data types. An interesting issue is that relational databases can be built on top of Berkeley DB. For example, the MySQL relational database system [13] does the SQL parsing and execution by itself, but relies on Berkeley DB for the storage level.

(15)

Figure 4: Relationships between the roles of BlobTamer

3.2 The grid-based data storage service: BlobTamer

BlobTamer is a system developped inside the Paris Team, at IRISA. It is written in the C++ programming language, having approximately 23,000 lines of code. The name reflects the fact that it manages efficiently blobs (binary large objects) and it makes them more user-friendly.

Managing massive data in large-scale distributed environments In this model, the data considered are strings of size in the order of Terabytes, which cannot fit in the memory of a single node. The storage of such large data naturally requires the use of data fragmenta-tion and of distributed storage, which are offered by grids. It is assumed that the access to data is fine-grain, each individual read or write operation concerning only a segment of the string of the order of Megabytes, microscopic with respect to the whole string. Also, the en-vironment considered is highly concurrent: the writing and reading accesses are concurrent, unpredictable and very frequent.

The strings are fragmented into small, equally-sizedpages, which are distributed in the local memory of a large number of nodes. Upon creation, a page is labeled with the version number at which it has been created. A concatenation of consecutive pages is calledsegment. There is a set of metadata which makes the connection between an access request and the list of pages that store the corresponding data.

The roles in the system The system consists of distributed processes, that communicate through remote procedure calls (RPCs). A physical node can run one or more processes and, in the same time, may play multiple roles from the ones mentioned below.

There are 5 types of processes in the system. There may be one or more concurrentclients

that issue READ and WRITE requests. The system is not aware of their number, which may vary in time. The pages created by the WRITE operations are physically stored in the local memory ofdata providers. On entering the system, eachdata provider registers with the

provider manager. This entity is responsible for providing a list of available data providers

toclientswho issue WRITE requests. For each request, theprovider manager decides which

data providersshould be used based on a strategy that assures load balancing. It periodically receives updates from thedata providersregarding their available space, so the list returned

(16)

Figure 5: Interactions between the actors: reads (left) and writes (right)

will contain providers with larger available space and lower load. Also, as many distinct providers as possible are enlisted, which allows an efficient parallel access to the pages.

The metadata generated upon the creation of new pages by WRITE requests are physi-cally stored by themetadata provider. Its purpose is to helpclientswho issue READ requests to localize the providers that store the pages corresponding to the required segment of the string. To allow concurrent access to metadata, themetadata provider is implemented on top of an off-the-shelf, stable and scalable distributed hash table: BambooDHT [10]. Theversion managerstores the number of the last published version of a data string. It serializes WRITE requests to each string and supplies the latest published string version to READ requests. All operations on theversion managerare atomic, since it is protected by a lock. The relationships between these types of processes are illustrated in Figure 4.

How writes and reads are performed A WRITE request begins with theclientcontacting theprovider managerto obtain a list of providers, one for each page of the segment. Then, the

clientcontacts, in parallel, the providers in the list and requests them to store the pages. After executing the request, each provider sends an acknowledgement to theclient. Only when it has received all the acknowledgements, so when it is sure all the pages are written ondata providers, theclientcontacts theversion manager, requesting a new version number. If an error occurs while writing the pages, theversion manageris not contacted at all. The version num-ber is used by theclientto generate the metadata corresponding to the already written data, which he sends to themetadata provider, in parallel. After receiving the acknowledgement, theclientreports the success to theversion manager.

The typical scenario for a READ request begins when theclientcontacts theversion man-agerto get the last version of the corresponding data string. If the version specified is larger then than the latest available version, the READ will fail. Otherwise, the client contacts themetadata providerand retrieves, in parallel, the metadata describing the pages of the re-quested segment. After gathering all the metadata, it contacts, in parallel, thedata providers

(17)

The function calls provided The service provides three primitives: one for allocating memory and two for manipulating strings. The ALLOC primitive takes two parameters (pagesizeandstringsize) and creates an all-zero string of the provided size. Thepagesize pa-rameter specifies the size of the pages that the string will be fragmented into. The primitive generates an uniqueidfor the string being allocated,idwhich must be specified by clients as input parameter to the other two primitives.

id = ALLOC(pagesize,stringsize)

The WRITE primitive modifies a string given by its idwith the contents of a buffer of lengthsizeat a specifiedoffset, all these parameters being provided by the client. The function call generates a new version number, corresponding to the new version (the modified one) of the data string.

vw = WRITE(id, buffer, offset, size )

A READ primitive takes a segment (specified by anoffsetand asize) from a string (spec-ified by its id) and puts it into abuffer. The versionvof the string from which the segment must be taken is also provided. The READ fails if the specified version of the string is not available, yet.

vr = READ(id, v, buffer, offset , size )

Experimental results Evaluations of the system have been peformed using 100 nodes, taken from 2 sites, of the Grid’5000 [12] testbed. Two experimental settings were used: one in which the client was located in the same cluster as the data and metadata providers, and one in which the client is located in a different, remote, grid cluster. The latency between the client and the data providers is much higher in the second setting (25 ms), compared to the first setting (0.1 ms).

Two experiments were performed. In one of them the purpose was to evaluate how the metadata scheme influences the performance of data accesses. The time required for meta-data to be completely read, respectively written, was measured. In the first setting it was observed that the increase in the number of providers did not impact the time required to perform a READ operation, whereas it improved the time required to perform a WRITE op-eration. In the case of the latter, this advantage is more visible when writing larger segments. In the second setting it has been concluded that the higher latency had a significant impact on the cost of reading the metadata, while this impact is much lower for WRITE operations. Another experiment aimed at evaluating how efficiently the lock-free scheme supports highly-concurrent data accesses. The average bandwidth per client was measured for READ and WRITE requests, when increasing the number of clients. In both settings it was noticed that the bandwidth per client decreases very slowly when the number of concurrent ac-cesses increases significantly. Moreover, the decrease in the read bandwidth is even smaller if client-side caching is used. The two experiments thus showed that the system scales well, without significantly affecting performance, both in terms of storage providers and of con-current clients.

(18)

Discussion One of the most important advantages of the system is that it allows efficient, large-scale concurrent access to the data strings, without locking them. The versioning tech-nique allows that: concurrent writes to the same page can be performed in parallel, because they access different versions of that page. Reading operations can also be performed in parallel, once eachclientreceives the latest version from theversion manager. The system also provides some fault tolerance mechanisms, through the off-the-shelf DHT on top of which themetadata provideris implemented.

3.3 Storing the data of Berkeley DB using BlobTamer

3.3.1 Selecting the level to interface

In order to achieve the interfacing of the two systems described above and, thus, to leave the job of storing Berkeley DB’s data to BlobTamer, the database system’s architecture had to be taken into consideration. The Berkeley DB library has a layered architecture, composed of five major subsystems:

Access Method The Access Method subsystem provides general-purpose support for cre-ating and accessing database files formatted as btrees, hashed files, and fixed- and variable-length records.

Memory (Buffer) Pool The Memory Pool subsystem (orbuffer manager, as known in litera-ture) is the general-purpose shared memory buffer pool used by Berkeley DB. This is the shared memory cache that allows multiple processes and threads within processes to share access to databases.

Transaction The Transaction subsystem allows a group of database modifications to be treated as an atomic unit so that either all of the changes are done, or none of the changes are done. The Transaction subsystem implements the Berkeley DB transaction model.

Locking The Locking subsystem is the general-purpose lock manager used by Berkeley DB. This module is useful outside of the Berkeley DB package for processes that require a portable, fast, configurable lock manager.

Logging The Logging subsystem is the write-ahead logging used to support the Berkeley DB transaction model. It is largely specific to the Berkeley DB package, and unlikely to be useful elsewhere except as a supporting module for the Berkeley DB transaction subsystem.

In addition to the above-mentioned subsystems, there is also a Storage layer, as in any other database management system. In this model, illustrated in Figure 6, the application makes calls to the access methods. When applications require recoverability, their calls to the Access Method subsystem must be wrapped in calls to the Transaction subsystem. The Ac-cess Method and Transaction subsystems in turn make calls into the Memory Pool, Locking and Logging subsystems on behalf of the application.

The underlying subsystems can be used independently by applications. For example, the Memory Pool subsystem can be used apart from the rest of Berkeley DB by applications

(19)

sim-Figure 6: The Berkeley DB architecture

by applications that are doing their own locking outside of Berkeley DB. However, this usage is not common, and most applications will either use only the Access Method subsystem, or the Access Method subsystem wrapped in calls to the Berkeley DB transaction interfaces.

As stated above, the Access Method and Transaction subsystems use the underlying shared memory buffer pool (cache) to hold recently used file pages in main memory. The pages have to be in the main memory, in order for the database management system to op-erate on it. The Memory Pool subsystem receives page requests from the upper layers and provides handles for underlying files. The handles are then used to retrieve pages from these files. When the pages are returned, if the requestor indicates that a page has been modified (i.e., the page isdirty), the page is written to the disk. This memory buffer pool handles all operations related to pages in a transparent way. The upper layers are not aware that not all data is in the memory, at one time. If the cache if full and a new page needs to be inserted, a page is selected and discarded from the pool. The selection is based on aleast-recently-used

algorithm: the page that stayed the longest time in the cache without being accessed will be replaced.

An important aspect at this point is selecting the layer to implement for a successful interfacing with BlobTamer. One natural choice is the Memory Pool, since page management support is provided by BlobTamer directly. An in-depth study not only of the layers, but also of the interactions between layers is necessary in order to be able to provide a correct interfacing. There exists a tight coupling of the Logging layer with the upper layers in order to provide recovery support. Because of that, this layer would have to be implemented as well, if the Memory Pool layer was chosen. On the other hand, the Storage layer acts as the backbone for both the Buffer Pool and the Logging layers. Both these layers use the Storage layer directly to store their data, as it can be seen from Figure 6. Implementing the Storage layer is much more simple, because it implies just a file system functionality on top of BlobTamer. This approach makes debugging easier, because the implementation is at a lower layer. Moreover, it enables the study of access (reads and writes) patterns at page

(20)

level, which might lead to optimizations for read/write operations, in the new version of the Storage layer. The potential introduction of such optimizations justifies the choice of implementing the Storage layer and not using a distributed file system (like NFS).

3.3.2 Methodology applied

Before implementing the Storage layer, a few things needed to be studied. First, it was im-portant to know how and where Berkeley DB stored its data and metadata (if it created any), how many physical files were created for each table, and whether temporary files were cre-ated while writing the data. Testing some applications that use Berkeley DB was required to study these aspects. Second, it was important to see how the system uses some basic system calls related to file operations (read, write, open, close, flush, fsync, etc.), concentrating es-pecially on the parameters used to read and write data (e.g., the offset in the file and the size of the operation). The possibility of changing the code in the wrappers of these system calls was analyzed, too.

The tested application The application needed for the tests was an example written in C provided in the Berkeley DB download toolkit. It concerns some products and vendors that sell the products. The input data were provided in text files (one for the vendors and another for the products), with one record per line. The application consists of two programs. One program creates the database, the tables (files on the hard-disk) and loads the data from the input files into the tables. Another program searches for a specific product, reads the data from the tables and displays information about the product and the vendors that sell it. The first program corresponds to the CREATE DATABASE and INSERT commands in SQL, while the second program corresponds to the SELECT command. In these programs data can be easily manipulated, by means ofputandgetmethods (provided by the Berkeley DB library), as it can be seen from the code fragments below.

// define data structure for the application typedef struct stock_dbs {

DB *inventory_dbp; DB *vendor_dbp; DB *itemname_sdbp; } STOCK_DBS; ... STOCK_DBS my_stock; ...

/* Set page size */

vendor_dbp->set_pagesize(dbp, 65536); ...

/* Put data into the database */

my_stock.vendor_dbp->put(my_stock.inventory_dbp, 0, &key, &data, 0); ...

/* Get a record */

(21)

Programs in a lower-level language need to be written, as mentioned in Section 3.1, since Berkeley DB does not support queries in a high-level language. Unfortunately, the examples provided were not flawless. It was noticed that not all products could be found, after several tests were performed on the program that searched for products. The reason was that the corresponding records were not introduced in the tables by the first program, because of a formatting parameter. This parameter belonged to thesscanf command that read the data from the input files. The value of the paramater was too small and products or vendors with longer names were not introduced in the database tables. A higher value had to be set for this parameter, in order for the programs to work properly.

Preparing the study Since the input files were not very large, the patterns used for writ-ing the pages were not so obvious. Therefore, a larger input file with vendors had to be generated, through a small C program. The new file had 10.000 vendors and a size of 1.3 Megabytes. The pattern for generating the file was very simple, based on adding the num-ber of the record to some base values, for each field. The page size was also modified, inside the application code, from 512 Bytes (initially) to 64 kilobytes, through a simple function call provided by the Berkeley DB interface.

The methods used to study the problems mentioned previously were various. The Linux

grepcommand was used to find the occurrence of the desired system calls or their wrappers in Berkeley DB’s code. Sometimes, this was not very efficient and Google Code Searchwas used instead. Then, the examples were run with theKDbgdebugger program to see exactly which were the functions and parts of code that were really executed. Finally, printf com-mands were used to see the values of the parameters used in the most important system calls and functions (eg, the ones related to reads and writes).

3.3.3 How Berkeley DB reads and writes

The testing of the previously mentioned application led to some interesting conclusions re-lated to how Berkeley DB reads and writes. The system uses not only the read andwrite

system calls to read and write its data and metadata, but also pread andpwrite. The latter are used for handling the data, while the former handle the metadata. Three database files (having a ".db" extension) are created for the application example described in the previuos section. One is used for storing information about the vendors (thevendordatabase), one for storing information about the products (theinventorydatabase) and another one for indexing the product names found in theinventorydatabase (this is theitemnamedatabase). Although these files are referred to as distinct databases in the examples, they would correspond to tables of the same database, in a relational database management system. The structure of the data in the files is similar to the structure of records in tables and the relationships be-tween the data in different files is similar to the relationships bebe-tween different tables of the same database. During the creation operation of these three databases, one temporary file is created for each database. Their purpose is to store the metadata. After metadata is written, they are renamed to the final names of the database files. The patterns in which the 4 system calls mentioned above are used for some simple database operations are described in what follows.

(22)

CREATE DATABASE The pattern for creating a database is quite simple. Twowritesystem calls are used, successively, to write metadata into the temporary file created. This metadata is written into the first two consecutive pages of the file. Then the temporary file is renamed and a preadsystem call is performed on the first of the two previously written pages. The size of all these three input/output operations is the selected page size: 64 kilobytes. These operations are repeated for all the three databases that are created by the application. INSERT Inserting the data into the databases consists of using onlypreadandpwritecalls. The application starts by writing the data for thevendor database. The observed pattern is the following: apreadcall is performed, after which a fewpwritecalls are performed, but to pages different than the one from whichpreadread the data. This series of system calls is then repeated several times, until all the data are inserted. The order of the page numbers involved in thepreadcalls that start each series of the pattern is interesting. The pages are read in ascendent order of their number and, with the exception of the first series, these numbers are consecutive. The read pages are previously written bypwritecalls in previous series. The page that is read in the first series (the second page of the database, i.e., page number one) was written previously, when the database was created. Some interestings remarks can be made about thepwritecalls, also. The number of the calls is approximately the same in all the series, except for the last two series, where it is much bigger. Although there is no particular order in which the pages are written, throughout all the series, the tendency is to write in pages with lower number first. So, it is more likely that pages with lower numbers are written among the first series. The total number ofpwritecalls is bigger than the number of written pages, which means some pages were written more than once. It is also interesting that the last series does not have apreadcall. The data for theinventory

and itemname databases are inserted instead. A pread call to the second page of each of these databases is performed, at first. Then, the second page and the first four pages of theitemnamedatabase, respectively inventory database are written. The size of the all the operations used for inserting the data is the chosen 64-kilobyte page size.

SELECT Searching and fetching data from the databases involves onlypreadandreadcalls. The databases must be open, at first. This assumes areadcall (of size 512 Bytes) followed by apreadcall (of size 64 kilobytes) on the first page of each database. Then, data search begins. For the example used, searching for a given product involves searching for the vendors who sell it, also. Apread call is performed on the second page of theitemname database. After that, two successivepreadcalls are made to the second and, respectively, fourth pages of the

inventorydatabase. The number of pages, as well as the specific pages, that have to be read may vary, depending on the size of theinventorydatabase and on the page size used. The product is found at this point. Twopreadcalls are performed on thevendordatabase, in order to find the vendors of the product. Again, the number of calls may vary here, according to the size of thevendordatabase and the page size used. Morepreadcalls would be needed if the database were bigger or the page size smaller.

Discussion The same tests were run on the modified version of the system, which reimple-ments the file system layer. The tests revealed not only that the results were the same as in the first case, but also that the pattern and the order in which the pages are read and written

(23)

same when the same tests are run indicates that the new file system layer is correctly imple-mented, that it correctly inserts and fetches the data. Moreover, the fact that the pattern of operations and the order of pages are the same proves that the file system layer was correctly isolated. It means that only this layer was changed, because the database engine behaves the same: it writes and reads pages in the same way and in the same order. This way, any other file system could be plugged in instead of the implemented one, and Berkeley DB could use it in the same way. Thus, the implementation provides generality.

When databases are open, a read operation of size 512 Bytes is performed on the first page written, as mention in the paragraph above. These 512 Bytes represent an unique identifier created that is stored at the beginning of the file. The identifier is required because there may exist other files with the same name. This identifier must be taken care of in the new version of the file system layer, too. But the problem is already solved, since BlobTamer provides such an identifier.

(24)

4 Implementation details

In order to achieve the interfacing between Berkeley DB and BlobTamer, several steps re-garding the practical implementation of the Storage layer were taken. First of all, a special data structure that holds details about all the files the database system used in the initial version was created. Then, the wrappers of some system calls used by Berkeley DB on read-ing and writread-ing data were modified in order to facilitate the transfer of data between the database system and the memory of the local machine. The modified code uses the newly created data structure. Finally, the function calls used for allocating memory, reading from and writing to memory were replaced with similar function calls offered by BlobTamer. The aspects above will be presented in what follows.

4.1 Mapping files into memory

The physical files the database system used in the disk storage version will no longer exist in the memory storage version. Thus, there must be some way to preserve the behavior of the database system and to make sure that it does exactly the same operations that it did when having disk storage. For example, the system does several reads and writes to the same file when creating a database file. It should be able to do that with the new Storage layer, too.

The main purpose of thefiles_memdata structure is to establish a one-to-one relationship between files (which are only "virtual" files in the memory storage version) and the memory zone they correspond to. The fname field thus stores the file name and the address field a pointer to memory zone allocated for the file. The mem_wr field indicates the amount of memory actually written from the zone reserved for the file. This is especially important when performing read operations from the memory zone, in order not to allow reads from a part of the zone which was not yet written. Finally, the tagfield indicates whether the memory zone for the file was allocated (set to 1) or not (set to 0). Initially, all memory zones are not allocated (set to 0).

An array of structures of this kind, called fm, is created. Its size is set to 100 entries because it is assumed that the examples deployed will not need more files than that, but it can easily be modified by changing theMAXFMSIZEdefined variable in thefs_layer.hheader file.

4.2 Re-implementing the read and write operations

Berkeley DB version 4.6.21 was used. The system’s code was modified at the level of theos

API, which corresponds to the file system layer. The most significant changes were made into the wrappers of the system calls used to perform read and write operations (read,write,

pread,pwrite), as well as into the wrapper of theopensystem call. Thepreadandpwritecalls had the __os_io function as wrapper. The read system call had the __os_read function as wrapper, while the__os_writefunction was the wrapper ofwrite.

Using local memory as storage All calls that read or wrote data or metadata needed to be redirected to the local RAM memory of the machine. In order to have a better control on the new code, the wrappers of read and write were modified. Their whole code was replaced by a call to__os_io. This way, all the read and write operations were performed

(25)

such that it would write to and read from memory, and no longer use these system calls. The old code in this function was erased, except for the part where the offset is calculated (as the page number multiplied by the page size) and for the switch structure that handles the input/output operations. Before this latter structure anassertcommand must be placed, which checks that the size of the operation, added to the calculated offset, is not bigger than MAXFILESIZE. This is the maximum size established for files, which is a defined variable whose value can be set in thefs_layer.hheader file. The code of the switch structure was then modified. Checking that the size, added to the offset, is larger than the mem_wrfield was neccessary, for both read and write cases. This meant that the program was trying to read or write a memory zone that was not yet written. If that was the case, an error was output for a read and the execution was aborted, because a memory zone cannot be read if it had not been written beforehand. For the write case, on the other hand, themem_wr field was updated for the corresponding file, with the value of the size of the operation, added to the calculated offset. After this check, the effective input/output operations were implemented, throughmemcpycalls. The value situated at the specified offset, added to theaddressfield of the file, is copied into the provider buffer, in the case of a read operation, and vice-versa for a write operation. The file involved in all the operations above was reffered to as the file in thefmarray at the entry having as index the file descriptor provided as input parameter by the__os_iofunction.

Another important change was made in the code of__os_openhandle, the wrapper func-tion of theopensystem call. The new version does not use openorcreatany more to open a file and create a file descriptor for it. Instead, the newly created allocate_file function is used to provide the file descriptor. This function searches in thefm array for the file name provided as input parameter. If it finds an entry with this name, whosetag field is set to 1, then the function returns the index of this entry in the array. The index will be used by

__os_openhandleto initialize the file descriptor. If the file name is not found in the array and the operation for which the file is opened is a write, a new entry is created in the array to add the new file. Thus, the searched file name is inserted in thefnamefield. Then, a memory zone is allocated with themalloccommand, initialized with 0’s, and a pointer to it is inserted in theaddressfield. Finally, thetagis set to 1 and themem_wrfield to 0, because the memory zone was allocated, but nothing was yet written to it. If the file name is not found and the operation is a read,__os_openhandlereturns an error code. That is quite normal, considering that one cannot read from a file that does not exist.

The two programs in the test apllication described in Section 3.3.2 were merged, because otherwise it would have been impossible to make one of them know which memory zones the other one has reserved. Therefore, the new test program did both the insertions and the search.

Using BlobTamer as storage After achieving memory-based storage, the next step was to use BlobTamer’s interface. To do this, a dynamic library had to be created, which would be linked at runtime. Another option would have been to use the calls provided by BlobTamer directly in the code. This was not possible due to the incompatiblity between the codes of Berkeley DB (written in C) and BlobTamer (written in C++). That is the reason for choos-ing to implement a library in order to create an interface between the two systems. This library, called blobtamer.h defines two data structures (blob_t andblob_env_t) and six func-tions. The most important of these functions areblob_init, used to initialize the environment,

(26)

andblob_alloc,blob_read,blob_writewhich are, respectively, wrappers of the ALLOC, READ, WRITE primitives of BlobTamer.

Using BlobTamer instead of the local machine’s memory involved a few changes in the code of the previously described version. To begin with, theaddressfield in thefiles_memdata structure is replaced by a field ofblob_ttype, namelyblob. Then, theallocate_filefunction is modified. At the begining, the environment is initialized, through ablob_initcall. The former

malloccall is replaced by ablob_alloccall that initializes theblobfield of the file. Finally, the

memcpycalls used in the__os_iofunction to read and write are replaced, respectively, with theblob_readandblob_writecalls.

4.3 Re-implementing collateral operations

Some other systems calls (besidesread, write, pread, pwrite andopen) were involved in the reading and writing processes. Thus, their wrappers had to be modified also, in order to suit the implementation of the file system layer with BlobTamer. System calls likeunlink,

fsync,fcntlandftruncatehad no correspondent operation in the new implementation, because BlobTamer does not provide these functionalities yet. Therefore, the code of their wrapper functions ( respectively__os_unlink,__os_fsync,__os_fdlockand__os_truncate) was taken out. All that was left of it was a return instruction with a successful operation code. This way, a successful input/output operation was assured, without having to implement unneccessary functionalities. However, these wrappers could be implemented in the future, if BlobTamer implements similar functionalities.

Other system calls, on the other hand, had correspondent operations in the new version and their wrappers underwent significant changes. The basic approach for most of them consisted in a search on thefmarray of files, based on a file name provided as input param-eter. If the file is found and thetagfield for that entry is set to 1 (i.e., memory was allocated for the file), then the search is stopped and various actions are taken. To begin with, the

__os_exists function (former wrapper of thestatsystem call) simply returns a successful re-turn code, meaning that the file was found. Then, __os_fileid (former wrapper of the stat

system call, also) returns the index of the found entry, as file identifier. Finally,__os_rename

(former wrapper of therenamesystem call) replaces the file name of the found entry with a new one, also provided as input parameter. All these three functions return a corresponding error code if the file is not found.

The approach is different for other wrappers, whose purpose is to provide some informa-tion. The__os_ioinfo function (former wrapper of thefstatsystem call) assigns information like the amount of data written into the file (the value of themem_wr field in the array en-try corresponding to the provided file descriptor) or the established page size to its output parameters. In the original version, these information were provided by a data structure returned byfstat. Most of the code and behaviour of the__os_seek function was preserved. However, thelseeksystem call was replaced by checking that the amount of data written in the file (the value of themem_wrfield) is not bigger than the calculated offset. This offset was calculated with the provided parameters, as the page size, multiplied by the page number, to which the relative offset was added (exactly like in the old version). If the check succeeds, than these parameters are assigned to the corresponding fields of a data structure, and the function exits successfully.

(27)

5 Conclusion

5.1 Contribution

This work is important from several points of view. To begin with, the greatest benefit is, of course, the creation of an interface between the Berkeley DB database engine and the Blob-Tamer grid-based storage service. The storage layer used by Berkeley DB was replaced such that now the data is stored by BlobTamer. In turn, BlobTamer uses a grid to actually store the data in the memory of multiple machines. Thus, the storage capacity of Berkeley DB was considerably extended. Experiments can now be made on a grid testbed like Grid’5000 [12], in order to compare the performance of the new implementation to the performance of the old one, where Berkeley DB stored its data on disk space.

The interface was created at a low layer, which led to two benefits. First of all, this al-lowed the study of the access patterns used by Berkeley DB, which might lead to potential optimizations for the read and write operations, in the future. Second, this implementation provides portability. Any other file system could be plugged in instead of the newly imple-mented storage layer. Berkeley DB would not be aware of that, since the access patterns do not change, with respect to the old storage layer.

Another contribution is represented by the creation of a dynamic library written in C, in order to interface the two systems. It would not have been possible to interface the two systems otherwise, due to the incompatibility between the codes of Berkeley DB (written in C) and BlobTamer (written in C++). This library is important because it represents a C interface for BlobTamer. It can be used by any application written in C, not only by Berkeley DB.

5.2 Other approaches

The presented implementation was not the only possible solution to achieve an interface be-tween Berkeley DB and BlobTamer. The Memory Pool layer of Berkeley DB could have been modified instead of the Storage Layer. This approach would have been perfectly feasible, since page management support is provided by BlobTamer. On the other hand, it would have had several disadvantages. To begin with, it would have implied implementing the Logging Layer of Berkeley DB as well, due to the tight coupling with the higher layers. This layer is used by the same upper layers as the Memory Pool and uses, as well as the Mem-ory Pool, the Storage Layer to store its data. Therefore, one cannot be implemented without the other. Then, the study of access patterns would not have been possible at the Mem-ory Pool layer. Moreover, implementing this layer would have changed these patterns, not necessarily in a better way. Finally, the implementation of a higher layer would have made debugging more difficult and the code of the new implementation more difficult to maintain.

5.3 Future work

Some optimizations to the implemented storage layer are possible in the future. First of all, the wrappers of several system calls have not been implemented, since these system calls and the operations they perform have no correspondent in BlobTamer for the moment. Provided that BlobTamer will implement these functionalities in the future, the Berkeley DB functions could be implemented, as well. For example, if BlobTamer will allow deletion in