• No results found

higher potential to scale than a locking scheme, it comes at a cost: write/write concurrent accesses to overlapping regions of the same file are not supported. Although not explicitly forbidden, the effects of attempting to do so are undefined.

GPFS. The General Parallel File System (GPFS) [140], developed by IBM is a closed-source,

high-performance file system that is in use by many supercomputers around the world. Files written to GPFS are split into blocks of less than 1 MB, which are distributed across multiple file servers that have direct access to several disk arrays. To prevent data loss, such blocks are either replicated on the same server using native RAID or on different servers. Metadata describes the file layout in terms of blocks, and the directory structure is distributed as well and efficiently supports a large number of files in the same directory. The clients can access files through an access interface that implements full POSIX semantics, including locking for exclusive file access thanks to a distributed locking scheme. An interesting feature of GPFS is its ability to be partition aware. More specifically, network failures that cut communication between file servers and partition them into groups, are detected through a heartbeat pro- tocol and measures are taken to reorganize the file system such that it comprises the largest group, effectively enabling a graceful degradation.

Ceph. With the evolution of storage technologies, file system designers have looked into

new architectures that can achieve scalability. The emerging object storage devices (OSDs) [95] couple processors and memory with disks to build storage devices that perform low-level file system management (such as block allocation and I/O scheduling) directly at hardware level. Such “intelligent” devices are leveraged by Ceph [167], a cluster file system specifically designed for dynamic environments that exhibit a wide range of workloads. Ceph decen- tralizes both data and metadata management, by using a flexible distribution function that places data objects in a large cluster of OSDs. This function features uniform distribution of data, consistent replication of objects, protection from device failures and efficient data migration. Clients can mount and access a Ceph file system through a POSIX-compliant interface that is provided by a client-side library.

3.3 Data grids

With the introduction of grid computing, presented in Section 2.2, the need arised to manage large data collections that are distributed worldwide over geographically distant locations. To address this need, data grids [165] emerged as the platform that combines several wide- area management techniques with the purpose of enabling efficient access to the data for the participants of the grid.

3.3.1 Architecture

Data grids are organized in a layered architecture, as proposed in [49, 9]. Each layer builds on the lower level layers and interacts with the components of the same level to build a complete data management system. We briefly introduce these layers, from the lowest to the highest:

26 Chapter3– Data storage in large-scale, distributed systems

Data fabric: consists of the resources that are owned by the grid participants and are in-

volved in data generation and storage, both with respect to the hardware (file servers, storage area networks, storage clusters, instruments like telescopes and sensors, etc.) as well as the software that leverages them (distributed file systems, operating systems, relational database management systems, etc.)

Communication: defines and implements the protocols that are involved in data transfers

among the grid resources of the fabric layer. These protocols are build on several well- known communication protocols, such as TCP/IP, authentication mechanisms, such as Kerberos [72], and secure communication channels, such as Secure Sockets Layer (SSL).

Data Grid Services: provides the end services for user applications to transfer, manage and

process data in the grid. More specifically, this layer is responsible to expose global mechanisms for data discovery, replication management, end-to-end data transfers, user access right management in virtual organizations, etc. Its purpose is to hide the complexity of managing storage resources behind a simple, yet powerful API.

Applications. At this layer are user applications that leverage the computational power of

the grid to process the data stored in the data grid. Several standardized tools, such as visualization applications, aim at presenting the end user with familiar building blocks that speed up application development.

3.3.2 Services

The need to manage storage resources that are dispersed over large distances led to several important design choices. Two important classes of services stand out.

3.3.2.1 Data transport services.

A class of services, called data transport services was designed that departs from data access transparency, enabling applications to explicitly manage data location and transfers, in the hope that application-specific optimizations can be exploited at higher level. The focus of such services is to provide high performance end-to-end transfers using low overhead pro- tocols, but this approach places the burden of ensuring data consistency and scalability on the application.

Data transport is concerned not only with defining a communication protocol that en- ables two end-to-end hosts to communicate among each other with the purpose of trans- ferring data, but also with other higher level aspects such as the mechanisms to route data in the network or to perform caching in order to satisfy particular constraints or speed up future data access. Several representative services are worth mentioning in this context.

Internet Backplane Protocol (IBP). IBP [15] enables applications to optimize data trans-

fers by controlling data transfers explicitly. Each of the nodes that is part of the IBP instance has a fixed-size cache into which data can be stored for a fixed amount of time. When data is routed during an end-to-end data transfer, data is cached at intermediate locations in a manner similar to “store-and-forward”. The application has direct control over the caches

3.3 – Data grids 27

of IBP nodes and can specify what data to cache where, which increases the chance of future requests for the same data to find it in a location that is close to where the data is required. IBP treats data as fixed-size byte arrays, in a similar fashion as the Internet Protocol, which splits data into fixed-size packets. The same way as IP, it provides a global naming scheme that enables any IBP node to be uniquely identified. Using this global naming scheme, ap- plications can move data around without caring about the underlying storage of individual nodes, which is transparently managed by IBP.

GridFTP. GridFTP [2, 20] extends the default FTP protocol with features that target efficient

and fast data transfer in grid environments, where typically large files need to be transferred between end points. Like FTP, GridFTP separates effective data transfers from control mes- sages by using a different communication channel for each of them. This enables third-party file transfers that are initiated and controlled by an entity that is neither the source, nor the destination of the transfer. In order to support large files better, GridFTP provides the the ability to stripe data into chunks that are distributed among the storage resources of the grid. Such chunks can be transferred in parallel to improve bandwidth utilization and speed up transfers. GridFTP can also use multiple TCP sockets over the same channel between a source and a destination in order to improve bandwidth utilization further in wide-area settings.

3.3.2.2 Transparent data sharing services.

Since grid participants share resources that are dispersed over large geographical areas, data storage needs to adapt accordingly. Unlike data transport services where data access is man- aged explicitly at application level, several attempts try to provide transparent access to data, in a manner similar to parallel file systems, but at global grid scale. This approach has the advantage of freeing the application from managing data locations explicitly, but faces sev- eral challenges because resources are heterogeneous and distances between them can vary greatly.

Replication becomes crucial in this context, as it improves locality of data and preserves bandwidth, greatly increasing scalability and access performance. However, on the down side, consistency among replicas that are stored in geographically distant locations becomes a difficult issue that is often solved by choosing a weak consistency model.

Grid Data Farm (Gfarm). Gfarm [155] is a framework that integrates storage resources and

I/O bandwidth with computational resources to enables scalable processing of large data sizes. At the core of Gfarm is the Gfarm file system, which federates local file systems of grid participants to build a unified file addressing space that is POSIX compatible and improves aggregated I/O throughput in large scale settings. Files in Gfarm are split into fragments that can be arbitrarily large and can be stored in any storage node of the grid. Applications may fine-tune the number of replicas and replica locations for each file individually, which has the potential to avoid bottlenecks to frequently accessed files and to improve access locality. Furthermore, the location of fragment replicas is exposed through a special API at application level, which enables to schedule computations close to the data.

28 Chapter3– Data storage in large-scale, distributed systems

JuxMem. Inspired by both both DSM systems and P2P systems, JuxMem [7] is a hybrid

data sharing service that aims to provide location transparency as well as data persistence in highly dynamic and heterogeneous grid environments. Data is considered to be mutable (i.e., it is not only read, but also concurrently updated) and is replicated in order to improve data access locality and fault tolerance. In this context, ensuring replica consistency is a difficult issue. JuxMem proposes an approach based on group communication abstractions to ensure entry-consistency, while guaranteeing high data availability and resilience to both node crashes and communication failures.

XtreemFS. XtreemFS [64] is an open-source distributed file system that is optimized for

wide-area deployments and enables clients to mount and access files through the Internet from anywhere, even by using public insecure networking infrastructure. To this end, it relies on highly secure communication channels built on top of SSL and X.509. XtreemFS exposes a configurable replication management system that enables easy replication of files across data centers to reduce network consumption, latency and increase data availability. Several features aim at dealing with high-latency links that are present in wide-area net- works: metadata caching, read-only replication based on fail-over replica maps, automatic on-close replication, POSIX advisory locks.