• No results found

This work proposes a collaborative cache system for operation and control of large volumes of temporary data in grid environments.

While other approaches exist, inspired by distributed filesystems or replication, that can provide the functionalities required to administrate this task, the design and implementation of a system based on collaborative caches represents an alternative to deal with the dynamic character of temporary data in distributed environments like a grid. The main feature is the automatic and autonomous character of a cache for data management. Additionally, grid environments are based on sharing and collaboration of resources; this is the same principle of the collaborative cache systems.

Extensibility and Scalability

The extensibility and scalability of the infrastructure is based on the ability to add distributed capabilities by using the individual service-oriented cache operations. This infrastructure allows a collective cache system to be built that distributes the work between many autonomous cache services.

This ability is focused on the collaboration for exchanging information about the data utilisation and information about cache operation. Thus the proposed approach is composed by a reference model, a common information structure, and a set of standard operations to implement the infrastructure.

Functional aggregation

The system is composed of four levels to allow a progressive integration of the ele- ments and aspects of the system. It starts by establishing the state and configuration of the available storage resources, a control level supervises and registers the actions realised with data objects in the local installation; the next level implements or- ganisation mechanisms for working together effectively ( collaboration layer ); and the final level builds a logical view of the data and system operation for an efficient global management.

Service oriented infrastructure

We propose a service-oriented infrastructure to support temporary data requests from a distributed community. The infrastructure confronts the related aspects of orchestrating and managing large temporary data with numerous distributed storage resources. The service-oriented properties that characterises the infrastructure are:

• The cache information is an abstraction that provides a logical view of the system’s components and behaviour, defined in terms of operations providing

cache capabilities.

• Cache operations are defined in terms of the messages exchanged between cache capability providers and requester components. The internal structure of components including features such as its implementation technology, pro- cess structure and database structure are deliberately abstracted away. This permits incorporation of existing temporary storage mechanisms (legacy sys- tems).

• The cache service is described by machine-processable metadata; the descrip- tion exposes the details necessary for use the service (description orientation). The cache service tends to use a small number of operations with relatively complex messages. The messages and descriptions are defined in a platform- neutral and standardised format (XML).

Delegate operation of temporary data

The model defines an infrastructure responsible for the operation of temporary data based on specialised cache entities that individually applies sophisticated strategies for determining which data should be kept in a storage resource. At the same time, these entities supervise and register the access to data and storage resources. They control individual storage resources on behalf of users and applications; further- more, those components implement the mechanisms of interaction in order to work together.

Wide accessibility

Cache reference model and cache operations seek to provide temporary storage ca- pabilities to a wide range of grid applications, services and users. In each location of the distributed system, the applications require to execute access operations to ma- nipulate the temporary data. The differences related with each particular technology are transparent for clients.

Security

The distribution of data over a grid makes data protection much more difficult than on closed systems. Data on grids may be stored in different locations but all storage sites are not accredited to receive data. Achieving a high security level can be mandatory for particular applications or virtual communities but security is always a trade off between inconvenience for the users and the desired level of protection. In order to permit users to use specific security mechanisms, clients must integrate invocations to protection function and caches access operations. In the grid enviro- ment many security functions need to be provided such as:

• Secure transfer of data from one grid element to another. • Secure storage of data on a storage resource.

• Access control for resources such as data, storage space or computing power. • Anonymization of sensible data entities.

• Tamper-proof logging of operations performed on data entities. • Traceability

The features that should protect data [106] while it is being stored on a grid cache, are access control and anonymization. Users need to trust the servers on which their data are going to be processed. Our research team proposes an access control mech- anism that provides a decentralized permission storage and management system. It also include an encrypted storage mechanism. All permissions are encoded in certificates, which are stored by their owners and used when required. Permissions can be created on demand, by the owners of the resources or by administrators to whom this responsibility has been delegated, without the need to contact a central permission storage system [105] [107].

Uniform operations and interfaces

The operations and interfaces necessary to handle temporary data are uniform. The components that operate temporary data support a standard set of operations and predefined interfaces. Thus, in each location the requests are interpreted in the same manner; similarly the information for describing the state and behaviour of the data and system components is represented in a common form.

Enabling the control of resources

The infrastructure uses resources gathered from multiple locations. Following the model proposed, each location permits the use of its resources for partners in an automatic way. The infrastructure permits sharing these capabilities in the grid environment and permits to operate of the collaborative system for temporary data management.

Coordination for effective operation

The use of distributed resources and components is organised for efficient global utilisation. This requires that components implement mechanisms that organise collaborative actions to get a global effect on the system function. In this way coherent actions regulate the work distribution between components.

The proposed infrastructure provides the ability to measure the performance of data operations and cache efficacy. In this context, it defines and provides the appropriate information to describe performance. This information is detailed and provided in an opportune way so it permits to take the decisions for operating the system in the most effictive way.

Chapter 5

Grid Cache Service (GCS)

In this chapter we describe the design and implementation of the base component for collaborative grid caches, the Grid Cache Service (GCS) that implements the model and specifications proposed in the last chapter.

We first present an overview of the GCS. Section 5.2 presents the design principles of the GCS, then the relation of the GCS design following the reference model described in Section 4.2, Section 5.4 describes the implementation of a prototype of the GCS, then section 5.5 presents the use of the GCS capabilities, then we present some performance tests on the GCS prototype. The chapter finishes with a general discussion.

5.1

Overview

As our main goal is to operate temporary data in grid environments in a dynamic way, we decided to use a group of caches distributed in the grid that collectively provide storage capabilities for temporary data [24]. Figure 5.1 illustrates a scenario for the utilisation of a group of GCSs to operate temporary data in grid environments in a dynamic way. A client invokes a GCS to store temporary data, if the GCS does not have available storage resources it can invoke the access operations of remote caches to store the data across multiple grid locations.

In this scenario each cache is a Grid Cache Service (GCS) which individually man- ages the local storage resources to provide the temporary storage capabilities of the system. In this chapter we describe the individual component of the system: the Grid Cache Service. The collective operation of the group is analysed in Chapter 6. The Grid Cache Service is the active element that supports the infrastructure. It implements the storage, control and collaboration layers of the reference model (Sec-

Data Data Data Data Cache Service

Data and storage

Administrative domain

Cache Service

Data and storage

Administrative domain

Cache Service

Data and storage

Administrative domain

Cache Service

Data and storage

Administrative domain Store temporary

data set in grid

Store data (part 1)

on remote cache (A) Store data (part 2) on remote cache (B)

Store data (part 3) on remote cache (C) User

Figure 5.1: GCS cache extensibility

tion 4.2). The design of the system implementing the coordination layer is presented in Chapter 6.

We have implemented a prototype of the Grid Cache Service (GCS) which is used as a demonstration of the grid caching approach developed in Chapter 4. The prototype provides support for the specified operations and cache information.

The Grid Cache Service Concept

The Grid Cache Service (GCS) is a local administrator of temporary storage and data which is implemented as a grid service that provides basic cache capabilities following the reference model described in Section 4.2. Therefore, GCS implements interfaces and associated behaviour of the cache operations: access, monitoring and configuration (see Section 4.3) for the operation of temporary data.

The GCS Functions

The main function of the GCS is to ensure that a specified data entity exists in a storage resource for a limited period of time and to register detailed information about the data and cache actions.

• management of underlying storage resources for temporary data based on cache techniques.

• provides temporary storage space on demand to store data entities (files) through defined access operations.

• provides detailed and real-time information about data and cache activity through defined information representation and operations.

• provides basic cache capabilities to monitoring actions of the data entities stored in the storage resource.

• supports collaborative cache operations to a wide range of clients (including other caches).