I
NFORMATYKI IA
UTOMATYKIK
ATEDRAI
NFORMATYKIS
TOSOWANEJmgr inż. Kamil Kuliberda
Ph.D. Thesis
Transparent Integration of Distributed
Resources within Object-Oriented
Database Grid
Advisor:
prof. dr hab. inż. Kazimierz Subieta
To my wife Lidia, sons Alan and Natan, brother Artur and my parents – for their love, belief and patience...
Index of Contents
ABSTRACT ... 5
ROZSZERZONE STRESZCZENIE ... 8
CHAPTER 1INTRODUCTION ... 14
1.1 Motivation ... 14
1.2 Introduction to the Grid Technology ... 15
1.2.1 Technical Computing Grids ... 17
1.2.2 Utility Computing Grids ... 17
1.2.3 Data Grids ... 18
1.3 Research Problem Formulation and Proposed Solution ... 19
1.4 Theses and Objectives ... 21
1.5 Thesis Outline ... 23
CHAPTER 2 STATE OF THE ART AND RELATED WORKS ... 25
2.1 Grid History ... 25
2.2 Contemporary Grid Related Technologies ... 32
2.2.1 SOA and OGSA ... 37
2.2.2 CORBA – Distributed objects ... 43
2.3 Conclusions ... 46
CHAPTER 3TECHNOLOGICAL BASE OF THE NEW APPROACH TO GRID ... 47
3.1 Peer-to-Peer Networks ... 47 3.2 JXTA Project... 52 3.2.1 JXTA Architecture ... 53 3.2.2 JXTA Protocols ... 54 3.2.3 JXTA ID ... 56 3.2.4 JXTA Peers ... 57 3.2.5 JXTA Groups ... 58 3.2.6 JXTA Advertisements ... 60 3.2.7 JXTA Modules ... 60 3.2.8 JXTA Pipes ... 61 3.2.9 JXTA Services... 62 3.2.10 JXTA Security ... 63 3.3 EDUTELLA ... 64
3.4 Middleware and Federated Databases ... 69
3.4.1 Object-oriented database as a middleware ... 71
3.4.2 Integration strategies ... 72
3.4.3 Distributed objects ... 74
3.4.4 Distributed and Federated Databases ... 76
3.5 The eGov-Bus Virtual Repository ... 80
3.6 The ODRA Database... 83
3.7 Conclusions ... 86
CHAPTER 4 AN AUTOMATIC INTEGRATION METHODOLOGY OF DISTRIBUTED DATA, GENERAL CONCEPT AND ASSUMPTIONS ... 88
4.1 Object Integration Approach in Data-Grid ... 89
4.1.1 Updatable Object-Oriented Views ... 90
4.1.2 Three-level Integration Model ... 92
4.3 Automated Integration of Distributed Objects ... 98
4.4 Approach to Integration using Data Grid Middleware ... 102
4.4.1 Virtual Network... 104
4.5 Integration Procedure for the Grid ... 106
CHAPTER 5PROTOTYPE IMPLEMENTATION AND EXAMPLES ... 109
5.1 Prototype Architecture ... 109
5.2 Virtual Network Architecture Details ... 112
5.3 User‘s Logical Environment in Grid Database ... 113
5.4 Running the Virtual Network for the ODRA-GRID ... 115
5.5 Running the ODRA-GRID Environment ... 117
5.6 Integration of Distributed Resources in ODRA-GRID Prototype ... 124
5.7 Real-Life Example of Integration ... 137
5.7.1 Creation of local objects representing one health centre‘s database – example ... 138
5.7.2 Creation of views representing health centre‘s virtual contribution objects – example ... 140
5.7.3 Script creating integration views for health centre‘s databases – example ... 144
5.7.4 Script creating global views for health centre‘s databases – example ... 150
5.8 CD Contents for Example of Prototype Implementation ... 154
CHAPTER 6AUTOMATIC INTEGRATION –TESTING RESULTS... 157
6.1 Testing of Employee-Department-Location Schema ... 161
6.2 Testing of Modified Employee-Department-Location Schema ... 166
6.3 Testing of Modified Employee-Department-Location Schema in Distributed Environment but without Virtual Network and Automated Integration Mechanisms ... 175
6.4 Testing of Modified Employee-Department-Location Schema at One Local Host without Virtual Network, Automated Integration Mechanisms ... 185
6.5 Testing Summary ... 194
CHAPTER 7SUMMARY AND CONCLUSIONS ... 198
7.1 Prototype‘s Limitations and Further Works ... 199
APPENDIX A OBJECT-TO-RELATIONAL WRAPPER AS A DATA RESOURCE FOR ODRA-GRID ... 201
A.1 Wrapper Architecture and Assumptions ... 201
A.2 Example ... 203
INDEX OF FIGURES ... 207
INDEX OF LISTINGS ... 208
INDEX OF TABLES ... 210
Abstract
This Ph.D. dissertation is focused on a novel approach to transparent integration of distributed and heterogeneous data resources into one common object-oriented database model. The integration process deals with managing with three tiers of updatable object-oriented views which according to a specific guidelines are able to accomplish the mappings between local (low level) data schemas into global (top level) data schema available directly to the users. The logical location of the tiers forces also the development of a security mechanism which naturally limits unauthorised access to a user‘s local and private data from the grid side. A managing process for views located on particular tiers is supported mainly by a view generator mechanism which is responsible for generating an intermediate view called integration view, which keeps logic information on dependencies between distributed resources, objects and their locations. Such a process makes a virtual global data store of contributing heterogeneous resources available to the users in a manner that they do not need to involve explicit integration operations.
The integration process is fully automated and transparent for top level users, so they are not aware about changes which happen inside the virtual repository and they can continue their work being not aware of the integration facilities and processes.
At the back-end of the solution a peer-to-peer network is placed, which permits moving a physical TCP/IP network infrastructure on a virtual level. Such an approach essentially improves efficiency of distributed database processing by disposing physical networks limitations like corporate network barriers – firewalls and NATs. Moreover, such a solution moves an ordinary database network connection logic into a simpler abstraction layer where each database is virtually visible as a node with bound data modules. This is developed from an architectural concept to a prototype implementation using the JXTA framework.
The automated and transparent integration mechanism has been implemented as a part of the virtual repository solution which relies on the stack- based approach (SBA), the corresponding stack-based query language (SBQL) and updateable object-oriented views.
The thesis has been developed under the eGov-Bus (Advanced eGovernment Information Service Bus) project supported by the European Community under ―Information Society Technologies‖ priority of the Sixth Framework Programme (contract number: FP6-IST-4-026727-STP). The idea with its different aspects (including the virtual repository it is a part of and data fragmentation and integration issues) has been presented in over 20 research papers, e.g. [1], [2], [3], [4], [5].
Keywords: grid, data-grid, database, object-oriented, heterogeneous resources
integration, automatic object integration, virtual repository, SBA, SBQL, peer-to-peer, P2P, virtual network
P
OLITECHNIKAŁ
ÓDZKAW
YDZIAŁE
LEKTROTECHNIKI,
E
LEKTRONIKI,
I
NFORMATYKI IA
UTOMATYKIK
ATEDRAI
NFORMATYKIS
TOSOWANEJmgr inż. Kamil Kuliberda
Rozprawa doktorska
Przezroczysta integracja
rozproszonych zasobów wewnątrz
obiektowego bazodanowego gridu
Promotor:
prof. dr hab. inż. Kazimierz Subieta
Rozszerzone streszczenie
Termin grid znany jest jako termin określający rozproszone sieci obliczeniowe, jednak szybka ewolucja Internetu, powiększanie się społeczności internetowych, globalny wzrost wymiany informacji przez Internet wytworzyły potrzebę przetwarzania danych w architekturze rozproszonej i tym samym rozwój systemów gridowych w stronę przetwarzania danych opisanych modelami biznesowymi – danych o konkretniej strukturze. Obecnie popularne rozwiązania jak sieci P2P (peer-to-peer), w których możliwe jest równoległe przetwarzanie dużych ilości danych w postaci mediów, czy plików nie wspierają zarządzania danymi strukturalnymi. Taki typ danych przechowywany jest w bazach danych.
Przetwarzanie danych pochodzących ze źródeł, którymi są bazy danych, stwarza obecnie duże problemy. Są one związane z odpowiednim fizycznym traktowaniem tych danych oraz ze sposobem jak te dane widzi użytkownik – na co składa się dostęp do tych danych oraz sposoby ich przetwarzania. Główną trudnością w przetwarzaniu jest tutaj właśnie struktura takich danych. Dane biznesowe zazwyczaj charakteryzują się złożoną strukturą, przez co nie mogą być identyfikowane oraz przetwarzane jak zwykły ciąg bajtów. Często ich struktura może być zależna od innych struktur, stąd w systemie rozproszonym zarządzanie jest bardzo trudne, a w niektórych przypadkach niemożliwe. Przetwarzania rozproszonych danych strukturalnych – opisanych modelem biznesowym, związany jest z budową modelu zdolnego realizować równoległe przetwarzanie rozproszonych danych, gdzie różnego typu dane oraz usługi znajdujące się w fizycznie odseparowanych od siebie lokalizacjach mogą być wirtualnie dostępne przez ich wirtualną reprezentację.
W systemach rozproszonych wymagane jest osiągnięcie transparentności, tzn. aby użytkownik pracując na danych mógł je przetwarzać bez względu na to czy są to dane lokalne znajdujące sie na lokalnym komputerze użytkownika, czy tez dane pobierane z lokalizacji zdalnych. Dodatkowo poszczególne lokalizacje skąd dane są
pobierane zazwyczaj są systemami heterogenicznymi. Stawia to dodatkowe wyzwanie dla pojektantów takich systemów w postaci realizacji mechanizmu integracji takich zasobów.
Mówiąc o przetwarzaniu danych mamy na myśli nie tylko ich odczyt, ale także swobodną ich aktualizację. Właśnie możliwość swobodnej modyfikacji danych rozproszonych przy założeniu, że użytkownik wprowadzający tą modyfikację nie jest nawet świadom, że działa na danych zdalnych jest najpoważniejszym problemem nierowiązanym w innych istniejących systemach rozproszonych.
Przy powyższych założeniach rozproszony system baz danych, nazywany
data-intensive lub data grid, musi mieć zapewnioną ciągłość pracy i łatwość dostępu do
danych, aby to zrealizować musi to być zapewnione już na poziomie architektury samego gridu. Dlatego wymaga to realizacji bardzo elastycznej łatwo skalowalnej arichtektury.
W niniejszej pracy doktorskiej powyższe problemy poddane zostały dyskusji oraz zaproponowano ich rozwiązanie w kotekście architektury data grid. Za cel ustalono realizację gridu baz danych cechującego się:
przezroczystością dostępnych zasobów gridu przy ich przetwarzaniu, automatyczną integracją zasobów dołączających do gridu,
wirtualną siecią łączącą współpracujące bazy danych.
Do osiągnięcia celu autor pracy doktorskiej zaproponował następujące założenia:
1. Opracowanie modelu integracji rozproszonych, heterogenicznych zasobów danych objętych jednym wpólnym obiektowym schematem danych. Proces integracji opiera się o trójwarstwową strukturę aktualizowalnych perspektyw, które zgodnie z odpowiednimi wytycznymi są w stanie zrealizować odwzorowanie pomiędzy danymi kontrybuującymi do gridu – schematem danych najniższego poziomu, znajdującym się na maszynie użytkownika włączającego się do gridu, a globalnym schematem gridu (najwyższego poziomu) dostępnym dla wszystkch użytkowników pracujących w gridzie. Trój-warstwowa struktura perpektyw to wzajemne zestawienie obiektowych, aktualizowalnych perspektyw w trzech warstwach, w taki sposób, że każda wyższa warstwa zeleżna jest od warstwy niższej. Zależność warstw określona jest umową, która mówi w jaki sposób wirtualne obiekty wyższej warstwy
zależą od obiektów niższej warstwy. Definicja perspektyw musi być zgodna ze sztuką budowy aktualizowalnych perspektyw opisaną w [6]. Zakładamy, że pierwsza warstwa musi zapewnić odwzorowanie i kontrybucję obiektów z lokalnej bazy danych do schematu odpowiadającego schematowi globalnemu – w wyniku tego powstają obiekty kontrybucyjne dopasowane do schematu kontrybucyjnego, druga warstwa dokonuje odwzorowania integracyjnego zgodnego ze schematem integracyjnym, gdzie wszystkie dostępne obiekty kontrubucyjne mapowane są na kolekcje obiektów integracyjnych, oraz trzecia warstwa, która dokonuje odwzorowania kolekcji obiektów integracyjnych na obiekty globalne dostępne bezpośrednio w gridzie.
2. Opracowanie generycznej techniki automatycznej integracji zasobów danych do globalnego schematu gridu. Jest to mechanizm składający się z szeregu metod i zasad wykorzystanych w algorytmach generujących perspektywy we wszystkich trzech warstwach opracowanego modelu integracji danych. Proces generacji perpektyw realizowany jest na podstawie specyficznej definicji perspektyw. 3. Opracowanie techniki zarządzania i utrzymania mechanizmu automatycznej
integracji, tak aby możliwe było ciągłe przyłączanie oraz odłączanie się poszczególnych zasobów kontrybucyjnych gridu. Mechanizm ten wykorzystuje podobne zasady i metody podczas przebudowywania działających perspektyw jak w przypadku generacji perspektyw, z tą różnicą że działa tylko na poziomie integracyjnym (w warstwie drugiej).
4. Budowa wirtualnej sieci opartej o archtekturę peer-to-peer, której zadaniem jest utowrzenie środowiska komunikacyjnego dla gridu, które ułatwi dostęp do baz danych jako węzłów w gridzie oraz uniezależni komunikację od warstw TCP/IP. Dodatkowo przeniesie standardową pracę w sieci na wyższy poziom abstrakcji, w ktorym nie są wymagane typowe operacje sieciowe.
5. Budowa warstwy pośredniczącej tzw. middleware zawierającej wszystkie powyższe rozwiązania i w której zamknięte bedzie proponowane rozwiązanie gridu baz danych.
Przy budowie prototypu wykorzystano wirtualne repozytorium [7] oparte o silniki obiektowych baz danych ODRA [7], mechanizm aktualizowalnych perspektyw [2], [6] który jest natywnym mechanizmem realizacji perspektyw w bazach ODRA oraz środowisko do budowy sieci peer-to-peer – JXTA Framework [8].
1. Możliwe jest skonstruowanie rozproszonego systemu bazodanowego pracującego w architekturze przetwarzania równoległego (grid) przy wykorzystaniu obiektowej bazy danych opartej na teorii podejścia stosowego oraz aktualizowanych perspektywach.
2. Możliwa jest integracja zasobów lokalnych do wirtualnego repozytorium przy wykorzystaniu mechanizmu wirtualnych obiektowych aktualizowanych perspektyw.
3. Możliwe jest zrealizowanie wirtualnej warstwy transportowej dla rozproszonego systemu bazodanowego w oparciu o architekturę P2P i zapewnienie przezroczystości przetwarzania danych przy jej wykorzystaniu.
Praca doktorska została wykonana w ramach projektu eGov-Bus (Advanced
eGovernment Information Service Bus) wspieranego przez Wspólnotę Europejską
w ramach priorytetu „Information Society Technologies‖ Szóstego Programu Ramowego (nr kontraktu: FP6-IST-4-026727-STP).
Tekst pracy został podzielny na następujące rozdziały, których zwięzłe streszczenia znajdują się poniżej:
Chapter 1 Introduction
Wstęp
Pierwszy rozdział wprowadza w tematykę pracy, zaprezentowane jest wprowadzenie w tematykę gridową, gdzie po krótce scharakteryzowane są obecnie znane typy gridów. Przedstawione są motywacje autora oraz sformułowano cele pracy oraz zidentyfikowano związane z nimi problemy. W tym kontekście omówiono tezy dysertacji oraz zarysowano opracowane przez autora rozwiązania.
Chapter 2 State of the Art and Related Works
Stan wiedzy i prace pokrewne
W opisie stanu wiedzy przedstawiono podstawowe pojęcia związane z systemami rozproszonymi, przetwarzaniem danych oraz integracją danych rozproszonych. Pokazano historię terminu grid oraz przykłady istniejących wczesnych oraz późniejszych rozwiązań nawiązujących do gridów. Omówiono podstawowe systemy gridowe od pierwszej do trzeciej genereracji. Przytoczono reprezentatywne przykłady istniejących architektur rozproszonych stosowanych w przetwarzaniu danych. Dla zaprezentowanych rozwiązań omówiono najczęściej pojawiające się problemy w tych rozwiązaniach.
Chapter 3 Technological Base of the New Approach to Grid
Technologiczne podstawy nowego podejścia do gridu
Jest to rozdział, w którym skupiono się na opisaniu szeregu technologii i rozwiązań mających bezpośredni wpływ na powstanie rozwiązania prezentowanego w niniejszej rozprawie doktorskiej bądź są jego częścią. Skupiono się tutaj wyłącznie na technologiach rozproszonych. Scharakteryzowana architekturę sieci peer-to-peer oraz jej implementację w projekcie JXTA, który znalazł zastosowanie również prototypie zrealizownym w ramach niniejszej pracy. Scharakteryzowano terminy warstwy pośredniczącej (middleware), federacyjnych baz danych (federated databases), w skrócie opisano rozwiązanie wirtualnego repozytorium eGov-Bus w ramach którego został wykorzystany prezentowany w pracy prototyp oraz prototyp obiektowej bazy danych ODRA.
Chapter 4 An Automatic Integration Methodology of Distributed Data, General Concept and Assumptions
Ogólna koncepcja i założenia metodologii automatycznej integracji rozproszonych danych
Ta część pracy przedstawia trójwarstwowy model przezroczystej integracji rozproszonych źródeł danych do jednego wspólnego schematu danych dla rozwiązania ODRA-GRID. Omówiono poszczególne warstwy modelu, również w odniesieniu do występujących rodzajów fragmentacji danych. Zaprezentowano zaprojektowaną i w znaczącym zakresie zaimplementowaną architekturę mechanizmu automatycznej integracji rozproszonych zasobów do gridu baz danych w oparciu o silnik bazy danych ODRA i mechanizm aktualizowalnych perspektyw. Zaprezentowano również architekturę wirtualnej sieci będącą warstwą pośredniczącą w której zaszyte zostały wszystkie proponowane mechanizmy gridowe. Pokazano schemat działania mechanizmu automatycznej generacji aktualizowalnych perpektyw wykorzystany w mechanizmie automatycznej integracji rozproszonych zasobów. Opracowane metody i mechanizmy zostały poparte odpowiednimi rzeczywistymi przykładami integracji w języku SBQL.
Chapter 5 Prototype Implementation and Examples
Realizacja prototypu i przykłady
W rozdziale znajduje się szczegółowe omówienie opracowanej wirtualnej sieci, opisano zasadę jej działania oraz wytyczne jak uruchomić prototyp ODRA-GRID i jak w nim działać. Szczegółowo opisano środowisko pracy, wraz z logiką, które zostaje udostępnione użytkownikowi sieci wirtualnej. Na bazie dwóch przykładów
obiektowych schematów danych opisano proces generacji modelu integracyjnego wewnątrz prototypu ODRA-GRID oraz szczegółowo wytłumaczono kolejne kroki jego przetwarzania.
Chapter 6 Automatic Integration – Testing Results
Automatyczna integracja – wyniki testów
W rozdziale zaprezentowano rezultaty testów zaimplementowanego systemu ODRA-GRID oraz mechanizmu automatycznej integracji rozproszonych zasobów. Wyniki potwierdzają skuteczność opracowanej metodologii. Testy empirycznie potwierdzają poprawność zastosowanych rozwiązań, a całość stanowi dowód dla tez dysertacji postawionych w rozdziale 1-szym.
Chapter 7 Summary and Conclusions
Podsumowanie i wnioski
Zawarte zostały tutaj doświadczenia i wnioski zdobyte podczas opracowywania architektury systemu ODRA-GRID i testowania prototypu. Wymieniono opracowane rozwiązania i wyniki badań, które jednoznacznie potwierdzają słuszność tez pracy doktorskiej. Osobny podrozdział poświęcony jest dalszym pracom, które mogą zostać wykonane w celu rozwoju prototypu i rozszerzenia jego funkcjonalności.
Tekst pracy został rozszerzony o załącznik omawiający użycie prototypu osłony obiektowo-relacyjnej jako dodatkowego źródła danych integrującego i włączającego do gridu relacyjne zasoby danych. W skrócie zaprezentowano architekturę osłony oraz metodę jej integracji z gridem.
Chapter 1
Introduction
1.1 Motivation
One of the current trends in software engineering is to provide technologies and tools for the development of applications for data intensive processing in a distributed environment. The roots for this tendency lay in business requirements (such as globalization and wide use of the Internet) that applications must fulfil. Nowadays we can observe the grow of computer technologies connected to the development of distributed applications for processing a large amount of data. Such new solutions must conform to existent systems in such a manner, that both - old and new systems - have to cooperate with each other. In contemporary business models there is a lot of associated services, that could be important for global services. The problem concerns distribution of data on different locations and the forms of the distribution. Usually such data are heterogeneous, fragmented and redundant, so the information available to the users is less useful for the reuse in the global models. Therefore a mechanism for common integration of such data is required. There is also a need for a simple solution aiming at transparent integration of various data models into a common global data schema. In many such applications the users should be able to process data in both ways – reading and updating.
The thesis deals with the data grid architecture for a data-intensive grid solution which covers the technical aspects of distributed data processing in a virtual repository. Such a virtual repository provides functionalities and services that are common for distributed resources, including a trust infrastructure (security, privacy, licensing, payments, etc.), web services, distributed transactions, workflow management, etc. Consecutively the thesis presents a proposal how to compose an automatic process to
integrate various forms of distributed data into one common data schema through the use of a virtual repository software. As an effect this work presents a prototype and technology for transparent processing of different forms of data. The ideas of the thesis are partially developed and implemented as an additional functionality of the virtual repository in the prototype ODRA-GRID.
Currently there are few solutions which concern the problems of transporting and integrating distributed data. Most of them deal only with integration of distributed data in the aspect of integrating services and resources for relational database systems. Moreover, such solutions allow for processing data in one direction – from a resource to a client. Data modification at the global side is forbidden. The proposal presented in this dissertation uses an object-oriented data model of the ODRA database engine and a peer-to-peer technology as core mechanisms for services and data integration. The approach allows the users to process the data available at the global side using all options, including data modifications.
1.2 Introduction to the Grid Technology
The tem grid in the meaning of a specific computer processing architecture can be dated on the half of 1990s, but the first introduction of the notion can be observed on the beginning of 1990s. The meaning of the term grid concerned meta-computation focused on usage of several computers‘ in parallel performing computation of one task divided on many sub-tasks. A sub-task was running on a particular machine. In such case the term grid was simply related to the well-known electric power grid. In this paradigm a user relies on cooperation between machines giving him/her the power of the sum of computations [9].
The first and initial definition of a contemporary grid was formulated by Ian Foster and Carl Kesselman in the book ―The Grid: Blueprint for a New Computing Infrastructure‖ [9]. The definition concerns mainly a computational grid and tells that the computational grid consists of hardware and associated software which together permits on reliable, homogenous, universal and cheap access to huge computational powers.
Today the above definition is not complete in virtue of focusing only on computational grids. Present grids give the users additional functionalities like resource sharing or data processing. In [10] the grid definition has been refined and considers
other aspects of the term. The authors assume that a grid is an open computer system which gives the possibility to use potentially distributed resources having the following properties:
1. it uses a standard, open protocols and interfaces that are widely accessed,
2. it assumes that shared resources are located at different physical domains which are separated e.g. administratively, geographically, by organization, etc.
3. it significantly improves the quality of available services in comparison to the direct usage of particular resources.
Please notice that the term resource, is understood generally. It may concern a power computation, applications, a set of data or a specific device.
The authors of the definition stated that if any system conforms to the above definition it would be exactly a grid system.
Grid computing is a term that arose in last few years to describe a number of computer
architecture approaches based on simple but powerful principles.Since then, the term has been broadened to refer generally to the use of shared (commodity) computer components — processing and storage — in a distributed networked architecture. In essence, grid computing is an architectural alternative to monolithic, centralised computations and storage architectures.
Summing up, there are at least three common uses of the term grid in the present IT terminology:
Technical Computing Grids employ rack-mount computer systems in scale-out configurations to bring the aggregate processing power of many CPUs to bare on problems of interest;
(Enterprise) Utility Computing Grids provide an agile, on-demand model for application provisioning and migration based on sharing of common infrastructure resources implemented through commodity computer components (CPUs, networking, storage);
Data Grids provide for the distributed capture, management, and sharing of information (and sometimes instrumentation) — typically across multiple authority domains.
1.2.1 Technical Computing Grids
This architectural approach is common in High Performance Computing (HPC) applications [10]. They are employing large amounts of the computing power to solve computationally intense problems. Traditional HPC apps in the scientific computing space include:
Energy research and simulation, and high energy physics research;
Earth, ocean, and atmospheric sciences, global change, and weather prediction; Complex multi-physics simulations for aerospace design;
Seismic data analysis;
Large scale signal and image processing applications.
These cases represent the types of applications that are used to run on large supercomputers, such as those developed in the early 80s by Cray (The Supercomputer Company) and others. As commodity-based cluster computing (Beowulf clusters) emerged, many of these applications were re-hosed on large (1000+ nodes) compute clusters often called grids.
Additionally, many new applications have arisen, in part due to the increased availability of these new commodity supercomputers. These applications are increasingly at the core of business critical operations, and include:
Drug discovery (computational chemistry, genomics research); Circuit design simulations;
Automobile design simulations (aerodynamics and crash analysis); Risk analysis of financial portfolios;
Digital media applications (animation and rendering).
1.2.2 Utility Computing Grids
Utility computing is all about leveraging modular computing, network, and storage components to improve resource utilization, increase enterprise agility through rapid application provisioning (and re-provisioning), and simplifying IT operations. In short, it is creating a more nimble, more cost effective IT organization.
The terms grid, utility computing, and on-demand computing are often used almost interchangeably to describe a wide variety of approaches that are generally aimed at these objectives. These approaches are typically based on two key principles — the foundational pillars — of utility computing: consolidation and virtualization [11]. These two principles go hand-in-hand to support hosting of multiple applications either concurrently, or in a time-share model — on the same physical resources. E.g. server virtualization, as exemplified by VMWare, Xen, and Microsoft Virtual Server supports the hosting of multiple (virtual) servers on a single physical host. The physical resources of the host (CPU, memory, I/O, network connectivity) are shared amongst a number of virtual servers. Each looks like a standalone server (with its own IP address or addresses, its own network and security settings, and its own OS and applications) but shares the underlying physical resources. Server virtualization improves utilization by consolidating multiple applications onto a common physical hardware platform — eliminating costs of capital expenditures and operating expenses associated with deploying multiple physical servers. This approach is particularly effective in containing the ―server sprawl‖ that has occurred in many IT organizations where every application instance required its own server (and local storage) [10].
Similarly, network virtualization strategies (including virtual LANs) and storage virtualization strategies serve to allow applications shared use of those infrastructure resources, often employing Quality-Of-Service (QOS) provisions.
Storage virtualization strategies, in particular, aim at delivering logical storage containers (file systems and LUNs or volumes) that transcend the physical nature of storage systems – disks and controllers. Together with tiered storage strategies (using different classes of storage for different types of data) and transparent data migration, they deliver a storage as services model where those services are provided in the storage network and not by individual physical devices or servers. A realization of what many have been calling information lifecycle management (ILM).
1.2.3 Data Grids
The notion of a data grid may be the closest concept to the original grid concept developed by Foster et al. and presented in this thesis. It represents a physically distributed set of information resources (services) contributed by multiple authorities under a common set of protocols. In some ways, the World Wide Web represents the
first generation data grid, in which information is ―published‖ by individual sites, indexed by crawlers, and accessed via search engines and explicitly represented links.
The San Diego Supercomputer Centre‘s Storage Resource Broker is another data grid model that presents catalogued data ―collections‖ presentable to a community of interest [12]. SRB ―presents the user with a single file hierarchy for data distributed across multiple storage systems. It has features to support the management, collaboration, controlled sharing, publication, replication, transfer, and preservation of distributed data.‖
In many ways, these efforts represent first generation data grids – with static or quasi-static data ―published‖ into web-based documents or files, and consumed by browsers and file-savvy applications. The next generation of data – or information – grid is one based on Web 2.0 technologies that affords a richer model for compositing individual information sources into a rich set of distributed web services.
1.3 Research Problem Formulation and Proposed Solution
Contemporary, parallels and distributed processing of data coming from sources which are databases causes big problems. These problems are related to physical treatment of such data and the way how a user sees the data, which constitute – an access to the data and its processing methods. The main difficulty in the processing is, in that case, the form of data – the data has a structure which is usually complex. Therefore, it cannot be identified and treated as a simple string of bytes. Often, the data structure may be dependent on other structures, so in a distributed system such data management is very difficult and in some cases impossible.
In distributed systems an achievement of transparency is required, i.e. a user working on a data can process it regardless of whether the data is local (located on the user's local computer), or the data is retrieved from remote locations. In most cases such remote locations are heterogeneous systems. This issues an additional challenge for designers of such systems. The problem is an implementation of the mechanism for an integration of such distributed and heterogeneous resources.
Discussing distributed data processing we mean not only their reading but also primarily their non-limited updating. The ability of making free modifications on distributed data with assumption that a user who introduces a modification is not even
aware that he/she operates on remote data, is the most serious unresolved problem in other existing distributed systems.
With the above assumptions, a distributed database system must guarantee its operation continuity and easy access to data. In order to achieve this, it must be ensured already at the grid architecture level. Therefore, this requires the implementation of a very flexible and highly scalable architecture.
In this dissertation the problems presented above have been discussed, and also the solution in the context of data grid architecture proposed. A prototype implementation of the database grid has been determined as the thesis‘ goal. It is characterised by:
Transparency of available resources during their processing in grid; Automatic integration of resources joining grid;
Virtual network which connects the cooperating databases.
Author proposed the following tenets to achieve the goals of the dissertation:
Development of a model of integration of distributed heterogeneous data resources into one common object-oriented data schema. The integration process is based on three-layer structure of object-oriented updateable views, which in accordance with the relevant guidelines are able to realise the mapping between data contributing to the grid (the lowest level data schema, which is located on the user's machine plugging into the grid), and a global grid schema (the highest level) which is accessible to all users working in the grid. The three-layer structure of the views is a mutual compound of object-oriented updateable views in three levels, so that each view located on a higher level depends on a view located on a lower level. Dependence of layers is determined by the contract, which tells how the virtual objects of a higher layer (which are result of the view) depend on virtual objects coming from a lower layer. Definition of the views must be consistent with the guidelines for building of updateable views described in [6]. In the three-layer architecture a view located in the first layer must provide an object mapping and a contribution of the objects coming from the local database to a schema corresponding to the grid global schema. As a result, contribution virtual objects matching the pattern of contribution are created. The second layer makes an integration object mapping in accordance with an integration schema. Then a collection of available contribution virtual objects is mapped into a
collection of integration virtual objects. The third layer does the mapping of a collection of available integration virtual objects into virtual global objects available in the grid directly.
Creation of generic technique of the automatic data resources integration into a global schema of the grid. It is a mechanism consisting of a number of methods and rules used in algorithms that generate updatable view definitions. They are used inside all three layers of the developed model for data integration. The process of generation of the views definitions is executed on the basis of a specific description for the views.
Creation of management and maintenance techniques for automatic integration mechanism, so that continuously joining and disconnecting of various contributing resources can be possible within the grid. This mechanism uses similar principles and methods, to those of views‘ generation mechanism, while rebuilding of the views is being performed. The difference is that it operates at the level of the integration only (the second layer).
Building a virtual network architecture based on a peer-to-peer network, whose task is to create a communications environment for the grid. The virtual network will facilitate access to databases as nodes in the grid, and make the communication independ from the limitations of the TCP/IP layer. Additionally, it will move an ordinary standard networking to a higher level of abstraction, where typical networking tasks are limited to minimum.
Construction of the middleware for the database grid which will contain all the above solutions.
1.4 Theses and Objectives
The summarised theses are the following:
1. It is possible to develop the construction of an object-oriented database system in a distributed and parallel processing architecture, commonly known as a data-intensive grid. The architecture is supported by an object-oriented database management system and an updatable object views mechanism.
2. Such a database system should be equipped with a unique integration facilities based on updatable object views. This mechanism can perform a transformation of local resources into a common - global data schema.
3. For more efficient using a database grid and allowing for transparent data processing the peer-to-peer virtual transport platform has to be introduced.
The prototype solution accomplishing, verifying and proving the theses has been developed and implemented according to the modular reusable software development methodology, where subsequent components are developed independently and combined according to the predefined (primarily assumed) interfaces.
The dissertation is organised as follows. First, in Chapter 2, the state of the art in the field of distributed data processing from heterogeneous and autonomous resources is analysed. This presents issues which appeared during the maintenance of remote data in various approaches. Then, a set of solutions related to distributed data integration is investigated with their assumptions, strengths and weaknesses, considering their experiences in the field and the possibility of their adaptation to the designed model – this is the content of the Chapter 3. The author‘s interests together with opportunity of their implementations within the eGov-Bus virtual repository software [7] have allowed design of the general modular data grid architecture (subchapter 4.1 and 4.3) and the implementation of a grid middleware based on general integration assumptions (presented in Chapter 4). The architecture assures genericity and flexibility in terms of integration of distributed data, as well as distributed communication and data transfer. Till now, however, the architecture has not been subjected to any substantial verification and experiments. According to the above, the first working prototype has been implemented and experimentally tested (Chapter 5 and Chapter 6). The tests include the object-to-relational wrapping mechanism [13] presented in 0.
The next stage concerns an integration of heterogeneous data based on the virtual repository mechanism (subchapter 3.5) and the ODRA object-oriented database management system (subchapter 3.6), including an automated process of integration, which is described in details in subchapter 5.6. The goal was to establish and achieve a fully automated solution for transparent inclusion and detaching of distributed objects in a virtual repository. The approach aims at designing a platform where all clients and data providers are able to access multiple distributed resources without any complications concerning data maintenance and to build a global schema for the accessible data. The assumption is that it must be a complete mechanism performing transparent integration of remote objects built on top of a database, including fully
operational database engine and transport platform. To achieve all these features a creatiion a powerful middleware layer was needed.
Described shortly, the development process consists mainly of analysis of related solutions in terms of the thesis‘ objectives and integration requirements with the rest of a virtual repository. The resulting automatic integration process is completely transparent for end users and it enables complete distributed object data processing (including data reading, updating, creation and removing) in the grid architecture. The prototype solution accomplishing these goals and functionalities is implemented with JavaTM language. It is based on:
The Stack-Based Approach (SBA), providing SBQL (Stack-Based Query Language) being a query language with a complete computational power of regular programming languages and updateable object-oriented views;
The JXTA peer-to-peer framework for creation of centralised and decentralised P2P networks that in turn are used for creation of a grid virtual network and a data transport platform.
1.5 Thesis Outline
The thesis is subdivided into the following chapters: Chapter 1 Introduction
The chapter presents the motivation for the thesis‘ subject, the theses and the objectives, research problem with the description of solutions employed and an overview of grid technologies.
Chapter 2 State of the Art and Related Works
The state of the art and the related works aiming at processing of distributed data in various forms and services, are discussed here. The solutions having a fundamental influence on this dissertation are briefly described.
Chapter 3 Technological Base of the New Approach to Grid
The fundamentals of distributed data integration methods and challenges are given, basing on existing solutions. The chapter also focuses on the main types of integration of data in existing distributed database systems and issues appearing in these solutions.
Chapter 4 An Automatic Integration Methodology of Distributed Data, General Concept and Assumptions
The chapter presents the concept and the assumptions of the developed and implemented methodology for automatic integrating heterogeneous object-oriented database resources into a virtual repository. A general three-level approach to integration data in a virtual repository is presented there. Also the grid middleware concept is described.
Chapter 5 Prototype Implementation and Examples
The detailed description of the developed and implemented automatic integration mechanism is given. It contains an architecture of a grid. Additionally, assumptions concerning a virtual network platform are explained in detail. Prototype activities are depicted by demonstrative examples based on an introduced object-oriented schemata. The chapter contains a number of listings which show complex transformations of integration views and present how the view generation mechanism works.
Chapter 6 Automatic Integration – Testing Results
Assumptions and issues of the testing process, as well as the results of testing are here.
Chapter 7 Summary and Conclusions
The conclusions and future works that can be conducted for the further grid and automatic integration mechanism prototype development.
The thesis text is extended with one appendix describing an integration of relational data using object-to-relational wrapper mechanism within a grid.
Chapter 2
State of the Art and Related Works
2.1 Grid History
The grid history began in early networking era before the Internet. Following Foster and Kesselman [9], some sources give as a beginning date establishing the ARPANET network and a network itself. However, in accordance with the Ian Foster's second definition of the grid, it can be assumed that the first system close to this definition is DNS (Domain Name System). It was originally developed to support the growth of email communications on the ARPANET, and now supports the Internet on a global scale. The distributed DNS (at the beginning – since 1973 – the DNS was centralised and called Host Names On-line) has been established in March 1985. When the Foster's definition will be taken under consideration, DNS applied all three definition rules:
#1 it uses standard and open protocol described in RFC 1034 and RFC 1035; #2 it assumes that shared resources should be located at different physical
locations. By distribution the DNS service in 13 root name servers the system complies the rule;
#3 – improves quality and reliability of the whole service.
Of course DNS has been introduced a long time before the grid term, so it is not really taken as a grid system nowadays.
Evolution of distributed systems brings a division of the grid on generations. Below there are historical draft and characteristics of grid forerunners with dates.
Currently there are three generations of grid systems; 1st generation, 2nd generation – also known as future generation and 3rd generation called – the next generation. The 1st generation grid systems assumes that the main issue is the access of grid resources on demand. The main components are resource providers, resource
requestors and a resource information system. The way such a grid is working is to move code (and necessary files) from a resource to a resource. This results in two major problems: security (how the resource owner can trust the incoming code) and file staging (how to move large files or databases efficiently). In the 2nd generation (future) grid systems the main issue is the access of grid services on demand, but the main components are service providers, service requestors and a service information system. In such a grid, a code is not moving. Instead that, requests (in form of data) move to service providers who process requests basing on their own or purchased (therefore trusted) code. In this way the biggest problem of the 1st generation grid, i.e., the security problem is solved (similar to the current web technology approach). The problem of file staging is not completely solved but in many cases the service and the necessary database is located on the same site of the grid and therefore the move of large files is less stressing. Large scientific databases should be created as distributed, multi-site grid services [14].
The first project which reached huge success in the area of 1st generation of grids is called Condor [15]. It was created in 1988. Condor is a queuing system, but in contrast to such standard solutions permits, and even puts the emphasis on cooperation of systems with administrative autonomy. Condor allows to install its processes not only on large cluster systems or dedicated computational servers, but also on ordinary workstations. As an early grid system it has unique feature of advanced sharing capabilities of specific machine resources (for example, only when it is not used locally). This feature enables interoperability of multiple machines similar to a grid. Condor does not fulfil the rule #1 of the second Foster's grid definition.
Another distributed system which gives to the users the possibility of sharing computers‘ computational power is PVM (Parallel Virtual Machine). This is a software package first published in 1990 that permits a heterogeneous collection of Unix and/or Windows computers hooked together by a network to be used as a single large parallel computer. Thus large computational problems can be solved more cost effectively by using the aggregate power and memory of many computers. PVMs are accessible through source code which can be compiled on many OSs and hardware platforms. PVM enables users to exploit their existing computer hardware to solve much larger problems at minimal additional cost. Hundreds of sites around the world are using PVM
to solve important scientific, industrial, and medical problems in addition to PVM's use as an educational tool to teach parallel programming [16].
Processing distributed objects among parallel resources has been determined in the CORBA architecture by OMG (Object Management Group) [17] in 1991 as CORBA v1.1 for first time. CORBA is a shortcut of Common Object Request Broker Architecture, and soon after it appeared, it became a standard architecture for distributed object systems. CORBA belongs to the 2nd generation of grid systems. It has been realised in many systems, the most popular are ORBit [18] and omniORB [19] (free), Visibroker [20] and Orbix [21] (commercial). More details about this approach are presented in subchapter 2.2.2.
Another noteworthy project of the 1st generation grids which was a wide range successful experiment and which was really built is I-WAY (Information Wide Area Year). In 1995 this project connected together about 20 most important computational centres in USA. It created the computational power which has been used for virtual reality simulations. I-WAY has shown some technical problems which are still actual in contemporary grid systems, like security of resources and computational data, allocation and discovering of available data resources.
The first distributed system, which gave rise to the concept of a data grid that is involved in sharing data within the meaning of the definition #2, was the Storage Resource Broker also known as SRB (version 1.0 was founded in 1997). This system enables the global and uniform view on data (generally, files) distributed on a number of physical locations. The most common usage of SRB is a distributed logical file system (a synergy of database system concepts and file systems concepts) that provides a powerful solution to manage multi-organisational file system namespaces. SRB presents to the user a single file hierarchy for data distributed across multiple storage systems. It has features to support the management, collaboration, controlled sharing, publication, replication, transfer, and preservation of distributed data. In fact it is a middleware in the sense that it is built on top of other major software packages (file systems, archives, real-time data sources, relational database management systems, etc). Its architecture assures well data processing without problems of security and discovery of resources. SRB introduced into grids the concept of metadata (i.e. data that describe the auxiliary data), often appearing in today's grid solutions [12].
Since when German scientist founded in 1997 the project entitled UNICORE which means UNiform Interface to COmputing REsources (described in [22]), grids evolved from simple-managed applications to multi-layers‘ architectures. UNICORE is a project which is still alive. Currently the 6th version of this system is implemented and shared to the public. At the beginning the project was oriented on supporting users who needed distributed resources and computational power to perform wide scale calculations. The essentials assumptions of UNICORE are:
Easy-to-use interface for tasks preparation, sending them to servers which are able to process them and subsequent monitoring;
The security system allows for uniform identification and authorization of users regardless the specificity of remote server systems;
The format for the notation of complex tasks consists of an operation executions sequences. It is assumed that these operations can be performed on different servers and simultaneously can depend on each other;
It separates the user from system‘s complexities – it has direct influence on a task which should be accessible on different system installations and allows to run for different system environments;
It supports dependent applications which could only process locally available files. UNICORE assumes lack of the support for interactive applications as well as streaming, etc.
At present UNICORE (in version 6) is designed according to the Service Oriented Architecture (SOA will be described in details in the next subchapter). In UNICORE, SOA is represented by UNICORE Atomic Services mechanism which implements indivisible and simple tasks without their support by workflows. This means that this mechanism gives the possibility to use different programming languages for defining and processing such tasks. Summarising, such an approach is oriented on processing computational or streaming tasks, where data processing (like structured data coming from databases) is not supported. However, all that makes UNICORE a flexible system, which constitutes in the grids‘ history a important part - as a system which is still alive and still evolves according to current trends.
Globus Toolkit 2 has been created in 1997 basing on the previous grid project called Globus Project. The formal founder of GT2 – currently GT4 is Globus Alliance
team. GT2 since its beginning turned out to be the standard for grid computations. It is related to OGSA – Open Grid Services Architecture (see next subchapter for details). GT2 implements protocols and APIs related to distributed calculations by providing services such as: authentication, search of resources (in a limited form), access to remote resources, data transfer, running, a control and queuing for processes, portability. After few years it became clear that this system actually has established a new quality of distributed calculations and sharing of resources between calculation centres. But it was also criticised for difficulty of its installation and management and complexity in usage [23]. For this reason, the system evaluated with many improvements into version 4 which is based on Web Services. Globus Toolkit architecture consists of three types of network nodes:
Computational – which execute processes controlled and queued by broker nodes;
Storage – which store a data needed for calculations or results;
Brokers – which are responsible for coordinating and queuing tasks, data transfers.
A very important feature which has been introduced in GT2 is a distributed transparency of data calculations – contemporary this feature is required in current grid solutions. In GT2 a user describes a performed task using a dedicated language JDL (Job Definition Language), and then sends it to a broker node, which is responsible for finding out the best place in a grid for its processing – regarding of the power possibilities of computational nodes as well as costs of data transfer needed for the calculations. In case when the task fails for some reason, the broker tries to start it again. To improve the system operational, a data can be replicated in multiple storage nodes, and the state of such replica is coordinated by a dedicated processes. The result of the calculation and their status can be tracked using the Logging Service. Moreover, a copying of the data between the nodes is done by using a special protocol called GridFTP. The protocol in a transparent way for users is able to divide processed data on smaller parts and reuse existing replicas for decreasing time and power consuming. Mentioned improvement determines how GT2 achieved distributed transparency of data calculations – they are following:
Auto-notifications for the users about a calculation status and its result;
Ensuring a homogeneous platform for access to a data and a transport layer for its transferring.
Some implementations unfortunately decrease transparency aspects related to data itself and its access:
Data shared by the consortium under a single agreement are shown for all participants in the group – this can expose others on a loss of information; Explicit location of data on the network is essential in reaching access to them
(the name of the node must be explicitly presented), which in a complex systems is inconvenient and not necessary;
Data stored at different locations but belonging to the same single user are not integrated automatically, thus the user must operate separately on each location. For above reasons, the system is difficult to use for users not knowing low level commands. Globus Toolkit still evolves and currently is available version 4 of the platform.
At the grid history reflection there has to be mentioned a commercial grid solution which was the first one supporting databases in grid architecture. It introduced, together with data integration aspects, certain facilities associated with a fragmentation of data and the processing of queries on the distributed resources. This refers to Oracle 10g system. The Oracle 10g database management system (so called Grid) is one of the forerunners in the field of distributed databases and distributed query processing systems in business. Its manufacturer states that this is the world's first system that is able to disperse a request processing, to realise a transparent integration of distributed resources and running in distributed manner database applications [24]. Experience shows that this system has no big differences from its previous version, the rhetoric of a grid is a little over-used.
Oracle 10g offers the possibility to create and manage a distributed database system, but in reality it can only be a homogenous database system containing just Oracle databases. Such a system is able to process heterogeneous data coming from different companies but in a very limited scope. This not an open and generic approach to distributed data processing. In general, the architecture of Oracle is a typical
client-server architecture. However in the grid version, a node – the entity performing a query and having its own data repositories – simultaneously can be a client and a server.
Servers in Oracle 10g can be connected by database links, which allow to perform in a transparent manner queries on other servers. This means that a query is performed on other distributed servers, not only the one into which the client is connected to. Each database has a unique name that can be used in queries. The assumption is that the user can not refer to any other database, but only to that which is administratively connected by the link with the current which the client uses. Links cover physical connection among database servers, thus users don't need to be aware that a query is processed in a distributed system. However, the capacities of database links are limited due to their flat architecture and the need to preserve the clarity of calls within a single node. Summarising, Oracle 10g provides data location transparency (but in a limited form), the transparency of a distributed query processing and distributed transactions.
In Oracle 10g the management of a distributed system is based on the following assumptions:
Each server is managed independently by its administrator, and each has their autonomous repository;
Each server maintains its own independent level of security and authentication for users. Remote user to get an access to the data must pass authentication. Authentication can be based on passwords stored locally, remotely or attached to the link;
Remote users linked to a local database are treated like local users and the system does not distinguish them. All their activities are verified by the local system;
Users can invoke procedures stored on remote servers;
Distributed query has a global coordinator, which analyzes the system distribution and divides base queries into (optimizable) sub-queries, which are processed on individual servers.
Oracle 10g is equipped with a mechanism for distributed, remote and local transactions. They allow to control operations performed on multiple servers, on a remote server and on a local system, respectively. In case of distribution, the system
provides that operations coming from a given transaction will be entirely approved or rejected on all servers. This solution is implemented by means of the two-phase commit (2PC) protocol.
2.2 Contemporary Grid Related Technologies
Grid systems have to solve a number of specific problems to be in accordance with the grid definition. The basic key is a communication between grid resources and their users. The term resource has a wide meaning in relation to distributed systems. It is understood as computing nodes (clusters, supercomputers, servers) and data storage devices. Please note that the resource may not necessarily determine the hardware and physical resource. Very often a resource is meant as well as application which is responsible for controlling and sharing a specific computer hardware. It is well known as a service (in meaning of resource). A good example is a service which provides an access to a data from database or just application which gives access to a file system. A grid communication has to provide a mechanism to take full advantage of resources power. This is really difficult task in term of the diversity of resources. Even a limiting the resources to computational ones still leaves the problem of heterogeneity of processor architectures, operating systems and hardware capabilities (e.g. available memory). The typical approach to solving this problem is a creation of an intermediate layer, i.e. so called middleware. In most cases it implements communication protocols and is responsible for interactions between these protocols and higher layers performing grid tasks. This is very flexible and generic approach which gives opportunities to solve other challenges facing grid systems. The grid building layer-model is presented in Fig. 1. It also shows a higher layers based on the middleware layer, i.e. users' applications, service applications and their users. The middleware as a management carrier is very generic in comparison to upper layers which present a specific software reflecting user needs – users are usually interested in specific applications which provide only a particular grid's functionalities, but shared in a user-friendly manner.
To share heterogeneous data in a grid environment between different clients‘ platforms (in meaning of their different local environments like OSs or type of using data) a middleware should be used there. Currently three approaches how to prepare such middleware are known.
The first concerns an implementation of different middleware versions, where each is created for a specific client platform. This is a straightforward approach. It requires a tremendous amount of work and is the most cost-consuming. Thus, unfortunately it is not so popular.
Fig. 1 Widely accepted model of grid building and evolution.
The second one is a using a portable-like programming language to implement a middleware. In such a case, the middleware can be used on any environment which supports the language used to create the middleware. Currently Java is the most popular as a platform-independent programming language. Java applications are typically compiled to bytecode that can run on any Java Virtual Machine (JVM) regardless of computer architecture.
The third approach let to use virtualization on a grid‘s destination platform. This means that other environments (actually OSs) can be run within existing OS physically installed on a hardware. This approach also requires a middleware which in most cases is produced according to the second approach, but access between OS environments is provided over network interfaces. Currently this is the most popular solution, not perfect, because a such implementation of middleware can be sometimes not efficient – OS which is running virtually in other physical OS, may be limited by physical OS and finally its performance will be decreased – whole processing will work slower. But current and future hardware trends is going to solve the problem.
Another key issue raised in grid systems is to provide widely acceptable safety. The issue has to be considered under the requirement to ensure the autonomy of particular systems, which consists of the grid system. The basic problems are:
Privacy of the data (including transmission over network); Reliability and integrity of obtaining the unchanged data; Mutual identification of cooperating partners;
Validity of sent messages (i.e. validation of sender);
Delegation of authorities (e.g. a transfer of a part of the users rights to an agent and to let him manage of any task in the name of the user);
Providing detailed authentication for resources access (users grouping, providing access roles, checking and management of authentication).
Notice that actually all of the above problems have existing, stable and widely accepted solutions. Privacy of transmission on the Internet as well as the integrity of transmitted data are ensured by SSL and TLS protocols, but in conjunction with digital signature and public key infrastructure (PKI), can be added to the next listed above problems. There are also many of authentication systems, starting with the simplest, built-in mechanisms of operating systems (like access to files and user groups, access control lists ACL-s). In grid systems, these solutions are widely used, however some issues remain. All of above safety sub-solutions work well together with homogeneous security policies and uniform mechanisms used by the communicating participants and trust between service providers. In the case of grid systems, the problem is much complicated; at least if the two nodes in a grid accept different PKI authentication centres and a user using them needs to keep several different certificates, such situation will be enough to invoke a significant difficulties for a creation of a mutual and encrypted communication.
The basis of security architecture for grid systems is the concept of virtual
organizations (VO) [10].
VO specifies the virtual group of users, distributed in terms of a work under common goals or characteristics. At the moment this concept is strongly expanded in order of increasing its dynamic creation and increase their functionality which reflects