• No results found

Using Databases to Manage State Information for. Globally Distributed Data

N/A
N/A
Protected

Academic year: 2021

Share "Using Databases to Manage State Information for. Globally Distributed Data"

Copied!
32
0
0

Loading.... (view fulltext now)

Full text

(1)

Using Databases to Manage

Using Databases to Manage

State Information for

State Information for

Globally Distributed Data

Globally Distributed Data

Storage Resource Broker

Storage Resource Broker

Reagan W. Moore

Reagan W. Moore

San Diego Supercomputer Center

San Diego Supercomputer Center

[email protected]

[email protected]

http://www.

(2)

Abstract

Abstract

The management of globally distributed data is

simplified through the use of data grids which enable data sharing environments. Data grids provide:

• Interoperability mechanisms needed to interact with legacy storage systems and legacy applications

• Logical name spaces needed to identify files, resources, and users.

• Consistent management of state information about each file within the distributed environment.

• Access controls, descriptive metadata, and administration metadata.

These capabilities enable data virtualization, the

ability to manage data independently of the chosen storage repositories.

• Examples of management of globally distributed data include data grid federation, distributed digital libraries, and distributed persistent archives.

(3)

Research Questions

Research Questions

How do we build global data

management systems that rely on

database technology to support state

information?

Is the current state of the art sufficient,

or do we need extensions to current

database technology?

(4)

Global Data Management Precursor

Global Data Management Precursor

1994 - Alternative Architecture Study for

NASA Earth Observing Satellite EOSDIS

Mike Stonebraker, Jim Grey, Jeff Dozier,

William Farrell, Reagan Moore

Proposed aggressive use of database

technology to manage

• 3 TBs data ingestion per day from multiple sites

• Replication of data

• 15 PB archive

• Discovery and manipulation of the collection

(5)

Data Grid Evolution

Data Grid Evolution

1995 - DARPA Massive Data Analysis Systems

1997 - DARPA/USPTO Distributed Object Computation Testbed

1998 - NSF National Partnership for Advanced Computational Infrastructure

1998 - DOE Accelerated Strategic Computing Initiative data grid1999 - NARA persistent archive

2000 - NASA Information Power Grid2001 - NLM Digital Embryo digital library2001 - DOE Particle Physics data grid

2001 - NSF Grid Physics Network data grid

2001 - NSF National Virtual Observatory data grid

2002 - NSF National Science Digital Library persistent archive2003 - NSF Southern California Earthquake Center digital library2003 - NIH Biomedical Informatics Research Network data grid

2003 - NSF Real-time Observatories, Applications, and Data management Network

2004 - NSF ITR, Constraint based data systems

2005 - LC Digital Preservation Lifecycle Management

(6)

Terminology

Terminology

1998

Data Grid, a data management system that

organizes distributed data into collections

Data Grid - data virtualization

2000

Persistent Archive, a data management

system that handles technology evolution

Persistent Archive - infrastructure

independence

(7)

Data Grids: First Viewpoint

Data Grids: First Viewpoint

Create a

shared collection

which

manages state information independently

from the storage systems

Build a metadata catalog to store state

information

Separate data access mechanisms from

storage access mechanisms

(8)

Trust Virtualization

Trust Virtualization

Shared collection owns the data

At each remote storage system, an account ID

is created under which the data grid stores

files

User authenticates to the data grid

Data grid checks access controls

Data grid server authenticates to a

remote data grid server

Remote data grid server authenticates

(9)

Data Grids

Data Grids

Largest single data grids

• ROADnet real-time sensor network, links 90 object ring buffers supporting 24,000 sensors

• BaBar high energy physics, distributes SLAC collections to Lyon,France and Rome, Italy

• Biomedical Informatics Research Network, links data resources across 25 institutions within the US

Largest data grid federations

• KEK high-energy physics federation of 7 data grids between Japan, South Korea, China, Taiwan,

Australia, Poland, US

• WUN federation of 5 academic institutions (SDSC, NCSA, U Bergen, U Southampton, U Manchester)

• NARA Research Prototype Persistent Archive (SDSC, U Maryland, NARA, GA Tech)

(10)

Data Grids: Second Viewpoint

Data Grids: Second Viewpoint

Support data management applications

Automate all aspects of data discovery,

access, management, analysis, preservation

Security paramount

Distributed data

Provide distributed data support for

Data sharing - data grids

Data publication - digital libraries

Data preservation - persistent archives

(11)

Generic Data Management

Generic Data Management

Data grids provide capabilities needed by

digital libraries and persistent archives

Infrastructure independence to manage a collection

distributed across multiple storage systems

Descriptive metadata to describe authenticity

context of each file

Administrative metadata to maintain integrity

• Location, replicas, checksums, audit trails, ownership, access controls, versions, locks, pinning, aggregation

Data grids are implemented as middleware

(12)

SRB server SRB agent SRB server

Federated Server Architecture

Federated Server Architecture

MCAT Read Application SRB agent 1 2 3 4 6 5 Logical Name Or Attribute Condition 1.Logical-to-Physical mapping 2.Identification of Replicas 3.Access & Audit Control

Peer-to-peer Brokering Server(s) Spawning Data Access Parallel Data Access R1 R2 5/6

(13)

Unix Shell NT Browser, Kepler Actors OAI, WSDL, (WSRF), GridFTP HTTP, DSpace, Fedora, OpenDAP Archives - Tape, Sam-QFS, DMF, HPSS, ADSM, UniTree, ADS Databases -DB2, Oracle, Sybase, Postgres, mySQL, Informix File Systems Unix, NT, Mac OSX Application ORB

Storage Repository Abstraction Database Abstraction Databases -DB2, Oracle, Sybase, Postgres, mySQL, Informix C Library, Java Logical Name Space Latency Management Data Transport Metadata Transport Consistency & Metadata Management / Authorization, Authentication, Audit

Linux I/O C++ DLL / Python, Perl, Windows Federation Management

Storage Resource Broker 3.3.1

(14)

Storage Resource Broker Collections at SDSC (8/2/2005) GBs of data stored Number of files Users with ACLs Data Grid Ê Ê Ê

NSF/ITR - National Virtual Observatory 53,862 9,536,751 100

NSF - National Partnership for Advanced Computational Infrastructure 36,149 7,539,180 380

Static collections Ğ Hayden planetarium 8,013 161,352 227

Pzone Ğ public collections 12,998 6,707,952 68

NSF/NPACI - Biology and Environmental collections 40,155 76,083 67 NSF/NPACI Ğ Joint Center for Structural Genomics 15,731 1,577,260 55 NSF - TeraGrid, ENZO Cosmology simulations 176,730 2,125,945 3,267 NIH - Biomedical Informatics Research Network 10,561 7,596,888 303

Digital Library Ê Ê Ê

NSF/NPACI - Long Term Ecological Reserve 256 9,033 36

NSF/NPACI - Grid Portal 2,620 53,048 460

NIH - Alliance for Cell Signaling microarray data 741 84,594 21

NSF - National Science Digital Library SIO Explorer collection 2,733 1,083,998 27 NSF/ITR - Southern California Earthquake Center 131,010 2,702,421 73

Persistent Archive Ê Ê Ê

NHPRC Persistent Archive Testbed (Kentucky, Ohio, Michigan, Minnesota) 100 382,186 28

UCSD Libraries archive 4,147 408,050 29

NARA- Research Prototype Persistent Archive 1,478 893,434 58

NSF - National Science Digital Library persistent archive 3,600 27,034,150 136

(15)

Infrastructure Independence

Infrastructure Independence

Data virtualization

Management of name spaces independently

of the storage repositories

• Global name spaces

• Persistent identifiers

• Collection-based ownership of files

Support for access operations independently

of the storage repositories

• Separation of access methods from storage protocols

(16)

Separation of Access Method

Separation of Access Method

from Storage Protocols

from Storage Protocols

Storage System

Storage System

Storage Protocol

Storage Protocol

Access Method

Access Method

Access Operations

Access Operations

Data Grid

Data Grid

Map from the

Map from the

operations used by

operations used by

the access method

the access method

to a standard set of

to a standard set of

operations used to

operations used to

interact with the

interact with the

storage system

storage system

Storage Operations

Storage Operations

(17)

Data Grid Operations

Data Grid Operations

File access

• Open, close, read, write, seek, stat, synch, …

• Audit, versions, pinning, checksums, synchronize, …

• Parallel I/O and firewall interactions

• Versions, backups, replicas

Latency management • Bulk operations

• Register, load, unload, delete, …

• Remote procedures

• HDFv5, data filtering, file parsing, replicate, aggregate

Metadata management

• SQL generation, schema extension, XML import and export, browsing, queries,

GGF, “Operations for Access, Management, and Transport at Remote Sites”

(18)

Latency Management

Latency Management

-

-Bulk Operations

Bulk Operations

Bulk register

• Create a logical name for a file

• Load context (metadata)

Bulk load

• Create a copy of the file on a data grid storage repository

Bulk unload

• Provide containers to hold small files and pointers to each file location

Bulk delete

• Trash can
(19)

Examples of Extensibility

Examples of Extensibility

The 3 fundamental APIs are C library, shell commands, Java

• Other access mechanisms are ported on top of these interfaces

API evolution

• Initial access through C library, Unix shell command

• Added iNQ Windows browser (C++ library)

• Added mySRB Web browser (C library and shell commands)

• Added Java (Jargon)

• Added Perl/Python load libraries (shell command)

• Added WSDL (Java)

• Added OAI-PMH, OpenDAP, DSpace digital library (Java)

• Added Kepler actors for dataflow access (Java)

(20)

Examples of Extensibility

Examples of Extensibility

Storage Repository Driver evolution • Initially supported Unix file system

• Added archival access - UniTree, HPSS

• Added FTP/HTTP

• Added database blob access

• Added database table interface

• Added Windows file system

• Added project archives - Dcache, Castor, ADS

• Added Object Ring Buffer, Datascope

• Added GridFTP version 3.3

Database management evolution • Postgres

• DB2

• Oracle

• Informix

• Sybase

(21)

Logical Name Spaces

Logical Name Spaces

Storage Repository • Storage location

• User name • File name

• File context (creation date,…) • Access constraints

Data Access Methods (C library, Unix, Web Browser)

Data access directly between

application and storage

repository using names

required by the local

repository

(22)

Logical Name Spaces

Logical Name Spaces

Storage Repository • Storage location

• User name • File name

• File context (creation date,…) • Access constraints

Data Grid

• Logical resource name space • Logical user name space

• Logical file name space • Logical context (metadata)

• Control/consistency constraints Data Collection

Data Access Methods (C library, Unix, Web Browser)

(23)

Federation Between Data Grids

Federation Between Data Grids

Data Grid

• Logical resource name space • Logical user name space

• Logical file name space • Logical context (metadata)

• Control/consistency constraints Data Collection B

Data Access Methods (Web Browser, DSpace, OAI-PMH)

Data Grid

• Logical resource name space • Logical user name space

• Logical file name space • Logical context (metadata)

• Control/consistency constraints Data Collection A

Access controls and consistency constraints on cross registration of digital entities

(24)

Types of Federation

Types of Federation

Peer-to-peer grids

• Data grids forward requests for access to public data

Hierarchical grids

• Master - slave

• All files in slave data grid are replicated from the master data grid

• Central archive

• Multiple independent data grids deposit replicas into a central archive

Replication grids

• Two independent data grids serve as back-up sites for each other

(25)

Data Management Systems

Data Management Systems

Digital Libraries

• DSpace services for ingestion and description of files -ported on top of SRB data grid

• Fedora relationship management services, port on top of SRB data grid in test

• Cheshire integration on top of SRB data grid

• OAI-PMH interface to the SRB data grid

Persistent archives

• Manage authenticity, integrity, and infrastructure independence

• Integrate preservation processes on top of SRB data grid

(26)

Chronopolis

Chronopolis

Preservation Facility

Preservation Facility

NCAR U Md SDSC

MCAT MCAT MCAT

Deep Archive at NARA, no user access but complete copy

Replicated copy

at U Md for improved access, load balancing and disaster recovery

Active archive at SDSC, user access Demonstrate preservation environment • Authenticity • Integrity • Management of technology evolution

• Mitigation of risk of data loss • Replication of data • Federation of catalogs • Management of preservation metadata • Scalability • 3 collections / year

• Support 100 TBs per site

Federation of Three

Independent Data Grids

(27)

Distributed Metadata Management

Distributed Metadata Management

Database specific replication

Oracle

Master-slave catalogs across vendors

Synchronize an independent metadata catalog

with state from primary catalog

SRB version 3.4

Master-slave catalog federation

between data grids

How to enforce data grid administration

(28)

Data Grids: Third Viewpoint

Data Grids: Third Viewpoint

Require ability to apply dynamic

consistency constraints to state

information

When federating data grids

When modifying views of collections

When managing data placement

When asserting global properties (creating

consistent state across the collection)

• Synchronization of replicas
(29)

State Information Context

State Information Context

Need to be able to characterize the

consistency constraints that are evaluated

when updating state information

Example - copying data

• Replica (intent that copy be synchronized)

• Version (intent that copy be labeled)

• Backup (intent that copy represent time snapshot)

• New file (intent that copy be independent of original)

• Could change intent of the copy, changing the required state information

(30)

Dynamic Constraints

Dynamic Constraints

Need technology that manages

Reification of consistency rules into metadata

State information about the reification

• Version of consistency rules that were evaluated

• Time stamp for when the evaluation was done

• Granularity within the collection for which the reification is valid

Require two versions

Management of procedural execution of rules

(31)

Projects

Projects

Monash University / NSF National

Science Digital Library / NARA

Persistent Archive

Integration of Fedora and SRB data grid

SDSC - NSF ITR on Constraint-based

Knowledge Systems for Grids, Digital

Libraries, and Persistent Archives

Embedding of dynamic constraint

(32)

For More Information

For More Information

Reagan W. Moore

San Diego Supercomputer Center

[email protected]

References

Related documents