Using Databases to Manage
Using Databases to Manage
State Information for
State Information for
Globally Distributed Data
Globally Distributed Data
Storage Resource Broker
Storage Resource Broker
Reagan W. Moore
Reagan W. Moore
San Diego Supercomputer Center
San Diego Supercomputer Center
[email protected]
http://www.
Abstract
Abstract
• The management of globally distributed data is
simplified through the use of data grids which enable data sharing environments. Data grids provide:
• Interoperability mechanisms needed to interact with legacy storage systems and legacy applications
• Logical name spaces needed to identify files, resources, and users.
• Consistent management of state information about each file within the distributed environment.
• Access controls, descriptive metadata, and administration metadata.
• These capabilities enable data virtualization, the
ability to manage data independently of the chosen storage repositories.
• Examples of management of globally distributed data include data grid federation, distributed digital libraries, and distributed persistent archives.
Research Questions
Research Questions
•
How do we build global data
management systems that rely on
database technology to support state
information?
•
Is the current state of the art sufficient,
or do we need extensions to current
database technology?
Global Data Management Precursor
Global Data Management Precursor
•
1994 - Alternative Architecture Study for
NASA Earth Observing Satellite EOSDIS
•
Mike Stonebraker, Jim Grey, Jeff Dozier,
William Farrell, Reagan Moore
•
Proposed aggressive use of database
technology to manage
• 3 TBs data ingestion per day from multiple sites
• Replication of data
• 15 PB archive
• Discovery and manipulation of the collection
Data Grid Evolution
Data Grid Evolution
• 1995 - DARPA Massive Data Analysis Systems
• 1997 - DARPA/USPTO Distributed Object Computation Testbed
• 1998 - NSF National Partnership for Advanced Computational Infrastructure
• 1998 - DOE Accelerated Strategic Computing Initiative data grid • 1999 - NARA persistent archive
• 2000 - NASA Information Power Grid • 2001 - NLM Digital Embryo digital library • 2001 - DOE Particle Physics data grid
• 2001 - NSF Grid Physics Network data grid
• 2001 - NSF National Virtual Observatory data grid
• 2002 - NSF National Science Digital Library persistent archive • 2003 - NSF Southern California Earthquake Center digital library • 2003 - NIH Biomedical Informatics Research Network data grid
• 2003 - NSF Real-time Observatories, Applications, and Data management Network
• 2004 - NSF ITR, Constraint based data systems
• 2005 - LC Digital Preservation Lifecycle Management
Terminology
Terminology
•
1998
•
Data Grid, a data management system that
organizes distributed data into collections
•
Data Grid - data virtualization
•
2000
•
Persistent Archive, a data management
system that handles technology evolution
•
Persistent Archive - infrastructure
independence
Data Grids: First Viewpoint
Data Grids: First Viewpoint
•
Create a
shared collection
which
manages state information independently
from the storage systems
•
Build a metadata catalog to store state
information
•
Separate data access mechanisms from
storage access mechanisms
Trust Virtualization
Trust Virtualization
•
Shared collection owns the data
•
At each remote storage system, an account ID
is created under which the data grid stores
files
•
User authenticates to the data grid
•
Data grid checks access controls
•
Data grid server authenticates to a
remote data grid server
•
Remote data grid server authenticates
Data Grids
Data Grids
•
Largest single data grids
• ROADnet real-time sensor network, links 90 object ring buffers supporting 24,000 sensors
• BaBar high energy physics, distributes SLAC collections to Lyon,France and Rome, Italy
• Biomedical Informatics Research Network, links data resources across 25 institutions within the US
•
Largest data grid federations
• KEK high-energy physics federation of 7 data grids between Japan, South Korea, China, Taiwan,
Australia, Poland, US
• WUN federation of 5 academic institutions (SDSC, NCSA, U Bergen, U Southampton, U Manchester)
• NARA Research Prototype Persistent Archive (SDSC, U Maryland, NARA, GA Tech)
Data Grids: Second Viewpoint
Data Grids: Second Viewpoint
•
Support data management applications
•
Automate all aspects of data discovery,
access, management, analysis, preservation
•
Security paramount
•
Distributed data
•
Provide distributed data support for
•
Data sharing - data grids
•
Data publication - digital libraries
•
Data preservation - persistent archives
Generic Data Management
Generic Data Management
•
Data grids provide capabilities needed by
digital libraries and persistent archives
•
Infrastructure independence to manage a collection
distributed across multiple storage systems
•
Descriptive metadata to describe authenticity
context of each file
•
Administrative metadata to maintain integrity
• Location, replicas, checksums, audit trails, ownership, access controls, versions, locks, pinning, aggregation
•
Data grids are implemented as middleware
SRB server SRB agent SRB server
Federated Server Architecture
Federated Server Architecture
MCAT Read Application SRB agent 1 2 3 4 6 5 Logical Name Or Attribute Condition 1.Logical-to-Physical mapping 2.Identification of Replicas 3.Access & Audit Control
Peer-to-peer Brokering Server(s) Spawning Data Access Parallel Data Access R1 R2 5/6
Unix Shell NT Browser, Kepler Actors OAI, WSDL, (WSRF), GridFTP HTTP, DSpace, Fedora, OpenDAP Archives - Tape, Sam-QFS, DMF, HPSS, ADSM, UniTree, ADS Databases -DB2, Oracle, Sybase, Postgres, mySQL, Informix File Systems Unix, NT, Mac OSX Application ORB
Storage Repository Abstraction Database Abstraction Databases -DB2, Oracle, Sybase, Postgres, mySQL, Informix C Library, Java Logical Name Space Latency Management Data Transport Metadata Transport Consistency & Metadata Management / Authorization, Authentication, Audit
Linux I/O C++ DLL / Python, Perl, Windows Federation Management
Storage Resource Broker 3.3.1
Storage Resource Broker Collections at SDSC (8/2/2005) GBs of data stored Number of files Users with ACLs Data Grid Ê Ê Ê
NSF/ITR - National Virtual Observatory 53,862 9,536,751 100
NSF - National Partnership for Advanced Computational Infrastructure 36,149 7,539,180 380
Static collections Ğ Hayden planetarium 8,013 161,352 227
Pzone Ğ public collections 12,998 6,707,952 68
NSF/NPACI - Biology and Environmental collections 40,155 76,083 67 NSF/NPACI Ğ Joint Center for Structural Genomics 15,731 1,577,260 55 NSF - TeraGrid, ENZO Cosmology simulations 176,730 2,125,945 3,267 NIH - Biomedical Informatics Research Network 10,561 7,596,888 303
Digital Library Ê Ê Ê
NSF/NPACI - Long Term Ecological Reserve 256 9,033 36
NSF/NPACI - Grid Portal 2,620 53,048 460
NIH - Alliance for Cell Signaling microarray data 741 84,594 21
NSF - National Science Digital Library SIO Explorer collection 2,733 1,083,998 27 NSF/ITR - Southern California Earthquake Center 131,010 2,702,421 73
Persistent Archive Ê Ê Ê
NHPRC Persistent Archive Testbed (Kentucky, Ohio, Michigan, Minnesota) 100 382,186 28
UCSD Libraries archive 4,147 408,050 29
NARA- Research Prototype Persistent Archive 1,478 893,434 58
NSF - National Science Digital Library persistent archive 3,600 27,034,150 136
Infrastructure Independence
Infrastructure Independence
•
Data virtualization
•
Management of name spaces independently
of the storage repositories
• Global name spaces• Persistent identifiers
• Collection-based ownership of files
•
Support for access operations independently
of the storage repositories
• Separation of access methods from storage protocols
Separation of Access Method
Separation of Access Method
from Storage Protocols
from Storage Protocols
Storage System
Storage System
Storage Protocol
Storage Protocol
Access Method
Access Method
Access Operations
Access Operations
Data Grid
Data Grid
Map from the
Map from the
operations used by
operations used by
the access method
the access method
to a standard set of
to a standard set of
operations used to
operations used to
interact with the
interact with the
storage system
storage system
Storage Operations
Storage Operations
Data Grid Operations
Data Grid Operations
• File access
• Open, close, read, write, seek, stat, synch, …
• Audit, versions, pinning, checksums, synchronize, …
• Parallel I/O and firewall interactions
• Versions, backups, replicas
• Latency management • Bulk operations
• Register, load, unload, delete, …
• Remote procedures
• HDFv5, data filtering, file parsing, replicate, aggregate
• Metadata management
• SQL generation, schema extension, XML import and export, browsing, queries,
• GGF, “Operations for Access, Management, and Transport at Remote Sites”
Latency Management
Latency Management
-
-Bulk Operations
Bulk Operations
•
Bulk register
• Create a logical name for a file
• Load context (metadata)
•
Bulk load
• Create a copy of the file on a data grid storage repository
•
Bulk unload
• Provide containers to hold small files and pointers to each file location
•
Bulk delete
• Trash canExamples of Extensibility
Examples of Extensibility
• The 3 fundamental APIs are C library, shell commands, Java
• Other access mechanisms are ported on top of these interfaces
• API evolution
• Initial access through C library, Unix shell command
• Added iNQ Windows browser (C++ library)
• Added mySRB Web browser (C library and shell commands)
• Added Java (Jargon)
• Added Perl/Python load libraries (shell command)
• Added WSDL (Java)
• Added OAI-PMH, OpenDAP, DSpace digital library (Java)
• Added Kepler actors for dataflow access (Java)
Examples of Extensibility
Examples of Extensibility
• Storage Repository Driver evolution • Initially supported Unix file system
• Added archival access - UniTree, HPSS
• Added FTP/HTTP
• Added database blob access
• Added database table interface
• Added Windows file system
• Added project archives - Dcache, Castor, ADS
• Added Object Ring Buffer, Datascope
• Added GridFTP version 3.3
• Database management evolution • Postgres
• DB2
• Oracle
• Informix
• Sybase
Logical Name Spaces
Logical Name Spaces
Storage Repository • Storage location
• User name • File name
• File context (creation date,…) • Access constraints
Data Access Methods (C library, Unix, Web Browser)
Data access directly between
application and storage
repository using names
required by the local
repository
Logical Name Spaces
Logical Name Spaces
Storage Repository • Storage location
• User name • File name
• File context (creation date,…) • Access constraints
Data Grid
• Logical resource name space • Logical user name space
• Logical file name space • Logical context (metadata)
• Control/consistency constraints Data Collection
Data Access Methods (C library, Unix, Web Browser)
Federation Between Data Grids
Federation Between Data Grids
Data Grid
• Logical resource name space • Logical user name space
• Logical file name space • Logical context (metadata)
• Control/consistency constraints Data Collection B
Data Access Methods (Web Browser, DSpace, OAI-PMH)
Data Grid
• Logical resource name space • Logical user name space
• Logical file name space • Logical context (metadata)
• Control/consistency constraints Data Collection A
Access controls and consistency constraints on cross registration of digital entities
Types of Federation
Types of Federation
•
Peer-to-peer grids
• Data grids forward requests for access to public data
•
Hierarchical grids
• Master - slave• All files in slave data grid are replicated from the master data grid
• Central archive
• Multiple independent data grids deposit replicas into a central archive
•
Replication grids
• Two independent data grids serve as back-up sites for each other
Data Management Systems
Data Management Systems
•
Digital Libraries
• DSpace services for ingestion and description of files -ported on top of SRB data grid
• Fedora relationship management services, port on top of SRB data grid in test
• Cheshire integration on top of SRB data grid
• OAI-PMH interface to the SRB data grid
•
Persistent archives
• Manage authenticity, integrity, and infrastructure independence
• Integrate preservation processes on top of SRB data grid
Chronopolis
Chronopolis
Preservation Facility
Preservation Facility
NCAR U Md SDSC
MCAT MCAT MCAT
Deep Archive at NARA, no user access but complete copy
Replicated copy
at U Md for improved access, load balancing and disaster recovery
Active archive at SDSC, user access Demonstrate preservation environment • Authenticity • Integrity • Management of technology evolution
• Mitigation of risk of data loss • Replication of data • Federation of catalogs • Management of preservation metadata • Scalability • 3 collections / year
• Support 100 TBs per site
Federation of Three
Independent Data Grids
Distributed Metadata Management
Distributed Metadata Management
•
Database specific replication
•
Oracle
•
Master-slave catalogs across vendors
•
Synchronize an independent metadata catalog
with state from primary catalog
•
SRB version 3.4
•
Master-slave catalog federation
between data grids
•
How to enforce data grid administration
Data Grids: Third Viewpoint
Data Grids: Third Viewpoint
•
Require ability to apply dynamic
consistency constraints to state
information
•
When federating data grids
•
When modifying views of collections
•
When managing data placement
•
When asserting global properties (creating
consistent state across the collection)
• Synchronization of replicasState Information Context
State Information Context
•
Need to be able to characterize the
consistency constraints that are evaluated
when updating state information
•
Example - copying data
• Replica (intent that copy be synchronized)
• Version (intent that copy be labeled)
• Backup (intent that copy represent time snapshot)
• New file (intent that copy be independent of original)
• Could change intent of the copy, changing the required state information
Dynamic Constraints
Dynamic Constraints
•
Need technology that manages
•
Reification of consistency rules into metadata
•
State information about the reification
• Version of consistency rules that were evaluated
• Time stamp for when the evaluation was done
• Granularity within the collection for which the reification is valid
•
Require two versions
•
Management of procedural execution of rules
Projects
Projects
•
Monash University / NSF National
Science Digital Library / NARA
Persistent Archive
•
Integration of Fedora and SRB data grid
•
SDSC - NSF ITR on Constraint-based
Knowledge Systems for Grids, Digital
Libraries, and Persistent Archives
•
Embedding of dynamic constraint
For More Information
For More Information
Reagan W. Moore
San Diego Supercomputer Center