Open Source Long
Term Preservation
Archives
Richard Matthews
2
Presentation
•
Prepared by:
> Keith Rajecki
> Industry Solutions Architect > Global Education & Research
•
Presented by:
> Rick Matthews > Sr. Staff Engineer
Agenda
•
Sun, Open Source, and Communities
•
Preservation Archiving Trends
•
Sun Archiving Storage Solutions
Sun Confidential: Internal Only 4 Business Presence 100 Countries Java Developers 5 Million Java Devices 6 Billion Annual Revenues $13+ Billion Worldwide Employees 35,000 Cash $4.8 Billion U.S. Patents 5,000+ Fortune 211 Company Solaris 10 Licenses 7 Million Annual R&D ~$2 Billion Annual Storage Petabytes Shipped 410 SPARC Embedded Processors 44+ Million
Sun's Open Source Strategy
Developer Preference
User
Preference Value Proposition
● More core developers ● More deploying developers ● More partners ● Business Deployment
● Sun's target market ● Binary distribution ● Pay for value
• Free to use
• More platform choice
• More suppliers
• Larger user community
6
Sun and Open Source Software
•
OSS has been part of Sun's DNA for awhile
“Every software asset we produce is open source. If it isn't today, it will be pretty damn quickly.”
Jonathan Schwartz
CEO, Sun Microsystems January, 2007
•
Sun's commitment to OSS Communities
> This includes Open Repository Communities > More than just a Storage perspective
Sun's Open Stack
Flexible and Heterogeneous with Zero Barrier to Exit
Java Enterprise System Composite Application Platform
Sun xVM Operating System Virtualization Architecture Database Platform Application Infrastructure Partners
Sun Confidential: Internal Only 8 • 4300 downloads to date • 14 million lines of source code • Community interest: 1 to 1000 core systems
• First derivative design: SimplyRISC S1 core
OpenSolaris + OpenSPARC = Only Truly Open Platform
www.opensparc.net
OpenSPARC
OpenSolaris
Innovation Happens Everywhere
www.opensolaris.org
90,000 Members
64 Community Projects,
BrandZ, DTrace, Solaris ZFS, Zones
53 User Groups Worldwide
260 Code Contributions
2006 Codie “Best Open Source Solution” 2005 Open Source World Editor's Choice 2005 InfoWorld Innovators Award
10
Sun is Committed to Developer Communities
Java Infrastructure Ecosystem Community Solaris Building
Open
and
Free
CommunitiesBuilding a Vibrant Ecosystem: Sun is the Largest Commercial Contributor to Open Source Communities
SPARC
java.net
The Source for Java Technology Collaboration
Sun Preservation Archiving Community
• PASIG Meeting May 27-30 San Francisco
www.sun-pasig.org
• Comparison of high-level OAIS architectures,
services-oriented architecture, and use cases
• Sharing of best practices and software code
• Cooperation on standard, open, ‘in-a-box’ solutions
around repository technologies
• Review of preservation and archiving storage
architectures and eResearch data set management
• Discussion of the uses of commercial third-party and
12
Trends: Growth and Preservation
Humans created 161 exabytes of data in 2006,
approximately 3 million times the
information in all the books ever written,
according to IDC.
Source: "An Inconvenient IT Truth," By Michael Vizard, eWeek, 06/22/07
80% of all movies made before 1940 are gone.
Repository Archiving Projects
• Compliance
• Book and Image Digitization and Sharing
• National Heritage Content
• Newspapers
• Replicated, Tiered Repositories for Archived Materials
• Research Data, Applications, and Systems
• Science, Technology, and Medical Journals
14
What's The Buzz?
•
Problem:
> Exponential growth of digital content, now and in the
future
> Powerful, flexible infrastructure required to archive:
– Store unstructured, fixed content
– Search that content
– Preserve that content for the long-term
•
One Proposed Solution:
SAM-QFS Infinite Archive System
•
SAM-QFS: World’s best policy based multi-tiered
archive manager
> Application Transparent dynamic data movement > Four tiers, local and remote
> “Continuous Archive” = CDP
> WORM & Retention management
•
Infinite Archive System
> Scalable multi-tiered SAM-QFS > platform base
> 10-256TB systems.
16
Sun StorageTek 5800 “Honeycomb”
Content-aware Open Storage
• What It Is
> 'Smart', network-attached, clustered,
racked storage system
• What It Provides
> 64TB (raw) per rack: data objects +
metadata
> WORM Objects
> Metadata awareness built into the
design
> Reliability, persistency, currency
assurances
Sun StorageTek 5800 “Honeycomb”
Content-aware Open Storage
• RAIN architecture
> Symmetric cluster CPU, memory,
SATA Disks
• Each node
> Opteron-based SunFire server
> Solaris 10 > 3 GB RAM > Dual Gig-E > 4 x 500GB SATA • L2 load-spreading switches • Service processor
18
Sun StorageTek 5800 “Honeycomb”
Content-aware Open Storage
'Half Cell' 8 servers 16 TB (raw) 'Full Cell' 16 servers 32 TB (raw) 'Rack' (2 Cells) 16 servers 32 TB (raw) 'Hive' (N Cells) [Future] 'Hot Scaling'
Why Honeycomb?
•
Architecture optimized to store and retrieve
unstructured fixed content
•
Object storage, metadata aware
•
Extreme data protection via RAID6, data
self-healing, bit-rot detection
> Mean Time To Data Loss > 2M years
•
A commitment to standards
> Dublin Core metadata > Web DAV
20
Why Honeycomb?
•
Open Source strategy fits with majority of
repository/archive software efforts
•
Standard Java and C APIs in SDK
•
Horizontal Scaling as storage needs grow
•
Dublin Core is only the beginning
•
Platform-agnostic
•
[Near Future] On-board local data services available
(“Storage Beans”)
Why Fedora + Honeycomb?
Answer: Archival Storage
Fedora Archive Server Sun STK5800 Systems Digital Library Custom Monolithic Digital Library Application
In order to seriously address current and future issues of scalability, persistence and flexibility, this will likely not meet the requirements, while this clearly will...
22
Why Fedora + Honeycomb?
Why Fedora + Honeycomb?
Application Clients Content-Rich Applications Fedora Archive Server Sun STK5800 Systems
ePublishing eResearch HPC Digital Library
• Designed with the proper intelligence in the proper places
• Metadata integral to storage
• World-class reliability, scalability and persistency1
• End-to-end Open Source Software solution
• Automated wide-area backup option
Storage Beans
•
Discrete Services inside Honeycomb
•
What will they do? That's up to
you!
•
Asynchronous (Background)
> Transformations
> Periodic Data Scrubbing > Duplicate Consolidation
•
Synchronous (Real-time)
> Audit logs
> Watermarking > Encryption
24
Sun/Fedora Efforts
•
Fedora runs now on Solaris/Open Solaris
> Server + Storage reference configurations
> Inclusion of Fedora 3 in the Open Solaris 'Indiana'
Repository
> Fedora on Solaris
– How does it perform? – How does it scale?
– What are the advantages to running on Solaris? – Best Practices on Sun
Proof Point: Fedora/Sun/JHU
User Applications
E-Research Preservation Archive Publishing
Fedora Commons Framework
Content Storage
Content Storage
Abstract Data Mgmt. Layer
Abstract Data Mgmt. Layer
Tape Honeycomb
Fast Disk
Fedora Web Service APIs (SOAP and REST)
Manage
API Access API Registry Search SearchRDF
Ingest Validate Manage Policy Access RDF Index Store Registry File system
(Objects) (Registry)RDBMS Triplestore (Index)
CMABind
Pluggable
Pluggable
Core Modules
Core Modules
A Fedora/Sun Solution for Creation,
A Fedora/Sun Solution for Creation,
Management, and Exchange of
Management, and Exchange of
Durable, Digital e-Research Content
Durable, Digital e-Research Content
at Johns Hopkins University
26
Honeycomb Virtual Views Example
Photo Application
• Photo demo application on top of StorageTek 5800 or ST5800 Emulator • Leverages metadata to
organize and present the content in logical views • Photo app extracts,
stores, and displays embedded EXIF jpeg metadata
Digital Repository References
• New York U. Digital Media Management
• Stanford U. OAIS Digital Repository
• Johns Hopkins U. eResearch (Fedora)
• Purdue U. eResearch (Fedora)
• Oxford University Google Project (Fedora/VITAL)
• National Library of New Zealand Digital Preservation (Ex Libris)
• California Digital Library Large Scale Digitization
• Swedish Archive for Sound/Recording Digital Media Management
• Southampton U. EPrints Repository
• San Diego Supercomputer Large Dataset Storage
• U. Michigan D-Space Repository
28
For More Information
• Storage Archive Manager
http://www.sun.com/storagetek/management_software/data_management/sam/index.xml/
• Honeycomb
http://www.sun.com/storagetek/disk_systems/enterprise/5800/index.xml
• Honeycomb Architecture Document:
http://www.sun.com/storagetek/disk_systems/enterprise/5800/5800-Arch-final-LR.pdf
• Sun Preservation and Archiving Community
http://www.sun-pasig.org
• Open Source Honeycomb Software