Cloud Storage:
Efficient and Sustainable Solutions for Data Growth
Dalit Naor, PhD
Manager, Storage Research
IBM Haifa Research Lab
© 2013 IBM Corporation
2
Who are we?
IBM Research: “The World is our Lab”
China Watson Almaden Austin Tokyo Haifa Zurich India Dublin Melbourne Brazil Kenya© 2013 IBM Corporation
4
IBM Research – Haifa
Established in 1972
Largest IBM Research facility outside the US
Spanning all IBM Research strategy areas
Working with IBM business units and
and IBM clients worldwide
Collaborating with academia and industry
Cloud Computing
Big Data Analytics
Quality Storage
Optimization
IBM Research – Haifa: What we do
Collaboration & Social Networking
© 2013 IBM Corporation
6
EU projects
•
Valuable projects
• Collaborative innovation - partners sharing their knowledge
• Extensive portfolio of collaborations with over 100 European universities, and 130 clients and partners
IBM Storage Reseach: Major Research Initiatives
Solid State
Information Systems
Solid State
Information Systems
Digital Preservation
& Archival Systems
Digital Preservation
& Archival Systems
Scale Out
and Cloud Storage
Scale Out
and Cloud Storage
IBM Storage Today and Tomorrow
Autonomic Storage
Management
Autonomic Storage
Management
Exploratory Storage
Systems
Exploratory Storage
Systems
Storage for Analytics
Storage for Analytics
© 2013 IBM Corporation
8
Motivation – why care about Storage?
Or,
why is Storage the most relevant topic in IT
today?
© 2013 IBM Corporation
10
Amazon S3 Growth : The First Trillion Objects (June 2012)
So what’s the problem
Data is growing faster than the IT infrastructure investments required
to support it, “data deluge gap.”
– IT budgets are growing only 5%-7% per year.
– Information is growing faster than the investment required to store, transmit, analyze and manage it.
– Energy consumption
Why?
– More automation and collection of data
– More multi-media data (e.g., online video, surveillance, medical imaging).
– More sharing of data
– Maintaining duplicates
What can be done?
– Store less (expire, compress…)
– Store it “in the right place”
– Optimize existing systems
© 2013 IBM Corporation
12
Impact of Storage Technologies on Energy Efficiency
Capacity optimization
–
Compression and de-duplication
Data tiers and technologies
–
Disk types
–
Flash / SSD
–
Tape
Cloud - A disruptive effect on storage
–
Low cost storage – consumer drives
–
Data replication
© 2013 IBM Corporation
Topics
1.
Optimizing Storage thru Cloud Technologies
–
Object Stores
–
Block storage for cloud compute
2.
Data Reduction
Data Compression and Deduplication techniques
Smart Compression and Deduplication decisions
© 2013 IBM Corporation
16
Different cloud workloads need different classes of storage
High-performance, co-located
storage for XaaS
– Blocks/file to support compute
General purpose data center NAS
extension
– Files
Fixed content depot
– Objects
E.g. Amazon EBS, Openstack NOVA
E.g. Amazon S3, Openstack Swift
Cloud (Online) Storage
Networked online storage
Data is stored in virtualized pools of
storage
– may span across multiple data centers
Typically hosted by a third party
Customers use to store files or data
objects.
Cloud Object Storage protocols
© 2013 IBM Corporation
18
How is it done? – the Internals
Cloud File Systems
Scalable File Systems
Different design points than traditional file systems
– New architecture
– New , “relaxed” , protocols and systems operations (I/O and management)
– New solutions for resiliency and high availability based on replication, e.g. not RAID
– Support for computation
– Designed for new workloads: large streaming, sequential Writes or Analytics.
Assumptions
– Based on commodity hardware
– Components always fail
- Need self monitoring to detect, tolerate, and recover from failures
– Optimized for large files
© 2013 IBM Corporation
20
Cloud Object Storage – Openstack / Swift
Source: Swiftstack documentation http://swiftstack.com/openstack-swift/architecture/
RESTful APIs
Swift storage:
http://swift.company.com/v1/account
/container/object
Cloud Object Storage – Openstack / Swift
Building Blocks
– Proxy Servers: Handles all incoming API requests.
– Rings: Map logical names of data to locations on particular disks.
– Zones: failure domains
© 2013 IBM Corporation
22
Cloud Object Storage – Openstack / Swift
Source : Swiftstack documentation http://swiftstack.com/openstack-swift/architecture/
Data model
– Accounts: “tenants”
– Containers: sets of objects
– Objects: The data itself, mapped to files on the local file system
– Partitions/Containers : Manage locations where data lives in the cluster.
Replication
– Everything is stored three times (by default)
– Upon a disk failure, the data is replication to other zones, ensuring three copies
VISION Cloud
:
Vi
rtualized
S
torage Serv
i
ces
Foundati
o
n for the Future Inter
n
et
Architect and build the next generation, scalable,
low-cost and secure storage cloud system
Key Innovations:
– Raise Abstraction Level of Storage
– Computational Storage
– Content-Centric Storage
– Advanced Capabilities
– Data Mobility and Federation
Four use cases to demonstrate data-intensive
services
© 2013 IBM Corporation
24
The evolution of Object Stores Research at IBM
Object Stores – an active area of research at IBM for the past 5 years
Research focus evolved over time:
how to build a
basic scalable
object store
advanced
capabilities
for object
stores
swift1 Rich Meta Data Support
Metadata integral part of objects. Can describe content and how handled. Provide queries and indexing over metadata, while supporting scalability
2 Multi-tenancy Provide secure logical isolation between tenants to enable
hosting of many tenants over the same shared
infrastructure. User of one tenant cannot access storage of another tenant. Security breach in one tenant cannot be leveraged to breach another tenant.
3 Computational
Storage
“Stored procedures” for a storage cloud. Provide ability to run computations (storlets) safely and securely, close to the data in Swift. Enables extending Swift without changing its code.
4 On-boarding Provide an on-boarding service. Relationships set with
containers on old provider. Client starts working with
© 2013 IBM Corporation
26
Index and Query of User Metadata is viewed as a critical
feature by our VISION Cloud Partners
A catalog maintains for each object in a container a list of the attributes and attribute-value pairs
–A content-centric query requires a look-up in the catalog
Example (schematic) – list all red objects
GET /MyContainer/ HTTP/1.1 . . . Match-md: Attribute=‘color’ x-Value=‘red’ Response (schematic) HTTP/1.1 200 OK Content-Type: application/json { "children" : [ “Obj 2", “Obj 3" ] }
Attribute Value Object
color red Obj 3
shape square Obj 2
shape triangle Obj 1
color blue Obj 1
color red Obj 2
shape square Obj 3
Obj 1
Obj 2 Obj 3
Computations (storlets) running in the object store save
network bandwidth and increase security
Restricted module performed in the storage close to the data
– Analogous to database stored procedures
– Dynamically loaded
Execute in a sandbox
Triggered synchronously or asynchronously by events
– Any RESTful request can trigger
– Can run in background or act as “transform”
Benefits
– Locality – avoid network overhead
– Security – avoid transferring data outside of cloud
– Automated execution
PUT Pudong Feb 2012
mimetype = jpeg
category = vacation picture location = Shanghai
Thumbnail Creator
Object-type = storlet Put object trigger:
mimetype = jpeg
category = vacation picture Code:
. . . .
Pudong Feb 2012 thumbnail
© 2013 IBM Corporation
28
On-Boarding-Federation allows a tenant to switch
to a new cloud storage provider on-the-fly
New Cloud
New Cloud
Old Cloud
Old Cloud
Federator Jobs Federator Direct Federation Admin Module 1 2a 3 2bNAS, Other storage
Extract and associate metadata
Group be geo-coordinates
VISION Cloud Telco Use Case (FT, Telenor)
Value: easily develop value-add
What is the use case
– Storage capacity for telco users with value-add applications, including transcoding on-the-fly
– Cloud provider
Innovations needed
– Rich metadata, content-centric access, storlets, secure multi-tenancy
Status
© 2013 IBM Corporation
Overview of OpenStack: Key components
Horizon Nova Swift Networking Glance New in Havana Metering (Ceilometer)Basic Cloud Orchestration & Service Definition (Heat)
Oslo
© 2013 IBM Corporation
32
Example:
Cinder Volume Migration for Havana Release
Goal: To migrate a volume's data
from one location to another in
a manner that is as transparent as
possible to users and workloads.
– Analog to VMware's Storage vMotion
Use cases targeted for Havana:
Storage maintenance / decommissioning
Modifying volume types
- Enable/disable compression, Easy Tier, etc.
Design tenet:
– Users today are not aware of a volume's location, and should therefore not be aware of or directly control volume migration.
Topics
1.
Optimizing Storage thru Cloud Technologies
–
Object Stores
–
Block storage for cloud compute
2.
Data Reduction
Data Compression and Deduplication techniques
Smart Compression and Deduplication Decisions
© 2013 IBM Corporation
34
Compression and Deduplication
Compression and deduplication as
functions will continue to move into the
storage controllers
The challenge and the opportunity will be to
coordinate effective implementation of the
deduplication and/or compression functions
given approaches which happen further
upstream from the storage.
As CPU processing power continues to scale
Compression and deduplication as
functions will continue to move into the
storage controllers
The challenge and the opportunity will be to
coordinate effective implementation of the
deduplication and/or compression functions
given approaches which happen further
upstream from the storage.
Real-time Compression is seamlessly integrated with Storwize
family GUI
–
Simply select new volume preset
•
Straightforward compression of existing volumes using volume
mirroring
–
Convert to compressed and eliminate unused space during conversion
© 2013 IBM Corporation
36
Typical Compression Rates
Comprestimator
host-based utility can be used to evaluate expected
compression benefits for existing environments
–
About 1 minute analysis per device
–
Very accurate (less than 5% error)
–
Supported on wide range of clients
–
Evaluate expected compression on any storage system
Databases
50-80%
Server/Desktop
Virtualization
45-75%
Collaboration Data
20-75%
Engineering Data
50-80%
30-80%
36Our scope – Real-Time Compression
Compression for
primary data
in
enterprise
storage systems
Many benefits to compression. Reduces:
– Costs
– Rack space
– Cooling
– Can delay need for additional purchases to existing systems
The challenge:
Add “seamless” compression to a storage system
with little effect on performance
A bit about compression techniques:
We focused on Zlib – a popular compression engine (zip). Combines :
© 2013 IBM Corporation
38
Estimating compression ratios – some motivation
To Zip or not to Zip?
–
Compression does not come for free
–
Incurs
overheads
, sometimes significant
–
Not always worth the effort – depends on the actual
data
–
Goal: Avoid compressing “incompressible” data
Other potential benefits:
Evaluation and sizing
–
Compression ratio number of disks money!
–
Evaluation: should I invest in a storage system with compression
–
Sizing: How many disks should I buy
Especially In a
storage
system with a high disk to
Existing solutions
Rules of thumb
Deduce from past experience with similar applications
By file extension
– Not always accurate
– Not always available
Look at the actual data
Scan and compress everything
– Takes too long
Look at a prefix (of a file/chunk) and deduce about the rest
– No guarantees on the outcome
.jpg .doc
.ppt .vmdk .zip
© 2013 IBM Corporation
40
Input:
Large volume of data
– Block volume, file system, etc..
Goal:
Estimate the overall compression ratio with accuracy guarantee.
Part I – The Macro
The framework:
Choose m random locations
Compute an average of the compression ratio of these locations
What is “location”?
What is “compression ratio of a location”?
How do we get a guarantee?
The Macro-scale – Sample size and accuracy
Analysis yields:Confidence ≤ 2e-2m·Accuracy
Accuracy is a bound on the additive error
Plug desired confidence and accuracy into equation to get the required sample size
Sample size independent of Volume size!
Results of an estimator run are normally distributed around the actual compression ratio
© 2013 IBM Corporation
42
The Macro-scale – the Actual Tool
Written in C
Multi-threaded
Two implementations:
1. IBM Real-Time compression
2. Zlib compression on full objects
Tested on real life data
Example of a run: 73 seconds on a 3.2 TB volume – Error ~0.5%
– Exhaustive run took almost 4 hours
IBM Comprestimator – the macro-scale for IBM Real-time compression on
Storewize V7000 and SAN Volume controller: Downloadable at:
Topics
1.
Optimizing Storage thru Cloud Technologies
–
Object Stores
–
Block storage for cloud compute
2.
Data Reduction
Data Compression and Deduplication techniques
Smart Compression and Deduplication Decisions
© 2013 IBM Corporation
44
The Green Grid:
DCcE: Data Center Compute Efficiency
Primary services
– A server is usually commissioned to provide one or more specific services
Server compute Efficiency (ScE)
– Represent the percentage of time spent doing primary services
DCcE: Data Center compute Efficiency
– Calculated by averaging ScE from all servers
The Green Grid Data Center Storage Efficiency Metrics
(DCsE):
The Application Of Storage Systems Efficiency Operational Metrics
The Green Grid
– The Green Grid is proposing the use of a new family of metrics—Data Center storage Efficiency (DCsE)
– Represent efficiency at the Datacenter level
© 2013 IBM Corporation
46
Operational Metrics - Capacity
Data Center Storage Efficiency – Capacity (DCsE
CAP)
Data Center User Capacity in Use is defined as the storage space used by
applications and measures, in GB, the total amount of file system space
consumed by all applications, as seen from the application point of view.
Data Center Storage Power Consumption is defined as the average power
consumption of the storage system measured over a long enough period of
time that it is representative of the storage system behavior (for the
measured capacity). It is measured in watts.
n
Consumptio
Power
Storage
Center
Data
in Use
Capacity
r
Center Use
Data
=
capDCsE
Operational Metrics - Workload
Data Center Storage Efficiency – Workload (DCsE
IO)
Data Center I/O Throughput is defined as the number of I/O operations per
second that the applications execute (i.e., applications’ I/O rates). It is
measured in IOPS.
Data Center Storage Power Consumption is defined as the average power
consumed by the storage system while running the I/O workload over a long
enough period of time. It is measured in watts.
n
Consumptio
Power
Storage
Center
Data
Throughput
I/O
Center
Data
=
ioDCsE
© 2013 IBM Corporation
48
Operational Metrics - Throughput
Data Center Storage Efficiency – Throughput (DCsE
TP)
Data Center Data Transfer Throughput is defined as the amount of data
transferred per second by the applications (i.e., applications’ I/O rates), and it
is measured in MBPS.
Data Center Storage Power Consumption is defined as the average power
consumed by the storage system while running the data transfer workload
over a long enough period of time. It is measured is watts.
n
Consumptio
Power
Storage
Center
Data
in Use
Capacity
r
Center Use
Data
=
capDCsE
Optimizing Operational Metrics using
Energy-Saving Features
Scenario #1 – Capacity Planning And Data Migration
– It is well known that storage systems do not operate at peak throughput and capacity at all times. Moreover, storage workloads change over time, due to changes in the
applications, how the applications are used, and the storage hardware itself. Monitoring the values of the DCsE operational metrics over time provides a perspective on the current state of efficiency and whether the storage system can be optimized.
– The DCsE capacity and workload metrics can be optimized by moving data between storage systems or tiers with different performance and energy consumption
characteristics, based on the workload and application classification (e.g., business-critical, high-performance, high-availability, administrative).
Scenario #2 – Capacity Optimization: Compression And De-duplication
– Implementing storage capacity optimization techniques, such as de-duplication and compression, allows data centers to store more data on their systems. Data centers will
© 2013 IBM Corporation
50
Optimizing Operational Metrics using
Energy-Saving Features
Scenario #3 – Advanced Raid Levels And Snapshots
– Different resiliency (e.g., RAID) levels incur different storage capacity overheads. For example, RAID 10 incurs 50% capacity overhead; RAID 5 and RAID 6 incur different levels according to the number of data and parity (or mirrored) disks. However, in addition to capacity overhead, maintaining the parity incurs additional overhead when writing data. Thus, the decision of resiliency level affects DCsE capacity, DCsE
workload, and DCsE throughput. Data centers should take into account that both the application workload and the resiliency level affect the energy efficiency of the storage system.
– In addition, the storage system’s number of defined online snapshots, for rapidly
changing data, and the number of backup copies affect the underlying storage capacity requirements and energy consumption. Adjusting the number of snapshots and backups affects the DCsE capacity metric and allows data centers to store more data on the
same system.
Scenario #4 – Low Power Modes For Storage Systems
– Enabling the storage system to use low power states, including spinning/shutting down components and devices based on application data classification, can yield significant power reductions, but this approach is use-case dependent.