CS-552/452 Introduction to Cloud Computing
20. Cloud Object Storage
When we use object storage
• When we check
▪Facebook, twitter
▪Gmail
▪Take pictures with Instagram
▪Docs on DropBox
▪Check share point
Object Storage
• Object storage, also known as object-based storage, is a data storage that manages and manipulates data storage as objects.
• Unlike file systems which manage data as a file hierarchy, and block storage which manages data as blocks within sectors and tracks.
• Each object typically includes the data itself, a variable amount of metadata, and a globally unique identifier.
Meta DATA Data
An object
Object Storage is good for
▪ Unstructured data workloads
▪ Large capacity requirement (e.g., > 100s of Terabytes) – easily scaling out by adding more nodes
▪ Data archiving: documents, emails and backups
▪ Storage for photos, videos, virtual machine images
▪ But
▪ Need for granular security and multi-tenancy
▪ Need for automation, management, monitoring reporting tools
▪ Non-high performance
Object Storage Overview with architectural examples from Cloudian's
Block vs. Object Storage
Block
• Faster:
• For hot data
• Flash-optimized
• IOPS-centric
• VM optimized
Object
• Bigger:
• For cool/cloud data
• Object-based
• Scale-out (multi-PB)
• Software-centric
Block vs. Object
Block
• Data stored without any concept of data format or type
• The data is simply a series of 0s and 1s
• High-level applications or file systems to keep track of data location, context and meaning
Object
• Object consists of an object identifier (OID), data and metadata
• No object organization system (flat organization)
• Direct access to individual objects, no need to traverse directories
How to build an object storage system
Case 1: Swift
Architecture Overview
Proxy
Proxy Object Server
Container Server Account Server
Disks Disks
Proxy Object Server
Container Server Account Server
Disks Disks
Proxy Object Server
Container Server Account Server
Disks Disks
PUT /v1/account/container/object
Rings
Basically, two parts
Proxy Server:
•Exposes the swift public (REST) API to users and stream to and from the client upon request
Storage Nodes:
•Handle storage, replication, and management of
objects, containers,
and accounts.
Proxy Server
▪ Shared-nothing architecture, can be scaled as needed
▪ Can place load balancer ahead of Proxy servers
▪ Objects are streamed between proxy server and client directly
▪ There is no cache in between
Proxy
Proxy
Proxy
Storage Nodes: Storing &
Retrieving data
▪ Flat namespace: accounts, containers and objects
▪ No nested directories
▪ Account: collection of containers
▪ List containers: GET /v1/accountname/
▪ Create container: PUT /v1/accountname/containername/
▪ Containers: collection of objects
▪ List objects: GET /v1/accountname/containername/
▪ Upload object: PUT
/v1/accountname/containername/objectname
▪ Retrieve object: GET
/v1/accountname/containername/objectname
Object Server
▪ A very simple blob (i.e., binary large object) storage server that can store, retrieve and delete objects stored on local devices.
▪ How to store objects?
▪ Objects are stored as binary files on the filesystem (e.g., ext4)
▪ Where to store objects?
▪ Each object is stored under a path derived from the object name’s hash and the operation’s timestamp.
▪ Writes to objects?
▪ Last write always wins (LWW) and ensures that the latest object version will be served.
Proxy Object Server Obj
File
Obj File
Obj File File System (ext3, ext4,
btrfs)
Container Server
▪ The Container Server’s primary job is to handle listings of objects.
▪ It doesn’t know where those objects are, just what objects are in a specific container.
▪ The listings are stored as sqlite database files, and replicated across the cluster similar to how objects are.
▪ Statistics are also tracked that include the total number of objects, and total storage usage for that container.
Proxy
Container Server
db1 db2 db3
Account Server
• The Account Server is very similar to the
Container Server, except that it is responsible for listings of containers rather than objects.
Proxy
Account Server
db1 db2 db3
Architecture Overview
Proxy
Proxy Object Server
Container Server Account Server
Proxy Object Server
Container Server Account Server
Proxy Object Server
Container Server Account Server PUT /v1/account/container/object
Rings
The Rings
▪ The Rings: mapping data to physical locations in the cluster
▪ 3 rings to store 3 kind of mappings (accounts, containers and objects)
▪ Each ring works in the same way
▪ For a given account, container, or object name/path, the ring returns information on its physical location (i.e., device within a storage node)
▪ via the following two data structures:
▪ Device Look-up table: to find out which storage device (e.g., HDD or SSD) contains the target object
▪ Device List: to find out which storage node this device belongs to
Proxy GET /v1/account/container/object
Mapping using Basic Hash Functions
MAPPING OF OBJECTS TO DIFFERENT DRIVES
OBJECT HASH VALUE
(HEXADECIMAL) MAPPING VALUE DRIVE MAPPED TO
Image 1 b5e7d988cfdb78bc3be
1a9c221a8f744 hash(Image 1) % 4 = 2 Drive 2
Image 2 943359f44dc87f6a169
73c79827a038c hash(Image 2) % 4 = 3 Drive 3 Image 3 1213f717f7f754f050d0
246fb7d6c43b hash(Image 3) % 4 = 3 Drive 3 Music 1 4b46f1381a53605fc0f
93a93d55bf8be hash(Music 1) % 4 = 1 Drive 1
Music 2 ecb27b466c32a56730
298e55bcace257 hash(Music 2) % 4 = 0 Drive 0
Music 3 508259dfec6b1544f4a
d6e4d52964f59 hash(Music 3) % 4 = 0 Drive 0
Movie 1 69db47ace5f026310ab
170b02ac8bc58 hash(Movie 1) % 4 = 2 Drive 2
Movie 2 c4abbd49974ba44c16
9c220dadbdac71 hash(Movie 2) % 4 = 1 Drive 1
Problem?
Problem?
▪ But what if we have to add/remove drives?
▪ The hash values of all objects will stay the same, but we need to re- compute the mapping value for all objects, then re-map them to the different drives.
SWIFT -- Consistent Hashing Algorithm
▪ Consistent hashing algorithm achieves a similar goal but does things differently.
▪ Instead of generating the mapping value of each object, each drive will be assigned a range of hash values to store the objects.
RANGE OF HASH VALUES FOR EACH DRIVE
DRIVE RANGE OF HASH VALUES
Drive 0 0000 ~ 3ffe
Drive 1 3fff ~ 7ffe
Drive 2 7fff ~ bffe
MAPPING OF OBJECTS TO DIFFERENT DRIVES
OBJECT HASH VALUE (HEXADECIMAL) DRIVE MAPPED TO
Image 1 b5e7d988cfdb78bc3be1a9c221a8f744 Drive 2
Image 2 943359f44dc87f6a16973c79827a038c Drive 2
Image 3 1213f717f7f754f050d0246fb7d6c43b Drive 0
Music 1 4b46f1381a53605fc0f93a93d55bf8be Drive 1
Music 2 ecb27b466c32a56730298e55bcace257 Drive 3
Music 3 508259dfec6b1544f4ad6e4d52964f59 Drive 1
Movie 1 69db47ace5f026310ab170b02ac8bc58 Drive 1
Movie 2 c4abbd49974ba44c169c220dadbdac71 Drive 3
With New Device
▪ Each drive will get a new range of hash values it is going to store.
▪ Each object’s hash value will still remain the same.
▪ Any objects whose hash value is within range of its current drive will remain.
▪ For any other objects whose hash value is not within range of its current drive will be mapped to another drive
▪ But that number of objects to be moved is very fewer using consistent hashing algorithm, compared to the basic hash function.
RANGE OF HASH VALUES FOR EACH DRIVE
DRIVE RANGE OF HASH VALUES
Drive 0 0000… ~ 3fff…
Drive 1 3fff… ~ 7ffe…
Drive 2 7fff… ~ bffd…
Drive 3 bffd… ~ ffff…
Problem?
▪ Each drive has a large range of hash values
▪ Multiple objects may map to one drive
▪ Imbalance issue
Multiple Markers in Consistent Hashing Algorithm
▪ Instead of having one big hash range for each drive, multiple markers
serve to split those large hash range into smaller chunks
▪ Multiple markers helps to evenly distribute the objects into drives, thus helping with the load balancing
In Summary: What is Ring doing?
▪ Evenly mapping data to physical locations in the cluster
▪ with consistent hashing algorithm and multiple markers techniques
▪ Build (re-build) Look-up table
▪ rom object hash value to device
▪ Maintain device list
▪ to identify the device location – storage node
Data durability
Ensuring your data is still the same for ages▪ Proxy returns data only if content matches stored checksum
▪ Continuously running background processes
▪ Auditors: ensuring there is no bit-rot
▪ Quarantining replicas if checksum mismatch
▪ Replicators: ensuring all replicas are stored multiple times on remote nodes (for replication)
▪ Reconstructors: re-computing missing erasure-coding fragments (for erasure coding) or creating a new replica if one replica is
compromised (for replication..
Failure domains
Ensuring high availability and durability
Proxy Disk 0
Disk 1
Disk 2
Proxy Disk 3
Disk 4
Disk 5
Proxy Disk 6
Disk 7
Disk 8
Proxy Disk 9
Disk 10
Disk 11
Proxy Disk 12
Disk 13
Disk 14
Proxy Disk 15
Disk 16
Disk 17 Three replicas
Storage Nodes
Failure domains
Ensuring high availability and durability
Proxy Disk 0
Disk 1
Disk 2
Proxy Disk 3
Disk 4
Disk 5
Proxy Disk 6
Disk 7
Disk 8
Proxy Disk 9
Disk 10
Disk 11
Proxy Disk 12
Disk 13
Disk 14
Proxy Disk 15
Disk 16
Disk 17 Three replicas
Zone1 Zone2
Failure domains
Ensuring high availability and durability
Proxy Disk 0
Disk 1
Disk 2
Proxy Disk 3
Disk 4
Disk 5
Proxy Disk 6
Disk 7
Disk 8
Proxy Disk 9
Disk 10
Disk 11
Proxy Disk 12
Disk 13
Disk 14
Proxy Disk 15
Disk 16
Disk 17 Three replicas
Failure domains
Ensuring high availability and durability
Proxy Disk 0
Disk 1
Disk 2
Proxy Disk 3
Disk 4
Disk 5
Proxy Disk 6
Disk 7
Disk 8
Proxy Disk 9
Disk 10
Disk 11
Proxy Disk 12
Disk 13
Disk 14
Proxy Disk 15
Disk 16
Disk 17 Three replicas
Zone1 Zone2 Zone3
Region 1 Region 2
Re-Balancing
Proxy Disk 0
Disk 1
Disk 2
Proxy Disk 3
Disk 4
Disk 5
Proxy Disk 6
Disk 7
Disk 8
Proxy Disk 9
Disk 10
Disk 11
Proxy Disk 12
Disk 13
Disk 14
Proxy Disk 15
Disk 16
Disk 17
Zone1 Zone2 Zone3
Region 1 Region 2
To ensure a third replica
Explore More
• https://docs.openstack.org/swift/latest/
How to build an object storage system Case 2: Ceph
Weil, Sage A., et al. "Ceph: A scalable, high-performance distributed file system." Proceedings of the 7th symposium
System Overview
Client Operation
▪ Ceph interface
▪ Nearly POSIX
▪ Decoupled data and metadata operation
▪ User space implementation
▪ FUSE or directly linked
▪ Filesystem in Userspace (FUSE) is a
software interface for Unix-like computer operating systems that lets non-
privileged users create their own file systems without editing kernel code.
Key Features
▪ Decoupled data and metadata
▪ CRUSH
▪ Files striped onto predictably named objects
▪ CRUSH maps objects to storage devices
▪ Dynamic Distributed Metadata Management
▪ Dynamic subtree partitioning
▪ Distributes metadata amongst MDSs
▪ Object-based storage
▪ OSDs handle migration, replication, failure detection and recovery
An Example For Accessing Ceph Storage
▪ A Client sends an open request to MDS
▪ MDS returns capability, file inode, file size and stripe information
▪ The client reads/writes directly from/to OSDs
▪ In the end, the client sends a close request, and provides details to MDS
Distributed Metadata
▪ “Metadata operations often make up as much as half of file system workloads…”
▪ Effective metadata management is critical to overall system performance
Dynamic Subtree Partitioning
▪ Lets Ceph dynamically share metadata workload among tens or hundreds of metadata servers (MDSs)
▪ Sharing is dynamic and based on current access patterns
▪ Results in near-linear performance scaling in the number of MDSs
Hashing 1: Ceph firsts maps objects into placement groups (PG) using a hash
function
Hashing 2: Placement groups are then assigned to OSDs using a pseudo-
▪ Files are split across objects
▪ Objects are members of placement groups
▪ Placement groups are distributed across OSDs.
Distributed Object Storage
CRUSH
S. A. Weil, S. A. Brandt, E. L. Miller, and C. Maltzahn. CRUSH: Controlled, scalable, decentralized placement of replicated data. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (SC ’06), Tampa, FL, Nov. 2006. ACM
Replication
• Objects are replicated on OSDs within same PG
• How writes are performed
▪ Primary forwards updates to other replicas
▪ Sends ACK to client once all replicas have received the update
▪ Slow but safe
▪ Replicas send final commit once they have committed update to disk
Conclusion
▪Ceph and Swift share some similar concept, though implemented differently
▪ How to identify object (Rings vs. CRUSH)
▪ Distribute object evenly (Rings vs. CRUSH)
Written (presentation) Assignment 3
▪Similar to previous two
▪Especially to those who have never
presented before
Sources
• 1. Christian Schwede, Forget everything you knew about Swift Rings, https://www.openstack.org/assets/presentation-media/Rings201.pdf
• 2. Swift 101 https://www.youtube.com/watch?v=vAEU0Ld- GIU&feature=youtu.be
• 3. Ceph 101 https://www.youtube.com/watch?v=OyH1C0C4HzM