CS-552/452 Introduction to Cloud Computing. 20. Cloud Object Storage

(1)

CS-552/452 Introduction to Cloud Computing

20. Cloud Object Storage

(2)

When we use object storage

• When we check

▪Facebook, twitter

▪Gmail

▪Take pictures with Instagram

▪Docs on DropBox

▪Check share point

(3)

Object Storage

• Object storage, also known as object-based storage, is a data storage that manages and manipulates data storage as objects.

• Unlike file systems which manage data as a file hierarchy, and block storage which manages data as blocks within sectors and tracks.

• Each object typically includes the data itself, a variable amount of metadata, and a globally unique identifier.

Meta DATA Data

An object

(4)

Object Storage is good for

▪ Unstructured data workloads

▪ Large capacity requirement (e.g., > 100s of Terabytes) – easily scaling out by adding more nodes

▪ Data archiving: documents, emails and backups

▪ Storage for photos, videos, virtual machine images

▪ But

▪ Need for granular security and multi-tenancy

▪ Need for automation, management, monitoring reporting tools

▪ Non-high performance

Object Storage Overview with architectural examples from Cloudian's

(5)

Block vs. Object Storage

Block

• Faster:

• For hot data

• Flash-optimized

• IOPS-centric

• VM optimized

Object

• Bigger:

• For cool/cloud data

• Object-based

• Scale-out (multi-PB)

• Software-centric

(6)

Block vs. Object

Block

• Data stored without any concept of data format or type

• The data is simply a series of 0s and 1s

• High-level applications or file systems to keep track of data location, context and meaning

Object

• Object consists of an object identifier (OID), data and metadata

• No object organization system (flat organization)

• Direct access to individual objects, no need to traverse directories

(7)

How to build an object storage system

Case 1: Swift

(8)

(9)

Architecture Overview

Proxy

Proxy Object Server

Container Server Account Server

Disks Disks

Proxy Object Server

Disks Disks

Proxy Object Server

Disks Disks

PUT /v1/account/container/object

Rings

(10)

Basically, two parts

Proxy Server:

•Exposes the swift public (REST) API to users and stream to and from the client upon request

Storage Nodes:

•Handle storage, replication, and management of

objects, containers,

and accounts.

(11)

Proxy Server

▪ Shared-nothing architecture, can be scaled as needed

▪ Can place load balancer ahead of Proxy servers

▪ Objects are streamed between proxy server and client directly

▪ There is no cache in between

Proxy

(12)

Storage Nodes: Storing &

Retrieving data

▪ Flat namespace: accounts, containers and objects

▪ No nested directories

▪ Account: collection of containers

▪ List containers: GET /v1/accountname/

▪ Create container: PUT /v1/accountname/containername/

▪ Containers: collection of objects

▪ List objects: GET /v1/accountname/containername/

▪ Upload object: PUT

/v1/accountname/containername/objectname

▪ Retrieve object: GET

/v1/accountname/containername/objectname

(13)

Object Server

▪ A very simple blob (i.e., binary large object) storage server that can store, retrieve and delete objects stored on local devices.

▪ How to store objects?

▪ Objects are stored as binary files on the filesystem (e.g., ext4)

▪ Where to store objects?

▪ Each object is stored under a path derived from the object name’s hash and the operation’s timestamp.

▪ Writes to objects?

▪ Last write always wins (LWW) and ensures that the latest object version will be served.

Proxy Object Server Obj

File

Obj File

Obj File File System (ext3, ext4,

btrfs)

(14)

Container Server

▪ The Container Server’s primary job is to handle listings of objects.

▪ It doesn’t know where those objects are, just what objects are in a specific container.

▪ The listings are stored as sqlite database files, and replicated across the cluster similar to how objects are.

▪ Statistics are also tracked that include the total number of objects, and total storage usage for that container.

Proxy

Container Server

db1 db2 db3

(15)

Account Server

• The Account Server is very similar to the

Container Server, except that it is responsible for listings of containers rather than objects.

Proxy

Account Server

db1 db2 db3

(16)

Architecture Overview

Proxy

Proxy Object Server

Container Server Account Server PUT /v1/account/container/object

Rings

(17)

The Rings

▪ The Rings: mapping data to physical locations in the cluster

▪ 3 rings to store 3 kind of mappings (accounts, containers and objects)

▪ Each ring works in the same way

▪ For a given account, container, or object name/path, the ring returns information on its physical location (i.e., device within a storage node)

▪ via the following two data structures:

▪ Device Look-up table: to find out which storage device (e.g., HDD or SSD) contains the target object

▪ Device List: to find out which storage node this device belongs to

Proxy GET /v1/account/container/object

(18)

Mapping using Basic Hash Functions

MAPPING OF OBJECTS TO DIFFERENT DRIVES

OBJECT HASH VALUE

(HEXADECIMAL) MAPPING VALUE DRIVE MAPPED TO

Image 1 b5e7d988cfdb78bc3be

1a9c221a8f744 hash(Image 1) % 4 = 2 Drive 2

Image 2 943359f44dc87f6a169

73c79827a038c hash(Image 2) % 4 = 3 Drive 3 Image 3 1213f717f7f754f050d0

246fb7d6c43b hash(Image 3) % 4 = 3 Drive 3 Music 1 4b46f1381a53605fc0f

93a93d55bf8be hash(Music 1) % 4 = 1 Drive 1

Music 2 ecb27b466c32a56730

298e55bcace257 hash(Music 2) % 4 = 0 Drive 0

Music 3 508259dfec6b1544f4a

d6e4d52964f59 hash(Music 3) % 4 = 0 Drive 0

Movie 1 69db47ace5f026310ab

170b02ac8bc58 hash(Movie 1) % 4 = 2 Drive 2

Movie 2 c4abbd49974ba44c16

9c220dadbdac71 hash(Movie 2) % 4 = 1 Drive 1

Problem?

(19)

Problem?

▪ But what if we have to add/remove drives?

▪ The hash values of all objects will stay the same, but we need to re- compute the mapping value for all objects, then re-map them to the different drives.

(20)

SWIFT -- Consistent Hashing Algorithm

▪ Consistent hashing algorithm achieves a similar goal but does things differently.

▪ Instead of generating the mapping value of each object, each drive will be assigned a range of hash values to store the objects.

RANGE OF HASH VALUES FOR EACH DRIVE

DRIVE RANGE OF HASH VALUES

Drive 0 0000 ~ 3ffe

Drive 1 3fff ~ 7ffe

Drive 2 7fff ~ bffe

(21)

MAPPING OF OBJECTS TO DIFFERENT DRIVES

OBJECT HASH VALUE (HEXADECIMAL) DRIVE MAPPED TO

Image 1 b5e7d988cfdb78bc3be1a9c221a8f744 Drive 2

Image 2 943359f44dc87f6a16973c79827a038c Drive 2

Image 3 1213f717f7f754f050d0246fb7d6c43b Drive 0

Music 1 4b46f1381a53605fc0f93a93d55bf8be Drive 1

Music 2 ecb27b466c32a56730298e55bcace257 Drive 3

Music 3 508259dfec6b1544f4ad6e4d52964f59 Drive 1

Movie 1 69db47ace5f026310ab170b02ac8bc58 Drive 1

Movie 2 c4abbd49974ba44c169c220dadbdac71 Drive 3

(22)

With New Device

▪ Each drive will get a new range of hash values it is going to store.

▪ Each object’s hash value will still remain the same.

▪ Any objects whose hash value is within range of its current drive will remain.

▪ For any other objects whose hash value is not within range of its current drive will be mapped to another drive

▪ But that number of objects to be moved is very fewer using consistent hashing algorithm, compared to the basic hash function.

RANGE OF HASH VALUES FOR EACH DRIVE

DRIVE RANGE OF HASH VALUES

Drive 0 0000… ~ 3fff…

Drive 1 3fff… ~ 7ffe…

Drive 2 7fff… ~ bffd…

Drive 3 bffd… ~ ffff…

(23)

Problem?

▪ Each drive has a large range of hash values

▪ Multiple objects may map to one drive

▪ Imbalance issue

(24)

Multiple Markers in Consistent Hashing Algorithm

▪ Instead of having one big hash range for each drive, multiple markers

serve to split those large hash range into smaller chunks

▪ Multiple markers helps to evenly distribute the objects into drives, thus helping with the load balancing

(25)

In Summary: What is Ring doing?

▪ Evenly mapping data to physical locations in the cluster

▪ with consistent hashing algorithm and multiple markers techniques

▪ Build (re-build) Look-up table

▪ rom object hash value to device

▪ Maintain device list

▪ to identify the device location – storage node

(26)

(27)

Data durability

Ensuring your data is still the same for ages

▪ Proxy returns data only if content matches stored checksum

▪ Continuously running background processes

▪ Auditors: ensuring there is no bit-rot

▪ Quarantining replicas if checksum mismatch

▪ Replicators: ensuring all replicas are stored multiple times on remote nodes (for replication)

▪ Reconstructors: re-computing missing erasure-coding fragments (for erasure coding) or creating a new replica if one replica is

compromised (for replication..

(28)

Failure domains

Ensuring high availability and durability

Proxy Disk 0

Disk 1

Disk 2

Proxy Disk 3

Disk 4

Disk 5

Proxy Disk 6

Disk 7

Disk 8

Proxy Disk 9

Disk 10

Disk 11

Proxy Disk 12

Disk 13

Disk 14

Proxy Disk 15

Disk 16

Disk 17 Three replicas

Storage Nodes

(29)

Failure domains

Proxy Disk 0

Disk 1

Disk 2

Proxy Disk 3

Disk 4

Disk 5

Proxy Disk 6

Disk 7

Disk 8

Proxy Disk 9

Disk 10

Disk 11

Proxy Disk 12

Disk 13

Disk 14

Proxy Disk 15

Disk 16

Zone1 Zone2

(30)

Failure domains

Proxy Disk 0

Disk 1

Disk 2

Proxy Disk 3

Disk 4

Disk 5

Proxy Disk 6

Disk 7

Disk 8

Proxy Disk 9

Disk 10

Disk 11

Proxy Disk 12

Disk 13

Disk 14

Proxy Disk 15

Disk 16

(31)

Failure domains

Proxy Disk 0

Disk 1

Disk 2

Proxy Disk 3

Disk 4

Disk 5

Proxy Disk 6

Disk 7

Disk 8

Proxy Disk 9

Disk 10

Disk 11

Proxy Disk 12

Disk 13

Disk 14

Proxy Disk 15

Disk 16

Zone1 Zone2 Zone3

Region 1 Region 2

(32)

(33)

Re-Balancing

Proxy Disk 0

Disk 1

Disk 2

Proxy Disk 3

Disk 4

Disk 5

Proxy Disk 6

Disk 7

Disk 8

Proxy Disk 9

Disk 10

Disk 11

Proxy Disk 12

Disk 13

Disk 14

Proxy Disk 15

Disk 16

Disk 17

Zone1 Zone2 Zone3

Region 1 Region 2

To ensure a third replica

(34)

Explore More

• https://docs.openstack.org/swift/latest/

(35)

How to build an object storage system Case 2: Ceph

Weil, Sage A., et al. "Ceph: A scalable, high-performance distributed file system." Proceedings of the 7th symposium

(36)

System Overview

(37)

Client Operation

▪ Ceph interface

▪ Nearly POSIX

▪ Decoupled data and metadata operation

▪ User space implementation

▪ FUSE or directly linked

▪ Filesystem in Userspace (FUSE) is a

software interface for Unix-like computer operating systems that lets non-

privileged users create their own file systems without editing kernel code.

(38)

Key Features

▪ Decoupled data and metadata

▪ CRUSH

▪ Files striped onto predictably named objects

▪ CRUSH maps objects to storage devices

▪ Dynamic Distributed Metadata Management

▪ Dynamic subtree partitioning

▪ Distributes metadata amongst MDSs

▪ Object-based storage

▪ OSDs handle migration, replication, failure detection and recovery

(39)

An Example For Accessing Ceph Storage

▪ A Client sends an open request to MDS

▪ MDS returns capability, file inode, file size and stripe information

▪ The client reads/writes directly from/to OSDs

▪ In the end, the client sends a close request, and provides details to MDS

(40)

Distributed Metadata

▪ “Metadata operations often make up as much as half of file system workloads…”

▪ Effective metadata management is critical to overall system performance

(41)

Dynamic Subtree Partitioning

▪ Lets Ceph dynamically share metadata workload among tens or hundreds of metadata servers (MDSs)

▪ Sharing is dynamic and based on current access patterns

▪ Results in near-linear performance scaling in the number of MDSs

(42)

Hashing 1: Ceph firsts maps objects into placement groups (PG) using a hash

function

Hashing 2: Placement groups are then assigned to OSDs using a pseudo-

▪ Files are split across objects

▪ Objects are members of placement groups

▪ Placement groups are distributed across OSDs.

Distributed Object Storage

(43)

CRUSH

S. A. Weil, S. A. Brandt, E. L. Miller, and C. Maltzahn. CRUSH: Controlled, scalable, decentralized placement of replicated data. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (SC ’06), Tampa, FL, Nov. 2006. ACM

(44)

Replication

• Objects are replicated on OSDs within same PG

• How writes are performed

▪ Primary forwards updates to other replicas

▪ Sends ACK to client once all replicas have received the update

▪ Slow but safe

▪ Replicas send final commit once they have committed update to disk

(45)

Conclusion

▪Ceph and Swift share some similar concept, though implemented differently

▪ How to identify object (Rings vs. CRUSH)

▪ Distribute object evenly (Rings vs. CRUSH)

(46)

Written (presentation) Assignment 3

▪Similar to previous two

▪Especially to those who have never

presented before

(47)

Sources

• 1. Christian Schwede, Forget everything you knew about Swift Rings, https://www.openstack.org/assets/presentation-media/Rings201.pdf

• 2. Swift 101 https://www.youtube.com/watch?v=vAEU0Ld- GIU&feature=youtu.be

• 3. Ceph 101 https://www.youtube.com/watch?v=OyH1C0C4HzM