• No results found

Replicated BLOB storage in distributed environment.

N/A
N/A
Protected

Academic year: 2021

Share "Replicated BLOB storage in distributed environment."

Copied!
79
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)
(3)

Czech Technical University in Prague Faculty of Electrical Engineering

Department of Computer Science and Engineering

Master's Thesis

Replicated BLOB storage in distributed environment

Bc. Marek Timr

Supervisor: Ing. Martin Mudra

Study Programme: Open Informatics Field of Study: Software Engineering

(4)
(5)

v

Aknowledgements

I would like to thank my supervisor, Ing. Martin Mudra, for giving me the opportunity to work on this project, for his guidance and advice. I would also like to thank my family for their support and encouragement throughout my whole studies.

(6)
(7)

vii

Declaration

I declare that I elaborated this thesis on my own and that I mentioned all the information sources and literature that have been used in accordance with the Guideline for adhering to ethical principles in the course of elaborating an academic nal thesis.

(8)
(9)

Abstract

This thesis deals with the storage of data in distributed data stores. The work studies the possibilities of storing blob data and their representation in storage systems. The thesis deals with object storage design, which distributes its data between individual system members. Emphasis is on the horizontal scalability of the system. The design studies synchronization problems and ensuring of a consistent state after a failure of the system. The work contains an implementation of the proposed system and tests its functionality.

Abstrakt

Tato práce se zabývá ukládáním dat do distribuovaných datových úloºi²´. Studuje moºnosti ukládání blobových dat a jejich reprezentaci v úloºných systémech. Práce se zabývá návrhem objektového úloºi²t¥, které distribuuje svá data mezi jednotlivé £leny systému. D·raz je kladen na horizontální ²kálovatelnost systému. Návrh studuje problémy synchronizace a zaji²t¥ní konzistentního stavu p°i selhání systému. Práce obsahuje implementaci navrºeného systému a testuje jeho funk£nost.

(10)
(11)

Contents

1 Introduction 1

1.1 Motivation . . . 1

1.2 Goals of the thesis . . . 1

2 Analysis 3 2.1 Block and object storage dierence . . . 3

2.1.1 BLOB . . . 4

2.2 Analysis of requirements . . . 4

2.2.1 Functional requirements . . . 4

2.2.2 Non Funcitonal requirements . . . 4

2.3 Related work . . . 5

2.3.1 Haystack . . . 5

2.3.2 Windows Azure Storage . . . 6

2.3.2.1 Partition Layer . . . 7 2.3.2.2 Stream Layer . . . 7 2.3.3 SeaweedFS . . . 8 2.3.3.1 Volume server . . . 8 2.3.3.2 Master Server . . . 8 2.3.3.3 Write of an object . . . 8

2.3.3.4 Process of failover and election of a leader . . . 9

2.4 Shared state in distributed environment . . . 11

2.4.1 Concensus on a value . . . 11

2.5 Analysis of le systems . . . 12

2.5.1 Journaling le systems . . . 12

2.5.2 Ext4 . . . 12

2.5.3 NTFS . . . 13

2.6 Analysis of software tools . . . 13

2.6.1 Deploy on multiple platforms . . . 14

2.6.2 Database access . . . 14

2.6.3 Serving of HTTP content . . . 15

2.6.4 Fast le access . . . 16

3 Design 17 3.1 System architecture . . . 17

(12)

xii CONTENTS

3.3 Storage Node . . . 18

3.4 Operations on object . . . 18

3.4.1 Object write process . . . 19

3.4.1.1 Client request for object creation . . . 19

3.4.1.2 Upload of client data to a storage nodes . . . 20

3.4.1.3 Client nishing request . . . 21

3.4.1.4 Object identity . . . 21

3.4.2 Object read process . . . 23

3.4.3 Object deletion process . . . 24

3.4.4 Object lifecycle . . . 24

3.5 Object Representation . . . 25

3.5.1 Maximum object size . . . 26

3.5.2 Partial Content . . . 27

3.5.2.1 Expiration of temporary data . . . 27

3.5.3 States of a monolith le . . . 27

3.5.4 Selection of a monolith to store an object . . . 27

3.5.5 Deletion of an object . . . 28

3.5.6 Compaction of an monolith . . . 28

3.6 Communitaction protocols . . . 30

3.6.0.1 Unidirectional messages . . . 30

3.6.0.2 Messages requiring a reponse . . . 30

3.6.0.3 Joining a new storage node to the existing cluster . . . 30

3.6.0.4 Communication between peers . . . 31

3.6.0.5 Data transfer protocol . . . 31

3.7 Heartbeat . . . 32

3.8 Replication of data . . . 33

3.8.1 Storage state check . . . 34

3.8.2 Assigment of missing replicas . . . 34

3.8.3 Acquisition priority . . . 35

3.8.4 Deletion of redundant object replicas . . . 35

3.8.5 Deletion of expired objects . . . 35

3.8.5.1 End of storage check . . . 35

3.9 Scalability . . . 35

3.9.1 Storage Node scalability . . . 36

3.9.2 Master Server scalability . . . 36

3.10 Startup of a node . . . 36

3.10.0.1 Storage node startup after successfull termination . . . 36

3.10.0.2 Storage node startup after a failure . . . 37

3.10.0.3 Missing checksum of a monolith le . . . 38

4 Implementation 39 4.1 Platform . . . 39

4.2 Netty I/O framework . . . 39

4.3 Master server API . . . 40

4.4 Storage Node API . . . 41

(13)

CONTENTS xiii 4.6 Client library . . . 41 4.7 Console application . . . 42 4.8 Graphical interface . . . 42 4.9 Dependency injection . . . 43 4.10 Conguration . . . 43 4.11 Logging . . . 44

5 Testing and results 45 5.1 Performance testing . . . 45

5.1.1 Write tests . . . 46

5.1.2 Read tests . . . 47

5.2 Functional testing . . . 47

5.2.1 Create, read and delete scenario . . . 48

5.2.1.1 Delete after node shutdown scenario . . . 48

5.2.1.2 Replication balancing scenario . . . 49

5.2.2 Manual testing . . . 49

5.2.3 Test results discussion . . . 50

6 Conclusion 51 6.1 Future work . . . 51

6.2 Security . . . 51

6.3 File compression . . . 52

6.4 Periodical le checks . . . 52

Bibliography 53 A Nomenclature 57 B Content of the attached CD 59 B.1 Installation manuals . . . 59

(14)
(15)

List of Figures

2.1 Schema of a sequence of requests necessary for retrieval of an image from the

Haystack store . . . 6

2.2 Schema of high-level architekture of Windows Azure Storage . . . 7

2.3 Structure of a Stream stored on the Stream Layer of Windows Azure Storage 8 2.4 Diagram of align assignment of a volume and le ids for write by current leading master server . . . 10

3.1 Diagram of designed architecture of the storage cluster with communication channels between components . . . 18

3.2 Diagram of an example situation of a write of an object with replication factor 2 with failure of one of the nodes . . . 22

3.3 Diagram of sequence of client requests necessary for retrieval of object data . 23 3.4 Process of asynchronous object deletion after client request . . . 24

3.5 Diagram of the main states of an object and the transitions between them . . 25

3.6 Diagram of the internal structure of a monolith le . . . 25

3.7 Diagram of the structure of an object stored in a single monolith le . . . 26

3.8 Process of compaction of a monolith and selection of a new location for the living objects . . . 29

3.9 Protocol of the communication during the connection of a node to the cluster 32 3.10 Sequence diagram of heartbeats sent from a storage node to the master server 33 3.11 Diagram of the strategy for assignment of an under replicated object to a storage node . . . 34

4.1 Screenshot of the printed help of the console client application . . . 42

4.2 Screenshot of GUI with a sample listing of currently stored objects . . . 43

4.3 Screenshot of GUI showing details about stored object . . . 44

4.4 Screenshot of GUI with a listing of stored objects on a single node . . . 44

5.1 Diagram of the network connection conguration of the computers used during the testing . . . 46

(16)
(17)

List of Tables

4.1 Table of methods available to client exposed by the master server . . . 40

4.2 Table of methods available to client exposed by a storage node . . . 41

5.1 Table of hardware specication of the computers used during the testing . . . 45

5.2 Results of the rst write performance test . . . 46

5.3 Results of the second write performance test . . . 47

5.4 Results of the rst read performance test . . . 47

5.5 Results of the second read performance test . . . 47

(18)
(19)

Chapter 1

Introduction

1.1 Motivation

In the present times, applications produce increasing amounts of data that has to be stored. The nature of data might be various and change over the time. The variability prevents usage of conventional database technologies requiring the static structure.

We also want to conveniently and quickly access any of the stored data. Loss of data may result in a decrease of performance of even cause serious nancial damage. We have to deal hardware failures. The more data we store, the higher the chance is some data will get corrupted of lost.

A distributed data storage might satisfy all the needs. Distributed data store is a network of interconnected computers where data are usually stored on multiple nodes. Data store ensures high availability and consistency of stored data. We expect the system to be scalable, therefore adding computational resources should appropriately increase the performance of the whole system.

Several established companies are providing paid services to store virtually any amount of data. Although cloud services allow storing user data with ease, it usually comes with a signicant monetary cost. The client is charged for stored volume and used trac. Naturally, there are open source variants of such storage systems, each providing slightly dierent features. The downside is that the user has to manage his private infrastructure of machines. Then it depends on the application if it is worth for the user to acquire and maintain the infrastructure. Since the systems are expected to be deployed in large data centers, to work eciently, their hardware requirements might be signicantly high.

1.2 Goals of the thesis

The goal of the work is to design and implement simple BLOB (abbreviation stands for binary large object) storage. This system should run in a distributed network environment. That means it should handle problems associated with distribution and synchronization of shared state. The user of that system must be able to store and read les. The system should ensure that user data would be available even in the event of failure of a part of the system.

(20)

CHAPTER 1. INTRODUCTION

The great emphasis of this work will be on a selection of open source tools used for implementation. The system should have as little constrained as possible. That means it should be able to operate on common modern operating systems. Additionally, the user should be able to deploy the system on the majority of relevant hardware available to him.

(21)

Chapter 2

Analysis

In this chapter, the term object storage is dened. We specify functional and non-functional requirements put on a simple BLOB storage.

A signicant part of the analysis is dedicated to studying of current object storages. Since there is a plenty of dierent systems, a few systems are selected for the deeper study. The analysis describes strategies of the systems to store, replicate and serve data to clients. The analysis then describes some principles of le systems and management of state in a distributed environment. The end of the analysis looks into capabilities of current computer technologies. This breakdown will come into use during the implementation of a proof of concept BLOB storage.

2.1 Block and object storage dierence

Object storage is a general term that refers to way data are organized and how are ma-nipulated. The data are stored in form of objects, which constist of three components[1, 2]:

Data  The content of the object in an arbitrary format

Metadata  List of attributes which describes stored data. The content of metadata de-pendes on application. It may contain information about what is stored, what is the format of the data, the owner of the data or some security related information. Global identication  An unique identier that allows the data to be found in the

dis-tributed system. The location of the object may be transparent for the user of the system.

Block storage divides data into blocks of xed length without any accompanied metadata. The data are easier to modify unlike on object storage. Modifying le stored in an object storage requires the object is modies as a whole and then rewritten back. Block storage requires access only to blocks, that needs to be modied. Some block storages can be mount as regular le systems.[2]

(22)

CHAPTER 2. ANALYSIS

Block storage is more suited for applications that require virtual disk storage and the data are modied frequently. The size of the les is smaller and access to them is random[2]. Object storage is more suitable for archiving, backuping and preserving data[3]. The content of les has static character and they are accessed sequentialy. Individual object may be large. Object storage is cheaper to scale.

In this document the term object will refer to a le, that an user deposits into storage. By constrast, the term le will be used to describe physical representation of object on a disk.

2.1.1 BLOB

The abbreviation BLOB stands for Binary Large OBject[4]. Blob is an amorphous piece of binary data, which internal structure is not known externally[4]. Another interpretation refers to BLOB as to binary data type[4]. In the context of block storage,BLOB is another term for an object. Therefore terms BLOB and object are interchangeable in this work.

2.2 Analysis of requirements

In this section are listed functional and non-functional requirements put on designed system. The requirements are set by the assignment of this thesis.

2.2.1 Functional requirements

1. The system will store user data in the form of BLOB objects 2. User won't be able to modify stored data, only to erase them

3. The system will expose RESTful (Representational State Transfer) interface for client applications

4. The system will utilize the most of internet connection bandwith between system and client by writing and reading from disk in parallel

5. The system will enable to monitor the state of the storage with a web application 2.2.2 Non Funcitonal requirements

1. The storage must provide protection against loss of data in the event of technical failure 2. The storage must be highly horizontally scalable

3. Adding new storage nodes to cluster must be easy for the user 4. The system must be platform independent

5. The system must be able to restore after unexpected system crash

6. The system must be operational on relevant hardware with low performance 7. The system should prefer use of open source frameworks and libraries

(23)

2.3. RELATED WORK

2.3 Related work

There are existing object data storages that oer features similar to requirements of this work and are widely used in the cloud environment[5, 6]. The following section will study some of those data stores in order to examine advantages and drawbacks of features of the systems.

2.3.1 Haystack

Haystack is an object storage optimized for the ecient storage and retrieval of billions of photos, developed by the Facebook company. Its goal is to provide high throughput and low latency[7]. The system also aims to be fault tolerant, simple and cost-eective. The implementation of Haystack is not open to the public, however the paper [5] describes the general architecture.

The Haystack consists of several subsystems. The most important are Haystack Directory and Haystack Store. Haystack Directory manages general metadata about photos and their location in the Haystack Store.

The previous system stored each photo separately on NFS (Network File System) in directories consisting thousands of les. Reading an image lead to an excessive number of disk operations due to how NFS manages directory metadata[5]. Haystack Store composes individual photos sequentially to les of sizes roughly of 100GB. The Store holds the specic location of an image in the memory. It maps image identity to the physical le, its length and position in that le. The underlying le system assigns no other metadata that would be ignored regardless. Each image stored in the sequence le is stored in a structure called needle[5]. The needle consists a header and footer with metadata related to le and image data. During read, Haystack Store retrieves le location and oset from memory, calculates the length of the whole needle and reads it. Usually, only one disk operation is enough to fetch the whole needle.

To speed up the start of the system, Haystack Store manages an index le matching the structure of the volume le but containing only the metadata. The systems load the index le to the memory and therefore does not have to traverse the volumes[5].

Haystack Store distinguishes physical and logical volumes. The logical volume consists of several physical volumes on dierent machines. When Haystack stores an image on a logical volume, the image is written to all physical volumes[5].

The gure 2.1 describes the architecture of the system and sequence of requests made when an image is requested. A client browser requests the web server a photo over HTTP (Hypertext Transfer Protocol). The web server retrieves metadata of the le and returns an address of a CDN (Content Delivery Network) and a image identier. The CDN stores only photos that are popular at that time. If CDN does not own the photo, the client approaches the Haystack Store. The Store searches its cache, and if there is a miss, it loads the photo from a physical volume[7].

(24)

CHAPTER 2. ANALYSIS

Figure 2.1: Schema of a sequence of requests necessary for retrieval of an image from the Haystack store

2.3.2 Windows Azure Storage

Windows Azure Storage (abbreviated to WAS) is cloud storage that enables clients to storage structured and non-structured data at arbitrary volume[6]. WAS is a member of numerous group of services oered on the Azure platform[8]. The client is allowed to store les in the form of BLOBs, structured data to tabular service called Azure Table and messages to Azure Queue service. These services provide dierent RESTful APIs (application pro-gramming interface) to access the data. However, the services share the same internal data representation. The client is charged for stored data and the trac.

WAS is a very complex system that ensures strong consistency of data, their availabil-ity and partition tolerance [6], even though it is in the contradiction with CAP theorem (explained in the section 2.4).

Storage Stamp (abbreviated to Stamp) is a single data center of the WAS system. The capacity of one Stamp is up to 30PB. The WAS assigns the client a name of an account that matches one Storage Stamp. The Stamp encapsulates a front-end serving client requests and components named Partition Layer and Stream Layer. Figure 2.2 shows general architecture.

(25)

2.3. RELATED WORK

Figure 2.2: Schema of high-level architekture of Windows Azure Storage

2.3.2.1 Partition Layer

Partition Layer (abbreviated to PL) provides an abstraction of stored data (BLOB, Table, Queue). It ensures transactional operations on objects and their strong consistency. PL also contains cache to reduce I/O (input/output) operations on Stream Layer[6].

PL is responsible for replication of data on the object level between individual Stamp data centers. This replication takes places asynchronously after writing of an object[6]. 2.3.2.2 Stream Layer

Steam Layer (abbreviated to SL) stores data in the form of Streams, which are read by PL. SL, unlike PL, does not know the meaning of stored data. The Stream is a list of pointers to les names extents. In the context of WAS, an extent contains blocks of dierent length. Each block represents the smallest unit of data, that WAS manages. The gure 2.3 describes the structure of a Stream. SL comprises two principal components, the Stream Manager and series of Extent Nodes. Extent Node stores extent les and Stream Manager monitors their state and location. SL also replicate data, but on the level of extent les and in the context Stamp data center. During block write, SL synchronously writes a block to three dierent Extent Nodes. Large objects are broken up over many extents by Partition Layer. For each object, the PL keeps track of what Stream, extents and byte osets in these extents contain the object data.

(26)

CHAPTER 2. ANALYSIS

Figure 2.3: Structure of a Stream stored on the Stream Layer of Windows Azure Storage

2.3.3 SeaweedFS

SeaweedFS is an open source scalable distributed le system. SeaweedFS aims to be able to store billions of les and serve then fast [9]. The design of the system draws inspiration from the Haystack store. The le system is not POSIX (Portable Operating System Interface) compliant, but with the help of additional software, SeaweedFS could be mount by FUSE (Filesystem in Userspace). System exposes a HTTP API and serves responses in JSON (JavaScript Object Notation) format [9].

Each member of the cluster takes up to two roles. Member can be a master server and a volume server.

2.3.3.1 Volume server

Volume le is a le of the maximum size of 32GB that contains stored objects in a similar way as Haystack does. Volume servers manage volumes les. Each stored object is associated with a unique object id. The volume server stores mapping from an object id to a volume, an oset in the volume and the length of the object in a key-value database or memory. Volume server accepts, stores and server client data. The server replicates the volume les to other volume les in the system [9].

2.3.3.2 Master Server

The master server only maps unique volume ids to volume servers. This approach is dierent other mentioned storage systems because the server does not know anything about stored les. Although there may be multiple servers in the cluster with the master role, only one at a time is a leader. Delegating storage of metadata to volume servers lowers load on the master [9].

2.3.3.3 Write of an object

The upload of a le consists of two steps. First, a master server must assign volume id for storage of the le. Master allocates le ids sequentially. For security reasons, the master

(27)

2.3. RELATED WORK

server generates a random le cookie for each object. The le cookie prevents the unau-thorized clients from accessing random les based on guessing. The client receives JSON response from the master containing a location of a server volume, a le id, a le cookie and a volume id.

In the second step of the upload, the client sends the le to the given volume server and attach all identiers to the request.

The SeaweedFS is suitable for only small and medium size les such as images for web-sites. If a client wants to upload a larger le, it must be split into chunks by master server. Upon client request, the master returns a list of ids where each maps a chunk to object in storage[9].

2.3.3.4 Process of failover and election of a leader

Despite multiple master servers running in the cluster, only one at a time is holds mapping of volumes. If a client requests an object write on a master server that is not currently the leader, the server will redirect the client to the current one (schema shown on gure 2.4). The present leader of the cluster periodically sends heartbeats to others. If the server fails, a new leader is elected. Therefore the master server is not a single point of failure of the storage system.

The election is accomplished using Raft[10] consensus algorithm. The consensus is achieved by selecting a leader. The leader will replicate shared state to the followers. During and after the election, there is time window, in which system does not accept client requests. Each master must send its volume mapping to the newly elected leader.

(28)

CHAPTER 2. ANALYSIS

Figure 2.4: Diagram of align assignment of a volume and le ids for write by current leading master server

1. Client requests assignment of a volume

2. Master redirects the client to the current leader of the cluster 3. Client requests assignment of a volume

4. Master generates le and cookie id and assignes a volume for write 5. Client writes the le to node managing assigned volume

(29)

2.4. SHARED STATE IN DISTRIBUTED ENVIRONMENT

2.4 Shared state in distributed environment

The sharing of some common state in distributed network of computers is a signicant problem. The system must deal with concurrent updates issued on dierent members of the system. The data must remain in the consistent state.

CAP theorem states that a distributed data store may only provide only two out of three following guarantees[11]:

Consistency  A read operation is guaranteed to return the most recent written value. Availability  A non-failing node will return a reasonable non-error response within a

reasonable amount of time. The system remains functional even if there are failed nodes.

Partition Tolerance  The system will continue to function when network partitions occur. Network partition is a failure of a network device that causes a network to be split. Regular relation databases usually ensure strong consistency of data. The strong consis-tency means that after an update is committed and acknowledged by the system, all clients must see the last committed value. In contrast, a system that provides eventual consistency of data does not guarantee, that subsequent accesses to data will return the last updated value[12]. However eventual consistency guarantees that if no new updates are made on an object, eventually all members of the system will return the most up to date value.

2.4.1 Concensus on a value

Clients of distributes storage system may update a value on dierent nodes simultaneously. The systems must ensure consistency of the data and must propagate the correct recent written value to the rest of the nodes.

When distributes systems synchronize data across the cluster, they may use an imple-mentation of a Paxos[13]. Paxos is a family of consensus algorithms. The consensus is a process of agreeing on some value among of group of participants. The general principle of the process is following[14]:

1. A node called leader prepares a proposal of a value to a group of acceptors

2. The acceptors may promise that they accept the new value and reject all previous updates

3. If the majority of acceptors accept the proposal, the leader proposes the value. 4. If the acceptor did not receive any new proposal, it accepts new value

Implementation of a consensus protocol is complicated as it is prone to errors. Systems often rely on good synchronization of clocks. Any clock skew may result in data loss. The system may also end up in a split state. Network partition divides cluster into two separate networks. Each network creates a quorum and elects a leader. The system remains functional and continues to accept client writes to maintain availability. When the partition is resolved,

(30)

CHAPTER 2. ANALYSIS

the state of both networks has to be merged. The storage rollbacks the writes on one of the sub-networks. Consistency is sacriced in favor of availability [15].

It is important to state that this is only relevant to data stores that allow modication of stored data. Storage of immutable data does not suer from problems mentioned above.

2.5 Analysis of le systems

The main functionality of the designed system is the storage of data. It is crucial to under-stand how les systems manage the data to build an ecient solution.

2.5.1 Journaling le systems

Journaling le system is a system that records changes to be made in the les [16]. Changes are registered in a journal. The purpose of journaling le system is to ensure consistency of data in the event of system crash or power loss. In that situations, changes recorded in the journal are used for reconstruction of les. Journaling reduces the chances of le damage.

Operations on le systems may consist of a series of smaller operations. In case of inter-ruption between sub-operations, the system may end up in a non-consistent state. Journaling FS (File System) closes these operations in transactions. The transactional approach ensures atomicity of each operation. After system failure data are reverted to the state before trans-action beginning, or the transtrans-action is nished.

The transaction consists of these steps[16]:

1. Writing of planned changes on le to the journal 2. Making of changes to the le system

3. Recording successful completion of operation to the journal 4. Removal of record in the journal

Storing a journal decreases the performance of FS during le write because each piece of data has to be written twice. Some le systems record only metadata (eg. NTFS [17]). Recording only metadata increases performance but it involves a risk, that inconsistency happen between metadata and data.

Journal is a special le stored on designated place in the le system. 2.5.2 Ext4

Ext4 is journaling le system for Linux. On ext4 FS, each le is associated with an inode. The inode stores le metadata and a list of data blocks. Metadata are attributes partly standardized by POSIX. The attributes contain for example le size, user ID, group ID, le access mode, timestamp[18].

The blocks in the list are either direct and point to the disk sector, or are indirect and point to another list of blocks. In a case of large continuous les, ext4 can map le using

(31)

2.6. ANALYSIS OF SOFTWARE TOOLS

an extent tree. Allocating le using indirect mapping requires mapping each block. Leaves of the extent tree record number of covered blocks. This approach reduces the number of metadata blocks used moreover, brings some disk performance improvement.

Number of inodes is set during le system creation. The number is based on the heuristic how many les will be stored on FS. Storing a lot of small les can lead to exhaustion of inodes prohibiting storing another le.

Directories on ext4 are les that map names to inodes. [18] Previous versions of ext4 stored listings as a linear array. The performance was not great when directory contained many les. Ext4 stores mapping in htree, a special version of btree with the constant depth of one or two levels. The system hashes lename and uses the hash to traverse the tree to nd a le. A leaf node contains a list of les which are then sequentially searched.

Maximum le size on ext4 is 16TiB 2.5.3 NTFS

NTFS (New Technology File System) is a proprietary le system developed by Microsoft[19]. Each le stored on NTFS is recorded in a special le Master File Table (MFT). The MFT contains metadata about itself, directory listings and may also store le data. There is a mirror copy of the MFT in case the original got damaged[20].

The le record contains le metadata named resident attributes. If the attributes take too much space, additional clusters outside the MFT are allocated to store nonresident attributes. A cluster is the smallest amount of disk space allocated to store le data, usually of a size of 4KB[20].

One of the le attributes contains extent list with le data. To access data faster, data of small les may reside in the MFT. The architecture allows the maximum size of a le to be 16 exabytes. However, the implementation permits 16 terabytes[20].

Directory entries are specic records that contain an index to les. Large directories are organized in a B-tree structure, that contains pointers to other directory entries stored in clusters outside the MFT[20].

2.6 Analysis of software tools

This section analyses options for tools for implementation of the storage system. The most aecting choice is the selection of a programming language. Languagea come with an ecosys-tem of available libraries and frameworks with dierent stability and performance. Following list captures needed functionality based on requirements put on this thesis:

1. Deploy on multiple platforms 2. Database access

3. Serving of HTTP content 4. Fast le access

(32)

CHAPTER 2. ANALYSIS

2.6.1 Deploy on multiple platforms

The system is required to be deployable on common operating systems which are Windows, Linux, and macOS. There are two major high-level programming languages with rich ecosys-tems and good overall performace[21] which are available to all three operating sysecosys-tems. Java and C#. They both allow building an application which is runnable on any system. C++ is omitted from the selection of tools despite being arguably more performant due to the fact, that C++ is compiled and not interpreted language. The reason is that build of the system would require multiple compilations for every architecture, and it is more complicated to create fully portable code than using the above languages.

In the beginning, C# was exclusively bound to Windows platform. That is no longer true with the existence of Mono. Mono1 is cross platform and most importantly

open-source implementation of Microsoft's .NET Framework. The system is Ecma standard-compliant[22] of Common Language Infrastructure, which describes executable code and runtime environment of C# and other languages. That allows the code to run on dierent computers and architectures. The Mono project mirrors the development of C# language done by Microsoft.

In comparison, there is Java with OpenJDK. OpenJDK is an open-source implementa-tion of Standard Ediimplementa-tion of Java Platform[23]. It is a common Java runtime environment on many UNIX distributions. OpenJDK is also a reference implementation of Java plat-form[24]. In contrast, Mono developed by Microsoft± subsidiary Xamarin does not guarantee the uniformity with C# on Windows.

Both mentioned platforms require a runtime environment to be installed on the target machine and both are full featured languages with many similarities from the programmer point of view.

2.6.2 Database access

The system will have to store information about user data in some way. The system will also have to record the state of the storage persistently. Any loss or corruption of data about state directly aects users. Naturally, such storage would often be accessed and therefore it must not slow down the system.

Currently, there are several NoSQL databases available, that provide high performance. Namely Cassandra2 or for example MongoDB 3. Each commonly used solution has good

client support for both Java and C#. NoSQL databases are usually very well horizontally scalable and distributed. They are schema-less. The latter means querying data with some relation between them is harder or not possible in case of key-value stores. Distributed NoSQL databases may sacrice consistency to availability. Read and write operations are not necessary ACID. The lower consistency could be manifested by the risk of loss of data due to system crash of network partition.

The problem could be solved as a compromise between both requirements by selecting a relevant relational database. Relational databases provide operations with ACID (Atomicity,

1http://www.mono-project.com/ 2https://cassandra.apache.org/ 3https://www.mongodb.com/

(33)

2.6. ANALYSIS OF SOFTWARE TOOLS

Consistency, Isolation, Durability) properties. Some databases could be scaled by sharding. The sharding is technique splitting up a large table of data row-wise. A large table can be split into multiple smaller ones that will be placed on a separate server.

To access the database conveniently Java oers Hibernate4 ORM framework compliant

with Java Persistent API. For the C# there is complementary framework NHibernate5.

It will not be necessary for each member of the distributed system to run a large relational database. However, members of the system will have to store at least a minimum amount of le metadata. These would contain some le identication and its location based on internal representation. There may not be a reason to store such data in a database and instead hold it in memory for very fast access. However, there are two reasons for the use of a database. Despite each record could be as little as a few tens of bytes, it could add up to tens of megabytes after surpassing millions of stored les. The records would occupy the memory of the machine which resources could be constrained as required in 2.2. The storage service will be running in the background and should not limit other applications from running when not used.

The latter reason is the need to persist the state when the server is shutdown. There should be a representation of stored objects from which the system would read the state to memory. Also, the system should anticipate unexpected crashes which could corrupt the saved state. For that reason, servers could use an embedded journaling database to persist the storage state.

There are several options to choose from in Java ecosystem. Commonly used embedded databases contains SQLite6, H27 and Apache Derby8. In some cases they can produce more

performace than PostgreSQL[25].

For Mono, there is only SQLite as an embedded open-source relational database. The drawback of SQLite is the lack of support for transactional access.

2.6.3 Serving of HTTP content

The requirements specify the way the clients will access the data store. The system is required to serve clients using a RESTful API. Most requests are expected to require a signicant amount of time to complete due to transfers of large data streams. Also, serving requests of a large quantity of concurrently connected clients may be problematic[26]. It is important to choose a framework capable of handling such pressure.

Beside traditional container based Java servers, there is low level I/O framework Netty9.

The main idea of the framework is to be event driven and spend as little time as possible in user space. Despite running on JVM with garbage collector, Netty manages pools of native memory. That approach allows in some cases to eectively work with data without loading them into heap memory. There are dozens of large open source systems based on Netty[27], namely Apache Cassandra, Elastic Search or Play Framework.

4http://hibernate.org/ 5http://nhibernate.info/ 6http://www.sqlite.org/ 7http://www.h2database.com/html/main.html 8https://db.apache.org/derby/ 9http://netty.io/

(34)

CHAPTER 2. ANALYSIS

The counterpart for Netty in C# is Kestrel10, cross-platform HTTP server base on

asyn-chronous I/O library libuv 11. Kestrel is used as a permormant async server for ASP.NET

Core application. If one would want to get more low-level access, the libuv could be used directly. Since libuv is a framework written in C, there are C# wrappers12 13.

The performance of Kestrel and Netty is very high in comparison with other platforms. When comparing the two discused frameworks, Netty is rather faster accoriding to bench-marks[28].

2.6.4 Fast le access

The system is expected to write and read large portions of user data from disks. There is not much of proccessing needed. To increase performance, the system should utilize zero-copy data transfers to save CPU cycles and memory bandwidth[26]. Netty incorporates these principles in it is design[29].

The approach for C# application is to either invoke native library code, which is not very suitable for designed multiplatform system, or to utilize class UnmanagedMemoryStream[30].

10https://docs.microsoft.com/en-us/aspnet/core/fundamentals/servers/kestrel 11https://github.com/libuv/libuv

12https://github.com/StormHub/NetUV 13https://github.com/txdv/LibuvSharp

(35)

Chapter 3

Design

This chapter describes proposed design of storage system in detail. The design is based on gathered knowledge from analysis. This chapter will describe the in-depth architecture of the whole system. It will cover competencies of each member of the system. The design describes operations and internal processes of the system and handling of failures.

3.1 System architecture

The system will consist of two main components, a master server (in the text mentioned simply as a master or abbreviated to MS) and a series of storage nodes (simply nodes or abbreviated to SN). The gure 3.1 presents an organization of a storage cluster. The client of the system is meant to be an application communicating with the exposed interface.

3.2 Master Server

The master server will manage the running of the whole cluster. Its responsibility will be storage of object metadata and object locations on dierent storage nodes. The metadata will be served on demand to clients and individual nodes. All metadata will be stored in a relational database.

Analysis showed that sharing a state between nodes of a distributed system is problem-atic. Therefore the master server will run in only one instance in each cluster. The server will be a unied access point for all clients.

Responsibilities of the server will include generation of object identiers and assignment of suitable nodes for storage of a new object.

The master server will keep a connection with storage nodes to monitor their health. Through this communication channel, the nodes will receive commands to replicate or delete object data.

(36)

CHAPTER 3. DESIGN

Figure 3.1: Diagram of designed architecture of the storage cluster with communication channels between components

3.3 Storage Node

The storage node will be the fundamental building block of the system. Its main purpose will be to accept, store and serve client data. The node will manage an internal database mapping stored objects. The database will contain only the minimum metadata necessary for storage.

3.4 Operations on object

There are a few required operations, that client will be able to execute on an object. These operations are:

• Object write • Object read • Object delete

Each operation will be a continuous process that will be composed of smaller subtasks client will have to carry out. Subtasks dier on what component of the storage will be applied.

(37)

3.4. OPERATIONS ON OBJECT

Master Server will oer following functions to the client: • Create new object and assign it to nodes

• Get locations of nodes for nishing storing • Get location of stored object

• Delete stored object

Functions on a Storage Node mainly follows processes started on the Master Server. The following features will be available to a client:

• Upload partial data of a new object • Finish data upload

• Read stored object

3.4.1 Object write process

The process of storage of a new object will be the most complicated of general operations. It will consist three steps. Detailed description of each will follow

1. Client request creation of a new object on Master Server 2. Client uploads data to a storage nodes

3. Client makes nishing request on a storage node to end the upload Figure 3.2 demonstrates sample process of storing of an object.

3.4.1.1 Client request for object creation

Client will request creation of a new object on Master Server. The request will contain some metadata about stored le. This design proposes to store the original name of the le and its length. MS will record a creation timestamp. The new object will be given a unique identicator described in section 3.4.1.4. The client also species mandatory replication factor in his request. The replication factor is desired number of nodes, on which a replica of the object should be stored. If there will be enough connected storage nodes that could satisfy requested replication, MS will select a group of them of the same size as the replication factor. These nodes will be assigned to collect and store the object. Server's response to the client will contain a new id and a list of assigned storage nodes. The list will contain IP addresses and corresponding ports at which the servers will be accessible.

(38)

CHAPTER 3. DESIGN

3.4.1.2 Upload of client data to a storage nodes

The client will be able to choose any given storage node to upload object data. The request will contain the unique id and the length of the content.

If the object is large, it does not have to be sent using single HTTP request. Instead, client will be advised to include Range header[31] in the request and send only partial content. There are multiple possible usages and formatting of Range header. The SN will not implement all of them. The design considers a simple case where the value of the header is in the form of bytes=firstBytePos-lastBytePos . An example of the header for request containing rst 1000 bytes of a le would be bytes=0-999. The implementation will not limit the functionality.

There are couple of reasons to split data into multiple requests • Better recovery in case of network failure

• To achieve better throughput • To allow seeking in the stored data

Network connection should not be considered stable. A large le transfer might take a signicant amount of time and the connection interruption is possible. Therefore in case of failure, client does not have to start all over, but instead only send the last partial content again.

When dealing with higher latency, the bandwidth-delay product should be taken into consideration. It measures the maximum amount of data transmitted over the network before being acknowledged.[32] Opening multiple connections should increase the throughput to utilize available bandwidth fully.

The client will be allowed to send data to any of given group of storage nodes. Each storage node will synchronously share the acquired data with other peers in the group. To prevent buering of data on a node leading to crash, SN will hate to put a back pressure on the client to lower the incoming trac.

Storage Node will have to verify the existence and the length of the send object presented to him. When SN receives a client request, it will try to fetch the object metadata from its database. If there is no record, object either does not exist, or SN is missing object metadata. Therefore before reading the content of the request, SN will make a request on internal API of the master server to get the metadata and a list of peers. This scenario could lead to wasteful requests to the master server in a case when client sends simultaneously multiple requests at once. For each of the request, SN would connect to the MS. Not only the same metadata would be sent redundantly, but each request would open a new TCP connection. for that reason, the design in 3.1 proposes an open WebSocket connection between MS and each SN. After client creates a new object, MS preemptively pushes the object metadata over the open TCP connection to all nodes in the group.

The group of storage nodes will be a measure to prevent loss of the tenant data. In case of the failure of the node to which client is connected, a client should continue sending data to other members of the group. Because the original node will resend the client data to the peers, other nodes most likely will have the previously sent parts.

(39)

3.4. OPERATIONS ON OBJECT

Considering uniform chance of a failure of any given SN, the larger the node group will be, the higher the chance of failure of any member. Therefore when SN cannot synchronize data to his peer because of the peer failure, client data transfer should not be interrupted with the server error. The purpose of the group is to maximize the availability of the system for the client for the duration of the write of the data.

When a failed node from the group is reconnected, the system will not make any eort to sychronize missing client content. This would require that each SN would have to monitor the health of the peers for every object that is beeing assigned to it and also it would have to synchronize what is the state of the object on other peers. This overhead would signicantly reduce the performance and scalability of the storage. The other approach would require transferring the responsibility to the master server. Each SN would have to commit collected parts. This would bring massive overhead to MS and decreased scalability even more because the architecture takes into account only one instance of MS. With increasing number of SN connected to cluster would increase the load on MS.

3.4.1.3 Client nishing request

Since the client will be allowed to send the object data to any available node from the given group and nodes will not share a common state of the upload, the client will have to notify that the upload is over. The client will mae a nishing request on any node from the group. The node will respond to the client with an acceptation of the request. Then the NS will transform the uploaded data into an internal representation described in the section 3.5. After a successful nalization, node will notify the MS using the WebSocket channel. Master server will save, that the object is stored on the node, and change the state of the object from writable to readable (depicted in the gure 3.5). Master Server then uses the WebSocket connection to notify other nodes from the group to also nalize the object. MS will give them the location of the living replica in case any node is missing a part. The client will not be part of the process.

Figure 3.2 shows a situation in which a node crashes during the upload. The Storage Node 2 is missing part of client data. A similar scenario could contain crashed node that restarted, and now the client is issuing nishing request on it. In both cases the node is missing data and cannot transform the object for permanent storage. It will refuse client request client and respond with a list of all missing data ranges. The client will be obliged to add all absent parts and make the nishing request again.

In a situation, an SN is missing some parts and MS is issuing a le nalization, there will be no client to resend the data. The node will hate to connect to the peer that already has a complete replica of the object and request all missing data ranges from it. The location should be requested on the master server. After successful nalization, the node will notify the MS. The master then records the location of the replica in the database.

3.4.1.4 Object identity

Design of the storage system does not consider any user roles or user permission. For that reason each object is given a statistically unique identier. This approach limits the clients to guess the ids of other objects in an attempt to damage or access them without any

(40)

CHAPTER 3. DESIGN

Figure 3.2: Diagram of an example situation of a write of an object with replication factor 2 with failure of one of the nodes

(41)

3.4. OPERATIONS ON OBJECT

authorization. The length of the id will be 64 bits. It provides more than enough large namespace for objects and allows it to be internally stored as long data type. Clients will be presented with an alphanumerical representation which will be used in URI (Uniform Resource Identier) of client HTTP requests.

Generation of the id will take place on the master server during client's request to create a new object. The generation will be based on UUID (universally unique identier)[33].

3.4.2 Object read process

Reading an object will be less complicated than writing. The procedure will take two steps: 1. Retrieve the object location from the master server

2. Request data from a storage node

The master servers will be able to tell, which running and connected storage nodes keep the object. The server will return a list of these nodes together with their IP addresses. The client may assume that the data will not migrate in the cluster often and instead directly approach a storage node from the process of object write. If the node no longer stores the object, it will simply respond with an error status.

In order to speed up the transfer, the client may connect to multiple nodes at once. In that case, the request should contain Range header specifying what part of the object should be returned from the particular node.

The whole sequence with optional request on master is shown in diagram 3.3

(42)

CHAPTER 3. DESIGN

3.4.3 Object deletion process

To delete an object, the client will have to make a request on the master server. The server handles removal of the data from each storage node. A general process is shown in the gure 3.4. Some storage nodes holding the object may not be connected to the cluster at that moment. The master server must ensure that the deletion will be executed after the nodes reconnect.

Figure 3.4: Process of asynchronous object deletion after client request

3.4.4 Object lifecycle

The gure 3.5 shows the main states of the object. The states are divided with respect to writability and readability.

(43)

3.5. OBJECT REPRESENTATION

Figure 3.5: Diagram of the main states of an object and the transitions between them

3.5 Object Representation

The analysis has shown that storing large amount les may slow les system and increase the number of disk operations to access a le. The le system also stores metadata that the system would not utilize. For that reason, the system will store objects in les of large length. This work will refer to these les as monoliths. The gure 3.6 describes the internal organization of data.

Figure 3.6: Diagram of the internal structure of a monolith le

The monolith le will start with a header that contains a unique constant identifying that this is a monolith. The purpose of the constant will be to prevent reading and interpreting les, which are not monoliths. The constant will a sequence of 8 bytes encoding the word monolith in ASCII. After the constant, a 4-byte ag follows indicating the state of the monolith. The ag can mark the monolith as deleted. This is described more in the section 3.5.5. The rest of the content of the le is a sequence of individual objects. Each object consist of three components:

(44)

CHAPTER 3. DESIGN

• Object header • Binary data content • Object footer

Figure 3.7 shows a layout of an object. The header will identify the content of the object. It will include a 64-bit identier, the length of the following data block, the ag indicating the state and the 32-bit check number.

Figure 3.7: Diagram of the structure of an object stored in a single monolith le The header will be followed by the data block that contains the whole object. The footer after data block should contain two numbers. The checksum number calculated from data content and the check number from the header. System will randomly generate the check number for each object during their write. When the system will read the content of the le or scan the stored objects in the monolith, matching numbers in header and footer will signify, that the reader is not misinterpreting some random data as a header. It will also means that the monolith contains the whole object and not only a part.

3.5.1 Maximum object size

By design the system will no not split stored object. The advantage of this approach is that less metadata is necessary to be stored on a SN about a location of an object. To know the position of an object, only the le and oset in that le will be necessary for access. If the system stored the objects as a list of blocks, it would be much harder to reconstruct the state of SN after a failure. Reading a split object would require more disk operations, possibly require opening of multiple monolith les. Conventional magnetic disks are more ecient when reading data sequentially.

The downside of this approach is that the system must limit the maximum size of a stored object. The limit is the maximum length of a monolith. This length should be congurable, and our design suggests the size be 100GB. The size depends on the used le system. However, modern le systems usually permit this size.

(45)

3.5. OBJECT REPRESENTATION

3.5.2 Partial Content

Although the idea of monolith les draws from the design of Haystack volumes (described in section 2.3.1), there is a fundamental dierence in use. Since the system allows clients to upload large les, they could not be synchronously appended to the corresponding monoliths. Clients will also be allowed to send the partial content of their data, and they may send it in an arbitrary order.

Therefore an object is appended to a monolith only after the content is stored on the SN. When the client uploads a part of the content, SN stores the content into a separate le. The name of the le will contain the object identier and the position and the range of the data in the original uploaded le. When SN processes client request to nalize upload, it will check, if there are missing parts of the object. If not, SN will prepare an appropriate header and a footer and appends the objects to the monolith le. After that, SN will erase the temporary upload data.

3.5.2.1 Expiration of temporary data

Storage Node deletes the temporary data during four events.

Client nished object upload  After an object is fully written to a monolith, SN deletes all temporary les.

Client deletes unnished object  The delete is issued indirectly by MS.

Temporary les are checked at SN startup  SN checks temporary les during startup and deletes the malformed.

Temporary les expires  The client is obliged to nish upload of an object in 24 hours. If the condition is not satised, SN deletes the abandoned temporary les.

3.5.3 States of a monolith le

Storage Nodes will record two indices for each monolith. The former is readability of the le. The meaning is obvious. SN should not use non-readable monoliths when serving clients. The indication comes into use when SN will have to manipulate which would interrupt readers serving requests.

The latter indication will be writability of the monolith. Monolith may be sealed when its size approaches the maximum allowed limit. Although no objects will be permitted to be appended to the le, SN will be able to delete objects from the non-writable monolith.

The combination of these two indices determines four states of a monolith. 3.5.4 Selection of a monolith to store an object

An appender will be a unit capable of writing data to a monolith le. The storage node should manage an appender for each writable monolith. The appender is the only one permitted to write data to the specic monolith. Therefore objects will be appended sequentially. An appender will hold open le handler for writing.

(46)

CHAPTER 3. DESIGN

A machine hosting a SN may have mounted multiple dierent disks. On the start of the SN, a conguration will specify which disks should be used for storage of the monoliths. SN will track the amount of used space in the memory.

When SN needs to append a new object, an appropriate appender will be selected. The selection consists of these steps:

1. Storage Node will choose a location location of the least used disk by the storage 2. SN searches for a writable monolith which size does not exceed the maximum limit

after the object is appended

3. If there is no such le, SN creates new monolith le 4. SN issues appropriate appender to write the object

5. If the size of the monolith after write exceeds 90% of the maximum size, the monolith state will be set to non-writable, and appender will be closed.

By spreading les equally on each disk, SN distributes IO load and utilizes better disk capabilities. This distribution strategy is a heuristic that will fail if there is a disk of signif-icantly lower capacity. This disk will be probably lled rst preventing any storage of new objects.

3.5.5 Deletion of an object

By the nature of object storage, objects are expected not to be changed, respectively deleted, often. However, when SN is issued to remove the object, it must be deleted from each monolith le. Such operation would require a signicant amount of disk operations if the SN had to shift all stored data after deleted object and would be unwise to be done. Instead, SN will use the ag in the header of the object introduced in section 3.5. The ag from will be appropriately set to indicate, that the following object is deleted. After SN sets the ag, it will increase counter associated with monolith recording the size of deleted objects in the le. This operation is trivial and allows adding new objects to the le at the same time. 3.5.6 Compaction of an monolith

When deleted objects occupy more than 25% of the size of the monolith le, SN should compact the monolith. Monolith compaction is an operation during which living objects are transferred to a new location, and the monolith is deleted. Example compaction is depicted on gure 3.8

SN should delete such monolith only if it satises three conditions: 1. Monolith is not writable

2. Deleted les occupy more than 25% of the size of the le 3. There are no readers of the le

(47)

3.5. OBJECT REPRESENTATION

It is important that the monolith should be in the non-writable state for two reasons. Deletion of a couple of objects when the monolith size is short would trigger unnecessary copy operations and would prevent the monolith to grow. The latter reason is obvious since there is no need to delete the le when new data are appended.

The situation is more complicated with readers. SN checks the amount of space retained by deleted object after it removes an object. At that time, there may be several clients reading data from the le. Therefore the compaction should proceed as follows:

1. SN traverses the le and collects each oset and header of each non-deleted object. 2. SN selects a new location for living objects. If there is no space in an existing monolith,

a new one is created.

3. SN appends living objects to selected monolith and makes records in the database 4. SN sets the state of the source monolith to non-readable to prevent new readers to

start new reads

5. SN removes records in database pointing to the le.

6. SN sets the ag in the header of the monolith to mark the whole le as deleted. 7. After all readers are nished, SN deletes the physical le

Figure 3.8: Process of compaction of a monolith and selection of a new location for the living objects

Living objects are transferred to a monolith which current size could accommodate copied data. Monolith 1 is deleted afterward.

To know when the monolith is not being read anymore, SN will have to record the number of readers for each monolith. Storage Node should add deletion ag to the header of the monolith in a case of system crash. The le will be skipped during startup scan and immediately deleted.

(48)

CHAPTER 3. DESIGN

3.6 Communitaction protocols

The master server and the storage nodes cannot operate entirely independently in the cluster. Servers will need a communication channel to exchange information about stored objects and peers in the cluster. The design distinguishes two types of message exchanges. Unidirectional and bidirectional messages.

3.6.0.1 Unidirectional messages

Following unidirectional messages will be sent between a master server and a storage node: • MS sends essential metadata to SN

• MS issues deletion of an object • MS issues acquisition of a replica

• SN sends notication about successful storage of a le • SN sends heartbeat message

None of these messages require the accepting side to respond with a value. Also, request contained in each message is idempotent, and it is safe when such message is sent multiple times. Therefore these messages will be transferred between MS and nodes over a WebSocket connection. A WebSocket connection provides a full-duplex communication channel over a single TCP connection. This approach brings two advantages. TCP protocol provides reliable message delivery, and servers will not wastefully open a new connection for each message sent.

3.6.0.2 Messages requiring a reponse

There is no case in which a master would need a direct response from a node. Following requests will be forwarded from a storage node to the master:

• SN will need to check existence and length of an object

• SN will need to acquire a list of running nodes that shares assigned group for uploaded object

The master server will expose a RESTful API providing this data. This interface will be used only internally and will be accessed solely by the storage nodes.

3.6.0.3 Joining a new storage node to the existing cluster

The master server will have to record a current number of connected storage nodes. For each storage node, the master must know its address and ports on which the node listens. The server will also track statistics about used disk space to distribute objects across the whole cluster better. A storage node must go through two stages to be fully functional. The gure 3.9 describes the process of a node joining the cluster.

(49)

3.6. COMMUNITACTION PROTOCOLS

1. SN makes a join request on the MS using internal REST API. The request encloses local and public IP addresses and used TCP ports. If the node was a part of the cluster in the past, it adds its unique identication.

2. The master may generate a new unique identier for the new node. The master may also require a list of all objects currently stored on the node.

3. If the node is required to send the list, the procedure continues with the following request on master containing a list of ids.

4. The master accepts the request and returns the assigned identier of the node. From that moment, the master considers that the node joined the cluster but does not recognize the node as functional.

5. In the next stage, connecting storage node must open a WebSocket connection. The node must send a message containing node identity so that master can pair the con-nected channel with the node.

6. The master receives the identication message on the newly connected WebSocket channel. If the enclosed id matches a node that is not connected and has already joined the cluster in the previous stage, the master sends a message over the channel containing conrmation of connection.

7. In this stage, the master acknowledges the node is connected and functional. It will accept messages send over the WebSocket and assign a newly created object to this node.

8. After receiving conrmation message over WebSocket, the node may start serving client requests as long as there is an open WebSocket connection.

3.6.0.4 Communication between peers

During client uploads and object replications, the storage nodes will need to exchange object data. Each transfer will have a dierent recipient. It would be unnecessarily complicated for each node to maintain a persistent connection with every other node in the cluster. For that reason, each data transfer will be carried out over HTTP protocol. HTTP methods are well suitable for requesting parts of data and posting data to other peers. Storage nodes will have to expose RESTful API intended for other peers. The clients will not have access to this interface.

3.6.0.5 Data transfer protocol

Transfer of object data between dierent nodes will not require any additional protocol on top of the HTTP. Nodes will specify sent data using HTTP URI and appropriate headers. The content of the message will be plain binary data. However, no object data will be trans-ferred over masters internal API and WebSocket connection. The data in the message bodies will be serialized using Google Protocol Buers. Protocol Buers are platform neutral data interchange format1. Protocol Buers serialize structured data to ecient binary

(50)

CHAPTER 3. DESIGN

Figure 3.9: Protocol of the communication during the connection of a node to the cluster

tations suitable for data transfer. Serialization and deserialization speed is comparable to widely used JSON format[34]. The design favors the use of Protocol Buers due to good integration with Java platform and opportunity to establish a xed structure of messages.

3.7 Heartbeat

The master server will require periodic reports from each connected node to know the current status of the cluster.

SN will send a heartbeat every ve seconds. The time period is chosen to keep the master up-to-date but not to bring a signicant overhead to the nodes to prepare the message. The node will transmit the message over the open WebSocket channel. The report in a heartbeat will contain statistics about disk usage. The node will send the amount of disk space used by the monolith les and temporary les and space used occupied by temporary upload data. The node will send these metrics, although the master will know every stored object on the node and could calculate the sum of the taken space. Thhe storage node will have an internal policy of how many local copies of each object will store. When a client makes a delete request on the master, the master will not know exactly when the physical data of the object will be removed. As designed in the section 3.5, deleted data may remain stored on the node after the associated object is deleted. The master server will use the disk statistics to spread stored objects across the cluster better.

The gure 3.10 shows a heartbeat check. The master server will set a grace period of one minute for each node to send a heartbeat. If the period the time elapses, and the server will not receive any heartbeat, the master server will mark the node as disconnected and will close the WebSocket connection. The period is set to respect possible peak loads on the node and network instability.

(51)

3.8. REPLICATION OF DATA

Figure 3.10: Sequence diagram of heartbeats sent from a storage node to the master server

3.8 Replication of data

The crucial requirement of the system is to protect the stored data against a loss. The system will achieve this by replicating data across dierent machines in the cluster. The data will be available to the client even during failure of a node. During the creation of an object, the client will dene the value of replication factor. Replication factor denes how many times the object is stored in the cluster.

The system will distinguish ve states of an object.

Newly assigned object  This is the initial state of an object after client issues creation request on the master. No storage holds a replica and the object is not readable. Under replicated object  After nalization of an object or after a failure of a node, the

number of replicas will most likely not match the desired replication factor. The master server must command a sucient number of nodes to acquire a replica.

Fully replicated object  For requested replication factor, there is the same number of live replicas in the cluster. This is desired state for each object.

Over replicated object  There are more live replicas in the cluster than the set replica-tion factor. The master server should delete the excessive.

Lost object  No SN holds the object data. Unless a node with an existing replica starts and joins the cluster, the object is lost and not recoverable.

References

Related documents

In addition, if I marked "Yes" to any of the above questions, I hereby authorize release of information from my Department of Transportation regulated drug and alcohol

While I expect seed to be up a couple million bushels as I expect more planted acres next year, corn used for food and industrial uses other than ethanol are expected to

22) K. Zeeberg “Stromal Composition And Hypoxia Modulate Pancreatic Ductal Adenocarcinoma Pdac Cancer Stem Cell Behavior And Plasticity By Controlling The Angiogenic Secretome”. 23)

When a compatible medication is added to the Glucose Intravenous Infusion, the solution must be administered immediately.. Those additives known to be incompatible should not

Thanks to the efforts of Don Ross and the other members of the Oklahoma Commission to Study the Tulsa Race Riot of 1921, the prevailing narrative preserved by Parrish and

We had a number of learning objectives for students: we wanted them to learn about and experience a number of different technologies and resources for learn- ing; to become

Infraestructura del Perú INTERNEXA REP Transmantaro ISA Perú TRANSNEXA, 5% investment through INTERNEXA and 45% through INTERNEXA (Perú) COLOMBIA ARGENTINA CENTRAL AMERICA