Distributed File Systems

(1)

Distributed File Systems

Alemnew Sheferaw Asrese

University of Trento - Italy

December 12, 2012

(2)

Outline

1 Distributed File Systems

Introduction

Most Important DFS

2 The Google File System (GFS)

Introduction Design Overview System Interactions Master Operation

Fault Tolerance and Diagnosis

3 The Hadoop Distributed File System (HDFS)

Introduction Architecture

File I/O Operations and Replica Management Writing and Reading Data

Replica Placement

4 Conclusions

(3)

Introduction

Definition of DFS

File system that allows access to files from multiple hosts via a computer network

Features

Share files and storage resources on multiple machines No direct access to data storage for clients

Clients interact using a protocol Access restrictions to file system

(4)

Distributed File Systems Introduction

DFS Goals

Access transparency Location transparency Concurrent file updates File replication

Hardware and software hetereogenity Fault tolerance

Consistency Security Efficiency

(5)

Most Important DFS

NFS (Network File System), developed by Sun Microsystems in 1984

Small number of clients, very small number of servers Any machine can be a client and/or a server

Stateless file server with few exceptions (file locking) High performance

AFS (Andrew File System), developed by the Carnegie Mellon University

Support sharing on a large scale (up to 10,000 users) Client machines have disks (cache files for long periods) Most files are small

Reads more common than writes Most files read/written by one user

(6)

Distributed File Systems Most Important DFS

Need for Large Streams of Data

Standard DFS not adequate to manage large streams of data: Google : 24 PB/day

Facebook: 130 TB/day for logs, 500+ TB/day new date entry, 300 M photos uploaded per day, 5 040 TB data scanned in Hive per day LHC-CERN: 25 PB/year

Such workloads need different assumptions and goals. Proposed solutions: The Google File System (GFS)

Hadoop Distributed File System (HDFS) by Apache Software Foundation

(7)

The Google File System (GFS) [1]

Proprietary DFS, only key ideas available Developed and implemented by Google in 2003

Shares many goals with standard DFS (performance, scalability, reliability, availability,...)

Needed to manage data-intensive applications Provides fault tolerance

(8)

The Google File System (GFS) Introduction

GFS - Motivations

Component failures are the norm:

A storage cluster is built from hundreds or thousands of inexpensive commodity servers

Constant monitoring, error detection, fault tolerance, and automatic recovery is needed

Huge files (Multi- GB) with many application objects Appending new data rather than overwriting existing

Co-designing the applications and the file system API to increase flexibility

(9)

Assumptions

Composed of inexpensive commodity components that often fail Modest number of huge files (Multi GBs)

Large streaming reads and small random reads

Sequential writes that append to files rather than overwrite Files seldom modified again

Semantics for concurrently append to the same file by multiple clients High bandwidth more important than low latency

(10)

The Google File System (GFS) Design Overview

GFS Interface

Familiar file system interface

Don’t implement Standard API such as POSIX

Supports

Common Operations (create, delete,open, close, read, and write files) Snapshot

(11)

Architecture

(12)

Architecture

Master has file system metadata Master controls:

Chunk lease management

Garbage collection of Orphaned chunks Chunk migration

Periodically HeartBeat messages exchanged between master and chunkservers to check the state of chunkserver

(13)

Single Master

Enables to make sophisticated chunk placement and replication decisions

Must minimize its involvement in reads and writes Clients never read and write file data through the master Client asks the master which chunkservers it should contact

(14)

Single Master Cont...

Problems

Bandwidth bottleneck A single point of failure

Opportunities

Central place to control replication and garbage collection and many other activities

Solution

Around 2009 a Distributed Master is proposed and implemented for the current schema. BigTable and file count are being used. Details are not clearly available :)

(15)

ChunkServer

Large chunk size

64 MB - larger than typical File system

Large chunk size: Pros

Reduces client master interaction

Reduce network overhead by keeping a persistent TCP connection Reduces the size of the metadata stored on the master

Cons

internal fragmentation (solution: lazy space allocation)

large chunks can become hot spots (solutions: higher replication factor, staggering application start time, client-to-client communication)

(16)

Metadata

All metadata is kept in main memory of the master

Namespace

File-to-chunk mapping Chunks replicas location Access control information

Namespace and file-to-chunk mapping are persistently stored Mutations logged to an operation log

No persistent record of replicas locations Operation Log:

Contains critical metadata changes Defines order of concurrent operations Replicated on multiple remote machines Used to recover the master

Checkpoint of master if log beyond certain size:

Master switches to a new log

(17)

Consistency Model

File namespace mutations are atomic

Namespace locking guarantees atomicity and correctness Mutations applied in same order on all replicas

Chunk version numbers used to detect any stale replica (missed mutations while chunkserver was down)

Stale replicas are garbage collected

Checksum for detection of data corruption

A file region is consistent if all the clients will always see the same data regardless of which replicas they read from.

(18)

The Google File System (GFS) System Interactions

Leases and Mutation Order

Master uses leases to maintain consistent mutation order Master grants a chunk lease to a replica, called primary Primary chooses serial order between Mutations

Lease grant order defined by the master Order within a lease defined by the primary

(19)

Control Flow of a Write

(20)

Data Flow

Separated flow of data and control Goals:

1 _{Fully use outgoing bandwidth of each machine:}

Data pushed in a chain of chunkservers

2 _{Avoid network bottlenecks and high-latency links:}

Each machine forwards data to ”closest” machine

3 _{Minimize latency:}

(21)

Atomic Record Appends

Traditional Write

Client specifies the offset

Concurrent writes to the same region are not serializable

Record Append

Client specifies only the data

GFS Appends it to the file at least once atomically at an offset of GFSs choosing and returns that offset to the client

Used by Distributed application in which many clients on different machines append to the same file concurrently

(22)

Snapshot

Used to:

1 _{Quickly create copies of huge data sets} 2 _{Easily checkpoint current state}

The master receives snapshot request:

1 _{Revokes leases on chunks of the files it is about to snapshot} 2 _{Logs the operation to disk}

3 _{Duplicates the corresponding metadata}

New snapshot file points to same chunk C of source file New chunk C’ created for first write after snapshot

(23)

Namespace Management and Locking

No per-directory data structure that lists all files in the directory Doesn’t support aliases for the same file (No hard or symbolic links) Namespace maps full pathnames to metadata

Each master operation acquires a set of locks before it runs Read lock on internal nodes and read/write lock on the leaf Locking allows concurrent mutations in the same directory

Read lock on directory prevents its deletion, renaming or snapshot Write locks on file names serialize attempts to create a file with the same name twice

(24)

The Google File System (GFS) Master Operation

Replica Placement

Hundreds of chunkservers across many racks Chunkservers accessed from hundreds of clients Goals:

Maximize data reliability and availability Maximize network bandwidth utilization

(25)

Creation, Re-replication and Rebalancing

Creation

Place new replicas on chunkservers with below-average disk usage Limit number of recent creations on each chunkservers

Replication

When number of available replicas falls below a user-specified goal Prioritization: based on how far it is from its replication goal, live files than recently deleted files ...

Rebalancing

Periodically, for better disk utilization and load balancing Distribution of replicas is analyzed

(26)

The Google File System (GFS) Master Operation

Garbage Collection

File deletion logged by master

File renamed to a hidden name with deletion timestamp instead of reclaiming the resources immediately

Master regularly scan and removes hidden files older than 3 days (configurable)

Until then, hidden file can be read and undeleted

When a hidden file is removed, its in-memory metadata is erased Orphaned chunks identified removed during a regular scan of the chunk namespace

(27)

Stale Replica Detection

Need to distinguish between up-to-date and stale replicas Chunk version number:

Increased when master grants new lease on the chunk Not increased if replica is unavailable

(28)

The Google File System (GFS) Fault Tolerance and Diagnosis

High Availability

Fast Recovery

Master and chunkservers have to restore their state and start in seconds no matter how they terminated

Chunk Replication

Each chunk is replicated on multiple chunkservers on different racks

Master Replication

Master state replicated for reliability on multiple machines When master fails:

It can restart almost instantly

A new master process is started elsewhere

”Shadow” (not mirror) master provides only read-only access to file system when primary master is down

(29)

Data Integrity

Checksumming to detect corruption of stored data Integrity verified by each chunkserver independently Corruptions not propagated to other chunkservers

(30)

The Google File System (GFS) Fault Tolerance and Diagnosis

Diagnostic Tools

Extensive and detailed diagnostic logging

Diagnostic logs record many significant events(Chunkservers going up and down, RPC requests and replies)

... can be easily deleted without affecting the system

Bibliography

Case Study, GFS: Evolution on Fast-forward, A discussion between Kirk McKusick and Sean Quinlan; ACM Queue, 2009

(31)

What is Hadoop?

Summary

An Apache project

Yahoo! has developed and contributed to 80% of the core of Hadoop (HDFS and MapReduce).

Components:

HBase - originally by Powerset, then Microsoft Hive - Facebook

Pig, ZooKeeper, Chukwa - Yahoo!

Avro - Originally Yahoo, and co-developed with cloudera

(32)

The Hadoop Distributed File System (HDFS) Introduction

The Hadoop Distributed File System (HDFS) [2]

Sub-project of Apache Hadoop project

Designed to store very large data sets separately:

Metadata on dedicated server called NameNode Application data on DataNodes

Creates multiple replicas of data blocks Distributes replicas on nodes

Highly fault-tolerant

Designed to run on commodity hardware Suitable for applications for large data sets

(33)

(34)

The Hadoop Distributed File System (HDFS) Architecture

NameNode

Namespace is a hierarchy of files and directories

Files and directories are represented on the NameNode by inodes Maintains the namespace tree and the mapping of file blocks to DataNodes

Authorization and authentication

A multithreaded system and processes requests simultaneously from multiple clients

Keeps the entire namespace in RAM, allowing fast access to the metadata.

(35)

DataNodes

Responsible for serving read and write requests from the file systems clients

Perform block creation, deletion, and replication upon instruction from the NameNode

Performs handshake by connecting with NameNode during startup To verify the Namespace ID and software version of the DataNode Nodes with a different namespace ID will not be able to join the cluster

Incompatible version may cause data corruption or loss

Send heartbeats to the NameNode to confirm that the DataNode is operating and the block replicas it hosts are available.

(36)

HDFS Client

(37)

Image and Journal

Image

File system metadata that describes the organization of application data as directories and files

A persistent record of the image written to disk is called a checkpoint

Journal

A write-ahead commit log for changes to the file system that must be persistent

During startup the NameNode initializes the namespace image from the checkpoint, and then replays changes from the journal

A new checkpoint and empty journal are written back to the storage directories before the NameNode starts serving clients

(38)

CheckpointNode

Periodically combines the existing checkpoint and journal to create a new checkpoint and an empty journal

Usually runs on a different host wrt NameNode

Downloads the current checkpoint and journal files from the NameNode,

Merges them locally,and returns the new checkpoint back to the NameNode

(39)

BackupNode

Create periodic checkpoints

As Checkpoint Node, but it applies edits to its own namespace copy Without downloading checkpoint and journal files from the active NameNode

But in addition it maintains an in-memory, up-to-date image of the file system namespace

Always synchronized with the state of the NameNode Can be viewed as a read-only NameNode

(40)

The Hadoop Distributed File System (HDFS) File I/O Operations and Replica Management

File Read and Write, Block Placement

File Read and Write

Implements a single-writer, multiple-reader model

Block Placement

Spread the nodes across multiple racks Replicas placed:

The first on the node where the writer located,

The second and third on different nodes at different racks The others palced on random nodes:

1 _{No Datanode contains more than one replica of any block} 2 _{No rack contains more than two replicas are in the same rack}

(41)

Replication management

Detects that a block has become under- or over-replicated When a block becomes over replicated, the NameNode chooses a replica to remove

When a block becomes under-replicated, it is put in the replication priority queue

A block with only one replica has the highest priority

A block with a number of replicas that is greater than two thirds of its replication factor has the lowest priority

(42)

The Hadoop Distributed File System (HDFS) File I/O Operations and Replica Management

Inter-Cluster Data Copy

A tool called ”DistCp” used for large inter/intra-cluster parallel copying

A MapReduce job; each of the map tasks copies a portion of the source data into the destination file system

(43)

(44)

The Hadoop Distributed File System (HDFS) Writing and Reading Data

(45)

(46)

(47)

(48)

(49)

(50)

(51)

(52)

The Hadoop Distributed File System (HDFS) Replica Placement

Cluster Rebalancing

Balanced cluster: no under-utilized or over-utilized DataNodes Common reason: addition of new DataNodes

Requirements

No block loss

No change of the number of replicas

No reduction of the number of racks a block resides in No saturation of the network

Rebalancing issued by an administrator

Rebalancing decisions made by a Rebalancing Server

Goal of each iteration: reduce imbalance in over/under-utilized DataNodes

(53)

Balancer

A tool that balances disk space usage on an HDFS cluster.

Iteratively moves replicas from DataNodes with higher utilization to DataNodes with lower utilization

Bibliography

(54)

Conclusions

GFS

Proprietary file system

All the details in the paper of 2003

HDFS

Open-source project

Details are available in the paper of 2010 and in Hadoop website Used by many big companies

(55)

Bibliography

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung.

The Google File System.

In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, SOSP ’03, pages 29–43, New York, NY, USA, 2003. ACM.

Shvachko Konstantin, Kuang Hairong, Radia Sanjay, and Robert Chansler.

The Hadoop Distributed File System.

In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), MSST ’10, pages 1–10,