Distributed File Systems
Alemnew Sheferaw Asrese
University of Trento - Italy
December 12, 2012
Outline
1 Distributed File Systems
Introduction
Most Important DFS
2 The Google File System (GFS)
Introduction Design Overview System Interactions Master Operation
Fault Tolerance and Diagnosis
3 The Hadoop Distributed File System (HDFS)
Introduction Architecture
File I/O Operations and Replica Management Writing and Reading Data
Replica Placement
4 Conclusions
Introduction
Definition of DFS
File system that allows access to files from multiple hosts via a computer network
Features
Share files and storage resources on multiple machines No direct access to data storage for clients
Clients interact using a protocol Access restrictions to file system
Distributed File Systems Introduction
DFS Goals
Access transparency Location transparency Concurrent file updates File replication
Hardware and software hetereogenity Fault tolerance
Consistency Security Efficiency
Most Important DFS
NFS (Network File System), developed by Sun Microsystems in 1984
Small number of clients, very small number of servers Any machine can be a client and/or a server
Stateless file server with few exceptions (file locking) High performance
AFS (Andrew File System), developed by the Carnegie Mellon University
Support sharing on a large scale (up to 10,000 users) Client machines have disks (cache files for long periods) Most files are small
Reads more common than writes Most files read/written by one user
Distributed File Systems Most Important DFS
Need for Large Streams of Data
Standard DFS not adequate to manage large streams of data: Google : 24 PB/day
Facebook: 130 TB/day for logs, 500+ TB/day new date entry, 300 M photos uploaded per day, 5 040 TB data scanned in Hive per day LHC-CERN: 25 PB/year
Such workloads need different assumptions and goals. Proposed solutions: The Google File System (GFS)
Hadoop Distributed File System (HDFS) by Apache Software Foundation
The Google File System (GFS) [1]
Proprietary DFS, only key ideas available Developed and implemented by Google in 2003
Shares many goals with standard DFS (performance, scalability, reliability, availability,...)
Needed to manage data-intensive applications Provides fault tolerance
The Google File System (GFS) Introduction
GFS - Motivations
Component failures are the norm:
A storage cluster is built from hundreds or thousands of inexpensive commodity servers
Constant monitoring, error detection, fault tolerance, and automatic recovery is needed
Huge files (Multi- GB) with many application objects Appending new data rather than overwriting existing
Co-designing the applications and the file system API to increase flexibility
Assumptions
Composed of inexpensive commodity components that often fail Modest number of huge files (Multi GBs)
Large streaming reads and small random reads
Sequential writes that append to files rather than overwrite Files seldom modified again
Semantics for concurrently append to the same file by multiple clients High bandwidth more important than low latency
The Google File System (GFS) Design Overview
GFS Interface
Familiar file system interface
Don’t implement Standard API such as POSIX
Supports
Common Operations (create, delete,open, close, read, and write files) Snapshot
Architecture
The Google File System (GFS) Design Overview
Architecture
Master has file system metadata Master controls:
Chunk lease management
Garbage collection of Orphaned chunks Chunk migration
Periodically HeartBeat messages exchanged between master and chunkservers to check the state of chunkserver
Single Master
Enables to make sophisticated chunk placement and replication decisions
Must minimize its involvement in reads and writes Clients never read and write file data through the master Client asks the master which chunkservers it should contact
The Google File System (GFS) Design Overview
Single Master Cont...
Problems
Bandwidth bottleneck A single point of failure
Opportunities
Central place to control replication and garbage collection and many other activities
Solution
Around 2009 a Distributed Master is proposed and implemented for the current schema. BigTable and file count are being used. Details are not clearly available :)
ChunkServer
Large chunk size
64 MB - larger than typical File system
Large chunk size: Pros
Reduces client master interaction
Reduce network overhead by keeping a persistent TCP connection Reduces the size of the metadata stored on the master
Cons
internal fragmentation (solution: lazy space allocation)
large chunks can become hot spots (solutions: higher replication factor, staggering application start time, client-to-client communication)
The Google File System (GFS) Design Overview
Metadata
All metadata is kept in main memory of the master
Namespace
File-to-chunk mapping Chunks replicas location Access control information
Namespace and file-to-chunk mapping are persistently stored Mutations logged to an operation log
No persistent record of replicas locations Operation Log:
Contains critical metadata changes Defines order of concurrent operations Replicated on multiple remote machines Used to recover the master
Checkpoint of master if log beyond certain size:
Master switches to a new log
Consistency Model
File namespace mutations are atomic
Namespace locking guarantees atomicity and correctness Mutations applied in same order on all replicas
Chunk version numbers used to detect any stale replica (missed mutations while chunkserver was down)
Stale replicas are garbage collected
Checksum for detection of data corruption
A file region is consistent if all the clients will always see the same data regardless of which replicas they read from.
The Google File System (GFS) System Interactions
Leases and Mutation Order
Master uses leases to maintain consistent mutation order Master grants a chunk lease to a replica, called primary Primary chooses serial order between Mutations
Lease grant order defined by the master Order within a lease defined by the primary
Control Flow of a Write
The Google File System (GFS) System Interactions
Data Flow
Separated flow of data and control Goals:
1 Fully use outgoing bandwidth of each machine:
Data pushed in a chain of chunkservers
2 Avoid network bottlenecks and high-latency links:
Each machine forwards data to ”closest” machine
3 Minimize latency:
Atomic Record Appends
Traditional Write
Client specifies the offset
Concurrent writes to the same region are not serializable
Record Append
Client specifies only the data
GFS Appends it to the file at least once atomically at an offset of GFSs choosing and returns that offset to the client
Used by Distributed application in which many clients on different machines append to the same file concurrently
The Google File System (GFS) System Interactions
Snapshot
Used to:
1 Quickly create copies of huge data sets 2 Easily checkpoint current state
The master receives snapshot request:
1 Revokes leases on chunks of the files it is about to snapshot 2 Logs the operation to disk
3 Duplicates the corresponding metadata
New snapshot file points to same chunk C of source file New chunk C’ created for first write after snapshot
Namespace Management and Locking
No per-directory data structure that lists all files in the directory Doesn’t support aliases for the same file (No hard or symbolic links) Namespace maps full pathnames to metadata
Each master operation acquires a set of locks before it runs Read lock on internal nodes and read/write lock on the leaf Locking allows concurrent mutations in the same directory
Read lock on directory prevents its deletion, renaming or snapshot Write locks on file names serialize attempts to create a file with the same name twice
The Google File System (GFS) Master Operation
Replica Placement
Hundreds of chunkservers across many racks Chunkservers accessed from hundreds of clients Goals:
Maximize data reliability and availability Maximize network bandwidth utilization
Creation, Re-replication and Rebalancing
Creation
Place new replicas on chunkservers with below-average disk usage Limit number of recent creations on each chunkservers
Replication
When number of available replicas falls below a user-specified goal Prioritization: based on how far it is from its replication goal, live files than recently deleted files ...
Rebalancing
Periodically, for better disk utilization and load balancing Distribution of replicas is analyzed
The Google File System (GFS) Master Operation
Garbage Collection
File deletion logged by master
File renamed to a hidden name with deletion timestamp instead of reclaiming the resources immediately
Master regularly scan and removes hidden files older than 3 days (configurable)
Until then, hidden file can be read and undeleted
When a hidden file is removed, its in-memory metadata is erased Orphaned chunks identified removed during a regular scan of the chunk namespace
Stale Replica Detection
Need to distinguish between up-to-date and stale replicas Chunk version number:
Increased when master grants new lease on the chunk Not increased if replica is unavailable
The Google File System (GFS) Fault Tolerance and Diagnosis
High Availability
Fast Recovery
Master and chunkservers have to restore their state and start in seconds no matter how they terminated
Chunk Replication
Each chunk is replicated on multiple chunkservers on different racks
Master Replication
Master state replicated for reliability on multiple machines When master fails:
It can restart almost instantly
A new master process is started elsewhere
”Shadow” (not mirror) master provides only read-only access to file system when primary master is down
Data Integrity
Checksumming to detect corruption of stored data Integrity verified by each chunkserver independently Corruptions not propagated to other chunkservers
The Google File System (GFS) Fault Tolerance and Diagnosis
Diagnostic Tools
Extensive and detailed diagnostic logging
Diagnostic logs record many significant events(Chunkservers going up and down, RPC requests and replies)
... can be easily deleted without affecting the system
Bibliography
Case Study, GFS: Evolution on Fast-forward, A discussion between Kirk McKusick and Sean Quinlan; ACM Queue, 2009
What is Hadoop?
Summary
An Apache project
Yahoo! has developed and contributed to 80% of the core of Hadoop (HDFS and MapReduce).
Components:
HBase - originally by Powerset, then Microsoft Hive - Facebook
Pig, ZooKeeper, Chukwa - Yahoo!
Avro - Originally Yahoo, and co-developed with cloudera
The Hadoop Distributed File System (HDFS) Introduction
The Hadoop Distributed File System (HDFS) [2]
Sub-project of Apache Hadoop project
Designed to store very large data sets separately:
Metadata on dedicated server called NameNode Application data on DataNodes
Creates multiple replicas of data blocks Distributes replicas on nodes
Highly fault-tolerant
Designed to run on commodity hardware Suitable for applications for large data sets
The Hadoop Distributed File System (HDFS) Architecture
NameNode
Namespace is a hierarchy of files and directories
Files and directories are represented on the NameNode by inodes Maintains the namespace tree and the mapping of file blocks to DataNodes
Authorization and authentication
A multithreaded system and processes requests simultaneously from multiple clients
Keeps the entire namespace in RAM, allowing fast access to the metadata.
DataNodes
Responsible for serving read and write requests from the file systems clients
Perform block creation, deletion, and replication upon instruction from the NameNode
Performs handshake by connecting with NameNode during startup To verify the Namespace ID and software version of the DataNode Nodes with a different namespace ID will not be able to join the cluster
Incompatible version may cause data corruption or loss
Send heartbeats to the NameNode to confirm that the DataNode is operating and the block replicas it hosts are available.
The Hadoop Distributed File System (HDFS) Architecture
HDFS Client
Image and Journal
Image
File system metadata that describes the organization of application data as directories and files
A persistent record of the image written to disk is called a checkpoint
Journal
A write-ahead commit log for changes to the file system that must be persistent
During startup the NameNode initializes the namespace image from the checkpoint, and then replays changes from the journal
A new checkpoint and empty journal are written back to the storage directories before the NameNode starts serving clients
The Hadoop Distributed File System (HDFS) Architecture
CheckpointNode
Periodically combines the existing checkpoint and journal to create a new checkpoint and an empty journal
Usually runs on a different host wrt NameNode
Downloads the current checkpoint and journal files from the NameNode,
Merges them locally,and returns the new checkpoint back to the NameNode
BackupNode
Create periodic checkpoints
As Checkpoint Node, but it applies edits to its own namespace copy Without downloading checkpoint and journal files from the active NameNode
But in addition it maintains an in-memory, up-to-date image of the file system namespace
Always synchronized with the state of the NameNode Can be viewed as a read-only NameNode
The Hadoop Distributed File System (HDFS) File I/O Operations and Replica Management
File Read and Write, Block Placement
File Read and Write
Implements a single-writer, multiple-reader model
Block Placement
Spread the nodes across multiple racks Replicas placed:
The first on the node where the writer located,
The second and third on different nodes at different racks The others palced on random nodes:
1 No Datanode contains more than one replica of any block 2 No rack contains more than two replicas are in the same rack
Replication management
Detects that a block has become under- or over-replicated When a block becomes over replicated, the NameNode chooses a replica to remove
When a block becomes under-replicated, it is put in the replication priority queue
A block with only one replica has the highest priority
A block with a number of replicas that is greater than two thirds of its replication factor has the lowest priority
The Hadoop Distributed File System (HDFS) File I/O Operations and Replica Management
Inter-Cluster Data Copy
A tool called ”DistCp” used for large inter/intra-cluster parallel copying
A MapReduce job; each of the map tasks copies a portion of the source data into the destination file system
The Hadoop Distributed File System (HDFS) Writing and Reading Data
The Hadoop Distributed File System (HDFS) Writing and Reading Data
The Hadoop Distributed File System (HDFS) Writing and Reading Data
The Hadoop Distributed File System (HDFS) Writing and Reading Data
The Hadoop Distributed File System (HDFS) Replica Placement
Cluster Rebalancing
Balanced cluster: no under-utilized or over-utilized DataNodes Common reason: addition of new DataNodes
Requirements
No block loss
No change of the number of replicas
No reduction of the number of racks a block resides in No saturation of the network
Rebalancing issued by an administrator
Rebalancing decisions made by a Rebalancing Server
Goal of each iteration: reduce imbalance in over/under-utilized DataNodes
Balancer
A tool that balances disk space usage on an HDFS cluster.
Iteratively moves replicas from DataNodes with higher utilization to DataNodes with lower utilization
Bibliography
Conclusions
Conclusions
GFS
Proprietary file system
All the details in the paper of 2003
HDFS
Open-source project
Details are available in the paper of 2010 and in Hadoop website Used by many big companies
Bibliography
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung.
The Google File System.
In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, SOSP ’03, pages 29–43, New York, NY, USA, 2003. ACM.
Shvachko Konstantin, Kuang Hairong, Radia Sanjay, and Robert Chansler.
The Hadoop Distributed File System.
In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), MSST ’10, pages 1–10,