2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
Implementing the Hadoop Distributed
File System Protocol on OneFS
Jeff Hughes EMC Isilon
Outline
Hadoop Overview
OneFS Overview
MapReduce + OneFS
Details of isi_hdfs_d
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
Apache Hadoop Project
Two main components
MapReduce
Distributed File System (DFS)
The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using a simple programming model.
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
Hadoop: MapReduce
MapReduce
Distributed computation framework
Optimized for batch processing
Typical I/O profile:
5
DFS
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
Hadoop: HDFS Semantics
Metadata/data server cluster architecture
Cluster coherent namespace and data
Write-once-read-many access
Single writer only
Can append to existing files
Data mirrored 3x for resiliency Client exposed to data topology
Block locations as part of file metadata
7
http://www.snia.org/sites/default/files2/sdc_archives/2010_presentations/wednesday/DhrubaBorth akur-Hadoop_File_Systems.pdf
Hadoop: Why HDFS?
Portability – All user space and OS independent
Purpose-built – Primary workflow is MapReduce
Limited set of operations to implement
Single software package
Fluid client/server protocol development
Exposure of data topology
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
OneFS: Architecture
Isilon IQ Storage
Layer Communication Infiniband Intracluster
Servers
Client/Application Layer Ethernet Layer
Servers
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
OneFS: OS
Built from the ground up on FreeBSD
File system is a loadable kernel module
with VFS interface
Supports POSIX syscalls locally
Protocol servers access /ifs paths
FS Built for mixed namespace access
Supports SMB, NFS, HTTP, etc.
OneFS: Semantics
Symmetric cluster architecture
Metadata distributed across all nodes
Tightly coupled group semantics
Globally coherent file system access
Distributed lock manager
Two-phase commit for all write operations
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
MapReduce + OneFS: Architecture
OneFS Node NameNode DataNode OneFS Node NameNode DataNode OneFS Node NameNode DataNode OneFS Node NameNode DataNode Hadoop Node DFSClient 1) Request(“/file”) 2) Response(block locations) 3) GetBlock(block)
OneFS runs a daemon that speaks NameNode and DataNode natively
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
MapReduce + OneFS: Benefits
Easier integration to existing workflows
First class multi-protocol access
Reduce ETL stages
Increased disk efficiency
HDFS: 30% usable, OneFS: 80% usable
Reduced data center footprint
More data management options
Snapshots, site replication, etc.
MapReduce + OneFS: Challenges
Typical data path locality changes
MapReduce ➝ HDFS acts like DAS
MapReduce ➝ OneFS goes over the network
Client/server compatibility and maintenance
OneFS and MapReduce clusters run different
software versions
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
MapReduce + OneFS: Mitigations
1GbE < SATA controller < 10GbE
Hadoop designed for 1GbE
10GbE prices dropping
Denser storage == less nodes/networking
“Rack” locality limits cross-switch contention
17 Rack A Rack B Rack C
MapReduce + OneFS: Performance
DFS
Read Map Task Map Output Shuffle Reduce Task Write DFS
Typically ~100Mbit per task from HDFS
I/Os against temp vary considerable per job
More variable, still ~100Mbit per task to HDFS
Performance bottleneck likely to be temp space
Terasort example: 75% of I/Os against temp
Latency not much impact over HDFS large
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
HDFS Protocols
Two TCP based protocols
NameNode – metadata operations
DataNode – data transfers
About 26 NameNode RPCs
Mostly use fully qualified paths
POSIX-like file attrs (modebits, user/group)
Only 2 DataNode client operations
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
NameNode Request Example
Example getFileInfo(“/testfile”) request (think stat):
21 “/testfile”
Parameter type (string) Method name
NameNode Response Example
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
isi_hdfs_d
Multi-threaded daemon runs on all nodes
Services both NN and DN protocols
Translates RPCs to POSIX system calls
Stateless, underlying FS handles coherency
23 OneFS Node isi_hdfs_d Thread Request VFS
OneFS
Syscall ResponseExample NameNode RPCs
Most NameNode RPCs are straightforward
setPermission() → chmod(…)
setTimes() → utimes(…)
create() → open(…, O_CREAT, …)
Other RPCs need some creative interpretation
recoverLease()/renewLease()
abandonBlock()
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
HDFS Data Path
NN RPC: getBlockLocations(file)/addBlock(file)
Returns list of LocatedBlocks
25 LocatedBlock long offset Block long blkid long genStamp long numBytes DatanodeInfo[] ... DFSClient connects to DN
Chooses which DatanodeInfo
based on locality
Only Block structure passed to
LocatedBlocks Translation
LocatedBlock long offset Block long blkid long genStamp long numBytes DatanodeInfo[] ...Logical byte offset into the file Opaque to client, used by DN
Inode number Size of extent
Absolute byte offset
<IP:port> and rack info for
Specific to OneFS and isi_hdfs_d
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
Read Path Example
27 getBlockLocations()
LocatedBlocks
DN_OP_READ(Block)
Data stream
NameNode Connection Routing
NameNode is configured as single URL
Easy configuration:
hdfs://log-server.isilon.com:8020/
DNS round-robin to distribute across nodes
Metadata IOPs get spread out
OneFS maintains cross-node consistency
IP Failover plus client retries for resiliency
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
DataNode References
DataNode references returned by NameNode
All OneFS DataNodes can access same data
Each LocatedBlocks for reads has 3 DN refs
Round-robin across available nodes
Multiple refs lets client try other nodes before
coming back to the NameNode again
Write path only 1 reference per logical block
No need for client to replicate writes
Few Quirks
User/group identities are strings
OneFS natively stores UIDs or SIDs only
Requires name resolution on access
Locking!
HDFS uses “leases” to restrict to single writer
Implemented but no cross-protocol contention
Hadoop apps don’t expect files to move
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
It Works!
31
You see this across NFS:
And the same directory on HDFS:
Conclusions
HDFS protocol can map to POSIX pretty easily
Not all traditional shared storage is bad for
Hadoop workflows
Locality features worth preserving even without
node locality as a possibility
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
33