HDFS
Kevin Swingler
Hadoop Distributed File System
• File system designed to store VERY large files
• Streaming data access
• Running across clusters of commodity hardware
• Resilient to node failure
Large files
• Files can be many gigabytes or even terabytes in size
• In fact, HDFS struggles when there are a great many small files as the file structure data is held in memory
• There are petabyte stores running Hadoop
Streaming Access
• Write once, read many times design
• Files can be constantly added to, but Hadoop works only by appending, not inserting or updating records
• Lots of different analyses may be run on the same data, usually involving the entire file
• Examples: log files, transaction history, machine monitoring, sensor networks, ...
Commodity Hardware
• Designed not to need specialist high cost computers
• Runs over a flexible cluster of nodes
• Normally a rack of x86 machines, but you can run it on pretty much anything
Resilience
• Large clusters of nodes will suffer failures
• Hadoop uses replication to cope with this
• More on that later ...
Key Concepts
• A single name node is responsible for storing the location of every file in the system
– Kept in memory for speed – Stored on disk for persistence
• Many data nodes contain the data stored in blocks
• Data is stored in fixed size blocks - each file usually made up of many blocks
Blocks
• Files are split into blocks of a fixed size (64MB or 128MB usually)
• A single file's blocks are spread across the network so every files is distributed
• Each machine stores part of several files
• Replication means each block is written to r different nodes where r is the replication factor (usually 3)
Data (Slave) Nodes
• Store the blocks of data, plus local informatin about block location and file identity
• Report back to the name node to list the blocks they are storing
• All nodes contain replicated blocks - there are no primary/secondary data nodes
• So, nodes are not replicated, blocks are
Name (Master) Node
• Manages the namespace of the file system
• Stores the files store tree and file metadata
• Stores the location (which node) of all of the blocks in a file
• Is replicated by a secondary name node for resilience
Client
• The client accesses the data
• The name node tells the client where the data can be found on the data nodes
• The client interacts directly with the data node
• This happens 'under the hood'
• Example client = HDFS command line
A Nice Picture
Client Name Node Name NodeSecondary
Data Node
File 1
File 2
File 3 F2B1R1
F1B2R2 F1B1R1
F1B1R1 F1B1R1 F3B1R3
Data Node Data Node
F2B1R3 F1B3R3 F1B2R1
F1B3R1 F3B2R3
Rack 1 Rack 2
F2B1R2 F2B2R2
F3B1R1 F3B1R2
F2B3R1 F3B3R3
F2B3R3 Metadata
operations
Read Write Block operations
Name Node Federation
• If a cluster is so large that a single name node cannot cover it all,
• The whole space is partitioned and shared among a number of name nodes
• This is called Federation
HDFS Command Line
• Simplest way to interact with the files in HDFS is via the command line
• Get a shell (via SSH for example) on the server and type Unix like commands to interact with the file system
• Syntax is:
Examples
• -ls List directory contents
• -cat Copy file contents to stdout
• -appendToFile Append local file to dfs file
• -copyToLocal Copy from dfs to local
• -mkdir Make a directory in dfs
• -mv Move file within dfs
File Locations
• File locations are specified using POSIX style:
• E.g. /usr/kms/somedata/data.csv
• Permissions work like in Linux, except you cannot execute a file in DFS so there are only r (read) and w (write) permissions for users.
• You need to copy files (or data) from the local file system to HDFS explicitly - they are
separate
File Formats
• Just like any file system, HDFS allows data to be stored in any format
– Text, csv, json etc
– Media: images, sound etc
• Additionally, it offers its own container formats
• Files are split into blocks, so format must support this
Text Files
• CSV files are easy to split and can be
processed row by row so are a good choice
• JSON and XML are more difficult to split, so special tools are needed
Hadoop File Types
• Sequence files
– Binary files containing key/value pairs
• Serialization Formats
– Methods for turning program objects into data streams
– In the Hadoop io library, there are a number of writable classes that can be used from Java – Avro is an Apache project that is designed to
provide language independent serialization
Compression
• Compressed files are splittable
• Codec stored in header, so any compression method may be used
• Choice between compression speed and degree of compression
Java Interface
• Hadoop is written in Java and works most naturally in that language
• There are ways of interacting with it in other languages too, though
• Useful Java classes include – org.apache.hadoop.fs – org.apache.hadoop.io
• Access, create and write to files from Java
HDFS and Python
• Hadoopy - python wrapper for Hadoop
www.hadoopy.com
• Spotify have a nice tool called Snakebite
labs.spotify.com/2013/05/07/snakebite/
• Allows calls to HDFS from Python
• Quicker than hdfs from command line, which launches a JVM for each command
Web Server Interface
• HDFS also runs a web server, which provides information about data and jobs
REST API
• There is also a REST API you can use to interact with HDFS
http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=
• See
hadoop.apache.org/docs/r2.4.0/hadoop- project-dist/hadoop-hdfs/WebHDFS.html
Higher Level Tools
• Pig
– High level language for defining data flow
• Hive
– SQL like query language for MapReduce jobs
• Zoo Keeper
– Coordination service for distributed systems
• Spark
– High speed data analytics
• Mahout
– Machine learning
• http://hadoopecosystemtable.github.io/