HDFS. Hadoop Distributed File System

(1)

HDFS

Kevin Swingler

Hadoop Distributed File System

• File system designed to store VERY large files

• Streaming data access

• Running across clusters of commodity hardware

• Resilient to node failure

(2)

Large files

• Files can be many gigabytes or even terabytes in size

• In fact, HDFS struggles when there are a great many small files as the file structure data is held in memory

• There are petabyte stores running Hadoop

Streaming Access

• Write once, read many times design

• Files can be constantly added to, but Hadoop works only by appending, not inserting or updating records

• Lots of different analyses may be run on the same data, usually involving the entire file

• Examples: log files, transaction history, machine monitoring, sensor networks, ...

(3)

Commodity Hardware

• Designed not to need specialist high cost computers

• Runs over a flexible cluster of nodes

• Normally a rack of x86 machines, but you can run it on pretty much anything

Resilience

• Large clusters of nodes will suffer failures

• Hadoop uses replication to cope with this

• More on that later ...

(4)

Key Concepts

• A single name node is responsible for storing the location of every file in the system

– Kept in memory for speed – Stored on disk for persistence

• Many data nodes contain the data stored in blocks

• Data is stored in fixed size blocks - each file usually made up of many blocks

Blocks

• Files are split into blocks of a fixed size (64MB or 128MB usually)

• A single file's blocks are spread across the network so every files is distributed

• Each machine stores part of several files

• Replication means each block is written to r different nodes where r is the replication factor (usually 3)

(5)

Data (Slave) Nodes

• Store the blocks of data, plus local informatin about block location and file identity

• Report back to the name node to list the blocks they are storing

• All nodes contain replicated blocks - there are no primary/secondary data nodes

• So, nodes are not replicated, blocks are

Name (Master) Node

• Manages the namespace of the file system

• Stores the files store tree and file metadata

• Stores the location (which node) of all of the blocks in a file

• Is replicated by a secondary name node for resilience

(6)

Client

• The client accesses the data

• The name node tells the client where the data can be found on the data nodes

• The client interacts directly with the data node

• This happens 'under the hood'

• Example client = HDFS command line

A Nice Picture

Client ^{Name Node} _{Name Node}^Secondary

Data Node

File 1

File 2

File 3 F2B1R1

F1B2R2 F1B1R1

F1B1R1 F1B1R1 F3B1R3

Data Node Data Node

F2B1R3 F1B3R3 F1B2R1

F1B3R1 F3B2R3

Rack 1 Rack 2

F2B1R2 F2B2R2

F3B1R1 F3B1R2

F2B3R1 F3B3R3

F2B3R3 Metadata

operations

Read Write Block operations

(7)

Name Node Federation

• If a cluster is so large that a single name node cannot cover it all,

• The whole space is partitioned and shared among a number of name nodes

• This is called Federation

HDFS Command Line

• Simplest way to interact with the files in HDFS is via the command line

• Get a shell (via SSH for example) on the server and type Unix like commands to interact with the file system

• Syntax is:

(8)

Examples

• -ls List directory contents

• -cat Copy file contents to stdout

• -appendToFile Append local file to dfs file

• -copyToLocal Copy from dfs to local

• -mkdir Make a directory in dfs

• -mv Move file within dfs

File Locations

• File locations are specified using POSIX style:

• E.g. /usr/kms/somedata/data.csv

• Permissions work like in Linux, except you cannot execute a file in DFS so there are only r (read) and w (write) permissions for users.

• You need to copy files (or data) from the local file system to HDFS explicitly - they are

separate

(9)

File Formats

• Just like any file system, HDFS allows data to be stored in any format

– Text, csv, json etc

– Media: images, sound etc

• Additionally, it offers its own container formats

• Files are split into blocks, so format must support this

Text Files

• CSV files are easy to split and can be

processed row by row so are a good choice

• JSON and XML are more difficult to split, so special tools are needed

(10)

Hadoop File Types

• Sequence files

– Binary files containing key/value pairs

• Serialization Formats

– Methods for turning program objects into data streams

– In the Hadoop io library, there are a number of writable classes that can be used from Java – Avro is an Apache project that is designed to

provide language independent serialization

Compression

• Compressed files are splittable

• Codec stored in header, so any compression method may be used

• Choice between compression speed and degree of compression

(11)

Java Interface

• Hadoop is written in Java and works most naturally in that language

• There are ways of interacting with it in other languages too, though

• Useful Java classes include – org.apache.hadoop.fs – org.apache.hadoop.io

• Access, create and write to files from Java

HDFS and Python

• Hadoopy - python wrapper for Hadoop

www.hadoopy.com

• Spotify have a nice tool called Snakebite

labs.spotify.com/2013/05/07/snakebite/

• Allows calls to HDFS from Python

• Quicker than hdfs from command line, which launches a JVM for each command

(12)

Web Server Interface

• HDFS also runs a web server, which provides information about data and jobs

REST API

• There is also a REST API you can use to interact with HDFS

http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=

• See

hadoop.apache.org/docs/r2.4.0/hadoop- project-dist/hadoop-hdfs/WebHDFS.html

(13)

Higher Level Tools

• Pig

– High level language for defining data flow

• Hive

– SQL like query language for MapReduce jobs

• Zoo Keeper

– Coordination service for distributed systems

• Spark

– High speed data analytics

• Mahout

– Machine learning

• http://hadoopecosystemtable.github.io/