• No results found

HDFS. Hadoop Distributed File System

N/A
N/A
Protected

Academic year: 2021

Share "HDFS. Hadoop Distributed File System"

Copied!
13
0
0

Loading.... (view fulltext now)

Full text

(1)

HDFS

Kevin Swingler

Hadoop Distributed File System

• File system designed to store VERY large files

• Streaming data access

• Running across clusters of commodity hardware

• Resilient to node failure

(2)

Large files

• Files can be many gigabytes or even terabytes in size

• In fact, HDFS struggles when there are a great many small files as the file structure data is held in memory

• There are petabyte stores running Hadoop

Streaming Access

• Write once, read many times design

• Files can be constantly added to, but Hadoop works only by appending, not inserting or updating records

• Lots of different analyses may be run on the same data, usually involving the entire file

• Examples: log files, transaction history, machine monitoring, sensor networks, ...

(3)

Commodity Hardware

• Designed not to need specialist high cost computers

• Runs over a flexible cluster of nodes

• Normally a rack of x86 machines, but you can run it on pretty much anything

Resilience

• Large clusters of nodes will suffer failures

• Hadoop uses replication to cope with this

• More on that later ...

(4)

Key Concepts

• A single name node is responsible for storing the location of every file in the system

– Kept in memory for speed – Stored on disk for persistence

• Many data nodes contain the data stored in blocks

• Data is stored in fixed size blocks - each file usually made up of many blocks

Blocks

• Files are split into blocks of a fixed size (64MB or 128MB usually)

• A single file's blocks are spread across the network so every files is distributed

• Each machine stores part of several files

• Replication means each block is written to r different nodes where r is the replication factor (usually 3)

(5)

Data (Slave) Nodes

• Store the blocks of data, plus local informatin about block location and file identity

• Report back to the name node to list the blocks they are storing

• All nodes contain replicated blocks - there are no primary/secondary data nodes

• So, nodes are not replicated, blocks are

Name (Master) Node

• Manages the namespace of the file system

• Stores the files store tree and file metadata

• Stores the location (which node) of all of the blocks in a file

• Is replicated by a secondary name node for resilience

(6)

Client

• The client accesses the data

• The name node tells the client where the data can be found on the data nodes

• The client interacts directly with the data node

• This happens 'under the hood'

• Example client = HDFS command line

A Nice Picture

Client Name Node Name NodeSecondary

Data Node

File 1

File 2

File 3 F2B1R1

F1B2R2 F1B1R1

F1B1R1 F1B1R1 F3B1R3

Data Node Data Node

F2B1R3 F1B3R3 F1B2R1

F1B3R1 F3B2R3

Rack 1 Rack 2

F2B1R2 F2B2R2

F3B1R1 F3B1R2

F2B3R1 F3B3R3

F2B3R3 Metadata

operations

Read Write Block operations

(7)

Name Node Federation

• If a cluster is so large that a single name node cannot cover it all,

• The whole space is partitioned and shared among a number of name nodes

• This is called Federation

HDFS Command Line

• Simplest way to interact with the files in HDFS is via the command line

• Get a shell (via SSH for example) on the server and type Unix like commands to interact with the file system

• Syntax is:

(8)

Examples

• -ls List directory contents

• -cat Copy file contents to stdout

• -appendToFile Append local file to dfs file

• -copyToLocal Copy from dfs to local

• -mkdir Make a directory in dfs

• -mv Move file within dfs

File Locations

• File locations are specified using POSIX style:

• E.g. /usr/kms/somedata/data.csv

• Permissions work like in Linux, except you cannot execute a file in DFS so there are only r (read) and w (write) permissions for users.

• You need to copy files (or data) from the local file system to HDFS explicitly - they are

separate

(9)

File Formats

• Just like any file system, HDFS allows data to be stored in any format

– Text, csv, json etc

– Media: images, sound etc

• Additionally, it offers its own container formats

• Files are split into blocks, so format must support this

Text Files

• CSV files are easy to split and can be

processed row by row so are a good choice

• JSON and XML are more difficult to split, so special tools are needed

(10)

Hadoop File Types

• Sequence files

– Binary files containing key/value pairs

• Serialization Formats

– Methods for turning program objects into data streams

– In the Hadoop io library, there are a number of writable classes that can be used from Java – Avro is an Apache project that is designed to

provide language independent serialization

Compression

• Compressed files are splittable

• Codec stored in header, so any compression method may be used

• Choice between compression speed and degree of compression

(11)

Java Interface

• Hadoop is written in Java and works most naturally in that language

• There are ways of interacting with it in other languages too, though

• Useful Java classes include – org.apache.hadoop.fs – org.apache.hadoop.io

• Access, create and write to files from Java

HDFS and Python

• Hadoopy - python wrapper for Hadoop

www.hadoopy.com

• Spotify have a nice tool called Snakebite

labs.spotify.com/2013/05/07/snakebite/

• Allows calls to HDFS from Python

• Quicker than hdfs from command line, which launches a JVM for each command

(12)

Web Server Interface

• HDFS also runs a web server, which provides information about data and jobs

REST API

• There is also a REST API you can use to interact with HDFS

http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=

• See

hadoop.apache.org/docs/r2.4.0/hadoop- project-dist/hadoop-hdfs/WebHDFS.html

(13)

Higher Level Tools

• Pig

– High level language for defining data flow

• Hive

– SQL like query language for MapReduce jobs

• Zoo Keeper

– Coordination service for distributed systems

• Spark

– High speed data analytics

• Mahout

– Machine learning

http://hadoopecosystemtable.github.io/

References

Related documents

With all of the change in health care and the unprecedented demand for physician leadership, this book is a must-have for every physician with an interest in medical

Travel and Evacuation Insurance: Each participant MUST have Travel Medical and Evacuation insurance to attend this course.. This insurance is available from

Objectives: The main objective of the study is to measure the impact of microvascular complications and to assess the effect of patient counselling in im- proving

In the following section, information on the VDR variants found in the populations of South Africa was collated, with the aim of determining the function of such polymorphisms

• Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. • HDFS creates multiple replicas of 64+ Megabyte data blocks and distributes

HDFS - Hadoop Distributed File System (HDFS) is a file system that spans all the nodes in a Hadoop cluster for data storage.. It links together the file systems on many

Hadoop distributed file system is the basic component of the apache hadoop framework and it manages the data storage and it stores data in the form of blocks of data on the

00-1231-009-00 00123100900 FOOT EXERCISER PAD TRACTION TRACTION ACCESSORIES Traction Traction Products E1399 Y Unlisted device 0274 Medical/Surgical Supplies and Devices -