Istanbul Şehir University
Big Data Camp ’14
HDFS: Hadoop
Distributed File System
Aslan Bakirov
Agenda
•
Distributed File System
•
HDFS Concepts
•
HDFS Interfaces
•
HDFS Full Picture
•
Read Operation WorkFlow
•
Network Topology and Hadoop
•
Write File WorkFlow
•
Future Concepts
•
Demo
Distributed File System
•
What is distributed systems?
A network of interconnected computers is
distributed systems
A single computer can also be viewed as a
distributed system in which the central
control unit, memory units and IO channels
are separate processes
A system is distributed if the message
transmission delay is not negligible
compared to the time between events in a
single process
Distributed File System
•
File systems that manage the storage
across a network of machines are called
distributed file systems
●
Must be;
Consistent
(all nodes see the same data at
the same time)
Partition tolerant
(the system continues to
operate despite arbitrary message loss or
failure of part of the system)
HDFS Concepts
•
Hadoop Distributed File System:
part of
Apache Hadoop Project
•
Two Types of Nodes
NameNode(Master)
- Holds metadata
and keeps track of block locations in
DataNodes
DataNode
- Slave nodes that store and
retrieve data blocks. DNs periodically
report to namenode about list of
blocks that they are storing
•
Files split into 128mb(default) blocks
•
Replicated to 3 datanodes(default)
HDFS Concepts
HDFS is good for:
● Very Large Files: Hdfs is designed and optimized
for very large files, in size of GBs, TBs, PBs etc.
● Streaming Data Access: Hdfs is good for write
once and read many times pattern. Reading whole dataset is more important than reading particular block.
HDFS Concepts
HDFS is bad for:
● Low-latency data Access: HDFS is designed for
high throughput of data, HBase is better for low latency data Access.
● Lots of small files: Since the namenode holds
metadata in memory, the limit to the number of files in file system is governed by amount of
memory on NN. Each file costs about 150 Bytes of memory.
● Multiple writers: Writes are always done at the
end of file. There are no support for multiple writers at arbitrary offsets in the file.
HDFS Concepts
•
Block
A disk has a block size, minimum amount of
data that it can read or write. (512 bytes by
default).
HDFS has block size of 128MB by default but
it is configurable.
Files in HDFS are broken into block-sized
chunks and stored as independent units.
Unlike local file system, a file in HDFS smaller
than a single block does not occupy full block.
hadoop fsck / -files –blocks: list blocks make
HDFS Interfaces (Most Used Ones)
• C:
C interface uses libhdfs library which uses JNI to
call Java file system client • FUSE:
File system in Userspace (FUSE) allows file system
that are implemented in user space to be mounted as a UNIX file system. FUSE-DFS allows any HDFS to be mounted as a UNIX file system
• WebDAV:
WebDAV is a set of extensions to http to support
editing and retrieving files in HDFS • Java API:
FileSystem, FSDataInputStream and
FSDataOutputStream classes are used to read and write data to HDFS
HDFS Full Picture
Read Operation WorkFlow
Read Operation Workflow
1. HDFS client opens the file it wishes to read by calling open() on DistributedFileSystem for HDFS
2. DistributedFileSystem calls the namenode using RPC, to determine the locations of the blocks
3. Namenode returns the addresses of the datanodes that have copy of that block. (Datanodes are sorted according to the proximity to the client. Nearest one first). If the client is datanode itself, then it will read from local datanode, if it has the copy of block
Read Operation Workflow
4. The DistributedFileSystem returns an
FSDataInputStream to the client to read blocks
5. The client then calls read() method on the stream. FSDataInputStream connects to the first nearest
datanode for the first block. When end of stream reaches, DFSInputStream closes connection to that datanode and connects to the next datanode for the next block. This process are done for all blocks of file 6. DFSInputStream also does checksum over data it receives from datanode. And if a corrupted block
found, it will reported to the namenode before start reading same block from another datanode
Network Topology And Hadoop
•
What we mean “close” about the distance
between two nodes in datacenter?
•
Hadoop takes network as a tree and levels of
tree are datacenter, rack and node
•
Bandwidth available for each of the following
scenarios becomes progressively less
Process on the same node
Different nodes on the same rack
Nodes of different racks in the same
datacenter
Network Topology And Hadoop
Write Operation Workflow
Write Operation Workflow
1. Client creates the file by calling create() on DistributedFileSystem.
2. DistributedFilesystem makes an RPC call to the namenode to create a new file in the file system’s namespace with no blocks associated with it.
3. Namenodes makes some validation checks like, file already exist or permission problems.
4. If checks are passed, namenode makes a record of new file.
5. DistributedFileSystem returns FSDataOutputStream to the client to start writing.
Write Operation Workflow
6. As client writes data, FSDataOutputStream splits it into packets and writes them to an internal queue
called data queue.
7. The data queue is consumed by the DataStreamer, whose responsibility is to ask the namenode to allocate new
blocks by picking list of suitable datanodes to store the replicas.
8. The list of datanodes forms a pipeline – we will assume that replication parameter is three.
9. The DataStreamer streams the packets to the first datanode in the pipeline, which stores the packet and forwards it to the second datanode in the pipeline.
Similarly second node stores and forwards to the third node in the pipeline.
Write Operation Workflow
10. DFSOutputStream (FSDataOutputStream) also maintains the internal queue of packets that are
waiting to be acknowledged by datanodes, called ack queue.
11. The packet is removed from ack queue only when it has been acknowledged by all the datanodes in the pipeline
12. When client finishes writing data, it calls close() on the stream