Lect 5: Hadoop as Distributed System
September 23, 2017
Outline
1. Data strorage in HDFS 2. HDFS
3. Why is HDFS block size 128MB in Hadoop?
4. What are major functions of a filesystem?
5. Local filesystem Vs. HDFS
6. Local filesystem Vs. HDFS- Key differences
Data storage in HDFS
When you store a file in HDFS, the system breaks it down into a set of individual blocks and stores these blocks in various slave nodes in the Hadoop cluster. Blocks are are the smallest unit of data in a filesystem. We (client and admin) do not have any control on the block like block location. Namenode decides all such things.
HDFS stores each file as blocks. However, the block size in HDFS is very large.
The default size of the HDFS block is 128MB which you can configure as per your requirement. All blocks of the file are the same size except the last block, which can be either the same size or smaller. The files are split into 128 MB blocks and then stored into the Hadoop file system. The Hadoop
application is responsible for distributing the data block across multiple nodes.
It is not necessary that in HDFS, each file stored should be in exact multiple of the configured block size 128mb, 256mb etc., so final block for file uses only as much space as is needed
Why is HDFS Block size 128 MB in Hadoop?
For each data block distributed over slaves, metadata is maintained by Namenode. The metadata stores information about each data block. The reasons to 128MB of block size in HDFS are:
1. HDFS have huge data sets, i.e. terabytes and petabytes of data. So like Linux file system which have 4 KB block size, if we had block size 4KB for HDFS, then we would be having too many data blocks in Hadoop HDFS and therefore too much of metadata. So, managing this huge number of blocks and metadata will create huge overhead and traffic which is something which we don?t want.
2. Block size cant be so large that the system is waiting a very long time for one last unit of data processing to finish its work.
What are major functions of a filesystem ?
1. Filesystem controls how the data is stored and retrieved. Basically when you read and write files to your harddisk your request goes through a filesystem.
2. Next, filesystem has the meta data about your files and folders. Metadata like the file name, size, owner, created and modified time etc.
3. Filesystem also takes care of permissions and security
4. Filesystem manages your storage space. when you ask to write a file to harddisk. Filesystem helps figure out where in the hard disk it should write the file. and it should write the file as efficient as possible.
Common file system
1. FAT32, NTFS - Microsoft 2. ext3, ext 4- Linux
These filesystems can handle individual file sizes up to 8 Exabyte or even up to 16 Exabyte. Their capacity of handling big data is huge so, what is the need for HDFS ?
Local Filesystem Vs. HDFS ?
Assume you have a 10 node cluster and you have ext4 as the filesystem on each node. We will refer to ext4 on each node as the local filesystem. So first, when we upload a file to your filesystem we need the filesystem to divide the dataset in to fixed size blocks. Although every filesystem has a concept of block, the concept of blocks in HDFS is very different when compared to the blocks in traditional filesystem.
Local Filesystem Vs. HDFS ?-key differences
1. HDFS has a distributed view of the files or blocks in your cluster which is not possible with your local filesystem which is ext4( in our example from last slide). The local ext4 filesystem on node1 has no idea what is on node 2. Similarly node 2 has no idea what is on node 1, because since the ext4 filesystems in both node 1 and node 2 are local to each node and hence there is no way they can have a global or distributed view. That is why we say the ext4 on individual nodes as local filesystem.
2. Next important thing is replication. Since ext4 in node 1 has no idea about storage in any other node, it does not have the ability to replicate blocks in node 1 to other nodes. So now when you upload a file to HDFS.
it will be automatically split in to 128 MB fixed size blocks in the recent versions of Hadoop. 64 MB blocks in legacy versions. HDFS takes care of placing the blocks in different nodes and also take care of replicating each block in more than one node. By default HDFS replicates a block to 3 nodes.
Local Filesystem Vs. HDFS ?-key differences
Example: Lets say you copy a 700 MB dataset in to HDFS. HDFS will divide the dataset in to 128 MB blocks. So we will have 5 equal sized 128 MB block and one 60 MB block. Since HDFS has a distributed view of the cluster, HDFS will decide which nodes should hold these 6 blocks and also pick nodes to hold the replicated blocks. HDFS will continue to keep track of all the blocks and their node assignments all the time. So when a user ask HDFS about the 700 MB dataset it know how to construct the file from blocks.
Important points
1. HDFS by no means a replacement for your local filesystem. The operating system still rely on the local filesystem. Infact the operating system does not care about the presence of HDFS. HDFS should still go through ext4 to save the blocks in the storage (but its management is different).
2. The true power of HDFS is that it is spread across all the nodes in your cluster and it has a distributed view of the cluster and hence it knows how to construct the 700 MB dataset from blocks. Where as the ext4 does not have a distributed view and only has a local view will only have idea about the blocks in storage it is managing.