The HDFS: Highly Durable Storage Optimized for Analytics

The HDFS, as we hope you’ve guessed, holds the same role within Hadoop that Nanette and her team of elephants do within C&E Corp. It ensures that your data is always available for use, never lost or degraded and organized to support efficient Map/Reduce jobs. Files are stored on the HDFS as blocks of limited size (128 MB is a common choice).

Each block belongs to exactly one file; a file larger than the block size is stored in multiple blocks. The blocks are stored in cooked form as regular files on one of the Datanode’s regular volumes. (Hadoop’s decision to use regular files rather than attempting lower-level access to the disk, as many traditional databases do, helps make it remarkably portable, promotes reliability and plays to the strengths of the operating system’s finely-tuned access mechanisms.)

The HDFS typically stores multiple replicas of each block (three is the universal default, although you can adjust it per file), distributed across the cluster. Blocks within the same file may or may not share a Datanode but replicas never do (or they would not be replicas, would they?). The obvious reason for this replication is availability and durability — you can depend on finding a live Datanode for any block and you can depend that, if a Datanode goes down, a fresh replica can be readily produced.

JT and Nanette’s workflow illustrates the second benefit of replication: being able to

“move the compute to the data, not [expensively] moving the data to the compute.”

Multiple replicas give the Job Tracker enough options that it can dependably assign most tasks to be “Mapper-local.”

Like Nanette, the Namenode holds no data, only a sort of file allocation table (FAT), tracking for every file the checksum responsible Datanodes and other essential char‐

acteristics of each of its blocks. The Namenode depends on the Datanodes to report in regularly. Every three seconds, it sends a heartbeat — a lightweight notification saying, basically, “I’m still here!”. On a longer timescale, each Datanode prepares a listing of the replicas it sees on disk along with a full checksum of each replica’s contents. Having the Datanode contact the Namenode is a good safeguard that it is operating regularly and with good connectivity. Conversely, the Namenode uses the heartbeat response as its opportunity to issue commands dening a struggling Datanode.

If, at any point, the Namenode finds a Datanode has not sent a heartbeat for several minutes, or if a block report shows missing or corrupted files, it will commission new

22 | Chapter 2: Hadoop Basics

copies of the affected blocks by issuing replication commands to other Datanodes as they heartbeat in.

A final prominent role the Namenode serves is to act as the public face of the HDFS.

The ‘put’ and ‘get’ commands you just ran were Java programs that made network calls to the Namenode. There are API methods for the rest of the file system commands you would expect for use by that or any other low-level native client. You can also access its web interface, typically by visiting port 50070 (http://hostname.of.namenode:

50070), which gives you the crude but effective ability to view its capacity, operational status and, for the very patient, inspect the contents of the HDFS.

Sitting behind the scenes is the often-misunderstood secondary Namenode; this is not, as its name implies and as you might hope, a hot standby for the Namenode. Unless you are using the “HA namenode” feature provided in later versions of Hadoop, if your Namenode goes down, your HDFS has gone down. All the secondary Namenode does is perform some essential internal bookkeeping. Apart from ensuring that it, like your Namenode, is always running happily and healthily, you do not need to know anything more about the second Namenode for now.

One last essential to note about the HDFS is that its contents are immutable. On a regular file system, every time you hit “save,” the application modifies the file in place — on Hadoop, no such thing is permitted. This is driven by the necessities of distributed computing at high scale but it is also the right thing to do. Data analysis should proceed by chaining reproducible syntheses of new beliefs from input data. If the actions you are applying change, so should the output. This casual consumption of hard drive re‐

sources can seem disturbing to those used to working within the constraints of a single machine, but the economics of data storage are clear; it costs $0.10 per GB per month at current commodity prices, or one-tenth that for archival storage, and at least $50 an hour for the analysts who will use it.

Possibly the biggest rookie mistake made by those new to Big Data is a tendency to economize on the amount of data they store; we will try to help you break that habit.

You should be far more concerned with the amount of data you send over the network or to your CPU than with the amount of data you store and most of all, with the amount of time you spend deriving insight rather than acting on it. Checkpoint often, denorm‐

alize when reasonable and preserve the full provenance of your results.

We’ll spend the next few chapters introducing these core operations from the ground up. Let’s start by joining JT and Nannette with their next client.

The HDFS: Highly Durable Storage Optimized for Analytics | 23

CHAPTER 3

In document Big Data for Chimps (Page 52-55)