What’s a Hadoop, Anyway?

To Hadoop and Beyond is a series dedicated to exploring the basics of distributed computing as it stands today, and to take an inventory of where the state of the art is heading in the future. In this next chapter, we take a look at the basis for current-generation distributed computation – Apache Hadoop.

Yet Another Google Project

There are two primary components to Hadoop, the MapReduce algorithm, which we’ve covered before, and the Hadoop Distributed File System (HDFS). Like MapReduce, HDFS started out as a development by Google to help manage and analyze their mountains of data during the early days of the internet. To recap, the scale of the data that Google needed to analyze and index was such that even the most powerful supercomputers might struggle to remain performant. However, by networking clusters of off-the-shelf commodity hardware, Google was able to structure their problems in such a way using MapReduce that they could be successfully parallelized (mapped) and recombined (reduced) in a reasonable amount of time, with adequate fault tolerance.

HDFS developed out of the Google File System. The central innovation of the Google File System was to expect and solve for hardware failures. For readers familiar with storage array virtualization using a Redundant Array of Inexpensive Discs (RAID) methodology, the concept here is similar. Not only might hard drives fail during the course of processing your data, entire machines might! Previous implementations of distributed computing frameworks were designed around an assumption that machines were up and running all the time. GFS (and HDFS later on) did not share that philosophy – if a machine fails while it’s processing some of your code, it should alert a control node, which would then assign the work to another machine, and of course with no data being lost in the process.


Hadoop is named after the creator’s son’s toy elephant.

HDFS Data Management

The architecture of HDFS is as such – data is stored across multiple machines in a networked configuration. The networked machines do not need to use any kind of RAID configuration for their hard drives, nor do they even need to be similar in specification. A master namenode keeps a registry of the data stored on each machine, including redundancy. If the master recognizes that data has become corrupted on one host, or if a particular piece of data is not stored with sufficient redundancy across the cluster, it will take steps to correct the imbalance.

Given that HDFS is designed to be failure tolerant, there is also support for a failover, or secondary, namenode. If the primary namenode fails, a secondary namenode, having been kept up to date concurrently with the master node, can seamlessly step into action. Another interesting namenode feature that HDFS supports is federation. Unlike the backup namenode, which exists to step in in the case of failure, a set of federated namenodes works in concert to parallelize the work of namenodes and increase throughput. This can be necessary with large processing loads where a single namenode might become a bottleneck.

Similarly, a system of federated namenodes might have its own supporting case of failover namenodes that step in when one or more primary federated namenodes fail.

Another interesting feature of HDFS is that it supports defining the locality of networked machines. That is, it is possible to describe to HDFS the physical configuration of the machines in the cluster. In the case of rack-mounted server hardware, for example, HDFS can be configured to recognize that two machines are one the same physical rack. In this case, it may prefer to store redundant data on a host that is not on the same rack as other machines that have the same data. That way, if the whole rack fails the data is not subsequently lost.

Finally, a quick note that Hadoop, while associated with HDFS, can also be configured to access data on other file system types, including AWS, FTP servers, and MapRFS (a variant HDFS which supports full random access read-write!)

An Example of a Native Hadoop Workflow

With this in mind, let’s visualize what a (simple) native Hadoop workflow looks like, from start to finish. “Native” is an important distinction because running analytics processes using Hadoop with no additional abstraction(such as Hive, or some kind of analytical front-end) is an increasingly rare occurrence. We’ll talk about those later.

Defining the Problem

Imagine you’re a retail performance analyst at Amazon. You’ve got a mountain of data from your web analytics system about the pages that people view, how long they stay there, what they ultimately end up buying, and so on and so forth. Glancing through the data casually, you notice that a lot of people in Texas are looking at Texas-shaped Waffle Irons. This gets you wondering – do other companies sell waffle irons in the shape of other states? And if so, are people from other states as passionate about them as people from Texas are about waffles in the shape of Texas?

To do this, you’ll need to look up the most frequently viewed item in the Waffle Iron category (which doesn’t exist publicly, but you just know that exists internally within Amazon) for each state where the viewer can be attributed.

Gathering the Data

You go into your web analytics tool, ask it for all data and metadata relating to views on Waffle Irons going back two years (again, this is Amazon, we run statistically significant experiments here). You press go, it tells you the file is ready… and it’s ten gigabytes in size with 100 million records. Your finely tuned analyst sense tells you that Excel isn’t going to be able to handle this one. Apparently people are very excited about waffle irons.

You take a deep breath – alright, we’re going to have to do this in Hadoop. The thought sends a chill down your spine as you save the file to a remote host that the analytics team uses for running this kind of analysis. Luckily for you, that host happens to be an entry point into HDFS.

Writing the Map() Job

You crack open your IDE and get to writing. The first step of MapReduce, of course, is mapping your computation is a discrete chunk that can be easily parallelized, and more importantly recombined later on in the workflow.

It seems to you that a logical way to break up the data might be by state. There are a small amount of states, if we assume that the population distribution of the US is reflected in the file, then it stands to reason that the most populous state, California, would represent only 12 million records and 1.2gb of data. Perhaps some nodes will be able to process multiple states in the time that it takes one to process California, but with 100 million records, it will always be faster than doing it one single machine.

You tell the map process to take all the records for each state, grouped by state, and tally up the number of views for each webpage. Remember that we already filtered the web page data for Waffle Irons when we created the file.

As a final step, you ask the Map() job to return an associative array (or a dictionary, if you’re Python) containing the state for which they did the tally, the name of the webpage, and the number of views for each webpage.

pexels-photo (3)

Writing the Reduce() Job

The Reduce() job takes the output of the Map() job and recombines it into something logical and usable as requested by the user.

In this case, the Reduce job can be made aware that the Map job will return the webpage views for a given state. The Reduce job will then find the highest value for each state, and record it as an item in an associative array. Finally, the Reduce job will store the completed array in a callable object, which you will then print to a csv file.

Executing the Job

You save your script, give it a once over, and press go. To your delight, it compiles correctly the first time (pinch yourself – you might be dreaming)! Let’s take a moment to talk about what happens behind the scenes as you wait for results to be returned.

The host on which you transferred the file and where you ran the script serves as the entrypoint into HDFS (this doesn’t always have to be the case, individual hosts can be configured to send their commands to a remote master namenode).

When you execute the script, it reads through your commands and recognizes that you want to split the file by the state value. As the master namenode, it directs various hosts in the cluster to ingest chunks of the file and store them locally for analysis. It recognizes that some files are smaller than others, and so may apportion multiple files to the same host if it believes that it can finish processing in a short amount of time.

Now that the data has been distributed among hosts, it’s time for computation to begin. As the data transfer completes, the master namenode instructs each remote host to begin processing, and to report back if something goes wrong.

Such is your bad luck that while the cluster is plugging away, the power supply on one of the racks fails and the machines go down. Not a problem! The master namenode notices this and instructs other machines to pick up the slack. These may be machines that have already finished computation or ones that have particularly small files.

As individual machines finish their computation, they send the results back to the master namenode, which collects and saves the results for each chunk as it is completed. When all chunks have been completed, the master namenode sends the collated data back to the original host process, which then initiates the reduce process. Using the data produced by the Map() job, the Reduce() process will then find the page with the highest number of views for each state.

Reading the Results

To your surprise (or maybe not, if you’re not from Texas), you find that Texas is more or less the only state that has a Waffle Maker being made in its image. Indeed, the most viewed waffle makers in most other states are just plain old waffle makers with nothing much else interesting about them.

That’s not to say that the world of waffle-makers is boring. There are appliances that will make you a waffle in the shape of Mickey Mouse, the Death Star, and Spiderman.

This gets you thinking… Perhaps it’s time to brush on regular expressions and give Hadoop another workout!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.