Understanding the Core of Hadoop: the MapReduce Algorithm

To Hadoop and Beyond is a series dedicated to exploring the basics of distributed computing as it stands today, and to take an inventory of where the state of the art is heading in the future. In this first chapter, we explore the most important part of contemporary distributed computing platforms – the MapReduce Algorithm.

Why Distributed Computing?

To understand the MapReduce algorithm, it helps to understand the problem that it’s trying to solve. When you use a computer to do computation, there are several important infrastructure aspects to take into account. The power of the processor you’re using will impose a constraint on the computational complexity that you can handle. The available storage space imposes a limit on the size of the dataset that you can analyze, and the amount of available RAM memory that can be addressed limits the speed of the computation, especially for write/read intensive computation.

With the size of data increasing ever more and more quickly, it has become a reasonable probability that the size and complexity of your dataset could exceed the capability of your machine to complete the computation in a reasonable amount of time.

To overcome these hurdles, the conventional response, historically speaking, has been to develop ever faster processors, bigger storage arrays, and bigger and faster RAM modules. IBM and many other organizations have built businesses off of selling purpose-built supercomputers for the purpose of analyzing massive amounts of data using a single monolithic machine.

However, even with the heretofore unbroken influence of Moore’s Law, processing power and storage capacity still simply are not keeping up with the glut of data available out there that needs to be computed. At some level, the reasonable observer asks, instead of trying to build a bigger and better computer, why not just use two of them?

pexels-photo

Enter MapReduce

The concept of processing a dataset in a distributed manner across multiple nodes in a network of computers has been around in some form since at least 1995. The term MapReduce was coined by two engineers at Google in a 2004 paper produced for the Sixth Symposium on Operating System Design and Implementation. Being that Google was the originator of the MapReduce algorithm, the original use case was to rebuild the web search index for the entire internet in a timely and efficient manner, and to do so using commodity, off-the-shelf hardware rather than specialized and expensive supercomputers.

As part of Google’s commitment to open source software, the library that Google used internally to implement the algorithm was made publicly available. While it has been forked and modified and adapted by any number of organizations seeking to commercialize the technology, the underlying concept of MapReduce remains relevant even today. As the name suggests, MapReduce involves a Map() step, and a Reduce() step, in which data is distributed across multiple hosts (machine nodes in a networked cluster of off-the-shelf computers).

The Map Step

One of the interesting peculiarities of Moore’s Law is that local area network transfer speeds in commodity hardware has progressed more slowly than have processing power or storage capacity. Even in 2003, gigabit ethernet connectivity was the standard in enterprise data centers. Today, fiber backplanes have the capacity to exceed that performance, but by and large that is the only the improvement of the past decade in local area network transfer speeds.

Keep that in mind as we discuss MapReduce. Imagine we have a file that we want to analyze on a central machine, and we want to use the processing power of multiple machines to analyze that file. The first step, then, is to split it into chunks that can be digested by each machine for analysis. This is the first component of the Map step of MapReduce. In the paper introducing the concept, the authors suggesting placing each unique chunk on three different machines, in order to have redundancy should one or more machines fail during the calculation (because this is a distributed operation, we cannot be sure of the state of non-related machines in the network).

After splitting the file and transferring it to any number of distributed hosts, the next step is to run the analysis step on the data on each host. One host acts a control node, and seeks to assign pieces of work to other machines in the network, as well as collate their responses to the initial parts of the Map step (this will be used later, in the Reduce step). As individual hosts finish their assigned portions of the computation, the control node will tell them to work on another piece, as well as telling other machines in the network that it is safe to disregard and dispose of redundant chunks that have been analyzed by another node.

In terms of the computation, the value of MapReduce is that it is a very versatile framework, and many different computations can be carried out to take advantage of the efficiencies of distributed computation.

Let’s take an averaging operation as example. Averages are calculated by taking the sum of all elements in a set, and then dividing that sum by the number of elements in that same set. Using the MapReduce framework, say that we have a set of 100 items, and it is split into four equal sets of 25 items each and distributed across four nodes. Each node would then calculate the average of the elements in the set that it has been assigned, and pass that number back to control node. As additional metadata, they would also pass back the number of items in the set that they averaged. This way, we could assign a more powerful node a larger set, and a less powerful node a smaller set.

night-computer-hdd-hard-drive

The Reduce Step

Continuing with the example of calculating an average value across a large set, during the reduce step the control node will collate the data returned by each Map node and finish the analysis as requested by the user. For an averaging operation, the control node will then calculate the weighted sum of each average value, taking into account the size of the original set. In the case of four equally sized sets, the average of the averaged values will equal this weighted sum. However, if one node had 30% of the set, and another had 70%, it would then be required to weight the sum along those distributions to calculate a correct result.

If this all sounds like an unnecessary complication to simply calculating an average, remember that this process is designed to scale across many machines using cheap, off-the-shelf hardware. So, if we wanted to calculate the average of 1 billion records, and we had 1 thousand nodes in a networked cluster, then we could assign to each node a million records to average. Undoubtedly, taking the average of a million records is less computationally complex than taking the average of 1 billion records.

Another use case for this framework might be for querying against a dataset. In this case, the reduce operation is simply the collation of the result set and deduplication/error correction if any has happened. Imagine a database of 1 billion records, spread across 1 thousand nodes with a million records at each node. The user requests the system to find any and all records where a certain field has a certain value. During the Map step, each node will look at its own chunk of 1 million records for those records matching the criteria. Then, having found them, each node will then transfer the applicable records to the command node.

One final thing to understand about Reduce jobs is that they do not need to happen synchronously. When deploying a large cluster of machines, it is impractical and unrealistic to expect that all machines will complete their assigned task at the same time. So, instead, the control node that conducts the Reduce function will accept the output from nodes as they are ready, and recombine them in an order as specified by the user. If the control node determines that any given Map node has stalled, crashed, or failed, it will attempt to task another machine that has a redundant copy of that failed node’s data with the computation, or transfer that data to a node that is ready to compute, perhaps having already finished its initial computation.

In this way, it is not necessary that all nodes in a cluster have the exact same specifications or even stability.

The Legacy of MapReduce

Knowing now the basic mechanism of the MapReduce algorithm, we are now ready to tackle the many implementations and flavors of distributed computation that are available on the market today.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.