4 March 2015

In this article we will briefly discuss the computation paradigm MapReduce, and Apache Hadoop as one of its implementations. We won't get into much details, and we even won't implement the Word Count on Hadoop, but it should give some foundation for the future articles about tools for scalable data processing.

MapReduce

Source: @tgrall

MapReduce is a paradigm for parallel computation that comes from Functional Programming, it's mainly used for large scale data processing and it typically runs on distributed systems. It can hide the details of parallel execution, yet it's expressive enough for many data processing algorithms.

In MapReduce the two main primitives are functions map and reduce, and both of them must be provided by a programmer, the rest is done by the executing engine.

A map function takes a key-value pair (in_key, in_value) and produces multiple key-value pairs called “intermediate result”. The output of the map function is then grouped by keys and fed to the reduce function, which typically aggregates its input and produces the final result.

A MapReduce job is an execution of these two function on some data set. A job consists of three stages:

1. the map stage, where the map function is applied to each key-value pair of the input dataset
2. the grouping stages, at which the intermediate results are shuffled and combined, and
3. finally the reduce stage, where the reduce function is applied to each group of the intermediate result and produces the final result.

A job is executed in parallel with many nodes reading from many input sources.

Let's have a look at a classical example: the word count problem. Given some text, we need to calculate how many occurrences of each word there are.

In a python-like pseudo-code the map function is defined like this:

def map(String input_key, String doc):
for each word w in doc:
EmitIntermediate(w, 1)


Here at the map stage, for each each input word $w$ the mapper outputs a tuple $(w, 1)$ by using EmitIntermediate(w, 1).

After this, we group all produced tuples by key. That is, if we have $N$ words $w_1, \ ... \ , w_N$, then it the grouping for each word $w_i$ will be $(w_i, [1, 1, ..., 1])$: a word $w_i$ followed by a sequence of ones.

Finally, we define the reduce function:

def reduce(String output_key, Iterator output_vals):
int res = 0
for each v in output_vals:
res = res + v
Emit(res)


Here we just sum over all the ones in each group and that will be the number of occurrences for each word.

Now let's try to make sense of the sandwich picture from the beginning: for the map phase, each input vegetable (or a loaf) is cut and the output could be (sandwich_id, piece). Then at the reduce phase every tuple is grouped by the key, in our case sandwich_id, so all the needed ingredients are put together, and the reducer can assemble the sandwich.

Apache Hadoop is an open-source implementation of the MapReduce.

However it's not always clear what the term “Hadoop” means, and people may refer to different things when they use it. Most commonly, it's used when talking about the MapReduce component and the Hadoop Distributed File System (HDFS). Also the term sometimes may refer to other Hadoop-related products such as Apache HBase, Apache Hive, Apache Mahout and many others.

Hadoop MapReduce Component is a data processing tool that works on top of HDFS and implements MapReduce.

A Hadoop cluster consists of the main node, called Name Node that orchestrates the process of assigning jobs and monitoring the progress. A MapReduce job is done by data nodes, also called workers. There are two kind of workers: mapper workers and reduce workers.

This picture (from an article by the University of Washington) is an excellent way of describing how Hadoop works:

• The Name Node chooses some idle workers and assigns them a map task
• Before starting the map task, the workers need to do some preparation work:
• The specified input file is loaded to the file system and partitioned into blocks
• Then each block is replicated three times to guarantee fault-tolerance
• The map phase
• Each block of data is assigned to a map worker
• The worker applies the map function to it
• Once the map function finishes its work, the intermediate results are sorted on local disk of mapper
• Also the intermediate results there are partitioned into $R$ groups (partitioning is typically done by hashing) - these groups will be used at the reduce phase
• It needs to wait until all map tasks are completed
• Before the reduce phase starts
• The Name Node assigns reduce task to idle workers (reducers)
• The intermediate results are shuffled and assigned to reducers
• Each reducer pulls its partition from the mapper’s disk (all map results are already partitioned by mappers)
• The reduce phase
• Each reducer applies the reduce function to its own partition of data
• The produced result is stored and replicated three times to ensure fault-tolerance.

Note that fault-tolerance is easily achieved in this scheme as well: if a node fails, the name node just re-assigns a task to another node in the cluster. Also it does not prepare any execution plan beforehand: once a task is done, the node is assigned to another task. This way it also achieves load balancing.

This all looks nice, but there are some disadvantages.

There is no declarative high-level language such as SQL. MapReduce implementation in Hadoop is quite a low-level approach and requires writing programs in Java or Python. It makes queries harder to write and maintain, and this is especially bad for ad-hoc analytical querying. The solutions to this are high-level SQL-like scripting languages like Pig or Hive, and they are compiled to MapReduce jobs.

There are performance issues: both Map and Reduce operations are blocking. The reduce phase cannot begin until the map phase finishes, and this causes performance degradation.

Finally, Hadoop is still a very young technology, and it does not have decades of development behind like Relational Databases.