Anda di halaman 1dari 10


MapReduce workflows – unit tests with MRUnit – test data and local tests – anatomy of MapReduce job run –
classic Map-reduce – YARN – failures in classic Map-reduce and YARN – job scheduling – shuffle and sort – task
execution – MapReduce types – input formats – output formats

Map Reduce is a software framework for processing (large) data sets in a distributed fashion over a several machines. The
core idea behind.
Map Reduce is mapping your data set into a collection of <key, value> pairs, and then reducing overall pairs with the
same key. The overall concept is simple, but is actually quite expressive when you consider that:
1. Almost all data can be mapped into <key, value> pairs and
2. Your keys and values may be of any type: strings, integers, dummy types... and, of course, <key,value> pairs

MapReduce framework include:

Distributed sort ‐Distributed search ‐Web ‐link graph traversal ‐ machine learning.

A Map Reduce Workflow:

When we write a MapReduce workflow, we’ll have to create 2 scripts: the map script, and the reduce script. The rest will
be handled by the Amazon Elastic MapReduce (EMR) framework.

When we start a map/reduce workflow, the framework will split the input into segments, passing each segment to a
different machine. Each machine then runs the map script on the portion of data attributed to it.
The map script (which you write) takes some input data, and maps it to <key, value> pairs according to your
specifications. For example, if we wanted to count word frequencies in a text, we’d have <word,count> be our <key,
value> pairs. Our map script, then, would emit a <word ,1> pair for each word in the input stream.
The purpose of the map script is to model the data into <key, value> pairs for the reducer to aggregate.

Emitted <key, value> pairs are then “shuffled” (to use the terminology
in the diagram below), which basically means that pairs with the same key are grouped and passed to a single machine,
which will then run the reduce script over them.
The reduce script (which you also write) takes a collection of <key, value> pairs and “reduces” them according to the user
‐specified reduce script.
In our word count example, we want to count the number of word occurrences so that we can get frequencies. Thus, we’d
want our reduce script to simply sum the values of the collection of <key, value> pairs which have the same key.

The diagram below illustrates the described scenario nicely

Unit tests with MRUnit:

When working with large data sets, unit testing might not be the first thing that comes to mind. However, when
you consider the fact that no matter how large your cluster is, or how much data you have, the same code is
pushed out to all nodes for running the MapReduce job, Hadoop mappers and reducers lend themselves very well
to being unit tested. But what is not easy about unit testing Hadoop, is the framework itself. Luckily there is a
library that makes testing Hadoop fairly easy – MRUnit. MRUnit is based on JUnit and allows for the unit testing
of mappers, reducers and some limited integration testing of the mapper – reducer interaction along with
combiners, custom counters and partitioners. We are using the latest release of MRUnit as of this writing, 0.9.0.
All of the code under test comes from the previous post on computing averages using local aggregation.

Hadoop MapReduce jobs have a unique code architecture that follows a specific template with specific constructs. This
architecture raises interesting issues when doing test-driven development (TDD) and writing unit tests.

With MRUnit, you can craft test input, push it through your mapper and/or reducer, and verify it’s output all in a JUnit
test. As do other JUnit tests, this allows you to debug your code using the JUnit test as a driver. A map/reduce pair can
be tested using MRUnit’sMapReduceDriver. A combiner can be tested using MapReduceDriver as well. A
PipelineMapReduceDriver allows you to test a workflow of map/reduce jobs. Currently, partitioners do not have a test
driver under MRUnit. MRUnit allows you to do TDD and write light-weight unit tests which accommodate Hadoop’s
specific architecture and constructs.

Multi-threaded environment can be a hostile place to dwell in and debugging and testing it is not easy. With map/reduce
things get even more complex. These jobs run in distributed fashion, across many JVMs in a cluster of machines. That is
why it is important to use all the power of unit testing and run them as isolated as possible.

Logic of the example is the same as in the mentioned post#link. There are two input paths. One containing user
information: user id, first name, last name, country, city and company. Other one holds user’s awesomeness rating in a
form of a pair: user id, rating value.

Our MR job reads the inputs line by line and joins the information about users and their awesomeness rating. It filters out
all users with rating less than 150 leaving only awesome people in the results.
There are three main MRUnit classes that drive our tests: MapDriver, ReduceDriver and MapReduceDriver. They are
generic classes whose type paremeters depend on input and output types of mapper, reducer and whole map/reduce job,
respectively. This is instantiated as

MRUnit provides us tools to write tests in different manners. First approach is more traditional one – we specify the input,
run the job (or a part of it) and check if the output looks like we expect. In other words, we do the assertions by hand.

The main difference is in calling driver’s method run() or runTest(). First one just runs the test without validating the
results. Second also adds validation of the results to the execution flow.

Method List<Pair<K2,V2>>MapDriver#run() returns a list of pairs which is useful for testing the situations when mapper
produces key/value pairs for given input. This is what we have used in the approach when we were checking the results of
the mapper run.

Then, both MapDriver and ReduceDriver have method getContext(). It returns Context for further mocking – online
documentation has some short but clear examples how to do it.

Counters are the easiest way to measure and track the number of operations that happen in Map/Reduce programs. There
are some built in counters like “Spilled Records”, “Map output records”, “Reduce input records” or “Reduce shuffle
bytes”… MRUnit supports inspecting those by using getCounters() method of each of the drivers.

Class TestDriver provides facility for setting mock configuration – TestDriver#getConfiguration()) will allow you to
change only those parts of configuration you need to change.

Finally, MapReduceDriver is useful for testing the MR job in whole, checking if map and reduce parts are working
combined together.

Anatomy of Map Reduce job run

You can run a MapReduce job with a single line of code: JobClient.runJob(conf). It’s very short, but it conceals a great
deal of processing behind the scenes. This section uncovers the steps Hadoop takes to run a job.

The whole process is illustrated in belowFigure . At the highest level, there are four independent entities:
The client, which submits the MapReducejob.Thejobtracker, which coordinates the job run. The jobtracker is a Java
application whose main class is JobTracker. The tasktrackers, which run the tasks that the job has been split into.
Tasktrackers are Java applications whose main class is TaskTracker. The distributed filesystem, which is used for sharing
job files between the other entities.
Job Submission
The runJob() method on JobClient is a convenience method that creates a new JobClient instance and calls submitJob() on
it . Having submitted the job, runJob() polls the job’s progress once a second and reports the progress to the console if it
has changed since the last report. When the job is complete, if it was successful, the job counters are displayed. Otherwise,
the error that caused the job to fail is logged to the console.

The job submission process implemented by JobClient’ssubmitJob() method does the following:
Asks the jobtracker for a new job ID (by calling getNewJobId() on JobTracker) (step 2).Checks the output specification of
the job. For example, if the output directory has not been specified or it already exists, the job is not submitted and an
error is thrown to the MapReduceprogram.Computes the input splits for the job. If the splits cannot be computed, because
the input paths don’t exist, for example, then the job is not submitted and an error is thrown to the
MapReduceprogram.Copies the resources needed to run the job, including the job JAR file, the configuration file, and the
computed input splits, to the jobtracker’sfilesystem in a directory named after the job ID. The job JAR is copied with a
high replication factor (controlled by the mapred.submit.replication property, which defaults to 10) so that there are lots of
copies across the cluster for the tasktrackers to access when they run tasks for the job(step3) Tells the jobtracker that the
job is ready for execution (by calling submitJob() on JobTracker) (step 4).
Job Initialization

When the JobTracker receives a call to its submitJob() method, it puts it into an internal queue from where the job
scheduler will pick it up and initialize it. Initialization involves creating an object to represent the job being run, which
encapsulates its tasks, and bookkeeping information to keep track of the tasks’ status and progress (step 5).

To create the list of tasks to run, the job scheduler first retrieves the input splits computed by the JobClient from the
shared filesystem (step 6). It then creates one map task for each split. The number of reduce tasks to create is determined
by the mapred.reduce.tasks property in the JobConf, which is set by the setNumReduce Tasks() method, and the scheduler
simply creates this number of reduce tasks to be run. Tasks are given IDs at this point.

Task Assignment:

 Tasktrackers run a simple loop that periodically sends heartbeat method calls to the jobtracker.

 Heartbeats tell the jobtracker that a tasktracker is alive, but they also double as a channel for messages. As a part
of the heartbeat, a tasktracker will indicate whether it is ready to run a new task, and if it is, the jobtracker will
allocate it a task, which it communicates to the tasktracker using the heartbeat return value .

 Before it can choose a task for the tasktracker, the jobtracker must choose a job to select the task from. Having
chosen a job, the jobtracker now chooses a task for the job.

 Tasktrackers have a fixed number of slots for map tasks and for reduce tasks: for example, a tasktracker may be
able to run two map tasks and two reduce tasks simultaneously.

 The default scheduler fills empty map task slots before reduce task slots, so if the tasktracker has at least one
empty map task slot, the jobtracker will select a map task; otherwise, it will select a reduce task.

 To choose a reduce task, the jobtracker simply takes the next in its list of yet-to-be-run reduce tasks, since there
are no data locality considerations.

 For a map task, however, it takes account of the tasktracker’s network location and picks a task whose input split
is as close as possible to the tasktracker.
 In the optimal case, the task is data-local, that is, running on the same node that the split resides on. Alternatively,
the task may be rack-local: on the same rack, but not the same node, as the split.

 Some tasks are neither data-local nor rack-local and retrieve their data from a different rack from the one theyare
running on. You can tell the proportion of each type of task by looking at a job’s counters .

Task Execution:

 Now that the tasktracker has been assigned a task, the next step is for it to run the task.
 First, it localizes the job JAR by copying it from the shared filesystem to the tasktracker’sfilesystem. It also
copies any files needed from the distributed cache by the application to the local disk.
 Second, it creates a local working directory for the task, and un-jars the contents of the JAR into this directory.
 Third, it creates an instance of TaskRunner to run the task.
 TaskRunner launches a new Java Virtual Machine to run each task so that any bugs in the user-defined map and
reduce functions don’t affect the tasktracker.
 The child process communicates with its parent through the umbilical interface. This way it informs the parent of
the task’s progress every few seconds until the task is complete.

Streaming and Pipes

 Both Streaming and Pipes run special map and reduce tasks for the purpose of launching the user-supplied
executable and communication.
 In the case of Streaming, the Streaming task communicates with the process using standard input and output
 The Pipes task, on the other hand, listens on a socket and passes the C++ process a port number in its
environment, so that on startup, the C++ process can establish a persistent socket connection back to the parent
Java Pipes task.
 In both cases, during execution of the task, the Java process passes input key-value pairs to the external process,
which runs it through the user-defined map or reduce function and passes the output key-value pairs back to the
Java process.

Progress and Status Updates:

 MapReduce jobs are long-running batch jobs, taking anything from minutes to hours to run. Because this is a
significant length of time, it’s important for the user to get feedback on how the job is progressing.
 A job and each of its tasks have a status, which includes such things as the state of the job or task (e.g., running,
successfully completed, failed), the progress of maps and reduces, the values of the job’s counters, and a
statusmessage or description .
 These status change over the course of the job, so how do they get communicated back to the client?When a task
is running, it keeps track of its progress, that is, the proportion of the task completed. For map tasks, this is the
proportion of the input that has been processed.
 For reduce tasks, it’s a little more complex, but the system can still estimate the proportion of the reduce input
 It does this by dividing the total progress into three parts, corresponding to the three phases of the shuffle .
 For example, if the task has run the reducer on half its input, then the task’s progress is ⅚, since it has completed
the copy and sort phases (⅓ each) and is halfway through the reduce phase (⅙).

Progress reporting

 It is important, as it means Hadoop will not fail a task that’s making progress. All of the following operations
constitute progress:
 Reading an input record (in a mapper or reducer)Writing an output record (in a mapper or reducer)Setting the
status description on a reporter (using Reporter’s setStatus() method)Incrementing a counter (using Reporter’s
incrCounter() method)Calling Reporter’s progress() method.
 Tasks also have a set of counters that count various events as the task runs, either those built into the framework,
such as the number of map output records written, or ones defined by users.
 If a task reports progress, it sets a flag to indicate that the status change should be sent to the tasktracker.
 The flag is checked in a separate thread every three seconds, and if set it notifies the tasktracker of the current task
status. Meanwhile, the tasktracker is sending heartbeats to the jobtracker every five seconds (this is a minimum,
as the heartbeat interval is actually dependent on the size of the cluster: for larger clusters, the interval is longer),
and the status of all the tasks being run by the tasktracker is sent in the call.
 Counters are sent less frequently than every five seconds, because they can be relatively high-bandwidth.
 The jobtracker combines these updates to produce a global view of the status of all the jobs being run and their
constituent tasks. Finally, as mentioned earlier, the JobClient receives the latest status by polling the jobtracker
every second.
 Clients can also use JobClient’sgetJob() method to obtain a RunningJob instance, which contains all of the status
information for the job.The method calls are illustrated si given below

Job Completion:

 When the jobtracker receives a notification that the last task for a job is complete, it changes the status for the job
to “successful.”
 Then, when the JobClient polls for status, it learns that the job has completed successfully, so it prints a message
to tell the user and then returns from the runJob() method.
 The jobtracker also sends an HTTP job notification if it is configured to do so. This can be configured by clients
wishing to receive callbacks, via the job.end.notifica tion.url property.
 Last, the jobtracker cleans up its working state for the job and instructs tasktrackers to do the same .
 In the real world, user code is buggy, processes crash, and machines fail. One of the major benefits of using
Hadoop is its ability to handle such failures and allow your job to complete.


Hadoop YARN Technology:

 Yarn full form stands for yet another resource negotiator. It is a cluster management technology which is an open
source platform distributed for processing framework.
 The main objective of YARN is to construct a framework on Hadoop that allows the cluster resources to be
allocated to the specified applications and consider MapReduce has one of these applications.
 It separates each tasks of the job tracker into separate entities.
 The job tracker maintains track of both job scheduling which matches the tasks with task tracker and another one
is task progress monitoring that take care of tasks and starts again the failed or slower tasks and doing the task
bookkeeping like as maintaining counter totals.
 It divides these two roles into two independent daemons that are a mainly the resource manager which manages
the usage of resources across the cluster and an application master which manage the lifecycle of applications
running on the cluster.
 Application master agrees with the resource manager for the sake of cluster resources which is expressed in
terms of a number of containers each with a certain memory limit then runs application specific processes in
these containers.
 The containers are handled by node managers which are running on cluster nodes which ensure the application
does not use more resources which are allocated to it.
 It is a very efficient technology which manages the Hadoop cluster.
 YARN is one of the parts of Hadoop 2 version under the aegis of the Apache Software Foundation.
 YARN is has developed a completely new and innovative way of processing data and is now rightly at the center
of The Hadoop architecture.
 Using this technology now it is possible to stream real-time, uses new interactive SQL, process data using
multiple engines, manages the data using batch processing on a single platform and so on.

Map Reduce on YARN:

MapReduce on YARN includes more entities compared to the classic MapReduce. They are:

 Client –Client submits the MapReduce job.

 YARN resource manager – This manages the allocation of compute resources based on the cluster.
 YARN node managers – It launches and monitors the compute containers on machines based on the cluster.
 Map Reduce application master – It manages and arranges the tasks running the MapReduce job. The application
master and the MapReduce application tasks run correspondingly in the containers which are scheduled by the resource
manager and managed by the node managers.
 Distributed file system (Normally HDFS) – It shares the job files created between the other entities.

How the YARN technology works?

 YARN technology lets Hadoop provides the enterprise level solutions, helping organizations achieve better resource
management. It is the main platform for getting consistent solutions, high level of security and governing of data over
the complete spectrum of the Hadoop cluster.
 There are various technologies that resides within the data center can also benefit from YARN.
 This procedure is possible to process and have linear-scale storage in a very cost effective way.
 Using YARN helps to come with applications that can access data and run in a Hadoop ecosystem on a consistent

Some of the features of YARN

 High degree of compatibility: The applications created are using the Map Reduce framework which can easily run on
 Better cluster utilization: YARN allocates all the cluster resources in an efficient and dynamic manner and which leads
to utilizes it in much better way compared to previous version of Hadoop.
 Utmost scalability: As and when the required number of nodes in the Hadoop cluster expands, the YARN Resource
Manager ensures that it meets the user requirements and processing power of the data center does not face any
problems in solving.
 Multi-tenancy: Various engines that access data on the Hadoop cluster can efficiently works all thanks goes to YARN
being a highly versatile technology.

Key components of YARN:

YARN came into existence because there was an urgent need to separate the two distinct tasks that go on in a Hadoop
ecosystem and which are known as TaskTracker and the JobTrackerentities. So consider the below mentioned key
components of the YARN technology.

 Global Resource Manager

 Application Master per application
 Node Manager per node slave
 Container per application that runs on a Node Manager

 Thus the Node Manager and the Resource Manager became the main reason on which the new distributed
application works.

 The various resources manager are allocated to the system applications using the power of the Resource Manager.
Application Master works along with the Node Manager and also works on specific framework to get resources
from the Resource Manager to manage the various task components.

 A scheduler works with the RM(Resource Manager) framework for the right allocation of resources and ensuring
all the constraints of the user limit and queue capacities are adhered are provided at all times.

 As per the requirements of each application the scheduler will provide the right resource.

 The Application Master works on basis of coordination with the scheduler in order to get the right resource
containers keep an eye on the status and also keep tracking the progress of the process.

 The Node Manager manages the application containers and launches it when it is required, tracks down the uses
of the resources like the memory, processor, network and the disk utilization and gives the entire detailed report to
the Resource Manager.

Failures in Classic MapReduce:

○ failure of task

○ failure of TaskTracker

○ failure of JobTracker

Task Failure :

Exception in map or reduce tasks, TaskTracker marks the task as failed.

● JVM suddenly exists, TaskTracker marks the task as failed.

● Hanging tasks stop sending progress updates, TaskTracker kill the JVM and task is marked as failed.
Task Failure:

JobTracker reschedules the failed task to different TaskTracker if possible

● If the task has failed 4 or more times, it will not retried again.

● If the task fails 4 times, the whole job fails.

TaskTracker Failure:

If the TaskTracker crashes or runs very slowly, the JobTracker notices this from missing heartbeats.

● Successful map tasks are rescheduled to different TaskTracker if they belong to incomplete job.

● All tasks in progress are also rescheduled.

TaskTracker Failure:

TaskTracker may be blacklisted by JobTracker

● If 4 or more tasks from the same job has failed on a particular TaskTracker, JobTracker records this as fault.

● When minimum threshold of faults is exceeded, TaskTracker is blacklisted.

● Faults expire over time (one per day), TaskTrackers get a chance to run jobs again.

JobTracker Failure:

Single Point of Failure in Classic MapReduce

● All running jobs fail

● After restart, All jobs must be resubmitted

Failures in YARN:

Task Failure, same as in Classic MapReduce

● AppMaster Failure

● ResourceManager Failure

Application Master Failure:

Resource Manager notices failed AppMaster

● Resource Manager starts a new instance of AppMaster in new container

● Client experiences a timeout and get a new address of AppMaster from ResourceManager

Resource Manager Failure:

Resource Managers have checkpointing mechanism which saves its state to persistent storage.

● After crash, administrator brings new Resource Manager up and it recovers saved state.

Speculative Execution:

If Hadoop detects that some task is slower than normal, another equivalent backup task is launched. Which ever
completes first, the second one is killed immediately. Optimization, not a feature to make jobs run more reliably.