What is Hadoop?
Hadoop is an open-source software framework for storing data and running applications on
clusters of commodity hardware. It provides massive storage for any kind of data, enormous
processing power and the ability to handle virtually limitless concurrent tasks or jobs.
• Apache Flume: a tool used to collect, aggregate and move huge amounts of streaming
data into HDFS;
• Apache Hive: an SQL-on-Hadoop tool that provides data summarization, query and
analysis;
• Apache Pig: a high-level platform for creating programs that run on Hadoop clusters;
• Apache Sqoop: a tool to help transfer bulk data between Hadoop and structured data
stores, such as relational databases; and
5 Advantages of Hadoop
1. Scalable
Hadoop is a highly scalable storage platform, because it can store and distribute
very large data sets across hundreds of inexpensive servers that operate in parallel.
Unlike traditional relational database systems (RDBMS) that can't scale to process
large amounts of data, Hadoop enables businesses to run applications on thousands
of nodes involving thousands of terabytes of data.
2. Cost effective
Hadoop also offers a cost effective storage solution for businesses' exploding data
sets. The problem with traditional relational database management systems is that it
is extremely cost prohibitive to scale to such a degree in order to process such
massive volumes of data. In an effort to reduce costs, many companies in the past
would have had to down-sample data and classify it based on certain assumptions
as to which data was the most valuable. The raw data would be deleted, as it would
be too cost-prohibitive to keep. While this approach may have worked in the short
term, this meant that when business priorities changed, the complete raw data set
was not available, as it was too expensive to store. Hadoop, on the other hand, is
designed as a scale-out architecture that can affordably store all of a company's data
for later use. The cost savings are staggering: instead of costing thousands to tens
of thousands of pounds per terabyte, Hadoop offers computing and storage
capabilities for hundreds of pounds per terabyte.
3. Flexible
Hadoop enables businesses to easily access new data sources and tap into different
types of data (both structured and unstructured) to generate value from that data.
This means businesses can use Hadoop to derive valuable business insights from
data sources such as social media, email conversations or clickstream data. In
addition, Hadoop can be used for a wide variety of purposes, such as log
processing, recommendation systems, data warehousing, market campaign analysis
and fraud detection.
4. Fast
Hadoop's unique storage method is based on a distributed file system that basically
'maps' data wherever it is located on a cluster. The tools for data processing are
often on the same servers where the data is located, resulting in much faster data
processing. If you're dealing with large volumes of unstructured data, Hadoop is able
to efficiently process terabytes of data in just minutes, and petabytes in hours.
5. Resilient to failure
A key advantage of using Hadoop is its fault tolerance. When data is sent to an
individual node, that data is also replicated to other nodes in the cluster, which
means that in the event of failure, there is another copy available for use.
BDT Mini Brahmastra – Unit 5
The MapR distribution goes beyond that by eliminating the NameNode and replacing
it with a distributed No NameNode architecture that provides true high availability.
Our architecture provides protection from both single and multiple failures.
When it comes to handling large data sets in a safe and cost-effective manner,
Hadoop has the advantage over relational database management systems, and its
value for any size business will continue to increase as unstructured data continues
to grow.
Hadoop Architecture
At its core, Hadoop has two major layers namely −
MapReduce
MapReduce is a parallel programming model for writing distributed applications devised at
Google for efficient processing of large amounts of data (multi-terabyte data-sets), on large
BDT Mini Brahmastra – Unit 5
Apart from the above-mentioned two core components, Hadoop framework also includes the
following two modules −
• Hadoop Common − These are Java libraries and utilities required by other Hadoop
modules.
• Hadoop YARN − This is a framework for job scheduling and cluster resource
management.
Hadoop runs code across a cluster of computers. This process includes the following core
tasks that Hadoop performs −
• Data is initially divided into directories and files. Files are divided into uniform sized
blocks of 128M and 64M (preferably 128M).
• These files are then distributed across various cluster nodes for further processing.
• HDFS, being on top of the local file system, supervises the processing.
• Performing the sort that takes place between the map and reduce stages.
HDFS
Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant and
designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data,
the files are stored across multiple machines. These files are stored in redundant fashion to
rescue the system from possible data losses in case of failure. HDFS also makes applications
available to parallel processing.
Features of HDFS
• The built-in servers of namenode and datanode help users to easily check the status
of cluster.
• Streaming access to file system data.
• HDFS provides file permissions and authentication.
HDFS Architecture
Given below is the architecture of a Hadoop File System.
HDFS follows the master-slave architecture and it has the following elements.
Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating system
and the namenode software. It is a software that can be run on commodity hardware. The
system having the namenode acts as the master server and it does the following tasks −
• It also executes file system operations such as renaming, closing, and opening files
and directories.
Datanode
BDT Mini Brahmastra – Unit 5
The datanode is a commodity hardware having the GNU/Linux operating system and
datanode software. For every node (Commodity hardware/System) in a cluster, there will be
a datanode. These nodes manage the data storage of their system.
• Datanodes perform read-write operations on the file systems, as per client request.
• They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file system will be divided
into one or more segments and/or stored in individual data nodes. These file segments are
called as blocks. In other words, the minimum amount of data that HDFS can read or write is
called a Block. The default block size is 64MB, but it can be increased as per the need to
change in HDFS configuration.
Goals of HDFS
Fault detection and recovery − Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore HDFS should have mechanisms for
quick and automatic fault detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
Hardware at data − A requested task can be done efficiently, when the computation takes
place near the data. Especially where huge datasets are involved, it reduces the network
traffic and increases the throughput.
______________________________________
Map Reduce
What is MapReduce?
MapReduce is a processing technique and a program model for distributed computing based
on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce.
Map takes a set of data and converts it into another set of data, where individual elements
are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output
from a map as an input and combines those data tuples into a smaller set of tuples. As the
sequence of the name MapReduce implies, the reduce task is always performed after the
map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
BDT Mini Brahmastra – Unit 5
The Algorithm
• Generally MapReduce paradigm is based on sending the computer to where the data
resides!
• MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
o Map stage − The map or mapper’s job is to process the input data. Generally
the input data is in the form of file or directory and is stored in the Hadoop file
system (HDFS). The input file is passed to the mapper function line by line.
The mapper processes the data and creates several small chunks of data.
• During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
• The framework manages all the details of data-passing such as issuing tasks, verifying
task completion, and copying data around the cluster between the nodes.
• Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
• After completion of the given tasks, the cluster collects and reduces the data to form
an appropriate result, and sends it back to the Hadoop server.
BDT Mini Brahmastra – Unit 5
Terminology
• PayLoad − Applications implement the Map and the Reduce functions, and form the
core of the job.
• Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value
pair.
• NamedNode − Node that manages the Hadoop Distributed File System (HDFS).
• DataNode − Node where data is presented in advance before any processing takes
place.
• MasterNode − Node where JobTracker runs and which accepts job requests from
clients.
• JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.