Anda di halaman 1dari 14

{

Tanya Chaturvedi MBA(ISM) 500026401

What is Hadoop???
Hadoop is a software framework for distributed processing of large datasets across large clusters of computers Large datasets Terabytes or petabytes of data

Large clusters hundreds or thousands of nodes


Hadoop is written in the java programming language and requires Java Runtime Environment (JRE) 1.6 or higher.

Innovation

This technology was invented by Google back in their early days so they could usefully index all the textual and structural information they were collecting and then present meaningful results to the users. Hadoop is based on a simple data model, any data will fit.

Hadoop Master/Slave Architecture


Hadoop is designed as a master slave architecture.

Master node

Many slave nodes

Design Principles of Hadoop

Need to process big data. Need to parallelize computation across thousands of nodes. Commodity hardware Large number of low-end cheap machines working in parallel to solve a computing problem. Small number of high-end expensive machines. Fault tolerance and automatic recovery Nodes/tasks will fail and will recover automatically.

Users of Hadoop

Google: Inventors of MapReduce computing paradigm. Yahoo: index calculation for yahoo search engine. IBM, Microsoft, Oracle, Apple, HP, Twitter Facebook, Amazon, AOL, NetFlex Many others universities and research labs

Main Reasons for Using Hadoop

Hadoop Architecture
Hadoop framework consists of two main layers

1.

2.

Distributed file system (HDFS) Execution engine (MapReduce)

A small Hadoop cluster will include a single master and multiple slave nodes. The master node consists of a JobTracker, TaskTracker, NameNode and DataNode. A slave or worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only worker nodes and compute-only worker nodes. Job tracker is the master node. it receives the users job Hadoop requires Java Runtime Environment (JRE) 1.6 or higher.

Hadoop Distributed File System

HDFS is a distributed, scalable, and portable file system written in Java for the Hadoop framework. HDFS keeps different copies of data in different locations. The goal of HDFS is to reduce the impact of power failure or switch failure, so that even if these occur, the data can be available.

Properties of HDFS

Large: A HDFS instance may consist of thousands of server machines, each storing part of the file systems data Replication: Each data block is replicated many times (default is 3). Fault Tolerance: Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.

Advantages of Using Hadoop

Hadoop is a framework which provides distributed storage and computational capabilities both. It is extremely scalable. HDFS uses large block size which eventually works best when manipulating large data sets. HDFS maintains different replicas of files ; fault tolerant. Hadoop uses Mapreduce framework which is batch-based, distributed computing framework.

Limitations of Hadoop

Security Inefficient for handling small files. Does not offer storage or network level encryption. Single master model-can result in single point of failure.

Hadoop Vs. Other Systems

Anda mungkin juga menyukai