Anda di halaman 1dari 13



Seminar Topic:
Hadoop and Big Data

Submitted by: Submitted to:

Bavani Sankar A B Asst. Prof. Deepa Chaurse
CS-1, 4th Semester




Sampling Big Data.8


Benefits of using Hadoop..9

Hadoop Components and Hadoop History..10

Hadoop usage and scenarios11

Challenges of using Hadoop12




Big data is a term for data

sets that are so large or complex that
traditional data processing
applications are inadequate.
Challenges include analysis, capture,
data curation, search, sharing,
storage, transfer, visualization,
querying, updating and information
privacy. The term often refers simply
to the use of predictive analytics or
certain other advanced methods to
extract value from data, and seldom
to a particular size of data set.
Accuracy in big data may lead to
more confident decision making, and
better decisions can result in greater
operational efficiency, cost reduction
and reduced risk.

Analysis of data sets can find new

correlations to "spot business trends,
prevent diseases, combat crime and so on". Scientists, business executives, practitioners of medicine,
advertising and governments alike regularly meet difficulties with large data sets in areas including
Internet search, finance and business informatics. Scientists encounter limitations in e-Science work,
including meteorology, genomics, connectomics, complex physics simulations, biology and
environmental research.

Big data usually includes data sets with sizes beyond the ability of commonly used software tools to
capture, curate, manage, and process data within a tolerable elapsed time. Big data "size" is a
constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data. Big
data requires a set of techniques and technologies with new forms of integration to reveal insights from
datasets that are diverse, complex, and of a massive scale.

In a 2001 research report and related lectures, META Group (now Gartner) analyst Doug Laney defined
data growth challenges and opportunities as being three-dimensional, i.e. increasing volume (amount of
data), velocity (speed of data in and out), and variety (range of data types and sources). Gartner, and
now much of the industry, continue to use this "3Vs" model for describing big data. In 2012, Gartner
updated its definition as follows: "Big data is high volume, high velocity, and/or high variety
information assets that require new forms of processing to enable enhanced decision making, insight
discovery and process optimization." Gartner's definition of the 3Vs is still widely used, and in
agreement with a consensual definition that states that "Big Data represents the Information assets
characterized by such a High Volume, Velocity and Variety to require specific Technology and
Analytical Methods for its transformation into Value". Additionally, a new V "Veracity" is added by

some organizations to describe it, revisionism challenged by some industry authorities.
The 3Vs have been expanded to other complementary characteristics of big data:

Volume: big data doesn't sample; it just observes and tracks what happens
Velocity: big data is often available in real-time
Variety: big data draws from text, images, audio, video; plus it completes missing pieces
through data fusion
Machine Learning: big data often doesn't ask why and simply detects patterns
Digital footprint: big data is often a cost-free byproduct of digital interaction

The growing maturity of the concept more starkly delineates the difference between big data and
Business Intelligence:

Business Intelligence uses descriptive statistics with data with high information density to
measure things, detect trends, etc..
Big data uses inductive statistics and concepts from nonlinear system identification to infer laws
(regressions, nonlinear relationships, and causal effects) from large sets of data with low
information density to reveal relationships and dependencies, or to perform predictions of
outcomes and behaviors.
In a popular tutorial article published in IEEE Access Journal, the authors classified existing definitions
of big data into three categories: Attribute Definition, Comparative Definition and Architectural
Definition. The authors also presented a big-data technology map that illustrates its key technological


Big data has increased the demand of information management specialists in that Software AG, Oracle
Corporation, IBM, Microsoft, SAP, EMC, HP and Dell have spent more than $15 billion on software
firms specializing in data management and analytics. In 2010, this industry was worth more than $100
billion and was growing at almost 10 percent a year: about twice as fast as the software business as a
The use and adoption of big data within governmental processes is beneficial and allows efficiencies in
terms of cost, productivity, and innovation. Data analysis often requires multiple parts of government
(central and local) to work in collaboration and create new and innovative processes to deliver the
desired outcome.
Big data analysis was in part responsible for the BJP to win the Indian General Election 2014. The

Indian government utilises numerous techniques to ascertain how the Indian electorate is responding to
government action, as well as ideas for policy augmentation.
International development
Research on the effective usage of information and communication technologies for development (also
known as ICT4D) suggests that big data technology can make important contributions but also present
unique challenges to International development. Advancements in big data analysis offer cost-effective
opportunities to improve decision-making in critical development areas such as health care,
employment, economic productivity, crime, security, and natural disaster and resource management.

Based on TCS 2013 Global Trend Study, improvements in supply planning and product quality provide
the greatest benefit of big data for manufacturing. Big data provides an infrastructure for transparency
in manufacturing industry, which is the ability to unravel uncertainties such as inconsistent component
performance and availability. Predictive manufacturing as an applicable approach toward near-zero
downtime and transparency requires vast amount of data and advanced prediction tools for a systematic
process of data into useful information.

Cyber-physical models
Current PHM implementations mostly use data during the actual usage while analytical algorithms can
perform more accurately when more information throughout the machine's lifecycle, such as system
configuration, physical knowledge and working principles, are included. There is a need to
systematically integrate, manage and analyze machinery or process data during different stages of
machine life cycle to handle data/information more efficiently and further achieve better transparency
of machine health condition for manufacturing industry.

Big data analytics has helped healthcare improve by providing personalized medicine and prescriptive
analytics, clinical risk intervention and predictive analytics, waste and care variability reduction,
automated external and internal reporting of patient data, standardized medical terms and patient
registries and fragmented point solutions.

**Internet of Things Data analysis using Google Analytics
Internet of Things (IoT)
Big Data and the IoT work in conjunction. From a media perspective, data is the key derivative of
device inter-connectivity and allows accurate targeting. The Internet of Things, with the help of big
data, therefore transforms the media industry, companies and even governments, opening up a new era
of economic growth and competitiveness. The intersection of people, data and intelligent algorithms
have far-reaching impacts on media efficiency. The wealth of data generated allows an elaborate layer
on the present targeting mechanisms of the industry.
The Large Hadron Collider experiments represent about 150 million sensors delivering data 40 million
times per second. There are nearly 600 million collisions per second. After filtering and refraining from
recording more than 99.99995% of these streams, there are 100 collisions of interest per second.

As a result, only working with less than 0.001% of the sensor stream data, the data flow from all
four LHC experiments represents 25 petabytes annual rate before replication (as of 2012). This
becomes nearly 200 petabytes after replication.

If all sensor data were recorded in LHC, the data flow would be extremely hard to work with.
The data flow would exceed 150 million petabytes annual rate, or nearly 500 exabytes per day,
before replication. To put the number in perspective, this is equivalent to 500 quintillion
(51020) bytes per day, almost 200 times more than all the other sources combined in the
The Square Kilometre Array is a radio telescope built of thousands of antennas. It is expected to be
operational by 2024. Collectively, these antennas are expected to gather 14 exabytes and store one
petabyte per day. It is considered one of the most ambitious scientific projects ever undertaken.
Science and research

When the Sloan Digital Sky Survey (SDSS) began to collect astronomical data in 2000, it
amassed more in its first few weeks than all data collected in the history of astronomy
previously. Continuing at a rate of about 200 GB per night, SDSS has amassed more than 140
terabytes of information. When the Large Synoptic Survey Telescope, successor to SDSS,
comes online in 2020, its designers expect it to acquire that amount of data every five days.
Decoding the human genome originally took 10 years to process, now it can be achieved in less
than a day. The DNA sequencers have divided the sequencing cost by 10,000 in the last ten
years, which is 100 times cheaper than the reduction in cost predicted by Moore's Law.
The NASA Center for Climate Simulation (NCCS) stores 32 petabytes of climate observations
and simulations on the Discover supercomputing cluster.
Google's DNAStack compiles and organizes DNA samples of genetic data from around the
world to identify diseases and other medical defects. These fast and exact calculations eliminate
any 'friction points,' or human errors that could be made by one of the numerous science and
biology experts working with the DNA. DNAStack, a part of Google Genomics, allows
scientists to use the vast sample of resources from Google's search server to scale social
experiments that would usually take years, instantly.

Research activities
Encrypted search and cluster formation in big data was demonstrated in March 2014 at the American
Society of Engineering Education. Gautam Siwach engaged at Tackling the challenges of Big Data by
MIT Computer Science and Artificial Intelligence Laboratory and Dr. Amir Esmailpour at UNH
Research Group investigated the key features of big data as formation of clusters and their
interconnections. They focused on the security of big data and the actual orientation of the term

towards the presence of different type of data in an encrypted form at cloud interface by providing the
raw definitions and real time examples within the technology. Moreover, they proposed an approach for
identifying the encoding technique to advance towards an expedited search over encrypted text leading
to the security enhancements in big data.

In March 2012, The White House announced a national "Big Data Initiative" that consisted of six
Federal departments and agencies committing more than $200 million to big data research projects.
The initiative included a National Science Foundation "Expeditions in Computing" grant of $10 million
over 5 years to the AMPLab at the University of California, Berkeley. The AMPLab also received
funds from DARPA, and over a dozen industrial sponsors and uses big data to attack a wide range of
problems from predicting traffic congestion to fighting cancer.

Sampling Big Data

An important research question that can be asked about big data sets is whether you need to look at the
full data to draw certain conclusions about the properties of the data or is a sample good enough. The
name big data itself contains a term related to size and this is an important characteristic of big data.
But Sampling (statistics) enables the selection of right data points from within the larger data set to
estimate the characteristics of the whole population. For example, there are about 600 million tweets
produced every day. Is it necessary to look at all of them to determine the topics that are discussed
during the day? Is it necessary to look at all the tweets to determine the sentiment on each of the topics?
In manufacturing different types of sensory data such as acoustics, vibration, pressure, current, voltage
and controller data are available at short time intervals. To predict down-time it may not be necessary
to look at all the data but a sample may be sufficient.
There has been some work done in Sampling algorithms for Big Data. A theoretical formulation for
sampling Twitter data has also been developed.

>What makes Big Data useful?

Big Data is such an important and exciting field, and it sounds very well on
a conceptual level, but how is it implemented in terms of programs and
utilities in order to be analysed by scientists and programmers alike? And
how does it help analyse exascale amounts of data quickly?
This is where Hadoop comes into the picture.

Hadoop is an open-source
software framework for storing
data and running applications
on clusters of commodity
hardware. It provides massive storage for any kind of data, enormous processing power and the ability
to handle virtually limitless concurrent tasks or jobs.

Open-source software. Open-source software is created and maintained by a network of developers

from around the globe. It's free to download, use and contribute to, though more and more commercial
versions of Hadoop are becoming available.

Framework. In this case, it means that everything you need to develop and run software applications is
provided programs, connections, etc.

Massive storage. The Hadoop framework breaks big data into blocks, which are stored on clusters of
commodity hardware.

Processing power. Hadoop concurrently processes large amounts of data using multiple low-cost
computers for fast results.

Benefits of using Hadoop

One of the top reasons that organizations turn to Hadoop is its ability to store and process huge
amounts of data any kind of data quickly. With data volumes and varieties constantly increasing,
especially from social media and the Internet of Things, thats a key consideration. Other benefits

Computing power. Its distributed computing model quickly processes big data. The more
computing nodes you use, the more processing power you have.
Flexibility. Unlike traditional relational databases, you dont have to preprocess data before
storing it. You can store as much data as you want and decide how to use it later. That includes
unstructured data like text, images and videos.
Fault tolerance. Data and application processing are protected against hardware failure. If a
node goes down, jobs are automatically redirected to other nodes to make sure the distributed
computing does not fail. And it automatically stores multiple copies of all data.
Low cost. The open-source framework is free and uses commodity hardware to store large

quantities of data.
Scalability. You can easily grow your system simply by adding more nodes. Little
administration is required.

Hadoop Components
The basic/core components of Hadoop are:

Hadoop Common the libraries and utilities used by other Hadoop modules.
Hadoop Distributed File System (HDFS) the Java-based scalable system that stores data
across multiple machines without prior organization.
MapReduce a software programming model for processing large sets of data in parallel.
YARN resource management framework for scheduling and handling resource requests from
distributed applications. (YARN is an acronym for Yet Another Resource Negotiator.)

Hadoop History
It all started with the World Wide Web. As the web grew in the late 1900s and early 2000s, search
engines and indexes were created to help locate relevant information amid the text-based content. In the
early years, search results really were returned by humans. But as the web grew from dozens to
millions of pages, automation was needed. Web crawlers were created, many as university-led research
projects, and search engine start-ups took off (Yahoo, AltaVista, etc.).
One such project was an open-source web search engine called Nutch the brainchild of Doug Cutting
and Mike Cafarella. They wanted to invent a way to return web search results faster by distributing data
and calculations across different computers so multiple tasks could be accomplished simultaneously.
During this time, another search engine project called Google was in progress. It was based on the
same concept storing and processing data in a distributed, automated way so that relevant web search
results could be returned faster.
In 2006, Cutting joined Yahoo and took with him the Nutch project as well as ideas based on Googles
early work with automating distributed data storage and processing. The Nutch project was divided.
The web crawler portion remained as Nutch. The distributed computing and processing portion became
Hadoop (named after Cuttings sons toy elephant). In 2008, Yahoo released Hadoop as an open-source
project. Today, Hadoops framework and ecosystem of technologies are managed and maintained by
the non-profit Apache Software Foundation (ASF), a global community of software developers and

Hadoop Usage Scenarios
Going beyond its original goal of searching millions (or billions) of web pages and returning relevant
results, many organizations are looking to Hadoop as their next big data platform. Popular uses today
1. Low-cost storage and active data archive. The modest cost of commodity hardware makes
Hadoop useful for storing and combining data such as transactional, social media, sensor,
machine, scientific, click streams, etc. The low-cost storage lets you keep information that is not
deemed currently critical but that you might want to analyze later.
2. Staging area for a data warehouse and analytics store. One of the most prevalent uses is to
stage large amounts of raw data for loading into an enterprise data warehouse (EDW) or an
analytical store for activities such as advanced analytics, query and reporting, etc. Organizations
are looking at Hadoop to handle new types of data (e.g., unstructured), as well as to offload
some historical data from their enterprise data warehouses.
3. Data lake. Hadoop is often used to store large amounts of data without the constraints
introduced by schemas commonly found in the SQL-based world. It is used as a low-cost
compute-cycle platform that supports processing ETL and data quality jobs in parallel using
hand-coded or commercial data management technologies. Refined results can then be passed to
other systems (e.g., EDWs, analytic marts) as needed.
4. Sandbox for discovery and analysis. Because Hadoop was designed to deal with volumes of
data in a variety of shapes and forms, it can run analytical algorithms. Big data analytics on
Hadoop can help any organization operate more efficiently, uncover new opportunities and
derive next-level competitive advantage. The sandbox approach provides an opportunity to
innovate with minimal investment.
5. Recommendation systems. One of the most popular analytical uses by some of Hadoop's
largest adopters is for web-based recommendation systems. Facebook people you may know.
LinkedIn jobs you may be interested in. Netflix, eBay, Hulu items you may be interested in.
These systems analyze huge amounts of data in real time to quickly predict preferences before
customers leave the web page.

Using Hadoop / Loading data into Hadoop

Data can be loaded onto a Hadoop system using any of the following simple ways:

Load files to the system using simple Java commands. HDFS takes care of making multiple
copies of data blocks and distributing them across multiple nodes.
If you have a large number of files, a shell script that runs multiple put commands in parallel
will speed up the process. You dont have to write MapReduce code.
Create a cron job to scan a directory for new files and put them in HDFS as they show up.
This is useful for things like downloading email at regular intervals.
Mount HDFS as a file system and copy or write files there.

Use Sqoop to import structured data from a relational database to HDFS, Hive and HBase. It
can also extract data from Hadoop and export it to relational databases and data warehouses.
Use Flume to continuously load data from logs into Hadoop.
Use third-party vendor connectors (like SAS/ACCESS or SAS Data Loader for Hadoop).

Challenges of using Hadoop

Although Hadoop is a thriving platform for Big Data, these challenges are often faced by users of this
1. MapReduce programming is not a good match for all problems. Its good for simple
information requests and problems that can be divided into independent units, but it's not
efficient for iterative and interactive analytic tasks. MapReduce is file-intensive. Because the
nodes dont intercommunicate except through sorts and shuffles, iterative algorithms require
multiple map-shuffle/sort-reduce phases to complete. This creates multiple files between
MapReduce phases and is inefficient for advanced analytic computing.
2. Theres a widely acknowledged talent gap. It can be difficult to find entry-level programmers
who have sufficient Java skills to be productive with MapReduce. That's one reason distribution
providers are racing to put relational (SQL) technology on top of Hadoop. It is much easier to
find programmers with SQL skills than MapReduce skills. And, Hadoop administration seems
part art and part science, requiring low-level knowledge of operating systems, hardware and
Hadoop kernel settings.
3. Data security. Another challenge centers around the fragmented data security issues, though
new tools and technologies are surfacing. The Kerberos authentication protocol is a great step
forward for making Hadoop environments secure.
4. Full-fledged data management and governance. Hadoop does not have easy-to-use, full-feature
tools for data management, data cleansing, governance and metadata. Especially lacking are
tools for data quality and standardization.

Although this platform has challenges of its own, its safe to assume that they can be overcome, and
reap the benefits it has to offer us as researchers and programmers.


3. "Data, data everywhere". The Economist. 25 February 2010. Retrieved 9 December
4. "Challenges and Opportunities of Open Data in Ecology". Science 331(6018)