Anda di halaman 1dari 326

Introduction to Oracle Big Data

Electronic Presentation
D75058GC20 | Edition 2.0 | November 2016

Learn more from Oracle University at education.oracle.com


Author Copyright 2016, Oracle and/or its affiliates. All rights reserved.

Lauran K. Serhal Disclaimer

This document contains proprietary information and is protected by copyright and other intellectual property laws. You may
Technical Contributors copy and print this document solely for your own use in an Oracle training course. The document may not be modified or
and Reviewers altered in any way. Except where your use constitutes "fair use" under copyright law, you may not use, share, download,
Martin Gubar upload, copy, print, display, perform, reproduce, publish, license, post, transmit, or distribute this document in whole or in
part without the express authorization of Oracle.
Bill Beauregard
Melliyal Annamalai The information contained in this document is subject to change without notice. If you find any problems in the document,
Jean Ihm please report them in writing to: Oracle University, 500 Oracle Parkway, Redwood Shores, California 94065 USA. This
Bob DiMeo document is not warranted to be error-free.
Jean-Pierre Dijcks Restricted Rights Notice
Ben Gelernter
Bob Stanoch If this documentation is delivered to the United States Government or anyone using the documentation on behalf of the
Frederick Kush United States Government, the following notice is applicable:

Nancy Greenberg U.S. GOVERNMENT RIGHTS


Brian Pottle The U.S. Governments rights to use, modify, reproduce, release, perform, display, or disclose these training materials are
restricted by the terms of the applicable Oracle license agreement and/or the applicable U.S. Government contract.
Editors
Kavita Saini Trademark Notice
Chandrika Kennedy
Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their
respective owners.
Graphic Editors
Prakash Dharmalingam
Kavya Bellur

Publishers
Veena Narasimhan
Michael Sebastian Almeida
Raghunath M
1

Introduction

Copyright 2016, Oracle and/or its affiliates. All rights reserved.


Agenda

Course objectives, course road map, and lesson objectives


Introduction to the Oracle Big Data Lite VM
Introduction to the Oracle MoviePlex case study
Useful resources

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1-2
Course Objectives

After completing this course, you should be able to:


Define the term big data and discuss Oracles big data solution
Review the Hadoop ecosystem
Acquire data into the Hadoop Distributed File System (HDFS) and Oracle NoSQL
Database by using Flume and Sqoop
Examine and run MapReduce jobs
Use Oracle Big Data Connectors
Introduce the Oracle Big Data Appliance (BDA)

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1-3
Course Road Map Lesson 1: Introduction

Lesson 2: Big Data and Oracle Information


Management System

Lesson 3: Data Acquisition and Storage

Lesson 4: Data Access and Processing

Lesson 5: Data Unification

Lesson 6: Data Discovery and Analysis

Lesson 7: Introduction to the Oracle BDA

Lesson 8: Introduction to Oracle Big Data


Cloud Service (BDCS)

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1-4
Questions About You

To ensure that the class can be customized to meet your specific needs and to encourage
interaction, answer the following questions:
Which organization do you work for?
What is your role in your organization?
If you are a DBA, what products have you worked on?
Have you used the Oracle BDA?
Have you used any Hadoop components?
Do you meet the course prerequisites?
What do you hope to learn from this course?

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1-5
Lesson Objectives

After completing this lesson, you should be able to:


Discuss the course objectives and road map
Download, install, and run the Oracle Big Data Lite Virtual Machine (VM)
Review the Oracle MoviePlex case study
List the available appendixes
Identify the relevant documentation and other resources

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1-6
Oracle Big Data Lite VM

Oracle Big Data Lite VM provides an


integrated environment to help you get
started with the Oracle Big Data platform.
Many Oracle big data platform
components have been installed and
configured, allowing you to begin using
the system right away.
You can access the latest Big Data Lite
landing page on OTN at:
http://www.oracle.com/technetwork/datab
ase/bigdata-appliance/oracle-bigdatalite-
2104726.html.
The landing page contains tutorials,
videos, white papers, and more.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1-7
Oracle Big Data Lite VM Home Page Sections

Section Contents
Introduction Introduction to the Big Data Lite VM and the installed components
Download Oracle Big Data Lite VM Contains the following:
Deployment Guide document (important details)
Links to download the required (13) 7-zip files
Links to download the required Oracle VM VirtualBox plus its
Extension Pack and 7-zip files
Previous Version You can download the previous versions of the Big Data Lite VM and
compare the available components with older versions.
Getting Started View information (and demos) about the Oracle MoviePlex demo
data that is used in several of the big data courses.
Hands-on Lab Access to several hands-on labs
Web Sites/White Papers/EBook/Blogs List some resources that help you to learn more about the Oracle big
data platform.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1-8
Components of the Oracle Big Data Lite VM

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1-9
Downloading, Installing, and Using the Oracle Big Data Lite VM

1. Review the important Deployment Guide document for detailed installation steps.
2. Download and install Oracle VM VirtualBox, its Extension Pack, and 7-zip files.
3. Run the 7-zip extractor on the BigDataLite450.7z.001 file only. This will create the
BigDataLite450.ova VirtualBox appliance file.
4. In VirtualBox, import BigDataLite450.ova.
5. Start BigDataLite-4.5.0.
6. Log in as oracle/welcome1.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 10


Reviewing the Deployment Guide

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 11


Downloading Oracle VM VirtualBox Plus its Extension Pack

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 12


Installing Oracle VM VirtualBox Plus its Extension Pack

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 13


Starting the Oracle Big Data Lite VM

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 14


Oracle Big Data Lite VM Desktop

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 15


Review the Important Start Here Page

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 16


Starting and Stopping Services

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 17


Available Tools on the Browsers Toolbar

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 18


Agenda

Course objectives, course road map, and lesson objectives


Introduction to Oracle Big Data Lite VM
Introduction to the Oracle MoviePlex case study
Useful resources

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 19


Oracle MoviePlex Case Study: Introduction

Oracle MoviePlex is an application that is based on a fictitious online movie-streaming


rental company.
With this web-based application, you can do:
Browse a catalog of movies
Watch movie trailers
Rent movies
Review and rank movies
Like many other online stores, Oracle MoviePlex needed a cost-effective approach to
tackle the big data challenges.
It recently implemented the Oracle Big Data platform to:
Better manage the business
Identify key opportunities
Enhance customer satisfaction

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 20


Big Data Challenge
Applications are generating massive volumes of unstructured data
that describes user behavior and application performance.
How can companies fully capitalize on this valuable information
Capture users' clicks due to cost and complexity?
How do you use this raw data to gain better insights into your
customers, enhance their user experience, and ultimately improve profitability?

{"custId":1185972,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8}
{"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-07-01:00:00:22","recommended":"N","activity":7}
{"custId":1083711,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:26","recommended":null,"activity":9}
{"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:32","recommended":"Y","activity":7}
{"custId":1010220,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:42","recommended":"Y","activity":6}
{"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:43","recommended":null,"activity":8}
{"custId":1253676,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:50","recommended":null,"activity":9}
{"custId":1351777,"movieId":608,"genreId":6,"time":"2012-07-01:00:01:03","recommended":"N","activity":7}
{"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:07","recommended":null,"activity":9}
{"custId":1363545,"movieId":27205,"genreId":9,"time":"2012-07-01:00:01:18","recommended":"Y","activity":7}
{"custId":1067283,"movieId":1124,"genreId":9,"time":"2012-07-01:00:01:26","recommended":"Y","activity":7}
{"custId":1126174,"movieId":16309,"genreId":9,"time":"2012-07-01:00:01:35","recommended":"N","activity":7}
{"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:01:39","recommended":"Y","activity":7}
{"custId":1067283,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:55","recommended":null,"activity":9}
{"custId":1377537,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:58","recommended":null,"activity":9}
{"custId":1347836,"movieId":null,"genreId":null,"time":"2012-07-01:00:02:03","recommended":null,"activity":8}
{"custId":1137285,"movieId":null,"genreId":null,"time":"2012-07-01:00:03:39","recommended":null,"activity":8}
{"custId":1354924,"movieId":null,"genreId":null,"time":"2012-07-01:00:03:51","recommended":null,"activity":9}
{"custId":1036191,"movieId":null,"genreId":null,"time":"2012-07-01:00:03:55","recommended":null,"activity":8}
{"custId":1143971,"movieId":1017161,"genreId":44,"time":"2012-07-01:00:04:00","recommended":"Y","activity":7}
{"custId":1363545,"movieId":27205,"genreId":9,"time":"2012-07-01:00:04:03","recommended":"Y","activity":5}
{"custId":1273464,"movieId":null,"genreId":null,"time":"2012-07-01:00:04:39","recommended":null,"activity":9}
{"custId":1346299,"movieId":424,"genreId":1,"time":"2012-07-01:00:05:02","recommended":"Y","activity":4}

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 21


Derive Value from Big Data

How can you:


Make the right movie offers at the right time?
Better understand the viewing trends of various customer segments?
Optimize marketing spend by targeting customers with optimal promotional offers?
Minimize infrastructure spend by understanding bandwidth usage over time?
Prepare to answer questions that you havent thought of yet!

{"custId":1185972,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8}
{"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-07-01:00:00:22","recommended":"N","activity":7}
{"custId":1083711,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:26","recommended":null,"activity":9}
{"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:32","recommended":"Y","activity":7}
{"custId":1010220,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:42","recommended":"Y","activity":6}
{"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:43","recommended":null,"activity":8}
{"custId":1253676,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:50","recommended":null,"activity":9}
{"custId":1351777,"movieId":608,"genreId":6,"time":"2012-07-01:00:01:03","recommended":"N","activity":7}
{"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:07","recommended":null,"activity":9}
...

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 22


Oracle MoviePlex: Goal

To deliver a personalized movie-watching experience by collecting and storing:


User profiles
Movie listings
Ratings
Users viewing location within a paused movie
All customer information and session details
are fictitious.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 23


Oracle MoviePlex: Big Data Challenges

The application generates a huge volume of unstructured data.


Requests require fast response times (measured in milliseconds).
Data point: Latency matters.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 24


Oracle MoviePlex: Architecture

Endeca Oracle
Log of all activity Information Business
Application on the site Discovery Intelligence EE
Log
Customer Profile Oracle Exalytics
Capture activity required (such as, recommended
for the MoviePlex site movies)

Streamed
into HDFS Clustering/Market Basket
using Flume Oracle NoSQL DB Mood Oracle Advanced
Recommendations Analytics
Load Recommendations Oracle Exadata
Load Session and
Activity Data
Oracle Big
HDFS Data
Connectors
MapReduce MapReduce MapReduce
ORCH - CF Recs. Pig - Sessionize Hive - Activities

Oracle Big Data Appliance

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 25


Oracle MoviePlex: Data Generation Format
{"custId":1185972,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8}
{"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-07-01:00:00:22","recommended":"N","activity":7}
{"custId":1083711,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:26","recommended":null,"activity":9}
...

{"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-07-01:00:00:22","recommended":"N","activity":7}

Field Description
custId The customer's ID
movieId The ID of the selected movie
genreId The genre of the selected movie
time The timestamp when the customer watched the movie
recommended? Whether the selected movie was recommended, Y or N
Activity 1: Rate movie
2: Completed movie
3: Not completed
4: Started movie
5: Browsed movie
6: List movies
7: Search

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 26


Oracle MoviePlex Application
1 Simple profile updates 2 Advanced analytics: movies
based on mood

4 Advanced profile attributes What is a key-value


store?

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 27


Agenda

Course objectives, course road map, and lesson objectives


Introduction to Oracle Big Data Lite Virtual Machine (VM)
Introduction to the Oracle MoviePlex case study
Useful resources

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 28


Oracle MoviePlex Demo on Oracle Big Data Lite Landing Page

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 29


Appendixes

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 30


Oracle Big Data Appliance Help Center Documentation
https://docs.oracle.com/en/bigdata/

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 31


Oracle Big Data Appliance Documentation

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 32


Additional Resources: Oracle Big Data
Tutorials on the Oracle Learning Library (OLL)
You can access tutorials, Oracle by Example or OBEs, and other material by visiting the
OLL at http://www.oracle.com/goto/oll.
Provide step-by-step instructions for performing a variety of tasks in big data and related
products.
Reduce the time that you spend investigating the required steps to perform a task.
Use practical real-world situations so that you can gain knowledge through valuable
hands-on experience. You can then use the solutions as the foundation for production
implementation, dramatically reducing time to deployment.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 33


Oracle Big Data Tutorials on the OLL

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 34


Product Libraries

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 35


Oracle Big Data Landing Page

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 36


Oracle Big Data Discovery Landing Page

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 37


Oracle Big Data Administration Series

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 38


Oracle University Courses

education.oracle.com

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 39


Oracle University Courses Oracle Big Data Fundamentals

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 40


Oracle University Courses Oracle Big Data Fundamentals

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 41


Summary

In this lesson, you should have learned how to:


Discuss the course objectives and road map
Download, install, and start the Oracle Big Data Lite VM
Use the Oracle MoviePlex case study
Identify the available appendixes
Identify the relevant documentation and other resources

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 1 - 42


2

Big Data Management System

Copyright 2016, Oracle and/or its affiliates. All rights reserved.


Objectives

After completing this lesson, you should be able to:


Define the term Big Data
Identify the challenges and opportunities in implementing Big Data
Describe the Oracle Information Management Architecture for Big Data
Describe Oracles technology approach for Big Data adoption

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 2-2
Big Data: A Strategic Information Management (IM) Perspective

Information Management
Information Management is the means by which an
organization maximizes the efficiency and value with
which it plans, collects, organizes, uses, controls,
stores, and disseminates its Information.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 2-3
Big Data

Big Data is a term used to describe data sets whose size is beyond the capability of the
software tools commonly used to capture, manage, and process data.
Big Data can be generated from many different sources, including:
Social networks
Banking and financial services
E-commerce services
Web-centric services
Internet search indexes
Scientific and document searches
Medical records
Web logs

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 2-4
Characteristics of Big Data

Volume Velocity

Social Networks

RSS Feeds Microblogs


Variety Value

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 2-5
Big Data Opportunities: Some Examples

Todays Challenge New Data What's Possible?


Preventive care, reduced
Healthcare: Remote patient
hospitalization,
Expensive office visits monitoring
epidemiological studies
Manufacturing: Automated and predictive
Product sensors
In-person support diagnosis and support
Geo-advertising,
Location-based services:
Real-time location data personalized notifications
Based on home ZIP code
and search
Increased availability,
Utilities: Detailed consumption
reduced cost, tiered
Complex distribution grid statistics
metering plans

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 2-7
Big Data Challenges

Schematize? Analyze? Processing? Governance?

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 2-8
Information Management Landscape

Information Management (IM) is the means by which an organization seeks to:


Maximize the efficiency with which it plans, collects, organizes, uses, controls, stores,
disseminates, and disposes of its information
Ensure that the value of that information is identified and exploited to the maximum
extent possible

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 2 - 10


Extending the Boundaries of Information Management

How can Big Data technologies be applied to create


additional business value or reduce the costs of
delivering Information Management?

Bridge Big Data and traditional relational database


worlds by integrating structured, semi-structured, and
unstructured information.

Augment Big Data analysis techniques with a


portfolio of Oracle Advanced Analytics, Business
Intelligence, and Data Warehousing technologies.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 2 - 11


A Simple Functional Model for Big Data

Discovery and commercial exploitation for new data

Business
Value

Time / Effort

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 2 - 12


Oracle Information Management Conceptual Architecture

Actionable Actionable Actionable


Events Insights Information
Structured
Enterprise
Data

Data Event Engine Data Data Factory Enterprise Reporting


Streams Reservoir Information Store Other
Data
Execution
Innovation
Line of Governance!

Events Discovery Lab Discovery


& Data Output

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 2 - 14


IM Architecture Design Pattern: Discovery Lab

Actionable Actionable Actionable


Events Insights Information
Structured
Enterprise
Data

Data Event Engine Data Data Factory Enterprise Reporting


Streams Reservoir Information Store Other
Data
Execution
Innovation

Events Discovery Lab Discovery


& Data Output

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 2 - 16


IM Architecture Design Pattern: Information Platform

Actionable Actionable Actionable


Events Insights Information
Structured
Enterprise
Data

Data Event Engine Data Data Factory Enterprise Reporting


Streams Reservoir Information Store Other
Data
Execution
Innovation

Events Discovery Lab Discovery


& Data Output

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 2 - 17


IM Architecture Design Pattern: Data Application

Actionable Actionable Actionable


Events Insights Information
Structured
Enterprise
Data

Data Event Engine Data Data Factory Enterprise Reporting


Streams Reservoir Information Store Other
Data
Execution
Innovation

Events Discovery Lab Discovery


& Data Output

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 2 - 18


IM Architecture Design Pattern: Information Solution

Actionable Actionable Actionable


Events Insights Information
Structured
Enterprise
Data

Data Event Engine Data Data Factory Enterprise Reporting


Streams Reservoir Info Store Other
Data
Execution
Innovation

Events Discovery Lab Discovery


& Data Output

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 2 - 19


IM Architecture Design Pattern: Real-Time Events

Actionable Actionable Actionable


Events Insights Information
Structured
Enterprise
Data

Data Event Engine Data Data Factory Enterprise Reporting


Streams Reservoir Information Store Other
Data
Execution
Innovation

Events Discovery Lab Discovery


& Data Output

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 2 - 20


Big Data Adoption and Implementation Patterns

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 2 - 21


Additional Resources

The Oracle Big Data Learning Library:


https://apexapps.oracle.com/pls/apex/f?p=44785:141:0:::141:P141_PAGE_ID,P141_SE
CTION_ID:27,615
Information Management and Big Data: A Reference Architecture (Oracle White Paper)
http://www.oracle.com/ocom/groups/public/@otn/documents/webcontent/2297765.pdf

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 2 - 23


Summary

In this lesson, you should have learned how to:


Define the term Big Data
Identify the challenges and opportunities in implementing Big Data
Describe the Oracle Information Management Architecture for Big Data
Describe Oracles technology approach for Big Data adoption

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 2 - 24


3

Data Acquisition and Storage

Copyright 2016, Oracle and/or its affiliates. All rights reserved.


Objectives

After completing this lesson, you should be able to:


Define the Apache Hadoop ecosystem
Describe the architectural components of Hadoop Distributed File System (HDFS)
Interact with data stored in HDFS
Acquire data by using Apache Flume
Acquire and access data by using the Oracle NoSQL Database

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3-2
Agenda

Introduction to Apache Hadoop


Understand the architectural components of HDFS
Interact with data stored in HDFS
Acquire data by using Apache Flume
Acquire and access data by using Oracle NoSQL Database

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3-3
Apache Hadoop

Apache Hadoop is an open-source software framework for distributed storage and


distributed processing of big data on clusters of commodity hardware.
Open source available:
From the Apache Hadoop Foundation
As Distributions, such as Clouderas Distribution Including Apache Hadoop (CDH)
Apache Hadoop has two core components:
Distributed storage with HDFS
Distributed and parallel processing with MapReduce and Spark

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3-4
Types of Analysis That Use Hadoop

Text mining
Index building
Graph creation and analysis
Pattern recognition
Collaborative filtering
Prediction models
Sentiment analysis
Risk assessment

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3-6
Apache Hadoop Ecosystem

Hadoop: a partial list of related projects

Hadoop Core Components:


HDFS (Storage)
MapReduce (Processing)

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3-7
Oracle Big Data Appliance (BDA)

The Oracle BDA:


Simplifies the deployment of Apache Hadoop
Optimizes the secure deployment of CDH clusters on an engineered system to manage
and process the data
Is covered in a later lesson in this course

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3-8
Agenda

Introduction to Hadoop
Understand the architectural components of HDFS
Interact with data stored in HDFS
Acquire data by using Apache Flume
Acquire and access data by using Oracle NoSQL Database

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3-9
HDFS: Characteristics

Master-slave architecture

Fault-tolerant (HA)

Redundant

Supports MapReduce

Scalable

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 10


HDFS Key Definitions

Term Description
Cluster A group of servers (nodes) on a network that are configured to
work together. A server is either a master node or a slave
(worker) node.
Hadoop A batch-processing infrastructure that stores and distributes
files and distributes work across a group of servers (nodes).
Hadoop Cluster A collection of Racks containing master and slave nodes.

Blocks HDFS breaks down a data file into blocks or "chunks" and
stores the data blocks on different slave DataNodes in the
Hadoop cluster.
Replication Factor HDFS makes three copies of data blocks and stores them on
different DataNodes/Racks in the Hadoop cluster.
NameNode (NN) A service (Daemon) that maintains a directory of all files in
HDFS and tracks where data is stored in the HDFS cluster.
DataNode (DN) Stores the blocks or "chunks" of data for a set of files

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 11


HDFS: Example
Client Blocks NameNode (Master)
A (128 MB) File: movieplex1.log
Blocks: A, B, C
Data Nodes: 1, 2, 3
B (128 MB) Replication Factor: 3
A: DN 1,DN 2, DN 3
C (94 MB) B: DN 1,DN 2, DN 3
C: DN 1,DN 2, DN 3
. . .

movieplex1.log; 350 MB in Heartbeat every 3


size and a block size of 128 seconds and a
MB. The Client chunks the Blockreport
file into (3) blocks: A, B, Heartbeat and Blockreport every 6 hours
and C

A B

C A
...
B C

Data Node 1 (slave) Data Node 2 (slave)

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 12


Storing and Accessing Data Files in HDFS
Client Blocks Active (and Standby) NameNodes
A File:
File: movieplex1.log
movieplex1.log
Blocks:
Blocks: A, A,
B,B,
CC
Data
Data Nodes:
Nodes: 1,1,
2,2,
33
B DN1,DN2, DN3
A: DN1,DN2, DN3
A:
B: DN1,DN2, DN3
C C: DN1,DN2, DN3

movieplex1.log; 350 MB in size and a block size of


128 MB. The Client chunks the file into (3) blocks:
A, B, and C
Ack messages from the pipeline are sent back Master
to the client (blocks are copied)

Slave Slave Slave

A B C

C A B

B C A

DataNode 1 DataNode 2 DataNode 3

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 13


Oracle Big Data Lite VM

The Oracle Big Data Lite VM provides an


integrated environment to help you get
started with the Oracle big data platform.
Many Oracle big data platform
components have been installed and
configured, allowing you to begin using
the system right away.
You can access the latest Big Data Lite
landing page on OTN at:
http://www.oracle.com/technetwork/datab
ase/bigdata-appliance/oracle-bigdatalite-
2104726.html.
The landing page contains tutorials,
videos, white papers, and more.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 14


Accessing the NameNode Web UI on the Oracle Big Data Lite VM

The NameNode exposes its Web UI on port 50070

localhost:50070

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 15


NameNode Web UI

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 16


Agenda

Introduction to Hadoop
Understand the architectural components of HDFS
Interact with data stored in HDFS
Hue
Hadoop client
WebHDFS
HttpFS
Acquire data by using Apache Flume
Acquire and access data by using Oracle NoSQL Database

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 17


Using Cloudera Hue to Interact with HDFS
http://bda1node03.example.com:8888

Manage HDFS

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 18


Using Hadoop Client to Batch Load Data

Advantages:
Enables direct HDFS writes without intermediate file staging on Linux FS
Easy to scale:
Initiate concurrent puts for multiple files
HDFS will leverage multiple target servers and ingest faster
Big Data Appliance
Disadvantages:
Additional software (Hadoop client) need to be installed on the SRC server
HDFS nodes

HDFS put command


Issued from the Client
Hadoop client on the
source server
Source Server
Linux FS HDFS

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 19


The HDFS File System (FS) Shell Interface

HDFS supports a traditional hierarchical file organization.


You can use the FS shell command-line interface to interact with the data in HDFS.
The syntax of this command set is similar to that of other shells.
You can create, remove, rename, and move directories/files.
You can invoke FS shell as follows:

hadoop fs <args>

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 20


FS Shell Commands

Linux command

HDFS command

http://hadoop.apache.org/docs/current/hadoop-project-
dist/hadoop-common/FileSystemShell.html

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 21


Sample FS Shell Commands
Command Description
ls Lists attributes of files and directories
cat Copies source paths to stdout
cp Copy files from source to destination in HDFS
mv Moves files from source to destination . Moving files across file systems is not
permitted.
rm Deletes files specified. The -r option deletes the directory and its contents.
put Copies files from the local file system to HDFS
get Copies files from HDFS to the local file system
mkdir Creates one or more HDFS directories
rmdir Deletes a directory
jar Runs a jar file. Users can bundle their Map Reduce code in a jar file and
execute it using this command.
version Prints the Hadoop version
help Return usage output (available commands to use)

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 22


ls Command
hadoop fs -ls

directories

files

For a file, it returns stat on the file with the following format:
permissions number_of_replicas userid groupid filesize
modification_date modification_time filename
For a directory, it returns a list of its direct children as in UNIX. A directory is listed as:
permissions userid groupid modification_date modification_time dirname

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 23


mkdir and copyFromLocal Commands
Create an HDFS directory named curriculum by using the mkdir command:

Copy lab_05_01.txt from the local file system to the curriculum HDFS
directory by using the copyFromLocal command:

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 24


rm and cat Commands
Delete the curriculum HDFS directory by using the rm command. Use the -r option
to delete the directory and any content under it recursively:

Display the contents of the part-r-00000 HDFS file by using the cat command:

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 25


Load Data With WebHDFS or HttpFS

Advantages:
WebHDFS performance comparable with the Hadoop client
No additional software required on the client side

Disadvantages:
Complex syntax (comparable with the Hadoop client)
HttpFS utilizes a single gateway node that can be a potential bottleneck Big Data Appliance
HDFS nodes
Source Server
Initiate data loading No
on the client side with Hadoop
curl command Client

Linux FS HDFS

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 26


hadoop fs ls and LISTNAMES
curl -i
hadoop fs -ls "http://bigdatalite.localdomain:50070
/webhdfs/v1/user/oracle?op=LISTSTATUS
"
LISTSTATUS displays the same
content of the hadoop fs -ls
commend but in JSON format

Confirm file upload and


view its content

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 27


Uploading a Local File to an HDFS Directory With hadoop fs

Create an HDFS directory named test11 using hadoop fs CLI:

Copying localtest1.txt file to HDFS directory test11 using hadoop fs CLI:


hadoop fs -put test1.txt
hdfs://bigdatalite.localdomain:8020/user/oracle/test11

Confirm file upload and


view its content

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 28


Creating an HDFS Directory With WebHDFS
Creating an HDFS directory named test21 by using WebHDFS:
curl -i -X PUT -L -H 'Content-Type:application/octet-stream'
"http://bigdatalite.localdomain:50070/webhdfs/v1/user/oracle/test21?op=
MKDIRS&user.name=oracle";

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 29


Uploading a Local File to HDFS With WebHDFS
Creating an HDFS directory named test21 by using WebHDFS:

curl -i -X PUT -L -H 'Content-Type:application/octet-stream'


"http://bigdatalite.localdomain:50070/webhdfs/v1/user/oracle/test21/tes
t1.txt?op=CREATE&user.name=oracle" -T test1.txt;

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 30


Creating an HDFS and Loading Data by Using HttpFS
Creating an HDFS directory named test31 by using HttpFS and uploading:

curl -i -X PUT -L -H 'Content-Type:application/octet-stream'


"http://bigdatalite.localdomain:14000/webhdfs/v1/user/oracle/test31/tes
t1.txt?op=CREATE&user.name=oracle" -T test1.txt;

HttpFS uses default port 14000

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 31


Agenda

Introduction to Hadoop
Understand the architectural components of HDFS
Interact with data stored in HDFS
Acquire data by using Apache Flume
Acquire and access data by using Oracle NoSQL Database

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 32


What is Apache Flume?

Is a distributed service for collecting, aggregating, and moving large data to a centralized
data store
Was developed by Apache
Has the following features:
Simple
Reliable
Fault tolerant
Used for online analytic applications

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 33


Apache Flume: Architecture

Source Sink

Channel

Agent

Web HDFS
Server

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 34


What is NoSQL Database?

Is a key-value (KV) database


Is accessible by using Java APIs
Stores unstructured or semi-structured data as byte arrays

Benefits
Easy to install and configure
Highly reliable
General-purpose database system
Scalable throughput and predictable latency
KV Store
Configurable consistency and durability

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 35


What is Oracle NoSQL Database?

less is more 101100101001001


001101010101011
100101010100100
101

Simple Fast Flexible Reliable

Advanced key-value database designed as a cost-effective,


high-performance solution for simple operations on collections of data
with built-in high availability and elastic scale-out.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 36


Where is NoSQL Used?

ERP
Customer Portals
EAM Simple Data
Management Application
CRM Real Time Event
Driver
Inventory Processing
Globally Distributed,
Control
Always On data Mobile Data
Accting Management
& Payroll Competitive Advantages Time Series &
Process of Fast Data Sensor Data Mgmt
Mgmt
Business Lower TCO, Online Banking
Analytics commodity HW scale-out

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 37


Where is NoSQL Used?

ERP
Customer
Simple Data Portals

Expand
EAM
Management
CRM Real Time
Application
Event

Run Your
Inventory

Your
Globally Distributed, Driver
Control Processing
Always On data
Accting Mobile Data

Business
& Payroll Management

Process
Mgmt
Competitive Advantages
of Fast Data
Business Time Series &
Sensor Data
Mgmt
Business Lower TCO,
Analytics commodity HW scale-out Online Banking

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 38


Choose the Right Storage Option for the Job

HDFS Oracle NoSQL Database Oracle Database

File system Key-value database Relational database

No inherent structure Simple data structure Complex data structures, rich SQL

High-volume random reads and


High-volume writes High-volume OLTP with 2-PC
writes
Limited functionality, Simple get/put high-speed Security, backup/restore, data life
roll-your-own applications storage, flex configuration cycle mgmt, XML, etc.

Real-time, web-scale specialized General purpose SQL platform,


Batch oriented
applications multiple applications, ODBC, JDBC

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 39


Relational Database Management System Compared to NoSQL

Relational Database NoSQL Database


Management System
High-value, high-density, complex data Low-value, low-density, simple data
Complex data relationships Very simple relationships
Joins Avoids joins
Schema-centric, structured data Unstructured or semi-structured data
Designed to scale up (and not out) Distributed storage and processing
Well-defined standards Standards not yet evolved
Database-centric Application- and developer-centric
High security Minimal or no security

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 40


HDFS Compared to NoSQL

HDFS NoSQL Database


File system Database
No inherent structure Simple data structure
Batch-oriented Real-time
Processes data to use Delivers a service
Bulk storage Fast access to specific records
Write once, read many Read, write, delete, update

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 41


Available Resources
Component Website
Cloudera Manager http://www.cloudera.com/content/cloudera/en/products-and-
services/cloudera-enterprise/cloudera-manager.html
Apache Hadoop http://hadoop.apache.org/
fuse-dfs http://fuse.sourceforge.net/
Cloudera Hue http://www.cloudera.com/content/cloudera-content/cloudera-
docs/CDH4/4.2.0/Hue-2-User-Guide/hue2.html
Apache Oozie http://oozie.apache.org/
Apache Hive https://hive.apache.org/
Apache Pig http://pig.apache.org
Apache Flume http://flume.apache.org/
Apache Sqoop http://sqoop.apache.org/
Apache HBase http://hbase.apache.org/
Apache ZooKeeper http://zookeeper.apache.org
Apache Mahout http://mahout.apache.org
Apache Whirr https://whirr.apache.org/

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 42


Available Resources: Oracle University Courses

www.education.oracle.com

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 43


Available Resources: Oracle Learning Library (OLL)

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 44


Summary

In this lesson, you should have learned how to:


Define the Apache Hadoop ecosystem
Describe the architectural components of Hadoop Distributed File System (HDFS)
Interact with data stored in HDFS
Acquire data by using Apache Flume
Acquire and access data by using the Oracle NoSQL Database

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 3 - 45


4

Data Access and Processing

Copyright 2016, Oracle and/or its affiliates. All rights reserved.


Objectives

After completing this lesson, you should be able to:


Identify the benefits of MapReduce, run a MapReduce job, and monitor the job
Identify the benefits of Apache Spark, run a Spark job, and monitor the job
Use YARN to monitor jobs and to manage resources in your Hadoop cluster

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4-2
Agenda

MapReduce
Spark
YARN

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4-3
Apache Hadoop Core Components

Apache Hadoop is a system for large-scale and distributed data processing. It has two
core components:
Distributed storage with HDFS
Distributed and parallel processing with MapReduce or Spark framework
MapReduce is a batch-oriented software framework that enables you to write
applications that will process large amounts of data in parallel, on large clusters of
commodity hardware, and in a reliable and fault-tolerant manner.
The Apache Spark framework is a cluster-computing platform designed to
be fast and general-purpose. It extends the MapReduce functionality (covered later).
Both MapReduce and Spark are managed by YARN (the acronym for Yet Another
Resource Negotiator).

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4-4
MapReduce Framework Features

Integrates with HDFS and provides the same benefits for parallel data processing
Parallelizes and distributes computations to where the data is stored
The framework:
Schedules and monitors tasks, and re-executes failed tasks
Hides complex distributed computing tasks from the developer
Enables developers to focus on writing the Map and Reduce functions
MapReduce code can be written in Java, C, and scripting languages. Higher-level
abstractions (such as Hive and Pig) enable easy interaction. Optimizers construct
MapReduce jobs.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4-5
MapReduce Job

A MapReduce job is a unit of work requested by a client.


The job consists of:
Input data (stored in HDFS)
A MapReduce program (written by the developer)
Hadoop runs the job by dividing it into the following tasks:
Map tasks (code written by the developer)
Shuffle-and-sort tasks (performed by MapReduce)
Reduce tasks (code written by the developer)

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4-6
MapReduce Jobs Flow

MAP
MAP REDUCE
Input 1 Shuffle
MAP and REDUCE
(HDFS) Sort
MAP REDUCE
MAP
Output 1 (HDFS)

MAP
Input 2 MAP Shuffle REDUCE
and
(HDFS)
MAP Sort REDUCE
MAP
Output 2 (HDFS)

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4-7
Simple Word Count MapReduce: Example
Input Text from HDFS
1: Hello BigData World
2: Bye Bye BigData World

Input split 1

Mapper 1 Shuffle & Sort Intermediate data


Reducers
Hello: 1
BigData: 1 BigData: 1, 1 BigData: 2 Final Output
World: 1 to HDFS

Bye: 1, 1 Bye: 2 BigData: 2


Input split 2

Bye: 2
Hello: 1
Mapper 2 Hello: 1 Hello: 1 World: 2
Bye: 1
Bye: 1
World: 1, 1 World: 2
BigData: 1
World: 1

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4-8
How Do You Run a MapReduce Job?

hadoop jar WordCount.jar WordCount \


/user/oracle/wordcount/input_directory \
/user/oracle/wordcount/output_directory

Code Description
hadoop jar Instructs the client to submit job to the
ResourceManager
WordCount.jar The jar file that contains the Map and Reduce code
WordCount The name of the class that contains the main
method where processing starts
/user/oracle/wordcount/input_ The input directory
directory
/user/oracle/wordcount/output A single HDFS output path. All final output will be
_directory written to this directory.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4-9
Submitting a WordCount MapReduce Job:
Review the Input Data Files

cat file01 file02

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4 - 10


Submitting the WordCount MapReduce Job

hadoop jar WordCount.jar WordCount \


/user/oracle/wordcount/input \
/user/oracle/wordcount/output

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4 - 11


Monitoring MapReduce Jobs by Using the YARN Resource Manager
Web UI YARN (covered later in this lesson) is
used to manage resources on a
http://localhost:8088/cluster Hadoop cluster. In this example,
YARN is managing a MapReduce
job, but can also used to monitor
Spark jobs.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4 - 12


Monitoring MapReduce Jobs by Using the JobHistory Server Web UI

MapReduce JobHistoryServer
archives jobs metrics and can
be accessed through the
http://bigdatalite.localdomain:19888/jobhistory JobHistory Web UI or Hue.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4 - 13


Viewing the WordCount.java Program Output

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4 - 14


Agenda

MapReduce
Spark
YARN

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4 - 15


What is Apache Spark?

A fast and general engine for large-scale data processing


An open-source, parallel data-processing framework with a proven scalability of up to
2000 nodes
Makes it easy to develop fast, unified big data applications combining batch, streaming,
and interactive analytics on all your data

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4 - 16


Benefits of Using Spark

Speed
Ease of use (higher level of abstraction than MapReduce)
Sophisticated analytics
Runs on various cluster managers, such as YARN
Native integration with Java, Python, and Scala

Source: Apache Spark page on http://spark.apache.org/

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4 - 17


Scala Language: Overview

The Scala programming language can be used for implementing Spark.


Scala:
Is the acronym for Scalable Language
Is a pure-bred, object-oriented language
Runs on the Java virtual machine
Is reliable for large mission-critical systems

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4 - 18


Scala Program: Word Count Example
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SparkWordCount {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
val threshold = args(1).toInt
// split each document into words
val tokenized = sc.textFile(args(0)).flatMap(_.split(" "))
// count the occurrence of each word
val wordCounts = tokenized.map((_, 1)).reduceByKey(_ + _)
// filter out words with less than threshold occurrences
val filtered = wordCounts.filter(_._2 &gt;= threshold)
// count characters
val charCounts = filtered.flatMap(_._1.toCharArray).map((_,
1)).reduceByKey(_ + _)
System.out.println(charCounts.collect().mkString(", "))
}
}

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4 - 19


Spark Interactive Shells

There are two interactive Spark shells available to execute the Spark programs.
spark-shell is used for Scala pyspark is used for Python

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4 - 20


Starting Interactive Spark-Shell for Scala

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4 - 21


Word Count Example by Using Interactive Scala
// Open the two input files and get their file handles.
val files =
sc.textFile("hdfs://bigdatalite:8020/user/oracle/wordcount/input")

// Set the execution context for the data.


val counts = files.flatMap(line => line.split(" ")) .map(word => (word,
1)) .reduceByKey(_ + _)

// Execute the counts and save the output to sp-out directory.


counts.saveAsTextFile("hdfs://bigdatalite:8020/user/oracle/wordcount/sp-
out")

// print out the counts to the console.


counts.collect().foreach(println)

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4 - 22


Word Count Example by Using Interactive Scala: Output

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4 - 23


Monitoring Spark Jobs by Using the YARN Resource Manager Web
UI

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4 - 24


Agenda

MapReduce
Spark
YARN

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4 - 25


Apache Hadoop YARN: Overview

YARN is:
A subproject of Hadoop that separates resource-management and processing
components
A resource-management framework for Hadoop that is independent of execution
engines
Runs both MapReduce and Non-MapReduce jobs

Applications MapReduce Spark ...

Compute YARN (cluster resource management)

Storage HDFS

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4 - 26


YARN: Features

Scalability
Compatibility with MapReduce
Improved cluster utilization
Support for workloads other than MapReduce

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4 - 27


YARN (MRv2) Daemons

Component Description
ResourceManager (RM) A dedicated scheduler that allocates resources to the requesting applications. It has
two main components: Scheduler and ApplicationsManager.
It is a critical component of the cluster and runs on a dedicated master node.
High Availability (HA) RM with Active and Standby RMs automatic failover
NodeManager Each slave node in the cluster has an NM daemon, which acts as a slave for the RM.
(NM) Each NM tracks the available data processing resources and usage (CPU, memory,
disk, and network) on its slave node and sends regular reports to the RM.
ApplicationMaster The per-application AM is responsible for negotiating resources from the RM and
(AM) working with the NM(s) to execute and monitor the tasks. It runs on a slave node.
Container A container is a collection of all the resources necessary to run an application: CPU,
memory, network bandwidth, and disk space. It runs on a slave node in a cluster.
Job History Server Archives jobs and metadata

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4 - 28


Hadoop Basic Cluster YARN (MRv2): Example
NameNode
HDFS Master Nodes (Active &
Standby for
(Distributed storage) HA)

DataNode DataNode & DataNode &


Container Application Master Application Master

Container Container
Container

NodeManager NodeManager NodeManager

Slave Node (Server) 1 Slave Node (Server) 2 Slave Node (Server) 3

Resource
Resource Manager Manager
Master Nodes
(Distributed processing) Job History
Server

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4 - 29


Job Scheduling in YARN

YARN provides a pluggable model to schedule policies. The scheduler is responsible for
deciding where and when to run tasks. YARN supports the following pluggable schedulers:
FIFO (First In, First Out)
Allocates resources based on submission time (first in, first out)
Resources requests for the first application in the queue are allocated first; once its
requests have been satisfied, the next application in the queue is served, and so on.
Capacity Scheduler
Allocates resources to pools, with FIFO scheduling within each pool (default in
Hadoop)
Fair Scheduler
Allows YARN applications to share resources in large clusters fairly
We will focus on the fair scheduler in this course.
Default in CDH5 (used in Oracle BDA)

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4 - 30


Cloudera Manager Resource Management Features

Cloudera Manager provides the following features to assist you with allocating cluster
resources to services:
Static allocation (percentage of cluster resources)
Dynamic allocation
Service Queues "Pools" Gray boxes represent
statically allocated
resource pools
HDFS (30%) Impala (15%)

Blue boxes represent


YARN (30%)
dynamically allocated
Dynamic Resource Pools resource pools

hrpool marketingpool

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4 - 31


Summary

In this lesson, you should have learned how to:


Identify the benefits of MapReduce, run a MapReduce job, and monitor the job
Identify the benefits of Apache Spark, run a Spark job, and monitor the job
Use YARN to monitor jobs and to manage resources in your Hadoop cluster

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 4 - 32


5

Data Unification

Copyright 2016, Oracle and/or its affiliates. All rights reserved.


Objectives

After completing this lesson, you should be able to:


Identify the need to integrate your data
Provide an overview of data unification technologies that are supported by the Oracle
Big Data Management System
Identify the benefits of Oracle Big Data SQL

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5-2
Introducing Data Unification Options

Batch Loading
Oracle Loader for Hadoop (Oracle Big Data Connector)
Copy to Hadoop
Batch and Dynamic Loading
Oracle SQL Connector for HDFS (OSCH)
Dynamic Access
Oracle Big Data SQL
Oracle Datasource for Apache Hadoop
Integration and Synchronization
Oracle Data Integrator for Hadoop
Oracle GoldenGate for Hadoop

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5-3
Unifying Data: A Typical Requirement

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5-4
Oracle Big Data Management System

DATA RESERVOIR DATA WAREHOUSE


Cloudera Hadoop Oracle
Oracle Database
Database
Oracle Big Data SQL Oracle Big Data Oracle Big Data SQL
Connectors Oracle Industry
Oracle NoSQL ModelsMulti-tenant
In-Memory,
Oracle Big Data Discovery Oracle Industry Models
Oracle Advanced
Oracle Big Data Analytics
Oracle Advanced
Spatial and Graph Analytics
Oracle Data

Big Data Appliance


B Integrator Oracle Spatial & Graph
Oracle Spatial and
Graph
Exadata
X
Oracle Event Apache Oracle Oracle Data Oracle Oracle Event
Processing Flume GoldenGate Integrator GoldenGate Processing
SOURCES

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5-5
Oracle Big Data Connectors

Hive query of Oracle


Tables
Oracle Datasource for
Hadoop

B
Load into Database
R Client XQuery

R Analytics XML/XQuery
Oracle Loader for Hadoop
Oracle SQL query of
HDFS data
X
Oracle R Advanced Oracle XQuery on
Hadoop Lightweight Big Data SQL:
Analytics for Oracle SQL Connector for
Hadoop HDFS
Oracle Data
Integrator
DATA LAKE Knowledge Modules DATA WAREHOUSE

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5-6
Agenda

Batch Loading
Oracle Loader for Hadoop (Oracle Big Data Connector)
Copy to Hadoop
Batch and Dynamic Loading
Oracle SQL Connector for HDFS (OSCH)
Dynamic Access
Oracle Big Data SQL
Oracle Datasource for Apache Hadoop
Integration and Synchronization
Oracle Data Integrator for Hadoop
Oracle GoldenGate for Hadoop

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5-7
Oracle Loader for Hadoop (OLH)

Provides fast and efficient data loading from a Hadoop cluster into a table in an Oracle
Database
Pre-partitions the data if necessary and transforms it into a database-ready format
Sorts records by primary key or user-specified columns before loading the data or
creating output files
Uses the parallel processing framework of Hadoop (MapReduce) to perform these
preprocessing operations
Is a Java MapReduce application that balances the data across reducers to help
maximize performance
Reads from sources that have the data already in a record format, or can split the lines
of a text file into fields

Supports JSON, log, text, sequence,


compressed, and Parquet input file formats

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5-8
Oracle Loader for Hadoop (OLH)

OLH tasks include:


Offloading expensive data processing from the database server to Hadoop
Working with a range of input data formats
Handling skew in input data to maximize performance
Loading using online and offline modes

Supports JSON, log, text, sequence,


compressed, and Parquet input file formats

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5-9
Copy to Hadoop
Optimized query
with Big Data
SQL

Optimized Stored as
query with Oracle
hadoop eco- Data
system tools Pump files

Optionally
convert to
Parquet,
ORC, text

Fast copy from


database to HDFS

Use with SQL Developer and CLI

CLI Example: ohsh> create hive table hive_schema:HIVE_TAB as select *


from ORA_TAB where date > 31-JAN-2016 and less than < 29-FEB-2016

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 10


Key Benefits

Fast, secure copy of database data, table partitions to Hadoop


Archive in HDFS: Query with Big Data SQL
Analysis in Hadoop: Query with Hive, Spark
Easy to use UI: SQL Developer, command-line interface
Copied data in Oracle binary format (Oracle Data Pump files)
Optimized query via Big Data SQL
No loss of data precision
Query results of archived data identical to when data was in the database

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 11


Agenda

Batch Loading
Oracle Loader for Hadoop (Oracle Big Data Connector)
Copy to Hadoop
Batch and Dynamic Loading
Oracle SQL Connector for HDFS (OSCH)
Dynamic Access
Oracle Big Data SQL
Oracle Datasource for Apache Hadoop
Integration and Synchronization
Oracle Data Integrator for Hadoop
Oracle GoldenGate for Hadoop

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 12


Oracle SQL Connector for HDFS (OSCH)

OSCH provides read access to HDFS from an Oracle database.

Oracle SQL Connector Direct access from Oracle


for HDFS (OSCH) Database

SQL access to Hive and


HDFS

Oracle Automated generation of


external tables to access
Database
the data

Access or load data in


parallel
External tables

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 13


Agenda

Batch Loading
Oracle Loader for Hadoop (Oracle Big Data Connector)
Copy to Hadoop
Batch and Dynamic Loading
Oracle SQL Connector for HDFS (OSCH)
Dynamic Access
Oracle Big Data SQL
Oracle Datasource for Apache Hadoop
Integration and Synchronization
Oracle Data Integrator for Hadoop
Oracle GoldenGate for Hadoop

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 14


The New Normal at Oracle Customers

Expansion of Data Management Components Expansion of Programming Environments

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 15


Database Strategy for Big Data

Conventional view of Emerging view of


Data Management Data Management

Oracle Big Data SQL

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 16


Big Data SQL: Expanding Deployments
Engineered Systems Engineered Systems* Oracle Cloud

B X
Commodity Servers Mixed Deployment Mixed Deployment

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 17


Big Data SQL: A New Hadoop Processing Engine

Processing Layer

MapReduce Big Data


Spark Impala Search
and Hive SQL

Resource Management (YARN, cgroups)

Storage Layer

NoSQL Databases
Filesystem (HDFS)
(Oracle NoSQL DB, HBase)

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 18


Demo Flow

Start by looking at click data in HDFS, which has rich information about customers
behaviors.
Easily access that data from Oracle Database and apply techniques for making complex
documents easy to query.
Query any data including recommendation data in a NoSQL DB.
Problem: Personally Identifiable Information (PII) data must be safeguarded. Apply same
techniques to safeguard data in HDFS and NoSQL that you do with Oracle Database data.
We can now apply any type of analysis using rich SQL. We will turn those clicks into
customer sessions (something that takes hundreds of lines of java code). Based on customer
segmentation, review sessions that convert to sales. Finally, look at recommendations by
genre and see how they drive interest and sales.
Applications using Oracle REST Data Services (ORDS) can take advantage of all of this.
Regardless of the interface, applications can leverage the rich analytics, security, and ability
to query any data.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 19


Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 20
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 21
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 22
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 23
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 24
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 25
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 26
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 27
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 28
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 29
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 30
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 31
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 32
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 33
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 34
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 35
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 36
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 37
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 38
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 39
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 40
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 41
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 42
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 43
Oracle Datasource for Apache Hadoop

Analyze on Hadoop
Direct, parallel, fast, secure access to master data

HCatalog
Spark
Storage
Handler
Impala
Input
Format
Hive

Access database tables from Hadoop

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 44


Oracle Datasource for Apache Hadoop

Implementation

Hive
external Storage Handler Oracle
table table

Generate database splits.


Rewrite Hive SQL into Oracle SQL for each split.
Process splits in parallel by Hadoop tasks.
Return matching rows to Hadoop query coordinator.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 45


Agenda

Batch Loading
Oracle Loader for Hadoop (Oracle Big Data Connector)
Copy to Hadoop
Batch and Dynamic Loading
Oracle SQL Connector for HDFS (OSCH)
Dynamic Access
Oracle Big Data SQL
Oracle Datasource for Apache Hadoop
Integration and Synchronization
Oracle Data Integrator for Hadoop
Oracle GoldenGate for Hadoop

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 46


Oracle Data Integrator with Big Data
Heterogeneous Integration with Hadoop Environments

Oracle Data Integrator


- IKM Hive
(Transform)
Transforms - IKM Hive
(Control
_____
_____
_____
_____ Append)
_____ _____
_____ _____ _____
_____ - CKM Hive
__________
- RKM Hive
_____
_____ _____
_____ _____
_____
_____ _____
Loads _____ _____
Loads
IKM File to Hive IKM File-Hive to
_____
_____
_____
_____ Oracle
_____
_____
_____ _____ (OLH, OSCH) _____
_____
__________
_____
_____

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 47


Oracle GoldenGate for Big Data

Performs real-time replication and synchronization of data


Streamlines real-time data delivery into Big Data formats including Apache Hadoop,
Apache HBase, Apache Hive, Apache Flume, and others

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 48


Available Resources

Oracle Big Data Connectors Latest Documentation:


http://docs.oracle.com/bigdata/bda46/index.htm
Software Users Guide (includes Oracle DataSource for Hadoop, OD4H):
http://docs.oracle.com/bigdata/bda46/BIGUG/toc.htm
Oracle Big Data SQL 3.0.1 Users Guide:
http://docs.oracle.com/bigdata/bda45/BDSUG/toc.htm
Integrating Oracle Database and Apache Hadoop:
https://www.oracle.com/database/big-data-connectors/index.html
Oracle Technology Network:
http://www.oracle.com/technetwork/topics/bigdata/learnmore/index.html
Connecting Hadoop with Oracle:
https://blogs.oracle.com/bigdataconnectors/

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 49


Summary

In this lesson, you should have learned how to:


Identify the need to integrate your data
Provide an overview of data unification technologies that are supported by the Oracle
Big Data Management System
Identify the benefits of Oracle Big Data SQL

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 5 - 50


6

Data Discovery and Analysis

Copyright 2016, Oracle and/or its affiliates. All rights reserved.


Objectives

After completing this lesson, you should be able to describe the following products:
Oracle Big Data Discovery
Oracle Big Data Spatial and Graph
Oracle Advanced Analytics (OAA)
Oracle XQuery for Hadoop

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6-2
Oracle Unified Big Data Management and Analytics Strategy

Aggregate Manage Experiment Analyze and act


Connect people to the Collect, secure, and Innovate through Transform the workplace
information they need. make data available. experimentation with with actionable insights.
data.

Big Data Preparation Hadoop Platform Big Data Discovery Data Visualization
Data Integrator Big Data SQL R on Hadoop Business Intelligence
GoldenGate NoSQL Database Big Data Spatial Spatial and Graph
IoT Oracle Database and Graph Advanced Analytics

In the cloud
and on-premises

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6-3
Agenda

Oracle Big Data Discovery


Oracle Big Data Spatial and Graph
Oracle Advanced Analytics (OAA)
Oracle XQuery for Hadoop

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6-4
Big Data Labs: Fueling Enterprise Innovation

Create innovative products and


services.
Identify new customer segments.
Predict new product success.
Optimize pricing.
Detect fraudulent activity.
Manage risk efficiently.

The data lab enables organizations to think and act like startups.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6-5
Data Lab: Key Principals

Provides broad access to


Collects new and existing
a trusted community
data as raw materials

Offers a do anything Enables agile experimentation,


sandbox environment ability to fail fast

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6-6
Opening Up the Lab to a Broad Community: Difficulties

Data lab is currently the Raw tools and complex


realm of the elusive languages impair business
data scientist. analysts and SME productivity.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6-7
Unlocking the Data Lab: New Approach

A single intuitive and visual user interface to do the


following:

find explore transform discover share

Find and explore Big Data to Quickly transform and Unlock Big Data for anyone
understand its potential. enrich Big Data to make to discover and share new
it better. value.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6-8
Oracle Big Data Discovery: The Visual Face of Big Data

find explore transform discover share

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6-9
Catalog (Find)

Access a rich,
interactive catalog of all
data in Hadoop.
Use familiar search and
guided navigation for
ease of use.
See data set
summaries, user
annotation, and
recommendations.
Provision personal and
enterprise data to
Hadoop via self service.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 10


Explore

Visualize all attributes by


type.
Sort attributes by
information potential.
Assess attribute statistics,
data quality, and outliers.
Use scratch pad to
uncover correlations
between attributes.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 11


Transform

Perform intuitive, user-


driven data wrangling.
Access an extensive library
of powerful data
transformations and
enrichments.
Preview results, undo,
commit, and replay
transforms.
Test on sample data and
then apply to full data set in
Hadoop.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 12


Discover

Join and blend data for


deeper perspectives.
Compose project pages
via drag-and-drop.
Use powerful search
and guided navigation to
ask questions.
See new patterns in
rich, interactive data
visualizations.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 13


Share

Share projects,
bookmarks, and
snapshots with others.
Build galleries and tell
Big Data stories.
Collaborate and iterate
as a team.
Publish blended data to
HDFS for use in other
tools.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 14


Multiple Deployment Options for Big Data Discovery

Deploy on Commodity Deploy as an Engineered Subscribe to the Cloud


Hardware System Service

Install into your existing CDH Run on the Oracle Big Data Run Big Data Discovery
or Hortonworks Data Platform Appliance. workloads in Oracle Cloud.
(HDP) cluster. Avail of high-performance Experience fully automated
Leverage existing hardware optimized for Big lifecycle management.
infrastructure and technology Data. Avail of the industrys most
standards. Save 21% in costs and time secure and complete Big Data
Take Big Data Discovery to the versus commodity. Cloud Service.
Data Lake.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 15


Oracle Big Data Discovery: A Game Changing Platform

Business Benefits Technical Benefits


Get value faster: Rapidly turn raw Destroy existing technical barriers:
data into actionable insights, leveraged Run natively on Apache Spark cluster for
across the enterprise. maximum scalability and performance.

Democratize value from the data lab: Publish, secure, and leverage:
Integrate with Hadoop open standards
Increase the size, diversify the skills,
and leverage the unified Oracle Big Data
and improve the efficiency of Big Data ecosystem.
teams.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 16


Resources

Oracle Big Data Discovery documentation:


http://docs.oracle.com/en/bigdata/
Oracle Big Data Discovery Learning Library on OLL:
https://apexapps.oracle.com/pls/apex/f?p=44785:141:7470999189151::::P141_PAGE_I
D,P141_SECTION_ID,P141_PREV_PAGE:157,1816,5

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 17


Agenda

Oracle Big Data Discovery


Oracle Big Data Spatial and Graph
Oracle Advanced Analytics (OAA)
Oracle XQuery for Hadoop

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 18


Role of Relationships in Spatial and Graph Analysis

Are things in the same location? Who is the nearest? What tax zone is this in? Where
can I deliver in 35 minutes? What is in my sales territory? Is this built in a flood zone?
Which supplier am I most dependent on? Who is the most influential customer? Do my
products appeal to certain communities? What patterns are there in fraudulent behavior?

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 19


Oracle Big Data Spatial and Graph

Big Data often needs an organizing


principle: Data Harmonization.

Big Data analysis is often about


relationships, not aggregation.

Big Data platform is economically


compelling for working with massive data
sets found in spatial and graph workflows.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 20


Who Is Most Important?

Answers from Aggregation


Who spends the most? Tabular questions:
Well-suited to SQL-like tools
Who buys the highest margin goods?
Who is most consistently a top contributor?

Answers from Connectivity


Who is most influential? Graph questions:
Needs something different!
Which supplier do I depend on the most?
What is the right product mix for millennials?

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 21


Oracle Big Data Spatial and Graph

Spatial Analysis:
Location Data Enrichment
Proximity and containment analysis, clustering
Spatial data preparation (vector, raster)
Interactive visualization
Property Graph for Analysis:
Social media relationships
E-commerce targeted marketing
Cyber security, fraud detection
IoT, industrial engineering
Multimedia Analysis:
Framework for processing video and image data, such as facial recognition

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 22


What Big Data Problems Can Graphs Address?

Find people that are central in Identify groups of people that Find all the sets of entities that
Recommend the most similar match a given pattern, such as
item purchased by similar the given network, such as are close to each other, such as
influencer marketing. target group marketing. fraud detection.
people.

Product Recommendation Influencer Identification Community Detection Graph Pattern Matching

customer items

Purchase Communication
record stream such as
tweets

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 23


Modeling and Analyzing the Internet of Things

Cyber Security:
Critical / Alternate Path
Analysis:
Community Detection
Network Monitoring
Predictive Analysis
Multiple System Impact Analysis:
Transportation
Utilities
Finance

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 25


Property Graph: Big Data Spatial and Graph
Fast Analytics with Horizontally Scalable Storage
Parallel In-Memory-Based Analyst (PGX)
Graph Analytics
39 built-in memory-based graph analysis algorithms

Python, Perl, PHP, Ruby,


REST Web Service
In-memory Analytic Engine
Property Graph Query Language (PGQL)

Javascript,
Smart filtering of large graphs
Java APIs
Flexible Interfaces
Python, Groovy Graph Data Access Layer API
Blueprints and SolrCloud / Lucene
Java, Tinkerpop, Blueprints, Gremlin
Apache Lucene and SolrCloud
Java APIs
Massively-Scalable Graph Database
Multiple back-ends: NoSQL, HBase, Scalable and Persistent Storage
Oracle Database Property Graph Storage on
Scales securely to tens of billions of Apache HBase and Oracle NoSQL
nodes/edges

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 26


More Than 35 Graph Functions

Detecting Components and Communities Ranking and Walking


Pagerank, Personalized Pagerank,
Betweenness Centrality (with variants),
Tarjans, Kosarajus, Closeness Centrality, Degree Centrality,
Weakly Connected Components, Eigenvector Centrality, HITS,
Label Propagation (with variants), Random walking and sampling (with variants)
Soman and Narangs

Evaluating Community Structures Path-Finding


Hop-Distance (BFS)
Conductance, Modularity Dijkstras
Clustering Coefficient (Triangle Bi-directional Dijkstras
Counting)
Bellman-Fords
Adamic-Adar

Link Prediction SALSA Other Classics Vertex Cover


(Twitters Who-to-follow) Minimum Spanning-Tree (Prims)

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 27


Multiple Interfaces for Many Kinds of Users

In-Memory Analyst
35 Built-in Analytics

Graph Database

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 28


Differentiators: Graph

Commercial, supported software


Best of Both Worlds Graph DB
In-memory graph analysis algorithms: Like Neo4J
Distributed graph database model: Like Datastax Titan
Dozens of pre-built in-memory graph analysis algorithms
10 to 50 times faster analytics than competitors offerings
Analysis of 20 to 30 billion edge graphs in memory on a single BDA node

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 29


Oracle Big Data Spatial and Graph: Spatial Features
Insurance Industry Use Case

86%
Of insurance
companies agree
that analyzing
Actuarial and Accident Call data Customer
multiple data
Demographic data data data sources together is crucial to
making accurate predictions.
Enrich with

88%
postal code
Agree that linking
Categorize by region information by
location is key to
Data Products for Rate Structures combining disparate sources of
Underwriting/Risk Analysis
Big Data.
Source: The big data: How data analytics can yield underwriting gold.
Survey conducted by Ordnance Survey and Chartered Insurance Institute, 25 April 2013.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 30


Oracle Big Data Spatial and Graph: Spatial Features

Make developers and data scientists more productive with pre-built componentry and
templates for applications.
Pre-built parallel MapReduce and Spark spatial
algorithms
Raster and vector processing frameworks
Comparison with Big Data Discovery:
Big Data Discovery: An interactive tool
Big Data Spatial and Graph:
A developer-centric framework

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 31


Problems Addressed by Spatial Analysis Address

Data Harmonization using any location


attribute such as address, postal code, Categorization and filtering based
latitude/longitude, place name, and so on on location and proximity

Visualizing and displaying


results on a map

Spatial querying and analysis


Preparation, validation, and cleansing of Hadoop data with SQL
of spatial and raster data

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 32


Spatial Features

MapReduce routines for spatial


Data enrichment service API (gazetteer) analysis such as
distance/proximity, clustering

HTML5 Map Visualization API

Hive SQL API


Spatial processing of data stored in HDFS Query from Oracle DB with
or NoSQL, raster processing operations Big Data SQL and Oracle
and geodetic and Cartesian data SQL Connectors for Hadoop

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 33


Oracle Big Data Spatial and Graph: Benefits

Location-based insights, streamlining of Big Data development


Commercial, supported software
Componentry to boost efficiency of data scientists and developers,
time saving on custom development
New insight into location-based patterns and trends across your
entire Big Data volumes through spatial analytics
Access to the inherent location relationships across disparate Big
Data sources through harmonization and enrichment
Extension of existing Big Data applications to complement
relational environments
Multiple deployment options: Runs on commodity hardware or
BDA, both on-premises or in the cloud

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 35


Multiple Deployment Options for Big Data Spatial and Graph

Deploy on Commodity Deploy as an Engineered Subscribe to the Cloud


Hardware System Service

Install into your existing CDH or Run on the Oracle Big Data Run Big Data Spatial and
Hortonworks Data Platform Appliance. Graph workloads in Oracle
(HDP) cluster. Cloud.
Avail of high-performance
Leverage existing infrastructure hardware optimized for Big Experience fully automated
and technology standards. Data. lifecycle management.
Take graph, spatial, and Save 21% in costs and time Avail of the industrys most
multimedia analysis to the Data versus commodity. secure and complete Big Data
Lake. Cloud Service.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 36


Resources

Oracle Big Data Spatial and Graph documentation:


http://docs.oracle.com/bigdata/bda46/index.htm
Oracle Learning Library (OLL): Oracle Big Data Spatial and Graph - OBE Series
https://apexapps.oracle.com/pls/apex/f?p=44785:24:11930221486867::NO:RP,24:P24_
CONTENT_ID,P24_PREV_PAGE:13442,2
Oracle Big Data Spatial and Graph on Oracle.com:
www.oracle.com/database/big-data-spatial-and-graph
Oracle Technology Network (OTN):
http://www.oracle.com/technetwork/database/database-technologies/bigdata-
spatialandgraph

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 37


Resources

Hands On Lab for Big Data Spatial and Graph Property HOL/Demo:
http://www.oracle.com/technetwork/database/options/spatialandgraph/learnmore/biwa-
2016-more-session-information-2889878.html
Blog (technical examples and tips):
https://blogs.oracle.com/bigdataspatialgraph/
Oracle Big Data Lite Virtual Machine A free sandbox to get started:
www.oracle.com/technetwork/database/bigdata-appliance/oracle-bigdatalite-
2104726.html

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 38


Agenda

Oracle Big Data Discovery


Oracle Big Data Spatial and Graph
Oracle Advanced Analytics (OAA)
Oracle XQuery for Hadoop

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 39


Oracle R Advanced Analytics for Hadoop
R Client

R algorithms: Neural*, GLM*, LM kMeans, NMF, Pre-packaged predictive analytics


LMF, data movement, sampling, statistics algorithms
Parallel jobs on Hadoop Familiar interface R (to data
scientists)
Hadoop
Customer: Credit behavior
evaluation
Faster analytics, simpler solution,
and better behavior model

* enabled for over 200x speedup


for some queries

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 40


Oracle XQuery for Hadoop
Massively Scalable XQuery Processing in Hadoop

for $ln in text:collection()


let $f := tokenize($ln,,)
where $f[1] = x
return text:put($f[2]))

Oracle Database

Oracle NoSQL Database

Text Avro
JSON XML

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 41


Resources

Oracle Big Data Connectors documentation:


http://docs.oracle.com/bigdata/bda46/index.htm
Oracle Learning Library (OLL): Oracle Big Data Administration Series
https://apexapps.oracle.com/pls/apex/f?p=44785:24:101837840220695::NO:24:P24_CO
NTENT_ID,P24_PREV_PAGE:12361,29

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 42


Summary

In this lesson, you should have learned about the following products:
Oracle Big Data Discovery
Oracle Big Data Spatial and Graph
Oracle Advanced Analytics (OAA)
Oracle XQuery for Hadoop

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 43


7
Introduction to the Oracle Big Data
Appliance

Copyright 2016, Oracle and/or its affiliates. All rights reserved.


Objectives

After completing this lesson, you should be able to identify the benefits of the Oracle Big
Data Appliance (BDA), such as:
Simplified deployment of big data production clusters
High performance
Secure
Manageable
Open

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7-2
Agenda

Learn about the Oracle BDA


Configure and install the Oracle BDA
Complete the Oracle BDA Site Checklists
Run the Oracle BDA Configuration Generation Utility
Deploy the Oracle BDA software using the mammoth utility
Manage the Oracle BDA
Secure the Oracle BDA

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7-3
Big Data Management System

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7-4
Oracle BDA

An engineered system of hardware and software that delivers:


A complete and optimized Hadoop/NoSQL platform
Single-vendor support for both hardware and software
An easy-to-deploy solution
Tight integration with the Oracle Database

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7-5
Core Design Principles for BDA

Simplify access to all data, which provides the following benefits:


No bottlenecks
Full stack install and upgrades
Simplified management
Cluster growth
Critical node migration
Always highly available
Always secure
Very competitive price point

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7-6
Configuring and Installing the Oracle BDA: Road Map

Review the Oracle BDA site requirements.


Pre-Delivery of BDA Complete the Oracle BDA Site Checklists.
Procedures
Run the Oracle BDA Configuration Generation Utility.

Review the Safety Guidelines.


Unpack the Oracle BDA.
Hardware Installation
Place the BDA in its allocated space.
and Base Configuration
Power on the system.
Configure an Oracle BDA full or starter rack.

Run the mammoth utility to deploy the software on


Software Installation and
your Oracle BDA by using the Oracle BDA Configuration
Configuration
Generation Utility-generated files.

Upgrades and Upgrade BDA Mammoth.


Expansions Expand an Oracle BDA Starter Rack.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7-7
Configuring and Installing the Oracle BDA: Key Players

Pre-Delivery of BDA Procedures

Customer

Hardware Installation and Base Configuration

Oracle Field
Engineer

Oracle BDA
Installation and Configuration
Install
Coordinator
Oracle ACS

Upgrades and Expansions

Customer and
Oracle ACS

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7-8
Key Definitions

Term Description
Oracle BDA Site Checklists They provide a site checklist that the customer must
complete before the installation of Oracle BDA.
Oracle BDA Configuration Enables you to provide your information such as IP
Generation Utility addresses, software preferences, that are required for
deploying Oracle BDA . After guiding you through a series
of pages, the utility generates a set of configuration files.
These files help automate the deployment process and
ensure that Oracle BDA is configured to your exact
specifications.
Base Image This includes the operating system, drivers, firmware, etc.

Mammoth Software Deployment The bundle contains the installation files and the Base
Bundle Image. Before you install the software, you must use
Oracle BDA Configuration Generation Utility to generate
the configuration files.
Mammoth Utility Mammoth is a command-line utility for installing and
configuring the Oracle BDA software.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7-9
Accessing the Big Data Documentation Landing Page
https://docs.oracle.com/en/bigdata/

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 10


Completing the Oracle BDA Site Checklists
http://docs.oracle.com/bigdata/bda45/index.htm

Download and complete the 16-


page Site Checklists PDF file.
Complete the Pre-installation Site
Evaluation form (page 1).
Complete the rest of the
checklists.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 11


Completing the Oracle BDA Site Checklists Prior to Configuring and
Installing the Software
Complete the following checklists to ensure that the site is prepared for Oracle BDA:
System Components
Data Center Room
Data Center Environment
Access Route
Facility Power
Safety BDA Site Checklists

Logistics
Network Configuration
Auto Service Request
Oracle Enterprise Manager
Reracking

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 12


Using the Oracle BDA Configuration Generation Utility

Collects all of the information required to install and configure the Oracle BDA Software.
Acquires information from you, such as IP addresses, security information, software
preferences, etc.
After guiding you through a series of pages, the utility generates a set of configuration
files.
The generated files help automate the deployment process and ensure that Oracle BDA
is configured to your specifications.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 13


The Oracle BDA Configuration Generation Utility Pages

Welcome
Customer Details
Hardware Selection
Rack Details
Networking General Information,
Operating System, and
Administration Network Network configuration setup

General Network Properties


Review and Edit Details
Define Clusters
Mammoth (Software) setup
Cluster 1, 2,
Client and InfiniBand Network
Complete

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 14


Downloading and Extracting the Configuration Generation Utility
1 http://www.oracle.com/technetwork/database/bigdata-appliance/downloads/index.html

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 15


Running the Oracle BDA Configuration Generation Utility

Run in MS-Windows Run in Linux


(double-click) $ ./bdaconf.sh

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 16


The Welcome Screen

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 17


Selecting the Last Option in the Welcome Screen

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 18


Customer Details

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 19


Hardware Selection

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 20


Rack Details and Networking Pages

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 21


Define Clusters (Cluster 1: CDH Cluster)

1 3
2

Configuring cluster 1
as a CDH cluster.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 22


Define Clusters (Cluster 2: NoSQL Cluster)

3 4

Configuring cluster 2
as a NoSQL cluster.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 23


Cluster 1 Use the Cluster 1 (bda1h1) page to enter
and select from the following regions:
User/Groups
Client Network
Connectors on Rack 1
Infiniband Network
Big Data SQL
Big Data Connectors
Oracle Spatial and Graph
MIT Kerberos
Active Directory Kerberos
HDFS Transparent Encryption
Audit Vault
Auto Service Request
Enterprise Management Cloud Control
Email Alerting

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 24


Cluster 2

Use the Cluster 2 (bda1h2) page to enter


and select from the following regions:
User/Groups
Client Network
Connectors on Rack 1
InfiniBand Network
Auto Service Request
Installed Components:
Enterprise Edition, or
Community Edition
Oracle NoSQL Configuration

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 25


Client and InfiniBand Network and Complete Pages

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 26


Complete Deployment Assistant and Generate Files

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 27


Viewing the Generated Directories and Files

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 28


Viewing the Generated Directories and Files

bda1 folder:
Obsolete in Oracle BDA V4.2
bda1-BdaDeploy.json
and later versions
bda1-network.json
bda1h1 folder:
bda1h1-config Used by Mammoth during
bda1h2 folder: the software installation

bda1h2-config
bda-20160726-192435.zip

Run bda-preinstall-checkip.sh to perform validation


checks on the network configuration values.

Send .zip file to your Oracle contact

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 29


Agenda

Learn about the BDA


Configure and install the Oracle BDA
Complete the Oracle BDA Site Checklists
Run the Oracle BDA Configuration Generation Utility
Deploy the Oracle BDA software using the mammoth utility
Manage the Oracle BDA
Secure the Oracle BDA

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 30


Oracle BDA Provides Flexible Cluster Configurations

A CDH cluster is a minimum of 3 nodes, which is ideal for development, or 6 nodes for
production (recommended).
A starter rack contains 6 nodes.
A full rack contains 18 nodes
The Elastic configurations enables you to expand your system in 1-node increments by
adding a BDA X6-2 High Capacity (HC) into a 6-node starter rack.
The Rack can be multi-tenant. For example, you can have multiple clusters on a single
rack.
You can also have a single cluster spanning multiple racks.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 31


BDA X6-2 Hardware HC Node plus
InfiniBand
Full Rack Starter Rack Infrastructure

96 TB HDD per node


256 GB memory per
node (expandable to
a maximum of 768GB)

BDA Node

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 32


BDA X6-2 Integrated and Optional Software
Integrated Software Optional Software

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 33


Oracle BDA Mammoth Software Deployment Bundle

The Mammoth software deployment bundle contains the Installation files and the OS
base image.
You use the same Oracle BDA Mammoth Software Deployment Bundle to do the
following:
Install the software on a new rack
Add servers to a cluster
Upgrade the software on the Oracle BDA
Change the configuration of optional software
Reinstall the base image
Install a one-off patch

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 34


Using the Oracle BDA mammoth Utility

mammoth is the command-line utility that deploys software on Oracle's BDA (across all
servers in the rack) by using the files generated by BDA Configuration Generation Utility.
You can use Mammoth to:
Set up the cluster by using the generated configuration files
Create a cluster on one or more racks
Create multiple clusters on an Oracle BDA rack
Extend a cluster to new servers
Update a cluster with new software

./mammoth -i bda1h1

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 35


Using the Oracle BDA mammoth Utility

In addition to installing the software across all servers in the rack, the mammoth Utility:
Creates the required user accounts
Starts the correct services
Sets the appropriate configuration parameters
You must run the mammoth utility once for each rack.
For additional information on installing the software, see in the Oracle BDA Owner's
Guide.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 36


Downloading the Mammoth Software Deployment Bundle: Overview

Log into MOS at support.oracle.com by using your


1 sso credentials

Access and bookmark the Install/Upgrade/Configure


2 Oracle Big Data Appliance (Doc ID 1445745.2) from
MOS

Access and bookmark the Oracle Big Data Appliance Download Mammoth bundle patch
3 Patch Set Master Note (Doc ID 1485745.1) from MOS # 21109091 Oracle BDA Base Image
for Oracle Linux 6.

Download Oracle BDA Mammoth (Software


Access the Mammoth Bundle software for new installs
Deployment Bundle) for New CDH
4 (Oracle Linux 6), and upgrades (Oracle Linux 5) from
Installations with OL6 Base Image
MOS Table 1
Installation Doc

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 37


Searching for Document ID 1445745.2 on MOS

Bookmark this landing page in


your web browser

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 38


Oracle BDA Patch Set Master Note Page

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 39


Oracle BDA Mammoth Software Deployment Bundle Installation
Document (Doc ID 2011898.1)

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 40


BDA CDH Cluster Service Layout After Deployment

Active NameNode
Stand-by NameNode

Active ResourceManager
Stand-by ResourceManager

Runs all mammoth steps

# ./mammoth -i bda1h1

Cluster name

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 41


Example: Successful Big Data Systems Grow

12-node BDA for Production


Day 1 Hadoop HA and Security Set-up
Ready-to-Load Data
node12
node11
node10
node09
Rack 1 name: bda1

node08
node07
node06 Runs all mammoth steps
node05
node04 # ./mammoth -i bda1h1
node03
node02 Cluster name
node01

Cluster name: bda1h1

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 42


BDA Service Layout CDH Cluster Only

Active NameNode
Stand-by NameNode
node12
node11 Active ResourceManager
node10
node09 Stand-by ResourceManager
Rack 1 name: bda1

node08
node07 RM2
node06

node05
RM1
node04
node03
NN2
node02
node01 NN1

Cluster name: bda1h1

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 43


Successful Big Data Systems Grow
node18 Add 12 New Nodes across two racks
node17 Day 90
Cluster expansion with a single command
node16
node15 This expansion automatically optimizes HA
node14 setup across multiple racks.
node13 Because of uniform nodes and IB networking,
no data is moved.
Rack 1 name: bda1
node12
node11
node10 ./mammoth e bda1node13, bda1node14,,bda2node06
node09
node08
node07

Rack 2 name: bda2


node06 node06
node05 node05
node04 node04
node03 node03
node02 node02
node01 NN1 node01 NN2

Cluster name: bda1h1

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 44


Automatic Failover of the NameNode
node18
node17
node16
node15
node14
node13 Critical Node Failure => Active Name Node
node12 Automatic Failover to other (Standby) NameNode
node11
Automatic Service Request to Oracle for HW Failure
node10
node09
node08
node07
node06 node06
node05 node05
node04 node04
node03 node03
node02 node02
node01 NN1 node01 NN2

Cluster name: bda1h1

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 45


Restoring HA and Reinstating a Node
node18 Restore HA with a single command
node17
node16 bdacli admin_cluster migrate bda1node01
node15
node14
Reinstate the repaired node with a single command:
node13

node12 bdacli admin_cluster reprovision bda1node01


node11
node10
node09
node08
node07
node06 node06
node05 node05
node04 node04
node03 node03
node02 node02
node01 NN1 node01 NN2

Rack bda1 Rack bda2

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 46


Agenda

Learn about the Oracle BDA


Configure and install the Oracle BDA
Complete the Oracle BDA Site Checklists
Run the Oracle BDA Configuration Generation Utility
Deploy the Oracle BDA software using the mammoth utility
Manage the Oracle BDA
Cloudera Manager
Oracle Enterprise Manager Cloud Control BDA Plug-in
Secure the Oracle BDA

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 47


Administering the Oracle BDA: Overview

Once your CDH cluster is set up, you may want to update the configuration based on
specific needs, such as adding a new Impala or Hbase service.
You can use Cloudera Manager to manage the Hadoop cluster.
You can use Enterprise Manager to monitor the BDA, similar to how you would manage
your other Oracle products.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 48


What is Cloudera Manager?

You can use Cloudera Manager to perform the following administrative tasks:
Manage cluster configuration
Monitor hosts, jobs, events, and services
Start and stop services
View detailed performance metrics
Monitor the health of the system
Set up resource management
Track resource usage
Manage Hadoop security
Generate alerts

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 49


Cloudera Manager Web Console (Main Page)
http://scaj31bda03:7180

CDH Cluster name

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 50


Oracle Enterprise Manager BDA Plug-in

The Enterprise Manager BDA plug-in:


Enables organizations to manage the BDA using a consistent management framework
Monitors both hardware and software performance and manages incidents
Integrates with and complements Cloudera Manager, which provides detail cluster
configurations

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 51


The BDA Network Page: Software Overview

The BDA Network 1 page contains


the Hardware Overview and the
Software Overview sections. Here
is the Software Overview across
clusters. You can view the services
that are running across the nodes
and their status.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 52


The BDA Network Page: Schematic Hardware Overview

The Hardware Schematic section


provides you with a quick view of
the status of each HW component
on the BDA.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 53


Viewing the BDA Hardware and Software Components

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 54


Deployment of Secured Clusters on Oracle BDA

2. Authorization
1. Authentication

4. Encryption
3. Auditing (at rest and over the network)

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 55


BDA Secure Installation

Mammoth automates the setup of a secure cluster as follows:


Installs and configures Kerberos for strong authentication
Installs and configures Sentry to manage authorization
Configures auditing with Cloudera Navigator
Configures both data at rest and network encryption

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 56


Authorization

With strong authentication, authorization rules can be meaningful.


HDFS ACLs prevent unauthorized file access.
Access to Hive metadata is controlled by Sentry.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 57


Sentry Authorization Features

Secure Authorization
Ability to control access to data and/or privileges on data for authenticated users
Fine-grained Authorization
Ability to give users access to a subset of data
Includes access to a database, URI, table, or view
Role-based Authorization
Ability to create or apply template-based privileges based on functional roles

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 58


Auditing: Overview

Security requires auditing.


Who did what and was it allowed?
Authentication is a prerequisite.
Ensure who
Authorization is a prerequisite.
Ensure it was allowed

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 59


Auditing Using Cloudera Navigator

Cloudera Navigator:
Provides a deep level of auditing in Hadoop
Does not include auditing data from other sources
Includes lineage analysis
Cloudera Navigator audits the following activities:
HDFS data accessed through HDFS, Hive, HBase, and Impala services
Hive, HBase, and Impala operations
Hive metadata definition
Sentry access

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 60


Cloudera Navigator Reporting

Captures details of Hadoop activities across services


Easily identifies improper access attempts

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 61


Encryption

Oracle BDA automatically configures two types of encryption:


Network encryption
Data-at-rest encryption

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 62


Network Encryption

Oracle BDA supports network encryption for key activities, thereby preventing network
sniffing between computers. Mammoth automatically configures:
Cloudera Manager Server communicating with Agents
Hadoop HDFS data transfers
Hadoop internal RPC communications
Cloudera Manager web interface
Hadoop web UIs and web services
Hadoop YARN/MapReduce shuffle transfers

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 63


Data-at-Rest Encryption

On-disk encryption protects data at rest. HDFS Transparent Encryption enables:


Selected HDFS folders configuration to transparently encrypt files and subdirectories
Protect data from being viewed or copied outside of the Hadoop file system

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 64


Resources

Oracle Big Data documentation:


http://docs.oracle.com/en/bigdata/
Oracle Big Data Appliance Online Documentation Library:
http://docs.oracle.com/bigdata/bda46/index.htm
Oracle Big Data Administration Series on OLL:
https://apexapps.oracle.com/pls/apex/f?p=44785:24:101837840220695::NO:24:P24_CO
NTENT_ID,P24_PREV_PAGE:12361,29
Oracle Big Data Learning Library:
https://apexapps.oracle.com/pls/apex/f?p=44785:141:111624808872538::::P141_PAGE
_ID,P141_SECTION_ID,P141_PREV_PAGE:27,615,5

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 65


Summary

In this lesson, you should have learned how to use the Oracle BDA to maximize its following
benefits:
Simplified deployment of big data production clusters
High performance
Secure
Manageable
Open

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7 - 66


8
Introduction to Oracle Big Data Cloud
Service (BDCS)

Copyright 2016, Oracle and/or its affiliates. All rights reserved.


Objectives

After completing this lesson, you should be able to:


Define Oracle Big Data Cloud Service (BDCS)
Identify the key features and benefits of the Oracle BDCS
Create and modify an Oracle BDCS instance (Hadoop Cluster)
Use Cloudera Manager to monitor the Oracle BDCS instance

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8-2
Oracle Big Data Cloud Service (BDCS)

Oracle BDCS delivers the power of Hadoop as a secure, automated, elastic service,
which can also be fully integrated with existing enterprise data in Oracle Database.
A subscription to Oracle BDCS gives you access to the resources of a pre-configured
Oracle Big Data environment including CDH and Apache Spark.
Use BDCS to capture and analyze the massive volumes of data generated by social
media feeds, email, web logs, photographs, smart meters, sensors, and similar devices.
When you subscribe to Big Data Cloud Service:
You can select between (3) and (60) nodes in one-node increments
You can also burst by adding or removing up to 192 OCPUs (32 compute-only
nodes)
Oracle manages the whole hardware and networking infrastructure as well as the
initial setup, while you have complete administrators control of the software
All servers in a BDCS instance form a cluster.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8-3
Oracle Big Data Cloud Service: Key Features

Dedicated Compute Shapes with Direct Attached Storage:


Start as small as 3 nodes and grow seamlessly.
Hadoop, Spark delivered as an automated Cloud Service:
Cloudera Enterprise Data Hub Edition 5.x
Oracle Big Data Connectors
Oracle Big Data Spatial and Graph
Oracle Data Integrator Enterprise Edition
Platform for new Big Data Services:
Big Data Discovery
Big Data SQL (coming soon)

Software included

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8-4
Oracle Big Data Cloud Service: Benefits

Dedicated:
Dedicated instances that deliver high-performance
Elastic:
Scale up and down as needed.
Secure:
Secure Hadoop Cluster out of the box
Comprehensive Analytic software toolset: Big Data
Use the latest advances in Big Data processing.
Unify data processing with Big Data SQL.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8-5
Elasticity: Dedicated Compute Bursting

Burst Nodes
Key Features
Cluster Burst nodes provide:
Self service, on-demand from Cluster Service
Manager

Permanent Nodes
Large expansion with 32 OCPUs and 256 GB of
memory
Expansion nodes automatically instantiated as
cluster nodes
Bursting nodes that share InfiniBand fabric
Hourly billing rates
Retain Dedicated Compute for performance
32 OCPU
Key Benefits 256 GB RAM
48 TB Storage
Flexibility
Consistent high performance
Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8-6
Automated Service

Ease-of-use of a Managed Service Flexibility of an Un-Managed Service


Full stack (OS to Hadoop) management Full access to the system
Single command patching, upgrading High performance
Complete Cloud dashboards Runs same workloads as On-Prem
Usage, uptime, users, and so on Keep On-Prem and Cloud in sync.
VPN networking Runs any third-party software on BDCS
Use your favorite developer tools.

Big Data

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8-7
Security Made Easy

Key Features
Kerberos-ready cluster out of the box:
Apache Sentry enabled on secure clusters
Built-in data encryption:
At-rest through HDFS encryption
In-flight for all phases within Hadoop and Spark
Encrypted traffic to all client tools
VPN service
Key Benefits
Reduced risk
Faster time-to-value

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6-8
Comprehensive Analytics Toolset Included

Key Features
Base platform with all Cloudera and Spark tools
Big Data Connectors provides:
Scale-out R capabilities
Scale-out Complex Document Parsing
Big Data Spatial and Graph provides:
Property Graph In-memory Engine
Property Graph Pre-built Analytics
Key Benefits
Faster time-to-value
Lower overall cost

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6-9
Comprehensive Data Integration Toolset Included

Key Features
Base platform with all Cloudera and Spark tools
Big Data Connectors provides:
Oracle Data Integrator Enterprise Edition
Specific Loaders for Oracle Database integration
Oracle Loader for Apache Hadoop
Oracle SQL Connector for Apache Hadoop
Additional licensing can provide ODI EE Big Data
capabilities
Key Benefits
Faster time-to-value
Lower overall cost

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8 - 10


Oracle Cloud at Customer

Helps conform to business and


government security and compliance IaaS Database Big Data
requirements
Same PaaS and IaaS hardware and
software as Oracle Public Cloud
Managed by Oracle and delivered
as a service behind your firewall
Same cost-effective subscription pricing
model as Oracle Cloud

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8 - 11


Big Data Deployment Models: Choices

Big Data Appliance Big Data Cloud Machine Big Data Cloud Service

Customer Data Center Customer Data Center Oracle Cloud


Purchased Subscription Subscription
Customer Managed Oracle Managed Oracle Managed

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8 - 12


Creating a New Instance and a New CDH Cluster: Demo

1. Purchase an Oracle BDCS subscription from an Oracle Sales Representative.


2. Log in to an account to create the Big Data instance for your subscription. This displays
the Oracle Cloud My Services dashboard page.
3. Click Create Instance > Big Data to create a Big Data service instance. This step
allocates the resources for your cluster, including the number of nodes in your
subscription. Important: This step does not create a cluster.
4. In the Create New Oracle Big Data Cloud Service Instance (Create Service Instance
wizard), specify the Instance Details, and then click Create Service Instance.
5. Use the BDCS Service Console to create the CDH cluster.

Note: The screens used in this lesson were the latest as this lesson was developed;
however, the screens might not match your screens.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8 - 13


Purchasing an Oracle BDCS Subscription

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8 - 14


Signing In to the Oracle Cloud

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8 - 15


Signing In to the Oracle Cloud

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8 - 16


Creating a Service Instance

Account Name Cloud Service Type

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8 - 17


Creating a Service Instance: Instance Details Page

Cloud Service
Instance Name:
bursting1

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8 - 18


Creating a Service Instance: Create Service Instance Page

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8 - 19


Creating a Service Instance: Overview Tab

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8 - 20


Viewing the New Service Details

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8 - 21


Creating a New Cluster

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8 - 22


Create Instance Wizard: Service Details

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8 - 23


Create Instance Wizard: Security

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8 - 24


Create Instance Wizard: Storage

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8 - 25


Create Instance Wizard: Confirmation

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8 - 26


Service Instances Page

Started clusters
Each node has 32 OCPUs, 256 GB
RAM, and 48 TB storage;
therefore, this new cluster has
the following resources:
Clusters (3) nodes
96 OCPUs
768 GB RAM (256x3 nodes)
144 TB HDFS storage (48x3
nodes)

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8 - 27


Using Cloudera Manager (CM) to Manage the BDCS Instance
You can use CM to manage the
BDCS instance (Hadoop Cluster)
similar to how it was used for
managing the Oracle BDA.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8 - 28


Resources

Oracle Big Data Cloud Service documentation:


http://docs.oracle.com/cloud/latest/bigdata-cloud/index.html
Oracle Big Data Cloud Service Tutorials:
http://docs.oracle.com/cloud/latest/bigdata-cloud/bigdata-cloud-tutorials.html
Oracle Cloud Compute Site:
https://cloud.oracle.com/opc/iaas/ebooks/Oracle_Compute_Cloud_Service.pdf
Oracle Cloud Home page:
https://cloud.oracle.com/home

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 6 - 29


Summary

In this lesson, you should have learned how to:


Define Oracle Big Data Cloud Service (BDCS)
Identify the key features and benefits of the Oracle BDCS
Create and modify an Oracle BDCS instance (Hadoop Cluster)
Use Cloudera Manager to monitor the Oracle BDCS instance

Copyright 2016, Oracle and/or its affiliates. All rights reserved. 8 - 30