Cloudera Academic Partnership 8 PDF

Apache
Hadoop A course for undergraduates

Lecture 8
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-1
Hadoop Tools for Data AcquisiHon

Chapter 8.1
8-2
Data AcquisiHon with Hadoop

How to load data from an exis6ng RDBMS into HDFS using Sqoop
How to manage real-6me data such as log les using Flume
8-3
Chapter Topics
Hadoop Tools for Data Acquisi6on
Loading Data into HDFS from an RDBMS Using Sqoop
Managing Real-Hme Data Using Flume
8-4
ImporHng Data From an RDBMS to HDFS

Typical scenario: data stored in an RDBMS is needed in a
MapReduce job
Lookup tables
Legacy data
Possible to read directly from an RDBMS in your Mapper
Can lead to the equivalent of a distributed denial of service
(DDoS) a>ack on your RDBMS
In pracHce dont do it!
BeOer idea: use Sqoop to import the data into HDFS beforehand
RDBMS
sqoop
HDFS
8-5
Sqoop: SQL to Hadoop (1)

Sqoop: open source tool originally wriOen at Cloudera
Now a top-level Apache SoWware FoundaHon project
Imports tables from an RDBMS into HDFS
Just one table
All tables in a database
Just porHons of a table
Sqoop supports a WHERE clause
Uses MapReduce to actually import the data
Thro>les the number of Mappers to avoid DDoS scenarios
Uses four Mappers by default
Value is congurable
Uses a JDBC interface
Should work with virtually any JDBC-compaHble database
8-6
Sqoop: SQL to Hadoop (2)

Imports data to HDFS as delimited text les or SequenceFiles
Default is a comma-delimited text le
Can be used for incremental data imports
First import retrieves all rows in a table
Subsequent imports retrieve just rows created since the last import
Generates a class le which can encapsulate a row of the imported data
Useful for serializing and deserializing data in subsequent MapReduce
jobs
8-7
Custom Sqoop Connectors

Cloudera has partnered with other organiza6ons to create custom Sqoop
connectors
Use a databases naHve protocols rather than JDBC
Provides much faster performance
Current systems supported by custom connectors include:
Netezza
Teradata
Oracle Database (connector developed with Quest SoWware)
Others are in development
Custom connectors are not open source, but are free
Available from the Cloudera Web site
8-8
Sqoop: Basic Syntax

Standard syntax:
sqoop tool-name [tool-options]
Tools include:
import
import-all-tables
list-tables
Op6ons include:
--connect
--username
--password
8-9
Sqoop: Example
Example: import a table called employees from a database called
personnel in a MySQL RDBMS
$ sqoop import --username fred --password derf \
--connect jdbc:mysql://database.example.com/personnel \
--table employees
Example: as above, but only records with an ID greater than 1000

$ sqoop import --username fred --password derf \
--connect jdbc:mysql://database.example.com/personnel \
--table employees \
--where "id > 1000"
8-10
ImporHng An EnHre Database with Sqoop

Import all tables from the database (elds will be tab-delimited)
$ sqoop import-all-tables \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--fields-terminated-by '\t' \
--warehouse-dir /mydata
8-11
Incremental Imports with Sqoop

What if new records are added to the database?
Could re-import all records, but this is inecient
Sqoops incremental append mode imports only new records
Based on value of last record in specied column
$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--warehouse-dir /mydata \
--table orders \
--incremental append \
--check-column order_id \
--last-value 6713821
8-12
Sqoop: Other OpHons

Sqoop can take data from HDFS and insert it into an already-exis6ng table
in an RDBMS with the command
$ sqoop export [options]
For general Sqoop help:

$ sqoop help
For help on a par6cular command:

$ sqoop help command
8-13
Chapter Topics
Hadoop Tools for Data Acquisi6on
Loading Data into HDFS from an RDBMS Using Sqoop
Managing Real-6me Data Using Flume
8-14
Flume: Basics
Flume is a distributed, reliable, available service for
eciently moving large amounts of data as it is produced
Ideally suited to gathering logs from mulHple systems
and inserHng them into HDFS as they are generated
Flume is Open Source
IniHally developed by Cloudera
Flumes design goals:
Reliability
Scalability
Extensibility
8-15
Flume: High-Level Overview
Agent
Agent
Agent
Agent
encrypt
Writes to multiple HDFS file

formats (text, SequenceFile,
JSON, Avro, others)
Parallelized writes across
many collectors as much
write throughput as required
Agent
Agent
compress
batch
encrypt
Optionally process incoming

data: perform transformations,
suppressions, metadata
enrichment
Each agent can be configured
with an in-memory or durable
channel
Agent(s)
8-16
Flume Agent CharacterisHcs

Each Flume agent has a source, a sink and a channel
Source
Tells the node where to receive data from
Sink
Tells the node where to send data to
Channel
A queue between the Source and Sink
Can be in-memory only or Durable
Durable channels will not lose data if power is lost
8-17
Flumes Design Goals: Reliability

Channels provide Flumes reliability
Memory Channel
Data will be lost if power is lost
File Channel
Data stored on disk
Guarantees durability of data in face of a power loss
Data transfer between Agents and Channels is transac6onal
A failed data transfer to a downstream agent rolls back and retries
Can congure mul6ple Agents with the same task
e.g., two Agents doing the job of one collector if one agent fails
then upstream agents would fail over
8-18
Flumes Design Goals: Scalability

Scalability
The ability to increase system performance linearly by adding more
resources to the system
Flume scales horizontally
As load increases, more machines can be added to the
conguraHon
8-19
Flumes Design Goals: Extensibility

Extensibility
The ability to add new funcHonality to a system
Flume can be extended by adding Sources and Sinks to exis6ng storage
layers or data plaeorms
General Sources include data from les, syslog, and standard output
from a process
General Sinks include les on the local lesystem or HDFS
Developers can write their own Sources or Sinks
8-20
Flume: Usage Pa>erns

Flume is typically used to ingest log les from real-6me systems such as
Web servers, rewalls and mailservers into HDFS
Currently in use in many large organiza6ons, inges6ng millions of events
per day
At least one organizaHon is using Flume to ingest over 200 million
events per day
Flume is typically installed and congured by a system administrator
Check the Flume documentaHon if you intend to install it yourself
8-21
Key Points
Sqoop is a tool to load data from a database into HDFS
Flume is a tool for managing real-6me data
e.g. imporHng data from log les into HDFS
8-22
Bibliography
The following oer more informa6on on topics discussed in this chapter
Incremental impor6ng is described in the Sqoop documenta6on:
http://archive.cloudera.com/cdh/3/sqoop/
SqoopUserGuide.html#_incremental_imports
8-23
An IntroducHon to Oozie
Chapter 8.2
8-24
An IntroducHon to Oozie
What is Oozie?
Crea6ng Oozie workows
8-25
Chapter Topics
An Introduc6on to Oozie
Introduc6on to Oozie
CreaHng Oozie workows
8-26
The MoHvaHon for Oozie (1)

Many problems cannot be solved with a single
MapReduce job
Start
Data
Instead, a workow of jobs must be created

Simple workow:
Run Job A
Use output of Job A as input to Job B
Use output of Job B as input to Job C
Output of Job C is the nal required output
Easy if the workow is linear like this
Can be created as standard Driver code
Job A
Job B
Job C
Final
Result
8-27
The MoHvaHon for Oozie (2)

If the workow is more complex, Driver code becomes much more
dicult to maintain
Example: running mul6ple jobs in parallel, using the output from all of
those jobs as the input to the next job
Example: including Hive or Pig jobs as part of the workow
8-28
What is Oozie?
Oozie is a workow engine
Runs on a server
Typically outside the cluster
Runs workows of Hadoop jobs
Including Pig, Hive, Sqoop jobs
Submits those jobs to the cluster based on a workow deniHon
Workow deni6ons are submiOed via HTTP
Jobs can be run at specic 6mes
One-o or recurring jobs
Jobs can be run when data is present in a directory
8-29
Chapter Topics
An Introduc6on to Oozie
IntroducHon to Oozie
Crea6ng Oozie workows
8-30
Oozie Workow Basics

Oozie workows are wriOen in XML
Workow is a collec6on of ac6ons
MapReduce jobs, Pig jobs, Hive jobs etc.
A workow consists of control ow nodes and ac2on nodes
Control ow nodes dene the beginning and end of a workow
They provide methods to determine the workow execuHon path
Example: Run mulHple jobs simultaneously
Ac6on nodes trigger the execu6on of a processing task, such as
A MapReduce job
A Hive query
A Sqoop data import job
8-31
Simple Oozie Example (1)

Simple example workow for WordCount:
8-32

<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1">
<start to='wordcount'/>
<action name='wordcount'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.myorg.WordCount.Map</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.myorg.WordCount.Reduce</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to='end'/>
<error to='kill'/>
</action>
<kill name='kill'>
<message>Something went wrong: ${wf:errorCode('wordcount')}</message>
</kill/>
<end name='end'/>
</workflow-app>
8-33

<map-reduce>
<configuration>
<property>
</property>
<property>
</property>
<property>
</property>
<property>
</property>
</configuration>
</map-reduce>
<ok to='end'/>
<error to='kill'/>
</action>
<kill name='kill'>
</kill/>
<end name='end'/>
</workflow-app>
A workow is wrapped in the workflow-app

enHty
8-34

<map-reduce>
<configuration>
<property>
</property>
<property>
</property>
<property>
</property>
<property>
</property>
</configuration>
</map-reduce>
<ok to='end'/>
<error to='kill'/>
</action>
<kill name='kill'>
</kill/>
<end name='end'/>
</workflow-app>
The start node is the control node which tells

Oozie which workow node should be run rst. There
must be one start node in an Oozie workow. In
our example, we are telling Oozie to start by
transiHoning to the wordcount workow node.
8-35

<map-reduce>
<configuration>
<property>
</property>
<property>
</property>
<property>
</property>
<property>
</property>
</configuration>
</map-reduce>
<ok to='end'/>
<error to='kill'/>
</action>
<kill name='kill'>
</kill/>
<end name='end'/>
</workflow-app>
The wordcount acHon node denes a mapreduce acHon a standard Java MapReduce job.
8-36

<map-reduce>
<configuration>
<property>
</property>
<property>
</property>
<property>
</property>
<property>
</property>
</configuration>
</map-reduce>
<ok to='end'/>
<error to='end'/>
</action>
<kill name='kill'>
</kill/>
<end name='end'/>
</workflow-app>
Within the acHon, we dene the jobs properHes.
8-37

<map-reduce>
<configuration>
<property>
</property>
<property>
</property>
<property>
</property>
<property>
</property>
</configuration>
</map-reduce>
<ok to='end'/>
<error to='kill'/>
</action>
<kill name='kill'>
</kill/>
<end name='end'/>
</workflow-app>
We specify what to do if the acHon ends successfully,

and what to do if it fails. In this example, if the job is
successful we go to the end node. If it fails we go to
the kill node.
8-38

<map-reduce>
<configuration>
<property>
</property>
<property>
</property>
<property>
</property>
<property>
</property>
</configuration>
</map-reduce>
<ok to='end'/>
<error to='end'/>
</action>
<kill name='kill'>
</kill/>
<end name='end'/>
</workflow-app>
If the workow reaches a kill node, it will kill all

running acHons and then terminate with an error. A
workow can have zero or more kill nodes.
8-39

<map-reduce>
<configuration>
<property>
</property>
<property>
</property>
<property>
</property>
<property>
</property>
</configuration>
</map-reduce>
<ok to='end'/>
<error to='end'/>
</action>
<kill name='kill'>
</kill/>
<end name='end'/>
</workflow-app>
Every workow must have an end node. This

indicates that the workow has completed
successfully.
8-40
Other Oozie Control Nodes

A decision control node allows Oozie to determine the workow
execu6on path based on some criteria
Similar to a switch-case statement
fork and join control nodes split one execu6on path into mul6ple
execu6on paths which run concurrently
fork splits the execuHon path
join waits for all concurrent execuHon paths to complete before
proceeding
fork and join are used in pairs
8-41
Oozie Workow AcHon Nodes

Node Name
Descrip6on
map-reduce
Runs either a Java MapReduce or Streaming job
fs
Create directories, move or delete les or directories
java
Runs the main() method in the specied Java class as a single-

Map, Map-only job on the cluster
pig
Runs a Pig script
hive
Runs a Hive query
sqoop
Runs a Sqoop job
email
Sends an e-mail message
8-42
Submisng an Oozie Workow

To submit an Oozie workow using the command-line tool:
$ oozie job -oozie http://<oozie_server>/oozie \

-config config_file -run
Oozie can also be called from within a Java program

Via the Oozie client API
8-43
More on Oozie
Informa6on
Resource
Oozie installaHon and

conguraHon
CDH InstallaHon Guide

http://docs.cloudera.com
Oozie workows and acHons
https://oozie.apache.org
The procedure of running a

MapReduce job using Oozie
https://cwiki.apache.org/OOZIE/
map-reduce-cookbook.html
Oozie examples
Oozie examples are included in the Oozie

distribuHon. InstrucHons for running them:
http://oozie.apache.org/docs/
3.2.0-incubating/DG_Examples.html
8-44
Key Points
Oozie is a workow engine for Hadoop
Supports Java and Streaming MapReduce jobs, Sqoop jobs, Hive queries,
Pig scripts, and HDFS le manipula6on

8-45
Bibliography
Introduc6on to Oozie ar6cle
http://www.infoq.com/articles/introductionOozie
8-46
IntroducHon to Pig
Chapter 8.3
8-47
IntroducHon to Pig
The key features Pig oers
How to use Pig for data processing and analysis
How to use Pig interac6vely and in batch mode
8-48
Chapter Topics
Introduc6on to Pig
What is Pig?
Pigs Features
Pig Use Cases
InteracHng with Pig
8-49
Apache Pig Overview

Apache Pig is a plaeorm for data analysis and processing on Hadoop
It oers an alternaHve to wriHng MapReduce code directly
Originally developed as a research project at Yahoo
Goals: exibility, producHvity, and maintainability
Now an open-source Apache project
8-50
The Anatomy of Pig

Main components of Pig
The data ow language (Pig LaHn)
The interacHve shell where you can type Pig LaHn statements (Grunt)
The Pig interpreter and execuHon engine
Pig Latin Script
AllSales = LOAD 'sales'

AS (cust, price);
BigSales = FILTER AllSales
BY price > 100;
STORE BigSales INTO 'myreport';
Pig Interpreter / Execution Engine
MapReduce Jobs
!"Preprocess"and"parse"Pig"La0n
!"Check"data"types
!"Make"op0miza0ons
!"Plan"execu0on
!"Generate"MapReduce"jobs
!"Submit"job(s)"to"Hadoop
!"Monitor"progress
8-51
Where to Get Pig

CDH (Clouderas Distribu6on including Apache Hadoop) is the easiest way
to install Hadoop and Pig
A Hadoop distribuHon which includes core Hadoop, Pig, Hive, Sqoop,
HBase, Oozie, and other ecosystem components
Available as RPMs, Ubuntu/Debian/SuSE packages, or a tarball
Simple installaHon
100% free and open source
Installa6on is outside the scope of this course
Cloudera oers a training course for System Administrators, Cloudera
Administrator Training for Apache Hadoop
8-52
Chapter Topics
Introduc6on to Pig
What is Pig?
Pigs Features
Pig Use Cases
InteracHng with Pig
8-53
Pig Features
Pig is an alterna6ve to wri6ng low-level MapReduce code
Many features enable sophis6cated analysis and processing
HDFS manipulaHon
UNIX shell commands
RelaHonal operaHons
PosiHonal references for elds
Common mathemaHcal funcHons
Support for custom funcHons and data formats
Complex data structures
8-54
Chapter Topics
Introduc6on to Pig
What is Pig?
Pigs Features
Pig Use Cases
InteracHng with Pig
8-55
How Are OrganizaHons Using Pig?

Many organiza6ons use Pig for data analysis
Finding relevant records in a massive data set
Querying mulHple data sets
CalculaHng values from input data
Pig is also frequently used for data processing
Reorganizing an exisHng data set
Joining data from mulHple sources to produce a new data set
8-56
Use Case: Web Log SessionizaHon

Pig can help you extract valuable informa6on from Web server log les
...
Web Server Log Data
10.174.57.241 - - [03/May/2013:17:57:41 -0500] "GET /s?q=widget HTTP/1.1" 200 3617 "http://www.hotbot.com/find/dualcore" "WebTV 1.2" "U=129"
10.218.46.19 - - [03/May/2013:17:57:43 -0500] "GET /ide.html HTTP/1.1" 404 955 "http://www.example.com/s?q=JBuilder" "Mosaic/3.6 (X11;SunOS)"
10.174.57.241 - - [03/May/2013:17:58:03 -0500] "GET /wres.html HTTP/1.1" 200 5741 "http://www.example.com/s?q=widget" "WebTV 1.2" "U=129"
10.32.51.237 - - [03/May/2013:17:58:04 -0500] "GET /os.html HTTP/1.1" 404 955 "http://www.example.com/s?q=VMS" "Mozilla/1.0b (Win3.11)"
10.174.57.241 - - [03/May/2013:17:58:25 -0500] "GET /detail?w=41 HTTP/1.1" 200 8584 "http://www.example.com/wres.html" "WebTV 1.2" "U=129"
10.157.96.181 - - [03/May/2013:17:58:26 -0500] "GET /mp3.html HTTP/1.1" 404 955 "http://www.example.com/s?q=Zune" "Mothra/2.77" "U=3622"
10.174.57.241 - - [03/May/2013:17:59:36 -0500] "GET /order.do HTTP/1.1" 200 964 "http://www.example.com/detail?w=41" "WebTV 1.2" "U=129"
10.174.57.241 - - [03/May/2013:17:59:47 -0500] "GET /confirm HTTP/1.1" 200 964 "http://www.example.com/order.do" "WebTV 1.2" "U=129"
...
Clickstream Data for User Sessions

Process Logs
Recent Activity for John Smith
May 3, 2013
May 12, 2013
Search for 'Widget'
Track Order
Widget Results
Contact Us
Details for Widget X
Send Complaint
Order Widget X
8-57
Use Case: Data Sampling

Sampling can help you explore a representa6ve por6on of a large data set
Allows you to examine this porHon with tools that do not scale well
Supports faster iteraHons during development of analysis jobs
100 TB
50 MB
Random
Sampling
8-58
Use Case: ETL Processing

Pig is also widely used for Extract, Transform, and Load (ETL) processing
Operations
Pig Jobs Running on Hadoop Cluster
Data Warehouse
Accounting
Validate
data
Fix
errors
Remove
duplicates
Encode
values
Call Center
8-59
Chapter Topics
Introduc6on to Pig
What is Pig?
Pigs Features
Pig Use Cases
Interac6ng with Pig
8-60
Using Pig InteracHvely

You can use Pig interac6vely, via the Grunt shell
Pig interprets each Pig LaHn statement as you type it
ExecuHon is delayed unHl output is required
Very useful for ad hoc data inspecHon
Example of how to start, use, and exit Grunt
$ pig
grunt>
grunt>
grunt>
grunt>
allsales = LOAD 'sales' AS (name, price);

bigsales = FILTER allsales BY price > 100;
STORE bigsales INTO 'myreport';
quit;

Can also execute a Pig La6n statement from the UNIX shell via the -e
op6on
8-61
InteracHng with HDFS

You can manipulate HDFS with Pig, via the fs command

grunt>
grunt>
grunt>
grunt>
grunt>
grunt>
fs -mkdir sales/;
fs -put europe.txt sales/;
allsales = LOAD 'sales' AS (name, price);
bigsales = FILTER allsales BY price > 100;
STORE bigsales INTO 'myreport';
fs -getmerge myreport/ bigsales.txt;
8-62
InteracHng with UNIX

The sh command lets you run UNIX programs from Pig
grunt> sh date;
Fri May 10 13:05:31 PDT 2013
grunt> fs -ls;
grunt> sh ls;
-- lists HDFS files

-- lists local files
8-63
Running Pig Scripts

A Pig script is simply Pig La6n code stored in a text le
By convenHon, these les have the .pig extension
You can run a Pig script from within the Grunt shell via the run command
This is useful for automaHon and batch execuHon
grunt> run salesreport.pig;
It is common to run a Pig script directly from the UNIX shell

$ pig salesreport.pig
8-64
MapReduce and Local Modes

As described earlier, Pig turns Pig La6n into MapReduce jobs
Pig submits those jobs for execuHon on the Hadoop cluster
It is also possible to run Pig in local mode using the -x ag
This runs MapReduce jobs on the local machine instead of the cluster
Local mode uses the local lesystem instead ofHDFS
Can be helpful for tesHng before deploying a job to producHon
$ pig x local
-- interactive
$ pig -x local salesreport.pig
-- batch
8-65
Client-Side Log Files

If a job fails, Pig may produce a log le to explain why
These log les are typically produced in your current working directory
On the local (client) machine
8-66
EssenHal Points
Pig oers an alterna6ve to wri6ng MapReduce code directly
Pig interprets Pig LaHn code in order to create MapReduce jobs
It then submits these MapReduce jobs to the Hadoop cluster
You can execute Pig La6n code interac6vely through Grunt
Pig delays job execuHon unHl output is required
It is also common to store Pig La6n code in a script for batch execu6on
Allows for automaHon and code reuse
8-67
Bibliography
Apache Pig Web Site
http://pig.apache.org/
Process a Million Songs with Apache Pig
http://tiny.cloudera.com/dac03a
Powered By Pig
http://tiny.cloudera.com/dac03b
LinkedIn: User Engagement Powered By Apache Pig and Hadoop
http://tiny.cloudera.com/dac03c
Programming Pig (book)
http://tiny.cloudera.com/dac03d
8-68
Bibliography (contd)
Programming Pig (book)
http://tiny.cloudera.com/dac03d
The original paper on Pig published by Yahoo in 2008:
http://www.research.yahoo.com/files/sigmod08.pdf
8-69

Cloudera Academic Partnership 8 PDF

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Cloudera Academic Partnership 8 PDF

Diunggah oleh

Hak Cipta:

Format Tersedia

Apache

Hadoop A course for undergraduates

Hadoop Tools for Data AcquisiHon

Data AcquisiHon with Hadoop

ImporHng Data From an RDBMS to HDFS

Sqoop: SQL to Hadoop (1)

Sqoop: SQL to Hadoop (2)

Custom Sqoop Connectors

Sqoop: Basic Syntax

Example: as above, but only records with an ID greater than 1000

ImporHng An EnHre Database with Sqoop

Incremental Imports with Sqoop

Sqoop: Other OpHons

For general Sqoop help:

For help on a par6cular command:

Flume: High-Level Overview

Writes to multiple HDFS file

Optionally process incoming

Flume Agent CharacterisHcs

Flumes Design Goals: Reliability

Flumes Design Goals: Scalability

Flumes Design Goals: Extensibility

Flume: Usage Pa>erns

The MoHvaHon for Oozie (1)

Instead, a workow of jobs must be created

The MoHvaHon for Oozie (2)

Oozie Workow Basics

Simple Oozie Example (1)

Simple Oozie Example (2)

Simple Oozie Example (3)

A workow is wrapped in the workflow-app

Simple Oozie Example (4)

The start node is the control node which tells

Simple Oozie Example (5)

Simple Oozie Example (6)

Within the acHon, we dene the jobs properHes.

Simple Oozie Example (7)

We specify what to do if the acHon ends successfully,

Simple Oozie Example (9)

If the workow reaches a kill node, it will kill all

Simple Oozie Example (8)

Every workow must have an end node. This

Other Oozie Control Nodes

Oozie Workow AcHon Nodes

Runs either a Java MapReduce or Streaming job

Create directories, move or delete les or directories

Runs the main() method in the specied Java class as a single-

Runs a Pig script

Runs a Hive query

Runs a Sqoop job

Sends an e-mail message

Submisng an Oozie Workow

$ oozie job -oozie http://<oozie_server>/oozie \

Oozie can also be called from within a Java program

Oozie installaHon and

CDH InstallaHon Guide

Oozie workows and acHons

The procedure of running a

Oozie examples are included in the Oozie

Apache Pig Overview

The Anatomy of Pig

AllSales = LOAD 'sales'

Pig Interpreter / Execution Engine