Anda di halaman 1dari 69

Apache

Hadoop A course for undergraduates


Lecture 8

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-1

Hadoop Tools for Data AcquisiHon


Chapter 8.1

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-2

Data AcquisiHon with Hadoop


How to load data from an exis6ng RDBMS into HDFS using Sqoop
How to manage real-6me data such as log les using Flume

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-3

Chapter Topics
Hadoop Tools for Data Acquisi6on
Loading Data into HDFS from an RDBMS Using Sqoop
Managing Real-Hme Data Using Flume

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-4

ImporHng Data From an RDBMS to HDFS


Typical scenario: data stored in an RDBMS is needed in a
MapReduce job
Lookup tables
Legacy data
Possible to read directly from an RDBMS in your Mapper
Can lead to the equivalent of a distributed denial of service
(DDoS) a>ack on your RDBMS
In pracHce dont do it!
BeOer idea: use Sqoop to import the data into HDFS beforehand

RDBMS

sqoop

HDFS

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-5

Sqoop: SQL to Hadoop (1)


Sqoop: open source tool originally wriOen at Cloudera
Now a top-level Apache SoWware FoundaHon project
Imports tables from an RDBMS into HDFS
Just one table
All tables in a database
Just porHons of a table
Sqoop supports a WHERE clause
Uses MapReduce to actually import the data
Thro>les the number of Mappers to avoid DDoS scenarios
Uses four Mappers by default
Value is congurable
Uses a JDBC interface
Should work with virtually any JDBC-compaHble database
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-6

Sqoop: SQL to Hadoop (2)


Imports data to HDFS as delimited text les or SequenceFiles
Default is a comma-delimited text le
Can be used for incremental data imports
First import retrieves all rows in a table
Subsequent imports retrieve just rows created since the last import
Generates a class le which can encapsulate a row of the imported data
Useful for serializing and deserializing data in subsequent MapReduce
jobs

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-7

Custom Sqoop Connectors


Cloudera has partnered with other organiza6ons to create custom Sqoop
connectors
Use a databases naHve protocols rather than JDBC
Provides much faster performance
Current systems supported by custom connectors include:
Netezza
Teradata
Oracle Database (connector developed with Quest SoWware)
Others are in development
Custom connectors are not open source, but are free
Available from the Cloudera Web site

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-8

Sqoop: Basic Syntax


Standard syntax:
sqoop tool-name [tool-options]

Tools include:
import
import-all-tables
list-tables

Op6ons include:
--connect
--username
--password
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-9

Sqoop: Example
Example: import a table called employees from a database called
personnel in a MySQL RDBMS
$ sqoop import --username fred --password derf \
--connect jdbc:mysql://database.example.com/personnel \
--table employees

Example: as above, but only records with an ID greater than 1000


$ sqoop import --username fred --password derf \
--connect jdbc:mysql://database.example.com/personnel \
--table employees \
--where "id > 1000"

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-10

ImporHng An EnHre Database with Sqoop


Import all tables from the database (elds will be tab-delimited)
$ sqoop import-all-tables \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--fields-terminated-by '\t' \
--warehouse-dir /mydata

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-11

Incremental Imports with Sqoop


What if new records are added to the database?
Could re-import all records, but this is inecient
Sqoops incremental append mode imports only new records
Based on value of last record in specied column
$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--warehouse-dir /mydata \
--table orders \
--incremental append \
--check-column order_id \
--last-value 6713821

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-12

Sqoop: Other OpHons


Sqoop can take data from HDFS and insert it into an already-exis6ng table
in an RDBMS with the command
$ sqoop export [options]

For general Sqoop help:


$ sqoop help

For help on a par6cular command:


$ sqoop help command

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-13

Chapter Topics
Hadoop Tools for Data Acquisi6on
Loading Data into HDFS from an RDBMS Using Sqoop
Managing Real-6me Data Using Flume

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-14

Flume: Basics
Flume is a distributed, reliable, available service for
eciently moving large amounts of data as it is produced
Ideally suited to gathering logs from mulHple systems
and inserHng them into HDFS as they are generated
Flume is Open Source
IniHally developed by Cloudera
Flumes design goals:
Reliability
Scalability
Extensibility

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-15

Flume: High-Level Overview

Agent

Agent

Agent

Agent

encrypt

Writes to multiple HDFS file


formats (text, SequenceFile,
JSON, Avro, others)
Parallelized writes across
many collectors as much
write throughput as required

Agent

Agent

compress

batch
encrypt

Optionally process incoming


data: perform transformations,
suppressions, metadata
enrichment
Each agent can be configured
with an in-memory or durable
channel

Agent(s)

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-16

Flume Agent CharacterisHcs


Each Flume agent has a source, a sink and a channel
Source
Tells the node where to receive data from
Sink
Tells the node where to send data to
Channel
A queue between the Source and Sink
Can be in-memory only or Durable
Durable channels will not lose data if power is lost

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-17

Flumes Design Goals: Reliability


Channels provide Flumes reliability
Memory Channel
Data will be lost if power is lost
File Channel
Data stored on disk
Guarantees durability of data in face of a power loss
Data transfer between Agents and Channels is transac6onal
A failed data transfer to a downstream agent rolls back and retries
Can congure mul6ple Agents with the same task
e.g., two Agents doing the job of one collector if one agent fails
then upstream agents would fail over

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-18

Flumes Design Goals: Scalability


Scalability
The ability to increase system performance linearly by adding more
resources to the system
Flume scales horizontally
As load increases, more machines can be added to the
conguraHon

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-19

Flumes Design Goals: Extensibility


Extensibility
The ability to add new funcHonality to a system
Flume can be extended by adding Sources and Sinks to exis6ng storage
layers or data plaeorms
General Sources include data from les, syslog, and standard output
from a process
General Sinks include les on the local lesystem or HDFS
Developers can write their own Sources or Sinks

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-20

Flume: Usage Pa>erns


Flume is typically used to ingest log les from real-6me systems such as
Web servers, rewalls and mailservers into HDFS
Currently in use in many large organiza6ons, inges6ng millions of events
per day
At least one organizaHon is using Flume to ingest over 200 million
events per day
Flume is typically installed and congured by a system administrator
Check the Flume documentaHon if you intend to install it yourself

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-21

Key Points
Sqoop is a tool to load data from a database into HDFS
Flume is a tool for managing real-6me data
e.g. imporHng data from log les into HDFS

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-22

Bibliography
The following oer more informa6on on topics discussed in this chapter
Incremental impor6ng is described in the Sqoop documenta6on:
http://archive.cloudera.com/cdh/3/sqoop/
SqoopUserGuide.html#_incremental_imports

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-23

An IntroducHon to Oozie
Chapter 8.2

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-24

An IntroducHon to Oozie
What is Oozie?
Crea6ng Oozie workows

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-25

Chapter Topics
An Introduc6on to Oozie
Introduc6on to Oozie
CreaHng Oozie workows

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-26

The MoHvaHon for Oozie (1)


Many problems cannot be solved with a single
MapReduce job

Start
Data

Instead, a workow of jobs must be created


Simple workow:
Run Job A
Use output of Job A as input to Job B
Use output of Job B as input to Job C
Output of Job C is the nal required output
Easy if the workow is linear like this
Can be created as standard Driver code

Job A

Job B

Job C

Final
Result

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-27

The MoHvaHon for Oozie (2)


If the workow is more complex, Driver code becomes much more
dicult to maintain
Example: running mul6ple jobs in parallel, using the output from all of
those jobs as the input to the next job
Example: including Hive or Pig jobs as part of the workow

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-28

What is Oozie?
Oozie is a workow engine
Runs on a server
Typically outside the cluster
Runs workows of Hadoop jobs
Including Pig, Hive, Sqoop jobs
Submits those jobs to the cluster based on a workow deniHon
Workow deni6ons are submiOed via HTTP
Jobs can be run at specic 6mes
One-o or recurring jobs
Jobs can be run when data is present in a directory

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-29

Chapter Topics
An Introduc6on to Oozie
IntroducHon to Oozie
Crea6ng Oozie workows

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-30

Oozie Workow Basics


Oozie workows are wriOen in XML
Workow is a collec6on of ac6ons
MapReduce jobs, Pig jobs, Hive jobs etc.
A workow consists of control ow nodes and ac2on nodes
Control ow nodes dene the beginning and end of a workow
They provide methods to determine the workow execuHon path
Example: Run mulHple jobs simultaneously
Ac6on nodes trigger the execu6on of a processing task, such as
A MapReduce job
A Hive query
A Sqoop data import job

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-31

Simple Oozie Example (1)


Simple example workow for WordCount:

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-32

Simple Oozie Example (2)


<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1">
<start to='wordcount'/>
<action name='wordcount'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.myorg.WordCount.Map</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.myorg.WordCount.Reduce</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to='end'/>
<error to='kill'/>
</action>
<kill name='kill'>
<message>Something went wrong: ${wf:errorCode('wordcount')}</message>
</kill/>
<end name='end'/>
</workflow-app>

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-33

Simple Oozie Example (3)


<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1">
<start to='wordcount'/>
<action name='wordcount'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.myorg.WordCount.Map</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.myorg.WordCount.Reduce</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to='end'/>
<error to='kill'/>
</action>
<kill name='kill'>
<message>Something went wrong: ${wf:errorCode('wordcount')}</message>
</kill/>
<end name='end'/>
</workflow-app>

A workow is wrapped in the workflow-app


enHty

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-34

Simple Oozie Example (4)


<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1">
<start to='wordcount'/>
<action name='wordcount'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.myorg.WordCount.Map</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.myorg.WordCount.Reduce</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to='end'/>
<error to='kill'/>
</action>
<kill name='kill'>
<message>Something went wrong: ${wf:errorCode('wordcount')}</message>
</kill/>
<end name='end'/>
</workflow-app>

The start node is the control node which tells


Oozie which workow node should be run rst. There
must be one start node in an Oozie workow. In
our example, we are telling Oozie to start by
transiHoning to the wordcount workow node.

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-35

Simple Oozie Example (5)


<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1">
<start to='wordcount'/>
<action name='wordcount'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.myorg.WordCount.Map</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.myorg.WordCount.Reduce</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to='end'/>
<error to='kill'/>
</action>
<kill name='kill'>
<message>Something went wrong: ${wf:errorCode('wordcount')}</message>
</kill/>
<end name='end'/>
</workflow-app>

The wordcount acHon node denes a mapreduce acHon a standard Java MapReduce job.

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-36

Simple Oozie Example (6)


<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1">
<start to='wordcount'/>
<action name='wordcount'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.myorg.WordCount.Map</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.myorg.WordCount.Reduce</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to='end'/>
<error to='end'/>
</action>
<kill name='kill'>
<message>Something went wrong: ${wf:errorCode('wordcount')}</message>
</kill/>
<end name='end'/>
</workflow-app>

Within the acHon, we dene the jobs properHes.

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-37

Simple Oozie Example (7)


<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1">
<start to='wordcount'/>
<action name='wordcount'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.myorg.WordCount.Map</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.myorg.WordCount.Reduce</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to='end'/>
<error to='kill'/>
</action>
<kill name='kill'>
<message>Something went wrong: ${wf:errorCode('wordcount')}</message>
</kill/>
<end name='end'/>
</workflow-app>

We specify what to do if the acHon ends successfully,


and what to do if it fails. In this example, if the job is
successful we go to the end node. If it fails we go to
the kill node.

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-38

Simple Oozie Example (9)


<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1">
<start to='wordcount'/>
<action name='wordcount'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.myorg.WordCount.Map</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.myorg.WordCount.Reduce</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to='end'/>
<error to='end'/>
</action>
<kill name='kill'>
<message>Something went wrong: ${wf:errorCode('wordcount')}</message>
</kill/>
<end name='end'/>
</workflow-app>

If the workow reaches a kill node, it will kill all


running acHons and then terminate with an error. A
workow can have zero or more kill nodes.

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-39

Simple Oozie Example (8)


<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1">
<start to='wordcount'/>
<action name='wordcount'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.myorg.WordCount.Map</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.myorg.WordCount.Reduce</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to='end'/>
<error to='end'/>
</action>
<kill name='kill'>
<message>Something went wrong: ${wf:errorCode('wordcount')}</message>
</kill/>
<end name='end'/>
</workflow-app>

Every workow must have an end node. This


indicates that the workow has completed
successfully.

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-40

Other Oozie Control Nodes


A decision control node allows Oozie to determine the workow
execu6on path based on some criteria
Similar to a switch-case statement
fork and join control nodes split one execu6on path into mul6ple
execu6on paths which run concurrently
fork splits the execuHon path
join waits for all concurrent execuHon paths to complete before
proceeding
fork and join are used in pairs

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-41

Oozie Workow AcHon Nodes


Node Name

Descrip6on

map-reduce

Runs either a Java MapReduce or Streaming job

fs

Create directories, move or delete les or directories

java

Runs the main() method in the specied Java class as a single-


Map, Map-only job on the cluster

pig

Runs a Pig script

hive

Runs a Hive query

sqoop

Runs a Sqoop job

email

Sends an e-mail message

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-42

Submisng an Oozie Workow


To submit an Oozie workow using the command-line tool:

$ oozie job -oozie http://<oozie_server>/oozie \


-config config_file -run

Oozie can also be called from within a Java program


Via the Oozie client API

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-43

More on Oozie
Informa6on

Resource

Oozie installaHon and


conguraHon

CDH InstallaHon Guide


http://docs.cloudera.com

Oozie workows and acHons

https://oozie.apache.org

The procedure of running a


MapReduce job using Oozie

https://cwiki.apache.org/OOZIE/
map-reduce-cookbook.html

Oozie examples

Oozie examples are included in the Oozie


distribuHon. InstrucHons for running them:
http://oozie.apache.org/docs/
3.2.0-incubating/DG_Examples.html

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-44

Key Points
Oozie is a workow engine for Hadoop
Supports Java and Streaming MapReduce jobs, Sqoop jobs, Hive queries,
Pig scripts, and HDFS le manipula6on

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-45

Bibliography
The following oer more informa6on on topics discussed in this chapter
Introduc6on to Oozie ar6cle
http://www.infoq.com/articles/introductionOozie

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-46

IntroducHon to Pig
Chapter 8.3

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-47

IntroducHon to Pig
The key features Pig oers
How to use Pig for data processing and analysis
How to use Pig interac6vely and in batch mode

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-48

Chapter Topics
Introduc6on to Pig
What is Pig?
Pigs Features
Pig Use Cases
InteracHng with Pig

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-49

Apache Pig Overview


Apache Pig is a plaeorm for data analysis and processing on Hadoop
It oers an alternaHve to wriHng MapReduce code directly
Originally developed as a research project at Yahoo
Goals: exibility, producHvity, and maintainability
Now an open-source Apache project

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-50

The Anatomy of Pig


Main components of Pig
The data ow language (Pig LaHn)
The interacHve shell where you can type Pig LaHn statements (Grunt)
The Pig interpreter and execuHon engine
Pig Latin Script

AllSales = LOAD 'sales'


AS (cust, price);
BigSales = FILTER AllSales
BY price > 100;
STORE BigSales INTO 'myreport';

Pig Interpreter / Execution Engine

MapReduce Jobs

!"Preprocess"and"parse"Pig"La0n
!"Check"data"types
!"Make"op0miza0ons
!"Plan"execu0on
!"Generate"MapReduce"jobs
!"Submit"job(s)"to"Hadoop
!"Monitor"progress

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-51

Where to Get Pig


CDH (Clouderas Distribu6on including Apache Hadoop) is the easiest way
to install Hadoop and Pig
A Hadoop distribuHon which includes core Hadoop, Pig, Hive, Sqoop,
HBase, Oozie, and other ecosystem components
Available as RPMs, Ubuntu/Debian/SuSE packages, or a tarball
Simple installaHon
100% free and open source
Installa6on is outside the scope of this course
Cloudera oers a training course for System Administrators, Cloudera
Administrator Training for Apache Hadoop

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-52

Chapter Topics
Introduc6on to Pig
What is Pig?
Pigs Features
Pig Use Cases
InteracHng with Pig

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-53

Pig Features
Pig is an alterna6ve to wri6ng low-level MapReduce code
Many features enable sophis6cated analysis and processing
HDFS manipulaHon
UNIX shell commands
RelaHonal operaHons
PosiHonal references for elds
Common mathemaHcal funcHons
Support for custom funcHons and data formats
Complex data structures

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-54

Chapter Topics
Introduc6on to Pig
What is Pig?
Pigs Features
Pig Use Cases
InteracHng with Pig

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-55

How Are OrganizaHons Using Pig?


Many organiza6ons use Pig for data analysis
Finding relevant records in a massive data set
Querying mulHple data sets
CalculaHng values from input data
Pig is also frequently used for data processing
Reorganizing an exisHng data set
Joining data from mulHple sources to produce a new data set

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-56

Use Case: Web Log SessionizaHon


Pig can help you extract valuable informa6on from Web server log les
...

Web Server Log Data

10.174.57.241 - - [03/May/2013:17:57:41 -0500] "GET /s?q=widget HTTP/1.1" 200 3617 "http://www.hotbot.com/find/dualcore" "WebTV 1.2" "U=129"
10.218.46.19 - - [03/May/2013:17:57:43 -0500] "GET /ide.html HTTP/1.1" 404 955 "http://www.example.com/s?q=JBuilder" "Mosaic/3.6 (X11;SunOS)"
10.174.57.241 - - [03/May/2013:17:58:03 -0500] "GET /wres.html HTTP/1.1" 200 5741 "http://www.example.com/s?q=widget" "WebTV 1.2" "U=129"
10.32.51.237 - - [03/May/2013:17:58:04 -0500] "GET /os.html HTTP/1.1" 404 955 "http://www.example.com/s?q=VMS" "Mozilla/1.0b (Win3.11)"
10.174.57.241 - - [03/May/2013:17:58:25 -0500] "GET /detail?w=41 HTTP/1.1" 200 8584 "http://www.example.com/wres.html" "WebTV 1.2" "U=129"
10.157.96.181 - - [03/May/2013:17:58:26 -0500] "GET /mp3.html HTTP/1.1" 404 955 "http://www.example.com/s?q=Zune" "Mothra/2.77" "U=3622"
10.174.57.241 - - [03/May/2013:17:59:36 -0500] "GET /order.do HTTP/1.1" 200 964 "http://www.example.com/detail?w=41" "WebTV 1.2" "U=129"
10.174.57.241 - - [03/May/2013:17:59:47 -0500] "GET /confirm HTTP/1.1" 200 964 "http://www.example.com/order.do" "WebTV 1.2" "U=129"

...

Clickstream Data for User Sessions


Process Logs
Recent Activity for John Smith
May 3, 2013

May 12, 2013

Search for 'Widget'

Track Order

Widget Results

Contact Us

Details for Widget X

Send Complaint

Order Widget X

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-57

Use Case: Data Sampling


Sampling can help you explore a representa6ve por6on of a large data set
Allows you to examine this porHon with tools that do not scale well
Supports faster iteraHons during development of analysis jobs

100 TB

50 MB

Random
Sampling

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-58

Use Case: ETL Processing


Pig is also widely used for Extract, Transform, and Load (ETL) processing

Operations

Pig Jobs Running on Hadoop Cluster

Data Warehouse

Accounting
Validate
data

Fix
errors

Remove
duplicates

Encode
values

Call Center

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-59

Chapter Topics
Introduc6on to Pig
What is Pig?
Pigs Features
Pig Use Cases
Interac6ng with Pig

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-60

Using Pig InteracHvely


You can use Pig interac6vely, via the Grunt shell
Pig interprets each Pig LaHn statement as you type it
ExecuHon is delayed unHl output is required
Very useful for ad hoc data inspecHon
Example of how to start, use, and exit Grunt
$ pig
grunt>
grunt>
grunt>
grunt>

allsales = LOAD 'sales' AS (name, price);


bigsales = FILTER allsales BY price > 100;
STORE bigsales INTO 'myreport';
quit;


Can also execute a Pig La6n statement from the UNIX shell via the -e
op6on
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-61

InteracHng with HDFS


You can manipulate HDFS with Pig, via the fs command

grunt>
grunt>
grunt>
grunt>
grunt>
grunt>

fs -mkdir sales/;
fs -put europe.txt sales/;
allsales = LOAD 'sales' AS (name, price);
bigsales = FILTER allsales BY price > 100;
STORE bigsales INTO 'myreport';
fs -getmerge myreport/ bigsales.txt;

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-62

InteracHng with UNIX


The sh command lets you run UNIX programs from Pig

grunt> sh date;
Fri May 10 13:05:31 PDT 2013
grunt> fs -ls;
grunt> sh ls;

-- lists HDFS files


-- lists local files

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-63

Running Pig Scripts


A Pig script is simply Pig La6n code stored in a text le
By convenHon, these les have the .pig extension
You can run a Pig script from within the Grunt shell via the run command
This is useful for automaHon and batch execuHon
grunt> run salesreport.pig;

It is common to run a Pig script directly from the UNIX shell


$ pig salesreport.pig

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-64

MapReduce and Local Modes


As described earlier, Pig turns Pig La6n into MapReduce jobs
Pig submits those jobs for execuHon on the Hadoop cluster
It is also possible to run Pig in local mode using the -x ag
This runs MapReduce jobs on the local machine instead of the cluster
Local mode uses the local lesystem instead ofHDFS
Can be helpful for tesHng before deploying a job to producHon
$ pig x local

-- interactive

$ pig -x local salesreport.pig

-- batch

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-65

Client-Side Log Files


If a job fails, Pig may produce a log le to explain why
These log les are typically produced in your current working directory
On the local (client) machine

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-66

EssenHal Points
Pig oers an alterna6ve to wri6ng MapReduce code directly
Pig interprets Pig LaHn code in order to create MapReduce jobs
It then submits these MapReduce jobs to the Hadoop cluster
You can execute Pig La6n code interac6vely through Grunt
Pig delays job execuHon unHl output is required
It is also common to store Pig La6n code in a script for batch execu6on
Allows for automaHon and code reuse

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-67

Bibliography
The following oer more informa6on on topics discussed in this chapter
Apache Pig Web Site
http://pig.apache.org/
Process a Million Songs with Apache Pig
http://tiny.cloudera.com/dac03a
Powered By Pig
http://tiny.cloudera.com/dac03b
LinkedIn: User Engagement Powered By Apache Pig and Hadoop
http://tiny.cloudera.com/dac03c
Programming Pig (book)
http://tiny.cloudera.com/dac03d

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-68

Bibliography (contd)
The following oer more informa6on on topics discussed in this chapter
Programming Pig (book)
http://tiny.cloudera.com/dac03d
The original paper on Pig published by Yahoo in 2008:
http://www.research.yahoo.com/files/sigmod08.pdf

Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-69

Anda mungkin juga menyukai