Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-1
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-2
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-3
Chapter
Topics
Hadoop
Tools
for
Data
Acquisi6on
Loading
Data
into
HDFS
from
an
RDBMS
Using
Sqoop
Managing
Real-Hme
Data
Using
Flume
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-4
RDBMS
sqoop
HDFS
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-5
8-6
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-7
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-8
Tools
include:
import
import-all-tables
list-tables
Op6ons
include:
--connect
--username
--password
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
8-9
Sqoop:
Example
Example:
import
a
table
called
employees
from
a
database
called
personnel
in
a
MySQL
RDBMS
$ sqoop import --username fred --password derf \
--connect jdbc:mysql://database.example.com/personnel \
--table employees
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-10
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-11
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-12
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-13
Chapter
Topics
Hadoop
Tools
for
Data
Acquisi6on
Loading
Data
into
HDFS
from
an
RDBMS
Using
Sqoop
Managing
Real-6me
Data
Using
Flume
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-14
Flume:
Basics
Flume
is
a
distributed,
reliable,
available
service
for
eciently
moving
large
amounts
of
data
as
it
is
produced
Ideally
suited
to
gathering
logs
from
mulHple
systems
and
inserHng
them
into
HDFS
as
they
are
generated
Flume
is
Open
Source
IniHally
developed
by
Cloudera
Flumes
design
goals:
Reliability
Scalability
Extensibility
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-15
Agent
Agent
Agent
Agent
encrypt
Agent
Agent
compress
batch
encrypt
Agent(s)
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-16
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-17
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-18
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-19
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-20
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-21
Key
Points
Sqoop
is
a
tool
to
load
data
from
a
database
into
HDFS
Flume
is
a
tool
for
managing
real-6me
data
e.g.
imporHng
data
from
log
les
into
HDFS
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-22
Bibliography
The
following
oer
more
informa6on
on
topics
discussed
in
this
chapter
Incremental
impor6ng
is
described
in
the
Sqoop
documenta6on:
http://archive.cloudera.com/cdh/3/sqoop/
SqoopUserGuide.html#_incremental_imports
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-23
An
IntroducHon
to
Oozie
Chapter
8.2
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-24
An
IntroducHon
to
Oozie
What
is
Oozie?
Crea6ng
Oozie
workows
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-25
Chapter
Topics
An
Introduc6on
to
Oozie
Introduc6on
to
Oozie
CreaHng
Oozie
workows
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-26
Start
Data
Job A
Job B
Job C
Final
Result
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-27
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-28
What
is
Oozie?
Oozie
is
a
workow
engine
Runs
on
a
server
Typically
outside
the
cluster
Runs
workows
of
Hadoop
jobs
Including
Pig,
Hive,
Sqoop
jobs
Submits
those
jobs
to
the
cluster
based
on
a
workow
deniHon
Workow
deni6ons
are
submiOed
via
HTTP
Jobs
can
be
run
at
specic
6mes
One-o
or
recurring
jobs
Jobs
can
be
run
when
data
is
present
in
a
directory
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-29
Chapter
Topics
An
Introduc6on
to
Oozie
IntroducHon
to
Oozie
Crea6ng
Oozie
workows
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-30
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-31
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-32
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-33
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-34
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-35
The wordcount acHon node denes a mapreduce acHon a standard Java MapReduce job.
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-36
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-37
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-38
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-39
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-40
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-41
Descrip6on
map-reduce
fs
java
pig
hive
sqoop
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-42
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-43
More
on
Oozie
Informa6on
Resource
https://oozie.apache.org
https://cwiki.apache.org/OOZIE/
map-reduce-cookbook.html
Oozie examples
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-44
Key
Points
Oozie
is
a
workow
engine
for
Hadoop
Supports
Java
and
Streaming
MapReduce
jobs,
Sqoop
jobs,
Hive
queries,
Pig
scripts,
and
HDFS
le
manipula6on
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-45
Bibliography
The
following
oer
more
informa6on
on
topics
discussed
in
this
chapter
Introduc6on
to
Oozie
ar6cle
http://www.infoq.com/articles/introductionOozie
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-46
IntroducHon
to
Pig
Chapter
8.3
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-47
IntroducHon
to
Pig
The
key
features
Pig
oers
How
to
use
Pig
for
data
processing
and
analysis
How
to
use
Pig
interac6vely
and
in
batch
mode
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-48
Chapter
Topics
Introduc6on
to
Pig
What
is
Pig?
Pigs
Features
Pig
Use
Cases
InteracHng
with
Pig
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-49
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-50
MapReduce Jobs
!"Preprocess"and"parse"Pig"La0n
!"Check"data"types
!"Make"op0miza0ons
!"Plan"execu0on
!"Generate"MapReduce"jobs
!"Submit"job(s)"to"Hadoop
!"Monitor"progress
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-51
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-52
Chapter
Topics
Introduc6on
to
Pig
What
is
Pig?
Pigs
Features
Pig
Use
Cases
InteracHng
with
Pig
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-53
Pig
Features
Pig
is
an
alterna6ve
to
wri6ng
low-level
MapReduce
code
Many
features
enable
sophis6cated
analysis
and
processing
HDFS
manipulaHon
UNIX
shell
commands
RelaHonal
operaHons
PosiHonal
references
for
elds
Common
mathemaHcal
funcHons
Support
for
custom
funcHons
and
data
formats
Complex
data
structures
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-54
Chapter
Topics
Introduc6on
to
Pig
What
is
Pig?
Pigs
Features
Pig
Use
Cases
InteracHng
with
Pig
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-55
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-56
10.174.57.241 - - [03/May/2013:17:57:41 -0500] "GET /s?q=widget HTTP/1.1" 200 3617 "http://www.hotbot.com/find/dualcore" "WebTV 1.2" "U=129"
10.218.46.19 - - [03/May/2013:17:57:43 -0500] "GET /ide.html HTTP/1.1" 404 955 "http://www.example.com/s?q=JBuilder" "Mosaic/3.6 (X11;SunOS)"
10.174.57.241 - - [03/May/2013:17:58:03 -0500] "GET /wres.html HTTP/1.1" 200 5741 "http://www.example.com/s?q=widget" "WebTV 1.2" "U=129"
10.32.51.237 - - [03/May/2013:17:58:04 -0500] "GET /os.html HTTP/1.1" 404 955 "http://www.example.com/s?q=VMS" "Mozilla/1.0b (Win3.11)"
10.174.57.241 - - [03/May/2013:17:58:25 -0500] "GET /detail?w=41 HTTP/1.1" 200 8584 "http://www.example.com/wres.html" "WebTV 1.2" "U=129"
10.157.96.181 - - [03/May/2013:17:58:26 -0500] "GET /mp3.html HTTP/1.1" 404 955 "http://www.example.com/s?q=Zune" "Mothra/2.77" "U=3622"
10.174.57.241 - - [03/May/2013:17:59:36 -0500] "GET /order.do HTTP/1.1" 200 964 "http://www.example.com/detail?w=41" "WebTV 1.2" "U=129"
10.174.57.241 - - [03/May/2013:17:59:47 -0500] "GET /confirm HTTP/1.1" 200 964 "http://www.example.com/order.do" "WebTV 1.2" "U=129"
...
Track Order
Widget Results
Contact Us
Send Complaint
Order Widget X
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-57
100 TB
50 MB
Random
Sampling
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-58
Operations
Data Warehouse
Accounting
Validate
data
Fix
errors
Remove
duplicates
Encode
values
Call Center
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-59
Chapter
Topics
Introduc6on
to
Pig
What
is
Pig?
Pigs
Features
Pig
Use
Cases
Interac6ng
with
Pig
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-60
Can
also
execute
a
Pig
La6n
statement
from
the
UNIX
shell
via
the
-e
op6on
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
8-61
grunt>
grunt>
grunt>
grunt>
grunt>
grunt>
fs -mkdir sales/;
fs -put europe.txt sales/;
allsales = LOAD 'sales' AS (name, price);
bigsales = FILTER allsales BY price > 100;
STORE bigsales INTO 'myreport';
fs -getmerge myreport/ bigsales.txt;
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-62
grunt> sh date;
Fri May 10 13:05:31 PDT 2013
grunt> fs -ls;
grunt> sh ls;
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-63
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-64
-- interactive
-- batch
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-65
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-66
EssenHal
Points
Pig
oers
an
alterna6ve
to
wri6ng
MapReduce
code
directly
Pig
interprets
Pig
LaHn
code
in
order
to
create
MapReduce
jobs
It
then
submits
these
MapReduce
jobs
to
the
Hadoop
cluster
You
can
execute
Pig
La6n
code
interac6vely
through
Grunt
Pig
delays
job
execuHon
unHl
output
is
required
It
is
also
common
to
store
Pig
La6n
code
in
a
script
for
batch
execu6on
Allows
for
automaHon
and
code
reuse
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-67
Bibliography
The
following
oer
more
informa6on
on
topics
discussed
in
this
chapter
Apache
Pig
Web
Site
http://pig.apache.org/
Process
a
Million
Songs
with
Apache
Pig
http://tiny.cloudera.com/dac03a
Powered
By
Pig
http://tiny.cloudera.com/dac03b
LinkedIn:
User
Engagement
Powered
By
Apache
Pig
and
Hadoop
http://tiny.cloudera.com/dac03c
Programming
Pig
(book)
http://tiny.cloudera.com/dac03d
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-68
Bibliography
(contd)
The
following
oer
more
informa6on
on
topics
discussed
in
this
chapter
Programming
Pig
(book)
http://tiny.cloudera.com/dac03d
The
original
paper
on
Pig
published
by
Yahoo
in
2008:
http://www.research.yahoo.com/files/sigmod08.pdf
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
8-69