Anda di halaman 1dari 624

# Cloudera"Data"Analyst"Training:""

201403"
IntroducJon"
Chapter"1"

Course"Chapters"

!! Introduc/on\$
!! IntroducJon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulJ/Dataset"OperaJons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooJng"and"OpJmizaJon"
!! IntroducJon"to"Hive"
!! RelaJonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpJmizaJon"
!! Extending"Hive"
!! IntroducJon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

Chapter"Topics"

Introduc/on\$

!! Course"LogisJcs"
!! IntroducJons"

Course"ObjecJves"(1)"

During\$this\$course,\$you\$will\$learn\$
!The\$features\$that\$Pig,\$Hive,\$and\$Impala\$oer\$for\$data\$acquisi/on,\$storage,\$
and\$analysis\$
!How\$to\$iden/fy\$typical\$use\$cases\$for\$large#scale\$data\$analysis\$
!How\$to\$manage\$data\$in\$HDFS\$and\$export\$it\$for\$use\$with\$other\$systems\$
!The\$language\$syntax\$and\$data\$formats\$supported\$by\$these\$tools\$

Course"ObjecJves"(2)"

!How\$to\$design\$and\$execute\$queries\$on\$data\$stored\$in\$HDFS\$
!How\$to\$analyze\$structured,\$semi#structured,\$and\$unstructured\$data\$
!How\$Hive\$and\$Pig\$can\$be\$extended\$with\$custom\$func/ons\$and\$scripts\$
!How\$to\$store\$and\$query\$data\$for\$beOer\$performance\$

Chapter"Topics"

Introduc/on\$

!! Course"LogisJcs"
!! IntroducJons"

and\$Oracle\$
Tom"White,"Eric"Sammer,"Lars"George,"etc."

Macys.com,"NaJonal"Cancer"InsJtute,"Orbitz,"Social"Security"
!Cloudera\$public\$training:\$
Cloudera"Training"for"Apache"HBase"
IntroducJon"to"Data"Science:"Building"Recommender"Systems"
!Onsite\$and\$custom\$training\$is\$also\$available\$

CDH"

100%"open"source,""
related"projects"
The"most"complete,"tested,""
and"widely/deployed""
Integrates"all"the"key""
Available"as"RPMs"and"
Ubuntu/Debian/SuSE""
packages"or"as"a"tarball"

Cloudera"Express"

!Cloudera\$Express\$
!The\$best\$way\$to\$get\$started\$
!Includes\$CDH\$
!Includes\$Cloudera\$Manager\$
End/to/end""
Deploy,"manage,"and""
monitor"your"cluster"

Cloudera"Enterprise"

!Cloudera\$Enterprise\$
SubscripJon"product"including"CDH"and""
Cloudera"Manager"
!Includes\$support\$
!Includes\$extra\$Cloudera\$Manager\$features\$
ConguraJon"history"and"rollbacks"
LDAP"integraJon"
SNMP"support"
Automated"disaster"recovery"
Etc."

Chapter"Topics"

Introduc/on\$

!! Course\$Logis/cs\$
!! IntroducJons"

LogisJcs"

!Class\$start\$and\$nish\$/mes\$
!Lunch\$
!Breaks\$
!Restrooms\$
!Wi#Fi\$access\$
!Virtual\$machines\$
!Can\$I\$come\$in\$early/stay\$late?\$

Your\$instructor\$will\$give\$you\$details\$on\$how\$to\$access\$the\$course\$materials\$
and\$exercise\$instruc/ons\$for\$the\$class\$

Chapter"Topics"

Introduc/on\$

!! Course"LogisJcs"
!! Introduc/ons\$

IntroducJons"

Where"do"you"work"and"what"do"you"do"there?"
Which"database(s)"and"placorm(s)"do"you"use?"
Any"experience"as"a"developer?"
What"programming"languages"do"you"use?"
What"are"your"expectaJons"for"this"course?"

Chapter"2"

Course"Chapters"

!! IntroducDon"
!! IntroducDon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! Extending"Pig"
!! IntroducDon"to"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Extending"Hive"
!! IntroducDon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

In%this%chapter,%you%will%learn%
!Which%factors%led%to%the%era%of%Big%Data%
!How%it%oers%reliable%storage%for%massive%amounts%of%data%with%HDFS%
!How%it%supports%large#scale%data%processing%through%MapReduce%

!At%the%end%of%this%chapter%you%will%work%on%the%rst%Hands#On%Exercise%
!The%exercises%are%performed%in%a%Virtual%Machine%(VM)%
!The%rst%Kme%the%VM%is%launched,%it%takes%several%minutes%to%boot%
It"is"conguring"the"class"environment"
Subsequent"boots"are"much"faster"
Kme%we%get%to%the%rst%Hands#On%Exercise%
Your"instructor"will"tell"you"how"to"do"this"

Chapter"Topics"

!! HDFS"
!! MapReduce"
!! Conclusion"

Velocity"

!We%are%generaKng%data%faster%than%ever%
Processes"are"increasingly"automated"
Systems"are"increasingly"interconnected"
People"are"increasingly"interacDng"online"

Variety"

!We%are%producing%a%wide%variety%of%data%
Social"network"connecDons"
Electronic"medical"records"
Images,"audio,"and"video"
RFID"and"wireless"sensor"network"events"
And"much"more"
!Not%all%of%this%maps%cleanly%to%the%relaKonal%model%

Volume"

!Every%day%
Exchange"
!Every%minute%
Foursquare"handles"more"than"2,000"check/ins"
!And%every%second%
Banks"process"more"than"10,000"credit"card"transacDons"

Data"Has"Value"

!This%data%has%many%valuable%applicaKons%
PredicDng"demand"
MarkeDng"analysis"
Fraud"detecDon"
And"many,"many"more"
!We%must%process%it%to%extract%that%value%
And"processing"all#the#data"can"yield"more"accurate"results"

We"Need"a"System"that"Scales"

How"can"we"reliably"store"large"amounts"of"data"at"a"reasonable"cost?"
How"can"we"analyze"all"the"data"we"have"stored?"

Chapter"Topics"

!! HDFS"
!! MapReduce"
!! Conclusion"

!Scalable%and%economical%data%storage%and%processing%
Distributed"and"fault/tolerant""
Harnesses"the"power"of"industry"standard"hardware"
Processing:"MapReduce"
Plus"the"infrastructure"needed"to"make"them"work,"including"
Job"scheduling"and"monitoring"

Scalability"

!Individual%servers%within%a%cluster%are%called%nodes&
Typically"standard"rackmount"servers"running"Linux"
Each"node"both"stores"and"processes"data"
A"cluster"may"contain"up"to"several"thousand"nodes"
You"can"scale"out"incrementally"as"required"

Fault"Tolerance"

If"a"node"fails,"its"data"is"re/replicated"using"one"of"the"other"copies"
!RouKne%failures%are%handled%automaKcally%without%any%loss%of%data%

Chapter"Topics"

!! HDFS%
!! MapReduce"
!! Conclusion"

!Provides%inexpensive%and%reliable%storage%for%massive%amounts%of%data%
!OpKmized%for%sequenKal%access%to%a%relaKvely%small%number%of%large%les%
Each"le"is"likely"to"be"100MB"or"larger ""
MulD/gigabyte"les"are"typical"
!In%some%ways,%HDFS%is%similar%to%a%UNIX%lesystem%
Hierarchical,"with"UNIX/style"paths"(e.g.,"/sales/rpt/asia.txt)"
UNIX/style"le"ownership"and"permissions"
!There%are%also%some%major%deviaKons%from%UNIX%
No"concept"of"a"current"directory"
Cannot"modify"les"once"wri>en"

HDFS"Architecture"

architecture%
Master
!HDFS%master%daemon:%NameNode% (NameNode)
Monitors"slave"nodes"
!HDFS%slave%daemon:%DataNode%
Slaves
(DataNode)

Accessing"HDFS"via"the"Command"Line"

!HDFS%is%not%a%general%purpose%lesystem%
Not"built"into"the"OS,"so"only"specialized"tools"can"access"it"
!Example:%display%the%contents%of%the%/user/fred/sales.txt%le%

## \$ hadoop fs -cat /user/fred/sales.txt

!Example:%Create%a%directory%(below%the%root)%called%reports%

## \$ hadoop fs -mkdir /reports

Copying"Local"Data"To"and"From"HDFS"

!Remember%that%HDFS%is%disKnct%from%your%local%lesystem%

Client Machine

## \$ hadoop fs -get /reports/sales.txt

!Copy%le%input.txt%from%local%disk%to%the%users%directory%in%HDFS%

## \$ hadoop fs -put input.txt input.txt

!Get%a%directory%lisKng%of%the%HDFS%root%directory%

!Delete%the%le%/reports/sales.txt%

## \$ hadoop fs -rm /reports/sales.txt

Chapter"Topics"

!! HDFS"
!! MapReduce%
!! Conclusion"

Introducing"MapReduce"

!MapReduce%is%not%a%language,%its%a%programming%model%
!Benets%of%MapReduce%
Simplicity"
Flexibility"
Scalability"

Understanding"Map"and"Reduce"

!MapReduce%consists%of%two%funcKons:%map%and%reduce%
The"output"from"map"becomes"the"input"to"reduce"
!The%map%funcKon%always%runs%rst%
Typically"used"to"lter,"transform,"or"parse"data"
!The%reduce%funcKon%is%opKonal%
Not"always"needed""you"can"run"map/only"jobs"
!Each%piece%is%simple,%but%can%be%powerful%when%combined%

MapReduce"Example"

!The%following%slides%will%explain%an%enKre%MapReduce%job%
Input:"text"le"containing"order"ID,"employee"name,"and"sale"amount"
Output:"sum"of"all"sales"per"employee"

Job Input
0 Alice 3625
1 Bob 5174 Job Output
2 Alice 893
3 Alice 2139 Alice 12491
4 Diana 3581 Bob 9997
5 Carlos 1039 Carlos 1431
6 Bob 4823 Diana 5385
7 Alice 5834
8 Carlos 392
9 Diana 1804

Mappers"process"one"input"record"at"a"Dme"
For"each"input"record,"they"emit"zero"or"more"records"as"output"
And"then"emits"the"name"and"price"elds"for"each"as"output"
Job Input Alice 3625
Bob 5174
0 Alice 3625
1 Bob 5174 Alice 893
2 Alice 893 Alice 2139 Output
3 Alice 2139
4 Diana 3581 Diana 3581 from
5 Carlos 1039 Carlos 1039
6 Bob 4823 Map
7
8
Alice
Carlos
5834
392
Alice 5834
9 Diana 1804
Carlos 392
Diana 1804

Shue"and"Sort"

This"intermediate"process"is"known"as"the"shue"and"sort"

Alice 3625
Map Task #1 Output Bob 5174 Alice 3625
Alice 893
Alice 893 Alice 2139
Map Task #2 Output Alice 2139 Alice 5834 Input to Reduce Task #1
Carlos 1039
Diana 3581 Carlos 392
% Carlos 1039

## Bob 4823 Bob 5174

Map Task #4 Output Alice 5834 Bob 4823
Diana 3581
Diana 1804
Carlos 392
Map Task #5 Output Diana 1804

!Reducer%input%comes%from%the%shue%and%sort%process%
For"each"input"record,"reduce"can"emit"zero"or"more"output"records"
!Our%reduce%funcKon%simply%sums%total%per%person%
And"emits"employee"name"(key)"and"total"(value)"as"output"

Job Output
Alice 3625
Alice 893 (Output of Reduce Tasks)
Alice 2139
Input to Reduce Task #1 Alice 5834
Carlos 1039 Alice 12491
Carlos 392 Carlos 1431

Bob 9997
Bob 5174 Diana 5385
Bob 4823
Input to Reduce Task #2 Diana 3581
Diana 1804

Pumng"it"All"Together"

!Heres%the%data%ow%for%the%enKre%MapReduce%job%

Alice 3625
Bob 5174 Alice 3625
0 Alice 3625 Alice 893
1 Bob 5174 Alice 893 Alice 2139
2 Alice 893 Alice 2139 Alice 5834
3 Alice 2139 Carlos 1039 Alice 12491
4 Diana 3581 Diana 3581 Carlos 392 Carlos 1431
5 Carlos 1039 Carlos 1039
6 Bob 4823 Bob 9997
7 Alice 5834 Bob 4823 Bob 5174 Diana 5385
8 Carlos 392 Alice 5834 Bob 4823
9 Diana 1804 Diana 3581
Diana 1804
Carlos 392
Diana 1804

MapReduce"Architecture"

!MapReduce%version%1%and%version%2%(YARN)%
Similar"master/slave"architecture"
Details"dier"slightly"
!Master%nodes%
Run"master"daemons"to"accept"jobs,""
and"monitor"and"distribute"work"
!Slave%nodes%
Do"the"actual"work"
Report"status"back"to"master"daemons"
!HDFS%and%MapReduce%are%collocated%
Slave"nodes"run"both"HDFS"and"MR""
slave"daemons"on"the"same"machines"

MapReduce"Version"1"Architecture"

!MRv1%master%daemon:%JobTracker%
Reports"status"back"to"JobTracker"

MapReduce"Version"2"Architecture"

!MRv2%uses%the%YARN%cluster%
management%framework%
!MRv2%master%daemon:%ResourceManager%
Allocates"cluster"resources"for"a"job"
!MRv2%slave%daemons:%%
ApplicaKonMaster%
NodeManager%
Runs"on"all"slave"nodes"
Starts"and"monitors"the"actual"
Chapter"Topics"

!! HDFS"
!! MapReduce"
!! Conclusion"

Data"analysis""
Workow"management"
Many"are"also"open"source"Apache"projects"

Apache"Pig"

Pig"is"especially"good"at"joining"and"transforming"data"

## people = LOAD '/user/training/customers' AS (cust_id, name);

orders = LOAD '/user/training/orders' AS (ord_id, cust_id, cost);
groups = GROUP orders BY cust_id;
totals = FOREACH groups GENERATE group, SUM(orders.cost) AS t;
result = JOIN totals BY group, people BY cust_id;
DUMP result;
%
!The%Pig%interpreter%runs%on%the%client%machine%
Submits"those"jobs"to"the"cluster"

Apache"Hive"

!Hive%is%another%abstracKon%on%top%of%MapReduce%
Like"Pig,"it"also"reduces"development"Dme""
Hive"uses"a"SQL/like"language"called"HiveQL"

## SELECT customers.cust_id, SUM(cost) AS total

FROM customers
JOIN orders
ON customers.cust_id = orders.cust_id
GROUP BY customers.cust_id
"
ORDER BY total DESC;

!The%Hive%interpreter%runs%on%a%client%machine%
Turns"HiveQL"queries"into"MapReduce"jobs"
Submits"those"jobs"to"the"cluster"

Apache"HBase"

!Can%store%massive%amounts%of%data%
Gigabytes,"terabytes,"and"even"petabytes"of"data"in"a"table"
Tables"can"have"many"thousands"of"columns"
!Scales%to%provide%very%high%write%throughput%
Hundreds"of"thousands"of"inserts"per"second"
!Fairly%primiKve%when%compared%to%RDBMS%
NoSQL":"There"is"no"high/level"query"language""
Use"API"to"scan"/"get"/"put"values"based"on"keys"

Cloudera"Impala"

Can"query"data"stored"in"HDFS"or"HBase"tables"
!High%performance%%
Typically"at"least"10"Dmes"faster"than"Hive"or"MapReduce"
High/level"query"language"(subset"of"SQL/92)"

Apache"Sqoop"

!It%can%import%all%tables,%a%single%table,%or%a%porKon%of%a%table%into%HDFS%
Does"this"very"eciently"via"a"Map/only"MapReduce"job"
Result"is"a"directory"in"HDFS"containing"comma/delimited"text"les"
!Sqoop%can%also%export%data%from%HDFS%back%to%the%database%

ImporDng"Tables"with"Sqoop"

!This%example%imports%the%customers%table%from%a%MySQL%database%
Will"create"/mydata/customers"directory"in"HDFS"
Directory"will"contain"comma/delimited"text"les"

\$ sqoop import \
--connect jdbc:mysql://localhost/company \
--warehouse-dir /mydata \
--table customers

!Cloudera%oers%high#performance%custom%connectors%for%many%databases%

ImporDng"An"EnDre"Database"with"Sqoop"

!Import%all%tables%from%the%database%(elds%will%be%tab#delimited)%

\$ sqoop import-all-tables \
--connect jdbc:mysql://localhost/company \
--fields-terminated-by '\t' \
--warehouse-dir /mydata

ImporDng"ParDal"Tables"with"Sqoop"

!Import%only%specied%columns%from%products%table%

\$ sqoop import \
--connect jdbc:mysql://localhost/company \
--warehouse-dir /mydata \
--table products \
--columns "prod_id,name,price"

!Import%only%matching%rows%from%products%table%

\$ sqoop import \
--connect jdbc:mysql://localhost/company \
--warehouse-dir /mydata \
--table products \
--where "price >= 1000"

Incremental"Imports"with"Sqoop"

Could"re/import"all"records,"but"this"is"inecient"
!Sqoops%incremental%append%mode%imports%only%new%records%
Based"on"value"of"last"record"in"specied"column"

\$ sqoop import \
--connect jdbc:mysql://localhost/company \
--warehouse-dir /mydata \
--table orders \
--incremental append \
--check-column order_id \
--last-value 6713821

!What%if%exisKng%records%are%also%modied%in%the%database?%
Incremental"append"mode"doesnt"handle"this"
Caveat:"You"must"maintain"a"Dmestamp"column"in"your"table"

\$ sqoop import \
--connect jdbc:mysql://localhost/company \
--warehouse-dir /mydata \
--table shipments \
--incremental lastmodified \
--check-column last_update_date \
--last-value "2013-06-12 03:15:59"

!Sqoop%supports%this%via%export%

\$ sqoop export \
--connect jdbc:mysql://localhost/company \
--export-dir /mydata/recommender_output \
--table product_recommendations

Apache"Flume"

%%
!Flume%imports%data%into%HDFS%as&it&is&being&generated%by%various%sources%
Log Files
UNIX Custom
syslog Sources

Output more...

## Web/App Servers File Server

s
Flu opf
me do
ha

Sq
p
Sq

Data Warehouse
Relational Database
(OLAP)
(OLTP)

Apache"Oozie"

!Oozie%allows%developers%to%manage%processing%workows%
It"coordinates"execuDon"and"control"of"individual"jobs"
!Oozie%supports%many%workow%acKons,%including%
ExecuDng"MapReduce"jobs"
Running"Pig"or"Hive"scripts"
ExecuDng"standard"Java"or"shell"programs"
Running"remote"commands"with"SSH"
Sending"e/mail"messages"

Chapter"Topics"

!! HDFS"
!! MapReduce"
!! Exercise%Scenario%ExplanaKon%
!! Conclusion"

!Hands#On%Exercises%throughout%the%course%will%reinforce%the%topics%being%
discussed%
Most"exercises"depend"on"data"generated"in"earlier"exercises"
More"than"1,000"brick/and/mortar"stores"
Dualcore"also"has"a"thriving"e/commerce"Web"site"
!Dualcore%has%hired%you%to%help%nd%value%in%their%data%
You"will"process"and"analyze"data"from"internal"and"external"sources"
IdenDfy"opportuniDes"to"increase"revenue"
Find"new"ways"to"reduce"costs"
Help"other"departments"achieve"their"goals"

Chapter"Topics"

!! HDFS"
!! MapReduce"
!! Conclusion"

!During%this%course,%you%will%perform%numerous%hands#on%exercises%using%the%
Cloudera%Training%Virtual%Machine%(VM)%
Simply"a"cluster"comprised"of"a"single"node"
Typically"used"for"tesDng"code"before"deploying"to"a"large"cluster"

lesystem%and%a%relaKonal%database%server%to%HDFS%
You"will"analyze"this"data"in"subsequent"exercises"

Chapter"Topics"

!! HDFS"
!! MapReduce"
!! Conclusion%

EssenDal"Points"

!We%are%generaKng%more%data%%and%faster%%than%ever%before%
!Most%of%this%data%maps%poorly%to%structured%relaKonal%tables%
!The%ability%to%store%and%process%this%data%can%yield%valuable%insight%

Bibliography"

The%following%oer%more%informaKon%on%topics%discussed%in%this%chapter%
http://tiny.cloudera.com/dac02a
!IntroducKon%to%Apache%MapReduce%and%HDFS%(recorded%presentaKon)%
http://tiny.cloudera.com/dac02b
!Guide%to%HDFS%Commands%
http://tiny.cloudera.com/dac02c
http://tiny.cloudera.com/dac02d
!Sqoop%User%Guide%
http://tiny.cloudera.com/dac02e

IntroducAon"to"Pig"
Chapter"3"

Course"Chapters"

!! IntroducAon"
!! Introduc/on%to%Pig%
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulA/Dataset"OperaAons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooAng"and"OpAmizaAon"
!! IntroducAon"to"Hive"
!! RelaAonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpAmizaAon"
!! Extending"Hive"
!! IntroducAon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

IntroducAon"to"Pig"

In%this%chapter,%you%will%learn%
!The%key%features%Pig%oers%
!How%organiza/ons%use%Pig%for%data%processing%and%analysis%
!How%to%use%Pig%interac/vely%and%in%batch%mode%

Chapter"Topics"

Introduc/on%to%Pig%

!! What%is%Pig?%
!! Pigs"Features"
!! Pig"Use"Cases"
!! InteracAng"with"Pig"
!! Conclusion"

Apache"Pig"Overview"

It"oers"an"alternaAve"to"wriAng"MapReduce"code"directly"
!Originally%developed%as%a%research%project%at%Yahoo%%
Goals:"exibility,"producAvity,"and"maintainability"
Now"an"open/source"Apache"project"

The"Anatomy"of"Pig"

!Main%components%of%Pig%
The"data"ow"language"(Pig"LaAn)"
The"interacAve"shell"where"you"can"type"Pig"LaAn"statements"(Grunt)"
The"Pig"interpreter"and"execuAon"engine"

## Pig Latin Script Pig Interpreter / Execution Engine MapReduce Jobs

!"Preprocess"and"parse"Pig"La0n
!"Check"data"types
AS (cust, price); !"Make"op0miza0ons
BigSales = FILTER AllSales !"Plan"execu0on
BY price > 100;
STORE BigSales INTO 'myreport';
!"Generate"MapReduce"jobs
!"Monitor"progress

Where"to"Get"Pig"

HBase,"Oozie,"and"other"ecosystem"components"
Available"as"RPMs,"Ubuntu/Debian/SuSE"packages,"or"a"tarball"
Simple"installaAon"
100%"free"and"open"source"
!Installa/on%is%outside%the%scope%of%this%course%

Chapter"Topics"

Introduc/on%to%Pig%

!! What"is"Pig?"
!! Pigs%Features%
!! Pig"Use"Cases"
!! InteracAng"with"Pig"
!! Conclusion"

Pig"Features"

!Pig%is%an%alterna/ve%to%wri/ng%low#level%MapReduce%code%
!Many%features%enable%sophis/cated%analysis%and%processing%
HDFS"manipulaAon"
UNIX"shell"commands"
RelaAonal"operaAons"
PosiAonal"references"for"elds"
Common"mathemaAcal"funcAons"
Support"for"custom"funcAons"and"data"formats%
Complex"data"structures"

Chapter"Topics"

Introduc/on%to%Pig%

!! What"is"Pig?"
!! Pigs"Features"
!! Pig%Use%Cases%
!! InteracAng"with"Pig"
!! Conclusion"

How"Are"OrganizaAons"Using"Pig?"

!Many%organiza/ons%use%Pig%for%data%analysis%
Finding"relevant"records"in"a"massive"data"set"
Querying"mulAple"data"sets"
CalculaAng"values"from"input"data"
!Pig%is%also%frequently%used%for%data%processing%
Reorganizing"an"exisAng"data"set"
Joining"data"from"mulAple"sources"to"produce"a"new"data"set"

Use"Case:"Web"Log"SessionizaAon"

!Pig%can%help%you%extract%valuable%informa/on%from%Web%server%log%les%

## ... Web Server Log Data

10.174.57.241 - - [03/May/2013:17:57:41 -0500] "GET /s?q=widget HTTP/1.1" 200 3617 "http://www.hotbot.com/find/dualcore" "WebTV 1.2" "U=129"
10.218.46.19 - - [03/May/2013:17:57:43 -0500] "GET /ide.html HTTP/1.1" 404 955 "http://www.example.com/s?q=JBuilder" "Mosaic/3.6 (X11;SunOS)"
10.174.57.241 - - [03/May/2013:17:58:03 -0500] "GET /wres.html HTTP/1.1" 200 5741 "http://www.example.com/s?q=widget" "WebTV 1.2" "U=129"
10.32.51.237 - - [03/May/2013:17:58:04 -0500] "GET /os.html HTTP/1.1" 404 955 "http://www.example.com/s?q=VMS" "Mozilla/1.0b (Win3.11)"
10.174.57.241 - - [03/May/2013:17:58:25 -0500] "GET /detail?w=41 HTTP/1.1" 200 8584 "http://www.example.com/wres.html" "WebTV 1.2" "U=129"
10.157.96.181 - - [03/May/2013:17:58:26 -0500] "GET /mp3.html HTTP/1.1" 404 955 "http://www.example.com/s?q=Zune" "Mothra/2.77" "U=3622"
10.174.57.241 - - [03/May/2013:17:59:36 -0500] "GET /order.do HTTP/1.1" 200 964 "http://www.example.com/detail?w=41" "WebTV 1.2" "U=129"
10.174.57.241 - - [03/May/2013:17:59:47 -0500] "GET /confirm HTTP/1.1" 200 964 "http://www.example.com/order.do" "WebTV 1.2" "U=129"
...

## Clickstream Data for User Sessions

Process Logs
Recent Activity for John Smith
May 3, 2013 May 12, 2013
Search for 'Widget' Track Order

## Details for Widget X Send Complaint

Order Widget X

Use"Case:"Data"Sampling"

!Sampling%can%help%you%explore%a%representa/ve%por/on%of%a%large%data%set%
Allows"you"to"examine"this"porAon"with"tools"that"do"not"scale"well"
Supports"faster"iteraAons"during"development"of"analysis"jobs"

100 TB 50 MB

Random
Sampling

Use"Case:"ETL"Processing"

## Accounting Data Warehouse

Validate Fix Remove Encode
data errors duplicates values

Call Center

Chapter"Topics"

Introduc/on%to%Pig%

!! What"is"Pig?"
!! Pigs"Features"
!! Pig"Use"Cases"
!! Interac/ng%with%Pig%
!! Conclusion"

Using"Pig"InteracAvely"

!You%can%use%Pig%interac/vely,%via%the%Grunt%shell%
Pig"interprets"each"Pig"LaAn"statement"as"you"type"it"
ExecuAon"is"delayed"unAl"output"is"required"
!Example%of%how%to%start,%use,%and%exit%Grunt%

\$ pig
grunt> allsales = LOAD 'sales' AS (name, price);
grunt> bigsales = FILTER allsales BY price > 100;
grunt> STORE bigsales INTO 'myreport';
grunt> quit;
%
!Can%also%execute%a%Pig%La/n%statement%from%the%UNIX%shell%via%the%-e%
op/on

InteracAng"with"HDFS"

!You%can%manipulate%HDFS%with%Pig,%via%the%fs%command

## % grunt> fs -mkdir sales/;

grunt> fs -put europe.txt sales/;
grunt> allsales = LOAD 'sales' AS (name, price);
grunt> bigsales = FILTER allsales BY price > 100;
grunt> STORE bigsales INTO 'myreport';
grunt> fs -getmerge myreport/ bigsales.txt;

InteracAng"with"UNIX"

!The%sh%command%lets%you%run%UNIX%programs%from%Pig

grunt> sh date;
Fri May 10 13:05:31 PDT 2013
grunt> fs -ls; -- lists HDFS files
%
grunt> sh ls; -- lists local files

Running"Pig"Scripts"

!A%Pig%script%is%simply%Pig%La/n%code%stored%in%a%text%le%
By"convenAon,"these"les"have"the".pig"extension"
!You%can%run%a%Pig%script%from%within%the%Grunt%shell%via%the%run%command%
This"is"useful"for"automaAon"and"batch"execuAon""

## grunt> run salesreport.pig;

!It%is%common%to%run%a%Pig%script%directly%from%the%UNIX%shell%

\$ pig salesreport.pig

MapReduce"and"Local"Modes"

!As%described%earlier,%Pig%turns%Pig%La/n%into%MapReduce%jobs%
!It%is%also%possible%to%run%Pig%in%local%mode%using%the%-x%ag%

## \$ pig -x local salesreport.pig -- batch

Client/Side"Log"Files"

!If%a%job%fails,%Pig%may%produce%a%log%le%to%explain%why%
These"log"les"are"typically"produced"in"your"current"working"directory"
On"the"local"(client)"machine"

Chapter"Topics"

Introduc/on%to%Pig%

!! What"is"Pig?"
!! Pigs"Features"
!! Pig"Use"Cases"
!! InteracAng"with"Pig"
!! Conclusion%

EssenAal"Points"

!Pig%oers%an%alterna/ve%to%wri/ng%MapReduce%code%directly%
Pig"interprets"Pig"LaAn"code"in"order"to"create"MapReduce"jobs"
!You%can%execute%Pig%La/n%code%interac/vely%through%Grunt%
Pig"delays"job"execuAon"unAl"output"is"required"
!It%is%also%common%to%store%Pig%La/n%code%in%a%script%for%batch%execu/on%
Allows"for"automaAon"and"code"reuse"

Bibliography"

The%following%oer%more%informa/on%on%topics%discussed%in%this%chapter%
!Apache%Pig%Web%Site%
http://pig.apache.org/
!Process%a%Million%Songs%with%Apache%Pig%
http://tiny.cloudera.com/dac03a
!Powered%By%Pig%
http://tiny.cloudera.com/dac03b
http://tiny.cloudera.com/dac03c
!Programming%Pig%(book)%
http://tiny.cloudera.com/dac03d

Basic"Data"Analysis"with"Pig"
Chapter"4"

Course"Chapters"

!! IntroducDon"
!! IntroducDon"to"Pig"
!! Basic%Data%Analysis%with%Pig%
!! Processing"Complex"Data"with"Pig"
!! Extending"Pig"
!! IntroducDon"to"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Extending"Hive"
!! IntroducDon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

Basic"Data"Analysis"with"Pig"

In%this%chapter,%you%will%learn%
!The%basic%syntax%of%Pig%LaCn%
!Which%simple%data%types%Pig%uses%to%represent%data%
!How%to%sort%and%lter%data%in%Pig%
!How%to%use%many%of%Pigs%built#in%funcCons%for%data%processing%

Chapter"Topics"

Basic%Data%Analysis%with%Pig%

!! Pig%LaCn%Syntax%
!! Simple"Data"Types"
!! Field"DeniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
!! Conclusion"

!Pig%LaCn%is%a%data\$ow%language%
The"ow"of"data"is"expressed"as"a"sequence"of"statements"

## bigsales = FILTER allsales BY price > 999; -- in US cents

/*
* Save the filtered results into a new
* directory, below my home directory.
*/
STORE bigsales INTO 'myreport';

!Pig%LaCn%keywords%are%highlighted%here%in%blue%text%
Keywords"are"reserved""you"cannot"use"them"to"name"things"

## bigsales = FILTER allsales BY price > 999; -- in US cents

/*
* Save the filtered results into a new
* directory, below my home directory.
*/
STORE bigsales INTO 'myreport';

!IdenCers%are%the%names%assigned%to%elds%and%other%data%structures\$

## bigsales = FILTER allsales BY price > 999; -- in US cents

/*
* Save the filtered results into a new
* directory, below my home directory.
*/
STORE bigsales INTO 'myreport';

!IdenCers%must%conform%to%Pigs%naming%rules\$
!An%idenCer%must%always%begin%with%a%lePer%
This"may"only"be"followed"by"le>ers,"numbers,"or"underscores"

## Valid% x q1 q1_2013 MyData

Invalid% 4 price\$ profit% _sale

## bigsales = FILTER allsales BY price > 999; -- in US cents

/*
* Save the filtered results into a new
* directory, below my home directory.
*/
STORE bigsales INTO 'myreport';

!Whether%case%is%signicant%in%Pig%LaCn%depends%on%context%
!Keywords%(shown%here%in%blue%text)%are%not%case#sensiCve%
Neither"are"operators"(such"as"AND,"OR,"or"IS NULL)""
!IdenCers%and%paths%(shown%here%in%red%text)%are%case#sensiCve%
So"are"funcDon"names"(such"as"SUM"or"COUNT)"and"constants"

## STORE bigsales INTO 'myreport';

!Many%commonly#used%operators%in%Pig%LaCn%are%familiar%to%SQL%users%

## ArithmeCc% Comparison% Null% Boolean%

+ == IS NULL AND
- != IS NOT NULL OR
* < NOT
/ >
% <=
>=

Chapter"Topics"

Basic%Data%Analysis%with%Pig%

!! Simple"Data"Types"
!! Field"DeniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
!! Conclusion"

PigStorage"assumes"text"format"with"tab/separated"columns"
!Consider%the%following%le%in%HDFS%called%sales%
The"two"elds"are"separated"by"tab"characters"
"
" Alice 2999
Bob 3625
" Carlos 2764
"

## allsales = LOAD 'sales' AS (name, price);

Data"Sources:"File"and"Directories"

## allsales = LOAD 'sales' AS (name, price);

!Since%this%is%not%an%absolute%path,%it%is%relaCve%to%your%home%directory%
Your"home"directory"in"HDFS"is"typically"/user/youruserid/
Can"also"specify"an"absolute"path"(e.g.,"/dept/sales/2012/q4)"
!The%path%can%also%refer%to%a%directory%
File"pa>erns"(globs)"are"also"supported"

## allsales = LOAD 'sales_200[5-9]' AS (name, price);

!The%previous%example%also%assigns%names%to%each%column%

## allsales = LOAD 'sales' AS (name, price);

!Assign%column%names%is%not%required%
This"can"be"useful"when"exploring"a"new"dataset"
Refer"to"elds"by"posiDon"(\$0"is"rst,"\$1"is"second,"\$53"is"54th,"etc.)"

Using"Alternate"Column"Delimiters"

!You%can%specify%an%alternate%delimiter%as%an%argument%to%PigStorage%
Note"that"this"is"a"single"statement"

(name, price);

## allsales = LOAD 'sales.txt' USING PigStorage('|');

Chapter"Topics"

Basic%Data%Analysis%with%Pig%

!! Simple%Data%Types"
!! Field"DeniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
!! Conclusion"

Simple"Data"Types"in"Pig"

!Pig%supports%several%basic%data%types%
Similar"to"those"in"most"databases"and"programming"languages"
!Pig%treats%elds%of%unspecied%type%as%an%array%of%bytes%
Called"the"bytearray"type"in"Pig""

## " allsales = LOAD 'sales' AS (name, price);

List"of"Simple"Data"Types"

!There%are%eight%data%types%in%Pig%for%simple%values%

## Name% DescripCon% Example%Value%

int Whole"numbers% 2013
long Large"whole"numbers% 5,365,214,142L
float Decimals% 3.14159F
double Very"precise"decimals% 3.14159265358979323846
boolean* True"or"false"values" true
datetime* Date"and"Dme" 2013-05-30T14:52:39.000-04:00
chararray Text"strings% Alice
bytearray Raw"bytes"(e.g."any"data)% N/A

""*"Not"available"in"older"versions"of"Pig"

Specifying"Data"Types"in"Pig"

!Pig%will%do%its%best%to%determine%data%types%based%on%context%
For"example,"you"can"calculate"sales"commission"as""price * 0.1
In"this"case,"Pig"will"assume"that"this"value"is"of"type"double"
!However,%it%is%bePer%to%specify%data%types%explicitly%when%possible%

## allsales = LOAD 'sales' AS (name:chararray, price:int);

!Choosing%the%right%data%type%is%important%to%avoid%loss%of%precision%
!Important:%Avoid%using%oaCng%point%numbers%to%represent%money!%

HCatalog"

!A%new%project%called%HCatalog%can%store%this%informaCon%permanently%
So"it"need"not"be"specied"each"Dme"
!However,%HCatalog%is%sCll%in%early%development%and%is%not%yet%widely%used%

How"Pig"Handles"Invalid"Data"

!When%encountering%invalid%data,%Pig%subsCtutes%NULL%for%the%value%
For"example,"an"int"eld"containing"the"value"Q4
!The%IS NULL%and%IS NOT NULL%operators%test%for%null%values%
Note"that"NULL"is"not"the"same"as"the"empty"string"''

## hasprices = FILTER Records BY price IS NOT NULL;

Chapter"Topics"

Basic%Data%Analysis%with%Pig%

!! Simple"Data"Types"
!! Field%DeniCons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
!! Conclusion"

Key"Data"Concepts"in"Pig"

!RelaConal%databases%have%tables,%rows,%columns,%and%elds%
!We%will%use%the%following%data%to%illustrate%Pigs%equivalents%

## name% price% country%

Alice 2999 us
Bob 3625 ca
Carlos 2764 mx
Dieter 1749 de
tienne 2368 fr
Fredo 5637 it

Pig"Data"Concepts:"Fields"

!A%single%element%of%data%is%called%a%eld\$
It"corresponds"to"one"of"the"eight"data"types"seen"earlier"

## name% price% country%

Alice 2999 us
Bob 3625 ca
Carlos 2764 mx
Dieter 1749 de
tienne 2368 fr
Fredo 5637 it

Pig"Data"Concepts:"Tuples"

!A%collec/on%of%values%is%called%a%tuple\$
Fields"within"a"tuple"are"ordered,"but"need"not"all"be"of"the"same"type"

## name% price% country%

Alice 2999 us
Bob 3625 ca
Carlos 2764 mx
Dieter 1749 de
tienne 2368 fr
Fredo 5637 it

Pig"Data"Concepts:"Bags"

!A%collec/on%of%tuples%is%called%a%bag\$
!Tuples%within%a%bag%are%unordered%by%default%
The"eld"count"and"types"may"vary"between"tuples"in"a"bag"

## name% price% country%

Alice 2999 us
Bob 3625 ca
Carlos 2764 mx
Dieter 1749 de
tienne 2368 fr
Fredo 5637 it

!A%relaCon%is%simply%a%bag%with%an%assigned%name%(alias)%

## allsales = LOAD 'sales' AS (name, price);

bigsales = FILTER allsales BY price > 999;
STORE bigsales INTO 'myreport';

Chapter"Topics"

Basic%Data%Analysis%with%Pig%

!! Simple"Data"Types"
!! Field"DeniDons"
!! Data%Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
!! Conclusion"

Data"Output"in"Pig"

!The%command%used%to%handle%output%depends%on%its%desCnaCon%
DUMP:"sends"output"to"the"screen"
STORE:"sends"output"to"disk"(HDFS)"
!Example%of%DUMP%output,%using%data%from%the%le%shown%earlier%
The"parentheses"and"commas"indicate"tuples"with"mulDple"elds"

(Alice,2999,us)
(Bob,3625,ca)
(Carlos,2764,mx)
(Dieter,1749,de)
(tienne,2368,fr)
(Franco,5637,it)

Storing"Data"with"Pig"

!The%STORE%command%is%used%to%store%data%to%HDFS%
The"output"path"is"the"name"of"a"directory"
The"directory"must"not"yet"exist"
The"eld"delimiter"also"has"a"default"value"(tab)"

## STORE bigsales INTO 'myreport';

You"may"also"specify"an"alternate"delimiter"

## STORE bigsales INTO 'myreport' USING PigStorage(',');

Chapter"Topics"

Basic%Data%Analysis%with%Pig%

!! Simple"Data"Types"
!! Field"DeniDons"
!! Data"Output"
!! Viewing%the%Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
!! Conclusion"

Viewing"the"Schema"with"DESCRIBE

!The%DESCRIBE%command%shows%the%structure%of%the%data,%including%
names%and%types%
!The%following%Grunt%session%shows%an%example%

## grunt> allsales = LOAD 'sales' AS (name:chararray,

% price:int);
grunt> DESCRIBE allsales;

## allsales: {name: chararray,price: int}

Chapter"Topics"

Basic%Data%Analysis%with%Pig%

!! Simple"Data"Types"
!! Field"DeniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering%and%SorCng%Data"
!! Commonly/used"FuncDons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
!! Conclusion"

!The%FILTER%keyword%extracts%tuples%matching%the%specied%criteria%
"
"
bigsales = FILTER allsales BY price > 3000;

allsales bigsales

## name% price% country% price > 3000" name% price% country%

Alice 2999 us Bob 3625 ca
Bob 3625 ca Fredo 5637 it
Carlos 2764 mx
Dieter 1749 de
tienne 2368 fr
Fredo 5637 it

Filtering"by"MulDple"Criteria"

!You%can%combine%criteria%with%AND%and%OR

## somesales = FILTER allsales BY name == 'Dieter' OR (price >

3500 AND price < 4000);

allsales somesales

## name% price% country% name% price% country%

Alice 2999 us Bob 3625 ca
Bob 3625 ca Dieter 1749 de
Carlos 2764 mx
Dieter 1749 de Name%is%Dieter,%or%price%is%greater%%
tienne 2368 fr than%3500%and%less%than%4000"
Fredo 5637 it

!The%==%operator%is%supported%for%any%type%in%Pig%LaCn%
This"operator"is"used"for"exact"comparisons"
"
" alices = FILTER allsales BY name == 'Alice';

!Pig%LaCn%supports%paPern%matching%through%Javas%regular\$expressions%\$
This"is"done"with"the"MATCHES"operator"

## spammers = FILTER senders BY email_addr

MATCHES '.*@example\\.com\$';

!Filtering%extracts%rows,%but%someCmes%we%need%to%extract%columns%

## twofields = FOREACH allsales GENERATE amount, trans_id;

%

allsales twofields
salesperson% amount% trans_id% amount% trans_id%
Alice 2999 107546 2999 107546
Bob 3625 107547 3625 107547
Carlos 2764 107548 2764 107548
Dieter 1749 107549 1749 107549
tienne 2368 107550 2368 107550
Fredo 5637 107550 5637 107550

!The%FOREACH%and%GENERATE%keywords%can%also%be%used%to%create%elds%
For"example,"you"could"create"a"new"eld"based"on"price"

## t = FOREACH allsales GENERATE price * 0.07;

!It%is%possible%to%name%such%elds%

## t = FOREACH allsales GENERATE price * 0.07 AS tax;

!And%you%can%also%specify%the%data%type

## t = FOREACH allsales GENERATE price * 0.07 AS tax:float;

! DISTINCT%eliminates%duplicate%records%in%a%bag%
All%elds%must"be"equal"to"be"considered"a"duplicate"

## unique_records = DISTINCT all_alices;

all_alices unique_records

## rstname% lastname% country% rstname% lastname% country%

Alice Smith us Alice Smith us
Alice Jones us Alice Jones us
Alice Brown us Alice Brown us
Alice Brown us Alice Brown ca
Alice Brown ca

Controlling"Sort"Order

!Use%ORDER...BY%to%sort%the%records%in%a%bag%in%ascending%order%
Take"care"to"specify"a"schema""data"type"aects"how"data"is"sorted!"

## sortedsales = ORDER allsales BY country DESC;

allsales sortedsales
name% price% country% name% price% country%
Alice 29.99 us Alice 29.99 us
Bob 36.25 ca Carlos 27.64 mx
Carlos 27.64 mx Fredo 56.37 it
Dieter 17.49 de tienne 23.68 fr
tienne 23.68 fr Dieter 17.49 de
Fredo 56.37 it Bob 36.25 ca

LimiDng"Results"

!As%in%SQL,%you%can%use%LIMIT%to%reduce%the%number%of%output%records%

## somesales = LIMIT allsales 10;

!Beware!%Record%ordering%is%random%unless%specied%with%ORDER BY
Use"ORDER BY"and"LIMIT"together"to"nd"top/N"results"

## sortedsales = ORDER allsales BY price DESC;

top_five = LIMIT sortedsales 5;

Chapter"Topics"

Basic%Data%Analysis%with%Pig%

!! Simple"Data"Types"
!! Field"DeniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"
!! Commonly#used%FuncCons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"processing"
!! Conclusion"

Built/in"FuncDons"

!These%are%just%a%sampling%of%Pigs%many%built#in%funcCons%
%
FuncCon%DescripCon% Example%InvocaCon% Input% Output%
Convert"to"uppercase" UPPER(country) uk UK

## Return"a"random"number" RANDOM() 0.4816132

6652569
Round"to"closest"whole"number" ROUND(price) 37.19 37

## Return"chars"between"two"posiDons" SUBSTRING(name, 0, 2) Alice Al

!You%can%use%these%with%the%FOREACH..GENERATE%keywords%

## rounded = FOREACH allsales GENERATE ROUND(price);

Chapter"Topics"

Basic%Data%Analysis%with%Pig%

!! Simple"Data"Types"
!! Field"DeniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"
!! Commonly/used"FuncDons"
!! Hands#On%Exercise:%Using%Pig%for%ETL%processing"
!! Conclusion"

Hands/On"Exercise:"Using"Pig"for"ETL"processing"

!In%this%Hands#On%Exercise,%you%will%write%%Pig%LaCn%code%to%perform%basic%ETL%

Chapter"Topics"

Basic%Data%Analysis%with%Pig%

!! Simple"Data"Types"
!! Field"DeniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"
!! Commonly/used"FuncDons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"processing"
!! Conclusion"

EssenDal"Points"

!Pig%LaCn%supports%many%of%the%same%operaCons%as%SQL%
Though"Pigs"approach"is"quite"dierent"
!The%default%delimiter%for%both%input%and%output%is%the%tab%character%
You"can"specify"an"alternate"delimiter"as"an"argument"to"PigStorage
!Specifying%the%names%and%types%of%elds%is%not%required%

Bibliography"

The%following%oer%more%informaCon%on%topics%discussed%in%this%chapter%
!Pig%LaCn%Basics%
http://tiny.cloudera.com/dac04a
!Pig%LaCn%Built#In%FuncCons%
http://tiny.cloudera.com/dac04b
!DocumentaCon%for%Java%Regular%Expression%PaPerns%
http://tiny.cloudera.com/dac04c
!Installing%and%Using%HCatalog%
http://tiny.cloudera.com/dac04d

Processing"Complex"Data"with"Pig"
Chapter"5"

Course"Chapters"

!! IntroducFon"
!! IntroducFon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing%Complex%Data%with%Pig%
!! MulF/Dataset"OperaFons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooFng"and"OpFmizaFon"
!! IntroducFon"to"Hive"
!! RelaFonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpFmizaFon"
!! Extending"Hive"
!! IntroducFon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

Processing"Complex"Data"with"Pig"

In%this%chapter,%you%will%learn%
!How%Pig%uses%bags,%tuples,%and%maps%to%represent%complex%data%
!The%techniques%Pig%provides%for%grouping%and%ungrouping%data%
!How%to%use%aggregate%funcFons%in%Pig%LaFn%
!How%to%iterate%through%records%in%complex%data%structures%

Chapter"Topics"

Processing%Complex%Data%with%Pig%

!! Storage%Formats%
!! Complex/Nested"Data"Types"
!! Grouping"
!! Built/in"FuncFons"for"Complex"Data"
!! IteraFng"Grouped"Data"
!! Conclusion"

Storage"Formats"

Uses"a"delimited"text"le"format""

## allsales = LOAD 'sales' AS (name, price);

The"default"delimiter"(tab)"can"be"easily"changed"

## allsales = LOAD 'sales' USING PigStorage(',')

AS (name, price) ;

Other"Supported"Formats"

BinStorage Files"containing"binary"data"

PigStorage PigStorage
BinStorage BinStorage"

Chapter"Topics"

Processing%Complex%Data%with%Pig%

!! Storage"Formats"
!! Complex/Nested%Data%Types%
!! Grouping"
!! Built/in"FuncFons"for"Complex"Data"
!! IteraFng"Grouped"Data"
!! Conclusion"

Pigs"Complex"Data"Types:"Tuple"and"Bag"

A"tuple"is"a"collecFon"of"values""
A"bag"is"a"collecFon"of"tuples"

## trans_id% total% salesperson%

107546 2999 Alice
107547 3625 Bob
107548 2764 Carlos tuple"
107549 1749 Dieter
107550 2368 tienne
107551 5637 Fredo bag"

Pigs"Complex"Data"Types:"Map"

!Pig%also%supports%another%complex%type:%Map%
A"map"associates"a"chararray"(key)"to"another"data"element"(value)"

## trans_id% amount% salesperson% sales_details%

107546 2498 Alice date 12-02-2013
SKU 40155
store MIA01
107547 3625 Bob date 12-02-2013
SKU 3720
store STL04
coupon DEC13
107548 2764 Carlos date 12-03-2013
SKU 76102
store NYC15

RepresenFng"Complex"Types"in"Pig"

!It%is%important%to%know%how%to%dene%and%recognize%these%types%in%Pig%

Type% DeniFon%
Tuple% Comma/delimited"list"inside"parentheses:"
"
"""('107546', 2498, 'Alice')

Bag% Braces"surround"comma/delimited"list"of"tuples:"
"
"""{('107546', 2498, 'Alice'), ('107547', 3625, 'Bob')}

Map% Brackets"surround"comma/delimited"list"of"pairs;"keys"and"values"separated"by"#:"
"
"""['store'#'MIA01','location'#'Coral Gables']

!Complex%data%types%can%be%used%in%any%Pig%eld%
!The%following%example%show%how%a%bag%is%stored%in%a%text%le%
Example:"TransacFon"ID,"amount,"items"sold"(a"bag"of"tuples)"

## 107550 2498 {('40120', 1999), ('37001', 499)}

TAB" TAB"
Field"1" Field"2" Field"3"

## details = LOAD 'salesdetail' AS (

trans_id:chararray, amount:int,
items_sold:bag
{item:tuple (SKU:chararray, price:int)});

!The%following%example%show%how%a%map%is%stored%in%a%text%le%
% Example:"Customer"name,"credit"account"details"(map)","year"account"opened"

## Eva [creditlimit#5000,creditused#800] 2012

TAB" TAB"
% Field"1" Field"2" Field"3"

## credit = LOAD 'customer_accounts' AS (

name:chararray, account:map[], year:int);

Referencing"Map"Data"

!Consider%a%le%with%the%following%data%

Bob [salary#52000,age#52]
%

## details = LOAD 'data' AS (name:chararray, info:map[]);

%
!Here%is%the%syntax%for%referencing%data%within%the%map%and%bag%

## salaries = FOREACH details GENERATE info#'salary';

Chapter"Topics"

Processing%Complex%Data%with%Pig%

!! Storage"Formats"
!! Complex/Nested"Data"Types"
!! Grouping%
!! Built/in"FuncFons"for"Complex"Data"
!! IteraFng"Grouped"Data"
!! Conclusion"

Grouping"Records"By"a"Field"(1)"

!SomeFmes%you%need%to%group%records%by%a%given%eld%
For"example,"so"you"can"calculate"commissions"for"each"employee"

Alice 729
Bob 3999
Alice 27999
Carol 32999
Carol 4999
"
! Use%GROUP BY%to%do%this%in%Pig%LaFn%
The"new"relaFon"has"one"record"per"unique"value"in"the"specied"eld

## grunt> byname = GROUP sales BY name;

Grouping"Records"By"a"Field"(2)"

!The%new%relaFon%always%contains%two%elds%

## grunt> byname = GROUP sales BY name;

grunt> DESCRIBE byname;
byname: {group: chararray,sales: {(name:
chararray,price: int)}}

!The%rst%eld%is%literally%named%group%in%all%cases
Contains"the"value"from"the"eld"specied"in"GROUP BY
!The%second%eld%is%named%a^er%the%relaFon%specied%in%GROUP BY
Its"a"bag"containing"one"tuple"for"each"corresponding"value"

Grouping"Records"By"a"Field"(3)"

!The%example%below%shows%the%data%a^er%grouping%
Input"Data"(sales)"

## grunt> byname = GROUP sales BY name; Alice 729

grunt> DUMP byname; Bob 3999
(Bob,%{(Bob,3999)}) Alice 27999
(Alice,{(Alice,729),(Alice,27999)}) Carol 32999
Carol 4999
(Carol,{(Carol,32999),(Carol,4999)})

group sales
eld% eld%

Using"GROUP BY"to"Aggregate"Data"

!Aggregate%funcFons%create%one%output%value%from%mulFple%input%values%
For"example,"to"calculate"total"sales"by"employee"
Usually"applied"to"grouped"data"

## grunt> byname = GROUP sales BY name;

grunt> DUMP byname;
(Bob,{(Bob,3999)})
(Alice,{(Alice,729),(Alice,27999)})
(Carol,{(Carol,32999),(Carol,4999)})

## grunt> totals = FOREACH byname GENERATE

!We%can%use%the%SUM%funcFon%to%XXX%
group, SUM(sales.price);
grunt> dump totals;
(Bob,3999)
(Alice,28728)
(Carol,37998)

Grouping"Everything"Into"a"Single"Record

! We%just%saw%that%GROUP BY%creates%one%record%for%each%unique%value%
! GROUP ALL%puts%all%data%into%one%record

## grunt> grouped = GROUP sales ALL;

grunt> DUMP grouped;
(all,{(Alice,729),(Bob,3999),(Alice,27999),
(Carol,32999),(Carol,4999)})

Using"GROUP ALL"to"Aggregate"Data"

!Use%GROUP ALL%when%you%need%to%aggregate%one%or%more%columns%
For"example,"to"calculate"total"sales"for"all"employees"

## grunt> grouped = GROUP sales ALL;

grunt> DUMP grouped;
(all,{(Alice,729),(Bob,3999),(Alice,27999),(Carol,32999),
(Carol,4999)})
!We%can%use%the%SUM%funcFon%to%XXX%
grunt> totals = FOREACH grouped GENERATE SUM(sales.price);
grunt> dump totals;
(70725)

Removing"NesFng"in"Data

! Some%operaFons%in%Pig,%like%grouping,%produce%nested%data%structures%

## grunt> byname = GROUP sales BY name;

grunt> DUMP byname;
(Bob,{(Bob,3999)})
(Alice,{(Alice,729),(Alice,27999)})
(Carol,{(Carol,32999),(Carol,4999)})

!Grouping%can%be%useful%to%supply%data%to%aggregate%funcFons"
!However,%someFmes%you%want%to%work%with%a%at%data%structure%
The"FLATTEN"operator"removes"a"level"of"nesFng"in"data"

An"Example"of"FLATTEN

! The%following%shows%the%nested%data%and%what%FLATTEN%does%to%it%
% grunt> byname = GROUP sales BY name;
grunt> DUMP byname;
(Bob,{(Bob,3999)})
(Alice,{(Alice,729),(Alice,27999)})
(Carol,{(Carol,32999),(Carol,4999)})

## grunt> flat = FOREACH byname GENERATE group,

FLATTEN(sales.price);
grunt> DUMP flat;
(Bob,3999)
(Alice,729)
(Alice,27999)
(Carol,32999)
(Carol,4999)

Chapter"Topics"

Processing%Complex%Data%with%Pig%

!! Storage"Formats"
!! Complex/Nested"Data"Types"
!! Grouping"
!! Built#in%FuncFons%for%Complex%Data%
!! IteraFng"Grouped"Data"
!! Conclusion"

Pigs"Built/In"Aggregate"FuncFons"

!Pig%has%built#in%support%for%other%aggregate%funcFons%besides%SUM
!Examples:%
AVG:""Calculates"the"average"(mean)"of"all"values"
MIN:""Returns"the"smallest"value"
MAX:""Returns"the"largest"value"
!Pig%has%two%built#in%funcFons%for%counFng%records%
COUNT:""Returns"the"number"of"non#null"elements"in"the"bag"
COUNT_STAR:""Returns"the"number"of"all"elements"in"the"bag"

Other"Notable"Built/in"FuncFons"

!Here%are%a%some%other%useful%Pig%funcFons%
See"the"Pig"documentaFon"for"a"complete"list"

FuncFon% DescripFon%
DIFF Finds"tuples"that"appear"in"only"one"of"two"supplied"bags
IsEmpty Used"with"FILTER"to"match"bags"or"maps"that"contain"no"data"
SIZE Returns"the"size"of"the"eld"(deniFon"of"size"varies"by"data"type)
TOKENIZE Splits"a"text"string"(chararray)"into"a"bag"of"individual"words"

Chapter"Topics"

Processing%Complex%Data%with%Pig%

!! Storage"Formats"
!! Complex/Nested"Data"Types"
!! Grouping"
!! Built/in"FuncFons"for"Complex"Data"
!! IteraFng%Grouped%Data%
!! Conclusion"

Record"IteraFon

!We%have%seen%that%FOREACH...GENERATE%iterates%through%records%
!The%goal%is%to%transform%records%to%produce%a%new%relaFon%
SomeFmes"to"select"only"certain"columns"

## price_column_only = FOREACH sales GENERATE price;

SomeFmes"to"create"new"columns"

## taxes = FOREACH sales GENERATE price * 0.07;

SomeFmes"to"invoke"a"funcFon"on"the"data"

## totals = FOREACH grouped GENERATE SUM(sales.price);

NesFng"the"FOREACH"Keyword"

!A%variaFon%on%FOREACH%applies%a%set%of%operaFons%to%each%record%
This"is"oeen"used"to"apply"a"series"of"transformaFons"in"a"group"
!This%is%called%a%nested%FOREACH%
Allows"only"relaFonal"operaFons"(e.g., LIMIT,"FILTER,"ORDER BY)"
GENERATE"must"be"the"last"line"in"the"block"

Nested"FOREACH"Example"(1)"
Input"Data"
!Our%input%data%contains%a%list%of%employee%%
job%Ftles%and%corresponding%salaries% President 192000
Director 152500
!Goal:%idenFfy%the%three%highest%salaries% Director 161000
within%each%Ftle% Director 167000
Director 165000
Director 147000
Engineer 92300
Engineer 85000
Engineer 83000
Engineer 81650
Engineer 82100
Engineer 87300
Engineer 76000
Manager 87000
Manager 81000
Manager 75000
Manager 79000
Manager 67500

Nested"FOREACH"Example"(2)"
Input"Data"(excerpt)"
President 192000
!Next,%group%employees%by%Ftle% Director 152500
Assigned"to"new"relaFon"Ftle_group" Director 161000
...
Engineer 92300
...
Manager 67500

## employees = LOAD 'data' AS (title:chararray, salary:int);

title_group = GROUP employees BY title;

## top_salaries = FOREACH title_group {

sorted = ORDER employees BY salary DESC;
highest_paid = LIMIT sorted 3;
GENERATE group, highest_paid;
};

Nested"FOREACH"Example"(3)"
Input"Data"(excerpt)"
!The%nested%FOREACH%iterates%through%every%
record%in%the%group%(i.e.,%each%job%Ftle)% President 192000
It"sorts"each"record"in"that"group"in"" Director 152500
Director 161000
descending"order"of"salary" ...
It"then"selects"the"top"three" Engineer 92300
...
GENERATE"outputs"the"Ftle"and"salaries"
Manager 67500

## employees = LOAD 'data' AS (title:chararray, salary:int);

title_group = GROUP employees BY title;

## top_salaries = FOREACH title_group {

sorted = ORDER employees BY salary DESC;
highest_paid = LIMIT sorted 3;
GENERATE group, highest_paid;
};

Nested"FOREACH"Example"(4)
%%
title_group = GROUP employees BY title; President 192000
Director 152500
top_salaries = FOREACH title_group { Director 161000
sorted = ORDER employees BY salary DESC; ...
highest_paid = LIMIT sorted 3; Engineer 92300
GENERATE group, highest_paid; ...
}; Manager 67500

Output"produced"by"DUMP top_salaries

(Director,{(Director,167000),(Director,165000),(Director,161000)})
(Engineer,{(Engineer,92300),(Engineer,87300),(Engineer,85000)})
(Manager,{(Manager,87000),(Manager,81000),(Manager,79000)})
(President,{(President,192000)})

Chapter"Topics"

Processing%Complex%Data%with%Pig%

!! Storage"Formats"
!! Complex/Nested"Data"Types"
!! Grouping"
!! Built/in"FuncFons"for"Complex"Data"
!! IteraFng"Grouped"Data"
!! Conclusion"

campaign%

Chapter"Topics"

Processing%Complex%Data%with%Pig%

!! Storage"Formats"
!! Complex/Nested"Data"Types"
!! Grouping"
!! Built/in"FuncFons"for"Complex"Data"
!! IteraFng"Grouped"Data"
!! Conclusion%

EssenFal"Points"

!Pig%has%three%complex%data%types:%tuple,%bag,%and%map
A"map"is"simply"a"collecFon"of"key/value"pairs"
!These%structures%can%contain%simple%types%like%int%or%chararray
But"they"can"also"contain"complex"data"types"
Nested"data"structures"are"common"in"Pig"
!Pig%provides%methods%for%grouping%and%ungrouping%data%
You"can"remove"a"level"of"nesFng"using"the"FLATTEN"operator"
!Pig%oers%several%built#in%aggregate%funcFons%

MulA/Dataset"OperaAons"with"Pig"
Chapter"6"

Course"Chapters"

!! IntroducAon"
!! IntroducAon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! Mul*#Dataset%Opera*ons%with%Pig%
!! Extending"Pig"
!! Pig"TroubleshooAng"and"OpAmizaAon"
!! IntroducAon"to"Hive"
!! RelaAonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpAmizaAon"
!! Extending"Hive"
!! IntroducAon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

MulA/Dataset"OperaAons"with"Pig"

In%this%chapter,%you%will%learn%
!How%we%can%use%grouping%to%combine%data%from%mul*ple%sources%
!What%types%of%join%opera*ons%Pig%supports%and%how%to%use%them%
!How%to%concatenate%records%to%produce%a%single%data%set%
!How%to%split%a%single%data%set%into%mul*ple%rela*ons%

Chapter"Topics"

Mul*#Dataset%Opera*ons%with%Pig%

!! Techniques%for%Combining%Data%Sets%
!! Joining"Data"Sets"in"Pig"
!! Set"OperaAons"
!! SpliUng"Data"Sets"
!! Hands/On"Exercise:"Analyzing"Disparate"Data"Sets"with"Pig"
!! Conclusion"

Overview"of"Combining"Data"Sets"

!So%far,%we%have%concentrated%on%processing%single%data%sets%
Valuable"insight"oXen"results"from"combining"mulAple"data"sets"
!Pig%oers%several%techniques%for%achieving%this%
Using"the"GROUP"operator"with"mulAple"relaAons"
Joining"the"data"as"you"would"in"SQL"
Performing"set"operaAons"like"CROSS"and"UNION
!We%will%cover%each%of%these%in%this%chapter%

Example"Data"Sets"(1)"
Stores"
!Most%examples%in%this%chapter%will%involve%the%%
same%two%data%sets% A Anchorage
B Boston
D Dallas
Dualcores%stores% E Edmonton
F Fargo
!There%are%two%elds%in%this%rela*on%
1. store_id:chararray"(unique"key)"
2. name:chararray"(name"of"the"city"in"which"the"store"is"located)"

Example"Data"Sets"(2)"
Stores"
!Our%other%data%set%is%a%le%containing%
B Boston
!This%rela*on%contains%three%elds% C Chicago
D Dallas
1. person_id:int"(unique"key)" E Edmonton
2. name:chararray"(salesperson"name)" F Fargo
3. store_id:chararray"(refers"to"store)" Salespeople"

1 Alice B
2 Bob D
3 Carlos F
4 Dieter A
5 tienne F
6 Fredo C
7 George D
8 Hannah B
9 Irina C
10 Jack

Grouping"MulAple"RelaAons"

Groups"values"in"a"relaAon"based"on"the"specied"eld(s)"
!The%GROUP%operator%can%also%group%mul\$ple%rela*ons%
In"this"case,"using"the"synonymous"COGROUP"operator"is"preferred"

## grouped = COGROUP stores BY store_id, salespeople BY store_id;

!This%collects%values%from%both%data%sets%into%a%new%rela*on%
As"before,"the"new"relaAon"is"keyed"by"a"eld"named"group
This"group"eld"is"associated"with"one"bag"for"each"input"

## store_id records from stores records from salespeople

Example"of"COGROUP
Stores"
A Anchorage grunt> grouped = COGROUP stores BY store_id,
B Boston salespeople BY store_id;
C Chicago
D Dallas grunt> DUMP grouped;
E Edmonton (A,{(A,Anchorage)},{(4,Dieter,A)})
F Fargo (B,{(B,Boston)},{(1,Alice,B),(8,Hannah,B)})
(C,{(C,Chicago)},{(6,Fredo,C),(9,Irina,C)})
Salespeople" (D,{(D,Dallas)},{(2,Bob,D),(7,George,D)})
(E,{(E,Edmonton)},{})
1 Alice B (F,{(F,Fargo)},{(3,Carlos,F),(5,tienne,F)})
2 Bob D (,{},{(10,Jack,)})
3 Carlos F
4 Dieter A
5 tienne F
6 Fredo C
7 George D
8 Hannah B
9 Irina C
10 Jack

Chapter"Topics"

Mul*#Dataset%Opera*ons%with%Pig%

!! Techniques"for"Combining"Data"Sets"
!! Joining%Data%Sets%in%Pig%
!! Set"OperaAons"
!! SpliUng"Data"Sets"
!! Hands/On"Exercise:"Analyzing"Disparate"Data"Sets"with"Pig"
!! Conclusion"

Join"Overview"

!The%COGROUP%operator%creates%a%nested%data%structure%
!Pig%La*ns%JOIN%operator%creates%a%at%data%structure%
Similar"to"joins"in"a"relaAonal"database"
!A%JOIN%is%similar%to%doing%a%COGROUP%followed%by%a%FLATTEN
Though"they"handle"null"values"dierently"

Key"Fields"

!Like%COGROUP,%joins%rely%on%a%eld%shared%by%each%rela*on%

## joined = JOIN stores BY store_id, salespeople BY store_id;

!Joins%can%also%use%mul*ple%elds%as%the%key%

## joined = JOIN customers BY (name, phone_number),

accounts BY (name, phone_number);

Inner"Joins"

!The%default%JOIN%in%Pig%La*n%is%an%inner%join%

## joined = JOIN stores BY store_id, salespeople BY store_id;

!An%inner%join%outputs%records%only%when%the%key%is%found%in%all%inputs%
In"the"above"example,"stores"that"have"at"least"one"salesperson"
!You%can%do%an%inner%join%on%mul*ple%rela*ons%in%a%single%statement%
But"you"must"use"the"same"key"to"join"them"

Inner"Join"Example"
stores"
A Anchorage grunt> joined = JOIN stores BY store_id,
B Boston salespeople BY store_id;
C Chicago
D Dallas grunt> DUMP joined;
E Edmonton (A,Anchorage,4,Dieter,A)
F Fargo (B,Boston,1,Alice,B)
(B,Boston,8,Hannah,B)
salespeople" (C,Chicago,6,Fredo,C)
(C,Chicago,9,Irina,C)
1 Alice B (D,Dallas,2,Bob,D)
2 Bob D (D,Dallas,7,George,D)
3 Carlos F (F,Fargo,3,Carlos,F)
4 Dieter A (F,Fargo,5,tienne,F)
5 tienne F
6 Fredo C
7 George D
8 Hannah B
9 Irina C
10 Jack

EliminaAng"Duplicate"Fields"(1)"

!As%with%COGROUP,%the%new%rela*on%s*ll%contains%duplicate%elds%

## grunt> joined = JOIN stores BY store_id,

salespeople BY store_id;

## grunt> DUMP joined;

(A,Anchorage,4,Dieter,A)
(B,Boston,1,Alice,B)
(B,Boston,8,Hannah,B)
(C,Chicago,6,Fredo,C)
(C,Chicago,9,Irina,C)
(D,Dallas,2,Bob,D)
(D,Dallas,7,George,D)
(F,Fargo,3,Carlos,F)
(F,Fargo,5,tienne,F)

EliminaAng"Duplicate"Fields"(2)"

!We%can%use%FOREACH...GENERATE%to%retain%just%the%elds%we%need%
However,"it"is"now"slightly"more"complex"to"reference"elds"
We"must"fully/qualify"any"elds"with"names"that"are"not"unique"

## grunt> DESCRIBE joined;

joined: {stores::store_id: chararray,stores::name:
chararray,salespeople::person_id: int,salespeople::name:
chararray,salespeople::store_id: chararray}

## grunt> cleaned = FOREACH joined GENERATE stores::store_id,

stores::name, person_id, salespeople::name;

## grunt> DUMP cleaned;

(A,Anchorage,4,Dieter)
(B,Boston,1,Alice)
(B,Boston,8,Hannah)
... (additional records omitted for brevity) ...

Outer"Joins"

!Pig%La*n%allows%you%to%specify%the%type%of%join%following%the%eld%name%
Inner"joins"do"not"specify"a"join"type"

## joined = JOIN relation1 BY field [LEFT|RIGHT|FULL] OUTER,

relation2 BY field;

!An%outer%join%does%not%require%the%key%to%be%found%in%both%inputs%
!Outer%joins%require%Pig%to%know%the%schema%for%at%least%one%rela*on%
Which"relaAon"requires"schema"depends"on"the"join"type"
Full"outer"joins"require"schema"for"both"relaAons"

LeX"Outer"Join"Example"
stores"
!Result%contains%all%records%from%the%rela*on%
A Anchorage specied%on%the%le[,%but%only%matching%records%
B Boston
C Chicago from%the%one%specied%on%the%right%
D Dallas
E Edmonton grunt> joined = JOIN stores BY store_id
F Fargo LEFT OUTER, salespeople BY store_id;

## grunt> DUMP joined;

salespeople"
(A,Anchorage,4,Dieter,A)
1 Alice B (B,Boston,1,Alice,B)
2 Bob D (B,Boston,8,Hannah,B)
3 Carlos F (C,Chicago,6,Fredo,C)
4 Dieter A (C,Chicago,9,Irina,C)
5 tienne F (D,Dallas,2,Bob,D)
6 Fredo C (D,Dallas,7,George,D)
7 George D (E,Edmonton,,,)
8 Hannah B (F,Fargo,3,Carlos,F)
9 Irina C (F,Fargo,5,tienne,F)
10 Jack

Right"Outer"Join"Example"
stores"
!Result%contains%all%records%from%the%rela*on%
A Anchorage specied%on%the%right,%but%only%matching%records%
B Boston
C Chicago from%the%one%specied%on%the%le[%
D Dallas
E Edmonton grunt> joined = JOIN stores BY store_id
F Fargo RIGHT OUTER, salespeople BY store_id;

## grunt> DUMP joined;

salespeople"
(A,Anchorage,4,Dieter,A)
1 Alice B (B,Boston,1,Alice,B)
2 Bob D (B,Boston,8,Hannah,B)
3 Carlos F (C,Chicago,6,Fredo,C)
4 Dieter A (C,Chicago,9,Irina,C)
5 tienne F (D,Dallas,2,Bob,D)
6 Fredo C (D,Dallas,7,George,D)
7 George D (F,Fargo,3,Carlos,F)
8 Hannah B (F,Fargo,5,tienne,F)
9 Irina C (,,10,Jack,)
10 Jack

Full"Outer"Join"Example"
stores"
!Result%contains%all%records%where%there%is%a%match%
A Anchorage in%either%rela*on%
B Boston
C Chicago
D Dallas
E Edmonton grunt> joined = JOIN stores BY store_id
F Fargo FULL OUTER, salespeople BY store_id;

## grunt> DUMP joined;

salespeople"
(A,Anchorage,4,Dieter,A)
1 Alice B (B,Boston,1,Alice,B)
2 Bob D (B,Boston,8,Hannah,B)
3 Carlos F (C,Chicago,6,Fredo,C)
4 Dieter A (C,Chicago,9,Irina,C)
5 tienne F (D,Dallas,2,Bob,D)
6 Fredo C (D,Dallas,7,George,D)
7 George D (E,Edmonton,,,)
8 Hannah B (F,Fargo,3,Carlos,F)
9 Irina C (F,Fargo,5,tienne,F)
10 Jack (,,10,Jack,)

Chapter"Topics"

Mul*#Dataset%Opera*ons%with%Pig%

!! Techniques"for"Combining"Data"Sets"
!! Joining"Data"Sets"in"Pig"
!! Set%Opera*ons%
!! SpliUng"Data"Sets"
!! Hands/On"Exercise:"Analyzing"Disparate"Data"Sets"with"Pig"
!! Conclusion"

Crossing"Data"Sets"

! JOIN%nds%records%in%one%rela*on%that%match%records%in%another%
!Pigs%CROSS%operator%creates%the%cross%product%of%both%rela*ons%
Combines"all"records"in"both"tables"regardless"of"matching"
In"other"words,"all"possible"combinaAons"of"records"

## crossed = CROSS stores, salespeople;

!Careful:%This%can%generate%huge%amounts%of%data!%

Cross"Product"Example"
stores"
!Generates%every%possible%combina*on%of%records%
A Anchorage in%the%stores%and%salespeople%rela*ons%
B Boston
D Dallas
grunt> crossed = CROSS stores, salespeople;
salespeople"
grunt> DUMP crossed;
1 Alice B (A,Anchorage,1,Alice,B)
2 Bob D (A,Anchorage,2,Bob,D)
8 Hannah B (A,Anchorage,8,Hannah,B)
10 Jack (A,Anchorage,10,Jack,)
(B,Boston,1,Alice,B)
(B,Boston,2,Bob,D)
(B,Boston,8,Hannah,B)
(B,Boston,10,Jack,)
(D,Dallas,1,Alice,B)
(D,Dallas,2,Bob,D)
(D,Dallas,8,Hannah,B)
(D,Dallas,10,Jack,)

ConcatenaAng"Data"Sets"

!We%have%explored%several%techniques%for%combining%data%sets%
!The%UNION%operator%combines%records%ver*cally%
Pig"does"not"require"these"inputs"to"have"the"same"schema"
It"does"not"eliminate"duplicate"records"nor"preserve"order"

## both = UNION june_items, july_items;

UNION"Example"
june"
!Concatenates%all%records%from%june%and%july%
Battery 349
Cable 799 grunt> both = UNION june_items, july_items;
DVD 1999
HDTV 79999 grunt> DUMP both;
(Fax,17999)
(GPS,24999)
july" (HDTV,65999)
Fax 17999 (Ink,3999)
HDTV 65999 (Battery,349)
Ink 3999 (Cable,799)
(DVD, 1999)
(HDTV,79999)

Chapter"Topics"

Mul*#Dataset%Opera*ons%with%Pig%

!! Techniques"for"Combining"Data"Sets"
!! Joining"Data"Sets"in"Pig"
!! Set"OperaAons"
!! Spliang%Data%Sets%
!! Hands/On"Exercise:"Analyzing"Disparate"Data"Sets"with"Pig"
!! Conclusion"

SpliUng"Data"Sets"

!You%have%learned%several%ways%to%combine%data%sets%into%a%single%rela*on%
!Some*mes%you%need%to%split%a%data%set%into%mul*ple%rela*ons%
Server"logs"by"date"range"
Customer"lists"by"region"
Product"lists"by"vendor"
!Pig%La*n%supports%this%with%the%SPLIT%operator%

## SPLIT relation INTO relationA IF expression1,

relationB IF expression2,
relationC IF expression3...;

Expressions"need"not"be"mutually"exclusive"

SPLIT"Example"

!Split%customers%into%groups%for%rewards%program,%based%on%life*me%value%

customers"
grunt> SPLIT customers INTO
Annette 9700 gold_program IF ltv >= 25000,
Bruce 23500 silver_program IF ltv >= 10000
Charles 17800 AND ltv < 25000;
Dustin 21250
Eva 8500 grunt> DUMP gold_program;
Felix 9300 (Glynn,27800)
Glynn 27800 (Ian,43800)
Henry 8900 (Jeff,29100)
Ian 43800 (Kai,34000)
Jeff 29100
Kai 34000 grunt> DUMP silver_program;
Laura 7800 (Bruce,23500)
Mirko 24200 (Charles,17800)
(Dustin,21250)
(Mirko,24200)

Chapter"Topics"

Mul*#Dataset%Opera*ons%with%Pig%

!! Techniques"for"Combining"Data"Sets"
!! Joining"Data"Sets"in"Pig"
!! Set"OperaAons"
!! SpliUng"Data"Sets"
!! Hands#On%Exercise:%Analyzing%Disparate%Data%Sets%with%Pig%
!! Conclusion"

Hands/on"Exercise:"Analyzing"Disparate"Data"Sets"with"Pig"

!In%this%Hands#On%Exercise,%you%will%analyze%mul*ple%data%sets%with%Pig.%

Chapter"Topics"

Mul*#Dataset%Opera*ons%with%Pig%

!! Techniques"for"Combining"Data"Sets"
!! Joining"Data"Sets"in"Pig"
!! Set"OperaAons"
!! SpliUng"Data"Sets"
!! Hands/On"Exercise:"Analyzing"Disparate"Data"Sets"with"Pig"
!! Conclusion%

EssenAal"Points"

!You%can%use%COGROUP%to%group%mul*ple%rela*ons%
This"creates"a"nested"data"structure"
!Pig%supports%common%SQL%join%types%
Inner,"leX"outer,"right"outer,"and"full"outer"
You"may"need"to"fully/qualify"eld"names"when"using"joined"data"
!Pigs%CROSS%operator%creates%every%possible%combina*on%of%input%data%
This"can"create"huge"amounts"of"data""use"it"carefully!"
!You%can%use%a%UNION%to%concatenate%data%sets%

Extending"Pig"
Chapter"7"

Course"Chapters"

!! IntroducEon"
!! IntroducEon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulE/Dataset"OperaEons"with"Pig"
!! Extending%Pig%
!! Pig"TroubleshooEng"and"OpEmizaEon"
!! IntroducEon"to"Hive"
!! RelaEonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpEmizaEon"
!! Extending"Hive"
!! IntroducEon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

Extending"Pig"

In%this%chapter,%you%will%learn%
!How%to%use%parameters%in%your%Pig%LaAn%to%increase%its%exibility%
!How%to%dene%and%invoke%macros%to%improve%the%reusability%of%your%code%
!How%to%call%user#dened%funcAons%from%your%code%
!How%to%write%user#dened%funcAons%in%Python%
!How%to%process%data%with%external%scripts%

Chapter"Topics"

Extending%Pig%

!! Macros"and"Imports"
!! UDFs"
!! Contributed"FuncEons"
!! Using"Other"Languages"to"Process"Data"with"Pig"
!! Hands/On"Exercise:"Extending"Pig"with"Streaming"and"UDFs"
!! Conclusion"

The"Need"for"Parameters"(1)"

!Some%processing%is%very%repeAAve%
For"example,"creaEng"sales"reports"

## allsales = LOAD 'sales' AS (name, price);

bigsales = FILTER allsales BY price > 999;

## bigsales_alice = FILTER bigsales BY name == 'Alice';

STORE bigsales_alice INTO 'Alice';

The"Need"for"Parameters"(2)"

!You%may%need%to%change%the%script%slightly%for%each%run%
For"example,"to"modify"the"paths"or"lter"criteria"

## allsales = LOAD 'sales' AS (name, price);

bigsales = FILTER allsales BY price > 999;

## bigsales_alice = FILTER bigsales BY name == 'Alice';

STORE bigsales_alice INTO 'Alice';

Making"the"Script"More"Flexible"with"Parameters"

These"are"replaced"with"specied"values"at"runEme"

## allsales = LOAD '\$INPUT' AS (name, price);

bigsales = FILTER allsales BY price > \$MINPRICE;

## bigsales_name = FILTER bigsales BY name == '\$NAME';

STORE bigsales_name INTO '\$NAME';

Then"specify"the"values"on"the"command"line"

## \$ pig -p INPUT=sales -p MINPRICE=999 \

-p NAME='Jo Anne' reporter.pig

Two"Tricks"for"Specifying"Parameter"Values"

!You%can%also%specify%parameter%values%in%a%text%le%
An"alternaEve"to"typing"each"one"on"the"command"line"

INPUT=sales
MINPRICE=999
" # comments look like this
" NAME='Alice'
"
Use"-m filename"opEon"to"tell"Pig"which"le"contains"the"values""
!Parameter%values%can%be%dened%with%the%output%of%a%shell%command%
For"example,"to"set"MONTH"to"the"current"month:"

## MONTH=`date +'%m'` # returns 03 for March, 05 for May

Chapter"Topics"

Extending%Pig%

!! Macros%and%imports%
!! UDFs"
!! Contributed"FuncEons"
!! Using"Other"Languages"to"Process"Data"with"Pig"
!! Hands/On"Exercise:"Extending"Pig"with"Streaming"and"UDFs"
!! Conclusion"

The"Need"for"Macros"

!Parameters%simplify%repeAAve%code%by%allowing%you%to%pass%in%values%
But"someEmes"you"would"like"to"reuse"the"actual"code"too""

## allsales = LOAD 'sales' AS (name, price);

byperson = FILTER allsales BY name == 'Alice';

## SPLIT byperson INTO low IF price < 1000,

high IF price >= 1000;

## amt1 = FOREACH low GENERATE name, price * 0.07 AS amount;

amt2 = FOREACH high GENERATE name, price * 0.12 AS amount;

## commissions = UNION amt1, amt2;

grpd = GROUP commissions BY name;

## out = FOREACH grpd GENERATE SUM(commissions.amount) AS total;

Dening"a"Macro"in"Pig"LaEn"

!Macros%allow%you%to%dene%a%block%of%code%to%reuse%easily%
Similar"(but"not"idenEcal)"to"a"funcEon"in"a"programming"language"

## define calc_commission (NAME, SPLIT_AMT, LOW_PCT, HIGH_PCT)

returns result {
allsales = LOAD 'sales' AS (name, price);
byperson = FILTER allsales BY name == '\$NAME';

## SPLIT byperson INTO low if price < \$SPLIT_AMT,

high IF price >= \$SPLIT_AMT;

## amt1 = FOREACH low GENERATE name, price * \$LOW_PCT AS amount;

amt2 = FOREACH high GENERATE name, price * \$HIGH_PCT AS amount;

## commissions = UNION amt1, amt2;

grouped = GROUP commissions BY name;

## \$result = FOREACH grouped GENERATE SUM(commissions.amount);

};

Invoking"Macros"

!To%invoke%a%macro,%call%it%by%name%and%supply%values%in%the%correct%order%

## define calc_commission (NAME, SPLIT_AMT, LOW_PCT, HIGH_PCT)

returns result {
allsales = LOAD 'sales' AS (name, price);

};

## alice_comm = calc_commission('Alice', 1000, 0.07, 0.12);

carlos_comm = calc_commission('Carlos', 2000, 0.08, 0.14);

Reusing"Code"with"Imports"

!ASer%dening%a%macro,%you%may%wish%to%use%it%in%mulAple%scripts%
!You%can%include%one%script%within%another,%starAng%with%Pig%0.9%
This"is"done"with"the"import"keyword"and"path"to"le"being"imported"

## -- We saved the macro to a file named commission_calc.pig

import 'commission_calc.pig';

## alice_comm = calc_commission('Alice', 1000, 0.07, 0.12);

Chapter"Topics"

Extending%Pig%

!! Macros"and"Imports"
!! UDFs%
!! Contributed"FuncEons"
!! Using"Other"Languages"to"Process"Data"with"Pig"
!! Hands/On"Exercise:"Extending"Pig"with"Streaming"and"UDFs"
!! Conclusion"

User/Dened"FuncEons"(UDFs)"

!It%is%also%possible%to%dene%your%own%funcAons%
Pig"allows"wriEng"UDFs"in"several"languages"
Language% Supported%in%Pig%Versions%
Java" All
Python" 0.8 and later
JavaScript"(experimental)" 0.9 and later
Ruby"(experimental)" 0.10 and later
Groovy"(experimental)" 0.11 and later

!In%the%next%few%slides,%you%will%see%how%to%use%UDFs%in%Java,%and%how%to%
write%and%use%UDFs%in%Python%

Using"UDFs"Wri>en"in"Java"

!UDFs%are%packaged%into%Java%Archive%(JAR)%les%
!There%are%only%two%required%steps%for%using%them%
Register"the"JAR"le(s)"containing"the"UDF"and"its"dependencies"
Invoke"the"UDF"using"the"fully/qualied"classname"

REGISTER '/path/to/myudf.jar';
...
data = FOREACH allsales GENERATE com.example.MYFUNC(name);

!You%can%opAonally%dene%an%alias%for%the%funcAon%

REGISTER '/path/to/myudf.jar';
DEFINE FOO com.example.MYFUNC;
...
data = FOREACH allsales GENERATE FOO(name);

WriEng"UDFs"in"Python"(1)"

!Now%we%will%see%how%to%write%a%UDF%in%Python%
!The%data%we%want%to%process%has%inconsistent%phone%number%formats%

## Alice (314) 555-1212

Bob 212.555.9753
Carlos 405-555-3912
David (202) 555.8471
%
!We%will%write%a%Python%UDF%that%can%consistently%extract%the%area%code%

WriEng"UDFs"in"Python"(2)"

!Our%Python%code%is%straighaorward%
!The%only%unusual%thing%is%the%opAonal%@outputSchema%decorator%
This"tells"Pig"what"data"type"we"are"returning"
If"not"specied,"Pig"will"assume"bytearray

@outputSchema("areacode:chararray")
def get_area_code(phone):
areacode = "???" # return this for unknown formats

if len(phone) == 12:
# XXX-YYY-ZZZZ or XXX.YYY.ZZZZ format
areacode = phone[0:3]
elif len(phone) == 14:
# (XXX) YYY-ZZZZ or (XXX) YYY.ZZZZ format
areacode = phone[1:4]

return areacode

Invoking"Python"UDFs"from"Pig"LaEn"

!Using%this%UDF%from%our%Pig%LaAn%is%also%easy%
We"saved"our"Python"code"as"phonenumber.py
This"Python"le"is"in"our"current"directory"

## areacodes = FOREACH names GENERATE

phoneudf.get_area_code(phone) AS ac;

Chapter"Topics"

Extending%Pig%

!! Macros"and"Imports"
!! UDFs"
!! Contributed%FuncAons%
!! Using"Other"Languages"to"Process"Data"with"Pig"
!! Hands/On"Exercise:"Extending"Pig"with"Streaming"and"UDFs"
!! Conclusion"

Open"Source"UDFs"

!Pig%ships%with%a%set%of%community#contributed%UDFs%called%Piggy%Bank%
!Another%popular%package%of%UDFs,%called%DataFu,%has%been%open#sourced%

Piggy"Bank"

!Piggy%Bank%ships%with%Pig%
You"will"need"to"register"the"piggybank.jar"le"
The"locaEon"may"vary"depending"on"source"and"version"
In"CDH"on"our"VMs,"it"is"at"/usr/lib/pig/piggybank.jar"
!Some%UDFs%in%Piggy%Bank%include%(package%names%omided%for%brevity)%

Class%Name% DescripAon%
ISOToUnix Converts"an"ISO"8601"date/Eme"format"to"UNIX"format"
UnixToISO Converts"a"UNIX"date/Eme"format"to"ISO"8601"format"
LENGTH Returns"the"number"of"characters"in"the"supplied"string"
HostExtractor Returns"the"host"name"from"a"URL"
DiffDate Returns"number"of"days"between"two"dates"

DataFu"

!DataFu%does%not%ship%with%Pig,%but%is%part%of%CDH%4.1.0%and%later%
You"will"need"to"register"the"DataFu"JAR"le"
In"VM,"at"/usr/lib/pig/datafu-0.0.4-cdh4.2.0.jar"
!Some%UDFs%in%DataFu%include%(package%names%omided%for%brevity)%

Class%Name% DescripAon%
Quantile Calculates"quanEles"for"a"data"set"
Median Calculates"the"median"for"a"data"set"
Sessionize Groups"data"into"sessions"based"on"a"
specied"Eme"window"
HaversineDistInMiles Calculates"distance"in"miles"between"two"
points,"given"laEtude"and"longitude"

Using"A"Contributed"UDF"

!Here%is%an%example%of%using%a%UDF%from%DataFu%to%calculate%distance%
Input"data"
37.789336 -122.401385 40.707555 -74.011679

Pig"LaEn"
REGISTER '/usr/lib/pig/datafu-*.jar';
DEFINE DIST datafu.pig.geo.HaversineDistInMiles;

## places = LOAD 'data' AS (lat1:double, lon1:double,

lat2:double, lon2:double);

## dist = FOREACH places GENERATE DIST(lat1, lon1, lat2, lon2);

DUMP dist;

Output"data"
(2564.207116295711)

Chapter"Topics"

Extending%Pig%

!! Macros"and"Imports"
!! UDFs"
!! Contributed"FuncEons"
!! Using%Other%Languages%to%Process%Data%with%Pig%
!! Hands/On"Exercise:"Extending"Pig"with"Streaming"and"UDFs"
!! Conclusion"

Processing"Data"with"An"External"Script"

!Pig%allows%you%to%stream%data%through%another%language%for%processing%
This"is"done"using"the"STREAM"keyword"
Data"is"supplied"to"the"script"on"standard"input"as"tab/delimited"elds"
Script"writes"results"to"standard"output"as"tab/delimited"elds"

STREAM"Example"in"Python"(1)"

!Our%example%will%calculate%a%users%age%given%that%users%birthdate%
This"calculaEon"is"done"in"a"Python"script"named"agecalc.py"
!Here%is%the%corresponding%Pig%LaAn%code%
BackEcks"used"to"quote"script"name"following"the"alias"
Single"quotes"used"for"quoEng"script"name"within"SHIP
The"schema"for"the"data"produced"by"the"script"follows"the"AS"keyword"

## DEFINE MYSCRIPT `agecalc.py` SHIP('agecalc.py');

users = LOAD 'data' AS (name:chararray, birthdate:chararray);

## out = STREAM users THROUGH MYSCRIPT AS (name:chararray, age:int);

DUMP out;

STREAM"Example"in"Python"(2)"

!Python%code%for%agecalc.py

#!/usr/bin/env python

import sys
from datetime import datetime

## for line in sys.stdin:

line = line.strip()
(name, birthdate) = line.split("\t")

d1 = datetime.strptime(birthdate, '%Y-%m-%d')
d2 = datetime.now()

## print "%s\t%i" % (name, age)

STREAM"Example"in"Python"(3)"

## DEFINE MYSCRIPT `agecalc.py` SHIP('agecalc.py');

users = LOAD 'data' AS (name:chararray, birthdate:chararray);

## out = STREAM users THROUGH MYSCRIPT AS (name:chararray,

age:int);

DUMP out;

Input"data" Output"data"

## andy 1963-11-15 (andy,49)

betty 1985-12-30 (betty,27)
chuck 1979-02-23 (chuck,34)
debbie 1982-09-19 (debbie,30)

Chapter"Topics"

Extending%Pig%

!! Macros"and"Imports"
!! UDFs"
!! Contributed"FuncEons"
!! Using"Other"Languages"to"Process"Data"with"Pig"
!! Hands#On%Exercise:%Extending%Pig%with%Streaming%and%UDFs%
!! Conclusion"

Hands/on"Exercise:"Extending"Pig"with"Streaming"and"UDFs"

!In%this%Hands#On%Exercise,%you%will%process%data%with%an%external%script%and%a%
user#dened%funcAon.%

Chapter"Topics"

Extending%Pig%

!! Macros"and"Imports"
!! UDFs"
!! Contributed"FuncEons"
!! Using"Other"Languages"to"Process"Data"with"Pig"
!! Hands/On"Exercise:"Extending"Pig"with"Streaming"and"UDFs"
!! Conclusion%

EssenEal"Points"

!Pig%supports%several%extension%mechanisms%
!Parameters%and%macros%can%help%make%your%code%more%reusable%
And"easier"to"maintain"and"share"with"others"
!Piggy%Bank%and%DataFu%are%two%examples%of%open%source%UDFs%
You"can"also"write"your"own"UDFs"
!It%is%also%possible%to%embed%Pig%within%another%language%

Bibliography"

The%following%oer%more%informaAon%on%topics%discussed%in%this%chapter%
!DocumentaAon%on%Parameter%SubsAtuAon%in%Pig%
http://tiny.cloudera.com/dac07a
!DocumentaAon%on%Macros%in%Pig%
http://tiny.cloudera.com/dac07b
!DocumentaAon%on%User#Dened%FuncAons%in%Pig%
http://tiny.cloudera.com/dac07c
!DocumentaAon%on%Piggy%Bank%
http://tiny.cloudera.com/dac07d
!Introducing%Data%Fu%
http://tiny.cloudera.com/dac07e

Pig"TroubleshooBng"and"OpBmizaBon"
Chapter"8"

Course"Chapters"

!! IntroducBon"
!! IntroducBon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulB/Dataset"OperaBons"with"Pig"
!! Extending"Pig"
!! Pig%Troubleshoo3ng%and%Op3miza3on%
!! IntroducBon"to"Hive"
!! RelaBonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpBmizaBon"
!! Extending"Hive"
!! IntroducBon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

Pig"TroubleshooBng"and"OpBmizaBon"

In%this%chapter,%you%will%learn%
!How%to%use%SAMPLE%and%ILLUSTRATE%to%test%and%debug%Pig%jobs%
!How%Pig%creates%MapReduce%jobs%from%your%Pig%La3n%code%
!How%several%simple%changes%to%your%Pig%La3n%code%can%make%it%run%faster%

Chapter"Topics"

Pig%Troubleshoo3ng%And%Op3miza3on%

!! Troubleshoo3ng%Pig%
!! Logging"
!! OpBonal"Demo:"TroubleshooBng"a"Failed"Job"with"the"Web"UI"
!! Data"Sampling"and"Debugging"
!! Performance"Overview"
!! Understanding"the"ExecuBon"Plan"
!! Tips"for"Improving"the"Performance"of"Your"Pig"Jobs"
!! Conclusion"

TroubleshooBng"Overview"

!We%have%now%covered%how%to%use%Pig%for%data%analysis%
Unfortunately,"someBmes"your"code"may"not"work"as"you"expect"
!Here%we%will%cover%some%techniques%for%isola3ng%and%resolving%problems%
We"will"start"with"a"few"opBons"to"the"Pig"command"

Helping"Yourself"

!We%will%discuss%some%op3ons%for%the%pig%command%in%this%chapter%
You"can"view"all"of"them"by"using"the"-h"(help)"opBon"
!One%useful%op3on%is%-c%(check),%which%validates%the%syntax%of%your%code%

\$ pig -c myscript.pig
myscript.pig syntax OK

## \$ pig -p INPUT=demodata -dryrun myscript.pig

Creates"a"myscript.pig.substituted"le"in"current"directory"

Geang"Help"from"Others"

!Some3mes%you%may%need%help%from%others%
Mailing"lists"or"newsgroups"
Forums"and"bulleBn"board"sites"
Support"services"
using%

\$ pig -version
Apache Pig version 0.10.0-cdh4.2.0

Chapter"Topics"

Pig%Troubleshoo3ng%And%Op3miza3on%

!! TroubleshooBng"Pig"
!! Logging"
!! OpBonal"Demo:"TroubleshooBng"a"Failed"Job"with"the"Web"UI"
!! Data"Sampling"and"Debugging"
!! Performance"Overview"
!! Understanding"the"ExecuBon"Plan"
!! Tips"for"Improving"the"Performance"of"Your"Pig"Jobs"
!! Conclusion"

Customizing"Log"Messages"

!You%may%wish%to%change%how%much%informa3on%is%logged%
!Edit%the%/etc/pig/conf/log4j.properties%le%to%include:%

log4j.logger.org.apache.pig=ERROR

!Edit%the%/etc/pig/conf/pig.properties%le%to%set%this%property:%

log4jconf=/etc/pig/conf/log4j.properties

Customizing"Log"Messages"on"a"Per/Job"Basis"

!O]en%you%just%want%to%temporarily%change%the%log%level%
Especially"while"trying"to"troubleshoot"a"problem"with"your"script"
!You%can%specify%a%Log4J%proper3es%le%to%use%when%you%invoke%Pig%
This"overrides"the"default"Log4J"conguraBon"
!Create%a%customlog.properties%le%to%include:%

log4j.logger.org.apache.pig=DEBUG

!Specify%this%le%via%the%-log4jconf%argument%to%Pig%

## \$ pig -log4jconf customlog.properties

Controlling"Client/Side"Log"Files"

!When%a%job%fails,%Pig%may%produce%a%log%le%to%explain%why%
These"are"typically"produced"in"your"current"directory"
!To%use%a%dierent%loca3on,%use%the%-l%(log)%op3on%when%star3ng%Pig%

\$ pig -l /tmp

!Or%set%it%permanently%by%edi3ng%/etc/pig/conf/pig.properties%%
Specify"a"dierent"directory"using"the"log.file"property"

log.file=/tmp

Chapter"Topics"

Pig%Troubleshoo3ng%And%Op3miza3on%

!! TroubleshooBng"Pig"
!! Logging"
!! OpBonal"Demo:"TroubleshooBng"a"Failed"Job"with"the"Web"UI"
!! Data"Sampling"and"Debugging"
!! Performance"Overview"
!! Understanding"the"ExecuBon"Plan"
!! Tips"for"Improving"the"Performance"of"Your"Pig"Jobs"
!! Conclusion"

This"allows"us"to"easily"see"cluster"and"job"status"with"a"browser"
In"pseudo/distributed"mode,"the"hostname"is"localhost
NameNode" http://hostname:50070/
HDFS"
DataNode" http://hostname:50075/
JobTracker% http://hostname:50030/
MR1"
ResourceManager% http://hostname:8088/
MR2"
NodeManager" http://hostname:8042/

The"JobTracker"Web"UI"(1)"

The"JobTracker"Web"UI"(2)"

!The%JobTracker%Web%UI%also%shows%historical%informa3on%%

The"JobTracker"Web"UI"(3)"

!The%job%detail%page%can%help%you%troubleshoot%a%problem%

Naming"Your"Job"

There"might"be"dozens"or"hundreds"of"others"using"it"
As"a"result,"someBmes"it"is"hard"to"nd"your"job"in"the"Web"UI"
!We%recommend%sebng%a%name%in%your%scripts%to%help%iden3fy%your%jobs%
Set"the"job.name"property,"either"in"Grunt"or"your"script"

## grunt> set job.name 'Q2 2013 Sales Reporter';

Killing"a"Job"

!A%job%that%processes%a%lot%of%data%can%take%hours%to%complete%
SomeBmes"you"spot"an"error"in"your"code"just"aeer"submiang"a"job"
Rather"than"wait"for"the"job"to"complete,"you"can"kill"it"
!First,%nd%the%Job%ID%on%the%front%page%of%the%JobTracker%Web%UI%

!Then,%use%the%kill%command%in%Pig%along%with%that%Job%ID%

## grunt> kill job_201303151454_0028

Chapter"Topics"

Pig%Troubleshoo3ng%And%Op3miza3on%

!! TroubleshooBng"Pig"
!! Logging"
!! Op3onal%Demo:%Troubleshoo3ng%a%Failed%Job%with%the%Web%UI"
!! Data"Sampling"and"Debugging"
!! Performance"Overview"
!! Understanding"the"ExecuBon"Plan"
!! Tips"for"Improving"the"Performance"of"Your"Pig"Jobs"
!! Conclusion"

OpBonal"Demo:"Overview"

!Time%permibng,%your%instructor%will%now%demonstrate%how%to%use%the%
JobTrackers%Web%UI%to%isolate%a%bug%in%our%code%that%causes%a%job%to%fail%

\$ cd ~/training_materials/analyst/webuidemo%

Chapter"Topics"

Pig%Troubleshoo3ng%And%Op3miza3on%

!! TroubleshooBng"Pig"
!! Logging"
!! OpBonal"Demo:"TroubleshooBng"a"Failed"Job"with"the"Web"UI"
!! Data%Sampling%and%Debugging"
!! Performance"Overview"
!! Understanding"the"ExecuBon"Plan"
!! Tips"for"Improving"the"Performance"of"Your"Pig"Jobs"
!! Conclusion"

Using"SAMPLE"to"Create"a"Smaller"Data"Set

!Your%code%might%process%terabytes%of%data%in%produc3on%
However,"it"is"convenient"to"test"with"smaller"amounts"during"
development"
!Use%SAMPLE%to%choose%a%random%set%of%records%from%a%data%set%%
Stores"them"in"a"new"directory"called"mysample

subset = SAMPLE everything 0.05;
STORE subset INTO 'mysample';

Intelligent"Sampling"with"ILLUSTRATE

!Some3mes%a%random%sample%may%lack%data%needed%for%tes3ng%
For"example,"matching"records"in"two"data"sets"for"a"JOIN"operaBon"
!Pigs%ILLUSTRATE%keyword%can%do%more%intelligent%sampling%
Pig"will"examine"the"code"to"determine"what"data"is"needed"
It"picks"a"few"records"that"properly"exercise"the"code"
!You%should%specify%a%schema%when%using%ILLUSTRATE%%
Pig"will"generate"records"when"yours"dont"suce"

Using"ILLUSTRATE"Helps"You"to"Understand"Data"Flow

!Like%DUMP%and%DESCRIBE,%ILLUSTRATE%aids%in%debugging%
The"syntax"is"the"same"for"all"three"

## grunt> allsales = LOAD 'sales' AS (name:chararray, price:int);

grunt> bigsales = FILTER allsales BY price > 999;
grunt> ILLUSTRATE bigsales;
(Bob,3625)
--------------------------------------------------
| allsales | name:chararray | price:int |
--------------------------------------------------
| | Bob | 3625 |
| | Bob | 998 |
--------------------------------------------------
--------------------------------------------------
| bigsales | name:chararray | price:int |
--------------------------------------------------
| | Bob | 3625 |
--------------------------------------------------

General"Debugging"Strategies"

!Use%DUMP,%DESCRIBE,%and%ILLUSTRATE%o]en%
The"data"might"not"be"what"you"think"it"is"
!Look%at%a%sample%of%the%data%
Use"-dryrun"to"see"the"script"aeer"parameters"and"macros"are"
processed"
Test"external"scripts"(STREAM)"by"passing"some"data"from"a"local"le"

Chapter"Topics"

Pig%Troubleshoo3ng%And%Op3miza3on%

!! TroubleshooBng"Pig"
!! Logging"
!! OpBonal"Demo:"TroubleshooBng"a"Failed"Job"with"the"Web"UI"
!! Data"Sampling"and"Debugging"
!! Performance%Overview"
!! Understanding"the"ExecuBon"Plan"
!! Tips"for"Improving"the"Performance"of"Your"Pig"Jobs"
!! Conclusion"

Performance"Overview"

!We%have%discussed%several%techniques%for%nding%errors%in%Pig%La3n%code%
Once"you"get"your"code"working,"youll"oeen"want"it"to"work"faster"
!Most%of%these%topics%are%beyond%the%scope%of%this%course%
Well"cover"the"basics"and"oer"several"performance"improvement"Bps"
See"Programming/Pig"(chapters"7"and"8)"for"detailed"coverage"

Chapter"Topics"

Pig%Troubleshoo3ng%And%Op3miza3on%

!! TroubleshooBng"Pig"
!! Logging"
!! OpBonal"Demo:"TroubleshooBng"a"Failed"Job"with"the"Web"UI"
!! Data"Sampling"and"Debugging"
!! Performance"Overview"
!! Understanding%the%Execu3on%Plan"
!! Tips"for"Improving"the"Performance"of"your"Pig"jobs"
!! Conclusion%

How"Pig"LaBn"Becomes"a"MapReduce"Job

!However,%Pig%does%not%translate%your%code%into%Java%MapReduce%
Much"like"relaBonal"databases"dont"translate"SQL"to"C"language"code"
Like"a"database,"Pig"interprets"the"Pig"LaBn"to"develop"execuBon"plans"
!The%EXPLAIN%keyword%details%Pigs%three%execu3on%plans%
Logical""
Physical"
MapReduce"
!Seeing%an%example%job%will%help%us%befer%understand%EXPLAINs%output%

DescripBon"of"Our"Example"Code"and"Data

!Our%goal%is%to%produce%a%list%of%per#store%sales%
stores"
A Anchorage AS (store_id:chararray, name:chararray);
B Boston grunt> sales = LOAD 'sales'
C Chicago AS (store_id:chararray, price:int);
D Dallas
grunt> groups = GROUP sales BY store_id;
E Edmonton
grunt> totals = FOREACH groups GENERATE group,
F Fargo
SUM(sales.price) AS amount;
sales" grunt> joined = JOIN totals BY group,
stores BY store_id;
A 1999
grunt> result = FOREACH joined
D 2399
A 4579 GENERATE name, amount;
B 6139 grunt> DUMP result;
A 2489 (Anchorage,9067)
B 3699 (Boston,9838)
E 2479 (Dallas,8198)
D 5799 (Edmonton,2479)

Using"the"EXPLAIN"Keyword

!Using%EXPLAIN%rather%than%DUMP%will%show%the%execu3on%plans%

(Anchorage,9067)
(Boston,9838)
(Dallas,8198)
(Edmonton,2479)

## grunt> EXPLAIN result;

#-----------------------------------------------
# New Logical Plan:
#-----------------------------------------------
result: (Name: LOStore Schema:
stores::name#49:chararray,totals::amount#70:long)
|
|---result: (Name: LOForEach Schema:
stores::name#49:chararray,totals::amount#70:long)

## (other lines, including physical and MapReduce plans, would follow)

Chapter"Topics"

Pig%Troubleshoo3ng%And%Op3miza3on%

!! TroubleshooBng"Pig"
!! Logging"
!! OpBonal"Demo:"TroubleshooBng"a"Failed"Job"with"the"Web"UI"
!! Data"Sampling"and"Debugging"
!! Performance"Overview"
!! Understanding"the"ExecuBon"Plan"
!! Tips%for%Improving%the%Performance%of%your%Pig%Jobs"
!! Conclusion%

Pigs"RunBme"OpBmizaBons

!Pig%does%not%necessarily%run%your%statements%exactly%as%you%wrote%them%
!It%may%remove%opera3ons%for%eciency%

## sales = LOAD 'sales' AS (store_id:chararray, price:int);

unused = FILTER sales BY price > 789;
DUMP sales;

!It%may%also%rearrange%opera3ons%for%eciency%

## grouped = GROUP sales BY store_id;

totals = FOREACH grouped GENERATE group, SUM(sales.price);
joined = JOIN totals BY group, stores BY store_id;
only_a = FILTER joined BY store_id == 'A';
DUMP only_a;

OpBmizaBons"You"Can"Make"in"Your"Pig"LaBn"Code

!Pigs%op3mizer%does%what%it%can%to%improve%performance%
But"you"know"your"own"code"and"data"be>er"than"it"does"
!On%the%next%few%slides,%we%will%rewrite%this%Pig%code%for%performance%

## stores = LOAD 'stores' AS (store_id, name, postcode, phone);

sales = LOAD 'sales' AS (store_id, price);
joined = JOIN sales BY store_id, stores BY store_id;
DUMP joined;
groups = GROUP joined BY sales::store_id;
totals = FOREACH groups GENERATE
FLATTEN(joined.stores::name) AS name,
SUM(joined.sales::price) AS amount;
unique = DISTINCT totals;
region = FILTER unique BY name == 'Anchorage' OR name == 'Edmonton';
sorted = ORDER region BY amount DESC;
topone = LIMIT sorted 1;
STORE topone INTO 'topstore';

Dont"Produce"Output"You"Dont"Really"Need

!In%this%case,%we%forgot%to%remove%the%DUMP%statement%
SomeBmes"happens"when"moving"from"development"to"producBon"
And"it"might"go"unnoBced"if"youre"not"watching"the"terminal"

## stores = LOAD 'stores' AS (store_id, name, postcode, phone);

sales = LOAD 'sales' AS (store_id, price);
joined = JOIN sales BY store_id, stores BY store_id;
DUMP joined;
groups = GROUP joined BY sales::store_id;
totals = FOREACH groups GENERATE
FLATTEN(joined.stores::name) AS name,
SUM(joined.sales::price) AS amount;
unique = DISTINCT totals;
region = FILTER unique BY name == 'Anchorage' OR name == 'Edmonton';
sorted = ORDER region BY amount DESC;
topone = LIMIT sorted 1;
STORE topone INTO 'topstore';

Specify"Schema"Whenever"Possible

!The%postcode%and%phone%elds%in%the%stores%data%set%were%also%never%used%
EliminaBng"them"in"our"schema"ensures"theyll"be"omi>ed"in"joined

stores =
sales =
joined =
JOIN sales BY store_id, stores BY store_id;
groups =
GROUP joined BY sales::store_id;
totals =
FOREACH groups GENERATE
FLATTEN(joined.stores::name) AS name,
SUM(joined.sales::price) AS amount;
unique = DISTINCT totals;
region = FILTER unique BY name == 'Anchorage' OR name == 'Edmonton';
sorted = ORDER region BY amount DESC;
topone = LIMIT sorted 1;
STORE topone INTO 'topstore';

Filter"Unwanted"Data"As"Early"As"Possible

!We%previously%did%our%JOIN%before%our%FILTER%%
Moving"the"FILTER"operaBon"up"makes"our"script"more"ecient"
Caveat:"We"now"have"to"lter"by"store"ID"rather"than"store"name"

## stores = LOAD 'stores' AS (store_id:chararray, name:chararray);

sales = LOAD 'sales' AS (store_id:chararray, price:int);
regsales = FILTER sales BY store_id == 'A' OR store_id == 'E';
joined = JOIN regsales BY store_id, stores BY store_id;
groups = GROUP joined BY regsales::store_id;
totals = FOREACH groups GENERATE
FLATTEN(joined.stores::name) AS name,
SUM(joined.sales::price) AS amount;
unique = DISTINCT totals;
sorted = ORDER unique BY amount DESC;
topone = LIMIT sorted 1;
STORE topone INTO 'topstore';

However,"it"is"oeen"benecial"to"set"a"value"explicitly"in"your"script"

set default_parallel 5;
stores = LOAD 'stores' AS (store_id:chararray, name:chararray);
sales = LOAD 'sales' AS (store_id:chararray, price:int);
regsales = FILTER sales BY store_id == 'A' OR store_id == 'E';
joined = JOIN regsales BY store_id, stores BY store_id;
groups = GROUP joined BY regsales::store_id;
totals = FOREACH groups GENERATE
FLATTEN(joined.stores::name) AS name,
SUM(joined.sales::price) AS amount;
unique = DISTINCT totals;
sorted = ORDER unique BY amount DESC;
topone = LIMIT sorted 1;
STORE topone INTO 'topstore';

Specify"the"Smaller"Data"Set"First"in"a"Join

!We%can%op3mize%joins%by%specifying%the%larger%data%set%last%
In"our"case,"we"have"far"more"records"in"sales"than"in"stores
Changing"the"order"in"the"JOIN"statement"can"boost"performance"

set default_parallel 5;
stores = LOAD 'stores' AS (store_id:chararray, name:chararray);
sales = LOAD 'sales' AS (store_id:chararray, price:int);
regsales = FILTER sales BY store_id == 'A' OR store_id == 'E';
joined = JOIN stores BY store_id, regsales BY store_id;
groups = GROUP joined BY regsales::store_id;
totals = FOREACH groups GENERATE
FLATTEN(joined.stores::name) AS name,
SUM(joined.sales::price) AS amount;
unique = DISTINCT totals;
sorted = ORDER unique BY amount DESC;
topone = LIMIT sorted 1;
STORE topone INTO 'topstore';

Try"Using"Compression"on"Intermediate"Data

!Pig%scripts%o]en%yield%jobs%with%both%a%Map%and%a%Reduce%phase%
Remember"that"Mapper"output"becomes"Reducer"input"
Compressing"this"intermediate"data"is"easy"and"can"boost"performance"

## set mapred.compress.map.output true;

set mapred.map.output.compression.codec
set default_parallel 5;
stores = LOAD 'stores' AS (store_id:chararray, name:chararray);
sales = LOAD 'sales' AS (store_id:chararray, price:int);
regsales = FILTER sales BY store_id == 'A' OR store_id == 'E';
joined = JOIN stores BY store_id, regsales BY store_id;
groups = GROUP joined BY regsales::store_id;
totals = FOREACH groups GENERATE
FLATTEN(joined.stores::name) AS name,
SUM(joined.sales::price) AS amount;
... (other lines unchanged, but removed for brevity) ...

A"Few"More"Tips"for"Improving"Performance

!Main%theme:%Eliminate%unnecessary%data%as%early%as%possible%
Use"FOREACH ... GENERATE"to"select"just"those"elds"you"need"
Use"ORDER BY and"LIMIT"when"you"only"need"a"few"records"
Use"DISTINCT"when"you"dont"need"duplicate"records"
!Dropping%records%with%NULL%keys%before%a%join%can%boost%performance%
These"records"will"be"eliminated"in"the"nal"output"anyway"
Use"FILTER"to"remove"records"with"null"keys"before"the"join"

## stores = LOAD 'stores' AS (store_id:chararray, name:chararray);

sales = LOAD 'sales' AS (store_id:chararray, price:int);

## nonnull_stores = FILTER stores BY store_id IS NOT NULL;

nonnull_sales = FILTER sales BY store_id IS NOT NULL;

## joined = JOIN nonnull_stores BY store_id, nonnull_sales BY store_id;

Chapter"Topics"

Pig%Troubleshoo3ng%And%Op3miza3on%

!! TroubleshooBng"Pig"
!! Logging"
!! OpBonal"Demo:"TroubleshooBng"a"Failed"Job"with"the"Web"UI"
!! Data"Sampling"and"Debugging"
!! Performance"Overview"
!! Understanding"the"ExecuBon"Plan"
!! Tips"for"Improving"the"Performance"of"Your"Pig"Jobs"
!! Conclusion%

EssenBal"Points"

!You%can%boost%performance%by%elimina3ng%unneeded%data%during%
processing%
!Pigs%error%messages%dont%always%clearly%iden3fy%the%source%of%a%problem%
We"recommend"tesBng"your"scripts"with"a"small"data"sample"
!The%resources%listed%on%the%upcoming%bibliography%slide%may%further%assist%
you%in%solving%problems%

Bibliography"

The%following%oer%more%informa3on%on%topics%discussed%in%this%chapter%
http://tiny.cloudera.com/dac08a
!Pig%Tes3ng%and%Diagnos3cs%
http://tiny.cloudera.com/dac08b
!Mailing%List%for%Pig%Users%
http://tiny.cloudera.com/dac08c
!Ques3ons%Tagged%with%Pig%on%StackOverow%
http://tiny.cloudera.com/dac08d
!Ques3ons%Tagged%with%PigLa3n%on%StackOverow%
http://tiny.cloudera.com/dac08e

IntroducAon"to"Hive"
Chapter"9"

Course"Chapters"

!! IntroducAon"
!! IntroducAon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulA/Dataset"OperaAons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooAng"and"OpAmizaAon"
!! Introduc/on%to%Hive%
!! RelaAonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpAmizaAon"
!! Extending"Hive"
!! IntroducAon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

IntroducAon"to"Hive"

In%this%chapter,%you%will%learn%
!What%Hive%is%
!How%Hive%diers%from%a%rela/onal%database%
!Ways%in%which%organiza/ons%use%Hive%
!How%to%invoke%and%interact%with%Hive%

Chapter"Topics"

Introduc/on%to%Hive%

!! What%is%Hive?%
!! Hive"Schema"and"Data"Storage"
!! Hive"Use"Cases"
!! InteracAng"with"Hive"
!! Conclusion"

Overview"of"Apache"Hive"

!Apache%Hive%is%a%high#level%abstrac/on%on%top%of%MapReduce%
Uses"a"SQL/like"language"called"HiveQL"
Now"an"open/source"Apache"project"

## SELECT zipcode, SUM(cost) AS total

FROM customers
JOIN orders
ON customers.cust_id = orders.cust_id
WHERE zipcode LIKE '63%'
GROUP BY zipcode
ORDER BY total DESC;

High/Level"Overview"for"Hive"Users"

!Hive%runs%on%the%client%machine%
Turns"HiveQL"queries"into"MapReduce"jobs"
Submits"those"jobs"to"the"cluster"

## SELECT zipcode, SUM(cost) AS total

!"Parse"HiveQL
FROM customers JOIN orders !"Make"op1miza1ons
ON customers.id = orders.cid !"Plan"execu1on
"
WHERE
GROUP BY
zipcode LIKE '63%'
zipcode
!"Generate"MapReduce"jobs
ORDER BY total DESC;
!"Monitor"progress

Why"Use"Apache"Hive?"

!More%produc/ve%than%wri/ng%MapReduce%directly%
Five"lines"of"HiveQL"might"be"equivalent"to"100"lines"or"more"of"Java"
No"so\ware"development"experience"required"
Leverage"exisAng"knowledge"of"SQL"
!Oers%interoperability%with%other%systems%
Extensible"through"Java"and"external"scripts"

Where"to"Get"Hive"

HBase,"Oozie,"and"other"ecosystem"components"
Available"as"RPMs,"Ubuntu/Debian/SuSE"packages,"or"a"tarball"
Simple"installaAon"
100%"free"and"open"source"
!Installa/on%is%outside%the%scope%of%this%course%

Chapter"Topics"

Introduc/on%to%Hive%

!! What"is"Hive?%
!! Hive%Schema%and%Data%Storage%
!! Hive"Use"Cases"
!! InteracAng"with"Hive"
!! Conclusion"

!Hives%queries%operate%on%tables,%just%like%in%an%RDBMS%
A"table"is"simply"an"HDFS"directory"containing"one"or"more"les"
Default"path:"/user/hive/warehouse/<table_name>"""
Hive"supports"many"formats"for"data"storage"and"retrieval"
!How%does%Hive%know%the%structure%and%loca/on%of%tables?%
These"are"specied"when"tables"are"created"
Contained"in"an"RDBMS"such"as"MySQL"

"

!Hive%consults%the%metastore%to%determine%data%format%and%loca/on%
The"query"itself"operates"on"data"stored"on"a"lesystem"(typically"HDFS)"

Metastore
c t ure
s tru data
" t a bl on of
e
t i
1
and

Hive%Tables%(HDFS)
Qu 2
ery
ac
tua
l da
ta

Chapter"Topics"

Introduc/on%to%Hive%

!! What"is"Hive?"
!! Hive"Schema"and"Data"Storage"
!! Hive"Use"Cases"
!! InteracAng"with"Hive"
!! Conclusion"

Your"Cluster"is"Not"a"Database"Server"

!Client#server%database%management%systems%have%many%strengths%
Very"fast"response"Ame"
Support"for"transacAons"
Allows"modicaAon"of"exisAng"records"
Can"serve"thousands"of"simultaneous"clients"
It"simply"produces"MapReduce"jobs"from"HiveQL"queries"
LimitaAons"of"HDFS"and"MapReduce"sAll"apply"

Comparing"Hive"To"A"RelaAonal"Database"

Rela/onal%Database% Hive%
Query language SQL HiveQL
Update individual records Yes No
Delete individual records Yes No
Transactions Yes No
Index support Extensive Limited
Latency Very low High
Data size Terabytes Petabytes

Chapter"Topics"

Introduc/on%to%Hive%

!! What"is"Hive?"
!! Hive"Schema"and"Data"Storage"
!! Hive%Use%Cases%
!! InteracAng"with"Hive"
!! Conclusion"

Use"Case:"Log"File"AnalyAcs"

!Server%log%les%are%an%important%source%of%data%
!Hive%allows%you%to%treat%a%directory%of%log%les%like%a%table%
Allows"SQL/like"queries"against"raw"data"

## Dualcore Inc. Public Web Site (June 1 - 8)

Product Unique Visitors Page Views Average Time on Page Bounce Rate Conversion Rate
Tablet 5,278 5,894 17 seconds 23% 65%
Notebook 4,139 4,375 23 seconds 47% 31%
Stereo 2,873 2,981 42 seconds 61% 12%
Monitor 1,749 1,862 26 seconds 74% 19%
Router 987 1,139 37 seconds 56% 17%
Server 314 504 53 seconds 48% 28%
Printer 86 97 34 seconds 27% 64%

Use"Case:"SenAment"Analysis"

!Many%organiza/ons%use%Hive%to%analyze%social%media%coverage%

## Mentions of Dualcore on Social Media (by Hour)

Negative
Neutral
Positive

07 08 09 10 11 12 13 14 15 16 17 18

Chapter"Topics"

Introduc/on%to%Hive%

!! What"is"Hive?"
!! Hive"Schema"and"Data"Storage"
!! Hive"Use"Cases"
!! Interac/ng%with%Hive%
!! Conclusion"

Using"the"Hive"Shell"

!You%can%execute%HiveQL%statements%in%the%Hive%Shell%
This"interacAve"tool"is"similar"to"the"MySQL"shell"
!Run%the%hive%command%to%start%the%Hive%shell%
The"Hive"shell"will"display"its"hive>"prompt"
Each"statement"must"be"terminated"with"a"semicolon"
Use"the"quit"command"to"exit"the"Hive"shell"

\$ hive
hive> SELECT cust_id, fname, lname
FROM customers WHERE zipcode=20525;
1000000 Quentin Shepard
1000001 Brandon Louis
1000002 Marilyn Ham
hive> quit;
\$

Accessing"Hive"from"the"Command"Line"

!You%can%also%execute%a%le%containing%HiveQL%code%using%the%-f%op/on%
%
\$ hive -f myquery.hql

!Or%use%HiveQL%directly%from%the%command%line%using%the%-e%op/on%

## \$ hive -e 'SELECT * FROM users'

!Use%the%-S%(silent)%op/on%to%suppress%informa/onal%messages%
Can"also"be"used"with"-e"or"-f"opAons"

\$ hive -S

Hive"ProperAes"

!Many%aspects%of%Hives%behavior%are%congured%through%proper/es%
Use"set -v"in"Hive"to"see"current"values"

## hive> set -v;

"
!You%can%also%use%set%to%specify%property%values%

!Hive%runs%the%.hiverc%le%in%your%home%directory%at%startup%
Useful"for"specifying"per/user"defaults"

InteracAng"with"the"OperaAng"System"and"HDFS"

!Use%!%to%execute%system%commands%from%within%Hive%
Neither"pipes"nor"globs"(wildcards)"are"supported"

hive> ! date;
% Mon May 20 16:44:35 PDT 2013

!Prex%HDFS%commands%with%dfs%to%use%them%from%within%Hive%

## hive> dfs -mkdir /reports/sales/2013;

Accessing"Hive"with"Hue"

!Alterna/vely,%you%can%access%Hive%through%Hue%
!To%use%Hue,%browse%to%http://hue_server:8888/
May"need"to"start"Hue"service"rst"(sudo service hue start)"
!Hues%Hive%interface%is%called%Beeswax%
Launch"by"clicking"its"icon"
!Beeswax%features%include:%
CreaAng"tables"
Beeswax!icon!in!Hue!
Running"queries"
Browsing"tables"
Saving"queries"for"later"execuAon"

Hues"Beeswax"Query"Editor"
Hue - Hive Query

## SELECT * FROM employees WHERE state = 'CA';

Query"ExecuAon"in"Beeswax"

Query"Results"in"Beeswax"

InteracAng"With"HiveServer2""Hive"as"a"Service"(1)"

!HiveServer2%can%be%deployed%to%provide%a%centralized%Hive%service%
Uses"a"JDBC"or"ODBC"connecAon"
Supports"Kerberos"authenAcaAon"
% Beeline CLI

Submit Map
Reduce Jobs Mappers
HiveServer2
Server

Reducers

JDBC/ODBC
Application

Shared
Metastore

InteracAng"With"HiveServer2""Hive"as"a"Service"(2)"

!To%connect%to%HiveServer2,%use%Hue%or%the%Beeline%CLI%
You"cannot"use"the"Hive"shell"
!Example:%star/ng%Beeline%and%connec/ng%to%HiveServer2%
% [training@localhost analyst]\$ beeline
Beeline version 0.10.0-cdh4.2.1 by Apache Hive

## beeline> !connect jdbc:hive2://localhost:10000

training mypwd org.apache.hive.jdbc.HiveDriver

Connecting to jdbc:hive2://localhost:10000
Connected to: Hive (version 0.10.0)
Driver: Hive (version 0.10.0-cdh4.2.1)

beeline>

Chapter"Topics"

Introduc/on%to%Hive%

!! What"is"Hive?"
!! Hive"Schema"and"Data"Storage"
!! Hive"Use"Cases"
!! InteracAng"with"Hive"
!! Conclusion"

EssenAal"Points"

!Hive%is%a%high#level%abstrac/on%on%top%of%MapReduce%
!HiveQL%is%very%similar%to%SQL%
Easy"to"learn"for"those"with"relaAonal"database"experience"
However,"Hive"does"not"replace"your"RDBMS"
!Hive%tables%are%really%directories%of%les%in%HDFS%
!Hive%and%Pig%are%similar%in%many%ways%

Bibliography"

The%following%oer%more%informa/on%on%topics%discussed%in%this%chapter%
!Hive%Web%Site%
http://hive.apache.org/
!Beeline%CLI%reference%
http://sqlline.sourceforge.net/
!Programming%Hive%(book)%
http://tiny.cloudera.com/dac09a
http://tiny.cloudera.com/dac09b
!Sen/ment%Analysis%Using%Apache%Hive%
http://tiny.cloudera.com/dac09c

RelaAonal"Data"Analysis"with"Hive"
Chapter"10"

Course"Chapters"

!! IntroducAon"
!! IntroducAon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulA/Dataset"OperaAons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooAng"and"OpAmizaAon"
!! IntroducAon"to"Hive"
!! Rela*onal\$Data\$Analysis\$with\$Hive\$
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpAmizaAon"
!! Extending"Hive"
!! IntroducAon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

RelaAonal"Data"Analysis"with"Hive"

In\$this\$chapter,\$you\$will\$learn\$
!How\$to\$explore\$databases\$and\$tables\$in\$Hive\$
!How\$HiveQL\$syntax\$compares\$to\$SQL\$
!Which\$data\$types\$Hive\$supports\$
!Which\$types\$of\$join\$opera*ons\$Hive\$supports\$and\$how\$to\$use\$them\$
!How\$to\$use\$many\$of\$Hives\$built#in\$func*ons\$

Chapter"Topics"

Rela*onal\$Data\$Analysis\$with\$Hive\$

!! Hive\$Databases\$and\$Tables\$
!! Basic"HiveQL"Syntax"
!! Data"Types"
!! Joining"Datasets"
!! Common"Built/in"FuncAons"
!! Hands/On"Exercise:"Running"Hive"Queries"from"the"Shell,"Scripts,"and"Hue"
!! Conclusion"

Hive"Tables"

!Data\$for\$Hives\$tables\$is\$stored\$on\$the\$lesystem\$(typically\$HDFS)\$
Each"table"maps"to"a"single"directory"
!A\$tables\$directory\$may\$contain\$mul*ple\$les\$\$
Typically"delimited"text"les,"but"Hive"supports"many"formats"
Subdirectories"are"not"allowed"
!Hive\$uses\$the\$metastore\$to\$give\$context\$to\$this\$data\$
Helps"map"raw"data"in"HDFS"to"named"columns"of"specic"types"

Hive"Databases"

!Each\$Hive\$table\$belongs\$to\$a\$specic\$database\$\$
!Early\$versions\$of\$Hive\$supported\$only\$a\$single\$database\$
It"placed"all"tables"in"the"same"database"(named"default)"
This"is"sAll"the"default"behavior"
!Hive\$supports\$mul*ple\$databases\$as\$of\$release\$0.7.0\$

Exploring"Hive"Databases"and"Tables"(1)"

!Which\$databases\$are\$available?\$

## hive> SHOW DATABASES;

accounting
default
sales

!Switch\$between\$databases\$with\$the\$USE\$command\$

## Queries table in the

\$ hive
hive> SELECT * FROM customers; default database
hive> USE sales;
hive> SELECT * FROM customers;
Queries table in the
sales database

Exploring"Hive"Databases"and"Tables"(2)"

!Which\$tables\$does\$the\$current\$database\$contain?\$

## hive> USE accounting;

hive> SHOW TABLES;
invoices
taxes

!Which\$tables\$are\$contained\$in\$a\$dierent\$database?\$

## hive> SHOW TABLES IN sales;

customers
prospects

Exploring"Hive"Databases"and"Tables"(3)"

!The\$DESCRIBE\$command\$displays\$basic\$structure\$for\$a\$table\$

## hive> DESCRIBE orders;

order_id int
cust_id int
order_date timestamp

! DESCRIBE FORMATTED\$shows\$more\$detailed\$informa*on\$

## hive> DESCRIBE FORMATTED orders;

# col_name data_type comment
order_id int None
cust_id int None
order_date timestamp None
# Detailed Table Information ... More follows...

Chapter"Topics"

Rela*onal\$Data\$Analysis\$with\$Hive\$

!! Hive"Databases"and"Tables"
!! Basic\$HiveQL\$Syntax\$
!! Data"Types"
!! Joining"Datasets"
!! Common"Built/in"FuncAons"
!! Hands/On"Exercise:"Running"Hive"Queries"from"the"Shell,"Scripts,"and"Hue"
!! Conclusion"

An"IntroducAon"To"HiveQL"

!HiveQL\$is\$Hives\$query\$language\$
Based"on"a"subset"of"SQL/92,"plus"Hive/specic"extensions"
!Some\$limita*ons\$compared\$to\$standard\$SQL\$
Some"features"are"not"supported"
Others"are"only"parAally"implemented"
!HiveQL\$also\$has\$some\$features\$not\$oered\$in\$SQL\$

HiveQL"Basics"

!Hive\$keywords\$are\$not\$case#sensi*ve\$
Though"they"are"o]en"capitalized"by"convenAon"
!Statements\$are\$terminated\$by\$a\$semicolon\$
A"statement"may"span"mulAple"lines"
Only"supported"in"Hive"scripts"

\$ cat myscript.hql

## SELECT cust_id, fname, lname

FROM customers
WHERE zipcode='60601'; -- downtown Chicago

SelecAng"Data"from"Hive"Tables"

!The\$SELECT\$statement\$retrieves\$data\$from\$Hive\$tables\$
Can"specify"an"ordered"list"of"individual"columns"

## hive> SELECT cust_id, fname, lname FROM customers;

An"asterisk"matches"all"columns"in"the"table"
\$
hive> SELECT * FROM customers;
\$

LimiAng"and"SorAng"Query"Results"

! The\$LIMIT\$clause\$sets\$the\$maximum\$number\$of\$rows\$returned\$

## hive> SELECT fname, lname FROM customers LIMIT 10;

!Cau*on:\$no\$guarantee\$regarding\$which\$10\$results\$are\$returned\$
Use"ORDER BY"for"top/N"queries"
The"eld(s)"you"ORDER BY"must"be"selected"

## hive> SELECT cust_id, fname, lname FROM customers

ORDER BY cust_id DESC LIMIT 10;
\$

Using"a"WHERE"Clause"to"Restrict"Results"

! WHERE\$clauses\$restrict\$rows\$to\$those\$matching\$specied\$criteria\$
String"comparisons"are"case/sensiAve"

## hive> SELECT * FROM customers WHERE state

IN ('CA', 'OR', 'WA', 'NV', 'AZ');
"
!You\$can\$combine\$expressions\$using\$AND\$or\$OR\$\$

## hive> SELECT * FROM customers

WHERE fname LIKE 'Ann%'
AND (city='Seattle' OR city='Portland');

Table"Aliases"

!Table\$aliases\$can\$help\$simplify\$complex\$queries\$

## hive> SELECT o.order_date, c.fname, c.lname

FROM customers c JOIN orders o
ON c.cust_id = o.cust_id
WHERE c.zipcode='94306';

Note:"Using"AS"to"specify"table"aliases"is"not"supported"

\$
Combining"Query"Results"with"UNION ALL

!Unies\$output\$from\$SELECTs\$into\$a\$single\$result\$set\$
The"name,"order,"and"types"of"columns"in"each"query"must"match"
Hive"only"supports"UNION ALL

## SELECT emp_id, fname, lname, salary

FROM employees
WHERE state='CA' AND salary > 75000
UNION ALL
SELECT emp_id, fname, lname, salary
FROM employees
WHERE state != 'CA' AND salary > 50000;

! UNION ALL\$can\$also\$be\$used\$with\$subqueries\$

Subqueries"in"Hive"

!Hive\$only\$supports\$subqueries\$in\$the\$FROM\$clause\$of\$the\$SELECT\$
statement\$
It"does"not"support"correlated"subqueries"

## SELECT prod_id, brand, name

FROM (SELECT *
FROM products
WHERE (price - cost) / price > 0.65
ORDER BY price DESC
LIMIT 10) high_profits
WHERE price > 1000
ORDER BY brand, name;
"
!Hive\$allows\$arbitrary\$levels\$of\$subqueries\$
Each"subquery"must"be"named"(like"high_profits"above)"

Chapter"Topics"

Rela*onal\$Data\$Analysis\$with\$Hive\$

!! Hive"Databases"and"Tables"
!! Basic"HiveQL"Syntax"
!! Data\$Types\$
!! Joining"Datasets"
!! Common"Built/in"FuncAons"
!! Hands/On"Exercise:"Running"Hive"Queries"from"the"Shell,"Scripts,"and"Hue"
!! Conclusion"

Hives"Data"Types"

!Each\$column\$in\$Hive\$is\$associated\$with\$a\$data\$type\$
!Hive\$supports\$more\$than\$a\$dozen\$types\$
Most"are"similar"to"ones"found"in"relaAonal"databases"
Hive"also"supports"three"complex"types"
!Use\$the\$DESCRIBE\$command\$to\$see\$a\$tables\$column\$types\$\$

## hive> DESCRIBE products;

prod_id int
brand string
name string
price int
cost int
shipping_wt int

Hives"Integer"Types"

!Integer\$types\$are\$appropriate\$for\$whole\$numbers\$
Both"posiAve"and"negaAve"values"allowed"

## Name\$ Descrip*on\$ Example\$Value\$

TINYINT Range:"/128"to"127\$ 17
SMALLINT Range:"/32,768"to"32,767\$ 5842
INT Range:"/2,147,483,648"to"2,147,483,647\$ 84127213
BIGINT Range:"~"/9.2"quinAllion"to"~"9.2"quinAllion\$ 632197432180964

Hives"Decimal"Types"

!Decimal\$types\$are\$appropriate\$for\$oa*ng\$point\$numbers\$
Both"posiAve"and"negaAve"values"allowed"
Cau*on:"avoid"using"when"exact"values"are"required!"

## Name\$ Descrip*on\$ Example\$Value\$

FLOAT Decimals\$ 3.14159
DOUBLE Very"precise"decimals\$ 3.14159265358979323846

Other"Simple"Types"in"Hive"

!Hive\$can\$also\$store\$several\$other\$types\$of\$informa*on\$
Only"one"character"type"(variable"length)"

## Name\$ Descrip*on\$ Example\$Value\$

STRING Character"sequence" Betty F. Smith
BOOLEAN True"or"False" TRUE
TIMESTAMP* Instant"in"Ame" 2013-06-14 16:51:05
BINARY* Raw"bytes" N/A

""*"Not"available"in"older"versions"of"Hive"

Complex"Column"Types"in"Hive"

!Hive\$also\$has\$a\$few\$complex\$data\$types\$
These"are"capable"of"holding"mulAple"values"

Column\$Type\$ Usage\$in\$Query\$
ARRAY" departments[0]
Ordered"list"of"values,"all"of"the"same"type"
MAP" employees['BF5314']
Key/value"pairs,"each"of"the"same"type"
Named"elds,"of"possibly"mixed"types"

Chapter"Topics"

Rela*onal\$Data\$Analysis\$with\$Hive\$

!! Hive"Databases"and"Tables"
!! Basic"HiveQL"Syntax"
!! Data"Types"
!! Joining\$Datasets\$
!! Common"Built/in"FuncAons"
!! Hands/On"Exercise:"Running"Hive"Queries"from"the"Shell,"Scripts,"and"Hue"
!! Conclusion"

Joins"in"Hive"

!Joining\$disparate\$data\$sets\$is\$a\$common\$opera*on\$in\$Hive\$
!Hive\$supports\$several\$types\$of\$joins\$
Inner"joins"
Outer"joins"(le],"right,"and"full)"
Cross"joins"(supported"in"Hive"0.10"and"later)"
Le]"semi"joins"
!Only\$equality\$condi*ons\$are\$allowed\$in\$joins\$
Valid:""""customers.cust_id = orders.cust_id""
Invalid:"customers.cust_id <> orders.cust_id""
Outputs"records"where"the"specied"key"is"found"in"each"table"
!For\$best\$performance,\$list\$the\$largest\$table\$last\$in\$your\$query\$

Join"Syntax"

!Hive\$requires\$the\$following\$syntax\$for\$joins\$

## SELECT c.cust_id, name, total

FROM customers c
JOIN orders o ON (c.cust_id = o.cust_id);

!The\$above\$example\$is\$an\$inner\$join\$
Can"replace"JOIN"with"another"type"(e.g."RIGHT OUTER JOIN)""
!The\$following\$join\$syntax\$is\$not\$valid\$in\$Hive\$

## SELECT c.cust_id, name, total

FROM customers c, orders o
WHERE (c.cust_id = o.cust_id);

Inner"Join"Example"

customers\$table\$
cust_id\$ name\$ country\$ SELECT c.cust_id, name, total
a Alice us
FROM customers c
JOIN orders o
b Bob ca
ON (c.cust_id = o.cust_id);
c Carlos mx
d Dieter de
Result\$of\$query\$

## orders\$table\$ cust_id\$ name\$ total\$

a Alice 1539
order_id\$ cust_id\$ total\$
a Alice 6352
1 a 1539
b Bob 1456
2 c 1871
c Carlos 1871
3 a 6352
4 b 1456
5 z 2137

Le]"Outer"Join"Example"

customers\$table\$
cust_id\$ name\$ country\$ SELECT c.cust_id, name, total
a Alice us
FROM customers c
LEFT OUTER JOIN orders o
b Bob ca
ON (c.cust_id = o.cust_id);
c Carlos mx
d Dieter de
Result\$of\$query\$

## orders\$table\$ cust_id\$ name\$ total\$

a Alice 1539
order_id\$ cust_id\$ total\$
a Alice 6352
1 a 1539
b Bob 1456
2 c 1871
c Carlos 1871
3 a 6352
d Dieter NULL
4 b 1456
5 z 2137

Right"Outer"Join"Example"

customers\$table\$
cust_id\$ name\$ country\$ SELECT c.cust_id, name, total
a Alice us
FROM customers c
RIGHT OUTER JOIN orders o
b Bob ca
ON (c.cust_id = o.cust_id);
c Carlos mx
d Dieter de
Result\$of\$query\$

## orders\$table\$ cust_id\$ name\$ total\$

a Alice 1539
order_id\$ cust_id\$ total\$
a Alice 6352
1 a 1539
b Bob 1456
2 c 1871
c Carlos 1871
3 a 6352
NULL NULL 2137
4 b 1456
5 z 2137

Full"Outer"Join"Example"

customers\$table\$
cust_id\$ name\$ country\$ SELECT c.cust_id, name, total
a Alice us
FROM customers c
FULL OUTER JOIN orders o
b Bob ca
ON (c.cust_id = o.cust_id);
c Carlos mx
d Dieter de
Result\$of\$query\$

## orders\$table\$ cust_id\$ name\$ total\$

a Alice 1539
order_id\$ cust_id\$ total\$
a Alice 6352
1 a 1539
b Bob 1456
2 c 1871
c Carlos 1871
3 a 6352
d Dieter NULL
4 b 1456
NULL NULL 2137
5 z 2137

Using"An"Outer"Join"to"Find"Unmatched"Entries"

customers\$table\$
cust_id\$ name\$ country\$ SELECT c.cust_id, name, total
a Alice us
FROM customers c
FULL OUTER JOIN orders o
b Bob ca
ON (c.cust_id = o.cust_id)
c Carlos mx
WHERE c.cust_id IS NULL
d Dieter de OR o.total IS NULL;

orders\$table\$
Result\$of\$query\$
order_id\$ cust_id\$ total\$
1 a 1539
cust_id\$ name\$ total\$
d Dieter NULL
2 c 1871
NULL NULL 2137
3 a 6352
4 b 1456
5 z 2137

Cross"Join"Example"

disks\$table\$
name\$ SELECT * FROM disks
Internal hard disk
CROSS JOIN sizes;
External hard disk
Result\$of\$query\$

name\$ size\$
Internal hard disk 1.0 terabytes
sizes\$table\$ Internal hard disk 2.0 terabytes
size\$ Internal hard disk 3.0 terabytes
1.0 terabytes Internal hard disk 4.0 terabytes
2.0 terabytes External hard disk 1.0 terabytes
3.0 terabytes External hard disk 2.0 terabytes
4.0 terabytes External hard disk 3.0 terabytes
External hard disk 4.0 terabytes

Le]"Semi"Joins"(1)"

## !A\$less\$common\$type\$of\$join\$is\$the\$LEFT SEMI JOIN

They"are"a"special"(and"ecient)"type"of"inner"join"
They"behave"more"like"a"lter"than"a"join"
Only"unique"records"that"match"these"criteria"are"returned"
Fields"listed"in"SELECT"are"limited"to"the"le]/side"table"

SELECT c.cust_id
FROM customers c
LEFT SEMI JOIN orders o
ON (c.cust_id = o.cust_id
AND YEAR(o.order_date) = '2012');

Le]"Semi"Joins"(2)"

!Hive\$does\$not\$support\$IN/EXISTS\$subqueries\$

## SELECT c.cust_id FROM customers c

WHERE o.cust_id IN
(SELECT o.cust_id FROM orders o
WHERE YEAR(o.order_date = '2012'));

## !Using\$a\$LEFT SEMI JOIN\$is\$a\$common\$workaround\$

SELECT c.cust_id
FROM customers c
LEFT SEMI JOIN orders o
ON (c.cust_id = o.cust_id
AND YEAR(o.order_date) = '2012');

Chapter"Topics"

Rela*onal\$Data\$Analysis\$with\$Hive\$

!! Hive"Databases"and"Tables"
!! Basic"HiveQL"Syntax"
!! Data"Types"
!! Joining"Datasets"
!! Common\$Built#in\$Func*ons\$
!! Hands/On"Exercise:"Running"Hive"Queries"from"the"Shell,"Scripts,"and"Hue"
!! Conclusion"

Hive"FuncAons"

!Hive\$oers\$dozens\$of\$built#in\$func*ons\$\$
Many"are"idenAcal"to"those"found"in"SQL"
Others"are"Hive/specic"
!Example\$func*on\$invoca*on\$
FuncAon"names"are"not"case/sensiAve"

FROM customers;

## hive> DESCRIBE FUNCTION UPPER;

UPPER(str) - Returns str with all characters
changed to uppercase

Example"Built/in"FuncAons"(1)"

!These\$func*ons\$operate\$on\$numeric\$values\$

## Func*on\$Descrip*on\$ Example\$Invoca*on\$ Input\$ Output\$

Rounds"to"specied"#"of"decimals" ROUND(total_price, 2) 23.492 23.49

## Return"absolute"value" ABS(temperature) -49 49

Returns"square"root" SQRT(area) 64 8

## Returns"a"random"number" RAND() 0.584977

Example"Built/in"FuncAons"(2)"

!These\$func*ons\$operate\$on\$*mestamp\$values\$

## Func*on\$ Example\$Invoca*on\$ Input\$ Output\$

Descrip*on\$
Convert"to"UNIX" UNIX_TIMESTAMP(order_dt) 2013-06-14 1371243065
format" 16:51:05
Convert"to"string" FROM_UNIXTIME(mod_time) 1371243065 2013-06-14
format" 16:51:05
Extract"date" TO_DATE(order_dt) 2013-06-14 2013-06-14
porAon" 16:51:05
Extract"year" YEAR(order_dt) 2013-06-14 2013
porAon" 16:51:05
Returns"#"of"days" DATEDIFF(order_dt, ship_dt) 2013-06-14," 3
between"dates" 2013-06-17

Example"Built/in"FuncAons"(3)"

!Here\$are\$some\$other\$interes*ng\$func*ons\$

## Func*on\$Descrip*on\$ Example\$Invoca*on\$ Input\$ Output\$

Converts"to"uppercase" UPPER(fname) Bob BOB

## Returns"size"of"array"or"map" SIZE(array_field) N/A 6

Record"Grouping"and"Aggregate"FuncAons"

! GROUP BY\$groups\$selected\$data\$by\$one\$or\$more\$columns\$
Cau*on:"Columns"not"part"of"aggregaAon"must"be"listed"in"GROUP BY"
stores\$table\$
id\$ city\$ state\$ region\$ SELECT region, state,
COUNT(id) AS num
a Albany NY EAST
FROM stores
b Boston MA EAST
GROUP BY region, state;
c Chicago IL NORTH
d Detroit MI NORTH Result\$of\$query\$
e Elgin IL NORTH
region\$ state\$ num\$
EAST MA 1
EAST NY 1
NORTH IL 2
NORTH MI 1

Built/in"Aggregate"FuncAons"

!Hive\$oers\$many\$aggregate\$func*ons,\$including\$

Func*on\$Descrip*on\$ Example\$Invoca*on\$
Count"all"rows" COUNT(*)

Count"all"rows"where"eld"is"not"null" COUNT(fname)

## Count"all"rows"where"eld"is"unique"and"not"null" COUNT(DISTINCT fname)

Returns"the"largest"value" MAX(salary)

Returns"the"smallest"value" MIN(salary)

Returns"the"average"of"all"supplied"values" AVG(salary)

Chapter"Topics"

Rela*onal\$Data\$Analysis\$with\$Hive\$

!! Hive"Databases"and"Tables"
!! Basic"HiveQL"Syntax"
!! Data"Types"
!! Joining"Datasets"
!! Common"Built/in"FuncAons"
!! Hands#On\$Exercise:\$Running\$Hive\$Queries\$from\$the\$Shell,\$Scripts,\$and\$
Hue\$
!! Conclusion"

Hands/on"Exercise:"Running"Hive"Queries"

!In\$this\$Hands#On\$Exercise,\$you\$will\$run\$Hive\$queries\$from\$the\$Hive\$shell,\$the\$
command\$line,\$Hive\$scripts,\$and\$Hue.\$

Chapter"Topics"

Rela*onal\$Data\$Analysis\$with\$Hive\$

!! Hive"Databases"and"Tables"
!! Basic"HiveQL"Syntax"
!! Data"Types"
!! Joining"Datasets"
!! Common"Built/in"FuncAons"
!! Hands/On"Exercise:"Running"Hive"Queries"from"the"Shell,"Scripts,"and"Hue"
!! Conclusion\$

EssenAal"Points"

!Every\$Hive\$table\$belongs\$to\$exactly\$one\$database\$
The"SHOW DATABASES"command"lists"databases""
The"USE"command"switches"the"acAve"database"
The"SHOW TABLES"command"lists"all"tables"in"a"database"
!Every\$column\$in\$a\$Hive\$table\$has\$an\$associated\$data\$type\$
Most"simple"column"types"are"similar"to"SQL"
Hive"also"supports"a"few"complex"types"
!HiveQL\$syntax\$is\$familiar\$to\$those\$who\$know\$SQL\$
A"subset"of"SQL/92,"plus"Hive/specic"extensions"
Supports"inner,"outer,"and"le]"semi"joins"
Many"SQL"funcAons"are"built"into"Hive"

Bibliography"

The\$following\$oer\$more\$informa*on\$on\$topics\$discussed\$in\$this\$chapter\$
!HiveQL\$Language\$Manual\$
http://tiny.cloudera.com/dac10a
!Hive\$Built#In\$Func*ons\$
http://tiny.cloudera.com/dac10b

Hive"Data"Management"
Chapter"11"

Course"Chapters"

!! IntroducEon"
!! IntroducEon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulE/Dataset"OperaEons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooEng"and"OpEmizaEon"
!! IntroducEon"to"Hive"
!! RelaEonal"Data"Analysis"with"Hive"
!! Hive#Data#Management#
!! Text"Processing"With"Hive"
!! Hive"OpEmizaEon"
!! Extending"Hive"
!! IntroducEon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

Hive"Data"Management"

In#this#chapter#you#will#learn#
!How#Hive#encodes#and#stores#data#
!How#to#create#Hive#databases,#tables,#and#views#
!How#to#alter#and#remove#tables#
!How#to#save#query#results#into#tables#and#les#
!How#to#control#access#to#data#in#Hive#

Chapter"Topics"

Hive#Data#Management#

!! Hive#Data#Formats#
!! CreaEng"Databases"and"Hive/Managed"Tables"
!! Altering"Databases"and"Tables"
!! Self/Managed"Tables"
!! Simplifying"Queries"with"Views"
!! Storing"Query"Results"
!! Controlling"Access"to"Data"
!! Hands/On"Exercise:"Data"Management"with"Hive"
!! Conclusion"

Hive"Text"File"Format"

!Each#Hive#table#maps#to#a#directory#of#data,#typically#in#HDFS#
Data"stored"in"one"or"more"les"
!Default#data#format#is#plain#text#
One"record"per"line"(record"separator"is"\n)"
Columns"are"delimited"by"^A"(eld"separator"is"Control/A)
Complex"type"elements"are"separated"by"^B"(Control/B)
Map"keys/values"are"separated"by"^C"(Control/C)
!Hive#allows#you#to#override#these#delimiters#
Specify"alternate"values"during"table"creaEon"

Hive"Binary"File"Formats"

!Hive#also#supports#a#few#binary#formats#
Pro:"may"oer"be>er"performance"than"text"les"
Avoids"the"need"to"convert"all"data"to/from"strings"
Supports"block/"and"record/level"data"compression"
!RCFile#is#a#binary#format#created#especially#for#Hive#
RC"stands"for"Record"Columnar"
Column/oriented"format"ecient"for"some"queries"
Otherwise,"very"similar"to"sequence"les"

Chapter"Topics"

Hive#Data#Management#

!! Hive"Data"Formats"
!! CreaLng#Databases#and#Hive"Managed#Tables#
!! Altering"Databases"and"Tables"
!! Self/Managed"Tables"
!! Simplifying"Queries"with"Views"
!! Storing"Query"Results"
!! Controlling"Access"to"Data"
!! Hands/On"Exercise:"Data"Management"with"Hive"
!! Conclusion"

CreaEng"Databases"in"Hive"

!Hive#databases#are#simply#namespaces#
Helps"to"organize"your"tables"
!To#create#a#new#database#

## hive> CREATE DATABASE dualcore;

!To#condiLonally#create#a#new#database#

## hive> CREATE DATABASE IF NOT EXISTS dualcore;

CreaEng"a"Table"In"Hive"(1)"

!Basic#syntax#for#creaLng#a#table:#

## CREATE TABLE tablename (colname DATATYPE, ...)

ROW FORMAT DELIMITED
FIELDS TERMINATED BY char
STORED AS {TEXTFILE|SEQUENCEFILE|RCFILE}

!Creates#a#subdirectory#under#/user/hive/warehouse#in#HDFS#
This"is"Hives"warehouse)directory"

CreaEng"a"Table"In"Hive"(2)"

## CREATE TABLE tablename (colname DATATYPE, ...)

ROW FORMAT DELIMITED
FIELDS TERMINATED BY char
Specify a name for the table, and list the column
STORED AS {TEXTFILE|SEQUENCEFILE|RCFILE}
names and datatypes (see later)

CreaEng"a"Table"In"Hive"(3)"

## CREATE TABLE tablename (colname DATATYPE, ...)

ROW FORMAT DELIMITED
FIELDS TERMINATED BY char
STORED
This lineAS {TEXTFILE|SEQUENCEFILE|RCFILE}
states that fields in each file in the tables
directory are delimited by some character. Hives
default delimiter is Control-A, but you may
specify an alternate delimiter...

CreaEng"a"Table"In"Hive"(4)"

## CREATE TABLE tablename (colname DATATYPE, ...)

ROW FORMAT DELIMITED
FIELDS TERMINATED BY char
STORED AS {TEXTFILE|SEQUENCEFILE|RCFILE}
for example, tab-delimited data would require that
you specify FIELDS TERMINATED BY '\t'

CreaEng"a"Table"In"Hive"(5)"

## CREATE TABLE tablename (colname DATATYPE, ...)

ROW FORMAT DELIMITED
FIELDS TERMINATED BY char
STORED AS {TEXTFILE|SEQUENCEFILE|RCFILE}

## Finally, you may declare the file format. STORED AS

TEXTFILE is the default and does not need to be
specified

Example"Table"DeniEon"

!The#following#example#creates#a#new#table#named#jobs
Data"stored"as"text"with"four"comma/separated"elds"per"line"

## CREATE TABLE jobs

(id INT,
title STRING,
salary INT,
posted TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

Example"of"corresponding"record"for"the"table"above"

## 1,Data Analyst,100000,2013-06-21 15:52:03

Complex"Column"Types"

!Recap:#Hive#has#support#for#complex#data#types#

Column#Type# DescripLon#
ARRAY" Ordered"list"of"values,"all"of"the"same"type"
MAP" Key/value"pairs,"each"of"the"same"type"
STRUCT" Named"elds,"of"possibly"mixed"types"

CreaEng"Tables"with"Complex"Column"Types"

!Complex#columns#are#typed#
Arrays"have"a"single"type"
Maps"have"a"key"type"and"a"value"type"
Structs"have"a"type"for"each"a>ribute"
!These#types#are#specied#with#angle#brackets#

## CREATE TABLE stores

(store_id SMALLINT,
departments ARRAY<STRING>,
staff MAP<STRING, STRING>,
city:STRING,
state:STRING,
zipcode:STRING>
);

Customizing"Complex"Type"Storage"

!We#can#control#which#delimiters#are#used#for#complex#types#

## CREATE TABLE stores

(store_id SMALLINT,
departments ARRAY<STRING>,
staff MAP<STRING, STRING>,
city:STRING,
state:STRING,
zipcode:STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':';

Row"Format"Example"for"Complex"Types"

!The#following#is#an#example#of#a#record#in#that#table#

## id" departments" sta" street" city" state" zipcode"

!Examples#queries#on#this#record#

Audio

Bob

## hive> SELECT address.city FROM stores;

Ely

Chapter"Topics"

Hive#Data#Management#

!! Hive"Data"Formats"
!! CreaEng"Databases"and"Hive/Managed"Tables"
!! Altering"Databases"and"Tables"
!! Self/Managed"Tables"
!! Simplifying"Queries"with"Views"
!! Storing"Query"Results"
!! Controlling"Access"to"Data"
!! Hands/On"Exercise:"Data"Management"with"Hive"
!! Conclusion"

Data"ValidaEon"in"Hive"

Unlike"an"RDBMS,"Hive"does"not"validate"data"on"insert"
Files"are"simply"moved"into"place"
Errors"in"le"format"will"be"discovered"when"queries"are"performed"
!Missing#or#invalid#data#in#Hive#will#be#represented#as#NULL

## \$ hadoop fs -mv sales.txt /user/hive/warehouse/sales/

"
Done"from"within"the"Hive"shell"(or"a"Hive"script)"
This"moves"data"within"HDFS,"just"like"the"command"above"
Source"can"be"either"a"le"or"directory"

## hive> LOAD DATA INPATH 'sales.txt' INTO TABLE sales;

Copies"a"local"le"(or"directory)"to"the"tables"directory"in"HDFS"

## hive> LOAD DATA LOCAL INPATH '/home/bob/sales.txt'

INTO TABLE sales;

## \$ hadoop fs -put /home/bob/sales.txt \

/user/hive/warehouse/sales

Removes"all"les"within"the"tables"directory""
Then"moves"the"new"les"into"that"directory"

## hive> LOAD DATA INPATH '/depts/finance/salesdata'

OVERWRITE INTO TABLE sales;

!Sqoop#has#built"in#support#for#imporLng#data#into#Hive#
Creates"the"table"in"Hive"(metastore)"
Imports"data"from"RDBMS"to"tables"directory"in"HDFS"

\$ sqoop import \
--connect jdbc:mysql://localhost/dualcore \
--fields-terminated-by '\t' \
--table employees \
--hive-import

Chapter"Topics"

Hive#Data#Management#

!! Hive"Data"Formats"
!! CreaEng"Databases"and"Hive/Managed"Tables"
!! Altering#Databases#and#Tables#
!! Self/Managed"Tables"
!! Simplifying"Queries"with"Views"
!! Storing"Query"Results"
!! Controlling"Access"to"Data"
!! Hands/On"Exercise:"Data"Management"with"Hive"
!! Conclusion"

Removing"a"Database"

!Removing#a#database#is#similar#to#creaLng#it#
Just"replace"CREATE"with"DROP

## hive> DROP DATABASE IF EXISTS dualcore;

!These#commands#will#fail#if#the#database#contains#tables#
CauEon:"this"command"removes"data"in"HDFS!"

## hive> DROP DATABASE dualcore CASCADE;

Removing"a"Table"

!Syntax#is#similar#to#database#removal#

## hive> DROP TABLE IF EXISTS customers;

!CauLon:#These#commands#can#remove#data#in#HDFS!#
Hive"does"not"have"a"rollback"or"undo"feature"

Renaming"Tables"and"Columns"

This"command"does"not"change"data"in"HDFS"
!Rename#an#exisLng#table#

## hive> ALTER TABLE customers RENAME TO clients;

!Rename#a#column#by#specifying#its#old#and#new#names#
Type"must"be"specied"even"though"it"is"unchanged"

## hive> ALTER TABLE clients

CHANGE fname first_name STRING;

## Old"Name" New"Name" Type"

!You#can#also#modify#a#columns#type#
The"old"and"new"column"names"will"be"the"same"
You"must"ensure"data"in"HDFS"conforms"to"the"new"type"

## Old"Name" New"Name" Type"

Appended"to"the"end"of"any"exisEng"columns"
ExisEng"data"will"have"NULL"values"for"new"columns"

## hive> ALTER TABLE jobs

ADD COLUMNS (city STRING, bonus INT);

Reordering"and"Removing"Columns"

!Use#AFTER#or#FIRST#to#reorder#columns#
Ensure"that"your"data"in"HDFS"matches"the"new"order"

## hive> ALTER TABLE jobs

CHANGE bonus bonus INT AFTER salary;

## hive> ALTER TABLE jobs

CHANGE bonus bonus INT FIRST;

!Use#REPLACE COLUMNS#to#remove#columns#

## hive> ALTER TABLE jobs REPLACE COLUMNS

(id INT,
title STRING,
salary INT);

Chapter"Topics"

Hive#Data#Management#

!! Hive"Data"Formats"
!! CreaEng"Databases"and"Hive/Managed"Tables"
!! Altering"Databases"and"Tables"
!! Self"Managed#Tables#
!! Simplifying"Queries"with"Views"
!! Storing"Query"Results"
!! Controlling"Access"to"Data"
!! Hands/On"Exercise:"Data"Management"with"Hive"
!! Conclusion"

Controlling"Table"Data"LocaEon"

!Hive#stores#data#associated#with#a#table#in#its#warehouse#directory#
!Storing#data#below#Hives#warehouse#is#not#always#ideal#
Data"might"be"shared"by"several"users""
!Use#LOCATION#to#specify#directory#where#table#data#resides#

(campaign_id STRING,
when TIMESTAMP,
keyword STRING,
site STRING,
placement STRING,
was_clicked BOOLEAN,
cost SMALLINT)

Self/Managed"(External)"Tables"

!Recall#that#dropping#a#table#removes#its#data#in#HDFS#
!Using#EXTERNAL#when#creaLng#the#table#avoids#this#behavior#

(campaign_id STRING,
click_time TIMESTAMP,
keyword STRING,
site STRING,
placement STRING,
was_clicked BOOLEAN,
cost SMALLINT)

Chapter"Topics"

Hive#Data#Management#

!! Hive"Data"Formats"
!! CreaEng"Databases"and"Hive/Managed"Tables"
!! Altering"Databases"and"Tables"
!! Self/Managed"Tables"
!! Simplifying#Queries#with#Views#
!! Storing"Query"Results"
!! Controlling"Access"to"Data"
!! Hands/On"Exercise:"Data"Management"with"Hive"
!! Conclusion"

Simplifying"Complex"Queries"

!Complex#queries#can#become#cumbersome#
Imagine"typing"this"several"Emes"for"dierent"orders"

## SELECT o.order_id, order_date, p.prod_id, brand, name

FROM orders o
JOIN order_details d
ON (o.order_id = d.order_id)
JOIN products p
ON (d.prod_id = p.prod_id)
WHERE o.order_id=6584288;

CreaEng"Views"

!Views#in#Hive#are#conceptually#like#a#table,#but#backed#by#a#query#

## CREATE VIEW order_info AS

SELECT o.order_id, order_date, p.prod_id, brand, name
FROM orders o
JOIN order_details d
ON (o.order_id = d.order_id)
JOIN products p
ON (d.prod_id = p.prod_id);

!Our#query#is#now#greatly#simplied#

## hive> SELECT * FROM order_info WHERE order_id=6584288;

InspecEng"and"Removing"Views"

!Use#DESCRIBE FORMATTED#to#see#underlying#query#

## hive> DESCRIBE FORMATTED order_info;

!Use#DROP VIEW#to#remove#a#view#
#
hive> DROP VIEW order_info;

Chapter"Topics"

Hive#Data#Management#

!! Hive"Data"Formats"
!! CreaEng"Databases"and"Hive/Managed"Tables"
!! Altering"Databases"and"Tables"
!! Self/Managed"Tables"
!! Simplifying"Queries"with"Views"
!! Storing#Query#Results#
!! Controlling"Access"to"Data"
!! Hands/On"Exercise:"Data"Management"with"Hive"
!! Conclusion"

Saving"Query"Output"to"a"Table"

! SELECT#statements#display#their#results#on#screen#
!To#send#results#to#a#Hive#table,#use#INSERT OVERWRITE TABLE#
ExisEng"contents"will"be"deleted"

## hive> INSERT OVERWRITE TABLE ny_customers

SELECT * FROM customers
WHERE state = 'NY';

## hive> INSERT INTO TABLE ny_customers

SELECT * FROM customers
WHERE state = 'NJ' OR state = 'CT';

CreaEng"Tables"Based"On"ExisEng"Data"

!Hive#supports#creaLng#a#table#based#on#a#SELECT#statement#
Oien"know"as"Create"Table"As"Select"(CTAS)"

## CREATE TABLE ny_customers AS

# SELECT cust_id, fname, lname FROM customers
WHERE state = 'NY';

!Column#deniLons#are#derived#from#the#exisLng#table#
!Column#names#are#inherited#from#the#exisLng#names#
Use"aliases"in"the"SELECT"statement"to"specify"new"names"

WriEng"Output"To"a"Filesystem"

!You#can#save#output#to#a#le#in#HDFS#

## hive> INSERT OVERWRITE DIRECTORY '/dualcore/ny/'

SELECT * FROM customers
WHERE state = 'NY';

## hive> INSERT OVERWRITE LOCAL DIRECTORY '/home/bob/ny/'

SELECT * FROM customers
WHERE state = 'NY';

!Both#produce#text#les#delimited#by#Ctrl"A#characters#

WriEng"Output"To"HDFS,"Specifying"Format""

!To#write#the#les#to#HDFS#with#a#user"specied#format:#
Create"an"external"table"in"the"required"format"
Use"INSERT OVERWRITE TABLE

## hive> CREATE EXTERNAL TABLE ny_customers

(cust_id INT,
fname STRING,
lname STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/dualcore/nydata';

## hive> INSERT OVERWRITE TABLE ny_customers

SELECT cust_id, fname, lname
FROM customers WHERE state = 'NY';

MulE/Table"Insert"(1)"

!We#just#saw#that#you#can#save#output#to#an#HDFS#le#

## INSERT OVERWRITE DIRECTORY 'ny_customers'

SELECT cust_id, fname, lname
FROM customers WHERE state = 'NY';

!This#query#could#also#be#wriaen#as#follows#

FROM customers c
INSERT OVERWRITE DIRECTORY 'ny_customers'
SELECT cust_id, fname, lname WHERE state='NY';

MulE/Table"Insert"(2)"

!We#someLmes#need#to#extract#data#to#mulLple#tables#
Hive"SELECT"queries"can"take"a"long"Eme"to"complete"
!Hive#allows#us#to#do#this#with#a#single#query#
Much"more"ecient"than"using"mulEple"queries"
!The#following#example#demonstrates#mulL"table#insert#
Result"is"two"directories"in"HDFS"

FROM customers c
INSERT OVERWRITE DIRECTORY 'ny_names'
SELECT fname, lname WHERE state = 'NY';
INSERT OVERWRITE DIRECTORY 'ny_count'
SELECT count(DISTINCT cust_id)
WHERE state = 'NY';

MulE/Table"Insert"(3)"

!The#following#query#produces#the#same#result#

## FROM (SELECT * FROM customers WHERE state='NY') nycust

INSERT OVERWRITE DIRECTORY 'ny_names'
SELECT fname, lname
INSERT OVERWRITE DIRECTORY 'ny_count'
SELECT count(DISTINCT cust_id);

Chapter"Topics"

Hive#Data#Management#

!! Hive"Data"Formats"
!! CreaEng"Databases"and"Hive/Managed"Tables"
!! Altering"Databases"and"Tables"
!! Self/Managed"Tables"
!! Simplifying"Queries"with"Views"
!! Storing"Query"Results"
!! Controlling#Access#to#Data#
!! Hands/On"Exercise:"Data"Management"with"Hive"
!! Conclusion"

Hive"AuthenEcaEon"

!HiveServer2#can#be#congured#for#Kerberos#or#LDAP#authenLcaLon#
HiveServer2"from"Beeline"

## [training@localhost analyst]\$ beeline

Beeline version 0.10.0-cdh4.2.1 by Apache Hive

## beeline> !connect jdbc:hive2://localhost:10000

training mypwd org.apache.hive.jdbc.HiveDriver

Connecting to jdbc:hive2://localhost:10000
Connected to: Hive (version 0.10.0)
Driver: Hive (version 0.10.0-cdh4.2.1)

beeline>

Hive"AuthorizaEon"with"Apache"Sentry"

!Sentry#is#a#plug"in#for#enforcing#authorizaLon#
!Sentry#can#be#enabled#in#HiveServer2#
And"in"Impala"
!Sentry#provides#ne"grained,#role"based#authorizaLon#for#Hive#data#and#
!Sentry#was#developed#at#Cloudera#
Available"starEng"with"CDH"4.3"
Project"is"in"incubaEon"status"at"Apache"
http://incubator.apache.org/projects/sentry.html"

Sentry"Access"Control"Model"

#
!What#does#Sentry#control#access#to?#
Server"
Database"
Table"
View"
!Who#can#access#Sentry"controlled#objects?#
Users"in"a"Sentry"role"
Sentry"roles"="one"or"more"groups"
! How#can#role#members#access#Sentry"controlled#objects?#
Write"operaEons"(INSERT"privilege)""

Chapter"Topics"

Hive#Data#Management#

!! Hive"Data"Formats"
!! CreaEng"Databases"and"Hive/Managed"Tables"
!! Altering"Databases"and"Tables"
!! Self/Managed"Tables"
!! Simplifying"Queries"with"Views"
!! Storing"Query"Results"
!! Controlling"Access"to"Data"
!! Hands"On#Exercise:#Data#Management#with#Hive#
!! Conclusion"

Hands/on"Exercise:"Data"Management"with"Hive"

Hive.#

Chapter"Topics"

Hive#Data#Management#

!! Hive"Data"Formats"
!! CreaEng"Databases"and"Hive/Managed"Tables"
!! Altering"Databases"and"Tables"
!! Self/Managed"Tables"
!! Simplifying"Queries"with"Views"
!! Storing"Query"Results"
!! Controlling"Access"to"Data"
!! Hands/On"Exercise:"Data"Management"with"Hive"
!! Conclusion#

EssenEal"Points"

!Each#Hive#table#maps#to#a#directory#in#HDFS#
Table"data"stored"as"one"or"more"les"
Default"format:"plain"text"with"delimited"elds"
Binary"formats"may"oer"be>er"performance"but"limit"compaEbility"
!Dropping#a#Hive"managed#tables#deletes#data#in#HDFS#
External"tables"require"manual"data"deleEon"
!Views#can#help#to#simplify#complex#and#repeLLve#queries#
!In#a#secure#environment,#Sentry#provides#Hive#authorizaLon#

Text"Processing"with"Hive"
Chapter"12"

Course"Chapters"

!! IntroducEon"
!! IntroducEon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulE/Dataset"OperaEons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooEng"and"OpEmizaEon"
!! IntroducEon"to"Hive"
!! RelaEonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text\$Processing\$With\$Hive\$
!! Hive"OpEmizaEon"
!! Extending"Hive"
!! IntroducEon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

Text"Processing"with"Hive"

In\$this\$chapter,\$you\$will\$learn\$
!How\$to\$use\$Hives\$most\$important\$string\$funcAons\$
!How\$to\$format\$numeric\$values\$
!How\$to\$use\$regular\$expressions\$in\$Hive\$
!What\$n#grams\$are\$and\$why\$they\$are\$useful\$
!How\$to\$esAmate\$how\$oCen\$words\$or\$phrases\$occur\$in\$text\$

Chapter"Topics"

Text\$Processing\$with\$Hive\$

!! Overview\$of\$Text\$Processing
!! Important"String"FuncEons"
!! Using"Regular"Expressions"in"Hive"
!! SenEment"Analysis"and"n/grams"
!! OpEonal"Hands/On"Exercise:"Gaining"Insight"with"SenEment"Analysis"
!! Conclusion"

Text"Processing"Overview"

Carefully"curated"informaEon"in"rows"and"columns"
!What\$types\$of\$data\$are\$we\$producing\$today?\$
Free/form"notes"in"electronic"medical"records"
ApplicaEon"and"server"log"les"
Social"network"connecEons"
Electronic"messages"
Product"raEngs"
!These\$types\$of\$data\$also\$contain\$great\$value\$
But"extracEng"it"requires"a"dierent"approach"

Chapter"Topics"

Text\$Processing\$with\$Hive\$

!! Overview"of"Text"Processing"
!! Important\$String\$FuncAons
!! Using"Regular"Expressions"in"Hive"
!! SenEment"Analysis"and"n/grams"
!! OpEonal"Hands/On"Exercise:"Gaining"Insight"with"SenEment"Analysis"
!! Conclusion"

Basic"String"FuncEons"

!Hive\$supports\$many\$string\$funcAons\$oCen\$found\$in\$RDBMSs\$

## FuncAon\$DescripAon\$ Example\$InvocaAon\$ Input\$ Output\$

Convert"to"uppercase" UPPER(name) Bob BOB
Convert"to"lowercase" LOWER(name) Bob bob
Remove"whitespace"at"start/end" TRIM(name) Bob Bob
Remove"only"whitespace"at"start" LTRIM(name) Bob Bob
Remove"only"whitespace"at"end" RTRIM(name) Bob Bob
Extract"porEon"of"string"*" SUBSTRING(name, 0, 3) Samuel Sam

*" Unlike'SQL,'star0ng'posi0on'is'zero5based'in'Hive'

Parsing"URLs"with"Hive"

!The\$following\$examples\$assume\$the\$following\$URL\$as\$input\$
http://www.example.com/click.php?A=42&Z=105

Example\$InvocaAon\$ Output\$
PARSE_URL(url, 'PROTOCOL') http
PARSE_URL(url, 'HOST') www.example.com
PARSE_URL(url, 'PATH') /click.php
PARSE_URL(url, 'QUERY') A=42&Z=105
PARSE_URL(url, 'QUERY', 'A') 42
PARSE_URL(url, 'QUERY', 'Z') 105

Numeric"Format"FuncEons"

!Hive\$oers\$two\$funcAons\$for\$formaYng\$a\$number\$
Simple:"FORMAT_NUMBER"(0.10.0"and"later)"
VersaEle:"PRINTF"(0.9.0"and"later)"

## Example\$InvocaAon\$ Input\$ Output\$

FORMAT_NUMBER(commission, 2) 2345.519728 2,345.52

## PRINTF("\$%1.2f", total_price) 356.9752 \$356.98

PRINTF("%s owes \$%1.2f", name, amt) Bob, 3.9 Bob owes \$3.90

## PRINTF("%.2f%%", taxrate * 100) 0.47314 47.31%

\$
!CauAon:\$avoid\$storing\$precise\$values\$as\$oaAng\$point\$numbers!\$
These"funcEons"are"best"used"for"formaang"results"

Spliang"and"Combining"Strings"

! CONCAT\$combines\$one\$or\$more\$strings\$
The"CONCAT_WS"variaEon"joins"them"with"a"separator"
! SPLIT\$does\$nearly\$the\$opposite\$
Dierence:"return"value"is"ARRAY<STRING>

Example\$InvocaAon\$ Output\$
CONCAT('alice', '@example.com') alice@example.com

## SPLIT('Amy/Sam/Ted', '/') ["Amy","Sam","Ted"] *"

*"Representa0on'of'ARRAY<STRING>'

ConverEng"Array"to"Records"with"EXPLODE

!The\$EXPLODE\$funcAon\$creates\$a\$record\$for\$each\$element\$in\$an\$array\$
An"example"of"a"table'genera0ng'func0on'
The"alias"is"required"when"invoking"table"generaEng"funcEons"

Amy,Sam,Ted

## hive> SELECT SPLIT(people, ',') FROM example;

["Amy","Sam","Ted"]

## hive> SELECT EXPLODE(SPLIT(people, ',')) AS x FROM example;

Amy
Sam
Ted

Chapter"Topics"

Text\$Processing\$with\$Hive\$

!! Overview"of"Text"Processing"
!! Important"String"FuncEons"
!! Using\$Regular\$Expressions\$in\$Hive
!! SenEment"Analysis"and"n/grams"
!! OpEonal"Hands/On"Exercise:"Gaining"Insight"with"SenEment"Analysis"
!! Conclusion"

Regular"Expressions"

!A\$regular\$expression\$(regex)\$matches\$a\$pa`ern\$in\$text\$
Useful"when"exact"matching"is"not"pracEcal"

Regular\$Expression\$ String\$(matched\$porAon\$in\$bold)\$
Dualcore I wish Dualcore had 2 stores in 90210.
\\d I wish Dualcore had 2 stores in 90210.
\\d{5} I wish Dualcore had 2 stores in 90210.
\\d\\s\\w+ I wish Dualcore had 2 stores in 90210.
\\w{5,9} I wish Dualcore had 2 stores in 90210.
.?\\. I wish Dualcore had 2 stores in 90210.
.*\\. I wish Dualcore had 2 stores in 90210.

Hives"Regular"Expression"FuncEons"

!Hive\$has\$two\$important\$funcAons\$that\$use\$regular\$expressions\$
REGEXP_EXTRACT"returns"the"matched"text"
REGEXP_REPLACE"subsEtutes"another"value"for"the"matched"text"
!These\$examples\$assume\$that\$txt\$has\$the\$following\$value\$
It's on Oak St. or Maple St in 90210

FROM message;
90210

## hive> SELECT REGEXP_REPLACE(txt, 'St.?\\s+', 'Street ')

FROM message;
It's on Oak Street or Maple Street in 90210

Regex"SerDe"

!We\$someAmes\$need\$to\$analyze\$data\$that\$lacks\$consistent\$delimiters\$
Log"les"are"a"common"example"of"this"

## 05/23/2013 19:45:19 312-555-7834 CALL_RECEIVED ""

05/23/2013 19:45:23 312-555-7834 OPTION_SELECTED "Shipping"
05/23/2013 19:46:23 312-555-7834 ON_HOLD ""
05/23/2013 19:47:51 312-555-7834 AGENT_ANSWER "Agent ID N7501"
05/23/2013 19:48:37 312-555-7834 COMPLAINT "Item not received"
05/23/2013 19:48:41 312-555-7834 CALL_END "Duration: 3:22"

Allows"us"to"create"a"table"from"this"log"le"

CreaEng"a"Table"with"Regex"SerDe"(1)"

Log"excerpt"
05/23/2013 19:48:37 312-555-7834 COMPLAINT "Item not received"

RegexSerDe"
CREATE TABLE calls (
event_date STRING,
event_time STRING,
phone_num STRING,
event_type STRING,
details STRING)
WITH SERDEPROPERTIES ("input.regex" =
"([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^\"]*)\"");

!Each\$pair\$of\$parentheses\$denotes\$a\$eld\$
Field"value"is"text"matched"by"pa>ern"within"parentheses"

CreaEng"a"Table"with"Regex"SerDe"(2)"

Log"excerpt"
05/23/2013 19:48:37 312-555-7834 COMPLAINT "Item not received"

RegexSerDe"
CREATE TABLE calls (
event_date STRING,
event_time STRING,
phone_num STRING,
event_type STRING,
details STRING)
WITH SERDEPROPERTIES ("input.regex" =
"([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^\"]*)\"");

Table"excerpt"
event_date\$ event_Ame\$ phone_num\$ event_type\$ details\$
05/23/2013 19:45:37 312-555-7834 COMPLAINT Item not received

Regex"SerDe"in"Older"Versions"of"Hive"

!The\$Regex\$SerDe\$wasnt\$formally\$part\$of\$Hive\$prior\$to\$0.10.0\$
It"shipped"with"Hive,"but"was"part"of"the"hive/contrib"library"
!To\$use\$Regex\$SerDe\$in\$0.9.x\$and\$earlier\$versions\$of\$Hive\$\$
Change"the"SerDes"package"name,"as"shown"below"

## CREATE TABLE calls (

event_date STRING,
event_time
The packageSTRING,
name used in older versions is slightly different:
event_type STRING,
phone_num STRING,
details STRING)
WITH SERDEPROPERTIES ("input.regex" =
"([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^\"]*)\"");

Fixed/Width"Formats"in"Hive"

!Many\$older\$applicaAons\$produce\$data\$in\$xed#width\$formats\$

1030929610759620120829012215Oakland CA94618

## cust_id" order_id" date" Eme" city" state"zipcode"

!Unfortunately,\$Hive\$doesnt\$directly\$support\$these\$
But"you"can"overcome"this"limitaEon"by"using"RegexSerDe
!Caveat:\$all\$elds\$in\$RegexSerDe\$are\$of\$type\$STRING
May"need"to"cast"numeric"values"in"your"queries"

Fixed/Width"Format"Example"

Input"data"
1030929610759620120829012215Oakland CA94618

RegexSerDe"
CREATE TABLE fixed (
cust_id STRING,
order_id STRING,
order_dt STRING,
order_tm STRING,
city STRING,
state STRING,
zip STRING)
ROW FORMAT SERDE
WITH SERDEPROPERTIES ("input.regex" =
"(\\d{7})(\\d{7})(\\d{8})(\\d{6})(.{20})(\\w{2})(\\d{5})");

## cust_id\$ order_id\$ order_dt\$ order_tm\$ city\$ state\$ zipcode\$

1030929 6107596 20120829 012215 Oakland CA 94618

Chapter"Topics"

Text\$Processing\$with\$Hive\$

!! Overview"of"Text"Processing"
!! Important"String"FuncEons"
!! Using"Regular"Expressions"in"Hive"
!! SenAment\$Analysis\$and\$n#grams
!! OpEonal"Hands/On"Exercise:"Gaining"Insight"with"SenEment"Analysis"
!! Conclusion"

Parsing"Sentences"into"Words"

!Hives\$SENTENCES\$funcAon\$parses\$supplied\$text\$into\$words\$
!Input\$is\$a\$string\$containing\$one\$or\$more\$sentences\$
!Output\$is\$a\$two#dimensional\$array\$of\$strings\$
Outer"array"contains"one"element"per"sentence"
Inner"array"contains"one"element"per"word"in"that"sentence"
\$
hive> SELECT txt FROM phrases WHERE id=12345;
I bought this computer and really love it! It's very fast and
does not crash.

## hive> SELECT SENTENCES(txt) FROM phrases WHERE id=12345;

[["I","bought","this","computer","and","really","love","it"],
["It's","very","fast","and","does","not","crash"]]

SenEment"Analysis"

!SenAment\$analysis\$is\$an\$applicaAon\$of\$text\$analyAcs\$
ClassicaEon"and"measurement"of"opinions"
Frequently"used"for"social"media"analysis"
!Context\$is\$essenAal\$for\$human\$languages\$
Which"word"combinaEons"appear"together?"
How"frequently"do"these"combinaEons"appear?"
\$

n/grams"

!An\$n#gram\$is\$a\$word\$combinaAon\$(n=number\$of\$words)\$
Bigram"is"a"sequence"of"two"words"(n=2)"
!n#gram\$frequency\$analysis\$is\$an\$important\$step\$in\$many\$applicaAons\$
SuggesEng"spelling"correcEons"in"search"results"
Finding"the"most"important"topics"in"a"body"of"text"
IdenEfying"trending"topics"in"social"media"messages"

CalculaEng"n/grams"in"Hive"(1)"

!Hive\$oers\$the\$NGRAMS\$funcAon\$for\$calculaAng\$n#grams\$
!The\$funcAon\$requires\$three\$input\$parameters\$
Array"of"strings"(sentences),"each"containing"an"array"(words)"
Number"of"words"in"each"n/gram"
Desired"number"of"results"(top/N,"based"on"frequency)"
!Output\$is\$an\$array\$of\$STRUCT\$with\$two\$a`ributes\$
ngram:"the"n/gram"itself"(an"array"of"words)"
estfrequency:"esEmated"frequency"at"which"this"n/gram"appears"

CalculaEng"n/grams"in"Hive"(2)"

!The\$NGRAMS\$funcAon\$is\$oCen\$used\$with\$the\$SENTENCES\$funcAon\$
We"also"used"LOWER"to"normalize"case"
And"EXPLODE"to"convert"the"resulEng"array"to"a"series"of"rows"

## hive> SELECT txt FROM phrases WHERE id=56789;

This tablet is great. The size is great. The screen is
great. The audio is great. I love this tablet! I love

## hive> SELECT EXPLODE(NGRAMS(SENTENCES(LOWER(txt)), 2, 5))

AS bigrams FROM phrases WHERE id=56789;
{"ngram":["is","great"],"estfrequency":4.0}
{"ngram":["great","the"],"estfrequency":3.0}
{"ngram":["this","tablet"],"estfrequency":3.0}
{"ngram":["i","love"],"estfrequency":2.0}
{"ngram":["tablet","i"],"estfrequency":1.0}

Finding"Specic"n/grams"in"Text"

! CONTEXT_NGRAMS\$is\$similar,\$but\$considers\$only\$specic\$combinaAons\$
Any"NULL"values"in"the"array"are"treated"as"placeholders"

## hive> SELECT txt FROM phrases

WHERE txt LIKE '%new computer%';
My new computer is fast! I wish I'd upgraded sooner.
This new computer is expensive, but I need it now.
I can't believe her new computer failed already.

hive>SELECT EXPLODE(CONTEXT_NGRAMS(SENTENCES(LOWER(txt)),
ARRAY("new", "computer", NULL, NULL), 4, 3)) AS ngrams
FROM phrases;
{"ngram":["is","expensive"],"estfrequency":1.0}
{"ngram":["is","fast"],"estfrequency":1.0}

Histograms"

!Histograms\$illustrate\$how\$values\$in\$the\$data\$are\$distributed\$
This"helps"us"esEmate"the"overall"shape"of"the"data"distribuEon"

CalculaEng"Data"for"Histograms"

! HISTOGRAM_NUMERIC\$creates\$data\$needed\$for\$histograms\$
Input:"column"name"and"number"of"bins"in"the"histogram"
Output:"coordinates"represenEng"bin"centers"and"heights"

## hive> SELECT EXPLODE(HISTOGRAM_NUMERIC(

total_price, 10)) AS dist FROM cart_orders;
{"x":25417.336745023003,"y":8891.0}
{"x":74401.5041469194,"y":3376.0}
{"x":123550.04418985262,"y":611.0}
{"x":197421.12500000006,"y":24.0} Import"this"data"into"
{"x":267267.53846153844,"y":26.0} charEng"sokware"to"
{"x":425324.0,"y":4.0} produce"a"histogram"
{"x":479226.38461538474,"y":13.0}
{"x":524548.0,"y":6.0}
{"x":598463.5,"y":2.0}
{"x":975149.0,"y":2.0}

Chapter"Topics"

Text\$Processing\$with\$Hive\$

!! Overview"of"Text"Processing"
!! Important"String"FuncEons"
!! Using"Regular"Expressions"in"Hive"
!! SenEment"Analysis"and"n/grams"
!! OpAonal\$Hands#On\$Exercise:\$Gaining\$Insight\$with\$SenAment\$Analysis
!! Conclusion"

Hands/on"Exercise:"Gaining"Insight"with"SenEment"Analysis"

with\$Hive\$

Chapter"Topics"

Text\$Processing\$with\$Hive\$

!! Overview"of"Text"Processing"
!! Important"String"FuncEons"
!! Using"Regular"Expressions"in"Hive"
!! SenEment"Analysis"and"n/grams"
!! OpEonal"Hands/On"Exercise:"Gaining"Insight"with"SenEment"Analysis"
!! Conclusion

EssenEal"Points"

!Most\$data\$produced\$these\$days\$lacks\$rigid\$structure\$
Text"processing"can"help"us"analyze"loosely/structured"data"
!The\$SPLIT\$funcAon\$creates\$an\$array\$from\$a\$string\$
EXPLODE"creates"individual"records"from"an"array"
!Hive\$has\$extensive\$support\$for\$regular\$expressions\$
You"can"extract"or"subsEtute"values"based"on"pa>erns"
You"can"even"create"a"table"based"on"regular"expressions"
!An\$n#gram\$is\$a\$sequence\$of\$words\$
Use"NGRAMS"and"CONTEXT_NGRAMS"to"nd"their"frequency"

Hive"OpBmizaBon"
Chapter"13"

Course"Chapters"

!! IntroducBon"
!! IntroducBon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulB/Dataset"OperaBons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooBng"and"OpBmizaBon"
!! IntroducBon"to"Hive"
!! RelaBonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive\$Op,miza,on\$
!! Extending"Hive"
!! IntroducBon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

Hive"OpBmizaBon"

In\$this\$chapter,\$you\$will\$learn\$
!Which\$factors\$help\$determine\$the\$performance\$of\$Hive\$queries\$
!What\$command\$displays\$Hives\$execu,on\$plan\$for\$a\$query\$
!How\$to\$enable\$several\$useful\$Hive\$performance\$features\$
!How\$to\$use\$table\$bucke,ng\$to\$sample\$data\$
!How\$to\$create\$and\$rebuild\$indexes\$in\$Hive\$

Chapter"Topics"

Hive\$Op,miza,on\$

!! Understanding\$Query\$Performance
!! Controlling"Job"ExecuBon"
!! ParBBoning"
!! BuckeBng"
!! Indexing"Data"
!! Conclusion"

How"Hive"Processes"Data"

## Steps Run Locally Steps Run on Cluster

FROM customers
WHERE zipcode=94305;

Parse&HiveQL&Statements Submit&MapReduce&Jobs

Build&the&Execu-on&Plan
Map&Processing&Phase

Reduce&Processing&Phase

Write&Output&Data

Hive"Query"Performance"Pa>erns"(1)"

DESCRIBE customers;

## SELECT * FROM customers;

!Then\$a\$query\$that\$requires\$a\$map#only\$job\$

## SELECT * FROM customers WHERE zipcode = 94305;

Hive"Query"Performance"Pa>erns"(2)"

!The\$next\$slowest\$type\$of\$query\$requires\$both\$Map\$and\$Reduce\$phases\$

## SELECT COUNT(cust_id) FROM customers

WHERE zipcode=94305;

!The\$slowest\$queries\$require\$mul,ple\$MapReduce\$jobs\$

## SELECT zipcode, COUNT(cust_id) AS num FROM customers

\$ GROUP BY zipcode
ORDER BY num DESC
LIMIT 10;

Viewing"the"ExecuBon"Plan"

!How\$can\$you\$tell\$how\$Hive\$will\$execute\$a\$query?\$
Can"it"return"data"directly"from"HDFS?"
Will"it"require"a"reduce"phase"or"mulBple"MapReduce"jobs?"
!Prex\$your\$query\$with\$EXPLAIN\$to\$view\$Hives\$execu,on\$plan\$

## hive> EXPLAIN SELECT * FROM customers;

!The\$output\$of\$EXPLAIN\$can\$be\$very\$long\$and\$complex\$
Fully"understanding"it"requires"in/depth"knowledge"of"MapReduce"
We"will"cover"the"basics"here"

Viewing"a"Query"Plan"with"EXPLAIN"(1)"

!The\$query\$plan\$contains\$three\$main\$sec,ons\$
Abstract"syntax"tree"details"how"Hive"parsed"query"(excerpt"below)"
The"stage"dependencies"and"plans"are"more"useful"to"most"users"

## hive> EXPLAIN CREATE TABLE cust_by_zip AS

SELECT zipcode, COUNT(cust_id) AS num
FROM customers GROUP BY zipcode;

## ABSTRACT SYNTAX TREE:

(TOK_CREATETABLE (TOK_TABNAME cust_by_zip) ...

STAGE DEPENDENCIES:
... (excerpt shown on next slide)

STAGE PLANS:
... (excerpt shown on upcoming slide)

Viewing"a"Query"Plan"with"EXPLAIN"(2)"

## ABSTRACT SYNTAX TREE:

!Our\$query\$has\$three\$ ... (shown on previous slide)
stages\$
STAGE DEPENDENCIES:
!Dependencies\$dene\$ Stage-1 is a root stage
order\$ Stage-0 depends on stages: Stage-1
Stage/1"runs"rst" Stage-2 depends on stages: Stage-0
Stage/0"runs"next"
Stage/2"runs"last" STAGE PLANS:
... (shown on next slide)

Viewing"a"Query"Plan"with"EXPLAIN"(3)"

STAGE PLANS:
!Stage#1:\$MapReduce\$job\$ Stage: Stage-1
Map Reduce

## Alias -> Map Operator Tree:

!Map\$phase\$
TableScan
Selects"zipcode"and" Select Operator
cust_id"columns" zipcode, cust_id
"
Reduce Operator Tree:
!Reduce\$phase\$ Group By Operator
Group"by"zipcode aggregations:
"Count"cust_id expr: count(cust_id)
keys:
expr: zipcode

Viewing"a"Query"Plan"with"EXPLAIN"(4)"

STAGE PLANS:
\$ Stage: Stage-1 (covered earlier)
...
\$
Stage: Stage-0
!Stage#0:\$HDFS\$ac,on\$
Move Operator
Move"previous"stages" files:
output"to"Hives" hdfs directory: true
warehouse"directory" destination: (HDFS path...)
"
!Stage#2:\$Metastore\$ac,on\$ Stage: Stage-2
Create Table Operator:
Create"new"table"
Create Table
Has"two"columns" columns: zipcode string,
num bigint
name: cust_by_zip

SorBng"Results

!As\$in\$SQL,\$ORDER BY\$sorts\$specied\$elds\$in\$HiveQL\$
Consider"the"result"from"the"following"query"

## hive> SELECT name, SUM(total)

FROM order_info GROUP BY name
ORDER BY name;

Mapper output is
Alice 3625
Bob 5174 processed by a
Alice 3625
0 Alice 3625 Alice 893 single reducer
1 Bob 5174 Alice 893 Alice 2139
2 Alice 893 Alice 2139 Alice 5834 Alice 12491
3 Alice 2139 Bob 5174 Bob 9997
4 Diana 3581 Diana 3581 Bob 4823 Carlos 1431
5 Carlos 1039 Carlos 1039 Carlos 1039 Diana 5385
6 Bob 4823 Carlos 392
7 Alice 5834 Bob 4823 Diana 3581
8 Carlos 392 Alice 5834 Diana 1804
9 Diana 1804
Carlos 392
Diana 1804

Using"SORT BY"for"ParBal"Ordering

!HiveQL\$also\$supports\$par,al\$ordering\$via\$SORT BY\$
Oers"much"be>er"performance"if"global"order"isnt"required"

## hive> SELECT name, SUM(total)

FROM order_info GROUP BY name
SORT BY name;

## Alice 3625 Only output from

Bob 5174 Alice 3625 each reducer
0 Alice 3625 Alice 893
1 Bob 5174 Alice 893 Alice 2139 Alice 12491 is sorted
2 Alice 893 Alice 2139 Alice 5834 Carlos 1431
3 Alice 2139 Carlos 1039 Alice 12491
4 Diana 3581 Diana 3581 Carlos 392 Carlos 1431
5 Carlos 1039 Carlos 1039 Bob 9997
6 Bob 4823 Diana 5385
7 Alice 5834 Bob 4823 Bob 5174
8 Carlos 392 Alice 5834 Bob 4823 Bob 9997
9 Diana 1804 Diana 3581 Diana 5385
Diana 1804
Carlos 392
Diana 1804

Chapter"Topics"

Hive\$Op,miza,on\$

!! Understanding"Query"Performance"
!! Controlling\$Job\$Execu,on
!! ParBBoning"
!! BuckeBng"
!! Indexing"Data"
!! Conclusion"

Parallel"ExecuBon"

!Stages\$in\$Hives\$execu,on\$plan\$o_en\$lack\$dependencies\$
This"means"they"can"be"run"in"parallel"
!Hive\$supports\$parallel\$execu,on\$in\$such\$cases\$
However,"this"feature"is"disabled"by"default"
!Enable\$this\$by\$se`ng\$the\$hive.exec.parallel\$property\$to\$true\$
Set"only"for"yourself"in"\$HOME/.hiverc
Set"for"all"users"in"/etc/hive/conf/hive-site.xml

Reducing"Latency"Through"Local"ExecuBon"

Necessary"to"process"large"amounts"of"data"in"Hive"
Possibly"inecient"with"small"amount"of"data"
!Processing\$data\$locally\$can\$substan,ally\$speed\$up\$smaller\$jobs\$
Local"execuBon"can"substanBally"improve"turnaround"for"small"jobs"

## hive> SET mapred.job.tracker=local;

hive> SET mapred.local.dir=/home/training/tmpdata

## hive> SELECT zipcode, COUNT(cust_id) AS num

FROM customers GROUP BY zipcode;

AutomaBc"SelecBon"of"Local"ExecuBon"Mode"

!Hive\$can\$now\$select\$local\$execu,on\$mode\$automa,cally\$
It"does"this"on"a"case/by/case"basis"using"heurisBcs"
!Like\$parallel\$execu,on,\$this\$feature\$is\$disabled\$by\$default\$
Enable"by"sefng"hive.exec.mode.local.auto"to"true"

Job"Control"

!You\$will\$see\$many\$log\$messages\$when\$you\$run\$a\$query\$in\$Hives\$shell\$
One"of"these"messages"will"idenBfy"the"MapReduce"job"ID""

## hive> SELECT * FROM customers WHERE zipcode=94305;

Total MapReduce jobs = 1
Launching Job 1 out of 1
... (other messages omitted) ...

## \$ mapred job -status job_201306022351_0025

\$ mapred job -kill job_201306022351_0025

Viewing"a"Job"in"the"Web"UI"(1)"

## hive> SELECT * FROM customers WHERE zipcode=94305;

Total MapReduce jobs = 1
Launching Job 1 out of 1
... (other messages omitted) ...

## Starting Job = job_201306022351_0029, Tracking URL =

http://jobtracker.example.com:50030/jobdetails.jsp?
jobid=job_201306022351_0029

Viewing"a"Job"in"the"Web"UI"(2)"

Chapter"Topics"

Hive\$Op,miza,on\$

!! Understanding"Query"Performance"
!! Controlling"Job"ExecuBon"
!! Par,,oning
!! BuckeBng"
!! Indexing"Data"
!! Conclusion"

Recap:"Data"Storage"in"Hive"

A"table"simply"points"to"a"directory"in"HDFS"
The"tables"data"are"les"within"that"directory"

call_logs table

## 2013-06-03 07:55:36 312-555-3453

2013-06-03 07:55:39 312-555-3453 call-20130603.log
OPTION_SELECTED "Billing"

## 2013-06-04 07:05:21 212-555-3982

2013-06-04 07:08:36 212-555-3982
HUNG_UP "Busy watching Hadoop Webinar" call-20130604.log
directory in HDFS 2013-06-04 07:14:29 314-555-5741

## 2013-06-05 07:21:57 808-555-3453

2013-06-05 07:23:12 808-555-3453 call-20130605.log
OPTION_SELECTED "Billing"

Our"analysts"use"this"data"to"summarize"previous"days"calls"

## hive> SELECT event_type, COUNT(event_type)

FROM call_log
WHERE call_date = '2013-06-03'
GROUP BY event_type;

!These\$queries\$always\$lter\$by\$a\$value\$in\$the\$call_date\$eld\$

Table"ParBBoning"

!It\$is\$possible\$to\$create\$a\$table\$that\$par,,ons\$the\$data\$
Does"not"prevent"you"from"running"queries"that"span"parBBons"

call_logs table

date=2013-06-03 07:55:39 312-555-3453 OPTION_SELECTED
"Billing"
07:55:41 312-555-3453 CALL_ENDED ""

## call_logs 07:05:21 212-555-3982 CALL_RECEIVED ""

07:08:36 212-555-3982 HUNG_UP "Busy
directory in HDFS date=2013-06-04 07:14:29 314-555-5741 AGENT_ANSWER
"Agent ID N5150"
07:14:51 314-555-5741 CALL_TRANSFER
"Billing"

date=2013-06-05 07:23:12 808-555-3453 OPTION_SELECTED
"Billing"
07:23:14 213-555-8752 CALL_ENDED ""

CreaBng"A"ParBBoned"Table"In"Hive"(1)"

!Specify\$the\$par,,oned\$column\$in\$the\$PARTITION\$clause\$

## CREATE TABLE call_logs (

call_time STRING,
event_type STRING,
phone STRING,
details STRING)
PARTITIONED BY (call_date STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

!You\$can\$also\$created\$nested\$par,,ons\$

## PARTITIONED BY (call_date STRING, event_type STRING)

CreaBng"A"ParBBoned"Table"In"Hive"(2)"

!The\$par,,on\$column\$is\$displayed\$if\$you\$DESCRIBE\$the\$table:\$

## hive> DESCRIBE call_logs;

OK
call_time string
event_type string
\$ phone string
details string
call_date string

!However,\$the\$par,,on\$is\$a\$virtual\$column\$
The"data"does"not"exist"in"your"incoming"data"

## hive> LOAD DATA INPATH 'call-20130603.log'

INTO TABLE call_logs
PARTITION(call_date='2013-06-03');

## hive> LOAD DATA INPATH 'call-20130604.log'

INTO TABLE call_logs
PARTITION(call_date='2013-06-04');

!The\$above\$example\$would\$create\$two\$subdirectories:\$
/user/hive/warehouse/call_logs/call_date=2013-06-03
/user/hive/warehouse/call_logs/call_date=2013-06-04

Dynamic"ParBBon"Inserts"(1)"

Hive"can"dynamically"insert"data"into"specic"parBBons"for"you"
!Syntax:\$

FROM customers
INSERT OVERWRITE TABLE custs_part PARTITION(state)
SELECT cust_id, fname, lname, address, city,
zipcode, state;

!Par,,ons\$are\$automa,cally\$created\$based\$on\$the\$value\$of\$the\$last\$column
If"the"parBBon"does"exist,"it"will"be"overwri>en"

Dynamic"ParBBon"Inserts"(2)"

!Dynamic\$par,,oning\$is\$not\$enabled\$by\$default\$
Enable"it"by"sefng"these"two"properBes"

Property\$Name\$ Value\$
hive.exec.dynamic.partition true

hive.exec.dynamic.partition.mode nonstrict

!Cau,on:\$avoid\$crea,ng\$an\$excessive\$number\$of\$par,,ons\$
This"can"happen"your"data"contains"many"unique"values"

Dynamic"ParBBon"Inserts"(3)"

!Cau,on:\$if\$the\$par,,on\$column\$has\$many\$dierent\$values,\$many\$
par,,ons\$will\$be\$created\$
!Three\$Hive\$congura,on\$proper,es\$exist\$to\$limit\$this\$
hive.exec.max.dynamic.partitions.pernode
Maximum"number"of"dynamic"parBBons"that"can"be"created"by"any"
given"Mapper"or"Reducer"
Default"100"
hive.exec.max.dynamic.partitions
Total"number"of"dynamic"parBBons"that"can"be"created"by"one"
HiveQL"statement"
Default"1000"
hive.exec.max.created.files
Maximum"total"les"created"by"Mappers"and"Reducers"
Default"100000"

!To\$view\$the\$current\$par,,ons\$in\$a\$table\$

## ALTER TABLE call_logs

LOCATION '/dualcore/call_logs/call_date=2013-06-05';

## ALTER TABLE call_logs

DROP PARTITION (call_date='2013-06-06');

Chapter"Topics"

Hive\$Op,miza,on\$

!! Understanding"Query"Performance"
!! Controlling"Job"ExecuBon"
!! ParBBoning"
!! Bucke,ng
!! Indexing"Data"
!! Conclusion"

What"Is"BuckeBng?"

!Par,,oning\$subdivides\$data\$by\$values\$in\$par,,oned\$columns\$
!Bucke,ng\$data\$is\$another\$way\$of\$subdividing\$data\$
Calculates"hash"code"for"values"inserted"into"bucketed"columns"
Hash"code"used"to"assign"new"records"to"a"bucket"
!Goal:\$distribute\$rows\$across\$a\$predened\$number\$of\$buckets\$
Useful"for"jobs"which"need"random"samples"of"data"
Joins"may"be"faster"if"all"tables"are"bucketed"on"the"join"column"

CreaBng"A"Bucketed"Table"

!Example\$of\$crea,ng\$a\$table\$that\$supports\$bucke,ng\$
Creates"a"table"supporBng"20"buckets"based"on"order_id"column"
Each"bucket"should"contain"roughly"5%"of"the"tables"data"

## CREATE TABLE orders_bucketed

(order_id INT,
cust_id INT,
order_date TIMESTAMP)
CLUSTERED BY (order_id) INTO 20 BUCKETS;

!Column\$selected\$for\$bucke,ng\$should\$have\$well#distributed\$values\$
IdenBer"columns"are"olen"a"good"choice"

InserBng"Data"Into"A"Bucketed"Table"

!Bucke,ng\$isnt\$automa,cally\$enforced\$when\$inser,ng\$data\$
!Set\$the\$hive.enforce.bucketing\$property\$to\$true
This"sets"the"number"of"reducers"to"the"number"of"buckets"in"the"table"
deniBon"

## hive> SET hive.enforce.bucketing=true;

hive> INSERT OVERWRITE TABLE orders_bucketed
SELECT * FROM orders;

Sampling"Data"From"A"Bucketed"Table"

!Use\$the\$following\$syntax\$to\$sample\$data\$from\$a\$bucketed\$table:\$
This"example"selects"one"of"every"ten"records"(10%)"

## hive> SELECT * FROM orders_bucketed

TABLESAMPLE (BUCKET 1 OUT OF 10 ON order_id);

!It\$is\$possible\$to\$use\$TABLESAMPLE\$on\$a\$non#bucketed\$table\$
However,"this"requires"a"full"scan"of"the"enBre"table"

Chapter"Topics"

Hive\$Op,miza,on\$

!! Understanding"Query"Performance"
!! Controlling"Job"ExecuBon"
!! ParBBoning"
!! BuckeBng"
!! Indexing\$Data
!! Conclusion"

Indexes"in"Hive"

!Tables\$in\$Hive\$also\$now\$support\$indexes\$
Similar"to"indexes"in"RDBMSs,"but"much"more"limited"
!May\$improve\$performance\$for\$certain\$types\$of\$queries\$
But"maintaining"them"costs"disk"space"and"CPU"Bme"
!Syntax\$to\$create\$an\$index:\$

## hive> CREATE INDEX idx_orders_cust_id

ON TABLE orders(cust_id)
AS 'handler_class'
WITH DEFERRED REBUILD;

!Handler\$class\$is\$the\$fully#qualied\$name\$of\$a\$Java\$class,\$such\$as:\$

Viewing"and"Building"Indexes"in"Hive"

!This\$command\$lists\$the\$indexes\$associated\$with\$the\$orders\$table\$

## hive> SHOW FORMATTED INDEX ON orders;

!Hive\$indexes\$are\$ini,ally\$empty\$
Building"(and"later"rebuilding)"indexes"is"a"manual"process"
Use"the"ALTER INDEX"command"to"rebuild"an"index"
CauBon:"this"can"be"lengthy"operaBon!"

## hive> ALTER INDEX idx_orders_cust_id ON orders REBUILD;

Chapter"Topics"

Hive\$Op,miza,on\$

!! Understanding"Query"Performance"
!! Controlling"Job"ExecuBon"
!! ParBBoning"
!! BuckeBng"
!! Indexing"Data"
!! Conclusion"

EssenBal"Points"

! ORDER BY\$sorts\$results\$globally,\$just\$as\$in\$SQL\$
HiveQL"also"supports"SORT BY"for"parBal"ordering"
!Local\$execu,on\$mode\$can\$signicantly\$reduce\$query\$latency\$
But"only"appropriate"to"use"with"small"amounts"of"data"
!Par,,oning\$and\$bucke,ng\$can\$both\$subdivide\$a\$table's\$data\$
BuckeBng"is"used"to"support"random"sampling"
!Hives\$indexing\$feature\$can\$boost\$performance\$for\$certain\$queries\$
But"it"comes"at"the"cost"of"increased"disk"and"CPU"usage"

Bibliography"

The\$following\$oer\$more\$informa,on\$on\$topics\$discussed\$in\$this\$chapter\$
!Hive\$Manual\$for\$the\$EXPLAIN\$Command\$
http://tiny.cloudera.com/dac13a
!Hive\$Manual\$for\$Bucketed\$Tables\$
http://tiny.cloudera.com/dac13b
!Hive\$Manual\$for\$Indexes\$
http://tiny.cloudera.com/dac13c

Extending"Hive"
Chapter"14"

Course"Chapters"

!! IntroducDon"
!! IntroducDon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! Extending"Pig"
!! IntroducDon"to"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Extending\$Hive\$
!! IntroducDon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

Extending"Hive"

In\$this\$chapter,\$you\$will\$learn\$
! What\$role\$SerDes\$play\$in\$Hive\$
! How\$to\$use\$a\$custom\$SerDe\$
! How\$to\$use\$TRANSFORM\$for\$custom\$record\$processing\$
! How\$to\$use\$variable\$subsFtuFon\$

Chapter"Topics"

Extending\$Hive\$

!! SerDes
!! User/Dened"FuncDons"
!! Parameterized"Queries"
!! Conclusion"

Hive"SerDes"

Stands"for"Serializer"/"Deserializer"
SerDes"control"the"row"format"of"the"table"
Specied,"someDmes"implicitly,"when"table"is"created"
!Hive\$ships\$with\$many\$SerDes,\$including:\$\$

LazySimpleSerDe Using"specied"eld"delimiters"(default)"
RegexSerDe\$ Based"on"supplied"pa>erns"
ColumnarSerDe Using"the"columnar"format"needed"by"RCFile"
HBaseSerDe Using"an"HBase"table"

Input"Data"
05/23/2013 19:48:37 312-555-7834 COMPLAINT "Item not received"

Using"SerDe"
CREATE TABLE calls (
event_date STRING,
event_time STRING,
event_type STRING,
phone_num STRING,
details STRING)
WITH SERDEPROPERTIES ("input.regex" =
"([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^\"]*)\"");

ResulDng"Table"
event_date\$ event_Fme\$ event_type\$ phone_num\$ details\$
05/23/2013 19:45:37 312-555-7834 COMPLAINT Item not received

!Hive\$also\$allows\$wriFng\$custom\$SerDes\$using\$its\$Java\$API\$
There"are"many"open"source"SerDes"on"the"Web"
WriDng"your"own"is"seldom"necessary"
Using"JAR"le"from"http://tiny.cloudera.com/dac14a

!You\$must\$register\$external\$libraries\$before\$using\$them\$
Ensures"Hive"can"nd"the"library"(JAR"le)"at"runDme"

!Remains\$in\$eect\$only\$during\$the\$current\$Hive\$session\$

Using"the"SerDe"in"Hive"

Input"Data"
1,Gigabux,gigabux@example.com
2,"ACME Distribution Co.",acme@example.com
3,"Bitmonkey, Inc.",bmi@example.com

Specify"SerDe"
CREATE TABLE vendors
(id INT,
name STRING,
email STRING)
ROW FORMAT SERDE 'com.bizo.hive.serde.csv.CSVSerde';

## id\$ name\$ email\$ ResulDng"Table"

1 Gigabux gigabux@example.com
2 ACME Distribution Co. acme@example.com
3 Bitmonkey, Inc. bmi@example.com

Chapter"Topics"

Extending\$Hive\$

!! SerDes"
!! Data\$TransformaFon\$with\$Custom\$Scripts
!! User/Dened"FuncDons"
!! Parameterized"Queries"
!! Conclusion"

Using"TRANSFORM"to"Process"Data"Using"External"Scripts"

!You\$are\$not\$limited\$to\$manipulaFng\$data\$exclusively\$in\$HiveQL\$
Hive"allows"you"to"transform"data"through"external"scripts"or"programs"
These"can"be"wri>en"in"nearly"any"language"
!This\$is\$done\$with\$HiveQLs\$TRANSFORM\$...\$USING\$construct\$
One"or"more"elds"are"supplied"as"arguments"to"TRANSFORM()
The"external"script"is"idenDed"by"USING""

hive> SELECT TRANSFORM(*) USING 'myscript.pl'
FROM employees;

Data"Input"and"Output"with"TRANSFORM

Each"eld"in"the"supplied"record"will"be"a"tab/separated"string"
NULL"values"are"converted"to"the"literal"string"\N""
!You\$may\$need\$to\$convert\$values\$to\$appropriate\$types\$within\$your\$program\$
!Your\$program\$must\$return\$tab#delimited\$elds\$on\$standard\$output\$
Output"elds"can"opDonally"be"named"and"cast"using"the"syntax"below"

## SELECT TRANSFORM(product_name, price)

USING 'tax_calculator.py'
AS (item_name STRING, tax INT)
FROM products;

Hive"TRANSFORM"Example"(1)

!Here\$is\$a\$complete\$example\$of\$using\$TRANSFORM\$in\$Hive\$\$
corresponds,"and"then"returns"an"appropriate"greeDng"
Heres"a"sample"of"the"input"data"

## hive> SELECT name, email_address FROM employees;

Antoine antoine@example.fr
Kai kai@example.de
Pedro pedro@example.mx
"
Heres"the"corresponding"HiveQL"code"

USING 'greeting.pl' AS greeting
FROM employees;

Hive"TRANSFORM"Example"(2)

!The\$Perl\$script\$for\$this\$example\$is\$shown\$below\$

#!/usr/bin/env perl

## %greetings = ('de' => 'Hallo',

'fr' => 'Bonjour',
'mx' => 'Hola');

while (<STDIN>) {
(\$name, \$email) = split /\t/;
(\$suffix) = \$email =~ /\.([a-z]+)/;
\$greeting = \$greetings{\$suffix};
\$greeting = 'Hello' unless defined(\$greeting);
print "\$greeting \$name\n";
}

Hive"TRANSFORM"Example"(3)

\$\$
#!/usr/bin/env perl

## %greetings = ('de' => 'Hallo',

'fr' => 'Bonjour',
'mx' => 'Hola');

while (<STDIN>) {
The first line tells the system to use the Perl interpreter
(\$name, \$email) = split /\t/;
when=running
(\$suffix) \$emailthis =~script.
/\.([a-z]+)/;
\$greeting = \$greetings{\$suffix};
\$greeting = 'Hello'
We define unless indefined(\$greeting);
our greetings the next line using an
print "\$greeting \$name\n";
} associative array keyed by the country code well

Hive"TRANSFORM"Example"(4)

\$\$
#!/usr/bin/env perl

%greetings
('de'
each=>record
'Hallo',
from standard input within the loop,
'fr' => 'Bonjour',
and then split=>them
'mx' into fields based on tab characters.
'Hola');

while (<STDIN>) {
(\$name, \$email) = split /\t/;
(\$suffix) = \$email =~ /\.([a-z]+)/;
\$greeting = \$greetings{\$suffix};
\$greeting = 'Hello' unless defined(\$greeting);
print "\$greeting \$name\n";
}

Hive"TRANSFORM"Example"(5)

\$
#!/usr/bin/env perl

## %greetings = ('de' => 'Hallo',

'fr' => 'Bonjour',
'mx' => 'Hola');

while (<STDIN>) {
(\$name, \$email) = split /\t/;
(\$suffix) = \$email =~ /\.([a-z]+)\$/;
\$greeting = \$greetings{\$suffix};
\$greeting = 'Hello' unless defined(\$greeting);
print "\$greeting \$name\n";
} We extract the country code from the e-mail address (the
pattern matches any letters following the final dot). We use that
to look up a greeting, but default to Hello if we didnt find one.

Hive"TRANSFORM"Example"(6)

\$
#!/usr/bin/env perl

## %greetings = ('de' => 'Hallo',

'fr' => 'Bonjour',
'mx' => 'Hola');
Finally, we return our greeting as a single field by printing this
while (<STDIN>) {
value to standard output. If we had multiple fields, wed simply
(\$name, \$email) = split /\t/;
separate
(\$suffix) each by tab characters
= \$email when printing them here.
=~ /\.([a-z]+)\$/;
\$greeting = \$greetings{\$suffix};
\$greeting = 'Hello' unless defined(\$greeting);
print "\$greeting \$name\n";
}

Hive"TRANSFORM"Example"(7)

!Finally,\$heres\$the\$result\$of\$our\$transformaFon\$

USING 'greeting.pl' AS greeting
FROM employees;

Bonjour Antoine
Hallo Kai
Hola Pedro

Chapter"Topics"

Extending\$Hive\$

!! SerDes"
!! User#Dened\$FuncFons
!! Parameterized"Queries"
!! Conclusion"

Overview"of"User/Dened"FuncDons"(UDFs)"

!User#Dened\$FuncFons\$(UDFs)\$are\$custom\$funcFons\$\$
Invoked"with"the"same"syntax"as"built/in"funcDons"

## hive> SELECT CALC_SHIPPING_COST(order_id, 'OVERNIGHT')

FROM orders WHERE order_id = 5742354;

!There\$are\$three\$types\$of\$UDFs\$in\$Hive\$
Standard"UDFs"
User/Dened"Aggregate"FuncDons"(UDAFs)"
User/Dened"Table"FuncDons"(UDTFs)"

Developing"Hive"UDFs"

!Hive\$UDFs\$are\$wri^en\$in\$Java\$
Currently"no"support"for"wriDng"UDFs"in"other"languages"
!Open\$source\$UDFs\$are\$plenFful\$on\$the\$Web\$
!There\$are\$three\$steps\$for\$using\$a\$UDF\$in\$Hive\$
2. Register"the"funcDon"itself"
3. Use"the"funcDon"in"your"query"

A^enFon\$Java\$Developers"
Cloudera"now"oers"a"free"e/learning"
module"WriDng"UDFs"for"Hive"""
"
http://tiny.cloudera.com/dac14b"

Example:"Using"a"UDF"in"Hive"(1)"

!Our\$example\$UDF\$was\$compiled\$from\$sources\$found\$in\$GitHub\$
Popular"Web"site"for"many"open"source"sohware"projects"
Project"URL:"http://tiny.cloudera.com/dac14e"
!We\$compiled\$the\$source\$and\$packaged\$it\$into\$a\$JAR\$le\$
We"have"included"a"copy"of"it"on"your"VM"
!Our\$example\$shows\$the\$DATE_FORMAT\$UDF\$in\$that\$JAR\$le\$
Allows"great"exibility"in"formakng"date"elds"in"output"

Example:"Using"a"UDF"in"Hive"(2)"

!First,\$register\$the\$JAR\$with\$Hive\$\$
Same"step"as"with"a"custom"SerDe"

!Next,\$register\$the\$funcFon\$and\$assign\$an\$alias\$
The"quoted"value"is"the"fully/qualied"Java"class"for"the"UDF"
\$
hive> CREATE TEMPORARY FUNCTION DATE_FORMAT
AS 'com.nexr.platform.hive.udf.UDFDateFormat';

Example:"Using"a"UDF"in"Hive"(3)"

!You\$may\$then\$use\$the\$funcFon\$in\$your\$query\$

## hive> SELECT order_date FROM orders LIMIT 1;

2011-12-06 10:03:35

## hive> SELECT DATE_FORMAT(order_date, 'dd-MMM-yyyy')

FROM orders LIMIT 1;
06-Dec-2011

## hive> SELECT DATE_FORMAT(order_date, 'dd/mm/yy')

FROM orders LIMIT 1;
06/12/11

## hive> SELECT DATE_FORMAT(order_date, 'EEEE, MMM d, yyyy')

FROM orders LIMIT 1;
Tuesday, Dec 6, 2011

Chapter"Topics"

Extending\$Hive\$

!! SerDes"
!! User/Dened"FuncDons"
!! Parameterized\$Queries
!! Conclusion"

Hive"Variables"(1)"

!Hive\$supports\$variable\$subsFtuFon\$
Swaps"a"placeholder"with"a"variables"literal"value"at"run"Dme"
Variable"names"are"case/sensiDve"
!Within\$the\$Hive\$shell,\$set\$a\$named\$variable\$equal\$to\$some\$value:\$

## hive> SET state=CA;

\$
!To\$use\$the\$variables\$value\$in\$a\$HiveQL\$query:\$

## hive> SELECT * FROM employees

WHERE STATE = '\${hiveconf:state}';

Hive"Variables"(2)"

!You\$can\$set\$variables\$when\$you\$invoke\$Hive\$from\$the\$command\$line\$
Eases"repeDDve"queries"by"reducing"need"to"modify"HiveQL"
!For\$example,\$imagine\$that\$we\$have\$the\$following\$in\$state.hql

## SELECT COUNT(DISTINCT emp_id) FROM employees

WHERE state = '\${hiveconf:state}';

!This\$makes\$creaFng\$per#state\$reports\$easy:\$

## \$ hive -hiveconf state=CA -f state.hql > ca_count.txt

\$ hive hiveconf state=NY -f state.hql > ny_count.txt
\$ hive hiveconf state=TX -f state.hql > tx_count.txt

Chapter"Topics"

Extending\$Hive\$

!! SerDes"
!! User/Dened"FuncDons"
!! Parameterized"Queries"
!! Hands#On\$Exercise:\$Data\$TransformaFon\$with\$Hive
!! Conclusion"

!In\$this\$Hands#On\$Exercise,\$you\$will\$transform\$data\$with\$Hive\$

Chapter"Topics"

Extending\$Hive\$

!! SerDes"
!! User/Dened"FuncDons"
!! Parameterized"Queries"
!! Conclusion

EssenDal"Points"

! TRANSFORM\$processes\$records\$using\$an\$external\$program\$
This"can"be"wri>en"in"nearly"any"language"
!UDFs\$are\$User#Dened\$FuncFons\$
Custom"logic"that"can"be"invoked"just"like"built/in"funcDons"
!Hive\$subsFtutes\$variable\$placeholders\$with\$literal\$values\$you\$assign\$
This"is"done"when"you"execute"the"query"

IntroducAon"to"Impala"
Chapter"15"

Course"Chapters"

!! IntroducAon"
!! IntroducAon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulA/Dataset"OperaAons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooAng"and"OpAmizaAon"
!! IntroducAon"to"Hive"
!! RelaAonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpAmizaAon"
!! Extending"Hive"
!! Introduc.on\$to\$Impala\$
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

IntroducAon"to"Impala"

In\$this\$chapter,\$you\$will\$learn\$
! What\$Impala\$is\$and\$how\$it\$compares\$to\$Hive,\$Pig,\$and\$RDBMSs\$
! How\$Impala\$executes\$queries\$
! Where\$Impala\$ts\$into\$the\$data\$center\$
! What\$notable\$dierences\$exist\$between\$Impala\$and\$Hive\$
! How\$to\$run\$queries\$from\$the\$shell\$or\$browser\$

Chapter"Topics"

Introduc.on\$to\$Impala\$

!! What\$is\$Impala?
!! How"Impala"Diers"from"Hive"and"Pig"
!! How"Impala"Diers"from"RelaAonal"Databases"
!! LimitaAons"and"Future"DirecAons"
!! Using"the"Impala"Shell"
!! Conclusion"

What"is"Impala?"

!High#performance\$SQL\$engine\$for\$vast\$amounts\$of\$data\$
Massively/parallel"processing"(MPP)"
Query"latency"measured"in"milliseconds\$
Can"query"data"stored"in"HDFS"or"HBase"tables"
!Developed\$by\$Cloudera\$
100%"open"source,"released"under"the"Apache"sobware"

InteracAng"with"Impala"

!Impala\$supports\$a\$subset\$of\$SQL#92\$
Plus"a"few"extensions"found"in"MySQL"and"Oracle"SQL"dialects"
Almost"idenAcal"to"HiveQL"
!Impala\$oers\$many\$interfaces\$for\$running\$queries\$
Command/line"shell"
Hue"Web"applicaAon"
ODBC"/"JDBC""

Why"Use"Impala?"

!Many\$benets\$are\$the\$same\$as\$with\$Hive\$or\$Pig\$
More"producAve"than"wriAng"MapReduce"code"
No"sobware"development"experience"required"
Leverage"exisAng"knowledge"of"SQL"
!One\$benet\$exclusive\$to\$Impala\$is\$speed\$
Highly/opAmized"for"queries"
Almost"always"at"least"ve"Ames"faster"than"either"Hive"or"Pig"
Oben"20"Ames"faster"or"more"

Dualcore Inc. Dashboard

## Top States for In-Store Sales

Suppliers by Region

Japan: 31 suppliers

Where"Impala"Fits"Into"the"Data"Center"

## Transaction Records from

Application Database
Log Data from Documents from
Web Servers File Server

with Impala

## Analyst using Impala Analyst using

shell for ad hoc queries Impala via BI tool

Where"to"Get"Impala"

!We\$strongly\$recommend\$running\$Impala\$on\$CDH\$4.2\$or\$higher\$
Requires"a"64/bit"Linux"plahorm"
!Installa.on\$and\$congura.on\$are\$outside\$the\$scope\$of\$this\$course\$
Your"virtual"machine"includes"a"working"installaAon"of"Impala"

Chapter"Topics"

Introduc.on\$to\$Impala\$

!! What"is"Impala?"
!! How\$Impala\$Diers\$from\$Hive\$and\$Pig
!! How"Impala"Diers"from"RelaAonal"Databases"
!! LimitaAons"and"Future"DirecAons"
!! Using"the"Impala"Shell"
!! Conclusion"

Comparing"Impala"to"Hive"and"Pig"

!Lets\$rst\$look\$at\$similari'es\$between\$Hive,\$Pig,\$and\$Impala\$
Queries"expressed"in"high/level"languages"
AlternaAves"to"wriAng"MapReduce"code"
!Impala\$shares\$the\$metastore\$with\$Hive\$
Tables"created"in"Hive"are"visible"in"Impala"(and"vice"versa)"

ContrasAng"Impala"to"Hive"and"Pig"(1)"

MapReduce"is"a"general/purpose"computaAon"framework"
Not"opAmized"for"execuAng"interacAve"SQL"queries"
Even"a"trivial"query"takes"10"seconds"or"more"
!Impala\$does\$not\$use\$MapReduce\$
Uses"a"custom"execuAon"engine"built"specically"for"Impala"
Queries"can"complete"in"a"fracAon"of"a"second"

ContrasAng"Impala"to"Hive"and"Pig"(2)"

!Hive,\$Pig,\$and\$Impala\$also\$support\$
Execute"queries"via"interacAve"shell"or"command"line"
Grouping,"joining,"and"ltering"data"
!Impala\$currently\$lacks\$some\$Hive\$and\$Pig\$features\$
More"details"later"in"this"chapter"
!Hive\$and\$Pig\$are\$best\$suited\$to\$long#running\$batch\$processes\$

How"Impala"Executes"a"Query"

!Each\$slave\$node\$in\$the\$cluster\$runs\$an\$Impala\$daemon\$
Co/located"with"the"HDFS"Data"Node"
Client"issues"query"to"an"Impala"daemon"
!Impala\$daemon\$plans\$the\$query\$
Checks"the"local"metastore"cache"
Distributes"the"query"across"other"Impala"daemons"in"the"cluster"
Streams"results"to"client"
!Two\$other\$daemons\$running\$on\$master\$nodes\$support\$query\$execu.on\$
The"State"Store"daemon""
Provides"lookup"service"for"Impala"daemons"
Periodically"checks"status"of"Impala"daemons"
daemons"in"a"cluster"
Chapter"Topics"

Introduc.on\$to\$Impala\$

!! What"is"Impala?"
!! How"Impala"Diers"from"Hive"and"Pig"
!! How\$Impala\$Diers\$from\$Rela.onal\$Databases
!! LimitaAons"and"Future"DirecAons"
!! Using"the"Impala"Shell"
!! Conclusion"

Comparing"Impala"To"A"RelaAonal"Database"

Rela.onal\$Database\$ Impala\$
Query language SQL SQL-92 subset
Update individual records Yes No
Delete individual records Yes No
Transactions Yes No
Indexing Yes No
Latency Very low Low
Data size Terabytes Petabytes
ODBC / JDBC support Yes Yes

Chapter"Topics"

Introduc.on\$to\$Impala\$

!! What"is"Impala?"
!! How"Impala"Diers"from"Hive"and"Pig"
!! How"Impala"Diers"from"RelaAonal"Databases"
!! Limita.ons\$and\$Future\$Direc.ons
!! Using"the"Impala"Shell"
!! Conclusion"

Hive"Features"Currently"Unsupported"in"Impala"

!Impala\$does\$not\$currently\$support\$some\$features\$in\$Hive\$
Many"of"these"are"being"considered"for"future"releases"
!Complex\$data\$types\$(ARRAY,\$MAP,\$or\$STRUCT)\$
! BINARY\$data\$type\$
!External\$transforma.ons\$
!Custom\$SerDes\$
!Indexing\$
!Bucke.ng\$and\$table\$sampling\$

Other"Notable"Dierences"Between"Impala"and"Hive"

!Only\$one\$DISTINCT\$clause\$allowed\$per\$query\$in\$Impala\$
Typical"workaround"is"to"use"subselect"and"UNION
!Impala\$requires\$that\$queries\$with\$ORDER BY\$also\$specify\$a\$LIMIT\$
This"sets"an"outer"bound"on"the"result"set"
The"size"of"the"LIMIT"can"be"arbitrarily"large"
!Impala\$and\$Hive\$handle\$out#of#range\$values\$dierently\$
Hive"returns"NULL""
Impala"returns"the"maximum"value"for"that"type"

Query"Fault"Tolerance"in"Impala"

!Queries\$in\$both\$Hive\$and\$Impala\$are\$distributed\$across\$nodes\$
!Impala\$has\$its\$own\$execu.on\$engine\$
Currently"lacks"fault"tolerance"
If"a"node"fails"during"a"query,"the"query"will"fail"
Workaround:"just"re/run"the"query"

Chapter"Topics"

Introduc.on\$to\$Impala\$

!! What"is"Impala?"
!! How"Impala"Diers"from"Hive"and"Pig"
!! How"Impala"Diers"from"RelaAonal"Databases"
!! LimitaAons"and"Future"DirecAons"
!! Using\$the\$Impala\$Shell"
!! Conclusion"

StarAng"the"Impala"Shell"

!You\$can\$execute\$statements\$in\$the\$Impalas\$shell\$
This"interacAve"tool"is"similar"to"the"shell"in"MySQL"or"Hive"
!Execute\$the\$impala-shell\$command\$to\$start\$the\$shell\$
Some"log"messages"truncated"to"be>er"t"the"slide"

\$ impala-shell
Connected to localhost.localdomain:21000
Welcome to the Impala shell.
[localhost.localdomain:21000] >
"
!Use\$-i hostname:port\$op.on\$to\$connect\$to\$another\$server\$

\$ impala-shell i myserver.example.com:21000
[myserver.example.com:21000] >

Using"the"Impala"Shell"

!Enter\$semicolon#terminated\$statements\$at\$the\$prompt\$
Hit"enter"to"execute"the"query"
Impala"pre>y/prints"the"output"
Use"the"quit"command"to"exit"the"Impala"shell"

\$ impala-shell
> SELECT cust_id, fname, lname FROM customers
WHERE zipcode='20525';
+---------+--------+-----------+
| cust_id | fname | lname |
+---------+--------+-----------+
| 1133567 | Steven | Robertson |
| 1171826 | Robert | Gillis |
+---------+--------+-----------+
> quit;
\$

Note:"shell"prompt"abbreviated"as">\$

Running"Queries"from"the"Command"Line"

!You\$can\$execute\$a\$le\$containing\$queries\$using\$the\$-f\$op.on\$
\$
\$ impala-shell -f myquery.hql

!Run\$queries\$directly\$from\$the\$command\$line\$with\$the\$-q\$op.on\$

## \$ impala-shell -q 'SELECT * FROM users'

!Use\$-o\$(and\$op.onally\$specify\$delimiter)\$to\$capture\$output\$to\$le\$

\$ impala-shell -f myquery.hql \
-o results.txt \
--output_file_field_delim='\t'

InteracAng"with"the"OperaAng"System"

!Use\$shell\$to\$execute\$system\$commands\$from\$within\$Impala\$shell\$

## > shell date;

\$ Mon May 20 16:44:35 PDT 2013

!No\$direct\$support\$for\$HDFS\$commands\$\$

## > shell hadoop fs -mkdir /reports/sales/2013;

Accessing"Impala"with"Hue"(1)"

!Alterna.vely,\$you\$can\$access\$Impala\$through\$Hue\$
!To\$use\$Hue,\$browse\$to\$http://hue_server:8888/
May"need"to"start"Hue"service"rst"(sudo service hue start)"
!Launch\$Hues\$Impala\$interface\$by\$clicking\$its\$icon\$

Impala&icon&in&Hue&

Accessing"Impala"with"Hue"(2)"

!Hue\$allows\$you\$to\$run\$Impala\$queries\$from\$your\$Web\$browser\$
Hue - Impala Query

## SELECT zipcode, COUNT(order_id) AS total

FROM customers JOIN orders
ON customers.cust_id = orders.cust_id
WHERE zipcode LIKE '6%'
GROUP BY zipcode
ORDER BY total DESC
LIMIT 100;

Accessing"Impala"with"Hue"(3)"

!Hue\$displays\$the\$results\$in\$a\$sortable\$table\$
Hue - Impala Query

Chapter"Topics"

Introduc.on\$to\$Impala\$

!! What"is"Impala?"
!! How"Impala"Diers"from"Hive"and"Pig"
!! How"Impala"Diers"from"RelaAonal"Databases"
!! LimitaAons"and"Future"DirecAons"
!! Using"the"Impala"Shell"
!! Conclusion

EssenAal"Points"

!Impala\$is\$a\$high#performance\$SQL\$engine\$
!Queries\$are\$expressed\$in\$SQL\$dialect\$similar\$to\$HiveQL\$
!Primary\$dierence\$compared\$to\$Hive/Pig\$is\$speed\$
Hive"and"Pig"are"be>er"for"long/running"batch"processes"
Impala"does"not"currently"support"all"features"of"Hive"

Bibliography"(1)"

The\$following\$oer\$more\$informa.on\$on\$topics\$discussed\$in\$this\$chapter\$
!Free\$OReilly\$Cloudera-Impala\$book\$
http://tiny.cloudera.com/dac15f
http://tiny.cloudera.com/dac15a
!Wired\$Ar.cle\$on\$Impala\$
http://tiny.cloudera.com/dac15b
!Cloudera\$Blog\$Detailing\$Impala\$Features\$and\$Performance\$
http://tiny.cloudera.com/dac15c
!Impala\$Documenta.on\$at\$Cloudera\$Web\$site\$
http://tiny.cloudera.com/dac15e

Bibliography"(2)"

!37Signals\$Blog\$Comparing\$Performance\$of\$Impala,\$Hive,\$and\$MySQL\$
http://tiny.cloudera.com/dac15d

Analyzing"Data"with"Impala"
Chapter"16"

Course"Chapters"

!! IntroducEon"
!! IntroducEon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulE/Dataset"OperaEons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooEng"and"OpEmizaEon"
!! IntroducEon"to"Hive"
!! RelaEonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpEmizaEon"
!! Extending"Hive"
!! IntroducEon"to"Impala"
!! Analyzing\$Data\$with\$Impala\$
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

Analyzing"Data"with"Impala"

In\$this\$chapter,\$you\$will\$learn\$
! How\$Impalas\$query\$syntax\$compares\$to\$HiveQL\$
! How\$to\$create\$databases\$and\$tables\$in\$Impala\$
! Which\$data\$types\$Impala\$supports\$
! How\$to\$structure\$your\$query\$for\$beNer\$performance\$
! What\$other\$factors\$inuence\$Impala\$performance\$

Chapter"Topics"

Analyzing\$Data\$with\$Impala\$

!! Basic\$Syntax
!! Data"Types"
!! Filtering,"SorEng,"and"LimiEng"Results"
!! Joining"and"Grouping"Data"
!! User/Dened"FuncEons"
!! Improving"Impala"Performance"
!! Hands/On"Exercise:"InteracEve"Analysis"with"Impala"
!! Conclusion"

Overview"of"Impala"Query"Syntax"

!Impalas\$query\$language\$is\$a\$subset\$of\$SQL#92\$
Plus"a"few"extensions"from"Oracle"and"MySQL"dialects"
!Syntax\$almost\$idenKcal\$to\$HiveQL\$
Dierences"mainly"related"to"features"unsupported"in"Impala"
Most"Hive"queries"can"be"executed"verbaEm"in"Impala"
!Impala\$may\$also\$support\$statements\$that\$Hive\$does\$not\$
Such"as"the"ability"to"insert"individual"rows"*"

## > INSERT INTO customers VALUES (1234567, 'Abe',

'Froman', '123 Oak St.', 'Chicago', 'IL', '60601');

!Keywords\$are\$not\$case#sensiKve,\$but\$oVen\$capitalized\$by\$convenKon\$
These"are"allowed"in"scripts,""shell,"and"command"line"

\$ cat find_customers.sql

## /* This script will query the customers table

* and find all customers in a given ZIP code
*/
SELECT cust_id, fname, lname
FROM customers
WHERE zipcode='60601'; -- downtown Chicago

Databases"and"Tables"in"Impala"

!Every\$Impala\$table\$belongs\$to\$a\$database\$
Impala"and"Hive"share"a"metastore"
The"same"databases"and"tables"are"visible"in"both"Hive"and"Impala"
!The\$default\$database\$is\$selected\$at\$startup
The"USE"command"switches"to"another"database"
List"tables"in"a"database"with"the"SHOW TABLES"command"

## > USE accounting;

> SHOW TABLES;
+----------+
| name |
+----------+
| accounts |
+----------+

CreaEng"Databases"and"Tables"in"Impala"

!Data\$deniKon\$is\$generally\$idenKcal\$to\$Hive\$\$
Custom"SerDes"and"buckeEng"are"unsupported"

## > CREATE DATABASE sales;

> USE sales;
> CREATE EXTERNAL TABLE prospects
(id INT,
name STRING COMMENT 'Include surname',
email STRING,
active BOOLEAN COMMENT 'True, if on mailing list',
last_contact TIMESTAMP)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION '/dept/sales/prospects';

Displaying"Table"Structure"

!Use\$DESCRIBE\$to\$display\$a\$tables\$structure\$
DESCRIBE EXTENDED"is"unsupported"

## > DESCRIBE prospects;

+--------------+-----------+--------------------------+
| name | type | comment |
+--------------+-----------+--------------------------+
| id | int | |
| name | string | Include surname |
| email | string | |
| active | boolean | True, if on mailing list |
| last_contact | timestamp | |
+--------------+-----------+--------------------------+

!Impala\$shares\$the\$metastore\$with\$Hive\$
Tables"created"in"Hive"are"visible"in"Impala"(and"vice"versa)"
The"tables"schema"deniEons"
The"locaEons"of"tables"HDFS"blocks"
cluster\$
Impala"cluster"
"

modied"in"Hive one\$table\$immediately.\$
locaKons\$for\$new\$data\$
les\$only.\$
table
been"extensively" <table> single\$table\$as\$stale.\$When\$
HDFS"balancing all\$HDFS\$block\$locaKons\$
are\$retrieved.\$

SelecEng"Data"in"Impala"

!Use\$SELECT\$to\$retrieve\$data\$from\$tables\$
Results"are"forma>ed"for"display"

## > SELECT fname,lname,city,state FROM customers

WHERE cust_id = 1234567;
+-------+--------+---------+-------+
| fname | lname | city | state |
+-------+--------+---------+-------+
| Abe | Froman | Chicago | IL |
+-------+--------+---------+-------+

Impala"does"not"require"a"FROM"clause"

## > SELECT SQRT(64) AS square_root;

Using"Impala"Built/in"FuncEons"

!Invoke\$built#in\$funcKons\$as\$you\$would\$in\$SQL\$or\$HiveQL\$

## > SELECT CONCAT_WS(', ', lname, fname) AS fullname

FROM customers WHERE cust_id=1234567;

+-------------+
| fullname |
+-------------+
| Froman, Abe |
+-------------+

!Impala\$supports\$many\$of\$the\$same\$built#in\$funcKons\$as\$Hive\$
Lacks"some"others,"including"many"formacng"and"text"processing"
funcEons"

Chapter"Topics"

Analyzing\$Data\$with\$Impala\$

!! Basic"Syntax"
!! Data\$Types
!! Filtering,"SorEng,"and"LimiEng"Results"
!! Joining"and"Grouping"Data"
!! User/Dened"FuncEons"
!! Improving"Impala"Performance"
!! Hands/On"Exercise:"InteracEve"Analysis"with"Impala"
!! Conclusion"

Data"Types"in"Impala"

!Each\$column\$in\$a\$table\$is\$associated\$with\$a\$data\$type\$
Impala"supports"most"types"available"in"Hive"
Most"are"similar"to"those"found"in"relaEonal"databases"

## > DESCRIBE prospects;

+--------------+-----------+--------------------------+
| name | type | comment |
+--------------+-----------+--------------------------+
| id | int | |
| name | string | Include surname |
| email | string | |
| active | boolean | True, if on mailing list |
| last_contact | timestamp | |
+--------------+-----------+--------------------------+

Impalas"Integer"Types"

!Integer\$types\$are\$appropriate\$for\$whole\$numbers\$
Both"posiEve"and"negaEve"values"allowed"

## Name\$ DescripKon\$ Example\$Value\$

TINYINT Range:"/128"to"127\$ 17
SMALLINT Range:"/32,768"to"32,767\$ 5842
INT Range:"/2,147,483,648"to"2,147,483,647\$ 84127213
BIGINT Range:"~"/9.2"quinEllion"to"~"9.2"quinEllion\$ 632197432180964

Impalas"Decimal"Types"

!Decimal\$types\$are\$appropriate\$for\$oaKng\$point\$numbers\$
Both"posiEve"and"negaEve"values"allowed"
CauKon:"avoid"using"when"exact"values"are"required!"

## Name\$ DescripKon\$ Example\$Value\$

FLOAT Decimals\$ 3.14159
DOUBLE Very"precise"decimals\$ 3.14159265358979323846

Other"Types"in"Impala"

!Impala\$can\$also\$store\$a\$few\$other\$types\$of\$informaKon\$
Only"one"character"type"(variable"length)"

## Name\$ DescripKon\$ Example\$Value\$

STRING Character"sequence" Betty F. Smith

## TIMESTAMP Instant"in"Eme" 2013-06-14 16:51:05

"
!Impala\$does\$not\$support\$BINARY\$or\$complex\$types\$

Data"Type"Conversion"

!Hive\$auto#converts\$a\$STRING\$column\$used\$in\$numeric\$context\$

## hive> SELECT zipcode FROM customers LIMIT 1;

60601
hive> SELECT zipcode + 1.5 FROM customers LIMIT 1;
60602.5
"
!Impala\$requires\$an\$explicit\$CAST\$operaKon\$for\$this\$

## > SELECT zipcode + 1.5 FROM customers LIMIT 1;

ERROR: AnalysisException: Arithmetic operation ...
> SELECT CAST(zipcode AS float) + 1.5
FROM customers LIMIT 1;
60602.5

Chapter"Topics"

Analyzing\$Data\$with\$Impala\$

!! Basic"Syntax"
!! Data"Types"
!! Filtering,\$SorKng,\$and\$LimiKng\$Results
!! Joining"and"Grouping"Data"
!! User/Dened"FuncEons"
!! Improving"Impala"Performance"
!! Hands/On"Exercise:"InteracEve"Analysis"with"Impala"
!! Conclusion"

LimiEng"and"SorEng"Query"Results"

! The\$LIMIT\$clause\$sets\$the\$maximum\$number\$of\$rows\$returned\$

## > SELECT fname, lname FROM customers LIMIT 10;

!CauKon:\$no\$guarantee\$regarding\$which\$10\$results\$are\$returned\$
Use"ORDER BY"for"top/N"queries"
The"eld(s)"you"ORDER BY"must"be"selected"
!When\$using\$ORDER BY,\$the\$LIMIT\$clause\$is\$mandatory\$in\$Impala\$

## > SELECT cust_id, fname, lname FROM customers

ORDER BY cust_id DESC LIMIT 10;
\$

Chapter"Topics"

Analyzing\$Data\$with\$Impala\$

!! Basic"Syntax"
!! Data"Types"
!! Filtering,"SorEng,"and"LimiEng"Results"
!! Joining\$and\$Grouping\$Data
!! User/Dened"FuncEons"
!! Improving"Impala"Performance"
!! Hands/On"Exercise:"InteracEve"Analysis"with"Impala"
!! Conclusion"

Joins"in"Impala"

!Like\$Hive,\$Impala\$can\$join\$mulKple\$data\$sets\$
!Impala\$supports\$the\$same\$types\$of\$joins\$that\$Hive\$does\$
Inner"joins"
Outer"joins"(lel,"right,"and"full)"
Lel"semi"joins"
Cross"joins"

Record"Grouping"and"Aggregate"FuncEons"

! GROUP BY\$groups\$selected\$data\$by\$one\$or\$more\$columns\$
CauKon:"Columns"not"part"of"aggregaEon"must"be"listed"in"GROUP BY"
stores\$table\$
id\$ city\$ state\$ region\$ SELECT region, state,
COUNT(id) AS num
a Albany NY EAST
FROM stores
b Boston MA EAST
GROUP BY region, state;
c Chicago IL NORTH
d Detroit MI NORTH Result\$of\$query\$
e Elgin IL NORTH
region\$ state\$ num\$
EAST MA 1
EAST NY 1
NORTH IL 2
NORTH MI 1

Chapter"Topics"

Analyzing\$Data\$with\$Impala\$

!! Basic"Syntax"
!! Data"Types"
!! Filtering,"SorEng,"and"LimiEng"Results"
!! Joining"and"Grouping"Data"
!! User#Dened\$FuncKons\$
!! Improving"Impala"Performance"
!! Hands/On"Exercise:"InteracEve"analysis"with"Impala"
!! Conclusion"

Overview"of"Impala"User/Dened"FuncEons"(UDFs)"

!Like\$Hive,\$Impala\$supports\$User#Dened\$FuncKons\$(UDFs)\$\$
!Hive\$UDFs\$can\$be\$used\$in\$Impala\$with\$no\$changes\$
With"a"few"excepEons"
!There\$are\$two\$types\$of\$UDFs\$in\$Impala\$
Standard"UDFs"
User/Dened"Aggregate"FuncEons"(UDAFs)"
!Impala\$UDFs\$can\$be\$wriNen\$in\$Java\$or\$C++\$
C++"UDFs"are"implemented"as"shared"objects"
!Impala\$C++\$UDFs\$cannot\$be\$used\$in\$Hive\$
"
"

Using"a"Java"UDF"in"Impala"(1)"

!Register\$the\$funcKon\$with\$Impala\$\$
Specify"data"types"that"correspond"to"the"method"signature"of"the"UDF"
class"evaluate"method"aler"the"funcEon"name"
Specify"data"types"that"correspond"to"the"return"type"of"the"UDF"class"
evaluate"method"in"the"RETURNS"clause"
IdenEfy"the"jar"le"containing"the"UDF"class"in"the"LOCATION"clause"
Specify"the"UDF"class"name"in"the"SYMBOL"clause"

## CREATE FUNCTION STRIP(STRING) RETURNS STRING

LOCATION '/user/hive/udfs/MyUDFs.jar'
SYMBOL='com.example.hive.ql.udf.UDFStrip';

\$
Using"a"Java"UDF"in"Impala"(2)"

!You\$may\$then\$use\$the\$funcKon\$in\$Impala\$queries\$

\$

Using"a"C++"UDF"in"Impala"

!Register\$the\$funcKon\$with\$Impala\$\$
\$
CREATE FUNCTION COUNT_VOWELS(STRING)
RETURNS INT
LOCATION '/user/hive/udfs/sampleudfs.so'
SYMBOL='CountVowels';

!You\$may\$then\$use\$the\$funcKon\$in\$your\$query\$
\$

Chapter"Topics"

Analyzing\$Data\$with\$Impala\$

!! Basic"Syntax"
!! Data"Types"
!! Filtering,"SorEng,"and"LimiEng"Results"
!! Joining"and"Grouping"Data"
!! User/Dened"FuncEons"
!! Improving\$Impala\$Performance
!! Hands/On"Exercise:"InteracEve"Analysis"with"Impala"
!! Conclusion"

Impala"Performance"Overview"

CompuEng"staEsEcs"on"tables"before"running"joins"
The"format"and"type"of"data"being"queried"
The"hardware"and"conguraEon"of"your"cluster"

Join"Performance"OpEmizaEon"

!You\$should\$compute\$staKsKcs\$for\$tables\$with\$COMPUTE STATS\$
When"the"amount"of"data"in"a"table"changes"substanEally""
\$
COMPUTE STATS orders;
COMPUTE STATS order_details;
SELECT COUNT(o.order_id)FROM orders o
JOIN order_details d ON (o.order_id = d.order_id)
WHERE YEAR(o.order_date) = 2008;

Data"Formats"Supported"by"Impala"

!Impala\$can\$query\$data\$in\$several\$formats\$
Table"below"summarizes"compaEbility"

## File\$Type\$ DescripKon\$of\$File\$Type\$ Read\$ Create\$ Insert\$

Parquet" High/performance"columnar"format" Yes" Yes" Yes"
Text" Plaintext"delimited"at"le"format" Yes" Yes" Yes"
Avro" Structured"cross/plaqorm"binary"format" Yes" No" No"
RCFile" Columnar"format"compaEble"with"Hive" Yes" Yes" No"
SequenceFile" Binary"at"le"format" Yes" Yes" No"

Data"Formats"and"OpEmizaEon"

!The\$limiKng\$factor\$in\$most\$queries\$is\$I/O\$
Disk"speed"is"the"most"common"bo>leneck"
Columnar"formats"reduce"I/O"when"few"columns"selected"
!If\$performance\$is\$a\$key\$concern,\$choose\$Parquet\$
This"may"limit"compaEbility"with"other"tools"

## > CREATE TABLE order_details

(order_id INT,
prod_id INT)
STORED AS PARQUETFILE;

Impala"Query"Size"

!Query\$size\$is\$based\$on\$a\$querys\$working\$set\$size\$
The"working"set"of"a"query"contains"all"records"aler"
Filtering"rows"
Pruning"unused"columns"
Performing"aggregaEon,"if"applicable"
!For\$aggregaKons,\$the\$query\$size\$is\$the\$working\$set\$size\$for\$all\$the\$tables\$in\$
the\$query\$
!For\$joins,\$the\$query\$size\$is\$the\$working\$set\$size\$for\$all\$the\$tables\$in\$the\$join\$
excluding\$the\$largest\$table\$
!Impala\$queries\$must\$t\$into\$the\$cluster's\$aggregate\$memory\$

Cluster"Hardware"

!Memory:\$32GB\$minimum,\$64GB\$is\$beNer\$
!Disks:\$more\$is\$beNer\$
Impala"tries"to"maximize"throughput"across"disks"
Servers"with"12"or"more"disks"are"ideal"

Chapter"Topics"

Analyzing\$Data\$with\$Impala\$

!! Basic"Syntax"
!! Data"Types"
!! Filtering,"SorEng,"and"LimiEng"Results"
!! Joining"and"Grouping"Data"
!! User/Dened"FuncEons"
!! Improving"Impala"Performance"
!! Hands#On\$Exercise:\$InteracKve\$Analysis\$with\$Impala
!! Conclusion"

Hands/on"Exercise:"InteracEve"Analysis"with"Impala"

Chapter"Topics"

Analyzing\$Data\$with\$Impala\$

!! Basic"Syntax"
!! Data"Types"
!! Filtering,"SorEng,"and"LimiEng"Results"
!! Joining"and"Grouping"Data"
!! User/Dened"FuncEons"
!! Improving"Impala"Performance"
!! Hands/On"Exercise:"InteracEve"analysis"with"Impala"
!! Conclusion

EssenEal"Points"

!Impalas\$query\$syntax\$is\$nearly\$idenKcal\$to\$HiveQL\$
Most"Hive"queries"can"be"executed"verbaEm"in"Impala"
following"external"changes"
!Impala\$supports\$most\$simple\$data\$types\$from\$Hive\$
!Query\$structure\$and\$le\$format\$can\$aect\$performance\$
!Your\$clusters\$hardware\$also\$aects\$performance\$

Bibliography"

The\$following\$oer\$more\$informaKon\$on\$topics\$discussed\$in\$this\$chapter\$
http://tiny.cloudera.com/dac16a
!Impala\$Language\$Reference\$
http://tiny.cloudera.com/dac16b

Choosing"the"Best"Tool"for"the"Job"
Chapter"17"

Course"Chapters"

!! IntroducFon"
!! IntroducFon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulF/Dataset"OperaFons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooFng"and"OpFmizaFon"
!! IntroducFon"to"Hive"
!! RelaFonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpFmizaFon"
!! Extending"Hive"
!! IntroducFon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing\$the\$Best\$Tool\$for\$the\$Job\$
!! Conclusion"

Choosing"the"Best"Tool"for"the"Job"

In\$this\$chapter,\$you\$will\$learn\$
!How\$MapReduce,\$Pig,\$Hive,\$Impala,\$and\$RDBMSs\$compare\$to\$one\$another\$
!Why\$a\$workow\$might\$involve\$several\$dierent\$tools\$
!How\$to\$select\$the\$best\$tool\$for\$a\$given\$job\$

Chapter"Topics"

Choosing\$the\$Best\$Tool\$for\$the\$Job\$

!! Comparing\$MapReduce,\$Pig,\$Hive,\$Impala,\$and\$RelaNonal\$Databases
!! Which"to"Choose?"
!! Conclusion"

Recap"of"Data"Analysis/Processing"Tools"

!MapReduce\$
Low/level"processing"and"analysis"
!Pig\$
Procedural"data"ow"language"executed"using"MapReduce"
!Hive\$
SQL/based"queries"executed"using"MapReduce"
!Impala\$
High/performance"SQL/based"queries"using"a"custom"execuFon"engine"

Comparing"Pig,"Hive,"and"Impala"

## DescripNon\$of\$Feature\$ Pig\$ Hive\$ Impala\$

SQL-based query language No Yes Yes
Optional schema and metastore Yes No No
User-defined functions (UDFs) Yes Yes Yes
Process data with external scripts Yes Yes No
Extensible file format support Yes Yes No
Complex data types Yes Yes No
Query latency High High Low
Built-in data partitioning No Yes Yes
Accessible via ODBC / JDBC No Yes Yes

Do"These"Replace"an"RDBMS?"

!Probably\$not\$if\$RDBMS\$is\$used\$for\$its\$intended\$purpose\$
!RelaNonal\$databases\$are\$opNmized\$for\$
RelaFvely"small"amounts"of"data"
Immediate"results"
In/place"modicaFon"of"data"(UPDATE"and"DELETE)"
!Pig,\$Hive,\$and\$Impala\$are\$opNmized\$for\$
Extensive"scalability"at"low"cost"
!Pig\$and\$Hive\$are\$beSer\$suited\$for\$batch\$processing\$
Impala"and"RDBMSs"are"be>er"for"interacFve"use"

Comparing"RDBMS"to"Hive"and"Impala"

## RDBMS\$ Hive\$ Impala\$

Insert individual records Yes No Yes
Update and delete records Yes No No
Transactions Yes No No
Role-based authorization Yes Yes No
Stored procedures Yes No No
Index support Extensive Limited None
Latency Very low High Low
Data size Terabytes Petabytes Petabytes
Complex data types No Yes No
Storage cost Very high Very low Very low

Recap:"Apache"Sqoop"

Can"import"all"tables,"a"single"table,"or"a"porFon"of"a"table"into"HDFS"
Supports"incremental"imports"
Can"also"export"data"from"HDFS"back"to"the"database"

Chapter"Topics"

Choosing\$the\$Best\$Tool\$for\$the\$Job\$

!! Comparing"MapReduce,"Pig,"Hive,"Impala,"and"RelaFonal"Databases"
!! Which\$to\$Choose?
!! Conclusion"

Which"to"Choose?"

Mix"and"match"them"as"needed"
!MapReduce\$
Low/level"approach"oers"great"exibility"
More"Fme/consuming"and"error/prone"to"write"
Best"when"control"ma>ers"more"than"producFvity"
!Pig,\$Hive,\$and\$Impala\$oer\$more\$producNvity\$
Faster"to"write,"test,"and"deploy"than"MapReduce"

Analysis"Workow"Example"
Import Transaction Data
from RDBMS
Sessionize Web Sentiment Analysis on
Log Data with Pig Social Media with Hive

with Impala

## Analyst using Impala Analyst using Impala

shell for ad hoc queries via BI tool
Generate Nightly Reports
using Pig, Hive, or Impala

Chapter"Topics"

Choosing\$the\$Best\$Tool\$for\$the\$Job\$

!! Comparing"MapReduce,"Pig,"Hive,"Impala,"and"RelaFonal"Databases"
!! Which"to"Choose?"
!! Conclusion

EssenFal"Points"

Choose"the"best"one"for"a"given"job"
Workows"may"involve"exchanging"data"between"them"
!SelecNon\$criteria\$include\$scale,\$speed,\$control,\$and\$producNvity\$
MapReduce"oers"control"at"the"cost"of"producFvity"
Pig"and"Hive"oer"producFvity"but"not"necessarily"speed"
RelaFonal"databases"oer"speed"but"not"scalability"
Impala"oers"scalability"and"speed"but"less"control"

Conclusion"
Chapter"18"

Course"Chapters"

!! IntroducBon"
!! IntroducBon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulB/Dataset"OperaBons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooBng"and"OpBmizaBon"
!! IntroducBon"to"Hive"
!! RelaBonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpBmizaBon"
!! Extending"Hive"
!! IntroducBon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Interoperability"and"Workows"
!! Conclusion\$

Course"ObjecBves"(1)"

During\$this\$course,\$you\$have\$learned\$
!The\$features\$that\$Pig,\$Hive,\$and\$Impala\$oer\$for\$data\$acquisiCon,\$storage,\$
and\$analysis\$
!How\$to\$idenCfy\$typical\$use\$cases\$for\$large#scale\$data\$analysis\$
!How\$to\$manage\$data\$in\$HDFS\$and\$export\$it\$for\$use\$with\$other\$systems\$
!The\$language\$syntax\$and\$data\$formats\$supported\$by\$these\$tools\$

Course"ObjecBves"(2)"

!How\$to\$design\$and\$execute\$queries\$on\$data\$stored\$in\$HDFS\$
!How\$to\$analyze\$structured,\$semi#structured,\$and\$unstructured\$data\$
!How\$Hive\$and\$Pig\$can\$be\$extended\$with\$custom\$funcCons\$and\$scripts\$
!How\$to\$store\$and\$query\$data\$for\$beMer\$performance\$

Which"Course"to"Take"Next?"

Cloudera\$oers\$a\$range\$of\$training\$courses\$for\$you\$and\$your\$team\$\$
!For\$developers\$