Anda di halaman 1dari 624

Cloudera"Data"Analyst"Training:""

Using"Pig,"Hive,"and"Impala"with"Hadoop"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#1$
201403"
IntroducJon"
Chapter"1"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#2$
Course"Chapters"

!! Introduc/on$
!! Hadoop"Fundamentals"
!! IntroducJon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulJ/Dataset"OperaJons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooJng"and"OpJmizaJon"
!! IntroducJon"to"Hive"
!! RelaJonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpJmizaJon"
!! Extending"Hive"
!! IntroducJon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#3$
Chapter"Topics"

Introduc/on$

!! About$this$course$
!! About"Cloudera"
!! Course"LogisJcs"
!! IntroducJons"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#4$
Course"ObjecJves"(1)"

During$this$course,$you$will$learn$
!The$purpose$of$Hadoop$and$its$related$tools$
!The$features$that$Pig,$Hive,$and$Impala$oer$for$data$acquisi/on,$storage,$
and$analysis$
!How$to$iden/fy$typical$use$cases$for$large#scale$data$analysis$
!How$to$load$data$from$rela/onal$databases$and$other$sources$
!How$to$manage$data$in$HDFS$and$export$it$for$use$with$other$systems$
!How$Pig,$Hive,$and$Impala$improve$produc/vity$for$typical$analysis$tasks$
!The$language$syntax$and$data$formats$supported$by$these$tools$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#5$
Course"ObjecJves"(2)"

!How$to$design$and$execute$queries$on$data$stored$in$HDFS$
!How$to$join$diverse$datasets$to$gain$valuable$business$insight$
!How$to$analyze$structured,$semi#structured,$and$unstructured$data$
!How$Hive$and$Pig$can$be$extended$with$custom$func/ons$and$scripts$
!How$to$store$and$query$data$for$beOer$performance$
!How$to$determine$which$tool$is$the$best$choice$for$a$given$task$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#6$
Chapter"Topics"

Introduc/on$

!! About"this"course"
!! About$Cloudera$
!! Course"LogisJcs"
!! IntroducJons"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#7$
About"Cloudera"(1)"

!The$leader$in$Apache$Hadoop#based$soSware$and$services$
!Founded$by$leading$experts$on$Hadoop$from$Facebook,$Yahoo,$Google,$
and$Oracle$
!Provides$support,$consul/ng,$training,$and$cer/ca/on$for$Hadoop$users$
!Sta$includes$commiOers$to$virtually$all$Hadoop$projects$
!Many$authors$of$industry$standard$books$on$Apache$Hadoop$projects$
Tom"White,"Eric"Sammer,"Lars"George,"etc."

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#8$
About"Cloudera"(2)"

!Customers$include$many$key$users$of$Hadoop$
Allstate,"AOL"AdverJsing,"Box,"CBS"InteracJve,"eBay,"Experian,"Groupon,"
Macys.com,"NaJonal"Cancer"InsJtute,"Orbitz,"Social"Security"
AdministraJon,"Trend"Micro,"Trulia,"US"Army,""
!Cloudera$public$training:$
Cloudera"Developer"Training"for"Apache"Hadoop"
Cloudera"Administrator"Training"for"Apache"Hadoop"
Cloudera"Data"Analyst"Training:"Using"Pig,"Hive,"and"Impala"with"Hadoop"
Cloudera"Training"for"Apache"HBase"
IntroducJon"to"Data"Science:"Building"Recommender"Systems"
Cloudera"EssenJals"for"Apache"Hadoop"
!Onsite$and$custom$training$is$also$available$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#9$
CDH"

!CDH$(Clouderas$Distribu/on$including$Apache$Hadoop)$
100%"open"source,""
enterprise/ready""
distribuJon"of"Hadoop"and""
related"projects"
The"most"complete,"tested,""
and"widely/deployed""
distribuJon"of"Hadoop"
Integrates"all"the"key""
Hadoop"ecosystem"projects"
Available"as"RPMs"and"
Ubuntu/Debian/SuSE""
packages"or"as"a"tarball"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#10$
Cloudera"Express"

!Cloudera$Express$
Free"download"
!The$best$way$to$get$started$
$with$Hadoop$
!Includes$CDH$
!Includes$Cloudera$Manager$
End/to/end""
administraJon"for""
Hadoop"
Deploy,"manage,"and""
monitor"your"cluster"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#11$
Cloudera"Enterprise"

!Cloudera$Enterprise$
SubscripJon"product"including"CDH"and""
Cloudera"Manager"
!Includes$support$
!Includes$extra$Cloudera$Manager$features$
ConguraJon"history"and"rollbacks"
Rolling"updates"
LDAP"integraJon"
SNMP"support"
Automated"disaster"recovery"
Etc."

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#12$
Chapter"Topics"

Introduc/on$

!! About"this"course"
!! About"Cloudera"
!! Course$Logis/cs$
!! IntroducJons"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#13$
LogisJcs"

!Class$start$and$nish$/mes$
!Lunch$
!Breaks$
!Restrooms$
!Wi#Fi$access$
!Virtual$machines$
!Can$I$come$in$early/stay$late?$

Your$instructor$will$give$you$details$on$how$to$access$the$course$materials$
and$exercise$instruc/ons$for$the$class$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#14$
Chapter"Topics"

Introduc/on$

!! About"this"course"
!! About"Cloudera"
!! Course"LogisJcs"
!! Introduc/ons$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#15$
IntroducJons"

!About$your$instructor$
!About$you$
Where"do"you"work"and"what"do"you"do"there?"
Which"database(s)"and"placorm(s)"do"you"use?"
Have"you"worked"with"Apache"Hadoop"or"related"tools?"""
Any"experience"as"a"developer?"
What"programming"languages"do"you"use?"
What"are"your"expectaJons"for"this"course?"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#16$
Hadoop"Fundamentals"
Chapter"2"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#1%
Course"Chapters"

!! IntroducDon"
!! Hadoop%Fundamentals%
!! IntroducDon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulD/Dataset"OperaDons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooDng"and"OpDmizaDon"
!! IntroducDon"to"Hive"
!! RelaDonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpDmizaDon"
!! Extending"Hive"
!! IntroducDon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#2%
Hadoop"Fundamentals"

In%this%chapter,%you%will%learn%
!Which%factors%led%to%the%era%of%Big%Data%
!What%Hadoop%is%and%what%signicant%features%it%oers%
!How%it%oers%reliable%storage%for%massive%amounts%of%data%with%HDFS%
!How%it%supports%large#scale%data%processing%through%MapReduce%
!How%Hadoop%Ecosystem%tools%can%boost%an%analysts%producKvity%
!Several%ways%to%integrate%Hadoop%into%the%modern%data%center%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#3%
Aside:"Please"Launch"the"Virtual"Machine"

!At%the%end%of%this%chapter%you%will%work%on%the%rst%Hands#On%Exercise%
!The%exercises%are%performed%in%a%Virtual%Machine%(VM)%
!The%rst%Kme%the%VM%is%launched,%it%takes%several%minutes%to%boot%
It"is"conguring"the"class"environment"
Subsequent"boots"are"much"faster"
!To%save%Kme,%please%launch%the%VM%now%so%that%it%will%be%ready%by%the%
Kme%we%get%to%the%rst%Hands#On%Exercise%
Your"instructor"will"tell"you"how"to"do"this"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#4%
Chapter"Topics"

Hadoop%Fundamentals%

!! The%MoKvaKon%for%Hadoop%
!! Hadoop"Overview"
!! HDFS"
!! MapReduce"
!! The"Hadoop"Ecosystem"
!! Exercise"Scenario"ExplanaDon"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#5%
Velocity"

!We%are%generaKng%data%faster%than%ever%
Processes"are"increasingly"automated"
Systems"are"increasingly"interconnected"
People"are"increasingly"interacDng"online"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#6%
Variety"

!We%are%producing%a%wide%variety%of%data%
Social"network"connecDons"
Server"and"applicaDon"log"les"
Electronic"medical"records"
Images,"audio,"and"video"
RFID"and"wireless"sensor"network"events"
Product"raDngs"on"shopping"and"review"Web"sites"
And"much"more"
!Not%all%of%this%maps%cleanly%to%the%relaKonal%model%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#7%
Volume"

!Every%day%
More"than"1.5"billion"shares"are"traded"on"the"New"York"Stock"
Exchange"
Facebook"stores"2.7"billion"comments"and"Likes"
Google"processes"about"24"petabytes"of"data"
!Every%minute%
Foursquare"handles"more"than"2,000"check/ins"
TransUnion"makes"nearly"70,000"updates"to"credit"les"
!And%every%second%
Banks"process"more"than"10,000"credit"card"transacDons"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#8%
Data"Has"Value"

!This%data%has%many%valuable%applicaKons%
Product"recommendaDons"
PredicDng"demand"
MarkeDng"analysis"
Fraud"detecDon"
And"many,"many"more"
!We%must%process%it%to%extract%that%value%
And"processing"all#the#data"can"yield"more"accurate"results"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#9%
We"Need"a"System"that"Scales"

!Were%generaKng%too%much%data%to%process%with%tradiKonal%tools%
!Two%key%problems%to%address%%
How"can"we"reliably"store"large"amounts"of"data"at"a"reasonable"cost?"
How"can"we"analyze"all"the"data"we"have"stored?"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#10%
Chapter"Topics"

Hadoop%Fundamentals%

!! The"MoDvaDon"for"Hadoop"
!! Hadoop%Overview%
!! HDFS"
!! MapReduce"
!! The"Hadoop"Ecosystem"
!! Exercise"Scenario"ExplanaDon"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#11%
What"is"Apache"Hadoop?"

!Scalable%and%economical%data%storage%and%processing%
Distributed"and"fault/tolerant""
Harnesses"the"power"of"industry"standard"hardware"
!Heavily%inspired%by%technical%documents%published%by%Google%
!Core%Hadoop%consists%of%two%main%components%
Storage:"the"Hadoop"Distributed"File"System"(HDFS)"
Processing:"MapReduce"
Plus"the"infrastructure"needed"to"make"them"work,"including"
Filesystem"and"administraDon"uDliDes"
Job"scheduling"and"monitoring"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#12%
Scalability"

!Hadoop%is%a%distributed%system%
A"collecDon"of"servers"running"Hadoop"sofware"is"called"a"cluster#
!Individual%servers%within%a%cluster%are%called%nodes&
Typically"standard"rackmount"servers"running"Linux"
Each"node"both"stores"and"processes"data"
!Add%more%nodes%to%the%cluster%to%increase%scalability%
A"cluster"may"contain"up"to"several"thousand"nodes"
You"can"scale"out"incrementally"as"required"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#13%
Fault"Tolerance"

!Paradox:%Adding%nodes%increases%chances%that%any%one%of%them%will%fail%
SoluDon:"build"redundancy"into"the"system"and"handle"it"automaDcally"
!Files%loaded%into%HDFS%are%replicated%across%nodes%in%the%cluster%
If"a"node"fails,"its"data"is"re/replicated"using"one"of"the"other"copies"
!Data%processing%jobs%are%broken%into%individual%tasks%
Each"task"takes"a"small"amount"of"data"as"input"
Thousands"of"tasks"(or"more)"ofen"run"in"parallel"
If"a"node"fails"during"processing,"its"tasks"are"rescheduled"elsewhere"
!RouKne%failures%are%handled%automaKcally%without%any%loss%of%data%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#14%
Chapter"Topics"

Hadoop%Fundamentals%

!! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! HDFS%
!! MapReduce"
!! The"Hadoop"Ecosystem"
!! Exercise"Scenario"ExplanaDon"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#15%
HDFS:"Hadoop"Distributed"File"System"

!Provides%inexpensive%and%reliable%storage%for%massive%amounts%of%data%
!OpKmized%for%sequenKal%access%to%a%relaKvely%small%number%of%large%les%
Each"le"is"likely"to"be"100MB"or"larger ""
MulD/gigabyte"les"are"typical"
!In%some%ways,%HDFS%is%similar%to%a%UNIX%lesystem%
Hierarchical,"with"UNIX/style"paths"(e.g.,"/sales/rpt/asia.txt)"
UNIX/style"le"ownership"and"permissions"
!There%are%also%some%major%deviaKons%from%UNIX%
No"concept"of"a"current"directory"
Cannot"modify"les"once"wri>en"
Must"use"Hadoop/specic"uDliDes"or"custom"code"to"access"HDFS"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#16%
HDFS"Architecture"

!Hadoop%has%a%master/slave% A Small Hadoop Cluster


architecture%
Master
!HDFS%master%daemon:%NameNode% (NameNode)
Manages"namespace"and"metadata#
Monitors"slave"nodes"
!HDFS%slave%daemon:%DataNode%
Reads"and"writes"the"actual"data"
Slaves
(DataNode)

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#17%
Accessing"HDFS"via"the"Command"Line"

!HDFS%is%not%a%general%purpose%lesystem%
Not"built"into"the"OS,"so"only"specialized"tools"can"access"it"
End"users"typically"access"HDFS"via"the"hadoop fs"command"
!Example:%display%the%contents%of%the%/user/fred/sales.txt%le%

$ hadoop fs -cat /user/fred/sales.txt

!Example:%Create%a%directory%(below%the%root)%called%reports%

$ hadoop fs -mkdir /reports

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#18%
Copying"Local"Data"To"and"From"HDFS"

!Remember%that%HDFS%is%disKnct%from%your%local%lesystem%
Use"hadoop fs put"to"copy"local"les"to"HDFS"
Use"hadoop fs -get"to"fetch"a"local"copy"of"a"le"from"HDFS"

Hadoop Cluster

$ hadoop fs -put sales.txt /reports


Client Machine

$ hadoop fs -get /reports/sales.txt

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#19%
More"hadoop fs"Command"Examples""

!Copy%le%input.txt%from%local%disk%to%the%users%directory%in%HDFS%

$ hadoop fs -put input.txt input.txt

This"will"copy"the"le"to"/user/username/input.txt
!Get%a%directory%lisKng%of%the%HDFS%root%directory%

$ hadoop fs -ls /

!Delete%the%le%/reports/sales.txt%

$ hadoop fs -rm /reports/sales.txt

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#20%
Chapter"Topics"

Hadoop%Fundamentals%

!! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! HDFS"
!! MapReduce%
!! The"Hadoop"Ecosystem"
!! Exercise"Scenario"ExplanaDon"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#21%
Introducing"MapReduce"

!We%typically%process%data%in%Hadoop%using%MapReduce%
!MapReduce%is%not%a%language,%its%a%programming%model%
A"style"of"processing"data"popularized"by"Google"
!Benets%of%MapReduce%
Simplicity"
Flexibility"
Scalability"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#22%
Understanding"Map"and"Reduce"

!MapReduce%consists%of%two%funcKons:%map%and%reduce%
The"output"from"map"becomes"the"input"to"reduce"
!The%map%funcKon%always%runs%rst%
Typically"used"to"lter,"transform,"or"parse"data"
!The%reduce%funcKon%is%opKonal%
Normally"used"to"summarize"data"from"the"map"funcDon"(aggregaDon)"
Not"always"needed""you"can"run"map/only"jobs"
!Each%piece%is%simple,%but%can%be%powerful%when%combined%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#23%
MapReduce"Example"

!MapReduce%code%for%Hadoop%is%typically%wrifen%in%Java%
But"possible"to"use"nearly"any"language"with"Hadoop#Streaming"
!The%following%slides%will%explain%an%enKre%MapReduce%job%
Input:"text"le"containing"order"ID,"employee"name,"and"sale"amount"
Output:"sum"of"all"sales"per"employee"

Job Input
0 Alice 3625
1 Bob 5174 Job Output
2 Alice 893
3 Alice 2139 Alice 12491
4 Diana 3581 Bob 9997
5 Carlos 1039 Carlos 1431
6 Bob 4823 Diana 5385
7 Alice 5834
8 Carlos 392
9 Diana 1804

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#24%
ExplanaDon"of"the"Map"FuncDon"

!Hadoop%splits%job%into%many%individual%map%tasks%
Number"of"map"tasks"is"determined"by"the"amount"of"input"data"
Each"map"task"receives"a"porDon"of"the"overall"job"input"to"process"
Mappers"process"one"input"record"at"a"Dme"
For"each"input"record,"they"emit"zero"or"more"records"as"output"
!In%this%case,%the%map%task%simply%parses%the%input%record%
And"then"emits"the"name"and"price"elds"for"each"as"output"
Job Input Alice 3625
Bob 5174
0 Alice 3625
1 Bob 5174 Alice 893
2 Alice 893 Alice 2139 Output
3 Alice 2139
4 Diana 3581 Diana 3581 from
5 Carlos 1039 Carlos 1039
6 Bob 4823 Map
7
8
Alice
Carlos
5834
392
Bob 4823 Tasks
Alice 5834
9 Diana 1804
Carlos 392
Diana 1804

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#25%
Shue"and"Sort"

!Hadoop%automaKcally%sorts%and%merges%output%from%all%map%tasks%
This"intermediate"process"is"known"as"the"shue"and"sort"
The"result"is"supplied"to"reduce"tasks"

Alice 3625
Map Task #1 Output Bob 5174 Alice 3625
Alice 893
Alice 893 Alice 2139
Map Task #2 Output Alice 2139 Alice 5834 Input to Reduce Task #1
Carlos 1039
Diana 3581 Carlos 392
Map Task #3 Output
% Carlos 1039

Bob 4823 Bob 5174


Map Task #4 Output Alice 5834 Bob 4823
Input to Reduce Task #2
Diana 3581
Diana 1804
Carlos 392
Map Task #5 Output Diana 1804

% "Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#26%
ExplanaDon"of"Reduce"FuncDon"

!Reducer%input%comes%from%the%shue%and%sort%process%
As"with"map,"the"reduce"funcDon"receives"one"record"at"a"Dme"
A"given"reducer"receives"all"records"for"a"given"key"
For"each"input"record,"reduce"can"emit"zero"or"more"output"records"
!Our%reduce%funcKon%simply%sums%total%per%person%
And"emits"employee"name"(key)"and"total"(value)"as"output"

Job Output
Alice 3625
Alice 893 (Output of Reduce Tasks)
Alice 2139
Input to Reduce Task #1 Alice 5834
Carlos 1039 Alice 12491
Carlos 392 Carlos 1431

Bob 9997
Bob 5174 Diana 5385
Bob 4823
Input to Reduce Task #2 Diana 3581
Diana 1804

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#27%
Pumng"it"All"Together"

!Heres%the%data%ow%for%the%enKre%MapReduce%job%

Alice 3625
Bob 5174 Alice 3625
0 Alice 3625 Alice 893
1 Bob 5174 Alice 893 Alice 2139
2 Alice 893 Alice 2139 Alice 5834
3 Alice 2139 Carlos 1039 Alice 12491
4 Diana 3581 Diana 3581 Carlos 392 Carlos 1431
5 Carlos 1039 Carlos 1039
6 Bob 4823 Bob 9997
7 Alice 5834 Bob 4823 Bob 5174 Diana 5385
8 Carlos 392 Alice 5834 Bob 4823
9 Diana 1804 Diana 3581
Diana 1804
Carlos 392
Diana 1804

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#28%
MapReduce"Architecture"

!MapReduce%version%1%and%version%2%(YARN)%
Similar"master/slave"architecture"
Details"dier"slightly"
!Master%nodes%
Run"master"daemons"to"accept"jobs,""
and"monitor"and"distribute"work"
!Slave%nodes%
Run"slave"daemons"to"start"tasks"
Do"the"actual"work"
Report"status"back"to"master"daemons"
!HDFS%and%MapReduce%are%collocated%
Slave"nodes"run"both"HDFS"and"MR""
slave"daemons"on"the"same"machines"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#29%
MapReduce"Version"1"Architecture"

!MRv1%master%daemon:%JobTracker%
Divides"jobs"into"individual"tasks"
Assigns/monitors"tasks"on"slave"nodes"
!MRv1%slave%daemon:%TaskTracker%
Starts"tasks"
Reports"status"back"to"JobTracker"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#30%
MapReduce"Version"2"Architecture"

!MRv2%uses%the%YARN%cluster%
management%framework%
!MRv2%master%daemon:%ResourceManager%
Allocates"cluster"resources"for"a"job"
Starts"an"ApplicaDon"Master"for"a"job"
!MRv2%slave%daemons:%%
ApplicaKonMaster%
One"per"applicaDon"
Divides"jobs"into"individual"tasks"
Assigns"tasks"to"NodeManagers"
NodeManager%
Runs"on"all"slave"nodes"
Starts"and"monitors"the"actual"
processing"task"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#31%
Chapter"Topics"

Hadoop%Fundamentals%

!! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! HDFS"
!! MapReduce"
!! The%Hadoop%Ecosystem%
!! Exercise"Scenario"ExplanaDon"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#32%
The"Hadoop"Ecosystem"

!Many%related%tools%integrate%with%Hadoop%
Data"analysis""
Database"integraDon"
Workow"management"
!These%are%not%considered%core%Hadoop%
Rather,"they"are"part"of"the"Hadoop"ecosystem"
Many"are"also"open"source"Apache"projects"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#33%
Apache"Pig"

!Apache%Pig%builds%on%Hadoop%to%oer%high#level%data%processing%
This"is"an"alternaDve"to"wriDng"low/level"MapReduce"code"
Pig"is"especially"good"at"joining"and"transforming"data"

people = LOAD '/user/training/customers' AS (cust_id, name);


orders = LOAD '/user/training/orders' AS (ord_id, cust_id, cost);
groups = GROUP orders BY cust_id;
totals = FOREACH groups GENERATE group, SUM(orders.cost) AS t;
result = JOIN totals BY group, people BY cust_id;
DUMP result;
%
!The%Pig%interpreter%runs%on%the%client%machine%
Turns"PigLaDn"scripts"into"MapReduce"jobs"
Submits"those"jobs"to"the"cluster"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#34%
Apache"Hive"

!Hive%is%another%abstracKon%on%top%of%MapReduce%
Like"Pig,"it"also"reduces"development"Dme""
Hive"uses"a"SQL/like"language"called"HiveQL"

SELECT customers.cust_id, SUM(cost) AS total


FROM customers
JOIN orders
ON customers.cust_id = orders.cust_id
GROUP BY customers.cust_id
"
ORDER BY total DESC;

!The%Hive%interpreter%runs%on%a%client%machine%
Turns"HiveQL"queries"into"MapReduce"jobs"
Submits"those"jobs"to"the"cluster"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#35%
Apache"HBase"

!HBase%is%the%Hadoop%database%
!Can%store%massive%amounts%of%data%
Gigabytes,"terabytes,"and"even"petabytes"of"data"in"a"table"
Tables"can"have"many"thousands"of"columns"
!Scales%to%provide%very%high%write%throughput%
Hundreds"of"thousands"of"inserts"per"second"
!Fairly%primiKve%when%compared%to%RDBMS%
NoSQL":"There"is"no"high/level"query"language""
Use"API"to"scan"/"get"/"put"values"based"on"keys"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#36%
Cloudera"Impala"

!Massively%parallel%SQL%engine%which%runs%on%a%Hadoop%cluster%
Inspired"by"Googles"Dremel"project"
Can"query"data"stored"in"HDFS"or"HBase"tables"
!High%performance%%
Typically"at"least"10"Dmes"faster"than"Hive"or"MapReduce"
High/level"query"language"(subset"of"SQL/92)"
!Impala%is%100%%Apache#licensed%open%source%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#37%
Apache"Sqoop"

!Sqoop%exchanges%data%between%an%RDBMS%and%Hadoop%
!It%can%import%all%tables,%a%single%table,%or%a%porKon%of%a%table%into%HDFS%
Does"this"very"eciently"via"a"Map/only"MapReduce"job"
Result"is"a"directory"in"HDFS"containing"comma/delimited"text"les"
!Sqoop%can%also%export%data%from%HDFS%back%to%the%database%

Database Hadoop Cluster

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#38%
ImporDng"Tables"with"Sqoop"

!This%example%imports%the%customers%table%from%a%MySQL%database%
Will"create"/mydata/customers"directory"in"HDFS"
Directory"will"contain"comma/delimited"text"les"

$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--warehouse-dir /mydata \
--table customers

!Adding%the%--direct%opKon%may%oer%befer%performance%
Uses"database/specic"tools"instead"of"Java""
This"opDon"is"not"compaDble"with"all"databases"
!Cloudera%oers%high#performance%custom%connectors%for%many%databases%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#39%
ImporDng"An"EnDre"Database"with"Sqoop"

!Import%all%tables%from%the%database%(elds%will%be%tab#delimited)%

$ sqoop import-all-tables \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--fields-terminated-by '\t' \
--warehouse-dir /mydata

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#40%
ImporDng"ParDal"Tables"with"Sqoop"

!Import%only%specied%columns%from%products%table%

$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--warehouse-dir /mydata \
--table products \
--columns "prod_id,name,price"

!Import%only%matching%rows%from%products%table%

$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--warehouse-dir /mydata \
--table products \
--where "price >= 1000"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#41%
Incremental"Imports"with"Sqoop"

!What%if%new%records%are%added%to%the%database?%
Could"re/import"all"records,"but"this"is"inecient"
!Sqoops%incremental%append%mode%imports%only%new%records%
Based"on"value"of"last"record"in"specied"column"

$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--warehouse-dir /mydata \
--table orders \
--incremental append \
--check-column order_id \
--last-value 6713821

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#42%
Handling"ModicaDons"with"Incremental"Imports"

!What%if%exisKng%records%are%also%modied%in%the%database?%
Incremental"append"mode"doesnt"handle"this"
!Sqoops%lastmodified%append%mode%adds%and%updates%records%
Caveat:"You"must"maintain"a"Dmestamp"column"in"your"table"

$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--warehouse-dir /mydata \
--table shipments \
--incremental lastmodified \
--check-column last_update_date \
--last-value "2013-06-12 03:15:59"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#43%
ExporDng"Data"from"Hadoop"to"RDBMS"with"Sqoop"

!Weve%seen%several%ways%to%pull%records%from%an%RDBMS%into%Hadoop%
It"is"someDmes"also"helpful"to"push"data"in"Hadoop"back"to"an"RDBMS"
!Sqoop%supports%this%via%export%

$ sqoop export \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--export-dir /mydata/recommender_output \
--table product_recommendations

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#44%
Apache"Flume"

%%
!Flume%imports%data%into%HDFS%as&it&is&being&generated%by%various%sources%
Log Files
UNIX Custom
syslog Sources

Program And many


Output more...

Hadoop Cluster

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#45%
Recap:"Data"Center"IntegraDon"

Web/App Servers File Server


s
Flu opf
me do
ha

Sq
oop Hadoop Cluster oo
p
Sq

Data Warehouse
Relational Database
(OLAP)
(OLTP)

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#46%
Apache"Oozie"

!Oozie%allows%developers%to%manage%processing%workows%
It"coordinates"execuDon"and"control"of"individual"jobs"
!Oozie%supports%many%workow%acKons,%including%
ExecuDng"MapReduce"jobs"
Running"Pig"or"Hive"scripts"
ExecuDng"standard"Java"or"shell"programs"
ManipulaDng"data"via"HDFS"commands"
Running"remote"commands"with"SSH"
Sending"e/mail"messages"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#47%
Chapter"Topics"

Hadoop%Fundamentals%

!! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! HDFS"
!! MapReduce"
!! The"Hadoop"Ecosystem"
!! Exercise%Scenario%ExplanaKon%
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#48%
Hands/On"Exercises:"Scenario"ExplanaDon"

!Hands#On%Exercises%throughout%the%course%will%reinforce%the%topics%being%
discussed%
Exercises"simulate"the"kind"of"tasks"ofen"performed"using"the"tools"you"
will"learn"about"in"class"
Most"exercises"depend"on"data"generated"in"earlier"exercises"
!Scenario:%Dualcore%Inc.%is%a%leading%electronics%retailer%
More"than"1,000"brick/and/mortar"stores"
Dualcore"also"has"a"thriving"e/commerce"Web"site"
!Dualcore%has%hired%you%to%help%nd%value%in%their%data%
You"will"process"and"analyze"data"from"internal"and"external"sources"
IdenDfy"opportuniDes"to"increase"revenue"
Find"new"ways"to"reduce"costs"
Help"other"departments"achieve"their"goals"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#49%
Chapter"Topics"

Hadoop%Fundamentals%

!! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! HDFS"
!! MapReduce"
!! The"Hadoop"Ecosystem"
!! Exercise"Scenario"ExplanaDon"
!! Hands#On%Exercise:%Data%Ingest%with%Hadoop%Tools%
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#50%
About"the"Training"Virtual"Machine"

!During%this%course,%you%will%perform%numerous%hands#on%exercises%using%the%
Cloudera%Training%Virtual%Machine%(VM)%
!The%VM%has%Hadoop%installed%in%pseudo0distributed&mode%
Simply"a"cluster"comprised"of"a"single"node"
Typically"used"for"tesDng"code"before"deploying"to"a"large"cluster"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#51%
Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"

!In%this%Hands#On%Exercise,%youll%gain%pracKce%adding%data%from%the%local%
lesystem%and%a%relaKonal%database%server%to%HDFS%
You"will"analyze"this"data"in"subsequent"exercises"
!Please%refer%to%the%Hands#On%Exercise%Manual%for%instrucKons%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#52%
Chapter"Topics"

Hadoop%Fundamentals%

!! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! HDFS"
!! MapReduce"
!! The"Hadoop"Ecosystem"
!! Exercise"Scenario"ExplanaDon"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
!! Conclusion%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#53%
EssenDal"Points"

!We%are%generaKng%more%data%%and%faster%%than%ever%before%
!Most%of%this%data%maps%poorly%to%structured%relaKonal%tables%
!The%ability%to%store%and%process%this%data%can%yield%valuable%insight%
!Hadoop%oers%scalable%data%storage%(HDFS)%and%processing%(MapReduce)%
!There%are%lots%of%tools%in%the%Hadoop%ecosystem%that%help%you%to%integrate%
Hadoop%with%other%systems,%manage%complex%jobs,%and%ease%analysis%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#54%
Bibliography"

The%following%oer%more%informaKon%on%topics%discussed%in%this%chapter%
!10%Hadoopable%Problems%(recorded%presentaKon)%
http://tiny.cloudera.com/dac02a
!IntroducKon%to%Apache%MapReduce%and%HDFS%(recorded%presentaKon)%
http://tiny.cloudera.com/dac02b
!Guide%to%HDFS%Commands%
http://tiny.cloudera.com/dac02c
!Hadoop:%The%DeniKve%Guide,%3rd%EdiKon%
http://tiny.cloudera.com/dac02d
!Sqoop%User%Guide%
http://tiny.cloudera.com/dac02e

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#55%
IntroducAon"to"Pig"
Chapter"3"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#1%
Course"Chapters"

!! IntroducAon"
!! Hadoop"Fundamentals"
!! Introduc/on%to%Pig%
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulA/Dataset"OperaAons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooAng"and"OpAmizaAon"
!! IntroducAon"to"Hive"
!! RelaAonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpAmizaAon"
!! Extending"Hive"
!! IntroducAon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#2%
IntroducAon"to"Pig"

In%this%chapter,%you%will%learn%
!The%key%features%Pig%oers%
!How%organiza/ons%use%Pig%for%data%processing%and%analysis%
!How%to%use%Pig%interac/vely%and%in%batch%mode%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#3%
Chapter"Topics"

Introduc/on%to%Pig%

!! What%is%Pig?%
!! Pigs"Features"
!! Pig"Use"Cases"
!! InteracAng"with"Pig"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#4%
Apache"Pig"Overview"

!Apache%Pig%is%a%plaJorm%for%data%analysis%and%processing%on%Hadoop%
It"oers"an"alternaAve"to"wriAng"MapReduce"code"directly"
!Originally%developed%as%a%research%project%at%Yahoo%%
Goals:"exibility,"producAvity,"and"maintainability"
Now"an"open/source"Apache"project"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#5%
The"Anatomy"of"Pig"

!Main%components%of%Pig%
The"data"ow"language"(Pig"LaAn)"
The"interacAve"shell"where"you"can"type"Pig"LaAn"statements"(Grunt)"
The"Pig"interpreter"and"execuAon"engine"

Pig Latin Script Pig Interpreter / Execution Engine MapReduce Jobs

!"Preprocess"and"parse"Pig"La0n
AllSales = LOAD 'sales'
!"Check"data"types
AS (cust, price); !"Make"op0miza0ons
BigSales = FILTER AllSales !"Plan"execu0on
BY price > 100;
STORE BigSales INTO 'myreport';
!"Generate"MapReduce"jobs
!"Submit"job(s)"to"Hadoop
!"Monitor"progress

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#6%
Where"to"Get"Pig"

!CDH%(Clouderas%Distribu/on%including%Apache%Hadoop)%is%the%easiest%way%
to%install%Hadoop%and%Pig%
A"Hadoop"distribuAon"which"includes"core"Hadoop,"Pig,"Hive,"Sqoop,"
HBase,"Oozie,"and"other"ecosystem"components"
Available"as"RPMs,"Ubuntu/Debian/SuSE"packages,"or"a"tarball"
Simple"installaAon"
100%"free"and"open"source"
!Installa/on%is%outside%the%scope%of%this%course%
Cloudera"oers"a"training"course"for"System"Administrators,!Cloudera!
Administrator!Training!for!Apache!Hadoop!

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#7%
Chapter"Topics"

Introduc/on%to%Pig%

!! What"is"Pig?"
!! Pigs%Features%
!! Pig"Use"Cases"
!! InteracAng"with"Pig"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#8%
Pig"Features"

!Pig%is%an%alterna/ve%to%wri/ng%low#level%MapReduce%code%
!Many%features%enable%sophis/cated%analysis%and%processing%
HDFS"manipulaAon"
UNIX"shell"commands"
RelaAonal"operaAons"
PosiAonal"references"for"elds"
Common"mathemaAcal"funcAons"
Support"for"custom"funcAons"and"data"formats%
Complex"data"structures"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#9%
Chapter"Topics"

Introduc/on%to%Pig%

!! What"is"Pig?"
!! Pigs"Features"
!! Pig%Use%Cases%
!! InteracAng"with"Pig"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#10%
How"Are"OrganizaAons"Using"Pig?"

!Many%organiza/ons%use%Pig%for%data%analysis%
Finding"relevant"records"in"a"massive"data"set"
Querying"mulAple"data"sets"
CalculaAng"values"from"input"data"
!Pig%is%also%frequently%used%for%data%processing%
Reorganizing"an"exisAng"data"set"
Joining"data"from"mulAple"sources"to"produce"a"new"data"set"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#11%
Use"Case:"Web"Log"SessionizaAon"

!Pig%can%help%you%extract%valuable%informa/on%from%Web%server%log%les%

... Web Server Log Data


10.174.57.241 - - [03/May/2013:17:57:41 -0500] "GET /s?q=widget HTTP/1.1" 200 3617 "http://www.hotbot.com/find/dualcore" "WebTV 1.2" "U=129"
10.218.46.19 - - [03/May/2013:17:57:43 -0500] "GET /ide.html HTTP/1.1" 404 955 "http://www.example.com/s?q=JBuilder" "Mosaic/3.6 (X11;SunOS)"
10.174.57.241 - - [03/May/2013:17:58:03 -0500] "GET /wres.html HTTP/1.1" 200 5741 "http://www.example.com/s?q=widget" "WebTV 1.2" "U=129"
10.32.51.237 - - [03/May/2013:17:58:04 -0500] "GET /os.html HTTP/1.1" 404 955 "http://www.example.com/s?q=VMS" "Mozilla/1.0b (Win3.11)"
10.174.57.241 - - [03/May/2013:17:58:25 -0500] "GET /detail?w=41 HTTP/1.1" 200 8584 "http://www.example.com/wres.html" "WebTV 1.2" "U=129"
10.157.96.181 - - [03/May/2013:17:58:26 -0500] "GET /mp3.html HTTP/1.1" 404 955 "http://www.example.com/s?q=Zune" "Mothra/2.77" "U=3622"
10.174.57.241 - - [03/May/2013:17:59:36 -0500] "GET /order.do HTTP/1.1" 200 964 "http://www.example.com/detail?w=41" "WebTV 1.2" "U=129"
10.174.57.241 - - [03/May/2013:17:59:47 -0500] "GET /confirm HTTP/1.1" 200 964 "http://www.example.com/order.do" "WebTV 1.2" "U=129"
...

Clickstream Data for User Sessions


Process Logs
Recent Activity for John Smith
May 3, 2013 May 12, 2013
Search for 'Widget' Track Order

Widget Results Contact Us

Details for Widget X Send Complaint

Order Widget X

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#12%
Use"Case:"Data"Sampling"

!Sampling%can%help%you%explore%a%representa/ve%por/on%of%a%large%data%set%
Allows"you"to"examine"this"porAon"with"tools"that"do"not"scale"well"
Supports"faster"iteraAons"during"development"of"analysis"jobs"

100 TB 50 MB

Random
Sampling

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#13%
Use"Case:"ETL"Processing"

!Pig%is%also%widely%used%for%Extract,%Transform,%and%Load%(ETL)%processing%

Operations Pig Jobs Running on Hadoop Cluster

Accounting Data Warehouse


Validate Fix Remove Encode
data errors duplicates values

Call Center

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#14%
Chapter"Topics"

Introduc/on%to%Pig%

!! What"is"Pig?"
!! Pigs"Features"
!! Pig"Use"Cases"
!! Interac/ng%with%Pig%
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#15%
Using"Pig"InteracAvely"

!You%can%use%Pig%interac/vely,%via%the%Grunt%shell%
Pig"interprets"each"Pig"LaAn"statement"as"you"type"it"
ExecuAon"is"delayed"unAl"output"is"required"
Very"useful"for"ad"hoc"data"inspecAon"
!Example%of%how%to%start,%use,%and%exit%Grunt%

$ pig
grunt> allsales = LOAD 'sales' AS (name, price);
grunt> bigsales = FILTER allsales BY price > 100;
grunt> STORE bigsales INTO 'myreport';
grunt> quit;
%
!Can%also%execute%a%Pig%La/n%statement%from%the%UNIX%shell%via%the%-e%
op/on

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#16%
InteracAng"with"HDFS"

!You%can%manipulate%HDFS%with%Pig,%via%the%fs%command

% grunt> fs -mkdir sales/;


grunt> fs -put europe.txt sales/;
grunt> allsales = LOAD 'sales' AS (name, price);
grunt> bigsales = FILTER allsales BY price > 100;
grunt> STORE bigsales INTO 'myreport';
grunt> fs -getmerge myreport/ bigsales.txt;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#17%
InteracAng"with"UNIX"

!The%sh%command%lets%you%run%UNIX%programs%from%Pig

grunt> sh date;
Fri May 10 13:05:31 PDT 2013
grunt> fs -ls; -- lists HDFS files
%
grunt> sh ls; -- lists local files

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#18%
Running"Pig"Scripts"

!A%Pig%script%is%simply%Pig%La/n%code%stored%in%a%text%le%
By"convenAon,"these"les"have"the".pig"extension"
!You%can%run%a%Pig%script%from%within%the%Grunt%shell%via%the%run%command%
This"is"useful"for"automaAon"and"batch"execuAon""

grunt> run salesreport.pig;

!It%is%common%to%run%a%Pig%script%directly%from%the%UNIX%shell%

$ pig salesreport.pig

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#19%
MapReduce"and"Local"Modes"

!As%described%earlier,%Pig%turns%Pig%La/n%into%MapReduce%jobs%
Pig"submits"those"jobs"for"execuAon"on"the"Hadoop"cluster"
!It%is%also%possible%to%run%Pig%in%local%mode%using%the%-x%ag%
This"runs"MapReduce"jobs"on"the"local!machine"instead"of"the"cluster"
Local"mode"uses"the"local"lesystem"instead"ofHDFS"
Can"be"helpful"for"tesAng"before"deploying"a"job"to"producAon"

$ pig x local -- interactive

$ pig -x local salesreport.pig -- batch

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#20%
Client/Side"Log"Files"

!If%a%job%fails,%Pig%may%produce%a%log%le%to%explain%why%
These"log"les"are"typically"produced"in"your"current"working"directory"
On"the"local"(client)"machine"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#21%
Chapter"Topics"

Introduc/on%to%Pig%

!! What"is"Pig?"
!! Pigs"Features"
!! Pig"Use"Cases"
!! InteracAng"with"Pig"
!! Conclusion%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#22%
EssenAal"Points"

!Pig%oers%an%alterna/ve%to%wri/ng%MapReduce%code%directly%
Pig"interprets"Pig"LaAn"code"in"order"to"create"MapReduce"jobs"
It"then"submits"these"MapReduce"jobs"to"the"Hadoop"cluster"
!You%can%execute%Pig%La/n%code%interac/vely%through%Grunt%
Pig"delays"job"execuAon"unAl"output"is"required"
!It%is%also%common%to%store%Pig%La/n%code%in%a%script%for%batch%execu/on%
Allows"for"automaAon"and"code"reuse"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#23%
Bibliography"

The%following%oer%more%informa/on%on%topics%discussed%in%this%chapter%
!Apache%Pig%Web%Site%
http://pig.apache.org/
!Process%a%Million%Songs%with%Apache%Pig%
http://tiny.cloudera.com/dac03a
!Powered%By%Pig%
http://tiny.cloudera.com/dac03b
!LinkedIn:%User%Engagement%Powered%By%Apache%Pig%and%Hadoop%
http://tiny.cloudera.com/dac03c
!Programming%Pig%(book)%
http://tiny.cloudera.com/dac03d

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#24%
Basic"Data"Analysis"with"Pig"
Chapter"4"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#1%
Course"Chapters"

!! IntroducDon"
!! Hadoop"Fundamentals"
!! IntroducDon"to"Pig"
!! Basic%Data%Analysis%with%Pig%
!! Processing"Complex"Data"with"Pig"
!! MulD/Dataset"OperaDons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooDng"and"OpDmizaDon"
!! IntroducDon"to"Hive"
!! RelaDonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpDmizaDon"
!! Extending"Hive"
!! IntroducDon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#2%
Basic"Data"Analysis"with"Pig"

In%this%chapter,%you%will%learn%
!The%basic%syntax%of%Pig%LaCn%
!How%to%load%and%store%data%using%Pig%
!Which%simple%data%types%Pig%uses%to%represent%data%
!How%to%sort%and%lter%data%in%Pig%
!How%to%use%many%of%Pigs%built#in%funcCons%for%data%processing%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#3%
Chapter"Topics"

Basic%Data%Analysis%with%Pig%

!! Pig%LaCn%Syntax%
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DeniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#4%
Pig"LaDn"Overview"

!Pig%LaCn%is%a%data$ow%language%
The"ow"of"data"is"expressed"as"a"sequence"of"statements"
!The%following%is%a%simple%Pig%LaCn%script%to%load,%lter,%and%store%data%

allsales = LOAD 'sales' AS (name, price);

bigsales = FILTER allsales BY price > 999; -- in US cents

/*
* Save the filtered results into a new
* directory, below my home directory.
*/
STORE bigsales INTO 'myreport';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#5%
Pig"LaDn"Grammar:"Keywords"

!Pig%LaCn%keywords%are%highlighted%here%in%blue%text%
Keywords"are"reserved""you"cannot"use"them"to"name"things"

allsales = LOAD 'sales' AS (name, price);

bigsales = FILTER allsales BY price > 999; -- in US cents

/*
* Save the filtered results into a new
* directory, below my home directory.
*/
STORE bigsales INTO 'myreport';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#6%
Pig"LaDn"Grammar:"IdenDers"(1)"

!IdenCers%are%the%names%assigned%to%elds%and%other%data%structures$

allsales = LOAD 'sales' AS (name, price);

bigsales = FILTER allsales BY price > 999; -- in US cents

/*
* Save the filtered results into a new
* directory, below my home directory.
*/
STORE bigsales INTO 'myreport';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#7%
Pig"LaDn"Grammar:"IdenDers"(2)"

!IdenCers%must%conform%to%Pigs%naming%rules$
!An%idenCer%must%always%begin%with%a%lePer%
This"may"only"be"followed"by"le>ers,"numbers,"or"underscores"

Valid% x q1 q1_2013 MyData


Invalid% 4 price$ profit% _sale

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#8%
Pig"LaDn"Grammar:"Comments"

!Pig%LaCn%supports%two%types%of%comments%
Single"line"comments"begin"with"--"""
MulD/line"comments"begin"with"/*"and"end"with"*/"

allsales = LOAD 'sales' AS (name, price);

bigsales = FILTER allsales BY price > 999; -- in US cents

/*
* Save the filtered results into a new
* directory, below my home directory.
*/
STORE bigsales INTO 'myreport';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#9%
Case/SensiDvity"in"Pig"LaDn"

!Whether%case%is%signicant%in%Pig%LaCn%depends%on%context%
!Keywords%(shown%here%in%blue%text)%are%not%case#sensiCve%
Neither"are"operators"(such"as"AND,"OR,"or"IS NULL)""
!IdenCers%and%paths%(shown%here%in%red%text)%are%case#sensiCve%
So"are"funcDon"names"(such"as"SUM"or"COUNT)"and"constants"

allsales = LOAD 'sales' AS (name, price);

bigsales = FILTER allsales BY price > 999;

STORE bigsales INTO 'myreport';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#10%
Common"Operators"in"Pig"LaDn"

!Many%commonly#used%operators%in%Pig%LaCn%are%familiar%to%SQL%users%
Notable"dierence:"Pig"LaDn"uses"=="and"!="for"comparison"

ArithmeCc% Comparison% Null% Boolean%


+ == IS NULL AND
- != IS NOT NULL OR
* < NOT
/ >
% <=
>=

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#11%
Chapter"Topics"

Basic%Data%Analysis%with%Pig%

!! Pig"LaDn"Syntax"
!! Loading%Data"
!! Simple"Data"Types"
!! Field"DeniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#12%
Basic"Data"Loading"in"Pig"

!Pigs%default%loading%funcCon%is%called%PigStorage
The"name"of"the"funcDon"is"implicit"when"calling"LOAD
PigStorage"assumes"text"format"with"tab/separated"columns"
!Consider%the%following%le%in%HDFS%called%sales%
The"two"elds"are"separated"by"tab"characters"
"
" Alice 2999
Bob 3625
" Carlos 2764
"
!This%example%loads%data%from%the%above%le

allsales = LOAD 'sales' AS (name, price);

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#13%
Data"Sources:"File"and"Directories"

!The%previous%example%loads%data%from%a%le%named%sales

allsales = LOAD 'sales' AS (name, price);

!Since%this%is%not%an%absolute%path,%it%is%relaCve%to%your%home%directory%
Your"home"directory"in"HDFS"is"typically"/user/youruserid/
Can"also"specify"an"absolute"path"(e.g.,"/dept/sales/2012/q4)"
!The%path%can%also%refer%to%a%directory%
In"this"case,"Pig"will"recursively"load"all"les"in"that"directory"
File"pa>erns"(globs)"are"also"supported"

allsales = LOAD 'sales_200[5-9]' AS (name, price);

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#14%
Specifying"Column"Names"During"Load"

!The%previous%example%also%assigns%names%to%each%column%

allsales = LOAD 'sales' AS (name, price);

!Assign%column%names%is%not%required%
This"can"be"useful"when"exploring"a"new"dataset"
Refer"to"elds"by"posiDon"($0"is"rst,"$1"is"second,"$53"is"54th,"etc.)"

allsales = LOAD 'sales';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#15%
Using"Alternate"Column"Delimiters"

!You%can%specify%an%alternate%delimiter%as%an%argument%to%PigStorage%
!This%example%shows%how%to%load%comma#delimited%data%
Note"that"this"is"a"single"statement"

allsales = LOAD 'sales.csv' USING PigStorage(',') AS


(name, price);

!Or%to%load%pipe#delimited%data%without%specifying%column%names%

allsales = LOAD 'sales.txt' USING PigStorage('|');

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#16%
Chapter"Topics"

Basic%Data%Analysis%with%Pig%

!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple%Data%Types"
!! Field"DeniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#17%
Simple"Data"Types"in"Pig"

!Pig%supports%several%basic%data%types%
Similar"to"those"in"most"databases"and"programming"languages"
!Pig%treats%elds%of%unspecied%type%as%an%array%of%bytes%
Called"the"bytearray"type"in"Pig""

" allsales = LOAD 'sales' AS (name, price);

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#18%
List"of"Simple"Data"Types"

!There%are%eight%data%types%in%Pig%for%simple%values%

Name% DescripCon% Example%Value%


int Whole"numbers% 2013
long Large"whole"numbers% 5,365,214,142L
float Decimals% 3.14159F
double Very"precise"decimals% 3.14159265358979323846
boolean* True"or"false"values" true
datetime* Date"and"Dme" 2013-05-30T14:52:39.000-04:00
chararray Text"strings% Alice
bytearray Raw"bytes"(e.g."any"data)% N/A

""*"Not"available"in"older"versions"of"Pig"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#19%
Specifying"Data"Types"in"Pig"

!Pig%will%do%its%best%to%determine%data%types%based%on%context%
For"example,"you"can"calculate"sales"commission"as""price * 0.1
In"this"case,"Pig"will"assume"that"this"value"is"of"type"double"
!However,%it%is%bePer%to%specify%data%types%explicitly%when%possible%
Helps"with"error"checking"and"opDmizaDons"
Easiest"to"do"this"upon"load"using"the"format"eldname:type+

allsales = LOAD 'sales' AS (name:chararray, price:int);

!Choosing%the%right%data%type%is%important%to%avoid%loss%of%precision%
!Important:%Avoid%using%oaCng%point%numbers%to%represent%money!%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#20%
HCatalog"

!You%have%seen%how%to%specify%names,%types,%and%paths%during%LOAD
!A%new%project%called%HCatalog%can%store%this%informaCon%permanently%
So"it"need"not"be"specied"each"Dme"
Simplies"sharing"of"metadata"between"Pig,"Hive,"and"MapReduce"
!However,%HCatalog%is%sCll%in%early%development%and%is%not%yet%widely%used%
It"was"rst"added"to"CDH"in"release"4.2.0"(February,"2013)"
!You%can%load%data%easily%ader%rst%seeng%it%up%with%HCatalog%

allsales = LOAD 'sales'


USING org.apache.hcatalog.pig.HCatLoader();

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#21%
How"Pig"Handles"Invalid"Data"

!When%encountering%invalid%data,%Pig%subsCtutes%NULL%for%the%value%
For"example,"an"int"eld"containing"the"value"Q4
!The%IS NULL%and%IS NOT NULL%operators%test%for%null%values%
Note"that"NULL"is"not"the"same"as"the"empty"string"''
!You%can%use%these%operators%to%lter%out%bad%records%

hasprices = FILTER Records BY price IS NOT NULL;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#22%
Chapter"Topics"

Basic%Data%Analysis%with%Pig%

!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple"Data"Types"
!! Field%DeniCons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#23%
Key"Data"Concepts"in"Pig"

!RelaConal%databases%have%tables,%rows,%columns,%and%elds%
!We%will%use%the%following%data%to%illustrate%Pigs%equivalents%
Assume"this"data"was"loaded"from"a"tab/delimited"text"le"as"before"

name% price% country%


Alice 2999 us
Bob 3625 ca
Carlos 2764 mx
Dieter 1749 de
tienne 2368 fr
Fredo 5637 it

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#24%
Pig"Data"Concepts:"Fields"

!A%single%element%of%data%is%called%a%eld$
It"corresponds"to"one"of"the"eight"data"types"seen"earlier"

name% price% country%


Alice 2999 us
Bob 3625 ca
Carlos 2764 mx
Dieter 1749 de
tienne 2368 fr
Fredo 5637 it

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#25%
Pig"Data"Concepts:"Tuples"

!A%collec/on%of%values%is%called%a%tuple$
Fields"within"a"tuple"are"ordered,"but"need"not"all"be"of"the"same"type"

name% price% country%


Alice 2999 us
Bob 3625 ca
Carlos 2764 mx
Dieter 1749 de
tienne 2368 fr
Fredo 5637 it

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#26%
Pig"Data"Concepts:"Bags"

!A%collec/on%of%tuples%is%called%a%bag$
!Tuples%within%a%bag%are%unordered%by%default%
The"eld"count"and"types"may"vary"between"tuples"in"a"bag"

name% price% country%


Alice 2999 us
Bob 3625 ca
Carlos 2764 mx
Dieter 1749 de
tienne 2368 fr
Fredo 5637 it

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#27%
Pig"Data"Concepts:"RelaDons"

!A%relaCon%is%simply%a%bag%with%an%assigned%name%(alias)%
Most"Pig"LaDn"statements"create"a"new"relaDon"
!A%typical%script%loads%one%or%more%datasets%into%relaCons%
Processing"creates"new"relaDons"instead"of"modifying"exisDng"ones"
The"nal"result"is"usually"also"a"relaDon,"stored"as"output"

allsales = LOAD 'sales' AS (name, price);


bigsales = FILTER allsales BY price > 999;
STORE bigsales INTO 'myreport';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#28%
Chapter"Topics"

Basic%Data%Analysis%with%Pig%

!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DeniDons"
!! Data%Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#29%
Data"Output"in"Pig"

!The%command%used%to%handle%output%depends%on%its%desCnaCon%
DUMP:"sends"output"to"the"screen"
STORE:"sends"output"to"disk"(HDFS)"
!Example%of%DUMP%output,%using%data%from%the%le%shown%earlier%
The"parentheses"and"commas"indicate"tuples"with"mulDple"elds"

(Alice,2999,us)
(Bob,3625,ca)
(Carlos,2764,mx)
(Dieter,1749,de)
(tienne,2368,fr)
(Franco,5637,it)

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#30%
Storing"Data"with"Pig"

!The%STORE%command%is%used%to%store%data%to%HDFS%
Similar"to"LOAD,"but"writes"data"instead"of"reading"it"
The"output"path"is"the"name"of"a"directory"
The"directory"must"not"yet"exist"
!As%with%LOAD,%the%use%of%PigStorage%is%implicit%
The"eld"delimiter"also"has"a"default"value"(tab)"

STORE bigsales INTO 'myreport';

You"may"also"specify"an"alternate"delimiter"

STORE bigsales INTO 'myreport' USING PigStorage(',');

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#31%
Chapter"Topics"

Basic%Data%Analysis%with%Pig%

!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DeniDons"
!! Data"Output"
!! Viewing%the%Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#32%
Viewing"the"Schema"with"DESCRIBE

!The%DESCRIBE%command%shows%the%structure%of%the%data,%including%
names%and%types%
!The%following%Grunt%session%shows%an%example%

grunt> allsales = LOAD 'sales' AS (name:chararray,


% price:int);
grunt> DESCRIBE allsales;

allsales: {name: chararray,price: int}

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#33%
Chapter"Topics"

Basic%Data%Analysis%with%Pig%

!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DeniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering%and%SorCng%Data"
!! Commonly/used"FuncDons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#34%
Filtering"in"Pig"LaDn"

!The%FILTER%keyword%extracts%tuples%matching%the%specied%criteria%
"
"
bigsales = FILTER allsales BY price > 3000;

allsales bigsales

name% price% country% price > 3000" name% price% country%


Alice 2999 us Bob 3625 ca
Bob 3625 ca Fredo 5637 it
Carlos 2764 mx
Dieter 1749 de
tienne 2368 fr
Fredo 5637 it

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#35%
Filtering"by"MulDple"Criteria"

!You%can%combine%criteria%with%AND%and%OR

somesales = FILTER allsales BY name == 'Dieter' OR (price >


3500 AND price < 4000);

allsales somesales

name% price% country% name% price% country%


Alice 2999 us Bob 3625 ca
Bob 3625 ca Dieter 1749 de
Carlos 2764 mx
Dieter 1749 de Name%is%Dieter,%or%price%is%greater%%
tienne 2368 fr than%3500%and%less%than%4000"
Fredo 5637 it

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#36%
Aside:"String"Comparisons"in"Pig"LaDn"

!The%==%operator%is%supported%for%any%type%in%Pig%LaCn%
This"operator"is"used"for"exact"comparisons"
"
" alices = FILTER allsales BY name == 'Alice';

!Pig%LaCn%supports%paPern%matching%through%Javas%regular$expressions%$
This"is"done"with"the"MATCHES"operator"

a_names = FILTER allsales BY name MATCHES 'A.*';

spammers = FILTER senders BY email_addr


MATCHES '.*@example\\.com$';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#37%
Field"SelecDon"in"Pig"LaDn"

!Filtering%extracts%rows,%but%someCmes%we%need%to%extract%columns%
This"is"done"in"Pig"LaDn"using"the"FOREACH"and"GENERATE"keywords

twofields = FOREACH allsales GENERATE amount, trans_id;


%

allsales twofields
salesperson% amount% trans_id% amount% trans_id%
Alice 2999 107546 2999 107546
Bob 3625 107547 3625 107547
Carlos 2764 107548 2764 107548
Dieter 1749 107549 1749 107549
tienne 2368 107550 2368 107550
Fredo 5637 107550 5637 107550

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#38%
GeneraDng"New"Fields"in"Pig"LaDn"

!The%FOREACH%and%GENERATE%keywords%can%also%be%used%to%create%elds%
For"example,"you"could"create"a"new"eld"based"on"price"

t = FOREACH allsales GENERATE price * 0.07;

!It%is%possible%to%name%such%elds%

t = FOREACH allsales GENERATE price * 0.07 AS tax;

!And%you%can%also%specify%the%data%type

t = FOREACH allsales GENERATE price * 0.07 AS tax:float;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#39%
EliminaDng"Duplicates"

! DISTINCT%eliminates%duplicate%records%in%a%bag%
All%elds%must"be"equal"to"be"considered"a"duplicate"

unique_records = DISTINCT all_alices;

all_alices unique_records

rstname% lastname% country% rstname% lastname% country%


Alice Smith us Alice Smith us
Alice Jones us Alice Jones us
Alice Brown us Alice Brown us
Alice Brown us Alice Brown ca
Alice Brown ca

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#40%
Controlling"Sort"Order

!Use%ORDER...BY%to%sort%the%records%in%a%bag%in%ascending%order%
Add"DESC"to"sort"in"descending"order"instead"
Take"care"to"specify"a"schema""data"type"aects"how"data"is"sorted!"

sortedsales = ORDER allsales BY country DESC;

allsales sortedsales
name% price% country% name% price% country%
Alice 29.99 us Alice 29.99 us
Bob 36.25 ca Carlos 27.64 mx
Carlos 27.64 mx Fredo 56.37 it
Dieter 17.49 de tienne 23.68 fr
tienne 23.68 fr Dieter 17.49 de
Fredo 56.37 it Bob 36.25 ca

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#41%
LimiDng"Results"

!As%in%SQL,%you%can%use%LIMIT%to%reduce%the%number%of%output%records%

somesales = LIMIT allsales 10;

!Beware!%Record%ordering%is%random%unless%specied%with%ORDER BY
Use"ORDER BY"and"LIMIT"together"to"nd"top/N"results"

sortedsales = ORDER allsales BY price DESC;


top_five = LIMIT sortedsales 5;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#42%
Chapter"Topics"

Basic%Data%Analysis%with%Pig%

!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DeniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"
!! Commonly#used%FuncCons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"processing"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#43%
Built/in"FuncDons"

!These%are%just%a%sampling%of%Pigs%many%built#in%funcCons%
%
FuncCon%DescripCon% Example%InvocaCon% Input% Output%
Convert"to"uppercase" UPPER(country) uk UK

Remove"leading/trailing"spaces" TRIM(name) Bob Bob

Return"a"random"number" RANDOM() 0.4816132


6652569
Round"to"closest"whole"number" ROUND(price) 37.19 37

Return"chars"between"two"posiDons" SUBSTRING(name, 0, 2) Alice Al

!You%can%use%these%with%the%FOREACH..GENERATE%keywords%

rounded = FOREACH allsales GENERATE ROUND(price);

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#44%
Chapter"Topics"

Basic%Data%Analysis%with%Pig%

!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DeniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"
!! Commonly/used"FuncDons"
!! Hands#On%Exercise:%Using%Pig%for%ETL%processing"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#45%
Hands/On"Exercise:"Using"Pig"for"ETL"processing"

!In%this%Hands#On%Exercise,%you%will%write%%Pig%LaCn%code%to%perform%basic%ETL%
processing%tasks%on%data%related%to%Dualcores%online%adverCsing%campaigns%
!Please%refer%to%the%Hands#On%Exercise%Manual%for%instrucCons%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#46%
Chapter"Topics"

Basic%Data%Analysis%with%Pig%

!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DeniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"
!! Commonly/used"FuncDons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"processing"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#47%
EssenDal"Points"

!Pig%LaCn%supports%many%of%the%same%operaCons%as%SQL%
Though"Pigs"approach"is"quite"dierent"
Pig"LaDn"loads,"transforms,"and"stores"data"in"a"series"of"steps"
!The%default%delimiter%for%both%input%and%output%is%the%tab%character%
You"can"specify"an"alternate"delimiter"as"an"argument"to"PigStorage
!Specifying%the%names%and%types%of%elds%is%not%required%
But"it"can"improve"performance"and"readability"of"your"code"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#48%
Bibliography"

The%following%oer%more%informaCon%on%topics%discussed%in%this%chapter%
!Pig%LaCn%Basics%
http://tiny.cloudera.com/dac04a
!Pig%LaCn%Built#In%FuncCons%
http://tiny.cloudera.com/dac04b
!DocumentaCon%for%Java%Regular%Expression%PaPerns%
http://tiny.cloudera.com/dac04c
!Installing%and%Using%HCatalog%
http://tiny.cloudera.com/dac04d

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#49%
Processing"Complex"Data"with"Pig"
Chapter"5"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#1%
Course"Chapters"

!! IntroducFon"
!! Hadoop"Fundamentals"
!! IntroducFon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing%Complex%Data%with%Pig%
!! MulF/Dataset"OperaFons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooFng"and"OpFmizaFon"
!! IntroducFon"to"Hive"
!! RelaFonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpFmizaFon"
!! Extending"Hive"
!! IntroducFon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#2%
Processing"Complex"Data"with"Pig"

In%this%chapter,%you%will%learn%
!How%Pig%uses%bags,%tuples,%and%maps%to%represent%complex%data%
!The%techniques%Pig%provides%for%grouping%and%ungrouping%data%
!How%to%use%aggregate%funcFons%in%Pig%LaFn%
!How%to%iterate%through%records%in%complex%data%structures%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#3%
Chapter"Topics"

Processing%Complex%Data%with%Pig%

!! Storage%Formats%
!! Complex/Nested"Data"Types"
!! Grouping"
!! Built/in"FuncFons"for"Complex"Data"
!! IteraFng"Grouped"Data"
!! Hands/On"Exercise:"Analyzing"Ad"Campaign"Data"with"Pig"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#4%
Storage"Formats"

!We%have%seen%that%PigStorage%loads%and%stores%data
Uses"a"delimited"text"le"format""

allsales = LOAD 'sales' AS (name, price);

The"default"delimiter"(tab)"can"be"easily"changed"

allsales = LOAD 'sales' USING PigStorage(',')


AS (name, price) ;

!SomeFmes%you%need%to%load%or%store%data%in%other%formats%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#5%
Other"Supported"Formats"

!Here%are%a%few%of%Pigs%built#in%funcFons%for%loading%data%in%other%formats%
It"is"also"possible"to"implement"a"custom"loader"by"wriFng"Java"code ""

Name% Loads%Data%From%
TextLoader Text"les"(enFre"line"as"one"eld)%
JsonLoader% Text"les"containing"JSON/forma>ed"data%
BinStorage Files"containing"binary"data"
HBaseLoader HBase,"a"scalable"NoSQL"database"built"on"Hadoop"%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#6%
Load"and"Store"FuncFons"

!Some%funcFons%load%data,%some%store%data,%and%some%do%both %%

Load%FuncFon% Equivalent%Store%FuncFon%
PigStorage PigStorage
TextLoader None%
JsonLoader% JsonStorage%
BinStorage BinStorage"
HBaseLoader HBaseStorage%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#7%
Chapter"Topics"

Processing%Complex%Data%with%Pig%

!! Storage"Formats"
!! Complex/Nested%Data%Types%
!! Grouping"
!! Built/in"FuncFons"for"Complex"Data"
!! IteraFng"Grouped"Data"
!! Hands/On"Exercise:"Analyzing"Ad"Campaign"Data"with"Pig"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#8%
Pigs"Complex"Data"Types:"Tuple"and"Bag"

!We%have%already%seen%two%of%Pigs%three%complex%data%types%
A"tuple"is"a"collecFon"of"values""
A"bag"is"a"collecFon"of"tuples"

trans_id% total% salesperson%


107546 2999 Alice
107547 3625 Bob
107548 2764 Carlos tuple"
107549 1749 Dieter
107550 2368 tienne
107551 5637 Fredo bag"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#9%
Pigs"Complex"Data"Types:"Map"

!Pig%also%supports%another%complex%type:%Map%
A"map"associates"a"chararray"(key)"to"another"data"element"(value)"

trans_id% amount% salesperson% sales_details%


107546 2498 Alice date 12-02-2013
SKU 40155
store MIA01
107547 3625 Bob date 12-02-2013
SKU 3720
store STL04
coupon DEC13
107548 2764 Carlos date 12-03-2013
SKU 76102
store NYC15

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#10%
RepresenFng"Complex"Types"in"Pig"

!It%is%important%to%know%how%to%dene%and%recognize%these%types%in%Pig%

Type% DeniFon%
Tuple% Comma/delimited"list"inside"parentheses:"
"
"""('107546', 2498, 'Alice')

Bag% Braces"surround"comma/delimited"list"of"tuples:"
"
"""{('107546', 2498, 'Alice'), ('107547', 3625, 'Bob')}

Map% Brackets"surround"comma/delimited"list"of"pairs;"keys"and"values"separated"by"#:"
"
"""['store'#'MIA01','location'#'Coral Gables']

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#11%
Loading"and"Using"Complex"Types"(1)"

!Complex%data%types%can%be%used%in%any%Pig%eld%
!The%following%example%show%how%a%bag%is%stored%in%a%text%le%
Example:"TransacFon"ID,"amount,"items"sold"(a"bag"of"tuples)"

107550 2498 {('40120', 1999), ('37001', 499)}

TAB" TAB"
Field"1" Field"2" Field"3"

!Here%is%the%corresponding%LOAD%statement%specifying%the%schema%

details = LOAD 'salesdetail' AS (


trans_id:chararray, amount:int,
items_sold:bag
{item:tuple (SKU:chararray, price:int)});

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#12%
Loading"and"Using"Complex"Types"(2)"

!The%following%example%show%how%a%map%is%stored%in%a%text%le%
% Example:"Customer"name,"credit"account"details"(map)","year"account"opened"

Eva [creditlimit#5000,creditused#800] 2012

TAB" TAB"
% Field"1" Field"2" Field"3"

!Here%is%the%corresponding%LOAD%statement%specifying%the%schema%

credit = LOAD 'customer_accounts' AS (


name:chararray, account:map[], year:int);

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#13%
Referencing"Map"Data"

!Consider%a%le%with%the%following%data%

Bob [salary#52000,age#52]
%
!And%loaded%with%the%following%schema%

details = LOAD 'data' AS (name:chararray, info:map[]);


%
!Here%is%the%syntax%for%referencing%data%within%the%map%and%bag%

salaries = FOREACH details GENERATE info#'salary';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#14%
Chapter"Topics"

Processing%Complex%Data%with%Pig%

!! Storage"Formats"
!! Complex/Nested"Data"Types"
!! Grouping%
!! Built/in"FuncFons"for"Complex"Data"
!! IteraFng"Grouped"Data"
!! Hands/On"Exercise:"Analyzing"Ad"Campaign"Data"with"Pig"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#15%
Grouping"Records"By"a"Field"(1)"

!SomeFmes%you%need%to%group%records%by%a%given%eld%
For"example,"so"you"can"calculate"commissions"for"each"employee"

Alice 729
Bob 3999
Alice 27999
Carol 32999
Carol 4999
"
! Use%GROUP BY%to%do%this%in%Pig%LaFn%
The"new"relaFon"has"one"record"per"unique"value"in"the"specied"eld

grunt> byname = GROUP sales BY name;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#16%
Grouping"Records"By"a"Field"(2)"

!The%new%relaFon%always%contains%two%elds%

grunt> byname = GROUP sales BY name;


grunt> DESCRIBE byname;
byname: {group: chararray,sales: {(name:
chararray,price: int)}}

!The%rst%eld%is%literally%named%group%in%all%cases
Contains"the"value"from"the"eld"specied"in"GROUP BY
!The%second%eld%is%named%a^er%the%relaFon%specied%in%GROUP BY
Its"a"bag"containing"one"tuple"for"each"corresponding"value"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#17%
Grouping"Records"By"a"Field"(3)"

!The%example%below%shows%the%data%a^er%grouping%
Input"Data"(sales)"

grunt> byname = GROUP sales BY name; Alice 729


grunt> DUMP byname; Bob 3999
(Bob,%{(Bob,3999)}) Alice 27999
(Alice,{(Alice,729),(Alice,27999)}) Carol 32999
Carol 4999
(Carol,{(Carol,32999),(Carol,4999)})

group sales
eld% eld%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#18%
Using"GROUP BY"to"Aggregate"Data"

!Aggregate%funcFons%create%one%output%value%from%mulFple%input%values%
For"example,"to"calculate"total"sales"by"employee"
Usually"applied"to"grouped"data"

grunt> byname = GROUP sales BY name;


grunt> DUMP byname;
(Bob,{(Bob,3999)})
(Alice,{(Alice,729),(Alice,27999)})
(Carol,{(Carol,32999),(Carol,4999)})

grunt> totals = FOREACH byname GENERATE


!We%can%use%the%SUM%funcFon%to%XXX%
group, SUM(sales.price);
grunt> dump totals;
(Bob,3999)
(Alice,28728)
(Carol,37998)

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#19%
Grouping"Everything"Into"a"Single"Record

! We%just%saw%that%GROUP BY%creates%one%record%for%each%unique%value%
! GROUP ALL%puts%all%data%into%one%record

grunt> grouped = GROUP sales ALL;


grunt> DUMP grouped;
(all,{(Alice,729),(Bob,3999),(Alice,27999),
(Carol,32999),(Carol,4999)})

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#20%
Using"GROUP ALL"to"Aggregate"Data"

!Use%GROUP ALL%when%you%need%to%aggregate%one%or%more%columns%
For"example,"to"calculate"total"sales"for"all"employees"

grunt> grouped = GROUP sales ALL;


grunt> DUMP grouped;
(all,{(Alice,729),(Bob,3999),(Alice,27999),(Carol,32999),
(Carol,4999)})
!We%can%use%the%SUM%funcFon%to%XXX%
grunt> totals = FOREACH grouped GENERATE SUM(sales.price);
grunt> dump totals;
(70725)

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#21%
Removing"NesFng"in"Data

! Some%operaFons%in%Pig,%like%grouping,%produce%nested%data%structures%

grunt> byname = GROUP sales BY name;


grunt> DUMP byname;
(Bob,{(Bob,3999)})
(Alice,{(Alice,729),(Alice,27999)})
(Carol,{(Carol,32999),(Carol,4999)})

!Grouping%can%be%useful%to%supply%data%to%aggregate%funcFons"
!However,%someFmes%you%want%to%work%with%a%at%data%structure%
The"FLATTEN"operator"removes"a"level"of"nesFng"in"data"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#22%
An"Example"of"FLATTEN

! The%following%shows%the%nested%data%and%what%FLATTEN%does%to%it%
% grunt> byname = GROUP sales BY name;
grunt> DUMP byname;
(Bob,{(Bob,3999)})
(Alice,{(Alice,729),(Alice,27999)})
(Carol,{(Carol,32999),(Carol,4999)})

grunt> flat = FOREACH byname GENERATE group,


FLATTEN(sales.price);
grunt> DUMP flat;
(Bob,3999)
(Alice,729)
(Alice,27999)
(Carol,32999)
(Carol,4999)

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#23%
Chapter"Topics"

Processing%Complex%Data%with%Pig%

!! Storage"Formats"
!! Complex/Nested"Data"Types"
!! Grouping"
!! Built#in%FuncFons%for%Complex%Data%
!! IteraFng"Grouped"Data"
!! Hands/On"Exercise:"Analyzing"Ad"Campaign"Data"with"Pig"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#24%
Pigs"Built/In"Aggregate"FuncFons"

!Pig%has%built#in%support%for%other%aggregate%funcFons%besides%SUM
!Examples:%
AVG:""Calculates"the"average"(mean)"of"all"values"
MIN:""Returns"the"smallest"value"
MAX:""Returns"the"largest"value"
!Pig%has%two%built#in%funcFons%for%counFng%records%
COUNT:""Returns"the"number"of"non#null"elements"in"the"bag"
COUNT_STAR:""Returns"the"number"of"all"elements"in"the"bag"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#25%
Other"Notable"Built/in"FuncFons"

!Here%are%a%some%other%useful%Pig%funcFons%
See"the"Pig"documentaFon"for"a"complete"list"

FuncFon% DescripFon%
DIFF Finds"tuples"that"appear"in"only"one"of"two"supplied"bags
IsEmpty Used"with"FILTER"to"match"bags"or"maps"that"contain"no"data"
SIZE Returns"the"size"of"the"eld"(deniFon"of"size"varies"by"data"type)
TOKENIZE Splits"a"text"string"(chararray)"into"a"bag"of"individual"words"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#26%
Chapter"Topics"

Processing%Complex%Data%with%Pig%

!! Storage"Formats"
!! Complex/Nested"Data"Types"
!! Grouping"
!! Built/in"FuncFons"for"Complex"Data"
!! IteraFng%Grouped%Data%
!! Hands/On"Exercise:"Analyzing"Ad"Campaign"Data"with"Pig"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#27%
Record"IteraFon

!We%have%seen%that%FOREACH...GENERATE%iterates%through%records%
!The%goal%is%to%transform%records%to%produce%a%new%relaFon%
SomeFmes"to"select"only"certain"columns"

price_column_only = FOREACH sales GENERATE price;

SomeFmes"to"create"new"columns"

taxes = FOREACH sales GENERATE price * 0.07;

SomeFmes"to"invoke"a"funcFon"on"the"data"

totals = FOREACH grouped GENERATE SUM(sales.price);

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#28%
NesFng"the"FOREACH"Keyword"

!A%variaFon%on%FOREACH%applies%a%set%of%operaFons%to%each%record%
This"is"oeen"used"to"apply"a"series"of"transformaFons"in"a"group"
!This%is%called%a%nested%FOREACH%
Allows"only"relaFonal"operaFons"(e.g., LIMIT,"FILTER,"ORDER BY)"
GENERATE"must"be"the"last"line"in"the"block"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#29%
Nested"FOREACH"Example"(1)"
Input"Data"
!Our%input%data%contains%a%list%of%employee%%
job%Ftles%and%corresponding%salaries% President 192000
Director 152500
!Goal:%idenFfy%the%three%highest%salaries% Director 161000
within%each%Ftle% Director 167000
Director 165000
Director 147000
Engineer 92300
Engineer 85000
Engineer 83000
Engineer 81650
Engineer 82100
Engineer 87300
Engineer 76000
Manager 87000
Manager 81000
Manager 75000
Manager 79000
Manager 67500

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#30%
Nested"FOREACH"Example"(2)"
Input"Data"(excerpt)"
!First%load%the%data%from%the%le%%
President 192000
!Next,%group%employees%by%Ftle% Director 152500
Assigned"to"new"relaFon"Ftle_group" Director 161000
...
Engineer 92300
...
Manager 67500

employees = LOAD 'data' AS (title:chararray, salary:int);


title_group = GROUP employees BY title;

top_salaries = FOREACH title_group {


sorted = ORDER employees BY salary DESC;
highest_paid = LIMIT sorted 3;
GENERATE group, highest_paid;
};

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#31%
Nested"FOREACH"Example"(3)"
Input"Data"(excerpt)"
!The%nested%FOREACH%iterates%through%every%
record%in%the%group%(i.e.,%each%job%Ftle)% President 192000
It"sorts"each"record"in"that"group"in"" Director 152500
Director 161000
descending"order"of"salary" ...
It"then"selects"the"top"three" Engineer 92300
...
GENERATE"outputs"the"Ftle"and"salaries"
Manager 67500

employees = LOAD 'data' AS (title:chararray, salary:int);


title_group = GROUP employees BY title;

top_salaries = FOREACH title_group {


sorted = ORDER employees BY salary DESC;
highest_paid = LIMIT sorted 3;
GENERATE group, highest_paid;
};

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#32%
Nested"FOREACH"Example"(4)
Code"(LOAD"statement"removed"for"brevity)" Input"Data"(excerpt)"
%%
title_group = GROUP employees BY title; President 192000
Director 152500
top_salaries = FOREACH title_group { Director 161000
sorted = ORDER employees BY salary DESC; ...
highest_paid = LIMIT sorted 3; Engineer 92300
GENERATE group, highest_paid; ...
}; Manager 67500

Output"produced"by"DUMP top_salaries

(Director,{(Director,167000),(Director,165000),(Director,161000)})
(Engineer,{(Engineer,92300),(Engineer,87300),(Engineer,85000)})
(Manager,{(Manager,87000),(Manager,81000),(Manager,79000)})
(President,{(President,192000)})

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#33%
Chapter"Topics"

Processing%Complex%Data%with%Pig%

!! Storage"Formats"
!! Complex/Nested"Data"Types"
!! Grouping"
!! Built/in"FuncFons"for"Complex"Data"
!! IteraFng"Grouped"Data"
!! Hands#On%Exercise:%Analyzing%Ad%Campaign%Data%with%Pig%
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#34%
Hands/On"Exercise:"Analyzing"Ad"Campaign"Data"with"Pig"

!In%this%Hands#On%Exercise,%you%will%analyze%data%from%Dualcores%online%ad%
campaign%
!Please%refer%to%the%Hands#On%Exercise%Manual%for%instrucFons%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#35%
Chapter"Topics"

Processing%Complex%Data%with%Pig%

!! Storage"Formats"
!! Complex/Nested"Data"Types"
!! Grouping"
!! Built/in"FuncFons"for"Complex"Data"
!! IteraFng"Grouped"Data"
!! Hands/On"Exercise:"Analyzing"Ad"Campaign"Data"with"Pig"
!! Conclusion%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#36%
EssenFal"Points"

!Pig%has%three%complex%data%types:%tuple,%bag,%and%map
A"map"is"simply"a"collecFon"of"key/value"pairs"
!These%structures%can%contain%simple%types%like%int%or%chararray
But"they"can"also"contain"complex"data"types"
Nested"data"structures"are"common"in"Pig"
!Pig%provides%methods%for%grouping%and%ungrouping%data%
You"can"remove"a"level"of"nesFng"using"the"FLATTEN"operator"
!Pig%oers%several%built#in%aggregate%funcFons%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#37%
MulA/Dataset"OperaAons"with"Pig"
Chapter"6"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#1%
Course"Chapters"

!! IntroducAon"
!! Hadoop"Fundamentals"
!! IntroducAon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! Mul*#Dataset%Opera*ons%with%Pig%
!! Extending"Pig"
!! Pig"TroubleshooAng"and"OpAmizaAon"
!! IntroducAon"to"Hive"
!! RelaAonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpAmizaAon"
!! Extending"Hive"
!! IntroducAon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#2%
MulA/Dataset"OperaAons"with"Pig"

In%this%chapter,%you%will%learn%
!How%we%can%use%grouping%to%combine%data%from%mul*ple%sources%
!What%types%of%join%opera*ons%Pig%supports%and%how%to%use%them%
!How%to%concatenate%records%to%produce%a%single%data%set%
!How%to%split%a%single%data%set%into%mul*ple%rela*ons%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#3%
Chapter"Topics"

Mul*#Dataset%Opera*ons%with%Pig%

!! Techniques%for%Combining%Data%Sets%
!! Joining"Data"Sets"in"Pig"
!! Set"OperaAons"
!! SpliUng"Data"Sets"
!! Hands/On"Exercise:"Analyzing"Disparate"Data"Sets"with"Pig"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#4%
Overview"of"Combining"Data"Sets"

!So%far,%we%have%concentrated%on%processing%single%data%sets%
Valuable"insight"oXen"results"from"combining"mulAple"data"sets"
!Pig%oers%several%techniques%for%achieving%this%
Using"the"GROUP"operator"with"mulAple"relaAons"
Joining"the"data"as"you"would"in"SQL"
Performing"set"operaAons"like"CROSS"and"UNION
!We%will%cover%each%of%these%in%this%chapter%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#5%
Example"Data"Sets"(1)"
Stores"
!Most%examples%in%this%chapter%will%involve%the%%
same%two%data%sets% A Anchorage
B Boston
!The%rst%is%a%le%containing%informa*on%about%% C Chicago
D Dallas
Dualcores%stores% E Edmonton
F Fargo
!There%are%two%elds%in%this%rela*on%
1. store_id:chararray"(unique"key)"
2. name:chararray"(name"of"the"city"in"which"the"store"is"located)"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#6%
Example"Data"Sets"(2)"
Stores"
!Our%other%data%set%is%a%le%containing%
informa*on%about%Dualcores%salespeople% A Anchorage
B Boston
!This%rela*on%contains%three%elds% C Chicago
D Dallas
1. person_id:int"(unique"key)" E Edmonton
2. name:chararray"(salesperson"name)" F Fargo
3. store_id:chararray"(refers"to"store)" Salespeople"

1 Alice B
2 Bob D
3 Carlos F
4 Dieter A
5 tienne F
6 Fredo C
7 George D
8 Hannah B
9 Irina C
10 Jack

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#7%
Grouping"MulAple"RelaAons"

!We%previously%learned%about%the%GROUP%operator%
Groups"values"in"a"relaAon"based"on"the"specied"eld(s)"
!The%GROUP%operator%can%also%group%mul$ple%rela*ons%
In"this"case,"using"the"synonymous"COGROUP"operator"is"preferred"

grouped = COGROUP stores BY store_id, salespeople BY store_id;

!This%collects%values%from%both%data%sets%into%a%new%rela*on%
As"before,"the"new"relaAon"is"keyed"by"a"eld"named"group
This"group"eld"is"associated"with"one"bag"for"each"input"

(group, {bag of records}, {bag of records})

store_id records from stores records from salespeople

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#8%
Example"of"COGROUP
Stores"
A Anchorage grunt> grouped = COGROUP stores BY store_id,
B Boston salespeople BY store_id;
C Chicago
D Dallas grunt> DUMP grouped;
E Edmonton (A,{(A,Anchorage)},{(4,Dieter,A)})
F Fargo (B,{(B,Boston)},{(1,Alice,B),(8,Hannah,B)})
(C,{(C,Chicago)},{(6,Fredo,C),(9,Irina,C)})
Salespeople" (D,{(D,Dallas)},{(2,Bob,D),(7,George,D)})
(E,{(E,Edmonton)},{})
1 Alice B (F,{(F,Fargo)},{(3,Carlos,F),(5,tienne,F)})
2 Bob D (,{},{(10,Jack,)})
3 Carlos F
4 Dieter A
5 tienne F
6 Fredo C
7 George D
8 Hannah B
9 Irina C
10 Jack

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#9%
Chapter"Topics"

Mul*#Dataset%Opera*ons%with%Pig%

!! Techniques"for"Combining"Data"Sets"
!! Joining%Data%Sets%in%Pig%
!! Set"OperaAons"
!! SpliUng"Data"Sets"
!! Hands/On"Exercise:"Analyzing"Disparate"Data"Sets"with"Pig"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#10%
Join"Overview"

!The%COGROUP%operator%creates%a%nested%data%structure%
!Pig%La*ns%JOIN%operator%creates%a%at%data%structure%
Similar"to"joins"in"a"relaAonal"database"
!A%JOIN%is%similar%to%doing%a%COGROUP%followed%by%a%FLATTEN
Though"they"handle"null"values"dierently"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#11%
Key"Fields"

!Like%COGROUP,%joins%rely%on%a%eld%shared%by%each%rela*on%

joined = JOIN stores BY store_id, salespeople BY store_id;

!Joins%can%also%use%mul*ple%elds%as%the%key%

joined = JOIN customers BY (name, phone_number),


accounts BY (name, phone_number);

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#12%
Inner"Joins"

!The%default%JOIN%in%Pig%La*n%is%an%inner%join%

joined = JOIN stores BY store_id, salespeople BY store_id;

!An%inner%join%outputs%records%only%when%the%key%is%found%in%all%inputs%
In"the"above"example,"stores"that"have"at"least"one"salesperson"
!You%can%do%an%inner%join%on%mul*ple%rela*ons%in%a%single%statement%
But"you"must"use"the"same"key"to"join"them"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#13%
Inner"Join"Example"
stores"
A Anchorage grunt> joined = JOIN stores BY store_id,
B Boston salespeople BY store_id;
C Chicago
D Dallas grunt> DUMP joined;
E Edmonton (A,Anchorage,4,Dieter,A)
F Fargo (B,Boston,1,Alice,B)
(B,Boston,8,Hannah,B)
salespeople" (C,Chicago,6,Fredo,C)
(C,Chicago,9,Irina,C)
1 Alice B (D,Dallas,2,Bob,D)
2 Bob D (D,Dallas,7,George,D)
3 Carlos F (F,Fargo,3,Carlos,F)
4 Dieter A (F,Fargo,5,tienne,F)
5 tienne F
6 Fredo C
7 George D
8 Hannah B
9 Irina C
10 Jack

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#14%
EliminaAng"Duplicate"Fields"(1)"

!As%with%COGROUP,%the%new%rela*on%s*ll%contains%duplicate%elds%

grunt> joined = JOIN stores BY store_id,


salespeople BY store_id;

grunt> DUMP joined;


(A,Anchorage,4,Dieter,A)
(B,Boston,1,Alice,B)
(B,Boston,8,Hannah,B)
(C,Chicago,6,Fredo,C)
(C,Chicago,9,Irina,C)
(D,Dallas,2,Bob,D)
(D,Dallas,7,George,D)
(F,Fargo,3,Carlos,F)
(F,Fargo,5,tienne,F)

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#15%
EliminaAng"Duplicate"Fields"(2)"

!We%can%use%FOREACH...GENERATE%to%retain%just%the%elds%we%need%
However,"it"is"now"slightly"more"complex"to"reference"elds"
We"must"fully/qualify"any"elds"with"names"that"are"not"unique"

grunt> DESCRIBE joined;


joined: {stores::store_id: chararray,stores::name:
chararray,salespeople::person_id: int,salespeople::name:
chararray,salespeople::store_id: chararray}

grunt> cleaned = FOREACH joined GENERATE stores::store_id,


stores::name, person_id, salespeople::name;

grunt> DUMP cleaned;


(A,Anchorage,4,Dieter)
(B,Boston,1,Alice)
(B,Boston,8,Hannah)
... (additional records omitted for brevity) ...

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#16%
Outer"Joins"

!Pig%La*n%allows%you%to%specify%the%type%of%join%following%the%eld%name%
Inner"joins"do"not"specify"a"join"type"

joined = JOIN relation1 BY field [LEFT|RIGHT|FULL] OUTER,


relation2 BY field;

!An%outer%join%does%not%require%the%key%to%be%found%in%both%inputs%
!Outer%joins%require%Pig%to%know%the%schema%for%at%least%one%rela*on%
Which"relaAon"requires"schema"depends"on"the"join"type"
Full"outer"joins"require"schema"for"both"relaAons"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#17%
LeX"Outer"Join"Example"
stores"
!Result%contains%all%records%from%the%rela*on%
A Anchorage specied%on%the%le[,%but%only%matching%records%
B Boston
C Chicago from%the%one%specied%on%the%right%
D Dallas
E Edmonton grunt> joined = JOIN stores BY store_id
F Fargo LEFT OUTER, salespeople BY store_id;

grunt> DUMP joined;


salespeople"
(A,Anchorage,4,Dieter,A)
1 Alice B (B,Boston,1,Alice,B)
2 Bob D (B,Boston,8,Hannah,B)
3 Carlos F (C,Chicago,6,Fredo,C)
4 Dieter A (C,Chicago,9,Irina,C)
5 tienne F (D,Dallas,2,Bob,D)
6 Fredo C (D,Dallas,7,George,D)
7 George D (E,Edmonton,,,)
8 Hannah B (F,Fargo,3,Carlos,F)
9 Irina C (F,Fargo,5,tienne,F)
10 Jack

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#18%
Right"Outer"Join"Example"
stores"
!Result%contains%all%records%from%the%rela*on%
A Anchorage specied%on%the%right,%but%only%matching%records%
B Boston
C Chicago from%the%one%specied%on%the%le[%
D Dallas
E Edmonton grunt> joined = JOIN stores BY store_id
F Fargo RIGHT OUTER, salespeople BY store_id;

grunt> DUMP joined;


salespeople"
(A,Anchorage,4,Dieter,A)
1 Alice B (B,Boston,1,Alice,B)
2 Bob D (B,Boston,8,Hannah,B)
3 Carlos F (C,Chicago,6,Fredo,C)
4 Dieter A (C,Chicago,9,Irina,C)
5 tienne F (D,Dallas,2,Bob,D)
6 Fredo C (D,Dallas,7,George,D)
7 George D (F,Fargo,3,Carlos,F)
8 Hannah B (F,Fargo,5,tienne,F)
9 Irina C (,,10,Jack,)
10 Jack

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#19%
Full"Outer"Join"Example"
stores"
!Result%contains%all%records%where%there%is%a%match%
A Anchorage in%either%rela*on%
B Boston
C Chicago
D Dallas
E Edmonton grunt> joined = JOIN stores BY store_id
F Fargo FULL OUTER, salespeople BY store_id;

grunt> DUMP joined;


salespeople"
(A,Anchorage,4,Dieter,A)
1 Alice B (B,Boston,1,Alice,B)
2 Bob D (B,Boston,8,Hannah,B)
3 Carlos F (C,Chicago,6,Fredo,C)
4 Dieter A (C,Chicago,9,Irina,C)
5 tienne F (D,Dallas,2,Bob,D)
6 Fredo C (D,Dallas,7,George,D)
7 George D (E,Edmonton,,,)
8 Hannah B (F,Fargo,3,Carlos,F)
9 Irina C (F,Fargo,5,tienne,F)
10 Jack (,,10,Jack,)

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#20%
Chapter"Topics"

Mul*#Dataset%Opera*ons%with%Pig%

!! Techniques"for"Combining"Data"Sets"
!! Joining"Data"Sets"in"Pig"
!! Set%Opera*ons%
!! SpliUng"Data"Sets"
!! Hands/On"Exercise:"Analyzing"Disparate"Data"Sets"with"Pig"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#21%
Crossing"Data"Sets"

! JOIN%nds%records%in%one%rela*on%that%match%records%in%another%
!Pigs%CROSS%operator%creates%the%cross%product%of%both%rela*ons%
Combines"all"records"in"both"tables"regardless"of"matching"
In"other"words,"all"possible"combinaAons"of"records"

crossed = CROSS stores, salespeople;

!Careful:%This%can%generate%huge%amounts%of%data!%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#22%
Cross"Product"Example"
stores"
!Generates%every%possible%combina*on%of%records%
A Anchorage in%the%stores%and%salespeople%rela*ons%
B Boston
D Dallas
grunt> crossed = CROSS stores, salespeople;
salespeople"
grunt> DUMP crossed;
1 Alice B (A,Anchorage,1,Alice,B)
2 Bob D (A,Anchorage,2,Bob,D)
8 Hannah B (A,Anchorage,8,Hannah,B)
10 Jack (A,Anchorage,10,Jack,)
(B,Boston,1,Alice,B)
(B,Boston,2,Bob,D)
(B,Boston,8,Hannah,B)
(B,Boston,10,Jack,)
(D,Dallas,1,Alice,B)
(D,Dallas,2,Bob,D)
(D,Dallas,8,Hannah,B)
(D,Dallas,10,Jack,)

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#23%
ConcatenaAng"Data"Sets"

!We%have%explored%several%techniques%for%combining%data%sets%
They"have"had"one"thing"in"common:"they"combine"horizontally"
!The%UNION%operator%combines%records%ver*cally%
It"adds"data"from"input"relaAons"into"a"new"single"relaAon"
Pig"does"not"require"these"inputs"to"have"the"same"schema"
It"does"not"eliminate"duplicate"records"nor"preserve"order"
!This%is%helpful%for%incorpora*ng%new%data%into%your%processing%

both = UNION june_items, july_items;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#24%
UNION"Example"
june"
!Concatenates%all%records%from%june%and%july%
Adapter 549
Battery 349
Cable 799 grunt> both = UNION june_items, july_items;
DVD 1999
HDTV 79999 grunt> DUMP both;
(Fax,17999)
(GPS,24999)
july" (HDTV,65999)
Fax 17999 (Ink,3999)
GPS 24999 (Adapter,549)
HDTV 65999 (Battery,349)
Ink 3999 (Cable,799)
(DVD, 1999)
(HDTV,79999)

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#25%
Chapter"Topics"

Mul*#Dataset%Opera*ons%with%Pig%

!! Techniques"for"Combining"Data"Sets"
!! Joining"Data"Sets"in"Pig"
!! Set"OperaAons"
!! Spliang%Data%Sets%
!! Hands/On"Exercise:"Analyzing"Disparate"Data"Sets"with"Pig"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#26%
SpliUng"Data"Sets"

!You%have%learned%several%ways%to%combine%data%sets%into%a%single%rela*on%
!Some*mes%you%need%to%split%a%data%set%into%mul*ple%rela*ons%
Server"logs"by"date"range"
Customer"lists"by"region"
Product"lists"by"vendor"
!Pig%La*n%supports%this%with%the%SPLIT%operator%

SPLIT relation INTO relationA IF expression1,


relationB IF expression2,
relationC IF expression3...;

Expressions"need"not"be"mutually"exclusive"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#27%
SPLIT"Example"

!Split%customers%into%groups%for%rewards%program,%based%on%life*me%value%

customers"
grunt> SPLIT customers INTO
Annette 9700 gold_program IF ltv >= 25000,
Bruce 23500 silver_program IF ltv >= 10000
Charles 17800 AND ltv < 25000;
Dustin 21250
Eva 8500 grunt> DUMP gold_program;
Felix 9300 (Glynn,27800)
Glynn 27800 (Ian,43800)
Henry 8900 (Jeff,29100)
Ian 43800 (Kai,34000)
Jeff 29100
Kai 34000 grunt> DUMP silver_program;
Laura 7800 (Bruce,23500)
Mirko 24200 (Charles,17800)
(Dustin,21250)
(Mirko,24200)

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#28%
Chapter"Topics"

Mul*#Dataset%Opera*ons%with%Pig%

!! Techniques"for"Combining"Data"Sets"
!! Joining"Data"Sets"in"Pig"
!! Set"OperaAons"
!! SpliUng"Data"Sets"
!! Hands#On%Exercise:%Analyzing%Disparate%Data%Sets%with%Pig%
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#29%
Hands/on"Exercise:"Analyzing"Disparate"Data"Sets"with"Pig"

!In%this%Hands#On%Exercise,%you%will%analyze%mul*ple%data%sets%with%Pig.%
!Please%refer%to%the%Hands#On%Exercise%Manual%for%instruc*ons%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#30%
Chapter"Topics"

Mul*#Dataset%Opera*ons%with%Pig%

!! Techniques"for"Combining"Data"Sets"
!! Joining"Data"Sets"in"Pig"
!! Set"OperaAons"
!! SpliUng"Data"Sets"
!! Hands/On"Exercise:"Analyzing"Disparate"Data"Sets"with"Pig"
!! Conclusion%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#31%
EssenAal"Points"

!You%can%use%COGROUP%to%group%mul*ple%rela*ons%
This"creates"a"nested"data"structure"
!Pig%supports%common%SQL%join%types%
Inner,"leX"outer,"right"outer,"and"full"outer"
You"may"need"to"fully/qualify"eld"names"when"using"joined"data"
!Pigs%CROSS%operator%creates%every%possible%combina*on%of%input%data%
This"can"create"huge"amounts"of"data""use"it"carefully!"
!You%can%use%a%UNION%to%concatenate%data%sets%
!In%addi*on%to%combining%data%sets,%Pig%supports%spliang%them%too%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#32%
Extending"Pig"
Chapter"7"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#1%
Course"Chapters"

!! IntroducEon"
!! Hadoop"Fundamentals"
!! IntroducEon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulE/Dataset"OperaEons"with"Pig"
!! Extending%Pig%
!! Pig"TroubleshooEng"and"OpEmizaEon"
!! IntroducEon"to"Hive"
!! RelaEonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpEmizaEon"
!! Extending"Hive"
!! IntroducEon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#2%
Extending"Pig"

In%this%chapter,%you%will%learn%
!How%to%use%parameters%in%your%Pig%LaAn%to%increase%its%exibility%
!How%to%dene%and%invoke%macros%to%improve%the%reusability%of%your%code%
!How%to%call%user#dened%funcAons%from%your%code%
!How%to%write%user#dened%funcAons%in%Python%
!How%to%process%data%with%external%scripts%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#3%
Chapter"Topics"

Extending%Pig%

!! Adding%Flexibility%with%Parameters%
!! Macros"and"Imports"
!! UDFs"
!! Contributed"FuncEons"
!! Using"Other"Languages"to"Process"Data"with"Pig"
!! Hands/On"Exercise:"Extending"Pig"with"Streaming"and"UDFs"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#4%
The"Need"for"Parameters"(1)"

!Some%processing%is%very%repeAAve%
For"example,"creaEng"sales"reports"

allsales = LOAD 'sales' AS (name, price);


bigsales = FILTER allsales BY price > 999;

bigsales_alice = FILTER bigsales BY name == 'Alice';


STORE bigsales_alice INTO 'Alice';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#5%
The"Need"for"Parameters"(2)"

!You%may%need%to%change%the%script%slightly%for%each%run%
For"example,"to"modify"the"paths"or"lter"criteria"

allsales = LOAD 'sales' AS (name, price);


bigsales = FILTER allsales BY price > 999;

bigsales_alice = FILTER bigsales BY name == 'Alice';


STORE bigsales_alice INTO 'Alice';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#6%
Making"the"Script"More"Flexible"with"Parameters"

!Instead%of%hardcoding%values,%Pig%allows%you%to%use%parameters%
These"are"replaced"with"specied"values"at"runEme"

allsales = LOAD '$INPUT' AS (name, price);


bigsales = FILTER allsales BY price > $MINPRICE;

bigsales_name = FILTER bigsales BY name == '$NAME';


STORE bigsales_name INTO '$NAME';

Then"specify"the"values"on"the"command"line"

$ pig -p INPUT=sales -p MINPRICE=999 \


-p NAME='Jo Anne' reporter.pig

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#7%
Two"Tricks"for"Specifying"Parameter"Values"

!You%can%also%specify%parameter%values%in%a%text%le%
An"alternaEve"to"typing"each"one"on"the"command"line"

INPUT=sales
MINPRICE=999
" # comments look like this
" NAME='Alice'
"
Use"-m filename"opEon"to"tell"Pig"which"le"contains"the"values""
!Parameter%values%can%be%dened%with%the%output%of%a%shell%command%
For"example,"to"set"MONTH"to"the"current"month:"

MONTH=`date +'%m'` # returns 03 for March, 05 for May

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#8%
Chapter"Topics"

Extending%Pig%

!! Adding"Flexibility"with"Parameters%
!! Macros%and%imports%
!! UDFs"
!! Contributed"FuncEons"
!! Using"Other"Languages"to"Process"Data"with"Pig"
!! Hands/On"Exercise:"Extending"Pig"with"Streaming"and"UDFs"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#9%
The"Need"for"Macros"

!Parameters%simplify%repeAAve%code%by%allowing%you%to%pass%in%values%
But"someEmes"you"would"like"to"reuse"the"actual"code"too""

allsales = LOAD 'sales' AS (name, price);


byperson = FILTER allsales BY name == 'Alice';

SPLIT byperson INTO low IF price < 1000,


high IF price >= 1000;

amt1 = FOREACH low GENERATE name, price * 0.07 AS amount;


amt2 = FOREACH high GENERATE name, price * 0.12 AS amount;

commissions = UNION amt1, amt2;


grpd = GROUP commissions BY name;

out = FOREACH grpd GENERATE SUM(commissions.amount) AS total;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#10%
Dening"a"Macro"in"Pig"LaEn"

!Macros%allow%you%to%dene%a%block%of%code%to%reuse%easily%
Similar"(but"not"idenEcal)"to"a"funcEon"in"a"programming"language"

define calc_commission (NAME, SPLIT_AMT, LOW_PCT, HIGH_PCT)


returns result {
allsales = LOAD 'sales' AS (name, price);
byperson = FILTER allsales BY name == '$NAME';

SPLIT byperson INTO low if price < $SPLIT_AMT,


high IF price >= $SPLIT_AMT;

amt1 = FOREACH low GENERATE name, price * $LOW_PCT AS amount;


amt2 = FOREACH high GENERATE name, price * $HIGH_PCT AS amount;

commissions = UNION amt1, amt2;


grouped = GROUP commissions BY name;

$result = FOREACH grouped GENERATE SUM(commissions.amount);


};

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#11%
Invoking"Macros"

!To%invoke%a%macro,%call%it%by%name%and%supply%values%in%the%correct%order%

define calc_commission (NAME, SPLIT_AMT, LOW_PCT, HIGH_PCT)


returns result {
allsales = LOAD 'sales' AS (name, price);

... (other code removed for brevity) ...

$result = FOREACH grouped GENERATE SUM(commissions.amount);


};

alice_comm = calc_commission('Alice', 1000, 0.07, 0.12);


carlos_comm = calc_commission('Carlos', 2000, 0.08, 0.14);

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#12%
Reusing"Code"with"Imports"

!ASer%dening%a%macro,%you%may%wish%to%use%it%in%mulAple%scripts%
!You%can%include%one%script%within%another,%starAng%with%Pig%0.9%
This"is"done"with"the"import"keyword"and"path"to"le"being"imported"

-- We saved the macro to a file named commission_calc.pig

import 'commission_calc.pig';

alice_comm = calc_commission('Alice', 1000, 0.07, 0.12);

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#13%
Chapter"Topics"

Extending%Pig%

!! Adding"Flexibility"with"Parameters%
!! Macros"and"Imports"
!! UDFs%
!! Contributed"FuncEons"
!! Using"Other"Languages"to"Process"Data"with"Pig"
!! Hands/On"Exercise:"Extending"Pig"with"Streaming"and"UDFs"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#14%
User/Dened"FuncEons"(UDFs)"

!We%have%covered%many%of%Pigs%built#in%funcAons%already%
!It%is%also%possible%to%dene%your%own%funcAons%
Pig"allows"wriEng"UDFs"in"several"languages"
Language% Supported%in%Pig%Versions%
Java" All
Python" 0.8 and later
JavaScript"(experimental)" 0.9 and later
Ruby"(experimental)" 0.10 and later
Groovy"(experimental)" 0.11 and later

!In%the%next%few%slides,%you%will%see%how%to%use%UDFs%in%Java,%and%how%to%
write%and%use%UDFs%in%Python%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#15%
Using"UDFs"Wri>en"in"Java"

!UDFs%are%packaged%into%Java%Archive%(JAR)%les%
!There%are%only%two%required%steps%for%using%them%
Register"the"JAR"le(s)"containing"the"UDF"and"its"dependencies"
Invoke"the"UDF"using"the"fully/qualied"classname"

REGISTER '/path/to/myudf.jar';
...
data = FOREACH allsales GENERATE com.example.MYFUNC(name);

!You%can%opAonally%dene%an%alias%for%the%funcAon%

REGISTER '/path/to/myudf.jar';
DEFINE FOO com.example.MYFUNC;
...
data = FOREACH allsales GENERATE FOO(name);

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#16%
WriEng"UDFs"in"Python"(1)"

!Now%we%will%see%how%to%write%a%UDF%in%Python%
!The%data%we%want%to%process%has%inconsistent%phone%number%formats%

Alice (314) 555-1212


Bob 212.555.9753
Carlos 405-555-3912
David (202) 555.8471
%
!We%will%write%a%Python%UDF%that%can%consistently%extract%the%area%code%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#17%
WriEng"UDFs"in"Python"(2)"

!Our%Python%code%is%straighaorward%
!The%only%unusual%thing%is%the%opAonal%@outputSchema%decorator%
This"tells"Pig"what"data"type"we"are"returning"
If"not"specied,"Pig"will"assume"bytearray

@outputSchema("areacode:chararray")
def get_area_code(phone):
areacode = "???" # return this for unknown formats

if len(phone) == 12:
# XXX-YYY-ZZZZ or XXX.YYY.ZZZZ format
areacode = phone[0:3]
elif len(phone) == 14:
# (XXX) YYY-ZZZZ or (XXX) YYY.ZZZZ format
areacode = phone[1:4]

return areacode

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#18%
Invoking"Python"UDFs"from"Pig"LaEn"

!Using%this%UDF%from%our%Pig%LaAn%is%also%easy%
We"saved"our"Python"code"as"phonenumber.py
This"Python"le"is"in"our"current"directory"

REGISTER 'phonenumber.py' USING jython AS phoneudf;

names = LOAD 'names' AS (name:chararray, phone:chararray);

areacodes = FOREACH names GENERATE


phoneudf.get_area_code(phone) AS ac;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#19%
Chapter"Topics"

Extending%Pig%

!! Adding"Flexibility"with"Parameters%
!! Macros"and"Imports"
!! UDFs"
!! Contributed%FuncAons%
!! Using"Other"Languages"to"Process"Data"with"Pig"
!! Hands/On"Exercise:"Extending"Pig"with"Streaming"and"UDFs"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#20%
Open"Source"UDFs"

!Pig%ships%with%a%set%of%community#contributed%UDFs%called%Piggy%Bank%
!Another%popular%package%of%UDFs,%called%DataFu,%has%been%open#sourced%
by%LinkedIn%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#21%
Piggy"Bank"

!Piggy%Bank%ships%with%Pig%
You"will"need"to"register"the"piggybank.jar"le"
The"locaEon"may"vary"depending"on"source"and"version"
In"CDH"on"our"VMs,"it"is"at"/usr/lib/pig/piggybank.jar"
!Some%UDFs%in%Piggy%Bank%include%(package%names%omided%for%brevity)%

Class%Name% DescripAon%
ISOToUnix Converts"an"ISO"8601"date/Eme"format"to"UNIX"format"
UnixToISO Converts"a"UNIX"date/Eme"format"to"ISO"8601"format"
LENGTH Returns"the"number"of"characters"in"the"supplied"string"
HostExtractor Returns"the"host"name"from"a"URL"
DiffDate Returns"number"of"days"between"two"dates"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#22%
DataFu"

!DataFu%does%not%ship%with%Pig,%but%is%part%of%CDH%4.1.0%and%later%
You"will"need"to"register"the"DataFu"JAR"le"
In"VM,"at"/usr/lib/pig/datafu-0.0.4-cdh4.2.0.jar"
!Some%UDFs%in%DataFu%include%(package%names%omided%for%brevity)%

Class%Name% DescripAon%
Quantile Calculates"quanEles"for"a"data"set"
Median Calculates"the"median"for"a"data"set"
Sessionize Groups"data"into"sessions"based"on"a"
specied"Eme"window"
HaversineDistInMiles Calculates"distance"in"miles"between"two"
points,"given"laEtude"and"longitude"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#23%
Using"A"Contributed"UDF"

!Here%is%an%example%of%using%a%UDF%from%DataFu%to%calculate%distance%
Input"data"
37.789336 -122.401385 40.707555 -74.011679

Pig"LaEn"
REGISTER '/usr/lib/pig/datafu-*.jar';
DEFINE DIST datafu.pig.geo.HaversineDistInMiles;

places = LOAD 'data' AS (lat1:double, lon1:double,


lat2:double, lon2:double);

dist = FOREACH places GENERATE DIST(lat1, lon1, lat2, lon2);


DUMP dist;

Output"data"
(2564.207116295711)

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#24%
Chapter"Topics"

Extending%Pig%

!! Adding"Flexibility"with"Parameters%
!! Macros"and"Imports"
!! UDFs"
!! Contributed"FuncEons"
!! Using%Other%Languages%to%Process%Data%with%Pig%
!! Hands/On"Exercise:"Extending"Pig"with"Streaming"and"UDFs"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#25%
Processing"Data"with"An"External"Script"

!While%Pig%LaAn%is%powerful,%some%tasks%are%easier%in%another%language%
!Pig%allows%you%to%stream%data%through%another%language%for%processing%
This"is"done"using"the"STREAM"keyword"
!Similar%concept%to%Hadoop%Streaming%
Data"is"supplied"to"the"script"on"standard"input"as"tab/delimited"elds"
Script"writes"results"to"standard"output"as"tab/delimited"elds"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#26%
STREAM"Example"in"Python"(1)"

!Our%example%will%calculate%a%users%age%given%that%users%birthdate%
This"calculaEon"is"done"in"a"Python"script"named"agecalc.py"
!Here%is%the%corresponding%Pig%LaAn%code%
BackEcks"used"to"quote"script"name"following"the"alias"
Single"quotes"used"for"quoEng"script"name"within"SHIP
The"schema"for"the"data"produced"by"the"script"follows"the"AS"keyword"

DEFINE MYSCRIPT `agecalc.py` SHIP('agecalc.py');


users = LOAD 'data' AS (name:chararray, birthdate:chararray);

out = STREAM users THROUGH MYSCRIPT AS (name:chararray, age:int);

DUMP out;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#27%
STREAM"Example"in"Python"(2)"

!Python%code%for%agecalc.py

#!/usr/bin/env python

import sys
from datetime import datetime

for line in sys.stdin:


line = line.strip()
(name, birthdate) = line.split("\t")

d1 = datetime.strptime(birthdate, '%Y-%m-%d')
d2 = datetime.now()

age = int((d2 - d1).days / 365)

print "%s\t%i" % (name, age)

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#28%
STREAM"Example"in"Python"(3)"

!The%Pig%script%again,%and%the%data%it%reads%and%writes

DEFINE MYSCRIPT `agecalc.py` SHIP('agecalc.py');


users = LOAD 'data' AS (name:chararray, birthdate:chararray);

out = STREAM users THROUGH MYSCRIPT AS (name:chararray,


age:int);

DUMP out;

Input"data" Output"data"

andy 1963-11-15 (andy,49)


betty 1985-12-30 (betty,27)
chuck 1979-02-23 (chuck,34)
debbie 1982-09-19 (debbie,30)

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#29%
Chapter"Topics"

Extending%Pig%

!! Adding"Flexibility"with"Parameters%
!! Macros"and"Imports"
!! UDFs"
!! Contributed"FuncEons"
!! Using"Other"Languages"to"Process"Data"with"Pig"
!! Hands#On%Exercise:%Extending%Pig%with%Streaming%and%UDFs%
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#30%
Hands/on"Exercise:"Extending"Pig"with"Streaming"and"UDFs"

!In%this%Hands#On%Exercise,%you%will%process%data%with%an%external%script%and%a%
user#dened%funcAon.%
!Please%refer%to%the%Hands#On%Exercise%Manual%for%instrucAons%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#31%
Chapter"Topics"

Extending%Pig%

!! Adding"Flexibility"with"Parameters%
!! Macros"and"Imports"
!! UDFs"
!! Contributed"FuncEons"
!! Using"Other"Languages"to"Process"Data"with"Pig"
!! Hands/On"Exercise:"Extending"Pig"with"Streaming"and"UDFs"
!! Conclusion%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#32%
EssenEal"Points"

!Pig%supports%several%extension%mechanisms%
!Parameters%and%macros%can%help%make%your%code%more%reusable%
And"easier"to"maintain"and"share"with"others"
!Piggy%Bank%and%DataFu%are%two%examples%of%open%source%UDFs%
You"can"also"write"your"own"UDFs"
!It%is%also%possible%to%embed%Pig%within%another%language%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#33%
Bibliography"

The%following%oer%more%informaAon%on%topics%discussed%in%this%chapter%
!DocumentaAon%on%Parameter%SubsAtuAon%in%Pig%
http://tiny.cloudera.com/dac07a
!DocumentaAon%on%Macros%in%Pig%
http://tiny.cloudera.com/dac07b
!DocumentaAon%on%User#Dened%FuncAons%in%Pig%
http://tiny.cloudera.com/dac07c
!DocumentaAon%on%Piggy%Bank%
http://tiny.cloudera.com/dac07d
!Introducing%Data%Fu%
http://tiny.cloudera.com/dac07e

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#34%
Pig"TroubleshooBng"and"OpBmizaBon"
Chapter"8"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#1%
Course"Chapters"

!! IntroducBon"
!! Hadoop"Fundamentals"
!! IntroducBon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulB/Dataset"OperaBons"with"Pig"
!! Extending"Pig"
!! Pig%Troubleshoo3ng%and%Op3miza3on%
!! IntroducBon"to"Hive"
!! RelaBonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpBmizaBon"
!! Extending"Hive"
!! IntroducBon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#2%
Pig"TroubleshooBng"and"OpBmizaBon"

In%this%chapter,%you%will%learn%
!How%to%control%the%informa3on%that%Pig%and%Hadoop%write%to%log%les%
!How%Hadoops%Web%UI%can%help%you%troubleshoot%failed%jobs%
!How%to%use%SAMPLE%and%ILLUSTRATE%to%test%and%debug%Pig%jobs%
!How%Pig%creates%MapReduce%jobs%from%your%Pig%La3n%code%
!How%several%simple%changes%to%your%Pig%La3n%code%can%make%it%run%faster%
!Which%resources%are%especially%helpful%for%troubleshoo3ng%Pig%errors%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#3%
Chapter"Topics"

Pig%Troubleshoo3ng%And%Op3miza3on%

!! Troubleshoo3ng%Pig%
!! Logging"
!! Using"Hadoops"Web"UI"
!! OpBonal"Demo:"TroubleshooBng"a"Failed"Job"with"the"Web"UI"
!! Data"Sampling"and"Debugging"
!! Performance"Overview"
!! Understanding"the"ExecuBon"Plan"
!! Tips"for"Improving"the"Performance"of"Your"Pig"Jobs"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#4%
TroubleshooBng"Overview"

!We%have%now%covered%how%to%use%Pig%for%data%analysis%
Unfortunately,"someBmes"your"code"may"not"work"as"you"expect"
It"is"important"to"remember"that"Pig"and"Hadoop"are"intertwined"
!Here%we%will%cover%some%techniques%for%isola3ng%and%resolving%problems%
We"will"start"with"a"few"opBons"to"the"Pig"command"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#5%
Helping"Yourself"

!We%will%discuss%some%op3ons%for%the%pig%command%in%this%chapter%
You"can"view"all"of"them"by"using"the"-h"(help)"opBon"
Keep"in"mind"that"many"opBons"are"advanced"or"rarely"used"
!One%useful%op3on%is%-c%(check),%which%validates%the%syntax%of%your%code%

$ pig -c myscript.pig
myscript.pig syntax OK

!The%-dryrun%op3on%is%very%helpful%if%you%use%parameters%or%macros%

$ pig -p INPUT=demodata -dryrun myscript.pig

Creates"a"myscript.pig.substituted"le"in"current"directory"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#6%
Geang"Help"from"Others"

!Some3mes%you%may%need%help%from%others%
Mailing"lists"or"newsgroups"
Forums"and"bulleBn"board"sites"
Support"services"
!You%will%probably%need%to%provide%the%version%of%Pig%and%Hadoop%you%are%
using%

$ pig -version
Apache Pig version 0.10.0-cdh4.2.0

$ hadoop version
Hadoop 2.0.0-cdh4.2.0

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#7%
Chapter"Topics"

Pig%Troubleshoo3ng%And%Op3miza3on%

!! TroubleshooBng"Pig"
!! Logging"
!! Using"Hadoops"Web"UI"
!! OpBonal"Demo:"TroubleshooBng"a"Failed"Job"with"the"Web"UI"
!! Data"Sampling"and"Debugging"
!! Performance"Overview"
!! Understanding"the"ExecuBon"Plan"
!! Tips"for"Improving"the"Performance"of"Your"Pig"Jobs"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#8%
Customizing"Log"Messages"

!You%may%wish%to%change%how%much%informa3on%is%logged%
A"recent"change"in"Hadoop"can"cause"lots"of"warnings"when"using"Pig"
!Pig%and%Hadoop%use%the%Log4J%library,%which%is%easily%customized%
!Edit%the%/etc/pig/conf/log4j.properties%le%to%include:%

log4j.logger.org.apache.pig=ERROR
log4j.logger.org.apache.hadoop.conf.Configuration=ERROR

!Edit%the%/etc/pig/conf/pig.properties%le%to%set%this%property:%

log4jconf=/etc/pig/conf/log4j.properties

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#9%
Customizing"Log"Messages"on"a"Per/Job"Basis"

!O]en%you%just%want%to%temporarily%change%the%log%level%
Especially"while"trying"to"troubleshoot"a"problem"with"your"script"
!You%can%specify%a%Log4J%proper3es%le%to%use%when%you%invoke%Pig%
This"overrides"the"default"Log4J"conguraBon"
!Create%a%customlog.properties%le%to%include:%

log4j.logger.org.apache.pig=DEBUG

!Specify%this%le%via%the%-log4jconf%argument%to%Pig%

$ pig -log4jconf customlog.properties

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#10%
Controlling"Client/Side"Log"Files"

!When%a%job%fails,%Pig%may%produce%a%log%le%to%explain%why%
These"are"typically"produced"in"your"current"directory"
!To%use%a%dierent%loca3on,%use%the%-l%(log)%op3on%when%star3ng%Pig%

$ pig -l /tmp

!Or%set%it%permanently%by%edi3ng%/etc/pig/conf/pig.properties%%
Specify"a"dierent"directory"using"the"log.file"property"

log.file=/tmp

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#11%
Chapter"Topics"

Pig%Troubleshoo3ng%And%Op3miza3on%

!! TroubleshooBng"Pig"
!! Logging"
!! Using%Hadoops%Web%UI%
!! OpBonal"Demo:"TroubleshooBng"a"Failed"Job"with"the"Web"UI"
!! Data"Sampling"and"Debugging"
!! Performance"Overview"
!! Understanding"the"ExecuBon"Plan"
!! Tips"for"Improving"the"Performance"of"Your"Pig"Jobs"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#12%
The"Hadoop"Web"UI"

!Each%Hadoop%daemon%has%a%corresponding%Web%applica3on%
This"allows"us"to"easily"see"cluster"and"job"status"with"a"browser"
In"pseudo/distributed"mode,"the"hostname"is"localhost
Daemon%Name% Address%
NameNode" http://hostname:50070/
HDFS"
DataNode" http://hostname:50075/
JobTracker% http://hostname:50030/
MR1"
TaskTracker" http://hostname:50060/
ResourceManager% http://hostname:8088/
MR2"
NodeManager" http://hostname:8042/

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#13%
The"JobTracker"Web"UI"(1)"

!The%JobTracker%oers%the%most%useful%of%Hadoops%Web%UIs%
It"displays"MapReduce"status"informaBon"for"the"Hadoop"cluster"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#14%
The"JobTracker"Web"UI"(2)"

!The%JobTracker%Web%UI%also%shows%historical%informa3on%%
You"can"click"one"of"the"links"to"see"details"for"a"parBcular"job"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#15%
The"JobTracker"Web"UI"(3)"

!The%job%detail%page%can%help%you%troubleshoot%a%problem%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#16%
Naming"Your"Job"

!Hadoop%clusters%are%typically%shared%resources%
There"might"be"dozens"or"hundreds"of"others"using"it"
As"a"result,"someBmes"it"is"hard"to"nd"your"job"in"the"Web"UI"
!We%recommend%sebng%a%name%in%your%scripts%to%help%iden3fy%your%jobs%
Set"the"job.name"property,"either"in"Grunt"or"your"script"

grunt> set job.name 'Q2 2013 Sales Reporter';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#17%
Killing"a"Job"

!A%job%that%processes%a%lot%of%data%can%take%hours%to%complete%
SomeBmes"you"spot"an"error"in"your"code"just"aeer"submiang"a"job"
Rather"than"wait"for"the"job"to"complete,"you"can"kill"it"
!First,%nd%the%Job%ID%on%the%front%page%of%the%JobTracker%Web%UI%

!Then,%use%the%kill%command%in%Pig%along%with%that%Job%ID%

grunt> kill job_201303151454_0028

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#18%
Chapter"Topics"

Pig%Troubleshoo3ng%And%Op3miza3on%

!! TroubleshooBng"Pig"
!! Logging"
!! Using"Hadoops"Web"UI"
!! Op3onal%Demo:%Troubleshoo3ng%a%Failed%Job%with%the%Web%UI"
!! Data"Sampling"and"Debugging"
!! Performance"Overview"
!! Understanding"the"ExecuBon"Plan"
!! Tips"for"Improving"the"Performance"of"Your"Pig"Jobs"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#19%
OpBonal"Demo:"Overview"

!Time%permibng,%your%instructor%will%now%demonstrate%how%to%use%the%
JobTrackers%Web%UI%to%isolate%a%bug%in%our%code%that%causes%a%job%to%fail%
!The%code%and%instruc3ons%are%already%on%the%VM%

$ cd ~/training_materials/analyst/webuidemo%
$ cat README.txt

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#20%
Chapter"Topics"

Pig%Troubleshoo3ng%And%Op3miza3on%

!! TroubleshooBng"Pig"
!! Logging"
!! Using"Hadoops"Web"UI"
!! OpBonal"Demo:"TroubleshooBng"a"Failed"Job"with"the"Web"UI"
!! Data%Sampling%and%Debugging"
!! Performance"Overview"
!! Understanding"the"ExecuBon"Plan"
!! Tips"for"Improving"the"Performance"of"Your"Pig"Jobs"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#21%
Using"SAMPLE"to"Create"a"Smaller"Data"Set

!Your%code%might%process%terabytes%of%data%in%produc3on%
However,"it"is"convenient"to"test"with"smaller"amounts"during"
development"
!Use%SAMPLE%to%choose%a%random%set%of%records%from%a%data%set%%
!This%example%selects%about%5%%of%records%from%bigdata
Stores"them"in"a"new"directory"called"mysample

everything = LOAD 'bigdata';


subset = SAMPLE everything 0.05;
STORE subset INTO 'mysample';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#22%
Intelligent"Sampling"with"ILLUSTRATE

!Some3mes%a%random%sample%may%lack%data%needed%for%tes3ng%
For"example,"matching"records"in"two"data"sets"for"a"JOIN"operaBon"
!Pigs%ILLUSTRATE%keyword%can%do%more%intelligent%sampling%
Pig"will"examine"the"code"to"determine"what"data"is"needed"
It"picks"a"few"records"that"properly"exercise"the"code"
!You%should%specify%a%schema%when%using%ILLUSTRATE%%
Pig"will"generate"records"when"yours"dont"suce"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#23%
Using"ILLUSTRATE"Helps"You"to"Understand"Data"Flow

!Like%DUMP%and%DESCRIBE,%ILLUSTRATE%aids%in%debugging%
The"syntax"is"the"same"for"all"three"

grunt> allsales = LOAD 'sales' AS (name:chararray, price:int);


grunt> bigsales = FILTER allsales BY price > 999;
grunt> ILLUSTRATE bigsales;
(Bob,3625)
--------------------------------------------------
| allsales | name:chararray | price:int |
--------------------------------------------------
| | Bob | 3625 |
| | Bob | 998 |
--------------------------------------------------
--------------------------------------------------
| bigsales | name:chararray | price:int |
--------------------------------------------------
| | Bob | 3625 |
--------------------------------------------------

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#24%
General"Debugging"Strategies"

!Use%DUMP,%DESCRIBE,%and%ILLUSTRATE%o]en%
The"data"might"not"be"what"you"think"it"is"
!Look%at%a%sample%of%the%data%
Verify"that"it"matches"the"elds"in"your"LOAD"specicaBon"
!Other%helpful%steps%for%tracking%down%a%problem%
Use"-dryrun"to"see"the"script"aeer"parameters"and"macros"are"
processed"
Test"external"scripts"(STREAM)"by"passing"some"data"from"a"local"le"
Look"at"the"logs,"especially"task"logs"available"from"the"Web"UI"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#25%
Chapter"Topics"

Pig%Troubleshoo3ng%And%Op3miza3on%

!! TroubleshooBng"Pig"
!! Logging"
!! Using"Hadoops"Web"UI"
!! OpBonal"Demo:"TroubleshooBng"a"Failed"Job"with"the"Web"UI"
!! Data"Sampling"and"Debugging"
!! Performance%Overview"
!! Understanding"the"ExecuBon"Plan"
!! Tips"for"Improving"the"Performance"of"Your"Pig"Jobs"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#26%
Performance"Overview"

!We%have%discussed%several%techniques%for%nding%errors%in%Pig%La3n%code%
Once"you"get"your"code"working,"youll"oeen"want"it"to"work"faster"
!Performance%tuning%is%a%broad%and%complex%subject%
Requires"a"deep"understanding"of"Pig,"Hadoop,"Java,"and"Linux"
Typically"the"domain"of"engineers"and"system"administrators"
!Most%of%these%topics%are%beyond%the%scope%of%this%course%
Well"cover"the"basics"and"oer"several"performance"improvement"Bps"
See"Programming/Pig"(chapters"7"and"8)"for"detailed"coverage"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#27%
Chapter"Topics"

Pig%Troubleshoo3ng%And%Op3miza3on%

!! TroubleshooBng"Pig"
!! Logging"
!! Using"Hadoops"Web"UI"
!! OpBonal"Demo:"TroubleshooBng"a"Failed"Job"with"the"Web"UI"
!! Data"Sampling"and"Debugging"
!! Performance"Overview"
!! Understanding%the%Execu3on%Plan"
!! Tips"for"Improving"the"Performance"of"your"Pig"jobs"
!! Conclusion%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#28%
How"Pig"LaBn"Becomes"a"MapReduce"Job

!Pig%La3n%code%ul3mately%runs%as%MapReduce%jobs%on%the%Hadoop%cluster%
!However,%Pig%does%not%translate%your%code%into%Java%MapReduce%
Much"like"relaBonal"databases"dont"translate"SQL"to"C"language"code"
Like"a"database,"Pig"interprets"the"Pig"LaBn"to"develop"execuBon"plans"
Pigs"execuBon"engine"uses"these"to"submit"MapReduce"jobs"to"Hadoop"
!The%EXPLAIN%keyword%details%Pigs%three%execu3on%plans%
Logical""
Physical"
MapReduce"
!Seeing%an%example%job%will%help%us%befer%understand%EXPLAINs%output%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#29%
DescripBon"of"Our"Example"Code"and"Data

!Our%goal%is%to%produce%a%list%of%per#store%sales%
stores"
grunt> stores = LOAD 'stores'
A Anchorage AS (store_id:chararray, name:chararray);
B Boston grunt> sales = LOAD 'sales'
C Chicago AS (store_id:chararray, price:int);
D Dallas
grunt> groups = GROUP sales BY store_id;
E Edmonton
grunt> totals = FOREACH groups GENERATE group,
F Fargo
SUM(sales.price) AS amount;
sales" grunt> joined = JOIN totals BY group,
stores BY store_id;
A 1999
grunt> result = FOREACH joined
D 2399
A 4579 GENERATE name, amount;
B 6139 grunt> DUMP result;
A 2489 (Anchorage,9067)
B 3699 (Boston,9838)
E 2479 (Dallas,8198)
D 5799 (Edmonton,2479)

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#30%
Using"the"EXPLAIN"Keyword

!Using%EXPLAIN%rather%than%DUMP%will%show%the%execu3on%plans%

grunt> DUMP result;


(Anchorage,9067)
(Boston,9838)
(Dallas,8198)
(Edmonton,2479)

grunt> EXPLAIN result;


#-----------------------------------------------
# New Logical Plan:
#-----------------------------------------------
result: (Name: LOStore Schema:
stores::name#49:chararray,totals::amount#70:long)
|
|---result: (Name: LOForEach Schema:
stores::name#49:chararray,totals::amount#70:long)

(other lines, including physical and MapReduce plans, would follow)

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#31%
Chapter"Topics"

Pig%Troubleshoo3ng%And%Op3miza3on%

!! TroubleshooBng"Pig"
!! Logging"
!! Using"Hadoops"Web"UI"
!! OpBonal"Demo:"TroubleshooBng"a"Failed"Job"with"the"Web"UI"
!! Data"Sampling"and"Debugging"
!! Performance"Overview"
!! Understanding"the"ExecuBon"Plan"
!! Tips%for%Improving%the%Performance%of%your%Pig%Jobs"
!! Conclusion%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#32%
Pigs"RunBme"OpBmizaBons

!Pig%does%not%necessarily%run%your%statements%exactly%as%you%wrote%them%
!It%may%remove%opera3ons%for%eciency%

sales = LOAD 'sales' AS (store_id:chararray, price:int);


unused = FILTER sales BY price > 789;
DUMP sales;

!It%may%also%rearrange%opera3ons%for%eciency%

grouped = GROUP sales BY store_id;


totals = FOREACH grouped GENERATE group, SUM(sales.price);
joined = JOIN totals BY group, stores BY store_id;
only_a = FILTER joined BY store_id == 'A';
DUMP only_a;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#33%
OpBmizaBons"You"Can"Make"in"Your"Pig"LaBn"Code

!Pigs%op3mizer%does%what%it%can%to%improve%performance%
But"you"know"your"own"code"and"data"be>er"than"it"does"
A"few"small"changes"in"your"code"can"allow"addiBonal"opBmizaBons"
!On%the%next%few%slides,%we%will%rewrite%this%Pig%code%for%performance%

stores = LOAD 'stores' AS (store_id, name, postcode, phone);


sales = LOAD 'sales' AS (store_id, price);
joined = JOIN sales BY store_id, stores BY store_id;
DUMP joined;
groups = GROUP joined BY sales::store_id;
totals = FOREACH groups GENERATE
FLATTEN(joined.stores::name) AS name,
SUM(joined.sales::price) AS amount;
unique = DISTINCT totals;
region = FILTER unique BY name == 'Anchorage' OR name == 'Edmonton';
sorted = ORDER region BY amount DESC;
topone = LIMIT sorted 1;
STORE topone INTO 'topstore';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#34%
Dont"Produce"Output"You"Dont"Really"Need

!In%this%case,%we%forgot%to%remove%the%DUMP%statement%
SomeBmes"happens"when"moving"from"development"to"producBon"
And"it"might"go"unnoBced"if"youre"not"watching"the"terminal"

stores = LOAD 'stores' AS (store_id, name, postcode, phone);


sales = LOAD 'sales' AS (store_id, price);
joined = JOIN sales BY store_id, stores BY store_id;
DUMP joined;
groups = GROUP joined BY sales::store_id;
totals = FOREACH groups GENERATE
FLATTEN(joined.stores::name) AS name,
SUM(joined.sales::price) AS amount;
unique = DISTINCT totals;
region = FILTER unique BY name == 'Anchorage' OR name == 'Edmonton';
sorted = ORDER region BY amount DESC;
topone = LIMIT sorted 1;
STORE topone INTO 'topstore';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#35%
Specify"Schema"Whenever"Possible

!Specifying%schema%when%loading%data%eliminates%the%need%for%Pig%to%guess%
It"may"choose"a"bigger"type"than"you"need"(e.g.,"long"instead"of"int)"
!The%postcode%and%phone%elds%in%the%stores%data%set%were%also%never%used%
EliminaBng"them"in"our"schema"ensures"theyll"be"omi>ed"in"joined

stores =
LOAD 'stores' AS (store_id:chararray, name:chararray);
sales =
LOAD 'sales' AS (store_id:chararray, price:int);
joined =
JOIN sales BY store_id, stores BY store_id;
groups =
GROUP joined BY sales::store_id;
totals =
FOREACH groups GENERATE
FLATTEN(joined.stores::name) AS name,
SUM(joined.sales::price) AS amount;
unique = DISTINCT totals;
region = FILTER unique BY name == 'Anchorage' OR name == 'Edmonton';
sorted = ORDER region BY amount DESC;
topone = LIMIT sorted 1;
STORE topone INTO 'topstore';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#36%
Filter"Unwanted"Data"As"Early"As"Possible

!We%previously%did%our%JOIN%before%our%FILTER%%
This"produced"lots"of"data"we"ulBmately"discarded"
Moving"the"FILTER"operaBon"up"makes"our"script"more"ecient"
Caveat:"We"now"have"to"lter"by"store"ID"rather"than"store"name"

stores = LOAD 'stores' AS (store_id:chararray, name:chararray);


sales = LOAD 'sales' AS (store_id:chararray, price:int);
regsales = FILTER sales BY store_id == 'A' OR store_id == 'E';
joined = JOIN regsales BY store_id, stores BY store_id;
groups = GROUP joined BY regsales::store_id;
totals = FOREACH groups GENERATE
FLATTEN(joined.stores::name) AS name,
SUM(joined.sales::price) AS amount;
unique = DISTINCT totals;
sorted = ORDER unique BY amount DESC;
topone = LIMIT sorted 1;
STORE topone INTO 'topstore';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#37%
Consider"AdjusBng"the"ParallelizaBon

!Hadoop%clusters%scale%by%processing%data%in%parallel%
Newer"Pig"releases"choose"the"number"of"reducers"based"on"input"size"
However,"it"is"oeen"benecial"to"set"a"value"explicitly"in"your"script"
Your"system"administrator"can"help"you"determine"the"best"value"

set default_parallel 5;
stores = LOAD 'stores' AS (store_id:chararray, name:chararray);
sales = LOAD 'sales' AS (store_id:chararray, price:int);
regsales = FILTER sales BY store_id == 'A' OR store_id == 'E';
joined = JOIN regsales BY store_id, stores BY store_id;
groups = GROUP joined BY regsales::store_id;
totals = FOREACH groups GENERATE
FLATTEN(joined.stores::name) AS name,
SUM(joined.sales::price) AS amount;
unique = DISTINCT totals;
sorted = ORDER unique BY amount DESC;
topone = LIMIT sorted 1;
STORE topone INTO 'topstore';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#38%
Specify"the"Smaller"Data"Set"First"in"a"Join

!We%can%op3mize%joins%by%specifying%the%larger%data%set%last%
Pig"will"stream"the"larger"data"set"instead"of"reading"it"into"memory"
In"our"case,"we"have"far"more"records"in"sales"than"in"stores
Changing"the"order"in"the"JOIN"statement"can"boost"performance"

set default_parallel 5;
stores = LOAD 'stores' AS (store_id:chararray, name:chararray);
sales = LOAD 'sales' AS (store_id:chararray, price:int);
regsales = FILTER sales BY store_id == 'A' OR store_id == 'E';
joined = JOIN stores BY store_id, regsales BY store_id;
groups = GROUP joined BY regsales::store_id;
totals = FOREACH groups GENERATE
FLATTEN(joined.stores::name) AS name,
SUM(joined.sales::price) AS amount;
unique = DISTINCT totals;
sorted = ORDER unique BY amount DESC;
topone = LIMIT sorted 1;
STORE topone INTO 'topstore';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#39%
Try"Using"Compression"on"Intermediate"Data

!Pig%scripts%o]en%yield%jobs%with%both%a%Map%and%a%Reduce%phase%
Remember"that"Mapper"output"becomes"Reducer"input"
Compressing"this"intermediate"data"is"easy"and"can"boost"performance"
Your"system"administrator"may"need"to"install"a"compression"library"

set mapred.compress.map.output true;


set mapred.map.output.compression.codec
org.apache.hadoop.io.compress.SnappyCodec;
set default_parallel 5;
stores = LOAD 'stores' AS (store_id:chararray, name:chararray);
sales = LOAD 'sales' AS (store_id:chararray, price:int);
regsales = FILTER sales BY store_id == 'A' OR store_id == 'E';
joined = JOIN stores BY store_id, regsales BY store_id;
groups = GROUP joined BY regsales::store_id;
totals = FOREACH groups GENERATE
FLATTEN(joined.stores::name) AS name,
SUM(joined.sales::price) AS amount;
... (other lines unchanged, but removed for brevity) ...

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#40%
A"Few"More"Tips"for"Improving"Performance

!Main%theme:%Eliminate%unnecessary%data%as%early%as%possible%
Use"FOREACH ... GENERATE"to"select"just"those"elds"you"need"
Use"ORDER BY and"LIMIT"when"you"only"need"a"few"records"
Use"DISTINCT"when"you"dont"need"duplicate"records"
!Dropping%records%with%NULL%keys%before%a%join%can%boost%performance%
These"records"will"be"eliminated"in"the"nal"output"anyway"
But"Pig"doesnt"discard"them"unBl"aeer"the"join"
Use"FILTER"to"remove"records"with"null"keys"before"the"join"

stores = LOAD 'stores' AS (store_id:chararray, name:chararray);


sales = LOAD 'sales' AS (store_id:chararray, price:int);

nonnull_stores = FILTER stores BY store_id IS NOT NULL;


nonnull_sales = FILTER sales BY store_id IS NOT NULL;

joined = JOIN nonnull_stores BY store_id, nonnull_sales BY store_id;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#41%
Chapter"Topics"

Pig%Troubleshoo3ng%And%Op3miza3on%

!! TroubleshooBng"Pig"
!! Logging"
!! Using"Hadoops"Web"UI"
!! OpBonal"Demo:"TroubleshooBng"a"Failed"Job"with"the"Web"UI"
!! Data"Sampling"and"Debugging"
!! Performance"Overview"
!! Understanding"the"ExecuBon"Plan"
!! Tips"for"Improving"the"Performance"of"Your"Pig"Jobs"
!! Conclusion%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#42%
EssenBal"Points"

!You%can%boost%performance%by%elimina3ng%unneeded%data%during%
processing%
!Pigs%error%messages%dont%always%clearly%iden3fy%the%source%of%a%problem%
We"recommend"tesBng"your"scripts"with"a"small"data"sample"
Looking"at"the"Web"UI,"and"especially"the"log"messages,"can"be"helpful"
!The%resources%listed%on%the%upcoming%bibliography%slide%may%further%assist%
you%in%solving%problems%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#43%
Bibliography"

The%following%oer%more%informa3on%on%topics%discussed%in%this%chapter%
!Hadoop%Log%Loca3on%and%Reten3on%
http://tiny.cloudera.com/dac08a
!Pig%Tes3ng%and%Diagnos3cs%
http://tiny.cloudera.com/dac08b
!Mailing%List%for%Pig%Users%
http://tiny.cloudera.com/dac08c
!Ques3ons%Tagged%with%Pig%on%StackOverow%
http://tiny.cloudera.com/dac08d
!Ques3ons%Tagged%with%PigLa3n%on%StackOverow%
http://tiny.cloudera.com/dac08e

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#44%
IntroducAon"to"Hive"
Chapter"9"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#1%
Course"Chapters"

!! IntroducAon"
!! Hadoop"Fundamentals"
!! IntroducAon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulA/Dataset"OperaAons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooAng"and"OpAmizaAon"
!! Introduc/on%to%Hive%
!! RelaAonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpAmizaAon"
!! Extending"Hive"
!! IntroducAon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#2%
IntroducAon"to"Hive"

In%this%chapter,%you%will%learn%
!What%Hive%is%
!How%Hive%diers%from%a%rela/onal%database%
!Ways%in%which%organiza/ons%use%Hive%
!How%to%invoke%and%interact%with%Hive%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#3%
Chapter"Topics"

Introduc/on%to%Hive%

!! What%is%Hive?%
!! Hive"Schema"and"Data"Storage"
!! Comparing"Hive"to"TradiAonal"Databases"
!! Hive"Use"Cases"
!! InteracAng"with"Hive"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#4%
Overview"of"Apache"Hive"

!Apache%Hive%is%a%high#level%abstrac/on%on%top%of%MapReduce%
Uses"a"SQL/like"language"called"HiveQL"
Generates"MapReduce"jobs"that"run"on"the"Hadoop"cluster"
Originally"developed"by"Facebook"for"data"warehousing"
Now"an"open/source"Apache"project"

SELECT zipcode, SUM(cost) AS total


FROM customers
JOIN orders
ON customers.cust_id = orders.cust_id
WHERE zipcode LIKE '63%'
GROUP BY zipcode
ORDER BY total DESC;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#5%
High/Level"Overview"for"Hive"Users"

!Hive%runs%on%the%client%machine%
Turns"HiveQL"queries"into"MapReduce"jobs"
Submits"those"jobs"to"the"cluster"

HiveQL Statements Hive Interpreter / Execution Engine MapReduce Jobs

SELECT zipcode, SUM(cost) AS total


!"Parse"HiveQL
FROM customers JOIN orders !"Make"op1miza1ons
ON customers.id = orders.cid !"Plan"execu1on
"
WHERE
GROUP BY
zipcode LIKE '63%'
zipcode
!"Generate"MapReduce"jobs
ORDER BY total DESC;
!"Submit"job(s)"to"Hadoop
!"Monitor"progress

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#6%
Why"Use"Apache"Hive?"

!More%produc/ve%than%wri/ng%MapReduce%directly%
Five"lines"of"HiveQL"might"be"equivalent"to"100"lines"or"more"of"Java"
!Brings%large#scale%data%analysis%to%a%broader%audience%
No"so\ware"development"experience"required"
Leverage"exisAng"knowledge"of"SQL"
!Oers%interoperability%with%other%systems%
Extensible"through"Java"and"external"scripts"
Many"business"intelligence"(BI)"tools"support"Hive"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#7%
Where"to"Get"Hive"

!CDH%(Clouderas%Distribu/on%including%Apache%Hadoop)%is%the%easiest%way%
to%install%Hadoop%and%Hive%
A"Hadoop"distribuAon"which"includes"core"Hadoop,"Pig,"Hive,"Sqoop,"
HBase,"Oozie,"and"other"ecosystem"components"
Available"as"RPMs,"Ubuntu/Debian/SuSE"packages,"or"a"tarball"
Simple"installaAon"
100%"free"and"open"source"
!Installa/on%is%outside%the%scope%of%this%course%
Cloudera"oers"a"training"course"for"System"Administrators,!Cloudera!
Administrator!Training!for!Apache!Hadoop!

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#8%
Chapter"Topics"

Introduc/on%to%Hive%

!! What"is"Hive?%
!! Hive%Schema%and%Data%Storage%
!! Comparing"Hive"to"TradiAonal"Databases"
!! Hive"Use"Cases"
!! InteracAng"with"Hive"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#9%
How"Hive"Loads"and"Stores"Data"(1)"

!Hives%queries%operate%on%tables,%just%like%in%an%RDBMS%
A"table"is"simply"an"HDFS"directory"containing"one"or"more"les"
Default"path:"/user/hive/warehouse/<table_name>"""
Hive"supports"many"formats"for"data"storage"and"retrieval"
!How%does%Hive%know%the%structure%and%loca/on%of%tables?%
These"are"specied"when"tables"are"created"
This"metadata"is"stored"in"Hives"metastore"
Contained"in"an"RDBMS"such"as"MySQL"

"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#10%
How"Hive"Loads"and"Stores"Data"(2)"

!Hive%consults%the%metastore%to%determine%data%format%and%loca/on%
The"query"itself"operates"on"data"stored"on"a"lesystem"(typically"HDFS)"

Metastore
c t ure
s tru data
" t a bl on of
e
t i
Ge locat (metadata'in'RDBMS)
1
and

Hive%Tables%(HDFS)
Qu 2
ery
ac
tua
l da
ta

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#11%
Chapter"Topics"

Introduc/on%to%Hive%

!! What"is"Hive?"
!! Hive"Schema"and"Data"Storage"
!! Comparing%Hive%to%Tradi/onal%Databases%
!! Hive"Use"Cases"
!! InteracAng"with"Hive"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#12%
Your"Cluster"is"Not"a"Database"Server"

!Client#server%database%management%systems%have%many%strengths%
Very"fast"response"Ame"
Support"for"transacAons"
Allows"modicaAon"of"exisAng"records"
Can"serve"thousands"of"simultaneous"clients"
!Hive%does%not%turn%your%Hadoop%cluster%into%an%RDBMS%
It"simply"produces"MapReduce"jobs"from"HiveQL"queries"
LimitaAons"of"HDFS"and"MapReduce"sAll"apply"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#13%
Comparing"Hive"To"A"RelaAonal"Database"

Rela/onal%Database% Hive%
Query language SQL HiveQL
Update individual records Yes No
Delete individual records Yes No
Transactions Yes No
Index support Extensive Limited
Latency Very low High
Data size Terabytes Petabytes

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#14%
Chapter"Topics"

Introduc/on%to%Hive%

!! What"is"Hive?"
!! Hive"Schema"and"Data"Storage"
!! Comparing"Hive"to"TradiAonal"Databases"
!! Hive%Use%Cases%
!! InteracAng"with"Hive"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#15%
Use"Case:"Log"File"AnalyAcs"

!Server%log%les%are%an%important%source%of%data%
!Hive%allows%you%to%treat%a%directory%of%log%les%like%a%table%
Allows"SQL/like"queries"against"raw"data"

Dualcore Inc. Public Web Site (June 1 - 8)


Product Unique Visitors Page Views Average Time on Page Bounce Rate Conversion Rate
Tablet 5,278 5,894 17 seconds 23% 65%
Notebook 4,139 4,375 23 seconds 47% 31%
Stereo 2,873 2,981 42 seconds 61% 12%
Monitor 1,749 1,862 26 seconds 74% 19%
Router 987 1,139 37 seconds 56% 17%
Server 314 504 53 seconds 48% 28%
Printer 86 97 34 seconds 27% 64%

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#16%
Use"Case:"SenAment"Analysis"

!Many%organiza/ons%use%Hive%to%analyze%social%media%coverage%

Mentions of Dualcore on Social Media (by Hour)

Negative
Neutral
Positive

07 08 09 10 11 12 13 14 15 16 17 18

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#17%
Chapter"Topics"

Introduc/on%to%Hive%

!! What"is"Hive?"
!! Hive"Schema"and"Data"Storage"
!! Comparing"Hive"to"TradiAonal"Databases"
!! Hive"Use"Cases"
!! Interac/ng%with%Hive%
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#18%
Using"the"Hive"Shell"

!You%can%execute%HiveQL%statements%in%the%Hive%Shell%
This"interacAve"tool"is"similar"to"the"MySQL"shell"
!Run%the%hive%command%to%start%the%Hive%shell%
The"Hive"shell"will"display"its"hive>"prompt"
Each"statement"must"be"terminated"with"a"semicolon"
Use"the"quit"command"to"exit"the"Hive"shell"

$ hive
hive> SELECT cust_id, fname, lname
FROM customers WHERE zipcode=20525;
1000000 Quentin Shepard
1000001 Brandon Louis
1000002 Marilyn Ham
hive> quit;
$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#19%
Accessing"Hive"from"the"Command"Line"

!You%can%also%execute%a%le%containing%HiveQL%code%using%the%-f%op/on%
%
$ hive -f myquery.hql

!Or%use%HiveQL%directly%from%the%command%line%using%the%-e%op/on%

$ hive -e 'SELECT * FROM users'

!Use%the%-S%(silent)%op/on%to%suppress%informa/onal%messages%
Can"also"be"used"with"-e"or"-f"opAons"

$ hive -S

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#20%
Hive"ProperAes"

!Many%aspects%of%Hives%behavior%are%congured%through%proper/es%
Use"set -v"in"Hive"to"see"current"values"

hive> set -v;


"
!You%can%also%use%set%to%specify%property%values%
The"following"enables"columns"headers"in"query"results"

hive> set hive.cli.print.header=true;

!Hive%runs%the%.hiverc%le%in%your%home%directory%at%startup%
Useful"for"specifying"per/user"defaults"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#21%
InteracAng"with"the"OperaAng"System"and"HDFS"

!Use%!%to%execute%system%commands%from%within%Hive%
Neither"pipes"nor"globs"(wildcards)"are"supported"

hive> ! date;
% Mon May 20 16:44:35 PDT 2013

!Prex%HDFS%commands%with%dfs%to%use%them%from%within%Hive%

hive> dfs -mkdir /reports/sales/2013;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#22%
Accessing"Hive"with"Hue"

!Alterna/vely,%you%can%access%Hive%through%Hue%
Web/based"UI"for"many"Hadoop/related"services,"including"Hive"
!To%use%Hue,%browse%to%http://hue_server:8888/
May"need"to"start"Hue"service"rst"(sudo service hue start)"
!Hues%Hive%interface%is%called%Beeswax%
Launch"by"clicking"its"icon"
!Beeswax%features%include:%
CreaAng"tables"
Beeswax!icon!in!Hue!
Running"queries"
Browsing"tables"
Saving"queries"for"later"execuAon"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#23%
Hues"Beeswax"Query"Editor"
Hue - Hive Query
https://hueserver.example.com:8888/beeswax/ Google

SELECT * FROM employees WHERE state = 'CA';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#24%
Query"ExecuAon"in"Beeswax"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#25%
Query"Results"in"Beeswax"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#26%
InteracAng"With"HiveServer2""Hive"as"a"Service"(1)"

!HiveServer2%can%be%deployed%to%provide%a%centralized%Hive%service%
Uses"a"JDBC"or"ODBC"connecAon"
Supports"Kerberos"authenAcaAon"
% Beeline CLI

Hadoop Cluster

Submit Map
Reduce Jobs Mappers
HiveServer2
Server

Reducers

JDBC/ODBC
Application

Shared
Metastore

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#27%
InteracAng"With"HiveServer2""Hive"as"a"Service"(2)"

!To%connect%to%HiveServer2,%use%Hue%or%the%Beeline%CLI%
You"cannot"use"the"Hive"shell"
For"secure"deployments,"supply"your"user"ID"and"password"
!Example:%star/ng%Beeline%and%connec/ng%to%HiveServer2%
% [training@localhost analyst]$ beeline
Beeline version 0.10.0-cdh4.2.1 by Apache Hive

beeline> !connect jdbc:hive2://localhost:10000


training mypwd org.apache.hive.jdbc.HiveDriver

Connecting to jdbc:hive2://localhost:10000
Connected to: Hive (version 0.10.0)
Driver: Hive (version 0.10.0-cdh4.2.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ

beeline>

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#28%
Chapter"Topics"

Introduc/on%to%Hive%

!! What"is"Hive?"
!! Hive"Schema"and"Data"Storage"
!! Comparing"Hive"to"TradiAonal"Databases"
!! Hive"Use"Cases"
!! InteracAng"with"Hive"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#29%
EssenAal"Points"

!Hive%is%a%high#level%abstrac/on%on%top%of%MapReduce%
Runs"MapReduce"jobs"on"Hadoop"based"on"HiveQL"statements"
Originally"developed"for"data"warehousing"at"Facebook"
!HiveQL%is%very%similar%to%SQL%
Easy"to"learn"for"those"with"relaAonal"database"experience"
However,"Hive"does"not"replace"your"RDBMS"
!Hive%tables%are%really%directories%of%les%in%HDFS%
InformaAon"about"those"tables"is"kept"in"Hive's"metastore"
!Hive%and%Pig%are%similar%in%many%ways%
But"each"also"has"its"own"disAnct"advantages"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#30%
Bibliography"

The%following%oer%more%informa/on%on%topics%discussed%in%this%chapter%
!Hive%Web%Site%
http://hive.apache.org/
!Beeline%CLI%reference%
http://sqlline.sourceforge.net/
!Programming%Hive%(book)%
http://tiny.cloudera.com/dac09a
!Data%Analysis%with%Hadoop%and%Hive%at%Orbitz%
http://tiny.cloudera.com/dac09b
!Sen/ment%Analysis%Using%Apache%Hive%
http://tiny.cloudera.com/dac09c

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#31%
RelaAonal"Data"Analysis"with"Hive"
Chapter"10"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#1$
Course"Chapters"

!! IntroducAon"
!! Hadoop"Fundamentals"
!! IntroducAon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulA/Dataset"OperaAons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooAng"and"OpAmizaAon"
!! IntroducAon"to"Hive"
!! Rela*onal$Data$Analysis$with$Hive$
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpAmizaAon"
!! Extending"Hive"
!! IntroducAon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#2$
RelaAonal"Data"Analysis"with"Hive"

In$this$chapter,$you$will$learn$
!How$to$explore$databases$and$tables$in$Hive$
!How$HiveQL$syntax$compares$to$SQL$
!Which$data$types$Hive$supports$
!Which$types$of$join$opera*ons$Hive$supports$and$how$to$use$them$
!How$to$use$many$of$Hives$built#in$func*ons$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#3$
Chapter"Topics"

Rela*onal$Data$Analysis$with$Hive$

!! Hive$Databases$and$Tables$
!! Basic"HiveQL"Syntax"
!! Data"Types"
!! Joining"Datasets"
!! Common"Built/in"FuncAons"
!! Hands/On"Exercise:"Running"Hive"Queries"from"the"Shell,"Scripts,"and"Hue"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#4$
Hive"Tables"

!Data$for$Hives$tables$is$stored$on$the$lesystem$(typically$HDFS)$
Each"table"maps"to"a"single"directory"
!A$tables$directory$may$contain$mul*ple$les$$
Typically"delimited"text"les,"but"Hive"supports"many"formats"
Subdirectories"are"not"allowed"
!Hive$uses$the$metastore$to$give$context$to$this$data$
Helps"map"raw"data"in"HDFS"to"named"columns"of"specic"types"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#5$
Hive"Databases"

!Each$Hive$table$belongs$to$a$specic$database$$
!Early$versions$of$Hive$supported$only$a$single$database$
It"placed"all"tables"in"the"same"database"(named"default)"
This"is"sAll"the"default"behavior"
!Hive$supports$mul*ple$databases$as$of$release$0.7.0$
Helpful"for"organizaAon"and"authorizaAon"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#6$
Exploring"Hive"Databases"and"Tables"(1)"

!Which$databases$are$available?$

hive> SHOW DATABASES;


accounting
default
sales

!Switch$between$databases$with$the$USE$command$

Queries table in the


$ hive
hive> SELECT * FROM customers; default database
hive> USE sales;
hive> SELECT * FROM customers;
Queries table in the
sales database

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#7$
Exploring"Hive"Databases"and"Tables"(2)"

!Which$tables$does$the$current$database$contain?$

hive> USE accounting;


hive> SHOW TABLES;
invoices
taxes

!Which$tables$are$contained$in$a$dierent$database?$

hive> SHOW TABLES IN sales;


customers
prospects

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#8$
Exploring"Hive"Databases"and"Tables"(3)"

!The$DESCRIBE$command$displays$basic$structure$for$a$table$

hive> DESCRIBE orders;


order_id int
cust_id int
order_date timestamp

! DESCRIBE FORMATTED$shows$more$detailed$informa*on$

hive> DESCRIBE FORMATTED orders;


# col_name data_type comment
order_id int None
cust_id int None
order_date timestamp None
# Detailed Table Information ... More follows...

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#9$
Chapter"Topics"

Rela*onal$Data$Analysis$with$Hive$

!! Hive"Databases"and"Tables"
!! Basic$HiveQL$Syntax$
!! Data"Types"
!! Joining"Datasets"
!! Common"Built/in"FuncAons"
!! Hands/On"Exercise:"Running"Hive"Queries"from"the"Shell,"Scripts,"and"Hue"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#10$
An"IntroducAon"To"HiveQL"

!HiveQL$is$Hives$query$language$
Based"on"a"subset"of"SQL/92,"plus"Hive/specic"extensions"
!Some$limita*ons$compared$to$standard$SQL$
Some"features"are"not"supported"
Others"are"only"parAally"implemented"
!HiveQL$also$has$some$features$not$oered$in$SQL$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#11$
HiveQL"Basics"

!Hive$keywords$are$not$case#sensi*ve$
Though"they"are"o]en"capitalized"by"convenAon"
!Statements$are$terminated$by$a$semicolon$
A"statement"may"span"mulAple"lines"
!Comments$begin$with$-- (double(hyphen)(
Only"supported"in"Hive"scripts"
There"are"no"mulA/line"comments"in"Hive"

$ cat myscript.hql

SELECT cust_id, fname, lname


FROM customers
WHERE zipcode='60601'; -- downtown Chicago

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#12$
SelecAng"Data"from"Hive"Tables"

!The$SELECT$statement$retrieves$data$from$Hive$tables$
Can"specify"an"ordered"list"of"individual"columns"

hive> SELECT cust_id, fname, lname FROM customers;

An"asterisk"matches"all"columns"in"the"table"
$
hive> SELECT * FROM customers;
$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#13$
LimiAng"and"SorAng"Query"Results"

! The$LIMIT$clause$sets$the$maximum$number$of$rows$returned$

hive> SELECT fname, lname FROM customers LIMIT 10;

!Cau*on:$no$guarantee$regarding$which$10$results$are$returned$
Use"ORDER BY"for"top/N"queries"
The"eld(s)"you"ORDER BY"must"be"selected"

hive> SELECT cust_id, fname, lname FROM customers


ORDER BY cust_id DESC LIMIT 10;
$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#14$
Using"a"WHERE"Clause"to"Restrict"Results"

! WHERE$clauses$restrict$rows$to$those$matching$specied$criteria$
String"comparisons"are"case/sensiAve"

hive> SELECT * FROM orders WHERE order_id=1287;

hive> SELECT * FROM customers WHERE state


IN ('CA', 'OR', 'WA', 'NV', 'AZ');
"
!You$can$combine$expressions$using$AND$or$OR$$

hive> SELECT * FROM customers


WHERE fname LIKE 'Ann%'
AND (city='Seattle' OR city='Portland');

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#15$
Table"Aliases"

!Table$aliases$can$help$simplify$complex$queries$

hive> SELECT o.order_date, c.fname, c.lname


FROM customers c JOIN orders o
ON c.cust_id = o.cust_id
WHERE c.zipcode='94306';

Note:"Using"AS"to"specify"table"aliases"is"not"supported"

$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#16$
Combining"Query"Results"with"UNION ALL

!Unies$output$from$SELECTs$into$a$single$result$set$
The"name,"order,"and"types"of"columns"in"each"query"must"match"
Hive"only"supports"UNION ALL

SELECT emp_id, fname, lname, salary


FROM employees
WHERE state='CA' AND salary > 75000
UNION ALL
SELECT emp_id, fname, lname, salary
FROM employees
WHERE state != 'CA' AND salary > 50000;

! UNION ALL$can$also$be$used$with$subqueries$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#17$
Subqueries"in"Hive"

!Hive$only$supports$subqueries$in$the$FROM$clause$of$the$SELECT$
statement$
It"does"not"support"correlated"subqueries"

SELECT prod_id, brand, name


FROM (SELECT *
FROM products
WHERE (price - cost) / price > 0.65
ORDER BY price DESC
LIMIT 10) high_profits
WHERE price > 1000
ORDER BY brand, name;
"
!Hive$allows$arbitrary$levels$of$subqueries$
Each"subquery"must"be"named"(like"high_profits"above)"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#18$
Chapter"Topics"

Rela*onal$Data$Analysis$with$Hive$

!! Hive"Databases"and"Tables"
!! Basic"HiveQL"Syntax"
!! Data$Types$
!! Joining"Datasets"
!! Common"Built/in"FuncAons"
!! Hands/On"Exercise:"Running"Hive"Queries"from"the"Shell,"Scripts,"and"Hue"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#19$
Hives"Data"Types"

!Each$column$in$Hive$is$associated$with$a$data$type$
!Hive$supports$more$than$a$dozen$types$
Most"are"similar"to"ones"found"in"relaAonal"databases"
Hive"also"supports"three"complex"types"
!Use$the$DESCRIBE$command$to$see$a$tables$column$types$$

hive> DESCRIBE products;


prod_id int
brand string
name string
price int
cost int
shipping_wt int

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#20$
Hives"Integer"Types"

!Integer$types$are$appropriate$for$whole$numbers$
Both"posiAve"and"negaAve"values"allowed"

Name$ Descrip*on$ Example$Value$


TINYINT Range:"/128"to"127$ 17
SMALLINT Range:"/32,768"to"32,767$ 5842
INT Range:"/2,147,483,648"to"2,147,483,647$ 84127213
BIGINT Range:"~"/9.2"quinAllion"to"~"9.2"quinAllion$ 632197432180964

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#21$
Hives"Decimal"Types"

!Decimal$types$are$appropriate$for$oa*ng$point$numbers$
Both"posiAve"and"negaAve"values"allowed"
Cau*on:"avoid"using"when"exact"values"are"required!"

Name$ Descrip*on$ Example$Value$


FLOAT Decimals$ 3.14159
DOUBLE Very"precise"decimals$ 3.14159265358979323846

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#22$
Other"Simple"Types"in"Hive"

!Hive$can$also$store$several$other$types$of$informa*on$
Only"one"character"type"(variable"length)"

Name$ Descrip*on$ Example$Value$


STRING Character"sequence" Betty F. Smith
BOOLEAN True"or"False" TRUE
TIMESTAMP* Instant"in"Ame" 2013-06-14 16:51:05
BINARY* Raw"bytes" N/A

""*"Not"available"in"older"versions"of"Hive"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#23$
Complex"Column"Types"in"Hive"

!Hive$also$has$a$few$complex$data$types$
These"are"capable"of"holding"mulAple"values"

Column$Type$ Usage$in$Query$
ARRAY" departments[0]
Ordered"list"of"values,"all"of"the"same"type"
MAP" employees['BF5314']
Key/value"pairs,"each"of"the"same"type"
STRUCT" address.street
Named"elds,"of"possibly"mixed"types"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#24$
Chapter"Topics"

Rela*onal$Data$Analysis$with$Hive$

!! Hive"Databases"and"Tables"
!! Basic"HiveQL"Syntax"
!! Data"Types"
!! Joining$Datasets$
!! Common"Built/in"FuncAons"
!! Hands/On"Exercise:"Running"Hive"Queries"from"the"Shell,"Scripts,"and"Hue"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#25$
Joins"in"Hive"

!Joining$disparate$data$sets$is$a$common$opera*on$in$Hive$
!Hive$supports$several$types$of$joins$
Inner"joins"
Outer"joins"(le],"right,"and"full)"
Cross"joins"(supported"in"Hive"0.10"and"later)"
Le]"semi"joins"
!Only$equality$condi*ons$are$allowed$in$joins$
Valid:""""customers.cust_id = orders.cust_id""
Invalid:"customers.cust_id <> orders.cust_id""
Outputs"records"where"the"specied"key"is"found"in"each"table"
!For$best$performance,$list$the$largest$table$last$in$your$query$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#26$
Join"Syntax"

!Hive$requires$the$following$syntax$for$joins$

SELECT c.cust_id, name, total


FROM customers c
JOIN orders o ON (c.cust_id = o.cust_id);

!The$above$example$is$an$inner$join$
Can"replace"JOIN"with"another"type"(e.g."RIGHT OUTER JOIN)""
!The$following$join$syntax$is$not$valid$in$Hive$

SELECT c.cust_id, name, total


FROM customers c, orders o
WHERE (c.cust_id = o.cust_id);

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#27$
Inner"Join"Example"

customers$table$
cust_id$ name$ country$ SELECT c.cust_id, name, total
a Alice us
FROM customers c
JOIN orders o
b Bob ca
ON (c.cust_id = o.cust_id);
c Carlos mx
d Dieter de
Result$of$query$

orders$table$ cust_id$ name$ total$


a Alice 1539
order_id$ cust_id$ total$
a Alice 6352
1 a 1539
b Bob 1456
2 c 1871
c Carlos 1871
3 a 6352
4 b 1456
5 z 2137

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#28$
Le]"Outer"Join"Example"

customers$table$
cust_id$ name$ country$ SELECT c.cust_id, name, total
a Alice us
FROM customers c
LEFT OUTER JOIN orders o
b Bob ca
ON (c.cust_id = o.cust_id);
c Carlos mx
d Dieter de
Result$of$query$

orders$table$ cust_id$ name$ total$


a Alice 1539
order_id$ cust_id$ total$
a Alice 6352
1 a 1539
b Bob 1456
2 c 1871
c Carlos 1871
3 a 6352
d Dieter NULL
4 b 1456
5 z 2137

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#29$
Right"Outer"Join"Example"

customers$table$
cust_id$ name$ country$ SELECT c.cust_id, name, total
a Alice us
FROM customers c
RIGHT OUTER JOIN orders o
b Bob ca
ON (c.cust_id = o.cust_id);
c Carlos mx
d Dieter de
Result$of$query$

orders$table$ cust_id$ name$ total$


a Alice 1539
order_id$ cust_id$ total$
a Alice 6352
1 a 1539
b Bob 1456
2 c 1871
c Carlos 1871
3 a 6352
NULL NULL 2137
4 b 1456
5 z 2137

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#30$
Full"Outer"Join"Example"

customers$table$
cust_id$ name$ country$ SELECT c.cust_id, name, total
a Alice us
FROM customers c
FULL OUTER JOIN orders o
b Bob ca
ON (c.cust_id = o.cust_id);
c Carlos mx
d Dieter de
Result$of$query$

orders$table$ cust_id$ name$ total$


a Alice 1539
order_id$ cust_id$ total$
a Alice 6352
1 a 1539
b Bob 1456
2 c 1871
c Carlos 1871
3 a 6352
d Dieter NULL
4 b 1456
NULL NULL 2137
5 z 2137

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#31$
Using"An"Outer"Join"to"Find"Unmatched"Entries"

customers$table$
cust_id$ name$ country$ SELECT c.cust_id, name, total
a Alice us
FROM customers c
FULL OUTER JOIN orders o
b Bob ca
ON (c.cust_id = o.cust_id)
c Carlos mx
WHERE c.cust_id IS NULL
d Dieter de OR o.total IS NULL;

orders$table$
Result$of$query$
order_id$ cust_id$ total$
1 a 1539
cust_id$ name$ total$
d Dieter NULL
2 c 1871
NULL NULL 2137
3 a 6352
4 b 1456
5 z 2137

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#32$
Cross"Join"Example"

disks$table$
name$ SELECT * FROM disks
Internal hard disk
CROSS JOIN sizes;
External hard disk
Result$of$query$

name$ size$
Internal hard disk 1.0 terabytes
sizes$table$ Internal hard disk 2.0 terabytes
size$ Internal hard disk 3.0 terabytes
1.0 terabytes Internal hard disk 4.0 terabytes
2.0 terabytes External hard disk 1.0 terabytes
3.0 terabytes External hard disk 2.0 terabytes
4.0 terabytes External hard disk 3.0 terabytes
External hard disk 4.0 terabytes

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#33$
Le]"Semi"Joins"(1)"

!A$less$common$type$of$join$is$the$LEFT SEMI JOIN


They"are"a"special"(and"ecient)"type"of"inner"join"
They"behave"more"like"a"lter"than"a"join"
!Leh$semi$joins$include$addi*onal$criteria$in$the$ON$clause$
Only"unique"records"that"match"these"criteria"are"returned"
Fields"listed"in"SELECT"are"limited"to"the"le]/side"table"

SELECT c.cust_id
FROM customers c
LEFT SEMI JOIN orders o
ON (c.cust_id = o.cust_id
AND YEAR(o.order_date) = '2012');

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#34$
Le]"Semi"Joins"(2)"

!Hive$does$not$support$IN/EXISTS$subqueries$

SELECT c.cust_id FROM customers c


WHERE o.cust_id IN
(SELECT o.cust_id FROM orders o
WHERE YEAR(o.order_date = '2012'));

!Using$a$LEFT SEMI JOIN$is$a$common$workaround$

SELECT c.cust_id
FROM customers c
LEFT SEMI JOIN orders o
ON (c.cust_id = o.cust_id
AND YEAR(o.order_date) = '2012');

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#35$
Chapter"Topics"

Rela*onal$Data$Analysis$with$Hive$

!! Hive"Databases"and"Tables"
!! Basic"HiveQL"Syntax"
!! Data"Types"
!! Joining"Datasets"
!! Common$Built#in$Func*ons$
!! Hands/On"Exercise:"Running"Hive"Queries"from"the"Shell,"Scripts,"and"Hue"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#36$
Hive"FuncAons"

!Hive$oers$dozens$of$built#in$func*ons$$
Many"are"idenAcal"to"those"found"in"SQL"
Others"are"Hive/specic"
!Example$func*on$invoca*on$
FuncAon"names"are"not"case/sensiAve"

hive> SELECT CONCAT(fname, ' ', lname) AS fullname


FROM customers;

!To$see$informa*on$about$a$func*on$$

hive> DESCRIBE FUNCTION UPPER;


UPPER(str) - Returns str with all characters
changed to uppercase

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#37$
Example"Built/in"FuncAons"(1)"

!These$func*ons$operate$on$numeric$values$

Func*on$Descrip*on$ Example$Invoca*on$ Input$ Output$


Rounds"to"specied"#"of"decimals" ROUND(total_price, 2) 23.492 23.49

Returns"nearest"integer"above" CEIL(total_price) 23.492 24

Returns"nearest"integer"below" FLOOR(total_price) 23.492 23

Return"absolute"value" ABS(temperature) -49 49

Returns"square"root" SQRT(area) 64 8

Returns"a"random"number" RAND() 0.584977

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#38$
Example"Built/in"FuncAons"(2)"

!These$func*ons$operate$on$*mestamp$values$

Func*on$ Example$Invoca*on$ Input$ Output$


Descrip*on$
Convert"to"UNIX" UNIX_TIMESTAMP(order_dt) 2013-06-14 1371243065
format" 16:51:05
Convert"to"string" FROM_UNIXTIME(mod_time) 1371243065 2013-06-14
format" 16:51:05
Extract"date" TO_DATE(order_dt) 2013-06-14 2013-06-14
porAon" 16:51:05
Extract"year" YEAR(order_dt) 2013-06-14 2013
porAon" 16:51:05
Returns"#"of"days" DATEDIFF(order_dt, ship_dt) 2013-06-14," 3
between"dates" 2013-06-17

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#39$
Example"Built/in"FuncAons"(3)"

!Here$are$some$other$interes*ng$func*ons$

Func*on$Descrip*on$ Example$Invoca*on$ Input$ Output$


Converts"to"uppercase" UPPER(fname) Bob BOB

Extract"porAon"of"string" SUBSTRING(name, 0, 2) Alice Al

SelecAvely"return"value" IF(price > 1000, 'A', 'B') 1500 A

Convert"to"another"type" CAST(weight as INT) 3.581 3

Returns"size"of"array"or"map" SIZE(array_field) N/A 6

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#40$
Record"Grouping"and"Aggregate"FuncAons"

! GROUP BY$groups$selected$data$by$one$or$more$columns$
Cau*on:"Columns"not"part"of"aggregaAon"must"be"listed"in"GROUP BY"
stores$table$
id$ city$ state$ region$ SELECT region, state,
COUNT(id) AS num
a Albany NY EAST
FROM stores
b Boston MA EAST
GROUP BY region, state;
c Chicago IL NORTH
d Detroit MI NORTH Result$of$query$
e Elgin IL NORTH
region$ state$ num$
EAST MA 1
EAST NY 1
NORTH IL 2
NORTH MI 1

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#41$
Built/in"Aggregate"FuncAons"

!Hive$oers$many$aggregate$func*ons,$including$

Func*on$Descrip*on$ Example$Invoca*on$
Count"all"rows" COUNT(*)

Count"all"rows"where"eld"is"not"null" COUNT(fname)

Count"all"rows"where"eld"is"unique"and"not"null" COUNT(DISTINCT fname)

Returns"the"largest"value" MAX(salary)

Returns"the"smallest"value" MIN(salary)

Adds"all"supplied"values"and"returns"result" SUM(price)

Returns"the"average"of"all"supplied"values" AVG(salary)

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#42$
Chapter"Topics"

Rela*onal$Data$Analysis$with$Hive$

!! Hive"Databases"and"Tables"
!! Basic"HiveQL"Syntax"
!! Data"Types"
!! Joining"Datasets"
!! Common"Built/in"FuncAons"
!! Hands#On$Exercise:$Running$Hive$Queries$from$the$Shell,$Scripts,$and$
Hue$
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#43$
Hands/on"Exercise:"Running"Hive"Queries"

!In$this$Hands#On$Exercise,$you$will$run$Hive$queries$from$the$Hive$shell,$the$
command$line,$Hive$scripts,$and$Hue.$
!Please$refer$to$the$Hands#On$Exercise$Manual$for$instruc*ons$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#44$
Chapter"Topics"

Rela*onal$Data$Analysis$with$Hive$

!! Hive"Databases"and"Tables"
!! Basic"HiveQL"Syntax"
!! Data"Types"
!! Joining"Datasets"
!! Common"Built/in"FuncAons"
!! Hands/On"Exercise:"Running"Hive"Queries"from"the"Shell,"Scripts,"and"Hue"
!! Conclusion$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#45$
EssenAal"Points"

!Every$Hive$table$belongs$to$exactly$one$database$
The"SHOW DATABASES"command"lists"databases""
The"USE"command"switches"the"acAve"database"
The"SHOW TABLES"command"lists"all"tables"in"a"database"
!Every$column$in$a$Hive$table$has$an$associated$data$type$
Most"simple"column"types"are"similar"to"SQL"
Hive"also"supports"a"few"complex"types"
!HiveQL$syntax$is$familiar$to$those$who$know$SQL$
A"subset"of"SQL/92,"plus"Hive/specic"extensions"
Supports"inner,"outer,"and"le]"semi"joins"
Many"SQL"funcAons"are"built"into"Hive"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#46$
Bibliography"

The$following$oer$more$informa*on$on$topics$discussed$in$this$chapter$
!HiveQL$Language$Manual$
http://tiny.cloudera.com/dac10a
!Hive$Built#In$Func*ons$
http://tiny.cloudera.com/dac10b

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#47$
Hive"Data"Management"
Chapter"11"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"1#
Course"Chapters"

!! IntroducEon"
!! Hadoop"Fundamentals"
!! IntroducEon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulE/Dataset"OperaEons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooEng"and"OpEmizaEon"
!! IntroducEon"to"Hive"
!! RelaEonal"Data"Analysis"with"Hive"
!! Hive#Data#Management#
!! Text"Processing"With"Hive"
!! Hive"OpEmizaEon"
!! Extending"Hive"
!! IntroducEon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"2#
Hive"Data"Management"

In#this#chapter#you#will#learn#
!How#Hive#encodes#and#stores#data#
!How#to#create#Hive#databases,#tables,#and#views#
!How#to#load#data#into#tables#
!How#to#alter#and#remove#tables#
!How#to#save#query#results#into#tables#and#les#
!How#to#control#access#to#data#in#Hive#

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"3#
Chapter"Topics"

Hive#Data#Management#

!! Hive#Data#Formats#
!! CreaEng"Databases"and"Hive/Managed"Tables"
!! Loading"Data"into"Hive"
!! Altering"Databases"and"Tables"
!! Self/Managed"Tables"
!! Simplifying"Queries"with"Views"
!! Storing"Query"Results"
!! Controlling"Access"to"Data"
!! Hands/On"Exercise:"Data"Management"with"Hive"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"4#
Hive"Text"File"Format"

!Each#Hive#table#maps#to#a#directory#of#data,#typically#in#HDFS#
Data"stored"in"one"or"more"les"
!Default#data#format#is#plain#text#
One"record"per"line"(record"separator"is"\n)"
Columns"are"delimited"by"^A"(eld"separator"is"Control/A)
Complex"type"elements"are"separated"by"^B"(Control/B)
Map"keys/values"are"separated"by"^C"(Control/C)
!Hive#allows#you#to#override#these#delimiters#
Specify"alternate"values"during"table"creaEon"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"5#
Hive"Binary"File"Formats"

!Hive#also#supports#a#few#binary#formats#
Pro:"may"oer"be>er"performance"than"text"les"
Con:"may"limit"ability"to"read"data"from"other"programs"
!SequenceFile#is#a#Hadoop"specic#binary#format#
Avoids"the"need"to"convert"all"data"to/from"strings"
Supports"block/"and"record/level"data"compression"
!RCFile#is#a#binary#format#created#especially#for#Hive#
RC"stands"for"Record"Columnar"
Column/oriented"format"ecient"for"some"queries"
Otherwise,"very"similar"to"sequence"les"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"6#
Chapter"Topics"

Hive#Data#Management#

!! Hive"Data"Formats"
!! CreaLng#Databases#and#Hive"Managed#Tables#
!! Loading"Data"into"Hive"
!! Altering"Databases"and"Tables"
!! Self/Managed"Tables"
!! Simplifying"Queries"with"Views"
!! Storing"Query"Results"
!! Controlling"Access"to"Data"
!! Hands/On"Exercise:"Data"Management"with"Hive"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"7#
CreaEng"Databases"in"Hive"

!Hive#databases#are#simply#namespaces#
Helps"to"organize"your"tables"
!To#create#a#new#database#

hive> CREATE DATABASE dualcore;

!To#condiLonally#create#a#new#database#
Avoids"error"in"case"database"already"exists"(useful"for"scripEng)"

hive> CREATE DATABASE IF NOT EXISTS dualcore;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"8#
CreaEng"a"Table"In"Hive"(1)"

!Basic#syntax#for#creaLng#a#table:#

CREATE TABLE tablename (colname DATATYPE, ...)


ROW FORMAT DELIMITED
FIELDS TERMINATED BY char
STORED AS {TEXTFILE|SEQUENCEFILE|RCFILE}

!Creates#a#subdirectory#under#/user/hive/warehouse#in#HDFS#
This"is"Hives"warehouse)directory"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"9#
CreaEng"a"Table"In"Hive"(2)"

CREATE TABLE tablename (colname DATATYPE, ...)


ROW FORMAT DELIMITED
FIELDS TERMINATED BY char
Specify a name for the table, and list the column
STORED AS {TEXTFILE|SEQUENCEFILE|RCFILE}
names and datatypes (see later)

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"10#
CreaEng"a"Table"In"Hive"(3)"

CREATE TABLE tablename (colname DATATYPE, ...)


ROW FORMAT DELIMITED
FIELDS TERMINATED BY char
STORED
This lineAS {TEXTFILE|SEQUENCEFILE|RCFILE}
states that fields in each file in the tables
directory are delimited by some character. Hives
default delimiter is Control-A, but you may
specify an alternate delimiter...

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"11#
CreaEng"a"Table"In"Hive"(4)"

CREATE TABLE tablename (colname DATATYPE, ...)


ROW FORMAT DELIMITED
FIELDS TERMINATED BY char
STORED AS {TEXTFILE|SEQUENCEFILE|RCFILE}
for example, tab-delimited data would require that
you specify FIELDS TERMINATED BY '\t'

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"12#
CreaEng"a"Table"In"Hive"(5)"

CREATE TABLE tablename (colname DATATYPE, ...)


ROW FORMAT DELIMITED
FIELDS TERMINATED BY char
STORED AS {TEXTFILE|SEQUENCEFILE|RCFILE}

Finally, you may declare the file format. STORED AS


TEXTFILE is the default and does not need to be
specified

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"13#
Example"Table"DeniEon"

!The#following#example#creates#a#new#table#named#jobs
Data"stored"as"text"with"four"comma/separated"elds"per"line"

CREATE TABLE jobs


(id INT,
title STRING,
salary INT,
posted TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

Example"of"corresponding"record"for"the"table"above"

1,Data Analyst,100000,2013-06-21 15:52:03

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"14#
Complex"Column"Types"

!Recap:#Hive#has#support#for#complex#data#types#

Column#Type# DescripLon#
ARRAY" Ordered"list"of"values,"all"of"the"same"type"
MAP" Key/value"pairs,"each"of"the"same"type"
STRUCT" Named"elds,"of"possibly"mixed"types"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"15#
CreaEng"Tables"with"Complex"Column"Types"

!Complex#columns#are#typed#
Arrays"have"a"single"type"
Maps"have"a"key"type"and"a"value"type"
Structs"have"a"type"for"each"a>ribute"
!These#types#are#specied#with#angle#brackets#

CREATE TABLE stores


(store_id SMALLINT,
departments ARRAY<STRING>,
staff MAP<STRING, STRING>,
address STRUCT<street:STRING,
city:STRING,
state:STRING,
zipcode:STRING>
);

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"16#
Customizing"Complex"Type"Storage"

!We#can#control#which#delimiters#are#used#for#complex#types#

CREATE TABLE stores


(store_id SMALLINT,
departments ARRAY<STRING>,
staff MAP<STRING, STRING>,
address STRUCT<street:STRING,
city:STRING,
state:STRING,
zipcode:STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"17#
Row"Format"Example"for"Complex"Types"

!The#following#is#an#example#of#a#record#in#that#table#

1|Audio,Photo|A123:Abe,B456:Bob|123 Oak St.,Ely,MN,55731

id" departments" sta" street" city" state" zipcode"

!Examples#queries#on#this#record#

hive> SELECT departments[0] FROM stores;


Audio

hive> SELECT staff['B456'] FROM stores;


Bob

hive> SELECT address.city FROM stores;


Ely

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"18#
Chapter"Topics"

Hive#Data#Management#

!! Hive"Data"Formats"
!! CreaEng"Databases"and"Hive/Managed"Tables"
!! Loading#Data#into#Hive#
!! Altering"Databases"and"Tables"
!! Self/Managed"Tables"
!! Simplifying"Queries"with"Views"
!! Storing"Query"Results"
!! Controlling"Access"to"Data"
!! Hands/On"Exercise:"Data"Management"with"Hive"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"19#
Data"ValidaEon"in"Hive"

!Hadoop#and#its#ecosystem#are#schema#on#read#
Unlike"an"RDBMS,"Hive"does"not"validate"data"on"insert"
Files"are"simply"moved"into"place"
Loading"data"into"tables"is"therefore"very"fast"
Errors"in"le"format"will"be"discovered"when"queries"are"performed"
!Missing#or#invalid#data#in#Hive#will#be#represented#as#NULL

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"20#
Loading"Data"From"Files"(1)"

!To#load#data,#simply#add#les#to#the#tables#directory#in#HDFS#
Can"be"done"directly"using"hadoop fs"commands"
This"example"loads"data"from"HDFS"into"Hives"sales"table"

$ hadoop fs -mv sales.txt /user/hive/warehouse/sales/


"
!AlternaLvely,#use#Hives#LOAD DATA INPATH#command#
Done"from"within"the"Hive"shell"(or"a"Hive"script)"
This"moves"data"within"HDFS,"just"like"the"command"above"
Source"can"be"either"a"le"or"directory"

hive> LOAD DATA INPATH 'sales.txt' INTO TABLE sales;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"21#
Loading"Data"From"Files"(2)"

!Add#the#LOCAL#keyword#to#load#data#from#the#local#disk#
Copies"a"local"le"(or"directory)"to"the"tables"directory"in"HDFS"

hive> LOAD DATA LOCAL INPATH '/home/bob/sales.txt'


INTO TABLE sales;

This"is"equivalent"to"the"following"hadoop fs"command"

$ hadoop fs -put /home/bob/sales.txt \


/user/hive/warehouse/sales

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"22#
Loading"Data"From"Files"(3)"

!Add#the#OVERWRITE#keyword#to#delete#all#records#before#import#
Removes"all"les"within"the"tables"directory""
Then"moves"the"new"les"into"that"directory"

hive> LOAD DATA INPATH '/depts/finance/salesdata'


OVERWRITE INTO TABLE sales;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"23#
Loading"Data"From"a"RelaEonal"Database"

!Sqoop#has#built"in#support#for#imporLng#data#into#Hive#
!Just#add#the#--hive-import#opLon#to#your#Sqoop#command#
Creates"the"table"in"Hive"(metastore)"
Imports"data"from"RDBMS"to"tables"directory"in"HDFS"

$ sqoop import \
--connect jdbc:mysql://localhost/dualcore \
--username training
--password training \
--fields-terminated-by '\t' \
--table employees \
--hive-import

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"24#
Chapter"Topics"

Hive#Data#Management#

!! Hive"Data"Formats"
!! CreaEng"Databases"and"Hive/Managed"Tables"
!! Loading"Data"into"Hive"
!! Altering#Databases#and#Tables#
!! Self/Managed"Tables"
!! Simplifying"Queries"with"Views"
!! Storing"Query"Results"
!! Controlling"Access"to"Data"
!! Hands/On"Exercise:"Data"Management"with"Hive"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"25#
Removing"a"Database"

!Removing#a#database#is#similar#to#creaLng#it#
Just"replace"CREATE"with"DROP

hive> DROP DATABASE dualcore;

hive> DROP DATABASE IF EXISTS dualcore;

!These#commands#will#fail#if#the#database#contains#tables#
Add"the"CASCADE"keyword"to"force"removal"
CauEon:"this"command"removes"data"in"HDFS!"

hive> DROP DATABASE dualcore CASCADE;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"26#
Removing"a"Table"

!Syntax#is#similar#to#database#removal#

hive> DROP TABLE customers;

hive> DROP TABLE IF EXISTS customers;

!CauLon:#These#commands#can#remove#data#in#HDFS!#
Hive"does"not"have"a"rollback"or"undo"feature"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"27#
Renaming"Tables"and"Columns"

!Use#ALTER TABLE#to#modify#a#tables#metadata#
This"command"does"not"change"data"in"HDFS"
!Rename#an#exisLng#table#

hive> ALTER TABLE customers RENAME TO clients;

!Rename#a#column#by#specifying#its#old#and#new#names#
Type"must"be"specied"even"though"it"is"unchanged"

hive> ALTER TABLE clients


CHANGE fname first_name STRING;

Old"Name" New"Name" Type"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"28#
Modifying"and"Adding"Columns"

!You#can#also#modify#a#columns#type#
The"old"and"new"column"names"will"be"the"same"
You"must"ensure"data"in"HDFS"conforms"to"the"new"type"

hive> ALTER TABLE jobs CHANGE salary salary BIGINT;

Old"Name" New"Name" Type"


!Add#new#columns#to#a#table#
Appended"to"the"end"of"any"exisEng"columns"
ExisEng"data"will"have"NULL"values"for"new"columns"

hive> ALTER TABLE jobs


ADD COLUMNS (city STRING, bonus INT);

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"29#
Reordering"and"Removing"Columns"

!Use#AFTER#or#FIRST#to#reorder#columns#
Ensure"that"your"data"in"HDFS"matches"the"new"order"

hive> ALTER TABLE jobs


CHANGE bonus bonus INT AFTER salary;

hive> ALTER TABLE jobs


CHANGE bonus bonus INT FIRST;

!Use#REPLACE COLUMNS#to#remove#columns#
Any"column"not"listed"will"be"dropped"from"metadata"

hive> ALTER TABLE jobs REPLACE COLUMNS


(id INT,
title STRING,
salary INT);

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"30#
Chapter"Topics"

Hive#Data#Management#

!! Hive"Data"Formats"
!! CreaEng"Databases"and"Hive/Managed"Tables"
!! Loading"Data"into"Hive"
!! Altering"Databases"and"Tables"
!! Self"Managed#Tables#
!! Simplifying"Queries"with"Views"
!! Storing"Query"Results"
!! Controlling"Access"to"Data"
!! Hands/On"Exercise:"Data"Management"with"Hive"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"31#
Controlling"Table"Data"LocaEon"

!Hive#stores#data#associated#with#a#table#in#its#warehouse#directory#
!Storing#data#below#Hives#warehouse#is#not#always#ideal#
Data"might"be"shared"by"several"users""
!Use#LOCATION#to#specify#directory#where#table#data#resides#

CREATE TABLE adclicks


(campaign_id STRING,
when TIMESTAMP,
keyword STRING,
site STRING,
placement STRING,
was_clicked BOOLEAN,
cost SMALLINT)
LOCATION '/dualcore/ad_data';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"32#
Self/Managed"(External)"Tables"

!Recall#that#dropping#a#table#removes#its#data#in#HDFS#
!Using#EXTERNAL#when#creaLng#the#table#avoids#this#behavior#
Dropping"an"external"table"removes"only"its"metadata)

CREATE EXTERNAL TABLE adclicks


(campaign_id STRING,
click_time TIMESTAMP,
keyword STRING,
site STRING,
placement STRING,
was_clicked BOOLEAN,
cost SMALLINT)
LOCATION '/dualcore/ad_data';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"33#
Chapter"Topics"

Hive#Data#Management#

!! Hive"Data"Formats"
!! CreaEng"Databases"and"Hive/Managed"Tables"
!! Loading"Data"into"Hive"
!! Altering"Databases"and"Tables"
!! Self/Managed"Tables"
!! Simplifying#Queries#with#Views#
!! Storing"Query"Results"
!! Controlling"Access"to"Data"
!! Hands/On"Exercise:"Data"Management"with"Hive"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"34#
Simplifying"Complex"Queries"

!Complex#queries#can#become#cumbersome#
Imagine"typing"this"several"Emes"for"dierent"orders"

SELECT o.order_id, order_date, p.prod_id, brand, name


FROM orders o
JOIN order_details d
ON (o.order_id = d.order_id)
JOIN products p
ON (d.prod_id = p.prod_id)
WHERE o.order_id=6584288;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"35#
CreaEng"Views"

!Views#in#Hive#are#conceptually#like#a#table,#but#backed#by#a#query#
You"cannot"directly"add"data"to"a"view"

CREATE VIEW order_info AS


SELECT o.order_id, order_date, p.prod_id, brand, name
FROM orders o
JOIN order_details d
ON (o.order_id = d.order_id)
JOIN products p
ON (d.prod_id = p.prod_id);

!Our#query#is#now#greatly#simplied#

hive> SELECT * FROM order_info WHERE order_id=6584288;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"36#
InspecEng"and"Removing"Views"

!Use#DESCRIBE FORMATTED#to#see#underlying#query#

hive> DESCRIBE FORMATTED order_info;

!Use#DROP VIEW#to#remove#a#view#
#
hive> DROP VIEW order_info;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"37#
Chapter"Topics"

Hive#Data#Management#

!! Hive"Data"Formats"
!! CreaEng"Databases"and"Hive/Managed"Tables"
!! Loading"Data"into"Hive"
!! Altering"Databases"and"Tables"
!! Self/Managed"Tables"
!! Simplifying"Queries"with"Views"
!! Storing#Query#Results#
!! Controlling"Access"to"Data"
!! Hands/On"Exercise:"Data"Management"with"Hive"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"38#
Saving"Query"Output"to"a"Table"

! SELECT#statements#display#their#results#on#screen#
!To#send#results#to#a#Hive#table,#use#INSERT OVERWRITE TABLE#
DesEnaEon"table"must"already"exist"
ExisEng"contents"will"be"deleted"

hive> INSERT OVERWRITE TABLE ny_customers


SELECT * FROM customers
WHERE state = 'NY';

! INSERT INTO TABLE#adds#records#without#rst#deleLng#exisLng#data#

hive> INSERT INTO TABLE ny_customers


SELECT * FROM customers
WHERE state = 'NJ' OR state = 'CT';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"39#
CreaEng"Tables"Based"On"ExisEng"Data"

!Hive#supports#creaLng#a#table#based#on#a#SELECT#statement#
Oien"know"as"Create"Table"As"Select"(CTAS)"

CREATE TABLE ny_customers AS


# SELECT cust_id, fname, lname FROM customers
WHERE state = 'NY';

!Column#deniLons#are#derived#from#the#exisLng#table#
!Column#names#are#inherited#from#the#exisLng#names#
Use"aliases"in"the"SELECT"statement"to"specify"new"names"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"40#
WriEng"Output"To"a"Filesystem"

!You#can#save#output#to#a#le#in#HDFS#

hive> INSERT OVERWRITE DIRECTORY '/dualcore/ny/'


SELECT * FROM customers
WHERE state = 'NY';

!Add#LOCAL#to#store#results#to#local#disk#instead#

hive> INSERT OVERWRITE LOCAL DIRECTORY '/home/bob/ny/'


SELECT * FROM customers
WHERE state = 'NY';

!Both#produce#text#les#delimited#by#Ctrl"A#characters#

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"41#
WriEng"Output"To"HDFS,"Specifying"Format""

!To#write#the#les#to#HDFS#with#a#user"specied#format:#
Create"an"external"table"in"the"required"format"
Use"INSERT OVERWRITE TABLE

hive> CREATE EXTERNAL TABLE ny_customers


(cust_id INT,
fname STRING,
lname STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/dualcore/nydata';

hive> INSERT OVERWRITE TABLE ny_customers


SELECT cust_id, fname, lname
FROM customers WHERE state = 'NY';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"42#
MulE/Table"Insert"(1)"

!We#just#saw#that#you#can#save#output#to#an#HDFS#le#

INSERT OVERWRITE DIRECTORY 'ny_customers'


SELECT cust_id, fname, lname
FROM customers WHERE state = 'NY';

!This#query#could#also#be#wriaen#as#follows#

FROM customers c
INSERT OVERWRITE DIRECTORY 'ny_customers'
SELECT cust_id, fname, lname WHERE state='NY';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"43#
MulE/Table"Insert"(2)"

!We#someLmes#need#to#extract#data#to#mulLple#tables#
Hive"SELECT"queries"can"take"a"long"Eme"to"complete"
!Hive#allows#us#to#do#this#with#a#single#query#
Much"more"ecient"than"using"mulEple"queries"
!The#following#example#demonstrates#mulL"table#insert#
Result"is"two"directories"in"HDFS"

FROM customers c
INSERT OVERWRITE DIRECTORY 'ny_names'
SELECT fname, lname WHERE state = 'NY';
INSERT OVERWRITE DIRECTORY 'ny_count'
SELECT count(DISTINCT cust_id)
WHERE state = 'NY';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"44#
MulE/Table"Insert"(3)"

!The#following#query#produces#the#same#result#

FROM (SELECT * FROM customers WHERE state='NY') nycust


INSERT OVERWRITE DIRECTORY 'ny_names'
SELECT fname, lname
INSERT OVERWRITE DIRECTORY 'ny_count'
SELECT count(DISTINCT cust_id);

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"45#
Chapter"Topics"

Hive#Data#Management#

!! Hive"Data"Formats"
!! CreaEng"Databases"and"Hive/Managed"Tables"
!! Loading"Data"into"Hive"
!! Altering"Databases"and"Tables"
!! Self/Managed"Tables"
!! Simplifying"Queries"with"Views"
!! Storing"Query"Results"
!! Controlling#Access#to#Data#
!! Hands/On"Exercise:"Data"Management"with"Hive"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"46#
Hive"AuthenEcaEon"

!HiveServer2#can#be#congured#for#Kerberos#or#LDAP#authenLcaLon#
You"must"provide"a"user"ID"and"password"when"connecEng"to"
HiveServer2"from"Beeline"

[training@localhost analyst]$ beeline


Beeline version 0.10.0-cdh4.2.1 by Apache Hive

beeline> !connect jdbc:hive2://localhost:10000


training mypwd org.apache.hive.jdbc.HiveDriver

Connecting to jdbc:hive2://localhost:10000
Connected to: Hive (version 0.10.0)
Driver: Hive (version 0.10.0-cdh4.2.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ

beeline>

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"47#
Hive"AuthorizaEon"with"Apache"Sentry"

!Sentry#is#a#plug"in#for#enforcing#authorizaLon#
!Sentry#can#be#enabled#in#HiveServer2#
And"in"Impala"
!Sentry#provides#ne"grained,#role"based#authorizaLon#for#Hive#data#and#
metadata#
!Sentry#was#developed#at#Cloudera#
Available"starEng"with"CDH"4.3"
Project"is"in"incubaEon"status"at"Apache"
http://incubator.apache.org/projects/sentry.html"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"48#
Sentry"Access"Control"Model"

#
!What#does#Sentry#control#access#to?#
Server"
Database"
Table"
View"
!Who#can#access#Sentry"controlled#objects?#
Users"in"a"Sentry"role"
Sentry"roles"="one"or"more"groups"
! How#can#role#members#access#Sentry"controlled#objects?#
Read"operaEons"(SELECT"privilege)"
Write"operaEons"(INSERT"privilege)""
Metadata"deniEon"operaEons"(ALL"privilege)"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"49#
Chapter"Topics"

Hive#Data#Management#

!! Hive"Data"Formats"
!! CreaEng"Databases"and"Hive/Managed"Tables"
!! Loading"Data"into"Hive"
!! Altering"Databases"and"Tables"
!! Self/Managed"Tables"
!! Simplifying"Queries"with"Views"
!! Storing"Query"Results"
!! Controlling"Access"to"Data"
!! Hands"On#Exercise:#Data#Management#with#Hive#
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"50#
Hands/on"Exercise:"Data"Management"with"Hive"

!In#this#Hands"On#Exercise,#you#will#create,#load,#modify,#and#delete#tables#in#
Hive.#
!Please#refer#to#the#Hands"On#Exercise#Manual#for#instrucLons#

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"51#
Chapter"Topics"

Hive#Data#Management#

!! Hive"Data"Formats"
!! CreaEng"Databases"and"Hive/Managed"Tables"
!! Loading"Data"into"Hive"
!! Altering"Databases"and"Tables"
!! Self/Managed"Tables"
!! Simplifying"Queries"with"Views"
!! Storing"Query"Results"
!! Controlling"Access"to"Data"
!! Hands/On"Exercise:"Data"Management"with"Hive"
!! Conclusion#

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"52#
EssenEal"Points"

!Each#Hive#table#maps#to#a#directory#in#HDFS#
Table"data"stored"as"one"or"more"les"
Default"format:"plain"text"with"delimited"elds"
Binary"formats"may"oer"be>er"performance"but"limit"compaEbility"
!Dropping#a#Hive"managed#tables#deletes#data#in#HDFS#
External"tables"require"manual"data"deleEon"
! ALTER TABLE#is#used#to#add,#modify,#and#remove#columns#
!Views#can#help#to#simplify#complex#and#repeLLve#queries#
!In#a#secure#environment,#Sentry#provides#Hive#authorizaLon#

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"53#
Text"Processing"with"Hive"
Chapter"12"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#1$
Course"Chapters"

!! IntroducEon"
!! Hadoop"Fundamentals"
!! IntroducEon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulE/Dataset"OperaEons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooEng"and"OpEmizaEon"
!! IntroducEon"to"Hive"
!! RelaEonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text$Processing$With$Hive$
!! Hive"OpEmizaEon"
!! Extending"Hive"
!! IntroducEon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#2$
Text"Processing"with"Hive"

In$this$chapter,$you$will$learn$
!How$to$use$Hives$most$important$string$funcAons$
!How$to$format$numeric$values$
!How$to$use$regular$expressions$in$Hive$
!What$n#grams$are$and$why$they$are$useful$
!How$to$esAmate$how$oCen$words$or$phrases$occur$in$text$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#3$
Chapter"Topics"

Text$Processing$with$Hive$

!! Overview$of$Text$Processing
!! Important"String"FuncEons"
!! Using"Regular"Expressions"in"Hive"
!! SenEment"Analysis"and"n/grams"
!! OpEonal"Hands/On"Exercise:"Gaining"Insight"with"SenEment"Analysis"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#4$
Text"Processing"Overview"

!TradiAonal$data$processing$relies$on$highly#structured$data$
Carefully"curated"informaEon"in"rows"and"columns"
!What$types$of$data$are$we$producing$today?$
Free/form"notes"in"electronic"medical"records"
ApplicaEon"and"server"log"les"
Social"network"connecEons"
Electronic"messages"
Product"raEngs"
!These$types$of$data$also$contain$great$value$
But"extracEng"it"requires"a"dierent"approach"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#5$
Chapter"Topics"

Text$Processing$with$Hive$

!! Overview"of"Text"Processing"
!! Important$String$FuncAons
!! Using"Regular"Expressions"in"Hive"
!! SenEment"Analysis"and"n/grams"
!! OpEonal"Hands/On"Exercise:"Gaining"Insight"with"SenEment"Analysis"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#6$
Basic"String"FuncEons"

!Hive$supports$many$string$funcAons$oCen$found$in$RDBMSs$

FuncAon$DescripAon$ Example$InvocaAon$ Input$ Output$


Convert"to"uppercase" UPPER(name) Bob BOB
Convert"to"lowercase" LOWER(name) Bob bob
Remove"whitespace"at"start/end" TRIM(name) Bob Bob
Remove"only"whitespace"at"start" LTRIM(name) Bob Bob
Remove"only"whitespace"at"end" RTRIM(name) Bob Bob
Extract"porEon"of"string"*" SUBSTRING(name, 0, 3) Samuel Sam

*" Unlike'SQL,'star0ng'posi0on'is'zero5based'in'Hive'

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#7$
Parsing"URLs"with"Hive"

!Hive$oers$built#in$support$for$parsing$Web$addresses$(URLs)$
!The$following$examples$assume$the$following$URL$as$input$
http://www.example.com/click.php?A=42&Z=105

Example$InvocaAon$ Output$
PARSE_URL(url, 'PROTOCOL') http
PARSE_URL(url, 'HOST') www.example.com
PARSE_URL(url, 'PATH') /click.php
PARSE_URL(url, 'QUERY') A=42&Z=105
PARSE_URL(url, 'QUERY', 'A') 42
PARSE_URL(url, 'QUERY', 'Z') 105

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#8$
Numeric"Format"FuncEons"

!Hive$oers$two$funcAons$for$formaYng$a$number$
Simple:"FORMAT_NUMBER"(0.10.0"and"later)"
VersaEle:"PRINTF"(0.9.0"and"later)"

Example$InvocaAon$ Input$ Output$


FORMAT_NUMBER(commission, 2) 2345.519728 2,345.52

PRINTF("$%1.2f", total_price) 356.9752 $356.98

PRINTF("%s owes $%1.2f", name, amt) Bob, 3.9 Bob owes $3.90

PRINTF("%.2f%%", taxrate * 100) 0.47314 47.31%

$
!CauAon:$avoid$storing$precise$values$as$oaAng$point$numbers!$
These"funcEons"are"best"used"for"formaang"results"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#9$
Spliang"and"Combining"Strings"

! CONCAT$combines$one$or$more$strings$
The"CONCAT_WS"variaEon"joins"them"with"a"separator"
! SPLIT$does$nearly$the$opposite$
Dierence:"return"value"is"ARRAY<STRING>

Example$InvocaAon$ Output$
CONCAT('alice', '@example.com') alice@example.com

CONCAT_WS(' ', 'Bob', 'Smith') Bob Smith

CONCAT_WS('/', 'Amy', 'Sam', 'Ted') Amy/Sam/Ted

SPLIT('Amy/Sam/Ted', '/') ["Amy","Sam","Ted"] *"

*"Representa0on'of'ARRAY<STRING>'

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#10$
ConverEng"Array"to"Records"with"EXPLODE

!The$EXPLODE$funcAon$creates$a$record$for$each$element$in$an$array$
An"example"of"a"table'genera0ng'func0on'
The"alias"is"required"when"invoking"table"generaEng"funcEons"

hive> SELECT people FROM example;


Amy,Sam,Ted

hive> SELECT SPLIT(people, ',') FROM example;


["Amy","Sam","Ted"]

hive> SELECT EXPLODE(SPLIT(people, ',')) AS x FROM example;


Amy
Sam
Ted

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#11$
Chapter"Topics"

Text$Processing$with$Hive$

!! Overview"of"Text"Processing"
!! Important"String"FuncEons"
!! Using$Regular$Expressions$in$Hive
!! SenEment"Analysis"and"n/grams"
!! OpEonal"Hands/On"Exercise:"Gaining"Insight"with"SenEment"Analysis"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#12$
Regular"Expressions"

!A$regular$expression$(regex)$matches$a$pa`ern$in$text$
Useful"when"exact"matching"is"not"pracEcal"

Regular$Expression$ String$(matched$porAon$in$bold)$
Dualcore I wish Dualcore had 2 stores in 90210.
\\d I wish Dualcore had 2 stores in 90210.
\\d{5} I wish Dualcore had 2 stores in 90210.
\\d\\s\\w+ I wish Dualcore had 2 stores in 90210.
\\w{5,9} I wish Dualcore had 2 stores in 90210.
.?\\. I wish Dualcore had 2 stores in 90210.
.*\\. I wish Dualcore had 2 stores in 90210.

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#13$
Hives"Regular"Expression"FuncEons"

!Hive$has$two$important$funcAons$that$use$regular$expressions$
REGEXP_EXTRACT"returns"the"matched"text"
REGEXP_REPLACE"subsEtutes"another"value"for"the"matched"text"
!These$examples$assume$that$txt$has$the$following$value$
It's on Oak St. or Maple St in 90210

hive> SELECT REGEXP_EXTRACT(txt, '(\\d{5})', 1)


FROM message;
90210

hive> SELECT REGEXP_REPLACE(txt, 'St.?\\s+', 'Street ')


FROM message;
It's on Oak Street or Maple Street in 90210

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#14$
Regex"SerDe"

!We$someAmes$need$to$analyze$data$that$lacks$consistent$delimiters$
Log"les"are"a"common"example"of"this"

05/23/2013 19:45:19 312-555-7834 CALL_RECEIVED ""


05/23/2013 19:45:23 312-555-7834 OPTION_SELECTED "Shipping"
05/23/2013 19:46:23 312-555-7834 ON_HOLD ""
05/23/2013 19:47:51 312-555-7834 AGENT_ANSWER "Agent ID N7501"
05/23/2013 19:48:37 312-555-7834 COMPLAINT "Item not received"
05/23/2013 19:48:41 312-555-7834 CALL_END "Duration: 3:22"

! RegexSerDe$will$read$records$based$on$supplied$regular$expression$
Allows"us"to"create"a"table"from"this"log"le"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#15$
CreaEng"a"Table"with"Regex"SerDe"(1)"

Log"excerpt"
05/23/2013 19:45:19 312-555-7834 CALL_RECEIVED ""
05/23/2013 19:48:37 312-555-7834 COMPLAINT "Item not received"

RegexSerDe"
CREATE TABLE calls (
event_date STRING,
event_time STRING,
phone_num STRING,
event_type STRING,
details STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" =
"([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^\"]*)\"");

!Each$pair$of$parentheses$denotes$a$eld$
Field"value"is"text"matched"by"pa>ern"within"parentheses"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#16$
CreaEng"a"Table"with"Regex"SerDe"(2)"

Log"excerpt"
05/23/2013 19:45:19 312-555-7834 CALL_RECEIVED ""
05/23/2013 19:48:37 312-555-7834 COMPLAINT "Item not received"

RegexSerDe"
CREATE TABLE calls (
event_date STRING,
event_time STRING,
phone_num STRING,
event_type STRING,
details STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" =
"([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^\"]*)\"");

Table"excerpt"
event_date$ event_Ame$ phone_num$ event_type$ details$
05/23/2013 19:45:19 312-555-7834 CALL_RECEIVED
05/23/2013 19:45:37 312-555-7834 COMPLAINT Item not received

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#17$
Regex"SerDe"in"Older"Versions"of"Hive"

!The$Regex$SerDe$wasnt$formally$part$of$Hive$prior$to$0.10.0$
It"shipped"with"Hive,"but"was"part"of"the"hive/contrib"library"
!To$use$Regex$SerDe$in$0.9.x$and$earlier$versions$of$Hive$$
Add"this"JAR"le"to"Hive"
Change"the"SerDes"package"name,"as"shown"below"

CREATE TABLE calls (


event_date STRING,
event_time
The packageSTRING,
name used in older versions is slightly different:
event_type STRING,
org.apache.hadoop.hive.contrib.serde2.RegexSerDe"
phone_num STRING,
details STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" =
"([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^\"]*)\"");

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#18$
Fixed/Width"Formats"in"Hive"

!Many$older$applicaAons$produce$data$in$xed#width$formats$

1030929610759620120829012215Oakland CA94618

cust_id" order_id" date" Eme" city" state"zipcode"

!Unfortunately,$Hive$doesnt$directly$support$these$
But"you"can"overcome"this"limitaEon"by"using"RegexSerDe
!Caveat:$all$elds$in$RegexSerDe$are$of$type$STRING
May"need"to"cast"numeric"values"in"your"queries"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#19$
Fixed/Width"Format"Example"

Input"data"
1030929610759620120829012215Oakland CA94618

RegexSerDe"
CREATE TABLE fixed (
cust_id STRING,
order_id STRING,
order_dt STRING,
order_tm STRING,
city STRING,
state STRING,
zip STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" =
"(\\d{7})(\\d{7})(\\d{8})(\\d{6})(.{20})(\\w{2})(\\d{5})");

cust_id$ order_id$ order_dt$ order_tm$ city$ state$ zipcode$


1030929 6107596 20120829 012215 Oakland CA 94618

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#20$
Chapter"Topics"

Text$Processing$with$Hive$

!! Overview"of"Text"Processing"
!! Important"String"FuncEons"
!! Using"Regular"Expressions"in"Hive"
!! SenAment$Analysis$and$n#grams
!! OpEonal"Hands/On"Exercise:"Gaining"Insight"with"SenEment"Analysis"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#21$
Parsing"Sentences"into"Words"

!Hives$SENTENCES$funcAon$parses$supplied$text$into$words$
!Input$is$a$string$containing$one$or$more$sentences$
!Output$is$a$two#dimensional$array$of$strings$
Outer"array"contains"one"element"per"sentence"
Inner"array"contains"one"element"per"word"in"that"sentence"
$
hive> SELECT txt FROM phrases WHERE id=12345;
I bought this computer and really love it! It's very fast and
does not crash.

hive> SELECT SENTENCES(txt) FROM phrases WHERE id=12345;


[["I","bought","this","computer","and","really","love","it"],
["It's","very","fast","and","does","not","crash"]]

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#22$
SenEment"Analysis"

!SenAment$analysis$is$an$applicaAon$of$text$analyAcs$
ClassicaEon"and"measurement"of"opinions"
Frequently"used"for"social"media"analysis"
!Context$is$essenAal$for$human$languages$
Which"word"combinaEons"appear"together?"
How"frequently"do"these"combinaEons"appear?"
!Hive$oers$funcAons$that$help$answer$these$quesAons$
$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#23$
n/grams"

!An$n#gram$is$a$word$combinaAon$(n=number$of$words)$
Bigram"is"a"sequence"of"two"words"(n=2)"
!n#gram$frequency$analysis$is$an$important$step$in$many$applicaAons$
SuggesEng"spelling"correcEons"in"search"results"
Finding"the"most"important"topics"in"a"body"of"text"
IdenEfying"trending"topics"in"social"media"messages"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#24$
CalculaEng"n/grams"in"Hive"(1)"

!Hive$oers$the$NGRAMS$funcAon$for$calculaAng$n#grams$
!The$funcAon$requires$three$input$parameters$
Array"of"strings"(sentences),"each"containing"an"array"(words)"
Number"of"words"in"each"n/gram"
Desired"number"of"results"(top/N,"based"on"frequency)"
!Output$is$an$array$of$STRUCT$with$two$a`ributes$
ngram:"the"n/gram"itself"(an"array"of"words)"
estfrequency:"esEmated"frequency"at"which"this"n/gram"appears"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#25$
CalculaEng"n/grams"in"Hive"(2)"

!The$NGRAMS$funcAon$is$oCen$used$with$the$SENTENCES$funcAon$
We"also"used"LOWER"to"normalize"case"
And"EXPLODE"to"convert"the"resulEng"array"to"a"series"of"rows"

hive> SELECT txt FROM phrases WHERE id=56789;


This tablet is great. The size is great. The screen is
great. The audio is great. I love this tablet! I love
everything about this tablet!!!

hive> SELECT EXPLODE(NGRAMS(SENTENCES(LOWER(txt)), 2, 5))


AS bigrams FROM phrases WHERE id=56789;
{"ngram":["is","great"],"estfrequency":4.0}
{"ngram":["great","the"],"estfrequency":3.0}
{"ngram":["this","tablet"],"estfrequency":3.0}
{"ngram":["i","love"],"estfrequency":2.0}
{"ngram":["tablet","i"],"estfrequency":1.0}

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#26$
Finding"Specic"n/grams"in"Text"

! CONTEXT_NGRAMS$is$similar,$but$considers$only$specic$combinaAons$
AddiEonal"input"parameter:"array"of"words"used"for"ltering"
Any"NULL"values"in"the"array"are"treated"as"placeholders"

hive> SELECT txt FROM phrases


WHERE txt LIKE '%new computer%';
My new computer is fast! I wish I'd upgraded sooner.
This new computer is expensive, but I need it now.
I can't believe her new computer failed already.

hive>SELECT EXPLODE(CONTEXT_NGRAMS(SENTENCES(LOWER(txt)),
ARRAY("new", "computer", NULL, NULL), 4, 3)) AS ngrams
FROM phrases;
{"ngram":["is","expensive"],"estfrequency":1.0}
{"ngram":["failed","already"],"estfrequency":1.0}
{"ngram":["is","fast"],"estfrequency":1.0}

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#27$
Histograms"

!Histograms$illustrate$how$values$in$the$data$are$distributed$
This"helps"us"esEmate"the"overall"shape"of"the"data"distribuEon"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#28$
CalculaEng"Data"for"Histograms"

! HISTOGRAM_NUMERIC$creates$data$needed$for$histograms$
Input:"column"name"and"number"of"bins"in"the"histogram"
Output:"coordinates"represenEng"bin"centers"and"heights"

hive> SELECT EXPLODE(HISTOGRAM_NUMERIC(


total_price, 10)) AS dist FROM cart_orders;
{"x":25417.336745023003,"y":8891.0}
{"x":74401.5041469194,"y":3376.0}
{"x":123550.04418985262,"y":611.0}
{"x":197421.12500000006,"y":24.0} Import"this"data"into"
{"x":267267.53846153844,"y":26.0} charEng"sokware"to"
{"x":425324.0,"y":4.0} produce"a"histogram"
{"x":479226.38461538474,"y":13.0}
{"x":524548.0,"y":6.0}
{"x":598463.5,"y":2.0}
{"x":975149.0,"y":2.0}

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#29$
Chapter"Topics"

Text$Processing$with$Hive$

!! Overview"of"Text"Processing"
!! Important"String"FuncEons"
!! Using"Regular"Expressions"in"Hive"
!! SenEment"Analysis"and"n/grams"
!! OpAonal$Hands#On$Exercise:$Gaining$Insight$with$SenAment$Analysis
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#30$
Hands/on"Exercise:"Gaining"Insight"with"SenEment"Analysis"

!In$this$Hands#On$Exercise,$you$will$analyze$comments$in$product$raAng$data$
with$Hive$
!Please$refer$to$the$Hands#On$Exercise$Manual$for$instrucAons$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#31$
Chapter"Topics"

Text$Processing$with$Hive$

!! Overview"of"Text"Processing"
!! Important"String"FuncEons"
!! Using"Regular"Expressions"in"Hive"
!! SenEment"Analysis"and"n/grams"
!! OpEonal"Hands/On"Exercise:"Gaining"Insight"with"SenEment"Analysis"
!! Conclusion

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#32$
EssenEal"Points"

!Most$data$produced$these$days$lacks$rigid$structure$
Text"processing"can"help"us"analyze"loosely/structured"data"
!The$SPLIT$funcAon$creates$an$array$from$a$string$
EXPLODE"creates"individual"records"from"an"array"
!Hive$has$extensive$support$for$regular$expressions$
You"can"extract"or"subsEtute"values"based"on"pa>erns"
You"can"even"create"a"table"based"on"regular"expressions"
!An$n#gram$is$a$sequence$of$words$
Use"NGRAMS"and"CONTEXT_NGRAMS"to"nd"their"frequency"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#33$
Hive"OpBmizaBon"
Chapter"13"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#1$
Course"Chapters"

!! IntroducBon"
!! Hadoop"Fundamentals"
!! IntroducBon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulB/Dataset"OperaBons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooBng"and"OpBmizaBon"
!! IntroducBon"to"Hive"
!! RelaBonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive$Op,miza,on$
!! Extending"Hive"
!! IntroducBon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#2$
Hive"OpBmizaBon"

In$this$chapter,$you$will$learn$
!Which$factors$help$determine$the$performance$of$Hive$queries$
!What$command$displays$Hives$execu,on$plan$for$a$query$
!How$to$enable$several$useful$Hive$performance$features$
!How$to$par,,on$tables$to$reduce$amount$of$data$read$for$a$query$
!How$to$use$table$bucke,ng$to$sample$data$
!How$to$create$and$rebuild$indexes$in$Hive$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#3$
Chapter"Topics"

Hive$Op,miza,on$

!! Understanding$Query$Performance
!! Controlling"Job"ExecuBon"
!! ParBBoning"
!! BuckeBng"
!! Indexing"Data"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#4$
How"Hive"Processes"Data"

Steps Run Locally Steps Run on Cluster

SELECT COUNT(cust_id) Divide&Job&Into&Tasks


FROM customers
WHERE zipcode=94305;

Assign&Tasks&to&Nodes

Parse&HiveQL&Statements Submit&MapReduce&Jobs
to&Hadoop&Cluster
Start&Java&Process&for&Task

Read&Metadata&for&Tables
Read&Input&Data

Build&the&Execu-on&Plan
Map&Processing&Phase

Reduce&Processing&Phase

Write&Output&Data

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#5$
Hive"Query"Performance"Pa>erns"(1)"

!The$fastest$queries$involve$only$metadata$

DESCRIBE customers;

!The$next$fastest$simply$read$from$HDFS$

SELECT * FROM customers;

!Then$a$query$that$requires$a$map#only$job$

SELECT * FROM customers WHERE zipcode = 94305;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#6$
Hive"Query"Performance"Pa>erns"(2)"

!The$next$slowest$type$of$query$requires$both$Map$and$Reduce$phases$

SELECT COUNT(cust_id) FROM customers


WHERE zipcode=94305;

!The$slowest$queries$require$mul,ple$MapReduce$jobs$

SELECT zipcode, COUNT(cust_id) AS num FROM customers


$ GROUP BY zipcode
ORDER BY num DESC
LIMIT 10;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#7$
Viewing"the"ExecuBon"Plan"

!How$can$you$tell$how$Hive$will$execute$a$query?$
Does"it"read"only"metadata?"
Can"it"return"data"directly"from"HDFS?"
Will"it"require"a"reduce"phase"or"mulBple"MapReduce"jobs?"
!Prex$your$query$with$EXPLAIN$to$view$Hives$execu,on$plan$

hive> EXPLAIN SELECT * FROM customers;

!The$output$of$EXPLAIN$can$be$very$long$and$complex$
Fully"understanding"it"requires"in/depth"knowledge"of"MapReduce"
We"will"cover"the"basics"here"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#8$
Viewing"a"Query"Plan"with"EXPLAIN"(1)"

!The$query$plan$contains$three$main$sec,ons$
Abstract"syntax"tree"details"how"Hive"parsed"query"(excerpt"below)"
The"stage"dependencies"and"plans"are"more"useful"to"most"users"

hive> EXPLAIN CREATE TABLE cust_by_zip AS


SELECT zipcode, COUNT(cust_id) AS num
FROM customers GROUP BY zipcode;

ABSTRACT SYNTAX TREE:


(TOK_CREATETABLE (TOK_TABNAME cust_by_zip) ...

STAGE DEPENDENCIES:
... (excerpt shown on next slide)

STAGE PLANS:
... (excerpt shown on upcoming slide)

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#9$
Viewing"a"Query"Plan"with"EXPLAIN"(2)"

ABSTRACT SYNTAX TREE:


!Our$query$has$three$ ... (shown on previous slide)
stages$
STAGE DEPENDENCIES:
!Dependencies$dene$ Stage-1 is a root stage
order$ Stage-0 depends on stages: Stage-1
Stage/1"runs"rst" Stage-2 depends on stages: Stage-0
Stage/0"runs"next"
Stage/2"runs"last" STAGE PLANS:
... (shown on next slide)

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#10$
Viewing"a"Query"Plan"with"EXPLAIN"(3)"

STAGE PLANS:
!Stage#1:$MapReduce$job$ Stage: Stage-1
Map Reduce

Alias -> Map Operator Tree:


!Map$phase$
TableScan
Read"customers"table" alias: customers
Selects"zipcode"and" Select Operator
cust_id"columns" zipcode, cust_id
"
Reduce Operator Tree:
!Reduce$phase$ Group By Operator
Group"by"zipcode aggregations:
"Count"cust_id expr: count(cust_id)
keys:
expr: zipcode

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#11$
Viewing"a"Query"Plan"with"EXPLAIN"(4)"

STAGE PLANS:
$ Stage: Stage-1 (covered earlier)
...
$
Stage: Stage-0
!Stage#0:$HDFS$ac,on$
Move Operator
Move"previous"stages" files:
output"to"Hives" hdfs directory: true
warehouse"directory" destination: (HDFS path...)
"
!Stage#2:$Metastore$ac,on$ Stage: Stage-2
Create Table Operator:
Create"new"table"
Create Table
Has"two"columns" columns: zipcode string,
num bigint
name: cust_by_zip

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#12$
SorBng"Results

!As$in$SQL,$ORDER BY$sorts$specied$elds$in$HiveQL$
Consider"the"result"from"the"following"query"

hive> SELECT name, SUM(total)


FROM order_info GROUP BY name
ORDER BY name;

Mapper output is
Alice 3625
Bob 5174 processed by a
Alice 3625
0 Alice 3625 Alice 893 single reducer
1 Bob 5174 Alice 893 Alice 2139
2 Alice 893 Alice 2139 Alice 5834 Alice 12491
3 Alice 2139 Bob 5174 Bob 9997
4 Diana 3581 Diana 3581 Bob 4823 Carlos 1431
5 Carlos 1039 Carlos 1039 Carlos 1039 Diana 5385
6 Bob 4823 Carlos 392
7 Alice 5834 Bob 4823 Diana 3581
8 Carlos 392 Alice 5834 Diana 1804
9 Diana 1804
Carlos 392
Diana 1804

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#13$
Using"SORT BY"for"ParBal"Ordering

!HiveQL$also$supports$par,al$ordering$via$SORT BY$
Oers"much"be>er"performance"if"global"order"isnt"required"

hive> SELECT name, SUM(total)


FROM order_info GROUP BY name
SORT BY name;

Alice 3625 Only output from


Bob 5174 Alice 3625 each reducer
0 Alice 3625 Alice 893
1 Bob 5174 Alice 893 Alice 2139 Alice 12491 is sorted
2 Alice 893 Alice 2139 Alice 5834 Carlos 1431
3 Alice 2139 Carlos 1039 Alice 12491
4 Diana 3581 Diana 3581 Carlos 392 Carlos 1431
5 Carlos 1039 Carlos 1039 Bob 9997
6 Bob 4823 Diana 5385
7 Alice 5834 Bob 4823 Bob 5174
8 Carlos 392 Alice 5834 Bob 4823 Bob 9997
9 Diana 1804 Diana 3581 Diana 5385
Diana 1804
Carlos 392
Diana 1804

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#14$
Chapter"Topics"

Hive$Op,miza,on$

!! Understanding"Query"Performance"
!! Controlling$Job$Execu,on
!! ParBBoning"
!! BuckeBng"
!! Indexing"Data"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#15$
Parallel"ExecuBon"

!Stages$in$Hives$execu,on$plan$o_en$lack$dependencies$
This"means"they"can"be"run"in"parallel"
!Hive$supports$parallel$execu,on$in$such$cases$
However,"this"feature"is"disabled"by"default"
!Enable$this$by$se`ng$the$hive.exec.parallel$property$to$true$
Set"only"for"yourself"in"$HOME/.hiverc
Set"for"all"users"in"/etc/hive/conf/hive-site.xml

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#16$
Reducing"Latency"Through"Local"ExecuBon"

!Running$MapReduce$jobs$on$the$cluster$has$signicant$overhead$
Must"divide"work,"assign"tasks,"start"processes,"collect"results,"etc."
Necessary"to"process"large"amounts"of"data"in"Hive"
Possibly"inecient"with"small"amount"of"data"
!Processing$data$locally$can$substan,ally$speed$up$smaller$jobs$
Local"execuBon"can"substanBally"improve"turnaround"for"small"jobs"

hive> SET mapred.job.tracker=local;


hive> SET mapred.local.dir=/home/training/tmpdata

hive> SELECT zipcode, COUNT(cust_id) AS num


FROM customers GROUP BY zipcode;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#17$
AutomaBc"SelecBon"of"Local"ExecuBon"Mode"

!Hive$can$now$select$local$execu,on$mode$automa,cally$
It"does"this"on"a"case/by/case"basis"using"heurisBcs"
!Like$parallel$execu,on,$this$feature$is$disabled$by$default$
Enable"by"sefng"hive.exec.mode.local.auto"to"true"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#18$
Job"Control"

!You$will$see$many$log$messages$when$you$run$a$query$in$Hives$shell$
One"of"these"messages"will"idenBfy"the"MapReduce"job"ID""

hive> SELECT * FROM customers WHERE zipcode=94305;


Total MapReduce jobs = 1
Launching Job 1 out of 1
... (other messages omitted) ...

Starting Job = job_201306022351_0025

!Hadoops$mapred$command$lets$you$kill$the$job$or$view$its$status$

$ mapred job -status job_201306022351_0025


$ mapred job -kill job_201306022351_0025

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#19$
Viewing"a"Job"in"the"Web"UI"(1)"

!The$log$message$will$also$show$a$Web$address$
This"is"a"link"to"the"jobs"detail"page"in"Hadoops"Web"UI"

hive> SELECT * FROM customers WHERE zipcode=94305;


Total MapReduce jobs = 1
Launching Job 1 out of 1
... (other messages omitted) ...

Starting Job = job_201306022351_0029, Tracking URL =


http://jobtracker.example.com:50030/jobdetails.jsp?
jobid=job_201306022351_0029

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#20$
Viewing"a"Job"in"the"Web"UI"(2)"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#21$
Chapter"Topics"

Hive$Op,miza,on$

!! Understanding"Query"Performance"
!! Controlling"Job"ExecuBon"
!! Par,,oning
!! BuckeBng"
!! Indexing"Data"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#22$
Recap:"Data"Storage"in"Hive"

!The$metastore$maintains$informa,on$about$Hive$tables$
A"table"simply"points"to"a"directory"in"HDFS"
The"tables"data"are"les"within"that"directory"
!All$les$in$the$directory$are$read$during$a$query$

call_logs table

2013-06-03 07:55:36 312-555-3453


CALL_RECEIVED ""
2013-06-03 07:55:39 312-555-3453 call-20130603.log
OPTION_SELECTED "Billing"

2013-06-04 07:05:21 212-555-3982


call_logs CALL_RECEIVED ""
2013-06-04 07:08:36 212-555-3982
HUNG_UP "Busy watching Hadoop Webinar" call-20130604.log
directory in HDFS 2013-06-04 07:14:29 314-555-5741
AGENT_ANSWER "Agent ID N5150"

2013-06-05 07:21:57 808-555-3453


CALL_RECEIVED ""
2013-06-05 07:23:12 808-555-3453 call-20130605.log
OPTION_SELECTED "Billing"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#23$
File"Reads"in"Hive"Tables"

!Dualcores$phone$system$creates$daily$logs$detailing$calls$received$
Our"analysts"use"this"data"to"summarize"previous"days"calls"

hive> SELECT event_type, COUNT(event_type)


FROM call_log
WHERE call_date = '2013-06-03'
GROUP BY event_type;

!These$queries$always$lter$by$a$value$in$the$call_date$eld$
But"sBll"read"all"les"within"the"tables"directory"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#24$
Table"ParBBoning"

!It$is$possible$to$create$a$table$that$par,,ons$the$data$
Queries"that"lter"on"parBBoned"elds"limit"amount"of"data"read"
Does"not"prevent"you"from"running"queries"that"span"parBBons"

call_logs table

07:55:36 312-555-3453 CALL_RECEIVED ""


date=2013-06-03 07:55:39 312-555-3453 OPTION_SELECTED
"Billing"
07:55:41 312-555-3453 CALL_ENDED ""

call_logs 07:05:21 212-555-3982 CALL_RECEIVED ""


07:08:36 212-555-3982 HUNG_UP "Busy
watching Hadoop Webinar"
directory in HDFS date=2013-06-04 07:14:29 314-555-5741 AGENT_ANSWER
"Agent ID N5150"
07:14:51 314-555-5741 CALL_TRANSFER
"Billing"

07:21:57 808-555-3453 CALL_RECEIVED ""


date=2013-06-05 07:23:12 808-555-3453 OPTION_SELECTED
"Billing"
07:23:14 213-555-8752 CALL_ENDED ""

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#25$
CreaBng"A"ParBBoned"Table"In"Hive"(1)"

!Specify$the$par,,oned$column$in$the$PARTITION$clause$

CREATE TABLE call_logs (


call_time STRING,
event_type STRING,
phone STRING,
details STRING)
PARTITIONED BY (call_date STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

!You$can$also$created$nested$par,,ons$

PARTITIONED BY (call_date STRING, event_type STRING)

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#26$
CreaBng"A"ParBBoned"Table"In"Hive"(2)"

!The$par,,on$column$is$displayed$if$you$DESCRIBE$the$table:$

hive> DESCRIBE call_logs;


OK
call_time string
event_type string
$ phone string
details string
call_date string

!However,$the$par,,on$is$a$virtual$column$
The"data"does"not"exist"in"your"incoming"data"
Instead,"you"specify"the"parBBon"when"loading"the"data""

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#27$
Loading"Data"Into"ParBBons"

!You$must$specify$par,,on$eld$value$when$loading$data$

hive> LOAD DATA INPATH 'call-20130603.log'


INTO TABLE call_logs
PARTITION(call_date='2013-06-03');

hive> LOAD DATA INPATH 'call-20130604.log'


INTO TABLE call_logs
PARTITION(call_date='2013-06-04');

!The$above$example$would$create$two$subdirectories:$
/user/hive/warehouse/call_logs/call_date=2013-06-03
/user/hive/warehouse/call_logs/call_date=2013-06-04

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#28$
Dynamic"ParBBon"Inserts"(1)"

!What$if$your$table$already$exists,$contains$data,$and$lacks$par,,ons?$
Hive"can"dynamically"insert"data"into"specic"parBBons"for"you"
!Syntax:$

FROM customers
INSERT OVERWRITE TABLE custs_part PARTITION(state)
SELECT cust_id, fname, lname, address, city,
zipcode, state;

!Par,,ons$are$automa,cally$created$based$on$the$value$of$the$last$column
If"the"parBBon"does"not"already"exist,"it"will"be"created"
If"the"parBBon"does"exist,"it"will"be"overwri>en"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#29$
Dynamic"ParBBon"Inserts"(2)"

!Dynamic$par,,oning$is$not$enabled$by$default$
Enable"it"by"sefng"these"two"properBes"

Property$Name$ Value$
hive.exec.dynamic.partition true

hive.exec.dynamic.partition.mode nonstrict

!Cau,on:$avoid$crea,ng$an$excessive$number$of$par,,ons$
This"can"happen"your"data"contains"many"unique"values"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#30$
Dynamic"ParBBon"Inserts"(3)"

!Cau,on:$if$the$par,,on$column$has$many$dierent$values,$many$
par,,ons$will$be$created$
!Three$Hive$congura,on$proper,es$exist$to$limit$this$
hive.exec.max.dynamic.partitions.pernode
Maximum"number"of"dynamic"parBBons"that"can"be"created"by"any"
given"Mapper"or"Reducer"
Default"100"
hive.exec.max.dynamic.partitions
Total"number"of"dynamic"parBBons"that"can"be"created"by"one"
HiveQL"statement"
Default"1000"
hive.exec.max.created.files
Maximum"total"les"created"by"Mappers"and"Reducers"
Default"100000"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#31$
Viewing,"Adding,"and"Removing"ParBBons"

!To$view$the$current$par,,ons$in$a$table$

hive> SHOW PARTITIONS call_logs;

!Use$ALTER TABLE$to$add$or$drop$par,,ons$

ALTER TABLE call_logs


ADD PARTITION (call_date='2013-06-05')
LOCATION '/dualcore/call_logs/call_date=2013-06-05';

ALTER TABLE call_logs


DROP PARTITION (call_date='2013-06-06');

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#32$
Chapter"Topics"

Hive$Op,miza,on$

!! Understanding"Query"Performance"
!! Controlling"Job"ExecuBon"
!! ParBBoning"
!! Bucke,ng
!! Indexing"Data"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#33$
What"Is"BuckeBng?"

!Par,,oning$subdivides$data$by$values$in$par,,oned$columns$
!Bucke,ng$data$is$another$way$of$subdividing$data$
Calculates"hash"code"for"values"inserted"into"bucketed"columns"
Hash"code"used"to"assign"new"records"to"a"bucket"
!Goal:$distribute$rows$across$a$predened$number$of$buckets$
Useful"for"jobs"which"need"random"samples"of"data"
Joins"may"be"faster"if"all"tables"are"bucketed"on"the"join"column"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#34$
CreaBng"A"Bucketed"Table"

!Example$of$crea,ng$a$table$that$supports$bucke,ng$
Creates"a"table"supporBng"20"buckets"based"on"order_id"column"
Each"bucket"should"contain"roughly"5%"of"the"tables"data"

CREATE TABLE orders_bucketed


(order_id INT,
cust_id INT,
order_date TIMESTAMP)
CLUSTERED BY (order_id) INTO 20 BUCKETS;

!Column$selected$for$bucke,ng$should$have$well#distributed$values$
IdenBer"columns"are"olen"a"good"choice"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#35$
InserBng"Data"Into"A"Bucketed"Table"

!Bucke,ng$isnt$automa,cally$enforced$when$inser,ng$data$
!Set$the$hive.enforce.bucketing$property$to$true
This"sets"the"number"of"reducers"to"the"number"of"buckets"in"the"table"
deniBon"

hive> SET hive.enforce.bucketing=true;


hive> INSERT OVERWRITE TABLE orders_bucketed
SELECT * FROM orders;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#36$
Sampling"Data"From"A"Bucketed"Table"

!Use$the$following$syntax$to$sample$data$from$a$bucketed$table:$
This"example"selects"one"of"every"ten"records"(10%)"

hive> SELECT * FROM orders_bucketed


TABLESAMPLE (BUCKET 1 OUT OF 10 ON order_id);

!It$is$possible$to$use$TABLESAMPLE$on$a$non#bucketed$table$
However,"this"requires"a"full"scan"of"the"enBre"table"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#37$
Chapter"Topics"

Hive$Op,miza,on$

!! Understanding"Query"Performance"
!! Controlling"Job"ExecuBon"
!! ParBBoning"
!! BuckeBng"
!! Indexing$Data
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#38$
Indexes"in"Hive"

!Tables$in$Hive$also$now$support$indexes$
Similar"to"indexes"in"RDBMSs,"but"much"more"limited"
!May$improve$performance$for$certain$types$of$queries$
But"maintaining"them"costs"disk"space"and"CPU"Bme"
!Syntax$to$create$an$index:$

hive> CREATE INDEX idx_orders_cust_id


ON TABLE orders(cust_id)
AS 'handler_class'
WITH DEFERRED REBUILD;

!Handler$class$is$the$fully#qualied$name$of$a$Java$class,$such$as:$
org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#39$
Viewing"and"Building"Indexes"in"Hive"

!This$command$lists$the$indexes$associated$with$the$orders$table$

hive> SHOW FORMATTED INDEX ON orders;

!Hive$indexes$are$ini,ally$empty$
Building"(and"later"rebuilding)"indexes"is"a"manual"process"
Use"the"ALTER INDEX"command"to"rebuild"an"index"
CauBon:"this"can"be"lengthy"operaBon!"

hive> ALTER INDEX idx_orders_cust_id ON orders REBUILD;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#40$
Chapter"Topics"

Hive$Op,miza,on$

!! Understanding"Query"Performance"
!! Controlling"Job"ExecuBon"
!! ParBBoning"
!! BuckeBng"
!! Indexing"Data"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#41$
EssenBal"Points"

! ORDER BY$sorts$results$globally,$just$as$in$SQL$
HiveQL"also"supports"SORT BY"for"parBal"ordering"
!Local$execu,on$mode$can$signicantly$reduce$query$latency$
But"only"appropriate"to"use"with"small"amounts"of"data"
!Par,,oning$and$bucke,ng$can$both$subdivide$a$table's$data$
ParBBoning"may"reduce"the"amount"of"data"a"query"must"read"
BuckeBng"is"used"to"support"random"sampling"
!Hives$indexing$feature$can$boost$performance$for$certain$queries$
But"it"comes"at"the"cost"of"increased"disk"and"CPU"usage"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#42$
Bibliography"

The$following$oer$more$informa,on$on$topics$discussed$in$this$chapter$
!Hive$Manual$for$the$EXPLAIN$Command$
http://tiny.cloudera.com/dac13a
!Hive$Manual$for$Bucketed$Tables$
http://tiny.cloudera.com/dac13b
!Hive$Manual$for$Indexes$
http://tiny.cloudera.com/dac13c

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#43$
Extending"Hive"
Chapter"14"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#1$
Course"Chapters"

!! IntroducDon"
!! Hadoop"Fundamentals"
!! IntroducDon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulD/Dataset"OperaDons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooDng"and"OpDmizaDon"
!! IntroducDon"to"Hive"
!! RelaDonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpDmizaDon"
!! Extending$Hive$
!! IntroducDon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#2$
Extending"Hive"

In$this$chapter,$you$will$learn$
! What$role$SerDes$play$in$Hive$
! How$to$use$a$custom$SerDe$
! How$to$use$TRANSFORM$for$custom$record$processing$
! How$to$add$support$for$a$User#Dened$FuncFon$(UDF)$
! How$to$use$variable$subsFtuFon$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#3$
Chapter"Topics"

Extending$Hive$

!! SerDes
!! Data"TransformaDon"with"Custom"Scripts"
!! User/Dened"FuncDons"
!! Parameterized"Queries"
!! Hands/On"Exercise:"Data"TransformaDon"with"Hive"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#4$
Hive"SerDes"

!Hive$uses$a$SerDe$for$reading$and$wriFng$records$
Stands"for"Serializer"/"Deserializer"
SerDes"control"the"row"format"of"the"table"
Specied,"someDmes"implicitly,"when"table"is"created"
!Hive$ships$with$many$SerDes,$including:$$

Name$ Reads$and$Writes$Records$
LazySimpleSerDe Using"specied"eld"delimiters"(default)"
RegexSerDe$ Based"on"supplied"pa>erns"
ColumnarSerDe Using"the"columnar"format"needed"by"RCFile"
HBaseSerDe Using"an"HBase"table"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#5$
Recap:"CreaDng"a"Table"with"Regex"SerDe"

Input"Data"
05/23/2013 19:45:19 312-555-7834 CALL_RECEIVED ""
05/23/2013 19:48:37 312-555-7834 COMPLAINT "Item not received"

Using"SerDe"
CREATE TABLE calls (
event_date STRING,
event_time STRING,
event_type STRING,
phone_num STRING,
details STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" =
"([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^\"]*)\"");

ResulDng"Table"
event_date$ event_Fme$ event_type$ phone_num$ details$
05/23/2013 19:45:19 312-555-7834 CALL_RECEIVED
05/23/2013 19:45:37 312-555-7834 COMPLAINT Item not received

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#6$
Adding"a"Custom"SerDe"to"Hive"

!Hive$also$allows$wriFng$custom$SerDes$using$its$Java$API$
There"are"many"open"source"SerDes"on"the"Web"
WriDng"your"own"is"seldom"necessary"
!We$will$now$explain$how$to$add$a$custom$SerDe$to$Hive$
It"reads"and"writes"records"in"CSV"format"
Using"JAR"le"from"http://tiny.cloudera.com/dac14a

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#7$
Adding"a"JAR"File"to"Hive"

!You$must$register$external$libraries$before$using$them$
Ensures"Hive"can"nd"the"library"(JAR"le)"at"runDme"

hive> ADD JAR /home/training/csv-serde-1.0.jar;

!Remains$in$eect$only$during$the$current$Hive$session$
Consider"ediDng"your".hiverc"le"to"add"frequently/used"JARs"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#8$
Using"the"SerDe"in"Hive"

Input"Data"
1,Gigabux,gigabux@example.com
2,"ACME Distribution Co.",acme@example.com
3,"Bitmonkey, Inc.",bmi@example.com

Specify"SerDe"
CREATE TABLE vendors
(id INT,
name STRING,
email STRING)
ROW FORMAT SERDE 'com.bizo.hive.serde.csv.CSVSerde';

id$ name$ email$ ResulDng"Table"


1 Gigabux gigabux@example.com
2 ACME Distribution Co. acme@example.com
3 Bitmonkey, Inc. bmi@example.com

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#9$
Chapter"Topics"

Extending$Hive$

!! SerDes"
!! Data$TransformaFon$with$Custom$Scripts
!! User/Dened"FuncDons"
!! Parameterized"Queries"
!! Hands/On"Exercise:"Data"TransformaDon"with"Hive"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#10$
Using"TRANSFORM"to"Process"Data"Using"External"Scripts"

!You$are$not$limited$to$manipulaFng$data$exclusively$in$HiveQL$
Hive"allows"you"to"transform"data"through"external"scripts"or"programs"
These"can"be"wri>en"in"nearly"any"language"
!This$is$done$with$HiveQLs$TRANSFORM$...$USING$construct$
One"or"more"elds"are"supplied"as"arguments"to"TRANSFORM()
The"external"script"is"idenDed"by"USING""
It"receives"each"record,"processes"it,"and"returns"the"result"
!Use$ADD$$FILE$to$distribute$the$script$to$nodes$in$the$cluster$

hive> ADD FILE myscript.pl;


hive> SELECT TRANSFORM(*) USING 'myscript.pl'
FROM employees;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#11$
Data"Input"and"Output"with"TRANSFORM

!Your$external$program$will$receive$one$record$per$line$on$standard$input$
Each"eld"in"the"supplied"record"will"be"a"tab/separated"string"
NULL"values"are"converted"to"the"literal"string"\N""
!You$may$need$to$convert$values$to$appropriate$types$within$your$program$
For"example,"converDng"to"numeric"types"for"calculaDons"
!Your$program$must$return$tab#delimited$elds$on$standard$output$
Output"elds"can"opDonally"be"named"and"cast"using"the"syntax"below"

SELECT TRANSFORM(product_name, price)


USING 'tax_calculator.py'
AS (item_name STRING, tax INT)
FROM products;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#12$
Hive"TRANSFORM"Example"(1)

!Here$is$a$complete$example$of$using$TRANSFORM$in$Hive$$
Our"Perl"script"parses"an"e/mail"address,"determines"to"which"country"it"
corresponds,"and"then"returns"an"appropriate"greeDng"
Heres"a"sample"of"the"input"data"

hive> SELECT name, email_address FROM employees;


Antoine antoine@example.fr
Kai kai@example.de
Pedro pedro@example.mx
"
Heres"the"corresponding"HiveQL"code"

hive> ADD FILE greeting.pl;


hive> SELECT TRANSFORM(name, email_address)
USING 'greeting.pl' AS greeting
FROM employees;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#13$
Hive"TRANSFORM"Example"(2)

!The$Perl$script$for$this$example$is$shown$below$
A"complete"explanaDon"of"this"script"follows"on"the"next"few"slides"

#!/usr/bin/env perl

%greetings = ('de' => 'Hallo',


'fr' => 'Bonjour',
'mx' => 'Hola');

while (<STDIN>) {
($name, $email) = split /\t/;
($suffix) = $email =~ /\.([a-z]+)/;
$greeting = $greetings{$suffix};
$greeting = 'Hello' unless defined($greeting);
print "$greeting $name\n";
}

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#14$
Hive"TRANSFORM"Example"(3)

$$
#!/usr/bin/env perl

%greetings = ('de' => 'Hallo',


'fr' => 'Bonjour',
'mx' => 'Hola');

while (<STDIN>) {
The first line tells the system to use the Perl interpreter
($name, $email) = split /\t/;
when=running
($suffix) $emailthis =~script.
/\.([a-z]+)/;
$greeting = $greetings{$suffix};
$greeting = 'Hello'
We define unless indefined($greeting);
our greetings the next line using an
print "$greeting $name\n";
} associative array keyed by the country code well
extract from the e-mail address.

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#15$
Hive"TRANSFORM"Example"(4)

$$
#!/usr/bin/env perl

%greetings
We=read
('de'
each=>record
'Hallo',
from standard input within the loop,
'fr' => 'Bonjour',
and then split=>them
'mx' into fields based on tab characters.
'Hola');

while (<STDIN>) {
($name, $email) = split /\t/;
($suffix) = $email =~ /\.([a-z]+)/;
$greeting = $greetings{$suffix};
$greeting = 'Hello' unless defined($greeting);
print "$greeting $name\n";
}

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#16$
Hive"TRANSFORM"Example"(5)

$
#!/usr/bin/env perl

%greetings = ('de' => 'Hallo',


'fr' => 'Bonjour',
'mx' => 'Hola');

while (<STDIN>) {
($name, $email) = split /\t/;
($suffix) = $email =~ /\.([a-z]+)$/;
$greeting = $greetings{$suffix};
$greeting = 'Hello' unless defined($greeting);
print "$greeting $name\n";
} We extract the country code from the e-mail address (the
pattern matches any letters following the final dot). We use that
to look up a greeting, but default to Hello if we didnt find one.

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#17$
Hive"TRANSFORM"Example"(6)

$
#!/usr/bin/env perl

%greetings = ('de' => 'Hallo',


'fr' => 'Bonjour',
'mx' => 'Hola');
Finally, we return our greeting as a single field by printing this
while (<STDIN>) {
value to standard output. If we had multiple fields, wed simply
($name, $email) = split /\t/;
separate
($suffix) each by tab characters
= $email when printing them here.
=~ /\.([a-z]+)$/;
$greeting = $greetings{$suffix};
$greeting = 'Hello' unless defined($greeting);
print "$greeting $name\n";
}

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#18$
Hive"TRANSFORM"Example"(7)

!Finally,$heres$the$result$of$our$transformaFon$

hive> ADD FILE greeting.pl;


hive> SELECT TRANSFORM(name, email_address)
USING 'greeting.pl' AS greeting
FROM employees;

Bonjour Antoine
Hallo Kai
Hola Pedro

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#19$
Chapter"Topics"

Extending$Hive$

!! SerDes"
!! Data"TransformaDon"with"Custom"Scripts"
!! User#Dened$FuncFons
!! Parameterized"Queries"
!! Hands/On"Exercise:"Data"TransformaDon"with"Hive"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#20$
Overview"of"User/Dened"FuncDons"(UDFs)"

!User#Dened$FuncFons$(UDFs)$are$custom$funcFons$$
Invoked"with"the"same"syntax"as"built/in"funcDons"

hive> SELECT CALC_SHIPPING_COST(order_id, 'OVERNIGHT')


FROM orders WHERE order_id = 5742354;

!There$are$three$types$of$UDFs$in$Hive$
Standard"UDFs"
User/Dened"Aggregate"FuncDons"(UDAFs)"
User/Dened"Table"FuncDons"(UDTFs)"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#21$
Developing"Hive"UDFs"

!Hive$UDFs$are$wri^en$in$Java$
Currently"no"support"for"wriDng"UDFs"in"other"languages"
Using"TRANSFORM"may"be"an"alternaDve"to"UDFs"
!Open$source$UDFs$are$plenFful$on$the$Web$
!There$are$three$steps$for$using$a$UDF$in$Hive$
1. Add"the"funcDons"JAR"le"to"Hive"
2. Register"the"funcDon"itself"
3. Use"the"funcDon"in"your"query"

A^enFon$Java$Developers"
Cloudera"now"oers"a"free"e/learning"
module"WriDng"UDFs"for"Hive"""
"
http://tiny.cloudera.com/dac14b"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#22$
Example:"Using"a"UDF"in"Hive"(1)"

!Our$example$UDF$was$compiled$from$sources$found$in$GitHub$
Popular"Web"site"for"many"open"source"sohware"projects"
Project"URL:"http://tiny.cloudera.com/dac14e"
!We$compiled$the$source$and$packaged$it$into$a$JAR$le$
We"have"included"a"copy"of"it"on"your"VM"
!Our$example$shows$the$DATE_FORMAT$UDF$in$that$JAR$le$
Allows"great"exibility"in"formakng"date"elds"in"output"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#23$
Example:"Using"a"UDF"in"Hive"(2)"

!First,$register$the$JAR$with$Hive$$
Same"step"as"with"a"custom"SerDe"

hive> ADD JAR date-format-udf.jar;

!Next,$register$the$funcFon$and$assign$an$alias$
The"quoted"value"is"the"fully/qualied"Java"class"for"the"UDF"
$
hive> CREATE TEMPORARY FUNCTION DATE_FORMAT
AS 'com.nexr.platform.hive.udf.UDFDateFormat';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#24$
Example:"Using"a"UDF"in"Hive"(3)"

!You$may$then$use$the$funcFon$in$your$query$

hive> SELECT order_date FROM orders LIMIT 1;


2011-12-06 10:03:35

hive> SELECT DATE_FORMAT(order_date, 'dd-MMM-yyyy')


FROM orders LIMIT 1;
06-Dec-2011

hive> SELECT DATE_FORMAT(order_date, 'dd/mm/yy')


FROM orders LIMIT 1;
06/12/11

hive> SELECT DATE_FORMAT(order_date, 'EEEE, MMM d, yyyy')


FROM orders LIMIT 1;
Tuesday, Dec 6, 2011

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#25$
Chapter"Topics"

Extending$Hive$

!! SerDes"
!! Data"TransformaDon"with"Custom"Scripts"
!! User/Dened"FuncDons"
!! Parameterized$Queries
!! Hands/On"Exercise:"Data"TransformaDon"with"Hive"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#26$
Hive"Variables"(1)"

!Hive$supports$variable$subsFtuFon$
Swaps"a"placeholder"with"a"variables"literal"value"at"run"Dme"
Variable"names"are"case/sensiDve"
!Within$the$Hive$shell,$set$a$named$variable$equal$to$some$value:$

hive> SET state=CA;


$
!To$use$the$variables$value$in$a$HiveQL$query:$

hive> SELECT * FROM employees


WHERE STATE = '${hiveconf:state}';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#27$
Hive"Variables"(2)"

!You$can$set$variables$when$you$invoke$Hive$from$the$command$line$
Eases"repeDDve"queries"by"reducing"need"to"modify"HiveQL"
!For$example,$imagine$that$we$have$the$following$in$state.hql

SELECT COUNT(DISTINCT emp_id) FROM employees


WHERE state = '${hiveconf:state}';

!This$makes$creaFng$per#state$reports$easy:$

$ hive -hiveconf state=CA -f state.hql > ca_count.txt


$ hive hiveconf state=NY -f state.hql > ny_count.txt
$ hive hiveconf state=TX -f state.hql > tx_count.txt

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#28$
Chapter"Topics"

Extending$Hive$

!! SerDes"
!! Data"TransformaDon"with"Custom"Scripts"
!! User/Dened"FuncDons"
!! Parameterized"Queries"
!! Hands#On$Exercise:$Data$TransformaFon$with$Hive
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#29$
Hands/on"Exercise:"Data"TransformaDon"with"Hive"

!In$this$Hands#On$Exercise,$you$will$transform$data$with$Hive$
!Please$refer$to$the$Hands#On$Exercise$Manual$for$instrucFons$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#30$
Chapter"Topics"

Extending$Hive$

!! SerDes"
!! Data"TransformaDon"with"Custom"Scripts"
!! User/Dened"FuncDons"
!! Parameterized"Queries"
!! Hands/on"Exercise:"Data"TransformaDon"with"Hive"
!! Conclusion

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#31$
EssenDal"Points"

!SerDes$govern$how$Hive$reads$and$writes$a$table's$records$
Specied"(or"defaulted)"when"creaDng"a"table"
! TRANSFORM$processes$records$using$an$external$program$
This"can"be"wri>en"in"nearly"any"language"
!UDFs$are$User#Dened$FuncFons$
Custom"logic"that"can"be"invoked"just"like"built/in"funcDons"
!Hive$subsFtutes$variable$placeholders$with$literal$values$you$assign$
This"is"done"when"you"execute"the"query"
Especially"helpful"with"repeDDve"queries"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#32$
IntroducAon"to"Impala"
Chapter"15"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#1$
Course"Chapters"

!! IntroducAon"
!! Hadoop"Fundamentals"
!! IntroducAon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulA/Dataset"OperaAons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooAng"and"OpAmizaAon"
!! IntroducAon"to"Hive"
!! RelaAonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpAmizaAon"
!! Extending"Hive"
!! Introduc.on$to$Impala$
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#2$
IntroducAon"to"Impala"

In$this$chapter,$you$will$learn$
! What$Impala$is$and$how$it$compares$to$Hive,$Pig,$and$RDBMSs$
! How$Impala$executes$queries$
! Where$Impala$ts$into$the$data$center$
! What$notable$dierences$exist$between$Impala$and$Hive$
! How$to$run$queries$from$the$shell$or$browser$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#3$
Chapter"Topics"

Introduc.on$to$Impala$

!! What$is$Impala?
!! How"Impala"Diers"from"Hive"and"Pig"
!! How"Impala"Diers"from"RelaAonal"Databases"
!! LimitaAons"and"Future"DirecAons"
!! Using"the"Impala"Shell"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#4$
What"is"Impala?"

!High#performance$SQL$engine$for$vast$amounts$of$data$
Massively/parallel"processing"(MPP)"
Inspired"by"Googles"Dremel"project"
Query"latency"measured"in"milliseconds$
!Impala$runs$on$Hadoop$clusters$
Can"query"data"stored"in"HDFS"or"HBase"tables"
Reads"and"writes"data"in"common"Hadoop"le"formats"
!Developed$by$Cloudera$
100%"open"source,"released"under"the"Apache"sobware"
license"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#5$
InteracAng"with"Impala"

!Impala$supports$a$subset$of$SQL#92$
Plus"a"few"extensions"found"in"MySQL"and"Oracle"SQL"dialects"
Almost"idenAcal"to"HiveQL"
!Impala$oers$many$interfaces$for$running$queries$
Command/line"shell"
Hue"Web"applicaAon"
ODBC"/"JDBC""

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#6$
Why"Use"Impala?"

!Many$benets$are$the$same$as$with$Hive$or$Pig$
More"producAve"than"wriAng"MapReduce"code"
No"sobware"development"experience"required"
Leverage"exisAng"knowledge"of"SQL"
!One$benet$exclusive$to$Impala$is$speed$
Highly/opAmized"for"queries"
Almost"always"at"least"ve"Ames"faster"than"either"Hive"or"Pig"
Oben"20"Ames"faster"or"more"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#7$
Use"Case:"Business"Intelligence""

!Many$leading$business$intelligence$tools$support$Impala$
Dualcore Inc. Dashboard
https://dashboard.example.com/ Google

Revenue by Period Order Shipments Per Month

Top States for In-Store Sales

Suppliers by Region

Japan: 31 suppliers

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#8$
Where"Impala"Fits"Into"the"Data"Center"

Transaction Records from


Application Database
Log Data from Documents from
Web Servers File Server

Hadoop Cluster
with Impala

Analyst using Impala Analyst using


shell for ad hoc queries Impala via BI tool

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#9$
Where"to"Get"Impala"

!Download$Impala$from$http://www.cloudera.com/
!We$strongly$recommend$running$Impala$on$CDH$4.2$or$higher$
Requires"a"64/bit"Linux"plahorm"
!Installa.on$and$congura.on$are$outside$the$scope$of$this$course$
Your"virtual"machine"includes"a"working"installaAon"of"Impala"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#10$
Chapter"Topics"

Introduc.on$to$Impala$

!! What"is"Impala?"
!! How$Impala$Diers$from$Hive$and$Pig
!! How"Impala"Diers"from"RelaAonal"Databases"
!! LimitaAons"and"Future"DirecAons"
!! Using"the"Impala"Shell"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#11$
Comparing"Impala"to"Hive"and"Pig"

!Lets$rst$look$at$similari'es$between$Hive,$Pig,$and$Impala$
Queries"expressed"in"high/level"languages"
AlternaAves"to"wriAng"MapReduce"code"
Used"to"analyze"data"stored"on"Hadoop"clusters"
!Impala$shares$the$metastore$with$Hive$
Tables"created"in"Hive"are"visible"in"Impala"(and"vice"versa)"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#12$
ContrasAng"Impala"to"Hive"and"Pig"(1)"

!Hive$and$Pig$answer$queries$by$running$MapReduce$jobs$
MapReduce"is"a"general/purpose"computaAon"framework"
Not"opAmized"for"execuAng"interacAve"SQL"queries"
!MapReduce$overhead$results$in$high$latency$
Even"a"trivial"query"takes"10"seconds"or"more"
!Impala$does$not$use$MapReduce$
Uses"a"custom"execuAon"engine"built"specically"for"Impala"
Queries"can"complete"in"a"fracAon"of"a"second"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#13$
ContrasAng"Impala"to"Hive"and"Pig"(2)"

!Hive,$Pig,$and$Impala$also$support$
Execute"queries"via"interacAve"shell"or"command"line"
Grouping,"joining,"and"ltering"data"
Read"and"write"data"in"mulAple"formats"
!Impala$currently$lacks$some$Hive$and$Pig$features$
More"details"later"in"this"chapter"
!Hive$and$Pig$are$best$suited$to$long#running$batch$processes$
ParAcularly"data"transformaAon"tasks"
!Impala$is$best$for$interac.ve/ad$hoc$queries$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#14$
How"Impala"Executes"a"Query"

!Each$slave$node$in$the$cluster$runs$an$Impala$daemon$
Co/located"with"the"HDFS"Data"Node"
Client"issues"query"to"an"Impala"daemon"
!Impala$daemon$plans$the$query$
Checks"the"local"metastore"cache"
Distributes"the"query"across"other"Impala"daemons"in"the"cluster"
Daemons"read"data"from"HDFS"or"HBase"
Streams"results"to"client"
!Two$other$daemons$running$on$master$nodes$support$query$execu.on$
The"State"Store"daemon""
Provides"lookup"service"for"Impala"daemons"
Periodically"checks"status"of"Impala"daemons"
The"Catalog"daemon"relays"metadata"changes"to"all"the"Impala"
daemons"in"a"cluster"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#15$
Chapter"Topics"

Introduc.on$to$Impala$

!! What"is"Impala?"
!! How"Impala"Diers"from"Hive"and"Pig"
!! How$Impala$Diers$from$Rela.onal$Databases
!! LimitaAons"and"Future"DirecAons"
!! Using"the"Impala"Shell"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#16$
Comparing"Impala"To"A"RelaAonal"Database"

Rela.onal$Database$ Impala$
Query language SQL SQL-92 subset
Update individual records Yes No
Delete individual records Yes No
Transactions Yes No
Indexing Yes No
Latency Very low Low
Data size Terabytes Petabytes
ODBC / JDBC support Yes Yes

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#17$
Chapter"Topics"

Introduc.on$to$Impala$

!! What"is"Impala?"
!! How"Impala"Diers"from"Hive"and"Pig"
!! How"Impala"Diers"from"RelaAonal"Databases"
!! Limita.ons$and$Future$Direc.ons
!! Using"the"Impala"Shell"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#18$
Hive"Features"Currently"Unsupported"in"Impala"

!Impala$does$not$currently$support$some$features$in$Hive$
Many"of"these"are"being"considered"for"future"releases"
!Complex$data$types$(ARRAY,$MAP,$or$STRUCT)$
! BINARY$data$type$
!External$transforma.ons$
!Custom$SerDes$
!Indexing$
!Bucke.ng$and$table$sampling$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#19$
Other"Notable"Dierences"Between"Impala"and"Hive"

!Only$one$DISTINCT$clause$allowed$per$query$in$Impala$
Typical"workaround"is"to"use"subselect"and"UNION
!Impala$requires$that$queries$with$ORDER BY$also$specify$a$LIMIT$
This"sets"an"outer"bound"on"the"result"set"
The"size"of"the"LIMIT"can"be"arbitrarily"large"
!Impala$and$Hive$handle$out#of#range$values$dierently$
Hive"returns"NULL""
Impala"returns"the"maximum"value"for"that"type"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#20$
Query"Fault"Tolerance"in"Impala"

!Queries$in$both$Hive$and$Impala$are$distributed$across$nodes$
!Hive$answers$queries$by$running$MapReduce$jobs$
Takes"advantage"of"Hadoops"fault"tolerance"
If"a"node"fails"during"query,"MapReduce"runs"the"task"elsewhere"
!Impala$has$its$own$execu.on$engine$
Currently"lacks"fault"tolerance"
If"a"node"fails"during"a"query,"the"query"will"fail"
Workaround:"just"re/run"the"query"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#21$
Chapter"Topics"

Introduc.on$to$Impala$

!! What"is"Impala?"
!! How"Impala"Diers"from"Hive"and"Pig"
!! How"Impala"Diers"from"RelaAonal"Databases"
!! LimitaAons"and"Future"DirecAons"
!! Using$the$Impala$Shell"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#22$
StarAng"the"Impala"Shell"

!You$can$execute$statements$in$the$Impalas$shell$
This"interacAve"tool"is"similar"to"the"shell"in"MySQL"or"Hive"
!Execute$the$impala-shell$command$to$start$the$shell$
Some"log"messages"truncated"to"be>er"t"the"slide"

$ impala-shell
Connected to localhost.localdomain:21000
Server version: impalad version 1.0.1
Welcome to the Impala shell.
[localhost.localdomain:21000] >
"
!Use$-i hostname:port$op.on$to$connect$to$another$server$

$ impala-shell i myserver.example.com:21000
[myserver.example.com:21000] >

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#23$
Using"the"Impala"Shell"

!Enter$semicolon#terminated$statements$at$the$prompt$
Hit"enter"to"execute"the"query"
Impala"pre>y/prints"the"output"
Use"the"quit"command"to"exit"the"Impala"shell"

$ impala-shell
> SELECT cust_id, fname, lname FROM customers
WHERE zipcode='20525';
+---------+--------+-----------+
| cust_id | fname | lname |
+---------+--------+-----------+
| 1133567 | Steven | Robertson |
| 1171826 | Robert | Gillis |
+---------+--------+-----------+
> quit;
$

Note:"shell"prompt"abbreviated"as">$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#24$
Running"Queries"from"the"Command"Line"

!You$can$execute$a$le$containing$queries$using$the$-f$op.on$
$
$ impala-shell -f myquery.hql

!Run$queries$directly$from$the$command$line$with$the$-q$op.on$

$ impala-shell -q 'SELECT * FROM users'

!Use$-o$(and$op.onally$specify$delimiter)$to$capture$output$to$le$

$ impala-shell -f myquery.hql \
-o results.txt \
--output_file_field_delim='\t'

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#25$
InteracAng"with"the"OperaAng"System"

!Use$shell$to$execute$system$commands$from$within$Impala$shell$

> shell date;


$ Mon May 20 16:44:35 PDT 2013

!No$direct$support$for$HDFS$commands$$
But"could"run"hadoop fs"using"shell

> shell hadoop fs -mkdir /reports/sales/2013;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#26$
Accessing"Impala"with"Hue"(1)"

!Alterna.vely,$you$can$access$Impala$through$Hue$
!To$use$Hue,$browse$to$http://hue_server:8888/
May"need"to"start"Hue"service"rst"(sudo service hue start)"
!Launch$Hues$Impala$interface$by$clicking$its$icon$

Impala&icon&in&Hue&

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#27$
Accessing"Impala"with"Hue"(2)"

!Hue$allows$you$to$run$Impala$queries$from$your$Web$browser$
Hue - Impala Query
https://hueserver.example.com:8888/impala/ Google

SELECT zipcode, COUNT(order_id) AS total


FROM customers JOIN orders
ON customers.cust_id = orders.cust_id
WHERE zipcode LIKE '6%'
GROUP BY zipcode
ORDER BY total DESC
LIMIT 100;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#28$
Accessing"Impala"with"Hue"(3)"

!Hue$displays$the$results$in$a$sortable$table$
Hue - Impala Query
https://hueserver.example.com:8888/impala/ Google

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#29$
Chapter"Topics"

Introduc.on$to$Impala$

!! What"is"Impala?"
!! How"Impala"Diers"from"Hive"and"Pig"
!! How"Impala"Diers"from"RelaAonal"Databases"
!! LimitaAons"and"Future"DirecAons"
!! Using"the"Impala"Shell"
!! Conclusion

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#30$
EssenAal"Points"

!Impala$is$a$high#performance$SQL$engine$
Runs"on"Hadoop"clusters"
Reads"and"writes"data"in"HDFS"or"HBase"tables"
!Queries$are$expressed$in$SQL$dialect$similar$to$HiveQL$
!Primary$dierence$compared$to$Hive/Pig$is$speed$
Impala"avoids"MapReduce"latency"and"overhead"
!Impala$is$best$suited$to$ad$hoc/interac.ve$queries$
Hive"and"Pig"are"be>er"for"long/running"batch"processes"
Impala"does"not"currently"support"all"features"of"Hive"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#31$
Bibliography"(1)"

The$following$oer$more$informa.on$on$topics$discussed$in$this$chapter$
!Free$OReilly$Cloudera-Impala$book$
http://tiny.cloudera.com/dac15f
!Cloudera$Impala:$Real#Time$Queries$in$Apache$Hadoop$
http://tiny.cloudera.com/dac15a
!Wired$Ar.cle$on$Impala$
http://tiny.cloudera.com/dac15b
!Cloudera$Blog$Detailing$Impala$Features$and$Performance$
http://tiny.cloudera.com/dac15c
!Impala$Documenta.on$at$Cloudera$Web$site$
http://tiny.cloudera.com/dac15e

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#32$
Bibliography"(2)"

!37Signals$Blog$Comparing$Performance$of$Impala,$Hive,$and$MySQL$
http://tiny.cloudera.com/dac15d

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#33$
Analyzing"Data"with"Impala"
Chapter"16"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#1$
Course"Chapters"

!! IntroducEon"
!! Hadoop"Fundamentals"
!! IntroducEon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulE/Dataset"OperaEons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooEng"and"OpEmizaEon"
!! IntroducEon"to"Hive"
!! RelaEonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpEmizaEon"
!! Extending"Hive"
!! IntroducEon"to"Impala"
!! Analyzing$Data$with$Impala$
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#2$
Analyzing"Data"with"Impala"

In$this$chapter,$you$will$learn$
! How$Impalas$query$syntax$compares$to$HiveQL$
! How$to$create$databases$and$tables$in$Impala$
! How$and$when$to$refresh$the$metadata$cache$
! Which$data$types$Impala$supports$
! How$to$add$support$for$a$User#Dened$FuncKon$(UDF)$
! How$to$structure$your$query$for$beNer$performance$
! What$other$factors$inuence$Impala$performance$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#3$
Chapter"Topics"

Analyzing$Data$with$Impala$

!! Basic$Syntax
!! Data"Types"
!! Filtering,"SorEng,"and"LimiEng"Results"
!! Joining"and"Grouping"Data"
!! User/Dened"FuncEons"
!! Improving"Impala"Performance"
!! Hands/On"Exercise:"InteracEve"Analysis"with"Impala"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#4$
Overview"of"Impala"Query"Syntax"

!Impalas$query$language$is$a$subset$of$SQL#92$
Plus"a"few"extensions"from"Oracle"and"MySQL"dialects"
!Syntax$almost$idenKcal$to$HiveQL$
Dierences"mainly"related"to"features"unsupported"in"Impala"
Most"Hive"queries"can"be"executed"verbaEm"in"Impala"
!Impala$may$also$support$statements$that$Hive$does$not$
Such"as"the"ability"to"insert"individual"rows"*"

> INSERT INTO customers VALUES (1234567, 'Abe',


'Froman', '123 Oak St.', 'Chicago', 'IL', '60601');

*" Not"intended"for"bulk"loads"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#5$
Case/SensiEvity"and"Comments"

!Keywords$are$not$case#sensiKve,$but$oVen$capitalized$by$convenKon$
!Supports$single#$and$mulK#line$comments!
These"are"allowed"in"scripts,""shell,"and"command"line"

$ cat find_customers.sql

/* This script will query the customers table


* and find all customers in a given ZIP code
*/
SELECT cust_id, fname, lname
FROM customers
WHERE zipcode='60601'; -- downtown Chicago

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#6$
Databases"and"Tables"in"Impala"

!Every$Impala$table$belongs$to$a$database$
Impala"and"Hive"share"a"metastore"
The"same"databases"and"tables"are"visible"in"both"Hive"and"Impala"
!The$default$database$is$selected$at$startup
The"USE"command"switches"to"another"database"
List"tables"in"a"database"with"the"SHOW TABLES"command"

> USE accounting;


> SHOW TABLES;
+----------+
| name |
+----------+
| accounts |
+----------+

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#7$
CreaEng"Databases"and"Tables"in"Impala"

!Data$deniKon$is$generally$idenKcal$to$Hive$$
Custom"SerDes"and"buckeEng"are"unsupported"

> CREATE DATABASE sales;


> USE sales;
> CREATE EXTERNAL TABLE prospects
(id INT,
name STRING COMMENT 'Include surname',
email STRING,
active BOOLEAN COMMENT 'True, if on mailing list',
last_contact TIMESTAMP)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION '/dept/sales/prospects';

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#8$
Displaying"Table"Structure"

!Use$DESCRIBE$to$display$a$tables$structure$
DESCRIBE EXTENDED"is"unsupported"

> DESCRIBE prospects;


+--------------+-----------+--------------------------+
| name | type | comment |
+--------------+-----------+--------------------------+
| id | int | |
| name | string | Include surname |
| email | string | |
| active | boolean | True, if on mailing list |
| last_contact | timestamp | |
+--------------+-----------+--------------------------+

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#9$
Metadata"Caching"in"Impala"

!Impala$shares$the$metastore$with$Hive$
Tables"created"in"Hive"are"visible"in"Impala"(and"vice"versa)"
!Impala$caches$metadata$
The"tables"schema"deniEons"
The"locaEons"of"tables"HDFS"blocks"
!Metadata$updates$made$from!within!Impala$are$broadcast$throughout$the$
cluster$
The"Impala"metadata"cache"is"updated"automaEcally"throughout"the"
Impala"cluster"
No"addiEonal"acEons"are"required"
!Metadata$updates$made$from$outside!of!Impala$are$not$known$to$Impala$
For"example,"changes"made"in"Hive,"or"changes"to"the"tables"in"HDFS"
AddiEonal"acEons"are"required"to"update"the"metadata"cache""
"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#10$
UpdaEng"the"Impala"Metadata"Cache"

Metadata$Change$ Required$AcKon$ Eect$on$Cache$


New"table"added"to" INVALIDATE METADATA Marks$the$enKre$cache$as$
the"metastore"in"Hive" (with"no"table"name)$ stale;$metadata$cache$is$
reloaded$as$needed$
Table"schema" REFRESH <table> Reloads$the$metadata$for$
modied"in"Hive one$table$immediately.$
Reloads$HDFS$block$
locaKons$for$new$data$
les$only.$
New"data"added"to"a" REFRESH <table> (Same$as$above)$
table
Data"in"a"table"has" INVALIDATE METADATA Marks$the$metadata$for$a$
been"extensively" <table> single$table$as$stale.$When$
altered,"such"as"by" $ the$metadata$is$needed,$
HDFS"balancing all$HDFS$block$locaKons$
are$retrieved.$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#11$
SelecEng"Data"in"Impala"

!Use$SELECT$to$retrieve$data$from$tables$
Results"are"forma>ed"for"display"

> SELECT fname,lname,city,state FROM customers


WHERE cust_id = 1234567;
+-------+--------+---------+-------+
| fname | lname | city | state |
+-------+--------+---------+-------+
| Abe | Froman | Chicago | IL |
+-------+--------+---------+-------+

Impala"does"not"require"a"FROM"clause"

> SELECT SQRT(64) AS square_root;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#12$
Using"Impala"Built/in"FuncEons"

!Invoke$built#in$funcKons$as$you$would$in$SQL$or$HiveQL$

> SELECT CONCAT_WS(', ', lname, fname) AS fullname


FROM customers WHERE cust_id=1234567;

+-------------+
| fullname |
+-------------+
| Froman, Abe |
+-------------+

!Impala$supports$many$of$the$same$built#in$funcKons$as$Hive$
Lacks"some"others,"including"many"formacng"and"text"processing"
funcEons"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#13$
Chapter"Topics"

Analyzing$Data$with$Impala$

!! Basic"Syntax"
!! Data$Types
!! Filtering,"SorEng,"and"LimiEng"Results"
!! Joining"and"Grouping"Data"
!! User/Dened"FuncEons"
!! Improving"Impala"Performance"
!! Hands/On"Exercise:"InteracEve"Analysis"with"Impala"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#14$
Data"Types"in"Impala"

!Each$column$in$a$table$is$associated$with$a$data$type$
Impala"supports"most"types"available"in"Hive"
Most"are"similar"to"those"found"in"relaEonal"databases"

> DESCRIBE prospects;


+--------------+-----------+--------------------------+
| name | type | comment |
+--------------+-----------+--------------------------+
| id | int | |
| name | string | Include surname |
| email | string | |
| active | boolean | True, if on mailing list |
| last_contact | timestamp | |
+--------------+-----------+--------------------------+

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#15$
Impalas"Integer"Types"

!Integer$types$are$appropriate$for$whole$numbers$
Both"posiEve"and"negaEve"values"allowed"

Name$ DescripKon$ Example$Value$


TINYINT Range:"/128"to"127$ 17
SMALLINT Range:"/32,768"to"32,767$ 5842
INT Range:"/2,147,483,648"to"2,147,483,647$ 84127213
BIGINT Range:"~"/9.2"quinEllion"to"~"9.2"quinEllion$ 632197432180964

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#16$
Impalas"Decimal"Types"

!Decimal$types$are$appropriate$for$oaKng$point$numbers$
Both"posiEve"and"negaEve"values"allowed"
CauKon:"avoid"using"when"exact"values"are"required!"

Name$ DescripKon$ Example$Value$


FLOAT Decimals$ 3.14159
DOUBLE Very"precise"decimals$ 3.14159265358979323846

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#17$
Other"Types"in"Impala"

!Impala$can$also$store$a$few$other$types$of$informaKon$
Only"one"character"type"(variable"length)"

Name$ DescripKon$ Example$Value$


STRING Character"sequence" Betty F. Smith

BOOLEAN True"or"False" TRUE

TIMESTAMP Instant"in"Eme" 2013-06-14 16:51:05


"
!Impala$does$not$support$BINARY$or$complex$types$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#18$
Data"Type"Conversion"

!Hive$auto#converts$a$STRING$column$used$in$numeric$context$

hive> SELECT zipcode FROM customers LIMIT 1;


60601
hive> SELECT zipcode + 1.5 FROM customers LIMIT 1;
60602.5
"
!Impala$requires$an$explicit$CAST$operaKon$for$this$

> SELECT zipcode + 1.5 FROM customers LIMIT 1;


ERROR: AnalysisException: Arithmetic operation ...
> SELECT CAST(zipcode AS float) + 1.5
FROM customers LIMIT 1;
60602.5

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#19$
Chapter"Topics"

Analyzing$Data$with$Impala$

!! Basic"Syntax"
!! Data"Types"
!! Filtering,$SorKng,$and$LimiKng$Results
!! Joining"and"Grouping"Data"
!! User/Dened"FuncEons"
!! Improving"Impala"Performance"
!! Hands/On"Exercise:"InteracEve"Analysis"with"Impala"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#20$
LimiEng"and"SorEng"Query"Results"

! The$LIMIT$clause$sets$the$maximum$number$of$rows$returned$

> SELECT fname, lname FROM customers LIMIT 10;

!CauKon:$no$guarantee$regarding$which$10$results$are$returned$
Use"ORDER BY"for"top/N"queries"
The"eld(s)"you"ORDER BY"must"be"selected"
!When$using$ORDER BY,$the$LIMIT$clause$is$mandatory$in$Impala$

> SELECT cust_id, fname, lname FROM customers


ORDER BY cust_id DESC LIMIT 10;
$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#21$
Chapter"Topics"

Analyzing$Data$with$Impala$

!! Basic"Syntax"
!! Data"Types"
!! Filtering,"SorEng,"and"LimiEng"Results"
!! Joining$and$Grouping$Data
!! User/Dened"FuncEons"
!! Improving"Impala"Performance"
!! Hands/On"Exercise:"InteracEve"Analysis"with"Impala"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#22$
Joins"in"Impala"

!Like$Hive,$Impala$can$join$mulKple$data$sets$
!Impala$supports$the$same$types$of$joins$that$Hive$does$
Inner"joins"
Outer"joins"(lel,"right,"and"full)"
Lel"semi"joins"
Cross"joins"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#23$
Record"Grouping"and"Aggregate"FuncEons"

! GROUP BY$groups$selected$data$by$one$or$more$columns$
CauKon:"Columns"not"part"of"aggregaEon"must"be"listed"in"GROUP BY"
stores$table$
id$ city$ state$ region$ SELECT region, state,
COUNT(id) AS num
a Albany NY EAST
FROM stores
b Boston MA EAST
GROUP BY region, state;
c Chicago IL NORTH
d Detroit MI NORTH Result$of$query$
e Elgin IL NORTH
region$ state$ num$
EAST MA 1
EAST NY 1
NORTH IL 2
NORTH MI 1

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#24$
Chapter"Topics"

Analyzing$Data$with$Impala$

!! Basic"Syntax"
!! Data"Types"
!! Filtering,"SorEng,"and"LimiEng"Results"
!! Joining"and"Grouping"Data"
!! User#Dened$FuncKons$
!! Improving"Impala"Performance"
!! Hands/On"Exercise:"InteracEve"analysis"with"Impala"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#25$
Overview"of"Impala"User/Dened"FuncEons"(UDFs)"

!Like$Hive,$Impala$supports$User#Dened$FuncKons$(UDFs)$$
!Hive$UDFs$can$be$used$in$Impala$with$no$changes$
With"a"few"excepEons"
!There$are$two$types$of$UDFs$in$Impala$
Standard"UDFs"
User/Dened"Aggregate"FuncEons"(UDAFs)"
!Impala$UDFs$can$be$wriNen$in$Java$or$C++$
C++"UDFs"are"implemented"as"shared"objects"
!Impala$C++$UDFs$cannot$be$used$in$Hive$
"
"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#26$
Using"a"Java"UDF"in"Impala"(1)"

!Register$the$funcKon$with$Impala$$
Specify"data"types"that"correspond"to"the"method"signature"of"the"UDF"
class"evaluate"method"aler"the"funcEon"name"
Specify"data"types"that"correspond"to"the"return"type"of"the"UDF"class"
evaluate"method"in"the"RETURNS"clause"
IdenEfy"the"jar"le"containing"the"UDF"class"in"the"LOCATION"clause"
Specify"the"UDF"class"name"in"the"SYMBOL"clause"
You"do"not"need"to"run"a"a"separate"ADD JAR"step"

CREATE FUNCTION STRIP(STRING) RETURNS STRING


LOCATION '/user/hive/udfs/MyUDFs.jar'
SYMBOL='com.example.hive.ql.udf.UDFStrip';

$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#27$
Using"a"Java"UDF"in"Impala"(2)"

!You$may$then$use$the$funcKon$in$Impala$queries$

SELECT STRIP(email_address) FROM employees;


$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#28$
Using"a"C++"UDF"in"Impala"

!Register$the$funcKon$with$Impala$$
$
CREATE FUNCTION COUNT_VOWELS(STRING)
RETURNS INT
LOCATION '/user/hive/udfs/sampleudfs.so'
SYMBOL='CountVowels';

!You$may$then$use$the$funcKon$in$your$query$
$
SELECT COUNT_VOWELS(email_address) FROM employees;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#29$
Chapter"Topics"

Analyzing$Data$with$Impala$

!! Basic"Syntax"
!! Data"Types"
!! Filtering,"SorEng,"and"LimiEng"Results"
!! Joining"and"Grouping"Data"
!! User/Dened"FuncEons"
!! Improving$Impala$Performance
!! Hands/On"Exercise:"InteracEve"Analysis"with"Impala"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#30$
Impala"Performance"Overview"

!Query$performance$is$aected$by$three$broad$categories$
CompuEng"staEsEcs"on"tables"before"running"joins"
The"format"and"type"of"data"being"queried"
The"hardware"and"conguraEon"of"your"cluster"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#31$
Join"Performance"OpEmizaEon"

!Impala$uses$staKsKcs$about$tables$to$opKmize$joins$
!You$should$compute$staKsKcs$for$tables$with$COMPUTE STATS$
Aler"you"load"a"table"iniEally"
When"the"amount"of"data"in"a"table"changes"substanEally""
$
COMPUTE STATS orders;
COMPUTE STATS order_details;
SELECT COUNT(o.order_id)FROM orders o
JOIN order_details d ON (o.order_id = d.order_id)
WHERE YEAR(o.order_date) = 2008;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#32$
Data"Formats"Supported"by"Impala"

!Impala$can$query$data$in$several$formats$
Table"below"summarizes"compaEbility"
Create/load"using"Hive"if"those"operaEons"are"unsupported"in"Impala"

File$Type$ DescripKon$of$File$Type$ Read$ Create$ Insert$


Parquet" High/performance"columnar"format" Yes" Yes" Yes"
Text" Plaintext"delimited"at"le"format" Yes" Yes" Yes"
Avro" Structured"cross/plaqorm"binary"format" Yes" No" No"
RCFile" Columnar"format"compaEble"with"Hive" Yes" Yes" No"
SequenceFile" Binary"at"le"format" Yes" Yes" No"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#33$
Data"Formats"and"OpEmizaEon"

!The$limiKng$factor$in$most$queries$is$I/O$
Disk"speed"is"the"most"common"bo>leneck"
Columnar"formats"reduce"I/O"when"few"columns"selected"
!If$performance$is$a$key$concern,$choose$Parquet$
This"may"limit"compaEbility"with"other"tools"

> CREATE TABLE order_details


(order_id INT,
prod_id INT)
STORED AS PARQUETFILE;

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#34$
Impala"Query"Size"

!Query$size$is$based$on$a$querys$working$set$size$
The"working"set"of"a"query"contains"all"records"aler"
Filtering"rows"
Pruning"unused"columns"
Performing"aggregaEon,"if"applicable"
!For$aggregaKons,$the$query$size$is$the$working$set$size$for$all$the$tables$in$
the$query$
!For$joins,$the$query$size$is$the$working$set$size$for$all$the$tables$in$the$join$
excluding$the$largest$table$
!Impala$queries$must$t$into$the$cluster's$aggregate$memory$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#35$
Cluster"Hardware"

!CPU:$Impala$benets$from$newer$processors$$
Takes"advantage"of"addiEonal"opEmizaEon"when"available"
!Memory:$32GB$minimum,$64GB$is$beNer$
!Disks:$more$is$beNer$
Impala"tries"to"maximize"throughput"across"disks"
Servers"with"12"or"more"disks"are"ideal"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#36$
Chapter"Topics"

Analyzing$Data$with$Impala$

!! Basic"Syntax"
!! Data"Types"
!! Filtering,"SorEng,"and"LimiEng"Results"
!! Joining"and"Grouping"Data"
!! User/Dened"FuncEons"
!! Improving"Impala"Performance"
!! Hands#On$Exercise:$InteracKve$Analysis$with$Impala
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#37$
Hands/on"Exercise:"InteracEve"Analysis"with"Impala"

!In$this$Hands#On$Exercise,$you$will$run$ad$hoc$queries$with$Impala$
!Please$refer$to$the$Hands#On$Exercise$Manual$for$instrucKons$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#38$
Chapter"Topics"

Analyzing$Data$with$Impala$

!! Basic"Syntax"
!! Data"Types"
!! Filtering,"SorEng,"and"LimiEng"Results"
!! Joining"and"Grouping"Data"
!! User/Dened"FuncEons"
!! Improving"Impala"Performance"
!! Hands/On"Exercise:"InteracEve"analysis"with"Impala"
!! Conclusion

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#39$
EssenEal"Points"

!Impalas$query$syntax$is$nearly$idenKcal$to$HiveQL$
Most"Hive"queries"can"be"executed"verbaEm"in"Impala"
!Impala$caches$metadata$from$the$metastore$it$shares$with$Hive$
Use"INVALIDATE METADATA"or"REFRESH"to"update"the"cache"
following"external"changes"
!Impala$supports$most$simple$data$types$from$Hive$
!Query$structure$and$le$format$can$aect$performance$
!Your$clusters$hardware$also$aects$performance$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#40$
Bibliography"

The$following$oer$more$informaKon$on$topics$discussed$in$this$chapter$
!Introducing$Parquet:$Ecient$Columnar$Storage$for$Apache$Hadoop$
http://tiny.cloudera.com/dac16a
!Impala$Language$Reference$
http://tiny.cloudera.com/dac16b

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#41$
Choosing"the"Best"Tool"for"the"Job"
Chapter"17"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 17#1$
Course"Chapters"

!! IntroducFon"
!! Hadoop"Fundamentals"
!! IntroducFon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulF/Dataset"OperaFons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooFng"and"OpFmizaFon"
!! IntroducFon"to"Hive"
!! RelaFonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpFmizaFon"
!! Extending"Hive"
!! IntroducFon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing$the$Best$Tool$for$the$Job$
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 17#2$
Choosing"the"Best"Tool"for"the"Job"

In$this$chapter,$you$will$learn$
!How$MapReduce,$Pig,$Hive,$Impala,$and$RDBMSs$compare$to$one$another$
!Why$a$workow$might$involve$several$dierent$tools$
!How$to$select$the$best$tool$for$a$given$job$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 17#3$
Chapter"Topics"

Choosing$the$Best$Tool$for$the$Job$

!! Comparing$MapReduce,$Pig,$Hive,$Impala,$and$RelaNonal$Databases
!! Which"to"Choose?"
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 17#4$
Recap"of"Data"Analysis/Processing"Tools"

!MapReduce$
Low/level"processing"and"analysis"
!Pig$
Procedural"data"ow"language"executed"using"MapReduce"
!Hive$
SQL/based"queries"executed"using"MapReduce"
!Impala$
High/performance"SQL/based"queries"using"a"custom"execuFon"engine"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 17#5$
Comparing"Pig,"Hive,"and"Impala"

DescripNon$of$Feature$ Pig$ Hive$ Impala$


SQL-based query language No Yes Yes
Optional schema and metastore Yes No No
User-defined functions (UDFs) Yes Yes Yes
Process data with external scripts Yes Yes No
Extensible file format support Yes Yes No
Complex data types Yes Yes No
Query latency High High Low
Built-in data partitioning No Yes Yes
Accessible via ODBC / JDBC No Yes Yes

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 17#6$
Do"These"Replace"an"RDBMS?"

!Probably$not$if$RDBMS$is$used$for$its$intended$purpose$
!RelaNonal$databases$are$opNmized$for$
RelaFvely"small"amounts"of"data"
Immediate"results"
In/place"modicaFon"of"data"(UPDATE"and"DELETE)"
!Pig,$Hive,$and$Impala$are$opNmized$for$
Large"amounts"of"read/only"data"
Extensive"scalability"at"low"cost"
!Pig$and$Hive$are$beSer$suited$for$batch$processing$
Impala"and"RDBMSs"are"be>er"for"interacFve"use"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 17#7$
Comparing"RDBMS"to"Hive"and"Impala"

RDBMS$ Hive$ Impala$


Insert individual records Yes No Yes
Update and delete records Yes No No
Transactions Yes No No
Role-based authorization Yes Yes No
Stored procedures Yes No No
Index support Extensive Limited None
Latency Very low High Low
Data size Terabytes Petabytes Petabytes
Complex data types No Yes No
Storage cost Very high Very low Very low

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 17#8$
Recap:"Apache"Sqoop"

!Sqoop$helps$you$integrate$Hadoop$tools$with$relaNonal$databases$
!It$exchanges$data$between$RDBMS$and$Hadoop$
Can"import"all"tables,"a"single"table,"or"a"porFon"of"a"table"into"HDFS"
Supports"incremental"imports"
Can"also"export"data"from"HDFS"back"to"the"database"

Database Hadoop Cluster

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 17#9$
Chapter"Topics"

Choosing$the$Best$Tool$for$the$Job$

!! Comparing"MapReduce,"Pig,"Hive,"Impala,"and"RelaFonal"Databases"
!! Which$to$Choose?
!! Conclusion"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 17#10$
Which"to"Choose?"

!Choose$the$best$one$for$a$given$task$
Mix"and"match"them"as"needed"
!MapReduce$
Low/level"approach"oers"great"exibility"
More"Fme/consuming"and"error/prone"to"write"
Best"when"control"ma>ers"more"than"producFvity"
!Pig,$Hive,$and$Impala$oer$more$producNvity$
Faster"to"write,"test,"and"deploy"than"MapReduce"
Be>er"choice"for"most"analysis"and"processing"tasks"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 17#11$
Analysis"Workow"Example"
Import Transaction Data
from RDBMS
Sessionize Web Sentiment Analysis on
Log Data with Pig Social Media with Hive

Hadoop Cluster
with Impala

Analyst using Impala Analyst using Impala


shell for ad hoc queries via BI tool
Generate Nightly Reports
using Pig, Hive, or Impala

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 17#12$
Chapter"Topics"

Choosing$the$Best$Tool$for$the$Job$

!! Comparing"MapReduce,"Pig,"Hive,"Impala,"and"RelaFonal"Databases"
!! Which"to"Choose?"
!! Conclusion

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 17#13$
EssenFal"Points"

!You$have$learned$about$several$tools$for$data$analysis$
Each"is"be>er"at"some"tasks"than"others"
Choose"the"best"one"for"a"given"job"
Workows"may"involve"exchanging"data"between"them"
!SelecNon$criteria$include$scale,$speed,$control,$and$producNvity$
MapReduce"oers"control"at"the"cost"of"producFvity"
Pig"and"Hive"oer"producFvity"but"not"necessarily"speed"
RelaFonal"databases"oer"speed"but"not"scalability"
Impala"oers"scalability"and"speed"but"less"control"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 17#14$
Conclusion"
Chapter"18"

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 18#1$
Course"Chapters"

!! IntroducBon"
!! Hadoop"Fundamentals"
!! IntroducBon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulB/Dataset"OperaBons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooBng"and"OpBmizaBon"
!! IntroducBon"to"Hive"
!! RelaBonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpBmizaBon"
!! Extending"Hive"
!! IntroducBon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Interoperability"and"Workows"
!! Conclusion$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 18#2$
Course"ObjecBves"(1)"

During$this$course,$you$have$learned$
!The$purpose$of$Hadoop$and$its$related$tools$
!The$features$that$Pig,$Hive,$and$Impala$oer$for$data$acquisiCon,$storage,$
and$analysis$
!How$to$idenCfy$typical$use$cases$for$large#scale$data$analysis$
!How$to$load$data$from$relaConal$databases$and$other$sources$
!How$to$manage$data$in$HDFS$and$export$it$for$use$with$other$systems$
!How$Pig,$Hive,$and$Impala$improve$producCvity$for$typical$analysis$tasks$
!The$language$syntax$and$data$formats$supported$by$these$tools$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 18#3$
Course"ObjecBves"(2)"

!How$to$design$and$execute$queries$on$data$stored$in$HDFS$
!How$to$join$diverse$datasets$to$gain$valuable$business$insight$
!How$to$analyze$structured,$semi#structured,$and$unstructured$data$
!How$Hive$and$Pig$can$be$extended$with$custom$funcCons$and$scripts$
!How$to$store$and$query$data$for$beMer$performance$
!How$to$determine$which$tool$is$the$best$choice$for$a$given$task$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 18#4$
Which"Course"to"Take"Next?"

Cloudera$oers$a$range$of$training$courses$for$you$and$your$team$$
!For$developers$
Cloudera)Developer)Training)for)Apache)Hadoop)
Cloudera)Training)for)Apache)HBase)
!For$system$administrators$
Cloudera)Administrator)Training)for)Apache)Hadoop)
!For$data$scienCsts$
Introduc;on)to)Data)Science:)Building)Recommender)Systems)
!For$architects,$managers,$CIOs,$and$CTOs$
Cloudera)Essen;als)for)Apache)Hadoop)
$

"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 18#5$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 18#6$