Using"Pig,"Hive,"and"Impala"with"Hadoop"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#1$
201403"
IntroducJon"
Chapter"1"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#2$
Course"Chapters"
!! Introduc/on$
!! Hadoop"Fundamentals"
!! IntroducJon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulJ/Dataset"OperaJons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooJng"and"OpJmizaJon"
!! IntroducJon"to"Hive"
!! RelaJonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpJmizaJon"
!! Extending"Hive"
!! IntroducJon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#3$
Chapter"Topics"
Introduc/on$
!! About$this$course$
!! About"Cloudera"
!! Course"LogisJcs"
!! IntroducJons"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#4$
Course"ObjecJves"(1)"
During$this$course,$you$will$learn$
!The$purpose$of$Hadoop$and$its$related$tools$
!The$features$that$Pig,$Hive,$and$Impala$oer$for$data$acquisi/on,$storage,$
and$analysis$
!How$to$iden/fy$typical$use$cases$for$large#scale$data$analysis$
!How$to$load$data$from$rela/onal$databases$and$other$sources$
!How$to$manage$data$in$HDFS$and$export$it$for$use$with$other$systems$
!How$Pig,$Hive,$and$Impala$improve$produc/vity$for$typical$analysis$tasks$
!The$language$syntax$and$data$formats$supported$by$these$tools$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#5$
Course"ObjecJves"(2)"
!How$to$design$and$execute$queries$on$data$stored$in$HDFS$
!How$to$join$diverse$datasets$to$gain$valuable$business$insight$
!How$to$analyze$structured,$semi#structured,$and$unstructured$data$
!How$Hive$and$Pig$can$be$extended$with$custom$func/ons$and$scripts$
!How$to$store$and$query$data$for$beOer$performance$
!How$to$determine$which$tool$is$the$best$choice$for$a$given$task$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#6$
Chapter"Topics"
Introduc/on$
!! About"this"course"
!! About$Cloudera$
!! Course"LogisJcs"
!! IntroducJons"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#7$
About"Cloudera"(1)"
!The$leader$in$Apache$Hadoop#based$soSware$and$services$
!Founded$by$leading$experts$on$Hadoop$from$Facebook,$Yahoo,$Google,$
and$Oracle$
!Provides$support,$consul/ng,$training,$and$cer/ca/on$for$Hadoop$users$
!Sta$includes$commiOers$to$virtually$all$Hadoop$projects$
!Many$authors$of$industry$standard$books$on$Apache$Hadoop$projects$
Tom"White,"Eric"Sammer,"Lars"George,"etc."
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#8$
About"Cloudera"(2)"
!Customers$include$many$key$users$of$Hadoop$
Allstate,"AOL"AdverJsing,"Box,"CBS"InteracJve,"eBay,"Experian,"Groupon,"
Macys.com,"NaJonal"Cancer"InsJtute,"Orbitz,"Social"Security"
AdministraJon,"Trend"Micro,"Trulia,"US"Army,""
!Cloudera$public$training:$
Cloudera"Developer"Training"for"Apache"Hadoop"
Cloudera"Administrator"Training"for"Apache"Hadoop"
Cloudera"Data"Analyst"Training:"Using"Pig,"Hive,"and"Impala"with"Hadoop"
Cloudera"Training"for"Apache"HBase"
IntroducJon"to"Data"Science:"Building"Recommender"Systems"
Cloudera"EssenJals"for"Apache"Hadoop"
!Onsite$and$custom$training$is$also$available$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#9$
CDH"
!CDH$(Clouderas$Distribu/on$including$Apache$Hadoop)$
100%"open"source,""
enterprise/ready""
distribuJon"of"Hadoop"and""
related"projects"
The"most"complete,"tested,""
and"widely/deployed""
distribuJon"of"Hadoop"
Integrates"all"the"key""
Hadoop"ecosystem"projects"
Available"as"RPMs"and"
Ubuntu/Debian/SuSE""
packages"or"as"a"tarball"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#10$
Cloudera"Express"
!Cloudera$Express$
Free"download"
!The$best$way$to$get$started$
$with$Hadoop$
!Includes$CDH$
!Includes$Cloudera$Manager$
End/to/end""
administraJon"for""
Hadoop"
Deploy,"manage,"and""
monitor"your"cluster"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#11$
Cloudera"Enterprise"
!Cloudera$Enterprise$
SubscripJon"product"including"CDH"and""
Cloudera"Manager"
!Includes$support$
!Includes$extra$Cloudera$Manager$features$
ConguraJon"history"and"rollbacks"
Rolling"updates"
LDAP"integraJon"
SNMP"support"
Automated"disaster"recovery"
Etc."
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#12$
Chapter"Topics"
Introduc/on$
!! About"this"course"
!! About"Cloudera"
!! Course$Logis/cs$
!! IntroducJons"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#13$
LogisJcs"
!Class$start$and$nish$/mes$
!Lunch$
!Breaks$
!Restrooms$
!Wi#Fi$access$
!Virtual$machines$
!Can$I$come$in$early/stay$late?$
Your$instructor$will$give$you$details$on$how$to$access$the$course$materials$
and$exercise$instruc/ons$for$the$class$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#14$
Chapter"Topics"
Introduc/on$
!! About"this"course"
!! About"Cloudera"
!! Course"LogisJcs"
!! Introduc/ons$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#15$
IntroducJons"
!About$your$instructor$
!About$you$
Where"do"you"work"and"what"do"you"do"there?"
Which"database(s)"and"placorm(s)"do"you"use?"
Have"you"worked"with"Apache"Hadoop"or"related"tools?"""
Any"experience"as"a"developer?"
What"programming"languages"do"you"use?"
What"are"your"expectaJons"for"this"course?"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#16$
Hadoop"Fundamentals"
Chapter"2"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#1%
Course"Chapters"
!! IntroducDon"
!! Hadoop%Fundamentals%
!! IntroducDon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulD/Dataset"OperaDons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooDng"and"OpDmizaDon"
!! IntroducDon"to"Hive"
!! RelaDonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpDmizaDon"
!! Extending"Hive"
!! IntroducDon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#2%
Hadoop"Fundamentals"
In%this%chapter,%you%will%learn%
!Which%factors%led%to%the%era%of%Big%Data%
!What%Hadoop%is%and%what%signicant%features%it%oers%
!How%it%oers%reliable%storage%for%massive%amounts%of%data%with%HDFS%
!How%it%supports%large#scale%data%processing%through%MapReduce%
!How%Hadoop%Ecosystem%tools%can%boost%an%analysts%producKvity%
!Several%ways%to%integrate%Hadoop%into%the%modern%data%center%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#3%
Aside:"Please"Launch"the"Virtual"Machine"
!At%the%end%of%this%chapter%you%will%work%on%the%rst%Hands#On%Exercise%
!The%exercises%are%performed%in%a%Virtual%Machine%(VM)%
!The%rst%Kme%the%VM%is%launched,%it%takes%several%minutes%to%boot%
It"is"conguring"the"class"environment"
Subsequent"boots"are"much"faster"
!To%save%Kme,%please%launch%the%VM%now%so%that%it%will%be%ready%by%the%
Kme%we%get%to%the%rst%Hands#On%Exercise%
Your"instructor"will"tell"you"how"to"do"this"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#4%
Chapter"Topics"
Hadoop%Fundamentals%
!! The%MoKvaKon%for%Hadoop%
!! Hadoop"Overview"
!! HDFS"
!! MapReduce"
!! The"Hadoop"Ecosystem"
!! Exercise"Scenario"ExplanaDon"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#5%
Velocity"
!We%are%generaKng%data%faster%than%ever%
Processes"are"increasingly"automated"
Systems"are"increasingly"interconnected"
People"are"increasingly"interacDng"online"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#6%
Variety"
!We%are%producing%a%wide%variety%of%data%
Social"network"connecDons"
Server"and"applicaDon"log"les"
Electronic"medical"records"
Images,"audio,"and"video"
RFID"and"wireless"sensor"network"events"
Product"raDngs"on"shopping"and"review"Web"sites"
And"much"more"
!Not%all%of%this%maps%cleanly%to%the%relaKonal%model%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#7%
Volume"
!Every%day%
More"than"1.5"billion"shares"are"traded"on"the"New"York"Stock"
Exchange"
Facebook"stores"2.7"billion"comments"and"Likes"
Google"processes"about"24"petabytes"of"data"
!Every%minute%
Foursquare"handles"more"than"2,000"check/ins"
TransUnion"makes"nearly"70,000"updates"to"credit"les"
!And%every%second%
Banks"process"more"than"10,000"credit"card"transacDons"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#8%
Data"Has"Value"
!This%data%has%many%valuable%applicaKons%
Product"recommendaDons"
PredicDng"demand"
MarkeDng"analysis"
Fraud"detecDon"
And"many,"many"more"
!We%must%process%it%to%extract%that%value%
And"processing"all#the#data"can"yield"more"accurate"results"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#9%
We"Need"a"System"that"Scales"
!Were%generaKng%too%much%data%to%process%with%tradiKonal%tools%
!Two%key%problems%to%address%%
How"can"we"reliably"store"large"amounts"of"data"at"a"reasonable"cost?"
How"can"we"analyze"all"the"data"we"have"stored?"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#10%
Chapter"Topics"
Hadoop%Fundamentals%
!! The"MoDvaDon"for"Hadoop"
!! Hadoop%Overview%
!! HDFS"
!! MapReduce"
!! The"Hadoop"Ecosystem"
!! Exercise"Scenario"ExplanaDon"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#11%
What"is"Apache"Hadoop?"
!Scalable%and%economical%data%storage%and%processing%
Distributed"and"fault/tolerant""
Harnesses"the"power"of"industry"standard"hardware"
!Heavily%inspired%by%technical%documents%published%by%Google%
!Core%Hadoop%consists%of%two%main%components%
Storage:"the"Hadoop"Distributed"File"System"(HDFS)"
Processing:"MapReduce"
Plus"the"infrastructure"needed"to"make"them"work,"including"
Filesystem"and"administraDon"uDliDes"
Job"scheduling"and"monitoring"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#12%
Scalability"
!Hadoop%is%a%distributed%system%
A"collecDon"of"servers"running"Hadoop"sofware"is"called"a"cluster#
!Individual%servers%within%a%cluster%are%called%nodes&
Typically"standard"rackmount"servers"running"Linux"
Each"node"both"stores"and"processes"data"
!Add%more%nodes%to%the%cluster%to%increase%scalability%
A"cluster"may"contain"up"to"several"thousand"nodes"
You"can"scale"out"incrementally"as"required"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#13%
Fault"Tolerance"
!Paradox:%Adding%nodes%increases%chances%that%any%one%of%them%will%fail%
SoluDon:"build"redundancy"into"the"system"and"handle"it"automaDcally"
!Files%loaded%into%HDFS%are%replicated%across%nodes%in%the%cluster%
If"a"node"fails,"its"data"is"re/replicated"using"one"of"the"other"copies"
!Data%processing%jobs%are%broken%into%individual%tasks%
Each"task"takes"a"small"amount"of"data"as"input"
Thousands"of"tasks"(or"more)"ofen"run"in"parallel"
If"a"node"fails"during"processing,"its"tasks"are"rescheduled"elsewhere"
!RouKne%failures%are%handled%automaKcally%without%any%loss%of%data%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#14%
Chapter"Topics"
Hadoop%Fundamentals%
!! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! HDFS%
!! MapReduce"
!! The"Hadoop"Ecosystem"
!! Exercise"Scenario"ExplanaDon"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#15%
HDFS:"Hadoop"Distributed"File"System"
!Provides%inexpensive%and%reliable%storage%for%massive%amounts%of%data%
!OpKmized%for%sequenKal%access%to%a%relaKvely%small%number%of%large%les%
Each"le"is"likely"to"be"100MB"or"larger ""
MulD/gigabyte"les"are"typical"
!In%some%ways,%HDFS%is%similar%to%a%UNIX%lesystem%
Hierarchical,"with"UNIX/style"paths"(e.g.,"/sales/rpt/asia.txt)"
UNIX/style"le"ownership"and"permissions"
!There%are%also%some%major%deviaKons%from%UNIX%
No"concept"of"a"current"directory"
Cannot"modify"les"once"wri>en"
Must"use"Hadoop/specic"uDliDes"or"custom"code"to"access"HDFS"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#16%
HDFS"Architecture"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#17%
Accessing"HDFS"via"the"Command"Line"
!HDFS%is%not%a%general%purpose%lesystem%
Not"built"into"the"OS,"so"only"specialized"tools"can"access"it"
End"users"typically"access"HDFS"via"the"hadoop fs"command"
!Example:%display%the%contents%of%the%/user/fred/sales.txt%le%
!Example:%Create%a%directory%(below%the%root)%called%reports%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#18%
Copying"Local"Data"To"and"From"HDFS"
!Remember%that%HDFS%is%disKnct%from%your%local%lesystem%
Use"hadoop fs put"to"copy"local"les"to"HDFS"
Use"hadoop fs -get"to"fetch"a"local"copy"of"a"le"from"HDFS"
Hadoop Cluster
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#19%
More"hadoop fs"Command"Examples""
!Copy%le%input.txt%from%local%disk%to%the%users%directory%in%HDFS%
This"will"copy"the"le"to"/user/username/input.txt
!Get%a%directory%lisKng%of%the%HDFS%root%directory%
$ hadoop fs -ls /
!Delete%the%le%/reports/sales.txt%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#20%
Chapter"Topics"
Hadoop%Fundamentals%
!! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! HDFS"
!! MapReduce%
!! The"Hadoop"Ecosystem"
!! Exercise"Scenario"ExplanaDon"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#21%
Introducing"MapReduce"
!We%typically%process%data%in%Hadoop%using%MapReduce%
!MapReduce%is%not%a%language,%its%a%programming%model%
A"style"of"processing"data"popularized"by"Google"
!Benets%of%MapReduce%
Simplicity"
Flexibility"
Scalability"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#22%
Understanding"Map"and"Reduce"
!MapReduce%consists%of%two%funcKons:%map%and%reduce%
The"output"from"map"becomes"the"input"to"reduce"
!The%map%funcKon%always%runs%rst%
Typically"used"to"lter,"transform,"or"parse"data"
!The%reduce%funcKon%is%opKonal%
Normally"used"to"summarize"data"from"the"map"funcDon"(aggregaDon)"
Not"always"needed""you"can"run"map/only"jobs"
!Each%piece%is%simple,%but%can%be%powerful%when%combined%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#23%
MapReduce"Example"
!MapReduce%code%for%Hadoop%is%typically%wrifen%in%Java%
But"possible"to"use"nearly"any"language"with"Hadoop#Streaming"
!The%following%slides%will%explain%an%enKre%MapReduce%job%
Input:"text"le"containing"order"ID,"employee"name,"and"sale"amount"
Output:"sum"of"all"sales"per"employee"
Job Input
0 Alice 3625
1 Bob 5174 Job Output
2 Alice 893
3 Alice 2139 Alice 12491
4 Diana 3581 Bob 9997
5 Carlos 1039 Carlos 1431
6 Bob 4823 Diana 5385
7 Alice 5834
8 Carlos 392
9 Diana 1804
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#24%
ExplanaDon"of"the"Map"FuncDon"
!Hadoop%splits%job%into%many%individual%map%tasks%
Number"of"map"tasks"is"determined"by"the"amount"of"input"data"
Each"map"task"receives"a"porDon"of"the"overall"job"input"to"process"
Mappers"process"one"input"record"at"a"Dme"
For"each"input"record,"they"emit"zero"or"more"records"as"output"
!In%this%case,%the%map%task%simply%parses%the%input%record%
And"then"emits"the"name"and"price"elds"for"each"as"output"
Job Input Alice 3625
Bob 5174
0 Alice 3625
1 Bob 5174 Alice 893
2 Alice 893 Alice 2139 Output
3 Alice 2139
4 Diana 3581 Diana 3581 from
5 Carlos 1039 Carlos 1039
6 Bob 4823 Map
7
8
Alice
Carlos
5834
392
Bob 4823 Tasks
Alice 5834
9 Diana 1804
Carlos 392
Diana 1804
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#25%
Shue"and"Sort"
!Hadoop%automaKcally%sorts%and%merges%output%from%all%map%tasks%
This"intermediate"process"is"known"as"the"shue"and"sort"
The"result"is"supplied"to"reduce"tasks"
Alice 3625
Map Task #1 Output Bob 5174 Alice 3625
Alice 893
Alice 893 Alice 2139
Map Task #2 Output Alice 2139 Alice 5834 Input to Reduce Task #1
Carlos 1039
Diana 3581 Carlos 392
Map Task #3 Output
% Carlos 1039
% "Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#26%
ExplanaDon"of"Reduce"FuncDon"
!Reducer%input%comes%from%the%shue%and%sort%process%
As"with"map,"the"reduce"funcDon"receives"one"record"at"a"Dme"
A"given"reducer"receives"all"records"for"a"given"key"
For"each"input"record,"reduce"can"emit"zero"or"more"output"records"
!Our%reduce%funcKon%simply%sums%total%per%person%
And"emits"employee"name"(key)"and"total"(value)"as"output"
Job Output
Alice 3625
Alice 893 (Output of Reduce Tasks)
Alice 2139
Input to Reduce Task #1 Alice 5834
Carlos 1039 Alice 12491
Carlos 392 Carlos 1431
Bob 9997
Bob 5174 Diana 5385
Bob 4823
Input to Reduce Task #2 Diana 3581
Diana 1804
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#27%
Pumng"it"All"Together"
!Heres%the%data%ow%for%the%enKre%MapReduce%job%
Alice 3625
Bob 5174 Alice 3625
0 Alice 3625 Alice 893
1 Bob 5174 Alice 893 Alice 2139
2 Alice 893 Alice 2139 Alice 5834
3 Alice 2139 Carlos 1039 Alice 12491
4 Diana 3581 Diana 3581 Carlos 392 Carlos 1431
5 Carlos 1039 Carlos 1039
6 Bob 4823 Bob 9997
7 Alice 5834 Bob 4823 Bob 5174 Diana 5385
8 Carlos 392 Alice 5834 Bob 4823
9 Diana 1804 Diana 3581
Diana 1804
Carlos 392
Diana 1804
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#28%
MapReduce"Architecture"
!MapReduce%version%1%and%version%2%(YARN)%
Similar"master/slave"architecture"
Details"dier"slightly"
!Master%nodes%
Run"master"daemons"to"accept"jobs,""
and"monitor"and"distribute"work"
!Slave%nodes%
Run"slave"daemons"to"start"tasks"
Do"the"actual"work"
Report"status"back"to"master"daemons"
!HDFS%and%MapReduce%are%collocated%
Slave"nodes"run"both"HDFS"and"MR""
slave"daemons"on"the"same"machines"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#29%
MapReduce"Version"1"Architecture"
!MRv1%master%daemon:%JobTracker%
Divides"jobs"into"individual"tasks"
Assigns/monitors"tasks"on"slave"nodes"
!MRv1%slave%daemon:%TaskTracker%
Starts"tasks"
Reports"status"back"to"JobTracker"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#30%
MapReduce"Version"2"Architecture"
!MRv2%uses%the%YARN%cluster%
management%framework%
!MRv2%master%daemon:%ResourceManager%
Allocates"cluster"resources"for"a"job"
Starts"an"ApplicaDon"Master"for"a"job"
!MRv2%slave%daemons:%%
ApplicaKonMaster%
One"per"applicaDon"
Divides"jobs"into"individual"tasks"
Assigns"tasks"to"NodeManagers"
NodeManager%
Runs"on"all"slave"nodes"
Starts"and"monitors"the"actual"
processing"task"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#31%
Chapter"Topics"
Hadoop%Fundamentals%
!! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! HDFS"
!! MapReduce"
!! The%Hadoop%Ecosystem%
!! Exercise"Scenario"ExplanaDon"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#32%
The"Hadoop"Ecosystem"
!Many%related%tools%integrate%with%Hadoop%
Data"analysis""
Database"integraDon"
Workow"management"
!These%are%not%considered%core%Hadoop%
Rather,"they"are"part"of"the"Hadoop"ecosystem"
Many"are"also"open"source"Apache"projects"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#33%
Apache"Pig"
!Apache%Pig%builds%on%Hadoop%to%oer%high#level%data%processing%
This"is"an"alternaDve"to"wriDng"low/level"MapReduce"code"
Pig"is"especially"good"at"joining"and"transforming"data"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#34%
Apache"Hive"
!Hive%is%another%abstracKon%on%top%of%MapReduce%
Like"Pig,"it"also"reduces"development"Dme""
Hive"uses"a"SQL/like"language"called"HiveQL"
!The%Hive%interpreter%runs%on%a%client%machine%
Turns"HiveQL"queries"into"MapReduce"jobs"
Submits"those"jobs"to"the"cluster"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#35%
Apache"HBase"
!HBase%is%the%Hadoop%database%
!Can%store%massive%amounts%of%data%
Gigabytes,"terabytes,"and"even"petabytes"of"data"in"a"table"
Tables"can"have"many"thousands"of"columns"
!Scales%to%provide%very%high%write%throughput%
Hundreds"of"thousands"of"inserts"per"second"
!Fairly%primiKve%when%compared%to%RDBMS%
NoSQL":"There"is"no"high/level"query"language""
Use"API"to"scan"/"get"/"put"values"based"on"keys"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#36%
Cloudera"Impala"
!Massively%parallel%SQL%engine%which%runs%on%a%Hadoop%cluster%
Inspired"by"Googles"Dremel"project"
Can"query"data"stored"in"HDFS"or"HBase"tables"
!High%performance%%
Typically"at"least"10"Dmes"faster"than"Hive"or"MapReduce"
High/level"query"language"(subset"of"SQL/92)"
!Impala%is%100%%Apache#licensed%open%source%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#37%
Apache"Sqoop"
!Sqoop%exchanges%data%between%an%RDBMS%and%Hadoop%
!It%can%import%all%tables,%a%single%table,%or%a%porKon%of%a%table%into%HDFS%
Does"this"very"eciently"via"a"Map/only"MapReduce"job"
Result"is"a"directory"in"HDFS"containing"comma/delimited"text"les"
!Sqoop%can%also%export%data%from%HDFS%back%to%the%database%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#38%
ImporDng"Tables"with"Sqoop"
!This%example%imports%the%customers%table%from%a%MySQL%database%
Will"create"/mydata/customers"directory"in"HDFS"
Directory"will"contain"comma/delimited"text"les"
$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--warehouse-dir /mydata \
--table customers
!Adding%the%--direct%opKon%may%oer%befer%performance%
Uses"database/specic"tools"instead"of"Java""
This"opDon"is"not"compaDble"with"all"databases"
!Cloudera%oers%high#performance%custom%connectors%for%many%databases%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#39%
ImporDng"An"EnDre"Database"with"Sqoop"
!Import%all%tables%from%the%database%(elds%will%be%tab#delimited)%
$ sqoop import-all-tables \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--fields-terminated-by '\t' \
--warehouse-dir /mydata
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#40%
ImporDng"ParDal"Tables"with"Sqoop"
!Import%only%specied%columns%from%products%table%
$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--warehouse-dir /mydata \
--table products \
--columns "prod_id,name,price"
!Import%only%matching%rows%from%products%table%
$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--warehouse-dir /mydata \
--table products \
--where "price >= 1000"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#41%
Incremental"Imports"with"Sqoop"
!What%if%new%records%are%added%to%the%database?%
Could"re/import"all"records,"but"this"is"inecient"
!Sqoops%incremental%append%mode%imports%only%new%records%
Based"on"value"of"last"record"in"specied"column"
$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--warehouse-dir /mydata \
--table orders \
--incremental append \
--check-column order_id \
--last-value 6713821
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#42%
Handling"ModicaDons"with"Incremental"Imports"
!What%if%exisKng%records%are%also%modied%in%the%database?%
Incremental"append"mode"doesnt"handle"this"
!Sqoops%lastmodified%append%mode%adds%and%updates%records%
Caveat:"You"must"maintain"a"Dmestamp"column"in"your"table"
$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--warehouse-dir /mydata \
--table shipments \
--incremental lastmodified \
--check-column last_update_date \
--last-value "2013-06-12 03:15:59"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#43%
ExporDng"Data"from"Hadoop"to"RDBMS"with"Sqoop"
!Weve%seen%several%ways%to%pull%records%from%an%RDBMS%into%Hadoop%
It"is"someDmes"also"helpful"to"push"data"in"Hadoop"back"to"an"RDBMS"
!Sqoop%supports%this%via%export%
$ sqoop export \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--export-dir /mydata/recommender_output \
--table product_recommendations
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#44%
Apache"Flume"
%%
!Flume%imports%data%into%HDFS%as&it&is&being&generated%by%various%sources%
Log Files
UNIX Custom
syslog Sources
Hadoop Cluster
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#45%
Recap:"Data"Center"IntegraDon"
Sq
oop Hadoop Cluster oo
p
Sq
Data Warehouse
Relational Database
(OLAP)
(OLTP)
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#46%
Apache"Oozie"
!Oozie%allows%developers%to%manage%processing%workows%
It"coordinates"execuDon"and"control"of"individual"jobs"
!Oozie%supports%many%workow%acKons,%including%
ExecuDng"MapReduce"jobs"
Running"Pig"or"Hive"scripts"
ExecuDng"standard"Java"or"shell"programs"
ManipulaDng"data"via"HDFS"commands"
Running"remote"commands"with"SSH"
Sending"e/mail"messages"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#47%
Chapter"Topics"
Hadoop%Fundamentals%
!! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! HDFS"
!! MapReduce"
!! The"Hadoop"Ecosystem"
!! Exercise%Scenario%ExplanaKon%
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#48%
Hands/On"Exercises:"Scenario"ExplanaDon"
!Hands#On%Exercises%throughout%the%course%will%reinforce%the%topics%being%
discussed%
Exercises"simulate"the"kind"of"tasks"ofen"performed"using"the"tools"you"
will"learn"about"in"class"
Most"exercises"depend"on"data"generated"in"earlier"exercises"
!Scenario:%Dualcore%Inc.%is%a%leading%electronics%retailer%
More"than"1,000"brick/and/mortar"stores"
Dualcore"also"has"a"thriving"e/commerce"Web"site"
!Dualcore%has%hired%you%to%help%nd%value%in%their%data%
You"will"process"and"analyze"data"from"internal"and"external"sources"
IdenDfy"opportuniDes"to"increase"revenue"
Find"new"ways"to"reduce"costs"
Help"other"departments"achieve"their"goals"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#49%
Chapter"Topics"
Hadoop%Fundamentals%
!! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! HDFS"
!! MapReduce"
!! The"Hadoop"Ecosystem"
!! Exercise"Scenario"ExplanaDon"
!! Hands#On%Exercise:%Data%Ingest%with%Hadoop%Tools%
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#50%
About"the"Training"Virtual"Machine"
!During%this%course,%you%will%perform%numerous%hands#on%exercises%using%the%
Cloudera%Training%Virtual%Machine%(VM)%
!The%VM%has%Hadoop%installed%in%pseudo0distributed&mode%
Simply"a"cluster"comprised"of"a"single"node"
Typically"used"for"tesDng"code"before"deploying"to"a"large"cluster"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#51%
Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
!In%this%Hands#On%Exercise,%youll%gain%pracKce%adding%data%from%the%local%
lesystem%and%a%relaKonal%database%server%to%HDFS%
You"will"analyze"this"data"in"subsequent"exercises"
!Please%refer%to%the%Hands#On%Exercise%Manual%for%instrucKons%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#52%
Chapter"Topics"
Hadoop%Fundamentals%
!! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! HDFS"
!! MapReduce"
!! The"Hadoop"Ecosystem"
!! Exercise"Scenario"ExplanaDon"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
!! Conclusion%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#53%
EssenDal"Points"
!We%are%generaKng%more%data%%and%faster%%than%ever%before%
!Most%of%this%data%maps%poorly%to%structured%relaKonal%tables%
!The%ability%to%store%and%process%this%data%can%yield%valuable%insight%
!Hadoop%oers%scalable%data%storage%(HDFS)%and%processing%(MapReduce)%
!There%are%lots%of%tools%in%the%Hadoop%ecosystem%that%help%you%to%integrate%
Hadoop%with%other%systems,%manage%complex%jobs,%and%ease%analysis%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#54%
Bibliography"
The%following%oer%more%informaKon%on%topics%discussed%in%this%chapter%
!10%Hadoopable%Problems%(recorded%presentaKon)%
http://tiny.cloudera.com/dac02a
!IntroducKon%to%Apache%MapReduce%and%HDFS%(recorded%presentaKon)%
http://tiny.cloudera.com/dac02b
!Guide%to%HDFS%Commands%
http://tiny.cloudera.com/dac02c
!Hadoop:%The%DeniKve%Guide,%3rd%EdiKon%
http://tiny.cloudera.com/dac02d
!Sqoop%User%Guide%
http://tiny.cloudera.com/dac02e
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#55%
IntroducAon"to"Pig"
Chapter"3"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#1%
Course"Chapters"
!! IntroducAon"
!! Hadoop"Fundamentals"
!! Introduc/on%to%Pig%
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulA/Dataset"OperaAons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooAng"and"OpAmizaAon"
!! IntroducAon"to"Hive"
!! RelaAonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpAmizaAon"
!! Extending"Hive"
!! IntroducAon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#2%
IntroducAon"to"Pig"
In%this%chapter,%you%will%learn%
!The%key%features%Pig%oers%
!How%organiza/ons%use%Pig%for%data%processing%and%analysis%
!How%to%use%Pig%interac/vely%and%in%batch%mode%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#3%
Chapter"Topics"
Introduc/on%to%Pig%
!! What%is%Pig?%
!! Pigs"Features"
!! Pig"Use"Cases"
!! InteracAng"with"Pig"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#4%
Apache"Pig"Overview"
!Apache%Pig%is%a%plaJorm%for%data%analysis%and%processing%on%Hadoop%
It"oers"an"alternaAve"to"wriAng"MapReduce"code"directly"
!Originally%developed%as%a%research%project%at%Yahoo%%
Goals:"exibility,"producAvity,"and"maintainability"
Now"an"open/source"Apache"project"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#5%
The"Anatomy"of"Pig"
!Main%components%of%Pig%
The"data"ow"language"(Pig"LaAn)"
The"interacAve"shell"where"you"can"type"Pig"LaAn"statements"(Grunt)"
The"Pig"interpreter"and"execuAon"engine"
!"Preprocess"and"parse"Pig"La0n
AllSales = LOAD 'sales'
!"Check"data"types
AS (cust, price); !"Make"op0miza0ons
BigSales = FILTER AllSales !"Plan"execu0on
BY price > 100;
STORE BigSales INTO 'myreport';
!"Generate"MapReduce"jobs
!"Submit"job(s)"to"Hadoop
!"Monitor"progress
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#6%
Where"to"Get"Pig"
!CDH%(Clouderas%Distribu/on%including%Apache%Hadoop)%is%the%easiest%way%
to%install%Hadoop%and%Pig%
A"Hadoop"distribuAon"which"includes"core"Hadoop,"Pig,"Hive,"Sqoop,"
HBase,"Oozie,"and"other"ecosystem"components"
Available"as"RPMs,"Ubuntu/Debian/SuSE"packages,"or"a"tarball"
Simple"installaAon"
100%"free"and"open"source"
!Installa/on%is%outside%the%scope%of%this%course%
Cloudera"oers"a"training"course"for"System"Administrators,!Cloudera!
Administrator!Training!for!Apache!Hadoop!
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#7%
Chapter"Topics"
Introduc/on%to%Pig%
!! What"is"Pig?"
!! Pigs%Features%
!! Pig"Use"Cases"
!! InteracAng"with"Pig"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#8%
Pig"Features"
!Pig%is%an%alterna/ve%to%wri/ng%low#level%MapReduce%code%
!Many%features%enable%sophis/cated%analysis%and%processing%
HDFS"manipulaAon"
UNIX"shell"commands"
RelaAonal"operaAons"
PosiAonal"references"for"elds"
Common"mathemaAcal"funcAons"
Support"for"custom"funcAons"and"data"formats%
Complex"data"structures"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#9%
Chapter"Topics"
Introduc/on%to%Pig%
!! What"is"Pig?"
!! Pigs"Features"
!! Pig%Use%Cases%
!! InteracAng"with"Pig"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#10%
How"Are"OrganizaAons"Using"Pig?"
!Many%organiza/ons%use%Pig%for%data%analysis%
Finding"relevant"records"in"a"massive"data"set"
Querying"mulAple"data"sets"
CalculaAng"values"from"input"data"
!Pig%is%also%frequently%used%for%data%processing%
Reorganizing"an"exisAng"data"set"
Joining"data"from"mulAple"sources"to"produce"a"new"data"set"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#11%
Use"Case:"Web"Log"SessionizaAon"
!Pig%can%help%you%extract%valuable%informa/on%from%Web%server%log%les%
Order Widget X
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#12%
Use"Case:"Data"Sampling"
!Sampling%can%help%you%explore%a%representa/ve%por/on%of%a%large%data%set%
Allows"you"to"examine"this"porAon"with"tools"that"do"not"scale"well"
Supports"faster"iteraAons"during"development"of"analysis"jobs"
100 TB 50 MB
Random
Sampling
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#13%
Use"Case:"ETL"Processing"
!Pig%is%also%widely%used%for%Extract,%Transform,%and%Load%(ETL)%processing%
Call Center
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#14%
Chapter"Topics"
Introduc/on%to%Pig%
!! What"is"Pig?"
!! Pigs"Features"
!! Pig"Use"Cases"
!! Interac/ng%with%Pig%
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#15%
Using"Pig"InteracAvely"
!You%can%use%Pig%interac/vely,%via%the%Grunt%shell%
Pig"interprets"each"Pig"LaAn"statement"as"you"type"it"
ExecuAon"is"delayed"unAl"output"is"required"
Very"useful"for"ad"hoc"data"inspecAon"
!Example%of%how%to%start,%use,%and%exit%Grunt%
$ pig
grunt> allsales = LOAD 'sales' AS (name, price);
grunt> bigsales = FILTER allsales BY price > 100;
grunt> STORE bigsales INTO 'myreport';
grunt> quit;
%
!Can%also%execute%a%Pig%La/n%statement%from%the%UNIX%shell%via%the%-e%
op/on
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#16%
InteracAng"with"HDFS"
!You%can%manipulate%HDFS%with%Pig,%via%the%fs%command
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#17%
InteracAng"with"UNIX"
!The%sh%command%lets%you%run%UNIX%programs%from%Pig
grunt> sh date;
Fri May 10 13:05:31 PDT 2013
grunt> fs -ls; -- lists HDFS files
%
grunt> sh ls; -- lists local files
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#18%
Running"Pig"Scripts"
!A%Pig%script%is%simply%Pig%La/n%code%stored%in%a%text%le%
By"convenAon,"these"les"have"the".pig"extension"
!You%can%run%a%Pig%script%from%within%the%Grunt%shell%via%the%run%command%
This"is"useful"for"automaAon"and"batch"execuAon""
!It%is%common%to%run%a%Pig%script%directly%from%the%UNIX%shell%
$ pig salesreport.pig
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#19%
MapReduce"and"Local"Modes"
!As%described%earlier,%Pig%turns%Pig%La/n%into%MapReduce%jobs%
Pig"submits"those"jobs"for"execuAon"on"the"Hadoop"cluster"
!It%is%also%possible%to%run%Pig%in%local%mode%using%the%-x%ag%
This"runs"MapReduce"jobs"on"the"local!machine"instead"of"the"cluster"
Local"mode"uses"the"local"lesystem"instead"ofHDFS"
Can"be"helpful"for"tesAng"before"deploying"a"job"to"producAon"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#20%
Client/Side"Log"Files"
!If%a%job%fails,%Pig%may%produce%a%log%le%to%explain%why%
These"log"les"are"typically"produced"in"your"current"working"directory"
On"the"local"(client)"machine"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#21%
Chapter"Topics"
Introduc/on%to%Pig%
!! What"is"Pig?"
!! Pigs"Features"
!! Pig"Use"Cases"
!! InteracAng"with"Pig"
!! Conclusion%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#22%
EssenAal"Points"
!Pig%oers%an%alterna/ve%to%wri/ng%MapReduce%code%directly%
Pig"interprets"Pig"LaAn"code"in"order"to"create"MapReduce"jobs"
It"then"submits"these"MapReduce"jobs"to"the"Hadoop"cluster"
!You%can%execute%Pig%La/n%code%interac/vely%through%Grunt%
Pig"delays"job"execuAon"unAl"output"is"required"
!It%is%also%common%to%store%Pig%La/n%code%in%a%script%for%batch%execu/on%
Allows"for"automaAon"and"code"reuse"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#23%
Bibliography"
The%following%oer%more%informa/on%on%topics%discussed%in%this%chapter%
!Apache%Pig%Web%Site%
http://pig.apache.org/
!Process%a%Million%Songs%with%Apache%Pig%
http://tiny.cloudera.com/dac03a
!Powered%By%Pig%
http://tiny.cloudera.com/dac03b
!LinkedIn:%User%Engagement%Powered%By%Apache%Pig%and%Hadoop%
http://tiny.cloudera.com/dac03c
!Programming%Pig%(book)%
http://tiny.cloudera.com/dac03d
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 03#24%
Basic"Data"Analysis"with"Pig"
Chapter"4"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#1%
Course"Chapters"
!! IntroducDon"
!! Hadoop"Fundamentals"
!! IntroducDon"to"Pig"
!! Basic%Data%Analysis%with%Pig%
!! Processing"Complex"Data"with"Pig"
!! MulD/Dataset"OperaDons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooDng"and"OpDmizaDon"
!! IntroducDon"to"Hive"
!! RelaDonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpDmizaDon"
!! Extending"Hive"
!! IntroducDon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#2%
Basic"Data"Analysis"with"Pig"
In%this%chapter,%you%will%learn%
!The%basic%syntax%of%Pig%LaCn%
!How%to%load%and%store%data%using%Pig%
!Which%simple%data%types%Pig%uses%to%represent%data%
!How%to%sort%and%lter%data%in%Pig%
!How%to%use%many%of%Pigs%built#in%funcCons%for%data%processing%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#3%
Chapter"Topics"
Basic%Data%Analysis%with%Pig%
!! Pig%LaCn%Syntax%
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DeniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#4%
Pig"LaDn"Overview"
!Pig%LaCn%is%a%data$ow%language%
The"ow"of"data"is"expressed"as"a"sequence"of"statements"
!The%following%is%a%simple%Pig%LaCn%script%to%load,%lter,%and%store%data%
/*
* Save the filtered results into a new
* directory, below my home directory.
*/
STORE bigsales INTO 'myreport';
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#5%
Pig"LaDn"Grammar:"Keywords"
!Pig%LaCn%keywords%are%highlighted%here%in%blue%text%
Keywords"are"reserved""you"cannot"use"them"to"name"things"
/*
* Save the filtered results into a new
* directory, below my home directory.
*/
STORE bigsales INTO 'myreport';
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#6%
Pig"LaDn"Grammar:"IdenDers"(1)"
!IdenCers%are%the%names%assigned%to%elds%and%other%data%structures$
/*
* Save the filtered results into a new
* directory, below my home directory.
*/
STORE bigsales INTO 'myreport';
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#7%
Pig"LaDn"Grammar:"IdenDers"(2)"
!IdenCers%must%conform%to%Pigs%naming%rules$
!An%idenCer%must%always%begin%with%a%lePer%
This"may"only"be"followed"by"le>ers,"numbers,"or"underscores"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#8%
Pig"LaDn"Grammar:"Comments"
!Pig%LaCn%supports%two%types%of%comments%
Single"line"comments"begin"with"--"""
MulD/line"comments"begin"with"/*"and"end"with"*/"
/*
* Save the filtered results into a new
* directory, below my home directory.
*/
STORE bigsales INTO 'myreport';
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#9%
Case/SensiDvity"in"Pig"LaDn"
!Whether%case%is%signicant%in%Pig%LaCn%depends%on%context%
!Keywords%(shown%here%in%blue%text)%are%not%case#sensiCve%
Neither"are"operators"(such"as"AND,"OR,"or"IS NULL)""
!IdenCers%and%paths%(shown%here%in%red%text)%are%case#sensiCve%
So"are"funcDon"names"(such"as"SUM"or"COUNT)"and"constants"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#10%
Common"Operators"in"Pig"LaDn"
!Many%commonly#used%operators%in%Pig%LaCn%are%familiar%to%SQL%users%
Notable"dierence:"Pig"LaDn"uses"=="and"!="for"comparison"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#11%
Chapter"Topics"
Basic%Data%Analysis%with%Pig%
!! Pig"LaDn"Syntax"
!! Loading%Data"
!! Simple"Data"Types"
!! Field"DeniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#12%
Basic"Data"Loading"in"Pig"
!Pigs%default%loading%funcCon%is%called%PigStorage
The"name"of"the"funcDon"is"implicit"when"calling"LOAD
PigStorage"assumes"text"format"with"tab/separated"columns"
!Consider%the%following%le%in%HDFS%called%sales%
The"two"elds"are"separated"by"tab"characters"
"
" Alice 2999
Bob 3625
" Carlos 2764
"
!This%example%loads%data%from%the%above%le
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#13%
Data"Sources:"File"and"Directories"
!The%previous%example%loads%data%from%a%le%named%sales
!Since%this%is%not%an%absolute%path,%it%is%relaCve%to%your%home%directory%
Your"home"directory"in"HDFS"is"typically"/user/youruserid/
Can"also"specify"an"absolute"path"(e.g.,"/dept/sales/2012/q4)"
!The%path%can%also%refer%to%a%directory%
In"this"case,"Pig"will"recursively"load"all"les"in"that"directory"
File"pa>erns"(globs)"are"also"supported"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#14%
Specifying"Column"Names"During"Load"
!The%previous%example%also%assigns%names%to%each%column%
!Assign%column%names%is%not%required%
This"can"be"useful"when"exploring"a"new"dataset"
Refer"to"elds"by"posiDon"($0"is"rst,"$1"is"second,"$53"is"54th,"etc.)"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#15%
Using"Alternate"Column"Delimiters"
!You%can%specify%an%alternate%delimiter%as%an%argument%to%PigStorage%
!This%example%shows%how%to%load%comma#delimited%data%
Note"that"this"is"a"single"statement"
!Or%to%load%pipe#delimited%data%without%specifying%column%names%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#16%
Chapter"Topics"
Basic%Data%Analysis%with%Pig%
!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple%Data%Types"
!! Field"DeniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#17%
Simple"Data"Types"in"Pig"
!Pig%supports%several%basic%data%types%
Similar"to"those"in"most"databases"and"programming"languages"
!Pig%treats%elds%of%unspecied%type%as%an%array%of%bytes%
Called"the"bytearray"type"in"Pig""
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#18%
List"of"Simple"Data"Types"
!There%are%eight%data%types%in%Pig%for%simple%values%
""*"Not"available"in"older"versions"of"Pig"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#19%
Specifying"Data"Types"in"Pig"
!Pig%will%do%its%best%to%determine%data%types%based%on%context%
For"example,"you"can"calculate"sales"commission"as""price * 0.1
In"this"case,"Pig"will"assume"that"this"value"is"of"type"double"
!However,%it%is%bePer%to%specify%data%types%explicitly%when%possible%
Helps"with"error"checking"and"opDmizaDons"
Easiest"to"do"this"upon"load"using"the"format"eldname:type+
!Choosing%the%right%data%type%is%important%to%avoid%loss%of%precision%
!Important:%Avoid%using%oaCng%point%numbers%to%represent%money!%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#20%
HCatalog"
!You%have%seen%how%to%specify%names,%types,%and%paths%during%LOAD
!A%new%project%called%HCatalog%can%store%this%informaCon%permanently%
So"it"need"not"be"specied"each"Dme"
Simplies"sharing"of"metadata"between"Pig,"Hive,"and"MapReduce"
!However,%HCatalog%is%sCll%in%early%development%and%is%not%yet%widely%used%
It"was"rst"added"to"CDH"in"release"4.2.0"(February,"2013)"
!You%can%load%data%easily%ader%rst%seeng%it%up%with%HCatalog%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#21%
How"Pig"Handles"Invalid"Data"
!When%encountering%invalid%data,%Pig%subsCtutes%NULL%for%the%value%
For"example,"an"int"eld"containing"the"value"Q4
!The%IS NULL%and%IS NOT NULL%operators%test%for%null%values%
Note"that"NULL"is"not"the"same"as"the"empty"string"''
!You%can%use%these%operators%to%lter%out%bad%records%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#22%
Chapter"Topics"
Basic%Data%Analysis%with%Pig%
!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple"Data"Types"
!! Field%DeniCons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#23%
Key"Data"Concepts"in"Pig"
!RelaConal%databases%have%tables,%rows,%columns,%and%elds%
!We%will%use%the%following%data%to%illustrate%Pigs%equivalents%
Assume"this"data"was"loaded"from"a"tab/delimited"text"le"as"before"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#24%
Pig"Data"Concepts:"Fields"
!A%single%element%of%data%is%called%a%eld$
It"corresponds"to"one"of"the"eight"data"types"seen"earlier"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#25%
Pig"Data"Concepts:"Tuples"
!A%collec/on%of%values%is%called%a%tuple$
Fields"within"a"tuple"are"ordered,"but"need"not"all"be"of"the"same"type"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#26%
Pig"Data"Concepts:"Bags"
!A%collec/on%of%tuples%is%called%a%bag$
!Tuples%within%a%bag%are%unordered%by%default%
The"eld"count"and"types"may"vary"between"tuples"in"a"bag"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#27%
Pig"Data"Concepts:"RelaDons"
!A%relaCon%is%simply%a%bag%with%an%assigned%name%(alias)%
Most"Pig"LaDn"statements"create"a"new"relaDon"
!A%typical%script%loads%one%or%more%datasets%into%relaCons%
Processing"creates"new"relaDons"instead"of"modifying"exisDng"ones"
The"nal"result"is"usually"also"a"relaDon,"stored"as"output"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#28%
Chapter"Topics"
Basic%Data%Analysis%with%Pig%
!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DeniDons"
!! Data%Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#29%
Data"Output"in"Pig"
!The%command%used%to%handle%output%depends%on%its%desCnaCon%
DUMP:"sends"output"to"the"screen"
STORE:"sends"output"to"disk"(HDFS)"
!Example%of%DUMP%output,%using%data%from%the%le%shown%earlier%
The"parentheses"and"commas"indicate"tuples"with"mulDple"elds"
(Alice,2999,us)
(Bob,3625,ca)
(Carlos,2764,mx)
(Dieter,1749,de)
(tienne,2368,fr)
(Franco,5637,it)
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#30%
Storing"Data"with"Pig"
!The%STORE%command%is%used%to%store%data%to%HDFS%
Similar"to"LOAD,"but"writes"data"instead"of"reading"it"
The"output"path"is"the"name"of"a"directory"
The"directory"must"not"yet"exist"
!As%with%LOAD,%the%use%of%PigStorage%is%implicit%
The"eld"delimiter"also"has"a"default"value"(tab)"
You"may"also"specify"an"alternate"delimiter"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#31%
Chapter"Topics"
Basic%Data%Analysis%with%Pig%
!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DeniDons"
!! Data"Output"
!! Viewing%the%Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#32%
Viewing"the"Schema"with"DESCRIBE
!The%DESCRIBE%command%shows%the%structure%of%the%data,%including%
names%and%types%
!The%following%Grunt%session%shows%an%example%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#33%
Chapter"Topics"
Basic%Data%Analysis%with%Pig%
!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DeniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering%and%SorCng%Data"
!! Commonly/used"FuncDons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#34%
Filtering"in"Pig"LaDn"
!The%FILTER%keyword%extracts%tuples%matching%the%specied%criteria%
"
"
bigsales = FILTER allsales BY price > 3000;
allsales bigsales
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#35%
Filtering"by"MulDple"Criteria"
!You%can%combine%criteria%with%AND%and%OR
allsales somesales
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#36%
Aside:"String"Comparisons"in"Pig"LaDn"
!The%==%operator%is%supported%for%any%type%in%Pig%LaCn%
This"operator"is"used"for"exact"comparisons"
"
" alices = FILTER allsales BY name == 'Alice';
!Pig%LaCn%supports%paPern%matching%through%Javas%regular$expressions%$
This"is"done"with"the"MATCHES"operator"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#37%
Field"SelecDon"in"Pig"LaDn"
!Filtering%extracts%rows,%but%someCmes%we%need%to%extract%columns%
This"is"done"in"Pig"LaDn"using"the"FOREACH"and"GENERATE"keywords
allsales twofields
salesperson% amount% trans_id% amount% trans_id%
Alice 2999 107546 2999 107546
Bob 3625 107547 3625 107547
Carlos 2764 107548 2764 107548
Dieter 1749 107549 1749 107549
tienne 2368 107550 2368 107550
Fredo 5637 107550 5637 107550
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#38%
GeneraDng"New"Fields"in"Pig"LaDn"
!The%FOREACH%and%GENERATE%keywords%can%also%be%used%to%create%elds%
For"example,"you"could"create"a"new"eld"based"on"price"
!It%is%possible%to%name%such%elds%
!And%you%can%also%specify%the%data%type
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#39%
EliminaDng"Duplicates"
! DISTINCT%eliminates%duplicate%records%in%a%bag%
All%elds%must"be"equal"to"be"considered"a"duplicate"
all_alices unique_records
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#40%
Controlling"Sort"Order
!Use%ORDER...BY%to%sort%the%records%in%a%bag%in%ascending%order%
Add"DESC"to"sort"in"descending"order"instead"
Take"care"to"specify"a"schema""data"type"aects"how"data"is"sorted!"
allsales sortedsales
name% price% country% name% price% country%
Alice 29.99 us Alice 29.99 us
Bob 36.25 ca Carlos 27.64 mx
Carlos 27.64 mx Fredo 56.37 it
Dieter 17.49 de tienne 23.68 fr
tienne 23.68 fr Dieter 17.49 de
Fredo 56.37 it Bob 36.25 ca
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#41%
LimiDng"Results"
!As%in%SQL,%you%can%use%LIMIT%to%reduce%the%number%of%output%records%
!Beware!%Record%ordering%is%random%unless%specied%with%ORDER BY
Use"ORDER BY"and"LIMIT"together"to"nd"top/N"results"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#42%
Chapter"Topics"
Basic%Data%Analysis%with%Pig%
!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DeniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"
!! Commonly#used%FuncCons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"processing"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#43%
Built/in"FuncDons"
!These%are%just%a%sampling%of%Pigs%many%built#in%funcCons%
%
FuncCon%DescripCon% Example%InvocaCon% Input% Output%
Convert"to"uppercase" UPPER(country) uk UK
!You%can%use%these%with%the%FOREACH..GENERATE%keywords%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#44%
Chapter"Topics"
Basic%Data%Analysis%with%Pig%
!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DeniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"
!! Commonly/used"FuncDons"
!! Hands#On%Exercise:%Using%Pig%for%ETL%processing"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#45%
Hands/On"Exercise:"Using"Pig"for"ETL"processing"
!In%this%Hands#On%Exercise,%you%will%write%%Pig%LaCn%code%to%perform%basic%ETL%
processing%tasks%on%data%related%to%Dualcores%online%adverCsing%campaigns%
!Please%refer%to%the%Hands#On%Exercise%Manual%for%instrucCons%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#46%
Chapter"Topics"
Basic%Data%Analysis%with%Pig%
!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DeniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"
!! Commonly/used"FuncDons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"processing"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#47%
EssenDal"Points"
!Pig%LaCn%supports%many%of%the%same%operaCons%as%SQL%
Though"Pigs"approach"is"quite"dierent"
Pig"LaDn"loads,"transforms,"and"stores"data"in"a"series"of"steps"
!The%default%delimiter%for%both%input%and%output%is%the%tab%character%
You"can"specify"an"alternate"delimiter"as"an"argument"to"PigStorage
!Specifying%the%names%and%types%of%elds%is%not%required%
But"it"can"improve"performance"and"readability"of"your"code"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#48%
Bibliography"
The%following%oer%more%informaCon%on%topics%discussed%in%this%chapter%
!Pig%LaCn%Basics%
http://tiny.cloudera.com/dac04a
!Pig%LaCn%Built#In%FuncCons%
http://tiny.cloudera.com/dac04b
!DocumentaCon%for%Java%Regular%Expression%PaPerns%
http://tiny.cloudera.com/dac04c
!Installing%and%Using%HCatalog%
http://tiny.cloudera.com/dac04d
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 04#49%
Processing"Complex"Data"with"Pig"
Chapter"5"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#1%
Course"Chapters"
!! IntroducFon"
!! Hadoop"Fundamentals"
!! IntroducFon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing%Complex%Data%with%Pig%
!! MulF/Dataset"OperaFons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooFng"and"OpFmizaFon"
!! IntroducFon"to"Hive"
!! RelaFonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpFmizaFon"
!! Extending"Hive"
!! IntroducFon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#2%
Processing"Complex"Data"with"Pig"
In%this%chapter,%you%will%learn%
!How%Pig%uses%bags,%tuples,%and%maps%to%represent%complex%data%
!The%techniques%Pig%provides%for%grouping%and%ungrouping%data%
!How%to%use%aggregate%funcFons%in%Pig%LaFn%
!How%to%iterate%through%records%in%complex%data%structures%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#3%
Chapter"Topics"
Processing%Complex%Data%with%Pig%
!! Storage%Formats%
!! Complex/Nested"Data"Types"
!! Grouping"
!! Built/in"FuncFons"for"Complex"Data"
!! IteraFng"Grouped"Data"
!! Hands/On"Exercise:"Analyzing"Ad"Campaign"Data"with"Pig"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#4%
Storage"Formats"
!We%have%seen%that%PigStorage%loads%and%stores%data
Uses"a"delimited"text"le"format""
The"default"delimiter"(tab)"can"be"easily"changed"
!SomeFmes%you%need%to%load%or%store%data%in%other%formats%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#5%
Other"Supported"Formats"
!Here%are%a%few%of%Pigs%built#in%funcFons%for%loading%data%in%other%formats%
It"is"also"possible"to"implement"a"custom"loader"by"wriFng"Java"code ""
Name% Loads%Data%From%
TextLoader Text"les"(enFre"line"as"one"eld)%
JsonLoader% Text"les"containing"JSON/forma>ed"data%
BinStorage Files"containing"binary"data"
HBaseLoader HBase,"a"scalable"NoSQL"database"built"on"Hadoop"%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#6%
Load"and"Store"FuncFons"
!Some%funcFons%load%data,%some%store%data,%and%some%do%both %%
Load%FuncFon% Equivalent%Store%FuncFon%
PigStorage PigStorage
TextLoader None%
JsonLoader% JsonStorage%
BinStorage BinStorage"
HBaseLoader HBaseStorage%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#7%
Chapter"Topics"
Processing%Complex%Data%with%Pig%
!! Storage"Formats"
!! Complex/Nested%Data%Types%
!! Grouping"
!! Built/in"FuncFons"for"Complex"Data"
!! IteraFng"Grouped"Data"
!! Hands/On"Exercise:"Analyzing"Ad"Campaign"Data"with"Pig"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#8%
Pigs"Complex"Data"Types:"Tuple"and"Bag"
!We%have%already%seen%two%of%Pigs%three%complex%data%types%
A"tuple"is"a"collecFon"of"values""
A"bag"is"a"collecFon"of"tuples"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#9%
Pigs"Complex"Data"Types:"Map"
!Pig%also%supports%another%complex%type:%Map%
A"map"associates"a"chararray"(key)"to"another"data"element"(value)"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#10%
RepresenFng"Complex"Types"in"Pig"
!It%is%important%to%know%how%to%dene%and%recognize%these%types%in%Pig%
Type% DeniFon%
Tuple% Comma/delimited"list"inside"parentheses:"
"
"""('107546', 2498, 'Alice')
Bag% Braces"surround"comma/delimited"list"of"tuples:"
"
"""{('107546', 2498, 'Alice'), ('107547', 3625, 'Bob')}
Map% Brackets"surround"comma/delimited"list"of"pairs;"keys"and"values"separated"by"#:"
"
"""['store'#'MIA01','location'#'Coral Gables']
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#11%
Loading"and"Using"Complex"Types"(1)"
!Complex%data%types%can%be%used%in%any%Pig%eld%
!The%following%example%show%how%a%bag%is%stored%in%a%text%le%
Example:"TransacFon"ID,"amount,"items"sold"(a"bag"of"tuples)"
TAB" TAB"
Field"1" Field"2" Field"3"
!Here%is%the%corresponding%LOAD%statement%specifying%the%schema%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#12%
Loading"and"Using"Complex"Types"(2)"
!The%following%example%show%how%a%map%is%stored%in%a%text%le%
% Example:"Customer"name,"credit"account"details"(map)","year"account"opened"
TAB" TAB"
% Field"1" Field"2" Field"3"
!Here%is%the%corresponding%LOAD%statement%specifying%the%schema%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#13%
Referencing"Map"Data"
!Consider%a%le%with%the%following%data%
Bob [salary#52000,age#52]
%
!And%loaded%with%the%following%schema%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#14%
Chapter"Topics"
Processing%Complex%Data%with%Pig%
!! Storage"Formats"
!! Complex/Nested"Data"Types"
!! Grouping%
!! Built/in"FuncFons"for"Complex"Data"
!! IteraFng"Grouped"Data"
!! Hands/On"Exercise:"Analyzing"Ad"Campaign"Data"with"Pig"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#15%
Grouping"Records"By"a"Field"(1)"
!SomeFmes%you%need%to%group%records%by%a%given%eld%
For"example,"so"you"can"calculate"commissions"for"each"employee"
Alice 729
Bob 3999
Alice 27999
Carol 32999
Carol 4999
"
! Use%GROUP BY%to%do%this%in%Pig%LaFn%
The"new"relaFon"has"one"record"per"unique"value"in"the"specied"eld
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#16%
Grouping"Records"By"a"Field"(2)"
!The%new%relaFon%always%contains%two%elds%
!The%rst%eld%is%literally%named%group%in%all%cases
Contains"the"value"from"the"eld"specied"in"GROUP BY
!The%second%eld%is%named%a^er%the%relaFon%specied%in%GROUP BY
Its"a"bag"containing"one"tuple"for"each"corresponding"value"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#17%
Grouping"Records"By"a"Field"(3)"
!The%example%below%shows%the%data%a^er%grouping%
Input"Data"(sales)"
group sales
eld% eld%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#18%
Using"GROUP BY"to"Aggregate"Data"
!Aggregate%funcFons%create%one%output%value%from%mulFple%input%values%
For"example,"to"calculate"total"sales"by"employee"
Usually"applied"to"grouped"data"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#19%
Grouping"Everything"Into"a"Single"Record
! We%just%saw%that%GROUP BY%creates%one%record%for%each%unique%value%
! GROUP ALL%puts%all%data%into%one%record
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#20%
Using"GROUP ALL"to"Aggregate"Data"
!Use%GROUP ALL%when%you%need%to%aggregate%one%or%more%columns%
For"example,"to"calculate"total"sales"for"all"employees"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#21%
Removing"NesFng"in"Data
! Some%operaFons%in%Pig,%like%grouping,%produce%nested%data%structures%
!Grouping%can%be%useful%to%supply%data%to%aggregate%funcFons"
!However,%someFmes%you%want%to%work%with%a%at%data%structure%
The"FLATTEN"operator"removes"a"level"of"nesFng"in"data"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#22%
An"Example"of"FLATTEN
! The%following%shows%the%nested%data%and%what%FLATTEN%does%to%it%
% grunt> byname = GROUP sales BY name;
grunt> DUMP byname;
(Bob,{(Bob,3999)})
(Alice,{(Alice,729),(Alice,27999)})
(Carol,{(Carol,32999),(Carol,4999)})
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#23%
Chapter"Topics"
Processing%Complex%Data%with%Pig%
!! Storage"Formats"
!! Complex/Nested"Data"Types"
!! Grouping"
!! Built#in%FuncFons%for%Complex%Data%
!! IteraFng"Grouped"Data"
!! Hands/On"Exercise:"Analyzing"Ad"Campaign"Data"with"Pig"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#24%
Pigs"Built/In"Aggregate"FuncFons"
!Pig%has%built#in%support%for%other%aggregate%funcFons%besides%SUM
!Examples:%
AVG:""Calculates"the"average"(mean)"of"all"values"
MIN:""Returns"the"smallest"value"
MAX:""Returns"the"largest"value"
!Pig%has%two%built#in%funcFons%for%counFng%records%
COUNT:""Returns"the"number"of"non#null"elements"in"the"bag"
COUNT_STAR:""Returns"the"number"of"all"elements"in"the"bag"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#25%
Other"Notable"Built/in"FuncFons"
!Here%are%a%some%other%useful%Pig%funcFons%
See"the"Pig"documentaFon"for"a"complete"list"
FuncFon% DescripFon%
DIFF Finds"tuples"that"appear"in"only"one"of"two"supplied"bags
IsEmpty Used"with"FILTER"to"match"bags"or"maps"that"contain"no"data"
SIZE Returns"the"size"of"the"eld"(deniFon"of"size"varies"by"data"type)
TOKENIZE Splits"a"text"string"(chararray)"into"a"bag"of"individual"words"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#26%
Chapter"Topics"
Processing%Complex%Data%with%Pig%
!! Storage"Formats"
!! Complex/Nested"Data"Types"
!! Grouping"
!! Built/in"FuncFons"for"Complex"Data"
!! IteraFng%Grouped%Data%
!! Hands/On"Exercise:"Analyzing"Ad"Campaign"Data"with"Pig"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#27%
Record"IteraFon
!We%have%seen%that%FOREACH...GENERATE%iterates%through%records%
!The%goal%is%to%transform%records%to%produce%a%new%relaFon%
SomeFmes"to"select"only"certain"columns"
SomeFmes"to"create"new"columns"
SomeFmes"to"invoke"a"funcFon"on"the"data"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#28%
NesFng"the"FOREACH"Keyword"
!A%variaFon%on%FOREACH%applies%a%set%of%operaFons%to%each%record%
This"is"oeen"used"to"apply"a"series"of"transformaFons"in"a"group"
!This%is%called%a%nested%FOREACH%
Allows"only"relaFonal"operaFons"(e.g., LIMIT,"FILTER,"ORDER BY)"
GENERATE"must"be"the"last"line"in"the"block"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#29%
Nested"FOREACH"Example"(1)"
Input"Data"
!Our%input%data%contains%a%list%of%employee%%
job%Ftles%and%corresponding%salaries% President 192000
Director 152500
!Goal:%idenFfy%the%three%highest%salaries% Director 161000
within%each%Ftle% Director 167000
Director 165000
Director 147000
Engineer 92300
Engineer 85000
Engineer 83000
Engineer 81650
Engineer 82100
Engineer 87300
Engineer 76000
Manager 87000
Manager 81000
Manager 75000
Manager 79000
Manager 67500
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#30%
Nested"FOREACH"Example"(2)"
Input"Data"(excerpt)"
!First%load%the%data%from%the%le%%
President 192000
!Next,%group%employees%by%Ftle% Director 152500
Assigned"to"new"relaFon"Ftle_group" Director 161000
...
Engineer 92300
...
Manager 67500
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#31%
Nested"FOREACH"Example"(3)"
Input"Data"(excerpt)"
!The%nested%FOREACH%iterates%through%every%
record%in%the%group%(i.e.,%each%job%Ftle)% President 192000
It"sorts"each"record"in"that"group"in"" Director 152500
Director 161000
descending"order"of"salary" ...
It"then"selects"the"top"three" Engineer 92300
...
GENERATE"outputs"the"Ftle"and"salaries"
Manager 67500
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#32%
Nested"FOREACH"Example"(4)
Code"(LOAD"statement"removed"for"brevity)" Input"Data"(excerpt)"
%%
title_group = GROUP employees BY title; President 192000
Director 152500
top_salaries = FOREACH title_group { Director 161000
sorted = ORDER employees BY salary DESC; ...
highest_paid = LIMIT sorted 3; Engineer 92300
GENERATE group, highest_paid; ...
}; Manager 67500
Output"produced"by"DUMP top_salaries
(Director,{(Director,167000),(Director,165000),(Director,161000)})
(Engineer,{(Engineer,92300),(Engineer,87300),(Engineer,85000)})
(Manager,{(Manager,87000),(Manager,81000),(Manager,79000)})
(President,{(President,192000)})
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#33%
Chapter"Topics"
Processing%Complex%Data%with%Pig%
!! Storage"Formats"
!! Complex/Nested"Data"Types"
!! Grouping"
!! Built/in"FuncFons"for"Complex"Data"
!! IteraFng"Grouped"Data"
!! Hands#On%Exercise:%Analyzing%Ad%Campaign%Data%with%Pig%
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#34%
Hands/On"Exercise:"Analyzing"Ad"Campaign"Data"with"Pig"
!In%this%Hands#On%Exercise,%you%will%analyze%data%from%Dualcores%online%ad%
campaign%
!Please%refer%to%the%Hands#On%Exercise%Manual%for%instrucFons%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#35%
Chapter"Topics"
Processing%Complex%Data%with%Pig%
!! Storage"Formats"
!! Complex/Nested"Data"Types"
!! Grouping"
!! Built/in"FuncFons"for"Complex"Data"
!! IteraFng"Grouped"Data"
!! Hands/On"Exercise:"Analyzing"Ad"Campaign"Data"with"Pig"
!! Conclusion%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#36%
EssenFal"Points"
!Pig%has%three%complex%data%types:%tuple,%bag,%and%map
A"map"is"simply"a"collecFon"of"key/value"pairs"
!These%structures%can%contain%simple%types%like%int%or%chararray
But"they"can"also"contain"complex"data"types"
Nested"data"structures"are"common"in"Pig"
!Pig%provides%methods%for%grouping%and%ungrouping%data%
You"can"remove"a"level"of"nesFng"using"the"FLATTEN"operator"
!Pig%oers%several%built#in%aggregate%funcFons%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 05#37%
MulA/Dataset"OperaAons"with"Pig"
Chapter"6"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#1%
Course"Chapters"
!! IntroducAon"
!! Hadoop"Fundamentals"
!! IntroducAon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! Mul*#Dataset%Opera*ons%with%Pig%
!! Extending"Pig"
!! Pig"TroubleshooAng"and"OpAmizaAon"
!! IntroducAon"to"Hive"
!! RelaAonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpAmizaAon"
!! Extending"Hive"
!! IntroducAon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#2%
MulA/Dataset"OperaAons"with"Pig"
In%this%chapter,%you%will%learn%
!How%we%can%use%grouping%to%combine%data%from%mul*ple%sources%
!What%types%of%join%opera*ons%Pig%supports%and%how%to%use%them%
!How%to%concatenate%records%to%produce%a%single%data%set%
!How%to%split%a%single%data%set%into%mul*ple%rela*ons%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#3%
Chapter"Topics"
Mul*#Dataset%Opera*ons%with%Pig%
!! Techniques%for%Combining%Data%Sets%
!! Joining"Data"Sets"in"Pig"
!! Set"OperaAons"
!! SpliUng"Data"Sets"
!! Hands/On"Exercise:"Analyzing"Disparate"Data"Sets"with"Pig"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#4%
Overview"of"Combining"Data"Sets"
!So%far,%we%have%concentrated%on%processing%single%data%sets%
Valuable"insight"oXen"results"from"combining"mulAple"data"sets"
!Pig%oers%several%techniques%for%achieving%this%
Using"the"GROUP"operator"with"mulAple"relaAons"
Joining"the"data"as"you"would"in"SQL"
Performing"set"operaAons"like"CROSS"and"UNION
!We%will%cover%each%of%these%in%this%chapter%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#5%
Example"Data"Sets"(1)"
Stores"
!Most%examples%in%this%chapter%will%involve%the%%
same%two%data%sets% A Anchorage
B Boston
!The%rst%is%a%le%containing%informa*on%about%% C Chicago
D Dallas
Dualcores%stores% E Edmonton
F Fargo
!There%are%two%elds%in%this%rela*on%
1. store_id:chararray"(unique"key)"
2. name:chararray"(name"of"the"city"in"which"the"store"is"located)"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#6%
Example"Data"Sets"(2)"
Stores"
!Our%other%data%set%is%a%le%containing%
informa*on%about%Dualcores%salespeople% A Anchorage
B Boston
!This%rela*on%contains%three%elds% C Chicago
D Dallas
1. person_id:int"(unique"key)" E Edmonton
2. name:chararray"(salesperson"name)" F Fargo
3. store_id:chararray"(refers"to"store)" Salespeople"
1 Alice B
2 Bob D
3 Carlos F
4 Dieter A
5 tienne F
6 Fredo C
7 George D
8 Hannah B
9 Irina C
10 Jack
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#7%
Grouping"MulAple"RelaAons"
!We%previously%learned%about%the%GROUP%operator%
Groups"values"in"a"relaAon"based"on"the"specied"eld(s)"
!The%GROUP%operator%can%also%group%mul$ple%rela*ons%
In"this"case,"using"the"synonymous"COGROUP"operator"is"preferred"
!This%collects%values%from%both%data%sets%into%a%new%rela*on%
As"before,"the"new"relaAon"is"keyed"by"a"eld"named"group
This"group"eld"is"associated"with"one"bag"for"each"input"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#8%
Example"of"COGROUP
Stores"
A Anchorage grunt> grouped = COGROUP stores BY store_id,
B Boston salespeople BY store_id;
C Chicago
D Dallas grunt> DUMP grouped;
E Edmonton (A,{(A,Anchorage)},{(4,Dieter,A)})
F Fargo (B,{(B,Boston)},{(1,Alice,B),(8,Hannah,B)})
(C,{(C,Chicago)},{(6,Fredo,C),(9,Irina,C)})
Salespeople" (D,{(D,Dallas)},{(2,Bob,D),(7,George,D)})
(E,{(E,Edmonton)},{})
1 Alice B (F,{(F,Fargo)},{(3,Carlos,F),(5,tienne,F)})
2 Bob D (,{},{(10,Jack,)})
3 Carlos F
4 Dieter A
5 tienne F
6 Fredo C
7 George D
8 Hannah B
9 Irina C
10 Jack
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#9%
Chapter"Topics"
Mul*#Dataset%Opera*ons%with%Pig%
!! Techniques"for"Combining"Data"Sets"
!! Joining%Data%Sets%in%Pig%
!! Set"OperaAons"
!! SpliUng"Data"Sets"
!! Hands/On"Exercise:"Analyzing"Disparate"Data"Sets"with"Pig"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#10%
Join"Overview"
!The%COGROUP%operator%creates%a%nested%data%structure%
!Pig%La*ns%JOIN%operator%creates%a%at%data%structure%
Similar"to"joins"in"a"relaAonal"database"
!A%JOIN%is%similar%to%doing%a%COGROUP%followed%by%a%FLATTEN
Though"they"handle"null"values"dierently"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#11%
Key"Fields"
!Like%COGROUP,%joins%rely%on%a%eld%shared%by%each%rela*on%
!Joins%can%also%use%mul*ple%elds%as%the%key%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#12%
Inner"Joins"
!The%default%JOIN%in%Pig%La*n%is%an%inner%join%
!An%inner%join%outputs%records%only%when%the%key%is%found%in%all%inputs%
In"the"above"example,"stores"that"have"at"least"one"salesperson"
!You%can%do%an%inner%join%on%mul*ple%rela*ons%in%a%single%statement%
But"you"must"use"the"same"key"to"join"them"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#13%
Inner"Join"Example"
stores"
A Anchorage grunt> joined = JOIN stores BY store_id,
B Boston salespeople BY store_id;
C Chicago
D Dallas grunt> DUMP joined;
E Edmonton (A,Anchorage,4,Dieter,A)
F Fargo (B,Boston,1,Alice,B)
(B,Boston,8,Hannah,B)
salespeople" (C,Chicago,6,Fredo,C)
(C,Chicago,9,Irina,C)
1 Alice B (D,Dallas,2,Bob,D)
2 Bob D (D,Dallas,7,George,D)
3 Carlos F (F,Fargo,3,Carlos,F)
4 Dieter A (F,Fargo,5,tienne,F)
5 tienne F
6 Fredo C
7 George D
8 Hannah B
9 Irina C
10 Jack
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#14%
EliminaAng"Duplicate"Fields"(1)"
!As%with%COGROUP,%the%new%rela*on%s*ll%contains%duplicate%elds%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#15%
EliminaAng"Duplicate"Fields"(2)"
!We%can%use%FOREACH...GENERATE%to%retain%just%the%elds%we%need%
However,"it"is"now"slightly"more"complex"to"reference"elds"
We"must"fully/qualify"any"elds"with"names"that"are"not"unique"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#16%
Outer"Joins"
!Pig%La*n%allows%you%to%specify%the%type%of%join%following%the%eld%name%
Inner"joins"do"not"specify"a"join"type"
!An%outer%join%does%not%require%the%key%to%be%found%in%both%inputs%
!Outer%joins%require%Pig%to%know%the%schema%for%at%least%one%rela*on%
Which"relaAon"requires"schema"depends"on"the"join"type"
Full"outer"joins"require"schema"for"both"relaAons"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#17%
LeX"Outer"Join"Example"
stores"
!Result%contains%all%records%from%the%rela*on%
A Anchorage specied%on%the%le[,%but%only%matching%records%
B Boston
C Chicago from%the%one%specied%on%the%right%
D Dallas
E Edmonton grunt> joined = JOIN stores BY store_id
F Fargo LEFT OUTER, salespeople BY store_id;
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#18%
Right"Outer"Join"Example"
stores"
!Result%contains%all%records%from%the%rela*on%
A Anchorage specied%on%the%right,%but%only%matching%records%
B Boston
C Chicago from%the%one%specied%on%the%le[%
D Dallas
E Edmonton grunt> joined = JOIN stores BY store_id
F Fargo RIGHT OUTER, salespeople BY store_id;
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#19%
Full"Outer"Join"Example"
stores"
!Result%contains%all%records%where%there%is%a%match%
A Anchorage in%either%rela*on%
B Boston
C Chicago
D Dallas
E Edmonton grunt> joined = JOIN stores BY store_id
F Fargo FULL OUTER, salespeople BY store_id;
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#20%
Chapter"Topics"
Mul*#Dataset%Opera*ons%with%Pig%
!! Techniques"for"Combining"Data"Sets"
!! Joining"Data"Sets"in"Pig"
!! Set%Opera*ons%
!! SpliUng"Data"Sets"
!! Hands/On"Exercise:"Analyzing"Disparate"Data"Sets"with"Pig"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#21%
Crossing"Data"Sets"
! JOIN%nds%records%in%one%rela*on%that%match%records%in%another%
!Pigs%CROSS%operator%creates%the%cross%product%of%both%rela*ons%
Combines"all"records"in"both"tables"regardless"of"matching"
In"other"words,"all"possible"combinaAons"of"records"
!Careful:%This%can%generate%huge%amounts%of%data!%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#22%
Cross"Product"Example"
stores"
!Generates%every%possible%combina*on%of%records%
A Anchorage in%the%stores%and%salespeople%rela*ons%
B Boston
D Dallas
grunt> crossed = CROSS stores, salespeople;
salespeople"
grunt> DUMP crossed;
1 Alice B (A,Anchorage,1,Alice,B)
2 Bob D (A,Anchorage,2,Bob,D)
8 Hannah B (A,Anchorage,8,Hannah,B)
10 Jack (A,Anchorage,10,Jack,)
(B,Boston,1,Alice,B)
(B,Boston,2,Bob,D)
(B,Boston,8,Hannah,B)
(B,Boston,10,Jack,)
(D,Dallas,1,Alice,B)
(D,Dallas,2,Bob,D)
(D,Dallas,8,Hannah,B)
(D,Dallas,10,Jack,)
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#23%
ConcatenaAng"Data"Sets"
!We%have%explored%several%techniques%for%combining%data%sets%
They"have"had"one"thing"in"common:"they"combine"horizontally"
!The%UNION%operator%combines%records%ver*cally%
It"adds"data"from"input"relaAons"into"a"new"single"relaAon"
Pig"does"not"require"these"inputs"to"have"the"same"schema"
It"does"not"eliminate"duplicate"records"nor"preserve"order"
!This%is%helpful%for%incorpora*ng%new%data%into%your%processing%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#24%
UNION"Example"
june"
!Concatenates%all%records%from%june%and%july%
Adapter 549
Battery 349
Cable 799 grunt> both = UNION june_items, july_items;
DVD 1999
HDTV 79999 grunt> DUMP both;
(Fax,17999)
(GPS,24999)
july" (HDTV,65999)
Fax 17999 (Ink,3999)
GPS 24999 (Adapter,549)
HDTV 65999 (Battery,349)
Ink 3999 (Cable,799)
(DVD, 1999)
(HDTV,79999)
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#25%
Chapter"Topics"
Mul*#Dataset%Opera*ons%with%Pig%
!! Techniques"for"Combining"Data"Sets"
!! Joining"Data"Sets"in"Pig"
!! Set"OperaAons"
!! Spliang%Data%Sets%
!! Hands/On"Exercise:"Analyzing"Disparate"Data"Sets"with"Pig"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#26%
SpliUng"Data"Sets"
!You%have%learned%several%ways%to%combine%data%sets%into%a%single%rela*on%
!Some*mes%you%need%to%split%a%data%set%into%mul*ple%rela*ons%
Server"logs"by"date"range"
Customer"lists"by"region"
Product"lists"by"vendor"
!Pig%La*n%supports%this%with%the%SPLIT%operator%
Expressions"need"not"be"mutually"exclusive"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#27%
SPLIT"Example"
!Split%customers%into%groups%for%rewards%program,%based%on%life*me%value%
customers"
grunt> SPLIT customers INTO
Annette 9700 gold_program IF ltv >= 25000,
Bruce 23500 silver_program IF ltv >= 10000
Charles 17800 AND ltv < 25000;
Dustin 21250
Eva 8500 grunt> DUMP gold_program;
Felix 9300 (Glynn,27800)
Glynn 27800 (Ian,43800)
Henry 8900 (Jeff,29100)
Ian 43800 (Kai,34000)
Jeff 29100
Kai 34000 grunt> DUMP silver_program;
Laura 7800 (Bruce,23500)
Mirko 24200 (Charles,17800)
(Dustin,21250)
(Mirko,24200)
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#28%
Chapter"Topics"
Mul*#Dataset%Opera*ons%with%Pig%
!! Techniques"for"Combining"Data"Sets"
!! Joining"Data"Sets"in"Pig"
!! Set"OperaAons"
!! SpliUng"Data"Sets"
!! Hands#On%Exercise:%Analyzing%Disparate%Data%Sets%with%Pig%
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#29%
Hands/on"Exercise:"Analyzing"Disparate"Data"Sets"with"Pig"
!In%this%Hands#On%Exercise,%you%will%analyze%mul*ple%data%sets%with%Pig.%
!Please%refer%to%the%Hands#On%Exercise%Manual%for%instruc*ons%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#30%
Chapter"Topics"
Mul*#Dataset%Opera*ons%with%Pig%
!! Techniques"for"Combining"Data"Sets"
!! Joining"Data"Sets"in"Pig"
!! Set"OperaAons"
!! SpliUng"Data"Sets"
!! Hands/On"Exercise:"Analyzing"Disparate"Data"Sets"with"Pig"
!! Conclusion%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#31%
EssenAal"Points"
!You%can%use%COGROUP%to%group%mul*ple%rela*ons%
This"creates"a"nested"data"structure"
!Pig%supports%common%SQL%join%types%
Inner,"leX"outer,"right"outer,"and"full"outer"
You"may"need"to"fully/qualify"eld"names"when"using"joined"data"
!Pigs%CROSS%operator%creates%every%possible%combina*on%of%input%data%
This"can"create"huge"amounts"of"data""use"it"carefully!"
!You%can%use%a%UNION%to%concatenate%data%sets%
!In%addi*on%to%combining%data%sets,%Pig%supports%spliang%them%too%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 06#32%
Extending"Pig"
Chapter"7"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#1%
Course"Chapters"
!! IntroducEon"
!! Hadoop"Fundamentals"
!! IntroducEon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulE/Dataset"OperaEons"with"Pig"
!! Extending%Pig%
!! Pig"TroubleshooEng"and"OpEmizaEon"
!! IntroducEon"to"Hive"
!! RelaEonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpEmizaEon"
!! Extending"Hive"
!! IntroducEon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#2%
Extending"Pig"
In%this%chapter,%you%will%learn%
!How%to%use%parameters%in%your%Pig%LaAn%to%increase%its%exibility%
!How%to%dene%and%invoke%macros%to%improve%the%reusability%of%your%code%
!How%to%call%user#dened%funcAons%from%your%code%
!How%to%write%user#dened%funcAons%in%Python%
!How%to%process%data%with%external%scripts%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#3%
Chapter"Topics"
Extending%Pig%
!! Adding%Flexibility%with%Parameters%
!! Macros"and"Imports"
!! UDFs"
!! Contributed"FuncEons"
!! Using"Other"Languages"to"Process"Data"with"Pig"
!! Hands/On"Exercise:"Extending"Pig"with"Streaming"and"UDFs"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#4%
The"Need"for"Parameters"(1)"
!Some%processing%is%very%repeAAve%
For"example,"creaEng"sales"reports"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#5%
The"Need"for"Parameters"(2)"
!You%may%need%to%change%the%script%slightly%for%each%run%
For"example,"to"modify"the"paths"or"lter"criteria"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#6%
Making"the"Script"More"Flexible"with"Parameters"
!Instead%of%hardcoding%values,%Pig%allows%you%to%use%parameters%
These"are"replaced"with"specied"values"at"runEme"
Then"specify"the"values"on"the"command"line"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#7%
Two"Tricks"for"Specifying"Parameter"Values"
!You%can%also%specify%parameter%values%in%a%text%le%
An"alternaEve"to"typing"each"one"on"the"command"line"
INPUT=sales
MINPRICE=999
" # comments look like this
" NAME='Alice'
"
Use"-m filename"opEon"to"tell"Pig"which"le"contains"the"values""
!Parameter%values%can%be%dened%with%the%output%of%a%shell%command%
For"example,"to"set"MONTH"to"the"current"month:"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#8%
Chapter"Topics"
Extending%Pig%
!! Adding"Flexibility"with"Parameters%
!! Macros%and%imports%
!! UDFs"
!! Contributed"FuncEons"
!! Using"Other"Languages"to"Process"Data"with"Pig"
!! Hands/On"Exercise:"Extending"Pig"with"Streaming"and"UDFs"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#9%
The"Need"for"Macros"
!Parameters%simplify%repeAAve%code%by%allowing%you%to%pass%in%values%
But"someEmes"you"would"like"to"reuse"the"actual"code"too""
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#10%
Dening"a"Macro"in"Pig"LaEn"
!Macros%allow%you%to%dene%a%block%of%code%to%reuse%easily%
Similar"(but"not"idenEcal)"to"a"funcEon"in"a"programming"language"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#11%
Invoking"Macros"
!To%invoke%a%macro,%call%it%by%name%and%supply%values%in%the%correct%order%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#12%
Reusing"Code"with"Imports"
!ASer%dening%a%macro,%you%may%wish%to%use%it%in%mulAple%scripts%
!You%can%include%one%script%within%another,%starAng%with%Pig%0.9%
This"is"done"with"the"import"keyword"and"path"to"le"being"imported"
import 'commission_calc.pig';
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#13%
Chapter"Topics"
Extending%Pig%
!! Adding"Flexibility"with"Parameters%
!! Macros"and"Imports"
!! UDFs%
!! Contributed"FuncEons"
!! Using"Other"Languages"to"Process"Data"with"Pig"
!! Hands/On"Exercise:"Extending"Pig"with"Streaming"and"UDFs"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#14%
User/Dened"FuncEons"(UDFs)"
!We%have%covered%many%of%Pigs%built#in%funcAons%already%
!It%is%also%possible%to%dene%your%own%funcAons%
Pig"allows"wriEng"UDFs"in"several"languages"
Language% Supported%in%Pig%Versions%
Java" All
Python" 0.8 and later
JavaScript"(experimental)" 0.9 and later
Ruby"(experimental)" 0.10 and later
Groovy"(experimental)" 0.11 and later
!In%the%next%few%slides,%you%will%see%how%to%use%UDFs%in%Java,%and%how%to%
write%and%use%UDFs%in%Python%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#15%
Using"UDFs"Wri>en"in"Java"
!UDFs%are%packaged%into%Java%Archive%(JAR)%les%
!There%are%only%two%required%steps%for%using%them%
Register"the"JAR"le(s)"containing"the"UDF"and"its"dependencies"
Invoke"the"UDF"using"the"fully/qualied"classname"
REGISTER '/path/to/myudf.jar';
...
data = FOREACH allsales GENERATE com.example.MYFUNC(name);
!You%can%opAonally%dene%an%alias%for%the%funcAon%
REGISTER '/path/to/myudf.jar';
DEFINE FOO com.example.MYFUNC;
...
data = FOREACH allsales GENERATE FOO(name);
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#16%
WriEng"UDFs"in"Python"(1)"
!Now%we%will%see%how%to%write%a%UDF%in%Python%
!The%data%we%want%to%process%has%inconsistent%phone%number%formats%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#17%
WriEng"UDFs"in"Python"(2)"
!Our%Python%code%is%straighaorward%
!The%only%unusual%thing%is%the%opAonal%@outputSchema%decorator%
This"tells"Pig"what"data"type"we"are"returning"
If"not"specied,"Pig"will"assume"bytearray
@outputSchema("areacode:chararray")
def get_area_code(phone):
areacode = "???" # return this for unknown formats
if len(phone) == 12:
# XXX-YYY-ZZZZ or XXX.YYY.ZZZZ format
areacode = phone[0:3]
elif len(phone) == 14:
# (XXX) YYY-ZZZZ or (XXX) YYY.ZZZZ format
areacode = phone[1:4]
return areacode
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#18%
Invoking"Python"UDFs"from"Pig"LaEn"
!Using%this%UDF%from%our%Pig%LaAn%is%also%easy%
We"saved"our"Python"code"as"phonenumber.py
This"Python"le"is"in"our"current"directory"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#19%
Chapter"Topics"
Extending%Pig%
!! Adding"Flexibility"with"Parameters%
!! Macros"and"Imports"
!! UDFs"
!! Contributed%FuncAons%
!! Using"Other"Languages"to"Process"Data"with"Pig"
!! Hands/On"Exercise:"Extending"Pig"with"Streaming"and"UDFs"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#20%
Open"Source"UDFs"
!Pig%ships%with%a%set%of%community#contributed%UDFs%called%Piggy%Bank%
!Another%popular%package%of%UDFs,%called%DataFu,%has%been%open#sourced%
by%LinkedIn%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#21%
Piggy"Bank"
!Piggy%Bank%ships%with%Pig%
You"will"need"to"register"the"piggybank.jar"le"
The"locaEon"may"vary"depending"on"source"and"version"
In"CDH"on"our"VMs,"it"is"at"/usr/lib/pig/piggybank.jar"
!Some%UDFs%in%Piggy%Bank%include%(package%names%omided%for%brevity)%
Class%Name% DescripAon%
ISOToUnix Converts"an"ISO"8601"date/Eme"format"to"UNIX"format"
UnixToISO Converts"a"UNIX"date/Eme"format"to"ISO"8601"format"
LENGTH Returns"the"number"of"characters"in"the"supplied"string"
HostExtractor Returns"the"host"name"from"a"URL"
DiffDate Returns"number"of"days"between"two"dates"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#22%
DataFu"
!DataFu%does%not%ship%with%Pig,%but%is%part%of%CDH%4.1.0%and%later%
You"will"need"to"register"the"DataFu"JAR"le"
In"VM,"at"/usr/lib/pig/datafu-0.0.4-cdh4.2.0.jar"
!Some%UDFs%in%DataFu%include%(package%names%omided%for%brevity)%
Class%Name% DescripAon%
Quantile Calculates"quanEles"for"a"data"set"
Median Calculates"the"median"for"a"data"set"
Sessionize Groups"data"into"sessions"based"on"a"
specied"Eme"window"
HaversineDistInMiles Calculates"distance"in"miles"between"two"
points,"given"laEtude"and"longitude"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#23%
Using"A"Contributed"UDF"
!Here%is%an%example%of%using%a%UDF%from%DataFu%to%calculate%distance%
Input"data"
37.789336 -122.401385 40.707555 -74.011679
Pig"LaEn"
REGISTER '/usr/lib/pig/datafu-*.jar';
DEFINE DIST datafu.pig.geo.HaversineDistInMiles;
Output"data"
(2564.207116295711)
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#24%
Chapter"Topics"
Extending%Pig%
!! Adding"Flexibility"with"Parameters%
!! Macros"and"Imports"
!! UDFs"
!! Contributed"FuncEons"
!! Using%Other%Languages%to%Process%Data%with%Pig%
!! Hands/On"Exercise:"Extending"Pig"with"Streaming"and"UDFs"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#25%
Processing"Data"with"An"External"Script"
!While%Pig%LaAn%is%powerful,%some%tasks%are%easier%in%another%language%
!Pig%allows%you%to%stream%data%through%another%language%for%processing%
This"is"done"using"the"STREAM"keyword"
!Similar%concept%to%Hadoop%Streaming%
Data"is"supplied"to"the"script"on"standard"input"as"tab/delimited"elds"
Script"writes"results"to"standard"output"as"tab/delimited"elds"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#26%
STREAM"Example"in"Python"(1)"
!Our%example%will%calculate%a%users%age%given%that%users%birthdate%
This"calculaEon"is"done"in"a"Python"script"named"agecalc.py"
!Here%is%the%corresponding%Pig%LaAn%code%
BackEcks"used"to"quote"script"name"following"the"alias"
Single"quotes"used"for"quoEng"script"name"within"SHIP
The"schema"for"the"data"produced"by"the"script"follows"the"AS"keyword"
DUMP out;
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#27%
STREAM"Example"in"Python"(2)"
!Python%code%for%agecalc.py
#!/usr/bin/env python
import sys
from datetime import datetime
d1 = datetime.strptime(birthdate, '%Y-%m-%d')
d2 = datetime.now()
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#28%
STREAM"Example"in"Python"(3)"
!The%Pig%script%again,%and%the%data%it%reads%and%writes
DUMP out;
Input"data" Output"data"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#29%
Chapter"Topics"
Extending%Pig%
!! Adding"Flexibility"with"Parameters%
!! Macros"and"Imports"
!! UDFs"
!! Contributed"FuncEons"
!! Using"Other"Languages"to"Process"Data"with"Pig"
!! Hands#On%Exercise:%Extending%Pig%with%Streaming%and%UDFs%
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#30%
Hands/on"Exercise:"Extending"Pig"with"Streaming"and"UDFs"
!In%this%Hands#On%Exercise,%you%will%process%data%with%an%external%script%and%a%
user#dened%funcAon.%
!Please%refer%to%the%Hands#On%Exercise%Manual%for%instrucAons%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#31%
Chapter"Topics"
Extending%Pig%
!! Adding"Flexibility"with"Parameters%
!! Macros"and"Imports"
!! UDFs"
!! Contributed"FuncEons"
!! Using"Other"Languages"to"Process"Data"with"Pig"
!! Hands/On"Exercise:"Extending"Pig"with"Streaming"and"UDFs"
!! Conclusion%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#32%
EssenEal"Points"
!Pig%supports%several%extension%mechanisms%
!Parameters%and%macros%can%help%make%your%code%more%reusable%
And"easier"to"maintain"and"share"with"others"
!Piggy%Bank%and%DataFu%are%two%examples%of%open%source%UDFs%
You"can"also"write"your"own"UDFs"
!It%is%also%possible%to%embed%Pig%within%another%language%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#33%
Bibliography"
The%following%oer%more%informaAon%on%topics%discussed%in%this%chapter%
!DocumentaAon%on%Parameter%SubsAtuAon%in%Pig%
http://tiny.cloudera.com/dac07a
!DocumentaAon%on%Macros%in%Pig%
http://tiny.cloudera.com/dac07b
!DocumentaAon%on%User#Dened%FuncAons%in%Pig%
http://tiny.cloudera.com/dac07c
!DocumentaAon%on%Piggy%Bank%
http://tiny.cloudera.com/dac07d
!Introducing%Data%Fu%
http://tiny.cloudera.com/dac07e
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 07#34%
Pig"TroubleshooBng"and"OpBmizaBon"
Chapter"8"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#1%
Course"Chapters"
!! IntroducBon"
!! Hadoop"Fundamentals"
!! IntroducBon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulB/Dataset"OperaBons"with"Pig"
!! Extending"Pig"
!! Pig%Troubleshoo3ng%and%Op3miza3on%
!! IntroducBon"to"Hive"
!! RelaBonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpBmizaBon"
!! Extending"Hive"
!! IntroducBon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#2%
Pig"TroubleshooBng"and"OpBmizaBon"
In%this%chapter,%you%will%learn%
!How%to%control%the%informa3on%that%Pig%and%Hadoop%write%to%log%les%
!How%Hadoops%Web%UI%can%help%you%troubleshoot%failed%jobs%
!How%to%use%SAMPLE%and%ILLUSTRATE%to%test%and%debug%Pig%jobs%
!How%Pig%creates%MapReduce%jobs%from%your%Pig%La3n%code%
!How%several%simple%changes%to%your%Pig%La3n%code%can%make%it%run%faster%
!Which%resources%are%especially%helpful%for%troubleshoo3ng%Pig%errors%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#3%
Chapter"Topics"
Pig%Troubleshoo3ng%And%Op3miza3on%
!! Troubleshoo3ng%Pig%
!! Logging"
!! Using"Hadoops"Web"UI"
!! OpBonal"Demo:"TroubleshooBng"a"Failed"Job"with"the"Web"UI"
!! Data"Sampling"and"Debugging"
!! Performance"Overview"
!! Understanding"the"ExecuBon"Plan"
!! Tips"for"Improving"the"Performance"of"Your"Pig"Jobs"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#4%
TroubleshooBng"Overview"
!We%have%now%covered%how%to%use%Pig%for%data%analysis%
Unfortunately,"someBmes"your"code"may"not"work"as"you"expect"
It"is"important"to"remember"that"Pig"and"Hadoop"are"intertwined"
!Here%we%will%cover%some%techniques%for%isola3ng%and%resolving%problems%
We"will"start"with"a"few"opBons"to"the"Pig"command"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#5%
Helping"Yourself"
!We%will%discuss%some%op3ons%for%the%pig%command%in%this%chapter%
You"can"view"all"of"them"by"using"the"-h"(help)"opBon"
Keep"in"mind"that"many"opBons"are"advanced"or"rarely"used"
!One%useful%op3on%is%-c%(check),%which%validates%the%syntax%of%your%code%
$ pig -c myscript.pig
myscript.pig syntax OK
!The%-dryrun%op3on%is%very%helpful%if%you%use%parameters%or%macros%
Creates"a"myscript.pig.substituted"le"in"current"directory"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#6%
Geang"Help"from"Others"
!Some3mes%you%may%need%help%from%others%
Mailing"lists"or"newsgroups"
Forums"and"bulleBn"board"sites"
Support"services"
!You%will%probably%need%to%provide%the%version%of%Pig%and%Hadoop%you%are%
using%
$ pig -version
Apache Pig version 0.10.0-cdh4.2.0
$ hadoop version
Hadoop 2.0.0-cdh4.2.0
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#7%
Chapter"Topics"
Pig%Troubleshoo3ng%And%Op3miza3on%
!! TroubleshooBng"Pig"
!! Logging"
!! Using"Hadoops"Web"UI"
!! OpBonal"Demo:"TroubleshooBng"a"Failed"Job"with"the"Web"UI"
!! Data"Sampling"and"Debugging"
!! Performance"Overview"
!! Understanding"the"ExecuBon"Plan"
!! Tips"for"Improving"the"Performance"of"Your"Pig"Jobs"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#8%
Customizing"Log"Messages"
!You%may%wish%to%change%how%much%informa3on%is%logged%
A"recent"change"in"Hadoop"can"cause"lots"of"warnings"when"using"Pig"
!Pig%and%Hadoop%use%the%Log4J%library,%which%is%easily%customized%
!Edit%the%/etc/pig/conf/log4j.properties%le%to%include:%
log4j.logger.org.apache.pig=ERROR
log4j.logger.org.apache.hadoop.conf.Configuration=ERROR
!Edit%the%/etc/pig/conf/pig.properties%le%to%set%this%property:%
log4jconf=/etc/pig/conf/log4j.properties
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#9%
Customizing"Log"Messages"on"a"Per/Job"Basis"
!O]en%you%just%want%to%temporarily%change%the%log%level%
Especially"while"trying"to"troubleshoot"a"problem"with"your"script"
!You%can%specify%a%Log4J%proper3es%le%to%use%when%you%invoke%Pig%
This"overrides"the"default"Log4J"conguraBon"
!Create%a%customlog.properties%le%to%include:%
log4j.logger.org.apache.pig=DEBUG
!Specify%this%le%via%the%-log4jconf%argument%to%Pig%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#10%
Controlling"Client/Side"Log"Files"
!When%a%job%fails,%Pig%may%produce%a%log%le%to%explain%why%
These"are"typically"produced"in"your"current"directory"
!To%use%a%dierent%loca3on,%use%the%-l%(log)%op3on%when%star3ng%Pig%
$ pig -l /tmp
!Or%set%it%permanently%by%edi3ng%/etc/pig/conf/pig.properties%%
Specify"a"dierent"directory"using"the"log.file"property"
log.file=/tmp
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#11%
Chapter"Topics"
Pig%Troubleshoo3ng%And%Op3miza3on%
!! TroubleshooBng"Pig"
!! Logging"
!! Using%Hadoops%Web%UI%
!! OpBonal"Demo:"TroubleshooBng"a"Failed"Job"with"the"Web"UI"
!! Data"Sampling"and"Debugging"
!! Performance"Overview"
!! Understanding"the"ExecuBon"Plan"
!! Tips"for"Improving"the"Performance"of"Your"Pig"Jobs"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#12%
The"Hadoop"Web"UI"
!Each%Hadoop%daemon%has%a%corresponding%Web%applica3on%
This"allows"us"to"easily"see"cluster"and"job"status"with"a"browser"
In"pseudo/distributed"mode,"the"hostname"is"localhost
Daemon%Name% Address%
NameNode" http://hostname:50070/
HDFS"
DataNode" http://hostname:50075/
JobTracker% http://hostname:50030/
MR1"
TaskTracker" http://hostname:50060/
ResourceManager% http://hostname:8088/
MR2"
NodeManager" http://hostname:8042/
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#13%
The"JobTracker"Web"UI"(1)"
!The%JobTracker%oers%the%most%useful%of%Hadoops%Web%UIs%
It"displays"MapReduce"status"informaBon"for"the"Hadoop"cluster"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#14%
The"JobTracker"Web"UI"(2)"
!The%JobTracker%Web%UI%also%shows%historical%informa3on%%
You"can"click"one"of"the"links"to"see"details"for"a"parBcular"job"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#15%
The"JobTracker"Web"UI"(3)"
!The%job%detail%page%can%help%you%troubleshoot%a%problem%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#16%
Naming"Your"Job"
!Hadoop%clusters%are%typically%shared%resources%
There"might"be"dozens"or"hundreds"of"others"using"it"
As"a"result,"someBmes"it"is"hard"to"nd"your"job"in"the"Web"UI"
!We%recommend%sebng%a%name%in%your%scripts%to%help%iden3fy%your%jobs%
Set"the"job.name"property,"either"in"Grunt"or"your"script"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#17%
Killing"a"Job"
!A%job%that%processes%a%lot%of%data%can%take%hours%to%complete%
SomeBmes"you"spot"an"error"in"your"code"just"aeer"submiang"a"job"
Rather"than"wait"for"the"job"to"complete,"you"can"kill"it"
!First,%nd%the%Job%ID%on%the%front%page%of%the%JobTracker%Web%UI%
!Then,%use%the%kill%command%in%Pig%along%with%that%Job%ID%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#18%
Chapter"Topics"
Pig%Troubleshoo3ng%And%Op3miza3on%
!! TroubleshooBng"Pig"
!! Logging"
!! Using"Hadoops"Web"UI"
!! Op3onal%Demo:%Troubleshoo3ng%a%Failed%Job%with%the%Web%UI"
!! Data"Sampling"and"Debugging"
!! Performance"Overview"
!! Understanding"the"ExecuBon"Plan"
!! Tips"for"Improving"the"Performance"of"Your"Pig"Jobs"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#19%
OpBonal"Demo:"Overview"
!Time%permibng,%your%instructor%will%now%demonstrate%how%to%use%the%
JobTrackers%Web%UI%to%isolate%a%bug%in%our%code%that%causes%a%job%to%fail%
!The%code%and%instruc3ons%are%already%on%the%VM%
$ cd ~/training_materials/analyst/webuidemo%
$ cat README.txt
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#20%
Chapter"Topics"
Pig%Troubleshoo3ng%And%Op3miza3on%
!! TroubleshooBng"Pig"
!! Logging"
!! Using"Hadoops"Web"UI"
!! OpBonal"Demo:"TroubleshooBng"a"Failed"Job"with"the"Web"UI"
!! Data%Sampling%and%Debugging"
!! Performance"Overview"
!! Understanding"the"ExecuBon"Plan"
!! Tips"for"Improving"the"Performance"of"Your"Pig"Jobs"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#21%
Using"SAMPLE"to"Create"a"Smaller"Data"Set
!Your%code%might%process%terabytes%of%data%in%produc3on%
However,"it"is"convenient"to"test"with"smaller"amounts"during"
development"
!Use%SAMPLE%to%choose%a%random%set%of%records%from%a%data%set%%
!This%example%selects%about%5%%of%records%from%bigdata
Stores"them"in"a"new"directory"called"mysample
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#22%
Intelligent"Sampling"with"ILLUSTRATE
!Some3mes%a%random%sample%may%lack%data%needed%for%tes3ng%
For"example,"matching"records"in"two"data"sets"for"a"JOIN"operaBon"
!Pigs%ILLUSTRATE%keyword%can%do%more%intelligent%sampling%
Pig"will"examine"the"code"to"determine"what"data"is"needed"
It"picks"a"few"records"that"properly"exercise"the"code"
!You%should%specify%a%schema%when%using%ILLUSTRATE%%
Pig"will"generate"records"when"yours"dont"suce"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#23%
Using"ILLUSTRATE"Helps"You"to"Understand"Data"Flow
!Like%DUMP%and%DESCRIBE,%ILLUSTRATE%aids%in%debugging%
The"syntax"is"the"same"for"all"three"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#24%
General"Debugging"Strategies"
!Use%DUMP,%DESCRIBE,%and%ILLUSTRATE%o]en%
The"data"might"not"be"what"you"think"it"is"
!Look%at%a%sample%of%the%data%
Verify"that"it"matches"the"elds"in"your"LOAD"specicaBon"
!Other%helpful%steps%for%tracking%down%a%problem%
Use"-dryrun"to"see"the"script"aeer"parameters"and"macros"are"
processed"
Test"external"scripts"(STREAM)"by"passing"some"data"from"a"local"le"
Look"at"the"logs,"especially"task"logs"available"from"the"Web"UI"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#25%
Chapter"Topics"
Pig%Troubleshoo3ng%And%Op3miza3on%
!! TroubleshooBng"Pig"
!! Logging"
!! Using"Hadoops"Web"UI"
!! OpBonal"Demo:"TroubleshooBng"a"Failed"Job"with"the"Web"UI"
!! Data"Sampling"and"Debugging"
!! Performance%Overview"
!! Understanding"the"ExecuBon"Plan"
!! Tips"for"Improving"the"Performance"of"Your"Pig"Jobs"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#26%
Performance"Overview"
!We%have%discussed%several%techniques%for%nding%errors%in%Pig%La3n%code%
Once"you"get"your"code"working,"youll"oeen"want"it"to"work"faster"
!Performance%tuning%is%a%broad%and%complex%subject%
Requires"a"deep"understanding"of"Pig,"Hadoop,"Java,"and"Linux"
Typically"the"domain"of"engineers"and"system"administrators"
!Most%of%these%topics%are%beyond%the%scope%of%this%course%
Well"cover"the"basics"and"oer"several"performance"improvement"Bps"
See"Programming/Pig"(chapters"7"and"8)"for"detailed"coverage"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#27%
Chapter"Topics"
Pig%Troubleshoo3ng%And%Op3miza3on%
!! TroubleshooBng"Pig"
!! Logging"
!! Using"Hadoops"Web"UI"
!! OpBonal"Demo:"TroubleshooBng"a"Failed"Job"with"the"Web"UI"
!! Data"Sampling"and"Debugging"
!! Performance"Overview"
!! Understanding%the%Execu3on%Plan"
!! Tips"for"Improving"the"Performance"of"your"Pig"jobs"
!! Conclusion%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#28%
How"Pig"LaBn"Becomes"a"MapReduce"Job
!Pig%La3n%code%ul3mately%runs%as%MapReduce%jobs%on%the%Hadoop%cluster%
!However,%Pig%does%not%translate%your%code%into%Java%MapReduce%
Much"like"relaBonal"databases"dont"translate"SQL"to"C"language"code"
Like"a"database,"Pig"interprets"the"Pig"LaBn"to"develop"execuBon"plans"
Pigs"execuBon"engine"uses"these"to"submit"MapReduce"jobs"to"Hadoop"
!The%EXPLAIN%keyword%details%Pigs%three%execu3on%plans%
Logical""
Physical"
MapReduce"
!Seeing%an%example%job%will%help%us%befer%understand%EXPLAINs%output%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#29%
DescripBon"of"Our"Example"Code"and"Data
!Our%goal%is%to%produce%a%list%of%per#store%sales%
stores"
grunt> stores = LOAD 'stores'
A Anchorage AS (store_id:chararray, name:chararray);
B Boston grunt> sales = LOAD 'sales'
C Chicago AS (store_id:chararray, price:int);
D Dallas
grunt> groups = GROUP sales BY store_id;
E Edmonton
grunt> totals = FOREACH groups GENERATE group,
F Fargo
SUM(sales.price) AS amount;
sales" grunt> joined = JOIN totals BY group,
stores BY store_id;
A 1999
grunt> result = FOREACH joined
D 2399
A 4579 GENERATE name, amount;
B 6139 grunt> DUMP result;
A 2489 (Anchorage,9067)
B 3699 (Boston,9838)
E 2479 (Dallas,8198)
D 5799 (Edmonton,2479)
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#30%
Using"the"EXPLAIN"Keyword
!Using%EXPLAIN%rather%than%DUMP%will%show%the%execu3on%plans%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#31%
Chapter"Topics"
Pig%Troubleshoo3ng%And%Op3miza3on%
!! TroubleshooBng"Pig"
!! Logging"
!! Using"Hadoops"Web"UI"
!! OpBonal"Demo:"TroubleshooBng"a"Failed"Job"with"the"Web"UI"
!! Data"Sampling"and"Debugging"
!! Performance"Overview"
!! Understanding"the"ExecuBon"Plan"
!! Tips%for%Improving%the%Performance%of%your%Pig%Jobs"
!! Conclusion%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#32%
Pigs"RunBme"OpBmizaBons
!Pig%does%not%necessarily%run%your%statements%exactly%as%you%wrote%them%
!It%may%remove%opera3ons%for%eciency%
!It%may%also%rearrange%opera3ons%for%eciency%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#33%
OpBmizaBons"You"Can"Make"in"Your"Pig"LaBn"Code
!Pigs%op3mizer%does%what%it%can%to%improve%performance%
But"you"know"your"own"code"and"data"be>er"than"it"does"
A"few"small"changes"in"your"code"can"allow"addiBonal"opBmizaBons"
!On%the%next%few%slides,%we%will%rewrite%this%Pig%code%for%performance%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#34%
Dont"Produce"Output"You"Dont"Really"Need
!In%this%case,%we%forgot%to%remove%the%DUMP%statement%
SomeBmes"happens"when"moving"from"development"to"producBon"
And"it"might"go"unnoBced"if"youre"not"watching"the"terminal"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#35%
Specify"Schema"Whenever"Possible
!Specifying%schema%when%loading%data%eliminates%the%need%for%Pig%to%guess%
It"may"choose"a"bigger"type"than"you"need"(e.g.,"long"instead"of"int)"
!The%postcode%and%phone%elds%in%the%stores%data%set%were%also%never%used%
EliminaBng"them"in"our"schema"ensures"theyll"be"omi>ed"in"joined
stores =
LOAD 'stores' AS (store_id:chararray, name:chararray);
sales =
LOAD 'sales' AS (store_id:chararray, price:int);
joined =
JOIN sales BY store_id, stores BY store_id;
groups =
GROUP joined BY sales::store_id;
totals =
FOREACH groups GENERATE
FLATTEN(joined.stores::name) AS name,
SUM(joined.sales::price) AS amount;
unique = DISTINCT totals;
region = FILTER unique BY name == 'Anchorage' OR name == 'Edmonton';
sorted = ORDER region BY amount DESC;
topone = LIMIT sorted 1;
STORE topone INTO 'topstore';
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#36%
Filter"Unwanted"Data"As"Early"As"Possible
!We%previously%did%our%JOIN%before%our%FILTER%%
This"produced"lots"of"data"we"ulBmately"discarded"
Moving"the"FILTER"operaBon"up"makes"our"script"more"ecient"
Caveat:"We"now"have"to"lter"by"store"ID"rather"than"store"name"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#37%
Consider"AdjusBng"the"ParallelizaBon
!Hadoop%clusters%scale%by%processing%data%in%parallel%
Newer"Pig"releases"choose"the"number"of"reducers"based"on"input"size"
However,"it"is"oeen"benecial"to"set"a"value"explicitly"in"your"script"
Your"system"administrator"can"help"you"determine"the"best"value"
set default_parallel 5;
stores = LOAD 'stores' AS (store_id:chararray, name:chararray);
sales = LOAD 'sales' AS (store_id:chararray, price:int);
regsales = FILTER sales BY store_id == 'A' OR store_id == 'E';
joined = JOIN regsales BY store_id, stores BY store_id;
groups = GROUP joined BY regsales::store_id;
totals = FOREACH groups GENERATE
FLATTEN(joined.stores::name) AS name,
SUM(joined.sales::price) AS amount;
unique = DISTINCT totals;
sorted = ORDER unique BY amount DESC;
topone = LIMIT sorted 1;
STORE topone INTO 'topstore';
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#38%
Specify"the"Smaller"Data"Set"First"in"a"Join
!We%can%op3mize%joins%by%specifying%the%larger%data%set%last%
Pig"will"stream"the"larger"data"set"instead"of"reading"it"into"memory"
In"our"case,"we"have"far"more"records"in"sales"than"in"stores
Changing"the"order"in"the"JOIN"statement"can"boost"performance"
set default_parallel 5;
stores = LOAD 'stores' AS (store_id:chararray, name:chararray);
sales = LOAD 'sales' AS (store_id:chararray, price:int);
regsales = FILTER sales BY store_id == 'A' OR store_id == 'E';
joined = JOIN stores BY store_id, regsales BY store_id;
groups = GROUP joined BY regsales::store_id;
totals = FOREACH groups GENERATE
FLATTEN(joined.stores::name) AS name,
SUM(joined.sales::price) AS amount;
unique = DISTINCT totals;
sorted = ORDER unique BY amount DESC;
topone = LIMIT sorted 1;
STORE topone INTO 'topstore';
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#39%
Try"Using"Compression"on"Intermediate"Data
!Pig%scripts%o]en%yield%jobs%with%both%a%Map%and%a%Reduce%phase%
Remember"that"Mapper"output"becomes"Reducer"input"
Compressing"this"intermediate"data"is"easy"and"can"boost"performance"
Your"system"administrator"may"need"to"install"a"compression"library"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#40%
A"Few"More"Tips"for"Improving"Performance
!Main%theme:%Eliminate%unnecessary%data%as%early%as%possible%
Use"FOREACH ... GENERATE"to"select"just"those"elds"you"need"
Use"ORDER BY and"LIMIT"when"you"only"need"a"few"records"
Use"DISTINCT"when"you"dont"need"duplicate"records"
!Dropping%records%with%NULL%keys%before%a%join%can%boost%performance%
These"records"will"be"eliminated"in"the"nal"output"anyway"
But"Pig"doesnt"discard"them"unBl"aeer"the"join"
Use"FILTER"to"remove"records"with"null"keys"before"the"join"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#41%
Chapter"Topics"
Pig%Troubleshoo3ng%And%Op3miza3on%
!! TroubleshooBng"Pig"
!! Logging"
!! Using"Hadoops"Web"UI"
!! OpBonal"Demo:"TroubleshooBng"a"Failed"Job"with"the"Web"UI"
!! Data"Sampling"and"Debugging"
!! Performance"Overview"
!! Understanding"the"ExecuBon"Plan"
!! Tips"for"Improving"the"Performance"of"Your"Pig"Jobs"
!! Conclusion%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#42%
EssenBal"Points"
!You%can%boost%performance%by%elimina3ng%unneeded%data%during%
processing%
!Pigs%error%messages%dont%always%clearly%iden3fy%the%source%of%a%problem%
We"recommend"tesBng"your"scripts"with"a"small"data"sample"
Looking"at"the"Web"UI,"and"especially"the"log"messages,"can"be"helpful"
!The%resources%listed%on%the%upcoming%bibliography%slide%may%further%assist%
you%in%solving%problems%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#43%
Bibliography"
The%following%oer%more%informa3on%on%topics%discussed%in%this%chapter%
!Hadoop%Log%Loca3on%and%Reten3on%
http://tiny.cloudera.com/dac08a
!Pig%Tes3ng%and%Diagnos3cs%
http://tiny.cloudera.com/dac08b
!Mailing%List%for%Pig%Users%
http://tiny.cloudera.com/dac08c
!Ques3ons%Tagged%with%Pig%on%StackOverow%
http://tiny.cloudera.com/dac08d
!Ques3ons%Tagged%with%PigLa3n%on%StackOverow%
http://tiny.cloudera.com/dac08e
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 08#44%
IntroducAon"to"Hive"
Chapter"9"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#1%
Course"Chapters"
!! IntroducAon"
!! Hadoop"Fundamentals"
!! IntroducAon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulA/Dataset"OperaAons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooAng"and"OpAmizaAon"
!! Introduc/on%to%Hive%
!! RelaAonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpAmizaAon"
!! Extending"Hive"
!! IntroducAon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#2%
IntroducAon"to"Hive"
In%this%chapter,%you%will%learn%
!What%Hive%is%
!How%Hive%diers%from%a%rela/onal%database%
!Ways%in%which%organiza/ons%use%Hive%
!How%to%invoke%and%interact%with%Hive%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#3%
Chapter"Topics"
Introduc/on%to%Hive%
!! What%is%Hive?%
!! Hive"Schema"and"Data"Storage"
!! Comparing"Hive"to"TradiAonal"Databases"
!! Hive"Use"Cases"
!! InteracAng"with"Hive"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#4%
Overview"of"Apache"Hive"
!Apache%Hive%is%a%high#level%abstrac/on%on%top%of%MapReduce%
Uses"a"SQL/like"language"called"HiveQL"
Generates"MapReduce"jobs"that"run"on"the"Hadoop"cluster"
Originally"developed"by"Facebook"for"data"warehousing"
Now"an"open/source"Apache"project"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#5%
High/Level"Overview"for"Hive"Users"
!Hive%runs%on%the%client%machine%
Turns"HiveQL"queries"into"MapReduce"jobs"
Submits"those"jobs"to"the"cluster"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#6%
Why"Use"Apache"Hive?"
!More%produc/ve%than%wri/ng%MapReduce%directly%
Five"lines"of"HiveQL"might"be"equivalent"to"100"lines"or"more"of"Java"
!Brings%large#scale%data%analysis%to%a%broader%audience%
No"so\ware"development"experience"required"
Leverage"exisAng"knowledge"of"SQL"
!Oers%interoperability%with%other%systems%
Extensible"through"Java"and"external"scripts"
Many"business"intelligence"(BI)"tools"support"Hive"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#7%
Where"to"Get"Hive"
!CDH%(Clouderas%Distribu/on%including%Apache%Hadoop)%is%the%easiest%way%
to%install%Hadoop%and%Hive%
A"Hadoop"distribuAon"which"includes"core"Hadoop,"Pig,"Hive,"Sqoop,"
HBase,"Oozie,"and"other"ecosystem"components"
Available"as"RPMs,"Ubuntu/Debian/SuSE"packages,"or"a"tarball"
Simple"installaAon"
100%"free"and"open"source"
!Installa/on%is%outside%the%scope%of%this%course%
Cloudera"oers"a"training"course"for"System"Administrators,!Cloudera!
Administrator!Training!for!Apache!Hadoop!
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#8%
Chapter"Topics"
Introduc/on%to%Hive%
!! What"is"Hive?%
!! Hive%Schema%and%Data%Storage%
!! Comparing"Hive"to"TradiAonal"Databases"
!! Hive"Use"Cases"
!! InteracAng"with"Hive"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#9%
How"Hive"Loads"and"Stores"Data"(1)"
!Hives%queries%operate%on%tables,%just%like%in%an%RDBMS%
A"table"is"simply"an"HDFS"directory"containing"one"or"more"les"
Default"path:"/user/hive/warehouse/<table_name>"""
Hive"supports"many"formats"for"data"storage"and"retrieval"
!How%does%Hive%know%the%structure%and%loca/on%of%tables?%
These"are"specied"when"tables"are"created"
This"metadata"is"stored"in"Hives"metastore"
Contained"in"an"RDBMS"such"as"MySQL"
"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#10%
How"Hive"Loads"and"Stores"Data"(2)"
!Hive%consults%the%metastore%to%determine%data%format%and%loca/on%
The"query"itself"operates"on"data"stored"on"a"lesystem"(typically"HDFS)"
Metastore
c t ure
s tru data
" t a bl on of
e
t i
Ge locat (metadata'in'RDBMS)
1
and
Hive%Tables%(HDFS)
Qu 2
ery
ac
tua
l da
ta
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#11%
Chapter"Topics"
Introduc/on%to%Hive%
!! What"is"Hive?"
!! Hive"Schema"and"Data"Storage"
!! Comparing%Hive%to%Tradi/onal%Databases%
!! Hive"Use"Cases"
!! InteracAng"with"Hive"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#12%
Your"Cluster"is"Not"a"Database"Server"
!Client#server%database%management%systems%have%many%strengths%
Very"fast"response"Ame"
Support"for"transacAons"
Allows"modicaAon"of"exisAng"records"
Can"serve"thousands"of"simultaneous"clients"
!Hive%does%not%turn%your%Hadoop%cluster%into%an%RDBMS%
It"simply"produces"MapReduce"jobs"from"HiveQL"queries"
LimitaAons"of"HDFS"and"MapReduce"sAll"apply"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#13%
Comparing"Hive"To"A"RelaAonal"Database"
Rela/onal%Database% Hive%
Query language SQL HiveQL
Update individual records Yes No
Delete individual records Yes No
Transactions Yes No
Index support Extensive Limited
Latency Very low High
Data size Terabytes Petabytes
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#14%
Chapter"Topics"
Introduc/on%to%Hive%
!! What"is"Hive?"
!! Hive"Schema"and"Data"Storage"
!! Comparing"Hive"to"TradiAonal"Databases"
!! Hive%Use%Cases%
!! InteracAng"with"Hive"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#15%
Use"Case:"Log"File"AnalyAcs"
!Server%log%les%are%an%important%source%of%data%
!Hive%allows%you%to%treat%a%directory%of%log%les%like%a%table%
Allows"SQL/like"queries"against"raw"data"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#16%
Use"Case:"SenAment"Analysis"
!Many%organiza/ons%use%Hive%to%analyze%social%media%coverage%
Negative
Neutral
Positive
07 08 09 10 11 12 13 14 15 16 17 18
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#17%
Chapter"Topics"
Introduc/on%to%Hive%
!! What"is"Hive?"
!! Hive"Schema"and"Data"Storage"
!! Comparing"Hive"to"TradiAonal"Databases"
!! Hive"Use"Cases"
!! Interac/ng%with%Hive%
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#18%
Using"the"Hive"Shell"
!You%can%execute%HiveQL%statements%in%the%Hive%Shell%
This"interacAve"tool"is"similar"to"the"MySQL"shell"
!Run%the%hive%command%to%start%the%Hive%shell%
The"Hive"shell"will"display"its"hive>"prompt"
Each"statement"must"be"terminated"with"a"semicolon"
Use"the"quit"command"to"exit"the"Hive"shell"
$ hive
hive> SELECT cust_id, fname, lname
FROM customers WHERE zipcode=20525;
1000000 Quentin Shepard
1000001 Brandon Louis
1000002 Marilyn Ham
hive> quit;
$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#19%
Accessing"Hive"from"the"Command"Line"
!You%can%also%execute%a%le%containing%HiveQL%code%using%the%-f%op/on%
%
$ hive -f myquery.hql
!Or%use%HiveQL%directly%from%the%command%line%using%the%-e%op/on%
!Use%the%-S%(silent)%op/on%to%suppress%informa/onal%messages%
Can"also"be"used"with"-e"or"-f"opAons"
$ hive -S
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#20%
Hive"ProperAes"
!Many%aspects%of%Hives%behavior%are%congured%through%proper/es%
Use"set -v"in"Hive"to"see"current"values"
!Hive%runs%the%.hiverc%le%in%your%home%directory%at%startup%
Useful"for"specifying"per/user"defaults"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#21%
InteracAng"with"the"OperaAng"System"and"HDFS"
!Use%!%to%execute%system%commands%from%within%Hive%
Neither"pipes"nor"globs"(wildcards)"are"supported"
hive> ! date;
% Mon May 20 16:44:35 PDT 2013
!Prex%HDFS%commands%with%dfs%to%use%them%from%within%Hive%
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#22%
Accessing"Hive"with"Hue"
!Alterna/vely,%you%can%access%Hive%through%Hue%
Web/based"UI"for"many"Hadoop/related"services,"including"Hive"
!To%use%Hue,%browse%to%http://hue_server:8888/
May"need"to"start"Hue"service"rst"(sudo service hue start)"
!Hues%Hive%interface%is%called%Beeswax%
Launch"by"clicking"its"icon"
!Beeswax%features%include:%
CreaAng"tables"
Beeswax!icon!in!Hue!
Running"queries"
Browsing"tables"
Saving"queries"for"later"execuAon"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#23%
Hues"Beeswax"Query"Editor"
Hue - Hive Query
https://hueserver.example.com:8888/beeswax/ Google
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#24%
Query"ExecuAon"in"Beeswax"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#25%
Query"Results"in"Beeswax"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#26%
InteracAng"With"HiveServer2""Hive"as"a"Service"(1)"
!HiveServer2%can%be%deployed%to%provide%a%centralized%Hive%service%
Uses"a"JDBC"or"ODBC"connecAon"
Supports"Kerberos"authenAcaAon"
% Beeline CLI
Hadoop Cluster
Submit Map
Reduce Jobs Mappers
HiveServer2
Server
Reducers
JDBC/ODBC
Application
Shared
Metastore
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#27%
InteracAng"With"HiveServer2""Hive"as"a"Service"(2)"
!To%connect%to%HiveServer2,%use%Hue%or%the%Beeline%CLI%
You"cannot"use"the"Hive"shell"
For"secure"deployments,"supply"your"user"ID"and"password"
!Example:%star/ng%Beeline%and%connec/ng%to%HiveServer2%
% [training@localhost analyst]$ beeline
Beeline version 0.10.0-cdh4.2.1 by Apache Hive
Connecting to jdbc:hive2://localhost:10000
Connected to: Hive (version 0.10.0)
Driver: Hive (version 0.10.0-cdh4.2.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
beeline>
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#28%
Chapter"Topics"
Introduc/on%to%Hive%
!! What"is"Hive?"
!! Hive"Schema"and"Data"Storage"
!! Comparing"Hive"to"TradiAonal"Databases"
!! Hive"Use"Cases"
!! InteracAng"with"Hive"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#29%
EssenAal"Points"
!Hive%is%a%high#level%abstrac/on%on%top%of%MapReduce%
Runs"MapReduce"jobs"on"Hadoop"based"on"HiveQL"statements"
Originally"developed"for"data"warehousing"at"Facebook"
!HiveQL%is%very%similar%to%SQL%
Easy"to"learn"for"those"with"relaAonal"database"experience"
However,"Hive"does"not"replace"your"RDBMS"
!Hive%tables%are%really%directories%of%les%in%HDFS%
InformaAon"about"those"tables"is"kept"in"Hive's"metastore"
!Hive%and%Pig%are%similar%in%many%ways%
But"each"also"has"its"own"disAnct"advantages"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#30%
Bibliography"
The%following%oer%more%informa/on%on%topics%discussed%in%this%chapter%
!Hive%Web%Site%
http://hive.apache.org/
!Beeline%CLI%reference%
http://sqlline.sourceforge.net/
!Programming%Hive%(book)%
http://tiny.cloudera.com/dac09a
!Data%Analysis%with%Hadoop%and%Hive%at%Orbitz%
http://tiny.cloudera.com/dac09b
!Sen/ment%Analysis%Using%Apache%Hive%
http://tiny.cloudera.com/dac09c
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 09#31%
RelaAonal"Data"Analysis"with"Hive"
Chapter"10"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#1$
Course"Chapters"
!! IntroducAon"
!! Hadoop"Fundamentals"
!! IntroducAon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulA/Dataset"OperaAons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooAng"and"OpAmizaAon"
!! IntroducAon"to"Hive"
!! Rela*onal$Data$Analysis$with$Hive$
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpAmizaAon"
!! Extending"Hive"
!! IntroducAon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#2$
RelaAonal"Data"Analysis"with"Hive"
In$this$chapter,$you$will$learn$
!How$to$explore$databases$and$tables$in$Hive$
!How$HiveQL$syntax$compares$to$SQL$
!Which$data$types$Hive$supports$
!Which$types$of$join$opera*ons$Hive$supports$and$how$to$use$them$
!How$to$use$many$of$Hives$built#in$func*ons$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#3$
Chapter"Topics"
Rela*onal$Data$Analysis$with$Hive$
!! Hive$Databases$and$Tables$
!! Basic"HiveQL"Syntax"
!! Data"Types"
!! Joining"Datasets"
!! Common"Built/in"FuncAons"
!! Hands/On"Exercise:"Running"Hive"Queries"from"the"Shell,"Scripts,"and"Hue"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#4$
Hive"Tables"
!Data$for$Hives$tables$is$stored$on$the$lesystem$(typically$HDFS)$
Each"table"maps"to"a"single"directory"
!A$tables$directory$may$contain$mul*ple$les$$
Typically"delimited"text"les,"but"Hive"supports"many"formats"
Subdirectories"are"not"allowed"
!Hive$uses$the$metastore$to$give$context$to$this$data$
Helps"map"raw"data"in"HDFS"to"named"columns"of"specic"types"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#5$
Hive"Databases"
!Each$Hive$table$belongs$to$a$specic$database$$
!Early$versions$of$Hive$supported$only$a$single$database$
It"placed"all"tables"in"the"same"database"(named"default)"
This"is"sAll"the"default"behavior"
!Hive$supports$mul*ple$databases$as$of$release$0.7.0$
Helpful"for"organizaAon"and"authorizaAon"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#6$
Exploring"Hive"Databases"and"Tables"(1)"
!Which$databases$are$available?$
!Switch$between$databases$with$the$USE$command$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#7$
Exploring"Hive"Databases"and"Tables"(2)"
!Which$tables$does$the$current$database$contain?$
!Which$tables$are$contained$in$a$dierent$database?$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#8$
Exploring"Hive"Databases"and"Tables"(3)"
!The$DESCRIBE$command$displays$basic$structure$for$a$table$
! DESCRIBE FORMATTED$shows$more$detailed$informa*on$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#9$
Chapter"Topics"
Rela*onal$Data$Analysis$with$Hive$
!! Hive"Databases"and"Tables"
!! Basic$HiveQL$Syntax$
!! Data"Types"
!! Joining"Datasets"
!! Common"Built/in"FuncAons"
!! Hands/On"Exercise:"Running"Hive"Queries"from"the"Shell,"Scripts,"and"Hue"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#10$
An"IntroducAon"To"HiveQL"
!HiveQL$is$Hives$query$language$
Based"on"a"subset"of"SQL/92,"plus"Hive/specic"extensions"
!Some$limita*ons$compared$to$standard$SQL$
Some"features"are"not"supported"
Others"are"only"parAally"implemented"
!HiveQL$also$has$some$features$not$oered$in$SQL$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#11$
HiveQL"Basics"
!Hive$keywords$are$not$case#sensi*ve$
Though"they"are"o]en"capitalized"by"convenAon"
!Statements$are$terminated$by$a$semicolon$
A"statement"may"span"mulAple"lines"
!Comments$begin$with$-- (double(hyphen)(
Only"supported"in"Hive"scripts"
There"are"no"mulA/line"comments"in"Hive"
$ cat myscript.hql
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#12$
SelecAng"Data"from"Hive"Tables"
!The$SELECT$statement$retrieves$data$from$Hive$tables$
Can"specify"an"ordered"list"of"individual"columns"
An"asterisk"matches"all"columns"in"the"table"
$
hive> SELECT * FROM customers;
$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#13$
LimiAng"and"SorAng"Query"Results"
! The$LIMIT$clause$sets$the$maximum$number$of$rows$returned$
!Cau*on:$no$guarantee$regarding$which$10$results$are$returned$
Use"ORDER BY"for"top/N"queries"
The"eld(s)"you"ORDER BY"must"be"selected"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#14$
Using"a"WHERE"Clause"to"Restrict"Results"
! WHERE$clauses$restrict$rows$to$those$matching$specied$criteria$
String"comparisons"are"case/sensiAve"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#15$
Table"Aliases"
!Table$aliases$can$help$simplify$complex$queries$
Note:"Using"AS"to"specify"table"aliases"is"not"supported"
$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#16$
Combining"Query"Results"with"UNION ALL
!Unies$output$from$SELECTs$into$a$single$result$set$
The"name,"order,"and"types"of"columns"in"each"query"must"match"
Hive"only"supports"UNION ALL
! UNION ALL$can$also$be$used$with$subqueries$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#17$
Subqueries"in"Hive"
!Hive$only$supports$subqueries$in$the$FROM$clause$of$the$SELECT$
statement$
It"does"not"support"correlated"subqueries"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#18$
Chapter"Topics"
Rela*onal$Data$Analysis$with$Hive$
!! Hive"Databases"and"Tables"
!! Basic"HiveQL"Syntax"
!! Data$Types$
!! Joining"Datasets"
!! Common"Built/in"FuncAons"
!! Hands/On"Exercise:"Running"Hive"Queries"from"the"Shell,"Scripts,"and"Hue"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#19$
Hives"Data"Types"
!Each$column$in$Hive$is$associated$with$a$data$type$
!Hive$supports$more$than$a$dozen$types$
Most"are"similar"to"ones"found"in"relaAonal"databases"
Hive"also"supports"three"complex"types"
!Use$the$DESCRIBE$command$to$see$a$tables$column$types$$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#20$
Hives"Integer"Types"
!Integer$types$are$appropriate$for$whole$numbers$
Both"posiAve"and"negaAve"values"allowed"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#21$
Hives"Decimal"Types"
!Decimal$types$are$appropriate$for$oa*ng$point$numbers$
Both"posiAve"and"negaAve"values"allowed"
Cau*on:"avoid"using"when"exact"values"are"required!"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#22$
Other"Simple"Types"in"Hive"
!Hive$can$also$store$several$other$types$of$informa*on$
Only"one"character"type"(variable"length)"
""*"Not"available"in"older"versions"of"Hive"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#23$
Complex"Column"Types"in"Hive"
!Hive$also$has$a$few$complex$data$types$
These"are"capable"of"holding"mulAple"values"
Column$Type$ Usage$in$Query$
ARRAY" departments[0]
Ordered"list"of"values,"all"of"the"same"type"
MAP" employees['BF5314']
Key/value"pairs,"each"of"the"same"type"
STRUCT" address.street
Named"elds,"of"possibly"mixed"types"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#24$
Chapter"Topics"
Rela*onal$Data$Analysis$with$Hive$
!! Hive"Databases"and"Tables"
!! Basic"HiveQL"Syntax"
!! Data"Types"
!! Joining$Datasets$
!! Common"Built/in"FuncAons"
!! Hands/On"Exercise:"Running"Hive"Queries"from"the"Shell,"Scripts,"and"Hue"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#25$
Joins"in"Hive"
!Joining$disparate$data$sets$is$a$common$opera*on$in$Hive$
!Hive$supports$several$types$of$joins$
Inner"joins"
Outer"joins"(le],"right,"and"full)"
Cross"joins"(supported"in"Hive"0.10"and"later)"
Le]"semi"joins"
!Only$equality$condi*ons$are$allowed$in$joins$
Valid:""""customers.cust_id = orders.cust_id""
Invalid:"customers.cust_id <> orders.cust_id""
Outputs"records"where"the"specied"key"is"found"in"each"table"
!For$best$performance,$list$the$largest$table$last$in$your$query$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#26$
Join"Syntax"
!Hive$requires$the$following$syntax$for$joins$
!The$above$example$is$an$inner$join$
Can"replace"JOIN"with"another"type"(e.g."RIGHT OUTER JOIN)""
!The$following$join$syntax$is$not$valid$in$Hive$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#27$
Inner"Join"Example"
customers$table$
cust_id$ name$ country$ SELECT c.cust_id, name, total
a Alice us
FROM customers c
JOIN orders o
b Bob ca
ON (c.cust_id = o.cust_id);
c Carlos mx
d Dieter de
Result$of$query$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#28$
Le]"Outer"Join"Example"
customers$table$
cust_id$ name$ country$ SELECT c.cust_id, name, total
a Alice us
FROM customers c
LEFT OUTER JOIN orders o
b Bob ca
ON (c.cust_id = o.cust_id);
c Carlos mx
d Dieter de
Result$of$query$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#29$
Right"Outer"Join"Example"
customers$table$
cust_id$ name$ country$ SELECT c.cust_id, name, total
a Alice us
FROM customers c
RIGHT OUTER JOIN orders o
b Bob ca
ON (c.cust_id = o.cust_id);
c Carlos mx
d Dieter de
Result$of$query$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#30$
Full"Outer"Join"Example"
customers$table$
cust_id$ name$ country$ SELECT c.cust_id, name, total
a Alice us
FROM customers c
FULL OUTER JOIN orders o
b Bob ca
ON (c.cust_id = o.cust_id);
c Carlos mx
d Dieter de
Result$of$query$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#31$
Using"An"Outer"Join"to"Find"Unmatched"Entries"
customers$table$
cust_id$ name$ country$ SELECT c.cust_id, name, total
a Alice us
FROM customers c
FULL OUTER JOIN orders o
b Bob ca
ON (c.cust_id = o.cust_id)
c Carlos mx
WHERE c.cust_id IS NULL
d Dieter de OR o.total IS NULL;
orders$table$
Result$of$query$
order_id$ cust_id$ total$
1 a 1539
cust_id$ name$ total$
d Dieter NULL
2 c 1871
NULL NULL 2137
3 a 6352
4 b 1456
5 z 2137
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#32$
Cross"Join"Example"
disks$table$
name$ SELECT * FROM disks
Internal hard disk
CROSS JOIN sizes;
External hard disk
Result$of$query$
name$ size$
Internal hard disk 1.0 terabytes
sizes$table$ Internal hard disk 2.0 terabytes
size$ Internal hard disk 3.0 terabytes
1.0 terabytes Internal hard disk 4.0 terabytes
2.0 terabytes External hard disk 1.0 terabytes
3.0 terabytes External hard disk 2.0 terabytes
4.0 terabytes External hard disk 3.0 terabytes
External hard disk 4.0 terabytes
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#33$
Le]"Semi"Joins"(1)"
SELECT c.cust_id
FROM customers c
LEFT SEMI JOIN orders o
ON (c.cust_id = o.cust_id
AND YEAR(o.order_date) = '2012');
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#34$
Le]"Semi"Joins"(2)"
!Hive$does$not$support$IN/EXISTS$subqueries$
SELECT c.cust_id
FROM customers c
LEFT SEMI JOIN orders o
ON (c.cust_id = o.cust_id
AND YEAR(o.order_date) = '2012');
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#35$
Chapter"Topics"
Rela*onal$Data$Analysis$with$Hive$
!! Hive"Databases"and"Tables"
!! Basic"HiveQL"Syntax"
!! Data"Types"
!! Joining"Datasets"
!! Common$Built#in$Func*ons$
!! Hands/On"Exercise:"Running"Hive"Queries"from"the"Shell,"Scripts,"and"Hue"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#36$
Hive"FuncAons"
!Hive$oers$dozens$of$built#in$func*ons$$
Many"are"idenAcal"to"those"found"in"SQL"
Others"are"Hive/specic"
!Example$func*on$invoca*on$
FuncAon"names"are"not"case/sensiAve"
!To$see$informa*on$about$a$func*on$$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#37$
Example"Built/in"FuncAons"(1)"
!These$func*ons$operate$on$numeric$values$
Returns"square"root" SQRT(area) 64 8
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#38$
Example"Built/in"FuncAons"(2)"
!These$func*ons$operate$on$*mestamp$values$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#39$
Example"Built/in"FuncAons"(3)"
!Here$are$some$other$interes*ng$func*ons$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#40$
Record"Grouping"and"Aggregate"FuncAons"
! GROUP BY$groups$selected$data$by$one$or$more$columns$
Cau*on:"Columns"not"part"of"aggregaAon"must"be"listed"in"GROUP BY"
stores$table$
id$ city$ state$ region$ SELECT region, state,
COUNT(id) AS num
a Albany NY EAST
FROM stores
b Boston MA EAST
GROUP BY region, state;
c Chicago IL NORTH
d Detroit MI NORTH Result$of$query$
e Elgin IL NORTH
region$ state$ num$
EAST MA 1
EAST NY 1
NORTH IL 2
NORTH MI 1
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#41$
Built/in"Aggregate"FuncAons"
!Hive$oers$many$aggregate$func*ons,$including$
Func*on$Descrip*on$ Example$Invoca*on$
Count"all"rows" COUNT(*)
Count"all"rows"where"eld"is"not"null" COUNT(fname)
Returns"the"largest"value" MAX(salary)
Returns"the"smallest"value" MIN(salary)
Adds"all"supplied"values"and"returns"result" SUM(price)
Returns"the"average"of"all"supplied"values" AVG(salary)
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#42$
Chapter"Topics"
Rela*onal$Data$Analysis$with$Hive$
!! Hive"Databases"and"Tables"
!! Basic"HiveQL"Syntax"
!! Data"Types"
!! Joining"Datasets"
!! Common"Built/in"FuncAons"
!! Hands#On$Exercise:$Running$Hive$Queries$from$the$Shell,$Scripts,$and$
Hue$
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#43$
Hands/on"Exercise:"Running"Hive"Queries"
!In$this$Hands#On$Exercise,$you$will$run$Hive$queries$from$the$Hive$shell,$the$
command$line,$Hive$scripts,$and$Hue.$
!Please$refer$to$the$Hands#On$Exercise$Manual$for$instruc*ons$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#44$
Chapter"Topics"
Rela*onal$Data$Analysis$with$Hive$
!! Hive"Databases"and"Tables"
!! Basic"HiveQL"Syntax"
!! Data"Types"
!! Joining"Datasets"
!! Common"Built/in"FuncAons"
!! Hands/On"Exercise:"Running"Hive"Queries"from"the"Shell,"Scripts,"and"Hue"
!! Conclusion$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#45$
EssenAal"Points"
!Every$Hive$table$belongs$to$exactly$one$database$
The"SHOW DATABASES"command"lists"databases""
The"USE"command"switches"the"acAve"database"
The"SHOW TABLES"command"lists"all"tables"in"a"database"
!Every$column$in$a$Hive$table$has$an$associated$data$type$
Most"simple"column"types"are"similar"to"SQL"
Hive"also"supports"a"few"complex"types"
!HiveQL$syntax$is$familiar$to$those$who$know$SQL$
A"subset"of"SQL/92,"plus"Hive/specic"extensions"
Supports"inner,"outer,"and"le]"semi"joins"
Many"SQL"funcAons"are"built"into"Hive"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#46$
Bibliography"
The$following$oer$more$informa*on$on$topics$discussed$in$this$chapter$
!HiveQL$Language$Manual$
http://tiny.cloudera.com/dac10a
!Hive$Built#In$Func*ons$
http://tiny.cloudera.com/dac10b
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 10#47$
Hive"Data"Management"
Chapter"11"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"1#
Course"Chapters"
!! IntroducEon"
!! Hadoop"Fundamentals"
!! IntroducEon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulE/Dataset"OperaEons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooEng"and"OpEmizaEon"
!! IntroducEon"to"Hive"
!! RelaEonal"Data"Analysis"with"Hive"
!! Hive#Data#Management#
!! Text"Processing"With"Hive"
!! Hive"OpEmizaEon"
!! Extending"Hive"
!! IntroducEon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"2#
Hive"Data"Management"
In#this#chapter#you#will#learn#
!How#Hive#encodes#and#stores#data#
!How#to#create#Hive#databases,#tables,#and#views#
!How#to#load#data#into#tables#
!How#to#alter#and#remove#tables#
!How#to#save#query#results#into#tables#and#les#
!How#to#control#access#to#data#in#Hive#
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"3#
Chapter"Topics"
Hive#Data#Management#
!! Hive#Data#Formats#
!! CreaEng"Databases"and"Hive/Managed"Tables"
!! Loading"Data"into"Hive"
!! Altering"Databases"and"Tables"
!! Self/Managed"Tables"
!! Simplifying"Queries"with"Views"
!! Storing"Query"Results"
!! Controlling"Access"to"Data"
!! Hands/On"Exercise:"Data"Management"with"Hive"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"4#
Hive"Text"File"Format"
!Each#Hive#table#maps#to#a#directory#of#data,#typically#in#HDFS#
Data"stored"in"one"or"more"les"
!Default#data#format#is#plain#text#
One"record"per"line"(record"separator"is"\n)"
Columns"are"delimited"by"^A"(eld"separator"is"Control/A)
Complex"type"elements"are"separated"by"^B"(Control/B)
Map"keys/values"are"separated"by"^C"(Control/C)
!Hive#allows#you#to#override#these#delimiters#
Specify"alternate"values"during"table"creaEon"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"5#
Hive"Binary"File"Formats"
!Hive#also#supports#a#few#binary#formats#
Pro:"may"oer"be>er"performance"than"text"les"
Con:"may"limit"ability"to"read"data"from"other"programs"
!SequenceFile#is#a#Hadoop"specic#binary#format#
Avoids"the"need"to"convert"all"data"to/from"strings"
Supports"block/"and"record/level"data"compression"
!RCFile#is#a#binary#format#created#especially#for#Hive#
RC"stands"for"Record"Columnar"
Column/oriented"format"ecient"for"some"queries"
Otherwise,"very"similar"to"sequence"les"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"6#
Chapter"Topics"
Hive#Data#Management#
!! Hive"Data"Formats"
!! CreaLng#Databases#and#Hive"Managed#Tables#
!! Loading"Data"into"Hive"
!! Altering"Databases"and"Tables"
!! Self/Managed"Tables"
!! Simplifying"Queries"with"Views"
!! Storing"Query"Results"
!! Controlling"Access"to"Data"
!! Hands/On"Exercise:"Data"Management"with"Hive"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"7#
CreaEng"Databases"in"Hive"
!Hive#databases#are#simply#namespaces#
Helps"to"organize"your"tables"
!To#create#a#new#database#
!To#condiLonally#create#a#new#database#
Avoids"error"in"case"database"already"exists"(useful"for"scripEng)"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"8#
CreaEng"a"Table"In"Hive"(1)"
!Basic#syntax#for#creaLng#a#table:#
!Creates#a#subdirectory#under#/user/hive/warehouse#in#HDFS#
This"is"Hives"warehouse)directory"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"9#
CreaEng"a"Table"In"Hive"(2)"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"10#
CreaEng"a"Table"In"Hive"(3)"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"11#
CreaEng"a"Table"In"Hive"(4)"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"12#
CreaEng"a"Table"In"Hive"(5)"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"13#
Example"Table"DeniEon"
!The#following#example#creates#a#new#table#named#jobs
Data"stored"as"text"with"four"comma/separated"elds"per"line"
Example"of"corresponding"record"for"the"table"above"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"14#
Complex"Column"Types"
!Recap:#Hive#has#support#for#complex#data#types#
Column#Type# DescripLon#
ARRAY" Ordered"list"of"values,"all"of"the"same"type"
MAP" Key/value"pairs,"each"of"the"same"type"
STRUCT" Named"elds,"of"possibly"mixed"types"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"15#
CreaEng"Tables"with"Complex"Column"Types"
!Complex#columns#are#typed#
Arrays"have"a"single"type"
Maps"have"a"key"type"and"a"value"type"
Structs"have"a"type"for"each"a>ribute"
!These#types#are#specied#with#angle#brackets#
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"16#
Customizing"Complex"Type"Storage"
!We#can#control#which#delimiters#are#used#for#complex#types#
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"17#
Row"Format"Example"for"Complex"Types"
!The#following#is#an#example#of#a#record#in#that#table#
!Examples#queries#on#this#record#
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"18#
Chapter"Topics"
Hive#Data#Management#
!! Hive"Data"Formats"
!! CreaEng"Databases"and"Hive/Managed"Tables"
!! Loading#Data#into#Hive#
!! Altering"Databases"and"Tables"
!! Self/Managed"Tables"
!! Simplifying"Queries"with"Views"
!! Storing"Query"Results"
!! Controlling"Access"to"Data"
!! Hands/On"Exercise:"Data"Management"with"Hive"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"19#
Data"ValidaEon"in"Hive"
!Hadoop#and#its#ecosystem#are#schema#on#read#
Unlike"an"RDBMS,"Hive"does"not"validate"data"on"insert"
Files"are"simply"moved"into"place"
Loading"data"into"tables"is"therefore"very"fast"
Errors"in"le"format"will"be"discovered"when"queries"are"performed"
!Missing#or#invalid#data#in#Hive#will#be#represented#as#NULL
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"20#
Loading"Data"From"Files"(1)"
!To#load#data,#simply#add#les#to#the#tables#directory#in#HDFS#
Can"be"done"directly"using"hadoop fs"commands"
This"example"loads"data"from"HDFS"into"Hives"sales"table"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"21#
Loading"Data"From"Files"(2)"
!Add#the#LOCAL#keyword#to#load#data#from#the#local#disk#
Copies"a"local"le"(or"directory)"to"the"tables"directory"in"HDFS"
This"is"equivalent"to"the"following"hadoop fs"command"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"22#
Loading"Data"From"Files"(3)"
!Add#the#OVERWRITE#keyword#to#delete#all#records#before#import#
Removes"all"les"within"the"tables"directory""
Then"moves"the"new"les"into"that"directory"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"23#
Loading"Data"From"a"RelaEonal"Database"
!Sqoop#has#built"in#support#for#imporLng#data#into#Hive#
!Just#add#the#--hive-import#opLon#to#your#Sqoop#command#
Creates"the"table"in"Hive"(metastore)"
Imports"data"from"RDBMS"to"tables"directory"in"HDFS"
$ sqoop import \
--connect jdbc:mysql://localhost/dualcore \
--username training
--password training \
--fields-terminated-by '\t' \
--table employees \
--hive-import
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"24#
Chapter"Topics"
Hive#Data#Management#
!! Hive"Data"Formats"
!! CreaEng"Databases"and"Hive/Managed"Tables"
!! Loading"Data"into"Hive"
!! Altering#Databases#and#Tables#
!! Self/Managed"Tables"
!! Simplifying"Queries"with"Views"
!! Storing"Query"Results"
!! Controlling"Access"to"Data"
!! Hands/On"Exercise:"Data"Management"with"Hive"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"25#
Removing"a"Database"
!Removing#a#database#is#similar#to#creaLng#it#
Just"replace"CREATE"with"DROP
!These#commands#will#fail#if#the#database#contains#tables#
Add"the"CASCADE"keyword"to"force"removal"
CauEon:"this"command"removes"data"in"HDFS!"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"26#
Removing"a"Table"
!Syntax#is#similar#to#database#removal#
!CauLon:#These#commands#can#remove#data#in#HDFS!#
Hive"does"not"have"a"rollback"or"undo"feature"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"27#
Renaming"Tables"and"Columns"
!Use#ALTER TABLE#to#modify#a#tables#metadata#
This"command"does"not"change"data"in"HDFS"
!Rename#an#exisLng#table#
!Rename#a#column#by#specifying#its#old#and#new#names#
Type"must"be"specied"even"though"it"is"unchanged"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"28#
Modifying"and"Adding"Columns"
!You#can#also#modify#a#columns#type#
The"old"and"new"column"names"will"be"the"same"
You"must"ensure"data"in"HDFS"conforms"to"the"new"type"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"29#
Reordering"and"Removing"Columns"
!Use#AFTER#or#FIRST#to#reorder#columns#
Ensure"that"your"data"in"HDFS"matches"the"new"order"
!Use#REPLACE COLUMNS#to#remove#columns#
Any"column"not"listed"will"be"dropped"from"metadata"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"30#
Chapter"Topics"
Hive#Data#Management#
!! Hive"Data"Formats"
!! CreaEng"Databases"and"Hive/Managed"Tables"
!! Loading"Data"into"Hive"
!! Altering"Databases"and"Tables"
!! Self"Managed#Tables#
!! Simplifying"Queries"with"Views"
!! Storing"Query"Results"
!! Controlling"Access"to"Data"
!! Hands/On"Exercise:"Data"Management"with"Hive"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"31#
Controlling"Table"Data"LocaEon"
!Hive#stores#data#associated#with#a#table#in#its#warehouse#directory#
!Storing#data#below#Hives#warehouse#is#not#always#ideal#
Data"might"be"shared"by"several"users""
!Use#LOCATION#to#specify#directory#where#table#data#resides#
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"32#
Self/Managed"(External)"Tables"
!Recall#that#dropping#a#table#removes#its#data#in#HDFS#
!Using#EXTERNAL#when#creaLng#the#table#avoids#this#behavior#
Dropping"an"external"table"removes"only"its"metadata)
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"33#
Chapter"Topics"
Hive#Data#Management#
!! Hive"Data"Formats"
!! CreaEng"Databases"and"Hive/Managed"Tables"
!! Loading"Data"into"Hive"
!! Altering"Databases"and"Tables"
!! Self/Managed"Tables"
!! Simplifying#Queries#with#Views#
!! Storing"Query"Results"
!! Controlling"Access"to"Data"
!! Hands/On"Exercise:"Data"Management"with"Hive"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"34#
Simplifying"Complex"Queries"
!Complex#queries#can#become#cumbersome#
Imagine"typing"this"several"Emes"for"dierent"orders"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"35#
CreaEng"Views"
!Views#in#Hive#are#conceptually#like#a#table,#but#backed#by#a#query#
You"cannot"directly"add"data"to"a"view"
!Our#query#is#now#greatly#simplied#
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"36#
InspecEng"and"Removing"Views"
!Use#DESCRIBE FORMATTED#to#see#underlying#query#
!Use#DROP VIEW#to#remove#a#view#
#
hive> DROP VIEW order_info;
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"37#
Chapter"Topics"
Hive#Data#Management#
!! Hive"Data"Formats"
!! CreaEng"Databases"and"Hive/Managed"Tables"
!! Loading"Data"into"Hive"
!! Altering"Databases"and"Tables"
!! Self/Managed"Tables"
!! Simplifying"Queries"with"Views"
!! Storing#Query#Results#
!! Controlling"Access"to"Data"
!! Hands/On"Exercise:"Data"Management"with"Hive"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"38#
Saving"Query"Output"to"a"Table"
! SELECT#statements#display#their#results#on#screen#
!To#send#results#to#a#Hive#table,#use#INSERT OVERWRITE TABLE#
DesEnaEon"table"must"already"exist"
ExisEng"contents"will"be"deleted"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"39#
CreaEng"Tables"Based"On"ExisEng"Data"
!Hive#supports#creaLng#a#table#based#on#a#SELECT#statement#
Oien"know"as"Create"Table"As"Select"(CTAS)"
!Column#deniLons#are#derived#from#the#exisLng#table#
!Column#names#are#inherited#from#the#exisLng#names#
Use"aliases"in"the"SELECT"statement"to"specify"new"names"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"40#
WriEng"Output"To"a"Filesystem"
!You#can#save#output#to#a#le#in#HDFS#
!Add#LOCAL#to#store#results#to#local#disk#instead#
!Both#produce#text#les#delimited#by#Ctrl"A#characters#
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"41#
WriEng"Output"To"HDFS,"Specifying"Format""
!To#write#the#les#to#HDFS#with#a#user"specied#format:#
Create"an"external"table"in"the"required"format"
Use"INSERT OVERWRITE TABLE
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"42#
MulE/Table"Insert"(1)"
!We#just#saw#that#you#can#save#output#to#an#HDFS#le#
!This#query#could#also#be#wriaen#as#follows#
FROM customers c
INSERT OVERWRITE DIRECTORY 'ny_customers'
SELECT cust_id, fname, lname WHERE state='NY';
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"43#
MulE/Table"Insert"(2)"
!We#someLmes#need#to#extract#data#to#mulLple#tables#
Hive"SELECT"queries"can"take"a"long"Eme"to"complete"
!Hive#allows#us#to#do#this#with#a#single#query#
Much"more"ecient"than"using"mulEple"queries"
!The#following#example#demonstrates#mulL"table#insert#
Result"is"two"directories"in"HDFS"
FROM customers c
INSERT OVERWRITE DIRECTORY 'ny_names'
SELECT fname, lname WHERE state = 'NY';
INSERT OVERWRITE DIRECTORY 'ny_count'
SELECT count(DISTINCT cust_id)
WHERE state = 'NY';
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"44#
MulE/Table"Insert"(3)"
!The#following#query#produces#the#same#result#
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"45#
Chapter"Topics"
Hive#Data#Management#
!! Hive"Data"Formats"
!! CreaEng"Databases"and"Hive/Managed"Tables"
!! Loading"Data"into"Hive"
!! Altering"Databases"and"Tables"
!! Self/Managed"Tables"
!! Simplifying"Queries"with"Views"
!! Storing"Query"Results"
!! Controlling#Access#to#Data#
!! Hands/On"Exercise:"Data"Management"with"Hive"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"46#
Hive"AuthenEcaEon"
!HiveServer2#can#be#congured#for#Kerberos#or#LDAP#authenLcaLon#
You"must"provide"a"user"ID"and"password"when"connecEng"to"
HiveServer2"from"Beeline"
Connecting to jdbc:hive2://localhost:10000
Connected to: Hive (version 0.10.0)
Driver: Hive (version 0.10.0-cdh4.2.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
beeline>
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"47#
Hive"AuthorizaEon"with"Apache"Sentry"
!Sentry#is#a#plug"in#for#enforcing#authorizaLon#
!Sentry#can#be#enabled#in#HiveServer2#
And"in"Impala"
!Sentry#provides#ne"grained,#role"based#authorizaLon#for#Hive#data#and#
metadata#
!Sentry#was#developed#at#Cloudera#
Available"starEng"with"CDH"4.3"
Project"is"in"incubaEon"status"at"Apache"
http://incubator.apache.org/projects/sentry.html"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"48#
Sentry"Access"Control"Model"
#
!What#does#Sentry#control#access#to?#
Server"
Database"
Table"
View"
!Who#can#access#Sentry"controlled#objects?#
Users"in"a"Sentry"role"
Sentry"roles"="one"or"more"groups"
! How#can#role#members#access#Sentry"controlled#objects?#
Read"operaEons"(SELECT"privilege)"
Write"operaEons"(INSERT"privilege)""
Metadata"deniEon"operaEons"(ALL"privilege)"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"49#
Chapter"Topics"
Hive#Data#Management#
!! Hive"Data"Formats"
!! CreaEng"Databases"and"Hive/Managed"Tables"
!! Loading"Data"into"Hive"
!! Altering"Databases"and"Tables"
!! Self/Managed"Tables"
!! Simplifying"Queries"with"Views"
!! Storing"Query"Results"
!! Controlling"Access"to"Data"
!! Hands"On#Exercise:#Data#Management#with#Hive#
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"50#
Hands/on"Exercise:"Data"Management"with"Hive"
!In#this#Hands"On#Exercise,#you#will#create,#load,#modify,#and#delete#tables#in#
Hive.#
!Please#refer#to#the#Hands"On#Exercise#Manual#for#instrucLons#
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"51#
Chapter"Topics"
Hive#Data#Management#
!! Hive"Data"Formats"
!! CreaEng"Databases"and"Hive/Managed"Tables"
!! Loading"Data"into"Hive"
!! Altering"Databases"and"Tables"
!! Self/Managed"Tables"
!! Simplifying"Queries"with"Views"
!! Storing"Query"Results"
!! Controlling"Access"to"Data"
!! Hands/On"Exercise:"Data"Management"with"Hive"
!! Conclusion#
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"52#
EssenEal"Points"
!Each#Hive#table#maps#to#a#directory#in#HDFS#
Table"data"stored"as"one"or"more"les"
Default"format:"plain"text"with"delimited"elds"
Binary"formats"may"oer"be>er"performance"but"limit"compaEbility"
!Dropping#a#Hive"managed#tables#deletes#data#in#HDFS#
External"tables"require"manual"data"deleEon"
! ALTER TABLE#is#used#to#add,#modify,#and#remove#columns#
!Views#can#help#to#simplify#complex#and#repeLLve#queries#
!In#a#secure#environment,#Sentry#provides#Hive#authorizaLon#
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"53#
Text"Processing"with"Hive"
Chapter"12"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#1$
Course"Chapters"
!! IntroducEon"
!! Hadoop"Fundamentals"
!! IntroducEon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulE/Dataset"OperaEons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooEng"and"OpEmizaEon"
!! IntroducEon"to"Hive"
!! RelaEonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text$Processing$With$Hive$
!! Hive"OpEmizaEon"
!! Extending"Hive"
!! IntroducEon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#2$
Text"Processing"with"Hive"
In$this$chapter,$you$will$learn$
!How$to$use$Hives$most$important$string$funcAons$
!How$to$format$numeric$values$
!How$to$use$regular$expressions$in$Hive$
!What$n#grams$are$and$why$they$are$useful$
!How$to$esAmate$how$oCen$words$or$phrases$occur$in$text$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#3$
Chapter"Topics"
Text$Processing$with$Hive$
!! Overview$of$Text$Processing
!! Important"String"FuncEons"
!! Using"Regular"Expressions"in"Hive"
!! SenEment"Analysis"and"n/grams"
!! OpEonal"Hands/On"Exercise:"Gaining"Insight"with"SenEment"Analysis"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#4$
Text"Processing"Overview"
!TradiAonal$data$processing$relies$on$highly#structured$data$
Carefully"curated"informaEon"in"rows"and"columns"
!What$types$of$data$are$we$producing$today?$
Free/form"notes"in"electronic"medical"records"
ApplicaEon"and"server"log"les"
Social"network"connecEons"
Electronic"messages"
Product"raEngs"
!These$types$of$data$also$contain$great$value$
But"extracEng"it"requires"a"dierent"approach"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#5$
Chapter"Topics"
Text$Processing$with$Hive$
!! Overview"of"Text"Processing"
!! Important$String$FuncAons
!! Using"Regular"Expressions"in"Hive"
!! SenEment"Analysis"and"n/grams"
!! OpEonal"Hands/On"Exercise:"Gaining"Insight"with"SenEment"Analysis"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#6$
Basic"String"FuncEons"
!Hive$supports$many$string$funcAons$oCen$found$in$RDBMSs$
*" Unlike'SQL,'star0ng'posi0on'is'zero5based'in'Hive'
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#7$
Parsing"URLs"with"Hive"
!Hive$oers$built#in$support$for$parsing$Web$addresses$(URLs)$
!The$following$examples$assume$the$following$URL$as$input$
http://www.example.com/click.php?A=42&Z=105
Example$InvocaAon$ Output$
PARSE_URL(url, 'PROTOCOL') http
PARSE_URL(url, 'HOST') www.example.com
PARSE_URL(url, 'PATH') /click.php
PARSE_URL(url, 'QUERY') A=42&Z=105
PARSE_URL(url, 'QUERY', 'A') 42
PARSE_URL(url, 'QUERY', 'Z') 105
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#8$
Numeric"Format"FuncEons"
!Hive$oers$two$funcAons$for$formaYng$a$number$
Simple:"FORMAT_NUMBER"(0.10.0"and"later)"
VersaEle:"PRINTF"(0.9.0"and"later)"
PRINTF("%s owes $%1.2f", name, amt) Bob, 3.9 Bob owes $3.90
$
!CauAon:$avoid$storing$precise$values$as$oaAng$point$numbers!$
These"funcEons"are"best"used"for"formaang"results"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#9$
Spliang"and"Combining"Strings"
! CONCAT$combines$one$or$more$strings$
The"CONCAT_WS"variaEon"joins"them"with"a"separator"
! SPLIT$does$nearly$the$opposite$
Dierence:"return"value"is"ARRAY<STRING>
Example$InvocaAon$ Output$
CONCAT('alice', '@example.com') alice@example.com
*"Representa0on'of'ARRAY<STRING>'
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#10$
ConverEng"Array"to"Records"with"EXPLODE
!The$EXPLODE$funcAon$creates$a$record$for$each$element$in$an$array$
An"example"of"a"table'genera0ng'func0on'
The"alias"is"required"when"invoking"table"generaEng"funcEons"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#11$
Chapter"Topics"
Text$Processing$with$Hive$
!! Overview"of"Text"Processing"
!! Important"String"FuncEons"
!! Using$Regular$Expressions$in$Hive
!! SenEment"Analysis"and"n/grams"
!! OpEonal"Hands/On"Exercise:"Gaining"Insight"with"SenEment"Analysis"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#12$
Regular"Expressions"
!A$regular$expression$(regex)$matches$a$pa`ern$in$text$
Useful"when"exact"matching"is"not"pracEcal"
Regular$Expression$ String$(matched$porAon$in$bold)$
Dualcore I wish Dualcore had 2 stores in 90210.
\\d I wish Dualcore had 2 stores in 90210.
\\d{5} I wish Dualcore had 2 stores in 90210.
\\d\\s\\w+ I wish Dualcore had 2 stores in 90210.
\\w{5,9} I wish Dualcore had 2 stores in 90210.
.?\\. I wish Dualcore had 2 stores in 90210.
.*\\. I wish Dualcore had 2 stores in 90210.
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#13$
Hives"Regular"Expression"FuncEons"
!Hive$has$two$important$funcAons$that$use$regular$expressions$
REGEXP_EXTRACT"returns"the"matched"text"
REGEXP_REPLACE"subsEtutes"another"value"for"the"matched"text"
!These$examples$assume$that$txt$has$the$following$value$
It's on Oak St. or Maple St in 90210
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#14$
Regex"SerDe"
!We$someAmes$need$to$analyze$data$that$lacks$consistent$delimiters$
Log"les"are"a"common"example"of"this"
! RegexSerDe$will$read$records$based$on$supplied$regular$expression$
Allows"us"to"create"a"table"from"this"log"le"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#15$
CreaEng"a"Table"with"Regex"SerDe"(1)"
Log"excerpt"
05/23/2013 19:45:19 312-555-7834 CALL_RECEIVED ""
05/23/2013 19:48:37 312-555-7834 COMPLAINT "Item not received"
RegexSerDe"
CREATE TABLE calls (
event_date STRING,
event_time STRING,
phone_num STRING,
event_type STRING,
details STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" =
"([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^\"]*)\"");
!Each$pair$of$parentheses$denotes$a$eld$
Field"value"is"text"matched"by"pa>ern"within"parentheses"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#16$
CreaEng"a"Table"with"Regex"SerDe"(2)"
Log"excerpt"
05/23/2013 19:45:19 312-555-7834 CALL_RECEIVED ""
05/23/2013 19:48:37 312-555-7834 COMPLAINT "Item not received"
RegexSerDe"
CREATE TABLE calls (
event_date STRING,
event_time STRING,
phone_num STRING,
event_type STRING,
details STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" =
"([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^\"]*)\"");
Table"excerpt"
event_date$ event_Ame$ phone_num$ event_type$ details$
05/23/2013 19:45:19 312-555-7834 CALL_RECEIVED
05/23/2013 19:45:37 312-555-7834 COMPLAINT Item not received
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#17$
Regex"SerDe"in"Older"Versions"of"Hive"
!The$Regex$SerDe$wasnt$formally$part$of$Hive$prior$to$0.10.0$
It"shipped"with"Hive,"but"was"part"of"the"hive/contrib"library"
!To$use$Regex$SerDe$in$0.9.x$and$earlier$versions$of$Hive$$
Add"this"JAR"le"to"Hive"
Change"the"SerDes"package"name,"as"shown"below"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#18$
Fixed/Width"Formats"in"Hive"
!Many$older$applicaAons$produce$data$in$xed#width$formats$
1030929610759620120829012215Oakland CA94618
!Unfortunately,$Hive$doesnt$directly$support$these$
But"you"can"overcome"this"limitaEon"by"using"RegexSerDe
!Caveat:$all$elds$in$RegexSerDe$are$of$type$STRING
May"need"to"cast"numeric"values"in"your"queries"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#19$
Fixed/Width"Format"Example"
Input"data"
1030929610759620120829012215Oakland CA94618
RegexSerDe"
CREATE TABLE fixed (
cust_id STRING,
order_id STRING,
order_dt STRING,
order_tm STRING,
city STRING,
state STRING,
zip STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" =
"(\\d{7})(\\d{7})(\\d{8})(\\d{6})(.{20})(\\w{2})(\\d{5})");
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#20$
Chapter"Topics"
Text$Processing$with$Hive$
!! Overview"of"Text"Processing"
!! Important"String"FuncEons"
!! Using"Regular"Expressions"in"Hive"
!! SenAment$Analysis$and$n#grams
!! OpEonal"Hands/On"Exercise:"Gaining"Insight"with"SenEment"Analysis"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#21$
Parsing"Sentences"into"Words"
!Hives$SENTENCES$funcAon$parses$supplied$text$into$words$
!Input$is$a$string$containing$one$or$more$sentences$
!Output$is$a$two#dimensional$array$of$strings$
Outer"array"contains"one"element"per"sentence"
Inner"array"contains"one"element"per"word"in"that"sentence"
$
hive> SELECT txt FROM phrases WHERE id=12345;
I bought this computer and really love it! It's very fast and
does not crash.
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#22$
SenEment"Analysis"
!SenAment$analysis$is$an$applicaAon$of$text$analyAcs$
ClassicaEon"and"measurement"of"opinions"
Frequently"used"for"social"media"analysis"
!Context$is$essenAal$for$human$languages$
Which"word"combinaEons"appear"together?"
How"frequently"do"these"combinaEons"appear?"
!Hive$oers$funcAons$that$help$answer$these$quesAons$
$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#23$
n/grams"
!An$n#gram$is$a$word$combinaAon$(n=number$of$words)$
Bigram"is"a"sequence"of"two"words"(n=2)"
!n#gram$frequency$analysis$is$an$important$step$in$many$applicaAons$
SuggesEng"spelling"correcEons"in"search"results"
Finding"the"most"important"topics"in"a"body"of"text"
IdenEfying"trending"topics"in"social"media"messages"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#24$
CalculaEng"n/grams"in"Hive"(1)"
!Hive$oers$the$NGRAMS$funcAon$for$calculaAng$n#grams$
!The$funcAon$requires$three$input$parameters$
Array"of"strings"(sentences),"each"containing"an"array"(words)"
Number"of"words"in"each"n/gram"
Desired"number"of"results"(top/N,"based"on"frequency)"
!Output$is$an$array$of$STRUCT$with$two$a`ributes$
ngram:"the"n/gram"itself"(an"array"of"words)"
estfrequency:"esEmated"frequency"at"which"this"n/gram"appears"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#25$
CalculaEng"n/grams"in"Hive"(2)"
!The$NGRAMS$funcAon$is$oCen$used$with$the$SENTENCES$funcAon$
We"also"used"LOWER"to"normalize"case"
And"EXPLODE"to"convert"the"resulEng"array"to"a"series"of"rows"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#26$
Finding"Specic"n/grams"in"Text"
! CONTEXT_NGRAMS$is$similar,$but$considers$only$specic$combinaAons$
AddiEonal"input"parameter:"array"of"words"used"for"ltering"
Any"NULL"values"in"the"array"are"treated"as"placeholders"
hive>SELECT EXPLODE(CONTEXT_NGRAMS(SENTENCES(LOWER(txt)),
ARRAY("new", "computer", NULL, NULL), 4, 3)) AS ngrams
FROM phrases;
{"ngram":["is","expensive"],"estfrequency":1.0}
{"ngram":["failed","already"],"estfrequency":1.0}
{"ngram":["is","fast"],"estfrequency":1.0}
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#27$
Histograms"
!Histograms$illustrate$how$values$in$the$data$are$distributed$
This"helps"us"esEmate"the"overall"shape"of"the"data"distribuEon"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#28$
CalculaEng"Data"for"Histograms"
! HISTOGRAM_NUMERIC$creates$data$needed$for$histograms$
Input:"column"name"and"number"of"bins"in"the"histogram"
Output:"coordinates"represenEng"bin"centers"and"heights"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#29$
Chapter"Topics"
Text$Processing$with$Hive$
!! Overview"of"Text"Processing"
!! Important"String"FuncEons"
!! Using"Regular"Expressions"in"Hive"
!! SenEment"Analysis"and"n/grams"
!! OpAonal$Hands#On$Exercise:$Gaining$Insight$with$SenAment$Analysis
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#30$
Hands/on"Exercise:"Gaining"Insight"with"SenEment"Analysis"
!In$this$Hands#On$Exercise,$you$will$analyze$comments$in$product$raAng$data$
with$Hive$
!Please$refer$to$the$Hands#On$Exercise$Manual$for$instrucAons$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#31$
Chapter"Topics"
Text$Processing$with$Hive$
!! Overview"of"Text"Processing"
!! Important"String"FuncEons"
!! Using"Regular"Expressions"in"Hive"
!! SenEment"Analysis"and"n/grams"
!! OpEonal"Hands/On"Exercise:"Gaining"Insight"with"SenEment"Analysis"
!! Conclusion
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#32$
EssenEal"Points"
!Most$data$produced$these$days$lacks$rigid$structure$
Text"processing"can"help"us"analyze"loosely/structured"data"
!The$SPLIT$funcAon$creates$an$array$from$a$string$
EXPLODE"creates"individual"records"from"an"array"
!Hive$has$extensive$support$for$regular$expressions$
You"can"extract"or"subsEtute"values"based"on"pa>erns"
You"can"even"create"a"table"based"on"regular"expressions"
!An$n#gram$is$a$sequence$of$words$
Use"NGRAMS"and"CONTEXT_NGRAMS"to"nd"their"frequency"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 12#33$
Hive"OpBmizaBon"
Chapter"13"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#1$
Course"Chapters"
!! IntroducBon"
!! Hadoop"Fundamentals"
!! IntroducBon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulB/Dataset"OperaBons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooBng"and"OpBmizaBon"
!! IntroducBon"to"Hive"
!! RelaBonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive$Op,miza,on$
!! Extending"Hive"
!! IntroducBon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#2$
Hive"OpBmizaBon"
In$this$chapter,$you$will$learn$
!Which$factors$help$determine$the$performance$of$Hive$queries$
!What$command$displays$Hives$execu,on$plan$for$a$query$
!How$to$enable$several$useful$Hive$performance$features$
!How$to$par,,on$tables$to$reduce$amount$of$data$read$for$a$query$
!How$to$use$table$bucke,ng$to$sample$data$
!How$to$create$and$rebuild$indexes$in$Hive$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#3$
Chapter"Topics"
Hive$Op,miza,on$
!! Understanding$Query$Performance
!! Controlling"Job"ExecuBon"
!! ParBBoning"
!! BuckeBng"
!! Indexing"Data"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#4$
How"Hive"Processes"Data"
Assign&Tasks&to&Nodes
Parse&HiveQL&Statements Submit&MapReduce&Jobs
to&Hadoop&Cluster
Start&Java&Process&for&Task
Read&Metadata&for&Tables
Read&Input&Data
Build&the&Execu-on&Plan
Map&Processing&Phase
Reduce&Processing&Phase
Write&Output&Data
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#5$
Hive"Query"Performance"Pa>erns"(1)"
!The$fastest$queries$involve$only$metadata$
DESCRIBE customers;
!The$next$fastest$simply$read$from$HDFS$
!Then$a$query$that$requires$a$map#only$job$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#6$
Hive"Query"Performance"Pa>erns"(2)"
!The$next$slowest$type$of$query$requires$both$Map$and$Reduce$phases$
!The$slowest$queries$require$mul,ple$MapReduce$jobs$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#7$
Viewing"the"ExecuBon"Plan"
!How$can$you$tell$how$Hive$will$execute$a$query?$
Does"it"read"only"metadata?"
Can"it"return"data"directly"from"HDFS?"
Will"it"require"a"reduce"phase"or"mulBple"MapReduce"jobs?"
!Prex$your$query$with$EXPLAIN$to$view$Hives$execu,on$plan$
!The$output$of$EXPLAIN$can$be$very$long$and$complex$
Fully"understanding"it"requires"in/depth"knowledge"of"MapReduce"
We"will"cover"the"basics"here"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#8$
Viewing"a"Query"Plan"with"EXPLAIN"(1)"
!The$query$plan$contains$three$main$sec,ons$
Abstract"syntax"tree"details"how"Hive"parsed"query"(excerpt"below)"
The"stage"dependencies"and"plans"are"more"useful"to"most"users"
STAGE DEPENDENCIES:
... (excerpt shown on next slide)
STAGE PLANS:
... (excerpt shown on upcoming slide)
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#9$
Viewing"a"Query"Plan"with"EXPLAIN"(2)"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#10$
Viewing"a"Query"Plan"with"EXPLAIN"(3)"
STAGE PLANS:
!Stage#1:$MapReduce$job$ Stage: Stage-1
Map Reduce
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#11$
Viewing"a"Query"Plan"with"EXPLAIN"(4)"
STAGE PLANS:
$ Stage: Stage-1 (covered earlier)
...
$
Stage: Stage-0
!Stage#0:$HDFS$ac,on$
Move Operator
Move"previous"stages" files:
output"to"Hives" hdfs directory: true
warehouse"directory" destination: (HDFS path...)
"
!Stage#2:$Metastore$ac,on$ Stage: Stage-2
Create Table Operator:
Create"new"table"
Create Table
Has"two"columns" columns: zipcode string,
num bigint
name: cust_by_zip
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#12$
SorBng"Results
!As$in$SQL,$ORDER BY$sorts$specied$elds$in$HiveQL$
Consider"the"result"from"the"following"query"
Mapper output is
Alice 3625
Bob 5174 processed by a
Alice 3625
0 Alice 3625 Alice 893 single reducer
1 Bob 5174 Alice 893 Alice 2139
2 Alice 893 Alice 2139 Alice 5834 Alice 12491
3 Alice 2139 Bob 5174 Bob 9997
4 Diana 3581 Diana 3581 Bob 4823 Carlos 1431
5 Carlos 1039 Carlos 1039 Carlos 1039 Diana 5385
6 Bob 4823 Carlos 392
7 Alice 5834 Bob 4823 Diana 3581
8 Carlos 392 Alice 5834 Diana 1804
9 Diana 1804
Carlos 392
Diana 1804
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#13$
Using"SORT BY"for"ParBal"Ordering
!HiveQL$also$supports$par,al$ordering$via$SORT BY$
Oers"much"be>er"performance"if"global"order"isnt"required"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#14$
Chapter"Topics"
Hive$Op,miza,on$
!! Understanding"Query"Performance"
!! Controlling$Job$Execu,on
!! ParBBoning"
!! BuckeBng"
!! Indexing"Data"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#15$
Parallel"ExecuBon"
!Stages$in$Hives$execu,on$plan$o_en$lack$dependencies$
This"means"they"can"be"run"in"parallel"
!Hive$supports$parallel$execu,on$in$such$cases$
However,"this"feature"is"disabled"by"default"
!Enable$this$by$se`ng$the$hive.exec.parallel$property$to$true$
Set"only"for"yourself"in"$HOME/.hiverc
Set"for"all"users"in"/etc/hive/conf/hive-site.xml
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#16$
Reducing"Latency"Through"Local"ExecuBon"
!Running$MapReduce$jobs$on$the$cluster$has$signicant$overhead$
Must"divide"work,"assign"tasks,"start"processes,"collect"results,"etc."
Necessary"to"process"large"amounts"of"data"in"Hive"
Possibly"inecient"with"small"amount"of"data"
!Processing$data$locally$can$substan,ally$speed$up$smaller$jobs$
Local"execuBon"can"substanBally"improve"turnaround"for"small"jobs"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#17$
AutomaBc"SelecBon"of"Local"ExecuBon"Mode"
!Hive$can$now$select$local$execu,on$mode$automa,cally$
It"does"this"on"a"case/by/case"basis"using"heurisBcs"
!Like$parallel$execu,on,$this$feature$is$disabled$by$default$
Enable"by"sefng"hive.exec.mode.local.auto"to"true"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#18$
Job"Control"
!You$will$see$many$log$messages$when$you$run$a$query$in$Hives$shell$
One"of"these"messages"will"idenBfy"the"MapReduce"job"ID""
!Hadoops$mapred$command$lets$you$kill$the$job$or$view$its$status$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#19$
Viewing"a"Job"in"the"Web"UI"(1)"
!The$log$message$will$also$show$a$Web$address$
This"is"a"link"to"the"jobs"detail"page"in"Hadoops"Web"UI"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#20$
Viewing"a"Job"in"the"Web"UI"(2)"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#21$
Chapter"Topics"
Hive$Op,miza,on$
!! Understanding"Query"Performance"
!! Controlling"Job"ExecuBon"
!! Par,,oning
!! BuckeBng"
!! Indexing"Data"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#22$
Recap:"Data"Storage"in"Hive"
!The$metastore$maintains$informa,on$about$Hive$tables$
A"table"simply"points"to"a"directory"in"HDFS"
The"tables"data"are"les"within"that"directory"
!All$les$in$the$directory$are$read$during$a$query$
call_logs table
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#23$
File"Reads"in"Hive"Tables"
!Dualcores$phone$system$creates$daily$logs$detailing$calls$received$
Our"analysts"use"this"data"to"summarize"previous"days"calls"
!These$queries$always$lter$by$a$value$in$the$call_date$eld$
But"sBll"read"all"les"within"the"tables"directory"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#24$
Table"ParBBoning"
!It$is$possible$to$create$a$table$that$par,,ons$the$data$
Queries"that"lter"on"parBBoned"elds"limit"amount"of"data"read"
Does"not"prevent"you"from"running"queries"that"span"parBBons"
call_logs table
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#25$
CreaBng"A"ParBBoned"Table"In"Hive"(1)"
!Specify$the$par,,oned$column$in$the$PARTITION$clause$
!You$can$also$created$nested$par,,ons$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#26$
CreaBng"A"ParBBoned"Table"In"Hive"(2)"
!The$par,,on$column$is$displayed$if$you$DESCRIBE$the$table:$
!However,$the$par,,on$is$a$virtual$column$
The"data"does"not"exist"in"your"incoming"data"
Instead,"you"specify"the"parBBon"when"loading"the"data""
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#27$
Loading"Data"Into"ParBBons"
!You$must$specify$par,,on$eld$value$when$loading$data$
!The$above$example$would$create$two$subdirectories:$
/user/hive/warehouse/call_logs/call_date=2013-06-03
/user/hive/warehouse/call_logs/call_date=2013-06-04
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#28$
Dynamic"ParBBon"Inserts"(1)"
!What$if$your$table$already$exists,$contains$data,$and$lacks$par,,ons?$
Hive"can"dynamically"insert"data"into"specic"parBBons"for"you"
!Syntax:$
FROM customers
INSERT OVERWRITE TABLE custs_part PARTITION(state)
SELECT cust_id, fname, lname, address, city,
zipcode, state;
!Par,,ons$are$automa,cally$created$based$on$the$value$of$the$last$column
If"the"parBBon"does"not"already"exist,"it"will"be"created"
If"the"parBBon"does"exist,"it"will"be"overwri>en"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#29$
Dynamic"ParBBon"Inserts"(2)"
!Dynamic$par,,oning$is$not$enabled$by$default$
Enable"it"by"sefng"these"two"properBes"
Property$Name$ Value$
hive.exec.dynamic.partition true
hive.exec.dynamic.partition.mode nonstrict
!Cau,on:$avoid$crea,ng$an$excessive$number$of$par,,ons$
This"can"happen"your"data"contains"many"unique"values"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#30$
Dynamic"ParBBon"Inserts"(3)"
!Cau,on:$if$the$par,,on$column$has$many$dierent$values,$many$
par,,ons$will$be$created$
!Three$Hive$congura,on$proper,es$exist$to$limit$this$
hive.exec.max.dynamic.partitions.pernode
Maximum"number"of"dynamic"parBBons"that"can"be"created"by"any"
given"Mapper"or"Reducer"
Default"100"
hive.exec.max.dynamic.partitions
Total"number"of"dynamic"parBBons"that"can"be"created"by"one"
HiveQL"statement"
Default"1000"
hive.exec.max.created.files
Maximum"total"les"created"by"Mappers"and"Reducers"
Default"100000"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#31$
Viewing,"Adding,"and"Removing"ParBBons"
!To$view$the$current$par,,ons$in$a$table$
!Use$ALTER TABLE$to$add$or$drop$par,,ons$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#32$
Chapter"Topics"
Hive$Op,miza,on$
!! Understanding"Query"Performance"
!! Controlling"Job"ExecuBon"
!! ParBBoning"
!! Bucke,ng
!! Indexing"Data"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#33$
What"Is"BuckeBng?"
!Par,,oning$subdivides$data$by$values$in$par,,oned$columns$
!Bucke,ng$data$is$another$way$of$subdividing$data$
Calculates"hash"code"for"values"inserted"into"bucketed"columns"
Hash"code"used"to"assign"new"records"to"a"bucket"
!Goal:$distribute$rows$across$a$predened$number$of$buckets$
Useful"for"jobs"which"need"random"samples"of"data"
Joins"may"be"faster"if"all"tables"are"bucketed"on"the"join"column"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#34$
CreaBng"A"Bucketed"Table"
!Example$of$crea,ng$a$table$that$supports$bucke,ng$
Creates"a"table"supporBng"20"buckets"based"on"order_id"column"
Each"bucket"should"contain"roughly"5%"of"the"tables"data"
!Column$selected$for$bucke,ng$should$have$well#distributed$values$
IdenBer"columns"are"olen"a"good"choice"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#35$
InserBng"Data"Into"A"Bucketed"Table"
!Bucke,ng$isnt$automa,cally$enforced$when$inser,ng$data$
!Set$the$hive.enforce.bucketing$property$to$true
This"sets"the"number"of"reducers"to"the"number"of"buckets"in"the"table"
deniBon"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#36$
Sampling"Data"From"A"Bucketed"Table"
!Use$the$following$syntax$to$sample$data$from$a$bucketed$table:$
This"example"selects"one"of"every"ten"records"(10%)"
!It$is$possible$to$use$TABLESAMPLE$on$a$non#bucketed$table$
However,"this"requires"a"full"scan"of"the"enBre"table"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#37$
Chapter"Topics"
Hive$Op,miza,on$
!! Understanding"Query"Performance"
!! Controlling"Job"ExecuBon"
!! ParBBoning"
!! BuckeBng"
!! Indexing$Data
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#38$
Indexes"in"Hive"
!Tables$in$Hive$also$now$support$indexes$
Similar"to"indexes"in"RDBMSs,"but"much"more"limited"
!May$improve$performance$for$certain$types$of$queries$
But"maintaining"them"costs"disk"space"and"CPU"Bme"
!Syntax$to$create$an$index:$
!Handler$class$is$the$fully#qualied$name$of$a$Java$class,$such$as:$
org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#39$
Viewing"and"Building"Indexes"in"Hive"
!This$command$lists$the$indexes$associated$with$the$orders$table$
!Hive$indexes$are$ini,ally$empty$
Building"(and"later"rebuilding)"indexes"is"a"manual"process"
Use"the"ALTER INDEX"command"to"rebuild"an"index"
CauBon:"this"can"be"lengthy"operaBon!"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#40$
Chapter"Topics"
Hive$Op,miza,on$
!! Understanding"Query"Performance"
!! Controlling"Job"ExecuBon"
!! ParBBoning"
!! BuckeBng"
!! Indexing"Data"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#41$
EssenBal"Points"
! ORDER BY$sorts$results$globally,$just$as$in$SQL$
HiveQL"also"supports"SORT BY"for"parBal"ordering"
!Local$execu,on$mode$can$signicantly$reduce$query$latency$
But"only"appropriate"to"use"with"small"amounts"of"data"
!Par,,oning$and$bucke,ng$can$both$subdivide$a$table's$data$
ParBBoning"may"reduce"the"amount"of"data"a"query"must"read"
BuckeBng"is"used"to"support"random"sampling"
!Hives$indexing$feature$can$boost$performance$for$certain$queries$
But"it"comes"at"the"cost"of"increased"disk"and"CPU"usage"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#42$
Bibliography"
The$following$oer$more$informa,on$on$topics$discussed$in$this$chapter$
!Hive$Manual$for$the$EXPLAIN$Command$
http://tiny.cloudera.com/dac13a
!Hive$Manual$for$Bucketed$Tables$
http://tiny.cloudera.com/dac13b
!Hive$Manual$for$Indexes$
http://tiny.cloudera.com/dac13c
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 13#43$
Extending"Hive"
Chapter"14"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#1$
Course"Chapters"
!! IntroducDon"
!! Hadoop"Fundamentals"
!! IntroducDon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulD/Dataset"OperaDons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooDng"and"OpDmizaDon"
!! IntroducDon"to"Hive"
!! RelaDonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpDmizaDon"
!! Extending$Hive$
!! IntroducDon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#2$
Extending"Hive"
In$this$chapter,$you$will$learn$
! What$role$SerDes$play$in$Hive$
! How$to$use$a$custom$SerDe$
! How$to$use$TRANSFORM$for$custom$record$processing$
! How$to$add$support$for$a$User#Dened$FuncFon$(UDF)$
! How$to$use$variable$subsFtuFon$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#3$
Chapter"Topics"
Extending$Hive$
!! SerDes
!! Data"TransformaDon"with"Custom"Scripts"
!! User/Dened"FuncDons"
!! Parameterized"Queries"
!! Hands/On"Exercise:"Data"TransformaDon"with"Hive"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#4$
Hive"SerDes"
!Hive$uses$a$SerDe$for$reading$and$wriFng$records$
Stands"for"Serializer"/"Deserializer"
SerDes"control"the"row"format"of"the"table"
Specied,"someDmes"implicitly,"when"table"is"created"
!Hive$ships$with$many$SerDes,$including:$$
Name$ Reads$and$Writes$Records$
LazySimpleSerDe Using"specied"eld"delimiters"(default)"
RegexSerDe$ Based"on"supplied"pa>erns"
ColumnarSerDe Using"the"columnar"format"needed"by"RCFile"
HBaseSerDe Using"an"HBase"table"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#5$
Recap:"CreaDng"a"Table"with"Regex"SerDe"
Input"Data"
05/23/2013 19:45:19 312-555-7834 CALL_RECEIVED ""
05/23/2013 19:48:37 312-555-7834 COMPLAINT "Item not received"
Using"SerDe"
CREATE TABLE calls (
event_date STRING,
event_time STRING,
event_type STRING,
phone_num STRING,
details STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" =
"([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^\"]*)\"");
ResulDng"Table"
event_date$ event_Fme$ event_type$ phone_num$ details$
05/23/2013 19:45:19 312-555-7834 CALL_RECEIVED
05/23/2013 19:45:37 312-555-7834 COMPLAINT Item not received
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#6$
Adding"a"Custom"SerDe"to"Hive"
!Hive$also$allows$wriFng$custom$SerDes$using$its$Java$API$
There"are"many"open"source"SerDes"on"the"Web"
WriDng"your"own"is"seldom"necessary"
!We$will$now$explain$how$to$add$a$custom$SerDe$to$Hive$
It"reads"and"writes"records"in"CSV"format"
Using"JAR"le"from"http://tiny.cloudera.com/dac14a
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#7$
Adding"a"JAR"File"to"Hive"
!You$must$register$external$libraries$before$using$them$
Ensures"Hive"can"nd"the"library"(JAR"le)"at"runDme"
!Remains$in$eect$only$during$the$current$Hive$session$
Consider"ediDng"your".hiverc"le"to"add"frequently/used"JARs"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#8$
Using"the"SerDe"in"Hive"
Input"Data"
1,Gigabux,gigabux@example.com
2,"ACME Distribution Co.",acme@example.com
3,"Bitmonkey, Inc.",bmi@example.com
Specify"SerDe"
CREATE TABLE vendors
(id INT,
name STRING,
email STRING)
ROW FORMAT SERDE 'com.bizo.hive.serde.csv.CSVSerde';
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#9$
Chapter"Topics"
Extending$Hive$
!! SerDes"
!! Data$TransformaFon$with$Custom$Scripts
!! User/Dened"FuncDons"
!! Parameterized"Queries"
!! Hands/On"Exercise:"Data"TransformaDon"with"Hive"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#10$
Using"TRANSFORM"to"Process"Data"Using"External"Scripts"
!You$are$not$limited$to$manipulaFng$data$exclusively$in$HiveQL$
Hive"allows"you"to"transform"data"through"external"scripts"or"programs"
These"can"be"wri>en"in"nearly"any"language"
!This$is$done$with$HiveQLs$TRANSFORM$...$USING$construct$
One"or"more"elds"are"supplied"as"arguments"to"TRANSFORM()
The"external"script"is"idenDed"by"USING""
It"receives"each"record,"processes"it,"and"returns"the"result"
!Use$ADD$$FILE$to$distribute$the$script$to$nodes$in$the$cluster$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#11$
Data"Input"and"Output"with"TRANSFORM
!Your$external$program$will$receive$one$record$per$line$on$standard$input$
Each"eld"in"the"supplied"record"will"be"a"tab/separated"string"
NULL"values"are"converted"to"the"literal"string"\N""
!You$may$need$to$convert$values$to$appropriate$types$within$your$program$
For"example,"converDng"to"numeric"types"for"calculaDons"
!Your$program$must$return$tab#delimited$elds$on$standard$output$
Output"elds"can"opDonally"be"named"and"cast"using"the"syntax"below"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#12$
Hive"TRANSFORM"Example"(1)
!Here$is$a$complete$example$of$using$TRANSFORM$in$Hive$$
Our"Perl"script"parses"an"e/mail"address,"determines"to"which"country"it"
corresponds,"and"then"returns"an"appropriate"greeDng"
Heres"a"sample"of"the"input"data"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#13$
Hive"TRANSFORM"Example"(2)
!The$Perl$script$for$this$example$is$shown$below$
A"complete"explanaDon"of"this"script"follows"on"the"next"few"slides"
#!/usr/bin/env perl
while (<STDIN>) {
($name, $email) = split /\t/;
($suffix) = $email =~ /\.([a-z]+)/;
$greeting = $greetings{$suffix};
$greeting = 'Hello' unless defined($greeting);
print "$greeting $name\n";
}
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#14$
Hive"TRANSFORM"Example"(3)
$$
#!/usr/bin/env perl
while (<STDIN>) {
The first line tells the system to use the Perl interpreter
($name, $email) = split /\t/;
when=running
($suffix) $emailthis =~script.
/\.([a-z]+)/;
$greeting = $greetings{$suffix};
$greeting = 'Hello'
We define unless indefined($greeting);
our greetings the next line using an
print "$greeting $name\n";
} associative array keyed by the country code well
extract from the e-mail address.
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#15$
Hive"TRANSFORM"Example"(4)
$$
#!/usr/bin/env perl
%greetings
We=read
('de'
each=>record
'Hallo',
from standard input within the loop,
'fr' => 'Bonjour',
and then split=>them
'mx' into fields based on tab characters.
'Hola');
while (<STDIN>) {
($name, $email) = split /\t/;
($suffix) = $email =~ /\.([a-z]+)/;
$greeting = $greetings{$suffix};
$greeting = 'Hello' unless defined($greeting);
print "$greeting $name\n";
}
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#16$
Hive"TRANSFORM"Example"(5)
$
#!/usr/bin/env perl
while (<STDIN>) {
($name, $email) = split /\t/;
($suffix) = $email =~ /\.([a-z]+)$/;
$greeting = $greetings{$suffix};
$greeting = 'Hello' unless defined($greeting);
print "$greeting $name\n";
} We extract the country code from the e-mail address (the
pattern matches any letters following the final dot). We use that
to look up a greeting, but default to Hello if we didnt find one.
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#17$
Hive"TRANSFORM"Example"(6)
$
#!/usr/bin/env perl
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#18$
Hive"TRANSFORM"Example"(7)
!Finally,$heres$the$result$of$our$transformaFon$
Bonjour Antoine
Hallo Kai
Hola Pedro
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#19$
Chapter"Topics"
Extending$Hive$
!! SerDes"
!! Data"TransformaDon"with"Custom"Scripts"
!! User#Dened$FuncFons
!! Parameterized"Queries"
!! Hands/On"Exercise:"Data"TransformaDon"with"Hive"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#20$
Overview"of"User/Dened"FuncDons"(UDFs)"
!User#Dened$FuncFons$(UDFs)$are$custom$funcFons$$
Invoked"with"the"same"syntax"as"built/in"funcDons"
!There$are$three$types$of$UDFs$in$Hive$
Standard"UDFs"
User/Dened"Aggregate"FuncDons"(UDAFs)"
User/Dened"Table"FuncDons"(UDTFs)"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#21$
Developing"Hive"UDFs"
!Hive$UDFs$are$wri^en$in$Java$
Currently"no"support"for"wriDng"UDFs"in"other"languages"
Using"TRANSFORM"may"be"an"alternaDve"to"UDFs"
!Open$source$UDFs$are$plenFful$on$the$Web$
!There$are$three$steps$for$using$a$UDF$in$Hive$
1. Add"the"funcDons"JAR"le"to"Hive"
2. Register"the"funcDon"itself"
3. Use"the"funcDon"in"your"query"
A^enFon$Java$Developers"
Cloudera"now"oers"a"free"e/learning"
module"WriDng"UDFs"for"Hive"""
"
http://tiny.cloudera.com/dac14b"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#22$
Example:"Using"a"UDF"in"Hive"(1)"
!Our$example$UDF$was$compiled$from$sources$found$in$GitHub$
Popular"Web"site"for"many"open"source"sohware"projects"
Project"URL:"http://tiny.cloudera.com/dac14e"
!We$compiled$the$source$and$packaged$it$into$a$JAR$le$
We"have"included"a"copy"of"it"on"your"VM"
!Our$example$shows$the$DATE_FORMAT$UDF$in$that$JAR$le$
Allows"great"exibility"in"formakng"date"elds"in"output"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#23$
Example:"Using"a"UDF"in"Hive"(2)"
!First,$register$the$JAR$with$Hive$$
Same"step"as"with"a"custom"SerDe"
!Next,$register$the$funcFon$and$assign$an$alias$
The"quoted"value"is"the"fully/qualied"Java"class"for"the"UDF"
$
hive> CREATE TEMPORARY FUNCTION DATE_FORMAT
AS 'com.nexr.platform.hive.udf.UDFDateFormat';
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#24$
Example:"Using"a"UDF"in"Hive"(3)"
!You$may$then$use$the$funcFon$in$your$query$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#25$
Chapter"Topics"
Extending$Hive$
!! SerDes"
!! Data"TransformaDon"with"Custom"Scripts"
!! User/Dened"FuncDons"
!! Parameterized$Queries
!! Hands/On"Exercise:"Data"TransformaDon"with"Hive"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#26$
Hive"Variables"(1)"
!Hive$supports$variable$subsFtuFon$
Swaps"a"placeholder"with"a"variables"literal"value"at"run"Dme"
Variable"names"are"case/sensiDve"
!Within$the$Hive$shell,$set$a$named$variable$equal$to$some$value:$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#27$
Hive"Variables"(2)"
!You$can$set$variables$when$you$invoke$Hive$from$the$command$line$
Eases"repeDDve"queries"by"reducing"need"to"modify"HiveQL"
!For$example,$imagine$that$we$have$the$following$in$state.hql
!This$makes$creaFng$per#state$reports$easy:$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#28$
Chapter"Topics"
Extending$Hive$
!! SerDes"
!! Data"TransformaDon"with"Custom"Scripts"
!! User/Dened"FuncDons"
!! Parameterized"Queries"
!! Hands#On$Exercise:$Data$TransformaFon$with$Hive
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#29$
Hands/on"Exercise:"Data"TransformaDon"with"Hive"
!In$this$Hands#On$Exercise,$you$will$transform$data$with$Hive$
!Please$refer$to$the$Hands#On$Exercise$Manual$for$instrucFons$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#30$
Chapter"Topics"
Extending$Hive$
!! SerDes"
!! Data"TransformaDon"with"Custom"Scripts"
!! User/Dened"FuncDons"
!! Parameterized"Queries"
!! Hands/on"Exercise:"Data"TransformaDon"with"Hive"
!! Conclusion
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#31$
EssenDal"Points"
!SerDes$govern$how$Hive$reads$and$writes$a$table's$records$
Specied"(or"defaulted)"when"creaDng"a"table"
! TRANSFORM$processes$records$using$an$external$program$
This"can"be"wri>en"in"nearly"any"language"
!UDFs$are$User#Dened$FuncFons$
Custom"logic"that"can"be"invoked"just"like"built/in"funcDons"
!Hive$subsFtutes$variable$placeholders$with$literal$values$you$assign$
This"is"done"when"you"execute"the"query"
Especially"helpful"with"repeDDve"queries"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 14#32$
IntroducAon"to"Impala"
Chapter"15"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#1$
Course"Chapters"
!! IntroducAon"
!! Hadoop"Fundamentals"
!! IntroducAon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulA/Dataset"OperaAons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooAng"and"OpAmizaAon"
!! IntroducAon"to"Hive"
!! RelaAonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpAmizaAon"
!! Extending"Hive"
!! Introduc.on$to$Impala$
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#2$
IntroducAon"to"Impala"
In$this$chapter,$you$will$learn$
! What$Impala$is$and$how$it$compares$to$Hive,$Pig,$and$RDBMSs$
! How$Impala$executes$queries$
! Where$Impala$ts$into$the$data$center$
! What$notable$dierences$exist$between$Impala$and$Hive$
! How$to$run$queries$from$the$shell$or$browser$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#3$
Chapter"Topics"
Introduc.on$to$Impala$
!! What$is$Impala?
!! How"Impala"Diers"from"Hive"and"Pig"
!! How"Impala"Diers"from"RelaAonal"Databases"
!! LimitaAons"and"Future"DirecAons"
!! Using"the"Impala"Shell"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#4$
What"is"Impala?"
!High#performance$SQL$engine$for$vast$amounts$of$data$
Massively/parallel"processing"(MPP)"
Inspired"by"Googles"Dremel"project"
Query"latency"measured"in"milliseconds$
!Impala$runs$on$Hadoop$clusters$
Can"query"data"stored"in"HDFS"or"HBase"tables"
Reads"and"writes"data"in"common"Hadoop"le"formats"
!Developed$by$Cloudera$
100%"open"source,"released"under"the"Apache"sobware"
license"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#5$
InteracAng"with"Impala"
!Impala$supports$a$subset$of$SQL#92$
Plus"a"few"extensions"found"in"MySQL"and"Oracle"SQL"dialects"
Almost"idenAcal"to"HiveQL"
!Impala$oers$many$interfaces$for$running$queries$
Command/line"shell"
Hue"Web"applicaAon"
ODBC"/"JDBC""
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#6$
Why"Use"Impala?"
!Many$benets$are$the$same$as$with$Hive$or$Pig$
More"producAve"than"wriAng"MapReduce"code"
No"sobware"development"experience"required"
Leverage"exisAng"knowledge"of"SQL"
!One$benet$exclusive$to$Impala$is$speed$
Highly/opAmized"for"queries"
Almost"always"at"least"ve"Ames"faster"than"either"Hive"or"Pig"
Oben"20"Ames"faster"or"more"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#7$
Use"Case:"Business"Intelligence""
!Many$leading$business$intelligence$tools$support$Impala$
Dualcore Inc. Dashboard
https://dashboard.example.com/ Google
Suppliers by Region
Japan: 31 suppliers
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#8$
Where"Impala"Fits"Into"the"Data"Center"
Hadoop Cluster
with Impala
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#9$
Where"to"Get"Impala"
!Download$Impala$from$http://www.cloudera.com/
!We$strongly$recommend$running$Impala$on$CDH$4.2$or$higher$
Requires"a"64/bit"Linux"plahorm"
!Installa.on$and$congura.on$are$outside$the$scope$of$this$course$
Your"virtual"machine"includes"a"working"installaAon"of"Impala"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#10$
Chapter"Topics"
Introduc.on$to$Impala$
!! What"is"Impala?"
!! How$Impala$Diers$from$Hive$and$Pig
!! How"Impala"Diers"from"RelaAonal"Databases"
!! LimitaAons"and"Future"DirecAons"
!! Using"the"Impala"Shell"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#11$
Comparing"Impala"to"Hive"and"Pig"
!Lets$rst$look$at$similari'es$between$Hive,$Pig,$and$Impala$
Queries"expressed"in"high/level"languages"
AlternaAves"to"wriAng"MapReduce"code"
Used"to"analyze"data"stored"on"Hadoop"clusters"
!Impala$shares$the$metastore$with$Hive$
Tables"created"in"Hive"are"visible"in"Impala"(and"vice"versa)"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#12$
ContrasAng"Impala"to"Hive"and"Pig"(1)"
!Hive$and$Pig$answer$queries$by$running$MapReduce$jobs$
MapReduce"is"a"general/purpose"computaAon"framework"
Not"opAmized"for"execuAng"interacAve"SQL"queries"
!MapReduce$overhead$results$in$high$latency$
Even"a"trivial"query"takes"10"seconds"or"more"
!Impala$does$not$use$MapReduce$
Uses"a"custom"execuAon"engine"built"specically"for"Impala"
Queries"can"complete"in"a"fracAon"of"a"second"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#13$
ContrasAng"Impala"to"Hive"and"Pig"(2)"
!Hive,$Pig,$and$Impala$also$support$
Execute"queries"via"interacAve"shell"or"command"line"
Grouping,"joining,"and"ltering"data"
Read"and"write"data"in"mulAple"formats"
!Impala$currently$lacks$some$Hive$and$Pig$features$
More"details"later"in"this"chapter"
!Hive$and$Pig$are$best$suited$to$long#running$batch$processes$
ParAcularly"data"transformaAon"tasks"
!Impala$is$best$for$interac.ve/ad$hoc$queries$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#14$
How"Impala"Executes"a"Query"
!Each$slave$node$in$the$cluster$runs$an$Impala$daemon$
Co/located"with"the"HDFS"Data"Node"
Client"issues"query"to"an"Impala"daemon"
!Impala$daemon$plans$the$query$
Checks"the"local"metastore"cache"
Distributes"the"query"across"other"Impala"daemons"in"the"cluster"
Daemons"read"data"from"HDFS"or"HBase"
Streams"results"to"client"
!Two$other$daemons$running$on$master$nodes$support$query$execu.on$
The"State"Store"daemon""
Provides"lookup"service"for"Impala"daemons"
Periodically"checks"status"of"Impala"daemons"
The"Catalog"daemon"relays"metadata"changes"to"all"the"Impala"
daemons"in"a"cluster"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#15$
Chapter"Topics"
Introduc.on$to$Impala$
!! What"is"Impala?"
!! How"Impala"Diers"from"Hive"and"Pig"
!! How$Impala$Diers$from$Rela.onal$Databases
!! LimitaAons"and"Future"DirecAons"
!! Using"the"Impala"Shell"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#16$
Comparing"Impala"To"A"RelaAonal"Database"
Rela.onal$Database$ Impala$
Query language SQL SQL-92 subset
Update individual records Yes No
Delete individual records Yes No
Transactions Yes No
Indexing Yes No
Latency Very low Low
Data size Terabytes Petabytes
ODBC / JDBC support Yes Yes
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#17$
Chapter"Topics"
Introduc.on$to$Impala$
!! What"is"Impala?"
!! How"Impala"Diers"from"Hive"and"Pig"
!! How"Impala"Diers"from"RelaAonal"Databases"
!! Limita.ons$and$Future$Direc.ons
!! Using"the"Impala"Shell"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#18$
Hive"Features"Currently"Unsupported"in"Impala"
!Impala$does$not$currently$support$some$features$in$Hive$
Many"of"these"are"being"considered"for"future"releases"
!Complex$data$types$(ARRAY,$MAP,$or$STRUCT)$
! BINARY$data$type$
!External$transforma.ons$
!Custom$SerDes$
!Indexing$
!Bucke.ng$and$table$sampling$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#19$
Other"Notable"Dierences"Between"Impala"and"Hive"
!Only$one$DISTINCT$clause$allowed$per$query$in$Impala$
Typical"workaround"is"to"use"subselect"and"UNION
!Impala$requires$that$queries$with$ORDER BY$also$specify$a$LIMIT$
This"sets"an"outer"bound"on"the"result"set"
The"size"of"the"LIMIT"can"be"arbitrarily"large"
!Impala$and$Hive$handle$out#of#range$values$dierently$
Hive"returns"NULL""
Impala"returns"the"maximum"value"for"that"type"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#20$
Query"Fault"Tolerance"in"Impala"
!Queries$in$both$Hive$and$Impala$are$distributed$across$nodes$
!Hive$answers$queries$by$running$MapReduce$jobs$
Takes"advantage"of"Hadoops"fault"tolerance"
If"a"node"fails"during"query,"MapReduce"runs"the"task"elsewhere"
!Impala$has$its$own$execu.on$engine$
Currently"lacks"fault"tolerance"
If"a"node"fails"during"a"query,"the"query"will"fail"
Workaround:"just"re/run"the"query"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#21$
Chapter"Topics"
Introduc.on$to$Impala$
!! What"is"Impala?"
!! How"Impala"Diers"from"Hive"and"Pig"
!! How"Impala"Diers"from"RelaAonal"Databases"
!! LimitaAons"and"Future"DirecAons"
!! Using$the$Impala$Shell"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#22$
StarAng"the"Impala"Shell"
!You$can$execute$statements$in$the$Impalas$shell$
This"interacAve"tool"is"similar"to"the"shell"in"MySQL"or"Hive"
!Execute$the$impala-shell$command$to$start$the$shell$
Some"log"messages"truncated"to"be>er"t"the"slide"
$ impala-shell
Connected to localhost.localdomain:21000
Server version: impalad version 1.0.1
Welcome to the Impala shell.
[localhost.localdomain:21000] >
"
!Use$-i hostname:port$op.on$to$connect$to$another$server$
$ impala-shell i myserver.example.com:21000
[myserver.example.com:21000] >
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#23$
Using"the"Impala"Shell"
!Enter$semicolon#terminated$statements$at$the$prompt$
Hit"enter"to"execute"the"query"
Impala"pre>y/prints"the"output"
Use"the"quit"command"to"exit"the"Impala"shell"
$ impala-shell
> SELECT cust_id, fname, lname FROM customers
WHERE zipcode='20525';
+---------+--------+-----------+
| cust_id | fname | lname |
+---------+--------+-----------+
| 1133567 | Steven | Robertson |
| 1171826 | Robert | Gillis |
+---------+--------+-----------+
> quit;
$
Note:"shell"prompt"abbreviated"as">$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#24$
Running"Queries"from"the"Command"Line"
!You$can$execute$a$le$containing$queries$using$the$-f$op.on$
$
$ impala-shell -f myquery.hql
!Run$queries$directly$from$the$command$line$with$the$-q$op.on$
!Use$-o$(and$op.onally$specify$delimiter)$to$capture$output$to$le$
$ impala-shell -f myquery.hql \
-o results.txt \
--output_file_field_delim='\t'
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#25$
InteracAng"with"the"OperaAng"System"
!Use$shell$to$execute$system$commands$from$within$Impala$shell$
!No$direct$support$for$HDFS$commands$$
But"could"run"hadoop fs"using"shell
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#26$
Accessing"Impala"with"Hue"(1)"
!Alterna.vely,$you$can$access$Impala$through$Hue$
!To$use$Hue,$browse$to$http://hue_server:8888/
May"need"to"start"Hue"service"rst"(sudo service hue start)"
!Launch$Hues$Impala$interface$by$clicking$its$icon$
Impala&icon&in&Hue&
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#27$
Accessing"Impala"with"Hue"(2)"
!Hue$allows$you$to$run$Impala$queries$from$your$Web$browser$
Hue - Impala Query
https://hueserver.example.com:8888/impala/ Google
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#28$
Accessing"Impala"with"Hue"(3)"
!Hue$displays$the$results$in$a$sortable$table$
Hue - Impala Query
https://hueserver.example.com:8888/impala/ Google
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#29$
Chapter"Topics"
Introduc.on$to$Impala$
!! What"is"Impala?"
!! How"Impala"Diers"from"Hive"and"Pig"
!! How"Impala"Diers"from"RelaAonal"Databases"
!! LimitaAons"and"Future"DirecAons"
!! Using"the"Impala"Shell"
!! Conclusion
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#30$
EssenAal"Points"
!Impala$is$a$high#performance$SQL$engine$
Runs"on"Hadoop"clusters"
Reads"and"writes"data"in"HDFS"or"HBase"tables"
!Queries$are$expressed$in$SQL$dialect$similar$to$HiveQL$
!Primary$dierence$compared$to$Hive/Pig$is$speed$
Impala"avoids"MapReduce"latency"and"overhead"
!Impala$is$best$suited$to$ad$hoc/interac.ve$queries$
Hive"and"Pig"are"be>er"for"long/running"batch"processes"
Impala"does"not"currently"support"all"features"of"Hive"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#31$
Bibliography"(1)"
The$following$oer$more$informa.on$on$topics$discussed$in$this$chapter$
!Free$OReilly$Cloudera-Impala$book$
http://tiny.cloudera.com/dac15f
!Cloudera$Impala:$Real#Time$Queries$in$Apache$Hadoop$
http://tiny.cloudera.com/dac15a
!Wired$Ar.cle$on$Impala$
http://tiny.cloudera.com/dac15b
!Cloudera$Blog$Detailing$Impala$Features$and$Performance$
http://tiny.cloudera.com/dac15c
!Impala$Documenta.on$at$Cloudera$Web$site$
http://tiny.cloudera.com/dac15e
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#32$
Bibliography"(2)"
!37Signals$Blog$Comparing$Performance$of$Impala,$Hive,$and$MySQL$
http://tiny.cloudera.com/dac15d
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 15#33$
Analyzing"Data"with"Impala"
Chapter"16"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#1$
Course"Chapters"
!! IntroducEon"
!! Hadoop"Fundamentals"
!! IntroducEon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulE/Dataset"OperaEons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooEng"and"OpEmizaEon"
!! IntroducEon"to"Hive"
!! RelaEonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpEmizaEon"
!! Extending"Hive"
!! IntroducEon"to"Impala"
!! Analyzing$Data$with$Impala$
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#2$
Analyzing"Data"with"Impala"
In$this$chapter,$you$will$learn$
! How$Impalas$query$syntax$compares$to$HiveQL$
! How$to$create$databases$and$tables$in$Impala$
! How$and$when$to$refresh$the$metadata$cache$
! Which$data$types$Impala$supports$
! How$to$add$support$for$a$User#Dened$FuncKon$(UDF)$
! How$to$structure$your$query$for$beNer$performance$
! What$other$factors$inuence$Impala$performance$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#3$
Chapter"Topics"
Analyzing$Data$with$Impala$
!! Basic$Syntax
!! Data"Types"
!! Filtering,"SorEng,"and"LimiEng"Results"
!! Joining"and"Grouping"Data"
!! User/Dened"FuncEons"
!! Improving"Impala"Performance"
!! Hands/On"Exercise:"InteracEve"Analysis"with"Impala"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#4$
Overview"of"Impala"Query"Syntax"
!Impalas$query$language$is$a$subset$of$SQL#92$
Plus"a"few"extensions"from"Oracle"and"MySQL"dialects"
!Syntax$almost$idenKcal$to$HiveQL$
Dierences"mainly"related"to"features"unsupported"in"Impala"
Most"Hive"queries"can"be"executed"verbaEm"in"Impala"
!Impala$may$also$support$statements$that$Hive$does$not$
Such"as"the"ability"to"insert"individual"rows"*"
*" Not"intended"for"bulk"loads"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#5$
Case/SensiEvity"and"Comments"
!Keywords$are$not$case#sensiKve,$but$oVen$capitalized$by$convenKon$
!Supports$single#$and$mulK#line$comments!
These"are"allowed"in"scripts,""shell,"and"command"line"
$ cat find_customers.sql
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#6$
Databases"and"Tables"in"Impala"
!Every$Impala$table$belongs$to$a$database$
Impala"and"Hive"share"a"metastore"
The"same"databases"and"tables"are"visible"in"both"Hive"and"Impala"
!The$default$database$is$selected$at$startup
The"USE"command"switches"to"another"database"
List"tables"in"a"database"with"the"SHOW TABLES"command"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#7$
CreaEng"Databases"and"Tables"in"Impala"
!Data$deniKon$is$generally$idenKcal$to$Hive$$
Custom"SerDes"and"buckeEng"are"unsupported"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#8$
Displaying"Table"Structure"
!Use$DESCRIBE$to$display$a$tables$structure$
DESCRIBE EXTENDED"is"unsupported"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#9$
Metadata"Caching"in"Impala"
!Impala$shares$the$metastore$with$Hive$
Tables"created"in"Hive"are"visible"in"Impala"(and"vice"versa)"
!Impala$caches$metadata$
The"tables"schema"deniEons"
The"locaEons"of"tables"HDFS"blocks"
!Metadata$updates$made$from!within!Impala$are$broadcast$throughout$the$
cluster$
The"Impala"metadata"cache"is"updated"automaEcally"throughout"the"
Impala"cluster"
No"addiEonal"acEons"are"required"
!Metadata$updates$made$from$outside!of!Impala$are$not$known$to$Impala$
For"example,"changes"made"in"Hive,"or"changes"to"the"tables"in"HDFS"
AddiEonal"acEons"are"required"to"update"the"metadata"cache""
"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#10$
UpdaEng"the"Impala"Metadata"Cache"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#11$
SelecEng"Data"in"Impala"
!Use$SELECT$to$retrieve$data$from$tables$
Results"are"forma>ed"for"display"
Impala"does"not"require"a"FROM"clause"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#12$
Using"Impala"Built/in"FuncEons"
!Invoke$built#in$funcKons$as$you$would$in$SQL$or$HiveQL$
+-------------+
| fullname |
+-------------+
| Froman, Abe |
+-------------+
!Impala$supports$many$of$the$same$built#in$funcKons$as$Hive$
Lacks"some"others,"including"many"formacng"and"text"processing"
funcEons"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#13$
Chapter"Topics"
Analyzing$Data$with$Impala$
!! Basic"Syntax"
!! Data$Types
!! Filtering,"SorEng,"and"LimiEng"Results"
!! Joining"and"Grouping"Data"
!! User/Dened"FuncEons"
!! Improving"Impala"Performance"
!! Hands/On"Exercise:"InteracEve"Analysis"with"Impala"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#14$
Data"Types"in"Impala"
!Each$column$in$a$table$is$associated$with$a$data$type$
Impala"supports"most"types"available"in"Hive"
Most"are"similar"to"those"found"in"relaEonal"databases"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#15$
Impalas"Integer"Types"
!Integer$types$are$appropriate$for$whole$numbers$
Both"posiEve"and"negaEve"values"allowed"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#16$
Impalas"Decimal"Types"
!Decimal$types$are$appropriate$for$oaKng$point$numbers$
Both"posiEve"and"negaEve"values"allowed"
CauKon:"avoid"using"when"exact"values"are"required!"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#17$
Other"Types"in"Impala"
!Impala$can$also$store$a$few$other$types$of$informaKon$
Only"one"character"type"(variable"length)"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#18$
Data"Type"Conversion"
!Hive$auto#converts$a$STRING$column$used$in$numeric$context$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#19$
Chapter"Topics"
Analyzing$Data$with$Impala$
!! Basic"Syntax"
!! Data"Types"
!! Filtering,$SorKng,$and$LimiKng$Results
!! Joining"and"Grouping"Data"
!! User/Dened"FuncEons"
!! Improving"Impala"Performance"
!! Hands/On"Exercise:"InteracEve"Analysis"with"Impala"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#20$
LimiEng"and"SorEng"Query"Results"
! The$LIMIT$clause$sets$the$maximum$number$of$rows$returned$
!CauKon:$no$guarantee$regarding$which$10$results$are$returned$
Use"ORDER BY"for"top/N"queries"
The"eld(s)"you"ORDER BY"must"be"selected"
!When$using$ORDER BY,$the$LIMIT$clause$is$mandatory$in$Impala$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#21$
Chapter"Topics"
Analyzing$Data$with$Impala$
!! Basic"Syntax"
!! Data"Types"
!! Filtering,"SorEng,"and"LimiEng"Results"
!! Joining$and$Grouping$Data
!! User/Dened"FuncEons"
!! Improving"Impala"Performance"
!! Hands/On"Exercise:"InteracEve"Analysis"with"Impala"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#22$
Joins"in"Impala"
!Like$Hive,$Impala$can$join$mulKple$data$sets$
!Impala$supports$the$same$types$of$joins$that$Hive$does$
Inner"joins"
Outer"joins"(lel,"right,"and"full)"
Lel"semi"joins"
Cross"joins"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#23$
Record"Grouping"and"Aggregate"FuncEons"
! GROUP BY$groups$selected$data$by$one$or$more$columns$
CauKon:"Columns"not"part"of"aggregaEon"must"be"listed"in"GROUP BY"
stores$table$
id$ city$ state$ region$ SELECT region, state,
COUNT(id) AS num
a Albany NY EAST
FROM stores
b Boston MA EAST
GROUP BY region, state;
c Chicago IL NORTH
d Detroit MI NORTH Result$of$query$
e Elgin IL NORTH
region$ state$ num$
EAST MA 1
EAST NY 1
NORTH IL 2
NORTH MI 1
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#24$
Chapter"Topics"
Analyzing$Data$with$Impala$
!! Basic"Syntax"
!! Data"Types"
!! Filtering,"SorEng,"and"LimiEng"Results"
!! Joining"and"Grouping"Data"
!! User#Dened$FuncKons$
!! Improving"Impala"Performance"
!! Hands/On"Exercise:"InteracEve"analysis"with"Impala"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#25$
Overview"of"Impala"User/Dened"FuncEons"(UDFs)"
!Like$Hive,$Impala$supports$User#Dened$FuncKons$(UDFs)$$
!Hive$UDFs$can$be$used$in$Impala$with$no$changes$
With"a"few"excepEons"
!There$are$two$types$of$UDFs$in$Impala$
Standard"UDFs"
User/Dened"Aggregate"FuncEons"(UDAFs)"
!Impala$UDFs$can$be$wriNen$in$Java$or$C++$
C++"UDFs"are"implemented"as"shared"objects"
!Impala$C++$UDFs$cannot$be$used$in$Hive$
"
"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#26$
Using"a"Java"UDF"in"Impala"(1)"
!Register$the$funcKon$with$Impala$$
Specify"data"types"that"correspond"to"the"method"signature"of"the"UDF"
class"evaluate"method"aler"the"funcEon"name"
Specify"data"types"that"correspond"to"the"return"type"of"the"UDF"class"
evaluate"method"in"the"RETURNS"clause"
IdenEfy"the"jar"le"containing"the"UDF"class"in"the"LOCATION"clause"
Specify"the"UDF"class"name"in"the"SYMBOL"clause"
You"do"not"need"to"run"a"a"separate"ADD JAR"step"
$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#27$
Using"a"Java"UDF"in"Impala"(2)"
!You$may$then$use$the$funcKon$in$Impala$queries$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#28$
Using"a"C++"UDF"in"Impala"
!Register$the$funcKon$with$Impala$$
$
CREATE FUNCTION COUNT_VOWELS(STRING)
RETURNS INT
LOCATION '/user/hive/udfs/sampleudfs.so'
SYMBOL='CountVowels';
!You$may$then$use$the$funcKon$in$your$query$
$
SELECT COUNT_VOWELS(email_address) FROM employees;
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#29$
Chapter"Topics"
Analyzing$Data$with$Impala$
!! Basic"Syntax"
!! Data"Types"
!! Filtering,"SorEng,"and"LimiEng"Results"
!! Joining"and"Grouping"Data"
!! User/Dened"FuncEons"
!! Improving$Impala$Performance
!! Hands/On"Exercise:"InteracEve"Analysis"with"Impala"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#30$
Impala"Performance"Overview"
!Query$performance$is$aected$by$three$broad$categories$
CompuEng"staEsEcs"on"tables"before"running"joins"
The"format"and"type"of"data"being"queried"
The"hardware"and"conguraEon"of"your"cluster"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#31$
Join"Performance"OpEmizaEon"
!Impala$uses$staKsKcs$about$tables$to$opKmize$joins$
!You$should$compute$staKsKcs$for$tables$with$COMPUTE STATS$
Aler"you"load"a"table"iniEally"
When"the"amount"of"data"in"a"table"changes"substanEally""
$
COMPUTE STATS orders;
COMPUTE STATS order_details;
SELECT COUNT(o.order_id)FROM orders o
JOIN order_details d ON (o.order_id = d.order_id)
WHERE YEAR(o.order_date) = 2008;
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#32$
Data"Formats"Supported"by"Impala"
!Impala$can$query$data$in$several$formats$
Table"below"summarizes"compaEbility"
Create/load"using"Hive"if"those"operaEons"are"unsupported"in"Impala"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#33$
Data"Formats"and"OpEmizaEon"
!The$limiKng$factor$in$most$queries$is$I/O$
Disk"speed"is"the"most"common"bo>leneck"
Columnar"formats"reduce"I/O"when"few"columns"selected"
!If$performance$is$a$key$concern,$choose$Parquet$
This"may"limit"compaEbility"with"other"tools"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#34$
Impala"Query"Size"
!Query$size$is$based$on$a$querys$working$set$size$
The"working"set"of"a"query"contains"all"records"aler"
Filtering"rows"
Pruning"unused"columns"
Performing"aggregaEon,"if"applicable"
!For$aggregaKons,$the$query$size$is$the$working$set$size$for$all$the$tables$in$
the$query$
!For$joins,$the$query$size$is$the$working$set$size$for$all$the$tables$in$the$join$
excluding$the$largest$table$
!Impala$queries$must$t$into$the$cluster's$aggregate$memory$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#35$
Cluster"Hardware"
!CPU:$Impala$benets$from$newer$processors$$
Takes"advantage"of"addiEonal"opEmizaEon"when"available"
!Memory:$32GB$minimum,$64GB$is$beNer$
!Disks:$more$is$beNer$
Impala"tries"to"maximize"throughput"across"disks"
Servers"with"12"or"more"disks"are"ideal"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#36$
Chapter"Topics"
Analyzing$Data$with$Impala$
!! Basic"Syntax"
!! Data"Types"
!! Filtering,"SorEng,"and"LimiEng"Results"
!! Joining"and"Grouping"Data"
!! User/Dened"FuncEons"
!! Improving"Impala"Performance"
!! Hands#On$Exercise:$InteracKve$Analysis$with$Impala
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#37$
Hands/on"Exercise:"InteracEve"Analysis"with"Impala"
!In$this$Hands#On$Exercise,$you$will$run$ad$hoc$queries$with$Impala$
!Please$refer$to$the$Hands#On$Exercise$Manual$for$instrucKons$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#38$
Chapter"Topics"
Analyzing$Data$with$Impala$
!! Basic"Syntax"
!! Data"Types"
!! Filtering,"SorEng,"and"LimiEng"Results"
!! Joining"and"Grouping"Data"
!! User/Dened"FuncEons"
!! Improving"Impala"Performance"
!! Hands/On"Exercise:"InteracEve"analysis"with"Impala"
!! Conclusion
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#39$
EssenEal"Points"
!Impalas$query$syntax$is$nearly$idenKcal$to$HiveQL$
Most"Hive"queries"can"be"executed"verbaEm"in"Impala"
!Impala$caches$metadata$from$the$metastore$it$shares$with$Hive$
Use"INVALIDATE METADATA"or"REFRESH"to"update"the"cache"
following"external"changes"
!Impala$supports$most$simple$data$types$from$Hive$
!Query$structure$and$le$format$can$aect$performance$
!Your$clusters$hardware$also$aects$performance$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#40$
Bibliography"
The$following$oer$more$informaKon$on$topics$discussed$in$this$chapter$
!Introducing$Parquet:$Ecient$Columnar$Storage$for$Apache$Hadoop$
http://tiny.cloudera.com/dac16a
!Impala$Language$Reference$
http://tiny.cloudera.com/dac16b
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 16#41$
Choosing"the"Best"Tool"for"the"Job"
Chapter"17"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 17#1$
Course"Chapters"
!! IntroducFon"
!! Hadoop"Fundamentals"
!! IntroducFon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulF/Dataset"OperaFons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooFng"and"OpFmizaFon"
!! IntroducFon"to"Hive"
!! RelaFonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpFmizaFon"
!! Extending"Hive"
!! IntroducFon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing$the$Best$Tool$for$the$Job$
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 17#2$
Choosing"the"Best"Tool"for"the"Job"
In$this$chapter,$you$will$learn$
!How$MapReduce,$Pig,$Hive,$Impala,$and$RDBMSs$compare$to$one$another$
!Why$a$workow$might$involve$several$dierent$tools$
!How$to$select$the$best$tool$for$a$given$job$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 17#3$
Chapter"Topics"
Choosing$the$Best$Tool$for$the$Job$
!! Comparing$MapReduce,$Pig,$Hive,$Impala,$and$RelaNonal$Databases
!! Which"to"Choose?"
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 17#4$
Recap"of"Data"Analysis/Processing"Tools"
!MapReduce$
Low/level"processing"and"analysis"
!Pig$
Procedural"data"ow"language"executed"using"MapReduce"
!Hive$
SQL/based"queries"executed"using"MapReduce"
!Impala$
High/performance"SQL/based"queries"using"a"custom"execuFon"engine"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 17#5$
Comparing"Pig,"Hive,"and"Impala"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 17#6$
Do"These"Replace"an"RDBMS?"
!Probably$not$if$RDBMS$is$used$for$its$intended$purpose$
!RelaNonal$databases$are$opNmized$for$
RelaFvely"small"amounts"of"data"
Immediate"results"
In/place"modicaFon"of"data"(UPDATE"and"DELETE)"
!Pig,$Hive,$and$Impala$are$opNmized$for$
Large"amounts"of"read/only"data"
Extensive"scalability"at"low"cost"
!Pig$and$Hive$are$beSer$suited$for$batch$processing$
Impala"and"RDBMSs"are"be>er"for"interacFve"use"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 17#7$
Comparing"RDBMS"to"Hive"and"Impala"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 17#8$
Recap:"Apache"Sqoop"
!Sqoop$helps$you$integrate$Hadoop$tools$with$relaNonal$databases$
!It$exchanges$data$between$RDBMS$and$Hadoop$
Can"import"all"tables,"a"single"table,"or"a"porFon"of"a"table"into"HDFS"
Supports"incremental"imports"
Can"also"export"data"from"HDFS"back"to"the"database"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 17#9$
Chapter"Topics"
Choosing$the$Best$Tool$for$the$Job$
!! Comparing"MapReduce,"Pig,"Hive,"Impala,"and"RelaFonal"Databases"
!! Which$to$Choose?
!! Conclusion"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 17#10$
Which"to"Choose?"
!Choose$the$best$one$for$a$given$task$
Mix"and"match"them"as"needed"
!MapReduce$
Low/level"approach"oers"great"exibility"
More"Fme/consuming"and"error/prone"to"write"
Best"when"control"ma>ers"more"than"producFvity"
!Pig,$Hive,$and$Impala$oer$more$producNvity$
Faster"to"write,"test,"and"deploy"than"MapReduce"
Be>er"choice"for"most"analysis"and"processing"tasks"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 17#11$
Analysis"Workow"Example"
Import Transaction Data
from RDBMS
Sessionize Web Sentiment Analysis on
Log Data with Pig Social Media with Hive
Hadoop Cluster
with Impala
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 17#12$
Chapter"Topics"
Choosing$the$Best$Tool$for$the$Job$
!! Comparing"MapReduce,"Pig,"Hive,"Impala,"and"RelaFonal"Databases"
!! Which"to"Choose?"
!! Conclusion
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 17#13$
EssenFal"Points"
!You$have$learned$about$several$tools$for$data$analysis$
Each"is"be>er"at"some"tasks"than"others"
Choose"the"best"one"for"a"given"job"
Workows"may"involve"exchanging"data"between"them"
!SelecNon$criteria$include$scale,$speed,$control,$and$producNvity$
MapReduce"oers"control"at"the"cost"of"producFvity"
Pig"and"Hive"oer"producFvity"but"not"necessarily"speed"
RelaFonal"databases"oer"speed"but"not"scalability"
Impala"oers"scalability"and"speed"but"less"control"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 17#14$
Conclusion"
Chapter"18"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 18#1$
Course"Chapters"
!! IntroducBon"
!! Hadoop"Fundamentals"
!! IntroducBon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulB/Dataset"OperaBons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooBng"and"OpBmizaBon"
!! IntroducBon"to"Hive"
!! RelaBonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpBmizaBon"
!! Extending"Hive"
!! IntroducBon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Interoperability"and"Workows"
!! Conclusion$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 18#2$
Course"ObjecBves"(1)"
During$this$course,$you$have$learned$
!The$purpose$of$Hadoop$and$its$related$tools$
!The$features$that$Pig,$Hive,$and$Impala$oer$for$data$acquisiCon,$storage,$
and$analysis$
!How$to$idenCfy$typical$use$cases$for$large#scale$data$analysis$
!How$to$load$data$from$relaConal$databases$and$other$sources$
!How$to$manage$data$in$HDFS$and$export$it$for$use$with$other$systems$
!How$Pig,$Hive,$and$Impala$improve$producCvity$for$typical$analysis$tasks$
!The$language$syntax$and$data$formats$supported$by$these$tools$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 18#3$
Course"ObjecBves"(2)"
!How$to$design$and$execute$queries$on$data$stored$in$HDFS$
!How$to$join$diverse$datasets$to$gain$valuable$business$insight$
!How$to$analyze$structured,$semi#structured,$and$unstructured$data$
!How$Hive$and$Pig$can$be$extended$with$custom$funcCons$and$scripts$
!How$to$store$and$query$data$for$beMer$performance$
!How$to$determine$which$tool$is$the$best$choice$for$a$given$task$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 18#4$
Which"Course"to"Take"Next?"
Cloudera$oers$a$range$of$training$courses$for$you$and$your$team$$
!For$developers$
Cloudera)Developer)Training)for)Apache)Hadoop)
Cloudera)Training)for)Apache)HBase)
!For$system$administrators$
Cloudera)Administrator)Training)for)Apache)Hadoop)
!For$data$scienCsts$
Introduc;on)to)Data)Science:)Building)Recommender)Systems)
!For$architects,$managers,$CIOs,$and$CTOs$
Cloudera)Essen;als)for)Apache)Hadoop)
$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 18#5$
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 18#6$