Data Analyst Training 201403

Cloudera"Data"Analyst"Training:""
Using"Pig,"Hive,"and"Impala"with"Hadoop"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#1$
201403"
IntroducJon"
Chapter"1"
Course"Chapters"
!! Introduc/on$
!! Hadoop"Fundamentals"
!! IntroducJon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulJ/Dataset"OperaJons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooJng"and"OpJmizaJon"
!! IntroducJon"to"Hive"
!! RelaJonal"Data"Analysis"with"Hive"
!! Hive"Data"Management"
!! Text"Processing"With"Hive"
!! Hive"OpJmizaJon"
!! Extending"Hive"
!! IntroducJon"to"Impala"
!! Analyzing"Data"with"Impala"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion"
Chapter"Topics"
Introduc/on$
!! About$this$course$
!! About"Cloudera"
!! Course"LogisJcs"
!! IntroducJons"
Course"ObjecJves"(1)"
During$this$course,$you$will$learn$
!The$purpose$of$Hadoop$and$its$related$tools$
!The$features$that$Pig,$Hive,$and$Impala$oer$for$data$acquisi/on,$storage,$
and$analysis$
!How$to$iden/fy$typical$use$cases$for$large#scale$data$analysis$
!How$to$load$data$from$rela/onal$databases$and$other$sources$
!How$to$manage$data$in$HDFS$and$export$it$for$use$with$other$systems$
!How$Pig,$Hive,$and$Impala$improve$produc/vity$for$typical$analysis$tasks$
!The$language$syntax$and$data$formats$supported$by$these$tools$
Course"ObjecJves"(2)"
!How$to$design$and$execute$queries$on$data$stored$in$HDFS$
!How$to$join$diverse$datasets$to$gain$valuable$business$insight$
!How$to$analyze$structured,$semi#structured,$and$unstructured$data$
!How$Hive$and$Pig$can$be$extended$with$custom$func/ons$and$scripts$
!How$to$store$and$query$data$for$beOer$performance$
!How$to$determine$which$tool$is$the$best$choice$for$a$given$task$
Chapter"Topics"
Introduc/on$
!! About"this"course"
!! About$Cloudera$
!! Course"LogisJcs"
!! IntroducJons"
About"Cloudera"(1)"
!The$leader$in$Apache$Hadoop#based$soSware$and$services$
!Founded$by$leading$experts$on$Hadoop$from$Facebook,$Yahoo,$Google,$
and$Oracle$
!Provides$support,$consul/ng,$training,$and$cer/ca/on$for$Hadoop$users$
!Sta$includes$commiOers$to$virtually$all$Hadoop$projects$
!Many$authors$of$industry$standard$books$on$Apache$Hadoop$projects$
Tom"White,"Eric"Sammer,"Lars"George,"etc."
About"Cloudera"(2)"
!Customers$include$many$key$users$of$Hadoop$
Allstate,"AOL"AdverJsing,"Box,"CBS"InteracJve,"eBay,"Experian,"Groupon,"
Macys.com,"NaJonal"Cancer"InsJtute,"Orbitz,"Social"Security"
AdministraJon,"Trend"Micro,"Trulia,"US"Army,""
!Cloudera$public$training:$
Cloudera"Developer"Training"for"Apache"Hadoop"
Cloudera"Administrator"Training"for"Apache"Hadoop"
Cloudera"Data"Analyst"Training:"Using"Pig,"Hive,"and"Impala"with"Hadoop"
Cloudera"Training"for"Apache"HBase"
IntroducJon"to"Data"Science:"Building"Recommender"Systems"
Cloudera"EssenJals"for"Apache"Hadoop"
!Onsite$and$custom$training$is$also$available$
CDH"
!CDH$(Clouderas$Distribu/on$including$Apache$Hadoop)$
100%"open"source,""
enterprise/ready""
distribuJon"of"Hadoop"and""
related"projects"
The"most"complete,"tested,""
and"widely/deployed""
distribuJon"of"Hadoop"
Integrates"all"the"key""
Hadoop"ecosystem"projects"
Available"as"RPMs"and"
Ubuntu/Debian/SuSE""
packages"or"as"a"tarball"
Cloudera"Express"
!Cloudera$Express$
Free"download"
!The$best$way$to$get$started$
$with$Hadoop$
!Includes$CDH$
!Includes$Cloudera$Manager$
End/to/end""
administraJon"for""
Hadoop"
Deploy,"manage,"and""
monitor"your"cluster"
Cloudera"Enterprise"
!Cloudera$Enterprise$
SubscripJon"product"including"CDH"and""
Cloudera"Manager"
!Includes$support$
!Includes$extra$Cloudera$Manager$features$
ConguraJon"history"and"rollbacks"
Rolling"updates"
LDAP"integraJon"
SNMP"support"
Automated"disaster"recovery"
Etc."
Chapter"Topics"
Introduc/on$
!! About"Cloudera"
!! Course$Logis/cs$
!! IntroducJons"
LogisJcs"
!Class$start$and$nish$/mes$
!Lunch$
!Breaks$
!Restrooms$
!Wi#Fi$access$
!Virtual$machines$
!Can$I$come$in$early/stay$late?$
Your$instructor$will$give$you$details$on$how$to$access$the$course$materials$
and$exercise$instruc/ons$for$the$class$
Chapter"Topics"
Introduc/on$
!! About"Cloudera"
!! Course"LogisJcs"
!! Introduc/ons$
IntroducJons"
!About$your$instructor$
!About$you$
Where"do"you"work"and"what"do"you"do"there?"
Which"database(s)"and"placorm(s)"do"you"use?"
Have"you"worked"with"Apache"Hadoop"or"related"tools?"""
Any"experience"as"a"developer?"
What"programming"languages"do"you"use?"
What"are"your"expectaJons"for"this"course?"
Hadoop"Fundamentals"
Chapter"2"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#1%
Course"Chapters"
!! IntroducDon"
!! Hadoop%Fundamentals%
!! IntroducDon"to"Pig"
!! MulD/Dataset"OperaDons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooDng"and"OpDmizaDon"
!! IntroducDon"to"Hive"
!! RelaDonal"Data"Analysis"with"Hive"
!! Hive"OpDmizaDon"
!! Extending"Hive"
!! IntroducDon"to"Impala"
!! Conclusion"
Hadoop"Fundamentals"
In%this%chapter,%you%will%learn%
!Which%factors%led%to%the%era%of%Big%Data%
!What%Hadoop%is%and%what%signicant%features%it%oers%
!How%it%oers%reliable%storage%for%massive%amounts%of%data%with%HDFS%
!How%it%supports%large#scale%data%processing%through%MapReduce%
!How%Hadoop%Ecosystem%tools%can%boost%an%analysts%producKvity%
!Several%ways%to%integrate%Hadoop%into%the%modern%data%center%
Aside:"Please"Launch"the"Virtual"Machine"
!At%the%end%of%this%chapter%you%will%work%on%the%rst%Hands#On%Exercise%
!The%exercises%are%performed%in%a%Virtual%Machine%(VM)%
!The%rst%Kme%the%VM%is%launched,%it%takes%several%minutes%to%boot%
It"is"conguring"the"class"environment"
Subsequent"boots"are"much"faster"
!To%save%Kme,%please%launch%the%VM%now%so%that%it%will%be%ready%by%the%
Kme%we%get%to%the%rst%Hands#On%Exercise%
Your"instructor"will"tell"you"how"to"do"this"
Chapter"Topics"
Hadoop%Fundamentals%
!! The%MoKvaKon%for%Hadoop%
!! Hadoop"Overview"
!! HDFS"
!! MapReduce"
!! The"Hadoop"Ecosystem"
!! Exercise"Scenario"ExplanaDon"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
!! Conclusion"
Velocity"
!We%are%generaKng%data%faster%than%ever%
Processes"are"increasingly"automated"
Systems"are"increasingly"interconnected"
People"are"increasingly"interacDng"online"
Variety"
!We%are%producing%a%wide%variety%of%data%
Social"network"connecDons"
Server"and"applicaDon"log"les"
Electronic"medical"records"
Images,"audio,"and"video"
RFID"and"wireless"sensor"network"events"
Product"raDngs"on"shopping"and"review"Web"sites"
And"much"more"
!Not%all%of%this%maps%cleanly%to%the%relaKonal%model%
Volume"
!Every%day%
More"than"1.5"billion"shares"are"traded"on"the"New"York"Stock"
Exchange"
Facebook"stores"2.7"billion"comments"and"Likes"
Google"processes"about"24"petabytes"of"data"
!Every%minute%
Foursquare"handles"more"than"2,000"check/ins"
TransUnion"makes"nearly"70,000"updates"to"credit"les"
!And%every%second%
Banks"process"more"than"10,000"credit"card"transacDons"
Data"Has"Value"
!This%data%has%many%valuable%applicaKons%
Product"recommendaDons"
PredicDng"demand"
MarkeDng"analysis"
Fraud"detecDon"
And"many,"many"more"
!We%must%process%it%to%extract%that%value%
And"processing"all#the#data"can"yield"more"accurate"results"
We"Need"a"System"that"Scales"
!Were%generaKng%too%much%data%to%process%with%tradiKonal%tools%
!Two%key%problems%to%address%%
How"can"we"reliably"store"large"amounts"of"data"at"a"reasonable"cost?"
How"can"we"analyze"all"the"data"we"have"stored?"
Chapter"Topics"
!! The"MoDvaDon"for"Hadoop"
!! Hadoop%Overview%
!! HDFS"
!! MapReduce"
!! Conclusion"
What"is"Apache"Hadoop?"
!Scalable%and%economical%data%storage%and%processing%
Distributed"and"fault/tolerant""
Harnesses"the"power"of"industry"standard"hardware"
!Heavily%inspired%by%technical%documents%published%by%Google%
!Core%Hadoop%consists%of%two%main%components%
Storage:"the"Hadoop"Distributed"File"System"(HDFS)"
Processing:"MapReduce"
Plus"the"infrastructure"needed"to"make"them"work,"including"
Filesystem"and"administraDon"uDliDes"
Job"scheduling"and"monitoring"
Scalability"
!Hadoop%is%a%distributed%system%
A"collecDon"of"servers"running"Hadoop"sofware"is"called"a"cluster#
!Individual%servers%within%a%cluster%are%called%nodes&
Typically"standard"rackmount"servers"running"Linux"
Each"node"both"stores"and"processes"data"
!Add%more%nodes%to%the%cluster%to%increase%scalability%
A"cluster"may"contain"up"to"several"thousand"nodes"
You"can"scale"out"incrementally"as"required"
Fault"Tolerance"
!Paradox:%Adding%nodes%increases%chances%that%any%one%of%them%will%fail%
SoluDon:"build"redundancy"into"the"system"and"handle"it"automaDcally"
!Files%loaded%into%HDFS%are%replicated%across%nodes%in%the%cluster%
If"a"node"fails,"its"data"is"re/replicated"using"one"of"the"other"copies"
!Data%processing%jobs%are%broken%into%individual%tasks%
Each"task"takes"a"small"amount"of"data"as"input"
Thousands"of"tasks"(or"more)"ofen"run"in"parallel"
If"a"node"fails"during"processing,"its"tasks"are"rescheduled"elsewhere"
!RouKne%failures%are%handled%automaKcally%without%any%loss%of%data%
Chapter"Topics"
!! Hadoop"Overview"
!! HDFS%
!! MapReduce"
!! Conclusion"
HDFS:"Hadoop"Distributed"File"System"
!Provides%inexpensive%and%reliable%storage%for%massive%amounts%of%data%
!OpKmized%for%sequenKal%access%to%a%relaKvely%small%number%of%large%les%
Each"le"is"likely"to"be"100MB"or"larger ""
MulD/gigabyte"les"are"typical"
!In%some%ways,%HDFS%is%similar%to%a%UNIX%lesystem%
Hierarchical,"with"UNIX/style"paths"(e.g.,"/sales/rpt/asia.txt)"
UNIX/style"le"ownership"and"permissions"
!There%are%also%some%major%deviaKons%from%UNIX%
No"concept"of"a"current"directory"
Cannot"modify"les"once"wri>en"
Must"use"Hadoop/specic"uDliDes"or"custom"code"to"access"HDFS"
HDFS"Architecture"
!Hadoop%has%a%master/slave% A Small Hadoop Cluster

architecture%
Master
!HDFS%master%daemon:%NameNode% (NameNode)
Manages"namespace"and"metadata#
Monitors"slave"nodes"
!HDFS%slave%daemon:%DataNode%
Reads"and"writes"the"actual"data"
Slaves
(DataNode)
Accessing"HDFS"via"the"Command"Line"
!HDFS%is%not%a%general%purpose%lesystem%
Not"built"into"the"OS,"so"only"specialized"tools"can"access"it"
End"users"typically"access"HDFS"via"the"hadoop fs"command"
!Example:%display%the%contents%of%the%/user/fred/sales.txt%le%
$ hadoop fs -cat /user/fred/sales.txt
!Example:%Create%a%directory%(below%the%root)%called%reports%
$ hadoop fs -mkdir /reports
Copying"Local"Data"To"and"From"HDFS"
!Remember%that%HDFS%is%disKnct%from%your%local%lesystem%
Use"hadoop fs put"to"copy"local"les"to"HDFS"
Use"hadoop fs -get"to"fetch"a"local"copy"of"a"le"from"HDFS"
Hadoop Cluster
$ hadoop fs -put sales.txt /reports

Client Machine
$ hadoop fs -get /reports/sales.txt
More"hadoop fs"Command"Examples""
!Copy%le%input.txt%from%local%disk%to%the%users%directory%in%HDFS%
$ hadoop fs -put input.txt input.txt
This"will"copy"the"le"to"/user/username/input.txt
!Get%a%directory%lisKng%of%the%HDFS%root%directory%
$ hadoop fs -ls /
!Delete%the%le%/reports/sales.txt%
$ hadoop fs -rm /reports/sales.txt
Chapter"Topics"
!! Hadoop"Overview"
!! HDFS"
!! MapReduce%
!! Conclusion"
Introducing"MapReduce"
!We%typically%process%data%in%Hadoop%using%MapReduce%
!MapReduce%is%not%a%language,%its%a%programming%model%
A"style"of"processing"data"popularized"by"Google"
!Benets%of%MapReduce%
Simplicity"
Flexibility"
Scalability"
Understanding"Map"and"Reduce"
!MapReduce%consists%of%two%funcKons:%map%and%reduce%
The"output"from"map"becomes"the"input"to"reduce"
!The%map%funcKon%always%runs%rst%
Typically"used"to"lter,"transform,"or"parse"data"
!The%reduce%funcKon%is%opKonal%
Normally"used"to"summarize"data"from"the"map"funcDon"(aggregaDon)"
Not"always"needed""you"can"run"map/only"jobs"
!Each%piece%is%simple,%but%can%be%powerful%when%combined%
MapReduce"Example"
!MapReduce%code%for%Hadoop%is%typically%wrifen%in%Java%
But"possible"to"use"nearly"any"language"with"Hadoop#Streaming"
!The%following%slides%will%explain%an%enKre%MapReduce%job%
Input:"text"le"containing"order"ID,"employee"name,"and"sale"amount"
Output:"sum"of"all"sales"per"employee"
Job Input
0 Alice 3625
1 Bob 5174 Job Output
2 Alice 893
3 Alice 2139 Alice 12491
4 Diana 3581 Bob 9997
5 Carlos 1039 Carlos 1431
6 Bob 4823 Diana 5385
7 Alice 5834
8 Carlos 392
9 Diana 1804
ExplanaDon"of"the"Map"FuncDon"
!Hadoop%splits%job%into%many%individual%map%tasks%
Number"of"map"tasks"is"determined"by"the"amount"of"input"data"
Each"map"task"receives"a"porDon"of"the"overall"job"input"to"process"
Mappers"process"one"input"record"at"a"Dme"
For"each"input"record,"they"emit"zero"or"more"records"as"output"
!In%this%case,%the%map%task%simply%parses%the%input%record%
And"then"emits"the"name"and"price"elds"for"each"as"output"
Job Input Alice 3625
Bob 5174
0 Alice 3625
1 Bob 5174 Alice 893
2 Alice 893 Alice 2139 Output
3 Alice 2139
4 Diana 3581 Diana 3581 from
6 Bob 4823 Map
7
8
Alice
Carlos
5834
392
Bob 4823 Tasks
Alice 5834
9 Diana 1804
Carlos 392
Diana 1804
Shue"and"Sort"
!Hadoop%automaKcally%sorts%and%merges%output%from%all%map%tasks%
This"intermediate"process"is"known"as"the"shue"and"sort"
The"result"is"supplied"to"reduce"tasks"
Alice 3625
Map Task #1 Output Bob 5174 Alice 3625
Alice 893
Alice 893 Alice 2139
Map Task #2 Output Alice 2139 Alice 5834 Input to Reduce Task #1
Carlos 1039
Diana 3581 Carlos 392
Map Task #3 Output
% Carlos 1039
Bob 4823 Bob 5174

Map Task #4 Output Alice 5834 Bob 4823
Input to Reduce Task #2
Diana 3581
Diana 1804
Carlos 392
Map Task #5 Output Diana 1804
% "Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#26%
ExplanaDon"of"Reduce"FuncDon"
!Reducer%input%comes%from%the%shue%and%sort%process%
As"with"map,"the"reduce"funcDon"receives"one"record"at"a"Dme"
A"given"reducer"receives"all"records"for"a"given"key"
For"each"input"record,"reduce"can"emit"zero"or"more"output"records"
!Our%reduce%funcKon%simply%sums%total%per%person%
And"emits"employee"name"(key)"and"total"(value)"as"output"
Job Output
Alice 3625
Alice 893 (Output of Reduce Tasks)
Alice 2139
Input to Reduce Task #1 Alice 5834
Carlos 1039 Alice 12491
Carlos 392 Carlos 1431
Bob 9997
Bob 5174 Diana 5385
Bob 4823
Input to Reduce Task #2 Diana 3581
Diana 1804
Pumng"it"All"Together"
!Heres%the%data%ow%for%the%enKre%MapReduce%job%
Alice 3625
Bob 5174 Alice 3625
1 Bob 5174 Alice 893 Alice 2139
2 Alice 893 Alice 2139 Alice 5834
3 Alice 2139 Carlos 1039 Alice 12491
4 Diana 3581 Diana 3581 Carlos 392 Carlos 1431
6 Bob 4823 Bob 9997
7 Alice 5834 Bob 4823 Bob 5174 Diana 5385
8 Carlos 392 Alice 5834 Bob 4823
9 Diana 1804 Diana 3581
Diana 1804
Carlos 392
Diana 1804
MapReduce"Architecture"
!MapReduce%version%1%and%version%2%(YARN)%
Similar"master/slave"architecture"
Details"dier"slightly"
!Master%nodes%
Run"master"daemons"to"accept"jobs,""
and"monitor"and"distribute"work"
!Slave%nodes%
Run"slave"daemons"to"start"tasks"
Do"the"actual"work"
Report"status"back"to"master"daemons"
!HDFS%and%MapReduce%are%collocated%
Slave"nodes"run"both"HDFS"and"MR""
slave"daemons"on"the"same"machines"
MapReduce"Version"1"Architecture"
!MRv1%master%daemon:%JobTracker%
Divides"jobs"into"individual"tasks"
Assigns/monitors"tasks"on"slave"nodes"
!MRv1%slave%daemon:%TaskTracker%
Starts"tasks"
Reports"status"back"to"JobTracker"
MapReduce"Version"2"Architecture"
!MRv2%uses%the%YARN%cluster%
management%framework%
!MRv2%master%daemon:%ResourceManager%
Allocates"cluster"resources"for"a"job"
Starts"an"ApplicaDon"Master"for"a"job"
!MRv2%slave%daemons:%%
ApplicaKonMaster%
One"per"applicaDon"
Divides"jobs"into"individual"tasks"
Assigns"tasks"to"NodeManagers"
NodeManager%
Runs"on"all"slave"nodes"
Starts"and"monitors"the"actual"
processing"task"
Chapter"Topics"
!! Hadoop"Overview"
!! HDFS"
!! MapReduce"
!! The%Hadoop%Ecosystem%
!! Conclusion"
The"Hadoop"Ecosystem"
!Many%related%tools%integrate%with%Hadoop%
Data"analysis""
Database"integraDon"
Workow"management"
!These%are%not%considered%core%Hadoop%
Rather,"they"are"part"of"the"Hadoop"ecosystem"
Many"are"also"open"source"Apache"projects"
Apache"Pig"
!Apache%Pig%builds%on%Hadoop%to%oer%high#level%data%processing%
This"is"an"alternaDve"to"wriDng"low/level"MapReduce"code"
Pig"is"especially"good"at"joining"and"transforming"data"
people = LOAD '/user/training/customers' AS (cust_id, name);

orders = LOAD '/user/training/orders' AS (ord_id, cust_id, cost);
groups = GROUP orders BY cust_id;
totals = FOREACH groups GENERATE group, SUM(orders.cost) AS t;
result = JOIN totals BY group, people BY cust_id;
DUMP result;
%
!The%Pig%interpreter%runs%on%the%client%machine%
Turns"PigLaDn"scripts"into"MapReduce"jobs"
Submits"those"jobs"to"the"cluster"
Apache"Hive"
!Hive%is%another%abstracKon%on%top%of%MapReduce%
Like"Pig,"it"also"reduces"development"Dme""
Hive"uses"a"SQL/like"language"called"HiveQL"
SELECT customers.cust_id, SUM(cost) AS total

FROM customers
JOIN orders
ON customers.cust_id = orders.cust_id
GROUP BY customers.cust_id
"
ORDER BY total DESC;
!The%Hive%interpreter%runs%on%a%client%machine%
Turns"HiveQL"queries"into"MapReduce"jobs"
Apache"HBase"
!HBase%is%the%Hadoop%database%
!Can%store%massive%amounts%of%data%
Gigabytes,"terabytes,"and"even"petabytes"of"data"in"a"table"
Tables"can"have"many"thousands"of"columns"
!Scales%to%provide%very%high%write%throughput%
Hundreds"of"thousands"of"inserts"per"second"
!Fairly%primiKve%when%compared%to%RDBMS%
NoSQL":"There"is"no"high/level"query"language""
Use"API"to"scan"/"get"/"put"values"based"on"keys"
Cloudera"Impala"
!Massively%parallel%SQL%engine%which%runs%on%a%Hadoop%cluster%
Inspired"by"Googles"Dremel"project"
Can"query"data"stored"in"HDFS"or"HBase"tables"
!High%performance%%
Typically"at"least"10"Dmes"faster"than"Hive"or"MapReduce"
High/level"query"language"(subset"of"SQL/92)"
!Impala%is%100%%Apache#licensed%open%source%
Apache"Sqoop"
!Sqoop%exchanges%data%between%an%RDBMS%and%Hadoop%
!It%can%import%all%tables,%a%single%table,%or%a%porKon%of%a%table%into%HDFS%
Does"this"very"eciently"via"a"Map/only"MapReduce"job"
Result"is"a"directory"in"HDFS"containing"comma/delimited"text"les"
!Sqoop%can%also%export%data%from%HDFS%back%to%the%database%
Database Hadoop Cluster
ImporDng"Tables"with"Sqoop"
!This%example%imports%the%customers%table%from%a%MySQL%database%
Will"create"/mydata/customers"directory"in"HDFS"
Directory"will"contain"comma/delimited"text"les"
$ sqoop import \
--connect jdbc:mysql://localhost/company \
--username twheeler --password bigsecret \
--warehouse-dir /mydata \
--table customers
!Adding%the%--direct%opKon%may%oer%befer%performance%
Uses"database/specic"tools"instead"of"Java""
This"opDon"is"not"compaDble"with"all"databases"
!Cloudera%oers%high#performance%custom%connectors%for%many%databases%
ImporDng"An"EnDre"Database"with"Sqoop"
!Import%all%tables%from%the%database%(elds%will%be%tab#delimited)%
$ sqoop import-all-tables \
--fields-terminated-by '\t' \
--warehouse-dir /mydata
ImporDng"ParDal"Tables"with"Sqoop"
!Import%only%specied%columns%from%products%table%
$ sqoop import \
--table products \
--columns "prod_id,name,price"
!Import%only%matching%rows%from%products%table%
$ sqoop import \
--table products \
--where "price >= 1000"
Incremental"Imports"with"Sqoop"
!What%if%new%records%are%added%to%the%database?%
Could"re/import"all"records,"but"this"is"inecient"
!Sqoops%incremental%append%mode%imports%only%new%records%
Based"on"value"of"last"record"in"specied"column"
$ sqoop import \
--table orders \
--incremental append \
--check-column order_id \
--last-value 6713821
Handling"ModicaDons"with"Incremental"Imports"
!What%if%exisKng%records%are%also%modied%in%the%database?%
Incremental"append"mode"doesnt"handle"this"
!Sqoops%lastmodified%append%mode%adds%and%updates%records%
Caveat:"You"must"maintain"a"Dmestamp"column"in"your"table"
$ sqoop import \
--table shipments \
--incremental lastmodified \
--check-column last_update_date \
--last-value "2013-06-12 03:15:59"
ExporDng"Data"from"Hadoop"to"RDBMS"with"Sqoop"
!Weve%seen%several%ways%to%pull%records%from%an%RDBMS%into%Hadoop%
It"is"someDmes"also"helpful"to"push"data"in"Hadoop"back"to"an"RDBMS"
!Sqoop%supports%this%via%export%
$ sqoop export \
--export-dir /mydata/recommender_output \
--table product_recommendations
Apache"Flume"
%%
!Flume%imports%data%into%HDFS%as&it&is&being&generated%by%various%sources%
Log Files
UNIX Custom
syslog Sources
Program And many

Output more...
Hadoop Cluster
Recap:"Data"Center"IntegraDon"
Web/App Servers File Server

s
Flu opf
me do
ha
Sq
oop Hadoop Cluster oo
p
Sq
Data Warehouse
Relational Database
(OLAP)
(OLTP)
Apache"Oozie"
!Oozie%allows%developers%to%manage%processing%workows%
It"coordinates"execuDon"and"control"of"individual"jobs"
!Oozie%supports%many%workow%acKons,%including%
ExecuDng"MapReduce"jobs"
Running"Pig"or"Hive"scripts"
ExecuDng"standard"Java"or"shell"programs"
ManipulaDng"data"via"HDFS"commands"
Running"remote"commands"with"SSH"
Sending"e/mail"messages"
Chapter"Topics"
!! Hadoop"Overview"
!! HDFS"
!! MapReduce"
!! Exercise%Scenario%ExplanaKon%
!! Conclusion"
Hands/On"Exercises:"Scenario"ExplanaDon"
!Hands#On%Exercises%throughout%the%course%will%reinforce%the%topics%being%
discussed%
Exercises"simulate"the"kind"of"tasks"ofen"performed"using"the"tools"you"
will"learn"about"in"class"
Most"exercises"depend"on"data"generated"in"earlier"exercises"
!Scenario:%Dualcore%Inc.%is%a%leading%electronics%retailer%
More"than"1,000"brick/and/mortar"stores"
Dualcore"also"has"a"thriving"e/commerce"Web"site"
!Dualcore%has%hired%you%to%help%nd%value%in%their%data%
You"will"process"and"analyze"data"from"internal"and"external"sources"
IdenDfy"opportuniDes"to"increase"revenue"
Find"new"ways"to"reduce"costs"
Help"other"departments"achieve"their"goals"
Chapter"Topics"
!! Hadoop"Overview"
!! HDFS"
!! MapReduce"
!! Hands#On%Exercise:%Data%Ingest%with%Hadoop%Tools%
!! Conclusion"
About"the"Training"Virtual"Machine"
!During%this%course,%you%will%perform%numerous%hands#on%exercises%using%the%
Cloudera%Training%Virtual%Machine%(VM)%
!The%VM%has%Hadoop%installed%in%pseudo0distributed&mode%
Simply"a"cluster"comprised"of"a"single"node"
Typically"used"for"tesDng"code"before"deploying"to"a"large"cluster"
Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
!In%this%Hands#On%Exercise,%youll%gain%pracKce%adding%data%from%the%local%
lesystem%and%a%relaKonal%database%server%to%HDFS%
You"will"analyze"this"data"in"subsequent"exercises"
!Please%refer%to%the%Hands#On%Exercise%Manual%for%instrucKons%
Chapter"Topics"
!! Hadoop"Overview"
!! HDFS"
!! MapReduce"
!! Conclusion%
EssenDal"Points"
!We%are%generaKng%more%data%%and%faster%%than%ever%before%
!Most%of%this%data%maps%poorly%to%structured%relaKonal%tables%
!The%ability%to%store%and%process%this%data%can%yield%valuable%insight%
!Hadoop%oers%scalable%data%storage%(HDFS)%and%processing%(MapReduce)%
!There%are%lots%of%tools%in%the%Hadoop%ecosystem%that%help%you%to%integrate%
Hadoop%with%other%systems,%manage%complex%jobs,%and%ease%analysis%
Bibliography"
The%following%oer%more%informaKon%on%topics%discussed%in%this%chapter%
!10%Hadoopable%Problems%(recorded%presentaKon)%
http://tiny.cloudera.com/dac02a
!IntroducKon%to%Apache%MapReduce%and%HDFS%(recorded%presentaKon)%
http://tiny.cloudera.com/dac02b
!Guide%to%HDFS%Commands%
http://tiny.cloudera.com/dac02c
!Hadoop:%The%DeniKve%Guide,%3rd%EdiKon%
http://tiny.cloudera.com/dac02d
!Sqoop%User%Guide%
http://tiny.cloudera.com/dac02e
IntroducAon"to"Pig"
Chapter"3"
Course"Chapters"
!! IntroducAon"
!! Introduc/on%to%Pig%
!! MulA/Dataset"OperaAons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooAng"and"OpAmizaAon"
!! IntroducAon"to"Hive"
!! RelaAonal"Data"Analysis"with"Hive"
!! Hive"OpAmizaAon"
!! Extending"Hive"
!! IntroducAon"to"Impala"
!! Conclusion"
IntroducAon"to"Pig"
!The%key%features%Pig%oers%
!How%organiza/ons%use%Pig%for%data%processing%and%analysis%
!How%to%use%Pig%interac/vely%and%in%batch%mode%
Chapter"Topics"
Introduc/on%to%Pig%
!! What%is%Pig?%
!! Pigs"Features"
!! Pig"Use"Cases"
!! InteracAng"with"Pig"
!! Conclusion"
Apache"Pig"Overview"
!Apache%Pig%is%a%plaJorm%for%data%analysis%and%processing%on%Hadoop%
It"oers"an"alternaAve"to"wriAng"MapReduce"code"directly"
!Originally%developed%as%a%research%project%at%Yahoo%%
Goals:"exibility,"producAvity,"and"maintainability"
Now"an"open/source"Apache"project"
The"Anatomy"of"Pig"
!Main%components%of%Pig%
The"data"ow"language"(Pig"LaAn)"
The"interacAve"shell"where"you"can"type"Pig"LaAn"statements"(Grunt)"
The"Pig"interpreter"and"execuAon"engine"
Pig Latin Script Pig Interpreter / Execution Engine MapReduce Jobs
!"Preprocess"and"parse"Pig"La0n
AllSales = LOAD 'sales'
!"Check"data"types
AS (cust, price); !"Make"op0miza0ons
BigSales = FILTER AllSales !"Plan"execu0on
BY price > 100;
STORE BigSales INTO 'myreport';
!"Generate"MapReduce"jobs
!"Submit"job(s)"to"Hadoop
!"Monitor"progress
Where"to"Get"Pig"
!CDH%(Clouderas%Distribu/on%including%Apache%Hadoop)%is%the%easiest%way%
to%install%Hadoop%and%Pig%
A"Hadoop"distribuAon"which"includes"core"Hadoop,"Pig,"Hive,"Sqoop,"
HBase,"Oozie,"and"other"ecosystem"components"
Available"as"RPMs,"Ubuntu/Debian/SuSE"packages,"or"a"tarball"
Simple"installaAon"
100%"free"and"open"source"
!Installa/on%is%outside%the%scope%of%this%course%
Cloudera"oers"a"training"course"for"System"Administrators,!Cloudera!
Administrator!Training!for!Apache!Hadoop!
Chapter"Topics"
Introduc/on%to%Pig%
!! What"is"Pig?"
!! Pigs%Features%
!! Pig"Use"Cases"
!! Conclusion"
Pig"Features"
!Pig%is%an%alterna/ve%to%wri/ng%low#level%MapReduce%code%
!Many%features%enable%sophis/cated%analysis%and%processing%
HDFS"manipulaAon"
UNIX"shell"commands"
RelaAonal"operaAons"
PosiAonal"references"for"elds"
Common"mathemaAcal"funcAons"
Support"for"custom"funcAons"and"data"formats%
Complex"data"structures"
Chapter"Topics"
Introduc/on%to%Pig%
!! What"is"Pig?"
!! Pigs"Features"
!! Pig%Use%Cases%
!! Conclusion"
How"Are"OrganizaAons"Using"Pig?"
!Many%organiza/ons%use%Pig%for%data%analysis%
Finding"relevant"records"in"a"massive"data"set"
Querying"mulAple"data"sets"
CalculaAng"values"from"input"data"
!Pig%is%also%frequently%used%for%data%processing%
Reorganizing"an"exisAng"data"set"
Joining"data"from"mulAple"sources"to"produce"a"new"data"set"
Use"Case:"Web"Log"SessionizaAon"
!Pig%can%help%you%extract%valuable%informa/on%from%Web%server%log%les%
... Web Server Log Data

10.174.57.241 - - [03/May/2013:17:57:41 -0500] "GET /s?q=widget HTTP/1.1" 200 3617 "http://www.hotbot.com/find/dualcore" "WebTV 1.2" "U=129"
10.218.46.19 - - [03/May/2013:17:57:43 -0500] "GET /ide.html HTTP/1.1" 404 955 "http://www.example.com/s?q=JBuilder" "Mosaic/3.6 (X11;SunOS)"
10.174.57.241 - - [03/May/2013:17:58:03 -0500] "GET /wres.html HTTP/1.1" 200 5741 "http://www.example.com/s?q=widget" "WebTV 1.2" "U=129"
10.32.51.237 - - [03/May/2013:17:58:04 -0500] "GET /os.html HTTP/1.1" 404 955 "http://www.example.com/s?q=VMS" "Mozilla/1.0b (Win3.11)"
10.174.57.241 - - [03/May/2013:17:58:25 -0500] "GET /detail?w=41 HTTP/1.1" 200 8584 "http://www.example.com/wres.html" "WebTV 1.2" "U=129"
10.157.96.181 - - [03/May/2013:17:58:26 -0500] "GET /mp3.html HTTP/1.1" 404 955 "http://www.example.com/s?q=Zune" "Mothra/2.77" "U=3622"
10.174.57.241 - - [03/May/2013:17:59:36 -0500] "GET /order.do HTTP/1.1" 200 964 "http://www.example.com/detail?w=41" "WebTV 1.2" "U=129"
10.174.57.241 - - [03/May/2013:17:59:47 -0500] "GET /confirm HTTP/1.1" 200 964 "http://www.example.com/order.do" "WebTV 1.2" "U=129"
...
Clickstream Data for User Sessions

Process Logs
Recent Activity for John Smith
May 3, 2013 May 12, 2013
Search for 'Widget' Track Order
Widget Results Contact Us
Details for Widget X Send Complaint
Order Widget X
Use"Case:"Data"Sampling"
!Sampling%can%help%you%explore%a%representa/ve%por/on%of%a%large%data%set%
Allows"you"to"examine"this"porAon"with"tools"that"do"not"scale"well"
Supports"faster"iteraAons"during"development"of"analysis"jobs"
100 TB 50 MB
Random
Sampling
Use"Case:"ETL"Processing"
!Pig%is%also%widely%used%for%Extract,%Transform,%and%Load%(ETL)%processing%
Operations Pig Jobs Running on Hadoop Cluster
Accounting Data Warehouse

Validate Fix Remove Encode
data errors duplicates values
Call Center
Chapter"Topics"
Introduc/on%to%Pig%
!! What"is"Pig?"
!! Pigs"Features"
!! Pig"Use"Cases"
!! Interac/ng%with%Pig%
!! Conclusion"
Using"Pig"InteracAvely"
!You%can%use%Pig%interac/vely,%via%the%Grunt%shell%
Pig"interprets"each"Pig"LaAn"statement"as"you"type"it"
ExecuAon"is"delayed"unAl"output"is"required"
Very"useful"for"ad"hoc"data"inspecAon"
!Example%of%how%to%start,%use,%and%exit%Grunt%
$ pig
grunt> allsales = LOAD 'sales' AS (name, price);
grunt> bigsales = FILTER allsales BY price > 100;
grunt> STORE bigsales INTO 'myreport';
grunt> quit;
%
!Can%also%execute%a%Pig%La/n%statement%from%the%UNIX%shell%via%the%-e%
op/on
InteracAng"with"HDFS"
!You%can%manipulate%HDFS%with%Pig,%via%the%fs%command
% grunt> fs -mkdir sales/;

grunt> fs -put europe.txt sales/;
grunt> allsales = LOAD 'sales' AS (name, price);
grunt> STORE bigsales INTO 'myreport';
grunt> fs -getmerge myreport/ bigsales.txt;
InteracAng"with"UNIX"
!The%sh%command%lets%you%run%UNIX%programs%from%Pig
grunt> sh date;
Fri May 10 13:05:31 PDT 2013
grunt> fs -ls; -- lists HDFS files
%
grunt> sh ls; -- lists local files
Running"Pig"Scripts"
!A%Pig%script%is%simply%Pig%La/n%code%stored%in%a%text%le%
By"convenAon,"these"les"have"the".pig"extension"
!You%can%run%a%Pig%script%from%within%the%Grunt%shell%via%the%run%command%
This"is"useful"for"automaAon"and"batch"execuAon""
grunt> run salesreport.pig;
!It%is%common%to%run%a%Pig%script%directly%from%the%UNIX%shell%
$ pig salesreport.pig
MapReduce"and"Local"Modes"
!As%described%earlier,%Pig%turns%Pig%La/n%into%MapReduce%jobs%
Pig"submits"those"jobs"for"execuAon"on"the"Hadoop"cluster"
!It%is%also%possible%to%run%Pig%in%local%mode%using%the%-x%ag%
This"runs"MapReduce"jobs"on"the"local!machine"instead"of"the"cluster"
Local"mode"uses"the"local"lesystem"instead"ofHDFS"
Can"be"helpful"for"tesAng"before"deploying"a"job"to"producAon"
$ pig x local -- interactive
$ pig -x local salesreport.pig -- batch
Client/Side"Log"Files"
!If%a%job%fails,%Pig%may%produce%a%log%le%to%explain%why%
These"log"les"are"typically"produced"in"your"current"working"directory"
On"the"local"(client)"machine"
Chapter"Topics"
Introduc/on%to%Pig%
!! What"is"Pig?"
!! Pigs"Features"
!! Pig"Use"Cases"
!! Conclusion%
EssenAal"Points"
!Pig%oers%an%alterna/ve%to%wri/ng%MapReduce%code%directly%
Pig"interprets"Pig"LaAn"code"in"order"to"create"MapReduce"jobs"
It"then"submits"these"MapReduce"jobs"to"the"Hadoop"cluster"
!You%can%execute%Pig%La/n%code%interac/vely%through%Grunt%
Pig"delays"job"execuAon"unAl"output"is"required"
!It%is%also%common%to%store%Pig%La/n%code%in%a%script%for%batch%execu/on%
Allows"for"automaAon"and"code"reuse"
Bibliography"
The%following%oer%more%informa/on%on%topics%discussed%in%this%chapter%
!Apache%Pig%Web%Site%
http://pig.apache.org/
!Process%a%Million%Songs%with%Apache%Pig%
!Powered%By%Pig%
!LinkedIn:%User%Engagement%Powered%By%Apache%Pig%and%Hadoop%
!Programming%Pig%(book)%
Basic"Data"Analysis"with"Pig"
Chapter"4"
Course"Chapters"
!! IntroducDon"
!! Basic%Data%Analysis%with%Pig%
!! Extending"Pig"
!! Hive"OpDmizaDon"
!! Extending"Hive"
!! Conclusion"
Basic"Data"Analysis"with"Pig"
!The%basic%syntax%of%Pig%LaCn%
!How%to%load%and%store%data%using%Pig%
!Which%simple%data%types%Pig%uses%to%represent%data%
!How%to%sort%and%lter%data%in%Pig%
!How%to%use%many%of%Pigs%built#in%funcCons%for%data%processing%
Chapter"Topics"
Basic%Data%Analysis%with%Pig%
!! Pig%LaCn%Syntax%
!! Loading"Data"
!! Simple"Data"Types"
!! Field"DeniDons"
!! Data"Output"
!! Viewing"the"Schema"
!! Filtering"and"SorDng"Data"""
!! Commonly/used"FuncDons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"Processing"
!! Conclusion"
Pig"LaDn"Overview"
!Pig%LaCn%is%a%data$ow%language%
The"ow"of"data"is"expressed"as"a"sequence"of"statements"
!The%following%is%a%simple%Pig%LaCn%script%to%load,%lter,%and%store%data%
allsales = LOAD 'sales' AS (name, price);
bigsales = FILTER allsales BY price > 999; -- in US cents
/*
* Save the filtered results into a new
* directory, below my home directory.
*/
STORE bigsales INTO 'myreport';
Pig"LaDn"Grammar:"Keywords"
!Pig%LaCn%keywords%are%highlighted%here%in%blue%text%
Keywords"are"reserved""you"cannot"use"them"to"name"things"
/*
*/
Pig"LaDn"Grammar:"IdenDers"(1)"
!IdenCers%are%the%names%assigned%to%elds%and%other%data%structures$
/*
*/
Pig"LaDn"Grammar:"IdenDers"(2)"
!IdenCers%must%conform%to%Pigs%naming%rules$
!An%idenCer%must%always%begin%with%a%lePer%
This"may"only"be"followed"by"le>ers,"numbers,"or"underscores"
Valid% x q1 q1_2013 MyData

Invalid% 4 price$ profit% _sale
Pig"LaDn"Grammar:"Comments"
!Pig%LaCn%supports%two%types%of%comments%
Single"line"comments"begin"with"--"""
MulD/line"comments"begin"with"/*"and"end"with"*/"
/*
*/
Case/SensiDvity"in"Pig"LaDn"
!Whether%case%is%signicant%in%Pig%LaCn%depends%on%context%
!Keywords%(shown%here%in%blue%text)%are%not%case#sensiCve%
Neither"are"operators"(such"as"AND,"OR,"or"IS NULL)""
!IdenCers%and%paths%(shown%here%in%red%text)%are%case#sensiCve%
So"are"funcDon"names"(such"as"SUM"or"COUNT)"and"constants"
bigsales = FILTER allsales BY price > 999;
Common"Operators"in"Pig"LaDn"
!Many%commonly#used%operators%in%Pig%LaCn%are%familiar%to%SQL%users%
Notable"dierence:"Pig"LaDn"uses"=="and"!="for"comparison"
ArithmeCc% Comparison% Null% Boolean%

+ == IS NULL AND
- != IS NOT NULL OR
* < NOT
/ >
% <=
>=
Chapter"Topics"
!! Pig"LaDn"Syntax"
!! Loading%Data"
!! Field"DeniDons"
!! Data"Output"
!! Conclusion"
Basic"Data"Loading"in"Pig"
!Pigs%default%loading%funcCon%is%called%PigStorage
The"name"of"the"funcDon"is"implicit"when"calling"LOAD
PigStorage"assumes"text"format"with"tab/separated"columns"
!Consider%the%following%le%in%HDFS%called%sales%
The"two"elds"are"separated"by"tab"characters"
"
" Alice 2999
Bob 3625
" Carlos 2764
"
!This%example%loads%data%from%the%above%le
Data"Sources:"File"and"Directories"
!The%previous%example%loads%data%from%a%le%named%sales
!Since%this%is%not%an%absolute%path,%it%is%relaCve%to%your%home%directory%
Your"home"directory"in"HDFS"is"typically"/user/youruserid/
Can"also"specify"an"absolute"path"(e.g.,"/dept/sales/2012/q4)"
!The%path%can%also%refer%to%a%directory%
In"this"case,"Pig"will"recursively"load"all"les"in"that"directory"
File"pa>erns"(globs)"are"also"supported"
allsales = LOAD 'sales_200[5-9]' AS (name, price);
Specifying"Column"Names"During"Load"
!The%previous%example%also%assigns%names%to%each%column%
!Assign%column%names%is%not%required%
This"can"be"useful"when"exploring"a"new"dataset"
Refer"to"elds"by"posiDon"($0"is"rst,"$1"is"second,"$53"is"54th,"etc.)"
allsales = LOAD 'sales';
Using"Alternate"Column"Delimiters"
!You%can%specify%an%alternate%delimiter%as%an%argument%to%PigStorage%
!This%example%shows%how%to%load%comma#delimited%data%
Note"that"this"is"a"single"statement"
allsales = LOAD 'sales.csv' USING PigStorage(',') AS

(name, price);
!Or%to%load%pipe#delimited%data%without%specifying%column%names%
allsales = LOAD 'sales.txt' USING PigStorage('|');
Chapter"Topics"
!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Simple%Data%Types"
!! Field"DeniDons"
!! Data"Output"
!! Conclusion"
Simple"Data"Types"in"Pig"
!Pig%supports%several%basic%data%types%
Similar"to"those"in"most"databases"and"programming"languages"
!Pig%treats%elds%of%unspecied%type%as%an%array%of%bytes%
Called"the"bytearray"type"in"Pig""
" allsales = LOAD 'sales' AS (name, price);
List"of"Simple"Data"Types"
!There%are%eight%data%types%in%Pig%for%simple%values%
Name% DescripCon% Example%Value%

int Whole"numbers% 2013
long Large"whole"numbers% 5,365,214,142L
float Decimals% 3.14159F
double Very"precise"decimals% 3.14159265358979323846
boolean* True"or"false"values" true
datetime* Date"and"Dme" 2013-05-30T14:52:39.000-04:00
chararray Text"strings% Alice
bytearray Raw"bytes"(e.g."any"data)% N/A
""*"Not"available"in"older"versions"of"Pig"
Specifying"Data"Types"in"Pig"
!Pig%will%do%its%best%to%determine%data%types%based%on%context%
For"example,"you"can"calculate"sales"commission"as""price * 0.1
In"this"case,"Pig"will"assume"that"this"value"is"of"type"double"
!However,%it%is%bePer%to%specify%data%types%explicitly%when%possible%
Helps"with"error"checking"and"opDmizaDons"
Easiest"to"do"this"upon"load"using"the"format"eldname:type+
allsales = LOAD 'sales' AS (name:chararray, price:int);
!Choosing%the%right%data%type%is%important%to%avoid%loss%of%precision%
!Important:%Avoid%using%oaCng%point%numbers%to%represent%money!%
HCatalog"
!You%have%seen%how%to%specify%names,%types,%and%paths%during%LOAD
!A%new%project%called%HCatalog%can%store%this%informaCon%permanently%
So"it"need"not"be"specied"each"Dme"
Simplies"sharing"of"metadata"between"Pig,"Hive,"and"MapReduce"
!However,%HCatalog%is%sCll%in%early%development%and%is%not%yet%widely%used%
It"was"rst"added"to"CDH"in"release"4.2.0"(February,"2013)"
!You%can%load%data%easily%ader%rst%seeng%it%up%with%HCatalog%
allsales = LOAD 'sales'

USING org.apache.hcatalog.pig.HCatLoader();
How"Pig"Handles"Invalid"Data"
!When%encountering%invalid%data,%Pig%subsCtutes%NULL%for%the%value%
For"example,"an"int"eld"containing"the"value"Q4
!The%IS NULL%and%IS NOT NULL%operators%test%for%null%values%
Note"that"NULL"is"not"the"same"as"the"empty"string"''
!You%can%use%these%operators%to%lter%out%bad%records%
hasprices = FILTER Records BY price IS NOT NULL;
Chapter"Topics"
!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Field%DeniCons"
!! Data"Output"
!! Conclusion"
Key"Data"Concepts"in"Pig"
!RelaConal%databases%have%tables,%rows,%columns,%and%elds%
!We%will%use%the%following%data%to%illustrate%Pigs%equivalents%
Assume"this"data"was"loaded"from"a"tab/delimited"text"le"as"before"
name% price% country%

Alice 2999 us
Bob 3625 ca
Carlos 2764 mx
Dieter 1749 de
tienne 2368 fr
Fredo 5637 it
Pig"Data"Concepts:"Fields"
!A%single%element%of%data%is%called%a%eld$
It"corresponds"to"one"of"the"eight"data"types"seen"earlier"

Alice 2999 us
Bob 3625 ca
Carlos 2764 mx
Dieter 1749 de
tienne 2368 fr
Fredo 5637 it
Pig"Data"Concepts:"Tuples"
!A%collec/on%of%values%is%called%a%tuple$
Fields"within"a"tuple"are"ordered,"but"need"not"all"be"of"the"same"type"

Alice 2999 us
Bob 3625 ca
Carlos 2764 mx
Dieter 1749 de
tienne 2368 fr
Fredo 5637 it
Pig"Data"Concepts:"Bags"
!A%collec/on%of%tuples%is%called%a%bag$
!Tuples%within%a%bag%are%unordered%by%default%
The"eld"count"and"types"may"vary"between"tuples"in"a"bag"

Alice 2999 us
Bob 3625 ca
Carlos 2764 mx
Dieter 1749 de
tienne 2368 fr
Fredo 5637 it
Pig"Data"Concepts:"RelaDons"
!A%relaCon%is%simply%a%bag%with%an%assigned%name%(alias)%
Most"Pig"LaDn"statements"create"a"new"relaDon"
!A%typical%script%loads%one%or%more%datasets%into%relaCons%
Processing"creates"new"relaDons"instead"of"modifying"exisDng"ones"
The"nal"result"is"usually"also"a"relaDon,"stored"as"output"

Chapter"Topics"
!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Field"DeniDons"
!! Data%Output"
!! Conclusion"
Data"Output"in"Pig"
!The%command%used%to%handle%output%depends%on%its%desCnaCon%
DUMP:"sends"output"to"the"screen"
STORE:"sends"output"to"disk"(HDFS)"
!Example%of%DUMP%output,%using%data%from%the%le%shown%earlier%
The"parentheses"and"commas"indicate"tuples"with"mulDple"elds"
(Alice,2999,us)
(Bob,3625,ca)
(Carlos,2764,mx)
(Dieter,1749,de)
(tienne,2368,fr)
(Franco,5637,it)
Storing"Data"with"Pig"
!The%STORE%command%is%used%to%store%data%to%HDFS%
Similar"to"LOAD,"but"writes"data"instead"of"reading"it"
The"output"path"is"the"name"of"a"directory"
The"directory"must"not"yet"exist"
!As%with%LOAD,%the%use%of%PigStorage%is%implicit%
The"eld"delimiter"also"has"a"default"value"(tab)"
You"may"also"specify"an"alternate"delimiter"
STORE bigsales INTO 'myreport' USING PigStorage(',');
Chapter"Topics"
!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Field"DeniDons"
!! Data"Output"
!! Viewing%the%Schema"
!! Conclusion"
Viewing"the"Schema"with"DESCRIBE
!The%DESCRIBE%command%shows%the%structure%of%the%data,%including%
names%and%types%
!The%following%Grunt%session%shows%an%example%
grunt> allsales = LOAD 'sales' AS (name:chararray,

% price:int);
grunt> DESCRIBE allsales;
allsales: {name: chararray,price: int}
Chapter"Topics"
!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Field"DeniDons"
!! Data"Output"
!! Filtering%and%SorCng%Data"
!! Conclusion"
Filtering"in"Pig"LaDn"
!The%FILTER%keyword%extracts%tuples%matching%the%specied%criteria%
"
"
allsales bigsales
name% price% country% price > 3000" name% price% country%

Alice 2999 us Bob 3625 ca
Bob 3625 ca Fredo 5637 it
Carlos 2764 mx
Dieter 1749 de
tienne 2368 fr
Fredo 5637 it
Filtering"by"MulDple"Criteria"
!You%can%combine%criteria%with%AND%and%OR
somesales = FILTER allsales BY name == 'Dieter' OR (price >

3500 AND price < 4000);
allsales somesales
name% price% country% name% price% country%

Alice 2999 us Bob 3625 ca
Bob 3625 ca Dieter 1749 de
Carlos 2764 mx
Dieter 1749 de Name%is%Dieter,%or%price%is%greater%%
tienne 2368 fr than%3500%and%less%than%4000"
Fredo 5637 it
Aside:"String"Comparisons"in"Pig"LaDn"
!The%==%operator%is%supported%for%any%type%in%Pig%LaCn%
This"operator"is"used"for"exact"comparisons"
"
" alices = FILTER allsales BY name == 'Alice';
!Pig%LaCn%supports%paPern%matching%through%Javas%regular$expressions%$
This"is"done"with"the"MATCHES"operator"
a_names = FILTER allsales BY name MATCHES 'A.*';
spammers = FILTER senders BY email_addr

MATCHES '.*@example\\.com$';
Field"SelecDon"in"Pig"LaDn"
!Filtering%extracts%rows,%but%someCmes%we%need%to%extract%columns%
This"is"done"in"Pig"LaDn"using"the"FOREACH"and"GENERATE"keywords
twofields = FOREACH allsales GENERATE amount, trans_id;

%
allsales twofields
salesperson% amount% trans_id% amount% trans_id%
Alice 2999 107546 2999 107546
Bob 3625 107547 3625 107547
Carlos 2764 107548 2764 107548
Dieter 1749 107549 1749 107549
tienne 2368 107550 2368 107550
Fredo 5637 107550 5637 107550
GeneraDng"New"Fields"in"Pig"LaDn"
!The%FOREACH%and%GENERATE%keywords%can%also%be%used%to%create%elds%
For"example,"you"could"create"a"new"eld"based"on"price"
t = FOREACH allsales GENERATE price * 0.07;
!It%is%possible%to%name%such%elds%
t = FOREACH allsales GENERATE price * 0.07 AS tax;
!And%you%can%also%specify%the%data%type
t = FOREACH allsales GENERATE price * 0.07 AS tax:float;
EliminaDng"Duplicates"
! DISTINCT%eliminates%duplicate%records%in%a%bag%
All%elds%must"be"equal"to"be"considered"a"duplicate"
unique_records = DISTINCT all_alices;
all_alices unique_records
rstname% lastname% country% rstname% lastname% country%

Alice Smith us Alice Smith us
Alice Jones us Alice Jones us
Alice Brown us Alice Brown us
Alice Brown us Alice Brown ca
Alice Brown ca
Controlling"Sort"Order
!Use%ORDER...BY%to%sort%the%records%in%a%bag%in%ascending%order%
Add"DESC"to"sort"in"descending"order"instead"
Take"care"to"specify"a"schema""data"type"aects"how"data"is"sorted!"
sortedsales = ORDER allsales BY country DESC;
allsales sortedsales
name% price% country% name% price% country%
Alice 29.99 us Alice 29.99 us
Bob 36.25 ca Carlos 27.64 mx
Carlos 27.64 mx Fredo 56.37 it
Dieter 17.49 de tienne 23.68 fr
tienne 23.68 fr Dieter 17.49 de
Fredo 56.37 it Bob 36.25 ca
LimiDng"Results"
!As%in%SQL,%you%can%use%LIMIT%to%reduce%the%number%of%output%records%
somesales = LIMIT allsales 10;
!Beware!%Record%ordering%is%random%unless%specied%with%ORDER BY
Use"ORDER BY"and"LIMIT"together"to"nd"top/N"results"
sortedsales = ORDER allsales BY price DESC;

top_five = LIMIT sortedsales 5;
Chapter"Topics"
!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Field"DeniDons"
!! Data"Output"
!! Filtering"and"SorDng"Data"
!! Commonly#used%FuncCons"
!! Hands/On"Exercise:"Using"Pig"for"ETL"processing"
!! Conclusion"
Built/in"FuncDons"
!These%are%just%a%sampling%of%Pigs%many%built#in%funcCons%
%
FuncCon%DescripCon% Example%InvocaCon% Input% Output%
Convert"to"uppercase" UPPER(country) uk UK
Remove"leading/trailing"spaces" TRIM(name) Bob Bob
Return"a"random"number" RANDOM() 0.4816132

6652569
Round"to"closest"whole"number" ROUND(price) 37.19 37
Return"chars"between"two"posiDons" SUBSTRING(name, 0, 2) Alice Al
!You%can%use%these%with%the%FOREACH..GENERATE%keywords%
rounded = FOREACH allsales GENERATE ROUND(price);
Chapter"Topics"
!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Field"DeniDons"
!! Data"Output"
!! Hands#On%Exercise:%Using%Pig%for%ETL%processing"
!! Conclusion"
Hands/On"Exercise:"Using"Pig"for"ETL"processing"
!In%this%Hands#On%Exercise,%you%will%write%%Pig%LaCn%code%to%perform%basic%ETL%
processing%tasks%on%data%related%to%Dualcores%online%adverCsing%campaigns%
!Please%refer%to%the%Hands#On%Exercise%Manual%for%instrucCons%
Chapter"Topics"
!! Pig"LaDn"Syntax"
!! Loading"Data"
!! Field"DeniDons"
!! Data"Output"
!! Hands/On"Exercise:"Using"Pig"for"ETL"processing"
!! Conclusion"
EssenDal"Points"
!Pig%LaCn%supports%many%of%the%same%operaCons%as%SQL%
Though"Pigs"approach"is"quite"dierent"
Pig"LaDn"loads,"transforms,"and"stores"data"in"a"series"of"steps"
!The%default%delimiter%for%both%input%and%output%is%the%tab%character%
You"can"specify"an"alternate"delimiter"as"an"argument"to"PigStorage
!Specifying%the%names%and%types%of%elds%is%not%required%
But"it"can"improve"performance"and"readability"of"your"code"
Bibliography"
The%following%oer%more%informaCon%on%topics%discussed%in%this%chapter%
!Pig%LaCn%Basics%
!Pig%LaCn%Built#In%FuncCons%
!DocumentaCon%for%Java%Regular%Expression%PaPerns%
!Installing%and%Using%HCatalog%
Processing"Complex"Data"with"Pig"
Chapter"5"
Course"Chapters"
!! IntroducFon"
!! IntroducFon"to"Pig"
!! Processing%Complex%Data%with%Pig%
!! MulF/Dataset"OperaFons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooFng"and"OpFmizaFon"
!! IntroducFon"to"Hive"
!! RelaFonal"Data"Analysis"with"Hive"
!! Hive"OpFmizaFon"
!! Extending"Hive"
!! IntroducFon"to"Impala"
!! Conclusion"
Processing"Complex"Data"with"Pig"
!How%Pig%uses%bags,%tuples,%and%maps%to%represent%complex%data%
!The%techniques%Pig%provides%for%grouping%and%ungrouping%data%
!How%to%use%aggregate%funcFons%in%Pig%LaFn%
!How%to%iterate%through%records%in%complex%data%structures%
Chapter"Topics"
Processing%Complex%Data%with%Pig%
!! Storage%Formats%
!! Complex/Nested"Data"Types"
!! Grouping"
!! Built/in"FuncFons"for"Complex"Data"
!! IteraFng"Grouped"Data"
!! Hands/On"Exercise:"Analyzing"Ad"Campaign"Data"with"Pig"
!! Conclusion"
Storage"Formats"
!We%have%seen%that%PigStorage%loads%and%stores%data
Uses"a"delimited"text"le"format""
The"default"delimiter"(tab)"can"be"easily"changed"
allsales = LOAD 'sales' USING PigStorage(',')

AS (name, price) ;
!SomeFmes%you%need%to%load%or%store%data%in%other%formats%
Other"Supported"Formats"
!Here%are%a%few%of%Pigs%built#in%funcFons%for%loading%data%in%other%formats%
It"is"also"possible"to"implement"a"custom"loader"by"wriFng"Java"code ""
Name% Loads%Data%From%
TextLoader Text"les"(enFre"line"as"one"eld)%
JsonLoader% Text"les"containing"JSON/forma>ed"data%
BinStorage Files"containing"binary"data"
HBaseLoader HBase,"a"scalable"NoSQL"database"built"on"Hadoop"%
Load"and"Store"FuncFons"
!Some%funcFons%load%data,%some%store%data,%and%some%do%both %%
Load%FuncFon% Equivalent%Store%FuncFon%
PigStorage PigStorage
TextLoader None%
JsonLoader% JsonStorage%
BinStorage BinStorage"
HBaseLoader HBaseStorage%
Chapter"Topics"
!! Storage"Formats"
!! Complex/Nested%Data%Types%
!! Grouping"
!! Conclusion"
Pigs"Complex"Data"Types:"Tuple"and"Bag"
!We%have%already%seen%two%of%Pigs%three%complex%data%types%
A"tuple"is"a"collecFon"of"values""
A"bag"is"a"collecFon"of"tuples"
trans_id% total% salesperson%

107546 2999 Alice
107547 3625 Bob
107548 2764 Carlos tuple"
107549 1749 Dieter
107550 2368 tienne
107551 5637 Fredo bag"
Pigs"Complex"Data"Types:"Map"
!Pig%also%supports%another%complex%type:%Map%
A"map"associates"a"chararray"(key)"to"another"data"element"(value)"
trans_id% amount% salesperson% sales_details%

107546 2498 Alice date 12-02-2013
SKU 40155
store MIA01
107547 3625 Bob date 12-02-2013
SKU 3720
store STL04
coupon DEC13
107548 2764 Carlos date 12-03-2013
SKU 76102
store NYC15
RepresenFng"Complex"Types"in"Pig"
!It%is%important%to%know%how%to%dene%and%recognize%these%types%in%Pig%
Type% DeniFon%
Tuple% Comma/delimited"list"inside"parentheses:"
"
"""('107546', 2498, 'Alice')
Bag% Braces"surround"comma/delimited"list"of"tuples:"
"
"""{('107546', 2498, 'Alice'), ('107547', 3625, 'Bob')}
Map% Brackets"surround"comma/delimited"list"of"pairs;"keys"and"values"separated"by"#:"
"
"""['store'#'MIA01','location'#'Coral Gables']
Loading"and"Using"Complex"Types"(1)"
!Complex%data%types%can%be%used%in%any%Pig%eld%
!The%following%example%show%how%a%bag%is%stored%in%a%text%le%
Example:"TransacFon"ID,"amount,"items"sold"(a"bag"of"tuples)"
107550 2498 {('40120', 1999), ('37001', 499)}
TAB" TAB"
Field"1" Field"2" Field"3"
!Here%is%the%corresponding%LOAD%statement%specifying%the%schema%
details = LOAD 'salesdetail' AS (

trans_id:chararray, amount:int,
items_sold:bag
{item:tuple (SKU:chararray, price:int)});
Loading"and"Using"Complex"Types"(2)"
!The%following%example%show%how%a%map%is%stored%in%a%text%le%
% Example:"Customer"name,"credit"account"details"(map)","year"account"opened"
Eva [creditlimit#5000,creditused#800] 2012
TAB" TAB"
% Field"1" Field"2" Field"3"
!Here%is%the%corresponding%LOAD%statement%specifying%the%schema%
credit = LOAD 'customer_accounts' AS (

name:chararray, account:map[], year:int);
Referencing"Map"Data"
!Consider%a%le%with%the%following%data%
Bob [salary#52000,age#52]
%
!And%loaded%with%the%following%schema%
details = LOAD 'data' AS (name:chararray, info:map[]);

%
!Here%is%the%syntax%for%referencing%data%within%the%map%and%bag%
salaries = FOREACH details GENERATE info#'salary';
Chapter"Topics"
!! Storage"Formats"
!! Grouping%
!! Conclusion"
Grouping"Records"By"a"Field"(1)"
!SomeFmes%you%need%to%group%records%by%a%given%eld%
For"example,"so"you"can"calculate"commissions"for"each"employee"
Alice 729
Bob 3999
Alice 27999
Carol 32999
Carol 4999
"
! Use%GROUP BY%to%do%this%in%Pig%LaFn%
The"new"relaFon"has"one"record"per"unique"value"in"the"specied"eld
grunt> byname = GROUP sales BY name;
!The%new%relaFon%always%contains%two%elds%

grunt> DESCRIBE byname;
byname: {group: chararray,sales: {(name:
chararray,price: int)}}
!The%rst%eld%is%literally%named%group%in%all%cases
Contains"the"value"from"the"eld"specied"in"GROUP BY
!The%second%eld%is%named%aêr%the%relaFon%specied%in%GROUP BY
Its"a"bag"containing"one"tuple"for"each"corresponding"value"
!The%example%below%shows%the%data%aêr%grouping%
Input"Data"(sales)"
grunt> byname = GROUP sales BY name; Alice 729

grunt> DUMP byname; Bob 3999
(Bob,%{(Bob,3999)}) Alice 27999
(Alice,{(Alice,729),(Alice,27999)}) Carol 32999
Carol 4999
(Carol,{(Carol,32999),(Carol,4999)})
group sales
eld% eld%
Using"GROUP BY"to"Aggregate"Data"
!Aggregate%funcFons%create%one%output%value%from%mulFple%input%values%
For"example,"to"calculate"total"sales"by"employee"
Usually"applied"to"grouped"data"

grunt> DUMP byname;
(Bob,{(Bob,3999)})
(Alice,{(Alice,729),(Alice,27999)})
grunt> totals = FOREACH byname GENERATE

!We%can%use%the%SUM%funcFon%to%XXX%
group, SUM(sales.price);
grunt> dump totals;
(Bob,3999)
(Alice,28728)
(Carol,37998)
Grouping"Everything"Into"a"Single"Record
! We%just%saw%that%GROUP BY%creates%one%record%for%each%unique%value%
! GROUP ALL%puts%all%data%into%one%record
grunt> grouped = GROUP sales ALL;

grunt> DUMP grouped;
(all,{(Alice,729),(Bob,3999),(Alice,27999),
(Carol,32999),(Carol,4999)})
Using"GROUP ALL"to"Aggregate"Data"
!Use%GROUP ALL%when%you%need%to%aggregate%one%or%more%columns%
For"example,"to"calculate"total"sales"for"all"employees"
grunt> grouped = GROUP sales ALL;

grunt> DUMP grouped;
(all,{(Alice,729),(Bob,3999),(Alice,27999),(Carol,32999),
(Carol,4999)})
!We%can%use%the%SUM%funcFon%to%XXX%
grunt> totals = FOREACH grouped GENERATE SUM(sales.price);
grunt> dump totals;
(70725)
Removing"NesFng"in"Data
! Some%operaFons%in%Pig,%like%grouping,%produce%nested%data%structures%

grunt> DUMP byname;
(Bob,{(Bob,3999)})
!Grouping%can%be%useful%to%supply%data%to%aggregate%funcFons"
!However,%someFmes%you%want%to%work%with%a%at%data%structure%
The"FLATTEN"operator"removes"a"level"of"nesFng"in"data"
An"Example"of"FLATTEN
! The%following%shows%the%nested%data%and%what%FLATTEN%does%to%it%
% grunt> byname = GROUP sales BY name;
grunt> DUMP byname;
(Bob,{(Bob,3999)})
grunt> flat = FOREACH byname GENERATE group,

FLATTEN(sales.price);
grunt> DUMP flat;
(Bob,3999)
(Alice,729)
(Alice,27999)
(Carol,32999)
(Carol,4999)
Chapter"Topics"
!! Storage"Formats"
!! Grouping"
!! Built#in%FuncFons%for%Complex%Data%
!! Conclusion"
Pigs"Built/In"Aggregate"FuncFons"
!Pig%has%built#in%support%for%other%aggregate%funcFons%besides%SUM
!Examples:%
AVG:""Calculates"the"average"(mean)"of"all"values"
MIN:""Returns"the"smallest"value"
MAX:""Returns"the"largest"value"
!Pig%has%two%built#in%funcFons%for%counFng%records%
COUNT:""Returns"the"number"of"non#null"elements"in"the"bag"
COUNT_STAR:""Returns"the"number"of"all"elements"in"the"bag"
Other"Notable"Built/in"FuncFons"
!Here%are%a%some%other%useful%Pig%funcFons%
See"the"Pig"documentaFon"for"a"complete"list"
FuncFon% DescripFon%
DIFF Finds"tuples"that"appear"in"only"one"of"two"supplied"bags
IsEmpty Used"with"FILTER"to"match"bags"or"maps"that"contain"no"data"
SIZE Returns"the"size"of"the"eld"(deniFon"of"size"varies"by"data"type)
TOKENIZE Splits"a"text"string"(chararray)"into"a"bag"of"individual"words"
Chapter"Topics"
!! Storage"Formats"
!! Grouping"
!! IteraFng%Grouped%Data%
!! Conclusion"
Record"IteraFon
!We%have%seen%that%FOREACH...GENERATE%iterates%through%records%
!The%goal%is%to%transform%records%to%produce%a%new%relaFon%
SomeFmes"to"select"only"certain"columns"
price_column_only = FOREACH sales GENERATE price;
SomeFmes"to"create"new"columns"
taxes = FOREACH sales GENERATE price * 0.07;
SomeFmes"to"invoke"a"funcFon"on"the"data"
totals = FOREACH grouped GENERATE SUM(sales.price);
NesFng"the"FOREACH"Keyword"
!A%variaFon%on%FOREACH%applies%a%set%of%operaFons%to%each%record%
This"is"oeen"used"to"apply"a"series"of"transformaFons"in"a"group"
!This%is%called%a%nested%FOREACH%
Allows"only"relaFonal"operaFons"(e.g., LIMIT,"FILTER,"ORDER BY)"
GENERATE"must"be"the"last"line"in"the"block"
Nested"FOREACH"Example"(1)"
Input"Data"
!Our%input%data%contains%a%list%of%employee%%
job%Ftles%and%corresponding%salaries% President 192000
Director 152500
!Goal:%idenFfy%the%three%highest%salaries% Director 161000
within%each%Ftle% Director 167000
Director 165000
Director 147000
Engineer 92300
Engineer 85000
Engineer 83000
Engineer 81650
Engineer 82100
Engineer 87300
Engineer 76000
Manager 87000
Manager 81000
Manager 75000
Manager 79000
Manager 67500
Input"Data"(excerpt)"
!First%load%the%data%from%the%le%%
President 192000
!Next,%group%employees%by%Ftle% Director 152500
Assigned"to"new"relaFon"Ftle_group" Director 161000
...
Engineer 92300
...
Manager 67500
employees = LOAD 'data' AS (title:chararray, salary:int);

title_group = GROUP employees BY title;
top_salaries = FOREACH title_group {

sorted = ORDER employees BY salary DESC;
highest_paid = LIMIT sorted 3;
GENERATE group, highest_paid;
};
Input"Data"(excerpt)"
!The%nested%FOREACH%iterates%through%every%
record%in%the%group%(i.e.,%each%job%Ftle)% President 192000
It"sorts"each"record"in"that"group"in"" Director 152500
Director 161000
descending"order"of"salary" ...
It"then"selects"the"top"three" Engineer 92300
...
GENERATE"outputs"the"Ftle"and"salaries"
Manager 67500
employees = LOAD 'data' AS (title:chararray, salary:int);

title_group = GROUP employees BY title;
top_salaries = FOREACH title_group {

sorted = ORDER employees BY salary DESC;
highest_paid = LIMIT sorted 3;
GENERATE group, highest_paid;
};
Nested"FOREACH"Example"(4)
Code"(LOAD"statement"removed"for"brevity)" Input"Data"(excerpt)"
%%
title_group = GROUP employees BY title; President 192000
Director 152500
top_salaries = FOREACH title_group { Director 161000
sorted = ORDER employees BY salary DESC; ...
highest_paid = LIMIT sorted 3; Engineer 92300
GENERATE group, highest_paid; ...
}; Manager 67500
Output"produced"by"DUMP top_salaries
(Director,{(Director,167000),(Director,165000),(Director,161000)})
(Engineer,{(Engineer,92300),(Engineer,87300),(Engineer,85000)})
(Manager,{(Manager,87000),(Manager,81000),(Manager,79000)})
(President,{(President,192000)})
Chapter"Topics"
!! Storage"Formats"
!! Grouping"
!! Hands#On%Exercise:%Analyzing%Ad%Campaign%Data%with%Pig%
!! Conclusion"
Hands/On"Exercise:"Analyzing"Ad"Campaign"Data"with"Pig"
!In%this%Hands#On%Exercise,%you%will%analyze%data%from%Dualcores%online%ad%
campaign%
!Please%refer%to%the%Hands#On%Exercise%Manual%for%instrucFons%
Chapter"Topics"
!! Storage"Formats"
!! Grouping"
!! Conclusion%
EssenFal"Points"
!Pig%has%three%complex%data%types:%tuple,%bag,%and%map
A"map"is"simply"a"collecFon"of"key/value"pairs"
!These%structures%can%contain%simple%types%like%int%or%chararray
But"they"can"also"contain"complex"data"types"
Nested"data"structures"are"common"in"Pig"
!Pig%provides%methods%for%grouping%and%ungrouping%data%
You"can"remove"a"level"of"nesFng"using"the"FLATTEN"operator"
!Pig%oers%several%built#in%aggregate%funcFons%
MulA/Dataset"OperaAons"with"Pig"
Chapter"6"
Course"Chapters"
!! IntroducAon"
!! IntroducAon"to"Pig"
!! Mul*#Dataset%Opera*ons%with%Pig%
!! Extending"Pig"
!! Hive"OpAmizaAon"
!! Extending"Hive"
!! Conclusion"
MulA/Dataset"OperaAons"with"Pig"
!How%we%can%use%grouping%to%combine%data%from%mul*ple%sources%
!What%types%of%join%opera*ons%Pig%supports%and%how%to%use%them%
!How%to%concatenate%records%to%produce%a%single%data%set%
!How%to%split%a%single%data%set%into%mul*ple%rela*ons%
Chapter"Topics"
Mul*#Dataset%Opera*ons%with%Pig%
!! Techniques%for%Combining%Data%Sets%
!! Joining"Data"Sets"in"Pig"
!! Set"OperaAons"
!! SpliUng"Data"Sets"
!! Hands/On"Exercise:"Analyzing"Disparate"Data"Sets"with"Pig"
!! Conclusion"
Overview"of"Combining"Data"Sets"
!So%far,%we%have%concentrated%on%processing%single%data%sets%
Valuable"insight"oXen"results"from"combining"mulAple"data"sets"
!Pig%oers%several%techniques%for%achieving%this%
Using"the"GROUP"operator"with"mulAple"relaAons"
Joining"the"data"as"you"would"in"SQL"
Performing"set"operaAons"like"CROSS"and"UNION
!We%will%cover%each%of%these%in%this%chapter%
Example"Data"Sets"(1)"
Stores"
!Most%examples%in%this%chapter%will%involve%the%%
same%two%data%sets% A Anchorage
B Boston
!The%rst%is%a%le%containing%informa*on%about%% C Chicago
D Dallas
Dualcores%stores% E Edmonton
F Fargo
!There%are%two%elds%in%this%rela*on%
1. store_id:chararray"(unique"key)"
2. name:chararray"(name"of"the"city"in"which"the"store"is"located)"
Example"Data"Sets"(2)"
Stores"
!Our%other%data%set%is%a%le%containing%
informa*on%about%Dualcores%salespeople% A Anchorage
B Boston
!This%rela*on%contains%three%elds% C Chicago
D Dallas
1. person_id:int"(unique"key)" E Edmonton
2. name:chararray"(salesperson"name)" F Fargo
3. store_id:chararray"(refers"to"store)" Salespeople"
1 Alice B
2 Bob D
3 Carlos F
4 Dieter A
5 tienne F
6 Fredo C
7 George D
8 Hannah B
9 Irina C
10 Jack
Grouping"MulAple"RelaAons"
!We%previously%learned%about%the%GROUP%operator%
Groups"values"in"a"relaAon"based"on"the"specied"eld(s)"
!The%GROUP%operator%can%also%group%mul$ple%rela*ons%
In"this"case,"using"the"synonymous"COGROUP"operator"is"preferred"
grouped = COGROUP stores BY store_id, salespeople BY store_id;
!This%collects%values%from%both%data%sets%into%a%new%rela*on%
As"before,"the"new"relaAon"is"keyed"by"a"eld"named"group
This"group"eld"is"associated"with"one"bag"for"each"input"
(group, {bag of records}, {bag of records})
store_id records from stores records from salespeople
Example"of"COGROUP
Stores"
A Anchorage grunt> grouped = COGROUP stores BY store_id,
B Boston salespeople BY store_id;
C Chicago
D Dallas grunt> DUMP grouped;
E Edmonton (A,{(A,Anchorage)},{(4,Dieter,A)})
F Fargo (B,{(B,Boston)},{(1,Alice,B),(8,Hannah,B)})
(C,{(C,Chicago)},{(6,Fredo,C),(9,Irina,C)})
Salespeople" (D,{(D,Dallas)},{(2,Bob,D),(7,George,D)})
(E,{(E,Edmonton)},{})
1 Alice B (F,{(F,Fargo)},{(3,Carlos,F),(5,tienne,F)})
2 Bob D (,{},{(10,Jack,)})
3 Carlos F
4 Dieter A
5 tienne F
6 Fredo C
7 George D
8 Hannah B
9 Irina C
10 Jack
Chapter"Topics"
!! Techniques"for"Combining"Data"Sets"
!! Joining%Data%Sets%in%Pig%
!! Set"OperaAons"
!! Conclusion"
Join"Overview"
!The%COGROUP%operator%creates%a%nested%data%structure%
!Pig%La*ns%JOIN%operator%creates%a%at%data%structure%
Similar"to"joins"in"a"relaAonal"database"
!A%JOIN%is%similar%to%doing%a%COGROUP%followed%by%a%FLATTEN
Though"they"handle"null"values"dierently"
Key"Fields"
!Like%COGROUP,%joins%rely%on%a%eld%shared%by%each%rela*on%
joined = JOIN stores BY store_id, salespeople BY store_id;
!Joins%can%also%use%mul*ple%elds%as%the%key%
joined = JOIN customers BY (name, phone_number),

accounts BY (name, phone_number);
Inner"Joins"
!The%default%JOIN%in%Pig%La*n%is%an%inner%join%
joined = JOIN stores BY store_id, salespeople BY store_id;
!An%inner%join%outputs%records%only%when%the%key%is%found%in%all%inputs%
In"the"above"example,"stores"that"have"at"least"one"salesperson"
!You%can%do%an%inner%join%on%mul*ple%rela*ons%in%a%single%statement%
But"you"must"use"the"same"key"to"join"them"
Inner"Join"Example"
stores"
A Anchorage grunt> joined = JOIN stores BY store_id,
B Boston salespeople BY store_id;
C Chicago
D Dallas grunt> DUMP joined;
E Edmonton (A,Anchorage,4,Dieter,A)
F Fargo (B,Boston,1,Alice,B)
(B,Boston,8,Hannah,B)
salespeople" (C,Chicago,6,Fredo,C)
(C,Chicago,9,Irina,C)
1 Alice B (D,Dallas,2,Bob,D)
2 Bob D (D,Dallas,7,George,D)
3 Carlos F (F,Fargo,3,Carlos,F)
4 Dieter A (F,Fargo,5,tienne,F)
5 tienne F
6 Fredo C
7 George D
8 Hannah B
9 Irina C
10 Jack
EliminaAng"Duplicate"Fields"(1)"
!As%with%COGROUP,%the%new%rela*on%s*ll%contains%duplicate%elds%
grunt> joined = JOIN stores BY store_id,

salespeople BY store_id;
grunt> DUMP joined;

(A,Anchorage,4,Dieter,A)
(B,Boston,1,Alice,B)
(C,Chicago,6,Fredo,C)
(C,Chicago,9,Irina,C)
(D,Dallas,2,Bob,D)
(D,Dallas,7,George,D)
(F,Fargo,3,Carlos,F)
(F,Fargo,5,tienne,F)
EliminaAng"Duplicate"Fields"(2)"
!We%can%use%FOREACH...GENERATE%to%retain%just%the%elds%we%need%
However,"it"is"now"slightly"more"complex"to"reference"elds"
We"must"fully/qualify"any"elds"with"names"that"are"not"unique"
grunt> DESCRIBE joined;

joined: {stores::store_id: chararray,stores::name:
chararray,salespeople::person_id: int,salespeople::name:
chararray,salespeople::store_id: chararray}
grunt> cleaned = FOREACH joined GENERATE stores::store_id,

stores::name, person_id, salespeople::name;
grunt> DUMP cleaned;

(A,Anchorage,4,Dieter)
(B,Boston,1,Alice)
(B,Boston,8,Hannah)
... (additional records omitted for brevity) ...
Outer"Joins"
!Pig%La*n%allows%you%to%specify%the%type%of%join%following%the%eld%name%
Inner"joins"do"not"specify"a"join"type"
joined = JOIN relation1 BY field [LEFT|RIGHT|FULL] OUTER,

relation2 BY field;
!An%outer%join%does%not%require%the%key%to%be%found%in%both%inputs%
!Outer%joins%require%Pig%to%know%the%schema%for%at%least%one%rela*on%
Which"relaAon"requires"schema"depends"on"the"join"type"
Full"outer"joins"require"schema"for"both"relaAons"
LeX"Outer"Join"Example"
stores"
!Result%contains%all%records%from%the%rela*on%
A Anchorage specied%on%the%le[,%but%only%matching%records%
B Boston
C Chicago from%the%one%specied%on%the%right%
D Dallas
E Edmonton grunt> joined = JOIN stores BY store_id
F Fargo LEFT OUTER, salespeople BY store_id;
grunt> DUMP joined;

salespeople"
1 Alice B (B,Boston,1,Alice,B)
2 Bob D (B,Boston,8,Hannah,B)
3 Carlos F (C,Chicago,6,Fredo,C)
4 Dieter A (C,Chicago,9,Irina,C)
5 tienne F (D,Dallas,2,Bob,D)
6 Fredo C (D,Dallas,7,George,D)
7 George D (E,Edmonton,,,)
8 Hannah B (F,Fargo,3,Carlos,F)
9 Irina C (F,Fargo,5,tienne,F)
10 Jack
Right"Outer"Join"Example"
stores"
!Result%contains%all%records%from%the%rela*on%
A Anchorage specied%on%the%right,%but%only%matching%records%
B Boston
C Chicago from%the%one%specied%on%the%le[%
D Dallas
F Fargo RIGHT OUTER, salespeople BY store_id;
grunt> DUMP joined;

salespeople"
7 George D (F,Fargo,3,Carlos,F)
8 Hannah B (F,Fargo,5,tienne,F)
9 Irina C (,,10,Jack,)
10 Jack
Full"Outer"Join"Example"
stores"
!Result%contains%all%records%where%there%is%a%match%
A Anchorage in%either%rela*on%
B Boston
C Chicago
D Dallas
F Fargo FULL OUTER, salespeople BY store_id;
grunt> DUMP joined;

salespeople"
7 George D (E,Edmonton,,,)
8 Hannah B (F,Fargo,3,Carlos,F)
9 Irina C (F,Fargo,5,tienne,F)
10 Jack (,,10,Jack,)
Chapter"Topics"
!! Set%Opera*ons%
!! Conclusion"
Crossing"Data"Sets"
! JOIN%nds%records%in%one%rela*on%that%match%records%in%another%
!Pigs%CROSS%operator%creates%the%cross%product%of%both%rela*ons%
Combines"all"records"in"both"tables"regardless"of"matching"
In"other"words,"all"possible"combinaAons"of"records"
crossed = CROSS stores, salespeople;
!Careful:%This%can%generate%huge%amounts%of%data!%
Cross"Product"Example"
stores"
!Generates%every%possible%combina*on%of%records%
A Anchorage in%the%stores%and%salespeople%rela*ons%
B Boston
D Dallas
grunt> crossed = CROSS stores, salespeople;
salespeople"
grunt> DUMP crossed;
1 Alice B (A,Anchorage,1,Alice,B)
2 Bob D (A,Anchorage,2,Bob,D)
8 Hannah B (A,Anchorage,8,Hannah,B)
10 Jack (A,Anchorage,10,Jack,)
(B,Boston,1,Alice,B)
(B,Boston,2,Bob,D)
(B,Boston,10,Jack,)
(D,Dallas,1,Alice,B)
(D,Dallas,2,Bob,D)
(D,Dallas,8,Hannah,B)
(D,Dallas,10,Jack,)
ConcatenaAng"Data"Sets"
!We%have%explored%several%techniques%for%combining%data%sets%
They"have"had"one"thing"in"common:"they"combine"horizontally"
!The%UNION%operator%combines%records%ver*cally%
It"adds"data"from"input"relaAons"into"a"new"single"relaAon"
Pig"does"not"require"these"inputs"to"have"the"same"schema"
It"does"not"eliminate"duplicate"records"nor"preserve"order"
!This%is%helpful%for%incorpora*ng%new%data%into%your%processing%
both = UNION june_items, july_items;
UNION"Example"
june"
!Concatenates%all%records%from%june%and%july%
Adapter 549
Battery 349
Cable 799 grunt> both = UNION june_items, july_items;
DVD 1999
HDTV 79999 grunt> DUMP both;
(Fax,17999)
(GPS,24999)
july" (HDTV,65999)
Fax 17999 (Ink,3999)
GPS 24999 (Adapter,549)
HDTV 65999 (Battery,349)
Ink 3999 (Cable,799)
(DVD, 1999)
(HDTV,79999)
Chapter"Topics"
!! Set"OperaAons"
!! Spliang%Data%Sets%
!! Conclusion"
SpliUng"Data"Sets"
!You%have%learned%several%ways%to%combine%data%sets%into%a%single%rela*on%
!Some*mes%you%need%to%split%a%data%set%into%mul*ple%rela*ons%
Server"logs"by"date"range"
Customer"lists"by"region"
Product"lists"by"vendor"
!Pig%La*n%supports%this%with%the%SPLIT%operator%
SPLIT relation INTO relationA IF expression1,

relationB IF expression2,
relationC IF expression3...;
Expressions"need"not"be"mutually"exclusive"
SPLIT"Example"
!Split%customers%into%groups%for%rewards%program,%based%on%life*me%value%
customers"
grunt> SPLIT customers INTO
Annette 9700 gold_program IF ltv >= 25000,
Bruce 23500 silver_program IF ltv >= 10000
Charles 17800 AND ltv < 25000;
Dustin 21250
Eva 8500 grunt> DUMP gold_program;
Felix 9300 (Glynn,27800)
Glynn 27800 (Ian,43800)
Henry 8900 (Jeff,29100)
Ian 43800 (Kai,34000)
Jeff 29100
Kai 34000 grunt> DUMP silver_program;
Laura 7800 (Bruce,23500)
Mirko 24200 (Charles,17800)
(Dustin,21250)
(Mirko,24200)
Chapter"Topics"
!! Set"OperaAons"
!! Hands#On%Exercise:%Analyzing%Disparate%Data%Sets%with%Pig%
!! Conclusion"
Hands/on"Exercise:"Analyzing"Disparate"Data"Sets"with"Pig"
!In%this%Hands#On%Exercise,%you%will%analyze%mul*ple%data%sets%with%Pig.%
!Please%refer%to%the%Hands#On%Exercise%Manual%for%instruc*ons%
Chapter"Topics"
!! Set"OperaAons"
!! Conclusion%
EssenAal"Points"
!You%can%use%COGROUP%to%group%mul*ple%rela*ons%
This"creates"a"nested"data"structure"
!Pig%supports%common%SQL%join%types%
Inner,"leX"outer,"right"outer,"and"full"outer"
You"may"need"to"fully/qualify"eld"names"when"using"joined"data"
!Pigs%CROSS%operator%creates%every%possible%combina*on%of%input%data%
This"can"create"huge"amounts"of"data""use"it"carefully!"
!You%can%use%a%UNION%to%concatenate%data%sets%
!In%addi*on%to%combining%data%sets,%Pig%supports%spliang%them%too%
Extending"Pig"
Chapter"7"
Course"Chapters"
!! IntroducEon"
!! IntroducEon"to"Pig"
!! MulE/Dataset"OperaEons"with"Pig"
!! Extending%Pig%
!! Pig"TroubleshooEng"and"OpEmizaEon"
!! IntroducEon"to"Hive"
!! RelaEonal"Data"Analysis"with"Hive"
!! Hive"OpEmizaEon"
!! Extending"Hive"
!! IntroducEon"to"Impala"
!! Conclusion"
Extending"Pig"
!How%to%use%parameters%in%your%Pig%LaAn%to%increase%its%exibility%
!How%to%dene%and%invoke%macros%to%improve%the%reusability%of%your%code%
!How%to%call%user#dened%funcAons%from%your%code%
!How%to%write%user#dened%funcAons%in%Python%
!How%to%process%data%with%external%scripts%
Chapter"Topics"
Extending%Pig%
!! Adding%Flexibility%with%Parameters%
!! Macros"and"Imports"
!! UDFs"
!! Contributed"FuncEons"
!! Using"Other"Languages"to"Process"Data"with"Pig"
!! Hands/On"Exercise:"Extending"Pig"with"Streaming"and"UDFs"
!! Conclusion"
The"Need"for"Parameters"(1)"
!Some%processing%is%very%repeAAve%
For"example,"creaEng"sales"reports"

bigsales_alice = FILTER bigsales BY name == 'Alice';

STORE bigsales_alice INTO 'Alice';
The"Need"for"Parameters"(2)"
!You%may%need%to%change%the%script%slightly%for%each%run%
For"example,"to"modify"the"paths"or"lter"criteria"

bigsales_alice = FILTER bigsales BY name == 'Alice';

STORE bigsales_alice INTO 'Alice';
Making"the"Script"More"Flexible"with"Parameters"
!Instead%of%hardcoding%values,%Pig%allows%you%to%use%parameters%
These"are"replaced"with"specied"values"at"runEme"
allsales = LOAD '$INPUT' AS (name, price);

bigsales = FILTER allsales BY price > $MINPRICE;
bigsales_name = FILTER bigsales BY name == '$NAME';

STORE bigsales_name INTO '$NAME';
Then"specify"the"values"on"the"command"line"
$ pig -p INPUT=sales -p MINPRICE=999 \

-p NAME='Jo Anne' reporter.pig
Two"Tricks"for"Specifying"Parameter"Values"
!You%can%also%specify%parameter%values%in%a%text%le%
An"alternaEve"to"typing"each"one"on"the"command"line"
INPUT=sales
MINPRICE=999
" # comments look like this
" NAME='Alice'
"
Use"-m filename"opEon"to"tell"Pig"which"le"contains"the"values""
!Parameter%values%can%be%dened%with%the%output%of%a%shell%command%
For"example,"to"set"MONTH"to"the"current"month:"
MONTH=`date +'%m'` # returns 03 for March, 05 for May
Chapter"Topics"
Extending%Pig%
!! Adding"Flexibility"with"Parameters%
!! Macros%and%imports%
!! UDFs"
!! Conclusion"
The"Need"for"Macros"
!Parameters%simplify%repeAAve%code%by%allowing%you%to%pass%in%values%
But"someEmes"you"would"like"to"reuse"the"actual"code"too""

byperson = FILTER allsales BY name == 'Alice';
SPLIT byperson INTO low IF price < 1000,

high IF price >= 1000;
amt1 = FOREACH low GENERATE name, price * 0.07 AS amount;

amt2 = FOREACH high GENERATE name, price * 0.12 AS amount;
commissions = UNION amt1, amt2;

grpd = GROUP commissions BY name;
out = FOREACH grpd GENERATE SUM(commissions.amount) AS total;
Dening"a"Macro"in"Pig"LaEn"
!Macros%allow%you%to%dene%a%block%of%code%to%reuse%easily%
Similar"(but"not"idenEcal)"to"a"funcEon"in"a"programming"language"
define calc_commission (NAME, SPLIT_AMT, LOW_PCT, HIGH_PCT)

returns result {
byperson = FILTER allsales BY name == '$NAME';
SPLIT byperson INTO low if price < $SPLIT_AMT,

high IF price >= $SPLIT_AMT;
amt1 = FOREACH low GENERATE name, price * $LOW_PCT AS amount;

amt2 = FOREACH high GENERATE name, price * $HIGH_PCT AS amount;
commissions = UNION amt1, amt2;

grouped = GROUP commissions BY name;
$result = FOREACH grouped GENERATE SUM(commissions.amount);

};
Invoking"Macros"
!To%invoke%a%macro,%call%it%by%name%and%supply%values%in%the%correct%order%
define calc_commission (NAME, SPLIT_AMT, LOW_PCT, HIGH_PCT)

returns result {
... (other code removed for brevity) ...
$result = FOREACH grouped GENERATE SUM(commissions.amount);

};
alice_comm = calc_commission('Alice', 1000, 0.07, 0.12);

carlos_comm = calc_commission('Carlos', 2000, 0.08, 0.14);
Reusing"Code"with"Imports"
!ASer%dening%a%macro,%you%may%wish%to%use%it%in%mulAple%scripts%
!You%can%include%one%script%within%another,%starAng%with%Pig%0.9%
This"is"done"with"the"import"keyword"and"path"to"le"being"imported"
-- We saved the macro to a file named commission_calc.pig
import 'commission_calc.pig';
alice_comm = calc_commission('Alice', 1000, 0.07, 0.12);
Chapter"Topics"
Extending%Pig%
!! UDFs%
!! Conclusion"
User/Dened"FuncEons"(UDFs)"
!We%have%covered%many%of%Pigs%built#in%funcAons%already%
!It%is%also%possible%to%dene%your%own%funcAons%
Pig"allows"wriEng"UDFs"in"several"languages"
Language% Supported%in%Pig%Versions%
Java" All
Python" 0.8 and later
JavaScript"(experimental)" 0.9 and later
Ruby"(experimental)" 0.10 and later
Groovy"(experimental)" 0.11 and later
!In%the%next%few%slides,%you%will%see%how%to%use%UDFs%in%Java,%and%how%to%
write%and%use%UDFs%in%Python%
Using"UDFs"Wri>en"in"Java"
!UDFs%are%packaged%into%Java%Archive%(JAR)%les%
!There%are%only%two%required%steps%for%using%them%
Register"the"JAR"le(s)"containing"the"UDF"and"its"dependencies"
Invoke"the"UDF"using"the"fully/qualied"classname"
REGISTER '/path/to/myudf.jar';
...
data = FOREACH allsales GENERATE com.example.MYFUNC(name);
!You%can%opAonally%dene%an%alias%for%the%funcAon%
REGISTER '/path/to/myudf.jar';
DEFINE FOO com.example.MYFUNC;
...
data = FOREACH allsales GENERATE FOO(name);
WriEng"UDFs"in"Python"(1)"
!Now%we%will%see%how%to%write%a%UDF%in%Python%
!The%data%we%want%to%process%has%inconsistent%phone%number%formats%
Alice (314) 555-1212

Bob 212.555.9753
Carlos 405-555-3912
David (202) 555.8471
%
!We%will%write%a%Python%UDF%that%can%consistently%extract%the%area%code%
WriEng"UDFs"in"Python"(2)"
!Our%Python%code%is%straighaorward%
!The%only%unusual%thing%is%the%opAonal%@outputSchema%decorator%
This"tells"Pig"what"data"type"we"are"returning"
If"not"specied,"Pig"will"assume"bytearray
@outputSchema("areacode:chararray")
def get_area_code(phone):
areacode = "???" # return this for unknown formats
if len(phone) == 12:
# XXX-YYY-ZZZZ or XXX.YYY.ZZZZ format
areacode = phone[0:3]
elif len(phone) == 14:
# (XXX) YYY-ZZZZ or (XXX) YYY.ZZZZ format
areacode = phone[1:4]
return areacode
Invoking"Python"UDFs"from"Pig"LaEn"
!Using%this%UDF%from%our%Pig%LaAn%is%also%easy%
We"saved"our"Python"code"as"phonenumber.py
This"Python"le"is"in"our"current"directory"
REGISTER 'phonenumber.py' USING jython AS phoneudf;
names = LOAD 'names' AS (name:chararray, phone:chararray);
areacodes = FOREACH names GENERATE

phoneudf.get_area_code(phone) AS ac;
Chapter"Topics"
Extending%Pig%
!! UDFs"
!! Contributed%FuncAons%
!! Conclusion"
Open"Source"UDFs"
!Pig%ships%with%a%set%of%community#contributed%UDFs%called%Piggy%Bank%
!Another%popular%package%of%UDFs,%called%DataFu,%has%been%open#sourced%
by%LinkedIn%
Piggy"Bank"
!Piggy%Bank%ships%with%Pig%
You"will"need"to"register"the"piggybank.jar"le"
The"locaEon"may"vary"depending"on"source"and"version"
In"CDH"on"our"VMs,"it"is"at"/usr/lib/pig/piggybank.jar"
!Some%UDFs%in%Piggy%Bank%include%(package%names%omided%for%brevity)%
Class%Name% DescripAon%
ISOToUnix Converts"an"ISO"8601"date/Eme"format"to"UNIX"format"
UnixToISO Converts"a"UNIX"date/Eme"format"to"ISO"8601"format"
LENGTH Returns"the"number"of"characters"in"the"supplied"string"
HostExtractor Returns"the"host"name"from"a"URL"
DiffDate Returns"number"of"days"between"two"dates"
DataFu"
!DataFu%does%not%ship%with%Pig,%but%is%part%of%CDH%4.1.0%and%later%
You"will"need"to"register"the"DataFu"JAR"le"
In"VM,"at"/usr/lib/pig/datafu-0.0.4-cdh4.2.0.jar"
!Some%UDFs%in%DataFu%include%(package%names%omided%for%brevity)%
Class%Name% DescripAon%
Quantile Calculates"quanEles"for"a"data"set"
Median Calculates"the"median"for"a"data"set"
Sessionize Groups"data"into"sessions"based"on"a"
specied"Eme"window"
HaversineDistInMiles Calculates"distance"in"miles"between"two"
points,"given"laEtude"and"longitude"
Using"A"Contributed"UDF"
!Here%is%an%example%of%using%a%UDF%from%DataFu%to%calculate%distance%
Input"data"
37.789336 -122.401385 40.707555 -74.011679
Pig"LaEn"
REGISTER '/usr/lib/pig/datafu-*.jar';
DEFINE DIST datafu.pig.geo.HaversineDistInMiles;
places = LOAD 'data' AS (lat1:double, lon1:double,

lat2:double, lon2:double);
dist = FOREACH places GENERATE DIST(lat1, lon1, lat2, lon2);

DUMP dist;
Output"data"
(2564.207116295711)
Chapter"Topics"
Extending%Pig%
!! UDFs"
!! Using%Other%Languages%to%Process%Data%with%Pig%
!! Conclusion"
Processing"Data"with"An"External"Script"
!While%Pig%LaAn%is%powerful,%some%tasks%are%easier%in%another%language%
!Pig%allows%you%to%stream%data%through%another%language%for%processing%
This"is"done"using"the"STREAM"keyword"
!Similar%concept%to%Hadoop%Streaming%
Data"is"supplied"to"the"script"on"standard"input"as"tab/delimited"elds"
Script"writes"results"to"standard"output"as"tab/delimited"elds"
STREAM"Example"in"Python"(1)"
!Our%example%will%calculate%a%users%age%given%that%users%birthdate%
This"calculaEon"is"done"in"a"Python"script"named"agecalc.py"
!Here%is%the%corresponding%Pig%LaAn%code%
BackEcks"used"to"quote"script"name"following"the"alias"
Single"quotes"used"for"quoEng"script"name"within"SHIP
The"schema"for"the"data"produced"by"the"script"follows"the"AS"keyword"
DEFINE MYSCRIPT àgecalc.py` SHIP('agecalc.py');

users = LOAD 'data' AS (name:chararray, birthdate:chararray);
out = STREAM users THROUGH MYSCRIPT AS (name:chararray, age:int);
DUMP out;
!Python%code%for%agecalc.py
#!/usr/bin/env python
import sys
from datetime import datetime
for line in sys.stdin:

line = line.strip()
(name, birthdate) = line.split("\t")
d1 = datetime.strptime(birthdate, '%Y-%m-%d')
d2 = datetime.now()
age = int((d2 - d1).days / 365)
print "%s\t%i" % (name, age)
!The%Pig%script%again,%and%the%data%it%reads%and%writes
DEFINE MYSCRIPT àgecalc.py` SHIP('agecalc.py');

users = LOAD 'data' AS (name:chararray, birthdate:chararray);
out = STREAM users THROUGH MYSCRIPT AS (name:chararray,

age:int);
DUMP out;
Input"data" Output"data"
andy 1963-11-15 (andy,49)

betty 1985-12-30 (betty,27)
chuck 1979-02-23 (chuck,34)
debbie 1982-09-19 (debbie,30)
Chapter"Topics"
Extending%Pig%
!! UDFs"
!! Hands#On%Exercise:%Extending%Pig%with%Streaming%and%UDFs%
!! Conclusion"
Hands/on"Exercise:"Extending"Pig"with"Streaming"and"UDFs"
!In%this%Hands#On%Exercise,%you%will%process%data%with%an%external%script%and%a%
user#dened%funcAon.%
!Please%refer%to%the%Hands#On%Exercise%Manual%for%instrucAons%
Chapter"Topics"
Extending%Pig%
!! UDFs"
!! Conclusion%
EssenEal"Points"
!Pig%supports%several%extension%mechanisms%
!Parameters%and%macros%can%help%make%your%code%more%reusable%
And"easier"to"maintain"and"share"with"others"
!Piggy%Bank%and%DataFu%are%two%examples%of%open%source%UDFs%
You"can"also"write"your"own"UDFs"
!It%is%also%possible%to%embed%Pig%within%another%language%
Bibliography"
The%following%oer%more%informaAon%on%topics%discussed%in%this%chapter%
!DocumentaAon%on%Parameter%SubsAtuAon%in%Pig%
!DocumentaAon%on%Macros%in%Pig%
!DocumentaAon%on%User#Dened%FuncAons%in%Pig%
!DocumentaAon%on%Piggy%Bank%
!Introducing%Data%Fu%
Pig"TroubleshooBng"and"OpBmizaBon"
Chapter"8"
Course"Chapters"
!! IntroducBon"
!! IntroducBon"to"Pig"
!! MulB/Dataset"OperaBons"with"Pig"
!! Extending"Pig"
!! Pig%Troubleshoo3ng%and%Op3miza3on%
!! IntroducBon"to"Hive"
!! RelaBonal"Data"Analysis"with"Hive"
!! Hive"OpBmizaBon"
!! Extending"Hive"
!! IntroducBon"to"Impala"
!! Conclusion"
Pig"TroubleshooBng"and"OpBmizaBon"
!How%to%control%the%informa3on%that%Pig%and%Hadoop%write%to%log%les%
!How%Hadoops%Web%UI%can%help%you%troubleshoot%failed%jobs%
!How%to%use%SAMPLE%and%ILLUSTRATE%to%test%and%debug%Pig%jobs%
!How%Pig%creates%MapReduce%jobs%from%your%Pig%La3n%code%
!How%several%simple%changes%to%your%Pig%La3n%code%can%make%it%run%faster%
!Which%resources%are%especially%helpful%for%troubleshoo3ng%Pig%errors%
Chapter"Topics"
Pig%Troubleshoo3ng%And%Op3miza3on%
!! Troubleshoo3ng%Pig%
!! Logging"
!! Using"Hadoops"Web"UI"
!! OpBonal"Demo:"TroubleshooBng"a"Failed"Job"with"the"Web"UI"
!! Data"Sampling"and"Debugging"
!! Performance"Overview"
!! Understanding"the"ExecuBon"Plan"
!! Tips"for"Improving"the"Performance"of"Your"Pig"Jobs"
!! Conclusion"
TroubleshooBng"Overview"
!We%have%now%covered%how%to%use%Pig%for%data%analysis%
Unfortunately,"someBmes"your"code"may"not"work"as"you"expect"
It"is"important"to"remember"that"Pig"and"Hadoop"are"intertwined"
!Here%we%will%cover%some%techniques%for%isola3ng%and%resolving%problems%
We"will"start"with"a"few"opBons"to"the"Pig"command"
Helping"Yourself"
!We%will%discuss%some%op3ons%for%the%pig%command%in%this%chapter%
You"can"view"all"of"them"by"using"the"-h"(help)"opBon"
Keep"in"mind"that"many"opBons"are"advanced"or"rarely"used"
!One%useful%op3on%is%-c%(check),%which%validates%the%syntax%of%your%code%
$ pig -c myscript.pig
myscript.pig syntax OK
!The%-dryrun%op3on%is%very%helpful%if%you%use%parameters%or%macros%
$ pig -p INPUT=demodata -dryrun myscript.pig
Creates"a"myscript.pig.substituted"le"in"current"directory"
Geang"Help"from"Others"
!Some3mes%you%may%need%help%from%others%
Mailing"lists"or"newsgroups"
Forums"and"bulleBn"board"sites"
Support"services"
!You%will%probably%need%to%provide%the%version%of%Pig%and%Hadoop%you%are%
using%
$ pig -version
Apache Pig version 0.10.0-cdh4.2.0
$ hadoop version
Hadoop 2.0.0-cdh4.2.0
Chapter"Topics"
!! TroubleshooBng"Pig"
!! Logging"
!! Conclusion"
Customizing"Log"Messages"
!You%may%wish%to%change%how%much%informa3on%is%logged%
A"recent"change"in"Hadoop"can"cause"lots"of"warnings"when"using"Pig"
!Pig%and%Hadoop%use%the%Log4J%library,%which%is%easily%customized%
!Edit%the%/etc/pig/conf/log4j.properties%le%to%include:%
log4j.logger.org.apache.pig=ERROR
log4j.logger.org.apache.hadoop.conf.Configuration=ERROR
!Edit%the%/etc/pig/conf/pig.properties%le%to%set%this%property:%
log4jconf=/etc/pig/conf/log4j.properties
Customizing"Log"Messages"on"a"Per/Job"Basis"
!O]en%you%just%want%to%temporarily%change%the%log%level%
Especially"while"trying"to"troubleshoot"a"problem"with"your"script"
!You%can%specify%a%Log4J%proper3es%le%to%use%when%you%invoke%Pig%
This"overrides"the"default"Log4J"conguraBon"
!Create%a%customlog.properties%le%to%include:%
log4j.logger.org.apache.pig=DEBUG
!Specify%this%le%via%the%-log4jconf%argument%to%Pig%
$ pig -log4jconf customlog.properties
Controlling"Client/Side"Log"Files"
!When%a%job%fails,%Pig%may%produce%a%log%le%to%explain%why%
These"are"typically"produced"in"your"current"directory"
!To%use%a%dierent%loca3on,%use%the%-l%(log)%op3on%when%star3ng%Pig%
$ pig -l /tmp
!Or%set%it%permanently%by%edi3ng%/etc/pig/conf/pig.properties%%
Specify"a"dierent"directory"using"the"log.file"property"
log.file=/tmp
Chapter"Topics"
!! Logging"
!! Using%Hadoops%Web%UI%
!! Conclusion"
The"Hadoop"Web"UI"
!Each%Hadoop%daemon%has%a%corresponding%Web%applica3on%
This"allows"us"to"easily"see"cluster"and"job"status"with"a"browser"
In"pseudo/distributed"mode,"the"hostname"is"localhost
Daemon%Name% Address%
NameNode" http://hostname:50070/
HDFS"
DataNode" http://hostname:50075/
JobTracker% http://hostname:50030/
MR1"
TaskTracker" http://hostname:50060/
ResourceManager% http://hostname:8088/
MR2"
NodeManager" http://hostname:8042/
The"JobTracker"Web"UI"(1)"
!The%JobTracker%oers%the%most%useful%of%Hadoops%Web%UIs%
It"displays"MapReduce"status"informaBon"for"the"Hadoop"cluster"
!The%JobTracker%Web%UI%also%shows%historical%informa3on%%
You"can"click"one"of"the"links"to"see"details"for"a"parBcular"job"
!The%job%detail%page%can%help%you%troubleshoot%a%problem%
Naming"Your"Job"
!Hadoop%clusters%are%typically%shared%resources%
There"might"be"dozens"or"hundreds"of"others"using"it"
As"a"result,"someBmes"it"is"hard"to"nd"your"job"in"the"Web"UI"
!We%recommend%sebng%a%name%in%your%scripts%to%help%iden3fy%your%jobs%
Set"the"job.name"property,"either"in"Grunt"or"your"script"
grunt> set job.name 'Q2 2013 Sales Reporter';
Killing"a"Job"
!A%job%that%processes%a%lot%of%data%can%take%hours%to%complete%
SomeBmes"you"spot"an"error"in"your"code"just"aeer"submiang"a"job"
Rather"than"wait"for"the"job"to"complete,"you"can"kill"it"
!First,%nd%the%Job%ID%on%the%front%page%of%the%JobTracker%Web%UI%
!Then,%use%the%kill%command%in%Pig%along%with%that%Job%ID%
grunt> kill job_201303151454_0028
Chapter"Topics"
!! Logging"
!! Op3onal%Demo:%Troubleshoo3ng%a%Failed%Job%with%the%Web%UI"
!! Conclusion"
OpBonal"Demo:"Overview"
!Time%permibng,%your%instructor%will%now%demonstrate%how%to%use%the%
JobTrackers%Web%UI%to%isolate%a%bug%in%our%code%that%causes%a%job%to%fail%
!The%code%and%instruc3ons%are%already%on%the%VM%
$ cd ~/training_materials/analyst/webuidemo%
$ cat README.txt
Chapter"Topics"
!! Logging"
!! Data%Sampling%and%Debugging"
!! Conclusion"
Using"SAMPLE"to"Create"a"Smaller"Data"Set
!Your%code%might%process%terabytes%of%data%in%produc3on%
However,"it"is"convenient"to"test"with"smaller"amounts"during"
development"
!Use%SAMPLE%to%choose%a%random%set%of%records%from%a%data%set%%
!This%example%selects%about%5%%of%records%from%bigdata
Stores"them"in"a"new"directory"called"mysample
everything = LOAD 'bigdata';

subset = SAMPLE everything 0.05;
STORE subset INTO 'mysample';
Intelligent"Sampling"with"ILLUSTRATE
!Some3mes%a%random%sample%may%lack%data%needed%for%tes3ng%
For"example,"matching"records"in"two"data"sets"for"a"JOIN"operaBon"
!Pigs%ILLUSTRATE%keyword%can%do%more%intelligent%sampling%
Pig"will"examine"the"code"to"determine"what"data"is"needed"
It"picks"a"few"records"that"properly"exercise"the"code"
!You%should%specify%a%schema%when%using%ILLUSTRATE%%
Pig"will"generate"records"when"yours"dont"suce"
Using"ILLUSTRATE"Helps"You"to"Understand"Data"Flow
!Like%DUMP%and%DESCRIBE,%ILLUSTRATE%aids%in%debugging%
The"syntax"is"the"same"for"all"three"
grunt> allsales = LOAD 'sales' AS (name:chararray, price:int);

grunt> ILLUSTRATE bigsales;
(Bob,3625)
--------------------------------------------------
| allsales | name:chararray | price:int |
--------------------------------------------------
| | Bob | 3625 |
| | Bob | 998 |
--------------------------------------------------
--------------------------------------------------
| bigsales | name:chararray | price:int |
--------------------------------------------------
| | Bob | 3625 |
--------------------------------------------------
General"Debugging"Strategies"
!Use%DUMP,%DESCRIBE,%and%ILLUSTRATE%o]en%
The"data"might"not"be"what"you"think"it"is"
!Look%at%a%sample%of%the%data%
Verify"that"it"matches"the"elds"in"your"LOAD"specicaBon"
!Other%helpful%steps%for%tracking%down%a%problem%
Use"-dryrun"to"see"the"script"aeer"parameters"and"macros"are"
processed"
Test"external"scripts"(STREAM)"by"passing"some"data"from"a"local"le"
Look"at"the"logs,"especially"task"logs"available"from"the"Web"UI"
Chapter"Topics"
!! Logging"
!! Performance%Overview"
!! Conclusion"
Performance"Overview"
!We%have%discussed%several%techniques%for%nding%errors%in%Pig%La3n%code%
Once"you"get"your"code"working,"youll"oeen"want"it"to"work"faster"
!Performance%tuning%is%a%broad%and%complex%subject%
Requires"a"deep"understanding"of"Pig,"Hadoop,"Java,"and"Linux"
Typically"the"domain"of"engineers"and"system"administrators"
!Most%of%these%topics%are%beyond%the%scope%of%this%course%
Well"cover"the"basics"and"oer"several"performance"improvement"Bps"
See"Programming/Pig"(chapters"7"and"8)"for"detailed"coverage"
Chapter"Topics"
!! Logging"
!! Understanding%the%Execu3on%Plan"
!! Tips"for"Improving"the"Performance"of"your"Pig"jobs"
!! Conclusion%
How"Pig"LaBn"Becomes"a"MapReduce"Job
!Pig%La3n%code%ul3mately%runs%as%MapReduce%jobs%on%the%Hadoop%cluster%
!However,%Pig%does%not%translate%your%code%into%Java%MapReduce%
Much"like"relaBonal"databases"dont"translate"SQL"to"C"language"code"
Like"a"database,"Pig"interprets"the"Pig"LaBn"to"develop"execuBon"plans"
Pigs"execuBon"engine"uses"these"to"submit"MapReduce"jobs"to"Hadoop"
!The%EXPLAIN%keyword%details%Pigs%three%execu3on%plans%
Logical""
Physical"
MapReduce"
!Seeing%an%example%job%will%help%us%befer%understand%EXPLAINs%output%
DescripBon"of"Our"Example"Code"and"Data
!Our%goal%is%to%produce%a%list%of%per#store%sales%
stores"
grunt> stores = LOAD 'stores'
A Anchorage AS (store_id:chararray, name:chararray);
B Boston grunt> sales = LOAD 'sales'
C Chicago AS (store_id:chararray, price:int);
D Dallas
grunt> groups = GROUP sales BY store_id;
E Edmonton
grunt> totals = FOREACH groups GENERATE group,
F Fargo
SUM(sales.price) AS amount;
sales" grunt> joined = JOIN totals BY group,
stores BY store_id;
A 1999
grunt> result = FOREACH joined
D 2399
A 4579 GENERATE name, amount;
B 6139 grunt> DUMP result;
A 2489 (Anchorage,9067)
B 3699 (Boston,9838)
E 2479 (Dallas,8198)
D 5799 (Edmonton,2479)
Using"the"EXPLAIN"Keyword
!Using%EXPLAIN%rather%than%DUMP%will%show%the%execu3on%plans%
grunt> DUMP result;

(Anchorage,9067)
(Boston,9838)
(Dallas,8198)
(Edmonton,2479)
grunt> EXPLAIN result;

#-----------------------------------------------
# New Logical Plan:
#-----------------------------------------------
result: (Name: LOStore Schema:
stores::name#49:chararray,totals::amount#70:long)
|
|---result: (Name: LOForEach Schema:
stores::name#49:chararray,totals::amount#70:long)
(other lines, including physical and MapReduce plans, would follow)
Chapter"Topics"
!! Logging"
!! Tips%for%Improving%the%Performance%of%your%Pig%Jobs"
!! Conclusion%
Pigs"RunBme"OpBmizaBons
!Pig%does%not%necessarily%run%your%statements%exactly%as%you%wrote%them%
!It%may%remove%opera3ons%for%eciency%
sales = LOAD 'sales' AS (store_id:chararray, price:int);

unused = FILTER sales BY price > 789;
DUMP sales;
!It%may%also%rearrange%opera3ons%for%eciency%
grouped = GROUP sales BY store_id;

totals = FOREACH grouped GENERATE group, SUM(sales.price);
joined = JOIN totals BY group, stores BY store_id;
only_a = FILTER joined BY store_id == 'A';
DUMP only_a;
OpBmizaBons"You"Can"Make"in"Your"Pig"LaBn"Code
!Pigs%op3mizer%does%what%it%can%to%improve%performance%
But"you"know"your"own"code"and"data"be>er"than"it"does"
A"few"small"changes"in"your"code"can"allow"addiBonal"opBmizaBons"
!On%the%next%few%slides,%we%will%rewrite%this%Pig%code%for%performance%
stores = LOAD 'stores' AS (store_id, name, postcode, phone);

sales = LOAD 'sales' AS (store_id, price);
joined = JOIN sales BY store_id, stores BY store_id;
DUMP joined;
groups = GROUP joined BY sales::store_id;
totals = FOREACH groups GENERATE
FLATTEN(joined.stores::name) AS name,
SUM(joined.sales::price) AS amount;
unique = DISTINCT totals;
region = FILTER unique BY name == 'Anchorage' OR name == 'Edmonton';
sorted = ORDER region BY amount DESC;
topone = LIMIT sorted 1;
STORE topone INTO 'topstore';
Dont"Produce"Output"You"Dont"Really"Need
!In%this%case,%we%forgot%to%remove%the%DUMP%statement%
SomeBmes"happens"when"moving"from"development"to"producBon"
And"it"might"go"unnoBced"if"youre"not"watching"the"terminal"
stores = LOAD 'stores' AS (store_id, name, postcode, phone);

sales = LOAD 'sales' AS (store_id, price);
joined = JOIN sales BY store_id, stores BY store_id;
DUMP joined;
groups = GROUP joined BY sales::store_id;
Specify"Schema"Whenever"Possible
!Specifying%schema%when%loading%data%eliminates%the%need%for%Pig%to%guess%
It"may"choose"a"bigger"type"than"you"need"(e.g.,"long"instead"of"int)"
!The%postcode%and%phone%elds%in%the%stores%data%set%were%also%never%used%
EliminaBng"them"in"our"schema"ensures"theyll"be"omi>ed"in"joined
stores =
LOAD 'stores' AS (store_id:chararray, name:chararray);
sales =
LOAD 'sales' AS (store_id:chararray, price:int);
joined =
JOIN sales BY store_id, stores BY store_id;
groups =
GROUP joined BY sales::store_id;
totals =
FOREACH groups GENERATE
Filter"Unwanted"Data"As"Early"As"Possible
!We%previously%did%our%JOIN%before%our%FILTER%%
This"produced"lots"of"data"we"ulBmately"discarded"
Moving"the"FILTER"operaBon"up"makes"our"script"more"ecient"
Caveat:"We"now"have"to"lter"by"store"ID"rather"than"store"name"
stores = LOAD 'stores' AS (store_id:chararray, name:chararray);

regsales = FILTER sales BY store_id == 'A' OR store_id == 'E';
joined = JOIN regsales BY store_id, stores BY store_id;
groups = GROUP joined BY regsales::store_id;
sorted = ORDER unique BY amount DESC;
Consider"AdjusBng"the"ParallelizaBon
!Hadoop%clusters%scale%by%processing%data%in%parallel%
Newer"Pig"releases"choose"the"number"of"reducers"based"on"input"size"
However,"it"is"oeen"benecial"to"set"a"value"explicitly"in"your"script"
Your"system"administrator"can"help"you"determine"the"best"value"
set default_parallel 5;
joined = JOIN regsales BY store_id, stores BY store_id;
Specify"the"Smaller"Data"Set"First"in"a"Join
!We%can%op3mize%joins%by%specifying%the%larger%data%set%last%
Pig"will"stream"the"larger"data"set"instead"of"reading"it"into"memory"
In"our"case,"we"have"far"more"records"in"sales"than"in"stores
Changing"the"order"in"the"JOIN"statement"can"boost"performance"
joined = JOIN stores BY store_id, regsales BY store_id;
Try"Using"Compression"on"Intermediate"Data
!Pig%scripts%o]en%yield%jobs%with%both%a%Map%and%a%Reduce%phase%
Remember"that"Mapper"output"becomes"Reducer"input"
Compressing"this"intermediate"data"is"easy"and"can"boost"performance"
Your"system"administrator"may"need"to"install"a"compression"library"
set mapred.compress.map.output true;

set mapred.map.output.compression.codec
org.apache.hadoop.io.compress.SnappyCodec;
joined = JOIN stores BY store_id, regsales BY store_id;
... (other lines unchanged, but removed for brevity) ...
A"Few"More"Tips"for"Improving"Performance
!Main%theme:%Eliminate%unnecessary%data%as%early%as%possible%
Use"FOREACH ... GENERATE"to"select"just"those"elds"you"need"
Use"ORDER BY and"LIMIT"when"you"only"need"a"few"records"
Use"DISTINCT"when"you"dont"need"duplicate"records"
!Dropping%records%with%NULL%keys%before%a%join%can%boost%performance%
These"records"will"be"eliminated"in"the"nal"output"anyway"
But"Pig"doesnt"discard"them"unBl"aeer"the"join"
Use"FILTER"to"remove"records"with"null"keys"before"the"join"

nonnull_stores = FILTER stores BY store_id IS NOT NULL;

nonnull_sales = FILTER sales BY store_id IS NOT NULL;
joined = JOIN nonnull_stores BY store_id, nonnull_sales BY store_id;
Chapter"Topics"
!! Logging"
!! Conclusion%
EssenBal"Points"
!You%can%boost%performance%by%elimina3ng%unneeded%data%during%
processing%
!Pigs%error%messages%dont%always%clearly%iden3fy%the%source%of%a%problem%
We"recommend"tesBng"your"scripts"with"a"small"data"sample"
Looking"at"the"Web"UI,"and"especially"the"log"messages,"can"be"helpful"
!The%resources%listed%on%the%upcoming%bibliography%slide%may%further%assist%
you%in%solving%problems%
Bibliography"
The%following%oer%more%informa3on%on%topics%discussed%in%this%chapter%
!Hadoop%Log%Loca3on%and%Reten3on%
!Pig%Tes3ng%and%Diagnos3cs%
!Mailing%List%for%Pig%Users%
!Ques3ons%Tagged%with%Pig%on%StackOverow%
!Ques3ons%Tagged%with%PigLa3n%on%StackOverow%
IntroducAon"to"Hive"
Chapter"9"
Course"Chapters"
!! IntroducAon"
!! Extending"Pig"
!! Introduc/on%to%Hive%
!! Hive"OpAmizaAon"
!! Extending"Hive"
!! Conclusion"
IntroducAon"to"Hive"
!What%Hive%is%
!How%Hive%diers%from%a%rela/onal%database%
!Ways%in%which%organiza/ons%use%Hive%
!How%to%invoke%and%interact%with%Hive%
Chapter"Topics"
Introduc/on%to%Hive%
!! What%is%Hive?%
!! Hive"Schema"and"Data"Storage"
!! Comparing"Hive"to"TradiAonal"Databases"
!! Hive"Use"Cases"
!! InteracAng"with"Hive"
!! Conclusion"
Overview"of"Apache"Hive"
!Apache%Hive%is%a%high#level%abstrac/on%on%top%of%MapReduce%
Uses"a"SQL/like"language"called"HiveQL"
Generates"MapReduce"jobs"that"run"on"the"Hadoop"cluster"
Originally"developed"by"Facebook"for"data"warehousing"
Now"an"open/source"Apache"project"
SELECT zipcode, SUM(cost) AS total

FROM customers
JOIN orders
WHERE zipcode LIKE '63%'
GROUP BY zipcode
High/Level"Overview"for"Hive"Users"
!Hive%runs%on%the%client%machine%
Turns"HiveQL"queries"into"MapReduce"jobs"
HiveQL Statements Hive Interpreter / Execution Engine MapReduce Jobs
SELECT zipcode, SUM(cost) AS total

!"Parse"HiveQL
FROM customers JOIN orders !"Make"op1miza1ons
ON customers.id = orders.cid !"Plan"execu1on
"
WHERE
GROUP BY
zipcode LIKE '63%'
zipcode
!"Generate"MapReduce"jobs
!"Submit"job(s)"to"Hadoop
!"Monitor"progress
Why"Use"Apache"Hive?"
!More%produc/ve%than%wri/ng%MapReduce%directly%
Five"lines"of"HiveQL"might"be"equivalent"to"100"lines"or"more"of"Java"
!Brings%large#scale%data%analysis%to%a%broader%audience%
No"so\ware"development"experience"required"
Leverage"exisAng"knowledge"of"SQL"
!Oers%interoperability%with%other%systems%
Extensible"through"Java"and"external"scripts"
Many"business"intelligence"(BI)"tools"support"Hive"
Where"to"Get"Hive"
!CDH%(Clouderas%Distribu/on%including%Apache%Hadoop)%is%the%easiest%way%
to%install%Hadoop%and%Hive%
A"Hadoop"distribuAon"which"includes"core"Hadoop,"Pig,"Hive,"Sqoop,"
HBase,"Oozie,"and"other"ecosystem"components"
Available"as"RPMs,"Ubuntu/Debian/SuSE"packages,"or"a"tarball"
Simple"installaAon"
100%"free"and"open"source"
!Installa/on%is%outside%the%scope%of%this%course%
Cloudera"oers"a"training"course"for"System"Administrators,!Cloudera!
Administrator!Training!for!Apache!Hadoop!
Chapter"Topics"
!! What"is"Hive?%
!! Hive%Schema%and%Data%Storage%
!! Hive"Use"Cases"
!! Conclusion"
How"Hive"Loads"and"Stores"Data"(1)"
!Hives%queries%operate%on%tables,%just%like%in%an%RDBMS%
A"table"is"simply"an"HDFS"directory"containing"one"or"more"les"
Default"path:"/user/hive/warehouse/<table_name>"""
Hive"supports"many"formats"for"data"storage"and"retrieval"
!How%does%Hive%know%the%structure%and%loca/on%of%tables?%
These"are"specied"when"tables"are"created"
This"metadata"is"stored"in"Hives"metastore"
Contained"in"an"RDBMS"such"as"MySQL"
"
How"Hive"Loads"and"Stores"Data"(2)"
!Hive%consults%the%metastore%to%determine%data%format%and%loca/on%
The"query"itself"operates"on"data"stored"on"a"lesystem"(typically"HDFS)"
Metastore
c t ure
s tru data
" t a bl on of
e
t i
Ge locat (metadata'in'RDBMS)
1
and
Hive%Tables%(HDFS)
Qu 2
ery
ac
tua
l da
ta
Chapter"Topics"
!! What"is"Hive?"
!! Comparing%Hive%to%Tradi/onal%Databases%
!! Hive"Use"Cases"
!! Conclusion"
Your"Cluster"is"Not"a"Database"Server"
!Client#server%database%management%systems%have%many%strengths%
Very"fast"response"Ame"
Support"for"transacAons"
Allows"modicaAon"of"exisAng"records"
Can"serve"thousands"of"simultaneous"clients"
!Hive%does%not%turn%your%Hadoop%cluster%into%an%RDBMS%
It"simply"produces"MapReduce"jobs"from"HiveQL"queries"
LimitaAons"of"HDFS"and"MapReduce"sAll"apply"
Comparing"Hive"To"A"RelaAonal"Database"
Rela/onal%Database% Hive%
Query language SQL HiveQL
Update individual records Yes No
Delete individual records Yes No
Transactions Yes No
Index support Extensive Limited
Latency Very low High
Data size Terabytes Petabytes
Chapter"Topics"
!! What"is"Hive?"
!! Hive%Use%Cases%
!! Conclusion"
Use"Case:"Log"File"AnalyAcs"
!Server%log%les%are%an%important%source%of%data%
!Hive%allows%you%to%treat%a%directory%of%log%les%like%a%table%
Allows"SQL/like"queries"against"raw"data"
Dualcore Inc. Public Web Site (June 1 - 8)

Product Unique Visitors Page Views Average Time on Page Bounce Rate Conversion Rate
Tablet 5,278 5,894 17 seconds 23% 65%
Notebook 4,139 4,375 23 seconds 47% 31%
Stereo 2,873 2,981 42 seconds 61% 12%
Monitor 1,749 1,862 26 seconds 74% 19%
Router 987 1,139 37 seconds 56% 17%
Server 314 504 53 seconds 48% 28%
Printer 86 97 34 seconds 27% 64%
Use"Case:"SenAment"Analysis"
!Many%organiza/ons%use%Hive%to%analyze%social%media%coverage%
Mentions of Dualcore on Social Media (by Hour)
Negative
Neutral
Positive
07 08 09 10 11 12 13 14 15 16 17 18
Chapter"Topics"
!! What"is"Hive?"
!! Hive"Use"Cases"
!! Interac/ng%with%Hive%
!! Conclusion"
Using"the"Hive"Shell"
!You%can%execute%HiveQL%statements%in%the%Hive%Shell%
This"interacAve"tool"is"similar"to"the"MySQL"shell"
!Run%the%hive%command%to%start%the%Hive%shell%
The"Hive"shell"will"display"its"hive>"prompt"
Each"statement"must"be"terminated"with"a"semicolon"
Use"the"quit"command"to"exit"the"Hive"shell"
$ hive
hive> SELECT cust_id, fname, lname
FROM customers WHERE zipcode=20525;
1000000 Quentin Shepard
1000001 Brandon Louis
1000002 Marilyn Ham
hive> quit;
$
Accessing"Hive"from"the"Command"Line"
!You%can%also%execute%a%le%containing%HiveQL%code%using%the%-f%op/on%
%
$ hive -f myquery.hql
!Or%use%HiveQL%directly%from%the%command%line%using%the%-e%op/on%
$ hive -e 'SELECT * FROM users'
!Use%the%-S%(silent)%op/on%to%suppress%informa/onal%messages%
Can"also"be"used"with"-e"or"-f"opAons"
$ hive -S
Hive"ProperAes"
!Many%aspects%of%Hives%behavior%are%congured%through%proper/es%
Use"set -v"in"Hive"to"see"current"values"
hive> set -v;

"
!You%can%also%use%set%to%specify%property%values%
The"following"enables"columns"headers"in"query"results"
hive> set hive.cli.print.header=true;
!Hive%runs%the%.hiverc%le%in%your%home%directory%at%startup%
Useful"for"specifying"per/user"defaults"
InteracAng"with"the"OperaAng"System"and"HDFS"
!Use%!%to%execute%system%commands%from%within%Hive%
Neither"pipes"nor"globs"(wildcards)"are"supported"
hive> ! date;
% Mon May 20 16:44:35 PDT 2013
!Prex%HDFS%commands%with%dfs%to%use%them%from%within%Hive%
hive> dfs -mkdir /reports/sales/2013;
Accessing"Hive"with"Hue"
!Alterna/vely,%you%can%access%Hive%through%Hue%
Web/based"UI"for"many"Hadoop/related"services,"including"Hive"
!To%use%Hue,%browse%to%http://hue_server:8888/
May"need"to"start"Hue"service"rst"(sudo service hue start)"
!Hues%Hive%interface%is%called%Beeswax%
Launch"by"clicking"its"icon"
!Beeswax%features%include:%
CreaAng"tables"
Beeswax!icon!in!Hue!
Running"queries"
Browsing"tables"
Saving"queries"for"later"execuAon"
Hues"Beeswax"Query"Editor"
Hue - Hive Query
https://hueserver.example.com:8888/beeswax/ Google
SELECT * FROM employees WHERE state = 'CA';
Query"ExecuAon"in"Beeswax"
Query"Results"in"Beeswax"
InteracAng"With"HiveServer2""Hive"as"a"Service"(1)"
!HiveServer2%can%be%deployed%to%provide%a%centralized%Hive%service%
Uses"a"JDBC"or"ODBC"connecAon"
Supports"Kerberos"authenAcaAon"
% Beeline CLI
Hadoop Cluster
Submit Map
Reduce Jobs Mappers
HiveServer2
Server
Reducers
JDBC/ODBC
Application
Shared
Metastore
InteracAng"With"HiveServer2""Hive"as"a"Service"(2)"
!To%connect%to%HiveServer2,%use%Hue%or%the%Beeline%CLI%
You"cannot"use"the"Hive"shell"
For"secure"deployments,"supply"your"user"ID"and"password"
!Example:%star/ng%Beeline%and%connec/ng%to%HiveServer2%
% [training@localhost analyst]$ beeline
Beeline version 0.10.0-cdh4.2.1 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10000

training mypwd org.apache.hive.jdbc.HiveDriver
Connecting to jdbc:hive2://localhost:10000
Connected to: Hive (version 0.10.0)
Driver: Hive (version 0.10.0-cdh4.2.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
beeline>
Chapter"Topics"
!! What"is"Hive?"
!! Hive"Use"Cases"
!! Conclusion"
EssenAal"Points"
!Hive%is%a%high#level%abstrac/on%on%top%of%MapReduce%
Runs"MapReduce"jobs"on"Hadoop"based"on"HiveQL"statements"
Originally"developed"for"data"warehousing"at"Facebook"
!HiveQL%is%very%similar%to%SQL%
Easy"to"learn"for"those"with"relaAonal"database"experience"
However,"Hive"does"not"replace"your"RDBMS"
!Hive%tables%are%really%directories%of%les%in%HDFS%
InformaAon"about"those"tables"is"kept"in"Hive's"metastore"
!Hive%and%Pig%are%similar%in%many%ways%
But"each"also"has"its"own"disAnct"advantages"
Bibliography"
The%following%oer%more%informa/on%on%topics%discussed%in%this%chapter%
!Hive%Web%Site%
http://hive.apache.org/
!Beeline%CLI%reference%
http://sqlline.sourceforge.net/
!Programming%Hive%(book)%
!Data%Analysis%with%Hadoop%and%Hive%at%Orbitz%
!Sen/ment%Analysis%Using%Apache%Hive%
RelaAonal"Data"Analysis"with"Hive"
Chapter"10"
Course"Chapters"
!! IntroducAon"
!! Extending"Pig"
!! Rela*onal$Data$Analysis$with$Hive$
!! Hive"OpAmizaAon"
!! Extending"Hive"
!! Conclusion"
RelaAonal"Data"Analysis"with"Hive"
In$this$chapter,$you$will$learn$
!How$to$explore$databases$and$tables$in$Hive$
!How$HiveQL$syntax$compares$to$SQL$
!Which$data$types$Hive$supports$
!Which$types$of$join$opera*ons$Hive$supports$and$how$to$use$them$
!How$to$use$many$of$Hives$built#in$func*ons$
Chapter"Topics"
Rela*onal$Data$Analysis$with$Hive$
!! Hive$Databases$and$Tables$
!! Basic"HiveQL"Syntax"
!! Data"Types"
!! Joining"Datasets"
!! Common"Built/in"FuncAons"
!! Hands/On"Exercise:"Running"Hive"Queries"from"the"Shell,"Scripts,"and"Hue"
!! Conclusion"
Hive"Tables"
!Data$for$Hives$tables$is$stored$on$the$lesystem$(typically$HDFS)$
Each"table"maps"to"a"single"directory"
!A$tables$directory$may$contain$mul*ple$les$$
Typically"delimited"text"les,"but"Hive"supports"many"formats"
Subdirectories"are"not"allowed"
!Hive$uses$the$metastore$to$give$context$to$this$data$
Helps"map"raw"data"in"HDFS"to"named"columns"of"specic"types"
Hive"Databases"
!Each$Hive$table$belongs$to$a$specic$database$$
!Early$versions$of$Hive$supported$only$a$single$database$
It"placed"all"tables"in"the"same"database"(named"default)"
This"is"sAll"the"default"behavior"
!Hive$supports$mul*ple$databases$as$of$release$0.7.0$
Helpful"for"organizaAon"and"authorizaAon"
Exploring"Hive"Databases"and"Tables"(1)"
!Which$databases$are$available?$
hive> SHOW DATABASES;

accounting
default
sales
!Switch$between$databases$with$the$USE$command$
Queries table in the

$ hive
hive> SELECT * FROM customers; default database
hive> USE sales;
hive> SELECT * FROM customers;
Queries table in the
sales database
!Which$tables$does$the$current$database$contain?$
hive> USE accounting;

hive> SHOW TABLES;
invoices
taxes
!Which$tables$are$contained$in$a$dierent$database?$
hive> SHOW TABLES IN sales;

customers
prospects
!The$DESCRIBE$command$displays$basic$structure$for$a$table$
hive> DESCRIBE orders;

order_id int
cust_id int
order_date timestamp
! DESCRIBE FORMATTED$shows$more$detailed$informa*on$
hive> DESCRIBE FORMATTED orders;

# col_name data_type comment
order_id int None
cust_id int None
order_date timestamp None
# Detailed Table Information ... More follows...
Chapter"Topics"
!! Hive"Databases"and"Tables"
!! Basic$HiveQL$Syntax$
!! Data"Types"
!! Conclusion"
An"IntroducAon"To"HiveQL"
!HiveQL$is$Hives$query$language$
Based"on"a"subset"of"SQL/92,"plus"Hive/specic"extensions"
!Some$limita*ons$compared$to$standard$SQL$
Some"features"are"not"supported"
Others"are"only"parAally"implemented"
!HiveQL$also$has$some$features$not$oered$in$SQL$
HiveQL"Basics"
!Hive$keywords$are$not$case#sensi*ve$
Though"they"are"o]en"capitalized"by"convenAon"
!Statements$are$terminated$by$a$semicolon$
A"statement"may"span"mulAple"lines"
!Comments$begin$with$-- (double(hyphen)(
Only"supported"in"Hive"scripts"
There"are"no"mulA/line"comments"in"Hive"
$ cat myscript.hql
SELECT cust_id, fname, lname

FROM customers
WHERE zipcode='60601'; -- downtown Chicago
SelecAng"Data"from"Hive"Tables"
!The$SELECT$statement$retrieves$data$from$Hive$tables$
Can"specify"an"ordered"list"of"individual"columns"
hive> SELECT cust_id, fname, lname FROM customers;
An"asterisk"matches"all"columns"in"the"table"
$
hive> SELECT * FROM customers;
$
LimiAng"and"SorAng"Query"Results"
! The$LIMIT$clause$sets$the$maximum$number$of$rows$returned$
hive> SELECT fname, lname FROM customers LIMIT 10;
!Cau*on:$no$guarantee$regarding$which$10$results$are$returned$
Use"ORDER BY"for"top/N"queries"
The"eld(s)"you"ORDER BY"must"be"selected"
hive> SELECT cust_id, fname, lname FROM customers

ORDER BY cust_id DESC LIMIT 10;
$
Using"a"WHERE"Clause"to"Restrict"Results"
! WHERE$clauses$restrict$rows$to$those$matching$specied$criteria$
String"comparisons"are"case/sensiAve"
hive> SELECT * FROM orders WHERE order_id=1287;
hive> SELECT * FROM customers WHERE state

IN ('CA', 'OR', 'WA', 'NV', 'AZ');
"
!You$can$combine$expressions$using$AND$or$OR$$
hive> SELECT * FROM customers

WHERE fname LIKE 'Ann%'
AND (city='Seattle' OR city='Portland');
Table"Aliases"
!Table$aliases$can$help$simplify$complex$queries$
hive> SELECT o.order_date, c.fname, c.lname

FROM customers c JOIN orders o
ON c.cust_id = o.cust_id
WHERE c.zipcode='94306';
Note:"Using"AS"to"specify"table"aliases"is"not"supported"
$
Combining"Query"Results"with"UNION ALL
!Unies$output$from$SELECTs$into$a$single$result$set$
The"name,"order,"and"types"of"columns"in"each"query"must"match"
Hive"only"supports"UNION ALL
SELECT emp_id, fname, lname, salary

FROM employees
WHERE state='CA' AND salary > 75000
UNION ALL
SELECT emp_id, fname, lname, salary
FROM employees
WHERE state != 'CA' AND salary > 50000;
! UNION ALL$can$also$be$used$with$subqueries$
Subqueries"in"Hive"
!Hive$only$supports$subqueries$in$the$FROM$clause$of$the$SELECT$
statement$
It"does"not"support"correlated"subqueries"
SELECT prod_id, brand, name

FROM (SELECT *
FROM products
WHERE (price - cost) / price > 0.65
ORDER BY price DESC
LIMIT 10) high_profits
WHERE price > 1000
ORDER BY brand, name;
"
!Hive$allows$arbitrary$levels$of$subqueries$
Each"subquery"must"be"named"(like"high_profits"above)"
Chapter"Topics"
!! Data$Types$
!! Conclusion"
Hives"Data"Types"
!Each$column$in$Hive$is$associated$with$a$data$type$
!Hive$supports$more$than$a$dozen$types$
Most"are"similar"to"ones"found"in"relaAonal"databases"
Hive"also"supports"three"complex"types"
!Use$the$DESCRIBE$command$to$see$a$tables$column$types$$
hive> DESCRIBE products;

prod_id int
brand string
name string
price int
cost int
shipping_wt int
Hives"Integer"Types"
!Integer$types$are$appropriate$for$whole$numbers$
Both"posiAve"and"negaAve"values"allowed"
Name$ Descrip*on$ Example$Value$

TINYINT Range:"/128"to"127$ 17
SMALLINT Range:"/32,768"to"32,767$ 5842
INT Range:"/2,147,483,648"to"2,147,483,647$ 84127213
BIGINT Range:"~"/9.2"quinAllion"to"~"9.2"quinAllion$ 632197432180964
Hives"Decimal"Types"
!Decimal$types$are$appropriate$for$oa*ng$point$numbers$
Both"posiAve"and"negaAve"values"allowed"
Cau*on:"avoid"using"when"exact"values"are"required!"

FLOAT Decimals$ 3.14159
DOUBLE Very"precise"decimals$ 3.14159265358979323846
Other"Simple"Types"in"Hive"
!Hive$can$also$store$several$other$types$of$informa*on$
Only"one"character"type"(variable"length)"

STRING Character"sequence" Betty F. Smith
BOOLEAN True"or"False" TRUE
TIMESTAMP* Instant"in"Ame" 2013-06-14 16:51:05
BINARY* Raw"bytes" N/A
""*"Not"available"in"older"versions"of"Hive"
Complex"Column"Types"in"Hive"
!Hive$also$has$a$few$complex$data$types$
These"are"capable"of"holding"mulAple"values"
Column$Type$ Usage$in$Query$
ARRAY" departments[0]
Ordered"list"of"values,"all"of"the"same"type"
MAP" employees['BF5314']
Key/value"pairs,"each"of"the"same"type"
STRUCT" address.street
Named"elds,"of"possibly"mixed"types"
Chapter"Topics"
!! Data"Types"
!! Joining$Datasets$
!! Conclusion"
Joins"in"Hive"
!Joining$disparate$data$sets$is$a$common$opera*on$in$Hive$
!Hive$supports$several$types$of$joins$
Inner"joins"
Outer"joins"(le],"right,"and"full)"
Cross"joins"(supported"in"Hive"0.10"and"later)"
Le]"semi"joins"
!Only$equality$condi*ons$are$allowed$in$joins$
Valid:""""customers.cust_id = orders.cust_id""
Invalid:"customers.cust_id <> orders.cust_id""
Outputs"records"where"the"specied"key"is"found"in"each"table"
!For$best$performance,$list$the$largest$table$last$in$your$query$
Join"Syntax"
!Hive$requires$the$following$syntax$for$joins$
SELECT c.cust_id, name, total

FROM customers c
JOIN orders o ON (c.cust_id = o.cust_id);
!The$above$example$is$an$inner$join$
Can"replace"JOIN"with"another"type"(e.g."RIGHT OUTER JOIN)""
!The$following$join$syntax$is$not$valid$in$Hive$
SELECT c.cust_id, name, total

FROM customers c, orders o
WHERE (c.cust_id = o.cust_id);
Inner"Join"Example"
customers$table$
cust_id$ name$ country$ SELECT c.cust_id, name, total
a Alice us
FROM customers c
JOIN orders o
b Bob ca
ON (c.cust_id = o.cust_id);
c Carlos mx
d Dieter de
Result$of$query$
orders$table$ cust_id$ name$ total$

a Alice 1539
order_id$ cust_id$ total$
a Alice 6352
1 a 1539
b Bob 1456
2 c 1871
c Carlos 1871
3 a 6352
4 b 1456
5 z 2137
Le]"Outer"Join"Example"
customers$table$
a Alice us
FROM customers c
LEFT OUTER JOIN orders o
b Bob ca
c Carlos mx
d Dieter de
Result$of$query$

a Alice 1539
a Alice 6352
1 a 1539
b Bob 1456
2 c 1871
c Carlos 1871
3 a 6352
d Dieter NULL
4 b 1456
5 z 2137
Right"Outer"Join"Example"
customers$table$
a Alice us
FROM customers c
RIGHT OUTER JOIN orders o
b Bob ca
c Carlos mx
d Dieter de
Result$of$query$

a Alice 1539
a Alice 6352
1 a 1539
b Bob 1456
2 c 1871
c Carlos 1871
3 a 6352
NULL NULL 2137
4 b 1456
5 z 2137
Full"Outer"Join"Example"
customers$table$
a Alice us
FROM customers c
FULL OUTER JOIN orders o
b Bob ca
c Carlos mx
d Dieter de
Result$of$query$

a Alice 1539
a Alice 6352
1 a 1539
b Bob 1456
2 c 1871
c Carlos 1871
3 a 6352
d Dieter NULL
4 b 1456
NULL NULL 2137
5 z 2137
Using"An"Outer"Join"to"Find"Unmatched"Entries"
customers$table$
a Alice us
FROM customers c
FULL OUTER JOIN orders o
b Bob ca
ON (c.cust_id = o.cust_id)
c Carlos mx
WHERE c.cust_id IS NULL
d Dieter de OR o.total IS NULL;
orders$table$
Result$of$query$
1 a 1539
cust_id$ name$ total$
d Dieter NULL
2 c 1871
NULL NULL 2137
3 a 6352
4 b 1456
5 z 2137
Cross"Join"Example"
disks$table$
name$ SELECT * FROM disks
Internal hard disk
CROSS JOIN sizes;
External hard disk
Result$of$query$
name$ size$
Internal hard disk 1.0 terabytes
sizes$table$ Internal hard disk 2.0 terabytes
size$ Internal hard disk 3.0 terabytes
1.0 terabytes Internal hard disk 4.0 terabytes
2.0 terabytes External hard disk 1.0 terabytes
External hard disk 4.0 terabytes
Le]"Semi"Joins"(1)"
!A$less$common$type$of$join$is$the$LEFT SEMI JOIN

They"are"a"special"(and"ecient)"type"of"inner"join"
They"behave"more"like"a"lter"than"a"join"
!Leh$semi$joins$include$addi*onal$criteria$in$the$ON$clause$
Only"unique"records"that"match"these"criteria"are"returned"
Fields"listed"in"SELECT"are"limited"to"the"le]/side"table"
SELECT c.cust_id
FROM customers c
LEFT SEMI JOIN orders o
ON (c.cust_id = o.cust_id
AND YEAR(o.order_date) = '2012');
Le]"Semi"Joins"(2)"
!Hive$does$not$support$IN/EXISTS$subqueries$
SELECT c.cust_id FROM customers c

WHERE o.cust_id IN
(SELECT o.cust_id FROM orders o
WHERE YEAR(o.order_date = '2012'));
!Using$a$LEFT SEMI JOIN$is$a$common$workaround$
SELECT c.cust_id
FROM customers c
LEFT SEMI JOIN orders o
ON (c.cust_id = o.cust_id
AND YEAR(o.order_date) = '2012');
Chapter"Topics"
!! Data"Types"
!! Common$Built#in$Func*ons$
!! Conclusion"
Hive"FuncAons"
!Hive$oers$dozens$of$built#in$func*ons$$
Many"are"idenAcal"to"those"found"in"SQL"
Others"are"Hive/specic"
!Example$func*on$invoca*on$
FuncAon"names"are"not"case/sensiAve"
hive> SELECT CONCAT(fname, ' ', lname) AS fullname

FROM customers;
!To$see$informa*on$about$a$func*on$$
hive> DESCRIBE FUNCTION UPPER;

UPPER(str) - Returns str with all characters
changed to uppercase
Example"Built/in"FuncAons"(1)"
!These$func*ons$operate$on$numeric$values$
Func*on$Descrip*on$ Example$Invoca*on$ Input$ Output$

Rounds"to"specied"#"of"decimals" ROUND(total_price, 2) 23.492 23.49
Returns"nearest"integer"above" CEIL(total_price) 23.492 24
Returns"nearest"integer"below" FLOOR(total_price) 23.492 23
Return"absolute"value" ABS(temperature) -49 49
Returns"square"root" SQRT(area) 64 8
Returns"a"random"number" RAND() 0.584977
!These$func*ons$operate$on$*mestamp$values$
Func*on$ Example$Invoca*on$ Input$ Output$

Descrip*on$
Convert"to"UNIX" UNIX_TIMESTAMP(order_dt) 2013-06-14 1371243065
format" 16:51:05
Convert"to"string" FROM_UNIXTIME(mod_time) 1371243065 2013-06-14
format" 16:51:05
Extract"date" TO_DATE(order_dt) 2013-06-14 2013-06-14
porAon" 16:51:05
Extract"year" YEAR(order_dt) 2013-06-14 2013
porAon" 16:51:05
Returns"#"of"days" DATEDIFF(order_dt, ship_dt) 2013-06-14," 3
between"dates" 2013-06-17
!Here$are$some$other$interes*ng$func*ons$
Func*on$Descrip*on$ Example$Invoca*on$ Input$ Output$

Converts"to"uppercase" UPPER(fname) Bob BOB
Extract"porAon"of"string" SUBSTRING(name, 0, 2) Alice Al
SelecAvely"return"value" IF(price > 1000, 'A', 'B') 1500 A
Convert"to"another"type" CAST(weight as INT) 3.581 3
Returns"size"of"array"or"map" SIZE(array_field) N/A 6
Record"Grouping"and"Aggregate"FuncAons"
! GROUP BY$groups$selected$data$by$one$or$more$columns$
Cau*on:"Columns"not"part"of"aggregaAon"must"be"listed"in"GROUP BY"
stores$table$
id$ city$ state$ region$ SELECT region, state,
COUNT(id) AS num
a Albany NY EAST
FROM stores
b Boston MA EAST
GROUP BY region, state;
c Chicago IL NORTH
d Detroit MI NORTH Result$of$query$
e Elgin IL NORTH
region$ state$ num$
EAST MA 1
EAST NY 1
NORTH IL 2
NORTH MI 1
Built/in"Aggregate"FuncAons"
!Hive$oers$many$aggregate$func*ons,$including$
Func*on$Descrip*on$ Example$Invoca*on$
Count"all"rows" COUNT(*)
Count"all"rows"where"eld"is"not"null" COUNT(fname)
Count"all"rows"where"eld"is"unique"and"not"null" COUNT(DISTINCT fname)
Returns"the"largest"value" MAX(salary)
Returns"the"smallest"value" MIN(salary)
Adds"all"supplied"values"and"returns"result" SUM(price)
Returns"the"average"of"all"supplied"values" AVG(salary)
Chapter"Topics"
!! Data"Types"
!! Hands#On$Exercise:$Running$Hive$Queries$from$the$Shell,$Scripts,$and$
Hue$
!! Conclusion"
Hands/on"Exercise:"Running"Hive"Queries"
!In$this$Hands#On$Exercise,$you$will$run$Hive$queries$from$the$Hive$shell,$the$
command$line,$Hive$scripts,$and$Hue.$
!Please$refer$to$the$Hands#On$Exercise$Manual$for$instruc*ons$
Chapter"Topics"
!! Data"Types"
!! Conclusion$
EssenAal"Points"
!Every$Hive$table$belongs$to$exactly$one$database$
The"SHOW DATABASES"command"lists"databases""
The"USE"command"switches"the"acAve"database"
The"SHOW TABLES"command"lists"all"tables"in"a"database"
!Every$column$in$a$Hive$table$has$an$associated$data$type$
Most"simple"column"types"are"similar"to"SQL"
Hive"also"supports"a"few"complex"types"
!HiveQL$syntax$is$familiar$to$those$who$know$SQL$
A"subset"of"SQL/92,"plus"Hive/specic"extensions"
Supports"inner,"outer,"and"le]"semi"joins"
Many"SQL"funcAons"are"built"into"Hive"
Bibliography"
The$following$oer$more$informa*on$on$topics$discussed$in$this$chapter$
!HiveQL$Language$Manual$
!Hive$Built#In$Func*ons$
Hive"Data"Management"
Chapter"11"
"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 11"1#
Course"Chapters"
!! IntroducEon"
!! Extending"Pig"
!! Hive#Data#Management#
!! Hive"OpEmizaEon"
!! Extending"Hive"
!! Conclusion"
Hive"Data"Management"
In#this#chapter#you#will#learn#
!How#Hive#encodes#and#stores#data#
!How#to#create#Hive#databases,#tables,#and#views#
!How#to#load#data#into#tables#
!How#to#alter#and#remove#tables#
!How#to#save#query#results#into#tables#and#les#
!How#to#control#access#to#data#in#Hive#
Chapter"Topics"
Hive#Data#Management#
!! Hive#Data#Formats#
!! CreaEng"Databases"and"Hive/Managed"Tables"
!! Loading"Data"into"Hive"
!! Altering"Databases"and"Tables"
!! Self/Managed"Tables"
!! Simplifying"Queries"with"Views"
!! Storing"Query"Results"
!! Controlling"Access"to"Data"
!! Hands/On"Exercise:"Data"Management"with"Hive"
!! Conclusion"
Hive"Text"File"Format"
!Each#Hive#table#maps#to#a#directory#of#data,#typically#in#HDFS#
Data"stored"in"one"or"more"les"
!Default#data#format#is#plain#text#
One"record"per"line"(record"separator"is"\n)"
Columns"are"delimited"by"Â"(eld"separator"is"Control/A)
Complex"type"elements"are"separated"by"^B"(Control/B)
Map"keys/values"are"separated"by"^C"(Control/C)
!Hive#allows#you#to#override#these#delimiters#
Specify"alternate"values"during"table"creaEon"
Hive"Binary"File"Formats"
!Hive#also#supports#a#few#binary#formats#
Pro:"may"oer"be>er"performance"than"text"les"
Con:"may"limit"ability"to"read"data"from"other"programs"
!SequenceFile#is#a#Hadoop"specic#binary#format#
Avoids"the"need"to"convert"all"data"to/from"strings"
Supports"block/"and"record/level"data"compression"
!RCFile#is#a#binary#format#created#especially#for#Hive#
RC"stands"for"Record"Columnar"
Column/oriented"format"ecient"for"some"queries"
Otherwise,"very"similar"to"sequence"les"
Chapter"Topics"
!! Hive"Data"Formats"
!! CreaLng#Databases#and#Hive"Managed#Tables#
!! Conclusion"
CreaEng"Databases"in"Hive"
!Hive#databases#are#simply#namespaces#
Helps"to"organize"your"tables"
!To#create#a#new#database#
hive> CREATE DATABASE dualcore;
!To#condiLonally#create#a#new#database#
Avoids"error"in"case"database"already"exists"(useful"for"scripEng)"
hive> CREATE DATABASE IF NOT EXISTS dualcore;
CreaEng"a"Table"In"Hive"(1)"
!Basic#syntax#for#creaLng#a#table:#
CREATE TABLE tablename (colname DATATYPE, ...)

ROW FORMAT DELIMITED
FIELDS TERMINATED BY char
STORED AS {TEXTFILE|SEQUENCEFILE|RCFILE}
!Creates#a#subdirectory#under#/user/hive/warehouse#in#HDFS#
This"is"Hives"warehouse)directory"

Specify a name for the table, and list the column
names and datatypes (see later)

STORED
This lineAS {TEXTFILE|SEQUENCEFILE|RCFILE}
states that fields in each file in the tables
directory are delimited by some character. Hives
default delimiter is Control-A, but you may
specify an alternate delimiter...

for example, tab-delimited data would require that
you specify FIELDS TERMINATED BY '\t'

Finally, you may declare the file format. STORED AS

TEXTFILE is the default and does not need to be
specified
Example"Table"DeniEon"
!The#following#example#creates#a#new#table#named#jobs
Data"stored"as"text"with"four"comma/separated"elds"per"line"
CREATE TABLE jobs

(id INT,
title STRING,
salary INT,
posted TIMESTAMP
)
FIELDS TERMINATED BY ',';
Example"of"corresponding"record"for"the"table"above"
1,Data Analyst,100000,2013-06-21 15:52:03
Complex"Column"Types"
!Recap:#Hive#has#support#for#complex#data#types#
Column#Type# DescripLon#
ARRAY" Ordered"list"of"values,"all"of"the"same"type"
MAP" Key/value"pairs,"each"of"the"same"type"
STRUCT" Named"elds,"of"possibly"mixed"types"
CreaEng"Tables"with"Complex"Column"Types"
!Complex#columns#are#typed#
Arrays"have"a"single"type"
Maps"have"a"key"type"and"a"value"type"
Structs"have"a"type"for"each"a>ribute"
!These#types#are#specied#with#angle#brackets#
CREATE TABLE stores

(store_id SMALLINT,
departments ARRAY<STRING>,
staff MAP<STRING, STRING>,
address STRUCT<street:STRING,
city:STRING,
state:STRING,
zipcode:STRING>
);
Customizing"Complex"Type"Storage"
!We#can#control#which#delimiters#are#used#for#complex#types#
CREATE TABLE stores

(store_id SMALLINT,
departments ARRAY<STRING>,
staff MAP<STRING, STRING>,
address STRUCT<street:STRING,
city:STRING,
state:STRING,
zipcode:STRING>
)
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':';
Row"Format"Example"for"Complex"Types"
!The#following#is#an#example#of#a#record#in#that#table#
1|Audio,Photo|A123:Abe,B456:Bob|123 Oak St.,Ely,MN,55731
id" departments" sta" street" city" state" zipcode"
!Examples#queries#on#this#record#
hive> SELECT departments[0] FROM stores;

Audio
hive> SELECT staff['B456'] FROM stores;

Bob
hive> SELECT address.city FROM stores;

Ely
Chapter"Topics"
!! Loading#Data#into#Hive#
!! Conclusion"
Data"ValidaEon"in"Hive"
!Hadoop#and#its#ecosystem#are#schema#on#read#
Unlike"an"RDBMS,"Hive"does"not"validate"data"on"insert"
Files"are"simply"moved"into"place"
Loading"data"into"tables"is"therefore"very"fast"
Errors"in"le"format"will"be"discovered"when"queries"are"performed"
!Missing#or#invalid#data#in#Hive#will#be#represented#as#NULL
Loading"Data"From"Files"(1)"
!To#load#data,#simply#add#les#to#the#tables#directory#in#HDFS#
Can"be"done"directly"using"hadoop fs"commands"
This"example"loads"data"from"HDFS"into"Hives"sales"table"
$ hadoop fs -mv sales.txt /user/hive/warehouse/sales/

"
!AlternaLvely,#use#Hives#LOAD DATA INPATH#command#
Done"from"within"the"Hive"shell"(or"a"Hive"script)"
This"moves"data"within"HDFS,"just"like"the"command"above"
Source"can"be"either"a"le"or"directory"
hive> LOAD DATA INPATH 'sales.txt' INTO TABLE sales;
!Add#the#LOCAL#keyword#to#load#data#from#the#local#disk#
Copies"a"local"le"(or"directory)"to"the"tables"directory"in"HDFS"
hive> LOAD DATA LOCAL INPATH '/home/bob/sales.txt'

INTO TABLE sales;
This"is"equivalent"to"the"following"hadoop fs"command"
$ hadoop fs -put /home/bob/sales.txt \

/user/hive/warehouse/sales
!Add#the#OVERWRITE#keyword#to#delete#all#records#before#import#
Removes"all"les"within"the"tables"directory""
Then"moves"the"new"les"into"that"directory"
hive> LOAD DATA INPATH '/depts/finance/salesdata'

OVERWRITE INTO TABLE sales;
Loading"Data"From"a"RelaEonal"Database"
!Sqoop#has#built"in#support#for#imporLng#data#into#Hive#
!Just#add#the#--hive-import#opLon#to#your#Sqoop#command#
Creates"the"table"in"Hive"(metastore)"
Imports"data"from"RDBMS"to"tables"directory"in"HDFS"
$ sqoop import \
--connect jdbc:mysql://localhost/dualcore \
--username training
--password training \
--fields-terminated-by '\t' \
--table employees \
--hive-import
Chapter"Topics"
!! Altering#Databases#and#Tables#
!! Conclusion"
Removing"a"Database"
!Removing#a#database#is#similar#to#creaLng#it#
Just"replace"CREATE"with"DROP
hive> DROP DATABASE dualcore;
hive> DROP DATABASE IF EXISTS dualcore;
!These#commands#will#fail#if#the#database#contains#tables#
Add"the"CASCADE"keyword"to"force"removal"
CauEon:"this"command"removes"data"in"HDFS!"
hive> DROP DATABASE dualcore CASCADE;
Removing"a"Table"
!Syntax#is#similar#to#database#removal#
hive> DROP TABLE customers;
hive> DROP TABLE IF EXISTS customers;
!CauLon:#These#commands#can#remove#data#in#HDFS!#
Hive"does"not"have"a"rollback"or"undo"feature"
Renaming"Tables"and"Columns"
!Use#ALTER TABLE#to#modify#a#tables#metadata#
This"command"does"not"change"data"in"HDFS"
!Rename#an#exisLng#table#
hive> ALTER TABLE customers RENAME TO clients;
!Rename#a#column#by#specifying#its#old#and#new#names#
Type"must"be"specied"even"though"it"is"unchanged"
hive> ALTER TABLE clients

CHANGE fname first_name STRING;
Old"Name" New"Name" Type"
Modifying"and"Adding"Columns"
!You#can#also#modify#a#columns#type#
The"old"and"new"column"names"will"be"the"same"
You"must"ensure"data"in"HDFS"conforms"to"the"new"type"
hive> ALTER TABLE jobs CHANGE salary salary BIGINT;
Old"Name" New"Name" Type"

!Add#new#columns#to#a#table#
Appended"to"the"end"of"any"exisEng"columns"
ExisEng"data"will"have"NULL"values"for"new"columns"
hive> ALTER TABLE jobs

ADD COLUMNS (city STRING, bonus INT);
Reordering"and"Removing"Columns"
!Use#AFTER#or#FIRST#to#reorder#columns#
Ensure"that"your"data"in"HDFS"matches"the"new"order"

CHANGE bonus bonus INT AFTER salary;

CHANGE bonus bonus INT FIRST;
!Use#REPLACE COLUMNS#to#remove#columns#
Any"column"not"listed"will"be"dropped"from"metadata"
hive> ALTER TABLE jobs REPLACE COLUMNS

(id INT,
title STRING,
salary INT);
Chapter"Topics"
!! Self"Managed#Tables#
!! Conclusion"
Controlling"Table"Data"LocaEon"
!Hive#stores#data#associated#with#a#table#in#its#warehouse#directory#
!Storing#data#below#Hives#warehouse#is#not#always#ideal#
Data"might"be"shared"by"several"users""
!Use#LOCATION#to#specify#directory#where#table#data#resides#
CREATE TABLE adclicks

(campaign_id STRING,
when TIMESTAMP,
keyword STRING,
site STRING,
placement STRING,
was_clicked BOOLEAN,
cost SMALLINT)
LOCATION '/dualcore/ad_data';
Self/Managed"(External)"Tables"
!Recall#that#dropping#a#table#removes#its#data#in#HDFS#
!Using#EXTERNAL#when#creaLng#the#table#avoids#this#behavior#
Dropping"an"external"table"removes"only"its"metadata)
CREATE EXTERNAL TABLE adclicks

(campaign_id STRING,
click_time TIMESTAMP,
keyword STRING,
site STRING,
placement STRING,
was_clicked BOOLEAN,
cost SMALLINT)
LOCATION '/dualcore/ad_data';
Chapter"Topics"
!! Simplifying#Queries#with#Views#
!! Conclusion"
Simplifying"Complex"Queries"
!Complex#queries#can#become#cumbersome#
Imagine"typing"this"several"Emes"for"dierent"orders"
SELECT o.order_id, order_date, p.prod_id, brand, name

FROM orders o
JOIN order_details d
ON (o.order_id = d.order_id)
JOIN products p
ON (d.prod_id = p.prod_id)
WHERE o.order_id=6584288;
CreaEng"Views"
!Views#in#Hive#are#conceptually#like#a#table,#but#backed#by#a#query#
You"cannot"directly"add"data"to"a"view"
CREATE VIEW order_info AS

SELECT o.order_id, order_date, p.prod_id, brand, name
FROM orders o
JOIN order_details d
ON (o.order_id = d.order_id)
JOIN products p
ON (d.prod_id = p.prod_id);
!Our#query#is#now#greatly#simplied#
hive> SELECT * FROM order_info WHERE order_id=6584288;
InspecEng"and"Removing"Views"
!Use#DESCRIBE FORMATTED#to#see#underlying#query#
hive> DESCRIBE FORMATTED order_info;
!Use#DROP VIEW#to#remove#a#view#
#
hive> DROP VIEW order_info;
Chapter"Topics"
!! Storing#Query#Results#
!! Conclusion"
Saving"Query"Output"to"a"Table"
! SELECT#statements#display#their#results#on#screen#
!To#send#results#to#a#Hive#table,#use#INSERT OVERWRITE TABLE#
DesEnaEon"table"must"already"exist"
ExisEng"contents"will"be"deleted"
hive> INSERT OVERWRITE TABLE ny_customers

SELECT * FROM customers
WHERE state = 'NY';
! INSERT INTO TABLE#adds#records#without#rst#deleLng#exisLng#data#
hive> INSERT INTO TABLE ny_customers

WHERE state = 'NJ' OR state = 'CT';
CreaEng"Tables"Based"On"ExisEng"Data"
!Hive#supports#creaLng#a#table#based#on#a#SELECT#statement#
Oien"know"as"Create"Table"As"Select"(CTAS)"
CREATE TABLE ny_customers AS

# SELECT cust_id, fname, lname FROM customers
WHERE state = 'NY';
!Column#deniLons#are#derived#from#the#exisLng#table#
!Column#names#are#inherited#from#the#exisLng#names#
Use"aliases"in"the"SELECT"statement"to"specify"new"names"
WriEng"Output"To"a"Filesystem"
!You#can#save#output#to#a#le#in#HDFS#
hive> INSERT OVERWRITE DIRECTORY '/dualcore/ny/'

WHERE state = 'NY';
!Add#LOCAL#to#store#results#to#local#disk#instead#
hive> INSERT OVERWRITE LOCAL DIRECTORY '/home/bob/ny/'

WHERE state = 'NY';
!Both#produce#text#les#delimited#by#Ctrl"A#characters#
WriEng"Output"To"HDFS,"Specifying"Format""
!To#write#the#les#to#HDFS#with#a#user"specied#format:#
Create"an"external"table"in"the"required"format"
Use"INSERT OVERWRITE TABLE
hive> CREATE EXTERNAL TABLE ny_customers

(cust_id INT,
fname STRING,
lname STRING)
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/dualcore/nydata';
hive> INSERT OVERWRITE TABLE ny_customers

FROM customers WHERE state = 'NY';
MulE/Table"Insert"(1)"
!We#just#saw#that#you#can#save#output#to#an#HDFS#le#
INSERT OVERWRITE DIRECTORY 'ny_customers'

FROM customers WHERE state = 'NY';
!This#query#could#also#be#wriaen#as#follows#
FROM customers c
INSERT OVERWRITE DIRECTORY 'ny_customers'
SELECT cust_id, fname, lname WHERE state='NY';
!We#someLmes#need#to#extract#data#to#mulLple#tables#
Hive"SELECT"queries"can"take"a"long"Eme"to"complete"
!Hive#allows#us#to#do#this#with#a#single#query#
Much"more"ecient"than"using"mulEple"queries"
!The#following#example#demonstrates#mulL"table#insert#
Result"is"two"directories"in"HDFS"
FROM customers c
INSERT OVERWRITE DIRECTORY 'ny_names'
SELECT fname, lname WHERE state = 'NY';
INSERT OVERWRITE DIRECTORY 'ny_count'
SELECT count(DISTINCT cust_id)
WHERE state = 'NY';
!The#following#query#produces#the#same#result#
FROM (SELECT * FROM customers WHERE state='NY') nycust

INSERT OVERWRITE DIRECTORY 'ny_names'
SELECT fname, lname
INSERT OVERWRITE DIRECTORY 'ny_count'
SELECT count(DISTINCT cust_id);
Chapter"Topics"
!! Controlling#Access#to#Data#
!! Conclusion"
Hive"AuthenEcaEon"
!HiveServer2#can#be#congured#for#Kerberos#or#LDAP#authenLcaLon#
You"must"provide"a"user"ID"and"password"when"connecEng"to"
HiveServer2"from"Beeline"
[training@localhost analyst]$ beeline

Beeline version 0.10.0-cdh4.2.1 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10000

training mypwd org.apache.hive.jdbc.HiveDriver
Connecting to jdbc:hive2://localhost:10000
Connected to: Hive (version 0.10.0)
Driver: Hive (version 0.10.0-cdh4.2.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
beeline>
Hive"AuthorizaEon"with"Apache"Sentry"
!Sentry#is#a#plug"in#for#enforcing#authorizaLon#
!Sentry#can#be#enabled#in#HiveServer2#
And"in"Impala"
!Sentry#provides#ne"grained,#role"based#authorizaLon#for#Hive#data#and#
metadata#
!Sentry#was#developed#at#Cloudera#
Available"starEng"with"CDH"4.3"
Project"is"in"incubaEon"status"at"Apache"
http://incubator.apache.org/projects/sentry.html"
Sentry"Access"Control"Model"
#
!What#does#Sentry#control#access#to?#
Server"
Database"
Table"
View"
!Who#can#access#Sentry"controlled#objects?#
Users"in"a"Sentry"role"
Sentry"roles"="one"or"more"groups"
! How#can#role#members#access#Sentry"controlled#objects?#
Read"operaEons"(SELECT"privilege)"
Write"operaEons"(INSERT"privilege)""
Metadata"deniEon"operaEons"(ALL"privilege)"
Chapter"Topics"
!! Hands"On#Exercise:#Data#Management#with#Hive#
!! Conclusion"
Hands/on"Exercise:"Data"Management"with"Hive"
!In#this#Hands"On#Exercise,#you#will#create,#load,#modify,#and#delete#tables#in#
Hive.#
!Please#refer#to#the#Hands"On#Exercise#Manual#for#instrucLons#
Chapter"Topics"
!! Conclusion#
EssenEal"Points"
!Each#Hive#table#maps#to#a#directory#in#HDFS#
Table"data"stored"as"one"or"more"les"
Default"format:"plain"text"with"delimited"elds"
Binary"formats"may"oer"be>er"performance"but"limit"compaEbility"
!Dropping#a#Hive"managed#tables#deletes#data#in#HDFS#
External"tables"require"manual"data"deleEon"
! ALTER TABLE#is#used#to#add,#modify,#and#remove#columns#
!Views#can#help#to#simplify#complex#and#repeLLve#queries#
!In#a#secure#environment,#Sentry#provides#Hive#authorizaLon#
Text"Processing"with"Hive"
Chapter"12"
Course"Chapters"
!! IntroducEon"
!! Extending"Pig"
!! Text$Processing$With$Hive$
!! Hive"OpEmizaEon"
!! Extending"Hive"
!! Conclusion"
Text"Processing"with"Hive"
!How$to$use$Hives$most$important$string$funcAons$
!How$to$format$numeric$values$
!How$to$use$regular$expressions$in$Hive$
!What$n#grams$are$and$why$they$are$useful$
!How$to$esAmate$how$oCen$words$or$phrases$occur$in$text$
Chapter"Topics"
Text$Processing$with$Hive$
!! Overview$of$Text$Processing
!! Important"String"FuncEons"
!! Using"Regular"Expressions"in"Hive"
!! SenEment"Analysis"and"n/grams"
!! OpEonal"Hands/On"Exercise:"Gaining"Insight"with"SenEment"Analysis"
!! Conclusion"
Text"Processing"Overview"
!TradiAonal$data$processing$relies$on$highly#structured$data$
Carefully"curated"informaEon"in"rows"and"columns"
!What$types$of$data$are$we$producing$today?$
Free/form"notes"in"electronic"medical"records"
ApplicaEon"and"server"log"les"
Social"network"connecEons"
Electronic"messages"
Product"raEngs"
!These$types$of$data$also$contain$great$value$
But"extracEng"it"requires"a"dierent"approach"
Chapter"Topics"
!! Overview"of"Text"Processing"
!! Important$String$FuncAons
!! Conclusion"
Basic"String"FuncEons"
!Hive$supports$many$string$funcAons$oCen$found$in$RDBMSs$
FuncAon$DescripAon$ Example$InvocaAon$ Input$ Output$

Convert"to"uppercase" UPPER(name) Bob BOB
Convert"to"lowercase" LOWER(name) Bob bob
Remove"whitespace"at"start/end" TRIM(name) Bob Bob
Remove"only"whitespace"at"start" LTRIM(name) Bob Bob
Remove"only"whitespace"at"end" RTRIM(name) Bob Bob
Extract"porEon"of"string"*" SUBSTRING(name, 0, 3) Samuel Sam
*" Unlike'SQL,'star0ng'posi0on'is'zero5based'in'Hive'
Parsing"URLs"with"Hive"
!Hive$oers$built#in$support$for$parsing$Web$addresses$(URLs)$
!The$following$examples$assume$the$following$URL$as$input$
http://www.example.com/click.php?A=42&Z=105
Example$InvocaAon$ Output$
PARSE_URL(url, 'PROTOCOL') http
PARSE_URL(url, 'HOST') www.example.com
PARSE_URL(url, 'PATH') /click.php
PARSE_URL(url, 'QUERY') A=42&Z=105
PARSE_URL(url, 'QUERY', 'A') 42
PARSE_URL(url, 'QUERY', 'Z') 105
Numeric"Format"FuncEons"
!Hive$oers$two$funcAons$for$formaYng$a$number$
Simple:"FORMAT_NUMBER"(0.10.0"and"later)"
VersaEle:"PRINTF"(0.9.0"and"later)"
Example$InvocaAon$ Input$ Output$

FORMAT_NUMBER(commission, 2) 2345.519728 2,345.52
PRINTF("$%1.2f", total_price) 356.9752 $356.98
PRINTF("%s owes $%1.2f", name, amt) Bob, 3.9 Bob owes $3.90
PRINTF("%.2f%%", taxrate * 100) 0.47314 47.31%
$
!CauAon:$avoid$storing$precise$values$as$oaAng$point$numbers!$
These"funcEons"are"best"used"for"formaang"results"
Spliang"and"Combining"Strings"
! CONCAT$combines$one$or$more$strings$
The"CONCAT_WS"variaEon"joins"them"with"a"separator"
! SPLIT$does$nearly$the$opposite$
Dierence:"return"value"is"ARRAY<STRING>
Example$InvocaAon$ Output$
CONCAT('alice', '@example.com') alice@example.com
CONCAT_WS(' ', 'Bob', 'Smith') Bob Smith
CONCAT_WS('/', 'Amy', 'Sam', 'Ted') Amy/Sam/Ted
SPLIT('Amy/Sam/Ted', '/') ["Amy","Sam","Ted"] *"
*"Representa0on'of'ARRAY<STRING>'
ConverEng"Array"to"Records"with"EXPLODE
!The$EXPLODE$funcAon$creates$a$record$for$each$element$in$an$array$
An"example"of"a"table'genera0ng'func0on'
The"alias"is"required"when"invoking"table"generaEng"funcEons"
hive> SELECT people FROM example;

Amy,Sam,Ted
hive> SELECT SPLIT(people, ',') FROM example;

["Amy","Sam","Ted"]
hive> SELECT EXPLODE(SPLIT(people, ',')) AS x FROM example;

Amy
Sam
Ted
Chapter"Topics"
!! Using$Regular$Expressions$in$Hive
!! Conclusion"
Regular"Expressions"
!A$regular$expression$(regex)$matches$a$paèrn$in$text$
Useful"when"exact"matching"is"not"pracEcal"
Regular$Expression$ String$(matched$porAon$in$bold)$
Dualcore I wish Dualcore had 2 stores in 90210.
\\d I wish Dualcore had 2 stores in 90210.
\\d{5} I wish Dualcore had 2 stores in 90210.
\\d\\s\\w+ I wish Dualcore had 2 stores in 90210.
\\w{5,9} I wish Dualcore had 2 stores in 90210.
.?\\. I wish Dualcore had 2 stores in 90210.
.*\\. I wish Dualcore had 2 stores in 90210.
Hives"Regular"Expression"FuncEons"
!Hive$has$two$important$funcAons$that$use$regular$expressions$
REGEXP_EXTRACT"returns"the"matched"text"
REGEXP_REPLACE"subsEtutes"another"value"for"the"matched"text"
!These$examples$assume$that$txt$has$the$following$value$
It's on Oak St. or Maple St in 90210
hive> SELECT REGEXP_EXTRACT(txt, '(\\d{5})', 1)

FROM message;
90210
hive> SELECT REGEXP_REPLACE(txt, 'St.?\\s+', 'Street ')

FROM message;
It's on Oak Street or Maple Street in 90210
Regex"SerDe"
!We$someAmes$need$to$analyze$data$that$lacks$consistent$delimiters$
Log"les"are"a"common"example"of"this"
05/23/2013 19:45:19 312-555-7834 CALL_RECEIVED ""

05/23/2013 19:45:23 312-555-7834 OPTION_SELECTED "Shipping"
05/23/2013 19:46:23 312-555-7834 ON_HOLD ""
05/23/2013 19:47:51 312-555-7834 AGENT_ANSWER "Agent ID N7501"
05/23/2013 19:48:37 312-555-7834 COMPLAINT "Item not received"
05/23/2013 19:48:41 312-555-7834 CALL_END "Duration: 3:22"
! RegexSerDe$will$read$records$based$on$supplied$regular$expression$
Allows"us"to"create"a"table"from"this"log"le"
CreaEng"a"Table"with"Regex"SerDe"(1)"
Log"excerpt"
05/23/2013 19:45:19 312-555-7834 CALL_RECEIVED ""
RegexSerDe"
CREATE TABLE calls (
event_date STRING,
event_time STRING,
phone_num STRING,
event_type STRING,
details STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" =
"([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^\"]*)\"");
!Each$pair$of$parentheses$denotes$a$eld$
Field"value"is"text"matched"by"pa>ern"within"parentheses"
CreaEng"a"Table"with"Regex"SerDe"(2)"
Log"excerpt"
05/23/2013 19:45:19 312-555-7834 CALL_RECEIVED ""
RegexSerDe"
event_date STRING,
event_time STRING,
phone_num STRING,
event_type STRING,
details STRING)
"([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^\"]*)\"");
Table"excerpt"
event_date$ event_Ame$ phone_num$ event_type$ details$
05/23/2013 19:45:19 312-555-7834 CALL_RECEIVED
05/23/2013 19:45:37 312-555-7834 COMPLAINT Item not received
Regex"SerDe"in"Older"Versions"of"Hive"
!The$Regex$SerDe$wasnt$formally$part$of$Hive$prior$to$0.10.0$
It"shipped"with"Hive,"but"was"part"of"the"hive/contrib"library"
!To$use$Regex$SerDe$in$0.9.x$and$earlier$versions$of$Hive$$
Add"this"JAR"le"to"Hive"
Change"the"SerDes"package"name,"as"shown"below"

event_date STRING,
event_time
The packageSTRING,
name used in older versions is slightly different:
event_type STRING,
org.apache.hadoop.hive.contrib.serde2.RegexSerDe"
phone_num STRING,
details STRING)
"([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^\"]*)\"");
Fixed/Width"Formats"in"Hive"
!Many$older$applicaAons$produce$data$in$xed#width$formats$
1030929610759620120829012215Oakland CA94618
cust_id" order_id" date" Eme" city" state"zipcode"
!Unfortunately,$Hive$doesnt$directly$support$these$
But"you"can"overcome"this"limitaEon"by"using"RegexSerDe
!Caveat:$all$elds$in$RegexSerDe$are$of$type$STRING
May"need"to"cast"numeric"values"in"your"queries"
Fixed/Width"Format"Example"
Input"data"
1030929610759620120829012215Oakland CA94618
RegexSerDe"
CREATE TABLE fixed (
cust_id STRING,
order_id STRING,
order_dt STRING,
order_tm STRING,
city STRING,
state STRING,
zip STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.RegexSerDe'
"(\\d{7})(\\d{7})(\\d{8})(\\d{6})(.{20})(\\w{2})(\\d{5})");
cust_id$ order_id$ order_dt$ order_tm$ city$ state$ zipcode$

1030929 6107596 20120829 012215 Oakland CA 94618
Chapter"Topics"
!! SenAment$Analysis$and$n#grams
!! Conclusion"
Parsing"Sentences"into"Words"
!Hives$SENTENCES$funcAon$parses$supplied$text$into$words$
!Input$is$a$string$containing$one$or$more$sentences$
!Output$is$a$two#dimensional$array$of$strings$
Outer"array"contains"one"element"per"sentence"
Inner"array"contains"one"element"per"word"in"that"sentence"
$
hive> SELECT txt FROM phrases WHERE id=12345;
I bought this computer and really love it! It's very fast and
does not crash.
hive> SELECT SENTENCES(txt) FROM phrases WHERE id=12345;

[["I","bought","this","computer","and","really","love","it"],
["It's","very","fast","and","does","not","crash"]]
SenEment"Analysis"
!SenAment$analysis$is$an$applicaAon$of$text$analyAcs$
ClassicaEon"and"measurement"of"opinions"
Frequently"used"for"social"media"analysis"
!Context$is$essenAal$for$human$languages$
Which"word"combinaEons"appear"together?"
How"frequently"do"these"combinaEons"appear?"
!Hive$oers$funcAons$that$help$answer$these$quesAons$
$
n/grams"
!An$n#gram$is$a$word$combinaAon$(n=number$of$words)$
Bigram"is"a"sequence"of"two"words"(n=2)"
!n#gram$frequency$analysis$is$an$important$step$in$many$applicaAons$
SuggesEng"spelling"correcEons"in"search"results"
Finding"the"most"important"topics"in"a"body"of"text"
IdenEfying"trending"topics"in"social"media"messages"
CalculaEng"n/grams"in"Hive"(1)"
!Hive$oers$the$NGRAMS$funcAon$for$calculaAng$n#grams$
!The$funcAon$requires$three$input$parameters$
Array"of"strings"(sentences),"each"containing"an"array"(words)"
Number"of"words"in"each"n/gram"
Desired"number"of"results"(top/N,"based"on"frequency)"
!Output$is$an$array$of$STRUCT$with$two$a`ributes$
ngram:"the"n/gram"itself"(an"array"of"words)"
estfrequency:"esEmated"frequency"at"which"this"n/gram"appears"
CalculaEng"n/grams"in"Hive"(2)"
!The$NGRAMS$funcAon$is$oCen$used$with$the$SENTENCES$funcAon$
We"also"used"LOWER"to"normalize"case"
And"EXPLODE"to"convert"the"resulEng"array"to"a"series"of"rows"
hive> SELECT txt FROM phrases WHERE id=56789;

This tablet is great. The size is great. The screen is
great. The audio is great. I love this tablet! I love
everything about this tablet!!!
hive> SELECT EXPLODE(NGRAMS(SENTENCES(LOWER(txt)), 2, 5))

AS bigrams FROM phrases WHERE id=56789;
{"ngram":["is","great"],"estfrequency":4.0}
{"ngram":["great","the"],"estfrequency":3.0}
{"ngram":["this","tablet"],"estfrequency":3.0}
{"ngram":["i","love"],"estfrequency":2.0}
{"ngram":["tablet","i"],"estfrequency":1.0}
Finding"Specic"n/grams"in"Text"
! CONTEXT_NGRAMS$is$similar,$but$considers$only$specic$combinaAons$
AddiEonal"input"parameter:"array"of"words"used"for"ltering"
Any"NULL"values"in"the"array"are"treated"as"placeholders"
hive> SELECT txt FROM phrases

WHERE txt LIKE '%new computer%';
My new computer is fast! I wish I'd upgraded sooner.
This new computer is expensive, but I need it now.
I can't believe her new computer failed already.
hive>SELECT EXPLODE(CONTEXT_NGRAMS(SENTENCES(LOWER(txt)),
ARRAY("new", "computer", NULL, NULL), 4, 3)) AS ngrams
FROM phrases;
{"ngram":["is","expensive"],"estfrequency":1.0}
{"ngram":["failed","already"],"estfrequency":1.0}
{"ngram":["is","fast"],"estfrequency":1.0}
Histograms"
!Histograms$illustrate$how$values$in$the$data$are$distributed$
This"helps"us"esEmate"the"overall"shape"of"the"data"distribuEon"
CalculaEng"Data"for"Histograms"
! HISTOGRAM_NUMERIC$creates$data$needed$for$histograms$
Input:"column"name"and"number"of"bins"in"the"histogram"
Output:"coordinates"represenEng"bin"centers"and"heights"
hive> SELECT EXPLODE(HISTOGRAM_NUMERIC(

total_price, 10)) AS dist FROM cart_orders;
{"x":25417.336745023003,"y":8891.0}
{"x":74401.5041469194,"y":3376.0}
{"x":123550.04418985262,"y":611.0}
{"x":197421.12500000006,"y":24.0} Import"this"data"into"
{"x":267267.53846153844,"y":26.0} charEng"sokware"to"
{"x":425324.0,"y":4.0} produce"a"histogram"
{"x":479226.38461538474,"y":13.0}
{"x":524548.0,"y":6.0}
{"x":598463.5,"y":2.0}
{"x":975149.0,"y":2.0}
Chapter"Topics"
!! OpAonal$Hands#On$Exercise:$Gaining$Insight$with$SenAment$Analysis
!! Conclusion"
Hands/on"Exercise:"Gaining"Insight"with"SenEment"Analysis"
!In$this$Hands#On$Exercise,$you$will$analyze$comments$in$product$raAng$data$
with$Hive$
!Please$refer$to$the$Hands#On$Exercise$Manual$for$instrucAons$
Chapter"Topics"
!! Conclusion
EssenEal"Points"
!Most$data$produced$these$days$lacks$rigid$structure$
Text"processing"can"help"us"analyze"loosely/structured"data"
!The$SPLIT$funcAon$creates$an$array$from$a$string$
EXPLODE"creates"individual"records"from"an"array"
!Hive$has$extensive$support$for$regular$expressions$
You"can"extract"or"subsEtute"values"based"on"pa>erns"
You"can"even"create"a"table"based"on"regular"expressions"
!An$n#gram$is$a$sequence$of$words$
Use"NGRAMS"and"CONTEXT_NGRAMS"to"nd"their"frequency"
Hive"OpBmizaBon"
Chapter"13"
Course"Chapters"
!! IntroducBon"
!! Extending"Pig"
!! Pig"TroubleshooBng"and"OpBmizaBon"
!! Hive$Op,miza,on$
!! Extending"Hive"
!! Conclusion"
Hive"OpBmizaBon"
!Which$factors$help$determine$the$performance$of$Hive$queries$
!What$command$displays$Hives$execu,on$plan$for$a$query$
!How$to$enable$several$useful$Hive$performance$features$
!How$to$par,,on$tables$to$reduce$amount$of$data$read$for$a$query$
!How$to$use$table$bucke,ng$to$sample$data$
!How$to$create$and$rebuild$indexes$in$Hive$
Chapter"Topics"
Hive$Op,miza,on$
!! Understanding$Query$Performance
!! Controlling"Job"ExecuBon"
!! ParBBoning"
!! BuckeBng"
!! Indexing"Data"
!! Conclusion"
How"Hive"Processes"Data"
Steps Run Locally Steps Run on Cluster
SELECT COUNT(cust_id) Divide&Job&Into&Tasks

FROM customers
WHERE zipcode=94305;
Assign&Tasks&to&Nodes
Parse&HiveQL&Statements Submit&MapReduce&Jobs
to&Hadoop&Cluster
Start&Java&Process&for&Task
Read&Metadata&for&Tables
Read&Input&Data
Build&the&Execu-on&Plan
Map&Processing&Phase
Reduce&Processing&Phase
Write&Output&Data
Hive"Query"Performance"Pa>erns"(1)"
!The$fastest$queries$involve$only$metadata$
DESCRIBE customers;
!The$next$fastest$simply$read$from$HDFS$
SELECT * FROM customers;
!Then$a$query$that$requires$a$map#only$job$
SELECT * FROM customers WHERE zipcode = 94305;
Hive"Query"Performance"Pa>erns"(2)"
!The$next$slowest$type$of$query$requires$both$Map$and$Reduce$phases$
SELECT COUNT(cust_id) FROM customers

WHERE zipcode=94305;
!The$slowest$queries$require$mul,ple$MapReduce$jobs$
SELECT zipcode, COUNT(cust_id) AS num FROM customers

$ GROUP BY zipcode
ORDER BY num DESC
LIMIT 10;
Viewing"the"ExecuBon"Plan"
!How$can$you$tell$how$Hive$will$execute$a$query?$
Does"it"read"only"metadata?"
Can"it"return"data"directly"from"HDFS?"
Will"it"require"a"reduce"phase"or"mulBple"MapReduce"jobs?"
!Prex$your$query$with$EXPLAIN$to$view$Hives$execu,on$plan$
hive> EXPLAIN SELECT * FROM customers;
!The$output$of$EXPLAIN$can$be$very$long$and$complex$
Fully"understanding"it"requires"in/depth"knowledge"of"MapReduce"
We"will"cover"the"basics"here"
Viewing"a"Query"Plan"with"EXPLAIN"(1)"
!The$query$plan$contains$three$main$sec,ons$
Abstract"syntax"tree"details"how"Hive"parsed"query"(excerpt"below)"
The"stage"dependencies"and"plans"are"more"useful"to"most"users"
hive> EXPLAIN CREATE TABLE cust_by_zip AS

SELECT zipcode, COUNT(cust_id) AS num
FROM customers GROUP BY zipcode;
ABSTRACT SYNTAX TREE:

(TOK_CREATETABLE (TOK_TABNAME cust_by_zip) ...
STAGE DEPENDENCIES:
... (excerpt shown on next slide)
STAGE PLANS:
... (excerpt shown on upcoming slide)
ABSTRACT SYNTAX TREE:

!Our$query$has$three$ ... (shown on previous slide)
stages$
STAGE DEPENDENCIES:
!Dependencies$dene$ Stage-1 is a root stage
order$ Stage-0 depends on stages: Stage-1
Stage/1"runs"rst" Stage-2 depends on stages: Stage-0
Stage/0"runs"next"
Stage/2"runs"last" STAGE PLANS:
... (shown on next slide)
STAGE PLANS:
!Stage#1:$MapReduce$job$ Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:

!Map$phase$
TableScan
Read"customers"table" alias: customers
Selects"zipcode"and" Select Operator
cust_id"columns" zipcode, cust_id
"
Reduce Operator Tree:
!Reduce$phase$ Group By Operator
Group"by"zipcode aggregations:
"Count"cust_id expr: count(cust_id)
keys:
expr: zipcode
STAGE PLANS:
$ Stage: Stage-1 (covered earlier)
...
$
Stage: Stage-0
!Stage#0:$HDFS$ac,on$
Move Operator
Move"previous"stages" files:
output"to"Hives" hdfs directory: true
warehouse"directory" destination: (HDFS path...)
"
!Stage#2:$Metastore$ac,on$ Stage: Stage-2
Create Table Operator:
Create"new"table"
Create Table
Has"two"columns" columns: zipcode string,
num bigint
name: cust_by_zip
SorBng"Results
!As$in$SQL,$ORDER BY$sorts$specied$elds$in$HiveQL$
Consider"the"result"from"the"following"query"
hive> SELECT name, SUM(total)

FROM order_info GROUP BY name
ORDER BY name;
Mapper output is
Alice 3625
Bob 5174 processed by a
Alice 3625
0 Alice 3625 Alice 893 single reducer
1 Bob 5174 Alice 893 Alice 2139
2 Alice 893 Alice 2139 Alice 5834 Alice 12491
3 Alice 2139 Bob 5174 Bob 9997
4 Diana 3581 Diana 3581 Bob 4823 Carlos 1431
5 Carlos 1039 Carlos 1039 Carlos 1039 Diana 5385
6 Bob 4823 Carlos 392
7 Alice 5834 Bob 4823 Diana 3581
8 Carlos 392 Alice 5834 Diana 1804
9 Diana 1804
Carlos 392
Diana 1804
Using"SORT BY"for"ParBal"Ordering
!HiveQL$also$supports$par,al$ordering$via$SORT BY$
Oers"much"be>er"performance"if"global"order"isnt"required"
hive> SELECT name, SUM(total)

FROM order_info GROUP BY name
SORT BY name;
Alice 3625 Only output from

Bob 5174 Alice 3625 each reducer
1 Bob 5174 Alice 893 Alice 2139 Alice 12491 is sorted
2 Alice 893 Alice 2139 Alice 5834 Carlos 1431
3 Alice 2139 Carlos 1039 Alice 12491
4 Diana 3581 Diana 3581 Carlos 392 Carlos 1431
5 Carlos 1039 Carlos 1039 Bob 9997
6 Bob 4823 Diana 5385
7 Alice 5834 Bob 4823 Bob 5174
8 Carlos 392 Alice 5834 Bob 4823 Bob 9997
9 Diana 1804 Diana 3581 Diana 5385
Diana 1804
Carlos 392
Diana 1804
Chapter"Topics"
Hive$Op,miza,on$
!! Understanding"Query"Performance"
!! Controlling$Job$Execu,on
!! ParBBoning"
!! BuckeBng"
!! Indexing"Data"
!! Conclusion"
Parallel"ExecuBon"
!Stages$in$Hives$execu,on$plan$o_en$lack$dependencies$
This"means"they"can"be"run"in"parallel"
!Hive$supports$parallel$execu,on$in$such$cases$
However,"this"feature"is"disabled"by"default"
!Enable$this$by$se`ng$the$hive.exec.parallel$property$to$true$
Set"only"for"yourself"in"$HOME/.hiverc
Set"for"all"users"in"/etc/hive/conf/hive-site.xml
Reducing"Latency"Through"Local"ExecuBon"
!Running$MapReduce$jobs$on$the$cluster$has$signicant$overhead$
Must"divide"work,"assign"tasks,"start"processes,"collect"results,"etc."
Necessary"to"process"large"amounts"of"data"in"Hive"
Possibly"inecient"with"small"amount"of"data"
!Processing$data$locally$can$substan,ally$speed$up$smaller$jobs$
Local"execuBon"can"substanBally"improve"turnaround"for"small"jobs"
hive> SET mapred.job.tracker=local;

hive> SET mapred.local.dir=/home/training/tmpdata
hive> SELECT zipcode, COUNT(cust_id) AS num

FROM customers GROUP BY zipcode;
AutomaBc"SelecBon"of"Local"ExecuBon"Mode"
!Hive$can$now$select$local$execu,on$mode$automa,cally$
It"does"this"on"a"case/by/case"basis"using"heurisBcs"
!Like$parallel$execu,on,$this$feature$is$disabled$by$default$
Enable"by"sefng"hive.exec.mode.local.auto"to"true"
Job"Control"
!You$will$see$many$log$messages$when$you$run$a$query$in$Hives$shell$
One"of"these"messages"will"idenBfy"the"MapReduce"job"ID""
hive> SELECT * FROM customers WHERE zipcode=94305;

Total MapReduce jobs = 1
Launching Job 1 out of 1
... (other messages omitted) ...
Starting Job = job_201306022351_0025
!Hadoops$mapred$command$lets$you$kill$the$job$or$view$its$status$
$ mapred job -status job_201306022351_0025

$ mapred job -kill job_201306022351_0025
Viewing"a"Job"in"the"Web"UI"(1)"
!The$log$message$will$also$show$a$Web$address$
This"is"a"link"to"the"jobs"detail"page"in"Hadoops"Web"UI"
hive> SELECT * FROM customers WHERE zipcode=94305;

Total MapReduce jobs = 1
Launching Job 1 out of 1
... (other messages omitted) ...
Starting Job = job_201306022351_0029, Tracking URL =

http://jobtracker.example.com:50030/jobdetails.jsp?
jobid=job_201306022351_0029
Viewing"a"Job"in"the"Web"UI"(2)"
Chapter"Topics"
Hive$Op,miza,on$
!! Par,,oning
!! BuckeBng"
!! Indexing"Data"
!! Conclusion"
Recap:"Data"Storage"in"Hive"
!The$metastore$maintains$informa,on$about$Hive$tables$
A"table"simply"points"to"a"directory"in"HDFS"
The"tables"data"are"les"within"that"directory"
!All$les$in$the$directory$are$read$during$a$query$
call_logs table
2013-06-03 07:55:36 312-555-3453

CALL_RECEIVED ""
2013-06-03 07:55:39 312-555-3453 call-20130603.log
OPTION_SELECTED "Billing"
2013-06-04 07:05:21 212-555-3982

call_logs CALL_RECEIVED ""
2013-06-04 07:08:36 212-555-3982
HUNG_UP "Busy watching Hadoop Webinar" call-20130604.log
directory in HDFS 2013-06-04 07:14:29 314-555-5741
AGENT_ANSWER "Agent ID N5150"
2013-06-05 07:21:57 808-555-3453

CALL_RECEIVED ""
2013-06-05 07:23:12 808-555-3453 call-20130605.log
OPTION_SELECTED "Billing"
File"Reads"in"Hive"Tables"
!Dualcores$phone$system$creates$daily$logs$detailing$calls$received$
Our"analysts"use"this"data"to"summarize"previous"days"calls"
hive> SELECT event_type, COUNT(event_type)

FROM call_log
WHERE call_date = '2013-06-03'
GROUP BY event_type;
!These$queries$always$lter$by$a$value$in$the$call_date$eld$
But"sBll"read"all"les"within"the"tables"directory"
Table"ParBBoning"
!It$is$possible$to$create$a$table$that$par,,ons$the$data$
Queries"that"lter"on"parBBoned"elds"limit"amount"of"data"read"
Does"not"prevent"you"from"running"queries"that"span"parBBons"
call_logs table
07:55:36 312-555-3453 CALL_RECEIVED ""

date=2013-06-03 07:55:39 312-555-3453 OPTION_SELECTED
"Billing"
07:55:41 312-555-3453 CALL_ENDED ""
call_logs 07:05:21 212-555-3982 CALL_RECEIVED ""

07:08:36 212-555-3982 HUNG_UP "Busy
watching Hadoop Webinar"
directory in HDFS date=2013-06-04 07:14:29 314-555-5741 AGENT_ANSWER
"Agent ID N5150"
07:14:51 314-555-5741 CALL_TRANSFER
"Billing"
07:21:57 808-555-3453 CALL_RECEIVED ""

date=2013-06-05 07:23:12 808-555-3453 OPTION_SELECTED
"Billing"
07:23:14 213-555-8752 CALL_ENDED ""
CreaBng"A"ParBBoned"Table"In"Hive"(1)"
!Specify$the$par,,oned$column$in$the$PARTITION$clause$
CREATE TABLE call_logs (

call_time STRING,
event_type STRING,
phone STRING,
details STRING)
PARTITIONED BY (call_date STRING)
FIELDS TERMINATED BY '\t';
!You$can$also$created$nested$par,,ons$
PARTITIONED BY (call_date STRING, event_type STRING)
CreaBng"A"ParBBoned"Table"In"Hive"(2)"
!The$par,,on$column$is$displayed$if$you$DESCRIBE$the$table:$
hive> DESCRIBE call_logs;

OK
call_time string
event_type string
$ phone string
details string
call_date string
!However,$the$par,,on$is$a$virtual$column$
The"data"does"not"exist"in"your"incoming"data"
Instead,"you"specify"the"parBBon"when"loading"the"data""
Loading"Data"Into"ParBBons"
!You$must$specify$par,,on$eld$value$when$loading$data$
hive> LOAD DATA INPATH 'call-20130603.log'

INTO TABLE call_logs
PARTITION(call_date='2013-06-03');
hive> LOAD DATA INPATH 'call-20130604.log'

INTO TABLE call_logs
PARTITION(call_date='2013-06-04');
!The$above$example$would$create$two$subdirectories:$
/user/hive/warehouse/call_logs/call_date=2013-06-03
/user/hive/warehouse/call_logs/call_date=2013-06-04
Dynamic"ParBBon"Inserts"(1)"
!What$if$your$table$already$exists,$contains$data,$and$lacks$par,,ons?$
Hive"can"dynamically"insert"data"into"specic"parBBons"for"you"
!Syntax:$
FROM customers
INSERT OVERWRITE TABLE custs_part PARTITION(state)
SELECT cust_id, fname, lname, address, city,
zipcode, state;
!Par,,ons$are$automa,cally$created$based$on$the$value$of$the$last$column
If"the"parBBon"does"not"already"exist,"it"will"be"created"
If"the"parBBon"does"exist,"it"will"be"overwri>en"
!Dynamic$par,,oning$is$not$enabled$by$default$
Enable"it"by"sefng"these"two"properBes"
Property$Name$ Value$
hive.exec.dynamic.partition true
hive.exec.dynamic.partition.mode nonstrict
!Cau,on:$avoid$crea,ng$an$excessive$number$of$par,,ons$
This"can"happen"your"data"contains"many"unique"values"
!Cau,on:$if$the$par,,on$column$has$many$dierent$values,$many$
par,,ons$will$be$created$
!Three$Hive$congura,on$proper,es$exist$to$limit$this$
hive.exec.max.dynamic.partitions.pernode
Maximum"number"of"dynamic"parBBons"that"can"be"created"by"any"
given"Mapper"or"Reducer"
Default"100"
hive.exec.max.dynamic.partitions
Total"number"of"dynamic"parBBons"that"can"be"created"by"one"
HiveQL"statement"
Default"1000"
hive.exec.max.created.files
Maximum"total"les"created"by"Mappers"and"Reducers"
Default"100000"
Viewing,"Adding,"and"Removing"ParBBons"
!To$view$the$current$par,,ons$in$a$table$
hive> SHOW PARTITIONS call_logs;
!Use$ALTER TABLE$to$add$or$drop$par,,ons$
ALTER TABLE call_logs

ADD PARTITION (call_date='2013-06-05')
LOCATION '/dualcore/call_logs/call_date=2013-06-05';
ALTER TABLE call_logs

DROP PARTITION (call_date='2013-06-06');
Chapter"Topics"
Hive$Op,miza,on$
!! ParBBoning"
!! Bucke,ng
!! Indexing"Data"
!! Conclusion"
What"Is"BuckeBng?"
!Par,,oning$subdivides$data$by$values$in$par,,oned$columns$
!Bucke,ng$data$is$another$way$of$subdividing$data$
Calculates"hash"code"for"values"inserted"into"bucketed"columns"
Hash"code"used"to"assign"new"records"to"a"bucket"
!Goal:$distribute$rows$across$a$predened$number$of$buckets$
Useful"for"jobs"which"need"random"samples"of"data"
Joins"may"be"faster"if"all"tables"are"bucketed"on"the"join"column"
CreaBng"A"Bucketed"Table"
!Example$of$crea,ng$a$table$that$supports$bucke,ng$
Creates"a"table"supporBng"20"buckets"based"on"order_id"column"
Each"bucket"should"contain"roughly"5%"of"the"tables"data"
CREATE TABLE orders_bucketed

(order_id INT,
cust_id INT,
order_date TIMESTAMP)
CLUSTERED BY (order_id) INTO 20 BUCKETS;
!Column$selected$for$bucke,ng$should$have$well#distributed$values$
IdenBer"columns"are"olen"a"good"choice"
InserBng"Data"Into"A"Bucketed"Table"
!Bucke,ng$isnt$automa,cally$enforced$when$inser,ng$data$
!Set$the$hive.enforce.bucketing$property$to$true
This"sets"the"number"of"reducers"to"the"number"of"buckets"in"the"table"
deniBon"
hive> SET hive.enforce.bucketing=true;

hive> INSERT OVERWRITE TABLE orders_bucketed
SELECT * FROM orders;
Sampling"Data"From"A"Bucketed"Table"
!Use$the$following$syntax$to$sample$data$from$a$bucketed$table:$
This"example"selects"one"of"every"ten"records"(10%)"
hive> SELECT * FROM orders_bucketed

TABLESAMPLE (BUCKET 1 OUT OF 10 ON order_id);
!It$is$possible$to$use$TABLESAMPLE$on$a$non#bucketed$table$
However,"this"requires"a"full"scan"of"the"enBre"table"
Chapter"Topics"
Hive$Op,miza,on$
!! ParBBoning"
!! BuckeBng"
!! Indexing$Data
!! Conclusion"
Indexes"in"Hive"
!Tables$in$Hive$also$now$support$indexes$
Similar"to"indexes"in"RDBMSs,"but"much"more"limited"
!May$improve$performance$for$certain$types$of$queries$
But"maintaining"them"costs"disk"space"and"CPU"Bme"
!Syntax$to$create$an$index:$
hive> CREATE INDEX idx_orders_cust_id

ON TABLE orders(cust_id)
AS 'handler_class'
WITH DEFERRED REBUILD;
!Handler$class$is$the$fully#qualied$name$of$a$Java$class,$such$as:$
org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler"
Viewing"and"Building"Indexes"in"Hive"
!This$command$lists$the$indexes$associated$with$the$orders$table$
hive> SHOW FORMATTED INDEX ON orders;
!Hive$indexes$are$ini,ally$empty$
Building"(and"later"rebuilding)"indexes"is"a"manual"process"
Use"the"ALTER INDEX"command"to"rebuild"an"index"
CauBon:"this"can"be"lengthy"operaBon!"
hive> ALTER INDEX idx_orders_cust_id ON orders REBUILD;
Chapter"Topics"
Hive$Op,miza,on$
!! ParBBoning"
!! BuckeBng"
!! Indexing"Data"
!! Conclusion"
EssenBal"Points"
! ORDER BY$sorts$results$globally,$just$as$in$SQL$
HiveQL"also"supports"SORT BY"for"parBal"ordering"
!Local$execu,on$mode$can$signicantly$reduce$query$latency$
But"only"appropriate"to"use"with"small"amounts"of"data"
!Par,,oning$and$bucke,ng$can$both$subdivide$a$table's$data$
ParBBoning"may"reduce"the"amount"of"data"a"query"must"read"
BuckeBng"is"used"to"support"random"sampling"
!Hives$indexing$feature$can$boost$performance$for$certain$queries$
But"it"comes"at"the"cost"of"increased"disk"and"CPU"usage"
Bibliography"
The$following$oer$more$informa,on$on$topics$discussed$in$this$chapter$
!Hive$Manual$for$the$EXPLAIN$Command$
!Hive$Manual$for$Bucketed$Tables$
!Hive$Manual$for$Indexes$
Extending"Hive"
Chapter"14"
Course"Chapters"
!! IntroducDon"
!! Extending"Pig"
!! Hive"OpDmizaDon"
!! Extending$Hive$
!! Conclusion"
Extending"Hive"
! What$role$SerDes$play$in$Hive$
! How$to$use$a$custom$SerDe$
! How$to$use$TRANSFORM$for$custom$record$processing$
! How$to$add$support$for$a$User#Dened$FuncFon$(UDF)$
! How$to$use$variable$subsFtuFon$
Chapter"Topics"
Extending$Hive$
!! SerDes
!! Data"TransformaDon"with"Custom"Scripts"
!! User/Dened"FuncDons"
!! Parameterized"Queries"
!! Hands/On"Exercise:"Data"TransformaDon"with"Hive"
!! Conclusion"
Hive"SerDes"
!Hive$uses$a$SerDe$for$reading$and$wriFng$records$
Stands"for"Serializer"/"Deserializer"
SerDes"control"the"row"format"of"the"table"
Specied,"someDmes"implicitly,"when"table"is"created"
!Hive$ships$with$many$SerDes,$including:$$
Name$ Reads$and$Writes$Records$
LazySimpleSerDe Using"specied"eld"delimiters"(default)"
RegexSerDe$ Based"on"supplied"pa>erns"
ColumnarSerDe Using"the"columnar"format"needed"by"RCFile"
HBaseSerDe Using"an"HBase"table"
Recap:"CreaDng"a"Table"with"Regex"SerDe"
Input"Data"
05/23/2013 19:45:19 312-555-7834 CALL_RECEIVED ""
Using"SerDe"
event_date STRING,
event_time STRING,
event_type STRING,
phone_num STRING,
details STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
"([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^\"]*)\"");
ResulDng"Table"
event_date$ event_Fme$ event_type$ phone_num$ details$
05/23/2013 19:45:19 312-555-7834 CALL_RECEIVED
05/23/2013 19:45:37 312-555-7834 COMPLAINT Item not received
Adding"a"Custom"SerDe"to"Hive"
!Hive$also$allows$wriFng$custom$SerDes$using$its$Java$API$
There"are"many"open"source"SerDes"on"the"Web"
WriDng"your"own"is"seldom"necessary"
!We$will$now$explain$how$to$add$a$custom$SerDe$to$Hive$
It"reads"and"writes"records"in"CSV"format"
Using"JAR"le"from"http://tiny.cloudera.com/dac14a
Adding"a"JAR"File"to"Hive"
!You$must$register$external$libraries$before$using$them$
Ensures"Hive"can"nd"the"library"(JAR"le)"at"runDme"
hive> ADD JAR /home/training/csv-serde-1.0.jar;
!Remains$in$eect$only$during$the$current$Hive$session$
Consider"ediDng"your".hiverc"le"to"add"frequently/used"JARs"
Using"the"SerDe"in"Hive"
Input"Data"
1,Gigabux,gigabux@example.com
2,"ACME Distribution Co.",acme@example.com
3,"Bitmonkey, Inc.",bmi@example.com
Specify"SerDe"
CREATE TABLE vendors
(id INT,
name STRING,
email STRING)
ROW FORMAT SERDE 'com.bizo.hive.serde.csv.CSVSerde';
id$ name$ email$ ResulDng"Table"

1 Gigabux gigabux@example.com
2 ACME Distribution Co. acme@example.com
3 Bitmonkey, Inc. bmi@example.com
Chapter"Topics"
Extending$Hive$
!! SerDes"
!! Data$TransformaFon$with$Custom$Scripts
!! Conclusion"
Using"TRANSFORM"to"Process"Data"Using"External"Scripts"
!You$are$not$limited$to$manipulaFng$data$exclusively$in$HiveQL$
Hive"allows"you"to"transform"data"through"external"scripts"or"programs"
These"can"be"wri>en"in"nearly"any"language"
!This$is$done$with$HiveQLs$TRANSFORM$...$USING$construct$
One"or"more"elds"are"supplied"as"arguments"to"TRANSFORM()
The"external"script"is"idenDed"by"USING""
It"receives"each"record,"processes"it,"and"returns"the"result"
!Use$ADD$$FILE$to$distribute$the$script$to$nodes$in$the$cluster$
hive> ADD FILE myscript.pl;

hive> SELECT TRANSFORM(*) USING 'myscript.pl'
FROM employees;
Data"Input"and"Output"with"TRANSFORM
!Your$external$program$will$receive$one$record$per$line$on$standard$input$
Each"eld"in"the"supplied"record"will"be"a"tab/separated"string"
NULL"values"are"converted"to"the"literal"string"\N""
!You$may$need$to$convert$values$to$appropriate$types$within$your$program$
For"example,"converDng"to"numeric"types"for"calculaDons"
!Your$program$must$return$tab#delimited$elds$on$standard$output$
Output"elds"can"opDonally"be"named"and"cast"using"the"syntax"below"
SELECT TRANSFORM(product_name, price)

USING 'tax_calculator.py'
AS (item_name STRING, tax INT)
FROM products;
Hive"TRANSFORM"Example"(1)
!Here$is$a$complete$example$of$using$TRANSFORM$in$Hive$$
Our"Perl"script"parses"an"e/mail"address,"determines"to"which"country"it"
corresponds,"and"then"returns"an"appropriate"greeDng"
Heres"a"sample"of"the"input"data"
hive> SELECT name, email_address FROM employees;

Antoine antoine@example.fr
Kai kai@example.de
Pedro pedro@example.mx
"
Heres"the"corresponding"HiveQL"code"
hive> ADD FILE greeting.pl;

hive> SELECT TRANSFORM(name, email_address)
USING 'greeting.pl' AS greeting
FROM employees;
!The$Perl$script$for$this$example$is$shown$below$
A"complete"explanaDon"of"this"script"follows"on"the"next"few"slides"
#!/usr/bin/env perl
%greetings = ('de' => 'Hallo',

'fr' => 'Bonjour',
'mx' => 'Hola');
while (<STDIN>) {
($name, $email) = split /\t/;
($suffix) = $email =~ /\.([a-z]+)/;
$greeting = $greetings{$suffix};
$greeting = 'Hello' unless defined($greeting);
print "$greeting $name\n";
}
$$
#!/usr/bin/env perl

'fr' => 'Bonjour',
'mx' => 'Hola');
while (<STDIN>) {
The first line tells the system to use the Perl interpreter
when=running
($suffix) $emailthis =~script.
/\.([a-z]+)/;
$greeting = 'Hello'
We define unless indefined($greeting);
our greetings the next line using an
} associative array keyed by the country code well
extract from the e-mail address.
$$
#!/usr/bin/env perl
%greetings
We=read
('de'
each=>record
'Hallo',
from standard input within the loop,
'fr' => 'Bonjour',
and then split=>them
'mx' into fields based on tab characters.
'Hola');
while (<STDIN>) {
($suffix) = $email =~ /\.([a-z]+)/;
}
$
#!/usr/bin/env perl

'fr' => 'Bonjour',
'mx' => 'Hola');
while (<STDIN>) {
($suffix) = $email =~ /\.([a-z]+)$/;
} We extract the country code from the e-mail address (the
pattern matches any letters following the final dot). We use that
to look up a greeting, but default to Hello if we didnt find one.
$
#!/usr/bin/env perl

'fr' => 'Bonjour',
'mx' => 'Hola');
Finally, we return our greeting as a single field by printing this
while (<STDIN>) {
value to standard output. If we had multiple fields, wed simply
separate
($suffix) each by tab characters
= $email when printing them here.
=~ /\.([a-z]+)$/;
}
!Finally,$heres$the$result$of$our$transformaFon$
hive> ADD FILE greeting.pl;

hive> SELECT TRANSFORM(name, email_address)
USING 'greeting.pl' AS greeting
FROM employees;
Bonjour Antoine
Hallo Kai
Hola Pedro
Chapter"Topics"
Extending$Hive$
!! SerDes"
!! User#Dened$FuncFons
!! Conclusion"
Overview"of"User/Dened"FuncDons"(UDFs)"
!User#Dened$FuncFons$(UDFs)$are$custom$funcFons$$
Invoked"with"the"same"syntax"as"built/in"funcDons"
hive> SELECT CALC_SHIPPING_COST(order_id, 'OVERNIGHT')

FROM orders WHERE order_id = 5742354;
!There$are$three$types$of$UDFs$in$Hive$
Standard"UDFs"
User/Dened"Aggregate"FuncDons"(UDAFs)"
User/Dened"Table"FuncDons"(UDTFs)"
Developing"Hive"UDFs"
!Hive$UDFs$are$wriên$in$Java$
Currently"no"support"for"wriDng"UDFs"in"other"languages"
Using"TRANSFORM"may"be"an"alternaDve"to"UDFs"
!Open$source$UDFs$are$plenFful$on$the$Web$
!There$are$three$steps$for$using$a$UDF$in$Hive$
1. Add"the"funcDons"JAR"le"to"Hive"
2. Register"the"funcDon"itself"
3. Use"the"funcDon"in"your"query"
AênFon$Java$Developers"
Cloudera"now"oers"a"free"e/learning"
module"WriDng"UDFs"for"Hive"""
"
http://tiny.cloudera.com/dac14b"
Example:"Using"a"UDF"in"Hive"(1)"
!Our$example$UDF$was$compiled$from$sources$found$in$GitHub$
Popular"Web"site"for"many"open"source"sohware"projects"
Project"URL:"http://tiny.cloudera.com/dac14e"
!We$compiled$the$source$and$packaged$it$into$a$JAR$le$
We"have"included"a"copy"of"it"on"your"VM"
!Our$example$shows$the$DATE_FORMAT$UDF$in$that$JAR$le$
Allows"great"exibility"in"formakng"date"elds"in"output"
!First,$register$the$JAR$with$Hive$$
Same"step"as"with"a"custom"SerDe"
hive> ADD JAR date-format-udf.jar;
!Next,$register$the$funcFon$and$assign$an$alias$
The"quoted"value"is"the"fully/qualied"Java"class"for"the"UDF"
$
hive> CREATE TEMPORARY FUNCTION DATE_FORMAT
AS 'com.nexr.platform.hive.udf.UDFDateFormat';
!You$may$then$use$the$funcFon$in$your$query$
hive> SELECT order_date FROM orders LIMIT 1;

2011-12-06 10:03:35
hive> SELECT DATE_FORMAT(order_date, 'dd-MMM-yyyy')

FROM orders LIMIT 1;
06-Dec-2011
hive> SELECT DATE_FORMAT(order_date, 'dd/mm/yy')

06/12/11
hive> SELECT DATE_FORMAT(order_date, 'EEEE, MMM d, yyyy')

Tuesday, Dec 6, 2011
Chapter"Topics"
Extending$Hive$
!! SerDes"
!! Parameterized$Queries
!! Conclusion"
Hive"Variables"(1)"
!Hive$supports$variable$subsFtuFon$
Swaps"a"placeholder"with"a"variables"literal"value"at"run"Dme"
Variable"names"are"case/sensiDve"
!Within$the$Hive$shell,$set$a$named$variable$equal$to$some$value:$
hive> SET state=CA;

$
!To$use$the$variables$value$in$a$HiveQL$query:$
hive> SELECT * FROM employees

WHERE STATE = '${hiveconf:state}';
Hive"Variables"(2)"
!You$can$set$variables$when$you$invoke$Hive$from$the$command$line$
Eases"repeDDve"queries"by"reducing"need"to"modify"HiveQL"
!For$example,$imagine$that$we$have$the$following$in$state.hql
SELECT COUNT(DISTINCT emp_id) FROM employees

WHERE state = '${hiveconf:state}';
!This$makes$creaFng$per#state$reports$easy:$
$ hive -hiveconf state=CA -f state.hql > ca_count.txt

$ hive hiveconf state=NY -f state.hql > ny_count.txt
$ hive hiveconf state=TX -f state.hql > tx_count.txt
Chapter"Topics"
Extending$Hive$
!! SerDes"
!! Hands#On$Exercise:$Data$TransformaFon$with$Hive
!! Conclusion"
Hands/on"Exercise:"Data"TransformaDon"with"Hive"
!In$this$Hands#On$Exercise,$you$will$transform$data$with$Hive$
!Please$refer$to$the$Hands#On$Exercise$Manual$for$instrucFons$
Chapter"Topics"
Extending$Hive$
!! SerDes"
!! Hands/on"Exercise:"Data"TransformaDon"with"Hive"
!! Conclusion
EssenDal"Points"
!SerDes$govern$how$Hive$reads$and$writes$a$table's$records$
Specied"(or"defaulted)"when"creaDng"a"table"
! TRANSFORM$processes$records$using$an$external$program$
This"can"be"wri>en"in"nearly"any"language"
!UDFs$are$User#Dened$FuncFons$
Custom"logic"that"can"be"invoked"just"like"built/in"funcDons"
!Hive$subsFtutes$variable$placeholders$with$literal$values$you$assign$
This"is"done"when"you"execute"the"query"
Especially"helpful"with"repeDDve"queries"
IntroducAon"to"Impala"
Chapter"15"
Course"Chapters"
!! IntroducAon"
!! Extending"Pig"
!! Hive"OpAmizaAon"
!! Extending"Hive"
!! Introduc.on$to$Impala$
!! Conclusion"
IntroducAon"to"Impala"
! What$Impala$is$and$how$it$compares$to$Hive,$Pig,$and$RDBMSs$
! How$Impala$executes$queries$
! Where$Impala$ts$into$the$data$center$
! What$notable$dierences$exist$between$Impala$and$Hive$
! How$to$run$queries$from$the$shell$or$browser$
Chapter"Topics"
Introduc.on$to$Impala$
!! What$is$Impala?
!! How"Impala"Diers"from"Hive"and"Pig"
!! How"Impala"Diers"from"RelaAonal"Databases"
!! LimitaAons"and"Future"DirecAons"
!! Using"the"Impala"Shell"
!! Conclusion"
What"is"Impala?"
!High#performance$SQL$engine$for$vast$amounts$of$data$
Massively/parallel"processing"(MPP)"
Inspired"by"Googles"Dremel"project"
Query"latency"measured"in"milliseconds$
!Impala$runs$on$Hadoop$clusters$
Can"query"data"stored"in"HDFS"or"HBase"tables"
Reads"and"writes"data"in"common"Hadoop"le"formats"
!Developed$by$Cloudera$
100%"open"source,"released"under"the"Apache"sobware"
license"
InteracAng"with"Impala"
!Impala$supports$a$subset$of$SQL#92$
Plus"a"few"extensions"found"in"MySQL"and"Oracle"SQL"dialects"
Almost"idenAcal"to"HiveQL"
!Impala$oers$many$interfaces$for$running$queries$
Command/line"shell"
Hue"Web"applicaAon"
ODBC"/"JDBC""
Why"Use"Impala?"
!Many$benets$are$the$same$as$with$Hive$or$Pig$
More"producAve"than"wriAng"MapReduce"code"
No"sobware"development"experience"required"
Leverage"exisAng"knowledge"of"SQL"
!One$benet$exclusive$to$Impala$is$speed$
Highly/opAmized"for"queries"
Almost"always"at"least"ve"Ames"faster"than"either"Hive"or"Pig"
Oben"20"Ames"faster"or"more"
Use"Case:"Business"Intelligence""
!Many$leading$business$intelligence$tools$support$Impala$
Dualcore Inc. Dashboard
https://dashboard.example.com/ Google
Revenue by Period Order Shipments Per Month
Top States for In-Store Sales
Suppliers by Region
Japan: 31 suppliers
Where"Impala"Fits"Into"the"Data"Center"
Transaction Records from

Application Database
Log Data from Documents from
Web Servers File Server
Hadoop Cluster
with Impala
Analyst using Impala Analyst using

shell for ad hoc queries Impala via BI tool
Where"to"Get"Impala"
!Download$Impala$from$http://www.cloudera.com/
!We$strongly$recommend$running$Impala$on$CDH$4.2$or$higher$
Requires"a"64/bit"Linux"plahorm"
!Installa.on$and$congura.on$are$outside$the$scope$of$this$course$
Your"virtual"machine"includes"a"working"installaAon"of"Impala"
Chapter"Topics"
!! What"is"Impala?"
!! How$Impala$Diers$from$Hive$and$Pig
!! Conclusion"
Comparing"Impala"to"Hive"and"Pig"
!Lets$rst$look$at$similari'es$between$Hive,$Pig,$and$Impala$
Queries"expressed"in"high/level"languages"
AlternaAves"to"wriAng"MapReduce"code"
Used"to"analyze"data"stored"on"Hadoop"clusters"
!Impala$shares$the$metastore$with$Hive$
Tables"created"in"Hive"are"visible"in"Impala"(and"vice"versa)"
ContrasAng"Impala"to"Hive"and"Pig"(1)"
!Hive$and$Pig$answer$queries$by$running$MapReduce$jobs$
MapReduce"is"a"general/purpose"computaAon"framework"
Not"opAmized"for"execuAng"interacAve"SQL"queries"
!MapReduce$overhead$results$in$high$latency$
Even"a"trivial"query"takes"10"seconds"or"more"
!Impala$does$not$use$MapReduce$
Uses"a"custom"execuAon"engine"built"specically"for"Impala"
Queries"can"complete"in"a"fracAon"of"a"second"
ContrasAng"Impala"to"Hive"and"Pig"(2)"
!Hive,$Pig,$and$Impala$also$support$
Execute"queries"via"interacAve"shell"or"command"line"
Grouping,"joining,"and"ltering"data"
Read"and"write"data"in"mulAple"formats"
!Impala$currently$lacks$some$Hive$and$Pig$features$
More"details"later"in"this"chapter"
!Hive$and$Pig$are$best$suited$to$long#running$batch$processes$
ParAcularly"data"transformaAon"tasks"
!Impala$is$best$for$interac.ve/ad$hoc$queries$
How"Impala"Executes"a"Query"
!Each$slave$node$in$the$cluster$runs$an$Impala$daemon$
Co/located"with"the"HDFS"Data"Node"
Client"issues"query"to"an"Impala"daemon"
!Impala$daemon$plans$the$query$
Checks"the"local"metastore"cache"
Distributes"the"query"across"other"Impala"daemons"in"the"cluster"
Daemons"read"data"from"HDFS"or"HBase"
Streams"results"to"client"
!Two$other$daemons$running$on$master$nodes$support$query$execu.on$
The"State"Store"daemon""
Provides"lookup"service"for"Impala"daemons"
Periodically"checks"status"of"Impala"daemons"
The"Catalog"daemon"relays"metadata"changes"to"all"the"Impala"
daemons"in"a"cluster"
Chapter"Topics"
!! What"is"Impala?"
!! How$Impala$Diers$from$Rela.onal$Databases
!! Conclusion"
Comparing"Impala"To"A"RelaAonal"Database"
Rela.onal$Database$ Impala$
Query language SQL SQL-92 subset
Update individual records Yes No
Delete individual records Yes No
Transactions Yes No
Indexing Yes No
Latency Very low Low
Data size Terabytes Petabytes
ODBC / JDBC support Yes Yes
Chapter"Topics"
!! What"is"Impala?"
!! Limita.ons$and$Future$Direc.ons
!! Conclusion"
Hive"Features"Currently"Unsupported"in"Impala"
!Impala$does$not$currently$support$some$features$in$Hive$
Many"of"these"are"being"considered"for"future"releases"
!Complex$data$types$(ARRAY,$MAP,$or$STRUCT)$
! BINARY$data$type$
!External$transforma.ons$
!Custom$SerDes$
!Indexing$
!Bucke.ng$and$table$sampling$
Other"Notable"Dierences"Between"Impala"and"Hive"
!Only$one$DISTINCT$clause$allowed$per$query$in$Impala$
Typical"workaround"is"to"use"subselect"and"UNION
!Impala$requires$that$queries$with$ORDER BY$also$specify$a$LIMIT$
This"sets"an"outer"bound"on"the"result"set"
The"size"of"the"LIMIT"can"be"arbitrarily"large"
!Impala$and$Hive$handle$out#of#range$values$dierently$
Hive"returns"NULL""
Impala"returns"the"maximum"value"for"that"type"
Query"Fault"Tolerance"in"Impala"
!Queries$in$both$Hive$and$Impala$are$distributed$across$nodes$
!Hive$answers$queries$by$running$MapReduce$jobs$
Takes"advantage"of"Hadoops"fault"tolerance"
If"a"node"fails"during"query,"MapReduce"runs"the"task"elsewhere"
!Impala$has$its$own$execu.on$engine$
Currently"lacks"fault"tolerance"
If"a"node"fails"during"a"query,"the"query"will"fail"
Workaround:"just"re/run"the"query"
Chapter"Topics"
!! What"is"Impala?"
!! Using$the$Impala$Shell"
!! Conclusion"
StarAng"the"Impala"Shell"
!You$can$execute$statements$in$the$Impalas$shell$
This"interacAve"tool"is"similar"to"the"shell"in"MySQL"or"Hive"
!Execute$the$impala-shell$command$to$start$the$shell$
Some"log"messages"truncated"to"be>er"t"the"slide"
$ impala-shell
Connected to localhost.localdomain:21000
Server version: impalad version 1.0.1
Welcome to the Impala shell.
[localhost.localdomain:21000] >
"
!Use$-i hostname:port$op.on$to$connect$to$another$server$
$ impala-shell i myserver.example.com:21000
[myserver.example.com:21000] >
Using"the"Impala"Shell"
!Enter$semicolon#terminated$statements$at$the$prompt$
Hit"enter"to"execute"the"query"
Impala"pre>y/prints"the"output"
Use"the"quit"command"to"exit"the"Impala"shell"
$ impala-shell
> SELECT cust_id, fname, lname FROM customers
WHERE zipcode='20525';
+---------+--------+-----------+
| cust_id | fname | lname |
+---------+--------+-----------+
| 1133567 | Steven | Robertson |
| 1171826 | Robert | Gillis |
+---------+--------+-----------+
> quit;
$
Note:"shell"prompt"abbreviated"as">$
Running"Queries"from"the"Command"Line"
!You$can$execute$a$le$containing$queries$using$the$-f$op.on$
$
$ impala-shell -f myquery.hql
!Run$queries$directly$from$the$command$line$with$the$-q$op.on$
$ impala-shell -q 'SELECT * FROM users'
!Use$-o$(and$op.onally$specify$delimiter)$to$capture$output$to$le$
$ impala-shell -f myquery.hql \
-o results.txt \
--output_file_field_delim='\t'
InteracAng"with"the"OperaAng"System"
!Use$shell$to$execute$system$commands$from$within$Impala$shell$
> shell date;

$ Mon May 20 16:44:35 PDT 2013
!No$direct$support$for$HDFS$commands$$
But"could"run"hadoop fs"using"shell
> shell hadoop fs -mkdir /reports/sales/2013;
Accessing"Impala"with"Hue"(1)"
!Alterna.vely,$you$can$access$Impala$through$Hue$
!To$use$Hue,$browse$to$http://hue_server:8888/
May"need"to"start"Hue"service"rst"(sudo service hue start)"
!Launch$Hues$Impala$interface$by$clicking$its$icon$
Impala&icon&in&Hue&
!Hue$allows$you$to$run$Impala$queries$from$your$Web$browser$
Hue - Impala Query
https://hueserver.example.com:8888/impala/ Google
SELECT zipcode, COUNT(order_id) AS total

FROM customers JOIN orders
WHERE zipcode LIKE '6%'
GROUP BY zipcode
ORDER BY total DESC
LIMIT 100;
!Hue$displays$the$results$in$a$sortable$table$
Hue - Impala Query
https://hueserver.example.com:8888/impala/ Google
Chapter"Topics"
!! What"is"Impala?"
!! Conclusion
EssenAal"Points"
!Impala$is$a$high#performance$SQL$engine$
Runs"on"Hadoop"clusters"
Reads"and"writes"data"in"HDFS"or"HBase"tables"
!Queries$are$expressed$in$SQL$dialect$similar$to$HiveQL$
!Primary$dierence$compared$to$Hive/Pig$is$speed$
Impala"avoids"MapReduce"latency"and"overhead"
!Impala$is$best$suited$to$ad$hoc/interac.ve$queries$
Hive"and"Pig"are"be>er"for"long/running"batch"processes"
Impala"does"not"currently"support"all"features"of"Hive"
Bibliography"(1)"
The$following$oer$more$informa.on$on$topics$discussed$in$this$chapter$
!Free$OReilly$Cloudera-Impala$book$
http://tiny.cloudera.com/dac15f
!Cloudera$Impala:$Real#Time$Queries$in$Apache$Hadoop$
!Wired$Ar.cle$on$Impala$
!Cloudera$Blog$Detailing$Impala$Features$and$Performance$
!Impala$Documenta.on$at$Cloudera$Web$site$
Bibliography"(2)"
!37Signals$Blog$Comparing$Performance$of$Impala,$Hive,$and$MySQL$
Analyzing"Data"with"Impala"
Chapter"16"
Course"Chapters"
!! IntroducEon"
!! Extending"Pig"
!! Hive"OpEmizaEon"
!! Extending"Hive"
!! Analyzing$Data$with$Impala$
!! Conclusion"
Analyzing"Data"with"Impala"
! How$Impalas$query$syntax$compares$to$HiveQL$
! How$to$create$databases$and$tables$in$Impala$
! How$and$when$to$refresh$the$metadata$cache$
! Which$data$types$Impala$supports$
! How$to$add$support$for$a$User#Dened$FuncKon$(UDF)$
! How$to$structure$your$query$for$beNer$performance$
! What$other$factors$inuence$Impala$performance$
Chapter"Topics"
Analyzing$Data$with$Impala$
!! Basic$Syntax
!! Data"Types"
!! Filtering,"SorEng,"and"LimiEng"Results"
!! Joining"and"Grouping"Data"
!! User/Dened"FuncEons"
!! Improving"Impala"Performance"
!! Hands/On"Exercise:"InteracEve"Analysis"with"Impala"
!! Conclusion"
Overview"of"Impala"Query"Syntax"
!Impalas$query$language$is$a$subset$of$SQL#92$
Plus"a"few"extensions"from"Oracle"and"MySQL"dialects"
!Syntax$almost$idenKcal$to$HiveQL$
Dierences"mainly"related"to"features"unsupported"in"Impala"
Most"Hive"queries"can"be"executed"verbaEm"in"Impala"
!Impala$may$also$support$statements$that$Hive$does$not$
Such"as"the"ability"to"insert"individual"rows"*"
> INSERT INTO customers VALUES (1234567, 'Abe',

'Froman', '123 Oak St.', 'Chicago', 'IL', '60601');
*" Not"intended"for"bulk"loads"
Case/SensiEvity"and"Comments"
!Keywords$are$not$case#sensiKve,$but$oVen$capitalized$by$convenKon$
!Supports$single#$and$mulK#line$comments!
These"are"allowed"in"scripts,""shell,"and"command"line"
$ cat find_customers.sql
/* This script will query the customers table

* and find all customers in a given ZIP code
*/
FROM customers
WHERE zipcode='60601'; -- downtown Chicago
Databases"and"Tables"in"Impala"
!Every$Impala$table$belongs$to$a$database$
Impala"and"Hive"share"a"metastore"
The"same"databases"and"tables"are"visible"in"both"Hive"and"Impala"
!The$default$database$is$selected$at$startup
The"USE"command"switches"to"another"database"
List"tables"in"a"database"with"the"SHOW TABLES"command"
> USE accounting;

> SHOW TABLES;
+----------+
| name |
+----------+
| accounts |
+----------+
CreaEng"Databases"and"Tables"in"Impala"
!Data$deniKon$is$generally$idenKcal$to$Hive$$
Custom"SerDes"and"buckeEng"are"unsupported"
> CREATE DATABASE sales;

> USE sales;
> CREATE EXTERNAL TABLE prospects
(id INT,
name STRING COMMENT 'Include surname',
email STRING,
active BOOLEAN COMMENT 'True, if on mailing list',
last_contact TIMESTAMP)
FIELDS TERMINATED BY '\t'
LOCATION '/dept/sales/prospects';
Displaying"Table"Structure"
!Use$DESCRIBE$to$display$a$tables$structure$
DESCRIBE EXTENDED"is"unsupported"
> DESCRIBE prospects;

+--------------+-----------+--------------------------+
| name | type | comment |
+--------------+-----------+--------------------------+
| id | int | |
| name | string | Include surname |
| email | string | |
| active | boolean | True, if on mailing list |
| last_contact | timestamp | |
+--------------+-----------+--------------------------+
Metadata"Caching"in"Impala"
!Impala$shares$the$metastore$with$Hive$
Tables"created"in"Hive"are"visible"in"Impala"(and"vice"versa)"
!Impala$caches$metadata$
The"tables"schema"deniEons"
The"locaEons"of"tables"HDFS"blocks"
!Metadata$updates$made$from!within!Impala$are$broadcast$throughout$the$
cluster$
The"Impala"metadata"cache"is"updated"automaEcally"throughout"the"
Impala"cluster"
No"addiEonal"acEons"are"required"
!Metadata$updates$made$from$outside!of!Impala$are$not$known$to$Impala$
For"example,"changes"made"in"Hive,"or"changes"to"the"tables"in"HDFS"
AddiEonal"acEons"are"required"to"update"the"metadata"cache""
"
UpdaEng"the"Impala"Metadata"Cache"
Metadata$Change$ Required$AcKon$ Eect$on$Cache$

New"table"added"to" INVALIDATE METADATA Marks$the$enKre$cache$as$
the"metastore"in"Hive" (with"no"table"name)$ stale;$metadata$cache$is$
reloaded$as$needed$
Table"schema" REFRESH <table> Reloads$the$metadata$for$
modied"in"Hive one$table$immediately.$
Reloads$HDFS$block$
locaKons$for$new$data$
les$only.$
New"data"added"to"a" REFRESH <table> (Same$as$above)$
table
Data"in"a"table"has" INVALIDATE METADATA Marks$the$metadata$for$a$
been"extensively" <table> single$table$as$stale.$When$
altered,"such"as"by" $ the$metadata$is$needed,$
HDFS"balancing all$HDFS$block$locaKons$
are$retrieved.$
SelecEng"Data"in"Impala"
!Use$SELECT$to$retrieve$data$from$tables$
Results"are"forma>ed"for"display"
> SELECT fname,lname,city,state FROM customers

WHERE cust_id = 1234567;
+-------+--------+---------+-------+
| fname | lname | city | state |
+-------+--------+---------+-------+
| Abe | Froman | Chicago | IL |
+-------+--------+---------+-------+
Impala"does"not"require"a"FROM"clause"
> SELECT SQRT(64) AS square_root;
Using"Impala"Built/in"FuncEons"
!Invoke$built#in$funcKons$as$you$would$in$SQL$or$HiveQL$
> SELECT CONCAT_WS(', ', lname, fname) AS fullname

FROM customers WHERE cust_id=1234567;
+-------------+
| fullname |
+-------------+
| Froman, Abe |
+-------------+
!Impala$supports$many$of$the$same$built#in$funcKons$as$Hive$
Lacks"some"others,"including"many"formacng"and"text"processing"
funcEons"
Chapter"Topics"
!! Basic"Syntax"
!! Data$Types
!! Conclusion"
Data"Types"in"Impala"
!Each$column$in$a$table$is$associated$with$a$data$type$
Impala"supports"most"types"available"in"Hive"
Most"are"similar"to"those"found"in"relaEonal"databases"
> DESCRIBE prospects;

+--------------+-----------+--------------------------+
| name | type | comment |
+--------------+-----------+--------------------------+
| id | int | |
| name | string | Include surname |
| email | string | |
| active | boolean | True, if on mailing list |
| last_contact | timestamp | |
+--------------+-----------+--------------------------+
Impalas"Integer"Types"
!Integer$types$are$appropriate$for$whole$numbers$
Both"posiEve"and"negaEve"values"allowed"
Name$ DescripKon$ Example$Value$

TINYINT Range:"/128"to"127$ 17
SMALLINT Range:"/32,768"to"32,767$ 5842
INT Range:"/2,147,483,648"to"2,147,483,647$ 84127213
BIGINT Range:"~"/9.2"quinEllion"to"~"9.2"quinEllion$ 632197432180964
Impalas"Decimal"Types"
!Decimal$types$are$appropriate$for$oaKng$point$numbers$
Both"posiEve"and"negaEve"values"allowed"
CauKon:"avoid"using"when"exact"values"are"required!"

FLOAT Decimals$ 3.14159
DOUBLE Very"precise"decimals$ 3.14159265358979323846
Other"Types"in"Impala"
!Impala$can$also$store$a$few$other$types$of$informaKon$
Only"one"character"type"(variable"length)"

STRING Character"sequence" Betty F. Smith
BOOLEAN True"or"False" TRUE
TIMESTAMP Instant"in"Eme" 2013-06-14 16:51:05

"
!Impala$does$not$support$BINARY$or$complex$types$
Data"Type"Conversion"
!Hive$auto#converts$a$STRING$column$used$in$numeric$context$
hive> SELECT zipcode FROM customers LIMIT 1;

60601
hive> SELECT zipcode + 1.5 FROM customers LIMIT 1;
60602.5
"
!Impala$requires$an$explicit$CAST$operaKon$for$this$
> SELECT zipcode + 1.5 FROM customers LIMIT 1;

ERROR: AnalysisException: Arithmetic operation ...
> SELECT CAST(zipcode AS float) + 1.5
FROM customers LIMIT 1;
60602.5
Chapter"Topics"
!! Basic"Syntax"
!! Data"Types"
!! Filtering,$SorKng,$and$LimiKng$Results
!! Conclusion"
LimiEng"and"SorEng"Query"Results"
! The$LIMIT$clause$sets$the$maximum$number$of$rows$returned$
> SELECT fname, lname FROM customers LIMIT 10;
!CauKon:$no$guarantee$regarding$which$10$results$are$returned$
Use"ORDER BY"for"top/N"queries"
The"eld(s)"you"ORDER BY"must"be"selected"
!When$using$ORDER BY,$the$LIMIT$clause$is$mandatory$in$Impala$
> SELECT cust_id, fname, lname FROM customers

ORDER BY cust_id DESC LIMIT 10;
$
Chapter"Topics"
!! Basic"Syntax"
!! Data"Types"
!! Joining$and$Grouping$Data
!! Conclusion"
Joins"in"Impala"
!Like$Hive,$Impala$can$join$mulKple$data$sets$
!Impala$supports$the$same$types$of$joins$that$Hive$does$
Inner"joins"
Outer"joins"(lel,"right,"and"full)"
Lel"semi"joins"
Cross"joins"
Record"Grouping"and"Aggregate"FuncEons"
! GROUP BY$groups$selected$data$by$one$or$more$columns$
CauKon:"Columns"not"part"of"aggregaEon"must"be"listed"in"GROUP BY"
stores$table$
id$ city$ state$ region$ SELECT region, state,
COUNT(id) AS num
a Albany NY EAST
FROM stores
b Boston MA EAST
GROUP BY region, state;
c Chicago IL NORTH
d Detroit MI NORTH Result$of$query$
e Elgin IL NORTH
region$ state$ num$
EAST MA 1
EAST NY 1
NORTH IL 2
NORTH MI 1
Chapter"Topics"
!! Basic"Syntax"
!! Data"Types"
!! User#Dened$FuncKons$
!! Hands/On"Exercise:"InteracEve"analysis"with"Impala"
!! Conclusion"
Overview"of"Impala"User/Dened"FuncEons"(UDFs)"
!Like$Hive,$Impala$supports$User#Dened$FuncKons$(UDFs)$$
!Hive$UDFs$can$be$used$in$Impala$with$no$changes$
With"a"few"excepEons"
!There$are$two$types$of$UDFs$in$Impala$
Standard"UDFs"
User/Dened"Aggregate"FuncEons"(UDAFs)"
!Impala$UDFs$can$be$wriNen$in$Java$or$C++$
C++"UDFs"are"implemented"as"shared"objects"
!Impala$C++$UDFs$cannot$be$used$in$Hive$
"
"
Using"a"Java"UDF"in"Impala"(1)"
!Register$the$funcKon$with$Impala$$
Specify"data"types"that"correspond"to"the"method"signature"of"the"UDF"
class"evaluate"method"aler"the"funcEon"name"
Specify"data"types"that"correspond"to"the"return"type"of"the"UDF"class"
evaluate"method"in"the"RETURNS"clause"
IdenEfy"the"jar"le"containing"the"UDF"class"in"the"LOCATION"clause"
Specify"the"UDF"class"name"in"the"SYMBOL"clause"
You"do"not"need"to"run"a"a"separate"ADD JAR"step"
CREATE FUNCTION STRIP(STRING) RETURNS STRING

LOCATION '/user/hive/udfs/MyUDFs.jar'
SYMBOL='com.example.hive.ql.udf.UDFStrip';
$
Using"a"Java"UDF"in"Impala"(2)"
!You$may$then$use$the$funcKon$in$Impala$queries$
SELECT STRIP(email_address) FROM employees;

$
Using"a"C++"UDF"in"Impala"
!Register$the$funcKon$with$Impala$$
$
CREATE FUNCTION COUNT_VOWELS(STRING)
RETURNS INT
LOCATION '/user/hive/udfs/sampleudfs.so'
SYMBOL='CountVowels';
!You$may$then$use$the$funcKon$in$your$query$
$
SELECT COUNT_VOWELS(email_address) FROM employees;
Chapter"Topics"
!! Basic"Syntax"
!! Data"Types"
!! Improving$Impala$Performance
!! Conclusion"
Impala"Performance"Overview"
!Query$performance$is$aected$by$three$broad$categories$
CompuEng"staEsEcs"on"tables"before"running"joins"
The"format"and"type"of"data"being"queried"
The"hardware"and"conguraEon"of"your"cluster"
Join"Performance"OpEmizaEon"
!Impala$uses$staKsKcs$about$tables$to$opKmize$joins$
!You$should$compute$staKsKcs$for$tables$with$COMPUTE STATS$
Aler"you"load"a"table"iniEally"
When"the"amount"of"data"in"a"table"changes"substanEally""
$
COMPUTE STATS orders;
COMPUTE STATS order_details;
SELECT COUNT(o.order_id)FROM orders o
JOIN order_details d ON (o.order_id = d.order_id)
WHERE YEAR(o.order_date) = 2008;
Data"Formats"Supported"by"Impala"
!Impala$can$query$data$in$several$formats$
Table"below"summarizes"compaEbility"
Create/load"using"Hive"if"those"operaEons"are"unsupported"in"Impala"
File$Type$ DescripKon$of$File$Type$ Read$ Create$ Insert$

Parquet" High/performance"columnar"format" Yes" Yes" Yes"
Text" Plaintext"delimited"at"le"format" Yes" Yes" Yes"
Avro" Structured"cross/plaqorm"binary"format" Yes" No" No"
RCFile" Columnar"format"compaEble"with"Hive" Yes" Yes" No"
SequenceFile" Binary"at"le"format" Yes" Yes" No"
Data"Formats"and"OpEmizaEon"
!The$limiKng$factor$in$most$queries$is$I/O$
Disk"speed"is"the"most"common"bo>leneck"
Columnar"formats"reduce"I/O"when"few"columns"selected"
!If$performance$is$a$key$concern,$choose$Parquet$
This"may"limit"compaEbility"with"other"tools"
> CREATE TABLE order_details

(order_id INT,
prod_id INT)
STORED AS PARQUETFILE;
Impala"Query"Size"
!Query$size$is$based$on$a$querys$working$set$size$
The"working"set"of"a"query"contains"all"records"aler"
Filtering"rows"
Pruning"unused"columns"
Performing"aggregaEon,"if"applicable"
!For$aggregaKons,$the$query$size$is$the$working$set$size$for$all$the$tables$in$
the$query$
!For$joins,$the$query$size$is$the$working$set$size$for$all$the$tables$in$the$join$
excluding$the$largest$table$
!Impala$queries$must$t$into$the$cluster's$aggregate$memory$
Cluster"Hardware"
!CPU:$Impala$benets$from$newer$processors$$
Takes"advantage"of"addiEonal"opEmizaEon"when"available"
!Memory:$32GB$minimum,$64GB$is$beNer$
!Disks:$more$is$beNer$
Impala"tries"to"maximize"throughput"across"disks"
Servers"with"12"or"more"disks"are"ideal"
Chapter"Topics"
!! Basic"Syntax"
!! Data"Types"
!! Hands#On$Exercise:$InteracKve$Analysis$with$Impala
!! Conclusion"
Hands/on"Exercise:"InteracEve"Analysis"with"Impala"
!In$this$Hands#On$Exercise,$you$will$run$ad$hoc$queries$with$Impala$
!Please$refer$to$the$Hands#On$Exercise$Manual$for$instrucKons$
Chapter"Topics"
!! Basic"Syntax"
!! Data"Types"
!! Hands/On"Exercise:"InteracEve"analysis"with"Impala"
!! Conclusion
EssenEal"Points"
!Impalas$query$syntax$is$nearly$idenKcal$to$HiveQL$
Most"Hive"queries"can"be"executed"verbaEm"in"Impala"
!Impala$caches$metadata$from$the$metastore$it$shares$with$Hive$
Use"INVALIDATE METADATA"or"REFRESH"to"update"the"cache"
following"external"changes"
!Impala$supports$most$simple$data$types$from$Hive$
!Query$structure$and$le$format$can$aect$performance$
!Your$clusters$hardware$also$aects$performance$
Bibliography"
The$following$oer$more$informaKon$on$topics$discussed$in$this$chapter$
!Introducing$Parquet:$Ecient$Columnar$Storage$for$Apache$Hadoop$
!Impala$Language$Reference$
Choosing"the"Best"Tool"for"the"Job"
Chapter"17"
Course"Chapters"
!! IntroducFon"
!! IntroducFon"to"Pig"
!! MulF/Dataset"OperaFons"with"Pig"
!! Extending"Pig"
!! Pig"TroubleshooFng"and"OpFmizaFon"
!! IntroducFon"to"Hive"
!! RelaFonal"Data"Analysis"with"Hive"
!! Hive"OpFmizaFon"
!! Extending"Hive"
!! IntroducFon"to"Impala"
!! Choosing$the$Best$Tool$for$the$Job$
!! Conclusion"
Choosing"the"Best"Tool"for"the"Job"
!How$MapReduce,$Pig,$Hive,$Impala,$and$RDBMSs$compare$to$one$another$
!Why$a$workow$might$involve$several$dierent$tools$
!How$to$select$the$best$tool$for$a$given$job$
Chapter"Topics"
Choosing$the$Best$Tool$for$the$Job$
!! Comparing$MapReduce,$Pig,$Hive,$Impala,$and$RelaNonal$Databases
!! Which"to"Choose?"
!! Conclusion"
Recap"of"Data"Analysis/Processing"Tools"
!MapReduce$
Low/level"processing"and"analysis"
!Pig$
Procedural"data"ow"language"executed"using"MapReduce"
!Hive$
SQL/based"queries"executed"using"MapReduce"
!Impala$
High/performance"SQL/based"queries"using"a"custom"execuFon"engine"
Comparing"Pig,"Hive,"and"Impala"
DescripNon$of$Feature$ Pig$ Hive$ Impala$

SQL-based query language No Yes Yes
Optional schema and metastore Yes No No
User-defined functions (UDFs) Yes Yes Yes
Process data with external scripts Yes Yes No
Extensible file format support Yes Yes No
Complex data types Yes Yes No
Query latency High High Low
Built-in data partitioning No Yes Yes
Accessible via ODBC / JDBC No Yes Yes
Do"These"Replace"an"RDBMS?"
!Probably$not$if$RDBMS$is$used$for$its$intended$purpose$
!RelaNonal$databases$are$opNmized$for$
RelaFvely"small"amounts"of"data"
Immediate"results"
In/place"modicaFon"of"data"(UPDATE"and"DELETE)"
!Pig,$Hive,$and$Impala$are$opNmized$for$
Large"amounts"of"read/only"data"
Extensive"scalability"at"low"cost"
!Pig$and$Hive$are$beSer$suited$for$batch$processing$
Impala"and"RDBMSs"are"be>er"for"interacFve"use"
Comparing"RDBMS"to"Hive"and"Impala"
RDBMS$ Hive$ Impala$

Insert individual records Yes No Yes
Update and delete records Yes No No
Transactions Yes No No
Role-based authorization Yes Yes No
Stored procedures Yes No No
Index support Extensive Limited None
Latency Very low High Low
Data size Terabytes Petabytes Petabytes
Complex data types No Yes No
Storage cost Very high Very low Very low
Recap:"Apache"Sqoop"
!Sqoop$helps$you$integrate$Hadoop$tools$with$relaNonal$databases$
!It$exchanges$data$between$RDBMS$and$Hadoop$
Can"import"all"tables,"a"single"table,"or"a"porFon"of"a"table"into"HDFS"
Supports"incremental"imports"
Can"also"export"data"from"HDFS"back"to"the"database"
Database Hadoop Cluster
Chapter"Topics"
!! Comparing"MapReduce,"Pig,"Hive,"Impala,"and"RelaFonal"Databases"
!! Which$to$Choose?
!! Conclusion"
Which"to"Choose?"
!Choose$the$best$one$for$a$given$task$
Mix"and"match"them"as"needed"
!MapReduce$
Low/level"approach"oers"great"exibility"
More"Fme/consuming"and"error/prone"to"write"
Best"when"control"ma>ers"more"than"producFvity"
!Pig,$Hive,$and$Impala$oer$more$producNvity$
Faster"to"write,"test,"and"deploy"than"MapReduce"
Be>er"choice"for"most"analysis"and"processing"tasks"
Analysis"Workow"Example"
Import Transaction Data
from RDBMS
Sessionize Web Sentiment Analysis on
Log Data with Pig Social Media with Hive
Hadoop Cluster
with Impala
Analyst using Impala Analyst using Impala

shell for ad hoc queries via BI tool
Generate Nightly Reports
using Pig, Hive, or Impala
Chapter"Topics"
!! Comparing"MapReduce,"Pig,"Hive,"Impala,"and"RelaFonal"Databases"
!! Which"to"Choose?"
!! Conclusion
EssenFal"Points"
!You$have$learned$about$several$tools$for$data$analysis$
Each"is"be>er"at"some"tasks"than"others"
Choose"the"best"one"for"a"given"job"
Workows"may"involve"exchanging"data"between"them"
!SelecNon$criteria$include$scale,$speed,$control,$and$producNvity$
MapReduce"oers"control"at"the"cost"of"producFvity"
Pig"and"Hive"oer"producFvity"but"not"necessarily"speed"
RelaFonal"databases"oer"speed"but"not"scalability"
Impala"oers"scalability"and"speed"but"less"control"
Conclusion"
Chapter"18"
Course"Chapters"
!! IntroducBon"
!! Extending"Pig"
!! Pig"TroubleshooBng"and"OpBmizaBon"
!! Hive"OpBmizaBon"
!! Extending"Hive"
!! Interoperability"and"Workows"
!! Conclusion$
Course"ObjecBves"(1)"
During$this$course,$you$have$learned$
!The$purpose$of$Hadoop$and$its$related$tools$
!The$features$that$Pig,$Hive,$and$Impala$oer$for$data$acquisiCon,$storage,$
and$analysis$
!How$to$idenCfy$typical$use$cases$for$large#scale$data$analysis$
!How$to$load$data$from$relaConal$databases$and$other$sources$
!How$to$manage$data$in$HDFS$and$export$it$for$use$with$other$systems$
!How$Pig,$Hive,$and$Impala$improve$producCvity$for$typical$analysis$tasks$
!The$language$syntax$and$data$formats$supported$by$these$tools$
Course"ObjecBves"(2)"
!How$to$design$and$execute$queries$on$data$stored$in$HDFS$
!How$to$join$diverse$datasets$to$gain$valuable$business$insight$
!How$to$analyze$structured,$semi#structured,$and$unstructured$data$
!How$Hive$and$Pig$can$be$extended$with$custom$funcCons$and$scripts$
!How$to$store$and$query$data$for$beMer$performance$
!How$to$determine$which$tool$is$the$best$choice$for$a$given$task$
Which"Course"to"Take"Next?"
Cloudera$oers$a$range$of$training$courses$for$you$and$your$team$$
!For$developers$
Cloudera)Developer)Training)for)Apache)Hadoop)
Cloudera)Training)for)Apache)HBase)
!For$system$administrators$
Cloudera)Administrator)Training)for)Apache)Hadoop)
!For$data$scienCsts$
Introduc;on)to)Data)Science:)Building)Recommender)Systems)
!For$architects,$managers,$CIOs,$and$CTOs$
Cloudera)Essen;als)for)Apache)Hadoop)
$

Data Analyst Training 201403

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Data Analyst Training 201403

Diunggah oleh

Hak Cipta:

Format Tersedia

Cloudera"Data"Analyst"Training:""

!Hadoop%has%a%master/slave% A Small Hadoop Cluster

$ hadoop fs -cat /user/fred/sales.txt

$ hadoop fs -mkdir /reports

$ hadoop fs -put sales.txt /reports

$ hadoop fs -get /reports/sales.txt

$ hadoop fs -put input.txt input.txt

$ hadoop fs -rm /reports/sales.txt

Bob 4823 Bob 5174

people = LOAD '/user/training/customers' AS (cust_id, name);

SELECT customers.cust_id, SUM(cost) AS total

Database Hadoop Cluster

Program And many

Web/App Servers File Server

Pig Latin Script Pig Interpreter / Execution Engine MapReduce Jobs

... Web Server Log Data

Clickstream Data for User Sessions

Widget Results Contact Us

Details for Widget X Send Complaint

Operations Pig Jobs Running on Hadoop Cluster

Accounting Data Warehouse

% grunt> fs -mkdir sales/;

grunt> run salesreport.pig;

$ pig x local -- interactive

$ pig -x local salesreport.pig -- batch

allsales = LOAD 'sales' AS (name, price);

bigsales = FILTER allsales BY price > 999; -- in US cents

allsales = LOAD 'sales' AS (name, price);

bigsales = FILTER allsales BY price > 999; -- in US cents

allsales = LOAD 'sales' AS (name, price);

bigsales = FILTER allsales BY price > 999; -- in US cents

Valid% x q1 q1_2013 MyData

allsales = LOAD 'sales' AS (name, price);

bigsales = FILTER allsales BY price > 999; -- in US cents

allsales = LOAD 'sales' AS (name, price);

bigsales = FILTER allsales BY price > 999;

STORE bigsales INTO 'myreport';

ArithmeCc% Comparison% Null% Boolean%

allsales = LOAD 'sales' AS (name, price);

allsales = LOAD 'sales' AS (name, price);

allsales = LOAD 'sales_200[5-9]' AS (name, price);

allsales = LOAD 'sales' AS (name, price);

allsales = LOAD 'sales';

allsales = LOAD 'sales.csv' USING PigStorage(',') AS

allsales = LOAD 'sales.txt' USING PigStorage('|');

" allsales = LOAD 'sales' AS (name, price);

Name% DescripCon% Example%Value%

allsales = LOAD 'sales' AS (name:chararray, price:int);

allsales = LOAD 'sales'

hasprices = FILTER Records BY price IS NOT NULL;

name% price% country%

name% price% country%

name% price% country%

name% price% country%

allsales = LOAD 'sales' AS (name, price);

STORE bigsales INTO 'myreport';

STORE bigsales INTO 'myreport' USING PigStorage(',');

grunt> allsales = LOAD 'sales' AS (name:chararray,

allsales: {name: chararray,price: int}

name% price% country% price > 3000" name% price% country%

somesales = FILTER allsales BY name == 'Dieter' OR (price >

name% price% country% name% price% country%

a_names = FILTER allsales BY name MATCHES 'A.*';

spammers = FILTER senders BY email_addr