Pig 1

Pig
-pig is an engine for executing data flows in parallel on Hadoop

-includes language pig latin to express these data flows
-dataflow language because allows users to describe how input
should be read in, processed, and outputted in parallel.
-Pig latin is DAG or directed acyclic graph where the edges= data
flow and nodes process the data
-pig is not a good choice for writing small groups of records or
looking up records in random order
-pig came out of yahoo
CLI and config options
pig h-displays all these
Type quit; to exit pig

Type pig to enter Hadoop run mode
-hadoop fs commands can be accessed with fs followed by

keyword
Chapter 4
-pig data types
-divided into both scalar and complex types
-all represented by java.lang interfaces
-int, long, float,double, chararray, and bytearray
Complex types:
All three can be used within one another.

1. Map-this is chararray to data element mapping. The element
mapping can be any type (complex types too). The
chararray is a key.
-the value is assumed to by a byte if not specified
-values can be different types
Ex-[name#bob,age#55]
2. Tuple-fixed length ordered collection of pig elements
-ex-(bob,55)
-analogous to a row in SQl
3. Bag unordered collections of tuples
- Cant reference a tuple by position in a bag since it is
unordered
- Can have schema
- EX- Above is a bag with three tuples each with two fields
-one type not req. to fit into memory
Nulls- in pig/sql null data element has no value
Schemas
-pig very lax on schemas and guesses when has to
Example:
-for datasets >50 columns say use load function

Page 45
$0,$1,.. refers to the field position
-if pig knows the schema it can forget later on
Casts
-casts go in parenthesis
-casts to bytearrays never allowed but bytearrays to anything else
is
-casts to/from complex types not allowed
Chapter 5- introduction to pig latin

Pig is a dataflow language. Each process step results in a new
dataset(relation)
-comments are either or /**/ for multiline
Input/output
-load
-pig jobs will run in the directory on HDFS /users/yourlogin
- using which is for specifying the function with the using clause
-no function specified so PigStorage() will be used
-used to read in comma delimited files
-as clause to specify schema
PigStorage/TextLoader-support globs.
Globs-read multiple files that are not in the same directory.
Version specific
Escape many of the globs to prevent the CLI from expansion
of them
Store
Use store to save file on HDFS. Default tab delimited file
-default function used is PIgStorage()
-stores as comma delimited file

Dump
-show output on screen
Use dump <alias>;
Each record is a tuple and missing records are just
separated by commas
Relational Operators
-used to operate on the data
1.foreach
foreach takes a set of expressions and applies them by

row in the data pipeline
-projection operator because it sends input to next
operator
Expressions in foreach
Positional references are $ and column value starting at 0

-refer to all fields using asterisk and .. as well
-example
-arithmetic operators are supported as well.

-null values: x+null=null
-biconditional operator
-begins with a Boolean exp. And then a ? followed the
value to return if true and then : and then value to return
if false (both return values must be same type)
Extracting data from complex types
-use projection operation # for maps followed by

name of key as a string
-Use proj. operator . for tuples. The field can be

referenced by name or by position
-Use proj. operator . for bags

-when project fields into bags you create a new bag
with only those
Fields
UDFs
-use with foreach statement
-each foreach statement produces a tuple with a different

schema
-assign the names to fields in the output of foreach use as
-simple projections
Filter
-select records to be kept in data pipeline

-has a predicate which uses the operators (!
=,>,<.<=,=>,==) . All are used to compare scalers
except != and == can be used with maps/tuples.
Maps/tuples must have both same schema or no schema
-none applied to bags

-for character arrays use match
Use and/or to combine multiple predicates

- IS NULL/ IS NOT NULL- Returns either null
values or the not null ones
Group
-collect all records with the same value and for the
provided key and store in a bag
versus
Order by
-sorts the data within a partition and sorts by partition

size (part files)
-sorts data in aesc. order the key it follows after

-use desc for descending order
-nulls have the least value thus placed at the top

-pig will run a maprep job to first get an idea of key
distribution to get the appropriate distribution for reducer
jobs. Then followed by reduce job to combine the data
Distinct
-removes all duplicate records. Works only on entire
records not individual fields
Join
-
- to do joins based on multiple columns
-Ex- jnd=join daily by (symbol,date), divs by

(symbol,date);
-join preserves the names of the fields of the inputs
passed to it
-in describe <join alias> ; you will see <data_alias>::<field
name> with the joins
-do outer joins supported by left, right, or full
-with outer joins pig needs to know the respective schema

to know how many null values to fill in. Let outer join pig
must know scheme for right side vice versa. Full outer
joins must know schema of both sides
-in inner join, null values for keys are dropped. For outer
joins, the null keys are kept
-doing multiple inner joins
Limit
-allows you see a limited # of results
-limit adds another reducer phase since it needs to collect

records together to count how may its returning
-adding order before limit guarantees the same output but
otherwise may end different everytime
Sample
The 0.1 means 10%
Parallelism
-this causes a script wide use of 10 reducers

-pig version>0.8 has a heuristic that tries to add 1 reducer
per 1gb of data
UDFs
-Can be written in java or python
-udfs built in pig have to be registered
-PYTHON Udfs
-python script must be in the current working directory
-must include jython.jar in your classpath which is done by

setting PIG_CLASSPATH variable
- as bballudfs defines a namespace for all udfs from that
file
-all udfs are invoked bballudfs.udfname
Define and udfs
-here we see define is used to provide an alias to that

Reverse() udfs
-define can also provide constructor arguments to the
UDFS
Example
Static java functions

-pig has invoker methods that allows to treat certain java
static functions
-can only be used when
Must use
-provide full package/classname/method name and

secondly a space separate list of parameters the java
functions expect
For arrays example
-throws IllegalArgumentException
Chapter 6- Advanced pig latin
-flatten
#go over this later page 58-59

Nested Foreach
-apply a set of relational operations to each record in a
pipeline
-Joins
-pig allows user to define any implementation of join via
using clause
Join small to large dataset
Fragment-replicate join- basically fragment one file and
replicate the other
-uses only a map task.
-replicated tells pig to use fragment-replicate join/uses a

lot of memory
-second table and on is stored in memory

Pig 1

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Pig 1

Diunggah oleh

Hak Cipta:

Format Tersedia

Pig

-pig is an engine for executing data flows in parallel on Hadoop

Type quit; to exit pig

-hadoop fs commands can be accessed with fs followed by

All three can be used within one another.

-for datasets >50 columns say use load function

Chapter 5- introduction to pig latin

-default function used is PIgStorage()

-stores as comma delimited file

foreach takes a set of expressions and applies them by

Positional references are $ and column value starting at 0

-arithmetic operators are supported as well.

-use projection operation # for maps followed by

-Use proj. operator . for tuples. The field can be

-Use proj. operator . for bags

-each foreach statement produces a tuple with a different

-select records to be kept in data pipeline

-none applied to bags

Use and/or to combine multiple predicates

-sorts the data within a partition and sorts by partition

-sorts data in aesc. order the key it follows after

-nulls have the least value thus placed at the top

- to do joins based on multiple columns

-Ex- jnd=join daily by (symbol,date), divs by

-with outer joins pig needs to know the respective schema

-limit adds another reducer phase since it needs to collect

The 0.1 means 10%

-this causes a script wide use of 10 reducers

-must include jython.jar in your classpath which is done by

-here we see define is used to provide an alias to that

Static java functions

-provide full package/classname/method name and

For arrays example

#go over this later page 58-59

-replicated tells pig to use fragment-replicate join/uses a

Anda mungkin juga menyukai