Anda di halaman 1dari 22

Pig

-pig is an engine for executing data flows in parallel on Hadoop


-includes language pig latin to express these data flows
-dataflow language because allows users to describe how input
should be read in, processed, and outputted in parallel.
-Pig latin is DAG or directed acyclic graph where the edges= data
flow and nodes process the data
-pig is not a good choice for writing small groups of records or
looking up records in random order
-pig came out of yahoo
CLI and config options
pig h-displays all these

Type quit; to exit pig


Type pig to enter Hadoop run mode

-hadoop fs commands can be accessed with fs followed by


keyword

Chapter 4
-pig data types
-divided into both scalar and complex types
-all represented by java.lang interfaces
-int, long, float,double, chararray, and bytearray
Complex types:

All three can be used within one another.


1. Map-this is chararray to data element mapping. The element
mapping can be any type (complex types too). The
chararray is a key.
-the value is assumed to by a byte if not specified
-values can be different types
Ex-[name#bob,age#55]
2. Tuple-fixed length ordered collection of pig elements
-ex-(bob,55)
-analogous to a row in SQl
3. Bag unordered collections of tuples
- Cant reference a tuple by position in a bag since it is
unordered
- Can have schema
- EX- Above is a bag with three tuples each with two fields
-one type not req. to fit into memory
Nulls- in pig/sql null data element has no value
Schemas
-pig very lax on schemas and guesses when has to
Example:

-for datasets >50 columns say use load function


Page 45
$0,$1,.. refers to the field position
-if pig knows the schema it can forget later on
Casts

-casts go in parenthesis
-casts to bytearrays never allowed but bytearrays to anything else
is
-casts to/from complex types not allowed

Chapter 5- introduction to pig latin


Pig is a dataflow language. Each process step results in a new
dataset(relation)
-comments are either or /**/ for multiline
Input/output
-load
-pig jobs will run in the directory on HDFS /users/yourlogin
- using which is for specifying the function with the using clause
-no function specified so PigStorage() will be used
-used to read in comma delimited files
-as clause to specify schema

PigStorage/TextLoader-support globs.
Globs-read multiple files that are not in the same directory.
Version specific
Escape many of the globs to prevent the CLI from expansion
of them

Store
Use store to save file on HDFS. Default tab delimited file

-default function used is PIgStorage()

-stores as comma delimited file


Dump
-show output on screen
Use dump <alias>;
Each record is a tuple and missing records are just
separated by commas
Relational Operators
-used to operate on the data
1.foreach

foreach takes a set of expressions and applies them by


row in the data pipeline
-projection operator because it sends input to next
operator
Expressions in foreach

Positional references are $ and column value starting at 0


-refer to all fields using asterisk and .. as well
-example

-arithmetic operators are supported as well.


-null values: x+null=null
-biconditional operator
-begins with a Boolean exp. And then a ? followed the
value to return if true and then : and then value to return
if false (both return values must be same type)
Extracting data from complex types

-use projection operation # for maps followed by


name of key as a string

-Use proj. operator . for tuples. The field can be


referenced by name or by position

-Use proj. operator . for bags


-when project fields into bags you create a new bag
with only those
Fields

UDFs
-use with foreach statement

-each foreach statement produces a tuple with a different


schema
-assign the names to fields in the output of foreach use as
-simple projections

Filter

-select records to be kept in data pipeline


-has a predicate which uses the operators (!
=,>,<.<=,=>,==) . All are used to compare scalers
except != and == can be used with maps/tuples.
Maps/tuples must have both same schema or no schema

-none applied to bags


-for character arrays use match

Use and/or to combine multiple predicates


- IS NULL/ IS NOT NULL- Returns either null
values or the not null ones
Group
-collect all records with the same value and for the
provided key and store in a bag

versus

Order by

-sorts the data within a partition and sorts by partition


size (part files)

-sorts data in aesc. order the key it follows after


-use desc for descending order

-nulls have the least value thus placed at the top


-pig will run a maprep job to first get an idea of key
distribution to get the appropriate distribution for reducer
jobs. Then followed by reduce job to combine the data
Distinct
-removes all duplicate records. Works only on entire
records not individual fields

Join
-

- to do joins based on multiple columns

-Ex- jnd=join daily by (symbol,date), divs by


(symbol,date);
-join preserves the names of the fields of the inputs
passed to it
-in describe <join alias> ; you will see <data_alias>::<field
name> with the joins
-do outer joins supported by left, right, or full

-with outer joins pig needs to know the respective schema


to know how many null values to fill in. Let outer join pig
must know scheme for right side vice versa. Full outer
joins must know schema of both sides
-in inner join, null values for keys are dropped. For outer
joins, the null keys are kept
-doing multiple inner joins

Limit
-allows you see a limited # of results

-limit adds another reducer phase since it needs to collect


records together to count how may its returning
-adding order before limit guarantees the same output but
otherwise may end different everytime
Sample

The 0.1 means 10%

Parallelism

-this causes a script wide use of 10 reducers


-pig version>0.8 has a heuristic that tries to add 1 reducer
per 1gb of data
UDFs
-Can be written in java or python
-udfs built in pig have to be registered

-PYTHON Udfs
-python script must be in the current working directory

-must include jython.jar in your classpath which is done by


setting PIG_CLASSPATH variable
- as bballudfs defines a namespace for all udfs from that
file
-all udfs are invoked bballudfs.udfname
Define and udfs

-here we see define is used to provide an alias to that


Reverse() udfs
-define can also provide constructor arguments to the
UDFS
Example

Static java functions


-pig has invoker methods that allows to treat certain java
static functions
-can only be used when

Must use

-provide full package/classname/method name and


secondly a space separate list of parameters the java
functions expect

For arrays example

-throws IllegalArgumentException
Chapter 6- Advanced pig latin

-flatten

#go over this later page 58-59


Nested Foreach
-apply a set of relational operations to each record in a
pipeline

-Joins
-pig allows user to define any implementation of join via
using clause
Join small to large dataset
Fragment-replicate join- basically fragment one file and
replicate the other
-uses only a map task.

-replicated tells pig to use fragment-replicate join/uses a


lot of memory
-second table and on is stored in memory

Anda mungkin juga menyukai