Anda di halaman 1dari 51

Interpreting the Data:

Parallel Analysis with


Sawzall

Tilani Gunawardena

Content
Background
Introduction
Overview
A Simple Example
Google Infrastructure
Sawzall Language Overview
System Model
More Examples
Execution Model
Domain-specific Properties of the language
Performance
Why a new language?
Utility
Related Work
Future Work
Conclusions

Background
When the only tool you own is a
hammer, every problem begins to
resemble a nail. - Abraham Harold
Maslow

AcceptableTime=InputTime
+ProcessingTime+OutputTi
me
InputTime
Chubby
Bigtable
GFS
ProcessingTime
Sawzall
MapReduce
OutputTime
Bigtable
GFS

Introduction(1)
Large data set(large,dynamic,
unwieldy)
Flat but regular structure
Span multiple disks and machines
Bottleneck lies in I/O, not CPUs
Paralyze to improve throughput
Task division and distribution
Keep computation near to data
Tolerance of kinds of failures

Introduction(2)
Process data records that are present
on many machines.
Distribute the calculation across all the
machines to achieve high throughput.
Two phases for calculation
-Analysis Phase
-Aggregation Phase

Structure(1)
Five racks of 50-55 working computers each, with
four disks per machine.Such a configuration might
have a hundred terabytes of data to be processed,
distributed across some or all of the machines.

Structure(2)

Query is commutative then can process in any


order
If Aggregation is commutative and associative then
intermediate values can be grouped arbitrarily or
even aggregated in stages
The overall flow of filtering, aggregating, and
collating.
Each stage typically involves less data than the
previous

Overview(1)
Gigabyte to many terabytes of data
Hundreds or even thousands of
machines in parallel
Some executing the query while
others aggregate the results
An analysis may consume months of
CPU time, but with a thousand
machines that will only take a few
hours of real time.

Overview(2)
systems design is influenced by two
observations.
If the querying operations are
commutative across records, the order in
which the records are process
unimportant
If the aggregation operations are
commutative, the order in which the
intermediate values are processed is
unimportant

Overview(3)-Sawzall

Input is divided into pieces & processed


separately
Sawzall interpreter is instantiated for
each piece of data
Sawzall program operates on each input
record individually
The output of the program for each
record is the intermediate value.
The intermediate value is combined with
values from other records

A Simple Example
Input: Set of files that contain records where each
of the records contain one floating-point number.
Output:Number of records, sum of the values and
sum of the squares of the values.
count:
table sum of int;
total:
table sum of float;
sum_of_squares: table sum of float;
x:
float=input;
emit
count<1;
emit
total<x;
emit
sum_of_squares<- x*x;

Google Infrastructure(1)
Protocol Buffers
Google File System
Workqueue
MapReduce

Google Infrastructure(2)-PB
Protocol Buffers are used
-To define the messages communicated
between servers
-To describe the format of permanent
records stored on disk
DDL describes protocol buffers and
defines the content of the messages
Protocol compiler takes the DDL and
generates code to manipulate the protocol
buffers

Google Infrastructure(3)-PB
The generated code is compiled and
linked with the application
Protocol buffer types are roughly
analogous to C structs but the DDL
has two additional properties
A distinguishing integral tag
An indication of whether a field is
necessary or optional

Google Infrastructure(4)-PB
The following describes a protocol buffer with two
required fields. Field x has tag 1,and field y has tag
parsed message Point {
2
required int32 x = 1;
required int32 y = 2;
};

To extend this two-dimensional point, one can add


a new, optional field with a new tag. All existing
Points stored on disk remain readable; they are
compatible with the new definition since the new
field is optional.
parsed message Point {
required int32 x = 1;
required int32 y = 2;
optional string label = 3;
};

Google Infrastructure(5)-GFS
The data sets are often stored in GFS
GFS provides a reliable distributed
storage system
It can grow to petabyte scale by
keeping data in 64MB chunks
Each chunk is replicated, usually 3
times, on
different machines
stored on disks spread across
thousands of machines
GFS runs as an application-level file
system with a traditional hierarchical

Google Infrastructure(5)Workqueue

Software that handles the scheduling


of a job that runs on a cluster of
machines.
It creates a large-scale time sharing
system from an array of computers
and their disks.
It schedules jobs, allocates
resources,reports status, and collects
the results

Google Infrastructure(6)MapReduce

MapReduce is a software library for


applications that run on the
Workqueue.
Primary Services
It provides an execution model for
programs that operate on many data
items in parallel.
It isolates the application from the details
of running a distributed program,
When possible it schedules the
computations so each unit runs on the
machine or rack that holds its GFS data,

Google Infrastructure(7)
High level language

Software libraries

Scheduling software

Application file system

Language Overview(1)
Basic types
Integer (int)
Floating point (float)
Time
Fingerprint (represents an internally
computed hash of another value)
Bytes and String

Compound type
Arrays, Maps and Tuples

Language Overview(2)
A Sawzall program defines the
operations to be performed on a single
record of the data
There is nothing in the language to enable
examining multiple input records
simultaneously

The only output primitive in the


language is the emit statement

Language Overview(3)

Given a set of logs of the submissions to our source code


management system, this program will show how the rate of
submission varies through the week, at one-minute resolution:

proto p4stat.proto
submitsthroughweek: table sum[minute: int] of
count: int;
log: P4ChangelistStats = input;
t: time = log.time; # microseconds
minute: int = minuteof(t)+60*(hourof(t)
+24*(dayofweek(t)-1)); = 27
submitsthroughweek[0]
emit submitsthroughweek[minute]
<- 1;
submitsthroughweek[1]
= 31
submitsthroughweek[2] = 52
submitsthroughweek[3] = 41
...
submitsthroughweek[10079] = 34

Language Overview(4)
Frequency of submits to the source
code repository through the week.

Language Overview(4)
List of Aggregators
Collection
Sample
Sum
Maximum
Quantile
Top
Unique

System Model(1)

System operates in a batch-execution style


-User submits a job
-Job runs on a fixed set of files
-User collects the output
The input format and output destination are
given as arguments to the command that
submits the job.
The command is called saw
sawparameters
--program code.szl \ for saw are as follows
The
--workqueue testing \
--input_files /gfs/cluster1/2005-02-0[1-7]/submits.* \
--destination /gfs/cluster2/$USER/output@100

System Model(2)
User collects the data using the following:
dump
--source/gfs/cluster2/$USER/output@100 -format csv
The program merges the output data and
prints the final results.

System Model(3)
Implementation
-Sawzall works in the map phase
-Aggregators work in the reduce
phase

System Model(4)
Chaining
-The output from a Sawzall job is sent
as an input to another

More Examples(1)
Task:Process a web document
repository to know for each web
domain, which page has the highest
page
protorank
"document.proto"
max_pagerank_url:
table maximum(1) [domain: string] of url: string
weight pagerank: int;
doc: Document = input;
emit max_pagerank_url[domain(doc.url)] <- doc.url
weight doc.pagerank;

More Examples(2)
Task: To look at a set of search query
logs and construct a map showing how
the queries are distributed around the
globe
proto "querylog.proto"
queries_per_degree: table sum[lat: int][lon: int] of int;
log_record: QueryLogProto = input;
loc: Location = locationinfo(log_record.ip);
emit queries_per_degree[int(loc.lat)][int(loc.lon)] <- 1;

More Examples(3)

Execution Model
It runs on one record at a time
It is routine for a Sawzall job to be
executing on a thousand machines
simultaneously, yet the system
requires no explicit communication
between those machines

Domainspecific Properties of
the Language
Sawzall is statically typed

The main reason is dependability. Sawzall


programs can consume hours, even months,
of CPU time in a single run,and a late-arising
dynamic type error can be expensive.

Handle undefined values


Ex: A division by zero
Conversion errors
I/O problems
Using def() predicate
Run-time flag is set

Handle logical quantifiers

Characteristic
Similar to C and Pascal
Type-safe scripting language
Code is much shorter than C++
Pure value semantics, no reference
types
Statically typed
No exception processing

Performance(1)
Compare the single CPU speed of the
Sawzall interpreter.
Sawzall is faster than Python, Ruby
and Perl. But slower than interpreted
Java, compiled Java and compiled C++
The following table shows the
microbenchmark.
Sawzall

Python

Ruby

Perl

Mandelbrot
runtime
Factor

12.09s
1.00

45.42s
3.75

73.59s
6.09

38.68s
3.20

Fibonacci
runtime
Factor

11.73s
1.00

38.12s
3.24

47.18s
4.02

75.73s
6.46

Performance(2)
The main measurement is not singleCPU speed.
The main measurement is aggregate
system speed as machines are added
to process large datasets.
Experiment: 450GB sample of
compressed query log data to count
the occurrences of certain words using
Sawzall program.

Performance(3)

Test run on 50-600 , 2.4GHz Xeon computers


Scales up
At 600 machines the aggregate throughput was
1.06GB/s of compressed data or about 3.2GB/s
of raw input
Additional machines add .98 machine
throughput

Performance(4)
-The solid line is
elapsed time; the
dashed line is the
product of
machines and
elapsed time.

Why
a
new
language?
Why put a language above MapReduce?
MapReduce is very effective; whats missing?
Why not just attach an existing language such as Python to
MapReduce?
To make programs clearer, more compact, and more
expressive
Original motivation :parallelism
Separating out the aggregators
provides a model for distributed processing
Awk or or Python user have to write the aggregators
Capture the aggregators in the language(& its environment)
->means that the programmer never has to provide one,
unlike when using MapReduce
Sawzall programs tend to be around 10 to 20 times shorter
than the equivalent MapReduce programs in C++ &
significantly easier to write
Ability to add domain-specific features, custom debugging

Utility
how much data processing it does ?
March 2005.
Workqueue cluster
1500 Xeon CPUs
32,580 Sawzall jobs launched using an average of
220 machines each
18,636 failures
The jobs read a total of 3.21015 bytes
of data (2.8PB) and wrote 9.91012 bytes (9.3TB)
The average job therefore processed about 100GB

Related Work

Traditional data processing is done by


storing the information in a relational
database and processing it with SQL
queries.
Sawzall -data sets are usually too large to
fit in a relational database
Sawzall is very different from
SQL(combining a fairly traditional
procedural language with an interface to
efficient aggregators)
SQL is excellent at database join
operations, while Sawzall is not

Brook - language for data processing,

Related Work
Aurora-stream processing system that
supports a (potentially large) set of
standing queries on streams of data ;
Aurora [4]
Hancock-stream processing system;
Hancock [7]
concentrates on efficient operation of a
single thread instead of massive
parallelism.

Future
Work
Some of the larger or more complex analyses would
be helped by more aggressive compilation, perhaps
to native machine code.
Interface to query an external database
More direct support of join operations
Some analyses join data from multiple input
sources, Joining is supported but requires extra
chaining steps
A more radical system model
To eliminate the batch-processing mode entirely

Conclusions
Provide an expressive interface to a novel set of

aggregators that capture many common data


processing and data reduction problems
Ability to write short, clear programs that are
guaranteed to work well on thousands of machines
in parallel
user needs to know nothing about parallel
programming;
CPU time is rarely the limiting factor; most
programs are small and spend most of their time in
I/O and native run-time code
Scalability-linear growth in performance as we add
machines
Big data sets need lots of machines; its gratifying that
lots of machines can translate into big throughput.

References

References

Thank
You !!!

Sum Positive Integers (Shell)


#/bin/bash
limit=100
sum=0
i=0
while [ "$i" -le $limit ]
do
sum=`expr $sum + $i`
i=`expr $i + 1`
done
echo $sum

50

Sum Positive Integers


(Multiple processes
compiled language)
#define NUMCALC 25
#include <stdio.h>
#include <mpi.h>
int main (int argc, char *argv[])
{ int myrank, sum, n, partial;
MPI_Status status;
sum = 0;
MPI_Init (&argc, &argv);
MPI_Comm_rank (MPI_COMM_WORLD, &myrank);
if (myrank == 0)
{ printf ("starting...\n");
for (n=1; n<=NUMCALC; n++)
{ MPI_Recv (&partial, 1, MPI_INT, n, 0, MPI_COMM_WORLD, &status);
sum += partial;
}
printf ("sum is: %d\n", sum);
}
else
{ for (n=1; n<=4; n++)
sum += ((myrank-1)*4+n);
MPI_Send (&sum, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
}

http://digital.cs.usu.edu/~scott/5200_f02/sum.c

MPI_Finalize();
return 0;
}

51

Anda mungkin juga menyukai