Tilani Gunawardena
Content
Background
Introduction
Overview
A Simple Example
Google Infrastructure
Sawzall Language Overview
System Model
More Examples
Execution Model
Domain-specific Properties of the language
Performance
Why a new language?
Utility
Related Work
Future Work
Conclusions
Background
When the only tool you own is a
hammer, every problem begins to
resemble a nail. - Abraham Harold
Maslow
AcceptableTime=InputTime
+ProcessingTime+OutputTi
me
InputTime
Chubby
Bigtable
GFS
ProcessingTime
Sawzall
MapReduce
OutputTime
Bigtable
GFS
Introduction(1)
Large data set(large,dynamic,
unwieldy)
Flat but regular structure
Span multiple disks and machines
Bottleneck lies in I/O, not CPUs
Paralyze to improve throughput
Task division and distribution
Keep computation near to data
Tolerance of kinds of failures
Introduction(2)
Process data records that are present
on many machines.
Distribute the calculation across all the
machines to achieve high throughput.
Two phases for calculation
-Analysis Phase
-Aggregation Phase
Structure(1)
Five racks of 50-55 working computers each, with
four disks per machine.Such a configuration might
have a hundred terabytes of data to be processed,
distributed across some or all of the machines.
Structure(2)
Overview(1)
Gigabyte to many terabytes of data
Hundreds or even thousands of
machines in parallel
Some executing the query while
others aggregate the results
An analysis may consume months of
CPU time, but with a thousand
machines that will only take a few
hours of real time.
Overview(2)
systems design is influenced by two
observations.
If the querying operations are
commutative across records, the order in
which the records are process
unimportant
If the aggregation operations are
commutative, the order in which the
intermediate values are processed is
unimportant
Overview(3)-Sawzall
A Simple Example
Input: Set of files that contain records where each
of the records contain one floating-point number.
Output:Number of records, sum of the values and
sum of the squares of the values.
count:
table sum of int;
total:
table sum of float;
sum_of_squares: table sum of float;
x:
float=input;
emit
count<1;
emit
total<x;
emit
sum_of_squares<- x*x;
Google Infrastructure(1)
Protocol Buffers
Google File System
Workqueue
MapReduce
Google Infrastructure(2)-PB
Protocol Buffers are used
-To define the messages communicated
between servers
-To describe the format of permanent
records stored on disk
DDL describes protocol buffers and
defines the content of the messages
Protocol compiler takes the DDL and
generates code to manipulate the protocol
buffers
Google Infrastructure(3)-PB
The generated code is compiled and
linked with the application
Protocol buffer types are roughly
analogous to C structs but the DDL
has two additional properties
A distinguishing integral tag
An indication of whether a field is
necessary or optional
Google Infrastructure(4)-PB
The following describes a protocol buffer with two
required fields. Field x has tag 1,and field y has tag
parsed message Point {
2
required int32 x = 1;
required int32 y = 2;
};
Google Infrastructure(5)-GFS
The data sets are often stored in GFS
GFS provides a reliable distributed
storage system
It can grow to petabyte scale by
keeping data in 64MB chunks
Each chunk is replicated, usually 3
times, on
different machines
stored on disks spread across
thousands of machines
GFS runs as an application-level file
system with a traditional hierarchical
Google Infrastructure(5)Workqueue
Google Infrastructure(6)MapReduce
Google Infrastructure(7)
High level language
Software libraries
Scheduling software
Language Overview(1)
Basic types
Integer (int)
Floating point (float)
Time
Fingerprint (represents an internally
computed hash of another value)
Bytes and String
Compound type
Arrays, Maps and Tuples
Language Overview(2)
A Sawzall program defines the
operations to be performed on a single
record of the data
There is nothing in the language to enable
examining multiple input records
simultaneously
Language Overview(3)
proto p4stat.proto
submitsthroughweek: table sum[minute: int] of
count: int;
log: P4ChangelistStats = input;
t: time = log.time; # microseconds
minute: int = minuteof(t)+60*(hourof(t)
+24*(dayofweek(t)-1)); = 27
submitsthroughweek[0]
emit submitsthroughweek[minute]
<- 1;
submitsthroughweek[1]
= 31
submitsthroughweek[2] = 52
submitsthroughweek[3] = 41
...
submitsthroughweek[10079] = 34
Language Overview(4)
Frequency of submits to the source
code repository through the week.
Language Overview(4)
List of Aggregators
Collection
Sample
Sum
Maximum
Quantile
Top
Unique
System Model(1)
System Model(2)
User collects the data using the following:
dump
--source/gfs/cluster2/$USER/output@100 -format csv
The program merges the output data and
prints the final results.
System Model(3)
Implementation
-Sawzall works in the map phase
-Aggregators work in the reduce
phase
System Model(4)
Chaining
-The output from a Sawzall job is sent
as an input to another
More Examples(1)
Task:Process a web document
repository to know for each web
domain, which page has the highest
page
protorank
"document.proto"
max_pagerank_url:
table maximum(1) [domain: string] of url: string
weight pagerank: int;
doc: Document = input;
emit max_pagerank_url[domain(doc.url)] <- doc.url
weight doc.pagerank;
More Examples(2)
Task: To look at a set of search query
logs and construct a map showing how
the queries are distributed around the
globe
proto "querylog.proto"
queries_per_degree: table sum[lat: int][lon: int] of int;
log_record: QueryLogProto = input;
loc: Location = locationinfo(log_record.ip);
emit queries_per_degree[int(loc.lat)][int(loc.lon)] <- 1;
More Examples(3)
Execution Model
It runs on one record at a time
It is routine for a Sawzall job to be
executing on a thousand machines
simultaneously, yet the system
requires no explicit communication
between those machines
Domainspecific Properties of
the Language
Sawzall is statically typed
Characteristic
Similar to C and Pascal
Type-safe scripting language
Code is much shorter than C++
Pure value semantics, no reference
types
Statically typed
No exception processing
Performance(1)
Compare the single CPU speed of the
Sawzall interpreter.
Sawzall is faster than Python, Ruby
and Perl. But slower than interpreted
Java, compiled Java and compiled C++
The following table shows the
microbenchmark.
Sawzall
Python
Ruby
Perl
Mandelbrot
runtime
Factor
12.09s
1.00
45.42s
3.75
73.59s
6.09
38.68s
3.20
Fibonacci
runtime
Factor
11.73s
1.00
38.12s
3.24
47.18s
4.02
75.73s
6.46
Performance(2)
The main measurement is not singleCPU speed.
The main measurement is aggregate
system speed as machines are added
to process large datasets.
Experiment: 450GB sample of
compressed query log data to count
the occurrences of certain words using
Sawzall program.
Performance(3)
Performance(4)
-The solid line is
elapsed time; the
dashed line is the
product of
machines and
elapsed time.
Why
a
new
language?
Why put a language above MapReduce?
MapReduce is very effective; whats missing?
Why not just attach an existing language such as Python to
MapReduce?
To make programs clearer, more compact, and more
expressive
Original motivation :parallelism
Separating out the aggregators
provides a model for distributed processing
Awk or or Python user have to write the aggregators
Capture the aggregators in the language(& its environment)
->means that the programmer never has to provide one,
unlike when using MapReduce
Sawzall programs tend to be around 10 to 20 times shorter
than the equivalent MapReduce programs in C++ &
significantly easier to write
Ability to add domain-specific features, custom debugging
Utility
how much data processing it does ?
March 2005.
Workqueue cluster
1500 Xeon CPUs
32,580 Sawzall jobs launched using an average of
220 machines each
18,636 failures
The jobs read a total of 3.21015 bytes
of data (2.8PB) and wrote 9.91012 bytes (9.3TB)
The average job therefore processed about 100GB
Related Work
Related Work
Aurora-stream processing system that
supports a (potentially large) set of
standing queries on streams of data ;
Aurora [4]
Hancock-stream processing system;
Hancock [7]
concentrates on efficient operation of a
single thread instead of massive
parallelism.
Future
Work
Some of the larger or more complex analyses would
be helped by more aggressive compilation, perhaps
to native machine code.
Interface to query an external database
More direct support of join operations
Some analyses join data from multiple input
sources, Joining is supported but requires extra
chaining steps
A more radical system model
To eliminate the batch-processing mode entirely
Conclusions
Provide an expressive interface to a novel set of
References
References
Thank
You !!!
50
http://digital.cs.usu.edu/~scott/5200_f02/sum.c
MPI_Finalize();
return 0;
}
51