Anda di halaman 1dari 33

Joins in Hadoop

Gang and Ronnie


Agenda
Introduction of new types of joins
Experiment results
Join plan generator
Summary and future work
Problem at hand
Map join (fragment-duplicate join)
Duplicate
Split 3 Split 4 Split 1 Split 2
Map tasks:
Fragment (large table)
Duplicate
(small table)
Slide taken from project proposal
Too many copies of the
small table are shuffled
across the network

Partially Solved
Distributed Cache

Doesnt work with too
many nodes involved


Size of
R
Size of
S
Map
tasks
Duplicate
data
150 MB 24 GB 352 32 GB
Size of
R
Size of
S
Map
tasks
Duplicate
data
150 MB 24 GB 277 <=64*150
MB
Size of
R
Size of
S
# of
nodes
Duplicate
data
150 MB 24 TB 277 k TB? PB?
there are 64 nodes in our cluster, and distributed cache will copy the data no more
than that amount of time
Slide taken from project proposal II
Memory Limitation
Hash table is not memory-efficient.
The table size is usually larger than the heap
memory assigned to a task
Out Of Memory Exception!
Solving Not-Enough-Memory problem
New Map Joins:
Multi-phase map join (MMJ)
Reversed map join (RMJ)
JDBM-based map join (JMJ)

small table as: duplicate
large table as: fragment
Multi-phase map join
n-phase map join
Duplicate
Part 1
Split 3 Split 4 Split 1 Split 2
Map tasks:
Duplicate
Fragment
Duplicate
Part 2

Duplicate
Part n
Problem? - Reading large table multiple times!
Duplicate
Part 1
Reversed map join
Default map join (in each Map task):
1. read duplicate to memory, build hash table
2. for each tuple in fragment, probe the hash
table
Reversed map join (in each Map task): :
1. read fragment to memory, build hash table
2. for each tuple in duplicate , probe the hash
table
Problem? not really a Map job
JDBM-based map join
JDBM is a transactional persistence engine for
Java.

Using JDBM, we can eliminate
OutOfMemoryException. The size of the hash
table is no longer bound by the heap size.
Problem? Probing a hashtable on disk might take much time!
Advanced Joins
Step 1:
Semi join on join key only;
Step 2:
Use the result to filter the table;
Step 3:
Join new tables.

Can be applied to both map and reduce-side
joins

Problem? Step 1 and 2 have overhead!
The Nine Candidates
AMJ/no dist advanced map join without DC
AMJ/dist advanced map join with DC
DMJ/no dist default map join without DC
DMJ/dist default map join with DC
MMJ multi-phase map join
RMJ/dist reversed map join with DC
JMJ/dist JDBM-based map join with DC
ARJ/dist advanced reduce join with DC
DRJ default reduce join
Experiment Setup
TPC-DS benchmark
Evaluated query:
JOIN customer, web_sales ON cid
Performed on different scales of generated
data, e.g. 10GB, 170GB (not actual table size)

Each combination is performed five (5) times
Results are analyzed with error bars
Hadoop Cluster
128 Hewlett Packard
DL160 Compute Building
Blocks
Each equipped with:
2 quad-core CPUs
16 GB RAM
2 TB storage
High-speed network
connection

Used in the experiment:
Hadoop Cluster
(Altocumulus):
64 nodes

Result analysis
0
50
100
150
200
250
300
350
400
AMJ/no dist
AMJ/dist
DMJ/no dist
DMJ/dist
MMJ
RMJ/dist
JMJ/dist
ARJ/dist
DRJ
Some results ignored
One small note
What does 50*200 mean?

TABLE customer: from 50GB version of TPC-DS
- actual table size: about 100MB
TABLE web_sales: 200GB version of TPC-DS
- actual table size: about 30GB

Distributed Cache
0
50
100
150
200
250
300
350
400
DMJ/no dist
DMJ/dist
Distributed Cache II
Distributed cache introduces an overhead
when converting the file in HDFS to local disks.
The following situations are in favor of
Distributed cache (compared to non-DC):
1. number of nodes is low
2. number of map tasks is high

Advanced vs. Default
0
50
100
150
200
250
300
10*10 10*30 10*50 10*70 10*100 10*130 10*170 10*200 50*50 50*70 50*100 50*130 50*170 50*200 70*70
ARJ/dist
DRJ
Advanced vs. Default II
0
200
400
600
800
1000
1200
AMJ/dist
DMJ/dist
Advanced vs. Default III
The overhead of semi-join and filtering is
heavy.
The following situations are in favor of
advanced joins (compared to reduce joins):
1. join selectivity gets lower
2. network becomes slower (true!)
3. we need to handle skewed data
Map Join vs Reduce Join--Part I
0
50
100
150
200
250
300
350
400
450
DMJ/no dist
MMJ
JMJ/dist
ARJ/dist
DRJ
Map Join vs Reduce Join-- Part II
0
200
400
600
800
1000
1200
1400
DMJ/no dist
RMJ/dist
JMJ/dist
ARJ/dist
DRJ
Map Join vs Reduce Join
In most situations, Default Map Join performs better
than Default Reduce Join
Eliminate the data transfer and sorting at shuffle
stage
The gap is not significant due to the fast network
Potential problems of Map Joins
A job involving too many map tasks causes large
amount of data transferred over network
Distributed cache may do harm to performance
Beyond Default Map Join
Multi-Phase Map Join
Succeed in all experiment groups.
Performance comparable with DMJ when only one
phase is involved.
Performance degrades sharply when phase
number are greater than 2, due to the much more
tasks we launch.
Currently no support for distributed cache, not
scalable
Beyond Default Map Join
Reversed Map Join
Succeed in all experiment groups.
Not performs as good as DRJ due the overhead of
distributed cache
Performs best when

Beyond Default Map Join
JDBM Map Join
Fail for the last two experiment groups, mainly due
to the improper configuration settings.
Join Plan Generator
Cost-based + rule-based
Focus on three aspects
Whether or not to use
distributed cache
Whether to use Default Map
Join
Map joins or reduce side join
Parameters






Number of distributed
files
d
Network speed
v
Number of map tasks
m
Number of reduce
tasks
r
Number of working
nodes
n
Small table size
s
Large table size
l
Join Plan Generator
Whether to use distributed cache
Only works for map join approaches
Cost model
With distributed cache:
where is the average overhead to distribute one file
Without distributed cache:



d s n
v

1
s m
v

1

Join Plan Generator


Whether to use Default Map Join
We give Default Map Join the highest priority since
it usually works best
The choice on distributed cache can ensure Default
Map Join works efficiently
Rule: if small table can fit into memory entirely,
just do it.
Join Plan Generator
Map Joins or Default Reduce side Join
In those situations where DMJ fails, Reversed Map
Join is most promising in terms of usability and
scalability.
Cost model:
RMJ:
(without distributed cache)
(with distributed cache)
where is the average overhead to distribute one
file
DRJ:

d s m
d s n
) , ( , ) ( r v f l s

Join Plan Generator


Distributed cache?
Default Map Join?
Reversed Map Join /
Default Reduce side Join
Y N
Y
Do it
N
Do it
Summary
Distributed cache is a double-edge sword
When using distributed cache properly, Default
Map Join performs best
The three new map join approaches extend
the usability of default map join
Future Work
SPJA workflow
(selection, projection, join, aggregation)
Better optimizer
Multi-way join
Build to hybrid system
Need a dedicated (slower) cluster

Anda mungkin juga menyukai