Agenda Introduction of new types of joins Experiment results Join plan generator Summary and future work Problem at hand Map join (fragment-duplicate join) Duplicate Split 3 Split 4 Split 1 Split 2 Map tasks: Fragment (large table) Duplicate (small table) Slide taken from project proposal Too many copies of the small table are shuffled across the network
Partially Solved Distributed Cache
Doesnt work with too many nodes involved
Size of R Size of S Map tasks Duplicate data 150 MB 24 GB 352 32 GB Size of R Size of S Map tasks Duplicate data 150 MB 24 GB 277 <=64*150 MB Size of R Size of S # of nodes Duplicate data 150 MB 24 TB 277 k TB? PB? there are 64 nodes in our cluster, and distributed cache will copy the data no more than that amount of time Slide taken from project proposal II Memory Limitation Hash table is not memory-efficient. The table size is usually larger than the heap memory assigned to a task Out Of Memory Exception! Solving Not-Enough-Memory problem New Map Joins: Multi-phase map join (MMJ) Reversed map join (RMJ) JDBM-based map join (JMJ)
small table as: duplicate large table as: fragment Multi-phase map join n-phase map join Duplicate Part 1 Split 3 Split 4 Split 1 Split 2 Map tasks: Duplicate Fragment Duplicate Part 2
Duplicate Part n Problem? - Reading large table multiple times! Duplicate Part 1 Reversed map join Default map join (in each Map task): 1. read duplicate to memory, build hash table 2. for each tuple in fragment, probe the hash table Reversed map join (in each Map task): : 1. read fragment to memory, build hash table 2. for each tuple in duplicate , probe the hash table Problem? not really a Map job JDBM-based map join JDBM is a transactional persistence engine for Java.
Using JDBM, we can eliminate OutOfMemoryException. The size of the hash table is no longer bound by the heap size. Problem? Probing a hashtable on disk might take much time! Advanced Joins Step 1: Semi join on join key only; Step 2: Use the result to filter the table; Step 3: Join new tables.
Can be applied to both map and reduce-side joins
Problem? Step 1 and 2 have overhead! The Nine Candidates AMJ/no dist advanced map join without DC AMJ/dist advanced map join with DC DMJ/no dist default map join without DC DMJ/dist default map join with DC MMJ multi-phase map join RMJ/dist reversed map join with DC JMJ/dist JDBM-based map join with DC ARJ/dist advanced reduce join with DC DRJ default reduce join Experiment Setup TPC-DS benchmark Evaluated query: JOIN customer, web_sales ON cid Performed on different scales of generated data, e.g. 10GB, 170GB (not actual table size)
Each combination is performed five (5) times Results are analyzed with error bars Hadoop Cluster 128 Hewlett Packard DL160 Compute Building Blocks Each equipped with: 2 quad-core CPUs 16 GB RAM 2 TB storage High-speed network connection
Used in the experiment: Hadoop Cluster (Altocumulus): 64 nodes
Result analysis 0 50 100 150 200 250 300 350 400 AMJ/no dist AMJ/dist DMJ/no dist DMJ/dist MMJ RMJ/dist JMJ/dist ARJ/dist DRJ Some results ignored One small note What does 50*200 mean?
TABLE customer: from 50GB version of TPC-DS - actual table size: about 100MB TABLE web_sales: 200GB version of TPC-DS - actual table size: about 30GB
Distributed Cache 0 50 100 150 200 250 300 350 400 DMJ/no dist DMJ/dist Distributed Cache II Distributed cache introduces an overhead when converting the file in HDFS to local disks. The following situations are in favor of Distributed cache (compared to non-DC): 1. number of nodes is low 2. number of map tasks is high
Advanced vs. Default 0 50 100 150 200 250 300 10*10 10*30 10*50 10*70 10*100 10*130 10*170 10*200 50*50 50*70 50*100 50*130 50*170 50*200 70*70 ARJ/dist DRJ Advanced vs. Default II 0 200 400 600 800 1000 1200 AMJ/dist DMJ/dist Advanced vs. Default III The overhead of semi-join and filtering is heavy. The following situations are in favor of advanced joins (compared to reduce joins): 1. join selectivity gets lower 2. network becomes slower (true!) 3. we need to handle skewed data Map Join vs Reduce Join--Part I 0 50 100 150 200 250 300 350 400 450 DMJ/no dist MMJ JMJ/dist ARJ/dist DRJ Map Join vs Reduce Join-- Part II 0 200 400 600 800 1000 1200 1400 DMJ/no dist RMJ/dist JMJ/dist ARJ/dist DRJ Map Join vs Reduce Join In most situations, Default Map Join performs better than Default Reduce Join Eliminate the data transfer and sorting at shuffle stage The gap is not significant due to the fast network Potential problems of Map Joins A job involving too many map tasks causes large amount of data transferred over network Distributed cache may do harm to performance Beyond Default Map Join Multi-Phase Map Join Succeed in all experiment groups. Performance comparable with DMJ when only one phase is involved. Performance degrades sharply when phase number are greater than 2, due to the much more tasks we launch. Currently no support for distributed cache, not scalable Beyond Default Map Join Reversed Map Join Succeed in all experiment groups. Not performs as good as DRJ due the overhead of distributed cache Performs best when
Beyond Default Map Join JDBM Map Join Fail for the last two experiment groups, mainly due to the improper configuration settings. Join Plan Generator Cost-based + rule-based Focus on three aspects Whether or not to use distributed cache Whether to use Default Map Join Map joins or reduce side join Parameters
Number of distributed files d Network speed v Number of map tasks m Number of reduce tasks r Number of working nodes n Small table size s Large table size l Join Plan Generator Whether to use distributed cache Only works for map join approaches Cost model With distributed cache: where is the average overhead to distribute one file Without distributed cache:
d s n v
1 s m v
1
Join Plan Generator
Whether to use Default Map Join We give Default Map Join the highest priority since it usually works best The choice on distributed cache can ensure Default Map Join works efficiently Rule: if small table can fit into memory entirely, just do it. Join Plan Generator Map Joins or Default Reduce side Join In those situations where DMJ fails, Reversed Map Join is most promising in terms of usability and scalability. Cost model: RMJ: (without distributed cache) (with distributed cache) where is the average overhead to distribute one file DRJ:
d s m d s n ) , ( , ) ( r v f l s
Join Plan Generator
Distributed cache? Default Map Join? Reversed Map Join / Default Reduce side Join Y N Y Do it N Do it Summary Distributed cache is a double-edge sword When using distributed cache properly, Default Map Join performs best The three new map join approaches extend the usability of default map join Future Work SPJA workflow (selection, projection, join, aggregation) Better optimizer Multi-way join Build to hybrid system Need a dedicated (slower) cluster