Online Aggregation, A Survey

Preliminaries
Recent work
Our Contributions
Online Aggregation: A Survey
CS 860 Course Project Literature Survey
Jasnoor Singh Mann Smrithi Sukesh Vishnu Prathish
David R. Cheriton School of Computer Science
University of Waterloo
July 15, 2014
Jasnoor Singh Mann, Smrithi Sukesh, Vishnu Prathish Online Aggregation: A Survey
Preliminaries
Recent work
Our Contributions
Outline
1
Preliminaries
Introduction
Ripple Join
2
Recent work
Scalable Hash Ripple Join
Online Aggregation Over Multiple Queries
3
Our Contributions
Optimizing random sampling and Parallel online aggregation
Preliminaries
Recent work
Our Contributions
Introduction
Ripple Join
Outline
1
Preliminaries
Introduction
Ripple Join
2
Recent work
3
Our Contributions
Preliminaries
Recent work
Our Contributions
Introduction
Ripple Join
Online Aggregation
Aggregation in traditional databases performed in batch mode
Aggregates contain sql queries having aggragate functions
such as AVG, SUM and COUNT
Online Aggregation allows user to observe the progress of
their query and also control execution
Preliminaries
Recent work
Our Contributions
Introduction
Ripple Join
Online Aggregation
Interval indicates precision parameter
Condence indicates the condence parameter (probability)
Key feature of online aggregation is random access of data
using hash scans, index scans or sampling from indices.
Preliminaries
Recent work
Our Contributions
Introduction
Ripple Join
Outline
1
Preliminaries
Introduction
Ripple Join
2
Recent work
3
Our Contributions
Preliminaries
Recent work
Our Contributions
Introduction
Ripple Join
Ripple Joins
Used for Online processing (online aggregation) of multi table
aggregation queries in a relational database management
system.
Designed to minimize the time until an acceptably precise
estimate of the query result is obtained
It generalizes traditional block nested-loops and hash joins
The basic concept : For a two table join, one previously
unseen tuple from both relations are retrieved at each
sampling step, and then joined with previously seen tuples and
also with each other
The order for retrieving the data is random
Preliminaries
Recent work
Our Contributions
Outline
1
Preliminaries
Introduction
Ripple Join
2
Recent work
3
Our Contributions
Preliminaries
Recent work
Our Contributions
Scalable Hash Ripple Join: Introduction
Recently, a new variant of Hash Ripple Join was proposed. The
proposed variant tries to improve the performance, over the
traditional Hash Ripple Join, by:
Executing it on a multi-processor environment
Multi-threading at every processing node
Handling Hash Tables which do not t into memory (memory
overow)
It should be noted the traditional Hash Ripple Join simply reverts
to the less ecient Block Ripple Join, in case of memory overow.
Preliminaries
Recent work
Our Contributions
Scalable Hash Ripple Join: Working
The technique involves a central coordinator node and worker
nodes. A worker node runs three threads:
A thread to redistribute samples from the rst relation in the
join process
Another thread runs for the second relation
A third thread carries out a local join operation
The worker nodes periodically send their local join results to the
central coordinator node to generate global join results.
Preliminaries
Recent work
Our Contributions
Scalable Hash Ripple Join: Challenges and Analysis
The technique tries to overcome the following:
Hash tables not tting into memory eventually (Adaptive
Symmetric Hash Join)
Parallelism applied on an approximation technique
The authors suggest there is a performance gain by a factor of 5,
over Ripple Join (operating in uniprocessor environment).
The convergence rate of the technique is much faster than
traditional Hash Ripple Join.
Preliminaries
Recent work
Our Contributions
Outline
1
Preliminaries
Introduction
Ripple Join
2
Recent work
3
Our Contributions
Preliminaries
Recent work
Our Contributions
Multiple Queries: Introduction
Traditionally, individual online aggregation queries have been
processed.
Sai Wu introduced a technique called Continuous Sampling for
Online Aggregation over Multiple Queries (COSMOS). The basis
of the technique is providing reusability of partial aggregates from
earlier queries.
The technique makes use of a Scrambler which provides random
samples. Further, the queries are added to a dissemination graph
according to their dependencies on earlier queries.
Preliminaries
Recent work
Our Contributions
Multiple Queries: The Scrambler
The technique advocates having precomputed random sample
stream stored to reduce I/O overhead associated with retrieving
random samples.
The Scrambler retrieves samples in a sequential manner and
arranges them randomly to build the data stream. It runs a few
more passes to further randomize the data stream.
When data is updated or added to base tables, the scrambled
stream is updated while taking the random nature of the data
stream into consideration.
Preliminaries
Recent work
Our Contributions
Multiple Queries: Dissemination Graph
A dissemination graph maintains partitions of partial aggregates,
allowing a greater degree of reusability.
The nodes denote the queries. The edges denote the dependencies.
The edges labels are partition identiers. The sample source is the
scrambled data stream.
Preliminaries
Recent work
Our Contributions
Handling the skew in distribution
Some distributions outliers with large variance
Signicantly changes the overall aggregate
Random sampling may not capture - Wildly innacurate
aggregate
Solution : Create an outlier Index
separate Index for non-outliers
calculate aggregates seperately for both indexes
Merge both aggregates - Weighted average
Preliminaries
Recent work
Our Contributions
Outline
1
Preliminaries
Introduction
Ripple Join
2
Recent work
3
Our Contributions
Preliminaries
Recent work
Our Contributions
Optimizing random sampling
Removing the need for purely random sampling when we
have fore-knowledge of data access patterns.
Capture the partial ordering in data (other than GROUP BY)
Store relevant statistics forehand
Fingered Scan
In Memory sampling.
Parallel sampling using multicore.
Compromise the accuracy for speed while generating random
samples from sequentially loaded pages.
Not explored in the scope of this project
Preliminaries
Recent work
Our Contributions
Parallel Online aggregation
Formulae for calculating global aggregate from local
aggregates.
Reusing from parallel hash join paper
Exploring Parallel algorithm for other types JOINS
Sort Merge Join - Not possible.
Nested loop Join
Final result - A parallel nested loop online aggregation
algorithm with suitable random sampling method
Preliminaries
Recent work
Our Contributions
Questions ?

Online Aggregation, A Survey

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Online Aggregation, A Survey

Diunggah oleh

Hak Cipta:

Format Tersedia

Preliminaries

Anda mungkin juga menyukai