Anda di halaman 1dari 20

2014-05-06

Data Quality
•  Accuracy
Data Cleaning and Integration •  Completeness
•  Consistency
•  Timeliness
•  Believability
•  Interpretability

J. Pei: Big Data Analytics -- Data Cleaning and Integration 2

Data Preprocessing Data Cleaning


•  Processing data before an analytic task •  The process of detecting and correcting
–  Improve data quality corrupt or inaccurate records from data
–  Transform data to facilitate the target task •  Handling missing values
•  Major tasks •  Smoothing data
–  Data cleaning
–  Data integration
–  Data reduction
–  Data transformation

J. Pei: Big Data Analytics -- Data Cleaning and Integration 3 J. Pei: Big Data Analytics -- Data Cleaning and Integration 4

Handling Missing Values Disguised Missing Data?


•  Ignore records with missing values Online forms
•  Fill in missing values
–  Manually
–  Using a global constant
–  Using a measure of central tendency for the
attribute, such as mean, median, or mode •  Disguised missing data is the missing data entries
that are not explicitly represented as such, but
–  Using the central tendency of the class instead appear as potentially valid data values
–  Using the most probable value –  Information about "State" is missing
–  "Alabama" is used as disguise

J. Pei: Big Data Analytics -- Data Cleaning and Integration 5 J. Pei: Big Data Analytics -- Data Cleaning and Integration 6

1
2014-05-06

Disguised Missing Data Is Misleading Types of Disguised Missing Data


•  Unreasonable results
•  Wrong conclusion •  Randomly choose a valid •  A small number of values
value as disguise are chosen as disguise

Number of customers Number of customers


3500 3500

3000 3000

2500 2500

2000 2000

1500 1500

1000 1000

500 500

0 0

Alabama Ohio Washington Alabama Ohio Washington

Real values Disguised missing values

J. Pei: Big Data Analytics -- Data Cleaning and Integration 7 J. Pei: Big Data Analytics -- Data Cleaning and Integration 8

Problem Definition Ideas


•  Cleaning disguised missing data •  Observation 1: Frequently used disguises
Given a table T with attributes A, an integer k –  A small number of values are frequently used as
For each attribute Ai, output k candidates of frequently the disguises
used disguise values •  Observation 2: Missing at random
Number of customers
–  Missing data are often 3500

•  Examples 3000

distributed randomly 2500

–  “Alabama” in “state” 2000

1500
–  “0” in “blood pressure” A random subset of 1000
the whole database
–  “21” in “age” 500

Alabama Ohio Washington


J. Pei: Big Data Analytics -- Data Cleaning and Integration 9 J. Pei: Big Data Analytics -- Data Cleaning and Integration 10

General Framework Smoothing Noisy Data


Id State Age Gender
1 Alabama 30 M
•  For each attribute A 2 Alabama 30 M
•  Noise: a random error or variance in a
–  For each frequent value v 3 Alabama 30 F measured variable
in A 4 Alabama 20 F
•  Smoothing noise – removing noise
5 Ohio 20 F
•  Compute the maximal 6 Ohio 20 F
embedded unbiased
sample contained in Tv
–  Return the k values with
the best (in both quality
and size) embedded
unbiased sample

J. Pei: Big Data Analytics -- Data Cleaning and Integration 11 J. Pei: Big Data Analytics -- Data Cleaning and Integration 12

2
2014-05-06

Binning Regression
Sorted data for price (in dollars) : 4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into (equal-frequency) bins :

Bin1 : 4, 8, 15
Bin2 : 21, 21, 24
Bin3 : 25, 28, 34

Smoothing by bin means :

Bin1 : 9, 9, 9
Bin2 : 22, 22, 22
Bin3 : 29, 29, 29

Smoothing by bin boundaries :

Bin1 : 4, 4, 15
Bin2 : 21, 21, 24
Bin3 : 25, 25, 34
J. Pei: Big Data Analytics -- Data Cleaning and Integration 13 J. Pei: Big Data Analytics -- Data Cleaning and Integration 14

Outlier Analysis Data Cleaning as a Process


•  Data discrepancy detection
–  Use metadata (e.g., domain, range, dependency, distribution)
–  Check field overloading
–  Check uniqueness rule, consecutive rule and null rule
–  Use commercial tools
•  Data scrubbing: use simple domain knowledge (e.g., postal code, spell-
check) to detect errors and make corrections
•  Data auditing: by analyzing data to discover rules and relationship to detect
violators (e.g., correlation and clustering to find outliers)
•  Data migration and integration
–  Data migration tools: allow transformations to be specified
–  ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
•  Integration of the two processes
–  Iterative and interactive (e.g., Potter s Wheels)

J. Pei: Big Data Analytics -- Data Cleaning and Integration 15 J. Pei: Big Data Analytics -- Data Cleaning and Integration 16

Data Integration Data Integration System Architecture

•  Combining data from multiple (autonomous


and heterogeneous) sources
•  Providing a unified view
•  Why is data integration hard?
–  Systems challenges
–  Data logical organization challenges
–  Social and administrative challenges

http://en.wikipedia.org/wiki/File:Dataintegration.png

J. Pei: Big Data Analytics -- Data Cleaning and Integration 17 J. Pei: Big Data Analytics -- Data Cleaning and Integration 18

3
2014-05-06

Wrappers How to Build Wrappers?


•  Computer programs that extract content from a •  Manual construction
particular data source and transform into a target
•  Machine learning based methods: learning
form, such as a relational table
schemas from training data
•  Example: CMS (content management system)
wrapper <html>
–  Supervised learning approaches
<head>
<title> %page_title%</title>
–  Unsupervised learning approaches
</head>
<body>
%page_content%
<P>
%page_powered_by%
</body>
</html>

J. Pei: Big Data Analytics -- Data Cleaning and Integration 19 J. Pei: Big Data Analytics -- Data Cleaning and Integration 20

Schema Matching and Mapping Entity Detection and Recognition


•  Schema matching: finding the semantic •  Entity detection: identify atomic elements in
correspondences between attributes in data sources
and those in the mediated schema text or other data into predefined categories
–  Example: “attribute name in source S1 corresponds to such as person names, locations,
attributes firstname and surname in the mediated
schema organizations, etc.
–  Name based matching
–  Instance based matching
•  Entity disambiguation: identify entities
•  Schema mapping: transforming attribute values from carrying the same name
sources to mediated schema
–  Example: a query or a program extracting name values
from source S1, and forming firstname and surname
values for the mediated schema

J. Pei: Big Data Analytics -- Data Cleaning and Integration 21 J. Pei: Big Data Analytics -- Data Cleaning and Integration 22

Example Data Provenance


•  The data about how a data entry came to be
–  Also known as data lineage/predigree
•  The annotation approach: a series of
annotations describing how each data item
was produced
•  The graph of data relationships approach:
connecting sources and deriving new data
items via mapping

J. Pei: Big Data Analytics -- Data Cleaning and Integration 23 J. Pei: Big Data Analytics -- Data Cleaning and Integration 24

4
2014-05-06

Deep / Hidden Web


•  Sites that are difficult for a crawler to find
–  Probably over 100 times larger than the traditionally indexed web
•  Three major categories of sites in deep web
–  Private sites intentionally private – no incoming links or may require
login
–  Form results – only accessible by entering data into a form, e.g.,
airline ticket queries
•  Hard to detect changes behind a form
–  Scripted pages – using JavaScript, Flash, or another client-side
language in the web page
•  A crawler needs to execute the script – can slow down crawling
significantly
•  Deep web is different from dynamic pages
–  Wikis dynamically generates web pages but are easy to crawl
–  Private sites are static but cannot be crawled

J. Pei: Big Data Analytics -- Data Cleaning and Integration 25

5
Outline
•  Why multidimensional analysis?
Multidimensional Analysis •  Multidimensional analysis principle
•  OLAP
•  OLAP indexes

Jian Pei: Big Data Analytics -- Multidimensional Analysis 2

Dimensions Multi-dimensional Analysis


•  “An aspect or feature of a situation, problem, or •  Find interesting patterns in multi-dimensional
thing, a measurable extent of some kind” subspaces
– Dictionary –  “Michael Jordan is outstanding in subspaces (total
•  Dimensions/attributes are used to model points, total rebounds, total assists) and (number of
complex objects in a divide-and-conquer games played, total points, total assists)”
manner •  Different patterns may be manifested in
–  Objects are compared in selected dimensions/ different subspaces
attributes
–  Feature selection (machine learning and statistics):
•  More often than not, objects have too many select a subset of relevant features for use in model
dimensions/attributes than one is interested in construction – a set of features for all objects
and can handle –  Different subspaces may manifest different patterns

Jian Pei: Big Data Analytics -- Multidimensional Analysis 3 Jian Pei: Big Data Analytics -- Multidimensional Analysis 4

OLAP OLAP Operations


•  Conceptually, we may explore all possible subspaces for
interesting patterns •  Roll up (drill-up): summarize data by
–  What patterns are interesting? climbing up hierarchy or by dimension
–  How can we explore all possible subspaces systematically and
efficiently? reduction
–  Fundamental problems in analytics and data mining
•  Aggregates and group-bys are frequently used in data
–  (Day, Store, Product type, SUM(sales) !
analysis and summarization (Month, City, *, SUM(sales))
SELECT time, altitude, AVG(temp)
FROM weather GOUP BY time, altitude; •  Drill down (roll down): reverse of roll-up,
–  In TPC, 6 standard benchmarks have 83 queries, aggregates are
used 59 times, group-bys are used 20 times
from higher level summary to lower level
•  Online analytical processing (OLAP): the techniques summary or detailed data, or introducing
that answer multi-dimensional analytical (MDA) new dimensions
queries efficiently
Jian Pei: Big Data Analytics -- Multidimensional Analysis 5 Jian Pei: Big Data Analytics -- Multidimensional Analysis 6

1
Other Operations Relational Representation
•  Dice: pick specific values or ranges on •  If there are n dimensions, there are 2n
some dimensions possible aggregation columns
•  Pivot: “rotate” a cube – changing the order Roll up by model by year by color in a table
of dimensions in visual analysis

http://en.wikipedia.org/wiki/File:OLAP_pivoting.png

Jian Pei: Big Data Analytics -- Multidimensional Analysis 7 Jian Pei: Big Data Analytics -- Multidimensional Analysis 8

Difficulties Dummy Value ALL


•  Many group bys are needed
–  6 dimensions ! 26=64 group bys
•  In most SQL systems, the resulting query
needs 64 scans of the data, 64 sorts or
hashes, and a long wait!

Jian Pei: Big Data Analytics -- Multidimensional Analysis 9 Jian Pei: Big Data Analytics -- Multidimensional Analysis 10

DATA CUBE
Model Year Color Sales
Chevy 1990 blue 62
Chevy 1990 red 5

CUBE Semantics of ALL


Chevy 1990 white 95
Chevy 1990 ALL 154
Chevy 1991 blue 49
Chevy 1991 red 54
Chevy 1991 white 95
Chevy 1991 ALL 198
Chevy 1992 blue 71
Chevy 1992 red 31
Chevy 1992 white 54

•  ALL is a set
Chevy 1992 ALL 156
Chevy ALL blue 182
SALES Chevy ALL red 90
Model Year Color Sales Chevy ALL white 236

–  Model.ALL = ALL(Model) = {Chevy, Ford }


Chevy ALL ALL 508

CUBE
Chevy 1990 red 5
Ford 1990 blue 63
Chevy 1990 white 87
Ford 1990 red 64
Chevy 1990 blue 62 Ford 1990 white 62
Chevy 1991 red 54 Ford 1990 ALL 189
Chevy
Chevy
1991
1991
white
blue
95
49
Ford
Ford
Ford
1991
1991
1991
blue
red
white
55
52
9
–  Year.ALL = ALL(Year) = {1990,1991,1992}
Chevy 1992 red 31
Ford 1991 ALL 116
Chevy
Chevy
Ford
1992
1992
1990
white
blue
red
54
71
64
Ford
Ford
Ford
1992
1992
1992
blue
red
white
39
27
62
–  Color.ALL = ALL(Color) = {red,white,blue}
Ford 1990 white 62 Ford 1992 ALL 128
Ford 63 Ford ALL blue 157
1990 blue
Ford ALL red 143
Ford 1991 red 52
Ford ALL white 133
Ford 1991 white 9 Ford ALL ALL 433
Ford 1991 blue 55 ALL 1990 blue 125
Ford 1992 red 27 ALL 1990 red 69
Ford 1992 white 62 ALL 1990 white 149
ALL 1990 ALL 343
Ford 1992 blue 39
ALL 1991 blue 106
ALL 1991 red 104
SELECT Model, Year, Color, SUM(sales) AS Sales ALL
ALL
1991
1991
white
ALL
110
314
FROM Sales ALL 1992 blue 110
ALL 1992 red 58
WHERE Model in {'Ford', 'Chevy'} ALL
ALL
1992
1992
white
ALL
116
284
AND Year BETWEEN 1990 AND 1992 ALL
ALL
ALL
ALL
blue
red
339
233
GROUP BY CUBE(Model, Year, Color); ALL ALL white 369
ALL ALL ALL 941

Jian Pei: Big Data Analytics -- Multidimensional Analysis 11 Jian Pei: Big Data Analytics -- Multidimensional Analysis 12

2
OLTP Versus OLAP What Is a Data Warehouse?
OLTP OLAP
users clerk, IT professional knowledge worker
•  A data warehouse is a subject-oriented,
function day to day operations decision support integrated, time-variant, and nonvolatile
DB design
data
application-oriented
current, up-to-date, detailed, flat
subject-oriented
historical, summarized, multidimensional
collection of data in support of
relational Isolated integrated, consolidated management s decision-making process.
usage repetitive ad-hoc
access read/write, index/hash on prim. lots of scans
– W. H. Inmon
key
unit of work short, simple transaction complex query
•  Data warehousing: the process of
# records tens millions constructing and using data warehouses
accessed
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response

Jian Pei: Big Data Analytics -- Multidimensional Analysis 13 Jian Pei: Big Data Analytics -- Multidimensional Analysis 14

Subject-Oriented Integrated
•  Organized around major subjects, such as •  Integrating multiple, heterogeneous data sources
customer, product, sales –  Relational databases, flat files, on-line transaction
records
•  Focusing on the modeling and analysis of •  Data cleaning and data integration
data for decision makers, not on daily –  Ensuring consistency in naming conventions, encoding
operations or transaction processing structures, attribute measures, etc. among different data
sources
•  Providing a simple and concise view around •  E.g., Hotel price: currency, tax, breakfast covered, etc.
particular subject issues by excluding data –  When data is moved to the warehouse, it is converted
that are not useful in the decision support
process
Jian Pei: Big Data Analytics -- Multidimensional Analysis 15 Jian Pei: Big Data Analytics -- Multidimensional Analysis 16

Time Variant Nonvolatile


•  The time horizon for the data warehouse is •  A physically separate store of data
significantly longer than that of operational transformed from the operational
systems environment
–  Operational databases: current value data •  Operational updates of data do not occur in
–  Data warehouse data: provide information from a the data warehouse environment
historical perspective (e.g., past 5-10 years)
–  Do not require transaction processing, recovery,
•  Every key structure in the data warehouse contains and concurrency control mechanisms
an element of time, explicitly or implicitly
–  Require only two operations in data accessing
–  But the key of operational data may or may not contain
•  Initial loading of data
time element
•  Access of data

Jian Pei: Big Data Analytics -- Multidimensional Analysis 17 Jian Pei: Big Data Analytics -- Multidimensional Analysis 18

3
Why Separate Data Warehouse? Star Schema
•  High performance for both time item
time_key
–  Operational DBMS: tuned for OLTP day Sales Fact Table
item_key
item_name
day_of_the_week
–  Warehouse: tuned for OLAP month time_key
brand
type
•  Different functions and different data quarter
year item_key supplier_type

–  Historical data: data analysis often uses branch_key


branch location
historical data that operational databases do not location_key
location_key
typically maintain branch_key
street
branch_name units_sold
–  Data consolidation: data analysis requires branch_type city
dollars_sold state_or_province
consolidation (aggregation, summarization) of country
data from heterogeneous sources avg_sales
Measures
Jian Pei: Big Data Analytics -- Multidimensional Analysis 19 Jian Pei: Big Data Analytics -- Multidimensional Analysis 20

Snowflake Schema Fact Constellation Shipping Fact Table

time_key
time item time item item_key
time_key item_key supplier time_key item_key
day Sales Fact Table item_name Sales Fact Table shipper_key
supplier_key day item_name
day_of_the_week brand supplier_type day_of_the_week brand
time_key time_key from_location
month type month type
quarter supplier_key quarter supplier_type
item_key item_key to_location
year year
branch_key branch_key dollars_cost
branch location branch
location_key location_key location units_shipped
location_key
branch_key branch_key location_key
units_sold street branch_name units_sold
branch_name street shipper
city_key branch_type
branch_type
dollars_sold city dollars_sold city shipper_key
province_or_state shipper_name
city_key avg_sales
avg_sales country location_key
city
state_or_province Measures shipper_type
Measures country

Jian Pei: Big Data Analytics -- Multidimensional Analysis 21 Jian Pei: Big Data Analytics -- Multidimensional Analysis 22

(Good) Aggregate Functions Holistic Aggregate Functions


•  Distributive: there is a function G() such that •  There is no constant bound on the size of
F({Xi,j}) = G({F({Xi,j |i=1,...,I}) | j=1,...J}) the storage needed to describe a sub-
–  Examples: COUNT(), MIN(), MAX(), SUM() aggregate.
–  G=SUM() for COUNT()
–  There is no constant M, such that an M-tuple
•  Algebraic: there is an M-tuple valued function G() characterizes the computation
and a function H() such that F({Xi,j |i=1,...,I}).
F({Xi,j}) = H({G({Xi,j |i=1,.., I}) | j=1,..., J })
–  Examples: AVG(), standard deviation, MaxN(), MinN() •  Examples: Median(), MostFrequent() (also
–  For AVG(), G() records sum and count, H() adds these called the Mode()), and Rank()
two components and divides to produce the global
average
Jian Pei: Big Data Analytics -- Multidimensional Analysis 23 Jian Pei: Big Data Analytics -- Multidimensional Analysis 24

4
Index Requirements in OLAP OLAP Query Example
•  Data is read only •  In table (cust, gender, …), find the total
–  (Almost) no insertion or deletion number of male customers
•  Query types •  Method 1: scan the table once
–  Point query: looking up one specific tuple (rare) •  Method 2: build a B+ tree index on attribute
–  Range query: returning the aggregate of a gender, still need to access all tuples of
(large) set of tuples, with group by male customers
–  Complex queries: need specific algorithms and •  Can we get the count without scanning
index structures, will be discussed later
many tuples, even not all tuples of male
customers?
Jian Pei: Big Data Analytics -- Multidimensional Analysis 25 Jian Pei: Big Data Analytics -- Multidimensional Analysis 26

Bitmap Index Using Bitmap to Count


•  For n tuples, a bitmap index has n bits and •  Shcount[] contains the number of bits in the
can be packed into !n /8" bytes and !n /32" entry subscript
words –  shcount[01100101]=4
•  From a bit to the row-id: the j-th bit of the p- count = 0;
th byte ! row-id = p*8 +j cust gender … for (i = 0; i < SHNUM; i++)
Jack M … count += shcount[B[i]];
Cathy F …
… … …
Nancy F …

1 0 … 0
Jian Pei: Big Data Analytics -- Multidimensional Analysis 27 Jian Pei: Big Data Analytics -- Multidimensional Analysis 28

Advantages of Bitmap Index Bit-Sliced Index


•  Efficient in space •  A sale amount can be written as an integer
•  Ready for logic composition number of pennies, and then represented as
a binary number of N bits
–  C = C1 AND C2
–  24 bits is good for up to $167,772.15,
–  Bitmap operations can be used appropriate for many stores
•  Bitmap index only works for categorical data •  A bit-sliced index is N bitmaps
with low cardinality –  Tuple j sets in bitmap k if the k-th bit in its binary
–  Naively, we need 50 bits per entry to represent representation is on
the state of a customer in US –  The space costs of bit-sliced index is the same
as storing the data directly
–  How to represent a sale in dollars?
Jian Pei: Big Data Analytics -- Multidimensional Analysis 29 Jian Pei: Big Data Analytics -- Multidimensional Analysis 30

5
Using Indexes Cost Comparison
SELECT SUM(sales) FROM Sales WHERE C; •  Traditional value-list index (B+ tree) is costly
–  Tuples satisfying C is identified by a bitmap B in both I/O and CPU time
•  Direct access to rows to calculate SUM: –  Not good for OLAP
scan the whole table once •  Bit-sliced index is efficient in I/O
•  B+ tree: find the tuples from the tree •  Other case studies in [O Neil and Quass,
•  Projection index: only scan attribute sales SIGMOD 97]
•  Bit-sliced index: get the sum from ∑(B AND
Bk)*2k

Jian Pei: Big Data Analytics -- Multidimensional Analysis 31 Jian Pei: Big Data Analytics -- Multidimensional Analysis 32

Horizontal or Vertical Storage Horizontal Versus Vertical


•  A fact table for data warehousing is often fat •  Find the information of tuple t
–  Tens of even hundreds of dimensions/attributes –  Typical in OLTP
–  Horizontal storage: get the whole tuple in one search
•  A query is often about only a few attributes
–  Vertical storage: search 100 lists
•  Horizontal storage: tuples are stored one by one
•  Find SUM(a100) GROUP BY {a22, a83}
•  Vertical storage: tuples are stored by attributes –  Typical in OLAP
–  Horizontal storage (no index): search all tuples O(100n),
A1 A2 … A100 A1 A2 … A100 where n is the number of tuples
x1 x2 … x100 x1 x2 … x100 –  Vertical storage: search 3 lists O(3n), 3% of the
horizontal storage method
… … … … … … … …
z1 z2 … z100 z1 z2 … z100
•  Projection index: vertical storage

Jian Pei: Big Data Analytics -- Multidimensional Analysis 33 Jian Pei: Big Data Analytics -- Multidimensional Analysis 34

Rolling-up/Drilling-down Analysis Extending GROUP BY


Roll up by model by year by color SELECT Manufacturer, Year , Month, Day,
Not a table, many NULL values, no key
Color, Model, SUM(price) AS Revenue
FROM Sales
GROUP BY Manufacturer,
ROLLUP Year(Time) AS Year,
Month(Time) AS Month,
Day(Time) AS Day, Manufacturer Year, Mo, Day
Pivot CUBE Color, Model;
Model xColor
cubes

Jian Pei: Big Data Analytics -- Multidimensional Analysis 35 Jian Pei: Big Data Analytics -- Multidimensional Analysis 36

6
DATA CUBE
Model Year Color Sales
Chevy 1990 blue 62
Chevy 1990 red 5

CUBE MOLAP
Chevy 1990 white 95
Chevy 1990 ALL 154
Chevy 1991 blue 49
Chevy 1991 red 54
Chevy 1991 white 95
Chevy 1991 ALL 198
Chevy 1992 blue 71
Chevy 1992 red 31
Chevy 1992 white 54

Date
Chevy 1992 ALL 156
Chevy ALL blue 182
SALES
2Qtr
Chevy ALL red 90
1Qtr 3Qtr 4Qtr sum

t
Model Year Color Sales Chevy ALL white 236

uc
Chevy ALL ALL 508
TV
CUBE
Chevy 1990 red 5
Ford 1990 blue 63

od
Chevy 1990 white 87
Chevy 1990 blue 62
Ford 1990 red 64
PC U.S.A

Pr
Ford 1990 white 62
Chevy 1991 red 54 Ford 1990 ALL 189
Chevy 1991 white 95 Ford
Ford
1991
1991
blue
red
55
52
VCR

Country
Chevy 1991 blue 49
Chevy 1992 red 31
Ford 1991 white 9
sum
Chevy
Chevy
1992
1992
white
blue
54
71
Ford
Ford
Ford
1991
1992
1992
ALL
blue
red
116
39
27
Canada
Ford 1990 red 64 Ford 1992 white 62
Ford 1990 white 62 Ford 1992 ALL 128
Ford ALL blue 157
Ford
Ford
1990
1991
blue
red
63
52
Ford
Ford
ALL
ALL
red
white
143
133
Mexico
Ford 1991 white 9 Ford ALL ALL 433
Ford 1991 blue 55 ALL 1990 blue 125
Ford 1992 red 27 ALL 1990 red 69
ALL 1990 white 149
sum
Ford 1992 white 62
ALL 1990 ALL 343
Ford 1992 blue 39
ALL 1991 blue 106
ALL 1991 red 104
SELECT Model, Year, Color, SUM(sales) AS Sales ALL
ALL
1991
1991
white
ALL
110
314
FROM Sales ALL 1992 blue 110
ALL 1992 red 58
WHERE Model in {'Ford', 'Chevy'} ALL
ALL
1992
1992
white
ALL
116
284
AND Year BETWEEN 1990 AND 1992 ALL
ALL
ALL
ALL
blue
red
339
233
GROUP BY CUBE(Model, Year, Color); ALL ALL white 369
ALL ALL ALL 941

Jian Pei: Big Data Analytics -- Multidimensional Analysis 37 Jian Pei: Big Data Analytics -- Multidimensional Analysis 38

Pros and Cons ROLAP – Data Cube in Table


•  Easy to implement •  A multi-dimensional database
•  Fast retrieval Base table
Dimensions Measure
•  Many entries may be empty if data is sparse Store Product Season Sales Dimensions Measure

•  Costly in space S1 P1 Spring 6 Store Product Season AVG(Sales)


S1 P2 Spring 12 S1 P1 Spring 6
S2 P1 Fall 9 S1 P2 Spring 12
S2 P1 Fall 9
S1 * Spring 9
… … … …
* * * 9
Cubing

Jian Pei: Big Data Analytics -- Multidimensional Analysis 39 Jian Pei: Big Data Analytics -- Multidimensional Analysis 40

Observations How to Sort the Base Table?


•  Once a base table (A, •  General sorting in main memory O(nlogn)
B, C) is sorted by A-B-
C, aggregates (*,*,*), •  Counting in main memory O(n), linear to the
(A,*,*), (A,B,*) and number of tuples in the base table
(A,B,C) can be –  How to sort 1 million integers in range 0 to 100?
computed with one
scan and 4 counters –  Set up 100 counters, initiate them to 0 s
•  To compute other –  Scan the integers once, count the occurrences
aggregates, we can of each value in 1 to 100
sort the base table in –  Scan the integers again, put the integers to the
some other orders right places

Jian Pei: Big Data Analytics -- Multidimensional Analysis 41 Jian Pei: Big Data Analytics -- Multidimensional Analysis 42

7
Iceberg Cube Monotonic Iceberg Condition
•  In a data cube, many aggregate cells are •  If COUNT(a, b, *)<100, then COUNT(a, b,
trivial c)<100 for any c
–  Having an aggregate too small •  For cells c1 and c2, c1 is called an ancestor
•  Iceberg query of c2 if in all dimensions that c1 takes a non-*
value, c2 agrees with c1
–  (a,b,*) is an ancestor of (a,b,c)
•  An iceberg condition P is monotonic if for
any aggregate cell c failing P, any
descendants of c cannot honor P
Jian Pei: Big Data Analytics -- Multidimensional Analysis 43 Jian Pei: Big Data Analytics -- Multidimensional Analysis 44

Pushing Monotonic Conditions How to Push Non-Monotonic Ones?


•  BUC searches the •  Condition P(c)=AVG(price)>=800 AND
aggregates bottom-up COUNT(*)>=50 is not monotonic
in depth-first manner
•  BUC cannot push such a constraint
•  Only when a
monotonic condition
holds, the descendants
of the current node
should be expanded

Jian Pei: Big Data Analytics -- Multidimensional Analysis 45 Jian Pei: Big Data Analytics -- Multidimensional Analysis 46

Ideas Minimal Cubing


•  Let AVGk(price) be the average of the top-k •  Computing only a shell of a data cube
tuples –  Only compute and materialize low dimensional
•  AVGk(price)>=800 is a monotonic condition cuboids, dimensionality < k (k << n)
–  If the top-10 average of (Vancouver, *, *) is less –  Save space and cubing time
than 800, the top-10 average of (Vancouver,
laptop, *) cannot be 800 or more •  Indexing the shell cells as well as their cover
•  AVGk(price)>=800 can be a filter for – the tuples contributing to the shell cells
AVG(price)>=800 •  Query answering
–  If AVGk(price)<800, AVG(price)<800 –  Using the shell cells and their intersection to
–  Generally, AVG()<=AVGk() compute the non-materialized cells
Jian Pei: Big Data Analytics -- Multidimensional Analysis 47 Jian Pei: Big Data Analytics -- Multidimensional Analysis 48

8
A Data Cube Is Often Huge Compression of Data Cubes
•  10 dimensions, cardinality 20 for each •  Traditional compression methods, e.g., zip
dimension ! 2110=16,679,880,978,201 –  High compression ratio
possible tuples in the cube –  The compression cannot be queried directly
•  Even 1/1,000 of possible tuples are not •  Requirements for data cube compression
empty, still more than 16 billion tuples –  The compression can be queried efficiently
–  High compression ratio
•  Lossless compression and lossy
compression

Jian Pei: Big Data Analytics -- Multidimensional Analysis 49 Jian Pei: Big Data Analytics -- Multidimensional Analysis 50

Redundancy in Data Cube A Little More General Case


•  A base table with only one tuple (a1, …, a100, •  A base table with two tuples, t1 = (a1, a2, b3,
1000) and aggregate function SUM() b4, 100) and t2 = (a1, a2, c3, c4, 1000),
–  The data cube contains 2100 tuples! aggregate function SUM()
–  Every query about SUM() returns 1000 •  (a1, a2, *, *), (a1, *, *, *), (*, a2, *, *) and (*, *,
•  A data cube or a sub-cube may be *, *) all have sum 1100, since they are
populated by a single tuple – base single populated by the group of tuples {t1, t2} –
tuple base group tuples
•  We do not need to pre-compute and store
all aggregates
Jian Pei: Big Data Analytics -- Multidimensional Analysis 51 Jian Pei: Big Data Analytics -- Multidimensional Analysis 52

Semantic Compression Cube Cell Lattice


•  Can we summarize a data cube so that the •  Observation: many cells may have same
summarization can be browsed and understood aggregate values
effectively? •  Can we summarize the semantics of the cube by
–  The summarization itself is a compression grouping cells by aggregate values?
–  The compression preserves the roll-up/drill-down (S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9
relation
–  Directly query-able and browse-able for OLAP
•  Syntactic compression (S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*):9 (*,P1,f):9
–  Not preserving the roll-up/drill-down semantics
–  Directly query-able for some queries, but may not be
directly browse-able for OLAP (S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9

(*,*,*):9
Jian Pei: Big Data Analytics -- Multidimensional Analysis 53 Jian Pei: Big Data Analytics -- Multidimensional Analysis 54

9
A Naïve Attempt A Better Partitioning
•  Put all cells of same agg values into a class •  Quotient cube: partitioning preserving the
•  The result is not a lattice anymore! rollup/drilldown semantics
–  Anomaly: the rollup/drilldown semantics is lost (S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9

(S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9


C1 C2 C3
C1 C2 C3 (S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9
(S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*):9(*,P1,f):9
C4
(S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9
(S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9
C4 C5

(*,*,*):9 (*,*,*):9

Jian Pei: Big Data Analytics -- Multidimensional Analysis 55 Jian Pei: Big Data Analytics -- Multidimensional Analysis 56

Why Semantic Compression Useful? Why Semantic Compression Useful?

•  OLAP browsing
(S2,P1,f):9

(S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9 (S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9 (S2,*,f):9 (S2,P1,*) (*,P1,f):9

(S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9


C3 (S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9
(*,*,f):9 (S2,*,*):9

C1 C2 C1 C2

(S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9 (S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9
C4 C4

(*,*,*):9 C5 (*,*,*):9 C5
Jian Pei: Big Data Analytics -- Multidimensional Analysis 57 Jian Pei: Big Data Analytics -- Multidimensional Analysis 58

Goals Why Equivalent Aggregate Values?

•  Given a cube, characterize a good way (the •  Two cells have equivalent aggregate values
quotient cube way) of partitioning its cells if they cover the same set of tuples in the
into classes such that base table
–  The partition generates a reduced lattice Tuples in base table (S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9

preserving the roll-up/drill-down semantics


–  The partition is optimal: the number of classes (S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9

as small as possible
•  Compute, index and store quotient cubes (S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9

efficiently to answer OLAP queries


(*,*,*):9
Jian Pei: Big Data Analytics -- Multidimensional Analysis 59 Jian Pei: Big Data Analytics -- Multidimensional Analysis 60

10
Cover Partition Cover Partitions & Aggregates
•  For a cell c, a tuple t in base table is in c s •  All cells in a cover partition carry the same
cover if t can be rolled up to c aggregate value with respect to any
–  E.g., Cov(S1,*,spring)={(S1,P1,spring), aggregate function
(S1,P2,spring)} –  But cells in a class of MIN() may have different
Dimensions Measure
covers
Store Product Season Sales
•  For COUNT() and SUM() (positive), cover
S1 P1 Spring 6
S1 P2 Spring 12 equivalence coincides with aggregate
S2 P1 Fall 9 equivalence

Jian Pei: Big Data Analytics -- Multidimensional Analysis 61 Jian Pei: Big Data Analytics -- Multidimensional Analysis 62

Quotient Cube Multi-Criteria Decision Problems


•  A quotient cube is a quotient lattice of the •  “ – ·
cube lattice such that –  Two dimensions: and
–  Each class is convex and connected –  Preferences: “ ”
–  All cells in a class carry the identical aggregate –  Multidimensional decision problems have a long
value w.r.t. a given aggregate function history – more than 2300 years
•  Quotient cube preserves the roll-up / drill- •  Multidimensional decision problems are
down semantics often challenging
–  –
–  – ·
Jian Pei: Big Data Analytics -- Multidimensional Analysis 63 Jian Pei: Big Data Analytics -- Multidimensional Analysis 64

Skyline – Best Tradeoffs Skyline: Formal Definition


•  Two dimensions: distance to water and height •  A set of objects S in an n-dimensional space
•  Skyline: the buildings that are not dominated by D=(D1, …, Dn)
any other buildings in both dimensions –  Numeric dimensions for illustration in this talk
•  For u, v ∈ S, u dominates v if
SFU Harbor
Center –  u is better than v in one dimension, and
–  u is not worse than v in any other dimensions
–  For illustration in this talk, the smaller the better
•  u ∈ S is a skyline object if u is not
dominated by any other objects in S
Jian Pei: Big Data Analytics -- Multidimensional Analysis 65 Jian Pei: Big Data Analytics -- Multidimensional Analysis 66

11
Example Skyline Computation

Price •  First investigated as the maximum vector


v problem in [Kung et al. JACM 1975]
u –  An O(n logd-2n) time algorithm for d ≥ 4 and an
O(n log n) time algorithm for d = 2 and 3
–  Divide-and-conquer-based methods: DD&C,
LD&C, FLET
•  Skyline computation in database context
–  Data cannot be held into main memory
skyline points travel time –  External algorithms
Jian Pei: Big Data Analytics -- Multidimensional Analysis 67 Jian Pei: Big Data Analytics -- Multidimensional Analysis 68

Skyline Computation on Large DB Full Space Skyline Is Not Enough!


•  A rule of thumb in database research – scalability •  Skylines in subspaces
on large databases
–  Skyline in space (# stops, price, travel-time)
•  Index-based methods
–  Using bitmaps and the relationships between the skyline –  If one does not care about # stops, how can we
and the minimum coordinates of individual points, by derive the superior trade-offs between price and
Tan et al. travel-time from the full space skyline?
–  Using nearest-neighbor search by Kossmann et al.
–  The progressive branch-and-bound method by •  Sky cube – computing skylines in all non-
Papadias et al. empty subspaces (Yuan et al., VLDB 05)
•  Index-free methods –  A database/data warehousing approach
–  Divide-and-conquer and block nested loops by
Borzsonyi et al. –  Any subspace skyline queries can be answered
–  Sort-first-skyline (SFS) by Chomicki et al. (efficiently)

Jian Pei: Big Data Analytics -- Multidimensional Analysis 69 Jian Pei: Big Data Analytics -- Multidimensional Analysis 70

Sky Cube Understanding Skylines


• 
•  Both Wilt Chamberlain and Michael Jordan
are in the full space skyline of the Great
NBA Players
•  Data mining/exploration-driven questions
–  Which merits, respectively, really make them
outstanding?
–  How are they different?

Jian Pei: Big Data Analytics -- Multidimensional Analysis 71 Jian Pei: Big Data Analytics -- Multidimensional Analysis 72

12
Redundancy in Sky Cube Mining Decisive Subspaces

Does it just happen that •  Decisive subspaces – the minimal


skylines in multiple combinations of factors that determine the
subspaces are (subspace) skyline membership of an object
identical? •  Examples
–  Total rebounds for Chamberlain
–  For Jordan, (total points, total rebounds, total
assists) and (games played, total points, total
assists)
•  Details in [Pei et al., VLDB 2005]
Jian Pei: Big Data Analytics -- Multidimensional Analysis 73 Jian Pei: Big Data Analytics -- Multidimensional Analysis 74

Database & Data Mining Can Meet DB Extensions and Applications


•  – · •  Improving database query answering
•  Conceptually, computing skylines in all subspaces –  Efficient skyline query answering in subspaces
[Tao et al., ICDE 2006]
•  Only computing skyline groups and their decisive
–  Effective summary of skyline: distance-based
subspaces
representative skyline [Tao et al., ICDE 2009]
–  Concise representation, leading to fast algorithms
–  [Pei et al., ACM TODS 2006]
•  Extensions in data types
•  Improvement: borrowing frequent itemset mining –  Probabilistic skylines on uncertain data [Pei et
al., VLDB 2007]
techniques to speed up computation in high
dimensional spaces [Pei et al., ICDE 2007] –  Interval skyline queries on time series [Jiang
and Pei, ICDE 2009]

Jian Pei: Big Data Analytics -- Multidimensional Analysis 75 Jian Pei: Big Data Analytics -- Multidimensional Analysis 76

Dynamic User Preferences Personalized Recommendations


•  – · •  – ·

Different
customers may
have different
preferences
Jian Pei: Big Data Analytics -- Multidimensional Analysis 77 Jian Pei: Big Data Analytics -- Multidimensional Analysis 78

13
Favorable Facet Mining Monotonicity of Partial Orders
•  – · If p is not in the skyline with respect to partial R, p is not in
the skyline with any partial order stronger than R
•  A set of points in a multidimensional space
–  Fully ordered attributes: the preference orders
are fixed, e.g., price, star-level, and quality
–  (Categorical) Partially ordered attributes: the
preference orders are not fully determined, e.g.,
airlines, hotel groups, and property types
•  Some templates may apply, e.g., single houses >
semi-detached houses
•  Favorable facts of a point p: the partial
orders that make p in the skyline
Jian Pei: Big Data Analytics -- Multidimensional Analysis 79 Jian Pei: Big Data Analytics -- Multidimensional Analysis 80

Minimal Disqualifying Conditions Skyline Warehouse on Preferences


•  •  Materializing all MCDs and precompute skylines
•  For a point p, a most general partial order that –  Using an Implicit Preference Order tree (IPO-tree) index
disqualifies p in the skyline is a minimal •  Can online answer skyline queries with respect to
disqualifying condition (MDC) any user preferences
•  Details in [Wong et al., VLDB 2008]
•  Any partial orders stronger than an MDC cannot
make p in the skyline
•  How to compute MDC s efficiently?
–  MDC-O: computing MDC s on the fly
–  MDC-M: materializing MDC s
–  Details in [Wong et al., KDD 2007]
Jian Pei: Big Data Analytics -- Multidimensional Analysis 81 Jian Pei: Big Data Analytics -- Multidimensional Analysis 82

Learning User Preferences Mining Preferences from Examples


•  Realtors selling realties – a typical multi-criteria • 
decision problem •  Given a set of example points labeled skyline or
–  User preferences on multiple dimensions: location, size,
price, style, age, developer, …
non-skyline in a multidimensional space, can we
–  Thousands of realties learn the preferences on attributes?
•  How can a realtor learn a user s preferences on –  Favorable facets are for one superior example only
dimensions? •  Mining the minimal satisfying preference sets
–  – · (SPS)
–  Give a user a short list of realties and ask the user to –  The simplest hypotheses that fit the superior and inferior
pick the ones (s)he is/is not interested in examples
–  An interesting realty – a skyline point in the short list
–  An uninteresting realty – a non-skyline in the short list
Jian Pei: Big Data Analytics -- Multidimensional Analysis 83 Jian Pei: Big Data Analytics -- Multidimensional Analysis 84

14
Learning Methods Multidimensional Analysis of Logs

•  Complexity •  Look-up: “What are the top-5 electronics


–  The SPS existence problem is NP-hard that were most popularly searched by the
–  The minimal SPS problem is NP-hard users in the US in December, 2009?”
•  A greedy approach •  Reverse look-up: “What are the group-bys in
–  The term-based greedy algorithm time and region where Apple iPad was
–  The condition-based greedy algorithm popularly searched for?”
–  Details in [Jiang et al., KDD 08] •  Different users/applications may bear
different concept hierarchies in mind in their
multidimensional analysis
Jian Pei: Big Data Analytics -- Multidimensional Analysis 85 Jian Pei: Big Data Analytics -- Multidimensional Analysis 86

A Topic-Concept Cube Approach A Successful Case Study

Jian Pei: Big Data Analytics -- Multidimensional Analysis 87 Jian Pei: Big Data Analytics -- Multidimensional Analysis 88

15

Anda mungkin juga menyukai