Day 2

2014-05-06
Data Quality
•  Accuracy
Data Cleaning and Integration •  Completeness
•  Consistency
•  Timeliness
•  Believability
•  Interpretability
J. Pei: Big Data Analytics -- Data Cleaning and Integration 2
Data Preprocessing Data Cleaning

•  Processing data before an analytic task •  The process of detecting and correcting
–  Improve data quality corrupt or inaccurate records from data
–  Transform data to facilitate the target task •  Handling missing values
•  Major tasks •  Smoothing data
–  Data cleaning
–  Data integration
–  Data reduction
–  Data transformation
J. Pei: Big Data Analytics -- Data Cleaning and Integration 3 J. Pei: Big Data Analytics -- Data Cleaning and Integration 4
Handling Missing Values Disguised Missing Data?

•  Ignore records with missing values Online forms
•  Fill in missing values
–  Manually
–  Using a global constant
–  Using a measure of central tendency for the
attribute, such as mean, median, or mode •  Disguised missing data is the missing data entries
that are not explicitly represented as such, but
–  Using the central tendency of the class instead appear as potentially valid data values
–  Using the most probable value –  Information about "State" is missing
–  "Alabama" is used as disguise
1
2014-05-06
Disguised Missing Data Is Misleading Types of Disguised Missing Data

•  Unreasonable results
•  Wrong conclusion •  Randomly choose a valid •  A small number of values
value as disguise are chosen as disguise
Number of customers Number of customers

3500 3500
3000 3000
2500 2500
2000 2000
1500 1500
1000 1000
500 500
0 0
Alabama Ohio Washington Alabama Ohio Washington
Real values Disguised missing values
Problem Definition Ideas

•  Cleaning disguised missing data •  Observation 1: Frequently used disguises
Given a table T with attributes A, an integer k –  A small number of values are frequently used as
For each attribute Ai, output k candidates of frequently the disguises
used disguise values •  Observation 2: Missing at random
Number of customers
–  Missing data are often 3500
•  Examples 3000
distributed randomly 2500
–  “Alabama” in “state” 2000
1500
–  “0” in “blood pressure” A random subset of 1000
the whole database
–  “21” in “age” 500
Alabama Ohio Washington

General Framework Smoothing Noisy Data

Id State Age Gender
1 Alabama 30 M
•  For each attribute A 2 Alabama 30 M
•  Noise: a random error or variance in a
–  For each frequent value v 3 Alabama 30 F measured variable
in A 4 Alabama 20 F
•  Smoothing noise – removing noise
5 Ohio 20 F
•  Compute the maximal 6 Ohio 20 F
embedded unbiased
sample contained in Tv
–  Return the k values with
the best (in both quality
and size) embedded
unbiased sample
2
2014-05-06
Binning Regression
Sorted data for price (in dollars) : 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins :
Bin1 : 4, 8, 15
Bin2 : 21, 21, 24
Bin3 : 25, 28, 34
Smoothing by bin means :
Bin1 : 9, 9, 9
Bin2 : 22, 22, 22
Bin3 : 29, 29, 29
Smoothing by bin boundaries :
Bin1 : 4, 4, 15
Bin2 : 21, 21, 24
Bin3 : 25, 25, 34
Outlier Analysis Data Cleaning as a Process

•  Data discrepancy detection
–  Use metadata (e.g., domain, range, dependency, distribution)
–  Check field overloading
–  Check uniqueness rule, consecutive rule and null rule
–  Use commercial tools
•  Data scrubbing: use simple domain knowledge (e.g., postal code, spell-
check) to detect errors and make corrections
•  Data auditing: by analyzing data to discover rules and relationship to detect
violators (e.g., correlation and clustering to find outliers)
•  Data migration and integration
–  Data migration tools: allow transformations to be specified
–  ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
•  Integration of the two processes
–  Iterative and interactive (e.g., Potter s Wheels)
Data Integration Data Integration System Architecture
•  Combining data from multiple (autonomous

and heterogeneous) sources
•  Providing a unified view
•  Why is data integration hard?
–  Systems challenges
–  Data logical organization challenges
–  Social and administrative challenges
http://en.wikipedia.org/wiki/File:Dataintegration.png
3
2014-05-06
Wrappers How to Build Wrappers?

•  Computer programs that extract content from a •  Manual construction
particular data source and transform into a target
•  Machine learning based methods: learning
form, such as a relational table
schemas from training data
•  Example: CMS (content management system)
wrapper <html>
–  Supervised learning approaches
<head>
<title> %page_title%</title>
–  Unsupervised learning approaches
</head>
<body>
%page_content%
<P>
%page_powered_by%
</body>
</html>
Schema Matching and Mapping Entity Detection and Recognition

•  Schema matching: finding the semantic •  Entity detection: identify atomic elements in
correspondences between attributes in data sources
and those in the mediated schema text or other data into predefined categories
–  Example: “attribute name in source S1 corresponds to such as person names, locations,
attributes firstname and surname in the mediated
schema organizations, etc.
–  Name based matching
–  Instance based matching
•  Entity disambiguation: identify entities
•  Schema mapping: transforming attribute values from carrying the same name
sources to mediated schema
–  Example: a query or a program extracting name values
from source S1, and forming firstname and surname
values for the mediated schema
Example Data Provenance

•  The data about how a data entry came to be
–  Also known as data lineage/predigree
•  The annotation approach: a series of
annotations describing how each data item
was produced
•  The graph of data relationships approach:
connecting sources and deriving new data
items via mapping
4
2014-05-06
Deep / Hidden Web

•  Sites that are difficult for a crawler to find
–  Probably over 100 times larger than the traditionally indexed web
•  Three major categories of sites in deep web
–  Private sites intentionally private – no incoming links or may require
login
–  Form results – only accessible by entering data into a form, e.g.,
airline ticket queries
•  Hard to detect changes behind a form
–  Scripted pages – using JavaScript, Flash, or another client-side
language in the web page
•  A crawler needs to execute the script – can slow down crawling
significantly
•  Deep web is different from dynamic pages
–  Wikis dynamically generates web pages but are easy to crawl
–  Private sites are static but cannot be crawled
J. Pei: Big Data Analytics -- Data Cleaning and Integration 25
5
Outline
•  Why multidimensional analysis?
Multidimensional Analysis •  Multidimensional analysis principle
•  OLAP
•  OLAP indexes
Jian Pei: Big Data Analytics -- Multidimensional Analysis 2
Dimensions Multi-dimensional Analysis

•  “An aspect or feature of a situation, problem, or •  Find interesting patterns in multi-dimensional
thing, a measurable extent of some kind” subspaces
– Dictionary –  “Michael Jordan is outstanding in subspaces (total
•  Dimensions/attributes are used to model points, total rebounds, total assists) and (number of
complex objects in a divide-and-conquer games played, total points, total assists)”
manner •  Different patterns may be manifested in
–  Objects are compared in selected dimensions/ different subspaces
attributes
–  Feature selection (machine learning and statistics):
•  More often than not, objects have too many select a subset of relevant features for use in model
dimensions/attributes than one is interested in construction – a set of features for all objects
and can handle –  Different subspaces may manifest different patterns
Jian Pei: Big Data Analytics -- Multidimensional Analysis 3 Jian Pei: Big Data Analytics -- Multidimensional Analysis 4
OLAP OLAP Operations

•  Conceptually, we may explore all possible subspaces for
interesting patterns •  Roll up (drill-up): summarize data by
–  What patterns are interesting? climbing up hierarchy or by dimension
–  How can we explore all possible subspaces systematically and
efficiently? reduction
–  Fundamental problems in analytics and data mining
•  Aggregates and group-bys are frequently used in data
–  (Day, Store, Product type, SUM(sales) !
analysis and summarization (Month, City, *, SUM(sales))
SELECT time, altitude, AVG(temp)
FROM weather GOUP BY time, altitude; •  Drill down (roll down): reverse of roll-up,
–  In TPC, 6 standard benchmarks have 83 queries, aggregates are
used 59 times, group-bys are used 20 times
from higher level summary to lower level
•  Online analytical processing (OLAP): the techniques summary or detailed data, or introducing
that answer multi-dimensional analytical (MDA) new dimensions
queries efficiently
1
Other Operations Relational Representation
•  Dice: pick specific values or ranges on •  If there are n dimensions, there are 2n
some dimensions possible aggregation columns
•  Pivot: “rotate” a cube – changing the order Roll up by model by year by color in a table
of dimensions in visual analysis
http://en.wikipedia.org/wiki/File:OLAP_pivoting.png
Difficulties Dummy Value ALL

•  Many group bys are needed
–  6 dimensions ! 26=64 group bys
•  In most SQL systems, the resulting query
needs 64 scans of the data, 64 sorts or
hashes, and a long wait!
DATA CUBE
Model Year Color Sales
Chevy 1990 blue 62
Chevy 1990 red 5
CUBE Semantics of ALL

Chevy 1990 white 95
Chevy 1990 ALL 154
Chevy 1991 blue 49
Chevy 1991 red 54
Chevy 1991 white 95
Chevy 1991 ALL 198
Chevy 1992 blue 71
Chevy 1992 red 31
Chevy 1992 white 54
•  ALL is a set
Chevy 1992 ALL 156
Chevy ALL blue 182
SALES Chevy ALL red 90
Model Year Color Sales Chevy ALL white 236
–  Model.ALL = ALL(Model) = {Chevy, Ford }

Chevy ALL ALL 508
CUBE
Chevy 1990 red 5
Ford 1990 blue 63
Chevy 1990 white 87
Ford 1990 red 64
Chevy 1990 blue 62 Ford 1990 white 62
Chevy 1991 red 54 Ford 1990 ALL 189
Chevy
Chevy
1991
1991
white
blue
95
49
Ford
Ford
Ford
1991
1991
1991
blue
red
white
55
52
9
–  Year.ALL = ALL(Year) = {1990,1991,1992}
Chevy 1992 red 31
Ford 1991 ALL 116
Chevy
Chevy
Ford
1992
1992
1990
white
blue
red
54
71
64
Ford
Ford
Ford
1992
1992
1992
blue
red
white
39
27
62
–  Color.ALL = ALL(Color) = {red,white,blue}
Ford 1990 white 62 Ford 1992 ALL 128
Ford 63 Ford ALL blue 157
1990 blue
Ford ALL red 143
Ford 1991 red 52
Ford ALL white 133
Ford 1991 white 9 Ford ALL ALL 433
Ford 1991 blue 55 ALL 1990 blue 125
Ford 1992 red 27 ALL 1990 red 69
Ford 1992 white 62 ALL 1990 white 149
ALL 1990 ALL 343
Ford 1992 blue 39
ALL 1991 blue 106
ALL 1991 red 104
SELECT Model, Year, Color, SUM(sales) AS Sales ALL
ALL
1991
1991
white
ALL
110
314
FROM Sales ALL 1992 blue 110
ALL 1992 red 58
WHERE Model in {'Ford', 'Chevy'} ALL
ALL
1992
1992
white
ALL
116
284
AND Year BETWEEN 1990 AND 1992 ALL
ALL
ALL
ALL
blue
red
339
233
GROUP BY CUBE(Model, Year, Color); ALL ALL white 369
ALL ALL ALL 941
2
OLTP Versus OLAP What Is a Data Warehouse?
OLTP OLAP
users clerk, IT professional knowledge worker
•  A data warehouse is a subject-oriented,
function day to day operations decision support integrated, time-variant, and nonvolatile
DB design
data
application-oriented
current, up-to-date, detailed, flat
subject-oriented
historical, summarized, multidimensional
collection of data in support of
relational Isolated integrated, consolidated management s decision-making process.
usage repetitive ad-hoc
access read/write, index/hash on prim. lots of scans
– W. H. Inmon
key
unit of work short, simple transaction complex query
•  Data warehousing: the process of
# records tens millions constructing and using data warehouses
accessed
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
Subject-Oriented Integrated
•  Organized around major subjects, such as •  Integrating multiple, heterogeneous data sources
customer, product, sales –  Relational databases, flat files, on-line transaction
records
•  Focusing on the modeling and analysis of •  Data cleaning and data integration
data for decision makers, not on daily –  Ensuring consistency in naming conventions, encoding
operations or transaction processing structures, attribute measures, etc. among different data
sources
•  Providing a simple and concise view around •  E.g., Hotel price: currency, tax, breakfast covered, etc.
particular subject issues by excluding data –  When data is moved to the warehouse, it is converted
that are not useful in the decision support
process
Time Variant Nonvolatile

•  The time horizon for the data warehouse is •  A physically separate store of data
significantly longer than that of operational transformed from the operational
systems environment
–  Operational databases: current value data •  Operational updates of data do not occur in
–  Data warehouse data: provide information from a the data warehouse environment
historical perspective (e.g., past 5-10 years)
–  Do not require transaction processing, recovery,
•  Every key structure in the data warehouse contains and concurrency control mechanisms
an element of time, explicitly or implicitly
–  Require only two operations in data accessing
–  But the key of operational data may or may not contain
•  Initial loading of data
time element
•  Access of data
3
Why Separate Data Warehouse? Star Schema
•  High performance for both time item
time_key
–  Operational DBMS: tuned for OLTP day Sales Fact Table
item_key
item_name
day_of_the_week
–  Warehouse: tuned for OLAP month time_key
brand
type
•  Different functions and different data quarter
year item_key supplier_type
–  Historical data: data analysis often uses branch_key

branch location
historical data that operational databases do not location_key
location_key
typically maintain branch_key
street
branch_name units_sold
–  Data consolidation: data analysis requires branch_type city
dollars_sold state_or_province
consolidation (aggregation, summarization) of country
data from heterogeneous sources avg_sales
Measures
Snowflake Schema Fact Constellation Shipping Fact Table
time_key
time item time item item_key
time_key item_key supplier time_key item_key
day Sales Fact Table item_name Sales Fact Table shipper_key
supplier_key day item_name
day_of_the_week brand supplier_type day_of_the_week brand
time_key time_key from_location
month type month type
quarter supplier_key quarter supplier_type
item_key item_key to_location
year year
branch_key branch_key dollars_cost
branch location branch
location_key location_key location units_shipped
location_key
branch_key branch_key location_key
units_sold street branch_name units_sold
branch_name street shipper
city_key branch_type
branch_type
dollars_sold city dollars_sold city shipper_key
province_or_state shipper_name
city_key avg_sales
avg_sales country location_key
city
state_or_province Measures shipper_type
Measures country
(Good) Aggregate Functions Holistic Aggregate Functions

•  Distributive: there is a function G() such that •  There is no constant bound on the size of
F({Xi,j}) = G({F({Xi,j |i=1,...,I}) | j=1,...J}) the storage needed to describe a sub-
–  Examples: COUNT(), MIN(), MAX(), SUM() aggregate.
–  G=SUM() for COUNT()
–  There is no constant M, such that an M-tuple
•  Algebraic: there is an M-tuple valued function G() characterizes the computation
and a function H() such that F({Xi,j |i=1,...,I}).
F({Xi,j}) = H({G({Xi,j |i=1,.., I}) | j=1,..., J })
–  Examples: AVG(), standard deviation, MaxN(), MinN() •  Examples: Median(), MostFrequent() (also
–  For AVG(), G() records sum and count, H() adds these called the Mode()), and Rank()
two components and divides to produce the global
average
4
Index Requirements in OLAP OLAP Query Example
•  Data is read only •  In table (cust, gender, …), find the total
–  (Almost) no insertion or deletion number of male customers
•  Query types •  Method 1: scan the table once
–  Point query: looking up one specific tuple (rare) •  Method 2: build a B+ tree index on attribute
–  Range query: returning the aggregate of a gender, still need to access all tuples of
(large) set of tuples, with group by male customers
–  Complex queries: need specific algorithms and •  Can we get the count without scanning
index structures, will be discussed later
many tuples, even not all tuples of male
customers?
Bitmap Index Using Bitmap to Count

•  For n tuples, a bitmap index has n bits and •  Shcount[] contains the number of bits in the
can be packed into !n /8" bytes and !n /32" entry subscript
words –  shcount[01100101]=4
•  From a bit to the row-id: the j-th bit of the p- count = 0;
th byte ! row-id = p*8 +j cust gender … for (i = 0; i < SHNUM; i++)
Jack M … count += shcount[B[i]];
Cathy F …
… … …
Nancy F …
1 0 … 0
Advantages of Bitmap Index Bit-Sliced Index

•  Efficient in space •  A sale amount can be written as an integer
•  Ready for logic composition number of pennies, and then represented as
a binary number of N bits
–  C = C1 AND C2
–  24 bits is good for up to $167,772.15,
–  Bitmap operations can be used appropriate for many stores
•  Bitmap index only works for categorical data •  A bit-sliced index is N bitmaps
with low cardinality –  Tuple j sets in bitmap k if the k-th bit in its binary
–  Naively, we need 50 bits per entry to represent representation is on
the state of a customer in US –  The space costs of bit-sliced index is the same
as storing the data directly
–  How to represent a sale in dollars?
5
Using Indexes Cost Comparison
SELECT SUM(sales) FROM Sales WHERE C; •  Traditional value-list index (B+ tree) is costly
–  Tuples satisfying C is identified by a bitmap B in both I/O and CPU time
•  Direct access to rows to calculate SUM: –  Not good for OLAP
scan the whole table once •  Bit-sliced index is efficient in I/O
•  B+ tree: find the tuples from the tree •  Other case studies in [O Neil and Quass,
•  Projection index: only scan attribute sales SIGMOD 97]
•  Bit-sliced index: get the sum from ∑(B AND
Bk)*2k
Horizontal or Vertical Storage Horizontal Versus Vertical

•  A fact table for data warehousing is often fat •  Find the information of tuple t
–  Tens of even hundreds of dimensions/attributes –  Typical in OLTP
–  Horizontal storage: get the whole tuple in one search
•  A query is often about only a few attributes
–  Vertical storage: search 100 lists
•  Horizontal storage: tuples are stored one by one
•  Find SUM(a100) GROUP BY {a22, a83}
•  Vertical storage: tuples are stored by attributes –  Typical in OLAP
–  Horizontal storage (no index): search all tuples O(100n),
A1 A2 … A100 A1 A2 … A100 where n is the number of tuples
x1 x2 … x100 x1 x2 … x100 –  Vertical storage: search 3 lists O(3n), 3% of the
horizontal storage method
… … … … … … … …
z1 z2 … z100 z1 z2 … z100
•  Projection index: vertical storage
Rolling-up/Drilling-down Analysis Extending GROUP BY

Roll up by model by year by color SELECT Manufacturer, Year , Month, Day,
Not a table, many NULL values, no key
Color, Model, SUM(price) AS Revenue
FROM Sales
GROUP BY Manufacturer,
ROLLUP Year(Time) AS Year,
Month(Time) AS Month,
Day(Time) AS Day, Manufacturer Year, Mo, Day
Pivot CUBE Color, Model;
Model xColor
cubes
6
DATA CUBE
Model Year Color Sales
Chevy 1990 blue 62
Chevy 1990 red 5
CUBE MOLAP
Chevy 1990 white 95
Chevy 1990 ALL 154
Chevy 1991 blue 49
Chevy 1991 red 54
Chevy 1991 white 95
Chevy 1991 ALL 198
Chevy 1992 blue 71
Chevy 1992 red 31
Chevy 1992 white 54
Date
Chevy 1992 ALL 156
Chevy ALL blue 182
SALES
2Qtr
Chevy ALL red 90
1Qtr 3Qtr 4Qtr sum
t
Model Year Color Sales Chevy ALL white 236
uc
Chevy ALL ALL 508
TV
CUBE
Chevy 1990 red 5
Ford 1990 blue 63
od
Chevy 1990 white 87
Chevy 1990 blue 62
Ford 1990 red 64
PC U.S.A
Pr
Ford 1990 white 62
Chevy 1991 red 54 Ford 1990 ALL 189
Chevy 1991 white 95 Ford
Ford
1991
1991
blue
red
55
52
VCR
Country
Chevy 1991 blue 49
Chevy 1992 red 31
Ford 1991 white 9
sum
Chevy
Chevy
1992
1992
white
blue
54
71
Ford
Ford
Ford
1991
1992
1992
ALL
blue
red
116
39
27
Canada
Ford 1990 red 64 Ford 1992 white 62
Ford 1990 white 62 Ford 1992 ALL 128
Ford ALL blue 157
Ford
Ford
1990
1991
blue
red
63
52
Ford
Ford
ALL
ALL
red
white
143
133
Mexico
Ford 1991 white 9 Ford ALL ALL 433
Ford 1991 blue 55 ALL 1990 blue 125
Ford 1992 red 27 ALL 1990 red 69
ALL 1990 white 149
sum
Ford 1992 white 62
ALL 1990 ALL 343
Ford 1992 blue 39
ALL 1991 blue 106
ALL 1991 red 104
SELECT Model, Year, Color, SUM(sales) AS Sales ALL
ALL
1991
1991
white
ALL
110
314
FROM Sales ALL 1992 blue 110
ALL 1992 red 58
WHERE Model in {'Ford', 'Chevy'} ALL
ALL
1992
1992
white
ALL
116
284
AND Year BETWEEN 1990 AND 1992 ALL
ALL
ALL
ALL
blue
red
339
233
GROUP BY CUBE(Model, Year, Color); ALL ALL white 369
ALL ALL ALL 941
Pros and Cons ROLAP – Data Cube in Table

•  Easy to implement •  A multi-dimensional database
•  Fast retrieval Base table
Dimensions Measure
•  Many entries may be empty if data is sparse Store Product Season Sales Dimensions Measure
•  Costly in space S1 P1 Spring 6 Store Product Season AVG(Sales)

S1 P2 Spring 12 S1 P1 Spring 6
S2 P1 Fall 9 S1 P2 Spring 12
S2 P1 Fall 9
S1 * Spring 9
… … … …
* * * 9
Cubing
Observations How to Sort the Base Table?

•  Once a base table (A, •  General sorting in main memory O(nlogn)
B, C) is sorted by A-B-
C, aggregates (*,*,*), •  Counting in main memory O(n), linear to the
(A,*,*), (A,B,*) and number of tuples in the base table
(A,B,C) can be –  How to sort 1 million integers in range 0 to 100?
computed with one
scan and 4 counters –  Set up 100 counters, initiate them to 0 s
•  To compute other –  Scan the integers once, count the occurrences
aggregates, we can of each value in 1 to 100
sort the base table in –  Scan the integers again, put the integers to the
some other orders right places
7
Iceberg Cube Monotonic Iceberg Condition
•  In a data cube, many aggregate cells are •  If COUNT(a, b, *)<100, then COUNT(a, b,
trivial c)<100 for any c
–  Having an aggregate too small •  For cells c1 and c2, c1 is called an ancestor
•  Iceberg query of c2 if in all dimensions that c1 takes a non-*
value, c2 agrees with c1
–  (a,b,*) is an ancestor of (a,b,c)
•  An iceberg condition P is monotonic if for
any aggregate cell c failing P, any
descendants of c cannot honor P
Pushing Monotonic Conditions How to Push Non-Monotonic Ones?

•  BUC searches the •  Condition P(c)=AVG(price)>=800 AND
aggregates bottom-up COUNT(*)>=50 is not monotonic
in depth-first manner
•  BUC cannot push such a constraint
•  Only when a
monotonic condition
holds, the descendants
of the current node
should be expanded
Ideas Minimal Cubing

•  Let AVGk(price) be the average of the top-k •  Computing only a shell of a data cube
tuples –  Only compute and materialize low dimensional
•  AVGk(price)>=800 is a monotonic condition cuboids, dimensionality < k (k << n)
–  If the top-10 average of (Vancouver, *, *) is less –  Save space and cubing time
than 800, the top-10 average of (Vancouver,
laptop, *) cannot be 800 or more •  Indexing the shell cells as well as their cover
•  AVGk(price)>=800 can be a filter for – the tuples contributing to the shell cells
AVG(price)>=800 •  Query answering
–  If AVGk(price)<800, AVG(price)<800 –  Using the shell cells and their intersection to
–  Generally, AVG()<=AVGk() compute the non-materialized cells
8
A Data Cube Is Often Huge Compression of Data Cubes
•  10 dimensions, cardinality 20 for each •  Traditional compression methods, e.g., zip
dimension ! 2110=16,679,880,978,201 –  High compression ratio
possible tuples in the cube –  The compression cannot be queried directly
•  Even 1/1,000 of possible tuples are not •  Requirements for data cube compression
empty, still more than 16 billion tuples –  The compression can be queried efficiently
–  High compression ratio
•  Lossless compression and lossy
compression
Redundancy in Data Cube A Little More General Case

•  A base table with only one tuple (a1, …, a100, •  A base table with two tuples, t1 = (a1, a2, b3,
1000) and aggregate function SUM() b4, 100) and t2 = (a1, a2, c3, c4, 1000),
–  The data cube contains 2100 tuples! aggregate function SUM()
–  Every query about SUM() returns 1000 •  (a1, a2, *, *), (a1, *, *, *), (*, a2, *, *) and (*, *,
•  A data cube or a sub-cube may be *, *) all have sum 1100, since they are
populated by a single tuple – base single populated by the group of tuples {t1, t2} –
tuple base group tuples
•  We do not need to pre-compute and store
all aggregates
Semantic Compression Cube Cell Lattice

•  Can we summarize a data cube so that the •  Observation: many cells may have same
summarization can be browsed and understood aggregate values
effectively? •  Can we summarize the semantics of the cube by
–  The summarization itself is a compression grouping cells by aggregate values?
–  The compression preserves the roll-up/drill-down (S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9
relation
–  Directly query-able and browse-able for OLAP
•  Syntactic compression (S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*):9 (*,P1,f):9
–  Not preserving the roll-up/drill-down semantics
–  Directly query-able for some queries, but may not be
directly browse-able for OLAP (S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9
(*,*,*):9
9
A Naïve Attempt A Better Partitioning
•  Put all cells of same agg values into a class •  Quotient cube: partitioning preserving the
•  The result is not a lattice anymore! rollup/drilldown semantics
–  Anomaly: the rollup/drilldown semantics is lost (S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9
(S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9

C1 C2 C3
C1 C2 C3 (S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9
(S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*):9(*,P1,f):9
C4
(S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9
(S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9
C4 C5
(*,*,*):9 (*,*,*):9
Why Semantic Compression Useful? Why Semantic Compression Useful?
•  OLAP browsing
(S2,P1,f):9
(S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9 (S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9 (S2,*,f):9 (S2,P1,*) (*,P1,f):9
(S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9

C3 (S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9
(*,*,f):9 (S2,*,*):9
C1 C2 C1 C2
(S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9 (S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9
C4 C4
(*,*,*):9 C5 (*,*,*):9 C5
Goals Why Equivalent Aggregate Values?
•  Given a cube, characterize a good way (the •  Two cells have equivalent aggregate values
quotient cube way) of partitioning its cells if they cover the same set of tuples in the
into classes such that base table
–  The partition generates a reduced lattice Tuples in base table (S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9
preserving the roll-up/drill-down semantics

–  The partition is optimal: the number of classes (S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9
as small as possible
•  Compute, index and store quotient cubes (S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9
efficiently to answer OLAP queries

(*,*,*):9
10
Cover Partition Cover Partitions & Aggregates
•  For a cell c, a tuple t in base table is in c s •  All cells in a cover partition carry the same
cover if t can be rolled up to c aggregate value with respect to any
–  E.g., Cov(S1,*,spring)={(S1,P1,spring), aggregate function
(S1,P2,spring)} –  But cells in a class of MIN() may have different
Dimensions Measure
covers
Store Product Season Sales
•  For COUNT() and SUM() (positive), cover
S1 P1 Spring 6
S1 P2 Spring 12 equivalence coincides with aggregate
S2 P1 Fall 9 equivalence
Quotient Cube Multi-Criteria Decision Problems

•  A quotient cube is a quotient lattice of the •  “ – ·
cube lattice such that –  Two dimensions: and
–  Each class is convex and connected –  Preferences: “ ”
–  All cells in a class carry the identical aggregate –  Multidimensional decision problems have a long
value w.r.t. a given aggregate function history – more than 2300 years
•  Quotient cube preserves the roll-up / drill- •  Multidimensional decision problems are
down semantics often challenging
–  –
–  – ·
Skyline – Best Tradeoffs Skyline: Formal Definition

•  Two dimensions: distance to water and height •  A set of objects S in an n-dimensional space
•  Skyline: the buildings that are not dominated by D=(D1, …, Dn)
any other buildings in both dimensions –  Numeric dimensions for illustration in this talk
•  For u, v ∈ S, u dominates v if
SFU Harbor
Center –  u is better than v in one dimension, and
–  u is not worse than v in any other dimensions
–  For illustration in this talk, the smaller the better
•  u ∈ S is a skyline object if u is not
dominated by any other objects in S
11
Example Skyline Computation
Price •  First investigated as the maximum vector

v problem in [Kung et al. JACM 1975]
u –  An O(n logd-2n) time algorithm for d ≥ 4 and an
O(n log n) time algorithm for d = 2 and 3
–  Divide-and-conquer-based methods: DD&C,
LD&C, FLET
•  Skyline computation in database context
–  Data cannot be held into main memory
skyline points travel time –  External algorithms
Skyline Computation on Large DB Full Space Skyline Is Not Enough!

•  A rule of thumb in database research – scalability •  Skylines in subspaces
on large databases
–  Skyline in space (# stops, price, travel-time)
•  Index-based methods
–  Using bitmaps and the relationships between the skyline –  If one does not care about # stops, how can we
and the minimum coordinates of individual points, by derive the superior trade-offs between price and
Tan et al. travel-time from the full space skyline?
–  Using nearest-neighbor search by Kossmann et al.
–  The progressive branch-and-bound method by •  Sky cube – computing skylines in all non-
Papadias et al. empty subspaces (Yuan et al., VLDB 05)
•  Index-free methods –  A database/data warehousing approach
–  Divide-and-conquer and block nested loops by
Borzsonyi et al. –  Any subspace skyline queries can be answered
–  Sort-first-skyline (SFS) by Chomicki et al. (efficiently)
Sky Cube Understanding Skylines

• 
•  Both Wilt Chamberlain and Michael Jordan
are in the full space skyline of the Great
NBA Players
•  Data mining/exploration-driven questions
–  Which merits, respectively, really make them
outstanding?
–  How are they different?
12
Redundancy in Sky Cube Mining Decisive Subspaces
Does it just happen that •  Decisive subspaces – the minimal

skylines in multiple combinations of factors that determine the
subspaces are (subspace) skyline membership of an object
identical? •  Examples
–  Total rebounds for Chamberlain
–  For Jordan, (total points, total rebounds, total
assists) and (games played, total points, total
assists)
•  Details in [Pei et al., VLDB 2005]
Database & Data Mining Can Meet DB Extensions and Applications

•  – · •  Improving database query answering
•  Conceptually, computing skylines in all subspaces –  Efficient skyline query answering in subspaces
[Tao et al., ICDE 2006]
•  Only computing skyline groups and their decisive
–  Effective summary of skyline: distance-based
subspaces
representative skyline [Tao et al., ICDE 2009]
–  Concise representation, leading to fast algorithms
–  [Pei et al., ACM TODS 2006]
•  Extensions in data types
•  Improvement: borrowing frequent itemset mining –  Probabilistic skylines on uncertain data [Pei et
al., VLDB 2007]
techniques to speed up computation in high
dimensional spaces [Pei et al., ICDE 2007] –  Interval skyline queries on time series [Jiang
and Pei, ICDE 2009]
Dynamic User Preferences Personalized Recommendations

•  – · •  – ·
Different
customers may
have different
preferences
13
Favorable Facet Mining Monotonicity of Partial Orders
•  – · If p is not in the skyline with respect to partial R, p is not in
the skyline with any partial order stronger than R
•  A set of points in a multidimensional space
–  Fully ordered attributes: the preference orders
are fixed, e.g., price, star-level, and quality
–  (Categorical) Partially ordered attributes: the
preference orders are not fully determined, e.g.,
airlines, hotel groups, and property types
•  Some templates may apply, e.g., single houses >
semi-detached houses
•  Favorable facts of a point p: the partial
orders that make p in the skyline
Minimal Disqualifying Conditions Skyline Warehouse on Preferences

•  •  Materializing all MCDs and precompute skylines
•  For a point p, a most general partial order that –  Using an Implicit Preference Order tree (IPO-tree) index
disqualifies p in the skyline is a minimal •  Can online answer skyline queries with respect to
disqualifying condition (MDC) any user preferences
•  Details in [Wong et al., VLDB 2008]
•  Any partial orders stronger than an MDC cannot
make p in the skyline
•  How to compute MDC s efficiently?
–  MDC-O: computing MDC s on the fly
–  MDC-M: materializing MDC s
–  Details in [Wong et al., KDD 2007]
Learning User Preferences Mining Preferences from Examples

•  Realtors selling realties – a typical multi-criteria • 
decision problem •  Given a set of example points labeled skyline or
–  User preferences on multiple dimensions: location, size,
price, style, age, developer, …
non-skyline in a multidimensional space, can we
–  Thousands of realties learn the preferences on attributes?
•  How can a realtor learn a user s preferences on –  Favorable facets are for one superior example only
dimensions? •  Mining the minimal satisfying preference sets
–  – · (SPS)
–  Give a user a short list of realties and ask the user to –  The simplest hypotheses that fit the superior and inferior
pick the ones (s)he is/is not interested in examples
–  An interesting realty – a skyline point in the short list
–  An uninteresting realty – a non-skyline in the short list
14
Learning Methods Multidimensional Analysis of Logs
•  Complexity •  Look-up: “What are the top-5 electronics

–  The SPS existence problem is NP-hard that were most popularly searched by the
–  The minimal SPS problem is NP-hard users in the US in December, 2009?”
•  A greedy approach •  Reverse look-up: “What are the group-bys in
–  The term-based greedy algorithm time and region where Apple iPad was
–  The condition-based greedy algorithm popularly searched for?”
–  Details in [Jiang et al., KDD 08] •  Different users/applications may bear
different concept hierarchies in mind in their
multidimensional analysis
A Topic-Concept Cube Approach A Successful Case Study
15

Day 2

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Day 2

Diunggah oleh

Hak Cipta:

Format Tersedia

2014-05-06

J. Pei: Big Data Analytics -- Data Cleaning and Integration 2

Data Preprocessing Data Cleaning

Handling Missing Values Disguised Missing Data?

Disguised Missing Data Is Misleading Types of Disguised Missing Data

Number of customers Number of customers

Alabama Ohio Washington Alabama Ohio Washington

Real values Disguised missing values

Problem Definition Ideas

distributed randomly 2500

– “Alabama” in “state” 2000

Alabama Ohio Washington

General Framework Smoothing Noisy Data

Partition into (equal-frequency) bins :

Smoothing by bin means :

Smoothing by bin boundaries :

Outlier Analysis Data Cleaning as a Process

Data Integration Data Integration System Architecture

• Combining data from multiple (autonomous

Wrappers How to Build Wrappers?

Schema Matching and Mapping Entity Detection and Recognition

Example Data Provenance

Deep / Hidden Web

J. Pei: Big Data Analytics -- Data Cleaning and Integration 25

Jian Pei: Big Data Analytics -- Multidimensional Analysis 2

Dimensions Multi-dimensional Analysis

OLAP OLAP Operations

Difficulties Dummy Value ALL

CUBE Semantics of ALL

– Model.ALL = ALL(Model) = {Chevy, Ford }

Time Variant Nonvolatile

– Historical data: data analysis often uses branch_key

Snowflake Schema Fact Constellation Shipping Fact Table

(Good) Aggregate Functions Holistic Aggregate Functions

Bitmap Index Using Bitmap to Count

Advantages of Bitmap Index Bit-Sliced Index

Horizontal or Vertical Storage Horizontal Versus Vertical

Rolling-up/Drilling-down Analysis Extending GROUP BY

Pros and Cons ROLAP – Data Cube in Table

• Costly in space S1 P1 Spring 6 Store Product Season AVG(Sales)

Observations How to Sort the Base Table?

Pushing Monotonic Conditions How to Push Non-Monotonic Ones?

Ideas Minimal Cubing

Redundancy in Data Cube A Little More General Case

Semantic Compression Cube Cell Lattice

(S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9

Why Semantic Compression Useful? Why Semantic Compression Useful?

(S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9 (S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9 (S2,*,f):9 (S2,P1,*) (*,P1,f):9

(S1,*,s):9 (S1,P1,*):6 (*,P1,s):6 (S1,P2,*):12 (*,P2,s):12 (S2,*,f):9 (S2,P1,*) (*,P1,f):9

Goals Why Equivalent Aggregate Values?

preserving the roll-up/drill-down semantics

efficiently to answer OLAP queries

Quotient Cube Multi-Criteria Decision Problems

Skyline – Best Tradeoffs Skyline: Formal Definition

Price • First investigated as the maximum vector

Skyline Computation on Large DB Full Space Skyline Is Not Enough!

Sky Cube Understanding Skylines

Does it just happen that • Decisive subspaces – the minimal

Database & Data Mining Can Meet DB Extensions and Applications

Dynamic User Preferences Personalized Recommendations

Minimal Disqualifying Conditions Skyline Warehouse on Preferences

Learning User Preferences Mining Preferences from Examples

• Complexity • Look-up: “What are the top-5 electronics

A Topic-Concept Cube Approach A Successful Case Study

–  “Alabama” in “state” 2000

•  Combining data from multiple (autonomous

–  Model.ALL = ALL(Model) = {Chevy, Ford }

–  Historical data: data analysis often uses branch_key

•  Costly in space S1 P1 Spring 6 Store Product Season AVG(Sales)

(S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9 (S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9 (S2,,f):9 (S2,P1,) (*,P1,f):9

(S1,,s):9 (S1,P1,):6 (,P1,s):6 (S1,P2,):12 (,P2,s):12 (S2,,f):9 (S2,P1,) (,P1,f):9

Price •  First investigated as the maximum vector

Does it just happen that •  Decisive subspaces – the minimal

•  Complexity •  Look-up: “What are the top-5 electronics