OLAP
Hector Garcia-Molina
Stanford University
Warehousing
Growing industry: $8 billion in 1998
Range from desktop to huge:
Walmart:
Outline
What is a data warehouse?
Why a warehouse?
Models & operations
Implementing a warehouse
Future directions
What is a Warehouse?
oriented
aimed at executive, decision maker
often a copy of operational data
with value-added data (e.g., summaries, history)
integrated
time-varying
non-volatile
more
4
What is a Warehouse?
Collection of tools
gathering
data
cleansing, integrating, ...
querying, reporting, analysis
data mining
monitoring, administering warehouse
Warehouse Architecture
Client
Client
Query & Analysis
Metadata
Warehouse
Integration
Source
Source
Source
Why a Warehouse?
Two Approaches:
Query-Driven
(Lazy)
Warehouse (Eager)
?
Source
Source
Query-Driven Approach
Client
Client
Mediator
Wrapper
Source
Wrapper
Wrapper
Source
Source
Advantages of Warehousing
High query performance
Queries not visible outside warehouse
Local processing at sources unaffected
Can operate when sources unavailable
Can query data not stored in a DBMS
Extra information at warehouse
Modify,
Advantages of Query-Driven
storage
no need to purchase data
10
11
12
Mostly updates
Many small transactions
Mb-Tb of data
Raw data
Clerical users
Up-to-date data
Consistency,
recoverability critical
OLAP
Mostly reads
Queries long, complex
Gb-Tb of data
Summarized,
consolidated data
Decision-makers,
analysts as users
Data Marts
Smaller warehouses
Spans part of organization
e.g.,
13
Data Models
relations
stars
& snowflakes
cubes
Operators
slice
& dice
roll-up, drill down
pivoting
other
14
Star
product
prodId
p1
p2
name price
bolt
10
nut
5
customer
15
custId
53
81
111
store
custId
53
53
111
name
joe
fred
sally
prodId
p1
p2
p1
storeId
c1
c1
c3
address
10 main
12 main
80 willow
qty
1
2
5
storeId
c1
c2
c3
city
nyc
sfo
la
amt
12
11
50
city
sfo
sfo
la
Star Schema
product
prodId
name
price
sale
orderId
date
custId
prodId
storeId
qty
amt
customer
custId
name
address
city
store
storeId
city
16
Terms
Fact table
Dimension tables
Measures
product
prodId
name
price
sale
orderId
date
custId
prodId
storeId
qty
amt
customer
custId
name
address
city
store
storeId
city
17
Dimension Hierarchies
sType
store
store storeId
s5
s7
s9
city
cityId
sfo
sfo
la
tId
t1
t2
t1
mgr
joe
fred
nancy
snowflake schema
constellations
18
region
sType tId
t1
t2
city
size
small
large
cityId pop
sfo
1M
la
5M
location
downtown
suburbs
regId
north
south
region regId
name
north cold region
south warm region
Cube
Fact table view:
sale
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
Multi-dimensional cube:
amt
12
11
50
8
p1
p2
c1
12
11
c2
c3
50
dimensions = 2
19
3-D Cube
Fact table view:
sale
prodId
p1
p2
p1
p2
p1
p1
storeId
c1
c1
c3
c2
c1
c2
Multi-dimensional cube:
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
day 2
day 1
p1
p2 c1
p1
12
p2
11
c1
44
c2
4
c2
c3
c3
50
dimensions = 3
20
21
Aggregates
Add up amounts for day 1
In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1
sale
22
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
81
Aggregates
Add up amounts by day
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date
sale
23
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
ans
date
1
2
sum
81
48
Another Example
Add up amounts by day, product
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
sale
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
sale
prodId
p1
p2
p1
date
1
1
2
amt
62
19
48
rollup
drill-down
24
Aggregates
Operators: sum, count, max, min,
median, ave
Having clause
Using dimension hierarchy
average
25
Cube Aggregation
Example: computing sums
day 2
day 1
p1
p2 c1
p1
12
p2
11
p1
p2
c1
56
11
c1
44
c2
4
c2
...
c3
50
c2
4
8
rollup
drill-down
26
c3
c3
50
sum
c1
67
c2
12
c3
50
129
p1
p2
sum
110
19
Cube Operators
day 2
day 1
p1
p2 c1
p1
12
p2
11
p1
p2
c1
56
11
c1
44
c2
4
c2
...
c3
50
sale(c1,*,*)
c2
4
8
c3
50
sale(c2,p2,*)
27
c3
sum
c1
67
c2
12
c3
50
129
p1
p2
sum
110
19
sale(*,*,*)
Extended Cube
c2
4
8
c312
p1
p2
c1
*
12
p1
p2
c1*
44
c1
56
11
c267
4
c2
44
c3
4
50
11
23
8
8
50
*
62
19
81
day 2
day 1
28
p1
p2
*
c3
50
* 50
48
48
*
110
19
129
sale(*,p2,*)
day 2
day 1
p1
p2 c1
p1
12
p2
11
c1
44
c2
4
c2
c3
c3
50
customer
region
country
p1
p2
29
region A region B
56
54
11
8
(customer c1 in Region A;
customers c2, c3 in Region B)
Pivoting
Fact table view:
sale
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
Multi-dimensional cube:
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
day 2
day 1
p1
p2 c1
p1
12
p2
11
p1
p2
30
c1
56
11
c1
44
c2
4
c2
c3
c3
50
c2
4
8
c3
50
Implementing a Warehouse
Monitoring: Sending data from sources
Integrating: Loading, cleansing,...
Processing: Query processing, indexing, ...
Managing: Metadata, Design, ...
31
Monitoring
Source Types: relational, flat file, IMS,
VSAM, IDMS, WWW, news-wire,
Incremental vs. Refresh
customer
32
id
53
81
111
name
joe
fred
sally
address
10 main
12 main
80 willow
city
sfo
sfo
la
new
Periodic snapshots
Database triggers
Log shipping
Data shipping (replication service)
Transaction shipping
Polling (queries to source)
Screen scraping
Application level monitoring
33
Monitoring Techniques
Monitoring Issues
Frequency
periodic:
daily, weekly,
triggered: on big change, lots of changes, ...
Data transformation
convert
34
Integration
Data Cleaning
Data Loading
Derived Data
Client
Client
Query & Analysis
Metadata
Warehouse
Integration
Source
35
Source
Source
Data Cleaning
billing DB
customer1(Joe)
merged_customer(Joe)
service DB
36
customer2(Joe)
Loading Data
Incremental vs. refresh
Off-line vs. on-line
Frequency of loading
At
37
Parallel/Partitioned load
Derived Data
38
Materialized Views
sale
date
1
1
1
1
2
2
joinTb prodId
p1
p2
p1
p2
p1
p1
39
amt
12
11
50
8
44
4
name
bolt
nut
bolt
nut
bolt
bolt
product
price
10
5
10
5
10
10
storeId
c1
c1
c3
c2
c1
c2
date
1
1
1
1
2
2
id
p1
p2
amt
12
11
50
8
44
4
name price
bolt
10
nut
5
Processing
ROLAP servers vs. MOLAP servers
Index Structures
What to Materialize?
Algorithms
Client
Client
Metadata
Warehouse
Integration
Source
40
Source
Source
ROLAP Server
sale
prodId
p1
p2
p1
date
1
1
2
sum
62
19
48
tools
utilities
ROLAP
server
relational
DBMS
41
MOLAP Server
M.D. tools
Product
B
A
milk
soda
eggs
soap
utilities
42
multidimensional
server
2 3 4
Date
could also
sit on
relational
DBMS
Index Structures
Popular in Warehouses
inverted
lists
bit map indexes
join indexes
text indexes
43
Inverted Lists
18
19
20
21
22
23
25
26
age
index
44
r5
r19
r37
r40
rId
r4
r18
r19
r34
r35
r36
r5
r41
name age
joe
20
fred
20
sally
21
nancy 20
tom
20
pat
25
dave
21
jeff
26
...
20
23
r4
r18
r34
r35
inverted
lists
data
records
Hector Garcia Molina: Data Warehousing and OLAP
Query:
Get
45
Bit Maps
20
23
20
21
22
1
1
0
1
1
0
0
0
0
23
25
26
age
index
46
bit
maps
0
0
1
0
0
0
1
0
1
1
id
1
2
3
4
5
6
7
8
name age
joe
20
fred
20
sally
21
nancy 20
tom
20
pat
25
dave
21
jeff
26
...
18
19
data
records
Hector Garcia Molina: Data Warehousing and OLAP
Query:
Get
47
Join
Combine SALE, PRODUCT relations
In SQL: SELECT * FROM SALE, PRODUCT
sale
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
date
1
1
1
1
2
2
joinTb prodId
p1
p2
p1
p2
p1
p1
48
amt
12
11
50
8
44
4
name
bolt
nut
bolt
nut
bolt
bolt
product
price
10
5
10
5
10
10
storeId
c1
c1
c3
c2
c1
c2
date
1
1
1
1
2
2
id
p1
p2
name price
bolt
10
nut
5
amt
12
11
50
8
44
4
Join Indexes
join index
product
sale
49
id
p1
p2
rId
r1
r2
r3
r4
r5
r6
name price
bolt
10
nut
5
jIndex
r1,r3,r5,r6
r2,r4
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
What to Materialize?
Store in warehouse results useful for
common queries
Example:
total sales
day 2
day 1
c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8
p1
p2
materialize
50
c1
56
11
c2
4
8
c3
50
...
p1
c1
67
c2
12
c3
50
129
p1
p2
c1
110
19
Materialization Factors
Type/frequency of queries
Query response time
Storage cost
Update cost
51
all
c1
67
p1
c2
12
c3
50
city
city, product
p1
p2
c1
56
11
c2
4
8
date
city, date
product, date
c3
50
day 2
day 1
52
product
c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8
use greedy
algorithm to
decide what
to materialize
Dimension Hierarchies
all
cities
state
city
c1
c2
state
CA
NY
city
53
Dimension Hierarchies
all
city
city, product
product
city, date
date
product, date
state
state, date
state, product
state, product, date
Interesting Hierarchy
time
all
years
weeks
quarters
months
day
1
2
3
4
5
6
7
8
week
1
1
1
1
1
1
1
2
month
1
1
1
1
1
1
1
1
quarter
1
1
1
1
1
1
1
1
year
2000
2000
2000
2000
2000
2000
2000
2000
conceptual
dimension table
days
55
Design
What data is needed?
Where does it come from?
How to clean data?
How to represent in warehouse (schema)?
What to summarize?
What to materialize?
What to index?
56
Tools
Development
Workflow Management
57
Warehouse Management