Anda di halaman 1dari 58

Data Warehousing and

OLAP
Hector Garcia-Molina
Stanford University

Warehousing
Growing industry: $8 billion in 1998
Range from desktop to huge:

Walmart:

900-CPU, 2,700 disk, 23TB


Teradata system

Lots of buzzwords, hype


slice

& dice, rollup, MOLAP, pivot, ...

Hector Garcia Molina: Data Warehousing and OLAP

Outline
What is a data warehouse?
Why a warehouse?
Models & operations
Implementing a warehouse
Future directions

Hector Garcia Molina: Data Warehousing and OLAP

What is a Warehouse?

Collection of diverse data


subject

oriented
aimed at executive, decision maker
often a copy of operational data
with value-added data (e.g., summaries, history)
integrated
time-varying
non-volatile
more
4

Hector Garcia Molina: Data Warehousing and OLAP

What is a Warehouse?

Collection of tools
gathering

data
cleansing, integrating, ...
querying, reporting, analysis
data mining
monitoring, administering warehouse

Hector Garcia Molina: Data Warehousing and OLAP

Warehouse Architecture
Client

Client
Query & Analysis

Metadata

Warehouse

Integration

Source

Source

Source

Hector Garcia Molina: Data Warehousing and OLAP

Why a Warehouse?

Two Approaches:
Query-Driven

(Lazy)
Warehouse (Eager)

?
Source

Source

Hector Garcia Molina: Data Warehousing and OLAP

Query-Driven Approach

Client

Client
Mediator

Wrapper

Source

Wrapper

Wrapper

Source

Source

Hector Garcia Molina: Data Warehousing and OLAP

Advantages of Warehousing
High query performance
Queries not visible outside warehouse
Local processing at sources unaffected
Can operate when sources unavailable
Can query data not stored in a DBMS
Extra information at warehouse

Modify,

summarize (store aggregates)


Add historical information
9

Hector Garcia Molina: Data Warehousing and OLAP

Advantages of Query-Driven

No need to copy data


less

storage
no need to purchase data

More up-to-date data


Query needs can be unknown
Only query interface needed at sources
May be less draining on sources

10

Hector Garcia Molina: Data Warehousing and OLAP

OLTP vs. OLAP

OLTP: On Line Transaction Processing


Describes

OLAP: On Line Analytical Processing


Describes

11

processing at operational sites


processing at warehouse

Hector Garcia Molina: Data Warehousing and OLAP

OLTP vs. OLAP


OLTP

12

Mostly updates
Many small transactions
Mb-Tb of data
Raw data
Clerical users
Up-to-date data
Consistency,
recoverability critical

OLAP

Mostly reads
Queries long, complex
Gb-Tb of data
Summarized,
consolidated data
Decision-makers,
analysts as users

Hector Garcia Molina: Data Warehousing and OLAP

Data Marts
Smaller warehouses
Spans part of organization

e.g.,

Do not require enterprise-wide consensus


but

13

marketing (customers, products, sales)

long term integration problems?

Hector Garcia Molina: Data Warehousing and OLAP

Warehouse Models & Operators

Data Models
relations
stars

& snowflakes
cubes

Operators
slice

& dice
roll-up, drill down
pivoting
other
14

Hector Garcia Molina: Data Warehousing and OLAP

Star
product

prodId
p1
p2

name price
bolt
10
nut
5

sale oderId date


o100 1/7/97
o102 2/7/97
105 3/8/97

customer

15

custId
53
81
111

store

custId
53
53
111

name
joe
fred
sally

prodId
p1
p2
p1

storeId
c1
c1
c3

address
10 main
12 main
80 willow

qty
1
2
5

storeId
c1
c2
c3

city
nyc
sfo
la

amt
12
11
50

city
sfo
sfo
la

Hector Garcia Molina: Data Warehousing and OLAP

Star Schema

product
prodId
name
price

sale
orderId
date
custId
prodId
storeId
qty
amt

customer
custId
name
address
city

store
storeId
city

16

Hector Garcia Molina: Data Warehousing and OLAP

Terms
Fact table
Dimension tables
Measures

product
prodId
name
price

sale
orderId
date
custId
prodId
storeId
qty
amt

customer
custId
name
address
city

store
storeId
city

17

Hector Garcia Molina: Data Warehousing and OLAP

Dimension Hierarchies
sType

store

store storeId
s5
s7
s9

city

cityId
sfo
sfo
la

tId
t1
t2
t1

mgr
joe
fred
nancy

snowflake schema
constellations

18

region
sType tId
t1
t2

city

size
small
large

cityId pop
sfo
1M
la
5M

location
downtown
suburbs

regId
north
south

region regId
name
north cold region
south warm region

Hector Garcia Molina: Data Warehousing and OLAP

Cube
Fact table view:
sale

prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2

Multi-dimensional cube:
amt
12
11
50
8

p1
p2

c1
12
11

c2

c3
50

dimensions = 2

19

Hector Garcia Molina: Data Warehousing and OLAP

3-D Cube
Fact table view:
sale

prodId
p1
p2
p1
p2
p1
p1

storeId
c1
c1
c3
c2
c1
c2

Multi-dimensional cube:
date
1
1
1
1
2
2

amt
12
11
50
8
44
4

day 2
day 1

p1
p2 c1
p1
12
p2
11

c1
44

c2
4
c2

c3
c3
50

dimensions = 3

20

Hector Garcia Molina: Data Warehousing and OLAP

ROLAP vs. MOLAP


ROLAP:
Relational On-Line Analytical Processing
MOLAP:
Multi-Dimensional On-Line Analytical
Processing

21

Hector Garcia Molina: Data Warehousing and OLAP

Aggregates
Add up amounts for day 1
In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1
sale

22

prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2

date
1
1
1
1
2
2

amt
12
11
50
8
44
4

81

Hector Garcia Molina: Data Warehousing and OLAP

Aggregates
Add up amounts by day
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date
sale

23

prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2

date
1
1
1
1
2
2

amt
12
11
50
8
44
4

ans

date
1
2

sum
81
48

Hector Garcia Molina: Data Warehousing and OLAP

Another Example
Add up amounts by day, product
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
sale

prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2

date
1
1
1
1
2
2

amt
12
11
50
8
44
4

sale

prodId
p1
p2
p1

date
1
1
2

amt
62
19
48

rollup
drill-down
24

Hector Garcia Molina: Data Warehousing and OLAP

Aggregates
Operators: sum, count, max, min,
median, ave
Having clause
Using dimension hierarchy

average

by region (within store)


maximum by month (within date)

25

Hector Garcia Molina: Data Warehousing and OLAP

Cube Aggregation
Example: computing sums
day 2
day 1

p1
p2 c1
p1
12
p2
11

p1
p2

c1
56
11

c1
44

c2
4
c2

...

c3
50

c2
4
8

rollup
drill-down
26

c3

c3
50

sum

c1
67

c2
12

c3
50

129
p1
p2

sum
110
19

Hector Garcia Molina: Data Warehousing and OLAP

Cube Operators
day 2
day 1

p1
p2 c1
p1
12
p2
11

p1
p2

c1
56
11

c1
44

c2
4
c2

...

c3
50

sale(c1,*,*)

c2
4
8

c3
50

sale(c2,p2,*)

27

c3

sum

c1
67

c2
12

c3
50

129
p1
p2

sum
110
19

sale(*,*,*)

Hector Garcia Molina: Data Warehousing and OLAP

Extended Cube
c2
4
8
c312

p1
p2
c1
*
12

p1
p2
c1*
44

c1
56
11
c267
4

c2
44

c3
4
50

11
23

8
8

50

*
62
19
81

day 2

day 1

28

p1
p2
*

c3
50

* 50
48
48

*
110
19
129

sale(*,p2,*)

Hector Garcia Molina: Data Warehousing and OLAP

Aggregation Using Hierarchies

day 2
day 1

p1
p2 c1
p1
12
p2
11

c1
44

c2
4
c2

c3
c3
50

customer
region

country
p1
p2

29

region A region B
56
54
11
8

(customer c1 in Region A;
customers c2, c3 in Region B)

Hector Garcia Molina: Data Warehousing and OLAP

Pivoting
Fact table view:
sale

prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2

Multi-dimensional cube:
date
1
1
1
1
2
2

amt
12
11
50
8
44
4

day 2
day 1

p1
p2 c1
p1
12
p2
11

p1
p2

30

c1
56
11

c1
44

c2
4
c2

c3
c3
50

c2
4
8

c3
50

Hector Garcia Molina: Data Warehousing and OLAP

Implementing a Warehouse
Monitoring: Sending data from sources
Integrating: Loading, cleansing,...
Processing: Query processing, indexing, ...
Managing: Metadata, Design, ...

31

Hector Garcia Molina: Data Warehousing and OLAP

Monitoring
Source Types: relational, flat file, IMS,
VSAM, IDMS, WWW, news-wire,
Incremental vs. Refresh

customer

32

id
53
81
111

name
joe
fred
sally

address
10 main
12 main
80 willow

city
sfo
sfo
la

new

Hector Garcia Molina: Data Warehousing and OLAP

Periodic snapshots
Database triggers
Log shipping
Data shipping (replication service)
Transaction shipping
Polling (queries to source)
Screen scraping
Application level monitoring

33

Advantages & Disadvantages!!

Monitoring Techniques

Hector Garcia Molina: Data Warehousing and OLAP

Monitoring Issues

Frequency
periodic:

daily, weekly,
triggered: on big change, lots of changes, ...

Data transformation
convert

data to uniform format


remove & add fields (e.g., add date to get history)

Standards (e.g., ODBC)


Gateways

34

Hector Garcia Molina: Data Warehousing and OLAP

Integration
Data Cleaning
Data Loading
Derived Data

Client

Client
Query & Analysis

Metadata

Warehouse

Integration

Source

35

Source

Source

Hector Garcia Molina: Data Warehousing and OLAP

Data Cleaning

Migration (e.g., yen dollars)


Scrubbing: use domain-specific knowledge (e.g.,
social security numbers)
Fusion (e.g., mail list, customer merging)

billing DB

customer1(Joe)
merged_customer(Joe)

service DB

36

customer2(Joe)

Auditing: discover rules & relationships


(like data mining)
Hector Garcia Molina: Data Warehousing and OLAP

Loading Data
Incremental vs. refresh
Off-line vs. on-line
Frequency of loading

At

37

night, 1x a week/month, continuously

Parallel/Partitioned load

Hector Garcia Molina: Data Warehousing and OLAP

Derived Data

Derived Warehouse Data


indexes
aggregates
materialized

views (next slide)

When to update derived data?


Incremental vs. refresh

38

Hector Garcia Molina: Data Warehousing and OLAP

Materialized Views

sale

Define new warehouse relations using


SQL expressions
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2

date
1
1
1
1
2
2

joinTb prodId
p1
p2
p1
p2
p1
p1
39

amt
12
11
50
8
44
4

name
bolt
nut
bolt
nut
bolt
bolt

product

price
10
5
10
5
10
10

storeId
c1
c1
c3
c2
c1
c2

date
1
1
1
1
2
2

id
p1
p2

amt
12
11
50
8
44
4

name price
bolt
10
nut
5

does not exist


at any source

Hector Garcia Molina: Data Warehousing and OLAP

Processing
ROLAP servers vs. MOLAP servers
Index Structures
What to Materialize?
Algorithms

Client

Client

Query & Analysis

Metadata

Warehouse

Integration

Source

40

Source

Source

Hector Garcia Molina: Data Warehousing and OLAP

ROLAP Server

Relational OLAP Server

sale

prodId
p1
p2
p1

date
1
1
2

sum
62
19
48

tools

utilities

ROLAP
server

Special indices, tuning;


Schema is denormalized

relational
DBMS

41

Hector Garcia Molina: Data Warehousing and OLAP

MOLAP Server

Multi-Dimensional OLAP Server


Sales

M.D. tools

Product

B
A
milk
soda
eggs
soap

utilities

42

multidimensional
server

2 3 4
Date

could also
sit on
relational
DBMS

Hector Garcia Molina: Data Warehousing and OLAP

Index Structures

Traditional Access Methods


B-trees,

hash tables, R-trees, grids,

Popular in Warehouses
inverted

lists
bit map indexes
join indexes
text indexes

43

Hector Garcia Molina: Data Warehousing and OLAP

Inverted Lists
18
19

20
21
22

23
25
26

age
index
44

r5
r19
r37
r40

rId
r4
r18
r19
r34
r35
r36
r5
r41

name age
joe
20
fred
20
sally
21
nancy 20
tom
20
pat
25
dave
21
jeff
26

...

20
23

r4
r18
r34
r35

inverted
lists

data
records
Hector Garcia Molina: Data Warehousing and OLAP

Using Inverted Lists

Query:
Get

people with age = 20 and name = fred

List for age = 20: r4, r18, r34, r35


List for name = fred: r18, r52
Answer is intersection: r18

45

Hector Garcia Molina: Data Warehousing and OLAP

Bit Maps

20
23

20
21
22

1
1
0
1
1
0
0
0
0

23
25
26

age
index
46

bit
maps

0
0
1
0
0
0
1
0
1
1

id
1
2
3
4
5
6
7
8

name age
joe
20
fred
20
sally
21
nancy 20
tom
20
pat
25
dave
21
jeff
26

...

18
19

data
records
Hector Garcia Molina: Data Warehousing and OLAP

Using Bit Maps

Query:
Get

people with age = 20 and name = fred

List for age = 20: 1101100000


List for name = fred: 0100000001
Answer is intersection: 010000000000

Good if domain cardinality small


Bit vectors can be compressed

47

Hector Garcia Molina: Data Warehousing and OLAP

Join
Combine SALE, PRODUCT relations
In SQL: SELECT * FROM SALE, PRODUCT
sale

prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2

date
1
1
1
1
2
2

joinTb prodId
p1
p2
p1
p2
p1
p1
48

amt
12
11
50
8
44
4

name
bolt
nut
bolt
nut
bolt
bolt

product

price
10
5
10
5
10
10

storeId
c1
c1
c3
c2
c1
c2

date
1
1
1
1
2
2

id
p1
p2

name price
bolt
10
nut
5

amt
12
11
50
8
44
4

Hector Garcia Molina: Data Warehousing and OLAP

Join Indexes
join index
product

sale

49

id
p1
p2

rId
r1
r2
r3
r4
r5
r6

name price
bolt
10
nut
5

jIndex
r1,r3,r5,r6
r2,r4

prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2

date
1
1
1
1
2
2

amt
12
11
50
8
44
4

Hector Garcia Molina: Data Warehousing and OLAP

What to Materialize?
Store in warehouse results useful for
common queries
Example:
total sales

day 2
day 1

c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8

p1
p2

materialize

50

c1
56
11

c2
4
8

c3
50

...

p1

c1
67

c2
12

c3
50

129
p1
p2

c1
110
19

Hector Garcia Molina: Data Warehousing and OLAP

Materialization Factors
Type/frequency of queries
Query response time
Storage cost
Update cost

51

Hector Garcia Molina: Data Warehousing and OLAP

Cube Aggregates Lattice


129

all
c1
67

p1

c2
12

c3
50

city

city, product
p1
p2

c1
56
11

c2
4
8

date

city, date

product, date

c3
50

day 2
day 1

52

product

c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8

city, product, date

use greedy
algorithm to
decide what
to materialize

Hector Garcia Molina: Data Warehousing and OLAP

Dimension Hierarchies
all
cities

state

city
c1
c2

state
CA
NY

city

53

Hector Garcia Molina: Data Warehousing and OLAP

Dimension Hierarchies
all
city

city, product

product

city, date

city, product, date

date

product, date
state
state, date
state, product
state, product, date

not all arcs shown...


54

Hector Garcia Molina: Data Warehousing and OLAP

Interesting Hierarchy
time

all
years
weeks

quarters

months

day
1
2
3
4
5
6
7
8

week
1
1
1
1
1
1
1
2

month
1
1
1
1
1
1
1
1

quarter
1
1
1
1
1
1
1
1

year
2000
2000
2000
2000
2000
2000
2000
2000

conceptual
dimension table

days
55

Hector Garcia Molina: Data Warehousing and OLAP

Design
What data is needed?
Where does it come from?
How to clean data?
How to represent in warehouse (schema)?
What to summarize?
What to materialize?
What to index?

56

Hector Garcia Molina: Data Warehousing and OLAP

Tools

Development

Planning & Analysis

measure traffic (sources, warehouse, clients)

Workflow Management

57

performance monitoring, usage patterns, exception reporting

System & Network Management

what-if scenarios (schema changes, refresh rates), capacity planning

Warehouse Management

design & edit: schemas, views, scripts, rules, queries, reports

reliable scripts for cleaning & analyzing data


Hector Garcia Molina: Data Warehousing and OLAP

Current State of Industry

Extraction and integration done off-line


Usually

in large, time-consuming, batches

Everything copied at warehouse


Not

selective about what is stored


Query benefit vs storage & update cost

Query optimization aimed at OLTP


High

throughput instead of fast response


Process whole query before displaying
anything
58

Hector Garcia Molina: Data Warehousing and OLAP

Anda mungkin juga menyukai