Anda di halaman 1dari 69

Data Warehousing

University of California, Berkeley


School of Information Management
and Systems
SIMS 257: Database Management
2004.11.15- SLIDE 1

Lecture Outline
Review
Application of Object Relational DBMS the
Berkeley Environmental Digital Library

Data Warehouses
Introduction to Data Warehouses
Data Warehousing
(Based on lecture notes from Joachim
Hammer, University of Florida, and Joe
Hellerstein and Mike Stonebraker of UCB)

2004.11.15- SLIDE 2

Lecture Outline
Review
Application of Object Relational DBMS the
Berkeley Environmental Digital Library

Data Warehouses
Introduction to Data Warehouses
Data Warehousing
(Based on lecture notes from Joachim
Hammer, University of Florida, and Joe
Hellerstein and Mike Stonebraker of UCB)

2004.11.15- SLIDE 3

A Digital Library Infrastructure Model

Originators
Index
Services

Repositories
Network

Users
2004.11.15- SLIDE 4

UC Berkeley Digital Library Project


Focus: Work-centered digital information
services
Testbed: Digital Library for the California
Environment
Research: Technical agenda supporting
user-oriented access to large distributed
collections of diverse data types.
Part of the NSF/NASA/DARPA Digital
Library Initiative (Phases 1 and 2)
2004.11.15- SLIDE 5

The Environmental Library Contents


As of late 2002, the collection represents
over one terabyte of data, including over
183,000 digital images, about 300,000
pages of environmental documents, and
over 2 million records in geographical and
botanical databases.

2004.11.15- SLIDE 6

Botanical Data:
The CalFlora Database contains
taxonomical and distribution information
for more than 8000 native California
plants. The Occurrence Database includes
over 600,000 records of California plant
sightings from many federal, state, and
private sources. The botanical databases
are linked to the CalPhotos collection of
California plants, and are also linked to
external collections of data, maps, and
photos.
2004.11.15- SLIDE 7

Geographical Data:
Much of the geographical data in the collection
has been used to develop our web-based GIS
Viewer. The Street Finder uses 500,000 Tiger
records of S.F. Bay Area streets along with the
70,000-records from the USGS GNIS database.
California Dams is a database of information
about the 1395 dams under state jurisdiction. An
additional 11 GB of geographical data
represents maps and imagery that have been
processed for inclusion as layers in our GIS
Viewer. This includes Digital Ortho Quads and
DRG maps for the S.F. Bay Area.
2004.11.15- SLIDE 8

Documents:
Most of the 300,000 pages of digital documents are
environmental reports and plans that were provided by
California state agencies. This collection includes
documents, maps, articles, and reports on the California
environment including Environmental Impact Reports
(EIRs), educational pamphlets, water usage bulletins,
and county plans. Documents in this collection come
from the California Department of Water Resources
(DWR), California Department of Fish and Game (DFG),
San Diego Association of Governments (SANDAG), and
many other agencies. Among the most frequently
accessed documents are County General Plans for
every California county and a survey of 125 Sacramento
Delta fish species.
2004.11.15- SLIDE 9

Multivalent Documents
Cheshire
CheshireLayer
Layer
GIS Layer
Valence:
2: The relative
capacity to unite,
react, or interact
(as with antigens
or a biological
substrate).
Websters 7th Collegiate
Dictionary

Table Layer
History of The Classical World

kdk
dkd
kdk

The jsfj sjjhfjs jsjj


jsjhfsjf sjhfjksh sshf
jsfksfjk sjs jsjfs kj
sjfkjsfhskjf sjfhjksh
skjfhkjshfjksh
jsfhkjshfjkskjfhsfh
skjfksjflksjflksjflksf
sjfksjfkjskfjskfjklsslk
slfjlskfjklsfklkkkdsj
ksfksjfkskflk sjfjksf
kjsfkjsfkjshf sjfsjfjks
ksfjksfjksjfkthsjir\\
ks
ksfjksjfkksjklsks
klsjfkskfksjjjhsjhuu
sfsjfkjs
taksksh
sksksk
skksksk

Network
Protocols &
Resources
OCR Layer
OCR Mapping
Layer

Modernjsfj sjjhfjs jsjj

jsjhfsjf sslfjksh sshf


jsfksfjk sjs jsjfs kj
sjfkjsfhskjf sjfhjksh
skjfhkjshfjksh
jsfhkjshfjkskjfhsfh
skjfksjflksjflksjflksf
sjfksjfkjskfjskfjklsslk
slfjlskfjklsfklkkkdsj

Scanned
Page
Image

kdjjdkd kdjkdjkd kj
kdkdk kdkd dkk
jdjjdj
clclc ldldl

Table 1.

2004.11.15- SLIDE 10

2004.11.15- SLIDE 11

2004.11.15- SLIDE 12

2004.11.15- SLIDE 13

GIS Viewer Example


http://elib.cs.berkeley.edu/annotations/gis/buildings.html

2004.11.15- SLIDE 14

2004.11.15- SLIDE 15

2004.11.15- SLIDE 16

2004.11.15- SLIDE 17

Blobworld: use regions for retrieval


We want to find general objects
Represent images based on coherent
regions

2004.11.15- SLIDE 18

2004.11.15- SLIDE 19

2004.11.15- SLIDE 20

Lecture Outline
Review
Application of Object Relational DBMS the
Berkeley Environmental Digital Library

Data Warehouses
Introduction to Data Warehouses
Data Warehousing
(Based on lecture notes from Joachim
Hammer, University of Florida, and Joe
Hellerstein and Mike Stonebraker of UCB)

2004.11.15- SLIDE 21

Overview
Data Warehouses and Merging
Information Resources
What is a Data Warehouse?
History of Data Warehousing
Types of Data and Their Uses
Data Warehouse Architectures
Data Warehousing Problems and Issues

2004.11.15- SLIDE 22

Problem: Heterogeneous Information Sources

Heterogeneities are everywhere

Personal
Databases

Scientific Databases

Digital Libraries

World
Wide
Web

Different interfaces
Different data representations
Duplicate and inconsistent information

Slide credit: J. Hammer


2004.11.15- SLIDE 23

Problem: Data Management in Large Enterprises

Vertical fragmentation of informational


systems (vertical stove pipes)
Result of application (user)-driven
development of operational systems
Sales Planning
Suppliers
Num. Control
Stock Mngmt
Debt Mngmt
Inventory
...
...
...

Sales Administration

Finance

Manufacturing

...
Slide credit: J. Hammer
2004.11.15- SLIDE 24

Goal: Unified Access to Data

Integration System

World
Wide
Web

Digital Libraries

Scientific Databases

Personal
Databases

Collects and combines information


Provides integrated view, uniform user interface
Supports sharing

Slide credit: J. Hammer


2004.11.15- SLIDE 25

The Traditional Research Approach


Query-driven (lazy, on-demand)
Clients

Integration System

Metadata

...
Wrapper

Source

Wrapper

Source

Wrapper

...

Source

Slide credit: J. Hammer


2004.11.15- SLIDE 26

Disadvantages of Query-Driven Approach

Delay in query processing


Slow or unavailable information sources
Complex filtering and integration

Inefficient and potentially expensive for


frequent queries
Competes with local processing at
sources
Hasnt caught on in industry
Slide credit: J. Hammer
2004.11.15- SLIDE 27

The Warehousing Approach


Information
integrated in
advance
Stored in WH for
direct querying
and analysis

Clients

Data
Warehouse

Integration System

Metadata

...
Extractor/
Monitor

Source

Extractor/
Monitor

Source

Extractor/
Monitor

...

Source

Slide credit: J. Hammer


2004.11.15- SLIDE 28

Advantages of Warehousing Approach


High query performance
But not necessarily most current information

Doesnt interfere with local processing at


sources
Complex queries at warehouse
OLTP at information sources

Information copied at warehouse


Can modify, annotate, summarize, restructure, etc.
Can store historical information
Security, no auditing

Has caught on in industry


Slide credit: J. Hammer
2004.11.15- SLIDE 29

Not Either-Or Decision


Query-driven approach still better for
Rapidly changing information
Rapidly changing information sources
Truly vast amounts of data from large
numbers of sources
Clients with unpredictable needs

Slide credit: J. Hammer


2004.11.15- SLIDE 30

Data Warehouse Evolution


Relational
Databases
1960

1975

Company
DWs
1980

PCs and
Spreadsheets

End-user
Interfaces

1985

1990

Data Replication
Tools
1995

2000

InformationMiddle Data
Based
Revolution
Ages
Management

1st DW
Article

DW
Confs.

TIME

Prehistoric
Times

Building the
DW
Inmon (1992)

Vendor DW
Frameworks
Slide credit: J. Hammer
2004.11.15- SLIDE 31

What is a Data Warehouse?


A Data Warehouse is a
subject-oriented,
integrated,
time-variant,
non-volatile

collection of data used in support of


management decision making
processes.
-- Inmon & Hackathorn, 1994: viz. Hoffer, Chap 11
2004.11.15- SLIDE 32

DW Definition
Subject-Oriented:
The data warehouse is organized around the
key subjects (or high-level entities) of the
enterprise. Major subjects include

Customers
Patients
Students
Products
Etc.

2004.11.15- SLIDE 33

DW Definition
Integrated
The data housed in the data warehouse are
defined using consistent

Naming conventions
Formats
Encoding Structures
Related Characteristics

2004.11.15- SLIDE 34

DW Definition
Time-variant
The data in the warehouse contain a time
dimension so that they may be used as a
historical record of the business

2004.11.15- SLIDE 35

DW Definition
Non-volatile
Data in the data warehouse are loaded and
refreshed from operational systems, but
cannot be updated by end-users

2004.11.15- SLIDE 36

What is a Data Warehouse?


A Practitioners Viewpoint
A data warehouse is simply a single,
complete, and consistent store of data
obtained from a variety of sources and
made available to end users in a way they
can understand and use it in a business
context.
-- Barry Devlin, IBM Consultant

2004.11.15SLIDE
37
Slide
credit: J.
Hammer

A Data Warehouse is...


Stored collection of diverse data
A solution to data integration problem
Single repository of information

Subject-oriented
Organized by subject, not by application
Used for analysis, data mining, etc.

Optimized differently from transactionoriented db


User interface aimed at executive decision
makers and analysts
2004.11.15- SLIDE 38

Contd
Large volume of data (Gb, Tb)
Non-volatile
Historical
Time attributes are important

Updates infrequent
May be append-only
Examples
All transactions ever at WalMart
Complete client histories at insurance firm
Stockbroker financial information and portfolios
Slide credit: J. Hammer
2004.11.15- SLIDE 39

Warehouse is a Specialized DB
Standard DB

Mostly updates
Many small transactions
Mb - Gb of data
Current snapshot
Index/hash on p.k.
Raw data
Thousands of users (e.g.,
clerical users)

Warehouse
Mostly reads
Queries are long and
complex
Gb - Tb of data
History
Lots of scans
Summarized, reconciled
data
Hundreds of users (e.g.,
decision-makers,
analysts)
Slide credit: J. Hammer
2004.11.15- SLIDE 40

Summary
Business
Information Guide

Data
Warehouse
Catalog

Business Information
Interface

Data
Warehouse
Data Warehouse
Population

Enterprise
Modeling

Operational Systems
Slide credit: J. Hammer
2004.11.15- SLIDE 41

Warehousing and Industry


Warehousing is big business
$2 billion in 1995
$3.5 billion in early 1997
Predicted: $8 billion in 1998 [Metagroup]

WalMart has largest warehouse


900-CPU, 2,700 disk, 23 TB Teradata system
~7TB in warehouse
40-50GB per day
Slide credit: J. Hammer
2004.11.15- SLIDE 42

Types of Data
Business Data - represents meaning
Real-time data (ultimate source of all business data)
Reconciled data
Derived data

Metadata - describes meaning


Build-time metadata
Control metadata
Usage metadata

Data as a product* - intrinsic meaning


Produced and stored for its own intrinsic value
e.g., the contents of a text-book
Slide credit: J. Hammer
2004.11.15- SLIDE 43

Data Warehousing Architecture

2004.11.15- SLIDE 44

Ingest
Clients

Data
Warehouse

Integration System

Metadata

...
Extractor/
Monitor

Source/ File

Extractor/
Monitor

Source / DB

Extractor/
Monitor

...

Source / External
2004.11.15- SLIDE 45

Data Warehouse Architectures:


Conceptual View
Single-layer

Operational
systems

Informational
systems

Every data element is stored once only


Virtual warehouse
Real-time data

Two-layer
Real-time + derived data
Most commonly used approach in
industry today

Operational
systems

Informational
systems

Derived Data

Real-time data

Slide credit: J. Hammer


2004.11.15- SLIDE 46

Three-layer Architecture: Conceptual View

Transformation of real-time data to derived


data really requires two steps
Operational
systems

Informational
systems

Derived Data

Reconciled Data

View level
Particular informational
needs
Physical Implementation
of the Data Warehouse

Real-time data

Slide credit: J. Hammer


2004.11.15- SLIDE 47

Issues in Data Warehousing


Warehouse Design
Extraction
Wrappers, monitors (change detectors)

Integration
Cleansing & merging

Warehousing specification & Maintenance


Optimizations
Miscellaneous (e.g., evolution)
Slide credit: J. Hammer
2004.11.15- SLIDE 48

Data Warehousing: Two Distinct Issues

(1) How to get information into warehouse


Data warehousing

(2) What to do with data once its in


warehouse
Warehouse DBMS

Both rich research areas


Industry has focused on (2)

Slide credit: J. Hammer


2004.11.15- SLIDE 49

Data Extraction
Source types
Relational, flat file, WWW, etc.

How to get data out?


Replication tool
Dump file
Create report
ODBC or third-party wrappers

Slide credit: J. Hammer


2004.11.15- SLIDE 50

Wrapper
Converts data and queries from one data model to
another
Data
Model
A

Queries
Data

Data
Model
B

Extends query capabilities for sources with


limited capabilities
Queries

Wrapper

Source
Slide credit: J. Hammer
2004.11.15- SLIDE 51

Wrapper Generation
Solution 1: Hard code for each source
Solution 2: Automatic wrapper generation

Wrapper

Wrapper
Generator

Definition

Slide credit: J. Hammer


2004.11.15- SLIDE 52

Data Transformations
Convert data to uniform format
Byte ordering, string termination
Internal layout

Remove, add & reorder attributes


Add key
Add data to get history

Sort tuples

Slide credit: J. Hammer


2004.11.15- SLIDE 53

Monitors
Goal: Detect changes of interest and
propagate to integrator
How?
Triggers
Replication server
Log sniffer
Compare query results
Compare snapshots/dumps
Slide credit: J. Hammer
2004.11.15- SLIDE 54

Data Integration
Receive data (changes) from multiple
wrappers/monitors and integrate into warehouse
Rule-based
Actions

Resolve inconsistencies
Eliminate duplicates
Integrate into warehouse (may not be empty)
Summarize data
Fetch more data from sources (wh updates)
etc.
Slide credit: J. Hammer
2004.11.15- SLIDE 55

Data Cleansing
Find (& remove) duplicate tuples
e.g., Jane Doe vs. Jane Q. Doe

Detect inconsistent, wrong data


Attribute values that dont match

Patch missing, unreadable data


Notify sources of errors found

Slide credit: J. Hammer


2004.11.15- SLIDE 56

Warehouse Maintenance
Warehouse data materialized view
Initial loading
View maintenance

View maintenance

Slide credit: J. Hammer


2004.11.15- SLIDE 57

Differs from Conventional View Maintenance...

Warehouses may be highly aggregated


and summarized
Warehouse views may be over history of
base data
Process large batch updates
Schema may evolve

Slide credit: J. Hammer


2004.11.15- SLIDE 58

Differs from Conventional View Maintenance...

Base data doesnt participate in view


maintenance
Simply reports changes
Loosely coupled
Absence of locking, global transactions
May not be queriable

Slide credit: J. Hammer


2004.11.15- SLIDE 59

Warehouse Maintenance Anomalies


Materialized view maintenance in loosely
coupled, non-transactional environment
Simple example
Data
Warehouse

Sold (item,clerk,age)
Sold = Sale

Emp

Integrator

Sales

Sale(item,clerk)

Comp.

Emp(clerk,age)

Slide credit: J. Hammer


2004.11.15- SLIDE 60

Warehouse Maintenance Anomalies


Data
Warehouse

Sold (item,clerk,age)

Integrator

Sales

Sale(item,clerk)

Comp.

Emp(clerk,age)

1. Insert into Emp(Mary,25), notify integrator


2. Insert into Sale (Computer,Mary), notify integrator
3. (1) integrator adds Sale
(Mary,25)
4. (2) integrator adds (Computer,Mary)
Emp
5. View incorrect (duplicate tuple)

Slide credit: J. Hammer


2004.11.15- SLIDE 61

Maintenance Anomaly - Solutions


Incremental update algorithms (ECA,
Strobe, etc.)
Research issues: Self-maintainable views
What views are self-maintainable
Store auxiliary views so original + auxiliary
views are self-maintainable

Slide credit: J. Hammer


2004.11.15- SLIDE 62

Self-Maintainability: Examples
Sold(item,clerk,age) =
Sale(item,clerk) Emp(clerk,age)
Inserts into Emp
If Emp.clerk is key and Sale.clerk is
foreign key (with ref. int.) then no effect

Inserts into Sale


Maintain auxiliary view:
Emp-clerk,age(Sold)

Deletes from Emp


Delete from Sold based on clerk
Slide credit: J. Hammer
2004.11.15- SLIDE 63

Self-Maintainability: Examples
Deletes from Sale
Delete from Sold based on {item,clerk}
Unless age at time of sale is relevant

Auxiliary views for self-maintainability


Must themselves be self-maintainable
One solution: all source data
But want minimal set
Slide credit: J. Hammer
2004.11.15- SLIDE 64

Partial Self-Maintainability
Avoid (but dont prohibit) going to sources
Sold=Sale(item,clerk)

Emp(clerk,age)

Inserts into Sale


Check if clerk already in Sold, go to source
if not
Or replicate all clerks over age 30
Or ...

Slide credit: J. Hammer


2004.11.15- SLIDE 65

Warehouse Specification (ideally)


View Definitions
Warehouse
Configuration
Module

Integration
rules

Warehouse

Change
Detection
Requirements

Integrator

Extractor/
Monitor

Extractor/
Monitor

Metadata

Extractor/
Monitor

...
Slide credit: J. Hammer
2004.11.15- SLIDE 66

Optimization
Update filtering at extractor
Similar to irrelevant updates in constraint and
view maintenance

Multiple view maintenance


If warehouse contains several views
Exploit shared sub-views

Slide credit: J. Hammer


2004.11.15- SLIDE 67

Additional Research Issues

Historical views of non-historical data


Expiring outdated information
Crash recovery
Addition and removal of information
sources
Schema evolution

Slide credit: J. Hammer


2004.11.15- SLIDE 68

More Information on DW
Agosta, Lou, The Essential Guide to Data
Warehousing. Prentise Hall PTR, 1999.
Devlin, Barry, Data Warehouse, from
Architecture to Implementation. Addison-Wesley,
1997.
Inmon, W.H., Building the Data Warehouse.
John Wiley, 1992.
Widom, J., Research Problems in Data
Warehousing. Proc. of the 4th Intl. CIKM Conf.,
1995.
Chaudhuri, S., Dayal, U., An Overview of Data
Warehousing and OLAP Technology. ACM
SIGMOD Record, March 1997.
2004.11.15- SLIDE 69

Anda mungkin juga menyukai