Lecture Outline
Review
Application of Object Relational DBMS the
Berkeley Environmental Digital Library
Data Warehouses
Introduction to Data Warehouses
Data Warehousing
(Based on lecture notes from Joachim
Hammer, University of Florida, and Joe
Hellerstein and Mike Stonebraker of UCB)
2004.11.15- SLIDE 2
Lecture Outline
Review
Application of Object Relational DBMS the
Berkeley Environmental Digital Library
Data Warehouses
Introduction to Data Warehouses
Data Warehousing
(Based on lecture notes from Joachim
Hammer, University of Florida, and Joe
Hellerstein and Mike Stonebraker of UCB)
2004.11.15- SLIDE 3
Originators
Index
Services
Repositories
Network
Users
2004.11.15- SLIDE 4
2004.11.15- SLIDE 6
Botanical Data:
The CalFlora Database contains
taxonomical and distribution information
for more than 8000 native California
plants. The Occurrence Database includes
over 600,000 records of California plant
sightings from many federal, state, and
private sources. The botanical databases
are linked to the CalPhotos collection of
California plants, and are also linked to
external collections of data, maps, and
photos.
2004.11.15- SLIDE 7
Geographical Data:
Much of the geographical data in the collection
has been used to develop our web-based GIS
Viewer. The Street Finder uses 500,000 Tiger
records of S.F. Bay Area streets along with the
70,000-records from the USGS GNIS database.
California Dams is a database of information
about the 1395 dams under state jurisdiction. An
additional 11 GB of geographical data
represents maps and imagery that have been
processed for inclusion as layers in our GIS
Viewer. This includes Digital Ortho Quads and
DRG maps for the S.F. Bay Area.
2004.11.15- SLIDE 8
Documents:
Most of the 300,000 pages of digital documents are
environmental reports and plans that were provided by
California state agencies. This collection includes
documents, maps, articles, and reports on the California
environment including Environmental Impact Reports
(EIRs), educational pamphlets, water usage bulletins,
and county plans. Documents in this collection come
from the California Department of Water Resources
(DWR), California Department of Fish and Game (DFG),
San Diego Association of Governments (SANDAG), and
many other agencies. Among the most frequently
accessed documents are County General Plans for
every California county and a survey of 125 Sacramento
Delta fish species.
2004.11.15- SLIDE 9
Multivalent Documents
Cheshire
CheshireLayer
Layer
GIS Layer
Valence:
2: The relative
capacity to unite,
react, or interact
(as with antigens
or a biological
substrate).
Websters 7th Collegiate
Dictionary
Table Layer
History of The Classical World
kdk
dkd
kdk
Network
Protocols &
Resources
OCR Layer
OCR Mapping
Layer
Scanned
Page
Image
kdjjdkd kdjkdjkd kj
kdkdk kdkd dkk
jdjjdj
clclc ldldl
Table 1.
2004.11.15- SLIDE 10
2004.11.15- SLIDE 11
2004.11.15- SLIDE 12
2004.11.15- SLIDE 13
2004.11.15- SLIDE 14
2004.11.15- SLIDE 15
2004.11.15- SLIDE 16
2004.11.15- SLIDE 17
2004.11.15- SLIDE 18
2004.11.15- SLIDE 19
2004.11.15- SLIDE 20
Lecture Outline
Review
Application of Object Relational DBMS the
Berkeley Environmental Digital Library
Data Warehouses
Introduction to Data Warehouses
Data Warehousing
(Based on lecture notes from Joachim
Hammer, University of Florida, and Joe
Hellerstein and Mike Stonebraker of UCB)
2004.11.15- SLIDE 21
Overview
Data Warehouses and Merging
Information Resources
What is a Data Warehouse?
History of Data Warehousing
Types of Data and Their Uses
Data Warehouse Architectures
Data Warehousing Problems and Issues
2004.11.15- SLIDE 22
Personal
Databases
Scientific Databases
Digital Libraries
World
Wide
Web
Different interfaces
Different data representations
Duplicate and inconsistent information
Sales Administration
Finance
Manufacturing
...
Slide credit: J. Hammer
2004.11.15- SLIDE 24
Integration System
World
Wide
Web
Digital Libraries
Scientific Databases
Personal
Databases
Integration System
Metadata
...
Wrapper
Source
Wrapper
Source
Wrapper
...
Source
Clients
Data
Warehouse
Integration System
Metadata
...
Extractor/
Monitor
Source
Extractor/
Monitor
Source
Extractor/
Monitor
...
Source
1975
Company
DWs
1980
PCs and
Spreadsheets
End-user
Interfaces
1985
1990
Data Replication
Tools
1995
2000
InformationMiddle Data
Based
Revolution
Ages
Management
1st DW
Article
DW
Confs.
TIME
Prehistoric
Times
Building the
DW
Inmon (1992)
Vendor DW
Frameworks
Slide credit: J. Hammer
2004.11.15- SLIDE 31
DW Definition
Subject-Oriented:
The data warehouse is organized around the
key subjects (or high-level entities) of the
enterprise. Major subjects include
Customers
Patients
Students
Products
Etc.
2004.11.15- SLIDE 33
DW Definition
Integrated
The data housed in the data warehouse are
defined using consistent
Naming conventions
Formats
Encoding Structures
Related Characteristics
2004.11.15- SLIDE 34
DW Definition
Time-variant
The data in the warehouse contain a time
dimension so that they may be used as a
historical record of the business
2004.11.15- SLIDE 35
DW Definition
Non-volatile
Data in the data warehouse are loaded and
refreshed from operational systems, but
cannot be updated by end-users
2004.11.15- SLIDE 36
2004.11.15SLIDE
37
Slide
credit: J.
Hammer
Subject-oriented
Organized by subject, not by application
Used for analysis, data mining, etc.
Contd
Large volume of data (Gb, Tb)
Non-volatile
Historical
Time attributes are important
Updates infrequent
May be append-only
Examples
All transactions ever at WalMart
Complete client histories at insurance firm
Stockbroker financial information and portfolios
Slide credit: J. Hammer
2004.11.15- SLIDE 39
Warehouse is a Specialized DB
Standard DB
Mostly updates
Many small transactions
Mb - Gb of data
Current snapshot
Index/hash on p.k.
Raw data
Thousands of users (e.g.,
clerical users)
Warehouse
Mostly reads
Queries are long and
complex
Gb - Tb of data
History
Lots of scans
Summarized, reconciled
data
Hundreds of users (e.g.,
decision-makers,
analysts)
Slide credit: J. Hammer
2004.11.15- SLIDE 40
Summary
Business
Information Guide
Data
Warehouse
Catalog
Business Information
Interface
Data
Warehouse
Data Warehouse
Population
Enterprise
Modeling
Operational Systems
Slide credit: J. Hammer
2004.11.15- SLIDE 41
Types of Data
Business Data - represents meaning
Real-time data (ultimate source of all business data)
Reconciled data
Derived data
2004.11.15- SLIDE 44
Ingest
Clients
Data
Warehouse
Integration System
Metadata
...
Extractor/
Monitor
Source/ File
Extractor/
Monitor
Source / DB
Extractor/
Monitor
...
Source / External
2004.11.15- SLIDE 45
Operational
systems
Informational
systems
Two-layer
Real-time + derived data
Most commonly used approach in
industry today
Operational
systems
Informational
systems
Derived Data
Real-time data
Informational
systems
Derived Data
Reconciled Data
View level
Particular informational
needs
Physical Implementation
of the Data Warehouse
Real-time data
Integration
Cleansing & merging
Data Extraction
Source types
Relational, flat file, WWW, etc.
Wrapper
Converts data and queries from one data model to
another
Data
Model
A
Queries
Data
Data
Model
B
Wrapper
Source
Slide credit: J. Hammer
2004.11.15- SLIDE 51
Wrapper Generation
Solution 1: Hard code for each source
Solution 2: Automatic wrapper generation
Wrapper
Wrapper
Generator
Definition
Data Transformations
Convert data to uniform format
Byte ordering, string termination
Internal layout
Sort tuples
Monitors
Goal: Detect changes of interest and
propagate to integrator
How?
Triggers
Replication server
Log sniffer
Compare query results
Compare snapshots/dumps
Slide credit: J. Hammer
2004.11.15- SLIDE 54
Data Integration
Receive data (changes) from multiple
wrappers/monitors and integrate into warehouse
Rule-based
Actions
Resolve inconsistencies
Eliminate duplicates
Integrate into warehouse (may not be empty)
Summarize data
Fetch more data from sources (wh updates)
etc.
Slide credit: J. Hammer
2004.11.15- SLIDE 55
Data Cleansing
Find (& remove) duplicate tuples
e.g., Jane Doe vs. Jane Q. Doe
Warehouse Maintenance
Warehouse data materialized view
Initial loading
View maintenance
View maintenance
Sold (item,clerk,age)
Sold = Sale
Emp
Integrator
Sales
Sale(item,clerk)
Comp.
Emp(clerk,age)
Sold (item,clerk,age)
Integrator
Sales
Sale(item,clerk)
Comp.
Emp(clerk,age)
Self-Maintainability: Examples
Sold(item,clerk,age) =
Sale(item,clerk) Emp(clerk,age)
Inserts into Emp
If Emp.clerk is key and Sale.clerk is
foreign key (with ref. int.) then no effect
Self-Maintainability: Examples
Deletes from Sale
Delete from Sold based on {item,clerk}
Unless age at time of sale is relevant
Partial Self-Maintainability
Avoid (but dont prohibit) going to sources
Sold=Sale(item,clerk)
Emp(clerk,age)
Integration
rules
Warehouse
Change
Detection
Requirements
Integrator
Extractor/
Monitor
Extractor/
Monitor
Metadata
Extractor/
Monitor
...
Slide credit: J. Hammer
2004.11.15- SLIDE 66
Optimization
Update filtering at extractor
Similar to irrelevant updates in constraint and
view maintenance
More Information on DW
Agosta, Lou, The Essential Guide to Data
Warehousing. Prentise Hall PTR, 1999.
Devlin, Barry, Data Warehouse, from
Architecture to Implementation. Addison-Wesley,
1997.
Inmon, W.H., Building the Data Warehouse.
John Wiley, 1992.
Widom, J., Research Problems in Data
Warehousing. Proc. of the 4th Intl. CIKM Conf.,
1995.
Chaudhuri, S., Dayal, U., An Overview of Data
Warehousing and OLAP Technology. ACM
SIGMOD Record, March 1997.
2004.11.15- SLIDE 69