Data Warehousing
OLAP
Data Mining
Further Reading
Enroll Now
https://goo.gl/QbTVal
2
Data Warehousing
OLTP (online transaction processing) systems
range in size from megabytes to terabytes
high transaction throughput
Decision makers require access to all data
Historical and current
'A data warehouse is a subject-oriented, integrated, timevariant and non-volatile collection of data in support of
managements decision-making process' (Inmon 1993)
Enroll Now
https://goo.gl/QbTVal
3
Benefits
Potential high returns on investment
90% of companies in 1996 reported return of investment
(over 3 years) of > 40%
Competitive advantage
Data can reveal previously unknown, unavailable and
untapped information
Increased productivity of corporate decision-makers
Integration allows more substantive, accurate and consistent
analysis
4
Typical Architecture
Mainframe operational
n/w,h/w data
Warehouse mgr
Meta-data
Departmental
RDBMS data
Private data
Load
mgr
Highly
summarized Query
data
manager
Lightly summarized
data
Detailed data
DBMS
Warehouse mgr
External data
OLAP tools
Data-mining tools
Archive/backup
Data Warehouses
Types of Data
Detailed
Summarised
Meta-data
Archive/Back-up
Enroll Now
https://goo.gl/QbTVal
6
Information Flows
Operational data
source 1
Inflow
Load
mgr
Warehouse Mgr
Meta-flow
Metadata
Highly
summ.
data
Lightly
Upflow
summ.
Detailed data
DBMS
Warehouse mgr
Downflow
Operational data
source n
Source Connolly and Begg p1162
OLAP tools
Data-mining tools
Archive/backup
Information Flow
Processes
Five primary information flows
Inflow - extraction, cleansing and loading of data from
source systems into warehouse
Upflow - adding value to data in warehouse through
summarizing, packaging and distributing data
Downflow - archiving and backing up data in warehouse
Outflow - making data available to end users
Metaflow - managing the metadata
8
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Problems of Data
Warehousing
Underestimation of resources for data loading
Hidden problems with source systems
Required data not captured
Increased end-user demands
Data homogenization
High demand for resources
Data ownership
High maintenance
Long duration projects
Complexity of integration
Dimensionality Modelling
Similar to E-R modelling but with constraints
Star Schemas
The most common dimensional model
A fact table surrounded by dimension tables
Fact tables
contains FK for each dimension table
large relative to dimension tables
read-only
Dimension tables
reference data
query performance speeded up by denormalising
into a single dimension table
12
Enroll Now
https://goo.gl/QbTVal
13
Enroll Now
https://goo.gl/QbTVal
14
Other Schemas
Snowflake schemas
variant of star schema
each dimension can have its own dimensions
Starflake schemas
hybrid structure
contains mixture of (denormalised) star and
(normalised) snowflake schemas
15
OLAP
Online Analytical Processing
dynamic synthesis, analysis and consolidation of large
volumes of multi-dimensional data
normally implemented using specialized multidimensional DBMS
a method of visualising and manipulating data with
many inter-relationships
16
17
OLAP Tools
Categorised according to architecture of underlying database
Multi-dimensional OLAP
data typically aggregated and stored according to predicted usage
use array technology
Relational OLAP
use of relational meta-data layer with enhanced SQL
Managed Query Environment
deliver data direct from DBMS or MOLAP server to desktop in form
of a datacube
18
MOLAP
RDB
Server
MOLAP
server
Load
Database/Application
Enroll Now Logic Layer
https://goo.gl/QbTVal
Request
Result
Presentation
Layer
19
ROLAP
RDB
Server
Database
Layer
Enroll Now
SQL
Result
https://goo.gl/QbTVal
ROLAP
server
Request
Result
Application
Logic Layer
Presentation
Layer
20
MQE
RDB
Server
End-user
tools
SQL
Result
MOLAP
server
Load
Request
Result
Enroll Now
https://goo.gl/QbTVal
21
Data Mining
The process of extracting valid, previously
unknown, comprehensible and actionable
information from large databases and using it to
make crucial business decisions
focus is to reveal information which is hidden or
unexpected
patterns and relationships are identified by examining
the underlying rules and features of the data
work from data up
require large volumes of data
22
Enroll Now
https://goo.gl/QbTVal
23
Enroll Now
https://goo.gl/QbTVal
24
Enroll Now
https://goo.gl/QbTVal
25
Enroll Now
https://goo.gl/QbTVal
26
> 2 years
Yes
Customer age
> 25 years?
Rent property
No
Rent property
Yes
Buy property
Enroll Now
https://goo.gl/QbTVal
27
https://goo.gl/QbTVal
28
Segmentation: Scatterplot
Example
Enroll Now
https://goo.gl/QbTVal
29
30
Enroll Now
https://goo.gl/QbTVal
31
32
Data mining needs single, separate, clean, integrated, selfconsistent data source
Data warehouse well equipped:
populated with clean, consistent data
contains multiple sources
utilises query capabilities
capability to go back to data source
33
Further Reading
Connolly and Begg, chapters 31 to 34.
W H Inmon, Building the Data Warehouse, New York, Wiley
and Sons, 1993.
Benyon-Davies P, Database Systems (2nd ed), Macmillan
Press, 2000, ch 34, 35 & 36.
Enroll Now
https://goo.gl/QbTVal
34
35
36