Anda di halaman 1dari 69

CSE 530 – Database Management Systems

Data Warehousing Presentation

By
Ali Gardezi
Prashanth Janardanan
Aaron Sheffield
The Evolution Of Data Warehousing

 Since 1970s, organizations have mostly


focused new investment in new computer
systems that automate business processes.
 In recent times, Organizations are focusing
on ways to use operational data to support
decision-making, as means of gaining a
competitive edge.
The Evolution Of Data Warehousing

 However, operational systems were never


designed to support such business activities.
 Using these systems for decision making
may never be an easy solution.
 The concept of a Data Warehouse was
deemed the solution to meet the
requirements of a system capable of
supporting decision making, receiving data
from multiple operational data sources.
Data Warehousing Concepts

 Original concept of Data Warehouse was


devised by IBM as the ‘information
warehouse’.
 Due to sheer complexity and performance
problems, early attempts at creating a
information warehouse were mostly rejected.
 It is only in recent years that the potential of
Data Warehousing is seen as a valuable and
viable solution.
Data Warehousing Concepts
 The latest and most successful advocate of data warehousing
is Bill Inmon(“father of data warehousing”).
 Data Warehouse:
– A subject-oriented, integrated, time-variant, and non-volatile
collection of data in support of management’s decision making
process.
 The goal of a data warehouse is to integrate enterprise wise
data into a single repository from which users can easily run
queries, produce reports, and perform analysis.

 Data Warehouse is a data management and data analysis


technology.
Data Warehousing Concepts
 In recent years a new term associated with data warehousing
has been used, namely ‘Data Webhouse’.
 Data Webhouse:
– A distributed data warehouse that is implemented over the web
with no central data repository.
 The web is an immense source of behavioral data as
individuals interact through their web browsers with remote
websites.
 Data generated by this behavior is called clickstream.
 Using a data warehouse on the web to harness clickstream
data has led to the development of Data Webhouses.
Benefits of Data Warehousing
 Potential high returns on investments:
– An organization must commit a huge amount of resources
to implement a data warehouse.
– Costs can run anywhere from £50,000 to over £10 million
due to a variety of solutions available.
– A study by the International Data Corporation (IDC) in 1996
reported that the average three-year returns on investment
(ROI) in data warehousing reached 401%.
– 90% of companies achieved over 40% ROI.
– Half of the companies achieved over 160% ROI.
– A quarter achieved more than 600% ROI.
Benefits of Data Warehousing

 Competitive Advantage
– Huge returns of investment for companies who
have successfully implemented a data warehouse
is evidence of the enormous competitive
advantage that accompanies this technology.
– Competitive advantage is gained by allowing
decision-makers access to data that can reveal
previously unavailable, unknown and untapped
information on, for example, customers, trends
and demands.
Benefits of Data Warehousing

 Increased Productivity Of Corporate Decision


makers
– Data Warehousing improves the productivity of corporate
decision-makers by creating and integrated database of
consistent, subject oriented , historical data.
– It integrates data from multiple incompatible systems into a
form that provides one consistent view of the organization.
– By transforming data into meaningful information, a data
warehouse allows corporate decision makers to perform
more substantive, accurate and consistent analysis.
Comparison Of OLTP and Data
Warehousing Systems

 OLTP Systems  Data Warehousing Systems


– Holds current data. – Holds historical data.
– Stores detailed data. – Stores detailed, lightly, and highly
– Data is dynamic. summarized data.
– Repetitive processing. – Data is largely static.
– Ad Hoc, unstructured, and heuristic
processing.
– Medium to low level of transaction
– High level of transaction throughput. throughput.
– Predictable pattern of data usage. – Unpredictable pattern of usage.
– Transaction driven. – Analysis driven.
– Application oriented. – Subject oriented.
– Supports day to day decisions. – Supports strategic decisions.
– Serves large number of
clerical/operation users.
– Serves relatively low number of
managerial users.
Comparison Of OLTP and Data
Warehousing Systems

 Although OLTP systems and Data Warehouses have


different characteristics and are built with different
purposes in mind, these systems are closely related.
– OLTP systems provide the source data for the warehouse.
 A major problem in this relationship is that the data
held by the OLTP systems can be inconsistent,
fragmented, and subject to change, containing
duplicate or missing entries.
 Therefore, data needs to be ‘cleaned up’ before it
can be used in a data warehouse.
Comparison Of OLTP and
Data Warehousing Systems

 OLTP offers large amounts of raw data,


which is not easily analyzed.
 Data warehouse allows for more complex
queries,
– What was the total revenue for Scotland in the
third quarter of 2004?
– What was the total revenue for property sales for
each type of property in Great Britain in 2003?
Problems Of Data Warehousing
 The following are some of the problems associated
with developing and managing a data warehouse.
– Underestimation of resources of data loading.
– Hidden problems with source systems.
– Required Data not captured.
– Increase end-user demands.
– Data homogenizations.
– High demand for resources.
– Data ownership.
– High maintenance.
– Long-Duration projects.
– Complexity of integration.
Data Warehouse Architecture
Warehouse Manager

Operational Data Reporting, query, application


source 1 development, and EIS tools
Highly
Metadata Summarized Query
Load data Manager
Manager

Operational Data
source 2 Lightly Summarized Data
OLAP Tools

DBMS
Detailed Data

Warehouse Manager
Operational Data
source n

Data mining tools

Archive/backup
data
End-user access tools
Operational Data
store (ODS)
Data Warehouse Architecture

 Operational Data:
– Source of Data for the Database Warehouse.
 Operational Data Store:
– A repository of current and integrated Operational
Data used for analysis.
 Load Manager:
– Performs all the operations associated with the
extraction and loading of data into the warehouse.
Data Warehouse Architecture

 Warehouse Manager:
– Performs operations such as
 Analysis of data to ensure consistency.
 Transformation and merging of source data from
temporary storage into data warehouse tables.
 Creation of indexes and views on base tables.
 Generation of denormalization.
 Generation of aggregations.
 Backing-up and archiving data.
Data Warehouse Architecture
 Query Manager
– Performs all operations associated with the management of user
queries.
– Examples,
 Directing queries to the appropriate tables.
 Scheduling the execution of queries.
 Detailed Data
– The area that stores all the detailed data in the database schema.
– Not available online but made available by aggregating data to the
next level of detail.
– On a regular basis, detailed data is added to the warehouse to
supplement the aggregated data.
Data Warehouse Architecture

 Lightly And Highly Summarized Data


– This area stores all the predefined lightly and
highly defined summarized data generated by the
warehouse manager.
– This area of warehouse is transient in order to
respond to changing query profiles.
– Purpose is to speed up performance of queries.
– Summary data is updated continuously as new
data is loaded into the warehouse.
Data Warehouse Architecture
 Archive/Backup Data
– Area of Data warehouse stores detailed and summarized
data for the purposes of archiving and backup.
– Data is transferred to storage devices such as magnetic
tape or optical disk.
 Metadata
– Area stores all the metadata (data about data) definitions
used by all the processes in the warehouse.
– Used for a variety of purposes
 Extraction and loading processes.
 Warehouse Management process.
 As part of the query management process.
Data Warehouse Architecture

 End-User Access Tools


– Users interact with the warehouse using end-use
tools. For example,
 Reporting and query tools.
 Application development tools.
 Executive Information System (EIS) tools.
 Online Analytical Processing (OLAP) tools.
 Data mining tools.
Data Warehouse Data Flows
Warehouse Manager

Metadata
Operational Data Reporting, query, application
source 1 development, and EIS tools
Highly
Summarized Query
Inflow Manager
Load data
Manager

Outflow
Lightly Summarized Data
OLAP Tools

Upflow

DBMS
Detailed Data

Warehouse Manager
Operational Data
source n Downflow
Data mining tools

Archive/backup
data
End-user access tools
Operational Data
store (ODS)
Data Warehouse Data Flows

 Data Warehousing focuses on the


management of five primary data flows.
– Inflow.
– Upflow.
– Downflow.
– Outflow.
– Metaflow.
Data Warehouse Data Flows
 Inflow
– The processes associated with the extraction, cleansing,
and loading of the data from the source systems into the
data warehouse.
– As the source data is generated predominately by OLTP
systems, the data must be reconstructed for the purposes
of the data warehouse.
 The construction of data involves
– Cleansing dirty data
– Restructuring data to suit the new requirements of the data
warehouse including, for example, adding and/or removing fields,
and denormalizing data.
– Ensuring that the source data is consistent with itself and with the
data already in the warehouse.
Data Warehouse Data Flows
 Upflow
– The processes associated with adding value to the data in
the warehouse through summarizing, packaging, and
distribution of the data.
– Activities associated with the upflow include
 Summarizing the data by selecting, projecting, joining, and
grouping relational data into views that are more convenient
and useful to the end-users.
 Packaging the data by converting the detailed or summarized
data into more useful formats, such as spread sheets, text
documents, charts, other graphical presentations, private
databases and animation.
 Distributing the data to appropriate groups to increase its
availability and accessibility.
Data Warehouse Data Flows

 Downflow
– The processes associated with archiving and backing-up of
data in the warehouse.
– Archiving old data plays an important role in maintaining
the effectiveness and performance of the warehouse by
transferring the older data of limited value to a storage
archive such as a magnetic tape or optical disk.
– The downflow of data includes the processes to ensure that
the current state of the data warehouse can be rebuilt
following data loss, or software/hardware failures.
– Archived data should be stored in a way that allows the re-
establishment of the data in the warehouse, when required.
Data Warehouse Data Flows

 Outflow
– The processes associated with making the data
available to the end-users.
– The two key activities involved in the outflow
include:
 Accessing, which is concerned with satisfying the end-
users’ requests for the data that they need.
 Delivering, which is concerned with proactively
delivering information to the end-users’ workstations and
is referred to as a type of ‘publish-and-subscribe’
process.
Data Warehouse Data Flows

 Metaflow
– The processes associated with the management
of the metadata.
– Metaflow is the process that moves metadata
(data about other flows).
– Metadata is a description of the data contents of
the data warehouse, what is in it, where it came
from originally, and what has been done to it by
way of cleansing, integrating, and summarizing.
Data Warehouse Tools And
Technologies

 Extraction, Cleansing, and Transformation


Tools
– Can be carried out either by separate products, or
by a single integrated solution.
– Integrated solutions fall into one of the following
categories.
 Code Generators.
 Database data replication tools.
 Dynamic transformation engines.
Data Warehouse Tools And
Technologies

 Data Warehouse DBMS


– Requirements of a data warehouse DBMS
 Load performance
 Load processing
 Data quality management
 Query performance
 Terabyte scalability
 Mass user scalability
 Networked data warehouse
 Warehouse administration
 Integrated dimensional analysis
 Advanced query functionality.
Data Warehouse Tools And
Technologies

 Data Warehouse MetaData


– The management of the metadata in the warehouse is an extremely complex
and difficult task.
– The major integration issue is how to synchronize the various types of
metadata used throughout the data warehouse.
– In the beginning there were two major standards for metadata and modeling in
the areas of data warehousing and component development proposed by the
Meta Data Coalition (MDC) and the Object Management Group (OMG).
– MDC merged with OMG.
– The merger resulted in the incorporation of the best of MDC’s Open Information
Model (OIM) with the best of OMG’s Common Warehouse Metamodel (CWM).
– The new version of OPM’s CWN builds on various standards, including OMG’s
UML (Unified Modeling Language), XMI (XML Metadata Interchage), and MOF
(Meta Object Facility), and on the MDC’s OIM.
– The CVM was developed by a number of companies, including IBM, Oracle,
Unisys, Hyperion, Genesis, NCR, UBS, and Dimension EDI.
Data Warehouse Tools And
Technologies

 Administration and Management Tools


– The Data Warehouse administration and management tolls must be
capable of supporting the following tasks:
 Monitoring data loading from multiple resources.
 Data quality and integrity checks.
 Managing and updating metadata.
 Monitoring database performance to ensure efficient query response times
and resource utilization.
 Auditing data warehouse usage to provide user chargeback information.
 Replicating, subsetting and distributing data.
 Maintaining efficient data storage management.
 Purging data.
 Archiving and backing-up data.
 Implementing recovery following failure.
 Security management.
Data Marts

A subset of Data
Warehouse that supports
the requirements of a
particular department or
business function.
Date Warehouse Vs Data Mart

Focuses on the entire Focuses only on the


enterprise requirements of users
associated with one
department/business
function
Contains detailed Contains only summary
Operational Data information

More complex and tough Easily understood and


to navigate navigated
Why Data Marts?
 To give users access to date they need to analyze most often
 To provide data in a form that matches the collective view of the data
of a group of users in a department or business function
 To provide end user response time due to the reduction in the volume
of data to be accessed
 To provide appropriately structured data as dictated by the User
Requirements Source(URS) using tools like OLAP/Data mining
 To facilitate easy execution of tasks such as data cleansing, loading ,
transformation and integration
 Cost factor ( cheaper that Data Warehouses)
 To define the potential users in a clear way and their preference of
data marts over data warehousing
Approaches to building data marts

 One approach is to build several Data Marts


with a view to eventual integration into a
warehouse
 Other approach is to build infrastructure for a
corporate data warehouse while at the same
time building one or more Date Marts to
satisfy immediate business needs
Data Mart Architecture
Warehouse Manager

Operational Data
source 1 Reporting, query, application
development, and EIS tools
Metadata
Load Highly
Manager summarized
data
Query
Manager
Operational Data Lightly Summarized Data
source 2
OLAP Tools

DBMS
Detailed Data

Warehouse Manager
Operational Data
source n
Data marts
Data mining tools

End-user access tools

Operational Data
store (ODS) Archive/Backup data

Summarized data
Data Mart Issues
 Functionality  from being small, easy-to-access databases, the capabilities and complexity
of data marts have increased

 Size  deterioration of performance and increase in response times due to unprecedented


growth in size

 Load performance  faster end user response time leads to large number of summary
tables and hence increase in data loading time

 Users access to multiple data marts  this lead to creation of virtual data marts which are
nothing but views of several physical data marts

 Internet/Intranet access  web browsers such as netscape navigators and MS internet


explorers offer users low-cost access to data marts

 Administration  as the number of data marts increases, data marts administration


becomes complex and leads to data marts versioning, metadata consistency, enterprise
security and performance tuning
Data Warehousing using Oracle9i

 Data Warehousing applications require specialized


processing techniques to allow support for complex,
ad hoc queries
 Oracle9i Enterprise Edition is one of the leading
relational DBMS for such applications
 Offers query processing techniques, sophisticated
query optimization, scalable architecture etc
 Special features provided by Oracle9i are discussed
in the following slides
Main features of Oracle
 Summary management  Oracle provides features for storing multiple dimensions and
summary calculations on a table

 Analytical Functions  provides features such as ranking, moving aggregates, period-


over-period comparisons etc

 Bitmapped indexes  Oracle provides efficient storage of bitmapped indexes through


advanced data compression techniques

 Advanced Join methods  offers partition wise joins , which increases the performance
of joins and thereby reducing the query response time

 SQL optimizer  One of the most powerful features of oracle. Its cost based optimizer
determines the most efficient access paths and joins for every query.

 Resource management Oracle has resource management functionality that provides


control of system resources assigned to users.
Data Warehousing design

 The database component of the Date


Warehouse is described using a technique
called Dimensionality Modeling.
 DM has emerged as the dominant approach
for most of the data warehouse databases.
Dimensionality Modeling
 Definition  A logical design technique that aims to present the
data in a standard, intuitive form that allows for high-
performance access.

 DM uses the concept of ER modeling with some restrictions.

 Every DM is composed of one table called Fact Table, with a


composite primary key.
 Additionally, it contains a set of smaller tables called Dimension
Tables, each consisting of a simple(non- composite) primary
key as shown in the following slide.
Star Schema

TimeID(pk)
Day
Fact table week
___________
timeID(pk) propertyID(pk)
propertyID(pk) type
branchID(pk) city
Offerprice
Sellingprice branchID(pk)
Type
City
region
Unnormalized
Dimension tables
Dimensionality Modeling cont…

Formal Definition  A logical structure that


has a fact table containing factual data in the
centre, surrounded by dimension tables
containing reference data( which can be
denormalized)
Dimensionality Modeling cont…

 Snowflake Schema  A variant of the star


schema where dimension tables do not
contain denormalised table

 Starflake Schema  A hybrid structure that


contains a mixture of star and snowfake
schemas
Part of Snowflake Schema

branchID(pk)
Fact table Type
___________ city
timeID(fk)
propertyID(fk)
branchID(fk)
Offerprice
City(pk)
Sellingprice
region

Normalized
Dimension tables
Salient Features of DM
 Efficiency  The consistency of the underlying database structure allows more efficient
access to the data by various tools like report writers and query tools

 Ability to handle changing environments  This design is better able to support ad hoc
user queries

 Extensibility  This model is extensile as far as adding new facts, dimensions , attributes
and breaking existing dimension records are concerned

 Ability to model common business situations  using report writers, query tools and
other user interfaces, every situation has a well-understood set of alternatives

 Predictable query processing  Even though the overall suite of star schemas in the
enterprise dimensional model is complex, the query processing is very predictable as
every fact table has to queried independently
DM Vs ER modeling

DMs are used for the database component of


Data Warehouse whereas ER models have
been used to describe database component
of Online Transaction Processing
Systems(OLTP).
Design Methodology

 “Nine-Step Methodology” by Ralph Kimball


(1996).
 Build Data Marts first.
 Later combine to form a Data Warehouse.
Nine-Step Methodology

1. Choosing the process


2. Choosing the grain
3. Identifying and conforming the dimensions
4. Choosing the facts
5. Storing pre-calculations in the fact table
6. Rounding out the dimension tables
7. Choosing the duration of the database
8. Tracking slowly changing dimensions
9. Deciding the query priorities and the query modes
Step 1: Choosing the process

 Refers to the subject matter of a data mart.


 Should start with the one most likely to be:
– Delivered on time
– Within budget
– Commercially important (sales).
Step 2: Choosing the grain

 Deciding exactly what a fact table record


represents.
 Need to know the grain of the fact table
before you can identify the dimensions of the
fact table.
 Also, need the fact table grain before
determining the grain of the dimension
tables.
Step 3: Identifying and conforming
the dimensions

 Crucial to the understandability and ease of


use of the data mart.
 If a dimension occurs in two or more data
marts they must be exactly the same (or a
mathematical subsets).
 If a dimension is in more than one data mart
it is referred to being conformed.
Step 4: Choosing the facts

 All facts must be expressed at the level


implied by the grain.
 Facts should be numeric and additive.
 Additional facts can be added at any time.
Step 5: Storing pre-calculations in
the fact table

 Generally profit/loss statements.


 Best to pre-calculate if a user might calculate
them wrong.
 Also, adds a little extra speed if used often.
Step 6: Rounding out the
dimension tables

 Add as many text descriptions to the


dimensions as possible.
 Should be intuitive and understandable to the
users.
Step 7: Choosing the duration of
the database

 How far back in time the fact table goes.


 The older the data the more problems a data
mart will have reading/interpreting old files.
 Run into “slowly changing dimensions.”
Step 8: Tracking slowly changing
dimensions

 Important old dimensions have to be used for


some queries.
 Thus, they must be saved and maintained.
Three types of slowly changing
dimensions

 Type 1: Where a changed dimension attribute is


overwritten.
 Type 2: Where a changed dimension attribute
causes a new dimension record to be created.
 Type 3: Where a changed dimension attribute
causes an alternate attribute to be created so that
both the old and new values of the attribute are
simultaneously accessible (in the same dimension).
Step 9: Deciding the query
priorities and the query modes

 Physical sort order of the fact table.


 Pre-stored summaries or aggregations of the
fact table.
 Also, need to decide on other storage
concerns (backups, etc).
Assessing the Dimensionality of a
Data Warehouse

 Ralph Kimball (2000a, b).


 A method to score vendor’s products.
 20 criteria each scored with a 0 or 1.
 Divided into three categories: architectural,
administration, expression.
Architectural Criteria

 Fundamental characteristics to the way the


system’s organization.
 From the back-end, through the DMBS, to
the front-end and user’s desktop.
Architectural Criteria

 Explicit declaration
 Conformed dimensions and facts
 Dimensional integrity
 Open aggregate navigation
 Dimensional symmetry
 Sparsity tolerance
Administration Criteria

 Generally affect IT personnel (so pick some


thing you can work with).
 Affects the “smooth running” of the data
warehouse.
Administration Criteria

 Graceful modification
 Dimensional replication
 Changed dimension notification
 Surrogate key administration
 International consistency
 Multiple-dimension hierarchies
 Ragged-dimension hierarchies
Expression Criteria

 Analytic capabilities.
 Generally what the user sees most often
(make sure the boss is happy).
Expression Criteria

 Multiple valued dimensions


 Slowly changing dimensions
 Roles of a dimension
 Hot-swappable dimensions
 On-the-fly fact range dimensions
 On-the-fly behavior dimensions
Wash U. and Data Warehouses

 Yixin Chen’s research involves data


warehouses
 Stream Cube for online computation of a
multi-dimensional, multi-level stream data
cube under stringent time and space
constraints.
 http://www.cse.wustl.edu/~chen/research.htm
MetLife’s Data Warehouse

 Goal was to create a centralized place for all their


different types of insurance.
 From that information figure out what other services
a customer might need.
 http://archives.cnn.com/2001/TECH/industry/12/30/m
etlife.database.idg/index.html
 Rolled out in the first quarter of 2002.
– Stock value of $30.10
– Current stock value of $50.90 (11/23/05)
Questions?

Anda mungkin juga menyukai