Anda di halaman 1dari 157

Business Intelligence & Data Warehousing

ANAND.T,
Business Intelligence, Citicards, Tata Consultancy Services Ltd.,

Lecture I
Basics and Concepts

Motivation

Aims of information technology: To help workers in their everyday business activity and improve their productivity clerical data processing tasks To help knowledge workers (executives, managers, analysts) make faster and better decisions decision support systems Two types of applications: Operational applications Analytical applications

The Architecture of Data


Whats has been learned from data summaries by who, what, when, where,...
Business rules Metadata Database schema Summary data Operational data

Logical model physical layout of data


who, what, when,

where,

Business Intelligence
Business Intelligence is a technology based on customer and profit oriented models that reduces operating costs and provide increased profitability by improving productivity, sales, service and helps to make decision making capabilities at no time.

BI Cycle
INSIGHT

ANALYSIS

ACTION

Business Intelligence

MEASUREMENT

Uses of Business Intelligence Operational Efficiency


ERP Reporting KPI Tracking Product Profitability Risk Management Balanced Scorecard Activity Based Costing Global Sourcing Logistics

Uses of Business Intelligence Customer Interaction


Sales Analysis Sales Forecasting Segmentation Cross-selling CRM Analytics Campaign Planning Customer Profitability

Focus Groups

Online Surveys

Telephone Surveys

News Scanning Services

Internal Scanning Google

Mystery Shopping Custom Panels

Market Research

One-on-ones Ad Scanning/ Tracking

Competitive Intelligence
Website Mystery Shopping

Online Focus Groups

Government Reports

Association Stats AC Neilson Reports

Segmentation Mining Customer Records Predictive Modelling

Environmental Scanning
Media Monitoring Syndicated Studies POS Systems

Data Mining
Library Sciences CRM

Economic Reports

BI Tools
These tools will illustrate business intelligence in the areas of customer profiling, customer support, market research, market segmentation, product profitability, statistical analysis, inventory and distribution analysis.

Evolution

60s: Batch reports hard to find and analyze information inflexible and expensive, reprogram every new request 70s: Terminal-based DSS and EIS (executive information systems) still inflexible, not integrated with desktop tools 80s: Desktop data access and analysis tools query tools, spreadsheets, GUIs easier to use, but only access operational databases 90s: Data warehousing with integrated OLAP engines and tools

Data Warehousing Market


Hardware: servers, storage, clients Warehouse DBMS Tools Market growing from

$2B in 1995 to $8 B in 1998 (Meta Group)

Systems integration & Consulting Already deployed in many industries: manufacturing, retail, financial, insurance, transportation, telecom, utilities, healthcare.

What is a Data Warehouse A data warehouse is a

subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of managements decision-making process. --W. H. Inmon Collection of data that is used primarily in organizational decision making A decision support database that is maintained separately from the organizations operational database

How Many Matches?

How Many Matches Now?

Data Warehouse - Subject Oriented oriented to the major subject Subject oriented:
areas of the corporation that have been defined in the data model.

E.g. for an insurance company: customer, product, transaction or activity, policy, claim, account, and etc.

Operational DB and applications may be organized differently

E.g. based on type of insurance's: auto, life, medical, fire, ...

Data Warehouse Integrated


Lack consistency in encoding, naming conventions, , among different data sources Heterogeneous data sources When data is moved to the warehouse, it is converted.

Data Warehouse - NonVolatile data is regularly accessed and Operational

manipulated a record at a time, and update is done to data in the operational environment. Warehouse Data is loaded and accessed. Update of data does not occur in the data warehouse environment.

Data Warehouse - Time The time horizon for the data warehouse Variance

is significantly longer than that of operational systems.


Operational data: current value data. Data warehouse data : nothing more than a sophisticated series of snapshots, taken of at some moment in time.

The key structure of operational data may or may not contain some element of time. The key structure of the data warehouse always contains some element of time.

Why Separate Data Warehouse?


Performance

Special data organization, access methods, and implementation methods are needed to support multidimensional views and operations typical of OLAP Complex OLAP queries would degrade performance for operational transactions Concurrency control and recovery modes of OLTP are not compatible with OLAP analysis

Why Separate Data Warehouse? Function


Missing data: Decision support requires historical data which operational DBs do not typically maintain Data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources: operational DBs, external sources Data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled.

Advantages of Warehousing

High query performance Queries not visible outside warehouse Local processing at sources unaffected Can operate when sources unavailable Can query data not stored in a DBMS Extra information at warehouse

Modify, summarize (store aggregates) Add historical information

Advantages of Mediator Systems


No need to copy data

less storage no need to purchase data

More up-to-date data Query needs can be unknown Only query interface needed at sources May be less draining on sources

Requirements for Data Warehousing


Load performance Load processing Data quality management Query perfomance Terabyte scalability Mass user scalability Networked data warehouse Warehouse administration Integrated dimensional analysis Advanced query funtionality

Operational databases External data sources


Extract Transform Load Refresh

The Architecture of Data Warehousing

Metadata repository

Data Warehouse o/p OLAP server


Reports OLAP

Data marts

Data mining

Operational data source1

Warehouse Manager
Meta-data High summarized data Lightly summarized data

Operational data source 2

Load Manager
Detailed data

Query Manager

DBMS

Operational data source n

Operational data store (ods)

Warehouse Manager
(First Tier) (Third Tier)

Operational data store (ODS) Archive/backup data Data Mart summarized


data(Relational database)

End-user access tools

Summarized data (Multi-dimension database)

(Second Tier)

Typical data warehouse Three Tier architecture

Data Sources

Data sources are often the operational systems, providing the lowest level of data. Data sources are designed for operational use, not for decision support, and the data reflect this fact. Multiple data sources are often from different systems, run on a wide range of hardware and much of the software is built in-house or highly customized. Multiple data sources introduce a large number of issues -- semantic conflicts.

Creating and Maintaining a Warehouse


Data warehouse needs several tools that automate or support tasks such as:

Data extraction from different external data sources,


operational databases, files of standard applications (e.g. Excel, COBOL applications), and other documents (Word, WWW). Data cleaning (finding and resolving inconsistency in the source data) Integration and transformation of data (between different data formats, languages, etc.)

Creating and Maintaining a Warehouse


Data loading (loading the data into the data warehouse) Data replication (replicating source database into the data warehouse) Data refreshment Data archiving Checking for data quality Analyzing metadata

Physical Structure of Data Warehouse


There are three basic architectures constructing a data warehouse:

for

Centralized Distributed/Federated Tiered


The data warehouse is distributed for: load balancing, scalability and higher availability

Physical Structure of Data Warehouse


Client Client Client

Central Data Warehouse

Source

Source

Centralized architecture

Physical Structure of Data Warehouse


End Users Marketing Financial Distribution Local Data Marts

Logical Data Warehouse

Source

Source

Federated architecture

Physical Structure of Data Warehouse


Workstations (highly summarized data) Local Data Marts

Physical Data Warehouse

Source

Source

Tiered architecture

Physical Structure of Data Warehouse

Federated architecture

The logical data warehouse is only virtual

Tiered architecture

The central data warehouse is physical There exist local data marts on different tiers which store copies or summarization of the previous tier.

Want to know more about data warehousing schemas?

YES

NO

Related Concepts

Decision Support System Business Modeling OLTP/OLAP Data Modeling ETL Reporting Data Mining

Decision Support System (DSS) powerful tools of BI One of the


Information technology to help knowledge workers (executives, managers, analysts) make faster and better decisions: what were the sales volumes by region and by product category in the last year? how did the share price of computer manufacturers correlate with quarterly profits over the past 10 years? will a 10% discount increase sales volume sufficiently?

Business Modeling

Depicts the overall picture of a business Sub-categories

Business Process Modeling


Business processes are visually represented as diagrams of simple box with arrow graphics and text labels

Process Flow Modeling


Describe the various processes that happen in an organization and the relationships between them

Data Flow Modeling


Focuses on the flow of data between various Business Processes

Business Modeling Tools

Data Processing Models


There are two basic data processing models:

OLTP Online Transaction Processing


Describes processing at operational sites the main aim of OLTP is reliable and efficient processing of a large number of transactions and ensuring data consistency. Describes processing at warehouse the main aim of OLAP is efficient multidimensional processing of large data volumes.

OLAP Online Analytical Processing


OLTP vs. OLAP


OLTP
Users Clerk, IT professional Function Day To Day Operations DB Design Application-oriented Data Current, Up-to-date Detailed, Flat Relational Isolated Usage Repetitive Access Read/Write, Index/Hash On Prim. Key Unit Of Work Short, Simple Transaction # Records Accessed Tens #Users Thousands DB Size 100MB-GB Metric Transaction Throughput

OLAP
Knowledge worker Decision Support Subject-oriented Historical, Summarized Multidimensional Integrated, Consolidated Ad-hoc Lots Of Scans

Complex Query Millions Hundreds 100GB-TB Query Throughput, Response

OLAP Multidimensional Databases

Data Modeling

A Data model is a conceptual representation of data structures (tables) required for a database and is very powerful in expressing and communicating the business requirements. Visually represents

Nature of data Business rules governing the data Organization in database

Data Modeling

Types of data modeling


Conceptual Data Modeling Enterprise Data Modeling Logical Data Modeling Physical Data Modeling Relational Data Modeling Dimensional Data Modeling

Data Modeling

MORE

ETL

ETL stands for Extraction, Transformation , Loading Steps involved

Mapping the data between source systems and target database (data warehouse or data mart) Cleansing of source data in staging area Transforming cleansed source data and then loading into the target system

ETL Tools

Reporting

Business Intelligence Reporting Tools provide different views of data by pivoting or rotating the data across several dimensions. Nowadays all OLAP tools support reporting. Excel sheets and Flat files are the standard reporting mediums.

Data Mining

Data Mining is a set of processes related to analyzing and discovering useful, actionable knowledge buried deep beneath large volumes of data stores or data sets This knowledge discovery involves finding patterns or behaviors within the data that lead to some profitable business action Data Mining Life Cycle

Business problem Analysis Knowledge Discovery Implementation Results Analysis

Typical Data Warehouse

Lecture II
Design and Implementation

Database design methodology for data Nine-step methodology proposed by Kimball warehouses
Step 1 2 3 4 5 6 7 8 9 Activity Choosing the process Choosing the grain Identifying and conforming the dimensions Choosing the facts Storing the precalculations in the fact table Rounding out the dimension tables Choosing the duration of the database Tracking slowly changing dimensions Deciding the query priorities and the query modes

Database design methodology for data There are many approaches warehouses that offer alternative routes to the

creation of a data warehouse Typical approach decompose the design of the data warehouse into manageable parts data marts, At a later stage, the integration of the smaller data marts leads to the creation of the enterprisewide data warehouse. The methodology specifies the steps required for the design of a data mart, however, the methodology also ties together separate data marts so that over time they merge together into a coherent overall data warehouse.

Step 1: Choosing the process

The process (function) refers to the subject matter of a particular data marts. The first data mart to be built should be the one that is most likely to be delivered on time, within budget, and to answer the most commercially important business questions. The best choice for the first data mart tends to be the one that is related to sales

Step 2: Choosing the grain

Choosing the grain means deciding exactly what a fact table record represents. For example, the entity Sales may represent the facts about each property sale. Therefore, the grain of the Property_Sales fact table is individual property sale. Only when the grain for the fact table is chosen we can identify the dimensions of the fact table. The grain decision for the fact table also determines the grain of each of the dimension tables. For example, if the grain for the Property_Sales is an individual property sale, then the grain of the Client dimension is the detail of the client who bought a particular property.

Step 3: Identifying and conforming the dimensions

Dimensions set the context for formulating queries about the facts in the fact table. We identify dimensions in sufficient detail to describe things such as clients and properties at the correct grain. If any dimension occurs in two data marts, they must be exactly the same dimension, or one must be a subset of the other (this is the only way that two DM share one or more dimensions in the same application). When a dimension is used in more than one DM, the dimension is referred to as being conformed.

Step 4: Choosing the facts

The grain of the fact table determines which facts can be used in the data mart all facts must be expressed at the level implied by the grain. In other words, if the grain of the fact table is an individual property sale, then all the numerical facts must refer to this particular sale (the facts should be numeric and additive).

Step 5: Storing pre-calculations in the fact table

Once the facts have been selected each should be reexamined to determine whether there are opportunities to use pre-calculations. Common example: a profit or loss statement These types of facts are useful since they are additive quantities, from which we can derive valuable information. This is particularly true for a value that is fundamental to an enterprise, or if there is any chance of a user calculating the value incorrectly.

Step 6: Rounding out the dimension tables

In this step we return to the dimension tables and add as many text descriptions to the dimensions as possible. The text descriptions should be as intuitive and understandable to the users as possible

Step 7: Choosing the duration of the data warehouse


The duration measures how far back in time the fact table goes. For some companies (e.g. insurance companies) there may be a legal requirement to retain data extending back five or more years. Very large fact tables raise at least two very significant data warehouse design issues: The older data, the more likely there will be problems in reading and interpreting the old files It is mandatory that the old versions of the important dimensions be used, not the most current versions (we will discuss this issue later on)

Step 8: Tracking slowly changing dimensions

The changing dimension problem means that the proper description of the old client and the old branch must be used with the old data warehouse schema Usually, the data warehouse must assign a generalized key to these important dimensions in order to distinguish multiple snapshots of clients and branches over a period of time There are different types of changes in dimensions: A dimension attribute is overwritten A dimension attribute causes a new dimension record to be created, etc.,

Step 9: Deciding the query priorities and the query modes

In this step we consider physical design issues. The presence of pre-stored summaries and aggregates Indices Materialized views Security issue Backup issue Archive issue

Database design methodology for data warehouses - summary

At the end of this methodology, we have a design for a data mart that supports the requirements of a particular business process and allows the easy integration with other related data marts to ultimately form the enterprise-wide data warehouse. A dimensional model, which contains more than one fact table sharing one or more conformed dimension tables, is referred to as a fact constellation.

Implementing a Warehouse

Implementing a Warehouse

Designing and rolling out a data warehouse is a complex process, consisting of the following activities:

Define the architecture, do capacity planning, and select the storage servers, database and OLAP servers (ROLAP vs MOLAP), and tools Integrate the servers, storage, and client tools Design the warehouse schema and views

Implementing a Warehouse

Define the physical warehouse organization, data placement, partitioning, and access method Connect the sources using gateways, ODBC drivers, or other wrappers Design and implement scripts for data extraction, cleaning, transformation, load, and refresh

Implementing a Warehouse

Populate the repository with the schema and view definitions, scripts, and other metadata. Design and implement end-user applications. Roll out the warehouse and applications.

Implementing a Warehouse

Monitoring: Sending data from sources Integrating: Loading, cleansing, ... Processing: Query processing, indexing, ... Managing: Metadata, Design, ...

Monitoring

Data Extraction

Data extraction from external sources is usually implemented via gateways and standard interfaces (such as Information Builders EDA/SQL, ODBC, JDBC, Oracle Open Connect, Sybase Enterprise Connect, Informix Enterprise Gateway, etc.)

Monitoring Techniques

Detect changes to an information source that are of interest to the warehouse:


define triggers in a full-functionality DBMS examine the updates in the log file write programs for legacy systems Polling (queries to source) screen scraping

Propagate the change in a generic form to the integrator

Integration

Integrator

Receive changes from the monitors

make the data conform to the conceptual schema used by the warehouse merge the data with existing data already present resolve possible update anomalies

Integrate the changes into the warehouse


Data Cleaning Data Loading

Data Cleaning

Data cleaning is important to warehouse there is high probability of errors and anomalies in the data:

inconsistent field lengths, inconsistent descriptions, inconsistent value assignments, missing entries and violation of integrity constraints. optional fields in data entry are significant sources of inconsistent data.

Data Cleaning Techniques

Data migration: allows simple data transformation rules to be specified, e.g. replace the string gender by sex (Warehouse Manager from Prism is an example of this tool) Data scrubbing: uses domain-specific knowledge to scrub data (e.g. postal addresses) (Integrity and Trillum fall in this category) Data auditing: discovers rules and relationships by scanning data (detect outliers). Such tools may be considered as variants of data mining tools

Data Loading

After extracting, cleaning and transforming, data must be loaded into the warehouse. Loading the warehouse includes some other processing tasks: checking integrity constraints, sorting, summarizing, etc. Typically, batch load utilities are used for loading. A load utility must allow the administrator to monitor status, to cancel, suspend, and resume a load, and to restart after failure with no loss of data integrity

Data Loading Issues

The load utilities for data warehouses have to deal with very large data volumes Sequential loads can take a very long time. Full load can be treated as a single long batch transaction that builds up a new database. Using checkpoints ensures that if a failure occurs during the load, the process can restart from the last checkpoint

Data Refresh

Refreshing a warehouse means propagating updates on source data to the data stored in the warehouse when to refresh: periodically (daily or weekly) immediately (defered refresh and immediate refresh) determined by usage, types of data source,etc.

Data Refresh

how to refresh

data shipping transaction shipping

Most commercial DBMS provide replication servers that support incremental techniques for propagating updates from a primary database to one or more replicas. Such replication servers can be used to incrementally refresh a warehouse when sources change

Data Shipping

Data Shipping: (e.g. Oracle Replication Server), a table in the warehouse is treated as a remote snapshot of a table in the source database. After_row trigger is used to update snapshot log table and propagate the updated data to the warehouse

Transaction Shipping

Transaction Shipping: (e.g. Sybase Replication Server, Microsoft SQL Server), the regular transaction log is used. The transaction log is checked to detect updates on replicated tables, and those log records are transferred to a replication server, which packages up the corresponding transactions to update the replicas

Derived Data

Derived Warehouse Data indexes aggregates materialized views When to update derived data? The most difficult problem is how to refresh the derived data? The problem of constructing algorithms incrementally updating derived data has been the subject of much research!

Materialized Views

Define new warehouse relations using SQL expressions


prodId p1 p2 p1 p2 p1 p1 clientid c1 c1 c3 c2 c1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4

sale

product

id p1 p2

name bolt nut

price 10 5

join of sale and product


price 10 5 10 5 10 10 clientid c1 c1 c3 c2 c1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4

joinTb prodId p1 p2 p1 p2 p1 p1

name bolt nut bolt nut bolt bolt

Processing

Index Structures What to Materialize? Algorithms

Index Structures

Indexing principle:

mapping key values to records for associative direct access

Most popular indexing techniques in relational database: B+-trees For multi-dimensional data, a large number of indexing techniques have been developed: R-trees

Index Structures

Index structures applied in warehouses


inverted lists bit map indexes join indexes text indexes

MORE

What to Materialize?

Store in warehouse results useful for common queries Example: total sale
day 2 day 1
p1 p2 c1 p1 12 p2 11 c1 44 c2 8 c2 4 c3 50 c3

...

p1 p2

c1 56 11

c2 4 8

c3 50

p1

c1 67

c2 12

c3 50

129
p1 p2 c1 110 19

materialize

View and Materialized Views View

derived relation defined in terms of base (stored) relations a view can be materialized by storing the tuples of the view in the database index structures can be built on the materialized view

Materialized views

View and Materialized Views

Maintenance is an issue for materialized views


recomputation incremental updating

Managing

Metadata Repository

Administrative metadata

source database and their contents gateway descriptions warehouse schema, view and derived data definitions dimensions and hierarchies pre-defined queries and reports data mart locations and contents

Metadata Repository

Administrative metadata

data partitions data extraction, cleansing, transformation rules, defaults data refresh and purge rules user profiles, user groups security: user authorization, access control

Metadata Repository

Business

business terms & definition data ownership, charging data layout data currency (e.g., active, archived, purged) use statistics, error reports, audit trails

Operational

Importance of managing metadata


The integration of meta-data, that is data about data Meta-data is used for a variety of purposes and the management of it is a critical issue in achieving a fully integrated data warehouse The major purpose of meta-data is to show the pathway back to where the data began, so that the warehouse administrators know the history of any item in the warehouse The meta-data associated with data transformation and loading must describe the source data and any changes that were made to the data The meta-data associated with data management describes the data as it is stored in the warehouse The meta-data is required by the query manager to generate appropriate queries, also is associated with the user of queries

State of Commercial Practice

Products and Vendors Datamation, May 15, 1996; R.C. Barquin, H.A. Edelstein: Planning and Designing the Data Warehouse. Prentice Hall. 1997] Connectivity to sources Apertus CA-Ingres Gateway Information Builders EDA/SQLIBM Data Jioner Informix Enterprise Gateway Microsoft ODBC Oracle Open Connect Platinum Infohub SAS Connect Software AG Entire Sybase Enterprise Connect Trinzic InfoHub Data extract, clean, transform, refresh CA-Ingres Replicator Carleton Passport Evolutionary Tech Inc. ETI-Extract Harte-Hanks Trillium IBM Data Joiner, Data Propagator Oracle 7 Platinum InfoRefiner, InfroPump Praxis OmniReplicator Prism Warehouse Manager Redbrick TMU SAS Access Software AG Souorcepoint Sybase Replication Server Trinzic InfoPump

State of Commercial Practice


Multidimensional Database Engines Arbor Essbase Comshare Commander OLAP Oracle IRI Express SAS System Warehouse Data Servers CA-IngresIBM DB2 Information Builders Focus Informix Oracle Praxiz Model 204 Redbrick Software AG ADABAS Sybase MPP Tandem Terdata ROLAP Servers HP Intelligent Warehouse Information Advantage Asxys Informix Metacube MicroStrategy DSS Server

State of Commercial Practice Query/Reporting Environments

Brio/Query Business Objects Cognos Impromptu CA Visual Express IBM DataGuideInformation Builders Focus Six Informix ViewPoint Platinum Forest & Trees SAS Access Software AG Esperant

Multidimensional Analysis
Andydne PabloArbor Essbase Analysis Server Business Objects Cognos PowerPlay Dimensional Insight Cross Target Holistic Systems HOLOS Information Advantage Decision Suite IQ Software IQ/Vision Kenan System Acumate Lotus 123 Microsoft ExcelMicrostrategy DSS Pilot Lightship Platinum Forest & Trees Prodea Beacon SAS OLAP ++ Stanford Technology Group Metacube

State of Commercial Practice

Metadata Management

HP Intelligent Warehouse Platinum Repository

IBM Data Guide Prism Directory Manager

System Management

CA Unicenter HP OpenView IBM DataHub, NetView Information Builder Site Analyzer Prism Warehouse Manager SAS CPE Tivoli Software AG Source Point Redbrick Enterprise Control and Coordination At& T TOPEND IBM FlowMark Prism Warehouse Manager HP Intelligent Warehouse Platinum Repository Software AG Source Point

Process Management

Systems integration and consulting

Research

Data cleaning

focus on data inconsistencies, not schema differences data mining techniques design of summary tables, partitions, indexes tradeoffs in use of different indexes selecting appropriate summary tables dynamic optimization with feedback acid test for query optimization: cost estimation, use of transformations, search strategies partitioning query processing between OLAP server and backend server.

Physical Design

Query processing

Research

Warehouse Management

detecting runaway queries resource management incremental refresh techniques computing summary tables during load failure recovery during load and refresh process management: scheduling queries, load and refresh use of workflow technology for process management

References

www.toug.org/files/tougpr200302_4.ppt www-db.stanford.edu/~hector/cs245/Notes12.ppt www.epa.gov/storet/conf/Wilson_Data_Warehouse.ppt www.learndatamodeling.com www.learnbi.com www.datawarehousing.ittoolbox.com www.datawarehousing.com

Thank You

QUESTIONS?

APPENDIX A
Data warehouse Schemas

Star schema
A

single object (fact table) in the middle connected to a number of dimension tables
sale orderId date custId prodId storeId qty amt

product prodId name price

customer custId name address city

store storeId city

Star schema
product prodId p1 p2 name price bolt 10 nut 5

store storeId city c1 nyc c2 sfo c3 la

sale

oderId date o100 1/7/97 o102 2/7/97 o105 3/8/97

custId 53 53 111

prodId p1 p2 p1

storeId c1 c1 c3

qty 1 2 5

amt 12 11 50

customer

custId 53 81 111

name joe fred sally

address 10 main 12 main 80 willow

city sfo sfo la

Terms

Basic notion: a measure (e.g. sales, qty, etc) Given a collection of numeric measures

Each measure depends on a set of dimensions (e.g. sales volume as a function of product, time, and location)

Terms

Relation, which relates the dimensions to the measure of interest, is called the fact table (e.g. sale) Information about dimensions can be represented as a collection of relations called the dimension tables (product, customer, store) Each dimension can have a set of associated attributes

Example of Star Schema


Date Date Month Year

Product Sales Fact Table Date Product Store Customer unit_sales dollar_sales schilling_sales
ProductNo ProdName ProdDesc Category QOH

Store
StoreID City State Country Region

Customer CustId CustName CustCity CustCountry

Measurements

Dimension Hierarchies
For

each dimension, the set of associated attributes can be structured as a hierarchy


sType store city region

customer

city

state

country

Dimension Hierarchies
sType store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy tId t1 t2 cityId sfo la size small large pop 1M 5M location downtown suburbs regId north south

city

region

regId north south

name cold region warm region

Snowflake Schema
A refinement of star schema where the dimensional hierarchy is represented explicitly by normalizing the dimension tables

Example of Snowflake Schema


Month Year Year Month Year Date Date Month Sales Fact Table Date Product Store Store City State Country Country Region State Country Measurements City State StoreID City Customer unit_sales dollar_sales schilling_sales

Product ProductNo ProdName ProdDesc Category QOH

Cust CustId CustName CustCity CustCountry

Fact constellations
Fact constellations: Multiple fact tables share dimension tables

BACK

APPENDIX B
Data Modeling & OLAP

Multidimensional Data Model

Sales of products may be represented in one dimension (as a fact relation) or in two dimensions, e.g. : clients and products

Fact relation
sale Product p1 p2 p1 p2 Client c1 c1 c3 c2 Amt 12 11 50 8

Two-dimensional cube
p1 p2 c1 12 11 c2 8 c3 50

Multidimensional Data Model


Fact relation
sale Product Client p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 Date 1 1 1 1 2 2 Amt 12 11 50 8 44 4

3-dimensional cube

day 2 day 1

p1 p2 c1 p1 12 p2 11

c1 44 c2 8

c2 4 c3 50

c3

Multidimensional Data Model and Aggregates


Add up amounts for day 1 In SQL: SELECT sum(Amt) FROM SALE WHERE Date = 1
sale Product p1 p2 p1 p2 p1 p1 Client c1 c1 c3 c2 c1 c2 Date 1 1 1 1 2 2 Amt 12 11 50 8 44 4

result

81

Multidimensional Data Model and Aggregates


Add up amounts by day In SQL: SELECT Date, sum(Amt) FROM SALE GROUP BY Date
Product p1 p2 p1 p2 p1 p1 Client c1 c1 c3 c2 c1 c2 Date 1 1 1 1 2 2 Amt 12 11 50 8 44 4

sale

result

Date 1 2

sum 81 48

Multidimensional Data Model and Aggregates


Add up amounts by client, product In SQL: SELECT client, product, sum(amt) FROM SALE GROUP BY client, product

Multidimensional Data Model and Aggregates


sale Product p1 p2 p1 p2 p1 p1 Client c1 c1 c3 c2 c1 c2 Date 1 1 1 1 2 2 Am t 12 11 50 8 44 4

sale Product Client p1 c1 p1 c2 p1 c3 p2 c1 p2 c2

Sum 56 4 50 11 8

Multidimensional Data Model and Aggregates

In multidimensional data model together with measure values usually we store summarizing information (aggregates)
c1 56 11 67 c2 4 8 12 c3 50 50 Sum 110 19 129

p1 p2 Sum

Aggregates

Operators: sum, count, max, min, median, ave Having clause Using dimension hierarchy

average by region (within store) maximum by month (within date)

Cube Aggregation
day 2 day 1
c1 44 c2 8 c2 4 c3 50 c3

p1 p2 c1 p1 12 p2 11

Example: computing sums ...

p1 p2

c1 56 11

c2 4 8

c3 50

sum

c1 67

c2 12

c3 50

129
p1 p2 sum 110 19

Cube Operators

day 2 day 1

p1 p2 c1 p1 12 p2 11

c1 44 c2 8

c2 4 c3 50

c3

... sale(c1,*,*)
sum c1 67 c2 12 c3 50

p1 p2

c1 56 11

c2 4 8

c3 50

129
p1 p2 sum 110 19

sale(c2,p2,*)

sale(*,*,*)

Cube
*

day 2

day 1

p1 p2 *

p1 p2 c1 * 12
11 23

p1 p2 * c1 44 c2 44 8 8

c1 56 11 67 c2 4 c3 4 50 50

c2 4 8 12 c3 * 62 19 81

c3 50

* 50 48 48

* 110 19 129

sale(*,p2,*)

Aggregation Using Hierarchies


day 2 day 1
p1 p2 c1 p1 12 p2 11 c1 44 c2 8 c2 4 c3 50 c3

customer region country

p1 p2

re g io n A re g io n B 12 50 11 8

(customer c1 in Region A; customers c2, c3 in Region B)

Aggregation Using Hierarchies


Chennai c1 c2 c3 Bangalore c4 10 3 12 5 11 7 12 11 21 9 7 15

client city region

Date of sale

CD video Camera

aggregation with respect to city

CH BN

Video 22 23

Camera 8 18

CD 30 22

A Sample Data Cube


Date
ct du o Pr
camera video CD
1Q 2Q 3Q 4Q
sum

USA Canada Mexico


sum

sum

C o u n t r y

OLAP Servers

Relational OLAP (ROLAP):

Extended relational DBMS that maps operations on multidimensional data to standard relations operations Store all information, including fact tables, as relations Special purpose server that directly implements multidimensional data and operations store multidimensional datasets as arrays

Multidimensional OLAP (MOLAP):

OLAP Servers

Hybrid OLAP (HOLAP):

Give users/system administrators freedom to select different partitions.

OLAP Queries

Roll up: summarize data along a dimension hierarchy

If we are given total sales volume per city we can aggregate on the Location to obtain sales per states

OLAP Queries
client
10 3 12 5 11 7 12 11 21 9 7 15

city region
Date of sale

Chennai

c1 c2 c3 c4

Bangalore

CD video Camera

aggregation with respect to city

CH BN

Video 22 23

Camera 8 18

CD 30 22

OLAP Queries

Roll down, drill down: go from higher level summary to lower level summary or detailed data

For a particular product category, find the detailed sales data for each salesperson by date Given total sales by state, we can ask for sales per city, or just sales by city for a selected state

OLAP Queries
c1 44 c2 8 c2 4 c3 50 c3

day 2

day 1

p1 p2 c1 p1 12 p2 11

p1 p2

c1 56 11

c2 4 8

c3 50

sum

c1 67

c2 12

c3 50

129
p1 p2 sum 110 19

rollup drill-down

OLAP Queries

Slice and dice: select and project


Sales of video in India over the last 6 months Slicing and dicing reduce the number of dimensions The result of pivoting is called a cross-tabulation If we pivot the Sales cube on the Client and Product dimensions, we obtain a table for each client for each product value

Pivot: reorient cube


OLAP Queries
Pivoting can be combined with aggregation
sale prodId clientid p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4

day 2 day 1

p1 p2 c1 p1 12 p2 11

c1 44 c2 8

c2 4 c3 50

c3

1 2 Sum

c1 23 44 67

c2 8 4 12

c3 50 50

Sum 81 48 129

p1 p2 Sum

c1 56 11 67

c2 4 8 12

c3 50 50

Sum 110 19 129

OLAP Queries

Ranking: selection of first n elements (e.g. select 5 best purchased products in July) Others: stored procedures, selection, etc. Time functions e.g., time average

Cube Operation

SELECT date, product, customer, SUM (amount)

FROM SALES

CUBE BY date, product, customer Need compute the following Group-Bys


(date, product, customer), (date,product),(date, customer), (product, customer), (date), (product), (customer)

Cuboid Lattice

Data cube can be viewed as a lattice of cuboids


The bottom-most cuboid is the base cube. The top most cuboid contains only one cell.
(A,B,C,D) (A,B,C) (A,B,D) (A,C,D) (B,C,D) (A,B) (A,C) (A,D) (B,C) (B,D) (C,D) (A) (B) (C) ( all ) (D)

Cuboid Lattice
129

all date

p1

c1 67

c2 12

c3 50

city

product

city, product
p1 p2 c1 56 11 c2 4 8 c3 50

city, date

product, date
use greedy algorithm to decide what to materialize

day 2 day 1

c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8

city, product, date

Efficient Data Cube Computation

Materialization of data cube


Materialize every (cuboid), none, or some. Algorithms for selection of which cuboids to materialize:

size, sharing, and access frequency:


Type/frequency of queries Query response time Storage cost Update cost

Dimension Hierarchies

Client hierarchy
region
cities city c1 c2 c3 state CA NY SF region East East West

state

city

Dimension Hierarchies Computation


all city product date

city, product

city, date

product, date state state, date state, product

city, product, date

roll-up along client hierarchy

state, product, date

Cube Computation - Array Based Algorithm

An MOLAP approach:

the base cuboid is stored as multidimensional array. read in a number of cells to compute partial cuboids

Cube computations
B A C

ALL {ABC}{AB}{AC}{BC} {A}{B}{C}{ }


BACK

APPENDIX C
Index Structures

Inverted Lists
18 19
r4 r18 r34 r35 r5 r19 r37 r40

20 23

20 21 22

age index

inverted lists

data records

...

23 25 26

rId r4 r18 r19 r34 r35 r36 r5 r41

name joe fred sally nancy tom pat dave jeff

age 20 20 21 20 20 25 21 26

Inverted Lists

Query:

Get people with age = 20 and name = fred

List for age = 20: r4, r18, r34, r35 List for name = fred: r18, r52 Answer is intersection: r18

Bitmap Indexes

Bitmap index: An indexing technique that has attracted attention in multi-dimensional database implementation table
Customer c1 c2 c3 c4 c5 c6 City Detroit Chicago Detroit Poznan Paris Paris Car Ford Honda Honda Ford BMW Nissan

Bitmap Indexes

The index consists of bitmaps:


Index on City:
ec1 1 2 3 4 5 6 Chicago Detroit 0 1 1 0 0 1 0 0 0 0 0 0 Paris 0 0 0 0 1 1 Poznan 0 0 0 1 0 0

bitmaps

Bitmap Indexes
Index on Car:
ec1 1 2 3 4 5 6 BMW 0 1 0 0 1 0 Ford 1 0 0 1 0 0 Honda 0 1 1 0 0 0 Nissan 0 0 0 0 0 1

bitmaps

Bitmap Indexes

Index on a particular column Index consists of a number of bit vectors - bitmaps Each value in the indexed column has a bit vector (bitmaps) The length of the bit vector is the number of records in the base table The i-th bit is set if the i-th row of the base table has the value for the indexed column

Bitmap Index
18 19

20 23

20 21 22

1 1 0 1 1 0 0 0 0

age index

bit maps

data records

...

23 25 26

0 0 1 0 0 0 1 0 1 1

id 1 2 3 4 5 6 7 8

name joe fred sally nancy tom pat dave jeff

age 20 20 21 20 20 25 21 26

Using Bitmap indexes

Query:

Get people with age = 20 and name = fred

List for age = 20: 1101100000 List for name = fred: 0100000001 Answer is intersection: 0100000000 Good if domain cardinality small Bit vectors can be compressed

Using Bitmap indexes

They allow the use of efficient bit operations to answer some queries how many customers from Detroit have car Ford perform a bit-wise AND of two bitmaps: answer c1 how many customers have a car Honda count 1s in the bitmap - answer - 2 Compression - bit vectors are usually sparse for large databases the need for decompression

Bitmap Index Summary

With efficient hardware support for bitmap operations (AND, OR, XOR, NOT), bitmap index offers better access methods for certain queries

e.g., selection on two attributes

Some commercial products have implemented bitmap index Works poorly for high cardinality domains since the number of bitmaps increases Difficult to maintain - need reorganization when relation sizes change (new bitmaps)

Join

Combine SALE, PRODUCT relations In SQL: SELECT * FROM SALE, PRODUCT


sale prodId p1 p2 p1 p2 p1 p1 storeId c1 c1 c3 c2 c1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4
product id p1 p2 name bolt nut price 10 5

joinTb prodId p1 p2 p1 p2 p1 p1

name bolt nut bolt nut bolt bolt

price 10 5 10 5 10 10

storeId c1 c1 c3 c2 c1 c2

date 1 1 1 1 2 2

amt 12 11 50 8 44 4

Join Indexes
join index
product id p1 p2 name bolt nut price 10 5 jIndex r1,r3,r5,r6 r2,r4

sale

rId r1 r2 r3 r4 r5 r6

prodId p1 p2 p1 p2 p1 p1

storeId c1 c1 c3 c2 c1 c2

date 1 1 1 1 2 2

amt 12 11 50 8 44 4

Join Indexes

Traditional indexes map the value to a list of record ids. Join indexes map the tuples in the join result of two relations to the source tables. In data warehouse cases, join indexes relate the values of the dimensions of a star schema to rows in the fact table.

For a warehouse with a Sales fact table and dimension city, a join index on city maintains for each distinct city a list of RIDs of the tuples recording the sales in the city
BACK

Join indexes can span multiple dimensions

Anda mungkin juga menyukai