Data Mining Chapter2 1

3.
Data Warehouse Architecture

3.1 System Process 3.2 Process Architecture
CH#3, By: Babu Ram Dawadi
System Process
Data warehouse are built to support large data volumes (above 100GB of database) cost effectively
Data warehouse must be architected to support three major driving factors:

Populating the warehouse Day-to-day management of the warehouse The ability to cope with requirements evolution.
The process required to populate the warehouse focus on the extracting the data, cleaning it up and making it available for analysis.
Typical Process Flow
Before we create an architecture of a data warehouse, we must first understand the major process that constitute a data warehouse. The processes are:

Extract and load the data Clean and transform data into a form that can cope with large volumes, and provide good query performance. Backup and achieve data Manage queries, and direct them to the appropriate data sources.
Process Flow Within a DW

Users Source
Data Transformation And movement Data Warehouse
Extract & Load Query

Extract & Load Process
Data extraction takes data from source systems and makes it available to the data warehouse Data load takes extracted data and load it into the DW. When we extract data from the physical database, whatever form it is held in, the original information content will have been modified and extended over the years. Before loading the data into the DW, the information content must be reconstructed. The DW extract & load process must take data and add context and meaning in order to convert it into value-adding business information.
Process Flow Extract & Load
Process controlling
Is the mechanism that determine when to start extracting the data, run the transformations and consistency checks and so on, are very important. The various tools, logic modules and programs are executed in the correct sequence and at the correct time, a controlling mechanism is required to fire each module when appropriate.
Initiate extraction

Data should be in a consistent state when it is extracted from the source system. The information in a data warehouse represents a snapshot of corporate information, so that the user is looking at a single, consistent version of the truth. Guideline: start extracting data from data sources when it represents the same snapshot of time as all the other data sources.
Process Flow Extract & Load
Extraction
Process FlowExtract & Load
Loading the data
Once the data is extracted from the source systems, it is then typically loaded into a temporary data store in order for it to be cleaned up and made consistent. Guideline: do not execute consistency checks until all the data sources have been loaded into the temporary data sources. From the temporary data store when the data is cleaned up, the data is transformed into warehouse by the warehouse manager.
Process FlowClean & transform
This is the system process that takes the loaded data and structures it for query performance, and for minimizing operational costs. The process steps for cleaning and transferring are: Clean and transform the loaded data into a structure that speeds up queries.
Partition the data in order to speed up queries, optimize hardware performance and simplify the management of the DW. Create a aggregations to speedup the common queries.
Process FlowClean & transform
Data needs to be cleaned and checked in the following ways: Make sure data is consistent within itself.
Make sure that data is consistent with other data within the same source.
Make sure data is consistent with data in the other source system. Make sure data is consistent with the information already in the DW.
Process FlowBackup & Archive
As in operational systems, the data within the data warehouse is backed up regularly in order to ensure that the DW can always be recovered from the data loss, software failure or hardware failure.
In archiving, older data is removed from the system in a format that allows it to be quickly restored if required.
Process FlowQuery Management
The query management process is the system process that manages the queries and speeds up them up by directing queries to the most effective data source. The query management process may also be required to monitor the actual query profiles. Unlike the other system processes, query management does not generally operate during the load of information into the DW. The query management facilities are:
Directing Queries
The query management process determines which table delivers the answer effectively; by calculating which table would satisfy the query in the shortest space of time.
Process FlowQuery Management
Query management facilities.
Maximizing system resources

Regardless of the processing power available to run the DW, it is also too possible that a single large query can soak up all system processes, affecting the performance of the entire system. The query management process must ensure that no single query can affect the overall system performance.
Query Capture
Users are exploiting the information content of the DW, which implies that query profiles change on a regular basis over the life of a DW.
At various points in time , such as the end of week, these queries can be analyzed to capture the new query and the resulting impact on summary tables.
Query capture is typically the part of the query management process.
Process Architecture
The system processes describe the major processes that constitute a data warehouse.
Now the process architecture outline a complete data warehouse architecture that encompasses these processes. The complexity of each manager in a data warehouse will vary from DW to DW.
Three Data Warehouse Models
Enterprise warehouse
collects all of the information about subjects scanning the entire organization
Data Mart
a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart
Independent vs. dependent (directly from warehouse) data mart
Virtual warehouse

A set of views over operational databases Only some of the possible summary views may be materialized
Components

of DW Architecture
Load Manager Warehouse Manager Query Manager Detailed Information Summary Information Meta Data Data Marting
Data
L O
Information
Q U E
Decision
Operational Data
A D
R
Y
Data Differ
M A N A
Detailed Information
Summary Information
A
N
External Data
G E R
Meta data
A G
OLAP Tools
Warehouse Manager CH#3, By: Babu Ram Dawadi
Process Arch Load Manager
The load manager is the system component that performs all the operations necessary to support the extract and load process. This system may be constructed using a combination of off-the-self tools, C programs and shell scripts. The size and complexity of Load Manager will vary between specific from DW to DW. The effort to develop the load manager should be planned within the first production phase. The architecture of the load manager is such that it performs the following operations:

Extract the data from the source systems. Fast load the extracted data into a temporary data store Perform simple transformation into a structure similar to one in DW CH#3, By: Babu Ram Dawadi

Controlling Process Stored Procedures Copy Management Tool Fast Loader File Structure Temporary Data Source
Warehouse Structure
Load Manager Architecture
Extract data from source
In order to get hold of the source data it has to be transferred from source systems, and made available to the DW. Data should be loaded into the warehouse in the fastest possible time, in order to minimize the total load window. The speed at which the data is processed into the warehouse is affected by the kind of transformations that are taking place. In practice, it is more effective to load the data into a relational database prior to applying transformations and check Before or during the load there will be an opportunity to perform simple transformations on the data.
Fast Load
Simple Transformation
Process Arch Warehouse Manager
The warehouse manager is the system component that performs all the operations necessary to support the warehouse management process. This system is typically constructed using a combination of third party systems management software (C, shell scripts) The architecture of the warehouse manager is such that it performs the following operations:

Analyze the data to perform consistency and referential integrity check. Transform and merge the source data in the temporary data store in to the DW. Generate renormalization if appropriate. Backup totally the data within the DW.

Warehouse Manager Controlling Temporary Data Source
Process
Stored Procedures Backup/Reco very Tools SQL Scripts
Schema
Warehouse Structure
Warehouse Manager Architecture
Guideline: do not load data directly into the DW tables until it has been cleaned up. Use temporary tables that emulate the structures with in the DW.
Create Index & Views
The warehouse manager has to create indexes against the information in the fact or dimensional table. The overhead of inserting a row into a table and indexes can be higher with a large number of rows than the overhead of recreating the indexes once the rows have been inserted. Therefore it is more effective to drop all indexes against tables prior to inserting large rows The fact tables are large tables, so the warehouse manager creates views that combine a number of partitions into a single fact table. It is suggested that, we create a few views, corresponding to meaningful periods of time within the business.
Generate the summaries:
Summary information is necessary in any organization because the higher level officers dont want to see the detailed information. The summary information will be helpful to them for decision making. Summaries are generated automatically by the warehouse manager: i.e. it is executed every time data is loaded.
The actual generation of summaries is achieved through the use of embedded SQL in either stored procedure (Trigger) or C programs.
a Command sequence such as:
Create table {} as select {.} where {..} CH#3, By:{.} Babu from Ram Dawadi
Process Arch Query Manager
The query manager is the system component that performs all the operations necessary to support the query management process. The architecture of a query manager is such that it performs the following operations:

Direct queries to the appropriate tables Schedule the execution of user queries.
The query manager also stores query profiles to allow the warehouse manager to determine which indexes are appropriate.
Process Arch Query Manager
Query Redirection Via C tools, RDBMS
Stored Procedure
(Generate Views)
Query Management Tools
Query Scheduling via C tool or RDBMS
Meta Data
Summary Information
Process Arch Detailed Information
This is the area of the data warehouse that stores all the detailed information in the starflake schema. All the detailed information is held online the whole time, but aggregated to the next level of detail. And then the detailed information is offloaded into the tape archive. If the business information for detailed information is weak or very specific, it may be possible to satisfy it by storing a rolling three-month detailed history. Guideline: determine what business activities require detailed transaction information, in order to determine the level at which to retain detail information in the DW. If the detailed information is being stored offline to minimize the disk storage requirements, make sure that the data has been extracted, cleaned up, and transformed into a starflake schema prior to archiving it.

Data
L O A
Information
Q U E R
Decision
Operational Data
M A N A G E R
Summary Information
Meta data
A N A G E
Data Differ
Warehouse Manager External Data Detailed info. In archived data

OLAP Tools

Detailed

information can be managed by the topics:

Data warehouse schemas Fact data Dimension data Partitioning data
Star
product prodId p1 p2 name price bolt 10 nut 5
store
storeId c1 c2 c3
city nyc sfo la
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97
custId 53 53 111
prodId p1 p2 p1
storeId c1 c1 c3
qty 1 2 5
amt 12 11 50
customer
custId 53 81 111
name joe fred sally
address 10 main 12 main 80 willow
city sfo sfo la
Star Schema
product prodId name price
sale orderId date custId prodId storeId qty amt
customer custId name address city
store storeId city
Cube
Fact table view:
sale prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 amt 12 11 50 8
Multi-dimensional cube:
p1 p2 c1 12 11 c2 8 c3 50
dimensions = 2
A Sample Data Cube

TV PC VCR sum
200
1Qtr
2Qtr
Date
3Qtr 4Qtr
sum
Total annual sales of TV in U.S.A.
U.S.A Canada
150 63 37 450
Mexico
sum
Country
Process Arch Summary Information
Summary information is essentially a replication of information already in the data warehouse. The implication of summary data is that the data:

Exists to speed up the performance of common queries Increases operational cost May have to be updated every time new data is loaded into the DW May not have to be backed up, because it can be generated fresh from the detailed info.
The size of data that needs to be scanned is an order of magnitude smaller, this results in an order of magnitude improvement to the performance of the query. On the negative side there is an increase in operational cost, for creating and updating the summary table on a daily basis. Guideline1: avoid creating a summary that require more than 200 centralized summary tables on an ongoing basis.
Process Arch Summary Information
Summary infocontd
Guideline2: inform users that summary table accessed infrequently will be dropped on an ongoing basis.
Metadata Data Marting
A data mart is a subset of the information content of a DW that is stored in its own database, summarized or in detail. Data marting can improve query performance, simply be reducing the volume of data needs to be scanned to satisfy a query. Data marts are created along functional or departmental lines, in order to exploit a natural break of the data.
Multi-Tiered Architecture
Metadata
other
Monitor & Integrator
OLAP Server
sources
Operational Extract Transform Load Refresh
DBs
Data Warehouse
Serve
Analysis Query Reports Data mining
Data Marts
Data Sources
Data Storage OLAP Engine Front-End Tools CH#3, By: Babu Ram Dawadi

Data Mining Chapter2 1

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Data Mining Chapter2 1

Diunggah oleh

Hak Cipta:

Format Tersedia

3.

Data Warehouse Architecture

CH#3, By: Babu Ram Dawadi

Typical Process Flow

Process Flow Within a DW

Extract & Load Query

Extract & Load Process

Process Flow Extract & Load

Process Flow Extract & Load

Process FlowExtract & Load

Loading the data

Process FlowClean & transform

Process FlowClean & transform

Process FlowBackup & Archive

CH#3, By: Babu Ram Dawadi

Process FlowQuery Management

CH#3, By: Babu Ram Dawadi

Process FlowQuery Management

Query management facilities.

Maximizing system resources

CH#3, By: Babu Ram Dawadi

Three Data Warehouse Models

Warehouse Manager CH#3, By: Babu Ram Dawadi

Process Arch Load Manager

Process Arch Load Manager

Load Manager Architecture

CH#3, By: Babu Ram Dawadi

Process Arch Load Manager

Extract data from source

Process Arch Warehouse Manager

CH#3, By: Babu Ram Dawadi

Process Arch Warehouse Manager

Warehouse Manager Architecture

Process Arch Warehouse Manager

Create Index & Views

CH#3, By: Babu Ram Dawadi

Process Arch Warehouse Manager

Generate the summaries:

Process Arch Query Manager

Process Arch Query Manager

Query Redirection Via C tools, RDBMS

Query Management Tools

Query Scheduling via C tool or RDBMS

CH#3, By: Babu Ram Dawadi

Process Arch Detailed Information

Process Arch Detailed Information

Warehouse Manager External Data Detailed info. In archived data

Process Arch Detailed Information

information can be managed by the topics:

CH#3, By: Babu Ram Dawadi

city nyc sfo la

sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97

name joe fred sally

address 10 main 12 main 80 willow

city sfo sfo la

CH#3, By: Babu Ram Dawadi

sale orderId date custId prodId storeId qty amt

customer custId name address city

store storeId city

CH#3, By: Babu Ram Dawadi

CH#3, By: Babu Ram Dawadi

A Sample Data Cube

Total annual sales of TV in U.S.A.

CH#3, By: Babu Ram Dawadi