Anda di halaman 1dari 16

ETL Implementation Strategy

Contents

Buy or Build ETL


Major Factors Involved in Evaluating an ETL
An Ideal ETL Tool
ETL Implementation
ETL Process Example
Required Feature in ETL tool

Buying ETL Tool


Advantages
Reduced development time
A wide range of features available
Reusable across future phases involving data transformations within Project
Disadvantages
Time needed to learn the product
Training costs
May not do everything we need (to be supplemented with in-house
development)

Building ETL Tool


Advantages
No up-front purchasing costs
No training costs
Specifically designed for the purpose of the project
Disadvantages
Time needed for design, development, testing and documentation
May not have all the features of an off the shelf product
High maintenance
Recommended to buy a tool than building one for project

Factors Involved in Evaluating an ETL


Ease of use

Tool Integration

Database Connectivity

Metadata Support

Update Capabilities

Customization Methods

Surrogate key Support

Logging

Change Data Capture

Scheduling Features

Intelligent Queries

Quality Assessment

Multi-source Joins

Tool Architecture

Aggregate Capabilities

Price

Criteria

Criteria

ETL Environment
Creating an ETL environment requires six basic infrastructure components:
A Network Environment to connect source data systems to the
warehouse platform
A RDBMS for the warehouse
A Sort/Merge utility to integrate data from the various source systems
A method to perform calculations
A Utility to Schedule and Run ETL batch cycles based on events or
timelines
A Change Management utility to manage updates and version control of
programs and scripts

Implementation Methodology
Define business requirements for the warehouse project
Analyse the source systems
Develop physical data model
Design ETL processes

Design Stages
Designing process follows a systematic staged apprach. Various stages create a
single processing method for initial and successive loads to the data
warehouse. It involves 5 stages which provide a modular and adjustable
transformation process for the target table that can adapt easily to changes in
the source systems or the warehouse model desig
Stage 1. Source verification
performs the access and extraction of data from the source system and builds a
temporal view of the data at the time of extraction
Stage 2. Source alteration
perform a variety of transformations unique to the source, depending on business
requirements
Stage 3. Common interchange
applies business rules and/or transformation logic that is frequent across multiple
target tables
Stage 4. Target load determination
performs final formatting of data to produce load-ready files for the target table;
identifies and segregates rows to be inserted vs. updated (if applicable); applies
remaining technical meta data tagging; and processes data into the RDBMS
Stage 5. Aggregation
final stage, uses the load- ready files from Stage 4 to build aggregation tables
needed to improve query performance against the warehouse

ETL Process Example


Stage 1. Source Verification
source system is a human resources (HR) ERP system
target is an organization dimension table that happens to use type 2 slowly
changing dimensions
Working files
new
organization
records

FIGURE 1: Source verification, alteration, and common interchange stages.

ETL Process Example


Stage 2. Source Alteration
we append data from secondary sources.
In this case the HR ERP Region table to the primary organizational
extract file

Working file
new
organization
records

ETL Process Example


Stage 3. Common Interchange
we find that the region name values stored in the HR ERP system do not
conform to the established enterprise definitions thus we need to use the
merge infrastructure utility to update the organization record region names
to reflect the enterprise versions .

ETL Process Example


Stage 4. Target Load Determination
we compare the current load of organization records against those
previously loaded in earlier batch cycles

ETL Process Example


Stage 5. Aggregation
we flag new rows for insertion: current load-cycle records that have relevant
columns that do not match their corresponding organization dimension table
rows, new region names, or manager IDs

Required features in ETL tool


Architecture like Hub spoke or
client server
scalable and extensible technologyscale up as data grows
Client platform support windows
2k/95/98 etc
Server platform support Sun
Solaris, HP-UX,AIX etc.
Support for ERP sources
Support for parallelism
Code generator
Data transformation method
Support for managing and building
aggregates
Support for various industry standard
data types
Data Quality functionality feature
Exception handling capability
ETL process management

Backup and recovery feature


Metadata capture support
Viewing metadata
Security of metadata
Web integration support
Support for versioning
Installation procedure
Support for sharable repository
Support for designing data marts
Support for importing data models
from modeling tools
Support for different kind of
transformations
Adaptability
Support for growth
Ability to handle various source
types
Support for external loader