ETL For Type 2 Dimension

ETL for Type 2 dimension Overview
ETL Definitions and Data Flow for Type 2 dimensions Overview

The purpose of this document is to provide a detailed example of how data is extracted from the OLTP production database to the Staging database. Specifically this document will examine the steps required to extract data from a table that contains Type 1, Type 2, and Inferred data.
View and UDFs

Views and user-defined, table-valued functions are used in the extraction of data from the production database to provide consistent naming convention to the Staging database, and also provide a layer of abstraction between the ETL and production database. This will minimize downstream problems when minor changes are made to the production database.
Naming Conventions
Many of the fields in the production database do not have descriptive names. In order to make maintenance easier and ease future upgrades in the BI system, fields in the production database are renamed in views and UDFs by using a consistent and descriptive naming convention. The following naming convention rules are used in views and UDFs on the production database: Use the prefix vw for the name of the view. For example: vwReservations, or vwGuests. Tables and Views use the same name space. Using the prefix vw allow you to easily differentiate between tables and views when a table list is presented to the user in a drop-down list. User-defined functions are prefixed with udf. Stored procedures are prefixed with usp. When naming the primary key field, use the singular form of the table name + ID. For example, if the table name is People, the primary key field name would be PersonID, for the table named Guests, the primary key field would be GuestID. Do not use just ID or Code for a primary key field name. This makes maintenance and troubleshooting very difficult if the primary key field name for all the tables is either ID or Code. Use Pascal Case for closed, compound field names (capitalize the first character of each word in the field name). For example: PascalCaseField, or WorkPhone. Do not use underscores in field names unless you are separating two parts of the name that contain abbreviations. For example: CC_ID for credit card ID. Actually, for this field, the preferred method would be better to spell out the words credit and card. Then, the field name would b CreditCardID.
2007 STATRA. All rights reserved. Proprietary & Confidential. Page 1
Three letter prefix on task names in SSIS:

It is easier to interpret logs of package execution if the task and transform names tell what the type of the task or transform is. The three letter prefixes used to name tasks in SSIS are: EPT is an Execute Process task DFT is a Data Flow task DER - is a Derived Column transform ASP is an Analysis Services package SCD represents a Slowly Changing Dimension SEQ represents a sequence container Source descriptors SRC - is a data source in a data flow task STAT is a script transform that you will see in most of the data flows. It collects row count and throughput information which is ultimately reported in the enhanced logging
Data Extraction on the OLTP Production database

Data is pulled from the OLTP database on a nightly basis. The data is accessed through a layering of structures as diagramed below:
Production Database
If a table is used in more than one UDF, a view is created to provide consistent naming for all fields. Otherwise, the UDF pulls directly from the table.
Stored Procedure
SSIS
UDF
SSIS Package
Table
View
UDF returnes a Table-valued result set to the stored proc
Stored Procedure is called from SSIS Package. Data & Debug paramater is passed from SSIS to Stored Proc. A table-valued set is returned to SSIS package
SSIS package populates either a dimension of fact table
Table(s)
UDF
Stored Procedure
SSIS Package
Table(s)
A stored procedure is put between the UDF and the SSIS package because there is a bug in UDFs that causes problems when parsing parameters that come from an SSIS package.
When the bug is fixed that causes a problem in the UDF when parsing parameter from an SSIS package , the stored procedure can be removed from the pipeline, and the UDF can be called directly from the SSIS package
Auditing and logging code is contained in each SSIS package
In SSIS there are four levels of control. The first three level provided modularity, and determines the order in which ETL packages will be executed.
Flow Control
Configuration Information
When a package is started, connection information and variables are set from the package configuration. For the ADE BI System all configuration information is kept in the table admin.Configuration in the relational Staging database. Any information passed from one package to another is through this database.
Top Level Flow Control

The daily ETL process is controlled by the process called LoadGrouspFull_Daily.dtsx There are 4 main packages contained within LoadGrouspFull_Daily.dtsx as shown below:
Figure 1. LoadGrouspFull_Daily.dtsx Top level flow control
The job of LoadGrouspFull_Daily.dtsx is to specify the order of operations: First dimension data is loaded into the relational warehouse, then fact data is loaded, then dimensions are processed in the AS cube, then facts are processed.
Second Level Flow Control

When you open the task EPT Dimensions and you will see that it calls a package to carry out dimension processing: LoadGrouspDimensions_Daily.dtsx. That package, in turn, calls another package for each of the other dimensions that are processed in a daily load. For demonstration purposes, the first 7 dimensions are listed in the sequence package shown below.
Figure 2. LoadGrouspDimensions_Daily.dtsx Sequence container holds packages to load dimension data into staging db
Type 2 Dimension Logic Flow

DimCompany contains some columns that are Type 1 (data changes are overwritten), some columns are type 2 (historical changes need to be tracked) and logic will be added to allow for Type 1.5 (addition of an inferred member). We will use the Company dimension as an example of how the data flows to populate the staging table. The high level logic control for DimCompany contains the three elements as shown below:
Figure 3. Dim_Company.DTSX High level flow control to load dimCompany into staging
When you drill down into Data Flow Task (DFT) Load Company the following flow control is revealed:
Figure 4. Flow control to load DimCompany into staging table
SRC DimCompany
SRC DimCompany is an OLE DB flow component. This component executes the stored procedure etl.uspDimCompany to extract the data from udfDimCompany in the Source database. The stored procedure has two parameters: @logicalDate and ,@debug exec etl.uspDimCompany @logicalDate = ? ,@debug = 0 The create script for the stored procedure is shown below:
CREATE procedure [etl].[uspDimCompany] @logicalDate datetime ,@debug bit = 0 --Debug mode? with execute as caller as /* This procedure is used to extract Company information into * the staging database * * exec etl.uspDimCompany '2007-05-23', 1 */ begin set nocount on if @debug = 1 begin select top (100) * from etl.udfDimCompany (@logicalDate) end else begin select * from etl.udfDimCompany (@logicalDate) end --if set nocount off end proc etl.udfDimCompany is a user defined function that returns a TableValued Function containing the desired records from the Company and Address tables. The create script for etl.udfDimCompany is shown below: CREATE function [etl].[udfDimCompany](@logicalDate datetime) returns table as return ( SELECT dbo.company_profile.account AS CompanyAccountID , dbo.company_profile.name AS CompanyName , dbo.company_profile.contact_name AS ContactName , dbo.company_profile.contact_title AS ContactTitle , dbo.address.address AS Address , dbo.address.Address_2 AS Address2 , dbo.address.city , dbo.address.state , dbo.address.country
, dbo.address.zip , dbo.address.phone , dbo.address.fax , dbo.address.email , dbo.company_profile.credit_limit , dbo.company_profile.status , dbo.company_profile.property AS PropertyID , dbo.company_profile.locale_id AS LocaleID FROM dbo.company_profile INNER JOIN dbo.address ON dbo.company_profile.property = dbo.address.propertyfrom WHERE Logical_Date > @logicalDate );
STAT Source
STAT Source is a script component that executes a custom SQL Script. The purpose of STAT Source is to count the number of source records. The SQL script defined in STAT Source is as follows:
Imports Imports Imports Imports System System.Data System.Data.OleDb System.Collections
Public Class ScriptMain Inherits UserComponent Private startTicks, totalTicks As Long Private rowCount, totalRows As Integer Private rps As New ArrayList() 'rps = rows per second Public Overrides Sub Input0_ProcessInput(ByVal Buffer As Input0Buffer) 'Save the rate statistic for this buffer If startTicks <> 0 Then totalRows += rowCount Dim ticks As Long = CLng(DateTime.Now.Ticks - startTicks) If ticks > 0 Then totalTicks += ticks Dim rate As Integer = CInt(rowCount * (TimeSpan.TicksPerSecond / ticks)) rps.Add(rate) End If End If 'Reinitialize the counters rowCount = 0 startTicks = DateTime.Now.Ticks 'Call the base method MyBase.Input0_ProcessInput(Buffer) End Sub
Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer) rowCount += 1 'No exposed Buffer.RowCount property so have to count manually End Sub Public Overrides Sub PostExecute() MyBase.PostExecute() 'Define the Stored Procedure object With New OleDbCommand("audit.uspEvent_Package_OnCount") .CommandType = CommandType.StoredProcedure 'Define the common parameters .Parameters.Add("@logID", OleDbType.Integer).Value = Variables.LogID .Parameters.Add("@componentName", OleDbType.VarChar, 50).Value = Me.ComponentMetaData.Name .Parameters.Add("@rows", OleDbType.Integer).Value = totalRows .Parameters.Add("@timeMS", OleDbType.Integer).Value = CInt(totalTicks \ TimeSpan.TicksPerMillisecond) 'Only write the extended stats if RowCount > 0 If rps.Count > 0 Then 'Calculations depend on sorted array rps.Sort() 'Remove boundary-case statistics If rps.Count >= 3 Then rps.RemoveAt(0) 'Calculate min & max Dim min As Integer = CInt(rps.Item(0)) Dim max As Integer = CInt(rps.Item(rps.Count - 1)) 'Define the statistical parameters .Parameters.Add("@minRowsPerSec", OleDbType.Integer).Value = min .Parameters.Add("@maxRowsPerSec", OleDbType.Integer).Value = max End If 'Define and open the database connection .Connection = New OleDbConnection(Connections.SQLRealWarehouse.ConnectionString) .Connection.Open() Try .ExecuteNonQuery() 'Execute the procedure Finally 'Always finalize expensive objects .Connection.Close() .Connection.Dispose() End Try End With End Sub End Class
DER Coalesce
DER Coalesce is a data flow component that is used to replace any NULL values in the incoming source data with pre-defined unknown strings and numbers for the non-nullable target fields.
SCD Data Flow Component

The Slowly Changing Dimension Wizard allows you to select which columns are Type 1 (changes overwritten), Type 1.5 (Inferred), and Type 2 (Historical changes stored) Once you have classified each of the columns in the dimension, most of the code to support the functionality is written for you. You can modify the code after the wizard creates the code for the inferred output, the type 1 output, and the type 2 to accommodate any custom functionality that is desired.
Inferred Output
The first component in the Inferred Output branch, STAT Inferred, counts the number of records that include inferred members. Inferred members would be added if a fact includes, in this case, a Company that doesnt already exist in the Company table. The final step for the Inferred Output branch updates a record in the dimension table that is blank except for the business primary key that was retrieved from a fact record. The remaining data in this record will be updated with actual data in a future import.
New Output
The first component in the New Output branch, STAT New, counts the number of records that include new type 2 members. New members would be added if a fact includes, in this case, a Company that doesnt already exist in the Company table. The next component, a Union All Transform Component named All New SCD 2, unions updates made to type 2 columns as well as new records containing type 2 data. The next component, a Derived Colum Transform Component named DER New SCD 2, sets the values for all the derived columns such as the surrogate key, CurrentRow, StartDate. EndDate, InferredMember, LastModifidiedDate. The final component in the New Output branch, an OLE DB Destination Component named DTS New SCD-2, writes the data to the DimCompany table in the staging database.
SCD-2 Output
The first component in SCD-2 Output branch, STAT SCD-2, counts the number of records that include type 2 members that have updates. Historical changes will be saved with all type 2 data columns. The next component, a Derived Colum Transform Component named DER SCD 2, sets the values for all the derived columns such as the surrogate key, CurrentRow, StartDate. EndDate, InferredMember, LastModifidiedDate. The next component in the SCD 2 Output branch, an OLE DB Destination Component named DTS New SCD-2, updates the CurrentRow field for the give business primaryKey. The SCD 2 Output branch then merges with the Output branch at the Union All Transform Component named All New SCD 2.
SCD-1 Output
The first component in the SCD-1 Output branch, STAT SCD-1, counts the number of records that include type 1 members that have updates. Historical changes will be overwritten with all type 1 data columns. The second and final step in this branch, updates all data for type 1 changes in the DimCompany table.

ETL For Type 2 Dimension

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

ETL For Type 2 Dimension

Diunggah oleh

Hak Cipta:

Format Tersedia

ETL for Type 2 dimension Overview

ETL Definitions and Data Flow for Type 2 dimensions Overview

View and UDFs

2007 STATRA. All rights reserved. Proprietary & Confidential. Page 1