Outline
ETL in a Data Warehouse architecture ETL characteristics Extraction Transformation Loading Requirements for ETL ETL metadata Prototype systems
Robert Wrembel
ETL in DW architecture
DATA SOURCES INTERMEDIATE LAYER ETL SOFTWARE DATA WAREHOUSE BI APPLICATIONS
REPORTS
DATA WAREHOUSE
DATA MARTS
FINANCIAL AND STATISTICAL ANALYSIS
Robert Wrembel
ETL characteristics
Developing ETL processes
critical for DW operation
data quality data "freshness" (up to date) DW is refreshed in a finite time window (any delay in a DW refreshing makes it outdated or inconsistent or unavailable for use)
Robert Wrembel
ETL characteristics
Gartnera Report on DW projects in financial institutions Fortune 500
100 persons involved in a DW project 55 ETL 17 systems' administrators (DB, hardware, software) 4 system architects 9 consultants for the end user on the BI technology 5 software developers 9 managers hardware multiprocessor severs, TB disks (5 mln USD) ETL software (1 mln USD) typical number of data sources being integrated 10-50
Robert Wrembel
Technological challenges
Processing large data volumes in a limited time window Delivering reliable (valid, true, consistent) data data quality Processing an ETL flow efficiently Managing the evolution of data sources
Robert Wrembel
WIiZ
WFT
WArch
WBiI
Robert Wrembel
WE WBMiZ WEiT
7
"SSN"
format: 11 digits 22% incorrect
incorrect length, characters instead of digits, wrong checksum, wrong gender
Robert Wrembel
Last names
dictionary of last names was applied 20% not in the dictionary 0.04% wrong characters
Predefined dictionaries
31 values out of 2 correct 757 values out of 299 correct 74 values out of 3 correct
Robert Wrembel 9
Robert Wrembel
10
ETL architecture
DATA SOURCES DATABASES ETL
transformation cleaning
integration
loading
DATA WAREHOUSE
Robert Wrembel
11
Data sources
Each of the data sources supplying data into a DW have to be identified Data source description includes among others:
domain of activity (HR, Payroll, Marketing, ...) type of applications used for data processing data importance for a BI user who is a business user of source data who is a user of a technical architecture DBMS used to manage data hardware and operating systems the number of users per day data volume sizes DB schema the number of transactions per day
12
Robert Wrembel
Robert Wrembel
13
Solutions
audit columns
in a monitored table date and time of operation, operation type (I, U, D) providing values by means of: triggers, applications
snapshot log a system maintained log of changes redo log a system DB log (transaction rollback, transaction recovery, DB recovery)
periodical analysis (log scraping)/ on-line analysis (log sniffing)
Robert Wrembel
16
replacing wrong values with correct ones discovering functional dependencied between attributes discovering potential keys discovering business rules implicitly encoded in applications WizRule (WizSoft), DataMiningSuite (InformationDiscovery)
Robert Wrembel
17
Transformation
Requirements
Interactive and iterative process
define rules start the transformation verify results modify rules
Easily extendible Optimizable The more tasks executed automatically the better The less data for manual transformation the better
Robert Wrembel
18
Transformation
Transformation to a common data model
{object, O-R, semistructured, ...} relational
Robert Wrembel
19
Cleaning
Extracting atomic values from strings
Piotrowo 2, 60-965, Pozna ordering the values
Standardizing values
formatting values (e.g., dates, money) converting currencies lower-upper case conversion consistent abbreviations synonym dictionaries (Word Net) abbreviation dictionaries
20
Robert Wrembel
10
Cleaning
Merging semantically identical records Generating artificial identifiers
Use natural identifiers (e.g., SSN, pasport No, engine No, e-mail} No natural identifiers
sort + compare n neighbor records (window of size n) similarity function (e.g., if first and last names are identical then the records are identical)
similarity weights for attributes
approximate join
Robert Wrembel
22
11
Duplicate elimination
Simple similarity measure
the number of matching atomic strings / total number of unique atomic strings
Universidad de Costa Rica, Faculdad de Ingeniera Univ. de C. Rica Faculd. de Ingen. similarity measure = 5/5 Universidad de Costa Rica, Faculdad de Ingeniera, Escuela de Ciencias de la Computacin e Informtica Univ. de C. Rica Faculd. de Ingen. similarity measure = 5/9
Robert Wrembel
23
Duplicate elimination
Soundex
grouping entities having the same pronunciation entities pronounced identically have the same value of SOUNDEX (even if they are written differently) soundex('Smith')=soundex('Smit')=S530
Levenhstein/edit distance
similarity measure of two character strings source - L1 destination - L2 the distance is measured by a minimal number of inserts and deletes (sometimes updates) of signs in a character string leading to achieve L2 from L1 L1 and L2 are identical distance=0 ABC ABCDEF: distance=3 DEFCAB ABC: distance=5 Merge (Sagent), DataCleanser (EDD)
Robert Wrembel 24
12
Refreshing/loading HD
When to refresh a DW?
synchronously (after committing a transaction in a data source) RTDWs asynchronously traditional DWs
automatically in a given interval on demand
What to send?
data (Oracle) transactions (Sybase, SQL Server)
How to refresh?
incrementally fully
Refreshing efficiency
In a given finite time window Read only necessary data Avoid
DISTINCT, set operators, NOT i non-equal joins (usually require full scans) function calls in the WHERE clause GROUP BY in queries reading source data
sorting in a data source may may be ineffective sorting may interact with original processing in a data source
triggers in a DW
Robert Wrembel
26
13
Refreshing efficiency
Separate UPDATEs and INSERTs
UPDATEs are not executed in a direct load path replacing UPDATE by DELETE and INSERT the number of UPDATEs > INSERTs => TRUNCATE TABLE + INSERTs
Indexes
drop + re-create maintain on-line indexes and UPDATEs
remove indexes not used by UPDATEs execute UPDATEs remove remaining indexes execute INSERTs re-create indexes
Integrity constraints
turn off before loading
Robert Wrembel
27
Refreshing efficiency
Redo log
turn of redo log writing
ETL software may roll back failed transactions data loaded in a batch mode failed transactions may be easily re-executed
Use direct load path Filter data stored in files by means of OS utility (awk command) Sort data stored in files by means of OS utility (sort) Sort and compute aggregates in the ETL engine (not in a DW)
Robert Wrembel
28
14
Refreshing efficiency
Transformation of data
in a DW (ELT) in an ETL workflow
Parallel loading (partitioned and non-partitioned tables) Use native drivers for accessing data soures (avoid ODBC/JDBC) Gather DW statistics after refreshing Defragment DW
Robert Wrembel
29
Purpose of ODS
Separating ETL processing from original processing in data sources Re-executing failed transactions
Robert Wrembel
30
15
ODS content
Original source data Partially processed data Storing ETL metadata Mapping tables (EDS DW)
lineage, data provenance DW rows and their origines in data sources + a chain of transformations
Robert Wrembel
31
Designing ETL
Data profiling
repository
Jarke M., et. al.: Improving OLTP Data Quality Using Data Warehouse Mechanisms. SIGMOD Record, (28):2, 1999
Executing ETL
Robert Wrembel
16
Implementing ETL
ETL workflow of transformations Transformations
aggregation filtering joining normalizing values lookup generating IDs sorting EDS connector (DB, file, ...) ... user-defined
Robert Wrembel
33
ETL metadata
Business
dictionary of business terms mapping of business terms into DW objects business rules data quality schedules scripts execution logs monitoring
Robert Wrembel
34
17
ETL metadata
Technical
source description (localization, structure, content)
source type (relational db, object db, xml, html, spreadsheet, ...) structure/schema access methods users and their access rights data profiling results daily increase in data volume total data volume data statistics (for access optimization)
DW description
Robert Wrembel
logical schema physical data structures various DW statistics (for query optimization) physical disk organization
35
ETL metadata
Technical
ETL descriptions
implementations of algorithms (transforming, cleaning, integrating) scripts and tasks definitions execution schedules various dictionaries (countries, cities, ...) DW refreshing statistics (#rows loaded, #rows rejected, ...) refreshing logs workflow structure DS - DW mappings (schema and data)
Robert Wrembel
36
18
Reliability
restart after erroneous execution recovery after crash
Manageability
parameterized refreshing frequency automatic start
time-based token-based (data source informs ETL that data can be fetched)
19
Approaches
Off the self
quicker deployment data repositories and metadata management built-in drivers to all (multiple systems) dependency management between components incremental refreshing parallel processing expensive
User-defined
longer development applicable to a particular solution cheaper
Robert Wrembel
39
Commercial systems
Robert Wrembel
40
20
Prototype systems
AJAX - Inria
Galhardas H., Florescu D., Shasha D., Simon E.: An Extensible Framework for Data Cleaning. ICDE, 2000 Galhardas H., Florescu D., Shasha D., Simon E.: AJAX: An Extensible Data Cleaning Tool. SIGMOD, 2000
Robert Wrembel
AJAX
Input: a set of tables with inconsistent and duplicated rows Output: a set of tables with consistent, no duplicate rows Assumption
tables have defined primary keys
Robert Wrembel
42
21
AJAX - components
Data transformation service
standardizing values transformation MAPPING macro-operator
CREATE MAPPING MAPPING MG1 MG1 CREATE SELECT c.clID, c.clID, c.FName, c.FName, c.LName, c.LName, c.Street, c.Street, c.City, c.City, c.Code, c.Code, SELECT c.TelNo, c.Education c.Education c.TelNo, INTO Clients_Clean Clients_Clean INTO FROM Clients1 Clients1 c c FROM LET LName=INITCAP(c.LName) LName=INITCAP(c.LName) LET [Street, City, City, Code]=ExtractAdr(c.Address) Code]=ExtractAdr(c.Address) [Street, Education=IF(c.Education is is not not null) null) Education=IF(c.Education THEN RETURN RETURN c.Education c.Education THEN ELSE RETURN RETURN 'unknown' 'unknown' ELSE
Robert Wrembel
43
AJAX
Record matching service - duplicate elimination
similarity measure <0, 1> MATCH macro-operator
CREATE MATCH MATCH MH1 MH1 CREATE FROM Clients1 Clients1 c1, c1, Clients1 Clients1 c2 c2 FROM LET sim1=LNameSimF(c1.LName, sim1=LNameSimF(c1.LName, c2.LName) c2.LName) LET sim2=AddressSimF(c1.Address, c2.Address) c2.Address) sim2=AddressSimF(c1.Address, SIMILARITY=IF (sim1>0.9 and sim2>0.8) THEN RETURN RETURN MIN(sim1,sim2) MIN(sim1,sim2) SIMILARITY=IF (sim1>0.9 and sim2>0.8) THEN ELSE IF IF (sim1 (sim1 between between 0.6 0.6 and and 0.89 0.89 and and ELSE sim2 between between 0.7. 0.7. and and 0.8) 0.8) THEN THEN RETURN RETURN sim1 sim1 sim2 ELSE RETURN RETURN 0 0 ELSE THRESHOLD SIMILARITY>=0.7 SIMILARITY>=0.7 THRESHOLD
Robert Wrembel
44
22
AJAX
Duplicate elimination
manual semi-automatic automatic THRESHOLD > x
CREATE MAPPING MAPPING MG2 MG2 CREATE SELECT DI, DI, LName, LName, Address, Address, ... ... INTO INTO Clients Clients SELECT FROM MH1 MH1 FROM LET id=IDGen(M.ID_Client1, id=IDGen(M.ID_Client1, M.ID_Client2) M.ID_Client2) LET sim1=LNameSimF(M.ID_Client1.LName, M.ID_Client2.LName) sim1=LNameSimF(M.ID_Client1.LName, M.ID_Client2.LName) sim2=StreetSimF(M.ID_Client1.Address, M.ID_Client2.Address) M.ID_Client2.Address) sim2=StreetSimF(M.ID_Client1.Address, SIMILARITY SIMILARITY LName=IF (sim1>0.9) (sim1>0.9) THEN THEN RETURN RETURN M.ID_Client1.LName M.ID_Client1.LName LName=IF Street=IF (sim2>0.9) (sim2>0.9) THEN THEN RETURN RETURN M.ID_Client1.Street M.ID_Client1.Street Street=IF ..... ..... Address=CONCAT(Street, City, City, Code) Code) Address=CONCAT(Street, THRESHOLD SIMILARITY>=0.89 SIMILARITY>=0.89 THRESHOLD
Robert Wrembel 45
Potter's Wheel
Interactive and iterative process of data transformation and cleaning
a set of predefined transformations transformations are applied to a small subset of data transformations are visible to a user in real time spreadsheet interface
Robert Wrembel
46
23
Arktos II
Conceptual model transformed into implementation model Unique features
evolution of a workflow optimization of a workflow
Robert Wrembel
47
Robert Wrembel
48
24
Robert Wrembel
49
Robert Wrembel
50
25
Figuring out the set of correct transformations Defining cost model of executions
Robert Wrembel
51/54
Example
NotNull(total_price) Select(total_price>9000)
1
Sales1 {..., total_price, s_date, ...}
2
Sales2 {..., cost, sales_date, ...}
4
EUR2PLN
5
ConvertDate
6
SUM(cost,month)
Sales1
total_price [PLN] s_date [yyyy-mm-dd] monthly sales
Sales2
cost [EUR] sales_date [dd/mm/yy] daily sales
Robert Wrembel
52/54
26
Example
Select(total_price>9000) NotNull(total_price)
1
Sales1 {..., total_price, s_date, ...}
SUM(cost,month)
Select(total_price>9000)
2
Sales2 {..., cost, sales_date, ...}
4
EUR2PLN
5
ConvertDate
Robert Wrembel
53/54
Problems
Tasks are often expressed as programs in procedural languages
constructing cost model programs may have input parameters and conditional constructs how to interpret and optimize code?
Commercial systems
???
Robert Wrembel
54/54
27