Anda di halaman 1dari 27

On Building Integrated and Distributed Database Systems

Data Integration for Warehousing - ETL


Robert Wrembel Pozna University of Technology Institute of Computing Science Pozna, Poland
Robert.Wrembel@cs.put.poznan.pl www.cs.put.poznan.pl/rwrembel

Outline
ETL in a Data Warehouse architecture ETL characteristics Extraction Transformation Loading Requirements for ETL ETL metadata Prototype systems

Robert Wrembel

ETL in DW architecture
DATA SOURCES INTERMEDIATE LAYER ETL SOFTWARE DATA WAREHOUSE BI APPLICATIONS

REPORTS

Extraction Transformation Loading (Aggregation)

DATA WAREHOUSE

DATA MARTS
FINANCIAL AND STATISTICAL ANALYSIS

Robert Wrembel

ETL characteristics
Developing ETL processes
critical for DW operation
data quality data "freshness" (up to date) DW is refreshed in a finite time window (any delay in a DW refreshing makes it outdated or inconsistent or unavailable for use)

costly and time consuming


up to 70% project resources people hardware software

Robert Wrembel

ETL characteristics
Gartnera Report on DW projects in financial institutions Fortune 500
100 persons involved in a DW project 55 ETL 17 systems' administrators (DB, hardware, software) 4 system architects 9 consultants for the end user on the BI technology 5 software developers 9 managers hardware multiprocessor severs, TB disks (5 mln USD) ETL software (1 mln USD) typical number of data sources being integrated 10-50

Robert Wrembel

Technological challenges
Processing large data volumes in a limited time window Delivering reliable (valid, true, consistent) data data quality Processing an ETL flow efficiently Managing the evolution of data sources

Robert Wrembel

Data quality - case study


Integration of dean's office databases
9 databases in total over 70 000 students WTC WMRiT

WIiZ

WFT

WArch

WBiI
Robert Wrembel

WE WBMiZ WEiT
7

Data quality - case study


Student number (SN)
SN is unique within one dean's office database globally SN is not unique problem of uniquely identifying a student found 49 pairs of students having the same SN, students in a pair are physically different persons format: 6 digits + an optional letter {a, d, s, i}
2.75 incorrect

"SSN"
format: 11 digits 22% incorrect
incorrect length, characters instead of digits, wrong checksum, wrong gender

Robert Wrembel

Data quality - case study


First names
dictionary of first names was applied 0.8% not in the dictionary 0.9% incorrect
illegal characters, illegal values

Last names
dictionary of last names was applied 20% not in the dictionary 0.04% wrong characters

Predefined dictionaries
31 values out of 2 correct 757 values out of 299 correct 74 values out of 3 correct
Robert Wrembel 9

Data quality - case study


The dictionary of cities
4.3% of wrong characters 81% with mixed lower-upper case character strings

Robert Wrembel

10

ETL architecture
DATA SOURCES DATABASES ETL

extraction ODBC/JDBC SOURCES

transformation cleaning

integration

loading

DATA WAREHOUSE

STAGING AREA/OPERATIONAL DATA STORE (ODS) FILES

Robert Wrembel

11

Data sources
Each of the data sources supplying data into a DW have to be identified Data source description includes among others:
domain of activity (HR, Payroll, Marketing, ...) type of applications used for data processing data importance for a BI user who is a business user of source data who is a user of a technical architecture DBMS used to manage data hardware and operating systems the number of users per day data volume sizes DB schema the number of transactions per day
12

Robert Wrembel

Data access technologies


Gateway ODBC/JDBC OLE DB (Object Linking and Embedding DataBase) Drivers to various types of files (flat text, XML, ...)

Robert Wrembel

13

Detecting changes in DSs


Requirements
minimal interference with processing in data sources minimal (typically no) changes in data sources (structure, applications)

Solutions
audit columns
in a monitored table date and time of operation, operation type (I, U, D) providing values by means of: triggers, applications

snapshot log a system maintained log of changes redo log a system DB log (transaction rollback, transaction recovery, DB recovery)
periodical analysis (log scraping)/ on-line analysis (log sniffing)

2 consecutive snapshot comparison


low efficiency
Robert Wrembel 14

Analyzing data sources


Analytical methods (statistical, data mining) for estimating characteristics of data (data profiling) Analytical methods
data quality
identifying NULL/NOT NULL columns for each attribute count the number of rows with NULL values or/and default values (default value may denote that no value was provided during row instert) identifying columns with unique values maximum length of values allowed ranges/sets of values MIN, MAX, AVG, Variance, STDEV identifying not allowed values the number of rows with not allowed values attribute cardinality distribution of values for each attribute (histograms) data formats (e.g., dates, money, teleph. numbers)
Robert Wrembel 15

Analyzing data sources


Analytical methods
the structure and content of data sources daly growth of data volume MigrationArchitect(Evoke Software), Integrity (Vality)

Robert Wrembel

16

Analyzing data sources


Data mining methods association rules + domain knowledge
Sapia C., Hfling G., et. al.: On Supporting the Data Warehouse Design by Data Mining Techniques

discovering attribute meaning


(country='GB' sw=2), support 95%: sw=steering wheel; 2=right side compleating missing values based on rules with a high support

replacing wrong values with correct ones discovering functional dependencied between attributes discovering potential keys discovering business rules implicitly encoded in applications WizRule (WizSoft), DataMiningSuite (InformationDiscovery)

Robert Wrembel

17

Transformation
Requirements
Interactive and iterative process
define rules start the transformation verify results modify rules

Easily extendible Optimizable The more tasks executed automatically the better The less data for manual transformation the better

Robert Wrembel

18

Transformation
Transformation to a common data model
{object, O-R, semistructured, ...} relational

Transformation to a common representation


Employee {SSN, FName, LName, Street, No, PostalCode, City}

Removing useless columns User verification/correction is often required

Robert Wrembel

19

Cleaning
Extracting atomic values from strings
Piotrowo 2, 60-965, Pozna ordering the values

Removing Null values Replacing wrong values with correc ones


spelling dictionaries name dictionaries (countries, cities, address codes)

Standardizing values
formatting values (e.g., dates, money) converting currencies lower-upper case conversion consistent abbreviations synonym dictionaries (Word Net) abbreviation dictionaries
20

Robert Wrembel

10

Cleaning
Merging semantically identical records Generating artificial identifiers

IdCentric (FirstLogic), Trillium (TrilliumSoftware)


Robert Wrembel 21

Integration - duplicate elimination


Compared records have to be cleaned before
remove punctuation, white spaces, and special characters no abbreviations

Records differ slightly


{Wrembel, Robert, ul. Wyspiaskiego, Pozna} {Wrbel, Robert, ul. Wyspiaskiego, Pozna}

Use natural identifiers (e.g., SSN, pasport No, engine No, e-mail} No natural identifiers
sort + compare n neighbor records (window of size n) similarity function (e.g., if first and last names are identical then the records are identical)
similarity weights for attributes

approximate join

Robert Wrembel

22

11

Duplicate elimination
Simple similarity measure
the number of matching atomic strings / total number of unique atomic strings
Universidad de Costa Rica, Faculdad de Ingeniera Univ. de C. Rica Faculd. de Ingen. similarity measure = 5/5 Universidad de Costa Rica, Faculdad de Ingeniera, Escuela de Ciencias de la Computacin e Informtica Univ. de C. Rica Faculd. de Ingen. similarity measure = 5/9

Robert Wrembel

23

Duplicate elimination
Soundex
grouping entities having the same pronunciation entities pronounced identically have the same value of SOUNDEX (even if they are written differently) soundex('Smith')=soundex('Smit')=S530

Levenhstein/edit distance
similarity measure of two character strings source - L1 destination - L2 the distance is measured by a minimal number of inserts and deletes (sometimes updates) of signs in a character string leading to achieve L2 from L1 L1 and L2 are identical distance=0 ABC ABCDEF: distance=3 DEFCAB ABC: distance=5 Merge (Sagent), DataCleanser (EDD)
Robert Wrembel 24

12

Refreshing/loading HD
When to refresh a DW?
synchronously (after committing a transaction in a data source) RTDWs asynchronously traditional DWs
automatically in a given interval on demand

What to send?
data (Oracle) transactions (Sybase, SQL Server)

How to refresh?
incrementally fully

How frequently to refresh?


in a batch mode in a stream mode (RTDWs)
Robert Wrembel 25

Refreshing efficiency
In a given finite time window Read only necessary data Avoid
DISTINCT, set operators, NOT i non-equal joins (usually require full scans) function calls in the WHERE clause GROUP BY in queries reading source data
sorting in a data source may may be ineffective sorting may interact with original processing in a data source

triggers in a DW

Robert Wrembel

26

13

Refreshing efficiency
Separate UPDATEs and INSERTs
UPDATEs are not executed in a direct load path replacing UPDATE by DELETE and INSERT the number of UPDATEs > INSERTs => TRUNCATE TABLE + INSERTs

Indexes
drop + re-create maintain on-line indexes and UPDATEs
remove indexes not used by UPDATEs execute UPDATEs remove remaining indexes execute INSERTs re-create indexes

Integrity constraints
turn off before loading

Robert Wrembel

27

Refreshing efficiency
Redo log
turn of redo log writing
ETL software may roll back failed transactions data loaded in a batch mode failed transactions may be easily re-executed

turn of redo log writing for a particular table

Use direct load path Filter data stored in files by means of OS utility (awk command) Sort data stored in files by means of OS utility (sort) Sort and compute aggregates in the ETL engine (not in a DW)

Robert Wrembel

28

14

Refreshing efficiency
Transformation of data
in a DW (ELT) in an ETL workflow

Parallel loading (partitioned and non-partitioned tables) Use native drivers for accessing data soures (avoid ODBC/JDBC) Gather DW statistics after refreshing Defragment DW

Robert Wrembel

29

Purpose of ODS
Separating ETL processing from original processing in data sources Re-executing failed transactions

Robert Wrembel

30

15

ODS content
Original source data Partially processed data Storing ETL metadata Mapping tables (EDS DW)
lineage, data provenance DW rows and their origines in data sources + a chain of transformations

ODS is implemented as a database or a set of files

Robert Wrembel

31

Designing ETL
Data profiling

Defining ETL workflows

repository

Testing on a sample, verifying data quality

Jarke M., et. al.: Improving OLTP Data Quality Using Data Warehouse Mechanisms. SIGMOD Record, (28):2, 1999

Executing ETL

Modifying EDS improving data quality


32

Robert Wrembel

16

Implementing ETL
ETL workflow of transformations Transformations
aggregation filtering joining normalizing values lookup generating IDs sorting EDS connector (DB, file, ...) ... user-defined

Robert Wrembel

33

ETL metadata
Business
dictionary of business terms mapping of business terms into DW objects business rules data quality schedules scripts execution logs monitoring

Managing ETL execution

Robert Wrembel

34

17

ETL metadata
Technical
source description (localization, structure, content)
source type (relational db, object db, xml, html, spreadsheet, ...) structure/schema access methods users and their access rights data profiling results daily increase in data volume total data volume data statistics (for access optimization)

DW description

Robert Wrembel

logical schema physical data structures various DW statistics (for query optimization) physical disk organization
35

ETL metadata
Technical
ETL descriptions
implementations of algorithms (transforming, cleaning, integrating) scripts and tasks definitions execution schedules various dictionaries (countries, cities, ...) DW refreshing statistics (#rows loaded, #rows rejected, ...) refreshing logs workflow structure DS - DW mappings (schema and data)

Robert Wrembel

36

18

Requirements for ETL


Efficiency
finishing in a time window parallel executions

Reliability
restart after erroneous execution recovery after crash

Manageability
parameterized refreshing frequency automatic start
time-based token-based (data source informs ETL that data can be fetched)

suspend and resume a task

Ensuring data quality Security (access rights control)


Robert Wrembel 37

Requirements for ETL


Data safety after system's crash Predefined tasks Automatic generation of executable code Easy to modify Extending with user-defined components Batch execution Monitoring execution
processor time RAM throughput disk access competition

Automatic reporting about finishing, errors, ... Metadata management


Robert Wrembel 38

19

Approaches
Off the self
quicker deployment data repositories and metadata management built-in drivers to all (multiple systems) dependency management between components incremental refreshing parallel processing expensive

User-defined
longer development applicable to a particular solution cheaper

Robert Wrembel

39

Commercial systems

Robert Wrembel

40

20

Prototype systems
AJAX - Inria
Galhardas H., Florescu D., Shasha D., Simon E.: An Extensible Framework for Data Cleaning. ICDE, 2000 Galhardas H., Florescu D., Shasha D., Simon E.: AJAX: An Extensible Data Cleaning Tool. SIGMOD, 2000

Potter's Wheel - Berkeley


Raman V., Hellerstein J.M.: Potter's Wheel: An Interactive Data Cleaning System. VLDB, 2001

Arktos II - National Univ. of Athens, Univ. of Ioannina


Vassiliadis P., A. Simitsis, Georgantas P, Terrovitis M.: A Framework for the Design of ETL Scenarios. CAiSE, 2003 Simitsis A., Vassiliadis P., Skiadopoulos s., Sellis T.: Data Warehouse Refreshment. In Data Warehouses and OLAP: Concepts Architectures and Solutions. IGI, 2007 Simitsis A., Vassiliadis P., Sellis T.: Optimizing ETL processes in data warehouses. ICDE, 2005 Simitsis A., Vassiliadis P., Sellis T.: State-Space Optimization of ETL Workflows. IEEE TKDE (17):10, 2006 Tziovara V., Vassiliadis P., Simitsis A.: Deciding the physical implementation of ETL workflows. DOLAP, 2007
41

Robert Wrembel

AJAX
Input: a set of tables with inconsistent and duplicated rows Output: a set of tables with consistent, no duplicate rows Assumption
tables have defined primary keys

Robert Wrembel

42

21

AJAX - components
Data transformation service
standardizing values transformation MAPPING macro-operator

CREATE MAPPING MAPPING MG1 MG1 CREATE SELECT c.clID, c.clID, c.FName, c.FName, c.LName, c.LName, c.Street, c.Street, c.City, c.City, c.Code, c.Code, SELECT c.TelNo, c.Education c.Education c.TelNo, INTO Clients_Clean Clients_Clean INTO FROM Clients1 Clients1 c c FROM LET LName=INITCAP(c.LName) LName=INITCAP(c.LName) LET [Street, City, City, Code]=ExtractAdr(c.Address) Code]=ExtractAdr(c.Address) [Street, Education=IF(c.Education is is not not null) null) Education=IF(c.Education THEN RETURN RETURN c.Education c.Education THEN ELSE RETURN RETURN 'unknown' 'unknown' ELSE

Robert Wrembel

43

AJAX
Record matching service - duplicate elimination
similarity measure <0, 1> MATCH macro-operator
CREATE MATCH MATCH MH1 MH1 CREATE FROM Clients1 Clients1 c1, c1, Clients1 Clients1 c2 c2 FROM LET sim1=LNameSimF(c1.LName, sim1=LNameSimF(c1.LName, c2.LName) c2.LName) LET sim2=AddressSimF(c1.Address, c2.Address) c2.Address) sim2=AddressSimF(c1.Address, SIMILARITY=IF (sim1>0.9 and sim2>0.8) THEN RETURN RETURN MIN(sim1,sim2) MIN(sim1,sim2) SIMILARITY=IF (sim1>0.9 and sim2>0.8) THEN ELSE IF IF (sim1 (sim1 between between 0.6 0.6 and and 0.89 0.89 and and ELSE sim2 between between 0.7. 0.7. and and 0.8) 0.8) THEN THEN RETURN RETURN sim1 sim1 sim2 ELSE RETURN RETURN 0 0 ELSE THRESHOLD SIMILARITY>=0.7 SIMILARITY>=0.7 THRESHOLD

Result stored in a temporary table - matching table


M {ID_Client1, ID_Client2, similarity}

Robert Wrembel

44

22

AJAX
Duplicate elimination
manual semi-automatic automatic THRESHOLD > x
CREATE MAPPING MAPPING MG2 MG2 CREATE SELECT DI, DI, LName, LName, Address, Address, ... ... INTO INTO Clients Clients SELECT FROM MH1 MH1 FROM LET id=IDGen(M.ID_Client1, id=IDGen(M.ID_Client1, M.ID_Client2) M.ID_Client2) LET sim1=LNameSimF(M.ID_Client1.LName, M.ID_Client2.LName) sim1=LNameSimF(M.ID_Client1.LName, M.ID_Client2.LName) sim2=StreetSimF(M.ID_Client1.Address, M.ID_Client2.Address) M.ID_Client2.Address) sim2=StreetSimF(M.ID_Client1.Address, SIMILARITY SIMILARITY LName=IF (sim1>0.9) (sim1>0.9) THEN THEN RETURN RETURN M.ID_Client1.LName M.ID_Client1.LName LName=IF Street=IF (sim2>0.9) (sim2>0.9) THEN THEN RETURN RETURN M.ID_Client1.Street M.ID_Client1.Street Street=IF ..... ..... Address=CONCAT(Street, City, City, Code) Code) Address=CONCAT(Street, THRESHOLD SIMILARITY>=0.89 SIMILARITY>=0.89 THRESHOLD
Robert Wrembel 45

Potter's Wheel
Interactive and iterative process of data transformation and cleaning
a set of predefined transformations transformations are applied to a small subset of data transformations are visible to a user in real time spreadsheet interface

Robert Wrembel

46

23

Arktos II
Conceptual model transformed into implementation model Unique features
evolution of a workflow optimization of a workflow

Robert Wrembel

47

ETL unsolved problems


Structural changes in data sources
Wikipedia schema changed every 9-10 days on the average during the last 4 years Telecommunication data sources changed their schemas every 7-13 days, on the average Banking data sources changed their schemas every 2-4 weeks, on the average The most frequent changes concerned increasing the length of a column, changing a data type of a column, and adding a new column

Robert Wrembel

48

24

ETL unsolved problems


Structural changes in data sources

Robert Wrembel

49

ETL unsolved problems

Robert Wrembel

50

25

ETL unsolved problems

ETL optimization Workflow transformation


reordering tasks parallelizing tasks merging splitting tasks

Figuring out the set of correct transformations Defining cost model of executions

Robert Wrembel

51/54

Example
NotNull(total_price) Select(total_price>9000)

1
Sales1 {..., total_price, s_date, ...}

2
Sales2 {..., cost, sales_date, ...}

4
EUR2PLN

5
ConvertDate

6
SUM(cost,month)

Sales1
total_price [PLN] s_date [yyyy-mm-dd] monthly sales

Sales2
cost [EUR] sales_date [dd/mm/yy] daily sales

Robert Wrembel

52/54

26

Example
Select(total_price>9000) NotNull(total_price)

1
Sales1 {..., total_price, s_date, ...}

SUM(cost,month)

Select(total_price>9000)

2
Sales2 {..., cost, sales_date, ...}

4
EUR2PLN

5
ConvertDate

Minimize the amount of processed data

Robert Wrembel

53/54

Problems
Tasks are often expressed as programs in procedural languages
constructing cost model programs may have input parameters and conditional constructs how to interpret and optimize code?

Commercial systems
???

Robert Wrembel

54/54

27

Anda mungkin juga menyukai