Anda di halaman 1dari 76

Introduction to Ab Initio

Prepared By : Ashok Chanda

Accenture

Ab Initio Training

Ab initio Session 1

Introduction to DWH Explanation of DW Architecture Operating System / Hardware Support Introduction to ETL Process Introduction to Ab Initio Explanation of Ab Initio Architecture

Accenture

Ab Initio Training

What is Data Warehouse

A data warehouse is a copy of transaction data specifically structured for querying and reporting. A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process. A data warehouse is a central repository for all or significant parts of the data that an enterprise's various business systems collect.
Ab Initio Training 3

Accenture

Data Warehouse-Definitions

A data warehouse is a database geared towards the business intelligence requirements of an organization. The data warehouse integrates data from the various operational systems and is typically loaded from these systems at regular intervals. Data warehouses contain historical information that enables analysis of business performance over time. A collection of databases combined with a flexible data extraction system.
Ab Initio Training 4

Accenture

Data Warehouse

A data warehouse can be normalized or denormalized. It can be a relational database, multidimensional database, flat file, hierarchical database, object database, etc. Data warehouse data often gets changed. And data warehouses often focus on a specific activity or entity.

Accenture

Ab Initio Training

Why Use a Data Warehouse?


Data Exploration and Discovery Integrated and Consistent data Quality assured data Easily accessible data Production and performance awareness Access to data in a timely manner

Accenture

Ab Initio Training

Simplified Datawarehouse Architecture

Accenture

Ab Initio Training

Data warehouse Architecture

Data Warehouses can be architected in many different ways, depending on the specific needs of a business. The model shown below is the "hub-andspokes" Data Warehousing architecture that is popular in many organizations. In short, data is moved from databases used in operational systems into a data warehouse staging area, then into a data warehouse and finally into a set of conformed data marts. Data is copied from one database to another using a technology called ETL (Extract, Transform, Load).

Accenture

Ab Initio Training

Accenture

Ab Initio Training

The ETL Process


Capture Scrub or Data cleansing Transform Load and Index

Accenture

Ab Initio Training

10

ETL Technology

ETL Technology is an important component of the Data Warehousing Architecture. It is used to copy data from Operational Applications to the Data Warehouse Staging Area, from the DW Staging Area into the Data Warehouse and finally from the Data Warehouse into a set of conformed Data Marts that are accessible by decision makers. The ETL software extracts data, transforms values of inconsistent data, cleanses "bad" data, filters data and loads data into a target database. The scheduling of ETL jobs is critical. Should there be a failure in one ETL job, the remaining ETL jobs must respond appropriately.

Accenture

Ab Initio Training

11

Data Warehouse Staging Area

The Data Warehouse Staging Area is temporary location where data from source systems is copied. A staging area is mainly required in a Data Warehousing Architecture for timing reasons. In short, all required data must be available before data can be integrated into the Data Warehouse. Due to varying business cycles, data processing cycles, hardware and network resource limitations and geographical factors, it is not feasible to extract all the data from all Operational databases at exactly the same time

Accenture

Ab Initio Training

12

Examples- Staging Area

For example, it might be reasonable to extract sales data on a daily basis, however, daily extracts might not be suitable for financial data that requires a month-end reconciliation process. Similarly, it might be feasible to extract "customer" data from a database in Singapore at noon eastern standard time, but this would not be feasible for "customer" data in a Chicago database. Data in the Data Warehouse can be either persistent (i.e. remains around for a long period) or transient (i.e. only remains around temporarily). Not all business require a Data Warehouse Staging Area. For many businesses it is feasible to use ETL to copy data directly from operational databases into the Data Warehouse.

Accenture

Ab Initio Training

13

Data warehouse

The purpose of the Data Warehouse in the overall Data Warehousing Architecture is to integrate corporate data. It contains the "single version of truth" for the organization that has been carefully constructed from data stored in disparate internal and external operational databases. The amount of data in the Data Warehouse is massive. Data is stored at a very granular level of detail. For example, every "sale" that has ever occurred in the organization is recorded and related to dimensions of interest. This allows data to be sliced and diced, summed and grouped in unimaginable ways.

Accenture

Ab Initio Training

14

Data Warehouse

Contrary to popular opinion, the Data Warehouses does not contain all the data in the organization. It's purpose is to provide key business metrics that are needed by the organization for strategic and tactical decision making. Decision makers don't access the Data Warehouse directly. This is done through various front-end Data Warehouse Tools that read data from subject specific Data Marts. The Data Warehouse can be either "relational" or "dimensional". This depends on how the business intends to use the information.

Accenture

Ab Initio Training

15

Data Warehouse Environment


In addition to a relational/multidimensional database, a data warehouse environment often consists of an ETL solution, an OLAP engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.
Accenture Ab Initio Training 16

Data Mart

A subset of a data warehouse, for use by a single department or function. A repository of data gathered from operational data and other sources that is designed to serve a particular community of knowledge workers. A subset of the information contained in a data warehouse. Data marts have the same definition as the data warehouse (see below), but data marts have a more limited audience and/or data content.
Ab Initio Training 17

Accenture

Data Mart

ETL (Extract Transform Load) jobs extract data from the Data Warehouse and populate one or more Data Marts for use by groups of decision makers in the organizations. The Data Marts can be Dimensional (Star Schemas) or relational, depending on how the information is to be used and what "front end" Data Warehousing Tools will be used to present the information. Each Data Mart can contain different combinations of tables, columns and rows from the Enterprise Data Warehouse. For example, an business unit or user group that doesn't require a lot of historical data might only need transactions from the current calendar year in the database. The Personnel Department might need to see all details about employees, whereas data such as "salary" or "home address" might not be appropriate for a Data Mart that focuses on Sales.

Accenture

Ab Initio Training

18

Star Schema

The star schema is perhaps the simplest data warehouse schema. It is called a star schema because the entityrelationship diagram of this schema resembles a star, with points radiating from a central table. The center of the star consists of a large fact table and the points of the star are the dimension tables.
Ab Initio Training 19

Accenture

Star Schema continued

A star schema is characterized by one or more very large fact tables that contain the primary information in the data warehouse, and a number of much smaller dimension tables (or lookup tables), each of which contains information about the entries for a particular attribute in the fact table.
Ab Initio Training 20

Accenture

Advantages of Star Schemas

Provide a direct and intuitive mapping between the business entities being analyzed by end users and the schema design. Provide highly optimized performance for typical star queries. Are widely supported by a large number of business intelligence tools, which may anticipate or even require that the data-warehouse schema contain dimension tables Star schemas are used for both simple data marts and very large data warehouses.
Ab Initio Training 21

Accenture

Star schema

Diagrammatic representation of star schema

Accenture

Ab Initio Training

22

Snowflake Schema

The snowflake schema is a more complex data warehouse model than a star schema, and is a type of star schema. It is called a snowflake schema because the diagram of the schema resembles a snowflake. Snowflake schemas normalize dimensions to eliminate redundancy.
Ab Initio Training 23

Accenture

Snowflake Schema - Example

That is, the dimension data has been grouped into multiple tables instead of one large table. For example, a product dimension table in a star schema might be normalized into a products table, a product_category table, and a product_manufacturer table in a snowflake schema. While this saves space, it increases the number of dimension tables and requires more foreign key joins. The result is more complex queries and reduced query performance.
Ab Initio Training 24

Accenture

Diagrammatic representation for Snowflake Schema

Accenture

Ab Initio Training

25

Fact Table
The centralized table in a star schema is called as FACT table. A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys.

Accenture

Ab Initio Training

26

What happens during the ETL process?

During extraction, the desired data is identified and extracted from many different sources, including database systems and applications. Depending on the source system's capabilities (for example, operating system resources), some transformations may take place during this extraction process. The size of the extracted data varies from hundreds of kilobytes up to gigabytes, depending on the source system and the business situation. After extracting data, it has to be physically transported to the target system or an intermediate system for further processing.

Accenture

Ab Initio Training

27

Examples of SecondGeneration ETL Tools

Powermart 4.5 Informatica Corporation

Pioneer due to market share General-purpose tool oriented to data marts

Ardent DataStage Ardent Software, Inc.

Sagent Data Mart Solution 3.0 Sagent Technology

Progressively integrated with Microsoft A kit of tools that can be used to build applications End-to-end data warehousing solution from a single vendor
Ab Initio Training 28

Ab Initio 2.2 Ab Initio Software

Tapestry 2.1 D2K, Inc

Accenture

What to look for in ETL tools


Use optional data cleansing tool to clean-up source data Use extraction/transformation/load tool to retrieve, cleanse, transform, summarize, aggregate, and load data Use modern, engine-driven technology for fast, parallel operation Goal: define 100% of the transform rule with point and click interface Support development of logical and physical data models Generate and manage central metadata repository Open metadata exchange architecture to integrate central metadata with local metadata. Support metadata standards Provide end users access to metadata in business terms

Accenture

Ab Initio Training

29

Operating System / Hardware Support

This section discusses how a DBMS utilizes OS/hardware features such as parallel functionality, SMP/MPP support, and clustering. These OS/hardware features greatly extend the scalability and improve performance. However, managing an environment with these features is difficult and expensive.
Ab Initio Training 30

Accenture

Parallel Functionality

The introduction and maturation of parallel processing environments are key enablers of increasing database sizes, as well as providing acceptable response times for storing, retrieving, and administrating data. DBMS vendors are continually bringing products to market that take advantage of multi-processor hardware platforms. These products can perform table scans, backups, loads, and queries in parallel.
Ab Initio Training 31

Accenture

Parallel Features
An overview of typical parallel functionality is given below : Queries Parallel queries can enhance scalability for many query operations Data load Performance is always a serious issue when loading large databases. Meeting response time requirements is the overriding factor for determining the best load method and should be a key part of a performance benchmark Create table as select This feature makes it possible to create aggregated tables in parallel Index creation Parallel index creation exploits the benefits of parallel hardware by distributing the workload generated by a large index created for a large number of processors .

Accenture

Ab Initio Training

32

Which parallel processor configuration, SMP or MPP ?

SMP and clustered SMP environments , have the flexibility and ability to scale in small increments. SMP environments are often useful for the large, but static data warehouse, where the data cannot be easily partitioned, due to the unpredictable nature of how the data is joined over multiple tables for complex searches and ad-hoc queries.

Accenture

Ab Initio Training

33

Which parallel processor configuration, SMP or MPP ?

MPP works well in environments where growth is potentially unlimited, access patterns to the database are predictable, and the data can be easily partitioned across different MPP nodes with minimal data accesses crossing between them. This often occurs in large OLTP environments, where transactions are generally small and predictable, as opposed to decision support and data warehouse environments, where multiple tables can be joined in unpredictable ways. In fact, data warehousing and decision support are the areas most vendors of parallel hardware platforms and DBMSs are targeting. MPP does not scale well if heavy data warehouse database accesses must cross MPP nodes, causing I/O bottlenecks over the MPP interconnect, or if multiple MPP nodes are continually locked for concurrent record updates.

Accenture

Ab Initio Training

34

A Multi-CPU Computer (SMP)

Accenture

Ab Initio Training

35

A Network of Multi-CPU Nodes

Accenture

Ab Initio Training

36

A Network of Networks

Accenture

Ab Initio Training

37

Parallel Computer Architecture

Computers come in many shapes and sizes: Single-CPU, Multi-CPU Network of single-CPU computers Network of multi-CPU computers Multi-CPU machines are often called SMPs (for Symmetric Multi Processors). Specially-built networks of machines are often called MPPs (for Massively Parallel Processors).

Accenture

Ab Initio Training

38

Introduction to Ab Initio

Accenture

Ab Initio Training

39

History of Ab Initio

Ab Initio Software Corporation was founded in the mid 1990's by Sheryl Handler, the former CEO at Thinking Machines Corporation, after TMC filed for bankruptcy. In addition to Handler, other former TMC people involved in the founding of Ab Initio included Cliff Lasser, Angela Lordi, and Craig Stanfill. Ab Initio is known for being very secretive in the way that they run their business, but their software is widely regarded as top notch.
Ab Initio Training 40

Accenture

History of Ab Initio

The Ab Initio software is a fourth generation data analysis, batch processing, data manipulation graphical user interface (GUI)based parallel processing tool that is used mainly to extract, transform and load data. The Ab Initio software is a suite of products that together provides platform for robust data processing applications. The Core Ab Initio Products are: The [Co>Operating System] The Component Library The Graphical Development Environment
Ab Initio Training 41

Accenture

What Does Ab Initio Mean?


Ab Initio is Latin for From the Beginning. From the beginning our software was designed to support a complete range of business applications, from simple to the most complex. Crucial capabilities like parallelism and checkpointing cant be added after the fact. The Graphical Development Environment and a powerful set of components allow our customers to get valuable results from the beginning.

Accenture

Ab Initio Training

42

Ab Initios focus

Moving Data
move small and large volumes of data in an efficient manner deal with the complexity associated with business data

High Performance

scalable solutions

Better productivity

Accenture

Ab Initio Training

43

Ab Initios Software

Ab Initio software is a general-purpose data processing platform for missioncritical applications such as:
Data warehousing Batch processing Click-stream analysis Data movement Data transformation

Accenture

Ab Initio Training

44

Applications of Ab Initio Software

Processing just about any form and volume of data. Parallel sort/merge processing. Data transformation. Rehosting of corporate data. Parallel execution of existing applications.

Accenture

Ab Initio Training

45

Ab Initio Provides For:

Distribution - a platform for applications to execute across a collection of processors within the confines of a single machine or across multiple machines. Reduced Run Time Complexity - the ability for applications to run in parallel on any combination of computers where the Ab Initio Co>Operating System is installed from a single point of control.
Ab Initio Training 46

Accenture

Applications of Ab Initio Software in terms of Data Warehouse

Front end of Data Warehouse:


Transformation of disparate sources Aggregation and other preprocessing Referential integrity checking Database loading

Back end of Data Warehouse:


Extraction for external processing Aggregation and loading of Data Marts


Ab Initio Training 47

Accenture

Ab Initio or InformaticaPowerful ETL

Informatica and Ab Initio both support parallelism. But Informatica supports only one type of parallelism but the Ab Initio supports three types of parallelism. In Informatica the developer need to do some partitions in server manager by using that you can achieve parallelism concepts. But in Ab Initio the tool it self take care of parallelism we have three types of parallelisms in Ab Initio 1. Component 2. Data Parallelism 3. Pipe Line parallelism this is the difference in parallelism concepts. 2. We don't have scheduler in Ab Initio like Informatica you need to schedule through script or u need to run manually. 3. Ab Initio supports different types of text files means you can read same file with different structures that is not possible in Informatica, and also Ab Initio is more user friendly than Informatica so there is a lot of differences in Informatica and Ab initio.

8. AbInitio doesn't need a dedicated administrator, UNIX or NT Admin will suffice, where as other ETL tools do have administrative work.

Accenture

Ab Initio Training

48

Ab Initio or InformaticaPowerful ETL-continued

Error Handling - In Ab Initio you can attach error and reject files to each transformation and capture and analyze the message and data separately. Informatica has one huge log! Very inefficient when working on a large process, with numerous points of failure. Robust transformation language - Informatica is very basic as far as transformations go. While I will not go into a function by function comparison, it seems that Ab Initio was much more robust. Instant feedback - On execution, Ab Initio tells you how many records have been processed/rejected/etc. and detailed performance metrics for each component. Informatica has a debug mode, but it is slow and difficult to adapt to.

Accenture

Ab Initio Training

49

Both tools are fundamentally different


Which one to use depends on the work at hand and existing infrastructure and resources available. Informatica is an engine based ETL tool, the power this tool is in it's transformation engine and the code that it generates after development cannot be seen or modified. Ab Initio is a code based ETL tool, it generates ksh or bat etc. code, which can be modified to achieve the goals, if any that cannot be taken care through the ETL tool itself. Ab Initio doesn't need a dedicated administrator, UNIX or NT Admin will suffice, where as other ETL tools do have administrative work.

Accenture

Ab Initio Training

50

Ab Initio Product Architecture


User Applications Development Environments Ab Initio EME

GDE
Component Library

Shell 3rd Party Components

User-defined Components

The Ab Initio Co>Operating System Native Operating System (Unix, Windows, OS/390)

Accenture

Ab Initio Training

51

Ab Initio ArchitectureExplanation

The Ab Initio Cooperating system unites the network of computing resources-CPUs,storage disks , programs , datasets into a production quality data processing system with scalable performance and mainframe class reliability. The Cooperating system is layered on the top of the native operating systems of the collection of servers .It provides a distributed model for process execution, file management ,debugging, process monitoring , checkpointing .A user may perform all these functions from a single point of control.

Accenture

Ab Initio Training

52

Co>Operating System Services

Parallel and distributed application execution


Control Data Transport

Transactional semantics at the application level. Checkpointing. Monitoring and debugging. Parallel file management. Metadata-driven components.
Ab Initio Training 53

Accenture

Ab Initio: What We Do

Ab Initio software helps you build large-scale data processing applications and run them in parallel environments. Ab Initio software consists of two main programs: Co>Operating System: which your system administrator installs on a host Unix or Windows NT server, as well as on processing computers. The Graphical Development Environment (GDE): which you install on your PC (GDE Computer) and configure to communicate with the host.
Ab Initio Training 54

Accenture

The Ab Initio Co>Operating System

The Co>Operating System Runs across a variety of Operating Systems and Hardware Platforms including OS/390 on Mainframe, Unix, and Windows. Supports distributed and parallel execution. Can provide scalability proportional to the hardware resources provided. Supports platform independent data transport.
Ab Initio Training 55

Accenture

The Ab Initio Co>Operating System-Continued


The Ab Initio Co>Operating System depends on parallelism to connect (i.e., cooperate with) diverse databases. It extracts, transforms and loads data to and from Teradata and other data sources.

Accenture

Ab Initio Training

56

Co-Operating System Layer


GDE Any OS

Top Layer GDE Solaris, AIX, NT, Linux, NCR Co-Op System

GDE Same Co-Op Command On any OS. GDE Graphs can be moved from One OS to another w/o any Changes.
Ab Initio Training 57

Accenture

The Ab Initio Co>Operating System Runs on:


Sun Solaris IBM AIX Hewlett-Packard HPUX Siemens Pyramid Reliant UNIX IBM DYNIX/ptx Silicon Graphics IRIX

Red Hat Linux Windows NT 4.0 (x86) Windows NT 2000 (x86) Compaq Tru64 UNIX IBM OS/390 NCR MP-RAS
58

Accenture

Ab Initio Training

Connectivity to Other Software

Common, high performance database interfaces:


IBM DB2, DB2/PE, DB2EEE, UDB, IMS Oracle, Informix XPS,Sybase,Teradata,MS SQL Server 7 OLE-DB ODBC Connectors to many other third party products Trillium, ErWin, Siebel, etc.

Other software packages:

Accenture

Ab Initio Training

59

Ab Initio Cooperating System


Ab Initio Software Corporation, headquartered in Lexington, MA, develops software solutions that process vast amounts of data (well into the terabyte range) in a timely fashion by employing many (often hundreds) of server processors in parallel. Major corporations worldwide use Ab Initio software in mission critical, enterprise-wide, data processing systems. Together, Teradata and Ab Initio deliver: End-to-end solutions for integrating and processing data throughout the enterprise Software that is flexible, efficient, and robust, with unlimited scalability Professional and highly responsive support The Co>Operating System executes your application by creating and managing the processes and data flows that the components and arrows represent.

Accenture

Ab Initio Training

60

Graphical Development Environment GDE

Accenture

Ab Initio Training

61

The GDE
The Graphical Development Environment (GDE) provides a graphical user interface into the services of the Co>Operating System. The Graphical Development Environment Enables you to create applications by dragging and dropping Components. Allows you to point and click operations on executable flow charts. The Co>Operating System can execute these flowcharts directly. Graphical monitoring of running applications allows you to quantify data volumes and execution times, helping spot opportunities for improving performance.

Accenture

Ab Initio Training

62

The Graph Model

Accenture

Ab Initio Training

63

The Component Library:

The Component Library: Reusable software Modules for Sorting, Data Transformation, database Loading Etc. The components adapt at runtime to the record formats and business rules controlling their behavior. Ab Initio products have helped reduce a project s development and research time significantly.

Accenture

Ab Initio Training

64

Components

Components may run on any computer running the Co>Operating System. Different components do different jobs. The particular work a component accomplishes depends upon its parameter settings. Some parameters are data transformations, that is business rules to be applied to an input (s) to produce a required output.
Ab Initio Training 65

Accenture

rd 3 Party Components

Accenture

Ab Initio Training

66

EME

The Enterprise Meta>Environment (EME) is a highperformance object-oriented storage system that inventories and manages various kinds of information associated with Ab Initio applications. It provides storage for all aspects of your data processing system, from design information to operations data. The EME also provides rich store for the applications themselves, including data formats and business rules. It acts as hub for data and definitions . Integrated metadata management provides the global and consolidated view of the structure and meaning of applications and data- information that is usually scattered throughout you business .
Ab Initio Training 67

Accenture

Benefits of EME
The Enterprise Meta>Environment provides a rich store for applications and all of their associated information including : Technical Metadata-Applications related business rules , record formats and execution statistics Business Metadata-User defined documentations of job functions ,roles and responsibilities. Metadata is data about data and is critical to understanding and driving your business process and computational resources .Storing and using metadata is as important to your business as storing and using data.
Accenture Ab Initio Training 68

EME-Ab Initio Relevance

By integrating technical and business metadata ,you can grasp the entirety of your data processing from operational to analytical systems. The EME is completely integrated environment. The following figure shows how it fits in to the high level architecture of Ab Initio software.
Ab Initio Training 69

Accenture

Accenture

Ab Initio Training

70

Stepwise explanation of Ab Initio Architecture

You construct your application from the building blocks called components, manipulating them through the Graphical Development Environment (GDE). You check in your applications to the EME. The EME and GDE uses the underlining functionality of the Co>Operating System to perform many of their tasks. The Cooperating System units the distributed resources into a single virtual computer to run applications in parallel. Ab Initio software runs on Unix ,Windows NT,MVS operating systems.
Ab Initio Training 71

Accenture

Stepwise explanation of Ab Initio Architecture - continued

Ab Initio connector applications extract metadata from third part metadata sources into the EME or extract it from the EME into a third party destination. You view the results of project and application dependency analysis through a Web user interface .You also view and edit your business metadata through a web user interface.

Accenture

Ab Initio Training

72

EME :Various users constituency served


The EME addresses the metadata needs of three different constituencies: Business Users Developers System Administrators

Accenture

Ab Initio Training

73

EME :Various users constituency served

Business users are interested in exploiting data for analysis, in particular with regard to databases ,tables and columns. Developers tend to be oriented towards applications ,needing to analyze the impact of potential program changes. System Administrator and production personnel want job status information and run statistics.
Ab Initio Training 74

Accenture

EME Interfaces
We can create and manage EME through 3 interfaces: GDE Web User Interface Air Utility

Accenture

Ab Initio Training

75

Thank You
End of Session 1

Accenture

Ab Initio Training

76

Anda mungkin juga menyukai