Anda di halaman 1dari 35

Business Intelligence

Business intelligence (BI) is a broad category of applications and technologies for gathering, storing, analyzing, and providing access to data to help enterprise users make better business decisions. It involves data mining, business performance management, benchmarking, text mining, and predictive analytics

A very broad field indeed, it contains technologies such as Decision Support Systems (DSS), Executive Information Systems (EIS), On-Line Analytical Processing (OLAP), Relational OLAP (ROLAP), Multi-Dimensional OLAP (MOLAP), Hybrid OLAP (HOLAP, a combination of MOLAP and ROLAP), and more. BI can be broken down into four broad fields: Multi-dimensional Analysis Tools: These are the tools that allow the user to look at the data from a number of different "angles". These tools often use a multi-dimensional database referred to as a "cube". Query tools: Tools that allow the user to issue SQL (Structured Query Language) queries against the warehouse and get a result set back. Data Mining Tools: These are the tools that automatically search for patterns in data. These tools are usually driven by complex statistical formulas. The easiest way to distinguish data mining from the various forms of OLAP is that OLAP can only answer questions you know to ask, data mining answers questions you didn't necessarily know to ask. Data Visualization Tools: These are the tools that show graphical representations of data, including complex threedimensional data pictures. The theory is that the user can "see" trends more effectively in this manner than when looking at complex statistical graphs. Some vendors are making progress in this area using the Virtual Reality Modeling Language (VRML). Metadata Management Throughout the entire process of identifying, acquiring, and querying the data, metadata management takes place. Metadata is defined as "data about data". An example is a column in a table. The data type (for instance a string or integer) of the column is one piece of metadata. The name of the column is another. The actual value in the column for a particular row is not metadata - it is data. Metadata is stored in a Metadata Repository and provides extremely useful information to all of the tools mentioned previously. Metadata management has developed into an exacting science that can provide huge returns to an organization. It can assist companies in analyzing the impact of changes to database tables, tracking owners of individual data elements ("data stewards"), and much more. It is also required to build the warehouse, since the ETL tool needs to know the metadata attributes of the sources and targets in order to "map" the data properly. The BI tools need the metadata for similar reasons.

Data Warehouse
A Data Warehouse is a repository (collection of resources that can be accessed to retrieve information) of an organization's electronically stored data, designed to facilitate reporting and analysis

Data Warehousing Systems


A data warehousing system can perform advanced analyses of operational data without impacting operational systems. OLTP is very fast and efficient at recording the business transactions - not so good at providing answers to high-level strategic questions. Component Systems Legacy System Any information system currently in use that was built using previous technology generations. Most legacy Systems are operational in nature, largely because the automation of transaction-oriented business process had long been the priority of IT projects. Source Systems Any system from which data is taken for a data warehouse. A source system is often called a legacy system in a mainframe environment. Operational Data Stores (ODS) An ODS is a collection of integrated databases designed to support the monitoring of operations. Unlike the databases of OLTP applications (that are function oriented), the ODS contains subject oriented, volatile, and current enterprise-wide detailed information. It serves as a system of record that provides comprehensive views of data in operational sources. Like data warehouses, ODSs are integrated and subject-oriented. However, an ODS is always current and is constantly updated. The ODS is an ideal data source for a data warehouse, since it already contains integrated operational data as of a given point in time. In short, ODS is an integrated collection of clean data destined for the data warehouse.

Data Warehouse Definition


The term Data Warehouse was coined by Bill Inmon in 1990, which he defined in the following way: "A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process". He defined the terms in the sentence as follows: Subject Oriented

OLTP databases usually hold information about small subsets of the organization. For example, a retailer might have separate order entry systems and databases for retail, catalog, and outlet sales. Each system will support queries about the information it captures. But if somebody wants to find out details of all sales, then these separate systems are not adequate. To address this type of situation, your data warehouse database should be subject-oriented, organized into subject areas like sales, rather than around OLTP Data Sources.

Integrated A data warehouse is usually constructed by integrating multiple, heterogeneous sources, such as relational databases, flat files, and OLTP files. When data resides in many separate applications in the operational environment, the encoding of data is often inconsistent. For example, in the above system, the retail system uses a numeric 7-digit code for products, the outlet system code consists of 9 alpha-numerics, and the catalog system uses 4 alphabets and 4 numerics. To create a useful subject area, the source data must be integrated. There is no need to change the coding in these systems, but there must be some mechanism to modify the data coming into the data warehouse and assign a common coding scheme.

Time Variant Data are stored in a data warehouse to provide historical perspective. Every key structure in the data warehouse contains, implicitly or explicitly, an element of time. A data warehouse generally stores data that is 5-10 years old, to be used for comparisons, trends, and forecasting. Nonvolatile Unlike operational databases, warehouses primarily support reporting, not data capture. A data warehouse is always a physically separate store of data. Due to this separation, data warehouses do not require transaction processing, recovery, concurrency control etc. The data are not updated or changed in any way once they enter the data warehouse, but are only loaded, refreshed and accessed for queries.

Data Warehouse Architecture


In general, all data warehouse systems have the following layers:

1. Data Source Layer This represents the different data sources that feed data into the data warehouse. The data source can be of any format -- plain text file, relational database, other types of database, Excel file can all act as a data source. Many different types of data can be a data source: Operations -- such as sales data, HR data, product data, inventory data, marketing data, systems data. Web server logs with user browsing data. Internal market research data. Third-party data, such as census data, demographics data, or survey data.

All these data sources together form the Data Source Layer.

2. Data Extraction Layer Data gets pulled from the data source into the data warehouse system. There is likely some minimal data cleansing, but there is unlikely any major data transformation. 3. Staging Area This is where data sits prior to being scrubbed and transformed into a data warehouse / data mart. Having one common area makes it easier for subsequent data processing / integration. 4. ETL Layer This is where data gains its "intelligence", as logic is applied to transform the data from a transactional nature to an analytical nature. This layer is also where data cleansing happens. 5. Data Storage Layer This is where the transformed and cleansed data sit. Based on scope and functionality, 3 types of entities can be found here: data warehouse, data mart, and operational data store (ODS). In any given system, you may have just one of the three, two of the three, or all three types. 6. Data Logic Layer This is where business rules are stored. Business rules stored here do not affect the underlying data transformation rules, but does affect what the report looks like.

7. Data Presentation Layer This refers to the information that reaches the users. This can be in a form of a tabular / graphical report in a browser, an emailed report that gets automatically generated and sent everyday, or an alert that warns users of exceptions, among others. 8. Meta data Layer This is where information about the data stored in the data warehouse system is stored. A logical data model would be an example of something that's in the meta data layer.

9. System Operations Layer This layer includes information on how the data warehouse system operates, such as ETL job status, system performance, and user access history.

Process Related to Data warehouse Designing

Source System Identification: Source System Identification: In order to build the data warehouse, the appropriate data must be located. Typically, this will involve both the current OLTP (On-Line Transaction Processing) system where the "day-to-day" information about the business resides, and historical data for prior periods, which may be contained in some form of "legacy" system. Often these legacy systems are not relational databases, so much effort is required to extract the appropriate data.

Data Acquisition: This is the process of moving company data from the source systems into the warehouse. It is often the most time-consuming and costly effort in the data warehousing project, and is performed with software products known as ETL (Extract/Transform/Load) tools. There are currently over 50 ETL tools on the market. The data acquisition phase can cost millions of dollars and take months or even years to complete. Data acquisition is then an ongoing, scheduled process, which is executed to keep the warehouse current to a pre-determined period in time, (i.e. the warehouse is refreshed monthly).

Changed Data Capture: The periodic update of the warehouse from the transactional system(s) is complicated by the difficulty of identifying which records in the source have changed since the last update. This effort is referred to as "changed data capture". Changed data capture is a field of endeavor in itself, and many products are on the market to address it. Some of the technologies that are used in this area are Replication servers, Publish/Subscribe, Triggers and Stored Procedures, and Database Log Analysis.

Data Cleansing: This is typically performed in conjunction with data acquisition (it can be part of the "T" in "ETL"). A data warehouse that contains incorrect data is not only useless, but also very dangerous. The whole idea behind a data warehouse is to enable decision-making. If a high

level decision is made based on incorrect data in the warehouse, the company could suffer severe consequences, or even complete failure. Data cleansing is a complicated process that validates and, if necessary, corrects the data before it is inserted into the warehouse. For example, the company could have three "Customer Name" entries in its various source systems, one entered as "IBM", one as "I.B.M.", and one as "International Business Machines". Obviously, these are all the same customer. Someone in the organization must make a decision as to which is correct, and then the data cleansing tool will change the others to match the rule. This process is also referred to as "data scrubbing" or "data quality assurance". It can be an extremely complex process, especially if some of the warehouse inputs are from older mainframe file systems (commonly referred to as "flat files" or "sequential files").

Data Aggregation: This process is often performed during the "T" phase of ETL, if it is performed at all. Data warehouses can be designed to store data at the detail level (each individual transaction), at some aggregate level (summary data), or a combination of both. The advantage of summarized data is that typical queries against the warehouse run faster. The disadvantage is that information, which may be needed to answer a query, is lost during aggregation. The tradeoff must be carefully weighed, because the decision can not be undone without rebuilding and repopulating the warehouse. The safest decision is to build the warehouse with a high level of detail, but the cost in storage can be extreme. Now that the warehouse has been built and populated, it becomes possible to extract meaningful information from it that will provide a competitive advantage and a return on investment. This is done with tools that fall within the general rubric of "Business Intelligence".

Difference between Data warehouse and Operational Data Store


Operational Data Store Data Focused Integration From Transaction Processing Focused System Current, Near Term(Today, Last Weeks) Day-To- Day Decisions , Tactical Reporting, Current Operational Results Twice Daily, Daily, Weekly Data Warehouse Subject Oriented Integrated Non- Volatile Time Variant Historic(List Of Month, Quarterly , Five Years Long Term Decisions, Strategic Reporting, Trend Detection Weekly, Monthly, Quarterly

Characteristics Age Of Data Primary Use Frequency Of Load

Design Methodology
Top Down Design This is a Bill Inmons approach. In this approach data is extracted from operational system and loaded into staging area. Here the data is cleansed, consolidated and validated to ensure its accuracy and then transferred to Enterprise Data warehouse (EDW). The Data in EDW is usually in normalized form to avoid redundancy and have a detailed and true version of data. After EDW is in place, subject area specific data-marts are created which have data in denormalized form and in a summary format

Bottom Up Design Ralph Kimball, a well-known author on data warehousing, is a proponent of an approach to data warehouse design which he describes as bottom-up. In the bottom-up approach data marts are first created to provide reporting and analytical capabilities for specific business processes. The bottom-up process is the result of an initial business oriented Top-down analysis of the relevant business processes to be modeled.

Advantages of Data Warehousing


Potential high Return on Investment Competitive Advantage Increased Productivity of Corporate Decision Makers Large quantify of information consolidate into one location from disparate sources is available for analysis. Data warehouses facilitate decision support system applications such as trend reports (e.g., the items with the most sales in a particular area within the last two years), exception reports, and reports that show actual performance versus goals. Prior to loading data into data warehouse all data inconsistencies are identified and removed. This greatly simplifies reporting and analysis. Since Data warehouse is separate from operational system, they do not affect or slow down the operational system.

Problems with Data Warehousing


Underestimation of resources for data loading Hidden problems with source systems Required data not captured

Increased end-user demands High maintenance Long duration projects Complexity of integration

ETL Process (Extraction, Transformation, Loading)


ETL Technology (shown below with arrows) is an important component of the Data Warehousing Architecture. It is used to copy data from Operational Applications to the Data Warehouse Staging Area, from the DW Staging Area into the Data Warehouse and finally from the Data Warehouse into a set of conformed Data Marts that are accessible by decision makers

The ETL software extracts data, transforms values of inconsistent data, cleanses "bad" data, filters data and loads data into a target database. The scheduling of ETL jobs is critical. Should there be a failure in one ETL job, the remaining ETL jobs must respond appropriately. DW Staging Area: The Data Warehouse Staging Area is temporary location where data from source

systems is copied. A staging area is mainly required in a Data Warehousing Architecture for timing reasons. In short, all required data must be available before data can be integrated into the Data Warehouse. Data in the Data Warehouse can be either persistent (i.e. remains around for a long period) or transient (i.e. only remains around temporarily). Note: Not all business requires a Data Warehouse Staging Area. For many Businesses it is feasible to use ETL to copy data directly from operational databases into the Data Warehouse. Decision makers don't access the Data Warehouse directly. This is done through various front-end Data Warehouse Tools that read data from subject specific Data Marts. The Data Warehouse can be either "relational" or "dimensional". This depends on how the business intends to use the information.

Data Mart
Data Mart is a subset of the data resource, usually oriented to a specific purpose or major data subject that may be distributed to support business needs. The concept of a data mart can apply to any data whether they are operational data, evaluation data, spatial data, or metadata. A Data Mart is a specific, subject oriented, repository of data designed to answer specific questions for a specific set of users. So an organization could have multiple data marts serving the needs of marketing, sales, operations, collections, etc. A data mart usually is organized as one dimensional model as a star-schema (OLAP cube) made of a fact table and multiple dimension tables. Data Mart V/s Data Warehouse The basic difference between a data warehouse and a data mart is that the former is usually designed with an enterprise perspective, while the latter is usually created with a built-in organizational or functional bias (i.e., it is designed to generate a particular set of metrics from a specific business perspective). Data marts frequently obtain all or most of their data from a data warehouse. Data warehouse is a logical concept that houses the atomic data (and some aggregated/summarized data) for strategic analysis. Data marts are fed from the data warehouse with a subset (and aggregated/summarized) of the data warehouse data for performance and getting the data closer to the user.

The data warehouse is used as a back-end data store that allow data marts or cubes to be redesigned or replaced to meet changing business requirements or focus. Data marts and/or cubes can be completely regenerated from the detailed level information contained in the data warehouse.

Data warehouse consists of many different types of data structures (staging, ODS, extracts, etc.), while a data mart typically consists of a single data structure (i.e., a star schema, snowflake schema or hypercube).

Table 2: Difference Between data mart and data warehouse

Scope Subjects Data Sources Size (typical) Implementation Time

Data Warehouse Corporate Multiple Many 100 GB-TB+ Months to years

Data Mart Line-of-Business (LoB) Single Subject Few <100GB Months

Types of Data Marts 1. Dependent Data Mart Dependent data marts draw data from a central data warehouse that has already been created. With dependent data marts, this process is somewhat simplified because formatted and summarized (clean) data has already been loaded into the central data warehouse. The ETL process for dependent data marts is mostly a process of identifying the right subset of data relevant to the chosen data mart subject and moving a copy of it, perhaps in a summarized form.

2. Independent Data Mart Independent data marts are standalone systems built by drawing data directly from operational or external sources of data, or both. With independent data marts, however, you must deal with all aspects of the ETL process, much as you do with a central data warehouse. The number of sources is likely to be fewer and the amount of data associated with the data mart is less than the warehouse, given your focus on a single subject.

3. Hybrid Data Mart A Hybrid data marts simply combine the issues of independent and independent data marts. A hybrid data mart allows to combine input from sources other than a data
warehouse.

Meta Data Metadata is information about the data. For a data mart, metadata includes:

A description of the data in business terms Format and definition of the data in system terms Data sources and frequency of refreshing data

The primary objective for the metadata management process is to provide a directory of technical and business views of the data mart metadata. Metadata can be categorized as technical metadata and business metadata. Technical metadata consists of metadata created during the creation of the data mart, as well as metadata to support the management of the data mart. This includes data acquisition rules, the transformation of source data into the format required by the target data mart, and schedules for backing up and refreshing data. Business metadata allows end users to understand what information is available in the data mart and how it can be accessed.

Data Warehouse Architectures


1. Virtual Data Warehouse
A virtual data warehouse provides a compact view of the data inventory. It contains Meta data. It uses middleware to build connections to different data sources. They can be fast as they allow users to filter the most important pieces of data from different legacy applications. It enables business analysts to access and analyze data from operational systems, such as Oracle Financials or SAP R/3, without having to rely on IT personnel to extract and process the data.

2. Enterprise Data Warehouse


Enterprise Data Warehouse is a centralized warehouse which provides service for the entire enterprise. A data warehouse is by essence a large repository of historical and current transaction data of an organization. An Enterprise Data Warehouse is a specialized data warehouse which may have several interpretations.

3. Data Marts Data Mart is a subset of the data resource, usually oriented to a specific purpose or major data subject that may be distributed to support business needs.

4. Distributed Data Marts

5. Multitier Warehouse

Fact and Fact Table


Fact tables contain keys to dimension tables as well as measurable facts that data analysts would want to examine. For example, a store selling automotive parts might have a fact table recording a sale of each item. The fact table of an educational entity could track credit hours awarded to students. A bakery could have a fact table that records manufacturing of various baked goods. Fact tables can grow very large, with millions or even billions of rows. It is important to identify the lowest level of facts that makes sense to analyze for your business this is often referred to as fact table "grain". For instance, for a healthcare billing company it might be sufficient to track revenues by month; daily and hourly data might not exist or might not be relevant. On the other hand, the assembly line warehouse analysts might be very concerned in number of defective goods that were manufactured each hour. Similarly a marketing data warehouse might be concerned by the activity of a consumer group with a specific income-level rather than purchases made by each individual.

Types of Measures

1) Additive

Additive facts are facts that can be summed up through all of the dimensions in the fact table.

2) Semi- Additive

Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others. The Semi-Additive Measures(Where additivity is over certain dimensions and not on all dimensions):

Dirty Data Historical Data Category Data Periodic Snap-shots

3) Non Additive Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.The non-additive measures are:

Ratios & Percentages Intensity Measures (like temperature)

Grades Averages/Maximums/Minimums

Let us use examples to illustrate each of the three types of facts. The first example assumes that we are a retailer, and we have a fact table with the following columns:

Date Sales Product Sales_Amount

The purpose of this table is to record the sales amount for each product in each store on a daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you can sum up this fact along any of the three dimensions present in the fact table -- date, store, and product. For example, the sum of Sales_Amount for all 7 days in a week represent the total sales amount for that week. Say we are a bank with the following fact table:
Date Account Current_Balance Profit_Margin

The purpose of this table is to record the current balance for each account at the end of each day, as well as the profit margin for each account for each day. Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's the total current balance for all accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information).

Profit_Margin is a non-additive fact, for it does not make sense to add them up for the account level or the day level.

Types of Fact Table


Based on the above classifications, there are two types of fact tables:

Cumulative: This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The first example presented here is a cumulative fact table. Snapshot: This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. The second example presented here is a snapshot fact table. Fact Less Fact Table: A factless fact table is table that doesnt have fact at all. They may consist of nothing but keys. There are tow types of factless fact table. event and coverage. An event establishes the relationship among the dimension members from various dimension but there is no measured value. The existence of the relationship itself is the fact.This type of fact table itself can be used to generate the useful reports. You can count the number of occurrences with various criteria. For example, you can have a factless fact table to capture the student attendance (the example used by Ralph). The other type of factless table is called Coverage table by Ralph. It is used to support negative analysis report. For example a Store that did not sell a product for a given period. To produce such report, you need to have a fact table to capture all the possible combinations. You can then figure out what is missing.

Dimension Table
Dimension tables contain attributes that describe fact records in the fact table. Some of these attributes provide descriptive information; others are used to specify how fact table data should be summarized to provide useful information to the analyst.. E.g. In a

banking data warehouse accounting and balances will be stored in a Fact table where as customer name, address, mobile numbers will be stored in the dimension table. Dimension tables contain hierarchies of attributes that aid in summarization. For example, a dimension containing product information would often contain a hierarchy that separates products into categories such as food, drink, and non consumable items, with each of these categories further subdivided a number of times until the individual product SKU is reached at the lowest level.

Types of Dimension:
1) Degenerate Dimension An item that is in the fact table but is stripped off of its description, because the description belongs in dimension table, is referred to as Degenerated Dimension. Since it looks like dimension, but is really in fact table and has been degenerated of its description, hence is called degenerated dimension.

2) Junk Dimension When you consolidate lots of small dimensions and instead of having 100s of small dimensions, that will have few records in them, cluttering your database with these mini identifier tables, all records from all these small dimension tables are loaded into ONE dimension table and we call this dimension table Junk dimension table. (Since we are storing all the junk in this one table) For example: a company might have handful of manufacture plants, handful of order types, and so on, so forth, and we can consolidate them in one dimension table called junked dimension table.

3) Conformed These dimensions are something that is built once in your model and can be reused multiple times with different fact tables. For example, consider a model containing multiple fact tables, representing different data marts. Now look for a dimension that is common to these facts tables. In this example lets consider that the product dimension is common and hence can be reused by creating short cuts and joining the different fact tables. Some of the examples are time dimension, customer dimensions, product dimension.

4) Role Playing Dimension Role Playing dimension refers to a dimension can play different roles in a fact table depending on the context. For example, the Date dimension can be used for the

ordered date, scheduled shipping date, shipment date, and invoice date in an order line fact. In the data warehouse, you will have a single dimension table for the dates. You will have multiple warehouse foreign key from the fact table to the same dimension.

Slowly Changing Dimension Slowly Changing Dimensions (SCDs) are dimensions that have data that changes slowly, rather than changing on a time-based, regular schedule. For example, you may have a dimension in your database that tracks the sales records of your company's salespeople. Creating sales reports seems simple enough, until a salesperson is transferred from one regional office to another.

Type 0

The Type 0 method is a passive approach to managing dimension value changes, in which no action is taken. Values remain as they were at the time of the dimension record was first entered. In certain circumstances historical preservation with a Type 0 SCD may occur. But, higher order SCD types are often employed to guarantee history preservation, whereas Type 0 provides the least control or no control over managing a slowly changing dimension.

Type 1 The Type 1 methodology overwrites old data with new data, and therefore does not track historical data at all. This is most appropriate when correcting certain types of data errors, such as the spelling of a name. (Assuming you won't ever need to know how it used to be misspelled in the past.) Here is an example of a database table that keeps supplier information: Supplier_Ke y 123 Supplier_Code Supplier_Name ABC Acme Supply Co Supplier_Stat e CA

In this example, Supplier_Code is the natural key and Supplier_Key is a surrogate key. Technically, the surrogate key is not necessary, since the table will be unique by the natural key (Supplier_Code). However, the joins will perform better on an integer than on a character string. Now imagine that this supplier moves their headquarters to Illinois. The updated table would simply overwrite this record: Supplier_Ke y 123 Supplier_Code Supplier_Name ABC Acme Supply Co Supplier_Stat e IL

The obvious disadvantage to this method of managing SCDs is that there is no historical record kept in the data warehouse. You can't tell if your suppliers are tending to move to the Midwest, for example. But an advantage to Type 1 SCDs is that they are very easy to maintain.

Type 2 The Type 2 method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. With Type 2, we have unlimited history preservation as a new record is inserted each time a change is made. In the same example, if the supplier moves to Illinois, the table could look like this, with incremented version numbers to indicate the sequence of changes:

Supplier_Key Supplier_Code Supplier_Name Supplier_State Version 123 ABC Acme Supply Co CA 0 124 ABC Acme Supply Co IL 1

Another popular method for tuple versioning is to add effective date columns.

Supplier_Key Supplier_Code Supplier_Name Supplier_State Start_Date End_Date 01-Jan21-Dec123 ABC Acme Supply Co CA 2000 2004 22-Dec124 ABC Acme Supply Co IL 2004

The null End_Date in row two indicates the current tuple version. In some cases, a standardized surrogate high date (e.g. 9999-12-31) may be used as an end date, so that the field can be included in an index, and so that null-value substitution is not required when querying. Type 3

The Type 3 method is usually referred to as using "history tables", where one table keeps the current data, and an additional table is used to keep a record of some or all changes. Following the example above, the original table might be called Supplier and the history table might be called Supplier_History. Supplier Supplier_key Supplier_Code Supplier_Name Supplier_State 123 ABC Acme Supply Co IL Supplier_History Supplier_key Supplier_Code Supplier_Name Supplier_State Create_Date 123 ABC Acme Supply Co CA 22-Dec-2004 This method resembles how database audit tables and change data capture techniques function

Granulatity
The granularity is the lowest level of information stored in the fact table. The depth of data level is known as granularity. In date dimension the level could be year, month, quarter, period, week, day of granularity. The process consists of the following two steps: - Determining the dimensions that are to be included - Determining the location to place the hierarchy of each dimension of information

Data Warehouse Dimensional Modeling (Types of Schemas)

Star Schema: Star Schema is simplest data warehouse schema which consists of a few fact

tables and multiple dimension tables connected to fact table.

Snowflake Schema: Snowflake schema is a variation of Star schema where dimension table are normalized into multiple related tables. Fact table remains same similar to Star schema.

Galaxy Schema: Galaxy schema contains many fact tables with some common dimensions (conformed dimensions). This schema is a combination of many data marts.

Fact Constellation Schema: The dimensions in this schema are segregated into independent dimensions based on the levels of hierarchy. For example, if geography has five levels of hierarchy like teritary, region, country, state and city; constellation schema would have five dimensions instead of one.

Data Mining: Data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information information that can be used to increase revenue, cuts costs, or both.

OLAP
On-Line Analytical Processing (OLAP) is a category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user. OLAP functionality is characterized by dynamic multi-dimensional analysis of consolidated enterprise data supporting end user analytical and navigational activities including:

calculations and modeling applied across dimensions, through hierarchies and/or across members trend analysis over sequential time periods slicing subsets for on-screen viewing

drill-down to deeper levels of consolidation reach-through to underlying detail data rotation to new dimensional comparisons in the viewing area

OLAP is implemented in a multi-user client/server mode and offers consistently rapid response to queries, regardless of database size and complexity. OLAP helps the user synthesize enterprise information through comparative, personalized viewing, as well as through analysis of historical and projected data in various "what-if" data model scenarios. This is achieved through use of an OLAP Server.

OLTP
OLTP (online transaction processing) is a class of program that facilitates and manages transaction-oriented applications, typically for data entry and retrieval transactions in a number of industries, including banking, airlines, mailorder, supermarkets, and manufacturers. Probably the most widely installed OLTP product is IBM's CICS (Customer Information Control System). Today's online transaction processing increasingly requires support for transactions that span a network and may include more than one company. For this reason, new OLTP software uses client/server processing and brokering software that allows transactions to run on different computer platforms in a network.

OLAP V/s OLTP


OLTP System Online Transaction Processing (Operational System) Operational data; OLTPs are the original source of the data. To control and run fundamental business tasks A snapshot of ongoing business processes Short and fast inserts and updates initiated by end users Relatively standardized and simple queries Returning relatively few records OLAP System Online Analytical Processing (Data Warehouse) Consolidation data; OLAP data comes from the various OLTP Databases To help with planning, problem solving, and decision support Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data Often complex queries involving aggregations

Source of data Purpose of data What the data Reveals Inserts and Updates Queries

Processing Speed

Typically very fast

Space Requirements

Can be relatively small if historical data is archived Highly normalized with many tables Backup religiously; operational data is critical to run the business, data loss is likely to entail significant monetary loss and legal liability

Database Design Backup and Recovery

Depends on the amount of data involved; batch data refreshes and complex queries may take many hours; query speed can be improved by creating indexes Larger due to the existence of aggregation structures and history data; requires more indexes than OLTP Typically de-normalized with fewer tables; use of star and/or snowflake schemas Instead of regular backups, some environments may consider simply reloading the OLTP data as a recovery method

Types Of OLAP
1. Relational OLAP (ROLAP) Star Schema based

ROLAP works directly with relational databases. The base data and the dimension tables are stored as relational tables and new tables are created to hold the aggregated information. Depends on a specialized schema design. Advantages: Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on data size of the underlying relational database. In other words, ROLAP itself places no limitation on data amount.

Can leverage functionalities inherent in the relational database: Often, relational database already comes with a host of functionalities. ROLAP technologies, since they sit on top of the relational database, can therefore leverage these functionalities.

Disadvantages:

Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large.

Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL statements to query the relational database, and SQL statements do not fit all needs (for example, it is difficult to perform complex calculations using SQL), ROLAP technologies are therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions.

2. Multidimensional OLAP (MOLAP) Cube based MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP. MOLAP stores this data in an optimized multi-dimensional array storage, rather than in a relational database. Therefore it requires the pre-computation and storage of information in the cube - the operation known as processing. Advantages: Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for slicing and dicing operations.

Can perform complex calculations: All calculations have been pregenerated when the cube is created. Hence, complex calculations are not only doable, but they return quickly. Disadvantages: Limited in the amount of data it can handle: Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the cube itself. This is not to say that the data in the cube cannot be derived from a large amount of data. Indeed, this is possible. But in this case, only summary-level information will be included in the cube itself.

Requires additional investment: Cube technology are often proprietary and do not already exist in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.

3. Hybrid OLAP (HOLAP) HOLAP is the product of the attempt to incorporate the best features of MOLAP and ROLAP into a single architecture. This tool tried to bridge the technology gap of both products by enabling access or use to both multidimensional database (MDDB) and Relational Database Management System (RDBMS) data stores. HOLAP systems stores larger quantities of

detailed data in the relational tables while the aggregations are stored in the pre-calculated cubes. HOLAP also has the capacity to drill through from the cube down to the relational tables for delineated data.

Anda mungkin juga menyukai