Anda di halaman 1dari 58

Chapter 1

Introduction

What is Data Warehouse ?

It is a large store of data accumulated from a wide range of sources within a company
and used to guide management decisions. Data warehousing is the process of constructing and
using a data warehouse. It is constructed by integrating data from different heterogeneous
sources that support OLTP (On Line Transaction Processing). Primarily, data warehouse is
designed for data query and data analysis rather than on line transaction processing. It mainly
contains large historical data belonging to an enterprise which is used for creating analytical
reports for knowledge workers in the enterprise. The information gathered in data warehouse
helps in making business decisions for better business prospects. The application of data
warehouse environment include the following:-

You can fine tune the production environment based on sales analysis.

You can carry on customer analysis by finding customers buying patterns.

You can perform operational analysis of your business by customer relationship management.

Data Warehouse Definition

Different people have different definitions for a data warehouse. The most popular definition
came from Bill Inmon , who provided the following :-

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of


data in support of managements decision making process.

Subject-Oriented : A data warehouse can be used to analyze a particular subject area,


such as sales or inventory.
Integrated : The data is gathered from a number of heterogeneous sources as a coherent
whole. Example: Different sources have different ways of identifying a product. But in
data warehouse the product is identified in a single way.
Time-Variant : The data available were identified on a particular time period. Example:
The data warehouse may contain historical data so that one can retrieve the data
belonging to 6 months or 12 months and even older data based on requirement. Where as
in a transaction processing system (OLTP) , only recent data is maintained.
Non-Volatile : The data identified in data warehouse is stable , meaning , the data once
added never altered . New data can be added while old data may not deleted.

1
Data Mining
Data Warehouse Database
Sales Data
Source

Marketing
Data Source
Customers Orders

Purchasing
Products Payments
Data Source

Figure 1.1 : Subject Oriented Data Warehouse

Flat Files Reports/Query/Data Mining

Oracle Data
Source

Legacy
Data
Source
Data Warehouse Database
Spread Sheets

XML data

2
Figure 1.2 Integrated Data Warehouse

Volatile Data (OLTP) Non Volatile Data(OLAP)

Insert Delete Load Access

Access Change

Figure 1.3 Non-Volatile Data Warehouse

Data warehouse Data (Historical Data)


Operational
(Current Data
for 90 days) Data for Data for
Data for 6
months 12 months 3 years

Figure 1.4 Time-Variant Data Warehouse

OLTP versus OLAP

OLTP (Online Transaction Processing) involves recording of day to day transactions. It


maintains the detailed and current data of the operational system. Here the data can be added,
removed , modified and queried. This is a transactional sytem containing short on line
transactions such as updating account, book issue, and course registration. Here, concurrency of
transactions is of great performance concern. Data must be up-to-date and consistent and must
have fast response time. The data stored is normalized and Entity and Relationship model is
applied for building data. The data models adopted can be complex. The queries and updates on
these data bases are limited. Data integrity to be maintained in this environment which is mostly
multi user. Various data recovery mechanisms are applied whenever sytem crashes.

OLAP(Online Analytical Processing) involves historical data. This data is mined to make
knowledge discovery which helps in making managerial decisions for better prospects of a

3
business. This involves low volume of transactions with queries often very complex and involve
aggregations. This is an analytical system. These systems are effectively used for data mining
purposes. Multi dimentional schemas are applied to store data which is aggregated and
consolidated. These systems involve long transactions such as department wise total sales,
identifying top selling item, and count courses with more than 100 registrations etc.

OLTP Architecture

Presentation Layer

Applications

Application Layer

Database

Database Layer

OLAP Architecture

Extract Data Warehouse Server


Database Transfor Data
m
Warehouse
Database Load

Database

4
Client

The following table contains the major differences of OLTP and OLAP systems :-

Criteria OLTP(On line Transactional OLAP(On line Analytical


Processing) or Operational Processing) or Data Warehouse
System System
Data Source Contains operational data Contains consolidated data
Purpose To run business task To plan & decision support
Data View A snapshot of the business process Multidimensional view of the
business process
Users DBA, database professional Knowledge worker (Manager,
executive, analyst)
DB Design ER based, application based Star/Snowflake, subject based
Data storage Current or up-do-date data Historical data
Data Access Read/Write Mostly read
Queries Simple queries with little access Often complex queries
No. of records Few such as tens More such as millions
No. of users Huge, in terms of thousands Small, in terms of hundreds
DB size In the order of 100 MB to GB In the order of 100 GB to TB
Concurrency Biggest Concern Little concern
Data updates Very frequent Frequent
Schema Normalized Schema De-normalized schema
Target Targets one specific process Integrates data from processes

Table 1.1

Why Separate Data Warehouse ?

Generally, a separate data warehouse system is maintained from an operational system based on
the following reasons :-

To help promote the high performance for both the systems.


An operational database optimizes query process by indexing and hashing using primary
keys. Where as queries from data warehouse system are often complex involving

5
consolidated and summarized data. So, processing OLAP queries on operational database
would degrade the performance of operational tasks.
To ensure consistency of data, we apply concurrency control and recovery mechanisms
such as, locking and logging. OLAP queries often need the data for read only access with
summarization and aggregation. If concurrency control and recovery mechanisms are
applied for such OLAP operations may reduce the throughput of OLTP system.
Operational systems (OLTP) do not support historical data, where as , data warehouse
systems require historical data for decision support.

Since, the two systems provide different functionalities and require different kinds of data, it is
necessary to maintain separate databases for them.

Data Warehouse Architecture

Data Warehouse Data

6
Data Data
Mart Summary Raw Data Mart
Data

E
E-Extract , T-Transform, L-Load TL

Operational Databases External Data Sources


ERP SAP CRM Data Third
Data system system source party
data Flat Files
Data Warehouse Components

The primary components of a data warehouse are as follows :-

1. Source Data Component : Typically, the source data for the warehouse is coming from
the operational applications. As the data enters the warehouse, it is cleaned up and
transformed into an integrated structure and format. It is a component which contains
following types of data.
a. Production Data.
b. Internal Data.
c. Archived Data.
d. External Data.
2. Data Staging Area Component : It is a component where extraction of data from
operational systems into data warehouse happen. The data sourcing, cleanup,
transformation and migration tools perform all of the conversions. Here the disparate data
from different data sources is summarized, changed , structured and condensed that can
be used by the decision support tools. The following operations are handled in this
component:
a. Data Extraction.
b. Data Transformation.
c. Data Loading.
3. Data Storage Component : This component is the centralized data warehouse
database ,which is built for data mining tasks for .knowledge discovery, which helps in
decision support. The technological approach to the data warehouse data base is driven
by certain attributes such as, large database size, ad hoc query execution and the need for
flexible user view.
4. Information Delivery Component : This component subscribes the data warehouse
information to deliver to one or more destinations. The information delivery system

7
distributes warehouse stored data to other data warehouses and end-user products such as
spreadsheets.
5. Meta Data Component : Meta data is data about data that describes the data warehouse.
This component has an important role in data warehouse. It is applied for building ,
maintaining, managing and using the data warehouse. Meta data can be classified into
two categories, they are technical meta data and business metadata. This component acts
as a directory which helps in the decision support system to locate the components of the
data warehouse system. It helps in data aggregation and summarization. It is used in
query, extraction, reporting , transforming and loading activities.
6. Management and Control Component : This component sits on top of all components.
It coordinates the services and activities within the data warehouse. It controls the data
transformation and transfer in data warehouse data storage. In a heterogeneous
environment, the data reside on disparate systems and require inter-networking tools. The
need of managing this environment is obvious.
7. Data Mart : It is lower-cost, scaled down version of the data warehouse. Data Mart offer
a targeted and less costly method of gaining the advantages associated with data
warehousing and can be scaled up to a full data warehouse environment over time.

Extract-Transform-Load (ETL)
ETL process is a process in data warehousing responsible for pulling data out of the
source systems and placing it into a data warehouse. ETL covers a process of how the data are
loaded from the source system to the data warehouse. Data is extracted from an OLTP data base
and transformed into data warehouse schema and loaded into data warehouse data . Many data
warehouse systems can get data from non OLTP systems such as text files, legacy systems, and
spreadsheets. Such data also requires ETL process. It is an important component of the data
warehousing architecture.

ETL involves the following tasks:

Extracting Data : Data extracted from heterogeneous data source systems is converted
into one consolidated data warehouse format for the purpose of transformation of the
data.
Transforming Data : We may apply various business rules for derivations and
calculating new measures and dimensions. This process may involve the following tasks:-

Cleaning (Mapping into a consolidated format)


Filtering (Selecting some sub set of columns)
Splitting (Splitting a column into multiple columns and vice versa)
Data joining from multiple sources
Transposing rows and columns
Simple or Complex data validation

Loading Data : Load the data into data warehouse or data repository and other reporting
applications. You need to load your data warehouse regularly so that it can serve its
purpose of facilitating business analysis.

8
Fig 1.5 ETL Process Transformation
Functions
Data Source1

Data
Stagin Warehouse
g Area Load Fact or
Extract
Dimension
Data Source2

Data Sources Mapping Data Warehouse Schema


Logical (Multi-Dimensional) Data Model

The multidimensional data model is an integral part of On-Line Analytical Processing, or OLAP.
The multidimensional data model is composed of logical cubes, measures, dimensions,
hierarchies, levels, and attributes. Analysts know which business measures they are interested in
examining, which dimensions and attributes make the data meaningful, and how the dimensions
of their business are organized into levels and hierarchies. A multidimensional database (MDB)
is a type of database that is optimized for data warehouse and online analytical processing
(OLAP) applications. Multidimensional databases are frequently created using input from
existing relational databases. The multidimensional data model is designed to solve complex
queries in real time.

A multidimensional database uses the idea of a data cube to represent the dimensions of the data
available to a user. Types of multidimensional data model include Data Cube, Star Schema,
Snowflake Schema, and Fact Constellations. Multidimensional data model include the
following:-

Dimensions : Dimensions are the classes of descriptors of the facts. If the fact is sales,
the dimensions might be time, geography, customer and product. The dimension table
contains tuples of attributes of dimension with a primary key. The dimensional tables
contain attributes(descriptive) which are typically static values containing textual data or
discrete data. The applications include query filtering and query result set labeling.

Measures : In a data warehouse , a measure is a property on which calculations (e.g.,


sum, count, average, minimum, maximum) can be made. For example if retail store sold
a specific product, the quantity and prices of each item sold could be added or averaged
to find the total number of items sold and total or average price of the goods sold.

Facts : The fact table has tuples , one for a recorded fact. It contains compound primary
key and set of foreign keys. It contains a huge number of records. The facts tend to be
additive, semi additive and Non additive.

9
Cubes : When data is combined or grouped together in multi dimensional matrices is
called data cubes. A data cube can be two dimensional , containing data for products
verses various regions, it can be three dimensional , containing data for products verses
regions verses and fiscal quarters. You can have multi dimensional data cubes exceeding
three dimensions.

Operations : The operations can be role-up and drill-down on a data cube. The role-up
operation can remove one or more dimensions from the data cube to see a summarized
data. The navigation path is from more detailed data to less detailed. The drill-down
operation is exactly reverse of role-up. New dimensions can be added to the cube. The
navigation path is from less detailed path to the more detailed one.

The following diagrams give examples for data cube and its operations:-

Fig 1.6 Operations on Data Cube

Fig 1.7 A Sample data cube for sales of electronic equipment

10
Schema Design

A schema is a logical description of the entire database . It includes the name and
description of records with their associated data-items and aggregates. It is a collection of
database objects, including tables, views, indexes, and synonyms. A data warehouse also
requires to maintain a schema as like any other database. A database may use relational model
where as data warehouse uses Star, Snow Flake and Fact Constellation schema.

The multi dimensional schema used by data warehouse is defined using Data Mining
Query Language (DMQL). The data cube definition and dimension definition are used for
defining the data warehouses and data marts. There are many schema models used by data
warehouses, but the most commonly used are :-

Star Schema.
Snowflake Schema.
Fact Constellation Schema.

In order to determine which schema model to be used for a data warehouse depends on the
analysis of project requirements, data mining tools, ETL tools and project team personal
preferences.

11
Star Schema
Star schema is the simplest and most commonly used data warehouse schema. It is
named as star schema because the diagram looks like a star with points radiating from center.
The center of the star consists of fact table and the points of the star are the dimension tables.
Usually the fact tables in a star schema are in third normal form (3NF) whereas dimensional
tables are de-normalized.

A fact table normally contains two types of attributes such as foreign keys to dimension
tables and measures that contain numeric values. The numeric data in fact tables can be in detail
or aggregated. The primary keys of each dimension table will be part of the composite primary
key of the fact table. The attributes in dimension table describe the dimensional value which are
normally descriptive and textual. Dimension tables are generally small in size than fact tables. In
the star schema for database design, each dimension is described by its own table and the facts
are arranged in a single large table, indexed by a multi-part key that comprises the individual
keys of each dimension.

The following diagram represents the star schema for the sales data warehouse environment
contains Sales table as fact table and Time , Customer, Product and Warehouse as dimension
tables.

Time(Dimension Customer(Dimension)

TT ) CustomerKey
CustomerName
TimeKey AgeGroup
Year Gender
DaMonth Address
Day Phone
Email

Sales (Fact)
SalesID
TimeKey
CustomerKey
WarehouseKey
ProductKey
Price
Quantity
SalesAmount

12
Product(Dimension) Warehouse(Dimension
)
ProductKey
ProductKey
ProductName WarehouseKey
UnitPrice Name
Brand Street
City
Country
Fig 1.8 Star Schema for Sales Data Warehouse.

Note: The Fact table named Sales contain one primary key named SalesID, and four foreign keys
named TimeKey, CustomerKey, WarehouseKey, and ProductKey which are primary keys of
Time, Customer, Warehouse and Product dimensions.

Benefits of a Star Schema

A well-designed schema allows you to quickly understand, navigate and analyze large
multidimensional data sets. The main advantages of star schemas in a decision support
environment are:

The data is easier to understand and navigate because dimensions are


joined through fact tables.

Queries can run faster against star schema, because the join logic is
generally simpler and contain less normalized schema. . This design
feature enforces accurate and consistent query results.

Simplified business reporting logic extensible to accommodate


change. (You don't need to change this schema for future business
need).

A star schema is designed to enforce referential integrity of loaded


data by the use of primary and foreign keys. Primary keys in
dimension tables become foreign keys in fact tables to link each
record across dimension and fact tables.

13
A star schema is widely supported by a large number of business intelligence tools, which
may anticipate or even require that the data warehouse schema contain dimension tables.

The star schema structure reduces the time required to load large batches of data into a
database. By defining facts and dimensions and separating them into different tables, the
impact of a load operation is reduced. Dimension tables can be populated once and
occasionally refreshed. New facts can be added regularly and selectively by appending
records to a fact table.

Snowflake Schema

A snowflake schema is a logical arrangement of tables in a multi


dimensional database such that the entity relationship diagram resembles a
snowflake shape .A snowflake schema is a star schema with fully
normalized dimensions. A snowflake schema is a dimensional model in
which a central fact is surrounded by dimensions, where one or more of the
dimension are normalized to some extent. The snowflake normalization is
based on business hierarchy attributes. Star and snowflake schemas are
most commonly found in dimensional data warehouses and data
marts where speed of data retrieval is more important than the efficiency of
data manipulations.

Snowflake schemas are made for flexible querying over more


complicated relationships and dimensions. It is ideal for one-to-many and
many-to-many relationships among dimension levels and is typically
associated with data marts and dimensional data warehouses, in which data
retrieval speed is more critical than data manipulation efficiency. Snowflake
schemas are most commonly used with advanced query tools, which build
an abstraction layer between users and raw tables for scenarios that have
multiple queries with elaborate specifications.

Time(Dimension Customer(Dimension)
)
CustomerID Group
TimeID GroupID
Day Name GroupID
Month PhoneNumber GroupName
Year Email SegmentNam
QuarterNumber e
Address

14
Sales(Fact
)
ProductID
CustomerID
TimeID
ShopID
Quantity
Shop(Dimension SalesAmount Product(Dimension Brand
) ) BrandID
BrandName
ShopID ProductID Supplier
ShopName BrandID
BusinessType ProductType
Address
Address Name
Description

Fig 1.9 Snowflake Schema for Sales Data Warehouse

The above diagram depicts snowflake schema containing Sales as a fact table, Time,
Customer, Shop and Product as dimension tables and Group and Brand lookup tables. Here,
Brand and Group tables are formed as the result of normalizing Product and Customer
dimension tables.

Benefits of a Snowflake Schema


Better query performance because it requires less disk storage and joins tables with
smaller sizes.
Offers better flexibility and easier implementation.
Easy to maintain because of less redundancy due to normalization.
Improves overall performance because smaller tables are coupled.
Most of the multi dimensional data modeling tools that use data marts are optimized for
snowflake schemas.

15
Star Schema vs. Snowflake Schema
The differences of star schema and snowflake schema are mentioned below:

Star Schema: Consists of single fact table surrounded by some dimensional tables.
Snowflake Schema: The dimension tables are connected with some sub dimension
tables.

Star Schema : Dimension tables are de normalized.


Snowflake Schema : Dimension tables are normalized.

Star Schema : Used for report generation.


Snowflake Schema : Used for data cube.

Star Schema : Difficult to maintain and may use more storage space.
Snowflake Schema: Easier to maintain and also saves the storage space.

Star Schema: Better and effective navigation.


Snowflake Schema : Less effective navigation due to large number of joins.

Star Schema : Hierarchies for the dimensions are stored in the dimension table itself.
Snowflake Schema : Hierarchies are broken into separate tables. These hierarchies are
used to drill down the data.

Star Schema : The Extract-Transform-Load(ETL) job is less complex.


Snowflake Schema : The ETL job is more complex , as it loads data marts.

Fact Constellation Schema


Fact Constellation schema describes a logical data base structure of data warehouse or
data mart. It is a measure of OLAP (Online Analytical Processing) , which is a collection of
multiple fact tables sharing dimension tables. It is an improvement over start schema. A fact
constellation schema can be constructed from a star schema by splitting it into multiple facts.
This schema is more complex than star or snowflake schemas because it contains multiple fact
tables. However it gives a flexible solution though it is hard to manage and support. This kind of
schema can be dealt as a collection of stars which can be named as galaxy schema.

The following diagram specifies a fact constellation schema for a sales data for retail data
warehouse. It has three fact tables named as Sales , District and Region tables. It has three
dimensional tables named as Warehouse , Time, and Product. The primary key for the Sales fact
table is composite key consisting of SalesID, WarehouseID, ProductID, and TimeID attributes.
The dimensional tables , Warehouse has WarehouseID, Time has TimeID, and Product has
ProductID as their primary keys. WarehouseID, ProductID and TimeID act as foreign keys of
Sales fact table. These attributes are primary keys for their respective dimension tables as
mentioned in the diagram.

16
Warehouse(Dimension)
Time (Dimension)
WarehouseID Sales (Fact)
WarehouseName TimeID
Street SalesID Day
DistrictID WarehouseID Year
DistrictName ProductID Month
RegionID TimeID Quarter
RegionName TotalQty
WarehouseManager TotalPrice

District
(Fact)
Product(Dimension
DistrictID
) ProductID
TimeID
ProductID UnitsSold
ProductName TotalPrice
Brand
UnitPrice
Manufacturer
Region (Fact)

RegionID
ProductID
TimeID
UnitsSold
TotalPrice

Fig 1.10 A Fact Constellation Schema for Sales from Retail Data warehouse

Benefits of Fact Constellation Schema


Various advantages of Fact Constellation schema are as follows:-

Provides a flexible schema for implementation.


Provides a facility where multiple fact tables can share dimension tables, which is the
need of any refined applications.
Different fact tables are explicitly assigned to the dimensions. This is advantageous for
facts associated with a given dimension table and other facts with a deeper dimension
level.

17
Note: This schema is hard to maintain and support because of the complexity of the schema.

Fact Table

In any data warehousing environment , a fact table consists of the measurements, metrics
or facts of a business process, Ex : product wise monthly sales revenue. It is the central table in a
star schema or a snowflake schema of a data warehouse surrounded by dimension tables. A fact
table stores quantitative information for data analysis and is often de-normalized. A fact table is
used for data warehouse design using multi dimensional data model. A fact table might store a
measure which gives us the number of units sold by date, by warehouse and by product.

Fact tables generally work with dimension tables. A fact table hold the data that can be
analyzed where as a dimension table stores data about the ways in which the data in the fact table
can be analyzed. The fact table consists of two types of columns, such as, foreign keys and
measure columns. In a data warehouse for an enterprise , that sells products to customers, every
sale is a fact that is recorded. The quantity of products sold and the total price of products sold
indicate measures of the fact, sales.

Facts consist of measures that are numeric values which are additive, non-additive, and semi
additive in nature. They are described as follows:-

Fully-Additive Measure: Facts that can be aggregated through all the dimensions in a
fact table. An example is sales measure. You can add daily sales to get the sales for a
week, for a month, and for a quarter. You can also add sales across warehouses or
regions.
Semi-Additive Measure: Facts that can be summed up for some of the dimensions
in the fact table but not the others are semi-additive facts. That is they can be summed up
across some dimensions but not all. An example is checking account balance. You can get
balance amount from various transactions, but it is not correct to add the balance amounts
for various months across the time dimension.
Non-Additive Measure: Facts that cannot be summed up for any of the dimensions
present in the fact table. The examples for non-additive measure include, such as ratios.
Aggregation of percentage or dates is a non-additive measure , such as percentage of
profit margin. Rather storing this profit margin as an aggregated measure, we make it as a
calculated measure which is applied at query time.

The fact tables that contain no measures or facts are called fact-less-fact tables. For example,
a fact table which has only product key and time key is a fact less fact. Though there are no
measures in this table , still you can get the number of products sold over a period of time.

18
Dimension Table Characteristics

A fact table contains business facts or measures, and foreign keys which refer to
primary keys in the dimension tables. Commonly used dimensions are products , period,
customers, and regions.A dimension table generally contains descriptive attributes that are
typically textual fields. When a dimension table is created , a system generated key is used
as the primary key. This key is placed in the fact table as foreign key and are helpful in
joining these tables. Fact tables and dimension tables are often de-normalized, because these
tables are not used for the purpose of managing transactions, but are built to enable users for
the data analysis.

The primary functions of dimensions include providing filtering, grouping and


labeling operations. These functions are often referenced as slice and dice. Slicing is filtering
data, dicing is grouping the data and labeling is presenting query results for the purpose of
data analysis. Dimension tables are typically small, ranging from a few to several thousand
rows. Occasionally dimensions can grow fairly large, however. Dimensions have little impact
on performance.

The dimension tables often contain attributes that we summarize, filter or


aggregate. Dimension table do contain a primary key based on the business key or surrogate
key. This key is the basis for joining the dimension table with one or more fact tables. The
following provides the details of various kinds of dimensions:-

Slowly Changing Dimensions: These dimensions are attributes that would undergo changes
over time. Based on the business requirement, we decide whether to preserve the history of
changes in the data warehouse.

Rapidly Changing Dimensions: These dimensions are attributes that change more
frequently.

Junk Dimensions: A junk dimension is a single table with different and unrelated attributes
to avoid large number of foreign keys in the associated fact table. Junk dimensions are often
created to manage the foreign keys created by Rapidly Changing Dimensions.

Inferred Dimensions: When loading fact records , some times a dimension may not be
ready. In this case we generate a surrogate key with null values for all other attributes. This is
often called an inferred dimension.

Conformed Dimensions: It is a dimension that is used in multiple locations. A conformed


dimension may be used with multiple fact tables in a database, or in multiple data marts or
data warehouses.

19
Degenerate Dimension: This dimension attribute is stored as part of fact table and not in a
separate dimension table. These are dimension keys for which there are no other attributes.
These attributes are often used as the result of a drill through query to analyze the source of
an aggregated number in a report.

Role Playing Dimensions: It is the dimension where the same dimension key can be joined
to more than one foreign key in the fact table. For example, a fact table might contain foreign
keys for both ship date and delivery date. In this case , the same date dimension table joins to
both foreign keys. As the date dimension is playing multiple roles in this case, hence the
name Role Playing Diemnsion.

Static Dimensions: These dimensions are not extracted from the original data source, but are
created within the context of the data warehouse. A static dimension can be manually loaded
or can be generated by a procedure, such as a Date or Time dimension.

OLAP Cube

An OLAP cube is a data structure that overcomes the limitation of relational


databases by providing rapid analysis of data. It is the capability of manipulating and
analyzing data from multiple perspectives. An OLAP cube can have more than three
dimensions. By using these cubes operations such as roll-up, drill-down, slice and dice can
be performed on the data. These cubes are used for the purpose of querying data that are
relevant to the various interests of users. An OLAP cube can also be known as multi
dimensional cube or hyper cube.

An OLAP cube is a multi dimensional data model that is built and optimized for
data warehouse and OLAP applications. It is a method of storing data in a multi dimensional
form, usually for reporting purposes. In this cube, measures (data) can be categorized by
dimensions. Storing data in cubes increases the speed of data retrieval when you run analysis
reports using the data. It allows managers , executives and analysts get knowledge from the
data through fast, consistent and interactive access to data. The OLAP cubes divide the data
into sub sets that are defined by dimensions. A dimension is the descriptive attribute of a
measure.

An OLAP cube is a database technology that stores data in an optimized way to


provide quick response to queries by dimension and measure. Most OLAP cubes pre-
aggregate the measures by the different levels of categories in the dimensions to enable
quick response.

An OLAP cube generally addresses subject area for a business enterprise. It


allows you to analyze facts and measures, like sales transactions in the context of
dimensions, like customer , product or region. The dimensions become qualifiers to classify
or filter the data. For example, I want to know the sales in the north-eastern region by

20
customer name. Sales is fact, region and customer could be dimensions. Here, the need is to
filter data on region and report sales by customer name.

Uses of OLAP Cube: There are three reasons for building an OLAP data cube as mentioned
below:

Performance: The pre-aggregation of data in OLAP cubes provide very fast response
to queries which require reading or summarizing millions of records of star schema
data. The drilling , slicing and dicing activities that an analyst performs on the data ,
the results are almost immediate, but it would have taken significant amount of time
when relational data store is used.
Drill down Functionality: Many data mining and reporting tools will automatically
allow drilling up and down on dimensions if the data source is OLAP cube.
Availability of Software Tools: Many data mining and reporting software tools are
designed to use OLAP data source for the purpose of reporting. These tools are
specially built for multi-dimensional data analysis.
Note : OLAP cube technology will cost more in terms of development, learning and project time
but will return benefits in fast response time to analyze large amounts of data. This capability
can result in insights that drive actions and decisions that enable very large organizational
productivity, cost saving or revenue increasing gains.

Fig 1.11 An OLAP cube for sales of electronic items from a retail store.

21
OLAP Operations

OLAP provides a user-friendly environment for interactive data analysis. A number of


OLAP data cube operations exist to materialize different views of data, allowing interactive
querying and analysis of the data. The following are OLAP operations which are
multidimensional as they intend for multidimensional data model.

Roll-Up
Drill-Down
Slice and Dice
Pivot (Rotate)

Roll-Up : This operation , also known as drill-up or aggregation operation, performs aggregation
on a data cube. In this case, either you can climb up a concept hierarchy for a dimension or
climb down a concept hierarchy. When roll-up is performed, one or more dimensions from the
data cube are removed. The Roll-Up operator is useful in generating reports that contain
subtotals and totals. Roll-Up refers to the process of viewing data with decreasing detail.

Example: In any sales data warehouse environment where we maintain sales information region
wise. The location attributes can be mentioned in the hierarchy, Street < City < State < Country

22
about whome the data is stored. Roll-Up operation will go up in the hierarchy and might group
the sales according to country wise rather than city wise . Roll-Up can be specified as follows:-

Fig 1.12 Roll-Up operation on sales data cube

Drill-Down: This operation is the reverse of Roll-Up and it navigates from less detailed data to
more detailed data. This Dill-Down operation can be realized by stepping down a concept
hierarchy for a dimension.

Example: In the sales data cube , the sales are arranged quarter wise. You can come down in the
hierarchy , arranging data, monthly wise. This operation is typically called Drill-Down operation
on an OLAP cube. The time dimension contains the hierarchy in the order, Day < Month <
Quarter < Year. Drill-Down occurs by descending the time hierarchy from the level of Quarter to
the more detailed level of Month.

Fig 1.13 A diagram for Drill-Down operation on sales data cube.

23
Note: Drill-Down refers to the process of viewing data at a level of increased detail, while Roll-
up refers to the process of viewing data with decreasing detail.

Slice and Dice: The slice operation performs a selection on one dimension of the given cube,
resulting in a sub cube. Where as the dice operation defines a sub cube by performing a selection
on two or more dimensions. The following diagram specifies the slice operation , where the sales
data are selected from a central cube for the dimension time using the criterion time = Q1.

24
. Fig 1.14 Diagram for Slice operation Fig 1.15 Pivot operation

Pivot(Rotate): It is a visualization operation that rotates the data axes view in order to provide
an alternative presentation of the data. The figure 1.15 shows a pivot operation where the item
and location axes in 2-D slice are rotated. Other operations include rotating the axes in a 3-D
cube or transforming a 3-D cube into a series of 2-D planes.

Other OLAP operations can be specified as :

Drill-Across: This operation executes queries involving more than one fact table.

Drill-Through: This operation uses relational SQL (Structured Query Language) facilities to
drill through the bottom level of a data cube down to its back-end relational tables.

OLAP Server Architecture

25
In the OLAP world, there are two different types, such as
MOLAP( Multidimensional OLAP) and ROLAP( Relational OLAP). There is another OLAP
technology named HOLAP (Hybrid OLAP) which combines both MOLAP and ROLAP.

ROLAP: It is a form of online analytical processing that performs multidimensional analysis of


data stored in a relational database rather than in a multidimensional database (Data Cube). As
ROLAP uses relational database, it requires more processing time as well as more disk space.
ROLAP supports larger user groups and greater amounts of data.

The user submits queries for multidimensional analysis and the ROLAP server converts
these requests into SQL queries for submission to the relational database. Then the operation
which gets the results include ROLAP server converting data from SQL into multidimensional
format. After this conversion, data is sent to the client for the purpose of viewing.

Advantages of ROLAP:

Can handle large amounts of data . The data size can be limited only by the underlying
relational database. ROLAP does not have any limitation on data size.

Can use functionalities related to the relational databases. ROLAP technologies can use
this host of functionalities because they sit over relational database.

Disadvantages of ROLAP:

Performance can be slow, because each ROLAP request can be a SQL query or multiple
SQL queries on the relational database. The query might take more time if the database
size is quiet large.

ROLAP technologies are often limited by SQL functionalities, because any request in
ROLAP is converted into SQL statements. SQL statements may not fit all needs of data
analysis.

MOLAP : It is a an online analytical processing which uses a multidimensional data model for
data analysis. It requires that information first be processed before it is indexed, where as
ROLAP is entered directly into a relational database. This is the more traditional way of OLAP
analysis. In MOLAP, data is stored in a multidimensional cube. The storage is not in the
relational database, but in proprietary formats. Most end users prefer MOLAP because of its
better speed and user-responsiveness.

Advantages of MOLAP:

26
Good performance, because MOLAP cubes are built for fast data retrieval.
Can do complex calculations, because all calculations are pre-generated when the cube is
created.
Disadvantages of MOLAP:

There is limit on the data size it can handle because it is not possible to include large data
in the cube itself. The cube contains only aggregated data.
Needs additional investment in terms of human and capital resources, as data cubes are
often proprietary and may not be available with the enterprise.

HOLAP: These technologies are the result of combining ROLAP and MOLAP technologies. It
contains the advantages of ROLAP and MOLAP. HOLAP uses data cube technology for better
performance. When detailed data is required , HOLAP can drill through from the cube into the
underlying relational data.

The following diagrams represent pictorially the OLAP architectures:

27
Fig 1.16 OLAP server architecture

Fig 1.17 ROLAP and MOLAP architecture

28
Chapter 2
Data Mining

What is Data Mining ?

Data mining is the process of exploring huge stores of data to identify patterns and trends
which may not be possible through simple data analysis. It is the computational process of
discovering patterns in large data sets involving methods from artificial intelligence , machine
learning, statistics , and data base systems. Generally, data mining includes various algorithms to
search the data for knowledge discovery which might evaluate the probability of future events,
such as predicting the stock market events. The key properties of data mining are as follows:

Automatic discovery of patterns.

Predicting the most likely outcomes.

Creation of actionable information.

Focusing on large data sets and databases.

Data mining can answer questions that cannot be addressed through simple query and
reporting techniques. It is the analysis of data for relationships that have not discovered before.
For example, the sales records for a particular customer. The process of data mining involves
analyzing data from multiple perspectives and aggregating data into certain useful information.
The task of data mining extracts information that can be used to improve revenue, cutting down
costs etc. Using data mining techniques the users can analyze data from different dimensions or
angles, categorizing and summarizing the relationships identified. Data mining also identifies the
correlations.

Data mining on any enterprise results into turning raw data into useful information called
knowledge. The patterns identified in the large corporate data gives significant information for
company executives to develop effective marketing strategies which increase sales and reduce
costs. Information extracted through data mining task can be converted into useful knowledge
about customers buying patterns and future business trends.

29
Data mining has become a primary use by companies with strong consumer focus, such
as retail, financial, communication and marketing organizations. It enables these companies to
understand the relationships among factors such as price or product location with factors which
are external such as economic indication, business competition, and customer demographics.
This information immensely benefits them to determine the impact on sales.
Data Mining Definition

Data mining can be defined as follows:-

Data mining is defined as a process of extracting useful data from a large set of raw data.
Data mining softwares are used to analyze these identified patterns. It helps in
developing effective business strategies to perform the business better in highly
competitive world. It helps in making better corporate decisions for business
improvement. It includes effective data collection, maintaining data warehouse and
computer processing. Data mining involves segmenting the data and evaluating the
probability of future events by using various mathematical algorithms. The major
applications of data mining include pattern predictions based on trends and behavior
analysis. Prediction based on the most possible outcomes. The business decision oriented
data is created. Data mining majorly focuses on large data sets and data bases for the
purpose of analysis.

Data mining is the task of analyzing data for hidden patterns according to various
different perspectives. You can extract useful information from raw data for
categorization. Here, the data is collected and assembled at a common location named
data warehouse for the purpose of efficient analysis using data mining algorithms to make
decisions to cut business costs and improve revenue. Data mining is the practice of
examining large existing databases to extract useful knowledge. Data mining is also
called as data discovery or knowledge discovery .

Data mining process involves the searching through very large data for useful
information. It uses advanced statistical tools, artificial intelligence techniques, neural
networks and data mining algorithms to identify trends, patterns and various
relationships. Without data mining, they would have remained undetected. In comparing
with expert systems, data mining discovers hidden rules underlying the data, which is
also called data surfing.

Data mining can be considered as integral part of Knowledge Discovery in


Databases(KDD), which is the process of converting raw data into useful information. It
is the extraction of implicit and previously unknown information. Data mining involves

30
customer relationship management to provide better and customized services. Using data
mining, you can identify interesting knowledge about rules, patterns, regularities and
constraints for better business prospects. Data mining is also called business intelligence
(BI) task. We may not apply traditional techniques in data mining as they are not suitable
because of enormity of data, heterogeneous , high dimensionality, and distributed nature
of data used for the purpose of data mining.

KDD (Knowledge Discovery in Databases)

KDD is defined as the non-trivial, iterative and interactive process of identifying valid,
novel, potentially useful, and ultimately understandable knowledge (patterns, models, rules,
relationships etc.) in data.

KDD refers to the entire process of identifying useful knowledge from data and data
mining indicates a particular step in this process. It involves the evaluation and interpretation of
the identified patterns from the data to make knowledgeable decisions. It also includes
preprocessing , sampling and projections of the data before data mining step. KDD process is
concerned with the development of methods and techniques for finding knowledge from data. In
this process, specific data mining methods for pattern discovery and extraction are applied. KDD
emphasizes the high level application of data mining methods.

KDD involves multidisciplinary activities and encompasses data storage and access,
scaling algorithms to massive data sets and interpretation of results. The KDD process is
facilitated by data warehousing which involves data cleansing and data access procedures.
Artificial intelligence also supports KDD process by discovering facts based from
experimentation and observations. The patterns recognized must be valid and possess some
degree of certainity .

Some authors may not differentiate data mining from KDD , while others view data
mining as one of the essential step in the process of knowledge discovery. The list of steps
involved in knowledge discovery can be mentioned as follows:-

Data Cleaning: The noise and inconsistent data is removed.

Data Integration: Data from multiple heterogeneous data sources is combined.

Data Selection: Data which is relevant for the analysis task is retrieved from the database.

31
Data Trasformation: Data can be transformed or consolidated into forms suitable for mining by
performing summary or aggregation operations.

Data Mining: Various algorithms and intelligent techniques are applied for the purpose of
extracting data patterns.

Pattern Evaluation: The extracted data patterns are evaluated for better business prospects.

Knowledge Presentation: The identified knowledge from the data is presented by using various
software data mining tools to understand and make better business decisions.

Note: The following diagram specifies KDD process in which data mining is one of the tasks of
it.

Fig 2.1 Knowledge Discovery Process

Challenges in Data Mining

Data mining from heterogeneous data sources and global information systems poses
various challenges, because data might be available on distributed computing environments such
as LAN or WAN. Normally, these data sources may be structured, semi structured or
unstructured. Therefore, mining data for knowledge poses significant challenges. The
functioning of data mining system , some times unpredictable. For example, the system which
works better on smaller set of data might measurably fail on large data sets.

Data mining is not an easy or a simple task because of involvement of complex


algorithms for mining , as well as data may not be available at a single place. The data needs to
be integrated from various different types of data sources. The major issues that a data mining

32
system faces can be such as mining methodology , user interaction, performance issues, and
diverse data types.

Though data mining is a powerful process, it encounters many challenges while it is


implemented. These challenges may be related to performance , data storage, mining methods,
and techniques and algorithms used. The successful operation of data mining task includes
identifying various issues and challenges and fixing them.

Data mining process can recognize the following challenges:-

Poor quality data: The data can be often noisy and incomplete. The data contain dirty data,
missing values, inadequate data size, and poor representation. Data in most of the cases can be
inaccurate and unreliable. This kind of data may be because of human errors and errors of the
instruments.

Distributed data: The data in real life often stored on different platforms that is on distributed
computer architecture. The data can be in different databases , flat files, individual systems,
external data sources, and on the internet. Because of technical and organizational reasons, to
bring and integrate data from multiple sources into a centralized repository is a very difficult
process.

Complex and Variety of data: The real world data is very different and can be of wide variety.
We need to accommodate data which comes from different sources and with a variety of
different forms. The data can be complex and might include multimedia data such as audio and
video, images, temporal data , spatial data and time series data. It will be tough to maintain this
complex and different types of data and it will be difficult to extract required useful information.

Performance of Data Mining: The efficiency of various methods , algorithms and techniques will
have wide impact on the data mining system. If the algorithms and techniques are poorly defined , it will
definitely have adverse impact on performance of the system.

Data velocity: Online machine learning process often requires models to be constantly updated.
It means that the data warehouse data should also be updated with the recent and incoming data.

Background Knowledge: As we are aware that collecting and incorporating background


knowledge is significantly a complex process. By using background knowledge accurate, reliable
and highly sophisticated solutions can be found. More useful findings can be made by
descriptive tasks and more accurate predictions can be made by predictive tasks.

Data Visualization: In order to display the knowledge extracted in a presentable and easily
understandable manner, we need the process of data visualization. The data extracted should
convey unambiguously what it supposed to convey . The intent of the information extracted
should be straight forward and clear. In many occasions, it is difficult to present information in
an easily understandable and accurate form. We need very effective and successful data
visualization techniques for better presentation of the data extracted.

33
Data Privacy and Security: Data mining process generally leads to issues concerning data
security and privacy. For example, when a retailer gets customers buying patterns from an
extracted information, it happens without getting customers permission or concern.

Big Data : Data mining deals with huge data sets. This might require distributed approaches.

Challenging problems of data mining can be mentioned as follows:

Developing a unifying theory of data mining.


Scaling up for high dimensional data and high speed data streams.
Mining sequence data and time series data.
Mining complex knowledge from complex data.
Data mining in a network setting.
Distributed data mining and mining multi-agent data.
Data mining for biological and environmental problems.
Data mining process-related problems.
Security, privacy and data integrity.
Dealing with non-static, unbalanced and cost-sensitive data.

Data Mining Tasks

The data mining tasks can be majorly classified into two types based on what kind of task
you have and what achievement you want. These two major categories are descriptive tasks and
predictive tasks. The descriptive data mining tasks characterize the general properties of data.
Predictive data mining tasks predict the behavior of the new data based on the available data.

Descriptive Task: This task focuses on finding human-interpretable patterns that describe the
data. This task can be divided into the following:-

Association
Clustering
Summarization
Sequence discovery

Predictive Task: This task involves using some attributes to predict future or unknown values of
other attributes.

Classification
Prediction
Time series analysis
Deviation detection

34
The tasks of description and prediction for any data mining application may vary
considerably. In the context of KDD, description tends to be more important consideration than
prediction. But in case of pattern recognition and machine learning tasks prediction is an
important goal.

Classification: Classification is learning a function that maps (classifies) a data item into one of
several predefined classes. It derives a model to determine the class (category) of an object based
on its attributes. A collection of records with their associated attributes will be available, based
on this, class or category of the new object is determined. Classification can be used in reducing
marketing costs by identifying customers who can by a new product. By applying the past
available data the customers buying preferences are identified.

Prediction: It is the process of forecasting the values of missing or future data. A model is
developed with the past data .This model can be used to predict the values of new data which is
of our concern. This data mining task of prediction analysis is used in medical diagnosis and
fraud detection. Here, techniques such as regression are applied to develop a function which
maps a data item to a real valued prediction variable.

Time series Analysis: Time series records the sequence of events where future event is
determined from the recorded past events. Time series analysis comprises methods for analyzing
time series data in order to extract meaningful statistics and other characteristics of the data.
Time series forecasting is the use of a model to predict future values based on previously
observed values. Time series methods take into account possible internal structure in the data.
Time series data often arise when monitoring industrial processes or tracking corporate business
metrics. Time series analysis includes methods to analyze time-series data in order to extract
useful patterns, trends, rules and statistics. Stock market prediction is an important application of
time- series analysis.

Deviation Detection: This task focuses on discovering the most significant changes in the data
from previously measured values. Deviation detection normally identifies outliers, which express
deviation from previously known expectation. This task can be performed using statistics and
visualization techniques. Deviation detection is often applicable for fraud detection in case of
credit cards and insurance claims and quality control. Deviation detection is needed for the
following various reasons:

Knowledge Discovery: It is vital part of important business decisions and


scientific discovery.

Auditing: This information can reveal problems and any mal-practices.

Fraud Detection: This information reveal fraud cases.

Data Cleaning: This information is the result of mistakes in data entry which
should be corrected.

35
Association: Association identifies the similarity or connection among certain set of items
discovering relationships. Association analysis is used for commodity management, advertising,
and catalog design. For example, a retailer can identify the products which are purchased
together by the customers. It is a data mining function that identifies the probability of the co-
occurrence of items in a collection. The relationships between these items is called association
rules. Association rules are often used to analyze sales transactions.

Clustering: Clustering is used to identify various data objects that are similar. This similarity
can be decided based on factors such as purchase behavior , responsiveness to actions, and
geographical regions. It is the task of grouping a set of objects in such a way that objects in the
same group are similar than to those in other groups. These groups are called clusters. Similar
objects are grouped in one cluster and dissimilar objects are grouped in another cluster.

Summarization: It involves methods for finding a compact description for a subset of data. It
summarizes data included both primitive and derived data in order to create a derived data that is
general in nature. A set of relevant data are aggregated which result into a smaller set. Because of
the fact that the data in the data warehouse is of high volume, we need mechanisms to get only
the relevant data in a summarized format. Data can be aggregated in different levels and from
different angles. For example, in case of market basket analysis, high level summarized
information can be used for sales or for customer relationship involving customer and purchase
behavior analysis.

Sequence Discovery: The major goal of sequence discovery is to identify useful , novel, and
unexpected patterns in databases. For this purpose, we maintain a specific type of database
named sequence databases, which contain some sequences. Sequence discovery is an important
topic of data mining concerned with finding statistically relevant patterns in data warehouse
database. It is usually assumed that the values are discrete, and thus time series mining is closely
related, but usually considered a different activity.

Data Preprocessing

Data preprocessing is one of the data mining techniques that is used to transform the raw
data into useful and easily understandable format. The real world data is incomplete,
inconsistent, and erroneous. Data preprocessing is the task of resolving such issues. It transforms
the data into a format that will be more easily and effectively processed for the purpose of
knowledge discovery.

Data .preprocessing generally involves the following tasks:-

36
Data cleaning.
Data integration and transformation
Dimensionality reduction.
Feature subset selection
Discretization and binarization.

The following diagram depicts the data preprocessing task:-

Fig 2.5 Data Preprocessing Diagram

Data Cleaning

Data cleaning which is also called data scrubbing or data cleansing is the process of
cleaning up data in the database which is incorrect, incomplete, or duplicated. It involves the
correction or removal of erroneous or dirty data caused by contradictions, disparities and keying
mistakes. It also includes validation of the changes made, and may require normalization. This
task involves filling in missing values, smoothing noisy data, identifying and removing outliers,

37
and resolve inconsistencies. It is a technique which involves transformations to correct wrong,
old incomplete, and useless data This task is performed as a data preprocessing step to prepare
the data for a data warehouse.

After data cleaning is performed, a data set will be consistent with other similar data sets
in the system. The inconsistencies are removed which might have happened because of user
entry errors, by corruption in transmission or storage, and by different definitions of similar
entities in different data sources.

Fig 2.4 Data Cleaning Process Query

Data
Dirty Cleaning Cleaned
Data Data

Missing Values: Missing values commonly occur in any data mining applications. Missing
data happens because the data was not avilable, not applicable, and the expected event did not
happen. Missing data can happen due to the fact that the person who entered data may not know
the right value of it or missed filling it. However, there are many data mining scenarios in which
missing values provide important information.The meaning of different missing values depends
on the context of the system. During data analysis of the data warehousing sytem, we treat
missing values as useful information and adjusts their values into its calculations. Doing so will
ensure the balance in any data mining system.

The presence of missing values in a dataset can adversly affect the performance of a
classifier constructed using that dataset. As we are aware that there could be various reasons for
missing values in a dataset such as human error, hardware malfunction etc. Before applying any
data mining technique it is mandatory to tackle these missing data. Otherwise, the information
extracted from data set containing missing values will lead to incorrect decision making. Data
cleaning is one of the preprocessing tassk to fill in misssing values, smooth out noise and correct
inconsistencies. We have several techniques available for this cleaning process, and choosing the
one depends on the problem domain.

There are several techniques available to control the issue of missing values:-

38
Ignore the Data row: This is done when the class label is missing (assuming the mining task
involves classification).If the percentage of such rows is high, it leads to poor performance. For
example, we have a student database containing details of students enrollment. This data set has
a column which specifies the students success rate as Poor, Medium, and High. If your target is
to predict students success in college, and this particular column doe not contain data, they are
ignored and removed before executing the algorithm.

Use a Global Constant to fill in for missing values: Decide a new global constant value, like
unknown , N/A , or minus infinity to fill all the missing values. For example, in the same
students enrollment database, if the state of residence attribute value is missing, filling it with
some arbitrary value does not make any sense rather than using some value such as N/A.

Use Attribute mean: Replace missing values for an attribute with the mean value or median if
the data is discrete. For example, in a database of family incomes for a country, if the values for
average income are not entered for some families, you can add values of mean to replace those
values.

Use Attribute mean for all samples belonging to the same class: Instead of using the mean (or
median) of an attribute calculated by using all the rows in a database, we can limit calculations to
the relevant class. For example, when you are classifying customers according to credit risk ,
replace the missing value with the average income value for customers in the same credit risk
category as that of the given sample.

Use a data mining algorithm to predict the most probable value: The most probable value for
the missing data can be determined by using techniques such as regression, Baysian formalism,
decision tress, and clustering algorithms. For example, we can use a decision tree to try and
predict the probable value in the missing attribute, according to other attributes in the data.

Fill in the missing values manually: This approach may not be feasible for a large dataset with
many missing values as the method is extremely time consuming.

Noisy Data :

Noisy data can be defined as corrupted electronic signal or data that have been input erroneously or
corrupted in some processing step. It is the data that cannot be interpreted by machines. Noise is a random error or
variance in a measured variable. Noisy data is meaningless data. The term has often been used as a
synonym for corrupt data. However, its meaning has expanded to include any data that cannot be
understood and interpreted correctly by machines, such as unstructured text.

Noise can be a non-avoidable problem which affects the data collection and data preparation processes in
data mining applications. Noise can be implicit errors introduced by measurement tools such as various types of
sensors. Noise can be random errors introduced by batch processes or experts when the data are gathered. Noisy
data can be caused by hardware failures, programming errors, spelling errors and industry abbreviations.

39
The presence of noise could introduce new properties in the problem domain. Noisy data increases the
amount of storage space required and can affect the results of data mining task. Noisy data can be handled by using
the following techniques:-

Binning: Sort the attribute values and partition them into bins. Then smooth the data by bin means, bin median , or
bin boundaries. For example, we have the following data:- 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Partition into equi-size bins, then one can smooth by bin means, smooth by bin median, and
smooth by bin boundaries etc.

Partitioning into equi-sized bins:

Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34

Smoothing by bin means:

Bin 1: 9, 9, 9, 9
Bin 2: 23, 23, 23, 23
Bin 3: 29, 29, 29, 29

Smoothing by bin boundaries:

Bin 1: 4, 4, 4, 15
Bin 2: 21, 21, 21, 25
Bin 3: 26, 26, 26, 34

Clustering: Group values in clusters and then detect and remove outliers either manually or automatically.
Outliers are the values that fall outside of the set of clusters. The following figure describes the process of clustering
to smooth noise.

40
Fig 2.6 Detecting outliers by cluster analysis.

Combined Computer and Human Inspection: Computer detects outliers or suspicious values and then
checked by human.
Regression: Data can be smoothed by fitting the data to a function, such as with regression. Linear regression
involves finding the best suitable line to fit two attributes. One attribute value can be predicted based on the other.
Multiple linear regression is an extension of linear regression where more than two attribute values are used. The
following diagram gives the details of linear regression for two attributes X and Y. Here, the attribute X is
independent, and attribute Y is dependent.

Fig 2.6 A graph of Linear Regression

Inconsistent Data:

Inconsistent data is the data containing discrepancies in codes or names.


For example:
Age=55, Birthday = 13/12/1990,
Previously the code might be 1,2,3 etc. But now it is A,B,C etc.
Discrepancy between duplicate records etc.

Inconsistent data leads to bad quality of data which lead to bad business decisions.
Inconsistent data may cause incorrect or even misleading statistics. In the view of business
administration, inconsistent data can lead to false conclusions and misdirected investments.

Some data inconsistencies may be corrected manually using external references. For
example, errors made at data entry may be corrected by performing a paper trace. This manual
effort can be combined with routines designed to help correct data inconsistencies. Some times ,
knowledge engineering tools may be applied to know the data inconsistencies.

41
Data Integration:

Data integration involves combining data from several disparate sources, which are
stored using various technologies and provide a unified view of the data. These multiple and
different sources of data include multiple databases, data cubes , or flat files.

The primary objective of data integration is to support the analytical processing of large
data sets by aligning and combining data from various enterprises departments and external
remote sources. Data integration is generally implemented in data warehouses using several
softwares that host large data repositories from internal and external resources. Data integration
may involve inconsistent data and therefore needs data cleaning.

There are number of issues to be considered during data integration they are as follows:

Schema Integration.
Redundancy.
Detection and resolution of data value conflicts.
Schema Integration:

Schema is the organization of data. When data from different sources need to be
combined to retrieve information that is not contained entirely in either one, typically they do
not have the same schema. For example, one database schema might store information about
state as state , and in other database the same information can be stored as province. When these
schemas to be integrated, they must resolve the fact that the same information is stored in
different schemas.
Different databases may store the same concept in different ways. For example, in
database 1 a book publisher can be a separate entity. Where as in database 2 it might be just an
attribute. These things need to be taken care of while data integration is done. During schema
integration, there are the following two situation to be dealt with:
When different concepts are modeled in the same way. For example, in a university
database staff and students may be represented by the entity person though they are
different concepts.
When the same concepts are modeled in different ways. For example, the concept of
publisher as an attribute in one database and as an entity in some other database.
The following diagram depicts the process of schema integration:-

42
In schema1 we have two entities named Employee and Organization, and employee
works for the organization. In case of schema2 we have three entities named Employee, City and
Region. Here, employee is born in the city which belongs to a given region. In the third schema,
organization belongs to some municipality. These schemas can be integrated to form a single
schema as part of schema integration.

Employee Employee Born Organization

City In
Works

Region In Municipality
Organization

Schema 1 Schema 2 Schema 3

Employee Born City In Region

Works In

Organization In Municipality

Fig 2.7 Schema Integration


Redundancy:

Redundant data occur often when integration of multiple databases takes place. It can happen in
two cases such as:

Object Identification: The same attribute or object may have different names in different
databases. For example, customer_id in one database, and customer_number in another
database.
Derivable Data: One attribute can be derived from another table, such as annual
revenue.

43
Redundant attributes may be able to be detected by correlation analysis and covariance
analysis .Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality.

When two attributes are given, correlation analysis measures how strongly one attribute
implies other attribute, based on the available data. The correlation between the attributes A and
B can be measured by the following formula:

RA,B = (A-A_)(B-B_)

Where, n is the number of tuples, A and B are the respective mean values of A and B, and
A and B are the respective standard deviations of A and B. If the result is greater than zero, then
the attributes A and B are positively correlated, meaning the value of A increase as the value of B
increase. The higher the value the more correlated they are. The high value indicates either A or
B can be removed as redundancy. If the resulting value is equal to zero, then A and B are
independent and there is no correlation between them. That is, each attribute discourages the
other.

Apart from detecting redundancies among attributes, duplication should also be detected
at the record level. It happens when there are two identical records with same unique key.

Detection and Resolution of Data Value Conflicts:

At the data value level, factual discrepancies exist among several data sources in the values for
the same objects. For example:

For the same real world entity , let us say customer name, may have different values from
different sources. That is, in some data source it is John Pal and in another data source it
may be J.Pal, but both attribute values represent the same person.
Data value conflicts among attributes happen because of different representations and
different scales. For example, the attribute height in some data source may be expressed
in metric representation say in inches, in another data source the same attribute values
may be expressed in British units, say in centimeters.

Note: Careful integration of data from multiple different data sources can reduce and avoid
redundancies and inconsistencies in the data warehouse data thus resulting into accurate and high
speed data mining process.

Data Transformation:

Data transformation converts a set of data values from the data format of a source data
system into the data format of a destination data system. It is often used in a data warehouse
system. In this task the data are transformed or consolidated so that the resulting mining process
may be more efficient, and the patterns found may be easier to understand. In this task, the data

44
are transformed into forms suitable for effective data mining. Data transformation can involve
the following:-

Smoothing: It is the task of removing the noise from data. Such techniques involve binning,
clustering, and regression.

Aggregation: It is a type of data mining process where data is searched, gathered and presented
in a report based, summarized format to achieve specific business objectives or conduct data
analysis. Data aggregation is any process in which data is gathered and expressed in a summary
form, for purposes such as statistical analysis. Aggregations can be based on certain variables in
the data such as age, profession, income, and sales etc. For example, In order to compute
quarterly or annual sales we may have to aggregate daily sales data. This will help in
constructing a data cube for data analysis across multiple dimensions. Data aggregation may be

Generalization: Data generalization summarizes data by replacing relatively low level values
with high-level concepts or by reducing the number of dimensions to summarize data. For
example, the numeric values for an attribute age can be replaced by high level concepts such as
young, middle aged, and senior citizen. When you are summarizing the behavior of a group of
students, you may remove birth_date and telephone_number dimensions.

Generalization is the process of creating successive layers of summary data in database


evaluation. It is the process of zooming out to get broader view of a problem, trend, or situation.
It is also known as roll-up operation.

Normalization: This is the technique in which we fit the data in a pre-defined boundary, or to be
more specific , a predefined interval. The attribute data are scaled within a small and specified
range, such as -1.0 to 1.0 or 0.0 to 1.0. There are many methods for normalization, we study
three of them and they are as follows:

Min-Max Normalization.
z-score Normalization.
Normalization by Decimal Scaling.
Min-Max Normalization: It performs linear transformation on the original data. For example,
we have an attribute A, whose minimum and maximum values are minA and maxA. It maps a
value v of A to v in the new range new_minA and new_maxA by computing

V = v minA (new_maxA new_minA ) + new_minA.


maxA minA
z-score Normalization: The values of an attribute A are normalized using the value of the mean
and standard deviation of A. A value v of A is normalized to v by computing

v = v A
A

Decimal Scaling: This technique provides the range between -1 and 1. A value V of A is
normalized to V by computing

45
V = V where j is the smallest integer such that Max(|V|) < 1.
10j

Attribute Construction: This constructs new attributes out of the original attributes of the data
set. Attribute construction is a process that discovers missing information about the relationships
between the attributes by creating new attributes. These new attributes, improve the accuracy and
understanding of the data mining process. We may create an attribute area based on the attributes
height and width. Attribute construction can handle the fragmentation problem while decision
tree algorithms are used for the purpose of classification.

Dimensionality Reduction

Dimensionality Reduction is about converting data of very high dimensionality into data
of much lower dimensionality such that each of the lower dimensions convey much more
information. It is the process of reducing the number of random variables under consideration by
obtaining a set of principal variables .This is typically done while solving machine learning
problems to get better features for a classification or regression task.

Because of the huge , multiple and different datasets, the data analysis task has become a
challenging one. This colossal data collection may lead to significant amount of inconsistencies
and redundancies. While we are dealing with very large multi-dimensional data, we may have to
face more noise, redundant and unconnected data entities. In order to come out of this panic we
introduce a technique named dimensionality reduction

The recent trends in collecting huge and diverse datasets have created a great challenge in
data analysis. One of the characteristics of these gigantic datasets is that they often have
significant amounts of redundancies. The use of very large multi-dimensional data will result in
more noise, redundant data, and the possibility of unconnected data entities. To efficiently
manipulate data represented in a high-dimensional space and to address the impact of redundant
dimensions on the final results, we propose a new technique for the dimensionality reduction.
This technique effectively reduce data dimensionality for efficient data processing tasks such as,
pattern recognition, machine learning, text retrieval and data mining.

Dimension Reduction refers to the process of converting a set of data having vast
dimensions into data with lesser dimensions ensuring that it conveys similar information
concisely. These techniques are typically used while solving machine learning problems to
obtain better features for a classification or regression task.

The following list out the benefits of dimensionality reduction:-

It helps in data compression and reduce the storage space required as the data is down
sized.

46
It improves the speed of data accumulation.
It helps in noise and redundancies removal.
It fastens query and computations because of less dimensions, which lead to less
computation.
It benefits the usage of algorithms which are not fit for large number of dimensions.
It increases the performance of data mining task considerably.
It removes redundant features. For example, same value can be stored in two different
units (meters and inches).
Reducing dimensions of data to 2D or 3D will help immensely the data presentation and
visualization.
It helps in understanding patterns more clearly.
The following specify the applications of dimensionality reduction:-

Customer relationship management.


Text mining.
Image retrieval.
Microarray data analysis.
Protein classification.
Face recognition.
Handwritten digit recognition.
Intrusion detection.

We introduce the field of dimensionality reduction by dividing it into two parts. They are:

Feature Extraction. Feature extraction creates new features resulting from the
combination of the original features.

Feature Selection. Combining attributes into a new reduced set of features. Feature
selection is primarily performed to select relevant and informative features.
The techniques for dimensionality reduction are as follows:

1. Missing Values.
2. Low Variance Filter.
3. High Correlation Filter.
4. PCA.
5. Random Forests.
6. Backward Feature Elimination.
7. Forward Feature Construction.

The common techniques of dimensionality reduction are as follows:-

1.Missing Values: During data exploration we may encounter many missing values. Attributes
with too many missing values will not carry any useful information. So, the attributes or
variables with missing values exceeding a given threshold can be dropped. The higher the

47
threshold, the more aggressive the reduction. The method of defining the threshold for missing
values percentage varies from case to case.

2.Low Variance: If we come across a variable having the same values in our data set, such an
attribute has zero variance. Zero variant attributes will not improve the power of the model. Thus
all attributes with lower variance than a given threshold can be dropped. But, before applying
this technique, normalization required because variance is range dependent.

3.High Correlation Filter: The performance of the model can come down with dimensions
exhibiting higher correlation. It is not encouraging if we have multiple attributes with the similar
information. Several attributes with similar trends will carry similar data. In this situation, only
one value is sufficient to feed the machine learning model. In order to identify similarities in data
we compute the correlation coefficient between numerical as well as nominal attributes. Several
pairs of attributes with correlation coefficient higher than the threshold are reduced to only one
value. Correlations are dependent on scaling. So, for meaningful correlation comparison, we
need attribute normalization.

4.PCA (Principal Component Analysis): This is a procedure which transforms the original data
into a new data called principal components. This transformation results into the first principal
component having greatest variance. Each succeeding attribute have high variance under the
constraint that it is uncorrelated with the preceding attributes. Keeping the first components
reduces the data dimensionality while retaining most of the information.PCA transformation
depends on the relative scaling of the original attribute values. So, data attribute ranges need to
be normalized before applying PCA.

5.Random Forest: Random forests is a powerful new approach to data exploration, data
analysis and predictive modeling. It is a collection of trees and all of them are different. They are
not influenced by each other when they are constructed. The overall prediction of the forest is
calculated as sum of the predictions made from decision trees .Random forests is best suited for
the analysis of complex data structures containing millions of attributes.

It generally improves decision trees decisions. Single decision trees are likely to suffer
from high variance or high bias. But random forests use averaging to find a natural balance
between the two extremes. An approach to dimensionality reduction include generating large and
carefully constructed set of trees against a target attribute and then use each attributes usage
statistics to find the subset of features which are more informative. If an attribute is often
selected as best split, it is the most informative feature to be retained.

The following specify what random forests offer:

Data visualization for high dimensional data ( many attributes or columns).


Clustering.
Anomaly, outlier, and error detection.
Analysis of small sample data.
Automated identification of important predictors.

48
Generation of strong and accurate predictive models.
Explicit missing value imputation (for use by other tools).

Note: Random forests lead to classification or mean prediction (regression) of individual trees.
A random forest is an estimator that fits a number of decision trees (for classification) on various
sub samples of the data set and use averaging to improve the predictive accuracy and control
over fitting.

6.Backward Feature Elimination: We start with a given set of attributes and , in each round,
we remove an attribute from remaining attributes. Later, we estimate the performance for each
removed attribute. The attribute leading to the least decrease of performance is finally removed
from the selected set of attributes. The next round starts for the new selection of attributes.

During the application of this technique, the selected classification algorithm is applied
on a trained n input attributes. At a given iteration, we remove one input attribute at a time and
apply our algorithm on the remaining set of attributes. The attribute whose removal has produced
the smallest increase in the error rate is removed, and the algorithm is applied on n-1 attributes.
The classification is then applied on n-2 attributes, and so on. Each iteration k produces a model
trained on n-k attributes with an error rate of e(k). Ultimately, we define the smallest number of
attributes which are necessary to reach the expected classification performance.

7.Forward Feature Construction : This task is the reverse of the Backward Feature Elimination
process. Here, we start with one attribute only, and apply algorithm iteratively at a time by
adding one attribute. The attribute that produces the highest increase in performance is selected.
The algorithms for Backward Feature Elimination , and Forward Feature Construction are
computationally expensive. They are only applicable to datasets with relatively low number of
input attributes.
Feature Subset Selection

Feature subset selection is also called variable selection or attribute selection. It is the
automatic selection of attributes in your data that are most relevant to the predictive modeling
task on which you are working. Feature selection is the process of selecting subset of relevant
features (attributes) for use and remove redundant and irrelevant attributes in model construction.

We apply the technique of feature subset selection for making model development
simpler and easy to understand and interpret It helps in faster model induction. In any application
it is important to know relevant attributes .It improves the accuracy and performance of the data
mining process. Feature subset selection has become an active research area in pattern
recognition, statistics, and data mining. The major idea of feature selection is to select certain sub
set of features and eliminating others with little and no predictive information. Feature subset
selection can significantly make resulting classifiers more comprehensive.

49
Feature subset selection differs from dimensionality reduction. Both of these methods are
applied for reducing the number of attributes in the dataset. Dimensionality reduction does it so
by creating new combination of attributes, while feature subset selection method includes and
excludes attributes present in the data without modifying them. Feature selection is itself useful,
but it mostly acts as a filter, muting out features that arent useful in addition to your existing
features.

The feature subset selection process include four steps as mentioned below:

Feature subset generation.


Subset evaluation.
Stopping criterion.
Result validation

The feature subset generation is a search process which results in the selection of a
candidate subset for evaluation. This task uses sequential or random search strategies to
generate subsets of features. These searching strategies are based on step wise addition or
deletion of features. The generated subset is evaluated for its goodness by applying an evaluation
criteria. If the current subset is better than the previous subset , it replaces the previous subset.
These processes are continued until the stopping criterion is reached. By using previous
knowledge or some other tests , the best feature subset is validated.

The following diagram illustrates the feature subset selection process:-

: http://www.enggjournals.com/ijcse/doc/IJCSE15-07-06-010.pdf:
http://www.enggjournals.com/ijcse/doc/IJCSE15-07-06-010.pdf
Feature Subset Selection Process

Feature Subset
Original Set Subset
Subset Evaluation
Generation

Goodness of Subset

NO
Stopping Criterion

.
YES

Result
Validation

Figure 2.7 Feature Subset Selection

50
The issues that can be handled by feature subset selection process are:

Feature subset selection task can create an accurate predictive model. This process aids
you to choose features that give better accuracy while requiring less data.

Feature subset selection techniques help in identifying and removing redundant ,


irrelevant and useless data which may reduce the accuracy of the model.

It is always desirable to have less number of features than more of them. Fewer features
will reduce the complexity of the model, and of course a simpler model is easy to
understand, explain and maintain.

Feature subset selection improves the performance of the predictors. It provides faster
and cost-effective predictors, and gives better understanding of the process that generates
this data.

Feature Subset Selection Algorithms:


On the basis of selection strategy, feature subset selection algorithms are classified into three
different categories, namely Filter methods, Wrapper methods, and Hybrid methods.

Filter Methods: These methods select the feature subset independent of mining algorithm. Filter
feature selection methods apply a statistical measure to assign a scoring to each feature. The
features are ranked by the score and either selected to be kept or removed from the dataset. These
methods consider the feature independently. These methods can be applied to data with large
number of features or attributes. The major advantages of these methods is generality and high
computation efficiency.

Wrapper Methods: These methods require a predetermined algorithm to decide the best feature
subset. In these methods, set of features are selected based on searching techniques. Here,
different combinations are prepared, evaluated and compared to other combinations. A
predictive model is applied to evaluate the selected combination of features and assign a score
based on model accuracy. Though these methods guarantee better results, they are
computationally very expensive.
Training Set
Hybrid Methods: These methods combine Filter and Wrapper methods to achieve the
advantages of both the methods. They use independent measures and mining algorithms to
measure the goodness of a newly generated sub set. In this approach, Filter method is first
Training
applied Set the search space and then wrapper method
to reduce Feature Set to obtain the best feature
is applied
subset.
The following gives the wrapper approach to feature subset selection. The induction algorithm is
used as a black box by theSet
Feature subset selection algorithm.
Performance
Estimation

Feature Selection Search


Feature Set Hypothesis
Estimated
51
Test Set Accuracy
Induction
Algorithm

Feature Evaluation

Induction Algorithm

Final
Evaluation

Figure 2.8 Wrapper Method of Feature Subset Selection

The following gives the feature filter approach, in which the features are filtered independently
of the induction algorithm.

Original Variable Learnin Performa


Input Subset g nce
Variable Selectio Machine Evaluatio
s n n

Performance
NO Improved

YES

Final Subset

Figure 2.9 Feature


Subset Selection using Filter Methods

52
The following mentions the hybrid model of feature subset selection.

Original Reduced
Feature Set Feature
Subset
Filter Method Wrapper Method
More
Reduced
Feature
Subset

Mining Tasks

Figure 2.10
Feature Subset Selection using Hybrid model

The following give the details of various Feature Subset Selection algorithms by using Filter
approach:

Relief Algorithm: This algorithm uses the Relevance Evaluation approach. The major
benefit of this algorithm is , it is scalable to data set with increasing dimensionality. The
major drawback of this algorithm is , it cannot eliminate the redundant features.

Correlation-based Feature Selection Algorithm: This algorithm uses symmetric


uncertainty for calculating feature class and feature correlation. The major benefit of this
algorithm is, it handles both irrelevant and redundant features and it prevents the re-
introduction of redundant features. The major drawback is , it works well on smaller
datasets and it cannot handle numeric class problems.

Fast Correlation based Filter Algorithm: It uses predominant correlation as a goodness


measure based on symmetric uncertainty approach. The major benefit of this algorithm is,
it hugely reduces the dimensionality. The major drawback of this algorithm is, it cannot
handle feature redundancy.

Interact Algorithm: This algorithm uses symmetric uncertainty and backward


elimination approach. The major advantage of this algorithm is, it improves the accuracy
considerably. The major drawback of this algorithm is, its mining performance decreases
as the dimensionality increases.

Fast Clustering-Based Feature Subset Selection Algorithm: This algorithm uses the
graph-theoretic clustering method for clustering and a best feature is selected from each

53
cluster. The major benefit of this algorithm is, it reduces the dimensionality considerably.
The major drawback of this algorithm is, it works only for microarray data.

Condition Dynamic Mutual Information Feature Selection Algorithm: This


algorithm uses the mutual information approach. The major advantage of this algorithm
is, better performance. The major pitfall of this algorithm is, it is very sensitive to noise.

The following give the details of various feature subset selection algorithms using wrapper
approach:

Affinity Propagation Sequential Feature Selection Algorithm: Here, Affinity


Propagation clustering algorithm is applied to get the clusters. Then, sequential feature
selection algorithm is applied for each cluster to get the best set. This algorithm has
advantage of being faster than Sequential Feature Selection. The major drawback of this
algorithm is , its accuracy is not better than Sequential Feature Selection.
Evolutionary Local Selection Algorithm: This algorithm uses k-means algorithm for
clustering. The major benefit of this algorithm is, it covers a large space of possible
feature combinations. This algorithm has the major drawback that the cluster quality
decreases as the number of features increases.

The following specify the various feature subset selection algorithms using the hybrid approach:

Two-Phase Feature Selection Approach Algorithm: This algorithm uses Filter


approach for Artificial Neural Network Weight analysis to remove irrelevant features. It
applies Wrapper approach where redundant features are removed by using genetic
algorithm. The major advantage of this algorithm is , it handles the irrelevant and
redundant features and improves the accuracy.
Hybrid Feature Selection Algorithm: It applies Filter methods for mutual information.
It also applies Wrapper model based feature selection algorithm. The major advantage of
this algorithm is, it improves accuracy considerably. The major drawback of this
algorithm is, high computation cost for high dimensional data.

Discretization and Binarization

In the area of data mining, many methods like, association rules, Bayesian networks,
induction rules etc. can handle only discrete attributes. So, it is necessary to convert continuous
attributes into discrete attributes with set of intervals. For example, the age attribute can be
transformed into discrete values representing six intervals: less than 1 year (infant) , more than
two years and less than 12 years (Child), more than 12 years and less than 20 years (Adolescent)
more than 20 years and less than 40 years (Youth) , more than 40 years and less than 55 years

54
(Middle Aged), and more than 55 years (Senior Citizen). This process is known as discretization
where the attribute age is transformed into age group, is an essential task of preprocessing. This
is needed not only because, some learning methods do not handle continuous data, and
discretized data is more relevant for human interpretation. Discritized data is reduced data which
improve computation process.

Certain data mining algorithms may require categorical (discrete) data instead of numeric
data. Because of this requirement, the data must be preprocessed to map discrete values to
certain numeric ranges. This technique can be applied when small differences in numeric
(continuous) values are irrelevant for a problem. Discretization is a process which transforms
quantitative data into qualitative data. Many data mining applications use quantitative data,
however, some data mining algorithms designed in such a manner that they need qualitative data.
Some times the data mining algorithms become often less efficient on quantitative data. So,
discretization has become one of the active data preprocessing task in many data mining
applications. When discretization technique is applied, it can provide us non linear relations,
such as people within the age group of Infants and people within in the age group of Senior
Citizens are more sensitive to illness. This relationship between age and illness is not linear,
because of which data is discretized though the learning method can handle continuous data. The
process of discretization can harmonize the nature of the data if it is heterogeneous. For example,
in text categorization, the attributes are a mix of numerical values and occurrence terms.

Many machine learning techniques may work very effectively on real valued raw data,
this task is less efficient. Discretization when applied on any data mining task, generally
increases the efficiency of machine learning by providing a definite and discrete domain of the
data set. A normal dscritization process specifically contains four steps as follows:

Sort all continuous values of the attribute to be discretized.

Choose a cut point to split the continuous attribute values into attribute intervals.

Split or Merge the intervals of continuous attribute values.

Choose the stopping criteria of the discretization process.

There are several discretization approaches available, such as , Boolean Reasoning,


Equal-Frequency Binning, Entropy based, Nave Bayes etc. These methods are generally
developed to handle some specific problems or domains. Therefore, using these methods
developed for one domain in another domain may be inappropriate or cause serious problems
that affect the accuracy of the results. Apart from, how each discretization method works, the
major emphasis is on defining a cut point to split the continuous values and defining a stopping

55
criterion for the discretization process. The following describes the discretization process
pictorially:-

Continuous Attribute
Sort Attribute

Get Cutpoint/
Adjacent Intervals

NO

Measure Satisfied
Evaluation Measure

YES

NO Stopping Criteria
Split/Merg
e Attribute
YES
Over several years, many discretization algorithms are proposed and thoroughly tested.
Discretized Attribute
Figure 2.9 Discretization Process

Time and again they have shown that these algorithms have the potential to reduce the
data considerably and improve the predictive accuracy. Discretization methods have been
developed across the following different lines due to different needs.

Supervised vs. Unsuperwised.


Dynamic vs, Static.
Global vs. Local.
Splitting (top-down) vs. Merging( bottom up).
Direct vs. Incremental.

Supervised vs. Unspervised: Discretization methods can be supervised or unsupervised. While


the unsupervised methods are independent of the target attribute, supervised ones make use
intensively of the target attribute. Supervised filters generally take in consideration the class
value, while the unsupervised filters do not, that is, supervised discetization methods use the
number of classes as the discretization parameter, and unsupervised methods use the number of

56
bins (classes). Supervised techniques are more accurate when applied to algorithms based on
classification, because the class information is taken into account. On the other hand, the
unsupervised techniques are considerable faster. But, no method produces the highest accuracy
for all the data sets.

The unsupervised discretization methods are divided into two categories as follows:

Equal-Width interval discretization.


Equal-Frequency interval discretization.

In Equal-Width interval discretization case, the domain attribute a is divided into k intervals of
amax a min

equal width determined by h = k where amax = max{a1, a2, ..., an} and amin =

min{a1,a2, ,an}, the k+1 cut points are amin, amin+h,,amin+kh = amax . The disadvantage of this
method is uneven distribution of data points. Some intervals may contain more data points than
others.

In Equal-Frequency interval discretization category overcomes the limitations of equal-width


interval discretization by dividing the domain into intervals with the same distribution of data
points. This method determines the minimum and maximum values of the attribute and sort all
the values of the attribute in increasing order. Divide the interval [a max , amin] into k-intervals, such
that every interval contains the same number (n/k) of sorted values. It may contain duplicate
values also. It had been observed that Equal-Frequency interval discretization in combination
with the Nave Bayes learning scheme can give desirable results. This discretization method is
also called proportional k-interval discretization.

Dynamic vs. Static: Discretization methods can be static or dynamic. The static attribute
discretization is carried on one attribute at a time, not considering other attributes. The static
discretization is repeated for the other continuous attributes as many times as it is needed. Where
as in case of dynamic discretization method discretizes all attributes at the same time.

Global vs. Local: Local discretization methods produce partitions that are applied to localized
regions of the instance space. The algorithm C4.5 exemplifies these methods. Global methods
partition each attribute into regions independent of the other attributes.

Splitting vs. Merging: We are aware that in top-down approach, intervals are split while in
bottom-up approach intervals are merged. For splitting, we need to evaluate cut-points and
choose the best one to split the range of continuous values into two partitions. This process of
discretization continues with each part until a stopping criteria is satisfied. In case of merging,
adjacent intervals are evaluated. The best pair of intervals is used to merge in each iteration. The
method of discretization continues with remaining and the reduced number of intervals until the
stopping criteria is satisfied.

57
Note: A stopping criterion can be very simple as fixing the number of intervals at the beginning
or a more complex one like evaluating a function.

Direct vs Incremental : Direct methods divide the range of k intervals simultaneously ( Equal
width and Equal frequency) , needing an additional input from the user to determine the number
of intervals. Incremental methods begin with a simple discretization and pass through an
improvement process, needing an additional criterion to know when to stop discretizing.

Note: Data is usually of type nominal, discrete or continuous. Discrete and continuous data are
ordinal data types with ordering among them. Where as nominal values do not possess any order
amongst them. An attribute is discrete if it has relatively finite number of possible values, while a
continuous attribute is considered to have a very large number of possible values, some times
infinite. A discrete data attribute can be seen as a function whose range is finite, where as a
continuous attribute as a function whose range is infinite.

Supervised Discretization Methods:

Summarization
mining
compactconcept
dataset. is a key
which
description data
in-
for a

58

Anda mungkin juga menyukai