BSIT 53 New

B.Sc.
(IT) - 5
th
Semester
BSIT - 53 Data Warehousing & Data Mining
INFORMATION TECHNOLOGY PROGRAMMES
Bachelor of Science in Information Technology - B.Sc.(IT)
Master of Science in Information Technology - M.Sc. (IT)
In
collaboration
with
KUVEMPU UNI VERSI TY
Directorate of Distance Education
Kuvempu University
Shankaraghatta, Shimoga District, Karnataka
Universal Education Trust
Bangalore
BSI T - 54 Software Quality and Testing
II
Titles in this Volume :
BSIT - 53 Data Warehousing & Data Mining
Prepared by UNIVERSAL EDUCATION TRUST (UET)
Bangalore
First Edition : May 2005
Second Edition : May 2012
Copyright by UNIVERSAL EDUCATION TRUST, Bangalore
All rights reserved
No Part of this Book may be reproduced
in any form or by any means without the written
permission from Universal Education Trust, Bangalore.
All Product names and company names mentioned
herein are the property of their respective owners.
NOT FOR SALE
For personal use of Kuvempu University
IT - Programme Students only.
Corrections & Suggestions
for Improvement of Study material
are invited by Universal Education Trust, Bangalore.
E-mail : info@uetb.org
Printed at :
Pragathi Print Communications
Bangalore - 20
Ph : 080-23340100
BSIT - 54 Software Quality and Testing
III
DATA WAREHOUSI NG &
DATA MI NI NG
(BSI T - 53)
: Contributing Authors :
Dr. K. S. Shreedhara
UBDT College, Davangere
&
I ndira S. P.
GMI T, Davangere
IV
BLANK PAGE
V
a
Contents
DATAWAREHOUSING
COURSE SUMMARY 1
Chapter 1
INTRODUCTION TO DATA MANAGEMENT 3
1.0 The Concept of Data Bases............................................................... 3
1.1 The Concept of Data Bases............................................................... 5
1.2 Management Information Systems..................................................... 6
1.3 The Concept of Dataware House and Data Mining............................. 6
1.4 Concept of Views.............................................................................. 7
1.5 Concept of Normalization.................................................................. 8
Chapter 2
DEFINITION OF DATAWARE HOUSING 11
2.0 Introduction...................................................................................... 11
2.1 Definition of a Data Ware House....................................................... 12
2.2 The Dataware House Delivery Process.............................................. 12
2.3 Typical Process Flow in a Data Warehouse........................................ 16
2.3.1 Extract and Load Process.................................................... 17
2.3.2. Data cleanup and transformation.......................................... 18
2.3.3 Backup and Archiving......................................................... 18
2.3.4. Query Management............................................................ 19
VI
2.4 Architecture for Dataware House ..................................................... 19
2.4.1 The Load Manager ............................................................. 20
2.4.2 The Ware House Manager ................................................. 21
2.4.3 Query Manager ................................................................. 23
2.5 The Concept of Detailed Information ................................................. 23
2.6 Data Warehouse Schemas ................................................................ 23
2.7 Partitioning of Data ........................................................................... 24
2.8 Summary Information ........................................................................ 25
2.9 Meta Data ........................................................................................ 25
2.10 Data Marts ....................................................................................... 26
BLOCK SUMMARY ....................................................................... 27
Chapter 3
DATA BASE SCHEME 30
3.1 Star Flake Schemas .......................................................................... 31
3.1.1 Who are the facts and who are the Dimensions? ...................... 31
3.2 Designing of Fact Tables ................................................................... 36
3.3 Designing Dimension Tables .............................................................. 39
3.4 Designing the Star-Flake Schema ....................................................... 41
3.5 Query Redirection ............................................................................. 42
3.6 Multi Dimensional Schemas ............................................................... 43
BLOCK SUMMARY ....................................................................... 44
Chapter 4
PARTITIONING STRATEGY 46
4.1 Horizontal Partitioning ....................................................................... 47
4.2 Vertical Partitioning ........................................................................... 49
4.2.1 Normalisation .......................................................................... 50
4.2.2 Row Splitting .......................................................................... 51
4.3 Hardware partitioning ........................................................................ 52
4.3.1 Maximising the processing and avoiding bottlenecks................... 52
4.3.2 Stripping data across the nodes ................................................ 53
4.3.3 Horizontal hardware paritioning ................................................ 54
BLOCK SUMMARY ....................................................................... 54
Contents
VII
Chapter 5
AGGREGATIONS 56
5.1 The Need for Aggregation ................................................................ 56
5.2 Definition of Aggregation ................................................................... 57
5.3 Aspects to be looked into while Designing the Summary Tables ........... 59
BLOCK SUMMARY ....................................................................... 63
Chapter 6
DATA MART 65
6.1 The Need for Data Marts .................................................................. 65
6.2 Identify the Splits in Data .................................................................. 68
6.3 Identify the Access Tool Requirements .............................................. 68
6.4 Role of Access Control Issues in Data Mart Design ........................... 68
BLOCK SUMMARY ....................................................................... 70
Chapter 7
META DATA 72
7.1 Data Transformation and Loading ...................................................... 72
7.2 Data Management ............................................................................ 74
7.3 Query Generation ............................................................................. 76
BLOCK SUMMARY ....................................................................... 78
Chapter 8
PROCESS MANAGERS 79
8.1 Need for Managers to a Dataware House .......................................... 80
8.2 System Management Tools ................................................................ 80
8.2.1 Configuration Manager ............................................................ 81
8.2.2 Schedule Manager .................................................................. 81
8.2.3 Event Manager ....................................................................... 82
8.2.4 Database Manager .................................................................. 83
8.2.5 Backup up recovery manager .................................................. 84
8.3 Data Warehouse Process Managers .................................................. 85
8.3.1 Load manager ......................................................................... 85
8.3.2 Ware house Manager .............................................................. 86
8.3.3 Query Manager ...................................................................... 88
BLOCK SUMMARY ....................................................................... 89
Contents
VIII
DATA MINING
COURSE SUMMARY 92
Chapter 9
INTRODUCTION TO DATA MINING 94
9.0 Introduction ...................................................................................... 94
9.1 What is Data Mining ? ...................................................................... 95
9.2 Whata Kind of Data can be Mined ? .................................................. 98
9.3 What can Data Mining do ? ............................................................... 100
9.4 How do we Categorized Data Mining Systems ? ................................ 102
9.5 What are the Issues in Data Mining ? ................................................. 102
9.6 Reasons for the Growing Popularity of Data Mining ............................ 105
9.7 Applications ..................................................................................... 105
9.8 Exercise ........................................................................................... 106
Chapter 10
DATA PREPROCESSING AND DATA MINING PRIMITIVES 108
10.0 Introduction ..................................................................................... 108
10.1 Data Preparation .............................................................................. 108
10.1.1 Select data ......................................................................... 108
10.1.2 Data Cleaning .................................................................... 109
10.1.3 New data construction ........................................................ 110
10.1.4 Data formatting .................................................................. 110
10.2 Data Mining Primitives ..................................................................... 111
10.2.1 Defining data mining primitives ............................................ 111
10.3 A Data Mining Querying Language ................................................... 113
10.3.1 Syntax for Task-relevant data specification........................... 114
10.4 Designing Graphical User Interfaces Based on a Data Mining
Query Language .............................................................................. 117
10.5 Architectures of Data Mining Systems ............................................... 119
10.6 Exercise .......................................................................................... 120
Chapter 11
DATA MINING TECHNIQUES 122
11.0 Introduction ...................................................................................... 122
Contents
IX
11.1 Associations ..................................................................................... 122
11.1.1 Data Mining with Apriori algorithm ...................................... 123
11.1.2 Implementation Steps .......................................................... 124
11.1.3 Improving the efficiency of Aprori ....................................... 125
11.2 Data Mining with Decision Trees ....................................................... 125
11.2.1 Decision tree working concept ............................................ 126
11.2.2 Other Classification Methods .............................................. 128
11.2.3 Prediction .......................................................................... 130
11.2.4 Nonlinear Regression .......................................................... 132
11.2.5 Other Regression Models .................................................... 133
11.3 Classifier Accuracy ........................................................................... 133
11.3.1 Estimating Classifier Accuracy ............................................ 134
11.4 Bayesian Classification ..................................................................... 134
11.4.1 Bayes Theorem .................................................................. 135
11.4.2 Naive Bayesian Classification ............................................. 135
11.4.3 Bayesian Belief Networks .................................................. 137
11.4.4 Training Bayesian Belief Networks ..................................... 139
11.5 Neural Networks for Data Mining ..................................................... 140
11.5.1 Neural Network Topologies ................................................ 140
11.5.2 Feed-Forward Networks .................................................... 141
11.5.3 Classification by Backpropagation ....................................... 142
11.5.4 Backpropagation ................................................................ 142
11.5.5 Backpropagation and Interpretability .................................... 146
11.6 Clustering in Data Mining .................................................................... 147
11.6.1 Requirements for clustering ................................................ 147
11.6.2 Type of Data in Cluster Analysis ......................................... 149
11.6.3 Interval-Scaled Variables .................................................... 150
11.6.4 Binary Variables ................................................................. 152
11.6.5 Nominal, Ordinal and Ratio-Scaled Variables ....................... 154
11.6.6 Variables of Mixed Types .................................................... 156
11.7 A Categorization of Major Clustering Methods .................................... 157
11.8 Clustering Algorithm ......................................................................... 158
11.8.1 K-means algorithm ............................................................. 158
11.8.2 Important issues in automatic cluster detection ..................... 159
11.8.3 Application Issues .............................................................. 160
11.9 Genetic Algorithms ........................................................................... 160
Contents
X
11.10 Exercise ......................................................................................... 161
Chapter 12 163
GUIDELINES FOR KDD ENVIRONMENT
12.0 Introduction ..................................................................................... 163
12.1 Guidelines ....................................................................................... 163
12.2 Exercise .......................................................................................... 165
Chapter 13
DATA MININGAPPLICATION 167
13.0 Introduction ...................................................................................... 167
13.1 Data Mining for Biomedical and DNA Data Analysis .......................... 167
13.2 Data Mining for Financial Data Analysis ............................................ 169
13.3 Data Mining for the Retail Industry .................................................... 170
13.4 Other Applications ............................................................................ 171
13.5 Exercise .......................................................................................... 171
Contents
a
Dat a War ehousi ng
COURSE SUMMARY
T
he use of computers for data storage and manipulating is a fairly old phenomenon. In fact, one of
the main reasons for the popularity of computers is their ability to store and provide data accurately
over long periods of time. Of late, computers are also being used for decision making. The main
use of historical data is to provide trends, so that future sequences can be predicted. This task can also be
done by the computers which have sufficient capabilities in terms of hardware and software.
Once this aspect was explored, it was possible to make use of computers as a storehouse or
warehouse of data. As the name suggests, huge volumes of data are collected from different sources
and are stored in a manner that is congenial for retrieval. This is the basic concept of data ware housing.
By definition, data ware house stores huge volumes of data which pertain the various locations and
times. The most primary task of the ware house manager is to properly label them and be able to
provide them on demand. Of course, at the next level, it becomes desirable that the manager is able to do
some amount of scanning, filtering etc., so that the user can ask for data that satisfies specific questions
like the number of blue colored shirts sold in a particular location last summer and get the data. Most
database management systems also provide queries to do this job, but a ware house will have to cater to
much larger and rather adhoc type of data.
In this course, we take you through the preliminaries of a warehouse operation. You are introduced to
the concept of a data ware house as a process how the computer looks at the entire operation. You will
be introduced to the various concepts of data extraction, loading, transformation and archiving. Also the
process architecture with the concepts of software called managers is introduced.
1
BSIT 33 Data Warehousing and Data Mining
Then the concept of actual storage of data the database schema is dealt with in some detail. We look
at star flake schemas, fact tables, methods of designing them and also the concept of query redirection.
There is also a concept of partitioning the data, to ease the amount of scanning to answer the queries.
We talk of horizontal and vertical partitioning as well as hardware partitioning processes. An introduction
to the concept of aggregations is also provided.
Data ware houses store huge volumes of data like the store house of a huge business organization.
There is a need for smaller retail shops which can provide more frequently used data, without going
back to the store house. This is the concept of data marts. We look at several aspects of data marting.
The data in a ware house is not static It keeps changing . There is a need to order and maintain the
data. This leads as to the concept of meta data Data about data. The metadata help us to have the
information about the type of data that we have in the ware house and arrive at suitable decisions about
their manipulations.
Finally, an insight in to the working of the various managers the load manager, the ware house
manager and the query manager is provided.
Data Warehousing - Course Summary
2
Chapter 1
Int r oduct i on t o Dat a Management
BLOCK INTRODUCTION
I
n this introductory chapter, you will be introduced to the use of computers for data management. This
will give a sound background to initiate you into the concepts of data warehousing and datamining. In
brief, data is any useful piece of information- but most often data is normally a set of facts and figures
like the total expenditure on books during the last year, total number of computers sold in two months,
number of railway accidents in 6 months can be anything. The introduction in this chapter expects no
previous background on the students part, and it starts from level zero. We start with how the concepts of
computers could be used for storing and manipulating data, how can this become useful to the various
applications, what are the various terminologies used in the different contexts. It also deals with several
concepts that are used in the subsequent chapters at the conceptual level, so that the student is comfortable
as he proceeds to the subsequent chapters. He can also get back to this introductory chapter, if at a
subsequent level he finds some background is missing for him to continue his studies. For those students
who already have some background of the subject, some of the concepts dealt here may appear redundant,
but it is advised that they go through this block atleast once to ensure a continuity of ideas. Though some
of the ideas like a simple concept of the parts of a computer are already known to the student, a
different perspective of the same topic is made available. Hence the need the begin at the beginning.
1.0 INTRODUCTION
We begin with the parts of a computer and its primary role. The computer was originally designed to be
a fast computing device- that which can performarithmetic and logical operations faster. The concept of
using computers for data manipulation came much later.
3
4
A Computer will have 3 basic parts
i. A central processing unit that does all the arithmetic and logical operations. This can be
thought of as the heart of any computer and computers are identified by the type of CPU
that they use.
ii. The memory is supposed to hold the programs and data. All the computers that we came
across these days are what are known as stored program computers. The programs are to
be stored before hand in the memory and the CPU accesses these programs line by line and
executes them.
iii. The Input/output devices: These devices facilitate the interaction of the users with the computer.
The input devices are used to send information to the computer, while the output devices
accept the processed information form the computer and make it available to the user.
One can note that the I/O devices interact with the CPU and not directly with the memory. Any
memory access is to be done through the CPU only.
We are not discussing the various features of each of these devices at this stage.
A typical computer works in the following method- The user sends the programs as well as data,
which is stored in the memory. The CPU accesses the program line by line and executes it. Any data that
is needed to execute the program is also drawn from the memory. The output (or the results) are either
written back to the memory or sent to a designated output device as the case may be. Looking the other
way, the programs modify the data. For example, if two numbers 6 and 8 are added given 14 as the
answer, we can consider that the two input data are modified into their sum.
The other concept that we should consider is that even though originally computers were used to
operate on numbers only, they can stored and manipulate characters and sentences also though in a
very limited sense.
Incidentally, there are two types of memories Primary memory, which is embedded in the computer
and which is the main source of data to the computer and the secondary memory like floppy disks, CDs
etc., which can be carried around and used in different computers. They cost much less than the primary
Chapter 1 - Introduction to Data Management
5
memory, but the CPU can access data only from the primary memory. The main advantage of computer
memories, both primary and secondary, is that they can store data indefinitely and accurately.
The other aspects we need to consider, before moving on to the next set of aspects, is that of
communication. Computers can communicate amongst themselves through a local area/wide area
networks, which means data, one entered to a computer can be accessed by the other computers either
by the use of a secondary storage device like a CD/floppy or through a network. With the advent of the
internet, the whole world has become a network. So, theoretically data on any computer can be made
available on any other computer (subject to restriction, of course).
There is one more aspect that we need to consider. With the advent of computerization, most of the
offices, including sales counters etc. use computers. It is the data generated by these computers like the
sales details, salary paid, accounts details etc., that actually form the databases, the details of which we
will consider in a short while. Even aspects like the speeds of the machines etc. are nowadays convertible
directly into data and stored in computers, which means the data need not be entered even once. As and
when the data are generated, they are stored on the computer.
One more aspect we have to consider is that the cost of the memories have came down drastically
over the years. In the initial stages of the computer evolution, the memory, especially the main memory
was very costly and people were always trying to minimize the memory usage. But with the advent of
technology, the cost of the memory has become quite low and it is possible for us to store huge amounts
of data over long durations of time, at an affordable rate.
We shall now list the various aspects we have discussed, that become helpful in the context of this
course.
Computer carry memory with them, which is quite cheap.
Since secondary memories retain data until they are erased (mostly) they form a good place
for storing data.
The data stored can also be transmitted, from one computer to another, either through
secondary devices or through networks.
The computers can use this data in their calculations.
1.1 THE CONCEPT OF DATA BASES
We have seen in the previous section how data can be stored in computer. Such stored data becomes
a database a collection of data. For example, if all the marks scored by all the students of a class are
stored in the computer memory, it can be called a database. From such a database, we can answer
questions like who has scored the highest marks? ; In which subject the maximum number of students
6
have failed?; Which students are weak in more than one subject? etc. Of course, appropriate programs
have to be written to do these computations. Also, as the database becomes too large and more and more
data keeps getting included at different periods of time, there are several other problems about maintaining
these data, which will not be dealt with here.
Since handling of such databases has become one of the primary jobs of the computer in recent years,
it becomes difficult for the average user to keep writing such programs. Hence, special languages
called databasequery languages- have been deviced, which makes such programming easy, there languages
help in getting specific queries answered easily.
1.2 MANAGEMENT INFORMATION SYSTEMS
The other important job of a computer is in producing management reports-Details that a senior
management person would be interested in knowing about the performance of his organization, without
going to the specifics. Things like the average increase/decrease in sales, employee performance details,
market conditions etc, can be made available in a concise form like tables, barcharts, pie charts etc. if the
relevant details are available. Effectively, this again boils down to handling huge amounts of data, doing
simple arithmetic/statistical operations and producing the data in relevant form. Traditionally such operations
were being undertaken by a group of statisticians and secretaries, but now the computer can do it much
faster, more conveniently and also available to the manager at the click of a button.
1.3 THE CONCEPT OF DATAWARE HOUSE AND DATA
MINING
Now, we are in a position to look at these concepts. The data, when it becomes abundantly large, gives
rise to a warehouse, a storehouse of data. One common feature amongst all such warehouses is that
they have large amounts of data otherwise the type of data can be anything from the student data to the
details of the citizens of a city or the sales of previous years or the number of patients that came to a
hospital with different ailments. Such data becomes a storehouse of information. Most organization tend
to predict the future course of action based on the reports of the previous years. But the queries that arise
here are many times more complicated than in a simple database. For example, instead of simply finding
the marks scored by the weak students, we would like to analyse the tendency of the student to score low
marks in some subject. Instead of simply finding out the sales generated by the different outlets, we would
like to know what are the similar and dissimilar patterns amongst these outlets and how they point to the
overall prospects of the company etc. Since the manager need not be an IT professional on one hand and
since the data to be handled is too huge on the other, these concepts are to be dealt with, with utmost care.
7
1.4 CONCEPT OF VIEWS
Data is normally stored in tabular form, unless storage in other formats becomes advantageous, we
store data in what are technically called relations or in simple terms as tables.
A Simple table.
Inside the memory, the contents of a table are stored one by the side of the other, but still we can
imagine it to be a rectangular table that we are accustomed to.
However, when such tables tend to contain large number of fields (in the above table, each column
name, age, class etc. is a field), several other problems crop up. When there are 100 such fields, not
everybody who wants to process the table would be interested in all the 100 fields. More importantly, we
may also not want all of them to be allowed to look into all the fields.
For example in a table of employee details, we may not want an employee to know the salary of some
of other employee or an external person to know the salary of any employee. In the table of students
above, if some one is interested in knowing the average age of students of a class, he may not be
interested in their marks. Thus, the view will now include the concept of marks. Thus, the same table
may look differently to different people. Each person is entitled to his logical *view of the system.
Quite often, we look at the data differently. For example, we would like to look at the data differently.
For example, we would like to look it as a hierarchy, where one/more aspects are subordinates to yet
another objects. Thus the employee table can be looked upto in two ways.
The same table in hierarchical form
8
The problem is that the same table is sometime looked upto as a hierarchical form and sometimes as
a tabular form. It is to be noted that in all these cases, we are not going to rewrite the data to a different
place/in a different format, we simply use the same set of data, but interpret differently, while the computer
programs are perfectly capable of doing it, we, as humans, for our understanding write it in different
formats. In succeeding chapters, we came across a number of instances where the views are represented
in different pictorial forms, but it should be remembered that it in for our convenience only.
In come cases, it may also be that the data are actually in different tables. For example, in the above
case of managers, their family details may be in an altogether different table, it is for the software to
select and combine the fields. Such concepts, we call Schema a users view of data, which we use
extensively in the coming sections.
We also speak of dimensions, which are again similar to the fields of the above table. For example
in the student table, the student has the dimensions of name, age, class, marks1 and marks 2 etc. The
understanding is whenever you represent any student in the table, he is represented by these fact. Similarly
the entity of manager has the dimensions of name, designation, age and salary. Quite often we use the
term dimensions.
1.5 CONCEPT OF NORMALIZATION
Normalization is dealt with in several chapters of any books on database management systems. Here,
we will take the simplest definition, which suffices our purpose namely any field should not have subfields.
Again consider the following student table.
Here under the field marks, we have 3 sub fields: marks for subject1, marks for subject2 and subject3.
However, it is preferable split these subfields to regular fields as shown below
9
Quite often, the original table which comes with subfields will have to be modified suitable, by the
process of normalization.
Self Evaluation - I
1. What are the main parts of the computer?
2. How can data be transferred form one computer to another?
3. Can the CPU access data directly from the secondary memory?
4. For how long can the secondary memories store data?
5. What is a database?
6. What is the function of a database query language?
7. What is the use of a management information system?
8. What is a relation?
9. Why are different views of a given data needed?
10. Give the simplest definition of normalized data?
Answers
1. CPU, memory, I/O devices
2. Through floppies/CDs or through network connections.
3. No.
4. Until they are erased by the user.
5. A very large collection of useful data.
6. It makes database programming easy.
7. Produces information in a form that is useful to top managers.
8. Data stored in tabular form.
9. Because not all the users need to/should know all the fields.
10. The relation should not have any subfields (Sub dimensions).
Self Evaluation - II
1. With neat diagram explain the parts of computer.
2. Explain the concept of view. Give example.
10
Chapter 2
Definition of Data Warehousing
BLOCK INTRODUCTION
I
n this chapter, you will be introduced to the fundamental concepts of a data ware house. A data ware
house is not only a repository of huge volumes of data, but is also a system from where you can get
support to draw meaningful conclusions. We begin with formal definition of a data ware house and
look at the process of evolution of a data ware house. It involves the cooperation of every body in the
company starting from the IT strategists down to the average users. There is also scope for future
development. We look at the various components that go into to the design of a ware house. Then we
look at the typical flow of process in a ware house. This would trace the movement of data starting from
its acquisition to archiving.
The next step is to study the architecture of a typical ware house. The concept of different ware
house managers and their activities are introduced. We also see that concept of ware house schemas
methods of storing data in a ware house. We also briefly get ourselves introduced to some miscellaneous
concepts of data ware houses.
2.0 INTRODUCTION
In the last two decades, computerization has reached tremendous scales. New computer systems
have been installed to gain competitive edge in all sorts of business applications, starting from supermarkets,
computerized billing systems, computerized manufacturing to online transactions.
However, it is also realized that enormous knowledge is also available in these systems, which can be
utilized in several other ways. In fact, in todays world, the competitive edge will come more from the
proactive use of information ratter than more and more optimization. The information can be tapped for
decision making and for stealing a march over the rival organization.
11
12
Chapter 2 - Definition of Dataware Housing
However, none of the computer systems are designed to support this kind of activity - i.e. to tap the
data available and convert them into suitable decisions. They are not able to support the operational and
multidimensional requirements. Hence, a new set of systems called data ware houses are being developed.
They are able to make use of the available data, make it available in the form of information that can
improve the quality of decision making and the profitability of the organization.
2.1 DEFINITION OF A DATA WARE HOUSE
In its simplest form, a data ware house is a collection of key pieces of information used to manage and
direct the business for the most profitable outcome. It would decide the amount of inventory to be held,
the no. of employees to be hired, the amount to be procured on loan etc.,.
The above definition may not be precise - but that is how data ware house systems are. There are
different definitions given by different authors, but we have this idea in mind and proceed.
It is a large collection of data and a set of process managers that use this data to make information
available. The data can be meta data, facts, dimensions and aggregations. The process managers can be
load managers, ware house managers or query managers. The information made available is such that
they allow the end users to make informed decisions.
2.2 THE DATAWARE HOUSE DELIVERY PROCESS
This section deals with the dataware house from a different view point - how the different components
that go into it enable the building of a data ware house. The study helps us in two ways:
i) To have a clear view of the data ware house building process.
ii) To understand the working of the data ware house in the context of the components.
Now we look at the concepts in details.
i. IT Strategy
The company must and should have an overall IT strategy and the data ware housing has to be a part
of the overall strategy. This would not only ensure that adequate backup in terms of data and investments
are available, but also will help in integrating the ware house into the strategy. In other words, a data ware
house can not be visualized in isolation.
13
ii. Business Case Analysis
This looks at an obvious thing, but is most often misunderstood. The overall understanding of the
business and the importance of various components there in is a must. This will ensure that one can
clearly justify the appropriate level of investment that goes into the data ware house design and also the
amount of returns accruing.
Unfortunately, in many cases, the returns out of the ware housing activity are not quantifiable. At the
end of the year, one cannot say - I have saved / generated 2.5 crore Rs. because of data ware housing -
sort of statements. Data ware house affects the business and strategy plans indirectly - giving scope for
undue expectations on one hand and total neglect on the other. Hence, it is essential that the designer
must have a sound understanding of the overall business, the scope for his concept (data ware house) in
the project, so that he can answer the probing questions.
iii. Education
This has two roles to play - one to make people, specially top level policy makers, comfortable with the
concept. The second role is to aid the prototyping activity. To take care of the education concept, an
IT Strategy
Education Business case analysis
Technical Blueprint Business Requirements
Build the vision
History Load
Adhoc Enquiry
Automation
Requirement
Evolution
Future Growth
14
initial (usually scaled down) prototype is created and people are encouraged to interact with it. This
would help achieve both the activities listed above. The users became comfortable with the use of the
system and the ware house developer becomes aware of the limitations of his prototype which can be
improvised upon.
Normally, the prototypes can be dispensed with, once their usefulness is over.
iv. Business Requirements
As has been discussed earlier, it is essential that the business requirements are fully understood by the
data ware house planner. This would ensure that the ware house is incorporated adequately in the overall
setup of the organization. But it is equally essential that a scope of 15-25% for future enhancements,
modifications and long term planning is set apart. This is more easily said than done because future
modifications are hardly clear even to top level planners, let alone the IT professionals. However, absence
of such a leeway may endup in making the system too constrained and worthless in a very near future.
Once the business requirements are understood, the following aspects also are to be decided.
i) A logical model to store the data within the data ware house.
ii) A set of mapping rules - i.e. the ways and means of putting data into and out of the model.
iii) The business rules to be applied.
iv) The format of query and the profile.
Another pit fall at this stage is that some of the data may not be available at this stage (because some
of the data may get generated as and when the system is put to use). Normally, using artificially generated
data to supplement the nonavailability of data is not found very reliable.
v. Technical Blue Prints
This is the stage where the overall architecture that satisfies the requirements is delivered. At this
stage, the following items are decided upon.
i) The system architecture for the ware house.
ii) The server and data mart architecture.
iii) Design of the data base.
iv) Data retention strategy.
v) Data backup and recovery mechanism.
vi) Hardware and infrastructure plans.
15
vi. Building the Vision
Here the first physical infrastructure becomes available. The major infrastructure components are set
up, first stages of loading and generation of data start up. Needless to say, we hasten slowly and start
with a minimal state of data. The system becomes operational gradually over 4-6 months or even more.
vii. History Load
Here the system is made fully operational by loading the required history into the ware house - i.e.
what ever data is available over the previous years is put into the dataware house to make is fully
operational. To take an example, suppose the building the vision has been initiated with one years sales
data and is operational. Then the entire previous data - may be of the previous 5 or 10 years is loaded.
Now the ware house becomes fully loaded and is ready to take on live queries.
viii. Adhoc Query
Now we configure a query tool to operate against the data ware house. The users can ask questions
in a typical format (like the no. of items sold last month, the stock level of a particular item during the last
fortnight etc). This is converted into a data base query and the query is answered by the database. The
answer is again converted to a suitable form to be made available to the user.
ix. Automation
This phase automates the various operational processes like
i) Extracting and loading of data from the sources.
ii) Transforming the data into a suitable formfor analysis.
iii) Backing up, restoration and archiving.
iv) Generate aggregations.
v) Monitoring query profiles.
x. Extending Scope
There is not single mechanism by which this can be achieved. As and when needed, a new set of data
may be added, new formats may be included or may be even involve major changes.
16
xi. Requirement Evolution
Business requirements will constantly change during the life of the ware house. Hence, the process
that supports the ware house also needs to be constantly monitored and modified. This necessitates that
to begin with, the ware house should be made capable of capturing these changes and of growing with
such changes. This would extend the life and utility of the system considerably.
In the next two chapters, we look at the overall flow of processes and the architecture of a typical data
ware house. The typical data ware house process follows the delivery system we had discussed in the
previous section. We also keep in mind that typically data ware houses are built with large data volumes
(100 GB or more). Cost and time efficiency are vital factors. Hence a perfect tuning of the processes
and architecture is very essential for ensuring optimal performance of the data ware house. When you
note that typical warehouse queries may run from a few minutes to hours to elicit an answer, you can
understand the importance of these two stages. Unless a perfect fine tuning is done, the efficiencies may
become too low and the only option available may be to rebuild the systems from the scratch.
2.3 TYPICAL PROCESS FLOW IN A DATA WARE HOUSE
Any data ware house must support the following activities
i) Populating the ware house (i.e. inclusion of data)
ii) Day-to-day management of the ware house.
iii) Ability to accommodate the changes.
The processes to populate the ware house have to be able to extract the data, clean it up, and make it
available to the analysis systems. This is done on a daily / weekly basis depending on the quantum of the
data population to be incorporated.
The day to day management of data ware house is not to beconfused with maintenance and management
of hardware and software. When large amounts of data are stored and new data are being continually
added at regular intervals, maintaince of the quality of data becomes an important element.
Ability to accommodate changes implies the system is structured in such a way as to be able to cope
with future changes without the entire system being remodeled. Based on these, we can view the
processes that a typical data ware house scheme should support as follows.
17
2.3.1 Extract and Load Process
This forms the first stage of data ware house. External physical systems like the sales counters which
give the sales data, the inventory systems that give inventory levels etc. constantly feed data to the
warehouse. Needless to say, the format of these external data is to be monitored and modified before
loading it into the ware house. The data ware house must extract the data from the source systems, load
them into their data bases, remove unwanted fields (either because they are not needed or because they
are already there in the data base), adding new fields / reference data and finally reconciling with the
other data. We shall see a few more details of theses broad actions in the subsequent paragraphs.
A mechanism should be evolved to control the extraction of data, check their consistency etc.
For example, in some systems, the data is not authenticated until it is audited. It may also be
possible that the data is tentative and likely to change - for example estimated losses in a natural
calamity. In such cases, it is essential to draw up a set of modules which decide when and how
much of the available data will be actually extracted.
Having a set of consistent data is equally important. This especially happens when we are
having several online systems feeding the data. When data is being received from several
physical locations, unless some type of tuning up w.r.t. time is done, data becomes inconsistent.
For example sales data from 3-5 P.M. on Thursday may be inconsistent from the same data,
from the same 3-5 P.M., if it is for Friday. Though it looks trivial, such inconsistencies are
thoroughly harmful to the system and more importantly, very difficult to locate, once allowed
into the system.
Once data is extracted from the source systems, it is loaded into a temporary data storage
before it is Cleaned and loaded into the warehouse. The checks to find whether it is consistent
Archive and
Reload data
Data Ware House
Data transformation
Data Movement
Query
Users Data
Sources
Extract &
load data
18
may be complex and also since data keeps changing continuously, errors may go unnoticed,
unless monitored on a regular basis. In many cases, data lost may be confused with non existent
data. For example, if the data about purchase details of 10 customers are lost, one may always
presume that they have not made any purchases at all. Though it is impossible to make a list of
all possible inconsistencies, the system should be able to check for such eventualities and more
importantly correct them automatically. For example, in the case of a lost data, if a doubt arises,
it may ask for retransmission.
2.3.2 Data cleanup and Transformation
Data needs to be cleaned up and checked in the following ways
i) It should be consistent with itself.
ii) It should be consistent with other data from the same source.
iii) It should be consistent with other data from other sources.
iv) It should be consistent with the information already available in the data ware house.
While it is easy to list act the needs of a clean data, it is more difficult to set up systems that
automatically cleanup the data. The normal course is to suspect the quality of data, if it does not meet the
normally standards of commonsense or it contradicts with the data from other sources, data already
available in the data ware house etc. Normal intution doubts the validity of the new data and effective
measures like rechecking, retransmission etc., are undertaken. When none of these are possible, one
may even resort to ignoring the entire set of data and get on with next set of incoming data.
Once we are satisfied with the quality of data, it is usually transformed into a structure that facilitates
its storage in the ware house. This structural transformation is done basically to ensure operational and
query performance efficiency.
2.3.3 Backup and Archiving
In a normal system, data within the ware house is normally kept backed up at regular intervals to guard
against system crashes, data losses etc. The recovery strategies depend on the type of crashes and the
amount of losses.
Even apart from that, older data need to be archived. They are normally not required for the day to
day operations of the data ware house, but may be needed under extraordinary circumstances. For
example, if we normally take decisions based on last 5 years of data, any data pertaining to years beyond
this will be archived.
19
2.3.4 Query Management
This process manages the queries and speeds up the querying by directing them to the most effective
source. It also ensures that all system resources are used effectively by proper execution. It would also
assist in ware house management.
The other aspect is to monitor query profiles. Suppose a new type of query is raised by an end user
while system uses the available resources and query tables to answer the query, it will also note the
possibility of such queries being raised repeatedly and prepares summary tables to answer the same.
One other aspect that the query management process should take care is to ensure that no single
query can affect the overall system performance. Suppose a single query has asked for a piece of
information that may need exhaustive searching of a large number of tables. This would tie up most of the
system resources, there by slowing down the performance of other queries. This is again to be monitored
and remedied by the query management process.
2.4 ARCHITECTURE FOR DATA WARE HOUSE
The architecture for a data ware is indicated below. Before we proceed further, we should be clear
about the concept of architecture. It only gives the major items that make up a data ware house. The size
and complexity of each of these items depend on the actual size of the ware house itself, the specific
requirements of the ware house and the actual details of implementation.
Architecture of a data ware house
Users
Detailed
information
Summary
information
Meta data
Ware house Manager
Query
Manager
Load
Manager
Operatio
nal data
External
data
!
!
!
20
Before looking into the details of each of the managers we could get a broad idea about their functionality
by mapping the processes that we studied in the previous chapter to the managers. The extracting and
loading processes are taken care of by the load manager. The processes of cleanup and transformation
of data as also of back up and archiving are the duties of the ware house manage, while the query
manager, as the name implies is to take case of query management.
2.4.1 The Load Manager
The load manager is the system component that performs the operations necessary to support the
extract and load processes. It consists of a set of programs writen in a language like C, apart from
several off-the-shelf tools (readily available program segments).
It performs the following operations:
i) To extract data from the source (s)
ii) To load the data into a temporary storage device
iii) To perform simple transformations to map it to the structures of the data ware house.
Most of these are back end operations, performed normally after the end of the daily operation,
without human intervention.
Architecture of load Manager
Extracting data from the source depends on the configuration of the source systems, and normally can
be expected to be on a LAN or some other similar network. Simple File Transfer Protocol (FTP) should
be able to take care of most situations. This data is loaded into a temporary data storage device. Since
Load Manager
Data Extractor
Copy
management tool
Loader
File
Structures
Ware house
Data
Temporary
Data
21
the sources keep sending data at the rates governed by the data availability and the speed of the network,
it is highly essential that the data is loaded into the storage device as fast as possible. The matter becomes
more critical, when several sources keep sending data to the same ware house. Some authors even call
the process fast load to insist on the criticality of time.
The load manager also is expected perform simple transformations on the data. This is essential
because the data received by the load manager fromthe sources have different formats and it is essential
for the data to fit to a standard format before it is stored in the ware house data base. The load manager
should be able remove the unnecessary columns (for example each source sends details like name of the
item and the code of the item along with the data. When they are being concatenated, the name and code
should appear only once. So, the extra columns can be removed). It should also convert the data into a
standard data type of typical lengths etc.
2.4.2 The Ware House Manager
The ware house manager is a component that performs all operations necessary to support the ware
house management process. Unlike the load manager, the warehouse management process is driven by
the extent to which the operational management of the data ware house has been automated.
Architecture of a Ware House Manager
The ware house manger can be easily termed to be the most complex of the ware house components,
and performs a variety of tasks. A few of them can be listed below.
i) Analyze the data to confirm data consistency and data integrity.
ii) Transform and merge the source data from the temporary data storage into the ware house
Ware House Manager
Stored Processes
Back up / recovery
Controlling Processes
Temporary data Storage
Star flake Schema
Summary Tables
22
iii) Create indexes, cross references, partition views etc.,.
iv) Check for normalizations.
v) Generate new aggregations, if needed.
vi) Update all existing aggregations
vii) Create backups of data.
viii) Archive the data that needs to be archived.
The concept of consistency and integrity checks is extremely important if the data ware house is to
function satisfactorily over a period of time. The effectiveness of the information generated by the ware
house is dependent on the quality of data available. If new data that is inconsistent with the already
existing data is added, the information generated will be no more dependable. But checking for consistency
can be a very tricky thing indeed. It largely depends on the volume and type of data being stored and at
this stage we will simply accept it that the ware house manager is able to take up the same successfully.
Once the data is available in different tables, it is essential that complex transformations are to be done
on them to facilitate their merger with the ware house data. One common transformation is to ensure
common basis of comparison - by reconciling key reference items and rearranging the related items
accordingly. The problem is that the related items may be inter-referenced and hence rearranging them
may be a complex process.
The next stage is to transformthe data into a format suitable for decision support queries. Normally
the bulk of data is arranged at the centre of the structure, surrounded by the reference data. There are
three types of schemas - Star Schema, Snowflak Schema and Star flake schema. They are dealt in some
detail in a subsequent chapter.
The next stage is to create indexes for the information. This is a time consuming affair. These
indexes help to create different views of the data. For example, the same data can be viewed on a daily
basis, quarter basis or simply on incremental basis. The ware house manager is to create indexes to
facilitate easy accessibility for each of them. At the same time, having a very large number of views can
make addition of new data and successive indexing process to become time consuming.
The ware house manger also generates summaries automatically. Since most queries select only a
subset of the data from a particular dimension, these summaries are of vital importance in improving the
performance of the ware house.
Since the type of summaries needed also keep changing, meta data is used to effect such changes
(The concept is dealt with later)
The other operation that the ware house manager should take up is to provide query statistics. The
statistics are collected by the query manage as it intercepts any query hitting the database.
23
2.4.3 Query Manager
This performs all the operations necessary to support the query management process.
Architecture of query Manager
The query manager performs the following operations
i) Directs queries to the appropriate table(s)
ii) Schedule the execution of user queries.
2.5 THE CONCEPT OF DETAILED INFORMATION
The idea of a data ware house is to store the detailed information. But obviously all possible details of
all the available information can not be stored online - so the question to be solved is what degree of detail
is required - in other words how much detail is detailed enough for us? There are no fixed answers and
this makes the design process more complex.
2.6 DATA WARE HOUSE SCHEMAS
Star schemas are data base schemas that structure the data to exploit a typical decision support
enquiry. When the components of typical enquirys are examined, a few similarities stand out.
i) The queries examine a set of factual transactions - sales for example.
ii) The queries analyze the facts in different ways - by aggregating them on different bases /
graphing them in different ways.
The central concept of most such transactions is a fact table. The surrounding references are called
dimension tables. The combination can be called a star schema.
Query Manager
Query
redirection
Procedures
to generate
views
Query
management
tool
Query
Schedulin
g
Meta data
Detailed
information
Summary
Information
24
A star schema to represent the sales analysis.
The central Fact data is surrounded by the dimensional data.
The fact table contains a factual information. This is as collected by the sources. Fact table is the
major component of the database. Since the fact data contains data that is used by all the database
components and also data keeps adding to it, it is essential that from the beginning, it is maintained
accurately. Defining its contents accurately is one of the major focus areas of business requirements
stage. The dimension data is the information used to analyze the facts stored in the fact data. The
structuring of the information in this way helps in optimizing the query performance. The dimension data
normally change w.r.t. time, as and when our needs to assess the fact data changes.
2.7 PARTITIONING OF DATA
In most ware houses, the size of the fact data tables tends to become very large. This leads to several
problems of management, backup, processing etc. These difficulties can be over come by partitioning
each fact table into separate partitions.
Most often, the queries tend to be about the recent data rather than old data. We will be more
interested in what happened last week or last month than during the February two years ago.
Also the queries themselves tend to be mostly similar in nature. Most of the queries will be of the run
of the mill type.
Data ware houses tend to exploit these ideas by partitioning the large volume of data into data sets.
For example, data can be partitioned on weekly / monthly basis, so as the minimize the amount of data
scanned before answering a query. This technique allows data to be scanned to be minimized, without the
overhead of using an index. This improves the overall efficiency of the system. However, having too
many partitions can be counter productive and an optimal size of the partitions and the number of such
partitions is of vital importance.
Sales Transactions
Suppliers
Products
Time
Location
Customers
25
Participating generally helps in the following ways.
i) Assists in better management of data
ii) Ease of backup / recovery since the volumes are less.
iii) The star schemes with partitions produce better performance.
iv) Since several hardware architectures operate better in a partitioned environment, the overall
system performance improve
2.8 SUMMARY INFORMATION
This area contains all the predefined aggregations generated by the ware house manager. This helps
in the following ways
i) Speeds up the performance of commonly used queries
ii) Need not backed up, since it can be generated afresh, if the data is lost.
However, summary data tends to increase the operational costs on one hand and needs to be updated
every time new data is loaded.
In practice, optimal performance can be achieved in the following manner. Since, all types of queries
can not be anticipated before hand, the summary information contains only the commonly encountered
queries. However, any other query need not always be generated afresh, but by combining the summary
information in different ways.
For example, a system maintains a summary information of the sale of its products weekwise in each
of the cities. Suppose one wants to know the combined sale in all South Indian cities during last month,
one need not start afresh. The summary data available for each South Indian city for the four weeks of
last month can be combined to answer the query. Thus, the overall performance of query processing
increases many fold.
But it need not mean the systems performance it self improves geometrically. The increase in the
operational management cost, for creating and updating the tables eat up a large part of the system
performance. Thus, it is always essential to strike a balance number of summaries beyond which it
becomes counter productive.
2.9 META DATA
This area stores all the meta data definitions used by all processes within the data ware house. Now,
what is this meta data? Meta data is simply data about data. Data normally describe the objects, their
26
quantity, size, how they are stored etc. Similarly meta data stores data about how data (of objects) is
stored, etc.
Meta data is useful in a number of ways. It can map data sources to the common view of information
within the warehouse. It is helpful in query management, to direct query to most appropriate source etc.,.
The structure of meta data is different for each process. It means for each volume of data, there are
multiple sets of meta data describing the same volume. While this is a very convenient way of managing
data, managing meta data itself is not a very easy task.
2.10 DATA MARTS
A data mart is a subset of information content of a data ware house, stored in its own data base. The
data of a data ware house may have been collected through a ware house or in some cases, directly from
the source. In a crude sense, if you consider a data ware house as a whole sale shop of data, a data mart
can be thought of as a retailer.
They are normally created along functional or departmental lines, in order to exploit a natural break of
data. Ideally the queries raised in a data mart should not require data from outside the mart, though in
some practical cases, it may need data from the central ware house (Again compare the whole sale -
retail analogies)
Thus the technique to divide in to data marts is to identify those subsets of data which are more or less
self contained and group them into separate marts.
27
However, all the specialized tools of the ware house cannot be implemented at the data mart level, and
hence the specialized operations need to be performed at the central ware house and the data populated
to the marts. Also, a single ware house may not be able to support more than a few data marts for
reasons of maintaining data consistency, lead time needed for populating the marts etc.,
BLOCK SUMMARY
In this chapter, we got an overview of a what a data ware house is. We began with the definition of
a data ware house and proceeded to see a typical data ware house delivery process we saw how, starting
with the IT strategy of a company, we go through the various stages of a ware house building process and
also how there is a scope for future expansion made available.
The next stage was to study a typical process flow in a data ware house. They are broadly studied
under the heads of extract and load processes, data clean up and transformation, backup and archiving
and query management. We then took a look at a typical ware house architecture. We got ourselves
introduced to the concepts of the load manager, the query manager and the ware house manager. We
looked into their activities in brief. The next step was to get introduced to the concept of schemas, data
partition, data summaries and meta data. We also looked at the concept of data marts.
Each of these concepts will be expanded in the coming chapters
SELF EVALUATION - I
1. Define a data ware house.
2. What are the roles of education in a data ware housing delivery process.
3. What is history load?
4. Name the 3 major activities of a data ware house?
5. What is data loading process?
6. What are the different ways in which data is to be consistent?
7. What is archiving?
8. What is the main purpose of query management?
9. Name the functions of the load manager
10. Name any 3 functions of the ware house manager.
11. Name the duties of the query manager.
12. What is meta data?
28
13. What is a data mart?
14. What is the purpose of summary information.
SELF EVALUATION - II
1. With diagram explain the dataware house delivery process.
2. With diagram explain the architecture of dataware house.
3. With diagramexplain the architecture of load manager.
4. With diagram explain the architecture of query manager.
5. With diagram explain the warehouse manager.
ANSWER TO SELF EVALUATION - I
1. Collection of key pieces of information to arrive at suitable managerial decisions.
2. a) to make people comfortable with technology
b) to aid in prototyping.
3. Loading the data of previous years into a newly operational ware house.
4. a) populating the data
b) day to day management
c) accommodating changes.
5. Collecting data from source, remove unwanted fields, adding new fields / reference data and reconciling with
other data.
6. a) consistent with itself
b) consistent with other data from same source.
c) consistent with data from other sources.
d) consistent with data already in the warehouse.
7. Removing old data, not immediately needed, from the ware house and storing it else where.
8. To direct the queries to the most effective sources.
9. a) to extract data from the source.
b) to load data into a temporary storage device.
c) to perform simple transformation
10. a) To analyze data for consistency and integrity.
29
b) To transform and merge source data into the ware house.
c) create indexes, check normalizations etc.,.
11. a) direct queries to appropriate tables.
b) schedule the execution of queries.
12. Data-about-data-It helps in data management.
13. A subset of information content of a ware house, stored in its database for faster processing.
14. Speeds up the performance of commonly used queries.
Chapter 3
Data Base Schema
BLOCK INTRODUCTION
I
n this chapter, we look at the concept of schema a logical arrangement of facts to facilitate storage
and retrieval of data. We familiarize ourselves with the star flake schemas, fact tables and ability to
distinguish between facts and dimensions. The next stage is to determine the key dimensions that
apply to each fact. We also learn that a fact in one context becomes a dimension in a different context
and one has to be careful in dealing with them. The next stage is to learn to design the fact tables.
Several issues like the cost-benefit ratio, the desirable period of retention of data, minimizing the columns
sizes of the fact table etc are discussed.
Next we move on to the design of dimension tables and how to represent hierarchies and networks.
The other schemas we learn to design are the star flake schema and the multi-dimensional schemas. The
other aspect we look into is the concept of query redirection.
A schema, by definition, is a logical arrangements of facts that facilitate ease of storage and retrieval,
as described by the end users. The end user is not bothered about the overall arrangements of the data or
the fields in it. For example, a sales executive, trying to project the sales of a particular item is only
interested in the sales details of that item where as a tax practitioner looking at the same data will be
interested only in the amounts received by the company and the profits made. He is not worried about the
item numbers, part numbers etc. In other words, each of them has his own schema of the same database.
The ware house, in turn, should be able to allow each of them to work according to his own schema and
get the details needed.
The process of defining a schema is involves defining a vision and building it, after a detailed requirement
analysis and technical blue print development.
Chapter 3 - Data Base Schema
30
3.1 STAR FLAKE SCHEMAS
One of the key factors for a data base designer is to ensure that a database should be able to answer
all types of queries, even those that are not initially visualized by the developer. To do this, it is essential
to understand how the data within the database is used.
In a decision support system, which is what a data ware is supposed to provide basically, a large
number of different questions are asked about the same set of facts. For example, given a sales data
question like
i) What is the average sales quantum of a particular item?
ii) Which are the most popular brands in the last week?
iii)Which itemhas the least tumaround item.
iv)How many customers returned to procure the same item within one month. Etc.,.
Can be asked. They are all based on the sales data, but the method of viewing the data to answer the
question is different. The answers need to be given by rearranging or cross referencing different facts.
The basic concept behind the schema (that we briefly introduced in the previous section) is that
regardless of how the facts are analyzed, the facts are not going to change. Hence, the facts can be used
as a read only area and the reference data, which keep changing our a period of time, depending on the
type of queries of the customers, will be read / write. Hence, the typical star schema.
Another star schema to answer details about customers
3.1.1 Who are the Facts and who are the Dimensions?
The star schema looks a good solution to the problem of ware housing. It simply states that one should
identify the facts and store it in the read-only area and the dimensions surround the area. Whereas the
dimensions are liable to change, the facts are not. But given a set of raw data from the sources, how does
Customer
details
Time
Customer
events
Customer
accounts
Customer
location
31
32
one identify the facts and the dimensions? It is not always easy, but the following steps can help in that
direction.
i) Look for the fundamental transactions in the entire business process. These basic entities are
the facts.
ii) Find out the important dimensions that apply to each of these facts. They are the candidates for
dimension tables.
iii)Ensure that facts do not include those candidates that are actually dimensions, with a set of
facts attached to it.
iv)Ensure that dimensions do not include these candidates that are actually facts.
We shall elaborate each with some detail.
LOOK FOR ELEMENTAL TRANSACTION
This step involves understanding the primary objectives from which data is being collected and which
are the transactions that define the primary objectives of the business. Depending on the primary business
of the company, the facts will change. As per the sales example, when data from the sales outlet keep
coming, they give details about the items, numbers, amount sold, tax liabilities, customers who bought the
items etc. It is essential to identify that the sales figures are of the primary concern.
To give a contrast, assuming that similar data may also come froma production shop. It is essential to
Look for elemental
transactions
Identify the dimensions
and facts
Check for the facts
which are actually
dimensions
Check for the
dimensions which are
actually facts
33
understand that the facts about the production numbers are the ones about which the questions are going
to be asked. Having said that, we would also note that it is not always easy to identify the facts at a
cursory glance.
The next stage is to ask the question, whether these facts are going to be modified / operated upon
during the process? Again this is a difficult question, because once the facts are in the ware house, they
may get changed not only by the other transactions within the system, but also from outside the system.
For example, things like tax liabilities are likely to change based on the government policies, which are
external to the system. But the sales volumes, do not change once the sales are made. Hence the sales
volumes are fit to be considered for including in the fact tables.
It must have become clear by now that identifying the elemental transactions needs an indepth knowledge
of the system for which the ware house is being built as also the external environment and the way the
data is operated. Also the frequency and mode of attachment of new data to the existing ware house
facts is an important factor and needs to be identified at this stage itself.
DETERMINE THE KEY DIMENSIONS THAT APPLY TO EACH
FACT
In fact, this is a logical follow up of the previous step. Having identified the facts, one should be able
to identify in what ways these facts can be used. More technically, it means finding out which entities are
associated with the entity represented in the fact table. For example, entities like profit, turn around, tax
liabilities etc. are associated with the sales entity, in the sense, these entities, when enquired for would
make use of the sales data. But the key problem is to identify those entities that are not directly listed, but
may become applicable. For example, in the sales data, the entity like storage area or transportation costs
may not have been appearing directly, but questions like what is the storage area needed to cater to this
trade volume may be asked.
ENSURE THAT A CANDIDATE FACT IS NOT A DIMENSION
TABLE WITH DENORMALISED FACTS
One has to start checking each of the facts and the dimensions to ensure that they match, what
appears to be candidate fact tables can indeed be a combination of facts and dimensions. To clear the
doubts, look at the following example. Consider the case of a Customer field in a sales company. A
typical record can be of the following type
Name of the customer
His address
34
Dates on which
The customer recorded with the company
The customer requested for items
The items were sent
Bills were sent
The customer made the payment
Payments were encashed etc..
The name indicates that it has details about the customer, but most of the details are dates on which
certain events took place. As for as the company was concerned, this is the most natural way of storing
the facts. Hence, each of the dates should be so represented that each date becomes a row in the fact
table. This may slightly increase the size of the database, but is the most natural way of storing the data
as well as retrieving data from it. The star schema for the same is as follows.
One can typically ask queries of the following nature in this type of data.
i. The mean delivery time of items
ii. The normal delay between the delivery of items and receipt of payments
iii. The normal rate of defaults etc.
The key point arising out of the above discussions is as follows.
When a data is to be stored, look for items within the candidate fact tables that are actually denormalised
tables. i.e. the candidate fact table is a dimension containing repeating groups of factual attributes. For
example, in the above case, the fact dates on which was actually a set of repeating groups of attributes
facts Dimension
Operation
al events
customer
Dates
Address
of
customer
35
- dates of operations when such data is encountered, one should ensure to design the tables such that
the rows will not vary over time. For example in the above case the customer might have requested for
items on different dates. In such a case, when a new date for item requested comes up, it should not
replace the previous value. Otherwise, as one can clearly see, reports like how many times did the
customer ordered the items can never be generated. One simple way of achieving this is to make the
items Read only. As and when new dates keep coming, new rows will be added.
CHECK THAT A CANDIDATE DIMENSION IS NOT ACTUALLY
A FACT TABLE
This ensures that key dimensions are not fact tables.
Consider the following example.
Let us elaborate a little on the example. Consider a customer A. If there is a situation, where the
warehouse is building the profiles of customer, then A becomes a fact - against the name A, we can list his
address, purchases, debts etc. One can ask questions like how many purchases has A made in the last 3
months etc. Then A is fact. On the other hand, if it is likely to be used to answer questions like how
many customers have made more than 10 purchases in the last 6 months, and one uses the data of A, as
well as of other customers to give the answer, then it becomes a fact table. The rule is, in such cases,
avoid making A as a candidate key.
It is left to the candidate to think of examples that make promotions as facts as well as dimensions.
The golden rule is if an entity can be viewed in 3 or more different views, it is probably a fact table. A
fact table can not become a key dimensions.
Entity Fact/dimension Condition
Customer Fact
Dimension
If it appears in a customer
profile or customer
database
If it appears in the sales
analysis of data warehouse
(for ex. No. of customers)
Promotion Fact
Dimension
If it appears in the
promotion analysis of a
data warehouse
In other situations
36
3.2 DESIGNING OF FACT TABLES
The above listed methods, when iterated repeatedly will help to finally arrive at a set of entities that go
into a fact table. The next question is how big a fact table can be? An answer could be that it should be
big enough to store all the facts, still making the task of collecting data from this table reasonably fast.
Obviously, this depends on the hardware architecture as well as the design of the database. A suitable
hardware architecture can ensure that the cost of collecting data is reduced by the inherent capability of
the hardware on the other hand the database designed should ensure that whenever a data is asked for,
the time needed to search for the same is minimum. In other words, the designer should be able to
balance the value of information made available by the database and cost of making the same data
available to the user. A larger database obviously stores more details, so is definitely useful, but the cost
of storing a larger database as well as the cost of searching and evaluating the same becomes higher.
Technologically, there is perhaps no limit on the size of the database.
How does one optimize the cost- benefit ratio? There are no standard formulae, but some of the
following facts can be taken not of.
i. Understand the significance of the data stored with respect to time. Only those data that are
still needed for processing need to be stored. For example customer details after a period of
time may become irrelevant. Salary details paid in 1980s may be of little use in analyzing the
employee cost of 21
st
century etc. As and when the data becomes obsolete, it can be removed.
(There may arise a special case, when some body may ask for a historical fact after, say 50
years. But such cases are rare and do not warrant maintaining huge volumes of data)
ii. Find out whether maintaining of statistical samples of each of the subsets could be resorted to
instead of storing the entire data. For example, instead of storing the sales details of all the 200
towns in the last 5 years, one can store details of 10 smaller towns, five metros, 10 bigger cities
and 20 villages. After all data warehousing most often is resorted to get trends and not the
actual figures. The subsets of these individual details can always be extrapolated to get the
details instead of storing the entire data.
iii. Remove certain columns of the data, if you feel it is no more essential. For example, in a
railway database the column of age and sex of the passenger is stored. But to analyse the
traffic over a period of time, these two do not really mean much and can be conveniently
removed.
iv. Determine the use of intelligent and non intelligent keys.
v. Incorporate time as one of the factors into the data table. This can help in indicating the
usefulness of the data over a period of time and removal of absolete data.
vi. Partition the fact table. A record may contain a large number of fields, only a few of which are
actually needeed in each case. It is desirable to group those fields which will be useful into
37
smaller tables and store separately. For example, while storing data about employees, his family
details can become a separate table as also his salary detail can be stored in a different one.
Normally, when computing his salary, taxes etc., the no. of children etc. do not matter.
Now let us look into each of the above in a little more detail in the perspective of data warehousing.
IDENTIFICATION OF PERIOD OF RETENTION OF DATA
You ask a business man, he says he wants the data to be retained for as long as possible, 5,10,15 years,
the longer the better. The more data you have, the better the information generated. But such a view of
things is unnecessarily simplistic.
One need not have large amounts of data to get accurate results. One should retain relevant data.
Having a larger volume of data means larger storage costs and more efforts in compilation of data. But
more accurate data? Need not always be.
The database designer should try a judicious mix of detail of data with degrees of aggregation. One
should retain only the relevant portions of the data for the appropriate time only. Consider the following.
If a company wants to have an idea of the reorder levels, details of sales of last 6 months to one year
may be enough. Sales pattern of 5 year is unlikely to be relevant today.
If an electricity company wants to know the daily load patter, loading pattern of about one month may
be enough but one should take care to look at the appropriate month the load of winter months may be
different formthe load of summer months.
So it is essential to determine the retention period for each function, but once it is drawn, it becomes
easy to decide on the optimum value of data to be stored. After all, data warehousing deals more about
patterns and statistics, than actual figures.
DETERMINE WHETHER THE SAMPLES CAN REPLACE THE
DETAILS
As discussed earlier, instead of storing the entire data, it may be sufficient to store representative
samples. An electricity company, instead of storing the load pattern of all the 5 lakh houses of a city, can
store data about a few hundred houses each of the lower income, middle and posh homes. Similar
approximation can be done about different classes of industries and the results can be scaled by a suitable
factor. If the subsets are drawn properly, the exercise should give information as accurate as the actual
data would have given, at a much lower effort.
Select the appropriate columns. The more the number of columns in the data, more will be the storage
38
space and also the search time. Hence it is essential to identify all superflus data and remove them.
Typically all status indicaters, intermediate values, bits of reference data replicated for query performance
can be recommended
i. Examine each attribute of the data
ii. Find out if it is a new factual event ?
iii. Whether the same data is available else where, even if in a slightly different form ?
iv. Is the data only a control data ?
In principle, all data that is present in some formelsewhere or can be derived from the available data
at other places can be deleted. Also all intermediate data can be deleted since it can always be reproduced/
it may not be needed at all.
For example, in a tax data base, items like the tax liability can always be derived given the income
details. Hence the column on tax liability need not be stored in the warehouse.
MINIMISE THE COLUMN SIZES IN THE FACT TABLE
Efforts need to be made to save every byte of data while representing the facts. Since the data in a
typical ware house environment will be typically of a few million rows, even one byte of data saved in
each row could save large volumes of storage. Efforts should be made to ensure proper data representation,
removal of unnecessary accuracy and elimination of derived data.
Incorporate Time into the Fact Table
Time can be incorporated in different ways into a fact table. The most straight forward way is to
create a foreign key to store actual physical dates. The actual dates like 24 March 2004 can be stored.
However, in some cases, it may be more desirable to store the dates relative to a starting date. For
example 1 J anuary of each year may be start of a table, so that each date entry can indicate the no. of
days passed since them. For example 24 can indicate 24
th
J anuary, while 32 will indicate first February
etc. Note that some computation is required, but we are reducing the storage required drastically. In
some cases, where the actual dates are really not needed, but only a range of dates is enough, the column
can be skipped altogether. For example, in cases like sales analysis over a period of time, the fact that a
sale is made between 1
st
J anuary and 31
st
J anuary is enough and the actual dates are not important. Then
all data pertaining to the period can be stored in one table, without bothering about storing the actual data.
Thus we summarise that the possible techniques for storing dates are
i. Storing the physical date
39
ii. Store an offset from a given date
iii. Storing a date range.
3.3 DESIGNING DIMENSION TABLES
After the fact tables have been designed, it is essential to design the dimension tables. However, the
design of dimension tables need not be considered a critical activity, though a good design helps in improving
the performance. It is also desirable to keep the volumes relatively small, so that restructuring cost will be
less.
Now we see some of the commonly used dimensions.
Star Dimension
They speed up the query performance by denormalising reference information into a single table.
They presume that the bulk of queries coming are such that they analyze the facts by applying a number
of constraints to a single dimensioned data.
For example, the details of sales from a stores can be stored in horizontal rows and select one/few of
the attributes. Suppose a cloth store stores details of the sales one below the other and questions like how
many while shirts of size 85" are sold in one week are asked. All that the query has to do is to put the
relevant constraints to get the information.
This technique works well in solutions where there are a number of entitles, all related to the key
dimension entity.
Product
One Example of star dimension
The method may not work well where most of the columns are not accessed often. For example,
each query asks only about either the name shirt or color blue etc., but not things like blue shirt of size
90" with cost less than 500 Rs.
When a situation arises where a few columns are only accessed often, it is also possible to store those
columns into a separate dimension.
Unit Section Department Name Color Size Cost
40
Hierarchies and Networks
There are many instances, where it is not possible to denormalise all data into relations as detailed
above. You may note that star dimension is easy where all the entities are linked to a key by a one-to-one
relation. But, if there is a many to many relation, such single dimensional representation is not possible
and one has to resort either to multidimensional storage (Network) or a top down representation (hierarchy).
Out of these, the data that are likely to be accessed more often is demoralised into a star product table, as
detailed in the above section. All others are stored into snow flake schema.
However, there is another aspect that should be taken care of. In many cases, the dimensions themselves,
may not be static. At least some of them, if not all, vary overtime. This is particularly true for dimensions
that use hierarchies or networks t group basic concepts, because the business will probably change the
way in which it categories the dimension over a period of time.
Consider the following example. In a retailing scenario, the product dimension typically contains a
hierarchy used to categories products into departments, sections, business units etc. As the business
changes over the years, the products are re-categorised.
Take a more specific example. A department store has taken several steps over the years to upgrade
and modify the quality of its mens wear. Now it wants to know whether those efforts were successful,
how far and so on. One simple query would be to compare the sales of the present year to the sales of 10
or 15 years ago and hence draw the conclusions. But the definition of mens wear itself might have
changed. Things like T-shirts and J eans, which are called uni-wear today were called mens wear
previously. So, several queries, each looking at a separate sets of products have to be raised, they are to
be combined to produce a single answer. To take case of such eventualities, the dimensions in the
dimension table have to come with data ranges the periods over which the rows are valid. For example,
over the data range of, say 1985-1995, T-shirts were categerised as mens wear. When, after 1995, they
came to be categorized as unit-wear, a separate row is to be inserted, when T-Shirts are categorized as
uni-wear (say from 1995 till date). This would ensure that the changes made to dimensions are reflected
in the networks.
Some times, it may be necessary to answer queries, which are not exact days / dates, but definitions
like the first week of each month. One would like to compare the sales of the first week of this month with
sales in the first weeks of the previous months (obviously, sales during the first weeks will be more brisk
than in the latter months). Hoever, in such cases, it is desirable not to have ranges like first week, instead
have product queires like a product query of dates 1 through 7 of each month. The above method of
simply inserting new dates ranged rows to take care of modifications leads to a problem. Over a period
of time, the dimension tables grows to sizes, where not only large amounts of memories are involved and
managing becomes difficult, but more importantly, queries start to take much larger times to execute. In
such cases, one should think of partitioning the table horizontally (say data of previous 10 years into a
separate table) and a creating a combinatory view. The clear indication of the need is when the full-table
scan of the dimension table starts taking an appreciable amount of time.
41
3.4 DESIGNING THE STAR-FLAKE SCHEMA
A star flake schema, as wehave defined previously, is a schema that uses a combination of denormalised
star and normalized snow flake schemas. They are most appropriate in decision support data ware
houses. Generally, the detailed transactions are stored within a central fact table, which may be partitioned
horizontally or vertically. A series of combinatory data base views are created to allow the user to access
tools to treat the fact table partitions as a single, large table.
The key reference data is structured into a set of dimensions. Theses can be referenced from the fact
table. Each dimension is stored in a series of normalized tables (snow flakes), with an additional
denormalised star dimension table.
But the problem lies else where. It may not be easy to structure all entries in the model into specific
sets of dimensions. A single entity / relationship can be common across more than one dimension. For
example, the price of a product may be different at different locations. Then, instead of locating the
product and then the price directly, one has to take the intersection of the product and the locality and then
get the corresponding price. It may also be that the same product may be priced differently at different
times (may be seasonal and offseasonal prices) then a the model may appear as follows.
Pricing Model of a Stores
The basic concept behind designing star flake schemas is that entities are not strictly defined, but a
Basket transaction
Store
Region
Basket item
Product
Department
Time
Price
Business Unit
42
degree of cross over between dimensions is allowed. This is the way one comes across in the real world
environment. But at the same time, one should not provide an all pervasive intersection of schemas. It is
essential to keep the following in mind.
i) The number of intersecting entities should be relatively small.
ii) The intersecting entities should be clearly defined and understood within the business.
A Database of star flake schema: It typically stores the details of a retail store sales details, possibly of dress
materials.
However, a star flake schema, not only takes sufficient time to design, but also likely to keep changing
often.
3.5 QUERY REDIRECTION
One of the basic requirements for successful operation of star flake schema (or any schema, for that
matter) is the ability to direct a query to the most appropriate source. Note that once the available data
grows beyond a certain size, partitioning becomes essential. In such a scenario, it is essential that, in order
Facts Snow flake
dimensions
Star
dimensions
Sales transactions
Department Business
unit
Products Style
Products Size
Time Color
Time
Locations
Locations
Region
Week
Month
Summer
Sales
Easter
sales
43
to optimize the time spent on querying, the queries should be directed to the appropriate partitions that
store the date required by the query.
The basic method is to design the access tool in such away that it automatically defines the locality to
which the query is to be redirected. We discuss in more detail some of the guidelines on the style of query
formation now.
Fact tables could be combined in several ways using database views. Each of these views can
represent a period of time-say one view for every month or one for every season etc. Each of these
views should be able to get a union of the required facts fromdifferent partitions. The query can be built
by simply combining these unions.
Another aspect is the use of synonyms. For example the same fact maybe viewed differently by
different users sales of year 2000 is the sales data for the sales department whereas it is viewed as
the off load by the production department. It can also be used by the auditor with a different name. If
possible, the query support system should support these synonyms to ensure proper redirection.
Views should also be able to combine vertically partitioned tables to ensure that all the columns are
made available to the query. However, the trick is to ensure that only a few of the queries would like to
see columns across the vertical partitions because it is definitely a time consuming exercise. The
partitioning (vertical) is to be done in such a way that most of the data is available in a single partition.
The same arguments hold good for queries that need to process several tables simultaneously.
3.6 MULTI DIMENSIONAL SCHEMAS
Before we close, we see the interesting concept of multi dimensions. This is a very convenient
method of analyzing data, when it goes beyond the normal tabular relations.
For example, a store maintains a table of each item it sells over a month as a table, in each of its 10
outlets
Sales of item 1
This is a 2 dimensional table. One the other hand, if the company wants a data of all items sold by its
Outlet no. --> 1 2 3 4
Date 1
2
3
4
44
outlets, it can be done by simply by superimposing the 2 dimensional table for each of these items one
behind the other. Then it becomes a 3 dimensional view.
Then the query, instead of looking for a 2 dimensional rectangle of data, will look for a 3 dimensional
cuboid of data.
There is no reason why the dimensioning should stop at 3 dimensions. In fact almost all queries can be
thought of as approaching a multi-dimensioned unit of data from a multidimensioned volume of the
schema.
A lot of designing effort goes into optimizing such searches.
BLOCK SUMMARY
The chapter began with the definition of schema a logical arrangement of facts that helps in storage
and retrieval of data. The concept of star flake schemas, fact tables was discussed. The basic method
of distinguishing between facts and dimensions were discussed. It was also indicated that a fact in one
context can become a dimension in another case and vice versa. We also learnt to determine the key
dimensions that apply to each fact. We also touched upon the method of designing the fact-tables,
dimension tables and the methods of representing hierarchies and networks.
The concepts of star flake schema, query redirection and multi dimensional schemas were also
discussed.
SELF EVALUATION - I
1. What is a schema?
2. Distinguish between facts and dimensions?
3. Can a fact become a dimension and vice versa?
4. What basic concept defines the size of a fact table?
5. What is the importance of period of retention of data?
Item3
Item2
Item1
45
6. Name the 3 methods of incorporating time into the data?
7. Define a star flake schema?
8. What is query redirection?
1. A logical arrangement of facts that facilitate ease of storage and retrievel.
2. A fact is a piece of data that do not change with time whereas a dimension is a description which is likely to
change.
3. Yes, if the viewers objective changes.
4. That it should be big enough to store all the facts without compromising on the speed of query processing.
5. It is the period for which data is retained in the warehouse. Later on, it is archived.
6. a) storing the physical date.
b) store an offset from a given date.
c) store a date range.
7. It is a combination of demoralized and normalized snow flakeschemas. Sending the query to themost appropriate
part of the ware house.
1. With diagramexplain star flake schemas in detail. 2
2. Explain the designing of fact tables in detail.
3. Write diagram explain multidimensional schemas.
Chapter 4
Partitioning Strategy
BLOCK INTRODUCTION
I
n this chapter, we look into the trade offs of partitioning. Partitioning is needed in any large data ware
house to ensure that the performance and manageability is improved. It can help the query redirection
to send the queries to the appropriate partition, thereby reducing the overall time taken for query
processing.
Partitions can be horizontal or vertical. In horizontal partitioning, we simply the first few thousand
entries in one partition, the second few thousand in the next and so on. This can be done by partitioning
by time, where in all data pertaining to the first month / first year is put in the first partition, the second one
in the second partition and so on. The other alternatives can be based on different sized dimensions,
partitioning an other dimensions, petitioning on the size of the table and round robin partitions. Each of
them have certain advantages as well as disadvantages.
In vertical partitioning, some columns are stored in one partition and certain other columns of the same
row in a different partition. This can again be achieved either by normalization or row splitting. We will
look into their relative trade offs.
Partitioning can also be by hardware. This is aimed at reducing bottle necks and maximize the CPU
utilization.
We have seen in the previous chapters, in different contexts, the need for partitioning. There are a
number of performance related issues as well as manageability issues that decide the partitioning strategy.
To begin with we assume it has to be resorted to for the simple reason of the bulk of data that is normally
handled by any normal ware house.
Chapter 4 - Partitioning Strategy
46
4.1 HORIZONTAL PARTITIONING
This is essentially means that the table is partitioned after the first few thousand entries, and the next
few thousand entries etc. This is because in most cases, not all the information in the fact table needed all
the time. Thus horizontal partitioning helps to reduce the query access time, by directly cutting down the
amount of data to be scanned by the queries.
The most common methodology would be to partition based on the time factor each year or each
month etc. can be a separate portion. There is not reason that the partitions need to be of the same size.
However, if there is too much variation in size between the different partitions, it may affect the performance
parameters of the warehouse as such one should consider the alternative ways of partitioning, and not go
by the period itself as the deciding factor.
a) Partition by Time into Equal Segments:
This is the most straight forward method of partitioning by months or years etc. This will help if the
queires often come regarding the fortnightly or monthly performance / sales etc.
The advantage is that the slots are reusable. Suppose we are sure that we will no more need the data
of 10 years back, then we can simply delete the data of that slot and use it again.
Of course there is a serious draw back in the scheme if the partitions tend to differ too much in size.
The number of visitors visiting a till station, say in summer months, will be much larger than in winter
months and hence the size of the segment should be big enough to take case of the summer rush. This, of
course, would mean wastage of space during winter month data space.
Partitioning tables into Same Sized Segments
Year 1 Year 2 Year 3
Hill resort
details
J anuary
February
March
December
47
48
b) Partitioning by Time into Different Sized Segments
The above figure should give the details. Since in the summer months of March, April, May more
occupancy is reported, we use much larger sized partitions etc.
This is useful technique to keep the physical tables small and also the operating costs low.
The problem is to find a suitable partitioning strategy. In all cases, the solution may not be so obvious
as in the case of hill station occupancy. It may also happen that the sizes may have to be varied over a
period of time. This would also lead to movement of large portions of data within the warehouse over a
period of time. Hence, careful consideration about the likely increase in the overall costs due to these
factors should be given before adapting this method.
c) Partitioning on other Dimension
Data collection and storing need not always be partitioned based on time, though it is a very safe and
relatively straight forward method. It can be partitioned based on the different regions of operation,
different items under consideration or any other such dimension. It is beneficial to look into the possible
types of queries one many encounter, before deciding on the dimension suppose most of the queries are
likely to be based on the region wise performance, region wise sales etc., then having the region as the
dimension of partition is worthwhile. On the other hand, if most often we are worried about the total
performance of all regions, total sales of a month or total sales of a product etc, then region wise partitioning
could be a disadvantage, since each such queries will have to move across several partitions.
There is a more important problem suppose our basis of partition the dimension itself is going to
change in future suppose we have partitioned based on the regions. But in a future data, the definition of
Year 1 Year 2 Year 3
Hill resort
details
J anuary J anuary J anuary
February
March
December
February February
March March
December December
49
region itself is going to change 2 or more regious redefined. Then we end up building the entire fact
table again, moving the data in the process. This should be avoided at all costs.
d) Partition by the Size of the Table
In certain cases, we will not be sure of any dimension on which partitions can be made. Neither the
time nor the products or regions etc. serve as a good guide, nor are we sure of the type of queries that we
are likely to frequently encounter. In such cases, it is ideal to partition by size. Keep loading the data until
a prespecified memory is consumed, then create a new partition. However, this creates a very complex
situation similar to simply dumping the objects in a room, without any (labeling) and we will not be able to
know what data is in which partition. Normally metadata (data about data) may be needed to keep track
of the identifications of data stored in each of the partitions.
e) Using Round Robin Partitions:
Once the warehouse is holding full amount of data, if a new partition is required, it can be done only by
reusing the oldest partition. Then meta data is needed to note the beginning and ending of the historical
data.
This method, though simple, may land into trouble, if the sizes of the partitions are not same. Special
techniques to hold the overflowing data may become necessary.
4.2 VERTICAL PARTITIONING
As the name suggests, a vertical partitioning scheme divides the table vertically i.e. each row is
divided into 2 or more partitions.
Consider the following table
Now we may need to split this table because of any one of the following reasons:
i. We may not need to access all the data pertaining to a student all the time. For example, we
may need either only his personal details like age, address etc. or only the examination details of
Student
name
Age Address Class Fees paid Marks scored in different
subject
50
marks scored etc. Then we may choose to split them into separate tables, each containing data
only about the relevant fields. This will speed up accessing.
ii. The no. of fields in a row become inconveniently large, each field itself being made up of several
subfields etc. In such a scenario, it is always desirable to split it into two or more smaller tables.
The vertical partitioning itself can be achieved in two different ways: (i) normalization and (ii) row
splitting.
4.2.1 Normalisation
The usual approach in normalization in database applications is to ensure that the data is divided into
two or more tables, such that when the data in one of them is updated, it does not lead to anamolies of data
(The student is advised to refer any book on data base management systems for details, if interested).
The idea is to ensure that when combined, the data available is consistent.
However, in data warehousing, one may even tend to break the large table into several denormalized
smaller tables. This may lead to lots of extra space being used. But it helps in an indirect way It avoids
the overheads of joining the data during queries.
To make things clear consider the following table
The original table is as follows:
We may split into 2 tables, vertically
Student
name
Age Address Class Fees paid Marks scored in different
subject
Student
name
Age Address Class Fees paid
51
Note that the fields of student name and class are repeated, but that helps in reducing repeated join
operations, since the normally used fields of student name and marks scored are available in both the
fields.
With only two tables, it may appear to be trivial case of savings, but when several large tables are to be
repeatedly joined, it can lead to large savings in computation times.
4.2.2 Row Splitting
The second method of splitting is the row splitting is shown in below fig. 3.4:
The method involved identifying the not so frequently used fields and putting them into another table.
This would ensure that the frequently used fields can be accessed more often, at much lesser computation
time.
It can be noted that row splitting may not reduce or increasethe overall storage needed, but normalization
may involve a change in the overall storage space needed. In row splitting, the mapping is 1 to 1 whereas
normalization may produce one to many relationships.
Original Table After normalization
Student
name
Class Marks scored in different
subject
1:1 Mapping Many : 1 Mapping
52
Original Table Row Splitting
4.3 HARDWARE PARTITIONING
Needless to say, the dataware design process should try to maximize the performance of the system.
One of the ways to ensure this is to try to optimize by designing the data base with respect to specific
hardware architecture. Obviously, the exact details of optimization depends on the hardware platforms.
Normally the following guidelines are useful:
i. Maximize the processing, disk and I/O operations.
ii. Reduce bottlenecks at the CPU and I/O
The following mechanisms become handly
4.3.1 Maximising the Processing and Avoiding Bottlenecks
One of the ways of ensuring faster processing is to split the data query into several parallel queries,
convert them into parallel threads and run them parallel. This method will work only when there are
sufficient number of processors or sufficient processing power to ensure that they can actually run in
parallel. (again not that to run five threads, it is not always necessary that we should have five processors.
But to ensure optimality, even a lesser number of processors should be able to do the job, provided they
are able to do it fast enough to avoid bottlenecks at eh processor).
Shared architectures are ideal for such situations, because one can be almost sure that sufficient
processing powers are available at most of the times. A typical shared architecture looks at follows.
1:1 Mapping 1:1 Mapping
Fig. 3.4: Row Splitting
53
Of course, in such a networked environment, where each of the processors is able access data on
several active disks, several problems of data contention and data integrity need to be resolved. Those
aspects will not be discussed at this stage.
4.3.2 Stripping Data Across the Nodes
This mechanismdistributes the data by dividing a large table into several smaller units and storing them
in each of the disks. (The architecture is the same as above) There sub tables need not be of equal size,
but are so distributed to ensure optimum query performance. The trick is to ensure that the queries are
directed to the respective processors, which access the corresponding data disks to service the queries.
It may be noted that in a scenario, there is an overhead of about 5-10%, to divide the quires into
subqueries, transport them over the network etc.
Also, the method is unsuitable for smaller data volumes, since, in such a situation, the overheads tend
to dominate and bring down the overall performance.
Also, if the distribution of data across the disks is not proper, it may lead to an inefficient system and
we may have to redistribute the data in such a scenario.
processors
disks
Query 1 Query 2 Query 3
? ? ? ? ? ?
Network
subqueries
54
4.3.3 Horizontal Hardware Partitioning
This technique spreads the processing load by horizontally partitioning the fact table into smaller segments
and physically storing each segment into a different node. When a query needs to access in several
partitions, the accessing is done in a way similar to the above methods.
If the query is parallelized, then each subquery can run on the other nodes, as long as the total no. of
subprocesses do not exceed the number of available nodes.
This technique will minimize the traffic on the network. However, if most of the queries pertain to a
single data unit or a processor, we may land into another type of problem . Since the data unit or the
processor in question has limited capabilities, it becomes a bottleneck. This may affect the overall
performance of the system. Hence, it is essential to identify such units and try to distribute the data into
several units so as to redistribute the load.
Before we conclude, we point out that several parameters, like the size of the partition, the key about
which the petition is made, the number of parallel devices etc, affect the performance. Naturally, a larger
number of such parallel units improve the performance, but at the much higher costs. Hence, it is essential
to work out the minimumsize of the partitions that bring out the best performance from the system.
BLOCK SUMMARY
In this chapter, we started familiarizing our selves with the need for partitioning. It greatly helps in
ensuring better performance and manageability.
We looked at the concepts of horizontal and vertical partitioning. Horizontal partitioning can be done
based on time or the size of the block or both. One can think of storing the data, for example, of each
month in one block. This simple method, however, may become wasteful if the amount of data in each
month is not same. The solution is to have different sized petitions if we know before hand the amount of
data that goes into each of the partitions. Partitioning need not always be on time. It can be on other
dimensions as well. It can also be a round robin sort. Each of these methods have their own merits and
demerits.
Vertical partitioning can be done by either normalization or row splitting. We took examples to
understand the concepts involved. We also discussed about hardware partitioning and the issues involved.
SELF EVALUATION - I
1. What is horizontal partitioning?
2. What is vertical partitioning?
55
3. Name one advantage and one disadvantage of equal segment partitioning?
4. What is the concept of partitioning on dimensions?
5. Name the disadvantage of partitioning by size?
6. Name the two methods of vertical splitting?
7. What is the need for hardware partitioning?
8. What is parallelizing a query?
1. The first few entries are in first block , the second few in the second block etc.
2. A few columns are in one block, some other columns in another block, though they belong to the same row.
3. a) Slots are reusable.
b) If the amount of data is varying, it is wasteful.
4. Partitioning can be on any dimension like region, unit size, article etc.,.
5. Searching for a given data becomes very cumbersome.
6. Normalization and row splitting.
7. It helps to
a) optimize processing operations
b) reduce bottlenecks at CPU and I/O.
8. Dividing a query into sub queries and running them in parallel using threads.
1. With diagrams explain the types of partitioning in detail.
2. With exampleexplain the concept of normalization in detail.
Chapter 5
Aggregations
BLOCK INTRODUCTION
I
n this chapter, we look at the need for aggregation. It is performed to speed up common queries. The
cost of aggregation should off set the speed up of queries. We first ascertain ourselves that whenever
we expect similar types of queries repeatedly arriving at the ware house, some home work can be
done before hand, instead of processing each query on the fly (as it comes). This means, we partially
process the data and create summary tables, which become usable for the commonly encountered
queries. Of course, an uncommon query has to be processed on the fly.
Of course the design of the summary table is a very important factor that determines the efficiency of
operation. We see several guide lines to assist us in the process of developing useful summary tables.
We also look at certain thumbrules that guide us in the process of aggregation. Aggregation is performed
to speed up the normal queries and obviously, the cost of creating and managing the aggregations should
be less than the amount of speeding up of the queries. Otherwise, the whole exercise turns out to be
futile.
5.1 THE NEED FOR AGGREGATION
Data aggregation is an essential component of any decision support data ware house. It helps us to
ensure a cost effective query performance, which in other words means that costs incurred to get the
answers to a query would be more than off set by the benefits of the query answer. The data aggregation
attempts to do this by reducing the processing power needed to process the queries. However, too much
of aggregations would only lead to unacceptable levels of operational costs.
Too little of aggregations may not improve the performance to the required levels. A file balancing of
56
Chapter 5 - Aggregations
the two is essential to maintain the requirements stated above. One thumbrule that is often suggested is
that about three out of every four queries would be optimized by the aggregation process, whereas the
fourth will take its own time to get processed.
The second, though minor, advantage of aggregations is that they allow us to get the overall trends in
the data. While looking at individual data such overall trends may not be obvious, whereas aggregated
data will help us draw certain conclusions easily.
5.2 DEFINITION OF AGGREGATION
Most of the common queries will analyze
i) Either a subset of the available data
ii) Combination (aggregation) of the available data
Most of the queries can be answered only by analyzing the data in several dimensions. Thus simple
questions like how many mobiles were sold last month? are not often asked. Rather, questions like how
many more mobiles can be sold in the next six months are asked. To answer them one will have to
analyze the available data base shown in Fig. 4.1 on the following parameters like:
i) The income of the population
ii) Their occupation and hence the need for communication
iii) Their age groups
iv) Social trends etc.,.
In a simple scenario, to get the above conclusions, one should be able to get the data about the
population of the area, which may be in the following format and hence draw the necessary conclusions
by traversing along the various dimensions, selecting the relevant data and finally aggregating suitably.
One simple way to identify the number of (would be) mobile users is to identify that section of the
population with income above a threshold, with a particular family size, those whose professions need
frequent traveling, preferably of the younger age group, who could be the potential users.
Note that a simple query for each of these run on the database produces a rather complicated sets of
data. The final answer is combine these sets of data in a suitable format.
57
58
A Typical data base to process mobile sales
But a detailed look into the setup tells us that by properly arranging the queries and processing them in
optimal formats one could greatly reduce the computations needed. For example, one need not search for
all citizens above 18 years of age and all familiar with incomes greater than 15,000 per month etc. One
can simply prepare these table before hand and they can be used as and when required. As and when the
data changes, the summaries (Like number of citizens above 18 years of age, families with incomes
greater than 15,000 etc) need to be changed.
The advantage is that the bulk of the sub queries are carried out before the actual execution of the
query itself. This reduces the time delay between the time of raising the query and the results being made
available (Though, the total computation time may not be less than if the entire query were to be answered
on the fly)
The draw back is that in many cases, the summaries (or subqueries if you want to call it) are
closely coupled to the type of queries being raised and will have to be changed if the queries is changed.
The other advantage is that these pre-aggregation summaries also allow us to look at specific trends
more easily. The summaries highlight the trends, provide an overall view of the picture at large, rather
than isolated views.
Population data
Age groups
Income
Profession
Family size
Location Religion
59
5.3 ASPECTS TO BE LOOKED INTO WHILE DESIGNING THE
SUMMARY TABLES
The main purpose of using summary tables is to cut down the time taken to execute a specific query.
The main methodology involves minimizing the volume of data being scanned each time the query is to be
answered. In other words, partial answers to the query are already made available. For example, in the
above cited example of mobile market, if one expects
i) The citizens above 18 years of age
ii) With salaries greater than 15,000 and
iii)With professions that involve traveling are the potential customers, then, every time the query is
to be processed (may be every month or every quarter), one will have to look at the entire data
base to compute these values and then combine them suitably to get the relevant answers. The
other method is to prepare summary tables, which have the values pertaining toe ach of these
sub-queries, before hand, and then combine them as and when the query is raised . It can be
noted that the summaries can be prepared in the background (or when the number of queries
running are relatively sparse) and only the aggregation can be done on the fly.
Summary table are designed by following the steps given below
i) Decide the dimensions along which aggregation is to be done.
ii) Determine the aggregation of multiple facts.
iii)Aggregate multiple facts into the summary table.
iv)Determine the level of aggregation and the extent of embedding.
v) Design time into the table.
vi)Index the summary table.
i. Determine the Aggregation Dimensions
Summary tables should be created so as to make full use of the existing schema structures. In other
words, the summary tables should continue to retain all the dimensions that are not being aggregated.
This technique is sometimes referred to as Subsuming the dimension .
The concept is to ensure that all those dimensions that do not get modified due to aggregation continue
to be available in the summary table. No doubt, this would ensure that the flexibility of the summary table
is maintained for as long as possible.
In some cases, the summarizing also may be done partially. For example, we would like to know the
number of people residing in each of the localities. But, we are not interested in all the localities may be
60
only in a few privileged localities. In such a case, summarizing is done in the respect of only those
localities. If any data pertaining to the other localities arise, one will have to go back to the primary data
and get the details on the fly.
ii. Determine the Aggregation of Multiple Values
The objective is to include into the summary table any aggregated value that can speed up the query
processing. If the query uses more than one aggregated value on the same dimension, combine these
common values into a set of columns on the same table. If this looks complicated, look at the following
example.
Suppose there is a summary table of sales being setup. The data available, of course, is the daily sales.
These data can be used to create a summary table of weekly sales. Suppose the query may also need
details about the highest daily sales and lowest daily sales.
A query like indicate the week where the weekly sales were good, but one/more days registered very
low sales will be an example.
Note that the weekly sales as well as the high / low sales details can be summarized on the dimension
of sales.
In such a situation, the summary table can contain columns that refer to i) weekly sales ii) highest
daily sale of the week and iii) lowest daily sale of the week. This would ensure that we do not go about
computing the same set of data repeatedly.
However, too many such aggregated values in the same summary table is not desirable. Every such
new column would bring down the performance of the system, both while creating the summary table, as
well as while operating them. Though, no precise number of such aggregated columns is available, it is
essential to look at the amount of time needed to handle / create such new columns in companion with the
associated improvement in performance.
iii. Aggregate the Multiple Facts into the Summary Table
This again is a very tricky issue. One should think of the possibility of amalgamating a number of
related facts into the same summary table, if such an amalgamation is desirable. The starting point is to
look at the query that is likely to come up quite often. If the query is to look at the same set of facts
repeatedly, then it is desirable to place these facts in to a single summary table. Consider a case where
queries are expected to repeatedly ask for the amount of sales, their cost, the profits made and the
variation w.r.t. the previous weeks sales. The most ideal method is to combine them in to a single
summary table, so that the actual aggregation effort at the time of the query processing is reduced.
As discussed in the previous section, here also, one should ensure that too many facts are not combined
together, since such a move can reduce the overall performance instead of improving it.
61
iv. Determine the Level of Aggregation and the Extent of Embedding
Aggregating a dimension at a specific level implies that no further detail is available in that summary.
It is also essential that a huge number of summary tables are not created, as they tend to be rather counter
productive. Since the summary tables are produced to ensure that repeated computations are reduced
and also due to the fact that creation of summary tables themselves involves a certain amount of
computations, not more than 250 -300 such tables are recommended.
A few thumbrules could be of use.
i) As for as possible, aggregate at a level below the level required by the frequently encountered
queries. This would ensure that some flexibility is available for aggregation but the amount of
aggregation to be done on the fly is a minimum. However, the ratio of the number of rows in the
data to the aggregated rows should optimal. i.e. the number of computed rows (aggregated)
should not be high w.r.t. the independent data.
ii) Whenever the above condition can not be satisfied, i.e. a table ends up with too many aggregated
rows, try to break the summary table into two tables.
It may be noted that the summary tables need to be recreated every time the basic data changes (since
in such a situation, the summary also changes). This aspect of creation of summary tables consumes
quite an amount of time. If the policy is to use non-intelligent keys with in the summary table, then this
aspect consumes quite an amount of time.
Normally non-intelligent keys are used to ensure the need to restuructre the fact data if the key is
changed in future put the otherway, whenever the key is changed, the organization of the data table needs
to be changed, if the organization depends on the intelligent keys are used. But in a summary table, in any
case, the table is changed when ever one/more facts change and hence nothing is to be gained by using
the non-intelligent keys. Hence, one can as well have the intelligent keys.
v. Design Time into the Summary Table
Remember, in the case of fact tables, it was suggested that time can be stored into them to speed up
the operation. Calculations like weekly, monthly etc, details can straightaway take place if time were to
be incorporated in the summary table. Similarly, one can make use of the concept of time to speed up the
operations.
i) A physical date can be stored: This is the simplest but possibly a very convenient way of storing
time. The physical date is normally stored within the summary table, and preferably, intelligent
keys are used.
When ever the dimension of time is aggregated, (say daily, weekly etc) the data of the actual
time (say 12 P M etc.,) are lost. Suppose you store each sales data with the actual itemof the
sales. But if the aggregation is done about, say, the total sales per week, then the value of actual
62
time becomes useless. In such cases, either special care is to be taken to preserve the dates or
the dates need not be stored in the first place.
ii) Store an off set from a start data or start time: Again, if and when actual date / times, are
needed, they may be computed beginning from the offset. But, it may involve sufficient amount
of computations, when a large number of such dates / times are to be computed.
Again, as in the previous case, they are lost in the summary tables.
iii)Store a data range: Instead of storing the actual dates / times, one can store the range of dates
within which the values are applicable. Again, converting them to actual times is time consuming.
The quality of access tools governs the use of data ranges within the summary. If the tools are
not very efficient, then the use of data ranges within the summary tools can become counter
productive.
vi. Index the Summary Tables
One should always consider using a high level of indexation on summary tables. As the aim is to direct
as many queries as possible to summary tables, it is worth investing some effort in indexing. Since the
summary table are normally of reasonable size, indexing is most often worth while. However, if most of
the queries scan all the rows of the tables, then such indexes may end up being only an over head.
vii. Identifying the Summary Tables that need to be created
This is a very tricky, but a very important issue in summarizing. To some extent it depends on the
designer and the extent to which the summaries can be seamlessly created. But a few techniques can be
helpful.
One should examine the levels of aggregation within each key dimension and determine the most likely
combinations of interest. Then consider these combinations based on the likely queries one can expect in
the given environment. Each of these overlapping combinations becomes a summary table content.
This can be repeated, till a sufficient number of summary tables are created. However, most often, as
the system gets used and the normally encountered query profiles become clear, some of these summary
tables are dropped and new ones are created.
The size of the summary table also is an important factor. Since the philosophy behind a summary
table is to ensure that the amount of data to be scanned is to be kept relatively small, having a very large
summary table defeats this very purpose. Normally, summary tables with a higher degree of aggregation
tend to be smaller and vice versa. The usefulness of a summary table is limited by the average amount of
data to be scanned by the queries.
63
BLOCK SUMMARY
Aggregation is the concept of combining raw data to create summary tables which become useful
when processing the normally encountered queries. There is a over head involved in creating these
summary tables. The cost saved due to the speeding up of the queries should more than off set the cost
of creating and maintaining these summary tables, if the concept of aggregation were to be beneficial.
We discussed the basic process of creating summary tables as
i) Determining the dimensions to be aggregated.
ii) Determining the aggregation of multiple values.
iii)Determining the aggregation multiple facts.
iv)Determining the level of aggregating and embedding.
v) Incorporating time into the summary table and
vi)Indexing summary tables.
SELF EVALUATION - I
1. What is the need for aggregation?
2. What is a summary table?
3. What is the trade off involved in aggregation?
4. What is subsuming a dimension?
5. What is the golden rule that determines the level of aggregation?
6. Is using intelligent keys in a summary table desirable?
7. Which method of storing time is more appropriate in aggregation?
8. What is the role of indexing?
1. It helps in speeding up the processing of normal queries.
2. Partially aggregated table which helps in reducing the time of scanning of normal queries.
3. The cost of aggregation should be less than the cost saved due to reduced scanning of normal queries.
4. After aggregation, the summary table will retain all the dimensions that have not been aggregated.
5. Aggregate one level below the level required for known common queries.
6. Yes
7. Store physical dates directly into the summary table.
8. It helps in choosing the appropriate summary table.
1. With suitable example, explain the concept of aggregation.
2. Explain the designed steps for summary tables in detail.
64
Chapter 6
Dat a Mar t
BLOCK INTRODUCTION
I
n this chapter, the brief introduction to the concept of data marts is provided. The data mart stores a
subset of the data available in the ware house, so that one need not always have to scan through the
entire content of the ware house. It is similar to a retail outlet. A data mart speeds up the queries,
since the volume of data to be scanned is much less. It also helps to have tail or made processes for
different access tools, imposing control strategies etc.,.
The basic problem is to decide when to have a data mart and when to go back to the ware house. The
thumbrule is to make use of the natural splits in the organization / data or have one data mart for the
different access tools etc. Each of the data marts is to be provided with its own sub set of data information
and also its own summary information obviously this is a costly affair. The cost of maintaining the
additional hardware and software are to be offset by the faster query processing of the data mart. We
also look at the concept of copy management tools.
6.1 THE NEED FOR DATA MARTS
In a crude sense, if you consider a data ware house as a store house of data, a data mart is a retail
outlet of data. Searching for any data in a huge store house is difficult, but if the data is available, you
should be positively able to get it. On the other hand, in a retail out let, since the volume to be searched
from is small, you can be able to access the data fast. But it is possible that the data you are searching for
may not be available there, in which case you have to go back to your main store house to search for the
data.
65
66
Coming back to technical terminology, one can say the following are the reasons for which data marts
are created.
i) Since the volume of data scanned is small, they speed up the query processing.
ii) Data can be structured in a form suitable for a user access too
iii)Data can be segmented or partitioned so that they can be used on different platforms and also
different control strategies become applicable.
There are certain disadvantages also
i. The cost of setting up and operating data marts is quite high.
ii. Once a data strategy is put in place, the datamart formats become fixed. It may be fairly
difficult to change the strategy later, because the datamarts formats also have to be changes.
Hence, there are two stages in setting up data marts.
i. To decide whether data marts are needed at all. The above listed facts may help you to decide
whether it is worth while to setup data marts or operate from the warehouse itself. The problem
is almost similar to that of a merchant deciding whether he wants to set up retail shops or not.
ii. If you decide that setting up data marts is desirable, then the following steps have to be gone
through before you can freeze on the actual strategy of data marting.
a) Identify the natural functional splits of the organization.
b) Identify the natural splits of data.
c) Check whether the proposed access tools have any special data base structures.
d) Identify the infrastructure issues, if any, that can help in identifying the data marts.
e) Look for restrictions on access control. They can serve to demarcate the warehouse
details.
Now, we look into each one of the above in some detail.
A thorough look at the business organization help us to know whether there is an underlying structure
that help us to decide on data marting. The business can be split based on the regional organizations,
product organizations, type of data that becomes available etc. For example when the organization is set
up in several regions and the data ware house gets details fromeach of these regions, one simple way of
splitting is to set up a data mart for each of these regions. Probably the details or forecasts of one region
is available on each of these data marts.
Chapter 6 - Data Mart
67
Similarly, if the organization is split into several departments each of these departments can become
the subject of one data mart.
If such physical splits are not obvious, one can even think of the way data needs to be presented. One
datamart for daily reports, one for monthly reports etc.
Once you have drawn up a basis for splitting, you should try to justify based on the hardware costs,
business benefits and feasibility studies. For example, it may appear to be most natural to split based on
the regional organizations, setting up say 100 datamarts for 100 regions and interconnecting them may not
be a very feasible proposition.
Also, there is a load window problem. The data warehouse can be thought of as a huge volume and
each datamart provides a window to it. Obviously, each window provides only a partial view of the
actual data. The more is the number of datamarts, the more such windows will be there and there will be
problems of maintaining the overlaps, managing data consistency etc. In a large data ware house, these
problems will be definitely not trivial and unless managed in a professional manner, can lead to data
inconsistencies. The problem with these inconsistencies is that they are hard to trace and debug.
Then the other problem always remains If the split of the organization changes for some reason, then
the whole structure of data marting needs to be redefined.

Department (A)
Department (B)
Input data
Input data
Detailed
Information
Meta Data
Summary
Information
Data
Mart 1
Data
Mart 2
Front end
Front end
Front end
Front end
68
6.2 IDENTIFY THE SPLITS IN DATA
The issues involved here are similar to those in the splits of organizations. The type of data coming in
or the way it is stored helps us to identify the splits. For example, one may be storing the consumer items
data differently fromthe capital assets data or the data is being collected and stored dealer wise. In such
cases, one can set up the marts for each of the portions identifiable. The trade offs involved are exactly
identical to what was discussed in the previous section.
6.3 IDENTIFY THE ACCESS TOOL REQUIREMENTS
Data marts are required to support internal data structures that support the user access tools. Data
within those structures are not actually controlled by the ware house, but the data is to be rearranged and
up dated by the ware house. This arrangement (called populating of data) is suitable for the existing
requirements of data analysis. While the requirements are few and less complicated, any populating
method may be suitable, but as the demands increase (as it happens over a period of time) the populating
methods should match the tools used.
As a rule, this rearrangement (or populating) is to be done by the ware house after acquiring the data
from the source. In other words, the data received from the source should not directly be arranged in the
form of structures as needed by the access tools. This is because each piece of data is likely to be used
by several access tools which need different populating methods. Also, additional requirements may
come up later. Hence each data mart is to be populated from the ware house based on the access tool
requirements of the data ware house. This will ensure data consistency across the different marts.
6.4 ROLE OF ACCESS CONTROL ISSUES IN DATA MART
DESIGN
This is one of the major constraints in data mart designs. Any data warehouse, with its huge volume
of data is, more often than not, subject to various access controls as to who could access which part of
data. The easiest case is where the data is partitioned so clearly that a user of each partition cannot
access any other data. In such cases, each of these can be put in a data mart and the user of each can
access only his data.
In the data ware house, the data pertaining to all these marts are stored, but the partitioning are
retained. If a super user wants to get an overall view of the data, suitable aggregations can be generated.
However, in certain other cases the demarcation may not be so clear. In such cases, a judicious
analysis of the privacy constraints so as to optimize the privacy of each data mart is maintained.
69
Design of Data Mart
Design based on function
Data marts, as described in the previous sections can be designed, based on several splits noticeable
either in the data or the organization or in privacy laws. They may also be designed to suit the user access
tools. In the latter case, there is not much choice available for design parameters. In the other cases, it
is always desirable to design the data mart to suit the design of the ware house itself. This helps to
maintain maximumcontrol on the data base instances, by ensuring that the same design is replicated in
each of the data marts. Similarly the summary informations on each of the data mart can be a smaller
replica of the summary of the data ware house it self.
It is a good practice to ensure that each summary is designed to utilize all the dimension data in the star
Tier 3
Tier 1 Tier 2
Console of Mart 1
Console of Mart 2
Detailed
info
Summary
info
Detail Summary
Detail Summary
70
flake schema. In a simple scheme, the summary tables from the data ware house may be directly copied
to the data mart (or the relevant portions), but the data mart is so structured that it operates only on those
dimensions that are relevant to the mart.
The second case is when we populate the data base design specific to a user access tool. In such a
situation, we may have to probably transform the data into the required structure. In some cases, they
could simply be a transformation into the different data base tables, but in other cases new data structures
that suit each of the access tools need to be created. Such a transformation may need several degrees of
data aggregation using stored procedures.
Before we close this discussion, one warning note needs to be emphasized. You may have noticed
data marting indirectly leads to aggregation, but it should not be used as an alternative to aggregation,
since the costs are higher, but the data marts still will not be able to provide the overview capability of
aggregations.
BLOCK SUMMARY
The chapter introduced us to the concept of data mart, which can be compared to a retail outlet. It
speeds up the queries, but can store only a sub set of the data and one will have to go back to the ware
house for any additional data. It also helps to form data structures suitable for the user access tools and
also to impose access control strategies etc.,.
Normally it is possible to identify splits in the functioning of organization or in the data collected. The
ideal method is to use these splits to divide the data between different data marts. Access tools requirements
and control strategies can also dictate the setting up of data marts need to have their own detailed
information and summary information stored in them.
SELF EVALUATION - I
1. Define data marting
2. Name any 4 reasons for data marting
3. Name any 4 methods of splitting the data between data marts.
4. Which is the best schema for data marts?
5. Is data cleaning an important issue in data marts?
1. Creating a subset of data for easy accessing
71
2. a) Speed up queries by reducing the data to be scanned.
b) To suit specific access tools.
c) To improve control strategies.
d) segment data into different plat forms.
3. a) Use Natural split in organization.
b) natural split in data
c) to suit access tools
d) to suit access control issues.
4. Star flake schema
5. No, it is taken care of by the main ware house.
1. With neat diagram explain the organization of Datamart in detail.
Chapter 7
Met a Dat a
BLOCK INTRODUCTION
M
eta data is data about data. Since data in a dataware house is both voluminous and dynamic,
it needs constant monitoring. This can be done only if we have a separate set of data about
data is stored. This is the purpose of meta data.
Meta data is useful for data transformation and load data management and query generation.
This chapter introduces a few of the commonly used meta data functions for each of them.
Meta data, by definition is data about data or data that describes the data. In simple terms, the
data warehouse contains data that describes different situations. But there should also be some data that
gives details about the data stored in a data warehouse. This data is metadata. Metadata, apart form
other things, will be used for the following purposes.
1. Data transformation and Loading
2. Data Management
3. Query Generation
7.1 DATA TRANSFORMATION AND LOADING
This type of metadata is used during data transformation. In a simple dataware house, this type of
metadata may not be very important and may not be even present. But as more and more sources start
feeding the warehouse, the necessity for metadata is felt. It is also useful in matching the formats of the
data source and the data warehouse. More is the mismatch between the two, greater is the need for this
72
Chapter 7 - Meta Mart
type of metadata. Also, when the data being tansformed form the source changes, instead of changing
the data warehouse design itself, the metadata can capture these changes and automatically generate the
transformation programs.
For each source data field, the following information is required.
Source field
Unique identifier
Name
Type
Location
System
Object
The fields are self evident. The type field indicate detain like the storage type of data.
The destination field needs the following meta data.
Destination
Name
Type
Table name
The other information to be stored is the transformations that need to be applied to convert the source
data into the destination data.
This needs the following fields.
Transformation(s)
Name
Language
Module name
Syntax
The attribute language is the name of the language in which the transformation program is written.
The transformation can be a simple conversion of type (from integer to real; char to integer etc) or
may involve fairly complex procedures.
73
74
It is evident that most of these transformations are needed to take care of the difference in format in
which data is sent fromthe source and the format which is to be stored in the warehouse. There are other
complications like different types of mappings, accuracy of data available/stored and so on. The
transformation and mapping tools are able to take care of all this. But the disadvantage is that they are
quite costly on one hand and the resultant codes need not be optimal on the other.
7.2 DATA MANAGEMENT
Meta data should be able to describe data as it resides in the data warehouse. This will help the
warehouse manager to control data movements. The purpose of the metadata is to describe the objects
in the database. Some of the descriptions are listed here.
Tables
Columns
Names
Types
Indexes
Columns
Name
Type
Views
Columns
Name
Type
Constraints
Name
Type
Table
Columns
75
The metadata should also allow for cross referencing of columns of different tables, which may
contain the same data, whether they are having the same names or not. It is equally important to be able
to keep track of a particular column as it goes through several aggregations. To take care of such a
situation, metadata can be stored in the following format for each of the fields
Field
Unique identifier
Field name
Description
The unique identifier helps us to identify a particular column form other columns of the same name.
Similarly, for each table, the following information is to be stored
Table
Table name
Columns
Column name
Reference identifier
Again the names of the fields are self explanatory. The reference identifier helps to uniquely identify
the table. Aggregation is similar to tables and hence the following format is used.
Aggregation
Aggregation name
Columns
Column name
Aggregation.
There are certain functions that operate on the aggregations some of them are
Min
Max
Average
Sum etc..
76
Their functions are self explanatory.
Partitions are subsets of tables. They need the following metadata associated with them
Partition
Partition name
Table name
Range allowed
Range contained
Partition key
The names are again self explanatory.
7.3 QUERY GENERATION
Meta data is also required to generate queries. The query manger uses the metadata to build a history
of all queries run and generator a query profile for each user, or group of uses.
We simply list a few of the commonly used meta data for the query. The names are self explanatory.
Query
Table accessed
Column accessed
Name
Restrictions applied
Column name
Table name
Restrictions
77
J oin criteria applied
Column name
Table name
Column name
Table name
Aggregate function used
Column name
Aggregate function
Group by criteria
Column name
Sort direction
Syntax
Resources
Disk
Read
Write
Temporary
Each of these metadata need to be used specific to certain syntax. We shall not be going into the
details here.
Before we close, we shall have one point of caution. Gathering possible, they should be gathered in
the background.
BLOCK SUMMARY
We get ourselves familiarized with several meta data operations.
SELF EVALUATION
1. Explain all the steps of Data transformation and loading.
2. In detail explain data management.
3. In detail explain query generation.
78
Chapter 8
Pr ocess Manager s
BLOCK INTRODUCTION
I
n this chapter, we look at certain software managers that keep the data ware house going. We have
seen on several previous occasions that the warehouse is a dynamic identify and needs constant
maintenance. This was originally being done by human managers, but software managers have
taken over recently we look at two category of managers.
System managers and
Process managers.
The systems managers themselves are divided into various categories.
Configuration manager to take care of system configurations.
Schedule manage to look at scheduling aspects.
Event managers to identify specific Events and activate suitable corrective actions.
Data base manager and system managers to handle the various user related aspects.
A backup recovery manager to keep track of backups.
Amongst the process managers, we have the
Load manager to take case of source interaction, data transformation and data load.
Ware house manger to take care of data movement, meta data management and performance
monitoring.
Query manager to control query scheduling and monitoring.
79
80
Chapter 8 - Process Managers
We understand the functions of each of them in some detail.
In this chapter, we briefly discuss about the system and warehouse mangers. The managers are
specific software and the underlying processes that perform certain specific tasks. A manger can also be
looked upon as a tool. Sometimes, we use the terms manger and tool interchangeably.
8.1 NEED FOR MANAGERS TO A DATAWARE HOUSE.
Data warehouses are not just large databases. They are complex environments that integrate many
technologies. They are not static, but will be continuously changing both contentwise and structurewise.
Thus, there is a constant need for maintenance and management. Since huge amounts of time, money
and efforts are involved in the development of data warehouses, sophisticated management tools are
always justified in the case of data warehouses.
When the computer systems were in their initial stages of development, there used to be an army of
human managers, who went around doing all the administration and management. But such a scheme
became both unvieldy and prone to errors as the systems grew in size and complexity. Further most of the
management principles were adhoc in nature and were subject to human errors and fatigue.
In such a scenario, the need for complex tools which can go around managing without human intervention
was felt and the concept of mangers tools came up. But one major problem with such mangers is that
they need to interact with the humans at some stage or the other. This needs a lot of care to be taken to
allow for human intervention. Further, when different tools are used for different tasks, the tools should
be able interact amongst themselves, which brings the concept of compatibility into picture. Taking these
factors into account, several standard managers have been devised. They basically fall into two categories.
1. System Management Tools
2. Data Warehouse Process Management Tools.
We shall briefly look into the details of each of these categories.
8.2 SYSTEM MANAGEMENT TOOLS
The most important jobs done by this class of managers includes the following
1. Configuration Managers
2. Schedule Managers
3. Event Managers
81
4. Database Mangers
5. Back Up Recovery Managers
6. Resource and Performance a Monitors.
We shall look into the working of the first five classes, since last type of managers are less critical in
nature.
8.2.1 Configuration Manager
This tool is responsible for setting up and configuring the hardware. Since several types of machines
are being addressed, several concepts like machine configuration, compatibility etc. are to be taken care
of, as also the platform on which the system operates. Most configuration managers have a single
interface to allow the control of all types of issues.
8.2.2 Schedule Manager
The scheduling is the key for successful warehouse management. Almost all operations in the ware
house need some type of scheduling. Every operating system will have its own scheduler and batch
control mechanism. But these schedulers may not be capable of fully meeting the requirements of a data
warehouse. Hence it is more desirable to have specially designed schedulers to manage the operations.
Some of the capabilities that such a manager should have include the following
Handling multiple queues
Interqueue processing capabilities
Maintain job schedules across system outages
Deal with time zone differences
Handle job failures.
Restart failed jobs
Take care of job priorities
Management of queues
Notify a user that a job is completed.
It may be noted that these features are not exhaustive. On the other hand not all the schedule mangers
need to support all these features also.
82
While supporting the above cited jobs, the manager also needs to take care of the following operations,
which may be transparent to the user
Overnight processing
Data load
Data transformation
Index creation
Aggregation creation
Data movement
Back up
Daily scheduling
Report generations
Etc
8.2.3 Event Manager
An event is defined as a measurable, observable occurrence of a defined action. If this definition is
quite vague, it is because it encompasses a very large set of operations. The event manager is a software
that continuously monitors the system for the occurrence of the event and then take any action that is
suitable (Note that the event is a measurable and observable occurrence). The action to be taken is
also normally specific to the event.
Most often the term event refers to an error, problem or at least an uncommon event. The event
manager starts actions that either corrects the problems or limits the damage.
A partial list of the common events that need to be monitored are as follows:
Running out of memory space.
A process dying
A process using excessing resource
I/O errors
Hardware failure
83
Lack of space for a table
Excessive CPU usage
Buffer cache hit ratios falling below thresholds etc.
It is obvious that depending on the hardware, the platforms and the type of data being stored, these
events can keep changing.
The most common way of resolving the problem is to call a procedure that takes the corrective action
for the respective event. Most often, the problem resolution is done automatically and human intervention
is needed only in extreme cases. One golden rule while defining the procedures is that the solving of one
event should not produce side effects as for as possible. Suppose a table has run out of space, the
procedure to take care of this should provide extra space elsewhere. But the process should not end up
in snatching away the space from some other table, which may cause problems later on. However, it is
very difficult to define and implement such perfect procedures.
The other capability of the event manager is the ability to raise alarms. For example, when the space
is running out, it is one thing to wit for the completion of the event and take corrective action. But it can
also raise an alarm after say 90% of the space is used up, so that suitable corrective action can be taken
up early.
8.2.4 Database Manager
The database manger normally will also have a separate (and often independent) system manager
module. The purpose of these managers is to automate certain processes and simplify the execution of
others. Some of operations are listed as follows.
Ability to add/remove users
o User management
o Manipulate user quotas
o Assign and deassign the user profiles
Ability to perform database space management
o Monitor and report space usage
o Garbage management
o Add and expand space
84
Manage summary tables
Assign or deassign space
Reclaim space from old tables
Ability to manage errors
Etc..
User management is important in a data warehouse because a large number of users, each of whom
has the potential to use large amounts of resources, are to be managed. Added to this is the complexity of
managing the access controls and the picture in complete. The managers normally maintain profiles and
roles for each user and use them to take care of the access control aspects.
The measure of the success of a manager is its ability to manage space both inside and outside the
database. In some cases, an incremental change can trigger of huge changes and hence space management
and reclaimation of unused space as well as unwinging the fragmented chunks is a critical factor. The
manager should be able to clearly display the quantum and location of the space used, so that proper
decisions can be taken.
The need for temporary space to take care of interim storages is another important factor. Though
they do not appear in the final tally, insufficient / adhoc space can lead to inefficient performance of the
system. Proper utilization and tracking of such spaces is a challenging task.
In large databases, huge volumes of error logs and trace files are created. The ability to manage them
in the most appropriate form and archiving it at suitable intervals is also an important aspect. The trade
off between archiving and deleting the files is also to be clearly understood.
8.2.5 Back Up Recovery Manager
Since the data stored in a warehouse is invaluable, the need to backup and recover lost data cannot be
overemphasized. There are three main features for the management of backups.
Scheduling
Backup Data Tracking
Database Awareness.
Since the only reason backups are taken is to save the accidentally lost data, backups are useless
unless the data can be used effectively whenever needed. This needs a very efficient integration with the
schedule manager. Hence the backup recovery manager must be able to index and track the stored data
efficiently. The idea about the enormity the task can be got if one looks at the fact that the datawarehouses
themselves are huge and the backups will be several times bigger than the warehouse.
85
8.3 DATAWARE HOUSE PROCESS MANAGERS
These are responsible for the smooth flow, maintainance and upkeep of data into and out of the
database. The main types of process managers are
Load Manager
Warehouse Manager and
Query Manager
We shall look into each of them briefly. Before that, we look at a schematic diagram that defines the
boundaries of the three types of managers.
Boundaries of process managers
8.3.1 Load Manager
This is responsible for any data transformations and for loading of data into the database. They should
effect the following
Data
Load manager
Operational
data
Summary
info
Meta
data
Detailed
information
External
data
Decision
Query manager
Front
end
tools
information
Ware house manger
86
Data source interaction
Data transformation
Data load.
The actual complexity of each of these modules depend on the size of the database.
It should be able to interact with the source systems to verify the received data. This is a very
important aspect and any improper operations leads to invalid data affecting the entire warehouse. The
concept is normally achieved by making the source and data ware house systems compatible.
The easiest method is to ask the source to send some control information, based on which the data can
be verified for relevant data. Simple concepts like checksums can go a long way in ensuring error free
operations.
If there is a constant networking possible between the source and destination systems, then message
transfers between the source and ware house can be used to ensure that the data transfer is ensured
correctly. In case of errors, retransmissions can be asked for.
In more complex cases, a copy management tool can be used to effect more complex tests before
admitting the data from the source. The exact nature of checks is based on the actual type of data being
transferred.
One very simple, but very useful indication is to make sure that a count of the no. of records is
maintained so that no data is lost and at the same time, it does not get loaded twice.
The amount of data transformation needed again depends on the context. In simple cases, only the
field formats may have to be changed. Single fields may have to be broken down or multiple fields
combined. Extra fields may also have to be added / deleted. In more complex transformation mappings,
separate tools to do the job need to be employed.
Data loading is also a very important aspect. The actual operation depend on the softwares used.
8.3.2 Ware House Manager
The warehouse manager is responsible for maintaining data of the ware house. It should also create
and maintain a layer of meta data. Some of the responsibilities of the ware house manager are
Data movement
Meta data management
Performance monitoring
Archiving.
87
Data movement includes the transfer of data within the ware house, aggregation, creation and
maintenance of tables, indexes and other objects of importance. It should be able to create new aggregations
as well as remove the old ones. Creation of additional rows / columns, keeping track of the aggregation
processes and creating meta data are also its functions.
Most aggregations are created by queries. But a complex query normally needs several aggregations
and needs to be broken down to describe them. Also, they may not be a most optimal way of doing things.
In such cases, the ware house manager should be capable of breaking down the query and be able to
optimize the resultant set of aggregations. This may also need some human interaction.
The ware house manager must also be able to devise parallelisms for any given operation. This would
ensure the most optimal utilization of resources. But parallelization would also involve additional queuing
mechanisms, prioritization, sequencing etc. When data marts are being used, the warehouse manager is
also responsible for their maintenance. Scheduling their refresh sequences and clearing unwanted data
will also become its responsibilities.
The other important job of the warehouse manager is to manage the meta data. Whenever the old
data is archived or new data is loaded, the meta data needs to be updated. The manager should be able
do it automatically. The manager will also be responsible for the use of metadata, in several cases like
identifying the same data being present at different levels of aggregation. Performance monitoring and
tuning is also the responsibility of the ware house manager. This is done by maintaining statistics along
with the query history, so that suitable optimizations are done. But the amount of statistics stored and the
type of conclusions drawn are highly subjective. The aspect of tuning the system performance is a more
complex operation and as of now, no tool that can do it most effectively is available.
The last aspect of the ware house manager is archiving. All data is susceptible to aging over a period
of time, the usefulness of data becomes less and they are to be removed to make way for new data.
However, it is desirable to hold the data as long as possible, since there is no guarantee that a piece of
data will not at all be needed in future. Based on the availability of storage space and the previous
experience about how long a data needs to be preserved, a decision to purge the data is to be taken. In
some cases, legal requirements may also need the data to be preserved, though they are not useful from
the business point of view.
The answer is to hold data in archives normally on a tape but can also be on a disk, so that they can
be brought back as and when needed. However, one argument is that since the ware house always gets
data from a source and the source will have any way archived the data, the ware house need not do it
again.
In any case, a suitable strategy of archiving, based on various factors need to be devised and meticulously
implemented by the manager.
88
When designing the archiving process, several details need to be looked into:-
Life expectancy of data
In raw form
In aggregated form
Archiving parameters
Start data
Cycle time
Work load.
It can be pointed out that once the life expectancy of the data in the ware house is over, it is archived
in the raw form for sometime, later in the aggregated form, before being totally purged from the system.
The actual time frames of each of these depend on various factors and previous experience.
The cycle time indicates how often the data is archived weekly, monthly or quarterly. It is to be noted
that archiving puts extra load on the system. If the data to be archived is small, perhaps it can be done
overnight, but if it is large it may affect the normal schedule. This is one of the factors that decides the
cycle time longer the cycle time, more will be the load.
8.3.3 Query Manager
We shall look at the last of manager, but not of any less importance, the query manager. The main
responsibilities include the control of the following.
Users access to data
Query scheduling
Query Monitoring
These jobs are varied in nature and have not been automated as yet.
The main job of the query manager is to control the users access to data and also to present the data
as a result of the query processing in a format suitable to the user. The raw data, often from different
sources, need to be compiled in a format suitable for querying. The query manager will have to act as a
mediator between the user on one hand and the meta data on the other. It is desirable that all the access
tools work through the query manager. If not atleast indirect controls need to be set up, to ensure proper
restrictions on the queries made. This will ensure proper monitoring and control, if nothing else.
89
Scheduling the adhoc queries is also the responsibility of the query manager: Simultaneous, large,
uncontrolled queries affect the system performance. Proper queuing mechanisms to ensure fairness to all
queries is of prime importance. The query manager will be able to create, abort and requeue the jobs.
But the job of performance prediction when does a query in the queue gets completed is often kept out
side the purview for the simple reason that it is difficult to estimate before hand.
The other important aspect is that the query manager should ideally be able to monitor all the queries.
This would help in getting the proper statistics, tuning the adhoc queries to improve the system performance
as also to control the type of queries made. It is for this reason that all queries should be routed through
the query manager.
BLOCK SUMMARY
We have looked at understood the functioning of the following classes of managers.
System mangers
Configuration manager
Schedule manager.
Event manager
Data base manager.
Backup recovery manager.
Process managers
Load manager.
Ware house manager
Query manager.
SELF EVALUATION - I
1. What are the 2 basic classes of managers?
2. Name any 3 duties of schedule manager.
3. What is an event?
4. Name any 4 events.
5. How does the event manager manages the events?
90
6. Name any 4 functions of data base manager.
7. Name the 3 process managers.
8. Name the functions of the load manager.
9. Name the functions of warehouse manger.
10. What are the responsibilities of query manger?
1. a) Systems Managers.
b) Process Managers.
2. a) Handle multiple queues
b) Maintain job schedules across outages.
c) Supp0ort starting and stopping of queries etc.
3. An event is a measurable, observable occurrence of action.
4. a) disk running out of space.
b) excessive CPU use
c) A dying process
d) Tablereaching maximum size etc.,.
5. By calling the scripts capable of handling the events.
6. a) To add / remove users.
b) To maintain roles and profile.
c) To perform database pace management.
d) manage temporary tables etc.
7. a) Load manager.
b) warehouse manager.
c) Query manager.
8. a) Source interaction.
b) data transformation
c) data load.
9. a) data movement
91
b) meta data management
c) performance monitoring.
d) data archiving.
10. a) User access to data
b) query scheduling.
c) query monitoring.
1. In detail explain the systems management tools.
2. Explain the boundaries of process manager with neat diagram in detail.
Dat a Mi ni ng
COURSE SUMMARY
D
ata mining is a promising and flourishing frontier in database systems and new databaseapplications.
Data mining is also called knowledge discovery in databases (KDD), is the automated or extraction
of patterns representing knowledge implicitly stored in large databases, data warehouses and
other massive information repositories.
Data mining is multidisciplinary field, drawing work from areas including database technology, artificial
intelligence, machine learning, neural networks, statistics, pattern recognition, knowledge acquisition,
information retrieval , high performance computing and data visualization.
The aim of this course is to give the reader an appreciation of the importance and potential of data
mining. It is technique not only for IT managers but for all decision makers and we should able to exploit
the new technology.
Data mining is , in essence, a set of techniques that allows us to access data which is hidden in our
database. In large database especially , it is extremely important to get appropriate , accurate and useful
information which we cannot find with standard SQL tools. In order to do this, a structured approach is
needed. A step-by-step approach must be adopted. Goals must be identified, data cleaned and prepared
for the queries and analyses to make. It is essential to begin with a very good data warehouse and the
facility to clean the data.
In this course , reader go through the basic fundamentals of data mining , that is what is data mining,
issues related to data mining, approaches towards to data mining , application of data mining.
In the subsequent chapters, data mining techniques have been explained which includes association
rules, clustering and neural network. Advanced data mining issues are also discussed with respect to the
92
UNIT - II: Data Mining
world wide web. The guidelines to data mining issues are discussed. How to analyze the performance of
the data mining systems are discussed. Here the various factors like size of the data, data mining
methods and error in the system is taken into account in analyzing the performance of the system. Finally
we see the application aspects ,that is, implementation of data mining system.
93
Chapter 9 - Introduction to Data Mining
Chapter 9
Int r oduct ion t o Dat a Mining
9.0 INTRODUCTION
W
e are in an information technology age. In this information age, we believe that information
leads to power and success. With the development of powerful and sophisticated technologies
such as powerful machines like computers, satellites and others, we have been collecting
tremendous amounts of information. Initially, with the advent of computers and mass digital storage, we
started collecting and storing all sorts of data, counting on the power of computers to help sort through this
amalgam of information.
Unfortunately, these massive collections of data stored on different structures very rapidly became
overwhelming. This initial chaos has led to the creation of structured databases and database management
systems (DBMS). The efficient database management systems have been very important assets for
management of a large corpus of data and especially for effective and efficient retrieval of particular
information froma large collection whenever needed. The proliferation of database management systems
has also contributed to recent massive gathering of all sorts of information. Today, we have far more
information than we can handle from business transactions and scientific data to satellite pictures, text
reports and military intelligence.
The overall evolution of database technology is shown in the table 9.1
The abundance of data, coupled with the need for powerful data analysis tools, has been described as
a data rich but information poor situation. The fast-growing, tremendous amounts of data, collected
and stored in large and numerous databases, has far exceeded our human ability for comprehension
without powerful tools. Due to the high cost of data collection, people learned to make decisions based on
limited information. But this was not possible if the data set is huge or very big. Hence concept of data
mining was developed.
94
Table 9.1 Evolution of Database Technology
9.1 WHAT IS DATA MINING?
There are many definitions for Data mining. Few important definitions are given below.
Data mining refers to extracting or mining knowledge from large amounts of data.
Data mining is the process of exploration and analysis, by automatic or semiautomatic means, of
large quantities of data in order to discover meaningful patterns and rules.
Data Mining or Knowledge Discovery in Databases (KDD) as it is also known is the nontrivial
extraction of implicit, previously unknown, and potentially useful information from data. This encompasses
a number of different technical approaches, such as clustering, data summarization, learning classification
rules, finding dependency net works, analyzing changes, and detecting anomalies.
Data mining is the search for relationships and global patterns that exist in large databases but are
hidden among the vast amount of data, such as a relationship between patient data and their medical
diagnosis. These relationships represent valuable knowledge about the database and the objects in the
database and if the database is a faithful mirror, of the real world registered by the database.
Data mining refers to using a variety of techniques to identify nuggets of information or decision-
Evolutionary Step Business Question Enabl ing Technologies Characteristics
Data Collection
(1960s)
"What was my total
revenue in the last five
years?"
Computers, tapes,
disks
Retrospective,
static data
delivery
Data Access
(1980s)
"What were unit sales in
New Delhi last March?"
Relational databases
(RDBMS),
Structured Query
Language (SQL),
ODBC
Retrospective,
dynamic data
delivery at
record level
Data
Warehousing &
Decision Support
(1990s)
"What were unit sales in
New Delhi last March?
Drill down to Mumbai."
On-line analytic
processing (OLAP),
multidimensional
databases, data
warehouses
Retrospective,
dynamic data
delivery at
multiple levels
Data Mining
(Emerging Today)
"Whats likely to
happen to Mumbai unit
sales next month?
Why?"
Advanced
algorithms,
multiprocessor
computers, massive
databases
Prospective,
proactive
information
delivery
95
96
making knowledge in bodies of data, and extracting these in such a way that they can be put to use in the
areas such as decision support, prediction, forecasting and estimation. The data is often voluminous, but
as it stands of low value as no direct use can be made of it, it is the hidden information in the data that is
useful.
Some people relate the data mining to Knowledge Discovery in Databases or KDD . Alternatively,
others view data mining as simple an essential step in the process of knowledge discovery in databases.
Knowledge discovery is a process as shown in the figure 9.1. It consists of following iterative sequence
of the steps.
Figure 9.1 Data Mining: A KDD Process
1. Data cleaning: also known as data cleansing, it is a phase in which noise data and irrelevant
data are removed fromthe collection.
2. Data integration: at this stage, multiple data sources, often heterogeneous, may be combined
in a common source.
3. Data selection: at this step, the data relevant to the analysis is decided on and retrieved from
the data collection.
4. Data transformation: also known as data consolidation, it is a phase in which the selected data
is transformed into forms appropriate for the mining procedure.
97
5. Data mining: it is the crucial step in which clever techniques are applied to extract patterns
potentially useful.
6. Pattern evaluation: in this step, strictly interesting patterns representing knowledge are identified
based on given measures.
7. Knowledge representation: is the final phase in which the discovered knowledge is visually
represented to the user. This essential step uses visualization techniques to help users understand
and interpret the data mining results.
The data mining step may interact with the user or a knowledge base. The interesting patterns are
presented to the user and may be stored as new knowledge in the knowledge base. So we conclude data
mining is a step in the knowledge discovery process. Data mining is the process of discovering interesting
knowledge from large amounts of data stored either in databases, data warehouses or other information
repositories.
The architecture of a typical data mining system may have the following major components as shown
in figure 9.2.
Fig. 9.2 Architecture of a typical Data Mining System
Graphical user interface
Pattern evaluation
Data mining engine
Database or data warehouse
server
Knowledge
base
Database
Data
warehouse
Data
cleaning
Data
integration
Filtering
98
1. Database, data warehouse or other information repository: this is one or a set of databases,
data warehouses, spreadsheets or other kinds on information repositories. Data cleaning and
data integration techniques may be performed on the data.
2. Database or data warehouse server: The database or data warehouse server is responsible
for fetching the relevant data, based on the users data mining request.
3. Knowledge base: This is the domain that is used to guide the search or evaluate the
interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to
organize attributes or attribute values into different levels of abstraction.
4. Data mining engine: This is essential to the data mining system and ideally consists of a set of
functional modules for tasks such as characterization, association, classification, cluster analysis
and evolution and derivation analysis.
5. Pattern evaluation module: This component typically employs interestingness measures and
interacts with the data mining modules so as to focus to search towards interesting patterns. It
may use interestingness thresholds to filter out discovered patterns.
6. Graphical user interface: This module communicates between users and the data mining
system, allowing the user to interact with the system by specifying a data mining query or task,
providing information to help focus the search and performing exploratory data mining based on
the intermediate data mining results.
9.2 WHAT KIND OF DATA CAN BE MINED?
In principle, data mining is not specific to one type of media or data. Data mining should be applicable
to any kind of information repository. However, algorithms and approaches may differ when applied to
different types of data. Indeed, the challenges presented by different types of data vary significantly.
Data mining is being put into use and studied for databases, including relational databases, object-relational
databases and object oriented databases, data warehouses, transactional databases, unstructured and
semi structured repositories such as the World Wide Web, advanced databases such as spatial databases,
multimedia databases, time-series databases and textual databases, and even flat files. Here are some
examples in more detail
Flat Files: Flat files are actually the most common data source for data mining algorithms,
especially at the research level. Flat files are simple data files in text or binary format with a
structure known by the data mining algorithm to be applied. The data in these files can be
transactions, time-series data, scientific measurements.
Relational Databases: A relational database consists of a set of tables containing either values
of entity attributes, or values of attributes from entity relationships. Tables have columns and
99
rows, where columns represent attributes and rows represent tuples. A tuple in a relational table
corresponds to either an object or a relationship between objects and is identified by a set of
attribute values representing a unique key. The most commonly used query languagefor relational
database is SQL, which allows retrieval and manipulation of the data stored in the tables, as well
as the calculation of aggregate functions such as average, sum, min, max and count. Data
mining algorithms using relational databases can be more versatile than data mining algorithms
specifically written for flat files, since they can take advantage of the structure inherent to
relational databases. While data mining can benefit from SQL for data selection, transformation
and consolidation, it goes beyond what SQL could provide, such as predicting, comparing, detecting
deviations.
Data Warehouses: A data warehouse as a storehouse, is a repository of data collected from
multiple data sources (often heterogeneous) and is intended to be used as a whole under the
same unified schema. A data warehouse gives the option to analyze data from different sources
under the same roof. Data fromthe different stores would be loaded, cleaned, transformed and
integrated together. To facilitate decision making and multi-dimensional views, data warehouses
are usually modeled by a multi-dimensional data structure.
Multimedia Databases: Multimedia databases include video, images, audio and text media.
They can be stored on extended object-relational or object-oriented databases, or simply on a
file system. Multimedia is characterized by its high dimensionality, which makes data mining
even more challenging. Data mining from multimedia repositories may require computer vision,
computer graphics, image interpretation, and natural language processing methodologies.
Spatial Databases: Spatial databases are databases that in addition to usual data, store
geographical information like maps, and global or regional positioning. Such spatial databases
present new challenges to data mining algorithms.
Time-Series Databases: Time-series databases contain time related data such stock market
data or logged activities. These databases usually have a continuous flow of new data coming
in, which sometimes causes the need for a challenging real time analysis. Data mining in such
databases commonly includes the study of trends and correlations between evolutions of different
variables, as well as the prediction of trends and movements of the variables in time.
World Wide Web: The World Wide Web is the most heterogeneous and dynamic repository
available. A very large number of authors and publishers are continuously contributing to its
growth and metamorphosis and a massive number of users are accessing its resources daily.
Data in the World Wide Web is organized in inter-connected documents. These documents can
be text, audio, video, raw data and even applications. Conceptually, the World Wide Web is
comprised of three major components: The content of the Web, which encompasses documents
available, the structure of the Web, which covers the hyperlinks and the relationships between
documents; and the usage of the web, describing how and when the resources are accessed. A
100
fourth dimension can be added relating the dynamic nature or evolution of the documents. Data
mining in the World Wide Web or Web Mining, tries to address all these issues and is often
divided into web content mining, web structure mining and web usage mining.
9.3 WHAT CAN DATA MINING DO?
The kinds of patterns that can be discovered depend upon the data mining tasks employed. By and
large, there are two types of data mining tasks: descriptive data mining tasks that describe the general
properties of the existing data, and predictive data mining tasks that attempt to do predictions based on
inference on available data. The data mining functionalities and the variety of knowledge they discover
are briefly presented in the following list
a) Characterization: Data characterization is a summarization of general features of objects in a
target class, and produces what is called characteristic rules. The data relevant to a user-
specified class are normally retrieved by a database query and run through a summarization
module to extract the essence of the data at different levels of abstractions. For example, one
may want to characterize the Video store customers who regularly rent more than 30 movies a
year. Note that with a data cube containing summarization of data, simple OLAP operations fit
the purpose of data characterization.
b) Discrimination: Data discrimination produces what are called discriminant rules and is basically
the comparison of the general features of objects between two classes referred to as the target
class and the contrasting class. For example, one may want to compare the general
characteristics of the customers who rented more than 30 movies in the last year with those
whose rental account is lower than 5. The techniques used for data discrimination are very
similar to the techniques used for data characterization with the exception that data discrimination
results include comparative measures.
c) Association Analysis: Association analysis is the discovery of what are commonly called
association rules. It studies the frequency of items occurring together in transactional databases,
and based on a threshold called support, identifies the frequent item sets. Another threshold,
confidence, which is the conditional probability than an item appears in a transaction when
another item appears, is used to pinpoint association rules. Association analysis is commonly
used for market basket analysis. For example, it could be useful for the Video store manager to
know what movies are often rented together or if there is a relationship between renting a
certain type of movies and buying popcorn or pop. The discovered association rules are of the
form: PQ [s,c], where P and Q are conjunctions of attribute value-pairs, and s (for support) is
the probability that P and Q appear together in a transaction and c (for confidence) is the
conditional probability that Q appears in a transaction when P is present.
For example, the hypothetic association rule:
101
RentType(X, game) Age(X, 13-19) Buys(X, pop) [s=2% ,c=55%]
would indicate that 2% of the transactions considered are of customers aged between 13 and 19
who are renting a game and buying a pop, and that there is a certainty of 55% that teenage
customers who rent a game also buy pop.
d) Classification: Classification analysis is the organization of data in given classes. Also known
as supervised classification, the classification uses given class labels to order the objects in
the data collection. Classification approaches normally use a training set where all objects are
already associated with known class labels.
The classification algorithm learns from the training set and builds a model. The model is used to
classify new objects. For example, after starting a credit policy, the Video store managers could
analyze the customers behaviours and label accordingly the customers who received credits
with three possible labels safe, risky and very risky. The classification analysis would
generate a model that could be used to either accept or reject credit requests in the future.
e) Prediction: Prediction has attracted considerable attention given the potential implications of
successful forecasting in a business context. There are two major types of predictions: one can
either try to predict some unavailable data values or pending trends or predict a class label for
some data. The latter is tied to classification. Once a classification model is built based on a
training set, the class label of an object can be foreseen based on the attribute values of the
object and the attribute values of the classes. Prediction is however more often referred to the
forecast of missing numerical values, or increase/ decrease trends in time related data. The
major idea is to use a large number of past values to consider probable future values.
f) Clustering: Similar to classification, clustering is the organization of data in classes. However,
unlike classification, in clustering, class labels are unknown and it is up to the clustering algorithm
to discover acceptable classes. Clustering is also called unsupervised classification, because
the classification is not dictated by given class labels. There are many clustering approaches all
based on the principle of maximizing the similarity between objects in a same class (intra-class
similarity) and minimizing the similarity between objects of different classes (inter-class
similarity).
g) Outlier Analysis: Outliers are data elements that cannot be grouped in a given class or cluster,
also known as exceptions or surprises, they are often very important to identify. While outliers
can be considered noise and discarded in some applications, they can reveal important knowledge
in other domains, and thus can be very significant and their analysis valuable.
h) Evolution and Deviation Analysis: Evolution and deviation analysis pertain to the study of
time related data that changes in time. Evolution analysis models evolutionary trends in data,
which consent to characterizing, comparing, classifying or clustering of time related data. Deviation
102
analysis, on the other hand, considers differences between measured values and expected values,
and attempts to find the cause of the deviations from the anticipated values.
9.4 HOW DO WE CATEGORIZE DATA MINING SYSTEMS?
There are many data mining systems available or being developed. Some are specialized systems
dedicated to a given data source or are confined to limited data mining functionalities, other are more
versatile and comprehensive. Data mining systems can be categorized according to various criteria among
other classification are the following
a) Classification According to the Type of Data Source Mined: this classification categorizes
data mining systems according to the type of data handled such as spatial data, multimedia data,
time-series data, text data, World Wide Web, etc.
b) Classification According to the Data Model Drawn on: this classification categorizes data
mining systems based on the data model involved such as relational database, object-oriented
database, data warehouse, transactional, etc.
c) Classification According to the Kind of Knowledge Discovered: this classification
categorizes data mining systems based on the kind of knowledge discovered or data mining
functionalities, such as characterization, discrimination, association, classification, clustering etc.
Some systems tend to be comprehensive systems offering several data mining functionalities
together.
d) Classification According to Mining Techniques Used: Data mining systems employ and
provide different techniques. This classification categorizes data mining systems according to
the data analysis approach used such as machine learning, neural networks, genetic algorithms,
statistics, visualization, database oriented or data warehouse-oriented, etc. The classification
can also take into account the degree of user interaction involved in the data mining process
such as query-driven systems, interactive exploratory systems, or autonomous systems. A
comprehensive system would provide a wide variety of data mining techniques to fit different
situations and options, and offer different degrees of user interaction.
9.5 WHAT ARE THE ISSUES IN DATA MINING?
Data mining algorithms embody techniques that have sometimes existed for many years, but have only
lately been applied as reliable and scalable tools that time and again outperform older classical statistical
methods. While data mining is still in its infancy, it is becoming a trend and ubiquitous. Before data mining
develops into a conventional, mature and trusted discipline, many still pending issues have to be addressed.
Some of these issues are addressed below.
103
a) Security and Social Issues: Security is an important issue with any data collection that is
shared and/or is intended to be used for strategic decision-making. In addition, when data is
collected for customer profiling, user behaviour understanding, correlating personal data with
other information, etc., large amounts of sensitive and private information about individuals or
companies is gathered and stored. This becomes controversial given the confidential nature of
some of this data and the potential illegal access to the information. Moreover, data mining could
disclose new implicit knowledge about individuals or groups that could be against privacy policies,
especially if there is potential dissemination of discovered information. Another issue that arises
from this concern is the appropriate use of data mining. Due to the value of data, databases of
all sorts of content are regularly sold, and because of the competitive advantage that can be
attained from implicit knowledge discovered, some important information could be withheld,
while other information could be widely distributed and used without control.
b) User Interface Issues: The knowledge discovered by data mining tools is useful as long as it
is interesting, and above all understandable by the user. Good data visualization eases the
interpretation of data mining results, as well as helps users better understand their needs. Many
data exploratory analysis tasks are significantly facilitated by the ability to see data in an appropriate
visual presentation. There are many visualization ideas and proposals for effective data graphical
presentation. However, there is still much research to accomplish in order to obtain good
visualization tools for large datasets that could beused to display and manipulate mined knowledge.
The major issues related to user interfaces and visualization are screen real-estate, information
rendering and interaction. Interactivity with the data and data mining results is crucial since it
provides means for the user to focus and refine the mining tasks, as well as to picture the
discovered knowledge from different angles and at different conceptual levels.
c) Mining Methodology Issues: These issues pertain to the data mining approaches applied and
their limitations. Topics such as versatility of the mining approaches, the diversity of data available,
the dimensionality of the domain, the broad analysis needs (when known), the assessment of the
knowledge discovered, the exploitation of background knowledge and metadata, the control and
handling of noise in data, etc. are all examples that can dictate mining methodology choices. For
instance, it is often desirable to have different data mining methods available since different
approaches may perform differently depending upon the data at hand. Moreover, different
approaches may suit and solve users needs differently.
Most algorithms assume the data to be noise-free. This is of course a strong assumption. Most
datasets contain exceptions, invalid or incomplete information, etc., which may complicate, if
not obscure, the analysis process and in many cases compromise the accuracy of the results. As
a consequence, data preprocessing (data cleaning and transformation) becomes vital. It is often
seen as lost time, but data cleaning, as time consuming and frustrating as it may be, is one of the
most important phases in the knowledge discovery process. Data mining techniques should be
able to handle noise in data or incomplete information.
104
More than the size of data, the size of the search space is even more decisive for data mining
techniques. The size of the search space is often depending upon the number of dimensions in
the domain space. The search space usually grows exponentially when the number of dimensions
increases. This is known as the curse of dimensionality. This curse affects so badly the
performance of some data mining approaches that it is becoming one of the most urgent issues
to solve.
d) Performance Issues: Many artificial intelligence and statistical methods exist for data analysis
and interpretation. However, these methods were often not designed for the very large data sets
data mining is dealing with today. Terabyte sizes are common. This raises the issues of scalability
and efficiency of the data mining methods when processing considerably large data. Algorithms
with exponential and even medium-order polynomial complexity cannot be of practical use for
data mining. Linear algorithms are usually the norm. In same theme, sampling can be used for
mining instead of the whole dataset. However, concerns such as completeness and choice of
samples may arise. Other topics in the issue of performance are incremental updating, and
parallel programming. There is no doubt that parallelism can help solve the size problem if the
dataset can be subdivided and the results can be merged later. Incremental updating is important
for merging results from parallel mining, or updating data mining results when new data becomes
available without having to re-analyze the complete dataset.
e) Data Source Issues: There are many issues related to the data sources, some are practical
such as the diversity of data types, while others are philosophical like the data glut problem. We
certainly have an excess of data since we already have more data than we can handle and we
are still collecting data at an even higher rate. If the spread of database management systems
has helped increasethe gathering of information, the advent of data mining is certainly encouraging
more data harvesting. The current practice is to collect as much data as possible now and
process it, or try to process it, later. The concern is whether we are collecting the right data at
the appropriate amount, whether weknow what we want to do with it, and whether we distinguish
between what data is important and what data is insignificant. Regarding the practical issues
related to data sources, there is the subject of heterogeneous databases and the focus on diverse
complex data types. We are storing different types of data in a variety of repositories. It is
difficult to expect a data mining system to effectively and efficiently achieve good mining results
on all kinds of data and sources. Different kinds of data and sources may requiredistinct algorithms
and methodologies. Currently, there is a focus on relational databases and data warehouses, but
other approaches need to be pioneered for other specific complex data types. A versatile data
mining tool, for all sorts of data, may not be realistic. Moreover, the proliferation of heterogeneous
data sources, at structural and semantic levels, poses important challenges not only to the database
community but also to the data mining community.
105
9.6 REASONS FOR THE GROWING POPULARITY OF DATA
MINING
a) Growing Data Volume
The main reason for necessity of automated computer systems for intelligent data analysis is the
enormous volume of existing and newly appearing data that require processing. The amount of
data accumulated each day by various business, scientific, and governmental organizations around
the world is daunting. It becomes impossible for human analysts to cope with such overwhelming
amounts of data.
b) Limitations of Human Analysis
Two other problems that surface when human analysts process data are the inadequacy of the
human brain when searching for complex multifactor dependencies in data, and the lack of
objectiveness in such an analysis. A human expert is always a hostage of the previous experience
of investigating other systems. Sometimes this helps, sometimes this hurts, but it is almost
impossible to get rid of this fact.
c) Low Cost of Machine Learning
One additional benefit of using automated data mining systems is that this process has a much
lower cost than hiring an many highly trained professional statisticians. While data mining does
not eliminate human participation in solving the task completely, it significantly simplifies the job
and allows an analyst who is not a professional in statistics and programming to manage the
process of extracting knowledge from data.
9.7 APPLICATIONS
Data mining has many and varied fields of application some of which are listed below.
Retail/Marketing
Identify buying patterns fromcustomers
Find associations among customer demographic characteristics
Predict response to mailing campaigns
Market basket analysis
Banking
Detect patterns of fraudulent credit card use
106
Identify loyal customers
Predict customers likely to change their credit card affiliation
Determine credit card spending by customer groups
Find hidden correlations between different financial indicators
Identify stock trading rules from historical market data
Insurance and Health Care
Claims analysis - which medical procedures are claimed together
Predict which customers will buy new policies
Identify behaviour patterns of risky customers
Identify fraudulent behaviour
Transportation
Determine the distribution schedules among outlets
Analyse loading patterns
Medicine
Characterise patient behaviour to predict office visits
Identify successful medical therapies for different illnesses
9.8 EXERCISE
I. FILL UP THE BLANKS
1. Data mining refers to __________________ knowledge from large amount of data.
2. Data cleaning step removes _____________ and ____________ data.
3. GUI module communicates between _______________ and ___________ system
4. Flat files are simple data file in _______________ format.
5. Multimedia databases include ____________, ____________, ___________ and _____________.
6. Descriptive data mining task is to describe _______________
7. Association is characterized by ____________and ____________.
107
8. Clustering is called ___________________ .
9. Outliers are data elements that cannot be grouped in a given ___________
10. ______________ cost of machine learning makes data mining popular.
ANSWERS FOR FILL UP THE BLANKS.
1. Extracting
2. Noise and irrelevant
3. User and data mining
4. Text / binary
5. Video, images, audio and Text
6. General properties of the existing data.
7. Support and confidence.
8. Unsupervised classification.
9. Class / Clusters.
10. Low.
II. ANSWER THE FOLLOWING QUESTIONS
1. Explain the evolution of database technology.
2. Explain the KDD processes in detail.
3. Explain the architecture of a data mining system.
4. Explain the functions of data mining.
5. Explain the categories of data mining system.
6. Give the reasons for the growing popularity of data mining.
Chapter 10 - Data Preprocessing and Data Mining Primitives
Chapter 10
Dat a Pr epr ocessing and Dat a Mining
Pr i mi t i ves
10.0 INTRODUCTION
F
or a successful data mining operation the data must be consistent, reliable and appropriate. The
appropriate data must be selected, missing fields and incorrect values rectified, unnecessary
information removed and where data comes from different sources, the format of field values may
need to be altered to ensure they are interpreted correctly.
It is rather straightforward to apply DM modelling tools to data and judge the value of resulting models
based on their predictive or descriptive value. This does not diminish the role of careful attention to data
preparation efforts.
10.1 DATA PREPARATION
Data preparation process is roughly divided into data selection, data cleaning, formation of new data
and data formatting.
10.1.1 Select Data
A subset of data acquired in previous stages is selected based on criteria stressed in previous stages:
Data quality properties: completeness and correctness
Technical constraints such as limits on data volume or data type: this is basically related to data
mining tools which are planned earlier to be used for modeling
108
10.1.2 Data Cleaning
This step complements the previous one. It is also the most time consuming due to a lot of possible
techniques that can be implemented so as to optimize data quality for future modeling stage. Possible
techniques for data cleaning include:
Data Normalization. For example decimal scaling into the range (0,1), or standard deviation
normalization.
Data Smoothing. Discretization of numeric attributes is one example, this is helpful or even
necessary for logic based methods.
Treatment of Missing Values. There is not simple and safe solution for the cases where
some of the attributes have significant number of missing values. Generally, it is good to
experiment with and without these attributes in the modelling phase, in order to find out the
importance of the missing values. Simple solutions are
a) Replacing all missing values with a single global constant
b) Replace a missing value with its feature mean
c) Replace a missing value with its feature and class mean.
The main flaw of simple solutions is that substituted value is not the correct value. This means that the
data will be biased. If the missing values can be isolated to only a few features, then we can try a solution
by deleting examples containing missing values, or delete attributes containing most of the missing values.
Another solution, more sophisticated one is to try to predict missing values with a data mining tool. In this
case predicting missing values is a special data mining prediction problem.
Data Reduction. Reasons for data reduction are in most cases twofold: either the data may be
too big for the program, or expected time for obtaining the solution might be too long. The
techniques for data reduction are usually effective but imperfect. The most usual step for data
dimension reduction is to examine the attributes and consider their predictive potential. Some of
the attributes can usually be discarded, either because they are poor predictors or are redundant
relative to some other good attribute. Some of the methods for data reduction through attribute
removal are
a) Attribute selection from means and variances
b) Using principal component analysis
c) Merging features using linear transform.
109
110
10.1.3 New Data Construction
This step represents constructive operations on selected data which includes:
Derivation of new attributes fromtwo or more existing attributes
Generation of new records (samples)
Data transformation: Here data are transformed or consolidated into forms appropriate for
mining. Data transformation can involve the following
i) Smoothing which works to remove the noise from data. Such techniques include
binning, clustering and regression
ii) Aggregation where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual
total amounts. This step is typically used in constructing a data cube for analysis of the
data at multiple granularities.
iii) Generalization of the Data where low-level or primitive data are replaced by higher-
level concepts through the use of concept hierarchies. For example, numeric attributes
like age may be mapped to higher-level concepts like young, middle-aged and senior.
iv) Normalization where the attribute data are scaled so as to fall within a small specified
range , such as -1.0 to 1.0
v) Attribute Construction where new attributes are constructed and added from the
given set of attributes to help the mining process.
Merging Tables: joining together two or more tables having different attributes for same objects
Aggregations: operations in which new attributes are produced by summarizing information
frommultiple records and/or tables into new tables with summary attributes
10.1.4 Data Formatting
Final data preparation step which represents syntactic modifications to the data that do not change its
meaning, but are required by the particular modelling tool chosen for the DM task. These include:
Reordering of the attributes or records: some modelling tools require reordering of the attributes
(or records) in the dataset: putting target attribute at the beginning or at the end, randomizing
order of records (required by neural networks for example)
Changes related to the constraints of modelling tools: removing commas or tabs, special characters,
111
trimming strings to maximum allowed number of characters, replacing special characters with
allowed set of special characters.
There is also what is by DM practitioners called standard form of data (although there is not a
standard format of data that can be readily read by all modelling tools). Standard form refers primarily to
readable data types:
Binary variables (1-for true; 0-for false)
Ordered variables (numeric)
Categorical variables are in standard form of data transformed into -m- binary variables where m is
the number of possible values for the particular variable. Since distinct DM modeling tools usually prefer
either categorical or ordered attributes, the standard form is a data presentation that is uniform and
effective across a wide spectrum of DM modeling tools and other exploratory tools.
10.2 DATA MINING PRIMITIVES
Users communicates with the data mining system using a set of data mining primitives designed in
order to facilitate efficient and fruitful knowledge discovery. The primitives include the specification of
the portion of the database or the set of data in which the user is interested, the kinds of knowledge to be
mined, background knowledge useful in guiding the discovery process, interestingness measures for pattern
evaluation and how the discovered knowledge should be visualized. These primitives allow the user to
interactively communicate with the data mining system during discovery in order to examine the findings
from different angles or depths and direct the mining process.
10.2.1 Defining Data Mining Primitives
Each user will have a data mining task in mind. A data mining task can be specified in the form of a
data mining query, which is input to the data mining system. A data mining query is defined in terms of the
following primitives
a) Task Relevant Data This is the database portion to be investigated. For example , a company
XYZ is doing business in two states Karnataka and Tamilnadu. Manager of Karnataka wants to
know the total number of sales only in Karnataka, then only data related to Karnataka should be
accessed. These are referred to as relevant attributes.
b) The kinds of Knowledge to be Mined This specifies the data mining functions to be performed
such as characterization, discrimination, association, classification, clustering and evolution
112
analysis. For example, if studying the buying habits of customers in Karnataka , we may choose
to mine associations between customer profiles and the items that these customers like to buy.
c) Background Knowledge User can specify background knowledge or knowledge about the
domain to be mined. This knowledge is useful for guiding the knowledge discovery process and
for evaluating the patterns found. There are several kinds of background knowledge. There is
one popular background knowledge known as concept hierarchies. Concept hierarchies are
useful in that they allow data to be mined at multiple levels of abstraction. These can be used to
evaluate the discovered patterns according to their degree of unexpectedness or expectdness.
d) Interestingness Measures These functions are used to separate uninteresting patterns from
knowledge. They may be used to guide the mining process or after discovery to evaluate the
discovered patterns. Different kinds of knowledge may have different interestingness measures.
For example, interestingness measures for association rule include support ( the percentage of
task-relevant data tuple for which the rule pattern appears) and confidence ( an estimate of the
strength of the implication of the rule) Rules whose support and confidence values are below
user-specified thresholds are considered uninteresting.
e) Presentation and Visualization of Discovered Patterns This refers to the form in which
discovered patterns are to be displayed. Users can choose from different forms for knowledge
presentation such as rules, tables, charts, graphs, decision trees and cubes.
Task-relevant data
Database or data warehouse name
Database tables or data warehouse
cubes
Conditions for data selection
Relevant attributes or dimensions
Data Grouping criteria
Knowledge to be mined
Characterization
Discrimination
Association
Classification/Prediction
Clustering
113
Figure 10.1 Primitives for specifying a data mining task
10.3 A DATA MINING QUERYING LANGUAGE
A data mining language helps in effective knowledge discovery fromthe data mining systems. Designing
a comprehensive data mining language is challenging because data mining covers a wide spectrum of
tasks from data characterization to mining association rules, data classification and evolution analysis.
Each task has different requirements. The design of an effective data mining query language requires a
deep understanding of the power, limitation and underlying mechanism of the various kinds of data mining
tasks.
Background knowledge
Concept hierarchies
User beliefs about relationships in the
data
Pattern interestingness
measures
Simplicity
Certainty (Eg. Confidence)
Utility (eg.support)
Novelty
Visualization of discovered patterns
Rules, tables, reports, charts,
graphs,
decision trees and cubes
Drill-down and roll-up
114
10.3.1 Syntax for Task-Relevant Data Specification
The first step in defining a data mining task is the specification of the task-relevant data,that is , the
data on which is to be performed. This involves specifying the database and tables or data warehouse
containing the relevant data, conditions for selecting the relevant data, the relevant attributes or dimensions
for exploration and instructions regarding the ordering or grouping of the data retrieved. Data Mining
Query Language (DMQL) provides clauses for the specification of such information as follows.
Syntax for Task-Relevant Data Specification
The first step in defining a data mining task is the specification of the task-relevant data, that is, the
data on which mining is to be performed. This involves specifying the database and tables or data
warehouse containing the relevant data, conditions for selecting the relevant data, the relevant attributes
or dimensions for exploration, and instructions regarding the ordering or grouping of the data retrieved.
DMQL provides clauses for the clauses for the specification of such information, as follows:
use database (database_name) or use data warehouse (data_warehouse_name): The use
clause directs the mining task to the database or data warehouse specified.
from (relation(s)/cube(s)) [where(condition)]: The from and where clauses respectively specify
the database tables or data cubes involved, and the conditions defining the data to be retrieved.
in relevance to (attribute_or_dimension_list): This clause lists the attributes or dimensions for
exploration.
order by (order_list): The order by clause specifies the sorting order of the task relevant data.
group by (grouping_list): the group by clause specifies criteria for grouping the data.
having (conditions): The having cluase specifies the condition by which groups of data are
considered relevant
Top Level Sytax of the Data Mining Query Language DMQL
(DMQL) ::=(DMQL_statement); { (DMQL_Statement)}
(DMQL_Statement) ::=(Data_Mining_Statement)
| (Concept_Hierarchy_Definition_Statement)
| (Visualization_and_Presentation)
(Data_Mining_Statement) :: =
use database (Database_name) | use data warehouse (Data_warehouse_name)
{use hierarchy (hierarchy_name) for (attribute_or_dimension)|
115
(Mine_Knowledge_Specification)
in relevance to (attribute_or_dimension_list)
from (relation(s)/cube(s))
[where (condition)]
[order by (order_list)]
[group by (grouping_list)]
[having (condition)]
[with [(interest_measure_name)] threshold =(threshold_value)
[for(attribute(s)))]}
(Mine_Knowledge_Specification)::=(Mine_Discr) | (Mine_Assoc) | (Mine_class)
(Mine_Char) ::= mine characteristics [as {pattern_name)]
analyze(measure(s))
(Mine_Discr) ::= mine comparison [as (pattern_name) ]
for (target_class) where (target_condition)
{versus (contrast_class_i) where (contrast_condition_i)}
analyze(measure(s))
(Mine_Assoc) ::= mine association [as (pattern_name)]
[matching (metapattern) ]
(Mine_Class) ::= mine classification [as (pattern_name)]
analyze (classifying_attribute_or_dimension)
(Voncept_Hierarchy_Definition_statement) ::=
define hierarchy (hierarchy_name)
[for (attribute_or_dimension)]
on (relation_or_cube_or_hierarchy)
as (hierarchy_description)
[where (condition)]
116
(Visualization_and_presentation) :: =
display as (result_form | { (Multilevel_Manipulation)}
(Multilevel_Manipulation)::= roll up on (attribute_or_dimension)
| drill down on (attribute_or_dimension)
| add (attribute_or_dimension)
| drop (attribute_or_dimension)
Syntax for Specifying the Kind of Knowledge to be Mined
The (Mine_Knowledge_Specification) statement is used to specify the kind of knowledge to be mined.
In other words , it indicates the data mining functionality to be performed. Its syntax is defined below for
characterization, discrimination, association, and classification.
Characterization:
(Mine_Knowledge_specification) :: =
mine characteristics [as (pattern name)]
analyze (measure(s))
This specifies that characteristic descriptions are to be mined. The analyze clause, when used for
characterization, specifies aggregate measures, such as count, sum, or count % (percentage count, i.e.,
the percentage of tuples in the relevant data set with the specified characteristics). These measures are
to be computed for each data characteristic found.
Syntax for Concept Hierarchy Specification
Concept hierarchies allow the mining of knowledge at multiple levels of abstraction. In order to
accommodate the different viewpoints of users with regard to the data, there may be more than one
concept hierarchy per attribute or dimension. For instance, some users may prefer to organize branch
locations by provinces and states, while others may prefer to organize them according to languages used.
In such cases, a user can indicate which concept hierarchy is to be used with the statement
use hierarchy (hierarchy_name) for (attribute_or_dimension)
Otherwise, a default hierarchy per attribute or dimension is used.
Syntax for Interestingness Measure Specification
The user can help control the number of uninteresting patterns returned by the data mining system by
specifying measures of pattern interestingness and their corresponding thresholds. Interestingness measures
117
include the confidence, support, noice, and novelty measures. Interestingness measures and thresholds
can be specified by the user with the statement
with [(interest_measure_name)] threshold =(threshold_value)
Syntax for Pattern Presentation and Visualization Specification.
How can users specify the forms of presentation and visualization to be used in displaying the discovered
patterns? Our data mining query language needs syntax that allows users to specify the display of
discovered patterns in one or more forms, including rules, tables, crosstabs, pie or bar charts, decission
trees, cubes, curves, or surfaces. We define the DMQL display statement for this purpose:
display as (result_form)
Where the (result_form) could be any of the knowledge presentation or visualization forms listed
above.
Interactive mining should allow the discovered patterns to be viewed at different concept levels or
from different angles. This can be accomplished with roll-up and drill-down operations, as described
earlier. Patterns can be rolled up, or viewed at a more general level, by climbing up the concept hierarchy
of an attribute or dimension (replacing lower-level concept values by higher-level values). Generalization
can also be performed by dropping attributes or dimesions. For example, suppose that a pattern contains
the attribute city. Given the location hierarchy city <province_or_state <country <continent, then dropping
the attribute city from the patterns will generalize the data to the next highest level attribute,
province_or_state. Patterns can be drilled down on, or viewed at a less general level, by stepping down
the concept hierarchy of an attribute or dimension. Patterns can also be made less general by adding
attributes or dimensions to their description. The attribute added must be one of the attributes listed in the
in relevance to clause for task-relevant specification. The user can alternately view the patterns at
different levels of abstractions with the use of the following DMQL syntax.
(Multilevel_Manipulation) ::= roll up on (attribute_or_dimension)
| drill down on (attribute_or_dimension)
| add (attribute_or_dimension)
| drop (attribute_or_dimension)
10.4 DESIGNING GRAPHICAL USER INTERFACES BASED ON
A DATA MINING QUERY LANGUAGE
A data mining query language provides necessary primitives that allow users to communicate with
data mining systems. But novice users may find data mining query language difficult to use and the syntax
118
difficult to remember. Instead , user may prefer to communicate with data mining systems through a
graphical user interface (GUI). In relational database technology , SQL serves as a standard core language
for relational systems , on top of which GUIs can easily be designed. Similarly, a data mining query
language may serve as a core language for data mining system implementations, providing a basis for the
development of GUI for effective data mining. A data mining GUI may consist of the following functional
components
a) Data collection and data mining query composition - This component allows the user to specify
task-relevant data sets and to compose data mining queries. It is similar to GUIs used for the
specification of relational queries.
b) Presentation of discovered patterns This component allows the display of the discovered
patterns in various forms, including tables, graphs, charts, curves and other visualization techniques.
c) Hierarchy specification and manipulation - This component allows for concept hierarchy
specification , either manually by the user or automatically. In addition , this component should
allow concept hierarchies to be modified by the user or adjusted automatically based on a given
data set distribution.
d) Manipulation of data mining primitives This component may allow the dynamic adjustment of
data mining thresholds, as well as the selection, display and modification of concept hierarchies.
It may also allow the modification of previous data mining queries or conditions.
e) Interactive multilevel mining This component should allow roll-up or drill-down operations on
discovered patterns.
f) Other miscellaneous information This component may include on-line help manuals, indexed
search , debugging and other interactive graphical facilities.
The design of GUI should take into consideration different classes of users of a data mining system.
In general , users of a data mining systems can be classified into two types
1) Business analysts
2) Business executives
A business analyst would like to have flexibility and convenience in selecting different portions of data,
manipulating dimensions and levels , setting mining parameters and tuning the data mining processes.
A business executives need clear presentation and interpretation of data mining results, flexibility in
viewing and comparing different data mining results and easy integration of data mining results into report
writing and presentation processes. A well-designed data mining system should provide friendly user
interfaces for both kinds of users.
119
10.5 ARCHITECTURES OF DATA MINING SYSTEMS
A good system architecture will enable the system to make best use of the software environment ,
accomplish data mining tasks in an efficient and timely manner, interoperate and exchange information
with other information systems, be adaptable to users different requirements and evolve with time.
To know what are the desired architectures for data mining systems, we view data mining is integrated
with database/data warehousing and coupling with the following schemes
a) No-coupling
b) Loose coupling
c) Semitight coupling
d) Tight-coupling
a) No-coupling It means that data mining system will not utilize any function of a database or data
warehousing system. Here in this system , it fetches data from a particular source such as a file ,
processes data using some data mining algorithms and then store the mining result in another file.
This system has some disadvantages
1) Database system provides a great deal of flexibility and efficiency at storing , organizing, accessing
and processing data. Without this in a file, Data mining system may spend a more amount of
time finding, collecting , cleaning and transforming data.
2) There are many tested, scalable algorithms and data structures implemented in database and
data warehousing systems. It is feasible to realize efficient , scalable implementations using
such systems. Most data have been or will be stored in database or data warehouse systems.
Without any coupling of such systems a data mining system will need to use other tools to
extract data, making it difficult to integrate such a system into an information processing
environment. Hence, no coupling represents a poor design.
b) Loose coupling - It means that DM system will use some facilities of a DB or DW system,
fetching data froma data repository managed by these systems, performing data mining and then storing
the mining results either in a file or in a designated place in a database or data warehouse. Loose coupling
is better than no coupling since it can fetch any portion of data stored in databases or data warehouses by
using query processing, indexing and other system facilities. The typical characteristic is many loosely
coupled mining systems are main-memory based. Since mining itself does not explore data structures and
query optimizations methods provided by DB or DW systems, it is difficult for loose coupling to achieve
high scalability and good performance with large data sets.
c) Semitight coupling - It means that besides linking a DM system to a DB/DW system, efficient
implementations of a few essential data mining primitives can be provided in the DB/DW system. These
120
primitives can include sorting, indexing, aggregation, histogram analysis, multiway join and precomputation
of some essential statistical measures, such as sum, count, max, min, standard deviation and so on. Some
frequently used intermediate mining results can be precomputed and stored efficiently, this design will
enhance the performance of a DM system.
d) Tight coupling - It means that a DM system is smoothly integrated into a DB/DW system. The
data mining subsystem is treated as one functional component of an information system. Data mining
queries and functions are optimized based on mining query analysis, data structures, indexing schemes
and query processing methods of a DB or DW system. This approach is highly desirable since it facilitates
efficient implementation of data mining functions, high system performance and an integrated information
processing environment.
10.6 EXERCISE
I. FILL UP THE BLANKS
1. For successful data mining operations the data must be and .
2. The first stage of data preparation is
3. Data smoothing process comes in the stage of
4. helps in effective knowledge discovery from the data mining systems.
5. There are two types of users in data mining systems and
6. coupling DW system will not utilize the any functions of database or data warehousing system.
7. coupling DM system is smoothly integrated into a DB/DW system.
ANSWERS FOR FILL UP THE BLANKS
1. Consistent, reliable
2. data selection
3. Data cleaning
4. Data mining language
5. business analysts, business executives
6. no
7. tight
121
1. Explain the steps of data cleaning.
2. Explain the steps of new data construction.
3. Explain thedata mining primitives.
4. In detail explain the data mining querying system
Chapter 11 - Data Mining Techniques
Chapter 11
Dat a Mining Techniques
11.0 INTRODUCTION
T
he discovery stage of the KDD process is fascinating. Here we discuss some of the important
methods and in this way get an idea of the opportunities that are available as well as some of the
problems that occur during the discovery stage. We shall see that some learning algorithms do
well on one part of the data set where others fail and this clearly indicates the need for hybrid learning.
Data mining is not so much a single technique as the idea that there is more knowledge hidden in the
data than shows itself on the surface. Any technique that helps extract more out of our data is useful, so
data mining techniques formquite a heterogeneous group. In this chapter wediscuss some of the techniques.
11.1 ASSOCIATIONS
Given a collection of items and a set of records, each of which contain some number of items from the
given collection, an association function is an operation against this set of records which return affinities
or patterns that exist among the collection of items. These patterns can be expressed by rules such as
72% of all the records that contain items A, B and C also contain items D and E. The specific percentage
of occurrences (in this case 72) is called the confidence factor of the rule. Also, in this rule, A,B and C are
said to be on an opposite side of the rule to D and E. Associations can involve any number of items on
either side of the rule.
The market-basket problem assumes we have some large number of items, e.g., bread, milk.
Customers fill their market baskets with some subset of the items, and we get to know what items people
buy together, even if we dont know who they are. Marketers use this information to position items, and
control the way a typical customer traverses the store.
122
In addition to the marketing application, the same sort of question has the following uses:
1. Baskets =documents; items =words. Words appearing frequently together in documents may
represent phrases or linked concepts, can be used for intelligence gathering.
2. Baskets =sentences, items =documents. Two documents with many of the same sentences
could represent plagiarism or mirror sites on the Web.
At present context, association rules are useful in data mining if we already have a rough idea of what
it is we are looking for. This illustrates the fact that there is no algorithm that will automatically give us
everything that is of interest in the database. An algorithm that finds a lot of rules will probably also find
a lot of useless rules, while an algorithm that finds only a limited number of associations, without fine
tuning, will probably also miss a lot of interesting information.
11.1.1 Data Mining with Apriori Algorithm
APriori algorithm data mining discoversitems that are frequently associated together.
Let us look at the example of astore that sells DVDs, Videos, CDs, Books and Games.The store
owner might want to discover which of these items customers are likely to buy together. This canbe used
toincrease the stores cross sell and upsell ratios. Customers in this particular store may like buying a
DVD and a Game in 10 out of every 100 transactions or the sale of Videos may hardly ever be associated
with a sale of a DVD.
With the information above, the store could strive for more optimum placement of DVDs and Games
as the sale of one of them may improve the chances of the sale of the otherfrequently associated item.
On the other hand, the mailing campaigns may be fine tuned to reflect the fact that offering discount
coupons on Videos may even negatively impact the sales of DVDs offered in the same campaign. A
better decision could be not to offerboth DVDsand Videos in a campaign.
To arrive at these decisions, the store may have had to analyze 10,000 past transactions of customers
using calculations that seperate frequent and consequently important associations fromweak and unimportant
associations.
Thesefrequently occurringassociations are defined by measures known as Support Count and
Confidence. The support and confidence measures are defined so that all associations can be weighed
and only significant associations analyzed. The measures of an association rules are defined by the
support and confidence.
The Support Count is the number of transactions or percentage of transactions that feature the
association of a set of items.
Assume thatthe dataset of9 transactions belowis selected randomly from a universe of 100,000
transactions:
123
124
Chapter 11- Data Mining Techniques
Market Based Analysis XML Data
Table 11.1: Use a Minimum Support Percentage of 0.4% and a Minimum
ConfidencePercentage of 50% or 100%
The Apriori data mining analysisof the 9 transactionsabove is known as Market Based Analysis,
as it is designed to discover which items in a series of transactions are frequently associated together.
11.1.2 Implementation Steps
1. The Apriori algorithm would analyze all the transactions in a datasetfor each items support
count. Any item that has a support count less than the minimum support count required is
removed from the pool of candidate items.
2. Initially each of the items is a member of a set of 1Candidate Itemsets. The support count of
each candidate item in the itemset is calculated and itemswith a support count less than the
minimum required support count are removed as candidates. The remaining candidate items in
the itemsetare joined to create2 Candidate Itemsetsthat each comprise of twoitemsor
members.
3. The support count of each twomember itemset iscalculated from the database of transactions
and 2memberitemsets that occur with a support count greater than or equal to the minimum
support count are used to create 3 Candidate Itemsets. The process in steps 1 and 2 is
repeated generating 4 and 5 Candidate Itemsets until the Support Count of all the itemsets are
lower than the minimum required support count.
4. All the candidate itemsets generated with a support count greater than the minimum support
count form a set of Frequent Itemsets. These frequent itemsets are then used to generate
association rules with a confidence greater than or equal to the Minimum Confidence.
Transaction ID List of Items Purchased
1 Books, CD, Video
2 CD, Games
3 CD, DVD
4 Books, CD, Games
5 Books, DVD
6 CD, DVD
7 Books, DVD
8 Books, CD, DVD, Video
9 Books, CD, DVD
125
5. Apriori recursively generates all the subsets of each frequent itemset andcreates association
rulesbased on subsets with a confidence greaterthan the minimumconfidence.
11.1.3 Improving the Efficiency of Aprori
There are many variations of the Apriori algorithm have been proposed that focus on improving the
efficiency of the original algorithm.
Hash-based technique (Hashing itemset counts) : A hash-based technique can be used to reduce the
size of the candidate k-itemsets, Ck, for k>1. For example, when scanning each transaction in the database
to generate the frequent 1-itemsets, L1 from the candidate 1-itemsets in C1, we can generate all of the 2-
itemsets for each transaction, hash them into the different buckets of a hash table structure and increase
the corresponding bucket counts. A 2-itemset whose corresponding bucket count in the has table is below
the supported threshold cannot be frequent and thus should removed from the candidate set. Such a hash-
based technique may substantially reduce the number of the candidate k-itemsets examined.
Transaction reduction : A transaction that does not contain any frequent k-temsets cannot contain any
frequent (K+1) itemsets. Therefore, such a transaction can be marked or removed from further consideration
since subsequent scans of the database for j-itemsets, where j>k, will not require it.
Sampling : The basic idea of the sampling approach is to pick a random sample S of the given data D
and then search for frequent itemsets in S instead of D. In this way, We trade off some degree of
accuracy against efficiency.
Dynamic itemset counting : A dynamic itemset counting technique was proposed in which the database
is partitioned into blocks marked by start points. In this variation, new candidate itemsets can be added at
any start point, unlike Apriori, which determine new candidate itemsets only immediately priori to each
complete database scan. The technique is dynamic in that it estimates the support of all of the itemsets
that have been counted so far, adding new candidate itemsets if all of their subsets are estimated to be
frequent. The resulting algorithm requires database scan than Apriori.
11.2 DATA MINING WITH DECISION TREES
Decision trees are powerful and popular tools for classification and prediction. The attractiveness of
tree-based methods is due in large part to the fact that, it is simple and decision trees represent rules.
Rules can readily be expressed so that we humans can understand them or in a database access language
like SQL so that records falling into a particular category may be retrieved.
In some applications, the accuracy of a classification or prediction is the only thing that matters; if a
direct mail firm obtains a model that can accurately predict which members of a prospect pool are most
likely to respond to a certain solicitation, they may not care how or why the model works. In other
126
situations, the ability to explain the reason for a decision is crucial. In health insurance underwriting, for
example, there are legal prohibitions against discrimination based on certain variables. An insurance company
could find itself in the position of having to demonstrate to the satisfaction of a court of law that it has not
used illegal discriminatory practices in granting or denying coverage. There are a variety of algorithms for
building decision trees that share the desirable trait of explicability..
11.2.1 Decision Tree Working Concept
Decision tree is a classifier in the form of a tree structure where each node is either:
a leaf node, indicating a class of instances, or
a decision node that specifies some test to be carried out on a single attribute0value, with one
branch and sub-tree for each possible outcome of the test.
A decision tree can be used to classify an instance by starting at the root of the tree and moving
through it until a leaf node, which provides the classification of the instance.
Example: Decision making in the Bombay stock market is shown in Fig. 3.3.
Suppose that the major factors affecting the Bombay stock market are:
what it did yesterday;
what the New Delhi market is doing today;
bank interest rate;
unemployment rate;
Indias prospect at cricket.
Following table shown in Fig. 3.2 is a small illustrative dataset of six days about the Bombay stock
market. The lower part contains data of each day according to five questions, and the second row shows
the observed result (Yes (Y) or No (N) for It rises today). Figure 3.2 illustrates a typical learned
decision tree fromthe data in following.
Figure 11.2: Decision table
Instance No.
It rises today
1
Y
2
Y
3
Y
4
N
5
N
6
N
It rose yesterday
NDelhi rises today
Bank rate high
Unemployment high
India is losing
Y
Y
N
N
Y
Y
N
Y
Y
Y
N
N
N
Y
Y
Y
N
Y
N
Y
N
N
N
N
Y
N
N
Y
N
Y
127
Examples of a small dataset on the Bombay stock market
Figure 11.3: A decision tree for the Bombay stock market
The process of predicting an instance by this decision tree can also be expressed by answering the
questions in the following order:
Is unemployment high?
YES: The Bombay market will rise today
NO: Is the New Delhi market rising today?
YES: The Bombay market will rise today
NO: The Bombay market will not rise today.
Decision tree induction is a typical inductive approach to learn knowledge on classification. The key
requirements to do mining with decision trees are:
Attribute-value description: object or case must be expressible in terms of a fixed collection of
properties or attributes.
Predefined classes: The categories to which cases are to be assigned must have been established
beforehand (supervised data).
Discrete classes: A case does or does not belong to a particular class, and there must be for more
cases than classes.
Is unemployment high?
Is the New Delhi
market rising today ?
The Bombay
market will rise
today {2,3}
The Bombay market
will rise today
The Bombay market
will not rise today
{4,5,6}
Yes No
YES NO
128
Sufficient data: Usually hundreds or even thousands of training cases.
Logical classification model: Classifier that can be only expressed as decision trees or set of
production rules.
11.2.2 Other Classficiation Methods
Case-Based Reasoning
Case-based reasoning (CBR) classifiers are instanced-based. The samples or cases stored by CBR
are complex symbolic descriptions. Business applications of CBR include problem resolution for customer
service help desks, for example, where cases describe product-related diagnostic problems. CBR has
also been applied to areas such as engineering and law, where cases are either technical designs or legal
rulings, respectively.
When given a new case to classify, a case-based reasoner will first check if an identical training case
exists. If one is found, then the accompanying solution to that case is returned. If no identical case is
found, then the case-based reasoner will search for training cases having components that are similar to
those of the new case. Conceptually, these training cases may be considered as neighbors of the new
case. If cases are represented as graphs, this involves searching for subgraphs that are similar to subgraphs
within the new case. The case-based reasoner tries to combine the solutions of the neighboring trainingcases
in order to propose a solution for the new case. If compatibilities arise with the individual solutions, then
backtracking to search for other solutions may be necessary. The case-based reasoner may employ
background knowledge and problem-solving strategies in order to propose a feasible combined solution.
Rough Set Approach
Rough set theory can be used for classification to discover structural relationships within imprecise or
noisy data. It applies to discrete-valued attributes. Continuous-valued attributes must therefore be discretized
prior to its use.
Rough set theory is based on the establishment of equivalence classes within the given training data.
All of the data samples forming an equivalence class are indiscernible, that is, the samples are identical
with respect to the attributes describing the data. Given real-world data, it is common that some classes
cannot be distinguished in terms of the available attributes. Rough sets can be used to approximately or
roughly define such classes. A rough set definition for a given class C is approximated by two sets a
lower approximation of C and an upper approximation of C. The lower approximation of C consists of all
of the data samples that, based on the knowledge of the attributes, cannot be described as not belonging
to C. The lower and upper approximations for a class C are shown in figure 3.4, where each rectangular
region represents an equivalence class. Decision rules can be generated for each class. Typically, a
decision table is used to represent the rules.
129
Fig 11.4: A rough set approximation of the set samples of the class C using lower and upper approximations sets
of C The rectangular regions represent equivalence classes
Rough sets can also be used for feature reduction (where attributes that do not contribute towards the
classification of the given training data can be identified and removed) and relevance analysis (where the
contribution of significance of each attribute is assessed with respect to the classification task). The
problem of finding the minimal subsets (reducts) of attributes that can describe all of the concepts in the
given data set is NP-hard. However, algorithms to reduce the computation intensity have been proposed.
In one method, for example, a discernibility matrix is used that stores the differences between attribute
values for each pair of data samples. Rather than searching on the entire training set, the matrix is instead
searched to detect redundant attributes.
Fig 11.5: Fuzzy values for income
130
Fuzzy Set Approaches
Rule-based systems for classification have the disadvantage that they involve sharp cutoffs for
continuous attributes. For example, consider the following rule for customer credit application approval.
The rule essentially says that applications for customers who have had a job for two or more years and
who have a high income (i.e. of at least Rs 50k) are approved:
IF (years_employed >=2) ^(income > 50k) THEN credit =approved
Figure11.5 shows how values for the continuous attribute income are mapped into the discrete categories
(low, medium,high), as well as how the fuzzy membership or truth values are calculated. Fuzzy logicsystems
typically provide tools to assist users.
11.2.3 Prediction
The prediction of continuous values can be modeled by statistical techniques of regression. For example,
we may like to develop a model to predict the salary of college graduates with 10 years of work experience
or the potential sales of a new product given its price. Many problems can be solved by linear regression
and even more can be tackled by applying transformations to the variables so that a nonlinear problem can
be converted to a linear one.
Linear and Multiple Regression
In linear regression, data are modeled using a straight line. Linear regression is the simplest form of
regression. Bivariate linear regression models a random variable, Y (called a response variable), as a
linear function of another random variable, X (called a predictor variable), that is
Y =o +|X,
Where the variance of Y is assumed to be constant, and a and b are regression coefficients specifying
the Y-intercept and slope of the line, respectively. These coefficients can be solved for by the method
of least squares, which minimized the error between the actual data and the estimate of the line. Given
s samples or data points of the form (x1, y1), (x2, y2), . . . . , (xs, ys), then the regression coefficients can
be estimated using this method with the following equations:
131
Where x is the average of x1, x2, . . . . . . xs, and y is the average of y1, y2, . . ys. The coefficients o
and | often provide good approximations to otherwise complicated regression equations.
Table 11.2 Salary Data
Linear regression using the method of least squares Table 3.2 shows a set of paired data where X is
the number of years of work experience of a college graduate and Y is the corresponding salary of the
graduate. A plot of the data is shown in Figure 3.6, suggesting a linear relationship between the two
variables, X and Y. We model the relationship that salary may be related to the number of years of work
experience with the equation Y =a +|X.
Given the above data, we compute x =9.1 and y=55.4. Substituting these values into the above
equation , we get
| =(3-9,1) (30 55.4) +(8 9.1) (57 57.4) +. . . +(16 9.1) (83 55.4) =3.5
(3 9.1)
2
+(8 9.1)
2
+. . . . +(16 9.1)
2
o =55.4 (3.5)(9.1) =23.6
Thus, the equation of the least squares line is estimated by Y =23.6 +3.5X. Using this equation, we
can predict that the salary of a college graduate with, say, 10 years of experience is Rs. 58.6K.
Multiple regression is an extension of linear regression involving more than one predictor variable. It
allows response variable Y to be modeled as a linear function of a multidimensional feature vector. An
example of a multiple regression model based on two predictor attributes or variables, X1 and X2, is
Y =a +|
1
X
1
+|
2
X
2
The method of lest squares can also be applied here to solve for o, |
1
, and |
2
.
132
Figure 11.6 Plot of a graph shown in table 3.2
11.2.4 Nonlinear Regression
Polynomial regression can be modeled by adding polynomial terms to the basic linear model. By
applying transformations to the variables, we can convert the nonlinear model into a linear one that can
then be solved by the method of least squares.
Transformation of a polynomial regression model to a linear regression model. Consider a
cubicpolynomial relationship given by
Y =o +|
1
X
1
+|
2
X
2
+|
3
X
3
To convert this equation to linear form, we define new variables:
X
1
=X X
2
=X
2
X
3
=X
3
Using the above Equation can then be converted to linear form by applying the above assignments,
resulting in the equation Y =o +|
1
X
1
+|
2
X
2
+|
3
X
3
which is solvable by the method of least squares.
Some models are intractably nonlinear (such as the sumof exponential terms, for example) and cannot
be converted to a linear model. For such cases, it may be possible to obtain least square estimates through
extensive calculations on more complex formulae.
133
11.2.5 Other Regression Models
Linear regression is used to model continuous valued functions. It is widely used, owing largely to its
simplicity. Generalized linear models represent the theoretical foundation on which linear regression can
be applied to the modeling of categorical response variables. In generalized linear models, the variance of
the response variable Y is a function of the mean value of Y, unlike in linear regression, where the
variance of Y is constant. Common types of generalized linear models include logistic regression and
Poisson regression. Logistic regression models the probability of some event occurring as a linear function
of a set of predictor variables. Count data frequently exhibit a poisson distribution and are commonly
modeled using Poisson regression.
Log-linear models approximate discrete multidimensional probability distributions. They may be used
to estimate the probability value associated with data cube cells. For example, suppose we are given data
for the attributes city, item, year, and sales. In the log-linear method, all attributes must be categorical;
hence then be used to estimate the probability of each cell in the 4-D base cuboid for the given attributes,
based on the 2-D cuboids for city and item, city and year, city and sales, and the 3-D cuboid for item, year,
and sales. In this way, an iterative technique can be used to build higher-order data cubes from lowerorder
prediction, the log-linear model is useful for data compression (Since the smaller-order cuboidstogether
typically occupy less space than the base cuboid) and data smoothing (since cell estimates in the smaller-
order cuboids are less subject to sampling variations than cell estimates in the base cuboid).
11.3 CLASSIFIER ACCURACY
Estimating classifier accuracy is important in that it allows one to evaluate how accurately a given
classifier will label future data, that is, data on which the classifier has not been trained. For example, if
data from previous sales are used to train a classifier to predict customer purchasing behavior, we would
like some estimate of how accurately the classifier can predict the purchasing behavior of future customers.
Accuracy estimates also help in the comparison of different classifiers.
Figure 11.7: Estimating classifier accuracy with the holdout method.
134
11.3.1 Estimating Classifier Accuracy
Using training data to derive a classifier and then to estimate the accuracy of the classifier can result
in misleading overoptimistic estimates due to over specialization of the learning algorithm (or model) to the
data. Holdout and cross-validation are two common techniques for assessing classifier accuracy, based
on randomly sampled partitions of the given data.
In the holdout method, the given data are randomly partitioned into two independent sets, a training
set and a test set. Typically, two thirds of the data are allocated to the training set, and the remaining one
third is allocated to the test set. The training set is used to derive the classifier, whose accuracy is
estimated with the test set . The estimate is pessimistic since only apportion of the initial data is used to
derive the classifier. Random subsampling is a variation of the holdout method in which the holdout
method is repeated k times. The overall accuracy estimate is taken as the average of the accuracies
obtained formeach iteration.
In k-fold cross-validation, the initial data are randomly partitioned into k mutually exclusive subsets or
folds, S
1
, S
2
, . . . S
k
, each of approximately equal size. Training and testing is performed k times. In
iteration i, the subset Si is reserved as the test set, and the remaining subsets are collectively used to train
the classifier. That is, the classifier of the first iteration is trained on subsets S
2
, . . . . S
k
and tested on S
2
and so on. The accuracy estimate is the overall number of correct classifications from the k iterations,
divided by the total number of samples in the initial data. In stratified cross-validation, the folds are
stratified so that the class distribution of the samples in each fold is approximately the same as that in the
initial data.
Other methods of estimating classifier accuracy include bootstrapping, which samples the given training
instances uniformly with replacement, and leave-one-out, which is k-fold cross-validation with k set to s,
the number of initial samples.
In general, stratified 10-fold cross-validation is recommended for estimating classifier accuracy (even
if computation power allows using more folds) due to its relatively low bias and variance.
The use of such techniques to estimate classifier accuracy increases the overall computation time, yet
is useful for selecting among several classifiers.
11.4 BAYESIAN CLASSIFICATION
Bayesian classifiers are statistical classifiers. They can predict class membership probabilities, such
as the probability that a given sample belongs to a particular class.
Bayesian classification is based on Bayes theorem, described below. Studies comparing classification
algorithms have found a simple Bayesian classifier known as the naive Bayesian classifier to be comparable
135
in performance with decision tree and neural network classifiers. Bayesian classifiers have also exhibited
high accuracy and speed when applied to large databases.
Nave Bayesian classifiers assume that the effect of an attribute value on a given class is independent
of the values of the other attributes. This assumption is called class conditional independence. It is made
to simplify the computations involved and in this sense, is considered nave. Bayesian belief networks
are graphical models, which unlike native Bayesian classifiers, allow the representation of dependencies
among subsets of attributes. Bayesian belief networks can also be used for classification.
11.4.1 Bayes Theorem
Let X be a data sample whose class label is unknown. Let H be some hypothesis, such as that the data
sample X belongs to a specified class C. For classification problems, we want to determine P (H/X), the
probability that the hypothesis H holds given the observed data sample X.
P(HX) is the posterior probability, or a posteriori probability, of H conditioned on X. For example,
suppose the world of data samples consists of fruits, described by their color and shape. Suppose that X
is red and round, and that H is the hypothesis that X is an apple. Then P(HX) reflects our confidence that
X is an apple given that we have seen that X is red and round. In contrast, P(H) is the prior probability, or
a priori probability, of H. For our example, this is the probability that any given data sample is an apple,
regardless of how the data sample looks. The posterior probability, P(HX), is based on more information
(Such as background knowledge) than the prior probability, P(H), which is independent of X.
Similarly, P(XH) is the posterior probability of X conditioned on H. That is, it is the probability that X is
red and round given that we know that it is true that X is an apple. P(X) is the prior probability of X. .
P(X), P(H), and P(XH) may be estimated from the given data, as we shall see below. Bayes theorem
is useful in that it provides a way of calculating the posterior probability, P(HX), from P(H), P(X), and
P(XH). Bayes theorem is
P(HX) =P(XH) P(H)
P(X)
11.4.2 Naive Bayesian Classification
The nave Bayesian classifier, or simple Bayesian classifier, works as follows:
1. Each data sample is represented by an n-dimensional feature vector, X =(x1, x2, . . . . xn),
depicting n measurements made on the sample from n attributes, respectively, A1, A2, . .An.
2. Suppose that there are m classes, C1, C2, . Cm. Given an unknown data sample, X (i.e.,
136
having no class label), the classifier will predict that X belongs to the class having the highest
posterior probability, conditioned on X. That is, the nave Bayesian classifier assigns an
unknown sample X to the class Ci if and only if
P(CiX) >P(CjX) for 1 s j s m, j = i
Thus we maximize P(Ci/X). The class Ci for which P(Ci/X) is maximized is called the
maximum posteriori hypothesis. By Bayes theorem
P(CiX) =P(XCi) P(Ci)
P(X)
3. As P(X) is constant for all classes, only P(XCi) P(Ci) need be maximized. If the class prior
probabilities are not known, then it is commonly assumed that the classes are equally likely,
that is, P(C1) =P(C2) = =P(Cm), and we would therefore maximize P(XCi). Otherwise,
we maximize P(XCi) P(Ci). Note that the class prior probabilities may be estimated by P(Ci)
=si/s where si is the number of training samples of class Ci, and s is the total number of
training samples.
4. Given data sets with many attributes, it would be extremely computationally expensive to
compute P(XCi). In order to reduce computation in evaluating P(XCi), the nave assumption
of class conditional independence is made. This presumes that the values of the attributes
are conditionally independent of one another, given the class label of the sample, that is, there
are not dependence relationships among the attributes. Thus,
The probabilities P(x
1
|C
i
), P(x
2
|C
i
), . . ., P(x
n
|C
i
) can be estimated from the training samples,
where
a) If Ak is categorical, then P(x
k
|C
i
) =Sik, where sik is the number of training samples of class
S
i
Ci having the value xk for Ak, and Si is the number of training samples belong to Ci.
b) If Ak is continuous-valued, then the attribute is typically assumed to have a Gaussian
distribution so that
Where g(xk, ci, oci) is the Gaussian (normal) density function for attribute Ak, while ci
and oci are the mean and standard deviation, respectively, given the values for attribute Ak
137
for training samples of class Ci.
5. In order to classify an unknown sample X, P(X/Ci) P(Ci) is evaluated for each class Ci.
Sample X is then assigned to the class Ci if and only if
P(X|Ci)P(Ci) >P(X|Cj)P(Cj) for 1 s j s m, j =i.
In other words, it is assigned to the class Ci for which P(X/Ci)P(Ci) is the maximum.
In theory, Bayesian classifiers have the minimum error rate in comparison to all other classifiers.
However, in practice this is not always the case owing to inaccuracies in the assumption made for its use,
such as class conditional independence and the lack of available probability data. However, various empirical
studies of this classifier in comparison to decision tree and neural network classifiers have found it to be
comparable in some domains.
Bayesian classifiers are also useful in that they provide a theoretical justification for other classifiers
that do not explicitly use Bayes theorem. For example, under certain assumptions, it can be shown that
many neural networks and curve-fitting algorithms output the maximum posteriori hypothesis, as does the
nave Bayesian classifier.
11.4.3 Bayesian Belief Networks
The nave Bayesian classifier makes the assumption of class conditional independence, that is, given
the class label of a sample, the values of the attributes are
Figure 11.8 A simple Bayesian belief network and conditional probability table for the Lung cancer
138
Conditionally independent of one another. This assumption simplifies computation. When the assumption
holds true, then the naive Bayesian classifier is the most accurate in comparison with all other classifiers.
In practice, however, dependencies can exist between variables. Bayesian belief networks specify joing
conditional probability distributions. They allow class conditional independencies to be defined between
subsets of variables. They provide a graphical model of causal relationships, on which learning can be
performed. These networks are also known as belief networks, Bayesian networks, and probabilistic
networks. For brevity, we will refer to them as belief networks.
A belief network is defined by two components. The first is a directed acyclic graph, where each node
represents a randomvariable and each arc represents a probabilistic dependence. If an arc is drawn from
a node Y to a node Z, then Y is a parent or immediate predecessor of Z, and Z is a descendent of Y. Each
variable is conditionally independent of its nondescendents in the graph, given its parents. The variables
may be discrete or continuous-valued. They may correspond to actual attributes given in the data or to
hidden variables believed to form a relationship (such as medical syndromes in the case ofmedical
data).
Figure 3.8 shows a simple belief network for six Boolean variables. The arcs allow a representation of
causal knowledge. For example, having lung cancer is influenced by a persons family history of lung
cancer, as well as whether or not the person is a smoker. Furthermore, the arcs also show that the
variable Lung Cancer is conditionally independent of Emphysema, given its parents, Family History and
Smoker. This means that once the values of Family History and Smoker are known, then the variable
Emphysema does not provide any additional information regarding Lung Cancer.
The second component defining a belief network consists of one conditional probability table (CPT)
for each variable. The CPT for a variable Z specifies the conditional distribution P(z/Parents(Z)), where
Parents(Z) are the parents of Z. Figure 3.8 table shows a CPT for Lung Cancer. The conditional probability
for each value of Lung Cancer is given for each possible combination of values of its parents. For
instance, fromthe upper leftmost and bottom rightmost entries, respectively, we see that
P(Lung Cancer =yes | Family History =yes, Smoker =yes) =0.8
P(Lung Cancer =no | Family History =no, Smoker =no) =0.9
The joint probability of any tuple (z1, . . . . . zn) corresponding to the variables or attributes Z1 . . . . .
Zn is computed by
Where the values for P(zi/parents(Zi)) correspond to the entries in the CPT for Zi. A node within the
network can be selected as an output node, representing a class label attribute. There may be more
139
than one output node. Inference algorithms for learning can be applied on the network. The classification
process, rather than returning a single class label, can return a probability distribution for the class label
attribute, that is, predicting the probability of each class.
11.4.4 Training Bayesian Belief Networks
In the learning or training of a belief network, a number of scenarios are possible. The network
structure may be given in advance or inferred fromthe data. The network variables may be observable or
hidden in all or some of the training samples. The case of hidden data is also referred to as missing values
or incomplete data.
If the network structure is known and the variables are observable, then training the network is
straightforward. It consists of computing the CPT entries, as is similarly done when computing the
probabilities involved in native Bayesian classification. When the network structure is given and some of
the variables are hidden, then a method of gradient descent can be used to train the belief network. The
object is to learn the values for the CPT entries. Let S be a set of s training samples, X1, X2, . . . . . . . ,
Xs. Let w
ijk
be a CPT entry for the variable Yi =Yij having the parents Ui =u
ik
. For example, if w
ijk
is the
upper left most CPT entry of Figure 3.8 table, then Y
ij
is Lung Cancer; Y
ij
is its value, yes; Ui lists the
parent nodes of Y
i
, namely, {Family History, Smoker}; and uik lists the values of the parent nodes, namely,
{yes, yes}. The W
ijk
are viewed as weights, analogous to the weights in hidden units of neural
networks. The set of weights is collectively referred to as w. The weights are initialized to random
probability values. The gradient descent strategy performs greedy hill-climbing. Ateach iteration, the
weights are updated and will eventually converge to a local optimum solution.
1. Compute the gradients: For each i, j, k compute
The probability in the right-hand side of Equation is to be calculated for each training sample Xd in S.
For brevity, lets refer to this probability simply as P. When the variables represented by Yi and Ui are
hidden for some Xd, then the corresponding probability P can be computed form the observed variables of
the sample using standard algorithms for Bayesian network inference.
2. Take a small step in the direction of the gradient: The weights are updated by
140
Where l is the learning rate representing the step size, and
is computed from previous Equation. The learning rate is set to a small constant.
3. Renormalize the weights: Because the weights W
ijk
are probability values, they must be between
0.0 and 1.0, and
j
W
ijk
must equal 1 for all i, k. These criteria are achieved by renormalizing the weights
after they have been updated by previous Equation.
Several algorithms exist for learning the network structure from the training data given observable
variables. The problem is one of discrete optimization.
11.5 NEURAL NETWORKS FOR DATA MINING
A neural processing element receives inputs from other connected processing elements. These input
signals or values pass through weighted connections, which either amplify or diminish the signals. Inside
the neural processing element, all of these input signals are summed together to give the total input to the
unit. This total input value is then passed through a mathematical function to produce an output or decision
value ranging from 0 to 1. Notice that this is a real valued (analog) output, not a digital 0/1 output. If the
input signal matches the connection weights exactly, then the output is close to 1. If the input signal totally
mismatches the connection weights then the output is closeto 0. Varying degrees of similarity are represented
by the intermediate values. Now, of course, we can force the neural processing element to make a binary
(1/0) decision, but by using analog values ranging between 0.0 and 1.0 as the outputs, we are retaining
more information to pass on to the next layer of neural processing units. In a very real sense, neural
networks are analog computers.
Each neural processing element acts as a simple pattern recognition machine. It checks the input
signals against its memory traces (connection weights) and produces an output signal that corresponds to
the degree of match between those patterns. In typical neural networks, there are hundreds of neural
processing elements whose pattern recognition and decision making abilities are harnessed together to
solve problems.
11.5.1 Neural Network Topologies
The arrangement of neural processing units and their interconnections can have a profound impact on
the processing capabilities of the neural networks. In general, all neural networks have some set of
processing units that receive inputs fromthe outside world, which we refer to appropriately as the input
141
units. Many neural networks also have one or more layers of hidden processing units that receive
inputs only fromother processing units. A layer or slab of processing units receives a vector of data or
the outputs of a previous layer of units and processes them in parallel. The set of processing units that
represents the final result of the neural network computation is designated as the output units. There
are three major connection topologies that define how data flows between the input, hidden, and output
processing units. These main categories%feed forward, limited recurrent, and fully recurrent networks%are
described in detail in the next sections.
11.5.2 Feed-Forward Networks
Feed-forward networks are used in situations when we can bring all of the information to bear on a
problem at once, and we can present it to the neural network. It is like a pop quiz, where the teacher walks
in, writes a set of facts on the board, and says, OK, tell me the answer. You must take the data, process
it, and jump to a conclusion. In this type of neural network, the data flows through the network in one
direction, and the answer is based solely on the current set of inputs.
In Figure 11.9 we see a typical feed-forward neural network topology. Data enters the neural network
through the input units on the left. The input values are assigned to the input units as the unit activation
values. The output values of the units are modulated by the connection weights, either being magnified if
the connection weight is positive and greater than 1.0, or being diminished if the connection weight is
between 0.0 and 1.0. If the connection weight is negative, the signal is magnified or diminished in the
opposite direction.
Figure 11.9: Feed-forward neural networks.
Each processing unit combines all of the input signals corning into the unit along with a threshold value.
This total input signal is then passed through an activation function to determine the actual output of the
processing unit, which in turn becomes the input to another layer of units in a multi-layer network. The
most typical activation function used in neural networks is the S-shaped or sigmoid (also called the logistic)
function. This function converts an input value to an output ranging from 0 to 1. The effect of the threshold
weights is to shift the curve right or left, thereby making the output value higher or lower, depending on the
sign of the threshold weight. As shown in Figure 11.9, the data flows from the input layer through zero,
I
n
p
u
t
H
i
d
d
e
n
O
u
t
p
u
t
142
one, or more succeeding hidden layers and then to the output layer. In most networks, the units fromone
layer are fully connected to the units in the next layer. However, this is not a requirement of feed-forward
neural networks. In some cases, especially when the neural network connections and weights are
constructed from a rule or predicate form, there could be less connection weights than in a fully connected
network. There are also techniques for pruning unnecessary weights from a neural network after it is
trained. In general, the less weights there are, the faster the network will be able to process data and the
better it will generalize to unseen inputs. It is important to remember that feed-forward is a definition of
connection topology and data flow. It does not imply any specific type of activation function or training
paradigm.
11.5.3 Classification by Backpropagation
Backpropagation is a neural network learning algorithm. The field of neural networks was originally
kindled by psychologists and neurobiologists who sough to develop and test computational analogues of
neurons. Roughly speaking, a neural network is a set of connected input/output units where each connection
has a weight associated with it. During the learning phase, the network learns by adjusting the weights so
as to be able to predict the correct class label of the input samples. Neural network learning is also
referred to as connectionist learning due to the connections between units.
Neural networks involve long training times and are therefore more suitable for applications where
this is feasible. They require a number of parameters that are typically best determined empirically, such
as the network topology or structure. Neural networks have been criticized for their poor interpretability,
since it is difficult for humans to interpret the symbolic meaning behind the learned weights. These
features initially made neural networks less desirable for data mining.
Advantage of neural networks, however, include their high tolerance to noisy data as well as their
ability to classify patterns on which they have not been trained. In addition, several algorithms have
recently been developed for the extraction of rules from trained neural networks. These factors contribute
towards the usefulness of neural networks for classification in data mining.
The most popular neural network algorithm is the backprogragation algorithm, proposed in the 1980s.
11.5.4 Backpropagation
Backpropagation learns by iteratively processing a set of training samples, comparing the networks
prediction for each sample with the actual known class label. For each sample with the actual known
class label. For each training sample, the weights are modified so as to minimize the means squared error
between the networks prediction and the actual class. These modifications are made in the backwards
direction, that is, from the output layer, through each hidden layer down to the first hidden layer (hence the
143
name backpropagation). Although it is not guaranteed, in general the weights will eventually coverage,
and the learning process stops. The algorithm is summarized in Figure each step is described below.
Initialize the weights: The weights in the network are initialized to small random numbers (e.g.,
ranging from 1.0 to 1.0, or -0.5 to 0.5). Each unit has a bias associated with it, as explained below. The
biases are similarly initialized to small random numbers.
Each training sample, X, is processed by the following steps.
Propagate The inputs forward: In this step, the net input and output of each unit in the hidden and
output layers are computed. First, the training sample is fed to the input layer of the network. Note that
for unit j in the input layer, its output is equal to its input, that is, Oj =Ij for input unit j. The net input to each
unit in the hidden and output layers is computed as a linear combination of its inputs. To help illustrate this,
a hidden layer or output layer unit is shown in Figure 3.10. The inputs to the units are, in fact, the outputs
of the units connected to it in the previous layer. To compute the net input to the unit, each input connected
to the unit is multiplied by its corresponding weight, and this is summed. Given a unit j in a hidden or output
layer, the net input, Ij, , to unit j is
I
j
= w
ij
O
i
+u
j
,
i
Where w
ij
is the weight of the connection from unit i in the previous layer to unit j; O
i
is the output of
unit i from the previous layer; and uj is the bias of the unit. The bias acts as a threshold in that it serves to
vary the activity of the unit.
Algorithm: Backpropagation: Neural network learning for classification, using the backpropagation
algorithm.
Input: The training samples, samples; the learning rate, l; a multilayer feed-forward network, network.
Output: A neural network trained to classify the samples.
Method:
(1) Initialize all weights and biases in network;
(2) While terminating condition is not satisfied{
(3) For e ach training sample X in samples {
(4) // Propagate the inputs forward:
(5) For each hidden of output layer unit j{
(6) Ij =
i
w
ij
O
j
+uj; // compute the net input of unit j with respect of the previous layer, i
144
(7) Oj = 1 ; }//compute the output of each unit j
1+e-1
(8) //Backpropagate the errors:
(9) for each unit j in the output layer
(10) Errj =Oj (1 Oj) (Tj Oj); // compute the error
(11) For each unit in the hidden layers, from the last to the first hidden layer
(12) Err
j
=O
j
(1 - O
j
)
k
Err
k
w
jk
;//compute the error with respect to the next higher layer, k
(13) For each weight wij in network {
(14) A wij =(1)Err
j
O
j
;//weight increment
(15) w
ij
=w
ij
+A w
ij
;}// weight update
(16) for each bias u, in network {
(17) A uj =(1) Err
j
; // bias increment
(18) u =u
j
+A u
j
;} //bias update
(19) }}
Each unit in the hidden and output layers takes its net input and then applies an activation function to
it, as illustrated in Figure 11.10. Thefucntion symbolizes the activation of the neuron represented by the
unit. The logistic, or sigmoid, function is used. Given the net input Ij to unit j, then Oj, the out put of unit j,
is computed as
Oj = 1
1 +e
-Ij
This function is also referred to as a squashing function, since it maps a large input domain onto the
smaller range of 0 to 1. The logistic function is non-linear and differentiable, allowing the backpropagation
algorithm to model classification problems that are linearly inseparable.
Figure 11.10: Neural network shows input Layer, Activation Segment and output Weights
145
Backpropagate the Error: The error is propagated backwards by updating the weights and biases to
reflect the error of the networks prediction. For a unit j in the output layer, the error Errj is computed by
Errj =Oj(1-Oj)(Tj-Oj)
Where Oj is the actual output of unit j, and Tj is the true output, based on the known class label of the
given training sample. Note that Oj (1-Oj) is the derivative of the logistic function.
To compute the error of a hidden layer unit j, the weighted sum of the errors of the units connected to
unit j in the next layer are considered. The error of a hidden layer unit j is
Errj =Oj(1-Oj) Err
k
w
jk
k
Where wjk is the weight of the connection formunit j to a unit k in the next higher layer, and Errk is the
error of unit k.
The weights and biases are updated to reflect the propagated errors. Weights are updated by the
following equations, where A wij is the change in weight wij:
AWij =(1)ErrjOi
Wij =Wij +A Wij
The variable l denotes the learning rate, a constant typically having a value between 0.0 and 1.0.
Backpropagation learns using a method of gradient descent to search for a set of weights that can model
the given classification problem so as to minimize the mean squared distance between the networks
class prediction and the actual class label of the samples. The learning rate helps to avoid getting stuck at
a local minimum in decision space (i.e., where the weights appear to converge, but are not the optimum
solution) and encourages finding the global minimum. If the learning rate is too large, then oscillation
between inadequate solutions may occur at a very slow pace. If the learning rate is too large, then
oscillation between inadequate solutions may occur. A rule of thumb is to set the learning rate to 1/t,
where t is the number of iterations through the training set so far.
Biases are updated by the following equations below, where
w
uj is the change in bias uj;
A uj =(l)Errj
uj =uj +A uj
Note that here we are updating the weights and biases after the presentation of each sample. This is
referred to as case updating. Alternatively, the weight and bias increments could be accumulated in
variables, so that the weights and biases are updated after all of the samples in the training set have been
presented. This latter strategy is called epoch updating, where one iteration through the training set is an
epoch. In theory, the mathematical derivation of backpropagation employs epoch updating, yet in practice,
case updating is more common since it tends to yield more accurate results.
146
Terminating condition: Training stops when
- all A wij in the previous epoch were so small as to below some specified threshold, or
- the percentage of samples misclassified in the previous epoch is below some threshold, or
- a prespecified number of epochs has expired.
In practice, several hundreds of thousands of epochs may be required before the weights will converge.
11.5.5 Backpropagation and Interpretability
A major disadvantage of neural networks lies in their knowledge representation. Acquired knowledge
in the form of a network of units connected by weighted links is difficult for humans to interpret. This
factor has motivated research in extracting the knowledge embedded in trained neural networks and in
representing that knowledge symbolically. Methods included extracting rules from networks and sensitivity
analysis.
Various algorithms for the extraction of rules have been proposed. The methods typically impose
restrictions regarding procedures used in training the given neural network, the network topology, and the
discretization of input values.
Fully connected networks are difficult to articulate. Hence, often the first step towards extracting
rules from neural networks is network pruning. This consists of removing weighted links that do not result
in a decrease in the classification accuracy of the given network.
Once the trained network has been pruned, some approaches will then perform link, unit, or activation
value clustering. In one method, for example, clustering is used to find the set of common activation
values for each hidden unit in a given trained two layer neural network. The combinations of these
activation values for each hidden unit are analyzed. Rules are derived relating combinations of activation
values with corresponding output unit values. Similarly, the sets of input values and activation values are
studied to derive rules describing the relationship between the input and hidden unit layers. Finally, the two
sets of rules may be combined to form IF THEN rules. Other algorithms may derive rules of other forms,
including M-of-N rules (where M out of a given N conditions in the rule antecedent must be true in order
for the rule consequent to be applied), decision trees with M-of-N tests, fuzzy rules, and finite automata.
Sensitivity analysis is used to assess the impact that a given input variable has on a network output.
The input to the variable is varied while the remaining input variables are fixed at some value. Meanwhile,
changes in the network output are monitored. The knowledge gained from this form of analysis can be
represented in rule such as IF X decreases 5% THEN Y increases 8%.
147
11.6 CLUSTERING IN DATA MINING
Clustering is a division of data into groups of similar objects. Each group, called cluster, consists of
objects that are similar between themselves and dissimilar to objects of other groups. Representing data
by fewer clusters necessarily loses certain fine details (akin to lossy data compression), but achieves
simplification. It represents many data objects by few clusters, and hence, it models data by its clusters.
Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical
analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for
clusters is unsupervised learning, and the resulting system represents a data concept. Therefore, clustering
is unsupervised learning of a hidden data concept. Data mining deals with large databases that impose on
clustering analysis additional severe computational requirements.
Clustering techniques fall into a group of undirected data mining tools. The goal of undirected data
mining is to discover structure in the data as a whole. There is no target variable to be predicted, thus no
distinction is being made between independent and dependent variables.
Clustering techniques are used for combining observed examples into clusters (groups) which satisfy
two main criteria:
Each group or cluster is homogeneous; examples that belong to the same group are similar to
each other.
Each group or cluster should be different from other clusters, that is, examples that belong to
one cluster should be different from the examples of other clusters.
Depending on the clustering technique, clusters can be expressed in different ways:
Identified clusters may be exclusive, so that any example belongs to only one cluster.
They may be overlapping; an example may belong to several clusters.
They may be probabilistic, whereby an example belongs to each cluster with a certain probability.
Clusters might have hierarchical structure, having crude division of examples at highest level of
hierarchy, which is then refined to sub-clusters at lower levels.
11.6.1 Requirements for Clustering
Clustering is a challenging and interesting field potential applications posetheir own special requirements.
The following are typical requirements of clustering in data mining.
- Scalability: Many clustering algorithms work well on small data sets containing fewer than
148
200 data objects However, a large database may contain millions of objects. Clustering on a
sample of a given large data set may lead to biased results. Highly scalable clustering
algorithms are needed.
- Ability to Deal with Different types of Attributes: Many algorithms are designed to
cluster interval-based (numerical) data. However, applications may require clustering other
types of data, such as binary, categorical (nominal), and ordinal data, or mixtures of these
data types.
- Discovery of Clusters with Arbitrary Shape: Many clustering algorithms determined
clusters based on Euclidean or Manhattan distance measures. Algorithms based on such
distance measures tend to find spherical clusters with similar size and density. However, a
cluster could be of any shape. It is important to develop algorithms that can detect clusters of
arbitrary shape.
- Minimal Requirements for Domain Knowledge of Determine Input Parameters:
Many clustering algorithms require users to input certain parameters in cluster analysis (such
as the number of desired clusters). The clustering results can be quite sensitive to input
parameters. Parameters are often hard to determine, especially for data sets containing
high-dimensional objects. This not only burdens users, but also makes the quality of clustering
difficult to control.
- Ability to Deal with Noisy Data: Most real-world databases contain outliers or missing,
unknown, erroneous data. Some clustering algorithms are sensitive to such data and may
lead to clusters of poor quality.
- Insensitivity to the Order of Input Records: Some clustering algorithms are sensitive to
the order of input data; for example, the same set of data, when presented with different
orderings to such an algorithm, may generated dramatically different clusters. It is important
to develop algorithms that are insensitive to the order of input.
- High Dimensionality: A database or a data warehouse can contain several dimensions or
attributes. Many clustering algorithms are good at handling low-dimensional data, involving
only two to three dimensions. Human eyes are good at judging the quality of clustering for up
to three dimensions. It is challenging to cluster data objects in high dimensional space,
especially considering that such data can be very sparse and highly skewed.
- Constraint-Based Clustering: Real-world applications may need to perform clustering
under various kinds of constraints. Suppose that your job is to choose the locations for a
given number of new automatic cash-dispensing machines (i.e., ATMs) in a city. To decide
upon this, we may cluster household while considering constraints such as the citys rivers
and highway networks and customer requirements per region. A challenging task is to find
groups of data with good clustering behavior that satisfy specified constraints.
149
- Interpretability and Usability: Users expect clustering results to be interpretable,
comprehensible, and usable. That is, clustering may need to be tied up with specific semantic
interpretations and applications. It is important to study how an applications goal may influence
the selection of clustering methods.
11.6.2 Type of Data in Cluster Analysis
We study the types of data that often occur in cluster analysis and how to preprocess them for such an
analysis. Suppose that a data set to be clustered contains n objects, which may represent persons, houses,
documents, countries, and so on. Main memory-based clustering algorithms typically operate on either of
the following two data structures.
- Data Matrix (or object-by-variable structure): This represent n objects, such as persons,
with p variables (also called measurements or attributes), such as age, height, weight, gender,
race, and so on. The structure is in the form of a relational table, or n-by-p matrix (n objects
x p variables):
- Dissimilarity matrix (or object-by-object structure): This stores a collection of proximities
that are available for all pairs of n objects. It is often represented by an n-by-n table:
Where d(I,j) is the measured difference or dissimilarity between objects i and j. In general, d(I,j) is
150
a nonnegative number that is close to 0 when object I and j are highly similar or near each other, and
becomes larger the more they differ. Since d(j,i) =d(j,i) =0, we have the above matrix .
The data matrix is often called a two-mode matrix, where as the dissimilarity matrix is called a onemode
matrix, since the rows and columns of the former represent different entities, while those of the latter
represent the same entity. Many clustering algorithms operate on a dissimilarity matrix. If the data are
presented in the form of a data matrix, it can first be transformed into a dissimilarity matrix before
applying such clustering algorithms.
11.6.3 Interval-Scaled Variables
Interval-scaled variables are continuous measurements of a roughly linear scale. Typical examples
include weight and height, latitude and longitude coordinates (e.g., when clustering houses), and weather
temperature.
The measurement unit used can affect the clustering analysis. For example, changing measurement
units from meters to inches for height or from kilograms to pounds for weight, may lead to a very different
clustering structure. In general, expressing a variable in smaller units will lead to a larger range for that
variable and thus a large effect on the resulting clustering structure. To help avoid dependence on the
choice of measurement units, the data should be standardized. Standardizing measurements attempts to
give all variables an equal weight. This is particularly useful when given no prior knowledge of the data.
However, in some applications, users may intentionally want to give more weight to a certain set of
variables than to others. For example, when clustering basketball player candidates, we may prefer to
give more weight to the variable height.
To standardize measurements, one choice is to convert the original measurements to unitless variables.
Given measurements for a variable f, this can be performed as follows.
1. Calculate the mean absolute deviation, Sf:
Where x1f, . . . . ., xnf are n measurements off, and mf is the mean value of f, that is,
2. Calculate the standardized measurement, or z-score:
151
The mean absolute deviation, Sf, is more robust to outliers than the standard deviation, sf. When
computing the mean absolute deviation, the deviations from the mean (i.w., | x
if
- m
f
|) are not squared
hence, the effect of outliers is somewhat reduced. There are more robust measures of dispersion, such as
the median absolute deviation. However, the advantage of using the mean absolute deviation is that the
zscores of outliers do not become too small; hence, the outliers remain detectable.
Standardization may or may not be useful in a particular application. Thus the choice of whether and
how to perform standardization should be left to the user.
After standardization, or without standardization in certain applications, the dissimilarity (or similarity)
between the objects described by interval-scaled variables is typically computed based on the distance
between each pair of objects. The most popular distance measure in Euclidean distance, which is defined
as
Where is =(x
i1
, x
i2
, . . . x
ip
) and j =(x
j1
, x
j2
, . . . x
jp
) are two p-dimensional data objects.
Another well-known metric is Manhattan (or city block) distance, defined as
Both the Euclidean distance and Manhattan distance satisfy the following mathematic requirements of
a distance function:
1. d(i,j) > 0: Distance is a nonnegative number.
2. d(i,i) =0: The distance of an object to itself is 0.
3. d(i,j) =d (j,i): Distance is a symmetric function.
4. d(i,j) s d(i,h) +d(h,j): Going directly from object i to object j in space is no more than making
a detour over any other object h (triangular inequality).
Minkowski distance is a generalization of both Euclidean distance and Manhattan distance. It is defined
as
Where q is a positive integer. It represents the Manhattan distance when q=1, and Euclidean distance
when q =2.
152
If each variable is assigned a weight according to its perceived importance, the weighted Euclidean
distance can be computed as
Weighting can also be applied to the Manhattan and Minkowski distances.
11.6.4 Binary Variables
A Binary variable has only two states: 0 or 1, whether 0 means that the variable is absent, and 1 means
that it is present. Given the variable smoker describing a patient, for instance, 1 indicates that the patient
smokes, while 0 indicates that the patient does not. Treating binary variables as if they are interval-scaled
can lead to misleading clustering results. Therefore, methods specific to binary data are necessary for
computing dissimilarities.
To compute the dissimilarity between two binary variables, one approach involves computing a
dissimilarity matrix from the given binary data. If all binary variables are thought of as having the same
weight, we have the 2-by-2 contingency table of Table 3.3, whether q is the number of variables that equal
1 for both objects i and j, r is the number of variables that equal 1 for object i but that are 0 for object j, s
is the number of variables that equal 0 for object i but equal 1 for object j, and t is the number of variables
that equal 0 for both objects i and j. The total number of variables is p, where p =q +r +s +t.
Table 3.3 A contingency table for binary variables
A binary variable is symmetric if both of its states are equally valuable and carry the same weight; that
is, there is no preference on which outcome should be coded as 0 or 1. One such example could be the
attribute gender having the states male and female. Similarity that is based on symmetric binary variables
is called invariant similarity in that the result does not change when some or all of the binary variables are
coded differently. For invariant similarities, the most well known coefficient for assessing the dissimilarity
between objects i and j is the simple matching coefficient, defined as
153
A binary variable is asymmetric if the outcomes of the states are not equally important, such as the
positive and negative outcomes of a disease test. By convention, we shall code the most important outcome,
which is usually the rarest asymmetric binary variables, the agreement of two 1s (a positive match) is then
considered more significant that that of two 0s (a negative match). Therefore, such binary variables are
often considered monary (as if having one state). The similarity based on such variables is called
noninvariant similarity. For noninvariant similarities, the most well-known coefficient is the J accard
Coefficient, where the number of negative matches, t, is considered unimportant and thus is ignored in the
computation:
Example: Dissimilarity between binary variables
Suppose that a patient record table (Table 3.4) contains the attributes name, gender, fever, cough, test-
1, test-2, test-3, and test-4, where name is an object-id, gender is a symmetric attribute, and the remaining
attributes are asymmetric binary.
For asymmetric attribute values, let the values Y (yes) and P (positive) be sent to 1, and the value N
(no or negative) by set to 0. Suppose that the distance between objects (patients ) is computed based only
on the asymmetric variables. According to the J accard coefficient formula , the distance between each
pair of the three patients, Ram, Sita, and Laxman, should be
Table 3.4 A relational table containing mostly binary attributes.
154
These measurements suggest that Laxman and Sita are unlikely to have a similar disease since they
have the highest dissimilarity value among the three pairs. Of the three patients, Ram and Sita are the
most likely to have a similar disease.
11.6.5 Nominal, Ordinal and Ratio-Scaled Variables
Nominal Variables
A nominal variable is a generalization of the binary variable in that it can take on more than two states.
For example, map_color is a nominal variable that may have, say, five states: red, yellow, green, pink and
blue.
Let the number of states of a nominal variable be M. The states can be denoted by letters, symbols or
a set of integers, such as 1, 2 ,M. Notice that such integers are used just for data handling and
do not represent any specific ordering.
The dissimilarity between two objects i and j can be computed using the simple matching approach:
Where m is the number of matches (i.e., the number of variables for which i and j are in the same
state), and p is the total number of variables. Weights can be assigned to increase the effect of m or to
assign greater weight to the matches in variables having a larger number of states.
Nominal variables can be encoded by asymmetric binary variables by creating a new binary variable
for each of the M nominal states. For an object with a given state value, the binary variable representing
that state is set to 1, while variable map_color, a binary variable can be created for each of the five colors
listed above. For an object having the color yellow, the yellow variable is set to 1, while the remaining four
variables are set to 0.
Ordinal Variables
A discrete ordinal variable resembles a nominal variable, except that the M states of the ordinal value
are ordered in a meaningful sequence. Ordinal variables are very useful for registering subjective
assessments of qualities that cannot be measured objectively. For example professional ranks are often
enumerated in a sequential order, such as assistant, associate, and full. A continuous ordinal variable looks
like a set of continuous data of an unknown scale; that is, the relative ordering of the values is essential but
their actual magnitude is not. For example, the relative ranking in a particular sport (e.g., gold, silver,
bronze) is often more essential than the actual values of a particular measure. Ordinal variables may also
be obtained form the discretization of interval-scaled quantities by splitting the value range into a finite
155
number of classes. The values of an ordinal variable can be mapped to ranks. For example, suppose that
an ordinal variable f has Mf states. These ordered states define the ranking 1, . . . . . M
f
.
The treatment of ordinal variables is quite similar to that of interval-scaled variables when computing
the dissimilarity between objects. Suppose that f is a variable froma set of ordinal variables describing n
objects. The dissimilarity computation with respect of f involves the following steps:
1. The value of f for the ith object is x
if
, and f has Mf ordered states, representing the ranking
1, . . . . . , M
f
. Replace each x
if
by its corresponding ran, r
if
e{ 1, . . . . ., M
f
.}.
2. Since each ordinal variable can have a different number of states, it is often necessary to
map the range of each variable on to [0.0, 1.0] so that each variable has equal weight. This
can be achieved by replacing the rank r
if
of the ith object in the f
th
variable by
r
if
- 1
Z
if
= _______
M
f
- 1
3. Dissimilarity can then be computed using any of the distance measures described in Section
3.4.3 for interval-scaled variables, using z
if
to represent the f value for the ith object.
Ratio-Scaled Variables
A ratio-scaled variable makes a positive measurement on a nonlinear scale, such as an exponential
scale, approximately following the formula
Ae
Bt
or Ae
Bt
,
Where A and B are positive constants. Typical examples include the growth of a bacteria population,
or the decay of a radioactive element.
To compute the dissimilarity between objects described by ratio-scaled variable Thereare three methods
to handle ratio-scaled variables for computing the dissimilarity between objects.
- Treat ratio-scaled variables like interval-scaled variables. This, however, is not usually a
good choice since it is likely that the scale may be distorted.
- Apply logarithmic transformation to a ratio-scaled variable f having value xif for object I by
using the formula y
if
=log (x
if
). The y
if
values can be treated as interval-valued Note that for
some ratio-scaled variables, log-log or other transformations may be applied, depending on
the definition and application.
- Treat x
if
as continuous ordinal data and treat their ranks as interval-valued.
156
11.6.6 Variables of Mixed Types
In many real databases, objects are described by a mixture of variable types. One approach is to group
each kind of variable together, performing a separate cluster analysis for each variable type. This is
feasible if these analyses derive compatible results. However, in real applications, it is unlikely that a
separate cluster analysis per variable type will generate compatible results.
A more preferable approach is to process all variable types together, performing a single cluster
analysis. One such technique combines the different variables into a single dissimilarity matrix, bringing all
of the meaningful variables onto a common scale of the interval [0.0, 1.0].
Suppose that the data set contains p variables of mixed type. The dissimilarity d(i,j) between objects i
and j is defined as
Where the indicator c
(f)
=0 if either (1) x if is missing (i.e., there is no measurement of variable f for
object i or j object j), or (2) xif =xif =0 and variable f is asymmetric binary; otherwise c
(f)
=1 The
contribution of variable f to the dissimilarity between i and j , d(f), is computed dependent on its type:
Where the indicator o
(f)
ij
=0 if either (1) x
if
or x
jf
is missing (i.e., there is not measurement of variable
f for object i or object j), or (2) x
if
=x
if
=0 and variable f is asymmetric binary; otherwise o
(f)
ij
=1. The
contribution of variable f to the dissimilarity between I and j, d
(f)
ij
, is computed dependent on its type:
- If f is binary or nominal: d
(f)
ij =0 if x
if
=x
if
; otherwise o
(f)
ij =1.
- If is interval-based: d
(f)
ij
= |x
if
x
if
| , where h runs over all nonmissing objects
for variable f. maxhxhf - minhxhf
- If f is ordinal or ratio-scaled: compute the ranks r
if
and z
if
=r
if
1, and tread zif as interval-
scaled. M
f
- 1
Thus, the dissimilarity between objects can be computed even when the variables describing the
objects are of different types.
157
11.7 A CATEGORIZATION OF MAJOR CLUSTERING
METHODS
The choice of clustering algorithm depends both on the type of data available an don the particular
purpose and application. If cluster analysis is used as a descriptive or exploratory tool, it is possible to try
several algorithms on the same data to see what the data may disclose.
In general, major clustering methods can be classified into the following categories.
Partitioning Methods: Given a database of n objects or data tuples, a partitioning method constructs
k partitions of the data, where each partition represent a cluster and k d n. That is, it classifies the data
into k groups, which together satisfy the following requirements: (1) each group must contain at least one
object, and (2) each object must belong to exactly one group., Notice that the second requirement can be
relaxed in some fuzzy partitioning techniques.
Given k, the number of partitions to construct, a partitioning method creates an initial partitioning. It
then uses an iterative relocation technique that attempts to improve the partitioning by moving objects
from one group to another. The general criterion of a good partitioning is that objects in the same cluster
are close or related to each other, whereas objects of different for judging the quality of partitions.
To achieve global optimality in partitioning-based clustering would require the exhaustive enumeration
of all of the possible partitions. Instead, most applications adopt one of two popular heuristic methods: (1)
the k-means algorithm, where each cluster is represented by the mean value of the objects in the cluster,
and (2) the k-medoids algorithm, where each cluster is represented by one of the objects located near the
center of the cluster. These heuristic clustering methods work well for finding spherical-shaped clusters in
small to medium-sized databases. To find clusters with complex shapes and for clustering very large data
sets, partitioning-based methods need to be extended.
Hierarchical Methods: A hierarchical method creates a hierarchical decomposition of the given set
of data objects. A hierarchical method can be classified as being either agglomerative or divisive, based on
how the hierarchical decomposition is formed. The agglomerative approach, also called the bottom-up
approach, starts with each object forming a separate group. It successively merges the objects or groups
close to one another, until all of the groups are merged into one (the topmost level of the hierarchy), or until
a termination condition holds. The divisive approach, also called the top-down approach, starts with all the
objects in the same cluster. In each successive iteration, a cluster is split up into smaller clusters, until
eventually each object is in one cluster, or until a termination condition holds. Hierarchical methods suffer
from the fact that once a step (merge or split) is done, it can never be undone. This rigidity is useful in that
it leads to smaller computation costs by not worrying about a combinatorial number of different choices.
However, a major problem of such techniques is that they cannot correct erroneous decisions. There are
two approaches to improving the quality of hierarchical clustering: (1) performcareful analysis of object
linkages at each hierarchical partitioning, such as in CURE and Chameleon, or (2) integrate hierarchical
agglomeration and iterative relocation by first using a hierarchical agglomerative algorithm.
158
Density-Based Methods: Most partitioning methods cluster objects based on the distance between
objects. Such methods can find only spherical-shaped clusters and encounter difficulty at discovering
clusters of arbitrary shapes. Other clustering methods have been developed based on the notion of density.
Their general idea is to continue growing the given cluster as long as the density (number of objects or
data points) in the neighborhood exceeds some threshold; that is, for each data point within a given
cluster, the neighborhood of a given radius has to contain at least a minimum number of points. Such a
method can be used to filter out noise (outliers) and discover clusters of arbitrary shape.
DBSCAN is a typical density-based method that grows clusters according to a density threshold.
OPTICS is a density-based method that computes an augmented clustering ordering for automatic and
interactive cluster analysis.
Grid-Based Methods: Grid-based methods quantize the object space into a finite number of cells
that form a grid structure. All of the clustering operations are performed on the gird structure (i.e., on the
quantized space). The main advantage of this approach is its fast processing time, which is typically
independent of the number of data objects and dependent only on the number of cells in each dimension
in the quantized space.
STING is a typical example of a grid-based method. CLIQUE and Wave Cluster are two clustering
algorithms that are both grid-based and density based.
Model-Based Methods: Model based methods hypothesize a model for each of the clusters and
find the best fit of the data to the given model. A model-based algorithm may locate clusters by constructing
a density function that reflects the spatial distribution of the data points. It also leads to a way of automatically
determining the number of clusters based on standard statistics, taking noise or outliers into account and
thus yielding robust clustering methods.
11.8 CLUSTERING ALGORITHM
We will explain here the basics of the simplest of clustering methods: k-means algorithm. There are
many other methods, like self-organizing maps (Kohonen networks), or probabilistic clustering methods
(AutoClass algorithm), which are more sophisticated and proficient, but k-means algorithm seemed a best
choice for the illustration of the main principles.
11.8.1 K-Means Algorithm
This algorithm has as an input a predefined number of clusters, that is the k from its name. Means
stands for an average, an average location of all the members of a particular cluster. When dealing with
clustering techniques, one has to adopt a notion of a high dimensional space, or space in which orthogonal
dimensions are all attributes from the table of data we are analyzing. The value of each attribute of an
159
example represents a distance of the example from the origin along the attribute axes. Of course, in order
to use this geometry efficiently, the values in the data set must all be numeric (categorical data must be
transformed into numeric ones!) and should be normalized in order to allow fair computation of the overall
distances in a multi-attribute space.
K-means algorithm is a simple, iterative procedure, in which a crucial concept is the one of centroid.
Centroid is an artificial point in the space of records which represents an average location of the particular
cluster. The coordinates of this point are averages of attribute values of all examples that belong to the
cluster. The steps of the K-means algorithm are given below.
1. Select randomly k points (it can be also examples) to be the seeds for the centroids of k
clusters.
2. Assign each example to the centroid closest to the example, forming in this way k exclusive
clusters of examples.
3. Calculate new centroids of the clusters. For that purpose average all attribute values of the
examples belonging to the same cluster (centroid).
4. Check if the cluster centroids have changed their coordinates. If yes, start again form the
step 2). If not, cluster detection is finished and all examples have their cluster memberships
defined.
Usually this iterative procedure of redefining centroids and reassigning the examples to clusters needs
only a few iterations to converge.
11.8.2 Important Issues in Automatic Cluster Detection
Most of the issues related to automatic cluster detection are connected to the kinds of questions we
want to be answered in the data mining project, or data preparation for their successful application.
Distance Measure
Most clustering techniques use for the distance measure the Euclidean distance formula (square root
of the sum of the squares of distances along each attribute axes).
Non-numeric variables must be transformed and scaled before the clustering can take place. Depending
on this transformations, the categorical variables may dominate clustering results or they may be even
completely ignored.
Choice of the Right Number of Clusters
If the number of clusters k in the K-means method is not chosen so to match the natural structure of
160
the data, the results will not be good. The proper way t alleviate this is to experiment with different values
for k. In principle, the best k value will exhibit the smallest intra-cluster distances and largest inter-cluster
distances. More sophisticated techniques measure this qualities automatically, and optimize the number of
clusters in a separate loop (AutoClass).
Cluster Interpretation
Once the clusters are discovered they have to be interpreted in order to have some value for the data
mining project. There are different ways to utilize clustering results:
Cluster membership can be used as a label for the separate classification problem. Some
descriptive data mining technique (like decision trees) can be used to find descriptions of clusters.
Clusters can be visualized using 2D and 3D scatter graphs or some other visualization technique.
Differences in attribute values among different clusters can be examined, one attribute at a
time.
11.8.3 Application Issues
Clustering techniques are used when we expect natural groupings in examples of the data. Clusters
should then represent groups of items (products, events, customers) that have a lot in common. Creating
clusters prior to application of some other data mining technique (decision trees, neural networks) might
reduce the complexity of the problem by dividing the space of examples. This space partitions can be
mined separately and such two step procedure might exhibit improved results (descriptive or predictive)
as compared to data mining without using clustering.
11.9 GENETIC ALGORITHMS
There is a fascinating interaction between technology and nature and by means of technical inventions
we can learn to understand nature better. Conversely , nature of often a source of inspiration for technical
break throughs. The same principle also applies to computer science, a most fertile area for exchange of
views between biology and computer science being evolutionary computing . Evolutionary computing
occupies itself with problem solving by the application of evolutionary mechanisms. At present genetic
algorithms are considered to be among the most successful machine learning technique.
Genetic algorithms are inspired by Darwins theory of evolution. Solution to a problem solved by
genetic algorithms uses an evolutionary process (it is evolved).
161
Algorithm begins with a set of solutions (represented by chromosomes) called population. Solutions
from one population are taken and used to form a new population. This is motivated by a hope, that the
new population will be better than the old one. Solutions which are then selected to form new solutions
(offspring) are selected according to their fitness - the more suitable they are the more chances they
have to reproduce.
Genetic algorithms can be viewed as a kind of meta-learning strategy, which means that the genetic
approach could be employed by individuals who at the moment use almost any other learning mechanism.
The past few years have seen the development of many hybrid approaches, in which neural networks
have been used to create input for genetic algorithms or alternatively genetic algorithms to optimize the
output of neural networks. At present, genetic programming is widely used in financial markets and for
insurance applications.
11.10 EXERCISE
I. FILL IN THE BLANKS
1. algorithm data mining discoversitems that are frequently associated together.
2. are powerful and popular tools for classification and prediction.
3. Feed forward neural network contains 3 layers namely , and
4. is a division of data into groups of similar objects.
5. K-means algorithm is a simple procedure
6. Algorithm begins with a set of solutions (represented by chromosomes) called
7. supervised learning algorithm is one of the popular technique used in neural network.
ANSWERS
1. APriori
2. Decision trees
3. Input, Hidden, output
4. Clusering
5. Iterative
6. Population
7. Backpropagation
1. Explain theApriori algorithm in detail.
2. Explain the decision tree working concept of data mining.
3. Explain Bayelian classification.
4. Explain the neural network for data mining.
5. Explain the steps of Backpropagation.
6. Explain the requirements for clustering.
7. Explain the categorization of clustering methods.
8. Explain theK-mean algorithm in detail.
162
Chapter 12
Guidelines of KDD Envir onment
12.0 INTRODUCTION
T
he goal of a KDD process is to obtain an ever-increasing and better understanding of the changing
environment of the organization. A KDD environment supports the data mining process, but this
process is so involved that it is neither realistic nor desirable to try to support it with just one
generic tool. Rather , one needs a suite of tools that is carefully selected and tuned specifically for each
organization utilizing data mining. Strictly speaking, there exists no generic data mining tools , database,
pattern recognizers, machine learning, reporting tools, statistical analysis everything can be of use at
times. Still, it is clear that nevertheless one needs some guidance in how to set up a KDD environment.
12.1 GUIDELINES
It is customary in the computer industry to formulate rules of thumb that help information technology
(IT) specialists to apply new developments. In setting up a reliable data mining environment we may
follow the guidelines so that KDD system may work in a manner we desire.
1. Support Extremely Large Data Sets
Data mining deals with extremely large data sets consisting of billions of records and without
proper platforms to store and handle these volumes of data, no reliable data mining is possible.
Parallel servers with databases optimized for decision support system oriented queries are useful.
Fast and flexible access to large data sets is of very important.
2. Support Hybrid Learning
Learning tasks can be divided into three areas
163
164
Chapter 12 - Guidelines of KDD Environment
a. Classification tasks
b. Knowledge engineering tasks
c. Problem-solving tasks
All algorithms can not perform well in all the above areas as discussed in previous chapters.
Depending on our requirement one has to choose the appropriate one.
3. Establish a Data Warehouse
A data warehouse contains historic data and is subject oriented and static, that is , users do not
update the data but it is created on a regular time-frame on the basis of the operational data of
an organization. It is thus obvious that a data warehouse is an ideal starting point for a data
mining process, since data mining depends heavily on the permanent availability of historic data
and in this sense a data warehouse could be regarded as indispensable.
4. Introduce Data Cleaning Facilities
Even when a data warehouse is in operation , the data is certain to contain all sorts of
heterogeneous mixture. Special tools for cleaning data are necessary and some advanced tools
are available, especially in the field of de-duplication of client files. Other cleaning techniques
are only just emerging from research laboratories.
5. Facilitate Working with Dynamic Coding
Creative coding is the heart of the knowledge discovery process. The environment should enable
the user to experiment with different coding schemes, store partial results make attributes
discrete, create time series out of historic data, select random sub-samples, separate test sets
and so on. A project management environment that keeps track of the genealogy of different
samples and tables as well as of the semantics and transformations of the different attributes is
vital.
6. Integrate with Decision Support System
Data mining looks for hidden data that cannot easily be found using normal query techniques. A
knowledge discovery process always starts with traditional decision support system activities
and from there we magnify in on interesting parts of the data set.
7. Choose Extendible Architecture
New techniques for pattern recognition and machine learning are under development and we
also see many developments in the database area. It is advisable to choose an architecture that
enables us to integrate new tools at later stages. Object-oriented technology typically helps this
kind of flexibility.
165
8. Support Heterogeneous Databases
Not all the necessary data is necessarily to be found in the data warehouse. Sometimes we will
need to enrich the data warehouse with information from unexpected sources, such as information
brokers or with operational data that is not stored in our regular data warehouse. In order to
facilitate this, the data mining environment must support a variety of interfaces, hierarchical
databases, flat files various relational databases and object-oriented database systems.
9. Introduce Client/Server Architecture
A data mining environment needs extensive reporting facilities. Some development such as data
landscapes, point in the direction of highly interactive graphic environment but database servers
are not very suitable for this tasks. Discovery jobs need to be processed by large data mining
servers, while further refinement and reporting will take place on a client. Separating the data
mining activities on the servers fromthe clients is vital for good performance. Client/server is a
much more flexible system which moves the burden of visualization and graphical techniques
from the servers to the local machine. We can then optimize our database server completely for
data mining. Adequate parallelization of data mining algorithms on large servers is of vital
importance in this respect.
10. Introduce Cache Optimization
Learning and pattern recognition algorithms that operate on data bases often need very special
and frequent access to the data. Usually it is either impossible or impractical to store the data in
separate tables or to cache large portions in internal memory. The learning algorithms in a data
mining environments should be optimized for this type of database access. A low-level integration
with the database environment is desirable.
It is very important to note knowledge discovery process is not a one-off activity the we implement
and then ignore, the successful organization of the future will have to keep permanently alert
both to possible new sources of information and to the technologies available for opening up
these sources. The major problem facing our information society is the enormous abundance of
data. In the future every organization will have to find its way through this enormous information
and data mining will play an active and crucially important role.
12.2 EXERCISE
I. FILL IN THE BLANKS
1. servers with databases optimized for decision support system oriented queries are useful.
2. Learning tasks can be divided into three , and areas.
3. A contains historic data and is subject oriented and static.
4. coding is the heart of the knowledge discovery process.
5. A knowledge discovery process always starts with traditional system.
ANSWERS
1. Parallel
2. Classification tasks, knowledge engineering tasks, problem-solving tasks
3. Data warehouse
4. Creative
5. Decision support
1. Explain the guidelines of KDD environment in detail.
Chapter 12 - Guidelines of KDD Environment
166
Chapter 13
Dat a Mi ni ng Appl i cat i on
13.0 INTRODUCTION
I
n the previous chapters, we studied principles and methods for mining relational data and complex
types of data. We know data mining is interdisciplinary field and it is upcoming filed with wide and
diverse applications, there is still a nontrivial gap between general principles of data mining and
domain-specific, effective data mining tools for particular applications. In this chapter we discuss some of
the applications of data mining.
13.1 DATA MINING FOR BIOMEDICAL AND DNA DATA
ANALYSIS
Recently, there is explosive growth in the field of biomedical research, which includes from the
development of new pharmaceuticals and advances in cancer therapies to the identification and study of
the human genome by discovering large-scale sequencing patterns and gene functions. Since a great deal
of biomedical research has focused on DNA data analysis, we study this application here.
Recent research in DNA analysis has led to the discovery of genetic causes for many diseases and
disabilities, as well as the discovery of new medicines and approaches for disease diagnosis, prevention
and treatment.
An important focus in genome research is the study of DNA sequences since such sequences from
the foundation of the genetic codes of all living organisms. All DNA sequences are comprised of four
basic building blocks (called nucleotides) adenine (A), cytosine (C), guanine(G) and thymine (T). These
four nucleotides combined to formlong sequences or chains that resemble a twisted ladder.
167
168
Chapter 13 - Data Mining Application
Human beings have around 1,00,000 genes. A gene is usually comprised of hundreds of individual
nucleotides arranged in a particular code. There are almost an unlimited number of ways that the nucleotide
can be ordered and sequenced to form distinct genes. It is challenging to identify particular gene sequence
patterns that play roles in various diseases. Since many interesting sequential pattern analysis and similarity
search techniques have been developed in data mining, data mining has become a powerful tools and
contributes substantially to DNA analysis in the following manner
a) Semantic integration of heterogeneous, distributed genome databases
Due to the highly distributed, uncontrolled generation and use of a wide variety of DNA data,
the semantic integration of such heterogeneous and widely distributed genome databases becomes
an important task for systematic and coordinated analysis of DNA databases.
Data cleaning and data integration methods developed in data mining will help the integration of
genetic data and the construction of data warehouses for genetic data analysis.
b) Similarity search and comparison among DNA sequences
One of the most important search problems in genetic analysis is similarity search and comparison
among DNA sequence. Gene sequences isolated from diseased and healthy tissues can be
compared to identify critical differences between the two classes of genes. Data transformation
methods such as scaling, normalization and window stitching , which are popularly used in the
analysis of time-series data , are ineffective for genetic data since such data are nonnumeric
data and the precise interconnection between different kinds of nucleotides play an important
role in their function. On the other hand , the analysis of frequent sequential patterns is important
in the analysis of similarity and dissimilarity in genetic sequences.
c) Association analysis
Association analysis methods can be used to help determine the kinds of genes that are likely to
co-occur in target samples. Such analysis would facilitate the discovery of groups of genes and
the study of interactions and relationships between them.
d) Path analysis
While group of genes may contribute to a disease process, different genes may become active
at different stages of the disease. If the sequence of genetic activities across the different
stages of disease development can be identified, it may be possible to develop medicines that
target different stages separately, therefore achieving more effective treatment of the disease.
e) Visualization tools and genetic data analysis
Complex structures and sequencing patterns of genes are most effectively presented in graphs,
trees and chains by various kinds of visualization tools. Such visually appealing structures and
169
patterns facilitate pattern understanding, knowledge discovery and interactive data exploration.
Visualization therefore plays an important role in biomedical data mining.
13.2 DATA MINING FOR FINANCIAL DATA ANALYSIS
Most banks and financial institutions offer a wide variety of banking services (for example saving,
balance checking, individual transactions), credit (such as loans, mortgage) and investment services
(mutual funds). Some also offer insurance services and stock investment services.
Financial data collected in the banking and financial industry are often relatively complete, reliable and
of high quality, which facilitates systematic data analysis and data mining. The various issues are
a) Design and construction of data warehouses for multidimensional data analysis and
data mining
Data warehouses need to be constructed for banking and financial data. Multidimensional data
analysis methods should beused to analyze the general properties of such data. Data warehouses,
data cubes, multifeature and discovery-driven data cubes, characteristic and comparative analyses
and outlier analyses all play important roles in financial data analysis and mining.
b) Loan payment prediction and customer credit policy analysis
Loan payment prediction and customer credit analysis are critical to the business of a bank.
Many factors can strongly or weakly influence loan payment performance and customer credit
rating. Data mining methods, such as feature selection and attribute relevance ranking may
help identify important factors and eliminate irrelevant ones. In some cases, analysis of the
customer payment history may find that say, payment to income ratio is dominant factor,
while education level and debt ratio are not. The bank may then decide to adjust its loan-granting
policy so as to grant loans to those whose application was previously denied but whose profile
shows relatively low risks according to the critical factor analysis.
c) Classification and clustering of customers for targeted marketing
Classification and clustering methods can be used for customer group identification and targeted
marketing. Effective clustering and collaborative filtering methods can help identify customer
groups, associate a new customer with an appropriate customer group and facilitate targeted
marketing.
d) Detection of money laundering and other financial crimes
To detect money laundering and other financial crimes, it is important to integrate information
from multiple databases, as long as they are potentially related to the study. Multiple data
analysis tools can then be used to detect unusual patterns, such as large amounts of cash flow at
170
Chapter 13 - Data Mining Application
certain periods, by certain group of people and so on. Linkage analysis tools that are used to
identify links among different people and activities, classification tools that is used to group
different cases, outlier analysis tools which is used to detect unusual amounts of fund transfer or
other activities and sequential pattern analysis tools are used to characterize unusual access
sequence. These tools may identify important relationship and patterns of activities and help
investigators focus on suspicious cases for further detailed examination.
13.3 DATA MINING FOR THE RETAIL INDUSTRY
The retail industry is a major application area for data mining since it collects huge amount of data on
sales, customer shopping history, goods transportation, consumption and service records and so on. The
quantity of data collected continues to expand rapidly, due to web or e-commerce. Today, many stores
also have web sites where customers can make purchases on-line.
Retail data mining can help identify customer buying behaviours, discover customer shopping patterns
and trends, improve the quality of customer service, achieve better customer retention and satisfaction,
enhance goods consumption ratios, design more effective goods transportation and distribution policies
and reduce the cost of the business. The following are few activities of data mining are carried out in the
retail industry.
a) Design and construction of data warehouses on the benefits of data mining
The first aspect is to design a warehouse. Here it involves deciding which dimensions and
levels to include and what preprocessing to perform in order to facilitate quality and efficient
data mining.
b) Multidimensional analysis of sales, customers, products, time and region
The retail industry requires timely information regarding customer needs, product sales, trends
and fashions as well as the quality, cost, profit and service of commodities. It is therefore
important to provide powerful multidimensional analysis and visualization tools, including the
construction of sophisticated data cubes according to the needs of data analysis.
c) Analysis of the effectiveness of sales campaigns
The retail industry conducts sales campaigns using advertisements, coupons and various kinds
of discounts and bonuses to promote products and attract customers. Careful analysis of the
effectiveness of sales campaigns can help improve company profits. Multi-dimensional analysis
can be used for this purposes by comparing the amount of sales and the number of transactions
containing the sales items during the sales period versus those containing the same items before
or after the sales campaign.
171
d) Customer retention analysis of customer loyalty
With customer loyalty card information , one can register sequences of purchases of particular
customers. Customer loyalty and purchase trends can be analyzed in a systematic way, Goods
purchased at different periods by the same customer can be grouped into sequences. Sequential
patterns mining can then be used to investigate changes in customer consumption or loyalty and
suggest adjustments on the pricing an variety of goods in order to help retain customers and
attract new customers.
e) Purchase recommendations and cross-reference of items
Using association mining for sales records , one may discover that a customer who buys a
particular brand of bread is likely to buy another set of items. Such information can used to form
purchase recommendations. Purchase recommendations can e advertised on the web , in weekly
flyers or on the sales receipts to help improve customer service, aid customers in selecting items
and increase sales.
13.4 OTHER APPLICATIONS
As mentioned earlier, data mining is an interdisciplinary field. Data mining can be used in many areas.
Some of the applications are mentioned below.
Data mining for the telecommunication industry
Data mining system products and research prototypes.
13.5 EXERCISE
1. Identify an application and also explain the techniques that can be incorporated into, in solving the problem
using data mining techniques.

BSIT 53 New

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

BSIT 53 New

Diunggah oleh

Hak Cipta:

Format Tersedia

B.Sc.

Anda mungkin juga menyukai