MODULE-II
Data Mining and its Applications
1
Data Warehouse and Data Mining 16MCA442
indexing, and retrieving data from secondary or even tertiary storage systems.
Furthermore, parallel or distributed computing approaches are often necessary if the
desired data mining task is to be performed in a timely manner.
While such techniques can dramatically increase the size of the datasets that can be
handled, they often require the design of new algorithms and data structures.
2. Dimensionality
In some application domains, the number of dimensions (or attributes of a record) can
be very large, this makes the data difficult to analyze because of the curse of
dimensionality.
For example, in bioinformatics, the development of advanced microarray technologies
allows us to analyze gene expression data with thousands of attributes.
The dimensionality of a data mining problem may also increase substantially due to
the temporal, spatial, and sequential nature of the data.
The data needed for an analysis is not stored in one location or owned by one
organization.
Instead the data is geographically distributed among resources belonging to multiple
entities.
This requires the development of distributed data mining techniques. The key
challenges faced by distributed data mining algorithms include
o How to reduce the amount of communication needed to perform the
distributed computation
o How effectively consolidate the data mining results obtained from multiple
sources
o How to address data security issues.
5. Non-traditional Analysis
The tradition statistical approach is based on a hypothesize-and-test paradigm.
In other words, a hypothesis is proposed, an experiment is designed to gather the data,
and then the data is analyzed with respect to the hypothesis.
This process is labour-intensive. The data sets analyzed in data mining are typically
not the result of a carefully designed experiment and often opportunistic samples of
data, rather than random samples.
3
Data Warehouse and Data Mining 16MCA442
Enormity of data
High dimensionality of data
Heterogeneous, distributed nature of data
A Number of other areas also play key supporting roles.
In particular database systems are needed to provide support for efficient storage,
indexing, and query processing.
Techniques from high performance computing are often important in addressing the
massive size of some data sets.
Distributed techniques can also help address the issue of size and are essential when
the data cannot be gathered in one location. Figure 2.2 Shows the relationship of data
mining to other areas.
AI , Machine,
Learning and
Statistics Pattern
Recognition
Data Mining
Data mining tasks are generally divided into two major categories:
• Predictive tasks:
Use some variables to predict unknown or future values of other variables.
Ex: by seeing the behaviour of one variable we can decide the value of other
variable.
The attribute to be predicted is called: target or dependent
Attribute used for making prediction are called: explanatory or independent
variable
4
Data Warehouse and Data Mining 16MCA442
• Descriptive tasks:
Here the objective is to derive patterns (correlations, anomalies, cluster etc...)
that summarize the relationships in data.
They are needed post processing the data to validate and explain the results.
Predictive
Modelling
Cluster Analysis
Anomaly
Detection
Association
Analysis
5
Data Warehouse and Data Mining 16MCA442
suggests that customers who buy diapers also tend to buy milk.
Example:
Article Words
1 Dollar:1, industry:4, country:2, loan:3, government:2
2 Machinery:2, labor:3, market:4, industry:2, work:3, country:1
3 Job:5, inflation3, rise:2, jobless:2, market: 3, country:2
4 Patient:4, symptoms:2, drug:3, health:2, clinic:2, doctor:2
5 Death:2, cancer:4, drug:3, public:4, health:4, director:1
Data mining uses power of machine learning, statistics and database techniques to
mine large databases and come up with patterns.
As a highly application-driven domain, data mining has incorporated many techniques
from other domains such as statistics, machine learning, pattern recognition, database
and data warehouse systems, information retrieval, visualization, algorithms, high
performance computing, and many application domains.
Statistics is the base of all Data Mining and Machine learning algorithms
Statistics is the study of collecting, analyzing and studying data and come up with
inferences and prediction about future.
A statistical model is a set of mathematical functions that describe the behavior of the
objects in a target class in terms of random variables and their associated probability
6
Data Warehouse and Data Mining 16MCA442
distributions. Statistical models are widely used to model data and data classes. For
example, in data mining tasks like data characterization and classification, statistical
models of target classes can be built. In other words, such statistical models can be the
outcome of a data mining task.
Data Warehouse
Algorithms
Alternatively, data mining tasks can be built on top of statistical models. For example,
we can use statistics to model noise and missing data values. Then, when mining
patterns in a large data set, the data mining process can use the model to help identify
and handle noisy or missing values in the data.
Statistics research develops tools for prediction and forecasting using data and
statistical models. Statistical methods can be used to summarize or describe a collection
of data.
Statistics is useful for mining various patterns from data as well as for understanding
the underlying mechanisms generating and affecting the patterns.
Statistical methods can also be used to verify data mining results. For example, after a
classification or prediction model is mined, the model should be verified by statistical
hypothesis testing.
A statistical hypothesis test (sometimes called confirmatory data analysis) makes
statistical decisions using experimental data. A result is called statistically significant
if it is unlikely to have occurred by chance. If the classification or prediction model
7
Data Warehouse and Data Mining 16MCA442
holds true, then the descriptive statistics of the model increases the soundness of the
model.
Machine learning is a part of data science which majorly focuses on writing
algorithms in a way such that machines (Computers) are able to learn on their own
and use the learning’s to tell about new dataset whenever it comes in. Machine
learning uses power of statistics and learns from the training dataset.
Information retrieval is extracting important pattern, features, knowledge from data.
Data analytics needs important information for processing, visualization. Raw data is
not useful directly, once you extract important information out of it that can give you
better insight.
Pattern recognition in data mining is the extraction of information pieces from
structured and unstructured databases. It involves the task of processing unstructured
data, such as web-pages, free-form documents, and e-mail, for extracting named
entities such as people, places, organisations, and their relationships. Algorithms for
data mining have a close relationship with methods of pattern recognition and
machine learning.
8
Data Warehouse and Data Mining 16MCA442
sales increased by 10% in the last year, the data related to such products can be collected by
executing an SQL query.
Data discrimination: It is a comparison of the general features of target class data objects with
the general features of objects from one or a set of contrasting classes. The target and
contrasting classes can be specified by the user, and the corresponding data objects retrieved
through database queries. For example, the user may like to compare the general features of
software products whose sales increased by 10% in the last year with those whose sales
decreased by at least 30% during the same period. The methods used for data discrimination
are similar to those used for data characterization.
9
Data Warehouse and Data Mining 16MCA442
10
Data Warehouse and Data Mining 16MCA442
The following decision tree in Figure 2.5 is for the concept buy computer that indicates whether
a customer at a company is likely to buy a computer or not. Each internal node represents a
test on an attribute. Each leaf node represents a class.
4. Cluster Analysis
Clustering is the process of making a group of abstract objects into classes of similar
objects. In general, the class labels are not present in the training data simply because they are
not known to begin with.
It can be used to generate such labels. The objects are clustered or grouped based on the
principle of maximizing the intraclass similarity and minimizing the interclass similarity.
Figure 2.6: A 2-D plot of customer data with respect to customer locations in a city, showing
three data clusters. Each cluster “center” is marked with a “+”.
That is, clusters of objects are formed so that objects within a cluster have high
similarity in comparison to one another, but are very dissimilar to objects in other clusters.
11
Data Warehouse and Data Mining 16MCA442
Each cluster that is formed can be viewed as a class of objects, from which rules can be
derived.It can also facilitate taxonomy formation, that is, the organization of observations
into a hierarchy of classes that group similar events together.
5. Outlier Analysis − Outliers may be defined as the data objects that do not comply with the
general behavior or model of the data. These data objects are outliers. Most data mining methods
discard outliers as noise or exceptions. Outliers may be detected using statistical tests that
assume a distribution or probability model for a data or using distance measures where objects
that are a substantial distance from any cluster are considered outliers.
Design and construction of data warehouses for multidimensional data analysis and
data mining.
Loan payment prediction and customer credit policy analysis.
Classification and clustering of customers for targeted marketing.
Detection of money laundering and other financial crimes.
2. Retail Industry
Data Mining has its great application in Retail Industry because it collects large
amount of data from on sales, customer purchasing history, goods transportation,
12
Data Warehouse and Data Mining 16MCA442
consumption and services. It is natural that the quantity of data collected will continue to
expand rapidly because of the increasing ease, availability and popularity of the web.
Data mining in retail industry helps in identifying customer buying patterns and trends that lead
to improved quality of customer service and good customer retention and satisfaction. Here is
the list of examples of data mining in the retail industry −
Design and Construction of data warehouses based on the benefits of data mining.
Multidimensional analysis of sales, customers, products, time and region.
Analysis of effectiveness of sales campaigns.
Customer Retention.
Product recommendation and cross-referencing of items.
3. Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries providing
various services such as fax, pager, cellular phone, internet messenger, images, e- mail, web
data transmission, etc. Due to the development of new computer and communication
technologies, the telecommunication industry is rapidly expanding. This is the reason why
data mining is become very important to help and understand the business.
13
Data Warehouse and Data Mining 16MCA442
is a very important part of Bioinformatics. Following are the aspects in which data mining
contributes for biological data analysis −
6. Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the
availability of network resources. In this world of connectivity, security has become the
major issue. With increased usage of internet and availability of the tools and tricks for
intruding and attacking network prompted intrusion detection to become a critical component
of network administration. Here is the list of areas in which data mining technology may be
applied for intrusion detection −
14
Data Warehouse and Data Mining 16MCA442
Data Integration
Data transformation
Data reduction is as shown in figure 2.7
15
Data Warehouse and Data Mining 16MCA442
Missing values
Noisy data
1. Missing Values
Imagine that you need to analyze All Electronics sales and customer data. You note that
many tuples have no recorded value for several attributes, such as customer income. How
can you go about filling in the missing values for this attribute?
Let’s look at the following methods:
Ignore the tuple: This is usually done when the class label is missing (assuming the
mining task involves classification). This method is not very effective, unless the
tuple contains several attributes with missing values. It is especially poor when the
percentage of missing values per attribute varies considerably.
Fill in the missing value manually: In general, this approach is time-consuming and
may not be feasible given a large data set with many missing values.
Use a global constant to fill in the missing value: Replace all missing attribute
values by the same constant, such as a label like “Unknown” or ¥. If missing values
are replaced by, say, “Unknown,” then the mining program may mistakenly think that
they form an interesting concept, since they all have a value in common—that of
“Unknown.” Hence, although this method is simple, it is not fool proof.
Use the attribute mean to fill in the missing value: For example, suppose that the
average income of All Electronics customers is $56,000. Use this value to replace the
missing value for income.
Use the attribute mean for all samples belonging to the same class as the given
tuple: For example, if classifying customers according to credit risk, replace the missing
value with the average income value for customers in the same credit risk category as
that of the given tuple.
16
Data Warehouse and Data Mining 16MCA442
Use the most probable value to fill in the missing value: This may be determined
with regression, inference-based tools using a Bayesian formalism, or decision tree
induction. For example, using the other customer attributes in your data set, you may
construct a decision tree to predict the missing values for income.
2. Noisy Data
Noise is a random error or variance in a measured variable. Given a numerical attribute
such as, say, price, how can we “smooth” out the data to remove the noise?
17
Data Warehouse and Data Mining 16MCA442
smoothing by bin boundaries, the minimum and maximum values in a given bin are identified
as the bin boundaries. Each bin value is then replaced by the closest boundary value. In general,
the larger the width, the greater the effect of the smoothing. Alternatively, bins may be equal-
width, where the interval range of values in each bin is constant.
Regression: Data can be smoothed by fitting the data to a function, such as with
regression. Linear regression involves finding the “best” line to fit two attributes (or
variables), so that one attribute can be used to predict the other is as shown in Fig 2.10
Clustering: Outliers may be detected by clustering, where similar values are
organized into groups, or “clusters.” Intuitively, values that fall outside of the set of
clusters may be considered outliers.
Figure 2.9: A 2-D plot of customer data with respect to customer locations in a city,
showing three data clusters. Each cluster “center” is marked with a “+”.
18
Data Warehouse and Data Mining 16MCA442
The data integration, it combines data from multiple sources into a coherent data
store, as in data warehousing. These sources may include multiple databases, data cubes,
or flat files.
Data integration has to deal with following issues
1. Schema integration
Same values in different sources can mean different things
•E.g., Column ‘Title’ in one database means ‘Job Title’ while in another database it
means ‘Person Title’
2. Detecting and resolving data value conflicts
A third important issue in data integration is the detection and resolution of data value
conflicts.
For example, for the same real-world entity, attribute values from different sources may
differ. This may be due to differences in representation, scaling, or encoding. For instance, a
weight attribute may be stored in metric units in one system and British imperial units in
another.
3. Redundant: The redundant data occur often when integration of multiple databases
some redundancies can be detected by correlation analysis
Correlation analysis can be evaluated for 2 attributes
By using chi-square test we can evaluate the 2 attributes, Equation 1
19
Data Warehouse and Data Mining 16MCA442
where ,oij is the observed frequency (i.e., actual count) of the joint event (Ai;Bj) and eij is
the expected frequency of (Ai;Bj), which can be computed as Equation 2
where N is the number of data tuples, count(A=ai) is the number of tuples having value ai
for A, and count(B = bj) is the number of tuples having value bj for B. The sum is computed
over all of the r_c cells.
Table 2.1: 2*2 Contingency table for data
Data Male Female Total
O Ex O Ex
Fiction 250 90 200 360 450
Non_Fiction 50 210 1000 840 1050
Total 300 1200 1500
If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher,
the stronger correlation. rA,B = 0: independent; rA,B < 0: negatively correlated
20
Data Warehouse and Data Mining 16MCA442
In data mining pre-processes and especially in metadata and data warehouse, we use data
transformation in order to convert data from a source data format into destination data.
Smoothing
• It works to remove noise from the data.
• It is a form of data cleaning where users specify transformations to correct data
inconsistencies. Such techniques include binning, regression, and clustering.
Aggregation
Aggregation is a form of data reduction Here summary or aggregation operations are
applied to the data.• This step is typically used in constructing a data cube for analysis of
the data at multiple granularities.
Generalization
Here low-level or “primitive” (raw) data are replaced by higher-level concepts through
the use of concept hierarchies.• For example, attributes, like age, may be mapped to
higher-level concepts, like youth, middle-aged, and senior
Normalization
Here the attribute data are scaled so as to fall within a small specified range, such as [1:0
to 1:0], or [0:0 to 1:0].
There are three methods for data normalization:
o Min-Max normalization: It performs a linear transformation on the original data
suppose that min A and max Aare the minimum and maximum values of an
attribute, A. Min-max normalization maps a value, v, of A to v0 in the range [new
minA; new maxA] by computing
v minA
v' (new max new min ) new _ min
A A A
maxA minA
21
Data Warehouse and Data Mining 16MCA442
Ex: Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is
mapped to above formula
73,600 12,000
V1= (1.0 0) 0 0.716
98,000 12,000
then V1=v/10j
v1=986/1000=0.986
2.7.4 Data Reduction
Data reduction techniques can be applied to obtain a reduced representation of the
data set that is much smaller in volume, yet closely maintains the integrity of the original
data. That is, mining on the reduced data set should be more efficient yet produce the
same analytical results.
Strategies for data reduction include the following:
1. Data cube aggregation, where aggregation operations are applied to the data in the
construction of a data cube.
2. Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or
dimensions may be detected and removed.
3. Dimensionality reduction, where encoding mechanisms are used to reduce the data set
size.
4. Numerosity reduction, where the data are replaced or estimated by alternative, smaller
data representations such as parametric models (which need store only the model
parameters instead of the actual data) or nonparametric methods such as clustering,
sampling, and the use of histograms.
5. Discretization and concept hierarchy generation, where raw data values for attributes
22
Data Warehouse and Data Mining 16MCA442
1 Yes S in g le 125K No
2 No M a r r ie d 100K No
3 No S in g le 70K No
4 Yes M a r r ie d 120K No
5 No D iv o r c e d 95K Yes
6 No M a r r ie d 60K No
7 Yes D iv o r c e d 220K No
8 No S in g le 85K Yes
9 No M a r r ie d 75K No
10 No S in g le 90K Yes
10
The values used to represent an attribute may have properties that are not properties of
the attribute itself, and vice versa.
Example: Employee age and ID Number
23
Data Warehouse and Data Mining 16MCA442
– Distinctness: =
– Order: < >
– Addition: + -
– Multiplication: */
24
Data Warehouse and Data Mining 16MCA442
2] Sparsity
For some data sets, such as those with asymmetric features, most attributes of an
object have values of 0; In many cases, fewer than 1% of the entries are non-zero.
Sparsity is an advantage because usually only the non-zero values need to be stored
and manipulated.
This results in significant savings with respect to computation time and storage.
3] Resolution
It is frequently possible to obtain data at different levels of resolution, and often the
properties of the data are different at different resolutions.
For instance, the surface of the earth seems very uneven at a resolution of a few
meters, but is relatively smooth at a resolution of tens of kilometers.
The patterns in the data also depend on the level of resolution.
1. Record Data
Data set is a collection of records each of which consists of fixed set of data fields.
Basic form of record data, there is no explicit relationship among records or data
fields, and every record has the same set of attributes. It is usually stored either in flat
files or in relational database.
Relational databases are certainly more than a collection of records, but data mining
often does not use any of the additional information available in a relation database.
Types of record data are:
1. Transaction or market based data
2. The data matrix
3. The sparse data matrix
Transaction or Market Basket Data
Transaction data is a special type of record data, where each record involves a set of
items.
25
Data Warehouse and Data Mining 16MCA442
27
Data Warehouse and Data Mining 16MCA442
3. Ordered Data
For some types of data, the attributes have relationships that involve order in time or
space. they are
Sequential Data
It is also referred to as temporal data, can be thought of as an extension of record
data, where each record has a time associated with it. There are five different
times- t1, t2, t3,t4 and t5; three different customers –C1, C2 and C3;
Spatial Data
Some objects have spatial attributes, such as positions or areas, as well as other
types of attributes. An example of spatial data is weather data that is collected
for a variety of geographical location is as shown in figure 2.13
28
Data Warehouse and Data Mining 16MCA442
********************************************************
29