Data Mining

Data Warehouse and Data Mining 16MCA442
MODULE-II
Data Mining and its Applications
2.1 What is Data Mining?
 Data Mining is the process of automatically discovering useful information in large

data repositories.
OR
 Non-trivial extraction of implicit, previously unknown and potentially useful
information from data.
OR
 Exploration & analysis, by automatic or semi-automatic means, of large quantities of
data in order to discover meaningful patterns.
 Data mining techniques are deployed to scour large databases in order to find novel
and useful patterns that might otherwise remain unknown.
 Data mining is often considered to be an integral part of another process, called
Knowledge Discovery in Databases (or KDD). KDD refers to the overall process of
turning raw data into interesting knowledge and consists of a series of transformation
steps, including data preprocessing, data mining, and post processing.
 The objective of data preprocessing is to convert data into the right format for
subsequent analysis by selecting the appropriate data segments and extracting
attributes that are relevant to the data mining task.
Figure 2.1: Overview of Data Mining
1
 Input data can be stored in a variety of formats (flats, files spread-sheets or

relational tables) and may reside in a centralized data repository or be distributed
across multiple sites.
 The purpose of preprocessing is to transform the raw input data into an
appropriate format for subsequent analysis.
 The steps involved in data preprocessing include fusing data from multiple
sources, cleaning data to remove noise and duplicate observations, and selecting
records and features that are relevant to the data mining task at hand.
2.2 Motivating Challenges

There are several important challenges in applying data mining techniques to large
data sets:
1. Scalability
 Scalable techniques are needed to handle the massive size of some of the datasets that
are now being created.
 As an example, such datasets typically require the use of efficient methods for storing,
indexing, and retrieving data from secondary or even tertiary storage systems.
 Furthermore, parallel or distributed computing approaches are often necessary if the
desired data mining task is to be performed in a timely manner.
 While such techniques can dramatically increase the size of the datasets that can be
handled, they often require the design of new algorithms and data structures.
2. Dimensionality
 In some application domains, the number of dimensions (or attributes of a record) can
be very large, this makes the data difficult to analyze because of the curse of
dimensionality.
 For example, in bioinformatics, the development of advanced microarray technologies
allows us to analyze gene expression data with thousands of attributes.
 The dimensionality of a data mining problem may also increase substantially due to
the temporal, spatial, and sequential nature of the data.
3. Heterogeneous Complex Data

 Traditional statistical methods often deal with simple data types such as continuous
and categorical attributes.
2
 However, in recent years, more complicated types of structured and semi-structured

data have become more important.
 One example of such data is graph-based data representing the linkages of web pages,
social networks, or chemical structures.
 Another example is the free-form text that is found on most web pages. Traditional
data analysis techniques often need to be modified to handle the complex nature of such
data.
4. Data Ownership and Distribution
 The data needed for an analysis is not stored in one location or owned by one
organization.
 Instead the data is geographically distributed among resources belonging to multiple
entities.
 This requires the development of distributed data mining techniques. The key
challenges faced by distributed data mining algorithms include
o How to reduce the amount of communication needed to perform the
distributed computation
o How effectively consolidate the data mining results obtained from multiple
sources
o How to address data security issues.
5. Non-traditional Analysis
 The tradition statistical approach is based on a hypothesize-and-test paradigm.
 In other words, a hypothesis is proposed, an experiment is designed to gather the data,
and then the data is analyzed with respect to the hypothesis.
 This process is labour-intensive. The data sets analyzed in data mining are typically
not the result of a carefully designed experiment and often opportunistic samples of
data, rather than random samples.
The origins of Data Mining
 Draws ideas from Machine learning/Artificial Intelligence, pattern recognition,

statistics, and database systems
 Traditional Techniques may be unsuitable due to
3
Enormity of data
High dimensionality of data
Heterogeneous, distributed nature of data
 A Number of other areas also play key supporting roles.
 In particular database systems are needed to provide support for efficient storage,
indexing, and query processing.
 Techniques from high performance computing are often important in addressing the
massive size of some data sets.
 Distributed techniques can also help address the issue of size and are essential when
the data cannot be gathered in one location. Figure 2.2 Shows the relationship of data
mining to other areas.
AI , Machine,
Learning and
Statistics Pattern
Recognition
Data Mining
Database Technology, Parallel Computing, Distributed

Computing
Figure 2.2: Data mining as a confluence of many disciplines
2.3 Data Mining Tasks
Data mining tasks are generally divided into two major categories:
• Predictive tasks:
 Use some variables to predict unknown or future values of other variables.
 Ex: by seeing the behaviour of one variable we can decide the value of other
variable.
 The attribute to be predicted is called: target or dependent
 Attribute used for making prediction are called: explanatory or independent
variable
4
• Descriptive tasks:
 Here the objective is to derive patterns (correlations, anomalies, cluster etc...)
that summarize the relationships in data.
 They are needed post processing the data to validate and explain the results.
Predictive
Modelling
Cluster Analysis
Anomaly
Detection
Association
Analysis
Figure 2.3: Data Mining Tasks
Four of the Core data Mining tasks:

1) Predictive Modeling
2) Association analysis
3) Cluster analysis
4) Anomaly detection
1) Predictive Modeling: It refers to the task of building a model for the target variable as
a function of the explanatory variable. There is two types of predictive modeling tasks:
 Classification: It is used for discrete target variables
Ex: Web user will make purchase at an online bookstore is a classification task,
because the target variable is binary valued.
 Regression: It is used for continuous target variables.
Ex: forecasting the future price of a stock is regression task because price is a
Continuous values attribute
2) Association Analysis: T h e useful application of association is to find group of data that
have related functionality. The Goal of associating analysis is to extract the most of interesting
patterns in an efficient manner is as shown in figure 2.3
Ex: Market based analysis: We may discover the rule that {diapers} {Milk}, which
5
suggests that customers who buy diapers also tend to buy milk.
3) Cluster Analysis: Clustering has been used to group sets of related

customers.
Ex: collection of news articles below table shows first 3 rows speak about economy
and second 2 lines speak about health sector. A good clustering algorithm should be able to
identify these two clusters based on the similarity between words that appear in the article.
Example:
Article Words
1 Dollar:1, industry:4, country:2, loan:3, government:2
2 Machinery:2, labor:3, market:4, industry:2, work:3, country:1
3 Job:5, inflation3, rise:2, jobless:2, market: 3, country:2
4 Patient:4, symptoms:2, drug:3, health:2, clinic:2, doctor:2
5 Death:2, cancer:4, drug:3, public:4, health:4, director:1
4) Anomaly Detection: It is the task of identifying observations whose characteristics are

significantly different from the rest of the data. Such observations are known as anomalies
or outliers. Applications of anomalies are: fraud detection, network intrusion, unusual
patterns of diseases, and ecosystem disturbance.
2.4 Which Technologies Are Used for Data mining?
 Data mining uses power of machine learning, statistics and database techniques to
mine large databases and come up with patterns.
 As a highly application-driven domain, data mining has incorporated many techniques
from other domains such as statistics, machine learning, pattern recognition, database
and data warehouse systems, information retrieval, visualization, algorithms, high
performance computing, and many application domains.
 Statistics is the base of all Data Mining and Machine learning algorithms
 Statistics is the study of collecting, analyzing and studying data and come up with
inferences and prediction about future.
 A statistical model is a set of mathematical functions that describe the behavior of the
objects in a target class in terms of random variables and their associated probability
6
distributions. Statistical models are widely used to model data and data classes. For
example, in data mining tasks like data characterization and classification, statistical
models of target classes can be built. In other words, such statistical models can be the
outcome of a data mining task.
Statistics Machine learning Pattern recognition
Database System Visualization

Data Mining
Data Warehouse
Algorithms
Information retrieval High Performance

Applications
Computing
Figure 2.4: Data mining adopts techniques from many domains.
 Alternatively, data mining tasks can be built on top of statistical models. For example,
we can use statistics to model noise and missing data values. Then, when mining
patterns in a large data set, the data mining process can use the model to help identify
and handle noisy or missing values in the data.
 Statistics research develops tools for prediction and forecasting using data and
statistical models. Statistical methods can be used to summarize or describe a collection
of data.
 Statistics is useful for mining various patterns from data as well as for understanding
the underlying mechanisms generating and affecting the patterns.
 Statistical methods can also be used to verify data mining results. For example, after a
classification or prediction model is mined, the model should be verified by statistical
hypothesis testing.
 A statistical hypothesis test (sometimes called confirmatory data analysis) makes
statistical decisions using experimental data. A result is called statistically significant
if it is unlikely to have occurred by chance. If the classification or prediction model
7
holds true, then the descriptive statistics of the model increases the soundness of the
model.
 Machine learning is a part of data science which majorly focuses on writing
algorithms in a way such that machines (Computers) are able to learn on their own
and use the learning’s to tell about new dataset whenever it comes in. Machine
learning uses power of statistics and learns from the training dataset.
 Information retrieval is extracting important pattern, features, knowledge from data.
Data analytics needs important information for processing, visualization. Raw data is
not useful directly, once you extract important information out of it that can give you
better insight.
 Pattern recognition in data mining is the extraction of information pieces from
structured and unstructured databases. It involves the task of processing unstructured
data, such as web-pages, free-form documents, and e-mail, for extracting named
entities such as people, places, organisations, and their relationships. Algorithms for
data mining have a close relationship with methods of pattern recognition and
machine learning.
2.5 Data Mining Functionalities: Kinds of pattern that can be mined

Data mining functionality can be broken down into following parts namely:
 Data Characterization and Discrimination
 Mining Frequent Patterns, Associations and Correlations
 Classification and Prediction
 Cluster analysis
 Outlier analysis
1. Data Characterization and Discrimination

Data can be associated with classes or concepts. For example, in the All Electronics
store, classes of items for sale include computers and printers etc. It can be useful to describe
individual classes and concepts in summarized, concise, and yet precise terms. Such
descriptions of a class or a concept are called class/concept descriptions.
Data characterization: It is a summarization of the general characteristics or features of a
target class of data. The data corresponding to the user-specified class are typically collected
by a database query. For example, to study the characteristics of software products whose
8
sales increased by 10% in the last year, the data related to such products can be collected by
executing an SQL query.
Data discrimination: It is a comparison of the general features of target class data objects with
the general features of objects from one or a set of contrasting classes. The target and
contrasting classes can be specified by the user, and the corresponding data objects retrieved
through database queries. For example, the user may like to compare the general features of
software products whose sales increased by 10% in the last year with those whose sales
decreased by at least 30% during the same period. The methods used for data discrimination
are similar to those used for data characterization.
2. Mining Frequent Patterns

Frequent patterns are those patterns that occur frequently in transactional data. Here is the
list of kind of frequent patterns −
 Frequent Item Set − It refers to a set of items that frequently appear together, for
example, milk and bread.
 Frequent Subsequence − A sequence of patterns that occur frequently such as
purchasing a camera is followed by memory card.
 Frequent Sub Structure − Substructure refers to different structural forms, such
as graphs, trees, or lattices, which may be combined with item-sets or subsequences.
 Mining frequent patterns leads to discovery of interesting Associations and
correlations within data.
Association analysis: Suppose, as a marketing manager of All Electronics, you
would like to determine which items are frequently purchased together within the same
transactions.
 An example of such a rule, mined from the AllElectronics transactional database,
is Consider,
buys(X; “computer”))=>buys(X; “software”) [support = 1%; confidence = 50%]
where, X is a variable representing a customer. A confidence, or certainty, of 50%
means that if a customer buys a computer, there is a 50% chance that she will buy
software as well. A 1% support means that 1% of all of the transactions under
analysis showed that computer and software were purchased together. The above
9
rules that contain a single predicate/attribute are referred to as single-dimensional

association rules.
The Multidimensional association analysis rule is referred as
age(X; “20..29”)^ income(X,”20K…29K”)=>buys(X; “Iphone”)
[support = 1%; confidence = 50%]
3. Classification and Prediction

Classification is the process of finding a model that describes the data classes or
concepts. The purpose is to be able to use this model to predict the class of objects whose
class label is unknown. This derived model is based on the analysis of sets of training data. The
derived model can be presented in the following forms −
Classification (IF-THEN) Rules, Decision Trees
The list of functions involved in these processes is as follows −
Classification − It predicts the class of objects whose class label is unknown. Its objective is
to find a derived model that describes and distinguishes data classes or concepts. The Derived
Model is based on the analysis set of training data i.e. the data object whose class label is well
known.
Prediction − It is used to predict missing or unavailable numerical data values rather than class
labels. Regression Analysis is generally used for prediction. Prediction can also be used for
identification of distribution trends based on available data.
Example: Classification model can be represented in various forms, such as
IF-THEN , Decision Tree
a) IF-THEN rules
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can
Express a rule in the following from −
IF condition THEN conclusion
Let us consider a rule R1,
R1: IF age = youth AND student = yes
THEN buy_computer = yes
b) Decision tree
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each
internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each
leaf node holds a class label. The topmost node in the tree is the root node.
10
The following decision tree in Figure 2.5 is for the concept buy computer that indicates whether
a customer at a company is likely to buy a computer or not. Each internal node represents a
test on an attribute. Each leaf node represents a class.
Figure 2.5: Decision Tree
4. Cluster Analysis
Clustering is the process of making a group of abstract objects into classes of similar
objects. In general, the class labels are not present in the training data simply because they are
not known to begin with.
It can be used to generate such labels. The objects are clustered or grouped based on the
principle of maximizing the intraclass similarity and minimizing the interclass similarity.
Figure 2.6: A 2-D plot of customer data with respect to customer locations in a city, showing
three data clusters. Each cluster “center” is marked with a “+”.
That is, clusters of objects are formed so that objects within a cluster have high
similarity in comparison to one another, but are very dissimilar to objects in other clusters.
11
Each cluster that is formed can be viewed as a class of objects, from which rules can be
derived.It can also facilitate taxonomy formation, that is, the organization of observations
into a hierarchy of classes that group similar events together.
5. Outlier Analysis − Outliers may be defined as the data objects that do not comply with the
general behavior or model of the data. These data objects are outliers. Most data mining methods
discard outliers as noise or exceptions. Outliers may be detected using statistical tests that
assume a distribution or probability model for a data or using distance measures where objects
that are a substantial distance from any cluster are considered outliers.
2.6 Data Mining Applications
Here is the list of areas where data mining is widely used −
 Financial Data Analysis

 Retail Industry
 Telecommunication Industry
 Biological Data Analysis
 Other Scientific Applications
 Intrusion Detection
1. Financial Data Analysis

The financial data in banking and financial industry is generally reliable and of high quality
which facilitates systematic data analysis and data mining. Some of the typical cases are as
follows −
 Design and construction of data warehouses for multidimensional data analysis and
data mining.
 Loan payment prediction and customer credit policy analysis.
 Classification and clustering of customers for targeted marketing.
 Detection of money laundering and other financial crimes.
2. Retail Industry
Data Mining has its great application in Retail Industry because it collects large
amount of data from on sales, customer purchasing history, goods transportation,
12
consumption and services. It is natural that the quantity of data collected will continue to
expand rapidly because of the increasing ease, availability and popularity of the web.
Data mining in retail industry helps in identifying customer buying patterns and trends that lead
to improved quality of customer service and good customer retention and satisfaction. Here is
the list of examples of data mining in the retail industry −
 Design and Construction of data warehouses based on the benefits of data mining.
 Multidimensional analysis of sales, customers, products, time and region.
 Analysis of effectiveness of sales campaigns.
 Customer Retention.
 Product recommendation and cross-referencing of items.
3. Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries providing
various services such as fax, pager, cellular phone, internet messenger, images, e- mail, web
data transmission, etc. Due to the development of new computer and communication
technologies, the telecommunication industry is rapidly expanding. This is the reason why
data mining is become very important to help and understand the business.
Data mining in telecommunication industry helps in identifying the telecommunication

patterns, catch fraudulent activities, make better use of resource, and improve quality of service.
Here is the list of examples for which data mining improves telecommunication services −
 Multidimensional Analysis of Telecommunication data.

 Fraudulent pattern analysis.
 Identification of unusual patterns.
 Multidimensional association and sequential patterns analysis.
 Mobile Telecommunication services.
 Use of visualization tools in telecommunication data analysis.
4. Biological Data Analysis

In recent times, we have seen a tremendous growth in the field of biology such as
genomics, proteomics, functional Genomics and biomedical research. Biological data mining
13
is a very important part of Bioinformatics. Following are the aspects in which data mining
contributes for biological data analysis −
 Semantic integration of heterogeneous, distributed genomic and proteomic databases.

 Alignment, indexing, similarity search and comparative analysis multiple nucleotide
sequences.
 Discovery of structural patterns and analysis of genetic networks and protein
pathways.
 Association and path analysis.
 Visualization tools in genetic data analysis.
5. Other Scientific Applications

The applications discussed above tend to handle relatively small and homogeneous data
sets for which the statistical techniques are appropriate. Huge amount of data have been
collected from scientific domains such as geosciences, astronomy, etc. A large amount of
data sets is being generated because of the fast numerical simulations in various fields such as
climate and ecosystem modeling, chemical engineering, fluid dynamics, etc. Following are
the applications of data mining in the field of Scientific Applications −
 Data Warehouses and data preprocessing.

 Graph-based mining.
 Visualization and domain specific knowledge.
6. Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the
availability of network resources. In this world of connectivity, security has become the
major issue. With increased usage of internet and availability of the tools and tricks for
intruding and attacking network prompted intrusion detection to become a critical component
of network administration. Here is the list of areas in which data mining technology may be
applied for intrusion detection −
 Development of data mining algorithm for intrusion detection.

 Association and correlation analysis, aggregation to help select and build
discriminating attributes.
 Analysis of Stream data.
14
 Distributed data mining.

 Visualization and query tools.
2.7 Data Preprocessing: Why to preprocess data?

Data in the real world is dirty,
 Incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
E.g., occupation=“ ”
 Noisy: containing errors or outliers
E.g., Salary=“-10”
 Inconsistent: containing discrepancies in codes or names
E.g., Age=“42” Birthday=“03/07/1997” , Was rating “1,2,3”, now rating
“A, B, C” , discrepancy between duplicate records
The methods for Data pre-processing are organized into the following categories:
Figure 2.7: Forms of Data Preprocessing

 Data cleaning
 Data Integration
 Data transformation
 Data reduction is as shown in figure 2.7
15
2.7.1: Data Cleaning

Data cleaning is a technique that is applied to remove the noisy data and correct the
inconsistencies in data. Data cleaning involves transformations to correct the wrong data.
Data cleaning is performed as a data preprocessing step while preparing the data for a data
warehouse.
The major Data cleaning tasks
 Missing values
 Noisy data
1. Missing Values
Imagine that you need to analyze All Electronics sales and customer data. You note that
many tuples have no recorded value for several attributes, such as customer income. How
can you go about filling in the missing values for this attribute?
Let’s look at the following methods:
 Ignore the tuple: This is usually done when the class label is missing (assuming the
mining task involves classification). This method is not very effective, unless the
tuple contains several attributes with missing values. It is especially poor when the
percentage of missing values per attribute varies considerably.
 Fill in the missing value manually: In general, this approach is time-consuming and
may not be feasible given a large data set with many missing values.
 Use a global constant to fill in the missing value: Replace all missing attribute
values by the same constant, such as a label like “Unknown” or ¥. If missing values
are replaced by, say, “Unknown,” then the mining program may mistakenly think that
they form an interesting concept, since they all have a value in common—that of
“Unknown.” Hence, although this method is simple, it is not fool proof.
 Use the attribute mean to fill in the missing value: For example, suppose that the
average income of All Electronics customers is $56,000. Use this value to replace the
missing value for income.
 Use the attribute mean for all samples belonging to the same class as the given
tuple: For example, if classifying customers according to credit risk, replace the missing
value with the average income value for customers in the same credit risk category as
that of the given tuple.
16
 Use the most probable value to fill in the missing value: This may be determined
with regression, inference-based tools using a Bayesian formalism, or decision tree
induction. For example, using the other customer attributes in your data set, you may
construct a decision tree to predict the missing values for income.
2. Noisy Data
Noise is a random error or variance in a measured variable. Given a numerical attribute
such as, say, price, how can we “smooth” out the data to remove the noise?
The following data smoothing techniques is Binning, Regression and Clustering
 Binning: Binning methods smooth a sorted data value by consulting its

“neighbourhood,” that is, the values around it. The sorted values are distributed into a
number of “buckets,” or bins. Because binning methods consult the neighbourhood of
values, they perform local smoothing. Sorted data for price (in dollars): 4, 8, 15, 21, 21,
24, 25, 28, 34
Partition into (equal-frequency) bins:

Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
Figure 2.8: Illustrates some binning techniques.

In this example, the data for price are first sorted and then partitioned into equal-frequency
bins of size 3 (i.e., each bin contains three values). In smoothing by bin means, each value in
a bin is replaced by the mean value of the bin.
For example, the mean of the values 4, 8, and 15 in Bin 1 is 9.[4+8+15=27/3=9]
Therefore, each original value in this bin is replaced by the value 9. Similarly, smoothing by
bin means can be employed, in which each bin value is replaced by the bin median. In
17
smoothing by bin boundaries, the minimum and maximum values in a given bin are identified
as the bin boundaries. Each bin value is then replaced by the closest boundary value. In general,
the larger the width, the greater the effect of the smoothing. Alternatively, bins may be equal-
width, where the interval range of values in each bin is constant.
 Regression: Data can be smoothed by fitting the data to a function, such as with
regression. Linear regression involves finding the “best” line to fit two attributes (or
variables), so that one attribute can be used to predict the other is as shown in Fig 2.10
 Clustering: Outliers may be detected by clustering, where similar values are
organized into groups, or “clusters.” Intuitively, values that fall outside of the set of
clusters may be considered outliers.
Figure 2.9: A 2-D plot of customer data with respect to customer locations in a city,
showing three data clusters. Each cluster “center” is marked with a “+”.
18
Figure 2.10: Regression

2.7.2 Data Integration
The data integration, it combines data from multiple sources into a coherent data
store, as in data warehousing. These sources may include multiple databases, data cubes,
or flat files.
Data integration has to deal with following issues
1. Schema integration
Same values in different sources can mean different things
•E.g., Column ‘Title’ in one database means ‘Job Title’ while in another database it
means ‘Person Title’
2. Detecting and resolving data value conflicts
A third important issue in data integration is the detection and resolution of data value
conflicts.
For example, for the same real-world entity, attribute values from different sources may
differ. This may be due to differences in representation, scaling, or encoding. For instance, a
weight attribute may be stored in metric units in one system and British imperial units in
another.
3. Redundant: The redundant data occur often when integration of multiple databases
some redundancies can be detected by correlation analysis
 Correlation analysis can be evaluated for 2 attributes
By using chi-square test we can evaluate the 2 attributes, Equation 1
19
where ,oij is the observed frequency (i.e., actual count) of the joint event (Ai;Bj) and eij is
the expected frequency of (Ai;Bj), which can be computed as Equation 2
where N is the number of data tuples, count(A=ai) is the number of tuples having value ai
for A, and count(B = bj) is the number of tuples having value bj for B. The sum is computed
over all of the r_c cells.
Table 2.1: 2*2 Contingency table for data
Data Male Female Total
O Ex O Ex
Fiction 250 90 200 360 450
Non_Fiction 50 210 1000 840 1050
Total 300 1200 1500
By considering above Contingency data ,
 If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher,
the stronger correlation. rA,B = 0: independent; rA,B < 0: negatively correlated
20
2.7.3 Data Transformation
In data mining pre-processes and especially in metadata and data warehouse, we use data
transformation in order to convert data from a source data format into destination data.
Data transformation can involve the following:
 Smoothing
• It works to remove noise from the data.
• It is a form of data cleaning where users specify transformations to correct data
inconsistencies. Such techniques include binning, regression, and clustering.
 Aggregation
Aggregation is a form of data reduction Here summary or aggregation operations are
applied to the data.• This step is typically used in constructing a data cube for analysis of
the data at multiple granularities.
 Generalization
Here low-level or “primitive” (raw) data are replaced by higher-level concepts through
the use of concept hierarchies.• For example, attributes, like age, may be mapped to
higher-level concepts, like youth, middle-aged, and senior
 Normalization
Here the attribute data are scaled so as to fall within a small specified range, such as [1:0
to 1:0], or [0:0 to 1:0].
There are three methods for data normalization:
o Min-Max normalization: It performs a linear transformation on the original data
suppose that min A and max Aare the minimum and maximum values of an
attribute, A. Min-max normalization maps a value, v, of A to v0 in the range [new
minA; new maxA] by computing
v  minA
v'  (new max new min )  new _ min
A A A
maxA  minA
21
Ex: Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is
mapped to above formula
73,600 12,000
V1= (1.0  0)  0  0.716
98,000 12,000
Z-score normalization (μ: mean, σ: standard deviation):

v  A
v ' 
 A
Ex. Let μ = 54,000, σ = 16,000. Then

V1= (73,600-54000)/16000
V1 =1.225
Normalization by decimal scaling ,Where j is the smallest integer such that
Max([v]<1.The recorded values of A from -986 to 917.if j=3
then V1=v/10j
v1=986/1000=0.986
2.7.4 Data Reduction
Data reduction techniques can be applied to obtain a reduced representation of the
data set that is much smaller in volume, yet closely maintains the integrity of the original
data. That is, mining on the reduced data set should be more efficient yet produce the
same analytical results.
Strategies for data reduction include the following:
1. Data cube aggregation, where aggregation operations are applied to the data in the
construction of a data cube.
2. Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or
dimensions may be detected and removed.
3. Dimensionality reduction, where encoding mechanisms are used to reduce the data set
size.
4. Numerosity reduction, where the data are replaced or estimated by alternative, smaller
data representations such as parametric models (which need store only the model
parameters instead of the actual data) or nonparametric methods such as clustering,
sampling, and the use of histograms.
5. Discretization and concept hierarchy generation, where raw data values for attributes
22
are replaced by ranges or higher conceptual levels. Data discretization is a form of

numerosity reduction that is very useful for the automatic generation of concept
hierarchies. Discretization and concept hierarchy generation are powerful tools for data
mining, in that they allow the mining of data at multiple levels of abstraction. We
therefore defer the discussion of discretization and concept hierarchy generation .
2.8 Types of Data
2.8.1 What is Data?

 Collection of data objects and their attributes.
 An attribute is a property or characteristic of an object
– Examples: eye color of a person, temperature, etc.
– Attribute is also known as variable, field, characteristic, or feature
 A collection of attributes describe an object

– Object is also known as record, point, case, sample, entity, or instance
Attributes
Objects id e fu n d M a r it a l T a x a b le
S ta tu s In c o m e C heat
1 Yes S in g le 125K No
2 No M a r r ie d 100K No
3 No S in g le 70K No
4 Yes M a r r ie d 120K No
5 No D iv o r c e d 95K Yes
7 Yes D iv o r c e d 220K No
8 No S in g le 85K Yes
10 No S in g le 90K Yes
10
 The values used to represent an attribute may have properties that are not properties of
the attribute itself, and vice versa.
 Example: Employee age and ID Number
There are different types of attributes

1. Nominal
o Examples: ID numbers, eye color, zip codes
2. Ordinal
23
a. Examples: rankings (e.g., taste of potato chips on a scale from 1-10),

grades, height in {tall, medium, short}
3. Interval
a. Examples: calendar dates, temperatures in Celsius or Fahrenheit.
4. Ratio
a. Examples: temperature in Kelvin, length, time, counts
 A useful way to specify the type of an attribute is to identify the properties of numbers
that correspond to underlying properties of the attribute.
 The type of an attribute depends on which of the following properties it possesses:
– Distinctness: = 
– Order: < >
– Addition: + -
– Multiplication: */
2.8.2 Types of Data sets

 There are many types of data sets, and as the field of data mining develops and
matures, a greater variety of data sets become available for analysis.
 The types of data sets into three groups:
o Record data
o Graph-based data
o Ordered data
General characteristics of Data sets
 Three characteristics that apply to many data sets and have a significant impact on the
data mining techniques that are used
o Dimensionality
o Sparsity
o resolution
1] Dimensionality
 The dimensionality of a data set is the number of attributes that the objects in the data
set posses.
 Data with a small number of dimensionality tends to be qualitatively different than
moderate or high-dimensional data.
 The difficulties associated with analyzing high-dimensional data are sometimes
referred to as the curse of dimensionality.
24
2] Sparsity
 For some data sets, such as those with asymmetric features, most attributes of an
object have values of 0; In many cases, fewer than 1% of the entries are non-zero.
 Sparsity is an advantage because usually only the non-zero values need to be stored
and manipulated.
 This results in significant savings with respect to computation time and storage.
3] Resolution
 It is frequently possible to obtain data at different levels of resolution, and often the
properties of the data are different at different resolutions.
 For instance, the surface of the earth seems very uneven at a resolution of a few
meters, but is relatively smooth at a resolution of tens of kilometers.
 The patterns in the data also depend on the level of resolution.
The types of data sets into three groups:

1. Record Data
2. Graph-based data
3. Ordered data
1. Record Data
 Data set is a collection of records each of which consists of fixed set of data fields.
 Basic form of record data, there is no explicit relationship among records or data
fields, and every record has the same set of attributes. It is usually stored either in flat
files or in relational database.
 Relational databases are certainly more than a collection of records, but data mining
often does not use any of the additional information available in a relation database.
Types of record data are:
1. Transaction or market based data
2. The data matrix
3. The sparse data matrix
Transaction or Market Basket Data
 Transaction data is a special type of record data, where each record involves a set of
items.
25
 Transaction data is a collection of sets of items, but It can be viewed as a set of

records whose fields are asymmetric attributes. Most often, the attributes are binary,
indicating whether or not an item was purchased, but more generally , the attributes can
be discrete or continuous, such as the number of items purchased or the amount spent
on those items.
The Data Matrix

 If the data objects in a collection of data all have the dame fixed set of numeric
attributes, then the data objects can be thought of as points in a multidimensional space,
where each dimension represents a distinct attribute describing the object.
 A set of such data objects can be interpreted as an m by n matrix, where there are m
rows, one for each object, and n columns, one for each attribute. The matrix is called a
data matrix.
 A data matrix is a variation of record data, but because it consists of numeric
attributes, standard matrix operation can be applied to transform and manipulate the
data.
Figure 2.11: Types of Record Data

26
The Sparse Data Matrix

 A spares matrix is a special case of a data matrix in which the attributes are of same
type and are symmetric; i.e only non-zero values are important.
 Transactions data is an example of sparse data matrix that has only 0-1 entries.
 Another common example is document data. In particular, if order of the terms in a
document is ignored then a document can be represented as a term vector, where each
term is a component of the vector and the value of each component is the number of
times the corresponding term occurs in the document. This representation of a collection
of documents is often called a document-term matrix is shown in figure
2.11.
2. Graph-Based Data
 A graph can sometimes be a convenient and powerful representation for data.
Case 1: Data with Relationships among Objects
 The relationships among objects frequently convey important information.
 The data is often represented as a graph. They are mapped to nodes of the graph,
while the relationships among objects are captured by the links between objects and
links properties, such as direction and weight.
 Consider in web pages on the World Wide Web, which contain both text and links to
other pages is as shown in Figure 2.12
Figure 2.12: Linked Web Pages
27
3. Ordered Data
 For some types of data, the attributes have relationships that involve order in time or
space. they are
 Sequential Data
It is also referred to as temporal data, can be thought of as an extension of record
data, where each record has a time associated with it. There are five different
times- t1, t2, t3,t4 and t5; three different customers –C1, C2 and C3;
 Sequence Data or Genetic Sequence Data

It consists of a data set that is a sequence of individual entities, such as
sequence of words or letters.
 Time Series Data

Time series data is a special type of sequential data in which each record is a
time series, i.e. a series of measurements taken over time.
 Spatial Data
Some objects have spatial attributes, such as positions or areas, as well as other
types of attributes. An example of spatial data is weather data that is collected
for a variety of geographical location is as shown in figure 2.13
28
Figure 2.13: Different types of order data
********************************************************
29

Data Mining

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Data Mining

Diunggah oleh

Hak Cipta:

Format Tersedia

Data Warehouse and Data Mining 16MCA442

2.1 What is Data Mining?

 Data Mining is the process of automatically discovering useful information in large

Figure 2.1: Overview of Data Mining

 Input data can be stored in a variety of formats (flats, files spread-sheets or

2.2 Motivating Challenges

3. Heterogeneous Complex Data

 However, in recent years, more complicated types of structured and semi-structured

4. Data Ownership and Distribution

The origins of Data Mining

 Draws ideas from Machine learning/Artificial Intelligence, pattern recognition,

Database Technology, Parallel Computing, Distributed

Figure 2.2: Data mining as a confluence of many disciplines

2.3 Data Mining Tasks

Figure 2.3: Data Mining Tasks

Four of the Core data Mining tasks:

3) Cluster Analysis: Clustering has been used to group sets of related

4) Anomaly Detection: It is the task of identifying observations whose characteristics are

2.4 Which Technologies Are Used for Data mining?

Statistics Machine learning Pattern recognition

Database System Visualization

Information retrieval High Performance

Figure 2.4: Data mining adopts techniques from many domains.

2.5 Data Mining Functionalities: Kinds of pattern that can be mined

1. Data Characterization and Discrimination

2. Mining Frequent Patterns

rules that contain a single predicate/attribute are referred to as single-dimensional

3. Classification and Prediction

Figure 2.5: Decision Tree

2.6 Data Mining Applications

Here is the list of areas where data mining is widely used −

 Financial Data Analysis

1. Financial Data Analysis

Data mining in telecommunication industry helps in identifying the telecommunication

 Multidimensional Analysis of Telecommunication data.

4. Biological Data Analysis

 Semantic integration of heterogeneous, distributed genomic and proteomic databases.

5. Other Scientific Applications

 Data Warehouses and data preprocessing.

 Development of data mining algorithm for intrusion detection.

 Distributed data mining.

2.7 Data Preprocessing: Why to preprocess data?

Figure 2.7: Forms of Data Preprocessing

2.7.1: Data Cleaning

The following data smoothing techniques is Binning, Regression and Clustering

 Binning: Binning methods smooth a sorted data value by consulting its

Partition into (equal-frequency) bins:

Figure 2.8: Illustrates some binning techniques.

Figure 2.10: Regression

By considering above Contingency data ,

2.7.3 Data Transformation

Data transformation can involve the following:

Z-score normalization (μ: mean, σ: standard deviation):

Ex. Let μ = 54,000, σ = 16,000. Then

are replaced by ranges or higher conceptual levels. Data discretization is a form of

2.8 Types of Data

2.8.1 What is Data?

– Examples: eye color of a person, temperature, etc.

– Attribute is also known as variable, field, characteristic, or feature

 A collection of attributes describe an object

There are different types of attributes

a. Examples: rankings (e.g., taste of potato chips on a scale from 1-10),

2.8.2 Types of Data sets