Part A
(Compulsory)
1. a. Define Data mining and list out its functionalities.
Data mining (sometimes called data or knowledge discovery) is the process of analyzing data
from different perspectives and summarizing it into useful information which helps in
decision making. Technically, data mining is the process of finding correlations or patterns
among dozens of fields in large relational databases.
There are two types of data mining tasks: descriptive data mining tasks that describe the
general properties of the existing data, and predictive data mining tasks that attempt to do
predictions based on inference on available data.
The data mining functionalities and the variety of knowledge they discover are briefly
presented in the following list:
Characterization: Data characterization is a summarization of general features of objects in
a target class, and produces what is called characteristic rules.
Discrimination: Data discrimination produces what are called discriminant rules and is
basically the comparison of the general features of objects between two classes referred to as
the target class and the contrasting class.
Association analysis: Association analysis is the discovery of what are commonly called
association rules.
Classification: Classification analysis is the organization of data in given classes. Also
known as supervised classification, the classification uses given class labels to order the
objects in the data collection.
Prediction: Prediction is more often referred to the forecast of missing numerical values, or
increase/ decrease trends in time related data.
Clustering: Clustering is the organization of data in classes. However, unlike classification,
in clustering, class labels are unknown and it is up to the clustering algorithm to discover
acceptable classes. Clustering is also called unsupervised classification.
Outlier analysis: Outliers are data elements that cannot be grouped in a given class or
cluster..
Evolution and deviation analysis: Evolution and deviation analysis pertain to the study of
time related data that changes in time. Evolution analysis models evolutionary trends in data,
which consent to characterizing, comparing, classifying or clustering of time related data.
Deviation analysis, on the other hand, considers differences between measured values and
expected values, and attempts to find the cause of the deviations from the anticipated values.
A
b. Explain Principal Component analysis in detail.
Principal component analysis (PCA) is a technique used to emphasize variation and bring out
strong patterns in a dataset. It's often used to make data easy to explore and visualize.
Objectives of principal component analysis
To discover or to reduce the dimensionality of the data set.
To identify new meaningful underlying variables.
PCA takes a data matrix of n objects by p variables, which may be correlated, and
summarizes it by uncorrelated axes (principal components or principal axes) that are linear
combinations of the original p variables and the first k components display as much as
possible of the variation among objects.
Objective of PCA is to rigidly rotate the axes of this p-dimensional space to new positions
(principal axes) that have the following properties:
ordered such that principal axis 1 has the highest variance, axis 2 has the next highest
variance, .... , and axis p has the lowest variance
covariance among each pair of the principal axes is zero (the principal axes are
uncorrelated).
Grain
Identifying the grain also means deciding the level of detail that will be made available in
the dimensional model
Granularity is defined as the detailed level of information stored in a table
The more the detail, the lower is the level of granularity
A
The lesser the detail, higher is the level of granularity
Dimensions/Dimension Tables
The dimensional tables contain attributes (descriptive) which are typically static values
containing textual data or discrete numbers which behave as text values.
Main functionalities :
Query filtering\constraining
Query result set labeling
Star Schema
The basic star schema contains four components.
These are:
Fact table, Dimension tables, Attributes and Dimension hierarchies
A
(or)
3. a. Write a short note on discretization and binarization.
Discretization is the process of converting a continuous attribute to a discrete attribute. A
common example is rounding off real numbers to integers. Some data mining algorithms
require that the data be in the form of categorical or binary attributes. Thus, it is often
necessary to convert continuous attributes in to categorical attributes and / or binary
attributes. Its pretty straightforward to convert categorical attributes in to discrete or binary
attributes
Discretization of Continuous Attributes - Transformation of continuous attributes to a
categorical attributes involves
Deciding how many categories to have.
How to map the values of the continuous attribute to categorical attribute
A basic distinction between discretization methods for classification is whether class
information is used (supervised) or not (unsupervised)
A
b. Describe various OLAP operations in detail.
OLAP operations:
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways:
By climbing up a concept hierarchy for a dimension
By dimension reduction
The following diagram illustrates how roll-up works.
Drill-down
Drill-down is performed by stepping down a concept hierarchy for the dimension time.
Initially the concept hierarchy was "day < month < quarter < year." On drilling down, the
time dimension is descended from the level of quarter to the level of month. When drill-
down is performed, one or more dimensions from the data cube are added. It navigates the
data from less detailed data to highly detailed data.
Slice
A
The slice operation selects one particular dimension from a given cube and provides a new
sub-cube. Consider the following diagram that shows how slice works.
Here Slice is performed for the dimension "time" using the criterion time = "Q1".
It will form a new sub-cube by selecting one or more dimensions.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Consider the following diagram that shows the dice operation.
A
The dice operation on the cube based on the following selection criteria involves three
dimensions.
(location = "Toronto" or "Vancouver")
(time = "Q1" or "Q2")
(item =" Mobile" or "Modem")
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to
provide an alternative presentation of data. Consider the following diagram that shows the
pivot operation.
A
In this the item and location axes in 2-D slice are rotated.
First, a training set consisting of records whose class labels are known must be provided. The
training set is used to build a classification model, which is subsequently applied to the test
set, which consists of records with unknown class labels.
Information gain
ID3 uses information gain as its attribute selection measure. Let node N represent or hold
the tuples of partition D. The attribute with the highest information gain is chosen as the
splitting attribute for node N. This attribute minimizes the information needed to classify the
tuples in the resulting partitions and reflects the least randomness or impurity in these
partitions. Such an approach minimizes the expected number of tests needed to classify a
given tuple and guarantees that a simple (but not necessarily the simplest) tree is found.
The expected information needed to classify a tuple in D is given by
where pi is the probability that an arbitrary tuple in D belongs to class Ci and is estimated
by
A
Gain ratio
The information gain measure is biased toward tests with many outcomes. That is, it
prefers to select attributes having a large number of values. For example, consider an
attribute that acts as a unique identifier, such as product ID. A split on product ID would
result in a large number of partitions (as many as there are values), each one containing
just one tuple. Because each partition is pure, the information required to classify data set
D based on this partitioning would be Infoproduct ID(D) = 0. Therefore, the information
gained by partitioning on this attribute is maximal. Clearly, such a partitioning is useless
for classification.
C4.5, a successor of ID3, uses an extension to information gain known as gain ratio,
which attempts to overcome this bias. It applies a kind of normalization to information
gain using a split information value defined analogously with Info(D) as
Gini index
The Gini index is used in CART. Using the notation described above, the Gini index
measures the impurity of D, a data partition or set of training tuples, as
When considering a binary split, we compute a weighted sum of the impurity of each
resulting partition. For example, if a binary split on A partitions D into D1 and D2, the
gini index of D given that partitioning is