Anda di halaman 1dari 10

Data Mining - Classification & Prediction

There are two forms of data analysis that can be used for extracting models describing
important classes or to predict future data trends. These two forms are as follows
Classification models predict categorical class labels; and prediction models predict
continuous valued functions. For example, we can build a classification model to categorize
bank loan applications as either safe or risky, or a prediction model to predict the
expenditures in dollars of potential customers on computer equipment given their income
and occupation.
What is classification?
Following are the examples of cases where the data analysis task is Classification
A bank loan officer wants to analyze the data in order to know which customer (loan
applicant) are risky or which are safe.
A marketing manager at a company needs to analyze a customer with a given profile,
who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the categorical
labels. These labels are risky or safe for loan application data and yes or no for marketing
What is prediction?
Following are the examples of cases where the data analysis task is Prediction
Suppose the marketing manager needs to predict how much a given customer will spend
during a sale at his company. In this example we are bothered to predict a numeric value.
Therefore the data analysis task is an example of numeric prediction. In this case, a model or
a predictor will be constructed that predicts a continuous-valued-function or ordered value.
Note Regression analysis is a statistical methodology that is most often used for numeric
Classification and Prediction Issues
The major issue is preparing the data for Classification and Prediction. Preparing the data
involves the following activities:
Data Cleaning Data cleaning involves removing the noise and treatment of
missing values. The noise is removed by applying smoothing techniques and the
problem of missing values is solved by replacing a missing value with most
commonly occurring value for that attribute.
Relevance Analysis Database may also have the irrelevant attributes. Correlation
analysis is used to know whether any two given attributes are related.
Data Transformation and reduction- The data can be transformed by any of the
following methods.
o Normalization - The data is transformed using normalization. Normalization
involves scaling all values for given attribute in order to make them fall
within a small specified range. Normalization is used when in the learning
step, the neural networks or the methods involving measurements are used.
o Generalization - The data can also be transformed by generalizing it to the
higher concept. For this purpose we can use the concept hierarchies.
Note Data can also be reduced by some other methods such as wavelet transformation,
binning, histogram analysis, and clustering.
Comparison of Classification and Prediction Methods
Here is the criteria for comparing the methods of Classification and Prediction
Accuracy Accuracy of classifier refers to the ability of classifier. It predict the
class label correctly and the accuracy of the predictor refers to how well a given
predictor can guess the value of predicted attribute for a new data.
Speed This refers to the computational cost in generating and using the classifier or
Robustness It refers to the ability of classifier or predictor to make correct
predictions from given noisy data.
Scalability Scalability refers to the ability to construct the classifier or predictor
efficiently; given large amount of data.
Interpretability It refers to what extent the classifier or predictor understands.
Data Generalization and Summarization-Based Characterization
Data Generalization is the process of creating successive layers of summary data in an
evolutional database. It is a process of zooming out to get a broader view of a problem, trend
or situation. It is also known as rolling-up data.
Data and objects in databases often contain detailed information at primitive concept levels.
.For example, the item relation in sales database may contain attributes describing low-level
item information such s item _ID , name , brand, category, supplier, place_made, and price.
It is useful to be able to summarize a large set or data and present it at a high conceptual
level.. For example, summarizing a large set of items relating to Christmas season sales
provides a general description of such data , which can be very helpful for sales and
marketing managers. This requires an important functionality in data mining: data
generalization. Data generalization is a process that abstracts a large set of task-relevant data
in a database from a relatively low conceptual level to higher conceptual levels. Methods for
the efficient and flexible generalization of large data sets can be categorized according to two
approaches :(1) the data cube (or OLAP) approach and (2) the attribute oriented induction
approach .In this section, we describe the attribute-oriented induction approach.
Attribute-Oriented Induction
The attribute-oriented induction (AOI)) approach to data generalization and summarization-
based characterization was first proposed in 1989,a few years prior to the introduction of the
data cube approach. The data cube approach can be considered as a data warehouse-based,
pre-computation oriented, materialized-view approach. It performs off-line aggregation
before an OLAP or data mining query is submitted for processing. On the other hand, the
attribute-oriented induction approach, at least in its initial proposal, is a relational database
query oriented, generalization based, on-line data analysis technique. However, there is no
inherent barrier distinguishing the two approaches based on on-line aggregation versus off-
line pre computation. Some aggregations in the data cube can be computed on-line, while
off-line while off-line pre -computation of multidimensional space can speed up attribute
oriented induction as well
Association rules are if/then statements that help uncover relationships between seemingly
unrelated data in a relational database or other information repository. An example of an
association rule would be "If a customer buys a dozen eggs, he is 80% likely to also purchase
An association rule has two parts, an antecedent (if) and a consequent (then). An antecedent
is an item found in the data. A consequent is an item that is found in combination with the
Association rules are created by analyzing data for frequent if/then patterns and using the
criteria support and confidence to identify the most important relationships. Support is an
indication of how frequently the items appear in the database. Confidence indicates the
number of times the if/then statements have been found to be true.
In data mining, association rules are useful for analyzing and predicting customer behavior.
They play an important part in shopping basket data analysis, product clustering, catalog
design and store layout.
Programmers use association rules to build programs capable of machine learning. Machine
learning is a type of artificial intelligence (AI) that seeks to build programs with the ability to
become more efficient without being explicitly programmed.
Machine learning
Machine learning is a method of data analysis that automates analytical model building.
Using algorithms that iteratively learn from data, machine learning allows computers to find
hidden insights without being explicitly programmed where to look
Machine learning today makes it possible to quickly and automatically produce models that
can analyse bigger, more complex data. This will also deliver faster, and more accurate
results. By building precise models, an organisation has a better chance of identifying
profitable opportunities and safer outcomes or avoiding unknown risks.(2)
Machine learning can be applied in cases where the desired outcome is known called (guided
learning), or the data is unknown (unguided learning), or it can be used if learning is the
result of interaction between the environment and a model called (reinforcement leaning). (2)
Importantly it removes human intervention and becomes machine orientated by thinking all
on its own.

1)Basic parameter of mapper
Ans:-HADOOP - The four basic parameters of a mapper are LongWritable, text, text
and IntWritable. The first two represent input parameters and the second two
represent intermediate output parameters.
The four parameters for reducers are:
Text (intermediate output)
IntWritable (intermediate output)
Text (final output)
IntWritable (final output)

How many map tasks in Hadoop?
The number of map tasks in a program is handled by the total number of blocks of the input
files. For maps, the right level of parallelism seems to be around 10-100 maps/node, although
for cpu-light map tasks it has been set up to 300 maps. Since task setup takes some time, so
its better if the maps take at least a minute to execute.
If we have a block size of 128 MB and we expect 10TB of input data, we will have 82,000
maps. Ultimately the number of maps is determined by the InputFormat.
Mapper= {(total data size)/ (input split size)}
Example- data size is 1 TB and input split size is 100 MB.
Mapper= (1000*1000)/100= 10,000
Mapper stage: It is the first stage in the process which splits out each word into a separate
string (i.e. tokenizing the string) and for each word seen, it will output the word and a 1
(which is the count value) to indicate that it has seen the word one time.

Shuffle / Combiner stage: The shuffle phase will use the word as the key, hashing the
records to the specific reducers.

Reducer phase: The reduce phase will then sum up the number of times each word was seen
and write that sum count together with the word as output.

Mapper maps input key/value pairs to a set of intermediate key/value pairs.

Maps are the individual tasks that transform input records into intermediate
records. The transformed intermediate records do not need to be of the same type as
the input records. A given input pair may map to zero or many output pairs.
The Hadoop Map/Reduce framework spawns one map task for each
InputSplit generated by the InputFormat for the job.


Reducer reduces a set of intermediate values which share a key to a smaller set of
The number of reduces for the job is set by the user via
Overall, Reducer implementations are passed the JobConf for the job via the
JobConfigurable.configure(JobConf) method and can override it to initialize
themselves. The framework then calls reduce(WritableComparable, Iterator,
OutputCollector, Reporter) method for each <key, (list of values)> pair in the
grouped inputs. Applications can then override the Closeable.close() method to
perform any required cleanup.
Reducer has 3 primary phases: shuffle, sort and reduce.

What is Clustering?
Clustering is the process of making a group of abstract objects into classes of similar
Points to Remember
A cluster of data objects can be treated as one group.
While doing cluster analysis, we first partition the set of data into groups based on
data similarity and then assign the labels to the groups.
The main advantage of clustering over classification is that, it is adaptable to changes
and helps single out useful features that distinguish different groups.
Applications of Cluster Analysis
Clustering analysis is broadly used in many applications such as market research,
pattern recognition, data analysis, and image processing.
Clustering can also help marketers discover distinct groups in their customer base.
And they can characterize their customer groups based on the purchasing patterns.
In the field of biology, it can be used to derive plant and animal taxonomies,
categorize genes with similar functionalities and gain insight into structures inherent
to populations.
Clustering also helps in identification of areas of similar land use in an earth
observation database. It also helps in the identification of groups of houses in a city
according to house type, value, and geographic location.
Clustering also helps in classifying documents on the web for information discovery.
Clustering is also used in outlier detection applications such as detection of credit
card fraud.
As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to observe characteristics of each cluster.
Requirements of Clustering in Data Mining
The following points throw light on why clustering is required in data mining
Scalability We need highly scalable clustering algorithms to deal with large
Ability to deal with different kinds of attributes Algorithms should be capable to
be applied on any kind of data such as interval-based (numerical) data, categorical,
and binary data.
Discovery of clusters with attribute shape The clustering algorithm should be
capable of detecting clusters of arbitrary shape. They should not be bounded to only
distance measures that tend to find spherical cluster of small sizes.
High dimensionality The clustering algorithm should not only be able to handle
low-dimensional data but also the high dimensional space.
Ability to deal with noisy data Databases contain noisy, missing or erroneous
data. Some algorithms are sensitive to such data and may lead to poor quality
Interpretability The clustering results should be interpretable, comprehensible,
and usable.
Fraud detection in data mining
In highly regulated sectors like financial, healthcare, insurance, retail, and social security,
combating fraud is essential as there is a multitude of compliance, regulations, risk
management measures, and monetary consequences to be dealt with. The proliferation of
modern technology has produced more sophisticated fraud techniques, but technology
advancements have also enabled smarter approaches to detect fraud. In a world where
transactions and documents are digitally recorded in one way or another, evidence is out there
to aid investigators in the battle against damaging fraudulent schemes. The more difficult
question is "how to easily and quickly find that evidence?"
Massive Data, Legacy Data Warehouse, and SQL Queries = Delays and Headaches

Take our insurance agency customer for example.

Traditionally, their fraud investigation team had relied on data analysts to execute SQL
queries against a data warehouse that stores massive amounts of claims, billings, and other
information. Due to the volume, velocity, and variety of data in the warehouse, the
process could take weeks or months before enough evidence for a legal case was
developed. And so, just as any other businesses, the longer it takes to detect fraud, the more
losses the organization would suffer.
Shedding New Light on Fraud Detection Techniques with Predictive Analytics and
Machine Learning

Given the vast amount of data that our investigators need to sift through to find fraudulent
patterns, an integrated big data and search architecture emerged as the most feasible
Public data, such as providers' information, codes for healthcare procedures, etc., is
aggregated and processed through the big data framework, which performs massive
denormalization to distribute data into multiple tables and fields.
The processed data is then loaded into a search engine.
Machine learning and predictive analytics work to pinpoint fraud red flags and proactively
detect suspicious fraud schemes.
A search-based, graphical user interface is provided to investigators for analysis and
evidence documentation.
The big data architecture enables the insurance agency's fraud detection effort to be
more scalable, faster, and more accurate. Because the system really processes and analyzes
every record of the available data, it also gives investigators more confidence in their findings
(we like this better than sampling techniques and plain hunches!)

Hadoop Architecture Overview

Apache Hadoop is an open-source software framework for storage and large-scale
processing of data-sets on clusters of commodity hardware. There are mainly five building
blocks inside this runtime envinroment (from bottom to top):
the cluster is the set of host machines (nodes). Nodes may be partitioned in racks.
This is the hardware part of the infrastructure.
the YARN Infrastructure (Yet Another Resource Negotiator) is the framework
responsible for providing the computational resources (e.g., CPUs, memory, etc.)
needed for application executions. Two important elements are:
o the Resource Manager (one per cluster) is the master. It knows where the
slaves are located (Rack Awareness) and how many resources they have. It
runs several services, the most important is the Resource Scheduler which
decides how to assign the resources.

o the Node Manager (many per cluster) is the slave of the infrastructure. When
it starts, it announces himself to the Resource Manager. Periodically, it sends
an heartbeat to the Resource Manager. Each Node Manager offers some
resources to the cluster. Its resource capacity is the amount of memory and the
number of vcores. At run-time, the Resource Scheduler will decide how to use
this capacity: a Container is a fraction of the NM capacity and it is used by
o the client for running a program.

the HDFS Federation is the framework responsible for providing permanent, reliable
and distributed storage. This is typically used for storing inputs and output (but not
intermediate ones).
other alternative storage solutions. For instance, Amazon uses the Simple Storage
Service (S3).
the MapReduce Framework is the software layer implementing the MapReduce
The YARN infrastructure and the HDFS federation are completely decoupled and
independent: the first one provides resources for running an application while the second one
provides storage. The MapReduce framework is only one of many possible framework which
runs on top of YARN (although currently is the only one implemented).
YARN: Application Startup

In YARN, there are at least three actors:

the Job Submitter (the client)
the Resource Manager (the master)
the Node Manager (the slave)
The application startup process is the following:
1. a client submits an application to the Resource Manager
2. the Resource Manager allocates a container
3. the Resource Manager contacts the related Node Manager
4. the Node Manager launches the container
5. the Container executes the Application Master

The Application Master is responsible for the execution of a single application. It asks for
containers to the Resource Scheduler (Resource Manager) and executes specific programs
(e.g., the main of a Java class) on the obtained containers. The Application Master knows the
application logic and thus it is framework-specific. The MapReduce framework provides its
own implementation of an Application Master.
The Resource Manager is a single point of failure in YARN. Using Application Masters,
YARN is spreading over the cluster the metadata related to running applications. This
reduces the load of the Resource Manager and makes it fast recoverable.
Need of yarn
YARN took over the task of cluster management from MapReduce and MapReduce is
streamlined to perform Data Processing only in which it is best.

YARN has central resource manager component which manages resources and allocates the
resources to the application. Multiple applications can run on Hadoop via YARN and all
application could share common resource management.

Advantage of YARN:
1. Yarn does efficient utilization of the resource.
There are no more fixed map-reduce slots. YARN provides central resource manager. With
YARN, you can now run multiple applications in Hadoop, all sharing a common resource.
2. Yarn can even run application that do not follow MapReduce model.
YARN decouples MapReduce's resource management and scheduling capabilities from the
data processing component, enabling Hadoop to support more varied processing approaches
and a broader array of applications. For example, Hadoop clusters can now run interactive
querying and streaming data applications simultaneously with MapReduce batch jobs. This
also streamlines MapReduce to do what is does best - process data.