Anda di halaman 1dari 30

Applications of GA

Presented By
Dr. Bhupendra Verma
Director,
Technocrats Institute of Technology( Excellence), Bhopal

Neural Network GA Based


Hybrid system
Neural Network can learn various tasks from
training examples, classify and model non linear
relationships.
GA s have been used to optimize parameters of
NN
GA encode the parameters of NN as string
representing chromosome
GA NN technology have the ability to locate the
neighborhood of the optimal solution quickly
Large amount of memory required to handle
and manipulate the chromosomes

Example: GA Based Back propagation


Networks
Network Size : 2X2X2
Number of Weights: 8
Input to Hidden Layer :4
Hidden Layer to Output Layer:4
Conventional NN use make use of gradient
descent learning to obtain their weight
In conventional NN there is a problem of
local minima
GA Although does not guarantee global
optima but has been found to obtain
acceptably good solution.

Chromosome Representation
{ W11,W12, W21, W22, V11,V12, V21, V22}
Each Gene is a real value coded in decimal
digits.
We are considering weights up to three decimal
places so the number of digits required is 4.
One digit is required for sign ( +/-)
Ex
{84321 46242 34523 76587 34276 98767
23443 92313 }
Fitness Function

F=

1/ RMSE

Need of Data Mining


Traditional statistical techniques and data management tools are no longer
adequate for analyzing this vast collection of data being generated.
Examples Domains :
Financial Investment: Stock indexes and prices, interest rates, credit card
data, fraud detection
Health Care: Several diagnostic information stored by hospital
management systems
Manufacturing and Production: Process optimization and trouble shooting
Telecommunication network: Calling patterns and fault management
systems
Scientific Domain: Astronomical observations [6], genomic data, biological
data.
The World Wide Web

knowledge discovery in databases- KDD


The term KDD refers to the overall process of knowledge
discovery in databases. Data mining is a particular step in
this process, involving the application of specific algorithms
for extracting patterns (models) from data.
The additional steps in the KDD process, such as data
preparation, data selection, data cleaning, data Integration
and proper interpretation of the results of mining, ensures
that useful knowledge is derived from the data.

The Common Functions of Data Mining


Classification : classifies a data item into one of several predefined
categorical classes.
Regression : maps a data item to a real valued prediction variable.
Clustering : maps a data item into one of several clusters, where
clusters are natural groupings of data items based on similarity
metrics or probability density
models.
Rule generation : extracts classification rules from the data.

The Common Functions of Data Mining


Discovering association rules : describes association
relationship among different attributes.
Summarization : provides a compact description for a subset
of data.
Dependency modeling : describes significant dependencies
among variables.
Sequence analysis: models sequential patterns, like timeseries analysis. The goal is to model the states of the process
generating the sequence or to extract and report deviation
and trends over time

Challenges of Data Mining


Massive data sets and high dimensionality. Huge data sets create
combinatorial explosive search space and increase the chances that a data
mining algorithm will find spurious patterns that are not generally valid.
Possible solutions include robust and efficient algorithms, sampling
approximation methods and parallel processing.
User interaction and prior knowledge. Data mining is inherently an
interactive and iterative process. Users interaction is required at various
stages, and domain knowledge may be used either in the form of a high-level
specification of the model, or at a more detailed level.
Over fitting and assessing the statistical significance. Data sets used for
mining are usually huge and available from distributed sources. As a result,
often the presence of spurious data points leads to over fitting of the models.
Regularization and re-sampling methodologies need to be emphasized for
model design.

Challenges of Data Mining


Understandability of patterns. It is necessary to make the
discoveries more understandable to humans.
Possible solutions include rule structuring, natural language
representation, and the visualization of data and knowledge.
Nonstandard and incomplete data. The data can be missing
and/or noisy.
Mixed media data. Learning from data that is represented by a
combination of various media, like numeric, symbolic, images and
text.
Management of changing data and knowledge. Rapidly
changing data, in a database that is modified/deleted/augmented,
may make previously discovered patterns invalid.
Possible solutions include incremental methods for updating the
patterns.
Integration. Data mining tools are often only a part of the entire
decision making system. It is desirable that they integrate
smoothly, both with the database and the final decision making
procedure.

GA for Classification Rule Discovery


Michigan approach: population consists of individuals chromosomes
where each individual encodes a single prediction rule.
Pittsburgh approach: each individual encodes a set of prediction rules
Pluses and minuses:
- The Pittsburgh approach directly takes into account rule interaction when
computing the fitness function of an individual.
- This approach leads to syntactically longer individuals.
- In the Michigan approach the individuals are simpler and syntactically
shorter.
- It simplifies the design of genetic operators.
Take the rule: IF cond#1 AND cond#2 AND cnd#n.THEN class= c(i)
Representation of the rule antecedent
Representation of rule consequent ( the THEN part)

The Rule Antecedent


(Using GA)
Often there is a conjunction of conditions.
Usually use binary encoding.
A given attribute can take on k discrete values. Encoding
can consist of k bits.( for on or off).
0 0 1 1 1 0 1 1 0 00
All bits can be turned into 1 s in order to turn off this
condition.
Non-binary encoding is possible. Variable-length
individuals will arise. May have to modify crossover to be
able to cope with variable-length individuals.

Representing the Rule Consequent


(Predicted Class)

Three ways of representing the predicted class. (the THEN part)


First, encode it in the genome of an individual (possibly making it
subject to evolution.)
Second, associate all individuals of the population with the same
predicted class, which is never modified during the running of
the algorithm.
Third, choose the predicted class most suitable for a rule (a
deterministic way) as soon as the corresponding rule antecedent
is formed. ( e.g. Maximize fitness.)

Genetic Operators for Rule


Discovery
Selection: Each individual represents a single rule.(Michigan approach). Individuals to
be mated are elected by training examples i.e probability of mating is proportional to
fitness value. Only those rules cover same examples compete

Generalizing/specializing-condition operator. The g/s of a rule can be done


in a way independent of crossover.
Ex
A given individual represents a rule antecedent
(Age > 25) (Marital_Status = single).
We can generalize, say, the first condition by using a kind of mutation
operator that subtracts a small, randomly-generated value from 25.
(Age > 21) (Marital_Status = single)

Now Rule antecedent tends to cover more examples


Conversely, we could specialize the first condition of rule antecedent
by adding a small, randomly-generated value to 25

Generalizing/specializing crossover: Basic


idea of this special kind of crossover is to
generalize or specialize a given rule, depending
on whether it is currently over fitting or under
fitting the data.

Ex
With the Michigan approach - where each individual represents a
single rule - using a binary encoding. Then the generalizing /
specializing crossover operators can be implemented as the logical
OR and the logical AND, respectively

Fitness Function for Rule


Discovery
Let a rule be

IF

THEN

C,

The predictive performance of a rule can be summarized by


a 2 x 2 matrix, sometimes called a confusion matrix
TP = True Positives = Number of examples satisfying A and C
FP = False Positives = Number of examples satisfying A but
not C
FN = False Negatives = Number of examples not satisfying A
but satisfying C
TN = True Negatives = Number of examples not satisfying A
nor C

Given by ( confusion matrix)


.
C
not C
. Predicted
C
TP
.
Class
not C

Actual Class
FP
FN

TN

Calculate the confidence factor


CF = TP / ( TP + FP)
completeness measure( COMP) = TP / (TP + FN)
Fitness = CF * COMP =

(TP)(TP) / (TP+FP)(TP+FN)

Fitness = w1 X (CF * COMP) + w2 X (Simp)


where Simp is a measure of rule simplicity 0< Simp<1 and
w1 and w2 are user defined weights

GA for Clustering
Crucial issue in the design of an GA for clustering is
to decide what kind of individual representation will
be used to specify the clusters
- Cluster description-based representation:
In this case each individual explicitly represents the parameters
necessary to precisely specify each cluster. Nature of parameter
depends on shape of cluster

Centroid / Medoid -based representation


In this case each individual represents the coordinates of each
cluster's centroid or medoid.
A centroid is simply a point in the data space whose coordinates
specify the centre of the cluster.
A medoid is the data instance which is nearest to the cluster's
centroid.
The position of the centroids / medoids and the procedure used to
assign instances to clusters implicitly determine the precise shape and
size of the clusters

Instance-based
representation
In this case each individual consists of a string of n
elements (genes), where n is the number of data
instances. Each gene i, i=1,. . . ,n, represents the index
(id) of the cluster to which the i-th data instance is
assigned. Hence, each gene i can take one out of K
values, where K is the number of clusters.
Example
suppose that n = 10 and K= 3. The individual <2 1 2 3
3 2 1 1 2 3> corresponds to a candidate clustering
where the second, seventh and eighth instances are
assigned to cluster 1, the first, third, sixth and ninth
instances are assigned to cluster 2 and the other
instances are assigned to cluster 3.

Comparison of various representations


In both the centroid / medoid-based and the instance-based representation,
clusters are mutually exclusive and exhaustive
The cluster descriptions may have some overlapping so that an instance
may be located within two or more clusters.
The instance-based representation has the disadvantage that it does not
scale very well for large data sets, since each individual's length is directly
proportional to the number of instances being clustered.
This representation also involves a considerable degree of redundancy,
which may lead to problems in the application of conventional genetic
operators.
For instance, let n = 4 and K = 2, and consider the
individuals <1 2 1 2> and <2 1 2 1>. These two individuals have different
gene values in all the four genes, but they represent the same candidate
clustering solution, i.e., assigning the first and third instances to
one cluster and assigning the second and fourth instances to another
cluster. They crate very different results in crossover

Fitness evaluation for


Clustering
The fitness of an individual is a
measure of the quality of the clustering
represented by the individual.
Basic ideas of fitness usually involve
the following principles
- Smaller the intra-cluster (within
cluster) distance, the better the fitness.
- The larger the inter-cluster (between
cluster) distance, the better the fitness.

Genetic Algorithms (GAs) for Preprocessing


The use of GAs for attribute selection seems natural. The
main reason is that the major source of difficulty in
attribute selection is attribute interaction, and one of the
strengths of Gas is that they usually cope well with
attribute interactions.
The standard individual representation for attribute selection
consists simply of a string of N bits, where N is the number of
original attributes and the i-th bit, i=1,. . . ,N, can take the value 1
or 0, indicating whether or not, respectively, the i-th attribute is
selected.
- This individual representation is simple, and traditional
crossover and mutation operators can be easily applied.
- However, it has the disadvantage that it does not scale very
well with the number of attributes.

An alternative individual representation, where each


individual represents a candidate attribute subset. A
candidate attribute subset can be represented as a string
with m binary genes where m is the number of attributes
and each gene can take on a 0 or 1.
For instance, the individual (3 0 8 3 0), where M = 5,
represents a candidate solution where only the 3rd and
the 8th attributes are selected.
One advantage of this representation is that it scales up
better with respect to a large number of original attributes
Follow crossover and mutation procedures.

Fitness Function for Attribute


Selection
GAs for attribute selection can be
roughly divided into two approaches
-Wrapper approach: the GA uses the
classification algorithm to compute
the fitness of individuals
- Filter approach: the GA does not
use the classification algorithm

Genetic Algorithms (GAs) for Postprocessing

GAs can be used in the post-processing step when there is an


ensemble of classifiers (e.g. rule sets) created. Generating an
ensemble of classifiers is a relatively recent trend in machine
learning when our primary goal is to maximize predictive accuracy

Generating an ensemble of classifiers is useful since it has been


shown that in several cases an ensemble of classifiers has a better
predictive accuracy than a single classifier.

A fitness function may be created using weights for each classifier


in the ensemble. (A user may help.) There are also GA schemes to
optimize the weights of the classifiers.

There is a risk of generating too many classifiers which end up


over fitting the training data hence pruning is some times used

Research Problems
Discovering surprising rules:
Evolutionary algorithms seem to have a good potential to
discover truly surprising rules, due to their ability to cope
well with attribute interaction.
- An interesting research direction is to design new
surprisingness measures to evaluate the rules produced
by evolutionary algorithms

Scaling
Algorithms
Processing:

up

Evolutionary
with
Parallel

In the context of mining very large databases, the vast


majority of the processing time of an evolutionary algorithm
is spent on evaluating an individuals fitness
- Distributing the population individuals across the available
processors and computing their fitness in parallel. However,
this strategy reduces scalability for large databases.
-

Fitness of each individual is computed in parallel by all


processors. Data being mined is partitioned across the
processors

EA for KDD may be applied to other


domains
KDD has a very interdisciplinary nature and uses many
different paradigms of knowledge discovery algorithms.
This motivates the integration of evolutionary algorithms
with other knowledge discovery paradigms
KDD tasks involve some kind of prediction, where
generalization performance on a separate test set is
much more important than the performance on a training
set. This principle may be applied to other domain as
well

References

Alex A. Freitas, A Review of Evolutionary Algorithms for Data Mining,


University of Kent, UK, Computing Laboratory
Sushmita Mitra, Sankar K. Pal,and Pabitra Mitra Data Mining in Soft
Computing Framework: A Survey IEEE Transactions on Neural Networks,
Vol. 13, No 1, January 2002
Behrouz Minaei-Bidgoli and William F. Punch, Using Genetic Algorithms for
Data Mining Optimization in an Educational Web-Based System, GARAGe,
Department of Computer Science & Engineering, Michigan State University,
http://garage.cse.msu.edu
Alex Alves Freitas, Evolutionary Computation, http://www.ppgia.pucpr.br/~alex
Sid Bhattacharyya , Genetic Algorithms For Data Mining, www.uic.edu/classes/idsc/ids572cna/GADataMining_CNA.pdf
www.site.uottawa.ca/~nat/Courses/csi5388/.../Jim_Slides.ppt

Anda mungkin juga menyukai