Anda di halaman 1dari 10

Mining Association Rules Using Genetic Algorithm: The role

of Estimation Parameters
Abstract Genetic Algorithms (GA) have emerged as practical, robust optimization and search
methods to generate accurate and reliable Association Rules. The main motivation for using GAs
in the discovery of high-level prediction rules is that they perform global search and cope better
with attribute interaction than the greedy rule induction algorithms often used in data mining.
The objective of this paper is to compare the performance of the Genetic algorithm in mining
association rule by varying the GA parameters. The fitness function and population size are the
key parameters . The crossover rate speeds up the system by convergin
The size of the
dataset and relationship between its attributes also plays a role in achieving the optimum
accuracy.
Keywords : Association rules, Genetic Algorithm, GA parameters.
1. Introduction
Data mining, also referred as knowledge discovery in database, means a process of
nontrivial extraction of implicit, previously unknown and potentially useful information (such as
knowledge rules, constraints, regularities) from data in database. Data mining combines theory
and technology of several domains which include artificial intelligence, machine learning,
statistics, neural network and so on. Association rule mining is a major area in data mining that
discovers the relations between different attributes by analyzing and disposing data in the
database.
Many algorithms for generating association rules were developed over time. Some of the
well known algorithms are Apriori, Dataminer and FP-Growth tree. Many existing algorithms
traverse the database many times so the I/O overhead and computational complexity becomes
very high and cannot meet the requirements of large-scale database mining. Genetic algorithm is
based on the biological theory of evolution and molecular genetics of the global random search.
The algorithm has a strong randomness, robust and implicit parallelism, can quickly and
effectively search for global optimization, an effective way to deal with large-scale data sets. At
present, genetic algorithm-based data mining methods have yielded some progress, and based on
genetic algorithms classification system as also yielded some results.
This paper analyses the mining of Association Rules by applying Genetic Algorithms. There
have been several attempts for mining association rules using Genetic Algorithm. Robert Cattral
et al. [1] describe the evolution of hierarchy of rule using genetic algorithm with chromosomes
of varying length and macro mutations. The initial population is seeded rather than random
selection. Manish Saggar et al. [2] proposes an algorithm with binary encoding and the fitness
function was generated based on confusion matrix. The individuals are represented using the
Michigans Approach. Roulette Wheel selection is done by first normalizing the values of all
candidates.
Genetic algorithm based on the concept of strength of implication of rules was presented by
Zhou et al. [3]. The properties of independence and correlation of descriptions in rules are taken

up for fitness calculation. Genxiang et al. [4] introduced dynamic immune evolution, and
biometric mechanism in Engineering Immune Computing namely immune recognition, immune
memory, and immune regulation to GA for mining association rules.
Gonzales. E et al. [5] introduced the Genetic Relation Algorithm (GRA) based on evaluating
the distances between rules. The distance is calculated using both matching criteria namely
complete match and partial match. Genetic algorithm easily leads to premature convergence or
takes too much time to converge during evolution process. Hong GUO et al. [6] propose GA
with adaptive mutation and individual based selection methods to avoid premature convergence.
In Haiying Ma et al. [7] the encoding of data is done gene string structure where the
complexity concepts are mapped to form linear symbols. The fitness function is the measure of
the overall performance of the process rather than that of individual rules when the bit strings
were interpreted as a complex process. Adaptive exchange probability (Pc) and mutation
probability (Pm) are adopted in this paper. Hong Guo et al. [8] adopt the method of adaptive
mutation rate to avoid excessive variation causing non-convergence, or into a local optimal
solution. A sort of individual-based selection method, is applied to the evolution in genetic
algorithm, in order to prevent the high-fitness individuals converging early by the rapid growth
of the number of individual.
As the parameters of the genetic algorithm and the fitness function are found to be the major
area of interest in the above studies, this paper tries to explore on the effects of the genetic
parameters and the controlling variables of fitness function on three different datasets.
A brief introduction about Association Rule Mining and GA is given in Section 2, followed
by methodology in section 3, which describes the basic implementation details of Association
Rule Mining with GA. In section 4 the Parameters that decides the efficiency of the algorithm is
presented. Section 5 presents the experimental results followed by conclusion in the last section.
2. Association Rules and Genetic Algorithms
Association Rules
Association rule is a popular and well researched method for discovering interesting
relations between variables in large databases. It studies the frequency of items occurring
together in transactional databases, and based on a threshold called support, identifies the
frequent item sets. Another threshold, confidence, which is the conditional probability that an
item appears in a transaction when another item appears, is used to pinpoint association rules.
The discovered association rules are of the form: P
Q [s,c], where P and Q are conjunctions
of attribute value-pairs, and s (for support) is the probability that P and Q appear together in a
transaction and c (for confidence) is the conditional probability that Q appears in a transaction
when P is present.
Genetic Algorithm

Genetic Algorithm (GA) is a procedure used to find approximate solutions to search


problems through the application of the principles of evolutionary biology. Genetic algorithms
use biologically inspired techniques such as genetic inheritance, natural selection, mutation, and
sexual reproduction (recombination, or crossover).
Genetic algorithms are typically implemented using computer simulations in which an
optimization problem is specified. For this problem, members of a space of candidate solutions,
called individuals, are represented using abstract representations called chromosomes. The GA
consists of an iterative process that evolves a working set of individuals called a population
toward an objective function, or fitness function. Traditionally, solutions are represented using
fixed length strings, especially binary strings, but alternative encodings have also been
developed.
3. Methodology
The evolutionary process of GA is a highly simplified and stylized simulation of the
biological version. It starts from a population of individuals randomly generated according to
some probability distribution, usually uniform and updates this population in steps called
generations. In each generation, multiple individuals are randomly selected from the current
population based on application of fitness, crossover, and modified through mutation to form a
new population.
A. [Start] Generate random population of n chromosomes.
B. [Fitness] Evaluate the fitness f(x) of each chromosome x in the population.
C. [New population] Create a new population by repeating the following steps until the new
population is complete.
i.
ii.
iii.
iv.

[Selection] Select two parent chromosomes from a population according to their


fitness.
[Crossover] With a crossover probability alter the parents to form a new offspring
(children).
[Mutation] With a mutation probability mutate new offspring at each locus.
[Accepting] Place new offspring in a new population

D. [Replace] Use newly generated population for a further run of the algorithm
E. [Test] If the end condition is satisfied, stop, and return the best solution in current population
F. [Loop] Go to step B
4. Parameters in Genetic Algorithm
The GA parameters are the key components enabling the system to achieve good enough
solution for possible terminating conditions.
4.1 Encoding.

Encoding is the process of representing individual solutions. The most common way of
encoding is binary encoding. Here each chromosome encodes a binary string where each bit in
the string represents some characteristics of the solution. Other encoding schemes are octal.
Hexadecimal, permutation, value and tree encoding.
4.2 Population
Population refers to the number of chromosomes taken up for optimization. Chromosomes is
the raw genetic information that the GA deals with. If there are too few chromosomes, GA has
few possibilities to perform crossover and only a small part of search space is explored. On the
other hand, if there are too many chromosomes, GA slows down. The initial population
generation and population size are the two aspects of population. The initial population is either
selected randomly from the data or selected with prior knowledge on the data.

The population size is calculated by


2k

(1)

Where = number of chromosomes in data and k is the average size of the schema of
interest. If uniform crossover is adopted we can most likely get with population size at least
twice as small as the number of instances in the dataset.
4.3 Selection
During each successive generation, a proportion of the existing population is selected to
breed a new generation. Individuals are selected through a fitness-based process, where fitter
solutions as measured by a fitness function are typically more likely to be selected. The
Tournament, Roulette Wheel, Random, Rank and Boltzmann selection are the commonly used
selection methods. Elitism and stochastic universal sampling significantly improves the GAs
performance.
4.4 Fitness Function
A fitness function is a particular type of objective function that prescribes the optimality
of a chromosome in a genetic algorithm, so that the particular chromosome may be ranked
against all the other chromosomes [9, 10]. An ideal fitness function correlates closely with the
algorithm's goal, and yet may be computed quickly. Speed of execution is very important, as the
typical genetic algorithm must be iterated many times in order to produce an usable result for a
non-trivial problem.
This paper adopts minimum support and minimum confidence for filtering rules. Then
correlative degree is confirmed in rules which satisfy minimum support-degree and minimum
confidence-degree. After support-degree and confidence-degree are synthetically taken into
account, fit degree function is defined as follows.

( )

( )

( )

(2)

In the above formula, Rs +


Rc =
(Rs
Rc 0) and Suppmin, Confmin are respective
values of minimum support and minimum confidence. By all appearances if the Suppmin and
Confmin are set to higher values, then the value of fitness function is also found to be high.
4.5 Crossover Operator
Crossover entails choosing two individuals to swap segments of their code, producing
artificial "offspring" that are combinations of their parents. This process is intended to simulate
the analogous process of recombination that occurs to chromosomes during sexual reproduction.
Common forms of crossover include single-point crossover, in which a point of exchange is set
at a random location in the two individual genomes, where one individual contributes all its code
till the point of crossover, the second individual contributes all its code after the point of
crossover to produce an offspring, and uniform crossover, in which the value at any given
location in the offspring's genome is either the value of one parent's genome at that location or
the value of the other parent's genome at that location, chosen with 50/50 probability[8].
4.6 Mutation Operator
Partial gene values of individuals are adjusted by using mutation operation [5]. This part of
the genetic algorithm, require great care, here there are two probabilities, one usually called as
Pm, this probability will be used to judge whether mutation has to be done or not, when the
candidate fulfills this criterion it will be fed to another probability, the locus probability that is on
which point of the candidate the mutation has to be done.
4.7 Number of Generations
The generational process of mining association rules by Genetic algorithm is repeated
until a termination condition has been reached. Common terminating conditions are:

A solution is found that satisfies minimum criteria.


Fixed number of generations reached.
Allocated budget (computation time/money) reached.
The highest ranking solution's fitness is reaching or has reached a plateau such that
successive iterations no longer produce better results.
Manual inspection.
Combinations of the above.

5. Experimental Studies
The objective of this study is to compare the accuracy achieved in datasets by varying the
GA Parameters. The encoding of chromosome is binary encoding with fixed length. As the
crossover is performed on attribute level the mutation rate is set to zero so as to retain the
original attribute values. The selection method used is tournament selection. The fitness function
adopted is as given in equation (1).

Three datasets namely Lenses, Haberman survival and Iris Data Set from UCI Machine
Learning Repository have been taken up for experimentation. Lenses dataset has 4 attributes with
24 instances. Haberman's Survival data Set has 3 attributes and 306 instances and Iris dataset has
5 attributes and 150 instances. The Algorithm is implemented using MATLAB R2008a
simulation package. The flow of the system is as shown in flowchart below.

Figure 1. Flow chart of the GA


The default values set for the GA parameters are given in Table 1.
Table1. Default GA Parameters.
Parameter
Population Size
Crossover Rate
Mutation Rate
Selection Method
Minimum Support
Minimum Confidence

Value
Instances * 1.5
0.5
0.0
Tournament Selection
0.2
0.8

The accuracy and the convergence rate by controlling the GA parameters are recorded in
the table below. Accuracy is the count of dataset matching between the original dataset and
resulting population divided by the number of instances in dataset. The convergence rate is the
generation at which the fitness value becomes fixed. The population size is varied for the three
dataset, from the size of the dataset to one and half times the dataset size while keeping the other
parameters fixed.

Table 2: Comparison based on variation in population Size.

Lenses
Haberman
Iris

No. of Instances
Accuracy
No. of
%
Generations
75
7
71
114
77
88

No. of Instances * 1.25


Accuracy
No. of
%
Generations
82
12
68
88
87
53

No. of Instances *1.5


Accuracy
No. of
%
Generations
95
17
64
70
82
45

It could be seen from Table 2 that for the Lenses dataset whose size is small, an optimal
accuracy is achieved when the population size is one and half times the size of the dataset.
Whereas for the larger dataset, Haberman the accuracy is maximum when the population size is
equivalent to dataset size. For the Iris dataset of moderate size the population has to be set to
1.25 times the size of the dataset to achieve optimum result.
As the fitness function is considered to be the crucial factor for the GA, variations are
introduced in the fitness function while other parameters remain unchanged. In Table 3 the
minimum confidence and support values are altered when others are at default values and the
results are recorded.
Table 3 : Comparison based on variation in Minimum Support and Confidence

Lenses
Haberman
Iris

Minimum Support & Minimum Confidence


Sup = 0.4 & con
Sup =0.9 & con
Sup = 0.9 & con =
Sup = 0.2 & con =
=0.4
=0.9
0.2
0.9
Accura
No. of Accura
No. of
Accura
No. of
Accura
No. of
cy %
Generatio cy % Generation cy %
Generatio cy %
Generatio
ns
s
ns
ns
22
20
49
11
70
21
95
18
45
68
58
83
71
90
62
75
40
28
59
37
78
48
87
55

From the Table 3 it is clear that the variation in minimum support and confidence brings
greater changes in accuracy. When the minimum support and confidence values are minimum
the accuracy if found to be low regardless of the size of the dataset. The same is noted when both
of the values are set to maximum. Optimum accuracy is achieved when a tradeoff value between
minimum confidence and minimum support is set.
When the parameters Rs and Rc are altered in the fitness function minimum alterations in
accuracy are noted and hence their impact is not taken up for analysis.
In the table below the crossover probability is altered when other GA parameters are set
to default values and the results observed are recorded.
Table 4 : Comparison based on variation in Crossover Probability

Lenses
Haberman
Iris

Pc = .25
Accuracy
No. of
%
Generations
95
8
69
77
84
45

Cross Over
Pc = .5
Accuracy
No. of
%
Generations
95
16
71
83
86
51

Pc = .75
Accuracy
No. of
%
Generations
95
13
70
80
87
55

From the Table 4 it is evident that the accuracy achieved is almost same for all the three datasets
whatever the crossover probability adopted. The convergence rate depends on the data size and
population size being set.
The results observed are compared for the three datasets as shown in figures below.

100
80
60

Lenses
Haberman

40

Iris
20
0
No. of Instances

No. of Instances * 1.25

No. of Instances *1.5

Figure2: Population Size Vs Accuracy.


100
80
Lenses

60

Haberman

40

Iris
20
0
Sup = .2 & con = .5

Sup = .5 & con = .5 Sup = .75 & con = .75 Sup = .8 & con = .8

Figure 3: Minimum Support and Confidence Vs Accuracy.

The values of the GA parameters set for the three datasets when maximum efficiency is
achieved is shown in the table below.
Table 5. Comparison of the optimum value of Parameters for maximum Accuracy achieved.
Dataset

No.
of No.
of Population Minimum Minimum Crossover Accuracy
Instances attributes Size
Support
confidence rate
in %
Lenses
24
4
36
0.2
0.9
0.25
95
Haberman
306
3
306
0.9
0.2
0.5
71
Iris
150
5
225
0.2
0.9
0.75
87

It is observed from the experimental analysis that the choice of optimum population size
for better accuracy depends upon the number of instances in dataset. If dataset size is larger, then
the population size same as the number of instances in dataset is found to produce better
accuracy.
Setting up values for minimum support and confidence depends on the dataset and their
relationship between attributes. Tradeoff between minimum confidence and minimum support
has to be scored to attain optimum results. Cross over rate affects the convergence rate of the
system mainly and has minimum effect on the accuracy of the system.
6. Conclusion
Genetic Algorithms have been used to solve difficult optimization problems in a number of
fields and have proved to produce optimum results in mining Association rules. When Genetic
algorithm is used for mining association rules the GA parameters decides the efficiency of the
system. Minimum support, minimum confidence and population size affects the accuracy of the
system than other GA parameters. Setting of population size depends upon the size of the
problem whereas the minimum confidence and minimum support values are to be set based on
the problem under study. The optimum value of crossover rate leads to earlier convergence
while playing minimum role in achieving better accuracy. The optimum value of the GA
parameters varies from data to data and the fitness function plays a major role in optimizing the
results. The size of the dataset and relationship between attributes in data contributes to the
setting up of the parameters. The efficiency of the methodology could be further explored on
more datasets with varying attribute sizes.
References
[1].
[2].

[3].

Cattral, R., Oppacher, F., Deugo, D.,Rule Acquisition with a Genetic Algorithm,
Proceedings of the 1999 Congress on Evolutionary Computation,. CEC 99, 1999.
Saggar, M., Agrawal, A.K., Lad, A., Optimization of Association Rule Mining,
IEEE International Conference on Systems, Man and Cybernetics, Vol. 4, Page(s): 3725
3729, 2004.
Zhou Jun, Li Shu-you, Mei Hong-yan, Liu Hai-xia, A Method for Finding Implicating
Rules Based on the Genetic Algorithm, Third International Conference on Natural
Computation, Volume: 3, Page(s): 400 405, 2007.

Genxiang Zhang, Haishan Chen, Immune Optimization Based Genetic Algorithm for
Incremental Association Rules Mining, International Conference on Artificial
Intelligence and Computational Intelligence, AICI '09, Volume: 4, Page(s): 341 345,
2009.
[5]. Gonzales, E., Mabu, S., Taboada, K., Shimada, K., Hirasawa, K., Mining Multi-class
Datasets using Genetic Relation Algorithm for Rule Reduction, IEEE Congress on
Evolutionary Computation, CEC '09, Page(s): 3249 3255, 2009.
[6]. Xian-Jun Shi, Hong Lei, A Genetic Algorithm-Based Approach for Classification Rule
Discovery, International Conference on Information Management, Innovation
Management and Industrial Engineering, ICIII '08, Volume: 1 , Page(s): 175 178, 2008.
[7]. Haiying Ma, Xin Li, Application of Data Mining in Preventing Credit Card Fraud,
International Conference on Management and Service Science, MASS '09, Page(s): 1 6,
2009.
[8]. Hong Guo, Ya Zhou, An Algorithm for Mining Association Rules Based on Improved
Genetic Algorithm and its Application, 3rd International Conference on Genetic and
Evolutionary Computing, WGEC '09, Page(s): 117 120, 2009.
[9]. Hua Tang, Jun Lu, A Hybrid Algorithm Combined Genetic Algorithm with Information
Entropy for Data Mining, 2nd IEEE Conference on Industrial Electronics and
Applications, Page(s): 753 757, 2007.
[10]. Wenxiang Dou, Jinglu Hu, Hirasawa, K., Gengfeng Wu, Quick Response Data Mining
Model using Genetic Algorithm, SICE Annual Conference, Page(s): 1214 1219, 2008.
[4].

Anda mungkin juga menyukai