Anda di halaman 1dari 34

CHAPTER TWO

BASIC CONCEPTS AND DATA MINING TECHNIQUES


2.1 Concept of Data Mining
Database technology has been used with great success in
traditional business data processing. There is an increasing
desire to use this technology in new application domains. One
such application domain that is likely to acquire considerable
significance in the near future is database mining. An
increasing number of organizations are creating ultra large
databases (measured in gigabytes and even terabytes) of
business data, such as consumer data, transaction histories,
sales records, etc.; such data forms a potential gold mine of
valuable business information.
Data mining is a relatively new and promising technology. It
can be defined as the process of discovering meaningful new
correlation, patterns, and trends by digging into (mining) large
amounts of data stored in warehouse. These tools can include
statistical models, mathematical algorithms, and machine
learning methods such as neural networks or decision trees.
Consequently, data mining consists of more than collecting
and managing data, it also includes analysis and prediction.
The objective of data mining is to identify valid, novel,
potentially useful, and understandable correlations and
patterns in existing data. Finding useful patterns in data is
known by different names (e.g., knowledge extraction,
information discovery, information harvesting, data
archeology, and data pattern processing) [1,2].
The term data mining is primarily used by statisticians,
database researchers, and the business communities. The
term KDD (Knowledge Discovery in Databases) refers to the
overall process of discovering useful knowledge from data,
where data mining is a particular step in this process. The
steps in the KDD process, such as data preparation, data
selection, data cleaning, and proper interpretation of the
results of the data mining process, ensure that useful
knowledge is derived from the data [3, 4].
Data mining is an extension of traditional data analysis and
statistical approaches as it incorporates analytical techniques
drawn from various disciplines like AI, machine learning,
OLAP, data visualization as shown in Figure 2.1.

Figure2.1. Intersection of Technologies Related to Data Mining

This new technology should enable the discovery of trends


and predictive patterns in data, the creation and testing of
hypothesis, and generation of insight-provoking visualizations.
With the enormous amount of data stored in files, databases,
and other repositories, it is increasingly important, to develop
powerful tool for analysis and interpretation of such data and
for the extraction of interesting knowledge that could help in
decision-making [1,2,3,4].

2.1 Data Mining Definition


Briefly speaking, data mining refers to extracting useful
information from vast amounts of data. Many other terms are
being used to interpret data mining, such as knowledge
mining from databases, knowledge extraction, data analysis,
and data archaeology. Nowadays, it is commonly agreed that
data mining is an essential step in the process of knowledge
discovery in databases.
Data Mining is a multidisciplinary field, encompassing areas
like information technology, machine learning, statistics,
pattern recognition, data retrieval, neural networks,
information based systems, artificial intelligence and data
visualization. Data mining is the extraction of hidden
predictive information from large databases; it is a powerful
technology with great potential to help organizations focus on
the most important information in their data warehouses
[1,2,3].
Based on a broad view of data mining functionality, data
mining is the process of discovering interesting knowledge
from large amounts of data stored either in databases, data
warehouses, or other information repositories. Data mining is
the analysis of observational data sets to find unsuspected
relationships and to summarize the data in novel ways that
are both understandable and useful to the data owner. The
relationships and summaries derived through a data mining
exercise are often referred to as models or patterns. Data
mining is the process of applying these methods to data with
the intention of uncovering hidden patterns [2,3].
Data mining technology has been used for many years by
many fields such as businesses, scientists and governments.
It is used to sift through volumes of data such as airline
passenger trip information, population data and marketing
data to generate market research reports, although that
reporting is sometimes not considered to be data mining.
Data mining, popularly known as Knowledge Discovery in
Databases (KDD), it is the nontrivial extraction of implicit,
previously unknown and potentially useful information from
data in databases. Though, data mining and knowledge
discovery in databases (or KDD) are frequently treated as
synonyms, data mining is actually part of the knowledge
discovery process [3, 5].
2.2 The Data Mining Tasks:
The data mining tasks are of different types depending on the
use of data mining result. The Data Mining tasks are classified
as [1, 2]:
1. Exploratory Data Analysis: It is simply exploring the data
without any clear ideas of what we are looking for. These
techniques are interactive and visual.
2. Descriptive Modeling: It describe all the data, It includes
models for overall probability distribution of the data,
partitioning of the p-dimensional space into groups and
models describing the relationships between the variables.
3. Predictive Modeling: This model permits the value of one
variable to be predicted from the known values of other
variables.
4. Discovering Patterns and Rules: It concern with pattern
detection, the aim is spotting fraudulent behavior by
detecting regions of the space defining the different types of
transactions where the data points significantly different from
the rest.
5. Retrieval by Content: It is finding pattern similar to the
pattern of interest in the data set. This task is most commonly
used for text and image data sets.

2.3. Types of Data Mining Systems:


Data mining systems can be categorized according to various
criteria. The classification is as follows [3]:
1. Classification of data mining systems according to the type
of data source mined: This classification is according to the
type of data handled such as spatial data, multimedia data,
time-series data, text data, World Wide Web, etc.
2. Classification of data mining systems according to the data
model: This classification based on the data model involved
such as relational database, object-oriented database, data
warehouse, transactional database, etc.
3. Classification of data mining systems according to the kind
of knowledge discovered:
This classification based on the kind of knowledge discovered
or data mining functionalities, such as characterization,
discrimination, association, classification, clustering, etc.
Some systems tend to be comprehensive systems offering
several data mining functionalities together.
4. Classification of data mining systems according to mining
techniques used: This classification is according to the data
analysis approach used such as machine learning, neural
networks, genetic algorithms, statistics, visualization,
database oriented or data warehouse-oriented, etc.
The classification can also take into account the degree of
user interaction involved in the data mining process such as
query-driven systems, interactive exploratory systems, or
autonomous systems. A comprehensive system would provide
a wide variety of data mining techniques to fit different
situations and options, and offer different degrees of user
interaction.
2.4. Data Mining Life Cycle:
The life cycle of a data mining project consists of six phases
[2,4]. The sequence of the phases is not rigid. Moving back
and forth between different phases is always required. It
depends on the outcome of each phase as shown in Figure
2.1.

Fig
2.2 Data Mining Life Cycle
Source: Fayyad (1996); Han and Kamber (2006)
The main phases are:
1. Exploring and Preprocessing: the initial steps of
exploring, visualizing, and querying the data, to gain insight
into the data in an interactive manner. Preprocessing steps
such variable selection, data focusing, and data validation can
also be included in these initial steps.
2. Modeling: the steps involved in (a) selecting the model
representations that we seek to the data (e.g., a tree, a linear
function, a probability density model, etc.), (b) selecting the
score functions that score different models with respect to the
data, and (c) specifying the computational methods and
algorithms to optimize the score function (e.g., greedy local
search). These components combined together specify the
data mining algorithm to be used. The components may be
precompiled into a specific algorithm (e.g., CART or C4.5
decision tree implementations) or may be integrated in a
customized manner for a specific application (much more
common in the sciences).
3. Mining: the step (often repeated) of actually running a
particular data mining algorithm on a particular data set.
4. Evaluating: the step (often ignored) of critically evaluating
the quality of the output of the data mining algorithm from
step 3, both the predictions of the model and the
interpretation of the fitted model itself.
5. Deploying: the step (rarely achieved) of putting a model
from a data mining algorithm into routine predictive use, e.g.,
using the model continuously in real-time for scoring
customers visiting an ecommerce Web site. A challenging
(and under-appreciated) technical issue in this context is how
and when models should be updated for such continuous data
stream" applications.
2.5. The Data Mining Models:
The data mining models are of two types [1,2,6]: Predictive
and Descriptive.
2.5.1 Descriptive Models
The descriptive model identifies the patterns or relationships
in data and explores the properties of the data examined. Ex.
Clustering, Summarization, Association rule, Sequence
discovery etc. Clustering is similar to classification except that
the groups are not predefined, but are defined by the data
alone. It is also referred to as unsupervised learning or
segmentation. It is the partitioning or segmentation of the
data in to groups or clusters. The clusters are defined by
studying the behavior of the data by the domain experts. The
term segmentation is used in very specific context; it is a
process of partitioning of database into disjoint grouping of
similar tuples. Summarization is the technique of presenting
the summarize information from the data. The association rule
finds the association between the different attributes.
Association rule mining is a two step process: Finding all
frequent item sets, Generating strong association rules from
the frequent item sets. Sequence discovery is a process of
finding the sequence patterns in data. This sequence can be
used to understand the trend.
2.5.2 Predictive Models
The predictive model makes prediction about unknown data
values by using the known values. Ex. Classification,
Regression, Time series analysis, Prediction etc. Many of the
data mining applications are aimed to predict the future state
of the data.
Prediction is the process of analyzing the current and past
states of the attribute and prediction of its future state.
Classification is a technique of mapping the target data to the
predefined groups or classes, this is a supervise learning
because the classes are predefined before the examination of
the target data. The regression involves the learning of
function that map data item to real valued prediction variable.
In the time series analysis the value of an attribute is
examined as it varies over time. In time series analysis the
distance measures are used to determine the similarity
between different time series, the structure of the line is
examined to determine its behavior and the historical time
series plot is used to predict future values of the variable
2.6. Data Mining Methods:
The data mining methods are broadly categories as: On-Line
Analytical Processing (OLAP), Classification, Clustering,
Association Rule Mining, Temporal Data Mining, Time Series
Analysis, Spatial Mining, Web Mining etc. These methods use
different types of algorithms and data. The data source can be
data warehouse, database, flat file or text file. The algorithms
may be Statistical Algorithms, Decision Tree based, Nearest
Neighbor, Neural Network based, Genetic Algorithms based,
Ruled based, Support
Vector Machine etc. The selection of data mining algorithm is
mainly depends on the type of data used for mining and the
expected outcome of the mining process. The domain experts
play a significant role in the selection of algorithm for data
mining.
3.7 Data Mining and Statistics
The disciplines of statistics and data mining both aim to
discover structure in data. So much do their aims overlap,
that some people regard data mining as a subset of statistics.
But that is not a realistic assessment as data mining also
makes use of ideas, tools, and methods from other areas
particularly database technology and machine learning, and is
not heavily concerned with some areas in which statisticians
are interested. Statistical procedures do, however, play a
major role in data mining, particularly in the processes of
developing and assessing models. Most of the learning
algorithms use statistical tests when constructing rules or
trees and also for correcting models that are over fitted.
Statistical tests are also used to validate machine learning
models and to evaluate machine learning algorithms.
Some of the commonly used statistical analysis techniques
are discussed below.
Descriptive and Visualization Techniques include simple
descriptive statistics such as averages and measures of
variation, counts and percentages, and cross-tabs and simple
correlations. They are useful for understanding the structure
of the data.
Visualization is primarily a discovery technique and is useful
for interpreting large amounts of data; visualization tools
include histograms, box plots, scatter diagrams, and multi-
dimensional surface plots.
Cluster Analysis seeks to organize information about
variables so that relatively homogeneous groups, or
"clusters," can be formed. The clusters formed with this family
of methods should be highly internally homogenous
(members are similar to one another) and highly externally
heterogeneous (members are not like members of other
clusters).
Correlation Analysis measures the relationship between
two variables. The resulting correlation coefficient shows if
changes in one variable will result in changes in the other.
When comparing the correlation between two variables, the
goal is to see if a change in the independent variable will
result in a change in the dependent variable.
This information helps in understanding an independent
variable's predictive abilities.
Correlation findings, just as regression findings, can be useful
in analyzing causal relationships, but they do not by
themselves establish causal patterns.
Discriminant Analysis is used to predict membership in
two or more mutually exclusive groups from a set of
predictors, when there is no natural ordering on the groups.
Factor Analysis is useful for understanding the underlying
reasons for the correlations among a group of variables. The
main applications of factor analytic techniques are to reduce
the number of variables and to detect structure in the
relationships among variables; that is to classify variables.
Therefore, factor analysis can be applied as a data reduction
or structure detection method. In an exploratory factor
analysis, the goal is to explore or search for a factor structure.
Confirmatory factor analysis, on the other hand, assumes the
factor structure is known a priori and the objective is to
empirically verify or confirm that the assumed factor structure
is correct.
Regression Analysis is a statistical tool that uses the
relation between two or more quantitative variables so that
one variable (dependent variable) can be predicted from the
other(s) (independent variables). But no matter how strong
the statistical relations are between the variables, no cause-
and-effect pattern is necessarily implied by the regression
model.
Regression analysis comes in many flavors, including simple
linear, multiple linear, curvilinear, and multiple curvilinear
regression models.
3.8 Data Mining Techniques and Algorithms
This section provides an overview of some of the most
common data mining algorithms in use today. The section has
been divided into two broad categories:
Classical Techniques: Statistics, Neighborhoods and
Clustering
Next Generation Techniques: Trees, Networks and Rules
These categories will describe a number of data mining
algorithms at a high level and shall help to understand how
each algorithm fits into the landscape of data mining
techniques. Overall, six broad classes of data mining
algorithms are covered. Although there are a number of other
algorithms and many variations of the techniques described,
3.8.1 Classical Techniques: Statistics, Neighborhoods
and Clustering
3. 8.1.1 Statistics
By strict definition statistics or statistical techniques are not
data mining. They were being used long before; the term data
mining was coined to apply to business applications. However,
statistical techniques are driven by the data and are used to
discover patterns and build predictive models. This is why it is
important to have the idea of how statistical techniques work
and how they can be applied.
3.8.1.1.1 Prediction using Statistics
The term prediction is used for a variety of types of analysis
that may elsewhere be more precisely called regression.
Nonetheless regression is a powerful and commonly used tool
in statistics.
3.8.1.1.2 Linear Regression
In statistics prediction is usually synonymous with regression
of some form. There are a variety of different types of
regression in statistics but the basic idea is that a model is
created that maps values from predictors in such a way that
the lowest error occurs in making a prediction. The simplest
form of regression is simple linear regression that just
contains one predictor and a prediction.
3.8.1.2 Nearest Neighbor
Clustering and the Nearest Neighbor prediction technique are
among the oldest techniques used in data mining. Most
people think that clustering is like records are grouped
together. Nearest neighbor is a prediction technique that is
quite similar to clustering. Its essence is that in order to
predict what a prediction value is in one record look for
records with similar predictor values in the historical database
and use the prediction value from the record that is nearest
to the unclassified record. The better definition of near
might in fact be other people that you graduated from college
with rather than the people that you live next to. Nearest
Neighbor techniques are easy to use and understand because
they work in a way similar to the way that people think - by
detecting closely matching examples.
3.8.1.3 Clustering
Clustering can be said as identification of similar classes of
objects. By using clustering techniques we can further identify
dense and sparse regions in object space and can discover
overall distribution pattern and correlations among data
attributes. Classification approach can also be used for
effective means of distinguishing groups or classes of object
but it becomes costly so clustering can be used as
preprocessing approach for attribute subset selection and
classification.
3.8.2 Next Generation Techniques: Trees, Networks and
Rules
3.8.2.1 Decision Trees
A decision tree is a predictive model that, as its name implies,
can be viewed as a tree. Specifically each branch of the tree is
a classification question and the leaves of the tree are
partitions of the dataset with their classification.
There are some interesting things about the tree:
It divides up the data on each branch point without losing
any of the data (the number of total records in a given
parent node is equal to the sum of the records contained
in its two children).
The number of churners and non-churners is conserved
as you move up or down the tree
It is pretty easy to understand how the model is being
built (in contrast to the models from neural networks or
from standard statistics).
It would also be pretty easy to use this model if you
actually had to target those customers that are likely to
churn with a targeted marketing offer.
3.8.2.2 Neural Networks
Neural networks are an approach to computing that involves
developing mathematical structures with the ability to learn.
The methods are the result of academic investigations to
model nervous system learning. Neural networks have the
remarkable ability to derive meaning from complicated or
imprecise data. This can be used to extract patterns and
detect trends that are too complex to be noticed by either
humans or other computer techniques. A trained neural
network can be thought of as an "expert" in the category of
information it has been given to analyze. This expert can then
be used to provide projections given new situations of interest
and answer "what if' questions. Neural networks have already
been successfully applied in many industries. Since neural
networks are best at identifying patterns or trends in data,
they are well suited for prediction or forecasting needs.

3.8.2.3 Rule Induction


Rule induction is one of the major forms of data mining and is
the most common form of knowledge discovery in
unsupervised learning systems. Rule induction on a data base
can be a massive undertaking where all possible patterns are
systematically pulled out of the data and then an accuracy
and significance are added to them that tell the user how
strong the pattern is and how likely it is to occur again.
2.9 Educational Data Mining
2.9.1 EDM Basics
Applying data mining (DM) in education is an emerging
interdisciplinary research field also known as Educational Data
Mining (EDM). It is concerned with developing methods for
exploring the unique types of data that come from
educational environments. Its goal is to better understand
how students learn and identify the settings in which they
learn to improve educational outcomes and to gain insights
into and explain educational phenomena. Educational
information systems can store a huge amount of potential
data from multiple sources coming in different formats and at
different granularity levels. Each particular educational
problem has a specific objective with special characteristics
that require a different treatment of the mining problem. The
issues mean that traditional DM techniques cannot be applied
directly to these types of data and problems. As a
consequence, the knowledge discovery process has to be
adapted and some specific DM techniques are needed [7].
Educational Data Mining is a new growing research area that
can be defined as the application of data mining techniques
on raw data from educational systems in order to respond to
the educational questions and problems, and also to discover
the information hidden after this data. Over the last few years,
the popularity of this field enhanced a large number of
research studies that is difficult to surround and to identify the
contribution of data mining techniques in educational
systems. In fact, exploit and understand the raw data
collected from educational systems to help the designers and
the users of these systems improving their performance and
extracting useful information on the behaviors of students in
the learning process.

2.9.2 Educational Data Mining Definitions:


Different definitions have been provided for the term
Educational Data Mining. Educational Data Mining can defined
as an emerging discipline, with a suite of computational and
psychological methods and research approaches for
understanding how students learn, and the settings which
they learn in [7].
Educational data mining (EDM) is an emerging discipline
which focuses applying data mining tools and techniques to
educationally related data.
This definition does not mention data mining; open to
exploring and developing other analytical methods that can
be applied to educationally related data. However, in [8] the
authors precise that: EDM is both a learning science, as well
as a rich application area for data mining, due to the growing
availability of educational data. It enables data-driven
decision making for improving the current educational
practice and learning material.
In the same way, Romero and Ventura [9, 10] define EDM as
the application of data mining (DM) techniques to specific
type of dataset that come from educational environments to
address important educational questions.
Although different in some details, these definitions share an
emphasis on discovering knowledge based on educational
data to improve educational systems.
Educational data mining can be applied to assess students
learning performance, to improve the learning process and
guide students learning, to provide feedback and adapt
learning recommendations based on students learning
behaviors, to evaluate learning materials and courseware, to
detect abnormal learning behaviors and problems, and to
achieve a deeper understanding of educational phenomena
[2]. As shown in Figure2.3
Fig:
2.3: The Cycle of Applying Data Mining in Educational
Systems
Source: Romero and Ventura, 2007, pp. 136
[7] Romero C. and Ventura S.,"Educational data mining: A Survey from 1995 to 2005".Expert Systems with
Applications (33) 135-146. 2007

EDM is concerned with developing methods to explore the


unique types of data in educational settings and, using these
methods, to better understand students and the settings in
which they learn. Whether educational data is taken from
students' use of interactive learning environments, computer-
supported collaborative learning, or administrative data from
schools and universities, it often has multiple levels of
meaningful hierarchy, which often need to be determined by
properties in the data itself, rather than in advance. Issues of
time, sequence, and context also play important roles in the
study of educational data. The main objective of educational
institutes is to provide quality education to its students and to
improve the quality of managerial decisions. One way to
achieve highest level of quality in higher education system is
by discovering knowledge from educational data to study the
main attributes that may affect the students performance.
The discovered knowledge can be used to offer a helpful and
constructive recommendations to the academic planners in
higher education institutes to enhance their decision making
process, to improve students academic performance and
trim down failure rate, to better understand students
behavior, to assist instructors, to improve teaching and many
other benefits.
2.9.3 Process of Applying the Educational Data Mining:
This process starts with collecting or choosing the data to
study from the educational environment. The obtained raw
data require cleaning and preprocessing (heterogeneous data
fusion, treatment of missing and incorrect values, converting
the data to an appropriate form, feature selection, etc.). This
phase often requires the use of some data mining techniques.
Once the data preprocessed, the appropriate EDM
method/technique is applied.
Finally, the last step is the interpretation and the assessment
of the obtained results. To apply this process, which is often
difficult given the heterogeneity of the data in the educational
context, several tools are used as shown in Figure 2. 3:
Figure 2.4: The Process of Applying Data Mining in Educational Data
Mining

2.9.4 Goals for Educational Data Mining In Educational


Field
In the last several years, EDM has been applied to address a
wide number of goals that are all parts of the general
objective of improving learning.
Romero and Ventura [10] proposed to classify EDM objectives
depending on the viewpoint of the final user (learner,
educator, administrator, and researcher) and the problem to
resolve:
Learners. To support a learners reflections on the
situation, to provide adaptive feedback or
recommendations to learners, to respond to students
needs, to improve learning performance, etc.
Educators. To understand their students learning
processes and reflect on their own teaching methods, to
improve teaching performance, to understand social,
cognitive and behavioral aspects, etc.
Researchers. To develop and compare data mining
techniques to be able to recommend the most useful one
for each specific educational task or problem, to evaluate
learning effectiveness when using different settings and
methods, etc.
Administrators. To evaluate the best way to organize
institutional resources (human and material) and their
educational offer.
Student modeling. User modeling in the educational
domain incorporates such detailed information as
students characteristics or states such as knowledge,
skills, motivation, satisfaction, meta-cognition, attitudes,
experiences and learning progress, or certain types of
problems that negatively impact their learning outcomes
(making too many errors, misusing or under-using help,
gaming the system, inefficiently exploring learning
resources, etc.), affect, learning styles, and preferences.
The common objective here is to create or improve a
student model from usage information.
Predicting students performance and learning
outcomes. The objective is to predict a students final
grades or other types of learning outcomes (such as
retention in a degree program or future ability to learn),
based on data from course activities.
Generating recommendation. The objective is to
recommend to students which content is the most
appropriate for them at the current time
Analyzing learners behavior. This takes on several
forms: Applying educational data mining to answer
questions in any of the three areas previously discussed
(student models, Prediction, Generating
recommendation). It is also used to group student
according to their profile, and for adaptation and
personalization purposes.
Communicating to stakeholders. The objective is to
help course administrators and educators in analyzing
students activities and usage information in courses.
Domain structure analysis. The objective is to
determine domain structure and improving domain
models that characterize the content to be learned an
optimal instructional sequences, using the ability to
predict the students performance as a quality measure of
a domain structure model. Performance on tests or within
a learning environment is utilized for this goal.
Maintaining and improving courses. It is related to
the two previous goals. The objective here is to determine
how to improve courses (contents, activities, links, etc.),
using information (in particular) about student usage and
learning.
Studying the effects of different kinds of pedagogical
support that can be provided by learning software.
Advancing scientific knowledge about learning and
learners through building, discovering or improving
models of the student, the domain, and the pedagogical
support.
2.9.5 Predicting Student Performance
In Educational Systems, using data mining techniques;
student performance score values are predicted and these
values can be numerical or categorical. Regression analysis
finds relationship between a dependent (numerical) and one
or more independent variable (numerical).Classification
technique is used to classify individual items. Different data
mining techniques such as neural network, rule based system,
regression analysis, correlation analysis are applied on
educational data.

Grouping Students:
Clustering and Classification data mining techniques are
used to build groups of students based on their
characteristics, performance, etc. Different clustering
algorithms are K-mean, hierarchical, model-based
clustering.
Enrollment Management:
Enrollment management is required in educational system to
shape the enrollment strategy of an institution and fulfill all
goals. It is a set of activities such as marketing, retention
program and admission process.
2.9.5.1 Important Factors on Predicting Students
Performance
There are two main factors in predicting students
performances, which are attributes and prediction methods.
First step will be focused on the important attributes used in
predicting student performance and second step will be
focused on the prediction methods used in predicting students
performance.
The attributes that have been frequently used is cumulative
grade point average (CGPA) and internal assessment. Most of
data mining studies have used CGPA as their main attributes
to predict students performance. The main idea of why most
of the researchers are using CGPA is because it has a tangible
value for future educational and career mobility. It can also be
considered as an indication of realized academic potential.
Next, the most often attribute being used is students
demographic and external assessments. Students
demographic include gender, age, family background, and
disability. The reason of why most of the researchers used
students demographic such as gender is because they have
different styles of female and male students in their learning
process.
The three other attributes mostly used in predicting students
performance are extra-curricular activities, high school
background and social interaction network. There are five out
of thirty studies that used each one of these attributes. There
are also several researchers in another study who have used
psychometric factor to predict students performance. A
psychometric factor is identified as student interest, study
behavior, engage time, and family support. They have used
this attributes to make a system to look very clear, simple and
user friendly. 2.9.5.2 The Prediction Methods used for
Student Performance
In educational data mining method, predictive modeling is
usually used in predicting student performance. In order to
build the predictive modeling, there are several tasks used,
which are classification, regression and categorization. The
most popular task to predict students performance is
classification. There are several algorithms under
classification task that have been applied to predict students
performance.

2.11 RELATED WORK


Data mining in higher education is a recent research field and
this area of research is gaining popularity because of its
potentials to educational institutes. There are various
previous studies conducted to predict the Students Academic
Performance using Data Mining techniques.
Following is a brief description of some of the most relevant
studies found in related literature.
Delavari and Beikzadeh [1] give knowledge to use data
mining methods in Higher learning institutions and define how
data mining can be applied to the educational data. Their
study were aimed to present how useful data mining can be
used in higher education to improve student' performance.
R. R. Kabra and R. S. Bichkar [2] Define that a model can
be created using students past-academic performance with
the help of decision tree algorithm and this model can predict
students performance in the first year of engineering exam.
Their study aimed to discover knowledge that describes
students' performance in end semester examination. The
students' data that they used consist of the student' previous
database including attendance, class test, seminar and
assignment marks.
Oladipupo and Oyelade [3] show their study in which
students failure patterns are identified using association rule
data mining technique. Their study trims down failure rate
and improves academic performance.
Zlatko J. Kovacic [4] shows a case study for students
success prediction using educational data mining. For classify
successful and unsuccessful students the CHAID and CART
algorithm were applied on enrollment data of open
polytechnic (New Zealand).
Nguyen et al. [5] predict the performance of undergraduate
and postgraduate students at two different institutes using
decision tree and Bayesian network. This study was focused
on comparing students performance of different institutes
and helping failing students. In this study, decision tree gives
better accuracy than Bayesian network.

Hijazi and Naqvi [6] show a case study on student


performance. In this study 300 students sample is taken from
colleges of Punjab University (Pakistan) and they
determine that high correlation is present between some
factor (mothers education factor and students family income
factor) and student performance. Their study aimed to
discover knowledge that describes the High correlation
between Family education and students family income factor
and student performance.
T.Miranda Lakshmi, A.Martin, R.Mumtaj Begum and
Dr.V.Prasanna Venkatesan [7] conduct a case study on
students qualitative data using decision tree algorithms to
identify the effect of qualitative data in the performance of
the student.
Mohammed M. Abu Tair, Alaa M. El-Halees [8] conduct a
case study on graduate students data using data mining
techniques to improve performance and extract useful
knowledge from this data. Their study were aimed to present
how useful data mining can be used to improve student'
performance.
Sunita B.Aher, L.M.R.J. lobo [9] conducts a comparative
study to predict course selection using association rule
algorithms. This method allows the management to prepare
necessary resources for the new enrolled students and
indicates at an early stage which type of courses will
potentially be selected.
Umesh Kumar Pandey and S. Pal [10] administrate a
study in which 600 students data sample is taken from Dr.
M.L. Awadh Universitys colleges ,Faizabad and found that
whether new student will perform or not.
Bhise R. B, Thorat S.S and Supekar A.K. [11] performed
data mining process on the students database using
clustering- K-mean algorithm.
K.Shanmuga Priya and A.V.Senthil Kumar [12] use a
classification method that helps to improve the performance
& extract the knowledge from students final semester marks.
Varun Kumar and Anupama Chadha [13] used association
rule technique to improve the performance of postgraduate
students. They focus on many factors like students interest,
teaching methodologies, curriculum design using association
rule mining and these factors can affect post graduate
students performance.
Abeer Badr El Din Ahmed and Ibrahim Sayed Elaraby
[14] use decision tree technique for data classification that is
helpful for predicting the students final grade. This study was
focused on how decision tree is as a classification algorithm to
predict students final grade.
In 1986 J.R Quinlan summarizes an approach and describes
ID3 and this was the first research work on ID3 algorithm [15].
Anand Bahety implemented the ID3 algorithm on the Play
Tennis database and classified whether the weather is suitable
for playing tennis or not? Their results concluded that ID3
doesnt works well in continuous attributes but gives good
results for missing values [16].
Mary Slocum gives the implementation of the ID3 in the
medical area. She transforms the data into information to
make a decision and performed the task of collecting and
cleaning the data Entropy and Information Gain concepts are
used in this study [17].
Kumar Ashok (et.al) performed the id3 algorithm
classification on the census 2011 of India data to improving
or implementing a policy for right people. The concept of
information theory is used here. In the decision tree a
property on the basis of calculation is selected as the root of
the tree and this processs steps are repeated [18].
Sonika Tiwari used the ID3 algorithm for detecting Network
Anomalies with horizontal portioning based decision tree and
applies
different clustering algorithms. She checks the network
anomalies from the decision tree then she discovers the
comparative analysis of different clustering algorithms and
existing id3 decision tree [19].
Yadav, Bharadwaj and Pal [20] obtained the students data
such as attendance, seminar, assignment marks and class
test to predict the end semester performance using three
algorithms ID3decision tree, C4.5 and CART and result shows
that CART gives better result for classification of data.
Pandey and Pal [21] show their study using association rule
analysis to find the student interest of choosing class
language. In this paper they use seven different
interestingness measures. Their result concluded that student
has shown their interest in mix mode class language.
Bharadwaj and Pal [22] use the classification decision tree
technique to evaluate student end semester performance,
this study helps to identify the dropouts and students who
require special attention and teacher advising.
AI-Radaideh. et al [23] presents a classification based
model for student performance prediction using ID3
algorithm,C4.5 and Nave Bayes algorithm but decision tree
had better results.

K.S. Priya and A.V.S. kumar [24] use a classification


approach that extracts the knowledge from student end
semester marks.
Their study were aimed to present how useful Data Mining can
be used in educational field to enhance our understanding of
learning process to focus on extracting and identifying the
variables of the learning process of students as described by
Alaa el- Halees [25].
Han and Kamber [26] describe data mining software that
allows the users to analyze data from different views, and
summarize these relationships which are identified during the
mining process.
Divakar, R.C Jain [27] applied four classification methods on
student academic data i.e Decision tree (ID3), Multilayers
perceptron, Decision table & Nave Bayes classification
method.
Shaeela Ayesha, Tasleem Mustafa, Ahsan Raza Sattar,
and M. Inayat Khan [28] applied K-mean clustering to
analyze learning behavior of students which will help the tutor
to improve the performance of students and reduce the
dropout ratio to a significant level.

REFERENCES Related works


[1] C. Romero and S. Ventura, Educational data mining: A
survey from 1995 to 2005.Expert Systems with Applications
33 (2007) 135-146.
[2] Delavari N, Beikzadeh M.R. Data Mining Application in
Higher Learning Institutions Informatics in Education, 2008,
Vol. 7,No. 1, 31-54.
[3] R. R. Kabra and R. S. Bichkar, Performance Prediction of
Engineering Students using Decision Trees International
Journal of Computer Applications (0975 8887) Volume 36-
No.11, December 2011.
[4] Oladipupo O.O. and Oyelade O. J., Knowledge Discovery
from Students Result Repository: Association Rule Mining
Approach. International Journal of Computer Science &
Security (IJCSS), Volume (4) : Issue (2).
[5] Zlatko J.Kovacic, Early Prediction of Student Success:
Mining Students Enrollment Data, Proceedings of Informing
Science & IT Education Conference (InSITE) 2010.
[6] Nguyen Thai Nghe, Paul Janecek, and Peter Haddawy, A
Comparative Analysis of Techniques for Predicting Academic
Performance, In Proceedings of the 37th ASEE/IEEE Frontiers
in Education Conference. Pp. 7-12, 2007.
[7] Syed Tahir Hijazi, and S. M. M. Raza Naqvi, Factors
affecting students performance:
A Case Of Private Colleges. Bangladesh e-Journal of
Sociology.Volume3.Number1, January 2006.
[8] T.Miranda Lakshmi, A.Martin, R.Mumtaj Begum and
Dr.V.Prasanna Venkatesan, An Analysis on Performance of
Decision Tree Algorithms using Students Qualitative Data.
I.J.Modern Education and Computer Science, 2013, 5, 18-27.
[9] Mohammed M. Abu Tair, Alaa M. El-Halees, Mining
Educational Data to Improve Students Performance: A Case
Study.International Journal of Information and Communication
Technology Research.Volume 2 No. 2, February 2012.
[10] Sunita B.Aher, L.M.R.J. lobo, A Comparative study of
classification
algorithms,International Journal of Information Technology
and Knowledge
Management, July-December 2012, Volume 5, N0.2, pp 239-
243. 77
[11] Umesh Kumar Pandey and S. Pal, Data Mining: A
prediction of performer or underperformer using
classification,(IJCSIT) International Journal of Computer
Science and Information Technologies, Vol. 2 (2) , 2011,686-
690.
[12] Bhise R.B, Thorat S.S, Supekar A.K, Importance of Data
Mining in Higher Education System, IOSR Journal Of
Humanities And Social Science (IOSR-JHSS) ISSN: 2279-0837,
ISBN: 2279-0845. Volume 6, Issue 6 (Jan. - Feb. 2013), PP 18-
21.
[13] K.Shanmuga Priya and A.V.Senthil Kumar, Improving the
Students Performance Using Educational Data Mining, Int. J.
Advanced Networking and Applications Volume: 04 Issue: 04
Pages:1680-1685 (2013) ISSN : 0975-0290.
[14] Varun Kumar, Anupama Chadha, Mining Association
Rules in Students Assessment Data, IJCSI International
Journal of Computer Science Issues, Vol. 9, Issue 5, No 3,
September 2012.
[15] Abeer Badr El Din Ahmed and Ibrahim Sayed Elaraby,
Data Mining: A prediction for Student's Performance Using
Classification Method, World Journal of Computer Application
and Technology 2(2): 43-47, 2014.
[16] Quinlan, J.R. 1986, Induction of Decision trees, Machine
Learning.
[17] Anand Bahety, Extension and Evaluation of ID3
Decision Tree Algorithm. University of Maryland, College Park.
[18] Mary Slocum,,Decision making using ID3,RIVIER
ACADEMIC JOURNAL, VOLUME 8, NUMBER 2, FALL 2012.
[19] Kumar Ashok, Taneja H C, Chitkara Ashok K and Kumar
Vikas, Classification of Census Using Information Theoretic
Measure Based ID3 Algorithm . Int. Journal of Math. Analysis,
Vol. 6, 2012, no. 51, 2511 2518.
[20] Sonika Tiwari and Prof. Roopali Soni, Horizontal
partitioning ID3 algorithm A new approach of detecting
network anomalies using decision tree, International Journal
of Engineering Research & Technology (IJERT)Vol. 1 Issue 7,
September 2012.
[21] S. K. Yadav, B.K. Bharadwaj and S. Pal, Data Mining
Applications: A comparative study for Predicting Students
Performance, International Journal of Innovative Technology
and Creative Engineering (IJITCE), Vol. 1, No. 12, pp. 13-19,
2011.
[22]. U. K. Pandey and S. Pal, A Data Mining view on Class
Room Teaching
Language, IJCSI, Vol 8 issue 2 Maerch 2011 pg 277-282, ISSN
1694-0814.
78
[23]. B.K. Bharadwaj and S. Pal. Mining Educational Data to
Analyze Students Performance, International Journal of
Advance Computer Science and Applications (IJACSA), Vol. 2,
No. 6, pp. 63-69, 2011.
[24] Q. A. AI-Radaideh, E. W. AI-Shawakfa, and M. I.AI-
Najjar,Mining student data using decision trees,International
Arab Conference on Information Technology (ACIT2006),
Yarmouk University, Jordan, 2006.
[25] K.Shanmuga Priya and A.V.Senthil Kumar, Improving the
Students Performance Using Educational Data Mining, Int. J.
Advanced Networking and Applications Volume: 04 Issue: 04
Pages:1680-1685 (2013) ISSN : 0975-0290.
[26] Alaa el-Halees Mining students data to analyze e-
Learning behavior: A Case Study, 2009.
[27] J. Han and M. Kamber, Data Mining: Concepts and
Techniques,Morgan Kaufmann, 2000.
[28] Varsha, Anuj, Divakar, R.C Jain , Result analysis using
classification techniques, International Journal of Computer
Applications (0975-8887) Volume 1-No. 22,2010.
[29] S. Ayesha, T. Mustafa, A.R. Sattar, and M.I.Khan, Data
Mining Model for Higher Education System, European Journal
of ScientificResearch, Vol.43, No.1, pp.24-29, 2010.

References: chapter two


1. Jiawei Han, Micheline Kamber, Data Mining: Concepts and
Techniques, London: Academic Press, 5, 2001.
[2]Introduction to Data Mining and Knowledge Discovery,
Third Edition ISBN: 1-892095-02-5, Two Crows Corporation,
10500 Falls Road, Potomac, MD 20854 (U.S.A.), 1999.
[3]Larose, D. T., Discovering Knowledge in Data: An
Introduction to Data Mining, ISBN 0-471-66657-2, John Wiley
& Sons, Inc, 2005.
International Journal of Distributed and Parallel systems
(IJDPS) Vol.1, No.1, September 2010
[4]Dunham, M. H., Sridhar S., Data Mining: Introductory and
Advanced Topics, Pearson Education, New Delhi, ISBN: 81-
7758-785-4, 1st Edition, 2006.
[4].Chapman, P., Clinton, J., Kerber, R., Khabaza, T.,Reinartz, T.,
Shearer, C. and Wirth, R.. CRISP-DMm1.0 : Step-by-step data
mining guide, NCR Systems Engineering Copenhagen (USA
and Denmark),
DaimlerChrysler AG (Germany), SPSS Inc. (USA) and OHRA
Verzekeringenen Bank Group B.V (The
Netherlands), 2000.

[5]. Tan Pang-Ning, Steinbach, M., Vipin Kumar. Introduction


to Data Mining, Pearson Education, New Delhi, ISBN: 978-81-
317-1472-0, 3rd Edition, 2009.
6. Parack, S., Zahid, Z., Merchant, F.: Application of data
mining in educational databases for predicting academic
trends and patterns. In: Proceedings of 2012 IEEE
International Conference on Technology Enhanced Education,
pp. 14. IEEE Press, Piscataway (2012)

7. Huebner, R.A.: A survey of educational data-mining


research. Res. High. Educ. J. 19, 113 (2013)

8. Calders, T., Pechenizkiy, M.: Introduction to the special


section on educational data mining. ACM SIGKDD Explor.
13(2), 36 (2011)

9. Romero, C., Ventura, S.: Educational data mining: a review


of the state of the art. IEEE Trans. Syst. Man Cybern. Part C
Appl. Rev. 40(6), 601618 (2010)

10. Romero, C., Ventura, S.: Data mining in education. Wiley


Interdisc. Rev.: Data Min. Knowl. Discovery 3(1), 1227 (2013)
9. CONCLUSION:
Most of the previous studies on data mining applications in
various fields use the Variety of data types range from text to
images and stores in variety of databases and data
Structures. The different methods of data mining are used to
extract the patterns and thus the knowledge from this variety
databases. Selection of data and methods for data mining is
an important task in this process and needs the knowledge of
the domain. Several attempts have been made to design and
develop the generic data mining system but no system found
completely generic. Thus, for every domain the domain
experts assistant is mandatory. The domain experts shall be
guided by the system to effectively apply their knowledge for
the use of data mining systems to generate required
knowledge. The domain experts are required to determine the
variety of data that should be collected in the specific problem
domain, selection of specific data for data mining, cleaning
and transformation of data, extracting patterns for knowledge
generation and finally interpretation of the patterns and
knowledge generation.