8 tayangan

Diunggah oleh Sulieman Khalifa Arafa

- Data Mining
- Adaptive and Fast Predictions by Minimal Itemsets Creation
- Data Mining
- A Comparative Study of CN2 Rule and SVM Algorithm
- Romance de Paris
- Chap1 Intro x
- My Portfolio Final
- Data Warehousing and Data Mining
- Turunen Guha 1 5
- Introduction
- Ieee Java 2013
- dmfinance.pdf
- i Jcs It 2014050148
- An Information Theoretic Framework for Data Mining
- THESIS1
- Comparing We Ka and r
- abstractIntroductionUPM
- A Novel Approach for Clustering Categorical Time Series Using Dissimilarity Based Measure
- k 04086972
- IJAIEM-2014-04-25-055

Anda di halaman 1dari 34

2.1 Concept of Data Mining

Database technology has been used with great success in

traditional business data processing. There is an increasing

desire to use this technology in new application domains. One

such application domain that is likely to acquire considerable

significance in the near future is database mining. An

increasing number of organizations are creating ultra large

databases (measured in gigabytes and even terabytes) of

business data, such as consumer data, transaction histories,

sales records, etc.; such data forms a potential gold mine of

valuable business information.

Data mining is a relatively new and promising technology. It

can be defined as the process of discovering meaningful new

correlation, patterns, and trends by digging into (mining) large

amounts of data stored in warehouse. These tools can include

statistical models, mathematical algorithms, and machine

learning methods such as neural networks or decision trees.

Consequently, data mining consists of more than collecting

and managing data, it also includes analysis and prediction.

The objective of data mining is to identify valid, novel,

potentially useful, and understandable correlations and

patterns in existing data. Finding useful patterns in data is

known by different names (e.g., knowledge extraction,

information discovery, information harvesting, data

archeology, and data pattern processing) [1,2].

The term data mining is primarily used by statisticians,

database researchers, and the business communities. The

term KDD (Knowledge Discovery in Databases) refers to the

overall process of discovering useful knowledge from data,

where data mining is a particular step in this process. The

steps in the KDD process, such as data preparation, data

selection, data cleaning, and proper interpretation of the

results of the data mining process, ensure that useful

knowledge is derived from the data [3, 4].

Data mining is an extension of traditional data analysis and

statistical approaches as it incorporates analytical techniques

drawn from various disciplines like AI, machine learning,

OLAP, data visualization as shown in Figure 2.1.

and predictive patterns in data, the creation and testing of

hypothesis, and generation of insight-provoking visualizations.

With the enormous amount of data stored in files, databases,

and other repositories, it is increasingly important, to develop

powerful tool for analysis and interpretation of such data and

for the extraction of interesting knowledge that could help in

decision-making [1,2,3,4].

Briefly speaking, data mining refers to extracting useful

information from vast amounts of data. Many other terms are

being used to interpret data mining, such as knowledge

mining from databases, knowledge extraction, data analysis,

and data archaeology. Nowadays, it is commonly agreed that

data mining is an essential step in the process of knowledge

discovery in databases.

Data Mining is a multidisciplinary field, encompassing areas

like information technology, machine learning, statistics,

pattern recognition, data retrieval, neural networks,

information based systems, artificial intelligence and data

visualization. Data mining is the extraction of hidden

predictive information from large databases; it is a powerful

technology with great potential to help organizations focus on

the most important information in their data warehouses

[1,2,3].

Based on a broad view of data mining functionality, data

mining is the process of discovering interesting knowledge

from large amounts of data stored either in databases, data

warehouses, or other information repositories. Data mining is

the analysis of observational data sets to find unsuspected

relationships and to summarize the data in novel ways that

are both understandable and useful to the data owner. The

relationships and summaries derived through a data mining

exercise are often referred to as models or patterns. Data

mining is the process of applying these methods to data with

the intention of uncovering hidden patterns [2,3].

Data mining technology has been used for many years by

many fields such as businesses, scientists and governments.

It is used to sift through volumes of data such as airline

passenger trip information, population data and marketing

data to generate market research reports, although that

reporting is sometimes not considered to be data mining.

Data mining, popularly known as Knowledge Discovery in

Databases (KDD), it is the nontrivial extraction of implicit,

previously unknown and potentially useful information from

data in databases. Though, data mining and knowledge

discovery in databases (or KDD) are frequently treated as

synonyms, data mining is actually part of the knowledge

discovery process [3, 5].

2.2 The Data Mining Tasks:

The data mining tasks are of different types depending on the

use of data mining result. The Data Mining tasks are classified

as [1, 2]:

1. Exploratory Data Analysis: It is simply exploring the data

without any clear ideas of what we are looking for. These

techniques are interactive and visual.

2. Descriptive Modeling: It describe all the data, It includes

models for overall probability distribution of the data,

partitioning of the p-dimensional space into groups and

models describing the relationships between the variables.

3. Predictive Modeling: This model permits the value of one

variable to be predicted from the known values of other

variables.

4. Discovering Patterns and Rules: It concern with pattern

detection, the aim is spotting fraudulent behavior by

detecting regions of the space defining the different types of

transactions where the data points significantly different from

the rest.

5. Retrieval by Content: It is finding pattern similar to the

pattern of interest in the data set. This task is most commonly

used for text and image data sets.

Data mining systems can be categorized according to various

criteria. The classification is as follows [3]:

1. Classification of data mining systems according to the type

of data source mined: This classification is according to the

type of data handled such as spatial data, multimedia data,

time-series data, text data, World Wide Web, etc.

2. Classification of data mining systems according to the data

model: This classification based on the data model involved

such as relational database, object-oriented database, data

warehouse, transactional database, etc.

3. Classification of data mining systems according to the kind

of knowledge discovered:

This classification based on the kind of knowledge discovered

or data mining functionalities, such as characterization,

discrimination, association, classification, clustering, etc.

Some systems tend to be comprehensive systems offering

several data mining functionalities together.

4. Classification of data mining systems according to mining

techniques used: This classification is according to the data

analysis approach used such as machine learning, neural

networks, genetic algorithms, statistics, visualization,

database oriented or data warehouse-oriented, etc.

The classification can also take into account the degree of

user interaction involved in the data mining process such as

query-driven systems, interactive exploratory systems, or

autonomous systems. A comprehensive system would provide

a wide variety of data mining techniques to fit different

situations and options, and offer different degrees of user

interaction.

2.4. Data Mining Life Cycle:

The life cycle of a data mining project consists of six phases

[2,4]. The sequence of the phases is not rigid. Moving back

and forth between different phases is always required. It

depends on the outcome of each phase as shown in Figure

2.1.

Fig

2.2 Data Mining Life Cycle

Source: Fayyad (1996); Han and Kamber (2006)

The main phases are:

1. Exploring and Preprocessing: the initial steps of

exploring, visualizing, and querying the data, to gain insight

into the data in an interactive manner. Preprocessing steps

such variable selection, data focusing, and data validation can

also be included in these initial steps.

2. Modeling: the steps involved in (a) selecting the model

representations that we seek to the data (e.g., a tree, a linear

function, a probability density model, etc.), (b) selecting the

score functions that score different models with respect to the

data, and (c) specifying the computational methods and

algorithms to optimize the score function (e.g., greedy local

search). These components combined together specify the

data mining algorithm to be used. The components may be

precompiled into a specific algorithm (e.g., CART or C4.5

decision tree implementations) or may be integrated in a

customized manner for a specific application (much more

common in the sciences).

3. Mining: the step (often repeated) of actually running a

particular data mining algorithm on a particular data set.

4. Evaluating: the step (often ignored) of critically evaluating

the quality of the output of the data mining algorithm from

step 3, both the predictions of the model and the

interpretation of the fitted model itself.

5. Deploying: the step (rarely achieved) of putting a model

from a data mining algorithm into routine predictive use, e.g.,

using the model continuously in real-time for scoring

customers visiting an ecommerce Web site. A challenging

(and under-appreciated) technical issue in this context is how

and when models should be updated for such continuous data

stream" applications.

2.5. The Data Mining Models:

The data mining models are of two types [1,2,6]: Predictive

and Descriptive.

2.5.1 Descriptive Models

The descriptive model identifies the patterns or relationships

in data and explores the properties of the data examined. Ex.

Clustering, Summarization, Association rule, Sequence

discovery etc. Clustering is similar to classification except that

the groups are not predefined, but are defined by the data

alone. It is also referred to as unsupervised learning or

segmentation. It is the partitioning or segmentation of the

data in to groups or clusters. The clusters are defined by

studying the behavior of the data by the domain experts. The

term segmentation is used in very specific context; it is a

process of partitioning of database into disjoint grouping of

similar tuples. Summarization is the technique of presenting

the summarize information from the data. The association rule

finds the association between the different attributes.

Association rule mining is a two step process: Finding all

frequent item sets, Generating strong association rules from

the frequent item sets. Sequence discovery is a process of

finding the sequence patterns in data. This sequence can be

used to understand the trend.

2.5.2 Predictive Models

The predictive model makes prediction about unknown data

values by using the known values. Ex. Classification,

Regression, Time series analysis, Prediction etc. Many of the

data mining applications are aimed to predict the future state

of the data.

Prediction is the process of analyzing the current and past

states of the attribute and prediction of its future state.

Classification is a technique of mapping the target data to the

predefined groups or classes, this is a supervise learning

because the classes are predefined before the examination of

the target data. The regression involves the learning of

function that map data item to real valued prediction variable.

In the time series analysis the value of an attribute is

examined as it varies over time. In time series analysis the

distance measures are used to determine the similarity

between different time series, the structure of the line is

examined to determine its behavior and the historical time

series plot is used to predict future values of the variable

2.6. Data Mining Methods:

The data mining methods are broadly categories as: On-Line

Analytical Processing (OLAP), Classification, Clustering,

Association Rule Mining, Temporal Data Mining, Time Series

Analysis, Spatial Mining, Web Mining etc. These methods use

different types of algorithms and data. The data source can be

data warehouse, database, flat file or text file. The algorithms

may be Statistical Algorithms, Decision Tree based, Nearest

Neighbor, Neural Network based, Genetic Algorithms based,

Ruled based, Support

Vector Machine etc. The selection of data mining algorithm is

mainly depends on the type of data used for mining and the

expected outcome of the mining process. The domain experts

play a significant role in the selection of algorithm for data

mining.

3.7 Data Mining and Statistics

The disciplines of statistics and data mining both aim to

discover structure in data. So much do their aims overlap,

that some people regard data mining as a subset of statistics.

But that is not a realistic assessment as data mining also

makes use of ideas, tools, and methods from other areas

particularly database technology and machine learning, and is

not heavily concerned with some areas in which statisticians

are interested. Statistical procedures do, however, play a

major role in data mining, particularly in the processes of

developing and assessing models. Most of the learning

algorithms use statistical tests when constructing rules or

trees and also for correcting models that are over fitted.

Statistical tests are also used to validate machine learning

models and to evaluate machine learning algorithms.

Some of the commonly used statistical analysis techniques

are discussed below.

Descriptive and Visualization Techniques include simple

descriptive statistics such as averages and measures of

variation, counts and percentages, and cross-tabs and simple

correlations. They are useful for understanding the structure

of the data.

Visualization is primarily a discovery technique and is useful

for interpreting large amounts of data; visualization tools

include histograms, box plots, scatter diagrams, and multi-

dimensional surface plots.

Cluster Analysis seeks to organize information about

variables so that relatively homogeneous groups, or

"clusters," can be formed. The clusters formed with this family

of methods should be highly internally homogenous

(members are similar to one another) and highly externally

heterogeneous (members are not like members of other

clusters).

Correlation Analysis measures the relationship between

two variables. The resulting correlation coefficient shows if

changes in one variable will result in changes in the other.

When comparing the correlation between two variables, the

goal is to see if a change in the independent variable will

result in a change in the dependent variable.

This information helps in understanding an independent

variable's predictive abilities.

Correlation findings, just as regression findings, can be useful

in analyzing causal relationships, but they do not by

themselves establish causal patterns.

Discriminant Analysis is used to predict membership in

two or more mutually exclusive groups from a set of

predictors, when there is no natural ordering on the groups.

Factor Analysis is useful for understanding the underlying

reasons for the correlations among a group of variables. The

main applications of factor analytic techniques are to reduce

the number of variables and to detect structure in the

relationships among variables; that is to classify variables.

Therefore, factor analysis can be applied as a data reduction

or structure detection method. In an exploratory factor

analysis, the goal is to explore or search for a factor structure.

Confirmatory factor analysis, on the other hand, assumes the

factor structure is known a priori and the objective is to

empirically verify or confirm that the assumed factor structure

is correct.

Regression Analysis is a statistical tool that uses the

relation between two or more quantitative variables so that

one variable (dependent variable) can be predicted from the

other(s) (independent variables). But no matter how strong

the statistical relations are between the variables, no cause-

and-effect pattern is necessarily implied by the regression

model.

Regression analysis comes in many flavors, including simple

linear, multiple linear, curvilinear, and multiple curvilinear

regression models.

3.8 Data Mining Techniques and Algorithms

This section provides an overview of some of the most

common data mining algorithms in use today. The section has

been divided into two broad categories:

Classical Techniques: Statistics, Neighborhoods and

Clustering

Next Generation Techniques: Trees, Networks and Rules

These categories will describe a number of data mining

algorithms at a high level and shall help to understand how

each algorithm fits into the landscape of data mining

techniques. Overall, six broad classes of data mining

algorithms are covered. Although there are a number of other

algorithms and many variations of the techniques described,

3.8.1 Classical Techniques: Statistics, Neighborhoods

and Clustering

3. 8.1.1 Statistics

By strict definition statistics or statistical techniques are not

data mining. They were being used long before; the term data

mining was coined to apply to business applications. However,

statistical techniques are driven by the data and are used to

discover patterns and build predictive models. This is why it is

important to have the idea of how statistical techniques work

and how they can be applied.

3.8.1.1.1 Prediction using Statistics

The term prediction is used for a variety of types of analysis

that may elsewhere be more precisely called regression.

Nonetheless regression is a powerful and commonly used tool

in statistics.

3.8.1.1.2 Linear Regression

In statistics prediction is usually synonymous with regression

of some form. There are a variety of different types of

regression in statistics but the basic idea is that a model is

created that maps values from predictors in such a way that

the lowest error occurs in making a prediction. The simplest

form of regression is simple linear regression that just

contains one predictor and a prediction.

3.8.1.2 Nearest Neighbor

Clustering and the Nearest Neighbor prediction technique are

among the oldest techniques used in data mining. Most

people think that clustering is like records are grouped

together. Nearest neighbor is a prediction technique that is

quite similar to clustering. Its essence is that in order to

predict what a prediction value is in one record look for

records with similar predictor values in the historical database

and use the prediction value from the record that is nearest

to the unclassified record. The better definition of near

might in fact be other people that you graduated from college

with rather than the people that you live next to. Nearest

Neighbor techniques are easy to use and understand because

they work in a way similar to the way that people think - by

detecting closely matching examples.

3.8.1.3 Clustering

Clustering can be said as identification of similar classes of

objects. By using clustering techniques we can further identify

dense and sparse regions in object space and can discover

overall distribution pattern and correlations among data

attributes. Classification approach can also be used for

effective means of distinguishing groups or classes of object

but it becomes costly so clustering can be used as

preprocessing approach for attribute subset selection and

classification.

3.8.2 Next Generation Techniques: Trees, Networks and

Rules

3.8.2.1 Decision Trees

A decision tree is a predictive model that, as its name implies,

can be viewed as a tree. Specifically each branch of the tree is

a classification question and the leaves of the tree are

partitions of the dataset with their classification.

There are some interesting things about the tree:

It divides up the data on each branch point without losing

any of the data (the number of total records in a given

parent node is equal to the sum of the records contained

in its two children).

The number of churners and non-churners is conserved

as you move up or down the tree

It is pretty easy to understand how the model is being

built (in contrast to the models from neural networks or

from standard statistics).

It would also be pretty easy to use this model if you

actually had to target those customers that are likely to

churn with a targeted marketing offer.

3.8.2.2 Neural Networks

Neural networks are an approach to computing that involves

developing mathematical structures with the ability to learn.

The methods are the result of academic investigations to

model nervous system learning. Neural networks have the

remarkable ability to derive meaning from complicated or

imprecise data. This can be used to extract patterns and

detect trends that are too complex to be noticed by either

humans or other computer techniques. A trained neural

network can be thought of as an "expert" in the category of

information it has been given to analyze. This expert can then

be used to provide projections given new situations of interest

and answer "what if' questions. Neural networks have already

been successfully applied in many industries. Since neural

networks are best at identifying patterns or trends in data,

they are well suited for prediction or forecasting needs.

Rule induction is one of the major forms of data mining and is

the most common form of knowledge discovery in

unsupervised learning systems. Rule induction on a data base

can be a massive undertaking where all possible patterns are

systematically pulled out of the data and then an accuracy

and significance are added to them that tell the user how

strong the pattern is and how likely it is to occur again.

2.9 Educational Data Mining

2.9.1 EDM Basics

Applying data mining (DM) in education is an emerging

interdisciplinary research field also known as Educational Data

Mining (EDM). It is concerned with developing methods for

exploring the unique types of data that come from

educational environments. Its goal is to better understand

how students learn and identify the settings in which they

learn to improve educational outcomes and to gain insights

into and explain educational phenomena. Educational

information systems can store a huge amount of potential

data from multiple sources coming in different formats and at

different granularity levels. Each particular educational

problem has a specific objective with special characteristics

that require a different treatment of the mining problem. The

issues mean that traditional DM techniques cannot be applied

directly to these types of data and problems. As a

consequence, the knowledge discovery process has to be

adapted and some specific DM techniques are needed [7].

Educational Data Mining is a new growing research area that

can be defined as the application of data mining techniques

on raw data from educational systems in order to respond to

the educational questions and problems, and also to discover

the information hidden after this data. Over the last few years,

the popularity of this field enhanced a large number of

research studies that is difficult to surround and to identify the

contribution of data mining techniques in educational

systems. In fact, exploit and understand the raw data

collected from educational systems to help the designers and

the users of these systems improving their performance and

extracting useful information on the behaviors of students in

the learning process.

Different definitions have been provided for the term

Educational Data Mining. Educational Data Mining can defined

as an emerging discipline, with a suite of computational and

psychological methods and research approaches for

understanding how students learn, and the settings which

they learn in [7].

Educational data mining (EDM) is an emerging discipline

which focuses applying data mining tools and techniques to

educationally related data.

This definition does not mention data mining; open to

exploring and developing other analytical methods that can

be applied to educationally related data. However, in [8] the

authors precise that: EDM is both a learning science, as well

as a rich application area for data mining, due to the growing

availability of educational data. It enables data-driven

decision making for improving the current educational

practice and learning material.

In the same way, Romero and Ventura [9, 10] define EDM as

the application of data mining (DM) techniques to specific

type of dataset that come from educational environments to

address important educational questions.

Although different in some details, these definitions share an

emphasis on discovering knowledge based on educational

data to improve educational systems.

Educational data mining can be applied to assess students

learning performance, to improve the learning process and

guide students learning, to provide feedback and adapt

learning recommendations based on students learning

behaviors, to evaluate learning materials and courseware, to

detect abnormal learning behaviors and problems, and to

achieve a deeper understanding of educational phenomena

[2]. As shown in Figure2.3

Fig:

2.3: The Cycle of Applying Data Mining in Educational

Systems

Source: Romero and Ventura, 2007, pp. 136

[7] Romero C. and Ventura S.,"Educational data mining: A Survey from 1995 to 2005".Expert Systems with

Applications (33) 135-146. 2007

unique types of data in educational settings and, using these

methods, to better understand students and the settings in

which they learn. Whether educational data is taken from

students' use of interactive learning environments, computer-

supported collaborative learning, or administrative data from

schools and universities, it often has multiple levels of

meaningful hierarchy, which often need to be determined by

properties in the data itself, rather than in advance. Issues of

time, sequence, and context also play important roles in the

study of educational data. The main objective of educational

institutes is to provide quality education to its students and to

improve the quality of managerial decisions. One way to

achieve highest level of quality in higher education system is

by discovering knowledge from educational data to study the

main attributes that may affect the students performance.

The discovered knowledge can be used to offer a helpful and

constructive recommendations to the academic planners in

higher education institutes to enhance their decision making

process, to improve students academic performance and

trim down failure rate, to better understand students

behavior, to assist instructors, to improve teaching and many

other benefits.

2.9.3 Process of Applying the Educational Data Mining:

This process starts with collecting or choosing the data to

study from the educational environment. The obtained raw

data require cleaning and preprocessing (heterogeneous data

fusion, treatment of missing and incorrect values, converting

the data to an appropriate form, feature selection, etc.). This

phase often requires the use of some data mining techniques.

Once the data preprocessed, the appropriate EDM

method/technique is applied.

Finally, the last step is the interpretation and the assessment

of the obtained results. To apply this process, which is often

difficult given the heterogeneity of the data in the educational

context, several tools are used as shown in Figure 2. 3:

Figure 2.4: The Process of Applying Data Mining in Educational Data

Mining

Field

In the last several years, EDM has been applied to address a

wide number of goals that are all parts of the general

objective of improving learning.

Romero and Ventura [10] proposed to classify EDM objectives

depending on the viewpoint of the final user (learner,

educator, administrator, and researcher) and the problem to

resolve:

Learners. To support a learners reflections on the

situation, to provide adaptive feedback or

recommendations to learners, to respond to students

needs, to improve learning performance, etc.

Educators. To understand their students learning

processes and reflect on their own teaching methods, to

improve teaching performance, to understand social,

cognitive and behavioral aspects, etc.

Researchers. To develop and compare data mining

techniques to be able to recommend the most useful one

for each specific educational task or problem, to evaluate

learning effectiveness when using different settings and

methods, etc.

Administrators. To evaluate the best way to organize

institutional resources (human and material) and their

educational offer.

Student modeling. User modeling in the educational

domain incorporates such detailed information as

students characteristics or states such as knowledge,

skills, motivation, satisfaction, meta-cognition, attitudes,

experiences and learning progress, or certain types of

problems that negatively impact their learning outcomes

(making too many errors, misusing or under-using help,

gaming the system, inefficiently exploring learning

resources, etc.), affect, learning styles, and preferences.

The common objective here is to create or improve a

student model from usage information.

Predicting students performance and learning

outcomes. The objective is to predict a students final

grades or other types of learning outcomes (such as

retention in a degree program or future ability to learn),

based on data from course activities.

Generating recommendation. The objective is to

recommend to students which content is the most

appropriate for them at the current time

Analyzing learners behavior. This takes on several

forms: Applying educational data mining to answer

questions in any of the three areas previously discussed

(student models, Prediction, Generating

recommendation). It is also used to group student

according to their profile, and for adaptation and

personalization purposes.

Communicating to stakeholders. The objective is to

help course administrators and educators in analyzing

students activities and usage information in courses.

Domain structure analysis. The objective is to

determine domain structure and improving domain

models that characterize the content to be learned an

optimal instructional sequences, using the ability to

predict the students performance as a quality measure of

a domain structure model. Performance on tests or within

a learning environment is utilized for this goal.

Maintaining and improving courses. It is related to

the two previous goals. The objective here is to determine

how to improve courses (contents, activities, links, etc.),

using information (in particular) about student usage and

learning.

Studying the effects of different kinds of pedagogical

support that can be provided by learning software.

Advancing scientific knowledge about learning and

learners through building, discovering or improving

models of the student, the domain, and the pedagogical

support.

2.9.5 Predicting Student Performance

In Educational Systems, using data mining techniques;

student performance score values are predicted and these

values can be numerical or categorical. Regression analysis

finds relationship between a dependent (numerical) and one

or more independent variable (numerical).Classification

technique is used to classify individual items. Different data

mining techniques such as neural network, rule based system,

regression analysis, correlation analysis are applied on

educational data.

Grouping Students:

Clustering and Classification data mining techniques are

used to build groups of students based on their

characteristics, performance, etc. Different clustering

algorithms are K-mean, hierarchical, model-based

clustering.

Enrollment Management:

Enrollment management is required in educational system to

shape the enrollment strategy of an institution and fulfill all

goals. It is a set of activities such as marketing, retention

program and admission process.

2.9.5.1 Important Factors on Predicting Students

Performance

There are two main factors in predicting students

performances, which are attributes and prediction methods.

First step will be focused on the important attributes used in

predicting student performance and second step will be

focused on the prediction methods used in predicting students

performance.

The attributes that have been frequently used is cumulative

grade point average (CGPA) and internal assessment. Most of

data mining studies have used CGPA as their main attributes

to predict students performance. The main idea of why most

of the researchers are using CGPA is because it has a tangible

value for future educational and career mobility. It can also be

considered as an indication of realized academic potential.

Next, the most often attribute being used is students

demographic and external assessments. Students

demographic include gender, age, family background, and

disability. The reason of why most of the researchers used

students demographic such as gender is because they have

different styles of female and male students in their learning

process.

The three other attributes mostly used in predicting students

performance are extra-curricular activities, high school

background and social interaction network. There are five out

of thirty studies that used each one of these attributes. There

are also several researchers in another study who have used

psychometric factor to predict students performance. A

psychometric factor is identified as student interest, study

behavior, engage time, and family support. They have used

this attributes to make a system to look very clear, simple and

user friendly. 2.9.5.2 The Prediction Methods used for

Student Performance

In educational data mining method, predictive modeling is

usually used in predicting student performance. In order to

build the predictive modeling, there are several tasks used,

which are classification, regression and categorization. The

most popular task to predict students performance is

classification. There are several algorithms under

classification task that have been applied to predict students

performance.

Data mining in higher education is a recent research field and

this area of research is gaining popularity because of its

potentials to educational institutes. There are various

previous studies conducted to predict the Students Academic

Performance using Data Mining techniques.

Following is a brief description of some of the most relevant

studies found in related literature.

Delavari and Beikzadeh [1] give knowledge to use data

mining methods in Higher learning institutions and define how

data mining can be applied to the educational data. Their

study were aimed to present how useful data mining can be

used in higher education to improve student' performance.

R. R. Kabra and R. S. Bichkar [2] Define that a model can

be created using students past-academic performance with

the help of decision tree algorithm and this model can predict

students performance in the first year of engineering exam.

Their study aimed to discover knowledge that describes

students' performance in end semester examination. The

students' data that they used consist of the student' previous

database including attendance, class test, seminar and

assignment marks.

Oladipupo and Oyelade [3] show their study in which

students failure patterns are identified using association rule

data mining technique. Their study trims down failure rate

and improves academic performance.

Zlatko J. Kovacic [4] shows a case study for students

success prediction using educational data mining. For classify

successful and unsuccessful students the CHAID and CART

algorithm were applied on enrollment data of open

polytechnic (New Zealand).

Nguyen et al. [5] predict the performance of undergraduate

and postgraduate students at two different institutes using

decision tree and Bayesian network. This study was focused

on comparing students performance of different institutes

and helping failing students. In this study, decision tree gives

better accuracy than Bayesian network.

performance. In this study 300 students sample is taken from

colleges of Punjab University (Pakistan) and they

determine that high correlation is present between some

factor (mothers education factor and students family income

factor) and student performance. Their study aimed to

discover knowledge that describes the High correlation

between Family education and students family income factor

and student performance.

T.Miranda Lakshmi, A.Martin, R.Mumtaj Begum and

Dr.V.Prasanna Venkatesan [7] conduct a case study on

students qualitative data using decision tree algorithms to

identify the effect of qualitative data in the performance of

the student.

Mohammed M. Abu Tair, Alaa M. El-Halees [8] conduct a

case study on graduate students data using data mining

techniques to improve performance and extract useful

knowledge from this data. Their study were aimed to present

how useful data mining can be used to improve student'

performance.

Sunita B.Aher, L.M.R.J. lobo [9] conducts a comparative

study to predict course selection using association rule

algorithms. This method allows the management to prepare

necessary resources for the new enrolled students and

indicates at an early stage which type of courses will

potentially be selected.

Umesh Kumar Pandey and S. Pal [10] administrate a

study in which 600 students data sample is taken from Dr.

M.L. Awadh Universitys colleges ,Faizabad and found that

whether new student will perform or not.

Bhise R. B, Thorat S.S and Supekar A.K. [11] performed

data mining process on the students database using

clustering- K-mean algorithm.

K.Shanmuga Priya and A.V.Senthil Kumar [12] use a

classification method that helps to improve the performance

& extract the knowledge from students final semester marks.

Varun Kumar and Anupama Chadha [13] used association

rule technique to improve the performance of postgraduate

students. They focus on many factors like students interest,

teaching methodologies, curriculum design using association

rule mining and these factors can affect post graduate

students performance.

Abeer Badr El Din Ahmed and Ibrahim Sayed Elaraby

[14] use decision tree technique for data classification that is

helpful for predicting the students final grade. This study was

focused on how decision tree is as a classification algorithm to

predict students final grade.

In 1986 J.R Quinlan summarizes an approach and describes

ID3 and this was the first research work on ID3 algorithm [15].

Anand Bahety implemented the ID3 algorithm on the Play

Tennis database and classified whether the weather is suitable

for playing tennis or not? Their results concluded that ID3

doesnt works well in continuous attributes but gives good

results for missing values [16].

Mary Slocum gives the implementation of the ID3 in the

medical area. She transforms the data into information to

make a decision and performed the task of collecting and

cleaning the data Entropy and Information Gain concepts are

used in this study [17].

Kumar Ashok (et.al) performed the id3 algorithm

classification on the census 2011 of India data to improving

or implementing a policy for right people. The concept of

information theory is used here. In the decision tree a

property on the basis of calculation is selected as the root of

the tree and this processs steps are repeated [18].

Sonika Tiwari used the ID3 algorithm for detecting Network

Anomalies with horizontal portioning based decision tree and

applies

different clustering algorithms. She checks the network

anomalies from the decision tree then she discovers the

comparative analysis of different clustering algorithms and

existing id3 decision tree [19].

Yadav, Bharadwaj and Pal [20] obtained the students data

such as attendance, seminar, assignment marks and class

test to predict the end semester performance using three

algorithms ID3decision tree, C4.5 and CART and result shows

that CART gives better result for classification of data.

Pandey and Pal [21] show their study using association rule

analysis to find the student interest of choosing class

language. In this paper they use seven different

interestingness measures. Their result concluded that student

has shown their interest in mix mode class language.

Bharadwaj and Pal [22] use the classification decision tree

technique to evaluate student end semester performance,

this study helps to identify the dropouts and students who

require special attention and teacher advising.

AI-Radaideh. et al [23] presents a classification based

model for student performance prediction using ID3

algorithm,C4.5 and Nave Bayes algorithm but decision tree

had better results.

approach that extracts the knowledge from student end

semester marks.

Their study were aimed to present how useful Data Mining can

be used in educational field to enhance our understanding of

learning process to focus on extracting and identifying the

variables of the learning process of students as described by

Alaa el- Halees [25].

Han and Kamber [26] describe data mining software that

allows the users to analyze data from different views, and

summarize these relationships which are identified during the

mining process.

Divakar, R.C Jain [27] applied four classification methods on

student academic data i.e Decision tree (ID3), Multilayers

perceptron, Decision table & Nave Bayes classification

method.

Shaeela Ayesha, Tasleem Mustafa, Ahsan Raza Sattar,

and M. Inayat Khan [28] applied K-mean clustering to

analyze learning behavior of students which will help the tutor

to improve the performance of students and reduce the

dropout ratio to a significant level.

[1] C. Romero and S. Ventura, Educational data mining: A

survey from 1995 to 2005.Expert Systems with Applications

33 (2007) 135-146.

[2] Delavari N, Beikzadeh M.R. Data Mining Application in

Higher Learning Institutions Informatics in Education, 2008,

Vol. 7,No. 1, 31-54.

[3] R. R. Kabra and R. S. Bichkar, Performance Prediction of

Engineering Students using Decision Trees International

Journal of Computer Applications (0975 8887) Volume 36-

No.11, December 2011.

[4] Oladipupo O.O. and Oyelade O. J., Knowledge Discovery

from Students Result Repository: Association Rule Mining

Approach. International Journal of Computer Science &

Security (IJCSS), Volume (4) : Issue (2).

[5] Zlatko J.Kovacic, Early Prediction of Student Success:

Mining Students Enrollment Data, Proceedings of Informing

Science & IT Education Conference (InSITE) 2010.

[6] Nguyen Thai Nghe, Paul Janecek, and Peter Haddawy, A

Comparative Analysis of Techniques for Predicting Academic

Performance, In Proceedings of the 37th ASEE/IEEE Frontiers

in Education Conference. Pp. 7-12, 2007.

[7] Syed Tahir Hijazi, and S. M. M. Raza Naqvi, Factors

affecting students performance:

A Case Of Private Colleges. Bangladesh e-Journal of

Sociology.Volume3.Number1, January 2006.

[8] T.Miranda Lakshmi, A.Martin, R.Mumtaj Begum and

Dr.V.Prasanna Venkatesan, An Analysis on Performance of

Decision Tree Algorithms using Students Qualitative Data.

I.J.Modern Education and Computer Science, 2013, 5, 18-27.

[9] Mohammed M. Abu Tair, Alaa M. El-Halees, Mining

Educational Data to Improve Students Performance: A Case

Study.International Journal of Information and Communication

Technology Research.Volume 2 No. 2, February 2012.

[10] Sunita B.Aher, L.M.R.J. lobo, A Comparative study of

classification

algorithms,International Journal of Information Technology

and Knowledge

Management, July-December 2012, Volume 5, N0.2, pp 239-

243. 77

[11] Umesh Kumar Pandey and S. Pal, Data Mining: A

prediction of performer or underperformer using

classification,(IJCSIT) International Journal of Computer

Science and Information Technologies, Vol. 2 (2) , 2011,686-

690.

[12] Bhise R.B, Thorat S.S, Supekar A.K, Importance of Data

Mining in Higher Education System, IOSR Journal Of

Humanities And Social Science (IOSR-JHSS) ISSN: 2279-0837,

ISBN: 2279-0845. Volume 6, Issue 6 (Jan. - Feb. 2013), PP 18-

21.

[13] K.Shanmuga Priya and A.V.Senthil Kumar, Improving the

Students Performance Using Educational Data Mining, Int. J.

Advanced Networking and Applications Volume: 04 Issue: 04

Pages:1680-1685 (2013) ISSN : 0975-0290.

[14] Varun Kumar, Anupama Chadha, Mining Association

Rules in Students Assessment Data, IJCSI International

Journal of Computer Science Issues, Vol. 9, Issue 5, No 3,

September 2012.

[15] Abeer Badr El Din Ahmed and Ibrahim Sayed Elaraby,

Data Mining: A prediction for Student's Performance Using

Classification Method, World Journal of Computer Application

and Technology 2(2): 43-47, 2014.

[16] Quinlan, J.R. 1986, Induction of Decision trees, Machine

Learning.

[17] Anand Bahety, Extension and Evaluation of ID3

Decision Tree Algorithm. University of Maryland, College Park.

[18] Mary Slocum,,Decision making using ID3,RIVIER

ACADEMIC JOURNAL, VOLUME 8, NUMBER 2, FALL 2012.

[19] Kumar Ashok, Taneja H C, Chitkara Ashok K and Kumar

Vikas, Classification of Census Using Information Theoretic

Measure Based ID3 Algorithm . Int. Journal of Math. Analysis,

Vol. 6, 2012, no. 51, 2511 2518.

[20] Sonika Tiwari and Prof. Roopali Soni, Horizontal

partitioning ID3 algorithm A new approach of detecting

network anomalies using decision tree, International Journal

of Engineering Research & Technology (IJERT)Vol. 1 Issue 7,

September 2012.

[21] S. K. Yadav, B.K. Bharadwaj and S. Pal, Data Mining

Applications: A comparative study for Predicting Students

Performance, International Journal of Innovative Technology

and Creative Engineering (IJITCE), Vol. 1, No. 12, pp. 13-19,

2011.

[22]. U. K. Pandey and S. Pal, A Data Mining view on Class

Room Teaching

Language, IJCSI, Vol 8 issue 2 Maerch 2011 pg 277-282, ISSN

1694-0814.

78

[23]. B.K. Bharadwaj and S. Pal. Mining Educational Data to

Analyze Students Performance, International Journal of

Advance Computer Science and Applications (IJACSA), Vol. 2,

No. 6, pp. 63-69, 2011.

[24] Q. A. AI-Radaideh, E. W. AI-Shawakfa, and M. I.AI-

Najjar,Mining student data using decision trees,International

Arab Conference on Information Technology (ACIT2006),

Yarmouk University, Jordan, 2006.

[25] K.Shanmuga Priya and A.V.Senthil Kumar, Improving the

Students Performance Using Educational Data Mining, Int. J.

Advanced Networking and Applications Volume: 04 Issue: 04

Pages:1680-1685 (2013) ISSN : 0975-0290.

[26] Alaa el-Halees Mining students data to analyze e-

Learning behavior: A Case Study, 2009.

[27] J. Han and M. Kamber, Data Mining: Concepts and

Techniques,Morgan Kaufmann, 2000.

[28] Varsha, Anuj, Divakar, R.C Jain , Result analysis using

classification techniques, International Journal of Computer

Applications (0975-8887) Volume 1-No. 22,2010.

[29] S. Ayesha, T. Mustafa, A.R. Sattar, and M.I.Khan, Data

Mining Model for Higher Education System, European Journal

of ScientificResearch, Vol.43, No.1, pp.24-29, 2010.

1. Jiawei Han, Micheline Kamber, Data Mining: Concepts and

Techniques, London: Academic Press, 5, 2001.

[2]Introduction to Data Mining and Knowledge Discovery,

Third Edition ISBN: 1-892095-02-5, Two Crows Corporation,

10500 Falls Road, Potomac, MD 20854 (U.S.A.), 1999.

[3]Larose, D. T., Discovering Knowledge in Data: An

Introduction to Data Mining, ISBN 0-471-66657-2, John Wiley

& Sons, Inc, 2005.

International Journal of Distributed and Parallel systems

(IJDPS) Vol.1, No.1, September 2010

[4]Dunham, M. H., Sridhar S., Data Mining: Introductory and

Advanced Topics, Pearson Education, New Delhi, ISBN: 81-

7758-785-4, 1st Edition, 2006.

[4].Chapman, P., Clinton, J., Kerber, R., Khabaza, T.,Reinartz, T.,

Shearer, C. and Wirth, R.. CRISP-DMm1.0 : Step-by-step data

mining guide, NCR Systems Engineering Copenhagen (USA

and Denmark),

DaimlerChrysler AG (Germany), SPSS Inc. (USA) and OHRA

Verzekeringenen Bank Group B.V (The

Netherlands), 2000.

to Data Mining, Pearson Education, New Delhi, ISBN: 978-81-

317-1472-0, 3rd Edition, 2009.

6. Parack, S., Zahid, Z., Merchant, F.: Application of data

mining in educational databases for predicting academic

trends and patterns. In: Proceedings of 2012 IEEE

International Conference on Technology Enhanced Education,

pp. 14. IEEE Press, Piscataway (2012)

research. Res. High. Educ. J. 19, 113 (2013)

section on educational data mining. ACM SIGKDD Explor.

13(2), 36 (2011)

of the state of the art. IEEE Trans. Syst. Man Cybern. Part C

Appl. Rev. 40(6), 601618 (2010)

Interdisc. Rev.: Data Min. Knowl. Discovery 3(1), 1227 (2013)

9. CONCLUSION:

Most of the previous studies on data mining applications in

various fields use the Variety of data types range from text to

images and stores in variety of databases and data

Structures. The different methods of data mining are used to

extract the patterns and thus the knowledge from this variety

databases. Selection of data and methods for data mining is

an important task in this process and needs the knowledge of

the domain. Several attempts have been made to design and

develop the generic data mining system but no system found

completely generic. Thus, for every domain the domain

experts assistant is mandatory. The domain experts shall be

guided by the system to effectively apply their knowledge for

the use of data mining systems to generate required

knowledge. The domain experts are required to determine the

variety of data that should be collected in the specific problem

domain, selection of specific data for data mining, cleaning

and transformation of data, extracting patterns for knowledge

generation and finally interpretation of the patterns and

knowledge generation.

- Data MiningDiunggah olehchepimanca
- Adaptive and Fast Predictions by Minimal Itemsets CreationDiunggah olehAnonymous 7VPPkWS8O
- Data MiningDiunggah olehArun Gawande
- A Comparative Study of CN2 Rule and SVM AlgorithmDiunggah olehAlexander Decker
- Romance de ParisDiunggah olehisdba
- Chap1 Intro xDiunggah olehJason Cheng
- My Portfolio FinalDiunggah olehyubonnie
- Data Warehousing and Data MiningDiunggah olehHarjot Singh
- Turunen Guha 1 5Diunggah olehblackmatrix2007
- IntroductionDiunggah olehvinod
- Ieee Java 2013Diunggah olehPiyush Lathiya
- dmfinance.pdfDiunggah olehThach Ngo
- i Jcs It 2014050148Diunggah olehsuchi87
- An Information Theoretic Framework for Data MiningDiunggah olehHandsome Rob
- THESIS1Diunggah olehNitin Mishra
- Comparing We Ka and rDiunggah olehnpnbkck
- abstractIntroductionUPMDiunggah olehkapt_nizam
- A Novel Approach for Clustering Categorical Time Series Using Dissimilarity Based MeasureDiunggah olehInternational Journal for Scientific Research and Development - IJSRD
- k 04086972Diunggah olehIJERD
- IJAIEM-2014-04-25-055Diunggah olehAnonymous vQrJlEN
- Internet Data CollectionDiunggah olehkcsekar77
- abstractDiunggah olehPrasanthi Prasu
- IHDPSDiunggah olehLijo Saji Mathew
- Waste Management Statistics in the Czech RepublicDiunggah olehOliver Stankeric
- DISCOVERY OF RANKING FRAUD FOR MOBILE APPSDiunggah olehIJIERT-International Journal of Innovations in Engineering Research and Technology
- 17518-OBoyleDiunggah olehThangaGiri Baskaran
- An Augmented Anomaly-Based Network Intrusion Detection Systems Based on Neural NetworkDiunggah olehijsret
- FDPDiunggah olehkeerthidct
- Implementation of Privacy Preserving Methods Using Hadoop FrameworkDiunggah olehIRJET Journal
- Clustering Retail Products Based on Customer BehaviourDiunggah olehMarian Marcu

- adacossf.30145.fm.pdfDiunggah olehmaxamaxa
- Classification PartitioningDiunggah olehRajesh Punia
- Artificial+Intelligence+with+Python+Nanodegree+Syllabus+9-5Diunggah olehKapildev Kumar
- A Computer Based Artificial Neural Network Controller with Interactive Auditory GuidanceDiunggah olehtheijes
- Difference between Planning and SchedulingDiunggah olehtonytayic
- 2013 Ieee PaperDiunggah olehVinod Thete
- Artificial Neural Networks (Mh1202)Diunggah olehD
- Ancient Tamil Translator SystemDiunggah olehInternational Journal of Innovative Science and Research Technology
- Donahoe. Palmer, & Burgos (1997)Diunggah olehjasgdleste
- Algorithmic ModelDiunggah olehHoney Shakya
- ME ElectronicsDiunggah olehShailendrasingh Dikit
- Volume 3 No. 3, March 2012Diunggah olehEditor IJACSA
- The International Journal of Soft Computing, Mathematics and Control (IJSCMC)Diunggah olehijscmc
- 2mtechdip11Diunggah olehraviteja595
- race-aiDiunggah olehPramanshu Rajput
- Lass Oued 2018Diunggah olehKani Mozhi
- ANN_GEOTECHNICAL.pdfDiunggah olehVishnu S Das
- 001 the Human Brain Can Create Structures in Up to 11 DimensionsDiunggah olehdan7max
- Prediction of Natural Gas Viscosity Using Artificial Neural Network ApproachDiunggah oleh1573gc
- Recognition of persisting emotional valence from EEG using convolutional neural networks.pdfDiunggah olehMitchell Angel Gomez Ortega
- Loke 2017Diunggah olehNhmaulida
- Introduction to Soft Computing Neuro-fuzzy and Genetic Algorithms by Samir Roy & Udit ChakraborthyDiunggah olehSachin Kate
- 10. M.E. Electrical Control Systems.pdfDiunggah olehDevi Priya
- 1207.0580Improving neural networks by preventing co-adaptation of feature detectorsDiunggah olehLucas Gallindo
- Data MiningDiunggah olehHarminderSingh Bindra
- Methods and Techniques to Help Quality Function Deployment (QFD)Diunggah olehvhle67
- AI reportDiunggah olehFarah Tahir
- 2010.Mlss.webers1Diunggah olehSaksham Singhal
- Atsalakis Et Al. 2011 - Elliott Wave Theory and Neuro-fuzzy Systems in Stock Market Prediction - The WASP SystemDiunggah olehegonfish
- Part_1Diunggah olehLoard Sun