Anda di halaman 1dari 93

Final Year |Project 2

Development of Data Mining Algorithms for Analysing


Shopping Centre Dataset
by
Name: Khoo See Jun
ID: SN089817

Project Supervisor: Alicia Tang Yee Chong, Dr.

iv

DECLARATION

I hereby declare that this report, submitted to University Tenaga Nasional as a


partial fulfilment of the requirements for the Bachelor of Computer Science (System
and Networking) has not been submitted as an exercise for a degree at any other
university. I also certify that the work described here is entirely my own except for
excerpts and summaries whose sources are appropriately cited in the references.
This report may be made available within the university library and may be
photocopied or loaned to other libraries for the purposes of consultation.

23 February 2015
KHOO SEE JUN
SN089817

76

APPROVAL SHEET

This thesis entitled:


Development of Data Mining algorithms for analysing public domain databases
Submitted by:
KHOO SEE JUN (SN089817)
In requirement for the degree of Bachelor of Computer Science (System and
Networking), College of Information Technology, University TenagaNasional has
been accepted.

Supervisor: Alicia Tang Yee Chong, Dr.

Signature: .

Date:

76

ABSTRACT

The objectives of this project are to develop Data Mining algorithms for analysing
public domain databases. The public domain for this project is shopping centre. My
task for this project is to identify and perform an association rule mining task which
involves selecting an appropriate data set, preparing and preprocessing the data,
finding rules, including appropriate parameter setting, determining which of the
resulting rules are interesting and figuring out how the interesting rules could be
useful.
This work analyses well-known DM techniques in Weka workbench, and report the
simulation results using sample data by applying four selected DM techniques and
classifiers in the open source workbench to the Customer Relationship Management
(CRM) in shopping centre.
The design of the data mining process has been done in Chapter 4. This will show
how the data mining workflow.
The next section is implementation. The implementation begun with data mining
process (methodology).After that proceed with build models set for prediction. Then
compile modified code using Apache-Ant so that the code can be used by WEKA.

76

Lastly, generate best rules by import dataset into the WEKA software and compare the
run time between original code and modified code in different WEKA. The result
show that the run time have been improved with modified code.

TABLE OF CONTENTS

Page
DECLARATION

ii

APPROVAL SHEET

iii

ABSTRACT

iv

TABLE OF CONTENTS

LIST OF TABLES

ix

LIST OF FIGURES

CHAPTER 1 INTRODUCTION
1.0 INTRODUCTION
1.1 Project Introduction
1.2 Problem Statement
1.3 Objectives
1.4 Benefits
1.5 Project Scope
1.6 Expected Outcome
1.7 Gantt Chart

CHAPTER 2: RESEARCH AND LITERATURE REVIEW

1
2
2
3
4
5
5

76

2.0 LITERATURE REVIEW


2.1 Data Mining Techniques

2.1.1 Association

2.1.2 Classification

2.1.3 Prediction

2.1.4 Sequential Patterns (Long-term data)

2.1.5 Clustering

2.1.6 Decisions Trees (J48)

2.1.7Decision Table
2.2 Customer Relationship Management (CRM)

10
10

2.2.1 What is Customer Relationship Management (CRM)?

10

2.2.2 How CRM is Used Today

10

2.2.3 The CRM Strategy

11

2.2.4 The Impact of Technology on CRM

11

2.2.5 The Benefits of CRM

12

2.2.6 Data Mining and Customer Relationship Management

12

2.2.7 Review of Data Mining Tools in CRM

13

2.2.8 Data Mining Tools Applications in CRM

15

2.3 Data Mining Applications

17

2.3.1 Banking/Finance (Financial Data Analysis)

17

2.3.2 Retail/Marketing Industry

18

2.3.3 Telecommunication Industry

19

2.3.4 Biological Data Analysis

20

2.3.5 Medical/Pharma

20

76

2.3.6 Insurance and Health Car

21

2.3.7 Other Scientific Applications

21

2.3.8 Intrusion Detection

22

2.4 Data Mining Systems

23

2.4.1 Data Mining System Classification

23

2.4.2 Data Mining System Products

24

2.4.3 Choosing Data Mining System

25

2.4.4 Trends in Data Mining

27

2.5 Data Mining Process Model


2.5.1 Overview of Data Mining Life Cycle

28
28

CHAPTER 3: ANALYSIS
3.0 ANALYSIS
3.1 Data Mining for Shopping Centers
31
3.1.1 Free Sample Data for Testing Purpose
32
3.1.2 Related Work
33
3.1.3 Methods
35
3.1.4 Result and Discussion
36
3.1.5 Comparison between Nave Bayes (NB), Decision Table (DT) and
Decision Tree (J48)
42
3.1.6 Comparison between classifiers with time taken to build a model
3.2Association Rules Apriori Algorithm
3.2.1 Apriori Algorithm
3.2.2 Limitations of Apriori Algorithm

44

44
44
44

CHAPTER 4: DESIGN
4.0 DESIGN
4.1 Data Mining Process

46

76

4.1.1 Step One: Translate the business into a data mining problem

48

4.1.2 Step Two: Select appropriate data


4.1.3 Step Three: Analyze the data
4.1.4 Step Four: Create a Model Set for Prediction
4.1.5 Step Five: Fix Problem with the Data
4.1.6 Step Six: Transform Data to Bring Information to the Surface
4.1.7 Step Seven: Build Models
4.1.8 Step Eight: Deploy Models

48
48
50
51
52
53
54

CHAPTER 5: IMPLEMENTATION
5.0 IMPLEMENTATION
5.1 Data Mining Process

56

5.1.1 Translate the business into a data mining problem

57

5.1.2 Select appropriate data

57

5.1.3 Analyze the data

59

5.1.4 Create a Model Set for Prediction

61

5.1.5Fix Problem with the Data

62

5.1.6 Transform Data to Bring Information to the Surface

63

5.1.7 Build Models

64

5.1.8 Deploy Models

66

5.2 Apriori Algorithm Source Code

67

5.3 Import dataset into WEKA

69

CHAPTER 6: CONCLUSION
6.0 CONCLUSION
6.1 Progress and Outcome

70

6.2 Problems Encountered

71

76

6.3 Future Planning

71

REFERENCES

72

APPENDIX

77

LIST OF TABLES
Table No.

Page

TABLE 3.1

42

TABLE 3.2

44

76

LIST OF FIGURES
Figure No.

Page

Figure 2.1 Clustering (Sample Diagram)

Figure 2.2 Decision Tree (J48)

Figure 2.3 Data Mining Applications Useful For Companies

15

Figure 2.4 Data Mining System Classification

24

Figure 2.5 Data Mining Process Model

28

Figure 3.1 Sample Data (CSV format)

32

Figure 3.2 Sample Data (Notepad format)

33

Figure 3.3 Block Diagram

34

Figure 3.4 Results returned by the Nave Bayes classifier.

37

Figure 3.5Thedecision table of data analysis

38

Figure 3.6 J48 pruned tree of sex analysis

39

Figure 3.7 Associate Rules

40

Figure 3.8 Sample Data (CSV format)

41

Figure 3.9 Sample Data (Notepad format)

41

Figure 3.10 Associate Rules

42

Figure 4.1 Data Mining is not a linear process

47

76

Figure 4.2 Sample data in ARFF Viewer

49

Figure 4.3 Data Visualize

49

Figure 4.4 Visualization of data by age and sex

50

Figure 4.5 Data from the past mimics data from the past, present, and future 51
Figure 4.6 Sample data

51

Figure 4.7 Data Mining Model

53

Figure 4.8 Data Mining Process Model

54

Figure 4.9 Data Mining Scoring Process Model

55

Figure 4.10 Scoring Prediction

55

Figure 5.1 Appropriate data

57

Figure 5.2 Analyse data in ARFF Viewer

59

Figure 5.3 Data Visualize

59

Figure 5.4 Visualization of data by smoker and drink_level

60

Figure 5.5 Prediction Model

61

Figure 5.6 Fixed dataset

62

Figure 5.7 Transformed data

63

Figure 5.8 Data Mining Model

64

Figure 5.9 Model 1

65

Figure 5.10 Model 2

65

76

Figure 5.11 Model 3

66

Figure 5.12 Original Code Part 1

67

Figure 5.13 Modified Code Part 1

67

Figure 5.14 Original Code Part 2

68

Figure 5.15 Modified Code Part 2

68

Figure 5.16 Result of original code

69

Figure 5.17 Result of modified code

69

CHAPTER 1
1.0INTRODUCTION

1.1 Project Introduction


Far too many companies sit on loads of good customer data and do nothing with it. In
meanwhile they dont know that data is a gold mine of insight that can increase
customer loyalty, unlock hidden profitability and reduce client churn. By applying
data mining (knowledge discovery), theprocess used by companies to turn raw data
into useful information. By using computer-assisted software and go through the
process of digging and analyzing enormous sets of dataandextract the hidden
predictive information from large databases. It is a powerful new technology with
great potential to help companies focus on the most important information in their

76

data warehouses. In businesses can learn more about their customers and develop
more effective marketing strategies as well as increase sales and decrease costs.
Grocery stores are well-known users of data mining techniques. Many supermarkets
offer free loyalty cards to customers that give them access to reduced prices not
available to non-members. The cards make it easy for stores to track who is buying
what, when they are buying it, and at what price. The stores can then use this data,
after analyzing it, for multiple purposes, such as offering customers coupons that are
targeted to their buying habits and deciding when to put items on sale and when to sell
them at full price. Data mining tools predict behaviours and future trends, allowing
businesses to make proactive, knowledge-driven decisions. Data mining tools can
answer business questions that traditionally were too time consuming to resolve. They
scour databases for hidden patterns, finding predictive information that experts may
miss because it lies outside their expectations.

1.2 Problem Statement


Most of the companies have wasted a tons of useful customer data by doing nothing
on it. They do not know what exactly their customer need, what they are missing. By
engaging in data mining, we can gain greater insight into external conditions, internal
processes, companys market and their customers. We also gain predictive capabilities
that can be used both in strategic planning and in daily interactions. These insights and
predictive capabilities are taking a companys business results to the next level by
improving the companys marketing campaign management, up-sell and cross-sell
activities, or customer retention, risk analysis, or fraud detection efforts. This project

76

aim at a company to create powerful strategies, make fast and feasible decisions and
achieve competitive advantage in future.

1.3 Objectives
The objectives of this project are:
1. To identify parameters of the algorithm
2. To design new data mining algorithms
3. To develop new data mining algorithms

1.4 Benefits
The benefits of data mining in businesses are:
1. More Money. Money is always a good thing in business. When data is mined
that unearths the kinds of projects past donors contributed to, types of products
customers have purchased in the past, or a not-for-profit can put a number on
statistics for a grant proposal, it can result in serious cash. Once a business
knows who the top donors are or what their customers want, they can
customize approaches and outreach.
2. Improve Branding and Marketing. Data can reveal a number of things like
what direction the marketing department should take. For example, there might
have been a recent customer survey asking about what services or products
consumers want to see. That kind of information is gold, and a marketing
department can do wonders with it. If a survey or any feedback is being
collected, put it to use.

76

3. Streamline Outreach. Whether a business depends on e-mail blasts, print ads


or social media, knowing how customers want to be approached is important.
Data that includes relevant e-mail addresses, mailing addresses or social media
pages can help streamline any mailers or outreach. It also saves money,
whether it's in postage or time, by keeping consumer information updated.
4. Tap into New Markets.There are some databases available that businesses
can purchase, or the databases might be available to the public free of charge.
Business owners can use the databases of others to find out more information
about potential consumers and identify any holes in the current tactics.
However, when handling outside databases, it's especially important to
practice caution. Privacy is a big legal issue, and sometimes it's easy to overstep
boundaries.
5. Share and Share Alike. Sharing information is largely illegal, but it all
depends on what the customer has signed. For example, some coalitions may
share information on consumers in order to provide better services. This can be
dangerous grounds, but if it's legally acceptable, some business owners can
access the data of other partner organizations, too. This largely expands the
availability of information and can provide more data--and likely in turn more
accurate data--to improve the bottom line, services and research.
6. Learn from the Past. Data mining past information and comparing it to the
current situation can reveal a lot. Graphs can easily show any troubling sales
years, spikes or other trends that should be taken into consideration. Seeing the
ebb and flow of a business via data can provide insight that otherwise might be
overlooked. For example, a business that knows there's a history of high sales

76

in July can work on maximizing that month, while giving extra attention to
periods where sales slack.

1.5 Project Scope


Given databases of sufficient size and quality, data mining technology can generate
new business opportunities by providing these capabilities:
1. Select an appropriate data set
2. Preparing and pre-processing the data
3. Finding rules and identify parameter for the algorithm
1.6 Expected Outcome
The outcome of the project are:
1. A critical review data mining techniques
2. Dataset from a company
3. Design a new data mining algorithm

1.7 Gantt Chart

76

Chapter 2
2.0 LITERATURE REVIEW

2.1 Data Mining Techniques [6], [10], [11]


2.1.1 Association
Association (or relation) is probably the better known and most familiar and
straightforward data mining technique. A simple correlation between two or more
items, often of the same type to identify patterns. For example, when tracking people's
buying habits, we might identify that a customer always buys cream when they buy
strawberries, and therefore suggest that the next time that they buy strawberries they
might also want to buy cream.

2.1.2 Classification
We can use classification to build up an idea of the type of customer, item, or object
by describing multiple attributes to identify a particular class. For example, we can

76

easily classify cars into different types (sedan, 4x4, convertible) by identifying
different attributes (number of seats, car shape, driven wheels). Given a new car, we
might apply it into a particular class by comparing the attributes with our known
definition. We can apply the same principles to customers, for example by classifying
them by age and social group.

2.1.4 Prediction
Prediction is a wide topic and runs from predicting the failure of components or
machinery, to identifying fraud and even the prediction of company profits. Used in
combination with the other data mining techniques, prediction involves analyzing
trends, classification, pattern matching, and relation. By analyzing past events or
instances, we can make a prediction about an event. For example, using the credit card
authorization, we combine decision tree and classification to analysis an individual
past transaction to identify whether a transaction is fraudulent by matching the
historical pattern of the individual.

2.1.5 Sequential patterns (Long-term data)


Sequential patterns are a useful method for identifying trends, or regular occurrences
of similar events. For example, with customer data we can identify that customers buy
a particular collection of products together at different times of the year. In aonline
shopping website, we can use this information to automatically suggest that certain

76

items be added to their shopping cart based on their frequency and past purchasing
history.

2.1.6 Clustering
By examining one or more attributes or classes, we can group individual pieces of
data together to form a structure opinion. Clustering is using one or more attributes as
basis for identifying a cluster of correlating results. Clustering is useful to identify
different information because it correlates with other examples so we can see where
the similarities and ranges agree.

Clustering can work both ways. We can assume that there is a cluster at a certain point
and then use our identification criteria to see if we are correct. The graph in Figure
2.1 shows a good example. In this example, a sample of sales data compares the age
of the customer to the size of the sale. It is not unreasonable to expect that people in
their twenties (before marriage and kids), fifties, and sixties (when the children have
left home), have more disposable income.

76

Figure 2.1 Clustering (Sample Diagram).


(Adopted from http://www.ibm.com/developerworks/library/ba-data-mining-techniques/)

2.1.7 Decision trees(J48)


Related to most of the other techniques (primarily classification and prediction), the
decision tree can be used either as a part of the selection criteria, or to support the use
and selection of specific data within the overall structure. Within the decision tree, we
start with a simple question that has two (or sometimes more) answers. Each answer
leads to a further question to help classify or identify the data so that it can be
categorized, or so that a prediction can be made based on each answer.
Figure 2.2 shows an example where you can classify an incoming error condition.

76

Figure 2.2 Decision Tree (J48).


(Adopted from http://www.ibm.com/developerworks/library/ba-data-mining-techniques/)

Decision trees are often used with classification systems to attribute type information,
and with predictive systems, where different predictions might be based on past
historical experience that helps drive the structure of the decision tree and the output.
2.1.8 Decision Table
Decision tables, like decision trees, are classification models used for prediction. They
are induced by machine learning algorithms. A decision table consists of a hierarchical
table in which each entry in a higher level table gets broken down by the values of a
pair of additional attributes to form another table. The structure is similar to
dimensional stacking.

76

2.2 Customer Relationship Management (CRM) [15], [16], [17], [18]


2.2.1 What is Customer Relationship Management (CRM)?
CRM (customer relationship management) is an information industry term for
methodologies,

software,

and

usually

Internet

capabilities

that

help

company manage customer relationships in an organized way. For example, a


company might build a database about its customers that described relationships in
sufficient detail so that management, salespeople, people providing service, and
perhaps the customer directly could access information, match customer needs with
product plans and offerings, remind customers of service requirements, and know
what other products a customer had purchased, and so on.

2.2.2 How CRM is Used Today


CRM solutions provide a company with the customer business data to provide
services or products that customers want, provide better customer service, cross-sell
and up-sell more effectively, close deals, retain current customers and better
understand in the customer.

2.2.3 The CRM Strategy


Customer relationship management is often thought of as a business strategy that
enables businesses to improve in a number of areas. The CRM strategy allows a
company to following:

76

1) Understand the customer


2) Retain customers through better customer experience
3) Attract new customers
4) Win new clients and contracts
5) Increase profitably
6) Decrease customer management costs

2.2.4 The Impact of Technology on CRM


Technology and the Internet have changed the way companies approach customer
relationship strategies. Advances in technology have changed consumer buying
behaviour, and today there are many ways for companies to communicate with
customers and to collect data about them. With each new advance in technology
especially the proliferation of self-service channels like the Web and smartphones
customer relationships are being managed electronically.
Many aspects of customer relationship management rely heavily on technology;
however, the strategies and processes of a good CRM system will collect, manage and
link information about the customer with the goal of letting you market and sell
services effectively.

2.2.5 The Benefits of CRM

76

The biggest benefit most businesses realize when moving to a CRM system comes
directly from having all the business data stored and accessed from a single location.
Before CRM systems, customer data was spread out over office productivity suite
documents, email systems, mobile phone data and even paper note cards and Rolodex
entries. Storing all the data from all departments (e.g., sales, marketing, customer
service and HR) in a central location gives management and employees immediate
access to the most recent data when they need it. Departments can collaborate with
ease, and CRM systems help organization to develop efficient automated processes to
improve business processes.

2.2.6Data

Mining

and

Customer

Relationship

Management [17]

Customer relationship management (CRM) is a process that manages the interactions


between a company and its customers. The primary users of CRM software
applications are database marketers who are looking to automate the process of
interacting with customers.

To be successful, database marketers must first identify market segments containing


customers or prospects with high-profit potential. They then build and execute
campaigns that favourably impact the behaviour of these individuals.

The first task, identifying market segments, requires significant data about prospective

76

customers and their buying behaviours. In theory, the more data the better. In practice,
however, massive data stores often impede marketers, who struggle to sift through the
minutiae

to

find

the

nuggets

of

valuable

information.

Recently, marketers have added a new class of software to their targeting arsenal.
Data mining applications automate the process of searching the mountains of data to
find

patterns

that

are

good

predictors

of

purchasing

behaviours.

After mining the data, marketers must feed the results into campaign management
software that, as the name implies, manages the campaign directed at the defined
market segments.

In the past, the link between data mining and campaign management software was
mostly manual. In the worst cases, it involved "sneaker net," creating a physical file
on tape or disk, which someone then carried to another computer and loaded into the
marketing database.

This separation of the data mining and campaign management software introduces
considerable inefficiency and opens the door for human errors. Tightly integrating the
two disciplines presents an opportunity for companies to gain competitive advantage.

2.2.7Review of Data Mining Tools in CRM [18]

76

Data mining uses a combination of an explicit knowledge base, sophisticated


analytical skills, and domain knowledge to uncover hidden trends and patterns. These
trends and patterns form the basis of predictive models that enable analysts to produce
new observations from existing data. There are number of data mining tools available
in the market spaces that can provide the cutting edge for the firms to achieve
profitable CRM.
Data mining tools helps CRM by providing the complete framework, which covers:

To analyze the business problem.

To prepare the data requirements.

To build the suitable model with respect to business problem.

To validate and evaluate the designed model.

Model building is the next phase of the Data mining tool, which builds the various
models according to the data given in the data preparation phase. The last phase is the
evaluation of the model, so that the proper results in the form of useful patterns can be
drawn from the models built by the tools.
The tools of data mining for CRM should be able to detect the necessary information
from the available data .To achieve this, Data mining tools should have some
characteristic like:

User friendly environment

76

Efficiency of the tool

Basic task should be accomplished

Low cost of implementation

2.2.8 Data Mining Tools Applications in CRM [18]


Virtually any process from pharmacology to customer service can be studied,
understood, and improved using data mining. The top three end uses of data mining
are, not surprisingly, in the marketing area.

Figure
2.3
Data Mining Applications Useful For Companies.
(Adopted from http://www.informationweek.com/673/73iudat.htm)

Figure 2.3 shows that the Customer demographics are one of the most important
applications for the companies. The application of Data Mining tools are in:

76

Customer Profiling: In customer profiling, characteristics of good customers


are identified with the goals of predicting; who will become one and helping
marketers target new prospects. Data mining can find patterns in a customer
database that can be applied to a prospective database so that customer
acquisition can be appropriately targeted. For example, by identifying good
candidates for mail offers or catalogues direct-mail marketers can reduce
expenses and increase their sale

Targeted Marketing: Targeting specific promotions to existing and potential


customers offer similar benefits

Market-basket analysis: Market-basket analysis helps retailers understand


which products are purchased together or by an individual over time. With data
mining, retailers can determine which products to stock in which stores, and
even how to place them within a store. Data mining can also help assess the
effectiveness of promotions and coupons.

Manage customer relationship: Another common use of data mining in many


organizations is to help manage customer relationships. By determining
characteristics of customers who are likely to leave for a competitor, a
company can take action to retain that customer because doing so is usually far
less expensive than acquiring a new customer.

Fraud detection: Fraud detection is of great interest to telecommunications


firms, credit-card companies, insurance companies, stock exchanges, and

76

government agencies identify and track individual terrorists themselves, such


as through travel and immigration records.

Anticipate and prevent customer attrition: The data mining tool can help to
find the customers which are not satisfied by the firms services. This helps the
firms to give promotional services to group of customers who are likely to
attrite.

Mine unstructured data, such as text: The text data is always unstructured.
So data mining tools can help to mine the unstructured data to help the various
organizations to get good out of the data.

2.3 Data Mining Applications


Data mining is a data analysis approach that has been quickly adapted and used in a
large number of domains that were already using statistics. Here is the list of areas
where data mining is widely used:

Banking/Finance

Retail Industry

Telecommunication Industry

Biological Data Analysis

Medical/Pharma

76

Insurance and Health Care

Other Scientific Applications

Intrusion Detection

2.3.1 BANKING/FINANCE (FINANCIAL DATA ANALYSIS)


The financial data in banking and financial industry is generally reliable and of high
quality which facilitates the systematic data analysis and data mining. Here are the
few typical cases:

Design and construction of data warehouses for multidimensional data analysis


and data mining.

Loan payment prediction and customer credit policy analysis.

Classification and clustering of customers for targeted marketing.

Detection of money laundering and other financial crimes.

Detection of fraudulent credit card usage patterns.

Risk management related to attribution of loans using scorecards.

Find hidden correlations between different financial indicators.

Identification of stocks trading rules from historical market data.

76

2.3.2 RETAIL/MARKETING INDUSTRY


Data Mining has its great application in Retail Industry because it collects large
amount data from on sales, customer purchasing history, goods transportation,
consumption and services. It is natural that the quantity of data collected will continue
to expand rapidly because of increasing ease, availability and popularity of web.
The Data Mining in Retail Industry helps in identifying customer buying patterns and
trends. That leads to improved quality of customer service and good customer
retention and satisfaction. Here is the list of examples of data mining in retail industry:

Design and Construction of data warehouses based on benefits of data mining.

Multidimensional analysis of sales, customers, products, time and region.

Analysis of effectiveness of sales campaigns.

Customer Retention.

Product recommendation and cross-referencing of items.

Discovery of buying behaviour patterns

Detection of associations among customer characteristics.

Prediction of the probability that clients answer to mailing.

76

2.3.3 TELECOMMUNICATION INDUSTRY


Today the Telecommunication industry is one of the most emerging industries
providing various services such as fax, pager, cellular phone, Internet messenger,
images, e-mail, web data transmission etc. Due to the development of new computer
and communication technologies, the telecommunication industry is rapidly
expanding. This is the reason why data mining is become very important to help and
understand the business. Data Mining in Telecommunication industry helps in
identifying the telecommunication patterns, catch fraudulent activities, make better
use of resource, and improve quality of service. Here is the list examples for which
data mining improve telecommunication services:

Multidimensional Analysis of Telecommunication data.

Fraudulent pattern analysis.

Identification of unusual patterns.

Multidimensional association and sequential patterns analysis.

Mobile Telecommunication services.

Use of visualization tools in telecommunication data analysis.

76

2.3.4 BIOLOGICAL DATA ANALYSIS


Now a days we see that there is vast growth in field of biology such as genomics,
proteomics, functional Genomics and biomedical research. Biological data mining is
very important part of Bioinformatics. Following are the aspects in which data mining
contribute for biological data analysis:

Semantic integration of heterogeneous, distributed genomic and proteomic


databases.

Alignment, indexing, similarity search and comparative analysis multiple


nucleotide sequences.

Discovery of structural patterns and analysis of genetic networks and protein


pathways.

Association and path analysis.

Visualization tools in genetic data analysis.

2.3.5 MEDICAL/PHARMA
Data mining is a very important part in medical field. By getting through data mining,
research for new cure for rare diseases rate will be higher. Below are the aspects in
which data mining contribute for medical field:

Computer Assisted Diagnosis (expert systems learning)

76

Characterization/prediction of patient's response to product dosage

Identification of successful medical therapies (successful prescription


patterns).

Study of relations between dosage and potentially related adverse events

2.3.6 INSURANCE AND HEALTH CARE


Following is how the insurance companies manage their businesses and customer with
the help of data mining:

Discovery of medical procedures that are claimed together through claims


analysis

Identification of customers that are potential buyers for new policies.

Detection of behaviour patterns capable of identifying risky customers.

Detection of fraudulent behaviour.

2.3.7 OTHER SCIENTIFIC APPLICATIONS


The applications discussed above tend to handle relatively small and homogeneous
data sets for which the statistical techniques are appropriate. Huge amount of data
have been collected from scientific domains such as geosciences, astronomy etc.

76

There is large amount of data sets being generated because of the fast numerical
simulations in various fields such as climate, and ecosystem modelling, chemical
engineering, fluid dynamics etc. Following are the applications of data mining in field
of Scientific Applications:

Data Warehouses and data pre-processing.

Graph-based mining.

Visualization and domain specific knowledge.

2.3.8 INTRUSION DETECTION


Intrusion refers to any kind of action that threatens integrity, confidentiality, or
availability of network resources. In this world of connectivity security has become
the major issue. With increased usage of internet and availability of tools and tricks
for intruding and attacking network prompted intrusion detection to become a critical
component of network administration. Here is the list of areas in which data mining
technology may be applied for intrusion detection:

Development of data mining algorithm for intrusion detection.

Association and correlation analysis, aggregation to help select and build


discriminating attributes.

Analysis of Stream data.

76

Distributed data mining.

Visualization and query tools.

2.4 Data Mining Systems [13]


There is a large variety of Data Mining Systems available. Data mining System may
integrate techniques from the following:

Spatial Data Analysis

Information Retrieval

Pattern Recognition

Image Analysis

Signal Processing

Computer Graphics

Web Technology

Business

Bioinformatics

2.4.1 Data Mining System Classification [12]

76

The data mining system can be classified according to the following criteria:

Database Technology

Statistics

Machine Learning

Information Science

Visualization

Other Disciplines

Figure 2.4 Data Mining System Classification.


(Adopted from http://www.tutorialspoint.com/data_mining/dm_systems.htm)

76

2.4.2 Data Mining System Products [13]


There are many data mining system products and domain specific data mining
applications are available. The new data mining systems and applications are being
added to the previous systems. Also the efforts are being made towards
standardization of data mining languages.
2.4.3 Choosing Data Mining System
Which data mining system to choose will depend on following features of Data
Mining System:

Data Types - The data mining system may handle formatted text, record-based
data and relational data. The data could also be in ASCII text, relational database data
or data warehouse data. Therefore we should check what exact format, the data mining
system can handle.

System Issues - We must consider the compatibility of Data Mining system


with different operating systems. One data mining system may run on only on one
operating system or on several. There are also data mining systems that provide webbased user interfaces and allow XML data as input.

Data Sources - Data Sources refers to the data formats in which data mining
system will operate. Some data mining system may work only on ASCII text files
while other on multiple relational sources. Data mining system should also support
ODBC connections or OLE DB for ODBC connections.

76

Data Mining functions and methodologies - There are some data mining

systems that provide only one data mining function such as classification while some
provides multiple data mining functions such as concept description, discovery-driven
OLAP

analysis,

association

mining,

linkage

analysis,

statistical

analysis,

classification, prediction, clustering, outlier analysis, similarity search etc.

Coupling data mining with databases or data warehouse systems - Data


mining system need to be coupled with database or the data warehouse systems. The
coupled
components are integrated into a uniform information processing environment. Here
are the types of coupling listed below:

No coupling

Loose Coupling

Semi tight Coupling

Tight Coupling

Scalability - There are two scalability issues in Data Mining as follows:


Row (Database size) Scalability - Data mining System is considered
as row scalable when the number or rows are enlarged 10 times, It takes no more than
the 10 times to execute the query.

76

Column (Dimension) Scalability - Data mining system is considered

as column scalable if the mining query execution time increases linearly with number
of columns.

Visualization Tools - Visualization in Data mining can be categorized as


follows:

Data Visualization

Mining Results Visualization

Mining process visualization

Visual data mining

Data Mining query language and graphical user interface - The graphical
user interface which is easy to use and is required to promote user guided, interactive
data mining. Unlike relational database systems data mining systems do not share
underlying data mining query language.

2.4.4 Trends in Data Mining [25]


Here is the list of trends in data mining that reflects pursuit of the challenges such as
construction of integrated and interactive data mining environments, design of data
mining languages:
Application Exploration

76

Scalable and Interactive data mining methods

Integration of data mining with database systems, data warehouse systems and
web database systems.

Standardization of data mining query language

Visual Data Mining

New methods for mining complex types of data

Biological data mining

Data mining and software engineering

Web mining

Distributed Data mining

Real time data mining

Multi Database data mining

Privacy protection and Information Security in data mining

2.5 Data Mining Process Model [23]


CRISP-DM(Cross Industry Standard Process for Data Mining) stands for crossindustry process for data mining. The CRISP-DM methodology provides a structured

76

approach to planning a data mining project. It is a robust and well-proven


methodology.

2.5.1 Overview of Data Mining Life Cycle

Figure 2.5 Data Mining Process Model.


(Adopted from http://www.rithme.eu/?m=resources&p=dmmethod&lang=en)

Starting from the knowledge discovery processes used in early data mining projects,
CRISP-DM defined and validated a data mining process that could be applicable in
any industry sectors. This methodology should make large data mining projects faster,
cheaper, more reliable and more manageable. However, even small scale data mining
investigations can benefit from using it.

76

This process model provides a simple overview of the life cycle of a data mining
project. Corresponding phases of a data mining project are clearly identified
throughout tasks and relationships between these tasks. Even if the model doesn't
indicate it, there possibly exists relationships between all data mining tasks mainly
depending on analysis goals and on the data to be analysed.
Six main phases can be distinguished in this process model:

Business understanding - concerns the definition of the data mining problem


based on the business objectives.

Data understanding - this phase aims at getting a precise idea about data
available, identifying possible data quality issues, etc.

Data preparation - covers all activities meant to build the dataset to analyse
from the initial raw data. This includes cleaning, feature selection, sampling, etc.

Modeling - is the phase where several data mining techniques are parameter
and tested with the objective of optimizing the obtained data model or knowledge.

Evaluation - aims at verifying that the obtained model properly answers the
initially formulated business objectives and contributes to deciding whether the model
will be deployed or, on the contrary, will be rebuilt.

Deployment - is the final step of the cyclic data miningprocess model. Its
target is to take the obtained knowledge, put it in a convenient form and integrate it in
the business decision process. It can go, upon the objectives, from generating a report

76

describing the obtained knowledge to creating an specific application that will use the
obtained model to predict unknown values of a desired parameter.

Chapter 3
3.0 ANALYSIS

3.1 Data Mining for Shopping Centres


With the majority of large retailers offering a loyalty card scheme, the collection of
customer data is now routine commercial practice. Whilst loyalty schemes were
originally introduced to reward loyal customers and to encourage them to increase
their overall spend, retailers have been finding more and more sophisticated ways to
use customer data to their advantage.

Due to high competition in the business field, it is essential to consider the customer
relationship management of the shopping centre. Here analyse the massive volume of
customer data and classify them based on the customer behaviours and prediction.
Customer relationship management is mainly used in sales forecasting and banking
areas. Data mining provides the technology to analyse mass volume of data and detect
hidden patterns in data to convert raw data into valuable information.

76

This work analyses DM techniques in Weka workbench, and reports the simulation
results of applying four DM techniques and classifiers in the open source workbench
to the Customer Relationship Management (CRM) for a shopping centre.
We are here to propose that data mining techniques to be used in aiding the
salesperson and management of the shopping centre for effective decision making.
This approach was applied to 100 pre-processed records. Simulation results show that
the large volume of customer historical data can play a value added role for
shopping centre development in a way that the mined data helps them to study
customer behaviour so that personalized services can be provided.

Our aim is to demonstrate the possibilities and draw attention to the possible
implications of improving customer satisfaction. The objectives of this work could
include increasing rental incomes and bringing new life back into shopping centre.

3.1.1 Free Sample Data for Testing Purpose

76

Figure 3.1 Sample Data (CSV format).

Above is the sample data for testing purpose. This testing consist of 100 pre-processed
customer records. Included fields are:

Sex

Age

Channel

Transportation

All files are provided as CSV (comma-delimited).


Sex are gender, age are random. Channel is the way that the customer make payment,
with credit card or cash. Transportation is how the customer travel to their destination.

76

Figure 3.2 Sample Data (Notepad format).

3.1.2 Related Work


Before performing data mining need to perform the processes like data preparation
and data cleaning. Incomplete data were found in some of the records therefore data
preparation is needed. This means some records are lack of attribute values. Noisy
data contains errors and inconsistent data contains discrepancies in codes or names. In
data preparation need to select only the wanted fields from each table in order to
perform the data mining. Data reprocessing techniques like data cleaning and data
reduction were applied for conversion. Data cleaning procedure is used to clean the
data by filling the missing values, smoothing noisy data, identifying or removing
outliers and resolving inconsistencies. Additional data cleaning can be performed to
detect and remove redundancies still occur in the results obtained after data
integration. Data reduction produces a reduced representation of the data set that is
much smaller in volume and that should produce the same result.

76

Figure 3.3 Block Diagram.

The customer data may contain certain attribute that will take larger values. Therefore
if the attributes are left unnormalized, we need to normalize that. Furthermore, it
would be useful for analysis to obtain aggregate information. The data transformation
operations, such as normalization and aggregation, are additional data pre-processing
procedures that would contribute toward the success of the mining process.

Evaluation criteria: A rich set is available in Weka .


Only the following seven criteria are used:

Correctly Classified

Incorrectly Classified

Kappa Statistic

Mean Absolute Error

Root Mean Squared Error

Relative Absolute Error

Root Relative Squared Error

76

We will show the results of the above evaluation criteria applied to two scenarios
based on the customer data records maintained by the shopping centre.

3.1.3 Methods
Four DM algorithms were tested, as follows:

Nave Bayes Algorithm: Naive Bayes is a well-known in machine learning. It


is a simple and efficient learning method. The Naive Bayes classifier is an
approximation to an ideal Bayesian classifier which would classify an example
based on the probability of each class given the examples feature variables.
The main assumption is that the different features are independent of each
other given the class of the example.

Decision Table: Decision table is based on logical relationships just as the


truth table. It is a tool that helps us to look at the combination of both
completeness and inconsistency of conditions.

Decision Tree (J48): J48 attempts to account for noise and missing data. It
also deals with numeric attributes by determining where thresholds for
decision splits should be placed. The main parameters that can be set for this
algorithm are the confidence threshold, the minimum number of instances per
leaf and the number of folds for reduced error pruning.

Association: This technique finds groups of items that tend to occur together
in a transaction. Searches for relationships between variables. For example a
supermarket might gather data on customer purchasing habits. Using

76

association rule learning, the supermarket can determine which products are
frequently bought together and use this information for marketing purposes.
This is sometimes referred to as market basket analysis. We also identified and
performed an association rule mining task. This involves:
(1) Finding rules, including appropriate parameter setting,
(2) Determining which of the resulting rules are interesting,
(3) Figuring out how the interesting rules could be useful.

3.1.4 Result and Discussion


This section provides the simulation results produced by Weka. As noted earlier, three
types of classifiers are selected under theClassification technique, which are Nave
Bayes algorithm, Decision Table algorithm, and the J48 algorithm (Decision Tree), as
well as the Associative Rules.

Nave Bayes: Fig. 3.4 shows the output of the Nave Bayes algorithm that is used to
analyze the data.

76

Figure 3.4 Results returned by the Nave Bayes classifier.

Fig. 3.4 shows the result of analysis for transportation based on Nave
Bayes. The result reveals that both the male and female would like to use
private transport when travel to shopping center.

Decision Table: Fig. 3.5 shows the output for the case study that uses 100 training
instances, 1 rules, and it is a non matches covered by Majority class.

76

Figure 3.5 The decision table of data analysis.

Decision Tree (J48): Fig. 3.6 shows the output produced by the J48 algorithm.

76

Figure 3.6 J48 pruned tree of sex analysis.

The software listed all the possible rules of the decision.


Below are some of the simulation results:

If Sex = female and transportation = private then cash

76

If Sex = female and transportation = public and age lesser or equal than 66 than credit
card
If Sex = female and transportation = public and age greater than 66 then cash
If Sex = male then credit card

Association: Fig. 3.7 shows the results of selecting the Apriori algorithm using the
Associate Rules. The algorithm provides many rules. Only a few rules are useful for
effective decision making. It cannot generate best rules because of insufficient data.

76

Figure 3.7 Associate Rules

In order to make sure the Apriori algorithm of Associate Rules works well, some
new fields have been added into the sample data, relationship, region, brand and races.
Age have been removed due to the Apriori algorithm do not support numeric data.

Figure 3.8 Sample Data (CSV format).

Figure 3.9 Sample Data (Notepad format).

76

Figure 3.10 Associate Rules.

Below is the result of the simulation:


Channel=Cash Transportation=Private Race=Chinese ==> Sex=Female

3.1.5 Comparison between Nave Bayes (NB), Decision Table (DT) and Decision
Tree (J48)
Table 3.1 shows the comparison results of Nave Bayes (NB), Decision table (DT) and
J48. Overall, J48 gives better results than the DT and NB since J48 produces less
error.
Nave Bayes (NB)
Use Training Cross
Set
Correctly Classified
60
Incorrectly Classified
40
Kappa Statistic
0.0909
Mean Absolute Error
0.4562
Root Mean Squared Error 0.4777
Relative Absolute Error
94.97%

Validation
58
42
0.0455
0.4671
0.4897
97.23%

Percentage
Split
19
15
0.0449
0.4831
0.5114
99.37%

76

Root Relative Squared


Error

97.52%
99.97%
Decision Table (DT)
Correctly Classified
60
56
Incorrectly Classified
40
44
Kappa Statistic
0
-0.0577
Mean Absolute Error
0.4812
0.4855
Root Mean Squared Error 0.4899
0.4963
Relative Absolute Error
100.17%
101.05%
Root Relative Squared

102.28%

Error

100.01%
101.31%
Decision Tree (J48)
Correctly Classified
60
59
Incorrectly Classified
40
41
Kappa Statistic
0
-0.0199
Mean Absolute Error
0.48
0.4827
Root Mean Squared Error 0.4899
0.4954
Relative Absolute Error
99.9184 %
100.4763 %
Root Relative Squared

99.87%

Error

100.0864 %

99.9992 %

19
15
0
0.4868
0.4994
100.14%

19
15
0
0.4857
0.5004
99.9137 %

101.1204 %

TABLE 3.1 COMPARISON BETWEEN NB, DT, J48 BY DATA NUMERIC.

3.1.6 Comparison between classifiers with time taken to build a model


The results in Table 3.2 show that J48 has the highest correctly classified followed by
Nave Bayes and lastly is the Decision Table algorithm. The longest time taken to
build model is Decision table followed by Nave Bayes and J48 algorithm.

Algorithm
correctly
instances
time taken
(second)

Nave

J4

Decision

Bayes

Table

58

59

56

0.03

classified
to

build

TABLE 3.2 COMPARISON BETWEEN CLASSIFIERS WITH TIME TAKEN TO

76

BUILD A MODEL.

3.2 Association Rules Apriori Algorithm [24]


3.2.1 Apriori Algorithm
Apriori algorithm is mining for associations among items in a large database of sales
transaction. It is an important database mining function. For example, the information
of a customer who purchase a keyboard also tends to but a mouse at the same time

3.2.1 Limitations of Apriori Algorithm


Apriori algorithm is simple and easy to execute, but has some limitation. The main
limitation is costly to handle a huge number of candidate sets with much frequent
itemsets, low minimum support or large itemsets. For example, if there are 10^4 from
frequent 1-itemsets, it need to generate more than 10^7 candidates into 2-length and
accumulate and test their occurrence frequencies. Moreover, to discover a frequent
pattern in size of 100. Example v1, v2, v3 v100, it must generate 2^100 candidate
itemsets in total on costly and wasting of time of candidate generation. Thus, it will
repeatedly scan the database and check large set of candidates by pattern matching.
Apriori algorithm will be very low efficiency when memory capacity is limited with
large number of transactions.

76

Chapter 4
4.0 DESIGN

4.1 Data Mining Process


The data mining process has 8 steps.
1. Translate the business problem into a data mining problem.
2. Select appropriate data.
3. Analyze the data.
4. Create a model set
5. Fix problems with the data.
6. Transform data

76

7. Build models.
8. Deploy models

Figure 4.1 Data Mining is not a linear process.

As shown in Figure 4.1, data mining process is best considered as a set of settled
circles or nested loops instead of a straight line. The steps do have their order, but it is

76

not necessary to completely finish with one step before moving on to the following
step. After done with the following step, it may revisit the previous step.
4.1.1 Step One: Translate the business problem into a data mining problem
The first step is to explore the available data and make a list of candidate business
problems. A well-defined business problem will lead to the proper destination for data
mining project and solve the problem. Data mining goals for particular project should
be in more specific but not in broad and general. This make it easier to monitor
progress in achieving them. Example of specific goals:

Identify customers who are unlikely to renew their subscriptions.

Forecast customer population in future months.

List products whose sales are at risk if we discontinue wine and beer sales.

4.1.2 Step Two: Select appropriate data


Data mining requires data. The data would be better if already be resident in a
corporate data warehouse, cleansed, available, historically accurate, and frequently
updated. The data sources that are useful and valuable, from problem to problem and
industry to industry. A few samples of useful data:

Point of sale data (coupons, discount)

Credit card charge records

Direct mail response records

4.1.3 Step Three: Analyze the data


A good step to examine the dataset and understand the data file from a new source.
Data visualization is the best way to know the data.

76

Figure 4.2 Sample data in ARFF Viewer.

Figure 4.3 Data Visualize.

76

Figure 4.4 Visualization of data by age and sex.

4.1.4 Step Four: Create a Model Set for Prediction


Creating a model set for prediction requires assembling data from different sources.
When making a prediction, the predictive model uses data from the past, finding
patterns to make predictions about the future. Time can always be divided into three
periods: the past, present, and future.

76

Figure 4.5 Data from the past mimics data from the past, present, and future.

4.1.5 Step Five: Fix Problem with the Data

Figure 4.6Sample data.

Variables such as address, post, telephone number, email are useful information, but
not all the data mining algorithms can handle. So we have to fix the data by replacing
by other attributes.

76

4.1.6 Step Six: Transform Data to Bring Information to the Surface


Once all the steps above have been done, it is the time to bring the information to the
surface by adding derived fields, combining multiple variables, creating ratios and
formula logarithms. Because of different person spend different money on a product,
maybe some of the buy more and some of the buy less. So it is wiser to convert the
money values to proportions of their spending.

76

4.1.7 Step Seven: Build Models


A sample model based on the sample data that used in Chapter 3.

Figure 4.7Data Mining Model.

The diagram illustrates the flow of data when a mining structure is processed, and
when a mining model is processed.

76

4.1.8 Step Eight: Deploy Models


Deploying a model means moving it from the data mining environment to the scoring
environment. Once a model has been created, the model can then be used to make
predictions for new data. The model would be built by using historical customer data.
This process is illustrated below:

Figure 4.8Data Mining Process Model.

The process of prediction for data is scoring. The process of using the model is
different from the process that creates the model. A model is used multiple times after
it is created to score different databases. Example, it can use to predict the probability
of a customer whether it will purchase an item or not during the wholesale.

76

Figure 4.9Data Mining Scoring Process Model.

In the end, it will generate prediction number between 0 and 1 as the output and also
known as scoring.

76

Figure 4.10Scoring Prediction.

Chapter 5
5.0 IMPLEMENTATION

5.1 Data Mining Process


The data mining process has 8 steps.
1. Translate the business problem into a data mining problem.
2. Select appropriate data.
3. Analyse the data.
4. Create a model set.
5. Fix problems with the data.
6. Transform data.
7. Build models.
8. Deploy models.
5.1.1 Translate the business problem into a data mining problem
Example Scenario

76

A shopping centre want to know about their sales for the past 5 months, so that they
can forecast and achieve their target sales for the future months. Below are the specific
goals:

List products whose sales are at risk if we discontinue beer sales.


Which products they should make promotion for the future months.

5.1.2. Select appropriate data


Data Cleaning Process (Before)

Figure 5.1Appropriate data.

Above is a CSV file that contains 1000 user/customers profiles for testing purpose.
These data contain errors, inconsistent data and some records are lack of attribute
values. Data cleaning procedure is needed to clean the data before testing by filling
the missing values, smoothing noisy data, identifying or removing outliers and
resolving inconsistencies of data.
Included fields are:

userID

smoker

drink_level

76

dress_preference

ambience

transport

marital_status

hijos

birth_year

interest

personality

religion

activity

color

weight

budget

height

Upayment

Fcuisine

76

5.1.3 Analyse the data


This step is to examine the dataset and understand the data file from a new source by
using Weka ARFF Viewer and Weka Explorer Visualize.

Figure 5.2 Analyse data in ARFF Viewer.

76

Figure 5.3Data Visualize

Figure 5.4 Visualization of data by smoker and drink_level.

76

5.1.4 Create a model set

Amount of drinks sold for the past 5 months


140
120
100
91
8082

Amount of drinks

118
106

79
69
61

60
40

70
69
61

51
43

44
38

27
20
0 1
1

Month

Figure 5.5 Prediction Model

Creating a model set for prediction on the amount of drinks that sold for the past 5
months based on the data set. When making a prediction, the predictive model uses

76

data from the past, finding patterns to make predictions about the future. From the
model set, we found out that the higher sales are alcohol drinks during the 5 months
periods. Thus, we should not discontinue beer sales. We can make promotion for
non_alcohol and juice during 3rd and 4th month to boost their sales.

5.1.5 Fix problems with the data


Data Cleaning Process (After)

76

Figure 5.6 Fixed dataset.

The figure above is a fixed dataset after data cleaning process.Variables such as
address, post, telephone number, email are useful information, but not all the data
mining algorithm of this project can handle. So we have to choose certain attributes
that can be used in Associate Rules and fix the data by replacing by other attributes.

5.1.6 Transform data

76

Figure 5.7 Transformed data.

Compare the figure 5.7 and previous figure 5.6, there are some changes for the
income attribute. Associate Rules are unable to read the numeric data, so we have to
convert the numerical data into nominal data. Convert it to low, medium or high
instead of using numbering as the income attribute values.

76

5.1.7 Build models

Figure 5.8 Data Mining Model.

The diagram illustrates the flow of data when a mining structure is processed, and
when a mining model is processed. The model filter into 3 models to create models.
To build model, we can use parameters to adjust the algorithm, apply filters to the
dataset, creating different results. The mining model object contains summaries and
patterns that can be used for prediction. Below are the figure of 3 models:

76

Model 1

Model 1
374

religion = non_muslim

155
107

alc o ho l

no n_alc o ho l
drinks

Figure 5.9 Model 1

Model 2

juic e

76

model 2
104

96
84

food_preference = non_halal

beef

c hic ken

po r k

shopping_cart

Figure 5.10 Model 2

Model 3

Model 3
618

349
1000 customers

c as h

c redit c ard
payment

Figure 5.11 Model 3

33

debit c ard

76

5.1.8 Deploy model


The last step in the data mining process, is to deploy the models that performed the
best to a production environment.
Use the models to create predictions, which you can then use to make business
decisions.
Update the models dynamically, as more data comes into the organization.
5.2 Apriori Algorithm Source Code

Figure 5.12 Original Code Part 1

76

Figure 5.13 Modified Code Part 1

Figure 5.14 Original Code Part 2

76

Figure 5.15 Modified Code Part 2

5.3 Import dataset into WEKA


The same dataset imported into Weka to test for their result by using original apriori
algorithm and modified apriori algorithm. Below figures are the results:

76

Original

Figure 5.16 Result of original code.

Number of association rules generated are 10. The total time is 47ms.

Modified

Figure 5.17 Result of modified code.

76

Number of association rules generated are 10. The total time is 44ms. The runtime of
the apriori algorithm have been improved.

Chapter 6
6.0 CONCLUSION

6.1 Progress and Outcome


In the first phase of this project have been completed successfully. First, the
problem has been identified with setting up a list of objectives to be achieved. Then,
research stage has started. This research stage involved conducting a literature review
as well as review for data mining techniques. Furthermore, research on Customer
Relationship Management (CRM) in data mining and data mining application areas
have been performed.
The analysis section begun with analyze the current situation of shopping
centre and also the way of people spend their money in past, present and future. But
the most important part was testing several data mining algorithms and make
comparison on which method is the best. Finally the design section has started by
designing a data mining process and model set.
The next section after design section is implementation. The implementation
begun with data mining process (methodology).After that proceed with build models
set for prediction. Then compile modified code using Apache-Ant so that the code can
be used by WEKA.Lastly, generate best rules by import dataset into the WEKA
software and compare the run time between original code and modified code in
different WEKA. The result show that the run time have been improved with modified
code.

76

6.2 Problems Encountered

Difficulty with using data mining tool, WEKA.

Difficulty with obtaining datasets.

Lack of experience in designing a model.

Lack of Internet resources.

Difficulty in modifying source code.

Time limitation as several courses requirements were due at the same time.

6.3 Future Planning


There isa new data mining software name SPMF. SPMF is an open-source

data mining mining library written in Java, specialized in pattern


mining. It offers implementations of 86 data mining algorithms for
sequential pattern mining, association rule mining, itemset mining,
sequential rule mining and clustering. I hope I can do some research on
this software and compare with my current project in the future.

REFERENCES
Online Research
1) Data Mining: What is Data Mining
http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/
datamining.htm Date extracted: 24/6/2014

76

2) Definition of Data Mining


http://www.laits.utexas.edu/~anorman/BUS.FOR/course.mat/Alex/ Date
extracted: 24/6/2014
3) Investopedia explains Data Mining
http://www.investopedia.com/terms/d/datamining.asp Date extracted:
24/6/2014
4) Oracle Data Mining Concepts
http://docs.oracle.com/cd/B28359_01/datamine.111/b28129/process.htm#CHD
FGCIJ Date extracted: 30/6/2014
5) Resources of Data Mining http://www.rithme.eu/?
m=resources&p=resources&lang=en Date extracted: 30/6/2014
6) Data Mining Techniques
http://www.ibm.com/developerworks/opensource/library/ba-data-miningtechniques/index.html Date extracted: 6/7/2014
7) Carry Out Data Mining and Machine Learning with Weka
http://www.opensourceforu.com/2014/03/carry-data-mining-machine-learningweka/ Date extracted: 10/7/2014

8) An Introduction to Data Mining


http://www.thearling.com/text/dmwhite/dmwhite.htm Date extracted:
27/6/2014
9) How Business Can Benefit from Data Mining
http://www.tmcnet.com/topics/articles/2013/03/21/331429-how-businessesbenefit-from-data-mining.htm Date extracted: 27/6/2014

76

10) An Overview of Data Mining Techniques


http://www.thearling.com/text/dmtechniques/dmtechniques.htm Date
extracted:
11) Data Mining Techniques
http://www.uta.edu/faculty/sawasthi/Statistics/stdatmin.html#index Date
extracted: 7/7/2014
12) Data Mining Classification
http://www.tutorialspoint.com/data_mining/dm_classification_prediction.htm
Date extracted: 17/7/2014
13) Data Mining System
http://www.tutorialspoint.com/data_mining/dm_systems.htm Date extracted:
14) Data Mining Process Model http://www.rithme.eu/?
m=resources&p=dmmethod&lang=enDate extracted:
15) CRM Customer Relationship Management
http://www.webopedia.com/TERM/C/CRM.html Date extracted: 1/8/2014
16) What is CRM? http://searchcrm.techtarget.com/definition/CRM, posted by
Margaret Rouse. Date extracted: 11/8/2014
17) Data Mining and Customer Relationships
http://www.thearling.com/text/whexcerpt/whexcerpt.htm, by Kurt Thearling.
Date extracted: 11/8/2014

18) A Review of Data Mining Tools in Customer Relationship Management


http://www.tlainc.com/articl149.htm, Journal of Knowledge Management
Practice, Vol. 9, No. 1, March 2008 - Jayanthi Ranjan, Institute of

76

Management Technology, Ghaziabad, Vishal Bhatnagar, Indraprastha


University, Delhi. Date extracted: 19/8/2014
19) Data Mining for Shopping Centres Customer Knowledge Management
Framework
http://bura.brunel.ac.uk/bitstream/2438/1471/1/KMSCBasedOnChapshortV5.p
df Date extracted:30/8/2014
20) Customer Classification And Prediction Based On Data Mining Technique
http://www.ijetae.com/files/Volume2Issue12/IJETAE_1212_58.pdfDate
extracted:14/8/2014
21) Data Mining Techniques: For Marketing, Sales, and Customer Relationship
Management http://books.google.com.my/books?
id=AyQfVTDJypUC&pg=PA162&lpg=PA162&dq=Membership+Supermarke
t
%27s+Customer+in+data+mining&source=bl&ots=KWFyqsQYyK&sig=Uyh
kDWZ2kHDBxXVtW9nx5SnTIo&hl=en&sa=X&ei=cZ_8U5_2KoWE8gW9_4CADA&redir_
esc=y#v=onepage&q=Membership%20Supermarket's%20Customer%20in
%20data%20mining&f=false Date extracted:13/8/2014
22) How Do Supermarkets Use Your Data?
http://www.select-statistics.co.uk/article/blog-post/how-do-supermarkets-useyour-data Date extracted:29/8/2014
23) What is the CRISP-DM methodology?

76

http://www.sv-europe.com/crisp-dm-methodology/ Date

extracted:21/8/2014
24) Association Rules Apriori Algorithm
https://fenix.tecnico.ulisboa.pt/downloadFile/3779571250083/licao_9.pdfDate
extracted: 29/9/14
25) Data Mining Applications & Trends
http://www.tutorialspoint.com/data_mining/dm_applications_trends.htm Date
extracted: 10/7/2014
26) GitHub
https://github.com/jashmenn/apriori

Date extracted: 12/1/2015


27) Association Mining with Weka
http://facweb.cs.depaul.edu/mobasher/classes/ect584/weka/associate.html
Date extracted: 12/1/2015
28) Association Mining with Weka
http://facweb.cs.depaul.edu/mobasher/classes/ect584/weka/associate.html
Date extracted: 20/1/2015
29) AprioriItemset Generation
http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/itemset_apriori.html
Date extracted: 20/1/2015
30) Pentaho Data Mining
http://wiki.pentaho.com/display/DATAMINING/Apriori
Date extracted: 20/1/2015

76

31) SPMF
http://www.philippe-fournier-viger.com/spmf/index.php?link=download.php
Date extracted: 21/1/2015
32) CODE PROJECT
http://www.codeproject.com/Articles/70371/Apriori-Algorithm
Date extracted: 20/1/2015
33) All My Brain
http://allmybrain.com/2007/11/12/implementing-the-apriori-data-miningalgorithm-with-javascript/Date extracted: 12/1/2015
34) CODE PROJECT
http://www.codeproject.com/Articles/70371/Apriori-Algorithm
Date extracted: 18/1/2015
35) stackoverflow
http://stackoverflow.com/questions/17125742/creating-k-itemsets-from-2itemsetsDate extracted: 16/1/2015
36) compilr
https://compilr.com/soniaj/apriori/Project.java
Date extracted: 22/1/2015
37) Apache Ant - Tutorial
http://www.vogella.com/tutorials/ApacheAnt/article.html
Date extracted: 23/1/2015
38) Uregina

76

http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/Apriori.javaDate
extracted: 22/1/2015

Reference Book
1) Data Mining Practical Machine Leaning Tools and Techniques Second Edition
by Ian H. Witten, Department of Computer Science, University of Waikato and
Eibe Frank, Department of Computer Science, University of Waikato.

APPENDIX
Project 1 Gantt Chart

Semester 2
No
.

Activities
Deeply
research in
apriori
algorithm
Select
appropriate
data
Analyse and
prepare
dataset for
simulation

21No
v

28No
v

5- 12- 19De De De
c
c
c

26De
c

2Ja
n

9Ja
n

16
Ja
n

23
Ja
n

30-Jan

76

4
5
6

Modify apriori
algorithm
Validate model
Documentation

Project 2 Gantt Chart

Anda mungkin juga menyukai