Anda di halaman 1dari 32

References

1.

Witten, Ian H. and Eibe Frank. Data mining: practical


machine learning tools and techniques, 2nd edition.
Morgan Kaufmann publishers. 2005.

2.

Han, Jiawei, Micheline Kamber, and Jian Pei. Data


mining: concept and techniques, 3rd edition. Morgan
Kaufmann Publishers. 2012.

3.

Liu, Bing. Web data mining: exploring hyperlinks,


contents, and usage data. Springer. 2007.

Lecture plan
RPKPS (Rencana Program Kegiatan
Pembelajaran Semester)

Tujuan Instruksional Umum

To introduce students to the basic concepts and techniques of Data


Mining.
To develop skills of using recent data mining software for solving
practical problems.
To gain experience of doing independent study and research.

Lecture plan
RPKPS (Rencana Program Kegiatan
Pembelajaran Semester)
Tujuan Instruksional Khusus Tiap Topik (Pokok Bahasan)

Memahami unsur-unsur yang dirinci sebagai berikut.


What is data mining
Input: Concepts, instances, attributes

Output: Knowledge representation


Algorithms: The basic method

Credibility: Evaluating what has been learned


Advanced Data Mining: Implementation

Data Transformation
WEKA Data Mining Implementation.

Todays topic
What is data mining:
(1) data mining and machine learning;
(2) simple examples;

(3) machine learning and statistics;


(4) generalization as search.

What is data mining:


(1) data mining and machine learning;

(2) simple examples;


(3) machine learning and statistics;
(4) generalization as search.

What Is Data Mining?

Data mining (knowledge discovery in databases):

Extraction of interesting (non-trivial, implicit, previously unknown


and potentially useful) information or patterns from data in large
databases

Alternative names:

Data mining: a misnomer?


Knowledge discovery(mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.

Source: Jiawei Han's slide

9/13/2013

Data Mining: Concepts and Techniques

Why data mining?

The motivation:

Automated data collection tools

Mature database technology

Big data in
databases and
other repositories

Data rich but information poor!

Solution: Data warehousing and data mining

9/13/2013

Data explosion problem

Data warehousing and on-line analytical


processing (OLAP)
Data mining: extraction of interesting
knowledge (rules, patterns, constraints) from
data in large databases

Data Mining: Concepts and Techniques

Evolution of Database Technology

1960s:

1970s:

Relational data model, relational DBMS implementation

1980s:

Data collection, change from primitive file processing to database


system

RDBMS, advanced data models (extended-relational, OO,


deductive, etc.) and application-oriented DBMS (spatial, scientific,
engineering, etc.)

1990s2000s:

9/13/2013

Data mining and data warehousing, multimedia databases, and


Web databases

Data Mining: Concepts and Techniques

How about machine learning?


Data mining is defined as the process of
discovering useful patterns, automatically or
semi-automatically, in large quantities of data.
Where as, machine learning is
Learning (noun): cognitive process of acquiring skill
or knowledge (Wordweb 6.6)
Thus, machine learning can be thought as the
machine (i.e. computer) going on a process of
acquiring skill or knowledge.

So
How is the relation between data mining and
machine learning?

Potential Applications (1)

Data mining can be applied in multidiscipline


field, involving:

machine learning,

statistics,

databases,

artificial intelligence, and

pattern recognition

Web usage mining

Text mining

9/13/2013

Data Mining: Concepts and Techniques

12

What is data mining:


(1) data mining and machine learning;

(2) simple examples;


(3) machine learning and statistics;
(4) generalization as search.

Simple example
Contact lens prescription

The patterns can


be:

Classification

9/13/2013

Data Mining: Concepts and Techniques

Presented in
decision tree

14

More realistic example:


vertebral column
pelvic_incidence

pelvic_tilt

lumbar_lordo sacral_slope pelvic_radius degree_spondysis_angle


lolisthesis

Class
attribute

63.0278175 22.55258597 39.60911701 40.47523153 98.67291675

-0.254399986Hernia

39.05695098 10.06099147 25.01537822 28.99595951 114.4054254

4.564258645Hernia

68.83202098 22.21848205 50.09219357 46.61353893 105.9851355

-3.530317314Hernia

69.29700807 24.65287791 44.31123813 44.64413017 101.8684951

11.21152344Hernia

49.71285934 9.652074879

28.317406 40.06078446 108.1687249

7.918500615Hernia

40.25019968 13.92190658

25.1249496 26.32829311 130.3278713

2.230651729Hernia

53.43292815 15.86433612 37.16593387 37.56859203 120.5675233

5.988550702Hernia

45.36675362 10.75561143 29.03834896 34.61114218 117.2700675

-10.67587083Hernia

43.79019026

13.5337531 42.69081398 30.25643716 125.0028927

36.68635286 5.010884121

13.28901817Hernia

41.9487509 31.67546874 84.24141517

0.664437117Hernia

49.70660953 13.04097405 31.33450009 36.66563548 108.6482654

-7.825985755Hernia

Data mining process

As a process, data mining encompasses three main


steps:

Pre-processing dealing with unsuitable raw data

Data mining applying data mining method

Post-processing interpreting mined patterns

Architecture of a Typical Data Mining


System
Graphical user interface
Pattern evaluation
Data mining engine

Data cleaning & data


integration

Database or data
warehouse server

Filtering

Databases
9/13/2013

Knowledge-base

Data
Warehouse

Data Mining: Concepts and Techniques

18

Another example:Directed marketing


(S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISPDM Methodology.

Problem:
Increasing vast number of marketing campaigns

Global competitive world


Mass campaigns are ineffective

Solution:
Directed campaigns with a strict and rigorous selection of
contacts.
Focus on targets that assumable will be keener to that specific
product/service
More efficient, reduction in costs and time

The dataset:
Portuguese marketing
campaign related with bank
deposit subscription.

Dataset collected is related to


17 campaigns that occurred
between May 2008 and
November 2010,
corresponding to a total of
79354 contacts.
For each contact, recorded
a large number of attributes
the target variable (class attribute)
there were 6499 successes (8%
success rate).

Steps
1. Goal definition

To predict if a client will subscribe the deposit

Classification task

2. Simple data pre-processing (Data Preparation phase)

Non-conclusive instances were discarded, leading to a total of 55817


contacts.

Attribute reduction, leading to 29 attributes and 1 class attribute

Discard instances that contained missing values, leading to 45211


instances (5289 of which were successful or 11.7% success rate).

3. Data mining step (Modeling phase), using NB, DT, SVM

dataset was randomly divided into training (2/3) and test (1/3) sets

4. Evaluation of the model

Conclusion
Call duration is the most
relevant feature, meaning
that longer calls tend
increase successes.
In second place comes the
month of contact.
Success is most likely to
occur in the last month of
each trimester (March, June,
September and December).
Such knowledge can be
used to shift campaigns to
occur in those months.

Data Mining: On What Kind of


Data?

Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories

9/13/2013

Object-oriented and object-relational databases


Spatial databases
Time-series data and temporal data
Text databases and multimedia databases
Heterogeneous and legacy databases
WWW
Data Mining: Concepts and Techniques

23

Functionality
Knowledge produced by data mining

Knowledge in DM term, means useful pattern

The pattern should be

Useful

Valid

Understandable

Pattern types can be produced by data mining


methods:

Frequent pattern, association, correlation

Data characterization and discrimination

Classification and prediction

Cluster

Frequent pattern, association,


correlation

Patterns that occur frequently in data

Frequent itemset

Frequent subsequences

Frequent substructures

Leading to associations and correlation within


data

Classification and prediction

Cluster analysis

Are All the Discovered Patterns


Interesting?

A data mining system may generate thousands of patterns, not all of them
are interesting.

Suggested approach: Human-centered, query-based, focused mining

Interestingness measures: A pattern is interesting if it is easily


understood by humans, valid on new or test data with some degree of
certainty, potentially useful, novel, or validates some hypothesis that a user
seeks to confirm

Objective vs. subjective interestingness measures:

Objective: based on statistics and structures of patterns, e.g., support,


confidence, etc.

Subjective: based on users belief in the data, e.g., unexpectedness, novelty,


actionability, etc.

9/13/2013

Data Mining: Concepts and Techniques

28

Can We Find All and Only Interesting


Patterns?

Search for only interesting patterns: Optimization

Can a data mining system find only the interesting patterns?

Approaches

9/13/2013

First general all the patterns and then filter out the uninteresting
ones.
Generate only the interesting patternsmining query optimization

Data Mining: Concepts and Techniques

29

What is data mining:


(1) data mining and machine learning;

(2) simple examples;


(3) machine learning and statistics;
(4) generalization as search.

Machine learning and statistics


Both are in the continuum of data analysis
techniques
Some derive from the skills taught in standard
statistics courses,
others are more closely associated with algorithms
that has arisen out of computer science.

What is data mining:


(1) data mining and machine learning;

(2) simple examples;


(3) machine learning and statistics;
(4) generalization as search.

One way of visualizing the problem of learning


and one that distinguishes it from statistical
approachesis to imagine a search through a
space of possible concept descriptions for one
that fits the data.

Anda mungkin juga menyukai