Anda di halaman 1dari 42

Data Mining Term Project

Machine Learning with WEKA

Weka Explorer Tutorial


for Version 3.4.3
Svetlana S. Aksenova
Department of Computer Science
California State University, Sacramento
Fall 2004

Machine learning methods


for data mining
use techniques from computer science, statistics
and probability, and data visualization to search for
patterns and relationships in large data sets
Allow automatically analyze a large amount of data
The result of analysis automatically makes
predictions faster and more accurately
The result of analysis makes decisions faster and
more accurately

About WEKA
Developed by University of Waikato in New
Zealand
open source software issued under the GNU
General Public License
WEKA is a data mining system written in Java
implements data mining algorithms
compatible with most of computer platforms
applied to the dataset by choosing either
command line or graphic user interface

Introduction to the Tutorial


Created to help in learning process
Consists of 8 parts:
Introduction
Launching WEKA
Preprocessing Data
Building Classifiers
Clustering Data
Finding Associations
Attribute Selection
Data Visualization

Launching WEKA
GUI Chooser the Main Menu

Preprocessing
Data can be read from a
Local filesystem (in ARFF, CSV, C4.5, binary formats)
URL
SQL database (using JDBC)

File conversion
Preprocessing window
Preprocessing tools - filters

File Conversion

Excel

CSV

ARFF

Open File (from the local filesystem)

Open File (from a website)

http://gaia.ecs.csus.edu/~aksenovs/ weather.arff

Preprocessing Window

Setting Filters
WEKA contains filters for discretization,
normalization, resampling, attribute selection,
transformation and combination of attributes.
Some techniques, such as association rule mining,
can only be performed on categorical data.

Filter Configuration Options


Right-click on on filter

Building Classifiers
Choosing a classifier J48 (C4.5)

Setting Test Options

Output the Result


Used weather data in weather.arff for classification

Analyzing Results

Visualizing Results

Tree Visualizer

Error Visualizer

Error Visualizer (contd)

Exercise
Given at the end of the section
Classification Exercise
Use ID3 algorithm to classify weather data
from the weather.arff file. Perform initial
preprocessing and create a version of the
initial dataset in which all numeric attributes
should be converted to categorical data.

Clustering Data
The clustering schemes available in WEKA are
k-Means, EM, Cobweb, X-means, FarthestFirst.
Used customer data for clustering in customers.arff

Clustering Data (contd)


Choosing clustering scheme
K- means
5 clusters
Setting test options

Analyzing results

Visualizing Results

Results of Clustering in ARFF File

Exercise
Given at the end of the section
Clustering Exercise
Use k-means algorithm to bank data from
the bank.arff file. Perform initial
preprocessing and create a version of the
initial data set in which the ID field should
be removed and the "children" attribute
should be converted to categorical data.

Finding Associations
Apriori
works only with discrete data
identifies statistical dependencies between
groups of attributes
used grocery store data
from grocery.arff file with
confidence 40% and
support 30%.

Setting test options


Analyzing Results

Exercise
Given at the end of the section
Association Rules Exercise
Use Apriori algorithm to generate association
rules for Iris data from the iris.arff file.
Perform initial preprocessing and create a
version of the initial data set in which the
numeric attributes should be converted to
categorical data.

Attribute Selection
searches through all possible combinations of
attributes
finds which subset of attributes works best for
prediction.
contain two parts:
a search method: best-first, forward selection,
random, exhaustive, genetic algorithm, ranking,
evaluation method: correlation-based, wrapper,
information gain, chi-squared.
used weather data from weather.arff file

Attribute Selection (contd)

Data Visualization
visualize a 2-D plot of the current working relation
determine difficulty of the learning problem

Data Visualization (contd)

Selecting Instances
A group of points on the graph can be selected in
four ways:
1. Select Instance
2. Rectangle
3. Polygon
4. Polyline

Select Instance

Rectangle

Polygon

Polyline

Why should we use WEKA


You can solve a machine learning
problem with a minimum programming
WEKA includes
reading of data,
implementation of filtering,
result evaluation

Performance
Has not been evaluated in this project
Can it process large ARFF files (GB)?

An answer has been found in


wekalist
It can process some schemes that are

either incrementally trainable or can be


made to be.

Future Work
Has not been done due to time constraints
Simple CLI provides a simple commandline interface and allows direct execution of
Weka commands.
KnowledgeFlow is a Java-Beans-based
interface for setting up and running machine
learning experiments.

References
1.

2.
3.
4.
5.

6.

I. Witten, E. Frank, Data Mining, Practical Machine.


Learning Tools and Techniques with Java
Implementation, Morgan Kaufmann Publishers, 2000.
R. Kirkby, WEKA Explorer User Guide for version 3-3-4,
University of Weikato, 2002.
Weka Machine Learning Project,
http://www.cs.waikato.ac.nz/~ml/index.html.
Machine Learning With WEKA, E.Frank, University of
Waikato, New Zealand.
B. Mobasher, Data Preparation and Mining with WEKA,
http://maua.cs.depaul.edu/~classes/ect584/WEKA/associ
ation_rules.html, DePaul University, 2003.
M. H. Dunham, Data Mining, Introductory and Advanced
Topics, Prentice Hall, 2002.

Anda mungkin juga menyukai