Anda di halaman 1dari 11

Short manual for WEKA

This document is a short manual for installing and using the basic functionality of the
Machine Learning tool WEKA. By far not all functionality of Weka will be explained in the
manual. More information can be found on the internet.

Installing and running WEKA


Download the appropriate WEKA software from the WEKA webpage and install it on
your laptop or desktop.

Mac users
For people using a Mac laptop: you also have to copy the folder weka-X-X-XX to the
Applications folder.
You can run WEKA by (double) clicking the WEKA icon.

Figure 1: WEKA startup screen: the Weka GUI Chooser

WEKA data format


Before actually using WEKA one should know a little bit of the data format arff used by
WEKA. The arff format is a comma separated file. In the header the relation (name of
the data) is defined together with the attributes (also called features) The last attribute is
the class attribute and defines the class of each data sample. for an example see Figure
2.

@RELATION iris
@ATTRIBUTE
@ATTRIBUTE
@ATTRIBUTE
@ATTRIBUTE
@ATTRIBUTE
virginica}

sepallength
REAL
sepalwidth
REAL
petallength
REAL
petalwidth REAL
class
{Iris-setosa,Iris-versicolor,Iris-

@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa

Figure 2: WEKA arff format

Explorer
One of the basic applications to explore the data and classifiers is the Explorer. Click on
the Explorer button in the Weka GUI Chooser window, see Figure 1. Then the
following screen will appear:

Figure 3: WEKA Explorer window

Loading data
Click on the Open file button and select the arff data file you want to analyze. For
this manual we select the iris.arff file which can be found in the Weka data folder. On
my Windows machine this is the folder C:\Program Files (x86)\Weka-3-6\data.
After opening the file the following screen appears, see Figure 4

Figure 4: Initial Explorer screen Iris data file

Exploring attributes (aka features)


In the window bottom left you see the different attributes and for the selected attribute
sepalwidth you see on the right some statistics and the class labels for different values
of the attribute. This visualizations on the bottom right gives an indication for the
discriminating power of the selected attribute.
Exercise 1
Which of the attributes (except class) has the most discriminating power? For an
overview of all attributes click the Visualize all button.

A single attribute has most of the time not that much discriminating power. To explore
the discriminating power of two attributes click on the visualize tab of the Explorer
screen, see Figure 4. The following window should appear.

Figure 5: Data visualization.


Clicking on a sub-window in the top window will enlarge it.
Exercise 2
Which two attributes together have the most discriminating power and why?
It follows from the data visualizations that the class Iris setosa is very different (can be
discriminated/separated very well) from the other two classes but there is some mixing
of (confusion between) the classes Iris-versicolor and Iris-virginica. You should keep
this in mind when applying classifiers to this problem.

Exploring classifiers.
One of the basic classifiers we consider in the course is the Decision Tree. Click in the
Explorer window in the Classify tab. The following window should appear,

Figure 6: Exploring classifiers


Click on the Choose button and select the classifier you want to explore, in this case J48
which is in the trees subfolder. Under test options you see different options to evaluate
the performance of the decision tree classifier J48, these options will be explained in the
Machine Learning lectures. For this manual we go for the standard option 10-fold cross
validation. Above the Start button the attribute which is the class attribute is selected,
in this case the attribute class. Click on the Start button and Weka starts to evaluate
the Decision tree classifier J48 on the data using 10-fold cross validation evaluation
strategy. When the classifier evaluation is finished the right window Classifier output is
filled with evaluation metrics, see also Figure 7.

Figure 7: Evaluation metrics


Important aspects of this evaluation is the performance (correctly classified instances),
96% in this case, the precision and recall for the different classes and the confusion
matrix. In this confusion matrix the rows are the actual classes and the columns the
classes given by the classifier. For example 49 instances of Iris setosa are classified as
Iris-setosa and 1 instance of Iris-setosa is classified as Iris-versicolor.

Exercise 3
Is the confusion in line with what observed in the data visualization concerning the
separation of the classes? Explain why!
Exercise 4
Apply a Nave Bayes classifier (in the subfolder bayes) to the above data and give the
performance and the confusion matrix. Which classifier is better, the decision tree J48 or
the Nave Bayes classifier? Explain why! Is there a difference between the two
classifiers? (Hint: look at the confusion matrix).
One problem with Weka Explorer is that you have to remember and repeat all the steps
by selecting tabs and clicking buttons. Another option is to use the Knowledge flow
functionality of Weka.
Attribute selection
In some cases, for instance if one wants to classify emails or texts, one has a lot of
features. In that case a classifier can benefit from attribute selection. This can be done
by manually removing attributes in the Preprocess tab; select the attributes you want to
remove and click the remove button, see Figure 4. Another option is to apply an
automatic feature selection method which can be done in the Select attributes tab.

Figure 8: Attribute selection tab

Choose as Attribute Evaluator GainRatioAttributeEval this evaluates attributes based


on their information gain. This will lead to using Ranker as search method. Pushing the
Start button will lead to the following output:

Figure 9: Attributes ranked based on Information Gain


So attribute 4 petalwidth has the highest information gain, 0.871, and attribute 2
sepalwidth the lowest.
Exercise 5
Based on the information gain what would be the best 2 attributes to select? Compare
this with the decision tree of Figure 7. What do you observe?
There are many more options to select attributes but we will not discuss this in this
manual.
More information
More information on Weka and the Explorer can be found on the WekaMOOC:
http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/
Especially Class 1 is an introduction to the Explorer

Knowledge flow
In order to start the Knowledge flow click the KnowledgeFlow button in the Weka Gui
Chooser. The following window should appear,

Figure 10: Weka Knowledge Flow window


With the Knowledge Flow one can design and store Machine Learning pipelines which
can be used later on. The main part is the canvas on which the pipeline can be
constructed by using drag and drop and mouse clicks.
Arff Loader
The first component of the ML pipeline is the Arff Loader. Double click on the Arff
Loader button and click afterwards on the canvas. The Arff Loader will appear on the
canvas. Right click on the ArffLoader icon on the canvas and select Configure... and
select the Iris Flower data file iris.arff from the Weka data folder.
Class assigner
Under the Evaluation tab select ClassAssigner and put in on the canvas, right from
the ArffLoader. Right click on Arffloader and select dataset. Connect the arrow to

the ClassAssigner by moving the cursor to the ClassAssignerand afterwards left


mouse click. The canvas should look similar to the canvas in Figure 11. Next right click
and select Configure and select (Nom)class as class label.

Figure 11: Look of the canvas


Cross Validation
Next put the CrossValidationFoldMaker on the canvas. Right click ClassAssigner and
select dataSet and connect it to the CrossValidationFoldMaker. By right clicking the
CrossValidationFoldMaker and select Configure one can set the folds and the
seed. We leave as it is.
Classification
Next select your favorite classifier from the Classifiers tab, for instance J48 and put it
on the canvas. Connect the training and testset from the CrossValidationFoldMaker to
the classifier.
Evaluation
From the Evaluation tab select ClassifierPerformanceEvaluator, put it on the canvas
and connect it to the classifier by connecting batchClassifier to the
ClassifierPerformanceEvaluator. From the Visualization tab select TextViewer , put
it on the canvas and connect it to the ClassifierPerformanceEvaluator by selecting
text.
The complete pipeline should similar to the pipeline in Figure 12.

Figure 12: Complete pipeline.


Now one can start the pipeline by right clicking the Arffloader and select Start loading.
When the pipeline is finished on can view the results by right clicking the TextViewer
and selecting Show results.
Saving
The pipeline can be saved for later use by clicking the save button, left top of the
window.
Extensions
The pipeline can be extended in many ways see for instance Figure 13

Figure 13: Extended pipeline, can be used to compare a Nave Bayes classifier
with a decision tree.

Anda mungkin juga menyukai