CaseStudy ClassificationandEvaluation

CASE STUDY Classification and Evaluation with a Decision Tree
Using RapidMiner Data Mining Software Application (open source)

1. Create Three Data Sets in your NewLocalRepository: Add three data
sets to your NewLocalRepository based on the three excel files on the class
website titled Data_CR_FullSet.xlsx, Data_CR_Train.xlsx, Data_CR_Test.xlsx, as
described in the Tutorial. The full data set contains 425 rows of data. The
train data set includes 400 rows from the full data set, and the test data set
includes the remaining 25 rows from the full data set.
2. Create a process model in RapidMiner using the Retrieve operator (with the
FullDataSet) and the Decision Tree operator as described in the Tutorial.
3. Questions: Using the two operators described above (Retrieve and Decision
Tree with the FullDataSet), run your model. View the decision tree and
example set results and answer the questions below.
a. Can you appropriately assess the accuracy of the model that you have
just generated?
We had created the decision tree using the full data set. Since the input to
the decision tree was the full data set, Accuracy is 100% because the
there is no difference between the prediction and actual credit standing.
b. Based on the decision tree results and the book terminology, Checking
Account is the root node.
c. If a Customer has a high balance in their checking account, is older than
24, owns their home, and has had an account opened for 12 or less
months, then they should have a good or bad credit standing?Good
d. In finding the answer to c above, Own would be described as a
branch while Months Acct Open < 12 would be described as a leaf node
Hint: these are specific components of the tree branch, leaf node see
the class notes for details).
4. Train and Test: Now we will create a new process where we train and test
the decision tree model. Create the model shown below. Drag both the Train
and the Test data sets to the process window. Then add the Decision Tree
operator as you did in the tutorial. The Apply Model operator can be found at
Modeling Model Application Apply Model.
Data_CR_Tr
ain data set
Data_CR_Te
st data set
5. Questions: In the results window, view the ExampleSet Results, then

select the Data View radio button.
a) Create a Confusion Matrix as shown in the book, in the class slides, and in
Confusion Matrix example spreadsheet on the class website. Use the data
in the Credit Standing Column and the Prediction Column shown on the
ExampleSet Results tab, using the Data View radio button. The
spreadsheet shown should include 25 rows. The Credit Standing Column
and the Prediction Column are separated by two other columns that
include percentages. You can ignore these other two columns. See the
Confusion Matrix Example provided on the class website (Content Area
Tutorials Decision Tree) for a detailed example of how to create a
Confusion Matrix.
b) Calculate the True Positive Rate
50.0% Accuracy with good customers
c) Calculate the True Negative Rate
92.3% Accuracy with bad customers
d) Calculate the Accuracy of the model
72.0% Accuracy with all customers
e) Examine the two excel data sets used in this model and comment on
whether the size of the Test data set is appropriate (based on information
in the textbook).
The Training data set has 400 rows and the test set has 25 rows.
According to the book, for a simple split it is a good idea to designate twothirds of the data as the training set and the remaining one-third as the
test set. Since we have about 95% of the data as the training set and only
about 5% as the test set, we have a lot of bias (overfitting), so the size of
the Test data set is clearly inappropriate.
6. Use Cross Validation to Evaluate a Decision Tree: Start by dragging the

FullDataSet to a new process window. Next drag the Validation operator to the
Process window. The Validation operator can be found at
EvaluationValidationX-Validation. Connect the Model and Averageable
output ports for the Validation operator to the Result ports.
Data_CR_Full
set data set
Next, double click on the blue icon in the bottom right corner of the Validation
operator (the subprocess icon). A split process window appears as shown
below one for Training and one for Testing. Add the operators shown below
to the two windows. The Performance operator can be found at Evaluation
Performance and MeasurementClassification and RegressionPerformance
(Classification).
Be sure to connect the ports as shown in the screen prints above and below.
You can toggle back and forth between the Main Process and the Train/Test
split screen processes, by selecting the Process drop down box and selecting
Validation (or Retrieve).
7. Questions: Run the model. View the Tree and the Performance Vector results
a) Using the Performance Vector tab, copy the Confusion Matrix results and
the Accuracy ratio below.
true
Good
true
Bad
class
precisi
on
71.17
47
%
pred.
Good
116
pred.
Bad
98
164
class
recall
54.21
%
77.73
%
62.60
%
Accuracy ratio = 65.86% +/- 5.20% (mikro: 65.88%)

b) How do these results compare to the Confusion Matrix and Accuracy result
that you generated in the Train and Test model above?
The train and test model generated an accuracy of 72% while the cross
validation has an accuracy of 65.86+/-5.20%. We find the accuracies are
more or less equal. The Train and Test model performs marginally better than
the Cross Validation model.
c) Which model results would you rely upon and why?
Cross validation minimizes the bias associated with the simple split where the
data set is split into a training and test data. This is because in cross
validation the complete data set is split into separate subsets of same size to
even out biases. Cross validation accuracy is better than the Train and Test
model due to this. From our outputs we find that the accuracies are more or
less equal indicating some noise in the dataset. However if we use a new real
dataset to test, cross validation should perform much better than the Train
and Test model. Cross validation in general performs better than the Train and
Test model.

CaseStudy ClassificationandEvaluation

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

CaseStudy ClassificationandEvaluation

Diunggah oleh

Hak Cipta:

Format Tersedia

CASE STUDY Classification and Evaluation with a Decision Tree

Using RapidMiner Data Mining Software Application (open source)

5. Questions: In the results window, view the ExampleSet Results, then

6. Use Cross Validation to Evaluate a Decision Tree: Start by dragging the

Accuracy ratio = 65.86% +/- 5.20% (mikro: 65.88%)

Anda mungkin juga menyukai