CT5813701
Assignment #5
Lecturer:
Prof. Jui-Sheng Chou
By:
Richard Antoni Gosno
M10605820
1. Introduction
This assignment is related from previous assignment, which is demonstration of
baseline model in WEKA. From previous demonstration, we have already known
how to operate 4 baseline models; Artificial Neural Network (ANN), Linear
Regression (LR), Support Vector Regression (SVR), and Classification and
Regression Tree (CART). In this assignment, there will be a demonstration to build
an ensemble model based on 4 baseline models in WEKA. There are also 4 kinds
of ensemble model that will be demonstrated in this assignment:
Voting
Bagging
Stacking
Tiering
2. Problem Statement
This assignment will use “Heating load data” as the dataset. This dataset contains
768 datapoints which is divided into 8 columns of input variables (X1-X8) and 1
column of output variables (Y). Inputs and output of the dataset is shown in Table
1. The statistical description of the dataset is shown in Table 2.
Table 1. Dataset Input and Output Variables
Data no. X1 X2 X3 X4 X5 X6 X7 X8 Y1
1 0.9 563.5 318.5 122.5 7 2 0.4 1 36.47
2 0.76 661.5 416.5 122.5 7 5 0.4 1 40.43
3 0.76 661.5 416.5 122.5 7 5 0.1 4 32.31
4 0.86 588 294 147 7 5 0.1 2 25.36
5 0.9 563.5 318.5 122.5 7 4 0.1 5 28.03
6 0.86 588 294 147 7 5 0.25 5 29.39
7 0.62 808.5 367.5 220.5 3.5 4 0.1 2 12.97
8 0.82 612.5 318.5 147 7 2 0 0 17.05
… … … … … … … … … …
… … … … … … … … … …
… … … … … … … … … …
764 0.86 588 294 147 7 3 0.1 5 26.45
765 0.62 808.5 367.5 220.5 3.5 4 0.25 2 14.6
766 0.82 612.5 318.5 147 7 3 0.4 5 29.5
767 0.82 612.5 318.5 147 7 3 0.4 2 29.49
768 0.9 563.5 318.5 122.5 7 5 0.1 4 29.79
1
Artificial Intelligence for Project Management
3. Preparation
After getting information about the dataset, we will start to demonstrate the baseline
model in WEKA. Before everything, please make sure to convert the “Heating load
data” from excel format (*.xls) to csv format (*.csv). This is because WEKA can
only read the data based on several format. For this demonstration, it is safe to use
the csv format of the dataset. You can convert by opening the excel file and then
save the file as csv file (*.csv) as seen in Figure 1.
2
Artificial Intelligence for Project Management
3
Artificial Intelligence for Project Management
After loading the dataset into WEKA, now we are ready to create ensemble model
in WEKA. As mentioned before, this assignment will create 4 ensemble models:
Voting, Stacking, Bagging and Tiering.
4
Artificial Intelligence for Project Management
After determining all the combinations for voting model, we will start to
create the voting model for ANN+CART+SVR+LR. At classifier panel click
“choose”, the voting model is located in classifiers/meta/vote (Figure 5).
5
Artificial Intelligence for Project Management
The default setting of the baseline model is set for ZeroR. Since we do not
need the ZeroR, click the ZeroR in the list and delete it. Click “Choose” to insert
the baseline model into voting model. The location of the baseline model is the
same from previous assignment. ANN is located in
classifiers/functions/MultilayerPerceptron, LR is located in
classifiers/functions/LinearRegression, SVR is located in
classifiers/functions/SMOreg and CART is located in classifiers/tree/REPTree. If
you decide to use RBFKernel as SVR Kernel, just click the the SMOreg beside the
“Choose” button to change the kernel to RBFKernel (Figure 7). We need to click
“Add” after choosing every baseline model we want to input into voting model. The
classifier box should be like Figure 8 if you have already input all the baseline
model (ANN+CART+SVR+LR).
6
Artificial Intelligence for Project Management
Click “Ok” to close the object editor. Set the folds for cross-validation to 10
in the Test options panel and do not forget to click More options and choose
PlainText or CSV as output predictions. After that, the ensemble model is ready to
run. The result of the voting is shown in Figure 9.
Save the model if you want to use the model for future testing. The step for
the other combination of voting model is the same. The difference is only the
baseline models inputted in the voting model. Repeat the step above to do the 10
other voting model.
7
Artificial Intelligence for Project Management
8
Artificial Intelligence for Project Management
Since we will do the default setting of bagging, click “Ok” to close the
object editor. The default setting of WEKA’s bagging model is 100 bag size. Set
the folds for cross-validation to 10 in the Test options panel and do not forget to
click More options and choose PlainText or CSV as output predictions. After that,
the ensemble model is ready to run. The result of bagging is shown in Figure 12.
9
Artificial Intelligence for Project Management
10
Artificial Intelligence for Project Management
Repeat the step above for the 3 other Stacking models. There will be a total
of 4 stacking result collected in this assignment.
11
Artificial Intelligence for Project Management
(𝑀𝑎𝑥(𝑌)+𝑀𝑖𝑛(𝑌))
𝑇= (4.1)
𝑛
Where:
T: threshold value
Max(Y): Maximum value of output attribute in dataset
Min(Y): Minimum value of output attribute in dataset
n: number of n-tier
Since we use 3-tier classification, the value of n is equal to 3. After
determining the threshold value, we can build some classification rule to classify
our dataset. The classification rule is shown in equation below:
Apply this classification rule to our dataset. Open the dataset and calculate
the threshold value (Figure 17). We can easily use if function in excel to classify
our dataset base on the classification rule above.
Output
class
Threshold
12
Artificial Intelligence for Project Management
After determining all the output class, now we have to split the dataset into
training and testing subsets. Since the excel for generate the cross-validation index
is not provided in the class, the writer will use MATLAB program to automatically
split the dataset into testing and training subset. This assignment will use 10-folds
cross-validation which means that the dataset will be split into 10 parts. In each
fold, there is a specific part that will be set into testing subsets while the rest
datapoints will be set into training subsets. This step is repeated for 10 parts so in
the end each part will be set into testing subsets exactly 1 time and set into training
subsets exactly 9 times. The MATLAB program interface to split the dataset is
shown in Figure 19.
Now, we have to save manually each sheet (training and testing) into csv
format. It is because we want to run this subset in WEKA software and WEKA can
only read the subset if it is converted into csv format. You can save the training
subset by differentiate the name by its fold number (Example: Train1, Train2, etc.).
Do the same thing for testing subset (Example: Test1, Test2, etc.). After that you
can move the training files into training folder and testing files into testing folder
13
Artificial Intelligence for Project Management
to avoid loading the wrong files when running WEKA (it is optional). The training
and testing folder is shown in Figure 20 and Figure 21.
14
Artificial Intelligence for Project Management
Delete the Y1 which is the actual output of the dataset. This has to be done
since we will do classification not regression. So, we will use Y1Class as output.
15
Artificial Intelligence for Project Management
Save the model into specific name, for example Train1.model then go back
to the Preprocess tab. Now, load the testing data (Test1) and remove also the Y1
attributes using the same step as before (Figure 26).
After that, back again to Classify tab, make sure the classifier is still SMO.
In the Testing options panel click “Supplied test set” and open the same testing
subset (Test1.xlsx), the selection is shown in Figure 27.
16
Artificial Intelligence for Project Management
Figure 28. Run Classification for Testing Subset Based on Training Subset
Copy the predicted result for the testing subset. You can either make a new
column or just overwrite the previous classification in testing subset (We suggest
to create new column called “Predicted”). The predicted result of classification
model is shown in Figure 29.
17
Artificial Intelligence for Project Management
We finished to run the classification for the first fold. Repeat the same step
for the other fold until getting the result for all 10-folds. After that, combine the
testing file into 1 file again to convert back our testing subset into original dataset.
Since every testing subset is different for each fold, when we combine it all together
again, the testing subset will gather again as a complete original dataset (Figure 30).
18
Artificial Intelligence for Project Management
If you follow the step correctly, the amount of combined testing subset will
be the same with the amount of original dataset’s datapoints. You can save it as a
new file with name “Combined Result”. The next step is to use filter in excel to sort
the classification result. Copy the “High” filter result into new excel file and save it
into csv format with file name “High” (Optional but do not make yourself confuse
because we operate so many files). Do the same with “Medium” and “Low” filter
result. In the end we have 3 files with corresponding tier; High, Medium and Low
(Figure 31).
If you are doing things correctly again, the total datapoints for High,
Medium and Low files will equal to original dataset. In this assignment, we have
268 low-tier datapoints, 142 medium-tier datapoints, and 358 high-tier datapoints.
The total is 768 datapoints which is the same with original dataset.
Now we are ready to do the final step of tiering. Run the 3 files; High,
Medium and Low files in WEKA by using 4 baseline models; ANN, LR, CART
and SVR. It is just the same step with running baseline model in the previous
assignment. The difference is now we use different dataset (Low, Medium, High).
Each tier will be run by 4 baseline models. So, in the end we have a total of 12
results for Tiering.
19
Artificial Intelligence for Project Management
5. Result Analysis
Now, we have the result of all the ensemble model; Bagging, Stacking, Voting and
Tiering. It is the time to analyze the result of each ensemble model. This assignment
will use 4 statistical performance measures to compare the result of each baseline
model, which are coefficient of correlation (R), root mean squared error (RMSE),
mean absolute percentage error (MAPE) and also mean absolute error (MAE). The
formulation of the performance measures is show in Table 5.
1
Root mean squared error (RMSE) 𝑅𝑀𝑆𝐸 = √𝑛 ∑𝑛𝑖=1(𝑦 −𝑦 ′ )2
1 𝑦−𝑦′
Mean absolute percentage error (MAPE) 𝑀𝐴𝑃𝐸 = 𝑛 ∑𝑛𝑖=1 | |
𝑦
1
Mean absolute error (MAE) 𝑀𝐴𝐸 = ∑𝑛𝑖=1|𝑦 −𝑦′|
𝑛
20
Artificial Intelligence for Project Management
21
Artificial Intelligence for Project Management
22
Artificial Intelligence for Project Management
23
Artificial Intelligence for Project Management
24
Artificial Intelligence for Project Management
25
Artificial Intelligence for Project Management
26
Artificial Intelligence for Project Management
27
Artificial Intelligence for Project Management
28
Artificial Intelligence for Project Management
29
Artificial Intelligence for Project Management
30
Artificial Intelligence for Project Management
All the result of the ensemble model will be summarized into 1 table format to make
it easier to analyze and compare (Table 37).
31
Artificial Intelligence for Project Management
If you run the same step with WEKA, you can found that the performance
measure value is slightly different. It is because we manually calculate the
statistical performance measures by the equation shown in the previous section
(not WEKA’s). We also add the result from baseline model from previous
assignment to add more comparison. Since there are 4 statistical performance
measures used in this assignment, we need to rank the ensemble model based on
the performance of each measures. For R, the higher the value the better and for the
rest the lower the value is better. This assignment assumes that the priority will be
taken as R>RMSE>MAE>MAPE. It means that R is the highest priority and MAPE
is the lowest priority. We rank all the methods by using multiple level sort in excel,
shown in Figure 32. The result of the ranking is shown in Table 38.
32
Artificial Intelligence for Project Management
From Table 38 above, we can see that there are top 5 ensemble models that
have the highest R value, Bagging-CART, Baseline-CART, Stacking-
ANN+CART+SVR, Voting-ANN+CART and Stacking-CART+SVR+LR. All of
33
Artificial Intelligence for Project Management
them have R value of 0.998. The difference takes place in other performance
measures (RMSE, MAE and MAPE). For Rank 1-3, the result of the other measures
is linearly same with R. So, there will be no problem to sort the rank for the first
top 3. For Rank 4-5, it can be concluded based on our assumption in the first place
(R>RMSE>MAE>MAPE). In the end, Stacking-CART+SVR+LR is placed to
Rank 4 and Voting-ANN+CART is placed to Rank 5. The radar plot of top 5 models
is shown in Figure 33.
MAPE
34
Artificial Intelligence for Project Management
6. Conclusion
From the demonstration of all baseline models plus ensemble models in WEKA,
the writer concludes that CART method (Classification and Regression Tree) is the
best method for the dataset. In can be seen clearly from the top 5 results of all
models that CART is always shown in the contributing model. As further analysis,
the writer concludes that for Assignment 4 and Assignment 5, the best model is the
Bagging Model used for CART.
Result from all baseline models and ensemble models in this assignments is
done by the assumption of using default settings of WEKA. There is still a
possibility that another ensemble model or parameter settings of each model can
outperform the best model concluded in this assignment. Because of that reason,
the writer suggests the future trials to consider more comparison to give a better
knowledge in this data science practice.
35
Artificial Intelligence for Project Management
APPENDICES
(Since in this ensemble model trial we run the program many times and many folds,
it is not possible to print all the running result from all methods used in the
assignment).
36