L. Torgo
ltorgo@fc.up.pt
Faculdade de Cincias / LIAAD-INESC TEC, LA
Universidade do Porto
Nov, 2014
Introduction
What is Prediction?
Definition
Prediction (forecasting) is the ability to anticipate the future.
Prediction is possible if we assume that there is some regularity in
what we observe, i.e. if the observed events are not random.
Example
Medical Diagnosis: given an historical record containing the symptoms
observed in several patients and the respective diagnosis, try to
forecast the correct diagnosis for a new patient for which we know the
symptoms.
Prediction
Nov, 2014
2 / 167
Introduction
Prediction Models
Are obtained on the basis of the assumption that there is an
unknown mechanism that maps the characteristics of the
observations into conclusions/diagnoses. The goal of prediction
models is to discover this mechanism.
Going back to the medical diagnosis what we want is to know how
symptoms influence the diagnosis.
Prediction
Nov, 2014
3 / 167
Introduction
Prediction
Nov, 2014
4 / 167
Introduction
Prediction
use the obtained models to make predictions regards the target
variable of new cases given their descriptors.
Comprehensibility
use the models to better understand which are the factors that
influence the conclusions.
Prediction
Introduction
Nov, 2014
5 / 167
Prediction
Nov, 2014
6 / 167
Introduction
Classification Task
Regression Task
season
winter
spring
autumn
spring
autumn
winter
...
size
small
small
small
small
small
small
...
speed
medium
medium
medium
medium
medium
high
...
mxPH mnO2
Cl Chla
8.00 9.8 60.800 50.0
8.35 8.0 57.750 1.3
8.10 11.4 40.020 15.6
8.07 4.8 77.364 1.4
8.06 9.0 55.350 10.5
8.25 13.1 65.750 28.4
... ...
... ...
...
a1
... 0.0
... 1.4
... 3.3
... 3.1
... 9.2
... 15.1
... ...
Prediction
Introduction
Nov, 2014
7 / 167
Types of Models
Prediction
Nov, 2014
8 / 167
Introduction
Types of Models
Prediction
Introduction
Nov, 2014
9 / 167
Types of Models
Prediction
Nov, 2014
10 / 167
Evaluation Metrics
Evaluation Metrics
Classification Problems
Classification Problems
The setting
Given data set {< xi , yi >}N
i=1 , where xi is a feature vector
< x1 , x2 , , xp > and yi Y is the value of the nominal variable Y
There is an unknown function Y = f (x)
The approach
Assume a functional form h (x) for the unknown function f (),
where are a set of parameters
Assume a preference criterion over the space of possible
parameterizations of h()
Search for the optimal h() according to the criterion and the data
set
L.Torgo (FCUP - LIAAD / UP)
Prediction
Nov, 2014
12 / 167
Evaluation Metrics
Classification Problems
Classification Error
Error Rate
Given a set of test cases Ntest we can obtain the predictions for
these cases using some classification model.
The Error Rate (L0/1 ) measures the proportion of these
predictions that are incorrect.
In order to calculate the error rate we need to obtain the
information on the true class values of the Ntest cases.
Prediction
Evaluation Metrics
Nov, 2014
13 / 167
Classification Problems
Classification Error
Error Rate
Given a test set for which we know the true class the error rate
can be calculated as follows,
L0/1 =
1
Ntest
N
test
X
I(h (xi ), yi )
i=1
Prediction
Nov, 2014
14 / 167
Evaluation Metrics
Classification Problems
Confusion Matrices
A square nc nc matrix, where nc is the number of class values
of the problem
The matrix contains the number of times each pair
(ObservedClass,PredictedClass) occurred when testing a
classification model on a set of cases
Obs.
c1
c1
c2
c3
nc1 ,c1
nc2 ,c1
nc3 ,c1
Pred.
c2
nc1 ,c2
nc2 ,c2
nc3 ,c2
c3
nc1 ,c3
nc2 ,c3
nc3 ,c3
The error rate can be calculated from the information on this table.
L.Torgo (FCUP - LIAAD / UP)
Prediction
Evaluation Metrics
Nov, 2014
15 / 167
Classification Problems
An Example in R
Prediction
Nov, 2014
16 / 167
Evaluation Metrics
Classification Problems
Cost-Sensitive Applications
In the error rate one assumes that all errors and correct
predictions have the same value
This may not be adequate for some applications
Obs.
Cost/benefit Matrices
c1
c2
c3
c1
B1,1
C2,1
C3,1
Pred.
c2
c3
C1,2 C1,3
B2,2 C2,3
C3,2 B3,3
Prediction
Evaluation Metrics
Nov, 2014
17 / 167
Classification Problems
An Example in R
trueVals <- c("c1","c1","c2","c1","c3","c1","c2","c3","c2","c3")
preds <- c("c1","c2","c1","c3","c3","c1","c1","c3","c1","c2")
confMatrix <- table(trueVals,preds)
costMatrix <- matrix(c(10,-2,-4,-2,30,-3,-5,-6,12),ncol=3)
colnames(costMatrix) <- c("predC1","predC2","predC3")
rownames(costMatrix) <- c("obsC1","obsC2","obsC3")
costMatrix
##
predC1 predC2 predC3
## obsC1
10
-2
-5
## obsC2
-2
30
-6
## obsC3
-4
-3
12
utilityPreds <- sum(confMatrix*costMatrix)
utilityPreds
## [1] 28
Prediction
Nov, 2014
18 / 167
Evaluation Metrics
Classification Problems
Obs.
Preds.
Pos
Neg
Pos
True Positives (TP)
False Positives (FP)
Neg
False Negatives (FN))
True Negatives (TN)
Prediction
Evaluation Metrics
Nov, 2014
19 / 167
Classification Problems
Obs.
P
N
Preds.
P
N
TP FN
FP TN
Prec =
TP
TP + FP
Prediction
TP
TP + FN
Nov, 2014
20 / 167
Evaluation Metrics
Classification Problems
Obs.
Examples
P
N
Precision =
Prevs.
P
N
2
2
1
1
TP
2
=
= 0.667
TP + FP
2+1
TP
2
=
= 0.5
TP + FN
2+2
2+1
ErrorRate =
= 0.5
2+2+1+1
Recall =
Prediction
Evaluation Metrics
Nov, 2014
21 / 167
Classification Problems
The F-Measure
Combining Precision and Recall into a single measure
Prediction
Nov, 2014
22 / 167
Evaluation Metrics
Regression Problems
Regression Problems
The setting
Given data set {< xi , yi >}N
i=1 , where xi is a feature vector
< x1 , x2 , , xp > and yi < is the value of the numeric variable
Y
There is an unknown function Y = f (x)
The approach
Assume a functional form h (x) for the unknown function f (),
where are a set of parameters
Assume a preference criterion over the space of possible
parameterizations of h()
Search for the optimal h() according to the criterion and the data
set
L.Torgo (FCUP - LIAAD / UP)
Prediction
Evaluation Metrics
Nov, 2014
23 / 167
Regression Problems
Given a set of test cases Ntest we can obtain the predictions for
these cases using some regression model.
The Mean Squared Error (MSE) measures the average squared
deviation between the predictions and the true values.
In order to calculate the value of MSE we need to have both the
predicitons and the true values of the Ntest cases.
Prediction
Nov, 2014
24 / 167
Evaluation Metrics
Regression Problems
MSE =
Ntest
(yi yi )2
i=1
Prediction
Evaluation Metrics
Nov, 2014
25 / 167
Regression Problems
1
Ntest
N
test
X
|yi yi |
i=1
Prediction
Nov, 2014
26 / 167
Evaluation Metrics
Regression Problems
Relative error metrics are unit less which means that their scores
can be compared across different domains.
They are calculated by comparing the scores of the model under
evaluation against the scores of some baseline model.
The relative score is expected to be a value between 0 and 1, with
values nearer (or even above) 1 representing performances as
bad as the baseline model, which is usually chosen as something
too naive.
Prediction
Evaluation Metrics
Nov, 2014
27 / 167
Regression Problems
yi )2
i=1 (yi
PNtest
i=1 (y
yi )2
NMAE = Pi=1
Ntest
i=1
Prediction
|yi yi |
|y yi |
Nov, 2014
28 / 167
Evaluation Metrics
Regression Problems
1
Ntest
N
test
X
i=1
|yi yi |
yi
The Correlation between the predictions and the true values (y,y )
is given by,
(yi y )(yi y )
= qP
Ntest
)2 PNtest (y y )2
(
y
y
i
i
i=1
i=1
PNtest
y,y
i=1
Prediction
Evaluation Metrics
Nov, 2014
29 / 167
Regression Problems
An Example in R
## [1] 494
## [1] 0.5916
Prediction
Nov, 2014
30 / 167
Linear Discriminant
Linear Discriminant
1X
Cov (X , Y ) =
(xi x )(yi y )
n
i=1
Prediction
Nov, 2014
32 / 167
Linear Discriminant
wT Cw
Given this score the goal is to find the vector w that maximizes it.
There is a solution for this maximization problem given by,
1 (1 2 )
lda = C
w
Canonical discriminant functions extend the idea for more than
two classes
L.Torgo (FCUP - LIAAD / UP)
Prediction
Nov, 2014
33 / 167
Nov, 2014
34 / 167
Linear Discriminant
Linear Discriminant
Prediction
Nov, 2014
35 / 167
Obtain a random split of the data into two sub-sets using the
proportion 80%-20%.
Prediction
Nov, 2014
36 / 167
Prediction
Nov, 2014
38 / 167
Prediction
Nov, 2014
39 / 167
Prediction
Nov, 2014
40 / 167
Prediction
Nov, 2014
41 / 167
Prediction
sizemedium
3.26373
mnO2
1.05264
PO4
-0.05224
Nov, 2014
42 / 167
Nov, 2014
43 / 167
Prediction
Nov, 2014
44 / 167
Prediction
Nov, 2014
45 / 167
Nov, 2014
46 / 167
summary(final.la1)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
lm(formula = a1 ~ size + mxPH + Cl + NO3 + PO4, data = clean.algae[,
1:12])
Residuals:
Min
1Q Median
-28.87 -12.73 -3.74
3Q
8.42
Max
62.93
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 57.2855
20.9613
2.73
0.0069 **
sizemedium
2.8005
3.4019
0.82
0.4114
sizesmall
10.4064
3.8224
2.72
0.0071 **
mxPH
-3.9708
2.4820
-1.60
0.1113
Cl
-0.0523
0.0317
-1.65
0.1003
NO3
-0.8953
0.3515
-2.55
0.0116 *
PO4
-0.0591
0.0112
-5.29 3.3e-07 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 17.5 on 191 degrees of freedom
Multiple R-squared: 0.353, Adjusted R-squared: 0.332
F-statistic: 17.3 on 6 and 191 DF, p-value: 5.55e-16
Prediction
plot(algae.sols$a1,preds,main='Errors Scaterplot',
ylab='Predicted Values',xlab='True Values')
abline(0,1,col='red',lty=2)
##
##
nmae
0.7975
mape
Inf
nmse
0.7048
40
rmse
17.2074
40
20
60
Predicted Values
20
mae
mse
12.7843 296.0934
Errors Scaterplot
20
40
60
80
True Values
Prediction
Nov, 2014
47 / 167
Obtain a random split of the data into two sub-sets using the
proportion 70%-30%.
Prediction
Nov, 2014
48 / 167
Tree-based Models
Tree-based Models
Tree-based Models
Tree-based models (both classification and regression trees) are
models that provide as result a model based on logical tests on
the input variables
These models can be seen as a partitioning of the input space
defined by the input variables
This partitioning is defined based on carefully chosen logical tests
on these variables
Within each partition all cases are assigned the same prediction
(either a class label or a numeric value)
Tree-based models are known by their (i) computational efficiency;
(ii) interpretable models; (iii) embedded variable selection; (iv)
embedded handling of unknown variable values and (v) few
assumptions on the unknown function being approximated
L.Torgo (FCUP - LIAAD / UP)
Prediction
Nov, 2014
50 / 167
Tree-based Models
X_1
>= 2
<2
X_2
X_2
>=5
<5
X_1
>= 1
<1
1
X_1
Prediction
Nov, 2014
51 / 167
Nov, 2014
52 / 167
Tree-based Models
benign
|
458/241
Cell.size< 2.5
Cell.size>=2.5
benign
417/12
malignant
41/229
Cell.shape< 2.5
Bare.nuclei< 5.5
Bare.nuclei>=5.5
benign
416/5
malignant
1/7
Cell.shape>=2.5
benign
18/5
malignant
23/224
Cl.thickness< 5.5
Cl.thickness>=5.5
benign
18/1
Cell.size< 4.5
Cell.size>=4.5
malignant
0/4
malignant
18/52
malignant
5/172
Bare.nuclei< 2.5
Bare.nuclei>=2.5
benign
10/4
malignant
8/48
Marg.adhesion< 3.5
Marg.adhesion>=3.5
benign
10/1
Prediction
malignant
0/3
Tree-based Models
22.5
|
n=506
rm< 6.941
rm>=6.941
19.9
n=430
37.2
n=76
lstat>=14.4
15
n=175
crim>=6.992
rm< 7.437
lstat< 14.4
23.3
n=255
dis>=1.385
crim< 6.992
dis< 1.385
12
n=74
17.1
n=101
22.9
n=250
nox>=0.6055
nox< 0.6055
nox>=0.531
nox< 0.531
rm< 6.543
rm>=6.543
11.1
n=62
16.6
n=12
16.2
n=77
20
n=24
21.6
n=195
45.6
n=5
27.4
n=55
rm>=7.437
32.1
n=46
45.1
n=30
crim>=7.393
crim< 7.393
ptratio>=18.3
ptratio< 18.3
14.4
n=3
33.3
n=43
33.3
n=3
46.4
n=27
dis>=1.886
dis< 1.886
32.7
n=41
45.6
n=2
lstat>=7.57
lstat< 7.57
21
n=152
24
n=43
Prediction
Nov, 2014
53 / 167
Tree-based Models
Tree-based Models
Most tree-based models are binary trees with logical tests on each
node
Tests on numerical predictors take the form xi < , with <
Tests on nominal predictors take the form xj {v1 , , vm }
Each path from the top (root) node till a leaf can be seen as a
logical condition defining a region of the predictors space.
All observations falling on a leaf will get the same prediction
the majority class of the training cases in that leaf for classification
trees
the average value of the target variable for regression trees
Prediction
Nov, 2014
54 / 167
Tree-based Models
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
Prediction
Tree-based Models
Nov, 2014
55 / 167
Decision
dontPlay
dontPlay
Play
Play
Play
Play
dontPlay
dontPlay
Play
Play
Play
Play
dontPlay
dontPlay
dontPlay
dontPlay
Play
Play
Play
Play
dontPlay
dontPlay
Play
Play
Play
Play
dontPlay
dontPlay
Weather
{rain}
{overcast, sun}
Decision
Weather
15
dontPlay
overcast
27
high
99
Play
normal
102
dontPlay
overcast
26
normal
97
Play
12
high
120
Play
overcast
21
normal
74
Play
rain
11
normal
95
Play
overcast
10
normal
70
Play
rain
15
high
94
Play
overcast
30
high
108
Play
rain
29
high
79
dontPlay
overcast
30
normal
16
Play
rain
26
normal
75
dontPlay
rain
33
high
96
Play
overcast
30
normal
13
Play
rain
28
normal
44
Play
overcast
14
normal
32
Play
rain
21
high
84
Play
sun
30
normal
89
dontPlay
sun
19
high
111
dontPlay
sun
14
normal
81
Play
sun
19
high
41
dontPlay
sun
35
normal
38
dontPlay
sun
32
normal
55
dontPlay
sun
11
high
108
dontPlay
sun
33
normal
103
Play
sun
29
high
105
dontPlay
sun
15
normal
63
dontPlay
Weather
Temp
Humidity
rain
26
high
rain
35
rain
Prediction
Wind
Temp
Humidity
Wind
Nov, 2014
Decision
56 / 167
Tree-based Models
Prediction
Tree-based Models
Nov, 2014
57 / 167
Which value?
Should be the value that better represents the cases in the leaves
Prediction
Nov, 2014
58 / 167
Tree-based Models
Prediction
Tree-based Models
Nov, 2014
59 / 167
c
X
pi2
i=1
|DT |
|DT |
Gini(DT ) +
Gini(DT )
|D|
|D|
Prediction
Nov, 2014
60 / 167
Tree-based Models
Temp
26
35
12
11
15
29
26
33
28
21
27
26
21
10
30
30
30
14
30
19
14
19
35
32
11
33
29
15
Humidade
alta
normal
alta
normal
alta
alta
normal
alta
normal
alta
alta
normal
normal
normal
alta
normal
normal
normal
normal
alta
normal
alta
normal
normal
alta
normal
alta
normal
Vento
15
102
120
95
94
79
75
96
44
84
99
97
74
70
108
16
13
32
89
111
81
41
38
55
108
103
105
63
Deciso
noJogar
noJogar
jogar
jogar
jogar
noJogar
noJogar
jogar
jogar
jogar
jogar
jogar
jogar
jogar
jogar
jogar
jogar
jogar
noJogar
noJogar
jogar
noJogar
noJogar
noJogar
noJogar
jogar
noJogar
noJogar
Gini(D) = 1
2
16
16 + 12
GiniTempo{chuva} (D) =
+
10
28
18
28
2 !
12
= 0.49
16 + 12
Gini(DTempo{chuva} )
Gini(DTempo{chuva}
)
/
=0.489
Gini(DTempo{chuva} ) = 1
2
8
8 + 10
+
4+6
Gini(DTempo{chuva}
)=1
/
2
2 !
= 0.48
4+6
10
2 !
8 + 10
= 0.49
Prediction
Tree-based Models
Nov, 2014
61 / 167
Prediction
Nov, 2014
62 / 167
Tree-based Models
< 10.5
< 11.5
< 13
< 14.5
Choose the test with the best score to be the best test in variable
Temp
L.Torgo (FCUP - LIAAD / UP)
Prediction
Tree-based Models
Nov, 2014
63 / 167
Given a set of data D and a nominal variable A let VA,D be the set
of values of A occurring in D
Evaluate all possible combinations of subset of values in VA,D
Note that there are some optimizations that reduce the
computational complexity of this search
Prediction
Nov, 2014
64 / 167
Tree-based Models
Choose the test with the best score to be the best test in variable
Tempo
Prediction
Tree-based Models
Nov, 2014
65 / 167
(yi kt )2
Prediction
Nov, 2014
66 / 167
Tree-based Models
nt
ntL
Err (tL ) + R Err (tR )
nt
nt
where ntL /nt (ntR /nt ) is the proportion of cases going to the left
(right) branch of t
We can estimate the value of the split s by the respective error
reduction,
(s, t) = Err (t) Err (t, s)
Finding the best split test for a node t involves evaluating all
possible tests for this node using the above equations
Prediction
Tree-based Models
Nov, 2014
67 / 167
Prediction
Nov, 2014
68 / 167
Tree-based Models
Tree Error
Measured on
training data
Measured on
test (new) data
Tree Size
L.Torgo (FCUP - LIAAD / UP)
Prediction
Tree-based Models
Nov, 2014
69 / 167
Prediction
Nov, 2014
70 / 167
Tree-based Models
(Post-)Pruning a Tree
Cost-complexity and Error-complexity Pruning
Prediction
Tree-based Models
Nov, 2014
71 / 167
Prediction
Nov, 2014
72 / 167
Tree-based Models
2
|
70/76/17/13/9/29
Ba< 0.335
Ba>=0.335
2
69/75/17/12/9/3
Al< 1.42
1
63/31/13/1/3/2
library(DMwR)
data(Glass,package='mlbench')
ac <- rpartXse(Type ~ .,Glass)
prettyTree(ac)
Ca< 10.48
Ca>=10.48
1
63/21/13/0/2/2
Al>=1.42
2
6/44/4/11/6/1
Mg>=2.26
Mg< 2.26
2
2
0/10/0/1/1/0
6/41/4/0/1/0
RI>=1.517
RI< 1.517
1
60/17/6/0/1/1
7
1/1/0/1/0/26
3
3/4/7/0/1/1
5
0/3/0/11/5/1
Na< 13.5
Na>=13.5
5
6
0/1/0/11/0/00/2/0/0/5/1
Mg< 3.865
Mg>=3.865
1
2
59/11/5/0/1/1
1/6/1/0/0/0
Prediction
Tree-based Models
Nov, 2014
73 / 167
Nov, 2014
74 / 167
201
202
203
204
205
206
207
208
209
210
211
212
213
214
1
0.0
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
2
0.0000
0.2667
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
3
0.0000
0.4667
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
5
0.09091
0.00000
0.09091
0.09091
0.09091
0.09091
0.09091
0.09091
0.09091
0.09091
0.09091
0.09091
0.09091
0.09091
6
0.00000
0.06667
0.00000
0.00000
0.00000
0.00000
0.00000
0.00000
0.00000
0.00000
0.00000
0.00000
0.00000
0.00000
Prediction
7
0.9091
0.0000
0.9091
0.9091
0.9091
0.9091
0.9091
0.9091
0.9091
0.9091
0.9091
0.9091
0.9091
0.9091
Tree-based Models
1
0
0
0
0
0
0
2
0
0
0
0
0
0
3
0
0
0
0
0
0
5
0
0
0
0
0
0
6 7
0 0
0 0
0 1
0 0
0 0
0 13
mc <- table(ps,ts$Type)
err <- 100*(1-sum(diag(mc))/sum(mc))
err
## [1] 7.143
Prediction
Tree-based Models
Nov, 2014
75 / 167
16.9
|
n=200
library(DMwR)
data(algae)
d <- algae[,1:12]
ar <- rpartXse(a1 ~ .,d)
PO4>=43.82
8.92
n=148
Prediction
PO4< 43.82
39.7
n=52
Nov, 2014
76 / 167
Tree-based Models
tr <- d[1:150,]
ts <- d[151:200,]
ar <- rpartXse(a1 ~ .,tr)
preds <- predict(ar,ts)
mae <- mean(abs(preds-ts$a1))
mae
## [1] 12.28
cr <- cor(preds,ts$a1)
cr
## [1] 0.5124
Prediction
Nov, 2014
77 / 167
Apply the tree to the data used to obtain the model and calculate
the mean squared error of the predictions
Split the data set in two parts: 70% of the tests and the remaining
30%. Using the larger part to obtain a regression tree and apply it
to the other part. Calculate again the mean squared error.
Compare with the previous scores and comment.
Prediction
Nov, 2014
78 / 167
Bayesian Classification
Naive Bayes
Naive Bayes
Bayesian Classification
Prediction
Nov, 2014
80 / 167
Naive Bayes
Bayes Theorem
Prediction
Naive Bayes
Nov, 2014
81 / 167
Bayes Theorem
Bayes Theorem
P(H|x) =
P(x|H)P(H)
P(x)
Prediction
Nov, 2014
82 / 167
Naive Bayes
How it works?
We have a data set D with cases belonging to one of m classes
c1 , c2 , , cm
Given a new test case x this classifier produces as prediction the
class that has the highest estimated probability, i.e.
maxi{1,2, ,m} P(ci |x)
Given that P(x) is constant for all classes, according to the Bayes
Theorem the class with the highest probability is the one
maximizing the quantity P(x|ci )P(ci )
Prediction
Naive Bayes
Nov, 2014
83 / 167
p
Y
P(xk |ci )
k =1
Note that the quantities P(x1 |ci ), P(x2 |ci ), , P(xp |ci ) can be easily
estimated from the training data
L.Torgo (FCUP - LIAAD / UP)
Prediction
Nov, 2014
84 / 167
Naive Bayes
1
2
(x)2
2 2
Prediction
Naive Bayes
Nov, 2014
85 / 167
An Illustrative Example
Weather
rain
rain
rain
rain
rain
rain
rain
rain
rain
rain
overcast
overcast
overcast
overcast
overcast
overcast
overcast
overcast
sun
sun
sun
sun
sun
sun
sun
sun
sun
sun
Temp
26
35
12
11
15
29
26
33
28
21
27
26
21
10
30
30
30
14
30
19
14
19
35
32
11
33
29
15
Humidity
high
normal
high
normal
high
high
normal
high
normal
high
high
normal
normal
normal
high
normal
normal
normal
normal
high
normal
high
normal
normal
high
normal
high
normal
Wind
15
102
120
95
94
79
75
96
44
84
99
97
74
70
108
16
13
32
89
111
81
41
38
55
108
103
105
63
Decision
dontPlay
dontPlay
play
play
play
dontPlay
dontPlay
play
play
play
play
play
play
play
play
play
play
play
dontPlay
dontPlay
play
dontPlay
dontPlay
dontPlay
dontPlay
play
dontPlay
dontPlay
Repeat for
P(hsun, 16, normal, 70i | dontPlay) P(dontPlay)
Prediction
Nov, 2014
86 / 167
Naive Bayes
Naive Bayes in R
library(e1071)
set.seed(1234)
sp <- sample(1:150,100)
tr <- iris[sp,]
ts <- iris[-sp,]
nb <- naiveBayes(Species ~ ., tr)
(mtrx <- table(predict(nb,ts),ts$Species))
##
##
##
##
##
setosa
versicolor
virginica
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
Prediction
Naive Bayes
Nov, 2014
87 / 167
Laplace Correction
What if one of the P(xk |ci )s is equal to zero? This can easily
happen in nominal variables if one of the values does not occur in
a class.
Q
This would make the product P(x|ci ) = pk =1 P(xk |ci ) = 0
This zero probability would cancel the effects of all other P(xk |ci )s
Prediction
where q is an integer
Nov, 2014
88 / 167
Naive Bayes
Laplace Correction in R
library(e1071)
set.seed(1234)
sp <- sample(1:150,100)
tr <- iris[sp,]
ts <- iris[-sp,]
nb <- naiveBayes(Species ~ ., tr,laplace=1)
(mtrx <- table(predict(nb,ts),ts$Species))
##
##
##
##
##
setosa
versicolor
virginica
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
Prediction
k -Nearest Neighbors
Nov, 2014
89 / 167
k-Nearest Neighbors
k Nearest Neighbors
Prediction
Nov, 2014
91 / 167
k-Nearest Neighbors
Prediction
Nov, 2014
92 / 167
k-Nearest Neighbors
k Nearest Neighbors
Main Characteristics
Prediction
k-Nearest Neighbors
Nov, 2014
93 / 167
Distance Functions
Prediction
Nov, 2014
94 / 167
k-Nearest Neighbors
Distance Functions
ID
1
2
3
Income
2500
2750
4550
Sex
f
f
m
Position
manager
manager
director
Age
35
30
50
Prediction
k-Nearest Neighbors
Nov, 2014
95 / 167
Distance Functions
v
u p
uX
d(x, y) = t (xi yi )2
i=1
Example
Given two cases x =< 3, 5, 1 > and y =< 12, 5.4, 3 > their
Euclidean distance is given by
q
d(x, y) = (3 12)2 + (5 5.4)2 + (1 (3))2 = 9.85697
Prediction
Nov, 2014
96 / 167
k-Nearest Neighbors
Distance Functions
d(x, y) =
p
X
!1/r
|xi yi |r
i=1
where if
r = 1 we have what is known as the Manhattan distance (or
L1 -norm)
r = 2 we have the Euclidean distance
etc.
Prediction
k-Nearest Neighbors
Nov, 2014
97 / 167
Distance Functions
Prediction
Nov, 2014
98 / 167
k-Nearest Neighbors
Distance Functions
d(x, y) =
p
X
i (xi , yi )
i=1
emque,
1
0
i (v1 , v2 ) =
|v1 v2 |
if i is nominal e v1 =
6 v2
if i is nominal e v1 = v2
if i is numeric
range(i)
The distance between < 2500, f, director, 35 > and < 2750, f, director,
|3530|
|25002750|
30 > would be given by range(Salary
) + 0 + 0 + range(Age)
L.Torgo (FCUP - LIAAD / UP)
Prediction
k-Nearest Neighbors
Nov, 2014
99 / 167
Method
Search for the training case most similar to the test case
Predict for the test case the class of this nearest neighbor
Very simple method
May suffer with the presence of outliers
Frequently achieves good results
Prediction
Nov, 2014
100 / 167
k-Nearest Neighbors
Prediction
k-Nearest Neighbors
Nov, 2014
101 / 167
Nov, 2014
102 / 167
k-nearest neighbors in R
Package class contains function knn()
library(class)
set.seed(1234)
sp <- sample(1:150,100)
tr <- iris[sp,]
ts <- iris[-sp,]
nn3 <- knn(tr[,-5],ts[,-5],tr[,5],k=3)
(mtrx <- table(nn3,ts$Species))
##
## nn3
setosa versicolor virginica
##
setosa
12
0
0
##
versicolor
0
20
1
##
virginica
0
1
16
(err <- 1-sum(diag(mtrx))/sum(mtrx))
## [1] 0.04
L.Torgo (FCUP - LIAAD / UP)
Prediction
k-Nearest Neighbors
k-nearest neighbors in R - 2
Package DMwR has a wrapper function with a standard
interface,
library(class)
library(DMwR)
set.seed(1234)
sp <- sample(1:150,100)
tr <- iris[sp,]
ts <- iris[-sp,]
nn3 <- kNN(Species ~ .,tr,ts,k=3,norm=TRUE)
(mtrx <- table(nn3,ts$Species))
##
## nn3
setosa versicolor virginica
##
setosa
12
0
0
##
versicolor
0
20
3
##
virginica
0
1
14
(err <- 1-sum(diag(mtrx))/sum(mtrx))
## [1] 0.08
L.Torgo (FCUP - LIAAD / UP)
Prediction
k-Nearest Neighbors
Nov, 2014
103 / 167
Nov, 2014
104 / 167
k=1
k=3
k=5
k=7
k=11
k=13
k=15
avg
0.082
0.050
0.056
0.052
0.040
0.046
0.070
std
0.02741
0.02708
0.01265
0.03676
0.02108
0.02503
0.02539
Prediction
k-Nearest Neighbors
Method
Search for the training case most similar to the test case
Predict for the test case the average of the target variable values
of the neighbors
Prediction
k-Nearest Neighbors
Nov, 2014
105 / 167
Prediction
Nov, 2014
106 / 167
A Bit of History...
SVMs were introduced in 1992 at the COLT-92 conference
They gave origin to a new class of algorithms named kernel
machines
Since then there has been a growing interest on these methods
More information may be obtained at
www.kernel-machines.org
A good reference on SVMs:
N. Cristianini and J. Shawe-Taylor: An introduction to Support
Vector Machines. Cambridge University Press, 2000.
SVMs have been applied with success in a wide range of areas
like: bio-informatics, text mining, hand-written character
recognition, etc.
Prediction
Nov, 2014
108 / 167
, y= +1
, y = 1
w
X2
X1
Prediction
Nov, 2014
109 / 167
Map the original data into a new space of variables with very high
dimension.
Use a linear approximation on this new input space.
Prediction
Nov, 2014
110 / 167
Prediction
Nov, 2014
111 / 167
, y= +1
, y = 1
X2
X1
Class
Class
, y= +1
, y = 1
X2
X1
Prediction
Nov, 2014
112 / 167
Class
Class
, y= +1
, y = 1
X2
H1
X1
H2
Prediction
Nov, 2014
113 / 167
n
X
i=1
1X
i
i j yi yj (xi xj )
2
i,j
Subject to :
i 0
X
i yi = 0
Prediction
Nov, 2014
114 / 167
Prediction
Nov, 2014
115 / 167
Prediction
Nov, 2014
116 / 167
The Algorithm
Given a m classes task
Obtain m SVM classifiers, one for each class
Given a test case assign it to the class whose separating
hyperplane is more distant from the test case
Prediction
Nov, 2014
117 / 167
Nov, 2014
118 / 167
SVMs em R
Obtaining an SVM in R
The package e1071
library(e1071)
data(Glass,package='mlbench')
tr <- Glass[1:200,]
ts <- Glass[201:214,]
s <- svm(Type ~ .,Glass)
predict(s,ts)
## 201 202 203 204 205 206 207 208 209 210 211 212 213 214
##
7
7
7
7
7
7
7
7
7
7
7
7
7
7
## Levels: 1 2 3 5 6 7
Prediction
SVMs em R
ps <- predict(s,ts)
table(ps,ts$Type)
##
## ps
##
1
##
2
##
3
##
5
##
6
##
7
1
0
0
0
0
0
0
2
0
0
0
0
0
0
3
0
0
0
0
0
0
5
0
0
0
0
0
0
6 7
0 0
0 0
0 0
0 0
0 0
0 14
mc <- table(ps,ts$Type)
error <- 100*(1-sum(diag(mc))/sum(mc))
error
## [1] 0
Prediction
Nov, 2014
119 / 167
-SV Regression
Prediction
Nov, 2014
120 / 167
X
1
2
Minimize : kwk + C
(i
2
i=1
yi w x b
w x + b yi
Subject to :
i , i
+ i )
+ i
+ i
0
Prediction
Nov, 2014
121 / 167
Prediction
Nov, 2014
122 / 167
40
rmse
3.7177
mape
0.1634
30
mae
mse
2.6678 13.8211
20
##
##
Predictions
10
10
20
30
40
50
True
Prediction
Nov, 2014
123 / 167
Hands On SMVs
Hands on SVMs
The file Wine.Rdata contains 2 data frames with data about the
quality of green wines: i) redWine and ii) whiteWine. Each of
these data sets has information on a series of wine tasting sessions to
green wines (both red and white). For each wine sample several
physico-chemical properties of the wine sample together with a quality
score assigned by a committee of wine experts (variable quality).
1
Obtain and SVM for forecasting the quality of the red variant of
green wines
Split the data set in two parts: one with 70% of the samples and
the other with the remaining 30%. Obtain an SVM with the first
part and apply it to the second. What was the resulting mean
absolute error?
Using the round() function, round the predictions obtained in the
previous question to the nearest integer. Calculate the error rate
of the resulting integers when compared to the true values
Prediction
Nov, 2014
124 / 167
Model Ensembles
and Random Forests
Motivation
Model Ensembles
What?
Ensembles are collections of models that are used together to
address a certain prediction problem
Prediction
Nov, 2014
126 / 167
Motivation
Prediction
Nov, 2014
127 / 167
Motivation
Prediction
Nov, 2014
128 / 167
Random Forests
Prediction
Nov, 2014
129 / 167
Random Forests
Prediction
Nov, 2014
130 / 167
Random Forests em R
Random Forests in R
The package randomForest
library(randomForest)
data(Boston, package = "MASS")
samp <- sample(1:nrow(Boston), 354)
tr <- Boston[samp, ]
ts <- Boston[-samp, ]
m <- randomForest(medv ~ ., tr)
ps <- predict(m, ts)
mean(abs(ts$medv - ps))
## [1] 1.983
Prediction
Nov, 2014
131 / 167
Nov, 2014
132 / 167
Random Forests em R
A classification example
data(Glass, package = "mlbench")
set.seed(1234)
sp <- sample(1:nrow(Glass), 150)
tr <- Glass[sp, ]
ts <- Glass[-sp, ]
m <- randomForest(Type ~ ., tr, ntree = 3000)
ps <- predict(m, ts)
table(ps, ts$Type)
##
## ps
1 2
##
1 13 5
##
2 2 18
##
3 0 0
##
5 0 0
##
6 0 1
##
7 0 0
3
3
0
1
0
0
0
5
0
3
0
4
0
0
6
0
0
0
0
3
0
7
1
2
0
0
0
8
Prediction
Random Forests em R
data(Boston,package='MASS')
library(randomForest)
m <- randomForest(medv ~ ., Boston,
importance=T)
importance(m)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
crim
zn
indus
chas
nox
rm
age
dis
rad
tax
ptratio
black
lstat
rm
lstat
nox
dis
crim
ptratio
age
tax
indus
black
rad
zn
chas
%IncMSE IncNodePurity
16.553
2490.4
5.574
284.8
12.182
2666.7
3.594
220.5
19.958
2714.2
36.698
11845.0
15.072
1205.1
18.597
2439.8
8.003
374.0
12.757
1466.6
16.285
2684.0
10.275
737.8
30.584
12723.0
15 25
%IncMSE
35
lstat
rm
nox
ptratio
indus
crim
dis
tax
age
black
rad
zn
chas
4000
10000
IncNodePurity
Prediction
Nov, 2014
133 / 167
Load in the data set algae and answer the following questions:
1
Prediction
Nov, 2014
134 / 167
Evaluation Methodologies
and Comparison of Models
Performance Estimation
Performance Estimation
The setting
We have a predictive task where there is unknown function
Y = f (x) that maps a feature vector into a target variable value
(can be a classification or a regression problem)
Given data set {< xi , yi >}N
i=1 , where xi is a feature vector
< x1 , x2 , , xp > describing the case i and yi is the target
variable value (either a nominal or a numeric variable) of the case i
Given a predictive performance evaluation criterion - a metric of
predictive performance (e.g. error rate or mean squared error)
How to obtain a reliable estimate of the predictive performance of
any models that we consider to obtain with the given data set?
Prediction
Nov, 2014
136 / 167
Performance Estimation
Resubstituition estimates
Reliability of Estimates
Resubstitution estimates
Prediction
Performance Estimation
Nov, 2014
137 / 167
Resubstituition estimates
Reliability of Estimates
Resubstitution estimates (2)
Prediction
Nov, 2014
138 / 167
Performance Estimation
Prediction
Performance Estimation
Nov, 2014
139 / 167
Prediction
Nov, 2014
140 / 167
Performance Estimation
Prediction
Performance Estimation
Nov, 2014
141 / 167
Prediction
Nov, 2014
142 / 167
Prediction
Nov, 2014
143 / 167
Prediction
Nov, 2014
144 / 167
mae
mse
3.2320 22.1313
rmse
4.7044
mape
0.1732
nmse
0.2370
nmae
0.4530
Prediction
Nov, 2014
145 / 167
Cross Validation
Prediction
Nov, 2014
146 / 167
Prediction
Nov, 2014
147 / 167
A Simple Example
Prediction
Nov, 2014
148 / 167
Boston
*Workflow: standardWF
mse
avg
21.024
std
10.157
min
9.406
max
40.724
invalid 0.000
Prediction
Nov, 2014
149 / 167
Prediction
Nov, 2014
150 / 167
Nov, 2014
151 / 167
Nov, 2014
152 / 167
##
## == Cross Validation Performance Estimation Experiment ==
##
## 1 x 10 - Fold Cross Validation run with seed = 1234
## 3 workflows
## tested on 1 predictive tasks
getWorkflow("rpartXse.v1",res2)
## Workflow Object:
## Workflow ID
:: rpartXse.v1
## Workflow Function :: standardWF
## Parameter values:
##
learner.pars -> se=0
##
evaluator.pars -> stats=c("mse", "mae"
##
learner -> rpartXse
...
topPerformers(res2)
##
##
##
##
##
$Boston
Workflow Estimate
mse rpartXse.v1
19.241
mae rpartXse.v1
2.961
mape rpartXse.v1
0.157
Prediction
30
mse
20
4.0
3.5
mae
10
3.0
2.5
0.25
mape
0.20
0.15
Prediction
Alternative Workflows
rpartXse.v3
rpartXse.v2
rpartXse.v1
Nov, 2014
153 / 167
Nov, 2014
154 / 167
$Boston
$Boston$mse
Workflow Estimate
1 rpartXse.v1
19.24
2 rpartXse.v2
20.36
3 rpartXse.v3
21.02
$Boston$mae
Workflow Estimate
1 rpartXse.v1
2.961
2 rpartXse.v2
3.105
3 rpartXse.v3
3.176
$Boston$mape
Workflow Estimate
1 rpartXse.v1
0.1573
2 rpartXse.v2
0.1660
3 rpartXse.v3
0.1750
Prediction
A larger comparison
library(randomForest)
library(e1071)
res3 <- performanceEstimation(
PredTask(medv ~ ., Boston),
workflowVariants("standardWF",
learner=c("rpartXse","svm","randomForest")),
CvSettings(nReps=2,nFolds=5))
Prediction
Nov, 2014
155 / 167
Nov, 2014
156 / 167
The Results
plot(res3)
Cross Validation Performance Estimation Results
Boston
30
20
mse
10
Prediction
Alternative Workflows
randomForest.v3
svm.v2
rpartXse.v1
data(Glass,package='mlbench')
res4 <- performanceEstimation(
PredTask(Type ~ ., Glass),
workflowVariants("standardWF",learner="svm",
learner.pars=list(cost=c(1,10),
gamma=c(0.1,0.01))),
HldSettings(nReps=5,hldSz=0.3))
Prediction
Nov, 2014
157 / 167
Nov, 2014
158 / 167
The Results
plot(res4)
Hold Out Performance Estimation Results
Glass
err
0.5
0.4
0.3
Alternative Workflows
svm.v4
Prediction
svm.v3
svm.v2
svm.v1
data(Glass,package='mlbench')
data(iris)
res5 <- performanceEstimation(
c(PredTask(Type ~ ., Glass),PredTask(Species ~.,iris)),
c(workflowVariants("standardWF",learner="svm",
learner.pars=list(cost=c(1,10),
gamma=c(0.1,0.01))),
workflowVariants("standardWF",learner="rpartXse",
learner.pars=list(se=c(0,0.5,1)),
predictor.pars=list(type="class"))),
CvSettings(nReps=3,nFolds=10))
Prediction
Nov, 2014
159 / 167
Nov, 2014
160 / 167
The Results
plot(res5)
Cross Validation Performance Estimation Results
Glass
iris
0.6
0.4
err
rpartXse.v2
0.2
rpartXse.v1
0.0
rpartXse.v3
svm.v4
svm.v3
svm.v2
svm.v1
rpartXse.v3
rpartXse.v2
rpartXse.v1
svm.v4
svm.v3
svm.v2
svm.v1
Alternative Workflows
Prediction
$Glass
Workflow Estimate
err
svm.v1
0.294
$iris
Workflow Estimate
err
svm.v2
0.04
getWorkflow(topPerformers(res5)$Glass["err","Workflow"],res5)
## Workflow Object:
## Workflow ID
:: svm.v1
## Workflow Function :: standardWF
## Parameter values:
##
learner.pars -> cost=1 gamma=0.1
##
learner -> svm
Prediction
Nov, 2014
161 / 167
Load in the data set algae and answer the following questions:
1
Prediction
Nov, 2014
162 / 167
Prediction
Nov, 2014
163 / 167
A Simple Example
library(performanceEstimation)
library(DMwR) # because of rpartXse
data(Boston,package="MASS")
res2 <- performanceEstimation(
PredTask(medv ~ .,Boston),
workflowVariants("standardWF",
learner="rpartXse",
learner.pars=list(se=c(0,0.5,1)),
evaluator.pars=list(stats=c("mse","mae","mape"))),
CvSettings(nReps=1,nFolds=10))
Prediction
Nov, 2014
164 / 167
, , mse, Boston
AvgScore SdScore
Diff p.value
19.24 10.328
NA
NA
20.36
9.158 -1.124 0.8457
21.02 10.157 -1.783 0.2754
rpartXse.v1
rpartXse.v2
rpartXse.v3
, , mae, Boston
AvgScore SdScore
Diff p.value
2.961 0.5625
NA
NA
3.105 0.4992 -0.1444 0.27539
3.176 0.5712 -0.2148 0.08398
rpartXse.v1
rpartXse.v2
rpartXse.v3
, , mape, Boston
AvgScore
0.1573
0.1660
0.1750
rpartXse.v1
rpartXse.v2
rpartXse.v3
SdScore
Diff p.value
0.03700
NA
NA
0.04242 -0.008724 0.55664
0.05198 -0.017689 0.06445
Prediction
Nov, 2014
165 / 167
p <- pairedComparisons(res2)
signifDiffs(p,p.limit=0.1)
##
##
##
##
##
##
##
##
##
##
$Boston
$Boston$mae
AvgScore SdScore
Diff p.value
2.961 0.5625
NA
NA
3.176 0.5712 -0.2148 0.08398
rpartXse.v1
rpartXse.v3
$Boston$mape
rpartXse.v1
rpartXse.v3
AvgScore SdScore
Diff p.value
0.1573 0.03700
NA
NA
0.1750 0.05198 -0.01769 0.06445
Prediction
Nov, 2014
166 / 167
Check which workflow achieved the best score and carry out a
paired comparison of the other alternatives against this workflow
Obtain a bar plot with the scores of the workflows and put a * on
top of the bars that correspond to a statistically significant
difference with respect to the baseline
Prediction
Nov, 2014
167 / 167