Anda di halaman 1dari 6

1 of 6 1 of 6

Tree-Based Methods
Tree-based methods for regression and classication involve stratifying or segmenting the predictor space into a
number of simple regions. Since the set of splitting rules used to segment the predictor space can be summarized
in a tree, these types of approaches are known as decision-tree methods.
Tree-based methods are easy to interpret, but not competitive with the best supervised learning approaches in
terms of prediction accuracy. They can be made competitive when they are combined in ensembles of trees
(bagging, random forests, and boosting). Combining a large number of trees can often result in dramatic
improvements in prediction accuracy, at the expense of some loss interpretation
Example on how we would stratify some data and the resulting tree:
Building a tree
Final nodes (red and grey) are called terminal nodes (Red, Grey). The points along the tree where the predictor
space is split are referred to as internal nodes (X<60, Y<20, X<85). Variable that is split on rst is the most
important factor.
How does the automatic tree growing algorithm work?
Regression Tree
The idea is, we want to divide the predictor space into J distinct, non-overlapping regions. For every observation
that falls into the region Rj, we make the same prediction, which is simply the mean of the response values for the
training observations in Rj.
So each region J represents one of the terminal leaves, where we represent the observations by an average, and we
have in this case J such terminal leaves and were going to choose a set of boxes so the total variation of
2 of 6 2 of 6
observations around their mean in a box is as small as possible (we want to minimize RSS):
where is the mean response for the training observations within the box.
It is computationally infeasible to consider every possible partition of the feature space into J boxes, so trees use
an approximation called top-down greedy approach.
Its top-down because it starts at the top with a whole set of observations, and then it splits them into two pieces,
one a time at each level. Its greedy because it doesnt nd the best split among all possible splits, but only the
best split at that particular step.
Tree-growing process:
We start at the top with the full set of data and all the predictors, and we look for the predictor and the split that
leads to greatest possible deduction in RSS, and make a split to produce two nodes.
Next, we repeat the process, looking for the best predictor and best cutpoint in order to split the data further so as
to minimize the RSS within each of the resulting regions. The process continues until a stopping criterion is
reached; for instance, we may continue until no region contains more than ve observations.
Prediction
Once a tree is built, we predict the test observation by passing it down the tree, obeying each of the splits. It will
end up in a terminal node, and then we will use the mean of the training observations in that region to make the
prediction.
One question that arises is how large should the tree be? When should we stop growing the tree? One possibility
would just be to grow as large a tree as possible. At the extreme we could have a tree that has one observation in
each terminal node. But would overt the data, producing poor test set performance. If you have a tree that is so
large that each observation has its own terminal node, its going to have a training error of 0. But its going to be
adapted to that data too much and its going to overt.
One possible alternative to the process described above is to grow the tree only so long as the decrease in the RSS
due to each split exceeds some (high) threshold. This strategy will result in smaller trees, but is too short-sighted:
a seemingly worthless split early on in the tree might be followed by a very good split that is, a split that leads
to a large reduction in RSS later on.
A better strategy is to grow a very large tree , and then prune it back in order to obtain a subtree. Cost-
complexity (weakest-link) pruning is used for this.
(

j=1
J

iR
j
y
i
y
^
R
j
)
2
y
^
R
j
j
th
T
0
3 of 6 3 of 6
We consider a sequence of trees indexed by a nonnegative tuning parameter . For each value of there
corresponds a subtree such that
is as small as possible.
|T| - number of terminal nodes in the tree T.
- subset of predictor space corresponding to the terminal node.
- mean of the training observations in
The tuning parameter controls a trade-o! between the subtrees complexity and its t to the training data.
We select an optimal value using cross-validation. We then return to the full data set and obtain the subtree
corresponding to .
Classication Tree
Very similar to a regression tree, except that it is used to predict a qualitative response rather than a quantitative
one. For a classication tree, we predict that each observation belongs to the most commonly occurring class of
training observations in the region to which it belongs.
We grow the tree in the very same way as we did for regression trees. But we dont use residual sum of squares as
a criterion for making the splits. We need a criterion thats more geared towards classication.
One such measure is the Gini index. It is a variance measure across the classes. If the Gini index is small, that
means that one class is favored. In the extreme case, if the region is pure, so all observations are in one class, Gini
is zero. For this reason the Gini index is referred to as a measure of node purity a small value indicates that a
node contains predominantly observations from a single class.
- proportion of training observations in the m
th
region that are from the k
th
class.
An alternative to the Gini index is cross-entropy, that gives very similar results. Another is classication error
rate, but in practice two other measures are preferable.
Trees are easy to interpret, but generally do not have the same level of predictive accuracy as some of the other
regression and classication approaches. However, by aggregating many decision trees, the predictive

T T
0
( + |T|

m=1
|T|

x
i
R
m
y
i
y
^
R
m
)
2
R
m
m
th
y
^
R
m
R
m

^
G = (1 )

k=1
K
p
^
mk
p
^
mk
p
^
mk
4 of 6 4 of 6
performance of trees can be substantially improved.
Bagging
Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical
learning (frequently used in the context of decision trees).
At the heart of the bagging idea is that averaging a set of observations reduces variance. But in reality we are
likely to have access to only one available training set and wont be able to resample.
Bagging is going to try to achieve variance reduction, by taking bootstrap samples from a training set of the same
size as our original sample. to create pseudo training sets. Then we grow a tree on each one and then take their
average for the regression trees, and majority vote for classication trees.
Trees dont need to be pruned, because bagging them reduces variance. The whole idea of pruning a tree back
was to reduce the variance, but pruning increases bias. Bagging has low bias and reduces variance by averaging.
Out-of-Bag Error Estimation
There is a very straightforward way to estimate the test error of a bagged model. On average, each bagged tree
makes use of around two-thirds of the observations. The remaining one-third of the observations not used to t a
given bagged tree, are referred to as the out-of-bag (OOB) observations.
We can predict the response for the i
th
observation using each of the trees in which that observation was OOB.
This will yield around (n trees)/3 predictions for the i
th
observation, which we average. This estimate is
essentially the leave-one-out (LOO) cross-validation error for bagging, if the number of trees is large.
Random Forests
Idea of random forest is to build trees in such a way as to actually make the correlation between trees smaller.
This reduces the variance when we average the trees.
In random forest, when we build the trees and make splits, every time we consider a split in a tree we dont
consider all the possible predictors as we normally do, but rather we select at random m predictors among the p
total predictors. And the m is typically about the . So, if there are 100 predictors in our data set, every time we
go to make a split we dont consider all 100, but rather we take a random subset of 10 of them, the other 90 are
not considered. A new selection is made at every possible split.
Random forest improves bagging test error by one or 2 percent. By throwing away predictors we have
decorrelated the trees and the resulting average is a little better. Out-of-bag estimate is also a little better than it
was for bagging.
p

5 of 6 5 of 6
Boosting
Like bagging, boosting is a general approach that can be applied to many statistical learning methods for
regression or classication. And similar to bagging and random forest, it gives prediction models that are
averages over trees.
But theres a fundamental difference. For the random forest and bagging, the trees that are averaged are all
equivalent and independent, and the averaging is just used to reduce variance.
With boosting trees are grown sequentially: each tree is grown using information from previously grown trees.
And each of the trees thats added into the mix is added to improve on the performance of the previous collection
of trees.
Boosting Algorithm for Regression Trees
is a function that is an average of trees. We start with , so residuals are just data observations and
we have for all i in the training set.
For B trees (b=1,2,.B), repeat:
1) t a tree with d splits (d+1 terminal nodes) to the training data (X, r), where r is the current
residual
2) update by adding in a shrunken version of the new tree (shrunken by some factor ):
3) update the residuals:
Output the boosted model:
is the sum of B shrunken trees
Idea behind this procedure?
With a single tree, we can t a large tree to the data, and we can overt. In contrast, the idea boosting is to learn
more slowly. So we start with a tree. But rather than accept the full tree we shrink it back by for example, a factor
(x) f
^
(x) = 0 f
^
y
i
= r
i
y
i
f
b
^
f
^

(x) (x) + (x) f


^
f
^
f
b
^
(x) r
i
r
i
f
b
^
(x) = (x) f
^

b=1
B
f
b
^
6 of 6 6 of 6
of 0.01. And then we take residuals and repeat. The idea being that instead of overtting, we are tting very
slowly.
At each try we pick up a small piece of the signal, instead of trying to grab a lot of signal with a large amount of
tting, it ts very slowly in small parts, shrinking each time in order to approximate the signal without
overtting. Boosting also works for classication. Its similar in spirit but is slightly more complex.
Tuning Parameters
1. The number of trees B. Unlike bagging and random forests, boosting can overt if B is too large, although
this overtting tends to occur slowly if at all. We use cross-validation to select B.
2. The shrinkage parameter , a small positive number. This controls the rate at which boosting learns.
Typical values are 0.01 or 0.001, and the right choice can depend on the problem. Very small can require
using a very large value of B in order to achieve good performance.
3. The number of splits d in each tree, which controls the complexity of the boosted ensemble. Often d = 1
(single split) works well, in which case each tree involves a single variable, resulting in an additive model,
so there are no interactions.
if d = 2 it can involve at most two variables, so thats pairwise interactions.
More generally d is the interaction depth, and controls the interaction order of the boosted model,
since d splits can involve at most d variables. Typically you would try a few values of d (say 1, 2, 4,
and 8.)
Variable importance measure
For bagged and random forrest regression trees, we use the total drop in RSS (residual sum of squares) for a given
predictor over all splits in the tree. We look at all the splits to see whether the variable was involved in that split.
If it was, we measure how much it dropped the RSS and average over all the trees.
Similarly, for bagged/RF classication trees, we add up the total amount that the Gini index is decreased by splits
over a given predictor, averaged over all B trees.
Tree Based Methods
Gabriela Hromis
Notes are based on different books and class notes from different universities, especially from Statistical Learning,
Hastie & Tibshirani