0 penilaian0% menganggap dokumen ini bermanfaat (0 suara)
176 tayangan31 halaman
Random forest is a classifier consisting of a collection of tree-structured classifiers h(x, Thk ) where the Thk are independent identically distributed random vectors. Each tree casts a unit vote for the most popular class at input x.
Random forest is a classifier consisting of a collection of tree-structured classifiers h(x, Thk ) where the Thk are independent identically distributed random vectors. Each tree casts a unit vote for the most popular class at input x.
Hak Cipta:
Attribution Non-Commercial (BY-NC)
Format Tersedia
Unduh sebagai PDF, TXT atau baca online dari Scribd
Random forest is a classifier consisting of a collection of tree-structured classifiers h(x, Thk ) where the Thk are independent identically distributed random vectors. Each tree casts a unit vote for the most popular class at input x.
Hak Cipta:
Attribution Non-Commercial (BY-NC)
Format Tersedia
Unduh sebagai PDF, TXT atau baca online dari Scribd
University of Pennsylvania The Wharton School Department of Statistics
Random Forests, Stat 900, November 26, 2007 – p. 1/26
Random Forests Ensemble classification (and regression) algorithm
Random Forests, Stat 900, November 26, 2007 – p. 2/26
Random Forests Ensemble classification (and regression) algorithm Proposed by Leo Breiman in 1999
Random Forests, Stat 900, November 26, 2007 – p. 2/26
Random Forests Ensemble classification (and regression) algorithm Proposed by Leo Breiman in 1999 Easy to implement
Random Forests, Stat 900, November 26, 2007 – p. 2/26
Random Forests Ensemble classification (and regression) algorithm Proposed by Leo Breiman in 1999 Easy to implement Very effective in applications, has good generalization properties
Random Forests, Stat 900, November 26, 2007 – p. 2/26
Random Forests Ensemble classification (and regression) algorithm Proposed by Leo Breiman in 1999 Easy to implement Very effective in applications, has good generalization properties Algorithm outputs more information than just class label
Random Forests, Stat 900, November 26, 2007 – p. 2/26
Random Forests, Stat 900, November 26, 2007 – p. 3/26
Classification or Regression Problem We are given Sn = {(Xi , Yi )}ni=1 — set of i.i.d. observations distributed as P. Xi ∈ X — predictors Yi ∈ Y — responses Goal: find fn = A(Sn ) s.t. E(ℓ(fn (X), Y )) is minimized.
Random Forests, Stat 900, November 26, 2007 – p. 4/26
Abstract Definition Breiman (2001) defines random forest as follows. Definition 1 A random forest is a classifier consisting of a collection of tree-structured classifiers {h(x, Θk ), k = 1, . . .} where the Θk are independent identically distributed random vectors and each tree casts a unit vote for the most popular class at input x.
Random Forests, Stat 900, November 26, 2007 – p. 5/26
The Random Forests Algorithm 1. Choose T —number of trees to grow. 2. Choose m—number of variables used to split each node. m ≪ M , where M is the number of input variables. m is hold constant while growing the forest. 3. Grow T trees. When growing each tree do the following. (a) Construct a bootstrap sample of size n sampled from Sn with replacement and grow a tree from this bootstrap sample. (b) When growing a tree at each node select m variables at random and use them to find the best split. (c) Grow the tree to a maximal extent. There is no pruning. 4. To classify point X collect votes from every tree in the forest and then use majority voting to decide on the class label.
Random Forests, Stat 900, November 26, 2007 – p. 6/26
Compare to: Bagging Breiman, 1996 Works with any classification algorithm Like Random Forests uses bootstrapping Treats the underlying classification algorithm as a "black box" Variance reduction technique
Random Forests, Stat 900, November 26, 2007 – p. 7/26
Compare to: Random Split Selection Dietterich, 2000 Grow multiple trees When splitting, choose split uniformly at random from K best splits Can be used with or without pruning
Random Forests, Stat 900, November 26, 2007 – p. 8/26
Compare to: Random Subspace Ho, 1998 Grow multiple trees Each tree is grown using a fixed subset of variables Do a majority vote or averaging to combine votes from different trees
Random Forests, Stat 900, November 26, 2007 – p. 9/26
RF and Error Estimation 1. For each pairs (xi , yi ) in the training sample Select only trees that do not contain the pair Classify the pair with each of the selected trees Compute misclassification rate for the pair 2. Average over computed estimates
Random Forests, Stat 900, November 26, 2007 – p. 10/26
RF and Variable Selection 1. For each tree in the forest Classify out-of-bag cases and count number of correct votes Permute variable m in the out-of-bag sample Classify permuted out-of-bag sample and count number of correct votes Compute the difference between the unpermuted and permuted counts 2. Compute the average and sd of the differences 3. Compute z-statistic
Random Forests, Stat 900, November 26, 2007 – p. 11/26
RF and Interactions Gini importance for each variable Rank gini importance scores for each tree For each pair of variables compute the average rank difference over all trees
Random Forests, Stat 900, November 26, 2007 – p. 12/26
Unsupervised Learning (Dis)similarity measure For each tree put all the training sample down the tree For each pair of observations compute fraction of trees sij where they end up in the same node p Compute dissimilarity as dij = 1 − sij
Random Forests, Stat 900, November 26, 2007 – p. 13/26
Unsupervised Learning Synthetic datasets Mark observed data as “observed” Generate a synthetic sample from the product of marginal of observed data Mark generated data as “unobserved”
Random Forests, Stat 900, November 26, 2007 – p. 14/26
Unsupervised Learning Clustering Train random forest on the synthetic data Use the forest to compute the dissimilarity measure only for the observed data Use any clustering algorithm with the computed dissimilarity measure
Random Forests, Stat 900, November 26, 2007 – p. 15/26
Universal Consistency Assume i.i.d. data (X, Y ), Sn = {(Xi , Yi )}ni=1 from X × Y, with Y = {−1, 1}. Consider a method fn = A(Sn ), for example fn = AdaBoost(Sn , tn ). Definition 2 Method is universally consistent if for any distribution P a.s. ∗ L(fn ) →L , where L is the risk and L∗ is the Bayes risk:
Random Forests, Stat 900, November 26, 2007 – p. 16/26
Is Random Forests Consistent? Breiman (2001) wrote: Section 2 gives some theoretical background for random forests. Use of the Strong Law of Large Numbers shows that they always converge so that overfitting is not a problem.
Random Forests, Stat 900, November 26, 2007 – p. 17/26
Is Random Forests Consistent? Breiman (2001) wrote: Section 2 gives some theoretical background for random forests. Use of the Strong Law of Large Numbers shows that they always converge so that overfitting is not a problem. ··· This result explains why random forests do not overfit as more trees are added, but produce a limiting value of the generalization error.
Random Forests, Stat 900, November 26, 2007 – p. 17/26
One-Dimensional Case Theorem 3 Consider binary classification problem. If X = R then classification Random Forests algorithm is equivalent to 1-nearest neighbor classifier and hence is not consistent. Theorem 4 Consider binary classification problem. If X = R and bootstrap sample size k → ∞ s.t. k = o(n) then classification Random Forests algorithm is consistent.
Random Forests, Stat 900, November 26, 2007 – p. 18/26
One-Dimensional Case X = [0, 1], η(x) = P(Y = 1|x) = 0.25 + 0.5I{x≥0.5} , L1N N = 0.375
Random Forests, Stat 900, November 26, 2007 – p. 19/26
One-Dimensional Case
Random Forests, Stat 900, November 26, 2007 – p. 20/26
Two-Dimensional Case
Random Forests, Stat 900, November 26, 2007 – p. 21/26
Two-Dimensional Case
Random Forests, Stat 900, November 26, 2007 – p. 22/26
Four-Dimensional Case
Random Forests, Stat 900, November 26, 2007 – p. 23/26
Eight-Dimensional Case
Random Forests, Stat 900, November 26, 2007 – p. 24/26
Four-Dimensional Case Decision boundary: hyperplane
Random Forests, Stat 900, November 26, 2007 – p. 25/26
Other versions of ensemble classifiers Biau et al. (2007) Consistency of purely random forest Consistency of bagged nearest neighbor rules Consistency of forest consisting of trees based on the partitioning the space into nested rectangles
Random Forests, Stat 900, November 26, 2007 – p. 26/26