Anda di halaman 1dari 11

DECISION TREE

Swaraj Kumar, 381CO15


Netaji Subhas Institute of Technology
Sector 3 Dwarka, Delhi
swarajkumarnsit@gmail.com

Vibhav Agarwal, 400CO15


Netaji Subhas Institute of Technology
Sector 3 Dwarka, Delhi
vibhav.agarwal2@gmail.com
Abstract

Decision Tree is a supervised machine learning algorithm. It is widely used in classification and
regression problems. A key property of decision trees is that they are readily interpretable by
humans because they correspond to a sequence of binary decisions applied to the individual
input variable. This paper establishes how to find an efficient structure for the decision tree
using information theory. There is a special emphasis on ID3 algorithm,in detail. This paper
concludes with discussion on new avenues opening for the applications of decision trees as a
supervised learning algorithm.

1. Introduction

Fig 1: a decision tree showing the feature number and split points

This decision tree takes a set of features as input and produces a classification of either
Benign or Cancer as output. To decide the class of a new input vector we start at the root
node and continue down the tree, repeatedly applying decision rules. Each decision rule
compares a particular input feature with a fixed threshold, and depending on the feature
value we continue down either the left or right subtree. When we reach a leaf node we stop
and this becomes the classification for our input vector.

The decision function at each decision node in our tree can therefore be described by two
parameters: a feature number and a split point. The number in brackets next to the feature

2
description at each decision node is the feature number as in the data available. The
number next to the edges leaving a decision node is the split point.[1].

In order to learn such a model from training set, we have to determine the structure of the
tree, including which input variable is chosen at each node to form the split criterion as well
as the value of the threshold parameter for the split.[2]

Information theory provides an efficient method for selection of the nodes and has been covered
in detail.

2. Introduction to Tree

Tree is a widely used abstract data type in computer science, that simulates a hierarchical tree
structure. It comprises of a root node and subtrees of children with a parent node, represented
as a set of linked nodes. [3]

Application of trees-

1. representing family genealogies


2. as the underlying structure in decision-making algorithms
3. to represent priority queues (a special kind of tree called a heap)
4. to provide fast access to information in a database (a special kind of tree called a b-tree)

Important aspects of a tree required for decision making are-

1. Tree must be as short as possible.


2. Tree must be as symmetric as possible.

Why do we need a short tree?

By short ,it means that average number of nodes from the root till the leaves is least amongst all
possible paths of the tree. Short trees help to overcome the problem of overfitting because if the
decision tree has greater depth, it may loose some classification capabilities. Besides, it helps in
making time efficient decisions because the number of tests to be performed on each node
(hypothesis) evaluation becomes less complex for the whole tree.

Why symmetric tree?


More the symmetry, lesser would be its length and hence more time efficient. Symmetry leads
to stability. Hence, it also helps to overcome the overfitting problem.

3
3. Information Theory

The amount of information can be viewed as as the degree of surprise on knowing the value of
a random variable. If we are told that a highly improbable event has just occurred, we will have
received more information than if we were told that some very likely event has just occurred,
and if we knew that the event was certain to happen we would receive no information. Whereas
if events are equally likely to occur then the information gain is positive. A measure of
information content will therefore depend on probability distribution p(x) and we therefore look
for a quantity h(x) that is monotonic function of probability p(x) and that expresses the
information content. The form of h(.) can be found by noting that if we have two events X and Y
that are unrelated, then the information gain from observing both of them should be the sum of
information gained from each of them separately.[6]

h(x,y)= h(x)+h(y)
p(x,y)= p(x)p(y)

From the two relationships, it is easily shown that h(x) must be given by the logarithm of p(x).

h(x)= - log p

Where the negative sign ensures the information is positive or zero. The entropy of random
variable is obtained as the expectation of h(x) with respect to probability distribution p(x).

H(x) = - p(x)h(x)
H(x) = - p(x)log p(x)

Where H(x) is the entropy of the random variable.

4
4. ID3 Algorithm

[ Data set found in: Tom Mitchell. 1997. Machine Learning. McGrawHill. Chapter 3]

The ID3 algorithm builds decision trees using a top-down, greedy approach. Briefly, the steps
to the algorithm are:

1. Start with a training data set, which well call S. It should have attributes and classifications.
The attributes of Play Tennis are outlook, temperature humidity, and wind, and the classification
is whether or not to play tennis. There are 14 observations.

2. Determine the best attribute in the data set S. The first attribute ID3 picks in our example is
outlook. Well go over the definition of best attribute shortly.

3. Split S into subsets that correspond to the possible values of the best attribute. Under
outlook, the possible values are sunny, overcast, and rain, so the data is split into three subsets
(rows 1, 2, 8, 9, and 11 for sunny; rows 3, 7, 12, and 13 for overcast; and rows 4, 5, 6, 10, and
14 for rain).

4. Make a decision tree node that contains the best attribute. The outlook attribute takes its
rightful place at the root of the Play Tennis decision tree.

5. Recursively make new decision tree nodes with the subsets of data created in step #3.
Attributes cant be reused. If a subset of data agrees on the classification, choose that

5
classification. If there are no more attributes to split on, choose the most popular classification.
The sunny data is split further on humidity because ID3 decides that within the set of sunny
rows (1, 2, 8, 9, and 11), humidity is the best attribute. The two paths result in consistent
classificationssunny/high humidity always leads to no and sunny/normal humidity always
leads to yesso the tree ends after that. The rain data behaves in a similar manner, except with
the wind attribute instead of the humidity attribute. On the other hand, the overcast data always
leads to yes without the help of an additional attribute, so the tree ends immediately.[4]

4.1 Pseudocode

6
5. Choosing the best attribute

For ID3, we think of best in terms of which attribute has the most information gain, a measure
that expresses how well an attribute splits the data into groups based on classification.

Fig 2:effect of attributes gain and the data classification achieved

ID3 algorithm both deal with the case where classifications are either positive or negative, we
can simplify the formula to:
Entropy(S) = p+ log2 p+ p log2 p

Here, p+ is the proportion of examples with a positive classification and p- the proportion of
examples with a negative classification. A plot of p+ against Entropy(S) demonstrates how
entropy decreases to 0 as the proportion of negative or positive examples reaches 100% and
peaks at 1 as the examples become more heterogeneous.

Fig 3 : variation of Entropy and probability of positive

7
Information gain measures the reduction in entropy that results from partitioning the data on an
attribute A, which is another way of saying that it represents how effective an attribute is at
classifying the data. Given a set of training data S and an attribute A, the formula for information
gain is:

The entropies of the partitions, when summed and weighted, can be compared to the entropy of
the entire data set. The first term corresponds to the entropy of the data before the partitioning,
whereas the second term corresponds to the entropy afterwards. We want to maximize
information gain, so we want the entropies of the partitioned data to be as low as possible,
which explains why attributes that exhibit high information gain split training data into relatively
heterogenous groups.[4]

The Outlook attribute wins pretty handily, so its placed at the root of the decision tree.
Finally after recursive calls, ID3 is able to decide the nodes that will occupy the split points.

8
Fig 4: Effect of selecting different attributes and selecting the best one [7]

Fig 5: final decision tree[6]

9
6. Applications of Decision Tree

6.1 Business Management

In the past decades, many organizations had created their own databases to enhance their
customer services. Decision trees are a possible way to extract useful information from
databases and they have already been employed in many applications in the domain of
business and management. In particular, decision tree modelling is widely used in customer
relationship management and fraud detection.

6.2 Fault Diagnosis

Another widely used application in the engineering domain is the detection of faults, especially
in the identification of a faulty bearing in rotary machineries. To detect the existence of a faulty
bearing, engineers tend to measure the vibration and acoustic emission (AE) signals emanated
from the rotary machine. However, the measurement involves a number of variables, some of
which may be less relevant to the investigation. Decision trees are a possible tool to remove
such irrelevant variables as they can be used for the purposes of feature selection. Through
feature selection, three attributes were chosen to discriminate the faulty conditions of a bearing,
i.e., the minimum value of the vibration signal, the standard deviation of the vibration signal, and
kurtosis. The chosen attributes, subsequently, were used for creating another decision tree
model. Evaluations from this model show that more than 95% of the testing dataset has been
correctly classified. Such a highly accurate rate suggests that the removal of insignificant
attributes within a dataset is another contribution of decision trees.

6.3 Energy Consumption

Energy consumption concerns how much electricity has been used by individuals. The
investigation of energy consumption becomes an important issue as it helps utility companies
identify the amount of energy needed. Although many existing methods can be used for the
investigation of energy consumption, decision trees appear to be preferred. This is due to the
fact that a hierarchical structure provided by decision trees is useful to present the deep level of
information and insight.

10
References

1. Art of Computing, CECS, Australian National University


2. Pattern recognition and Machine Learning by Christopher Bishop
3. https://en.wikipedia.org/wiki/Decision_tree
4. Resource on Decision tree from Udacity
5. https://en.wikipedia.org/wiki/Tree_(data_structure)
6. Chapter 18 from Artificial Intelligence: A modern approach by Stuart Russell and Peter
Norvig
7. Lecture 4 slide by Marina Santini

11

Anda mungkin juga menyukai