ID3

ï The algorithm operates over a set of training instances, C.

ï If all instances in C are in class P, create a node P and stop, otherwise select a feature F and create a decision node.

ï Partition the training instances in C into subsets according to the values of V.

ï Apply the algorithm recursively to each of the subsets C .

Example

size: small medium large
colour: red blue green
shape: brick wedge sphere pillar
%% yes
medium blue brick
small red sphere
large green pillar
large green sphere
%% no
small red wedge
large red wedge
large red pillar

Output

This can eaily be expressed as a nested if-statement

if (shape == wedge)
    return no;
if (shape == brick)
    return yes;
if (shape == pillar)
{
    if (colour == red)
        return no;
    if (colour == green)
        return yes;
}
if (shape == sphere)
    return yes;

Choosing Attributes

ï The order in which attributes are chosen determines how complicated the tree is.

ï ID3 uses entropy to determine the most informative attribute.

(all logs are base 2)

ï Frequency can be used as a probability estimate.

ï E.g. if there are 5 +ve examples and 3 -ve examples in a node the estimated probability of +ve is 5/8 = 0.625.

ï Work out entropy based on distribution of classes.

ï Trying splitting on each attribute.

ï Work out expected information gain for each attribute

ï Choose best attribute

Example

ï Initial decision tree is one node with all examples.

ï There are 4 +ve examples and 3 -ve examples

ï i.e. probability of +ve is 4/7 = 0.57; probability of -ve is 3/7 = 0.43

ï Entropy is: - (0.57 x log 0.57) - (0.43 x log 0.43) = 0.99

ï Evaluate possible ways of splitting.

ï Try split on size which has three values: large, medium and small.

ï There are four instances with size = large.

ï There are two large positives examples and two large negative examples.

ï The probability of +ve is 0.5

ï The entropy is: - (0.5 x log 0.5) - (0.5 x log 0.5) = 1

ï There is one small +ve and one small -ve

ï Entropy is: - (0.5 x log 0.5) - (0.5 x log 0.5) = 1

ï There is only one medium +ve and no medium -ves, so entropy is 0.

ï Expected information for a split on size is:

ï The expected information gain is: 0.99 ó 0.86 = 0.13

ï Now try splitting on colour and shape.

ï Colour has an information gain of 0.52

ï Shape has an information gain of 0.7

ï Therefore split on shape.

ï Repeat for all subtree

Windowing

ï ID3 can deal with very large data sets by performing induction on subsets or windows onto the data. 1. Select a random subset of the whole set of training instances.

2. Use the induction algorithm to form a rule to explain the current window.

3. Scan through all of the training instances looking for exceptions to the rule.

4. Add the exceptions to the window

ï Repeat steps 2 to 4 until there are no exceptions left.

Cost-Complexity Pruning

ï Let S be a subset of the decision tree.

ï If S were replaced by a leaf, we expect the error rate to increase.

ï The complexity of S can be measured by the average path lengh from the root to the leaves.

ï C4 repeatedly finds the subtree, S, with the least ratio of increased error rate to complexity.

ï If it is significantly worse the the average over all subtrees, S is replaced by a leaf.

Expected Error Pruning

ï Approximate expected error assuming that we prune at a particular node.

ï Approximate backed-up error from children assuming we did not prune.

ï If expected error isless than backed-up error, prune.

Expected Error

ï If we prune a node, it becomes a leaf labelled, C.

ï What will be the expected classification error at this leaf?

S is the set of examples in a node
k is the number of classes
N examples in S
C is the majority class in S
n out of N examples in S belong to C

Backed-up Error

ï For a non-leaf node

ï Let chidren of Node be Node1, Node2, etc

ï Probabilities can be estimated by relative frequencies of attribute values in sets of examples that fall into child nodes.

Pruning

Error Calculation

ï Left child of b has class frequencies [3, 2]

ï Right child has error of 0.333

ï Static error estimate E(b) is 0.375

ï Backed-up error is

ï Since backed-up estimate is greater than static estimate, prune tree and use static error of 0.375