COMP9416

Machine Learning

Solutions to Problems on ID3 and Pruning

  1. [Describe for the given table of examples how the information gain calculations for the first split in the decision tree are done if the split is based on the texture attribute.]

    There are 7 positive examples and 4 negative ones overall, so the overall entropy is -(7/11)*log2(7/11) - (4/11)*log2(4/11) which is about 0.94566.
    There are 3 values for the texture attribute: smooth, wavy and rough.
    For smooth, there are 4 positive examples and 1 negative, so the entropy is -(4/5)*log2(4/5) - (1/5)*log2(1/5) which is about 0.7219.
    For wavy there is one positive and one negative example, so the entropy is -(1/2)*log2(1/2) - (1/2)*log2(1/2) = 1.
    For rough there are two positive and two negative examples, so the entropy is -(1/2)*log2(1/2) - (1/2)*log2(1/2) = 1.
    Expected information for a split on texture is thus:

    0.7219*(5/11) + 1*(2/11) + 1*(4/11) = 0.8736
    Expected information gain for a split on texture is thus 0.94566 - 0.8736 = 0.07206.


    The question as originally posed ends here. What follows is the rest of the information gain calculations for the other two attributes.
    Attribute size:
    There are 3 values for the size attribute: small, medium and large.
    For small, there are two positive and two negative examples - the entropy is 1 as in previous such cases.
    For medium, there is one positive and one negative example - again the entropy is 1.
    For large, there are 4 positive and one negative example - the entropy is 0.7219 as in the case of smooth above.
    Thus the expected information for a split on size is:
    0.7219*(5/11) + 1*(2/11) + 1*(4/11) = 0.8736 exactly as for texture, so as for texture the expected information gain for a split on size is 0.07206.

    Attribute temperature:
    There are 4 values for the temperature attribute: cold, cool, warm and hot.
    For cold, there is 1 positive and 3 negative examples, so the entropy is -(1/4)*log2(1/4) - (3/4)*log2(3/4) which is about 0.811278.
    For cool, there are 3 positive and no negative examples, so the entropy is -(0/3)*log2(0/3) - (3/3)*log2(3/3) = 0.
    For warm, there is one positive and no negative example - again the entropy is 0.
    For hot, there are 2 positive and 1 negative examples, so the entropy is -(1/3)*log2(1/3) - (2/3)*log2(2/3) which is about 0.918296.
    Thus the expected information for a split on temperature is:
    0.811278*(4/11) + 0*(3/11) + 0*(1/11) + 0.918296*(3/11) which is about 0.54545.
    Thus expected information gain for a split on temperature is about 0.94566 - 0.54545 = 0.40021.


    The information gain for a split on temperature is clearly larger than that for the other attributes, so we would split on temperature in the first instance.

  2. [Given the subtree of a decision tree shown below, and using the Laplace error estimate and the criterion defined in lectures, should the the children of the subtree be pruned or not?]

                     P: ( [8, 5] )
                         /      \
                        /        \
                       /          \
                 L: [6,2]     R: [4,1]
    
    The nodes have been labelled P(parent), L(left) and R(right) for ease of reference.

    The Laplace (static) error for L is 0.3 (obtained by putting N = 8, n = 6, and k = 2 into the Laplace error estimate formula.)
    The Laplace error for R is 0.285714 (obtained by setting N = 5, n = 4, & k = 2).
    Hence BackedUpError(P) = (8/13)*0.3 + (5/13)*0.285714 = 0.2945 or so.
    The Laplace (static) error for P (set N=13, n=8, k=2) is 0.4. This is larger than the backed up error, so we do not prune