Solution for hw4 Q1 1. a) First split on D and then on B.. First on B and on D also works.. b) Splitting on A gives: +: 3 Red 1 Green -: 3 Green 1 Red Gain = 1 - (.5 * Entropy(3,1) + .5 * Entropy(3,1)) = 1 - (Entropy(3,1)) = 1 - -(1/4 Log 1/4 + 3/4 Log 3/4) = 1 + (- .81) = 0.19 Splitting on B: +: 2 Red 2 Green -: 2 Red 2 Green Gain = 1 - (.5 * Entropy(2,2) + .5 * Entropy(2,2)) = 1 - (Entropy(2,2)) = 1 - 1 = 0 Same for D.. Splitting on C: +: 2 Green 3 Red -: 2 Green 1 Red Gain = 1 - (5/8 * Entropy(2,3) + 3/8 * Entropy(2,1)) = 1 - ( 5/8 * (2/5*Log(2/5) + 3/5*Log(3/5)) + 3/8 * (2/3*Log(2/3) + 1/3*Log(1/3)) ) = 1 - (0.95) = 0.05 So the info-gain algorithm picks A which has the highest information gain. c) No the tree learnt by this algorithm will not be as small as the tree we found in part A.. The reason is that we do not consider the possibility of using multiple arguments.. We just test on one variable at at a time. The algorithm than picks the highest info-gain value greedily. But local optimality does not lead to global optimality in this case. Some people wrote that the tree in (a) actually overfits.. Yes, that is a possibility. In general finding the smallest tree means making the tree too dependent on the training data since the tree will be smallest with respect to THAT data only.. But, this is not the answer to this question plus the property of B and D might be a pattern rather than an irregularity.. This means that if we were to get more examples, we might see the same behavior, B and D separating the set when used together.. A decision tree, in general, tries to find such patterns but as you can see sometimes considering one variable at a time and picking the current best doesnt yield the best overall result.