Solution for hw4 Q1

1.

a) First split on D and then on B.. First on B and on D also works..


b)

Splitting on A gives:

+: 3 Red 1 Green
-: 3 Green 1 Red

Gain = 1 - (.5 * Entropy(3,1) + .5 * Entropy(3,1))
     = 1 - (Entropy(3,1))      
     = 1 - -(1/4 Log 1/4 + 3/4 Log 3/4)
     = 1 + (- .81)
     = 0.19

Splitting on B:

+: 2 Red 2 Green
-: 2 Red 2 Green

Gain = 1 - (.5 * Entropy(2,2) + .5 * Entropy(2,2))
     = 1 - (Entropy(2,2))
     = 1 - 1 
     = 0

Same for D..

Splitting on C:

+: 2 Green 3 Red
-: 2 Green 1 Red

Gain = 1 - (5/8 * Entropy(2,3) + 3/8 * Entropy(2,1))
     = 1 - (
	    5/8 * (2/5*Log(2/5) + 3/5*Log(3/5)) +
	    3/8 * (2/3*Log(2/3) + 1/3*Log(1/3))
	   )
     = 1 - (0.95)
     = 0.05

So the info-gain algorithm picks A which has the highest
information gain.

c) No the tree learnt by this algorithm will not be
   as small as the tree we found in part A.. The reason
   is that we do not consider the possibility of using 
   multiple arguments.. We just test on one variable at
   at a time. The algorithm than picks the highest info-gain
   value greedily. But local optimality does not lead to
   global optimality in this case. Some people wrote that
   the tree in (a) actually overfits.. Yes, that is a possibility.
   In general finding the smallest tree means making the tree
   too dependent on the training data since the tree will be
   smallest with respect to THAT data only.. But, this is not
   the answer to this question plus the property of B and D might
   be a pattern rather than an irregularity.. This means that if we were
   to get more examples, we might see the same behavior, B and D separating
   the set when used together..  A decision tree, in
   general, tries to find such patterns but as you can see sometimes
   considering one variable at a time and picking the current best
   doesnt yield the best overall result.