3:  Discuss the tradeoff between generalization error and training error
for a machine learning method in the context of the expressiveness of
hypothesis class of the learning method.

Any machine learning procedure can be understood as performing some search
through a particular hypothesis space defined by that
procedure.  Ultimately, the goal of learning is to find the best possible
hypothesis within that space with respect to the target data set which
ultimately represents the target function.  The question then becomes: how
do we define best, moreover how do we define best with respect to a
particular hypothesis space.

Given any particular hypothesis, there are two particular quantities of
interest here: training error and generalization error.  Training is the
error that the particular hypothesis makes on the training set used to find
this hypothesis.  This quantity is easily measurable, and typically is
minimized during training.  Generalization error, on the other hand, is a
measure of the true error between the hypothesis and the true target
function.  Typically we can not measure is directly, instead we can
approximate this by measuring the performance of the training method on an
independent test set.

A more powerful learning method will explore a larger hypothesis
space.  This, however, presents the drawback of over-fitting.  Given the
increased size of the hypothesis space there is a greater chance that a
hypothesis within that space closely matches the true hypothesis.  However,
there is also a greater chance of finding a hypothesis within that larger
space that matches the training data well, but does not match the target
function.  The hypothesis space contains both good approximations of the
target function and hypotheses which in fit will the training data but to
not match the target function.  In effect, the hypothesis space contains
hypotheses which 'memorize' the training data independent of the true
generating function and fit both the noise and the underlying structure of
the training set.

Furthermore, given that most data sets are noisy it is likely that there is
even generalization error from the optimal hypothesis or true generating
function.  Nevertheless, as a powerful learning method searches a larger
and larger hypothesis space it will likely discover hypothesis that match
the training data set better than the true generating function itself.  In
this case the over-fit hypothesis will have a lower training error then
even the optimal hypothesis, which produces the lowest generalization
error.  Thus, we can see that the more hypothesis a powerful learning
technique considers more hypothesis (typically achieved by extending the
training period) it makes a tradeoff between generalization error and
training error.  Training error typically is monotonically non-increasing,
whereas after a point generalization error will begin to increase.  Thus,
the 'best' hypothesis in the space is the one that minimizes generalization
error.  Unfortunately, we rarely know this point, so approximate this by
using an independent test set for validation to prematurely stop a powerful
learning technique once we feel we have reached the best generalization error.

Conversly, a less powerful learning method will explore a smaller
hypothesis space.  This limits its tendency to over-fit, since likely there
will be fewer hypotheses in the space that memorize both the noise and the
data within the training set.  However, it is also likely that there are
fewer hypotheses in the smaller space that approximate the target function
well.  Thus, in this case we are likely to find that generalization error
will more closely be approximated by training error, but neither errors are
likely to be small.


5:  We never test the same attribute twice along one path in a decision
tree.  Why not?

For boolean or nominal attributes testing the same attribute twice along
one path in a decision tree would be redundant.  Consider that the first
test along single path for a given attribute divides all instances along
the different values for that attribute; thus further down that path all
instances remaining have only a single value of that attribute.  If tested
again, there would be no division of those instances.  Rather, they would
all travel down the same branch.  Thus, the information gain would be 0,
since the sole child node of the second attribute test would have exactly
the same entropy as the parent since it has exactly the same instances.

Interestingly, if the attribute is continuous this does not apply as
typically a single node will divide a continuous feature according to a
threshold value.  Thus, we can see that testing such an attribute later in
the tree will apply a second threshold value and further divide the
instances which make it to the second test.