3: Discuss the tradeoff between generalization error and training error for a machine learning method in the context of the expressiveness of hypothesis class of the learning method. Any machine learning procedure can be understood as performing some search through a particular hypothesis space defined by that procedure. Ultimately, the goal of learning is to find the best possible hypothesis within that space with respect to the target data set which ultimately represents the target function. The question then becomes: how do we define best, moreover how do we define best with respect to a particular hypothesis space. Given any particular hypothesis, there are two particular quantities of interest here: training error and generalization error. Training is the error that the particular hypothesis makes on the training set used to find this hypothesis. This quantity is easily measurable, and typically is minimized during training. Generalization error, on the other hand, is a measure of the true error between the hypothesis and the true target function. Typically we can not measure is directly, instead we can approximate this by measuring the performance of the training method on an independent test set. A more powerful learning method will explore a larger hypothesis space. This, however, presents the drawback of over-fitting. Given the increased size of the hypothesis space there is a greater chance that a hypothesis within that space closely matches the true hypothesis. However, there is also a greater chance of finding a hypothesis within that larger space that matches the training data well, but does not match the target function. The hypothesis space contains both good approximations of the target function and hypotheses which in fit will the training data but to not match the target function. In effect, the hypothesis space contains hypotheses which 'memorize' the training data independent of the true generating function and fit both the noise and the underlying structure of the training set. Furthermore, given that most data sets are noisy it is likely that there is even generalization error from the optimal hypothesis or true generating function. Nevertheless, as a powerful learning method searches a larger and larger hypothesis space it will likely discover hypothesis that match the training data set better than the true generating function itself. In this case the over-fit hypothesis will have a lower training error then even the optimal hypothesis, which produces the lowest generalization error. Thus, we can see that the more hypothesis a powerful learning technique considers more hypothesis (typically achieved by extending the training period) it makes a tradeoff between generalization error and training error. Training error typically is monotonically non-increasing, whereas after a point generalization error will begin to increase. Thus, the 'best' hypothesis in the space is the one that minimizes generalization error. Unfortunately, we rarely know this point, so approximate this by using an independent test set for validation to prematurely stop a powerful learning technique once we feel we have reached the best generalization error. Conversly, a less powerful learning method will explore a smaller hypothesis space. This limits its tendency to over-fit, since likely there will be fewer hypotheses in the space that memorize both the noise and the data within the training set. However, it is also likely that there are fewer hypotheses in the smaller space that approximate the target function well. Thus, in this case we are likely to find that generalization error will more closely be approximated by training error, but neither errors are likely to be small. 5: We never test the same attribute twice along one path in a decision tree. Why not? For boolean or nominal attributes testing the same attribute twice along one path in a decision tree would be redundant. Consider that the first test along single path for a given attribute divides all instances along the different values for that attribute; thus further down that path all instances remaining have only a single value of that attribute. If tested again, there would be no division of those instances. Rather, they would all travel down the same branch. Thus, the information gain would be 0, since the sole child node of the second attribute test would have exactly the same entropy as the parent since it has exactly the same instances. Interestingly, if the attribute is continuous this does not apply as typically a single node will divide a continuous feature according to a threshold value. Thus, we can see that testing such an attribute later in the tree will apply a second threshold value and further divide the instances which make it to the second test.