CS578 Empirical Methods in Machine Learning and Data Mining Mid-Term Exam Due: Tuesday, November 2, 2004 at 2:55pm Please keep answers clear and concise. Some questions can be answered with a few words. Some questions admit alternate answers depending on how you think about the problem. Better answers receive more credit. All work must be done individually. Discussion is not allowed. If you have questions see the instructor or one of the TAs. You may use references. The text for this mid-term is available through the course web page. Handwritten exams are acceptable if legible. Sign your exam before turning it in indicating that you did the work without external help. Exams will not be accepted without signatures. (20 points) COMPUTATIONAL COMPLEXITY OF LEARNING ALGORITHMS For the following, assume a 2-class problem with N training cases and A attributes. Express answers with big-O notation in terms of A and N. (4) What is the computational complexity of finding the best binary split for one continuous attribute (i.e., finding the threshold for that split)? (4) What is the computational complexity of growing a decision tree to depth D if all the attributes are binary (with the standard recursive partitioning algorithm we discussed in class). (4) What is the complexity of calculating the distance between two cases for k-nearest neighbor? (4) What is the complexity for T test cases of weighted kNN when k is much smaller than N? Assume a straightforward implementation of kNN that doesn't use fancy data structures. State any assumptions. (4) What is the complexity if we switch to 1-NN and do LOOCV (leave-one-out cross validation) on the entire data set? (8 points) OVERFITTING AND GENERALIZATION (4) Training with more training data usually improves generalization performance. Suppose we have a training set containing M training cases. If we duplicate the training cases D times each so that the training set now contains D*M cases (but still only M unique cases), will this improve the generalization performance of a neural net? Why? (4) Suppose instead of a neural net we use k-NN on the duplicated training set. Will the optimal value of k be larger than, smaller than, or the same size as the optimal value before the training set was duplicated? Why? (25 points) DECISION TREES (3) Briefly explain why decision trees are not as well suited for regression problems as they are for classification problems. One nice property of decision trees is that they can handle missing values. When a case is missing an attribute value that is tested at a node, we can handle the missing value several ways: 1) let the case can go down all branches 2) let the case go down all branches weighted by 1/attribute_arity 3) let the case go down the branch most other cases went down 4) let the case go down each branch weighted by the fraction of other cases that went down each branch 5) let the case go down one branch selected at random 6) create a new branch just for the missing values (4) Which approach do you think is best? Briefly explain why. (4) Which approach do you think is worst? Briefly explain why. We want to grow decsion trees to minimize squared error instead of using information gain or gain ratio. Let us define RMS_Gain as: / |Sv| \ RMS_Gain(Attribute) = RMSE(S) - SUM | ------ * RMSE(Sv) | v \ |S | / where S is the node being split, v ranges over the values of the Attribute, Sv is the subset of cases from S with Attribute value v, and |S| and |Sv| are the number of cases at the node S or in the subset Sv. RMS_Gain is similar to InfoGain, but we've replaced the entropy terms with RMSE terms. The RMSE(S) for node S is calculated by predicting probabilities for the node in the usual way (i.e. by dividing the fraction of cases in each class in the node by the total number of cases in the node) and then computing the root-mean-squared-error of these predictions for the node. Calculate the Info_Gain, Gain_Ratio, and Gini_Score for the following two attributes: Attribute True A B Class --- --- ----- 1 1 1 1 2 2 2 1 1 2 2 2 3 1 2 3 2 2 3 2 1 (2) Info_Gain ___ ___ (2) Gain_Ratio ___ ___ (2) RMS_Gain ___ ___ (2) What is the largest and smallest possible Info_Gain? Do large or small values of Info_Gain represent good splits? (2) What is the largest and smallest possible Gain_Ratio? Do large or small values of Gain_Ratio represent good splits? (2) What is the largest and smallest possible RMS_Gain? Do large or small values of RMS_Gain represent good splits? (2) Which attribute test above, A or B, should we install? (15 points) K-NEAREST NEIGHBOR (3) Does k-nearest neighbor converge to baseline performance as k approaches 1 or as k approaches the size of the training set? Why? (3) If a data set has significant class noise, do we want to use a smaller or larger value of k? Why? (3) Does kNN tend to overfit with smaller values of k or with larger values of k? Why? (3) The optimal value of k for a data set is K_OPT. If noise is added to the distance function, do you expect the optimal value of k to increase, decrease, or stay the same? Why? (3) The optimal value of k for unweighted kNN for a data set is K_OPT. If we switch to weighted kNN (where the weight of a case is inversely proportional to the distance to that case), do you expect the optimal value of k to increase, decrease, or stay the same? Why? (10 points) PERCEPTRONS In the following, assume inputs are +1 or -1 and that the threshold unit outputs +1 for inputs > 0 and -1 otherwise. (2) Briefly explain why the threshold unit in perceptrons is both a blessing and a curse. (2) Briefly explain why replacing threshold units with sigmoids (or any well-behaved non-linear squashing/activation function) makes neural nets more useful than perceptron nets. (2) Draw a perceptron that computes the NAND (NOT AND) function for two inputs A and B. Be sure to show the weights for the inputs to A and B, as well as the bias weight. (This is not a network -- just a single perceptron.) (2) Draw a perceptron that computes the OR function for three inputs A, B, and C. Be sure to show the weights for the inputs to A, B and C, as well as the bias weight. (This is not a network -- just a single perceptron.) (2) Draw a network of perceptrons that computes the XOR function for three inputs A, B, and C. Be sure to show the weights for the inputs to A, B and C, as well as all bias weights and weights joining the perceptrons. (12 points) ARTIFICIAL NEURAL NETS (ANNs) (3) Show that a neural net with two hidden layers consisting only of linear units is equivalent to a network with one hidden layer of linear units. That is, if we remove the non-linear activation functions (e.g., the sigmoids) from the nodes so that each node's activation is just the weighted sum of it's inputs, show that the expressive power of a network with two hidden layers is the same as an ANN with one hidden layer. To keep it simple, it is sufficient to show this for a simple ANN that has 2 inputs, 2 hidden units in each layer, and one output unit. (3) Are the bias weights in an ANN really necessary? Suppose we eliminate the bias weight going to each node. This means each node's activation is the sigmoid of just the weighted sum of it's inputs, without an extra weight going to a constant "1" activation. The net is trained as usual with backpropagation. How well do you think it will perform compared to a regular ANN trained with bias weights. Why? (3) Consider a neural net with N hidden units trained with per epoch updating on a data set with fixed learning rate L and no momentum. Suppose the optimal early stopping point is E epochs. If we train the same net from the same data using per pattern updating instead of batch updating using a learning rate of L/E, will training the net probably require E passes through the train set, more than E passes, or fewer than E passes? Why? (3) Consider a neural net with N hidden units trained with per epoch updating on a data set D with learning rate L and momentum M. Suppose the optimal early stopping point is E epochs. If we add noise to the attributes in D, and train the same net with per epoch updating, will training the net probably require E epochs, more than E epochs, or fewer than E epochs to reach optimal performance? Why? (10 points) CROSS VALIDATION (3) Briefly explain why the performance of a model estimated with m-fold cross validation might consistently be a little worse than the true performance. (3) Suppose you are doing m-fold cross-validation. Show that the average accuracy of the m folds is equal to the accuracy of the union of the m-fold test sets. In other words, compute the accuracy on each fold individually, and then take the average of these m accuracies, and compare this to calculating the accuracy on all the folds concatenated together after doing prediction on each of them with m-fold CV. Is the same true for RMSE (root-mean-squared error)? (2) List as many reasons as you can why we usually prefer 10-fold cross validation to 2-fold cross validation? (2) List as many reasons as you can why we might prefer 10-fold cross validation to 100-fold cross validation?