CS578 Empirical Methods in Machine Learning and Data Mining Mid-Term Exam Due: Thursday, November 6, 2003 at 2:55pm Please keep answers clear and concise. Some questions can be answered with a few words. Some questions admit alternate answers depending on how you think about the problem. Better answers receive more credit. All work must be done individually. Discussion is not allowed. If you have questions see the instructor or TA. You may use references. The text for this mid-term is available through the course web page. Handwritten exams are acceptable if legible. Sign your exam before turning it in indicating that you did the work without external help. (20 points) COMPUTATIONAL COMPLEXITY OF LEARNING ALGORITHMS (2) What is the computational complexity (big-O notation) of installing the root attribute in a decision tree for a training set with C cases and V boolean (2 valued) variables? Express the complexity in terms of C and V. (2) What is the computational complexity of installing the root attribute if all V attributes are continuous instead of boolean? (2) What is the computational complexity of growing a complete decision tree to depth D with the standard recursive partitioning algorithm we discussed in class given a training set with C cases and V boolean variables? (2) What is the complexity of growing the same complete tree to depth D with boolean variables if you do 1-step lookahead when growing the tree? (2) Given a fully-connected feedforward neural net with I inputs, O outputs, and one hidden layer with H hidden units, how many floating point multiplications must be done to make the predictions for a test set containing C cases? (This is just forward propagation. The net was already trained on a train set with backpropagation.) (2) What is the complexity of calculating the Euclidean distance between two cases for k-nearest neighbor with A attributes? (2) What is the complexity of 1-NN for regression (predicting a continuous value) for a data set with A attributes and C cases? (2) If we move from 1-NN to k-NN, how does the complexity change? (2) If we switch to locally weighted averaging instead of k-NN, what is the complexity? (2) What is the complexity if we use 1-NN and do LOOCV (leave-one-out cross validation)? (7 points) OVERFITTING AND GENERALIZATION (2) Training with more training data usually improves generalization performance. Suppose we have a training set containing M training cases. If we duplicate the training cases D times each so that the training set now contains D*M cases (but still only M unique cases), will this improve the generalization performance of a neural net? Why? (2) Suppose instead of a neural net we use k-NN on the duplicated training set. Will the optimal value of k be larger than, smaller than, or the same size as the optimal value before the training set was duplicated? Why? (2) Suppose we train a linear SVM on the duplicated training set. Will the new best value C' with the duplicated data be larger than, smaller than, or the same as the the best C for the unduplicated training set? Why? (1) What is the new best value C'? Express your answer in terms of D, M, and C (the best C with the original unduplicated data set). (25 points) DECISION TREES (2) Briefly explain why growing full decision trees to maximum depth using randomly installed splits and no pruning often yields trees that perform reasonably well. (2) Why does lookahead often reduce the size/depth of a decision tree? We want to grow decsion trees to maximize accuracy instead of using information gain or gain ratio. Let us define AccuracyGain as: / |Sv| \ AccGain(Attribute) = Accuracy(S) - SUM | ------ * Accuracy(Sv) | v \ |S | / where S is the node being split, v ranges over the values of the Attribute, Sv is the subset of cases from S with Attribute value v, and |S| and |Sv| are the number of cases at the node S or in the subset Sv. AccGain is similar to InfoGain, but we've replaced the entropy terms with accuracy terms. Accuracy(S) is the baseline accuracy of S, the fraction of instances in S classified correctly if you predict the majority class for all cases in S. Compute the InfoGain, GainRatio, and AccGain for these two attributes: Attribute True A B Class --- --- ----- 1 1 1 1 2 2 2 1 1 2 2 2 3 1 1 3 1 2 3 2 2 (2) InfoGain _?_ _?_ (2) GainRatio _?_ _?_ (2) AccGain _?_ _?_ (2) When using InfoGain, we install the split (attribute test) that has the highest InfoGain. Should we install the split that has the highest or lowest AccGain? (2) Does AccGain have the same problem with high-arity attributes that InfoGain has? What is this problem and why does AccGain have this problem or not have this problem? (2) Optimizing accuracy is appealing, but AccGain has a weakness compared to InfoGain for problems with more than two classes. What is it? One nice property of decision trees is that they can handle missing values. When a case is missing an attribute value that is tested at a node, we can handle the missing value several ways: 1) let the case can go down all branches 2) let the case go down all branches weighted by 1/attribute_arity 3) let the case go down the branch most other cases went down 4) let the case go down each branch weighted by the fraction of other cases that went down each branch 5) let the case go down one branch selected at random 6) create a new branch just for the missing values 7) ignore the case (2) Which approach do you think is best? Briefly explain why. (2) Which approach do you think is worst? Briefly explain why. (1) Just between 1) and 2), which one do you think is more appropriate? why? (2) Briefly describe a situation where 6) would be very inappropriate to use. (2) Briefly describe a situation where 7) would be very inappropriate to use. (10 points) K-NEAREST NEIGHBOR (2) If a data set has significant class noise, do we want to use a smaller or larger value of k? Why? (2) If a data set has significant attribute noise, do we want to use a smaller or larger value of k? Why? (2) Does kNN tend to overfit with smaller values of k or with larger values of k? Why? (2) Does Locally Weighted Averaging tend to overfit with smaller values of kernel width, or with larger values of kernel width? Why? (2) A training set has M instances and the optimal value of k for this training set is K_OPT. If the training set gets larger, do you expect the optimal value of k to increase, decrease, or stay the same? Why? (12 points) ARTIFICIAL NEURAL NETS (2) If you train a neural net with backpropagation for many epochs, which is more likely to have the lowest RMSE on the *train* set, a neural net with 1, 10, or 100 hidden units? (2) If you train the same neural net with backpropagation for many epochs, which is more likely to have the lowest RMSE on the *test* set, a neural net with 1, 10, or 100 hidden units? (2) When training a neural net with backpropagation on a training set with 0/1 targets, which do you think is likely to reach its early stopping point first, RMSE or accuracy? Why? (2) Backpropagation is done with error functions on the outputs such as squared error or cross-entropy, even when the goal is to maximize classification accuracy. Why is it difficult to do backpropagation to maximize classification accuracy directly? (4) Derive backprop for just the output unit of a neural net trained with -1*[target*log(output) + (1-target)*log(1-output)] instead of the usual squared error (Target - Output)^2. Assume a sigmoid output unit and 0/1 targets. (8 points) SUPPORT VECTOR MACHINES (2) If we remove one of the support vectors from the train set, does the size of the maximum margin decrease, stay the same, or increase for that data set? (2) If we double the size of the training set by drawing additional points from the same distribution, do you expect the size of the margin to decrease, stay the same, or increase? Suppose instead of linear separating hyperplanes, we could train SVMs using separating V-hyperplanes -- two linear semi-hyperplanes that can bend where they join. The angle at the bend that yields the maximum margin is found by the SVM while maximizing the margin. LINEAR HYPERPLANE V-HYPERPLANE . . . / . . . / . . / - . . / - . / - . / - . / - . / - ./ - . \ - / \ (2) Do you expect the size of the maximum margin to decrease, stay the same, or increase with V-hperplanes compared to linear hyperplanes? (2) For the same value of C, re SVM models with V-hperplanes less complex, more complex, or the same complexity as SVM models with linear hyperplanes? (8 points) CROSS VALIDATION (4) List as many reasons as you can why we usually prefer 10-fold cross validation to 2-fold cross validation? (4) Suppose we use 3-fold cross validation to train neural nets with early stopping. In trial 1 the net is trained on fold 1, early stopped on fold 2, and tested on fold 3. 3-fold CV yields three neural nets, N1, N2, and N3. Suppose we want to combine the three nets into one model M by averaging the predictions from the three nets. If Ni(Cj) is the prediction net Ni makes on test case Cj, the average prediction of the three nets is M(Cj) = (N1(Cj)+N2(Cj)+N3(Cj))/3 for case Cj. If we compare M to Ni on Ni's test fold, why isn't this a fair comparison? Design an experiment where we can fairly compare M to each of the Ni's. (10 points) COMPARISON OF ALGORITHMS (2) List 2 advantages of decision trees over neural nets. (2) List 2 advantages of neural nets over decision trees. (2) List 2 advantages of k-NN over decision trees or neural nets. (2) List 2 disadvantages of k-NN over decision trees or neural nets. (2) List 2 advantages of SVMs over decision trees and neural nets.