CS578 Empirical Methods in Machine Learning and Data Mining Mid-Term Exam Due: Tuesday, November 6, 2007 at 2:55pm Please keep answers clear and concise. Some questions can be answered with a few words. Some questions admit alternate answers depending on how you think about the problem. Better answers receive more credit. All work must be done individually. Discussion is not allowed. If you have questions see the instructor or one of the TAs. You may use references. The text for this mid-term is available through the course web page. Handwritten exams are acceptable if legible. Sign your exam before turning it in indicating that you did the work without external help. Exams will not be accepted without signatures. (18 points) COMPUTATIONAL COMPLEXITY OF LEARNING ALGORITHMS For the following, assume a 2-class problem with N training cases and A attributes. Where appropriate, express answers with big-O notation in terms of A and N. (3) What is the computational complexity of finding the best binary split for one continuous attribute (i.e., finding the threshold for that split)? (3) Suppose attribute A has V discrete values (i.e., A is arity V), but we want to do a binary split on A. One way to do this is subsetting, where one subset of the values of A go down one branch, and the remaining values of A go down the other branch. What is the complexity of finding the optimal partitioning of the values V into the subsets for the two branches branches? (3) What is the complexity of doing 1-step lookahead when installing the root attribute in a decision tree? (3) How many multiplies are required to do forward propagation in a standard fully-connected feedforward neural net with A inputs, H hidden units, and O outputs? (3) What is the complexity of finding the single nearest neighbor to a test point in 1-NN? (3) What is the complexity of LOOCV (leave-one-out cross validation) using 1-NN? (6 points) OVERFITTING AND GENERALIZATION (2) Training with more training data usually improves generalization. Suppose we have a training set containing M training cases. If we duplicate the training cases D times each so that the training set now contains D*M cases (but still only M unique cases), will this improve the generalization performance of a neural net? Why? (2) Suppose instead of a neural net we use k-NN on the duplicated training set. Will the optimal value of k be larger than, smaller than, or the same size as the optimal value before the training set was duplicated? Why? (2) Suppose instead of a neural net or k-NN we prune a decision tree using the duplicated training set. Will the tree pruned using the duplicated data likely be larger, smaller, or the same size as the tree pruned on the original unduplicated data? Why? State any assumptions you make. (21 points) DECISION TREES We want to grow decsion trees to minimize squared error instead of using information gain or gain ratio. Let us define RMS_Gain as: / |Sv| \ RMS_Gain(Attribute) = RMSE(S) - SUM | ------ * RMSE(Sv) | v \ |S | / where S is the node being split, v ranges over the values of the Attribute, Sv is the subset of cases from S with Attribute value v, and |S| and |Sv| are the number of cases at the node S or in the subset Sv. RMS_Gain is similar to InfoGain, but we've replaced the entropy terms with RMSE terms. The RMSE(S) for node S is calculated by predicting probabilities for the node in the usual way (i.e. by dividing the fraction of cases in each class in the node by the total number of cases in the node) and then computing the root-mean-squared-error of these predictions for the node. Calculate the Info_Gain, Gain_Ratio, and Gini_Score for the following two attributes: Attribute True A B Class --- --- ----- 1 1 2 1 2 1 2 1 1 2 2 2 3 1 1 3 2 2 3 2 1 3 3 2 (2) Info_Gain ___ ___ (2) Gain_Ratio ___ ___ (2) RMS_Gain ___ ___ (2) What is the largest and smallest possible Info_Gain that could be calculated for *any* data set? Do large or small values of Info_Gain represent good splits? (2) What is the largest and smallest possible Gain_Ratio that could be calculated on any data set? Do large or small values of Gain_Ratio represent good splits? (2) What is the largest and smallest possible RMS_Gain that could be calculated on any data set? Do large or small values of RMS_Gain represent good splits? (3) Which attribute test above, A or B, should we install for each splitting criterion? Info_Gain? _____ Gain_ration? _____ RMS_Gain? _____ One nice property of decision trees is that they can handle missing values. When a case is missing an attribute value that is tested at a node, we can handle the missing value several ways: 1) let the case can go down all branches 2) let the case go down all branches weighted by 1/attribute_arity 3) let the case go down the branch most other cases went down 4) let the case go down each branch weighted by the fraction of other cases that went down each branch 5) let the case go down one branch selected at random 6) create a new branch just for the missing values (2) Which approach do you think is best? Briefly explain why. (2) Which approach do you think is worst? Briefly explain why. (1) Just between 1) and 2), which one do you think is more appropriate? why? (1) Briefly describe a situation where 6) would be very inappropriate to use. (18 points) K-NEAREST NEIGHBOR (2) Does k-nearest neighbor converge to baseline performance as k approaches 1 or as k approaches the size of the training set? Why? (3) If a data set has significant class noise, do we want to use a smaller or larger value of k? Why? (3) Does kNN tend to overfit with smaller values of k or with larger values of k? Why? (3) A training set has M instances and the optimal value of k for this training set is K_OPT. If the training set gets larger, do you expect the optimal value of k to increase, decrease, or stay the same? Why? (3) The optimal value of k for unweighted kNN for a data set is K_OPT. If we switch to weighted kNN (where the weight of a case is inversely proportional to the distance to that case), do you expect the optimal value of k to increase, decrease, or stay the same? Why? (2) Suppose we perform kNN on a large data set with many features, but give weight 1 to all features used in a decision tree trained on this data, and give weight 0 to all features not used in the decision tree. Do you think kNN will perform better, worse, or about the same with the features weighted this way? Why? (2) Suppose a data set contains both nominal attributes and continuous attributes. Calculating distance on continuous attributes is easy. One way to calculate distance on nominal attributes is to define the distance between two nominal attributes as 0 when the attributes have the same value, and 1 whenever the attribute have different values. Another approach is to first convert each nominal attribute of arity V to V binary attributes (as we did with neural nets), and then calculate distance on these binary attributes. Does it matter which approach we use? If it does, what should we do to make the two approaches yield the same answers? (10 points) PERCEPTRONS In the following, assume inputs are +1 or -1 and that the threshold unit outputs +1 for inputs > 0 and -1 otherwise. (2) Briefly explain why the threshold unit in perceptrons is both a blessing and a curse. (2) Briefly explain why replacing threshold units with sigmoids (or any well-behaved non-linear squashing/activation function) makes neural nets more useful than perceptron nets. (2) Draw a perceptron that computes the NAND (NOT AND) function for two inputs A and B. Be sure to show the weights for the inputs to A and B, as well as the bias weight. (This is not a network -- just a single perceptron.) (2) Draw a perceptron that computes the OR function for three inputs A, B, and C. Be sure to show the weights for the inputs to A, B and C, as well as the bias weight. (This is not a network -- just a single perceptron.) (2) Draw a network of perceptrons that computes the XOR function for three inputs A, B, and C. Be sure to show the weights for the inputs to A, B and C, as well as all bias weights and weights joining the perceptrons. (19 points) ARTIFICIAL NEURAL NETS (ANNs) (2) Backpropagation is done with error functions on the outputs such as squared error or cross-entropy, even when the goal is to maximize classification accuracy. Why is it difficult to do backpropagation to maximize classification accuracy directly? (2) When training a neural net with backpropagation on a training set with 0/1 targets, which do you think is likely to reach its early stopping point first, RMSE or accuracy? Why? (2) Consider a neural net with one hidden layer containing N hidden units, and another neural net with N hidden layers each containing 1 hidden unit. If there are I inputs and O outputs, how many weights are in each neural net? Which net can learn more complex functions? (3) Show that a neural net with two hidden layers consisting only of linear units is equivalent to a network with one hidden layer of linear units. That is, if we remove the non-linear activation functions (e.g., the sigmoids) from the nodes so that each node's activation is just the weighted sum of it's inputs, show that the expressive power of a network with two hidden layers is the same as an ANN with one hidden layer. To keep it simple, it is sufficient to show this for a simple ANN that has 2 inputs, 2 hidden units in each layer, and one output unit. (3) Are the bias weights in an ANN really necessary? Suppose we eliminate the bias weight going to each node. This means each node's activation is the sigmoid of just the weighted sum of it's inputs, without an extra weight going to a constant "1" activation. The net is trained as usual with backpropagation. How well do you think it will perform compared to a regular ANN trained with bias weights. Why? (3) If a neural net with one hidden unit is trained long enough on a large consistent training set containing N cases, will the training set error always go to zero? If a neural net with 2^N hidden units is trained on the same training set, will the training set error always go to zero? Explain your reasoning? (4) Derive backprop for just the output unit of a neural net trained with -1*[target*log(output) + (1-target)*log(1-output)] instead of the usual squared error (Target - Output)^2. Assume a sigmoid output unit and 0/1 targets. (8 points) CROSS VALIDATION (3) Suppose you are doing m-fold cross-validation. Show that the average accuracy of the m folds is equal to the accuracy of the union of the m-fold test sets. In other words, compute the accuracy on each fold individually, and then take the average of these m accuracies, and compare this to calculating the accuracy on all the folds concatenated together after doing prediction on each of them with m-fold CV. Is the same true for RMSE (root-mean-squared error)? (3) List as many reasons as you can why we usually prefer 10-fold cross validation to 2-fold cross validation? (2) Suppose we use 3-fold cross validation to train neural nets. In trial 1 the net is trained on fold 1, early stopped on fold 2, and tested on fold 3. 3-fold CV yields three neural nets, N1, N2, and N3. Suppose we want to combine the three nets into one model M by averaging the predictions from the three nets. If Ni(Cj) is the prediction net Ni makes on test case Cj, the average prediction of the three nets is M(Cj) = (N1(Cj)+N2(Cj)+N3(Cj))/3 for case Cj. If we compare M to Ni on Ni's test fold, why isn't this a fair comparison?