CS578 Empirical Methods in Machine Learning and Data Mining Mid-Term Exam Due: Thursday, November 16, 2006 at 2:55pm Please keep answers clear and concise. Some questions can be answered with a few words. Some questions admit alternate answers depending on how you think about the problem. Better answers receive more credit. All work must be done individually. Discussion is not allowed. If you have questions see the instructor or one of the TAs. You may use references. The text for this mid-term is available through the course web page. Handwritten exams are acceptable if legible. Sign your exam before turning it in indicating that you did the work without external help. Exams will not be accepted without signatures. (20 points) COMPUTATIONAL COMPLEXITY OF LEARNING ALGORITHMS For the following problems, more credit will be given for performing the analysis of more efficient algorithms, for example using O(n*logn) sort algorithms instead of O(n^2) sort algorithms where appropriate. (2) What is the computational complexity (big-O) of installing the root test in a decision tree given a training set with C cases and V variables? Assume all variables are nominals with 3 values. Express the complexity in terms of C, D, and V. (2) What is the computational complexity of growing a decision tree (with the standard recursive partitioning algorithm discussed in class) to depth D given a training set with C cases and V variables? Again assume all variables are nominals with 3 values and express the complexity in terms of C, D, and V. (2) What is the complexity if all the variables are continuous instead of boolean? (2) What is the complexity of calculating the distance between two cases for k-nearest neighbor if all the attributes are continuous? (2) What is the complexity of 1-NN for regression (predicting a continuous value) on the data set above? Assume a straightforward implementation that doesn't use any fancy data structures. (2) If we move from 1-NN to k-NN, how does the complexity change? Again assume a straightforward implementation of k-NN that doesn't use any fancy data structures. (2) If we switch to locally weighted averaging instead of k-NN, what is the complexity? (2) What is the complexity if we use 1-NN and do LOOCV (leave-one-out cross validation)? (2) How many multiplications and additions/subtractions are necessary to make predictions for the data set above with continuous attributes for a neural net containing H hidden units on each of L hidden layers? (2) What is the complexity of forward stepwise feature selection in the worst case if there are K features and we assume that the cost of training/testing the model on a subset of features is constant O(1)? (5 points) OVERFITTING AND GENERALIZATION (3) Training with a larger training set usually improves generalization performance. Suppose we have a training set containing M training cases. If we duplicate the training cases D times each so that the training set now contains D*M cases (but still only M unique cases), will the optimal value of k in k-nearest neighbor be larger than, smaller than, or the same size as the optimal value of k before the training set was duplicated? Why? (2) Suppose we are pruning decision trees with the duplicated training set. Will the duplicates cause more pruning, less pruning, or the same amount of pruning as with the unduplicated training set? Why? (20 points) DECISION TREES (2) Briefly explain why decision trees are not as well suited for regression problems as they are for classification problems. One nice property of decision trees is that they can handle missing values. When a case is missing an attribute value that is tested at a node, we can handle the missing value several ways: 1) let the case can go down all branches 2) let the case go down all branches weighted by 1/attribute_arity 3) let the case go down the branch most other cases went down 4) let the case go down each branch weighted by the fraction of other cases that went down each branch 5) let the case go down one branch selected at random 6) create a new branch just for the missing values (2) Which approach do you think is best? Briefly explain why. (2) Which approach do you think is worst? Briefly explain why. Calculate the Info_Gain, Gain_Ratio, and Gini_Score for the following two attributes: Attribute True A B Class --- --- ----- 1 1 1 1 2 2 2 1 1 2 2 1 3 1 2 3 2 2 3 2 2 (2) Info_Gain ___ ___ (2) Gain_Ratio ___ ___ (2) Gini_Score ___ ___ (2) What is the largest and smallest possible Info_Gain? Do large or small values of Info_Gain represent good splits? (2) What is the largest and smallest possible Gain_Ratio? Do large or small values of Gain_Ratio represent good splits? (2) What is the largest and smallest possible Gini_Score? Do large or small values of Gini_Score represent good splits? (2) Which attribute test above, A or B, should we install? (10 points) K-NEAREST NEIGHBOR (2) Does k-nearest neighbor converge to baseline performance as k approaches 1 or as k approaches the size of the training set? Why? (2) Does kNN tend to overfit with smaller values of k or with larger values of k? Why? (2) The optimal value of k for a data set is K_OPT. If noise is added to the distance function, do you expect the optimal value of k to increase, decrease, or stay the same? Why? (2) The optimal value of k for unweighted kNN for a data set is K_OPT. If we switch to weighted kNN (where the weight of a case is inversely proportional to the distance to that case), do you expect the optimal value of k to increase, decrease, or stay the same? Why? (2) k-nearest neighbor is sometimes called "lazy learning". Why? (10 points) ARTIFICIAL NEURAL NETS (4) Show that a neural net with two hidden layers consisting only of linear units is equivalent to a network with one hidden layer of linear units. That is, if we remove the non-linear activation functions (e.g., the sigmoids) from the nodes so that each node's activation is just the weighted sum of it's inputs, show that the expressive power of a network with two hidden layers is the same as an ANN with one hidden layer. To keep it simple, it is sufficient to show this for a simple ANN that has 2 inputs, 2 hidden units in each layer, and one output unit. (4) Derive backprop for the output layer of a neural net using exponential squared error e^((Target - Output)^2) instead of the usual squared error (Target - Output)^2. (2) Does changing the error function at the output of a neural net alter the derivation of backprop for the hidden layers? Why? (10 points) FEATURE SELECTION (2) Suppose we have 10^6 features, most of which are noisy and irrelevant to the problem we want to learn. Unfortunately, we only have a small training set of only 100 cases. How many different subsets of features (different combinations of features) could we test (i.e. how many combinations are there)? (2) Besides computational cost, what is a potential problem of trying all possible subsets of features to determine which subset works best? (2) Forward stepwise selection and backward stepwise elimination are both greedy feature selection methods --- once they add or delete a feature that feature is permanently added to or removed from the feature set. Describe a way to make these methods less greedy. (2) Briefly describe a situation where forward stepwise selection does not find the best subset of features. (2) Suppose the goal is to find a small subset of features that performs well. Briefly describe a situation where backwards elimination can fail to find some good small subsets of features. (10 points) CROSS VALIDATION (2) Suppose you are doing m-fold cross-validation. Is the average root-mean-squared-error (RMSE) of the m folds always equal to the RMSE of the union of the m-fold test sets? In other words, if you compute the RMSE on each fold individually, and then take the average of these m RMSEs, and compare this to calculating the RMSE on all the folds concatenated together after doing prediction on each of them with m-fold CV, will you always get the same answer? Prove your answer. (1) Is the same true for accuracy? (no proof necessary) (1) Is the same true for ROC Area? (no proof necessary) (2) List as many reasons as you can why we usually prefer 10-fold cross validation to 2-fold cross validation? (2) In k-fold cross validation, which is more independent, the train sets or the test sets? Why? (2) Briefly explain why the performance of a model estimated with m-fold cross validation might consistently be a little worse than the true performance. (5 points) MISSING VALUES (4) Consider a decision tree where missing values are handled by sending a case down each branch with a weight equal to the fraction of cases with no missing values that go down that branch. Prove that if a test case is missing *all* attribute values, this decision tree will predict for this case the baseline probability (from the training set) of each class. That is, if the training set contains 20% positives, the tree will predict p(+)=0.20 and p(-)=0.80 for any cases missing all attribute values. (1) Is this good or bad property? (10 points) COMPARISON OF ALGORITHMS (2) List 3 advantages of decision trees over neural nets. (2) List 3 advantages of neural nets over decision trees. (2) List 3 advantages of k-NN over decision trees or neural nets. (2) List 3 disadvantages of k-NN over decision trees or neural nets. (2) List 3 disadvantages of perceptrons (with threshold units) over neural nets trained with backprop.