CS578 Empirical Methods in Machine Learning and Data Mining
Mid-Term Exam     Due: Thursday, November 16, 2006 at 2:55pm

Please keep answers clear and concise.  Some questions can be answered
with a few words.  Some questions admit alternate answers depending on
how you think about the problem.  Better answers receive more credit.

All work must be done individually.  Discussion is not allowed.  If
you have questions see the instructor or one of the TAs.  You may use
references.  The text for this mid-term is available through the
course web page.  Handwritten exams are acceptable if legible.

Sign your exam before turning it in indicating that you did the work
without external help.  Exams will not be accepted without signatures.


(20 points) COMPUTATIONAL COMPLEXITY OF LEARNING ALGORITHMS

For the following problems, more credit will be given for performing
the analysis of more efficient algorithms, for example using O(n*logn)
sort algorithms instead of O(n^2) sort algorithms where appropriate.

(2) What is the computational complexity (big-O) of installing the
root test in a decision tree given a training set with C cases and V
variables?  Assume all variables are nominals with 3 values.  Express
the complexity in terms of C, D, and V.  

(2) What is the computational complexity of growing a decision tree
(with the standard recursive partitioning algorithm discussed in
class) to depth D given a training set with C cases and V variables?
Again assume all variables are nominals with 3 values and express the
complexity in terms of C, D, and V.

(2) What is the complexity if all the variables are continuous instead
of boolean?

(2) What is the complexity of calculating the distance between two
cases for k-nearest neighbor if all the attributes are continuous?

(2) What is the complexity of 1-NN for regression (predicting a
continuous value) on the data set above?  Assume a straightforward
implementation that doesn't use any fancy data structures.

(2) If we move from 1-NN to k-NN, how does the complexity change?  Again
assume a straightforward implementation of k-NN that doesn't use any
fancy data structures.

(2) If we switch to locally weighted averaging instead of k-NN, what
is the complexity?

(2) What is the complexity if we use 1-NN and do LOOCV (leave-one-out
cross validation)?

(2) How many multiplications and additions/subtractions are necessary
to make predictions for the data set above with continuous attributes
for a neural net containing H hidden units on each of L hidden layers?

(2) What is the complexity of forward stepwise feature selection in
the worst case if there are K features and we assume that the cost of
training/testing the model on a subset of features is constant O(1)?


(5 points) OVERFITTING AND GENERALIZATION

(3) Training with a larger training set usually improves generalization
performance.  Suppose we have a training set containing M training
cases.  If we duplicate the training cases D times each so that the
training set now contains D*M cases (but still only M unique cases),
will the optimal value of k in k-nearest neighbor be larger than,
smaller than, or the same size as the optimal value of k before the
training set was duplicated?  Why?

(2) Suppose we are pruning decision trees with the duplicated training
set.  Will the duplicates cause more pruning, less pruning, or the
same amount of pruning as with the unduplicated training set?  Why?


(20 points) DECISION TREES

(2) Briefly explain why decision trees are not as well suited for
regression problems as they are for classification problems.


One nice property of decision trees is that they can handle
missing values.  When a case is missing an attribute value that is
tested at a node, we can handle the missing value several ways:

1) let the case can go down all branches
2) let the case go down all branches weighted by 1/attribute_arity
3) let the case go down the branch most other cases went down
4) let the case go down each branch weighted by the fraction of
   other cases that went down each branch
5) let the case go down one branch selected at random
6) create a new branch just for the missing values

(2) Which approach do you think is best?  Briefly explain why.

(2) Which approach do you think is worst?  Briefly explain why.


Calculate the Info_Gain, Gain_Ratio, and Gini_Score for the following
two attributes:

            Attribute    True
             A     B     Class
            ---   ---    -----
             1     1       1
             1     2       2
             2     1       1
             2     2       1
             3     1       2
             3     2       2
             3     2       2

(2)  Info_Gain  ___   ___

(2) Gain_Ratio  ___   ___

(2) Gini_Score  ___   ___
  

(2) What is the largest and smallest possible Info_Gain?  Do large or
small values of Info_Gain represent good splits?

(2) What is the largest and smallest possible Gain_Ratio?  Do large or
small values of Gain_Ratio represent good splits?

(2) What is the largest and smallest possible Gini_Score?  Do large or
small values of Gini_Score represent good splits?

(2) Which attribute test above, A or B, should we install?


(10 points) K-NEAREST NEIGHBOR

(2) Does k-nearest neighbor converge to baseline performance as k
approaches 1 or as k approaches the size of the training set?  Why?

(2) Does kNN tend to overfit with smaller values of k or with larger
values of k?  Why?

(2) The optimal value of k for a data set is K_OPT. If noise is added
to the distance function, do you expect the optimal value of k to
increase, decrease, or stay the same?  Why?

(2) The optimal value of k for unweighted kNN for a data set is
K_OPT. If we switch to weighted kNN (where the weight of a case is
inversely proportional to the distance to that case), do you expect
the optimal value of k to increase, decrease, or stay the same?  Why?

(2) k-nearest neighbor is sometimes called "lazy learning".  Why?


(10 points) ARTIFICIAL NEURAL NETS

(4) Show that a neural net with two hidden layers consisting only of
linear units is equivalent to a network with one hidden layer of
linear units.  That is, if we remove the non-linear activation
functions (e.g., the sigmoids) from the nodes so that each node's
activation is just the weighted sum of it's inputs, show that the
expressive power of a network with two hidden layers is the same as an
ANN with one hidden layer.  To keep it simple, it is sufficient to
show this for a simple ANN that has 2 inputs, 2 hidden units in each
layer, and one output unit.

(4) Derive backprop for the output layer of a neural net using
exponential squared error e^((Target - Output)^2) instead of the usual
squared error (Target - Output)^2.

(2) Does changing the error function at the output of a neural net
alter the derivation of backprop for the hidden layers?  Why?


(10 points) FEATURE SELECTION

(2) Suppose we have 10^6 features, most of which are noisy and
irrelevant to the problem we want to learn.  Unfortunately, we only
have a small training set of only 100 cases.  How many different
subsets of features (different combinations of features) could we
test (i.e. how many combinations are there)?

(2) Besides computational cost, what is a potential problem of trying
all possible subsets of features to determine which subset works best?

(2) Forward stepwise selection and backward stepwise elimination are
both greedy feature selection methods --- once they add or delete a
feature that feature is permanently added to or removed from the
feature set.  Describe a way to make these methods less greedy.  

(2) Briefly describe a situation where forward stepwise selection does
not find the best subset of features.

(2) Suppose the goal is to find a small subset of features that
performs well. Briefly describe a situation where backwards
elimination can fail to find some good small subsets of features.


(10 points) CROSS VALIDATION

(2) Suppose you are doing m-fold cross-validation.  Is the average
root-mean-squared-error (RMSE) of the m folds always equal to the RMSE
of the union of the m-fold test sets?  In other words, if you compute
the RMSE on each fold individually, and then take the average of these
m RMSEs, and compare this to calculating the RMSE on all the folds
concatenated together after doing prediction on each of them with
m-fold CV, will you always get the same answer?  Prove your answer.

(1) Is the same true for accuracy? (no proof necessary)

(1) Is the same true for ROC Area? (no proof necessary)

(2) List as many reasons as you can why we usually prefer 10-fold
cross validation to 2-fold cross validation?

(2) In k-fold cross validation, which is more independent, the train
sets or the test sets?  Why?

(2) Briefly explain why the performance of a model estimated with
m-fold cross validation might consistently be a little worse than the
true performance.


(5 points) MISSING VALUES

(4) Consider a decision tree where missing values are handled by
sending a case down each branch with a weight equal to the fraction of
cases with no missing values that go down that branch.  Prove that if
a test case is missing *all* attribute values, this decision tree will
predict for this case the baseline probability (from the training set)
of each class.  That is, if the training set contains 20% positives,
the tree will predict p(+)=0.20 and p(-)=0.80 for any cases missing
all attribute values.
 
(1) Is this good or bad property?


(10 points) COMPARISON OF ALGORITHMS

(2) List 3 advantages of decision trees over neural nets.

(2) List 3 advantages of neural nets over decision trees.

(2) List 3 advantages of k-NN over decision trees or neural nets.

(2) List 3 disadvantages of k-NN over decision trees or neural nets.

(2) List 3 disadvantages of perceptrons (with threshold units) over
    neural nets trained with backprop.