CS578 Empirical Methods in Machine Learning and Data Mining
Mid-Term Exam     Due: Tuesday, November 2, 2004 at 2:55pm

Please keep answers clear and concise.  Some questions can be answered
with a few words.  Some questions admit alternate answers depending on
how you think about the problem.  Better answers receive more credit.

All work must be done individually.  Discussion is not allowed.  If
you have questions see the instructor or one of the TAs.  You may use
references.  The text for this mid-term is available through the
course web page.  Handwritten exams are acceptable if legible.

Sign your exam before turning it in indicating that you did the work
without external help.  Exams will not be accepted without signatures.


(20 points) COMPUTATIONAL COMPLEXITY OF LEARNING ALGORITHMS

For the following, assume a 2-class problem with N training cases and
A attributes.  Express answers with big-O notation in terms of A and
N.

(4) What is the computational complexity of finding the best binary
split for one continuous attribute (i.e., finding the threshold for
that split)?

(4) What is the computational complexity of growing a decision tree to
depth D if all the attributes are binary (with the standard recursive
partitioning algorithm we discussed in class).

(4) What is the complexity of calculating the distance between two
cases for k-nearest neighbor?

(4) What is the complexity for T test cases of weighted kNN when k is
much smaller than N?  Assume a straightforward implementation of kNN
that doesn't use fancy data structures.  State any assumptions.

(4) What is the complexity if we switch to 1-NN and do LOOCV
(leave-one-out cross validation) on the entire data set?


(8 points) OVERFITTING AND GENERALIZATION

(4) Training with more training data usually improves generalization
performance.  Suppose we have a training set containing M training
cases.  If we duplicate the training cases D times each so that the
training set now contains D*M cases (but still only M unique cases),
will this improve the generalization performance of a neural net?
Why?

(4) Suppose instead of a neural net we use k-NN on the duplicated
training set.  Will the optimal value of k be larger than, smaller
than, or the same size as the optimal value before the training set
was duplicated?  Why?


(25 points) DECISION TREES

(3) Briefly explain why decision trees are not as well suited for
regression problems as they are for classification problems.


One nice property of decision trees is that they can handle missing
values.  When a case is missing an attribute value that is tested at 
a node, we can handle the missing value several ways:

1) let the case can go down all branches
2) let the case go down all branches weighted by 1/attribute_arity
3) let the case go down the branch most other cases went down
4) let the case go down each branch weighted by the fraction of
   other cases that went down each branch
5) let the case go down one branch selected at random
6) create a new branch just for the missing values

(4) Which approach do you think is best?  Briefly explain why.

(4) Which approach do you think is worst?  Briefly explain why.


We want to grow decsion trees to minimize squared error instead of
using information gain or gain ratio.  Let us define RMS_Gain as:

                                     /  |Sv|               \
RMS_Gain(Attribute) = RMSE(S) - SUM |  ------ *  RMSE(Sv)   |
                                 v   \  |S |               /

where S is the node being split, v ranges over the values of the
Attribute, Sv is the subset of cases from S with Attribute value v,
and |S| and |Sv| are the number of cases at the node S or in the
subset Sv.  RMS_Gain is similar to InfoGain, but we've replaced the
entropy terms with RMSE terms.  The RMSE(S) for node S is calculated
by predicting probabilities for the node in the usual way (i.e. by
dividing the fraction of cases in each class in the node by the total
number of cases in the node) and then computing the
root-mean-squared-error of these predictions for the node.


Calculate the Info_Gain, Gain_Ratio, and Gini_Score for the following
two attributes:

            Attribute    True
             A     B     Class
            ---   ---    -----
             1     1       1
             1     2       2
             2     1       1
             2     2       2
             3     1       2
             3     2       2
             3     2       1

(2)  Info_Gain  ___   ___

(2) Gain_Ratio  ___   ___

(2)   RMS_Gain  ___   ___
  

(2) What is the largest and smallest possible Info_Gain?  Do large or
small values of Info_Gain represent good splits?

(2) What is the largest and smallest possible Gain_Ratio?  Do large or
small values of Gain_Ratio represent good splits?

(2) What is the largest and smallest possible RMS_Gain?  Do large or
small values of RMS_Gain represent good splits?

(2) Which attribute test above, A or B, should we install?


(15 points) K-NEAREST NEIGHBOR

(3) Does k-nearest neighbor converge to baseline performance as k
approaches 1 or as k approaches the size of the training set?  Why?

(3) If a data set has significant class noise, do we want to use a
smaller or larger value of k?  Why?

(3) Does kNN tend to overfit with smaller values of k or with larger
values of k?  Why?

(3) The optimal value of k for a data set is K_OPT. If noise is added
to the distance function, do you expect the optimal value of k to
increase, decrease, or stay the same?  Why?

(3) The optimal value of k for unweighted kNN for a data set is
K_OPT. If we switch to weighted kNN (where the weight of a case is
inversely proportional to the distance to that case), do you expect
the optimal value of k to increase, decrease, or stay the same?  Why?


(10 points) PERCEPTRONS

In the following, assume inputs are +1 or -1 and that the threshold
unit outputs +1 for inputs > 0 and -1 otherwise.

(2) Briefly explain why the threshold unit in perceptrons is both a
blessing and a curse.

(2) Briefly explain why replacing threshold units with sigmoids (or
any well-behaved non-linear squashing/activation function) makes
neural nets more useful than perceptron nets.

(2) Draw a perceptron that computes the NAND (NOT AND) function for
two inputs A and B.  Be sure to show the weights for the inputs to A
and B, as well as the bias weight.  (This is not a network -- just a
single perceptron.)

(2) Draw a perceptron that computes the OR function for three inputs
A, B, and C.  Be sure to show the weights for the inputs to A, B and
C, as well as the bias weight.  (This is not a network -- just a
single perceptron.)

(2) Draw a network of perceptrons that computes the XOR function for
three inputs A, B, and C.  Be sure to show the weights for the inputs
to A, B and C, as well as all bias weights and weights joining the
perceptrons.


(12 points) ARTIFICIAL NEURAL NETS (ANNs)

(3) Show that a neural net with two hidden layers consisting only of
linear units is equivalent to a network with one hidden layer of
linear units.  That is, if we remove the non-linear activation
functions (e.g., the sigmoids) from the nodes so that each node's
activation is just the weighted sum of it's inputs, show that the
expressive power of a network with two hidden layers is the same as an
ANN with one hidden layer.  To keep it simple, it is sufficient to
show this for a simple ANN that has 2 inputs, 2 hidden units in each
layer, and one output unit.

(3) Are the bias weights in an ANN really necessary?  Suppose we
eliminate the bias weight going to each node.  This means each node's
activation is the sigmoid of just the weighted sum of it's inputs,
without an extra weight going to a constant "1" activation.  The net
is trained as usual with backpropagation.  How well do you think it
will perform compared to a regular ANN trained with bias weights.
Why?

(3) Consider a neural net with N hidden units trained with per epoch
updating on a data set with fixed learning rate L and no momentum.
Suppose the optimal early stopping point is E epochs.  If we train the
same net from the same data using per pattern updating instead of
batch updating using a learning rate of L/E, will training the net
probably require E passes through the train set, more than E passes,
or fewer than E passes?  Why?

(3) Consider a neural net with N hidden units trained with per epoch
updating on a data set D with learning rate L and momentum M.  Suppose
the optimal early stopping point is E epochs.  If we add noise to the
attributes in D, and train the same net with per epoch updating, will
training the net probably require E epochs, more than E epochs, or
fewer than E epochs to reach optimal performance?  Why?


(10 points) CROSS VALIDATION

(3) Briefly explain why the performance of a model estimated with
m-fold cross validation might consistently be a little worse than the
true performance.

(3) Suppose you are doing m-fold cross-validation.  Show that the
average accuracy of the m folds is equal to the accuracy of the union
of the m-fold test sets.  In other words, compute the accuracy on each
fold individually, and then take the average of these m accuracies,
and compare this to calculating the accuracy on all the folds
concatenated together after doing prediction on each of them with
m-fold CV.  Is the same true for RMSE (root-mean-squared error)?

(2) List as many reasons as you can why we usually prefer 10-fold
cross validation to 2-fold cross validation?

(2) List as many reasons as you can why we might prefer 10-fold cross
validation to 100-fold cross validation?