CS578 Empirical Methods in Machine Learning and Data Mining
Mid-Term Exam     Due: Thursday, November 6, 2003 at 2:55pm

Please keep answers clear and concise.  Some questions can be answered
with a few words.  Some questions admit alternate answers depending on
how you think about the problem.  Better answers receive more credit.

All work must be done individually.  Discussion is not allowed.  If
you have questions see the instructor or TA.  You may use references.
The text for this mid-term is available through the course web page.
Handwritten exams are acceptable if legible.

Sign your exam before turning it in indicating that you did the work
without external help.


(20 points) COMPUTATIONAL COMPLEXITY OF LEARNING ALGORITHMS

(2) What is the computational complexity (big-O notation) of
installing the root attribute in a decision tree for a training set
with C cases and V boolean (2 valued) variables?  Express the
complexity in terms of C and V.

(2) What is the computational complexity of installing the root
attribute if all V attributes are continuous instead of boolean?

(2) What is the computational complexity of growing a complete
decision tree to depth D with the standard recursive partitioning
algorithm we discussed in class given a training set with C cases and
V boolean variables?

(2) What is the complexity of growing the same complete tree to depth
D with boolean variables if you do 1-step lookahead when growing the
tree?

(2) Given a fully-connected feedforward neural net with I inputs, O
outputs, and one hidden layer with H hidden units, how many floating
point multiplications must be done to make the predictions for a test
set containing C cases?  (This is just forward propagation.  The net
was already trained on a train set with backpropagation.)

(2) What is the complexity of calculating the Euclidean distance
between two cases for k-nearest neighbor with A attributes?

(2) What is the complexity of 1-NN for regression (predicting a
continuous value) for a data set with A attributes and C cases?

(2) If we move from 1-NN to k-NN, how does the complexity change?

(2) If we switch to locally weighted averaging instead of k-NN, what
is the complexity?

(2) What is the complexity if we use 1-NN and do LOOCV (leave-one-out
cross validation)?


(7 points) OVERFITTING AND GENERALIZATION

(2) Training with more training data usually improves generalization
performance.  Suppose we have a training set containing M training
cases.  If we duplicate the training cases D times each so that the
training set now contains D*M cases (but still only M unique cases),
will this improve the generalization performance of a neural net?
Why?

(2) Suppose instead of a neural net we use k-NN on the duplicated
training set.  Will the optimal value of k be larger than, smaller
than, or the same size as the optimal value before the training set
was duplicated?  Why?

(2) Suppose we train a linear SVM on the duplicated training set.
Will the new best value C' with the duplicated data be larger than,
smaller than, or the same as the the best C for the unduplicated
training set?  Why?

(1) What is the new best value C'?  Express your answer in terms of D,
M, and C (the best C with the original unduplicated data set).


(25 points) DECISION TREES

(2) Briefly explain why growing full decision trees to maximum depth
using randomly installed splits and no pruning often yields trees that
perform reasonably well.

(2) Why does lookahead often reduce the size/depth of a decision tree?

We want to grow decsion trees to maximize accuracy instead of using
information gain or gain ratio.  Let us define AccuracyGain as:

                                         /  |Sv|                \
AccGain(Attribute) = Accuracy(S) - SUM |  ------ * Accuracy(Sv) |
                                     v   \  |S |                /

where S is the node being split, v ranges over the values of the
Attribute, Sv is the subset of cases from S with Attribute value v,
and |S| and |Sv| are the number of cases at the node S or in the
subset Sv.  AccGain is similar to InfoGain, but we've replaced the
entropy terms with accuracy terms.  Accuracy(S) is the baseline
accuracy of S, the fraction of instances in S classified correctly 
if you predict the majority class for all cases in S.

Compute the InfoGain, GainRatio, and AccGain for these two attributes:

             Attribute    True
              A     B     Class
             ---   ---    -----
              1     1       1
              1     2       2
              2     1       1
              2     2       2
              3     1       1
              3     1       2
              3     2       2

(2)   InfoGain  _?_   _?_

(2)  GainRatio  _?_   _?_

(2)    AccGain  _?_   _?_

(2) When using InfoGain, we install the split (attribute test) that
has the highest InfoGain.  Should we install the split that has the
highest or lowest AccGain?

(2) Does AccGain have the same problem with high-arity attributes that
InfoGain has?  What is this problem and why does AccGain have this
problem or not have this problem?

(2) Optimizing accuracy is appealing, but AccGain has a weakness
compared to InfoGain for problems with more than two classes.  What is
it?

One nice property of decision trees is that they can handle missing
values.  When a case is missing an attribute value that is tested at
a node, we can handle the missing value several ways:

1) let the case can go down all branches
2) let the case go down all branches weighted by 1/attribute_arity
3) let the case go down the branch most other cases went down
4) let the case go down each branch weighted by the fraction of
    other cases that went down each branch
5) let the case go down one branch selected at random
6) create a new branch just for the missing values
7) ignore the case

(2) Which approach do you think is best?   Briefly explain why.

(2) Which approach do you think is worst?  Briefly explain why.

(1) Just between 1) and 2), which one do you think is more appropriate? why?

(2) Briefly describe a situation where 6) would be very inappropriate to use.

(2) Briefly describe a situation where 7) would be very inappropriate to use.


(10 points) K-NEAREST NEIGHBOR

(2) If a data set has significant class noise, do we want to use a
smaller or larger value of k?  Why?

(2) If a data set has significant attribute noise, do we want to use a
smaller or larger value of k?  Why?

(2) Does kNN tend to overfit with smaller values of k or with larger
values of k?  Why?

(2) Does Locally Weighted Averaging tend to overfit with smaller
values of kernel width, or with larger values of kernel width?  Why?

(2) A training set has M instances and the optimal value of k for this
training set is K_OPT. If the training set gets larger, do you expect
the optimal value of k to increase, decrease, or stay the same?  Why?


(12 points) ARTIFICIAL NEURAL NETS

(2) If you train a neural net with backpropagation for many epochs,
which is more likely to have the lowest RMSE on the *train* set, a
neural net with 1, 10, or 100 hidden units?

(2) If you train the same neural net with backpropagation for many
epochs, which is more likely to have the lowest RMSE on the *test*
set, a neural net with 1, 10, or 100 hidden units?

(2) When training a neural net with backpropagation on a training set
with 0/1 targets, which do you think is likely to reach its early
stopping point first, RMSE or accuracy?  Why?

(2) Backpropagation is done with error functions on the outputs such
as squared error or cross-entropy, even when the goal is to maximize
classification accuracy.  Why is it difficult to do backpropagation to
maximize classification accuracy directly?

(4) Derive backprop for just the output unit of a neural net trained
with -1*[target*log(output) + (1-target)*log(1-output)] instead of the
usual squared error (Target - Output)^2.  Assume a sigmoid output unit
and 0/1 targets.


(8 points) SUPPORT VECTOR MACHINES

(2) If we remove one of the support vectors from the train set, does
the size of the maximum margin decrease, stay the same, or increase
for that data set?

(2) If we double the size of the training set by drawing additional
points from the same distribution, do you expect the size of the
margin to decrease, stay the same, or increase?

Suppose instead of linear separating hyperplanes, we could train SVMs
using separating V-hyperplanes -- two linear semi-hyperplanes that can
bend where they join.  The angle at the bend that yields the maximum
margin is found by the SVM while maximizing the margin.

    LINEAR HYPERPLANE           V-HYPERPLANE

       . . .  /                  . . .  / 
        . .  / -                  . .  / -
         .  /  -                   .  /  -
         . / -                     . / -  
         ./    -                   . \   -
         /                            \

(2) Do you expect the size of the maximum margin to decrease, stay the
same, or increase with V-hperplanes compared to linear hyperplanes?

(2) For the same value of C, re SVM models with V-hperplanes less
complex, more complex, or the same complexity as SVM models with
linear hyperplanes?


(8 points) CROSS VALIDATION

(4) List as many reasons as you can why we usually prefer 10-fold
cross validation to 2-fold cross validation?

(4) Suppose we use 3-fold cross validation to train neural nets with
early stopping.  In trial 1 the net is trained on fold 1, early
stopped on fold 2, and tested on fold 3.  3-fold CV yields three
neural nets, N1, N2, and N3.  Suppose we want to combine the three
nets into one model M by averaging the predictions from the three
nets.  If Ni(Cj) is the prediction net Ni makes on test case Cj, the
average prediction of the three nets is M(Cj) =
(N1(Cj)+N2(Cj)+N3(Cj))/3 for case Cj.  If we compare M to Ni on Ni's
test fold, why isn't this a fair comparison?  Design an experiment
where we can fairly compare M to each of the Ni's.


(10 points) COMPARISON OF ALGORITHMS

(2) List 2 advantages of decision trees over neural nets.

(2) List 2 advantages of neural nets over decision trees.

(2) List 2 advantages of k-NN over decision trees or neural nets.

(2) List 2 disadvantages of k-NN over decision trees or neural nets.

(2) List 2 advantages of SVMs over decision trees and neural nets.