CS578 Fall 2001
Empirical Methods in Machine Learning and Data Mining 
Homework Assignment #2
Due: Tuesday, October 23 2001

The goal of this assignment is to implement k-nearest neighbor (kNN)
and run a few experiments with kNN and cross validation.  You can use
any programming language you want.

Here are some features the code should support:

 - should work for different values of k
 - should work for both classification and regression problems.
   regression is very easy -- just average (or weighted average)
   of the values; classification requires a little more work since
   you have to count how many of the nearest neighbors are in each
   class.
 - the distance function should support feature weights, but it
   does not have to optimize those feature weights -- see extra
   credit below.  for most of your experiments you should set 
   the feature weights to 1.0
 - should support leave-one-out-cross-validation (LOOCV) and
   report the LOOCV accuracy and RMSE
 - should support kNN (prediction is the average or predominant
   class of the k-nearest neighbors), weighted kNN (prediction
   is based on the k-nearest neighbors, but weighted by their
   distance to the test case), and locally weighted averaging
   (prediction depends on all cases in the train set, not just
   the nearest k, weighted by their distance to the test case).
 - in classification problems, should calculate a probability that 
   the test case is in each of the classes

EXPERIMENTS:

 1: We will put a dataset on the web for you to use.  It contains
    about 20,000 cases and about 150 attributes.  From this data
    you will predict a boolean variable (just two classes).

 2: Using LOOCV, experiment with different values of k to find
    which k works best.  Do this using unweighted kNN (i.e., the
    distance is not taken into account)

 3: Vary the size of the training set to create learning curves
    for different values of k.  Does the best value of k change
    when the train set size changes?

 4: Instead of kNN, switch to locally weighted averaging where you
    us all cases in the train set, but their predictions are weighted
    inversely by distance.  Use weights that fall of exponentially 
    with distance.  Experiment with different values for the kernel 
    width which controls how quickly the vote of a training case falls
    off with distance.  Which is best?  Does the best kernel width
    change with the size of the training set?

 5: Weight features with weights equal to 1/variance or 1/(max-min).  
    This helps compensate for features that have unusually small or 
    large ranges or which vary much more or much less than other 
    features.  Does performance improve when you do this?
 

EXTRA CREDIT -- do one or more of the following::

 - try to optimize the feature weights to improve performance by 
   applying some form of numerical optimization to the weights or
   by using something like information gain to scale the weights
 - do feature selection to find a subset of the features that seems
   to perform better than using all the features
 - implement and test locally weighted regression using simple local
   models such as linear regression.
 - experiment with different distance functions for locally weighted
   averaging or weighted kNN
 - do experiments with weighted kNN to find good values for k and the
   kernel width
 - make the code more efficient, document the methods you use and
   the speed improvement
 - implement and experiment with n-fold cross validation
 - modify your code so it uses a final test set that is held out of
   the train set used for LOOCV.  once you find the best value of k 
   or kernel width using LOOCV on the train set, report the
   performance of this model on the final test set
 - vary the coding of different types of attributes to improve the 
   distance metric

Hand in a brief summary of the results with enough documentation so
that we can see what you did and how you did it.  Do not write a paper
or anything like that.  This is homework, not a class project.  You'll
be using the kNN code you write later in the class project, so effort
spent now to become familiar with kNN and write good code should pay
off later.  In the class project you'll be comparing the performance
of kNN with other learning methods such as IND decision trees, so it
might be easier for you later if the kNN code you write now will be
able to read data files similar to those used by IND.

Have fun.