CS578 Fall 2001 Empirical Methods in Machine Learning and Data Mining Homework Assignment #2 Due: Tuesday, October 23 2001 The goal of this assignment is to implement k-nearest neighbor (kNN) and run a few experiments with kNN and cross validation. You can use any programming language you want. Here are some features the code should support: - should work for different values of k - should work for both classification and regression problems. regression is very easy -- just average (or weighted average) of the values; classification requires a little more work since you have to count how many of the nearest neighbors are in each class. - the distance function should support feature weights, but it does not have to optimize those feature weights -- see extra credit below. for most of your experiments you should set the feature weights to 1.0 - should support leave-one-out-cross-validation (LOOCV) and report the LOOCV accuracy and RMSE - should support kNN (prediction is the average or predominant class of the k-nearest neighbors), weighted kNN (prediction is based on the k-nearest neighbors, but weighted by their distance to the test case), and locally weighted averaging (prediction depends on all cases in the train set, not just the nearest k, weighted by their distance to the test case). - in classification problems, should calculate a probability that the test case is in each of the classes EXPERIMENTS: 1: We will put a dataset on the web for you to use. It contains about 20,000 cases and about 150 attributes. From this data you will predict a boolean variable (just two classes). 2: Using LOOCV, experiment with different values of k to find which k works best. Do this using unweighted kNN (i.e., the distance is not taken into account) 3: Vary the size of the training set to create learning curves for different values of k. Does the best value of k change when the train set size changes? 4: Instead of kNN, switch to locally weighted averaging where you us all cases in the train set, but their predictions are weighted inversely by distance. Use weights that fall of exponentially with distance. Experiment with different values for the kernel width which controls how quickly the vote of a training case falls off with distance. Which is best? Does the best kernel width change with the size of the training set? 5: Weight features with weights equal to 1/variance or 1/(max-min). This helps compensate for features that have unusually small or large ranges or which vary much more or much less than other features. Does performance improve when you do this? EXTRA CREDIT -- do one or more of the following:: - try to optimize the feature weights to improve performance by applying some form of numerical optimization to the weights or by using something like information gain to scale the weights - do feature selection to find a subset of the features that seems to perform better than using all the features - implement and test locally weighted regression using simple local models such as linear regression. - experiment with different distance functions for locally weighted averaging or weighted kNN - do experiments with weighted kNN to find good values for k and the kernel width - make the code more efficient, document the methods you use and the speed improvement - implement and experiment with n-fold cross validation - modify your code so it uses a final test set that is held out of the train set used for LOOCV. once you find the best value of k or kernel width using LOOCV on the train set, report the performance of this model on the final test set - vary the coding of different types of attributes to improve the distance metric Hand in a brief summary of the results with enough documentation so that we can see what you did and how you did it. Do not write a paper or anything like that. This is homework, not a class project. You'll be using the kNN code you write later in the class project, so effort spent now to become familiar with kNN and write good code should pay off later. In the class project you'll be comparing the performance of kNN with other learning methods such as IND decision trees, so it might be easier for you later if the kNN code you write now will be able to read data files similar to those used by IND. Have fun.