The archive contains two files: 

  1. hw2.dta - the data set. It contains 22176 cases, each with 144 columns. The column to be predicted is the first one. It is a boolean prediction with values 1 and 2. The remaining 143 columns are the input variables for k-nn.
  2. hw2.attr - the attribute file. It contains the type and range of each attribute. You will need this file if you want to go beyond treating each attribute as continuous. 

          Because the data set is large and most implementations of k-nn are O(n2) in the number of cases in the training set, we strongly suggest you develop and debug your code using a smaller sample until you are sure it is working, then run the final experiments with the entire data set. If you have performance problems such a large data set, you should do experiments using a sample from the data set, such as 5000 or 10000 cases. 

          We apologize for the delay in making the dataset available. We had trouble dealing with missing values. 

          Download the archive