Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/fontdata.js

Project 1: k-Nearest Neighbors

So many points,
some near some far,
- who are my true neighbors?

Introduction

In this project, you will build a k-nearest neighbor classifier.

In the same directory as the last project, simple call

svn update
The directory should be populated with the code for this project. The code for this project consists of several Octave files, some of which you will need to read and understand in order to complete the assignment, and some of which you can ignore.

Files you'll edit:
findknn.mFinds the k nearest neighbors of an input vector.
knnclassifier.mThe final k-NN classifier.
analyze.mComputes the classification error.
competition.mOutputs a classification for a data set.
Files you might want to look at:
innerproduct.mComputes all pairwise inner-products between two sets of vectors.
l2distance.mComputes all pairwise squared l2 distances between vectors.
hw1tests.mRuns a bunch of unit tests to find obvious bugs in your code.
hw1tictoc.mRuns and times your code.
visualhw1.mVisualizes the classifier. If you find that the images shown by the interface are upside down, try uncommenting line 8 in visualhw1.m
Supporting files you can ignore:
digits.matA data set of handwritten digits. It is located under resources in Piazza.
faces.matA data set of human faces.

How to submit: You can commit your code with subversion, with the command line

svn commit -m "some meaningful comment"
where you should substitute "some meaningful comment" with something that describes what you did. You can submit as often as you want until the deadline.

Evaluation: Your code will be autograded for technical correctness. Please do not change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder. However, the correctness of your implementation -- not the autograder's output -- will be the final judge of your score. If necessary, we will review and grade assignments individually to ensure that you receive due credit for your work.

Academic Dishonesty: We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; please don't let us down. If you do, we will pursue the strongest consequences available to us.

Getting Help: You are not alone! If you find yourself stuck on something, contact the course staff for help. Office hours, section, and the Piazza are there for your support; please use them. If you can't make our office hours, let us know and we will schedule more. We want these projects to be rewarding and instructional, not frustrating and demoralizing. But, we don't know when or how to help unless you ask.

k-Nearest Neighbors implementation in Octave

Our first goal towards a kNN classifier is to build a classifier for handwritten digits classification and face recognition.

Data

We first obtain some data for testing your code. The data resides in the files faces.mat and digits.mat which hold the datasets for the further experiments. First, load the faces dataset:

>> load faces

Now, let us check which variables were loaded by the previous statement:

>> whos

NameSizeBytesClassAttributes
xTe1178x28263872double
xTr1178x2522374848double
yTe1x28224double
yTr1x2522016double
Here, xTr are the training vectors with labels yTr and xTe are the testing vectors with labels yTe.

As a reminder, to predict the label or class of an image in xTe, we will look for the k-nearest neighbors in xTr and predict a label based on their labels in yTr. For evaluation, we will compare these labels against the true labels provided in yTe.

Visualizing data

You can visualize one of the faces by running

>> load faces
>> figure(1);
>> clf;
>> imagesc(reshape(xTr(:,1),38,31));
>> colormap gray;
Note that the shape 38x31 is the size of the image. We convert it from a flat vector form in the dataset to a matrix which can be visualized. Note: If your images appear upside down, use imagesc(flipud(reshape(xTr(:,1),38,31)))

There is also a little k-nearest neighbor classifier visualization tool, called visualhw1.m. Run the following commands to test it. Since you have not implemented the k-NN classifier as yet, the tool should show random predictions as in the figure at the top of the page:

>> visualhw1 faces
>> visualhw1 digits


Note: The tool tries to show you all the images in the dataset as you keep pressing enter on the octave interpretor. You can exit it by simply pressing Ctrl+C. Use the tool again once you are done with the following questions of the project to see what your code predicts for various test cases. (Due to some differences in supporting libraries it can be that images appear upside down. In that case you need to uncomment the line %flip=@(M) flipud(M); at the beginning of visualhw1.m.)

Implementation

The following questions will ask you to finish these functions in a pre-defined order.
As a general rule, you should avoid tight loops at all cost.

(a) Implement the functions innerproduct.m and l2distance.m. You may use your own code(s) from the previous project.

(b) Implement the function findknn.m, which should find the k nearest neighbors of a set of vectors within a given training data set. The call of

 [I,D]=findknn(xTr,xTe,k);
should result in two matrices I and D, both of dimensions k×n, where n is the number of input vectors in xTe. The matrix I(i,j) is the index of the ith nearest neighbor of the vector xTe(:,j). So, for example, if we set i=I(1,3), then xTr(:,i) is the first nearest neighbor of vector xTe(:,3). The second matrix D returns the corresponding distances. So D(i,j) is the distance of xTe(:,j) to its ith nearest neighbor.

(c) The function analyze.m should compute various metrics to evaluate a classifier. The call of

  result=analyze(kind,truth,preds);
should output the accuracy or absolute loss in variable result. The type of output required can be specified in the input argument kind as "abs" or "acc". The input variables truth and pred should contain vectors of true and predicted labels respectively. For example, the call
>> analyze('acc',[1 2 1 2],[1 2 1 1])
should return an accuracy of 0.75. Here, the true labels are 1,2,1,2 and the predicted labels are 1,2,1,1. So the first three examples are classified correctly, and the last one is wrong --- 75% accuracy.

(d) Take a look at hw1tictoc.m. This script runs the (not yet implemented) k-nearest neighbor classifier over the faces and digits data set. In its current implementation the results are picked at random. The faces data set has 40 classes, the digits data set 10. What classification accuracy would you expect from a random classifier?

(e) Implement the function knnclassifier.m, which should perform k nearest neighbor classification on a given test data set. The call

preds=knnclassifier(xTr,yTr,xTe,k)
should output the predictions for the data in xTe i.e. preds(i) will contain the prediction for xTe(:,i).

You can compute the actual classification error on the test set by calling

>> analyze('acc',yTe,knnclassifier(xTr,yTr,xTe,3))

Test your function on both data sets with hw1tictoc.m. You can now also run visualhw1 faces or visualhw1 digits. Hopefully you will get better results than in the figure at the top of the page.

(f) Sometimes a k-NN classifier can result in a draw, when the majority vote is not clearly defined. Can you improve your accuracy by falling back onto k-NN with lower k in such a case? (You should edit the file knnclassifier.m).

(g) (optional) You can use the command randperm and modify visualhw1 to shuffle the dimensions of the images in random order. This should not affect your classification error.

(h) Edit the function competition.m, which reads in a training and testing set and makes predictions. Inside this function you are free to use any combination or variation of the k-nearest neighbor classifier. Can you beat my submission on our secret training and testing set?

Evaluation

For this project, you will be ranked on the following measures:

Hints:

Octave functions you might want to use:
sortSorts a matrix column-wise and returns indices and sorted values.
helpProvides you detailed information about any Octave command, e.g. help sort.
modeReturns the most common item in a list. Check help mode to understand the second and third output of this function.
cellfunThis function allows you to apply a function on every element of a cell array.