Project 1: k-Nearest Neighbors

Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/fontdata.js

Project 1: $k$ -Nearest Neighbors

So many points,
some near some far,
- who are my true neighbors?

Introduction

In this project, you will build a $k$ -nearest neighbor classifier.

In the same directory as the last project, simple call

svn update

The directory should be populated with the code for this project. The code for this project consists of several Octave files, some of which you will need to read and understand in order to complete the assignment, and some of which you can ignore.

Files you'll edit:
`findknn.m`	Finds the $k$ nearest neighbors of an input vector.
`knnclassifier.m`	The final $k$ -NN classifier.
`analyze.m`	Computes the classification error.
`competition.m`	Outputs a classification for a data set.
Files you might want to look at:
`innerproduct.m`	Computes all pairwise inner-products between two sets of vectors.
`l2distance.m`	Computes all pairwise squared $l_2$ distances between vectors.
`hw1tests.m`	Runs a bunch of unit tests to find obvious bugs in your code.
`hw1tictoc.m`	Runs and times your code.
`visualhw1.m`	Visualizes the classifier. If you find that the images shown by the interface are upside down, try uncommenting line 8 in `visualhw1.m`
Supporting files you can ignore:
`digits.mat`	A data set of handwritten digits. It is located under resources in Piazza.
`faces.mat`	A data set of human faces.

How to submit: You can commit your code with subversion, with the command line

svn commit -m "some meaningful comment"

where you should substitute "some meaningful comment" with something that describes what you did. You can submit as often as you want until the deadline.

Evaluation: Your code will be autograded for technical correctness. Please do not change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder. However, the correctness of your implementation -- not the autograder's output -- will be the final judge of your score. If necessary, we will review and grade assignments individually to ensure that you receive due credit for your work.

Academic Dishonesty: We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; please don't let us down. If you do, we will pursue the strongest consequences available to us.

Getting Help: You are not alone! If you find yourself stuck on something, contact the course staff for help. Office hours, section, and the Piazza are there for your support; please use them. If you can't make our office hours, let us know and we will schedule more. We want these projects to be rewarding and instructional, not frustrating and demoralizing. But, we don't know when or how to help unless you ask.

k-Nearest Neighbors implementation in Octave

Our first goal towards a $k$ NN classifier is to build a classifier for handwritten digits classification and face recognition.

Data

We first obtain some data for testing your code. The data resides in the files faces.mat and digits.mat which hold the datasets for the further experiments. First, load the faces dataset:

>> load faces

Now, let us check which variables were loaded by the previous statement:

>> whos

`Name`	`Size`	`Bytes`	`Class`	`Attributes`
`xTe`	`1178x28`	`263872`	`double`
`xTr`	`1178x252`	`2374848`	`double`
`yTe`	`1x28`	`224`	`double`
`yTr`	`1x252`	`2016`	`double`

Here, xTr are the training vectors with labels yTr and xTe are the testing vectors with labels yTe.

As a reminder, to predict the label or class of an image in xTe, we will look for the k-nearest neighbors in xTr and predict a label based on their labels in yTr. For evaluation, we will compare these labels against the true labels provided in yTe.

Visualizing data

You can visualize one of the faces by running

>> load faces
>> figure(1);
>> clf;
>> imagesc(reshape(xTr(:,1),38,31));
>> colormap gray;

Note that the shape 38x31 is the size of the image. We convert it from a flat vector form in the dataset to a matrix which can be visualized. Note: If your images appear upside down, use imagesc(flipud(reshape(xTr(:,1),38,31)))

There is also a little k-nearest neighbor classifier visualization tool, called visualhw1.m. Run the following commands to test it. Since you have not implemented the k-NN classifier as yet, the tool should show random predictions as in the figure at the top of the page:

>> visualhw1 faces >> visualhw1 digits

Note: The tool tries to show you all the images in the dataset as you keep pressing enter on the octave interpretor. You can exit it by simply pressing Ctrl+C. Use the tool again once you are done with the following questions of the project to see what your code predicts for various test cases. (Due to some differences in supporting libraries it can be that images appear upside down. In that case you need to uncomment the line %flip=@(M) flipud(M); at the beginning of visualhw1.m.)

Implementation

The following questions will ask you to finish these functions in a pre-defined order.
As a general rule, you should avoid tight loops at all cost.

(a) Implement the functions innerproduct.m and l2distance.m. You may use your own code(s) from the previous project.

(b) Implement the function findknn.m, which should find the $k$ nearest neighbors of a set of vectors within a given training data set. The call of

 [I,D]=findknn(xTr,xTe,k);

should result in two matrices

$I$ and

$D$ , both of dimensions

$k\times n$ , where

$n$ is the number of input vectors in xTe. The matrix

$I(i,j)$ is the index of the

$i^{th}$ nearest neighbor of the vector

$xTe(:,j)$ . So, for example, if we set i=I(1,3), then xTr(:,i) is the first nearest neighbor of vector xTe(:,3). The second matrix

$D$ returns the corresponding distances. So

$D(i,j)$ is the distance of

$xTe(:,j)$ to its

$i^{th}$ nearest neighbor.

  result=analyze(kind,truth,preds);

should output the accuracy or absolute loss in variable result. The type of output required can be specified in the input argument kind as "abs" or "acc". The input variables truth and pred should contain vectors of true and predicted labels respectively. For example, the call

>> analyze('acc',[1 2 1 2],[1 2 1 1])

should return an accuracy of 0.75. Here, the true labels are 1,2,1,2 and the predicted labels are 1,2,1,1. So the first three examples are classified correctly, and the last one is wrong --- 75% accuracy.

(d) Take a look at hw1tictoc.m. This script runs the (not yet implemented) $k$ -nearest neighbor classifier over the faces and digits data set. In its current implementation the results are picked at random. The faces data set has $40$ classes, the digits data set $10$ . What classification accuracy would you expect from a random classifier?

(e) Implement the function knnclassifier.m, which should perform $k$ nearest neighbor classification on a given test data set. The call

preds=knnclassifier(xTr,yTr,xTe,k)

should output the predictions for the data in xTe i.e.

$preds(i)$ will contain the prediction for

$xTe(:,i)$ .

You can compute the actual classification error on the test set by calling

>> analyze('acc',yTe,knnclassifier(xTr,yTr,xTe,3))

Test your function on both data sets with hw1tictoc.m. You can now also run visualhw1 faces or visualhw1 digits. Hopefully you will get better results than in the figure at the top of the page.

(f) Sometimes a $k$ -NN classifier can result in a draw, when the majority vote is not clearly defined. Can you improve your accuracy by falling back onto $k$ -NN with lower $k$ in such a case? (You should edit the file knnclassifier.m).

(g) (optional) You can use the command randperm and modify visualhw1 to shuffle the dimensions of the images in random order. This should not affect your classification error.

(h) Edit the function competition.m, which reads in a training and testing set and makes predictions. Inside this function you are free to use any combination or variation of the k-nearest neighbor classifier. Can you beat my submission on our secret training and testing set?

Evaluation

For this project, you will be ranked on the following measures:

Percentage of test cases passed
Average of:
- Accuracy on the faces test dataset and
- Accuracy on the digits test dataset
- Accuracy on the secret test dataset

Hints:

Octave functions you might want to use:
`sort`	Sorts a matrix column-wise and returns indices and sorted values.
`help`	Provides you detailed information about any Octave command, e.g. `help sort`.
`mode`	Returns the most common item in a list. Check `help mode` to understand the second and third output of this function.
`cellfun`	This function allows you to apply a function on every element of a cell array.

Project 1: kk-Nearest Neighbors