![]()
Optimize your breakfast:
ditch Fruit Loops for Chex!
In this project, you will write a function to compute Euclidean distances between sets of vectors. But before you get started, you need to check out your code onto whatever computer you want to use.
How to check out your code: The first thing you need to do is obtain your code from the server. We use subversion, which should be installed on your Mac OSX or Linux computer as svn
. (If you have Windows you can use TortoiseSVN or install Linux in a Virtual Box.)
To check out your code with the following command line call:
svn checkout svn+ssh://YOURNETID@en-cs-cs4780.coecis.cornell.edu:/opt/cs4780/repos/YOURNETIDOf course you must substitute YOURNETID with your Cornell NetID. This should prompt you for your password and then populate your directory with files. If you are unable to login using your Cornell NetID, post it on Piazza. In the future whenever you want to see if any new projects are available, simply call
svn update
The code for this project (project0
) consists of several text files, some of which you will need to read and understand in order to complete the assignment, and some of which you can ignore.
Files you'll edit: | |
codename.txt | This file contains some codename that represents you on the leaderboard. You can change your codename anytime you want, but please avoid offensive or annoying names (e.g. no names of other class mates, no swear words, no inappropriate body parts, javascripts,...) |
email.txt | You can (but don't have to) specify an email address. The autograder will use this email address to send you an email when your grading job has finished. |
partners.txt | If you work in a group, this file should contain the two NetIDs of you and your group partner. These should be in two separate lines. There should be nothing else in this file. Please make sure that your partner also puts you in his/her partners.txt file, as project partnerships must be reciprocal. If you don't have a partner then just leave the file blank. |
innerproduct.m | Computes all pairwise inner-products between two sets of vectors. |
l2distance.m | Computes all pairwise squared l2 distances between vectors. |
Files you might want to look at: | |
l2distanceSlow.m | A horrible implementation of l2distance.m (please do not show to minors or people who get traumatized easily.) |
hw0tests.m | Runs several unit tests to find obvious bugs in your implementation. |
hw0tictoc.m | Runs and times your code. It performs binary search to find out how many distance computations you can perform in 1s. |
How to submit: You can commit your code with subversion, with the command line
svn commit -m "some meaningful comment"where you should substitute "some meaningful comment" with something that describes what you did. You can submit as often as you want until the deadline. Please be aware that the last submission determines your grade.
Evaluation: Your code will be autograded for technical correctness. Please do not change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder. However, the correctness of your implementation -- not the autograder's output -- will be the final judge of your score. If necessary, we will review and grade assignments individually to ensure that you receive due credit for your work.
Academic Dishonesty: We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; please don't let us down. If you do, we will pursue the strongest consequences available to us.
Getting Help: You are not alone! If you find yourself stuck on something, contact the course staff for help. Office hours, section, and the Piazza are there for your support; please use them. If you can't make our office hours, let us know and we will schedule more. We want these projects to be rewarding and instructional, not frustrating and demoralizing. But, we don't know when or how to help unless you ask.
Many machine learning algorithms access their input data primarily through pairwise distances. It is therefore important that we have a fast function that computes pairwise (Euclidean) distances of input vectors. Assume we have n data vectors →x1,…,→xn∈Rd and m vectors →z1,…,zm∈Rd. And let us define two matrices X=[→x1,…,→xn]∈Rd×n, where the ith column is a vector →xi and similarly Z=[→z1,…,→zm]. Our distance function takes as input the two matrices X and Z and outputs a matrix D∈Rn×m, where Dij=√(→xi−→zj)⊤(→xi−→zj).
A naive Octave implementation of such a distance function would look like this:
function D=l2distanceSlow(X,Z) [d,n]=size(X); % dimension of X [~,m]=size(Z); % dimension of Z D=zeros(n,m); % allocate memory for the output matrix for i=1:n % loop over vectors in X for j=1:m % loop over vectors in Z D(i,j)=0.0; for k=1:d % loop over dimensions D(i,j)=D(i,j)+(X(k,i)-Z(k,j))^2; % compute l2-distance between the ith and jth vector end; D(i,j)=sqrt(D(i,j)); % take square root end; end;Please read through the code carefully and make sure you understand it. It is perfectly correct and will produce the correct result ... eventually. To see what is wrong, run the following program in the Octave console:
X=rand(100,100); Z=rand(100,50); tic;D=l2distanceSlow(X,Z);tocThis code defines some random data in X and Z and computes the corresponding distance matrix D. The tic,toc statements time how long this takes. On my laptop it takes about 30 seconds. This is way too slow to compute 5000 distances!!! If I were to compute the distances between all 2007 test and 7291 train images in our digit classification data set (next homework), it would take over a day!! The problem is that the distance function uses a programming style that is well suited for C++ or Java, but not Octave!!
What is the problem? Octave is an interpreted language with a lot of slow execution overhead. This means you have to pay dearly for every command you execute. In our example we have three nested loops and call the inner most line many many times. This means almost our entire running time is spent in execution overhead. As a general rule, you should avoid tight loops at all cost.
Although there is an execution overhead per line in Octave, matrix operations are extremely optimized and very fast. In order to become a successful Octave programmer, you need to free yourself from the "for-loop" thinking and start thinking in terms of matrix operations. Octave can be very fast, if almost all the time is spent in a few heavy duty matrix operations. In this assignment you will do this, and transform the function above into a few matrix operations without any loops at all.
The key to efficient programming in Octave and machine learning in general is to think about it in terms of mathematics, and not in terms of Loops.
(a) Show that the Gram matrix (aka inner-product matrix) Gij=→x⊤i→zj
innerproduct.m
.
My Answer to (a): (Please edit the .html file and "svn commit".)
(b) Let us define two new matrices S,R∈Rn×m Sij=→x⊤i→xi, Rij=→z⊤j→zj.
My Answer to (b):
(c) Implement the function l2distance.m
, which computes the Euclidean distance matrix D without a single loop. Test the distance function again:
X=rand(100,100); Z=rand(100,50); tic;D=l2distance(X,Z);tocHow much faster is your code now? With this implementation you should easily be able to compute the distances between many more vectors. Call the following function
hw0tictocto see how many distance computations you can perform within one second. With the loopy implementations, the same computation might have taken you several days.
My Answer to (c):
Octave functions you might want to use: | |
size | Returns the dimensions of a matrix or vector. |
repmat | Replicates a vector to create a matrix with identical rows or columns. |
sqrt | Computes the element-wise square-root of a matrix (or scalar). |
keyboard | This is very handy if you want to debug your code. Wherever you place keyboard, you get the interactive interpreter to debug your program. |
dbstep | If you interrupted your code with keyboard you can take a step with dbstep . |
dbcont | If you interrupted your code with keyboard you can continue it with dbcont . |
Tests. To test your code you can run hw0tests
, which runs several unit tests. As your project progresses, slowly but surely these tests should stop failing.
Timing: To evaluate how fast your final code is, run hw0tictoc
.