Project 4: Empirical Risk Minimization

"One person's spam is another person's dinner."
-- ancient German wisdon

Introduction

In this project you will be building an email spam filter. First perform an "svn update" in your svn root directory.

The code for this project (project04) consists of several files, some of which you will need to read and understand in order to complete the assignment, and some of which you can ignore. Ignore vishw4.m as this function does not work.

Files you'll edit:
`codename.txt`	This file contains some codename that represents you on the leaderboard. You can change your codename anytime you want, but please avoid offensive or annoying names (e.g. no names of other class mates, no swear words, no inappropriate body parts, urls, javascripts,...)
`partner.txt`	If you work in a group, this file should contain the NetID of your partner. There should be nothing else in this file. Please make sure that your partner also puts you in his/her `partner.txt` file, as project partnerships must be mutual.
`ridge.m`	Computes the ridge regression loss and gradient
`hinge.m`	Computes the hinge loss and gradient
`grdescent.m`	Performs gradient descent
`linclassify.m`	Returns the predictions for a weight vector and a data set.
`logistic.m`	Computes the logistic regression loss and gradient.
`spamupdate.m`	(optional) Allows you to update the spam filter when you make a mistake.
`trainspamfilter.m`	Trains your spam filter and saves the final weight vector in a file `w0.mat`.
Files you might want to look at:
`tokenize.py`	A simple python script that turns raw emails into bag of word vectors.
`tokenizedata.m`	Calls tokenize.py to tokenize the raw data into vectorial format.
`valsplit.m`	This function takes the data and splits it into 80% training (xTr,yTr) and 20% validation (xTv,yTv). The splitting is not random but by time (i.e. the training data consists of emails that were received before the validation data.)
`hw04tests.m`	Runs several unit tests to find obvious bugs in your implementation.
`spamfilter.m`	Loads in the file `w0.mat` and applies the corresponding spam filter on whatever test set you pass on as argument.

How to submit: You can commit your code with subversion, with the command line

svn commit -m "some insightful comment"

where you should substitute "some meaningful comment" with something that describes what you did. You can submit as often as you want until the deadline. Please be aware that the last submission determines your grade.

Evaluation: Your code will be autograded for technical correctness. Please do not change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder. However, the correctness of your implementation -- not the autograder's output -- will be the final judge of your score. If necessary, we will review and grade assignments individually to ensure that you receive due credit for your work.

Academic Dishonesty: We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; please don't let us down. If you do, we will pursue the strongest consequences available to us.

Getting Help: You are not alone! If you find yourself stuck on something, contact the course staff for help. Office hours, section, and the Piazza are there for your support; please use them. If you can't make our office hours, let us know and we will schedule more. We want these projects to be rewarding and instructional, not frustrating and demoralizing. But, we don't know when or how to help unless you ask.

Computing derivatives

Before you dive into the programming part of this assignment you will need to derive the gradients for several loss functions. You do not have to hand this part in, but save your derivations as these are part of written homework 2.

Derive the gradient function for each of the following loss functions with respect to the weight vectdor $w$. Write down the gradient update (with stepsize $c$).
(Note that: $\|w\|_2^2=w^\top w$ and $\lambda$ is a non-negative constant.)

Ridge Regression: ${\cal L}(w)=\sum_{i=1}^n (w^\top x_i-y_i)^2+\lambda \|w\|_2^2$
Logistic Regression: ($y_i\in\{+1,-1\}$): ${\cal L}(w)=\sum_{i=1}^n \log(1+\exp{(-y_i w^\top x_i)})$
Hinge loss: ($y_i\in\{+1,-1\}$): ${\cal L}(w)=\sum_{i=1}^n \max \left(1-y_i(w^\top x_i+b),0\right)+\lambda \|w\|_2^2$

Building an email spam filter

You will now implement these functions and their gradient updates. Before we get started, please download the dataset for this project. You can find them on piazza:

https://piazza.com/cornell/fall2015/cs47805780/resources

The file data_train.mat contains the pre-processed email data, where emails are represented as bag-of-words vectors. You will only need this file for now, but to compete in the spam filtering competition you might want to download spam_train.zip, which contains the raw email data, so that you can invent your own features.

Type in

    >> load data_train
    >> valsplit
    >> whos
	Variables in the current scope:

   Attr Name        Size                     Bytes  Class
   ==== ====        ====                     =====  ===== 
        X        1024x5000                15006996  double
        Y           1x5000                   40000  double
        xTr      1024x4000                12310784  double
        xTv      1024x1000                 2696216  double
        yTr         1x4000                   32000  double
        yTv         1x1000                    8000  double

	Total is 2507832 elements using 30093996 bytes

This should generate a training data set xTr, yTr and a validation set xTv, yTv for you. (You can check what variables is in your environment with the who command.)
It is now time to implement your classifiers. We will always use gradient descent, but with various loss functions.

Implement the function ridge.m which computes the loss and gradient for a particular data set xTr, yTr and a weight vector w. Make sure you don't forget to incorporate your regularization constant $\lambda$. You can check your gradient with the following expression (it checks the difference between the interpolated and the actual function value):
```
 
         err= checkgrad('ridge', zeros(size(xTr,1),1), 1e-5, xTr, yTr, 10);
        
```
Here 1e-5 is Octave code for $10^{-5}$. The return value should be very small (less than $10^{-8}$). (You can also use the checkgrad function for the later functions in this assignment and throughout your entire life whenever you have to implement functions and their gradients.)
Implement the function grdescent.m which performs gradient descent. Make sure to include the tolerance variable to stop early if the norm of the gradient is less than the tolerance value (you can use the function norm(x)). When the norm of the gradient is tiny it means that you have arrived at a minimum.
The first parameter of grdescent is a function which takes a weight vector and returns loss and gradient. In Octave you can make inline functions e.g. with the following code (first line):
```
        f=@(w) ridge(w,xTr,yTr,0.1);
        w=grdescent(f,zeros(size(xTr,1),1),1e-6,1000);
        
```
You can choose what kind of step-size you implement (e.g. constant, decreasing, line search,...). [HINT: Personally, I increase the stepsize by a factor of 1.01 each iteration where the loss goes down, and decrease it by a factor 0.5 if the loss went up. ... if you are smart you also undo the last update in that case to make sure the loss decreases every iteration.]
Write the (almost trivial) function linclassify which returns the predictions for a vector w and a data set xTv. (You can take it from a previous project.)
Now call:
```
	>> trainspamfilter(xTr,yTr);
	>> spamfilter(xTv,yTv,0.1);
	False positive rate: 0.78%
	True positive rate: 77.39%
	AUC: 98.50%        
```
The first command trains a spam filter with ridge regression and saves the resulting weight vector in w0.mat.
The second command will run your spam filter with the weights in w0.mat over the validation data set.
The outputs of spamfilter.m are:
- false positive rate (fpr) (how many emails you accidentally classify as spam).
- true positive rate (tpr) (how many spam emails you catch).
- area under the curve (AUC), which different from tpr and fpr is independent of the cut-off threshold (the last argument into spamfilter.m). As the name suggests, it computes the area of the ROC curve and is a good measure to compare spam filters.
Now implement the function hinge.m, which is the equivalent to ridge but with the hinge loss. Take a look at trainspamfilter.m. You can change it to use the logistic loss instead of ridge regression to train the classifier.
Now implement the function logistic.m, which is the equivalent to ridge but with the log-loss (logistic regression). [By default the logistic loss does not take a regularization constant, but feel free to incorporate regularization if you want to.]
You can now run hw04rocs to see if your algorithms all work. You might have to fiddle with the STEPSIZE parameter at the very top (maybe set it to something very small initially (e.g. 1e-08) and work yourself up). If you want to, change the trainspamfilter.m to a different loss function with different parameters.
(Optional) you can implement the function spamupdate to make small gradient steps during test time (basically you still correct the classifier after you made a mistake).
(Optional) If you take a look at the script tokenizedata.m, you can get an idea of how the tokenization is done. You can modify this if you want to change how the tokenization is done. For example, by default the data uses $2^{10}=1024$ dimensional features. You could change this by increasing $10$ to $11$. You could also change the tokenize.py function (e.g. to include bigrams). A common trick is e.g. to remove stopwords. A full list is here. You can also include bi-grams or feature re-weighting with TFIDF

Feature Extraction (Competition)

(optional) This programming project also includes a competition. We will rank all submissions by how well your Spam classifier performs on a secret test set of emails. If you want to improve your classifier modify tokenize.py or trainspamfilter. You must train your weight vector locally and commit the changes to it (the file w0 is written by trainspamfilter and contains the weight vector learned by your algorithm. The autograder will evaluate this classifier on emails from the same authors as the ones in your dataset, but emails that arrived later on. You can also modify spamupdate.m. Look at the task 8 for tips to get you started. Also consider changing the default threshold in spamfilter.m which is currently set to 0.3. We will use your modified tokenizers to tokenize the test emails as well.

Hints

Tests. To test your code you can run hw04tests, which runs several unit tests. As your project progresses, slowly but surely these tests should stop failing.

Credits

Parts of this webpage were copied from or heavily inspired by John DeNero's and Dan Klein's (awesome) Pacman class.