![]()
"One person's spam is another person's dinner."
-- ancient German wisdon
In this project you will be building an email spam filter. First perform an "svn update" in your svn root directory.
The code for this project (project04
) consists of several files, some of which you will need to read and understand in order to complete the assignment, and some of which you can ignore. Ignore vishw4.m as this function does not work.
Files you'll edit: | |
codename.txt | This file contains some codename that represents you on the leaderboard. You can change your codename anytime you want, but please avoid offensive or annoying names (e.g. no names of other class mates, no swear words, no inappropriate body parts, urls, javascripts,...) |
partner.txt | If you work in a group, this file should contain the NetID of your partner. There should be nothing else in this file. Please make sure that your partner also puts you in his/her partner.txt file, as project partnerships must be mutual. |
ridge.m | Computes the ridge regression loss and gradient |
hinge.m | Computes the hinge loss and gradient |
grdescent.m | Performs gradient descent |
linclassify.m | Returns the predictions for a weight vector and a data set. |
logistic.m | Computes the logistic regression loss and gradient. |
spamupdate.m | (optional) Allows you to update the spam filter when you make a mistake. |
trainspamfilter.m | Trains your spam filter and saves the final weight vector in a file w0.mat . |
Files you might want to look at: | |
tokenize.py | A simple python script that turns raw emails into bag of word vectors. |
tokenizedata.m | Calls tokenize.py to tokenize the raw data into vectorial format. |
valsplit.m | This function takes the data and splits it into 80% training (xTr,yTr) and 20% validation (xTv,yTv). The splitting is not random but by time (i.e. the training data consists of emails that were received before the validation data.) |
hw04tests.m | Runs several unit tests to find obvious bugs in your implementation. |
spamfilter.m | Loads in the file w0.mat and applies the corresponding spam filter on whatever test set you pass on as argument. |
How to submit: You can commit your code with subversion, with the command line
svn commit -m "some insightful comment"where you should substitute "some meaningful comment" with something that describes what you did. You can submit as often as you want until the deadline. Please be aware that the last submission determines your grade.
Evaluation: Your code will be autograded for technical correctness. Please do not change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder. However, the correctness of your implementation -- not the autograder's output -- will be the final judge of your score. If necessary, we will review and grade assignments individually to ensure that you receive due credit for your work.
Academic Dishonesty: We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; please don't let us down. If you do, we will pursue the strongest consequences available to us.
Getting Help: You are not alone! If you find yourself stuck on something, contact the course staff for help. Office hours, section, and the Piazza are there for your support; please use them. If you can't make our office hours, let us know and we will schedule more. We want these projects to be rewarding and instructional, not frustrating and demoralizing. But, we don't know when or how to help unless you ask.
Before you dive into the programming part of this assignment you will need to derive the gradients for several loss functions. You do not have to hand this part in, but save your derivations as these are part of written homework 2.
Derive the gradient function for each of the following loss functions with respect to the weight vectdor w. Write down the gradient update (with stepsize c).
(Note that: ‖w‖22=w⊤w and λ is a non-negative constant.)
You will now implement these functions and their gradient updates. Before we get started, please download the dataset for this project. You can find them on piazza:
https://piazza.com/cornell/fall2015/cs47805780/resources
The file data_train.mat
contains the pre-processed email data, where emails are represented as bag-of-words vectors. You will only need this file for now, but to compete in the spam filtering competition you might want to download spam_train.zip
, which contains the raw email data, so that you can invent your own features.
Type in
>> load data_train >> valsplit >> whos Variables in the current scope: Attr Name Size Bytes Class ==== ==== ==== ===== ===== X 1024x5000 15006996 double Y 1x5000 40000 double xTr 1024x4000 12310784 double xTv 1024x1000 2696216 double yTr 1x4000 32000 double yTv 1x1000 8000 double Total is 2507832 elements using 30093996 bytesThis should generate a training data set
xTr
, yTr
and a validation set xTv
, yTv
for you. (You can check what variables is in your environment with the who
command.) ridge.m
which computes the loss and gradient for a particular data set xTr
, yTr
and a weight vector w
.
Make sure you don't forget to incorporate your regularization constant λ.
You can check your gradient with the following expression
(it checks the difference between the interpolated and the actual function value):err= checkgrad('ridge', zeros(size(xTr,1),1), 1e-5, xTr, yTr, 10);Here
1e-5
is Octave code for 10−5.
The return value should be very small (less than 10−8).
(You can also use the checkgrad
function for the later functions in this assignment and throughout your entire life whenever you have to implement functions and their gradients.)grdescent.m
which performs gradient descent.
Make sure to include the tolerance variable to stop early if the norm of the gradient is less than the tolerance value (you can use the function norm(x)
). When the norm of the gradient is tiny it means that you have arrived at a minimum. grdescent
is a function which takes a weight vector and returns loss and gradient.
In Octave you can make inline functions e.g. with the following code (first line):f=@(w) ridge(w,xTr,yTr,0.1); w=grdescent(f,zeros(size(xTr,1),1),1e-6,1000);You can choose what kind of step-size you implement (e.g. constant, decreasing, line search,...). [HINT: Personally, I increase the stepsize by a factor of 1.01 each iteration where the loss goes down, and decrease it by a factor 0.5 if the loss went up. ... if you are smart you also undo the last update in that case to make sure the loss decreases every iteration.]
linclassify
which returns the predictions for a vector w
and a data set xTv
. (You can take it from a previous project.)>> trainspamfilter(xTr,yTr); >> spamfilter(xTv,yTv,0.1); False positive rate: 0.78% True positive rate: 77.39% AUC: 98.50%The first command trains a spam filter with ridge regression and saves the resulting weight vector in
w0.mat
. w0.mat
over the validation data set.spamfilter.m
are:
spamfilter.m
). As the name suggests, it computes the area of the ROC curve and is a good measure to compare spam filters. hinge.m
, which is the equivalent to ridge but with the hinge loss. Take a look at trainspamfilter.m
. You can change it to use the logistic loss instead of ridge regression to train the classifier.logistic.m
, which is the equivalent to ridge but with the log-loss (logistic regression).
[By default the logistic loss does not take a regularization constant, but feel free to incorporate regularization if you want to.] spamupdate
to make small gradient steps during test time
(basically you still correct the classifier after you made a mistake).tokenizedata.m
, you can get an idea of how the tokenization is done.
You can modify this if you want to change how the tokenization is done.
For example, by default the data uses 210=1024 dimensional features. You could change this by increasing 10 to 11.
You could also change the tokenize.py function (e.g. to include bigrams).
A common trick is e.g. to remove stopwords. A full list is
here.
You can also include bi-grams or feature re-weighting with TFIDF(optional) This programming project also includes a competition. We will rank all submissions by how well your Spam classifier performs on a secret test set of emails. If you want to improve your classifier modify tokenize.py
or trainspamfilter
. You must train your weight vector locally and commit the changes to it (the file w0
is written by trainspamfilter
and contains the weight vector learned by your algorithm. The autograder will evaluate this classifier on emails from the same authors as the ones in your dataset, but emails that arrived later on. You can also modify spamupdate.m
. Look at the task 8 for tips to get you started. Also consider changing the default threshold in spamfilter.m
which is currently set to 0.3. We will use your modified tokenizers to tokenize the test emails as well.
Tests. To test your code you can run hw04tests
, which runs several unit tests. As your project progresses, slowly but surely these tests should stop failing.