Lecture 1: Supervised Learning



The goal in supervised learning is to make predictions from data. For example, one popular application of supervised learning is email spam filtering. Here, an email (the data instance) needs to be classified as spam or not-spam. Following the approach of traditional computer science, one might be tempted to write a carefully designed program that follows some rules to decide if an email is spam or not. Although such a program might work reasonably well for a while, it has significant draw-backs. As email spam changes it would have to be re-written. Spammers could attempt to reverse engineer the software and design messages that circumvent it. And even if it is successful, it could probably not easily be applied to different languages. Machine Learning uses a different approach to generate a program that can make predictions from data. Instead of programming it by hand it is learned from past data. This process works if we have data instances for which we know exactly what the right prediction would have been. For example past data might be user-annotated as spam or not-spam. A machine learning algorithm can utilize such data to learn a program, a classifier, to predict the correct label of each annotated data instance. Other successful applications of machine learning include web-search ranking (predict which web-page the user will click on based on his/her search query), placing of online advertisements (predict the expected revenue of an ad, when placed on a homepage, which is seen by a specific user), visual object recognition (predict which object is in an image - e.g. a camera mounted on a self-driving car), face-detection (predict if an image patch contains a human face or not).


Let us formalize the supervised machine learning setup. Our training data comes in pairs of inputs $(\mathbf{x},y)$, where $\mathbf{x}\in{\mathcal{R}}^d$ is the input instance and $y$ its label. The entire training data is denoted as $$ D=\left\{(\mathbf{x}_1,y_1),\dots,(\mathbf{x}_n,y_n)\right\}\subseteq {\cal R}^d\times \mathcal{C}\nonumber $$ where:
$\mathcal{R}^d$:d-dimensional feature space
$\mathbf{x}_i$: input vector of the $i^{th}$ sample
$y_i$: label of the $i^{th}$ sample
$\mathcal{C}$: label space
There are multiple scenarios for the label space $\mathcal{C}$: Example: Spam filtering. An email is either a spam ($+1$), or not ($-1$). Example: Face classification. A person can be exactly one of $K$ identities (e.g., 1="Barack Obama", 2="George W. Bush", etc.). Example: predict future temperature or the height of a person.
Binary classification, i.e., $\mathcal{C}=\{0,1\}$ or $\mathcal{C}=\{-1,+1\}$.
Multi-class classification, i.e., $\mathcal{C}=\{1,2,\cdots,K\}$ $(K\ge2)$.
Regression, i.e., $\mathcal{C}=\mathbb{R}$.
The goal of supervised learning is to find a function $h:\mathbb{R}^d\to\mathcal{C}$, such that
    $h(\mathbf{x}_i)\approx y_i$ for all $(\mathbf{x}_i,y_i)\in D$ (training); $h(\mathbf{x}_i)\approx y_i$ for all $(\mathbf{x}_i,y_i)\not\in D$ (testing).

Examples of feature vectors

We call $\mathbf{x}_i$ a feature vector and the $d$ dimensions the features describing the $i-$th sample.
Patient Data in a hospital. $\mathbf{x}_i=(x_i^1,x_i^2,\cdots,x_i^d)$, where $x_i^1=0$ or $1$, referring to the patient $i$'s gender, $x_i^2$ is the height of patient $i$ in $cm$, and $x_i^3$ is the age of patient $i$ in years, etc. In this case, $d\le100$ and the feature vector is dense, i.e., the number of nonzero coordinates in $\mathbf{x}_i$ is large relative to $d$.
Text document in bag-of-words format. $\mathbf{x}_i=(x_i^1,x_i^2,\cdots,x_i^d)$, where $x_i^j$ is the occurances of the $j$th word (of, e.g., a dictionary) in document $i$. In this case, $d\sim 100000 -10M$ and the feature vector is sparse, i.e., $\mathbf{x}_i$ consists of mostly zeros.
Image. $\mathbf{x}_i=(x_i^1,x_i^2,\cdots,x_i^{3k})$, where $x_i^{3j-2}$, $x_i^{3j-1}$, and $x_i^{3j}$ refer to the red, green, and blue values of the $j$th pixel in the image. In this case, $d\sim 100000 - 10M$ and the feature vector is dense. A $7\mathrm{MP}$ camera results in $7\mathrm{M}\times 3=21\mathrm{M}$ features.

Hypothesis classes and No Free Lunch

Before we can find a function $h$, we must specify what type of function it is that we are looking for. It could be an artificial neural network, a decision tree or many other types of classifiers. We call the set of possible functions the hypothesis class. By specifying the hypothesis class, we are encoding important assumptions about the type of problem we are trying to learn. The No Free Lunch Theorem states that every successful ML algorithm must make assumptions. This also means that there is no single ML algorithm that works for every settings.

Loss Functions

We want to find the best $h(\cdot)$ for a data set $D$. To quantify "best", we introduce the concept of a loss functions (also called risk functions). A loss function measures how wrong your classifier $h$ still is or how many mistakes it still makes. Famous examples:
Zero-one loss: $$\mathcal{L}_{0/1}(h)=\frac{1}{n}\sum^n_1\delta_{h(\mathbf{x}_i)\ne y_i}, \mbox{ where }\delta_{h(\mathbf{x}_i)\ne y_i}=\begin{cases} 1,&\mbox{ if $h(\mathbf{x}_i)\ne y_i$}\\ 0,&\mbox{ o.w.} \end{cases} $$ This loss function returns the error rate on this data set $D$. For every example that the classifier misclassifies (i.e. gets wrong) a loss of 1 is suffered, whereas correctly classified samples lead to 0 loss.
Squared loss: $$\mathcal{L}_{sq}(h)=\frac{1}{n}\sum^n_{i=1}(h(\mathbf{x}_i)-y_i)^2.$$ The squared loss function is typically used in regression, and it gives magnified penalties if $|h(\mathbf{x}_i)-y_i|$ is large.
Absolute loss: $$\mathcal{L}_{abs}(h)=\frac{1}{n}\sum^n_{i=1}|h(\mathbf{x}_i)-y_i|.$$ The absolute loss function is typically used in regression, and it gives non-magnified penalties if $|h(\mathbf{x}_i)-y_i|$ is large. Thus, absolute loss is suitable for noisy data.


Given a loss function, we can then attempt to find the function $h$ that minimizes the loss: $$ h=\textrm{argmin}_{h\in{\mathcal{H}}}\mathcal{L}(h) $$ A big part of machine learning focuses on the question, how to do this minimization efficiently.

If you find a function $h(\cdot)$ with low loss on your data $D$, how do you know whether it will still get examples right that are not in $D$?

Bad example: "memorizer" $h(\cdot)$ $$h(x)=\begin{cases} y_i,&\mbox{ if $\exists (\mathbf{x}_i,y_i)\in D$, s.t., $\mathbf{x}=\mathbf{x}_i$},\\ 0,&\mbox{ o.w.} \end{cases}$$ For this $h(\cdot)$, we get $0\%$ error on the training data $D$, but does horribly with samples not in $D$, i.e., there's the overfitting issue with this function.

Train / Test splits

To resolve the overfitting issue, we usual split $D$ in to three subsets: $D_\mathrm{TR}$ as the training data, $D_\mathrm{VA}$, as the validation data, and $D_\mathrm{TE}$, as the test data. Usually, they are split into a proportion of $80\%$, $10\%$, and $10\%$. Then, we choose $h(\cdot)$ based on $D_\mathrm{TR}$, and evaluate $h(\cdot)$ on $D_\mathrm{TE}$.

Quiz: Why do we need $D_\mathrm{VA}$?
$D_\mathrm{VA}$ is used to check whether the $h(\cdot)$ obtained from $D_\mathrm{TR}$ suffers from the overfitting issue. $h(\cdot)$ will need to be validated on $D_\mathrm{VA}$, if the loss is too large, $h(\cdot)$ will get revised based on $D_\mathrm{TR}$, and validated again on $D_\mathrm{VA}$. This process will keep going back and forth until it gives low loss on $D_\mathrm{VA}$. Here's a trade-off between the sizes of $D_\mathrm{TR}$ and $D_\mathrm{VA}$: the training results will be better for a larger $D_\mathrm{TR}$, but the validation will be more efficient if $D_\mathrm{VA}$ is larger.

How to Split the Data?

By time, if the data is temporally collected.
In generally, if the data has a temporal component, we must split it by time.
Uniformly at random, if (and, in general, only if) the data is $i.i.d.$.
As a general rule-of-thumb, $D_\mathrm{TE}$ and $D_\mathrm{TR}$ should be somehow independent. The test error (or test loss) approximates the true generalization error/loss. Putting everything together: $$\mbox{Learning: }h^*(\cdot)=\textrm{argmin}_{h(\cdot)\in\mathcal{H}}\frac{1}{|D_\mathrm{TR}|}\sum_{(\mathbf{x},y)\in D_\mathrm{TR}}\ell(\mathbf{x},y|h(\cdot)),$$ where $\mathcal{H}$ is the hypothetical class (i.e., the set of all possible classifiers $h(\cdot)$). It's clear that $h^*(\cdot)$ is the classifier that minimizes the training loss. $$\mbox{Evaluation: }\epsilon_\mathrm{TE}=\frac{1}{|D_{TE}|}\sum_{(\mathbf{x},y)\in D_\mathrm{TE}} \ell (\mathbf{x},y|h^*(\cdot)),$$ i.e., $\epsilon_\mathrm{TE}$ is the testing loss. $$\mbox{Generalization: }\epsilon=\mathbb{E}_{(\mathbf{x},y)\sim \mathcal{P}}[\ell(\mathbf{x},y|h^*(\cdot))],$$ where $\mathcal{P}$ is the "true" distribution that the data $D_\mathrm{TE}$ follows. Thus, $\epsilon$ is the (unattainable) generalization loss.

Quiz: Why does $\epsilon_\mathrm{TE}\to\epsilon$ as $|D_\mathrm{TE}|\to +\infty$? This is due to the weak law of large numbers. Thus, the testing data set $D_\mathrm{TE}$ should consist of $i.i.d.$ data points.

No free lunch. Every ML algorithm has to make assumptions on which hypothesis class $\mathcal{H}$ should you choose? This choice depends on the data, and encodes your assumptions about the data set/distribution $\mathcal{P}$. Clearly, there's no one perfect $\mathcal{H}$ for all problems.

Example. Assume that $(\mathbf{x}_1,y_1)=(1,1)$, $(\mathbf{x}_2,y_2)=(2,2)$, $(\mathbf{x}_3,y_3)=(3,3)$, $(\mathbf{x}_4,y_4)=(4,4)$, and $(\mathbf{x}_5,y_5)=(5,5)$.

Question: what is the value of $y$ if $\mathbf{x}=2.5$? It is impossible to know the answer without assumptions.