$\mathcal{R}^d$: | d-dimensional feature space |
$\mathbf{x}_i$: | input vector of the $i^{th}$ sample |
$y_i$: | label of the $i^{th}$ sample |
$\mathcal{C}$: | label space |
Binary classification | $\mathcal{C}=\{0,1\}$ or $\mathcal{C}=\{-1,+1\}$. | Eg. spam filtering. An email is either a spam ($+1$), or not ($-1$). |
Multi-class classification | $\mathcal{C}=\{1,2,\cdots,K\}$ $(K\ge2)$. | Eg. face classification. A person can be exactly one of $K$ identities (e.g., 1="Barack Obama", 2="George W. Bush", etc.). |
Regression | $\mathcal{C}=\mathbb{R}$. | Eg. predict future temperature or the height of a person. |
Patient Data in a hospital. $\mathbf{x}_i=(x_i^1,x_i^2,\cdots,x_i^d)$, where $x_i^1=0$ or $1$, referring to the patient $i$'s gender, $x_i^2$ is the height of patient $i$ in $cm$, and $x_i^3$ is the age of patient $i$ in years, etc. In this case, $d\le100$ and the feature vector is dense, i.e., the number of nonzero coordinates in $\mathbf{x}_i$ is large relative to $d$. |
Text document in bag-of-words format. $\mathbf{x}_i=(x_i^1,x_i^2,\cdots,x_i^d)$, where $x_i^j$ is the occurances of the $j$th word (of, e.g., a dictionary) in document $i$. In this case, $d\sim 100000 -10M$ and the feature vector is sparse, i.e., $\mathbf{x}_i$ consists of mostly zeros. |
Image. $\mathbf{x}_i=(x_i^1,x_i^2,\cdots,x_i^{3k})$, where $x_i^{3j-2}$, $x_i^{3j-1}$, and $x_i^{3j}$ refer to the red, green, and blue values of the $j$th pixel in the image. In this case, $d\sim 100000 - 10M$ and the feature vector is dense. A $7\mathrm{MP}$ camera results in $7\mathrm{M}\times 3=21\mathrm{M}$ features. |
Before we can find a function $h$, we must specify what type of function it is that we are looking for. It could be an artificial neural network, a decision tree or many other types of classifiers. We call the set of possible functions the hypothesis class. By specifying the hypothesis class, we are encoding important assumptions about the type of problem we are trying to learn. The No Free Lunch Theorem states that every successful ML algorithm must make assumptions. This also means that there is no single ML algorithm that works for every settings.
There are typically two steps involved in learning a hypothesis function h(). First, we select the type of machine learning algorithm that we think is appropriate for this particular learning problem. This defines the hypothesis class $\mathcal{H}$, i.e. the set of functions we can possibly learn. The second step is to find the best function within this class, $h\in\mathcal{H}$. This second step is the actual learning process and often, but not always, involves an optimization problem. Essentially, we try to find a function h within the hypothesis class that makes the fewest mistakes within our training data. (If there is not a single function we typically try to choose the "simplest" by some notion of simplicity - but we will cover this in more detail in a later class.) How can we find the best function? For this we need some way to evaluate what it means for one function to be better than another. This is where the loss function (aka risk function) comes in. A loss function evaluates a hypothesis $h\in{\mathcal{H}}$ on our training data and tells us how bad it is. The higher the loss, the worse it is - a loss of zero means it makes perfect predictions. It is common practice to normalize the loss by the total number of training samples, n, so that the output can be interpreted as the average loss per sample (and is independent of n).
Zero-one loss:
The simplest loss function is the zero-one loss. It literally counts how many mistakes an hypothesis function h makes on the training set. For every single example it suffers a loss of 1 if it is mispredicted, and 0 otherwise. The normalized zero-one loss returns the fraction of misclassified training samples, also often referred to as the training error. The zero-one loss is often used to evaluate classifiers in multi-class/binary classification settings but rarely useful to guide optimization procedures because the function is non-differentiable and non-continuous. Formally, the zero-one loss can be stated has: $$\mathcal{L}_{0/1}(h)=\frac{1}{n}\sum^n_{i=1}\delta_{h(\mathbf{x}_i)\ne y_i}, \mbox{ where }\delta_{h(\mathbf{x}_i)\ne y_i}=\begin{cases} 1,&\mbox{ if $h(\mathbf{x}_i)\ne y_i$}\\ 0,&\mbox{ o.w.} \end{cases} $$ This loss function returns the error rate on this data set $D$. For every example that the classifier misclassifies (i.e. gets wrong) a loss of 1 is suffered, whereas correctly classified samples lead to 0 loss. |
Squared loss:
The squared loss function is typically used in regression settings. It iterates over all training samples and suffers the loss $\left(h(\mathbf{x}_i)-y_i\right)^2$. The squaring has two effects: 1., the loss suffered is always nonnegative; 2., the loss suffered grows quadratically with the absolute mispredicted amount. The latter property encourages no predictions to be really far off (or the penalty would be so large that a different hypothesis function is likely better suited). On the flipside, if a prediction is very close to be correct, the square will be tiny and little attention will be given to that example to obtain zero error. For example, if $|h(\mathbf{x}_i)-y_i|=0.001$ the squared loss will be even smaller, $0.000001$, and will likely never be fully corrected. If, given an input $\mathbf{x}$, the label $y$ is probabilistic according to some distribution $P(y|\mathbf{x})$ then the optimal prediction to minimize the squared loss is to predict the expected value, i.e. $h(\mathbf{x})=\mathbf{E}_{P(y|\mathbf{x})}[y]$. Formally the squared loss is: $$\mathcal{L}_{sq}(h)=\frac{1}{n}\sum^n_{i=1}(h(\mathbf{x}_i)-y_i)^2.$$ |
Absolute loss:
Similar to the squared loss, the absolute loss function is also typically used in regression settings. It suffers the penalties $|h(\mathbf{x}_i)-y_i|$. Because the suffered loss grows linearly with the mispredictions it is more suitable for noisy data (when some mispredictions are unavoidable and shouldn't dominate the loss). If, given an input $\mathbf{x}$, the label $y$ is probabilistic according to some distribution $P(y|\mathbf{x})$ then the optimal prediction to minimize the absolute loss is to predict the median value, i.e. $h(\mathbf{x})=\textrm{MEDIAN}_{P(y|\mathbf{x})}[y]$. Formally, the absolute loss can be stated as: $$\mathcal{L}_{abs}(h)=\frac{1}{n}\sum^n_{i=1}|h(\mathbf{x}_i)-y_i|.$$ |
Given a loss function, we can then attempt to find the function $h$ that minimizes the loss: $$ h=\textrm{argmin}_{h\in{\mathcal{H}}}\mathcal{L}(h) $$ A big part of machine learning focuses on the question, how to do this minimization efficiently.
If you find a function $h(\cdot)$ with low loss on your data $D$, how do you know whether it will still get examples right that are not in $D$?
Bad example: "memorizer" $h(\cdot)$ $$h(x)=\begin{cases} y_i,&\mbox{ if $\exists (\mathbf{x}_i,y_i)\in D$, s.t., $\mathbf{x}=\mathbf{x}_i$},\\ 0,&\mbox{ o.w.} \end{cases}$$ For this $h(\cdot)$, we get $0\%$ error on the training data $D$, but does horribly with samples not in $D$, i.e., there's the overfitting issue with this function.
By time, if the data is temporally collected. In general, if the data has a temporal component, we must split it by time. |
Uniformly at random, if (and, in general, only if) the data is $i.i.d.$. |
Quiz: Why does $\epsilon_\mathrm{TE}\to\epsilon$ as $|D_\mathrm{TE}|\to +\infty$? This is due to the weak law of large numbers. Thus, the testing data set $D_\mathrm{TE}$ should consist of $i.i.d.$ data points.
No free lunch. Every ML algorithm has to make assumptions on which hypothesis class $\mathcal{H}$ should you choose? This choice depends on the data, and encodes your assumptions about the data set/distribution $\mathcal{P}$. Clearly, there's no one perfect $\mathcal{H}$ for all problems.
Example. Assume that $(\mathbf{x}_1,y_1)=(1,1)$, $(\mathbf{x}_2,y_2)=(2,2)$, $(\mathbf{x}_3,y_3)=(3,3)$, $(\mathbf{x}_4,y_4)=(4,4)$, and $(\mathbf{x}_5,y_5)=(5,5)$.Question: what is the value of $y$ if $\mathbf{x}=2.5$? It is impossible to know the answer without assumptions.