Question: Why is Naïve Bayes called "generative" and what is its discriminative counter part?
Naive Bayes is called "generative" because it models the joint distribution of the input \( x \) and the label \( y \). Its discriminative counterpart is Logistic Regression.
True or False, and Explain Why: Naive Bayes assumes that the features are independent \( P(\mathbf{x}) = \prod_{\alpha=1}^d P(\mathbf{x}_{\alpha}) \).
False! Naive Bayes assumes that the features are independent given the label.
True or False, and Explain Why: If the training dataset is linearly separable, Naive Bayes will obtain zero training error.
False! Naive Bayes's decision boundary won't necessarily be a separating hyperplane.
Question: MAP estimation with +1 smoothing in Naive Bayes with \( c \) classes and a categorical distribution with \( k \) categories per feature is equivalent to adding how many hallucinated samples to the training set?
\( ck \)
When is the Naïve Bayes decision boundary identical to logistic regression?
In the limit as the dataset size becomes large, if the Naive Bayes modeling assumption holds on the source distribution.
True or False, and Explain Why: kNN is a generative classifier.
False. kNN doesn't model a distribution over examples \(x\).
Question: What is the main assumption behind nearest-neighbor classification?
Similar inputs have similar outputs.
Name one advantage of kNN over logistic regression, and vice versa.
kNN can produce sophisticated nonlinear decision boundaries. Logistic regression doesn't suffer from the curse of dimensionality.
Question: Why does kNN still tend to perform well on high dimensional faces and handwritten digits data?
Those data tend to lie close to a low-dimensional subspace
Question: Assume you have points in 2 dimensions sampled from a normal distribution and a hyper-plane that intersects the origin. As you add dimensions with random Gaussian feature values, how are the pairwise point-distances affected? How are the distances to the hyper-plane affected?
Pairwise distances become large and concentrated. Distances to the hyperplane stay the same.
True or False, and Explain Why: Given a data set with two classes, the Perceptron algorithm always finds a separating hyper-plane within finitely many iterations - but it does not necessarily have a large margin.
False. The Perceptron algorithm won't terminate if the dataset isn't linearly separable.
Why do SVMs maximize the margin?
This brings the training examples as far from the classification boundary as possible, so a "noisy version" of those examples will still be classified correctly.
What is the Perceptron update after a positive input point \( x \) is misclassified?
\( w \leftarrow w + y x. \)
Why is the SVM margin exactly \(\frac{1}{\| w \|} \)?
Because we scale \(w\) and \(b\) in the formula for the hyperplane so that this will be the margin.
Why shouldn't you incorporate the bias \(b\) as a constant feature when working with SVMs?
Because then it will be added to the objective \( \| w\|^2 \) which would change the semantics of the SVM.
True or False, and Explain Why: Newton’s method always converges, but sometimes it is slower than Gradient Descent.
False. Newton's method can fail to converge, e.g. on the function \( f(w) = w^2 + |w| \).
Name two assumptions that the Perceptron makes about the data.
Binary classification. Linearly separable.
In regression, give one reason to prefer the squared loss over the absolute loss, and one reason to prefer the absolute loss over the squared loss.
Squared loss smooth, better for optimization. Absolute loss less sensitive to noise and outliers.
Give two distinct reasons why you might not be able to use Newton’s method on a particular data set (independent of convergence).
Loss not twice differentiable, Hessian matrix not positive definite.
In the soft-margin SVM decision boundaries below, which corresponds to what value of C, for \( C \in \{ 0.0001, 1, 100 \} \)?
Large C behaves like hard-margin SVM. Small C might have some errors. So it's: 0.0001, 1, 100.
Question: What is the difference between supervised and unsupervised learning?
Supervised learning has labels. Unsupervised learning doesn't.
Question: Why is your test error typically higher than your validation error?
Distribution shift and overfitting in model selection.
Classify the following models as discriminative or generative:
All of them are discriminative except Naive Bayes.
Classify these models as classification, regression, or both:
Linear regression is regression; kNN is both; the other two are classification.
What is the difference between linear regression and ridge regression? When might you want to use one over the other?
Ridge regression adds a regularizer: it's the MAP to linear regression's MLE.
True or False, and Explain Why: K-means and PCA require labeled data
False. They're unsupervised learning methods and as such don't require labels.
Give an example of how to choose k, the number of clusters, when using k-means.
Look at loss versus k and see where the graph has a "knee".
Why do we remove the mean from data before computing principal components (i.e., equating them to singular vectors)?
Otherwise we'd tend to reduce to the direction of the mean, which is not what we want.
Does Lloyd’s algorithm (the one we discussed to practically solve K-means) always yield the optimal clustering for a given k? If it doesn’t, why not?
It doesn't: there are multiple stable clusterings that could result.
Give two motivations why someone might apply PCA on their data?
Removes computational cost of high dimensions, removes noise that might negatively impact learning, facilitates visualization, etc.