### CS 4/5780 Jeopardy

Naive BayeskNNSVMGeneral MLMore GeneralUnsupervised
100100100100100100
200200200200200200
300300300300300300
400400400400400400
500500500500500500

## Naive Bayes — 100

Question: Why is Naïve Bayes called "generative" and what is its discriminative counter part?

Naive Bayes is called "generative" because it models the joint distribution of the input $$x$$ and the label $$y$$. Its discriminative counterpart is Logistic Regression.

## Naive Bayes — 200

True or False, and Explain Why: Naive Bayes assumes that the features are independent $$P(\mathbf{x}) = \prod_{\alpha=1}^d P(\mathbf{x}_{\alpha})$$.

False! Naive Bayes assumes that the features are independent given the label.

## Naive Bayes — 300

True or False, and Explain Why: If the training dataset is linearly separable, Naive Bayes will obtain zero training error.

False! Naive Bayes's decision boundary won't necessarily be a separating hyperplane.

## Naive Bayes — 400

Question: MAP estimation with +1 smoothing in Naive Bayes with $$c$$ classes and a categorical distribution with $$k$$ categories per feature is equivalent to adding how many hallucinated samples to the training set?

$$ck$$

## Naive Bayes — 500

When is the Naïve Bayes decision boundary identical to logistic regression?

In the limit as the dataset size becomes large, if the Naive Bayes modeling assumption holds on the source distribution.

## kNN — 100

True or False, and Explain Why: kNN is a generative classifier.

False. kNN doesn't model a distribution over examples $$x$$.

## kNN — 200

Question: What is the main assumption behind nearest-neighbor classification?

Similar inputs have similar outputs.

## kNN — 300

Name one advantage of kNN over logistic regression, and vice versa.

kNN can produce sophisticated nonlinear decision boundaries. Logistic regression doesn't suffer from the curse of dimensionality.

## kNN — 400

Question: Why does kNN still tend to perform well on high dimensional faces and handwritten digits data?

Those data tend to lie close to a low-dimensional subspace

## kNN — 500

Question: Assume you have points in 2 dimensions sampled from a normal distribution and a hyper-plane that intersects the origin. As you add dimensions with random Gaussian feature values, how are the pairwise point-distances affected? How are the distances to the hyper-plane affected?

Pairwise distances become large and concentrated. Distances to the hyperplane stay the same.

## SVM — 100

True or False, and Explain Why: Given a data set with two classes, the Perceptron algorithm always finds a separating hyper-plane within finitely many iterations - but it does not necessarily have a large margin.

False. The Perceptron algorithm won't terminate if the dataset isn't linearly separable.

## SVM — 200

Why do SVMs maximize the margin?

This brings the training examples as far from the classification boundary as possible, so a "noisy version" of those examples will still be classified correctly.

## SVM — 300

What is the Perceptron update after a positive input point $$x$$ is misclassified?

$$w \leftarrow w + y x.$$

## SVM — 400

Why is the SVM margin exactly $$\frac{1}{\| w \|}$$?

Because we scale $$w$$ and $$b$$ in the formula for the hyperplane so that this will be the margin.

## SVM — 500

Why shouldn't you incorporate the bias $$b$$ as a constant feature when working with SVMs?

Because then it will be added to the objective $$\| w\|^2$$ which would change the semantics of the SVM.

## General — 100

True or False, and Explain Why: Newton’s method always converges, but sometimes it is slower than Gradient Descent.

False. Newton's method can fail to converge, e.g. on the function $$f(w) = w^2 + |w|$$.

## General — 200

Name two assumptions that the Perceptron makes about the data.

Binary classification. Linearly separable.

## General — 300

In regression, give one reason to prefer the squared loss over the absolute loss, and one reason to prefer the absolute loss over the squared loss.

Squared loss smooth, better for optimization. Absolute loss less sensitive to noise and outliers.

## General — 400

Give two distinct reasons why you might not be able to use Newton’s method on a particular data set (independent of convergence).

Loss not twice differentiable, Hessian matrix not positive definite.

## General — 500

In the soft-margin SVM decision boundaries below, which corresponds to what value of C, for $$C \in \{ 0.0001, 1, 100 \}$$?

Large C behaves like hard-margin SVM. Small C might have some errors. So it's: 0.0001, 1, 100.

## More General — 100

Question: What is the difference between supervised and unsupervised learning?

Supervised learning has labels. Unsupervised learning doesn't.

## More General — 200

Question: Why is your test error typically higher than your validation error?

Distribution shift and overfitting in model selection.

## More General — 300

Classify the following models as discriminative or generative:

• SVM
• Logistic regression
• Perceptron
• Naïve Bayes

All of them are discriminative except Naive Bayes.

## More General — 400

Classify these models as classification, regression, or both:

• kNN
• SVM
• Logistic regression
• Linear regression

Linear regression is regression; kNN is both; the other two are classification.

## More General — 500

What is the difference between linear regression and ridge regression? When might you want to use one over the other?

Ridge regression adds a regularizer: it's the MAP to linear regression's MLE.

## Unsupervised — 100

True or False, and Explain Why: K-means and PCA require labeled data

False. They're unsupervised learning methods and as such don't require labels.

## Unsupervised — 200

Give an example of how to choose k, the number of clusters, when using k-means.

Look at loss versus k and see where the graph has a "knee".

## Unsupervised — 300

Why do we remove the mean from data before computing principal components (i.e., equating them to singular vectors)?

Otherwise we'd tend to reduce to the direction of the mean, which is not what we want.

## Unsupervised — 400

Does Lloyd’s algorithm (the one we discussed to practically solve K-means) always yield the optimal clustering for a given k? If it doesn’t, why not?

It doesn't: there are multiple stable clusterings that could result.

## Unsupervised — 500

Give two motivations why someone might apply PCA on their data?

Removes computational cost of high dimensions, removes noise that might negatively impact learning, facilitates visualization, etc.