CS 4782 Jeopardy

BasicsCNNsNLPModern VisionGen. ModelsMisc.

Basics — 100

Name three regularization techniques.

Dropout, weight decay, L2/L1 regularization, early stopping, data augmentation, and batch normalization.

Basics — 200

True or False, and Explain Why: Different training runs of a ResNet model using SGD converge to the same solution.

False. Due to the stochastic nature of SGD and the non-convexity of the loss landscape, different training runs can converge to different local minima.

Basics — 300

What is an adaptive optimizer? Name three kinds of adaptive optimizers.

Adaptive optimizers adapt the learning rate for each individual parameter based on its historical gradients. Adagrad, RMSProp, and Adam are adaptive optimizers.

Basics — 400

True or False, and Explain Why: Gradient descent is preferrable to stochastic optimizers for deep learning because it provides more stable estimates of the gradient.

False. Gradient descent is computationally expensive for large datasets, while stochastic optimizers like SGD provide noisy gradient estimates that can help escape suboptimal solutions. The noise actually has a regulatory effect.

Basics — 500

A student is training a ResNet with a learning rate of 1e-4. They add batch norm layers and re-train the model with the same hyperparamters. Will they see a large performance improvement?

No. Adding batch norm layers without adjusting the learning rate is unlikely to result in a large performance improvement. Batch normalization allows for higher learning rates, so the student should increase the learning rate to observe performance improvements.

CNN — 100

True or False: Deep convolutional networks are always preferrable to shallow networks.

False. Deep convolutional networks can be harder to optimize and may be more susceptible to overfitting.

CNN — 200

What are residual connections and how are they helpful for CNNs?

Residual connections, or skip connections, in convolutional neural networks (CNNs) introduce shortcuts that allow data to bypass some layers. This helps mitigate the vanishing gradients.

CNN — 300

If the input of shape (h, w, c) is convolved with two 3x3 filters with a padding of 1 around the entire image. What is the dimension of the output?

It will be of size (h, w, 2). The padding ensures that the spatial size is preserved.

CNN — 400

True or False: 1x1 convolutions are not very useful because they can't incorporate spatial information.

False. One of the main uses of 1x1 convolutions is to reduce the number of channels in feature maps, acting as a cross-channel linear transformation, prior to aggregating spatial information.

CNN — 500

A 5x5 kernel is applied to a feature map with 4 channels and produces a feature map with 8 channels. How many parameters does it have? Ignore bias terms.

It has (5x5)x4x8=800 parameters

NLP — 100

What are two reasons for why transformers are better than RNNs?

RNNs are hard to train because of the vanishing gradient problem and because all of the context needs to be compressed into a single vector.

NLP — 200

What is prompt engineering?

Prompt engineering refers to the process of carefully designing the prompts or instructions given to large language models in order to elicit desired outputs or behaviors.

NLP — 300

What is the training objective for BERT and GPT?

BERT is trained with masked language modeling and next sentence prediction. GPT is trained with next word prediction.

NLP — 400

What are the dimensions of the query, key, value projection matrices for self-attention? Assume that the input is n x d, where n is the number of tokens and d is the dimension.

The matrices are all d x d.

NLP — 500

State and explain what the three different types of attention in a transfomer are.

Encoder has self-attention layers where each input token can attend to all other tokens. The decoder has cross-attention layers where each token in the decoder attends to all tokens in the output of the encoder (the queries come from the decoder and the keys and values come from the encoder). The decoder has masked self-attention layers where attention is computed only over past tokens.

Modern Vision — 100

How do vision transforms process images prior to the transformer?

They split the images up into patches with additive positional embeddings.

Modern Vision — 200

What is the objective of contrastive loss functions?

Contrastive loss functions encourage positive instances to be close and negative instances to be far away.

Modern Vision — 300

Name three self-supervised vision algorithms.

SimCLR, MoCo, DINO, Masked Autoencoders (MAE).

Modern Vision — 400

What is CLIP and how is it trained?

CLIP (Contrastive Image-Language Pre-training) consists of twin text and image towers. They are trained with a contrastive learning objective to embed text-image pairs close together.

Modern Vision — 500

True or False, and explain why: A benefit of self-supervised learning over supervised learning algorithms is that they do not require making assumptions about the data.

False. Different self-supervised learning algorithms make different assumptions. SimCLR, for instance, requires selecting augmentations that should lead to similar representations.

Generative Models — 100

In the context of VAEs, what does ELBO stand for, and what does it represent? What are the two terms (informally)?

ELBO stands for Evidence Lower Bound. It is a lower bound on the log-likelihood of the data and is used as the objective function for training VAEs. It can be decomposed to a reconstruction term and a prior matching term.

Generative Models — 200

What is the key difference between the generator's objective in a GAN compared to a VAE?

In a GAN, the generator is trained to fool the discriminator, while in a VAE, the generator (decoder) is trained to reconstruct the input.

Generative Models — 300

Name two ways diffusion models different from traditional hierarchical VAEs.

(1) Diffusion models use a fixed, predefined noise schedule, while hierarchical VAEs learn the latent variables at each level. (2) Diffusion models fix the latents to have the same dimensionality of the data.

Generative Models — 400

Provide one pro and one con for each of the following generative models: diffusion models, GANs, and VAEs.

Diffusion Models: - Pro: Generate high-quality samples - Con: Slow sampling
GANs: - Pro: Generates realistic samples, Fast sampling - Con: Unstable training, Mode collapse
VAEs: - Pro: Provide a probabilistic latent space for encoding and generating data - Con: Generated samples can be blurry

Generative Models — 500

In diffusion models, what does the score function represent, and how is it used in the generation process?

The score function in diffusion models represents the gradient of the log-density of the data distribution with respect to the input. It is used to guide the denoising process during generation by providing the direction of steepest ascent towards more probable data points at each step.

Misc. — 100

What is the over-smoothing problem in Graph Neural Networks (GNNs)?

Over-smoothing refers to the phenomenon where node features become indistinguishable as the number of GNN layers increases, leading to a loss of discriminative power and poor performance on downstream tasks.

Misc. — 200

What is the purpose of stochastic depth in deep neural networks?

Stochastic depth is a regularization technique that randomly drops out entire layers during training, which helps to reduce overfitting and improve the network's ability to generalize.

Misc. — 300

What do scaling laws in deep learning refer to, and why are they important?

Scaling laws refer to the predictable relationships between model performance, size, and training data. They suggest that performance improves following a power law as model size and data increase. Scaling laws are important because they guide the efficient scaling of models and help estimate the resources needed for future advances.

Misc. — 400

What are two types of normalization layers used in deep neural networks, and what is their purpose?

Two types of normalization layers are Batch Normalization and Layer Normalization. They are used to normalize the activations within a network, which stabilizes training and allows for faster convergence and better generalization.

Misc. — 500

Consider an image captioning model with cross-attention between the image features and the text tokens.
True or False: Assuming the image resolution is fixed, the computational complexity of cross-attention in this model is quadratic with respect to the length of the text sequence.

False. The computational complexity of this cross-attention operation is O(nk), where n is the number of image features and k is the number of text tokens.