100100100100100100

200200200200200200

300300300300300300

400400400400400400

500500500500500500

Name three regularization techniques.

Dropout, weight decay, L2/L1 regularization, early stopping, data augmentation, and batch normalization.

*True or False, and Explain Why:* Different training runs of a ResNet model using SGD converge to the same solution.

False. Due to the stochastic nature of SGD and the non-convexity of the loss landscape, different training runs can converge to different local minima.

What is an adaptive optimizer? Name three kinds of adaptive optimizers.

Adaptive optimizers adapt the learning rate for each individual parameter based on its historical gradients. Adagrad, RMSProp, and Adam are adaptive optimizers.

*True or False, and Explain Why:* Gradient descent is preferrable to stochastic optimizers for deep learning because it provides more stable estimates of the gradient.

False. Gradient descent is computationally expensive for large datasets, while stochastic optimizers like SGD provide noisy gradient estimates that can help escape suboptimal solutions. The noise actually has a regulatory effect.

A student is training a ResNet with a learning rate of 1e-4. They add batch norm layers and re-train the model with the same hyperparamters. Will they see a large performance improvement?

No. Adding batch norm layers without adjusting the learning rate is unlikely to result in a large performance improvement. Batch normalization allows for higher learning rates, so the student should increase the learning rate to observe performance improvements.

True or False: Deep convolutional networks are always preferrable to shallow networks.

False. Deep convolutional networks can be harder to optimize and may be more susceptible to overfitting.

What are residual connections and how are they helpful for CNNs?

Residual connections, or skip connections, in convolutional neural networks (CNNs) introduce shortcuts that allow data to bypass some layers. This helps mitigate the vanishing gradients.

If the input of shape (h, w, c) is convolved with two 3x3 filters with a padding of 1 around the entire image. What is the dimension of the output?

It will be of size (h, w, 2). The padding ensures that the spatial size is preserved.

True or False: 1x1 convolutions are not very useful because they can't incorporate spatial information.

False. One of the main uses of 1x1 convolutions is to reduce the number of channels in feature maps, acting as a cross-channel linear transformation, prior to aggregating spatial information.

A 5x5 kernel is applied to a feature map with 4 channels and produces a feature map with 8 channels. How many parameters does it have? Ignore bias terms.

It has (5x5)x4x8=800 parameters

What are two reasons for why transformers are better than RNNs?

RNNs are hard to train because of the vanishing gradient problem and because all of the context needs to be compressed into a single vector.

What is prompt engineering?

Prompt engineering refers to the process of carefully designing the prompts or instructions given to large language models in order to elicit desired outputs or behaviors.

What is the training objective for BERT and GPT?

BERT is trained with masked language modeling and next sentence prediction. GPT is trained with next word prediction.

What are the dimensions of the query, key, value projection matrices for self-attention? Assume that the input is n x d, where n is the number of tokens and d is the dimension.

The matrices are all d x d.

State and explain what the three different types of attention in a transfomer are.

Encoder has self-attention layers where each input token can attend to all other tokens. The decoder has cross-attention layers where each token in the decoder attends to all tokens in the output of the encoder (the queries come from the decoder and the keys and values come from the encoder). The decoder has masked self-attention layers where attention is computed only over past tokens.

How do vision transforms process images prior to the transformer?

They split the images up into patches with additive positional embeddings.

What is the objective of contrastive loss functions?

Contrastive loss functions encourage positive instances to be close and negative instances to be far away.

Name three self-supervised vision algorithms.

SimCLR, MoCo, DINO, Masked Autoencoders (MAE).

What is CLIP and how is it trained?

CLIP (Contrastive Image-Language Pre-training) consists of twin text and image towers. They are trained with a contrastive learning objective to embed text-image pairs close together.

True or False, and explain why: A benefit of self-supervised learning over supervised learning algorithms is that they do not require making assumptions about the data.

False. Different self-supervised learning algorithms make different assumptions. SimCLR, for instance, requires selecting augmentations that should lead to similar representations.

In the context of VAEs, what does ELBO stand for, and what does it represent? What are the two terms (informally)?

ELBO stands for Evidence Lower Bound. It is a lower bound on the log-likelihood of the data and is used as the objective function for training VAEs. It can be decomposed to a reconstruction term and a prior matching term.

What is the key difference between the generator's objective in a GAN compared to a VAE?

In a GAN, the generator is trained to fool the discriminator, while in a VAE, the generator (decoder) is trained to reconstruct the input.

Name two ways diffusion models different from traditional hierarchical VAEs.

(1) Diffusion models use a fixed, predefined noise schedule, while hierarchical VAEs learn the latent variables at each level. (2) Diffusion models fix the latents to have the same dimensionality of the data.

Provide one pro and one con for each of the following generative models: diffusion models, GANs, and VAEs.

Diffusion Models: - Pro: Generate high-quality samples - Con: Slow sampling

GANs: - Pro: Generates realistic samples, Fast sampling - Con: Unstable training, Mode collapse

VAEs: - Pro: Provide a probabilistic latent space for encoding and generating data - Con: Generated samples can be blurry

In diffusion models, what does the score function represent, and how is it used in the generation process?

The score function in diffusion models represents the gradient of the log-density of the data distribution with respect to the input. It is used to guide the denoising process during generation by providing the direction of steepest ascent towards more probable data points at each step.

What is the over-smoothing problem in Graph Neural Networks (GNNs)?

Over-smoothing refers to the phenomenon where node features become indistinguishable as the number of GNN layers increases, leading to a loss of discriminative power and poor performance on downstream tasks.

What is the purpose of stochastic depth in deep neural networks?

Stochastic depth is a regularization technique that randomly drops out entire layers during training, which helps to reduce overfitting and improve the network's ability to generalize.

What do scaling laws in deep learning refer to, and why are they important?

Scaling laws refer to the predictable relationships between model performance, size, and training data. They suggest that performance improves following a power law as model size and data increase. Scaling laws are important because they guide the efficient scaling of models and help estimate the resources needed for future advances.

What are two types of normalization layers used in deep neural networks, and what is their purpose?

Two types of normalization layers are Batch Normalization and Layer Normalization. They are used to normalize the activations within a network, which stabilizes training and allows for faster convergence and better generalization.

Consider an image captioning model with cross-attention between the image features and the text tokens.

True or False: Assuming the image resolution is fixed, the computational complexity of cross-attention in this model is quadratic with respect to the length of the text sequence.

False. The computational complexity of this cross-attention operation is O(nk), where n is the number of image features and k is the number of text tokens.