CS 4787/5777 Jeopardy

Linear AlgebraGDAccelerationDNNsTransformersHPO
100100100100100100
200200200200200200
300300300300300300
400400400400400400
500500500500500500

Linear Algebra — 100

Question: What is the dimension of the vector space of linear maps from $\R^m$ to $\R^n$?

$mn$

Linear Algebra — 200

Consider the following Torch code.

x = torch.ones(4,5,1,3)
y = torch.ones(p,5,3)
z = x + y

For what values of $p$, if any, will this code not raise an exception? What would be the shape of $z$?

$p \in \{1,5\}$, and the shape of $z$ is $(4,5,5,3)$.

Linear Algebra — 300

Consider the following Torch code, given $X \in \R^{B \times m \times d}$ and $Y \in \R^{B \times n \times d}$

Z = X @ Y.t()

What is the shape of $Z$? How many floating point operations will this run (ignoring non-dominant factors)?

$Z \in \R^{B \times m \times n}$, and there will be $B m n d$ multiplies and $B m n d$ adds.

Linear Algebra — 400

What's the difference between eager execution and graph execution? What are some advantages of one vs the other?

Eager execution runs the forward pass as the compute graph is being constructed. Graph execution constructs the graph beforehand in a separate step.

Linear Algebra — 500

What are the core components of a modern deep learning framework?

(1) A numerical linear algebra library. (2) Hardware support for running on devices, e.g. GPUs. (3) A backpropagation engine. (4) A library for expressing deep neural networks in an object-oriented way. (5) Embedding in a high level language like Python.

Gradient Descent — 100

In the expression $\text{runtime} \approx \mathcal{O}(\kappa nd \log(1/\epsilon))$ what do these variables mean?

This gives an approximate runtime for gradient descent to reach loss gap $\epsilon$. $\kappa$ is the condition number; $n$ is the size of the training set; $d$ is the dimension of the examples..

Gradient Descent — 200

What's the difference between forward mode and reverse mode automatic differentiation?

Forward mode tracks the derivative of an intermediate with respect to an input; reverse mode tracks the derivative of an output with respect to an intermediate. Forward is best for functions like $\R \rightarrow \R^d$; backward is best for functions like $\R^d \rightarrow \R$.

Gradient Descent — 300

What is the first step (initial condition) of backpropagation?

Set the derivative of the loss with respect to itself to $1$.

Gradient Descent — 400

Question: What is minibatching?

Use an average of $B$ example gradients in each step of stochastic gradient descent, rather than just $1$.

Gradient Descent — 500

Question: What is the relationship between the number of FLOPs in the forward pass and the backward pass of a typical deep neural network training task?

The backward pass is roughly $2 \times$ the cost of the forward pass. This happens when we're bound by matrix multiply, because if the forward pass involves a matrix multiply $X Y$ then the backward pass will usually need two matrix multiplies $G Y^$ and $X^T G$.

Acceleration — 100

True or False, and Explain Why: Using momentum accelerates gradient descent by replacing the factor of $\kappa$ in the convergence rate with $\kappa^2$.

False. It replaces that with $\sqrt{\kappa}$.

Acceleration — 200

True or False, and Explain Why: The loss function $\ell_1(w)$ = 2 \| w \|^2$ has a lower condition number than the loss function $\ell_2(w) = 5 \| w\|^2$.

False. They both have condition number $\kappa = 1$: their directional second derivative is always the same.

Acceleration — 300

True or False, and Explain Why: Adding $\ell_2$ regularization (or increasing the $\ell_2$ regularization constant $\lambda$) can help to decrease the condition number $\kappa$ of a strongly convex loss function.

True. The condition number is $\kappa = \frac{L}{\mu}$ and adding $\ell_2$ regularization replaces this with $\frac{L + \lambda}{\mu + \lambda}$ which always makes $\kappa$ smaller.

Acceleration — 400

True or False, and Explain Why: Sparsity, autoencoders, and dimension reduction all help to reduce the effect of a large number of examples n on the overall cost of learning.

False. These things all target the dimension $d$, reducing the cost of operating in large dimensions.

Acceleration — 500

Consider the equation \[ \operatorname{Adam} = X + Y \] where $X$ and $Y$ are other optimization methods we've seen. What should $X$ and $Y$ be here? Explain.

$X = $ RMSProp, $Y = $ momentum. Adam combines the first-order running average of momentum with ther second-order running average for adaptive learning rates of RMSProp.

Deep Neural Networks — 100

What is a ReLU nonlinearity? How does it operate on a vector?

$\operatorname{ReLU}(x) = \max(0, x)$. The ReLU operates elementwise on a vector.

Deep Neural Networks — 200

What is a softmax nonlinearity? What is it used for?

$\operatorname{softmax}(x)_i = \frac{\exp(x_i)}{\sum_{j=1}^d \exp(x_j)}$. This is usually used to get a probability distribution from a logit vector. It also appears in attention!

Deep Neural Networks — 300

What is the meaning of the term bias in the context of a neural network linear layer?

It means the layer computes an affine transformation $f(X) = X W^T + b$ with additional bias weights $b$, rather than just computing a linear transformation $f(X) = X W^T$.

Deep Neural Networks — 400

What is a residual connection in a neural network? Where is it used in transformers?

A residual connection just means we add the input to the output of a layer. It makes learning easier by avoiding "vanishing gradients." It's used across each Attention and each MLP block in a standard Transformer.

Deep Neural Networks — 500

What is a recurrent neural network (RNN)? How does it process sequences?

A RNN processes sequences like a state machine, taking as input a current state vector and an input embedding, and outputting a next state vector. This leads to very deep compute graphs! Makes training expensive!

Transformers — 100

In the formula \[ \operatorname{softmax}\left( \frac{Q K^T}{\sqrt{d_k}} + M \right) V \] what does the variable $M$ represent? What is it used for?

That's the mask. It's used (among other things) to encode positional information in attention by allowing a token to only depend on its past, not the future.

Transformers — 200

What is multi-head attention?

It's just $h$ independent attention operations running in parallel.

Transformers — 300

What are the three main approaches to encoding position in transformers?

Absolute positional encoding. Relative positional encoding. Causal attention masks.

Transformers — 400

What is the function of the KV cache in a transformer?

Since models are causal and MLP layers are per-token, later token predictions depend on earlier ones only through the attention mechanism. The attention mechanism only depends on the $K$ and $V$ tensors. To make future predictions during inference, we only need to save the and values from the past! We can get rid of all the other intermediates.

Transformers — 500

What is the purpose of grouped-query attention (GQA)? How does it reduce the cost of transformer inference? Hint:

LlamaModel((layers): ModuleList((0-31): 32 x LlamaDecoderLayer((self_attn): LlamaAttention(
,     (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
,     (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
,     (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
,     (o_proj): Linear(in_features=4096, out_features=4096, bias=False)

GQA restricts many heads of multi-head attention to share the same $K$ and $V$ values. It reduces the cost of the KV cache by reusing each $K$ and $V$ vector multiple times across multiple queries.

Hyperparameter Opt. — 100

True or False, and Explain Why: Random search helps solve the curse of dimensionality of grid search.

True. Grid search requires at minimum $2^{\text{# hyperparams}}$ runs; random search can search any number of times.

Hyperparameter Opt. — 200

What is the usual functional form of a neural network scaling law?

It's a power law, like $f(x) = C x^{-\gamma}$.

Hyperparameter Opt. — 300

What is the point of using a scaling law instead of traditional Hyperparameter Optimization?

HPO is expensive. Sometimes you can afford only one big run!

Hyperparameter Opt. — 400

What does the term epoch mean in the context of machine learning training?

One pass through the training dataset, using each example once.

Hyperparameter Opt. — 500

What is the significance of the term Chinchilla in deep learning?

It's a popular scaling law for neural language models! For loss, it predicts \[ L(N,D) = \frac{406.4}{N^{0.34}} + \frac{410.7}{D^{0.28}} + 1.69. \]