Question: What is the dimension of the vector space of linear maps from $\R^m$ to $\R^n$?
$mn$
Consider the following Torch code.
x = torch.ones(4,5,1,3)
y = torch.ones(p,5,3)
z = x + y
For what values of $p$, if any, will this code not raise an exception? What would be the shape of $z$?
$p \in \{1,5\}$, and the shape of $z$ is $(4,5,5,3)$.
Consider the following Torch code, given $X \in \R^{B \times m \times d}$ and $Y \in \R^{B \times n \times d}$
Z = X @ Y.t()
What is the shape of $Z$? How many floating point operations will this run (ignoring non-dominant factors)?
$Z \in \R^{B \times m \times n}$, and there will be $B m n d$ multiplies and $B m n d$ adds.
What's the difference between eager execution and graph execution? What are some advantages of one vs the other?
Eager execution runs the forward pass as the compute graph is being constructed. Graph execution constructs the graph beforehand in a separate step.
What are the core components of a modern deep learning framework?
(1) A numerical linear algebra library. (2) Hardware support for running on devices, e.g. GPUs. (3) A backpropagation engine. (4) A library for expressing deep neural networks in an object-oriented way. (5) Embedding in a high level language like Python.
In the expression $\text{runtime} \approx \mathcal{O}(\kappa nd \log(1/\epsilon))$ what do these variables mean?
This gives an approximate runtime for gradient descent to reach loss gap $\epsilon$. $\kappa$ is the condition number; $n$ is the size of the training set; $d$ is the dimension of the examples..
What's the difference between forward mode and reverse mode automatic differentiation?
Forward mode tracks the derivative of an intermediate with respect to an input; reverse mode tracks the derivative of an output with respect to an intermediate. Forward is best for functions like $\R \rightarrow \R^d$; backward is best for functions like $\R^d \rightarrow \R$.
What is the first step (initial condition) of backpropagation?
Set the derivative of the loss with respect to itself to $1$.
Question: What is minibatching?
Use an average of $B$ example gradients in each step of stochastic gradient descent, rather than just $1$.
Question: What is the relationship between the number of FLOPs in the forward pass and the backward pass of a typical deep neural network training task?
The backward pass is roughly $2 \times$ the cost of the forward pass. This happens when we're bound by matrix multiply, because if the forward pass involves a matrix multiply $X Y$ then the backward pass will usually need two matrix multiplies $G Y^$ and $X^T G$.
True or False, and Explain Why: Using momentum accelerates gradient descent by replacing the factor of $\kappa$ in the convergence rate with $\kappa^2$.
False. It replaces that with $\sqrt{\kappa}$.
True or False, and Explain Why: The loss function $\ell_1(w)$ = 2 \| w \|^2$ has a lower condition number than the loss function $\ell_2(w) = 5 \| w\|^2$.
False. They both have condition number $\kappa = 1$: their directional second derivative is always the same.
True or False, and Explain Why: Adding $\ell_2$ regularization (or increasing the $\ell_2$ regularization constant $\lambda$) can help to decrease the condition number $\kappa$ of a strongly convex loss function.
True. The condition number is $\kappa = \frac{L}{\mu}$ and adding $\ell_2$ regularization replaces this with $\frac{L + \lambda}{\mu + \lambda}$ which always makes $\kappa$ smaller.
True or False, and Explain Why: Sparsity, autoencoders, and dimension reduction all help to reduce the effect of a large number of examples n on the overall cost of learning.
False. These things all target the dimension $d$, reducing the cost of operating in large dimensions.
Consider the equation \[ \operatorname{Adam} = X + Y \] where $X$ and $Y$ are other optimization methods we've seen. What should $X$ and $Y$ be here? Explain.
$X = $ RMSProp, $Y = $ momentum. Adam combines the first-order running average of momentum with ther second-order running average for adaptive learning rates of RMSProp.
What is a ReLU nonlinearity? How does it operate on a vector?
$\operatorname{ReLU}(x) = \max(0, x)$. The ReLU operates elementwise on a vector.
What is a softmax nonlinearity? What is it used for?
$\operatorname{softmax}(x)_i = \frac{\exp(x_i)}{\sum_{j=1}^d \exp(x_j)}$. This is usually used to get a probability distribution from a logit vector. It also appears in attention!
What is the meaning of the term bias in the context of a neural network linear layer?
It means the layer computes an affine transformation $f(X) = X W^T + b$ with additional bias weights $b$, rather than just computing a linear transformation $f(X) = X W^T$.
What is a residual connection in a neural network? Where is it used in transformers?
A residual connection just means we add the input to the output of a layer. It makes learning easier by avoiding "vanishing gradients." It's used across each Attention and each MLP block in a standard Transformer.
What is a recurrent neural network (RNN)? How does it process sequences?
A RNN processes sequences like a state machine, taking as input a current state vector and an input embedding, and outputting a next state vector. This leads to very deep compute graphs! Makes training expensive!
In the formula \[ \operatorname{softmax}\left( \frac{Q K^T}{\sqrt{d_k}} + M \right) V \] what does the variable $M$ represent? What is it used for?
That's the mask. It's used (among other things) to encode positional information in attention by allowing a token to only depend on its past, not the future.
What is multi-head attention?
It's just $h$ independent attention operations running in parallel.
What are the three main approaches to encoding position in transformers?
Absolute positional encoding. Relative positional encoding. Causal attention masks.
What is the function of the KV cache in a transformer?
Since models are causal and MLP layers are per-token, later token predictions depend on earlier ones only through the attention mechanism. The attention mechanism only depends on the $K$ and $V$ tensors. To make future predictions during inference, we only need to save the and values from the past! We can get rid of all the other intermediates.
What is the purpose of grouped-query attention (GQA)? How does it reduce the cost of transformer inference? Hint:
LlamaModel((layers): ModuleList((0-31): 32 x LlamaDecoderLayer((self_attn): LlamaAttention(
, (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
, (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
, (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
, (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
GQA restricts many heads of multi-head attention to share the same $K$ and $V$ values. It reduces the cost of the KV cache by reusing each $K$ and $V$ vector multiple times across multiple queries.
True or False, and Explain Why: Random search helps solve the curse of dimensionality of grid search.
True. Grid search requires at minimum $2^{\text{# hyperparams}}$ runs; random search can search any number of times.
What is the usual functional form of a neural network scaling law?
It's a power law, like $f(x) = C x^{-\gamma}$.
What is the point of using a scaling law instead of traditional Hyperparameter Optimization?
HPO is expensive. Sometimes you can afford only one big run!
What does the term epoch mean in the context of machine learning training?
One pass through the training dataset, using each example once.
What is the significance of the term Chinchilla in deep learning?
It's a popular scaling law for neural language models! For loss, it predicts \[ L(N,D) = \frac{406.4}{N^{0.34}} + \frac{410.7}{D^{0.28}} + 1.69. \]