Question: Does a simple matrix multiply in PyTorch running on a CPU use any parallelism? If so, what kind?
Yes! It's always going to use SIMD parallelism and it often will use multicore parallelism as well.
What is Amdahl's law?
Tells you how much parallel speedup you can expect. $$\text{speedup} = \frac{1}{s + \frac{1 - s}{m}}.$$
What is the main way (from a programmer's perspective) that multiple cores of a multicore CPU communicate with each other.
Via a shared memory abstraction.
What does an AllReduce operation do?
Each distributed worker has a vector. The AllReduce operation sums the vectors, then replicates the result across all workers.
What is the Fully Sharded Data Parallel (FSDP) distribution strategy?
Like data parallel, but it splits the weights for each layer among all machines, then uses an all-gather to get them whenever they're needed.
What are the two types of memory locality?
Temporal locality: if a location in memory is accessed, it is likely that that location will be accessed again in the near future. Spatial locality: if a location in memory is accessed, it is likely that other nearby locations will be accessed in the near future.
What's the difference between memory latency and memory bandwidth (throughput)?
Latency: how much time does it take to access data at a new address in memory?Throughput: how much data total can we access in a given length of time?
What is NUMA (non uniform memory access)?
Multiple CPUs chips running on a single motherboard; while a CPU chip can read/write data nearby another CPU, the cost of accessing that memory will be higher than the cost of accessing local memory.
Question: What is prefetching?
Prefetching loads data into the cache before it is ever accessed. Can be done automatically by the hardware or manually by the programmer.
Question: How does the use of sparsity impact the memory subsystem?
Positive: Sparsity lowers the total amount of memory in use by the program. Negative: Sparsity lowers the memory locality
True or False, and Explain Why: There is exactly one bit pattern that represents each of +Inf, -Inf, and NaN in standard floating-point formats.
False. There are many NaN bit patterns.
True or False, and Explain Why: The number $3/8$ can be represented exactly as a half-precision floating-point number.
True. It's a rational number with denominator a power of 2.
True or False, and Explain Why: Half precision floating-point numbers have half the machine epsilon of full-precision floating point numbers.
False. The machine epsilon represents the relative error of the format, so half-precision numbers should have a larger machine epsilon, not a smaller one!
True or False, and Explain Why: Since e2m1 fp4 has 4-bits, it can represent $2^{4} = 16$ distinct real numbers.
False. The codes for $+0$ and $-0$ are distinct, so it represents only $15$ distinct numbers.
What is block floating point?
Group floating-point numbers into blocks and have them share an exponent. Common block sizes include 16 and 64.
How many threads are in an NVIDIA GPU warp?
32.
What is a thread block?
A collection of threads that run together on the same SM and can communicate with each other efficiently via shared memory.
What feature contains most of the FLOPS capability of a modern NVIDIA GPU?
The tensorcores! They accelerate matrix-matrix multiply.
True or False, and Explain Why: GPUs and CPUs are always located on separate chips in practice, even though in theory they can be put together as a single hybrid system.
False. Modern hybrid systems exist! E.g. the M-series mac.
How can a machine learning practitioner go about using a TPU?
Just call .to(device) in pytorch as you would with any other device! Machine learning frameworks support multiple accelerators.
What is transfer learning?
A model trained on one task is used in to improve performance on another task.
What is a foundation model?
A "foundation model" is a large general-purpose model that can be adapated to a wide variety of downstream tasks ("foundation" in the sense of being a support for future development).
When might you want to use a state space model or linearized attention?
The main benefit of these approaches is they "solve" the quadratic runtime scaling of SDPA (scaled dot product attention). You might want to use these for long context applications.
What is in-context learning?
Learn without gradient descent by passing example-label pairs as history in the context passed to a large language model.
Why might you want to use a diffusion model instead of an autoregressive one?
Can generate in fewer steps than the sequence length. Better suited to image modality.
True or False, and Explain Why: Increasing the batch size is always good for inference.
False. Having a large inference batch size can increase latency!
True or False, and Explain Why: Taking advantage of neural network compression is usually annoying because you have to compress the model yourself, which takes time and implementation effort.
False. You can find lots of compressed open-source models online these days!
What does it mean to say a model is W4A16?
4 bits are used to store the weights. 16 bits are used to store the activations.
What is a straight-through estimator (used in QAT)?
Just pretend the quantization is not there in the backward pass.
What is a Mixture of Experts (MoE) layer? How can it improve performance?
A MoE layer uses only a subset of the weights in the MLP block for each token. It reduces the FLOPs needed for inference. It can also reduce the memory bandwidth needed in the low-batch-size setting.