CS 4787/5777 Final Jeopardy

ParallelismMemoryQuantizationAcceleratorsModern MLInference
100100100100100100
200200200200200200
300300300300300300
400400400400400400
500500500500500500

Parallelism — 100

Question: Does a simple matrix multiply in PyTorch running on a CPU use any parallelism? If so, what kind?

Yes! It's always going to use SIMD parallelism and it often will use multicore parallelism as well.

Parallelism — 200

What is Amdahl's law?

Tells you how much parallel speedup you can expect. $$\text{speedup} = \frac{1}{s + \frac{1 - s}{m}}.$$

Parallelism — 300

What is the main way (from a programmer's perspective) that multiple cores of a multicore CPU communicate with each other.

Via a shared memory abstraction.

Parallelism — 400

What does an AllReduce operation do?

Each distributed worker has a vector. The AllReduce operation sums the vectors, then replicates the result across all workers.

Parallelism — 500

What is the Fully Sharded Data Parallel (FSDP) distribution strategy?

Like data parallel, but it splits the weights for each layer among all machines, then uses an all-gather to get them whenever they're needed.

Memory — 100

What are the two types of memory locality?

Temporal locality: if a location in memory is accessed, it is likely that that location will be accessed again in the near future. Spatial locality: if a location in memory is accessed, it is likely that other nearby locations will be accessed in the near future.

Memory — 200

What's the difference between memory latency and memory bandwidth (throughput)?

Latency: how much time does it take to access data at a new address in memory?Throughput: how much data total can we access in a given length of time?

Memory — 300

What is NUMA (non uniform memory access)?

Multiple CPUs chips running on a single motherboard; while a CPU chip can read/write data nearby another CPU, the cost of accessing that memory will be higher than the cost of accessing local memory.

Memory — 400

Question: What is prefetching?

Prefetching loads data into the cache before it is ever accessed. Can be done automatically by the hardware or manually by the programmer.

Memory — 500

Question: How does the use of sparsity impact the memory subsystem?

Positive: Sparsity lowers the total amount of memory in use by the program. Negative: Sparsity lowers the memory locality

Quantization — 100

True or False, and Explain Why: There is exactly one bit pattern that represents each of +Inf, -Inf, and NaN in standard floating-point formats.

False. There are many NaN bit patterns.

Quantization — 200

True or False, and Explain Why: The number $3/8$ can be represented exactly as a half-precision floating-point number.

True. It's a rational number with denominator a power of 2.

Quantization — 300

True or False, and Explain Why: Half precision floating-point numbers have half the machine epsilon of full-precision floating point numbers.

False. The machine epsilon represents the relative error of the format, so half-precision numbers should have a larger machine epsilon, not a smaller one!

Quantization — 400

True or False, and Explain Why: Since e2m1 fp4 has 4-bits, it can represent $2^{4} = 16$ distinct real numbers.

False. The codes for $+0$ and $-0$ are distinct, so it represents only $15$ distinct numbers.

Quantization — 500

What is block floating point?

Group floating-point numbers into blocks and have them share an exponent. Common block sizes include 16 and 64.

Accelerators — 100

How many threads are in an NVIDIA GPU warp?

32.

Accelerators — 200

What is a thread block?

A collection of threads that run together on the same SM and can communicate with each other efficiently via shared memory.

Accelerators — 300

What feature contains most of the FLOPS capability of a modern NVIDIA GPU?

The tensorcores! They accelerate matrix-matrix multiply.

Accelerators — 400

True or False, and Explain Why: GPUs and CPUs are always located on separate chips in practice, even though in theory they can be put together as a single hybrid system.

False. Modern hybrid systems exist! E.g. the M-series mac.

Accelerators — 500

How can a machine learning practitioner go about using a TPU?

Just call .to(device) in pytorch as you would with any other device! Machine learning frameworks support multiple accelerators.

Modern ML Practice — 100

What is transfer learning?

A model trained on one task is used in to improve performance on another task.

Modern ML Practice — 200

What is a foundation model?

A "foundation model" is a large general-purpose model that can be adapated to a wide variety of downstream tasks ("foundation" in the sense of being a support for future development).

Modern ML Practice — 300

When might you want to use a state space model or linearized attention?

The main benefit of these approaches is they "solve" the quadratic runtime scaling of SDPA (scaled dot product attention). You might want to use these for long context applications.

Modern ML Practice — 400

What is in-context learning?

Learn without gradient descent by passing example-label pairs as history in the context passed to a large language model.

Modern ML Practice — 500

Why might you want to use a diffusion model instead of an autoregressive one?

Can generate in fewer steps than the sequence length. Better suited to image modality.

Inference — 100

True or False, and Explain Why: Increasing the batch size is always good for inference.

False. Having a large inference batch size can increase latency!

Inference — 200

True or False, and Explain Why: Taking advantage of neural network compression is usually annoying because you have to compress the model yourself, which takes time and implementation effort.

False. You can find lots of compressed open-source models online these days!

Inference — 300

What does it mean to say a model is W4A16?

4 bits are used to store the weights. 16 bits are used to store the activations.

Inference — 400

What is a straight-through estimator (used in QAT)?

Just pretend the quantization is not there in the backward pass.

Inference — 500

What is a Mixture of Experts (MoE) layer? How can it improve performance?

A MoE layer uses only a subset of the weights in the MLP block for each token. It reduces the FLOPs needed for inference. It can also reduce the memory bandwidth needed in the low-batch-size setting.