Warm-up with Menti (5 minutes)

Question of the day on Menti:

  • What’s on your bucket list?

Logistics (5 minutes)

We pointed out

  • The Slack group
  • Gather.town office hours
  • Resource links (missing semester, SWC, and Unix resources)

Also, some brief suggestions for breakouts: share a common doc, feel free to pop out and ask me questions (or use Slack).

From last time (15 min)

Groups 6-10 from last time (whoever was in the lead). One of the interesting notes that came up is related to a point in the slides about CPU vs GPU performance. Modern high-end CPUs have theoretical peak flop rates that are not necessarily so much lower than those of GPUs, but people often report order of magnitude speedups when moving to GPUs. Why? Partly because those same folks often spend more time tuning the GPU code than the CPU code.

Breakout (40 minutes)

  1. Name and your one wished-for superpower
  2. Suppose we have n tasks in a p-stage pipeline (and n > p), where each pipeline stage takes one clock cycle to complete. Including time to drain and fill the pipeline, what is the overall number of cycles required? What fraction of the pipeline cycles are empty?
  3. Suppose you had a 16-entry cache that was four-way set associative, and that the eviction policy (how you kick something out of the cache) is least recently used (LRU). For the following sequence of reads, label whether each would be a cache hit or miss, and the type of miss involved (compulsory, capacity, or conflict): 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 20, 24, 28, 32, 16, 20, 24, 28, 32
  4. Consider the teaser from the 9/10 lecture slides (computing the centroid of a million points in the plane). Which do you think would be fastest, and why? If you have time, consider writing a short C program to test your hypothesis.
  5. Comment on factors that go into the speed advantage of single precision compared to double precision (on Intel CPUs).

Discussion (10 minutes)

We had an interesting discussion about problem 4 (the centroid). There are two distinct issues at play here: the memory access pattern, and the computational pattern and opportunities for instruction-level parallelism. The second method reads data with a stride of 2, which may end up involving twice as many cache line loads. However, this computation is not naturally set up to expose the maximum possible instruction-level parallelism (though it could be rewritten to do so), and so it ends up being limited by flop rates rather than memory. For this reason, the first arrangement – which involves to independent sums computed simultaneously – ends up running about twice as fast.