Meeting notes (2020-09-08)

Warm-up with Menti (5 minutes)

Several people arrived a few minutes late, whether because of the hour or because of not figuring out the Zoom link. While we waited for people to join, we asked a questions with Menti:

What pets do you have?

Thanks to those who showed up from awkward time zones! I hope it was worth it.

Logistics (5 minutes)

We pointed out

The narrated slides
HW 0
Piazza
Gather.town for office hours and meetups
CS partner-finding event
Missing semester, SWC, and Unix resources

Breakout (45 minutes)

Name and favorite cartoon character
Look up system X on Top 500 list. Can you explain the theoretical peak flop rate?
Google to find the Gordon Bell prize for 2020-X where X is your group number. Try to find an article that gives the science and some performance numbers.
Discuss point X in Bailey’s article (where X is your group number)
Toy analysis from 9/3 notes.
Effective flop rate of matmul on my system?
Repeat experiment with Python, MATLAB, or Julia on your system. This will require figuring out how to do timings. What is the effective flop rate?

Afternotes

In the discussion after the breakout, we got through groups 1-4. We’ll have to pick up groups 6-10 next time.

The theoretical peak flop rate is not so easy to compute for many of the top 500 machines. It seems like it should be simple, but it’s actually complicated by variable clock rates, different types of processors (e.g. GPUs or other accelerators), and other factors.
I talk a bit about different floating point formats and what constitutes a flop in the 9/8 slide deck. One can achieve higher rates with narrower precisions, but the effects of rounding error and dynamic range restrictions are also greater for those formats. So there is a tradeoff. There is a lot of interest now in mixed precision computations that try to get the best of both worlds – accuracy associated with the higher precision, speed associated with the lower precision. This came up in some of the Gordon Bell problem discussions.
Gordon Bell prize computations may get up to a few percent of theoretical peak, but it’s quite difficult to go higher than that.
Bailey mentions taking the performance of a kernel and representing it as the performance of a whole application. The word “kernel” in this setting is used to describe any commonly-used piece of functionality (matrix multiply, linear solve, etc) that is widely enough used to be tuned for different architectures. GPU programming didn’t really exist in its modern incarnation when Bailey wrote the article.

From a tactical perspective, I recommend using a shared document (e.g. a Google document) for shared notes, URLs, and computations from the breakout groups.