Logistics (5 minutes)

  • Please use n2-highcpu-2 for timing and tuning on matmul
  • Accounts set up soon on Graphite (CPU and GPU
    • Priority partitions: cs5220 and cs5220-gpu

Quick discussion of matrix multiply status (10 minutes)

Examples: 1D waves in C and Python (10 minutes)

I will probably run this part locally on my laptop, but it is also possible to run it remotely on a Google Compute Engine machine. I do this with SSH tunneling, which is a technique to make traffic from a port on my local machine go across SSH and magically arrive at a port on the remote machine. This is a standard bit of SSH trickery, and can be accomplished with the Google systems by running the gcloud command reported by clicking the “SSH” menu to the right of the instance information in the Compute Engine instance list. For me, this brings up something like:

gcloud beta compute ssh --zone "us-central1-a" "instance-1" --project "cs-5220-289518"

Typing that command at the command line of my Mac gets me an ordinary ssh session to a Google VM. In order to add tunneling, I run

gcloud beta compute ssh --zone "us-central1-a" "instance-1" --project "cs-5220-289518" -- -L 8888:localhost:8888

Then when I run jupyter notebook (after installing all the packages to ensure that I can do such a thing!), if it sends traffic to port 8888 on the Google VM, I can pick up that traffic on port 8888 of my local host. That is, I can access the Jupyter notebook on my local machine by pointing my web browser to

http://localhost:8888/?token=RANDOM-LOOKING-STUFF-HERE

and interact with the remote Jupyter notebook from the comfort of my local web browser.

Breakout (35 minutes)

  1. Suppose you have a tuned single-core dot product that is limited by memory bandwidth (with memory at 12.4 GB/s for one core), and sending a message between processors takes 10 microseconds. If a parallel dot product implementation requires p-1 messages, what is the speedup curve for running a dot product on double precision vectors of dimension one million?
  2. Consider a spatial decomposition of “Game of Life” on an n-by-n grid with periodic boundary conditions in distributed memory. Assume we have a p-by-q grid of processors, and exchange a “halo” of d layers of boundary cells every d steps of the simulation. How would we model the communication and computation costs at each step? Under what circumstances is it possible to “hide” the communication under the computation. Use a simple model of the type discussed toward the end of the particle lecture.
  3. Address the questions at the end of the waves demo.

Report out (15 minutes)

Afternotes