CS 5220 architecture intro

A memory benchmark (membench)

for array A of length L from 4KB to 8MB by 2x
  for stride s  from 4 bytes to L/2 by 2x
    time the following loop
    for i = 0 to L by s
      load A[i]

Membench in pictures

Size = 64 bytes (16 ints)
Strides of 4, 8, 16, 32 bytes

Membench on totient CPU

Vertical: 64B line size, 4K page size
Horizontal: 64KB L1, 256KB L2, 15MB L3
Diagonal: 8-way cache associativity, 512 entry L2 TLB

Note on storage

Two standard layouts:
- Column major (Fortran): A(i,j) at A+i+j*n
- Row-major (C): A(ij) at A+i*n+j
I default to column major
Also note: C has poor language matrix support

Matrix multiply

How fast can naive matrix multiply run?

#define A(i,j) AA[i+j*n]
#define B(i,j) BB[i+j*n]
#define C(i,j) CC[i+j*n]

memset(C, 0, n*n*sizeof(double));
for (int i = 0; i < n; ++i)
    for (int j = 0; j < n; ++j)
        for (int k = 0; k < n; ++k)
            C(i,j) += A(i,k) * B(k,j);

One row in naive

Access $A$ and $C$ with stride $8n$ bytes
Access all $8n^2$ bytes of $B$ before first re-use
Poor arithmetic intensity

Matrix multiply compared (Totient + ICC)

Hmm...

Compiler makes some difference
Naive Fortran is faster than naive C
Local instruction mix sets speed of light
Access pattern determines how close we get to limit

Engineering strategy

Start with small kernel multiply
- Maybe odd sizes, strange layouts -- just go fast!
- May play with AVX intrinsics, compiler flags, etc
- Deserves its own timing rig
Use blocking based on kernel to improve access pattern

Simple model

Two types of memory (fast+slow)
- $m$ = words read from slow memory
- $t_m$ = slow memory op time
- $f$ = number of flops
- $t_f$ = time per flop
- $q = f/m$ = average flops/slow access
Time: $ft_f + mt_m = ft_f \left( 1 + \frac{t_m/t_f}{q} \right)$
Larger $q$ means better time

How big can $q$ be?

Level 1/2/3 Basic Linear Algebra Subroutines (BLAS)

Dot product: $n$ data, $2n$ flops
Matrix-vector multiply: $n^2$ data, $2n^2$ flops
Matrix-matrix multiply: $2n^2$ data, $2n^3$ flops

We like to build on level 3 BLAS (like matrix multiplication)!

Tuning matrix multiply

Matmul assignment is up
You will get email with group assignments
Goal is single core performance analysis and tuning
Deliverables
- Report describing strategy and performance results
- Pointer to a repository so we can run a competition

Possible tactics

Manually tune some small kernels
Write an auto-tuner to sweep parameters
Try different compilers (and flags)
Try different layouts
Copy optimization
Study strategies from past/present classes!

Warning

Tuning can be like video games!
Do spend the time to do a good job
Don't get so sucked in you neglect more important things

CS 5220: Applications of Parallel Computers

Matmul and tiling

08 Sep 2015

A memory benchmark (membench)

Membench in pictures

Membench on totient CPU

Note on storage

Matrix multiply

One row in naive

Matrix multiply compared (Totient + ICC)

Hmm...

Engineering strategy

Simple model

How big can $q$ be?

Tuning matrix multiply

Possible tactics

Warning