CS 5220 architecture intro

# CS 5220: Applications of Parallel Computers ## Memory matters ## 01 Sep 2015

## [My laptop](http://ark.intel.com/products/75991/Intel-Core-i5-4288U-Processor-3M-Cache-up-to-3_10-GHz) - Theoretical peak: roughly 100 GFlop/s - Peak memory bandwidth: 25.6 GigaBytes/s - Arithmetic intensity = flops/memory access - Low arithmetic intensity is bad news...

How long will it take?

Consider my machine (peak 100 GFlop/s, peak memory bandwidth 25.6 GB/s). I have a code with double precision arithmetic intensity one. What is the max flop rate? A: 100 GFlop/s =: I appreciate your optimism, but no. A: 25.6 GFlop/s =: Right idea, but each double is eight bytes. A: 3.2 GFlop/s =: Yup. A: 320 MFlop/s =: That's a bit too pessimistic.

## Memory basics - Memory *latency* = how long to get requested item - Memory *bandwidth* = rate memory can provide data - Bandwidth improving faster than latency - Processor demand growing faster than either!

## Cache basics - Programs usually have *locality* - *Spatial*: nearby items accessed consecutively - *Temporal*: use a small "working set" repeatedly - Cache hierarchy built to use locality - Cache = small, fast memory - Several types of cache on modern chips

## Caches help... - Hide memory cost by reusing data - Exploits temporal locality - Use bandwidth to fetch *cache line* all at once - Exploits spatial locality - Use bandwidth to support multiple outstanding reads - Overlap computation + communication with memory - aka prefetching This is (mostly) automatic and implicit.

## Cache organization - Store cache *lines* of several bytes - Cache *hit* = copy of needed data in cache - Cache *miss* otherwise. Three types: - Compulsory: data never used before - Capacity: working set too big, discarded data - Conflict: insufficient *associativity* for access pattern - Cache *hit rate* = cache hits / memory accesses attempted

## Cache associativity - Where can data for a given main memory address go? - Direct-mapped: only one cache location - n-way set associative: n possible cache locations - Fully associative: anywhere in cache - Ex: 8-bit addresses $10011101_2$ - Cache location based on low-order bits of address - Direct mapped (16 entries): only store in $1101_2$ - 4-way associative (64 entries): four possible locations - In either case, address $10111101_2$ would conflict - High associativity is more expensive

## Caches on my laptop Multiple *levels* of cache with different sizes and latencies. Cache lines are 64B in all cases, I think. - Data caches: - L1 cache: 64 KB/core, 8-way (4 clocks) - L2 cache: 256 KB/core, 8-way (12 clocks) - L3 cache: 3 MB (shared), direct mapped (21 clocks?) - Also have instruction caches for code (less worry) Miss in lower level cache may still hit in higher level cache.

## Modeling question Consider my machine (100 Gigaflop/s peak, 25.6 GB/s bandwidth). Suppose a workload of mostly double precision fused multiply-adds. What is the minimum cache hit rate needed to maintain half peak? (We'll talk more about this in class)

## Modeling question We have $N = 10^6$ two-dimensional coordinates, and want the centroid. Which of these is faster and why? 1. Store an array of $(x_i, y_i)$ coordinates. Loop $i$ and simultaneously sum the $x_i$ and $y_i$. 2. Store an array of $(x_i, y_i)$ coordinates. Loop $i$ and sum the $x_i$, then sum the $y_i$ in a separate loop. 3. Store the $x_i$ n one array, the $y_i$ in a second array. Sum the $x_i$, then sum the $y_i$. 4. Other methods? Try it out and see!

## A memory benchmark (membench) for array A of length L from 4KB to 8MB by 2x for stride s from 4 bytes to L/2 by 2x time the following loop for i = 0 to L by s load A[i]

How long will it take?

Membench on my laptop

Membench on my laptop