CS 5220: Applications of Parallel Computers
Memory matters
01 Sep 2015
- Theoretical peak: roughly 100 GFlop/s
- Peak memory bandwidth: 25.6 GigaBytes/s
- Arithmetic intensity = flops/memory access
- Low arithmetic intensity is bad news...
How long will it take?
Consider my machine (peak 100 GFlop/s, peak memory bandwidth 25.6 GB/s).
I have a code with double precision arithmetic intensity one.
What is the max flop rate?
Memory basics
- Memory latency = how long to get requested item
- Memory bandwidth = rate memory can provide data
- Bandwidth improving faster than latency
- Processor demand growing faster than either!
Cache basics
- Programs usually have locality
- Spatial: nearby items accessed consecutively
- Temporal: use a small "working set" repeatedly
- Cache hierarchy built to use locality
- Cache = small, fast memory
- Several types of cache on modern chips
Caches help...
- Hide memory cost by reusing data
- Exploits temporal locality
- Use bandwidth to fetch cache line all at once
- Exploits spatial locality
- Use bandwidth to support multiple outstanding reads
- Overlap computation + communication with memory
This is (mostly) automatic and implicit.
Cache organization
- Store cache lines of several bytes
- Cache hit = copy of needed data in cache
- Cache miss otherwise. Three types:
- Compulsory: data never used before
- Capacity: working set too big, discarded data
- Conflict: insufficient associativity for access pattern
- Cache hit rate = cache hits / memory accesses attempted
Cache associativity
- Where can data for a given main memory address go?
- Direct-mapped: only one cache location
- n-way set associative: n possible cache locations
- Fully associative: anywhere in cache
- Ex: 8-bit addresses 100111012
- Cache location based on low-order bits of address
- Direct mapped (16 entries): only store in 11012
- 4-way associative (64 entries): four possible locations
- In either case, address 101111012 would conflict
- High associativity is more expensive
Caches on my laptop
Multiple levels of cache with different sizes and latencies.
Cache lines are 64B in all cases, I think.
- Data caches:
- L1 cache: 64 KB/core, 8-way (4 clocks)
- L2 cache: 256 KB/core, 8-way (12 clocks)
- L3 cache: 3 MB (shared), direct mapped (21 clocks?)
- Also have instruction caches for code (less worry)
Miss in lower level cache may still hit in higher level cache.
Modeling question
Consider my machine (100 Gigaflop/s peak, 25.6 GB/s bandwidth).
Suppose a workload of mostly double precision fused multiply-adds.
What is the minimum cache hit rate needed to maintain half peak?
(We'll talk more about this in class)
Modeling question
We have N=106 two-dimensional coordinates, and want the centroid.
Which of these is faster and why?
- Store an array of (xi,yi) coordinates. Loop i and
simultaneously sum the xi and yi.
- Store an array of (xi,yi) coordinates. Loop i and
sum the xi, then sum the yi in a separate loop.
- Store the xi n one array, the yi in a second array.
Sum the xi, then sum the yi.
- Other methods?
Try it out and see!
A memory benchmark (membench)
for array A of length L from 4KB to 8MB by 2x
for stride s from 4 bytes to L/2 by 2x
time the following loop
for i = 0 to L by s
load A[i]
Membench on my laptop
Membench on my laptop