# Caches Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University #### CPU clock rates ~0.2ns - 2ns (5GHz-500MHz) | Technology | Capacity | \$/GB | Latency | |---------------|----------|---------------------|-------------------------------------------------| | Tape | 1 TBA | \$. <b>1</b> 7 | 100s of seconds | | Disk | 1 TB | \$.0 <mark>8</mark> | Millions cycles (ms) | | SSD (Flash) | 128GB | \$3 | Thousands of cycles (us) | | DRAM | 4GB | \$25 | 50-300 cycles (1 <mark>0</mark> <b>s</b> of ns) | | SRAM off-chip | 4MB | \$4k | 5-15 cycles (few ns) | | SRAM on-chip | 256 KB | 5562 | 1-3 cycles (ns) | | am Je | | | | Others: eDRAM aka 1-T SRAM, FeRAM, CD, DVD, ... Q: Can we create illusion of cheap + large + fast? These are rough numbers: mileage may vary for latest/greatest Caches usually made of SRAM (or eDRAM) # Memory closer to processor • small & fast stores active data Memory farther from processor • big & slow stores inactive data Assumption: Most data is not active. Q: How to decide what is active? A: Some committee decides A: Programmer decides A: Compiler decides A: OS decides at run-time A: Hardware decides at run-time A: Data that will be used soon If Mem[x] is was accessed *recently*... - ... then Mem[x] is likely to be accessed soon - Caches exploit temporal locality by putting recently accessed Mem[x] higher in the pyramid - ... then Mem[ $x \pm \varepsilon$ ] is likely to be accessed soon - Caches exploit spatial locality by putting an entire block containing Mem[x] higher in the pyramid ``` Memory trace int n = 4; 0x7c942b18 int k[V] = \{ 3, 14, 0, 10 \}; 0x7c9a2b19 0x769a2b1a 0x7c9a28x1bc9a2b14 in { fib(int i) . { } 0x7c9a2b1e if (i <= 2) return i; 0x7g9a2b1d else return fib(i-1)+fib(i-2); int main(int ac, char **av rinti(fib(k[i ⊃prin 90 T - 79 48 X BOUDDOUD ... 0x00400318 +1 me 0x0040031c ``` • • • # Memory closer to processor is fast and small - usually stores subset of memory farther from processor - "strictly inclusive" - alternatives: - strictly exclusive - mostly inclusive - Transfer whole blocks cache lines, e.g: 4kb: disk ↔ ram 256b: ram $\leftrightarrow$ L2 64b: $L2 \leftrightarrow L1$ # Processor tries to access Mem[x] #### Check: is block containing x in the cache? - Yes: cache hit - return requested data from cache line - No: cache miss - read block from memory (or lower level cache) - (evict an existing cache line to make room) - place new block in cache - return requested data - → and stall the pipeline while all of this happens ### Cache has to be fast and dense - Gain speed by performing lookups in parallel - but requires die real estate for lookup logic - Reduce lookup logic by limiting where in the cache a block might be placed - but might reduce cache effectiveness # A given data block can be placed... - ... in any cache line -> Fully Associative - ... in exactly one cache line → Direct Mapped - ... in a small set of cache lines -> Set Associative 12 0x000000 0x000004 800000x0 0x00000c 0x00004c # **Direct Mapped Cache** - Each block number mapped to a single cache I - Simples line 0 line 1 line 2 line 3 | مان مانا | | | | |----------|------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--| | ine inde | ex | 0x000014 | | | st hard | ware | 2 (0x000018 | | | | | 0x00001c | | | | | 0 $0$ $0$ $0$ $0$ $0$ | | | | | 0x000024 | | | · | | 0x00002c | | | 4 | L, | Ox000030 | | | U | Ч | $\bigcirc \bigcirc $ | | | | 4 | 0x000038 | | | ) ( | ' | <b>3</b> (0x00003c | | | 9 | 9 | 0x000040 | | | | 1 | 0x000044 | | | 7. 0 | 343) | 0x000048 | | | | | | | # Assume sixteen 64-byte cache lines n bit index, m bit offset Q: How big is cache (data only)? Q: How much SRAM needed (data + overhead)? #### Using byte addresses in this example! Addr Bus = 5 bits #### Using byte addresses in this example! Addr Bus = 5 bits n bit index, m bit offset Q: How big is cache (data only)? Q: Howhmuch SRAM needed (data + overhead)? $$(32 - n - m + 1) \cdot 2^n$$ lines = Cache Performance (very simplified): L1 (SRAM): 512 x 64 byte cache lines, direct mapped Data cost: 3 cycle per word access Lookup cost: 2 cycle Mem (DRAM): 4GB Data cost: 50 cycle per word, plus 3 cycle per consecutive word 5 cycles for cache Hit 5 +50 +3.15 = 100 cycles cache Miss Performance depends on: Access time for hit, miss penalty, hit rate Cache misses: classification The line is being referenced for the first time Cold (aka Compulsory) Miss The line was in the cache, but has been evicted Q: How to avoid... # Cold Misses - of x Blogian, prefetch - Unavoidable? The data was never in the cache... - Prefetching! #### Other Misses · Buymore SRAM (acre layout • Use a more flexible cache design 21 # Bigger cache doesn't always help... Memcpy access trace: 0, 16, 1, 17, 2, 18, 3, 19, 4, 20, ... Hit rate with four direct-mapped 2-byte cache lines? With eight 2-byte cache lines? Cache misses: classification The line is being referenced for the first time Cold (aka Compulsory) Miss The line was in the cache, but has been evicted... - ... because some other access with the same index - Conflict Miss - ... because the cache is too small - i.e. the working set of program is larger than the cache - Capacity Miss #### Q: How to avoid... #### **Cold Misses** - Unavoidable? The data was never in the cache... - Prefetching! # **Capacity Misses** Buy more SRAM #### **Conflict Misses** Use a more flexible cache design # A given data block can be placed... - ... in exactly one cache line Direct Mapped - ... in a small set of cache lines → Set Associative #### Using byte addresses in this example! Addr Bus = 5 bits Tag Offset *m* bit offset, 2<sup>n</sup> cache lines Q: How big is cache (data only)? Q: How much SRAM needed (data + overhead)? # Fully-associative reduces conflict misses... ... assuming good eviction strategy Memcpy access trace: 0, 16, 1, 17, 2, 18, 3, 19, 4, 20, ... Hit rate with four fully-associative 2-byte cache lines? ... but large block size can still reduce hit rate vector add access trace: 0, 100, 200, 1, 101, 201, 2, 202, ... Hit rate with four fully-associative 2-byte cache lines? With two 4-byte cache lines? #### Cache misses: classification # Cold (aka Compulsory) The line is being referenced for the first time # Capacity - The line was evicted because the cache was too small - i.e. the working set of program is larger than the cache The line was evicted because of another access whose index conflicted #### Caching assumptions - small working set: 90/10 rule - can predict future: spatial & temporal locality #### **Benefits** big & fast memory built from (big & slow) + (small & fast) #### **Tradeoffs:** associativity, line size, hit cost, miss penalty, hit rate - Fully Associative higher hit cost, higher hit rate - Larger block size -> lower hit cost, higher miss penalty Next up: other designs; writing to caches