## Caches Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University See P&H 5.1, 5.2 (except writes) # Big Picture: Memory Memory: big & slow vs Caches: small & fast ## Goals for Today: caches ## Examples of caches: - Direct Mapped - Fully Associative - N-way set associative ### Performance and comparison - Hit ratio (conversly, miss ratio) - Average memory access time (AMAT) - Cache size ## Cache Performance Average Memory Access Time (AMAT) Cache Performance (very simplified): L1 (SRAM): 512 x 64 byte cache lines, direct mapped Data cost: 3 cycle per word access Lookup cost: 2 cycle Mem (DRAM): 4GB Data cost: 50 cycle per word, plus 3 cycle per consecutive word Performance depends on: $\rightarrow +50 + 15 \times 3$ Access time for hit, miss penalty, hit rate ## Misses Cache misses: classification The line is being referenced for the first time Cold (aka Compulsory) Miss The line was in the cache, but has been evicted # **Avoiding Misses** Q: How to avoid... #### **Cold Misses** - Unavoidable? The data was never in the cache... - Prefetching! #### Other Misses - Buy more SRAM - Use a more flexible cache design # Bigger cache doesn't always help... Mem access trace: 0, 16, 1, 17, 2, 18, 3, 19, 4, ... Hit rate with four direct-mapped 2-byte cache lines? With eight 2-byte cache lines? 11 13 14 15 With four 4-byte cache lines? 16 17 18 ## Misses Cache misses: classification The line is being referenced for the first time Cold (aka Compulsory) Miss The line was in the cache, but has been evicted... ... because some other access with the same index Conflict Miss ... because the cache is too small - i.e. the working set of program is larger than the cache - Capacity Miss ## **Avoiding Misses** Q: How to avoid... #### **Cold Misses** - Unavoidable? The data was never in the cache... - Prefetching! ### **Capacity Misses** Buy more SRAM #### **Conflict Misses** Use a more flexible cache design # Three common designs ## A given data block can be placed... - ... in any cache line → Fully Associative - ... in exactly one cache line → Direct Mapped - ... in a small set of cache lines $\rightarrow$ Set Associative # Comparison: Direct Mapped Using byte addresses in this example! Addr Bus = 5 bits | Processor | Cache<br>4 cache lines | Memory | |----------------------------------------------------|--------------------------|--------| | 00001 | <u>2</u> word block | 0 100 | | | <u>2</u> bit tag field | 1 110 | | | 2 bit index field | 2 120 | | LB \$1 ← M[ 1 ] | 1 bit block offset field | 3 130 | | LB \$2 ← M[ 5 ] | tag data | 4 140 | | LB \$3 ← M[ 1 ] | | 5 150 | | LB $$3 \leftarrow M[4]$<br>LB $$2 \leftarrow M[0]$ | | 6 160 | | LB \$2 ← M[ 12 ] | | 7 170 | | LB \$2 ← M[ 5 ] | ' | 8 180 | | □ LB \$2 ← M[ 12 ] | 2 - | 9 190 | | LB \$2 ← M[ 5 ]<br>LB \$2 ← M[ 12 ] | | 10 200 | | LB \$2 ← M[ 5 ] | 3 | 11 210 | | | | 12 220 | | | Misses: | 13 230 | | | Hits: | 14 240 | | | | 15 250 | # Comparison: Direct Mapped Using byte addresses in this example! Addr Bus = 5 bits | Processor | Cache<br>4 cache lines | Memory | |---------------------------------------------------------|--------------------------|--------| | | <u>2</u> word block | 0 100 | | | <u>2</u> bit tag field | 1 110 | | LB \$1 ← M[ 1 ] M<br>LB \$2 ← M[ 5 ] M | 2 bit index field | 2 120 | | | 1 bit block offset field | 3 130 | | | tag data | 4 140 | | LB \$3 ← M[ 1 ] H<br>LB \$3 ← M[ 4 ] H | 1 00 100 | 5 150 | | LB \$3 ← M[ 4 ] H | 110 | 6 160 | | LB \$2 ← M[ 12 ] M | 0 | 7 170 | | LB \$2 ← M[ 5 ] M | | 8 180 | | LB $$2 \leftarrow M[12]$ M | | 9 190 | | LB $$2 \leftarrow M[5]$ M<br>LB $$2 \leftarrow M[12]$ M | 150 | 10 200 | | LB \$2 ← M[ 5 ] M | 0 | 11 210 | | | | 12 220 | | | Misses: 8 | 13 230 | | | Hits: 3 | 14 240 | | | | 15 250 | # Comparison: Fully Associative Using byte addresses in this example! Addr Bus = 5 bits | Processor | Cache<br>4 cache lines | Memory | |-------------------------------------|--------------------------|---------------| | 00001 | <u>2</u> word block | 0 100 | | | <u>4</u> bit tag field | 1 110 | | | 1 bit block offset field | 2 120 | | LB \$1 ← M[ 1 ] | tag data | 3 130 | | LB \$2 ← M[ 5 ] | tag uata | 4 140 | | LB \$3 ← M[ 1 ]<br>LB \$3 ← M[ 4 ] | | 5 150 | | LB \$3 ← M[ 4 ] LB \$2 ← M[ 0 ] | | 6 160 | | LB \$2 ← M[ 12 ] | | 7 170 | | LB \$2 ← M[ 5 ] | | 8 180 | | LB \$2 ← M[ 12 ]<br>LB \$2 ← M[ 5 ] | | 9 190 | | LB \$2 ← M[ 12 ] | | 10 200 | | LB \$2 ← M[ 5 ] | | 11 210 | | | | 12 220 | | | Misses: | 13 230 | | | Hits: | 14 240 | | | | <b>15 250</b> | # Comparison: Fully Associative Using byte addresses in this example! Addr Bus = 5 bits | Processor | Cache<br>4 cache lines | Memory | |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------|--------| | | <u>2</u> word block | 0 100 | | | 4 bit tag field | 1 110 | | LB \$1 ← M[ 1 ] M<br>LB \$2 ← M[ 5 ] M<br>LB \$3 ← M[ 1 ] H<br>LB \$3 ← M[ 4 ] H<br>LB \$2 ← M[ 0 ] H<br>LB \$2 ← M[ 12 ] M<br>LB \$2 ← M[ 5 ] H<br>LB \$2 ← M[ 5 ] H<br>LB \$2 ← M[ 5 ] H<br>LB \$2 ← M[ 5 ] H<br>LB \$2 ← M[ 5 ] H<br>LB \$2 ← M[ 5 ] H | 1 bit block offset field | 2 120 | | | tag data | 3 130 | | | | 4 140 | | | | 5 150 | | | 110 | 6 160 | | | 1 0010 140 | 7 170 | | | 150 | 8 180 | | | 1 0110 220 | 9 190 | | | 230 | 10 200 | | | | 11 210 | | | | 12 220 | | | Misses: 3 | 13 230 | | | Hits: 8 | 14 240 | | | | 15 250 | ## Comparison: 2 Way Set Assoc Using byte addresses in this example! Addr Bus = 5 bits ## Comparison: 2 Way Set Assoc Using byte addresses in this example! Addr Bus = 5 bits ## Cache Size # Direct Mapped Cache (Reading) ## Direct Mapped Cache Size Offset Index Tag n bit index, m bit offset Q: How big is cache (data only)? Q: How much SRAM needed (data + overhead)? 19 # Direct Mapped Cache Size Tag Index Offset n bit index, m bit offset Q: How big is cache (data only)? Q: How much SRAM needed (data + overhead)? Cache of size 2<sup>n</sup> blocks Block size of 2<sup>m</sup> bytes Tag field: 32 - (n + m) Valid bit: 1 Bits in cache: $2^n x$ (block size + tag size + valid bit size) = $2^n (2^m \text{ bytes } x \text{ 8 bits-per-byte} + (32-n-m) + 1)$ # Fully Associative Cache (Reading) # Fully Associative Cache Size Tag Offset m bit offset , $2^n$ cache lines Q: How big is cache (data only)? Q: How much SRAM needed (data + overhead)? # cache (ines x block 512e 27 x 2 mby tes = 2n+n overhead tag = 32-m ra(id=) # Fully Associative Cache Size Tag Offset m bit offset $, 2^n$ cache lines Q: How big is cache (data only)? Q: How much SRAM needed (data + overhead)? Cache of size 2<sup>n</sup> blocks Block size of 2<sup>m</sup> bytes Tag field: 32 – m Valid bit: 1 Bits in cache: 2<sup>n</sup> x (block size + tag size + valid bit size) $= 2^{n} (2^{m} \text{ bytes x 8 bits-per-byte} + (32-m) + 1)$ ## Fully-associative reduces conflict misses... ... assuming good eviction strategy Mem access trace: 0, 16, 1, 17, 2, 18, 3, 19, 4, 20, ... Hit rate with four fully-associative 2-byte cache lines? ... but large block size can still reduce hit rate vector add trace: 0, 100, 200, 1, 101, 201, 2, 202, ... Hit rate with four fully-associative 2-byte cache lines? 509 | 200 201 With two fully-associative 4-byte cache lines? ## Misses Cache misses: classification Cold (aka Compulsory) The line is being referenced for the first time ## Capacity - The line was evicted because the cache was too small - i.e. the working set of program is larger than the cache #### Conflict The line was evicted because of another access whose index conflicted # Cache Tradeoffs | Direct Mapped | | Fully Associative | |--------------------------|----------------------|-------------------| | + Smaller | Tag Size | Larger – | | + Less | SRAM Overhead | More – | | + Less | Controller Logic | More – | | + Faster | Speed | Slower – | | + Less | Price | More – | | + Very | Scalability | Not Very – | | <ul><li>Lots</li></ul> | # of conflict misses | Zero + | | – Low | Hit rate | High + | | <ul><li>Common</li></ul> | Pathological Cases? | ? | ### Administrivia Prelim2 *today*, Thursday, March 29<sup>th</sup> at 7:30pm Location is Phillips 101 and prelim2 starts at 7:30pm Project2 due next Monday, April 2<sup>nd</sup> # Summary #### Caching assumptions - small working set: 90/10 rule - can predict future: spatial & temporal locality #### Benefits big & fast memory built from (big & slow) + (small & fast) #### **Tradeoffs:** associativity, line size, hit cost, miss penalty, hit rate - Fully Associative → higher hit cost, higher hit rate - Larger block size lower hit cost, higher miss penalty Next up: other designs; writing to caches