(jinsoo.kim@snu.ac.kr)
Systems Software &
Architecture Lab.
Seoul National University Cache Memories
Fall 2018
▪ Small, fast SRAM-based memories managed automatically in hardware
• Hold frequently accessed blocks of main memory
▪ CPU looks first for data in cache
▪ Typical system structure:
CPU chip
Register file
Cache
ALU
Memory
System bus Memory bus
I/O Main
Bus interface
bridge memory
set
line or block
S = 2s sets
Cache size:
v tag 0 1 2 B-1 C = S x E x B data bytes
Address of word:
t bits s bits b bits
S = 2s sets
tag set block
index offset
v tag 0 1 2 B-1
valid bit
B = 2b bytes per cache block (the data)
4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim (jinsoo.kim@snu.ac.kr) 4
▪ Direct mapped: one line per set
▪ Assume: cache block size 8 bytes
Address of int:
v tag 0 1 2 3 4 5 6 7
t bits 0…01 100
v tag 0 1 2 3 4 5 6 7
find set
S= 2s sets
v tag 0 1 2 3 4 5 6 7
v tag 0 1 2 3 4 5 6 7
Address of int:
valid? + match: assume yes = hit
t bits 0…01 100
v tag 0 1 2 3 4 5 6 7
block offset
Address of int:
valid? + match: assume yes = hit
t bits 0…01 100
v tag 0 1 2 3 4 5 6 7
block offset
v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7
v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7
v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7
v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7
L1 L1 L1 L1 L2 unified cache:
d-cache i-cache d-cache i-cache
… 256 KB, 8-way,
Access: 10 cycles
L2 unified cache L2 unified cache
L3 unified cache:
8 MB, 16-way,
Access: 40-75 cycles
L3 unified cache
(shared by all cores) Block size:
64 bytes for all caches
Main memory
▪ Memory mountain:
Measured read throughput as a function of spatial and temporal locality
• Compact way to characterize memory system performance
12000
32 KB L1 d-cache
256 KB L2 cache
10000 8 MB L3 cache
8000 Ridges 64 B block size
L2 of temporal
6000
locality
4000
2000 L3
Slopes
0
of spatial 32k
s1
locality s3 128k
s5 Mem 512k
2m
s7
Stride (x8 bytes) 8m
s9 Size (bytes)
32m
s11
128m
4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim (jinsoo.kim@snu.ac.kr) 18
▪ Make the common case go fast
• Focus on the inner loops of the core functions
A B C A B C A B C
0.25 1.0 0.0 0.0 0.25 0.25 1.0 0.0 1.0
4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim (jinsoo.kim@snu.ac.kr) 27
▪ Performance in Core i7
100
jki / kji
Cycles per inner loop iteration
kij / ikj
1
50 100 150 200 250 300 350 400 450 500 550 600 650 700
Array size (n)
4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim (jinsoo.kim@snu.ac.kr) 28
▪ Cache memories can have significant performance impact