16 Cache

Jin-Soo Kim
(jinsoo.kim@snu.ac.kr)
Systems Software &
Architecture Lab.
Seoul National University Cache Memories
Fall 2018
▪ Small, fast SRAM-based memories managed automatically in hardware
• Hold frequently accessed blocks of main memory
▪ CPU looks first for data in cache
▪ Typical system structure:
CPU chip
Register file
Cache
ALU
Memory
System bus Memory bus
I/O Main
Bus interface
bridge memory
4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim (jinsoo.kim@snu.ac.kr) 2

E = 2e lines per set
set
line or block
S = 2s sets
Cache size:
v tag 0 1 2 B-1 C = S x E x B data bytes
valid bit B = 2b bytes per cache block (the data)

• Locate set
E = 2e lines per set
• Check if any line in set has matching tag
• Yes + line valid: hit
• Locate data starting at offset
Address of word:
t bits s bits b bits
S = 2s sets
tag set block
index offset
data begins at this offset
v tag 0 1 2 B-1
valid bit
B = 2b bytes per cache block (the data)
▪ Direct mapped: one line per set
▪ Assume: cache block size 8 bytes
Address of int:
v tag 0 1 2 3 4 5 6 7
t bits 0…01 100
v tag 0 1 2 3 4 5 6 7
find set
S= 2s sets
v tag 0 1 2 3 4 5 6 7
v tag 0 1 2 3 4 5 6 7

Address of int:
valid? + match: assume yes = hit
t bits 0…01 100
v tag 0 1 2 3 4 5 6 7
block offset

Address of int:
valid? + match: assume yes = hit
t bits 0…01 100
v tag 0 1 2 3 4 5 6 7
block offset
int (4 Bytes) is here
No match: old line is evicted and replaced

t=1 s=2 b=1 M=16 bytes (4-bit addresses)
x xx x B=2 bytes/block, S=4 sets, E=1 Blocks/set
v Tag Block Address trace (reads, one byte per read):

Set 0 0
1 1?
0 ?
M[8-9]
M[0-1] 0 [00002], miss (cold)
Set 1 1 [00012], hit
Set 2 7 [01112], miss (cold)
Set 3 1 0 M[6-7]
8 [10002], miss (cold)
0 [00002] miss (conflict)

▪ Here E = 2: Two lines per set
▪ Assume cache block size 8 bytes
Address of short int:
t bits 0…01 100
v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7
v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 find set
v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7
v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7

▪ Here E = 2: Two lines per set
▪ Assume cache block size 8 bytes
Address of short int:
t bits 0…01 100
compare both
valid? + match: yes = hit
v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7
short int (2 Bytes) is here block offset
No match: One line in set is selected for eviction and replacement

Replacement policies: random, LRU, …
t=2 s=1 b=1 M=16 bytes (4-bit addresses)
xx x x B=2 bytes/block, S=2 sets, E=2 Blocks/set
Address trace (reads, one byte per read):

v Tag Block
0
1 ?
00 ?
M[0-1]
0 [00002], miss
Set 0
0
1 10 M[8-9] 1 [00012], hit
7 [01112], miss
0
1 01 M[6-7]
Set 1
0 8 [10002], miss
0 [00002] hit

▪ Multiple copies of data exist: L1, L3, L3, Main Memory, Disk
▪ What to do on a write-hit?
• Write-through: write immediately to memory
• Write-back: defer write to memory until replacement of line
– Need a dirty bit (line different from memory or not)
▪ What to do on a write-miss?
• Write-allocate: load into cache, update line in cache
– Good if more writes to the location follow
• No-write-allocate: writes straight to memory, does not load into cache
▪ Typical: Write-through + No-write-allocate
Write-back + Write-allocate
Processor package
Core 0 Core 3 L1 i-cache and d-cache:
32 KB, 8-way,
Regs Regs
Access: 4 cycles
L1 L1 L1 L1 L2 unified cache:
d-cache i-cache d-cache i-cache
… 256 KB, 8-way,
Access: 10 cycles
L2 unified cache L2 unified cache
L3 unified cache:
8 MB, 16-way,
Access: 40-75 cycles
L3 unified cache
(shared by all cores) Block size:
64 bytes for all caches
Main memory

▪ Miss rate
• Fraction of memory references not found in cache (misses / accesses) = 1 – hit rate
• Typically, 3 – 10% for L1
• Can be quite small (e.g., < 1%) for L2 depending on size, etc.
▪ Hit time
• Time to deliver a line in the cache to the processor
• Includes time to determine whether the line is in the cache
• Typically, 4 clock cycles for L1 and 10 clock cycles for L2
▪ Miss penalty
• Additional time required because of a miss
• Typically, 50 – 200 cycles for main memory (Trend: increasing!)
▪ Huge difference between a hit and a miss
• Could be 100x, if just L1 and main memory
▪ Would you believe 99% hits is twice as good as 97%?

• Consider:
Cache hit time of 1 cycle
Miss penalty of 100 cycles
• Average access time:
97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles
99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles
▪ This is why “miss rate” is used instead of “hit rate”

▪ Read throughput (read bandwidth)
• Number of bytes read from memory per second (MB/s)
▪ Memory mountain:
Measured read throughput as a function of spatial and temporal locality
• Compact way to characterize memory system performance

long data[MAXELEMS]; /* Global array to traverse */
Call test() with
/* test - Iterate over first "elems" elements of array “data” many combinations of
* with stride of "stride", using 4x4 loop unrolling.
*/ elems and stride.
int test(int elems, int stride) {
long i, sx2=stride*2, sx3=stride*3, sx4=stride*4;
long acc0 = 0, acc1 = 0, acc2 = 0, acc3 = 0; For each elems and
long length = elems, limit = length - sx4; stride:
/* Combine 4 elements at a time */
for (i = 0; i < limit; i += sx4) { 1. Call test() once to
acc0 = acc0 + data[i]; warm up the caches.
acc1 = acc1 + data[i+stride];
acc2 = acc2 + data[i+sx2];
acc3 = acc3 + data[i+sx3]; 2. Call test() again
}
/* Finish any remaining elements */ and measure the read
for (; i < length; i++) { throughput (MB/s)
acc0 = acc0 + data[i];
}
return ((acc0 + acc1) + (acc2 + acc3));
}

Aggressive
prefetching
16000
Core i7 Haswell
14000 L1 2.1 GHz
Read throughput (MB/s)
12000
32 KB L1 d-cache
256 KB L2 cache
10000 8 MB L3 cache
8000 Ridges 64 B block size
L2 of temporal
6000
locality
4000
2000 L3
Slopes
0
of spatial 32k
s1
locality s3 128k
s5 Mem 512k
2m
s7
Stride (x8 bytes) 8m
s9 Size (bytes)
32m
s11
128m
▪ Make the common case go fast
• Focus on the inner loops of the core functions
▪ Minimize the misses in the inner loops

• Repeated references to variables are good (temporal locality)
• Stride-1 reference patterns are good (spatial locality)
▪ Key idea: Our qualitative notion of locality is quantified through our

understanding of cache memories

▪ Description
• Multiply N x N matrices
/* ijk */
• O(N3) total operations for (i=0; i<n; i++) {
Variable sum
held in register
for (j=0; j<n; j++) {
(*,j) sum = 0.0;
(i,j) for (k=0; k<n; k++)
(i,*) X =
sum += a[i][k] * b[k][j];
A B C c[i][j] = sum;
}
▪ Assumptions }
• Line size = 32 bytes (big enough for 4 64-bit words)

• Matrix dimension (N) is very large

▪ Matrix multiplication (ijk)
/* ijk */
Inner loop:
for (i=0; i<n; i++) {
for (j=0; j<n; j++) { (*,j)
sum = 0.0; (i,j)
(i,*)
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j]; A B C
c[i][j] = sum;
}
} Row-wise Column- Fixed
wise
Misses per Inner Loop Iteration:
A B C
0.25 1.0 0.0

▪ Matrix multiplication (jik)
/* jik */ Inner loop:
for (j=0; j<n; j++) {
for (i=0; i<n; i++) { (*,j)
sum = 0.0; (i,j)
(i,*)
for (k=0; k<n; k++)
A B C
sum += a[i][k] * b[k][j];
c[i][j] = sum
}
} Row-wise Column- Fixed
wise
A B C
0.25 1.0 0.0

▪ Matrix multiplication (kij)
/* kij */
Inner loop:
for (k=0; k<n; k++) {
for (i=0; i<n; i++) { (i,k) (k,*)
r = a[i][k]; (i,*)
for (j=0; j<n; j++) A B C
c[i][j] += r * b[k][j];
}
}
Fixed Row-wise Row-wise

A B C
0.0 0.25 0.25

▪ Matrix multiplication (ikj)
/* ikj */ Inner loop:

for (i=0; i<n; i++) {
for (k=0; k<n; k++) { (i,k) (k,*)
(i,*)
r = a[i][k];
A B C
for (j=0; j<n; j++)
c[i][j] += r * b[k][j];
}
} Fixed Row-wise Row-wise

A B C
0.0 0.25 0.25

▪ Matrix multiplication (jki)
/* jki */ Inner loop:

for (j=0; j<n; j++) { (*,k) (*,j)
for (k=0; k<n; k++) { (k,j)
r = b[k][j];
for (i=0; i<n; i++) A B C
c[i][j] += a[i][k] * r;
}
}
Column- Fixed Column-
wise wise
A B C
1.0 0.0 1.0

▪ Matrix multiplication (kji)
/* kji */ Inner loop:

for (k=0; k<n; k++) {
(*,k) (*,j)
for (j=0; j<n; j++) {
(k,j)
r = b[k][j];
for (i=0; i<n; i++) A B C
c[i][j] += a[i][k] * r;
}
}
Column- Fixed Column-
wise wise
A B C
1.0 0.0 1.0

▪ Summary
ijk (& jik): kij (& ikj): jki (& kji):
• 2 loads, 0 stores • 2 loads, 1 store • 2 loads, 1 store
• misses/iter = 1.25 • misses/iter = 0.5 • misses/iter = 2.0
for (i=0; i<n; i++) { for (k=0; k<n; k++) { for (j=0; j<n; j++) {
for (j=0; j<n; j++) { for (i=0; i<n; i++) { for (k=0; k<n; k++) {
sum = 0.0; r = a[i][k]; r = b[k][j];
for (k=0; k<n; k++) for (j=0; j<n; j++) for (i=0; i<n; i++)
sum += a[i][k] * b[k][j]; c[i][j] += r * b[k][j]; c[i][j] += a[i][k] * r;
c[i][j] = sum; } }
} } }
}
A B C A B C A B C
0.25 1.0 0.0 0.0 0.25 0.25 1.0 0.0 1.0
▪ Performance in Core i7
100
jki / kji
Cycles per inner loop iteration
ijk / jik jki

kji
10
ijk
jik
kij
ikj
kij / ikj
1
50 100 150 200 250 300 350 400 450 500 550 600 650 700
Array size (n)
▪ Cache memories can have significant performance impact
▪ You can write your programs to exploit this!

• Focus on the inner loops, where bulk of computations and memory accesses occur
• Try to maximize spatial locality by reading data objects with sequentially with stride 1
• Try to maximize temporal locality by using a data object as often as possible once it’s
read from memory

16 Cache

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

16 Cache

Diunggah oleh

Hak Cipta:

Format Tersedia

Jin-Soo Kim

4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim (jinsoo.kim@snu.ac.kr) 2

valid bit B = 2b bytes per cache block (the data)

data begins at this offset

4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim (jinsoo.kim@snu.ac.kr) 5

4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim (jinsoo.kim@snu.ac.kr) 6

int (4 Bytes) is here

No match: old line is evicted and replaced

v Tag Block Address trace (reads, one byte per read):

4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim (jinsoo.kim@snu.ac.kr) 8

v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 find set

4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim (jinsoo.kim@snu.ac.kr) 9

valid? + match: yes = hit

short int (2 Bytes) is here block offset

No match: One line in set is selected for eviction and replacement

Address trace (reads, one byte per read):

4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim (jinsoo.kim@snu.ac.kr) 11

4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim (jinsoo.kim@snu.ac.kr) 13

▪ Would you believe 99% hits is twice as good as 97%?

▪ This is why “miss rate” is used instead of “hit rate”

4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim (jinsoo.kim@snu.ac.kr) 16

4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim (jinsoo.kim@snu.ac.kr) 17

▪ Minimize the misses in the inner loops

▪ Key idea: Our qualitative notion of locality is quantified through our

4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim (jinsoo.kim@snu.ac.kr) 19

• Line size = 32 bytes (big enough for 4 64-bit words)

4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim (jinsoo.kim@snu.ac.kr) 20

4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim (jinsoo.kim@snu.ac.kr) 21

4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim (jinsoo.kim@snu.ac.kr) 22

Misses per Inner Loop Iteration:

4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim (jinsoo.kim@snu.ac.kr) 23

/* ikj */ Inner loop:

Misses per Inner Loop Iteration:

4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim (jinsoo.kim@snu.ac.kr) 24

/* jki */ Inner loop:

4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim (jinsoo.kim@snu.ac.kr) 25

/* kji */ Inner loop:

4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim (jinsoo.kim@snu.ac.kr) 26

ijk / jik jki

▪ You can write your programs to exploit this!

4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim (jinsoo.kim@snu.ac.kr) 29

Anda mungkin juga menyukai