Computer Architecture-Cache Microarchitecture

ECE 4750 Computer Architecture
Topic 6: Cache Microarchitecture

Christopher Batten
School of Electrical and Computer Engineering
Cornell University
http://www.csl.cornell.edu/courses/ece4750
slide revision: 2013-10-01-22-23
Single-Bank Cache uArch
Multi-Bank Cache uArch
Basic Optimizations
Cache Examples
Agenda
Single-Bank Cache Microarchitecture

Multi-Bank Cache Microarchitecture
Basic Optimizations
Cache Examples
ECE 4750
T06: Cache Microarchitecture
2 / 36
Basic Optimizations
Cache Examples
Direct Mapped Cache Microarchitecture
ECE 4750
3 / 36
Basic Optimizations
Cache Examples
Set Associative Cache Microarchitecture
ECE 4750
4 / 36
Basic Optimizations
Cache Examples
Fully Associative Cache Microarchitecture
ECE 4750
5 / 36
Basic Optimizations
Cache Examples
Synchronous SRAMs
ECE 4750
6 / 36
Basic Optimizations
Cache Examples
Direct-Mapped Parallel Read Hit Path
ECE 4750
7 / 36
Basic Optimizations
Cache Examples
Direct Mapped Pipelined Write Hit Path
ECE 4750
8 / 36
Basic Optimizations
Cache Examples
Set-Associative Parallel Read Hit Path
ECE 4750
9 / 36
Basic Optimizations
Cache Examples
Processor-Cache Interaction
0x4
Add
M
A
nop
PC
addr
inst
IR
D
Decode,
Register
Fetch
ALU
Primary
Data rdata
Cache
hit?
wdata
wdata
hit?
Primary
Instruction
Cache
MD1
we
addr
MD2
Stall entire
CPU on data
cache miss
To Memory Control
Cache Refill Data from Lower Levels of
Memory Hierarchy
ECE 4750
10 / 36
Basic Optimizations
Cache Examples
Zero-Cycle Hit Latency with Tightly Coupled Interface

cs_X
cs_M
cs_W
decode
br_targ
jr
j_targ
pc_plus4
br_targ
_X
pc_plus4_D
pc_F
+4
ir[15:0]
pc_sel_P
stall_D
stall_F
kill_F
addr
ir[25:0]
ir_D
j_tgen
op0
_sel_D
br_tgen
alu
_func_X
op0_X
16
ir[10:6]
wb_sel_M
result_M
ir[25:21]
rdata
ir[20:16]
imem
branch
_cond
regfile
(read)
op1_X
ir[15:0]
ir[15:0]
regfile
(write)
dmem
_wen_M
zext
sext
result_W
alu
nop
stall_D
regfile
_waddr_W
regfile
_wen_W
sd_X
sd_M
addr rdata
wdata
op1
_sel_D
dmem
bypass_from_X1
bypass_from_M
bypass_from_W
Fetch (F)
ECE 4750
Decode & Reg Read (D)
Execute (X)
Tag Check
Data Access
Memory (M)
Writeback (W)
11 / 36
Basic Optimizations
Cache Examples
Two-Cycle Hit Latency with Val/Rdy Interface

cs_X
cs_M0
cs_M1
cs_W
decode
br_targ
jr
j_targ
pc_plus4
br_targ
_X
pc_plus4_F1 pc_plus4_D
pc_F0
+4
ir[15:0]
pc_sel_P
stall_D
stall_F0
!rdy
memreq
valrdy
ir[25:0]
kill_F1
ir_D
memresp
val
imem
branch
_cond
j_tgen
op0
_sel_D
br_tgen
op0_X
16
ir[10:6]
alu
_func_X
wb_sel_M
result_M0 result_M1 result_W
ir[25:21]
ir[20:16]
regfile
(read)
op1_X
regfile
(write)
alu
nop
stall_D
ir[15:0]
ir[15:0]
zext
sext
sd_X
op1
_sel_D
!rdy
bypass_from_X1
bypass_from_M
bypass_from_W
Fetch (F0/F1)
ECE 4750
regfile
_waddr_W
regfile
_wen_W
Execute (X)
memreq
valrdy
memresp
val
dmem
Tag Check
Data Access
Memory (M0/M1) Writeback (W)
12 / 36
Basic Optimizations
Cache Examples
Parallel Read, Pipelined Write Hit Path

cs_X
cs_M
cs_W
decode
br_targ
jr
j_targ
pc_plus4
br_targ
_X
pc_plus4_D
pc_F
+4
ir[15:0]
pc_sel_P
stall_D
stall_F
!rdy
memreq
valrdy
kill_F
ir_D
memresp
val
imem
ir[25:0]
branch
_cond
j_tgen
op0
_sel_D
br_tgen
op0_X
16
ir[10:6]
alu
_func_X
wb_sel_M
result_M
ir[25:21]
ir[20:16]
regfile
(read)
op1_X
stall_D
ir[15:0]
ir[15:0]
sd_X
op1
_sel_D
!rdy
bypass_from_X1
bypass_from_M
bypass_from_W
Fetch (F)
ECE 4750
regfile
(write)
dmem
_wen_M
zext
sext
result_W
alu
nop
regfile
_waddr_W
regfile
_wen_W
Execute (X)
memreq
valrdy
memresp
val
dmem
azard
ew H
s?
Tag Check
Read Access
Memory (M)
Write
Access
Writeback (W)
13 / 36
Basic Optimizations
Cache Examples
Agenda

Basic Optimizations
Cache Examples
ECE 4750
14 / 36
Basic Optimizations
Cache Examples
Multicore PARC w/ Shared Unified I/D $
ECE 4750
15 / 36
Basic Optimizations
Cache Examples
Multicore PARC w/ Shared Single-Bank I$ and D$
ECE 4750
16 / 36
Basic Optimizations
Cache Examples
Multicore PARC w/ Shared Multi-Bank I$ and D$
ECE 4750
17 / 36
Basic Optimizations
Cache Examples
Multicore PARC w/ Private I$ and D$
ECE 4750
18 / 36
Basic Optimizations
Cache Examples
Multicore PARC w/ Private I$ and Shared D$
ECE 4750
19 / 36
Basic Optimizations
Cache Examples
Agenda

Basic Optimizations
Cache Examples
ECE 4750
20 / 36
Basic Optimizations
Cache Examples
Reduce Average Memory Access Time
Avg Mem Access Time = Hit Time + ( Miss Rate Miss Penalty )
I Reduce hit time
I Reduce miss rate
. Small and simple caches
.
.
.
.
Large block size

Large cache size
High associativity
Compiler optimizations
I Reduce miss penalty

. Multi-level cache hierarchy
. Prioritize reads
ECE 4750
21 / 36
Basic Optimizations
Cache Examples
Reduce Hit Time: Small & Simple Caches
ECE 4750
22 / 36
Basic Optimizations
Cache Examples
Reduce Miss Rate: Large Block Size
I Less tag overhead

I Exploit fast burst transfers
from DRAM
I Exploit fast burst transfers
over wide on-chip busses
ECE 4750
I Can waste bandwidth

if data is not used
I Fewer blocks more
conflicts
23 / 36
Basic Optimizations
Cache Examples
Reduce Miss Rate: Large Cache Size
Empirical Rule of Thumb:
If cache size is doubled, miss rate usually drops by about 2
ECE 4750
24 / 36
Basic Optimizations
Cache Examples
Reduce Miss Rate: High Associativity
Empirical Rule of Thumb:

Direct-mapped cache of size N has about the same miss rate as a two-way
set-associative cache of size N/2
ECE 4750
25 / 36
Basic Optimizations
Cache Examples
Reduce Miss Rate: Compiler Optimizations

I Restructuring code affects the data block access sequence
. Group data accesses together to improve spatial locality
. Re-order data accesses to improve temporal locality
I Prevent data from entering the cache

. Useful for variales that will only be accessed once before eviction
. Needs mechanism for software to tell hardware not to cache data
(no-allocate instruction hits or page table bits)
I Kill data that will never be used again

. Streaming data exploits spatial locality but not temporal locality
. Replace into dead-cache locations
ECE 4750
26 / 36
Basic Optimizations
Cache Examples
Loop Interchange
for(j=0; j < N; j++) {
for(i=0; i < M; i++) {
x[i][j] = 2 * x[i][j];
}
}
for(i=0; i < M; i++) {

for(j=0; j < N; j++) {
x[i][j] = 2 * x[i][j];
}
}
What type of locality does this improve?

ECE 4750
27 / 36
Basic Optimizations
Cache Examples
Loop Fusion
for(i=0; i < N; i++)
a[i] = b[i] * c[i];
for(i=0; i < N; i++)
d[i] = a[i] * c[i];
for(i=0; i < N; i++)

{
a[i] = b[i] * c[i];
d[i] = a[i] * c[i];
}
What type of locality does this improve?

ECE 4750
28 / 36
Basic Optimizations
Cache Examples
Reduce Miss Penalty: Multi-Level Caches

Hit
Processor
L1 Cache
L2 Cache
Main
Memory
L1 Miss -- L2 Hit
Avg Mem Access Time =

Hit Time of L1 + ( Miss Rate of L1 Miss Penalty of L1 )
Miss Penalty of L1 =
Hit Time of L2 + ( Miss Rate of L2 Miss Penalty of L2 )
I Local miss rate = misses in cache / accesses to cache

I Global miss rate = misses in cache / processor memory accesses
I Misses per instruction = misses in cache / number of instructions
ECE 4750
29 / 36
Basic Optimizations
Cache Examples
Reduce Miss Penalty: Multi-Level Caches

I Use smaller L1 if there is also an L2
. Trade increased L1 miss rate for reduced L1 hit time & L1 miss penalty
. Reduces average access energy
I Use simpler write-through L1 with on-chip L2

. Write-back L2 cache absorbs write traffic, doesnt go off-chip
. Simplifies processor pipeline
. Simplifies on-chip coherence issues
I Inclusive Multilevel Cache

. Inner cache holds copy of data in outer cache
. External coherence is simpler
I Exclusive Multilevel Cache

. Inner cache hold data not in outer cache
. Swap lines between inner/outer cache on miss
ECE 4750
30 / 36
Basic Optimizations
Cache Examples
Reduce Miss Penalty: Prioritize Reads

CPU
RF
Data
Cache
Write
buffe
r
Evicted dirty lines for writeback cache
OR
All writes in writethrough cache
Unified
L2 Cache
I Processor not stalled on writes, and read misses can go ahead of

writes to main memory
I Write buffer may hold updated value of location needed by read miss
. On read miss, wait for write buffer to be empty
. Check write buffer addresses and bypass
ECE 4750
31 / 36
Basic Optimizations
Cache Examples
Cache Optimizations Impact

on Average Memory Access Time
Hit
Time
Technique
Parallel read hit

Pipelined write hit
Smaller caches
Large block size
Large cache size
High associativity
Compiler optimizations
++
++
Miss
Rate
HW
0
1
0
++
Multi-level cache
Prioritize reads
ECE 4750
Miss
Penalty
0
1
1
0
2
1
32 / 36
Basic Optimizations
Cache Examples
Agenda

Basic Optimizations
Cache Examples
ECE 4750
33 / 36
Basic Optimizations
Itanium-2 On-Chip Caches

Intel
Itanium-2
(Intel/HP,
2002)
Cache Examples
On-Chip Caches
Level 1: 16KB, 4-way s.a.,
64B line, quad-port (2
load+2 store), single cycle
latency
Level 2: 256KB, 4-way s.a,
128B line, quad-port (4
load or 4 store), five cycle
latency
Level 3: 3MB, 12-way s.a.,
128B line, single 32B port,
twelve cycle latency
February 9, 2/17/2009
2010
ECE 4750
CS152, Spring 2010
24
34 / 36
Basic Optimizations
Cache Examples
IBM Power-7 On-Chip Caches
Power 7 On-Chip Caches [IBM 2009]

32KB L1 I$/core
32KB L1 D$/core
3-cycle latency
256KB Unified L2$/core
8-cycle latency
32MB Unified Shared L3$

Embedded DRAM
25-cycle latency to local
slice
February 9, 2010
ECE 4750
CS152, Spring 2010

25
35 / 36
Basic Optimizations
Cache Examples
Acknowledgements
Some of these slides contain material developed and copyrighted by:

Arvind (MIT), Krste Asanovic (MIT/UCB), Joel Emer (Intel/MIT)
James Hoe (CMU), John Kubiatowicz (UCB), David Patterson (UCB)
MIT material derived from course 6.823
UCB material derived from courses CS152 and CS252
ECE 4750
36 / 36

Computer Architecture-Cache Microarchitecture

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Computer Architecture-Cache Microarchitecture

Diunggah oleh

Hak Cipta:

Format Tersedia

ECE 4750 Computer Architecture

Topic 6: Cache Microarchitecture

Single-Bank Cache uArch

Multi-Bank Cache uArch

Single-Bank Cache Microarchitecture

T06: Cache Microarchitecture

Single-Bank Cache uArch

Multi-Bank Cache uArch

Direct Mapped Cache Microarchitecture

T06: Cache Microarchitecture

Single-Bank Cache uArch

Multi-Bank Cache uArch

Set Associative Cache Microarchitecture

T06: Cache Microarchitecture

Single-Bank Cache uArch

Multi-Bank Cache uArch

Fully Associative Cache Microarchitecture

T06: Cache Microarchitecture

Single-Bank Cache uArch

Multi-Bank Cache uArch

T06: Cache Microarchitecture

Single-Bank Cache uArch

Multi-Bank Cache uArch

Direct-Mapped Parallel Read Hit Path

T06: Cache Microarchitecture

Single-Bank Cache uArch

Multi-Bank Cache uArch

Direct Mapped Pipelined Write Hit Path

T06: Cache Microarchitecture

Single-Bank Cache uArch

Multi-Bank Cache uArch

Set-Associative Parallel Read Hit Path

T06: Cache Microarchitecture

Single-Bank Cache uArch

Multi-Bank Cache uArch

T06: Cache Microarchitecture

Single-Bank Cache uArch

Multi-Bank Cache uArch

Zero-Cycle Hit Latency with Tightly Coupled Interface

Decode & Reg Read (D)

T06: Cache Microarchitecture

Single-Bank Cache uArch

Multi-Bank Cache uArch

Two-Cycle Hit Latency with Val/Rdy Interface

result_M0 result_M1 result_W

Decode & Reg Read (D)

T06: Cache Microarchitecture

Single-Bank Cache uArch

Multi-Bank Cache uArch

Parallel Read, Pipelined Write Hit Path

Decode & Reg Read (D)

T06: Cache Microarchitecture

Single-Bank Cache uArch

Multi-Bank Cache uArch

Single-Bank Cache Microarchitecture

T06: Cache Microarchitecture

Single-Bank Cache uArch

Multi-Bank Cache uArch

Multicore PARC w/ Shared Unified I/D $

T06: Cache Microarchitecture

Single-Bank Cache uArch

Multi-Bank Cache uArch

Multicore PARC w/ Shared Single-Bank I$ and D$

T06: Cache Microarchitecture