Anda di halaman 1dari 36

ECE 4750 Computer Architecture

Topic 6: Cache Microarchitecture


Christopher Batten
School of Electrical and Computer Engineering
Cornell University
http://www.csl.cornell.edu/courses/ece4750
slide revision: 2013-10-01-22-23

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Agenda

Single-Bank Cache Microarchitecture


Multi-Bank Cache Microarchitecture
Basic Optimizations
Cache Examples

ECE 4750

T06: Cache Microarchitecture

2 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Direct Mapped Cache Microarchitecture

ECE 4750

T06: Cache Microarchitecture

3 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Set Associative Cache Microarchitecture

ECE 4750

T06: Cache Microarchitecture

4 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Fully Associative Cache Microarchitecture

ECE 4750

T06: Cache Microarchitecture

5 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Synchronous SRAMs

ECE 4750

T06: Cache Microarchitecture

6 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Direct-Mapped Parallel Read Hit Path

ECE 4750

T06: Cache Microarchitecture

7 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Direct Mapped Pipelined Write Hit Path

ECE 4750

T06: Cache Microarchitecture

8 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Set-Associative Parallel Read Hit Path

ECE 4750

T06: Cache Microarchitecture

9 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Processor-Cache Interaction
0x4

Add

M
A

nop
PC

addr

inst

IR
D

Decode,
Register
Fetch

ALU

Primary
Data rdata
Cache
hit?
wdata
wdata

hit?
Primary
Instruction
Cache

MD1

we
addr

MD2

Stall entire
CPU on data
cache miss
To Memory Control
Cache Refill Data from Lower Levels of
Memory Hierarchy

ECE 4750

T06: Cache Microarchitecture

10 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Zero-Cycle Hit Latency with Tightly Coupled Interface


cs_X

cs_M

cs_W

decode
br_targ
jr
j_targ
pc_plus4

br_targ
_X
pc_plus4_D

pc_F
+4

ir[15:0]

pc_sel_P

stall_D

stall_F

kill_F
addr

ir[25:0]

ir_D

j_tgen
op0
_sel_D

br_tgen

alu
_func_X

op0_X

16

ir[10:6]

wb_sel_M

result_M

ir[25:21]

rdata
ir[20:16]

imem

branch
_cond

regfile
(read)

op1_X

ir[15:0]
ir[15:0]

regfile
(write)
dmem
_wen_M

zext
sext

result_W

alu

nop
stall_D

regfile
_waddr_W
regfile
_wen_W

sd_X

sd_M
addr rdata
wdata

op1
_sel_D

dmem
bypass_from_X1
bypass_from_M
bypass_from_W

Fetch (F)

ECE 4750

Decode & Reg Read (D)

Execute (X)

T06: Cache Microarchitecture

Tag Check
Data Access
Memory (M)

Writeback (W)

11 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Two-Cycle Hit Latency with Val/Rdy Interface


cs_X

cs_M0

cs_M1

cs_W

decode
br_targ
jr
j_targ
pc_plus4

br_targ
_X
pc_plus4_F1 pc_plus4_D

pc_F0

+4

ir[15:0]

pc_sel_P

stall_D
stall_F0

!rdy

memreq
valrdy

ir[25:0]

kill_F1

ir_D

memresp
val

imem

branch
_cond

j_tgen
op0
_sel_D

br_tgen

op0_X

16

ir[10:6]

alu
_func_X
wb_sel_M

result_M0 result_M1 result_W

ir[25:21]
ir[20:16]

regfile
(read)

op1_X

regfile
(write)

alu

nop
stall_D

ir[15:0]
ir[15:0]

zext
sext

sd_X
op1
_sel_D
!rdy
bypass_from_X1
bypass_from_M
bypass_from_W

Fetch (F0/F1)

ECE 4750

regfile
_waddr_W
regfile
_wen_W

Decode & Reg Read (D)

Execute (X)

T06: Cache Microarchitecture

memreq
valrdy

memresp
val

dmem

Tag Check
Data Access
Memory (M0/M1) Writeback (W)

12 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Parallel Read, Pipelined Write Hit Path


cs_X

cs_M

cs_W

decode
br_targ
jr
j_targ
pc_plus4

br_targ
_X
pc_plus4_D

pc_F
+4

ir[15:0]

pc_sel_P

stall_D

stall_F

!rdy

memreq
valrdy

kill_F

ir_D

memresp
val

imem

ir[25:0]

branch
_cond

j_tgen
op0
_sel_D

br_tgen

op0_X

16

ir[10:6]

alu
_func_X
wb_sel_M

result_M

ir[25:21]
ir[20:16]

regfile
(read)

op1_X

stall_D

ir[15:0]
ir[15:0]

sd_X
op1
_sel_D
!rdy
bypass_from_X1
bypass_from_M
bypass_from_W

Fetch (F)

ECE 4750

regfile
(write)
dmem
_wen_M

zext
sext

result_W

alu

nop

regfile
_waddr_W
regfile
_wen_W

Decode & Reg Read (D)

Execute (X)

T06: Cache Microarchitecture

memreq
valrdy

memresp
val

dmem

azard
ew H

s?

Tag Check
Read Access
Memory (M)

Write
Access
Writeback (W)

13 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Agenda

Single-Bank Cache Microarchitecture


Multi-Bank Cache Microarchitecture
Basic Optimizations
Cache Examples

ECE 4750

T06: Cache Microarchitecture

14 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Multicore PARC w/ Shared Unified I/D $

ECE 4750

T06: Cache Microarchitecture

15 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Multicore PARC w/ Shared Single-Bank I$ and D$

ECE 4750

T06: Cache Microarchitecture

16 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Multicore PARC w/ Shared Multi-Bank I$ and D$

ECE 4750

T06: Cache Microarchitecture

17 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Multicore PARC w/ Private I$ and D$

ECE 4750

T06: Cache Microarchitecture

18 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Multicore PARC w/ Private I$ and Shared D$

ECE 4750

T06: Cache Microarchitecture

19 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Agenda

Single-Bank Cache Microarchitecture


Multi-Bank Cache Microarchitecture
Basic Optimizations
Cache Examples

ECE 4750

T06: Cache Microarchitecture

20 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Reduce Average Memory Access Time

Avg Mem Access Time = Hit Time + ( Miss Rate Miss Penalty )

I Reduce hit time

I Reduce miss rate

. Small and simple caches

.
.
.
.

Large block size


Large cache size
High associativity
Compiler optimizations

I Reduce miss penalty


. Multi-level cache hierarchy
. Prioritize reads
ECE 4750

T06: Cache Microarchitecture

21 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Reduce Hit Time: Small & Simple Caches

ECE 4750

T06: Cache Microarchitecture

22 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Reduce Miss Rate: Large Block Size

I Less tag overhead


I Exploit fast burst transfers
from DRAM
I Exploit fast burst transfers
over wide on-chip busses
ECE 4750

I Can waste bandwidth


if data is not used
I Fewer blocks more
conflicts

T06: Cache Microarchitecture

23 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Reduce Miss Rate: Large Cache Size

Empirical Rule of Thumb:

If cache size is doubled, miss rate usually drops by about 2

ECE 4750

T06: Cache Microarchitecture

24 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Reduce Miss Rate: High Associativity

Empirical Rule of Thumb:


Direct-mapped cache of size N has about the same miss rate as a two-way
set-associative cache of size N/2
ECE 4750

T06: Cache Microarchitecture

25 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Reduce Miss Rate: Compiler Optimizations


I Restructuring code affects the data block access sequence
. Group data accesses together to improve spatial locality
. Re-order data accesses to improve temporal locality

I Prevent data from entering the cache


. Useful for variales that will only be accessed once before eviction
. Needs mechanism for software to tell hardware not to cache data
(no-allocate instruction hits or page table bits)

I Kill data that will never be used again


. Streaming data exploits spatial locality but not temporal locality
. Replace into dead-cache locations

ECE 4750

T06: Cache Microarchitecture

26 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Loop Interchange
for(j=0; j < N; j++) {
for(i=0; i < M; i++) {
x[i][j] = 2 * x[i][j];
}
}

for(i=0; i < M; i++) {


for(j=0; j < N; j++) {
x[i][j] = 2 * x[i][j];
}
}

What type of locality does this improve?


ECE 4750

T06: Cache Microarchitecture

27 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Loop Fusion
for(i=0; i < N; i++)
a[i] = b[i] * c[i];
for(i=0; i < N; i++)
d[i] = a[i] * c[i];

for(i=0; i < N; i++)


{
a[i] = b[i] * c[i];
d[i] = a[i] * c[i];
}

What type of locality does this improve?


ECE 4750

T06: Cache Microarchitecture

28 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Reduce Miss Penalty: Multi-Level Caches


Hit
Processor

L1 Cache

L2 Cache

Main
Memory

L1 Miss -- L2 Hit

Avg Mem Access Time =


Hit Time of L1 + ( Miss Rate of L1 Miss Penalty of L1 )
Miss Penalty of L1 =
Hit Time of L2 + ( Miss Rate of L2 Miss Penalty of L2 )

I Local miss rate = misses in cache / accesses to cache


I Global miss rate = misses in cache / processor memory accesses
I Misses per instruction = misses in cache / number of instructions
ECE 4750

T06: Cache Microarchitecture

29 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Reduce Miss Penalty: Multi-Level Caches


I Use smaller L1 if there is also an L2
. Trade increased L1 miss rate for reduced L1 hit time & L1 miss penalty
. Reduces average access energy

I Use simpler write-through L1 with on-chip L2


. Write-back L2 cache absorbs write traffic, doesnt go off-chip
. Simplifies processor pipeline
. Simplifies on-chip coherence issues

I Inclusive Multilevel Cache


. Inner cache holds copy of data in outer cache
. External coherence is simpler

I Exclusive Multilevel Cache


. Inner cache hold data not in outer cache
. Swap lines between inner/outer cache on miss

ECE 4750

T06: Cache Microarchitecture

30 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Reduce Miss Penalty: Prioritize Reads


CPU
RF

Data
Cache

Write
buffe
r
Evicted dirty lines for writeback cache
OR
All writes in writethrough cache

Unified
L2 Cache

I Processor not stalled on writes, and read misses can go ahead of


writes to main memory

I Write buffer may hold updated value of location needed by read miss
. On read miss, wait for write buffer to be empty
. Check write buffer addresses and bypass

ECE 4750

T06: Cache Microarchitecture

31 / 36

Single-Bank Cache uArch

Basic Optimizations

Multi-Bank Cache uArch

Cache Examples

Cache Optimizations Impact


on Average Memory Access Time
Hit
Time

Technique

Parallel read hit


Pipelined write hit
Smaller caches
Large block size
Large cache size
High associativity
Compiler optimizations

++
++

Miss
Rate

T06: Cache Microarchitecture

HW
0
1
0

++

Multi-level cache
Prioritize reads

ECE 4750

Miss
Penalty

0
1
1
0

2
1

32 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Agenda

Single-Bank Cache Microarchitecture


Multi-Bank Cache Microarchitecture
Basic Optimizations
Cache Examples

ECE 4750

T06: Cache Microarchitecture

33 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Itanium-2 On-Chip Caches


Intel
Itanium-2
(Intel/HP,
2002)

Cache Examples

On-Chip Caches
Level 1: 16KB, 4-way s.a.,
64B line, quad-port (2
load+2 store), single cycle
latency
Level 2: 256KB, 4-way s.a,
128B line, quad-port (4
load or 4 store), five cycle
latency
Level 3: 3MB, 12-way s.a.,
128B line, single 32B port,
twelve cycle latency

February 9, 2/17/2009
2010

ECE 4750

CS152, Spring 2010

T06: Cache Microarchitecture

24

34 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

IBM Power-7 On-Chip Caches

Power 7 On-Chip Caches [IBM 2009]


32KB L1 I$/core
32KB L1 D$/core
3-cycle latency
256KB Unified L2$/core
8-cycle latency

32MB Unified Shared L3$


Embedded DRAM
25-cycle latency to local
slice

February 9, 2010
ECE 4750

CS152, Spring 2010


T06: Cache Microarchitecture

25
35 / 36

Single-Bank Cache uArch

Multi-Bank Cache uArch

Basic Optimizations

Cache Examples

Acknowledgements

Some of these slides contain material developed and copyrighted by:


Arvind (MIT), Krste Asanovic (MIT/UCB), Joel Emer (Intel/MIT)
James Hoe (CMU), John Kubiatowicz (UCB), David Patterson (UCB)
MIT material derived from course 6.823
UCB material derived from courses CS152 and CS252

ECE 4750

T06: Cache Microarchitecture

36 / 36

Anda mungkin juga menyukai