Anda di halaman 1dari 42

Cache Memory

CSE 410, Spring 2008 Computer Systems


http://www.cs.washington.edu/410
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 1

Reading and References


Reading
Computer Organization and Design, Patterson and Hennessy
Section 7.1 Introduction Section 7.2 The Basics of Caches Section 7.3 Measuring and Improving Cache Performance

Reference
OSC (The dino book), Chapter 8: focus on paging

CDA&AQA, Chapter 5
Chapter 4, See MIPS Run, D. Sweetman
IBM and Cell chips: http://www.blachford.info/computer/Cell/Cell0_v2.html
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 2

Arthur W. Burks, Herman H. Goldstine, John von Neumann


We are therefore forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible This team was forced to realize this in the 1960s Cache a safe place to store something (webster) Cache (cse & ee ) : a small and fast local memory Caches some consider this to be

2/23/2014

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

The Quest for Speed - Memory


If all memory accesses (IF/lw/sw) accessed main memory, programs would run exponentially slower
Compare memory access times: 2^4- 2^6 time slower? Could be even slower depending on pipeline length!

And its getting worse


processors speed up by ~50% annually memory accesses speed up by ~9% annually its becoming harder and harder to keep these processors fed
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 4

A Solution: Memory Hierarchy*


Keep copies of the active data in the small, fast, expensive storage Keep all data in the big, slow, cheap storage
fast, small, expensive storage

slow, large, cheap storage

2/23/2014

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

Memory Hierarchy**
Memory Level Registers Fabrication Tech Registers Access Typ. Size $/MB Time (ns) (bytes) (Circa ?) <0.5 256 1000

L1 Cache
L2 Cache Memory Disk
2/23/2014

SRAM
SRAM DRAM Magnetic Disk

.5-5
10 100 10M

64K
1M 512M 100G

100(25)
100(5) 100(.020) 0.0035(0)
6

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

What is a Cache?
A subset* of a larger memory A small and fast place to store frequently accessed items Can be an instruction cache, a video cache, a streaming buffer/cache Like a fisheye view of memory, where magnification == speed

2/23/2014

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

What is a Cache?
A cache allows for fast accesses to a subset of a larger data store Your web browsers cache gives you fast access to pages you visited recently
faster because its stored locally subset because the web wont fit on your disk

The memory cache gives the processor fast access to memory that it used recently
faster because its fancy and usually located on the CPU chip subset because the cache is smaller than main memory
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 8

Memory Hierarchy
Registers CPU L1 cache

L2 Cache

Main Memory

2/23/2014

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

2/23/2014

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

10

IBMs Cell Chips

2/23/2014

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

11

Locality of reference
Temporal locality - nearness in time
Data being accessed now will probably be accessed again soon Useful data tends to continue to be useful

Spatial locality - nearness in address


Data near the data being accessed now will probably be needed soon Useful data is often accessed sequentially

2/23/2014

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

12

Memory Access Patterns

Memory accesses dont usually look like this


random accesses
2/23/2014

Memory accesses do usually look like this


hot variables step through arrays
13

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

Cache Terminology
Hit and Miss
the data item is in the cache or the data item is not in the cache

Hit rate and Miss rate


the percentage of references that the data item is in the cache or not in the cache

Hit time and Miss time


the time required to access data in the cache (cache access time) and the time required to access data not in the cache (memory access time)

2/23/2014

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

14

Effective Access Time


cache hit rate cache miss rate

teffective = (h)tcache + (1-h)tmemory


effective access time cache access time memory access time

aka, Average Memory Access Time (AMAT)


2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 15

# of bits in a Cache
Caches are expensive memories that carry additional metadata. Note the bits needed for a cache is a function of the # of cache blocks as well as the address size (embedded in the tag bit calculation) 2n * ( block size + tag size + valid bit )
Where n = number of blocks How many words per block? = 2m tag size = (32 n m 2)
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 16

Cache Contents
When do we put something in the cache?
when it is used for the first time

When do we remove something in the cache?


when we need the space in the cache for some other entry all of memory wont fit on the CPU chip so not every location in memory can be cached

2/23/2014

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

17

A small two-level hierarchy


8-word cache
116 = 1110100 120 = 1111000 124 = 1111100 0000000 0000100 0001000 0001100 0010000 0010100 0 4 8 12 16 20 = = = = = =

32-word memory
2/23/2014

(128 bytes)
18

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

Fully Associative Cache


In a fully associative cache,
any memory word can be placed in any cache line each cache line stores an address and a data value accesses are slow (but not as slow as you might think)
Why? Check all addresses at once.
2/23/2014

Address
0010100 0000100 0100100 0101100 0001100 1101100 0100000 1111100

Valid
Y N Y Y N Y Y N

Value
0x00000001 0x09D91D11 0x00000410 0x00012D10 0x00000005 0x0349A291 0x000123A8 0x00000200

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

19

Direct Mapped Caches


Fully associative caches are often too slow With direct mapped caches the address of the item determines where in the cache to store it
In our example, the lowest order two bits are the byte offset within the word stored in the cache The next three bits of the address dictate the location of the entry within the cache The remaining higher order bits record the rest of the original address as a tag for this entry
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 20

Address Tags
A tag is a label for a cache entry indicating where it came from
The upper bits of the data items address
7 bit Address 1011101

Tag (2)
10

Index (3)
111

Byte Offset (2)


01

2/23/2014

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

21

Direct Mapped Cache


Memory Address 1100000 1000100 0101000 0001100 1010000 1110100 0011000 1011100 Cache Contents Tag 11 10 01 00 10 11 00 10 Valid Y N Y Y N Y Y N Value 0x00000001 0x09D91D11 0x00000410 0x00012D10 0x00000005 0x0349A291 0x000123A8 0x00000200 Cache Index 0002 = 0012 = 0102 = 0112 = 1002 = 1012 = 1102 = 1112 =

0 1 2 3 4 5 6 7

2/23/2014

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

22

N-way Set Associative Caches


Direct mapped caches cannot store more than one address with the same index
Simple case of 1-way Set associative Compare to fully associative, where N = number of blocks

If two addresses collide, then you overwrite the older entry 2-way associative caches can store two different addresses with the same index
3-way, 4-way and 8-way set associative designs too

Reduces misses due to conflicts/competition/thrashing for same cache block, as each cache block can hold more Larger sets imply slower accesses
Need to compare N-Items, where N in D.M. is 1.

2/23/2014

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

23

2-way Set Associative Cache


Index 000 001 010 011 100 101 110 111 Tag 11 10 01 00 10 11 00 10 Valid Y N Y Y N Y Y N Value 0x00000001 0x09D91D11 0x00000410 0x00012D10 0x00000005 0x0349A291 0x000123A8 0x00000200 Tag 00 10 11 10 11 10 01 10 Valid Y N Y N N Y Y N Value 0x00000002 0x0000003B 0x000000CF 0x000000A2 0x00000333 0x00003333 0x0000C002 0x00000005

The highlighted cache entry contains values for addresses 10101xx2 and 11101xx2.
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 24

Associativity Spectrum

Direct Mapped Fast to access Conflict Misses

N-way Associative Slower to access Fewer Conflict Misses

Fully Associative Slow to access No Conflict Misses

2/23/2014

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

25

Spatial Locality
Using the cache improves performance by taking advantage of temporal locality
When a word in memory is accessed it is loaded into cache memory It is then available quickly if it is needed again soon

This does nothing for spatial locality


Whats the solution here?

2/23/2014

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

26

Memory Blocks
Divide memory into blocks If any word in a block is accessed, then load an entire block into the cache
Block 0 0x000000000x0000003F
Block 1 0x000000400x0000007F Block 2 0x000000800x000000BF
Cache line for 16 word block size tag valid
2/23/2014

w0

w1

w2

w3

w4

w5

w6

w7

w8

w9

w10

w11

w12

w13

w14
27

w15

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

Address Tags Revisited


A cache block size > 1 word requires the address to be divided differently Instead of a byte offset into a word, we need a byte offset into the block Assume we have 10-bit addresses, 8 cache lines, and 4 words (16 bytes) per cache line block
10 bit Address 0101100111 Tag (3) Index (3) Block Offset (4)

010
2/23/2014

110

0111
28

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

The Effects of Block Size


Big blocks are good
Fewer first time misses Exploits spatial locality Counter: Law of diminishing returns applies here if block size becomes a significant fraction of overall cache size

Small blocks are good


Dont evict as much data when bringing in a new entry More likely that all items in the block will turn out to be useful
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 29

Reads vs. Writes


Caching is essentially making a copy of the data When you read, the copies still match when youre done (read op is immutable) When you write, the results must eventually propagate to both copies (a hierarchy)
Especially at the lowest level of the hierarchy, which is in some sense the permanent copy
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 30

Cache Misses
Reads are easier, due to constraints
A dedicated memory unit (or, memory controller) has the job of fetching and filling cache lines

Writes are trickier as we need to maintain consistent memory!


What if memory changed at a top level but not at either main memory or the disk? Write-through: write to your cache, and percolate that write down (slow and straightforward, factor of 10)
Use a cache or buffer to percolate the write down Faster, more complex to implement

Optimization: Write-back (or, write-on-replace)

2/23/2014

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

31

Write-Through Caches
Write all updates to both cache and memory Advantages
The cache and the memory are always consistent Evicting a cache line is cheap because no data needs to be written out to memory at eviction Easy to implement

Disadvantages
Runs at memory speeds when writing
One solution: use a cache. We can use write buffer\victim buffer to mask this delay
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 32

Write-Back Caches
Write the update to the cache only. Write to memory only when cache block is evicted Advantage
Runs at cache speed rather than memory speed Some writes never go all the way to memory When a whole block is written back, can use high bandwidth transfer

Disadvantage
complexity required to maintain consistency
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 33

Dirty bit
When evicting a block from a write-back cache, we could
always write the block back to memory write it back only if we changed it

Caches use a dirty bit to mark if a line was changed


the dirty bit is 0 when the block is loaded it is set to 1 if the block is modified when the line is evicted, it is written back only if the dirty bit is 1
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 34

i-Cache and d-Cache


There usually are two separate caches for instructions and data.
Avoids structural hazards in pipelining The combined cache is twice as big but still has an access time of a small cache Allows both caches to operate in parallel, for twice the bandwidth But the miss rate will slightly increase, due to the extra partition that a shared cache wouldnt have
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 35

Latency V.S. Throughput


Latency: The time to get the first requested word of the cache (if present)
The whole point to caches: this should be relatively small (compared to other memories)

Throughput: The time it takes to fetch the rest of the block (could be many words)
Defined by your hardware wide bandwidths preferred: at what cost? DDR register dual read/write on clock edge
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 36

Cache Line Replacement


How do you decide which cache block to replace? If the cache is direct-mapped, its easy
only one slot per index only one choice!

Otherwise, common strategies:


Random
Why might this be worthwhile?

Least Recently Used (LRU)


FIFO approximation works reasonably well here
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 37

LRU Implementations
LRU is very difficult to implement for high degrees of associativity 4-way approximation:
1 bit to indicate least recently used pair 1 bit per pair to indicate least recently used item in this pair

Another approximation: FIFO queuing We will see this again at the operating system level
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 38

Multi-Level Caches
Use each level of the memory hierarchy as a cache over the next lowest level
(or, simply recurse again on our current abstraction)

Inserting level 2 between levels 1 and 3 allows:


level 1 to have a higher miss rate (so can be smaller and cheaper) level 3 to have a larger access time (so can be slower and cheaper)
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 39

Summary: Classifying Caches


Where can a block be placed?
Direct mapped, N-way Set or Fully associative

How is a block found?


Direct mapped: one choice -> by index Set associative: by index and search Fully associative: by search (all at once)

What happens on a write access?


Write-back or Write-through

Which block should be replaced?


Random LRU (Least Recently Used)
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 40

Further: Tuning Multi-Level Caches


When we add a second cache, this grows the power and complexity of our design
The presence of this other cache now changes what we should tune the first cache for.
If you only have one cache, it has to do it all

L1 and L2 caches have different influences on AMAT If L1 is a subset of L2, what does this imply with regards to L2s size? L1 is usually tuned for fast hit time
Directly affects the clock cycle Smaller cache, smaller block size, low associativity

L2 is frequently tuned for miss rate reduction


Larger cache (10), larger block size, higher associativity

2/23/2014

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

41

Being Explored
Cache Coherency in multiprocessor systems
Consider keeping multiple l1s in sync, when separated by a slow bus

Want each processor to have its own cache


Fast local access No interference with/from other processors No problem if processors operate in isolation (a thread tree per proc, for example)

But: now what happens if more than one processor accesses a cache line at the same time?
How do we keep multiple copies consistent?
Consider multiple writes to the same data Overhead in all this synchronization/message passing on a possibly slow bus

What about synchronization with main storage?


2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 42

Anda mungkin juga menyukai