Cache Memory: CSE 410, Spring 2008 Computer Systems

Cache Memory
CSE 410, Spring 2008 Computer Systems

http://www.cs.washington.edu/410
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 1
Reading and References

Reading
Computer Organization and Design, Patterson and Hennessy
Section 7.1 Introduction Section 7.2 The Basics of Caches Section 7.3 Measuring and Improving Cache Performance
Reference
OSC (The dino book), Chapter 8: focus on paging
CDA&AQA, Chapter 5
Chapter 4, See MIPS Run, D. Sweetman
IBM and Cell chips: http://www.blachford.info/computer/Cell/Cell0_v2.html
Arthur W. Burks, Herman H. Goldstine, John von Neumann

We are therefore forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible This team was forced to realize this in the 1960s Cache a safe place to store something (webster) Cache (cse & ee ) : a small and fast local memory Caches some consider this to be
2/23/2014
cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington
The Quest for Speed - Memory

If all memory accesses (IF/lw/sw) accessed main memory, programs would run exponentially slower
Compare memory access times: 2^4- 2^6 time slower? Could be even slower depending on pipeline length!
And its getting worse

processors speed up by ~50% annually memory accesses speed up by ~9% annually its becoming harder and harder to keep these processors fed
A Solution: Memory Hierarchy*

Keep copies of the active data in the small, fast, expensive storage Keep all data in the big, slow, cheap storage
fast, small, expensive storage
slow, large, cheap storage
2/23/2014
Memory Hierarchy**
Memory Level Registers Fabrication Tech Registers Access Typ. Size $/MB Time (ns) (bytes) (Circa ?) <0.5 256 1000
L1 Cache
L2 Cache Memory Disk
2/23/2014
SRAM
SRAM DRAM Magnetic Disk
.5-5
10 100 10M
64K
1M 512M 100G
100(25)
100(5) 100(.020) 0.0035(0)
6
What is a Cache?
A subset* of a larger memory A small and fast place to store frequently accessed items Can be an instruction cache, a video cache, a streaming buffer/cache Like a fisheye view of memory, where magnification == speed
2/23/2014
What is a Cache?
A cache allows for fast accesses to a subset of a larger data store Your web browsers cache gives you fast access to pages you visited recently
faster because its stored locally subset because the web wont fit on your disk
The memory cache gives the processor fast access to memory that it used recently
faster because its fancy and usually located on the CPU chip subset because the cache is smaller than main memory
Memory Hierarchy
Registers CPU L1 cache
L2 Cache
Main Memory
2/23/2014
2/23/2014
10
IBMs Cell Chips
2/23/2014
11
Locality of reference
Temporal locality - nearness in time
Data being accessed now will probably be accessed again soon Useful data tends to continue to be useful
Spatial locality - nearness in address

Data near the data being accessed now will probably be needed soon Useful data is often accessed sequentially
2/23/2014
12
Memory Access Patterns
Memory accesses dont usually look like this

random accesses
2/23/2014
Memory accesses do usually look like this

hot variables step through arrays
13
Cache Terminology
Hit and Miss
the data item is in the cache or the data item is not in the cache
Hit rate and Miss rate

the percentage of references that the data item is in the cache or not in the cache
Hit time and Miss time

the time required to access data in the cache (cache access time) and the time required to access data not in the cache (memory access time)
2/23/2014
14
Effective Access Time

cache hit rate cache miss rate
teffective = (h)tcache + (1-h)tmemory

effective access time cache access time memory access time
aka, Average Memory Access Time (AMAT)

# of bits in a Cache
Caches are expensive memories that carry additional metadata. Note the bits needed for a cache is a function of the # of cache blocks as well as the address size (embedded in the tag bit calculation) 2n * ( block size + tag size + valid bit )
Where n = number of blocks How many words per block? = 2m tag size = (32 n m 2)
Cache Contents
When do we put something in the cache?
when it is used for the first time
When do we remove something in the cache?

when we need the space in the cache for some other entry all of memory wont fit on the CPU chip so not every location in memory can be cached
2/23/2014
17
A small two-level hierarchy

8-word cache
116 = 1110100 120 = 1111000 124 = 1111100 0000000 0000100 0001000 0001100 0010000 0010100 0 4 8 12 16 20 = = = = = =
32-word memory
2/23/2014
(128 bytes)
18
Fully Associative Cache

In a fully associative cache,
any memory word can be placed in any cache line each cache line stores an address and a data value accesses are slow (but not as slow as you might think)
Why? Check all addresses at once.
2/23/2014
Address
0010100 0000100 0100100 0101100 0001100 1101100 0100000 1111100
Valid
Y N Y Y N Y Y N
Value
0x00000001 0x09D91D11 0x00000410 0x00012D10 0x00000005 0x0349A291 0x000123A8 0x00000200
19
Direct Mapped Caches

Fully associative caches are often too slow With direct mapped caches the address of the item determines where in the cache to store it
In our example, the lowest order two bits are the byte offset within the word stored in the cache The next three bits of the address dictate the location of the entry within the cache The remaining higher order bits record the rest of the original address as a tag for this entry
Address Tags
A tag is a label for a cache entry indicating where it came from
The upper bits of the data items address
7 bit Address 1011101
Tag (2)
10
Index (3)
111
Byte Offset (2)

01
2/23/2014
21
Direct Mapped Cache

Memory Address 1100000 1000100 0101000 0001100 1010000 1110100 0011000 1011100 Cache Contents Tag 11 10 01 00 10 11 00 10 Valid Y N Y Y N Y Y N Value 0x00000001 0x09D91D11 0x00000410 0x00012D10 0x00000005 0x0349A291 0x000123A8 0x00000200 Cache Index 0002 = 0012 = 0102 = 0112 = 1002 = 1012 = 1102 = 1112 =
0 1 2 3 4 5 6 7
2/23/2014
22
N-way Set Associative Caches

Direct mapped caches cannot store more than one address with the same index
Simple case of 1-way Set associative Compare to fully associative, where N = number of blocks
If two addresses collide, then you overwrite the older entry 2-way associative caches can store two different addresses with the same index
3-way, 4-way and 8-way set associative designs too
Reduces misses due to conflicts/competition/thrashing for same cache block, as each cache block can hold more Larger sets imply slower accesses
Need to compare N-Items, where N in D.M. is 1.
2/23/2014
23
2-way Set Associative Cache

Index 000 001 010 011 100 101 110 111 Tag 11 10 01 00 10 11 00 10 Valid Y N Y Y N Y Y N Value 0x00000001 0x09D91D11 0x00000410 0x00012D10 0x00000005 0x0349A291 0x000123A8 0x00000200 Tag 00 10 11 10 11 10 01 10 Valid Y N Y N N Y Y N Value 0x00000002 0x0000003B 0x000000CF 0x000000A2 0x00000333 0x00003333 0x0000C002 0x00000005
The highlighted cache entry contains values for addresses 10101xx2 and 11101xx2.
Associativity Spectrum
Direct Mapped Fast to access Conflict Misses
N-way Associative Slower to access Fewer Conflict Misses
Fully Associative Slow to access No Conflict Misses
2/23/2014
25
Spatial Locality
Using the cache improves performance by taking advantage of temporal locality
When a word in memory is accessed it is loaded into cache memory It is then available quickly if it is needed again soon
This does nothing for spatial locality

Whats the solution here?
2/23/2014
26
Memory Blocks
Divide memory into blocks If any word in a block is accessed, then load an entire block into the cache
Block 0 0x000000000x0000003F
Block 1 0x000000400x0000007F Block 2 0x000000800x000000BF
Cache line for 16 word block size tag valid
2/23/2014
w0
w1
w2
w3
w4
w5
w6
w7
w8
w9
w10
w11
w12
w13
w14
27
w15
Address Tags Revisited

A cache block size > 1 word requires the address to be divided differently Instead of a byte offset into a word, we need a byte offset into the block Assume we have 10-bit addresses, 8 cache lines, and 4 words (16 bytes) per cache line block
10 bit Address 0101100111 Tag (3) Index (3) Block Offset (4)
010
2/23/2014
110
0111
28
The Effects of Block Size

Big blocks are good
Fewer first time misses Exploits spatial locality Counter: Law of diminishing returns applies here if block size becomes a significant fraction of overall cache size
Small blocks are good

Dont evict as much data when bringing in a new entry More likely that all items in the block will turn out to be useful
Reads vs. Writes

Caching is essentially making a copy of the data When you read, the copies still match when youre done (read op is immutable) When you write, the results must eventually propagate to both copies (a hierarchy)
Especially at the lowest level of the hierarchy, which is in some sense the permanent copy
Cache Misses
Reads are easier, due to constraints
A dedicated memory unit (or, memory controller) has the job of fetching and filling cache lines
Writes are trickier as we need to maintain consistent memory!

What if memory changed at a top level but not at either main memory or the disk? Write-through: write to your cache, and percolate that write down (slow and straightforward, factor of 10)
Use a cache or buffer to percolate the write down Faster, more complex to implement
Optimization: Write-back (or, write-on-replace)
2/23/2014
31
Write-Through Caches
Write all updates to both cache and memory Advantages
The cache and the memory are always consistent Evicting a cache line is cheap because no data needs to be written out to memory at eviction Easy to implement
Disadvantages
Runs at memory speeds when writing
One solution: use a cache. We can use write buffer\victim buffer to mask this delay
Write-Back Caches
Write the update to the cache only. Write to memory only when cache block is evicted Advantage
Runs at cache speed rather than memory speed Some writes never go all the way to memory When a whole block is written back, can use high bandwidth transfer
Disadvantage
complexity required to maintain consistency
Dirty bit
When evicting a block from a write-back cache, we could
always write the block back to memory write it back only if we changed it
Caches use a dirty bit to mark if a line was changed

the dirty bit is 0 when the block is loaded it is set to 1 if the block is modified when the line is evicted, it is written back only if the dirty bit is 1
i-Cache and d-Cache

There usually are two separate caches for instructions and data.
Avoids structural hazards in pipelining The combined cache is twice as big but still has an access time of a small cache Allows both caches to operate in parallel, for twice the bandwidth But the miss rate will slightly increase, due to the extra partition that a shared cache wouldnt have
Latency V.S. Throughput

Latency: The time to get the first requested word of the cache (if present)
The whole point to caches: this should be relatively small (compared to other memories)
Throughput: The time it takes to fetch the rest of the block (could be many words)
Defined by your hardware wide bandwidths preferred: at what cost? DDR register dual read/write on clock edge
Cache Line Replacement

How do you decide which cache block to replace? If the cache is direct-mapped, its easy
only one slot per index only one choice!
Otherwise, common strategies:

Random
Why might this be worthwhile?
Least Recently Used (LRU)

FIFO approximation works reasonably well here
LRU Implementations
LRU is very difficult to implement for high degrees of associativity 4-way approximation:
1 bit to indicate least recently used pair 1 bit per pair to indicate least recently used item in this pair
Another approximation: FIFO queuing We will see this again at the operating system level
Multi-Level Caches
Use each level of the memory hierarchy as a cache over the next lowest level
(or, simply recurse again on our current abstraction)
Inserting level 2 between levels 1 and 3 allows:

level 1 to have a higher miss rate (so can be smaller and cheaper) level 3 to have a larger access time (so can be slower and cheaper)
Summary: Classifying Caches

Where can a block be placed?
Direct mapped, N-way Set or Fully associative
How is a block found?

Direct mapped: one choice -> by index Set associative: by index and search Fully associative: by search (all at once)
What happens on a write access?

Write-back or Write-through
Which block should be replaced?

Random LRU (Least Recently Used)
Further: Tuning Multi-Level Caches

When we add a second cache, this grows the power and complexity of our design
The presence of this other cache now changes what we should tune the first cache for.
If you only have one cache, it has to do it all
L1 and L2 caches have different influences on AMAT If L1 is a subset of L2, what does this imply with regards to L2s size? L1 is usually tuned for fast hit time
Directly affects the clock cycle Smaller cache, smaller block size, low associativity
L2 is frequently tuned for miss rate reduction

Larger cache (10), larger block size, higher associativity
2/23/2014
41
Being Explored
Cache Coherency in multiprocessor systems
Consider keeping multiple l1s in sync, when separated by a slow bus
Want each processor to have its own cache

Fast local access No interference with/from other processors No problem if processors operate in isolation (a thread tree per proc, for example)
But: now what happens if more than one processor accesses a cache line at the same time?
How do we keep multiple copies consistent?
Consider multiple writes to the same data Overhead in all this synchronization/message passing on a possibly slow bus
What about synchronization with main storage?


Cache Memory: CSE 410, Spring 2008 Computer Systems

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Cache Memory: CSE 410, Spring 2008 Computer Systems

Diunggah oleh

Hak Cipta:

Format Tersedia

Cache Memory

CSE 410, Spring 2008 Computer Systems

Reading and References

Arthur W. Burks, Herman H. Goldstine, John von Neumann

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

The Quest for Speed - Memory

And its getting worse

A Solution: Memory Hierarchy*

slow, large, cheap storage

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

IBMs Cell Chips

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

Spatial locality - nearness in address

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

Memory Access Patterns

Memory accesses dont usually look like this

Memory accesses do usually look like this

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

Hit rate and Miss rate

Hit time and Miss time

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

Effective Access Time

teffective = (h)tcache + (1-h)tmemory

aka, Average Memory Access Time (AMAT)

When do we remove something in the cache?

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

A small two-level hierarchy

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

Fully Associative Cache

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

Direct Mapped Caches

Byte Offset (2)

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

Direct Mapped Cache

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

N-way Set Associative Caches

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

2-way Set Associative Cache

Direct Mapped Fast to access Conflict Misses

N-way Associative Slower to access Fewer Conflict Misses

Fully Associative Slow to access No Conflict Misses

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

This does nothing for spatial locality

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

Address Tags Revisited

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

The Effects of Block Size

Small blocks are good

Reads vs. Writes

Writes are trickier as we need to maintain consistent memory!

Optimization: Write-back (or, write-on-replace)

cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington

Caches use a dirty bit to mark if a line was changed

i-Cache and d-Cache

Latency V.S. Throughput

Cache Line Replacement

Otherwise, common strategies: