Reference
OSC (The dino book), Chapter 8: focus on paging
CDA&AQA, Chapter 5
Chapter 4, See MIPS Run, D. Sweetman
IBM and Cell chips: http://www.blachford.info/computer/Cell/Cell0_v2.html
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 2
2/23/2014
2/23/2014
Memory Hierarchy**
Memory Level Registers Fabrication Tech Registers Access Typ. Size $/MB Time (ns) (bytes) (Circa ?) <0.5 256 1000
L1 Cache
L2 Cache Memory Disk
2/23/2014
SRAM
SRAM DRAM Magnetic Disk
.5-5
10 100 10M
64K
1M 512M 100G
100(25)
100(5) 100(.020) 0.0035(0)
6
What is a Cache?
A subset* of a larger memory A small and fast place to store frequently accessed items Can be an instruction cache, a video cache, a streaming buffer/cache Like a fisheye view of memory, where magnification == speed
2/23/2014
What is a Cache?
A cache allows for fast accesses to a subset of a larger data store Your web browsers cache gives you fast access to pages you visited recently
faster because its stored locally subset because the web wont fit on your disk
The memory cache gives the processor fast access to memory that it used recently
faster because its fancy and usually located on the CPU chip subset because the cache is smaller than main memory
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 8
Memory Hierarchy
Registers CPU L1 cache
L2 Cache
Main Memory
2/23/2014
2/23/2014
10
2/23/2014
11
Locality of reference
Temporal locality - nearness in time
Data being accessed now will probably be accessed again soon Useful data tends to continue to be useful
2/23/2014
12
Cache Terminology
Hit and Miss
the data item is in the cache or the data item is not in the cache
2/23/2014
14
# of bits in a Cache
Caches are expensive memories that carry additional metadata. Note the bits needed for a cache is a function of the # of cache blocks as well as the address size (embedded in the tag bit calculation) 2n * ( block size + tag size + valid bit )
Where n = number of blocks How many words per block? = 2m tag size = (32 n m 2)
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 16
Cache Contents
When do we put something in the cache?
when it is used for the first time
2/23/2014
17
32-word memory
2/23/2014
(128 bytes)
18
Address
0010100 0000100 0100100 0101100 0001100 1101100 0100000 1111100
Valid
Y N Y Y N Y Y N
Value
0x00000001 0x09D91D11 0x00000410 0x00012D10 0x00000005 0x0349A291 0x000123A8 0x00000200
19
Address Tags
A tag is a label for a cache entry indicating where it came from
The upper bits of the data items address
7 bit Address 1011101
Tag (2)
10
Index (3)
111
2/23/2014
21
0 1 2 3 4 5 6 7
2/23/2014
22
If two addresses collide, then you overwrite the older entry 2-way associative caches can store two different addresses with the same index
3-way, 4-way and 8-way set associative designs too
Reduces misses due to conflicts/competition/thrashing for same cache block, as each cache block can hold more Larger sets imply slower accesses
Need to compare N-Items, where N in D.M. is 1.
2/23/2014
23
The highlighted cache entry contains values for addresses 10101xx2 and 11101xx2.
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 24
Associativity Spectrum
2/23/2014
25
Spatial Locality
Using the cache improves performance by taking advantage of temporal locality
When a word in memory is accessed it is loaded into cache memory It is then available quickly if it is needed again soon
2/23/2014
26
Memory Blocks
Divide memory into blocks If any word in a block is accessed, then load an entire block into the cache
Block 0 0x000000000x0000003F
Block 1 0x000000400x0000007F Block 2 0x000000800x000000BF
Cache line for 16 word block size tag valid
2/23/2014
w0
w1
w2
w3
w4
w5
w6
w7
w8
w9
w10
w11
w12
w13
w14
27
w15
010
2/23/2014
110
0111
28
Cache Misses
Reads are easier, due to constraints
A dedicated memory unit (or, memory controller) has the job of fetching and filling cache lines
2/23/2014
31
Write-Through Caches
Write all updates to both cache and memory Advantages
The cache and the memory are always consistent Evicting a cache line is cheap because no data needs to be written out to memory at eviction Easy to implement
Disadvantages
Runs at memory speeds when writing
One solution: use a cache. We can use write buffer\victim buffer to mask this delay
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 32
Write-Back Caches
Write the update to the cache only. Write to memory only when cache block is evicted Advantage
Runs at cache speed rather than memory speed Some writes never go all the way to memory When a whole block is written back, can use high bandwidth transfer
Disadvantage
complexity required to maintain consistency
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 33
Dirty bit
When evicting a block from a write-back cache, we could
always write the block back to memory write it back only if we changed it
Throughput: The time it takes to fetch the rest of the block (could be many words)
Defined by your hardware wide bandwidths preferred: at what cost? DDR register dual read/write on clock edge
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 36
LRU Implementations
LRU is very difficult to implement for high degrees of associativity 4-way approximation:
1 bit to indicate least recently used pair 1 bit per pair to indicate least recently used item in this pair
Another approximation: FIFO queuing We will see this again at the operating system level
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 38
Multi-Level Caches
Use each level of the memory hierarchy as a cache over the next lowest level
(or, simply recurse again on our current abstraction)
L1 and L2 caches have different influences on AMAT If L1 is a subset of L2, what does this imply with regards to L2s size? L1 is usually tuned for fast hit time
Directly affects the clock cycle Smaller cache, smaller block size, low associativity
2/23/2014
41
Being Explored
Cache Coherency in multiprocessor systems
Consider keeping multiple l1s in sync, when separated by a slow bus
But: now what happens if more than one processor accesses a cache line at the same time?
How do we keep multiple copies consistent?
Consider multiple writes to the same data Overhead in all this synchronization/message passing on a possibly slow bus