Anda di halaman 1dari 52

Memory Hierarchy - Introduction

Computer programmers want unlimited amount of

fast memory
An economical soln for that is a memory hierarchy

which takes the advantage of principle of locality and cost performance of memory reference
Principle of Locality : says that most programs dont

access all code or data uniformly.

Multilevel Memory hierarchy

Since fast memory is expensive, a memory hierarchy is

organized into several levels each smaller, faster, and more expensive per byte than the next lower level.
The goal is to provide a memory system with cost per

byte is as low as cheapest level of memory and speed is as high as the fastest level.
Each level maps addresses from a slower larger

memory to a smaller but faster memory higher in the hierarchy.

Cache : is the name given to the highest or first level

of memory hierarchy once the address leaves the processor. cache is a temporary storage area where frequently accessed data can be stored for rapid access.
Cache hit : When the processor finds the requested

data in the cache it is known as cache it.

Cache miss occurs when the processor does not find

the needed data item in cache.

Three categories of cache misses


Compulsory the very first access to block can not be in cache, so the block must be brought into the cache. Compulsory misses are those that occur even if you had infinite cache. Capacity if the cache cannot contain all the blocks needed during execution of a program, capacity misses occur because of blocks being discarded and later retrieved. Conflict if the block placement strategy is not fully associative, conflict misses will occur because a block may be discarded and later retrieved if conflicting blocks map to its set.



The time required for cache miss depends on both the

latency and the bandwidth of the memory.

Latency determines the time to retrieve first word of the

Bandwidth determines the time to retrieve the rest of

this block.
The cache misses are handled by hardware and causes

processors using in-order execution to stall.

With out of order execution, an instruction using the

result must wait, but the other instructions may proceed during miss.

Block : A fixed size collection of data containing the requested word ,

also known as line run.

Virtual memory : Not all objects referenced by a program need to

reside in main memory. Some objects may reside in disk.

Pages : Address space is usually broken into fixed size blocks. Page fault: At any time a page resides either in memory or in disk.

When processor references an item within a page that is not present in memory then a page fault occurs, and the entire page is moved from disk to memory. processor is not stalled. The processor usually switches to some other task.

Since page faults take so long they are handled in software and the

Cache performance
Memory hierarchy can substantially improve the performance because

of locality of reference and the higher speed of smaller memories.

The processor execution time equation when we consider the number

of cycles during which processor is stalled waiting for memory access, which is called memory stall cycles is as given bellow. CPU Execution Time= (CPU clock cycles + Memory stall clock cycles) * clock cycle time.
This equation assumes that the CPU clock cycles include the time to

handle a cache hit, and that the processor stalled during a cache miss.

The number of memory stall cycles depends on both the number

of misses and the cost per miss, which is called miss penalty.
Memory stall cycles = Number of misses * miss penalty

= IC * Misses * miss penalty Instruction = IC * Memory accesses * miss rate*miss penalty Instruction The component miss rate is simply the fraction of cache accesses that results in a miss( i.e the number of accesses that miss divided by the number of accesses)

The formula was an approximation since the miss rates and miss

penalties are different for reads and writes.

Memory stall cycles could then be defined in terms of the number of

memory accesses per instruction, miss penalty(in clock cycles) for reads and writes, and miss rate for reads and writes;
Memory stall cycles =

IC * reads per instruction *read miss rate* read miss penalty + IC * writes per instruction*write miss rate * write miss penalty.

Some designers measure miss rate as misses per

instruction rather than misses per memory reference. These two are related. Misses = Instrn miss rate * mem accesses = miss rate* mem accesses Instrn count Instrn
Misses per instruction is often reported as misses per

1000 instrns to show integers instead of fractions.

4 memory hierarchy questions

Cache first level of memory hierarchy Answering following questions help us understand the

different trade-offs of memories help us at different levels of hierarchy. 1 . Where can a block be placed in the upper level? (block placement) 2. How is a block found if it is in the upper level ? (block identification) 3. Which block should be replaced on miss ? ( block replacement) 4. What happens on a write ? ( write strategy).

Where can a block be placed in a cache ???

Where can a block be placed in a cache ???

Three categories of cache organization

1. If each block has only one place it can appear in cache it is said to be direct mapped. The mapping is done usually (block addr ) MOD (no of blocks in cache) 2. If a cache block can be placed anywhere in the cache, the cache is said to be fully associative. 3. If a block can be placed in restricted set of places in the cache, the cache is set associative. A set is a group of blocks in the cache. A block is mapped into a set, and then the block can be placed anywhere within the set. The set is usually chosen by bit selection; that is (block address) MOD (number of sets in cache)

If there are n blocks in a set the cache placement is

called n-way set associative. Direct mapped is simply one way set associative and a fully associative cache with m blocks could be called m-way set associative. Direct associative can be thought of having m sets and fully associative as having one set. The vast majority of processor caches today are direct mapped, two way set associative, or four way set associative.

How is a block found if it is not in cache ???

How is a block found if it is not in cache ???

Caches have an address tag on each block frame that

gives the block address. The tag of every cache block is checked to see if it matches the block address from the processor. All possible tags are searched in parallel because speed is critical. A valid bit is added to the tag to say whether or not this entry contains a valid address. If this bit is not set, there can not be a match on this address.

Relationship of a processor address to cache

The first division is between is between the block

address and the block offset. The block frame address is further divided into the tag field and the index field. The block offset field selects the desired data from the block. The index field selects the set. The tag field is compared against it for a hit.

Memory Hierarchy-Review
By, Chandru 1RV08SCS05

Objectives > Basic Information

> Four Memory Hierarchy Questions Q1: Where can a block be placed in the upper level? (block placement) Q2: How is a block found if it is in the upper level? (block identification) Q3: Which block should be replaced on a miss? (block replacement) Q4: What happens on a write? (write strategy) > An Example: The Opteron Data Cache

Memory arrangement of storage in current computer The hierarchical Hierarchy

architectures is called the memory hierarchy. It is designed to take advantage of memory locality in computer programs.

Most modern CPUs are so fast that for most program workloads, the

locality of reference of memory accesses and the efficiency of the caching and memory transfer between different levels of the hierarchy are the practical limitation on processing speed.

Smaller, faster, and costlier (per byte) storage devices

An Example Memory Hierarchy

L0: registers
CPU registers hold words retrieved from L1 cache.

L1: on-chip L1 cache (SRAM) L2: off-chip L2 cache (SRAM)

L1 cache holds cache lines retrieved from the L2 cache memory. L2 cache holds cache lines retrieved from main memory.


Larger, slower, and cheaper (per byte) storage devices


main memory (DRAM)

Main memory holds disk blocks retrieved from local disks.


local secondary storage (local disks)

Local disks hold files retrieved from disks on remote network servers.

remote secondary storage (distributed file systems, Web servers)

The Principle of Locality The Principle of Locality:

Program access a relatively small portion of the address

space at any instant of time.

Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)

Cache Algorithm (Read) tags to Look at Processor Address, search cache

find match. Then either
HIT - Found in Cache Return copy of data from cache
Hit Rate = fraction of accesses found in cache Miss Rate = 1 Hit rate Hit Time = RAM access time +

MISS - Not in cache

Read block of data from Main Memory Wait Return data to processor and update cache

time to determine HIT/MISS

Miss Time = time to replace block in cache + time to deliver block to processor

Caching in a Memory Hierarchy

Level k: 8 4 9 14 10 3 Smaller, faster, more expensive device at level k caches a subset of the blocks from level k+1 10 4 Data is copied between levels in block-sized transfer units

0 Level k+1: 4 8 12

1 5 9 13

2 6 10 14

3 7 11 15 Larger, slower, cheaper storage device at level k+1 is partitioned into blocks.

Types of cache misses: Cold (compulsory) miss

Cold misses occur because the cache is empty. Conflict miss If the block placement strategy is not fully associative, conflict misses will occur because a block may be discarded and later retrieved if conflicting blocks map to itself. Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k block. E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time. Capacity miss If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur

Q3: Which block should be replaced on a miss?

Easy for Direct Mapped Set Associative or Fully Associative:

Candidate blocks are randomly selected. Some systems generate pseudorandom block number Least Recently Used (LRU) LRU relies on a corollary of locality: If recently used blocks are likely to be used again, then LRU block is good candidate. First In, First Out (FIFO) used in highly associative caches

General Caching Concepts

14 12

Program needs object d, which is stored in some

Request 12 14

block b. Cache hit

Level k:

Program finds b in the cache at level k.

4* 12


E.g., block 14.

Cache miss
b is not at level k, so level k cache must

12 4*

Request 12

0 Level k+1:

fetch it from level k+1. E.g., block 12. If level k cache is full, then some current block must be replaced (evicted). Which one is the victim?

4 4*
8 12

9 13

10 14

11 15

Placement policy: where can the new block go? E.g., b mod 4 Replacement policy: which block should be evicted? E.g., LRU

Q4: What happens on a write?

Cache hit:
write through: write both cache & memory

generally higher traffic but simplifies cache coherence write back: write cache only (memory is written only when the entry is evicted) a dirty bit per block can further reduce the traffic

Cache miss: Q-4 continued

no write allocate: only write to main memory (lower-level memory)

write allocate : The block is allocated on a write miss. Write misses

act like read misses

Assume a fully associative cache with cache Example empty.write-back sequence ofmanymemory entries that starts Below is a five operations (the address is in square brackets): WriteMem[100]; WriteMem[100]; Read Mem[200]; WriteMem[200]; WriteMem[100]. What are the number of hits and misses when using no-write allocate versus writeallocate?

Answer For no-write allocate, the address 100 is not in the cache, and

there is no allocation on write, so the first two writes will result in misses. Address 200 is also not in the cache, so the read is also a miss. The subsequent write to address 200 is a hit. The last write to 100 is still a miss. The result for no-write allocate is four misses and one hit. For write allocate, the first accesses to 100 and 200 are misses, and the rest are hits since 100 and 200 are both found in the cache. Thus, the result for write allocate is two misses and three hits.

Six basic Cache Optimization

Sandeep Singh M.Tech (CSE) 2nd Sem

Average memory access time = Hit time + Miss rate * Miss penalty Three Categories of Cache Optimizations Reducing the miss rate : large block size ,large cache size and higher associativity Reducing the miss penalty : multilevel caches and giving reads priority over writes Reducing the time to hit in the cache : avoiding address translation when indexing the cache.

Three categories of misses

Compulsory : the very first access to a block cannot be in the cache ,so

the block must be brought into the cache. These are also called coldstart misses or first-reference misses. Compulsory misses are those that occur in an infinite cache. Capacity : if the cache cannot contain all the blocks needed during execution of a program ,capacity misses will occur because of blocks discarded and later retrieved. Capacity misses are those that occur in a fully associative cache. Conflict : if the block placement strategy is set associative or direct mapped ,conflict misses will occur because a block may be discarded and later retrieved if too many blocks map to its set. These misses are also called collision misses. Conflict misses are those that occur going from fully associative to eight-way associative ,four-way associative and so on.

Four Division of Conflict Misses Eight-way :conflict misses due to going from fully associative to

eight-way associative. Four-way :conflict misses due to going from eight-way associative to four-way associative. Two-way :conflict misses due to going from four-way associative to two-way associative. One-way :conflict misses due to going from two-way associative to one-way associative.

First Optimization : Larger Block Size to Reduce

Miss Rate
The simplest way to reduce miss rate is to increase the block size

,larger block size will reduce also compulsory misses. This reduction occur because the principle of locality has two components : temporal locality and spatial locality. Larger block take advantage of spatial locality. But larger blocks also increase the miss penalty. Since they reduce the number of blocks in the cache ,larger blocks may increase conflict misses and even capacity misses if the cache is small.

The selection of block size depends on both the latency and

bandwidth of the lower-level memory. High latency and high bandwidth encourage large block size since the cache gets many more bytes per miss for a small increase in miss penalty. Low latency and low bandwidth encourage smaller block size since there is little time saved from a larger block.

Second Optimization : Larger Caches to Reduce

Miss Rate The obvious way to reduce capacity misses is to increase capacity of the cache. The drawback is potentially longer hit time and higher cost and power. This technique has been especially popular in off-chip caches.

Third Optimization : Higher Associativity to

Reduce Miss Rate

There are two rules The first is that eight-way set associative is for practical purpose

as effective in reducing misses for these sized caches as fully associative. The second rule ,called the 2:1 cache rule of thumb , is that a direct-mapped cache of size N has about the same miss rate as a two-way set-associative cache of size N/2. The higher associativity increases average memory access time.

Fourth Optimization : Multilevel Caches to

Reduce Miss Penalty

Due to performance gap between processor and memory

designer added another level of cache between the original cache and memory. The first-level cache can be small enough to match the clock cycle time of the fast processor. The second-level cache can be large enough to capture many accesses that would go to main memory, thereby lessening the effective miss penalty.

Average memory access time for a two-level cache : Average memory access time = Hit time(L1) + Miss rate(L1)*Miss

penalty(L1) Miss penalty(L1) = Hit time(L2) + Miss rate(L2)*Miss penalty(L2) So Average memory access time = Hit time(L1) + Miss rate(L1)*(hit time(L2) + Miss rate(L2)*Miss penalty(L2))

Term adopted for a two-level cache system :

Local miss rate : this rate is simply the number of misses in a

cache divided by the total number of memory access to this cache. for the first-level cache it is equal to Miss rate(L1) and for the second-level cache it is Miss rate(L2). Global miss rate : the number of misses in the cache divided by the total number of memory accesses generated by the processor. The global miss rate for the first-level cache is still just Miss rate(L1) ,but for second-level cache it is Miss rate(L1)*Miss rate(L2).

Fifth Optimization : Giving Priority to read Misses

over Writes to Reduce Miss Penalty

This optimization serves reads before writes have been completed. With a write-through cache most important improvement is a write

buffer of the proper size. Write buffers ,however do complicate memory accesses because they might hold the updated value of a location needed on a read miss. The simplest way out of this is for the read miss to wait until the write buffer is empty. The alternative is to check the content of the write buffer on a read miss ,and if there are no conflict and the memory system is available , let the read miss continue.

Sixth Optimization : Avoiding Address Translation during indexing of the Cache to Reduce Hit rate
We use virtual addresses for the cache ,since hits are much more

common than misses. Such caches are termed virtual caches , with physical cache used to identify the traditional caches that uses physical addresses. Two important tasks are : indexing the cache and comparing addresses. Full virtual addressing for both indices and tags eliminates address translation time from a cache hit.

Some Reasons for not building virtually addressed caches: Protection : page-level protection is checked as part of the virtual to

physical address translation and it must be enforced. One solution is to copy the protection information from the TLB on a miss , and a field to hold it and check it on every access to the virtually addressed cache. Another reason is that every time a process is switched, the virtual addresses refer to different physical addresses, requiring the cache to be flushed. One solution is to increase the width of the cache address tag with a process-identifier tag (PID). If operating system assign these tags to processes, it only need flush the cache when a PID is recycled, that is PID distinguishes whether or not the data in the cache are for this program

A third reason is operating systems and users program may use

different virtual addresses for the same physical address. These duplicate addresses, called synonyms or aliases, could result in two copies of the same data in virtual cache, if one is modified the other will have the wrong value. With a physical cache this would not happen, since the accesses would first be translated to the same physical cache block. Hardware solution to the synonym problem, called antialiasing, guarantee every cache block a unique physical address. Software can make this problem much easier by forcing to share some address bits, this restriction is called page coloring.

I/O typically uses physical addresses and thus would require

mapping to virtual addresses to interact with a virtual cache. One alternative is to use part of the page offset-the part that is identical in both virtual and physical addresses-to index the cache. At the same time as the cache is being read using that index, the virtual part of the address is translated, and the tag match uses physical addresses. This alternative allows the cache read to begin immediately, and yet the tag comparison is still with physical addresses. The limitation of this virtually indexed, physically tagged alternative is that a direct-mapped cache can be no bigger than the page size.