Anda di halaman 1dari 22

VLSI Project

Least Recently Frequently Used Caching Algorithm with Filtering Policies

Alexander Zlotnik Marcel Apfelbaum

Supervised by: Michael Behar, Winter 2005/2006

VLSI Project Winter 2005/2006 1

Introduction (cont.)
Cache definition Memory chip part of the Processor
Same technology Speed: same order of magnitude as accessing Registers Relatively small and expensive Acts like an HASH function : holds part of the address spaces.

VLSI Project Winter 2005/2006

Introduction (cont.)
Cache memories Main idea
When processor needs instruction or data it first looks for it in the cache. If that fails, it brings the data from the main memory to the cache and uses it from there. Address space is partitioned into blocks Cache holds lines, each line holds a block
A block may not exist in the cache -> cache miss

If we miss the Cache

Entire block is fetched into a line buffer, and then put into the cache Before putting the new block in the cache, another block may need to be evicted from the cache (to make room for the new block)
VLSI Project Winter 2005/2006 3

Introduction (cont.)
Cache aim
Fast access time Fast search mechanism High Hit-Ratio Highly effective replacement mechanism High Adaptability - fast replacement of not need lines Long sighted - estimation if a block will be used in future
VLSI Project Winter 2005/2006 4

Project Objective
Develop an LRFU caching mechanism Implementation of a cache entrance filtering technique Compare and analyze against LRU Researching various configurations of LRFU , on order to achieve maximum hit rate

VLSI Project Winter 2005/2006

Project Requirements
Develop for SimpleScalar platform to simulate processor caches Run developed caching & filtering mechanisms on accepted benchmarks C language No hardware components equivalence needed, software implementation only
VLSI Project Winter 2005/2006 6

Background and Theory

Cache Replacement options:
FIFO, LRU, Random, Pseudo LRU, LFU

Currently used algorithms:

LRU(2 ways requires 1 bit per set to mark latest accessed) Pseudo LRU (4 ways and more, Fully associative)

Pseudo LRU (4-way example)

Bit 0 specify if way is (0,1) or (2,3) Bit 1 specify who was between 0 and 1 Bit 2 specify who was between 2 and 3
VLSI Project Winter 2005/2006

Bit 0 Bit 1 Bit 2


Background and Theory (cont)

High Adaptability 1 cycle algorithm Low memory usage

Long sighted Smarter

Short sighted

Cache pollution Requires many cycles More memory needed

VLSI Project Winter 2005/2006

Background and Theory (cont)

Both recency and frequency affect the likelihood of future references

A replacement algorithm that allows a flexible trade-off between recency and frequency

The idea: LRFU (Least Recently/Frequently Used)

Subsumes both LRU and LFU algorithms Overcome the cycles used by LFU by filtering Cache entrances Yields better performance than them
VLSI Project Winter 2005/2006 9

Development Stages
1. 2. 3. 4. 5. 6. Studying the background Learning SimpleScalar sim-cache platform Develop LRFU caching algorithm for SimpleScalar Develop filtering policy Benchmarking (smart environment) Analyzing various LRFU configurations and comparison with LRU algorithm

VLSI Project Winter 2005/2006


The LRFU policy associates a value with each block. This value quantifies the likelihood that the block will be referenced in the near future. Each reference to a block in the past adds a contribution to this value and its contribution is determined by a weighing function F.
Current time tc

time t1 t2


Ctc(block) = F(||1) + F(2) + F(||3) ||

t c - t1 tc - t2
VLSI Project Winter 2005/2006

tc - t3

Principles (cont)
Weighing function F(x) = (1/2)x
Monotonically decreasing Subsume LRU and LFU When = 0, (i.e. F(x) = 1), then it becomes LFU When = 1, (i.e. F(x) = (1/2)x), then it becomes LRU
When 0 < < 1, it is between LFU and LRU F(x) F(x) = 1 (LFU extreme)

Spectrum (LRU/LFU) F(x) = (1/2)x (LRU extreme)

VLSI Project Winter 2005/2006

current time - reference time


Principles (cont)
Update of C(block) over time
Only two counters for each block are needed to calculate C(block)



= (t2 - t1)




C t2(b) = F (1+) + F (2+) + F (3+) = (1/2)(1+ ) +

(1/2) (2+ ) + (1/2) (3+ ) = ((1/2)1 + (1/2)2 + (1/2)3 ) (1/2)

= C t1(b) x F ()
VLSI Project Winter 2005/2006 13

Design and Implementation

Data Address In cache END In cache Not in cache

In Victims cache ? Not in cache

END Filter out Insert Data into Victims Cache

Filter Insert into cache Insert Data removed from cache by LRFU

VLSI Project Winter 2005/2006


Design and Implementation (cont)

Data structure

LRFU uses for each block two BOUNDED counters

VLSI Project Winter 2005/2006


Hardware budget
Each block in cache requires two bounded counters
Previous C(t) Time that passed from previous access

Victims cache
The size will be based on empirical analysis

VLSI Project Winter 2005/2006


We implemented a very simple filtering algorithm, whose single task is to cause less changes in cache.
After a cache miss, the brought block is entered in cache with a probability 0<p<1, p configurable. If the block is not entered in cache , is entered automatically in victims cache.

After a cache miss, C(t) is calculated for each block in set and the one with the smallest C(t) is selected for replacement.

VLSI Project Winter 2005/2006



Hit Rate

Cache Size (# of blocks)

VLSI Project Winter 2005/2006 18

Results (cont)

Hit rate

VLSI Project Winter 2005/2006 19

Special Problems
Software simulation of hardware
Utilizing existing data structures of SimpleScalar

Finding the perfect C(t)

Applying mathematical theory into practice

VLSI Project Winter 2005/2006


We implemented a different cache replacement mechanism and received exciting results Hardware implementation of the mechanism is hard, but possible The Implementation achieved the goals
Subsumes both the LRU and LFU algorithms Yields better performance than them (up to 30%!!!)

VLSI Project Winter 2005/2006


Future Research
Implementation of better filtering techniques Dynamic version of the LRFU algorithm Adjust periodically depending on the evolution of workload Research of hardware needed for LRFU

VLSI Project Winter 2005/2006