Anda di halaman 1dari 22

VLSI Project

Least Recently Frequently Used Caching Algorithm with Filtering Policies


Alexander Zlotnik Marcel Apfelbaum

Supervised by: Michael Behar, Winter 2005/2006


VLSI Project Winter 2005/2006 1

Introduction (cont.)
Cache definition Memory chip part of the Processor
Same technology Speed: same order of magnitude as accessing Registers Relatively small and expensive Acts like an HASH function : holds part of the address spaces.

VLSI Project Winter 2005/2006

Introduction (cont.)
Cache memories Main idea
When processor needs instruction or data it first looks for it in the cache. If that fails, it brings the data from the main memory to the cache and uses it from there. Address space is partitioned into blocks Cache holds lines, each line holds a block
A block may not exist in the cache -> cache miss

If we miss the Cache


Entire block is fetched into a line buffer, and then put into the cache Before putting the new block in the cache, another block may need to be evicted from the cache (to make room for the new block)
VLSI Project Winter 2005/2006 3

Introduction (cont.)
Cache aim
Fast access time Fast search mechanism High Hit-Ratio Highly effective replacement mechanism High Adaptability - fast replacement of not need lines Long sighted - estimation if a block will be used in future
VLSI Project Winter 2005/2006 4

Project Objective
Develop an LRFU caching mechanism Implementation of a cache entrance filtering technique Compare and analyze against LRU Researching various configurations of LRFU , on order to achieve maximum hit rate

VLSI Project Winter 2005/2006

Project Requirements
Develop for SimpleScalar platform to simulate processor caches Run developed caching & filtering mechanisms on accepted benchmarks C language No hardware components equivalence needed, software implementation only
VLSI Project Winter 2005/2006 6

Background and Theory


Cache Replacement options:
FIFO, LRU, Random, Pseudo LRU, LFU

Currently used algorithms:


LRU(2 ways requires 1 bit per set to mark latest accessed) Pseudo LRU (4 ways and more, Fully associative)

Pseudo LRU (4-way example)


Bit 0 specify if way is (0,1) or (2,3) Bit 1 specify who was between 0 and 1 Bit 2 specify who was between 2 and 3
VLSI Project Winter 2005/2006

Bit 0 Bit 1 Bit 2


7

Background and Theory (cont)


LRU
Advantages
High Adaptability 1 cycle algorithm Low memory usage

LFU
Advantage
Long sighted Smarter

Disadvantage
Short sighted

Disadvantages
Cache pollution Requires many cycles More memory needed
8

VLSI Project Winter 2005/2006

Background and Theory (cont)


Observation
Both recency and frequency affect the likelihood of future references

Goal
A replacement algorithm that allows a flexible trade-off between recency and frequency

The idea: LRFU (Least Recently/Frequently Used)


Subsumes both LRU and LFU algorithms Overcome the cycles used by LFU by filtering Cache entrances Yields better performance than them
VLSI Project Winter 2005/2006 9

Development Stages
1. 2. 3. 4. 5. 6. Studying the background Learning SimpleScalar sim-cache platform Develop LRFU caching algorithm for SimpleScalar Develop filtering policy Benchmarking (smart environment) Analyzing various LRFU configurations and comparison with LRU algorithm

VLSI Project Winter 2005/2006

10

Principles
The LRFU policy associates a value with each block. This value quantifies the likelihood that the block will be referenced in the near future. Each reference to a block in the past adds a contribution to this value and its contribution is determined by a weighing function F.
1
Current time tc

time t1 t2

t3

Ctc(block) = F(||1) + F(2) + F(||3) ||


t c - t1 tc - t2
VLSI Project Winter 2005/2006

tc - t3
11

Principles (cont)
Weighing function F(x) = (1/2)x
Monotonically decreasing Subsume LRU and LFU When = 0, (i.e. F(x) = 1), then it becomes LFU When = 1, (i.e. F(x) = (1/2)x), then it becomes LRU
When 0 < < 1, it is between LFU and LRU F(x) F(x) = 1 (LFU extreme)
1

Spectrum (LRU/LFU) F(x) = (1/2)x (LRU extreme)


VLSI Project Winter 2005/2006

X
current time - reference time

12

Principles (cont)
Update of C(block) over time
Only two counters for each block are needed to calculate C(block)
Proof:

t1
2
t2

t2

= (t2 - t1)

time

3
t3

t1

C t2(b) = F (1+) + F (2+) + F (3+) = (1/2)(1+ ) +

(1/2) (2+ ) + (1/2) (3+ ) = ((1/2)1 + (1/2)2 + (1/2)3 ) (1/2)

= C t1(b) x F ()
VLSI Project Winter 2005/2006 13

Design and Implementation


Filtering
Data Address In cache END In cache Not in cache

In Victims cache ? Not in cache

END Filter out Insert Data into Victims Cache

Filter Insert into cache Insert Data removed from cache by LRFU

VLSI Project Winter 2005/2006

14

Design and Implementation (cont)


Data structure

LRFU uses for each block two BOUNDED counters

VLSI Project Winter 2005/2006

15

Hardware budget
Counters
Each block in cache requires two bounded counters
Previous C(t) Time that passed from previous access

Victims cache
The size will be based on empirical analysis

VLSI Project Winter 2005/2006

16

Algorithms
Filtering
We implemented a very simple filtering algorithm, whose single task is to cause less changes in cache.
After a cache miss, the brought block is entered in cache with a probability 0<p<1, p configurable. If the block is not entered in cache , is entered automatically in victims cache.

Replacement
After a cache miss, C(t) is calculated for each block in set and the one with the smallest C(t) is selected for replacement.

VLSI Project Winter 2005/2006

17

Results

Hit Rate

Cache Size (# of blocks)


VLSI Project Winter 2005/2006 18

Results (cont)

Hit rate

VLSI Project Winter 2005/2006 19

Special Problems
Software simulation of hardware
Utilizing existing data structures of SimpleScalar

Finding the perfect C(t)


Applying mathematical theory into practice

VLSI Project Winter 2005/2006

20

Conclusions
We implemented a different cache replacement mechanism and received exciting results Hardware implementation of the mechanism is hard, but possible The Implementation achieved the goals
Subsumes both the LRU and LFU algorithms Yields better performance than them (up to 30%!!!)

VLSI Project Winter 2005/2006

21

Future Research
Implementation of better filtering techniques Dynamic version of the LRFU algorithm Adjust periodically depending on the evolution of workload Research of hardware needed for LRFU

VLSI Project Winter 2005/2006

22