Abstract—Registers within a processor, cache within, on, or II. CACHE DESIGN STRATEGY
outside the processor, and virtual memory on the disk drive
builds memory hierarchy in modern computer systems. The
The processor has three main units, namely, instruction
principle of locality of reference makes this memory hierarchy unit, execution unit, and storage unit. Instruction fetch and
work efficiently. In recent years, cache organizations and designs decode is performed by the instruction unit. The execution unit
have witnessed several advances that have not only improved is responsible for logical and arithmetic operations and
their performance such as hit rates, speed, latency, energy executing an instruction. The storage unit establishes an
consumption, etc. but various new designs and organizations for interface through a temporary store between the other two
chip multi-processors such as multilevel caches, Non-Uniform units. The major components of the storage unit are: cache
Cache Access (NUCA), hybrid caches, etc. have also emerged. memory, translator, and Translation Look-aside Buffer (TLB).
This paper presents a study of current competing processors in An Address Space Identifier Table (ASIT), a Buffer
terms of various factors determining performance and Invalidation Address Stack (BIAS) and write through buffers
throughput of cache organization and design. To evaluate their may also be present in the storage unit.
performance and viability, it reviews recent cache trends that
include hybrid cache memory, non-uniform cache architecture, Technology has permitted fabrication of chips containing
energy efficient replacement algorithms, cache memory over two million transistors, of which only some portion is
programming, software defined caches and emerging techniques required to build a powerful processor. To reduce inter-chip
for making cache reliable against soft errors. It discusses the pros data transfers which could lead to higher memory access time,
and cons of emerging cache architectures and designs. on-chip memory is placed in the processor. Table 1 shows
specifications and cache design strategy of a few recent
Keywords – Cache Memory; Cache Design; Hybrid Cache processors produced by Intel and AMD.
Memory; Cache Performance; Memory Programming; Software
Defined Cache The cache organization is determined by the mapping
technique. A mapping technique maps larger number of main
I. INTRODUCTION memory block into fewer lines of cache and the tag bits within
To reduce memory access time, it is desired to have each cache line determine which mapped block of main
memory with least access time that never exhausts. An efficient memory is currently present in a particular cache line. Out of
and economical method to achieve this is through a few levels the three possible approaches i.e. direct, associative and set
of memory constituting memory hierarchy. Levels close to the associative, set associative caches are considered best due to
processor are fast, smaller in capacity and expensive in their better hit rate and less access time. However, it has been
comparision to those away from it. The memory hierarchy reported that beyond a certain point, increasing cache size has
works in a way that the low capacity, more expensive and more of an impact than increasing associativity.
faster memories are supplemented by high capacity, cheaper Replacement algorithm plays a vital role in the design of
and slower memories. The principal of locality of reference is cache memory because it makes the decision to select a line of
the key to the success of this organization because as one cache is to be replaced with the desired block of main memory.
moves down the memory hierarchy, frequency of access Least Frequently Used (LFU), First in First Out (FIFO) and
decreases. Least Recently Used (LRU) are algorithms which may be used
Memory access has temporal and local characteristics. The for making such decisions. LRU is the most effective algorithm
processor often makes accesses to instructions and data (loci of because besides its easy implementation, more recently used
reference) from locations that might already be in use or are words are more likely to be referenced again.
near the current address. The temporal locality is due to the use Write Caching, Write Back and Write Through are the
of loops, which reuse data and instructions. A program for write policies that decide how consistency is maintained
short duration distributes memory references non-uniformly between cache lines and their corresponding blocks in main
throughout its address space, which remain same for long memory. Considerable traffic is generated in case of write
duration. The local characteristics is because the address space through policy and in write back policy single bit error of any
is divided into small sized segments. Cache memories are used type cannot be tolerated unless ECC (Error Correcting Code) is
to keep data and instructions from portions of memory that are provided. Write caching is a combination of write through and
currently accessed by the processor close to it. Cache memories write back caching in which a small fully-associative cache is
deliver data at much faster speed to the processor as compared placed behind a write through cache.
to the main memory.
978-1-4799-6629-5/14/$31.00 2014
c IEEE 398
TABLE 1. SPECIFICATIONS OF AMD AND INTEL PROCESSORS
Design Power
Replacement
Algorithm &
Architecture
Write Policy
Clock Speed
Data Width
Number of
Bus Speed
Multiplier
Processor
Mapping,
L2 Cache
L3 Cache
LI Cache
Thermal
Threads
(K/MB)
(Watts)
Caches
Micro-
Clock
Cores
MHZ
(MB)
(bits)
(KB)
AMD Processors
Opteron 3200 I: 12x64
2500 X K 10 64 12 12 12x512 2x6 NPA 3 140
6180 MHZ D: 12x64
Phenom 533* I: 4x64 2
2600 13 K 10 64 4 4 4x512 NPA 3 140
X4 9950 MHZ D: 4x64 shared
Athelon 4100 Pile- I: 2x64
3800 X 64 4 4 2x2MB Nil NPA 2 100
X4 760 MHZ driver D: 2x16
A 10 4400 Pile- I: 2x64
4100 X 64 4 4 2x2MB Nil NPA 2 100
6800K MHZ driver D: 4x16
Legends: I-Instruction Cache; D-Data Cache; *: 533 MHz Memory controller; One 2000 MHz 16-bit Hyper Transport link.
Intel Processors
Core i3 5 GT/s Sandy I: 2x32 3
2500 25 64 2 4 2x256 NPA 3 35
2100T DMI Bridge D: 2x32 shared
Core i5 2.5 GT/s I: 4x32 8
2800 21 Nehalem 64 4 4 4x256 NPA 3 95
760 DMI D: 4x32 shared
Core i7 2.5 GT/s I: 4x32 8
2933 22 Nehalem 64 4 8 4x256 NPA 3 95
875K DMI D: 4x32 shared
Core i7 6.4 GT/s I: 6x32
3467 26 Westmere# 64 6 12 6x256 12 shared NPA 3 130
990x* DMI D: 6x32
Legends: I-Instruction Cache; D-Data Cache; #: Nehalem (Westmere); *: Extreme Edition; NPA: Not Publically Available.
More often L1 cache consists of L1 data cache and L1 been conducted to design efficient hybrid cache memory
instruction cache. Due to the ability of balancing load between systems with non-volatile memory.
fetched instructions and data, the unified cache exhibits a TABLE 2. COMPARISON OF VARIOUS MEMORY TECHNOLOGIES
higher hit rate as compared to split cache. On the other hand
split cache design eliminates the problem of contention Feature SRAM eDRAM MRAM PRAM
between the fetch/decode and execution units. This contention Density Low High High Very High
can degrades performance by interfering with efficient use of Speed Very Fast Fast
Fast Read Slow Read
instruction pipeline. Unfortunately, many implementation Slow Write Very Slow Write
Dynamic Low Read Medium Read
details of processors such as mapping function, replacement Power
Low Medium
High Write High Write
algorithm and write policy are not publicly available. In Leak Power High Medium Low Low
general AMD processors and x86 processors from Intel employ Non-volatile No No Yes Yes
direct-mapped L1 cache, and L2 cache is usually between 2 to
4 way set associative mapped. The L3 and higher caches could Wu et al [1] evaluated several possibilities to accommodate
be between 16-way to 64-way set associative. Most of them on-chip cache hierarchies based on Hybrid Cache Architecture
use least recently used replacement policy, without much (HCA). The study evaluated Level Hybrid Cache Architecture
variation and a write-back cache. (LHCA) and Region based HCA (RHCA). The LHCA (inter
cache level) used desperate memory technology to make levels
III. HYBRID CACHE MEMORY BASED ON SRAM-MRAM in a cache hierarchy and provided 7% geometric mean IPC
(intruction per cycle) improvement over baseline cache while
Magnetic Random Access Memory (MRAM) and phase
as RHCA (intra cache level) partitioned a single level cache
change Random Access Memory (PRAM) that are non-volatile
into multiple regions and provided an improvement of 12%
and energy efficient thus making them suitable for future
IPC over baseline cache under same area constraints. To
computing. However, they suffer from drawbacks because of
alleviate the performance degradation caused by long access
limited endurance and long write latency. To overcome these
latency of MRAM with small SRAM cache, a way partitioned
issues, a hybrid cache memory system consisting of both
hybrid cache memory system constructed with SRAM and
volatile and non-volatile memory is actively investigated.
MRAM was proposed by Lee et al [2]. The SRAM tecnology
Notable features of SRAM, embedded Dynamic RAM was used to built level 1 cache where as level 2 caches were
(eDRAM), MMRAM and PRAM are presented in table 2. It is manufactured using S, D and MRAM technologies. Using
evident that different cache technologies excel in one field but various benchmark programmes it was found that there is a
fall behind in others. Consequently, a cache designed by significant reduction in the average memory access time
employing different technologies may outperform its (AMAT) and power consumption for hybrid caches in
counterparts. As the capacity of cache becomes larger, more comparision to homogenous SRAM architecture.
sophisticated hybrid memory system architecture and operation
The energy efficiency and performance cannot be
mechanism can deliver superior performance and energy
improved simply by adopting MRAM as L2 or L3 memory in
efficiency than the conventional cache memory system with
the conventional memory hierarchy and the best cache
only SRAM and/or MRAM. Limited studies [1], [2], [3] have
An extensive study, first of its kind to compare the addressed by S-NUCA-2 cache architecture shown in figure 1
performance of each of the above discussed cache schemes (d). A 2D switched network is used in static NUCA instead of
with a set of 16 benchmarks were carried out by Kim et al [5]. private per-bank channels. Dynamic NUCA (D-NUCA)
Figure 2 shows the comparative evaluation of 16 MB/50nm scheme improves cache performance over S-NUCA as it
IPC obtained from UCA, S-NUCA-1, ML-UCA, S-NUCA-2, adaptively places data close to the requesting core [4]. It
DN-best (best D-NUCA policy), and an ideal D-NUCA upper involve a combination of placement, migration, and replication
bound. When the delay to route across a cache is significant, strategies. Placement and migration dynamically, places data
the performance can be improved considerably by partitioning close to cores that use it, reducing access latency. Replication
the cache into more banks. However, the problem of significant makes multiple copies of frequently used lines, reducing
area overhead present in S-NUCA-1 for larger number of latency for widely read-shared lines (e.g., hot code), at the
banks can restrict partitioning of banks. This problem is expense of some capacity loss, as depicted in figure 1 (e).
It was found that DN-best outperformed on all benchmarks relatively faster cache access as compared to UCA only when
except mgrid, gcc and bt. DN-best offers more stable cache size exceeds some limit. Therefore, cache access
performance. The IPC result of DN-best was found only 16 % techniques can be modified to work faster at lower levels (L1)
worse. It signifies that performance of DN-best is close to ideal that have smaller sizes.
and is unlikely to benefit from better migration policies or
compiler support. The flexibility and scalability of NUCA V. ENERGY EFFICIENT REPLACEMENT ALGORITHM
memory systems will benefit emerging chip multiprocessor Storage systems consume a significant energy in data
(CMP) architectures because NUCA memory system provides centers raising energy consumption concerns and requirement