Cache 2

A Study of Recent Advances in Cache Memories
M. Tariq Banday Munis Khan

Department of Electronics & Inst. Technology School of Electronics and Communication Engineering
University of Kashmir, Srinagar, India SMVD University, Katra, India
E-mail: sgrmtb@yahoo.com E-mail: munis819@hotmail.com
Abstract—Registers within a processor, cache within, on, or II. CACHE DESIGN STRATEGY
outside the processor, and virtual memory on the disk drive
builds memory hierarchy in modern computer systems. The
The processor has three main units, namely, instruction
principle of locality of reference makes this memory hierarchy unit, execution unit, and storage unit. Instruction fetch and
work efficiently. In recent years, cache organizations and designs decode is performed by the instruction unit. The execution unit
have witnessed several advances that have not only improved is responsible for logical and arithmetic operations and
their performance such as hit rates, speed, latency, energy executing an instruction. The storage unit establishes an
consumption, etc. but various new designs and organizations for interface through a temporary store between the other two
chip multi-processors such as multilevel caches, Non-Uniform units. The major components of the storage unit are: cache
Cache Access (NUCA), hybrid caches, etc. have also emerged. memory, translator, and Translation Look-aside Buffer (TLB).
This paper presents a study of current competing processors in An Address Space Identifier Table (ASIT), a Buffer
terms of various factors determining performance and Invalidation Address Stack (BIAS) and write through buffers
throughput of cache organization and design. To evaluate their may also be present in the storage unit.
performance and viability, it reviews recent cache trends that
include hybrid cache memory, non-uniform cache architecture, Technology has permitted fabrication of chips containing
energy efficient replacement algorithms, cache memory over two million transistors, of which only some portion is
programming, software defined caches and emerging techniques required to build a powerful processor. To reduce inter-chip
for making cache reliable against soft errors. It discusses the pros data transfers which could lead to higher memory access time,
and cons of emerging cache architectures and designs. on-chip memory is placed in the processor. Table 1 shows
specifications and cache design strategy of a few recent
Keywords – Cache Memory; Cache Design; Hybrid Cache processors produced by Intel and AMD.
Memory; Cache Performance; Memory Programming; Software
Defined Cache The cache organization is determined by the mapping
technique. A mapping technique maps larger number of main
I. INTRODUCTION memory block into fewer lines of cache and the tag bits within
To reduce memory access time, it is desired to have each cache line determine which mapped block of main
memory with least access time that never exhausts. An efficient memory is currently present in a particular cache line. Out of
and economical method to achieve this is through a few levels the three possible approaches i.e. direct, associative and set
of memory constituting memory hierarchy. Levels close to the associative, set associative caches are considered best due to
processor are fast, smaller in capacity and expensive in their better hit rate and less access time. However, it has been
comparision to those away from it. The memory hierarchy reported that beyond a certain point, increasing cache size has
works in a way that the low capacity, more expensive and more of an impact than increasing associativity.
faster memories are supplemented by high capacity, cheaper Replacement algorithm plays a vital role in the design of
and slower memories. The principal of locality of reference is cache memory because it makes the decision to select a line of
the key to the success of this organization because as one cache is to be replaced with the desired block of main memory.
moves down the memory hierarchy, frequency of access Least Frequently Used (LFU), First in First Out (FIFO) and
decreases. Least Recently Used (LRU) are algorithms which may be used
Memory access has temporal and local characteristics. The for making such decisions. LRU is the most effective algorithm
processor often makes accesses to instructions and data (loci of because besides its easy implementation, more recently used
reference) from locations that might already be in use or are words are more likely to be referenced again.
near the current address. The temporal locality is due to the use Write Caching, Write Back and Write Through are the
of loops, which reuse data and instructions. A program for write policies that decide how consistency is maintained
short duration distributes memory references non-uniformly between cache lines and their corresponding blocks in main
throughout its address space, which remain same for long memory. Considerable traffic is generated in case of write
duration. The local characteristics is because the address space through policy and in write back policy single bit error of any
is divided into small sized segments. Cache memories are used type cannot be tolerated unless ECC (Error Correcting Code) is
to keep data and instructions from portions of memory that are provided. Write caching is a combination of write through and
currently accessed by the processor close to it. Cache memories write back caching in which a small fully-associative cache is
deliver data at much faster speed to the processor as compared placed behind a write through cache.
to the main memory.
978-1-4799-6629-5/14/$31.00 2014
c IEEE 398
TABLE 1. SPECIFICATIONS OF AMD AND INTEL PROCESSORS
Design Power
Replacement
Algorithm &
Architecture
Write Policy
Clock Speed
Data Width
Number of
Bus Speed
Multiplier
Processor
Mapping,
L2 Cache
L3 Cache
LI Cache
Thermal
Threads
(K/MB)
(Watts)
Caches
Micro-
Clock
Cores
MHZ
(MB)
(bits)
(KB)
AMD Processors
Opteron 3200 I: 12x64
2500 X K 10 64 12 12 12x512 2x6 NPA 3 140
6180 MHZ D: 12x64
Phenom 533* I: 4x64 2
2600 13 K 10 64 4 4 4x512 NPA 3 140
X4 9950 MHZ D: 4x64 shared
Athelon 4100 Pile- I: 2x64
3800 X 64 4 4 2x2MB Nil NPA 2 100
X4 760 MHZ driver D: 2x16
A 10 4400 Pile- I: 2x64
4100 X 64 4 4 2x2MB Nil NPA 2 100
6800K MHZ driver D: 4x16
Legends: I-Instruction Cache; D-Data Cache; *: 533 MHz Memory controller; One 2000 MHz 16-bit Hyper Transport link.
Intel Processors
Core i3 5 GT/s Sandy I: 2x32 3
2500 25 64 2 4 2x256 NPA 3 35
2100T DMI Bridge D: 2x32 shared
Core i5 2.5 GT/s I: 4x32 8
2800 21 Nehalem 64 4 4 4x256 NPA 3 95
760 DMI D: 4x32 shared
Core i7 2.5 GT/s I: 4x32 8
2933 22 Nehalem 64 4 8 4x256 NPA 3 95
875K DMI D: 4x32 shared
Core i7 6.4 GT/s I: 6x32
3467 26 Westmere# 64 6 12 6x256 12 shared NPA 3 130
990x* DMI D: 6x32
Legends: I-Instruction Cache; D-Data Cache; #: Nehalem (Westmere); *: Extreme Edition; NPA: Not Publically Available.
More often L1 cache consists of L1 data cache and L1 been conducted to design efficient hybrid cache memory
instruction cache. Due to the ability of balancing load between systems with non-volatile memory.
fetched instructions and data, the unified cache exhibits a TABLE 2. COMPARISON OF VARIOUS MEMORY TECHNOLOGIES
higher hit rate as compared to split cache. On the other hand
split cache design eliminates the problem of contention Feature SRAM eDRAM MRAM PRAM
between the fetch/decode and execution units. This contention Density Low High High Very High
can degrades performance by interfering with efficient use of Speed Very Fast Fast
Fast Read Slow Read
instruction pipeline. Unfortunately, many implementation Slow Write Very Slow Write
Dynamic Low Read Medium Read
details of processors such as mapping function, replacement Power
Low Medium
High Write High Write
algorithm and write policy are not publicly available. In Leak Power High Medium Low Low
general AMD processors and x86 processors from Intel employ Non-volatile No No Yes Yes
direct-mapped L1 cache, and L2 cache is usually between 2 to
4 way set associative mapped. The L3 and higher caches could Wu et al [1] evaluated several possibilities to accommodate
be between 16-way to 64-way set associative. Most of them on-chip cache hierarchies based on Hybrid Cache Architecture
use least recently used replacement policy, without much (HCA). The study evaluated Level Hybrid Cache Architecture
variation and a write-back cache. (LHCA) and Region based HCA (RHCA). The LHCA (inter
cache level) used desperate memory technology to make levels
III. HYBRID CACHE MEMORY BASED ON SRAM-MRAM in a cache hierarchy and provided 7% geometric mean IPC
(intruction per cycle) improvement over baseline cache while
Magnetic Random Access Memory (MRAM) and phase
as RHCA (intra cache level) partitioned a single level cache
change Random Access Memory (PRAM) that are non-volatile
into multiple regions and provided an improvement of 12%
and energy efficient thus making them suitable for future
IPC over baseline cache under same area constraints. To
computing. However, they suffer from drawbacks because of
alleviate the performance degradation caused by long access
limited endurance and long write latency. To overcome these
latency of MRAM with small SRAM cache, a way partitioned
issues, a hybrid cache memory system consisting of both
hybrid cache memory system constructed with SRAM and
volatile and non-volatile memory is actively investigated.
MRAM was proposed by Lee et al [2]. The SRAM tecnology
Notable features of SRAM, embedded Dynamic RAM was used to built level 1 cache where as level 2 caches were
(eDRAM), MMRAM and PRAM are presented in table 2. It is manufactured using S, D and MRAM technologies. Using
evident that different cache technologies excel in one field but various benchmark programmes it was found that there is a
fall behind in others. Consequently, a cache designed by significant reduction in the average memory access time
employing different technologies may outperform its (AMAT) and power consumption for hybrid caches in
counterparts. As the capacity of cache becomes larger, more comparision to homogenous SRAM architecture.
sophisticated hybrid memory system architecture and operation
The energy efficiency and performance cannot be
mechanism can deliver superior performance and energy
improved simply by adopting MRAM as L2 or L3 memory in
efficiency than the conventional cache memory system with
the conventional memory hierarchy and the best cache
only SRAM and/or MRAM. Limited studies [1], [2], [3] have
2014 International Conference on Contemporary Computing and Informatics (IC3I) 399

organization depends on the capacity of L2 cache system [3]. The wire delays caused due to data residing far away from
When capacity of L2 is less than 2MB, the L2 cache system processor is addressed through Non-Uniform Cache Access
with SRAM delivers best performance and ED product. (NUCA). It uses a switched network to allow data to migrate to
MRAM shows best energy efficiency and ED product with the different cache regions based on their frequency of access.
L2 cache capacity from 2MB to 4MB. The simplified way Figure 1 adopted from Kim et al [4] shows different L2 cache
partitioned hybrid cache memory system provides best ED architectures. Figure 1(a) shows uniform cache architecture
product when L2 cache capacity is over 4MB. The adoption of also called traditional cache. The performance of this
non-volatile memory like MRAM in the cache memory system architecture is poor when the cache size is over 4MB because
is accelerating due to the increase in cache capacity of current of internal wire delays and limited ports. Figure 1(b) shows
systems. Various issues that need further investigation include ML-UCA, a multi-level (L2 and L3) cache. To support
the study of cache memories based on SRAM-MRAM multiple parallel access, in it both levels of cache are
represented cache performance in context of single-threaded aggressively banked. In this design, extra space is consumed
execution, improving thermal reliability and endurance. because inclusion is enforced. To overcome this, a cache
supporting non-uniform access as shown in figure 1 (c) was
IV. NON-UNIFORM CACHE ACCESS proposed. In this, placement of data in more than one bank is
In a larger cache, the processor is able to access data close avoided by statically determining banks. This type of cache is
to it much more quickly as compared to the data far from it. called S-NUCA-1.
Fig. 1. Architectures of Level-2 Caches, adapted from [4]
An extensive study, first of its kind to compare the addressed by S-NUCA-2 cache architecture shown in figure 1
performance of each of the above discussed cache schemes (d). A 2D switched network is used in static NUCA instead of
with a set of 16 benchmarks were carried out by Kim et al [5]. private per-bank channels. Dynamic NUCA (D-NUCA)
Figure 2 shows the comparative evaluation of 16 MB/50nm scheme improves cache performance over S-NUCA as it
IPC obtained from UCA, S-NUCA-1, ML-UCA, S-NUCA-2, adaptively places data close to the requesting core [4]. It
DN-best (best D-NUCA policy), and an ideal D-NUCA upper involve a combination of placement, migration, and replication
bound. When the delay to route across a cache is significant, strategies. Placement and migration dynamically, places data
the performance can be improved considerably by partitioning close to cores that use it, reducing access latency. Replication
the cache into more banks. However, the problem of significant makes multiple copies of frequently used lines, reducing
area overhead present in S-NUCA-1 for larger number of latency for widely read-shared lines (e.g., hot code), at the
banks can restrict partitioning of banks. This problem is expense of some capacity loss, as depicted in figure 1 (e).
1.4 1.4 1.4 1.4 1.4 1.4

1.2 1.2 1.2 1.2 1.2 1.2
1 1 1 1 1 1
0.8 0.8 0.8 0.8 0.8 0.8
0.6 0.6 0.6 0.6 0.6 0.6
0.4 0.4 0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2 0.2 0.2
0 0 0 0 0 0
256.bzip2 176.gcc 197.parser 300.twolf 179.art 181.mcf
Fig. 2: IPC, 16MB Cache Performance
It was found that DN-best outperformed on all benchmarks relatively faster cache access as compared to UCA only when
except mgrid, gcc and bt. DN-best offers more stable cache size exceeds some limit. Therefore, cache access
performance. The IPC result of DN-best was found only 16 % techniques can be modified to work faster at lower levels (L1)
worse. It signifies that performance of DN-best is close to ideal that have smaller sizes.
and is unlikely to benefit from better migration policies or
compiler support. The flexibility and scalability of NUCA V. ENERGY EFFICIENT REPLACEMENT ALGORITHM
memory systems will benefit emerging chip multiprocessor Storage systems consume a significant energy in data
(CMP) architectures because NUCA memory system provides centers raising energy consumption concerns and requirement
400 2014 International Conference on Contemporary Computing and Informatics (IC3I)

for power management schemes for disks [6]. PA-LRU (power cores. Therefore, dynamic cache architectures need to be
aware replacement algorithm) [7] was proposed to selectively implemented to carter to the needs of multi core processors
keep away active blocks from “inactive” disks for longer having more than 16 cores.
period in the storage cache in order to extend idle period
lengths of these disks. It was found that this reduced the energy VIII. MEMORY HIERARCHY PROGRAMMING
consumed by the disks significantly due to their ability to stay Owing to the increase in the number of parallel processing
in low power modes for long periods and spin up and down units and the concerns for efficient utilization of memory
only a few times. However, it required cumbersome tuning for bandwidth, the need to develop new programming abstractions
each workload in PA-LRU. To make tuning simple, Zhu et al for memory management in both stream architectures and
[8] proposed PB-LRU (partition based LRU). PB-LRU, using multi-core processors have gained focus. To facilitate program
replacement algorithm divides cache into separate partitions, correctness and performance, explicit memory management is
which are managed separately. New design of energy efficient required to transfer data between memories on the chip and off
replacement algorithms for cache memories for single disks the chip. Fatahalian et al [11] proposed Sequoia, a
and modification of existing ones to work with multispeed programming model that places a programmer in explicit
disks are highly desired. control of the movement and placement of data at all levels of
memory hierarchy through first class language mechanism. The
VI. REPLACEMENT ALGORITHM FOR FLASH MEMORY Sequoia provides limited set of abstractions by taking a
Since the inception of mobile embedded systems, the use practical approach to portable parallel programming that can be
of flash memories has became more and more evident due to implemented efficiently. Sequoia requires hierarchical
their small and lightweight form factor and low power organization in programs that encourage hierarchy-aware,
consumption. For disk based storage systems, the number of parallel divide-and-conquer programs. Each function is called
memory hits is the only concern in most of the operating in a chain to accept arguments by a space limited procedure,
systems. Replacement algorithm for flash memories should thus occupying comparatively less storage than those of the
also consider replacement cost caused by selecting dirty calling functions. Sequoia is in under development, but has a
victims besides hit count. A Clean-First LRU (CFLRU) great scope for the future memory hierarchies.
replacement policy was proposed by Park et al [9] in which the
LRU is split into working region and the clean-first region. It IX. SCALABLE SOFTWARE-DEFINED CACHES
evicts clean pages preferentially in the clean first region. When Shared last-level caches, widely used in chip-
evaluated in a file system based buffer cache using block multiprocessors (CMPs) face two fundamental limitations:
access, a 28.4% reduction in average replacement cost for swap firstly, the latency and energy of shared caches degrade as the
system and 26.2% reduction in buffer cache in comparison to system scales up and secondly, when multiple workloads share
LRU was found. A serious problem may arise in situations the CMP, they suffer from interference in shared cache
where illegal shutdowns occur because CFLRU and LRU accesses. Unfortunately, prior research addressing one issue
algorithms retain dirty pages in the SDRAM cache. Therefore, either ignores or worsens the other: Ideally, a cache should
it has been suggested to make use of CFLRU algorithm with both store data close to where it is used and at the same time
journaling file system. Future algorithms need to be allow its capacity to be partitioned, enabling software to
incorporated for better performance of replacement algorithm provide isolation, prioritize competing applications, and
in flash memories. increase cache utilization. NUCA techniques reduce access
latency but are prone to hotspots and interference. The cache
VII. ADAPTIVE MULTI-LEVEL CACHE HIERARCHY partitioning techniques only provide isolation but do not reduce
In order to minimize latency time and optimize the access latency. Nathan et al [12] proposed Jigsaw, a technique
performance, efficient cache hierarchy is necessary. Different that jointly addresses the scalability and interference problems
architectures implement different cache topologies. A cache of shared caches. Jigsaw implements efficient hardware
topology for a 16 core CMP in which each of Y number of L2 support for shared cache management, monitoring, and
slice is shared by X cores and their L1 caches, each of Z adaptation. The resource-management algorithms develop a
number of L3 slice is shared by Y number of L2 slices is system-level runtime that leverages Jigsaw to both maximize
represented as (X:Y:Z). In order to have highest throughput, cache utilization and place data close to where it is used.
the best configuration varies with time during the execution of Jigsaw lets software combine multiple bank partitions into a
the workload [10]. In order to deal with diverse range of logical, software-defined cache, which is called a share. By
applications, “One cache-topology-fits-all” philosophy is mapping data to shares, and configuring the locations and sizes
inadequate. To cater this requirement, Shekhar et al [10], of the individual bank partitions that compose each share.
proposed a reconfigurable adaptive multi-level cache hierarchy Software has full control over both where data is placed in the
called MorphCache. It permits different cache topologies to co- cache and the capacity allocated to it. Jigsaw efficiently
exist together with the same architecture by dynamically tuning supports reconfiguring shares dynamically and moving data
a multi-level cache topology in a CMP. The average across shares and implements monitoring hardware to let
throughput and harmonic mean of multithreaded and multi- software find the optimal share configuration efficiently.
programmed workloads when evaluated on a 16 core CMP
significantly improved using MorphCache. Although beyond An evaluation of Jigsaw using extensive simulations of 16
16 cores the reconfigurable MorphCache does not work and 64-core tiled CMPs proved that it improves cache
properly, it shows good results with processors having lesser performance by up to 2.2x (18% avg) over a conventional
shared cache, and significantly outperforms state-of-the-art

NUCA and partitioning techniques. Table 3 compares Jigsaw accessed lately by the CPU. In order to mask leading zero’s
with competing cache partitioning techniques in terms of their additional narrow tag bits are needed in case of NWVC scheme
capacity, latency, control, isolation, and directory. It is evident to reduce all vulnerable phases by exploiting the narrow width
that jigsaw outperforms all cache partitioning schemes. values in the data array in the data cache. To reduce the RR and
TABLE 3. COMPARISON OF CACHE PARTITIONING SCHEMES
RH vulnerable phases of the data and tag arrays in both the
data and instruction caches, the scrubbing scheme dramatically
Capacity Directory- increases the accesses to the L2 cache. It results in performance
Scheme Capacity Latency Isolation
Control less
Jigsaw High Low Yes Yes Yes
loss in instruction cache. By combining the DTEWB, MDB,
Private Caches Low Low No Yes No CCI and NWVC schemes, there is a minor performance loss
Shared Caches High High Yes No Yes and energy overheads. By combining the CS and CCI schemes,
Partioned
High High Yes Yes Yes there is minor performance loss and a moderate energy increase
Shared Caches in the L2 cache.
Private-Based
Intermediate Low Yes No No
D-NUCA XI. DISCUSSIONS
Shared-Based
Intermediate Low No No Yes
D-NUCA The innovations in micro-architecture facilitated by
semiconductor technology has improved the performance of
X. CACHE RELIABILITY AGAINST SOFT ERRORS processors in many-folds, however, there has not been equal
The susceptibility of on-chip caches to soft errors is performance improvements in memory. The floating-point
increasing rapidly as microprocessors are continuously scaled arithmetic capability of microprocessors have increased
down and because of energetic particle strikes such as alpha significantly due to their high level integration and superscalar
particles and high-energy neutrons from decaying radioactive architectural designs, but comparable increase in memory
impurities in interconnecting materials and packaging [13]. In speed has not been achieved. Consequently, to reduce memory
order to protect information integrity various coding schemes access time cache memories were used. In order to reduce
[14] are used in register files, latches, and on-chip caches for average bandwidth and memory latency, cache memories have
reliable computing. These schemes provide different degrees of been used since a long time. Many memory intensive
reliability at different energy, hardware, and performance costs. commercial workloads can be effectively organized by limiting
the on-chip cache resources.
Wang et al [15] conducted a comprehensive studies and
characterization on the reliability behavior of cache memories This paper presented some of the recent advances made in
including soft errors. The study proposed a framework for the cache memory to address latency, energy consumption and
development of new lifetime models for tag arrays and data improve access time of memory management unit. MAID
residing in both instruction and data cache. These models replacement algorithm [6] employs use of cache drives to
facilitated characterization of cache vulnerability. The reduce the spin-up of data drives to conserves energy. It makes
proposed lifetime model classifies phases for each data item cluster of networked servers to dynamically reconfigure or
into vulnerable and non-vulnerable phases. This classification shrink cache to operate it with a few nodes under light load.
is based on their previous and current activity. An error may Soft errors as discussed earlier pose a serious threat and various
propagate to level 2 cache or the processor through dirty line schemes are proposed to prevent the integrity of information. A
write back or load operations. Temporal Vulnerability Factor fast soft error estimation approach was proposed by Ebrahimi
(TVF) was proposed as a measure of cache vulnerability. It is et al [17], called CLASS. It was developed in order to have the
percentage of data items present in vulnerable phases and is capability to handle both regular and irregular structures. For
indicative of cache reliability. During the study, major high-end embedded systems, the Chip multiprocessors (CMPs)
contributors to vulnerable phases were identified in both have emerged as a dominant architectural alternative. In Chip
instruction and data caches and cache reliability schemes for multiprocessors, either all processors may share same L2 cache
these phases were proposed. These include: Dead-Time-Based or each processor may possess its own L2 private cache.
Early Write-Back (DTEWB), Multiple Dirty Bits (MDB),
To use on-chip cache memories more efficiently with chip
Clean Cache-Line Invalidation (CCI), Narrow-Width Value
microprocessors, many on-chip cache organizations have been
Compression (NWVC), Cache-line Scrubbing (CS) Combined
proposed. These include CMP-SNUCA [18], CMP-NuRAPID
and CS-CCI. Two main vulnerable phases are WPL and
[20], Victim Replication [19], and CMP-CC [21]. NUCA
FWPL. WPL (write-replace) is the life time phase between the
technique is applied to CMPs architecture in CMP-SNUCA
last write and the replacement without any read in between.
scheme. In order to reduce wire delays, these on-chip cache
FWPL is the lifetime phase between first write and the
organizations migrate blocks close to the requestor. In another
replacement of a dirty cache line. While reducing the WPL and
similar technique [22], this is achieved by selectively writing
FWPL vulnerable phases of data and tag arrays in the write
back L2 victims to a peer L2 cache. Recently proposed hybrid
back data cache, the energy consumption of L2 caches
caches address placement requirement of only a subset of data,
increases in DTEWB. Reducing the vulnerable phases (mainly
require complex lookups and coherence mechanisms to
WPL) of the data array in the data cache by preventing writing
increase latency and fail to scale to high core counts, however,
back clean data items to the L2 cache may incur overhead of
in comparison to conventional design they offer lower latency
additional dirty tag bits to reduce TVF in MBD scheme.
[24].
Reducing the RR and RH vulnerable phases of the data and tag
arrays in both data and instruction caches in CCI scheme may To improve application performance on current
result in performance loss when the invalidated cache-lines are microprocessors, memory subsystem plays a vital role [23].
402 2014 International Conference on Contemporary Computing and Informatics (IC3I)

The flexibility and scalability of NUCA memory systems is [7] Q. Zhu, F.M. David, C.F. Devaraj, Z. Li, Y. Zhou and P. Cao,
likely to benefit emerging chip multiprocessors (CMPs). “Reducing energy consumption of disk storage using power-aware
cache management,” Proc. 10th Int’l Symp. High Performance
Hardavellas et al [24] found that most applications have few Computer Architecture, p. 18, 2004.
distinct classes of accesses (instructions, private data, read [8] Q. Zhu, A. Shankar and Y. Zhou, “PB-LRU: a self-tuning power aware
shared, and write shared data), and proposed Reactive NUCA storage cache replacement algorithm for conserving disk energy,”
(R-NUCA), which specializes placement and replication Proc. 18th Int’l Conf. Supercomputing (ICS '04), pp. 79-88, 2004.
policies for each class of accesses on a per-page basis, and [9] S.Y. Park, D. Jung, J.U. Kang, J.S. Kim and J. Lee, “CFLRU: a
replacement algorithm for flash memory,” Proc. Int’l Conf. Compilers,
significantly outperforms NUCA schemes without access architecture and synthesis for embedded systems (CASES 2006), pp.
differentiation. RNUCA support intelligent migration, 234-241, 2006.
placement and replication for the on-chip last level cache and [10] S. Srikantaiah, E. Kultursay, T. Zhang, M. Kandemir, M.J. Irwin and Y.
does not need any explicit coherence mechanism. For Xie, "MorphCache: A Reconfigurable Adaptive Multi-level Cache
scientific, server and multi-programmed workloads, the cache hierarchy," Proc. 17th Int’l Symp. High Performance Computer
performance is improved if designed using RNUCA. Sequoia Architecture (HPCA 2011), pp. 231-242, Feb. 2011.
[11] K. Fatahalian, T.J. Knight, M. Houston, M. Erez, D.R. Horn, L. Leem,
focuses on regular applications; however, support for irregular J.Y. Park, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan, “Sequioa:
computations on machines with multi-level memory Programming the memory hierarchy,” Proc. ACM/IEEE Conf.
hierarchies through programming constructs and features such Supercomputing, p. 83, 2006.
as spawn and call-up has been designed by Bauer et al. [25]. [12] N. Beckmann and D. Sanchez, “Jigsaw: scalable software-defined
caches,” Proc. 22nd Conf. Parallel architectures and compilation
Capacity of cache memory can be increased by techniques (PACT 2013), 2013.
compression of data/instructions residing inside the cache, [13] C. Weaver, J. Emer, S.S. Mukherjee and S.K Reinhardt, "Techniques to
thereby increasing its efficiency by reducing long off-chip reduce the soft error rate of a high-performance microprocessor," Proc.
31st Int’l Symp. Computer Architecture, pp. 264-275, June 2004
misses. However, it is likely that compressing cache would [14] J. Kim, N. Hardavellas, K. Mai, B. Falsafi, and J.C. Hoe, “Multi-Bit
increase the cache-hit time because of additional overhead Error Tolerant Caches Using Two-Dimensional Error Coding,” Proc.
involved in decompressing data/instructions. Therefore, it may 40th IEEE/ACM Int’l Symp. Microarchitecture, pp. 197-209, Dec.
improve or deteriorate the overall system. CacheRAID [26] is 2007.
an adaptive write cache policy that works on an energy- [15] S. Wang et al., “On the Characterization and Optimization of On-Chip
Cache Reliability against Soft Errors,” IEEE Tranc. Computers, vol. 58,
efficient RAID storage architecture to balance the cost and the no. 9, pp. 1171-1184, Sept. 2009.
performance. In comparison to RAID 0 and RAID 5, it [16] N. Agrawal, V. Prabhakaran, T. Wobber, J. D. Davis, M. Manasse, and
consumes only 30-40% power and is efficient to RAID 5 by R. Panigrahy, “Design tradeoffs for SSD performance,” Proc. Ann.
5%. Tech. Conf. USENIX’08, pp. 57-70 2008.
[17] M. Ebrahimi, L. Chen, H. Asadi and M.B. Tahoori, "CLASS:
The hardware design of flash memory based SSD [16] have Combined logic and architectural soft error sensitivity analysis," Proc.
revolutionized secondary storage systems because SSD is 18th Design Automation Conference (ASP-DAC), pp. 601-607, Jan.
much faster and its performance in random data access is 2013.
[18] B.M. Beckmann and D.A. Wood, “Managing Wire Delay in Large
superb. The cache organization implemented to support the Chip-Multiprocessor Cache,” Proc. 37th Int’l Symp. Microarchitecture
basic main memory system can no longer be used to support (MICRO 37), pp. 319-330, 2004.
the modern SSD based main memory. Although the Smart [19] M. Zhang and K. Asanovic, “Victim Replication: Maximizing Capacity
Response Technology (SRT) introduced by Intel for their Z68 while Hiding Wire Delay in Tiled Chip Multiprocessors,” Proc. Int’l
chipset, which allows a SATA solid-state drive (SSD) to Symp. Computer Architecture (ISCA), pp. 336-345, 2005.
function as cache for a (conventional, magnetic) hard disk [20] Z. Chishti, M.D. Powell and T.N. Vijaykumar, “Optimizing
Replication, Communication and Capacity Allocation in CMPs,” Proc.
drive helped to access the data faster but a better cache Int’l Symp. Computer Architecture (ISCA), pp. 357-368, 2005.
organization needs to be implemented on processors to [21] J. Chang and G.S. Sohi, “Cooperative Caching for Chip-
compensate for the slower speed of main memory. Multiprocessors,” Proc. Int’l Symp. Computer Architecture (ISCA), pp.
264-276, 2006.
REFERENCES [22] E. Speight, H. Shafi, L. Zhang and R. Rajamony, “Adaptive
Mechanisms and Policies for Managing Cache Hierarchies in Chip
[1] X. Wu, J. Li, L. Zhang, E. Speight, R. Rajamony and Y Xie, “Hybrid
Multiprocessors,” Proc. Int’l Symp. Computer Architecture (ISCA), pp.
Cache Architecture with Disparate Memory Technology,” Proc. 36th
346-356, 2005.
Int’l Symp. Computer Architecture (ISCA 2009), pp. 34-45, 2009.
[23] D. Hackenberg, D. Molka and W.E. Nagel, "Comparing cache
[2] S. Lee, J. Jung, and C.M. Kyung, “Hybrid Cache Architecture
architectures and coherency protocols on x86-64 multicore SMP
Replacing SRAM Cache with Future Memory Technology,” Proc. Int’l
systems," Proc. 42nd Ann. IEEE/ACM Int’l Symp. Microarchitecture,
Symp. Circuits and Systems (ISCAS 20012), pp. 2481-2484, May 2012.
pp. 413-422, Dec. 2009.
[3] B.M. Lee and G.H. Park, "Performance and energy-efficiency analysis
[24] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, “Reactive
of hybrid cache memory based on SRAM-MRAM," Proc. Int’l Conf.
NUCA: near-optimal block placement and replication in distributed
SoC Design (ISOCC 2012), pp. 247-250, Nov. 2012.
caches,” Proc 36th annual Int’l symp. Computer architecture (ISCA
[4] C. Kim, D. Burger and S.W. Keckler, "Nonuniform cache architectures
'09), pp. 184-195, 2009.
for wire-delay dominated on-chip caches," Proc. Micro IEEE, vol. 23,
[25] M. Bauer, J. Clark, E. Schkufza and A. Aiken, “Programming the
no. 6, pp. 99-107, Nov-Dec. 2003.
memory hierarchy revisited: Supporting irregular parallelism in
[5] C. Kim, D. Burger and S.W. Keckler, “An Adaptive, Non-Uniform
sequoia.” Proc. 16th ACM Symp. Principles and Practices of Parallel
Cache Structure for Wire-Delay Dominated On-Chip Caches,” Proc.
Programming (PPoPP 2011), pp. 13-24, February 2011.
10th Int’l Conf. Architectural Support for Programming Languages and
[26] T.Y Chen, T.T. Yeh, H.W. Wei, Y.X. Fang, W.K. Shih and T.S, Hsu,
Operating Systems (ASPLOS 2002), pp. 211-222, 2002.
"CacheRAID: An Efficient Adaptive Write Cache Policy to Conserve
[6] D. Colarelli and D. Grunwald, "Massive Arrays of Idle Disks For
RAID Disk Array Energy," Proc. 5th IEEE Int’l Conf. Utility and Cloud
Storage Archives," Proc. ACM/IEEE Conf. Supercomputing, p. 47,
Computing (UCC 2012), Nov. 2012.
Nov. 2002.

Cache 2

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Cache 2

Diunggah oleh

Hak Cipta:

Format Tersedia

A Study of Recent Advances in Cache Memories

M. Tariq Banday Munis Khan

2014 International Conference on Contemporary Computing and Informatics (IC3I) 399

Fig. 1. Architectures of Level-2 Caches, adapted from [4]

1.4 1.4 1.4 1.4 1.4 1.4

400 2014 International Conference on Contemporary Computing and Informatics (IC3I)

2014 International Conference on Contemporary Computing and Informatics (IC3I) 401

402 2014 International Conference on Contemporary Computing and Informatics (IC3I)

2014 International Conference on Contemporary Computing and Informatics (IC3I) 403

Anda mungkin juga menyukai