Anda di halaman 1dari 31

Embedded Memories

Introduction

Embedded systems functionality aspects Processing processors transformation of data Storage memory retention of data Communication buses transfer of data

Memory :Basic concept Stores large number of bits m x n: m words of n bits each k = Log2(m) address input signals or m = 2^k words e.g., 4,096 x 8 memory: 32,768 bits 12 address input signals 8 input/output data signals Memory access r/w: selects read or write enable: read or write only when asserted multiport: multiple accesses to different locations simultaneously

Dept of ECE,RVCE Bangalore.

Page 1

Embedded Memories

1.Write Ability and Storage performance


Traditional ROM/RAM distinctions ROM read only, bits stored without power RAM read and write, lose stored bits without power Traditional distinctions blurred Advanced ROMs can be written to e.g., EEPROM Advanced RAMs can hold bits without power e.g., NVRAM Write ability Manner and speed a memory can be written

Storage permanence ability of memory to hold stored bits after they are written
Page 2

Dept of ECE,RVCE Bangalore.

Embedded Memories
write ability
Ranges of write ability High end processor writes to memory simply and quickly e.g., RAM Middle range processor writes to memory, but slower e.g., FLASH, EEPROM Lower range special equipment, programmer, must be used to write to memory e.g., EPROM, OTP ROM Low end bits stored only during fabrication e.g., Mask-programmed ROM In-system programmable memory Can be written to by a processor in the embedded system using the memory Memories in high end and middle range of write ability

Storage performance
Range of storage permanence High end essentially never loses bits e.g., mask-programmed ROM Middle range holds bits days, months, or years after memorys power source turned off e.g., NVRAM Lower range holds bits as long as power supplied to memory e.g., SRAM Low end begins to lose bits almost immediately after written e.g., DRAM

Dept of ECE,RVCE Bangalore.

Page 3

Embedded Memories

2.ROM:Read only Memory

Nonvolatile memory Holds bits after power is no longer supplied High end and middle range of storage permanence Nonvolatile memory Can be read from but not written to, by a processor in an embedded system Traditionally written to, programmed, before inserting to embedded system Uses Store software program for general-purpose processor program instructions can be one or more ROM words Store constant data needed by system Implement combinational circuit

Example :8*4 ROM


Horizontal lines = words Vertical lines = data Lines connected only at circles Decoder sets word 2s line to 1 if address input is 010 Data lines Q3 and Q1 are set to 1 because there is a programmed connection with word 2s line Word 2 is not connected with data lines Q2 and Q0 Output is 1010

Dept of ECE,RVCE Bangalore.

Page 4

Embedded Memories

Mask-programmed ROM
Connections programmed at fabrication set of masks Lowest write ability only once Highest storage permanence bits never change unless damaged Typically used for final design of high-volume systems spread out NRE cost for a low unit cost

OTP ROM:One-time programmable ROM


Connections programmed after manufacture by user user provides file of desired contents of ROM file input to machine called ROM programmer each programmable connection is a fuse ROM programmer blows fuses where connections should not exist Very low write ability typically written only once and requires ROM programmer device Very high storage permanence bits dont change unless reconnected to programmer and more fuses blown Commonly used in final products cheaper, harder to inadvertently modify
Page 5

Dept of ECE,RVCE Bangalore.

Embedded Memories
EPROM: Erasable programmable ROM
Programmable component is a MOS transistor Transistor has floating gate surrounded by an insulator (a) Negative charges form a channel between source and drain storing a logic 1 (b) Large positive voltage at gate causes negative charges to move out of channel and get trapped in floating gate storing a logic 0 (c) (Erase) Shining UV rays on surface of floating-gate causes negative charges to return to channel from floating gate restoring the logic 1 (d) An EPROM package showing quartz window through which UV light can pass Better write ability can be erased and reprogrammed thousands of times Reduced storage permanence program lasts about 10 years but is susceptible to radiation and electric noise Typically used during design development

Connections programmed after manufacture by user user provides file of desired contents of ROM file input to machine called ROM programmer each programmable connection is a fuse ROM programmer blows fuses where connections should not exist Very low write ability typically written only once and requires ROM programmer device Very high storage permanence bits dont change unless reconnected to programmer and more fuses blown Commonly used in final products cheaper, harder to inadvertently modify

Dept of ECE,RVCE Bangalore.

Page 6

Embedded Memories
EEPROM: Electrically Erasable Programmable ROM
Programmed and erased electronically typically by using higher than normal voltage can program and erase individual words Better write ability can be in-system programmable with built-in circuit to provide higher than normal voltage built-in memory controller commonly used to hide details from memory user writes very slow due to erasing and programming busy pin indicates to processor EEPROM still writing can be erased and programmed tens of thousands of times Similar storage permanence to EPROM (about 10 years) Far more convenient than EPROMs, but more expensive

Flash Memory
Extension of EEPROM Same floating gate principle Same write ability and storage permanence Fast erase Large blocks of memory erased at once, rather than one word at a time Blocks typically several thousand bytes large Writes to single words may be slower Entire block must be read, word updated, then entire block written back Used with embedded systems storing large data items in nonvolatile memory e.g., digital cameras, TV set-top boxes, cell phones

Dept of ECE,RVCE Bangalore.

Page 7

Embedded Memories 3.RAM


Typically volatile memory bits are not held without power supply Read and written to easily by embedded system during execution Internal structure more complex than ROM a word consists of several memory cells, each storing 1 bit each input and output data line connects to each cell in its column rd/wr connected to every cell when row is enabled by decoder, each cell has logic that stores input data bit when rd/wr indicates write or outputs stored bit when rd/wr indicates read

Basic Types of RAM


SRAM: Static RAM Memory cell uses flip-flop to store bit Requires 6 transistors Holds data as long as power supplied DRAM: Dynamic RAM Memory cell uses MOS transistor and capacitor to store bit More compact than SRAM Refresh required due to capacitor leak words cells refreshed when read Typical refresh rate 15.625 microsec. Slower to access than SRAM

RAM VARIATION PSRAM: Pseudo-static RAM DRAM with built-in memory refresh controller Popular low-cost high-density alternative to SRAM NVRAM: Nonvolatile RAM Holds data after external power removed Battery-backed RAM SRAM with own permanently connected battery writes as fast as reads no limit on number of writes unlike nonvolatile ROM-based memory SRAM with EEPROM or flash stores complete RAM contents on EEPROM or flash before power turned off
Page 8

Dept of ECE,RVCE Bangalore.

Embedded Memories

4.Scratchpad Memory
Embedded processor-based system >Processor core > Embedded memory > Instruction and Data Cache > Embedded SRAM > Embedded DRAM Scratch Pad Memory > Design problems 1. How much on-chip memory? 2. Partitioning of on-chip memory in cache and scratchpad? 3. Which variables/arrays in the scratchpad?

Goals
> Improve performance > Save power

Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications


Dept of ECE,RVCE Bangalore. Page 9

Embedded Memories
Abstract
Efficient utilizationof on-chip memory space is extremely important in modern embedded system applications basedon microprocessor cores. In additionto a data cache that interfaceswith slower off-chip memory, a fast on-chip SRAM,called Scratch-Pad memory, is often used in several applications. this present a technique for efficiently exploiting onchip Scratch-Pad memory by partitioning the applications scalar and array variables into off chipDRAM and on-chip Scratch-Pad SRAM, with the goal of minimizing the total execution time of embedded applications.

> Introduction
Complex embedded system applications typically use heterogeneous chips consisting of microprocessor cores, along with on-chip memory and co-processors. Flexibility and short design time considerations drive the use of CPU cores as instantiable modules in system designs [5]. The integration of processor cores and memory in the same chip effects a reduction in the chip count, leading to costeffective solutions. Examples of commercial microprocessor cores commonly used in system design are LSI Logics CW33000 series [3]and the ARM series from AdvancedRISC Machines [10].

Dept of ECE,RVCE Bangalore.

Page 10

Embedded Memories

Typical examples of optional modules integrated with the processor on the same chip are: Instruction Cache, Data Cache, and on-chip SRAM. The instruction and data caches are fastlocal memory serving as an interface between the processor and the off-chip memory. The on-chip SRAM, termed Scratch-Pad memory, is a small, high-speed data memory that is mapped into an address space disjoint from the off-chip memory, but connected to the same address and data buses. Both the cache and Scratch-Pad SRAM have a single processor cycle ac-cess latency, whereas an access to the off-chip memory (usually DRAM) takes several (typically 1020) processor cycles. The main difference between the Scratch-Pad SRAM and data cache is that the SRAM guarantees a single-cycle access time, whereas an access to cache is subject to compulsory, capacity, and conflict misses. When an embedded application is compiled, the accessed data can now be stored either in the Scratch-Pad memory or in off-chip memory. In the second case, it is accessed by the processor through the data cache. We present a technique for minimizing the total execution time of an embedded application by a careful partitioning of scalar and array variables used in the application into off-chip DRAM (accessed through data cache) and Scratch-Pad SRAM. Optimization techniques for improving the data cache performance of programs have been reported [4, 7, 9]. The analysis in [9] is limited to scalars, and hence, not generally applicable. Iteration space blocking for improving data locality is studied in [4]. This technique is also
Dept of ECE,RVCE Bangalore. Page 11

Embedded Memories
limited to the type of code that yields naturally to blocking. In [7],a data layout strategy for avoiding conflict misses is presented. However, array access patterns in some applications are too complex to be statically analyzable using this method. The availability of an on-chip SRAM with guaranteed fast access time creates an opportunity for overcomingsome of the cache conflict problems (Section 2). Theproblem of partitioning data into SRAM and cache with the objective of maximizing performance, which we address inthis paper, has, to our knowledge, not been attempted before.

> Problem Description

Figure 1(a) shows the architectural block diagram of an application employing a typical embedded core processor (e.g., the LSI Logic CW33000 RISC Microprocessor core[3]), where the parts enclosed in the dotted rectangle are implemented in one chip, and which interfaces with an off-chip memory, usually realized with DRAM. The address and data buses from the CPU core connect to the Data Cache, Scratch-Pad memory, and the External Memory Interface (EMI) blocks. On a memory access request from the CPU, the data cache indicates a cache hit to the EMI block through the C HIT signal. Similarly, if the SRAM interface circuitry in the Scratch-Pad memory determines that the referenced memory address maps into the on-chip SRAM, it assumes control of the data bus and indicates this status to the EMI through signal S HIT. If both the cache and SRAM report misses, the EMI transfers a block of data of the appropriate size (equal to the cache line size) between the cache and the DRAM. The data address space mapping is shown in Figure 1(b),for a memory of size data words. Memory addresses0 . . .1 map into the Scratch-Pad memory, and have a single processor cycle access time. Thus, in Figure 1(a), S HIT would be asserted whenever the processor attemptsto access any address in the range 0 . ..1. Memory addresses . . . 1 map into the offchip DRAM, and are accessed by the CPU through the data cache. A cache hit for an address in the range. . . 1 results in a single-cycledelay, whereas a cache miss, which leads to a block transfer between off-chip and cache memory, results in a delay of10-20 processor cycles. Suppose the above code is executed on a processor configured with a data cache of size 1 KByte. The performance is degraded by the conflict misses in the cache between elements of the two arrays Hist and BrightnessLevel. Data layout techniques, such as [7] are not effective in eliminating the above type of conflicts, because the accesses to Hist are data-dependent. Note that this problem occurs in both direct-mapped as well as set-associative caches. However, the conflict problem can be solved elegantly if we include a Scratch-Pad SRAM in the architecture. Since the Hist array is relatively small, we can store it in the SRAM, so that it does not conflict with BrightnessLevel in the data cache. This storage assignment improves the performance of the Histogram Evaluation code significantly.

Dept of ECE,RVCE Bangalore.

Page 12

Embedded Memories
We present a strategy for partitioning scalar and array variables in an application code into Scratch-Pad memory and off-chip DRAM accessed through data cache, to maximize the performance by selectively mapping to the SRAM those variables that are estimated to cause the maximum number of conflicts in the data cache.

> The Partitioning Strategy


The overall approach in partitioning program variables into Scratch-Pad memory and DRAM is to minimize the cross-interference between different variables in the data cache. We first outline the different features of the code affecting the partitioning.

Dept of ECE,RVCE Bangalore.

Page 13

Embedded Memories 5.CACHE

Want inexpensive, fast memory Main memory Large, inexpensive, slow memory stores entire program and data Cache Small, expensive, fast memory stores copy of likely accessed parts of larger memory Can be multiple levels of cache

>Introduction to Memory Hierarchy

Usually designed with SRAM faster but more expensive than DRAM Usually on same chip as processor space limited, so much smaller than off-chip main memory faster access ( 1 cycle vs. several cycles for main memory) Cache operation: Request for main memory access (read or write) First, check cache for copy
Page 14

Dept of ECE,RVCE Bangalore.

Embedded Memories
cache hit copy is in cache, quick access cache miss copy not in cache, read address and possibly its neighbors into cache Several cache design choices cache mapping, replacement policies, and write techniques

>Different Mapping Techniques Direct Mapping


Main memory address divided into 2 fields Index cache address number of bits determined by cache size Tag compared with tag stored in cache at address indicated by index if tags match, check valid bit Valid bit indicates whether data in slot has been loaded from memory Offset used to find particular word in cache line

Fully associative mapping


Complete main memory address stored in each cache address All addresses stored in cache simultaneously compared with desired address Valid bit and offset same as direct mapping
Page 15

Dept of ECE,RVCE Bangalore.

Embedded Memories

Set associative Mapping


Compromise between direct mapping and fully associative mapping Index same as in direct mapping But, each cache address contains content and tags of 2 or more memory address locations Tags of that set simultaneously compared as in fully associative mapping Cache with set size N called N-way set-associative 2-way, 4-way, 8-way are common

Technique for choosing which block to replace when fully associative cache is full when set-associative caches line is full Direct mapped cache has no choice Random replace block chosen at random LRU: least-recently used
Page 16

Dept of ECE,RVCE Bangalore.

Embedded Memories
replace block not accessed for longest time FIFO: first-in-first-out push block onto queue when accessed choose block to replace by popping queue

>Cache Replacement Policy


When written, data cache must update main memory Write-through write to main memory whenever cache is written to easiest to implement processor must wait for slower main memory write potential for unnecessary writes Write-back main memory only written when dirty block replaced extra dirty bit for each block set when cache block written to reduces number of slow main memory writes

>Cache Impact on system Performance


Most important parameters in terms of performance: Total size of cache total number of data bytes cache can hold tag, valid and other house keeping bits not included in total Degree of associativity Data block size Larger caches achieve lower miss rates but higher access cost e.g., 2 Kbyte cache: miss rate = 15%, hit cost = 2 cycles, miss cost = 20 cycles avg. cost of memory access = (0.85 * 2) + (0.15 * 20) = 4.7 cycles 4 Kbyte cache: miss rate = 6.5%, hit cost = 3 cycles, miss cost will not change avg. cost of memory access = (0.935 * 3) + (0.065 * 20) = 4.105 cycles (improvement) 8 Kbyte cache: miss rate = 5.565%, hit cost = 4 cycles, miss cost will not change avg. cost of memory access = (0.94435 * 4) + (0.05565 * 20) = 4.8904 cycles (worse)

Dept of ECE,RVCE Bangalore.

Page 17

Embedded Memories

6. Advanced RAM
DRAMs commonly used as main memory in processor based embedded systems high capacity, low cost Many variations of DRAMs proposed need to keep pace with processor speeds FPM DRAM: fast page mode DRAM EDO DRAM: extended data out DRAM SDRAM/ESDRAM: synchronous and enhanced synchronous DRAM RDRAM: rambus DRAM

6.1 Basic DRAM


Address bus multiplexed between row and column components Row and column addresses are latched in, sequentially, by strobing ras (row address strobe) and cas (column address strobe) signals, respectively Refresh circuitry can be external or internal to DRAM device strobes consecutive memory address periodically causing memory content to be refreshed Refresh circuitry disabled during read or write operation

Fast Page Mode DRAM (FPM DRAM)


Each row of memory bit array is viewed as a page
Dept of ECE,RVCE Bangalore. Page 18

Embedded Memories
Page contains multiple words Individual words addressed by column address Timing diagram: row (page) address sent 3 words read consecutively by sending column address for each Extra cycle eliminated on each read/write of words from same

Dept of ECE,RVCE Bangalore.

Page 19

Embedded Memories
Extended data out DRAM (EDO DRAM)
Improvement of FPM DRAM Extra latch before output buffer allows strobing of cas before data read operation completed Reduces read/write latency by additional cycle

(Synchronous and Enhanced Synchronous (ES) DRAM


SDRAM latches data on active edge of clock Eliminates time to detect ras/cas and rd/wr signals A counter is initialized to column address then incremented on active edge of clock to access consecutive memory locations ESDRAM improves SDRAM added buffers enable overlapping of column addressing faster clocking and lower read/write latency possible

Dept of ECE,RVCE Bangalore.

Page 20

Embedded Memories
Rambus DRAM (RDRAM)
More of a bus interface architecture than DRAM architecture Data is latched on both rising and falling edge of clock Broken into 4 banks each with own row decoder can have 4 pages open at a time Capable of very high throughput

6.2 DRAM Integration Problem


SRAM easily integrated on same chip as processor DRAM more difficult Different chip making process between DRAM and conventional logic Goal of conventional logic (IC) designers: - minimize parasitic capacitance to reduce signal propagation delays and power consumption Goal of DRAM designers: - create capacitor cells to retain stored information Integration processes beginning to appear

6.3 Memory Management Unit (MMU)


Duties of MMU Handles DRAM refresh, bus interface and arbitration Takes care of memory sharing among multiple processors Translates logic memory addresses from processor to physical memory addresses of DRAM Modern CPUs often come with MMU built-in Single-purpose processors can be used

Dept of ECE,RVCE Bangalore.

Page 21

Embedded Memories

7.Cache Coherence Protocols

The presence of caches in current-generation distributed shared-memory multiprocessors improves performance by reducing the processors memory access time and by decreasing the bandwidth requirements of both the local memory module and the global interconnect. Unfortunately, the local caching of data introduces the cache coherence problem. Early distributed shared-memory machines left it to the programmer to deal with the cache coherence problem, and consequently these machines were considered difficult to program [5][38][54]. Todays multiprocessors solve the cache coherence problem in hardware by implementing a cache coherence protocol. This chapter outlines the cache coherence problem and describes how cache coherence protocols solve it. In addition, this chapter discusses several different varieties of cache coherence protocols including their advantages and disadvantages, their organization, their common protocol transitions, and some examples of machines that implement each protocol. Ultimately a designer has to choose a protocol to implement, and this should be done carefully. Protocol choice can lead to differences in cache miss latencies and differences in the number of messages sent through the interconnection network, both of which can lead to differences in overall application performance. Moreover, some protocols have high-level properties like automatic data distribution or distributed queueing that can help application performance. Before discussing specific protocols, however, let us examine the cache coherence problem in distributed shared-memory machines in detail.

7.1 The Cache Coherence Problem


Figure 2.1 depicts an example of the cache coherence problem. Memory initially contains the value 0 for location x, and processors 0 and 1 both read location x into theircaches. If processor 0 writes location x in its cache with the value 1, then processor 1scache now contains the stale value 0 for location x. Subsequent reads of location x by processor1 willcontinue to return the stale, cached value of 0. This is likely not what the programmer expected when she wrote the program. The expected behavior is for a read byany processor to return the most up-to-date copy of the datum. This is exactly what acache coherence protocol does: it ensures that requests for a certain datum always returnthe most recent value.

Dept of ECE,RVCE Bangalore.

Page 22

Embedded Memories

The coherence protocol achieves this goal by taking action whenever a location is written. More precisely, since the granularity of a cache coherence protocol is a cache line, the protocol takes action whenever any cache line is written. Protocols can take two kinds of actions when a cache line L is writtenthey may either invalidate all copies of L from the other caches in the machine, or they may update those lines with the new value being written. Continuing the earlier example, in an invalidation-based protocol when processor 0 writes x = 1, the line containing x is invalidated from processor 1s cache. The next time processor 1 reads location x it suffers a cache miss, and goes to memory to retrieve the latest copy of the cache line. In systems with write-through caches, memory can supply the data because it was updated when processor 0 wrote x. In the more common case of systems with writeback caches, the cache coherence protocol has to ensure that processor 1 asks processor 0 for the latest copy of the cache line. Processor 0 then supplies the line from its cache and processor 1 places that line into its cache, completing its cache miss. In update-based protocols when processor 0 writes x = 1, it sends the new copy of the datum directly to processor 1 and updates the line in processor 1s cache with the new value. In either case, subsequent reads by processor 1 now see the correct value of 1 for location x, and the system is said to be cache coherent.

Most modern cache-coherent multiprocessors use the invalidation technique rather than the update technique since it is easier to implement in hardware. As cache line sizes continue to increase, the invalidation-based protocols remain popular because of the increased
Dept of ECE,RVCE Bangalore. Page 23

Embedded Memories
number of updates required when writing a cache line sequentially with an update-based coherence protocol. There are times however, when using an update-based protocol is superior. These include accessing heavily contended lines and some types of synchronization variables. Typically designers choose an invalidation-based protocol and add some special features to handle heavily contended synchronization variables. All the protocols presented in this paper are invalidation-based cache coherence protocols, and a later section is devoted to the discussion of synchronization primitives.

Dept of ECE,RVCE Bangalore.

Page 24

Embedded Memories 8. Directory-Based Coherence


The previous section describes the cache coherence problem and introduces the cache coherence protocol as the agent that solves the coherence problem. But the question remains, how do cache coherence protocols work? There are two main classes of cache coherence protocols, snoopy protocols and directorybased protocols. Snoopy protocols require the use of a broadcast medium in the machine and hence apply only to small-scale bus-based multiprocessors. In these broadcast systems each cache snoops on the bus and watches for transactions which affect it. Any time a cache sees a write on the bus it invalidates that line out of its cache if it is present. Any time a cache sees a read request on the bus it checks to see if it has the most recent copy of the data, and if so, responds to the bus request. These snoopy bus-based systems are easy to build, but unfortunately as the number of processors on the bus increases, the single shared bus becomes a bandwidth bottleneck and the snoopy protocols reliance on a broadcast mechanism becomes a severe scalability limitation. To address these problems, architects have adopted the distributed shared memory (DSM) architecture. In a DSM multiprocessor each node contains the processor and its caches, a portion of the machines physically distributed main memory, and a node controller which manages communication within and between nodes (see Figure 2.2). Rather than being connected by a single shared bus, the nodes are connected by a scalable interconnection network. The DSM architecture allows multiprocessors to scale to thousands
Chapter 2: Cache Coherence Protocols 13

of nodes, but the lack of a broadcast medium creates a problem for the cache coherence protocol. Snoopy protocols are no longer appropriate, so instead designers must use a directory-based cache coherence protocol. The first description of directory-based protocols appears in Censier and Feautriers 1978 paper [9]. The directory is simply an auxiliary data structure that tracks the caching state of each cache line in the system. For each cache line in the system, the directory needs to track which caches, if any, have read-only copies of the line, or which cache has the latest copy of the line if the line is held exclusively. A directory-based cache-coherent machine works by consulting the directory on each cache miss and taking the appropriate action based on the type of request and the current state of the directory. Figure 2.3 shows a directory-based DSM machine. Just as main memory is physically distributed throughout the machine to improve aggregate memory bandwidth, so the directory is distributed to eliminate the bottleneck that would be caused by a single monolithic directory. If each nodes main memory is divided into cache-line-sized blocks, then the directory can be thought of as extra bits of state for each block of main memory. Any time

Dept of ECE,RVCE Bangalore.

Page 25

Embedded Memories
a processor wants to read cache line L, it must send a request to the node that has the directory

for line L. This node is called the home node for L. The home node receives the request, consults the directory, and takes the appropriate action. On a cache read miss, for example, if the directory shows that the line is currently uncached or is cached read-only

Dept of ECE,RVCE Bangalore.

Page 26

Embedded Memories

9. MESI Cache Coherence .


Abstract

Nowadays, the computational systems (multi and uniprocessors) need to avoid the cache coherence problem. There are some techniques to solve this problem. The MESI cache coherence protocol is one of them. This paper presents a simulator of the MESI protocol which is used for teaching the cache memory coherence on the computer systems with hierarchical memory system and for explaining the process of the cache memory location in multilevel cache memory systems. The paper shows a description of the course in which the simulator is used, a short explanation about the MESI protocol and how the simulator works. Then, some experimental results in a real teaching environment are described. Keywords: Cache memory, Coherence protocol, MESI, Simulator, Teaching tool.

9.1 Introduction
In multiprocessor systems, the memory should provide a set of locations that hold values, and when a location is read itshould return the latest written value to that location. This property must be established to communicate data between threads or processes running on one processor. One reading returns the latest written value to the location regardless ofwhich process wrote it. This question is known as the cache coherence problem. This kind of problems arises even in uniprocessors when I/O operations occur. Most I/O transfers are performed by direct memory access (DMA) devices that move data between the memory and the peripheral component without involving the processor [5]. When the DMAdevice writes to a location in main memory, unless special action is taken, the processor may continue to see the old value if that location was previously present in its cache [1]. The techniques and support which are used to solve the multiprocessor cache coherence problem also solve the I/O coherence problem. Essentially all microprocessors today provide support for multiprocessor cache coherence. The MESI cache coherence protocol is a technique to maintain the coherence of the cache memory content in hierarchical memory systems [2], [7]. It is based on four possible states of the cache blocks: Modified, Exclusive, Shared and Invalid. Each accessed block lies in one of these stages and the transitions among them define the MESI protocol. Nowadays, most processors (Intel, AMD) use this protocol or its versions. Knowing how these processors maintain the cache coherence is very important for the students. This paper presents a simulator of the MESI cache coherence protocol [1], [6]. The MESI simulator is a software tool which has been implemented in the JAVA language. It has been developed
Dept of ECE,RVCE Bangalore. Page 27

Embedded Memories
specifically for teaching purposes. It has been designed to show how the MESI protocol works to maintain the cache memory coherence in a multi-user system for a single processor. The simulator permits to configure the cache memory parameters and the statistics of the studying memory access; it also permits to determine how these statistics are shown. The sections in this paper are organised as follows: Section 2 presents some works related to the MESI protocol. Section 3 describes the educational objectives for the simulator. Section 4 explains the MESI protocol. Section 5 shows the main characteristics of the MESI simulator, a description of pedagogical issues and some performance examples Section 6 describes the experimental results in a real teaching environment. Section 7 indicates our future works about the cache memory coherence protocols. Finally, section 8 concludes this paper.

Dept of ECE,RVCE Bangalore.

Page 28

Embedded Memories 10.MESI Protocol


The MESI protocol makes it possible to maintain the coherence in cached systems. It is based on the four states that a block in the cache memory can have. These four states are the abbreviations for MESI: modified, exclusive, shared and invalid. States are explained below: Invalid: It is a non-valid state. The data you are looking for are not in the cache, or the local copy of these data is not correct because another processor has updated the corresponding memory position. Shared: Shared without having been modified. Another processor can have the data into the cache memory and both copies are in their current version. Exclusive: Exclusive without having been modified. That is, this cache is the only one that has the correct value of the block. Data blocks are according to the existing ones in the main memory. Modified: Actually, it is an exclusive-modified state. It means that the cache has the only copy that is correct in the whole system. The data which are in the main memory are wrong. The state of each cache memory block can change depending on the actions taken by the CPU [3]. Figure 1 presents these transitions clearly. Although the Figure 1 is very clear, here is a brief explanation: at the beginning, when the cache is empty and a blockof memory is written into the cache by the processor, this block has the exclusive state because there are no copies ofthat block in the cache. Then, if this block is written, it changes to a modified state, because the block is only in one cache but it has been modified and the block that is in the main memory is different to it. On the other hand, if a block is in the exclusive state, when the CPU tries to read it and it does not find the block, ithas to find it in the main memory and loads it into its cache memory. Then, that block is in two different caches so itsstate is shared. Then, if a CPU wants to write into a block that is in the modified state and it is not in its cache, this block has to be cleared from the cache where it was and it has to be loaded into the main memory because it was the mostcurrent copy of that block in the system. In that case, the CPU writes the block and it is loaded in its cache memorywith the exclusive state, because it is the most current version now. If the CPU wants to read a block and it does not find the block in its cache, this is because there is a more recent copy, so the system has to clear the block from the cachewhere it was and to load it in the main memory. From there, the block is read and the new state is shared because there

Dept of ECE,RVCE Bangalore.

Page 29

Embedded Memories
are two current copies in the system. Another option is that a CPU writes into a shared block, in this case the block changes its state into exclusive.

Figure 1: Transitions from CPU bus

It should be taken into account that the state of a cache memory block can change because of the actions of anotherCPU, an Input/Output interruption or a DMA. These transitions are shown in Figure 2. Hence, the processor is going touse the valid data in its operations. We do not have to worry if a processor has changed data from the main memory andhas the most current value of these data in its cache. With the MESI protocol, the processor obtains the most currentvalue every time it is required.

Dept of ECE,RVCE Bangalore.

Page 30

Embedded Memories
11.References
[1] Culler, D.E., Singh, J.P., and Gupta, A. Parallel Computer Architecture. A hardware/software approach . Morgan Kaufmann Publishers, Inc., 1999. [2] Hamacher, C., Vranesic, Z., and Zaky, S. Computer Organization. McGraw-Hill, 2003. [3] Handy, J. The Cache Memory Book. Academic Press, 1998. [4] McGettrick, A., Thies, M.D., Soldan, D.L., and Srimani, P.K., Computer Engineering Curriculum in the New Millennium. IEEE Transactions on Education, vol. 46, no. 4, November 2003. [5] Patterson, D.A., and Hennessy, J.L. Computer Organization and Design: The Hardware/Software Interface . Morgan Kaufman Publishers, Inc., 2004. [6] Stalling, W. Computer Organization and Architecture. Prentice-Hall, 2006. [7] Tanembaum, A.S. Structured Computer Organization. Prentice-Hall, 2006.
CLEI ELECTRONIC JOURNAL, VOLUME 12, NUMBER 1, PAPER 5, APRIL 2009

Dept of ECE,RVCE Bangalore.

Page 31

Anda mungkin juga menyukai