Introduction
Embedded systems functionality aspects Processing processors transformation of data Storage memory retention of data Communication buses transfer of data
Memory :Basic concept Stores large number of bits m x n: m words of n bits each k = Log2(m) address input signals or m = 2^k words e.g., 4,096 x 8 memory: 32,768 bits 12 address input signals 8 input/output data signals Memory access r/w: selects read or write enable: read or write only when asserted multiport: multiple accesses to different locations simultaneously
Page 1
Embedded Memories
Storage permanence ability of memory to hold stored bits after they are written
Page 2
Embedded Memories
write ability
Ranges of write ability High end processor writes to memory simply and quickly e.g., RAM Middle range processor writes to memory, but slower e.g., FLASH, EEPROM Lower range special equipment, programmer, must be used to write to memory e.g., EPROM, OTP ROM Low end bits stored only during fabrication e.g., Mask-programmed ROM In-system programmable memory Can be written to by a processor in the embedded system using the memory Memories in high end and middle range of write ability
Storage performance
Range of storage permanence High end essentially never loses bits e.g., mask-programmed ROM Middle range holds bits days, months, or years after memorys power source turned off e.g., NVRAM Lower range holds bits as long as power supplied to memory e.g., SRAM Low end begins to lose bits almost immediately after written e.g., DRAM
Page 3
Embedded Memories
Nonvolatile memory Holds bits after power is no longer supplied High end and middle range of storage permanence Nonvolatile memory Can be read from but not written to, by a processor in an embedded system Traditionally written to, programmed, before inserting to embedded system Uses Store software program for general-purpose processor program instructions can be one or more ROM words Store constant data needed by system Implement combinational circuit
Page 4
Embedded Memories
Mask-programmed ROM
Connections programmed at fabrication set of masks Lowest write ability only once Highest storage permanence bits never change unless damaged Typically used for final design of high-volume systems spread out NRE cost for a low unit cost
Embedded Memories
EPROM: Erasable programmable ROM
Programmable component is a MOS transistor Transistor has floating gate surrounded by an insulator (a) Negative charges form a channel between source and drain storing a logic 1 (b) Large positive voltage at gate causes negative charges to move out of channel and get trapped in floating gate storing a logic 0 (c) (Erase) Shining UV rays on surface of floating-gate causes negative charges to return to channel from floating gate restoring the logic 1 (d) An EPROM package showing quartz window through which UV light can pass Better write ability can be erased and reprogrammed thousands of times Reduced storage permanence program lasts about 10 years but is susceptible to radiation and electric noise Typically used during design development
Connections programmed after manufacture by user user provides file of desired contents of ROM file input to machine called ROM programmer each programmable connection is a fuse ROM programmer blows fuses where connections should not exist Very low write ability typically written only once and requires ROM programmer device Very high storage permanence bits dont change unless reconnected to programmer and more fuses blown Commonly used in final products cheaper, harder to inadvertently modify
Page 6
Embedded Memories
EEPROM: Electrically Erasable Programmable ROM
Programmed and erased electronically typically by using higher than normal voltage can program and erase individual words Better write ability can be in-system programmable with built-in circuit to provide higher than normal voltage built-in memory controller commonly used to hide details from memory user writes very slow due to erasing and programming busy pin indicates to processor EEPROM still writing can be erased and programmed tens of thousands of times Similar storage permanence to EPROM (about 10 years) Far more convenient than EPROMs, but more expensive
Flash Memory
Extension of EEPROM Same floating gate principle Same write ability and storage permanence Fast erase Large blocks of memory erased at once, rather than one word at a time Blocks typically several thousand bytes large Writes to single words may be slower Entire block must be read, word updated, then entire block written back Used with embedded systems storing large data items in nonvolatile memory e.g., digital cameras, TV set-top boxes, cell phones
Page 7
RAM VARIATION PSRAM: Pseudo-static RAM DRAM with built-in memory refresh controller Popular low-cost high-density alternative to SRAM NVRAM: Nonvolatile RAM Holds data after external power removed Battery-backed RAM SRAM with own permanently connected battery writes as fast as reads no limit on number of writes unlike nonvolatile ROM-based memory SRAM with EEPROM or flash stores complete RAM contents on EEPROM or flash before power turned off
Page 8
Embedded Memories
4.Scratchpad Memory
Embedded processor-based system >Processor core > Embedded memory > Instruction and Data Cache > Embedded SRAM > Embedded DRAM Scratch Pad Memory > Design problems 1. How much on-chip memory? 2. Partitioning of on-chip memory in cache and scratchpad? 3. Which variables/arrays in the scratchpad?
Goals
> Improve performance > Save power
Embedded Memories
Abstract
Efficient utilizationof on-chip memory space is extremely important in modern embedded system applications basedon microprocessor cores. In additionto a data cache that interfaceswith slower off-chip memory, a fast on-chip SRAM,called Scratch-Pad memory, is often used in several applications. this present a technique for efficiently exploiting onchip Scratch-Pad memory by partitioning the applications scalar and array variables into off chipDRAM and on-chip Scratch-Pad SRAM, with the goal of minimizing the total execution time of embedded applications.
> Introduction
Complex embedded system applications typically use heterogeneous chips consisting of microprocessor cores, along with on-chip memory and co-processors. Flexibility and short design time considerations drive the use of CPU cores as instantiable modules in system designs [5]. The integration of processor cores and memory in the same chip effects a reduction in the chip count, leading to costeffective solutions. Examples of commercial microprocessor cores commonly used in system design are LSI Logics CW33000 series [3]and the ARM series from AdvancedRISC Machines [10].
Page 10
Embedded Memories
Typical examples of optional modules integrated with the processor on the same chip are: Instruction Cache, Data Cache, and on-chip SRAM. The instruction and data caches are fastlocal memory serving as an interface between the processor and the off-chip memory. The on-chip SRAM, termed Scratch-Pad memory, is a small, high-speed data memory that is mapped into an address space disjoint from the off-chip memory, but connected to the same address and data buses. Both the cache and Scratch-Pad SRAM have a single processor cycle ac-cess latency, whereas an access to the off-chip memory (usually DRAM) takes several (typically 1020) processor cycles. The main difference between the Scratch-Pad SRAM and data cache is that the SRAM guarantees a single-cycle access time, whereas an access to cache is subject to compulsory, capacity, and conflict misses. When an embedded application is compiled, the accessed data can now be stored either in the Scratch-Pad memory or in off-chip memory. In the second case, it is accessed by the processor through the data cache. We present a technique for minimizing the total execution time of an embedded application by a careful partitioning of scalar and array variables used in the application into off-chip DRAM (accessed through data cache) and Scratch-Pad SRAM. Optimization techniques for improving the data cache performance of programs have been reported [4, 7, 9]. The analysis in [9] is limited to scalars, and hence, not generally applicable. Iteration space blocking for improving data locality is studied in [4]. This technique is also
Dept of ECE,RVCE Bangalore. Page 11
Embedded Memories
limited to the type of code that yields naturally to blocking. In [7],a data layout strategy for avoiding conflict misses is presented. However, array access patterns in some applications are too complex to be statically analyzable using this method. The availability of an on-chip SRAM with guaranteed fast access time creates an opportunity for overcomingsome of the cache conflict problems (Section 2). Theproblem of partitioning data into SRAM and cache with the objective of maximizing performance, which we address inthis paper, has, to our knowledge, not been attempted before.
Figure 1(a) shows the architectural block diagram of an application employing a typical embedded core processor (e.g., the LSI Logic CW33000 RISC Microprocessor core[3]), where the parts enclosed in the dotted rectangle are implemented in one chip, and which interfaces with an off-chip memory, usually realized with DRAM. The address and data buses from the CPU core connect to the Data Cache, Scratch-Pad memory, and the External Memory Interface (EMI) blocks. On a memory access request from the CPU, the data cache indicates a cache hit to the EMI block through the C HIT signal. Similarly, if the SRAM interface circuitry in the Scratch-Pad memory determines that the referenced memory address maps into the on-chip SRAM, it assumes control of the data bus and indicates this status to the EMI through signal S HIT. If both the cache and SRAM report misses, the EMI transfers a block of data of the appropriate size (equal to the cache line size) between the cache and the DRAM. The data address space mapping is shown in Figure 1(b),for a memory of size data words. Memory addresses0 . . .1 map into the Scratch-Pad memory, and have a single processor cycle access time. Thus, in Figure 1(a), S HIT would be asserted whenever the processor attemptsto access any address in the range 0 . ..1. Memory addresses . . . 1 map into the offchip DRAM, and are accessed by the CPU through the data cache. A cache hit for an address in the range. . . 1 results in a single-cycledelay, whereas a cache miss, which leads to a block transfer between off-chip and cache memory, results in a delay of10-20 processor cycles. Suppose the above code is executed on a processor configured with a data cache of size 1 KByte. The performance is degraded by the conflict misses in the cache between elements of the two arrays Hist and BrightnessLevel. Data layout techniques, such as [7] are not effective in eliminating the above type of conflicts, because the accesses to Hist are data-dependent. Note that this problem occurs in both direct-mapped as well as set-associative caches. However, the conflict problem can be solved elegantly if we include a Scratch-Pad SRAM in the architecture. Since the Hist array is relatively small, we can store it in the SRAM, so that it does not conflict with BrightnessLevel in the data cache. This storage assignment improves the performance of the Histogram Evaluation code significantly.
Page 12
Embedded Memories
We present a strategy for partitioning scalar and array variables in an application code into Scratch-Pad memory and off-chip DRAM accessed through data cache, to maximize the performance by selectively mapping to the SRAM those variables that are estimated to cause the maximum number of conflicts in the data cache.
Page 13
Want inexpensive, fast memory Main memory Large, inexpensive, slow memory stores entire program and data Cache Small, expensive, fast memory stores copy of likely accessed parts of larger memory Can be multiple levels of cache
Usually designed with SRAM faster but more expensive than DRAM Usually on same chip as processor space limited, so much smaller than off-chip main memory faster access ( 1 cycle vs. several cycles for main memory) Cache operation: Request for main memory access (read or write) First, check cache for copy
Page 14
Embedded Memories
cache hit copy is in cache, quick access cache miss copy not in cache, read address and possibly its neighbors into cache Several cache design choices cache mapping, replacement policies, and write techniques
Embedded Memories
Technique for choosing which block to replace when fully associative cache is full when set-associative caches line is full Direct mapped cache has no choice Random replace block chosen at random LRU: least-recently used
Page 16
Embedded Memories
replace block not accessed for longest time FIFO: first-in-first-out push block onto queue when accessed choose block to replace by popping queue
Page 17
Embedded Memories
6. Advanced RAM
DRAMs commonly used as main memory in processor based embedded systems high capacity, low cost Many variations of DRAMs proposed need to keep pace with processor speeds FPM DRAM: fast page mode DRAM EDO DRAM: extended data out DRAM SDRAM/ESDRAM: synchronous and enhanced synchronous DRAM RDRAM: rambus DRAM
Embedded Memories
Page contains multiple words Individual words addressed by column address Timing diagram: row (page) address sent 3 words read consecutively by sending column address for each Extra cycle eliminated on each read/write of words from same
Page 19
Embedded Memories
Extended data out DRAM (EDO DRAM)
Improvement of FPM DRAM Extra latch before output buffer allows strobing of cas before data read operation completed Reduces read/write latency by additional cycle
Page 20
Embedded Memories
Rambus DRAM (RDRAM)
More of a bus interface architecture than DRAM architecture Data is latched on both rising and falling edge of clock Broken into 4 banks each with own row decoder can have 4 pages open at a time Capable of very high throughput
Page 21
Embedded Memories
The presence of caches in current-generation distributed shared-memory multiprocessors improves performance by reducing the processors memory access time and by decreasing the bandwidth requirements of both the local memory module and the global interconnect. Unfortunately, the local caching of data introduces the cache coherence problem. Early distributed shared-memory machines left it to the programmer to deal with the cache coherence problem, and consequently these machines were considered difficult to program [5][38][54]. Todays multiprocessors solve the cache coherence problem in hardware by implementing a cache coherence protocol. This chapter outlines the cache coherence problem and describes how cache coherence protocols solve it. In addition, this chapter discusses several different varieties of cache coherence protocols including their advantages and disadvantages, their organization, their common protocol transitions, and some examples of machines that implement each protocol. Ultimately a designer has to choose a protocol to implement, and this should be done carefully. Protocol choice can lead to differences in cache miss latencies and differences in the number of messages sent through the interconnection network, both of which can lead to differences in overall application performance. Moreover, some protocols have high-level properties like automatic data distribution or distributed queueing that can help application performance. Before discussing specific protocols, however, let us examine the cache coherence problem in distributed shared-memory machines in detail.
Page 22
Embedded Memories
The coherence protocol achieves this goal by taking action whenever a location is written. More precisely, since the granularity of a cache coherence protocol is a cache line, the protocol takes action whenever any cache line is written. Protocols can take two kinds of actions when a cache line L is writtenthey may either invalidate all copies of L from the other caches in the machine, or they may update those lines with the new value being written. Continuing the earlier example, in an invalidation-based protocol when processor 0 writes x = 1, the line containing x is invalidated from processor 1s cache. The next time processor 1 reads location x it suffers a cache miss, and goes to memory to retrieve the latest copy of the cache line. In systems with write-through caches, memory can supply the data because it was updated when processor 0 wrote x. In the more common case of systems with writeback caches, the cache coherence protocol has to ensure that processor 1 asks processor 0 for the latest copy of the cache line. Processor 0 then supplies the line from its cache and processor 1 places that line into its cache, completing its cache miss. In update-based protocols when processor 0 writes x = 1, it sends the new copy of the datum directly to processor 1 and updates the line in processor 1s cache with the new value. In either case, subsequent reads by processor 1 now see the correct value of 1 for location x, and the system is said to be cache coherent.
Most modern cache-coherent multiprocessors use the invalidation technique rather than the update technique since it is easier to implement in hardware. As cache line sizes continue to increase, the invalidation-based protocols remain popular because of the increased
Dept of ECE,RVCE Bangalore. Page 23
Embedded Memories
number of updates required when writing a cache line sequentially with an update-based coherence protocol. There are times however, when using an update-based protocol is superior. These include accessing heavily contended lines and some types of synchronization variables. Typically designers choose an invalidation-based protocol and add some special features to handle heavily contended synchronization variables. All the protocols presented in this paper are invalidation-based cache coherence protocols, and a later section is devoted to the discussion of synchronization primitives.
Page 24
of nodes, but the lack of a broadcast medium creates a problem for the cache coherence protocol. Snoopy protocols are no longer appropriate, so instead designers must use a directory-based cache coherence protocol. The first description of directory-based protocols appears in Censier and Feautriers 1978 paper [9]. The directory is simply an auxiliary data structure that tracks the caching state of each cache line in the system. For each cache line in the system, the directory needs to track which caches, if any, have read-only copies of the line, or which cache has the latest copy of the line if the line is held exclusively. A directory-based cache-coherent machine works by consulting the directory on each cache miss and taking the appropriate action based on the type of request and the current state of the directory. Figure 2.3 shows a directory-based DSM machine. Just as main memory is physically distributed throughout the machine to improve aggregate memory bandwidth, so the directory is distributed to eliminate the bottleneck that would be caused by a single monolithic directory. If each nodes main memory is divided into cache-line-sized blocks, then the directory can be thought of as extra bits of state for each block of main memory. Any time
Page 25
Embedded Memories
a processor wants to read cache line L, it must send a request to the node that has the directory
for line L. This node is called the home node for L. The home node receives the request, consults the directory, and takes the appropriate action. On a cache read miss, for example, if the directory shows that the line is currently uncached or is cached read-only
Page 26
Embedded Memories
Nowadays, the computational systems (multi and uniprocessors) need to avoid the cache coherence problem. There are some techniques to solve this problem. The MESI cache coherence protocol is one of them. This paper presents a simulator of the MESI protocol which is used for teaching the cache memory coherence on the computer systems with hierarchical memory system and for explaining the process of the cache memory location in multilevel cache memory systems. The paper shows a description of the course in which the simulator is used, a short explanation about the MESI protocol and how the simulator works. Then, some experimental results in a real teaching environment are described. Keywords: Cache memory, Coherence protocol, MESI, Simulator, Teaching tool.
9.1 Introduction
In multiprocessor systems, the memory should provide a set of locations that hold values, and when a location is read itshould return the latest written value to that location. This property must be established to communicate data between threads or processes running on one processor. One reading returns the latest written value to the location regardless ofwhich process wrote it. This question is known as the cache coherence problem. This kind of problems arises even in uniprocessors when I/O operations occur. Most I/O transfers are performed by direct memory access (DMA) devices that move data between the memory and the peripheral component without involving the processor [5]. When the DMAdevice writes to a location in main memory, unless special action is taken, the processor may continue to see the old value if that location was previously present in its cache [1]. The techniques and support which are used to solve the multiprocessor cache coherence problem also solve the I/O coherence problem. Essentially all microprocessors today provide support for multiprocessor cache coherence. The MESI cache coherence protocol is a technique to maintain the coherence of the cache memory content in hierarchical memory systems [2], [7]. It is based on four possible states of the cache blocks: Modified, Exclusive, Shared and Invalid. Each accessed block lies in one of these stages and the transitions among them define the MESI protocol. Nowadays, most processors (Intel, AMD) use this protocol or its versions. Knowing how these processors maintain the cache coherence is very important for the students. This paper presents a simulator of the MESI cache coherence protocol [1], [6]. The MESI simulator is a software tool which has been implemented in the JAVA language. It has been developed
Dept of ECE,RVCE Bangalore. Page 27
Embedded Memories
specifically for teaching purposes. It has been designed to show how the MESI protocol works to maintain the cache memory coherence in a multi-user system for a single processor. The simulator permits to configure the cache memory parameters and the statistics of the studying memory access; it also permits to determine how these statistics are shown. The sections in this paper are organised as follows: Section 2 presents some works related to the MESI protocol. Section 3 describes the educational objectives for the simulator. Section 4 explains the MESI protocol. Section 5 shows the main characteristics of the MESI simulator, a description of pedagogical issues and some performance examples Section 6 describes the experimental results in a real teaching environment. Section 7 indicates our future works about the cache memory coherence protocols. Finally, section 8 concludes this paper.
Page 28
Page 29
Embedded Memories
are two current copies in the system. Another option is that a CPU writes into a shared block, in this case the block changes its state into exclusive.
It should be taken into account that the state of a cache memory block can change because of the actions of anotherCPU, an Input/Output interruption or a DMA. These transitions are shown in Figure 2. Hence, the processor is going touse the valid data in its operations. We do not have to worry if a processor has changed data from the main memory andhas the most current value of these data in its cache. With the MESI protocol, the processor obtains the most currentvalue every time it is required.
Page 30
Embedded Memories
11.References
[1] Culler, D.E., Singh, J.P., and Gupta, A. Parallel Computer Architecture. A hardware/software approach . Morgan Kaufmann Publishers, Inc., 1999. [2] Hamacher, C., Vranesic, Z., and Zaky, S. Computer Organization. McGraw-Hill, 2003. [3] Handy, J. The Cache Memory Book. Academic Press, 1998. [4] McGettrick, A., Thies, M.D., Soldan, D.L., and Srimani, P.K., Computer Engineering Curriculum in the New Millennium. IEEE Transactions on Education, vol. 46, no. 4, November 2003. [5] Patterson, D.A., and Hennessy, J.L. Computer Organization and Design: The Hardware/Software Interface . Morgan Kaufman Publishers, Inc., 2004. [6] Stalling, W. Computer Organization and Architecture. Prentice-Hall, 2006. [7] Tanembaum, A.S. Structured Computer Organization. Prentice-Hall, 2006.
CLEI ELECTRONIC JOURNAL, VOLUME 12, NUMBER 1, PAPER 5, APRIL 2009
Page 31