DJKFKKF

Embedded Computer: Memory System, Input/Output
Outline
Memory System

Types of memory Caches Memory Mapped I/O Polling Interrupt
Input/Output

Ingo Sander ingo@imit.kth.se
September 4, 2007
IL2206 Embedded Systems
The memory bottleneck
Memory System
Most instructions in a RISC processor can execute in a single clock cycle BUT Access to the main memory (typically in SDRAM) is slow If memory access time can be shortened the system would perform considerably better
September 4, 2007
Memory Performance
Memory Bandwidth
Memory Bandwidth
rate at which information can be transferred from the memory system Time between the following two time instances
If R is the number of request that the memory can serve simultaneously then BW = R/L Example:
Latency
time instance where the processor issues a request to the memory time instance where the requested data arrives and is available for use by processor
IL2206 Embedded Systems 5
A 32-bit memory with latency 20 ns has a bandwidth BW = 32 Bit / 20 ns = 1.6 GBit/s = 20 MByte/s
September 4, 2007
September 4, 2007
Types of memory
SRAM vs. DRAM

ROM (Read Only Memory)

SRAM (Static RAM)

Mask-programmable Flash programmable (can be reprogrammed, but has long access times)
Faster Easier to integrate with logic Higher power consumption Denser Must be refreshed
RAM (Random Access Memory)

DRAM (Dynamic RAM)

DRAM SRAM
September 4, 2007
September 4, 2007
Synchronous DRAM
Flash issues

Clock signal is used internally to pipeline accesses

Memory must be fast enough to respond to request Request takes multiple clock cycles 1, 2, 4, 8 locations
Flash is programmed at system voltages Erasure time is long Must be erased in blocks Limited number of erasures
Provides burst mode access:

A Flash Memory is very useful in combination with SRAM or SDRAM devices, since it can load these devices at power-on
9 September 4, 2007 IL2206 Embedded Systems 10
September 4, 2007
Memory Access Times and Costs

Memory Technology SRAM DRAM Magnetic disk Typical Access Time 0.5 ns -5 ns 50 ns 70 ns 5,000,000 ns 20,000,000 ns $ per GB in 2004 $4000 - $10000 $100 - $200 $0.5 - $2
Embedded system memories

Large fast memories are very expensive Embedded systems have to be produced at a low cost
single SRAM main memory is in general too expensive combination of fast and slow memories is often still feasible
Source: Patterson and Hennessy, 2004

September 4, 2007 IL2206 Embedded Systems 11 September 4, 2007 IL2206 Embedded Systems 12
Caches
Memory is a bottleneck
Large fast memories are too expensive, but small fast memories are feasible A cache memory is a small, but fast memory that is located near the CPU to reduce memory access times Ideally the processor does only need to access the cache and not the main memory
While the CPU is fast, each memory access takes long time and slows down the system
Caches can increase the performance, if most memory requests do not need to access the main memory
CPU
(fast)
CPU
(fast)
Memory
(very slow)
Memory Cache
(fast) (very slow)
Bus
(slow)
Bus
(slow)
September 4, 2007
13
September 4, 2007
14
Caches and CPUs

address cache controller data cache address data main memory
Cache operation
Many main memory locations are mapped onto one cache entry May have caches for:

CPU data
2000 Wolf (Morgan Kaufman)
instructions; data; data + instructions (unified).
Memory access time is no longer 2000 Wolf (Morgan deterministic! Kaufman)

September 4, 2007
15
September 4, 2007
Terms

Types of misses
Cache hit: required location is in cache. Cache miss: required location is not in cache. Working set: set of locations used by program in a time interval.

Compulsory (cold): location has never been accessed. Capacity: working set is too large. Conflict: multiple locations in working set map to same cache entry.
September 4, 2007
17
September 4, 2007
18
Memory system performance

Write operations
h = cache hit rate. tcache = cache access time, tmain = main memory access time. Average memory access time:
Write-through: immediately copy write to main memory

Causes unnecessary memory communication Memory has always a valid copy of the cache block
tav = htcache + (1-h)tmain
Write-back: write to main memory only when location is removed from cache

Tries to minimize communication with memory Memory may have an invalid copy of the cache block. Must be updated, when a cache block is replaced
September 4, 2007
19
September 4, 2007
20
Replacement
Cache performance benefits

Replacement policy: strategy for choosing which cache entry to throw out to make room for a new memory location. Two popular strategies:

Random. Least-recently used (LRU).
Keep frequently-accessed locations in fast cache. Cache retrieves more than one word at a time.
Sequential accesses are faster after first access.
In case of a modified cache entry in a write-back cache replacement means also to write the contents of the dirty cache entry back to the memory. Thus a cache miss can be expensive!
September 4, 2007
September 4, 2007
22
Data Transfer to Cache

Cache organizations
Words are transferred between cache and processor Blocks (of multiple words, given by the block size) are transferred between cache and memory
Word Transfer Block Transfer

Main Memory
CPU
Cache
Fully-associative: any memory location can be stored anywhere in the cache (almost never implemented). Direct-mapped: each memory location maps onto exactly one cache entry. N-way set-associative: each memory location can go into one of N entries.
September 4, 2007
23
September 4, 2007
24
Direct-mapped cache

Cache Line 0 1
Example Direct Mapped Cache

A direct-mapped cache consists of several cache lines, where each cache line has a status bit, a tag and data (cache block) There is a given mapping for each memory location!
Cache Block Tag Wd 0 Wd 0 Wd 1 Wd 1 Wd 2 Wd 2 Wd 3 Wd 3 Memory Address 0 10 Block 1 20 Block 2 30 Block 3 40 Block 4 50 Block 5 60 Block 6 70 Block 7 80 Block 8 FF0

Block 0

7 Status Bit
Wd 0
Wd 1
Wd 2
Wd 3
Cache has 2 KBytes (512 words), organized as 64 cache lines with a block size of 8 words Memory has 64 Kbytes (16 KWords), which can be seen as 2048 blocks of 8 Words Address size is 16 bits The direct map technique uses the modulo (remainder) operation to map on a cache block

Block 0, 64, 128, ... is mapped on Block 0 in the cache Block 1, 65, 129, is mapped on Block 1 in the cache
September 4, 2007
Block 1024
25
September 4, 2007
26
Example Direct Mapped Cache

Main Memory Memory Address
5 Tag 6 Block 3 2 Word Byte Offset
Direct-mapped cache
Block 0 Block 1 0x0000 0x0020
Cache Line 0 Line 1

Block 63 Block 64 Block 65
0 4 1 5 2 6 3 7
1 valid
0xabcd tag
byte byte byte data cache line
Line 63
Block 127
1 5 32 Data (8 words)
A block has 8 words
tag
index offset = hit value byte

( or halfword/word)
Valid Tag
Block 2047
0xFFE0
27 September 4, 2007 IL2206 Embedded Systems 28
September 4, 2007
Direct-mapped cache locations

Example 2-way set-associative cache

Memory Address Main Memory
Block 0
6 Tag 5 Set
Set 0 Set 1
Many locations map onto the same cache block. Conflict misses are easy to generate:

5 Offset
Cache Way 1 Way 1
Block 1
Way 0 Way 0
Array a[] uses locations 0, 1, 2, Array b[] uses locations 1024, 1025, 1026, Operation a[i] + b[i] generates conflict misses.
Block 31 Block 32 Block 33
0 4
1 5
2 6
3 7
Set 31
Way 0
Way 1
A block has 8 words
Block 127
1 2000 Wolf (Morgan Kaufman) 6 32 Data (8 words) Valid Tag
Block 2043
IL2206 Embedded Systems 29 September 4, 2007 IL2206 Embedded Systems 30
September 4, 2007
Set-Associative Caches
One-way set associative (direct-mapped)
Block (Set) 0 1 2 3 4 5 6 7 Tag Tag Tag Tag Tag Tag Tag Tag Data Data Data Data Data Data Data Data
Fully associative cache

Data Data Data Data Tag Tag Tag Tag Data Data Data Data
Two-way set associative

Set 0 Tag Tag Tag Tag
1 element per set
1 2 3
2 elements per set
There is a complete freedom, where to place a block in the cache But all blocks have to be searched for the correct tag pattern In order to have an acceptable performance, the tags must be searched in parallel
Eight-way set associative (fully associative)

Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data
8 elements per set

Example caches
Summary Memory Systems

StrongARM:
16 Kbyte, 32-way, 32-byte block instruction cache. 16 Kbyte, 32-way, 32-byte block data cache (write-back). 512 Bytes to 64KBytes direct-mapped I- and Dcache with a cache block size of 4 (D), 16(D) or 32(I&D) Bytes
Memory is a bottleneck in the system Different memories exist

Cost increases with memory performance
A cache memory can significantly decrease execution time at low cost

Nios II

Execution time is very hard to predict Problem for design of real-time systems Locality is important to utilize caches efficiently There can be several level of different caches Embedded systems have usually only one cache level
September 4, 2007
September 4, 2007
Input and Output Devices

Input/Output

Input/Output Devices are used to communicate with the environment An example is a UART (Universal Asynchronous Receiver/Transmitter) These devices (like other peripheral devices) are controlled by reading and writing to registers
Register Select Data Bus
Control Signals
Status Register Mode Register
Output Input
Data Register
I/O Device
September 4, 2007 IL2206 Embedded Systems 36
Serial communication
Universal Asynchronous Receiver/Transmitter (UART)

Characters are transmitted separately
Component for serial to parallel conversion Has a serial receiver/transmitter Many parameters can be configured

no char start bit 0 bit 1 ... bit n-1 stop time

2000 Morgan Kaufman (Wayne Wolf)

Baud rate Number of bits per character Parity bits Length of Stop Bit
September 4, 2007
37
September 4, 2007
38
Memory-Mapped I/O
Memory-Mapped I/O

Peripheral Components can be connected to the processor by memory-mapped I/O The components can be reached via a separate address space Memory-mapped I/O requires extra hardware for address decoding
The output chip-enable has to be active, when the input of the decoder is a correct address Other address bits are used for register select The decoder can be implemented with a small block of programmable logic or custom hardware (VHDL)
Register Select
Addressbus CPU
Decoder
Chip Enable Read/Write Peripheral
Interface to Environment
Databus
Example Memory-Mapped I/O

Accessing Memory Locations in C

R0 R1 ... R7 Databus (D31-D0) (D7-D0)
A device with 8 8-bit-registers shall be connected to the address 0x1000

0x00001002
Addressbus (ADR31-ADR0) ADR3 -ADR31 Decoder ADR2 ADR1 ADR0
0 1 0
RS2 RS1 RS0
Symbolic names can be defined for memory locations #define MEM_LOCATION 0x18

1
CE
Functions can be defined to access memory

Registers
Active when ADR12=1 and all others are 0!
The registers can now be accessed in the address space 0x1000 (R0) until 0x1007 (R7) movia r1, 0x1002 movi r3, 0x08 stb r3, (r1) set bit 3 and clears all other bits device register R2

peek can be used to read a memory location (byte) char peek(char *location) {return *location;} poke can be used to write to a memory location (byte) void poke(char *location, char newval) {*location = newval;}
Dont do this!
Memory Locations shouldnt be accessed directly!

Busy Wait I/O

Software shall be flexible

Hardware could change

Programmers may make mistakes that the compiler would not do (e.g. memory alignment) HAL (Hardware Abstraction Layer) offers optimized device drivers to access peripheral devices and memory
Busy Wait I/O is the most basic way to communicate with an I/O-device The processor wait until the I/O-device has completed its current task Disadvantage: Processor cannot be used for other tasks during the waiting period! This method is also often called polling!
Example: Sending string via serial link

Busy Wait I/O Pseudo Code:

Characters = String; While not all characters sent Send next character; While Sender = Busy Wait; Done!
September 4, 2007
43
September 4, 2007
44
C-Programming Testing of Bits

In order to test specific bits, it is needed to mask the other bits Example: Busy Flag: Busy = 1; Non-Busy = 0

7 0x1000 0x1001
September 4, 2007 IL2206 Embedded Systems

define Status 0x1000 define SendBuf 0x1001 char *myString = Hello World; char *current_char;
7 Status Sender Sender Buffer
45 September 4, 2007
5 BF
0 0x1000 0x1001
5 BF
46

Here you should use HAL functions!
Simultaneous busy/wait input and output

Example: Copying Characters from Input to Output
while (current_char != \0) { poke(SendBuf, *current_char++); while ((peek(Status) & 0x20) != 0) ; } /* Mask needed, since other bits */ /* in status register may not be zero */
7 0x1000 0x1001
September 4, 2007 IL2206 Embedded Systems
Busy Wait I/O Pseudo Code:
5 BF

47
Loop While inBuffer busy Wait; Read Character Copy Character to Output Buffer Send Character While outBuffer busy Wait;
September 4, 2007
48
Interrupt I/O
Interrupt Scheme
Interrupt Request
Busy/wait is very inefficient.

CPU cant do other work while testing device. Hard to do simultaneous I/O.
CPU
Interrupt Acknowledge Data/Address
Device
Interrupts allow a device to change the flow of control in the CPU.

Causes subroutine call to handle device.
September 4, 2007
49
September 4, 2007
50
Interrupt physical interface

Interrupt behavior

CPU and device are connected by CPU bus CPU and device handshake:

device asserts interrupt request; CPU asserts interrupt acknowledge when it can handle the interrupt.
Based on subroutine call mechanism Interrupt forces next instruction to be a subroutine call to a predetermined location
Return address is saved to resume executing foreground program
September 4, 2007
51
September 4, 2007
52
Programming Interrupt
Foreground Program
Do something Interrupt Event
Receive-Send with Polling

Interrupt Handler

Save Registers Handle Interrupt Restore Registers Restore PC Clear interrupt disable flag
Assume a program that as part of its duties receives characters and sends them further to another device Solution with polling:
loop
Wait for new character; Do something; Send character;

Interrupt Vector
Branch to Interrupt Handler
end loop; System cannot do anything while it waits for a new character until the sender is ready System resources are utilized very inefficiently!
September 4, 2007
53
September 4, 2007
54
Better Receive-Send Implementation with Interrupt

Better Receive-Send Implementation with Interrupt

Parallization of duties
Wait for new character (interrupt)

If character is received it is stored in a buffer Work with the stored buffer elements Check if transmitter is ready and send the first character of the buffer

Do Something (foreground program)

System can do other thing while waiting for receiver or sender Buffer is needed to store elements Size of buffer must be chosen carefully

Send character if transmitter ready (interrupt)

too small => buffer overflow too large => too expensive design
September 4, 2007
September 4, 2007
56
Typical Embedded Design Problems

Send-Receive with Circular Buffer (Wolf)

Embedded Systems are inherently parallel (concurrent), since they interact with heterogeneous environment
Parallization allows for a faster processing, since work can be done in parallel Waiting times can be avoided
Independent receive, send realized by two interrupt routines Receive-interrupt routine Puts a character into queue Send-interrupt routine Sends a character, when sender ready
The need for buffers is a logical consequence of parallization
System designer needs to find the right amount of parallization and the right buffer size!
head headtail
September 4, 2007
tail
Send-Receive with Circular Buffer (Wolf)

Send-Receive sequence diagram (Wolf)

:foreground :input :output :queue empty a empty
A circular buffer can be realised in a memory with a pointer for head and tail If a pointer is at the end of the buffer, the next position is the start of the buffer
i f g h
b bc
tail
September 4, 2007
head
IL2206 Embedded Systems 59 September 4, 2007 IL2206 Embedded Systems
c
60
Debugging interrupt code

Prioritized Interrupts
What if you forget to change registers?

Foreground program can exhibit mysterious bugs Bugs will be hard to repeat---depend on interrupt timing It is difficult to debug an interrupt routine!
Some CPUs (as Nios II) support several interrupt levels by their hardware Otherwise extra hardware (priority decoder) can be used to create several levels of interrupt
September 4, 2007
61
September 4, 2007
62
Interrupt prioritization
Example: Prioritized I/O

:interrupts B C A A,B
Masking: interrupt with priority lower than current priority is not recognized until pending interrupt is complete. Non-maskable interrupt (NMI): highestpriority, never masked.
:foreground
:A
:B
:C
Often used for power-down.
September 4, 2007
63
September 4, 2007
64
Sources of interrupt overhead

Summary
Handler execution time Interrupt mechanism overhead Register save/restore Pipeline-related penalties Cache-related penalties
Peripherals can be made accessible for software by memory mapped I/O Two basic approaches for communication with I/O device

polling processor checks, if data has arrived interrupt processor is notified, if data has arrived
Interrupt is not always better than polling!


DJKFKKF

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

DJKFKKF

Diunggah oleh

Hak Cipta:

Format Tersedia

Embedded Computer: Memory System, Input/Output

Types of memory Caches Memory Mapped I/O Polling Interrupt

Ingo Sander ingo@imit.kth.se

IL2206 Embedded Systems

The memory bottleneck

IL2206 Embedded Systems

IL2206 Embedded Systems

SRAM vs. DRAM

ROM (Read Only Memory)

SRAM (Static RAM)

RAM (Random Access Memory)

DRAM (Dynamic RAM)

IL2206 Embedded Systems

IL2206 Embedded Systems

Clock signal is used internally to pipeline accesses

Provides burst mode access:

IL2206 Embedded Systems

Memory Access Times and Costs

Embedded system memories

Source: Patterson and Hennessy, 2004

IL2206 Embedded Systems

IL2206 Embedded Systems

Caches and CPUs

instructions; data; data + instructions (unified).

Memory access time is no longer 2000 Wolf (Morgan deterministic! Kaufman)

IL2206 Embedded Systems

2000 Wolf (Morgan Kaufman)

2000 Wolf (Morgan Kaufman)

IL2206 Embedded Systems

IL2206 Embedded Systems

Memory system performance

Write-through: immediately copy write to main memory

tav = htcache + (1-h)tmain

2000 Wolf (Morgan Kaufman)

IL2206 Embedded Systems

IL2206 Embedded Systems

Cache performance benefits

Random. Least-recently used (LRU).

Sequential accesses are faster after first access.

2000 Wolf (Morgan Kaufman)

IL2206 Embedded Systems

Data Transfer to Cache

IL2206 Embedded Systems

IL2206 Embedded Systems

Example Direct Mapped Cache

IL2206 Embedded Systems

IL2206 Embedded Systems

Example Direct Mapped Cache

Cache Line 0 Line 1

byte byte byte data cache line

A block has 8 words

index offset = hit value byte

IL2206 Embedded Systems

Direct-mapped cache locations

Example 2-way set-associative cache

Cache Way 1 Way 1

Block 31 Block 32 Block 33

A block has 8 words

Fully associative cache

Two-way set associative

1 element per set

2 elements per set

Eight-way set associative (fully associative)