Anda di halaman 1dari 43

EE (CE) 6304 Computer Architecture Lecture #7 (9/18/13)

Myoungsoo Jung Assistant Professor Department of Electrical Engineering University of Texas at Dallas

Virtual Memory y Review

Views of Memory
Real machines have limited amounts of memory

640KB? A f few GB? (This laptop = 2GB)


Programmer doesnt want to be bothered

Do y you think, , oh, , this computer p only y has 128MB so Ill write my code this way f you run on a d different fferent What happens if machine?

Programmers Programmer s View


Example 32-bit memory

When programming, you dont care about y how much real memory there is Even if you use a lot, memory can always b paged d to di k be disk

0-2GB

Kernel

Text Data Heap

Stack A K A Vi t l Add A.K.A. Virtual Addresses 4GB

Programmers Programmer s View


Really y Programs g View Each program/process gets its own 4GB space

Or much, much more with a 64 64-bit bit processor


Kernel Kernel Text Data Heap Text Data Heap Text Data Heap p Kernel Stack

Stack

Stack

CPUs CPU s View


At some point, the CPU is going to have to loadfrom/store-to from/store to memory memory all it knows is the real real, A A.K.A. K A physical memory

which unfortunately y is often < 4GB and is almost m never 4GB per process and is never 16 exabytes per process

Pages
Memory is divided into pages, which are nothing more than fixed sized and aligned regions of memory

Typical size: 4KB/page (but not always)


0-4095 4096-8191 8192-12287 12288-16383 12288 16383 Page 0 Page 1 Page 2 g 3 Page

Page Table
Map from virtual addresses to physical locations
0K 0K 4K 8K 12K Page Table implements this VP mapping 4K 8K 12K 16K 20K 24K Virtual Addresses Entry includes permissions (e.g., read readonly) 28K Physical Addresses

Physical Location may include hard-disk

Need for Translation


0xFC51908B Vi t l Address Virtual Add Virtual Page Number Page Offset

Physical Address Page Table Main Memory 0x00152 0x0015208B

0xFC519

Page Tables
0K 4K 8K 12K

Physical Memory 0K 4K 8K 12K 16K 20K 24K 28K

0K 4K 8K 12K

What is in a Page Table Entry (or PTE)? Pointer to actual page Permission P i i bit bits: valid, lid read-only, d l read-write, d it write-only it l Example: Intel x86 architecture PTE: Address same format previous slide (10, 10, 12-bit offset) Intermediate I di page tables bl called ll d Di Directories i Page Frame Number (Physical Page Number) 31-12 P: W: U: PWT: PCD: A: D: L: Free 0 L D A UW P (OS) 11-9 8 7 6 5 4 3 2 1 0
PWT T

What is in a Page Table Entry (PTE)?

Present (same as valid bit in other architectures) Writeable User accessible Page write transparent: external cache write-through Page cache disabled (page cannot be cached) Accessed: page has been accessed recently Dirty y (PTE only): y page p g has been modified recently y L=14MB page (directory only). Bottom 22 bits of virtual address serve as offset

PCD D

Three Advantages of Virtual Memory


Translation: g can be given g consistent view of memory, y even though g Program physical memory is scrambled Makes multithreading reasonable (now used a lot!) mportant part of program ( (Working Work ng Set Set) ) must Only the most important be in physical memory. Contiguous structures (like stacks) use only as much physical memory as necessary yet still grow later. Protection: Different threads (or processes) protected from each other. Different pages can be given special behavior (Read Only, Invisible to user programs, etc). Kernel data protected from User programs Very important for protection from malicious programs Sharing: an map same phys physical cal page to multiple mult ple users Can (Shared memory)

Large L g Address Space p Support pp


Virtual Address: 10 bits
Virtual Virtual P1 i index d P2 i index d

10 bits

12 bits
Offset

Address: Page #

Physical Physical

Offset

4KB

PageTablePtr

4 bytes

Single-Level Page Table Large 32 bit address 1M 4KB pages for a 32-bit entries Each process needs own page table! Multi-Level Page Table Can allow sparseness of page table Portions of table can be swapped to disk

4 bytes

TLB Review

Translation Look-Aside Look Aside Buffers


Translation Look-Aside Buffers (TLB)
Cache on translations Fully Associative, Set Associative, or Direct Mapped
VA CPU TLB miss T ns Translation data hit PA Cache hit miss Main Memory

Translation with a TLB

TLBs are:

Small typically yp y not more than 128 256 entries Fully Associative

Caching g Applied pp to Address Translation

CPU

Virtual Address

TLB
Cached? C h d? Yes No

Physical Address dd

Physical Memory

Translate (MMU) Data Read or Write (untranslated) Question is one of page locality: does it exist? Instruction accesses spend a lot of time on the same page (since accesses sequential) Stack accesses have definite locality of reference locality but still some some Data accesses have less page locality, Can we have a TLB hierarchy? Sure: multiple levels at different sizes/speeds

What Actually Happens on a TLB Miss?


Hardware traversed page tables: On TLB miss miss, hardware in MMU looks at current page table to fill TLB (may walk multiple levels) If PTE valid, hardware fills TLB and processor never knows If PTE marked as invalid, causes Page Fault, after which kernel decides what to do afterwards Software traversed Page tables (like MIPS) On TLB miss, processor receives TLB fault Kernel traverses page table to find PTE If PTE valid, fills TLB and returns from fault If PTE marked as invalid invalid, internally calls Page Fault handler Most chip sets provide hardware traversal Modern operating systems tend to have more TLB faults since they use translation for many things Examples: shared segments user-level user level portions of an operating system

Implementing LRU
Have LRU counter for each line in a set When line accessed

Get old value X of its counter Set its counter to max value y other line in the set For every
If counter larger than X, decrement it W When replacement p m needed

Select line whose counter is 0

Clock Algorithm: Not Recently Used


Single Clock Hand: Advances only on page fault!

Set of all pages in Memory

Check for pages not used recently g used Table Mark p pages g as Page not recently y
dirty used

Replace an old page, not the oldest page Details: per physical p y page: p g Hardware use bit p Hardware sets use bit on each reference If use bit isnt set, means not referenced in a long time On page fault: Advance Ad clock l k h hand d ( (not t real l ti time) ) Check use bit: 1used recently; clear and leave alone 0selected candidate for replacement

1 1 0 1 Clock Algorithm: Approximate LRU (approx to approx to MIN) 0

0 0 1 1 0

...

Example: R3000 pipeline


MIPS R3000 Pipeline Inst Fetch TLB Dcd/ Reg g RF ALU / E.A Operation E.A. TLB 64 entry, on-chip, fully associative, software TLB fault handler Virtual Address Space
ASID 6 V. Page Number 20 Offset 12

Memory D-Cache

Write Reg WB

I-Cache

TLB

0xx User segment (caching based on PT/TLB entry) 100 Kernel physical space, cached 101 Kernel physical space, uncached p 11x Kernel virtual space Allows context switching among 64 user processes without TLB flush

As described, TLB lookup is in serial with cache lookup:


Virtual Address V page no. 10 offset TLB Lookup V
A Access Rights

Reducing translation time further

PA

P page no.

offset 10

Ph i l Address Physical Add

Machines with TLBs go one step further: they overlap TLB lookup with cache access. access

Works because offset available early

Here is how this might work with a 4K cache:


assoc 32 TLB lookup 20 page # Hit/ Miss FN = index 10 2 disp 00

Overlapping TLB & Cache Access

4K Cache 4 bytes

1 K

FN Data

Hit/ Miss

What if cache size is increased to 8KB? Overlap not complete Need to do something else Another option: Virtual Caches Tags in cache are virtual addresses Translation only happens on cache misses

Summary: TLB, TLB Virtual Memory


Page tables map virtual address to physical address TLBs TLB are important i f for f fast translation l i

TLB misses are significant in processor performance f most systems cant access all of 2nd level cache without TLB misses!
Caches, TLBs, Virtual Memory all understood by examining i i h how th they deal d l with ith 4 questions: ti 1) Where can block be placed? 2) How is block found? 3) What block is replaced on miss? 4) How are writes handled? Today VM allows many processes to share single memory without having to swap all processes to disk;

Exceptions: Traps and Interrupts

(H d (Hardware) )

Exceptions: Traps and Interrupts

Exception vs vs. Interrupt


Exception: An unusual event happens to an instruction during its execution Examples: divide by zero, undefined opcode Interrupt: Hardware signal to switch the processor to a new instruction stream Example: a sound card interrupts when it needs more audio output samples (an audio click happens if it is left waiting)

Problems with Pipelining


Problem: It must appear that the exception or interrupt must appear between 2 instructions (Ii and Ii+1) The effect of all instructions up to and including Ii is complete No N effect ff t of f any i instruction t ti after ft Ii can take t k place The interrupt p (exception) ( p ) handler either aborts program or restarts at instruction Ii+1

Example: Device Interrupt


(Say, ( y, arrival of f network message) m g )

Raise priority

add
External Interrup pt

Reenable All Ints Save registers lw addi sw r2,0(r1) r3,r0,#5 0(r1),r3


Int terrupt H Handler

r1,r2,r3 r4,r1,#4 r4,r4,#2

subi slli

lw

r1,20(r0)

Hiccup(!) lw lw add sw r2,0(r4) r3,4(r4) r2,r2,r3 8(r4),r2

Restore registers Clear current Int Disable All Ints Restore priority p y RTE

( (again, i f for arrival i l of f network t k message) )


E External Interru upt

Alternative: Polling

Disable Network Intr subi slli lw lw add dd sw lw beq lw lw addi sw Clear r4,r1,#4 r4 r4 #2 r4,r4,#2 r2,0(r4) r3,4(r4) r2,r2,r3 2 2 3 8(r4),r2 r1,12(r0) r1,no_mess r1,20(r0) r2,0(r1) r3,r0,#5 0(r1),r3 Network Intr

Polling Point ( h k device (check d i register) i t )

Handler

no_mess:

Polling is faster/slower than Interrupts Interrupts.


Polling is faster than interrupts because Compiler knows which registers in use at polling point. point Hence, Hence do not need to save and restore registers (or not as many). Other interrupt overhead avoided (pipeline flush, trap p i iti priorities, etc). t ) Polling is slower than interrupts because Overhead of polling instructions is incurred regardless of whether or not handler is run. This could add to inner-loop delay. Device may have to wait for service for a long time time. When to use one or the other? Multi-axis tradeoff Frequent/regular events good for polling, as long as device

can be controlled at user level.

Interrupts good for infrequent/irregular events Interrupts good for ensuring regular/predictable service of events.

Trap/Interrupt classifications
Traps: relevant to the current process Faults, arithmetic traps, and synchronous traps Invoke software on behalf of the currently executing process Interrupts: caused by asynchronous, outside events I/O devices requiring service (DISK, network) Clock interrupts (real time scheduling) Machine h Checks: h k caused d b by serious h hardware d f failure l Not always restartable Indicate I di t th that t b bad d thi things h have h happened. d Non-recoverable ECC error Machine room fire Power outage

A related classification: Synchronous vs. Asynchronous


Synchronous: means related to the instruction stream, i e during the execution of an instruction i.e. Must stop an instruction that is currently executing Page fault on load or store instruction Arithmetic exception p Instructions Software Trap Asynchronous: means unrelated to the instruction stream, i.e. caused by an outside event. Does D not have h to disrupt d instructions that h are already l d executing Interrupts are asynchronous Machine checks are asynchronous high availability interrupts): interrupts) SemiSynchronous (or high-availability Caused by external event but may have to disrupt current instructions in order to guarantee service

Interrupt Priorities Must be Handled


Raise priority Reenable All Ints Save registers lw r1,20(r0) lw r2,0(r1) addi r3,r0,#5 sw 0(r1),r3 Restore registers Clear current Int Disable All Ints Restore priority RTE
Could b be interr rupted by y disk

N Network Interrupt

add subi slli

r1,r2,r3 r4,r1,#4 r4,r4,#2

Hiccup(!) lw lw add sw r2,0(r4) r3,4(r4) r2,r2,r3 8(r4),r2

N t th Note that t priority i it must tb be raised i dt to avoid id recursive i i interrupts! t t !

Interrupt Controller
Prior rity Encoder Inte errupt Mask M IntID Interrupt p

CPU
Int D Disable

Timer

Network

Software f Interrupt

Control

NMI

Interrupts inv invoked ked with interrupt lines fr from m devices Interrupt controller chooses interrupt request to honor Mask enables/disables interrupts p Priority encoder picks highest enabled interrupt Software Interrupt Set/Cleared by Software Interrupt identity specified with ID line CPU can disable all interrupts with internal flag Non-maskable interrupt line (NMI) cant be disabled

Interrupt controller hardware and mask levels


Operating system constructs a hierarchy of masks that reflects some form of interrupt priority. For instance: P i it Priority E Examples l 0 Software interrupts 2 Network Interrupts 4 Sound card 5 Disk Interrupt 6 Real Time clock Non-Maskable Ints (power) p This reflects the an order of urgency to interrupts For instance, this ordering says that disk events can interrupt the interrupt handlers for network interrupts.

Can we have fast interrupts?


Fine Grain I Interrupt t

add dd subi slli

r1,r2,r3 1 2 r4,r1,#4 r4,r4,#2

Hiccup(!) lw lw add sw r2,0(r4) r3,4(r4) r2,r2,r3 8(r4),r2

Raise priority Reenable All Ints Save registers lw r1,20(r0) lw r2,0(r1) , ( ) addi r3,r0,#5 sw 0(r1),r3 Restore registers Clear current Int Disable All Ints Restore priority RTE

Could d be inte errupted d by disk

P Pipeline l Drain: D Can C be b very Expensive E Priority Manipulations Register R i t S Save/Restore /R t

128 registers + cache misses + etc.

An interrupt or exception is considered precise if there is a single instruction (or interrupt point) for which: hi h
All instructions before that have committed their state No following instructions (including the interrupting instruction) have modified any state.

Precise Interrupts/Exceptions

This means, that y you can restart execution at the interrupt point and get the right answer
Implicit in our previous example of a device interrupt: Interrupt I t t point i t i is at t fi first t lw l instruction i t ti
Exte ernal Inter rrupt
add subi bi slli r1,r2,r3 r4,r1,#4 4 1 #4 r4,r4,#2

In nt handle er

lw lw add sw

r2,0(r4) r3,4(r4) r2,r2,r3 8(r4),r2

Precise Exceptions in Static Pipelines

Key observation: architected state only change in memory and register write stages.

Precise interrupt point may require multiple PCs


addi r4,r3,#4 sub r1,r2,r3 r1,there PC: bne r2,r3,r5 , , PC+4: and <other insts> addi r4,r3,#4 sub r1,r2,r3 PC: bne r1,there r2,r3,r5 PC+4: and <other insts>

Interrupt point described as <PC,PC+4>

Interrupt point described as: <PC+4 there> (branch was taken) <PC+4,there> or <PC+4,PC+8> (branch was not taken)

On SPARC, interrupt hardware produces pc p (next ( pc) p ) and npc On MIPS, only pc must fix point in software

Why are precise interrupts desirable?


Many types of interrupts/exceptions need to be restartable Easier to figure out what actually restartable. happened:
I.e. TLB faults. Need to fix translation, then restart load/store IEEE gradual underflow, illegal operation, etc: e.g. Suppose you are computing: f ( x ) = Th Then, for f , 0 x 0 f (0) = NaN + illegal _ operation 0

sin( x ) x

Want to take exception, replace NaN with 1, then restart.

Restartability doesnt require preciseness. However, preciseness i makes k it a lot l t easier i to t restart. t t Simplify the task of the operating system a lot Less state needs to be saved away if unloading process process. Quick to restart (making for fast interrupts)

Precise Exceptions in simple 5-stage 5 stage pipeline:


Exceptions may occur at different stages in pipeline (I.e. out of order): Arithmetic exceptions occur in execution stage TLB faults can occur in instruction fetch or memory stage What Wh about b interrupts? i ? The Th doctors d mandate d of f d do no h harm applies here: try to interrupt the pipeline as little as possible All of this solved by y tagging gg g instructions in pipeline pp as cause exception or not and wait until end of memory stage to flag exception Interrupts become marked NOPs (like bubbles) that are placed into pipeline instead of an instruction. Assume that interrupt condition persists in case NOP flushed Clever instruction fetch might start fetching instructions from interrupt vector, but this is complicated by need for supervisor p i mode m d switch, it h saving in of f one n or m more PC PCs, etc t

Summary: Interrupts
Interrupts and Exceptions either interrupt the current instruction or happen between instructions

Possibly large quantities of state must be saved before interrupting


Machines with precise exceptions provide one single point in the program p p g to restart execution

All instructions before that point have completed p No instructions after or including that point have completed

Anda mungkin juga menyukai