Designing safety-critical operating systems for memory protection and fault tolerance

Designing safety-critical operating systems
By David Kleidermacher • Process: A heavyweight unit thread cannot corrupt the code ruptions that are very difficult
Director of Engineering consisting primarily of a dis- or data of a more critical thread. to track down. In fact, since
E-mail: davek@ghs.com tinct address space, within It is truly a wonder that non- RAM is often located at physi-
which one or more threads memory protected OS are still cal address zero in a f lat
Mark Griglock execute. used in complex embedded sys- memory model, even NULL
Safety- Critical Engineering • Kernel: The portion of an tems where reliability, safety or pointer dereferences will go
Manager OS that provides core sys- security are important. undetected! (Clearly, logical
E-mail: markg@ghs.com tem services such as sched- Enabling the MMU has page zero is a good one to add
uling, thread synchroniza- other benefits as well. One big to the “unmap list.”)
Green Hills Software tion and inter-process com- advantage stems from the abil-
munication. ity to selectively map and System call
Whether you are designing a unmap pages into a logical ad- Another issue is that the kernel
telecom switch, a piece of Memory protection dress space. Physical memory must protect itself against im-
medical equipment or one of Fault tolerance begins with pages are mapped into the logi- proper system calls. Many ker-
the many complex systems memory protection. For many cal space to hold the current nels return the actual pointer to
aboard an aircraft, certain years, microprocessors have in- process’ code; others are a newly created kernel object,
critical parts of the application cluded on-chip memory man- mapped for data. Likewise, such as a semaphore, to the
must be able to operate under agement units (MMU) that en- physical memor y pages are thread that created it, as a
all conditions. Indeed, given able individual threads of soft- mapped in to hold the stacks of handle. When that pointer is
the steadily increasing speed of ware to run in hardware-pro- threads that are part of the pro- passed back to the kernel in sub-
processors and the economi- tected address spaces. But many cess. An RTOS can easily pro- sequent system calls, it may be
cally-driven desire to run mul- commercial RTOS never enable vide the ability to leave a page’s de-referenced directly. But what
tiple applications, at varying the MMU, even if such hardware
levels of criticality, on the same is present in the system.
processor, the risks continue to When all of an application’s Active
grow. threads share the same memory
Consider a blood gas ana- space, any thread could—inten-
lyzer used in an intensive care tionally or unintentionally—cor-
unit. The analyzer may serve rupt the code, data or stack of
two distinct purposes. First, it another thread. A misbehaved Active
monitors the level of oxygen thread could even corrupt the
and other gasses in the kernel’s own code or internal
patient’s bloodstream, in real data structures. It is easy to see
time. If any monitored gas how a single errant pointer in
reaches a dangerously low or one thread could easily bring
high level, the analyzer should down the entire system or at
produce an audible alarm or least cause it to behave unex-
take some more direct, pectedly.
Redundant
interventionary action. But the For safety and reliability, a
device may have a second use, process-based RTOS is prefer-
offering a historical display of able. To create processes with Figure 1: Redundancy via system heartbeats.
gas levels for “offline” analysis. individual address spaces, the
In such a system, data logging, RTOS need only create some worth of the logical addresses if the thread uses that pointer to
data display and user interface RAM-based data structures and after each thread’s stack un- modify the kernel object di-
threads may compete with the enable the MMU to enforce the mapped. That way, if any thread rectly or simply overwrites its
critical monitoring and alarm protections described therein. overflows its assigned stack, a handle with a pointer to some
threads for use of the processor The basic idea is that a new set hardware memory protection other memory. The results may
and other resources. of logical addresses is fault will occur. The kernel will be disastrous.
In order for threads of vary- “switched in” at each context suspend the thread instead of A bad system call should
ing importance to safely coex- switch. allowing it to corrupt other im- never be able to take down the
ist in the same system, the OS The MMU maps a logical ad- portant memory areas within kernel. An RTOS should, there-
that manages the processor and dress used during an instruc- the address space (like another fore, employ opaque handles
other resources must be able to tion fetch or a data read or write thread’s stack). This adds a for kernel objects. It should also
properly partition the software to a physical address in memory level of protection between validate the parameters to all
to guarantee resource availabil- through the current mapping. threads, even within the same system calls.
ity. The key word here is guar- It also flags attempts to access address space.
antee. Post-design, post-imple- illegal logical addresses, which Memory protection, includ- Fault tolerance and high
mentation testing cannot be have not been “mapped” to any ing this kind of stack overflow availability
counted on. Safety-critical sys- physical address. The cost of detection, is often helpful dur- Even the best software has la-
tems must be safe at all times. processes is the overhead inher- ing the development of an ap- tent bugs. As applications be-
ent in memory access through plication. Programming errors come more complex, perform-
Terminology a look-up table. But the payoff will generate exceptions that ing more functions for a soft-
The following terms are is huge. Careless or malicious are immediately detected and ware-hungry world, the number
used in this article: corruption across process easily traceable to the source of bugs in fielded systems will
• Thread: A lightweight unit boundaries is rendered impos- code. Without memory protec- continue to rise. System design-
of program execution. sible. A bug in a user interface tion, bugs can cause subtle cor- ers must, therefore, plan for fail-
Guaranteed resource availability: a portion of its memory quota
a) Space domain to satisfy the request. This kind
In safety-critical systems, a criti- of space domain protection
cal application cannot, as a re- should be part of the RTOS de-
sult of malicious or careless ex- sign. Central memory stores
ecution of another application, and discretionarily-assigned
Central store
run out of memory resources. In limits are insufficient when
most RTOS, memory used to guarantees are required.
hold thread control blocks and If an RTOS provides a
other kernel objects comes from memory quota system, dynamic
a central store. loading of low criticality appli-
Memory When a thread creates a new cations can be tolerated. High
starved
thread, semaphore or other ker- criticality applications already
nel object, the kernel carves off running are guaranteed to have
a chunk of memory from this the physical memory they will
Central store central store to hold the data for require to run. In addition, the
this object. A bug in one thread memory used to hold any new
b) could, therefore, result in a situ- processes should come from
Memory ation where this program cre- the memory quota of the creat-
guaranteed
ates too many kernel objects and ing process. If this memor y
the central store is exhausted comes from a central store,
(Figure 2a). A more critical then process creation can fail if
thread could fail as a result, per- a malicious or carelessly writ-
Figure 2: a) Before memory quotas and b) after. haps with disastrous effects. ten application attempts to cre-
In order to guarantee that ate too many new processes.
ures and employ fault recovery dundant node can take control this scenario cannot occur, the (Most programmers have ei-
techniques. Of course, the ef- automatically. RTOS can provide a memory ther mistakenly executed or at
fect of fault recovery is applica- quota system wherein the sys- least heard of a Unix “fork
tion-dependent: a user interface Mandatory vs. discretionary tem designer statically defines bomb,” which can easily take
can restart itself in the face of a access control how much physical memor y down an entire system.) In most
fault, a f light-control system An example of a discretionary each process has (Figure 2b). safety-critical systems, dynamic
probably cannot. access control is a Unix file: a For example, a user interface process creation will simply not
One way to do fault recovery process or thread can, at its sole process might be provided a be tolerated at all and the RTOS
is to have a supervisor thread in discretion, modify the permis- maximum of 128KB and a should be configurable such
an address space all on its own. sions on a file, thereby permitting flight control program a maxi- that this capability can be re-
When a thread faults (for ex- access to the file by another pro- mum of 196KB. If a thread moved from the system.
ample, due to a stack overflow), cess in the system. Discretion- within the user interface pro-
the kernel should provide some ary access controls are useful for cess encounters the aforemen- Guaranteed resource availability:
mechanism whereby notifica- some objects in some systems. tioned failure scenario, the pro- Time domain
tion can be sent to the supervi- An RTOS that is used in a cess may exhaust its own The vast majority of RTOS em-
sor thread. If necessary, the su- safety- or security-critical sys- 128KB of memor y. But the ploy priority-based, preemptive
pervisor can then make a system must be able to go one big flight control program and its schedulers. Under this scheme,
tem call to close down the step further and provide man- 196KB of memory are wholly the highest priority ready
faulted thread or the entire pro- datory access control of critical unaffected. thread in the system always gets
cess and restart it. The supervi- system objects. For example, In a safety-critical system, to use the processor (execute).
sor might also be hooked into a consider an aircraft sensor de- memory should be treated as a If multiple threads are at that
software “watchdog” setup, vice, access to which is con- hard currency: when a thread same highest priority level, they
whereby thread deadlocks and trolled by a flight control pro- wants to create a kernel object, generally share the processor
starvation can be detected as gram. The system designer its parent process must provide equally, via timeslicing. The
well. must be able to set up the sys-
In many critical systems, tem statically such that the
high availability is assured by flight control program and only
Traditional scheduler
employing multiple redundant the flight control program has Thread B is starved
nodes in the system. In such a access to this device. Another Thread A'
system, the kernel running on application in the system can-
a redundant node must have the not dynamically request and Thread A Thread B
ability to detect a failure in one obtain access to this device. Thread A" Thread B
of the operating nodes. One And the flight control program
method is to provide a built-in cannot dynamically provide ac-
heartbeat in the interprocessor cess to the device to any other
message passing mechanism of application in the system. The
Scheduler using weights
the RTOS (Figure 1). Upon access control is enforced by the Thread B CPU resource is guaranteed
system startup, a communica- kernel, is not circumventable
Thread A'
tions channel is opened be- by application code and is thus
tween the redundant nodes and mandatory. Mandatory access Thread A Thread B Thread B
each of the operating nodes. control provides guarantees.
Thread A"
During normal operation, the Discretionary access controls
redundant nodes continually are only as effective as the ap-
receive heartbeat messages plications using them and these
from the operating nodes. If the applications must be assumed
heartbeat fails to arrive, the re- to have bugs in them. Figure 3: Traditional scheduler versus scheduler with weights.
problem with this timeslicing create many confederate fect each other is to provide a is linear with the number of
(or even run-to-completion) threads without affecting the process-level scheduler. De- threads in the partitions, an
within a given priority level, is ability of Thread B to get its signers of safety critical soft- unacceptably poor implemen-
that there is no provision for work done; Thread B’s proces- ware have noted this require- tation. The RTOS must imple-
guaranteeing processor time for sor reservation is thus guaran- ment for a long time. The pro- ment the partition scheduler
critical threads. teed. A scheduler that provides cess, or partition, scheduling within the kernel and ensure
Consider the following sce- this kind of guaranteed re- concept is a major part of that partition switching takes
nario: the system includes two source availability in addition ARINC Specification 653, an constant time and is as fast as
threads at the same priority to the standard scheduling Avionics Application Software possible.
level. Thread A is a non-critical, techniques is required in some Standard Interface.
background thread. Thread B is critical embedded systems, par- The ARINC 653 partition Schedulability
a critical thread that needs at ticularly avionics. scheduler runs partitions, or Meeting hard deadlines is one of
processes, according to a the most fundamental require-
timeline established by the sys- ments of a RTOS and is espe-
H tem designer. Each process is cially important in safety-criti-
provided one or more windows cal systems. Depending on the
M of execution within the repeat- system and the thread, missing
ing timeline. During each win- a deadline can be a critical fault.
L dow, all the threads in the other Rate monotonic analysis
processes are not runnable; (RMA) is frequently used by sys-
only the threads within the cur- tem designers to analyze and
Figure 4: Priority inversion. rently active process are predict the timing behavior of
runnable (and typically are systems. In doing so, the system
least 40 percent of the proces- The problem inherent in all scheduled according to the designer is relying on the un-
sor time to get its work done. schedulers is that they are igno- standard thread scheduling derlying OS to provide fast and
Because Thread A and B are as- rant of the process in which rules). When the flight control temporally deterministic sys-
signed the same priority level, threads reside. Continuing our application’s window is active, tem services. Not only must the
the typical scheduler will time previous example, suppose that its processing resource is guar- designer understand how long
slice them so that both threads Thread A executes in a user in- anteed; a user interface applica- it takes to execute the thread’s
get 50 percent of the processor. terface process while critical tion cannot run and take away code, but also any overhead as-
At this point, Thread B is able Thread B executes in a f light processing time from the criti- sociated with the thread must
to get its work done. Now sup- control process. The two appli- cal application during this win- be determined. Overhead typi-
pose Thread A creates a new cations are partitioned and pro- dow. cally includes context switch
thread at the same priority tected in the space domain but Although not specified in time, the time required to ex-
level. Consequently, there are not in the time domain. Design- ARINC 653, a prudent addition ecute kernel system calls, and
three highest priority threads ers of safety-critical systems re- to the implementation is to ap- the overhead of interrupts and
sharing the processor. Sud- quire the ability to guarantee ply the concept of a background interrupt handlers firing and
denly, Thread B is only getting that the run-time characteris- partition. When there are no executing.
33 percent of the processor and tics of the user interface cannot runnable threads within the ac- All RTOS incur the overhead
cannot get its critical work possibly affect the run-time tive partition, the partition of context switching. Lower
done. For that matter, if the characteristics of the f light scheduler should be able to run context switching time implies
code in Thread A has a bug or control system. Thread background threads, if any, in lower overhead, more efficient
virus, it may create dozens or schedulers simply cannot make the background partition, in- use of available processing re-
even hundreds of “confeder- this guarantee. stead of idling. An example sources and increased likeli-
ate” threads, causing Thread B Consider a situation in background thread might be a hood of meeting deadlines. A
to get a tiny fraction of the which Thread
runtime. B normally gets
One solution to this problem all the runtime
is to enable the system designer it needs by H
to inform the scheduler of a making it
thread’s maximum “weight” higher priority
M
within the priority level (Fig- than Thread A
ure 3). When a thread creates or any of the
another equal priority thread, other threads L
the creating thread must give in the user in-
up part of its own weight to the terface. Due to
new thread. In our previous ex- a bug or poor
Figure 5: Priority inheritance.
ample, suppose the system de- design or im-
signer had assigned weight to proper testing,
Thread A and Thread B such Thread B may lower its own pri- low priority diagnostic agent RTOS’s context switching code
that Thread A has 60 percent of ority (the ability to do so is that runs occasionally but does is usually hand optimized for
the runtime and Thread B has available in practically all ker- not have hard real-time dead- optimal execution speed.
40 percent of the runtime. nels), causing the thread in the lines.
When Thread A creates the user interface to gain control of Attempts have been made to Interrupt latency
third thread, it must provide the processor. Similarly, add partition scheduling on top A typical embedded system has
part of its own weight, say 30 Thread A may raise its priority of commercial off-the-shelf OS several types of interrupts re-
percent. Now Thread A and the above the priority of Thread B by selectively halting all the sulting from the use of various
new thread each get 30 percent with the same effect. threads in the active partition kinds of devices. Some inter-
of the processor time but criti- A convenient way to guaran- and then running all the rupts are higher priority and re-
cal Thread B’s 40 percent re- tee that the threads in processes threads in the next partition. quire a faster response time
mains inviolate. Thread A can of different criticality cannot af- Thus, partition switching time than others. For example, an
cally elevates the low priority A solution called the prior-
Overhead thread to the priority of the ity ceiling protocol not only
high priority thread. Once the solves the priority inversion
(S2) S2 low priority thread releases the problem but also prevents
H
rS2 mutex, its priority will be re- chained blocking (Figure 7).
(S1) rS1 turned to normal and the high In one implementation
M
S2 priority thread will run. The scheme (called the highest
L
dynamic priority elevation pre- locker), each semaphore has
S1 vents a medium priority thread an associated priority, which is
from running while the high assigned by the system de-
Figure 6: Chained blocking caused by priority inheritance. priority thread is waiting; pri- signer to be the priority of the
ority inversion is avoided (Fig- highest priority thread that
interrupt that signals the kernel times for all such calls. Two ma- ure 5). In this example, the might ever try to acquire that
to read a sensor that is critical jor problems involve the timing critical section execution time object. When a thread takes
to an aircraft’s f light control of message transfers and the (the time the low priority such a semaphore, it is imme-
should be handled with the timing of mutex take opera- thread holds the mutex) is diately elevated to the priority
minimum possible latency. On tions. added to the overhead of the of the semaphore. When the
the other hand, a typical timer A thread may spend time high priority thread. semaphore is released, the
tick interrupt frequency may be performing a variety of activi- A weakness of the priority thread reverts back to its origi-
60Hz or 100Hz. Ten millisec- ties. Of course, its primary ac- inheritance protocol is that it nal priority. Because of this
onds is an eternity in hardware tivity is executing code. Other does not prevent chained block- priority elevation, no other
terms, so interrupt latency for activities include sending and ing. Suppose a medium priority threads that might contend for
the timer tick interrupt is not as receiving messages. Message thread attempts to take a mutex the same semaphore can run
critical as for most other inter- transfer times vary with the size owned by a low priority thread until the semaphore is re-
rupts. of the data. How can the system but while the low priority leased. It is easy to see how this
Most kernels disable all in- designer account for this time? thread’s priority is elevated to prevents chained blocking.
terrupts while manipulating The RTOS can provide a capa- medium by priority inherit- Several RTOS provide sup-
internal data structures during bility for controlling whether ance, a high priority thread be- port for both priority inherit-
system calls. Interrupts are dis- transfer times are attributed to comes runnable and attempts ance and priority ceilings, leav-
abled so that the timer tick in- the sending thread or to the re- to take another mutex already ing the decision up to the sys-
terrupt cannot occur (a timer ceiving thread or shared be- owned by the medium priority tem designer.
tick may cause a context switch) tween them. Indeed, the thread. The medium priority
at a time when internal kernel kernel’s scheduler should treat thread’s priority is increased to Changing requirements
data structures are being all activities, not just the pri- high but the high priority Many of the RTOS in use today
changed. The system’s inter- mary activity, as prioritized thread now must wait for both were originally designed for
rupt latency is directly related units of execution so that the the low priority thread and the software systems that were
to the length of the longest system designer can properly medium priority thread to com- smaller, simpler and ran on pro-
critical section in the kernel. control and ac-
In effect, most kernels in- count for them.
crease the latency of all inter- rS1 rS2 S2
H
rupts just to avoid a low prior- Priority inversion
ity timer interrupt. A better so- Priority inversion
lution is to never disable inter- has long been the M
rupts in kernel system calls and bane of system de- S2
instead to postpone the han- signers attempting L
dling of an intervening timer to perform rate S1
tick until the system call com- monotonic analy-
pletes. This strategy depends sis, since RMA de- Figure 7: Priority ceilings.
on all kernel system calls being pends on higher
short (or at least that calls that priority threads running before plete before it can run again. cessors without memory protec-
are not short are restartable), lower priority threads. Priority The chain of blocking criti- tion hardware. With the ever-
so that scheduling events can inversion occurs when a high cal sections can extend to in- increasing complexity of appli-
preempt the completion of the priority thread is unable to run clude the critical sections of any cations in today’s embedded sys-
system call. Therefore, the time because a mutex (or binar y threads that might access the tems, fault tolerance and high
to get back to the scheduler may semaphore) it attempts to ob- same mutex. Not only does this availability features have be-
vary by a few instructions (in- tain is owned by a low priority make it much more difficult for come increasingly important.
significant for a 60Hz sched- thread but the low priority the system designer to compute Especially stringent are the re-
uler) but will always be short thread is unable to execute and overhead, but since the system quirements for safety-critical
and bounded. It is much more release the mutex because a designer must compute the systems.
difficult to engineer a kernel medium priority thread is also worst case overhead, the Fault tolerance begins with
that has preemptible system runnable (Figure 4). The most chained blocking phenomenon processes and memory protec-
calls in this manner, which is common RTOS solution to the may result in a much less effi- tion but extends to much more,
why most kernels do not do it priority inversion problem is to cient system (Figure 6). These especially the need to guarantee
this way. support the priority inheritance blocking factors are added into resource availability in the time
protocol. the computation time for tasks and space domains. Kernel sup-
Bounded execution times A mutex that supports prior- in the RMA analysis, poten- port for features like the prior-
In order to allow computation of ity inheritance works as fol- tially rendering the system ity ceiling protocol give safety-
the overhead of system calls that lows: if a high-priority thread unschedulable. This may force critical system designers the ca-
a thread will execute while do- attempts to take a mutex al- the designer to resort to a faster pabilities needed to maximize
ing its work, an RTOS should ready owned by a low priority CPU or to remove functionality efficiency and guarantee sche-
provide bounded execution thread, the kernel automati- from the system. dulability in their systems.

Designing safety-critical operating systems for memory protection and fault tolerance

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Designing safety-critical operating systems for memory protection and fault tolerance

Diunggah oleh

Hak Cipta:

Format Tersedia

Designing safety-critical operating systems

Anda mungkin juga menyukai