Anda di halaman 1dari 8

RTOS for Fault Tolerant Application

Abstract: Increasing complexity of safety-critical systems that support real-time multitasking applications requests the concurrency management offered by realtime operating systems (RTOS). Real-time systems can suffer severe consequences if the functional as well as the time specifications are not met. In addition, real-time systems are subject to transient errors originating from several sources, including the impact of high energy particles on sensitive areas of integrated circuits. Therefore, the evaluation of the sensitivity of RTOS to transient faults is a major issue. This paper explores sensitivity of RTOS kernels in safety-critical systems. We characterize and analyze the consequences of transient faults on key components of the kernel of MicroC, a popular RTOS. We specifically focus on its task scheduling and context switching modules. Classes of fault syndromes specific to safety-critical real-time systems are identified. Results reported in this paper demonstrate that 34% of faults that affect the scheduling and context switching functions led to scheduling dysfunctions. This represents an important fraction of faults that cannot be ignored during the design phase of safety-critical applications running under an RTOS. Index TermsContext switch, fault injection, fault syndromes, real-time operating systems (RTOS), scheduler, safety-critical systems. Introduction: TODAY, many safety-critical embedded systems support real-time multitasking applications (e.g., nuclear power stations applications, aerospace applications, traffic control or medical life support, etc.). The complexity of these systems requires real-time operating systems (RTOS). Due to the time criticality factor, the design of real-time systems becomes challenging. In realtime systems, critical tasks must never miss their deadlines and never produce incorrect output results. If their time responses exceed a given time period (deadline) or if they provide incorrect results, the consequences can be catastrophic (e.g., loss of human lives or economical disaster). Therefore, the correct real-time functionality of safety-critical systems is mandatory in order to guarantee the correctness of output results and the required response time of critical tasks, even in the worst situations. Real-time systems, like all electronic systems, are subject to transient errors due to cosmic rays and alpha particles. These errors can cause undesired modifications of storage memory cells. The consequences of transient errors are currently a well known concern in microelectronic systems. International technology roadmap for semiconductor (ITRS) predicts increasing system failure rates due to transient errors for future generations.

These errors affect applications running on embedded systems as well as the RTOS under which it executes. Consequently, they affect both correctness of output results and the timing of the tasks response. In real-time applications, the time correctness can be more important than the correctness of output results. For instance, if a system is able to provide correct output results, but later than some deadline, the system behaviour may be incorrect, with consequences more significant than if a result with a minor error is provided on time. The main services provided by an RTOS kernel are task scheduling (taking into account several factors - tasks priorities, resources and time management, etc.) and context switching. The scheduler decides which task is to be executed, while the context switch module loads the context (variables, stack, etc.) of the selected task. If these two services do not work properly, the tasks execution order may be affected, and some critical tasks could miss their deadlines or provide incorrect output results. A real-time scheduler must be extremely reliable and safe, in order to ensure correctness of the real-time system response. This is a major concern to RTOS providers, and several standards for safety and reliable implementations were proposed. For instance, RTCA DO-178B is a standard for software used in avionics equipments. This standard approach reliability and safety from the software development perspective, ensuring RTOS fault tolerance in case of software bugs. However, implementations respecting this standard may also be subject to transient errors, and the study of their sensitivity to these errors becomes an important issue for safety-critical real-time applications. The majority of existing works propose fault injection techniques to evaluate the robustness of kernels that are not real-time. In, a fault injection tool was developed to study error propagation in UNIX systems. Reported results show that most injected faults lead to system failure. A similar result has been reported , the authors propose a fault injection tool that corrupts the system calls parameters. The results show a high failure rate of POSIX1 functions. Representative studies reported in propose the MAFALDA tool to inject faults in the microkernel object code and the application data segment. The results report not only system crashes, but also error propagation to the application level. However, none of the cited works addresses the real-time aspect, which is the key reason for using RTOS in safety-critical real-time systems. 1POSIX (Portable Operating System Interface) is standards specified by the IEEE to define the API (Application Program Interface) for software designed to run on variants of the UNIX OS There is a lack of contributions in the specialized literature that consider sensitivity of real-time features of RTOSs subject to transient faults. The work proposed in is to our knowledge the only existing research that investigates the temporal aspects of injected faults. In this work, the authors propose a tool that aims at evaluating the time correctness of the Chorus microkernel. They study the consequences of faults injected on the

scheduler code. Experimental results show that about 7% of injected faults are propagated to the application level. It is of interest that modern RTOSs are PROMable, which means that a CPU can execute the RTOS services directly from the PROM. Since PROMs are less sensitive to transient errors than RAMs, faults in the scheduler code are less of a concern. However, the PROMable RTOSs are still subject to transient errors during their execution, as the CPU registers are intensively used. Therefore, to assess the robustness of RTOSs to transient faults, it is mandatory to investigate their sensitivity to faults injected in CPU registers. With respect to the presented state-of-the-art, the main contributions of this paper are: 1) the definition of different types of syndromes caused by transient faults occurring in safetycritical systems, including RTOS; 2) the proposal of a fault injection methodology allowing to asses MicroC RTOS sensitivity to register level transient faults taking into account both functional correctness and real-time aspects; and 3) a detailed analysis of reasons for scheduling dysfunctions caused by transient errors.

The choice of MicroC in order to evaluate the sensitivity of RTOS kernels in safety-critical systems was motivated by several aspects. MicroC is an open source kernel and it is widely used in real-time applications. In addition, MicroC was certified for use in safety-critical systems (in conformity to RTCA DO-178B). Moreover, the current trends in real-time systems is to adopt less

complex RTOSs running on multiprocessor-based architectures, instead of using a complex RTOS running on a single processor.

The transient fault model considered in our experiments is bit-flips in the processor registers, while the key components of the MicroC kernel (the task scheduling and the context switch) are active. Comparing our results to those reported in, we observed that faults corrupting the CPU registers during the execution of the scheduling and context switching functions have a significant impact on the real-time systems reliability. In our experiments, we recorded that 34% of injected faults caused scheduling dysfunctions while an additional 17% led to system crashes. This represents an important fraction of faults that cannot be ignored during the design stage of safety-critical applications running under an RTOS. The paper is structured as follows. Section II identifies fault syndromes for safety-critical systems including an RTOS. Section III briefly describes the main features of the MicroC kernel. The conceptual framework of the proposed fault injection technique is depicted in Section IV. Fault injection results are analyzed and discussed in Section V. Section VI provides some lessons learned concerning the fault injection experiments and results analysis. Finally, Section VII presents our concluding remarks.

Fault Syndromes For Safety-Critical Systems Including An RTOS: Transient faults in the RTOS kernel of a safety-critical system may cause several syndromes. The main classes of syndromes caused by the transient faults occurring in an RTOS kernel are presented in Fig. 1. As illustrated in the figure, when affected by transient faults, an RTOS may present two main classes of syndromes. Syndromes that may also be observed in classical systems. Effect-lessno observable effect on system functionality; Application hangthe system application stops responding (e.g., it enters an infinite loop); Exceptionthe program triggers some exception routine (e.g., illegal instruction, division by zero, etc.); Memory access dysfunctionthe system tries to access a non-valid physical memory address; System crashthe system stops functioning. This syndrome may be a consequence of a memory access dysfunction; Incorrect output resultsthe systems provides results, but they are different from the expected ones. Syndromes specific to real-time systems using an RTOS may be classified as follows. Real-time problemthe real-time constraints specified for the system are not respected; Scheduling dysfunctionthe scheduling of the tasks composing the application running on the system is not correct. This syndrome may cause realtime problems, incorrect output results problems, or system crashes.

Microc OS-Ii Real-Time Kernel: Basic Considerations: MicroC is a reliable, flexible, pre-emptive, real-time multitasking kernel. It has been certified by the Federal Aviation Administration for use in commercial aircrafts. The source code of MicroC kernel is mainly written in standard C, which makes it portable to different processor architectures. Only a small portion of the code has to be adapted to the target processor. The main services offered by MicroC are task scheduling, intertask communication by semaphores, message mailboxes and message queues, time management functions, etc. MicroC can manage up to 64 tasks; each task is associated with a unique priority. A task can be in one of five states (dormant, ready to run, running, waiting and interrupted). The dormant state corresponds to a task that has not been made available to the multitasking kernel. The waiting state corresponds to a task that waits for the occurrence of an event. A task is in the interrupted state when an interrupt has occurred and the CPU is handling the interrupt service routine (ISR). A task is running when it has exclusive control of the CPU. The ready to run state corresponds to a task that can be executed once the CPU becomes available (the running task terminates). Generally, a task is an infinite loop function that executes user code. MicroC associates to each task a task control block (TCB) that contains essential information about the task (e.g., delay, state, priority, address to the current top of the stack, etc.). MicroC uses the TCB to preserve the tasks state when it is suspended, and to resume its execution exactly where it was when the task becomes ready to run again. All TCBs are located in RAM. Another characteristic of the considered multitasking application is that each task has its own stack, which contains tasks variables and the tasks running context (the content of all the CPU registers). The scheduling function is activated every time a task calls the kernels services and when the system returns from an interrupt service routine. When invoked, the scheduling function verifies if a higher priority task than the currently running task is ready to run. In this case, a context switch is performed. The context switch saves the context of the task being suspended and loads into the CPU the values of the registers for the task to resume. The ready to run tasks are placed in the ready list that is stored in memory in two structures: in which, each bit is associated to a priority level. OSRdyGrp is an 8bit vector. Each bit in OSRdyGrp corresponds to a row in OSRdyTbl. If at least one of the tasks whose priorities are grouped in a row is ready to run, the corresponding bit in OSRdyGrp is set to 1. The scheduler uses OSRdyGrp and OSRdyTbl structures to determine the highest priority task allowed to run. The values of OSRdyGrp and of OSRdyTbl row corresponding to the first 1 in OSRdyGrp are used as indexes in a lookup table helping to determine the

highest priority task. This operation is deterministic (its execution time is constant for all contexts). Taking into account this functionality, transient faults occurring in the OSRdyGrp and OSRdyTbl structures may have major implications on the correct behavior of the MicroC RTOS and consequently on the global system (as explained in Section II in the definition of scheduling dysfunction syndromes).

Fault Injection Framework: In order to asses the robustness of the MicroC RTOS kernel scheduler, we developed an environment able to inject faultsthat corrupt CPUs registers at random instants, while the scheduler and the context switch functions are executed. The studied system architecture is organized as illustrated in Fig. 3. The adopted system architecture is simulated by an Instruction Set Simulator (ISS) tool. The fault injection tool uses temporal breakpoint features available in the ISS to inject faults by software means. Once a temporal breakpoint is reached, global execution is suspended and the ISS tool activates a Fault Injection Manager (FIM) that comprises three modules: a fault parameters generator, a fault tracer and a results analyzer. After the fault has been injected, the global execution is resumed.

The injection process is depicted in Fig. 4. The fault parameters generator calculates when and where the fault will be injected. In our experiments, faults consist of single bit-flips affecting only the main MicroC kernel features: task scheduling and context switching. Accordingly, the fault instant must coincide to the time intervals when these functions are active, as illustrated. Conclusion: Today, many safety-critical embedded systems execute realtime multitasking applications. The complexity of these systems typically sets a requirement for an RTOS. These systems are subject to transient errors induced by parasitic phenomena that may both affect the correctness of logical results and the timing of the tasks response. In this paper, we analyzed the sensitivity of MicroC RTOS to transient faults. We presented a classification of syndromes caused by the transient faults occurring in safety-critical systems including RTOSs. We identified syndromes specific to real-time systems including RTOS: real-time problems (when real-time constraints specified for the system are not respected) and scheduling dysfunction (when scheduling of the different tasks composing the application running on the system is not correct).We also presented a methodology based on fault injection that allows assessing MicroC RTOS sensitivity to transient faults taking into account both logical correctness and real-time aspects. In addition, this paper presents an analysis of reasons for scheduling dysfunctions, which may allow designers to improve the RTOS robustness to transient faults.