Self Healing OS

How to Realize Self-Healing Operating Systems?
Hossein Momeni
Computer Engineering Department
Iran University of Science and
Technology
momeni@iust.ac.ir

Omid Kashefi
Technology
kashefi@iust.ac.ir

Hadi Sharifi
Technology
hsharifi@comp.iust.ac.ir
Abstract Operating systems serve as executing platforms and
resource managers for applications. With the development of
more complex computer systems and applications, the required
operating systems become complex too. But the proper
management of such complex operating systems by human
beings has shown to be impractical. Nowadays, self managing
concepts provide the basis for developing appropriate
mechanisms to handle complex systems with less human
interventions. Although the implications of deploying self
managing and autonomic attributes and concepts at the
application levels have been studied, their deployment at
system software level such as in operating systems have not
been fully studied. Self managed applications may not enjoy
the whole benefit of self management if the platform on which
they run, specially its operating system, is not self managed.
Given this requirement, this paper highlights the most
experienced faults and anomalies of operating systems, and
proposes a tiered operating system architecture and a
corresponding self healing mechanism to show how self-
managing can be realized at operating system level. The main
objective has been to make the operating system resilient to
operating system faults without restarting the operating
system.
Keywords: Operating System; Autonomic; Self Healing;
Kernel; Supervisor; Hypervisor
I. INTRODUCTION
Nowadays, the need for more powerful, complex and
reliable systems and applications are rapidly increasing. With
the introduction of such applications, keeping them healthy,
maintaining them properly, and keeping them well perform
and secure has become more difficult and complicated. This
complication is mostly attributed to the desire for ubiquity of
heterogeneous and distributed computing resources, services
and mechanisms, or in a nutshell, heterogeneity challenge.
The full human control over such a heterogeneous and
ubiquitous world is not feasible or better say practical. So
what is the solution?
Autonomic computing presents the idea of reducing
(ideally, removing) human interventions and making the
system management autonomous and more accurate [1].
IBM [2] defines an autonomic system as a system with four
basic features: self-healing, self-protecting, self-optimizing,
and self-configuring.
A self-healing system is featured by being aware of
system state and its ability to detect and recover from
known/unknown faults.
A self-protecting system is a system that can detect,
identify and protect the system from arbitrary attacks.
Self-optimization is the ability to monitor and control the
resources of a system, to get the optimal performance based
on performance requirements.
Self-configuration is about autonomic (re)configuration
of systems components.
Autonomic computing presents a promising prospect for
management of modern complex systems by lowering or
removing the necessity of human interventions in the
management of such systems, a la, operating systems and
applications. Self managing applications can be immune
from their executing platforms if the operating systems under
which they are managed are self managed too [3]. Otherwise,
applications are vulnerable to all kinds of platform faults and
anomalies. It is thus necessary to have self healing support at
the operating system level as well.
The deployment of self-managing and autonomic
attributes and concepts at application levels have been
studied, but their deployment at system software level such
as at the operating system level has not been fully studied.
In this paper, we investigate how one can apply
autonomic properties in general, and specifically self healing
properties to operating systems to reduce human
involvement in their management and configuration. We
further propose a new three layered operating system
architecture in support of self healing attributes.
The rest of paper is organized as follows. Section II
presents the common operating system faults. Section III
presents some notable related works. Section IV presents our
proposed architecture. Section V discusses the prototyping of
this architecture, and finally, Section VI concludes the paper.
II. OPERATING SYSTEMS FAULTS
Generally speaking, all types of software, including
operating systems and user applications, may mostly suffer
from the following faults [4, 5]:
Syntactic faults: input parameter faults.
Semantic faults: inconsistent behavior and incorrect
results.
Service faults: QoS faults like real-time violations.
Communication and interaction faults: time-out and
service unavailability.
Exceptions: I/O related exceptions and security-
related exceptions.
Anomalies in operating systems, as execution platforms
for user applications, may well infect some or even all user
applications behavior and correctness, while anomalies of
an application are often constrained to that application and
are not propagated to other applications or the operating
system; this is in fact ensured by the operating system [6].
Therefore, the fault resiliency of operating systems is
paramount to reliability of systems.
Application crashes are caused by faults that can be
considered as soft faults. The system running such an
application can be recovered from these faults in most cases
without restarting the operating system. On the other hand,
the most critical faults in operating systems are in the form of
exceptions and system call faults when accessing I/O. These
faults are mostly caused by the failure of device drivers and
are considered as hard faults that require the restart of
operating system.
We primarily intend to propose an operating system
architecture to avoid the occurrences of such hard faults in
the operating system in the first place, and try to detect and
automatically heal hard faults without restarting the
operating system.
III. RELATED WORK
There have been three notable efforts on introducing
autonomic operating systems. Chronologically, the first one
is the Minix3 [7] operating system that has been developed
in Virije University by Andrew. S. Tanenbaum. Minix3
focuses on reliability and self-healing features, by reducing
the size of kernel and moving device drivers to application
layer [8]. The Minix3 architecture is shown in Figure 1. It
recovers from some failures in device drivers and
applications with minimal user intervention. Although this
technique recovers most of operating systems faults and
reduces -by moving up- nearly 70% of the size of operating
system kernel code [8, 12], many operating system kernel
faults remain uncontrolled and need to be recovered by
human intervention; faults such as resource manager faults
in lock management, panic in the kernel, processes in
harmful states (i.e. zombies) and so on.

Figure 1. The Minix3 architecture

The second work is the predictive self-healing
mechanism included in Solaris-10 [9] by the Sun
Microsystems. The fault manager and service manager in
Solaris-10 are made responsible to predict the failures of
components before they occur, and reconfiguration agents
take proactive actions to handle failures. Simplified
administration, fast and easy repair, and maximum
availability are reported as the strengths of the included
mechanism.
Roy Campbell et als. have taken a different approach to
improve the reliability of Choices operating system [10] via
a special exception handling technique. An error handler
manages the kernel errors [3, 11]. Component isolation
using a wrapper prevents error propagation, and using a
speculative recovery plan, recovery orders are directed only
to specific components as it is shown in Figure 2. Code
reloading is used for fixing transient faults and recovering
faulty components.

Figure 2. The Choices component isolation using wrappers
The works done so far on making self managed
operating systems reduce human interventions to relatively
prevent and remedy operating system faults, but they fall
short of presenting a comprehensive architecture to realize
all autonomic features. Therefore, the challenges of
developing autonomic platforms still remain.

IV. PROPOSED ARCHITECTURE
As we mentioned earlier, the hard faults are harmful and can
interrupt the system to function as required. To decrease the
probability of occurrences of operating systems hard faults,
we opt to minimize the functionality of the core of the
operating systems kernel, as is done Minix3 [7], and then try
to make this minimal core as stable as possible. This core is
placed in a three layered architecture in support of self-
healing attributes. The architecture has a user layer, a
supervisor layer and a hypervisor layer as is shown in Figure
3. We have conceived that any operating system conforming
to this architecture, must be able to heal operating system
faults and anomalies by itself without human intervention or
restart of the operating system. To achieve this, the operating
system must be enabled to perceive when it is performing
healthily and when it is performing unhealthily due to the
occurrences of operating system faults. In case it is
unhealthy, the operating system must recover from its current
unhealthy state to a healthy state, doing necessary
adjustments.
To have an operating system with the above features, we
propose a self healing mechanism that works in four phases:
1. Monitoring
2. Analysis
3. Detection
4. Healing
In monitoring phase, two healing units as are shown in
Figure 3 periodically monitor system states. The super-
healing unit in the supervisor layer monitors the application
layers activity states, and the hyper-healing unit in the
hypervisor layer, monitors the kernel activity states.
In analysis phase, system states are analyzed in search of
unhealthy and healthy states. System states are induced by
using learning methods; during execution of operating
system, healing units train a classifier function with healthy
system states, and states that do not comply with healthy
criteria, are considered as unhealthy ones.
In detection phase, system faults and anomalies are
detected and appropriate plans for handling them are
planned.
In healing phase, healing units heal the system by
transition from unhealthy states to healthy states as is shown
in Figure 4. Since most operating systems faults are
attributed to device drivers [12], we have chosen to move
parts of kernel, such as device drivers, to user layer in our
operating system architecture. With this approach we have
minimized the kernel core and made the super-healing unit
responsible for healing device driver faults. Kernel faults are
healed by the hyper-healing unit in the hypervisor layer.

Figure 3. Proposed architecture for a self-healing operating system.
Transformation in healing phase may be achieved by a
number of recovery and repair mechanisms such as
microreboot, i.e., individual rebooting of fine-grain
application components, without disturbing the rest of the
application; It can achieve many of the same benefits as
whole-process restarts, but an order of magnitude faster and
with an order of magnitude less lost works. It is a fine-grain
technique for surgically recovering faulty application
components. This low-cost recovery technique leads to the
disappearance of some operating system faults such as
deadlocks, and environment-dependent errors. Microreboot
can be done at multiple levels until the faults disappear [13].
Since our proposed architecture for self healing operating
system is modular, only crashed modules need to be rebooted
to recover the system.

The deployment of the above self-healing mechanism at
the operating system level provides a more stable
environment for running applications and can reduce
stopping their execution even when they execute correctly
without any faults in them.

Figure 4. State chart of a self-healing operating system
V. IMPLEMENTATION
We implemented a prototype self healing operating
system based on the proposed architecture using the Linux
kernel as our kernel and enabling Xen A hypervisor virtual
machine on Linux as the hypervisor layers micro-kernel
[14, 15].
We moved up device drivers as daemons and placed a
device driver polling middleware to handle devices and
communicate to the kernel. We added all operating system
resource control mechanisms, like deadlock manager, paging
manager, and lock manager, to the super-healing unit. The
system state is periodically monitored in the monitoring
phase and at each period the health of the state is analyzed.
Unhealthy states are made healthy by recovering the detected
faults using the common resource control mechanisms. In
other cases that the control mechanisms cannot do any good,
recovery is achieved by just restarting the super-healing unit.
The hyper-healing unit monitors the kernel state. When a
kernel hard fault like kernel panic is detected, only the kernel
is rerun, instead of restarting the whole system.
Due to the minimal size and functionality of the kernel in
the hypervisor layer, we experienced noticeably less faults in
the microkernel even in the presence of kernel faults.
VI. CONCLUSION
In this paper, we investigated how one can apply
autonomic properties in general, and specifically self healing
properties to operating systems to reduce human
involvement in the management and configuration of
operating systems. A new three layered operating system
S
y
s
c
a
l
l

User App.

Process Manager
I/O Manager
Memory Manager
Super-Healing Unit
K
e
r
n
e
l

Device Network
H
y
p
e
r
c
a
l
l

Hypervisor Layer
Supervisor Layer
User Layer
Hyper-Healing Unit
Kernel Process
Boot Strap
User App. User App.
Device
A
p
p
l
i
c
a
t
i
o
n

M
i
c
r
o

K
e
r
n
e
l

Healthy
State
Unhealthy
State
Detection of Fault
Healing
architecture in support of self healing attributes, as well as an
overview of a self healing mechanism, was proposed.
We argued that adding the self-healing property to
operating systems offers a more stable execution
environment for running applications. This is because,
correctly behaving applications are not unnecessarily and
adversely affected by (at least all sorts of) operating systems
faults.
Although self-healing has been supported at the operating
system level, a comprehensive autonomic operating system
additionally requires support for self-protection, self-
configuration and self-optimization features.
ACKNOWLEDGMENT
We would like to acknowledge Iran Telecommunication
Research Center (ITRC) that has partially granted the
research reported in this paper. We would also like to thank
our colleagues in the distributed systems research laboratory
of the Computer Engineering Department of IUST that
helped us in prototyping our ideas reported in this paper.
REFERENCES
[1] J. O. Kephart and D. M. Chess, "The Vision of Autonomic
Computing", IEEE Computer, Vol. 36, No. 1, pp. 4150, 2003.
[2] M. R. Nami, M. Sharifi, A Survey of Autonomic Computing
Systems, IFIP, Springer Boston, pp. 101-110, 2007.
[3] M. D. Francis and R. H. Campbell, "Building a Self-Healing
Operating System", The 3
rd
IEEE International Symposium on
Dependable, Autonomic and Secure Computing, Columbia, MD,
USA, 2007.
[4] A. Avizienis, J. C. Laprie, B. Randell, C. Landwehr, Basic Concepts
and Taxonomy of Dependable and Secure Computing, IEEE
Transactions on Dependable and Secure Computing, Vol. 1, No. 1,
pp. 11-33, 2004.
[5] L. Mariani, A Fault Taxonomy for Component-based Software,
Proceedings of International Workshop on Test and Analysis of
Component-Based Systems (TACoS), Electronic Notes in Theoretical
Computer Science, Vol. 82, No. 6, pp. 55-65, Elsevier Science, 2003.
[6] H. Bos, G. Homburg and A. S. Tanenbaum, Construction of a
Highly Dependable Operating System, The 6th European
Dependable Computing Conference, Coimbra, Portugal, 1820 Oct.
2006.
[7] MINIX3 Operating System,
URL: http://www.minix3.org, Last Accessed on November 2007.
[8] A. S. Tanenbaum, J. N. Herder and H. Bos, "Can We Make Operating
Systems Reliable and Secure?", IEEE Computer, Vol. 39, No.5, pp.
44-51. 2006.
[9] Sun Micro System, "Predictive Self-Healing in the Solaris 10
Operating System", A Technical Introduction, Santa Clara, USA,
2004.
[10] R. H Campbell, N. Islam, R. Johnson, and P. Kougiouris, "Choices,
Frameworks and Refinement", IEEE Computer, pp. 9-15, 1991.
[11] F. M. David, J. C. Carlyle, E. M. Chan, D. K. Raila and R. H.
Campbell, "Exception Handling in the Choices Operating System",
LNCS, No. 4119, pp. 42-61, 2006.
[12] A. Chou, J. Yang, B. Chelf, S. Hallem and D.Engler, An Empirical
Study of Operating System Errors, In Proceeding of 18
th
ACM
Symposium on Operating Systems, New York, NY, USA, pp. 7388,
2001.
[13] G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman and A. Fox,
Microreboot A Technique for Cheap Recovery, In Proceeding of
the 6
th
ACM Symposium on Opearting Systems Design &
Implementation, San Francisco, CA, pp. 3-3, 2004.
[14] I, Habib, Xen, Specialized Systems Consultants, Inc., Linux
Journal, Vol. 2006, No. 145, May 2006.
[15] Xen, The Official Project Site,
URL: http://www.xen.org, Last Accessed on November 2007.

Self Healing OS

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Self Healing OS

Diunggah oleh

Hak Cipta:

Format Tersedia

How to Realize Self-Healing Operating Systems?

Anda mungkin juga menyukai