Anda di halaman 1dari 5

International Journal of Computer Systems (ISSN: 2394-1065), Volume 03 Issue 02, February, 2016

Available at http://www.ijcsonline.com/

Perspectives on Safety Critical Computing Systems


Kadupukotla Satish Kumar and Panchumarthy Seetha Ramaiah

Dept of Computer Science, JNTU Kakinada, India, satishkmca@yahoo.com


Dept of Computer Science and Systems Engineering, AU Visakhapatnam, India, psrama@gmail.com

Abstract
Computer systems have become an integral part of our life. They are being used in systems catering to basic utility
services to complex scientific research and defense purposes. Any system presents some risk to its owner's, users and
environment. Some present more risk than others and those that present the most risk are what we call safety-critical
systems. Safety Critical systems are those systems whose failures could result in loss of life, loss of revenue, significant
property damage or damage to the environment. This paper reviews 9 system failures, belonging to various domains,
because of software bugs. The paper discusses some regulatory standards and guidelines for proper testing of the
software. We as well recommend the guidelines that should be followed while testing to reduce the instances of software
failure in safety-critical systems.
Keywords: Safety Critical Systems, SIL, Safety Standards

I.

INTRODUCTION

Most of the failures of software projects can be


attributed to the fact that they fail to completely meet the
requirements.
These requirements can be the cost,
schedule, quality, or requirements objectives. Studies have
been carried out over the software failures, results of which
are very alarming as it states that, around 50% to 80% of
the projects result in a failure. The causes for failure are
varied, but various studies state that the most common
causes of failure are lack of the client participation,
improperly trained developers, continuously changing
client requirements, unrealistic project objectives,
inaccurate estimates of resources, no proper definition of
system requirements, poor reporting, inappropriate
development practices. All software projects should be
tested properly and they should maintain accuracy to all
times. Every time we speak about success of software
projects, there are also projects that are failures. [1][2][3]
This paper is organized as follows. The next section
highlights some cases of software failures. In Section III
we talk about how to overcome the software failures. In
Section IV we give a description of the Safety Critical
Systems. In Section V we talk about Software Engineering
for Safety Critical Systems. In Section VI we present a
discussion about the various standards and Safety Integrity
Levels for Safety Critical Systems. In Section VII we give
our conclusion.
II.

SOFTWARE FAILURES

In an ever growing complexity of a system, errors


might get ignored or remain undetected until a catastrophe
occurs resulting in either huge loss of wealth or sometimes
resulting in human casualties as well. We will be
discussing some of the failures of complex systems, by
citing well-known software errors that might / have led to
huge loss of resources in the space, transportation,
communication, government, and health care industries
including:

Disintegration of Mars Orbiter (1998)

Patriot Missile Defense System (1991)

Almost a WW-III (1983)

Iran Air Flight 655 (1988)

AT&T Breakdown (1990)

Therac-25 (1986)

Challenger Space Shuttle Disaster (1986)

Public in-convenience

A. Space
1. Disintegration of MARS Orbiter [19]
NASA launched a mission to carry out a study of
MARS environment. An orbiter was launched in 1998 to
carry out the study amid much fanfare but it ended in a
disaster. Investigations into the root cause of the failure of
mission was attributed to a software error involving
calculation. A report issued by NASA states the root cause
was failure to use metric units in the coding of a software
file, small Forces, used in trajectory models. An
investigation revealed that the navigation team was
calculating metric units and the ground calculations were in
Imperial units. The computer sytems in the crafts were
unable to reconcile the differences resulting in a navigation
error.
2. The Mariner 1 spacecraft [12]
During the time when punch cards were used, program
would be translated to the punch cards by a programmer or
operator using a keypunch. Any mistakes either in the form
of typographical mistakes, or an incorrect command, could
not be caught by looking at the punch card. Many a times,
verification was carried out by re-punching the code onto a
second punch card and then by comparing it with the first
card using a card verifier. An unverified bug introduced by
a punch card is generally regarded as one of the most

108 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 03, Issue 02, February, 2016

Kadupukotla Satish Kumar et al.

Perspectives on Safety Critical Computing Systems

expensive software bugs in history, as it resulted in the


destruction of the Mariner 1 spacecraft in 1962 (cost in
1962 dollars: 18.5 million; cost in todays dollars :135
million), before it could complete its mission of flying
by Venus. The Mariner 1 spacecraft was launched on July
22, 1962 from Cape Canaveral, Florida. Soon after the
launch, an onboard guidance antenna failed, which caused
fallback to a backup radar system that should have been
able to guide the spacecraft. However, there was a fatal
flaw in the software of that guidance system. When the
equations that would be used to process and translate
tracking data into flight instructions were encoded onto
punch cards, one critical symbol was left out: an overbar or
overline, often confused in ensuing years with a hyphen.
The lack of that overbar, essentially, caused the guidance
computer to incorrectly compensate for some otherwise
normal movement in the spacecraft.
3. Ariane 5, Flight 501 Failure [18]
European space Agency spent around 10 years and $7
billion to produce Ariane 5, a giant rocket capable of
placing a pair of three-ton satellites into orbit with each
launch. The mission also intended to give Europe
supremacy in the commercial space business. Minutes into
its maiden voyage, the rocket exploded because of a small
computer program trying to stuff a 64-bit number into a 16bit space.
B. Defense
1. Almost WW-III
During the height of cold war, when peaceniks around
the world were leaving no stone unturned to prevent a war
between USA & USSR which could have resulted in a
nuclear world war, their efforts would have gone down the
drain, courtesy a software error. On the 26th of September
1083, the early warning system of the USSR raised a false
alarm that USA had launched a missile attack. It raised the
alarm twice, first alarm stated that USA had launched 1
missile and another alarm mentioned an attack by 5
missiles. The officer on duty based on his understanding
declared it as a false alarm.
2. Patriot Missile Defense System [15]
During the Operation Desert Storm, a software error in
the US Patriot Missile Defense System resulted in the death
of 28 US soldiers. The system failed to intercept an
incoming Scud missile which struck military barracks. The
failure was attributed to an error in calculation. The
systems internal clock was measured in tenths of seconds
and the actual time was reported by multiplying the internal
clocks value with a 24-bit fixed-point register. Due to this,
the two systems which were supposed to share an universal
time, instead had independent system clocks, resulting in
an out of sync situation. causing the failure.
3. Aegis Combat System, Iran Air Flight 655 [16]
On July 3, 1988, Aegis combat defense system, used by
the U.S. Navy, failed to carry out proper calculation
because of which the USS Vincennes mistakenly shot
down a passenger aircraft, Iran Air Flight 655 resulting in
290 civilian casualties. Using the missile guidance system,
Vincenness Commanding Officer believed the Iran Air
Airbus A300B2 was a much smaller Iran Air Force F-14A

Tomcat jet fighter descending on an attack vector, when in


fact the Airbus was transporting civilians and on its normal
civilian flight path. The radar system temporarily lost
Flight 655 and Reassigned its track number to a F-14A
Tomcat fighter that it had previously seen. During the
critical period, the decision to fire was made, and U.S.
military personnel shot down the civilian plane.
C. Telecommunications
1. AT&T Breakdown [17]
In January 1990, unknown combinations of calls
caused malfunctions, over AT\&T network, across 114
switching center across the whole of Unites States. Due to
the malfunction around 65 million calls could not be
connected nationwide. The cause was attributed to a
sequence of events that triggered a software error which
was due to a fault in the code.
D. Health Care and Medicine
1. Therac-25 [14]
Therac-25 was a radiation therapy machine developed
by Atomic Energy of Canada for cancer treatment.
Between the years 1985 and 1987 Therac-25 machines in
four medical centers gave massive overdoses of radiation to
six patients. An extensive investigation and report revealed
that in some instances operators repeated overdoses
because machine display indicated that no dose was
administered. Some patients received between 13,000 25,000 rads when 100-200 needed. The result of the
excessive radiation exposure resulted in severe injuries and
three patients lost their lives. Not adhering to good safety
design was the cause of the errors. The investigation also
found calculation errors. For example, the set-up test used
a flag variable, of size of just one byte, whose bit value was
incremented on each run. When the routine called for the
256th time, there was a flag overflow and huge electron
beam was erroneously turned on. An extensive
investigation that followed showed that although some
latent error could be traced back for several years, there
was an inadequate system of reporting that made it hard to
pinpoint the root cause of the failure. The final
investigations report indicates that during real-time
operation the software recorded only certain parts of
operator input/editing. A careful reconstruction by a
physicist at one of the cancer centers in order to determine
what went wrong revealed what exactly went wrong.
E. Public Utilities
1. Power Blackout in USA & Canada [13]
A software bug in an alarm system placed at a control
room of an energy company caused an electrical power
blackout in the Northeastern and Midwestern USA and
Ontario in Canada. The outage affected over 50 million
people from both the nations.

III.

HOW TO OVERCOME SOFTWARE FAILURES

Software Process is a complex process. It should be


handled with care and proper understanding [5]. Capturing
Client and User requirements properly and transforming

109 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 03, Issue 02, February, 2016

Kadupukotla Satish Kumar et al.

Perspectives on Safety Critical Computing Systems

those requirements Software Project is art of technique.


USA government itself is spending 60 Billion dollars on
testing. Most of the projects have failures because of
developers are not capturing requirements properly, End
user could not provide his/her requirements properly,
parameter negligence, Software professionals could not
come up with proper technology understanding, not
following security measures, etc. For better understanding
conduct postmortem and iterate for next project. To
overcome failures Safety-Critical System plays a key role.
items when proofreading spelling and grammar:
A. Abbreviations and Acronyms
FMEA: Failure Mode and Effects Analysis
LOPA: Layers Of Protection Analysis
PFD: Probability of Failure on Demand
SIF: Safety Instrumented Function
SIL: Safety Integrity Level.
IV.

ABOUT SAFETY CRITICALSYSTEM

A life-critical system or safety-critical system [1] is a


system whose failure or malfunction may result in one (or
more) of the following outcomes:
Death or serious injury to people
Loss or damage to equipment/ property
Environmental Harm
Risks of this sort are usually managed with the
methods and tools of safety engineering. A life-critical
system is designed to loose less than one life per billion
(109) hours of operation. Typical design methods include
probabilistic risk assessment, a method that combines
failure mode and effects analysis (FMEA) with fault tree
analysis.
Safety-critical systems are increasingly
computer-based. Any system represents some risk to its
owners, users, and environment. Some present more than
others and those that present the most risk are what we call
safety-critical systems.
The risk is a threat to something valuable. All systems
either have something of value, which may be jeopardized
inside them, or their usage may jeopardize some value
outside them. A system should be built to protect the
values both from the result of ordinary use of the system
and from the result of malicious attacks of various kinds.
A typical categorization of values looks at a values
concerning
Safety
Economy
Security
Environment
V.

SOFTWARE ENGINEERING FOR SAFETY


CRITICALSYSTEM

Component Software engineering for safety-critical


systems is very difficult [8]. There are three aspects which
can be applied to help in the software engineering process
for safety-critical systems. First is process engineering and
management. Secondly, selecting the appropriate resources
and environment for the system. Thirdly, the developers
should address any legal or regulatory requirements for the

system, for eg. Federation of American Aviation has given


some guidelines to be followed for systems to be used in
aviation. By setting up a standard to which a system should
adhere to, it forces the developers to take the necessary
precautions. The aviation industry has been successful in
laying down standards for producing safety-critical
avionics software. Similar standards are also in place for
automotive (ISO 26262), Medical (IEC 62304) and
nuclear (IEC 61513) industries. The standard approach is to
carefully code, inspect, document, test, verify and analyze
the system. Another approach is to certify a production
system, a compiler, and then generate the system's code
from specifications. Another approach uses formal methods
to generate proofs that the code meets requirements. All of
these approaches improve the software quality in safetycritical systems by testing or eliminating manual steps in
the development process, because people make mistakes,
and these mistakes are the most common cause for
accidents. Many regulatory standards address how to
determine the safety criticality of systems and provide
guidelines for the corresponding testing. Some of them (but
probably not all) are:

CEI/IEC 61508 Functional safety of electrical/


electronic programmable safety-related systems

Do-178B Software considerations in airborne


systems and equipment certification

Pr EN 50128 Software for railway control and


protection systems

Def stan 00-55 Requirements for safety-related


software in defense equipment.

IEC 880 Software for Computers in the safety


systems of nuclear power stations.

MISRA (Motor Industry Software Reliability


Association) Development guidelines for
vehicle-based software.

FDA (Food and Drug Administration) American


Food and Drug Association (Pharmaceutical
standards).
VI.

DISCUSSION ON SIL

A SIL [20] is a measure of safety system performance,


or probability of failure on demand (PFD) for a safety
critical system. There are four discrete integrity levels
associated with SIL. The higher the SIL level, the lower
the probability of failure on demand of the system and the
higher the system reliability and performance. SIL level
are directly proportional to the system complexity and cost
of development. A SIL level applies to an entire system.
Individual products or components do not have SIL
ratings. SIL levels are used when implementing a safety
critical system that must reduce an existing intolerable
process risk level to a tolerable risk range.
Safety integrity level (SIL) is defined as a relative level
of risk-reduction provided by a safety function, or to
specify a target level of risk reduction. In simple terms,
SIL is a measurement of performance required for a safety
instrumented function (SIF). The requirements for a given
SIL are not consistent among all of the functional safety

110 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 03, Issue 02, February, 2016

Kadupukotla Satish Kumar et al.

Perspectives on Safety Critical Computing Systems

standards. In the European functional safety standards


based on the IEC 61508 standard four SILs are defined,
with SIL 4 the most dependable and SIL 1 the least. A SIL
is determined based on a number of quantitative factors in
combination with qualitative factors such as development
process and safety life cycle management.
Assignment of SIL is an exercise in risk analysis where
the risk associated with a specific hazard, that is intended
to be protected against by a SIF, is calculated without the
beneficial risk reduction effect of the SIF. That
"unmitigated" risk is then compared against a tolerable
risk target. The difference between the "unmitigated" risk
and the tolerable risk, if the "unmitigated" risk is higher
than tolerable, must be addressed through risk reduction of
the SIF. This amount of required risk reduction is
correlated with the SIL target. In essence, each order of
magnitude of risk reduction that is required correlates with
an increase in one of the required SIL numbers.
There are several methods used to assign a SIL. These
are normally used in combination, and may include:
Risk matrices
Risk graphs
Layers Of Protection Analysis (LOPA)
In the above listed methods, LOPA is one of the most
commonly used method by big industries. The assignment
may be tested using both pragmatic and controllability
approaches, applying guidance on SIL assignment
published by the UK HSE. SIL assignment processes that
use the HSE guidance to ratify assignments developed
from Risk Matrices have been certified to meet IEC EN
61508.
The standards are application-specific, and that can
make it difficult to determine what to do if we have to do
with multidisciplinary products. Nonetheless, standards do
provide useful guidance. The most generic of the standards
listed above is IEC 61508; this may always be used if a
system does not fit into any of the other types. All the
standards operate with so-called software integrity levels
(SILs).

The concept of SILs allows a standard to define a


hierarchy of levels of testing (and development). A SIL is
normally applied to a subsystem; that is, we can operate
with various degrees of SILs within a single system or
within a system of systems. The determination of the SIL
for a system under testing is based on a risk analysis. The
standards concerning safety critical systems deal with both
development processes and supporting processes, that is,
project management, configuration management, and
product quality assurance.
We take as example the CEI/IEC 61508 which
recommends the usage of test case design techniques
depending on the SIL of a system. This standard defines
four integrity levels: SIL4, SIL3, SIL2 and SIL1, where
SIL4 is the most critical. For a SIL4 classified system, the
standard says that the use of equivalence partitioning is
highly recommended as part of the functional testing [Fig
1]. Furthermore the use of boundary value analysis is
highly recommended, while the use of cause-effect graph
and error guessing are only recommended. For white-box
testing the level of coverage is highly recommended,
though the standard does not say which level of which
coverage. The recommendations are less and less strict as
we come down the SILs in the standard. For highly safetycritical system the testers may be required to deliver a
compliance statement or matrix, explaining how the
pertaining regulations have to be follow and fulfilled.

Table 1. Classification of SIL


Value
Safety

100000000 100000
Many Peo- Humans
ple killed
lives

100
Damage to
physical
ob- jects,
in danger
risk
of
personal
in- jury
Financial
Great finan- Significant
Economy
catastrophes cial loss
financi
al loss
Destruction/ Destruction/ Faults
in
Security
data
disclosure
Disclosure
of
of
strategic
critic
data
al
data
Reparable,
Local
Environmen Extensivea
nd
a
t
and
but
damage to
services
nd
irreparable
co
the
services
damage to
menvironthe
prehensive
ment
environdamage to
ment
the
environment

1
Insignificant
damage to
things; no
risk
to
people
Insignificant
financi
al loss
No risk for
data

No environmental risk

Fig. 1. Graph of Various SILs


VII. CONCLUSION
This paper has reviewed 9 system failures because
of software bug. The failures have resulted in loss of
money, effort and in some cases even loss of life. Though
in some cases, the loss might have been aggravated because
of human error or hardware failure, nevertheless software
being one of the reason for the systems failure cannot be
ruled out. Infallibility of software systems become even
more imperative with the increasing use of software
systems even in our every day life. The paper has discussed
various standards and guidelines that one should be
adhering to, while developing the software. The paper has
also discussed the classification of software into various
categories based on their probability of failure. The paper
recommends precautions and measures that need to be
taken for the software, based on its classification, during
testing to avoid the softwares failures.

111 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 03, Issue 02, February, 2016

Kadupukotla Satish Kumar et al.

Perspectives on Safety Critical Computing Systems

REFERENCES
[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]
[13]
[14]
[15]
[16]

[17]
[18]
[19]
[20]

J. C. Knight, Safety critical systems: challenges and directions,


IEEE Proceedings of the 24th International Conference on Software
Engineering. ICSE 2002, pp. 547 550
W. R. Dunn, Designing safety-critical computer systems, IEEE
Transactions on Computers, Vol. 36, No. 11, pp. 40 46,
November 2003.
M. Ben. Swarup and P. S. Ramaiah An Approach to Modeling
Software Safety in Safety-Critical Systems, Journal of Computer
Science, Vol 5, No. 4, pp 311-320, 2009, ISSN: 1549-3636.
G Raj Kumar, Dr. K Alagarswamy, "The most common factors for
the failure of Software Development Project", TIJCSA, Vol 1, No.
11, pp: 74-77, January 2013.
T. R. S. P. Babu, D. S. Rao, P. Ratna, "Negative Testing is Trivial
for Better Software Products", IJRAET, Vol3, No. 1, pp: 36-41,
2015.
T. O A Lehtinen, M. V Mantyla, J. Vanhanen, J. Itkonen and C.
Lassenius, "Perceived causes of software project failures An
analysis of their relationships", Information and Software
Technology, Vol 56, No. 6, pp: 623-643, June 2014
Edward E. Ogheneovo, "Software Dysfunction: Why Do Software
Fail?", Journal of Computer and Communications, Vol 2, No 6, pp:
25-35, April 2014
R. Kaur, Dr. J. Sengupta, "Software Process Models and Analysis
on Failure of Software Development Projects", International
Journal of Scientific & Engineering Research Vol 2, No.2, pp: 1-4,
February-2011
Lorin J. May, "Major Causes of Software Project Failures".
[Available
at]
http://www.cic.unb.br/~genaina/ES/ManMonth/SoftwareProjectFail
ures.pdf
Dr. Paul Dorsey, "Top 10 Reasons Why Systems Projects Fail".
[Available
at]
http://www.ksg.harvard.edu/mrcbg/ethiopia/Publications/Top%2010%20Reasons%20Why%20Sy
stems%20Projects%20Fail.pdf
Andrew Short, "Reasons for software failures", [Available at]
https://indico.cern.ch/event/276139/contribution/49/attachments/50
0995/691988/Reasons_for_software_failures.pdf
http://www.itworld.com/article/2717299/it-management/mariner-1s 135-million-software-bug.html?page=2
https://reports.energy.gov/BlackoutFinal-Web.pdf
http://sunnyday.mit.edu/papers/therac.pdf
http://www.gao.gov/products/IMTEC-92-26
http://ocw.mit.edu/courses/aeronautics-and-astronautics/16422human-supervisory-control-of-automated-systemsspring2004/projects/vincennes.pdf
http://www.mit.edu/hacker/part1.html
Ariane
501
Inquiry
Board,
[Available
at]
https://www.ima.umn.edu/~arnold/disasters/ariane5rep.html
http://mars.jpl.nasa.gov/msp98/news/mco991110.html
http://uspas.fnal.gov/materials/12UTA/11_integrity%20.pdf

112 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 03, Issue 02, February, 2016

Anda mungkin juga menyukai