Anda di halaman 1dari 59

Safety

Overview

 Cases studies: root causes of accidents in complex


systems
• Key issues in System Safety Engineering
• Risk Analysis
• Hazard Analysis
• System Architectures

2
Liviu Miclea 2016
Overview

 Cases studies: root causes of accidents in complex


systems
• Key issues in System Safety Engineering
• Risk Analysis
• Hazard Analysis
• System Architectures

3
Liviu Miclea 2016
Safety – Complexity - Responsibility
How does it fit?
Examples: Space
1986: Space shuttle Challenger
explodes shortly after takeoff

1996: Ariane 5 is destroyed, because


of untested re-use of a software
component of the naviga-tion system

1999: Mars Climate Orbiter crashes


on Mars due to SW interface
problems

2003: Space shuttle Columbia burns


up during re-entry

2004: Robot Mars Rover Spirit fails


due to unexpected storage overflow

5
Liviu Miclea 2016
Examples: Military

1991: 28 US soldiers die, because Patriot


antimissile system misses a Scud missile due
to a SW error

1994: Two US Airforce F-15 fighter shoot down


two US Army Black helicopters over no flight
zone in Northern Irak, 26 death (Friendly Fire)

2001: After pushing the reset button an Osprey


aircraft crashes, all four soldiers die

2006: Predator drone gets out of control an


crashes in New Mexico close to a civil airport

2007: In South Africa an automated anti-


aircraft canon fires uncontrolled at a parade
and kills 9 soldiers

6
Liviu Miclea 2016
Examples: Road

2003: Toll system „Toll Collect“ can not start service, i. a. due
to SW problems in the vehicle systems
2004: DaimlerChrysler calls back 680.000 cars because of
problems of the electronically controlles brake system
2004: Chrysler recalls 2,7 Mio cars due to a defect in the
automated gear switch system
2009: Toyota calls back 3,8 Mio cars due to problems with the
gas pedal and problems with ABS SW

7
Liviu Miclea 2016
Example: Energy

2003:: Complete Blackout in US Northeast, i.a. due to failure of


alarm and warning systems (configuration and SW errors)
2006: Incident in NPS Forsmark (Sweden)
2010: Offshore platform Deepwater Horizon sinks
2011: Nuclear catastrophe at NPS Fukushima

8
Liviu Miclea 2016
Examples: Healthcare
1985–1987: Due to SW errors five patients die after radiation overdose applied by
Therac-25

2001: In Panama five patients die due to radiation overdose, because incomplete
and misunderstood SW application conditions

2005: Recall of diabetes measurement devices in the US due to possibility of


„mode errors“ (e. g. related to measurement units)

2010: NY Times reports that after computer crashes radiation overdoses have been
applied due to deleted or outdated information.

9
Liviu Miclea 2016
Examples: Maritime
1987: Herald of Free Enterprise ferry leaves harbours with open bow doors and
capsizes, 134 fatalities

1994: Estonia ferry sinks in the Baltic Sea due to damaged bow door, 852
fatalities

1995: Autopilot steered cruise ship Royal Majesty strands

2006: Ferry accident in Canada due to lack of understanding of the automation


system by the crew

2012: Costa Concordia sinks (32 fatalities)

10
Liviu Miclea 2016
Examples: Civil Aviation

1991: Lauda Air crash due to in-flight activation


of thrust reverser

2000: Concorde catches fire during takeoff in


Paris

2001: In Milano Linate a passenger aircraft


collides with a Cessna during takeoff in heavy
fog, 118 fatalities

2002: Mid-air collision close to Überlingen, 71


fatalities

2005: Crash of Helios Air 522 after loss of


cabin pressure, 121 fatalities

2008: Computer problem leads to severe


incident over Australia

11
Liviu Miclea 2016
Examples: Railway

1998: 101 passengers die in Eschede ICE


derailment

1999: Train collision after passing of red


signal at Ladbroke Grove (near
Paddington) causes 31 fatalities

2000: 9 passengers die after derailment


due to overspeeding at Brühl railway
station

2005: Derailment of a commuter train at


Osaka due to overspeed (107 fatalities, 540
injuries)

2008: Washington metro crash due to


failure in track clearance system

2011: In China train falls from bridge after


signal failure

12
Liviu Miclea 2016
Complexity: Some trends
 Increasingly shortened cycles for technology innovation
 Increasing performance and complexity of technical systems
 Shortened product development cycles due to tightened
international competition and time pressure
 Ease of modification and complex interaction of components
lead often to unexpected emergent behaviour, in particular
SW intensive systems
 Accidents related to systems although components are
checked, tested, assessed, certified, proven-in-use ...
(System accidents)
 Superficially the causes often seem to be SW or human
errors...

13
Liviu Miclea 2016
What is safety culture?

Safety culture is:


a state, in which each member of staff
is always concerned to improve safety,
is conscious of what can go wrong,
feels personally responsible for safety,
a disciplined, consistent manner of working by competent personnel, who
are sure of themselves, but not self-satisfied,
follow defined processes,
produce good teamwork and
communicate efficiently with each other,
the insistence on a sound technical basis for actions and on a consistent
analysis and elimination of problems.

Angepasst nach Nuclear Reactor Commission (NRC), 1990


14
Liviu Miclea 2016
Overview

 Cases studies: root causes of accidents in complex


systems
• Key issues in System Safety Engineering
• Risk Analysis
• Hazard Analysis
• System Architectures

15
Liviu Miclea 2016
Goals, objectives and
restrictions

Awareness (constientizare) – of common concepts and


issues associated with achieving and assuring system
safety
Understanding – of role of safety analysis techniques
in achievement and assurance of system safety
Working knowledge – initial ability to understand and
apply key system safety analysis techniques
Technical systems - are only considered here, human
and management aspects are covered by “Risk
Analysis of Technical Systems”
16
Liviu Miclea 2016
Safety

There is no commonly accepted set of terms


we define key terms, consistent with some of the more common
usage

Safety is concerned with physical systems

 a system is unsafe if it causes unacceptable harm (daune), e.g.


loss of life or environmental damage
 only physical systems can cause this sort of harm
 information (computer) systems can only cause harm indirectly

Hence physical system context must be considered (risk


analysis). 17
Liviu Miclea 2016
Safety vs. RAM
Reliability: The probability that an item can perform a required function
under given conditions for a given time interval (t1, t2).

Availability: The ability of a product to be in a state to perform a required


function under given conditions at a given instant of time or over a given
time interval assuming that the required external resources are provided.

Maintainability: The probability that a given active maintenance action,


for an item under given conditions of use can be carried out within a
stated time interval when the maintenance is performed under stated
conditions and using stated procedures and resources.

Reliability is mainly concerned with component failures


Many accidents occur without any component failure
Safety is not reliability: A very reliable component may be unsafe, and a
very safe system may be very unreliable.
18
Liviu Miclea 2016
HW vs. SW Safety
SW has brought additional flexibility into systems design
This flexibility, the ease of changing SW, is probably the greatest source
of risk in safety-related computer systems
SW allows more complex system to build than would have been possible
with HW
A crucial point is that usually the systems expert is not a SW expert (and
vice versa)
SW fails differently from HW, e. g.
• the risk related to SW usually does not decrease with operational
experience
• two different SW modules do not fail independently
The same discipline as in system and HW engineering must be applied to
SW engineering
It is crucial that the correct requirements are defined and enforced on any
system level
As a result safety must be built into the system, in complex systems
safety can not be claimed on testing or field experience
19
Liviu Miclea 2016
Safety is a system property

20
Liviu Miclea 2016
Terminology

We wish to prevent accidents or reduce their frequency


accident – unintended event or sequence of events leading to harm
(daune) – death, injury, environmental or material damage
Accidents happen in the real world!
Accidents arise from hazards
hazard – physical condition of system that threatens (amenință) the
safety of personnel or the system, i.e. can lead to an accident
hazard identification – identifying those situations (hazards) which could
lead to an accident under credible conditions

21
Liviu Miclea 2016
Terminology
Hazards are caused by failures, or failure conditions

fault: abnormal condition that may cause a reduction in, or loss of, the
capability of a functional unit to perform its intended function
failure – inability of an item to perform its intended function
failure condition or failure mode – often used to identify specific failure

Note: failure is vis a vis intent – not the specification

Different classes of failure

systematic – failures due to flaws in design, manufacture, installation,


maintenance. Items subjected to the same conditions fail consistently.
Can either be avoided or tolerated.
random – failures due to physical causes – a variety of degradation
mechanisms. Cannot be avoided, only tolerated or controlled. 22
Liviu Miclea 2016
Key issues in system safety

Three main issues in achieving safety:

 determine the requirements – hazard analysis and


risk analysis techniques
 design the system to be safe – design is a creative
activity, but structured causal analyses help guide
design process
 provide evidence of safety – role of techniques and
ultimately production of a convincing safety case

23
Liviu Miclea 2016
Compliance vs. Assurance

Compliance (conformitate): A demonstration that a characteristic


or property of a product satisfies the stated requirements.
Compliance with requirements, standards, laws etc. is necessary,
but is not sufficient to achieve system safety
Assurance means to produce conclusive arguments and evidence
that the system is safe

Only assurance can be sufficient for system safety


We have to ensure that the system behaves safely, which means
we have to be convinced that the system is safe

Usually assurance implies compliance, but not necessarily vice


versa
24
Liviu Miclea 2016
Overview

 Cases studies: root causes of accidents in complex


systems
• Key issues in System Safety Engineering
• Risk Analysis
• Hazard Analysis
• System Architectures

25
Liviu Miclea 2016
Motivation
Assumptions:
 A zero risk (complex) technological system is not feasible
 Any risk analysis makes (implicitly or explicitely) decisions about
optimal use of (limited) financial ressources as well as
minimization of the expected damage resulting from usage of a
system
 Decisions are also taken without any risk analysis
 Maximization of technical safety generally does not lead to
minimal risk
 a system is safe, if the risk associated with its usage is below the
tolerable risk
 a safe system is not a zero risk system
26
Liviu Miclea 2016
Normative Requirements - IEC 62278(EN 50126)

Risk analysis is mandatory (in phase 3 and ongoing)


Goals and methods have to be agreed (62278 gives only an example)
Risk = function (accident severity, accident frequency)
risk analysis is a means to clearly distinguish responsibility between
operator and supplier

Frequency of occurrence Tolerable risk and


Risk Levels categories have to be
of a hazardous event
Frequent individually defined
Probable Intolerable
Occasional
Remote
Improbable Tolerable
Incredible
Insignificant Marginal Critical Catastrophic
Severity Levels of Hazard Consequence

27
Liviu Miclea 2016
Risk: Definition
Example: Risk of casualty (rom: ranire) of a 22 year old male Swiss car driver

 2.5x10-4 per year


 1:4000 per year
 1:1 000 000 per operating hour (250 hours/year and 10000 km)
 0.5% throughout life
(over 40 years, ass. risk reduction by a factor of 2 for senior drivers)
 2.5x10-8 per kilometer
 mean reduction of life expectancy by 77 days
 20% increase of risk of casualty
 etc.

 Concise definition of units and assumptions is necessary!

 Statistical results only! 28


Liviu Miclea 2016
Terminology Individual
Risk ri
Tolerable
individual
risk
Collective
Mean
Risk
individual
risk
Individuals
Risk = (monotone) function of accident severity and frequency,
Usually expected harm (per time unit) (for a particular poulation)
Individual Risk  Related to a particular individual
Collective Risk  Related to a particular group of people or the society
Risk Aversion  Increased weighting of high damages or repeated accidents
Tolerable Risk  The maximum acceptable level of risk
Residual Risk  Risk which remains after implementation of all safey measures,
consists of consciously accepted, falsely estimated and not identified risks
29
Liviu Miclea 2016
System/Function
Problem:
The definitions for system, function etc. are arbitrary...

system: - set of interrelated elements considered in a defined context

as a whole and separated from their environment (IEC AC/7/2004)


- set of sub-systems or elements which interact according to a design

sub-system: portion of a system which fulfils a specialised function.

function: A mode of action or activity by which a product fulfils its purpose.

element: a part of a product that has been determined to be a basic unit

or building block. An element may be simple or complex.

... and there will probably be never a general and unambiguous


definition. 30
Liviu Miclea 2016
Overview

 Cases studies: root causes of accidents in complex


systems
• Key issues in System Safety Engineering
• Risk Analysis
• Hazard Analysis
• System Architectures

31
Liviu Miclea 2016
Hazard (rom: pericol, risc)
“Hazard: A physical situation with a potential for human injury (ranire).
(IEC 62278).”

Unspecific! Anything may be considered as a hazard!

Alternative:
“A hazard is a state or set of conditions of a system
(or an object) that, together with other conditions in the environment of the
system (or object), will lead inevitably to an accident (loss event). [....]
A hazard is defined with respect to the environment
of the system or component. [....]
What constitutes a hazard depends upon where
the boundaries of the system are drawn. [....]”
Leveson: Safeware, 1995
32
Liviu Miclea 2016
Hazards, Causes and Accidents
Cause (system level)
=> hazard (subsystem Hazard (system Accident k
Level) level)

Accident l
Cause
Subsystem System boundary
boundary

Causes Consequences

Hazards should be defined on a unique (high) level with respect to


the interface between the system and its environment!

33
Liviu Miclea 2016
Basic process (based on EN 50129)

1 System Definition
2 Hazard Identification
3 Consequence Analysis Risk Analysis
4 Loss Analyis
5 Risk Assessment
Result: Hazards and
tolerable hazard rates

H, THR
Hazard Apportionment of qualitative
Control and quantitative requirements
6 Causal Analysis
7 CCF Analysis
8 SIL Allocation
(Functions) Apportionment of
quantitative requirements
SIL table Functions, THR, SIL

(Design+Implementation)
Components, FR, SIL

34
Liviu Miclea 2016
Definition of safety requirements
System Analyse
Definition Operation

Identify
Hazards

Estimate
Hazard
Rates

Identify Hazard
Consequen Log
ces:
• Accidents
• Near miss
• Safe state

Determine
risk

Risk System
Determine
tolerability Requirements
THR
criteria Specification
(safety) (Safety
requirements)
System Design
Analysis

35
Liviu Miclea 2016
Process view For each hazard
Risk Analysis
Hazard
Hazards Analysis
Perform Safety-
and THR
Causal related
System Analysis application
functions conditions

Collect For each function


contribution
to hazards

Determine For each subsystem


System
Architecture which
Description function is
realised by
which
subsystem

Determine Subsystem
SIL table THR and Requirements
SIL Specification

System Determine
Design FR for
Description system
elements

36
Liviu Miclea 2016
Independence
Functional independence implies,
that there are neither systematic nor random faults,
which cause a set of functions to fail simultaneously.

Physical independence implies,


that there are no random faults or external influences,
which cause a set of functions to fail simultaneously.

Interpretation: If system failure is implied by failure of two units (A AND B),


the safety requirements for the units may be reduced
(in comparison to the system safety requirements)
depending on the type of independence.
In case of functional independence SIL and failure rates are reduced,
in case of physical independence only failure rate requirements are reduced.
37
Liviu Miclea 2016
Apportionment of safety requirements
From Risk
Analysis

Hazard
+ THR

Undetected failure Undetetced failure Undetected failure


of power supply of road-side of LC controller
warnings
Late or no switch-in Undetected failure Undetetced failure Undetected failure 1E-7 1E-7 1E-7
of power supply of road-side of LC controller LC set back to
normal position

....
warnings
1E-7 1E-7 1E-7

Check
1E-7

System
independence Undetected failure
of light signals
7E-6
Undetected
failure of barriers
7E-6 architecture
Undetected failure

assumptions
Undetected Undetected failure Undetected
of switch-in failute of distant of light signals failure of barriers
function signal
1E-7 7E-6 7E-6

....
Determine SIL + THR for
SIL table THR + SIL subsystems

Apportion SIL + FR for


hazard elements
rates to
elements
38
Liviu Miclea 2016
Safety Integrity Levels

A means to create a balance


System
Failure between systematic and
random failures
Heuristic only
OR
Human
State of the art used by
Physical
most safety standards
systematic
faults random
failure

Requirements must be defined for


Hardly quantifiable, Quantifiable
qualitative Levels („failure rates“)
both systematic and random failures
(SILs)

39
Liviu Miclea 2016
SIL table
SAFETY INTEGRITY Tolerable Hazard
LEVEL Rate
THR per
hour and per function
4 10-9<THR < 10-8
3 10-8 < THR < 10-7
2 10-7 < THR < 10-6
1 10-6 < THR < 10-5

SILs are assigned to safety functions and are „inherited“ by the components
implementing the safety functions
balance by means of a SIL allocation table
Procedure described in detail by EN 50129
Same table as in IEC 61508 (but continuous operation mode only)
Approach is based on heuristics and experience rather than scientific evidence

40
Liviu Miclea 2016
Conclusions

Production of a Safety Case is the principal objective


of all the safety lifecycle activities
The objective of the Safety Case is to ‘pull together’
many forms of information and present a coherent
argument of safety
Standard structure is helpful
Problems of scale and complexity
‘Chain of argument’ can often get lost
Goal Structuring can be a useful means of presenting
safety arguments and defining re-usable safety
patterns
41
Liviu Miclea 2016
Overview

 Cases studies: root causes of accidents in complex


systems
• Key issues in System Safety Engineering
• Risk Analysis
• Hazard Analysis
• System Architectures

42
Liviu Miclea 2016
Architectures

43
Liviu Miclea 2016
Single board Controller Model

44
Liviu Miclea 2016
1oo1 Architecture

1oo1 PFD Fault Tree 1oo1 PFS Fault Tree

45
Liviu Miclea 2016
1oo1 Architecture

1oo1 Markov Model

46
Liviu Miclea 2016
1oo1 Architecture

1oo1 Architecture with Watchdog Timer

47
Liviu Miclea 2016
1oo2 Architecture

1oo2 Architecture
48
Liviu Miclea 2016
1oo2 Architecture

1oo2 PFD Fault Tree


49
Liviu Miclea 2016
2oo2 Architecture

Note: It is undesirable for the system to fail with outputs de-energized

50
Liviu Miclea 2016
2oo2 Architecture

2oo2 PFD Fault Tree

51
Liviu Miclea 2016
1oo1D Architecture

52
Liviu Miclea 2016
2oo3 Architecture

2oo3 Architecture Single


Fault Degradation Models

53
Liviu Miclea 2016
2oo3 Architecture
2oo3 Architecture Single
Fault Degradation Models

2oo3 Architecture Dual Fault


Modes

54
Liviu Miclea 2016
2oo2D Architecture 1oo2D Architecture

1oo2D with DU Failure in


One Unit 55
Liviu Miclea 2016
1oo2D Architecture with Comparison

56
Liviu Miclea 2016
Comparing Architectures
Single Board Controller Model Results

- Highest safety rate -> 1oo2 (PFD min), then 2oo3 (3X1oo2)
- 2oo2 lowest PFS

Single Safety Controller Model Results

57
Liviu Miclea 2016
Comparing Architectures
Single Safety Controller Model Results for All Architectures

Legend: ft – Fault Tree; mm – Markov


Conclusions:
- ft and mm results are practically identical
- 1oo1D more safety then 1oo1, but the cost is higher false trip rate (PFS)
- Lowest PFD -> 1oo2DComp (many commercial safety PLC implementations), but
higher PFS when compared with 1oo2D
- An optimal architecture design with good self-diagnostics would not need
comparison
- 2oo3, 1oo2D, and 1oo2DComp -> provide excellent safety
- 2oo2, 2oo2D, 2oo3, 1oo2 and 1oo2Dcomp -> excellent operation with low PFS
- 2oo2D -> BEST COMPROMISE if the automatic diagnostics are excellent
(microprocessors specifically designed for use in IEC 61508 certified systems)
(better then the traditional 2oo3) 58
Liviu Miclea 2016
Comparing Architectures
Single Safety Controller Model Results for All Architectures

- 1oo1D better safety than 1oo1, but the cost is higher false trip
- Lowest PFD -> 1oo2D with comparison diagnostics (many implementations)
(cons: higher PFS)
- 2oo3, 1oo2D and 1oo2D with comparison diagnostic -> excellent safety
- 2oo2, 2oo2D, 2oo3, 1oo2D and 1oo2D with comparison diagnostic -> excellent
operation without an excessive false trip rate (low PFS)
- 2002D -> the BEST compromise if the automatic diagnostics is excellent (IEC
61508) (is better then 2oo3)

59
Liviu Miclea 2016

Anda mungkin juga menyukai