Anda di halaman 1dari 108

Probability and Statistics with

Reliability, Queuing and Computer


Science Applications: Chapter 8 on
Continuous-Time Markov Chains

Kishor Trivedi

Non-State Space Models


Recall that non-state-space models like RBDs
and FTs can easily be formulated and (assuming
statistical independence) solved for system
reliability, system availability and system MTTF
Each component can have attached to it

A probability of failure
A failure rate
A distribution of time to failure
Steady-state and instantaneous unavailability

Markov chain
To model complex interactions between components,
use other kinds of models like Markov chains or
more generally state space models.
Many examples of dependencies among system
components have been observed in practice and
captured by Markov models.

MARKOV CHAINS
State-space based model
States represent various conditions of the system
Transitions between states indicate occurrences of
events

State-Space-Based Models
States and labeled state transitions
State can keep track of:
Number of functioning resources of each type
States of recovery for each failed resource
Number of tasks of each type waiting at each
resource
Allocation of resources to tasks

A transition:
Can occur from any state to any other state
Can represent a simple or a compound event

State-Space-Based Model (Continued)


Drawn as a directed graph
Transition label:
Probability: homogeneous discrete-time Markov chain (DTMC)
Rate: homogeneous continuous-time Markov chain (CTMC)
Time-dependent rate: non-homogeneous CTMC
Distribution function: semi-Markov process (SMP)
Two distribution functions; Markov regenerative process (MRGP)

MARKOV CHAINS

(Continued)

For continuous-time Markov chains (CTMCs)


the time variable associated with the system
evolution is continuous
We will mean a CTMC whenever we speak of
Markov model (chain)

Chapter 8
Continuous Time Markov Chains

Formal Definition
A discrete-state continuous-time stochastic process
is called a Markov chain if
for t0 < t1 < t2 < . < tn < t , the conditional pmf
satisfies the following Markov property:

A CTMC is characterized by state changes that can


occur at any arbitrary time
Index space is continuous.
The state space is discrete valued.

Continuous Time Markov Chain


(CTMC)
A CTMC can be completely described by:
Initial state probability vector for X(t0):
Transition probability functions (over an
interval)

pmf of X(t)
Using the theorem of total probability

If v = 0 in the above equation, we get

Homogenous CTMCs

is a (time-)homogenous CTMC iff

Or, the conditional pmf satisfies:

A CTMC is said to be irreducible if every state can


be reached from every other state, with a non-zero
probability.
A state is said to be absorbing if no other state can be
reached from it with non-zero probability.
Notion of transient, recurrent non-null, recurrent null
are the same as in a DTMC. There is no notion of
periodicity in a CTMC, however.

CTMC Dynamics
Chapman-Kolmogorov Equation

Note that these transition probabilities are functions of


elapsed time and not of the number of elapsed steps
The direct use of the this equation is difficult unlike the
case of DTMC where we could anchor on one-step
transition probabilities
Hence the notion of rates of transitions which follows
next

Transition Rates
Define the rates (probabilities per unit time):
net rate out of state j at time t:

the rate from state i to state j at time t:

Kolmogorov Differential
Equation
The transition probabilities and transition rates are,

Dividing both sides by h and taking the limit,

Kolmogorov Differential Equation


(contd.)
Kolmogorovs backward equation,
Writing these eqs. in the matrix form,

Homogeneous CTMC
Specialize to HCTMC (Kolmogorov diff. eqn) :

In the matrix form, (Matrix Q is called the infinitesimal


generator matrix (or simply Generator
Matrix))

CTMC Steady-state Solution

Steady state solution of CTMC obtained by solving the


following balance equations:

Irreducible CTMCs with all states recurrent non-null will have


+ve steady-state {j} values that are unique and independent of
the initial probability vector. All states of a finite irreducible
CTMC will be recurrent non-null.
Measures of interest may be computed by assigning reward
rates to states and computing expected steady state reward rate:

CTMC Measures

Measures of interest may be computed by assigning reward rates to states and


computing expected reward rate at time t:

Expected accumulated reward (over an interval of time)

Lj(t) is the expected time spent in state j during (0,t)

(LTODE)

Markov Availability Model

2-State Markov Availability Model

UP
1

DN
0

1
MTTF

1
MTTR

1) Steady-state balance equations for each state:


Rate of flow IN = rate of flow OUT
State1:
State0:

0 1

1
0
2 unknowns, 2 equations, but there is only
one independent

equation.

2-State Markov Availability Model


(Continued)

0 1 1

Need an additional equation:

1
1 1 1 1

1
MTTF
1 Ass

1 1 MTTF MTTR
1

MTTR
1 Ass
MTTF MTTR

Downtime in minutes per year =

MTTR
MTTF MTTR

* 8760*60

Ass 0.99999 1 Ass 10 5 DTMY 5.356 min

2-State Markov Availability Model


(Continued)

2) Transient Availability for each state:


Rate of buildup = rate of flow IN - rate of flow OUT

d 1
0 (t ) 1 (t )
dt
d 1
(1 1 (t )) 1 (t )
since 0 (t ) 1 (t ) 1 we have
dt
d 1
( ) 1 (t )
dt
This equation can be solved to obtain assuming 1(0)=1

( ) t
1 (t ) A(t )

2-State Markov Availability Model


(Continued)

3)

R (t ) e t

4) Steady State Availability:

lim A(t ) Ass


t

Markov availability model


Assume we have a two-component parallel
redundant system with repair rate .
Assume that the failure rate of both the components
is .
When both the components have failed, the system
is considered to have failed.

Markov availability model


(Continued)

Let the number of properly functioning components be the


state of the system. The state space is {0,1,2} where 0 is
the system down state.
We wish to examine effects of shared vs. non-shared repair.

Markov availability model


(Continued)

2
2

Non-shared (independent)
repair

Shared repair

Markov availability model


(Continued)

Note: Non-shared case can be modeled & solved using a RBD


or a FTREE but shared case needs the use of Markov chains.

Steady-state balance equations


For any state:
Rate of flow in = Rate of flow out
Consider the shared case

i: steady state probability that system is in state i

2 2 1
( ) 1 2 2 0

1 0

Steady-state balance equations


(Continued)

Hence

Since
We have
or

2
1 1 0

2
0 1 2 1


0 0
1
0
2

1 2
2

0 1
2

Steady-state balance equations


(Continued)
Steady-state unavailability = 0= 1 - Ashared
Similarly for non-shared case,
steady-state unavailability = 1 - Anon-shared

Downtime in minutes per year = (1 - A)* 8760*60

1 Anon shared

2
2
1
2

Steady-state balance equations

A larger example
Return to the 2 control and 3 voice channels example and
assume that the control channel failure rate is c, voice channel
failure rate is v.
Repair rates are c and v, respectively. Assuming a single
shared repair facility and control channel having preemptive
repair priority over voice channels, draw the state diagram of a
Markov availability model. Using SHARPE GUI, solve the
Markov chain for steady-state and instantaneous availability.

WFS Example

A Workstations-Fileserver
Example
Computing system consisting of:
A file-server
Two workstations
Computing network connecting them

System operational as long as:


One of the Workstations
and
The file-server are operational

Computer network is assumed to be faultfree

The WFS Example

Markov Chain for WFS Example


Assuming exponentially distributed times to
failure
w : failure rate of workstation
f : failure rate of file-server

Assume that components are repairable


w: repair rate of workstation
f: repair rate of file-server

File-server has (preemptive) priority for


repair over workstations (such repair priority
cannot be captured by non-state-space
models)

Markov Availability Model for


WFS
w

2w
2,1

1,1

0,1

w
f

w
f

2w
2,0

1,0

0,0

Since all states are reachable from every other states, the
CTMC is irreducible. Furthermore, all states are positive
recurrent.

Markov Availability Model for


WFS (Continued)
In this figure, the label (i,j) of each state is
interpreted as follows: i represents the number of
workstations that are still functioning and j is 1
or 0 depending on whether the file-server is up
or down respectively.

Markov Model
Let {X(t), t > 0} represent a finite-state
Continuous Time Markov Chain (CTMC)
with state space .
Infinitesimal Generator Matrix Q = [qij]:
qij (i !j) : transition rate from state i to state j
qii = - qi=

j i qij , the diagonal element

Markov Availability Model for


WFS (Continued)
For the example problem, with the states ordered
as (2,1), (2,0), (1,1), (1,0), (0,1), (0,0) the Q
matrix is given by:

f
2w
0
0
0
( f 2 w )

)
0
2

0
0
f
f
w
w

0
( w f w )
f
w
0
= w

0
0

)
0

f
f
w
w

0
0
w
0
( w f ) f

0
0
0
0

f
f

Markov Model (steady-state)


: Steady-state probability vector

Q 0,

( ( 2,1) , ( 2, 0) , (1,1) , (1,0 ) , ( 0,1) , ( 0, 0) )


These are called steady-state balance
equations
rate of flow
, in = rate of flow out

after solving A
for
SS
availability

obtain
( 2,1) Steady-state
(1,1)

Markov Model (transient)


(t):transient state probability vector
(0): initial probability vector of the CTMC
Transient behavior described by the
Kolmogorov differential equation (KDE):
d
(t ) (t )Q,
dt

given (0)

Markov Availability Model


We compute the availability of the system:System is
available as long as it is in states (2,1) and (1,1).
Instantaneous availability of the system:

A (t ) ( 2 ,1) (t ) (1,1) (t )
Ass
lim
A
(
t
)

Availability (Continued)
Interval Availability:

Steady-State Availability:

There are three kinds of Availabilities!


Instantaneous, Interval & Steady-state

A (t )
I

A( x) dx
t

Expected uptime in (0, t ]

ASS lim A(t ) lim AI (t )


t

Markov Availability Model


(Continued)
t

Define L(t ) (u )du


o

L(i,j)(t): Expected Total Time Spent in State (i,j)


during (0,t)
Integrating the KDE, we get the LTODE:

d
L(t ) L(t )Q (0) ,
dt

L ( 0) 0

Interval availability

AI (t )

L( 2,1) (t ) L(1,1) (t )
t

Markov Availability Model Results

w 0.0001 hr 1 , f 0.00005 hr 1 , w 1.0 hr 1 , f 0.5 hr 1

Ass 0.9999

2-component Availability model with


finite Detection delay
2-component availability model

Steady state availability Ass = 1-0

Failure detection stage takes random time, EXP()

Down states are 0 and 1D Ass = 1- 0- 1D


Therefore, steady state unavailability U() is given by

Redundant System with Finite


Detection Switchover Time
After solving the Markov model, we obtain steady-state probabilities:

Can solve in closed-form or using SHARPE

2 , 1D , 1 , 0
Asys 2 1 (or 1D )

Closed-form


2
2

0 [1

] 1

( ) 22
0

1
E


1
E
2
1
1D
( ) E
2

1
2
22 E
1

A 2 1 r 1 D

2

2
( 2

r
)/ E
2
( )

2-component availability model


with imperfect coverage
Coverage factor = c (conditional probability
that the fault is correctly handled)

1C state is a reboot (down) state.

2-components availability
model : delay + imperfect
coverage

Model has detection delay + imperfect


coverage
Down states are 0, 1C and 1D.

Modeling Software Faults


Operating System Failure
Availability model with hardware and software (OS)
redundancy; operational phase; Heisenbugs
Probability & Statistics with Reliability,
Queuing and Computer Science Applications
(2nd ed.)
K. S. Trivedi
John Wiley, 2001.

Assumptions

Hardware failures are permanent


A repair or replacement action
while OS failures are cleared by a
reboot
Repair or reboot takes place at
rates and for the hardware
and OS, respectively.

Webserver Availability Model


with warm Replication
Two nodes for hardware redundancy
Each node has a copy of the webserver (software
redundancy replication)
Primary node can fail
Secondary node can fail
Primary process can fail
Secondary process can fail
Failures may have imperfect coverage
Time delay for fault detection
Model of a real system developed at Avaya Labs

Modeling Software Faults


Application Failure
Availability model with passive redundancy
(warm replication) of application; Operational phase;
Heisenbugs or hardware transients
Assumptions

A web server software,


that fails at the rate p
running on a machine
that fails at the rate m

Mean time to detect


server process failure 1 and the mean time to
p
detect machine failure
-1m

The mean restart time


of a machine -1m

The mean restart time


of a server -1p

Performance and Reliability Evaluation of Passive Replication Schemes in Application Lev


Fault-Tolerance
S. Garg, Y. Huang, C. Kintala, K. S. Trivedi and S. Yagnik
Proc. of the 29th Intl. Symp. On Fault-Tolerant Computing, FTCS-29, June 1999.

Parameters
Process MTTF = 10 days (1/p)
Node MTTF = 20 days (1/n)
Process polling interval = 2 seconds (1/p)
Mean process restart time = 30 seconds (1/p)
Mean process failover time = 2 minutes (1/n)
Switching time with mean 1/ s
C = 0.95

Solution for warm replication

Modeling an N+1 Protection


System

Outline
Description of the system
Using a rate approximation
Using a 3-stage Erlang approximation to a
uniform distribution
Using a Semi-Markov model - approximation
method using a 3-stage Erlang distribution
Using equations of the underlying SemiMarkov Process
Solutions for the models

Description of the system


N = Number of protected units (we use N=1)
= Unit failure rate
= Unit restoration rate
T = deterministic time between routine diagnostics
c = Probability that a protection switch
successfully restores service
d = Probability that a failure in the standby unit is
detected

Outline
Description of the system
Using a rate approximation
Using a 3-stage Erlang approximation to a
uniform distribution
Using a Semi-Markov model - approximation
method using a 3-stage Erlang distribution
Using equations of the underlying SemiMarkov Process
Solutions for the models

Hot Standby with different coverages


Normal
(1+1)
(1-d)

(1-c)

(c+d)

Protection
Switch

Failure

Failure to
Detect
Protection
Fault

Simplex
(1)
2

Failed
(0)

Normal:
1
Protection Switch Failure:
2
Simplex:
3
Failure to detect protection fault: 4
Failed:
5

N=1

Diagnostics; Using a rate


approximation
Normal
(1+1)
(1-d)

(1-c)

(c+d)

Protection
Switch

Failure

Simplex
(1)
2

Failed
(0)
Time to diagnostic is
exponentially distributed
with mean T/2

2/T

Failure to
Detect
Protection
Fault

Normal:
1
Protection Switch Failure:
2
Simplex:
3
Failure to detect protection fault: 4
Failed:
5

N=1

Outline
Description of the system
Using a rate approximation
Using a 3-stage Erlang approximation to a
uniform distribution
Using a Semi-Markov model approximation method using a 3-stage
Erlang distribution
Using equations of the underlying SemiMarkov Process
Solutions for the models

1.8

Comparison of
probability density functions (pdf)

1.6
1.4
1.2
1
pdf

3-stage Erlang pdf


U(0,1) pdf

0.8
0.6
0.4
0.2
0

T=1
time

1.2

Comparison of cumulative distribution


functions (cdf)

cdf

0.8

3-stage Erlang cdf

0.6

U(0,1) cdf

0.4

0.2

T=1
time

Using a 3-stage Erlang approximation to a


uniform distribution
Normal
(1+1)

(1-d)

(1-c)

(c+d)

Protection
Switch
Failure

Time to
diagnostic is
uniformly
distributed over
(0,T) approximated
by a 3-stage
Erlang with
mean T/2

Simplex
(1)

s1

s2

6/T

Failure to
Detect
Protection
Fault

6/T

Failed
(0)

6/T

Outline
Description of the system
Using a rate approximation
Using a 3-stage Erlang approximation to a
uniform distribution
Using a Semi-Markov model approximation method using a 3-stage
Erlang distribution
Using equations of the underlying SemiMarkov Process
Solutions for the models

Using a Semi-Markov model approximation method using an Erlang


distribution (N=1) E(t) -> 3-stage Erlang distribution
given by,
31

Normal
(1+1)
(1-d)

(1-c)

(c+d)

Time to
diagnostic is
uniformly
distributed over
(0,T) approximated
by a 3-stage
Erlang
distribution
with mean T/2

Protection
Switch

Failure

Simplex
(1)

Failure to
Detect
Protection
E(t)
Fault

Failed
(0)

k 0

( T6 t ) k
k!

T6 t

Outline
Description of the system
Using a rate approximation
Using a 3-stage Erlang approximation to a
uniform distribution
Using a Semi-Markov model approximation method using a 3-stage
Erlang distribution
Using equations of the underlying SemiMarkov Process
Solutions for the models

Using Equations of the underlying


Semi-Markov Process
Steady state solution
One step transition probability matrix, P of the
embedded DTMC

0
0

1-c
2

0
0
0
0

cd
2

0
T
1
(
1

e
)
T
1

1-d
2

0
0
0
0

1
T

(1 e
0

Using Equations of the underlying


Semi-Markov Process (Continued)
Solve v vP to obtain,

v v,

1 c
1 2

v,

1 d
1 2

v,

v1 , (

(1 c )
2( )

1 d
2

(1

where v1 1 1c 1d (1c ) 1 1d (1 1 (1e T ))


2

2 ( )

1
T

(1 e

)))v1

Using Equations of the underlying


Semi-Markov Process (Continued)
Time to the next diagnostic is uniformly distributed over (0,T)

H i (t ) : CDF of the sojourn time in state i


H1 (t ) 1 e 2 t ,
H 3 (t ) 1 e

( )t

H 2 (t ) 1 e ( )t ,
,

H 5 (t ) 1 e
tT

1,

tT

H 4 (t )

1 (1 Tt )e t ,

2 t

Using Equations of the underlying


Semi-Markov Process (Continued)

hi : mean sojourn time in state i [1-H i(t)]dt


0

h1

1
2

, h2

, h3

, h4
1

1
T 2

(1 e

State probabilities of the SMP are given by,

vi hi
5

v jhj
j 1

Unavailability 2 5

), h5

1
2

Outline
Description of the system
Using a rate approximation
Using a 3-stage Erlang approximation to a
uniform distribution
Using a Semi-Markov model approximation method using a 3-stage
Erlang distribution
Using equations of the underlying SemiMarkov Process
Solutions for the models

Solutions for the models


Parameter values assumed:
N=1
c = 0.9
d = 0.9
= 0.0001 / hour
= 1 / hour
T = 1 hour

Results obtained
Steady state availability

Probability of being in states Normal,


Simplex, or Failure to Detect Protection Fault

Steady state unavailability

Probability of being in states Protection Switch


Failure, or Failed (0)

Average downtime in steady state

Steady state unavailability * Number of minutes


in a year

Average #units available


2*PNormal + 1*PSimplex +1*PFailuretoDetectProtectionFault

Diagnostic start time


approximation

Steady state availability

Steady state
unavailability

Avg. downtime in
steady state
(Minutes/year)

Avg. #units available (out of 1 + 1


spare)

Exp(2/T)

9.99989992e-01

1.00075983e-05

5.25999

1.99977503e+00

3-stage Erlang with mean


T/2

9.99989992e-01

1.00075983e-05

5.25999

1.99977503e+00

Semi-Markov (3-stage
Erlang approx. with mean
T/2)

9.99989992e-01

1.00075983e-05

5.25999

1.99977503e+00

Semi-Markov Process
equations-U([0,T])

9.99989992e-01

1.00075983e-05

5.25999

1.99977503e+00

Markov Reliability Model

Markov reliability model


with repair

Consider the 2-component parallel system (no delay +


perfect cov) but disallow repair from system down state
Note that state 0 is now an absorbing state. The state
diagram is given in the following figure.
This reliability model with repair cannot be modeled using
a reliability block diagram or a fault tree. We need to
resort to Markov chains. (This is a form of dependency
since in order to repair a component you need to know the
status of the other component).

Markov reliability model


with repair (Continued)

Absorbing state

Markov chain has an absorbing state. In the


steady-state, system will be in state 0 with
probability 1. Hence transient analysis is of
interest. States 1 and 2 are transient states.

Markov reliability model


with repair (Continued)
Assume that the initial state of the Markov chain
is 2, that is, 2(0) = 1, k (0) = 0 for k = 0, 1.
Then the system of differential Equations is written
based on:
rate of buildup = rate of flow in - rate of flow out
for each state

Markov reliability model


with repair (Continued)
d 2 (t )
2 2 (t ) 1 (t )
dt
d 1 (t )
2 2 (t ) ( ) 1 (t )
dt

d 0 (t )
1 (t )
dt

Markov reliability model


with repair (Continued)
After solving these equations, we get
R(t) = 2(t) +1(t)

Recalling that MTTF R(t ) dt


0

MTTF
2
2 2

, we get:

Markov reliability model


with repair (Continued)
Note that the MTTF of the two component
parallel redundant system, in the absence
of a repair facility (i.e., = 0), would have
been equal to the first term,
3 / ( 2* ), in the above expression.
Therefore, the effect of a repair facility is to
increase the mean life by / (2*2), or by a

factor (1 22 ) 1
3

Markov Reliability Model with


Repair ( WFS Example)
Assume that the computer system does not recover if
both workstations fail, or if the file-server fails

Markov Reliability Model with Repair

States (0,1), (1,0) and (2,0) become absorbing states while (2,1) and (1,1) are transient states.
Note: we have made a simplification that, once the CTMC reaches a system failure state, we
do not allow any more transitions.

Markov Model with Absorbing


States
If we solve for (2,1)(t) and 1,1)(t) then
R(t)= (2,1)(t) + 1,1)(t)
For a Markov chain with absorbing states:
A: the set of absorbing states
B = - A: the set of remaining states
i,j): Mean time spent in state i,j until absorption

(i , j )

( i , j ) ( x ) dx ,

QB B (0)

QB B (0)

(i, j ) B

Markov Model with Absorbing


States (Continued)
QB derived from Q by restricting it to only
states in B

Mean time to absorption MTTA is given as:

MTTA

( i , j )B

(i , j )

Markov Reliability Model with


Repair (Continued)

QB

( f 2w )

2w

( w f w )

First Solve

Markov Reliability Model with


Repair (Continued)
Then :
next solve
Then :
Mean time to failure is 19992 hours.

Markov Reliability Model


without Repair
Assume that neither workstations nor fileserver is repairable

Markov Reliability Model


without Repair (Continued)

States (0,1), (1,0) and (2,0)


become absorbing states

Markov Reliability Model


without Repair (Continued)
QB

( f 2 w )

2w

( f w )

Mean time to failure is 9333 hours.

Markov Reliability Model


with Imperfect Coverage

Markov model
with imperfect coverage
Next consider a modification of the above
example proposed by Arnold as a model of
duplex processors of an electronic
switching system. We assume that not all
faults are recoverable and that c is the
coverage factor which denotes the
conditional probability that the system
recovers given that a fault has occurred.
The state diagram is now given by the
following picture:

Now allow for Imperfect


coverage
c

Markov model
with imperfect coverage

(Continued)

Assume that the initial state is 2 so that:

2 (0) 1, 0 (0) 1 (0) 0

Then the system of differential equations are:


d2 (t )
2c2 (t ) 2 (1 c) 2 (t ) 1 (t )
dt
d1 (t )
2c2 (t ) ( ) 1 (t )
dt
d0 (t )
2 (1 c) 2 (t ) 1 (t )
dt

Markov model
with imperfect coverage

(Continued)

After solving the differential equations we obtain:


R(t)=2(t) + 1(t)
From R(t), we can system MTTF:

(1 2c)
MTTF
2[ (1 c)]
It should be clear that the system MTTF and system reliability are
critically dependent on the coverage factor.

Anda mungkin juga menyukai