Evaluating The Fault Tolerance of Stateful TMR

2010 13th International Conference on Network-Based Information Systems
Evaluating the Fault Tolerance of Stateful TMR

Katsuyoshi Matsumoto, Minoru Uehara and Hideki Mori
Dept. of Open Information Systems, Graduate School of Engineering
Toyo University, Kawagoe, Saitama 3508585, Japan
Email: {dz080001x,uehara,mori}@toyonet.toyo.ac.jp
AbstractModule redundancy is often used in the construction of reliable systems. Triple Module Redundancy (TMR) is a
method for improving reliability through module redundancy,
although it does not give the correct results when two out of
three modules fail. We, therefore, proposed a new voting architecture known as Stateful TMR, which uses both the results
of TMR and the history of states to select the most reliable
module. Through simulations, we evaluate the reliability of a
module using both TMR and Stateful TMR, and show that for
both transient and permanent failures, Stateful TMR achieves
higher reliability than TMR.
Keywords-fault tolerance; module redundancy; majority voting; reliability; permanent failure; transient failure; Stateful
TMR;
Figure 1.
I. I NTRODUCTION
action on the part of the system. N-Modular Redundancy

(NMR) is an example of this passive technique, in which
majority voting is performed to select a reliable module.
The active approach achieves fault tolerance by detecting
the existence of faults and performing some action to remove
the faulty hardware from the system.
The hybrid approach combines the attractive features of
both the passive and active approaches. Fault masking is
used in the hybrid approach to prevent erroneous results
from being generated, while fault detection, fault location,
and fault recovery are used to improve fault tolerance
by removing faulty hardware and replacing it with spare
components.
Next we explain TMR, which is the most popular passive
approach. Fig. 1 shows the outline of TMR. The basic
concept of TMR involves triplicating the hardware and
performing a majority vote to determine the outputs of
the system. If only one of the modules is faulty, the two
remaining fault-free modules mask the results of the faulty
module during the voting process. However, TMR cannot
mask the results when two or more modules are faulty.
In recent years, the design of hardware systems has

become remarkably complicated, while the role and importance of the system has increased explosively. Under
these circumstances, there is real danger that serious damage
can be caused by problems in the system. To reduce the
possibility of this happening, it is necessary to improve the
reliability of the system. Module Redundancy is often used
in the construction of reliable systems, while TMR (Triple
Modular Redundancy) is used as a method of improving
the reliability introduced by module redundancy. Since TMR
uses majority voting by three modules, it cannot decide the
correct result when two out of the three modules fail. Stateful
TMR makes use of the results of TMR together with the
history of states, to select the most reliable module.
In this paper, we evaluate the reliability of both TMR
and Stateful TMR in permanent and transient failures. The
remainder of the paper is organized as follows. In Section II
hardware redundancy is described, while in Section III we
explain the proposed architecture, Stateful TMR. In Section
IV we describe the simulations and present our results.
Section V gives details of the evaluation. Finally, in Section
VI we present our conclusions.
III. S TATEFUL TMR

Here, we describe Stateful TMR[4], [5]. The underlying
concept of Stateful TMR is that each module state is
evaluated by TMR as illustrated in Fig. 2. The output from
Stateful TMR is decided by the state evaluation unit and
voting result. The state evaluation unit shows the failure
state of each module by keeping a state evaluation register
for each module, (As , Bs , Cs ), where As , Bs , and Cs are
the state evaluation registers for modules A, B, and C,
II. H ARDWARE R EDUNDANCY

In this section, we describe hardware redundancy[1], [2],
[3], which includes an active technique, a passive technique,
and a hybrid technique.
The passive technique uses a fault masking method that
covers the faulty parts with a mask. This approach is
designed to implement fault tolerance without requiring any
978-0-7695-4167-9/10 $26.00 2010 IEEE
DOI 10.1109/NBiS.2010.86
TMR
332
Figure 2.
N oF = 0
V OT ER M Is sets F (As , B, Cs )
N oF = 1,
Consensus All Clear (all set N )
M Is has N M As sets F
M Is resets (sets N )
M Is has F M Is sets F
M As resets (sets N )
N oF = 2,
Consensus remain
M Is has N M As sets F ,
M Is has F M Is sets F ,
Stateful TMR
respectively. If module A is faulty, state evaluation register

As records the fault.
Next, we describe the decision process of a failure state
in Stateful TMR. Fig. 3 illustrates how decisions are made
during state evaluation. Here, M As depicts the state group of
the modules contributing to the majority voting result, while
M Is depicts the state group of the modules contributing
to the minority voting result (only a single element). F
denotes the failure state and N the normal state in the
state evaluation register. N oF denotes the number of failure
states; for example, N oF = 1 means that the state evaluation
unit has one failure state.
1) State evaluation unit detects no failure
a) minority member state determines the failure
state
b) majority member state determines the normal
state
2) State evaluation unit detects a failure state
a) only one module has failed
i) if the failed module belongs to the majority
group
A) majority member state determines the
failure state
B) minority member state determines the
normal state
ii) if the failed module belongs to the minority
group
A) majority member state determines the normal state
failure state
b) two modules have failed
i) if one failed module has contributed to the
majority vote
A) majority member state determines the normal state
failure state
ii) if two failed modules have contributed to the
majority vote
A) majority member state determines the
failure state
Figure 3.
decision of state

normal state
If there are no failure states in the state evaluation unit,
when a failure state is determined by the voting result, the
minority member state sets the failure state.
First we consider the state evaluation unit with one failure
state. If the failed module has contributed to the majority
voting result, the state of the majority members determines
the failure state, while the state of the minority members
determines the normal state. If the failed module is in the
minority, the minority member state determines the failure
state.
Next we consider the state evaluation unit with two failed
modules. If a module in the normal state is included in the
majority group, the majority member state determines the
normal state in each module while the minority member
state determines the failure state. If the normal state is in
the minority according to the majority vote, the majority
member state determines the failure state.
Fig. 4 illustrates how the output decision is reached. Here,
M Ao denotes the output group of the modules contributing
to the majority vote, while M Io denotes the output group of
the modules contributing to the minority vote (only a single
module).
1) All modules are normal,
a) the majority result determines the output.
2) State evaluation unit detects a failure state,
a) one module has failed,
i) if the failed module is included in the majority group,
A) the output is selected according to the
minority result.
ii) if the failed module in not included in the
majority group,
333
Consensus Consensus result

N oF = 0,
V OT ER M Ao
N oF = 1,
M Is has N M Io
M Is has F M Ao
N oF = 2,
M Is has N M Io
M Is has F M Ao
Figure 4.
Figure 5.
Permanent failure
Figure 6.
Transient failure
decision of the output

majority result.
b) two modules have failed,
i) if the majority group includes only one failed
module,
B. Permanent failure
Figs.7 to 9 show the simulation results for permanent
failure. Fig. 7 shows the simulation results for a failure rate
of 0.001.
In this simulation, TMR initially has higher reliability than
the single module, but this decreases over time. Stateful
TMR has higher reliability and better fault tolerance than
both TMR and the single module.
Fig. 8 shows the simulation results for a failure rate of
0.005. In this simulation, the reliability of TMR is initially
slightly higher than the single module, but in the end the
reliability is lower. Stateful TMR has higher reliability than
the single module and TMR.
Fig. 9 shows the simulation results for a failure rate of
0.01. In this simulation, TMR initially has higher reliability
than the single module, but in the end, the reliability of TMR
is lower. Stateful TMR has higher reliability than both the
single module and TMR.
According to the simulation results, TMR has low fault
tolerance in each pattern, but towards the end, it has lower

majority result.
ii) if the majority group includes two failure
states,
minority result.
The output is determined by the number of failure states
together with the majority vote. If all modules are in the
normal state, the output is determined by the majority vote.
If one module is in the failure state, and the failed module is
included in the majority group, the output is determined by
the minority result. If the failed module is in the minority, the
output is determined by the majority result. If two modules
are in a failure state, and a normal state module is included
in the majority group, the output is the majority result. If a
normal state module is included in the minority group, the
output is determined by the minority vote.
IV. S IMULATION
We have evaluated both TMR and Stateful TMR. Section

IV-A explains the evaluation methods, while Section IV-B,
IV-C describes the simulation results.
NONE
TMR
Stateful TMR
0.9
0.8
Reliability (R)
0.7
A. Evaluation method
In this study, we evaluated both permanent and transient
failures through our simulations. Permanent failure occurs if
a failed module never recovers, while transient failure occurs
if a failed module can recover to some degree. Taking this
into consideration, the failure rate (p, 0 p 1) is dened
as the number of failed modules per unit time. The recovery
rate (q, 0 q 1) is dened as the number of failed
modules recovered per unit time. Figs.5 and 6 illustrate the
denition of permanent and transient failure, respectively.
0.6
0.5
0.4
0.3
0.2
100
200
300
400
500
600
700
Time (t)
Figure 7.
334
Failure Rate: 0.001
800
900
1000
NONE
TMR
Stateful TMR
NONE
TMR
Stateful TMR
0.9
0.8
0.8
Reliability (R)
Reliability (R)
0.7
0.6
0.4
0.6
0.5
0.4
0.3
0.2
0.2
0.1
100
200
300
Figure 8.
400
500
Time (t)
600
700
800
900
1000
Failure Rate: 0.005
100
200
Figure 10.
300
400
500
Time (t)
600
700
900
1000
Failure Rate: 0.01, Recovery Rate: 0.001
NONE
TMR
Stateful TMR
800
NONE
TMR
Stateful TMR
0.9
0.8
Reliability (R)
Reliability (R)
0.8
0.6
0.4
0.7
0.6
0.2
0.5
100
200
300
Figure 9.
400
500
Time (t)
600
700
800
900
0.4
1000
100
200
Figure 11.
Failure Rate: 0.01
300
400
500
Time (t)
600
700
800
900
1000
convergence occurs at 0.5 for all three methods.

Fig. 12 shows the simulation results for a recovery rate of
0.1. In this simulation, TMR and Stateful TMR have higher
reliability than the single module. Stateful TMR has slightly
higher reliability than TMR.
For this simulation, we used a unit failure rate of 0.1
and varied the recovery rate between 0.001 and 0.1. For
low recovery rates, TMR has lower fault tolerance than
the single module approach. Stateful TMR has lower fault
tolerance than the single module in the end. With a recovery
rate of 0.01, TMR has slightly higher fault tolerance than
the single module. Stateful TMR has higher fault tolerance
than both the other methods. For high recovery rates, both
TMR and Stateful TMR have higher fault tolerance than the
single module, with Stateful TMR having the highest fault
tolerance.
fault tolerance than a single module. With high failure rates,

the fault tolerance of TMR is lower than that of a single
module. Stateful TMR has greater fault tolerance than both
the single module and TMR.
C. Transient failure
In this section, we evaluate transient failure, with the simulation results shown in Figs. 10 to 12. For this simulation,
the failure rate is set to 0.01, while the recovery rate varies
between 0.001 and 0.1.
Fig. 10 illustrates the performance with a recovery rate
of 0.001. According to the simulation results, TMR initially
achieves better reliability than the single module, but this
changes over time and, in the end, is lower. Stateful TMR
initially achieves higher reliability than both the single
module and TMR, but in the end, has lower reliability than
the single module.
Fig. 11 shows the simulation results for a recovery rate
of 0.01. TMR has slightly higher reliability than the single
module, while Stateful TMR has higher reliability than
both the single module and TMR. For this recovery rate,
V. E VALUATION
In this section, we evaluate fault tolerant performance with
permanent and transient faults. Table I gives an evaluation
of the simulation results for permanent faults. In evaluating
335
VI. C ONCLUSION
NONE
TMR
Stateful TMR
0.99
In this paper, we evaluated the fault tolerance of TMR and

Stateful TMR for both transient and permanent failures. For
permanent failures, Stateful TMR maintains high reliability
regardless of the recovery rate. Regarding TMR, as time
increases, so the reliability decreases, until the reliability
is less than that of the single module, although TMR does
achieve high reliability for a constant period of time. For
permanent failures, with a low failure rate, TMR obtains
relatively high reliability in fault tolerance, but, with a high
failure rate, the reliability decreases compared with TMR
and in the end is below that of the single module. Stateful
TMR obtains higher fault tolerance compared with both
TMR and single module redundancy.
For transient failure, with a high recovery rate, TMR
achieves higher fault tolerance compared with the single
module without redundancy, although it initially has lower
fault tolerance compared with both the single module without redundancy and TMR. But Stateful TMR has lower
fault tolerance than a single module without redundancy
during the steady period. With a high recovery rate, Stateful
TMR has good fault tolerance at the same level as in TMR.
Regarding Stateful TMR in transient failures, when the
recovery rate is high, very good reliability is achieved as
well as TMR. Even with a low recovery rate, Stateful TMR
still has better reliability than TMR.
0.98
Reliability (R)
0.97
0.96
0.95
0.94
0.93
0.92
0.91
0.9
100
200
Figure 12.
300
400
500
Time (t)
600
700
800
900
1000

Table I
P ERMANENT FAILURES
TMR
Stateful TMR
permanent
Failure Rate: high
Failure Rate: low
middle
low
high
high
permanent failures, the failure rate is varied, while the

recovery rate remains xed. For permanent failures, TMR
has lower fault tolerance than the single module with a
high failure rate. With a low failure rate, TMR initially
has higher fault tolerance than the single module, but in
the end, the fault tolerance of TMR is lower than that of
the single module. With low recovery rates, TMR has lower
fault tolerance than the single module. With a high recovery
rate, TMR has higher fault tolerance than the single module.
According to these results, TMR has the same level of fault
tolerance as Stateful TMR. For permanent failures, Stateful
TMR has higher fault tolerance than the single module and
TMR in all cases.
Table II gives an evaluation of the simulation results for
transient faults. In the evaluation of transient failure, the
failure rate is xed while the recovery rate is changed.
For transient failure, Stateful TMR achieved higher fault
tolerance than TMR. With a low recovery rate, Stateful TMR
initially has higher fault tolerance than the single module,
but in the end, its fault tolerance is lower. With a high
recovery rate, Stateful TMR exhibits the best fault tolerant
performance.
R EFERENCES
[1] D. K. Pradhan, Fault-Tolerant Computer System Design,
Prentice Hall, New Jersey, 1996
[2] M. Abd-El-Barr, Design and Analysis of Reliable and Fault
Tolerant Computer Systems, Imperial College Press, London,
2007
[3] J. Vial, A. Bosio, P. Girard, C. Landrault, S.Pravossoudovitch,
and A.virazel, Using TMR Architectures for Yield Improvement, Proceeding of IEEE Defect and Fault Tolerance in
VLSI Systems, pp. 7-15, 2008
[4] K. Matsumoto, M. Uehara, and H. Mori, Proposal of Stateful
Relilability Counter Small-World Cellular Nueral Networks,
In Proc. of the 3rd International Conference on Complex,
Intelligent and Software Intensive Systems (CISIS2009),
pp.154-161, 2009
[5] K. Matsumoto, M. Uehara, and H. Mori, Evaluation of
Stateful Reliabilit Counter in Small-World Cellular Neural
Networks, In Proc. of 2009 Intrenational Conference on
Network-Based Information Systems (NBiS2009), pp. 417423, 2009
Table II
T RANSIENT FAILURES
TMR
Stateful TMR
transient
Recovery Rate: low Recovery Rate: high
low
high
middle
high
336

Evaluating The Fault Tolerance of Stateful TMR

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Evaluating The Fault Tolerance of Stateful TMR

Diunggah oleh

Hak Cipta:

Format Tersedia

2010 13th International Conference on Network-Based Information Systems

Evaluating the Fault Tolerance of Stateful TMR

action on the part of the system. N-Modular Redundancy

In recent years, the design of hardware systems has

III. S TATEFUL TMR

II. H ARDWARE R EDUNDANCY

respectively. If module A is faulty, state evaluation register

B) minority member state determines the

Consensus Consensus result

decision of the output

A) the output is selected according to the

A) the output is selected according to the

We have evaluated both TMR and Stateful TMR. Section

Failure Rate: 0.001

Failure Rate: 0.005

Failure Rate: 0.01, Recovery Rate: 0.001

Failure Rate: 0.01

Failure Rate: 0.01, Recovery Rate: 0.01

convergence occurs at 0.5 for all three methods.

fault tolerance than a single module. With high failure rates,

In this paper, we evaluated the fault tolerance of TMR and

Failure Rate: 0.01, Recovery Rate: 0.1

permanent failures, the failure rate is varied, while the

Anda mungkin juga menyukai