WRL0004 TMP

FAULT TOLERANCE IN
DISTRIBUTED SYSTEMS

CMPE 516
TERM PROJECT

NAM AKSU
2003701303
CMPE MS

1. INTRODUCTION
In this paper, the fault tolerance in distributed systems concept and the subjects related to this
area will be discussed in a detailed manner. Firstly, some basic descriptions and concepts
about fault tolerance in distributed systems will be given as a fisrt adaptation. The basic
differences between faults, errors and failures will be discussed, and fault classifications will
be given. After giving the detailed information about necessary concepts, some failure models
in distributed systems will be explained with some example cases.
A reliable client-server model will be explained as an example for the failure models in
distributed systems. Then, main hardware reliability models, that is series and parallel models,
will be mentioned in a detailed manner. After giving the models, another important issue in
distributed systems will be discussed: Aggrement in faulty distributed systems. After giving
enough information aobut that subject, two important case will be examined for making the
subject clear. These two examples are Two Army Problem and Byzantine Generals Problem.
Then , another important subject Replication of Data in fault tolerant distributed systems will
be discussed. Active and Passive Replication will be explained with the advantages and
disadvantages of the models, then The Gossip Architecture which is the mixture of this two
model will be discessed. After all of that, some information about recovery in distributed
systems will be given and the paper will be finished with a conclusion part.

2. FAULT TOLERANCE BASICS
In this part of the paper, some basic concepts and definitions about fault tolerance will be
given for a better understanding in the following parts of the paper.
In the past, fault-tolerant computing was the exclusive domain of very specialized
organizations such as telecom companies and financial institutions. With business-to-business
transactions taking place over the Internet, however, we are interested not only in making sure
that things work as intended, but also, when the inevitable failures do occur, that the damage
is minimal.
Unfortunately, fault-tolerant computing is extremely hard, involving intricate algorithms for
coping with the inherent complexity of the physical world. As it turns out, that world
conspires against us and is constructed in such a way that, generally, it is simply not possible
to devise absolutely foolproof, 100% reliable software. No matter how hard we try, there is
always a possibility that something can go wrong. The best we can do is to reduce the
probability of failure to an "acceptable" level. Unfortunately, the more we strive to reduce this
probability, the higher the cost.
There is much confusion about the terminology used with fault tolerance. For example, the
terms "reliability" and "availability" are often used interchangeably, but do they always mean
the same thing? What about "faults" and "errors"? In this section, we introduce the basic
concepts behind fault tolerance .
Fault tolerance is the ability of a system to perform its function correctly even in the presence
of internal faults. The purpose of fault tolerance is to increase the dependability of a system.
A complementary but separate approach to increasing dependability is fault prevention. This
consists of techniques, such as inspection, whose intent is to eliminate the circumstances by
which faults arise.
2.1. Faults, Errors, and Failures
Implicit in the definition of fault tolerance is the assumption that there is a specification of
what constitutes correct behavior. A failure occurs when an actual running system deviates
from this specified behavior. The cause of a failure is called an error. An error represents an
invalid system state, one that is not allowed by the system behavior specification. The error
itself is the result of a defect in the system or fault. In other words, a fault is the root cause of
a failure. That means that an error is merely the symptom of a fault. A fault may not
necessarily result in an error, but the same fault may result in multiple errors. Similarly, a
single error may lead to multiple failures.
For example, in a software system, an incorrectly written instruction in a program may
decrement an internal variable instead of incrementing it. Clearly, if this statement is
executed, it will result in the incorrect value being written. If other program statements then
use this value, the whole system will deviate from its desired behavior. In this case, the
erroneous statement is the fault, the invalid value is the error, and the failure is the behavior
that results from the error. Note that if the variable is never read after being written, no failure
will occur. Or, if the invalid statement is never executed, the fault will not lead to an error.
Thus, the mere presence of errors or faults does not necessarily imply system failure.
At the heart of all fault tolerance techniques is some form of masking redundancy. This means
that components that are prone to defects are replicated in such a way that if a component
fails, one or more of the non-failed replicas will continue to provide service with no
appreciable disruption. There are many variations on this basic theme.
2.2. Fault Classifications
Based on duration, faults can be classified as transient or permanent. A transient fault will
eventually disappear without any apparent intervention, whereas a permanent one will remain
unless it is removed by some external agency. While it may seem that permanent faults are
more severe, from an engineering perspective, they are much easier to diagnose and handle. A
particularly problematic type of transient fault is the intermittent fault that recurs, often
unpredictably.
A different way to classify faults is by their underlying cause. Design faults are the result of
design failures, like our coding example above. While it may appear that in a carefully
designed system all such faults should be eliminated through fault prevention, this is usually
not realistic in practice. For this reason, many fault-tolerant systems are built with the
assumption that design faults are inevitable, and theta mechanisms need to be put in place to
protect the system against them. Operational faults, on the other hand, are faults that occur
during the lifetime of the system and are invariably due to physical causes, such as processor
failures or disk crashes.
Finally, based on how a failed component behaves once it has failed, faults can be classified
into the following categories:
Crash faults -- the component either completely stops operating or never returns to a
valid state;
Omission faults -- the component completely fails to perform its service;
Timing faults -- the component does not complete its service on time;
Byzantine faults -- these are faults of an arbitrary nature.
3

3. FAILURE MODELS IN DISTRIBUTED SYSTEMS

In this part of the paper, some failure models in distributed systems wikk be given. In all of
these scenarios, clients use a collection of servers.

Crash: Server halts, but was working ok until then, e.g. O.S. failure.
Omission: Server fails to receive or respond or reply, e.g. server not listening or buffer
overflow.
Timing: Server response time is outside its specification, client may give up.
Response: Incorrect response or incorrect processing due to control flow out of
synchronization.
Arbitrary value (or Byzantine): Server behaving erratically, for example providing arbitrary
responses at arbitrary times. Server output is inappropriate but it is not easy to determine this
to be incorrect. Duplicated message due to buffering problem maw be given as an example.
Alternatively, there may be a malicious element involved.

After giving the concepts about the failure models, some of the examples about failure models
are shown below:

Case: Client unable to locate server, e.g. server down, or server has changed.
Solution: Use an exception handler, but this is not always possible in the programming
language used.

Case: Client request to server is lost.
Solution: Use a timeout to await server reply, then re-send, but be careful about idempotent
operations. If multiple requests appear to get lost assume cannot locate server error.

Case: Server crash after receiving client request. Problem may be not being able to tell if
request was carried out (e.g. client requests print page, server may stop before or after
printing, before acknowledgement)
Solutions: Rebuild server and retry client request (assuming at least once semantics for
request). Give up and report request failure (assuming at most once semantics) what is
usually required is exactly once semantics, but this difficult to guarantee.

Case: Server reply to client is lost.
Solution: Client can simply set timer and if no reply in time assume server down, request lost
or server crashed during processing request.

4. HARDWARE RELIABILITY MODELLING

In this section of the paper, the hardware reliability modelling in the distributed systems will
be explained, and the two types series and parallel modelling will be discussed.

4.1. Series Model

In the series model, failure of any component 1 .. N will lead to system failure. If we say that,
component i has reliability R
i
, we can show that, system reliability is:

R = R
1
* R
2
* R
3
** R
n

E.g. system has 100 components, failure of any component will cause system failure. If
individual components have reliability 0.999, what is system reliability?

R = R
1
* R
2
** R
100
= (0,999)
100

4.2. Parallel Model

In the parallel Model, system works unless all components fail. The main difference between
these two models is this property of parallel model. Connecting components in parallel
provides system redundancy reliability enhancement.

If we say that, R = reliability, Q=Unreliability, the system unreliability is:

Q = Q
1
* Q
2
* Q
3
*...* Q
n

1 - R = (1 R
1
) * (1 R
2
) * (1 R
3
) * (1 R
n
)

E.g. system consists of 3 components with reliability 0.9, 0.95 and 0.98, connected in parallel.
What is overall system reliability:

R = 1-(1-.9)*(1-.95)*(1-.98) = 1-0.1*0.05*0.02 = 1-0.0001
R = 0.99990

5. AGREEMENT IN FAULTY DISTRIBUTED SYSTEMS

The various failure scenarios in distributed systems and transmission delays in particular have
instigated important work on the foundations of distributed software. Much of this work has
focused on the central issue of distributed agreement. There are many variations of this
problem, including time synchronization, consistent distributed state, distributed mutual
exclusion, distributed transaction commit, distributed termination, distributed election, etc.
However, all of these reduce to the common problem of reaching agreement in a distributed
environment in the presence of failures.

Agreement in faulty distributed systems is used to elect a coordinator process or deciding to
commit a transaction in distributed systems. It uses a majority voting mechanism which can
tolerate K faulty out of 2K+1 processes. (K fails, K+1 majority OK)
Goal of distributed systems is to have all non faulty processes agreeing, and reaching
agreement in a finite number of operations. In the following sections two common examples
will be given for a better understanding of the subject.

5.1. Two Army Problem

In this example, Enemy Red Army has 5000 troops. Blue Army has two separate gatherings,
Blue (1) and Blue (2), each of 3000 troops. Alone Blue will loose, together as a coordinated
attack Blue can win. Communications is by unreliable channel (send a messenger who may be
captured by red army so may not arrive.

Scenario:
Blue(1) sends to Blue(2) lets attack tomorrow at dawn later, Blue(2) sends confirmation to
Blue(1) splendid idea, see you at dawn but, Blue(1) realizes that Blue(2) does not know if
the message arrived, so, Blue(1) sends to Blue(2) message arrived, battle set then, Blue(2)
realizes that Blue(1)does not know if the message arrived etc.
The two blue armies can never be sure because of the unreliable communication. No certain
agreement can be reached using this method.

5.2. Byzantine Generals Problem

In this example, Enemy Red Army, as before, but Blue Army is under control of N generals
(encamped separately). M (unknown) out N generals are traitors and will try to prevent the N-
M loyal generals reaching agreement. Communication is reliable by one to one telephone
between pairs of generals to exchange troop strength information.

Problem:
How can the blue army loyal generals reach agreement on troop strength of all other loyal
generals?

Postcondition:
If the ith general is loyal then troops[i] is troop strength of general i. If the ith general is not
loyal then troops[i] is undefined (and is probably incorrect).

Algorithm:
Each general sends a message to the N-1 (i.e. 3) other generals. Loyal generals tell truth,
traitors lie. The results of message exchanges are collated by each general to give vector[N]
Each general sends vector[N] to all other N-1 (3) generals. Each general examining each
element received from the other N-1 look for the majority response for each blue general.
Algorithm works since traitor generals are unable to affect messages from loyal generals.
Overcoming M traitor generals requires a minimum 2M+1 loyal (3M+1 generals in total).

6. REPLICATION OF DATA

In this section, the replication of data in distributed systems will be discussed in a detailed
manner with its different models. The main goal of replication of data in distributed systems
is maintaining copies on multiple computers (e.g. DNS)

The main benefits of replication of data can be classified as follows:
1. Performance enhancement
2. Reliability enhancement
3. Data closer to client
4. Share workload
5. Increased availability
6. Increased fault tolerance

The constraints are classified below:
1. How to keep data consistency (need to ensure a satisfactorily consistent image for
clients)
2. Where to place replicas and how updates are propagated
3. Scalability

Fault Tolerant System Architectures:
Client (C)
Front End (FE) = client interface
Replica Manager (RM) = service provider

6.1. Passive Replication

In passive replication, all client requests (via front end processes) are directed to nominated
primary replica manager (RM). Single primary RM communicates together with one or more
secondary replica managers (operating as backups). Single primary RM is responsible for all
front end communication and updating of backup RMs. Distributed applications
communicate with primary replica manager, which sends copies of up to date data. Requests
for data update from client interface to primary RM is distributed to each backup RM. If
primary replica manager fails a secondary replica manager observes this and is promoted to
act as primary RM. To tolerate n process failures need n+1 RMs. Passive replication cannot
tolerate Byzantine failures.

Request is issued to primary RM, each with unique id. Primary RM receives request. Check
request id, in case request has already been executed. If request is an update the primary RM
sends the updated state and unique request id to all backup RMs. Each backup RM sends
acknowledgment to primary RM. When ack. is received from all backup RMs the primary
RM sends request acknowledgment to front end (client interface). All requests to primary RM
are processed in the order of receipt.

6.2. Active Replication

In active replication model, there are multiple (group) replica managers (RM), each with
equivalent roles. The RMs operate as a group and each front end (client interface) multicasts
requests to a group of RMs. Requests are processed by all RMs independently (and
identically). Client interface compares all replies received and can tolerate N out of 2N+1
failures, i.e. consensus when N+1 identical responses received. This model also can tolerate
Byzantine failure.

Client request is sent to group of RMs using totally ordered reliable multicast, each sent with
unique request id. Each RM processes the request and sends response/result back to the front
end. Front end collects (gathers) responses from each RM. Fault Tolerance: Individual RM
failures have little effect on performance. For n process fails need 2n+1 RMs (to leave a
majority n+1 operating).

6.3. Gossip Architectures

In Gossip Architectures, the main concept is to replicate data close to points where clients
need it first. Aim is to provide high availability at expense of weaker data consistency.
It is a framework for dealing with highly available services through use of replication RMs
exchange (or gossip) in the background from time to time. Multiple replica managers (RM),
single front end (FE) sends query or update to any (one) RM. A given RM may be
unavailable, but the system is to guarantee a service.

In the Gossip Architecture, clients request service operations that are initially processed by a
front end, which normally communicates with only one replica manager at a time, although
free to communicate with others if its usual manager is heavily loaded.

7. RECOVERY

In this part of the paper, some information about recovery in distributed systems will be given
in a short manner. Once failure has occurred in many cases, it is important to recover critical
processes to a known state in order to resume processing. Problem is compounded in
distributed systems. There are two approaches for the recovery in distributed environments.

Backward recovery, by use of checkpointing (global snapshot of distributed system status) to
record the system state but checkpointing is costly (performance degradation).

Forward recovery attempt to bring system to a new stable state from which it is possible to
proceed (applied in situations where the nature if errors are known and a reset can be applied).

Forward recovery is most extensively used in distributed systems and generally safest
can be incorporated into middleware layers, complicated in the case of process, machine or
network failure. It gives no guarantee that same fault may occur again (deterministic view
affects failure transparency properties), and can not be applied to irreversible(non-idempotent)
operations, e.g. ATM withdrawall.

8. CONCLUSION

In this conclusion section, the overall summary is given. Hardware, software and networks
cannot be totally free from failures. Fault tolerance is a non-functional requirement that
requires a system to continue to operate, even in the presence of faults. Distributed systems
can be more fault tolerant than centralized systems. Agrement in faulty systems and reliable
group communication are important problems in distributed systems. Replication of Data is a
major fault tolerance method in distributed systems. Recovery is another property to consider
in faulty distributed environments.

WRL0004 TMP

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

WRL0004 TMP

Diunggah oleh

Hak Cipta:

Format Tersedia

FAULT TOLERANCE IN

Anda mungkin juga menyukai