PHD Thesis

See
discussions, stats, and author profiles for this publication at: http://www.researchgate.net/publication/36207005
Replication in distributed management

systems.
ARTICLE
Source: OAI
CITATIONS
DOWNLOADS
VIEWS
14
13
48
1 AUTHOR:
Evangelos Kotsakis
European Commission
21 PUBLICATIONS 124 CITATIONS
SEE PROFILE
Available from: Evangelos Kotsakis

Retrieved on: 01 August 2015
REPLICATION IN DISTRIBUTED MANAGEMENT

SYSTEMS
EVANGELOS GRIGORIOS KOTSAKIS
Telford Research Institute

Department of Electrical and Electronic Engineering
The University Of Salford
Submitted in Partial Fulfilment of the Requirements for the

Degree of Doctor of Philosophy
1998
This thesis is dedicated to

my daughter Dimitra
ii
TABLE OF CONTENTS
LIST OF FIGURES................................................................................................................................... V
LIST OF TABLES ................................................................................................................................. VII
ACKNOWLEDGMENTS..................................................................................................................... VIII
ABBREVIATIONS .................................................................................................................................. IX
ABSTRACT ............................................................................................................................................... X
1. INTRODUCTION .................................................................................................................................. 1
1.1 DISTRIBUTED MANAGEMENT SYSTEMS ............................................................................................... 1
1.2 REPLICATION ON A DISTRIBUTED MIB ................................................................................................. 2
1.3 THE WORK........................................................................................................................................... 4
1.4 ROAD MAP OF THE THESIS .................................................................................................................. 5
2. REPLICATION MANAGEMENT SYSTEM ARCHITECTURE .................................................... 8
2.1 MANAGEMENT FUNCTIONAL AREAS ................................................................................................... 8
2.2 MANAGEMENT ARCHITECTURAL MODEL .......................................................................................... 10
2.3 PROTOCOLS FOR CONTROLLING MANAGEMENT INFORMATION......................................................... 11
2.3.1 OSI Management Framework ................................................................................................... 11
2.3.2 Internet Network Management .................................................................................................. 12
2.4 OBJECT ORIENTED MIB MODELLING ................................................................................................. 13
2.5 DISTRIBUTED MANAGEMENT INFORMATION BASE (MIB) ................................................................. 15
2.6 DISTRIBUTED NETWORK MANAGEMENT ........................................................................................... 18
2.7 CORBA SYSTEM .............................................................................................................................. 22
2.8 IMPLEMENTING OSI MANAGEMENT SERVICES FOR TMN ................................................................. 23
2.9 REPLICATION IN A MANAGEMENT SYSTEM ....................................................................................... 26
2.10 NEED FOR REPLICATION TECHNIQUES IN A MANAGEMENT SYSTEM .................................................. 29
2.11 SYNCHRONOUS AND ASYNCHRONOUS REPLICA MODELS ................................................................. 33
2.12 REPLICATION TRANSPARENCY AND ARCHITECTURAL MODEL ........................................................ 33
2.13 SUMMARY ....................................................................................................................................... 37
3. FAILURES IN A MANAGEMENT SYSTEM .................................................................................. 38
3.1 DEPENDABILITY BETWEEN AGENTS .................................................................................................. 38
3.2 FAILURE CLASSIFICATION .................................................................................................................. 39
3.3 FAULTY AGENT BEHAVIOUR ............................................................................................................. 40
3.4 FAILURE SEMANTICS ......................................................................................................................... 41
3.5 FAILURE MASKING ............................................................................................................................ 42
3.6 ARCHITECTURAL ISSUES ................................................................................................................... 46
3.7 GROUP SYNCHRONISATION ............................................................................................................... 47
3.7.1 Close Synchronisation ............................................................................................................... 47
3.7.2 Loose synchronisation ............................................................................................................... 48
3.8 GROUP SIZE....................................................................................................................................... 49
3.9 GROUP COMMUNICATION .................................................................................................................. 49
3.10 AVAILABILITY POLICY ..................................................................................................................... 50
3.11 GROUP MEMBER AGREEMENT ........................................................................................................ 51
3.12 SUMMARY ....................................................................................................................................... 53
4. REPLICA CONTROL PROTOCOLS ............................................................................................... 55
4.1 PARTITIONING IN A REPLICATION SYSTEM ......................................................................................... 55
4.2 CORRECTNESS IN REPLICATION ......................................................................................................... 56
4.3 TRANSACTION PROCESSING DURING PARTITIONING .......................................................................... 59
4.4 PARTITION PROCESSING STRATEGY................................................................................................... 60
4.5 AN ABSTRACT MODEL FOR STUDYING REPLICATION ALGORITHMS .................................................. 62
4.6 PRIMARY SITE PROTOCOL ................................................................................................................. 66
iii
4.7 VOTING ALGORITHMS ........................................................................................................................ 69

4.7.1 Majority Consensus Algorithm .................................................................................................. 70
4.7.2 Voting With Witnesses ............................................................................................................... 73
4.7.3 Dynamic Voting ......................................................................................................................... 73
4.7.4 Dynamic Majority Consensus Algorithm (DMCA) - A novel approach .................................... 79
4.8 SUMMARY ......................................................................................................................................... 91
5. ANALYSIS AND DESIGN OF THE SOFTWARE SIMULATION ............................................... 93
5.1 INTRODUCTION TO SIMULATION MODELLING ..................................................................................... 93
5.2 USING AN OBJET-ORIENTED TECHNIQUE FOR MODELLING A SIMULATION SYSTEM .......................... 94
5.3 OBJECT ORIENTED DISCRETE EVENT SIMULATION............................................................................ 95
5.4 THE SIMULATION MODELLING PROCESS ........................................................................................... 96
5.4.1 Problem formulation ................................................................................................................. 96
5.4.2 Model Implementation .............................................................................................................. 97
5.5 OBJECT ORIENTED ANALYSIS AND DESIGN ....................................................................................... 97
5.5.1 Analysis ..................................................................................................................................... 98
5.5.2 Design ....................................................................................................................................... 98
5.5.3 Implementation .......................................................................................................................... 99
5.6 ATS REQUIREMENTS ........................................................................................................................ 99
5.7 ATS ANALYSIS................................................................................................................................ 101
5.7.1 Object Model ........................................................................................................................... 103
5.8 DYNAMIC MODEL ........................................................................................................................... 106
5.9 EVALUATION OF THE SYSTEM ......................................................................................................... 108
5.10 SUMMARY ..................................................................................................................................... 108
6. SIMULATION AND ESTIMATION OF REPLICA CONTROL PROTOCOLS ....................... 110
6.1 PERFORMANCE EVALUATION .......................................................................................................... 110
6.2 THE SIMULATION MODEL................................................................................................................ 111
6.3 FAULT INJECTION ............................................................................................................................ 113
6.4 SIMULATED ALGORITHMS ............................................................................................................... 114
6.5 THE PROTOCOLS ROUTINES ............................................................................................................ 115
6.6 IMPLEMENTING GROUP COMMUNICATION....................................................................................... 116
6.7 FUNCTIONAL COMPONENTS OF THE SIMULATION. ........................................................................... 117
6.8 PARAMETER OF THE SIMULATION .................................................................................................... 119
6.9 AVAILABILITY AND THE CONTRIBUTION OF THE DMCA ALGORITHM .............................................. 119
6.10 RESULTS OF THE SIMULATION ....................................................................................................... 122
6.11 SUMMARY ..................................................................................................................................... 135
7. CONCLUSIONS................................................................................................................................. 138
7.1 CONTRIBUTIONS OF THIS WORK ....................................................................................................... 138
7.2 FUTURE RESEARCH DIRECTION ....................................................................................................... 139
7.3 CONCLUDING REMARKS .................................................................................................................. 141
APPENDIX-A PAPERS............................................................................................................. PAPERS 1
APPENDIX-B TABLES............................................................................................................. TABLES 1
APPENDIX-C SOURCE CODE ................................................................................................... CODE 1
LIST OF REFERENCES................................................................................................ REFERENCES 1
iv
LIST OF FIGURES
FIGURE 2-1: BASIC MANAGEMENT MODEL.................................................................................................... 9
FIGURE 2-2: VIEWS OF SHARED MANAGEMENT KNOWLEDGE....................................................................... 17
FIGURE 2-3: SIMPLIFIED MANAGEMENT SYSTEM. ....................................................................................... 17
FIGURE 2-4 NETWORK MANAGEMENT APPROACHES (A) CENTRALISED (B) PLATFORM BASED (C)
HIERARCHICAL (D) DISTRIBUTED ......................................................................................................... 21
FIGURE 2-5: INTER-WORKING TMN ........................................................................................................... 25
FIGURE 2-6: (A) REPLICATION (B) NO REPLICATION ..................................................................................... 28
FIGURE 2-7: NETWORK MANAGEMENT REPLICATION EXAMPLE .................................................................. 31
FIGURE 2-8: SYNCHRONOUS REPLICATION .................................................................................................. 33
FIGURE 2-9: ARCHITECTURAL MODEL FOR REPLICATION. (A) NON TRANSPARENT SYSTEM (B) TRANSPARENT
REPLICATION SYSTEM (C ) LAZY REPLICATION (D) PRIMARY COPY MODEL. ......................................... 35
FIGURE 3-1: RELATIONSHIP BETWEEN USER AND RESOURCE. ...................................................................... 39
FIGURE 3-2: FAILURE MASKING ................................................................................................................... 42
FIGURE 3-3: GROUP MASKING ..................................................................................................................... 43
FIGURE 4-1 REPLICATION ANOMALY CAUSED BY CONFLICT WRITE OPERATIONS. A) BEFORE ISOLATION B)
AFTER ISOLATION ................................................................................................................................ 57
FIGURE 4-2: LOGICAL AND PHYSICAL OBJECTS OF THE SENSOR ENTITY. ..................................................... 64
FIGURE 4-3. REPLICATION USING PRIMARY SITE ALGORITHM. ..................................................................... 66
FIGURE 4-4. READ IN A PRIMARY SITE PROTOCOL ...................................................................................... 68
FIGURE 4-5. WRITE IN A PRIMARY SITE PROTOCOL..................................................................................... 68
FIGURE 4-6. MAKE CURRENT IN A PRIMARY SITE PROTOCOL ..................................................................... 69
FIGURE 4-7. READ IN A MAJORITY CONSENSUS ALGORITHM ...................................................................... 71
FIGURE 4-8. WRITE IN A MAJORITY CONSENSUS ALGORITHM ....................................................................... 72
FIGURE 4-9. MAKE CURRENT IN A MAJORITY CONSENSUS ALGORITHM ..................................................... 72
FIGURE 4-10. ISMAJORITY IN THE DYNAMIC VOTING PROTOCOL ............................................................... 75
FIGURE 4-11. READ FUNCTION IN THE DYNAMIC VOTING PROTOCOL ......................................................... 75
FIGURE 4-12. WRITE (UPDATE) IN THE DYNAMIC VOTING PROTOCOL ......................................................... 76
FIGURE 4-13 UPDATE IN THE DYNAMIC VOTING PROTOCOL ...................................................................... 78
FIGURE 4-14 MAKE CURRENT IN THE DYNAMIC VOTING PROTOCOL ......................................................... 78
FIGURE 4-15 READPERMITTED IN THE DMCA ........................................................................................... 82
FIGURE 4-16 WRITEPERMITTED FUNCTION IN THE DMCA.......................................................................... 83
FIGURE 4-17 DOREAD FUNCTION IN THE DMCA........................................................................................ 84
FIGURE 4-18 DOWRITE FUNCTION IN THE DMCA ...................................................................................... 86
FIGURE 4-19 MAKE CURRENT FUNCTION IN DMCA................................................................................... 88
FIGURE 4-20: SEQUENCE DIAGRAM FOR DOREAD OPERATION .................................................................... 89
FIGURE 4-21: SEQUENCE DIAGRAM FOR DOWRITE OPERATION ................................................................... 90
FIGURE 4-22: SEQUENCE DIAGRAM FOR MAKECURRENT OPERATION ......................................................... 90
FIGURE 5-1. ATS PROCESS DIAGRAM ........................................................................................................ 102
FIGURE 5-2. ATS OBJECT MODEL .............................................................................................................. 106
FIGURE 5-3. ATS DYNAMIC MODEL .......................................................................................................... 107
FIGURE 6-1: NETWORK MODEL................................................................................................................. 112
FIGURE 6-2: FAULT INJECTION SYSTEM .................................................................................................... 113
FIGURE 6-3: COMPONENTS OF THE SIMULATION MODEL ............................................................................ 118
FIGURE 6-4: AVAILABILITY CURVE ............................................................................................................ 120
FIGURE 6-5: TOTAL AVAILABILITY =4. ................................................................................................... 121
FIGURE 6-6: BOUNDARIES OF TOTAL AVAILABILITY .................................................................................. 122
FIGURE 6-7: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR
DELAY=0.1 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY. .................. 126
DELAY=0.2 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 127
vi
LIST OF TABLES
TABLE 2-1: CMISE SERVICES AND FUNCTIONS ........................................................................................... 12
TABLE 2-2: SNMP SERVICES AND FUNCTIONS ............................................................................................ 13
TABLE 4-1: DMCA MAPPING ...................................................................................................................... 91
TABLE 6-1: SIMULATION PARAMETERS ..................................................................................................... 123
vii
ACKNOWLEDGEMENTS
I would like to thank my supervisor Dr. B. H. Pardoe who has assisted me and
guided me in preparing this thesis. His kind assistance in my struggles with the English
language was very helpful. He has been willing to answer any technical question and
provide me with the information and knowledge to cope with the difficult task of
preparing a Ph.D. thesis. Without his kind help and encouragement, this research
would never have been done
I would especially like to express sincere appreciation for the financial support,
encouragement and love given to me by my parents Grigorios and Dimitra Kotsakis.
They have given me much more than moral and material support through my University
studies. They provide me a rock - solid support system which was proved helpful during
my studies. Without their support and love, this research would never have been done
Special thanks are due to my wife Chaido for encouraging me over the last three
years. I would like to thank her for her unconditional love, unfailing enthusiasm,
unending optimism and confidence in my abilities. Her patient and support are
boundless.
Finally, I would like to thank little Dimitra, whose birth two years ago gave me a
great joy, for being quiet while I was writing this thesis
viii
ABBREVIATIONS
ANSA
Advanced Network System Architecture
ATM
Asynchronous Transfer Mode
ATS
Availability Testing System
CMIS
Common Management Information Protocol
CMISE
Common Management Information Service Element
DMCA
Dynamic Majority Consensus Algorithm
FE
Front End
IP
Internet Protocol
ISO
International Standards Organisation
LAN
Local Area Network
MIB
Management Information Base
OMT
Object Modelling Technique
OSF/DCE
Open Software Foundation / Distributed Computing Environment
OSI
Open System Interconnection
ROSE
Remote Operation Service Element
SNA
Systems Network Architecture
SNMP
Simple Network Management Protocol
TCP
Transmission Control Protocol
UDP
User Datagram Protocol
WAN
Wide Area Network
ix
ABSTRACT
Systems management is concerned with supervising and controlling the system so
that it fulfils the requirements. The management of a system may be performed by a
mixture of human and automated components. These components are abstract
representations of network resources and they are known as managed objects. A
distributed management system may be viewed as a collection of such objects located at
different sites in a network. Replication is a technique used in distributed systems to
improve the availability of vital data components and to provide higher system
performance since access to a particular object may be accomplished at multiple sites
concurrently. By applying replication in a distributed management system, we can locate
certain management objects at multiple sites by copying their internal data and the
operations used to access or update those data. This is considered as a great advantage
since it increases the reliability and availability, it provides higher fault tolerance and it
allows data sharing; improving in that way the system performance.
This thesis is concerned with methods that may be used to apply replication in
such a system, as well as certain replica control algorithms that may be used to control
operations over a replicated managed object. Certain replication architectures are
examined and the availability provided by each of them is discussed. A new replica
control algorithm is proposed as an alternative to providing higher availability. A tool
for evaluating the availability provided by a replica control algorithm is designed and
proposed as a benchmark test utility that examines the suitability of certain replica
control algorithms.
1. INTRODUCTION
The importance of replication techniques for providing high availability in

distributed systems has been known for over two decades. But the use of these
techniques in network management systems have been minimal. A reason for this has
been that the machinery needed to cope with the partitions and reunions is excessively
complex. This thesis addresses the methods that may be safely used for replicating
management objects and it shows how one can use replication techniques in a
management system that preserves usability and availability.
The rest of this chapter discusses replication techniques and the related problems.
It begins with a discussion on network management systems and the use of replication
in a distributed Management Information Base (MIB) for improving the availability of
the managed objects. It then discusses the goal of this thesis and it concludes with a
road-map for the rest of the thesis.
1.1 Distributed Management Systems

The management of a communication environment is a distributed information
processing application where individual components of the management activities are
associated with network resources. Management applications perform the management
activities in a distributed and consistent manner that guarantees transparency and system
operability. Management information is stored in a special database which is known as
Management Information Base (MIB). The MIB is the conceptual repository of the
management information and each object stored in the MIB is associated with an
1
individual network resource, an attribute used to represent a network activity. When the
MIB is distributed over sites, one site may fail, while other sites continue to operate.
Distributed MIBs may also increase performance since different managed objects
located at different hosts may be accessed concurrently.
A fundamental problem with a distributed MIB is data availability. Since managed
objects are stored on separate machines, a server crash or a network failure that
partitions a client from a server can prevent a manager from accessing managed objects.
Such situations are very frustrating to a manager because they impede computation even
though client resources are still available. The problem of object availability increases
over time for two reasons
1. The frequency of the network failures will increase. Networks get larger, they
cover larger geographical area, encompass multiple administrative boundaries and
consist of multiple sub-networks via routers and bridges. Furthermore, there is an
increased need for better network resource management that increases the
availability of managed objects and the management performance
2. The introduction of mobile network managers will increase the number of
occasions on which management agencies are inaccessible. Wireless technologies
such as packet-radio suffer from inherent limitations like short range and line of
sight. Due to these limitations, the network connections between management
agents and mobile managers will exhibit frequent partitions.
1.2 Replication on a distributed MIB
Replication is a technique used in distributed operating systems and distributed
databases to improve the availability of system resources. In the case of the MIB,
replication can be used to increase the performance of management activities and to
provide high availability of management objects. Replicating the same management
object at different sites can improve the availability remarkably because the system can
continue to operate as long as at least one site is up. It also improves performance of
global retrieval queries, because the result of such a query can be obtained locally from
any site; hence a retrieval query can be processed at the local site where it is submitted.
To deal with replicated objects in a management information base, a control
method is needed to keep all the replicas in a consistent state even during partitioning.
The proposed techniques used to assure consistency may be divided into two families;
those based on a distinguished copy and those based on voting. The former technique is
based on the idea of using a designated copy of each replicated copy of each replicated
object in such a way that requests are sent to the site that contains that copy
(ALSBERG 1976, BERNSTEIN 1987, GARCIA 1982).
Voting replica control algorithms are more promising. They do not use a
distinguished copy; rather a request is sent to all sites that include a copy of the
replicated object. An access to a particular copy is granted if a majority of votes is
collected (GIFFORD 1979, JAJODIA 1989, KOTSAKIS 1996b). Voting algorithms
are fully distributed concurrency control algorithms and they exhibit higher flexibility
over those based on a distinguished copy. Despite the fact that the voting algorithms
pass many messages among sites, it is anticipated that good performance will be gained,
if the round trip time becomes shorter. Todays technology can improve the round trip
time through the use of high speed networks like ATM. Thus, messages may be
transferred from one machine to another faster and more reliably.
In the voting scheme, replicated objects can be accessed in the partition group that
obtains a majority vote. In the distinguished (primary) copy
scheme, availability is
significantly limited in the case of a network or link failure. Primary copy algorithms
exhibit good behaviour only for site failures . On the other hand, voting algorithms
3
provide higher availability tolerating both network and site failures. Voting algorithms
guarantee consistency at the expense of the availability. To provide higher availability,
one may use either a consistency relaxation technique that allows concurrent access of
replicated objects across different partitions (optimistic control algorithm) or to improve
the existing voting pessimistic algorithms by forming more sophisticated schemes.
Optimistic control algorithms must be supported by an extra mechanism to detect and
resolve diverging replicas once the partition groups are reconnected. This complicates
the control of replication task and allows, at least for a short interval of time,
inconsistency between replicas. Such an approach requires long time to retrieve the state
of the database after a site failure and it does not seem appropriate for databases such as
those used to store management information. Therefore, the invention of a more
sophisticated replica control algorithm based on voting seems to be a promising
approach that may provide higher availability preserving strong consistency between
replicated objects.
Finally, the following questions are addressed in this thesis:
Can we improve further the availability of managed objects by utilising voting
techniques?
Can replication be used effectively in a distributed MIB in order to ensure fault
tolerance in a management system?
1.3 The work

The goal of this work is to investigate the potential use of replication in a network
management system and examine the practical aspect of applying such a technique in
real time systems. Intuitively this work appears viable for the following reasons.
1. There is a great proliferation of different network technologies and a great need for
network management. Keeping management information available and in a
consistent state is of great importance, since the network operability depends on
management activities.
2. Availability of network information may be obtained only by applying redundancy.
Replication is one of the most widely used techniques that ensures high availability
keeping the replicated objects in a consistent state.
3. The development of replication schemes for insuring higher object availability and
for tolerating site and communication failures is very promising and should be
studied further.
4. Failures always happen. No system can work forever within its specifications.
Exogenous or endogenous factors can affect the operability of the system causing
temporary or permanent failures.
5. The need for developing fault tolerant techniques for network management systems is
of great importance.
1.4 Road Map of the Thesis

The rest of the thesis consists of six chapters.
Chapter 2 describes systems management and provides a discussion about the
architectural models of management systems. It examines object replication in terms of
system performance and availability and illustrates the replication architectural models
for managing failures.
Chapter 3 discusses the nature of failures in a management system. It decomposes a

management system into management agents and defines the concept of dependability
between agents. It also classifies certain failures according to their disruptive behaviour
and it then ends by specifying the architectural impact of a group of agents to the
availability of replicated objects.
Chapter 4 presents the correctness criteria that should be taken into account when
designing a replication system. It also introduces an abstract model in order to study
formally certain replica control algorithms. It then presents a variety of replica control
algorithms. It ends with a thorough discussion on the DMCA (Dynamic Majority
Consensus Algorithm) which is a novel approach that enriches the up to date knowledge
on replication techniques and improves the overall management of replicated objects
providing higher availability.
Chapter 5 mainly evaluates the DMCA algorithm and presents quantitative results
regarding the availability provided by DMCA. It starts by specifying the way in which
one can measure the performance of certain replica control protocols and introduces the
simulation model for building the benchmark test utility ATS (Availability Testing
System) that estimates the effectiveness of the algorithms. It also shows the fault
injection mechanism for generating faults and repairs. It ends with a thorough discussion
on the results of the simulation justifying the superiority of the DMCA.
Chapter 6 presents the object -oriented development process of the ATS tool. It first
discusses the advantages of using object oriented technology to develop such a complex
system and it then presents the static object model and dynamic model of the ATS.
The thesis concludes with chapter 7 which presents the contributions and it includes a
discussion of future work and a summary of key results.
2. REPLICATION MANAGEMENT SYSTEM

ARCHITECTURE
This chapter introduces the fundamental idea behind systems management and
illustrates the main features of a management system. It provides a brief discussion
about the architectural model of a management system and introduces the concept of
distributed MIB as a naturally distributed database. It highlights issues related to the
object oriented MIB modelling and definition of managed objects. It also justifies the
use of object replication in terms of system performance, data reliability and availability.
Finally it discusses the type of failures that may occur in a management system as well
as the replication architectural models that may be used to maintain multiple replicas.
2.1 Management Functional Areas
Management of a system is concerned with supervising and controlling
the
system so that it fulfils the operational requirements. To facilitate the management task ,
Open System Interconnection (OSI ) divides the management design process into five
areas known as OSI management functional areas (ISO 1989). The fundamental
objectives of the OSI functional areas are to fulfil the following goals.
1. To maintain proper operation of a complex network (fault management).
2. To maintain internal accounting procedures (accounting management).
3. To maintain procedures regarding the configuration of a network or a
distributed processing system (configuration management).
Figure 2-1: Basic Management model
4. To
provide
the
capability of
performance
evaluation
(performance
management).
5. To allow authorised access-control information to be maintained and
distributed across a management domain (security management).
In other words, a network that has a management system, must be able to manage
its own operations, performance, failures, modifications, security and hardware/software
configuration. To fulfil the above requirements, it is necessary to develop a management
model that is capable of incorporating a vast amount of services covered under the
specifications of the OSI functional areas. The actual architecture of the network
management model varies greatly, depending on the functionality of the platform and
the details of the network management capability. A management architectural model
that has been proposed by OSI defines the fundamental concepts of systems
management (ISO 1992). This model describes the information, functional,
communication and organisational aspect of systems management.
2.2 Management Architectural Model

The management of a communication environment is an information processing
application. Because the system being managed is distributed, the individual
components of the management activities are themselves distributed. Management
applications perform the management activities in a distributed manner by establishing
associations between management entities. As shown in Figure 2-1 there are two
fundamental type of entities that exchange management information; one takes the
manager role and the other the agent role. An entity, that plays the manager role is
supposed to be the entity that generates queries to obtain management information. An
entity plays the agent role when it accepts those queries returning a response back to the
manager and generates notifications regarding the state of the objects located in the
domain of the agent. An agent performs management operations on managed objects as
a consequence of the communication with the manager. A manager may be seen as the
part of the distributed application that is responsible for generating messages related to
one or more management activities (collection of information, controlling the state of
remote objects, change the configuration of managed devices etc.). The agent, on the
other hand, could be viewed as a local corespondent of the manager in the managed
system controlling access to the managed object and looking after the distribution of
events occurring in the managed system.
As Figure 2-1 shows, three kinds of messages are transferred between manager and
agent. These are the following:
request messages transferred from the manager to the agent.
response messages are bi-directional
notification messages transferred from the agent to the manager
10
The database of
systems management information called Management
Information Base (MIB) is associated with both the manager and the agent. The MIB is
the conceptual repository of the management information stored in an OSI-based
network management system. The definition of the MIB describes the conceptual
schema containing information about managed objects and relations between them. It
actually defines the set of all managed objects visible to a network management entity.
The MIB may be viewed as the interface definition - it defines a conceptual schema
which contains information about specific managed objects, which are instantiations of
managed object classes. The schema also embodies relationships between these
managed objects, specifies the operations which may be performed on them and
describes the notifications which they may emit (ISO 1993).
2.3 Protocols For Controlling Management Information

All types of management exchanges consist of requests and/or requests responses. There are currently two basic architectural frameworks related to the
standardisation of the exchange of messages passed between managers and agencies; the
OSI management framework (ISO 1989) and the Internet network management
framework (CERF 1988).
2.3.1 OSI Management Framework

The Common Management Information Protocol (CMIP) is a utility designed to
convey the requests or responses between managers and agencies (ISO 1991a). The
CMIP offers specific system management functions and services for the remote holding
of management data. The CMIP implements the services offered by the remote
operation service elements (ROSE) (ISO 1988), in order to perform the create, set,
delete, get, action, and event-report operations. The Common Management Information
11
Service Element (CMISE) is the standardised application service element that is used to
exchange management information in the form of requests and/or requests-responses
(ISO 1991). The CMISE is a basic vehicle that provides individual management
applications with the means of executing management operations on objects and issuing
notifications. The CMISE provides the means of supporting distributed management
operations using application associations. The CMISE services shown in Table 2-1
constitute the kernel functional unit of the CMISE. A system supporting the CMIP must
implement the kernel functional units of the CMISE.
Table 2-1: CMISE services and functions
Notifications
Operations
Service
Type
Function
M_EVENT_REPORT
C/NC
M_GET
M_SET
M_ACTION
M_CREATE
M_DELETE
C
C/NC
C/NC
C
C
Gives notifications of an event

occurring on a managed object
Request for management data
Modification of management data
Action execution on managed object
Creation of a managed object
Deletion of a managed object
C = Confirmed
NC = Not Confirmed
M stands for management
2.3.2 Internet Network Management

The Simple Network Management Protocol (SNMP) (CASE 1990) is used to
convey management information in an Internet management system just like the CMIP
is used in an OSI management system. The SNMP includes a limited set of management
requests and responses. The managing system issues get, get_next, and set requests to
retrieve single or multiple objects or to set the value of a single object. The managed
system sends a response to complete the get, get_next or set requests. The managed
system also sends an event notification called trap to the managing system to identify
12
the occurrence of an event. Table 2-2 lists the SNMP request and response messages
along with their types and functions.
Table 2-2: SNMP services and functions
Service
Type
Function
Notifications
Trap
C/NC
Operations
GetRequest
GetNextRequest
C
C
GetResponse
NC
SetRequest
An agent sends a trap to alert the

manager that an event has been
occurred
Retrieves the state of a single object
Retrieves the state of the next object
in an a sequence of objects
Response sent by the agent to the
manager
Sets the state of a managed object
C = Confirmed
NC = Not Confirmed
2.4 Object oriented MIB modelling

The central point in a management system is the managed object. A managed
object may be seen as the management view of a resource and is described by the
following characteristics:
Attributes, that denote specific characteristics of the resource.
Operations, that are performed on a set of attributes
Behaviour, that specifies how the object reacts to operations performed on it.
Notifications, that may be emitted to the managing station through a protocol

as a reaction in an external event or as a repeated action.
The managed object class provides a way to specify a family of managed object. A
managed object class is a template for managed objects that share the same attributes,
operations, notifications and behaviour. A managed object is an instantiation of the
managed object class.
13
The MIB is the conceptual repository containing all the related information about
managed objects. The MIB modelling encompasses an abstract model and an
implementation model (KOTSAKIS 1995). The abstract model defines
Principles of naming objects
The logical structure of management information
Concepts related with management object classes and the relationship between
them.
The implementation model (BABAT 1991, KOTSAKIS 1995) defines the following
The platform for hosting a MIB.
The architectural principles for partitioning MIB information
Database type (Object oriented or relational)
Translation of MIB object model into a schema
The Management Information Model (ISO 1993) defines two types of management
operations
operations intended to be applied to the object attributes
operations intended to be applied to the management object as a whole.
Attribute oriented operations are as follows
get attribute value
replace attribute value
replace with default value
add member
remove member
14
Any operation may affect the state of one or more attributes. The operations may also be
performed atomically (either all operations succeed or none is performed). Operations
that may be applied to the managed object as a whole are the following
create
delete
action
An action operation requests the managed object to perform the specified action and to
indicate the result of this action.
2.5 Distributed Management Information Base (MIB)
Roles are not permanently assigned to a management entity. Some management
entities may be restricted to only taking an agent role, some to only taking a manager
role while other are allowed to take an agent role in one interaction and to take a
manager role in a separate interaction. In order to perform system management and
share management knowledge, it is sometimes necessary to embody manager and agent
within a single open system (see Figure 2-2). Shared management knowledge is implied
by the nature of the management framework since the management applications are
distributed across a network. Therefore the management information base may be
naturally viewed as a distributed database containing the managed objects that belong
to the same management system but is physically spread over multiple sites (hosts) of a
computer network. The MIB is considered as a superset of managed objects. Each
subset of this superset may constitute a set of objects associated with a device physically
separated from any other managed device. (ARPEGE 1994) Therefore the managed
objects in each location may be viewed as a local management description of the
15
managed device. The distributed design of a MIB may be considered as a great

advantage for the following reasons:
Increased reliability and availability: Reliability is defined as the probability that a

system is up at a particular moment, whereas availability is the probability that a
system is continuously available during a time interval. When
the MIB is
distributed over several sites, one site may fail while other sites continue to operate.
Only the objects associated with the failed site cannot be accessed. This improves
both reliability and availability. On the other hand a failure in a centralised MIB
may makes the whole system unavailable to all users.
Allowing data sharing while maintaining some measure of local control: A

distributed MIB allows the control of objects locally at each agent. Objects, that may
be available to a specific manager may be hidden to some other managers.
Improved Performance: A distributed MIB implies the existence of smaller

databases at each site. If a site combines both the role of manager and agent, the
manager may gain faster access to the local MIB than any other manager located in
a remote site. This increases the performance of the system since a set of managed
objects may be accessed locally without the need to open a communication
transaction over the network. In addition a distributed MIB decreases the load
(number of transactions) submitted to an agent compared with the load executed by
a centralised MIB. Since different agents may operate independently, different
agents may proceed in parallel, reducing response times.
16
Figure 2-2: Views of shared management knowledge
A typical arrangement of a management system is shown in Figure 2-3. The nodes may
be located in physical proximity and connected via a Local Area Network (LAN), or
they may be geographically distributed over a interconnected network (Internet). It is
possible to connect a number of diskless workstations or personal computers as
Figure 2-3: Simplified Management System.
17
managers to a set of agents that maintains the managed objects. As illustrated in Figure
2-3, some nodes may run as managers (such as the diskless node 1, or the node 2 with
disks), while other nodes are dedicated to run only agent software, such as the node 3.
Still other nodes may support both manager and agent roles, such as the node 4.
Interaction between manager and agent might proceed as follows:
1. The manager parses a user query and decomposes it into a number of
independent queries that are sent separately to independent management agent
nodes.
2. Each agent node processes the local query and sends a response to the manager
node.
3. The manager node combines the results of the subqueries to produce the result of
the original submitted query.
4. If something occurs in an agent that changes its operational state, a notification
may be generated from the agent and an associated message is sent urgently to
the manager for further processing.
The agent software is responsible for local access of managed objects while the manager
software is responsible for most of the distribution functions; it processes all the user
requests that require access to more than one management node and it keeps a truck
where each managed object is located. An important function of the manager is to hide
the details of data distribution from the user, that is, the user should write global queries
as though the MIB were not distributed. This property is called MIB transparency. A
management system that does not provide distribution transparency makes it the
responsibility of the user to specify the managed node related with a managed object.
2.6 Distributed Network Management
18
Systems are increasingly becoming complex and distributed. As a result, they are
exposed to problems such as failures, performance inefficiency and resource allocation.
So, an efficient integrated network management system is required to monitor, interpret
and control the behaviour of their hardware and software resources. This task is
currently being carried out by centralised network management systems in which a
single management system monitors the whole network. Most of existing management
systems are platform-centred. That is, the applications are separated from the data they
require and from the devices they need to control. Although some experts believe that
most network management problems can be solved with a centralised management
system, there are real network management problems that cannot be adequately
addressed by the centralised approach. (MEYER 1995).
Basically, there are four basic approaches for network management systems
centralised, platform based, hierarchical and distributed (LEINWARD
1993).
Currently, most network management systems are centralised. In a centralised

management system (Figure 2-4.a ) there is a single management machine (manager)
which collects the information and controls the entire network. This workstation is a
single point of failure and if it fails, the entire network could collapse. In case the
management host does not fail, but the fault partitions the network, the other part of the
network is left without any management functionality. Centralised network management
has shown inadequacy for efficient management of large heterogeneous network. Also, a
centralised system cannot be easily scaled up when the size of complexity of the
network increases.
In the platform based approach (Figure 2-4.b), a single manager is divided in two
parts; the management platform and the management application. The management
platform is mainly concerned with information gathering while management
19
applications use the services offered by the management platform to handle decision
support. The advantage of this approach is that, applications do not need to worry about
protocol complexity and heterogeneity.
The hierarchical architecture (Figure 2-4.c) uses the concept of Manager Of
Managers (MOM) and manager per domain paradigm (LEINWARD
1993). Each
domain manager is only responsible for the management of its domain and it is unaware
of other domains. The manager of managers sits at the higher level and request
information from domain managers.
20
The distributed approach (Figure 2-4.d) is a peer architecture. Multiple managers,

each one responsible for a domain, communicate with each other in a peer system.
Whenever information from another domain is required, the corresponding manager is
contacted and the information is retrieved. By distributing management over several
workstations, the network management reliability, robustness and performance increase
Figure 2-4 Network management approaches (a) centralised (b) platform based (c) hierarchical (d)
distributed
while the network management cost in communication and computation decreases. This
21
approach has also been adapted by ISO standards and the Telecommunication
Management Network (TMN) architecture (ITU 1995).
A distributed system should use interconnected and independent processing

elements to avoid having a single point of failures. Several reasons contribute in using a
distributed management architecture: higher performance/cost ratio, modularity, greater
expandability and scalability, higher availability and reliability. Distributed management
services should be transparent to users, so that they cannot distinguish between a local
and a remote service. This requires the system to be consistent, secure, fault tolerant and
have a bounded response time.
Remote Procedure Call (RPC) (NELSON 1981) is well understood control
mechanism used for calling a remote procedure in a client server environment. The
Object Management Group (OMG)
Common Object Request Broker Architecture
(CORBA) (OMG 1997) is also an important standard for distributed object oriented
systems. It is aimed at the management of objects in distributed heterogeneous systems.
CORBA addresses two challenges in developing distributed systems (OMG 1997):
1. Making the design of the system not more difficult than a centralised one
2. Providing an infrastructure to integrate application components into a distributed
system.
2.7 CORBA System

The most promising approach to solve the distributed interface and integration
problem is the CORBA architecture(VINOSKI 1997). Although CORBA does not
22
support directly a network management architecture, it provides a distributed object

oriented framework where a management system may be developed.
The main component of CORBA is the Object Request Broker (ORB). An ORB is
the basic mechanism by which objects transparently make requests to each other on the
same machine or across a network. A client object need not be aware of the mechanisms
used to communicate with or activate an object, how the object is implemented nor
where the object is located. The ORB forms the foundation for building applications
constructed from distributed objects and for interoperability between applications in
both homogeneous and heterogeneous environments.
The OMG Interface Definition Language (IDL) provides a standardised way to
define the interfaces to CORBA objects. The IDL definition is the contract between the
implementor of the object and the client. IDL is a strongly typed declarative language
that is programming language independent. Language mapping enables objects to be
implemented in the developers programming language of choice.
CORBA services include naming, events, persistence, transactions, concurrency
control, relationships, queries, security etc. CORBA services are the basic building
blocks for distributed object applications. Compliant objects can be combined in many
different ways and put to many different uses in applications. They can be used to
construct higher level facilities and object frameworks that can inter-operate across
multiple platform environments.
2.8 Implementing OSI Management Services for TMN

Recently the telecommunication industry has gained knowledge and experience
establishing management functionality through the Telecommunication Management
23
Network (TMN) framework(ITU 1995). On the other hand in the Internet community,
Simple Network Management protocol (SNMP)has gained widespread acceptance due
to its simplicity of implementation. Thus, TMN and Internet management will co-exist
in the future.
The aim of the TMN is to enhance interoperability of management software and to
provide an architecture for management systems. A TMN is a logically distinct network
from the telecommunication network that it manages. It interfaces with the
telecommunication network at several different points and controls their operations. The
TMN information architecture is based on an object oriented approach and the
agent/manager concepts that underlie the Open Systems Interconnection (OSI) systems
management.
The Telecommunication Management Network (TMN) is a framework for the
management of telecommunication networks and the services provided on those
networks. The Open Systems Interconnection (OSI) management framework is an
essential component of the TMN architecture. Each TMN function block can play the
role of an OSI manager, an OSI agent or both.
A managed object instance can represent a resource and thus there is a
requirement for communication between managed objects instances in an OSI agent and
the resources they represent. Examples of resources include telecommunication
switches, bridges, gateways etc. If a new interface card is added to a switch, the switch
may send a create request to the agent for the creation of the corresponding manager
object instance.
24
Figure 2-5
illustrates how TMN systems can inter-work within the TMN logical layer
architecture (SIDOR 1998, FERIDUM 1996). In this architecture, system A manages

System B and B may, in turn, necessitate operations on the information model of the
system C.
Figure 2-5: Inter-working TMN
The management information base ( MIB) is the managed object repository and
may be implemented by using C++ objects through a MIB composer tool (FERIDUM
1996). In (BAN 1995) a uniform generic object model
(GOM) is proposed for
manipulating transparently managed objects of various specific object models (CORBA,

OSI X.700, COM etc.). Communication between managed object instances and
resources can be initiated from both directions. Resource access components access
managed object instances through the core agent. Some or all of the managed object
instances in a MIB may be persistent to allow fast recovery after an agent failure. There
are two major design considerations in implementing persistence:
1. Performance: persistent managed objects ensure fast restart after agent failures. For
example the instance representing a leased line between two communication nodes
may need to be persistent, whereas an instance representing a connection does not
(since after an agent failure, the connection will be terminated). Object oriented
25
databases or traditional relational databases or even flat files can be used to

implement selective persistence.
2. Synchronisation: When the agent restarts, managed objects must be updated to
reflect the current state of the resources. Synchronisation requires exchange of are
you there and what are the current value type messages between managed objects
and resources.
2.9 Replication In a Management System

Making persistent managed objects may increase the performance and object
availability offered during an agents failure. Replication may be used to increase further
the performance and the availability of network managed objects. The major design
considerations in implementing replication are that it allows the control of objects
locally at each replication site and it lets managers gain faster access to a MIB managed
object by retrieving information locally without the need of performing remote
transactions. In that way the load is shared among many sites.
To facilitate better the use of a replication technique in a network management
system we may incorporate replication in a distributed framework (CORBA). CORBA
has been designed to provide an architecture for distributed object-oriented computing,
not network management. Engineers have focused their efforts on developing an
integrated management platform to
create, manage and invoke distributed
telecommunication services. Some of these efforts are (LEPPINEN 1997, MAFFEIS

1997a, RAHKILA 1997)
The CORBA standard provides mechanisms for the definition of interfaces to
distributed objects and for communication of operations to those objects through
26
messages. Unfortunately, the current CORBA standard makes no provision for fault
tolerance.
To provide fault tolerance, objects should be replicated across multiple processors
within the distributed system (ADAMEC 1995. The motivations for applying object
replication in a distributed network management system could be of many types:
One can be performance enhancement. Management information that is shared
between a large manager community should not be held at a single server, since this
computer will act as a bottleneck that slow down responses.
Another motivation is improvement of fault tolerance. When the computer with one
replica crashes, system can proceed management computation with another replica.
Another motivation could be the case of using replicas to access remote objects.
When a remote object is to be accessed, a local replica reflecting remote objects
state is created and used instead of a remote object.
27
Figure 2-6: (a) replication (b) no replication
Figure 2-6
shows a typical scheme for implementing replication of management
information. Agent updates the MIBs located at the manager sites by exchanging
messages with the managers. Each manager get the management information locally
without the need of issuing remote request. This yields a performance increment since
two additional instances of the MIB are used to provide information about the same
resources.
(MAFFEIS 1997b) discusses a CORBA based fault tolerant system that monitors remote
objects and if some of them fail it automatically restarts the failed objects and replicates
state-full objects on the fly, migrating objects from one host to another. In
(NARASIMHAN 1997) a similar system is discussed, which provides fault tolerant
services under CORBA to applications with no modification to the existing ORB.
28
2.10 Need for replication techniques in a management system

The Management Information Base (MIB) is viewed as a distributed database that
stores information associated with the resources of a network or a remote system.
Replication is applied to managed objects, which may be seen as an abstract
representation of the resources. In traditional data-base applications, the need for a
replication scheme is straight forward since the nature of the data easily allows the
implementation of such a schema. For instance, a company may have locations at
different cities or a bank may have multiple branches. It is natural, for such applications
to enforce replication since such a scheme may increase drastically the fault tolerance of
the system. If, for example, the software in some branch fails, information regarding
customers of that branch may be available from some other branch. A management
information base is different from a traditional database in that, the information stored
in it corresponds to objects that represent software or hardware resources. For instance,
a variable associated with a remote sensor may be considered as a managed object and
its value may be viewed as an instantiation describing its state. The main questions that
arise in a management database application are the following :
1. Do we really need to replicate such kind of objects?
2. How useful is a replication scheme in a practical management system?
To answer all these questions we illustrate an example of a network management system
which manages network resources spread across an interconnected network.
Figure 2-7
shows an interconnected network consisting of three networks. Each
network constitutes a management domain. Each domain has a manager, an agent, a

management information base (MIB) and some network resources that are monitored
and controlled by the manager. Agents are responsible for collecting information from
the resources and updating the relevant managed objects in the database to reflect the
29
current state of the resources. The manager and the agent of each domain could be
accommodated by the same host computer. However, we use the most general case in
which manager and agent reside at different computers. This could be the case where
the manager runs on a diskless machine. There are two possible scenarios:
1. No replication: Each MIB stores information about managed resources of its own
domain.
2. Replication: Each MIB contains replicated information which is associated with
resources of other domains.
30
Domain B
Network
Resources
A
Manager B
Domain A
Network B
Manager A
Agent B
Bridge AB
MIB B
Network A
Bridge BC
Manager C
Bridge AC
Agent A
Domain C
Network C
MIB A
Network
Resources
A
Agent C
MIB C
Network
Resources
A
Figure 2-7: Network management replication example
In a no replication scheme, when a manager wants information about some

network resources, it contacts the appropriate agent which is responsible for providing
this management information. The agent either collects the information dynamically
from the resources or it makes a relevant query to the MIB and sends a response back to
the manager. In a pure distributed management environment if manager B wants some
information about the network resources A, it sends a request to agent A. Upon
receiving the request, agent A undertakes to provide the requested information to
manager B. In a hierarchical management system, this could be accomplished through
31
the manager A. Manager B establish a manager to manager communication with

manager A and then it asks A to request its local (domain) agent A to complete the task.
In a replication scheme, replicas of managed objects exist on other MIBs. When
an agent updates the state of a managed object locally, it also transmits the object state
to other domain and updates all the replicated objects that resides at other MIBs. Under
this arrangement, when manager B wants some information about network resources A,
there is no need either to contact agent A or the manager A but its local agent B since
the MIB B contains replicated managed objects of the MIB A. This improves the
performance and speeds up the process of collecting network management information
from remote system. This becomes more obvious if network B is a remote network
which is linked with a low speed leased lines with network A. In case of an agent
failure, managers can still get information from some particular managed objects from
other MIBs increasing the availability of the management information and making the
management system more robust and fault tolerant.
We may increase further the performance and the availability if we utilise dynamic
creation of managed objects from faulty agents to other agents. This can be achieved by
utilising CORBA based replication techniques that may restart failed objects and
replicate their state on the fly on another agent (MAFFEIS 1997b).
It becomes clear that replicating managed objects to other agents MIBs may
increase the availability of certain managed objects and the performance of the
management system ensuring continuous network management without to interrupt the
monitoring and control of network resources.
32
2.11 Synchronous and Asynchronous replica models

In the context of research on fault tolerance, a system that has the property of
always responding to a message within a known finite time interval is said to be
synchronous. A synchronous replica system is said to be the system in which all update
requests are ordered. That is requests are processed at all replicas in the same order.
Consider for example the replication model in Figure 2-8. The node M sends an update
request r to all other nodes G1, G2, G3 and wait for responses. If it receives all the
acknowledgements A1, A2, A3 from those nodes, it assumes that the update is done
successfully and it proceeds to the next request. In a synchronous replication system the
next request is forwarded only if the current update request has been processed at all the
agencies holding replicas. A replication system not having this property is said to be
asynchronous. That is, in an asynchronous replication system a node proceeds to the
next request without the need to wait to get acknowledgements from all the recipients of
the previous request. That results in an unordered processing of requests. A request
received by the node G1 may be processed in a different order than that of node G2 or
G3.
Figure 2-8: Synchronous Replication
2.12 Replication Transparency and Architectural Model

A key issue relating to replication is transparency. Transparency (invisibility)
determines the extend the users (managers) are aware of that some objects are
replicated. At one extreme the users are fully aware of replication process and can even
33
control it. At the other, the system does everything without users noticing anything. The
ANSA reference manual (ANSA 1989) and the International Standard Organisation
Reference Model for Open Distributed Processing (ISO 1992a) provide definitions
related to replication transparency. Among others the standards state that a replication
system is transparent if it enables multiple instances of the information object (in our
case managed object) to be replicated without knowledge of the replicas by users or
application programs.
A basic architectural model for controlling replicated objects may involve distinct
agencies located across a network. Figure 2-9(a) shows how a manager may control the
entire process in a non transparent system. When a manager creates or updates an object,
it does so on one agency and then it takes responsibility to make copies or complete any
update on other agencies.
An agency is a process that contains replicas and performs operations upon them
directly. An agency may maintain a physical copy of every logical item, however, there
are cases, when an agency may not maintain a physical copy. For example, a managed
object needed mostly by a manager on one LAN may be never used by a manager of
another LAN. In this case the agency in the second LAN may not contain a physical
copy of that object and if ever this manager requests information about the object, the
local agency may obtain the information making a call to another agency that actually
holds a physical copy of the object. The general model for a transparent replication
system is shown in Figure 2-9(b). A managers request first handled by a Front End (FE)
component.
34
Figure 2-9: Architectural model for replication. (a) non transparent system (b)
transparent replication system (c ) Lazy replication (d) Primary copy model.
35
The FE component is used for passing the messages to at least one agency. This
hides details of how the message is forwarded to which agency. The user manager does
not need to determine a specific agency for service, but it just sends the message and the
FE component takes responsibility to determine which agency will receive the request.
The FE component may be implemented as part of the manager application or it may be
implemented as a separate process invoked by a manager application using a kind of
Interprocess Communication (PRESOTTO 1990). Figure 2-9(c ) shows a specialisation of
the architectural model in Figure 2-9(b). The model in (c ) is called a lazy replication
model and implements what is called gossip architecture (LADIN 1992). Here the
manager creates or updates only one copy in one agency. Later the agency itself makes
replicas on other agencies automatically without the managers knowledge. The
replication server is running in the background all the time scanning the managed object
hierarchically. Whenever it finds a managed object to have less replicas than it is
expected, the replication server arranges to make all the additional copies. The
replication server works best for immutable objects since such objects cannot change
during the replication process. This architecture is also called gossip architecture
because the replica agencies exchange gossip messages in order to convey the updates
they have each received. In gossip architecture the FE component communicates directly
either with an individual agency or alternatively with more than on agencies. Figure 29(d)
shows another replication architectural model known as the primary copy model
(LISKOV 1991). In that model all front ends communicate with the same primary
agency when updating a particular data item. The primary agency propagates the
updates to the other agencies called slaves. Front ends may also read objects from a
slave. If the primary agency fails, one of the slaves can be promoted to act as the
primary. Front ends may communicate with either a primary or a slave agency to
36
retrieve information. In that case, however, front ends may not perform updates; updates
are made to primary copy of an object.
2.13 Summary
This chapter has set the background for a replication management system. It has
shown the need for using a replication scheme in a real time application. It has
examined the distributed aspect of a network management system describing the
distributed nature of the MIB. It has briefly discussed two major protocols (CMIP and
SNMP) for exchanging management messages. It has also examined design aspects of
the MIB discussing the significance of the managed object as an autonomous entity for
performing operations related to incoming messages. The concepts of object availability
and performance have been defined and used as a measure of the quality of service of
the system. Synchronous and asynchronous replica models have been examined and
finally various architectural models for replication have been discussed as a way to
maintain transparently multiple replicas.
In the following chapters will discuss the internal mechanisms (algorithms) used
to obtain transparent updates to replicated objects. We will discuss a variety of solutions
that may be applied to ensure consistency among multiple replicas in occurrence of node
or communication link failures.
37
3. FAILURES IN A MANAGEMENT SYSTEM
This chapter discusses the nature of failures in a management system. It first

defines the concept of dependability between management agents and it then classifies
certain failures that may occurs in a management system analysing further each one by
its potentially disruptive behaviour. Failure semantics and masking are examined as a
way to understand how failures may be masked by using certain techniques. The chapter
ends by specifying some architectural issues including synchronisation, communication
and availability of certain components in a group of agents.
3.1 Dependability Between Agents
An agent provides certain management services that may be viewed as a collection
of operations whose execution can be triggered by inputs from other agents (proxy) or a
manager or the passage of time. An agent implements a management service without
exposing to the manager the internal representation of the managed objects. Such details
are hidden from the manager, who need know only the externally specified management
service behaviour. Agents may implement their services which are implemented by
other agents. An agent U depends on the agent R if the correctness of U depends on the
correctness of R's behaviour. The agent U is called the user and the R is called the
resource of U. Resources in turn might depend on other resources to provide their
service, and so on, down to the managed objects. The managed object is the atomic
resource which is not analysed further and which is actually used to represent hardware
or software components in a network. What is a resource at a certain level of abstraction
38
can be a user at another level of abstraction. The relationship between user and resource
is a "depends on" relationship as it is shown in 2nd.
A distributed management system consists of many agents. The management
services provided by those agents may depend on other secondary low level
management services associated with operating system components as well as
communication components. The union of all these management services is provided as
a distributed management system service. To ensure correctness and management
service availability, the classes of possible failures in the lower levels of abstraction
should be studied and redundancy in particular management services should be
introduced to prevent system crashes.
3.2
Failure Classification
An agent designed to provide a certain management service works correctly if in
response to requests, it behaves in a manner consistent with the service specification. By

an agents response we mean any output that it has to be delivered to the manager. An
agent fails when the agent does not behave in the manner specified . The most frequent
failures are the followings:
Figure 3-1: Relationship between user and resource.
39
1. Omission Failure: It happens when the agent receiving a request omits to respond
to that request. This failure occurs either because the queue of incoming messages
in the agent is full and therefore any additional request is lost or an internal failure
(i.e. memory allocation failure) is experienced due to a temporary lack of physical
resources for handling the incoming request. A communication service that
occasionally loses messages but does not delay messages is an example of a
service that suffers omission failures
2. Timing Failure: It happens when the agent response is functionally correct but
untimely. The response occurs outside the real-time interval specified. The most
frequent timing failure is the performance (late timing) failure in which the
response reaches the manager after the elapse of the time interval during which the
manager is expecting the response. This failure occurs because either the network
is too slow or the agent is overloaded and it gets late to give a response to the
manager. An excessive message transmission or message processing delay due to
an overload is an example of performance failures.
3. Response Failure: It happens when the agent responds incorrectly, either the value
of its output is incorrect (value failure) or the state transition that takes place is
incorrect (state failure). A search procedure that "finds" a key that is not an entry
of a routing table is an example of a response failure.
Crash Failures: It happens when after the first omission to produce a response to a
request, an agent omits to produce outputs for subsequent requests
3.3 Faulty Agent Behaviour

To detect a failure, an agent should reveal a certain behaviour that allows us to
identify the occurrence of a failure in order to perform the appropriate actions for
40
handling the failure. The behaviour of an agent under the occurrence of a failure may be
classified as follows:
Fail-stop behaviour
Byzantine behaviour
With fail-stop behaviour, a faulty agency just stops and does not respond to subsequent
requests or produce further output, except perhaps to announce that it is no longer
functioning. With Byzantine behaviour, a faulty agency continues to run, issuing wrong
responses to requests and possibly working together maliciously with other faulty
managers or agencies to give the impression that they are all working correctly when
they are not. In our study we assume only fail-stop behaviour.
3.4 Failure Semantics

The failure behaviour an agent can exhibit must be studied in order to suggest
possible fail tolerance mechanisms. Recovery actions invoked upon detection of an
agent failure depends on the likely failure behaviour of the agent. Therefore one has to
extend the standard specifications of an agent to include failure behaviour. If the
specification of an agent prescribes that the failure F may occur, it is said that the agent
has an F failure semantics(CRISTIAN 1991). If a communication failure is allowed to
lose messages but the probability that it delays or corrupts messages is negligible, we
say it has an omission failure semantic. When the service is allowed to lose or delay
messages, but it is unlikely that it corrupts messages, we say that it has an
omission/performance failure semantic. Similarly, if an agent is likely to suffer only
crash failures, we say that agent has a crash failure semantic. In general, if the failure
specification of an agent A1 allows A1 to exhibit behaviours in the union of two failure
classes F and G, we say that A1 has F/G failure semantics. An agent that has F/G failure
semantics can experience more failure behaviours than an agent with F failure semantics
41
and thus the F/G is a weaker failure semantic than F. An agent that can exhibit any
failure behaviour has the weakest failure semantic called arbitrary failure semantics.
Therefore the arbitrary failure semantics includes all the previously defined failure
semantics. It is the responsibility of the agent designer to ensure that it properly
implements specified failure semantics. In general the stronger a failure semantic is, the
more expensive and complex it is to build an agent that implements it.
3.5 Failure Masking
An failure behaviour can be classified only with respect to a certain agent
specification, at a certain level of abstraction. If a management agent depends on lowerlevel agents to correctly provide management services, then a failure of a certain type at
a lower level of abstraction can result in a failure of a different type at a higher level of
abstraction.
Let us consider the example in Figure 3-2. A manager M sends a request to the
agent A which in turns uses the agent B to get some information necessary to built a
response to the Manager request. Let us consider that B is unable to provide the
necessary information to the agent A due to either a communication failure (omission,
or performance failure) or site failure (crash, value failure etc.). The agent A is actually
built one layer above B and it may hide the failure of B by either using another agent,
say C, that provides exactly the same information as B or try to resolve the problem by
itself by playing the role of B as well (it may access directly the managed object hosted
Figure 3-2: Failure masking

42
Figure 3-3: Group masking
at the agent's B site). The agent A may also change the failure semantics. That is, a crash
failure in agent B may be propagated as an omission failure to the manager from the
agent A.
Failure propagation among managers and agents situated at different abstraction
levels of the "depends on" hierarchy can be a complex phenomenon. The task of
checking the correctness of results provided by lower-level servers is very cumbersome
and for this reason, designers prefer to use agents with as strong as possible failure
semantics. Exception handling provides a convenient way to propagate information
about failure detection across abstraction levels and replication of certain services
provide the mechanism for masking lower level failures. An agent A that is able to
provide certain services despite the failure of an underlying component, it is said to
mask the component's failure. If the masking attempts of an agent do not succeed, a
consistent state must be recovered for the agent before information about the failure is
propagated to the next level of abstraction, where further masking attempts can take
place. In this way information about the failure of lower level components can either be
hidden from the human users by successful masking attempt or can be propagated to
human users as a failure of a higher-level service they requested. The programming of
masking and consistent state recovery actions is usually simpler when the designer
knows that the components do not change their state when they cannot provides their
43
services. Agents which, either provide their standard service or signal an exception
without changing their state (called atomic (CRISTIAN 1989)) simplify fault tolerance
because they provide their users with simple-to-understand omission failure semantics.
To ensure that a service remains available to managers despite agent failures, one
can implement the service by a group of redundant physical independent, components ,
so that if some of these fail, the remaining ones provide the service. We say that a group
masks the failure of a member m whenever the group (as a whole ) responds as specified
to users despite the failure of m. While hierarchical masking requires users to implement
any resource failure-masking attempts as exception handling mechanisms, with group
masking, individual member failures are entirely hidden from users by the group
management mechanisms.
The group output is a function of the outputs of individual group members. For
example, the group output can be the output generated by the fastest member of the
group, the output generated by some distinguished member of the group or the result of
a majority vote on group member outputs. A group G has a failure semantic F if the
failures that are likely to be observed by users are in class F. An agent group able to
mask from its managers any k concurrent member failures will be termed k-fault
tolerant; when k is equal to one, the group is single-fault tolerant and when k is greater
than to one, the group is multiple-fault tolerant. For example if the k members of an
agent group have crash/performance failure semantics with members ranked as primary,
first back-up, second back-up, etc. up to k-1 concurrent member failures may be
masked. A group of 2k+1 members with arbitrary failure semantics whose output is the
result of a majority vote among outputs computed in parallel by all members can mask a
minority up to k member failures. When a majority of members fail in an arbitrary way,
the entire group can fail in an arbitrary way. Hierarchical and group masking are two
44
end points of a continuum of failure-masking techniques. In practice one often sees

approaches that combine elements of both. For example a manager M may send its
request to the primary agent. If no response is received, the manager may try to send the
request again. If no response is received for the second time, the manager may assume
that the primary agent crashed and it may decide to send the same request to a
secondary replica of the primary agent.
The specific mechanisms needed for managing redundant agent groups in a way
that masks member failures, and at the same time makes the group behaviour
functionally indistinguishable from that of single agents depends critically on the failure
semantics specified for the group members and the communication services used. The
stronger the failure semantics of group members and communication, the simpler and
more efficient the group handling mechanisms can be. Conversely, the weaker the
failure semantics of members and communication, the more complex and expensive the
group handling mechanisms become. The group handling cost increases as the failure
semantics
of
group
members
becomes
weaker.
In
(CRISTIAN
1985,
EZHILCHELVAN 1986) families of solutions to a group communication problem are

studied under increasingly weak group member failure semantics. Statistical
measurements in practical systems confirm the general rule that the cost of group
handling mechanisms rises when the failure semantics of group members is weak.
while the handling cost for crash/performance failure semantics is 15% of the total
throughput of a system (BONG 1989), the handling cost for arbitrary failure semantics
can be over 80% (PALUMBO 1985).
Since it is more expensive to build agents with stronger failure semantics, but it is
cheaper to handle the failure behaviour of such agents at higher levels of abstraction, a
key issue in designing multi-layered fault tolerant systems is how to balance the
45
amounts of failure detection, recovery and masking redundancy used at various levels of
a management system in order to obtain the best overall cost/performance/dependability
result. Recent research has shown that a small investment at a lower level of abstraction
for ensuring that lower level components have stronger failure semantics can often
contribute to substantial cost saving
and speed improvements at higher levels of
abstraction and can result in lower overall cost(CRISTIAN 1991). On the other hand,
deciding to use too much redundancy, especially masking redundancy, at the lower
levels of abstraction of a system might be wasteful from an overall cost/effectiveness
point of view, since such lower level redundancy can duplicate the masking redundancy
that higher levels of abstraction might use to satisfy their own dependability
requirements (SALTZER 1984).
3.6 Architectural Issues
A prerequisite for the implementation of a management service by an agent group
capable of masking low-level component failures is the existence of multiple hosts with
access to the physical resources used by the service. For example, if a disk containing
managed object instances can be accessed from four different agents, then all four
agents can host management services for that management database. A four-member
agent group can then be organised to mask up to three concurrent processor failures.
Therefore, replication of the resources needed by a service is a prerequisite for making
that service available despite individual resource failures. The use of agent groups raises
a number of novel issues.
Group synchronisation. How should group members running on different processors
(or machines) maintain consistency of their local states in the presence of member
failures, member joins, and communication failures?
Group size. How many members should a group have?
46
Group communication. How should agent groups communicate?

Availability policy. How is it automatically ensured that the required number of
members is maintained for agent groups despite operating system, agent and
communication failure?
3.7 Group Synchronisation
The Group synchronisation policy describes the degree of local state
synchronisation that must exist among the agents implementing the management
service. In other words, it describes the way agents run on different machines. Two
types of synchronisation may be applied ; close synchronisation and loose
synchronisation
3.7.1 Close Synchronisation

Close synchronisation describes that local member states are closely synchronised
to each other by letting members execute all service requests in parallel and go through
the same sequence of state transitions. The group output depends on the failure
semantics assumed for its members. If group members can fail in arbitrary ways
majority voting is the most common method used. In a voting scheme a group answer is
output only if a majority of the members agree. This group organisation masks minority
member failures at the price of slowing down the group output time to the time needed
for a majority of members to compute agreeing answers and for the voting process to
take place. If a majority of members fail concurrently, then the group output can be
incorrect. In the next chapter further certain voting techniques used to insure correctness
and consistency among replicated copies of a managed object are discussed. Closely
synchronised groups of servers have been used by other systems that have attempted to
tolerate arbitrary server failures (HOPKINS 1978, WENSLEY 1978, HARPER 1988).
47
Examples of closely synchronised groups of members with crash/performance semantics

are described in (COOPER 1985, CRISTIAN 1990). A number of rules for transforming
non-fault tolerant services implemented by non redundant application programs into
fault tolerant services implemented by closely synchronised server groups have been
proposed in (LAMPORT 1984).
3.7.2 Loose synchronisation

In contrast to close synchronisation, loose synchronisation ranks the group
members. Loose synchronisation requires that only the highest ranking group member
processes service requests and records the current service state. The highest ranking
member (primary) is also the one who sends the group output to users. All the other
lower ranking members are regularly updated by the primary. If the primary fails the
next highest ranking agent can be used to service the user. In this way, the failure can be
masked to users, who only experience a delay in getting their responses. Examples of
loosely synchronised server groups are discussed in (BIRMAN 1987, BONG 1989,
CRISTIAN 1990, OKI 1988).
The main advantage of loose synchronisation over close synchronisation is that
only primary agents make full use of their share of the replicated service resources,
while the secondary ones make only a reduced use. The communication overhead
needed in close synchronisation is also a drawback, since it requires an agreement
between the group member before providing any service. The main drawback of the
loose synchronisation is the delays seen by managers or other clients when group
members fail are longer. The worst case delay in answering a request after a primary
failure will not only be composed of the time needed to detect and reach agreement
about the failure of the primary, but also of the time needed by the new primary to
48
restore old backups. For on-line transaction processing environments, such delays are
considered critical. For real time applications, if the response time required is smaller
than the time needed to detect a member failure and to restore old backups, close
synchronisation has to be used.
3.8 Group Size

The more agents in a group, the greater its availability and capacity for servicing
requests in parallel. On the other hand, the more members a group has , the higher the
cost for communication and synchronisation among group members. Given a certain
required service availability one can use stochastic modelling and simulation methods
for determining an optimum number of members by taking into account the individual
agent failure rates, the failure semantics, the request arrival rates and the communication
cost.
3.9 Group Communication

Communication among group members is more complicated than a point-to-point
communication using certain communication protocols such as TCP/IP, SNA, OSI etc.
If the group state is replicated at several members, these members need to update the
replicas by using special group protocols that ensure replica consistency in the presence
of process or communication failure. Although these special protocols are built over a
transport layer such as TCP or UDP, they must provide special facilities that enhance
the availability of the replicas, ensure consistency among replicas and enforce the
correctness of certain operations on the replicated copies. Certain replica control
protocols (algorithms) are discussed thoroughly in the next chapter. The aim of these
protocols is to control the replicated data providing facilities for group reconfiguration
when certain failures occur.
49
3.10 Availability Policy

The synchronisation and replication policies defined for management services
implemented by an agent group constitute the availability policy for that group. One
possible approach to enforce a certain group availability is to implement in each group
member mechanisms for reaching agreement, as well as mechanisms for detecting
member failures and procedures for handling new member joins.
Another approach adopted in (CRISTIAN 1990) requires group members to
implement only an application specific procedure
needed for local members. The
advantages of this approach are obvious only if we have to implement different services
on different sites. The different services are provided through a service availability
manager which is used to forward requests on different services to the appropriate
member or subgroup. When we need to implement just one service, this approach is not
satisfactory because it increases the total overhead. Another drawback is that no group
availability policy will be enforced when the availability manager is down. In the case of
a network management fail-tolerant system we have to implement just one service
associated with the management of certain network resources. Therefore the objective is
to have a specified management service availability policy enforced whenever at least
one site works in the system. This results in a need to replicate the global state on all
working members of the group.
To maintain the consistency of these replicated global managed objects at all sites
in the presence of random communication or site failures, each replica of a managed
object should be updated in such a way that all these sites see the same version of the
managed object. If different availability managers see different sequences of updates,
then their local views of the global system state will diverge. This might lead to
violations of a specified agent group availability policy.
50
3.11 Group Member Agreement

To ensure consistency between group members, one needs to solve two major
problems.
First, to achieve agreement on member joins and failures. Every member should
know those members of the group with which a communication is possible. The
protocols that ensure such an agreement are called membership protocols.
Second, to achieve agreement on the order of messages that are broadcast in the
group. The protocols that ensures broadcast ordering are called atomic broadcast
protocols. :
Existing membership and atomic broadcast protocols can again be divided in
synchronous and asynchronous. Synchronous protocols have strong timeliness
properties and assume the existence of a time bound. They guarantee the propagation of
information within bounded times and assume that group member clocks are
synchronised within some constant deviation. Messages among members are exchanged
within pre-specified time intervals and the delays are shorter than this time intervals. If
the delay exceeds the time interval specified, a synchronous protocol may violate its
safety requirements. Asynchronous protocols do not have time bounds and they exhibit
a weak timeliness property, but on other hand they never violate their safety
requirements - even when communication delays are unbounded and communication
partitions occur. In the asynchronous approach, broadcasts that occur during a
membership change have to be aborted. Designers using asynchronous protocols tolerate
partitions that require that the correctly working members of the group form a quorum
before any work can be done. This requirement is needed to prevent divergence among
the members in distinct partitions. Synchronous protocols are discussed extensively in
51
(CRISTIAN 1985, CRISTIAN 1988) and asynchronous in (BIRMAN 1987, CARR

1985, ).
Designers are thus faced with the following choices: attempt to ensure the
existence of an upper time bound on message delays or accept unbounded message
delays. If an upper bound is achievable and this bound allows satisfactory speed, then
one can adopt the synchronous approach which guarantees strong timeliness properties
and enables the system to work autonomously for as long as there is at least one member
working correctly. On the other hand if the cost for resolving inconsistency is higher
than the cost of missing recovery, one can adopt the asynchronous approach with timeout delays. The cost of using the asynchronous approach will then be weak timeliness
and the need to worry about quorums.
The replica control protocols presented in the next chapter assume the existence
of an asynchronous mechanism for detecting any change in the group membership and
exchanging any information among group members. This is because the need to detect
any failure in a consistent and safe way is higher in a critical real time application such
as a network management application than any other system that it allows violations of
its safety requirements. Timeliness is less important than safety and the cost for
compensating inconsistency in a network management system is too high. The choice of
using an asynchronous approach in a network management replication system seems to
be an advantage since it prevents any inconsistency among the states of group members
in distinct partitions. Safety, in that case, is ensured by adopting a quorum consensus
technique.
52
3.12 Summary
This chapter has proposed a number of concepts that are fundamental in designing
fault tolerant network management systems. Some of the concepts such as the notion of
the dependability between management agents and the hierarchical structure of agents
are fundamental to any fault tolerant distributed system. Dependability has been
examined as a way to form a hierarchy of co-operative agents that work together to
provide higher service availability
Failure classification provides a way to understand the behaviour of certain
failures and to set the background for a possible recovery technique. The most frequent
failures have been discussed and the cause of such failures have been examined. The
behaviour of a faulty agent may determine the actions adopted to recover from a
particular anomaly. Fail stop and Byzantine behaviour have been stated as two different
ways with which a faulty agent may interact with other agents.
A failure semantics study has been expressed and the distinction between weak
and strong failure semantics has been examined in terms of failure behaviour.
Hierarchical failure masking is a technique used to hide a failure effects by either calling
replicated agents to provide certain services or trying to disguise a failure and let a
higher abstraction layer to handle it.
Concepts such as group synchronisation, group size, group communication and
availability policy are relative to the architectural aspect of the network management
system and they have been discussed under the designer perspective. The architectural
aspects are used to first, formulate fault tolerant issues that arise in designing a
replication management system, and second, describe various design choices.
The next chapter will examine certain quorum consensus replica control protocol
architectures used as core mechanism for ensuring consistency among replicas in a
53
group of agents. Replica availability of the proposed protocols will be used as a measure
for examining the suitability of the protocols. In a following chapter we measure the
replica availability by simulating the membership changes in a group of agents.
54
4. REPLICA CONTROL PROTOCOLS
This chapter presents the correctness criteria that should be taken into account
when designing a replication system. It discuses the difference between the logical and
the physical entity of a replicated object and it explains the transaction processing
strategy that satisfies the correctness criteria. An abstract model is introduced in order to
study formally certain replication algorithms. A survey of a variety of replica control
algorithms is presented and each replica control algorithm is discussed thoroughly.
These algorithms constitute the internal mechanism of a replication scheme and they are
used basically to ensure consistency among multiple copies of an object in the presence
of network failures.
4.1 Partitioning in a Replication System
As it is shown in previous chapters, the technique of data replication in distributed
database systems (such as a distributed MIB) is typically used to increase the availability
and reliability of stored data in the presence of node failure and network partitions
(DAVIDSON 1985, GIFFORD 1979, BERNSTEIN 1987, JOSEPH 1987,
JAJODIA 1989, SARIN 1985). The idea of replicating an object at those sites that
frequently access the object may be implemented by storing copies of the object where
the access to it seems inevitable. By storing copies of critical object on many nodes, the
probability that at least one copy of the object will be accessible in the presence of
failures increases.
55
In a replication system, a particular object may be replicated by replicating all the

internal data and functionality at various sites. A partitioning occurs when the nodes in
the network split into disjoint groups of communicating nodes, due to node or
communication failure. The nodes in each group can communicate with each other but
no node in one group is able to communicate with nodes in other groups. A reunion
occurs when two subgroups are reunited into one single partition (or sub-network).
When partitioning occurs, a dangerous situation begins to arise: nodes in one partition
might perform an update to an object, while at the same time, nodes in other partitions
do a different update to the same object. If these two updates conflict, it may be difficult
or even impossible to resolve the conflict satisfactorily. The partitioned system is faced
with a choice either it accepts updates in more than one partition, in which case conflicts
among copies of the same object are inevitable or it accepts updates in one partition at
most, in which case the availability of the replicated object is diminished. Therefore, the
design of an appropriate replica control algorithm is necessary for handling read and
write operations in the presence of network partitioning. Before proceeding into the
examination of certain replica control algorithms, I will present the correctness criteria
that may be taken into account when examining such algorithms.
4.2 Correctness in Replication
A database is correct if it correctly describes the external objects and processes the
relative information as it is expected. In the case of a distributed database, correctness is
related to the effective and efficient representation of network resources. In theory, such
a vague notion of correctness could be formalised by a set of data constraints, known as
integrity constraints.
56
Let us consider a managed object X associated with a particular resource, which is

stored at both sites A and B (see Figure 4-1). At the beginning both replicas XA and XB
hold the same value V1. Suppose that a communication failure isolates the two sites. If
there is no precaution, users can access the replicas XA and XB and update them
differently. Since there is no communication between A and B, there is no way by which
they may notify each other. Therefore, after the isolation, XA may holds the value V2
and XB the value V3. This causes a kind of inconsistency since two instances of the
same logical object X hold different values. Users who access transparently the object X
get different views of the representative resources. Obviously, the anomaly is caused
due to conflicting updates (write operations) issued in parallel by transactions executing
in different partitions.
A management database is a set of logical data items (objects) that support among
others read and write operations. Read and write operations are examined particularly
because they can cause inconsistency. Other operations, such as notifications, traps etc.
cannot violate consistency since they are generated by the managed object itself and not
by an external user or manager. The managed object cannot change its state unless it is
instructed to do so by an external element calling a write operation.
Figure 4-1 Replication anomaly caused by conflict write

operations. a) before isolation b) after isolation
57
All reads and writes are accomplished via transactions. Transactions are assumed to be
correct. More precisely, a transaction transforms an initially correct database state into
another correct state. Transactions may interact with one another indirectly by reading
and writing the same data items. Two operations on the same object are said to conflict
if at least one of them is a write (BERNSTEIN 1987). Conflicts are often labelled
either read-write, write-read or write-write, depending on the types of data operations
involved and their order of execution (BERNSTEIN 1981).
A generally accepted notion of correctness for a database system is that it executes
transactions so that they appear to users as isolated actions on the database. This
property, referred as atomicity, is achieved by the all or nothing execution of the
transaction operations. In this case, either all writes succeeds (committed transactions)
or none are performed (aborted transactions). Correctness and consistency between
operations performed in different transactions are ensured by assigning a serial
execution to any set of concurrent operations that produces the same effect
(serialisability). That is a serialisable execution is a concurrent execution of many
transactions that produces the same effects to the database as some serial execution of
the same transactions. Other correctness criteria may be expressed in the form of
integrity constrains. Such criteria may range from simple constrains (e.g. a particular
object can not assign a negative value) or more complex constrains that involve many
replicas (i.e. all replicas must have the same view of a particular object any time they are
accessed).
In a system with integrity constraints, an operation is allowed only if its execution
is atomic and its results satisfy the integrity constraints. In a replicated database the
value of each logical object X is expressed by one or more physical instances, which are
referred to as the copies of X. Each read and write operations issued by a transaction on
58
some logical data items must be mapped by the database system to corresponding
operations on physical copies. The mapping must ensure that the concurrent execution
of transactions on replicated objects is equivalent to a serial execution on non-replicated
objects, a property known as one-copy serialisability (DAVIDSON 1985). The part of
replication system that is responsible for this mapping is called replica control protocol
(algorithm).
4.3 Transaction Processing During Partitioning
In a Partitioned network, where communication connectivity of the system is
broken by failures or by communication shutdowns, each partition must determine
which transactions it can execute without violating the correctness criteria. It is assumed
that the network is cleanly partitioned (that is any two sites in the same partition can
communicate and any two sites in different partitions cannot communicate) and that
one-copy serialisability is the correctness criterion. Addressing the correctness criteria
implies satisfaction of the following proposition:
1. Correctness must be maintained within a single partition by assigning a
single view to all the replicas in the partition.
2. Each partition must make sure that its actions do not conflict with the actions
of other partitions
Correctness within a single partition can be maintained by adapting one of the replica
control algorithms. For example, the sites in a partition can implement a write operation
on a logical object by writing all copies in the partition. The problem of ensuring onecopy serialisability across partitions become more difficult as the number of partitions
increases. In theory, a replication scheme contains two algorithms; one to ensure
correctness across partitions and a replica control algorithm to ensure one copy
59
behaviour. In practice many replica schemes compose both algorithms into a single
solution.
4.4 Partition Processing Strategy
Solving the problem of global correctness needs to deal with two matters:
1. When a partition occurs, sites executing transactions may find themselves in
different partitions and thus unable to take a decision as to whether to commit
the transaction or abort it.
2. When partitions are reconnected (reunited) mutual consistency between
copies in different partitions must be re-established. By mutual consistency, it
is meant that the copies have the same state (or value). The updates made to a
logical object in one partition must be propagated to its copies in all the other
partitions.
Partition processing strategies can be divided basically into two classes. The first one is
called optimistic and allows updates in all partitions in the network. the second one is
called pessimistic and allows updates to take place only in one partition.
Optimistic protocols (BLAUSTEIN 1985, DAVIDSON 1984, SARIN 1985)
hope that the inevitable conflict among transactions are rare. These algorithms take the
approach that any copy of the replicated object must be available even when the network
partitions. Optimistic algorithms require a mechanism for conflict detection and
resolution. To preserve consistency, conflicting transactions are rolled back when
partitions are reunited
Pessimistic protocols (GIFFORD 1979, ABBADI 1986, PRIS 1986a,
JAJODIA 1989, KOTSAKIS 1996a) maintain the consistency of the replicated object
even in the presence of network partitioning. Replicated objects are updated only in a
60
single partition at any given time. Thus only one partition holds the most recent copy
preventing in that way any possible conflict.
Optimistic protocols are useful in situations in which the number of replicated object is
large and the probability of partitioning small. Pessimistic protocols prevent
inconsistency by limiting the availability. Each partition makes worst-case assumptions
about what other partitions are doing and operates under the assumption that if an
inconsistency can occur, it will occur. Optimistic protocols do not limit availability and
allow any transaction to be executed in a partition that contains copies of an object.
Optimistic protocols operate under the optimistic assumption that inconsistencies, even
if possible, rarely occur. Optimistic protocols allow conflicts among the transactions and
try to resolve them when the conflicts occur. Pessimistic protocols do not allow conflicts
and prevent any inconsistency by allowing updates only in a single partition. As a
consequence a pessimistic protocol is more suitable for real time application (network
management applications etc.) than an optimistic one. In a critical real time application
like that of managing the operations of a satellite ( or a nuclear reactor), the replicated
data must be all the time consistent and any possible conflict should be prevented. Real
time processes interact dynamically with the external worlds (i.e. network resources).
When a stimulus appears the system must respond to it in a certain way before a certain
deadline taking into account all the current information. The time limit sometimes does
not allow conflict resolution. If, for instance, the response is not delivered during a prespecified time interval, the service may be considered unavailable (performance failure).
The advantage of using a pessimistic protocol over an optimistic one in a distributed
database system that is used as a repository of real time applications are the following:
1. A pessimistic protocol prevents any inconsistency, where an optimistic one
allows inconsistency and tries to resolve it later.
61
2. A pessimistic protocol has faster response, since all the information needed
by the protocol is available locally at the site. The protocol may decide to
allow (or not allow) a particular update by using a local record kept in each
site.
3. Optimistic protocols are useful in a situation in which the number of
replicated copies is large and the probability of partitioning is small. This may
be the case of applying replication over a Local Area Network (LAN) where
the probability of having a connection break is very small. In the case of
applying replication across interconnected networks that encompasses
different technologies (like that of the satellite network presented in chapter
2), the probability of having link failures increases.
When we design a replica control algorithm the competing goals of availability and
correctness must be seriously considered. Correctness can be achieved simply by
suspending operations in all but one of the partition groups. On the other hand
availability can be achieved simply by allowing all nodes to process updates. It is
obvious that it is impossible to satisfy both goals simultaneously, one or both must be
relaxed to some extend depending on how critical the application is.
Relaxing
availability is fairly straight-forward; you simply disallow certain updates at certain

sites. However, relaxing correctness usually requires extensive knowledge about the
semantics of the replicated objects and the cost of any inconsistency.
4.5 An Abstract Model for Studying Replication Algorithms
As it has been shown in a previous chapter, the network management database is a
distributed database that may be viewed as a set of logical data items supporting two
types of operations; read and write. Although, the network management protocols (such
as SNMP, CMIP that have been discussed in chapter 2) do not support directly read and
62
write operations on managed objects, we can classify certain protocol operation into
read and write activities. For instance , M_GET operation of the CMIP, and GetRequest
(or GetNext) operation of the SNMP may be considered as read class operations since
they do not affect the state of the managed object. On the other hand, M_SET operation
of the CMIP and SetRequest operation of the SNMP may be considered as write class
operations since their aim is to change the current state of the managed object.
The replicated data items are physically stored at different sites. Each item is
conceptually an object that encapsulates some internal data and provides a well defined
interface for updating for accessing and updating the state of the object. The size of the
objects is not important . An object may be as simple as a single variable holding a
single value (this object is known as a fine grain object) or it may be as complex as a
subordinate data base (this object is known as a large grain object) (CHIN 1991).
Objects are physically stored at different sites. The state of an object is determined by
the current values of the variables used to describe its attributes. That is, by giving a
value to each of its variables.
The state of the entire distributed database is composed by the individual states of all
logical objects. The term logical object is used to distinguish the logical view of the
object from its physical representation. Figure 4-2 shows three different sites holding a
copy of a sensor object. This object has a single attribute called temperature and two
operations to read and update the temperature. The object has been instantiated with a
temperature value equal to 25. In a replication scheme, each site must keep a copy
(physical object) of the logical view of the sensor object. To ensure consistency, all the
physical copies must adhere to the same logical view. Accessing any of the physical
copies allows us to get exactly the same data. The logical object provides a user
oriented view of the entity, it shows how the user expects to see the entity. In a
63
Figure 4-2: Logical and Physical objects of the sensor entity.
replicated database, a logical object is assumed valid if all its physical representations
are consistent and consequently they have exactly the same state (same temperature).
Each read and write operation issued on a logical object must be mapped to
corresponding operations on physical copies. A transaction is a process that issues read
and write operations on the objects. Each of these operations may trigger a sequence of
other operations in order to provide a particular access or update. For instance, the read
operation may trigger an interrupt and read the temperature from a hardware device. The
duration of the transaction is the time interval between the time a read or write operation
is issued and the time the operation terminates. Transactions interact with one another
indirectly by reading and writing the same logical object. As already noted, operations
on the same logical object are said to conflict if at least one of the them is write
(BERNSTEIN 1987). Therefore conflict can occur at the following sequence of
operations:
read-write
64
write-read
write-write
Transactions guarantee correctness only if they are assumed as isolated actions. This
property is referred to as atomic execution and it has the following effects:
1. The execution of each transaction is all or nothing. Either all of the
operations are performed or none are performed (atomic commitment).
2. Executing multiple transactions concurrently produces the same result as if
they were executed in a serial manner one after an other (serialisability).
In a replicated database logical operations issued by a transaction are mapped to
corresponding physical ones. The mapping must ensure that the concurrent execution of
transactions is equivalent to a serial execution on non-replicated data, a property known
as one-copy serializability. The mechanism that performs this mapping is called the
replica control protocol.
When the system is partitioned, each partition must determine which transactions it can
execute without violating the correctness criteria (atomic commitment and
serialisability). This can be accomplished by considering the following statements :
1. Each partition must maintain correctness within its region
2. Each partition must make sure that its actions do not conflict with the actions
of other partitions
Most of the proposed replica control protocols fulfil the conditions above in order to
ensure consistency. The following sections examine thoroughly some replica control
protocols and explain how they ensure consistency under network partitioning whilst
providing at the same time a tolerable object availability.
65
4.6 Primary Site Protocol

This was originally presented as a resilient technique for sharing distributed
resources (ALSBERG 1976). It suggests that one copy of an object is designated the
primary copy and thus it is responsible for all the activities of the object. All reads for an
object must be performed at the primary site. Updates are propagated to all copies. In
the case of partition failures , only the partition containing the primary copy can access
the object. This approach does not work well in the case of general failures. In cases that
it is difficult to distinguish the type of failure (site failure or communication failure), we
cannot re-elect a new primary site,. However, if we are able to distinguish these two
type of failures we can elect a new primary site when the original one fails (GARCIA
1982). Another very similar approach is that in (MINOURA 1982). It supports the
primary copy notion except that the primary copy can change for reasons other than site
failures. Although, accessing a copy needs the use of a token. In principle, this approach
uses the notion of the primary copy to keep consistency among distributed copies of a
logical object.
The following shows how a primary site algorithm operates under partitioning. Let
us consider a replication scheme that has n copies of a logical object X (Figure 4-3). The
copies named X1, X2, ... Xn depict physical replicated entities of the object X located at
different sites connected via communication links. The X1 copy is hosted in the primary
Figure 4-3. Replication using primary site algorithm.

66
site P and all the others in secondary sites. Whenever a site wants to read the object X, it
accesses a physical entity Xi nearest to the site. To avoid any inconsistency, this Xi copy
should be in the same partition as the primary site. Therefore, each read(X) is translated
to a read(Xi). Whenever a site wants to update the state of the object X, it broadcast the
update to all accessible sites, that is, each write(X) operation is translated to write(X1),
write(X2), write(X3), ..., write(Xn). This approach is often called read one, write all
mechanism (BERNSTEIN 1987). When a partitioning occurs, sites that are members
of the partition that does not contain the primary site cannot access the X object, that is,
they cannot perform a read or write operation on it. However, sites that belong to the
same partition as the primary site can fully perform any operation. Write operations
performed by these sites update only those Xi entities that are in the primary partition.
(Primary is this partition that contains the primary site). When a reunion occurs, two or
more partitions are united into one single partition. If this unified partition contains the
primary site all those sites that have lost previews updates become current by getting the
latest version of the primary copy. In the case of a failure of the primary site, a new
primary site may be elected (GARCIA 1982). When partitioning occurs , copies that
are not found in the primary partition are registered as unavailable or not current.
These copies cannot be accessed either for read or write. The major functions that
describe the behaviour of a primary site algorithm are as follows:
PrimaryPartitionMember(): It returns TRUE if the site is a member of the primary
partition, otherwise it returns FALSE. The primary partition is that partition which
contains the primary site. The following data structures are used to describe certain
concepts:
Object: A logical object
67
ObjectCopy: A physical object

ObjectValue: The value of the logical object (it coincides with the value of
the primary copy).
ObjectCopyValue: The value of a physical object
DoRead(X): It reads the current state of the logical object X (Figure 4-4). This is
translated to a physical read to the nearest copy of X. The following functions show the
implementation of the DoRead(X). This function returns TRUE if it succeeds, otherwise
it returns FALSE. The notation used to describe the functions is C based (KERNIGHAN
1988).The function FindNearestCopy(X) returns the address of the nearest copy Xi of
the logical object X.
Boolean DoRead(Object X)
{
if (PrimaryPartitionMember())
{
ObjectCopy x_copy=FindNearestCopy(X);
read(x_copy);
/* read the copy*/
return TRUE;
}
else
return FALSE;
}
/*finds the nearest copy*/
Figure 4-4. Read in a Primary Site Protocol
DoWrite(X,v): It updates an object X to a new value v (Figure 4-5). The Update(X,v)

function updates all the copies of X in the partition to the new value v and returns
TRUE if it succeeds, otherwise FALSE. . DoWrite returns TRUE if it succeeds
Boolean DoWrite(Object X, ObjectValue v)
{
{
if (Update(X,v))
/*updates all the copies in the partition*/
return TRUE;
else
return FALSE;
}
else
return FALSE;
}
Figure 4-5. Write in a Primary Site Protocol
68
MakeCurrent(): It is called after the occurrence of a reunion (Figure 4-6). The aim of this
function is to update those copies that missed some updates due to partitioning. This
function is executed by a site that becomes aware of a reunion occurrence.
Boolean MakeCurrent()
{
{
ObjectValue v=GetLatestCopyValue(X) /* Get the value of the object either
from the primary copy or any other copy that resides in the primary partition*/
if (Update(X,v));
/*updates all the copies in the partition that have missed
previous updates. If this succeeds all these copies will become current and they can
be accessed normally from that time on.*/
return TRUE;
else
return FALSE;
}
else
return FALSE;
}
Figure 4-6. Make Current in a Primary Site Protocol
4.7 Voting algorithms

In voting algorithms, every copy of a replicated object is assigned some number of
votes. Every transaction must collect a read quorum of r votes to read an object and a
write quorum of w votes to write an object. Quorums must satisfy the following two
statement:
1) r+w > u
2) w > u/2
where u is the total number of votes assigned to a logical object.

The first constraint ensures that there is a non null intersection between every read
quorum and every write quorum. Any read quorum is therefore guaranteed to have a
69
current copy of the object. In a partitioned system, this constraint guarantees that an
object cannot be read in one partition and written in another. Hence read-write conflicts
cannot occur between partitions.
The second constraint ensures that two writes cannot happen in parallel or, if the
system is partitioned, that writes cannot occur in two different partitions on the same
logical object. Hence write-write conflicts cannot occur between partitions.
Each site that holds replicated objects maintains its own connection vector. A
connection vector is recorded continuously and it indicates the connectivity of the site. It
literally presents a mechanism by which the respective site knows what sites it can talk
to. Communication failures and repairs are recorded in the appropriate connection
vectors, so that all connection vectors in a single partition are identical.
Each physical copy i is associated with a version number (VNi). The version
number of a copy is an integer which counts the number of successful updates to the
copy. This number is initially set to zero and it is incremented by one each time an
update to the copy occurs. The current version number of a replicated object is the
maximum taken over the version numbers of all copies of the object. A copy is said to
be current if its version number is equal to the current version number.
Through out this section, we assume that there is a logical object that is stored
redundantly in n sites in a distributed system. Initially these sites are all connected and
all physical copies are mutually consistent. Since the following protocols do not depend
on the number of logical objects which are replicated , it is assumed for ease of
exposition that there is just one logical object replicated in n sites.
4.7.1 Majority Consensus Algorithm
The first voting approach was the majority consensus algorithm (THOMAS 1979).
What will be described is the generalisation of that algorithm as proposed by Gifford
70
(GIFFORD 1979). We simplify the discussion of this protocol assuming only one type
of replicated physical objects. Weak copies are not considered and we assume that all
the replicated copies assign the same number of votes. The following functions describe
the behaviour of the Giffords approach.
DoRead(X): It reads the current state of the logical object X (Figure 4-7). This is
translated to a physical read to the nearest copy of X within the partition. The following
functions show the implementation of the DoRead(X). This function returns TRUE if it
succeeds, otherwise it returns FALSE. The function FindNearestCopy(X) returns the
address of the nearest copy Xi of the logical object X. The function CollectReadVotes(X)
gathers all the read votes assigned to the logical object within a certain partition. r is the
read quorum, that is, the threshold for performing a read operation in the partition.
{
if (CollectReadVotes(X)>=r)
{
read(x_copy);
/* read the copy*/
return TRUE;
}
else
return FALSE;
}
Figure 4-7. Read in a Majority Consensus Algorithm

function updates all the copies of X in the partition to the new value v. DoWrite returns
TRUE if it succeeds, otherwise FALSE. The function CollectWriteVotes(X) gathers all
the write votes assigned to the logical object X within a certain partition. w is the write
quorum, that is, the threshold for performing a write operation on the logical object X.
71

{
if (CollectWriteVotes(X)>=w)
{
if (Update(X,v))
return TRUE;
else
return FALSE;
}
else
return FALSE;
}
Figure 4-8. Write in a Majority Consensus Algorithm
MakeCurrent(): It is called after the occurrence of a reunion (Figure 4-9). The aim of this
function is executed by a site that becomes aware of a reunion occurrence.
GetLatestCopyValue(X) returns the instance of X with the greatest version number.
{
if (CollectWriteVotes(X)>=w)
{
/* Get the value of the object */
ObjectValue v=GetLatestCopyValue(X);
if (Update(X,v));
/*updates all the copies in the partition that have missed
previous updates. If this succeeds all these copies will become current and they can
be accessed normally from that time on.*/
return TRUE;
else
return FALSE;
}
else
return FALSE;
}
Figure 4-9. Make Current in a Majority Consensus Algorithm
72
4.7.2 Voting With Witnesses

Voting with witnesses approach was introduced by Pris in (PRIS 1986a, PRIS
1986b). In this approach a replicated object is a collection of mutually consistent

entities much like the previous approaches, but there are basically two types of
replicated physical objects: full copies and witnesses. A full copy contains data, a
version number and a pre-defined number of votes which entitles it to participate in all
elections involving the replicated object. A witness contains only a version number that
always reflects the most recent update of the object. Each witness is assigned a specific
number of votes and is therefore entitled to participate to all elections involving the
replicated file.
Procedures for collecting read and write quorums are scarcely affected by the presence
of witnesses. One can indeed select Giffords original scheme or any of its several
variants. In Priss approach read and write quorums are collected as if the witnesses
were conventional copies with the following restrictions:
1) every quorum must include at least one current copy
2) every write quorum must include at least one full copy.
Restriction (1) expresses the fact that one cannot read from a witness or use it to bring a
copy up-to-date. Restriction (2) expresses the fact that writes have to be recorded in a
secondary storage.
4.7.3 Dynamic Voting
Dynamic voting (JAJODIA 1989) was introduced to increase the availability of
replicated objects maintaining consistency in the presence of partitioning caused by site
or communication link failures. This algorithm belongs to the family of voting
algorithms and it is called dynamic due to its capability to adjust dynamically its internal
overhead data in order to achieve higher availability. Jajodias algorithm introduces
73
some additional attributes which are associated with each physical copy. These are the
following:
Update Site Cardinality(SC), which reflects the number of sites participating in the most
recent update to the object. Each site sets initially the site Cardinality equal to the
number of the replicated copies n. Whenever an update is made to the object, the site
Cardinality is set to the number of physical copies which were updated during this
update.
Among all the sites of the network that hold a copy, there is a privileged site called
Distinguished that identifies one of the sites that participated in the last update. If the
sites are ordered (1, 2, 3 etc), this could be the site which is assigned by the greater
number and which has participated in the last update.
A partition P is said to be a majority partition if either of the following two conditions
holds:
1. The partition P contains more than the half of the current copies of the object
2. The partition P contains exactly the half of the current copies and moreover
contains the Distinguished Site.
A copy is said to be current if its version number is equal to the current version number
(the maximum taken over the version numbers).
Update Site Cardinality is an attribute similar to the connection vector but it specifies
which nodes participated in the last update. What follows is the procedures that
implements Jajodias algorithm.
IsMajority(): This is to determine whether a site is a member of the majority partition or
not. Figure 4-10 shows the pseudo-code for the IsMajority() function..
74
#define
AND &&
Boolean IsMajority()
{
int n=NOfOnes(SV); /*returns how many sites are working and thus they can
communicate with each other. This is determined by the number of flags that
are up in the site vector SV*/
if (n>SC/2)
return TRUE;
else if ((n==SC/2) AND Is_The_Distinguished_Site_In_The_Partition())
return TRUE;
else return FALSE;
}
Figure 4-10. IsMajority in the Dynamic Voting Protocol
DoRead(X): It reads the current state of the logical object X. This is translated to a
physical read to the nearest copy of X. The following function in Figure 4-11 show the
implementation of the DoRead(X). This function returns TRUE if it succeeds, otherwise
it returns FALSE. The function FindNearestCopy(X) returns the address of the nearest
copy Xi of the logical object X. The function CollectReadVotes(X) gathers all the read
votes assigned to the logical object within a certain partition. r is the read quorum, that
is, the threshold for performing a read operation in the partition.
{
if (IsMajority())
{
read(x_copy);
/* read the copy*/
return TRUE;
}
else
return FALSE;
}
Figure 4-11. Read function in the Dynamic Voting Protocol

function updates all the copies of X in the partition to the new value v. DoWrite returns
75
TRUE if it succeeds.. DoWrite is executed by a site only if the site is a member of the
majority partition.

{
if (IsMajority())
{
if (Update(X,v))
return TRUE;
else
return FALSE;
}
else
return FALSE;
}
Figure 4-12. write (update) in the Dynamic Voting Protocol
MakeCurrent(): It is called after the occurrence of a reunion(Figure 4-14). The aim of this
function is executed by a site that becomes aware of a reunion occurrence. The function
Update() in Figure 4-13 updates a physical copy to the most recent value.
The
MakeCurrent() describes all the steps to update a reunion. The first site S that becomes
aware of the reunion sends a request to each site in the partition P and asks them to
determine locally whether they belong to a majority partition or not. If at least one site
sends a positive response, the site S obtains the Version Number (VN) and site
Cardinality (SC) from that site and executes the Update() function otherwise it does the
following:
It finds the maximum VN of all the copies in the partition (MAX_VN) and the set I of
sites that hold the most recent copy (that with the maximum version number). Let C be
the Site Cardinality of any of the sites members of I, and N be the Cardinality of I. An
update of the reunited partition can take place only if
76
1) either N>C/2 or
2) N=C/2 and the Distinguished site is in the current partition
The first requirement makes sure that the sites with the greatest VN in the partition are
the majority. Consider that the Site Cardinality (SC) indicates the number of sites
participating in the last update. Thus the Site Cardinality (C)of any of the sites in the set
I determines the number of sites participating in the most recent update. The Cardinality
(N) of I indicates how many of the participating sites in the most recent update are
present. Therefore if N is greater than half the C, more than the half of the participating
sites in the most recent update are present and an update is allowed making the partition
current. The second requirement makes sure that in case N=C/2 and the Distinguished
site is in the partition, the partition is allowed to be updated. The requirement for the
Distinguished site is used to break ties between partitions when we have a partition
decomposition into two sub-partitions with equal number of sites. Therefore the
partition that contains the distinguished site is eligible to apply any update.
77
BOOL UpdateObject(Objext X)
{
/* Get the value of the object */
ObjectValue v=GetLatestCopyValue(X);
if (Update(X,v))
/*updates all the copies in the partition that have missed previous
updates. If this succeeds all these copies will become current and they can be
accessed normally from that time on.*/
return TRUE;
else
return FALSE;
Figure 4-13 Update in the Dynamic Voting Protocol
#define AND &&

{
BOOL found; /*TRUE if a site in the partition is a member of a majority partition*/
ObjectCopy S=NULL; /*copy residing in a site that is member of a majority partition*/
for (all the sites X in the partition){
Request_Majority(X); /* request each site to run IsMajority() */
found=IsInAMajorityPartition(X));/*Is X in a Majority partition */
if (found)
S=X;
}
if (found)
return UpdateObject(S);
else {
MAX_VN=max{VNi :Si P}; /*finds the maximum Version Number in the partition*/
I=max{Si P: VNi=MAX_VN};/*finds those sites holding the maximum Version Number*/
C=card(I); /*cardinallity of the set I - # of sites with version number equal to MAX_VN*/
N=Site Cardinality of any site which is member of I;
if (N>C/2)
return Update();
else if (N==C/2) AND (distinguished Site I)
return Update();
else
return FALSE;
}
Figure 4-14 Make Current in the Dynamic Voting Protocol
78
4.7.4 Dynamic Majority Consensus Algorithm (DMCA) - A novel approach

The Dynamic Majority Consensus Algorithm (DMCA) proposes a novel approach
that improves the availability of managed objects insuring at the same time consistency
among replicated objects (KOTSAKIS 1996a , KOTSAKIS 1996b). DMCA
algorithm tries to exploit the difference between read rate and write rate in order to
increase the total availability of a managed object. Choosing the objects that must be
replicated is not an easy procedure, however, replicating managed objects that are read
very frequently but are updated rarely (or not frequently) improves the performance of
the system substantially. In a replication system, the write operation is translated to a set
of physical write operations, each one applicable to a single copy of the object. On the
other hand a read operation is translated to a single physical read applicable to the
nearest copy of the object. Therefore, replicating objects that rarely change reduces the
use of multiple write operations and increases, in that way, the performance of the
system.
In the DMCA approach we assume that replication is applied to those objects that
are updated rarely. As expected under such a strategy, the rate of read operations that are
issued to each copy of the managed objects is greater than the rate of write operations
that are issued during a specific interval of time. DMCA tries to exploit this difference
between read rate and write rate in order to increase the total availability of a managed
object.
4.7.4.1 DMCA Assumptions
Before we proceed further with the algorithm it is necessary to make some assumptions.
79
1. Partitioning is caused by site or communication failures. When partitioning takes

place, a single partition called the main partition is subdivided into two subpartitions
called secondary partitions. When a reunion occurs two secondary partitions are
merged into a single main partition.
2. All communication links are bi-directional
3. A site crash or a link failure is detectable by all the sites that constituted the main
partition. It is assumed that a mechanism, by which each site knows what sites it can
talk to, is present. Each site maintains its own connection vector in which
connectivity information is recorded continuously. Communication failures and
repairs are recorded in a dynamic manner in the appropriate connection vector, so
that all connection vectors in a single partition are identical. (DAVCEN 1985)
discusses a mechanism for implementing a connection vector. If such a mechanism
is not available, alternative algorithms, which are described in ( JAJODIA 1987a,
JAJODIA 1987b) can be used.
4. The sites in the network are ordered and identified by a distinct number
5. Communication failures and repairs are recorded instantly by a mechanism similar
to connection vector. Looking up a communication vector we can find which nodes
in the network can communicate. An implementation of the Connection Vector
(CV) may be achieved by considering the CV as a sequence of bits that reflects the
connectivity of the site. If, for instance, site 7 has CV=100010101111, then sites
0,1,2,3,5,7,11 (bit positions set to 1 in CV) constitute a partition and therefore the
sites 0,1,2,3,5,7,11 can communicate. All the sites belonging to the same partition
have the same CV. Upon an occurrence of a failure or repair, the CV changes
accordingly to reflect the new connectivity of the site.
80
6. The algorithm is applicable to a set of copies (replicas) of a single data item spread
across the network in different sites. The data item is stored redundantly at n sites
(n>2).
7. Each replicated data item is associated with a set of variables used by the algorithm
to ensure consistency and availability. These variables are discussed in the next
section
4.7.4.2 DMCA Maintenance Variables
Site Vector (SV). It is a sequence of bits, similar to CV, that indicates which sites
participate in the most recent update (write operation) of the data item. When
partitioning occurs and the data item in the main partition is current the SV is
assigned the value of the CV. A data item is assumed current if either a write
operation has occurred or the MakeCurrent procedure has been performed after
the occurrence of a reunion. The MakeCurrent routine is explained in the
following section
Site Cardinality (SC). It is an integer number that denotes the number of sites
participating in the most recent update of the data item.
Read Quorum (r) determines the minimum number of sites that must be up, to
allow a Read operation.
Write Quorum (w) determines the minimum number of sites that must be up, to
allow a Write operation.
Current (CUR). It is a Boolean variable that indicates whether the data item is
current or not. It is TRUE if the data item is current, otherwise it is FALSE.
Version Number (VN). It is an integer number that indicates how many times the
data item has been updated. Each time an update is successfully performed the VN
increase by one. It is initially zero.
81
4.7.4.3 DMCA Basic Functions

DMCA uses the concept of quorum to read and write a managed object. The Read
Quorum (r) identifies the minimum number of sites that must be up to allow a read
operation and the Write Quorum (w) identifies the minimum number of sites that must
be up to allow a write operation..
DMCA algorithm is described by five routines that co-operate with each other to
allow read and write operations to be performed in a partitioned system. These routines
are as follows:
ReadPermitted is used by a site to determine whether a read operation is
permitted or not. The bitcnt(SV) is a function that counts the 1s in the Site Vector (SV).
If this function is called after partitioning, given that the data item is current, the SV will
reflect the number of communicating sites in the partition. Therefore, if the number of
sites in the partition satisfies the r, a read operation is allowed.
BOOL ReadPermitted()
{
/* It returns TRUE if a read operation is permitted */
if (bitcnt(SV)>=r)
/* the number of the working nodes in the partition is greater

than or equal to read quorum..*/
return TRUE;
else
return FALSE;
Figure 4-15 ReadPermitted in the DMCA
82
WritePermitted is used by a site to determine whether a write operation is

permitted or not. This routine is similar to ReadPermitted but it checks if the
number of sites that remain after partitioning satisfies the write quorum w.
BOOL WritePermitted()
{
/* It returns TRUE if a write operation is permitted */
if (bitcnt(SV)>=w)
/* the number of the working nodes in the partition is greater

than or equal to write quorum..*/
return TRUE;
else
return FALSE;
Figure 4-16 WritePermitted function in the DMCA
DoRead is used when the site intents to read a replicated object. The only
condition that must be satisfied in order to perform the DoRead is the SV to be greater
than or equal the r.
83
BOOL DoRead(Object X)
{
/* It returns TRUE if a read operation may be accomplished */
if (ReadPermitted(X))
/* Read operation is permitted.*/
{
read(x_copy);
/* read the copy*/
return TRUE;
}
else
return FALSE;
Figure 4-17 DoRead function in the DMCA
DoWrite is used when the site indents to change the state of the replicated object
This routine checks first the write quorum w to see if a write operation is permitted. If
so, it proceeds , otherwise it rejects the write operation. If the Write Quorum is satisfied,
the site broadcasts the INTENTION_TO_WRITE message to all other sites in the
partition. Each site upon receiving this message sends an acknowledgement. If the
originator receives all the acknowledgements from all the sites in the partition, it
performs the write operation and broadcasts the COMMIT message to all the sites in the
partition, otherwise it broadcasts the ABORT message. If the connection vector changes
during the operation of the algorithm we follow a similar approach as in (JAJODIA
1989). That is, if the Connection Vector changes after the issue of the
INTENTION_TO_WRITE message, but before the sending of the COMMIT message,
the originator sends the ABORT message instead of COMMIT. Any site that has
84
acknowledged the INTENTION_TO_WRITE message and receives the ABORT

message terminates the write operation unsuccessfully. Upon receiving the COMMIT
message, the node modifies its maintenance variable associated with that object as
follows:
VN=VN+1
SC=bitcnt(CV)
SV=CV
CUR=TRUE
RW
r round
SC
RW WW
WW
w round
SC
RW WW
RW is the Read Weight and WW is the Write Weight. Because the r and w are integers
we use the function round to round the result to the nearest integer. If the sum (r+w) is
found equal to SC we increase the w by one to ensure consistency (r+w>SC). The Read
and Write Weight are associated with the probability of having a read and write
respectively as follows:
RW
1
Re ad Pr ob
WW
1
Write Pr ob
85
BOOL DoWrite(Object X,ObjectValue v)

{
/* It returns TRUE if a write operation may be accomplished */
if (WritePermitted())
/* Write operation is permitted.*/
{
Broadcast(INTENTION_TO_WRITE);
WaitForAck();
if (AllAckReceived())
{
Update(X,v))
Broadcast(COMMIT);
return TRUE;
}
else
return FALSE;
}
else
return FALSE;
Figure 4-18 DoWrite function in the DMCA
If the WriteProb is much less than the ReadProb the r is approximately zero and the w
is approximately equal to SC. This means that if read operations occur very frequently,
they are very likely to be executed since they require a small quorum. On the other hand
write operations are very unlikely to be executed in a case of partitioning , since they
require a quorum approximately equal to SC. In most practical applications involving
distributed management databases , the ReadProb is approximately four or five times
greater than the WriteProb. This, of course, may vary depending upon what policy we
86
follow in replicating managed objects. For instance, if we choose to replicate all the
objects that are updated very frequently, we should expect the WriteProb to be greater
than the ReadProb; but such a choice does not increase the performance of the system
since each write is translated into a set of physical write operations that require extra
network bandwidth to be implemented.
MakeCurrent is performed after the occurrence of a reunion . This routine aims
to update some copies of the object that came from a subpartition in which write
operations were not allowed due to a large Write Quorum. The MakeCurrent is said
to be successful if it sets the variable CUR=TRUE. If the variable CUR is TRUE the
object is considered current. The site that performs this routine broadcasts a request for
quorum and waits for responses. If it receives all the expected responses from all the
sites in the partition, it proceeds, otherwise it sends an ABORT message and the
MakeCurrent is considered to have failed. Each site in the partition upon receiving
the request for quorum, sends back to the originator the VN of the copy, the w of the
copy and the state of the object. The originator receives all the responses and finds the
maximum VN and the w corresponding to that VN (MWQ) as well as the number of
nodes that contain the maximum VN (MC) and the state of the object corresponding to
the copy with maximum VN. If MC MWQ the state of the local copy assigns the state
of the copy with maximum VN and the following instructions are executed to update the
maintenance variables:
VN=Maximum
VNCUR=TRUE
SC=bitcnt(CV)
SV=CV
87
RW
r round
SC
RW WW
WW
w round
SC
RW WW
If MC is less than MWQ the site should wait for some period of time and try again.
#define AND &&
Boolean MakeCurrent(Object X)
{
Broadcast(REQUEST_FOR_QUORUM);
WaitForResponse();
if (AllResponseReceived())
{
MVN=max{VNi :Si P}; /*finds the maximum Version Number in the partition*/
MWQ=Writequorum(Si: MVN=VersionNumer(Si)); /*finds those sites holding the
maximum Version Number and then get the write quorum (MWQ) of these sites*/
MC=The Number of sites with version number equal to MVN;
if (MC>MWQ)
{
ObjectValue v=GetLatestCopyValue(X); /*v corresponds to a copy with the
largest Version Number*/
Update(X,v))
Broadcast(COMMIT);
return TRUE;
}
else{
WaitAndTryLatter();
return FALSE;
}
}
else
return FALSE
}
Figure 4-19 Make Current function in DMCA
4.7.4.4 DMCA Sequence Diagram

A sequence diagram shows an interaction arranged in time sequence (UML 1997).
In particular, it shows the objects participating in the interaction and the messages that
they exchange in the sequence. A sequence diagram has two dimensions: the vertical
dimension represents time, the horizontal dimension represents objects. Time proceeds
down the page. Objects can be grouped into swim-lanes on a diagram. An object is
shown as a vertical line called lifeline. The lifeline represents the existence of the
object at a particular time. An activation is shown as a tall thin rectangle whose top is
88
aligned with its initiation time. A message is shown as a horizontal solid arrow from the
lifeline of one object to the lifeline of another object.
DMCA algorithm involves two main objects: the user object which issues read
and write requests and the replication manager object which accepts and handles these
requests and which provides response messages to the user according the DMCA replica
control protocol. The user object can be a part of a network management application. An
instance of a replication manager object resides at many sites that accommodate
managed objects. A replication manager object may be seen as the object that a user can
communicate with, in order to collect or set management information associated with
network resources.
In the following diagrams, for simplicity, one object lifeline is drawn for all the
replication manager objects are involved in read and write operations and one lifeline is
drawn for all user objects. Three diagrams are represented : one describing DoRead
function, one for the DoWrite function and one for MakeCurrent function.
User object
Replication manager object
Broadcast Read Request
Nearest Copy Response

IsReadPermitted
no
yes
Perform Read
FindNearestCopy
Figure 4-20: Sequence diagram for DoRead operation
89
User object
Broadcast Write Request
Write Response
Counts all responses
All responded AND
Write is permitted
no
yes
Performs Write
broadcast commit
Figure 4-21: sequence diagram for DoWrite operation
User object
Broadcast Request for Quorum
QuorumResponse
All Responses
are Received
no
yes
F
MC>MWQ
no
yes
Update All Copies
Calculate Latest
Copy Value
Broadcast Commit
Where
F: MVN=find the maximum VN
MWQ=find the sites holding the maximum VN
MC= the number of sites with VN=MVN
Figure 4-22: Sequence diagram for MakeCurrent operation
90
The following table Shows the mapping between the DMCA protocol operations and the
CMIP and SNMP operations
Table 4-1: DMCA mapping
CMIP
M_GET
M_SET
M_CREATE
M_DELETE
SNMP
GetRequest
GetNextRequest
SetRequest
DMCA
Read
Write
4.8 Summary
This chapter presents a thorough approach to replica control algorithms, especially
to those replication algorithms that use voting techniques. Studying first the criteria to
achieve correctness, it sets the background for understanding the internal mechanisms
used to insure consistency among multiple replicas in a distributed database.
Pessimistic and optimistic processing strategies have been discussed as two
alternatives to establish a replication scheme. However, pessimistic strategies have some
advantages over the optimistic ones in distributed database systems that are used as a
repository for demanding applications. Pessimistic algorithms provides faster response,
higher availability, and prevent any temporary inconsistency.
Certain replication protocols have been discussed. The primary site protocol is a
static protocol that introduces the notion of the primary partition. Only the operations
that are submitted from sites of the primary partition are allowed to execute. From the
rank of voting algorithms, the following have been examined: the classic Giffords
approach, the Jajodias dynamic voting techniques which enhances the Giffords
approach by introducing a mechanism to dynamically change the read and write quorum
and a novel approach called DMCA which is an improvement on Jajodias technique
91
since it is able to dynamically change the read and write quorum by taking into account
the read and write ratio. Jajodias algorithm implies that reads and writes execute with
the same probability (since they have the same occurrence rates). It cannot distinguished
a possible difference between read rate and write rate. Adjusting the read and write
quorum according the read and write occurrence rate may increase the availability of the
replicated object and make the system more fault tolerant.
Chapter 6 provides a quantitative comparison between the approaches presented in
this chapter and draws some conclusions about the availability provided by each replica
control protocol. The next chapter discusses the model used to simulate certain replica
control protocols.
92
5. ANALYSIS AND DESIGN OF THE SOFTWARE

SIMULATION
This chapter presents the object oriented development of the Availability Testing
System (ATS). It discusses first the advantages of using the object oriented paradigm for
developing such a complex system. It then presents the simulation modelling process
and presents briefly the Object Modelling Technique (OMT) which has been used to
construct the ATS system. Following the development process imposed by the OMT, it
discusses the requirements of the ATS system , then its analysis and design through a
static and a dynamic object model.
5.1 Introduction to simulation modelling
Simulation should be understood as the process of designing a model of a real
system and conducting experiments with this model for the purpose of understanding
the behaviour of the system or of evaluating various strategies for the operation of the
system (SHANNON 1975).
Simulation is classified based on the types of the system studied and it can be
either continuous or discrete. In the case of studying replication algorithms, the discrete
simulation seems adequate to describe the behaviour of each algorithm. There are two
approaches for discrete simulation: event driven and process driven. Under event driven
discrete simulation, the modeller has to think in terms of the events that may change the
status of the system (LAW 1991). In a replication system, for example, the status may
change by the occurrence of events that cause partitions and reunions. The status of the
93
system is defined by a set of variables being observed. On the other hand under the
process driven approach, the modeller thinks in terms of processes that the dynamic
entity will experience as it moves through the system.
The simulation system that has been used to test the availability of the replication
algorithms is a system consisting of certain dynamic entities. Dynamic entities are the
objects that interchange information providing in that way certain services by using the
system resources. Entities may experience events which result in an instantaneous
change of the system state. Some events are endogenous and occur within the system
(replica updates) and some events are exogenous and occur outside the system (read,
write, partition and reunion operations). The aim of the simulation is to model the
random behaviour of the system, over time, by utilising an internal simulation clock and
sampling from a stream of random numbers.
5.2 Using an Objet-Oriented Technique for Modelling a Simulation
System
Simulation is a useful and essential technique for verifying the operability of
systems with large number of entities. The object oriented paradigm has become popular
in software engineering communities due to its modularity, reusability and its support
to iterative design techniques. The idea of an object-oriented simulation has great
intuitive appeal in the application development process because, it is very easy to view
the real world as being composed of objects. An object oriented technique introduces (1)
information hiding (2) abstraction (3) polymorphism. Both information hiding and data
abstraction allow the simulation modeller to focus on those mechanisms that are
important discarding any irrelevant implementation details. This gives the freedom to
the modeller to change implementation details of a system component at a later stage of
the development without the need of redesigning or affecting other components. The
flexible behaviour of objects is realised through polymorphism and dynamic binding of
94
methods. The binding to an actual function takes place at run- time and not at compiletime. In this way, inheritance provides a flexible mechanism by which you can reuse
code, since a derived class may specialise or override parts of the inherited specification.
Object-oriented techniques offers encapsulation and inheritance as the major
abstraction mechanisms to be used in system development. Encapsulation promotes
modularity, meaning that objects must be regarded as the building blocks of a complex
system. Once a proper modularization has been achieved, the object implementor may
postpone any final decisions concerning the implementation.
An other advantage of an object-oriented approach often considered as the main
advantage, is the reuse of code. Inheritance is an invaluable mechanism in this respect;
since the code that is reused offers all that is needed. The inheritance mechanism
enables the developer to modify (or refine) the behaviour of a class of objects without
requiring access to the source code.
5.3 Object Oriented Discrete Event Simulation
In the object oriented paradigm, a program is described as a collection of
communicating objects that represent separate activities in the real world and which are
able to exchange messages with each other. An object is an abstract data type that
defines a set of operations that perform on the internal data that express the object. Each
object is an instance of a class. A class can be thought of as a template which produces
objects. The object oriented paradigm has been successfully applied to a variety of fields
of computer science and engineering. In distributed algorithms, the global system is
decomposed into a set of communicating logical processes. These logical processes
work concurrently to accomplish the objective of the distributed task. This concurrency
is realised in a simulation system by sequential simulation of the execution time . The
95
sequential execution is achieved through a complex synchronisation mechanism, which

guarantees the order in which the events are delivered to certain logical processes.
The Availability Testing System (ATS) is a discrete event simulation system (MISRA
1986) and it is realised through facilities which ensure synchronisation of events. The
ATS manages an event list and provides event scheduling and dispatching methods.
Messages are delivered to the destinations through a communication mechanism which
manages the operations of sending and receiving messages. Each message is passed to a
higher level class in time-stamp order. The classes that process the messages constitute a
base framework which is used to test any voting algorithm. Each tested algorithm uses
the facilities provided by the framework in order to complete read or write operations
5.4 The Simulation Modelling Process
The simulation modelling methodology that has been used for the development of
the ATS system has the following stages:
Problem formulation and Objectives (requirements analysis)
Model design
Model Implementation
5.4.1 Problem formulation
In this stage, the objectives (requirements) of the ATS system have been studied. The
ATS is formulated as a system that should be able to measure the availability of certain
replica control protocols. Such a simulation system should be able to handle random
events, in the same way a real system does. The types of these events are partitioning
events, reunion events, read and write events. A simulation model requires data.
Without input data, the simulation model itself is incapable of generating any data about
the behaviour of the real system it represents. The input data of the ATS simulation
system are randomly generated according to the Poisson distribution. The distribution
96
function of the inter-arrival time of all of the events handled by the system is given by
an exponential distribution (Poisson arrival implies exponential inter-arrival time). The
rate of the occurrence of the events and the simulation time determines two number of
the event that occurs during the observation time interval which is equal to the
simulation interval of time. The ATS simulation system is a terminating system
(SADOWSKI 1993). Terminating systems are systems that have a clear point in time
when they start operations and a clear point in time when they end operations. ATS
specifies a random sample size of event and a time simulation length.
5.4.2 Model Implementation
The ATS simulation model is implemented by using C++ programming language
(STROUSTRUP 1991). C++ is a good tool that support program organisation through
classes and class hierarchies. Classes help the developer to decompose a complex
solution into simpler ones . Each class has its own internal data that may be updated
through a set of defined operations. The encapsulation of code (operations) and data
(variables) into a single entity help the developer to focus on the design and
implementation of smaller pieces of software structures and then unify all separate
components to form the complete solution.
5.5 Object Oriented Analysis and Design

Object oriented development is a conceptual process whose greatest benefit is that
it helps developers to express abstract concepts clearly. It can serve as a medium for
specification, analysis and documentation of any system. This section present an object
oriented methodology used to express object oriented concepts. This methodology is
called Object Modelling Technique (OMT) (RUMBAUGH 1991) and it consists of
building a model of an application domain and then adding implementation details to it
97
during the design of the system. The OMT methodology has basically three stages
named Analysis, Design and Implementation.
5.5.1 Analysis
Analysis is the first step of the OMT methodology and it starts from the statement
of the problem and builds a model focusing on the properties of particular objects that
are used to abstractly represent real world concepts. The analysis model is a precise
abstraction of what the desired system must do, not how it will be done. The analysis
clarifies the requirements and set the base for later design and implementation. The
output of the analysis phase consists of two models named object and dynamic models.
The object model describes the static structure of the objects in a system and their
relationships. The object model contains object diagrams. An object diagram is a graph
whose nodes are object classes and whose arcs are relationships among classes. An
object model captures the structural aspect of the system by showing the objects
participating in the system as well the relationships among them.
The dynamic model describe the behavioural aspect of the system over time. The
dynamic model is used to specify and implement the control aspects of the system. The
dynamic model contains state diagrams. A state diagram is a graph whose nodes are
states and whose arcs are transitions between states. Transitions are caused by events.
An event is something that happens at a point in time and it represents external stimuli.
5.5.2 Design
Design emphasises a proper and effective structuring of the complex system
allowing an object oriented decomposition. During the design phase, high level
decisions are made about the overall architecture of the system. The analysis phase
determines what the implementation must do, and the design phase determines the full
98
definitions of the objects and associations used in the implementation, as well as the
methods used to implement all the operations. During the design phase the development
of the system moves from the application domain concepts toward computer concepts.
The classes, attributes and associations from analysis must be implemented as specific
data structures.
5.5.3 Implementation
During implementation, all the design objects and associations are explicitly defined by
using a programming language (preferably an object-oriented one). The implementation
language should provide facilities that help the developer to realise the concepts as
defined during the design phase. One can fake object oriented implementation by using
a non-object -oriented language, but it is horribly ungainly to do so. To have a smooth
transition from the design phase to the implementation phase , the language should
support the following features (CARDELLI 1985):
Objects that are data abstractions with an interface of named operations and a
hidden local state.
Objects that have associated type (class)
Types (classes) may inherit attributes from super-types (super-classes)
According the Cardelli and Wegner definition (CARDELLI 1985), a language is said to
be object oriented if it supports inheritance. Under this definition, Smalltalk
(GOLDBERG 1983), C++ (STROUSTRUP 1991), Eiffel (MEYER 1992), CLOSS
(KEENE 1989) are all object oriented languages and can be used to implement an
object-oriented concept.
5.6 ATS Requirements
The OMT methodology will be use to develop a simulation for testing different
replica control protocols. The final tool is called Availability Testing System (ATS) and
99
it aims to be used as a tool to measure the availability of a certain replica control

protocols. The rest of this chapter explains the analysis and design model for state the
problem and the solution of the application of the simulation
In a replicated system, the availability of a replicated object is defined as the
conditional probability that an operation issued may be performed (BEAR 1988). The
availability depicts the proportion of the accepted operations, counts the total number of
read and write operations issued during a specific time interval and marks those reads
and writes respectively that are performed. The ratio of the total number of operations
performed to the total number of operations offered during a given observation time
interval provides an estimate of the availability. The Availability Testing System (ATS)
has been made to test the availability of a certain replica control protocols. The ATS
system tests each protocol by simulating the network behaviour. It produces artificial
failures and repairs that lead to a subsequent partition or reunion. At the same time
random read and write operations are issued in each node of the network. If an algorithm
in a node can perform a particular operation, this operation is considered available.
ATS is able to simulate a network of n sites. A set of m replica control protocols
resides in each site. The ATS system test the availability of each algorithm by
generating events according to predefined occurrence rates. There are four types of
generated events; Read, Write, Reunion and Partition. Read and Write events are
generated locally to the site and therefore they affect only the algorithm instances
residing in the site. Partition and Reunion events are generated globally and they affect
all the instances of the tested algorithms across the network. Each of those m algorithms
has been copied to each site. When a site event (Read or Write) occurs, all algorithms in
this site are executed and they change their state accordingly.
100
The ATS system provides automatic monitoring of various conditions at multiple

sites. More precisely, it measures the following:
1. The number of read operations issued in each site
2. The number of write operations issued in each site
3. The number of read operations performed in each site
4. The number of read operations performed in each site
5. The mean availability of each operation across the whole network
provided by each replica control protocol
ATS generates random events by using an exponential distribution, its mean value of
which is reverse proportional to the occurrence rate of a particular event. The system has
a means to run all the candidate protocols at the same time and makes them to react (or
change their state) to exactly the same set of random events. Each instance of each
algorithm runs independently on each site. It is the internal state of each algorithm
which determines the execution of a read or write operation. ATS is just the vehicle
which accommodates most of the overhead in running the simulation.
5.7 ATS Analysis
The purpose of the ATS system is to produce four different events (read, write,
partition and reunion) and simulate the behaviour of each replica control protocol under
the occurrence of each event. All the events are generated by an object called Event
Generator. The events are queued and the most imminent even is extracted from the
queue. Under the occurrence of a network event (partition or reunion) the Network
manager object is used to handle the event and take the appropriate actions. The actions
corresponding to a network event is the creation of a new partition or the unification of
two sub-partitions into a single super-partition. Under the occurrence of a node event
(read or write) the Node object is used to handle the event by running the replica
101
algorithm under testing. Each Node object is an autonomous entity simulating the
behaviour of a separate site. The number of read and write operations performed are
registered and finally a statistical object is called to measure the availability of each
replica control protocol. All measurements are stored in a file for further processing.
The main objective of the simulation is to get a practical estimate of the availability
provided by certain replica protocols in order to draw useful conclusions about their
effectiveness. During the evaluations of the replica control protocols all the relevant
rates with which the events are generated are taken into account. The rate with which an
event is generated may affect the effectiveness of a certain protocol.
Figure 5-1
shows the process diagram of ATS system. In each site there is an
instance of an algorithm. Each instance runs independently. ATS supervises all the
actions that should be taken and it handles all the events generated by the ATS event
generator. Events may affect the state of the algorithm or the state of the whole system.
Figure 5-1. ATS process diagram
102
The protocol is defined as a class of objects. Each instance of this protocol is an

object associated with a particular node in the network that holds replicated items. The
ATS tests the protocol by simulating the network behaviour. ATS produces artificial
failures and repairs, that lead to a subsequent partition or reunion and the same time
random reads and writes are issued in each node of the network. If the node can perform
a particular operation, this operation is considered available. Any replica control
protocol can be virtually ported to the ATS system, since the basic components used for
instrumentation are fully reusable.
5.7.1 Object Model
A number of objects can be identified as the basic components of the Availability
Testing System. Figure 5-2 shows the core objects and their relationships. Each object has
its own responsibilities and co-operates with other objects to complete complex tasks.
The basic stimuli is provided by the NetEventGenerator and the NodeEventGenerator
objects. These objects inherit common characteristics from a base class which is called
EventGenerator. The NetEventGenerator is used to generate partitions and reunions
and the second one to generate read and write operations. The random events generated
by the two generators are handled by certain object of the system. A Network object
handles network oriented events (partitions and reunions). Node objects handle node
oriented events (reads and writes). The Network object is used to simulate the behaviour
of the network. During run time, only one instance of this object may exist. The
Network object consists of one or more Partitions. If there is no failure, only one
partition will exist. Each partition consists of one or more nodes. An undivided network
has all its nodes in a single partition. Each node may holds many algorithms. Each
algorithm represents a replica control protocol. An Algorithm may use many Messages.
Messages are used to curry information from one node to another. Any replica control
103
protocol is a specialisation of the Algorithm super class. Any message used by the
replica control protocol is a specialisation of the Message super class. Algorithm super
class is an abstract class that provides the basic functionality that may be found useful to
a replica control protocol, among others, it provides a service for sending and receiving
messages. Sending a message is implemented by forwarding the message to the local
Node object. The Node
is responsible for the delivering of the message to the
destination algorithm. Algorithm also provides a read-only access to the connection

vector.
The connection vector is an object that resembles the connectivity of a node. It
indicates the nodes that constitute a partition. In my approach, the connection vector has
been implemented as a sequence of bits. In Algorithm class, DoRead(), Write() and
HandleMessage() functions have been declared abstract, since their definitions are
protocol specific. However, they provide a single interface over multiple
implementations. Message super class may be viewed as a message header containing
information necessary for the identification and delivery of the message. The data part
of the message is defined as a specialisation of the Message class. Each Node object is
able to interact with any other node within the same partition. Therefore a node can send
and receive messages to and from any other node in the partition. Each partition having
more than one nodes is able to split into two sub-partitions. Any two partitions can join
into a single group.
ATS has two task to complete (1) to simulate the replica control protocols for a
sequence of random events (2) to collect statistics after the simulation. Statistics are
collected by registering the values of four variables named readissued,
writeissued, readexecuted and writeexecuted. These variables indicate
104
the number of read and write operations issued and the read and write operations
executed respectively. These variables provide an estimate of the availability..
105
UniformRandGenerator
+ Rans():double
NetworkEventGenerator
NodeEventGenerator
- ReunionRate:double
- PartitionRate:double
- ReadRate:double
- WriteRate:double
Vector
- Nbits
+ GetNextEvent():EventType
+ GetNextEvent():EventType
+ IsSet(integer):Boolean
+ Set(integer)
+ Reset(integer)
+ SetAll()
+ ResetAll()
+ Count():integer
+ Size():integer
Partition
1+
Consists of
1+
Consists of
+ Npartnodes:integer
Simulate
Message
Legend: Visibility
+ public
# protected
- private
Alg1
Msg1
Alg1
Msg2
holds
head
Receive(msg)
protocol
Algorithm
+ source_node_id:integer
+ dest_node_id:integer
+ source_alg:Algorithm
+ dest_alg:Algorithm
+ source_alg_id:integer
+ msg_id:integer
...
-node_id:integer
-readissued:integer
-writeissued:integer
+ NetSize():integer
Simulate
+ Break():Partition
+ PartSize():integer
CollectStat + Join(Partition)
CollectStat + ConnectionVector():Vector
+ Simulate(EventGenerator)
+ Send(Message)
+ CollectStat(Statistics)
HandleEvent
+Receive(Message)
+ HandleEvent(SystemEvent)
Send(msg)
+ MakePartition()
+ MakeReunion()
Node
HandleEvent
- Nnetnodes:integer
- Npartitions:integer
site
group
Network
division
system
ConnVector
Alg2
Msg1
uses
Alg2
Msg2
# read_performed:integer
# write_perrformed:integer
+ NetSize():integer
+ PartSize():integer
+ ConnectionVector():Vector
+ Send(Message)
+Receive(Message)
+DoRead() {abstract}
+DoWrite() {abstract}
+HandleMessage(Message) {abstract}
# alg_id:integer
Alg1
+DoRead()
+DoWrite()
+HandleMessage(Message)
Alg2
+DoRead()
+DoWrite()
+HandleMessage(Message)
...
Figure 5-2. ATS object model
5.8 Dynamic Model
106
The dynamic model describes the flow of control, interactions and sequencing of
operations in the system. The dynamic model for the ATS is shown in figure 2. It
basically consists of four state diagrams. Each state diagram describes an interactive
aspect of the associated object. Network, Partition, Node and Algorithm are four
fundamental state diagrams that depict the sequencing of operations in the associated
objects. Each of these diagrams includes sub-diagrams that refine the interactions and
provide more details about the sequencing of the operations. The states of a sub-diagram
Network
NetworkEventGenerator
Simulate
MakePartition
Partition
EndOfSimulation
entry / Choose a partition and

send Break to it
[Np<Nn]
MakeReunion
[Np>1] entry / Choose two partitions
randomly and send Join to them
entry / Send Simulate to each

partition in the Network
Join
EndOfCollection
do: Reunite two partitions and send
entry / Send Simulate to each node of the partition
CollectStat
Break
HandleEvent(vnt) to each node of those partitions
do: Split the partition into two subpartitions and send
entry / Send CollectStat to each node of the partition
Simulate
Node
entry / Send Read to each algorithm
idle
Write / Increase writeissued
entry / Send Write to each algorithm
Legend:
Nn: number of nodes
Np: number of partitions
shared event
Receive(msg)
Write
destination node.
Read
do: Find the
destination algorithm by invoking

Receive(msg)
Send(msg)
Receive(msg)
do: Deliver the messsage msg to the
CollectStat
NodeEventGenerator
HandleEvent(vnt) to each node of the sub-partitions
CollectStat
Partition
Read / Increase readissued
entry / Send CollectStat to

each partition
Simulate
[no Partition or Reunion]
CollectStatistics
HandleEvent(vnt)
Reunion
entry / Register the values of

readissued and writeissued and
Send CollectStat to each algorithm
entry / Send HandleEvent(vnt)

to each algorithm
HandleEvent(vnt)
idle
Algorithm
HandleEvent(vnt) / Handle the system event vnt
CollectStat / Register the values of read_performed and write_performed
Send(msg) / Send the message msg to the attached node
Receive(msg) / Handle the message msg
Read / Perform read. If read succeeds, increase read_performed by 1
Write / Perform write. If write succeeds, increase write_performed by 1
Figure 5-3. ATS dynamic model
107
are totally determined by shared events and conditions. For instance, the Network state
diagram includes two sub-diagrams; one for simulation (Simulate) and one for
collection of the statistics (CollectStatistics).
NetworkEventGenerator and NodeEventGenerator objects deliver events to both
Network and Node. Each of those events triggers a sequence of operations. Dashed lines
represent transitions between objects. Shared events curry information transferred from
one object to another and trigger activities in those objects.
5.9 Evaluation of the System
In the ATS system, the protocol is defined as a separate module and then is
compiled with the rest of the system to form an executable module. More than one
protocols may be tested at the same time (in a single run). All the protocols respond to
the same set of events allowing us to draw conclusions about the suitability of a certain
replica control protocol. The ATS model incorporates most of the characteristics of an
object oriented system (classification, polymorphism and inheritance) in order to define
highly reusable components. It allows the testing of the availability of multiple replica
control protocols without the need to modify our testing system. Porting a new protocol
is fairly easy, since what it requires is just the definition of the protocol and the
definition of any related message it uses. The ATS system remains unchanged providing
higher reusability with less effort.
5.10 Summary
108
Discrete event simulation using the object oriented paradigm has been shown to be a
suitable approach for building complex simulation systems. The modularity and
reusability help to decompose the system into co-operative processes that are related to
independent simulation entities. Object Modelling Technique (OMT) supports all of
the necessary facilities for expressing object oriented concepts. OMT has been used
extensively for analysing the requirements of the ATS system as well as designing the
ATS system . The whole ATS system is summarised in two diagrams named : object
model and dynamic model. The Object model describes the static structure of the
system, whereas the dynamic model describes the behavioural aspect of the system.
ATS allows the testing of the availability of multiple replica control protocols without
the need to modify the main procedures of the testing system. This provides an extra
flexibility since it makes easy to port a new protocol without to disturb the core
procedures of the simulation.
109
6. SIMULATION AND ESTIMATION OF REPLICA

CONTROL PROTOCOLS
This chapter presents how the measurements regarding the performance of certain
replica control algorithms has been achieved. It introduces the simulation model used to
build a benchmark test utility for estimating the effectiveness of algorithms. It discusses
a fault injection mechanism for generating faults and repairs and it specifies the
environment in which the simulation evolves. It also defines the functional components
of the simulation, as well as the parameters used to estimate the availability of the tested
algorithms. The chapter ends with a thorough discussion about the contribution of the
DMCA algorithm. It shows why the DMCA provides higher total availability and
presents the results of the benchmark test. The DMCA is compared with two other
representative voting algorithms (GIFFORD 1979, JAJODIA 1989).
6.1 Performance Evaluation
The evaluation and performance of the replica control protocols has become an area of
great practical interest. In most cases, the most important aspect of this performance is
the availability of replicated objects managed by the protocol. The availability of the
replicated data objects represents the steady-state probability that the object is available
at any given moment. Several techniques have been used to evaluate the availability of
replicated data. Combinatorial models are very simple to use (PU 1988) but cannot
represent complex recovery modes like those found in voting protocols (GIFFORD
1979), (PRIS 1986b), (JAJODIA 1989) and (KOTSAKIS 1996b). Stochastic
110
models have been extensively used to study replication protocols (JAJODIA 1990),
(PRIS 1991) but suffer from two important limitations:
1. Stochastic models quickly become intractable, unless all failures and repair
processes have exponential distributions.
2. Stochastic models do not describe communication failures well, since the
number of distinct states in a model increases exponentially with the number of
failure modes being considered.
Discrete event simulation does not suffer these limitations. Simulation models allow the
relaxation of most assumptions that are required for stochastic models. They can also
represent systems with communication failures. For all its advantages, simulation has
one major disadvantage: it provides only numerical results. This makes it more difficult
to predict how the modelled system would behave when some of its parameters are
modelled. Each time you change the parameters you need to run the simulated system
again to get the results.
6.2 The Simulation Model
Most studies of replicated data availability have depended on probabilistic models
to evaluate the availability of replica control protocols (JAJODIA 1990).
These
models do not generally consider the effect of network partitioning, because of the
enormous complexity that would be involved. As a result, the data that they present are
for ideal environments that are unlikely to exist under actual conditions. Discrete event
simulation has been used to observe the behaviour of three replica control protocols
under more realistic conditions. Many parameters can affect the availability of replicated
data. The simulation model considers the following type of failures:
1. Hardware failures, which results in a site being down for hours or even days
111
2. Software failures, which result in temporary disorder of a site. The site is in an

operational state again after a simple reboot.
3. Link failures, which result in temporary partitioning.
The simulation model which is presented in this chapter is equally capable of handling
all of these possible failure modes having also the ability to simulate arbitrary network
configurations. The modelled networks are assumed to be collections of nodes linked
together by gateway sites or repeaters. Partitioning is caused by generating failures into
gateways or repeaters and reunions are realised by repairing failed gateways or
repeaters.
Figure 6-1: Network Model
Figure 6-1 shows a typical network model that may be considered for hosting a
replication network. It consists of two carrier-sense (IEEE 802.3 based LAN) segments
and two token ring (IEEE 802.5 based LAN). The repeater and the gateways link
together all of
the network segments. Site failures are assumed to be fail-stop
(SCHLICHTING 1983). The network will be partitioned into one or more partitions
112
if the repeater or a gateway fails. Sites attached to a local area network can communicate
even after the repeater or gateway failure but they are not able to communicate with a
site attached to another LAN. All replicated objects are assumed to be available as long
as they can be accessed from any site in the network.
6.3 Fault Injection

To evaluate the availability provided by each algorithm, a simulation-based fault
injection has been used. Simulation-based fault injection assumes that errors or failures
occur according predetermined distribution. The faults are injected into the system to
1. Study the behaviour of the algorithms in the presence of faults.
2. Evaluate the availability provided by the algorithms
Figure 6-2: Fault Injection System
Figure 6-2
shows a fault injection environment which typically consists of the replication
system with the target replication algorithms, a fault injector, a repair injector, a work
113
load generator, a data collector and a data analyser. The replication system comprises all
the groups of replicated nodes that can host the replication algorithms as well as the
necessary facilities for interchanging messages between nodes of the same group. The
fault injector injects partitioning faults into the target system, by injecting faults into
certain gateways or repeaters. The repair injector injects reunions into the system. Each
reunion corresponds to a repeater or gateway repair. The system also executes read and
write operations on replicated objects. The read and write operations are generated from
the work-load generator. The controller is physically the simulation program that runs
and controls all the parts of the testing system. It also tracks the execution of read and
write operations and initiates data collection. The data collector performs on-line data
collection and the data analyser performs data processing and analysis. The injection of
faults is done during run-time by using the time-out technique. A timer expires at a
predefined time triggering injection. The inter-arrival time between faults follows the
exponential distribution. When the timer expires, operations occur by interrupting the
normal operation of the system
6.4 Simulated Algorithms
The three algorithms (GIFFORD 1979, JAJODIA 1989, KOTSAKIS 1996b)
presented in the previous chapter have been tested. Each algorithm has been tested
under exactly the same sequence of events. When an event (partition, reunion, read or
write) occurs, it is inserted into a queue and then each algorithm performs the necessary
house keeping operations to reflect any change in the replication system. Each algorithm
keeps its own record and when the simulation finishes and each algorithm reaches a
steady-state condition, the simulation control unit counts the percentage of read and
write operations that have been performed giving in that way an estimation of the
114
availability provided by the algorithms for a particular set of parameters. The

parameters, that are considered in each run, are the following:
1. Number of network nodes
2. Partitioning rate
(number of partitions per time unit)
3. Repair delay
(in time units)
4. Read rate
(number of read per time unit)
5. Write rate
(number of write per time unit)
6. Simulation Interval (in time units)

The simulation interval should have such a value that it guarantees a large number of
reunions and partitions during the run-time. The greater the simulation interval is, the
more accurate the simulation will be.
Each physical site is represented by a site process which is identified by a unique
identity number. Each site process contains a work load generator which generates
Poisson read and write events. The read and write operations inter-arrival time follows
the exponential distribution. The mean for read operations is equal to
mean for write operations is equal to
1
and the
read rate
1
. The read and write operations
write rate
generated in each site realise the periodical access process to each replicated object.
Each site provides a process which calculates the percentage of successful accesses. An
access is considered successful if the relative operation (read or write) can be performed.
6.5 The Protocols routines
A general form for accessing the replicated objects is used in a common way to all
of the tested protocols. When a read or write occurs the read or write routine of each
protocol is activated. Each protocol has its own view of the network state and according
to that view it executes all the necessary subroutines needed to accomplish a read or
115
write access. The results of a successful or unsuccessful access are recorded separately
for each protocol. These results are gathered during execution and later they are
compared in order to draw useful conclusions about the availability provided by each
protocol. A critical part of the model is to determine whether two sites can
communicate. Since all the protocols rely on communication between sites to determine
the status of the replicated data, a fast simple means is needed to determine
communication links. For sites on the same LAN, the solution is simple. If any two sites
are up and running, it can be assumed that they can communicate. For sites not on the
same network segment, the assumption cannot be made since they may be separated by
one or more gateway sites or repeaters. A solution to this is found by viewing the
network as a tree structure, whose nodes consist of different network segment and their
respective sites. One segment is chosen as the basis. Communication is determined by
traversing the tree between two sites. The tree structure is conceptual and is represented
by double linked lists1. The connectivity between sites is shown by a communication
vector which is realised through an array of bits. Each site has assigned a unique identity
number. Given the identity numbers of two sites the communication routines determine
if they can communicate by checking the connectivity vector.
6.6 Implementing Group Communication
Sites in a communicating group exchange messages by using the multicast model.
A simple implementation is based on simple message sending which might take the
form:
void multicast (PortID *destination, Message msg)
{
for (int i=0; i<HigherDestination; i++)
Send(destination[i], msg)
}
A thorough study of the organisation of the simulation is presented in the next chapter.
116
Where, destination is an array of destination ports which will receive the message msg,
and HigherDestination is the higher index which identifies the last destination. The
destination i is identified by the destination[i] ID. The multicast procedure simply send
the message to each destination port. It has been used in the simulation program because
of its simplicity and ease of handling. In practice, this multicast procedure may be
replaced by other more sophisticated and efficient procedures like those provided in
broadcast packet switching network technologies.
6.7 Functional Components of the Simulation.
The functional components of the simulation represent the main modules of the
simulation model. Figure 6-3 illustrates the modules that compose the simulation model.
Each module is briefly described as follows:
1. Transaction generator: This module generates transactions and distributes them to
the relevant sites. A transaction is a sequence of read and write operations which are
generated through independent generators.
2. Repair/Failure generator: This module generates site and communication failures
and distributes them to the relevant sites. This modules also generates repairs triggering
in that way the recovery system.
117
Figure 6-3: Components of the simulation model
3. Each site is a module which in turn consists of the following components:

a) Transaction manager: At each site, this component is responsible for overseeing
the execution of the transactions originating from other sites. at each server site,
this component is responsible for accepting transaction request from the coordinator site.
b) Resource manager: At each server site, this component co-ordinates the
consumption of site resources. The resources are related primarily with the
memory space needed for the allocation of the replication objects.
c) Failure manager: At each site, this component takes appropriate actions to handle
fail-stop failures.
d) Recovery manager: At each site, this component co-ordinates the activities
associated with the recovery of the most up-to-date replicated object.
118
6.8 Parameter of the simulation

The model is extensively parameterised. These are groups of parameters associated with
different aspects of the simulated systems. Each group is briefly discussed bellow:
1. Global characteristics: This group of parameters describes the general characteristics
of the system. Some of the key parameters are: the total number of sites in the
replication system and the number of the replicated objects.
2. Transaction characteristics: This group of parameters describes the transactions that
take place in the system. It includes the rates with which the read and write
operations occur.
3. Failure and repair parameters: These parameters describe the characteristics of the
failures and repairs in the system. Such parameters include the partitioning rate
(number of partitions per time unit) and the repair delay (number of time units until
the occurrence of a repair). A repair delay determines the duration of a particular
failure.
4. Implementation oriented parameters: This set of parameters describes miscellaneous
variables related to the implementation of the model and the replica control
protocols.
6.9 Availability and the Contribution of the DMCA algorithm
The simulation utilises three availability measurements. These are the following:
Read Availability Ar
Ar
Reads Performed R p
Reads Issued
Ri
Write Availability Aw
Aw
Write Performed Wp
Writes Issued
Wi
Total Availability A
119
Operations Performed R p Wp
Operations Issued
Ri Wi
let
Read issued Ri
,
Write issued Wi
the total availability becomes
Ar Aw
1
Ar is always greater than or equal to Aw , because whenever a write is allowed a read

operation is allowed as well, whereas the opposite is not true. Therefore Ar will be at
least Aw and Ar-Aw is positive (0< Ar-Aw <1).
We observe that lim A Aw
0
(when Ri <<Wi )
and lim A Ar (when Ri >>Wi ).

Total
Availability
1
Ar
Aw
Figure 6-4: Availability curve

Figure 6-4
shows the availability curve and it points out that the total availability is
limited between the asymptotes Ar and Aw. Calculating the derivative of A, we find that
A Ar Aw
1 2
120
The derivative of A indicates that as the difference (Ar-Aw) approximates 1, the slope of
the total availability A approximates its maximum value. This implies that there exist
so that for >0 the total availability A will get higher values for smaller as shown in
Total
Availability
1
Ar
A'r
A'w
Aw
0
0
10
15
20
Figure 6-5: Total Availability =4.

Figure 6-6.
After experimental studies, it has been considered a reasonable assumption to
deem that the read to write ratio =4 (BAKER 1991). For =4, the derivative of A is
equal to
Ar Aw
. Considering that 0<Ar-Aw<1, its maximum value becomes
25
1/25=0.04. Therefore, as shown in Figure 6-5, at =4, the total availability curve tends to
be parallel to the horizontal axis reaching in that way a steady-state condition. Therefore
for a given >4, we may increase the total availability if we can somehow increase the
difference (Ar-Aw). This is what the DMCA algorithm (described in the previous
chapter) achieves by exploiting such large values of . The DMCA algorithm exploits
the difference between the rates of read and write operations by changing dynamically
the read and write quorums and allowing the execution of the read operations rather than
121
the write operations. Read operations are actually the majority of the operations
performed in a replication system.
6.10 Results of the Simulation
To simplify the analysis of the simulation results, it is assumed that all sites have
identical mean time to fail and mean times to repair. If individual site attributes had
been considered, it would have been difficult to determine if the availability was being
affected by partitioning or by some other factor such as a site failing frequently for short
periods of time.
The simulation utilises 20 replicas (one in each node). The partitioning rate is
equal to 2 partitions per time unit and the simulation interval is 100 time units. These
parameters are shown in Table 6-1. The repair delay is measured in time units2. The
availability provided by the tested replica control protocols has been estimated for
various values of the repair delay. The chosen values for the mean repair delay is
Total
Availability
1
Ar
A'r
A'w
Aw
Figure 6-6: Boundaries of total availability .
A time unit could be a day or a month or any other time interval defined by the user. The choice of a particular time unit does not
effect the results of the simulation as far as we utilize the same time unit for all the simulation parameters (simulation time interval,
partitioning rate and repair delay).
122
Table 6-1: Simulation parameters
Parameter
Value
Number of replicated nodes
20
Partition rate
2 partitions per time unit
Simulation Interval
100 time units
consistent with the measurements made using the internet (LONG 1991) and range
from 0.1 to 3.
The rest of this chapter presents the results of the availability provided by the
tested replica control protocols as has been measured during the simulation. The
following figures presents the read availability, the write availability and total
availability provided by each replica control protocol for certain values of the mean
repair delay. The figures presented in this chapter have been created by taking into
account the results obtained by using the Availability Testing System (ATS), which is
discussed in the next chapter. In Appendix B, one can find the tables associated with
the figures as well as the raw values regarding the read and write operations occurred
during the simulation. The tables in Appendix B shows the number of read and write
operations issued during a simulation as well as the number of read and write operations
performed by each replica control protocol during the same interval of time. They also
show how many partitions and reunions occurred as a result of failures and repairs
respectively. All of the following figures show the availability in respect of the
parameter (read issued to write issued ratio). The replica control protocols that have
been tested are: the voting algorithm (GIFFORD 1979), the dynamic voting algorithm
(JAJODIA 1989) and the DMCA (KOTSAKIS 1996b).
Figure 6-7
shows the read availability, the write availability and the total availability
provided by the tested replica control algorithms for a mean repair delay equal to 0.1.
123
This means that during the time interval of 100 time units, it is expected to have 200
partitioning events (since the rate is 2 partitions per time unit) and it takes
approximately 0.1 time units each failure to get repaired. Because the repair delay is too
short, if we compare it with the inter-arrival time of the partitions, the read availability
is very high and it is around 0.95. This means that, approximately, 95% of the read
operations issued are performed. The write availability is expected to be smaller than
this, due to the higher cost of the write operations. However Jajodias algorithm
provides higher write availability than the other two. Jajodias algorithm keeps a balance
between the execution of read and write operations. It provides almost a constant read
and write availability regardless of the rates of read and write operations issued. DMCA
tries to exploit the difference between the read and write rates in order to increase the
total availability. The DMCA approach provides higher availability than that of the
other two algorithms. The availability of the DMCA increases as the factor increases.
This is because the aim of the DMCA algorithm is to allow low cost operations (like
read) with large occurrence rates to be performed easily, whereas high cost operations
(like write) to be performed rarely. Under this strategy the majority of the operations
issued are executed. This approach makes the DMCA algorithm suitable for >1.
The availability provided by the Giffords algorithm follows the DMCA one, but
this is due to the fact that in Figure 6-7 the repair delay is too short. As we see later, when
the repair delay increases, the total availability of Giffords algorithm becomes much
smaller that that of DMCA, especially for <6 (in practice <10). Figure 6-8 shows the
availability results for a repair delay equal to 0.2 . These availability curves are similar
to the ones presented in Figure 6-7, except that, the availability level of each algorithm is
a bit smaller. This is expected, since repairs occur less frequently and it takes more time
to have a reunion. This also means that the replicated objects are unavailable for longer
124
intervals of time. As the repair delay increases the availability level of all of the tested
algorithms will gradually decreases. Figure 6-9 illustrates the availability results for mean
repair delay equal to 0.3. The DMCA algorithm provides higher total availability than
that of the other two algorithms. Figure 6-10 illustrates the availability results for mean
repair delay equal to 0.4 and Figure 6-11 for 0.5. As the ratio increases, it becomes more
difficult for write operations to perform, since the read and write quorums are
reconfigured according the current read and write rates. However, DMCA eases the
execution of the read operations and thus increases the possibility of performing the
majority of operations issued which results in gaining higher total availability. Figure 612, Figure 6-13, Figure 6-14, Figure 6-15
and Figure 6-16 illustrate the results of the simulation
for mean repair delay equal to 1.0, 1.5, 2.0, 2.5 and 3.0 respectively. It is observed that
there is no much difference between the slopes of the illustrated availability curves for
the tested algorithms. What is very obvious in these figures is the gradual decrease of
the availability provided by the algorithms, as the mean repair delay increases. For
instance for mean repair delay equal to 1.0, the DMCA algorithm provides a total
availability which is around 0.62 and the Jajodias algorithm around 0.58 (see Figure 612).
When the repair delay becomes equal to 3.0 , the total availability provided by
DMCA is around 0.55 and by Jajodias 0.50. It is observed a 7% to 8% decrease of the

availability for both algorithms.
In all presented figures that illustrate the availability results, it is apparent that
DMCA tries to moderate the differences between Jajodias and Giffords algorithms in
order to increase the total availability. DMCA inherits the dynamic adjustment of
Jajodias algorithm
and at the same time it goes a step further by reconfiguring
dynamically the read and write quorums according to the read and write occurrence
rates.
125
(a)
Read Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
DMCA (Kotsakis)
16
18
20
Jajodia 1989
(b)
Write Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
DMCA (Kotsakis)
16
18
20
Jajodia 1989
(c)
Total Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
DMCA (Kotsakis)
14
16
18
20
Jajodia 1989
Figure 6-7: Availability provided by the tested replica control protocols for mean repair delay=0.1 (a)
read availability (b) write availability (c) total availability.
126
(a)
Read Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
16
DMCA (Kotsakis)
18
20
Jajodia 1989
(b)
Write Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
16
DMCA (Kotsakis)
18
20
Jajodia 1989
(c)
Total Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
DMCA (Kotsakis)
16
18
20
Jajodia 1989
read availability (b) write availability (c) total availability
127
(a)
Read Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
16
DMCA (Kotsakis)
18
20
Jajodia 1989
(b)
Write Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
16
DMCA (Kotsakis)
18
20
Jajodia 1989
(c)
Total Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
DMCA (Kotsakis)
14
16
18
20
Jajodia 1989
128
(a)
Read Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
16
DMCA (Kotsakis)
18
20
Jajodia 1989
(b)
Write Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
16
DMCA (Kotsakis)
18
20
Jajodia 1989
(c)
Total Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
DMCA (Kotsakis)
14
16
18
20
Jajodia 1989
129
(a)
Read Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
DMCA (Kotsakis)
12
14
16
Jajodia 1989
(b)
Write Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
DMCA (Kotsakis)
12
14
16
Jajodia 1989
(c)
Total Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
DMCA (Kotsakis)
10
12
14
16
Jajodia 1989
130
(a)
Read Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
DMCA (Kotsakis)
14
16
Jajodia 1989
(b)
Write Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
DMCA (Kotsakis)
14
16
Jajodia 1989
(c)
Total Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
DMCA (Kotsakis)
12
14
16
Jajodia 1989
131
(a)
Read Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
DMCA (Kotsakis)
16
18
20
Jajodia 1989
(b)
Write Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
DMCA (Kotsakis)
16
18
20
Jajodia 1989
(c)
Total Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
DMCA (Kotsakis)
14
16
18
20
Jajodia 1989
132
(a)
Read Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
16
DMCA (Kotsakis)
18
20
Jajodia 1989
(b)
Write Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
DMCA (Kotsakis)
16
18
20
Jajodia 1989
(c)
Total Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
DMCA (Kotsakis)
14
16
18
20
Jajodia 1989
133
(a)
Read Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
16
DMCA (Kotsakis)
18
20
Jajodia 1989
(b)
Write Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
DMCA (Kotsakis)
16
18
20
Jajodia 1989
(c)
Total Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
DMCA (Kotsakis)
14
16
18
20
Jajodia 1989
134
(a)
Read Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
DMCA (Kotsakis)
16
18
20
Jajodia 1989
(b)
Write Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
16
DMCA (Kotsakis)
18
20
Jajodia 1989
(c)
Total Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
DMCA (Kotsakis)
14
16
18
20
Jajodia 1989
6.11 Summary
135
This chapter has described the simulation model which has been used to
implement a benchmark test utility which is used to measure the availability provided by
some representative replica control algorithms. It has specified the parameters of the
simulation which have been used to describe the behaviour of the tested algorithms. The
replicated objects are accessed through special routines specific to the particular
algorithm. The simulation has been
implemented by using group communication
mechanisms. This chapter has also shown the advantages of using discrete event
simulation over continuous one and explained the suitability of the event driven
simulation over the process driven one. It has presented the results of the benchmark test
through a series of illustrations for various mean repair rates and it has provided a
thorough comparison between the DMCA algorithm and two other voting algorithms.
The DMCA achieves greater total availability than that of (GIFFORD 1979) and
(JAJODIA 1989). The higher availability is obtained by exploiting the difference
between read rate and write rate and by utilising a dynamic adjustment scheme which
allows the reconfiguration of the read and write quorums .
The DMCA approach provides higher availability than that of the other two
algorithms. The availability of the DMCA increases as the factor increases. This is
because the aim of the DMCA algorithm is to allow low cost operations (like read) with
large occurrence rates to be performed easily, whereas high cost operations (like write)
to be performed rarely. Under this strategy the majority of the operations issued are
executed. This approach makes the DMCA algorithm suitable for >1.
DMCA tries to moderate the differences between Jajodias and Giffords
algorithms in order to increase the total availability. DMCA inherits the dynamic
adjustment of Jajodias algorithm
and at the same time it goes a step further by
136
reconfiguring dynamically the read and write quorums according to the read and write
occurrence rates.
DMCA is an algorithm that can safely be used in a network management system
for replicating managed objects. It requires no modification to the voting scheme and it
adjusts dynamically the read and write quorum and provides improvement in the
availability of network managed objects.
137
7. CONCLUSIONS
Replication techniques can mask many failures in a network management system

and substantially improve the availability of managed objects. Voting algorithms ensure
consistency providing higher availability even when the network is partitioned.
This work shows why replication of managed objects is of great importance in a
network management system and analyses dependability relationships between
management agents. It also specifies a model for constructing fault tolerant activities in
a network management system. It finally discusses a novel replica control protocol
(DMCA) and a software simulation system (ATS) for evaluating the availability of
voting replica control algorithms.
7.1 Contributions of this work

The main contribution of this work is the design of a voting replica control
algorithm named DMCA, which is to be used in a network management system. The
design of the novel algorithm is accompanied with the development of a validating
simulation system (ATS) that proves the suitability of DMCA. More specifically the
contributions are in the following three areas:
1. Architectural design of a system that can cope with site and communication failures.
The architectural aspects of such a system are used (1) to formulate fault tolerant
design and (2) to provide solutions that enhance the availability of replicated objects
138
in the case of network partitioning. The design of a novel replica control protocol is
following by studying the criteria for achieving correctness and high availability.
2. Development of a benchmark test utility (ATS) that has been used for estimating the
availability of certain replica control algorithms. The development of such a tool is
very important because it allows the designer
to draw conclusions about the
suitability of a representative set of replica control algorithms. The ATS tool is built
by using the object -oriented paradigm and designed as an event simulation system.
The main advantage of this tool is that it may be extended to accommodate more
voting algorithms with less effort.
3. Quantitative analysis based on experimentation constructed by using the ATS tool.
The aim of this analysis is to prove the suitability of the DMCA algorithm by
comparing the simulation results with other voting algorithms. This analysis also has
shown how the availability is affected when the repair delay gets longer.
7.2 Future Research Direction

A replication scheme determines how many replicas of an object are created and
to which processor these replicas are allocated. This scheme affects the performance of
the distributed system since reading replicated objects locally is faster and less costly
than reading them from a remote processor. On the other hand, an update of an object is
usually written to all or a majority of the replicas. In this case, a wide distribution slows
down each write and increases its communication cost. In general, the optimal
replication scheme of an object depends on the read-write pattern, that is, the number of
reads and writes issued by each processor.
DMCA and other replica control algorithms utilises a replication scheme
established once when the distributed database is designed. The replication scheme
139
remains fixed until the designer manually changes the number of replicas or their
locations. If the reads and writes are fixed and are known a priori, then this is a
reasonable solution. However, if the read and write patterns change dynamically, in
unpredictable ways, such a replication scheme may lead to severe performance
problems.
In situations where read and write patterns change dynamically, there is a need to
develop a replication algorithm that incorporates dynamic adaptation features. Such an
algorithm may have the ability to learn characteristics of its environment and use them
to re-adjust the read and write quorum and re-order the placement of the objects to suit
its future access patterns. DMCA does perform the dynamic adjustment of read and
write quorum but not dynamic replacement of replicated objects.
Such a dynamic algorithm should incorporate an automaton that takes into account
historical and statistical data regarding the read and write access of replicated objects.
Dynamic replacement of replicated objects may decrease the transmission time and
increase in that way the performance of the read and write operations.
The communication cost of a replication scheme is the average number of interprocessor messages required for a read or a write of the object. Placing the replicated
objects at those locations that minimise the number of messages passed to access them
forms an optimum replication scheme that can ensure high availability, good
performance and strong consistency. Adaptive replication techniques that encompass
dynamic object replacement may work together with the DMCA to increase further the
performance. Recent work (WOLFSON 1997) proves that an adaptive replication
algorithm improves the performance by exercising dynamic object replacement. In
(WOLFSON 1997, a method of copying with storage space limitations at various
processors in the network is also proposed in order to make a comparison between a
140
static scheme (that without object replacement) and a dynamic one that exercises
replicated object replacement. Applying dynamic replacement to the DMCA algorithm
may increase further the performance and improve the usability of the algorithm.
7.3 Concluding Remarks
Data availability is a fundamental problem in a network management system. As
argued earlier in this thesis, this problem will become more intense as different network
technologies are envolved. Replication of managed objects is the key to solving this
problem. Voting replica control algorithms exhibit a suitable behaviour that ensures
consistency and provides high availability.
This thesis has shown that voting replication techniques can be safely used in a
distributed MIBs and that the object availability provided by voting algorithms may be
improved more by adjusting the read and write quorum according to the read and write
occurrence rate. It has also been shown that applying replication techniques to network
management makes the management activities more robust and fault tolerant. It finally
proposes the DMCA voting algorithm, which provides higher availability than that
provided by other similar algorithms.
141
APPENDIX-A
PAPERS
This appendix includes the papers related to this thesis, authored by myself and coauthored by my supervisor Dr. P.H.Pardoe. These papers have been published in the
proceedings of the associated conference.
Papers 1
APPENDIX-B
TABLES
This appendix contains the tables describing the results of the simulation discussed in
chapter 6. All the figures illustrating quantitative results about the availability provided
by each tested algorithm have been constructed by using these tables. The values
presented in these tables are mean values achieved by an extensive running of the ATS
simulation tool.
Tables 1
APPENDIX-C
SOURCE CODE
This appendix includes the code in C++ of the ATS system. The program has been
tested under Windows 32 (Win95 or NT)
Code 1
LIST OF REFERENCES
[ABBADI 1985]
EL ABBADI, A.; SKEEN, D.; and CHRISTIAN, F., An Efficient Fault Tolerant
Protocol for Replicated Data Management, In Proceedings of the 4th ACM
Symposium on Principles of Database Systems (1985), ACM, New York,1985, pp.
215-228.
[ABBADI 1986]
EL ABBADI, A.; TOUEG, S. Availability in Partitioned Replicated Databases,
In Proceedings of the 5th ACM Symposium on Principles of Database Systems
(1986), ACM, New York, 1985, pp. 240-251.
[ADAMEC 1995]
ADAMEC, JAROMIR, MICHAEL GROF, JAN KLEINDIENST, FRANTISEK
PLASIL and PETR TUMA Supporting Interoperability in CORBA via Object
Services, Tech. Report, No. 114, Department of software engineering, Charles
University Prague, October 1995, also available at
http://www.cs.wustl.edu/~schmidt/CORBA-docs/interoperability.ps.gz.
[ALSBERG 1976]
ALSBERG, P. A. and DAY, J. D., A principle for resilient sharing of
distributed resources, In Proceedings of the 2nd International Conference
on Software Engineering, IEEE computer Society, San Francisco,
California, October 1976, pp. 627-644.
References 1
[ANSA 1989]
The Advanced Network System Architecture (ANSA) Reference Manual, Castle
Hill, Cambridge England, Architecture Project Management, 1989.
[ARPEGE 1994]
ARPEGE Group (1994), Network Management Concepts and Tools, Chapman
Hall, 1994, ISBN 0-412-57810-7.
[BABAT 1991]
BABAT, SUBODH, OSI Management Information Base Implementation, In
proceedings of the IFIP Symposium on Integrated Network Management II,
Amsterdam, 1991
[BAKER 1991]
BAKER, M., G., HARTMAN, J.,H., KUPFER, M. D., SHIRRIFF, K., W.,
and OUSTERHOUT, J. K., Measurements of a Distributed File System,
Proceedings of the 13th ACM Symposium on Operating System Principles,
(1991), pp. 198-212.
[BAN 1995]
BAN, BELA Towards an Object-Oriented Framework for Multi -Domain
Management, IBM Zurich Research Laboratory, Rueschlikon, Dec 18, 1995,
also available at http://simon.cs.cornell.edu/Info/People/bba/GOM.ps.gz.
[BEAR 1988]
BEAR, D. Principles of Telecommunication Traffic Engineering, 3rd edition, IEE
Telecommunication Series 2, IEE, Peter Peregrinus Ltd., London, 1988.
References 2
[BERNSTEIN 1981]
BERNSTEIN, P. A. And GOODMAN, N, Concurrency Control in
Distributed Database Systems, ACM Computing Survey, Vol. 13 No 2,
June 1981, pp. 185-221.
[BERNSTEIN 1987]
BERNSTEIN, P. A. ; HADZILACOS, V and GOODMAN, N.,
Concurrency Control and Recovery in Database Systems, Addison Wesley,
1987.
[BEVER 1993]
BEVER, M ; GEIHS,K. ; HEUSER, L. ; MUHLHAUSER, M. ; and SCHILL,A.,
Distributed Systems, OSF/DCE and Beyond, In DCE - The OSF Distributed
Computing Environment, edited by SCHILL, A, Springer - Verlag, 1993, pp. 1-20.
[BIRMAN 1987]
BIRMAN, K., and JOSHEPH, T., Reliable communication in the presence of
failures., ACM Transactions on Computer Systems, Vol. 5, No. 1, February 1987.
[BLAUSTEIN 1985]
BLAUSTEIN, B.T., and KAUFMAN, C. W. Updating Replicated Data
During Communication Failures In Proceedings of the 11th International
Conference on Very Large Data Bases, 1985, pp. 49-58.
[BONG 1989]
BONG, A. BLAU, W., GRAETSCH, W., HERRMANN, F. and OBERLE, W.,
Fault-tolerance under Unix, ACM Transactions on Computer Systems, Vol. 7,
No. 1, February 1989.
References 3
[BUDHIRAJA 1993]
BUDHIRAJA, N. ; MARZULLO, K. ; SCHNEIDER, I.B.; and TOUEG,S. The
Primary - Backup Approach, In Distributed Systems, 2nd ed., MULLENDER, S
(ed.), pp. 199-216, ACM press, 1993.
[CARDELLI 1985]
CARDELLI, L. and WEGNER, P. On Understanding Types, Data Abstraction,
and Polymorphism, ACM Computing Surveys vol. 17, No. 4, December 1985.
[CARR 1985]
CARR, R., The Tandem global update protocol, Tandem System Review, Vol. 1,
No. 2, June 1985.
[CASE 1990]
CASE, J.D.; FEDOR, M.; SCHOFFSTALL,M.L.; and DAVIN, C., Simple
Network Management Protocol (SNMP), Request For Comment , RFC-1157,
1990.
[CCITT 1993]
CCITT Recommendation X.700 Management Frameworks Definition for
Open Systems Interconnection (OSI) for CCITT Applications, 1993.
[CERF 1988]
CERF, V. IAB Recommendations for the Development of Internet Network
Management Standards, Request For Comment , RFC-1052, April 1988.
[CHANG 1984]
CHANG, J., M., and MAXEMCHUCK, N., Reliable Broadcast Protocols, ACM
Transactions on Computer Systems, Vo. 2, No. 3, August 1984.
References 4
[CHIN 1991]
CHIN, R. S. and CHANSON, S. T. Distributed object based programming
systems, ACM Computing Surveys Vol. 23, No. 1, (March 1991), pp. 91124
[COOPER 1985]
COOPER, E., Replicated Distributed Programs, Ph.D. dissertation, UC Berkley,
1985.
[CRISTIAN 1985]
CRISTIAN, F., AGHILI, H., STRONG, R., and DOLEV, D. Atomic broadcast:
From simple diffusion to Byzantine agreement, 15th International Conference on
Fault-tolerant computing , Ann Arbor, Michigan, 1985.
[CRISTIAN 1988]
CRISTIAN, F. Agreeing on who is present and who is absent in a synchronous
distributed system, In Proceedings of the 18th International Conference on Fault
Tolerant Computing, Tokyo, June 1988.
[CRISTIAN 1989]
CRISTIAN, FLAVIN "Exception handling". In Dependability of Resilient
Computers, T. Anderson (ed.), Blackwell Scientific Publications, Oxford, 1989.
[CRISTIAN 1990]
CRISTIAN, F., DANCEY, R., and DEHN, J., Fault tolerance in the advanced
automation system, 20th International Conference on Fault-tolerant Computing ,
Newcastle upon Tyne, England, June 1990.
References 5
[CRISTIAN 1991]
CRISTIAN, FLAVIN Understanding Fault-Tolerant Distributed Systems,
Communications of the ACM (February 1991), Vol. 34, No. 2, pp. 57-78.
[DAVCEN 1985]
DAVCEV, D. and BURKHARD,W., Consistency and Recovery Control for
Replicated files, In Proceedings of the 10th ACM Symposium on Operating
Systems Principles (1985). ACM, New York, 1985, pp. 87-96.
[DAVCEN 1985]
DAVSEN D. and BURKHARD W.A Consistency and Recovery Control for
replicated files In Proceedings of the 10th ACM Symposium on Operating
Systems Principles, 1985, pp. 87-96.
[DAVIDSON 1984]
DAVIDSON, S. B. Optimism and Consistency in Partitioned Distributed
Database Systems, ACM Transactions on Database Systems, 1984, Vol. 9,
No 3, pp. 456-481.
[DAVIDSON 1985]
DAVIDSON, S. B. ; GARCIA-MOLINA, H. And SKEEN, D. Consistency
in Partitioned Networks, ACM Computer Survey, 1985, Vol. 17, No 3, pp.
341-370
[ESWARAN 1976]
ESWARAN, K. P.; GRAY, J. N. ; LORIE, R. A. And TRAIGER, I. L. The
Notions of Consistency and Predicate Locks in a Database System,
Communications of the ACM, Nov. 1976, Vol. 19, No 11, pp. 624-633.
References 6
[EZHILCHELVAN 1986]
EZHILCHELVAN, P., and SHRIVASTAVA, S. A characterisation of faults in
systems, Fifth Symposium on Reliability in Distributed Software and Database
systems, Los Angeles, January 1986
[FERIDUM 1996]
FERIDUM, M., HEUSLER, L., and NIELSEN, R, Implementing OSI
Agent/Managers for TMN, IEEE Communications Magazine, September
1996.
[GARCIA 1982]
GARCIA, H., Elections in a distributed computing system. IEEE
Transactions on Computers, Vol. 31, No. 1, January 1982, pp. 48-59.
[GIFFORD 1979]
GIFFORD, D. K. Weighted Voting for Replicated Data, In Proceedings of the
7th Symposium on Operating Systems Principles (Pacific Grove, CA,
December 1979), ACM- SIGOPS, pp. 150-162

[GLITHO 1995]
GLITHO, R., and HAYES, S., Telecommunications Management Network
Vision vs. Reality, IEEE Communications Magazine, March 1995, pp. 47-52.
[GOLDBERG 1983]
GOLDBERG, A. and ROBSON, D. SMALLTACK -80 : The language and its
Implementation. Addison-Wesley, Reading, Massachusetts, 1083.
References 7
[HARPER 1988]
HARPER, R., LALA, L., DEYST, J., Fault tolerant parallel processor
architecture overview , 18th International Conference on Fault Tolerant
Computing, Tokyo, June 1988.
[HOPKINS 1978]
HOPKINS, A., SMITH, B., LALA, J., FTMP-A highly reliable fault tolerant
multi-processor for aircraft , Proceedings IEEE, Vol. 66, Oct. 1978.
[ISO 1988]
ISO/IEC 9072:1988 (CCITT Recommendation X.211), International Organisation
for Standardisation, Information Processing Systems: Open Systems
Interconnection, Text Communication - Remote Operations Part 1: Model,
Notation and Service Definition.
[ISO 1989]
ISO/IEC 7498-4: 1989, Information processing systems, Open Systems
Interconnection, Basic Reference Model , Part 4: Management framework.
(CCITT Recommendation X.700: 1992, Management Framework for Open
systems Interconnection -OSI- for CCITT Applications)
[ISO 1991]
ISO/IEC 9595:1991 (CCITT Recommendation X.710), International Organisation
for Standardisation, Information Processing Systems: Open Systems
Interconnection, Common Management Information Service Definition
References 8
[ISO 1991a]
ISO/IEC 9596-1:1991 (CCITT Recommendation X.711), International
Organisation for Standardisation, Information Processing Systems: Open Systems
Interconnection, Common Management Information Protocol Specification.
[ISO 1992]
ISO/IEC 10040:1992, (CCITT Recommendation X.701: 1992) Information
technology, Open Systems Interconnection, Systems management overview.
[ISO 1992a]
ISO/IEC 10746-1:1992 , International Organisation for Standardisation, Basic
Reference Model of Open Distributed Processing, Part 1: Overview and Guide to
Use, JTC1/SC212/WG7CD 10746-1, ISO 1992.
[ISO 1993]
ISO/IEC 10165-1: 1993, (CCITT Recommendation X.720: 1992), Information
technology, Open Systems Interconnection, Structure of management information:
Management information model.
[ITU 1995]
ITU-T Recommendation M.3010 Principles and Architecture for the TMN
[JAJODIA 1987a]
JAJODIA, S. and MUTCHLER, D. Dynamic Voting In Proceedings of the
ACM SIGMOD Intl Conference on Management of Data, May 1987.
[JAJODIA 1987b]
JAJODIA, S. and MUTCHLER, D. Enhancement to the voting algorithm In
Proceedings of the 13th Intl conference on Very Large Databases (VLDB),
September 1987.
References 9
[JAJODIA 1989]
JAJODIA, S. and MUTCHLER, D. A pessimistic Consistency Control
Algorithm for Replicated Files Which Achieves High Availability, IEEE
Transactions on Software Engineering, (January 1989), Vol. 15, No 1, pp.
39-46.
[JAJODIA 1990]
JAJODIA, SUSHIL and MUTCHLER, DAVID, Dynamic Voting Algorithms for
Maintaining the Consistency of a Replicated Database, ACM Transactions on
Database Systems, Vol. 15, No. 2, June 1990, pp. 230-280.
[JOSEPH 1987]
JOSEPH, T. A. and BIRMAN, K. P. Low Cost Management of Replicated
Data in Fault Tolerant Distributed Systems, ACM Transactions in
Computer Systems, 1987, Vol. 4, No 1
[KAHANI 1997]
KAHANI, M. and BEADLE, P., H., W., Decentralised Approaches for
Network Management, ACM SIGCOM Computer Communications Review,
Vol. 27, No. 3, Jully1997, pp. 36-47
[KEENE 1989]
KEENE, S., E., Object-Oriented Programming in Common LISP, AddisonWesley, 1989
[KERNIGHAN 1988]
KERNIGHAN, B., W., and RITCHIE, D. M., The C Programming Language,
Second Edition. Prentice Hall, Englewood Cliffs, N.J., 1988
References10
[KOTSAKIS 1995]
KOTSAKIS, E.G. and PARDOE, B.H., Modelling OSI Management Information
Base With Object Oriented Analysis, In Proceedings of the 1995 International
Symposium on Communications, Taipei, Taiwan (December 27-29, 1995), pp.
143-149
[KOTSAKIS 1996a]
E.G.KOTSAKIS and B.H.PARDOE Replication of Management Objects in
Distributed MIB In Proceedings of the ICT96 International Conference on
Telecommunications, April 14-17, 1996, pp. 545-549.
[KOTSAKIS 1996b]
E.G.KOTSAKIS and B.H.PARDOE Dynamic Quorum Adjustment: A
consistency scheme for Replicated Objects, In Proceedings of the Third
Communication Networks Symposium, Manchester, July 8-9, 1996, pp. 197-200
[KOTSAKIS 1996c]
E.G.KOTSAKIS and B.H.PARDOE Simulation IASTED
[KRIEGER 1998]
KRIEGER, D., and ADLER, R., M., The Emergence of Distributed
Component Platform, IEEE Computer, March 1998, pp. 43-53.
[LADIN 1992]
LADIN, R ; LISKOV, B. ; SHRINA, L. ; and GHEMAWAT, S. Providing
Availability Using Lazy Replication, ACM Transactions on Computer Systems,
Vol. 10, No. 4, pp. 360-391.
References11
[LAMPORT 1984]
LAMPORT, L., Using time instead of time-outs in fault tolerant systems, ACM
Transactions on Programming Languages and Systems, Vol. 6, No. 2, 1984
[LAW 1991]
Law, Averill M. and Kelton, W. David, Simulation Modelling and Analysis, 2nd
edition, McGraw Hill, 1991
[LEINWARD 1993]
LEINWARD, A., and FANG, K., Network Management: A practical
perspective, Addison Wesley, 1993
[LEPPINEN 1997]
LEPPINEN, MIKA., PULKKINEN, PEKKA., RAUTIAINEN, AAPO, Java
and CORBA Based Network Management, IEEE Computer, June 1997, pp.
83-87.
[LEWIS 1995]
LEWIS, G., R., CORBA 2.0 Universal Networked Objects ACM Standard
View Vol. 3, No. 3, September 1995.
[LISKOV 1991]
LISKOV, B ; GHEMAWAT, S ; GRUBER, R ; JOHNSON, P. ; SHRINA, L. and
WILLIAMS, M. Replication in the HARP file System In Proceedings of the
13th ACM Symposium on Operating Systems Principles, pp. 226-238, 1991.
References12
[LONG 1991]
LONG, D.D.E., CARROLL, J.L. and PARK, C.J., A study of the
reliability of the Internet sites, Proceedings of the 10th Symposium on
Reliable Distributed Systems, 1991, pp. 177-186.
[MAFFEIS 1997a]
MAFFEIS, S. and SCHMIDT, D., C., Constructing Reliable Distributed
Communication Systems with CORBA, IEEE Communications Magazine,
February 1997.
[MAFFEIS 1997b]
MAFFEIS, S., Piranha: A CORBA Tool for High Availability, IEEE
Computer, April 1997, pp. 59-66.
[MEYER 1992]
MEYER, B., Eiffel: The Language, Prentice Hall, 1992.
[MEYER 1995]
MEYER, K., ERLINGER, M., BETSER, J., SUNSHINE, C., GOLDSZMIDT,
G. and YEMINI, Y., Decentralizing Control and Intelligence in Network
Management, In Proceedings of the International Symposium on Integrated
Network Management, May 1995.
[MINOURA 1982]
MINOURA, T., and WIEDERHOLD, G. Resilient Extended True-Copy Token
Scheme for a Distributed Database System, IEEE Transactions on Software
Engineering , Vol. SE-8., No. 3, May 1982,, pp. 173-189.
References13
[MISRA 1986]
MISRA, J. Distributed Discrete - Event Simulation, ACM Computing surveys,
Vol. 18, No. 1, pp. 36-65, March 1986.
[MULLENDER 1990]
MULLENDER, S.J. ; VAN ROSSUM, G. ; TANENBAUM, A.S. ; VAN
RENESSE, R. ; and VAN STAVEREN, H, Amoeba: A Distributed Operating
System for the 1990s, IEEE Computer, Vol. 23, No. 5, pp. 44-53, May 1990.
[NARASIMHAN 1997]
Narasimhan, P., and Moser, L., E., and Melliar-Smith, P., M., Replica
Consistency of CORBA objects in partitionable distributed Systems,
Distributed Systems Engineering, Vol. 4, No. 3, September 1997, pp. 139-150.
[NELSON 1981]
NELSON, B. Remote Procedure Call, Technical Report CLS-81-9, Xerox
Palo Alto Research Center 1981.
[OKI 1988]
OKI, B., and LISKOV, B. Viewstamped replication: A new primary copy method
to support highly available distributed systems, Seventh ACM Symposium on
Principles of Distributed Computing, August 1988.
[OMG 1997]
Object Management Group A discussion of the Object Management
Architecture (OMA), OMG, January 1997.
[OSF 1993]
Open Software Foundation (OSF), Introduction to OSF DME, Distributed
Management Environment (DME) 1.0, H001, 1993
References14
[PALUMBO 1985]
PALUMBO, D., and BUTLER, R., Measurement of SIFT operating system
overhead, NASA Technical Memorandum 86322, 1985
[PARKER 1983]
PARKER, D.S. JR.; POPEK, G.J.; RUDISIN, G.; STOUGHTON, A.; WALKER,
B.J.; WALTON, E.; CHOW, J.M.; EDWARDS, D.; KISER, S.; and KLINE, C.,
Detection of Mutual Inconsistency in Databases, IEEE Transactions on
Software Engineering, SE-9,3 (1983), pp. 240-247.
[PAVON 1998]
PAVON, J and TOMAS, J., CORBA for Network and Service Management
in the TINA framework, IEEE Communications Magazine, March 1998, pp.
72-79.
[PRESOTTO 1990]
PRESOTTO, D.L. and RITCHIE, D.M. Interprocess Communication in the
Ninth Edition of UNIX System, Software Practice and Experience, Vol. 20, No.
S1, pp. 3-17, June 1990.
[PU 1988]
PU, G. , NOE, J.D., and PROUDFOOT, A. B. Regeneration of Replicated
Objects: A Technique and its Eden Implementation, IEEE Transactions on
Software Engineering, Vol. SE-14, No. 7, 1988, pp. 936-945.
References15
[PRIS 1986a]
PRIS, J-F, Voting With a Variable Number of Copies, In Proceedings
of the IEEE International Symposium on Fault-Tolerant Computing, IEEE,
NY, 1986, pp. 50-55.
[PRIS 1986b]
PRIS, J-F, Voting With Witnesses: A Consistency Scheme for Replicated
Files, In Proceedings of the IEEE International Conference on Distributed
Computing ,IEEE, NY, 1986, pp. 606-621.
[PRIS 1991]
PRIS, J-F, and LONG, D.,D.,E, Voting With Regenerable Volatile
Witnesses, Proceedings of the 7th International Conference on Data
Engineering, 1991, pp. 112-119.
[RAHKILA 1997]
RAHKILA, S. and STENBERG S., Experiences on Integration of Network
Management and a Distributed Computing Platform, In Proceedings of the
30th Hawaii Intl Conference on System Science, IEEE CS Press, 1997, pp.
140-149, also appears in Distributed Systems Engineering, Vol. 4, No. 3,
September 1997.
[RAMAN 1998]
RAMAN, L. OSI Systems and Network Management, IEEE
Communications Magazine, March 1998, pp. 10-17.
References16
[RUMBAUGH 1991]
RUMBAUGH, J., BLAHA, M., PREMERLANI, W., EDDY, F., LORENSEN, W.
Object-Oriented Modelling and Design, Prentice Hall, 1991.
[SADOWSKI 1993]
SADOWSKI, R., Selling Simulation and Simulation Results, In Proceedings of
the 1993 Winter Simulation Conference, pp. 65-68, 1993
[SALTZER 1984]
SALTZER, J., REED, D., and CLARK, D., End-to-end arguments in system
design, ACM Transactions on Computer Systems, Vol. 2, No. 4, November 1984.
[SARIN 1985]
SARIN, S. K.; BLAUSTEIN, B. T. and KAUFMAN, C. W. System
Architecture for Partition-Tolerant Distributed Databases, IEEE
Transactions on Computer, 1985, Vol. C-34, No 12, pp. 1158-1163.
[SCHLICHTING 1983]
SCHLICHTING, R. D. and SCHNEIDER, F.,B. Fail Stop Processors: An
Approach to Designing Fault Tolerance Computing Systems, ACM
Transactions on Computer Systems, Vol. 3 (1983), pp 222-238
[SCHNEIDER 1990]
SCHNEIDER, F.B. Implementing Fault - Tolerant Services Using the State
Machine Approach, ACM Computing Survey, Vol. 22, No. 12, pp. 299-319,
December 1990.
References17
[SHANNON 1975]
SHANNON, R. E. Systems Simulation: The art and the science, Prentice Hall,
1975.
[SIDOR 1998]
SIDOR, D., J. TMN standards: satisfying todays needs while preparing for
tomorrow, IEEE Communications Magazine, March 1998, pp 54-64.
[STROUSTRUP 1991]
STROUSTRUP, B. The C++ Programming Language, Second edition. Addison
Wesley, 1991
[THOMAS 1979]
THOMAS, R. H., A majority consensus approach to concurrency control, ACM
Transactions on Database Systems, Vol. 4, No. 2, pp. 180-209.
[UML 1997]
Unified Modelling Language (UML), Notation Guide, Version 1.1, September
1, 1997, Rational Software corporation, Santa Clara, CA. Also available at
http://www.rational.com/uml/ad970805_UML11_Notation2.zip.
[VINOSKI 1997]
VINOSKI, S., CORBA: Integrating Diverse Applications within Distributed
Heterogeneous Environments, IEEE Communications Magazine, February 1997, pp.
46-55.
References18
[WENSLEY 1978]
WENSLEY, J., LAMPORT, L., COLDBERG, J., GREEN, M., LEVITT, K.,
MELLIAR-SMITH, M., SHOSTAK, R., and WEINSTOCK, C. SIFT: Design
and analysis of a fault tolerant computer for aircraft control, Proceedings IEEE,
Vol. 66, Oct. 1978.
[WOLFSON 1997]
WOLFSON, O., JAJODIA, S., HUANG, Y., An adaptive Data Replication
Algorithm ACM Transactions on Database Systems, Vol. 22, No. 2, June 1997,
pages 255-314.
References19

PHD Thesis

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

PHD Thesis

Diunggah oleh

Hak Cipta:

Format Tersedia

See

Replication in distributed management

Available from: Evangelos Kotsakis

REPLICATION IN DISTRIBUTED MANAGEMENT

EVANGELOS GRIGORIOS KOTSAKIS

Telford Research Institute

The University Of Salford

Submitted in Partial Fulfilment of the Requirements for the

This thesis is dedicated to

4.7 VOTING ALGORITHMS ........................................................................................................................ 69

Advanced Network System Architecture

Asynchronous Transfer Mode

Availability Testing System

Common Management Information Protocol

Common Management Information Service Element

Dynamic Majority Consensus Algorithm

International Standards Organisation

Local Area Network

Management Information Base

Object Modelling Technique

Open Software Foundation / Distributed Computing Environment

Open System Interconnection

Remote Operation Service Element

Systems Network Architecture

Simple Network Management Protocol

Transmission Control Protocol

User Datagram Protocol

Wide Area Network

The importance of replication techniques for providing high availability in

1.1 Distributed Management Systems

1.3 The work

1.4 Road Map of the Thesis

Chapter 3 discusses the nature of failures in a management system. It decomposes a

2. REPLICATION MANAGEMENT SYSTEM

Figure 2-1: Basic Management model

2.2 Management Architectural Model

request messages transferred from the manager to the agent.

response messages are bi-directional

notification messages transferred from the agent to the manager

systems management information called Management

2.3 Protocols For Controlling Management Information

2.3.1 OSI Management Framework

Table 2-1: CMISE services and functions

Gives notifications of an event

M stands for management

2.3.2 Internet Network Management

Table 2-2: SNMP services and functions

An agent sends a trap to alert the

2.4 Object oriented MIB modelling

Attributes, that denote specific characteristics of the resource.

Operations, that are performed on a set of attributes

Notifications, that may be emitted to the managing station through a protocol

Principles of naming objects

The logical structure of management information

The platform for hosting a MIB.

The architectural principles for partitioning MIB information

Database type (Object oriented or relational)

Translation of MIB object model into a schema

operations intended to be applied to the object attributes

operations intended to be applied to the management object as a whole.

Attribute oriented operations are as follows

get attribute value

replace attribute value

/finds the nearest copy/

/finds the nearest copy/