discussions, stats, and author profiles for this publication at: http://www.researchgate.net/publication/36207005
CITATIONS
DOWNLOADS
VIEWS
14
13
48
1 AUTHOR:
Evangelos Kotsakis
European Commission
21 PUBLICATIONS 124 CITATIONS
SEE PROFILE
1998
ii
TABLE OF CONTENTS
LIST OF FIGURES................................................................................................................................... V
LIST OF TABLES ................................................................................................................................. VII
ACKNOWLEDGMENTS..................................................................................................................... VIII
ABBREVIATIONS .................................................................................................................................. IX
ABSTRACT ............................................................................................................................................... X
1. INTRODUCTION .................................................................................................................................. 1
1.1 DISTRIBUTED MANAGEMENT SYSTEMS ............................................................................................... 1
1.2 REPLICATION ON A DISTRIBUTED MIB ................................................................................................. 2
1.3 THE WORK........................................................................................................................................... 4
1.4 ROAD MAP OF THE THESIS .................................................................................................................. 5
2. REPLICATION MANAGEMENT SYSTEM ARCHITECTURE .................................................... 8
2.1 MANAGEMENT FUNCTIONAL AREAS ................................................................................................... 8
2.2 MANAGEMENT ARCHITECTURAL MODEL .......................................................................................... 10
2.3 PROTOCOLS FOR CONTROLLING MANAGEMENT INFORMATION......................................................... 11
2.3.1 OSI Management Framework ................................................................................................... 11
2.3.2 Internet Network Management .................................................................................................. 12
2.4 OBJECT ORIENTED MIB MODELLING ................................................................................................. 13
2.5 DISTRIBUTED MANAGEMENT INFORMATION BASE (MIB) ................................................................. 15
2.6 DISTRIBUTED NETWORK MANAGEMENT ........................................................................................... 18
2.7 CORBA SYSTEM .............................................................................................................................. 22
2.8 IMPLEMENTING OSI MANAGEMENT SERVICES FOR TMN ................................................................. 23
2.9 REPLICATION IN A MANAGEMENT SYSTEM ....................................................................................... 26
2.10 NEED FOR REPLICATION TECHNIQUES IN A MANAGEMENT SYSTEM .................................................. 29
2.11 SYNCHRONOUS AND ASYNCHRONOUS REPLICA MODELS ................................................................. 33
2.12 REPLICATION TRANSPARENCY AND ARCHITECTURAL MODEL ........................................................ 33
2.13 SUMMARY ....................................................................................................................................... 37
3. FAILURES IN A MANAGEMENT SYSTEM .................................................................................. 38
3.1 DEPENDABILITY BETWEEN AGENTS .................................................................................................. 38
3.2 FAILURE CLASSIFICATION .................................................................................................................. 39
3.3 FAULTY AGENT BEHAVIOUR ............................................................................................................. 40
3.4 FAILURE SEMANTICS ......................................................................................................................... 41
3.5 FAILURE MASKING ............................................................................................................................ 42
3.6 ARCHITECTURAL ISSUES ................................................................................................................... 46
3.7 GROUP SYNCHRONISATION ............................................................................................................... 47
3.7.1 Close Synchronisation ............................................................................................................... 47
3.7.2 Loose synchronisation ............................................................................................................... 48
3.8 GROUP SIZE....................................................................................................................................... 49
3.9 GROUP COMMUNICATION .................................................................................................................. 49
3.10 AVAILABILITY POLICY ..................................................................................................................... 50
3.11 GROUP MEMBER AGREEMENT ........................................................................................................ 51
3.12 SUMMARY ....................................................................................................................................... 53
4. REPLICA CONTROL PROTOCOLS ............................................................................................... 55
4.1 PARTITIONING IN A REPLICATION SYSTEM ......................................................................................... 55
4.2 CORRECTNESS IN REPLICATION ......................................................................................................... 56
4.3 TRANSACTION PROCESSING DURING PARTITIONING .......................................................................... 59
4.4 PARTITION PROCESSING STRATEGY................................................................................................... 60
4.5 AN ABSTRACT MODEL FOR STUDYING REPLICATION ALGORITHMS .................................................. 62
4.6 PRIMARY SITE PROTOCOL ................................................................................................................. 66
iii
iv
LIST OF FIGURES
FIGURE 2-1: BASIC MANAGEMENT MODEL.................................................................................................... 9
FIGURE 2-2: VIEWS OF SHARED MANAGEMENT KNOWLEDGE....................................................................... 17
FIGURE 2-3: SIMPLIFIED MANAGEMENT SYSTEM. ....................................................................................... 17
FIGURE 2-4 NETWORK MANAGEMENT APPROACHES (A) CENTRALISED (B) PLATFORM BASED (C)
HIERARCHICAL (D) DISTRIBUTED ......................................................................................................... 21
FIGURE 2-5: INTER-WORKING TMN ........................................................................................................... 25
FIGURE 2-6: (A) REPLICATION (B) NO REPLICATION ..................................................................................... 28
FIGURE 2-7: NETWORK MANAGEMENT REPLICATION EXAMPLE .................................................................. 31
FIGURE 2-8: SYNCHRONOUS REPLICATION .................................................................................................. 33
FIGURE 2-9: ARCHITECTURAL MODEL FOR REPLICATION. (A) NON TRANSPARENT SYSTEM (B) TRANSPARENT
REPLICATION SYSTEM (C ) LAZY REPLICATION (D) PRIMARY COPY MODEL. ......................................... 35
FIGURE 3-1: RELATIONSHIP BETWEEN USER AND RESOURCE. ...................................................................... 39
FIGURE 3-2: FAILURE MASKING ................................................................................................................... 42
FIGURE 3-3: GROUP MASKING ..................................................................................................................... 43
FIGURE 4-1 REPLICATION ANOMALY CAUSED BY CONFLICT WRITE OPERATIONS. A) BEFORE ISOLATION B)
AFTER ISOLATION ................................................................................................................................ 57
FIGURE 4-2: LOGICAL AND PHYSICAL OBJECTS OF THE SENSOR ENTITY. ..................................................... 64
FIGURE 4-3. REPLICATION USING PRIMARY SITE ALGORITHM. ..................................................................... 66
FIGURE 4-4. READ IN A PRIMARY SITE PROTOCOL ...................................................................................... 68
FIGURE 4-5. WRITE IN A PRIMARY SITE PROTOCOL..................................................................................... 68
FIGURE 4-6. MAKE CURRENT IN A PRIMARY SITE PROTOCOL ..................................................................... 69
FIGURE 4-7. READ IN A MAJORITY CONSENSUS ALGORITHM ...................................................................... 71
FIGURE 4-8. WRITE IN A MAJORITY CONSENSUS ALGORITHM ....................................................................... 72
FIGURE 4-9. MAKE CURRENT IN A MAJORITY CONSENSUS ALGORITHM ..................................................... 72
FIGURE 4-10. ISMAJORITY IN THE DYNAMIC VOTING PROTOCOL ............................................................... 75
FIGURE 4-11. READ FUNCTION IN THE DYNAMIC VOTING PROTOCOL ......................................................... 75
FIGURE 4-12. WRITE (UPDATE) IN THE DYNAMIC VOTING PROTOCOL ......................................................... 76
FIGURE 4-13 UPDATE IN THE DYNAMIC VOTING PROTOCOL ...................................................................... 78
FIGURE 4-14 MAKE CURRENT IN THE DYNAMIC VOTING PROTOCOL ......................................................... 78
FIGURE 4-15 READPERMITTED IN THE DMCA ........................................................................................... 82
FIGURE 4-16 WRITEPERMITTED FUNCTION IN THE DMCA.......................................................................... 83
FIGURE 4-17 DOREAD FUNCTION IN THE DMCA........................................................................................ 84
FIGURE 4-18 DOWRITE FUNCTION IN THE DMCA ...................................................................................... 86
FIGURE 4-19 MAKE CURRENT FUNCTION IN DMCA................................................................................... 88
FIGURE 4-20: SEQUENCE DIAGRAM FOR DOREAD OPERATION .................................................................... 89
FIGURE 4-21: SEQUENCE DIAGRAM FOR DOWRITE OPERATION ................................................................... 90
FIGURE 4-22: SEQUENCE DIAGRAM FOR MAKECURRENT OPERATION ......................................................... 90
FIGURE 5-1. ATS PROCESS DIAGRAM ........................................................................................................ 102
FIGURE 5-2. ATS OBJECT MODEL .............................................................................................................. 106
FIGURE 5-3. ATS DYNAMIC MODEL .......................................................................................................... 107
FIGURE 6-1: NETWORK MODEL................................................................................................................. 112
FIGURE 6-2: FAULT INJECTION SYSTEM .................................................................................................... 113
FIGURE 6-3: COMPONENTS OF THE SIMULATION MODEL ............................................................................ 118
FIGURE 6-4: AVAILABILITY CURVE ............................................................................................................ 120
FIGURE 6-5: TOTAL AVAILABILITY =4. ................................................................................................... 121
FIGURE 6-6: BOUNDARIES OF TOTAL AVAILABILITY .................................................................................. 122
FIGURE 6-7: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR
DELAY=0.1 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY. .................. 126
FIGURE 6-8: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR
DELAY=0.2 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 127
FIGURE 6-9: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR
DELAY=0.3 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 128
FIGURE 6-10: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR
DELAY=0.4 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 129
FIGURE 6-11: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR
DELAY=0.5 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 130
FIGURE 6-12: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR
DELAY=1.0 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 131
FIGURE 6-13: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR
DELAY=1.5 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 132
FIGURE 6-14: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR
DELAY=2.0 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 133
FIGURE 6-15: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR
DELAY=2.5 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 134
FIGURE 6-16: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR
DELAY=3.0 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 135
vi
LIST OF TABLES
TABLE 2-1: CMISE SERVICES AND FUNCTIONS ........................................................................................... 12
TABLE 2-2: SNMP SERVICES AND FUNCTIONS ............................................................................................ 13
TABLE 4-1: DMCA MAPPING ...................................................................................................................... 91
TABLE 6-1: SIMULATION PARAMETERS ..................................................................................................... 123
vii
ACKNOWLEDGEMENTS
I would like to thank my supervisor Dr. B. H. Pardoe who has assisted me and
guided me in preparing this thesis. His kind assistance in my struggles with the English
language was very helpful. He has been willing to answer any technical question and
provide me with the information and knowledge to cope with the difficult task of
preparing a Ph.D. thesis. Without his kind help and encouragement, this research
would never have been done
I would especially like to express sincere appreciation for the financial support,
encouragement and love given to me by my parents Grigorios and Dimitra Kotsakis.
They have given me much more than moral and material support through my University
studies. They provide me a rock - solid support system which was proved helpful during
my studies. Without their support and love, this research would never have been done
Special thanks are due to my wife Chaido for encouraging me over the last three
years. I would like to thank her for her unconditional love, unfailing enthusiasm,
unending optimism and confidence in my abilities. Her patient and support are
boundless.
Finally, I would like to thank little Dimitra, whose birth two years ago gave me a
great joy, for being quiet while I was writing this thesis
viii
ABBREVIATIONS
ANSA
ATM
ATS
CMIS
CMISE
DMCA
FE
Front End
IP
Internet Protocol
ISO
LAN
MIB
OMT
OSF/DCE
OSI
ROSE
SNA
SNMP
TCP
UDP
WAN
ix
ABSTRACT
Systems management is concerned with supervising and controlling the system so
that it fulfils the requirements. The management of a system may be performed by a
mixture of human and automated components. These components are abstract
representations of network resources and they are known as managed objects. A
distributed management system may be viewed as a collection of such objects located at
different sites in a network. Replication is a technique used in distributed systems to
improve the availability of vital data components and to provide higher system
performance since access to a particular object may be accomplished at multiple sites
concurrently. By applying replication in a distributed management system, we can locate
certain management objects at multiple sites by copying their internal data and the
operations used to access or update those data. This is considered as a great advantage
since it increases the reliability and availability, it provides higher fault tolerance and it
allows data sharing; improving in that way the system performance.
This thesis is concerned with methods that may be used to apply replication in
such a system, as well as certain replica control algorithms that may be used to control
operations over a replicated managed object. Certain replication architectures are
examined and the availability provided by each of them is discussed. A new replica
control algorithm is proposed as an alternative to providing higher availability. A tool
for evaluating the availability provided by a replica control algorithm is designed and
proposed as a benchmark test utility that examines the suitability of certain replica
control algorithms.
1. INTRODUCTION
individual network resource, an attribute used to represent a network activity. When the
MIB is distributed over sites, one site may fail, while other sites continue to operate.
Distributed MIBs may also increase performance since different managed objects
located at different hosts may be accessed concurrently.
A fundamental problem with a distributed MIB is data availability. Since managed
objects are stored on separate machines, a server crash or a network failure that
partitions a client from a server can prevent a manager from accessing managed objects.
Such situations are very frustrating to a manager because they impede computation even
though client resources are still available. The problem of object availability increases
over time for two reasons
1. The frequency of the network failures will increase. Networks get larger, they
cover larger geographical area, encompass multiple administrative boundaries and
consist of multiple sub-networks via routers and bridges. Furthermore, there is an
increased need for better network resource management that increases the
availability of managed objects and the management performance
2. The introduction of mobile network managers will increase the number of
occasions on which management agencies are inaccessible. Wireless technologies
such as packet-radio suffer from inherent limitations like short range and line of
sight. Due to these limitations, the network connections between management
agents and mobile managers will exhibit frequent partitions.
1.2 Replication on a distributed MIB
Replication is a technique used in distributed operating systems and distributed
databases to improve the availability of system resources. In the case of the MIB,
replication can be used to increase the performance of management activities and to
provide high availability of management objects. Replicating the same management
object at different sites can improve the availability remarkably because the system can
continue to operate as long as at least one site is up. It also improves performance of
global retrieval queries, because the result of such a query can be obtained locally from
any site; hence a retrieval query can be processed at the local site where it is submitted.
To deal with replicated objects in a management information base, a control
method is needed to keep all the replicas in a consistent state even during partitioning.
The proposed techniques used to assure consistency may be divided into two families;
those based on a distinguished copy and those based on voting. The former technique is
based on the idea of using a designated copy of each replicated copy of each replicated
object in such a way that requests are sent to the site that contains that copy
(ALSBERG 1976, BERNSTEIN 1987, GARCIA 1982).
Voting replica control algorithms are more promising. They do not use a
distinguished copy; rather a request is sent to all sites that include a copy of the
replicated object. An access to a particular copy is granted if a majority of votes is
collected (GIFFORD 1979, JAJODIA 1989, KOTSAKIS 1996b). Voting algorithms
are fully distributed concurrency control algorithms and they exhibit higher flexibility
over those based on a distinguished copy. Despite the fact that the voting algorithms
pass many messages among sites, it is anticipated that good performance will be gained,
if the round trip time becomes shorter. Todays technology can improve the round trip
time through the use of high speed networks like ATM. Thus, messages may be
transferred from one machine to another faster and more reliably.
In the voting scheme, replicated objects can be accessed in the partition group that
obtains a majority vote. In the distinguished (primary) copy
scheme, availability is
significantly limited in the case of a network or link failure. Primary copy algorithms
exhibit good behaviour only for site failures . On the other hand, voting algorithms
3
provide higher availability tolerating both network and site failures. Voting algorithms
guarantee consistency at the expense of the availability. To provide higher availability,
one may use either a consistency relaxation technique that allows concurrent access of
replicated objects across different partitions (optimistic control algorithm) or to improve
the existing voting pessimistic algorithms by forming more sophisticated schemes.
Optimistic control algorithms must be supported by an extra mechanism to detect and
resolve diverging replicas once the partition groups are reconnected. This complicates
the control of replication task and allows, at least for a short interval of time,
inconsistency between replicas. Such an approach requires long time to retrieve the state
of the database after a site failure and it does not seem appropriate for databases such as
those used to store management information. Therefore, the invention of a more
sophisticated replica control algorithm based on voting seems to be a promising
approach that may provide higher availability preserving strong consistency between
replicated objects.
Finally, the following questions are addressed in this thesis:
Can we improve further the availability of managed objects by utilising voting
techniques?
Can replication be used effectively in a distributed MIB in order to ensure fault
tolerance in a management system?
1. There is a great proliferation of different network technologies and a great need for
network management. Keeping management information available and in a
consistent state is of great importance, since the network operability depends on
management activities.
2. Availability of network information may be obtained only by applying redundancy.
Replication is one of the most widely used techniques that ensures high availability
keeping the replicated objects in a consistent state.
3. The development of replication schemes for insuring higher object availability and
for tolerating site and communication failures is very promising and should be
studied further.
4. Failures always happen. No system can work forever within its specifications.
Exogenous or endogenous factors can affect the operability of the system causing
temporary or permanent failures.
5. The need for developing fault tolerant techniques for network management systems is
of great importance.
Chapter 4 presents the correctness criteria that should be taken into account when
designing a replication system. It also introduces an abstract model in order to study
formally certain replica control algorithms. It then presents a variety of replica control
algorithms. It ends with a thorough discussion on the DMCA (Dynamic Majority
Consensus Algorithm) which is a novel approach that enriches the up to date knowledge
on replication techniques and improves the overall management of replicated objects
providing higher availability.
Chapter 5 mainly evaluates the DMCA algorithm and presents quantitative results
regarding the availability provided by DMCA. It starts by specifying the way in which
one can measure the performance of certain replica control protocols and introduces the
simulation model for building the benchmark test utility ATS (Availability Testing
System) that estimates the effectiveness of the algorithms. It also shows the fault
injection mechanism for generating faults and repairs. It ends with a thorough discussion
on the results of the simulation justifying the superiority of the DMCA.
Chapter 6 presents the object -oriented development process of the ATS tool. It first
discusses the advantages of using object oriented technology to develop such a complex
system and it then presents the static object model and dynamic model of the ATS.
The thesis concludes with chapter 7 which presents the contributions and it includes a
discussion of future work and a summary of key results.
This chapter introduces the fundamental idea behind systems management and
illustrates the main features of a management system. It provides a brief discussion
about the architectural model of a management system and introduces the concept of
distributed MIB as a naturally distributed database. It highlights issues related to the
object oriented MIB modelling and definition of managed objects. It also justifies the
use of object replication in terms of system performance, data reliability and availability.
Finally it discusses the type of failures that may occur in a management system as well
as the replication architectural models that may be used to maintain multiple replicas.
2.1 Management Functional Areas
Management of a system is concerned with supervising and controlling
the
system so that it fulfils the operational requirements. To facilitate the management task ,
Open System Interconnection (OSI ) divides the management design process into five
areas known as OSI management functional areas (ISO 1989). The fundamental
objectives of the OSI functional areas are to fulfil the following goals.
1. To maintain proper operation of a complex network (fault management).
2. To maintain internal accounting procedures (accounting management).
3. To maintain procedures regarding the configuration of a network or a
distributed processing system (configuration management).
4. To
provide
the
capability of
performance
evaluation
(performance
management).
5. To allow authorised access-control information to be maintained and
distributed across a management domain (security management).
In other words, a network that has a management system, must be able to manage
its own operations, performance, failures, modifications, security and hardware/software
configuration. To fulfil the above requirements, it is necessary to develop a management
model that is capable of incorporating a vast amount of services covered under the
specifications of the OSI functional areas. The actual architecture of the network
management model varies greatly, depending on the functionality of the platform and
the details of the network management capability. A management architectural model
that has been proposed by OSI defines the fundamental concepts of systems
management (ISO 1992). This model describes the information, functional,
communication and organisational aspect of systems management.
10
The database of
Information Base (MIB) is associated with both the manager and the agent. The MIB is
the conceptual repository of the management information stored in an OSI-based
network management system. The definition of the MIB describes the conceptual
schema containing information about managed objects and relations between them. It
actually defines the set of all managed objects visible to a network management entity.
The MIB may be viewed as the interface definition - it defines a conceptual schema
which contains information about specific managed objects, which are instantiations of
managed object classes. The schema also embodies relationships between these
managed objects, specifies the operations which may be performed on them and
describes the notifications which they may emit (ISO 1993).
Service Element (CMISE) is the standardised application service element that is used to
exchange management information in the form of requests and/or requests-responses
(ISO 1991). The CMISE is a basic vehicle that provides individual management
applications with the means of executing management operations on objects and issuing
notifications. The CMISE provides the means of supporting distributed management
operations using application associations. The CMISE services shown in Table 2-1
constitute the kernel functional unit of the CMISE. A system supporting the CMIP must
implement the kernel functional units of the CMISE.
Notifications
Operations
Service
Type
Function
M_EVENT_REPORT
C/NC
M_GET
M_SET
M_ACTION
M_CREATE
M_DELETE
C
C/NC
C/NC
C
C
C = Confirmed
NC = Not Confirmed
12
the occurrence of an event. Table 2-2 lists the SNMP request and response messages
along with their types and functions.
Service
Type
Function
Notifications
Trap
C/NC
Operations
GetRequest
GetNextRequest
C
C
GetResponse
NC
SetRequest
C = Confirmed
NC = Not Confirmed
Behaviour, that specifies how the object reacts to operations performed on it.
The managed object class provides a way to specify a family of managed object. A
managed object class is a template for managed objects that share the same attributes,
operations, notifications and behaviour. A managed object is an instantiation of the
managed object class.
13
The MIB is the conceptual repository containing all the related information about
managed objects. The MIB modelling encompasses an abstract model and an
implementation model (KOTSAKIS 1995). The abstract model defines
Concepts related with management object classes and the relationship between
them.
The implementation model (BABAT 1991, KOTSAKIS 1995) defines the following
The Management Information Model (ISO 1993) defines two types of management
operations
add member
remove member
14
Any operation may affect the state of one or more attributes. The operations may also be
performed atomically (either all operations succeed or none is performed). Operations
that may be applied to the managed object as a whole are the following
create
delete
action
An action operation requests the managed object to perform the specified action and to
indicate the result of this action.
2.5 Distributed Management Information Base (MIB)
Roles are not permanently assigned to a management entity. Some management
entities may be restricted to only taking an agent role, some to only taking a manager
role while other are allowed to take an agent role in one interaction and to take a
manager role in a separate interaction. In order to perform system management and
share management knowledge, it is sometimes necessary to embody manager and agent
within a single open system (see Figure 2-2). Shared management knowledge is implied
by the nature of the management framework since the management applications are
distributed across a network. Therefore the management information base may be
naturally viewed as a distributed database containing the managed objects that belong
to the same management system but is physically spread over multiple sites (hosts) of a
computer network. The MIB is considered as a superset of managed objects. Each
subset of this superset may constitute a set of objects associated with a device physically
separated from any other managed device. (ARPEGE 1994) Therefore the managed
objects in each location may be viewed as a local management description of the
15
the MIB is
distributed over several sites, one site may fail while other sites continue to operate.
Only the objects associated with the failed site cannot be accessed. This improves
both reliability and availability. On the other hand a failure in a centralised MIB
may makes the whole system unavailable to all users.
16
A typical arrangement of a management system is shown in Figure 2-3. The nodes may
be located in physical proximity and connected via a Local Area Network (LAN), or
they may be geographically distributed over a interconnected network (Internet). It is
possible to connect a number of diskless workstations or personal computers as
17
managers to a set of agents that maintains the managed objects. As illustrated in Figure
2-3, some nodes may run as managers (such as the diskless node 1, or the node 2 with
disks), while other nodes are dedicated to run only agent software, such as the node 3.
Still other nodes may support both manager and agent roles, such as the node 4.
Interaction between manager and agent might proceed as follows:
1. The manager parses a user query and decomposes it into a number of
independent queries that are sent separately to independent management agent
nodes.
2. Each agent node processes the local query and sends a response to the manager
node.
3. The manager node combines the results of the subqueries to produce the result of
the original submitted query.
4. If something occurs in an agent that changes its operational state, a notification
may be generated from the agent and an associated message is sent urgently to
the manager for further processing.
The agent software is responsible for local access of managed objects while the manager
software is responsible for most of the distribution functions; it processes all the user
requests that require access to more than one management node and it keeps a truck
where each managed object is located. An important function of the manager is to hide
the details of data distribution from the user, that is, the user should write global queries
as though the MIB were not distributed. This property is called MIB transparency. A
management system that does not provide distribution transparency makes it the
responsibility of the user to specify the managed node related with a managed object.
18
Systems are increasingly becoming complex and distributed. As a result, they are
exposed to problems such as failures, performance inefficiency and resource allocation.
So, an efficient integrated network management system is required to monitor, interpret
and control the behaviour of their hardware and software resources. This task is
currently being carried out by centralised network management systems in which a
single management system monitors the whole network. Most of existing management
systems are platform-centred. That is, the applications are separated from the data they
require and from the devices they need to control. Although some experts believe that
most network management problems can be solved with a centralised management
system, there are real network management problems that cannot be adequately
addressed by the centralised approach. (MEYER 1995).
Basically, there are four basic approaches for network management systems
centralised, platform based, hierarchical and distributed (LEINWARD
1993).
applications use the services offered by the management platform to handle decision
support. The advantage of this approach is that, applications do not need to worry about
protocol complexity and heterogeneity.
The hierarchical architecture (Figure 2-4.c) uses the concept of Manager Of
Managers (MOM) and manager per domain paradigm (LEINWARD
1993). Each
domain manager is only responsible for the management of its domain and it is unaware
of other domains. The manager of managers sits at the higher level and request
information from domain managers.
20
Figure 2-4 Network management approaches (a) centralised (b) platform based (c) hierarchical (d)
distributed
while the network management cost in communication and computation decreases. This
21
approach has also been adapted by ISO standards and the Telecommunication
Management Network (TMN) architecture (ITU 1995).
(CORBA) (OMG 1997) is also an important standard for distributed object oriented
systems. It is aimed at the management of objects in distributed heterogeneous systems.
CORBA addresses two challenges in developing distributed systems (OMG 1997):
1. Making the design of the system not more difficult than a centralised one
2. Providing an infrastructure to integrate application components into a distributed
system.
22
Network (TMN) framework(ITU 1995). On the other hand in the Internet community,
Simple Network Management protocol (SNMP)has gained widespread acceptance due
to its simplicity of implementation. Thus, TMN and Internet management will co-exist
in the future.
The aim of the TMN is to enhance interoperability of management software and to
provide an architecture for management systems. A TMN is a logically distinct network
from the telecommunication network that it manages. It interfaces with the
telecommunication network at several different points and controls their operations. The
TMN information architecture is based on an object oriented approach and the
agent/manager concepts that underlie the Open Systems Interconnection (OSI) systems
management.
The Telecommunication Management Network (TMN) is a framework for the
management of telecommunication networks and the services provided on those
networks. The Open Systems Interconnection (OSI) management framework is an
essential component of the TMN architecture. Each TMN function block can play the
role of an OSI manager, an OSI agent or both.
A managed object instance can represent a resource and thus there is a
requirement for communication between managed objects instances in an OSI agent and
the resources they represent. Examples of resources include telecommunication
switches, bridges, gateways etc. If a new interface card is added to a switch, the switch
may send a create request to the agent for the creation of the corresponding manager
object instance.
24
Figure 2-5
illustrates how TMN systems can inter-work within the TMN logical layer
The management information base ( MIB) is the managed object repository and
may be implemented by using C++ objects through a MIB composer tool (FERIDUM
1996). In (BAN 1995) a uniform generic object model
1. Performance: persistent managed objects ensure fast restart after agent failures. For
example the instance representing a leased line between two communication nodes
may need to be persistent, whereas an instance representing a connection does not
(since after an agent failure, the connection will be terminated). Object oriented
25
26
messages. Unfortunately, the current CORBA standard makes no provision for fault
tolerance.
To provide fault tolerance, objects should be replicated across multiple processors
within the distributed system (ADAMEC 1995. The motivations for applying object
replication in a distributed network management system could be of many types:
One can be performance enhancement. Management information that is shared
between a large manager community should not be held at a single server, since this
computer will act as a bottleneck that slow down responses.
Another motivation is improvement of fault tolerance. When the computer with one
replica crashes, system can proceed management computation with another replica.
Another motivation could be the case of using replicas to access remote objects.
When a remote object is to be accessed, a local replica reflecting remote objects
state is created and used instead of a remote object.
27
Figure 2-6
information. Agent updates the MIBs located at the manager sites by exchanging
messages with the managers. Each manager get the management information locally
without the need of issuing remote request. This yields a performance increment since
two additional instances of the MIB are used to provide information about the same
resources.
(MAFFEIS 1997b) discusses a CORBA based fault tolerant system that monitors remote
objects and if some of them fail it automatically restarts the failed objects and replicates
state-full objects on the fly, migrating objects from one host to another. In
(NARASIMHAN 1997) a similar system is discussed, which provides fault tolerant
services under CORBA to applications with no modification to the existing ORB.
28
current state of the resources. The manager and the agent of each domain could be
accommodated by the same host computer. However, we use the most general case in
which manager and agent reside at different computers. This could be the case where
the manager runs on a diskless machine. There are two possible scenarios:
1. No replication: Each MIB stores information about managed resources of its own
domain.
2. Replication: Each MIB contains replicated information which is associated with
resources of other domains.
30
Domain B
Network
Resources
A
Manager B
Domain A
Network B
Manager A
Agent B
Bridge AB
MIB B
Network A
Bridge BC
Manager C
Bridge AC
Agent A
Domain C
Network C
MIB A
Network
Resources
A
Agent C
MIB C
Network
Resources
A
31
32
control it. At the other, the system does everything without users noticing anything. The
ANSA reference manual (ANSA 1989) and the International Standard Organisation
Reference Model for Open Distributed Processing (ISO 1992a) provide definitions
related to replication transparency. Among others the standards state that a replication
system is transparent if it enables multiple instances of the information object (in our
case managed object) to be replicated without knowledge of the replicas by users or
application programs.
A basic architectural model for controlling replicated objects may involve distinct
agencies located across a network. Figure 2-9(a) shows how a manager may control the
entire process in a non transparent system. When a manager creates or updates an object,
it does so on one agency and then it takes responsibility to make copies or complete any
update on other agencies.
An agency is a process that contains replicas and performs operations upon them
directly. An agency may maintain a physical copy of every logical item, however, there
are cases, when an agency may not maintain a physical copy. For example, a managed
object needed mostly by a manager on one LAN may be never used by a manager of
another LAN. In this case the agency in the second LAN may not contain a physical
copy of that object and if ever this manager requests information about the object, the
local agency may obtain the information making a call to another agency that actually
holds a physical copy of the object. The general model for a transparent replication
system is shown in Figure 2-9(b). A managers request first handled by a Front End (FE)
component.
34
Figure 2-9: Architectural model for replication. (a) non transparent system (b)
transparent replication system (c ) Lazy replication (d) Primary copy model.
35
The FE component is used for passing the messages to at least one agency. This
hides details of how the message is forwarded to which agency. The user manager does
not need to determine a specific agency for service, but it just sends the message and the
FE component takes responsibility to determine which agency will receive the request.
The FE component may be implemented as part of the manager application or it may be
implemented as a separate process invoked by a manager application using a kind of
Interprocess Communication (PRESOTTO 1990). Figure 2-9(c ) shows a specialisation of
the architectural model in Figure 2-9(b). The model in (c ) is called a lazy replication
model and implements what is called gossip architecture (LADIN 1992). Here the
manager creates or updates only one copy in one agency. Later the agency itself makes
replicas on other agencies automatically without the managers knowledge. The
replication server is running in the background all the time scanning the managed object
hierarchically. Whenever it finds a managed object to have less replicas than it is
expected, the replication server arranges to make all the additional copies. The
replication server works best for immutable objects since such objects cannot change
during the replication process. This architecture is also called gossip architecture
because the replica agencies exchange gossip messages in order to convey the updates
they have each received. In gossip architecture the FE component communicates directly
either with an individual agency or alternatively with more than on agencies. Figure 29(d)
shows another replication architectural model known as the primary copy model
(LISKOV 1991). In that model all front ends communicate with the same primary
agency when updating a particular data item. The primary agency propagates the
updates to the other agencies called slaves. Front ends may also read objects from a
slave. If the primary agency fails, one of the slaves can be promoted to act as the
primary. Front ends may communicate with either a primary or a slave agency to
36
retrieve information. In that case, however, front ends may not perform updates; updates
are made to primary copy of an object.
2.13 Summary
This chapter has set the background for a replication management system. It has
shown the need for using a replication scheme in a real time application. It has
examined the distributed aspect of a network management system describing the
distributed nature of the MIB. It has briefly discussed two major protocols (CMIP and
SNMP) for exchanging management messages. It has also examined design aspects of
the MIB discussing the significance of the managed object as an autonomous entity for
performing operations related to incoming messages. The concepts of object availability
and performance have been defined and used as a measure of the quality of service of
the system. Synchronous and asynchronous replica models have been examined and
finally various architectural models for replication have been discussed as a way to
maintain transparently multiple replicas.
In the following chapters will discuss the internal mechanisms (algorithms) used
to obtain transparent updates to replicated objects. We will discuss a variety of solutions
that may be applied to ensure consistency among multiple replicas in occurrence of node
or communication link failures.
37
38
can be a user at another level of abstraction. The relationship between user and resource
is a "depends on" relationship as it is shown in 2nd.
A distributed management system consists of many agents. The management
services provided by those agents may depend on other secondary low level
management services associated with operating system components as well as
communication components. The union of all these management services is provided as
a distributed management system service. To ensure correctness and management
service availability, the classes of possible failures in the lower levels of abstraction
should be studied and redundancy in particular management services should be
introduced to prevent system crashes.
3.2
Failure Classification
An agent designed to provide a certain management service works correctly if in
39
1. Omission Failure: It happens when the agent receiving a request omits to respond
to that request. This failure occurs either because the queue of incoming messages
in the agent is full and therefore any additional request is lost or an internal failure
(i.e. memory allocation failure) is experienced due to a temporary lack of physical
resources for handling the incoming request. A communication service that
occasionally loses messages but does not delay messages is an example of a
service that suffers omission failures
2. Timing Failure: It happens when the agent response is functionally correct but
untimely. The response occurs outside the real-time interval specified. The most
frequent timing failure is the performance (late timing) failure in which the
response reaches the manager after the elapse of the time interval during which the
manager is expecting the response. This failure occurs because either the network
is too slow or the agent is overloaded and it gets late to give a response to the
manager. An excessive message transmission or message processing delay due to
an overload is an example of performance failures.
3. Response Failure: It happens when the agent responds incorrectly, either the value
of its output is incorrect (value failure) or the state transition that takes place is
incorrect (state failure). A search procedure that "finds" a key that is not an entry
of a routing table is an example of a response failure.
Crash Failures: It happens when after the first omission to produce a response to a
request, an agent omits to produce outputs for subsequent requests
handling the failure. The behaviour of an agent under the occurrence of a failure may be
classified as follows:
Fail-stop behaviour
Byzantine behaviour
With fail-stop behaviour, a faulty agency just stops and does not respond to subsequent
requests or produce further output, except perhaps to announce that it is no longer
functioning. With Byzantine behaviour, a faulty agency continues to run, issuing wrong
responses to requests and possibly working together maliciously with other faulty
managers or agencies to give the impression that they are all working correctly when
they are not. In our study we assume only fail-stop behaviour.
and thus the F/G is a weaker failure semantic than F. An agent that can exhibit any
failure behaviour has the weakest failure semantic called arbitrary failure semantics.
Therefore the arbitrary failure semantics includes all the previously defined failure
semantics. It is the responsibility of the agent designer to ensure that it properly
implements specified failure semantics. In general the stronger a failure semantic is, the
more expensive and complex it is to build an agent that implements it.
3.5 Failure Masking
An failure behaviour can be classified only with respect to a certain agent
specification, at a certain level of abstraction. If a management agent depends on lowerlevel agents to correctly provide management services, then a failure of a certain type at
a lower level of abstraction can result in a failure of a different type at a higher level of
abstraction.
Let us consider the example in Figure 3-2. A manager M sends a request to the
agent A which in turns uses the agent B to get some information necessary to built a
response to the Manager request. Let us consider that B is unable to provide the
necessary information to the agent A due to either a communication failure (omission,
or performance failure) or site failure (crash, value failure etc.). The agent A is actually
built one layer above B and it may hide the failure of B by either using another agent,
say C, that provides exactly the same information as B or try to resolve the problem by
itself by playing the role of B as well (it may access directly the managed object hosted
at the agent's B site). The agent A may also change the failure semantics. That is, a crash
failure in agent B may be propagated as an omission failure to the manager from the
agent A.
Failure propagation among managers and agents situated at different abstraction
levels of the "depends on" hierarchy can be a complex phenomenon. The task of
checking the correctness of results provided by lower-level servers is very cumbersome
and for this reason, designers prefer to use agents with as strong as possible failure
semantics. Exception handling provides a convenient way to propagate information
about failure detection across abstraction levels and replication of certain services
provide the mechanism for masking lower level failures. An agent A that is able to
provide certain services despite the failure of an underlying component, it is said to
mask the component's failure. If the masking attempts of an agent do not succeed, a
consistent state must be recovered for the agent before information about the failure is
propagated to the next level of abstraction, where further masking attempts can take
place. In this way information about the failure of lower level components can either be
hidden from the human users by successful masking attempt or can be propagated to
human users as a failure of a higher-level service they requested. The programming of
masking and consistent state recovery actions is usually simpler when the designer
knows that the components do not change their state when they cannot provides their
43
services. Agents which, either provide their standard service or signal an exception
without changing their state (called atomic (CRISTIAN 1989)) simplify fault tolerance
because they provide their users with simple-to-understand omission failure semantics.
To ensure that a service remains available to managers despite agent failures, one
can implement the service by a group of redundant physical independent, components ,
so that if some of these fail, the remaining ones provide the service. We say that a group
masks the failure of a member m whenever the group (as a whole ) responds as specified
to users despite the failure of m. While hierarchical masking requires users to implement
any resource failure-masking attempts as exception handling mechanisms, with group
masking, individual member failures are entirely hidden from users by the group
management mechanisms.
The group output is a function of the outputs of individual group members. For
example, the group output can be the output generated by the fastest member of the
group, the output generated by some distinguished member of the group or the result of
a majority vote on group member outputs. A group G has a failure semantic F if the
failures that are likely to be observed by users are in class F. An agent group able to
mask from its managers any k concurrent member failures will be termed k-fault
tolerant; when k is equal to one, the group is single-fault tolerant and when k is greater
than to one, the group is multiple-fault tolerant. For example if the k members of an
agent group have crash/performance failure semantics with members ranked as primary,
first back-up, second back-up, etc. up to k-1 concurrent member failures may be
masked. A group of 2k+1 members with arbitrary failure semantics whose output is the
result of a majority vote among outputs computed in parallel by all members can mask a
minority up to k member failures. When a majority of members fail in an arbitrary way,
the entire group can fail in an arbitrary way. Hierarchical and group masking are two
44
of
group
members
becomes
weaker.
In
(CRISTIAN
1985,
45
amounts of failure detection, recovery and masking redundancy used at various levels of
a management system in order to obtain the best overall cost/performance/dependability
result. Recent research has shown that a small investment at a lower level of abstraction
for ensuring that lower level components have stronger failure semantics can often
contribute to substantial cost saving
abstraction and can result in lower overall cost(CRISTIAN 1991). On the other hand,
deciding to use too much redundancy, especially masking redundancy, at the lower
levels of abstraction of a system might be wasteful from an overall cost/effectiveness
point of view, since such lower level redundancy can duplicate the masking redundancy
that higher levels of abstraction might use to satisfy their own dependability
requirements (SALTZER 1984).
3.6 Architectural Issues
A prerequisite for the implementation of a management service by an agent group
capable of masking low-level component failures is the existence of multiple hosts with
access to the physical resources used by the service. For example, if a disk containing
managed object instances can be accessed from four different agents, then all four
agents can host management services for that management database. A four-member
agent group can then be organised to mask up to three concurrent processor failures.
Therefore, replication of the resources needed by a service is a prerequisite for making
that service available despite individual resource failures. The use of agent groups raises
a number of novel issues.
Group synchronisation. How should group members running on different processors
(or machines) maintain consistency of their local states in the presence of member
failures, member joins, and communication failures?
Group size. How many members should a group have?
46
47
48
restore old backups. For on-line transaction processing environments, such delays are
considered critical. For real time applications, if the response time required is smaller
than the time needed to detect a member failure and to restore old backups, close
synchronisation has to be used.
49
advantages of this approach are obvious only if we have to implement different services
on different sites. The different services are provided through a service availability
manager which is used to forward requests on different services to the appropriate
member or subgroup. When we need to implement just one service, this approach is not
satisfactory because it increases the total overhead. Another drawback is that no group
availability policy will be enforced when the availability manager is down. In the case of
a network management fail-tolerant system we have to implement just one service
associated with the management of certain network resources. Therefore the objective is
to have a specified management service availability policy enforced whenever at least
one site works in the system. This results in a need to replicate the global state on all
working members of the group.
To maintain the consistency of these replicated global managed objects at all sites
in the presence of random communication or site failures, each replica of a managed
object should be updated in such a way that all these sites see the same version of the
managed object. If different availability managers see different sequences of updates,
then their local views of the global system state will diverge. This might lead to
violations of a specified agent group availability policy.
50
51
52
3.12 Summary
This chapter has proposed a number of concepts that are fundamental in designing
fault tolerant network management systems. Some of the concepts such as the notion of
the dependability between management agents and the hierarchical structure of agents
are fundamental to any fault tolerant distributed system. Dependability has been
examined as a way to form a hierarchy of co-operative agents that work together to
provide higher service availability
Failure classification provides a way to understand the behaviour of certain
failures and to set the background for a possible recovery technique. The most frequent
failures have been discussed and the cause of such failures have been examined. The
behaviour of a faulty agent may determine the actions adopted to recover from a
particular anomaly. Fail stop and Byzantine behaviour have been stated as two different
ways with which a faulty agent may interact with other agents.
A failure semantics study has been expressed and the distinction between weak
and strong failure semantics has been examined in terms of failure behaviour.
Hierarchical failure masking is a technique used to hide a failure effects by either calling
replicated agents to provide certain services or trying to disguise a failure and let a
higher abstraction layer to handle it.
Concepts such as group synchronisation, group size, group communication and
availability policy are relative to the architectural aspect of the network management
system and they have been discussed under the designer perspective. The architectural
aspects are used to first, formulate fault tolerant issues that arise in designing a
replication management system, and second, describe various design choices.
The next chapter will examine certain quorum consensus replica control protocol
architectures used as core mechanism for ensuring consistency among replicas in a
53
group of agents. Replica availability of the proposed protocols will be used as a measure
for examining the suitability of the protocols. In a following chapter we measure the
replica availability by simulating the membership changes in a group of agents.
54
This chapter presents the correctness criteria that should be taken into account
when designing a replication system. It discuses the difference between the logical and
the physical entity of a replicated object and it explains the transaction processing
strategy that satisfies the correctness criteria. An abstract model is introduced in order to
study formally certain replication algorithms. A survey of a variety of replica control
algorithms is presented and each replica control algorithm is discussed thoroughly.
These algorithms constitute the internal mechanism of a replication scheme and they are
used basically to ensure consistency among multiple copies of an object in the presence
of network failures.
4.1 Partitioning in a Replication System
As it is shown in previous chapters, the technique of data replication in distributed
database systems (such as a distributed MIB) is typically used to increase the availability
and reliability of stored data in the presence of node failure and network partitions
(DAVIDSON 1985, GIFFORD 1979, BERNSTEIN 1987, JOSEPH 1987,
JAJODIA 1989, SARIN 1985). The idea of replicating an object at those sites that
frequently access the object may be implemented by storing copies of the object where
the access to it seems inevitable. By storing copies of critical object on many nodes, the
probability that at least one copy of the object will be accessible in the presence of
failures increases.
55
56
57
All reads and writes are accomplished via transactions. Transactions are assumed to be
correct. More precisely, a transaction transforms an initially correct database state into
another correct state. Transactions may interact with one another indirectly by reading
and writing the same data items. Two operations on the same object are said to conflict
if at least one of them is a write (BERNSTEIN 1987). Conflicts are often labelled
either read-write, write-read or write-write, depending on the types of data operations
involved and their order of execution (BERNSTEIN 1981).
A generally accepted notion of correctness for a database system is that it executes
transactions so that they appear to users as isolated actions on the database. This
property, referred as atomicity, is achieved by the all or nothing execution of the
transaction operations. In this case, either all writes succeeds (committed transactions)
or none are performed (aborted transactions). Correctness and consistency between
operations performed in different transactions are ensured by assigning a serial
execution to any set of concurrent operations that produces the same effect
(serialisability). That is a serialisable execution is a concurrent execution of many
transactions that produces the same effects to the database as some serial execution of
the same transactions. Other correctness criteria may be expressed in the form of
integrity constrains. Such criteria may range from simple constrains (e.g. a particular
object can not assign a negative value) or more complex constrains that involve many
replicas (i.e. all replicas must have the same view of a particular object any time they are
accessed).
In a system with integrity constraints, an operation is allowed only if its execution
is atomic and its results satisfy the integrity constraints. In a replicated database the
value of each logical object X is expressed by one or more physical instances, which are
referred to as the copies of X. Each read and write operations issued by a transaction on
58
some logical data items must be mapped by the database system to corresponding
operations on physical copies. The mapping must ensure that the concurrent execution
of transactions on replicated objects is equivalent to a serial execution on non-replicated
objects, a property known as one-copy serialisability (DAVIDSON 1985). The part of
replication system that is responsible for this mapping is called replica control protocol
(algorithm).
4.3 Transaction Processing During Partitioning
In a Partitioned network, where communication connectivity of the system is
broken by failures or by communication shutdowns, each partition must determine
which transactions it can execute without violating the correctness criteria. It is assumed
that the network is cleanly partitioned (that is any two sites in the same partition can
communicate and any two sites in different partitions cannot communicate) and that
one-copy serialisability is the correctness criterion. Addressing the correctness criteria
implies satisfaction of the following proposition:
1. Correctness must be maintained within a single partition by assigning a
single view to all the replicas in the partition.
2. Each partition must make sure that its actions do not conflict with the actions
of other partitions
Correctness within a single partition can be maintained by adapting one of the replica
control algorithms. For example, the sites in a partition can implement a write operation
on a logical object by writing all copies in the partition. The problem of ensuring onecopy serialisability across partitions become more difficult as the number of partitions
increases. In theory, a replication scheme contains two algorithms; one to ensure
correctness across partitions and a replica control algorithm to ensure one copy
59
behaviour. In practice many replica schemes compose both algorithms into a single
solution.
4.4 Partition Processing Strategy
Solving the problem of global correctness needs to deal with two matters:
1. When a partition occurs, sites executing transactions may find themselves in
different partitions and thus unable to take a decision as to whether to commit
the transaction or abort it.
2. When partitions are reconnected (reunited) mutual consistency between
copies in different partitions must be re-established. By mutual consistency, it
is meant that the copies have the same state (or value). The updates made to a
logical object in one partition must be propagated to its copies in all the other
partitions.
Partition processing strategies can be divided basically into two classes. The first one is
called optimistic and allows updates in all partitions in the network. the second one is
called pessimistic and allows updates to take place only in one partition.
Optimistic protocols (BLAUSTEIN 1985, DAVIDSON 1984, SARIN 1985)
hope that the inevitable conflict among transactions are rare. These algorithms take the
approach that any copy of the replicated object must be available even when the network
partitions. Optimistic algorithms require a mechanism for conflict detection and
resolution. To preserve consistency, conflicting transactions are rolled back when
partitions are reunited
Pessimistic protocols (GIFFORD 1979, ABBADI 1986, PRIS 1986a,
JAJODIA 1989, KOTSAKIS 1996a) maintain the consistency of the replicated object
even in the presence of network partitioning. Replicated objects are updated only in a
60
single partition at any given time. Thus only one partition holds the most recent copy
preventing in that way any possible conflict.
Optimistic protocols are useful in situations in which the number of replicated object is
large and the probability of partitioning small. Pessimistic protocols prevent
inconsistency by limiting the availability. Each partition makes worst-case assumptions
about what other partitions are doing and operates under the assumption that if an
inconsistency can occur, it will occur. Optimistic protocols do not limit availability and
allow any transaction to be executed in a partition that contains copies of an object.
Optimistic protocols operate under the optimistic assumption that inconsistencies, even
if possible, rarely occur. Optimistic protocols allow conflicts among the transactions and
try to resolve them when the conflicts occur. Pessimistic protocols do not allow conflicts
and prevent any inconsistency by allowing updates only in a single partition. As a
consequence a pessimistic protocol is more suitable for real time application (network
management applications etc.) than an optimistic one. In a critical real time application
like that of managing the operations of a satellite ( or a nuclear reactor), the replicated
data must be all the time consistent and any possible conflict should be prevented. Real
time processes interact dynamically with the external worlds (i.e. network resources).
When a stimulus appears the system must respond to it in a certain way before a certain
deadline taking into account all the current information. The time limit sometimes does
not allow conflict resolution. If, for instance, the response is not delivered during a prespecified time interval, the service may be considered unavailable (performance failure).
The advantage of using a pessimistic protocol over an optimistic one in a distributed
database system that is used as a repository of real time applications are the following:
1. A pessimistic protocol prevents any inconsistency, where an optimistic one
allows inconsistency and tries to resolve it later.
61
2. A pessimistic protocol has faster response, since all the information needed
by the protocol is available locally at the site. The protocol may decide to
allow (or not allow) a particular update by using a local record kept in each
site.
3. Optimistic protocols are useful in a situation in which the number of
replicated copies is large and the probability of partitioning is small. This may
be the case of applying replication over a Local Area Network (LAN) where
the probability of having a connection break is very small. In the case of
applying replication across interconnected networks that encompasses
different technologies (like that of the satellite network presented in chapter
2), the probability of having link failures increases.
When we design a replica control algorithm the competing goals of availability and
correctness must be seriously considered. Correctness can be achieved simply by
suspending operations in all but one of the partition groups. On the other hand
availability can be achieved simply by allowing all nodes to process updates. It is
obvious that it is impossible to satisfy both goals simultaneously, one or both must be
relaxed to some extend depending on how critical the application is.
Relaxing
write operations on managed objects, we can classify certain protocol operation into
read and write activities. For instance , M_GET operation of the CMIP, and GetRequest
(or GetNext) operation of the SNMP may be considered as read class operations since
they do not affect the state of the managed object. On the other hand, M_SET operation
of the CMIP and SetRequest operation of the SNMP may be considered as write class
operations since their aim is to change the current state of the managed object.
The replicated data items are physically stored at different sites. Each item is
conceptually an object that encapsulates some internal data and provides a well defined
interface for updating for accessing and updating the state of the object. The size of the
objects is not important . An object may be as simple as a single variable holding a
single value (this object is known as a fine grain object) or it may be as complex as a
subordinate data base (this object is known as a large grain object) (CHIN 1991).
Objects are physically stored at different sites. The state of an object is determined by
the current values of the variables used to describe its attributes. That is, by giving a
value to each of its variables.
The state of the entire distributed database is composed by the individual states of all
logical objects. The term logical object is used to distinguish the logical view of the
object from its physical representation. Figure 4-2 shows three different sites holding a
copy of a sensor object. This object has a single attribute called temperature and two
operations to read and update the temperature. The object has been instantiated with a
temperature value equal to 25. In a replication scheme, each site must keep a copy
(physical object) of the logical view of the sensor object. To ensure consistency, all the
physical copies must adhere to the same logical view. Accessing any of the physical
copies allows us to get exactly the same data. The logical object provides a user
oriented view of the entity, it shows how the user expects to see the entity. In a
63
replicated database, a logical object is assumed valid if all its physical representations
are consistent and consequently they have exactly the same state (same temperature).
Each read and write operation issued on a logical object must be mapped to
corresponding operations on physical copies. A transaction is a process that issues read
and write operations on the objects. Each of these operations may trigger a sequence of
other operations in order to provide a particular access or update. For instance, the read
operation may trigger an interrupt and read the temperature from a hardware device. The
duration of the transaction is the time interval between the time a read or write operation
is issued and the time the operation terminates. Transactions interact with one another
indirectly by reading and writing the same logical object. As already noted, operations
on the same logical object are said to conflict if at least one of the them is write
(BERNSTEIN 1987). Therefore conflict can occur at the following sequence of
operations:
read-write
64
write-read
write-write
Transactions guarantee correctness only if they are assumed as isolated actions. This
property is referred to as atomic execution and it has the following effects:
1. The execution of each transaction is all or nothing. Either all of the
operations are performed or none are performed (atomic commitment).
2. Executing multiple transactions concurrently produces the same result as if
they were executed in a serial manner one after an other (serialisability).
In a replicated database logical operations issued by a transaction are mapped to
corresponding physical ones. The mapping must ensure that the concurrent execution of
transactions is equivalent to a serial execution on non-replicated data, a property known
as one-copy serializability. The mechanism that performs this mapping is called the
replica control protocol.
When the system is partitioned, each partition must determine which transactions it can
execute without violating the correctness criteria (atomic commitment and
serialisability). This can be accomplished by considering the following statements :
1. Each partition must maintain correctness within its region
2. Each partition must make sure that its actions do not conflict with the actions
of other partitions
Most of the proposed replica control protocols fulfil the conditions above in order to
ensure consistency. The following sections examine thoroughly some replica control
protocols and explain how they ensure consistency under network partitioning whilst
providing at the same time a tolerable object availability.
65
1982). Another very similar approach is that in (MINOURA 1982). It supports the
primary copy notion except that the primary copy can change for reasons other than site
failures. Although, accessing a copy needs the use of a token. In principle, this approach
uses the notion of the primary copy to keep consistency among distributed copies of a
logical object.
The following shows how a primary site algorithm operates under partitioning. Let
us consider a replication scheme that has n copies of a logical object X (Figure 4-3). The
copies named X1, X2, ... Xn depict physical replicated entities of the object X located at
different sites connected via communication links. The X1 copy is hosted in the primary
site P and all the others in secondary sites. Whenever a site wants to read the object X, it
accesses a physical entity Xi nearest to the site. To avoid any inconsistency, this Xi copy
should be in the same partition as the primary site. Therefore, each read(X) is translated
to a read(Xi). Whenever a site wants to update the state of the object X, it broadcast the
update to all accessible sites, that is, each write(X) operation is translated to write(X1),
write(X2), write(X3), ..., write(Xn). This approach is often called read one, write all
mechanism (BERNSTEIN 1987). When a partitioning occurs, sites that are members
of the partition that does not contain the primary site cannot access the X object, that is,
they cannot perform a read or write operation on it. However, sites that belong to the
same partition as the primary site can fully perform any operation. Write operations
performed by these sites update only those Xi entities that are in the primary partition.
(Primary is this partition that contains the primary site). When a reunion occurs, two or
more partitions are united into one single partition. If this unified partition contains the
primary site all those sites that have lost previews updates become current by getting the
latest version of the primary copy. In the case of a failure of the primary site, a new
primary site may be elected (GARCIA 1982). When partitioning occurs , copies that
are not found in the primary partition are registered as unavailable or not current.
These copies cannot be accessed either for read or write. The major functions that
describe the behaviour of a primary site algorithm are as follows:
PrimaryPartitionMember(): It returns TRUE if the site is a member of the primary
partition, otherwise it returns FALSE. The primary partition is that partition which
contains the primary site. The following data structures are used to describe certain
concepts:
Object: A logical object
67
MakeCurrent(): It is called after the occurrence of a reunion (Figure 4-6). The aim of this
function is to update those copies that missed some updates due to partitioning. This
function is executed by a site that becomes aware of a reunion occurrence.
Boolean MakeCurrent()
{
if (PrimaryPartitionMember())
{
ObjectValue v=GetLatestCopyValue(X) /* Get the value of the object either
from the primary copy or any other copy that resides in the primary partition*/
if (Update(X,v));
/*updates all the copies in the partition that have missed
previous updates. If this succeeds all these copies will become current and they can
be accessed normally from that time on.*/
return TRUE;
else
return FALSE;
}
else
return FALSE;
}
Figure 4-6. Make Current in a Primary Site Protocol
current copy of the object. In a partitioned system, this constraint guarantees that an
object cannot be read in one partition and written in another. Hence read-write conflicts
cannot occur between partitions.
The second constraint ensures that two writes cannot happen in parallel or, if the
system is partitioned, that writes cannot occur in two different partitions on the same
logical object. Hence write-write conflicts cannot occur between partitions.
Each site that holds replicated objects maintains its own connection vector. A
connection vector is recorded continuously and it indicates the connectivity of the site. It
literally presents a mechanism by which the respective site knows what sites it can talk
to. Communication failures and repairs are recorded in the appropriate connection
vectors, so that all connection vectors in a single partition are identical.
Each physical copy i is associated with a version number (VNi). The version
number of a copy is an integer which counts the number of successful updates to the
copy. This number is initially set to zero and it is incremented by one each time an
update to the copy occurs. The current version number of a replicated object is the
maximum taken over the version numbers of all copies of the object. A copy is said to
be current if its version number is equal to the current version number.
Through out this section, we assume that there is a logical object that is stored
redundantly in n sites in a distributed system. Initially these sites are all connected and
all physical copies are mutually consistent. Since the following protocols do not depend
on the number of logical objects which are replicated , it is assumed for ease of
exposition that there is just one logical object replicated in n sites.
4.7.1 Majority Consensus Algorithm
The first voting approach was the majority consensus algorithm (THOMAS 1979).
What will be described is the generalisation of that algorithm as proposed by Gifford
70
(GIFFORD 1979). We simplify the discussion of this protocol assuming only one type
of replicated physical objects. Weak copies are not considered and we assume that all
the replicated copies assign the same number of votes. The following functions describe
the behaviour of the Giffords approach.
DoRead(X): It reads the current state of the logical object X (Figure 4-7). This is
translated to a physical read to the nearest copy of X within the partition. The following
functions show the implementation of the DoRead(X). This function returns TRUE if it
succeeds, otherwise it returns FALSE. The function FindNearestCopy(X) returns the
address of the nearest copy Xi of the logical object X. The function CollectReadVotes(X)
gathers all the read votes assigned to the logical object within a certain partition. r is the
read quorum, that is, the threshold for performing a read operation in the partition.
Boolean DoRead(Object X)
{
if (CollectReadVotes(X)>=r)
{
ObjectCopy x_copy=FindNearestCopy(X);
read(x_copy);
/* read the copy*/
return TRUE;
}
else
return FALSE;
}
71
MakeCurrent(): It is called after the occurrence of a reunion (Figure 4-9). The aim of this
function is to update those copies that missed some updates due to partitioning. This
function is executed by a site that becomes aware of a reunion occurrence.
GetLatestCopyValue(X) returns the instance of X with the greatest version number.
Boolean MakeCurrent()
{
if (CollectWriteVotes(X)>=w)
{
/* Get the value of the object */
ObjectValue v=GetLatestCopyValue(X);
if (Update(X,v));
/*updates all the copies in the partition that have missed
previous updates. If this succeeds all these copies will become current and they can
be accessed normally from that time on.*/
return TRUE;
else
return FALSE;
}
else
return FALSE;
}
Figure 4-9. Make Current in a Majority Consensus Algorithm
72
some additional attributes which are associated with each physical copy. These are the
following:
Update Site Cardinality(SC), which reflects the number of sites participating in the most
recent update to the object. Each site sets initially the site Cardinality equal to the
number of the replicated copies n. Whenever an update is made to the object, the site
Cardinality is set to the number of physical copies which were updated during this
update.
Among all the sites of the network that hold a copy, there is a privileged site called
Distinguished that identifies one of the sites that participated in the last update. If the
sites are ordered (1, 2, 3 etc), this could be the site which is assigned by the greater
number and which has participated in the last update.
A partition P is said to be a majority partition if either of the following two conditions
holds:
1. The partition P contains more than the half of the current copies of the object
2. The partition P contains exactly the half of the current copies and moreover
contains the Distinguished Site.
A copy is said to be current if its version number is equal to the current version number
(the maximum taken over the version numbers).
Update Site Cardinality is an attribute similar to the connection vector but it specifies
which nodes participated in the last update. What follows is the procedures that
implements Jajodias algorithm.
IsMajority(): This is to determine whether a site is a member of the majority partition or
not. Figure 4-10 shows the pseudo-code for the IsMajority() function..
74
#define
AND &&
Boolean IsMajority()
{
int n=NOfOnes(SV); /*returns how many sites are working and thus they can
communicate with each other. This is determined by the number of flags that
are up in the site vector SV*/
if (n>SC/2)
return TRUE;
else if ((n==SC/2) AND Is_The_Distinguished_Site_In_The_Partition())
return TRUE;
else return FALSE;
}
Figure 4-10. IsMajority in the Dynamic Voting Protocol
DoRead(X): It reads the current state of the logical object X. This is translated to a
physical read to the nearest copy of X. The following function in Figure 4-11 show the
implementation of the DoRead(X). This function returns TRUE if it succeeds, otherwise
it returns FALSE. The function FindNearestCopy(X) returns the address of the nearest
copy Xi of the logical object X. The function CollectReadVotes(X) gathers all the read
votes assigned to the logical object within a certain partition. r is the read quorum, that
is, the threshold for performing a read operation in the partition.
Boolean DoRead(Object X)
{
if (IsMajority())
{
ObjectCopy x_copy=FindNearestCopy(X);
read(x_copy);
/* read the copy*/
return TRUE;
}
else
return FALSE;
}
TRUE if it succeeds.. DoWrite is executed by a site only if the site is a member of the
majority partition.
MakeCurrent(): It is called after the occurrence of a reunion(Figure 4-14). The aim of this
function is to update those copies that missed some updates due to partitioning. This
function is executed by a site that becomes aware of a reunion occurrence. The function
Update() in Figure 4-13 updates a physical copy to the most recent value.
The
MakeCurrent() describes all the steps to update a reunion. The first site S that becomes
aware of the reunion sends a request to each site in the partition P and asks them to
determine locally whether they belong to a majority partition or not. If at least one site
sends a positive response, the site S obtains the Version Number (VN) and site
Cardinality (SC) from that site and executes the Update() function otherwise it does the
following:
It finds the maximum VN of all the copies in the partition (MAX_VN) and the set I of
sites that hold the most recent copy (that with the maximum version number). Let C be
the Site Cardinality of any of the sites members of I, and N be the Cardinality of I. An
update of the reunited partition can take place only if
76
1) either N>C/2 or
2) N=C/2 and the Distinguished site is in the current partition
The first requirement makes sure that the sites with the greatest VN in the partition are
the majority. Consider that the Site Cardinality (SC) indicates the number of sites
participating in the last update. Thus the Site Cardinality (C)of any of the sites in the set
I determines the number of sites participating in the most recent update. The Cardinality
(N) of I indicates how many of the participating sites in the most recent update are
present. Therefore if N is greater than half the C, more than the half of the participating
sites in the most recent update are present and an update is allowed making the partition
current. The second requirement makes sure that in case N=C/2 and the Distinguished
site is in the partition, the partition is allowed to be updated. The requirement for the
Distinguished site is used to break ties between partitions when we have a partition
decomposition into two sub-partitions with equal number of sites. Therefore the
partition that contains the distinguished site is eligible to apply any update.
77
BOOL UpdateObject(Objext X)
{
/* Get the value of the object */
ObjectValue v=GetLatestCopyValue(X);
if (Update(X,v))
/*updates all the copies in the partition that have missed previous
updates. If this succeeds all these copies will become current and they can be
accessed normally from that time on.*/
return TRUE;
else
return FALSE;
78
80
6. The algorithm is applicable to a set of copies (replicas) of a single data item spread
across the network in different sites. The data item is stored redundantly at n sites
(n>2).
7. Each replicated data item is associated with a set of variables used by the algorithm
to ensure consistency and availability. These variables are discussed in the next
section
4.7.4.2 DMCA Maintenance Variables
Site Vector (SV). It is a sequence of bits, similar to CV, that indicates which sites
participate in the most recent update (write operation) of the data item. When
partitioning occurs and the data item in the main partition is current the SV is
assigned the value of the CV. A data item is assumed current if either a write
operation has occurred or the MakeCurrent procedure has been performed after
the occurrence of a reunion. The MakeCurrent routine is explained in the
following section
Site Cardinality (SC). It is an integer number that denotes the number of sites
participating in the most recent update of the data item.
Read Quorum (r) determines the minimum number of sites that must be up, to
allow a Read operation.
Write Quorum (w) determines the minimum number of sites that must be up, to
allow a Write operation.
Current (CUR). It is a Boolean variable that indicates whether the data item is
current or not. It is TRUE if the data item is current, otherwise it is FALSE.
Version Number (VN). It is an integer number that indicates how many times the
data item has been updated. Each time an update is successfully performed the VN
increase by one. It is initially zero.
81
if (bitcnt(SV)>=r)
return TRUE;
else
return FALSE;
82
if (bitcnt(SV)>=w)
return TRUE;
else
return FALSE;
DoRead is used when the site intents to read a replicated object. The only
condition that must be satisfied in order to perform the DoRead is the SV to be greater
than or equal the r.
83
BOOL DoRead(Object X)
{
/* It returns TRUE if a read operation may be accomplished */
if (ReadPermitted(X))
{
ObjectCopy x_copy=FindNearestCopy(X);
read(x_copy);
return TRUE;
}
else
return FALSE;
DoWrite is used when the site indents to change the state of the replicated object
This routine checks first the write quorum w to see if a write operation is permitted. If
so, it proceeds , otherwise it rejects the write operation. If the Write Quorum is satisfied,
the site broadcasts the INTENTION_TO_WRITE message to all other sites in the
partition. Each site upon receiving this message sends an acknowledgement. If the
originator receives all the acknowledgements from all the sites in the partition, it
performs the write operation and broadcasts the COMMIT message to all the sites in the
partition, otherwise it broadcasts the ABORT message. If the connection vector changes
during the operation of the algorithm we follow a similar approach as in (JAJODIA
1989). That is, if the Connection Vector changes after the issue of the
INTENTION_TO_WRITE message, but before the sending of the COMMIT message,
the originator sends the ABORT message instead of COMMIT. Any site that has
84
r round
SC
RW WW
WW
w round
SC
RW WW
RW is the Read Weight and WW is the Write Weight. Because the r and w are integers
we use the function round to round the result to the nearest integer. If the sum (r+w) is
found equal to SC we increase the w by one to ensure consistency (r+w>SC). The Read
and Write Weight are associated with the probability of having a read and write
respectively as follows:
RW
1
Re ad Pr ob
WW
1
Write Pr ob
85
if (WritePermitted())
{
Broadcast(INTENTION_TO_WRITE);
WaitForAck();
if (AllAckReceived())
{
Update(X,v))
Broadcast(COMMIT);
return TRUE;
}
else
return FALSE;
}
else
return FALSE;
If the WriteProb is much less than the ReadProb the r is approximately zero and the w
is approximately equal to SC. This means that if read operations occur very frequently,
they are very likely to be executed since they require a small quorum. On the other hand
write operations are very unlikely to be executed in a case of partitioning , since they
require a quorum approximately equal to SC. In most practical applications involving
distributed management databases , the ReadProb is approximately four or five times
greater than the WriteProb. This, of course, may vary depending upon what policy we
86
follow in replicating managed objects. For instance, if we choose to replicate all the
objects that are updated very frequently, we should expect the WriteProb to be greater
than the ReadProb; but such a choice does not increase the performance of the system
since each write is translated into a set of physical write operations that require extra
network bandwidth to be implemented.
MakeCurrent is performed after the occurrence of a reunion . This routine aims
to update some copies of the object that came from a subpartition in which write
operations were not allowed due to a large Write Quorum. The MakeCurrent is said
to be successful if it sets the variable CUR=TRUE. If the variable CUR is TRUE the
object is considered current. The site that performs this routine broadcasts a request for
quorum and waits for responses. If it receives all the expected responses from all the
sites in the partition, it proceeds, otherwise it sends an ABORT message and the
MakeCurrent is considered to have failed. Each site in the partition upon receiving
the request for quorum, sends back to the originator the VN of the copy, the w of the
copy and the state of the object. The originator receives all the responses and finds the
maximum VN and the w corresponding to that VN (MWQ) as well as the number of
nodes that contain the maximum VN (MC) and the state of the object corresponding to
the copy with maximum VN. If MC MWQ the state of the local copy assigns the state
of the copy with maximum VN and the following instructions are executed to update the
maintenance variables:
VN=Maximum
VNCUR=TRUE
SC=bitcnt(CV)
SV=CV
87
RW
r round
SC
RW WW
WW
w round
SC
RW WW
If MC is less than MWQ the site should wait for some period of time and try again.
#define AND &&
Boolean MakeCurrent(Object X)
{
Broadcast(REQUEST_FOR_QUORUM);
WaitForResponse();
if (AllResponseReceived())
{
MVN=max{VNi :Si P}; /*finds the maximum Version Number in the partition*/
MWQ=Writequorum(Si: MVN=VersionNumer(Si)); /*finds those sites holding the
maximum Version Number and then get the write quorum (MWQ) of these sites*/
MC=The Number of sites with version number equal to MVN;
if (MC>MWQ)
{
ObjectValue v=GetLatestCopyValue(X); /*v corresponds to a copy with the
largest Version Number*/
Update(X,v))
/*updates all the copies in the partition*/
Broadcast(COMMIT);
return TRUE;
}
else{
WaitAndTryLatter();
return FALSE;
}
}
else
return FALSE
}
Figure 4-19 Make Current function in DMCA
aligned with its initiation time. A message is shown as a horizontal solid arrow from the
lifeline of one object to the lifeline of another object.
DMCA algorithm involves two main objects: the user object which issues read
and write requests and the replication manager object which accepts and handles these
requests and which provides response messages to the user according the DMCA replica
control protocol. The user object can be a part of a network management application. An
instance of a replication manager object resides at many sites that accommodate
managed objects. A replication manager object may be seen as the object that a user can
communicate with, in order to collect or set management information associated with
network resources.
In the following diagrams, for simplicity, one object lifeline is drawn for all the
replication manager objects are involved in read and write operations and one lifeline is
drawn for all user objects. Three diagrams are represented : one describing DoRead
function, one for the DoWrite function and one for MakeCurrent function.
User object
no
yes
Perform Read
FindNearestCopy
89
User object
Write Response
Counts all responses
All responded AND
Write is permitted
no
yes
Performs Write
broadcast commit
User object
QuorumResponse
All Responses
are Received
no
yes
F
MC>MWQ
no
yes
Calculate Latest
Copy Value
Broadcast Commit
Where
F: MVN=find the maximum VN
MWQ=find the sites holding the maximum VN
MC= the number of sites with VN=MVN
90
The following table Shows the mapping between the DMCA protocol operations and the
CMIP and SNMP operations
CMIP
M_GET
M_SET
M_CREATE
M_DELETE
SNMP
GetRequest
GetNextRequest
SetRequest
DMCA
Read
Write
4.8 Summary
This chapter presents a thorough approach to replica control algorithms, especially
to those replication algorithms that use voting techniques. Studying first the criteria to
achieve correctness, it sets the background for understanding the internal mechanisms
used to insure consistency among multiple replicas in a distributed database.
Pessimistic and optimistic processing strategies have been discussed as two
alternatives to establish a replication scheme. However, pessimistic strategies have some
advantages over the optimistic ones in distributed database systems that are used as a
repository for demanding applications. Pessimistic algorithms provides faster response,
higher availability, and prevent any temporary inconsistency.
Certain replication protocols have been discussed. The primary site protocol is a
static protocol that introduces the notion of the primary partition. Only the operations
that are submitted from sites of the primary partition are allowed to execute. From the
rank of voting algorithms, the following have been examined: the classic Giffords
approach, the Jajodias dynamic voting techniques which enhances the Giffords
approach by introducing a mechanism to dynamically change the read and write quorum
and a novel approach called DMCA which is an improvement on Jajodias technique
91
since it is able to dynamically change the read and write quorum by taking into account
the read and write ratio. Jajodias algorithm implies that reads and writes execute with
the same probability (since they have the same occurrence rates). It cannot distinguished
a possible difference between read rate and write rate. Adjusting the read and write
quorum according the read and write occurrence rate may increase the availability of the
replicated object and make the system more fault tolerant.
Chapter 6 provides a quantitative comparison between the approaches presented in
this chapter and draws some conclusions about the availability provided by each replica
control protocol. The next chapter discusses the model used to simulate certain replica
control protocols.
92
This chapter presents the object oriented development of the Availability Testing
System (ATS). It discusses first the advantages of using the object oriented paradigm for
developing such a complex system. It then presents the simulation modelling process
and presents briefly the Object Modelling Technique (OMT) which has been used to
construct the ATS system. Following the development process imposed by the OMT, it
discusses the requirements of the ATS system , then its analysis and design through a
static and a dynamic object model.
5.1 Introduction to simulation modelling
Simulation should be understood as the process of designing a model of a real
system and conducting experiments with this model for the purpose of understanding
the behaviour of the system or of evaluating various strategies for the operation of the
system (SHANNON 1975).
Simulation is classified based on the types of the system studied and it can be
either continuous or discrete. In the case of studying replication algorithms, the discrete
simulation seems adequate to describe the behaviour of each algorithm. There are two
approaches for discrete simulation: event driven and process driven. Under event driven
discrete simulation, the modeller has to think in terms of the events that may change the
status of the system (LAW 1991). In a replication system, for example, the status may
change by the occurrence of events that cause partitions and reunions. The status of the
93
system is defined by a set of variables being observed. On the other hand under the
process driven approach, the modeller thinks in terms of processes that the dynamic
entity will experience as it moves through the system.
The simulation system that has been used to test the availability of the replication
algorithms is a system consisting of certain dynamic entities. Dynamic entities are the
objects that interchange information providing in that way certain services by using the
system resources. Entities may experience events which result in an instantaneous
change of the system state. Some events are endogenous and occur within the system
(replica updates) and some events are exogenous and occur outside the system (read,
write, partition and reunion operations). The aim of the simulation is to model the
random behaviour of the system, over time, by utilising an internal simulation clock and
sampling from a stream of random numbers.
5.2 Using an Objet-Oriented Technique for Modelling a Simulation
System
Simulation is a useful and essential technique for verifying the operability of
systems with large number of entities. The object oriented paradigm has become popular
in software engineering communities due to its modularity, reusability and its support
to iterative design techniques. The idea of an object-oriented simulation has great
intuitive appeal in the application development process because, it is very easy to view
the real world as being composed of objects. An object oriented technique introduces (1)
information hiding (2) abstraction (3) polymorphism. Both information hiding and data
abstraction allow the simulation modeller to focus on those mechanisms that are
important discarding any irrelevant implementation details. This gives the freedom to
the modeller to change implementation details of a system component at a later stage of
the development without the need of redesigning or affecting other components. The
flexible behaviour of objects is realised through polymorphism and dynamic binding of
94
methods. The binding to an actual function takes place at run- time and not at compiletime. In this way, inheritance provides a flexible mechanism by which you can reuse
code, since a derived class may specialise or override parts of the inherited specification.
Object-oriented techniques offers encapsulation and inheritance as the major
abstraction mechanisms to be used in system development. Encapsulation promotes
modularity, meaning that objects must be regarded as the building blocks of a complex
system. Once a proper modularization has been achieved, the object implementor may
postpone any final decisions concerning the implementation.
An other advantage of an object-oriented approach often considered as the main
advantage, is the reuse of code. Inheritance is an invaluable mechanism in this respect;
since the code that is reused offers all that is needed. The inheritance mechanism
enables the developer to modify (or refine) the behaviour of a class of objects without
requiring access to the source code.
5.3 Object Oriented Discrete Event Simulation
In the object oriented paradigm, a program is described as a collection of
communicating objects that represent separate activities in the real world and which are
able to exchange messages with each other. An object is an abstract data type that
defines a set of operations that perform on the internal data that express the object. Each
object is an instance of a class. A class can be thought of as a template which produces
objects. The object oriented paradigm has been successfully applied to a variety of fields
of computer science and engineering. In distributed algorithms, the global system is
decomposed into a set of communicating logical processes. These logical processes
work concurrently to accomplish the objective of the distributed task. This concurrency
is realised in a simulation system by sequential simulation of the execution time . The
95
function of the inter-arrival time of all of the events handled by the system is given by
an exponential distribution (Poisson arrival implies exponential inter-arrival time). The
rate of the occurrence of the events and the simulation time determines two number of
the event that occurs during the observation time interval which is equal to the
simulation interval of time. The ATS simulation system is a terminating system
(SADOWSKI 1993). Terminating systems are systems that have a clear point in time
when they start operations and a clear point in time when they end operations. ATS
specifies a random sample size of event and a time simulation length.
5.4.2 Model Implementation
The ATS simulation model is implemented by using C++ programming language
(STROUSTRUP 1991). C++ is a good tool that support program organisation through
classes and class hierarchies. Classes help the developer to decompose a complex
solution into simpler ones . Each class has its own internal data that may be updated
through a set of defined operations. The encapsulation of code (operations) and data
(variables) into a single entity help the developer to focus on the design and
implementation of smaller pieces of software structures and then unify all separate
components to form the complete solution.
during the design of the system. The OMT methodology has basically three stages
named Analysis, Design and Implementation.
5.5.1 Analysis
Analysis is the first step of the OMT methodology and it starts from the statement
of the problem and builds a model focusing on the properties of particular objects that
are used to abstractly represent real world concepts. The analysis model is a precise
abstraction of what the desired system must do, not how it will be done. The analysis
clarifies the requirements and set the base for later design and implementation. The
output of the analysis phase consists of two models named object and dynamic models.
The object model describes the static structure of the objects in a system and their
relationships. The object model contains object diagrams. An object diagram is a graph
whose nodes are object classes and whose arcs are relationships among classes. An
object model captures the structural aspect of the system by showing the objects
participating in the system as well the relationships among them.
The dynamic model describe the behavioural aspect of the system over time. The
dynamic model is used to specify and implement the control aspects of the system. The
dynamic model contains state diagrams. A state diagram is a graph whose nodes are
states and whose arcs are transitions between states. Transitions are caused by events.
An event is something that happens at a point in time and it represents external stimuli.
5.5.2 Design
Design emphasises a proper and effective structuring of the complex system
allowing an object oriented decomposition. During the design phase, high level
decisions are made about the overall architecture of the system. The analysis phase
determines what the implementation must do, and the design phase determines the full
98
definitions of the objects and associations used in the implementation, as well as the
methods used to implement all the operations. During the design phase the development
of the system moves from the application domain concepts toward computer concepts.
The classes, attributes and associations from analysis must be implemented as specific
data structures.
5.5.3 Implementation
During implementation, all the design objects and associations are explicitly defined by
using a programming language (preferably an object-oriented one). The implementation
language should provide facilities that help the developer to realise the concepts as
defined during the design phase. One can fake object oriented implementation by using
a non-object -oriented language, but it is horribly ungainly to do so. To have a smooth
transition from the design phase to the implementation phase , the language should
support the following features (CARDELLI 1985):
Objects that are data abstractions with an interface of named operations and a
hidden local state.
Objects that have associated type (class)
Types (classes) may inherit attributes from super-types (super-classes)
According the Cardelli and Wegner definition (CARDELLI 1985), a language is said to
be object oriented if it supports inheritance. Under this definition, Smalltalk
(GOLDBERG 1983), C++ (STROUSTRUP 1991), Eiffel (MEYER 1992), CLOSS
(KEENE 1989) are all object oriented languages and can be used to implement an
object-oriented concept.
5.6 ATS Requirements
The OMT methodology will be use to develop a simulation for testing different
replica control protocols. The final tool is called Availability Testing System (ATS) and
99
100
algorithm under testing. Each Node object is an autonomous entity simulating the
behaviour of a separate site. The number of read and write operations performed are
registered and finally a statistical object is called to measure the availability of each
replica control protocol. All measurements are stored in a file for further processing.
The main objective of the simulation is to get a practical estimate of the availability
provided by certain replica protocols in order to draw useful conclusions about their
effectiveness. During the evaluations of the replica control protocols all the relevant
rates with which the events are generated are taken into account. The rate with which an
event is generated may affect the effectiveness of a certain protocol.
Figure 5-1
instance of an algorithm. Each instance runs independently. ATS supervises all the
actions that should be taken and it handles all the events generated by the ATS event
generator. Events may affect the state of the algorithm or the state of the whole system.
102
protocol is a specialisation of the Algorithm super class. Any message used by the
replica control protocol is a specialisation of the Message super class. Algorithm super
class is an abstract class that provides the basic functionality that may be found useful to
a replica control protocol, among others, it provides a service for sending and receiving
messages. Sending a message is implemented by forwarding the message to the local
Node object. The Node
104
the number of read and write operations issued and the read and write operations
executed respectively. These variables provide an estimate of the availability..
105
UniformRandGenerator
+ Rans():double
NetworkEventGenerator
NodeEventGenerator
- ReunionRate:double
- PartitionRate:double
- ReadRate:double
- WriteRate:double
Vector
- Nbits
+ GetNextEvent():EventType
+ GetNextEvent():EventType
+ IsSet(integer):Boolean
+ Set(integer)
+ Reset(integer)
+ SetAll()
+ ResetAll()
+ Count():integer
+ Size():integer
Partition
1+
Consists of
1+
Consists of
+ Npartnodes:integer
Simulate
Message
Legend: Visibility
+ public
# protected
- private
Alg1
Msg1
Alg1
Msg2
holds
head
Receive(msg)
protocol
Algorithm
+ source_node_id:integer
+ dest_node_id:integer
+ source_alg:Algorithm
+ dest_alg:Algorithm
+ source_alg_id:integer
+ msg_id:integer
...
-node_id:integer
-readissued:integer
-writeissued:integer
+ NetSize():integer
Simulate
+ Break():Partition
+ PartSize():integer
CollectStat + Join(Partition)
CollectStat + ConnectionVector():Vector
+ Simulate(EventGenerator)
+ Send(Message)
+ CollectStat(Statistics)
HandleEvent
+Receive(Message)
+ HandleEvent(SystemEvent)
+ Simulate(EventGenerator)
+ CollectStat(Statistics)
+ HandleEvent(SystemEvent)
Send(msg)
+ MakePartition()
+ MakeReunion()
+ Simulate(EventGenerator)
+ CollectStat(Statistics)
Node
HandleEvent
- Nnetnodes:integer
- Npartitions:integer
site
group
Network
division
system
ConnVector
Alg2
Msg1
uses
Alg2
Msg2
# read_performed:integer
# write_perrformed:integer
+ NetSize():integer
+ PartSize():integer
+ ConnectionVector():Vector
+ Send(Message)
+Receive(Message)
+ CollectStat(Statistics)
+ HandleEvent(SystemEvent)
+DoRead() {abstract}
+DoWrite() {abstract}
+HandleMessage(Message) {abstract}
# alg_id:integer
Alg1
+DoRead()
+DoWrite()
+HandleMessage(Message)
Alg2
+DoRead()
+DoWrite()
+HandleMessage(Message)
...
106
The dynamic model describes the flow of control, interactions and sequencing of
operations in the system. The dynamic model for the ATS is shown in figure 2. It
basically consists of four state diagrams. Each state diagram describes an interactive
aspect of the associated object. Network, Partition, Node and Algorithm are four
fundamental state diagrams that depict the sequencing of operations in the associated
objects. Each of these diagrams includes sub-diagrams that refine the interactions and
provide more details about the sequencing of the operations. The states of a sub-diagram
Network
NetworkEventGenerator
Simulate
MakePartition
Partition
EndOfSimulation
MakeReunion
[Np>1] entry / Choose two partitions
randomly and send Join to them
Join
EndOfCollection
CollectStat
Break
Simulate
Node
entry / Send Read to each algorithm
idle
Write / Increase writeissued
Legend:
Nn: number of nodes
Np: number of partitions
shared event
Receive(msg)
Write
destination node.
Read
Send(msg)
Receive(msg)
CollectStat
NodeEventGenerator
CollectStat
Partition
Simulate
CollectStatistics
HandleEvent(vnt)
Reunion
HandleEvent(vnt)
idle
Algorithm
HandleEvent(vnt) / Handle the system event vnt
CollectStat / Register the values of read_performed and write_performed
Send(msg) / Send the message msg to the attached node
Receive(msg) / Handle the message msg
Read / Perform read. If read succeeds, increase read_performed by 1
Write / Perform write. If write succeeds, increase write_performed by 1
107
are totally determined by shared events and conditions. For instance, the Network state
diagram includes two sub-diagrams; one for simulation (Simulate) and one for
collection of the statistics (CollectStatistics).
NetworkEventGenerator and NodeEventGenerator objects deliver events to both
Network and Node. Each of those events triggers a sequence of operations. Dashed lines
represent transitions between objects. Shared events curry information transferred from
one object to another and trigger activities in those objects.
5.9 Evaluation of the System
In the ATS system, the protocol is defined as a separate module and then is
compiled with the rest of the system to form an executable module. More than one
protocols may be tested at the same time (in a single run). All the protocols respond to
the same set of events allowing us to draw conclusions about the suitability of a certain
replica control protocol. The ATS model incorporates most of the characteristics of an
object oriented system (classification, polymorphism and inheritance) in order to define
highly reusable components. It allows the testing of the availability of multiple replica
control protocols without the need to modify our testing system. Porting a new protocol
is fairly easy, since what it requires is just the definition of the protocol and the
definition of any related message it uses. The ATS system remains unchanged providing
higher reusability with less effort.
5.10 Summary
108
Discrete event simulation using the object oriented paradigm has been shown to be a
suitable approach for building complex simulation systems. The modularity and
reusability help to decompose the system into co-operative processes that are related to
independent simulation entities. Object Modelling Technique (OMT) supports all of
the necessary facilities for expressing object oriented concepts. OMT has been used
extensively for analysing the requirements of the ATS system as well as designing the
ATS system . The whole ATS system is summarised in two diagrams named : object
model and dynamic model. The Object model describes the static structure of the
system, whereas the dynamic model describes the behavioural aspect of the system.
ATS allows the testing of the availability of multiple replica control protocols without
the need to modify the main procedures of the testing system. This provides an extra
flexibility since it makes easy to port a new protocol without to disturb the core
procedures of the simulation.
109
This chapter presents how the measurements regarding the performance of certain
replica control algorithms has been achieved. It introduces the simulation model used to
build a benchmark test utility for estimating the effectiveness of algorithms. It discusses
a fault injection mechanism for generating faults and repairs and it specifies the
environment in which the simulation evolves. It also defines the functional components
of the simulation, as well as the parameters used to estimate the availability of the tested
algorithms. The chapter ends with a thorough discussion about the contribution of the
DMCA algorithm. It shows why the DMCA provides higher total availability and
presents the results of the benchmark test. The DMCA is compared with two other
representative voting algorithms (GIFFORD 1979, JAJODIA 1989).
6.1 Performance Evaluation
The evaluation and performance of the replica control protocols has become an area of
great practical interest. In most cases, the most important aspect of this performance is
the availability of replicated objects managed by the protocol. The availability of the
replicated data objects represents the steady-state probability that the object is available
at any given moment. Several techniques have been used to evaluate the availability of
replicated data. Combinatorial models are very simple to use (PU 1988) but cannot
represent complex recovery modes like those found in voting protocols (GIFFORD
1979), (PRIS 1986b), (JAJODIA 1989) and (KOTSAKIS 1996b). Stochastic
110
models have been extensively used to study replication protocols (JAJODIA 1990),
(PRIS 1991) but suffer from two important limitations:
1. Stochastic models quickly become intractable, unless all failures and repair
processes have exponential distributions.
2. Stochastic models do not describe communication failures well, since the
number of distinct states in a model increases exponentially with the number of
failure modes being considered.
Discrete event simulation does not suffer these limitations. Simulation models allow the
relaxation of most assumptions that are required for stochastic models. They can also
represent systems with communication failures. For all its advantages, simulation has
one major disadvantage: it provides only numerical results. This makes it more difficult
to predict how the modelled system would behave when some of its parameters are
modelled. Each time you change the parameters you need to run the simulated system
again to get the results.
6.2 The Simulation Model
Most studies of replicated data availability have depended on probabilistic models
to evaluate the availability of replica control protocols (JAJODIA 1990).
These
models do not generally consider the effect of network partitioning, because of the
enormous complexity that would be involved. As a result, the data that they present are
for ideal environments that are unlikely to exist under actual conditions. Discrete event
simulation has been used to observe the behaviour of three replica control protocols
under more realistic conditions. Many parameters can affect the availability of replicated
data. The simulation model considers the following type of failures:
1. Hardware failures, which results in a site being down for hours or even days
111
Figure 6-1 shows a typical network model that may be considered for hosting a
replication network. It consists of two carrier-sense (IEEE 802.3 based LAN) segments
and two token ring (IEEE 802.5 based LAN). The repeater and the gateways link
together all of
(SCHLICHTING 1983). The network will be partitioned into one or more partitions
112
if the repeater or a gateway fails. Sites attached to a local area network can communicate
even after the repeater or gateway failure but they are not able to communicate with a
site attached to another LAN. All replicated objects are assumed to be available as long
as they can be accessed from any site in the network.
Figure 6-2
system with the target replication algorithms, a fault injector, a repair injector, a work
113
load generator, a data collector and a data analyser. The replication system comprises all
the groups of replicated nodes that can host the replication algorithms as well as the
necessary facilities for interchanging messages between nodes of the same group. The
fault injector injects partitioning faults into the target system, by injecting faults into
certain gateways or repeaters. The repair injector injects reunions into the system. Each
reunion corresponds to a repeater or gateway repair. The system also executes read and
write operations on replicated objects. The read and write operations are generated from
the work-load generator. The controller is physically the simulation program that runs
and controls all the parts of the testing system. It also tracks the execution of read and
write operations and initiates data collection. The data collector performs on-line data
collection and the data analyser performs data processing and analysis. The injection of
faults is done during run-time by using the time-out technique. A timer expires at a
predefined time triggering injection. The inter-arrival time between faults follows the
exponential distribution. When the timer expires, operations occur by interrupting the
normal operation of the system
6.4 Simulated Algorithms
The three algorithms (GIFFORD 1979, JAJODIA 1989, KOTSAKIS 1996b)
presented in the previous chapter have been tested. Each algorithm has been tested
under exactly the same sequence of events. When an event (partition, reunion, read or
write) occurs, it is inserted into a queue and then each algorithm performs the necessary
house keeping operations to reflect any change in the replication system. Each algorithm
keeps its own record and when the simulation finishes and each algorithm reaches a
steady-state condition, the simulation control unit counts the percentage of read and
write operations that have been performed giving in that way an estimation of the
114
3. Repair delay
4. Read rate
5. Write rate
1
and the
read rate
1
. The read and write operations
write rate
generated in each site realise the periodical access process to each replicated object.
Each site provides a process which calculates the percentage of successful accesses. An
access is considered successful if the relative operation (read or write) can be performed.
6.5 The Protocols routines
A general form for accessing the replicated objects is used in a common way to all
of the tested protocols. When a read or write occurs the read or write routine of each
protocol is activated. Each protocol has its own view of the network state and according
to that view it executes all the necessary subroutines needed to accomplish a read or
115
write access. The results of a successful or unsuccessful access are recorded separately
for each protocol. These results are gathered during execution and later they are
compared in order to draw useful conclusions about the availability provided by each
protocol. A critical part of the model is to determine whether two sites can
communicate. Since all the protocols rely on communication between sites to determine
the status of the replicated data, a fast simple means is needed to determine
communication links. For sites on the same LAN, the solution is simple. If any two sites
are up and running, it can be assumed that they can communicate. For sites not on the
same network segment, the assumption cannot be made since they may be separated by
one or more gateway sites or repeaters. A solution to this is found by viewing the
network as a tree structure, whose nodes consist of different network segment and their
respective sites. One segment is chosen as the basis. Communication is determined by
traversing the tree between two sites. The tree structure is conceptual and is represented
by double linked lists1. The connectivity between sites is shown by a communication
vector which is realised through an array of bits. Each site has assigned a unique identity
number. Given the identity numbers of two sites the communication routines determine
if they can communicate by checking the connectivity vector.
6.6 Implementing Group Communication
Sites in a communicating group exchange messages by using the multicast model.
A simple implementation is based on simple message sending which might take the
form:
void multicast (PortID *destination, Message msg)
{
for (int i=0; i<HigherDestination; i++)
Send(destination[i], msg)
}
A thorough study of the organisation of the simulation is presented in the next chapter.
116
Where, destination is an array of destination ports which will receive the message msg,
and HigherDestination is the higher index which identifies the last destination. The
destination i is identified by the destination[i] ID. The multicast procedure simply send
the message to each destination port. It has been used in the simulation program because
of its simplicity and ease of handling. In practice, this multicast procedure may be
replaced by other more sophisticated and efficient procedures like those provided in
broadcast packet switching network technologies.
6.7 Functional Components of the Simulation.
The functional components of the simulation represent the main modules of the
simulation model. Figure 6-3 illustrates the modules that compose the simulation model.
Each module is briefly described as follows:
1. Transaction generator: This module generates transactions and distributes them to
the relevant sites. A transaction is a sequence of read and write operations which are
generated through independent generators.
2. Repair/Failure generator: This module generates site and communication failures
and distributes them to the relevant sites. This modules also generates repairs triggering
in that way the recovery system.
117
118
Reads Performed R p
Reads Issued
Ri
Write Availability Aw
Aw
Write Performed Wp
Writes Issued
Wi
Total Availability A
119
Operations Performed R p Wp
Operations Issued
Ri Wi
let
Read issued Ri
,
Write issued Wi
Ar Aw
1
(when Ri <<Wi )
Total
Availability
1
Ar
Aw
shows the availability curve and it points out that the total availability is
limited between the asymptotes Ar and Aw. Calculating the derivative of A, we find that
A Ar Aw
1 2
120
The derivative of A indicates that as the difference (Ar-Aw) approximates 1, the slope of
the total availability A approximates its maximum value. This implies that there exist
so that for >0 the total availability A will get higher values for smaller as shown in
Total
Availability
1
Ar
A'r
A'w
Aw
0
0
10
15
20
deem that the read to write ratio =4 (BAKER 1991). For =4, the derivative of A is
equal to
Ar Aw
. Considering that 0<Ar-Aw<1, its maximum value becomes
25
1/25=0.04. Therefore, as shown in Figure 6-5, at =4, the total availability curve tends to
be parallel to the horizontal axis reaching in that way a steady-state condition. Therefore
for a given >4, we may increase the total availability if we can somehow increase the
difference (Ar-Aw). This is what the DMCA algorithm (described in the previous
chapter) achieves by exploiting such large values of . The DMCA algorithm exploits
the difference between the rates of read and write operations by changing dynamically
the read and write quorums and allowing the execution of the read operations rather than
121
the write operations. Read operations are actually the majority of the operations
performed in a replication system.
6.10 Results of the Simulation
To simplify the analysis of the simulation results, it is assumed that all sites have
identical mean time to fail and mean times to repair. If individual site attributes had
been considered, it would have been difficult to determine if the availability was being
affected by partitioning or by some other factor such as a site failing frequently for short
periods of time.
The simulation utilises 20 replicas (one in each node). The partitioning rate is
equal to 2 partitions per time unit and the simulation interval is 100 time units. These
parameters are shown in Table 6-1. The repair delay is measured in time units2. The
availability provided by the tested replica control protocols has been estimated for
various values of the repair delay. The chosen values for the mean repair delay is
Total
Availability
1
Ar
A'r
A'w
Aw
A time unit could be a day or a month or any other time interval defined by the user. The choice of a particular time unit does not
effect the results of the simulation as far as we utilize the same time unit for all the simulation parameters (simulation time interval,
partitioning rate and repair delay).
122
Parameter
Value
20
Partition rate
Simulation Interval
consistent with the measurements made using the internet (LONG 1991) and range
from 0.1 to 3.
The rest of this chapter presents the results of the availability provided by the
tested replica control protocols as has been measured during the simulation. The
following figures presents the read availability, the write availability and total
availability provided by each replica control protocol for certain values of the mean
repair delay. The figures presented in this chapter have been created by taking into
account the results obtained by using the Availability Testing System (ATS), which is
discussed in the next chapter. In Appendix B, one can find the tables associated with
the figures as well as the raw values regarding the read and write operations occurred
during the simulation. The tables in Appendix B shows the number of read and write
operations issued during a simulation as well as the number of read and write operations
performed by each replica control protocol during the same interval of time. They also
show how many partitions and reunions occurred as a result of failures and repairs
respectively. All of the following figures show the availability in respect of the
parameter (read issued to write issued ratio). The replica control protocols that have
been tested are: the voting algorithm (GIFFORD 1979), the dynamic voting algorithm
(JAJODIA 1989) and the DMCA (KOTSAKIS 1996b).
Figure 6-7
shows the read availability, the write availability and the total availability
provided by the tested replica control algorithms for a mean repair delay equal to 0.1.
123
This means that during the time interval of 100 time units, it is expected to have 200
partitioning events (since the rate is 2 partitions per time unit) and it takes
approximately 0.1 time units each failure to get repaired. Because the repair delay is too
short, if we compare it with the inter-arrival time of the partitions, the read availability
is very high and it is around 0.95. This means that, approximately, 95% of the read
operations issued are performed. The write availability is expected to be smaller than
this, due to the higher cost of the write operations. However Jajodias algorithm
provides higher write availability than the other two. Jajodias algorithm keeps a balance
between the execution of read and write operations. It provides almost a constant read
and write availability regardless of the rates of read and write operations issued. DMCA
tries to exploit the difference between the read and write rates in order to increase the
total availability. The DMCA approach provides higher availability than that of the
other two algorithms. The availability of the DMCA increases as the factor increases.
This is because the aim of the DMCA algorithm is to allow low cost operations (like
read) with large occurrence rates to be performed easily, whereas high cost operations
(like write) to be performed rarely. Under this strategy the majority of the operations
issued are executed. This approach makes the DMCA algorithm suitable for >1.
The availability provided by the Giffords algorithm follows the DMCA one, but
this is due to the fact that in Figure 6-7 the repair delay is too short. As we see later, when
the repair delay increases, the total availability of Giffords algorithm becomes much
smaller that that of DMCA, especially for <6 (in practice <10). Figure 6-8 shows the
availability results for a repair delay equal to 0.2 . These availability curves are similar
to the ones presented in Figure 6-7, except that, the availability level of each algorithm is
a bit smaller. This is expected, since repairs occur less frequently and it takes more time
to have a reunion. This also means that the replicated objects are unavailable for longer
124
intervals of time. As the repair delay increases the availability level of all of the tested
algorithms will gradually decreases. Figure 6-9 illustrates the availability results for mean
repair delay equal to 0.3. The DMCA algorithm provides higher total availability than
that of the other two algorithms. Figure 6-10 illustrates the availability results for mean
repair delay equal to 0.4 and Figure 6-11 for 0.5. As the ratio increases, it becomes more
difficult for write operations to perform, since the read and write quorums are
reconfigured according the current read and write rates. However, DMCA eases the
execution of the read operations and thus increases the possibility of performing the
majority of operations issued which results in gaining higher total availability. Figure 612, Figure 6-13, Figure 6-14, Figure 6-15
for mean repair delay equal to 1.0, 1.5, 2.0, 2.5 and 3.0 respectively. It is observed that
there is no much difference between the slopes of the illustrated availability curves for
the tested algorithms. What is very obvious in these figures is the gradual decrease of
the availability provided by the algorithms, as the mean repair delay increases. For
instance for mean repair delay equal to 1.0, the DMCA algorithm provides a total
availability which is around 0.62 and the Jajodias algorithm around 0.58 (see Figure 612).
When the repair delay becomes equal to 3.0 , the total availability provided by
dynamically the read and write quorums according to the read and write occurrence
rates.
125
(a)
Read Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
DMCA (Kotsakis)
16
18
20
Jajodia 1989
(b)
Write Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
DMCA (Kotsakis)
16
18
20
Jajodia 1989
(c)
Total Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
DMCA (Kotsakis)
14
16
18
20
Jajodia 1989
Figure 6-7: Availability provided by the tested replica control protocols for mean repair delay=0.1 (a)
read availability (b) write availability (c) total availability.
126
(a)
Read Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
16
DMCA (Kotsakis)
18
20
Jajodia 1989
(b)
Write Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
16
DMCA (Kotsakis)
18
20
Jajodia 1989
(c)
Total Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
DMCA (Kotsakis)
16
18
20
Jajodia 1989
Figure 6-8: Availability provided by the tested replica control protocols for mean repair delay=0.2 (a)
read availability (b) write availability (c) total availability
127
(a)
Read Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
16
DMCA (Kotsakis)
18
20
Jajodia 1989
(b)
Write Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
16
DMCA (Kotsakis)
18
20
Jajodia 1989
(c)
Total Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
DMCA (Kotsakis)
14
16
18
20
Jajodia 1989
Figure 6-9: Availability provided by the tested replica control protocols for mean repair delay=0.3 (a)
read availability (b) write availability (c) total availability
128
(a)
Read Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
16
DMCA (Kotsakis)
18
20
Jajodia 1989
(b)
Write Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
16
DMCA (Kotsakis)
18
20
Jajodia 1989
(c)
Total Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
DMCA (Kotsakis)
14
16
18
20
Jajodia 1989
Figure 6-10: Availability provided by the tested replica control protocols for mean repair delay=0.4 (a)
read availability (b) write availability (c) total availability
129
(a)
Read Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
DMCA (Kotsakis)
12
14
16
Jajodia 1989
(b)
Write Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
DMCA (Kotsakis)
12
14
16
Jajodia 1989
(c)
Total Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
DMCA (Kotsakis)
10
12
14
16
Jajodia 1989
Figure 6-11: Availability provided by the tested replica control protocols for mean repair delay=0.5 (a)
read availability (b) write availability (c) total availability
130
(a)
Read Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
DMCA (Kotsakis)
14
16
Jajodia 1989
(b)
Write Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
DMCA (Kotsakis)
14
16
Jajodia 1989
(c)
Total Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
DMCA (Kotsakis)
12
14
16
Jajodia 1989
Figure 6-12: Availability provided by the tested replica control protocols for mean repair delay=1.0 (a)
read availability (b) write availability (c) total availability
131
(a)
Read Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
DMCA (Kotsakis)
16
18
20
Jajodia 1989
(b)
Write Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
DMCA (Kotsakis)
16
18
20
Jajodia 1989
(c)
Total Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
DMCA (Kotsakis)
14
16
18
20
Jajodia 1989
Figure 6-13: Availability provided by the tested replica control protocols for mean repair delay=1.5 (a)
read availability (b) write availability (c) total availability
132
(a)
Read Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
16
DMCA (Kotsakis)
18
20
Jajodia 1989
(b)
Write Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
DMCA (Kotsakis)
16
18
20
Jajodia 1989
(c)
Total Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
DMCA (Kotsakis)
14
16
18
20
Jajodia 1989
Figure 6-14: Availability provided by the tested replica control protocols for mean repair delay=2.0 (a)
read availability (b) write availability (c) total availability
133
(a)
Read Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
16
DMCA (Kotsakis)
18
20
Jajodia 1989
(b)
Write Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
DMCA (Kotsakis)
16
18
20
Jajodia 1989
(c)
Total Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
DMCA (Kotsakis)
14
16
18
20
Jajodia 1989
Figure 6-15: Availability provided by the tested replica control protocols for mean repair delay=2.5 (a)
read availability (b) write availability (c) total availability
134
(a)
Read Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
DMCA (Kotsakis)
16
18
20
Jajodia 1989
(b)
Write Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
14
16
DMCA (Kotsakis)
18
20
Jajodia 1989
(c)
Total Availability
1
0.8
0.6
0.4
0.2
0
0
Gifford 1979
10
12
DMCA (Kotsakis)
14
16
18
20
Jajodia 1989
Figure 6-16: Availability provided by the tested replica control protocols for mean repair delay=3.0 (a)
read availability (b) write availability (c) total availability
6.11 Summary
135
This chapter has described the simulation model which has been used to
implement a benchmark test utility which is used to measure the availability provided by
some representative replica control algorithms. It has specified the parameters of the
simulation which have been used to describe the behaviour of the tested algorithms. The
replicated objects are accessed through special routines specific to the particular
algorithm. The simulation has been
mechanisms. This chapter has also shown the advantages of using discrete event
simulation over continuous one and explained the suitability of the event driven
simulation over the process driven one. It has presented the results of the benchmark test
through a series of illustrations for various mean repair rates and it has provided a
thorough comparison between the DMCA algorithm and two other voting algorithms.
The DMCA achieves greater total availability than that of (GIFFORD 1979) and
(JAJODIA 1989). The higher availability is obtained by exploiting the difference
between read rate and write rate and by utilising a dynamic adjustment scheme which
allows the reconfiguration of the read and write quorums .
The DMCA approach provides higher availability than that of the other two
algorithms. The availability of the DMCA increases as the factor increases. This is
because the aim of the DMCA algorithm is to allow low cost operations (like read) with
large occurrence rates to be performed easily, whereas high cost operations (like write)
to be performed rarely. Under this strategy the majority of the operations issued are
executed. This approach makes the DMCA algorithm suitable for >1.
DMCA tries to moderate the differences between Jajodias and Giffords
algorithms in order to increase the total availability. DMCA inherits the dynamic
adjustment of Jajodias algorithm
136
reconfiguring dynamically the read and write quorums according to the read and write
occurrence rates.
DMCA is an algorithm that can safely be used in a network management system
for replicating managed objects. It requires no modification to the voting scheme and it
adjusts dynamically the read and write quorum and provides improvement in the
availability of network managed objects.
137
7. CONCLUSIONS
138
in the case of network partitioning. The design of a novel replica control protocol is
following by studying the criteria for achieving correctness and high availability.
2. Development of a benchmark test utility (ATS) that has been used for estimating the
availability of certain replica control algorithms. The development of such a tool is
very important because it allows the designer
suitability of a representative set of replica control algorithms. The ATS tool is built
by using the object -oriented paradigm and designed as an event simulation system.
The main advantage of this tool is that it may be extended to accommodate more
voting algorithms with less effort.
3. Quantitative analysis based on experimentation constructed by using the ATS tool.
The aim of this analysis is to prove the suitability of the DMCA algorithm by
comparing the simulation results with other voting algorithms. This analysis also has
shown how the availability is affected when the repair delay gets longer.
remains fixed until the designer manually changes the number of replicas or their
locations. If the reads and writes are fixed and are known a priori, then this is a
reasonable solution. However, if the read and write patterns change dynamically, in
unpredictable ways, such a replication scheme may lead to severe performance
problems.
In situations where read and write patterns change dynamically, there is a need to
develop a replication algorithm that incorporates dynamic adaptation features. Such an
algorithm may have the ability to learn characteristics of its environment and use them
to re-adjust the read and write quorum and re-order the placement of the objects to suit
its future access patterns. DMCA does perform the dynamic adjustment of read and
write quorum but not dynamic replacement of replicated objects.
Such a dynamic algorithm should incorporate an automaton that takes into account
historical and statistical data regarding the read and write access of replicated objects.
Dynamic replacement of replicated objects may decrease the transmission time and
increase in that way the performance of the read and write operations.
The communication cost of a replication scheme is the average number of interprocessor messages required for a read or a write of the object. Placing the replicated
objects at those locations that minimise the number of messages passed to access them
forms an optimum replication scheme that can ensure high availability, good
performance and strong consistency. Adaptive replication techniques that encompass
dynamic object replacement may work together with the DMCA to increase further the
performance. Recent work (WOLFSON 1997) proves that an adaptive replication
algorithm improves the performance by exercising dynamic object replacement. In
(WOLFSON 1997, a method of copying with storage space limitations at various
processors in the network is also proposed in order to make a comparison between a
140
static scheme (that without object replacement) and a dynamic one that exercises
replicated object replacement. Applying dynamic replacement to the DMCA algorithm
may increase further the performance and improve the usability of the algorithm.
7.3 Concluding Remarks
Data availability is a fundamental problem in a network management system. As
argued earlier in this thesis, this problem will become more intense as different network
technologies are envolved. Replication of managed objects is the key to solving this
problem. Voting replica control algorithms exhibit a suitable behaviour that ensures
consistency and provides high availability.
This thesis has shown that voting replication techniques can be safely used in a
distributed MIBs and that the object availability provided by voting algorithms may be
improved more by adjusting the read and write quorum according to the read and write
occurrence rate. It has also been shown that applying replication techniques to network
management makes the management activities more robust and fault tolerant. It finally
proposes the DMCA voting algorithm, which provides higher availability than that
provided by other similar algorithms.
141
APPENDIX-A
PAPERS
This appendix includes the papers related to this thesis, authored by myself and coauthored by my supervisor Dr. P.H.Pardoe. These papers have been published in the
proceedings of the associated conference.
Papers 1
APPENDIX-B
TABLES
This appendix contains the tables describing the results of the simulation discussed in
chapter 6. All the figures illustrating quantitative results about the availability provided
by each tested algorithm have been constructed by using these tables. The values
presented in these tables are mean values achieved by an extensive running of the ATS
simulation tool.
Tables 1
APPENDIX-C
SOURCE CODE
This appendix includes the code in C++ of the ATS system. The program has been
tested under Windows 32 (Win95 or NT)
Code 1
LIST OF REFERENCES
[ABBADI 1985]
EL ABBADI, A.; SKEEN, D.; and CHRISTIAN, F., An Efficient Fault Tolerant
Protocol for Replicated Data Management, In Proceedings of the 4th ACM
Symposium on Principles of Database Systems (1985), ACM, New York,1985, pp.
215-228.
[ABBADI 1986]
EL ABBADI, A.; TOUEG, S. Availability in Partitioned Replicated Databases,
In Proceedings of the 5th ACM Symposium on Principles of Database Systems
(1986), ACM, New York, 1985, pp. 240-251.
[ADAMEC 1995]
ADAMEC, JAROMIR, MICHAEL GROF, JAN KLEINDIENST, FRANTISEK
PLASIL and PETR TUMA Supporting Interoperability in CORBA via Object
Services, Tech. Report, No. 114, Department of software engineering, Charles
University Prague, October 1995, also available at
http://www.cs.wustl.edu/~schmidt/CORBA-docs/interoperability.ps.gz.
[ALSBERG 1976]
ALSBERG, P. A. and DAY, J. D., A principle for resilient sharing of
distributed resources, In Proceedings of the 2nd International Conference
on Software Engineering, IEEE computer Society, San Francisco,
California, October 1976, pp. 627-644.
References 1
[ANSA 1989]
The Advanced Network System Architecture (ANSA) Reference Manual, Castle
Hill, Cambridge England, Architecture Project Management, 1989.
[ARPEGE 1994]
ARPEGE Group (1994), Network Management Concepts and Tools, Chapman
Hall, 1994, ISBN 0-412-57810-7.
[BABAT 1991]
BABAT, SUBODH, OSI Management Information Base Implementation, In
proceedings of the IFIP Symposium on Integrated Network Management II,
Amsterdam, 1991
[BAKER 1991]
BAKER, M., G., HARTMAN, J.,H., KUPFER, M. D., SHIRRIFF, K., W.,
and OUSTERHOUT, J. K., Measurements of a Distributed File System,
Proceedings of the 13th ACM Symposium on Operating System Principles,
(1991), pp. 198-212.
[BAN 1995]
BAN, BELA Towards an Object-Oriented Framework for Multi -Domain
Management, IBM Zurich Research Laboratory, Rueschlikon, Dec 18, 1995,
also available at http://simon.cs.cornell.edu/Info/People/bba/GOM.ps.gz.
[BEAR 1988]
BEAR, D. Principles of Telecommunication Traffic Engineering, 3rd edition, IEE
Telecommunication Series 2, IEE, Peter Peregrinus Ltd., London, 1988.
References 2
[BERNSTEIN 1981]
BERNSTEIN, P. A. And GOODMAN, N, Concurrency Control in
Distributed Database Systems, ACM Computing Survey, Vol. 13 No 2,
June 1981, pp. 185-221.
[BERNSTEIN 1987]
BERNSTEIN, P. A. ; HADZILACOS, V and GOODMAN, N.,
Concurrency Control and Recovery in Database Systems, Addison Wesley,
1987.
[BEVER 1993]
BEVER, M ; GEIHS,K. ; HEUSER, L. ; MUHLHAUSER, M. ; and SCHILL,A.,
Distributed Systems, OSF/DCE and Beyond, In DCE - The OSF Distributed
Computing Environment, edited by SCHILL, A, Springer - Verlag, 1993, pp. 1-20.
[BIRMAN 1987]
BIRMAN, K., and JOSHEPH, T., Reliable communication in the presence of
failures., ACM Transactions on Computer Systems, Vol. 5, No. 1, February 1987.
[BLAUSTEIN 1985]
BLAUSTEIN, B.T., and KAUFMAN, C. W. Updating Replicated Data
During Communication Failures In Proceedings of the 11th International
Conference on Very Large Data Bases, 1985, pp. 49-58.
[BONG 1989]
BONG, A. BLAU, W., GRAETSCH, W., HERRMANN, F. and OBERLE, W.,
Fault-tolerance under Unix, ACM Transactions on Computer Systems, Vol. 7,
No. 1, February 1989.
References 3
[BUDHIRAJA 1993]
BUDHIRAJA, N. ; MARZULLO, K. ; SCHNEIDER, I.B.; and TOUEG,S. The
Primary - Backup Approach, In Distributed Systems, 2nd ed., MULLENDER, S
(ed.), pp. 199-216, ACM press, 1993.
[CARDELLI 1985]
CARDELLI, L. and WEGNER, P. On Understanding Types, Data Abstraction,
and Polymorphism, ACM Computing Surveys vol. 17, No. 4, December 1985.
[CARR 1985]
CARR, R., The Tandem global update protocol, Tandem System Review, Vol. 1,
No. 2, June 1985.
[CASE 1990]
CASE, J.D.; FEDOR, M.; SCHOFFSTALL,M.L.; and DAVIN, C., Simple
Network Management Protocol (SNMP), Request For Comment , RFC-1157,
1990.
[CCITT 1993]
CCITT Recommendation X.700 Management Frameworks Definition for
Open Systems Interconnection (OSI) for CCITT Applications, 1993.
[CERF 1988]
CERF, V. IAB Recommendations for the Development of Internet Network
Management Standards, Request For Comment , RFC-1052, April 1988.
[CHANG 1984]
CHANG, J., M., and MAXEMCHUCK, N., Reliable Broadcast Protocols, ACM
Transactions on Computer Systems, Vo. 2, No. 3, August 1984.
References 4
[CHIN 1991]
CHIN, R. S. and CHANSON, S. T. Distributed object based programming
systems, ACM Computing Surveys Vol. 23, No. 1, (March 1991), pp. 91124
[COOPER 1985]
COOPER, E., Replicated Distributed Programs, Ph.D. dissertation, UC Berkley,
1985.
[CRISTIAN 1985]
CRISTIAN, F., AGHILI, H., STRONG, R., and DOLEV, D. Atomic broadcast:
From simple diffusion to Byzantine agreement, 15th International Conference on
Fault-tolerant computing , Ann Arbor, Michigan, 1985.
[CRISTIAN 1988]
CRISTIAN, F. Agreeing on who is present and who is absent in a synchronous
distributed system, In Proceedings of the 18th International Conference on Fault
Tolerant Computing, Tokyo, June 1988.
[CRISTIAN 1989]
CRISTIAN, FLAVIN "Exception handling". In Dependability of Resilient
Computers, T. Anderson (ed.), Blackwell Scientific Publications, Oxford, 1989.
[CRISTIAN 1990]
CRISTIAN, F., DANCEY, R., and DEHN, J., Fault tolerance in the advanced
automation system, 20th International Conference on Fault-tolerant Computing ,
Newcastle upon Tyne, England, June 1990.
References 5
[CRISTIAN 1991]
CRISTIAN, FLAVIN Understanding Fault-Tolerant Distributed Systems,
Communications of the ACM (February 1991), Vol. 34, No. 2, pp. 57-78.
[DAVCEN 1985]
DAVCEV, D. and BURKHARD,W., Consistency and Recovery Control for
Replicated files, In Proceedings of the 10th ACM Symposium on Operating
Systems Principles (1985). ACM, New York, 1985, pp. 87-96.
[DAVCEN 1985]
DAVSEN D. and BURKHARD W.A Consistency and Recovery Control for
replicated files In Proceedings of the 10th ACM Symposium on Operating
Systems Principles, 1985, pp. 87-96.
[DAVIDSON 1984]
DAVIDSON, S. B. Optimism and Consistency in Partitioned Distributed
Database Systems, ACM Transactions on Database Systems, 1984, Vol. 9,
No 3, pp. 456-481.
[DAVIDSON 1985]
DAVIDSON, S. B. ; GARCIA-MOLINA, H. And SKEEN, D. Consistency
in Partitioned Networks, ACM Computer Survey, 1985, Vol. 17, No 3, pp.
341-370
[ESWARAN 1976]
ESWARAN, K. P.; GRAY, J. N. ; LORIE, R. A. And TRAIGER, I. L. The
Notions of Consistency and Predicate Locks in a Database System,
Communications of the ACM, Nov. 1976, Vol. 19, No 11, pp. 624-633.
References 6
[EZHILCHELVAN 1986]
EZHILCHELVAN, P., and SHRIVASTAVA, S. A characterisation of faults in
systems, Fifth Symposium on Reliability in Distributed Software and Database
systems, Los Angeles, January 1986
[FERIDUM 1996]
FERIDUM, M., HEUSLER, L., and NIELSEN, R, Implementing OSI
Agent/Managers for TMN, IEEE Communications Magazine, September
1996.
[GARCIA 1982]
GARCIA, H., Elections in a distributed computing system. IEEE
Transactions on Computers, Vol. 31, No. 1, January 1982, pp. 48-59.
[GIFFORD 1979]
GIFFORD, D. K. Weighted Voting for Replicated Data, In Proceedings of the
7th Symposium on Operating Systems Principles (Pacific Grove, CA,
References 7
[HARPER 1988]
HARPER, R., LALA, L., DEYST, J., Fault tolerant parallel processor
architecture overview , 18th International Conference on Fault Tolerant
Computing, Tokyo, June 1988.
[HOPKINS 1978]
HOPKINS, A., SMITH, B., LALA, J., FTMP-A highly reliable fault tolerant
multi-processor for aircraft , Proceedings IEEE, Vol. 66, Oct. 1978.
[ISO 1988]
ISO/IEC 9072:1988 (CCITT Recommendation X.211), International Organisation
for Standardisation, Information Processing Systems: Open Systems
Interconnection, Text Communication - Remote Operations Part 1: Model,
Notation and Service Definition.
[ISO 1989]
ISO/IEC 7498-4: 1989, Information processing systems, Open Systems
Interconnection, Basic Reference Model , Part 4: Management framework.
(CCITT Recommendation X.700: 1992, Management Framework for Open
systems Interconnection -OSI- for CCITT Applications)
[ISO 1991]
ISO/IEC 9595:1991 (CCITT Recommendation X.710), International Organisation
for Standardisation, Information Processing Systems: Open Systems
Interconnection, Common Management Information Service Definition
References 8
[ISO 1991a]
ISO/IEC 9596-1:1991 (CCITT Recommendation X.711), International
Organisation for Standardisation, Information Processing Systems: Open Systems
Interconnection, Common Management Information Protocol Specification.
[ISO 1992]
ISO/IEC 10040:1992, (CCITT Recommendation X.701: 1992) Information
technology, Open Systems Interconnection, Systems management overview.
[ISO 1992a]
ISO/IEC 10746-1:1992 , International Organisation for Standardisation, Basic
Reference Model of Open Distributed Processing, Part 1: Overview and Guide to
Use, JTC1/SC212/WG7CD 10746-1, ISO 1992.
[ISO 1993]
ISO/IEC 10165-1: 1993, (CCITT Recommendation X.720: 1992), Information
technology, Open Systems Interconnection, Structure of management information:
Management information model.
[ITU 1995]
ITU-T Recommendation M.3010 Principles and Architecture for the TMN
[JAJODIA 1987a]
JAJODIA, S. and MUTCHLER, D. Dynamic Voting In Proceedings of the
ACM SIGMOD Intl Conference on Management of Data, May 1987.
[JAJODIA 1987b]
JAJODIA, S. and MUTCHLER, D. Enhancement to the voting algorithm In
Proceedings of the 13th Intl conference on Very Large Databases (VLDB),
September 1987.
References 9
[JAJODIA 1989]
JAJODIA, S. and MUTCHLER, D. A pessimistic Consistency Control
Algorithm for Replicated Files Which Achieves High Availability, IEEE
Transactions on Software Engineering, (January 1989), Vol. 15, No 1, pp.
39-46.
[JAJODIA 1990]
JAJODIA, SUSHIL and MUTCHLER, DAVID, Dynamic Voting Algorithms for
Maintaining the Consistency of a Replicated Database, ACM Transactions on
Database Systems, Vol. 15, No. 2, June 1990, pp. 230-280.
[JOSEPH 1987]
JOSEPH, T. A. and BIRMAN, K. P. Low Cost Management of Replicated
Data in Fault Tolerant Distributed Systems, ACM Transactions in
Computer Systems, 1987, Vol. 4, No 1
[KAHANI 1997]
KAHANI, M. and BEADLE, P., H., W., Decentralised Approaches for
Network Management, ACM SIGCOM Computer Communications Review,
Vol. 27, No. 3, Jully1997, pp. 36-47
[KEENE 1989]
KEENE, S., E., Object-Oriented Programming in Common LISP, AddisonWesley, 1989
[KERNIGHAN 1988]
KERNIGHAN, B., W., and RITCHIE, D. M., The C Programming Language,
Second Edition. Prentice Hall, Englewood Cliffs, N.J., 1988
References10
[KOTSAKIS 1995]
KOTSAKIS, E.G. and PARDOE, B.H., Modelling OSI Management Information
Base With Object Oriented Analysis, In Proceedings of the 1995 International
Symposium on Communications, Taipei, Taiwan (December 27-29, 1995), pp.
143-149
[KOTSAKIS 1996a]
E.G.KOTSAKIS and B.H.PARDOE Replication of Management Objects in
Distributed MIB In Proceedings of the ICT96 International Conference on
Telecommunications, April 14-17, 1996, pp. 545-549.
[KOTSAKIS 1996b]
E.G.KOTSAKIS and B.H.PARDOE Dynamic Quorum Adjustment: A
consistency scheme for Replicated Objects, In Proceedings of the Third
Communication Networks Symposium, Manchester, July 8-9, 1996, pp. 197-200
[KOTSAKIS 1996c]
E.G.KOTSAKIS and B.H.PARDOE Simulation IASTED
[KRIEGER 1998]
KRIEGER, D., and ADLER, R., M., The Emergence of Distributed
Component Platform, IEEE Computer, March 1998, pp. 43-53.
[LADIN 1992]
LADIN, R ; LISKOV, B. ; SHRINA, L. ; and GHEMAWAT, S. Providing
Availability Using Lazy Replication, ACM Transactions on Computer Systems,
Vol. 10, No. 4, pp. 360-391.
References11
[LAMPORT 1984]
LAMPORT, L., Using time instead of time-outs in fault tolerant systems, ACM
Transactions on Programming Languages and Systems, Vol. 6, No. 2, 1984
[LAW 1991]
Law, Averill M. and Kelton, W. David, Simulation Modelling and Analysis, 2nd
edition, McGraw Hill, 1991
[LEINWARD 1993]
LEINWARD, A., and FANG, K., Network Management: A practical
perspective, Addison Wesley, 1993
[LEPPINEN 1997]
LEPPINEN, MIKA., PULKKINEN, PEKKA., RAUTIAINEN, AAPO, Java
and CORBA Based Network Management, IEEE Computer, June 1997, pp.
83-87.
[LEWIS 1995]
LEWIS, G., R., CORBA 2.0 Universal Networked Objects ACM Standard
View Vol. 3, No. 3, September 1995.
[LISKOV 1991]
LISKOV, B ; GHEMAWAT, S ; GRUBER, R ; JOHNSON, P. ; SHRINA, L. and
WILLIAMS, M. Replication in the HARP file System In Proceedings of the
13th ACM Symposium on Operating Systems Principles, pp. 226-238, 1991.
References12
[LONG 1991]
LONG, D.D.E., CARROLL, J.L. and PARK, C.J., A study of the
reliability of the Internet sites, Proceedings of the 10th Symposium on
Reliable Distributed Systems, 1991, pp. 177-186.
[MAFFEIS 1997a]
MAFFEIS, S. and SCHMIDT, D., C., Constructing Reliable Distributed
Communication Systems with CORBA, IEEE Communications Magazine,
February 1997.
[MAFFEIS 1997b]
MAFFEIS, S., Piranha: A CORBA Tool for High Availability, IEEE
Computer, April 1997, pp. 59-66.
[MEYER 1992]
MEYER, B., Eiffel: The Language, Prentice Hall, 1992.
[MEYER 1995]
MEYER, K., ERLINGER, M., BETSER, J., SUNSHINE, C., GOLDSZMIDT,
G. and YEMINI, Y., Decentralizing Control and Intelligence in Network
Management, In Proceedings of the International Symposium on Integrated
Network Management, May 1995.
[MINOURA 1982]
MINOURA, T., and WIEDERHOLD, G. Resilient Extended True-Copy Token
Scheme for a Distributed Database System, IEEE Transactions on Software
Engineering , Vol. SE-8., No. 3, May 1982,, pp. 173-189.
References13
[MISRA 1986]
MISRA, J. Distributed Discrete - Event Simulation, ACM Computing surveys,
Vol. 18, No. 1, pp. 36-65, March 1986.
[MULLENDER 1990]
MULLENDER, S.J. ; VAN ROSSUM, G. ; TANENBAUM, A.S. ; VAN
RENESSE, R. ; and VAN STAVEREN, H, Amoeba: A Distributed Operating
System for the 1990s, IEEE Computer, Vol. 23, No. 5, pp. 44-53, May 1990.
[NARASIMHAN 1997]
Narasimhan, P., and Moser, L., E., and Melliar-Smith, P., M., Replica
Consistency of CORBA objects in partitionable distributed Systems,
Distributed Systems Engineering, Vol. 4, No. 3, September 1997, pp. 139-150.
[NELSON 1981]
NELSON, B. Remote Procedure Call, Technical Report CLS-81-9, Xerox
Palo Alto Research Center 1981.
[OKI 1988]
OKI, B., and LISKOV, B. Viewstamped replication: A new primary copy method
to support highly available distributed systems, Seventh ACM Symposium on
Principles of Distributed Computing, August 1988.
[OMG 1997]
Object Management Group A discussion of the Object Management
Architecture (OMA), OMG, January 1997.
[OSF 1993]
Open Software Foundation (OSF), Introduction to OSF DME, Distributed
Management Environment (DME) 1.0, H001, 1993
References14
[PALUMBO 1985]
PALUMBO, D., and BUTLER, R., Measurement of SIFT operating system
overhead, NASA Technical Memorandum 86322, 1985
[PARKER 1983]
PARKER, D.S. JR.; POPEK, G.J.; RUDISIN, G.; STOUGHTON, A.; WALKER,
B.J.; WALTON, E.; CHOW, J.M.; EDWARDS, D.; KISER, S.; and KLINE, C.,
Detection of Mutual Inconsistency in Databases, IEEE Transactions on
Software Engineering, SE-9,3 (1983), pp. 240-247.
[PAVON 1998]
PAVON, J and TOMAS, J., CORBA for Network and Service Management
in the TINA framework, IEEE Communications Magazine, March 1998, pp.
72-79.
[PRESOTTO 1990]
PRESOTTO, D.L. and RITCHIE, D.M. Interprocess Communication in the
Ninth Edition of UNIX System, Software Practice and Experience, Vol. 20, No.
S1, pp. 3-17, June 1990.
[PU 1988]
PU, G. , NOE, J.D., and PROUDFOOT, A. B. Regeneration of Replicated
Objects: A Technique and its Eden Implementation, IEEE Transactions on
Software Engineering, Vol. SE-14, No. 7, 1988, pp. 936-945.
References15
[PRIS 1986a]
PRIS, J-F, Voting With a Variable Number of Copies, In Proceedings
of the IEEE International Symposium on Fault-Tolerant Computing, IEEE,
NY, 1986, pp. 50-55.
[PRIS 1986b]
PRIS, J-F, Voting With Witnesses: A Consistency Scheme for Replicated
Files, In Proceedings of the IEEE International Conference on Distributed
Computing ,IEEE, NY, 1986, pp. 606-621.
[PRIS 1991]
PRIS, J-F, and LONG, D.,D.,E, Voting With Regenerable Volatile
Witnesses, Proceedings of the 7th International Conference on Data
Engineering, 1991, pp. 112-119.
[RAHKILA 1997]
RAHKILA, S. and STENBERG S., Experiences on Integration of Network
Management and a Distributed Computing Platform, In Proceedings of the
30th Hawaii Intl Conference on System Science, IEEE CS Press, 1997, pp.
140-149, also appears in Distributed Systems Engineering, Vol. 4, No. 3,
September 1997.
[RAMAN 1998]
RAMAN, L. OSI Systems and Network Management, IEEE
Communications Magazine, March 1998, pp. 10-17.
References16
[RUMBAUGH 1991]
RUMBAUGH, J., BLAHA, M., PREMERLANI, W., EDDY, F., LORENSEN, W.
Object-Oriented Modelling and Design, Prentice Hall, 1991.
[SADOWSKI 1993]
SADOWSKI, R., Selling Simulation and Simulation Results, In Proceedings of
the 1993 Winter Simulation Conference, pp. 65-68, 1993
[SALTZER 1984]
SALTZER, J., REED, D., and CLARK, D., End-to-end arguments in system
design, ACM Transactions on Computer Systems, Vol. 2, No. 4, November 1984.
[SARIN 1985]
SARIN, S. K.; BLAUSTEIN, B. T. and KAUFMAN, C. W. System
Architecture for Partition-Tolerant Distributed Databases, IEEE
Transactions on Computer, 1985, Vol. C-34, No 12, pp. 1158-1163.
[SCHLICHTING 1983]
SCHLICHTING, R. D. and SCHNEIDER, F.,B. Fail Stop Processors: An
Approach to Designing Fault Tolerance Computing Systems, ACM
Transactions on Computer Systems, Vol. 3 (1983), pp 222-238
[SCHNEIDER 1990]
SCHNEIDER, F.B. Implementing Fault - Tolerant Services Using the State
Machine Approach, ACM Computing Survey, Vol. 22, No. 12, pp. 299-319,
December 1990.
References17
[SHANNON 1975]
SHANNON, R. E. Systems Simulation: The art and the science, Prentice Hall,
1975.
[SIDOR 1998]
SIDOR, D., J. TMN standards: satisfying todays needs while preparing for
tomorrow, IEEE Communications Magazine, March 1998, pp 54-64.
[STROUSTRUP 1991]
STROUSTRUP, B. The C++ Programming Language, Second edition. Addison
Wesley, 1991
[THOMAS 1979]
THOMAS, R. H., A majority consensus approach to concurrency control, ACM
Transactions on Database Systems, Vol. 4, No. 2, pp. 180-209.
[UML 1997]
Unified Modelling Language (UML), Notation Guide, Version 1.1, September
1, 1997, Rational Software corporation, Santa Clara, CA. Also available at
http://www.rational.com/uml/ad970805_UML11_Notation2.zip.
[VINOSKI 1997]
VINOSKI, S., CORBA: Integrating Diverse Applications within Distributed
Heterogeneous Environments, IEEE Communications Magazine, February 1997, pp.
46-55.
References18
[WENSLEY 1978]
WENSLEY, J., LAMPORT, L., COLDBERG, J., GREEN, M., LEVITT, K.,
MELLIAR-SMITH, M., SHOSTAK, R., and WEINSTOCK, C. SIFT: Design
and analysis of a fault tolerant computer for aircraft control, Proceedings IEEE,
Vol. 66, Oct. 1978.
[WOLFSON 1997]
WOLFSON, O., JAJODIA, S., HUANG, Y., An adaptive Data Replication
Algorithm ACM Transactions on Database Systems, Vol. 22, No. 2, June 1997,
pages 255-314.
References19