Multicast
KENNETH
Cornell
BIRMAN
University
ANDRE
Ecole
SCHIPER
Polytechnique
F6d&al
de Lausanne,
Switzerland
and
PAT STEPHENSON
Cornell
The
University
ISIS
toolkit
process
groups
model
Our
approach
fault-tolerant,
a totally
ately
upon
the
sender
performance
tion
may
belongs,
version
achieve
Categories
overhead
usually
a small
number.
and
work
leads
and
Network
works]:
Distributed
C ,2.1
network
ating
Protocolsprotocol
Systems
Systems]:
Additional
and
Algorithms,
Key
to the
process
that
groups
of
this
style
or extended
messages
been
issues
that
of this
implements
size of the
have
pragmatic
to
concern
into
immedi
groups
to
implemented
that
arose
and group
a raw
as
and the
communica-
message
transport
of distributed
Words
and
[Computer-Communication
architecture;
d[strlbuted
Organization
Terms:
that
communications;
ating Systemsl:
Process Management
tems]: Communications
Management
General
delivers
protocols
comparable
widespread
Descriptors:
Design
works]:
the
Both
of the
us to conclude
scaling
which
synchronous
in support
CBCAST,
proportional
some
virtually
computing
costly,
Subject
and
we discuss
on
of protocols
called
CBCAST
ABCAST.
a space
contradicting
and
delivery.
called
based
a new family
primitive,
imposes
of ISIS
Our
be unacceptably
Architecture
message
performance
finding
a multicast
primitive,
and
achieved.
can
layera
ordered
environment
We present
around
multicast
reception,
of a recent
programming
communication
revolves
causally
ordered
which
part
is a distributed
and group
[Computer
C.2.4
[Computer-Communication
applications,
network
concurrency,
message
Design
Networks]:
C ,2.2
operating
systems;
distributed
network
NetNet-
D, 4.1 [Oper-
D .4.4 [Operating
synchronization;
sending,
Network
Communication
communication;
Sys-
D.4, 7 [Oper-
systems
Reliability
Phrases:
Fault-tolerant
process
groups,
message
ordering,
multicast
communication
This
work
was
DARPA/NASA
by grants
report
from
GTE,
are those
position,
policy,
Authors
4130
Lausanne,
not made
Defense
Advanced
administered
and Siemens,
of the authors
Inc.
and should
K. Birman
Upson
Hall,
to copy without
or distributed
publication
Association
specific
IBM,
the
NAG2-593
Research
by the
The views,
Projects
NASA
opinions,
not be construed
Agency
Ames
Research
a~d findings
as an official
(DoD)
contained
Department
under
Center,
and
in this
of Defense
and P. Stephenson,
Ithaca,
NY
14853-7501;
Cornell
University,
A. Schiper,
Ecole
Department
Polytechnique
of Computer
F6d6rale
de
the copies
are
Switzerland.
Permission
of the
by
or decision
Addresses:
Science,
and
for direct
its date
for Computing
of this
commercial
appear,
Machinery.
and
material
advantage,
notice
is granted
the ACM
is given
To copy otherwise,
that
provided
copyright
copying
or to republish,
that
notice
0734-2071/91/0800-0272
TransactIons
on Computer
Systems,
requires
$0150
Vol
9, No. 3, August
is by permission
permission,
@ 1991 ACM
ACM
supported
subcontract
of the
a fee and/or
273
1. INTRODUCTION
1.1
The ISIS toolkit [8] provides a variety of tools for building software in loosely
coupled distributed
environments.
The system has been successful in addressing problems of distributed
consistency, cooperative distributed
algorithms
and fault-tolerance.
At the time of this writing,
Version 2.1 of the Toolkit
was in use at several hundred locations worldwide.
Two aspects of ISIS are key to its overall approach:
synchronous
process groups.
Such a group
An implementation
of virtually
consists of a set of processes cooperating to execute a distributed
algorithm,
replicate data, provide a service fault-tolerantly
or otherwise exploit distribution.
Although
ISIS supports a wide range of multicast
protocols, a protocol
called CBCAST accounts for the majority of communication
in the system. In
fact, many of the ISIS tools are little more than invocations of this communication primitive.
For example, the ISIS replicated
data tool uses a single
(asynchronous)
CBCAST to perform each update and locking operation; reads
at all. A consequence is that the cost of CBCAST
require no communication
represents the dominant performance bottleneck in the ISIS system.
in part for structural
The original
ISIS CBCAST
protocol was costly
was
reasons and in part because of the protocol used [6]. The implementation
within
a protocol
server, hence all CBCAST
communication
was
via
an
Andirect path. Independent
of the cost of the protocol itself, this indirection
was expensive.
Furthermore,
the protocol server proved difficult
to scale,
limiting
the initial versions of ISIS to networks of a few hundred nodes. With
implementation
favored generality
respect to the protocol used, our initial
over specialization
thereby permitting
extremely flexible destination
addressing.
It used a piggybacking
algorithm
that achieved the CBCAST ordering
property but required periodic garbage collection.
in addressing seems weaker today. Experience with
The case for flexibility
insight
into how the system is used,
ISIS has left us with substantial
us to focus on core functionality.
The protocols described in this
permitting
paper support highly concurrent
applications,
scale to systems with large
numbers of potentially
overlapping
process groups and bound the overhead
to the size of the
associated with piggybacked
information
in proportion
process groups to which the sender of a message belongs. Although
slightly
less general than the earlier solution, the new protocols are able to support
The
the ISIS toolkit and all ISIS applications
with which we are familiar.
has been a substantial
increase in the
benefit of this reduction in generality
ACM Transactionson ComputerSystems,Vol 9,No. 3, August 1991
274
K. Birman et al
performance
and
no evident
of an
application
carry
only
and
groups
will
have
prior
by ISIS
on
multicast.
problem
satisfy.
group;
number
and
and
2. EXPERIENCE
patterns
types
implementation
the
Section
initial
communication
virtually
3 surveys
synchronous
or ABCAST
in
6.
The
of process
Section
technique
issues.
of our
and
a CBC!AST
in
will
used,
of order.
new
considered
multicasts
the
usage
has
case
of groups
out
applications.
that
suite
common
arrives
formalizes
our
the
2 discusses
ISIS
protocol
In
most
of group
current
performance
new
or number
actually
properties
are
size
Section
introduces
groups
of the
the
the
support.
communication,
if it
Section
of ISIS-specific
discussion
the
could
of the
only
among
Section
multiple
bursty
as follows.
observed
In fact,
it
regardless
be delayed
been
multicasting
must
localized,
is structured
work
system.
of system
overhead
supported
that
scale
with
a small
paper
of our
to the
a message
The
scalability
limits
protocol
a single
Section
paper
process
7 considers
concludes
implementation,
in
with
Section
8.
We begin by reviewing
existing
in
Birman
and
ISIS
Cooper
supports
these
is
denoted
equals
in
subdivide
closely
to
monitor
or by
will
another.
multicast
passive
and
server
groups
are
either
quently
Many
overlaid
to
simplest
replicated
needed.
of
client-server
full
in
this
to
disseminate
a client/server
in specific
group
stocks.
of groups.
is
one
stock
through
is
which
change
the
are
ISIS
membership
on a
arise
when
Hierarchical
maps
the
initial
subse
among
is
structures,
brokerage
existing
applica-
application
desired.
one
Clients
mechanism
would
to
example
14].
calls
which
any
partitioned
of these
quotes
Nonetheless,
Groups
the
communication
than
in
for
[10,
group
and
functionality
in
of
servers
and
structures
root
Data
case,
clients.
of sites,
system
A
subgroup.
later
arise
group
subgroup,
more
mixed
by
a number
groups.
appropriate
and
a
the
in
RPC
client
groups
a distributed
of
the
group
of servers
Diffusion
servers
issuing
appropriate
hierarchical
when
used
set
to large
use
In
is a type
a large-group
applications
group.
the
messages.
with
although
server
in
is
on behalf
the
and
as
data,
engage
structure
with
server
of
cooperate
otherwise
common
to
needed
an
and
ACM Transactions
manage
or
both
sets
is rarely
numbers
1. The
processes
interact
a favorite
Finally,
are
groups
Clients
whole
information
floor.
interest
large
Another
to the
receive
only
group
Figure
may
action.
replies
group
request
ISIS
They
status,
picking
to the
groups
overlapping
diffusion
by
tree-structured
subgroups,
done.
anothers
clients.
messages
interacts
it
of
their
simply
connection
use
style,
trading
larger
able,
set
broadcasts
brokerage
in
diffusion
servers
task
one
large
multicast
A
that
multicasting
often
illustrated
In a peer group,
group.
distributed
group.
/reply
to it,
their
get
coordinated
request
of groups,
peer
order
potentially
tion
types
the
tasks,
client /server
a
[3].
four
the
avail-
employing
For
example,
almost
always
be
programs
applications
infrequently,
register
rarely
and
Diffusion Groups
Peer Groups
Client/Server
Groups
Fig.
generally
contain
(e.g.,
processes).
3-5
server
or diffusion
Through
are
in
one
just
discussed
obtained
acceptable
group
was
more
future,
we
expect
groups.
We
also
set
were
expect
that
the
protocols
described
here
the
operating
system,
groups
get
much
cheaper.
from
process
groups
object.
As
tially
reduced
using
the
the
might
use
cost
A network
process
the
some
sort
of
group
information
modules
in
consider
to
for
rebuild
this
the
be
client
layers
of
expected
to
ISIS
in cases
data
will
become
can
of
around
preventing
groups
the
types
lower
factor
may
to
ISIS
can
a group
where
type
or
substan-
much
more
be substantially
paper.
some
represent
ACM Transactions
(except
application-level
simulation
service
four
especially
groups
to
Looking
into
applications,
or leaving
to the
Applications
these
communication
some
developed
scientific
as we
of groups,
patterns
prior
communication
small,
key
these
changes.
remain
Furthermore,
we
that
to be a dominant
for
of a client/
of ISIS
supporting
numbers
that
points,
A
ensuring
group
of joining
protocols
these
characteristics.
elelment.
model
expect
processes.
because
illustrate
these
grid
we
load-sharing
entities.
However,
move
seem
large
naturally
will
and
costs
very
a result,
outnumber
dynamic,
To
These
employing
by
that
versions
membership
group).
and
concluded
In
continue
groups
or diffusion
or
of clients
heavy-weight
only
to
have
evolved.
than
system
fault-tolerance
the number
(hundreds).
fairly
frequent
our
of a client-server
users
ISIS
performance
much
for
[3, 4] we
way
groups
sgroups.
hand,
be large
users
of the
here,
lypesofproces
members
may
of ISIS
artifacts
1.
the other
group
studies
part
Hierarchical Groups
enough
On
275
running
applications
employing
the
that
an
neighbors
on hundreds
would
have
n-dimensional
of
each
of sites
grid
might
276
K. Birman et al.
replicate
individual
be a large
group
data items
containing
moving
data
be used
to implement
many
such
and
the
inflexible
the
support
highly
message
developed
Mattern
[19].
influenced
by
work
groups
and
represents
earlier
ISIS
unlikely
imports
would
be huge
that
they
needs,
now
the
to enforce
proven
be used
to
interest
enabling
processing,
mechanisms
motiva-
have
could
that
new
parallel
a primary
protocols
applications
to these
for
COMMUNICATION
protocols
by
Schiper
In
the
evolved
[25],
case
protocols
of
process
multiple,
protocol
must
the
the
use
realtime
us.
explo-
of multicast
deadlines
and
is the
key
until
to high
The
ISIS
sets
of
architecture
overlayed
information
directly
Petersons
protocols
case.
and
does
[20].
However,
only
in
the
addresses
argue
[4]
context
the
that
case
of
a multicast
(without
Asynchronous
group-structured
/server
block-
communication
distributed
Both
the
support
supports
not
groups
optimizes
within
do not
protocol
groups,
and
was
respects:
asynchronously
occurs).
client
and
fault-tolerantly
Ladins
sion
to be used
[13]
algorithm
Peterson
transparently
we
proto-
Fidge
the
address causality
delivery
treats
this
and
delivery
by
group,
following
Elsewhere,
in
work
applications
of ISIS.
groups,
for
and
cation.
feature
message
on
[16]
the
solution
performance
is a central
a causal
based
Ladin
in
causality
remote
PROTOCOLS
process
protocols
Our
groups.
respect
sender
by
protocols
group.
overlapping
are
single
developed
these
from
and
of a single
and
these
seems
respond
ON GROUP
generalizes
Both
ing
it
as support
Our communication
our
could
that
priorities.
3. PRIOR WORK
col
would
perhaps
group
application
of process
The
large-scale
hardware,
result
a process
a modular
like
scale;
here
the
domains,
Similarly,
in
number
here.
dynamic,
issues
communication
the
groups;
replication
extensive.
applications
reported
of such
case,
reported
to
process
data
patterns.
objects
groups
difficult
protocols
ration
access
each
research
and
support
In
between
to
the
to
small
smaller
replicated
objects.
desire
for
The
response
overlap
The
tion
in
using
many
the
clients
and
the
subgroups
these
styles
clients
diffusion
servers
can
use
and
multicast
as
causality
multicast
/server
interactions,
to
groups
of
of a client
of group
client/server
permit
and
management
group.
communi-
but
directly
not
diffu-
to
server
groups.
Ladins
Our
protocol
protocol
uses
uses
stable
a notion
storage
as part
of message
of the
stability
fault-tolerance
that
method.
requires
no
external
storage.
29].
Again,
the
contribution.
consistent
Transactions
extensions
However,
with
on Computer
supporting
our
causality
Systems,
Vol
total
message delivery
(ABC!AST)
multiple
groups
ABCAST
protocol
also
thereby
permitting
9, No
3, August
1991
protocols
represent
uses
it
to
[6,
our
a delivery
be
used
asynchronously.
indeed,
A delivery
several
of the
4. EXECUTION
might
cited
be total without
would
not
provide
being causal,
this
Basic System
to be solved.
Model
with
ory
spaces.
known
later
we
fail-stop
nism,
Initially,
relax
we
this
described
that
When
data,
this
set
Processes
notification
below.
replicated
assume
assumption.
assumption);
manage
and
guarantee.
MODEL
We now formalize
4.1
ordering
protocols
277
is provided
multiple
subdivide
is static
and
fail
crashing
by
processes
a
computation,
by
a failure
need
disjoint
in
memadvance;
detectably
detection
to
cooperate,
monitor
one
(a
mechae.g.,
to
anothers
(1) uiewo(g)
O,
~ P, where
by the addition
or subtraction
and
of exactly
Processes learn of the failure of other group members only through this
view mechanism, never through any sort of direct observation.
We assume that direct communication
between processes is always possitransport
layer.
ble; the software implementing
this is called the message
Within our protocols, processes always communicate
using point-to-point
and
multicast
messages; the latter may be transmitted
using multiple
point-topoint messages if no more efficient alternative
is available.
The transport
communication
primitives
must provide lossless, uncorrupted,
sequenced
ACM
Transactions
on
Computer
Systems,
1991.
278
K. Birman et al,
message delivery.
and
discard
been
made.
an
This
extended
attempt
there
process
waiting
cannot
choice
to
perhaps
possibility
for
both
to
system.
from
assumed
intercept
detection
might
but
Obviously,
way
by
has
hang
respond),
permanent
same
to
failure
a process
store
the
the
the
that
a paging
with
treat
is also
once
for
then
transient
failures,
forcing
hence
the
faulty
protocol.
architecture
protocols,
layer
process
be distinguished
but
a recovery
protocol
the
communication
sort
to run
Our
port
little
transport
a failed
against
(e. g.,
resume
of this
is
message
from
guards
period
to
problems
The
messages
permits
to
mentation
described
in
unreliable
datagram
layer.
take
this
application
advantage
paper
uses
builders
of
special
to define
hardware.
a transport
that
we
new
trans-
The
imple-
built
over
an
The execution
of a process
is a partially
ordered
sequence
of events, each
corresponding
to the execution
of an indivisible
action. An acyclic event
order, denoted ~ , reflects the dependence of events occurring at process p
upon one another. The event sendP( m) denotes the transmission
of m by
process p to a set of one or more destinations,
dests( m); the reception
of
message m by process p is denoted rcuP( m). We omit the subscript when the
process is clear from the context. If I dests( m) I >1 we will assume that send
puts messages into all communication
channels in a single action that might
be interrupted
by failure, but not by other send or rev actions.
by rcvP(uiewz(g))
the event
by which a process p belonging
to g
We denote
learns
uiewi(g).
of
We distinguish
the event of receiuing a message from the event of deliuery,
since this allows us to model protocols that delay message delivery until some
condition
is satisfied.
The delivery
event is denoted deliuerP( m) where
rcuP( m) ~ deliuerP(
m).
~p:
e ~
then
e,
(2) ~ m: send(m)
For messages
for
send(m)
Finally,
process
failures
the
4.2
+
for
m and
failed
Virtual
the notation
m,
m + m will
be used as a shorthand
sencl(m).
demonstrating
is eventually
are
e ~ e
+ rcu( m)
detected
liveness,
received
and
unless
eventually
we
the
assume
sender
reflected
that
any
message sent by a
or destination
in
new
group
fails,
views
and
that
omitting
process.
Synchrony
Properties
Required
of Multicast
Protocols
on Computer
Systems,
Vol.
9, No. 3, August
1991,
279
(1) Address
are in identical
group
views
when
the members
the message
of that
view.
list and
atomicity
and order.
This involves delivery of messages fault(2) Delivery
tolerantly
(either all operational
destinations
eventually
receive a message, or, and only if the sender fails, none do). Furthermore,
when
multiple
destinations
receive the same message, they observe consistent
delivery orders, in one of the two senses detailed below.
Vp
deliver(m)
~ deliuer
the
( m).
CBCAST provides only the causal delivery ordering. If two CBCASTS are
concurrent,
the protocol places no constraints
on their relative
delivery
ordering at overlapping
destinations.
ABCAST extends the causal ordering
into a total one, by ordering concurrent messages m and m such that
~m,
m,
peg;
~q eg:
deliverP(m,
deliver~(m,
g) ~ deliverP(m,
g) ~ deliverq(m,
g)
g).
ACM Transactions
on Computer
Systems,
Vol. 9, No
3, August
1991.
280
K. Birman et al.
themselves
implemented
26].
Schmuck
In
fact,
ABCAST
can
Further,
he
delivery
to use
permits
message
process
observing
both
example,
delivery
not
it to fail;
round
any
[5,
terms
of
correctness.
for
a class
ordering
of
prop-
of communication.
are
not
multicasts
that
in
complete
emulate
ordering
further
requires
are
first
CBCAST
specified
compromising
can
on the
receive
this
without
protocols
delivery
asynchronous
algorithms
CBCAST
message
will
using
many
CBCAST
that
For
and
entirely
[261 that
demonstrates
tolerance
model.
after
be modified
orderings.
ert y that
Fault
almost
shows
independent
from
multicasts
in
a faulty
in progress
our
sender
at the
time
4.3
failure
Vector
flushed
be
Time
Our delivery protocol is based on a type of logical clock called a vector clock.
The vector time protocol maintains
sufficient
information
to represent
~
precisely.
n
A vector time for a process p,, denoted VT( pi), is a vector of length
(where n = I P I), indexed by process-id.
(1) When
p, starts
execution,
send(m)
at p,,
to zeros.
by process
p,
is timestamped
by 1.
with
(4) When
= max(VT(pJ)
n: VT(pJ)[k]
[k], Wi(m)[
the increVT(m),
k]).
VT,
< VT,
if VT1 s VT,
Transactions
on Computer
of
s VTz[il
p]
Systems,
Vol.
m and
m,
m ~
m
iff VT(m)
< VT( m):
precisely.
form by Fidge [131 and Mattern [191; the
researchers have also used vector times
301. As noted earlier, our work is an
in [251, which uses vector times as the
9, No. 3, August
1991.
that
delivers
AND ABCAST
point-to-point
281
PROTOCOL
This section presents our new CBCAST and ABCAST protocols. We initially
consider the case of a single process group with fixed membership;
multiple
group issues are addressed in the next section. This section first introduces
the causal delivery protocol, then extends it to a totally ordered ABCAST
protocol, and finally considers view changes.
5.1 CBCAST
Protocol
m).
of message
p,
increments
p] + pi delays delivery
Vk:ln
with
m.
VT(m),
of m until:
VT(m)
[k]
= VT(p~)[k]
VT(m)
[k]
s VT(pJ)[k]
+ 1
ifk=
otherwise
Process pj need not delay messages received from itself. Delayed messages are maintained
on a queue, the CBCAST delay queue. This queue is
sorted by vector time, with concurrent
messages ordered by time of
receipt (however, the queue order will not be used until later in the
paper).
VT( p,) is updated
(3) When a message m is delivered,
the vector time protocol from Section 4.3.
in accordance
with
on Computer
Systems,
1991
282
K Birman et al.
PI
P2
.
-
.($1,0)
. .
.
. .
P3
Fig.
Safety.
ml and
m2
2.
Using
the
VT
rule
to delay
of a process
message
p~ that
delivery.
receives
two messages
(1)
S VT(m,)[i].
no
messages
Inductive
step.
Suppose pJ has received k messages, none of which
message m such that ml -+ m. If ml has not yet been delivered, then
is a
(2)
This follows because the only way to assign a value to VT( p~)[ i] greater than
VI( ml)[ i] is to deliver a message from p, that was sent subsequent
to ml,
and such a message would be causally dependent on ml. From relations
1
and 2 it follows that
v~(PJ)[4
By application
cannot be mz.
ACM
TransactIons
<
Systems,
Vol.
v~(~z)[il
9, No. 3, August
1991
by pJ
Suppose
283
there
VZ(rn)[k]
# VZ(p~)[k]
+ 1
fork
= i,or
~k:l.n
[
and that
turn.
m was
W(m)
[k]
k+i
> VT(pJ)[k]
not transmitted
by process
pJ.
consider
We
these cases in
Causal
ABCAST
Protocol
that
each
message
is
uniquely
identified
by
uicl(
m).
To ABCAST
m,
a process holding the token uses CBCAST to transmit
in the normal manner. If the sender is not holding the token, the ABCAST
ACM
Transactions
on Computer
Systems,
is
1991.
284
K. Birman et al
done
in
stages:
arose
batch
upon)
any
desired,
(4)
On
in
sages
referred
CBCAST
in
desired,
holder
the
messages
must
or
may
may
be
sent
be
CBCAST
by
be specified
in
places
Eventually,
all
that
precede
message
the
will
token
holder.
it
on the
the
If
will
CBCAST
ABCAST
be received,
sets-order
so as to
piggybacked
message.
a process
sets-order
(or
the
this
manner.
the
delayed
before
message,
normal
to in
CBCAST
it
ABCAST
token
in
this
but
of a sets-order
queue
(liveness
If
subsequent
receipt
Recall
2.
transmissions,
a new
delay
Our
step
such
have
mes-
and
been
all
the
delivered
of CBCAST).
that
protocol
the
places
now
order
a partial
reorders
given
order
on the
concurrent
by
the
messages
ABCAST
sets-order
in the
messages
message,
and
delivered
off
delay
queue.
by placing
marks
them
them
as
deliverable.
(5)
Deliverable
ABCAST
messages
may
be
the
front
of
the
queue.
Step 4 is the
deliver
key
ABCAST
order
will
these
ABCAST
The
tend
to
moved
1 It might
moderately
creating
in
be consistent
cost
originate
one
of doing
originate
appear
large
a likely
an
the
site,
the
message
cost
is
to forward
such
bottleneck,
on
the
whale
process
such
CBCAST
would
the
the
token
the
Transactions
on Computer
Systems,
Vol
holder
is
to
used.
itself
per
drrectly
double
IO load
ABCAST.
to the token
the
IO
on other
9, No. 3, August
where
moved.
then
degree.
ACM
participants
holder
locations
token
repeatedly,
a message
reducing
all
token
This
treated
CBCASTS.
which
a solution
causes
the
because
depends
one
step
that
were
with
same
This
order
causality
ABCAST
the
cheaper
the
as if they
frequency
at
protocol.
in
with
messages
and
to that
the
messages
1991.
done
once
If
multieasts
If
they
holder
by
destinations
multicasts
the
token
However,
the
is
originate
token
only
for a
holder,
to a minor
285
randomly
and the token is not moved, the cost is 1 + 1/k CBCASTS per
ABCAST, where we assume that one sets-order
message is sent for ordering
purposes after k ABCASTS.
5.3 Multicast Stability
The knowledge that a multicast has reached all its destinations
will be useful
m
is k-stable if it is known
below. Accordingly,
we will say that a multicast
When k = I dests( m) I
that the multicast has been received at k destinations.
we will say that m is ( ~ully) stable.
Recall that our model assumes a reliable transport
layer. Since messages
can be lost in transport
for a variety of reasons (buffer overflows, noise on
communication
lines, etc.), such a layer normally
uses some sort of positive
or negative
acknowledgement
scheme to confirm
delivery.
Our liveness
m will eventually
assumption is strong enough to imply that any multicast
become stable, if the sender does not fail. It is thus reasonable to assume that
this information
is available to the process that sent a CBCAST message.
To propagate stability
information
among the members of a group, we will
a stability
introduce the following
convention.
Each process p, maintains
sequence number,
stable (p,). This number will be the largest integer n such
that all multicasts by sender p, having VT( P,)[ i] s n are stable.
m, process pi piggybacks
its current value of
When sending a multicast
stable( p,); recipients
make note of these incoming
values. If stable( pi)
changes and pi has no outgoing messages then, when necessary (see below),
P, can send a stability
message containing
5.4 VT Compression
It
is not
always
necessary
to transmit
the full
vector
timestamp
on each
message.
LEMMA
carry
1.
vector
PROOF.
Consider
two
multicasts
m and m such that m + m. If pi is the
sender of m and p~ is the sender of m, then there are two cases. If i = j then
VT(m)[
i] < VT(m)[
i], hence (step 2, Section 5.1) m cannot be delivered
until after m is delivered. Now, if i # ~ but m carries the field for P,, then
VT( m)[ i] s VT(m)[
i], and again, m will be delayed under step 2 until after
m is delivered.
But, if this field is omitted, there must be some earlier
by p~, that did carry the field. Then m will
message m, also multicast
be delivered
before m and, under the first case, m will be delivered
before m.
286
K. Birman et al.
messages but the first will just contain one field. Moreover, in this case, the
value of the field can be inferred from the FIFO property of the channels, so
such messages need not contain any timestamp.
We will make further use of
this idea below.
5.5
Delivery
Atomicity
and Group
Membership
Changes
synchronous
(2) Reinitializing
(3) Delivery
VT
addressing.
timestamps.
atomicity
when failures
occur.
(4) Handling delayed ABCAST messages when the token holder fails without
message.
sending a sets-order
To achieve virtually
synchronous adVirtually
synchronous
addressing.
dressing when group membership
changes while multicasts
are active, we
the communication
in a process group.
introduce
the notion of fZushing
Initially,
we will assume that processes do not fail or leave the group (we
Consider a process group in
treat these cases in a subsequent subsection).
view,. Say that view, + ~ now becomes defined. We can flush communication
by having
all the processes in view,+ ~ send a message flush
i + l, to all
other processes in this view. After sending such messages and before receiving such a flush message from all members of view, + 1 a process will accept
and deliver messages but will not initiate new multicasts.
Because communication is FIFO, if process p has received a flush message from all processes in
uiew, + ~, it will first have received all messages that were sent by members of
view,. In the absence of failures,
this establishes
that multicasts
will be
virtually
synchronous in the sense of Section 4.
A disadvantage
of this flush protocol is that it sends n2 messages. Fortunately, the solution is readily modified into one that uses a linear number of
messages. Still deferring
the issue of failures,
say that we designate one
Any deterministic
rule can
member of view, + ~, PC, as the fhsh coordinator.
flushes
by
be used for this purpose. A process p, other than the coordinator
first
waiting
sending
waits
view,
until
all
multicasts
a view, + ~ flush
until
+ ~. It
that
message
flush
messages
then
multicasts
it
sent
have
stabilized,
to the coordinator.2
have
been
received
its own flush
message
and
The coordinator,
from
all other
to the members
then
PC,
members
of
of view, + 1
2 Actually,
channels
ACM
it
need
not
wait
until
the
coordinator
has
received
are FIFO.
Transactions
on Computer
Systems,
Vol.
9, No. 3, August
1991
its
active
multicasts,
since
Atomic
Group Multicast
287
VT fields.
After executing the flush protocol, all processes
we will
fields of VT to zero. This is clearly useful. Accordingly,
includes
that the vector timestamp
sent in a message, VT(m),
of viewi in which
m was transmitted.
A vector timestamp
expired index may be ignored, since the flush protocol used to
will have forced the delivery of all multicasts
that could have
in the prior view.
atomicity
and
virtual
synchrony
when failures
occur.
We now
during an execution. We discuss
leave a group at the end of this
examines
installed
view,
Transactions
on Computer
Systems,
1991.
K. Birman et al.
PI
P2
288
Cbca.st
P4
P3
l%
F3
bi_
#- .~-,
.
-..
>5
View A
V]ew A
flush (T~)
flush
(i%)
fonwd(cbcastl
flush
) ~
----
(PI )
cbcastz
View B
View B
flush
fomwd(cbcaw
flush
(TJ )
(R)
14
VirtuallV SynchronousAddressh~
Phv.@&&g@im
Fig.
3,
Group
flush
algorithm
view(m)
< i, p ignores
m as a duplicate.
If m was sent in view,, p
applies the normal delivery rule, identifying
and discarding duplicates in
the obvious manner. If m was sent in uiew( m) > i, p saves m until
view(m)
has been installed.
(3)
flush
i + k
messages
from
all
processes
in
view, + ~ n view, + ~ (for any k > 1), p installs
view, + ~ by delivering
it to
sends counter, and, if
the application
layer. It decrements the inhibit_
the counter is now zero, permits new messages to be initiated.
Any message m that was sent in view, and has not yet been delivered
and
may be discarded. This can only occur if the sender of m has failed,
a message m (m ~ m) that has
has previously
received
and delivered
m is an orphan
of a system
execution
now been lost. In such a situation,
On
receiving
that
was
erased
(4) A message
has
been
4 Notice
that
the
destinations.
ACM
atomicity
in
our
failure
Transactions
model,
then
even
p could
definition
could
failures.4
as soon
Notice
at most
of
Our
multiple
be discarded
stable.
forwarded
protocol,
but
can
become
by
that
once
if
of delivery
only
on Computer
(in
process
lead
to
step
atomicity
Vol.
has
using
been
becomes
delivered
locally
stable
after
and
having
1).
accepts
a situation
be achieved
Systems,
as it
a message
in
could
a much
and
delivers
which
be changed
more
9, No. 3, August
a message
m is not
costly
1991
m under
delivered
to its
to exclude
such
2-phase
delivery
this
other
executions,
protocol,
289
(5) A process
p that
sent a message m in view, will now consider m to have
if delivery confirmation
has been obtained from all processes in
view, (7 view, + ~, for any value of k, or if view, + ~ has been installed.
stabilized
LEMMA
The flush
2.
algorithm
is safe and
live.
PROOF.
A participant
in
the
protocol,
p,
delays installation
of view,+ ~
until it has received flush messages sent in view, +k (k > 1) from all members
of view, + ~ that are still operational
in view, +~. These are the only processes
that could have a copy of an unstable message sent in view,. This follows
because messages from failed processes are discarded, and because messages
are only forwarded to processes that were alive in the view in which they
were originally
sent. Because participants
forward unstable messages before
sending the flush, and channels are FIFO, p will have received all messages
vie wi + ~. It follows that the protocol is safe.
sent in view, before installing
Liveness follows because the protocol to install viewi+ ~ can only be delayed
by the failure of a member of view, + ~. Since view,+ ~ has a finite number of
members, any delay will be finite.
all
this
flush
illustrated
nonatomic
ered,
the
change,
in
multicasts
at linear
the
messages
protocol
Figure
into
needed
3,
one
to terminate
the
protocol
for
view,
+ ~.
is live.
this
in
which
protocol
all
converts
multicasts
an
are
execution
atomically
with
deliv-
cost.
ACM Transactions
290
K, Birman et al.
A slight
change
a group
for
triggers
a view
departing
flush
to the
reasons
protocol
other
change
process
algorithm
is
reported
include
of the
group
ensures
that
even
received
some
multicast
until
if
failure.
similar
must
member
is needed
than
as
it
the
to
in the
ISIS
the
one
having
departing
either
fails
departing
interrupted
for
left
the
or
where
a process
this
but
voluntarily.
the
is
in
In
which
the
and
only
the
case,
treated
This
process
atomicity
the
as
terminates.
delivery
it
which
this
is
algorithm
a failure,
leaves
possibility,
a failure,
process,
member
by
case
supports
to
have
will
still
be preserved.
ABCAST
ordering
when the token holder fails.
The atomicity
mechanism
of the preceding subsection requires a small modification
of the ABCAST
the token
protocol. Consider an ABCAST that is sent in view, and for which
holder
fails
before
sending
the
sets-order
message.
completion
of the
flush
protocol
for
installed.
Notice
problem
any
that
by
the
delay
ordering
well-known,
queue
any
is partially
concurrent
deterministic
rule.
For
The resulting
total order on ABCAST
processes and consistent with causality.
6.
EXTENSIONS
TO THE BASIC
ordered
ABCAST
by
messages
example,
they
messages
will
We
within
can
can
solve
this
set using
be sorted
be the
by
our
uid.
same at all
PROTOCOL
Neither
of the protocols in Section 5 is suitable for use in a setting with
multiple
process groups. We first introduce
the modifications
needed to
extend CBCAST
to a multigroup
setting.
We then briefly
examine the
problem of ABCAST in this setting.
The CBCAST solution we arrive at initially
could be costly in systems with
very large numbers of process groups or groups that change dynamically.
This has not been a problem in the current ISIS system because current
applications
use comparatively
few process groups, and processes tend to
multicast for extended periods in the same group. However, these characteristics will not necessarily hold in future ISIS applications.
Accordingly,
the
second part of the section explores additional
extensions of the protocol that
with
very
large
numbers
of very
dynamic
would permit its use in settings
process
properties
6.1
groups.
of
Extension
The
what
we
resulting
call
of CBCAST
the
protocol
is
communication
to Multiple
interesting
structure
because
it
exploits
of the system.
Groups
291
PI
ml
P2
((1,0,0,*),(*,0,0,0))
~4
.
s-.
,-.
t
s
. .
. .
.
P3
,m2
,
,
*
P4
Fig,
4.
Messages
sent within
process
: ((1,0,0,*),
(*,
groups.
1,0,
1))
to
message
g~.
If
not,
a process
belonging
with
to
g~
and
not
to
g.
that
receives
of message m from
p,
p],
pj
delays
until
= VT~(pj)[i]
+ 1, and
2.1 vT.(m)[il
k # i): VT~(m)[k]
s VT(pj)[k],
2.2 Vk: (p~~g~~
< VT~(pJ).
2.3 Vg: (ge G,): VTJm)
and
This is just the original protocol modified to iterate over the set of groups to
which a process belongs. As in the original
protocol, pJ does not delay
messages received from itself.
Figure 4 illustrates
the application
of this rule in an example with four
processes identified as PI . . . P4. processes
P1, PZ and P3 belong
to grOUP Gl,
5 Clearly,
if p, is not
timestamp,
length
For
n, with
a member
of g=, then
our
will
clarity,
a special
entry
figures
Wa[ i] = O, allowing
continue
Transactions
to depict
that
each
a sparse
is not a member
on Computer
representation
VT
timestamp
Systems,
of the
as a vector
of
of ga.
Vol.
9, No.
3, August
1991.
292
K Birman et al
mz.
6.2
is done
Multiple-Group
within
groups,
these
entries
are
irrelevant
to
pJ.
ABCAST
When run over this extended CBCAST protocol, our ABC!AST solution will
continue to provide a total, causal delivery ordering within any single process
group. However, it will not order multicasts
to different groups even if those
In a previous article [4] we examine the
groups have members
in common.
need fm a global ABCAST ordering property and suggest that it may be of
little practical
importance,
since the single-group
protocol satisfies the requirements
of all existing
ISIS applications
of which we know. We have
extended our ABCAST
solution to a solution that provides a global total
ordering; the resulting
protocol, in any single group g, has a cost proportional to the number
of other groups that overlap with
g. Details
are
presented by Stephenson [281. This extended, global-ordering
ABCAST protocol could be implemented
if the need arises.
6.3
Extended
VT CompressIon
VT.
received
from
some
process
Transactions
on
Computer
Systems,
m that
Vol
not
in
g~ will
it multicasts
9, No
3, August
include
the
updated
value
value
293
is the initiator
of a multicast
an unstable
to g that
multicast
is not stable, or
to g.
Activity
is a local property; i.e., process p can compute whether or not some
group g is active for it by examining
local state. Moreover, g may be active
for q (in fact, by delaying the delivery
of
for p at a time it is inactive
messages, a process may force a group to become inactive just by waiting
until any active multicasts
stabilize).
Now, we modify step 2 of the extended protocol as follows. A process p
which receives a multicast
m in group g. must delay m until any multicasts
m,
apparently
concurrent
with m, and previously
received in other groups
g~ (b # a) have been delivered
locally.
For example, say that process p
receives ml in group gl, then mz in group gz, and then m~ in group gl.
m~ must
Under the rule, mz must not be delivered until after ml. Similarly,
not be delivered until after mz. Since the basic protocol is live in any single
group, no message will be delayed indefinitely
under this modified rule.
Then, when sending messages (in any group), timestamps corresponding to
inactive groups can be omitted from a message. The intuition
here is that it
is possible for a stable message to have reached its destinations,
but still be
blocked on some delivery queues. Our change ensures that such a message
will be delivered before any subsequent messages received in other groups.
Knowing that this will be the case, the vector timestamp
can be omitted.
It is appealing
to ask how effective timestamp
compression
will be in
typical
ISIS applications.
In particular,
if the compression
dramatically
reduces the number of timestamps
sent on a typical message, we will have
arrived at the desired, low overhead, protocol. On the other hand, if compression is ineffective,
measures may be needed to further reduce the number of
vector timestamps
transmitted.
Unfortunately,
we lack sufficient experience
to answer this question experimentally.
At this stage, any discussion must
necessarily be speculative.
Recall from Section 2 that ISIS applications
are believed to exhibit commuIn our setting, locality would mean that a process that most
nication
locality.
recently received a message in group g, will probably
send and receive
several times in g, before sending or receiving in some other group. It would
be surprising
if distributed
systems did not exhibit communication
locality,
since analogous properties are observed in almost all settings, reflecting the
fact that most computations
involve some form of looping [121. Process-Woupbased systems would also be expected to exhibit locality for a second reason:
when a distributed
algorithm
is executed in a group g there will often be a
flurry of message exchange by participants.
For example, if a process group
were used to manage transactionally
replicated data, an update transaction
ACM
Transactions
on Computer
Systems,
1991.
294
K, Birman et al,
might multicast
to request a lock, to issue each update, and to initiate
the commit protocol.
Such sequences of multicasts
arise in many ISIS
algorithms.
The extended compression rule benefits from communication
locality, since
few vector timestamps
would be transmitted
during a burst of activity.
Most
groups are small, hence those timestamps that do need to be piggybacked on
a message will be small. Moreover, in a system with a high degree of locality,
each process group through which a vector timestamp passes will delay the
timestamp briefly.
For example, suppose that a process that sends one multicast
in group g,
will, on the average, send and receive a total of n multicasts
in g, before
sending or receiving in some other group. Under our extended rule, only the
first of these multicasts
will carry vector timestamps
for groups other than
g,. Subsequent multicasts
need carry no vector timestamps
at all, since the
senders time stamp can be deduced using the method of Section 5.4. Moreper second,
over, if the vector timestamp
for a group gl changes k times
members
of an adjacent
group
g2 (that
are
not
also
members
of gl)
will
see a
A group
at distance
d would
see every
ndth
value,
of k / nd per second. Thus, the combination
of compression and
giving
a rate
communication
locality can substantially
reduce the vector time stamp overhead on messages. In fact, if most messages are sent in bursts, the average
multicast may not carry any timestamps
at all!
rate
6.4
of change
Garbage
of
k /n.
Collection
of Vector
Timestamps
Atomicity
and Group
Membership
Changes
dressing
failures,
atomicity
need to be reconsidered
synchronous
addressing.
Recall that virtually
synchronous adis implemented
using a group flush protocol. In the absence of
the protocol of Section 5.5 continues
to provide this property.
295
Although it may now be the case that some messages arrive carrying stale
vector timestamps corresponding
to group views that are no longer installed,
the convention of tagging the vector timestamp
with the view index number
for which is was generated permits the identification
of these timestamps,
which may be safely ignored: any messages to which they refer were delivered during an earlier view flush operation.
Atomicity
and virtual
synchrony
when failures
occur.
When
the flush
algorithm
of Section 5.5 is applied in a setting with multiple,
overlapping
groups, failures introduce problems not seen in single-group settings.
Consider two processes pl, p2 and two groups gl, gz, such that pl belongs
event sequence
to gl and p2 to both gl and gz, and suppose the following
occurs:
(1) pl multicasts
(2) pz receives
multicasts
(3)
p,
(4)
Both
ml
to
in view(gl).
g,
and delivers
ml,
while
m2 to g,,
still
in (uiew(gl),
p, and p2 fail,
causing
in (view(gI),
view(g,)).
view(g,)).
the installation
of viezu(gl)
and vier,v(gzj.
(1) If
delivers
because
ml
mz before installing
was not delivered first.
viezv(gz),
by q, atomicity
that remained
causality
will be violated,
operational.
will
be
because
violated,
m2
was
Even worse, q may not be able to detect any problem in the first case.
Here, although m~ will carry a vector timestamp reflecting the transmission
will be ignored as stale . In general, q will only
of ml, the timestamp
recognize a problem if mz is received before the flush of gl has completed.
There are several ways to solve this problem. Since the problem does not
occur when a process communicates
only within
a single group, our basic
approach will be to intervene
when a process begins communicating
with
another group, delaying
communication
in group gz until causally prior
messages to group gl are no longer at risk of loss, Any of the following rules
would have this effect:
One can build a k-resilient
protocol that operates by delaying communication outside a group until all causally previous messages are k-stable; as k
approaches n, this becomes a conservative
but completely
safe approach.
Here, the sender of a message may have to wait before being permitted to
transmit
it,
Rely on the underlying
message transport
protocol to deliver the message
to all destinations
despite failures. For example, a token ring or ethernet
ACM TransactionsonComputerSystems,Vol. 9, No. 3, August1991.
296
K. Birman et al.
might implement
reliable multicast
transport
that messages are never lost at this stage.
Construct
a special
have a copy of any
destinations.
Such
in the original ISIS
[231.
directly
at the link
level, so
message-logging
server which is always guaranteed to
messages that have not yet been delivered to all their
message store
a facility would resemble the associative
system [71 and also recalls work by Powell and Presotto
Our implementation
uses the first of these schemes, and as initially
structured,
operates completely safely (i. e., with k = n). That is, a process p,
that
has
sent
multicast
or
in
system
will
received
group
be
g2
multicasts
until
somewhat
the
in
gl
more
group
multicasts
flexible,
gl
are
will
all
allowing
the
delay
fully
initiation
stable.
process
of a
Our
that
future
creates
6.6
to
the
vector
present,
times
processes
argued
ACM
a value
Use of Communication
l.Jntil
o Barry
specify
and
that
Gleeson
Transactions
we
Structure
have
associated
having
groups
comprising
in
many
is with
total
with
that
the
systems
the UNISYS
on Computer
size
Corporation,
Vol
message
be
application.
compression
Systems,
each
could
linear
On
could
the
1991
one
drastically
9, No. 3, August
a vector
in
the
time
number
hand,
we
reduce
or
of
have
size.
297
Moreover, similar
constraints
arise in many published CBCAST protocols.
However, one can still imagine executions in which vector sizes would grow
to dominate message sizes. A substantial
reduction in the number of vector
timestamps that each process must maintain
and transmit
is possible in the
case of certain communication
patterns, which are defined precisely below.
Even if communication
does not always follow these patterns,
our new
solution can form the basis of other slightly more costly solutions which are
also described below.
Our approach will be to construct
an abstract description
of the group
overlap structure for the system. This structure will not be physically
maintained in any implementation,
but will be used to reason about communication properties of the system as it executes. Initially,
we will assume that
group membership
is frozen
and that the communication
structure of the
system is static. Later, in Section 6.7, we will extend these to systems with
dynamically
changing communication
structure.
For clarity, we will present
our algorithms
and proofs in a setting where timestamp
compression rules
are not in use; the algorithms
can be shown to work when timestamp
compression is in use, but the proofs are more complicated.
structure
of a system to be an undirected graph
Define the communication
CG = (G, E) where the nodes, G, correspond to process groups and edge
(gl, gz) belongs to E iff there exists a process p belonging to both g, and gz.
If the graph so obtained has no biconnected component7 containing more than
k nodes, we will say that the communication
structure
of the system is
k-bounded. In a k-bounded communication
structure, the length of the largest
simple cycle is k. 8 A O-bounded communication
structure is a tree (we neglect
the uninteresting
case of a forest). Clearly, such a communication
structure
is acyclic.
Notice that causal communication
cycles can arise even if CG is acyclic.
m2,
m3
and m~ form a causal cycle
For example, in Figure 4, messages ml,
restricts
such
spanning
both gl and gz. However,
the acyclic structure
communication
cycles in a useful way. Below, we demonstrate
that it is
unnecessary
to transport
all vector timestamps
on each message in the
k-bounded case. If a given group is in a biconnected component of size k,
processes in this group need only to maintain
and transmit
timestamps
for
other groups in this biconnected component. We can also show that they need
at least these timestamps.
As a consequence, if the communicato maintain
tion structure is acyclic, processes need only maintain the timestamps for the
groups to which they belong.
We proceed to the proof of our main result in stages. First we address the
special case of an acyclic communication
structure, and show that if a system
has an acyclic communication
structure,
each process in the system only
of groups to which it belongs.
maintains
and multicasts
the VT timestamps
7 Two vertices
are in the same biconnected
component
them after any other vertex has been removed.
8 The
contain
nodes
of a simple
arbitrary
repeated
cycle
(other
than
the
starting
of a graph
node)
if there
are distinct;
is still
a path
a complex
between
cycle
may
nodes.
ACM
Transactions
on Computer
Systems,
1991.
298
K. Blrman et al.
Notice that this bounds the overhead on a message in proportion to the size
and number of groups to which a process belongs.
We will wish to show that if message ml is sent (causally) before message
mk, then ml will be delivered before mh at all overlapping
sites. Consider
the chain of messages below.
chain
denote
such
a sequence
of
messages,
and
let
the
notation
P,
m, A m~ mean that
reflects
the
p~ transmits
of
transmission
m~ using
For
m,.
message transmitted
by process p, in group ga. So m, ~ m~ iff VT.( p~)[ i] > k
and consequently
VT.( m~)[ i] > k. Our proof will show that if m, ~ mj and
u.
the destinations
of m, and m~ overlap, then m, ~ m~, where p] is the sender
m, will be delivered
before ml at any overlapping
of mJ. Consequently,
destinations.
We now note some simple facts about this message chain that we will use
in the proof. Recall that a multicast to a group g. can only be performed by a
to g.. Also, since the communication
structure
is
process p, belonging
acyclic, processes can be members of at most two groups. Since mh and ml
have
gl
overlapping
and
of
g2,
destinations,
then
gk,
the
p2, the
and
destination
destination
of the
final
of
ml,
broadcast,
is a member
of
is either
or
gl
ml
mk simply traverses part of
Since
CG is acyclic,
the message
chain
a tree reversing itself at one or more distinguished
groups. We will denote
such a group g,. Although
causality information
is lost as a message chain
traverses the tree, we will show that when the chain reverses itself at some
group gr, the relevant information
will be recovered
on the way back.
g2.
LEMMA
If a system
3.
only
it belongs.
to which
PROOF.
ml . .
The
mk.
proof
Recall
destinations,
they
has
an
acyclic
maintains
is
by
and
on
induction
that
we must
will
be
communication
multi
1, the length
show that
delivered
in
if ml
ml
causal
of the
each
order
at
all
such
proc-
of groups
message
mk have
and
i.e.,
structure,
chain
overlapping
destinations,
will
will
therefore
be delivered
ACM
Transactions
on Computer
correctly
Systems,
Vol
at any
9, No
overlapping
3, August
1991
achieved,
with
gls
since ph = pz
timestamp.
destinations.
It
299
Inductive
step.
Suppose that our algorithm
delivers all pairs of causally
related messages correctly if there is a message chain between them of length
1< k. We show that causality is not violated for message chains where 1 = k.
Consider a point in the causal chain where it reverses itself. We represent
* m, -+ mr, * mr+l, where mr.l
and mr+l are sent in gr_l E
this by m,.l
and
m, and
m,, are sent in g, by p, and
g,+ 1 by P, and P,+ 1 respectively,
in
p,,. Note that p, and p,+ ~ are members of both groups. This is illustrated
Figure 5. Now, m,, will not be delivered at p,+ ~ until mr has been delivered
Pr+l
there, since they are both broadcast in g,. We now have mr ~ ~ m, * mr~ ~.
We have now established a message chain between ml and mh where 1< k.
hypothesis,
ml
will
be delivered before mh at any
So, by the induction
overlapping
destinations,
which is what we set out to prove.
THEOREM
multicast
which
Each
1.
process
the VT timestamps
p,
in
of groups
a system
needs
in the biconnected
only
to maintain
components
and
of CG to
p, belongs.
PROOF.
As with Lemma 3, our proof will focus on the message chain that
established a causal link between the sending of two messages with overlapping destinations.
This sequence may contain simple cycles of length up to h,
where k is the size of the largest biconnected component of CG. Consider the
simple cycle illustrated
below, contained in some arbitrary
message chain.
Pl~.
.gl
. . P2:P3m:
&?l
gl
Now, since pl, p2 and p3 are all in groups in a simple cycle of CG, all the
groups are in the same biconnected component of CG, and all processes on
the message chain will maintain
and transmit
the timestamps
of all the
groups. In particular,
when ml arrives at p3, it will carry a copy of VTgl
indicating
that ml was sent. This means that ml will not be delivered at p3
until
ml
has been delivered there. So ml+ ~ will not be transmitted
by p3
P3
until
ml has been delivered
there. Thus ml a ml+ ~. We may repeat this
process for each simple cycle of length greater than 2 in the causal chain,
reducing it to a chain within one group. We now apply Lemma 3, completing
the proof.
both
2.
necessary
If a system
and sufficient
ACM
Transactions
on Computer
Systems,
Vol
9, No.
3, August
it is
those
1991.
300
K. Birman et al.
Fig.
VT times tamps
which
6.7
corresponding
5.
Causal
to groups
reversal
in the biconnected
component
of CG to
p, belongs.
Extensions
to Arbitrary,
Dynamic
Communication
Structures
301
group
groups
gl
iff
(the
gl
notion
is the only
of
an
active
active
group
group
for process
was
defined
in
or p has no active
Section
6.3).
If
attempts to multicast
when this rule is not satisfied, it is simply delayed.
During this delay, incoming messages are not delivered. This means that all
groups will eventually
become inactive and the rule above will eventually
be
satisfied. At this point, the message is sent. It is now immediate
from the
extended compression rule of Section 6.3 that when a message m is multicast
in group g, only the senders timestamp
for group g need be transmitted.
The conservative
rule imposes a delay only when two causally successive
groups. Thus, the rule would be inexpensive in
messages are sent to different
systems with a high degree of locality.
On the other hand, the overhead
imposed would be substantial
if processes multicast
to several different
groups in quick succession.
Multicast
epochs.
We now introduce an approach capable of overcoming
the delays associated with the conservative
rule but at the cost of additional
group flushes. We will develop this approach by first assuming that a process
leaves a group only because of a failure and then extending
the result to
cover the case of a process that leaves a group but remains active (as will be
seen below, this can create a form of phantom cycle).
Assume that CG contains cycles but that some mechanism has been used
to select a subset of edges X such that CG = (G, E X) is known to be
acyclic. We extend our solution to use the acyclic protocol proved by Lemma 2
communication
within groups. If there is some edge (g, g) e X, we
for most
ACM Transactionson
Computer
Systems,
Vol
9, No. 3, August
1991.
302
K. Birman et al
will
say that one of the two groups, say g, must be designated as an excluded
In this case, all multicasts to or from g will be done using the protocol
described below.
Keeping track of excluded groups could be difficult;
however, it is easy to
make pessimistic
estimates (and we will derive a protocol that works correctly with such pessimistic
estimates).
For example, in ISIS, a process p
might assume that it is in an excluded group if there is more than one other
neighboring
group. This is a safe assumption; any group in a cycle in CG will
certainly have two neighboring
groups. This subsection develops solutions for
arbitrary
communication
structures, assuming that some method such as the
previous is used to safely identify excluded groups.
epoch, to be associated with messages
We will define a notion of multicast
such that if for two messages ml and mz, epoch( ml) < epoch( m2), then ml
In situations
where a causal ordering
will
always
be delivered
before
m2.
problem could arise, our solution will increment the epoch counter.
p maintains
a local variable,
epochP. When
Specifically,
each
process
a multicast,
it increments
its
epoch variable
if the
process p initiates
epochP on the outgoing
condition
given below holds. It then piggybacks
group.
message.
On
initiate
the
message
start
sage
triggers
group
view
the
flush,
major
flush
reception
flush
flush
execution
for
the
of the
our
the
same
which
vector
a major
On
new
and
view,
completing
belongs,
5.5,
and
sending
of this
the
flush
view
a
mes-
as for
timestamps
minor
the
just
p will
then
by
Reception
of Section
clears
each
view).
it
members.
protocol
using
for
to
group
flush
if
m,
groups
other
views
is incremented
within
all
implementation
numbers
number
a message
to
(because
ISIS
done
of
protocol
a new
as part
number:
minor
one
protocol,
for
all
of
the
each
group
The
last message
either
g or
received
was
is an
excluded
group.
from
some
other
group
g,
and
two
protocol
different
groups
in
immediate
succession,
where
one
of the
groups
is excluded.
That
ACM Transactions
is,
if process
VT
not
303
deliver
PROOF.
messages
The VT protocol
causally
within
3.
messages
Consider
have
an
extended
arbitrary
arbitrary
overlapping
to implement
communication
message
destinations.
chain
For
multicast
epochs
structures.
where
example,
the
in
first
the
and
chain
will
last
shown
below,
Pk + ~ might be a member of both gl and g~ and hence a destination
of
loss of generality,
we will assume that gl . . . g~
both ml and m~. Without
are distinct. We wish to show that the last message will be delivered after
the first at all such destinations.
mk.
protocols
of Cristian
of ours.
done using
In
fact,
periodic
synchronization
(on
between
which
the
that
protocol
A-T
scheme
Vol.
9, No.
and such a
is based)
is
[17, 27].
Transactions
on Computer
Systems,
3, August
1991.
304
K. Birman et al,
THE PROTOCOLS
TO ISIS
7.1
clientlserver
of the
service
client
and
this
case,
they
each
timestamps
0( s*c),
of
since
through
the
receives
all
entire
the
timestamp
client
forms
Figure
other
number
clients
will
and
not
into
one
of
the
clients
c is
a peer
optimization
+ c).
This
it
can
is
containing
to
fully
each
In
client
that
state
to
that
maintain
timestamp
the
number
with
each
subgroup
in
the
size
of
of
clients.
other
except
size
s that
of
be applied
to reduce
the
essentially
collapses
all
optimization
group;
group
total
and
groups.
and
needs
group-a
form
in peer
group,
2 appears
be communicating
servers
large
additional
servers
an
0(s
one
Theorem
client
of
multicasts,
to
an
one
the
forms
l).
such
the
size
groups
(see
containing
servers,
of these
of servers
providing
every
s is
Groups
a set
set
group
where
Fortunately,
are
server
settings,
the
considerations
described
by
Stephenson
stock
quote
trading
service
IBM
stock.
approximating
sages.
ACM
Such
Transactions
our
by implementing
out of it after
some period
on
send
an
an
ability
multicast
Alternatively,
10 For example,
drop
the
might
Computer
IBM
quote
can
destination
protocol
can
be
sets
to
only
and
scheme
into
or by simply
of inactivity,
Systems,
Vol
9, No
3, August
brokers
by
forming
discarding
be extended
the compression
those
emulated
1991
one
unwanted
directly
having
a client
actually
groups
messupport-
of a group
Overhead
Overhead
timestamp
Resulting
sender.id,
Overhead
Variable
VT
I.
sizes:
from
uid,
28
bytes total
VT
tirnestarnps
epochP,
Process
Group
dest_group,
as needed
(Section
305
Styles
stabled=,~(l~)
II
6.3)
k,k<n
Peer Group
n is the number
Client/Sewer
k,k~s+c
of group
members.
II
is the number
s
Diffusion
Different
of servers
and
c is the
number
of clients.
into
the group.
k,ksb
I b is the
Hierarchical
number
of members
broadcasting
k,k~n
n is the size of a submou~.
Point-to-Point
Messages
System
Interface
Issues
One question raised by our protocols concerns the mechanism by which the
system would actually be informed about special application
structure,
such
as an acyclic communication
structure.
This is not an issue in the current
ISIS implementation,
which uses the conservative
rule, excluding
groups
adjacent to more than one neighboring
group. In the current system, the only
ACM TransactionsonComputerSystems,Vol. 9, No. 3, August 1991.
II
306
K. Birman
al.
et
which
causality
subsequently
primitive
can
prior
messages
have
the
the
case
routine
such
of a domain
application
its
acyclic
and
invoking
7.4
as
may
before
Group
will
an
causality
from
interface
by
assertions
basis,
communication
that
and
but
causally
on a per-domain
In
structure,
groups.
described
domain
all
k or
constant
earlier
domain,
to ensure
could
declaring
cycles
will
to domain,
interdomain
until
an
excluded
simulation
a domain
domains,
designing
acyclic
to designate
in
between
be asserted
physics
and
created
a process
are
a separate
switching
intra-
block
stability
can
groups
is
enforced
11 We
to have
enough
on both
group
not
message
be used
as
excluding
flush
the
as the
structure
be achieved
which
delivered.
declared
such
group
is
structure
pg _ exclude
An
up
been
communication
Each
Causality
be provided,
parameters
about
enforced.
in it.
flush
which
is
remains
it
set
to
be
be broken.
causal
safety
By
would
operations.
View Management
ISIS
architecture
because
of its
reliance
not
reside
directly
scalability
11 There
this
of
the
are a number
issue
to a future
12 As in conventional
would
be
Subsequent
ACM
Transactions
it
node,
of possible
group
on
to
scale
to
server.
introduces
large
LAN
this
a bottleneck
networks
implementations
very
Although
with
of flush,
at
that
most
settings
server
few
we defer
need
limits
the
hundred
discussion
of
paper.
computing
some sort
request,
to this,
to
every
architecture
with
a forwarded
difficult
protocol
distributed
be registered
via
on
is
on a central
of group
causing
operations
Computer
Systems,
the
could
Vol.
environments,
location
caller
to
be performed
9, No,
this
service
be
added
dmectly.
3, August
approach
Initial
1991
assumes
connection
as
a new
that
to a group
member
or
groups
would
client.
workstations.
new
module,
8.
By reimplementing
multicast
protocol
this
limit
and
separating
to scalability
failure
can
mechanism
detection
307
in terms
into
of our
a free-standing
be eliminated.
PERFORMANCE
We implemented
the new CBCAST and ABCAST protocols within the ISIS
Toolkit, Versions 2.1 and 3.0 (respectively
available since October 1990 and
evaluated
March 1991); the figures below are from the V3.O implementation
on
a lightly
loaded lOMbit
Ethernet.
The
using SUN 4/60 workstations
is less ambitious
than what we plan for the future ISIS
implementation
protocols directly
in the
operating
system, which will support the new
system.
Nonetheless,
Table
II
grows
the
roughly
table
that
linearly
shows
the
process;
subsequent
Figures
for
did
not
we
and
4 5ms
sort
of lightweight
was
We
cost
24. 5ms
Figure
size
number
ABCAST
size
is
half
reflects
an initial
not
move,
tions
such
13 It
and
SUN
the
relative
the
as TCP:
should
be noted
and
plan
to compare
process
a lk
SUN
RPC
do not
nor
lk
call
(over
does
RPC,
null
are
when
Thus,
to
in
7 successive
3. 5ms
of any
time
both.
our
a null
creation
a substantial
multicast
protocol;
using
TCP)
unrelated
the
which
and
wall-clock
ISIS
ISIS
using
with
message
include
is the
environment
our
achieved
machines
procedure
with
although
the
ISIS
cost
of CBCAST
7 compares
grows.
in
determining
From
these
with
of the
part
the
is
multicast
which
all
null
RPCS
7k
packets
that
although
sent
the
as a function
of a lk
graphs
and
small
in which
scheduling
compare
cost
cost,
in tests
synchronous
RPCS
the
implementation
higher
more
commercial
UNIX
systems,
faster.
We are now working
of
same
destinations.
multicasting
be
the
remote
side,
of sites
figures
for
with
of CBCAST
the
can
for
figures
remote
Figure
ABCAST
throughput
line
destinawould
protocol.
factor
speed
what
for
SUN
using
the
also
figures
a 7-destination
number
the
running
ISIS
that
a dominant
roughly
seen
of sites;
as the
first
in the
remote
the
CBCAST
The
thread
or more
these,
6.51ms;
recorded,
of the
22.9ms
6 shows
and
of
The
on the
to
2-process
13 These
cost
using
one
in
using
group.
to another
mechanism
a null
aspects
note
to
III;
compared
or received
costs
cost
Table
latency
task
to
reply
be
7. 86ms.
additional
attributable
the
in
destination
a CBCAST
procedure
example,
sent
of the
encouraging.
a message
token.
respectively.
tions
show
was
transmitting
of sending
can
measured
protocol.
size
a round-trip
reply
3.36ms
the
ordering
For
had
message
with
remote
worked.
facility
of
appear
performance
performance
cost
lines
the
vendor-supplied
we
measured
the
cost
ABCAST
hold
ISIS
the
shows
and
IO
we
that
the
ABCAST
ordering
incurred
of the
packet
runs
This
last
token
by
at
point
does
applica-
protocol.
well
to
SUN
with
a single
RPC
figUre
traditional
remote
is about
streaming
resulting
ISIS
RPC
with
protocols
destination,
as good
that
CBCAST
as one
performance
to a lk
see that
packets.
costs
of packet
CBCAST
of the
can ind
is quite a bit
Mach [1, 211,
Mach/x-Kernel
RPC.
ACM
Transactions
on Computer
Systems,
Vol. 9, No
3, August
1991.
308
K. Birman et al,
.
Table
11.
Multicast
L: 7K Packets.
Performance
All
Figures
Figures
(CBCAST)
Measured
S: Null
on SUN
4/60s
Causal
All
dests
M: lK
SUNOS
reply
Asynchronous
msglsec
Self
1.51
1.52
6.51
kblsec
1.51
4240
16706
7.86
17.0
1019
550
8.67
10.1
23.4
572
361
10.7
12.2
34.9
567
249
12.9
15.2
43.4
447
180
16.3
18.8
54.1
352
143
19.6
21.9
64.6
305
118
22.9
25.7
75.7
253
101
Table
111.
Multicast
S: Null
Packets;
Performance
M: lK
Packets;
4.1.1
multi-rpc
(ins)
Destinations
Packets;
Running
Figures
Packets;
(ABCAST).
L: 7K Packets
7
Ordered
J411 dests
Self
1.52
1.56
1~.~
remains
so
that
messages
an
1.55
3300
20000
14.0
25.5
341
560
18.5
19.7
30.7
273
317
?4.6
~5.6
42.6
207
175
31.4
32.5
53.5
197
152
39.1
39.2
70.3
352
115
44.3
42.1
77.5
133
98.9
55.2
48.3
97.0
97
79.7
the
between
(recall
n-destination
same
performance
by
SUN,
550kb/sec
that
we
(550k
and
and
currently
multicast
the
bytes
total
750kb/sec
make
involves
per
rate
second)
of
as the
no use
sending
n times).
ACM Transactions
kbisec
supplied
increases
msg/sec
exactly
tions
Asynchronous
implementation
UNIX
reply
(ins)
Destinations
achieved
multi-rpc
data
number
as the
sent
TCP
through
of destina-
of hardware
multicast
essentially
identical
60
7 kbyte
1 kbyte
Null
packets
packets
packets
50
rpc
with
all
309
replies
4
5
destinations
+-.+. .
-e -
40
mS
30 F
;~
o
3
of remote
Number
Fig. 6.
CBCAST
RPC timing
CBCAST
1
50
as a function
vs ABCAST
I
I
RPC time
1
packets)
1
(lk
45 40 35 -
CBCAST
ABCAST
30
mS
~
+
25 -
...
20 15
10 -
.....
...
..*
...
.+. .
. . . ..
.+
. .
(
o0
Fig. 7.
...
. . . h
...
,.. . .
.,.
. ... +
1
CBCAST
2
3
Number of remote
vs. ABCAST
RPC timing
destinations
as a function
of group size.
The engineering
of UNIX UDP leads to flow control problems when large
When ISIS streams
to, say, 7 destinations,
volumes of data are transmitted.
the
sender
4000-8000
backed
into
as
or
100
process
generates
bytes
a single
more
generates
large
long
and
data
of these
such
numbers
may
packet).
in
UDP
packets
as many
For
messages
a high
of
contain
7 destinations,
transit
volume
ACM Transactions
at
of UDP
(these
as 25
ISIS
ISIS
one
traffic,
time
are
messages
typically
piggy
may
have
as many
and
when
a single
UNIX
begins
to
lose
310
K. Birman et al,
Table
IV.
Null
Causal
Multicast
to Another
Local
Total
both outgoing
flow
overload
UNIX.
be directly
One
might
tions
within
86
us
receive
394
us
31
us
reply
406
US
86
us
receive
224
US
31
us
Total
1.43
mS
234
(hfeasured)
1.51
mS
Within
down
we
connections
at
kernel.
time,
implement
This
is
due
to
ISIS
over
to
expect
statistics
TCP.
only
memory
unrelated
we
will
be possible.
to have
to
before
memory
should
programs
any
to
just
detailed
control
didnt
application
sender
where
flow
US
silently.
the
x-Kernel,
better
why
limit
the
UDP packets
the
far
wonder
implementations
TCP
US
choking
available,
open
406
control,
Reply
cost
send
and incoming
heuristic
with
Causal
cost
Operation
Thread
buffering
limits
on
Most
a small
UNIX
number
(mbuf)
the
of
limita-
number
of
file
descriptors
It
that
is
of other
multicast
col,
protocols
the
driver
reliable
implemented
Although
we
(to
scale,
ISIS
protocols.
ours
to
the
different
on
to
a new
tion
we
outside
ACM
another
of the
obtained
Transactions
We
entry
by
The
on
point
within
each
space
the
comparing
and
Computer
Vol
test
shows
costs
receive
Systems,
its
own
of
sending
figures
were
9, No
3, August
second.
CBCAST
reply
is
the
the
costs
for
address
message
The
attributed
causally
obtained
1991
this
but
first
we
of processes
were
odd,
sent
in
obvious
as
light
of
way
to
sent
by
attributable
RPC
This
to
involves
no communicacolumn
to causality,
ordered
by
the
protocol,
a null
space.
impres-
accelerate
our
seem
this
program.
costs
system
per
a pair
and
a profile
delivered
of the
second
the
shows
by
may
computed
IV
to handle
address
send
then
This
is unaffected
also
between
RPC
but
protodevice
achieves
expended
and
on
known
streaming
would
request
focus
multicast,
Table
our
with
reliable
a special
kernel.
Amoeba
was
best
performance
that
Both
(The
system.
costs,
time
other
multicasts
for
option
no
as
of the
Amoeba
processor
ethernet.
delays).
thread
(measured)
messages.
CPU
the
ordered
scale.
an
asynchronous
of the
a thread
with
a single
lOMbit
round-trip
parts
creating
total
how
CBCASTS.
emphasis
measure
rapidly
within
layer
1000
the
of
constructed
performance
hardware,
understand
over
than
know
Perhaps
is
possible
more
drops
protocol
RPC
lowest
We
UNIX.
[151,
destination),
special
single-destination
over
comparable
remote
using
order
profiled
our
the
of
achieve
while
result
using
at
one
sive
protocols.
multicast
a performance
protocol
In
multicast
implemented
Amoeba
achieves
by
published
shows
which
and
hardwiring
FIFO
our
V.
lk
Causal
Multicast
Causal
.23
Isis
1.21
Transport
Ca.1.ls
system to loop
Cache hit ratios
Table
remote
through
may
shows
the
process.
implementation
costs,
with
its
spent
with
because
assumption.
protocols
for
would
improved
to the
physical
parallelism
a serialized
multiple
The
using
of
a good
spent
transporting
us that
header
cost
our
sequencing
new
reflects
in
that
as
The
ISIS
(which
call
an avenue
maybe
closed
running
much
on a
benefit
from
application
it
could
might
from
the
FIFO
would
accept
using
multithread
system
listing
the
format
acknowledgment
the
such
of the
data
our
data
as the
sender,
structure,
etc.
in reimplementing
ACM
we will
be using
An
adding
However,
be much
protocols.
imposing
a simpler
has
of extensible
Each
of the
intersite
24
Our
result
and
destinations
tables,
sort
messages,
and time,
CBCAST
should
symbol
not
that
with
machine.
This
some
is
concerned
a remote
over
or more
studies
layers
number.
are transmitted
overhead
as self-describing
to
this
structure,
one
the
order
be built
information
byte
performance
is in
to
should
ISIS
header,
and
in
comparison
containing
message,
these
protocol
messages
and this
offering
implementation
new
system
message
has a 64-byte
size of the
draw
our
messages
little
enclosing
tation,
transport
remote-procedure
obtain
protocol.
RPC
unordered
is a process
messages
was
causal
itself.
in
the
RPC
not
used
of a FIFO
transport)
destination
message
size
completely
a requirement,
will
incoming
transport
a lk
compares
use
to a
protocol
the
for
header
that
a non-FIFO
if the
an
to
heavy
RPC
in the
above),
wire
breakdown
opposed
such
applications
while
14 Most of this
additional
(i.e.,
as could
consists
sender,
include
example,
we
of an
as
times.
7.86ms
spent
The
Our
however,
stream
imposes
reply.
case.
thousand
the
table,
note,
time
convinced
each
ten
time
on the
should
conclusion
protocol
a null
for
the
time
such
all
physically
messages
in
major
the
the
make
ISIS
system
nearly
multicast
and
not
For
threads),
operating
null
protocols
times
system.
into
the
and
of code
estimate.
basis
down
transport,
our
low
calls,
bytes14
generally
processor,
broken
from
mS
sections
slightly
on a layer-by-layer
are
reader
parallel
see
the
FIFO
RPC
ISIS
7.86
header
The
(Measured)
system
220
multicast,
mS
costs
in
multicast
8.1
corresponding
ISIS-supplied
approximately
TotaJ
the
(taken
costs
mS
mS
costs
The
mS
1.5
explain
Reply
mS
1.59
Time
with
311
rns
3.6
System
wire
Process
multicast
of
Rest
to a Remote
bytes.
an overhead
ISIS
each
packet
of these
message,
packet
the
carries
Further,
of both
ISIS
space
compact
represen-
reduced.
Transactions
on Computer
Systems,
1991.
312
K. Birman et al.
kernel,
so that
tion
device.
ers
to
the
kernel.
the
For
move
protocol
modules
In
case,
obvious
speedup
would
are
exploring
now
Mach
of code
our
be moved
both
that
we
can
example,
into
the
this
would
result
from
closer
and
to the
Chorus
network
yield
the
hardware
permit
communica-
application
communication
a significant
use
develop-
component
speedup.
of hardware
The
multicast,
of
other
an
idea
experimentally.
9. CONCLUSIONS
We have
presented
ordered
multicast
ordered
atomic
Versions
2.1
and
way
to achieve
fact,
there
IIven
in
primitive
ISIS
the
benefits
with
can
or multicast
in
message
its
is that
with
best
the
widespread
with
systems
concern
no
overhead
out-of-order
the
can
basic
fault-tolerance
may
the
overhead
that
protocol
needed
finding
to
device
will
never
operate
blocking
occurs.
performance
in
communication,
genuinely
achieve
well;
be used.
appropriate
fashion,
facilities
scales
than
With
of
inexpensive
it could
groups,
reception
as ISIS
and
bursty
other
channels.
a totally
as part
an
in which
streaming
multicast
that
offers
with
hardware,
such
implemented
systems
asynchronous,
existing
been
It is fast
causally
into
protocol
of overlapping
in
interprocess
unless
extended
of network
numbers
sent
has
synchrony.
and
a reliable,
easily
Our
size
communication
sender
conclusion
virtual
small
a completely
or
and
to the
FIFO
is
Toolkit.
large
be
reliable,
drivers
of
limit
is typically
multicasts
safely
of the
implementing
protocol
multicast
applications
implement
efficiently
The
3.0
is no evident
cm a multicast
most
a protocol
primitive.
Our
competitive
contradicting
be unacceptably
the
costly.
ACKNOWLEDGMENTS
Shivakant
numerous
greatly
Misra,
Aleta
suggestions
Ricciardi,
concerning
Mark
the
Wood,
protocols
and
and
Guerney
Hunt
presentation,
made
which
we
appreciate.
REFERENCES
1. ACCETTA, M., BARON,
new kernel
foundation
Carnegie
Mellon
USENIX
Conference
2. ACM
SIGOPS,
(Bretton
K.,
Syst.
Reu.
4. BIRMAN,
ACM
1986),
April
1991;
PA, Aug.
1986.
Also
in
proceedings
Mach: A
Science,
of the Summer
1986
pp. 93-112.
of the Ninth
Oct.
ACM
Symposium
on Operating
Systems
Principles
1983).
COOPER,
system.
Pittsburgh,
(July
N. H.,
AND
programming
Science
Univ.,
Proceedings
Woods,
3. BIRMAN,
R , GOLUB,
for UNIX
R.
The
European
also
ISIS
SIGOPS
available
project:
Real
Workshop,
as Tech.
Rep.
experience
Sept.
with
1990.
TR90-1138,
a fault
To appear
Cornell
in
Univ
tolerant
Operating
, Computer
Dept.
K.
Transactions
A.,
COOPER, R.,
on Computer
AND GLEESON,
Systems,
Vol.
B.
Programming
9, No, 3, August
1991
with
process
groups:
Group
multicast
Feb.
semantics.
Tech.
Rep.
TR91-1
185.
Cornell
Univ.,
Computer
313
Science
Dept,,
1991.
5. BIRMAN,
K.,
Systems,
AND JOSEPH, T.
Sape Mullender,
6. BIRMAN,
K.
Nov.
of the Eleuenth
1987).
ACM
ACM
K.
m ing
Science,
New
Exploiting
Reliable
KANE,
K.,
Guzde
March
and
systems.
In
Distributed
pp. 319-368,
synchrony
pp. 123-138.
1987,
A,
virtual
York,
1987),
Cornell
1989,
on Operating
1 (Feb.
Users
in distributed
York,
Symposium
T.
in distributed
Systems
communication
systems.
Prmctples
in
the
(Austin,
presence
In
Tex.,
of failures.
47-76.
AND SCHMUCK, F.
Reference
ISIS
Manual,
first
A Distributed
edition.
Dept.
Programof Computer
1988.
2, 3 (Aug.
replication
Press,
5,
Syst.
Environment
9. CHANG,
New
AND JOSEPH,
Comput.
K.
ACM
SIGOPS,
P.,
Trans.
8. BIRMAN,
ACM
Proceeding
7. BIRMAN,
Exploiting
Ed.,
N.
Reliable
broadcast
protocols.
ACM
Trans.
Comput.
Syst.
1984), 251-273.
1989),
11. CRISTIAN,
IEEE,
F.,
message
Computer
AGHILI,
diffusion
Society
July
1986.
P.
Working
R.,
Order
An
earlier
Tech.
version
Computing,
sets past
2003,
AND DOLEV,
agreement.
on Fault-Tolerant
12. DENNING,
Press.
STRONG, H.
to Byzantine
H.,
pp. 25-28.
D.
Rep.
Atomic
broadcast:
RJ5244,
appeared
in
IBM
From
Research
Proceedings
simple
Laboratory,
of the International
1985.
and present.
IEEE
Trans.
Softw.
Eng.
that
preserve
1 (Jan,
SE-6,
1980),
64-84.
13. FIDGE,
C.
Timestamps
Proceedings
14. GARGIA-MOLINA,
Proceedings
IEEE,
New
16. LADIN,
York,
M
R.,
Computing
18. MARZULLO,
K.
19. MATTERN,
20.
Time
PETERSON,
L.
information
and
Lazy
Aug.
the
the
in a multicast
Computing
1989),
ordering.
In
environment.
Systems
In
(June
1989),
1990).
Exploiting
Symposium
ACM,
New
of events
An
efficient
reliable
5-19.
replication:
ACM
ordering
and
time
Univ.,
global
on Parallel
L.,
partial
pp. 56-66,
the
semantics
on Principles
York,
of dis-
of Distributed
1990,
pp. 43-58.
in a distributed
system.
Commun.
Dept,
of Electri-
558-565,
Stanford
F.
L.
of the Tenth
Quebec,
clocks,
Maintaining
Workshop
the
1988,
AND SHRIRA,
1978),
cal Engineering,
ordering
on Distributed
A. S., HUMMEL,
Syst.
Proceedings
City,
Time,
Conference,
pp. 354-361.
B.,
(Quebec
L.
Science
Message
Conference
Operating
In
21, 7 (July
ttonal
AND SPAUSTER, A,
1989,
LISKOV,
systems
Computer
F., TANENBAUM,
services,
17, LAMPORT,
Australian
International
protocol.
tributed
ACM,
H.,
9th
15, KAASHOEK,
broadcast
in message-passing
of the 11th
BUCHOLZ,
in interprocess
in a distributed
June
states
and
in distributed
Distributed
N,
system.
C.,
PhD
thesis,
1984.
systems.
Algorithms.
AND
SCHLICHTING,
communication.
ACM
In
Proceedings
North-Holland,
R.
Trans.
Preserving
Comput.
of the Znterna-
Amsterdam,
and
Syst.
1989.
using
context
7, 3 (Aug.
1989),
217-246.
21.
PETERSON, L.
L.,
Evaluating
new
Operating
Systems
HUTCHINSON,
design
N.,
OMALLEY,
techniques.
Principles
In
Proceedings
(Litchfield
Park,
of the
Ariz.,
Nov.
M.
Rpc
Twelfth
1989).
in
ACM
ACM,
New
the
x-Kernel:
Symposium
York,
on
1989,
pp.
91-101.
22.
PITELLI,
Tech.
23.
F.,
TR-002-85,
POWELL, M.
nism.
In
100-109.
24.
AND GARCIA-M
Rep.
Data
processing
June
1985.
AND PRESOTTO, D. L.
Publishing
of the Ninth
Proceedings
published
RICCIARDI,
Feb.
H.
Univ.,
Proceedings
A.,
asynchronous
25.
LENA,
Princeton
AND BIRMAN,
environments.
K.
ACM
A reliable
Symposium
as Operating
P.
Tech.
with
Using
Systems
process
Rep. TR91-1
on
broadcast
modular
System
mecha-
Principles
pp.
17, 5.
to implement
188, Computer
redundancy.
communication
Operating
Reuiew
groups
triple
Science
failure
Dept.,
detection
Cornell
in
Univ.,
1991.
SCHIPER, A.,
A new
Transactions
algorithm
on Computer
to implement
Systems,
Vol
causal
9, No
ordering.
3, August
In
1991.
314
K Birman et al.
.
Proceedings
of the 3rd
Computer
26
SCHMUCK, F,
PhD
27.
Science
thesis,
SRIKANTH,
International
392,
Workshop
Springer-Verlag,
T. K.,
Univ.,
New
broadcast
on DwtrLbuted
York,
1989,
primitives
Algor~thms,
Lecture
Notes
on
pp. 219-232.
in asynchronous
distributed
systems.
1988,
AND TOUEG, S.
Optimal
clock
synchromzatlon.
J. ACM
34, 3 (July
1987),
626-645.
28
STEPHENSON, P.
29,
VERiSSIMO,
30.
protocol.
In
(Austin,
Tex.,
WALKER,
causal
Proceedings
B.,
Sept
G.,
ENGLISH,
In
Principles
(Bretton
Woods,
Transactions
1990;
Symposium
ACM,
POPEK,
New
on Computer
of
N H , Ott
April
1991;
Systems,
accepted
9, No
Cornell
Amp:
zcations
Feb.
parallel
1991.
atomic
Architectures
multlcast
&
Protocols
83-93.
C., AND
ACM
THIEL,
1991
3, August
G.
Symposmm
pp. 49-70
May
Univ.,
A highly
Commun
Ninth
1983),
Vol.
M.
1989,
KLINE,
the
thesis,
on
York,
R.,
proceedings
revised
PhD
AND BAPTISTA,
1989).
system.
April
rnulticast,
of the
operating
Received
ACM
Fast
1991
The
on
LOCUS
Operatzng
distributed
Systems