Benjamin Atkin
Cornell University
batkin@cs.cornell.edu
Information sharing is an important task in distributed systems. Processes and users require access to shared data for meaningful cooperation, and expect certain guarantees regarding the way in which the data they share is managed.
An ideal system for data sharing among processes operating on different computers would satisfy a number of requirements. Firstly, a client of the system should always be able to access the most recent copy of an object, if so desired. The system should also provide a high degree of availability, so that, barring a major failure of components within the distributed system, a client can always reach a copy of a desired object. Furthermore, it should provide access to remotely-stored objects is comparable in speed to local accesses.
Beyond these primary (though optimistic) requirements, there exist other features which may be useful. A client may want to be able to modify a series of objects in such a way that the whole series of modifications is reflected consistently in subsequent accesses, as in a transactional database system. It may be important for updates to objects to be persistent over failures and restarts of the critical components of the system.
Other requirements concern the nature of the shared data and how it can be accessed. Does it consist of files, database records, or objects such as might be defined in an object-oriented programming language? How big are they? Should the system provide a way to encapsulate operations with the objects, or leave this up to the application designer? Should the system allow access to just one class of objects, or should it be open-ended?
Existing systems take a variety of approaches. The LOCUS operating system [14] implements a distributed file system with nestable transactions. The Argus programming language [7] provides an object-oriented environment for distributed programming. The QuickSilver operating system [4] has a facility by which clients can access data spread across multiple servers using transactions. These and other systems are described further in the next section.
One feature in common to most of these systems is a somewhat heavy-weight notion of access to shared data, emphasising consistency and transactional properties over speed and scalability. This means that application designers who are willing to settle for weaker semantics must either accept performance decreased by unnecessary overheads, or implement a new system from scratch.
This paper outlines the design of a distributed shared object system which provides these weaker guarantees. In the style of the Lightweight Recoverable Virtual Memory system [11], which provides a minimal facility for transactional access to local virtual memory, a system for data sharing among clients in a distributed system will be implemented. This system will emphasise speed of access and scalability, at the cost of transient inconsistencies which may be observable between the copies of data seen by clients.
The remainder of this paper outlines the characteristics of this system and the intended implementation. Section 2 reviews some existing systems for distributed data sharing in detail. Section 3 describes the probabilistic broadcast protocol, which underlies the design. Section 4 gives an outline of the system. Section 5 details the work that will be undertaken, intended applications, and additional features which are being considered.
A number of systems address the requirements described in the previous
section. Here some the most relevant are briefly described.
The LOCUS distributed operating system [15]
incorporates a Unix-compatible file system with replication and
transactional features. Replication is intended to increase
availability and more efficient access. Files are assigned to file
groups, which may be wholly or partially replicated at a site. The
file creation operation specifies how many times a file should be
replicated.
Each file group has a designated current synchronisation site
(CSS), which is free to implement its own synchronisation policy. If
the network becomes partitioned, a file group has a CSS in each
section of the partition. All file open operations are directed to the
CSS of the group a file belongs to, after which the CSS forwards the
request to a server in the group. The client then interacts with this
server. Changes made to a single file are atomic, and are temporarily
recorded using a shadow-page mechanism. Upon a file close, the changes
are committed by writing the file's inode to disk. Though this
facility only permits modification to a single file within a
transaction, it makes updates atomic across machine
failures. Partitions of the network introduce the possibility of
inconsistent versions of files arising in disjoint parts of the
partition, which are resolved automatically when possible, and with
user intervention in other cases. Upon a commit, the server must
notify all other replica servers and the CSS, which decides when the
update is complete. Since only one server modifies the file,
individual replica servers do not take any commit or abort
decisions. There is therefore no need for an expensive 2-phase
commit. A more advanced facility for nested transactions is described
in [8], but is not discussed here.
The system proposed in this paper differs from LOCUS in allowing
servers holding a replica of an object to handle updates autonomously,
without reference to any centralised synchronisation manager (a
potential bottleneck in a large system). Propagation of updates among
replica servers is also managed asynchronously, rather than
synchronously as in LOCUS. While allowing some clients to see the
effects of updates faster, it introduces the possibility of clients
observing transient inconsistencies.
The QuickSilver distributed system [4] provides
an example of a general mechanism for controlling data sharing in a
distributed environment. QuickSilver is an operating system based on a
``lean'' kernel, over which system services are implemented as
user-level servers. Viewing replication as too expensive and
insufficiently general to provide an effective basis for resilience to
failures, the designers of QuickSilver chose to implement a
transaction-based recovery service for use by servers. A server can
declare itself to stateless or recoverable; a recoverable server is
included in a transaction when the client owning the transaction sends
it an IPC request. This IPC message registers the participation of the
server with the transaction manager running on the server's machine.
Transactions are terminated when the client commits or aborts, or when
a participating server aborts. 2-phase commit is employed, with the
option that servers may cast a ``read-only'' vote after the
preparation phase. This indicates that they have not modified any
resources and wish to be excluded from the second round. Each
transaction manager maintains its own log, which is shared with
servers at that machine. Servers define their own recovery procedures
using the log. The authors state that services which need to manage
replicas, such as the naming services, do not use the full
capabilities of the transaction mechanism, relying on atomic updates
only. This is a service which has been handled with lazy updates in
some other systems, such as Grapevine [2].
The Argus programming language [7] is
an environment for writing distributed applications, supporting
atomic actions, which are serialisable and can be nested. Programs
access guardians (servers) by invoking handlers which the guardians
define; individual guardians implement their own concurrency control
for access to the objects they manage. Transaction nesting requires
lock propagation among invocations at the same node, and two-phase
commit. A log is used for recovery, with periodic snapshots of the
state of guardians.
Clients accessing an Argus guardian are only able to modify objects
through the interface the guardian provides. This provides a much
higher degree of control over modifications to objects than the system
described in this paper. The price paid is in the need to make a
call to a guardian for every access, potentially requiring
inter-machine communication each time. Mapping an object directly into
the client's address space permits access by making a local procedure
call.
The Bayou storage system [13] is designed for a
network of computers in which machines may be only intermittently
connected. The assumption of weak connectivity leads to a scheme
incorporating the idea of eventual consistency between data replicas,
which may require rollback of updates when the propagation of changes
between replicas reveals inconsistencies. Updates are propagated by
asynchronous, pair-wise communication between replicas. In the absence
of further updates, replicas will eventually become consistent,
providing that the network is indefinitely partitioned. However, there
is no upper bound on the time which an update will take to be
reflected in all replicas.
Updates which have not attained stability are marked as
tentative. Application-specific checks supplied with updates allow
servers to detect possible conflicts. Each update specifies a
``dependency check'' to be made against the replica's data set. The
update also includes a ``merge procedure'' to resolve any conflicts
which are detected. The order in which updates are finalised is
decided by the primary server of the data set. If a replica server is
unable to communicate with the primary server, it must mark all
updates as tentative. Update rollback is achieved using a log of
tentative and committed updates maintained with each replica. The
status of an update can be queried by the client.
Updates are communicated between pairs of servers using an
anti-entropy [3] session, in which one
server tells the other which updates it has seen. This algorithm,
elaborated in [9], is similar to the gossip
rounds of the probabilistic broadcast protocol, described in the next
section. Unlike Bayou, the system proposed in this paper has no notion
of committed writes -- there is no global order for updates and so no
need for a primary server, a possible impediment to scalability.
Swallow [12,10] is a
reliable distributed data repository, which permits clients to access
data in the framework of atomic actions. Each object in the repository
is represented by a history of states, consisting of a series of
immutable versions of the object. A repository stores its data on
append-only optical disks, which provide stable storage. The existence
of time-stamped versions permits accounting and consistent snapshots
of the repository state.
Atomic transactions are implemented by a system of object headers and
commit records. Each object has an object header with a pointer to the
most recent version of the object; when a transaction modifies an
object, it creates a tentative version, referenced by a separate
pointer in the header. Each transaction has a commit record which the
headers of modified objects and the tentative versions also reference.
In this way, committing a transaction merely requires updating this
record and writing it to stable storage. A protocol similar to
two-phase commit is used for transactions that access more than one
repository.
The notion of object versions incorporated into Swallow is used in the
system presented here, but with a somewhat different purpose --
allowing different clients to see object versions with different
degrees of stability. This is further elaborated upon in section 4.
The probabilistic broadcast protocol [1,6] is a mechanism for one-to-many communication among a
group of cooperating processes, which can reside on distinct
machines. It is intended to be resilient to transient degradation of
network performance and group members with potentially slow response
times. It is also aims to be highly scalable, eventually accommodating
groups whose sizes number in the thousands. Both these goals address
shortcomings which the authors of [1]
identify in current heavy-weight group communication protocols (for
instance, the Horus system [14]).
The protocol consists of two complementary components: a message is
sent by an initial, potentially unreliable multicast to the members of
the group, followed by ``gossip'' to propagate the message to group
members which did not receive the initial multicast. Gossip
communication proceeds in rounds, which are not synchronised between
processes. In each round, a process randomly selects another process
in the group and sends it a digest of its most recently-received
messages. The recipient of the digest may then ask the digest's sender
to retransmit any messages which have not been received. After holding
each message for a pre-specified number of rounds, a process may
garbage-collect it to reclaim memory.
The theory underlying this gossip strategy is described in
[3]. Messages spread through the group
in a similar manner to an epidemic. In each new gossip round, a
message ``infects'' processes which have not yet received it, until
all group members have received a copy of the message. Given certain
assumptions regarding the behaviour of the underlying network, the
epidemic model allows the number of receivers of a message to be
characterised by a bimodal distribution -- this is, a small
probability exists that a only few processes have received the
message, a large probability that all or nearly all processes have
received the message, and intervening events have a negligible
probability. The protocol can thus offer a probabilistic guarantee of
the success of a broadcast.
The system proposed exploits the features of the probabilistic
broadcast protocol to create a toolkit for implementing
weakly-consistent shared objects in a distributed system. The object
management system makes some sacrifices in the consistency of the
objects, in exchange for an minimal client overhead and improved
scalability.
An object server runs on each host, responsible for caching replicas
of objects and propagating local updates of an object to the other
replicas which cache it. A client interested in accessing an object
contacts its local server. The server obtains a replica of the object
and maps it into the client's address space; the client then accesses
the object through read and write library calls. Figure
1 illustrates a configuration of the components on a
single machine. When the client updates the object, the server is
alerted. It uses a probabilistic broadcast to notify the other servers
maintaining the object -- that is, it uses an initial broadcast,
followed by gossip rounds.
A consequence of the probabilistic broadcast algorithm is that a
message sent out to a group of servers has an associated bimodal
probability distribution, which specifies the distribution of the
number of servers reached by the message. As the number of rounds
since the message was sent increases, the parameters of the
distribution change and the probability that all or almost all the
servers have received the message increases. We can view the
probability of the occurrence ``all servers have received a copy of
the update message'' as a measure of the stability of the update. It
follows that the more rounds an update message is ``delayed'' by a
server before being reflected in the client's copy of an object, the
more stable the associated update is.
This feature can be exploited by a client interested in messages which
have achieved a certain stability within the group of servers. By
specifying a delay in terms of the number of rounds for which its
local server should hold back an update, it can assure itself that the
copy of the object that it sees has attained a particular level of
stability. This means that a client which is principally interested in
urgent updates can set a minimal delay, while one with a more
long-term view would opt for a greater update delay.
A consequence of different clients at a host wishing to see different
views is that the server must maintain a sequence of temporally
distinct versions of each object it replicates. Each object version
has a queue of pending updates, sorted by age, which must be applied
at the start of each new round of gossip. The client, however, sees
only the version it is interested in, and accesses it by a call within
its own address space. All updates to the object are handled
asynchronously by the server, with the client accessing the object
using an in-process procedure call. Figure 2 gives an
conceptual example of object management by a server.
This section identifies the areas of the implementation which have
still to be decided, as well as possible applications and experiments
that will be conducted to evaluate the system.
It is intended that the system implementation will follow the
description in the preceding section, but there the details of how the
mapping of an object from a server to a client were left deliberately
vague. It is currently planned that the Unix mmap facility will
be used, with a backing file created by the server. An unresolved
problem concerns the most appropriate method of mapping object and
performing updates on them -- in particular, a question arises
regarding whether the server should maintain a separate copy of the
object for each client, or let clients share and have a single copy
per version. The latter option would appear to require less space and
work to apply updates, but the former may result in less contention
between clients accessing the same shared memory region. It would also
prevent a disastrous memory reference in one client destroying another
client's copy of the object.
Given that a client reads and updates the object by making calls to
local procedures, another question arises in the behaviour of
updates. An update operation alerts the server so that it will
propagate the update to other servers, but it is unclear whether the
effect of a client's update should be immediately reflected in the
client's copy of the object or not. In this situation, the correct
behaviour is not obvious. Taking the attitude that the results of the
update should be deferred until it satisfies the client's desired
update stability seems preferable, since it is consistent with the
treatment of incoming updates made remotely. However, it leads to a
situation where local updates have a delayed effect and may result in
the client being designed to block until the results are seen!
Finally, the issue of concurrency control between the client and
server to control changes to the shared object copy must be
considered. It seems desirable that a client read blocks until the
server has finished writing a remote update to the object. What the
most effective scheme to implement this is has not been decided.
Although it is a stated design goal that the system should be as
simple as possible, and have a minimum of overhead, the addition of
features such as causal ordering may be investigated to see if their
complexity is warranted. It would be interesting to see how
complicated a facility providing ``probabilistic transactions'' would
be, in which updates forming the transaction are seen as an atomic
group by recipients.
This section addresses what sort of applications a system with the
guarantees described might be suited to, and how its performance might
be examined.
The probabilistic broadcast protocol only ensures a local ordering --
that is, if message m precedes message
from a server
S, then all other servers will report m before
.
However, if we have two servers, each sending updates, a
case may arise where clients receive their messages in an incorrect
order, as depicted in Figure 3. As the diagram
illustrates, a multicasts from distinct servers are not guaranteed to
be received in a consistent order by different receiver
processes. This fact has repercussions if probabilistic broadcast is
used for data sharing, since two clients can view inconsistent copies
of a shared data object at the same time. Applications which seek to
share data using the system described here might be partitioned into
two classes:
Inconsistencies might arise in such a system when an object might move
from the area managed by one server to another. In this case, there is
a period after the handoff from one server to another where a message
from the old server might be received late by the monitoring process. However, the volume of
updates to the object would ensure that such an inconsistency would be
quickly rectified, since more updates from the new server would
quickly arrive to overwrite the incorrect value.
Related work
LOCUS
QuickSilver
Argus
Bayou
Swallow
Probabilistic broadcast
Outline of implementation
Further details
Issues in implementation
Applications and experiments
Applications belonging to the first class might include white-pages or
a distributed naming service. An example of an application in the
second, more interesting, class would be a monitoring service for a
cellular telephone network. Each mobile telephone is represented by an
object, and each cell contains a host. A client at a cell can monitor
a telephone by requesting the server running at its host to map the
appropriate object into its address space. Objects contain the spatial
location of the telephone, which must be rapidly updated, and a
transition from one cell to another changes the server which does the
updates.
Experiments will be designed to simulate such a system in a network of
computers, each running a server, and with objects which ``move'' from
the control of one server to another in a random, but repeatable
manner. Clients will watch the updates to objects and log them to
allow analysis and detection of inconsistencies. The following
experiments are being considered:
References
[1] | Kenneth Birman, Mark Hayden, Oznur Ozkasap, Zhen Xiao, Mihai Budiu, and Yaron Minsky. Bimodal multicast. Technical Report TR98-1683, Department of Computer Science, Cornell University, May 1998. |
[2] | Andrew D. Birrell, Roy Levin, Roger M. Needham, and Michael D. Schroeder. Grapevine: An exercise in distributed computing. Communications of the ACM, 25(4):260--274, 1982. |
[3] | Alan Demers, Dan Greene, Carl Hauser, Wes Irish, John Larson, Scott Shenker, Howard Sturgis, Dan Swinehart, and Doug Terry. Epidemic algorithms for replicated database management. In Proceedings of the Sixth Annual ACM Symposium on Principles of distributed computing, pages 1--12, Vancouver, British Columbia, August 1987. |
[4] | Roger Haskin, Yoni Malachi, Wayne Sawdon, and Gregory Chan. Recovery management in QuickSilver. ACM Transactions on Computer Systems, 6(1):82--108, 1988. |
[5] | Mark Hayden. The Ensemble System. PhD thesis, Cornell University, Ithaca, NY 14853, January 1998. Cornell University Computer Science Department Technical Report TR98-1662. |
[6] | Mark Hayden and Kenneth Birman. Probabilistic broadcast. Technical Report TR96-1606, Department of Computer Science, Cornell University, September 1996. |
[7] | Barbara Liskov, Dorothy Curtis, Paul Johnson, and Robert Scheifler. Implementation of Argus. In Proceedings of the Eleventh ACM Symposium on Operating systems principles, pages 111--122, Austin, Texas, November 1987. |
[8] | Erik T. Mueller, Johanna D. Moore, and Gerald H. Popek. A nested transactional mechanism for LOCUS. In Proceedings of the Ninth ACM Symposium on Operating systems principles, pages 71--89, Bretton Woods, New Hampshire, October 1983. |
[9] | Karin Petersen, Mike Spreitzer, Douglas B. Terry, Marvin Theimer, and Alan J. Demers. Flexible update propagation for weakly consistent replication. In Proceedings of the Sixteenth ACM Symposium on Operating System Principles, pages 288--301, St. Malo, France, October 1997. |
[10] | David P. Reed. Implementing atomic actions on decentralised data. ACM Transactions on Computer Systems, 1(1):3--23, 1983. |
[11] | M. Satyanaranan, Henry H. Mashburn, Puneet Kumar, David C. Steere, and James J. Kistler. Lightweight recoverable virtual memory. ACM Transactions on Computer Systems, 12(1):33--57, 1994. |
[12] | Liba Svobodova. A reliable object-oriented data repository for a distributed computer system. In Proceedings of the Eighth ACM Symposium on Operating systems principles, pages 47--58, Pacific Grove, California, December 1981. |
[13] | Douglas B. Terry, Marvin M. Theimer, Karin Petersen, Alan J. Demers, Mike J. Spreitzer, and Carl H. Hauser. Managing update conflicts in Bayou, a weakly connected replicated storage system. In Proceedings of the Fifteenth ACM Symposium on Operating systems principles, pages 172--188, Copper Mountain Resort, Colorado, December 1995. |
[14] | Robbert van Renesse, Kenneth P. Birman, and Silvano Maffeis. Horus, a flexible group communication system. Communications of the ACM, 39(4):76--83, 1996. |
[15] | Bruce Walker, Gerald Popek, Robert English, Charles Kline, and Greg Thiel. The LOCUS distributed operating system. In Proceedings of the Ninth ACM Symposium on Operating systems principles, pages 49--70, Bretton Woods, New Hampshire, October 1983. |